Closed Bug 1188018 Opened 9 years ago Closed 5 years ago

Investigate persistent ELB-level 5XX level errors from pipeline edge server

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: whd, Assigned: whd)

Details

(Whiteboard: [DataOps])

Wesley Dawson [:whd]

Assignee

Description

•

9 years ago

We have a background 5XX rate from the tee ELB that does not correspond to 5XXs from either the old or new telemetry backends, nor their ELBs. This is puzzling and I haven't been able to determine the root cause. The total traffic amounts to about 0.2-0.5/s, which is approximately 0.1% of requests. The client should resend the data when this happens so we shouldn't be experiencing data loss, but we should still figure out what is causing it and fix it.

I enabled ELB logging on the tee server ELB and pulled down nginx logs from the tee servers for the same period to compare. The only thing I've determined is that some requests that return a 504 from the ELB are "seen" by the tee nginx/openresty process and log a 408, but others don't appear to show up at all.

This should be reproducible on with a load test, which I haven't done. I'm working on a rudimentary load test output from landfill to better reproduce actual load in stage.

Stuart Philp :sphilp

Comment 1

•

9 years ago

I was able to produce 504's and then 503's under a fairly small load. 160rps, using a ping that is only 3kb in size. Is the staging tee identical to the prod tee? 

Transactions:		       26270 hits
Availability:		       95.80 %
Response time:		        0.27 secs
Transaction rate:	      160.01 trans/sec
Successful transactions:       26270
Failed transactions:	        1151
Longest transaction:	       29.01
Shortest transaction:	        0.00

Mark Reid [:mreid]

Comment 2

•

9 years ago

Status code definitions:

408 Request Timeout
503 Service Unavailable
504 Gateway Timeout

So they all seem to relate to backend timeouts / failures.

relud, are you able to check the config on stage / prod?

Flags: needinfo?(dthornton)

Daniel Thorn [:relud]

Comment 3

•

9 years ago

it appears that the stage and prod configurations differ in one way: elb -> connection settings -> ide timeout is 60 in stage and 120 in prod. everything else looks the same.

Flags: needinfo?(dthornton)

Thomas Huelbert

Updated

•

9 years ago

Assignee: nobody → whd

Iteration: --- → 43.1 - Aug 24

Priority: -- → P2

Katie Parlante

Updated

•

9 years ago

Priority: P2 → P1

Katie Parlante

Comment 4

•

9 years ago

Investigated and determined not to be urgent.

Priority: P1 → P2

Katie Parlante

Updated

•

8 years ago

Whiteboard: [SvcOps]

Thomas Huelbert

Updated

•

8 years ago

Priority: P2 → P3

Wesley Dawson [:whd]

Assignee

Comment 5

•

7 years ago

Depending on whether this goes away after bug #1340754 is completed, I will rename this bug or close it.

Wesley Dawson [:whd]

Assignee

Updated

•

7 years ago

Summary: Investigate persistent ELB-level 5XX level errors from pipeline tee server → Investigate persistent ELB-level 5XX level errors from pipeline edge server

Mark Reid [:mreid]

Comment 6

•

7 years ago

Did this go away after bug 1340754 landed?

Flags: needinfo?(whd)

Wesley Dawson [:whd]

Assignee

Comment 7

•

7 years ago

(In reply to Mark Reid [:mreid] from comment #6)
> Did this go away after bug 1340754 landed?

No, hence the bug title change. I estimate the rate to be about 1-2/s now, which amounts to ~100k 5XX a day. I have in my backlog to file a support request with AWS to investigate this.

Flags: needinfo?(whd)

Mark Reid [:mreid]

Updated

•

7 years ago

Component: Metrics: Pipeline → Pipeline Ingestion

Product: Cloud Services → Data Platform and Tools

Jason Thomas [:jason]

Updated

•

6 years ago

Whiteboard: [SvcOps] → [DataOps]

Wesley Dawson [:whd]

Assignee

Comment 8

•

5 years ago

3.5 years! With the move to GCP infra I'm not planning on finally tracking down the root cause of this in AWS. The GCP infra may end up having similar behavior, for which we can file a new bug and track it down.

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Updated

•

2 years ago

Component: Pipeline Ingestion → General

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Investigate persistent ELB-level 5XX level errors from pipeline edge server

Categories

(Data Platform and Tools :: General, defect, P3)

Tracking

(Not tracked)

People

(Reporter: whd, Assigned: whd)

References

Details

(Whiteboard: [DataOps])

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Updated

Updated

Comment 4

Updated

Updated

Comment 5

Updated

Comment 6

Comment 7

Updated

Updated

Comment 8

Updated