Closed
Bug 1188018
Opened 9 years ago
Closed 5 years ago
Investigate persistent ELB-level 5XX level errors from pipeline edge server
Categories
(Data Platform and Tools :: General, defect, P3)
Data Platform and Tools
General
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: whd, Assigned: whd)
Details
(Whiteboard: [DataOps])
We have a background 5XX rate from the tee ELB that does not correspond to 5XXs from either the old or new telemetry backends, nor their ELBs. This is puzzling and I haven't been able to determine the root cause. The total traffic amounts to about 0.2-0.5/s, which is approximately 0.1% of requests. The client should resend the data when this happens so we shouldn't be experiencing data loss, but we should still figure out what is causing it and fix it. I enabled ELB logging on the tee server ELB and pulled down nginx logs from the tee servers for the same period to compare. The only thing I've determined is that some requests that return a 504 from the ELB are "seen" by the tee nginx/openresty process and log a 408, but others don't appear to show up at all. This should be reproducible on with a load test, which I haven't done. I'm working on a rudimentary load test output from landfill to better reproduce actual load in stage.
Comment 1•9 years ago
|
||
I was able to produce 504's and then 503's under a fairly small load. 160rps, using a ping that is only 3kb in size. Is the staging tee identical to the prod tee? Transactions: 26270 hits Availability: 95.80 % Response time: 0.27 secs Transaction rate: 160.01 trans/sec Successful transactions: 26270 Failed transactions: 1151 Longest transaction: 29.01 Shortest transaction: 0.00
Comment 2•9 years ago
|
||
Status code definitions: 408 Request Timeout 503 Service Unavailable 504 Gateway Timeout So they all seem to relate to backend timeouts / failures. relud, are you able to check the config on stage / prod?
Flags: needinfo?(dthornton)
Comment 3•9 years ago
|
||
it appears that the stage and prod configurations differ in one way: elb -> connection settings -> ide timeout is 60 in stage and 120 in prod. everything else looks the same.
Flags: needinfo?(dthornton)
Updated•9 years ago
|
Assignee: nobody → whd
Iteration: --- → 43.1 - Aug 24
Priority: -- → P2
Updated•9 years ago
|
Priority: P2 → P1
Updated•8 years ago
|
Whiteboard: [SvcOps]
Updated•8 years ago
|
Priority: P2 → P3
Assignee | ||
Comment 5•7 years ago
|
||
Depending on whether this goes away after bug #1340754 is completed, I will rename this bug or close it.
Assignee | ||
Updated•7 years ago
|
Summary: Investigate persistent ELB-level 5XX level errors from pipeline tee server → Investigate persistent ELB-level 5XX level errors from pipeline edge server
Assignee | ||
Comment 7•7 years ago
|
||
(In reply to Mark Reid [:mreid] from comment #6) > Did this go away after bug 1340754 landed? No, hence the bug title change. I estimate the rate to be about 1-2/s now, which amounts to ~100k 5XX a day. I have in my backlog to file a support request with AWS to investigate this.
Flags: needinfo?(whd)
Updated•7 years ago
|
Component: Metrics: Pipeline → Pipeline Ingestion
Product: Cloud Services → Data Platform and Tools
Updated•6 years ago
|
Whiteboard: [SvcOps] → [DataOps]
Assignee | ||
Comment 8•5 years ago
|
||
3.5 years! With the move to GCP infra I'm not planning on finally tracking down the root cause of this in AWS. The GCP infra may end up having similar behavior, for which we can file a new bug and track it down.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Updated•2 years ago
|
Component: Pipeline Ingestion → General
You need to log in
before you can comment on or make changes to this bug.
Description
•