Closed Bug 1188018 Opened 9 years ago Closed 5 years ago

Investigate persistent ELB-level 5XX level errors from pipeline edge server

Categories

(Data Platform and Tools :: General, defect, P3)

defect

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: whd, Assigned: whd)

Details

(Whiteboard: [DataOps])

We have a background 5XX rate from the tee ELB that does not correspond to 5XXs from either the old or new telemetry backends, nor their ELBs. This is puzzling and I haven't been able to determine the root cause. The total traffic amounts to about 0.2-0.5/s, which is approximately 0.1% of requests. The client should resend the data when this happens so we shouldn't be experiencing data loss, but we should still figure out what is causing it and fix it.

I enabled ELB logging on the tee server ELB and pulled down nginx logs from the tee servers for the same period to compare. The only thing I've determined is that some requests that return a 504 from the ELB are "seen" by the tee nginx/openresty process and log a 408, but others don't appear to show up at all.

This should be reproducible on with a load test, which I haven't done. I'm working on a rudimentary load test output from landfill to better reproduce actual load in stage.
I was able to produce 504's and then 503's under a fairly small load. 160rps, using a ping that is only 3kb in size. Is the staging tee identical to the prod tee? 

Transactions:		       26270 hits
Availability:		       95.80 %
Response time:		        0.27 secs
Transaction rate:	      160.01 trans/sec
Successful transactions:       26270
Failed transactions:	        1151
Longest transaction:	       29.01
Shortest transaction:	        0.00
Status code definitions:

408 Request Timeout
503 Service Unavailable
504 Gateway Timeout

So they all seem to relate to backend timeouts / failures.

relud, are you able to check the config on stage / prod?
Flags: needinfo?(dthornton)
it appears that the stage and prod configurations differ in one way: elb -> connection settings -> ide timeout is 60 in stage and 120 in prod. everything else looks the same.
Flags: needinfo?(dthornton)
Assignee: nobody → whd
Iteration: --- → 43.1 - Aug 24
Priority: -- → P2
Priority: P2 → P1
Investigated and determined not to be urgent.
Priority: P1 → P2
Whiteboard: [SvcOps]
Priority: P2 → P3
Depending on whether this goes away after bug #1340754 is completed, I will rename this bug or close it.
Summary: Investigate persistent ELB-level 5XX level errors from pipeline tee server → Investigate persistent ELB-level 5XX level errors from pipeline edge server
Did this go away after bug 1340754 landed?
Flags: needinfo?(whd)
(In reply to Mark Reid [:mreid] from comment #6)
> Did this go away after bug 1340754 landed?

No, hence the bug title change. I estimate the rate to be about 1-2/s now, which amounts to ~100k 5XX a day. I have in my backlog to file a support request with AWS to investigate this.
Flags: needinfo?(whd)
Component: Metrics: Pipeline → Pipeline Ingestion
Product: Cloud Services → Data Platform and Tools
Whiteboard: [SvcOps] → [DataOps]

3.5 years! With the move to GCP infra I'm not planning on finally tracking down the root cause of this in AWS. The GCP infra may end up having similar behavior, for which we can file a new bug and track it down.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
Component: Pipeline Ingestion → General
You need to log in before you can comment on or make changes to this bug.