Investigate persistent ELB-level 5XX level errors from pipeline edge server

NEW
Assigned to

Status

Data Platform and Tools
Pipeline Ingestion
P3
normal
2 years ago
5 months ago

People

(Reporter: whd, Assigned: whd)

Tracking

Details

(Whiteboard: [SvcOps])

(Assignee)

Description

2 years ago
We have a background 5XX rate from the tee ELB that does not correspond to 5XXs from either the old or new telemetry backends, nor their ELBs. This is puzzling and I haven't been able to determine the root cause. The total traffic amounts to about 0.2-0.5/s, which is approximately 0.1% of requests. The client should resend the data when this happens so we shouldn't be experiencing data loss, but we should still figure out what is causing it and fix it.

I enabled ELB logging on the tee server ELB and pulled down nginx logs from the tee servers for the same period to compare. The only thing I've determined is that some requests that return a 504 from the ELB are "seen" by the tee nginx/openresty process and log a 408, but others don't appear to show up at all.

This should be reproducible on with a load test, which I haven't done. I'm working on a rudimentary load test output from landfill to better reproduce actual load in stage.

Comment 1

2 years ago
I was able to produce 504's and then 503's under a fairly small load. 160rps, using a ping that is only 3kb in size. Is the staging tee identical to the prod tee? 

Transactions:		       26270 hits
Availability:		       95.80 %
Response time:		        0.27 secs
Transaction rate:	      160.01 trans/sec
Successful transactions:       26270
Failed transactions:	        1151
Longest transaction:	       29.01
Shortest transaction:	        0.00

Comment 2

2 years ago
Status code definitions:

408 Request Timeout
503 Service Unavailable
504 Gateway Timeout

So they all seem to relate to backend timeouts / failures.

relud, are you able to check the config on stage / prod?
Flags: needinfo?(dthornton)
it appears that the stage and prod configurations differ in one way: elb -> connection settings -> ide timeout is 60 in stage and 120 in prod. everything else looks the same.
Flags: needinfo?(dthornton)

Updated

2 years ago
Assignee: nobody → whd
Iteration: --- → 43.1 - Aug 24
Priority: -- → P2

Updated

2 years ago
Priority: P2 → P1

Comment 4

2 years ago
Investigated and determined not to be urgent.
Priority: P1 → P2

Updated

2 years ago
Whiteboard: [SvcOps]

Updated

2 years ago
Priority: P2 → P3
(Assignee)

Comment 5

8 months ago
Depending on whether this goes away after bug #1340754 is completed, I will rename this bug or close it.
(Assignee)

Updated

8 months ago
Summary: Investigate persistent ELB-level 5XX level errors from pipeline tee server → Investigate persistent ELB-level 5XX level errors from pipeline edge server

Comment 6

5 months ago
Did this go away after bug 1340754 landed?
Flags: needinfo?(whd)
(Assignee)

Comment 7

5 months ago
(In reply to Mark Reid [:mreid] from comment #6)
> Did this go away after bug 1340754 landed?

No, hence the bug title change. I estimate the rate to be about 1-2/s now, which amounts to ~100k 5XX a day. I have in my backlog to file a support request with AWS to investigate this.
Flags: needinfo?(whd)

Updated

5 months ago
Component: Metrics: Pipeline → Pipeline Ingestion
Product: Cloud Services → Data Platform and Tools
You need to log in before you can comment on or make changes to this bug.