We have a background 5XX rate from the tee ELB that does not correspond to 5XXs from either the old or new telemetry backends, nor their ELBs. This is puzzling and I haven't been able to determine the root cause. The total traffic amounts to about 0.2-0.5/s, which is approximately 0.1% of requests. The client should resend the data when this happens so we shouldn't be experiencing data loss, but we should still figure out what is causing it and fix it. I enabled ELB logging on the tee server ELB and pulled down nginx logs from the tee servers for the same period to compare. The only thing I've determined is that some requests that return a 504 from the ELB are "seen" by the tee nginx/openresty process and log a 408, but others don't appear to show up at all. This should be reproducible on with a load test, which I haven't done. I'm working on a rudimentary load test output from landfill to better reproduce actual load in stage.
I was able to produce 504's and then 503's under a fairly small load. 160rps, using a ping that is only 3kb in size. Is the staging tee identical to the prod tee? Transactions: 26270 hits Availability: 95.80 % Response time: 0.27 secs Transaction rate: 160.01 trans/sec Successful transactions: 26270 Failed transactions: 1151 Longest transaction: 29.01 Shortest transaction: 0.00
Status code definitions: 408 Request Timeout 503 Service Unavailable 504 Gateway Timeout So they all seem to relate to backend timeouts / failures. relud, are you able to check the config on stage / prod?
it appears that the stage and prod configurations differ in one way: elb -> connection settings -> ide timeout is 60 in stage and 120 in prod. everything else looks the same.
Investigated and determined not to be urgent.
Depending on whether this goes away after bug #1340754 is completed, I will rename this bug or close it.
Did this go away after bug 1340754 landed?
(In reply to Mark Reid [:mreid] from comment #6) > Did this go away after bug 1340754 landed? No, hence the bug title change. I estimate the rate to be about 1-2/s now, which amounts to ~100k 5XX a day. I have in my backlog to file a support request with AWS to investigate this.