Many tests failing on treeherder with "ERROR 504: Gateway Time-out." or "ERROR 500: Internal Server Error."



3 years ago
3 years ago


(Reporter: aryx, Unassigned)




(1 attachment)

There are issues with

           => `'
Connecting to||:443... connected.
HTTP request sent, awaiting response... 504 Gateway Time-out
01:04:51 ERROR 504: Gateway Time-out.

--2015-11-28 00:40:47--
Connecting to||:443... connected.
HTTP request sent, awaiting response... 500 Internal Server Error
2015-11-28 00:40:47 ERROR 500: Internal Server Error.
Raising severity as this is a tree closure issue.
Severity: normal → blocker
Created attachment 8693201 [details]

Definitely an event on hg.m.o - see attached graphs.

Source URLs for copy/paste messages in comment 0 & comment 1 are:

Which show problem occurred from both spot instance in use1 & hardware in scl3.  Health reports to be attached, but both hosts have done a clean run since failed run.
Currently issue attaching files to bmo -- will attach later
I have what was requested in bug 1225281 running on CRON on various hg hosts. It's set up to email me if there is a failure. Around 0029 PDT this morning, a number of alerts started spewing due to apparent issues with hgweb1. Those alerts persisted until they cleared around 0830-0836 PDT.

I suspect something happened with hgweb1 and it started erroring on HTTP requests. I suspect the 504's are from httpd not connecting to the WSGI process tree. I would think the load balancer would translate HTTP 504s into taking the host out of service automatically. Perhaps it didn't do that. Or perhaps hgweb1 was bouncing between up and down in the zlb.

It's a Saturday and I don't feel like spending more time on this. Perhaps taking hgweb1 out of the zlb and leaving this open for triage on Monday is prudent.
Actually, no. #sysadmins said a bunch of hgweb machines ran out of memory around 0030 PDT and were swapping hard. I guess hgweb1 was hit harder than the others for some reason. Who knows.
Comment hidden (Intermittent Failures Robot)
Comment hidden (Intermittent Failures Robot)
Assignee: relops → nobody
Component: RelOps → Mercurial:
Product: Infrastructure & Operations → Developer Services
QA Contact: arich → hwine
I haven't dug into the logs, but later in the day we had a single IP flood with requests. See bug 1228806.

I'm being lazy and assuming the two events are linked. If we ever need to know for certain, the logs are there for later analysis.
Last Resolved: 3 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.