Many tests failing on treeherder with "ERROR 504: Gateway Time-out." or "ERROR 500: Internal Server Error."

RESOLVED WORKSFORME

Status

Developer Services
Mercurial: hg.mozilla.org
--
blocker
RESOLVED WORKSFORME
2 years ago
2 years ago

People

(Reporter: aryx, Unassigned)

Tracking

Details

Attachments

(1 attachment)

"ERROR 504: Gateway Time-out."
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=52d0c4ad8be5

"ERROR 500: Internal Server Error."
https://treeherder.mozilla.org/#/jobs?repo=fx-team&revision=24282235336d
There are issues with hg.mozilla.org:

--01:03:50--  https://hg.mozilla.org/build/tools/raw-file/default/buildfarm/utils/archiver_client.py
           => `archiver_client.py'
Resolving hg.mozilla.org... 63.245.215.102, 63.245.215.25
Connecting to hg.mozilla.org|63.245.215.102|:443... connected.
HTTP request sent, awaiting response... 504 Gateway Time-out
01:04:51 ERROR 504: Gateway Time-out.

--2015-11-28 00:40:47--  https://hg.mozilla.org/build/tools/raw-file/default/buildfarm/utils/archiver_client.py
Resolving hg.mozilla.org... 63.245.215.102, 63.245.215.25
Connecting to hg.mozilla.org|63.245.215.102|:443... connected.
HTTP request sent, awaiting response... 500 Internal Server Error
2015-11-28 00:40:47 ERROR 500: Internal Server Error.
Raising severity as this is a tree closure issue.
Severity: normal → blocker
Created attachment 8693201 [details]
20151128-0230PT-hgstats.pdf

Definitely an event on hg.m.o - see attached graphs.

Source URLs for copy/paste messages in comment 0 & comment 1 are:
https://treeherder.mozilla.org/logviewer.html#?job_id=5958977&repo=fx-team#L94
https://treeherder.mozilla.org/logviewer.html#?job_id=17955233&repo=mozilla-inbound#L106

Which show problem occurred from both spot instance in use1 & hardware in scl3.  Health reports to be attached, but both hosts have done a clean run since failed run.
Currently issue attaching files to bmo -- will attach later
Saw issue again for Windows 8 x64 pgo: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=c33072613b5e
I have what was requested in bug 1225281 running on CRON on various hg hosts. It's set up to email me if there is a failure. Around 0029 PDT this morning, a number of alerts started spewing due to apparent issues with hgweb1. Those alerts persisted until they cleared around 0830-0836 PDT.

I suspect something happened with hgweb1 and it started erroring on HTTP requests. I suspect the 504's are from httpd not connecting to the WSGI process tree. I would think the load balancer would translate HTTP 504s into taking the host out of service automatically. Perhaps it didn't do that. Or perhaps hgweb1 was bouncing between up and down in the zlb.

It's a Saturday and I don't feel like spending more time on this. Perhaps taking hgweb1 out of the zlb and leaving this open for triage on Monday is prudent.
Actually, no. #sysadmins said a bunch of hgweb machines ran out of memory around 0030 PDT and were swapping hard. I guess hgweb1 was hit harder than the others for some reason. Who knows.

Comment 8

2 years ago
28 automation job failures were associated with this bug yesterday.

Repository breakdown:
* fx-team: 16
* mozilla-inbound: 11
* b2g-inbound: 1

Platform breakdown:
* linux32: 9
* windows8-64: 6
* android-2-3-armv7-api9: 5
* linux64: 3
* osx-10-6: 2
* windowsxp: 1
* osx-10-10: 1
* b2g-emu-ics: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1228725&startday=2015-11-28&endday=2015-11-28&tree=all

Comment 9

2 years ago
29 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* fx-team: 17
* mozilla-inbound: 11
* b2g-inbound: 1

Platform breakdown:
* linux32: 9
* windows8-64: 6
* android-2-3-armv7-api9: 5
* linux64: 3
* osx-10-6: 2
* osx-10-10: 2
* windowsxp: 1
* b2g-emu-ics: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1228725&startday=2015-11-23&endday=2015-11-29&tree=all
Assignee: relops → nobody
Component: RelOps → Mercurial: hg.mozilla.org
Product: Infrastructure & Operations → Developer Services
QA Contact: arich → hwine
I haven't dug into the logs, but later in the day we had a single IP flood hg.mo with requests. See bug 1228806.

I'm being lazy and assuming the two events are linked. If we ever need to know for certain, the logs are there for later analysis.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.