588711 - [tracking bug] Frequent build automation failures due to "Gateway Time-out" during network activity related steps

Reporter

Description

•

15 years ago

I've seen at least 10 TryServer builds fail today due to "abort: HTTP Error 504: Gateway Time-out" during hg clone. The logs all end with something like this: { argv: ['/tools/python/bin/hg', 'clone', '--verbose', '--noupdate', '--rev', '583ae843a43488f3bfda25ff4615887cfc8a57f5', u'http://hg.mozilla.org/try', '/builds/slave/tryserver-linux64-debug/build'] environment: [...] using PTY: False requesting all changes abort: HTTP Error 504: Gateway Time-out elapsedTime=1800.137150 program finished with exit code 255 } List of failure logs with this problem (all of which just completed recently): ============================================================================== http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203583.1282205659.24030.gz http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203477.1282205509.23531.gz http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203476.1282205520.23582.gz http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205314.22453.gz http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205501.23520.gz http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205496.23498.gz http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205295.22395.gz http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203474.1282205288.22338.gz http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205314.22454.gz And here's one that failed with this error earlier today: http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282178040.1282180111.11311.gz

Daniel Holbert [:dholbert]

Reporter

Updated

•

15 years ago

OS: Linux → All

Hardware: x86 → All

Daniel Holbert [:dholbert]

Reporter

Comment 1

•

15 years ago

I initially thought this might be a case of "try server hg repository has too many heads & needs a fresh start", but I don't think that's actually what's going on. I just tried running the |hg| command from comment 0 locally (at home)... > hg clone --verbose --noupdate --rev 583ae843a43488f3bfda25ff4615887cfc8a57f5 http://hg.mozilla.org/try ...and it gets past the "requesting all changes" stage and on to "adding changesets" pretty much instantaneously. So, this looks like a case of network congestion, I guess?

Daniel Holbert [:dholbert]

Reporter

Comment 2

•

15 years ago

(In reply to comment #1) > So, this looks like a case of network congestion, I guess? (I mean: sporadic congestion, affecting the builders)

bhearsum@mozilla.com (:bhearsum)

Comment 3

•

15 years ago

(In reply to comment #1) > I initially thought this might be a case of "try server hg repository has too > many heads & needs a fresh start", but I don't think that's actually what's > going on. Definitely not, because it does shallow clones (only the required changesets for the head you want). > So, this looks like a case of network congestion, I guess? 504 Gateway Time-out means that the proxy made an HTTP request to the HG server, which didn't respond in a timely manner (in this case, ~30min). Generally, it indicates issues on the HG server, not network congestion. Moving this to Server Ops. Marked critical -- but it's really a blocker once it starts happening again.

Assignee: nobody → server-ops

Severity: normal → critical

Component: Release Engineering → Server Operations

QA Contact: release → mrz

Shyam Mani [:fox2mike]

Updated

•

15 years ago

Assignee: server-ops → aravind

bhearsum@mozilla.com (:bhearsum)

Comment 4

•

15 years ago

Here's a tally of failure times from the last 24h (times listed are start times in PDT of the hg clone): August 18: 11:45am - 12pm, 5 failures 12:15pm - 12:30, 2 failures 3:15 - 3:30, 5 failures 5 - 5:15, 22 failures 5:15 - 5:30, 11 failures 5:30 - 5:45, 22 failures 5:45 - 6, 7 failures 8:30 - 8:45, 10 failures August 19th: 12am - 12:15am, 2 failures 12:15am - 12:30, 3 failures 12:30 - 12:45, 11 failures 2:30 - 2:45, 11 failures 3:30 - 3:45, 5 failures 7:30 - 7:45, 12 failures

Aravind Gottipati [:aravind]

Assignee

Comment 6

•

15 years ago

I got the caching layer out of the picture and am sending try requests to the backend server directly. This should hopefully help with the problem. Please let me know if you notice any more try failures (since 6:30 AM PST).

Aravind Gottipati [:aravind]

Assignee

Comment 7

•

15 years ago

Getting varnish out of the picture seems to have helped. Its not a permanent fix, that's going to be build & release reworking the way they do checkouts. Please re-open if you notice this issue again.

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

bhearsum@mozilla.com (:bhearsum)

Comment 8

•

15 years ago

We had some more failures last night, unfortunately: Tryserver Mercurial failures on 2010/08/24 20:15: 1 21:15: 12 21:30: 14 21:45: 6 23:30: 2 I'm not sure what else we can do besides wait for the real fix in bug 589885

Daniel Holbert [:dholbert]

Reporter

Comment 9

•

15 years ago

Ok - reopening per comment 7, then.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Daniel Holbert [:dholbert]

Reporter

Comment 10

•

15 years ago

(Even if there's nothing that can done here at the moment, it's probably best to have an open bug tracking this, so that other people who hit this issue can more easily find it.)

Nick Thomas [:nthomas] (UTC+12)

Comment 11

•

15 years ago

More than 10 builds had problems at 18:34 today, with abort: HTTP Error 502: Bad Gateway

matthew zeier [:mrz]

Updated

•

15 years ago

Whiteboard: [pending build fix]

Mike Taylor [:bear]

Comment 12

•

15 years ago

yesterday, 13 Sep 2010, we had about 4 hours of intermittant 502 gateway errors happening with increasing frequency from 10am EDT to when IT restarted the varnish proxy. it impacted more than 50 builds and triggered many more retries.

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

15 years ago

Summary: Frequent TryServer failures due to "abort: HTTP Error 504: Gateway Time-out" during hg clone → Frequent build automation failures due to "Gateway Time-out" during network activity related steps

Lukas Blakk [:lsblakk] use ?needinfo

Updated

•

15 years ago

Summary: Frequent build automation failures due to "Gateway Time-out" during network activity related steps → [tracking bug] Frequent build automation failures due to "Gateway Time-out" during network activity related steps

Mike Taylor [:bear]

Comment 13

•

15 years ago

adding more details: two nightlies and a chunk of the mobile, maemo builds during the late overnight also had hg 502 failures

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 14

•

15 years ago

Raising priority, as these hg failures are causing builds and tests to fail in production. This is causing delays yesterday and last night, which caused long wait times. This also requires a lot of manual re-queuing of jobs. Removing whiteboard "[pending build fix]", because while comment#8 is about enhancements to RelEng infrastructure, it is orthogonal to what needs to be fixed here.

Severity: critical → blocker

Whiteboard: [pending build fix]

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 15

•

15 years ago

from quick chat with aravind: 1) the errors yesterday morning were caused by someone running a spider on hg.m.o. 2) Unknown what caused the errors last night.

Aravind Gottipati [:aravind]

Assignee

Comment 16

•

15 years ago

Is this still consistently happening? Is there a reproducible test case I can use?

Mike Taylor [:bear]

Comment 17

•

15 years ago

No occurances have been reported to me today. The checkin/checkout activity seems much less than earlier. I will close with the option to reopen if something spikes - thanks

Status: REOPENED → RESOLVED

Closed: 15 years ago → 15 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard

Bugzilla

[tracking bug] Frequent build automation failures due to "Gateway Time-out" during network activity related steps

Categories

(mozilla.org Graveyard :: Server Operations, task)

Tracking

(Not tracked)

People

(Reporter: dholbert, Assigned: aravind)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Comment 4

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 12

Updated

Updated

Comment 13

Comment 14

Comment 15

Comment 16

Comment 17

Updated