Closed Bug 588711 Opened 10 years ago Closed 10 years ago

[tracking bug] Frequent build automation failures due to "Gateway Time-out" during network activity related steps

Categories

(mozilla.org Graveyard :: Server Operations, task, blocker)

task
Not set
blocker

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dholbert, Assigned: aravind)

References

()

Details

I've seen at least 10 TryServer builds fail today due to "abort: HTTP Error 504: Gateway Time-out" during hg clone.

The logs all end with something like this:
{
argv: ['/tools/python/bin/hg', 'clone', '--verbose', '--noupdate', '--rev', '583ae843a43488f3bfda25ff4615887cfc8a57f5', u'http://hg.mozilla.org/try', '/builds/slave/tryserver-linux64-debug/build']
 environment:
[...]
 using PTY: False
requesting all changes
abort: HTTP Error 504: Gateway Time-out
elapsedTime=1800.137150
program finished with exit code 255
}

List of failure logs with this problem (all of which just completed recently):
==============================================================================
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203583.1282205659.24030.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203477.1282205509.23531.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203476.1282205520.23582.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205314.22453.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205501.23520.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205496.23498.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205295.22395.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203474.1282205288.22338.gz
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282203475.1282205314.22454.gz

And here's one that failed with this error earlier today:
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1282178040.1282180111.11311.gz
OS: Linux → All
Hardware: x86 → All
I initially thought this might be a case of "try server hg repository has too many heads & needs a fresh start", but I don't think that's actually what's going on.

I just tried running the |hg| command from comment 0 locally (at home)...
> hg clone --verbose --noupdate --rev 583ae843a43488f3bfda25ff4615887cfc8a57f5 http://hg.mozilla.org/try
...and it gets past the "requesting all changes" stage and on to "adding changesets" pretty much instantaneously.

So, this looks like a case of network congestion, I guess?
(In reply to comment #1)
> So, this looks like a case of network congestion, I guess?
(I mean: sporadic congestion, affecting the builders)
(In reply to comment #1)
> I initially thought this might be a case of "try server hg repository has too
> many heads & needs a fresh start", but I don't think that's actually what's
> going on.

Definitely not, because it does shallow clones (only the required changesets for the head you want).

> So, this looks like a case of network congestion, I guess?

504 Gateway Time-out means that the proxy made an HTTP request to the HG server, which didn't respond in a timely manner (in this case, ~30min). Generally, it indicates issues on the HG server, not network congestion.

Moving this to Server Ops. Marked critical -- but it's really a blocker once it starts happening again.
Assignee: nobody → server-ops
Severity: normal → critical
Component: Release Engineering → Server Operations
QA Contact: release → mrz
Assignee: server-ops → aravind
Here's a tally of failure times from the last 24h (times listed are start times in PDT of the hg clone):
August 18:
11:45am - 12pm, 5 failures
12:15pm - 12:30, 2 failures
3:15 - 3:30, 5 failures
5 - 5:15, 22 failures
5:15 - 5:30, 11 failures
5:30 - 5:45, 22 failures
5:45 - 6, 7 failures
8:30 - 8:45, 10 failures

August 19th:
12am - 12:15am, 2 failures
12:15am - 12:30, 3 failures
12:30 - 12:45, 11 failures
2:30 - 2:45, 11 failures
3:30 - 3:45, 5 failures
7:30 - 7:45, 12 failures
Duplicate of this bug: 588411
I got the caching layer out of the picture and am sending try requests to the backend server directly.  This should hopefully help with the problem.  Please let me know if you notice any more try failures (since 6:30 AM PST).
Getting varnish out of the picture seems to have helped.  Its not a permanent fix, that's going to be build & release reworking the way they do checkouts.  Please re-open if you notice this issue again.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
We had some more failures last night, unfortunately:
Tryserver Mercurial failures on 2010/08/24
20:15: 1
21:15: 12
21:30: 14
21:45: 6
23:30: 2

I'm not sure what else we can do besides wait for the real fix in bug 589885
Ok - reopening per comment 7, then.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Even if there's nothing that can done here at the moment, it's probably best to have an open bug tracking this, so that other people who hit this issue can more easily find it.)
More than 10 builds had problems at 18:34 today, with
  abort: HTTP Error 502: Bad Gateway
Whiteboard: [pending build fix]
yesterday, 13 Sep 2010, we had about 4 hours of intermittant 502 gateway errors happening with increasing frequency from 10am EDT to when IT restarted the varnish proxy.

it impacted more than 50 builds and triggered many more retries.
Summary: Frequent TryServer failures due to "abort: HTTP Error 504: Gateway Time-out" during hg clone → Frequent build automation failures due to "Gateway Time-out" during network activity related steps
Summary: Frequent build automation failures due to "Gateway Time-out" during network activity related steps → [tracking bug] Frequent build automation failures due to "Gateway Time-out" during network activity related steps
adding more details: two nightlies and a chunk of the mobile, maemo builds during the late overnight also had hg 502 failures
Raising priority, as these hg failures are causing builds and tests to fail in production. This is causing delays yesterday and last night, which caused long wait times. This also requires a lot of manual re-queuing of jobs. 


Removing whiteboard "[pending build fix]", because while comment#8 is about enhancements to RelEng infrastructure, it is orthogonal to what needs to be fixed here.
Severity: critical → blocker
Whiteboard: [pending build fix]
from quick chat with aravind:

1) the errors yesterday morning were caused by someone running a spider on hg.m.o. 

2) Unknown what caused the errors last night.
Is this still consistently happening?  Is there a reproducible test case I can use?
No occurances have been reported to me today.  The checkin/checkout activity seems much less than earlier.

I will close with the option to reopen if something spikes - thanks
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.