Closed Bug 693202 Opened 13 years ago Closed 12 years ago

Some full clones of source repos timing out

Categories

(mozilla.org Graveyard :: Server Operations, task, P3)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: cshields)

References

Details

(Whiteboard: [buildduty])

The first time I remember seeing it was on the first push to Electrolysis in more than a month, so I blew it off. Then I saw it here and there on days-old jobs on rarely used trees where I rarely look, and blew it off as maybe a time when hg.m.o was intermittently broken. But, really, there's just too many of these, particularly considering that they mostly happen on trees where nobody seems to know that they can and should file it, and they can and should retrigger it.

https://tbpl.mozilla.org/php/getParsedLog.php?id=6729456&tree=UX
https://tbpl.mozilla.org/php/getParsedLog.php?id=6729485&tree=UX
https://tbpl.mozilla.org/php/getParsedLog.php?id=6710130&tree=Services-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=6709676&tree=Services-Central
https://tbpl.mozilla.org/php/getParsedLog.php?id=6710131&tree=Services-Central
I think these are slaves which don't have a copy of the repo in their hg share dir, then attempt to clone, then buildbot interrupts it after an hour. ie the hg server is responding too slowly.

(In reply to Phil Ringnalda (:philor) from comment #0)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=6729456&tree=UX

A clone started at Fri Oct 7 09:46:08 2011:
 hg clone -U -r 70e30fe7d2ab198263e535ecb6810149d5b1ecd8 http://hg.mozilla.org/projects/ux /builds/hg-shared/projects/ux
Hadn't finished 1 hour later so buildbot killed the job. The same slave (linux-ix-slave18) cloned that repo in 1657 seconds on Oct 9 13:44.

> https://tbpl.mozilla.org/php/getParsedLog.php?id=6729485&tree=UX

Started cloning ux at Fri Oct 7 09:49:27 2011 (so very close to the one above), buildbot timed it out an hour later when it wasn't done. (w64-ix-slave12)

> https://tbpl.mozilla.org/php/getParsedLog.php?id=6710130&tree=Services-
> Central
> https://tbpl.mozilla.org/php/getParsedLog.php?id=6709676&tree=Services-
> Central
> https://tbpl.mozilla.org/php/getParsedLog.php?id=6710131&tree=Services-
> Central

Three failures on the same revision (6270ac0e7f96):
* linux64-ix-slave21 - started cloning Thu Oct 6 12:42:09 2011, buildbot killed it after an hour
* mw32-ix-slave02 - started cloning Thu Oct 6 12:43:50 2011, got 'HTTP Error 504: Gateway Time-out' after 30 minutes
* w64-ix-slave06 - started cloning Thu Oct 6 12:42:24 2011, buildbot killed it after an hour

I don't see any time-correlated alerts for dm-vcview* from nagios, but there were a couple of load spikes on dm-vcview04 on the 7th. Four of the five slaves are in the SCL1 colo, while mw32-ix-slave02 is in the mtv office, so might have been network congestion (we know SCL1 <---> SJC1 is busy, and hg.m.o is in SJC1). 

Since it's now after-the-fact, I suggest we keep this open for a few days. If it recurs then it should go over to Server Ops asap for diagnosis.
Priority: -- → P3
Summary: hgtool driven clobbers frequently time out → Some full clones of source repos timing out
Whiteboard: [buildduty]
https://tbpl.mozilla.org/php/getParsedLog.php?id=6758612&tree=Build-System

Error pulling changes into e:\builds\moz2_slave\bld-system-w64\build from http://hg.mozilla.org/projects/build-system; clobbering
command: START
command: hg clone -r 6e945f23126536783fce6ef618162f48b1491ea9 http://hg.mozilla.org/projects/build-system e:\\\\builds\\\\moz2_slave\\\\bld-system-w64\\\\build
command: cwd: e:\builds\moz2_slave\bld-system-w64
command: output:
requesting all changes
abort: HTTP Error 504: Gateway Time-out
command: END (1800.09s elapsed)

Traceback (most recent call last):
  File "e:/builds/moz2_slave/bld-system-w64/tools/buildfarm/utils/hgtool.py", line 75, in <module>
    shareBase=options.shared_dir)
  File "e:/builds/moz2_slave/bld-system-w64/tools/buildfarm/utils\../../lib/python\util\hg.py", line 316, in mercurial
    return clone(repo, dest, branch, revision, update_dest=update_dest)
  File "e:/builds/moz2_slave/bld-system-w64/tools/buildfarm/utils\../../lib/python\util\hg.py", line 125, in clone
    run_cmd(cmd)
  File "e:/builds/moz2_slave/bld-system-w64/tools/buildfarm/utils\../../lib/python\util\commands.py", line 42, in run_cmd
    return subprocess.check_call(cmd, **kwargs)
  File "c:\mozilla-build\python\lib\subprocess.py", line 498, in check_call
    raise CalledProcessError(retcode, cmd)
subprocess.CalledProcessError: Command '['hg', 'clone', '-r', u'6e945f23126536783fce6ef618162f48b1491ea9', 'http://hg.mozilla.org/projects/build-system', 'e:\\builds\\moz2_slave\\bld-system-w64\\build']' returned non-zero exit status -1
program finished with exit code 1
elapsedTime=1800.399000
(In reply to Phil Ringnalda (:philor) from comment #2)
> https://tbpl.mozilla.org/php/getParsedLog.php?id=6758612&tree=Build-System

Started at Sun Oct 9 15:29:28 2011, got the 504 30 minutes later.
Over to Server Ops. This causes delays to developers - more than 1 hour to get a failure, then they have to retry the build and hope it works.
Assignee: nobody → server-ops
Severity: normal → critical
Component: Release Engineering → Server Operations
QA Contact: release → cshields
cc'ing Amy from RelOps to keep them in the loop
https://tbpl.mozilla.org/php/getParsedLog.php?id=6758722&tree=Build-System
w32-ix-slave41
2011-10-09 15:26:56 PDT
Error pulling changes into e:\builds\moz2_slave\bld-system-w32\build from http://hg.mozilla.org/projects/build-system; clobbering
command: START
command: hg clone -r 6e945f23126536783fce6ef618162f48b1491ea9 http://hg.mozilla.org/projects/build-system e:\\\\builds\\\\moz2_slave\\\\bld-system-w32\\\\build
command: cwd: e:\builds\moz2_slave\bld-system-w32
command: output:

command timed out: 3600 seconds without output, attempting to kill
There's an 'Hg' cluster in ganglia, but I don't know which machine I should looking at these days for non-try requests. There is a load spike on dm-vcview04 at the same time as the failure for comment #2, but I thought that was try only (at least when aravind was maintaining things).
Assignee: server-ops → shyam
Okay, so it seems like the servers aren't returning anything :

send(3, "GET /projects/build-system?cmd=c"..., 272, 0) = 272
recv(3,  <unfinished ...>

real	4m25.905s
user	0m0.242s
sys	0m2.914s

Just sat there at the recv till I killed it. This happens both externally (outside the DC, on 63.245.208.189) and internally (on 10.2.74.67). Poking further.
Tried this again (and this just for information, I'm still debugging)

[shyam@boris tmphg]$ time hg clone -r 6e945f23126536783fce6ef618162f48b1491ea9 http://hg.mozilla.org/projects/build-system .
adding changesets
adding manifests
adding file changes
added 78108 changesets with 371460 changes to 80319 files
updating to branch default
52072 files updated, 0 files merged, 0 files removed, 0 files unresolved

real	10m46.862s
user	9m1.742s
sys	0m38.605s
a test I ran on my home laptop: (just adding to the info "noise")

[] moz:temp bear$ time hg clone -r 6e945f23126536783fce6ef618162f48b1491ea9 http://hg.mozilla.org/projects/build-system 
destination directory: build-system
requesting all changes
adding changesets
adding manifests
adding file changes
added 78108 changesets with 371460 changes to 80319 files
updating to branch default
52072 files updated, 0 files merged, 0 files removed, 0 files unresolved

real	12m22.126s
user	2m39.633s
sys	0m42.694s

[] moz:temp bear$ du -h -s build-system
1.2G	build-system
more timeouts (irc timestamps are EST)

[01:33]  <fox2mike> bear-afk: are we still seeing time-outs? long clones? 
[01:33]  <bear-afk> let me look
[01:38]  <bear-afk> fox2mike: https://tbpl.mozilla.org/php/getParsedLog.php?id=6612293&tree=Mozilla-Release&full=1
[01:38]  <bear-afk> https://tbpl.mozilla.org/php/getParsedLog.php?id=6612271&tree=Mozilla-Release
[01:39]  <fox2mike> so it's still happening
[01:39]  <bear-afk> yes, 2 in the latest set of pushes
[01:40]  <bear-afk> and only for the m-r hg repo
[01:40]  <bear-afk> smaller tool clones worked
[01:42]  <fox2mike> I'm thinking it's varnish, might need to flush caches. discussing with a few folks
[01:42]  <bear-afk> k
[01:42]  <bear-afk> another from build-system: https://tbpl.mozilla.org/php/getParsedLog.php?id=6758722&tree=Build-System
Okay, so we had a little chat and think that the varnish cache was the culprit. 

That has now been flushed. I'm keeping my fingers crossed and if we were right, we shouldn't see any more of these issues. Initial clones might be slower (but shouldn't fail) and it should all catch up after that. 

Please update the bug with any failures henceforth. That means we were wrong and it's failing elsewhere too :) and more poking is to be done.
https://tbpl.mozilla.org/php/getParsedLog.php?id=6767229&tree=Mozilla-Beta
linux-ix-slave17
2011-10-10 08:20:00 PDT
command timed out: 3600 seconds without output, attempting to kill
More poking is occuring.
Assignee: shyam → bkero
attempting to upgrade varnish from 2.1.2 (SVN) to 2.1.5.  This will also have the consequence of restarting the daemon, which has been known to temporarily solve slowness/timeout issues.
Varnish has been upgraded and restarted.  I'm noting this here so build can have a reference point if they notice issues starting around this time.
philor: have the builds gone any smoother with the upgrade/restart?
There was one failure to clone that ran between 11:42 and 12:42 (w64-ix-slave11.build.m.o doing a jaegermonkey win64 build). We probably need to wait a day or so before we can call this fixed, given it took some hours to go bad after comment #12.
We had three on DevTools at 18:09 to 19:09 yesterday (the 11th), eg
https://tbpl.mozilla.org/php/getParsedLog.php?id=6798662&tree=Devtools
Thanks for the continued starrings and updates.  This is still being investigated, but it's hard to nail down since it's sporadic.
I've tracked this down to an issue with how varnish caches objects.  A write-up of the feature I plan to implement to fix this has been posted here: https://www.varnish-software.com/blog/streaming-varnish-30

It essentially boils down to this:

1.  hg clones consist of 3 main hg calls:
1) HTTP GET request to hgweb to determine permissions and capabilities, 47 bytes
2) HTTP GET request to resolve commit hash to an internal hash, 13 bytes
3) HTTP GET request data stream from commit 0 to desired commit, 700ish megabytes

The way that varnish handles request #3 is to retrieve the entire thing from the backend before serving anything to the client.

Starting in Varnish version 3.0, a do_stream function allows objects to be cached and sent to clients at the same time.  I've constructed a virtual machine running Varnish 3.0 which interfaces to the same backend servers the real varnish server does.  The results are as follows:

Old method (no streaming, uncached): 33 minutes 26 seconds
New method (streaming, uncached): 22 minutes, 19 seconds
New method (streaming, cached): 11 minutes, 50 seconds

Tha plan is to deploy this on Tuesday, 2011-10-18.  These are the steps required, downtime should be under 1 minute, although plan for 20 minutes in case (heh) of complications:

1) Add varnish-release-3.0.0 RPM from varnish-software.com to the mozilla RPM repository
2) ssh dm-vcview04 yum remove varnish && varnish-libs
3) yum install varnish-release
4) Add the following lines to /etc/varnish/hgweb.vcl file in the vcl_fetch() function:
                if (req.url ~ "cmd=changegroupsubset") {
                    set beresp.do_stream = true;
                }
5) ssh dm-vcview04 service varnish restart

Failback option is to:
1) Remove the lines from /etc/varnish/hgweb.vcl
2) yum remove varnish-release
3) yum install varnish
4) service varnish restart
Ben, thanks for doing the debugging and replicating to figure this out.

The planned interruption here is similar to the varnish restarts we've done a few times earlier in this bug.  As such, I don't think this requires a pre-planned tree closure.  At worst, a few jobs may need to be re-triggered due to failing clones during the restart - similar to the problem this is intended to solve.

I'd like to get an ack from buildduty on Monday that this is OK.  If not, please set the needs-treeclosure? flag and we'll follow the usual treeclosure process here.
I've talked with catlee and we've scheduled this for 7AM PST on Tuesday.
I deployed the upgrade this morning during a window from 07:10 to 07:13.  The upgrade went successful, and I am currently doing a mock checkout to compare checkout times against the old non-streaming version.

My uncached checkout took 22 minutes 41 seconds

Could I get someone from build to comment if clones are quicker and more reliable?  Then we can close this bug out.
I'll keep an eye on the tree today looking for clone timeouts, but we'll need several days of continued good performance before I could verify this is as fixed.
Depends on: 695467
This is affecting the FF8 beta4 release.
I've closed the trees to slow down the HG load and hopefully get the release to go through.
Blocks: 695429
Severity: critical → blocker
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #31)
> This is affecting the FF8 beta4 release.
> I've closed the trees to slow down the HG load and hopefully get the release
> to go through.

FTR this also affected Thunderbird nightly builds as well as our release builds.
As mentioned on irc, ops has been looking at this this morning since before Armen's report.  It's unclear whether or not this has anything to do with the varnish upgrade that was done earlier this week.
Could this varnish issue be affecting developer.mozilla.org? It has become highly unresponsive since the hg thing cropped up.
I don't know of any relation or dependency of code between either of them, but maybe someone else cc'd on this bug does.
I believe the load cause could have been due to a configuration change in the varnish caching daemon yesterday.  I added the lines:

if (req.url ~ "cmd=changegroupsubset") {
    return (pass);
}

Yesterday due to some developer responses of receiving cached data and old tips during some clones.  Upon commenting that section out today, the load on the backend servers has dropped into normal levels.

Some other changes yesterday might have also fixed the old tip problem.
Severity: blocker → major
Could build please confirm that clones are working reliably again?
Some release jobs that started 45 mins ago have failed but it is also the first time that 2-3 release jobs have passed the hg clone/update step.

Perhaps in 30 mins we should be pass the hump for the releases.
We could be opening the trees anytime soon.
I was celebrating too fast.

I am getting this on the win32 and win32 xulrunner *release* jobs even after having clobbered them.

'd:\\mozilla-build\\hg\\hg.exe' 'clone' '--verbose' '--noupdate' u'http://hg.mozilla.org/releases/mozilla-beta' 'build'
#########
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
** unknown exception encountered, please report by visiting
**  http://mercurial.selenic.com/wiki/BugTracker
** Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)]
** Mercurial Distributed SCM (version 1.7.5)
** Extensions loaded: win32text, graphlog, share, purge
Traceback (most recent call last):
  File "hg", line 38, in <module>
  File "mercurial\dispatch.pyc", line 16, in run
  File "mercurial\dispatch.pyc", line 36, in dispatch
  File "mercurial\dispatch.pyc", line 58, in _runcatch
  File "mercurial\dispatch.pyc", line 593, in _dispatch
  File "mercurial\dispatch.pyc", line 401, in runcommand
  File "mercurial\dispatch.pyc", line 644, in _runcommand
  File "mercurial\dispatch.pyc", line 598, in checkargs
  File "mercurial\dispatch.pyc", line 591, in <lambda>
  File "mercurial\util.pyc", line 426, in check
  File "mercurial\commands.pyc", line 736, in clone
  File "mercurial\hg.pyc", line 337, in clone
  File "mercurial\localrepo.pyc", line 1886, in clone
  File "mercurial\localrepo.pyc", line 1295, in pull
  File "mercurial\localrepo.pyc", line 1739, in addchangegroup
  File "mercurial\revlog.pyc", line 1381, in addgroup
  File "mercurial\revlog.pyc", line 1220, in _addrevision
mpatch.mpatchError: patch cannot be decoded
For the release we have now made it through the hg steps. Let's see how the L10n repacks behave once they get triggered.
Due to complications with the upgrade, we have completely backed out of the varnish upgrade and config changes and are now back to the same state we were in on Monday (e.g. there will still be intermittent timeouts under load).  The sequence of events:

1) Upgrade varnish to fix the timeouts, and fix scripts to use the new admin interface.
2) Discover that some (all?) 32 bit machines are throwing errors when trying to clone after the upgrade.
3) Make configuration changes to varnish to fix the errors.
4) Discover that the configuration changes are severely overloading the CPUs on the vcview servers and hg is non-functional under load.
5) Try various steps to bring down the load.
6) Back out the configuration change and verify that load comes down.  Some (all?) 32bit machines still can not successfully clone (as in step #2)
7) Back out the varnish upgrade and script changes entirely.
https://tbpl.mozilla.org/php/getParsedLog.php?id=6939540&tree=Mozilla-Inbound - I should have waited for the third one on that push, too.
I'm not sure what's left to be done without a redesign.

The error logs are as follows:

From nginx:

2011/10/19 15:11:09 [error] 25442#0: *236253 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.2.90.111, server: hg.mozilla.org, request: "GET /integration/mozilla-inbound?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=a4bd037eb2524a37d1fd29fe01c9c0ff644a15b8 HTTP/1.1", upstream: "http://10.2.74.71:80/integration/mozilla-inbound?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=a4bd037eb2524a37d1fd29fe01c9c0ff644a15b8", host: "hg.mozilla.org"
2011/10/19 15:11:57 [error] 25442#0: *237859 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.12.49.62, server: hg.mozilla.org, request: "GET /integration/mozilla-inbound?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=a4bd037eb2524a37d1fd29fe01c9c0ff644a15b8 HTTP/1.1", upstream: "http://10.2.74.71:80/integration/mozilla-inbound?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=a4bd037eb2524a37d1fd29fe01c9c0ff644a15b8", host: "hg.mozilla.org"
2011/10/19 15:16:20 [crit] 25442#0: *264046 SSL_write() failed (SSL:) while sending to client, client: 95.178.195.205, server: hg.mozilla.org, request: "GET /mozilla-central/json-pushes?full=1&maxhours=24 HTTP/1.1", upstream: "http://10.2.74.71:80/mozilla-central/json-pushes?full=1&maxhours=24", host: "hg.mozilla.org", referrer: "https://tbpl.mozilla.org/"

No relevant error logs are on the dm-vcview webheads


The timeouts are:

nginx: (all in seconds)
    proxy_read_timeout 1800; 
    proxy_connect_timeout 1800;
    client_body_timeout 120;
    client_header_timeout 120;
    keepalive_timeout 120;
    send_timeout 120;

varnish:
    .connect_timeout = 600s;
    .first_byte_timeout = 600s;
    .between_bytes_timeout = 600s;

I think the keepalive timeouts and client timeouts in nginx are on the low side.
(In reply to Ben Kero [:bkero] from comment #47)
> I'm not sure what's left to be done without a redesign.

I think that's probably best regardless. I know you're picking up other people's trash right now, bkero, but would you mind getting a bug on file for that?

> I think the keepalive timeouts and client timeouts in nginx are on the low
> side.

I'm assuming you're talking about the 120s values here? 

Most of my clones today that were timing out were doing so reasonably quickly, i.e. <= 6 minutes, so I can see increasing these timeouts helping some. How disruptive is changing nginx settings? Will we burn clones-in-progress?
I've created a bug, 695934 to track the rebuild.

I can try adjusting the nginx settings using 'service nginx reload', which should set the new config settings without affecting existing connections.
(In reply to Ben Kero [:bkero] from comment #50) 
> I can try adjusting the nginx settings using 'service nginx reload', which
> should set the new config settings without affecting existing connections.

bkero: has this been tried yet? If not, can we schedule a time this week to try?

Also, while we're trying to think of other possible fixes, are there any disk issues in play on the hg servers, e.g. are we using NFS from a netapp and is it slowing things down? 

I just don't know what already been tried, and it seems (to me) like we're running out of things *to* try on the IT side.

FWIW, we're pushing hard on the hg mirrors work on the releng side too.
The new timeout settings have already been tried, I adjusted them last week and issued a 'service nginx reload'.  Judging by the recent failed logs, it didn't help at all.

I've looked into disk issues, there are none.  IOWait on the dm-vcview hosts is below 2%.  The problem is CPU utilization.  hgweb is very CPU intensive, and doing multiple checkouts creates more load than CPUs.  I've run out of things to try on the IT side, each change I've made has had to be backed out because of increased load issues, or miscached fragments.

The only solution left for this is for the system to be rebuilt.  I've filed a bug for that already (linked earlier).
philor, can you please let me know if this happens again after 1/14/12 08:30 PST?
I can reproduce.
full clone mozilla-inbound failed on 15-Jan 20:00JST(03:00PST)

$ /c/mozilla-build/hg/hg clone http://hg.mozilla.org/integration/mozilla-inbound/ /d/srcinbound
requesting all changes
adding changesets
adding manifests
transaction abort!
rollback completed
abort: premature EOF reading chunk (got 34497 bytes, expected 48897)
Closing this now.  Alice, we have rebuild the hg infrastructure today - meaning that you shouldn't see that error again.

similarly with build network clones, philor.
Assignee: bkero → cshields
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.