693202 - Some full clones of source repos timing out

Reporter

Description

•

14 years ago

The first time I remember seeing it was on the first push to Electrolysis in more than a month, so I blew it off. Then I saw it here and there on days-old jobs on rarely used trees where I rarely look, and blew it off as maybe a time when hg.m.o was intermittently broken. But, really, there's just too many of these, particularly considering that they mostly happen on trees where nobody seems to know that they can and should file it, and they can and should retrigger it. https://tbpl.mozilla.org/php/getParsedLog.php?id=6729456&tree=UX https://tbpl.mozilla.org/php/getParsedLog.php?id=6729485&tree=UX https://tbpl.mozilla.org/php/getParsedLog.php?id=6710130&tree=Services-Central https://tbpl.mozilla.org/php/getParsedLog.php?id=6709676&tree=Services-Central https://tbpl.mozilla.org/php/getParsedLog.php?id=6710131&tree=Services-Central

Nick Thomas [:nthomas] (UTC+12)

Comment 1

•

14 years ago

I think these are slaves which don't have a copy of the repo in their hg share dir, then attempt to clone, then buildbot interrupts it after an hour. ie the hg server is responding too slowly. (In reply to Phil Ringnalda (:philor) from comment #0) > https://tbpl.mozilla.org/php/getParsedLog.php?id=6729456&tree=UX A clone started at Fri Oct 7 09:46:08 2011: hg clone -U -r 70e30fe7d2ab198263e535ecb6810149d5b1ecd8 http://hg.mozilla.org/projects/ux /builds/hg-shared/projects/ux Hadn't finished 1 hour later so buildbot killed the job. The same slave (linux-ix-slave18) cloned that repo in 1657 seconds on Oct 9 13:44. > https://tbpl.mozilla.org/php/getParsedLog.php?id=6729485&tree=UX Started cloning ux at Fri Oct 7 09:49:27 2011 (so very close to the one above), buildbot timed it out an hour later when it wasn't done. (w64-ix-slave12) > https://tbpl.mozilla.org/php/getParsedLog.php?id=6710130&tree=Services- > Central > https://tbpl.mozilla.org/php/getParsedLog.php?id=6709676&tree=Services- > Central > https://tbpl.mozilla.org/php/getParsedLog.php?id=6710131&tree=Services- > Central Three failures on the same revision (6270ac0e7f96): * linux64-ix-slave21 - started cloning Thu Oct 6 12:42:09 2011, buildbot killed it after an hour * mw32-ix-slave02 - started cloning Thu Oct 6 12:43:50 2011, got 'HTTP Error 504: Gateway Time-out' after 30 minutes * w64-ix-slave06 - started cloning Thu Oct 6 12:42:24 2011, buildbot killed it after an hour I don't see any time-correlated alerts for dm-vcview* from nagios, but there were a couple of load spikes on dm-vcview04 on the 7th. Four of the five slaves are in the SCL1 colo, while mw32-ix-slave02 is in the mtv office, so might have been network congestion (we know SCL1 <---> SJC1 is busy, and hg.m.o is in SJC1). Since it's now after-the-fact, I suggest we keep this open for a few days. If it recurs then it should go over to Server Ops asap for diagnosis.

Priority: -- → P3

Summary: hgtool driven clobbers frequently time out → Some full clones of source repos timing out

Whiteboard: [buildduty]

Phil Ringnalda (:philor)

Reporter

Comment 2

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6758612&tree=Build-System Error pulling changes into e:\builds\moz2_slave\bld-system-w64\build from http://hg.mozilla.org/projects/build-system; clobbering command: START command: hg clone -r 6e945f23126536783fce6ef618162f48b1491ea9 http://hg.mozilla.org/projects/build-system e:\\\\builds\\\\moz2_slave\\\\bld-system-w64\\\\build command: cwd: e:\builds\moz2_slave\bld-system-w64 command: output: requesting all changes abort: HTTP Error 504: Gateway Time-out command: END (1800.09s elapsed) Traceback (most recent call last): File "e:/builds/moz2_slave/bld-system-w64/tools/buildfarm/utils/hgtool.py", line 75, in <module> shareBase=options.shared_dir) File "e:/builds/moz2_slave/bld-system-w64/tools/buildfarm/utils\../../lib/python\util\hg.py", line 316, in mercurial return clone(repo, dest, branch, revision, update_dest=update_dest) File "e:/builds/moz2_slave/bld-system-w64/tools/buildfarm/utils\../../lib/python\util\hg.py", line 125, in clone run_cmd(cmd) File "e:/builds/moz2_slave/bld-system-w64/tools/buildfarm/utils\../../lib/python\util\commands.py", line 42, in run_cmd return subprocess.check_call(cmd, **kwargs) File "c:\mozilla-build\python\lib\subprocess.py", line 498, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['hg', 'clone', '-r', u'6e945f23126536783fce6ef618162f48b1491ea9', 'http://hg.mozilla.org/projects/build-system', 'e:\\builds\\moz2_slave\\bld-system-w64\\build']' returned non-zero exit status -1 program finished with exit code 1 elapsedTime=1800.399000

Nick Thomas [:nthomas] (UTC+12)

Comment 3

•

14 years ago

(In reply to Phil Ringnalda (:philor) from comment #2) > https://tbpl.mozilla.org/php/getParsedLog.php?id=6758612&tree=Build-System Started at Sun Oct 9 15:29:28 2011, got the 504 30 minutes later.

Nick Thomas [:nthomas] (UTC+12)

Comment 4

•

14 years ago

Over to Server Ops. This causes delays to developers - more than 1 hour to get a failure, then they have to retry the build and hope it works.

Assignee: nobody → server-ops

Severity: normal → critical

Component: Release Engineering → Server Operations

QA Contact: release → cshields

Mike Taylor [:bear]

Comment 5

•

14 years ago

cc'ing Amy from RelOps to keep them in the loop

Phil Ringnalda (:philor)

Reporter

Comment 6

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6758722&tree=Build-System w32-ix-slave41 2011-10-09 15:26:56 PDT Error pulling changes into e:\builds\moz2_slave\bld-system-w32\build from http://hg.mozilla.org/projects/build-system; clobbering command: START command: hg clone -r 6e945f23126536783fce6ef618162f48b1491ea9 http://hg.mozilla.org/projects/build-system e:\\\\builds\\\\moz2_slave\\\\bld-system-w32\\\\build command: cwd: e:\builds\moz2_slave\bld-system-w32 command: output: command timed out: 3600 seconds without output, attempting to kill

Nick Thomas [:nthomas] (UTC+12)

Updated

•

14 years ago

Blocks: 691467

Nick Thomas [:nthomas] (UTC+12)

Comment 7

•

14 years ago

There's an 'Hg' cluster in ganglia, but I don't know which machine I should looking at these days for non-try requests. There is a load spike on dm-vcview04 at the same time as the failure for comment #2, but I thought that was try only (at least when aravind was maintaining things).

Shyam Mani [:fox2mike]

Updated

•

14 years ago

Assignee: server-ops → shyam

Shyam Mani [:fox2mike]

Comment 8

•

14 years ago

Okay, so it seems like the servers aren't returning anything : send(3, "GET /projects/build-system?cmd=c"..., 272, 0) = 272 recv(3, <unfinished ...> real 4m25.905s user 0m0.242s sys 0m2.914s Just sat there at the recv till I killed it. This happens both externally (outside the DC, on 63.245.208.189) and internally (on 10.2.74.67). Poking further.

Shyam Mani [:fox2mike]

Comment 9

•

14 years ago

Tried this again (and this just for information, I'm still debugging) [shyam@boris tmphg]$ time hg clone -r 6e945f23126536783fce6ef618162f48b1491ea9 http://hg.mozilla.org/projects/build-system . adding changesets adding manifests adding file changes added 78108 changesets with 371460 changes to 80319 files updating to branch default 52072 files updated, 0 files merged, 0 files removed, 0 files unresolved real 10m46.862s user 9m1.742s sys 0m38.605s

Mike Taylor [:bear]

Comment 10

•

14 years ago

a test I ran on my home laptop: (just adding to the info "noise") [] moz:temp bear$ time hg clone -r 6e945f23126536783fce6ef618162f48b1491ea9 http://hg.mozilla.org/projects/build-system destination directory: build-system requesting all changes adding changesets adding manifests adding file changes added 78108 changesets with 371460 changes to 80319 files updating to branch default 52072 files updated, 0 files merged, 0 files removed, 0 files unresolved real 12m22.126s user 2m39.633s sys 0m42.694s [] moz:temp bear$ du -h -s build-system 1.2G build-system

Mike Taylor [:bear]

Comment 11

•

14 years ago

more timeouts (irc timestamps are EST) [01:33] <fox2mike> bear-afk: are we still seeing time-outs? long clones? [01:33] <bear-afk> let me look [01:38] <bear-afk> fox2mike: https://tbpl.mozilla.org/php/getParsedLog.php?id=6612293&tree=Mozilla-Release&full=1 [01:38] <bear-afk> https://tbpl.mozilla.org/php/getParsedLog.php?id=6612271&tree=Mozilla-Release [01:39] <fox2mike> so it's still happening [01:39] <bear-afk> yes, 2 in the latest set of pushes [01:40] <bear-afk> and only for the m-r hg repo [01:40] <bear-afk> smaller tool clones worked [01:42] <fox2mike> I'm thinking it's varnish, might need to flush caches. discussing with a few folks [01:42] <bear-afk> k [01:42] <bear-afk> another from build-system: https://tbpl.mozilla.org/php/getParsedLog.php?id=6758722&tree=Build-System

Shyam Mani [:fox2mike]

Comment 12

•

14 years ago

Okay, so we had a little chat and think that the varnish cache was the culprit. That has now been flushed. I'm keeping my fingers crossed and if we were right, we shouldn't see any more of these issues. Initial clones might be slower (but shouldn't fail) and it should all catch up after that. Please update the bug with any failures henceforth. That means we were wrong and it's failing elsewhere too :) and more poking is to be done.

Phil Ringnalda (:philor)

Reporter

Comment 13

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6767229&tree=Mozilla-Beta linux-ix-slave17 2011-10-10 08:20:00 PDT command timed out: 3600 seconds without output, attempting to kill

Dustin J. Mitchell [:dustin] (he/him)

Comment 14

•

14 years ago

More poking is occuring.

Shyam Mani [:fox2mike]

Updated

•

14 years ago

Assignee: shyam → bkero

Ben Kero [:bkero]

Comment 15

•

14 years ago

attempting to upgrade varnish from 2.1.2 (SVN) to 2.1.5. This will also have the consequence of restarting the daemon, which has been known to temporarily solve slowness/timeout issues.

Ben Kero [:bkero]

Comment 16

•

14 years ago

Varnish has been upgraded and restarted. I'm noting this here so build can have a reference point if they notice issues starting around this time.

Ben Kero [:bkero]

Comment 17

•

14 years ago

philor: have the builds gone any smoother with the upgrade/restart?

Nick Thomas [:nthomas] (UTC+12)

Comment 18

•

14 years ago

There was one failure to clone that ran between 11:42 and 12:42 (w64-ix-slave11.build.m.o doing a jaegermonkey win64 build). We probably need to wait a day or so before we can call this fixed, given it took some hours to go bad after comment #12.

Phil Ringnalda (:philor)

Reporter

Comment 19

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6775327&tree=Mozilla-Beta w32-ix-slave36 2011-10-10 16:17:09 PDT :(

Phil Ringnalda (:philor)

Reporter

Comment 20

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6790484&tree=Services-Central linux-ix-slave22 2011-10-11 09:40:12 PDT

Phil Ringnalda (:philor)

Reporter

Comment 21

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6813982&tree=Fx-Team linux-ix-slave40 2011-10-12 10:10:55 PDT

Nick Thomas [:nthomas] (UTC+12)

Comment 22

•

14 years ago

We had three on DevTools at 18:09 to 19:09 yesterday (the 11th), eg https://tbpl.mozilla.org/php/getParsedLog.php?id=6798662&tree=Devtools

Phil Ringnalda (:philor)

Reporter

Comment 23

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6827650&tree=Fx-Team w64-ix-slave07 2011-10-13 04:02:23 PDT

Dustin J. Mitchell [:dustin] (he/him)

Comment 25

•

14 years ago

Thanks for the continued starrings and updates. This is still being investigated, but it's hard to nail down since it's sporadic.

Ben Kero [:bkero]

Comment 26

•

14 years ago

I've tracked this down to an issue with how varnish caches objects. A write-up of the feature I plan to implement to fix this has been posted here: https://www.varnish-software.com/blog/streaming-varnish-30 It essentially boils down to this: 1. hg clones consist of 3 main hg calls: 1) HTTP GET request to hgweb to determine permissions and capabilities, 47 bytes 2) HTTP GET request to resolve commit hash to an internal hash, 13 bytes 3) HTTP GET request data stream from commit 0 to desired commit, 700ish megabytes The way that varnish handles request #3 is to retrieve the entire thing from the backend before serving anything to the client. Starting in Varnish version 3.0, a do_stream function allows objects to be cached and sent to clients at the same time. I've constructed a virtual machine running Varnish 3.0 which interfaces to the same backend servers the real varnish server does. The results are as follows: Old method (no streaming, uncached): 33 minutes 26 seconds New method (streaming, uncached): 22 minutes, 19 seconds New method (streaming, cached): 11 minutes, 50 seconds Tha plan is to deploy this on Tuesday, 2011-10-18. These are the steps required, downtime should be under 1 minute, although plan for 20 minutes in case (heh) of complications: 1) Add varnish-release-3.0.0 RPM from varnish-software.com to the mozilla RPM repository 2) ssh dm-vcview04 yum remove varnish && varnish-libs 3) yum install varnish-release 4) Add the following lines to /etc/varnish/hgweb.vcl file in the vcl_fetch() function: if (req.url ~ "cmd=changegroupsubset") { set beresp.do_stream = true; } 5) ssh dm-vcview04 service varnish restart Failback option is to: 1) Remove the lines from /etc/varnish/hgweb.vcl 2) yum remove varnish-release 3) yum install varnish 4) service varnish restart

Dustin J. Mitchell [:dustin] (he/him)

Comment 27

•

14 years ago

Ben, thanks for doing the debugging and replicating to figure this out. The planned interruption here is similar to the varnish restarts we've done a few times earlier in this bug. As such, I don't think this requires a pre-planned tree closure. At worst, a few jobs may need to be re-triggered due to failing clones during the restart - similar to the problem this is intended to solve. I'd like to get an ack from buildduty on Monday that this is OK. If not, please set the needs-treeclosure? flag and we'll follow the usual treeclosure process here.

Ben Kero [:bkero]

Comment 28

•

14 years ago

I've talked with catlee and we've scheduled this for 7AM PST on Tuesday.

Ben Kero [:bkero]

Comment 29

•

14 years ago

I deployed the upgrade this morning during a window from 07:10 to 07:13. The upgrade went successful, and I am currently doing a mock checkout to compare checkout times against the old non-streaming version. My uncached checkout took 22 minutes 41 seconds Could I get someone from build to comment if clones are quicker and more reliable? Then we can close this bug out.

Chris Cooper [:coop] (he/him)

Comment 30

•

14 years ago

I'll keep an eye on the tree today looking for clone timeouts, but we'll need several days of continued good performance before I could verify this is as fixed.

Chris Cooper [:coop] (he/him)

Updated

•

14 years ago

Depends on: 695467

Armen [:armenzg]

Comment 31

•

14 years ago

This is affecting the FF8 beta4 release. I've closed the trees to slow down the HG load and hopefully get the release to go through.

Blocks: 695429

Severity: critical → blocker

Mark Banner (:standard8)

Comment 32

•

14 years ago

(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #31) > This is affecting the FF8 beta4 release. > I've closed the trees to slow down the HG load and hopefully get the release > to go through. FTR this also affected Thunderbird nightly builds as well as our release builds.

Amy Rich [:arr] [:arich]

Comment 33

•

14 years ago

As mentioned on irc, ops has been looking at this this morning since before Armen's report. It's unclear whether or not this has anything to do with the varnish upgrade that was done earlier this week.

Eric Shepherd [:sheppy]

Comment 34

•

14 years ago

Could this varnish issue be affecting developer.mozilla.org? It has become highly unresponsive since the hg thing cropped up.

Ben Kero [:bkero]

Comment 35

•

14 years ago

I don't know of any relation or dependency of code between either of them, but maybe someone else cc'd on this bug does.

Ben Kero [:bkero]

Comment 36

•

14 years ago

I believe the load cause could have been due to a configuration change in the varnish caching daemon yesterday. I added the lines: if (req.url ~ "cmd=changegroupsubset") { return (pass); } Yesterday due to some developer responses of receiving cached data and old tips during some clones. Upon commenting that section out today, the load on the backend servers has dropped into normal levels. Some other changes yesterday might have also fixed the old tip problem.

Severity: blocker → major

Ben Kero [:bkero]

Comment 37

•

14 years ago

Could build please confirm that clones are working reliably again?

Armen [:armenzg]

Comment 38

•

14 years ago

Some release jobs that started 45 mins ago have failed but it is also the first time that 2-3 release jobs have passed the hg clone/update step. Perhaps in 30 mins we should be pass the hump for the releases. We could be opening the trees anytime soon.

Armen [:armenzg]

Comment 39

•

14 years ago

I was celebrating too fast. I am getting this on the win32 and win32 xulrunner *release* jobs even after having clobbered them. 'd:\\mozilla-build\\hg\\hg.exe' 'clone' '--verbose' '--noupdate' u'http://hg.mozilla.org/releases/mozilla-beta' 'build' ######### requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed ** unknown exception encountered, please report by visiting ** http://mercurial.selenic.com/wiki/BugTracker ** Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)] ** Mercurial Distributed SCM (version 1.7.5) ** Extensions loaded: win32text, graphlog, share, purge Traceback (most recent call last): File "hg", line 38, in <module> File "mercurial\dispatch.pyc", line 16, in run File "mercurial\dispatch.pyc", line 36, in dispatch File "mercurial\dispatch.pyc", line 58, in _runcatch File "mercurial\dispatch.pyc", line 593, in _dispatch File "mercurial\dispatch.pyc", line 401, in runcommand File "mercurial\dispatch.pyc", line 644, in _runcommand File "mercurial\dispatch.pyc", line 598, in checkargs File "mercurial\dispatch.pyc", line 591, in <lambda> File "mercurial\util.pyc", line 426, in check File "mercurial\commands.pyc", line 736, in clone File "mercurial\hg.pyc", line 337, in clone File "mercurial\localrepo.pyc", line 1886, in clone File "mercurial\localrepo.pyc", line 1295, in pull File "mercurial\localrepo.pyc", line 1739, in addchangegroup File "mercurial\revlog.pyc", line 1381, in addgroup File "mercurial\revlog.pyc", line 1220, in _addrevision mpatch.mpatchError: patch cannot be decoded

Armen [:armenzg]

Comment 40

•

14 years ago

For the release we have now made it through the hg steps. Let's see how the L10n repacks behave once they get triggered.

Amy Rich [:arr] [:arich]

Comment 41

•

14 years ago

Due to complications with the upgrade, we have completely backed out of the varnish upgrade and config changes and are now back to the same state we were in on Monday (e.g. there will still be intermittent timeouts under load). The sequence of events: 1) Upgrade varnish to fix the timeouts, and fix scripts to use the new admin interface. 2) Discover that some (all?) 32 bit machines are throwing errors when trying to clone after the upgrade. 3) Make configuration changes to varnish to fix the errors. 4) Discover that the configuration changes are severely overloading the CPUs on the vcview servers and hg is non-functional under load. 5) Try various steps to bring down the load. 6) Back out the configuration change and verify that load comes down. Some (all?) 32bit machines still can not successfully clone (as in step #2) 7) Back out the varnish upgrade and script changes entirely.

Phil Ringnalda (:philor)

Reporter

Comment 43

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6939155&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6939150&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6939124&tree=Mozilla-Inbound Tough to say whether we're right back where we were, or actually worse off, since that's three on a single push.

Phil Ringnalda (:philor)

Reporter

Comment 44

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6939512&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6939508&tree=Mozilla-Inbound

Phil Ringnalda (:philor)

Reporter

Comment 45

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6939540&tree=Mozilla-Inbound - I should have waited for the third one on that push, too.

Phil Ringnalda (:philor)

Reporter

Comment 46

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6939622&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6939682&tree=Mozilla-Inbound

Ben Kero [:bkero]

Comment 47

•

14 years ago

I'm not sure what's left to be done without a redesign. The error logs are as follows: From nginx: 2011/10/19 15:11:09 [error] 25442#0: *236253 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.2.90.111, server: hg.mozilla.org, request: "GET /integration/mozilla-inbound?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=a4bd037eb2524a37d1fd29fe01c9c0ff644a15b8 HTTP/1.1", upstream: "http://10.2.74.71:80/integration/mozilla-inbound?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=a4bd037eb2524a37d1fd29fe01c9c0ff644a15b8", host: "hg.mozilla.org" 2011/10/19 15:11:57 [error] 25442#0: *237859 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.12.49.62, server: hg.mozilla.org, request: "GET /integration/mozilla-inbound?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=a4bd037eb2524a37d1fd29fe01c9c0ff644a15b8 HTTP/1.1", upstream: "http://10.2.74.71:80/integration/mozilla-inbound?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=a4bd037eb2524a37d1fd29fe01c9c0ff644a15b8", host: "hg.mozilla.org" 2011/10/19 15:16:20 [crit] 25442#0: *264046 SSL_write() failed (SSL:) while sending to client, client: 95.178.195.205, server: hg.mozilla.org, request: "GET /mozilla-central/json-pushes?full=1&maxhours=24 HTTP/1.1", upstream: "http://10.2.74.71:80/mozilla-central/json-pushes?full=1&maxhours=24", host: "hg.mozilla.org", referrer: "https://tbpl.mozilla.org/" No relevant error logs are on the dm-vcview webheads The timeouts are: nginx: (all in seconds) proxy_read_timeout 1800; proxy_connect_timeout 1800; client_body_timeout 120; client_header_timeout 120; keepalive_timeout 120; send_timeout 120; varnish: .connect_timeout = 600s; .first_byte_timeout = 600s; .between_bytes_timeout = 600s; I think the keepalive timeouts and client timeouts in nginx are on the low side.

Phil Ringnalda (:philor)

Reporter

Comment 48

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6941142&tree=Mozilla-Inbound

Chris Cooper [:coop] (he/him)

Comment 49

•

14 years ago

(In reply to Ben Kero [:bkero] from comment #47) > I'm not sure what's left to be done without a redesign. I think that's probably best regardless. I know you're picking up other people's trash right now, bkero, but would you mind getting a bug on file for that? > I think the keepalive timeouts and client timeouts in nginx are on the low > side. I'm assuming you're talking about the 120s values here? Most of my clones today that were timing out were doing so reasonably quickly, i.e. <= 6 minutes, so I can see increasing these timeouts helping some. How disruptive is changing nginx settings? Will we burn clones-in-progress?

Ben Kero [:bkero]

Comment 50

•

14 years ago

I've created a bug, 695934 to track the rebuild. I can try adjusting the nginx settings using 'service nginx reload', which should set the new config settings without affecting existing connections.

Phil Ringnalda (:philor)

Reporter

Comment 53

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=6970193&tree=Firefox

Chris Cooper [:coop] (he/him)

Comment 54

•

14 years ago

(In reply to Ben Kero [:bkero] from comment #50) > I can try adjusting the nginx settings using 'service nginx reload', which > should set the new config settings without affecting existing connections. bkero: has this been tried yet? If not, can we schedule a time this week to try? Also, while we're trying to think of other possible fixes, are there any disk issues in play on the hg servers, e.g. are we using NFS from a netapp and is it slowing things down? I just don't know what already been tried, and it seems (to me) like we're running out of things *to* try on the IT side. FWIW, we're pushing hard on the hg mirrors work on the releng side too.

Phil Ringnalda (:philor)

Reporter

Comment 55

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=7042678&tree=Mozilla-Inbound

Chris Cooper [:coop] (he/him)

Comment 56

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=7062827&tree=Ash&full=1

Ben Kero [:bkero]

Comment 57

•

14 years ago

The new timeout settings have already been tried, I adjusted them last week and issued a 'service nginx reload'. Judging by the recent failed logs, it didn't help at all. I've looked into disk issues, there are none. IOWait on the dm-vcview hosts is below 2%. The problem is CPU utilization. hgweb is very CPU intensive, and doing multiple checkouts creates more load than CPUs. I've run out of things to try on the IT side, each change I've made has had to be backed out because of increased load issues, or miscached fragments. The only solution left for this is for the system to be rebuilt. I've filed a bug for that already (linked earlier).

Phil Ringnalda (:philor)

Reporter

Comment 58

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=7052157&tree=UX

Phil Ringnalda (:philor)

Reporter

Comment 59

•

14 years ago

https://tbpl.mozilla.org/php/getParsedLog.php?id=7147211&tree=Firefox

Corey Shields [:cshields]

Assignee

Comment 60

•

13 years ago

philor, can you please let me know if this happens again after 1/14/12 08:30 PST?

Alice0775 White

Comment 61

•

13 years ago

I can reproduce. full clone mozilla-inbound failed on 15-Jan 20:00JST(03:00PST) $ /c/mozilla-build/hg/hg clone http://hg.mozilla.org/integration/mozilla-inbound/ /d/srcinbound requesting all changes adding changesets adding manifests transaction abort! rollback completed abort: premature EOF reading chunk (got 34497 bytes, expected 48897)

Corey Shields [:cshields]

Assignee

Comment 62

•

13 years ago

Closing this now. Alice, we have rebuild the hg infrastructure today - meaning that you shouldn't see that error again. similarly with build network clones, philor.

Assignee: bkero → cshields

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard