Closed
Bug 693202
Opened 13 years ago
Closed 12 years ago
Some full clones of source repos timing out
Categories
(mozilla.org Graveyard :: Server Operations, task, P3)
mozilla.org Graveyard
Server Operations
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: philor, Assigned: cshields)
References
Details
(Whiteboard: [buildduty])
The first time I remember seeing it was on the first push to Electrolysis in more than a month, so I blew it off. Then I saw it here and there on days-old jobs on rarely used trees where I rarely look, and blew it off as maybe a time when hg.m.o was intermittently broken. But, really, there's just too many of these, particularly considering that they mostly happen on trees where nobody seems to know that they can and should file it, and they can and should retrigger it. https://tbpl.mozilla.org/php/getParsedLog.php?id=6729456&tree=UX https://tbpl.mozilla.org/php/getParsedLog.php?id=6729485&tree=UX https://tbpl.mozilla.org/php/getParsedLog.php?id=6710130&tree=Services-Central https://tbpl.mozilla.org/php/getParsedLog.php?id=6709676&tree=Services-Central https://tbpl.mozilla.org/php/getParsedLog.php?id=6710131&tree=Services-Central
Comment 1•13 years ago
|
||
I think these are slaves which don't have a copy of the repo in their hg share dir, then attempt to clone, then buildbot interrupts it after an hour. ie the hg server is responding too slowly. (In reply to Phil Ringnalda (:philor) from comment #0) > https://tbpl.mozilla.org/php/getParsedLog.php?id=6729456&tree=UX A clone started at Fri Oct 7 09:46:08 2011: hg clone -U -r 70e30fe7d2ab198263e535ecb6810149d5b1ecd8 http://hg.mozilla.org/projects/ux /builds/hg-shared/projects/ux Hadn't finished 1 hour later so buildbot killed the job. The same slave (linux-ix-slave18) cloned that repo in 1657 seconds on Oct 9 13:44. > https://tbpl.mozilla.org/php/getParsedLog.php?id=6729485&tree=UX Started cloning ux at Fri Oct 7 09:49:27 2011 (so very close to the one above), buildbot timed it out an hour later when it wasn't done. (w64-ix-slave12) > https://tbpl.mozilla.org/php/getParsedLog.php?id=6710130&tree=Services- > Central > https://tbpl.mozilla.org/php/getParsedLog.php?id=6709676&tree=Services- > Central > https://tbpl.mozilla.org/php/getParsedLog.php?id=6710131&tree=Services- > Central Three failures on the same revision (6270ac0e7f96): * linux64-ix-slave21 - started cloning Thu Oct 6 12:42:09 2011, buildbot killed it after an hour * mw32-ix-slave02 - started cloning Thu Oct 6 12:43:50 2011, got 'HTTP Error 504: Gateway Time-out' after 30 minutes * w64-ix-slave06 - started cloning Thu Oct 6 12:42:24 2011, buildbot killed it after an hour I don't see any time-correlated alerts for dm-vcview* from nagios, but there were a couple of load spikes on dm-vcview04 on the 7th. Four of the five slaves are in the SCL1 colo, while mw32-ix-slave02 is in the mtv office, so might have been network congestion (we know SCL1 <---> SJC1 is busy, and hg.m.o is in SJC1). Since it's now after-the-fact, I suggest we keep this open for a few days. If it recurs then it should go over to Server Ops asap for diagnosis.
Priority: -- → P3
Summary: hgtool driven clobbers frequently time out → Some full clones of source repos timing out
Whiteboard: [buildduty]
Reporter | ||
Comment 2•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6758612&tree=Build-System Error pulling changes into e:\builds\moz2_slave\bld-system-w64\build from http://hg.mozilla.org/projects/build-system; clobbering command: START command: hg clone -r 6e945f23126536783fce6ef618162f48b1491ea9 http://hg.mozilla.org/projects/build-system e:\\\\builds\\\\moz2_slave\\\\bld-system-w64\\\\build command: cwd: e:\builds\moz2_slave\bld-system-w64 command: output: requesting all changes abort: HTTP Error 504: Gateway Time-out command: END (1800.09s elapsed) Traceback (most recent call last): File "e:/builds/moz2_slave/bld-system-w64/tools/buildfarm/utils/hgtool.py", line 75, in <module> shareBase=options.shared_dir) File "e:/builds/moz2_slave/bld-system-w64/tools/buildfarm/utils\../../lib/python\util\hg.py", line 316, in mercurial return clone(repo, dest, branch, revision, update_dest=update_dest) File "e:/builds/moz2_slave/bld-system-w64/tools/buildfarm/utils\../../lib/python\util\hg.py", line 125, in clone run_cmd(cmd) File "e:/builds/moz2_slave/bld-system-w64/tools/buildfarm/utils\../../lib/python\util\commands.py", line 42, in run_cmd return subprocess.check_call(cmd, **kwargs) File "c:\mozilla-build\python\lib\subprocess.py", line 498, in check_call raise CalledProcessError(retcode, cmd) subprocess.CalledProcessError: Command '['hg', 'clone', '-r', u'6e945f23126536783fce6ef618162f48b1491ea9', 'http://hg.mozilla.org/projects/build-system', 'e:\\builds\\moz2_slave\\bld-system-w64\\build']' returned non-zero exit status -1 program finished with exit code 1 elapsedTime=1800.399000
Comment 3•13 years ago
|
||
(In reply to Phil Ringnalda (:philor) from comment #2) > https://tbpl.mozilla.org/php/getParsedLog.php?id=6758612&tree=Build-System Started at Sun Oct 9 15:29:28 2011, got the 504 30 minutes later.
Comment 4•13 years ago
|
||
Over to Server Ops. This causes delays to developers - more than 1 hour to get a failure, then they have to retry the build and hope it works.
Assignee: nobody → server-ops
Severity: normal → critical
Component: Release Engineering → Server Operations
QA Contact: release → cshields
Comment 5•13 years ago
|
||
cc'ing Amy from RelOps to keep them in the loop
Reporter | ||
Comment 6•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6758722&tree=Build-System w32-ix-slave41 2011-10-09 15:26:56 PDT Error pulling changes into e:\builds\moz2_slave\bld-system-w32\build from http://hg.mozilla.org/projects/build-system; clobbering command: START command: hg clone -r 6e945f23126536783fce6ef618162f48b1491ea9 http://hg.mozilla.org/projects/build-system e:\\\\builds\\\\moz2_slave\\\\bld-system-w32\\\\build command: cwd: e:\builds\moz2_slave\bld-system-w32 command: output: command timed out: 3600 seconds without output, attempting to kill
Comment 7•13 years ago
|
||
There's an 'Hg' cluster in ganglia, but I don't know which machine I should looking at these days for non-try requests. There is a load spike on dm-vcview04 at the same time as the failure for comment #2, but I thought that was try only (at least when aravind was maintaining things).
Updated•13 years ago
|
Assignee: server-ops → shyam
Comment 8•13 years ago
|
||
Okay, so it seems like the servers aren't returning anything : send(3, "GET /projects/build-system?cmd=c"..., 272, 0) = 272 recv(3, <unfinished ...> real 4m25.905s user 0m0.242s sys 0m2.914s Just sat there at the recv till I killed it. This happens both externally (outside the DC, on 63.245.208.189) and internally (on 10.2.74.67). Poking further.
Comment 9•13 years ago
|
||
Tried this again (and this just for information, I'm still debugging) [shyam@boris tmphg]$ time hg clone -r 6e945f23126536783fce6ef618162f48b1491ea9 http://hg.mozilla.org/projects/build-system . adding changesets adding manifests adding file changes added 78108 changesets with 371460 changes to 80319 files updating to branch default 52072 files updated, 0 files merged, 0 files removed, 0 files unresolved real 10m46.862s user 9m1.742s sys 0m38.605s
Comment 10•13 years ago
|
||
a test I ran on my home laptop: (just adding to the info "noise") [] moz:temp bear$ time hg clone -r 6e945f23126536783fce6ef618162f48b1491ea9 http://hg.mozilla.org/projects/build-system destination directory: build-system requesting all changes adding changesets adding manifests adding file changes added 78108 changesets with 371460 changes to 80319 files updating to branch default 52072 files updated, 0 files merged, 0 files removed, 0 files unresolved real 12m22.126s user 2m39.633s sys 0m42.694s [] moz:temp bear$ du -h -s build-system 1.2G build-system
Comment 11•13 years ago
|
||
more timeouts (irc timestamps are EST) [01:33] <fox2mike> bear-afk: are we still seeing time-outs? long clones? [01:33] <bear-afk> let me look [01:38] <bear-afk> fox2mike: https://tbpl.mozilla.org/php/getParsedLog.php?id=6612293&tree=Mozilla-Release&full=1 [01:38] <bear-afk> https://tbpl.mozilla.org/php/getParsedLog.php?id=6612271&tree=Mozilla-Release [01:39] <fox2mike> so it's still happening [01:39] <bear-afk> yes, 2 in the latest set of pushes [01:40] <bear-afk> and only for the m-r hg repo [01:40] <bear-afk> smaller tool clones worked [01:42] <fox2mike> I'm thinking it's varnish, might need to flush caches. discussing with a few folks [01:42] <bear-afk> k [01:42] <bear-afk> another from build-system: https://tbpl.mozilla.org/php/getParsedLog.php?id=6758722&tree=Build-System
Comment 12•13 years ago
|
||
Okay, so we had a little chat and think that the varnish cache was the culprit. That has now been flushed. I'm keeping my fingers crossed and if we were right, we shouldn't see any more of these issues. Initial clones might be slower (but shouldn't fail) and it should all catch up after that. Please update the bug with any failures henceforth. That means we were wrong and it's failing elsewhere too :) and more poking is to be done.
Reporter | ||
Comment 13•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6767229&tree=Mozilla-Beta linux-ix-slave17 2011-10-10 08:20:00 PDT command timed out: 3600 seconds without output, attempting to kill
Comment 14•13 years ago
|
||
More poking is occuring.
Updated•13 years ago
|
Assignee: shyam → bkero
Comment 15•13 years ago
|
||
attempting to upgrade varnish from 2.1.2 (SVN) to 2.1.5. This will also have the consequence of restarting the daemon, which has been known to temporarily solve slowness/timeout issues.
Comment 16•13 years ago
|
||
Varnish has been upgraded and restarted. I'm noting this here so build can have a reference point if they notice issues starting around this time.
Comment 17•13 years ago
|
||
philor: have the builds gone any smoother with the upgrade/restart?
Comment 18•13 years ago
|
||
There was one failure to clone that ran between 11:42 and 12:42 (w64-ix-slave11.build.m.o doing a jaegermonkey win64 build). We probably need to wait a day or so before we can call this fixed, given it took some hours to go bad after comment #12.
Reporter | ||
Comment 19•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6775327&tree=Mozilla-Beta w32-ix-slave36 2011-10-10 16:17:09 PDT :(
Reporter | ||
Comment 20•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6790484&tree=Services-Central linux-ix-slave22 2011-10-11 09:40:12 PDT
Reporter | ||
Comment 21•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6813982&tree=Fx-Team linux-ix-slave40 2011-10-12 10:10:55 PDT
Comment 22•13 years ago
|
||
We had three on DevTools at 18:09 to 19:09 yesterday (the 11th), eg https://tbpl.mozilla.org/php/getParsedLog.php?id=6798662&tree=Devtools
Reporter | ||
Comment 23•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6827650&tree=Fx-Team w64-ix-slave07 2011-10-13 04:02:23 PDT
Comment 25•13 years ago
|
||
Thanks for the continued starrings and updates. This is still being investigated, but it's hard to nail down since it's sporadic.
Comment 26•13 years ago
|
||
I've tracked this down to an issue with how varnish caches objects. A write-up of the feature I plan to implement to fix this has been posted here: https://www.varnish-software.com/blog/streaming-varnish-30 It essentially boils down to this: 1. hg clones consist of 3 main hg calls: 1) HTTP GET request to hgweb to determine permissions and capabilities, 47 bytes 2) HTTP GET request to resolve commit hash to an internal hash, 13 bytes 3) HTTP GET request data stream from commit 0 to desired commit, 700ish megabytes The way that varnish handles request #3 is to retrieve the entire thing from the backend before serving anything to the client. Starting in Varnish version 3.0, a do_stream function allows objects to be cached and sent to clients at the same time. I've constructed a virtual machine running Varnish 3.0 which interfaces to the same backend servers the real varnish server does. The results are as follows: Old method (no streaming, uncached): 33 minutes 26 seconds New method (streaming, uncached): 22 minutes, 19 seconds New method (streaming, cached): 11 minutes, 50 seconds Tha plan is to deploy this on Tuesday, 2011-10-18. These are the steps required, downtime should be under 1 minute, although plan for 20 minutes in case (heh) of complications: 1) Add varnish-release-3.0.0 RPM from varnish-software.com to the mozilla RPM repository 2) ssh dm-vcview04 yum remove varnish && varnish-libs 3) yum install varnish-release 4) Add the following lines to /etc/varnish/hgweb.vcl file in the vcl_fetch() function: if (req.url ~ "cmd=changegroupsubset") { set beresp.do_stream = true; } 5) ssh dm-vcview04 service varnish restart Failback option is to: 1) Remove the lines from /etc/varnish/hgweb.vcl 2) yum remove varnish-release 3) yum install varnish 4) service varnish restart
Comment 27•13 years ago
|
||
Ben, thanks for doing the debugging and replicating to figure this out. The planned interruption here is similar to the varnish restarts we've done a few times earlier in this bug. As such, I don't think this requires a pre-planned tree closure. At worst, a few jobs may need to be re-triggered due to failing clones during the restart - similar to the problem this is intended to solve. I'd like to get an ack from buildduty on Monday that this is OK. If not, please set the needs-treeclosure? flag and we'll follow the usual treeclosure process here.
Comment 28•13 years ago
|
||
I've talked with catlee and we've scheduled this for 7AM PST on Tuesday.
Comment 29•13 years ago
|
||
I deployed the upgrade this morning during a window from 07:10 to 07:13. The upgrade went successful, and I am currently doing a mock checkout to compare checkout times against the old non-streaming version. My uncached checkout took 22 minutes 41 seconds Could I get someone from build to comment if clones are quicker and more reliable? Then we can close this bug out.
Comment 30•13 years ago
|
||
I'll keep an eye on the tree today looking for clone timeouts, but we'll need several days of continued good performance before I could verify this is as fixed.
Comment 31•13 years ago
|
||
This is affecting the FF8 beta4 release. I've closed the trees to slow down the HG load and hopefully get the release to go through.
Blocks: 695429
Severity: critical → blocker
Comment 32•13 years ago
|
||
(In reply to Armen Zambrano G. [:armenzg] - Release Engineer from comment #31) > This is affecting the FF8 beta4 release. > I've closed the trees to slow down the HG load and hopefully get the release > to go through. FTR this also affected Thunderbird nightly builds as well as our release builds.
Comment 33•13 years ago
|
||
As mentioned on irc, ops has been looking at this this morning since before Armen's report. It's unclear whether or not this has anything to do with the varnish upgrade that was done earlier this week.
Comment 34•13 years ago
|
||
Could this varnish issue be affecting developer.mozilla.org? It has become highly unresponsive since the hg thing cropped up.
Comment 35•13 years ago
|
||
I don't know of any relation or dependency of code between either of them, but maybe someone else cc'd on this bug does.
Comment 36•13 years ago
|
||
I believe the load cause could have been due to a configuration change in the varnish caching daemon yesterday. I added the lines: if (req.url ~ "cmd=changegroupsubset") { return (pass); } Yesterday due to some developer responses of receiving cached data and old tips during some clones. Upon commenting that section out today, the load on the backend servers has dropped into normal levels. Some other changes yesterday might have also fixed the old tip problem.
Severity: blocker → major
Comment 37•13 years ago
|
||
Could build please confirm that clones are working reliably again?
Comment 38•13 years ago
|
||
Some release jobs that started 45 mins ago have failed but it is also the first time that 2-3 release jobs have passed the hg clone/update step. Perhaps in 30 mins we should be pass the hump for the releases. We could be opening the trees anytime soon.
Comment 39•13 years ago
|
||
I was celebrating too fast. I am getting this on the win32 and win32 xulrunner *release* jobs even after having clobbered them. 'd:\\mozilla-build\\hg\\hg.exe' 'clone' '--verbose' '--noupdate' u'http://hg.mozilla.org/releases/mozilla-beta' 'build' ######### requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed ** unknown exception encountered, please report by visiting ** http://mercurial.selenic.com/wiki/BugTracker ** Python 2.6.6 (r266:84297, Aug 24 2010, 18:46:32) [MSC v.1500 32 bit (Intel)] ** Mercurial Distributed SCM (version 1.7.5) ** Extensions loaded: win32text, graphlog, share, purge Traceback (most recent call last): File "hg", line 38, in <module> File "mercurial\dispatch.pyc", line 16, in run File "mercurial\dispatch.pyc", line 36, in dispatch File "mercurial\dispatch.pyc", line 58, in _runcatch File "mercurial\dispatch.pyc", line 593, in _dispatch File "mercurial\dispatch.pyc", line 401, in runcommand File "mercurial\dispatch.pyc", line 644, in _runcommand File "mercurial\dispatch.pyc", line 598, in checkargs File "mercurial\dispatch.pyc", line 591, in <lambda> File "mercurial\util.pyc", line 426, in check File "mercurial\commands.pyc", line 736, in clone File "mercurial\hg.pyc", line 337, in clone File "mercurial\localrepo.pyc", line 1886, in clone File "mercurial\localrepo.pyc", line 1295, in pull File "mercurial\localrepo.pyc", line 1739, in addchangegroup File "mercurial\revlog.pyc", line 1381, in addgroup File "mercurial\revlog.pyc", line 1220, in _addrevision mpatch.mpatchError: patch cannot be decoded
Comment 40•13 years ago
|
||
For the release we have now made it through the hg steps. Let's see how the L10n repacks behave once they get triggered.
Comment 41•13 years ago
|
||
Due to complications with the upgrade, we have completely backed out of the varnish upgrade and config changes and are now back to the same state we were in on Monday (e.g. there will still be intermittent timeouts under load). The sequence of events: 1) Upgrade varnish to fix the timeouts, and fix scripts to use the new admin interface. 2) Discover that some (all?) 32 bit machines are throwing errors when trying to clone after the upgrade. 3) Make configuration changes to varnish to fix the errors. 4) Discover that the configuration changes are severely overloading the CPUs on the vcview servers and hg is non-functional under load. 5) Try various steps to bring down the load. 6) Back out the configuration change and verify that load comes down. Some (all?) 32bit machines still can not successfully clone (as in step #2) 7) Back out the varnish upgrade and script changes entirely.
Reporter | ||
Comment 43•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6939155&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6939150&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6939124&tree=Mozilla-Inbound Tough to say whether we're right back where we were, or actually worse off, since that's three on a single push.
Reporter | ||
Comment 44•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6939512&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6939508&tree=Mozilla-Inbound
Reporter | ||
Comment 45•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6939540&tree=Mozilla-Inbound - I should have waited for the third one on that push, too.
Reporter | ||
Comment 46•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6939622&tree=Mozilla-Inbound https://tbpl.mozilla.org/php/getParsedLog.php?id=6939682&tree=Mozilla-Inbound
Comment 47•13 years ago
|
||
I'm not sure what's left to be done without a redesign. The error logs are as follows: From nginx: 2011/10/19 15:11:09 [error] 25442#0: *236253 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.2.90.111, server: hg.mozilla.org, request: "GET /integration/mozilla-inbound?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=a4bd037eb2524a37d1fd29fe01c9c0ff644a15b8 HTTP/1.1", upstream: "http://10.2.74.71:80/integration/mozilla-inbound?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=a4bd037eb2524a37d1fd29fe01c9c0ff644a15b8", host: "hg.mozilla.org" 2011/10/19 15:11:57 [error] 25442#0: *237859 upstream timed out (110: Connection timed out) while reading response header from upstream, client: 10.12.49.62, server: hg.mozilla.org, request: "GET /integration/mozilla-inbound?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=a4bd037eb2524a37d1fd29fe01c9c0ff644a15b8 HTTP/1.1", upstream: "http://10.2.74.71:80/integration/mozilla-inbound?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=a4bd037eb2524a37d1fd29fe01c9c0ff644a15b8", host: "hg.mozilla.org" 2011/10/19 15:16:20 [crit] 25442#0: *264046 SSL_write() failed (SSL:) while sending to client, client: 95.178.195.205, server: hg.mozilla.org, request: "GET /mozilla-central/json-pushes?full=1&maxhours=24 HTTP/1.1", upstream: "http://10.2.74.71:80/mozilla-central/json-pushes?full=1&maxhours=24", host: "hg.mozilla.org", referrer: "https://tbpl.mozilla.org/" No relevant error logs are on the dm-vcview webheads The timeouts are: nginx: (all in seconds) proxy_read_timeout 1800; proxy_connect_timeout 1800; client_body_timeout 120; client_header_timeout 120; keepalive_timeout 120; send_timeout 120; varnish: .connect_timeout = 600s; .first_byte_timeout = 600s; .between_bytes_timeout = 600s; I think the keepalive timeouts and client timeouts in nginx are on the low side.
Reporter | ||
Comment 48•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6941142&tree=Mozilla-Inbound
Comment 49•13 years ago
|
||
(In reply to Ben Kero [:bkero] from comment #47) > I'm not sure what's left to be done without a redesign. I think that's probably best regardless. I know you're picking up other people's trash right now, bkero, but would you mind getting a bug on file for that? > I think the keepalive timeouts and client timeouts in nginx are on the low > side. I'm assuming you're talking about the 120s values here? Most of my clones today that were timing out were doing so reasonably quickly, i.e. <= 6 minutes, so I can see increasing these timeouts helping some. How disruptive is changing nginx settings? Will we burn clones-in-progress?
Comment 50•13 years ago
|
||
I've created a bug, 695934 to track the rebuild. I can try adjusting the nginx settings using 'service nginx reload', which should set the new config settings without affecting existing connections.
Reporter | ||
Comment 53•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=6970193&tree=Firefox
Comment 54•13 years ago
|
||
(In reply to Ben Kero [:bkero] from comment #50) > I can try adjusting the nginx settings using 'service nginx reload', which > should set the new config settings without affecting existing connections. bkero: has this been tried yet? If not, can we schedule a time this week to try? Also, while we're trying to think of other possible fixes, are there any disk issues in play on the hg servers, e.g. are we using NFS from a netapp and is it slowing things down? I just don't know what already been tried, and it seems (to me) like we're running out of things *to* try on the IT side. FWIW, we're pushing hard on the hg mirrors work on the releng side too.
Reporter | ||
Comment 55•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=7042678&tree=Mozilla-Inbound
Comment 57•13 years ago
|
||
The new timeout settings have already been tried, I adjusted them last week and issued a 'service nginx reload'. Judging by the recent failed logs, it didn't help at all. I've looked into disk issues, there are none. IOWait on the dm-vcview hosts is below 2%. The problem is CPU utilization. hgweb is very CPU intensive, and doing multiple checkouts creates more load than CPUs. I've run out of things to try on the IT side, each change I've made has had to be backed out because of increased load issues, or miscached fragments. The only solution left for this is for the system to be rebuilt. I've filed a bug for that already (linked earlier).
Reporter | ||
Comment 58•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=7052157&tree=UX
Reporter | ||
Comment 59•13 years ago
|
||
https://tbpl.mozilla.org/php/getParsedLog.php?id=7147211&tree=Firefox
Assignee | ||
Comment 60•13 years ago
|
||
philor, can you please let me know if this happens again after 1/14/12 08:30 PST?
Comment 61•12 years ago
|
||
I can reproduce. full clone mozilla-inbound failed on 15-Jan 20:00JST(03:00PST) $ /c/mozilla-build/hg/hg clone http://hg.mozilla.org/integration/mozilla-inbound/ /d/srcinbound requesting all changes adding changesets adding manifests transaction abort! rollback completed abort: premature EOF reading chunk (got 34497 bytes, expected 48897)
Assignee | ||
Comment 62•12 years ago
|
||
Closing this now. Alice, we have rebuild the hg infrastructure today - meaning that you shouldn't see that error again. similarly with build network clones, philor.
Assignee: bkero → cshields
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•