Closed
Bug 511258
Opened 14 years ago
Closed 14 years ago
Build slaves failing to pull from hg.m.o intermittently
Categories
(mozilla.org Graveyard :: Server Operations, task)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: nthomas, Assigned: chizu)
References
Details
Recent events: * 1.9.2 branch created Aug 13, adding more builders * IT changes to how hg clients pull data on Aug 17 [hg.m.o -> dm-hg01/02 -> vcview01 for hg clients, or vcview03 for other hg related activities browsing etc] * 20090818 nightly builds fail out when trying to pull source from hg. Some 38 builds start at 3:02 PDT and request full source trees. Starting roughly half at a time worked better. Steps already taken: * Aravind made hg.m.o config changes at 4am PDT today, and further tweaks pushed at 6am * RelEng has spaced out nightly builds more (bug 511126), now two branches starting at 3:02, 3:32, and 4:02. Filing this because there are still intermittent problems pulling/updating, details to follow. Aravind is going to setup munin/cacti load monitoring
Reporter | ||
Comment 1•14 years ago
|
||
Tue Aug 18 09:12:44 - 09:17:31: moz2-linux-slave25 on cloning projects/electrolysis adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 4337 bytes, expected 12856) Tue Aug 18 12:34:28: moz2-win32-slave14 on cloning build/tools adding file changes transaction abort! rollback completed abort: consistency error adding group! Tue Aug 18 13:15:07 - 13:16:32: moz2-win32-slave22 hits same error cloning releases/mozilla-1.9.2. Successfully cloned build/tools in the ten seconds before that. Tue Aug 18 14:18:15 - 14:18:19: moz2-win32-slave29, transaction abort on cloning build/tools
Reporter | ||
Comment 2•14 years ago
|
||
Tue Aug 18 14:18:15 2009 - 14:18:18: moz2-linux-slave19, transaction abort trying to clone build/tools
Reporter | ||
Comment 3•14 years ago
|
||
Tue Aug 18 16:36:46 (0 secs): moz2-win32-slave34, cloning tracemonkey abort: HTTP Error 500: Internal Server Error Tue Aug 18 16:37:18 (0 secs): moz2-darwin9-slave15, same as above Tue Aug 18 16:38:40 (0 secs): moz2-win32-slave25, same as above Tue Aug 18 16:39:48 (0 secs): moz2-linux-slave13, same as above
Comment 5•14 years ago
|
||
I don't know if this is any help, but the Thunderbird build boxes have been having these kind of hg problems for a long time now. My understanding so far was that hg access from the outside world vs. from the inside (moco build network) was different. Could the recent changes have made access from the moco build network more similar to the path from the outside? I can get timing data for some of the recent Thunderbird failures if that can help some.
Reporter | ||
Comment 6•14 years ago
|
||
Haven't seen any problems in Firefox builds since 16:45 PDT.
Reporter | ||
Comment 7•14 years ago
|
||
Not out of the woods yet: Tue Aug 18 19:35:43 2009 (53 secs): moz2-linux-slave10 cloning mozilla-central: adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 2192938 bytes, expected 2235011)
Comment 8•14 years ago
|
||
Tue Aug 18 21:01:54 2009 (3 mins, 17 secs): moz2-win32-slave01 cloning mozilla-central: abort: premature EOF reading chunk (got 3882 bytes, expected 28711)
Comment 9•14 years ago
|
||
Tue Aug 18 22:53:45 2009 (3 mins, 47 secs): moz2-linux-slave06 cloning mozilla-central, with something slightly different for variety's sake transaction abort! requesting all changes adding changesets adding manifests adding file changes rollback completed abort: consistency error adding group! (which is a pretty weird order of output)
![]() |
||
Comment 10•14 years ago
|
||
mozilla-1.9.2 WINNT 5.2 mozilla-1.9.2 leak test build on 2009/08/18 22:05:19 moz2-win32-slave02 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 2227094 bytes, expected 2235011) and Linux mozilla-1.9.2 unit test on 2009/08/18 22:05:19 moz2-linux-slave07 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 237 bytes, expected 304)
Comment 11•14 years ago
|
||
I tried submitting try builds 2009/08/18 15:11:13 and 2009/08/19 00:44:03. Windows, Mac and Linux have all had failures with highlights of: abort: premature EOF reading chunk (got 237 bytes, expected 304) abort: consistency error adding group! abort: integrity check failed on 00manifest.d:15143!
![]() |
||
Comment 12•14 years ago
|
||
WINCE 5.0 mozilla-1.9.2 build on 2009/08/19 01:17:09 moz2-win32-slave02 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: consistency error adding group!
Reporter | ||
Comment 13•14 years ago
|
||
Nagios has been saying: dm-vcview01:hplog is WARNING: WARNING 0005: Server Blade Enclosure LAN Settings Changed (Enclosure Serial Number USE901JS0V) Doesn't appear to be flapping.
Comment 14•14 years ago
|
||
There's currently four boxes on mozilla-central and two on mozilla-1.9.2 showing this problem: - WINNT 5.2 mozilla-central unit test % - WINCE 5.0 mozilla-central build % - Linux x86-64 mozilla-central nightly % - Linux mozilla-central leak test build % - Linux mozilla-1.9.2 nightly % - Linux mozilla-1.9.2 unit test %
Comment 15•14 years ago
|
||
Tons more overnight: Maemo 1.9.2 nightly @ Wed Aug 19 01:06:16 2009 WinMO Electrolysis nightly @ Wed Aug 19 01:06:06 2009 WINCE 5.0 1.9.2 build @ Wed Aug 19 01:18:57 2009 Linux 1.9.2 unit test @ Wed Aug 19 01:27:59 2009 Linux x86-64 m-c nightly @ Wed Aug 19 03:04:27 2009 Linux 1.9.2 nightly @ Wed Aug 19 03:36:27 2009 OS X 1.9.2 xulrunner @ Wed Aug 19 03:36:38 2009 OS X places nightly @ Wed Aug 19 04:06:01 2009 Linux x86-64 1.9.2 xulrunner @ Wed Aug 19 04:40:11 2009 win32 m-c unit test @ Wed Aug 19 04:47:53 2009 WINCE 5.0 m-c build @ Wed Aug 19 04:53:40 2009 Linux m-c leak test @ Wed Aug 19 04:57:55 2009 Linux electrolysis @ Wed Aug 19 05:36:18 2009 There's no sign of this slowing down.
Severity: major → critical
Comment 16•14 years ago
|
||
Oh and if you hadn't guessed, this has been affecting Thunderbird & Sunbird builds as well (yesterday and today) - mainly the nightlies that all start at 3am PST.
Comment 17•14 years ago
|
||
We (me and another guy from the hg crew) think this can only be caused by network corruption, so I'm afraid there might not be much I can do.
Severity: critical → major
Comment 18•14 years ago
|
||
Apparently Aravind is on vacation, unassigning this from him.
Assignee: aravind → server-ops
Severity: major → critical
Comment 19•14 years ago
|
||
[61] dm-vcview01:hplog is WARNING: WARNING 0005: Server Blade Enclosure LAN Settings Changed (Enclosure Serial Number USE901JS0V) Been getting those for a while, no idea if it's related to the problem
Comment 20•14 years ago
|
||
The switch doesn't see any bouncing port and if it were bouncing bonding should handle that. That IML message looks old (was in 2008) so I'd discount that all together.
Comment 21•14 years ago
|
||
dm-vcview02 was not responding for some reason. Load was 0 on the box but it was extremely sluggish. I've since kicked the box and brought it back up and it's helping take some of the load off the other two.
Assignee: server-ops → shyam
Comment 22•14 years ago
|
||
(In reply to comment #15) > Linux m-c leak test @ Wed Aug 19 04:57:55 2009 > Linux electrolysis @ Wed Aug 19 05:36:18 2009 > These are the last two I have seen in both masters
Assignee: shyam → server-ops
Comment 23•14 years ago
|
||
Please don't reset the assignee, it send pages to multiple folks at IT. Are those times PDT? If so that was almost 2 hours ago.
Assignee: server-ops → shyam
Comment 24•14 years ago
|
||
(In reply to comment #23) > > Are those times PDT? If so that was almost 2 hours ago. It is PDT Linux electrolysis unit test @ Wed Aug 19 07:42:20 2009 Win32 electrolysis leak test @ Wed Aug 19 07:36:44 2009
Comment 25•14 years ago
|
||
Okay, I've made one more tweak to streamline traffic. Let's see if we see any more failures.
Comment 26•14 years ago
|
||
This could be a different bug (possibly caused by the earlier problems, but I don't know enough to differentiate), I just pushed to try server and got: requesting all changes adding changesets adding manifests transaction abort! rollback completed abort: integrity check failed on 00manifest.d:20033! On WINNT 5.2 try hg build on 2009/08/19 07:44:42 (PDT)
Comment 27•14 years ago
|
||
I've managed to catch one such failure when pulling from build/buildbot-configs, and I have both debug logs and I've also put up a tarball of the busted repository. In its state, it's simply impossible to pull from it ever again. http://gozer.ectoplasm.org/tmp/buildbot-configs-broken.tar.bz2
Assignee: shyam → server-ops
Comment 28•14 years ago
|
||
Taking it off critical because it keeps paging us and people keep changing the assignee.
Severity: critical → major
Comment 29•14 years ago
|
||
And I got this log from a client that only does $> while [ 1 ] ; do date; hg pull -r tip; sleep 60; done Tue Aug 18 05:19:02 PDT 2009 using https://hg.mozilla.org/build/buildbot-configs pulling from https://hg.mozilla.org/build/buildbot-configs sending capabilities command capabilities: unbundle=HG10GZ,HG10BZ,HG10UN lookup changegroupsubset sending lookup command searching for changes no changes found Tue Aug 18 05:20:01 PDT 2009 using https://hg.mozilla.org/build/buildbot-configs pulling from https://hg.mozilla.org/build/buildbot-configs sending capabilities command capabilities: unbundle=HG10GZ,HG10BZ,HG10UN lookup changegroupsubset sending lookup command searching for changes sending branches command examining 303ae2bc331c:77547da571d1 found incomplete branch 303ae2bc331c:77547da571d1 sending between command narrowing 1:7 e66d476658c0 narrowing 2:7 8a50eacda69f found new branch changeset e66d476658c0 found new changesets starting at e66d476658c0 1 total queries sending changegroupsubset command adding changesets add changeset e66d476658c0 add changeset 303ae2bc331c adding manifests adding file changes adding mozilla2/config.py revisions transaction abort! rollback completed Traceback (most recent call last): File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 45, in _runcatch return _dispatch(ui, args) File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 364, in _dispatch ret = _runcommand(ui, options, cmd, d) File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 417, in _runcommand return checkargs() File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 373, in checkargs return cmdfunc() File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 356, in <lambda> d = lambda: func(ui, repo, *args, **cmdoptions) File "/opt/csw/lib/python/site-packages/mercurial/commands.py", line 2053, in pull modheads = repo.pull(other, heads=revs, force=opts['force']) File "/opt/csw/lib/python/site-packages/mercurial/localrepo.py", line 1484, in pull return self.addchangegroup(cg, 'pull', remote.url()) File "/opt/csw/lib/python/site-packages/mercurial/localrepo.py", line 2016, in addchangegroup if fl.addgroup(chunkiter, revmap, trp) is None: File "/opt/csw/lib/python/site-packages/mercurial/revlog.py", line 1182, in addgroup raise LookupError(p, self.indexfile, _('unknown parent')) LookupError: data/mozilla2/config.py.i@11b629ac50b9: unknown parent abort: data/mozilla2/config.py.i@11b629ac50b9: unknown parent! Once that one happens, the checkout is stuck and will never be able to pull successfully.
Comment 30•14 years ago
|
||
(In reply to comment #28) > Taking it off critical because it keeps paging us and people keep changing the > assignee. you're supposed take it off critical when it's fixed, right?
Comment 31•14 years ago
|
||
osx m-c test mochitests @ Wed Aug 19 08:49:40 2009 Linux Fennec desktop m-c build @ Wed Aug 19 08:53:41 2009
Comment 32•14 years ago
|
||
(In reply to comment #31) > osx m-c test mochitests @ Wed Aug 19 08:49:40 2009 destination directory: tools requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 1739812 bytes, expected 1872976) > Linux Fennec desktop m-c build @ Wed Aug 19 08:53:41 2009 destination directory: tools requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: consistency error adding group! (In reply to comment #24) > Linux electrolysis unit test @ Wed Aug 19 07:42:20 2009 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: consistency error adding group! > Win32 electrolysis leak test @ Wed Aug 19 07:36:44 2009 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: integrity check failed on data/db/sqlite3/src/sqlite3.c.d:22!
Comment 33•14 years ago
|
||
Is this related? http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250696802.1250697083.9692.gz /tools/python/bin/hg clone --verbose --noupdate http://hg.mozilla.org/try /builds/slave/sendchange-linux-hg/build in dir /builds/slave/sendchange-linux-hg (timeout 3600 secs) ... abort: HTTP Error 414: Request-URI Too Large requesting all changes elapsedTime=0.747700 program finished with exit code 255
(In reply to comment #17) > We (me and another guy from the hg crew) think this can only be caused by > network corruption, so I'm afraid there might not be much I can do. What do you mean by "network corruption"? Isn't this all over TCP, which has checksums and such? And shouldn't mercurial be robust in the face of such aborts due to abrupt connection termination?
Comment 35•14 years ago
|
||
(In reply to comment #33) > Is this related? > > http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250696802.1250697083.9692.gz > > /tools/python/bin/hg clone --verbose --noupdate http://hg.mozilla.org/try > /builds/slave/sendchange-linux-hg/build > in dir /builds/slave/sendchange-linux-hg (timeout 3600 secs) > ... > abort: HTTP Error 414: Request-URI Too Large > requesting all changes > elapsedTime=0.747700 > program finished with exit code 255 This is a separate issue with the Try server. It means we've grown too many heads there and it needs to be pruned.
Comment 36•14 years ago
|
||
I vaguely remember having to do the math for TCP/IP checksum reliability years ago. Its in the ballpark of one uncorrected error per 10 million packets with single bit errors. Ethernet is pretty reliable these days, so you don't actually have a bit error for every packet. In other words, the probability for uncorrected checksum failure is somewhere between ridiculously unlikely and astronomical.
Comment 37•14 years ago
|
||
(In reply to comment #35) > (In reply to comment #33) > > Is this related? > > > > http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250696802.1250697083.9692.gz > > > > /tools/python/bin/hg clone --verbose --noupdate http://hg.mozilla.org/try > > /builds/slave/sendchange-linux-hg/build > > in dir /builds/slave/sendchange-linux-hg (timeout 3600 secs) > > ... > > abort: HTTP Error 414: Request-URI Too Large > > requesting all changes > > elapsedTime=0.747700 > > program finished with exit code 255 > > This is a separate issue with the Try server. It means we've grown too many > heads there and it needs to be pruned. I filed bug 511471, which will get dealt with tomorrow morning, fwiw
Comment 38•14 years ago
|
||
- Linux m-c test everythingelse @ Wed Aug 19 09:29:36 2009 destination directory: tools requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: consistency error adding group! - Linux tracemonkey leak test build @ Wed Aug 19 10:13:44 2009 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: integrity check failed on data/db/sqlite3/src/sqlite3.c.d:22! - Linux electrolysis build @ Wed Aug 19 10:34:31 2009 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 10222 bytes, expected 28711) - Linux Fennec Desktop electrolysis @ Wed Aug 19 11:35:50 2009 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: consistency error adding group!
Assignee | ||
Comment 39•14 years ago
|
||
The cause of this seems to be stacking queries mercurial backend servers. Yesterday one of the mercurial servers was switched to hardware from a VM, which might be related.
Comment 40•14 years ago
|
||
No more incidents at this point. Last incident at 11:35:50 mentioned in comment 38. I will check once more before I head out.
Assignee | ||
Comment 41•14 years ago
|
||
Load seems more evenly spread out now. Shyam's enabled performance monitoring in munin for these boxes, and we've got one more mercurial server in progress if that becomes needed. We'll monitor the data munin collects for any further problems.
Comment 42•14 years ago
|
||
- Linux tracemonkey build @ Wed Aug 19 12:50:36 2009 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 2149500 bytes, expected 2235011)
Comment 43•14 years ago
|
||
Another one on try: http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250713363.1250713512.7406.gz&fulltext=1
Assignee | ||
Comment 44•14 years ago
|
||
(In reply to comment #42) > - Linux tracemonkey build @ Wed Aug 19 12:50:36 2009 > requesting all changes > adding changesets > adding manifests > adding file changes > transaction abort! > rollback completed > abort: premature EOF reading chunk (got 2149500 bytes, expected 2235011) This was a proxy restart for a config change. (In reply to comment #43) > Another one on try: > http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250713363.1250713512.7406.gz&fulltext=1 This is the bug popping up again. Munin shows a spike in nfs traffic around then. We could be starving some Hg processes for I/O by overloading our nfs capacity. Looking into this now.
Reporter | ||
Comment 45•14 years ago
|
||
WINNT 5.2 tracemonkey build between Wed Aug 19 15:19:56 PDT and 15:22:10, on cloning tracemonkey: transaction abort! rollback completed abort: premature EOF reading chunk (got 13842 bytes, expected 15020)
Comment 46•14 years ago
|
||
Linux tracemonkey unit test on 2009/08/19 15:28:09 rollback completed abort: premature EOF reading chunk (got 6186 bytes, expected 15020) elapsedTime=72.905727 program finished with exit code 255 === Output ended === ======== BuildStep ended ======== Linux tracemonkey unit test on 2009/08/19 15:35:07 rollback completed abort: consistency error adding group! elapsedTime=284.373095 program finished with exit code 255 === Output ended === ======== BuildStep ended ======== WINNT 5.2 tracemonkey build on 2009/08/19 15:16:19 transaction abort! rollback completed abort: premature EOF reading chunk (got 13842 bytes, expected 15020) elapsedTime=133.859000 program finished with exit code -1 === Output ended === ======== BuildStep ended ======== WINNT 5.2 tracemonkey build on 2009/08/19 15:29:24 rollback completed abort: premature EOF reading chunk (got 154 bytes, expected 349) elapsedTime=27.860000 program finished with exit code -1 === Output ended === ======== BuildStep ended ======== Could we at least have a work-around in the build script that tries to pull 3 times in a row or something like that?
Comment 47•14 years ago
|
||
Summary of quick chat with chizu as he headed out to colo. hg.m.o is actually a round robin across 2 VMs and 1 physical machine. Yesterday this was changed to 1 VM and 2 physical machines. Turns out the physical machines have less ram then the VM. While IO load on physical machines is ok, this is causing the file-cache on the physical machines to get borked. To fix this, Chizu is going to upgrade the ram on each physical machine, one at a time. Each machine should take <10mins to upgrade. Technically powering down one machine at a time like this does not mean closing hg.m.o, but given that the two machines are low-ram, its unlikely that failover to remaining machines will go well. Therefore we'll do a quick tree closure as soon as chizu is in colo. (/me has also notified sheriff of upcoming downtime)
Comment 48•14 years ago
|
||
dm-vcview01, 03 were upgraded from 4GB ram to 8GB ram and restarted. We *think* the hg problem is fixed. But obviously if you see/hear otherwise, we'd like to know!
Comment 49•14 years ago
|
||
(In reply to comment #48) > dm-vcview01, 03 were upgraded from 4GB ram to 8GB ram and restarted. > > We *think* the hg problem is fixed. What was the root cause?
I think mrz is going to do a post-mortem description later, but people have been working on this stuff flat-out since yesterday, so I'm happy to let them catch their breath before they dissect the causes. :-)
Reporter | ||
Comment 51•14 years ago
|
||
Not much activity on the main branches since 1800 PDT to test the fix, so I've fired off about 12 builds on mozilla-1.9.2. Still getting a lot of transaction abort messages on try server builds, so perhaps we should go ahead with bug 511471 asap. There are about 590 heads on that repo at the moment.
Comment 52•14 years ago
|
||
(In reply to comment #50) > I think mrz is going to do a post-mortem description later, but people have > been working on this stuff flat-out since yesterday, so I'm happy to let them > catch their breath before they dissect the causes. :-) Adding more RAM as a quick fix is cool. There can't be a post-mortem until the root cause is determined.
Comment 53•14 years ago
|
||
Multiple failures on TM tree. rollback completed abort: premature EOF reading chunk (got 26366 bytes, expected 42292) elapsedTime=163.192947 program finished with exit code 255 === Output ended === ======== BuildStep ended ========
Assignee | ||
Comment 54•14 years ago
|
||
(In reply to comment #53) > Multiple failures on TM tree. > > rollback completed > abort: premature EOF reading chunk (got 26366 bytes, expected 42292) > elapsedTime=163.192947 > program finished with exit code 255 > === Output ended === > ======== BuildStep ended ======== If these happened at 7:49, it was caused by a configuration change that needed an apache reload.
Just to plan out stuff over the next while, is there a point at which we'll just roll back all changes (any new build hosts, any config changes, any underlying VM/hardware moves, etc.) to the last known good state? I understand and appreciate that people are working hard to move forward to better state with faster hardware and so forth, but we're hurting pretty bad right now, and it's not clear what the plan is now that the additional RAM didn't make things all better. Let me know if there are things I can do (get the tree closed, have people look at python core dumps, pull in extra analysis help, anything) to help move it forward; this is basically the most important thing in my life right now. (I'm 647.296.1922 if you think of something you need during the night.)
Assignee | ||
Comment 56•14 years ago
|
||
Some updates: * bug 511471 was tried without improvement. * Looking at the history of what's related to this, it seems it's been a mistake to follow the VM -> hardware change as this predates it. * RAM was 4GB on the dm-vcview01 VM and 4GB on dm-vcview01 hardware. This is now 8GB, which should improve performance, but does not affect the timeouts. * Configuration changes were made on dm-vcview01, but have been reverted to the original VM configs.
![]() |
||
Comment 57•14 years ago
|
||
Not sure if this is related WINCE 5.0 mozilla-1.9.2 build on 2009/08/19 22:44:43 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1250747083.1250747193.22109.gz requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: integrity check failed on data/db/sqlite3/src/sqlite3.c.d:22! elapsedTime=96.688000 program finished with exit code -1 and Linux mozilla-1.9.2 build on 2009/08/19 22:44:43 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1250747083.1250747193.22109.gz requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: integrity check failed on data/db/sqlite3/src/sqlite3.c.d:22! elapsedTime=117.885569 program finished with exit code 255
Comment 58•14 years ago
|
||
(In reply to comment #55) > Just to plan out stuff over the next while, is there a point at which we'll > just roll back all changes (any new build hosts, any config changes, any > underlying VM/hardware moves, etc.) to the last known good state? I mentioned this to shaver out-of-band but will include it here too. Nothing on the system side had changed prior to this bug opening. Build increased load last Thursday but IT hadn't changed any systems or configs. The only changes we've done have been reactive and I suspect we'd be in a worse state if we reverted to the VMs which were clearly CPU starved. Point #2 in comment #0 really means to track the migration from VM to physical server (and nthomas said so too on IRC). The theory right now is that this has been happening for awhile now and only recently it's gotten worse enough to be noticable. Aravind's increased the timeouts on nginx so Apache shouldn't have anymore failure to send data errors.
Comment 59•14 years ago
|
||
On the backend (read only) servers, that serve this content, apache says this for a try repo clone. 10.2.74.13 - - [19/Aug/2009:23:13:40 -0700] "GET /try?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=c4cdfd9a2881f088dca2010604182addbbb2fe38+3c5a5cc10ea95b787198a1e7817af24fad7427b5+f34211125e93965d8e908e4ec11c0740ef5ccf97+dc02eff6d8c0164cd9cae0151b8c6a8b914c7c8d+b42a1594f61338e48a8fab40a8a2cbfee0120e9d+cab97aca485d30972f9a624eb92ca5e027571692+c4ebfef7afa6d7b417c9476bceb27053d4118157 HTTP/1.0" 200 148024974 "-" "mercurial/proto-1.0" Thats one that works. However, I occasionally see entries that say 10.2.74.13 - - [19/Aug/2009:23:13:39 -0700] "GET /try?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=c4cdfd9a2881f088dca2010604182addbbb2fe38+3c5a5cc10ea95b787198a1e7817af24fad7427b5+f34211125e93965d8e908e4ec11c0740ef5ccf97+dc02eff6d8c0164cd9cae0151b8c6a8b914c7c8d+b42a1594f61338e48a8fab40a8a2cbfee0120e9d+cab97aca485d30972f9a624eb92ca5e027571692+c4ebfef7afa6d7b417c9476bceb27053d4118157 HTTP/1.0" 200 56464598 "-" "mercurial/proto-1.0" If you look carefully, you will see that the number of bytes returned is different for the exact same query. Those I am willing to bet are the ones that fail. Why apache returns a 200 status code, but returns fewer bytes is still a mystery. Looking into that somemore.
Comment 60•14 years ago
|
||
So based on that last comment - I changed apache to isolate the mod_wsgis between requests, so they don't trample on each other and fuck up my otherwise beautiful mercurial setup. I changed apache to use a new thread for every new mercurial request. Aki tested this and he was able to consistently check out the try repo 10 times in parallel (in two different tests - I am calling that resolved beyond a doubt). As always, please re-open if things are still broken.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Comment 61•14 years ago
|
||
gd and this is why we celebrate sysadmin appreciation day (http://www.sysadminday.com/) . And you're on vacation no less! Thanks!
Comment 62•14 years ago
|
||
Fired off a bunch of nightlies (>10 I think) across the two production-masters. They've all passed the hg clone steps and hit the compile steps. I'm ready to call this fixed, but the nightlies tonight should give us an additional level of verification. Thanks everyone!
Comment 63•14 years ago
|
||
(In reply to comment #62) > Fired off a bunch of nightlies (>10 I think) across the two production-masters. > They've all passed the hg clone steps and hit the compile steps. I'm ready to > call this fixed, but the nightlies tonight should give us an additional level > of verification. > > Thanks everyone! Cool, lets hope the hack holds up. Thanks for Nick and Aki for being very patient and helping me isolate and test the patch. Thanks for justdave, chizu and fox2mike for seeing this through all the permutations and combinations.
Comment 64•14 years ago
|
||
Win2k3 comm-central build on 2009/08/20 16:49:24 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1250812164.1250812540.26613.gz pulling from http://hg.mozilla.org/mozilla-central searching for changes adding changesets adding manifests adding file changes transaction abort! rollback completed
Severity: blocker → major
Comment 65•14 years ago
|
||
Win2k3 comm-1.9.1 check on 2009/08/20 12:33:12 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.0/1250796792.1250797958.22186.gz hg clone http://hg.mozilla.org/comm-central requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: connection ended unexpectedly
Comment 66•14 years ago
|
||
And to more from overnight: Win2k3 comm-central build Date: 2009/08/20 22:17:51 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1250831871.1250832608.25758.gz pulling from http://hg.mozilla.org/mozilla-central searching for changes adding changesets adding manifests adding file changes Win2k3 comm-1.9.1 nightly Date: 2009/08/21 03:00:01 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.0/1250848801.1250851373.6971.gz requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed
Comment 67•14 years ago
|
||
What IP address are those boxes getting for "hg.mozilla.org"?
Comment 68•14 years ago
|
||
Seems to be round-robin between 2 IPs: 63.245.208.189 dm-hg02.mozilla.org. 63.245.208.188 dm-hg01.mozilla.org.
Comment 69•14 years ago
|
||
Win2k3 comm-central nightly Date: 2009/08/22 03:00:00 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1250935200.1250936732.2481.gz One thing worth noting is that this particular clone died almost precisely 20 minutes after it started.
Comment 70•14 years ago
|
||
(In reply to comment #68) > Seems to be round-robin between 2 IPs: > > 63.245.208.189 dm-hg02.mozilla.org. > 63.245.208.188 dm-hg01.mozilla.org. chizu, are those the ones he should be hitting externally? vcview03 isn't in that list.
Comment 71•14 years ago
|
||
Five more overnight, with a relatively new one on Linux: Win2k3 comm-central nightly Date: 2009/08/25 06:37:34 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1251207454.1251208548.19215.gz [hg] transaction abort! Win2k3 comm-central nightly Date: 2009/08/25 04:02:37 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1251198157.1251199875.21202.gz [hg] transaction abort! Win2k3 comm-central nightly Date: 2009/08/25 03:16:08 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1251195368.1251196665.16025.gz [hg] transaction abort! Win2k3 comm-1.9.1 nightly Date: 2009/08/25 03:19:15 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.0/1251195555.1251197210.22055.gz [hg] transaction abort! Linux comm-1.9.1 nightly Date: 2009/08/25 03:00:00 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.0/1251194400.1251194548.25822.gz raise IOError(None, _('connection ended unexpectedly')) IOError: (None, 'connection ended unexpectedly') abort: connection ended unexpectedly
Comment 72•14 years ago
|
||
One theory worth considering is that our connectivity to hg.m.o is over a 100Mb link, and I'll guess the moco build machines can get at it via gigabit. This would mean everything we do would be at least 10 times slower, and could make timeouts/bailouts on the server-side more likely ?
Comment 73•14 years ago
|
||
If you're bandwidth constrained you should see that in any trend graphs.
Comment 74•14 years ago
|
||
Looking at bandwidth usage graphs, specifically for my uplink to MoCo, I can see we are nowhere near any line limit. Thinking about it, in my various hg benchmarks, I've never found hg to be able to get close to wire speed, so it's probably a red herring. For fun and profit, try and see how fast you can grab http://hg.mozilla.org/mozilla-central/archive/tip.tar.bz2 600K/s is pretty much the best I've ever seen.
Comment 75•14 years ago
|
||
Win2k3 comm-central bloat Date: 2009/08/25 23:48:10 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1251269290.1251269528.7496.gz abort: connection ended unexpectedly Time: real 95.972 secs (user 3.906+0.000 sys 0.891+0.000)
Reporter | ||
Comment 76•14 years ago
|
||
There are still failures occurring on MoCo builders but only infrequently. Eg "WINNT 5.2 places nightly" at Sun Aug 30 04:02:00 2009, cloning build/tools: destination directory: tools requesting all changes abort: HTTP Error 502: Bad Gateway "Linux try hg build" at Sat Aug 29 13:22:36 2009 transaction abort! rollback completed abort: connection ended unexpectedly And also at Fri Aug 28 17:11:57 2009.
Reporter | ||
Comment 77•14 years ago
|
||
Failed to pull build/tools a bit after 2009/09/06 04:02:00 http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1252234920.1252238938.24591.gz adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 700601 bytes, expected 2069754) Much the same only on Linux in http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1252234920.1252236746.30486.gz
Updated•9 years ago
|
Product: mozilla.org → mozilla.org Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•