511258 - Build slaves failing to pull from hg.m.o intermittently

Reporter

Description

•

16 years ago

Recent events: * 1.9.2 branch created Aug 13, adding more builders * IT changes to how hg clients pull data on Aug 17 [hg.m.o -> dm-hg01/02 -> vcview01 for hg clients, or vcview03 for other hg related activities browsing etc] * 20090818 nightly builds fail out when trying to pull source from hg. Some 38 builds start at 3:02 PDT and request full source trees. Starting roughly half at a time worked better. Steps already taken: * Aravind made hg.m.o config changes at 4am PDT today, and further tweaks pushed at 6am * RelEng has spaced out nightly builds more (bug 511126), now two branches starting at 3:02, 3:32, and 4:02. Filing this because there are still intermittent problems pulling/updating, details to follow. Aravind is going to setup munin/cacti load monitoring

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 1

•

16 years ago

Tue Aug 18 09:12:44 - 09:17:31: moz2-linux-slave25 on cloning projects/electrolysis adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 4337 bytes, expected 12856) Tue Aug 18 12:34:28: moz2-win32-slave14 on cloning build/tools adding file changes transaction abort! rollback completed abort: consistency error adding group! Tue Aug 18 13:15:07 - 13:16:32: moz2-win32-slave22 hits same error cloning releases/mozilla-1.9.2. Successfully cloned build/tools in the ten seconds before that. Tue Aug 18 14:18:15 - 14:18:19: moz2-win32-slave29, transaction abort on cloning build/tools

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 2

•

16 years ago

Tue Aug 18 14:18:15 2009 - 14:18:18: moz2-linux-slave19, transaction abort trying to clone build/tools

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 3

•

16 years ago

Tue Aug 18 16:36:46 (0 secs): moz2-win32-slave34, cloning tracemonkey abort: HTTP Error 500: Internal Server Error Tue Aug 18 16:37:18 (0 secs): moz2-darwin9-slave15, same as above Tue Aug 18 16:38:40 (0 secs): moz2-win32-slave25, same as above Tue Aug 18 16:39:48 (0 secs): moz2-linux-slave13, same as above

Philippe M. Chiasson (:gozer)

Comment 5

•

16 years ago

I don't know if this is any help, but the Thunderbird build boxes have been having these kind of hg problems for a long time now. My understanding so far was that hg access from the outside world vs. from the inside (moco build network) was different. Could the recent changes have made access from the moco build network more similar to the path from the outside? I can get timing data for some of the recent Thunderbird failures if that can help some.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 6

•

16 years ago

Haven't seen any problems in Firefox builds since 16:45 PDT.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 7

•

16 years ago

Not out of the woods yet: Tue Aug 18 19:35:43 2009 (53 secs): moz2-linux-slave10 cloning mozilla-central: adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 2192938 bytes, expected 2235011)

Phil Ringnalda (:philor)

Comment 8

•

16 years ago

Tue Aug 18 21:01:54 2009 (3 mins, 17 secs): moz2-win32-slave01 cloning mozilla-central: abort: premature EOF reading chunk (got 3882 bytes, expected 28711)

Phil Ringnalda (:philor)

Comment 9

•

16 years ago

Tue Aug 18 22:53:45 2009 (3 mins, 47 secs): moz2-linux-slave06 cloning mozilla-central, with something slightly different for variety's sake transaction abort! requesting all changes adding changesets adding manifests adding file changes rollback completed abort: consistency error adding group! (which is a pretty weird order of output)

Robert Strong (they/them - no direct email)

Comment 10

•

16 years ago

mozilla-1.9.2 WINNT 5.2 mozilla-1.9.2 leak test build on 2009/08/18 22:05:19 moz2-win32-slave02 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 2227094 bytes, expected 2235011) and Linux mozilla-1.9.2 unit test on 2009/08/18 22:05:19 moz2-linux-slave07 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 237 bytes, expected 304)

Mark Banner (:standard8)

Comment 11

•

16 years ago

I tried submitting try builds 2009/08/18 15:11:13 and 2009/08/19 00:44:03. Windows, Mac and Linux have all had failures with highlights of: abort: premature EOF reading chunk (got 237 bytes, expected 304) abort: consistency error adding group! abort: integrity check failed on 00manifest.d:15143!

Robert Strong (they/them - no direct email)

Comment 12

•

16 years ago

WINCE 5.0 mozilla-1.9.2 build on 2009/08/19 01:17:09 moz2-win32-slave02 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: consistency error adding group!

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 13

•

16 years ago

Nagios has been saying: dm-vcview01:hplog is WARNING: WARNING 0005: Server Blade Enclosure LAN Settings Changed (Enclosure Serial Number USE901JS0V) Doesn't appear to be flapping.

Peter Weilbacher

Comment 14

•

16 years ago

There's currently four boxes on mozilla-central and two on mozilla-1.9.2 showing this problem: - WINNT 5.2 mozilla-central unit test % - WINCE 5.0 mozilla-central build % - Linux x86-64 mozilla-central nightly % - Linux mozilla-central leak test build % - Linux mozilla-1.9.2 nightly % - Linux mozilla-1.9.2 unit test %

bhearsum@mozilla.com (:bhearsum)

Comment 15

•

16 years ago

Tons more overnight: Maemo 1.9.2 nightly @ Wed Aug 19 01:06:16 2009 WinMO Electrolysis nightly @ Wed Aug 19 01:06:06 2009 WINCE 5.0 1.9.2 build @ Wed Aug 19 01:18:57 2009 Linux 1.9.2 unit test @ Wed Aug 19 01:27:59 2009 Linux x86-64 m-c nightly @ Wed Aug 19 03:04:27 2009 Linux 1.9.2 nightly @ Wed Aug 19 03:36:27 2009 OS X 1.9.2 xulrunner @ Wed Aug 19 03:36:38 2009 OS X places nightly @ Wed Aug 19 04:06:01 2009 Linux x86-64 1.9.2 xulrunner @ Wed Aug 19 04:40:11 2009 win32 m-c unit test @ Wed Aug 19 04:47:53 2009 WINCE 5.0 m-c build @ Wed Aug 19 04:53:40 2009 Linux m-c leak test @ Wed Aug 19 04:57:55 2009 Linux electrolysis @ Wed Aug 19 05:36:18 2009 There's no sign of this slowing down.

Severity: major → critical

Mark Banner (:standard8)

Comment 16

•

16 years ago

Oh and if you hadn't guessed, this has been affecting Thunderbird & Sunbird builds as well (yesterday and today) - mainly the nightlies that all start at 3am PST.

Dirkjan Ochtman (:djc)

Comment 17

•

16 years ago

We (me and another guy from the hg crew) think this can only be caused by network corruption, so I'm afraid there might not be much I can do.

Severity: critical → major

bhearsum@mozilla.com (:bhearsum)

Comment 18

•

16 years ago

Apparently Aravind is on vacation, unassigning this from him.

Assignee: aravind → server-ops

Severity: major → critical

Shyam Mani [:fox2mike]

Comment 19

•

16 years ago

[61] dm-vcview01:hplog is WARNING: WARNING 0005: Server Blade Enclosure LAN Settings Changed (Enclosure Serial Number USE901JS0V) Been getting those for a while, no idea if it's related to the problem

matthew zeier [:mrz]

Comment 20

•

16 years ago

The switch doesn't see any bouncing port and if it were bouncing bonding should handle that. That IML message looks old (was in 2008) so I'd discount that all together.

Shyam Mani [:fox2mike]

Comment 21

•

16 years ago

dm-vcview02 was not responding for some reason. Load was 0 on the box but it was extremely sluggish. I've since kicked the box and brought it back up and it's helping take some of the load off the other two.

Assignee: server-ops → shyam

Armen [:armenzg]

Comment 22

•

16 years ago

(In reply to comment #15) > Linux m-c leak test @ Wed Aug 19 04:57:55 2009 > Linux electrolysis @ Wed Aug 19 05:36:18 2009 > These are the last two I have seen in both masters

Assignee: shyam → server-ops

Shyam Mani [:fox2mike]

Comment 23

•

16 years ago

Please don't reset the assignee, it send pages to multiple folks at IT. Are those times PDT? If so that was almost 2 hours ago.

Assignee: server-ops → shyam

Armen [:armenzg]

Comment 24

•

16 years ago

(In reply to comment #23) > > Are those times PDT? If so that was almost 2 hours ago. It is PDT Linux electrolysis unit test @ Wed Aug 19 07:42:20 2009 Win32 electrolysis leak test @ Wed Aug 19 07:36:44 2009

Shyam Mani [:fox2mike]

Comment 25

•

16 years ago

Okay, I've made one more tweak to streamline traffic. Let's see if we see any more failures.

Mark Banner (:standard8)

Comment 26

•

16 years ago

This could be a different bug (possibly caused by the earlier problems, but I don't know enough to differentiate), I just pushed to try server and got: requesting all changes adding changesets adding manifests transaction abort! rollback completed abort: integrity check failed on 00manifest.d:20033! On WINNT 5.2 try hg build on 2009/08/19 07:44:42 (PDT)

Philippe M. Chiasson (:gozer)

Comment 27

•

16 years ago

I've managed to catch one such failure when pulling from build/buildbot-configs, and I have both debug logs and I've also put up a tarball of the busted repository. In its state, it's simply impossible to pull from it ever again. http://gozer.ectoplasm.org/tmp/buildbot-configs-broken.tar.bz2

Assignee: shyam → server-ops

Shyam Mani [:fox2mike]

Comment 28

•

16 years ago

Taking it off critical because it keeps paging us and people keep changing the assignee.

Severity: critical → major

Philippe M. Chiasson (:gozer)

Comment 29

•

16 years ago

And I got this log from a client that only does $> while [ 1 ] ; do date; hg pull -r tip; sleep 60; done Tue Aug 18 05:19:02 PDT 2009 using https://hg.mozilla.org/build/buildbot-configs pulling from https://hg.mozilla.org/build/buildbot-configs sending capabilities command capabilities: unbundle=HG10GZ,HG10BZ,HG10UN lookup changegroupsubset sending lookup command searching for changes no changes found Tue Aug 18 05:20:01 PDT 2009 using https://hg.mozilla.org/build/buildbot-configs pulling from https://hg.mozilla.org/build/buildbot-configs sending capabilities command capabilities: unbundle=HG10GZ,HG10BZ,HG10UN lookup changegroupsubset sending lookup command searching for changes sending branches command examining 303ae2bc331c:77547da571d1 found incomplete branch 303ae2bc331c:77547da571d1 sending between command narrowing 1:7 e66d476658c0 narrowing 2:7 8a50eacda69f found new branch changeset e66d476658c0 found new changesets starting at e66d476658c0 1 total queries sending changegroupsubset command adding changesets add changeset e66d476658c0 add changeset 303ae2bc331c adding manifests adding file changes adding mozilla2/config.py revisions transaction abort! rollback completed Traceback (most recent call last): File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 45, in _runcatch return _dispatch(ui, args) File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 364, in _dispatch ret = _runcommand(ui, options, cmd, d) File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 417, in _runcommand return checkargs() File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 373, in checkargs return cmdfunc() File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 356, in <lambda> d = lambda: func(ui, repo, *args, **cmdoptions) File "/opt/csw/lib/python/site-packages/mercurial/commands.py", line 2053, in pull modheads = repo.pull(other, heads=revs, force=opts['force']) File "/opt/csw/lib/python/site-packages/mercurial/localrepo.py", line 1484, in pull return self.addchangegroup(cg, 'pull', remote.url()) File "/opt/csw/lib/python/site-packages/mercurial/localrepo.py", line 2016, in addchangegroup if fl.addgroup(chunkiter, revmap, trp) is None: File "/opt/csw/lib/python/site-packages/mercurial/revlog.py", line 1182, in addgroup raise LookupError(p, self.indexfile, _('unknown parent')) LookupError: data/mozilla2/config.py.i@11b629ac50b9: unknown parent abort: data/mozilla2/config.py.i@11b629ac50b9: unknown parent! Once that one happens, the checkout is stuck and will never be able to pull successfully.

Robert Sayre

Comment 30

•

16 years ago

(In reply to comment #28) > Taking it off critical because it keeps paging us and people keep changing the > assignee. you're supposed take it off critical when it's fixed, right?

Armen [:armenzg]

Comment 31

•

16 years ago

osx m-c test mochitests @ Wed Aug 19 08:49:40 2009 Linux Fennec desktop m-c build @ Wed Aug 19 08:53:41 2009

Armen [:armenzg]

Comment 32

•

16 years ago

(In reply to comment #31) > osx m-c test mochitests @ Wed Aug 19 08:49:40 2009 destination directory: tools requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 1739812 bytes, expected 1872976) > Linux Fennec desktop m-c build @ Wed Aug 19 08:53:41 2009 destination directory: tools requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: consistency error adding group! (In reply to comment #24) > Linux electrolysis unit test @ Wed Aug 19 07:42:20 2009 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: consistency error adding group! > Win32 electrolysis leak test @ Wed Aug 19 07:36:44 2009 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: integrity check failed on data/db/sqlite3/src/sqlite3.c.d:22!

Jason Orendorff [:jorendorff]

Comment 33

•

16 years ago

Is this related? http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250696802.1250697083.9692.gz /tools/python/bin/hg clone --verbose --noupdate http://hg.mozilla.org/try /builds/slave/sendchange-linux-hg/build in dir /builds/slave/sendchange-linux-hg (timeout 3600 secs) ... abort: HTTP Error 414: Request-URI Too Large requesting all changes elapsedTime=0.747700 program finished with exit code 255

Mike Shaver (:shaver emeritus)

Comment 34

•

16 years ago

(In reply to comment #17) > We (me and another guy from the hg crew) think this can only be caused by > network corruption, so I'm afraid there might not be much I can do. What do you mean by "network corruption"? Isn't this all over TCP, which has checksums and such? And shouldn't mercurial be robust in the face of such aborts due to abrupt connection termination?

(not currently active) Ted Mielczarek

Comment 35

•

16 years ago

(In reply to comment #33) > Is this related? > > http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250696802.1250697083.9692.gz > > /tools/python/bin/hg clone --verbose --noupdate http://hg.mozilla.org/try > /builds/slave/sendchange-linux-hg/build > in dir /builds/slave/sendchange-linux-hg (timeout 3600 secs) > ... > abort: HTTP Error 414: Request-URI Too Large > requesting all changes > elapsedTime=0.747700 > program finished with exit code 255 This is a separate issue with the Try server. It means we've grown too many heads there and it needs to be pruned.

Andreas Gal :gal

Comment 36

•

16 years ago

I vaguely remember having to do the math for TCP/IP checksum reliability years ago. Its in the ballpark of one uncorrected error per 10 million packets with single bit errors. Ethernet is pretty reliable these days, so you don't actually have a bit error for every packet. In other words, the probability for uncorrected checksum failure is somewhere between ridiculously unlikely and astronomical.

bhearsum@mozilla.com (:bhearsum)

Comment 37

•

16 years ago

(In reply to comment #35) > (In reply to comment #33) > > Is this related? > > > > http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250696802.1250697083.9692.gz > > > > /tools/python/bin/hg clone --verbose --noupdate http://hg.mozilla.org/try > > /builds/slave/sendchange-linux-hg/build > > in dir /builds/slave/sendchange-linux-hg (timeout 3600 secs) > > ... > > abort: HTTP Error 414: Request-URI Too Large > > requesting all changes > > elapsedTime=0.747700 > > program finished with exit code 255 > > This is a separate issue with the Try server. It means we've grown too many > heads there and it needs to be pruned. I filed bug 511471, which will get dealt with tomorrow morning, fwiw

Armen [:armenzg]

Comment 38

•

16 years ago

- Linux m-c test everythingelse @ Wed Aug 19 09:29:36 2009 destination directory: tools requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: consistency error adding group! - Linux tracemonkey leak test build @ Wed Aug 19 10:13:44 2009 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: integrity check failed on data/db/sqlite3/src/sqlite3.c.d:22! - Linux electrolysis build @ Wed Aug 19 10:34:31 2009 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 10222 bytes, expected 28711) - Linux Fennec Desktop electrolysis @ Wed Aug 19 11:35:50 2009 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: consistency error adding group!

chizu

Assignee

Updated

•

16 years ago

Assignee: server-ops → thardcastle

Severity: major → blocker

chizu

Assignee

Comment 39

•

16 years ago

The cause of this seems to be stacking queries mercurial backend servers. Yesterday one of the mercurial servers was switched to hardware from a VM, which might be related.

Armen [:armenzg]

Comment 40

•

16 years ago

No more incidents at this point. Last incident at 11:35:50 mentioned in comment 38. I will check once more before I head out.

chizu

Assignee

Comment 41

•

16 years ago

Load seems more evenly spread out now. Shyam's enabled performance monitoring in munin for these boxes, and we've got one more mercurial server in progress if that becomes needed. We'll monitor the data munin collects for any further problems.

Armen [:armenzg]

Comment 42

•

16 years ago

- Linux tracemonkey build @ Wed Aug 19 12:50:36 2009 requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 2149500 bytes, expected 2235011)

bhearsum@mozilla.com (:bhearsum)

Comment 43

•

16 years ago

Another one on try: http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250713363.1250713512.7406.gz&fulltext=1

chizu

Assignee

Comment 44

•

16 years ago

(In reply to comment #42) > - Linux tracemonkey build @ Wed Aug 19 12:50:36 2009 > requesting all changes > adding changesets > adding manifests > adding file changes > transaction abort! > rollback completed > abort: premature EOF reading chunk (got 2149500 bytes, expected 2235011) This was a proxy restart for a config change. (In reply to comment #43) > Another one on try: > http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250713363.1250713512.7406.gz&fulltext=1 This is the bug popping up again. Munin shows a spike in nfs traffic around then. We could be starving some Hg processes for I/O by overloading our nfs capacity. Looking into this now.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 45

•

16 years ago

WINNT 5.2 tracemonkey build between Wed Aug 19 15:19:56 PDT and 15:22:10, on cloning tracemonkey: transaction abort! rollback completed abort: premature EOF reading chunk (got 13842 bytes, expected 15020)

Andreas Gal :gal

Comment 46

•

16 years ago

Linux tracemonkey unit test on 2009/08/19 15:28:09 rollback completed abort: premature EOF reading chunk (got 6186 bytes, expected 15020) elapsedTime=72.905727 program finished with exit code 255 === Output ended === ======== BuildStep ended ======== Linux tracemonkey unit test on 2009/08/19 15:35:07 rollback completed abort: consistency error adding group! elapsedTime=284.373095 program finished with exit code 255 === Output ended === ======== BuildStep ended ======== WINNT 5.2 tracemonkey build on 2009/08/19 15:16:19 transaction abort! rollback completed abort: premature EOF reading chunk (got 13842 bytes, expected 15020) elapsedTime=133.859000 program finished with exit code -1 === Output ended === ======== BuildStep ended ======== WINNT 5.2 tracemonkey build on 2009/08/19 15:29:24 rollback completed abort: premature EOF reading chunk (got 154 bytes, expected 349) elapsedTime=27.860000 program finished with exit code -1 === Output ended === ======== BuildStep ended ======== Could we at least have a work-around in the build script that tries to pull 3 times in a row or something like that?

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 47

•

16 years ago

Summary of quick chat with chizu as he headed out to colo. hg.m.o is actually a round robin across 2 VMs and 1 physical machine. Yesterday this was changed to 1 VM and 2 physical machines. Turns out the physical machines have less ram then the VM. While IO load on physical machines is ok, this is causing the file-cache on the physical machines to get borked. To fix this, Chizu is going to upgrade the ram on each physical machine, one at a time. Each machine should take <10mins to upgrade. Technically powering down one machine at a time like this does not mean closing hg.m.o, but given that the two machines are low-ram, its unlikely that failover to remaining machines will go well. Therefore we'll do a quick tree closure as soon as chizu is in colo. (/me has also notified sheriff of upcoming downtime)

John O'Duinn [:joduinn] (please use "needinfo?" flag)

Comment 48

•

16 years ago

dm-vcview01, 03 were upgraded from 4GB ram to 8GB ram and restarted. We *think* the hg problem is fixed. But obviously if you see/hear otherwise, we'd like to know!

Robert Sayre

Comment 49

•

16 years ago

(In reply to comment #48) > dm-vcview01, 03 were upgraded from 4GB ram to 8GB ram and restarted. > > We *think* the hg problem is fixed. What was the root cause?

Mike Shaver (:shaver emeritus)

Comment 50

•

16 years ago

I think mrz is going to do a post-mortem description later, but people have been working on this stuff flat-out since yesterday, so I'm happy to let them catch their breath before they dissect the causes. :-)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 51

•

16 years ago

Not much activity on the main branches since 1800 PDT to test the fix, so I've fired off about 12 builds on mozilla-1.9.2. Still getting a lot of transaction abort messages on try server builds, so perhaps we should go ahead with bug 511471 asap. There are about 590 heads on that repo at the moment.

Robert Sayre

Comment 52

•

16 years ago

(In reply to comment #50) > I think mrz is going to do a post-mortem description later, but people have > been working on this stuff flat-out since yesterday, so I'm happy to let them > catch their breath before they dissect the causes. :-) Adding more RAM as a quick fix is cool. There can't be a post-mortem until the root cause is determined.

Andreas Gal :gal

Comment 53

•

16 years ago

Multiple failures on TM tree. rollback completed abort: premature EOF reading chunk (got 26366 bytes, expected 42292) elapsedTime=163.192947 program finished with exit code 255 === Output ended === ======== BuildStep ended ========

chizu

Assignee

Comment 54

•

16 years ago

(In reply to comment #53) > Multiple failures on TM tree. > > rollback completed > abort: premature EOF reading chunk (got 26366 bytes, expected 42292) > elapsedTime=163.192947 > program finished with exit code 255 > === Output ended === > ======== BuildStep ended ======== If these happened at 7:49, it was caused by a configuration change that needed an apache reload.

Mike Shaver (:shaver emeritus)

Comment 55

•

16 years ago

Just to plan out stuff over the next while, is there a point at which we'll just roll back all changes (any new build hosts, any config changes, any underlying VM/hardware moves, etc.) to the last known good state? I understand and appreciate that people are working hard to move forward to better state with faster hardware and so forth, but we're hurting pretty bad right now, and it's not clear what the plan is now that the additional RAM didn't make things all better. Let me know if there are things I can do (get the tree closed, have people look at python core dumps, pull in extra analysis help, anything) to help move it forward; this is basically the most important thing in my life right now. (I'm 647.296.1922 if you think of something you need during the night.)

chizu

Assignee

Comment 56

•

16 years ago

Some updates: * bug 511471 was tried without improvement. * Looking at the history of what's related to this, it seems it's been a mistake to follow the VM -> hardware change as this predates it. * RAM was 4GB on the dm-vcview01 VM and 4GB on dm-vcview01 hardware. This is now 8GB, which should improve performance, but does not affect the timeouts. * Configuration changes were made on dm-vcview01, but have been reverted to the original VM configs.

Robert Strong (they/them - no direct email)

Comment 57

•

16 years ago

Not sure if this is related WINCE 5.0 mozilla-1.9.2 build on 2009/08/19 22:44:43 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1250747083.1250747193.22109.gz requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: integrity check failed on data/db/sqlite3/src/sqlite3.c.d:22! elapsedTime=96.688000 program finished with exit code -1 and Linux mozilla-1.9.2 build on 2009/08/19 22:44:43 http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1250747083.1250747193.22109.gz requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: integrity check failed on data/db/sqlite3/src/sqlite3.c.d:22! elapsedTime=117.885569 program finished with exit code 255

matthew zeier [:mrz]

Comment 58

•

16 years ago

(In reply to comment #55) > Just to plan out stuff over the next while, is there a point at which we'll > just roll back all changes (any new build hosts, any config changes, any > underlying VM/hardware moves, etc.) to the last known good state? I mentioned this to shaver out-of-band but will include it here too. Nothing on the system side had changed prior to this bug opening. Build increased load last Thursday but IT hadn't changed any systems or configs. The only changes we've done have been reactive and I suspect we'd be in a worse state if we reverted to the VMs which were clearly CPU starved. Point #2 in comment #0 really means to track the migration from VM to physical server (and nthomas said so too on IRC). The theory right now is that this has been happening for awhile now and only recently it's gotten worse enough to be noticable. Aravind's increased the timeouts on nginx so Apache shouldn't have anymore failure to send data errors.

Aravind Gottipati [:aravind]

Comment 59

•

16 years ago

On the backend (read only) servers, that serve this content, apache says this for a try repo clone. 10.2.74.13 - - [19/Aug/2009:23:13:40 -0700] "GET /try?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=c4cdfd9a2881f088dca2010604182addbbb2fe38+3c5a5cc10ea95b787198a1e7817af24fad7427b5+f34211125e93965d8e908e4ec11c0740ef5ccf97+dc02eff6d8c0164cd9cae0151b8c6a8b914c7c8d+b42a1594f61338e48a8fab40a8a2cbfee0120e9d+cab97aca485d30972f9a624eb92ca5e027571692+c4ebfef7afa6d7b417c9476bceb27053d4118157 HTTP/1.0" 200 148024974 "-" "mercurial/proto-1.0" Thats one that works. However, I occasionally see entries that say 10.2.74.13 - - [19/Aug/2009:23:13:39 -0700] "GET /try?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=c4cdfd9a2881f088dca2010604182addbbb2fe38+3c5a5cc10ea95b787198a1e7817af24fad7427b5+f34211125e93965d8e908e4ec11c0740ef5ccf97+dc02eff6d8c0164cd9cae0151b8c6a8b914c7c8d+b42a1594f61338e48a8fab40a8a2cbfee0120e9d+cab97aca485d30972f9a624eb92ca5e027571692+c4ebfef7afa6d7b417c9476bceb27053d4118157 HTTP/1.0" 200 56464598 "-" "mercurial/proto-1.0" If you look carefully, you will see that the number of bytes returned is different for the exact same query. Those I am willing to bet are the ones that fail. Why apache returns a 200 status code, but returns fewer bytes is still a mystery. Looking into that somemore.

Aravind Gottipati [:aravind]

Comment 60

•

16 years ago

So based on that last comment - I changed apache to isolate the mod_wsgis between requests, so they don't trample on each other and fuck up my otherwise beautiful mercurial setup. I changed apache to use a new thread for every new mercurial request. Aki tested this and he was able to consistently check out the try repo 10 times in parallel (in two different tests - I am calling that resolved beyond a doubt). As always, please re-open if things are still broken.

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

matthew zeier [:mrz]

Comment 61

•

16 years ago

gd and this is why we celebrate sysadmin appreciation day (http://www.sysadminday.com/) . And you're on vacation no less! Thanks!

Aki Sasaki (not active)

Comment 62

•

16 years ago

Fired off a bunch of nightlies (>10 I think) across the two production-masters. They've all passed the hg clone steps and hit the compile steps. I'm ready to call this fixed, but the nightlies tonight should give us an additional level of verification. Thanks everyone!

Aravind Gottipati [:aravind]

Comment 63

•

16 years ago

(In reply to comment #62) > Fired off a bunch of nightlies (>10 I think) across the two production-masters. > They've all passed the hg clone steps and hit the compile steps. I'm ready to > call this fixed, but the nightlies tonight should give us an additional level > of verification. > > Thanks everyone! Cool, lets hope the hack holds up. Thanks for Nick and Aki for being very patient and helping me isolate and test the patch. Thanks for justdave, chizu and fox2mike for seeing this through all the permutations and combinations.

Philippe M. Chiasson (:gozer)

Comment 64

•

16 years ago

Win2k3 comm-central build on 2009/08/20 16:49:24 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1250812164.1250812540.26613.gz pulling from http://hg.mozilla.org/mozilla-central searching for changes adding changesets adding manifests adding file changes transaction abort! rollback completed

Severity: blocker → major

Philippe M. Chiasson (:gozer)

Comment 65

•

16 years ago

Win2k3 comm-1.9.1 check on 2009/08/20 12:33:12 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.0/1250796792.1250797958.22186.gz hg clone http://hg.mozilla.org/comm-central requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed abort: connection ended unexpectedly

Philippe M. Chiasson (:gozer)

Comment 66

•

16 years ago

And to more from overnight: Win2k3 comm-central build Date: 2009/08/20 22:17:51 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1250831871.1250832608.25758.gz pulling from http://hg.mozilla.org/mozilla-central searching for changes adding changesets adding manifests adding file changes Win2k3 comm-1.9.1 nightly Date: 2009/08/21 03:00:01 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.0/1250848801.1250851373.6971.gz requesting all changes adding changesets adding manifests adding file changes transaction abort! rollback completed

matthew zeier [:mrz]

Comment 67

•

16 years ago

What IP address are those boxes getting for "hg.mozilla.org"?

Philippe M. Chiasson (:gozer)

Comment 68

•

16 years ago

Seems to be round-robin between 2 IPs: 63.245.208.189 dm-hg02.mozilla.org. 63.245.208.188 dm-hg01.mozilla.org.

Philippe M. Chiasson (:gozer)

Comment 69

•

16 years ago

Win2k3 comm-central nightly Date: 2009/08/22 03:00:00 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1250935200.1250936732.2481.gz One thing worth noting is that this particular clone died almost precisely 20 minutes after it started.

matthew zeier [:mrz]

Comment 70

•

16 years ago

(In reply to comment #68) > Seems to be round-robin between 2 IPs: > > 63.245.208.189 dm-hg02.mozilla.org. > 63.245.208.188 dm-hg01.mozilla.org. chizu, are those the ones he should be hitting externally? vcview03 isn't in that list.

Philippe M. Chiasson (:gozer)

Comment 71

•

16 years ago

Five more overnight, with a relatively new one on Linux: Win2k3 comm-central nightly Date: 2009/08/25 06:37:34 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1251207454.1251208548.19215.gz [hg] transaction abort! Win2k3 comm-central nightly Date: 2009/08/25 04:02:37 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1251198157.1251199875.21202.gz [hg] transaction abort! Win2k3 comm-central nightly Date: 2009/08/25 03:16:08 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1251195368.1251196665.16025.gz [hg] transaction abort! Win2k3 comm-1.9.1 nightly Date: 2009/08/25 03:19:15 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.0/1251195555.1251197210.22055.gz [hg] transaction abort! Linux comm-1.9.1 nightly Date: 2009/08/25 03:00:00 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.0/1251194400.1251194548.25822.gz raise IOError(None, _('connection ended unexpectedly')) IOError: (None, 'connection ended unexpectedly') abort: connection ended unexpectedly

Philippe M. Chiasson (:gozer)

Comment 72

•

16 years ago

One theory worth considering is that our connectivity to hg.m.o is over a 100Mb link, and I'll guess the moco build machines can get at it via gigabit. This would mean everything we do would be at least 10 times slower, and could make timeouts/bailouts on the server-side more likely ?

matthew zeier [:mrz]

Comment 73

•

16 years ago

If you're bandwidth constrained you should see that in any trend graphs.

Philippe M. Chiasson (:gozer)

Comment 74

•

16 years ago

Looking at bandwidth usage graphs, specifically for my uplink to MoCo, I can see we are nowhere near any line limit. Thinking about it, in my various hg benchmarks, I've never found hg to be able to get close to wire speed, so it's probably a red herring. For fun and profit, try and see how fast you can grab http://hg.mozilla.org/mozilla-central/archive/tip.tar.bz2 600K/s is pretty much the best I've ever seen.

Philippe M. Chiasson (:gozer)

Comment 75

•

16 years ago

Win2k3 comm-central bloat Date: 2009/08/25 23:48:10 http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1251269290.1251269528.7496.gz abort: connection ended unexpectedly Time: real 95.972 secs (user 3.906+0.000 sys 0.891+0.000)

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 76

•

16 years ago

There are still failures occurring on MoCo builders but only infrequently. Eg "WINNT 5.2 places nightly" at Sun Aug 30 04:02:00 2009, cloning build/tools: destination directory: tools requesting all changes abort: HTTP Error 502: Bad Gateway "Linux try hg build" at Sat Aug 29 13:22:36 2009 transaction abort! rollback completed abort: connection ended unexpectedly And also at Fri Aug 28 17:11:57 2009.

Nick Thomas [:nthomas] (UTC+12)

Reporter

Comment 77

•

16 years ago

Failed to pull build/tools a bit after 2009/09/06 04:02:00 http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1252234920.1252238938.24591.gz adding file changes transaction abort! rollback completed abort: premature EOF reading chunk (got 700601 bytes, expected 2069754) Much the same only on Linux in http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1252234920.1252236746.30486.gz

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: mozilla.org → mozilla.org Graveyard