Closed Bug 511258 Opened 14 years ago Closed 14 years ago

Build slaves failing to pull from hg.m.o intermittently

Categories

(mozilla.org Graveyard :: Server Operations, task)

x86
All
task
Not set
major

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: nthomas, Assigned: chizu)

References

Details

Recent events:
* 1.9.2 branch created Aug 13, adding more builders
* IT changes to how hg clients pull data on Aug 17 [hg.m.o -> dm-hg01/02 -> vcview01 for hg clients, or vcview03 for other hg related activities browsing etc]
* 20090818 nightly builds fail out when trying to pull source from hg. Some 38 builds start at 3:02 PDT and request full source trees. Starting roughly half at a time worked better.

Steps already taken:
* Aravind made hg.m.o config changes at 4am PDT today, and further tweaks pushed at 6am
* RelEng has spaced out nightly builds more (bug 511126), now two branches starting at 3:02, 3:32, and 4:02.

Filing this because there are still intermittent problems pulling/updating, details to follow. Aravind is going to setup munin/cacti load monitoring
Tue Aug 18 09:12:44 - 09:17:31: moz2-linux-slave25 on cloning projects/electrolysis
  adding file changes
  transaction abort!
  rollback completed
  abort: premature EOF reading chunk (got 4337 bytes, expected 12856)

Tue Aug 18 12:34:28: moz2-win32-slave14 on cloning build/tools
  adding file changes
  transaction abort!
  rollback completed
  abort: consistency error adding group!

Tue Aug 18 13:15:07 - 13:16:32: moz2-win32-slave22 hits same error cloning releases/mozilla-1.9.2. Successfully cloned build/tools in the ten seconds before that.
 
Tue Aug 18 14:18:15 - 14:18:19: moz2-win32-slave29, transaction abort on cloning build/tools
Tue Aug 18 14:18:15 2009 - 14:18:18: moz2-linux-slave19, transaction abort trying to clone build/tools
Tue Aug 18 16:36:46 (0 secs): moz2-win32-slave34, cloning tracemonkey
  abort: HTTP Error 500: Internal Server Error

Tue Aug 18 16:37:18 (0 secs): moz2-darwin9-slave15, same as above
Tue Aug 18 16:38:40 (0 secs): moz2-win32-slave25, same as above
Tue Aug 18 16:39:48 (0 secs): moz2-linux-slave13, same as above
I don't know if this is any help, but the Thunderbird build boxes have been having these kind of hg problems for a long time now. My understanding so far was that hg access from the outside world vs. from the inside (moco build network) was different. Could the recent changes have made access from the moco build network more similar to the path from the outside?

I can get timing data for some of the recent Thunderbird failures if that can help some.
Haven't seen any problems in Firefox builds since 16:45 PDT.
Not out of the woods yet:

Tue Aug 18 19:35:43 2009 (53 secs): moz2-linux-slave10 cloning mozilla-central:
  adding file changes
  transaction abort!
  rollback completed
  abort: premature EOF reading chunk (got 2192938 bytes, expected 2235011)
Tue Aug 18 21:01:54 2009 (3 mins, 17 secs): moz2-win32-slave01 cloning mozilla-central:
abort: premature EOF reading chunk (got 3882 bytes, expected 28711)
Tue Aug 18 22:53:45 2009 (3 mins, 47 secs): moz2-linux-slave06 cloning mozilla-central, with something slightly different for variety's sake

transaction abort!
requesting all changes
adding changesets
adding manifests
adding file changes
rollback completed
abort: consistency error adding group!

(which is a pretty weird order of output)
mozilla-1.9.2

WINNT 5.2 mozilla-1.9.2 leak test build on 2009/08/18 22:05:19
moz2-win32-slave02

requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: premature EOF reading chunk (got 2227094 bytes, expected 2235011)

and

Linux mozilla-1.9.2 unit test on 2009/08/18 22:05:19
moz2-linux-slave07

requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: premature EOF reading chunk (got 237 bytes, expected 304)
I tried submitting try builds 2009/08/18 15:11:13 and 2009/08/19 00:44:03. Windows, Mac and Linux have all had failures with highlights of:

abort: premature EOF reading chunk (got 237 bytes, expected 304)
abort: consistency error adding group!
abort: integrity check failed on 00manifest.d:15143!
WINCE 5.0 mozilla-1.9.2 build on 2009/08/19 01:17:09
moz2-win32-slave02

requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: consistency error adding group!
Nagios has been saying:
dm-vcview01:hplog is WARNING: WARNING 0005: Server Blade Enclosure LAN Settings Changed (Enclosure Serial Number USE901JS0V)

Doesn't appear to be flapping.
There's currently four boxes on mozilla-central and two on mozilla-1.9.2 showing this problem:

- WINNT 5.2 mozilla-central unit test %
- WINCE 5.0 mozilla-central build %
- Linux x86-64 mozilla-central nightly %
- Linux mozilla-central leak test build %

- Linux mozilla-1.9.2 nightly %
- Linux mozilla-1.9.2 unit test %
Tons more overnight:

Maemo 1.9.2 nightly @ Wed Aug 19 01:06:16 2009
WinMO Electrolysis nightly @ Wed Aug 19 01:06:06 2009
WINCE 5.0 1.9.2 build @ Wed Aug 19 01:18:57 2009
Linux 1.9.2 unit test @ Wed Aug 19 01:27:59 2009
Linux x86-64 m-c nightly @ Wed Aug 19 03:04:27 2009
Linux 1.9.2 nightly @ Wed Aug 19 03:36:27 2009
OS X 1.9.2 xulrunner @ Wed Aug 19 03:36:38 2009
OS X places nightly @ Wed Aug 19 04:06:01 2009
Linux x86-64 1.9.2 xulrunner @ Wed Aug 19 04:40:11 2009
win32 m-c unit test @ Wed Aug 19 04:47:53 2009
WINCE 5.0 m-c build @ Wed Aug 19 04:53:40 2009
Linux m-c leak test @ Wed Aug 19 04:57:55 2009
Linux electrolysis @ Wed Aug 19 05:36:18 2009

There's no sign of this slowing down.
Severity: major → critical
Oh and if you hadn't guessed, this has been affecting Thunderbird & Sunbird builds as well (yesterday and today) - mainly the nightlies that all start at 3am PST.
We (me and another guy from the hg crew) think this can only be caused by network corruption, so I'm afraid there might not be much I can do.
Severity: critical → major
Apparently Aravind is on vacation, unassigning this from him.
Assignee: aravind → server-ops
Severity: major → critical
[61] dm-vcview01:hplog is WARNING: WARNING 0005: Server Blade Enclosure LAN Settings Changed (Enclosure Serial Number USE901JS0V)

Been getting those for a while, no idea if it's related to the problem
The switch doesn't see any bouncing port and if it were bouncing bonding should handle that.

That IML message looks old (was in 2008) so I'd discount that all together.
dm-vcview02 was not responding for some reason. Load was 0 on the box but it was extremely sluggish.

I've since kicked the box and brought it back up and it's helping take some of the load off the other two.
Assignee: server-ops → shyam
(In reply to comment #15)
> Linux m-c leak test @ Wed Aug 19 04:57:55 2009
> Linux electrolysis @ Wed Aug 19 05:36:18 2009
> 
These are the last two I have seen in both masters
Assignee: shyam → server-ops
Please don't reset the assignee, it send pages to multiple folks at IT.

Are those times PDT? If so that was almost 2 hours ago.
Assignee: server-ops → shyam
(In reply to comment #23)
> 
> Are those times PDT? If so that was almost 2 hours ago.
It is PDT

Linux electrolysis unit test @ Wed Aug 19 07:42:20 2009
Win32 electrolysis leak test @ Wed Aug 19 07:36:44 2009
Okay, I've made one more tweak to streamline traffic. Let's see if we see any more failures.
This could be a different bug (possibly caused by the earlier problems, but I don't know enough to differentiate), I just pushed to try server and got:

requesting all changes
adding changesets
adding manifests
transaction abort!
rollback completed
abort: integrity check failed on 00manifest.d:20033!

On WINNT 5.2 try hg build on 2009/08/19 07:44:42 (PDT)
I've managed to catch one such failure when pulling from build/buildbot-configs, and I have both debug logs and I've also put up a tarball of the busted repository. In its state, it's simply impossible to pull from it ever again.

http://gozer.ectoplasm.org/tmp/buildbot-configs-broken.tar.bz2
Assignee: shyam → server-ops
Taking it off critical because it keeps paging us and people keep changing the assignee.
Severity: critical → major
And I got this log from a client that only does

$> while [ 1 ] ; do date; hg pull -r tip; sleep 60; done


Tue Aug 18 05:19:02 PDT 2009
using https://hg.mozilla.org/build/buildbot-configs
pulling from https://hg.mozilla.org/build/buildbot-configs
sending capabilities command
capabilities: unbundle=HG10GZ,HG10BZ,HG10UN lookup changegroupsubset
sending lookup command
searching for changes
no changes found
Tue Aug 18 05:20:01 PDT 2009
using https://hg.mozilla.org/build/buildbot-configs
pulling from https://hg.mozilla.org/build/buildbot-configs
sending capabilities command
capabilities: unbundle=HG10GZ,HG10BZ,HG10UN lookup changegroupsubset
sending lookup command
searching for changes
sending branches command
examining 303ae2bc331c:77547da571d1
found incomplete branch 303ae2bc331c:77547da571d1
sending between command
narrowing 1:7 e66d476658c0
narrowing 2:7 8a50eacda69f
found new branch changeset e66d476658c0
found new changesets starting at e66d476658c0
1 total queries
sending changegroupsubset command
adding changesets
add changeset e66d476658c0
add changeset 303ae2bc331c
adding manifests
adding file changes
adding mozilla2/config.py revisions
transaction abort!
rollback completed
Traceback (most recent call last):
  File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 45, in _runcatch
    return _dispatch(ui, args)
  File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 364, in _dispatch
    ret = _runcommand(ui, options, cmd, d)
  File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 417, in _runcommand
    return checkargs()
  File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 373, in checkargs
    return cmdfunc()
  File "/opt/csw/lib/python/site-packages/mercurial/dispatch.py", line 356, in <lambda>
    d = lambda: func(ui, repo, *args, **cmdoptions)
  File "/opt/csw/lib/python/site-packages/mercurial/commands.py", line 2053, in pull
    modheads = repo.pull(other, heads=revs, force=opts['force'])
  File "/opt/csw/lib/python/site-packages/mercurial/localrepo.py", line 1484, in pull
    return self.addchangegroup(cg, 'pull', remote.url())
  File "/opt/csw/lib/python/site-packages/mercurial/localrepo.py", line 2016, in addchangegroup
    if fl.addgroup(chunkiter, revmap, trp) is None:
  File "/opt/csw/lib/python/site-packages/mercurial/revlog.py", line 1182, in addgroup
    raise LookupError(p, self.indexfile, _('unknown parent'))
LookupError: data/mozilla2/config.py.i@11b629ac50b9: unknown parent
abort: data/mozilla2/config.py.i@11b629ac50b9: unknown parent!

Once that one happens, the checkout is stuck and will never be able to pull successfully.
(In reply to comment #28)
> Taking it off critical because it keeps paging us and people keep changing the
> assignee.

you're supposed take it off critical when it's fixed, right?
osx m-c test mochitests @ Wed Aug 19 08:49:40 2009
Linux Fennec desktop m-c build @ Wed Aug 19 08:53:41 2009
(In reply to comment #31)
> osx m-c test mochitests @ Wed Aug 19 08:49:40 2009
destination directory: tools
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: premature EOF reading chunk (got 1739812 bytes, expected 1872976)

> Linux Fennec desktop m-c build @ Wed Aug 19 08:53:41 2009
destination directory: tools
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: consistency error adding group!

(In reply to comment #24)
> Linux electrolysis unit test @ Wed Aug 19 07:42:20 2009
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: consistency error adding group!

> Win32 electrolysis leak test @ Wed Aug 19 07:36:44 2009
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: integrity check failed on data/db/sqlite3/src/sqlite3.c.d:22!
Is this related?

http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250696802.1250697083.9692.gz

/tools/python/bin/hg clone --verbose --noupdate http://hg.mozilla.org/try /builds/slave/sendchange-linux-hg/build
 in dir /builds/slave/sendchange-linux-hg (timeout 3600 secs)
 ...
abort: HTTP Error 414: Request-URI Too Large
requesting all changes
elapsedTime=0.747700
program finished with exit code 255
(In reply to comment #17)
> We (me and another guy from the hg crew) think this can only be caused by
> network corruption, so I'm afraid there might not be much I can do.

What do you mean by "network corruption"?  Isn't this all over TCP, which has checksums and such?

And shouldn't mercurial be robust in the face of such aborts due to abrupt connection termination?
(In reply to comment #33)
> Is this related?
> 
> http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250696802.1250697083.9692.gz
> 
> /tools/python/bin/hg clone --verbose --noupdate http://hg.mozilla.org/try
> /builds/slave/sendchange-linux-hg/build
>  in dir /builds/slave/sendchange-linux-hg (timeout 3600 secs)
>  ...
> abort: HTTP Error 414: Request-URI Too Large
> requesting all changes
> elapsedTime=0.747700
> program finished with exit code 255

This is a separate issue with the Try server. It means we've grown too many heads there and it needs to be pruned.
I vaguely remember having to do the math for TCP/IP checksum reliability years ago. Its in the ballpark of one uncorrected error per 10 million packets with single bit errors. Ethernet is pretty reliable these days, so you don't actually have a bit error for every packet. In other words, the probability for uncorrected checksum failure is somewhere between ridiculously unlikely and astronomical.
(In reply to comment #35)
> (In reply to comment #33)
> > Is this related?
> > 
> > http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250696802.1250697083.9692.gz
> > 
> > /tools/python/bin/hg clone --verbose --noupdate http://hg.mozilla.org/try
> > /builds/slave/sendchange-linux-hg/build
> >  in dir /builds/slave/sendchange-linux-hg (timeout 3600 secs)
> >  ...
> > abort: HTTP Error 414: Request-URI Too Large
> > requesting all changes
> > elapsedTime=0.747700
> > program finished with exit code 255
> 
> This is a separate issue with the Try server. It means we've grown too many
> heads there and it needs to be pruned.

I filed bug 511471, which will get dealt with tomorrow morning, fwiw
- Linux m-c test everythingelse @ Wed Aug 19 09:29:36 2009
destination directory: tools
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: consistency error adding group!

- Linux tracemonkey leak test build @ Wed Aug 19 10:13:44 2009
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: integrity check failed on data/db/sqlite3/src/sqlite3.c.d:22!

- Linux electrolysis build @ Wed Aug 19 10:34:31 2009
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: premature EOF reading chunk (got 10222 bytes, expected 28711)

- Linux Fennec Desktop electrolysis @ Wed Aug 19 11:35:50 2009
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: consistency error adding group!
Assignee: server-ops → thardcastle
Severity: major → blocker
The cause of this seems to be stacking queries mercurial backend servers. Yesterday one of the mercurial servers was switched to hardware from a VM, which might be related.
No more incidents at this point.
Last incident at 11:35:50 mentioned in comment 38.
I will check once more before I head out.
Load seems more evenly spread out now.

Shyam's enabled performance monitoring in munin for these boxes, and we've got one more mercurial server in progress if that becomes needed. We'll monitor the data munin collects for any further problems.
- Linux tracemonkey build @ Wed Aug 19 12:50:36 2009
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: premature EOF reading chunk (got 2149500 bytes, expected 2235011)
(In reply to comment #42)
> - Linux tracemonkey build @ Wed Aug 19 12:50:36 2009
> requesting all changes
> adding changesets
> adding manifests
> adding file changes
> transaction abort!
> rollback completed
> abort: premature EOF reading chunk (got 2149500 bytes, expected 2235011)
This was a proxy restart for a config change.


(In reply to comment #43)
> Another one on try:
> http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1250713363.1250713512.7406.gz&fulltext=1
This is the bug popping up again. Munin shows a spike in nfs traffic around then.

We could be starving some Hg processes for I/O by overloading our nfs capacity. Looking into this now.
WINNT 5.2 tracemonkey build between Wed Aug 19 15:19:56 PDT and 15:22:10, on cloning tracemonkey:
  transaction abort!
  rollback completed
  abort: premature EOF reading chunk (got 13842 bytes, expected 15020)
Linux tracemonkey unit test on 2009/08/19 15:28:09

rollback completed
abort: premature EOF reading chunk (got 6186 bytes, expected 15020)
elapsedTime=72.905727
program finished with exit code 255
=== Output ended ===
======== BuildStep ended ========

Linux tracemonkey unit test on 2009/08/19 15:35:07

rollback completed
abort: consistency error adding group!
elapsedTime=284.373095
program finished with exit code 255
=== Output ended ===
======== BuildStep ended ========

WINNT 5.2 tracemonkey build on 2009/08/19 15:16:19

transaction abort!
rollback completed
abort: premature EOF reading chunk (got 13842 bytes, expected 15020)
elapsedTime=133.859000
program finished with exit code -1
=== Output ended ===
======== BuildStep ended ========


WINNT 5.2 tracemonkey build on 2009/08/19 15:29:24

rollback completed
abort: premature EOF reading chunk (got 154 bytes, expected 349)
elapsedTime=27.860000
program finished with exit code -1
=== Output ended ===
======== BuildStep ended ========

Could we at least have a work-around in the build script that tries to pull 3 times in a row or something like that?
Summary of quick chat with chizu as he headed out to colo.

hg.m.o is actually a round robin across 2 VMs and 1 physical machine. Yesterday this was changed to 1 VM and 2 physical machines.

Turns out the physical machines have less ram then the VM. While IO load on physical machines is ok, this is causing the file-cache on the physical machines to get borked. To fix this, Chizu is going to upgrade the ram on each physical machine, one at a time. Each machine should take <10mins to upgrade.

Technically powering down one machine at a time like this does not mean closing hg.m.o, but given that the two machines are low-ram, its unlikely that failover to remaining machines will go well. Therefore we'll do a quick tree closure as soon as chizu is in colo.

(/me has also notified sheriff of upcoming downtime)
dm-vcview01, 03 were upgraded from 4GB ram to 8GB ram and restarted.  

We *think* the hg problem is fixed. But obviously if you see/hear otherwise, we'd like to know!
(In reply to comment #48)
> dm-vcview01, 03 were upgraded from 4GB ram to 8GB ram and restarted.  
> 
> We *think* the hg problem is fixed.

What was the root cause?
I think mrz is going to do a post-mortem description later, but people have been working on this stuff flat-out since yesterday, so I'm happy to let them catch their breath before they dissect the causes. :-)
Not much activity on the main branches since 1800 PDT to test the fix, so I've fired off about 12 builds on mozilla-1.9.2. 

Still getting a lot of transaction abort messages on try server builds, so perhaps we should go ahead with bug 511471 asap. There are about 590 heads on that repo at the moment.
(In reply to comment #50)
> I think mrz is going to do a post-mortem description later, but people have
> been working on this stuff flat-out since yesterday, so I'm happy to let them
> catch their breath before they dissect the causes. :-)

Adding more RAM as a quick fix is cool. There can't be a post-mortem until the root cause is determined.
Multiple failures on TM tree.

rollback completed
abort: premature EOF reading chunk (got 26366 bytes, expected 42292)
elapsedTime=163.192947
program finished with exit code 255
=== Output ended ===
======== BuildStep ended ========
(In reply to comment #53)
> Multiple failures on TM tree.
> 
> rollback completed
> abort: premature EOF reading chunk (got 26366 bytes, expected 42292)
> elapsedTime=163.192947
> program finished with exit code 255
> === Output ended ===
> ======== BuildStep ended ========

If these happened at 7:49, it was caused by a configuration change that needed an apache reload.
Just to plan out stuff over the next while, is there a point at which we'll just roll back all changes (any new build hosts, any config changes, any underlying VM/hardware moves, etc.) to the last known good state?  I understand and appreciate that people are working hard to move forward to better state with faster hardware and so forth, but we're hurting pretty bad right now, and it's not clear what the plan is now that the additional RAM didn't make things all better.

Let me know if there are things I can do (get the tree closed, have people look at python core dumps, pull in extra analysis help, anything) to help move it forward; this is basically the most important thing in my life right now.

(I'm 647.296.1922 if you think of something you need during the night.)
Some updates:
* bug 511471 was tried without improvement.
* Looking at the history of what's related to this, it seems it's been a mistake to follow the VM -> hardware change as this predates it.
* RAM was 4GB on the dm-vcview01 VM and 4GB on dm-vcview01 hardware. This is now 8GB, which should improve performance, but does not affect the timeouts.
* Configuration changes were made on dm-vcview01, but have been reverted to the original VM configs.
Not sure if this is related

WINCE 5.0 mozilla-1.9.2 build on 2009/08/19 22:44:43
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1250747083.1250747193.22109.gz
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: integrity check failed on data/db/sqlite3/src/sqlite3.c.d:22!
elapsedTime=96.688000
program finished with exit code -1

and

Linux mozilla-1.9.2 build on 2009/08/19 22:44:43
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox3.6/1250747083.1250747193.22109.gz
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: integrity check failed on data/db/sqlite3/src/sqlite3.c.d:22!
elapsedTime=117.885569
program finished with exit code 255
(In reply to comment #55)
> Just to plan out stuff over the next while, is there a point at which we'll
> just roll back all changes (any new build hosts, any config changes, any
> underlying VM/hardware moves, etc.) to the last known good state? 

I mentioned this to shaver out-of-band but will include it here too.

Nothing on the system side had changed prior to this bug opening.  Build increased load last Thursday but IT hadn't changed any systems or configs.  

The only changes we've done have been reactive and I suspect we'd be in a worse state if we reverted to the VMs which were clearly CPU starved.  

Point #2 in comment #0 really means to track the migration from VM to physical server (and nthomas said so too on IRC).  

The theory right now is that this has been happening for awhile now and only recently it's gotten worse enough to be noticable. 

Aravind's increased the timeouts on nginx so Apache shouldn't have anymore failure to send data errors.
On the backend (read only) servers, that serve this content, apache says this for a try repo clone.

10.2.74.13 - - [19/Aug/2009:23:13:40 -0700] "GET /try?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=c4cdfd9a2881f088dca2010604182addbbb2fe38+3c5a5cc10ea95b787198a1e7817af24fad7427b5+f34211125e93965d8e908e4ec11c0740ef5ccf97+dc02eff6d8c0164cd9cae0151b8c6a8b914c7c8d+b42a1594f61338e48a8fab40a8a2cbfee0120e9d+cab97aca485d30972f9a624eb92ca5e027571692+c4ebfef7afa6d7b417c9476bceb27053d4118157 HTTP/1.0" 200 148024974 "-" "mercurial/proto-1.0"


Thats one that works.  However, I occasionally see entries that say
10.2.74.13 - - [19/Aug/2009:23:13:39 -0700] "GET /try?bases=0000000000000000000000000000000000000000&cmd=changegroupsubset&heads=c4cdfd9a2881f088dca2010604182addbbb2fe38+3c5a5cc10ea95b787198a1e7817af24fad7427b5+f34211125e93965d8e908e4ec11c0740ef5ccf97+dc02eff6d8c0164cd9cae0151b8c6a8b914c7c8d+b42a1594f61338e48a8fab40a8a2cbfee0120e9d+cab97aca485d30972f9a624eb92ca5e027571692+c4ebfef7afa6d7b417c9476bceb27053d4118157 HTTP/1.0" 200 56464598 "-" "mercurial/proto-1.0"


If you look carefully, you will see that the number of bytes returned is different for the exact same query.  Those I am willing to bet are the ones that fail.  Why apache returns a 200 status code, but returns fewer bytes is still a mystery.  Looking into that somemore.
So based on that last comment - I changed apache to isolate the mod_wsgis between requests, so they don't trample on each other and fuck up my otherwise beautiful mercurial setup.  I changed apache to use a new thread for every new mercurial request.  Aki tested this and he was able to consistently check out the try repo 10 times in parallel (in two different tests - I am calling that resolved beyond a doubt).

As always, please re-open if things are still broken.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
gd and this is why we celebrate sysadmin appreciation day (http://www.sysadminday.com/) .  And you're on vacation no less!

Thanks!
Fired off a bunch of nightlies (>10 I think) across the two production-masters.  They've all passed the hg clone steps and hit the compile steps.  I'm ready to call this fixed, but the nightlies tonight should give us an additional level of verification.

Thanks everyone!
(In reply to comment #62)
> Fired off a bunch of nightlies (>10 I think) across the two production-masters.
>  They've all passed the hg clone steps and hit the compile steps.  I'm ready to
> call this fixed, but the nightlies tonight should give us an additional level
> of verification.
> 
> Thanks everyone!

Cool, lets hope the hack holds up.  Thanks for Nick and Aki for being very patient and helping me isolate and test the patch.

Thanks for justdave, chizu and fox2mike for seeing this through all the permutations and combinations.
Win2k3 comm-central build on 2009/08/20 16:49:24
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1250812164.1250812540.26613.gz
pulling from http://hg.mozilla.org/mozilla-central
searching for changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
Severity: blocker → major
Win2k3 comm-1.9.1 check on 2009/08/20 12:33:12
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.0/1250796792.1250797958.22186.gz
hg clone http://hg.mozilla.org/comm-central
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
abort: connection ended unexpectedly
And to more from overnight:

Win2k3 comm-central build Date: 2009/08/20 22:17:51
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1250831871.1250832608.25758.gz
pulling from http://hg.mozilla.org/mozilla-central
searching for changes
adding changesets
adding manifests
adding file changes

Win2k3 comm-1.9.1 nightly Date: 2009/08/21 03:00:01
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.0/1250848801.1250851373.6971.gz
requesting all changes
adding changesets
adding manifests
adding file changes
transaction abort!
rollback completed
What IP address are those boxes getting for "hg.mozilla.org"?
Seems to be round-robin between 2 IPs:

63.245.208.189 dm-hg02.mozilla.org.
63.245.208.188 dm-hg01.mozilla.org.
Win2k3 comm-central nightly Date: 2009/08/22 03:00:00
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1250935200.1250936732.2481.gz

One thing worth noting is that this particular clone died almost precisely 20 minutes after it started.
(In reply to comment #68)
> Seems to be round-robin between 2 IPs:
> 
> 63.245.208.189 dm-hg02.mozilla.org.
> 63.245.208.188 dm-hg01.mozilla.org.

chizu, are those the ones he should be hitting externally?  vcview03 isn't in that list.
Five more overnight, with a relatively new one on Linux:

Win2k3 comm-central nightly Date: 2009/08/25 06:37:34
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1251207454.1251208548.19215.gz
[hg] transaction abort!

Win2k3 comm-central nightly Date: 2009/08/25 04:02:37
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1251198157.1251199875.21202.gz
[hg] transaction abort!

Win2k3 comm-central nightly Date: 2009/08/25 03:16:08
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1251195368.1251196665.16025.gz
[hg] transaction abort!

Win2k3 comm-1.9.1 nightly Date: 2009/08/25 03:19:15
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.0/1251195555.1251197210.22055.gz
[hg] transaction abort!

Linux comm-1.9.1 nightly Date: 2009/08/25 03:00:00
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird3.0/1251194400.1251194548.25822.gz
raise IOError(None, _('connection ended unexpectedly'))
IOError: (None, 'connection ended unexpectedly')
abort: connection ended unexpectedly
One theory worth considering is that our connectivity to hg.m.o is over a 100Mb link, and I'll guess the moco build machines can get at it via gigabit. This would mean everything we do would be at least 10 times slower, and could make timeouts/bailouts on the server-side more likely ?
If you're bandwidth constrained you should see that in any trend graphs.
Looking at bandwidth usage graphs, specifically for my uplink to MoCo, I can see we are nowhere near any line limit. Thinking about it, in my various hg benchmarks, I've never found hg to be able to get close to wire speed, so it's probably a red herring.

For fun and profit, try and see how fast you can grab http://hg.mozilla.org/mozilla-central/archive/tip.tar.bz2

600K/s is pretty much the best I've ever seen.
Win2k3 comm-central bloat Date: 2009/08/25 23:48:10
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1251269290.1251269528.7496.gz
abort: connection ended unexpectedly
Time: real 95.972 secs (user 3.906+0.000 sys 0.891+0.000)
There are still failures occurring on MoCo builders but only infrequently. Eg

"WINNT 5.2 places nightly" at Sun Aug 30 04:02:00 2009, cloning build/tools:
destination directory: tools
requesting all changes
abort: HTTP Error 502: Bad Gateway

"Linux try hg build" at Sat Aug 29 13:22:36 2009
transaction abort!
rollback completed
abort: connection ended unexpectedly
And also at Fri Aug 28 17:11:57 2009.
Failed to pull build/tools a bit after 2009/09/06 04:02:00
http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1252234920.1252238938.24591.gz

adding file changes
transaction abort!
rollback completed
abort: premature EOF reading chunk (got 700601 bytes, expected 2069754)

Much the same only on Linux in
http://tinderbox.mozilla.org/showlog.cgi?log=Electrolysis/1252234920.1252236746.30486.gz
Product: mozilla.org → mozilla.org Graveyard
You need to log in before you can comment on or make changes to this bug.