Closed Bug 1333348 Opened 7 years ago Closed 7 years ago

Intermittent abort: missing support for negative part header size: -424604107

Categories

(Developer Services :: Mercurial: hg.mozilla.org, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 1291926

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell unknown])

also hit autoland now
gps: could you take a look ?
Flags: needinfo?(gps)
This failure confounds me.

robustcheckout is retrying as it is supposed to. However, the error keeps manifesting. But it isn't the same error and it doesn't happen in exactly the same place.

On retry, the stream from the server should be identical to what was requested before, as the client's state hasn't changed. But the failure occurs in different locations.

It certainly looks like bits are getting corrupted somehow. But until I get a dump of data over the wire that can reproduce this so I can step through with a debugger, I'm not going to be able to pinpoint this.

What's really odd is we only seem to be seeing these consistent failures on Windows. It is possible we're looking at a bug with TLS decoding or Python memory somehow getting corrupted inside a C library. I'm kinda curious where the python.exe on these Windows instances comes from...

pmoore: can you shed light on the source of python.exe on Windows TC workers?
Flags: needinfo?(gps) → needinfo?(pmoore)
I think this is a buildbot slave, so redirecting to grenade.

Python on the TC builders comes from Mozilla Build 2.2.0:

  * https://github.com/mozilla-releng/OpenCloudConfig/blob/8dc9f3e087921e483c07bd47c1488273de457da0/userdata/Manifest/gecko-1-b-win2012.json#L445
Flags: needinfo?(pmoore) → needinfo?(rthijssen)
b-2008-spot-009 was already terminated but i got the output below from b-2008-spot-089 which wasn't:

C:\Users\Administrator>where hg
C:\mozilla-build\hg\hg.exe

C:\Users\Administrator>hg --version
Mercurial Distributed SCM (version 3.9.1)
(see https://mercurial-scm.org for more information)

Copyright (C) 2005-2016 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

C:\Users\Administrator>where python
C:\mozilla-build\python27\python.exe
C:\mozilla-build\buildbotve\Scripts\python.exe

C:\Users\Administrator>python --version
Python 2.7.5
Flags: needinfo?(rthijssen)
philor found this issue on a Linux machine. So it isn't isolated to Windows. But I'm still concerned about the relative frequency on Windows. We run a lot more jobs on Linux and the fact we hardly see this failure on Linux is alarming.

I think triaging this will require me to RDP into an instance that can reproduce the issue. And given the nature of the problem I may need to debug minutes after it occurs before state on hg.mozilla.org changes.

I'm not sure when I'll have time to look into this, however...
See Also: → 1340630
:mshal, any updates here?
Flags: needinfo?(mshal)
I'm not working on this, and I'm not really an hg person so I'm not sure how to fix it.

IMO there are really two problems here:

1) The hg server returns bogus data sometimes, which seems to be more easily triggered when under load (eg: a large checkout). This results in things like the negative header size, and I speculate it could also result in the struct error in bug 1340630. Someone on the vcs team who is familiar with hg server management can probably help here.

2) The hg client (Windows builder) pulls an inordinate amount of data for the number of changesets it grabs. If I'm reading the hg output correctly, it looks like it pulls a 1.7GB bundle even for only 170 changesets. Again here we'll probably need someone more familiar with hg than I :/

applying clone bundle from https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/mozilla-unified/99f0792ae01e564e9b17a54b5010cb58eeb9b274.packed1-gd.hg
316715 files to transfer, 1.70 GB of data
transferred 1.70 GB in 241.4 seconds (7.22 MB/sec)
finished applying clone bundle
searching for changes
adding changesets
adding manifests
adding file changes
added 170 changesets with 384 changes to 291 files (+1 heads)

changesets [============================================================>] 1/1
                                                                               
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 0 changes to 0 files (-1 heads)

I suspect if we fixed pulling tons of data on every build that we'd see this a lot less, though really both issues should be fixed.
Flags: needinfo?(mshal)
I'm not sure we have an hg expert other than gps, so this may need to wait till he returns.
Whiteboard: [stockwell unknown]
The negative part header size is likely a bit flip. Why, I don't know. There have been reports of this to upstream Mercurial as well. Appears to be primarily on Windows.

This is a non-trivial bug to investigate and will likely require several days of someone's time.
Looking at logs, this appears to be a dupe of bug 1291926. I'm seeing the same behavior: really slow application of manifests followed by a "stream ended unexpectedly" error. This screams of the network timeout I diagnosed in bug 1291926.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.