Intermittent abort: missing support for negative part header size: -424604107

RESOLVED DUPLICATE of bug 1291926

Status

Developer Services
Mercurial: hg.mozilla.org
RESOLVED DUPLICATE of bug 1291926
9 months ago
6 months ago

People

(Reporter: Treeherder Bug Filer, Unassigned)

Tracking

({intermittent-failure})

Details

(Whiteboard: [stockwell unknown])

(Reporter)

Description

9 months ago
treeherder
Filed by: cbook [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=71422832&repo=autoland

https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-win64/1485237043/autoland-win64-bm70-build1-build878.txt.gz
also hit autoland now
gps: could you take a look ?
Flags: needinfo?(gps)

Comment 3

9 months ago
This failure confounds me.

robustcheckout is retrying as it is supposed to. However, the error keeps manifesting. But it isn't the same error and it doesn't happen in exactly the same place.

On retry, the stream from the server should be identical to what was requested before, as the client's state hasn't changed. But the failure occurs in different locations.

It certainly looks like bits are getting corrupted somehow. But until I get a dump of data over the wire that can reproduce this so I can step through with a debugger, I'm not going to be able to pinpoint this.

What's really odd is we only seem to be seeing these consistent failures on Windows. It is possible we're looking at a bug with TLS decoding or Python memory somehow getting corrupted inside a C library. I'm kinda curious where the python.exe on these Windows instances comes from...

pmoore: can you shed light on the source of python.exe on Windows TC workers?
Flags: needinfo?(gps) → needinfo?(pmoore)

Comment 4

9 months ago
28 failures in 152 pushes (0.184 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 27
* mozilla-inbound: 1

Platform breakdown:
* windows8-64: 17
* windowsxp: 11

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-01-24&endday=2017-01-24&tree=all
I think this is a buildbot slave, so redirecting to grenade.

Python on the TC builders comes from Mozilla Build 2.2.0:

  * https://github.com/mozilla-releng/OpenCloudConfig/blob/8dc9f3e087921e483c07bd47c1488273de457da0/userdata/Manifest/gecko-1-b-win2012.json#L445
Flags: needinfo?(pmoore) → needinfo?(rthijssen)
b-2008-spot-009 was already terminated but i got the output below from b-2008-spot-089 which wasn't:

C:\Users\Administrator>where hg
C:\mozilla-build\hg\hg.exe

C:\Users\Administrator>hg --version
Mercurial Distributed SCM (version 3.9.1)
(see https://mercurial-scm.org for more information)

Copyright (C) 2005-2016 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

C:\Users\Administrator>where python
C:\mozilla-build\python27\python.exe
C:\mozilla-build\buildbotve\Scripts\python.exe

C:\Users\Administrator>python --version
Python 2.7.5
Flags: needinfo?(rthijssen)

Comment 7

9 months ago
philor found this issue on a Linux machine. So it isn't isolated to Windows. But I'm still concerned about the relative frequency on Windows. We run a lot more jobs on Linux and the fact we hardly see this failure on Linux is alarming.

I think triaging this will require me to RDP into an instance that can reproduce the issue. And given the nature of the problem I may need to debug minutes after it occurs before state on hg.mozilla.org changes.

I'm not sure when I'll have time to look into this, however...

Comment 8

9 months ago
21 failures in 141 pushes (0.149 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 12
* mozilla-inbound: 7
* mozilla-central: 1
* mozilla-aurora: 1

Platform breakdown:
* windowsxp: 11
* windows8-64: 3
* linux64: 3
* linux32: 2
* android-4-2-x86: 1
* android-4-0-armv7-api15: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-01-27&endday=2017-01-27&tree=all

Comment 9

9 months ago
70 failures in 749 pushes (0.093 failures/push) were associated with this bug in the last 7 days. 

This is the #27 most frequent failure this week. 

** This failure happened more than 50 times this week! Resolving this bug is a high priority. **

Repository breakdown:
* autoland: 56
* mozilla-inbound: 10
* mozilla-central: 2
* mozilla-aurora: 2

Platform breakdown:
* windowsxp: 32
* windows8-64: 27
* linux64: 4
* android-4-0-armv7-api15: 4
* linux32: 2
* android-4-2-x86: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-01-23&endday=2017-01-29&tree=all
Duplicate of this bug: 1309715

Comment 11

9 months ago
27 failures in 733 pushes (0.037 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* autoland: 16
* mozilla-inbound: 7
* mozilla-aurora: 3
* mozilla-central: 1

Platform breakdown:
* windowsxp: 18
* windows8-64: 7
* linux64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-01-30&endday=2017-02-05&tree=all

Comment 12

8 months ago
11 failures in 836 pushes (0.013 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* autoland: 4
* mozilla-inbound: 3
* mozilla-aurora: 2
* try: 1
* mozilla-central: 1

Platform breakdown:
* windowsxp: 6
* windows8-64: 4
* android-4-0-armv7-api15: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-02-06&endday=2017-02-12&tree=all

Updated

8 months ago
See Also: → bug 1340630

Comment 13

8 months ago
29 failures in 833 pushes (0.035 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* autoland: 18
* mozilla-inbound: 5
* try: 3
* mozilla-aurora: 2
* mozilla-central: 1

Platform breakdown:
* windowsxp: 17
* windows8-64: 8
* linux64: 4

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-02-13&endday=2017-02-19&tree=all
:mshal, any updates here?
Flags: needinfo?(mshal)

Comment 15

8 months ago
18 failures in 812 pushes (0.022 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* mozilla-inbound: 8
* autoland: 5
* mozilla-aurora: 4
* try: 1

Platform breakdown:
* windowsxp: 12
* windows8-64: 5
* android-4-2-x86: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-02-20&endday=2017-02-26&tree=all
I'm not working on this, and I'm not really an hg person so I'm not sure how to fix it.

IMO there are really two problems here:

1) The hg server returns bogus data sometimes, which seems to be more easily triggered when under load (eg: a large checkout). This results in things like the negative header size, and I speculate it could also result in the struct error in bug 1340630. Someone on the vcs team who is familiar with hg server management can probably help here.

2) The hg client (Windows builder) pulls an inordinate amount of data for the number of changesets it grabs. If I'm reading the hg output correctly, it looks like it pulls a 1.7GB bundle even for only 170 changesets. Again here we'll probably need someone more familiar with hg than I :/

applying clone bundle from https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/mozilla-unified/99f0792ae01e564e9b17a54b5010cb58eeb9b274.packed1-gd.hg
316715 files to transfer, 1.70 GB of data
transferred 1.70 GB in 241.4 seconds (7.22 MB/sec)
finished applying clone bundle
searching for changes
adding changesets
adding manifests
adding file changes
added 170 changesets with 384 changes to 291 files (+1 heads)

changesets [============================================================>] 1/1
                                                                               
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 0 changes to 0 files (-1 heads)

I suspect if we fixed pulling tons of data on every build that we'd see this a lot less, though really both issues should be fixed.
Flags: needinfo?(mshal)
I'm not sure we have an hg expert other than gps, so this may need to wait till he returns.
Whiteboard: [stockwell unknown]

Comment 18

8 months ago
20 failures in 783 pushes (0.026 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* mozilla-esr52: 9
* autoland: 3
* mozilla-release: 2
* mozilla-central: 2
* mozilla-aurora: 2
* mozilla-inbound: 1
* mozilla-beta: 1

Platform breakdown:
* windowsxp: 13
* windows8-64: 3
* linux64: 2
* android-4-2-x86: 1
* android-4-0-armv7-api15: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-02-27&endday=2017-03-05&tree=all

Comment 19

7 months ago
8 failures in 777 pushes (0.01 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-esr52: 4
* autoland: 4

Platform breakdown:
* windowsxp: 5
* windows8-64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-03-13&endday=2017-03-19&tree=all

Comment 20

7 months ago
The negative part header size is likely a bit flip. Why, I don't know. There have been reports of this to upstream Mercurial as well. Appears to be primarily on Windows.

This is a non-trivial bug to investigate and will likely require several days of someone's time.

Comment 21

7 months ago
12 failures in 898 pushes (0.013 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 5
* mozilla-esr52: 3
* mozilla-inbound: 1
* mozilla-central: 1
* mozilla-beta: 1
* mozilla-aurora: 1

Platform breakdown:
* windowsxp: 6
* windows8-64: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-03-20&endday=2017-03-26&tree=all

Comment 22

7 months ago
8 failures in 845 pushes (0.009 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 5
* mozilla-inbound: 1
* mozilla-esr52: 1
* mozilla-central: 1

Platform breakdown:
* windowsxp: 6
* windows8-64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-03-27&endday=2017-04-02&tree=all

Comment 23

7 months ago
5 failures in 867 pushes (0.006 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-beta: 4
* mozilla-release: 1

Platform breakdown:
* linux64: 3
* windowsxp: 1
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-04-03&endday=2017-04-09&tree=all

Comment 24

6 months ago
Looking at logs, this appears to be a dupe of bug 1291926. I'm seeing the same behavior: really slow application of manifests followed by a "stream ended unexpectedly" error. This screams of the network timeout I diagnosed in bug 1291926.
Status: NEW → RESOLVED
Last Resolved: 6 months ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1291926
You need to log in before you can comment on or make changes to this bug.