1333348 - Intermittent abort: missing support for negative part header size: -424604107

Reporter

Description

•

7 years ago

treeherder

Filed by: cbook [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=71422832&repo=autoland

https://archive.mozilla.org/pub/firefox/tinderbox-builds/autoland-win64/1485237043/autoland-win64-bm70-build1-build878.txt.gz

Carsten Book [:Tomcat]

Comment 1

•

7 years ago

also hit autoland now

Carsten Book [:Tomcat]

Comment 2

•

7 years ago

gps: could you take a look ?

Flags: needinfo?(gps)

Gregory Szorc [:gps]

Comment 3

•

7 years ago

This failure confounds me.

robustcheckout is retrying as it is supposed to. However, the error keeps manifesting. But it isn't the same error and it doesn't happen in exactly the same place.

On retry, the stream from the server should be identical to what was requested before, as the client's state hasn't changed. But the failure occurs in different locations.

It certainly looks like bits are getting corrupted somehow. But until I get a dump of data over the wire that can reproduce this so I can step through with a debugger, I'm not going to be able to pinpoint this.

What's really odd is we only seem to be seeing these consistent failures on Windows. It is possible we're looking at a bug with TLS decoding or Python memory somehow getting corrupted inside a C library. I'm kinda curious where the python.exe on these Windows instances comes from...

pmoore: can you shed light on the source of python.exe on Windows TC workers?

Flags: needinfo?(gps) → needinfo?(pmoore)

Comment hidden (Intermittent Failures Robot)

28 failures in 152 pushes (0.184 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 27
* mozilla-inbound: 1

Platform breakdown:
* windows8-64: 17
* windowsxp: 11

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-01-24&endday=2017-01-24&tree=all

Pete Moore [:pmoore][:pete]

Comment 5

•

7 years ago

I think this is a buildbot slave, so redirecting to grenade.

Python on the TC builders comes from Mozilla Build 2.2.0:

  * https://github.com/mozilla-releng/OpenCloudConfig/blob/8dc9f3e087921e483c07bd47c1488273de457da0/userdata/Manifest/gecko-1-b-win2012.json#L445

Flags: needinfo?(pmoore) → needinfo?(rthijssen)

Rob Thijssen [:grenade (EET/UTC+0300)]

Comment 6

•

7 years ago

b-2008-spot-009 was already terminated but i got the output below from b-2008-spot-089 which wasn't:

C:\Users\Administrator>where hg
C:\mozilla-build\hg\hg.exe

C:\Users\Administrator>hg --version
Mercurial Distributed SCM (version 3.9.1)
(see https://mercurial-scm.org for more information)

Copyright (C) 2005-2016 Matt Mackall and others
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

C:\Users\Administrator>where python
C:\mozilla-build\python27\python.exe
C:\mozilla-build\buildbotve\Scripts\python.exe

C:\Users\Administrator>python --version
Python 2.7.5

Flags: needinfo?(rthijssen)

Gregory Szorc [:gps]

Comment 7

•

7 years ago

philor found this issue on a Linux machine. So it isn't isolated to Windows. But I'm still concerned about the relative frequency on Windows. We run a lot more jobs on Linux and the fact we hardly see this failure on Linux is alarming.

I think triaging this will require me to RDP into an instance that can reproduce the issue. And given the nature of the problem I may need to debug minutes after it occurs before state on hg.mozilla.org changes.

I'm not sure when I'll have time to look into this, however...

Comment hidden (Intermittent Failures Robot)

21 failures in 141 pushes (0.149 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 12
* mozilla-inbound: 7
* mozilla-central: 1
* mozilla-aurora: 1

Platform breakdown:
* windowsxp: 11
* windows8-64: 3
* linux64: 3
* linux32: 2
* android-4-2-x86: 1
* android-4-0-armv7-api15: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-01-27&endday=2017-01-27&tree=all

Comment hidden (Intermittent Failures Robot)

70 failures in 749 pushes (0.093 failures/push) were associated with this bug in the last 7 days. 

This is the #27 most frequent failure this week. 

** This failure happened more than 50 times this week! Resolving this bug is a high priority. **

Repository breakdown:
* autoland: 56
* mozilla-inbound: 10
* mozilla-central: 2
* mozilla-aurora: 2

Platform breakdown:
* windowsxp: 32
* windows8-64: 27
* linux64: 4
* android-4-0-armv7-api15: 4
* linux32: 2
* android-4-2-x86: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-01-23&endday=2017-01-29&tree=all

Comment hidden (Intermittent Failures Robot)

27 failures in 733 pushes (0.037 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* autoland: 16
* mozilla-inbound: 7
* mozilla-aurora: 3
* mozilla-central: 1

Platform breakdown:
* windowsxp: 18
* windows8-64: 7
* linux64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-01-30&endday=2017-02-05&tree=all

Comment hidden (Intermittent Failures Robot)

11 failures in 836 pushes (0.013 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* autoland: 4
* mozilla-inbound: 3
* mozilla-aurora: 2
* try: 1
* mozilla-central: 1

Platform breakdown:
* windowsxp: 6
* windows8-64: 4
* android-4-0-armv7-api15: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-02-06&endday=2017-02-12&tree=all

Michael Shal [:mshal]

Updated

•

7 years ago

Comment 14

•

7 years ago

:mshal, any updates here?

Flags: needinfo?(mshal)

Comment hidden (Intermittent Failures Robot)

18 failures in 812 pushes (0.022 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* mozilla-inbound: 8
* autoland: 5
* mozilla-aurora: 4
* try: 1

Platform breakdown:
* windowsxp: 12
* windows8-64: 5
* android-4-2-x86: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-02-20&endday=2017-02-26&tree=all

Michael Shal [:mshal]

Comment 16

•

7 years ago

I'm not working on this, and I'm not really an hg person so I'm not sure how to fix it.

IMO there are really two problems here:

1) The hg server returns bogus data sometimes, which seems to be more easily triggered when under load (eg: a large checkout). This results in things like the negative header size, and I speculate it could also result in the struct error in bug 1340630. Someone on the vcs team who is familiar with hg server management can probably help here.

2) The hg client (Windows builder) pulls an inordinate amount of data for the number of changesets it grabs. If I'm reading the hg output correctly, it looks like it pulls a 1.7GB bundle even for only 170 changesets. Again here we'll probably need someone more familiar with hg than I :/

applying clone bundle from https://s3-us-west-2.amazonaws.com/moz-hg-bundles-us-west-2/mozilla-unified/99f0792ae01e564e9b17a54b5010cb58eeb9b274.packed1-gd.hg
316715 files to transfer, 1.70 GB of data
transferred 1.70 GB in 241.4 seconds (7.22 MB/sec)
finished applying clone bundle
searching for changes
adding changesets
adding manifests
adding file changes
added 170 changesets with 384 changes to 291 files (+1 heads)

changesets [============================================================>] 1/1
                                                                               
searching for changes
adding changesets
adding manifests
adding file changes
added 1 changesets with 0 changes to 0 files (-1 heads)

I suspect if we fixed pulling tons of data on every build that we'd see this a lot less, though really both issues should be fixed.

Flags: needinfo?(mshal)

Amy Rich [:arr] [:arich]

Comment 17

•

7 years ago

I'm not sure we have an hg expert other than gps, so this may need to wait till he returns.

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Whiteboard: [stockwell unknown]

Comment hidden (Intermittent Failures Robot)

20 failures in 783 pushes (0.026 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* mozilla-esr52: 9
* autoland: 3
* mozilla-release: 2
* mozilla-central: 2
* mozilla-aurora: 2
* mozilla-inbound: 1
* mozilla-beta: 1

Platform breakdown:
* windowsxp: 13
* windows8-64: 3
* linux64: 2
* android-4-2-x86: 1
* android-4-0-armv7-api15: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-02-27&endday=2017-03-05&tree=all

Comment hidden (Intermittent Failures Robot)

8 failures in 777 pushes (0.01 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-esr52: 4
* autoland: 4

Platform breakdown:
* windowsxp: 5
* windows8-64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-03-13&endday=2017-03-19&tree=all

Gregory Szorc [:gps]

Comment 20

•

7 years ago

The negative part header size is likely a bit flip. Why, I don't know. There have been reports of this to upstream Mercurial as well. Appears to be primarily on Windows.

This is a non-trivial bug to investigate and will likely require several days of someone's time.

Comment hidden (Intermittent Failures Robot)

12 failures in 898 pushes (0.013 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 5
* mozilla-esr52: 3
* mozilla-inbound: 1
* mozilla-central: 1
* mozilla-beta: 1
* mozilla-aurora: 1

Platform breakdown:
* windowsxp: 6
* windows8-64: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-03-20&endday=2017-03-26&tree=all

Comment hidden (Intermittent Failures Robot)

8 failures in 845 pushes (0.009 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* autoland: 5
* mozilla-inbound: 1
* mozilla-esr52: 1
* mozilla-central: 1

Platform breakdown:
* windowsxp: 6
* windows8-64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-03-27&endday=2017-04-02&tree=all

Comment hidden (Intermittent Failures Robot)

5 failures in 867 pushes (0.006 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-beta: 4
* mozilla-release: 1

Platform breakdown:
* linux64: 3
* windowsxp: 1
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1333348&startday=2017-04-03&endday=2017-04-09&tree=all

Gregory Szorc [:gps]

Comment 24

•

7 years ago

Looking at logs, this appears to be a dupe of bug 1291926. I'm seeing the same behavior: really slow application of manifests followed by a "stream ended unexpectedly" error. This screams of the network timeout I diagnosed in bug 1291926.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → DUPLICATE