1291926 - Intermittent cloning failing with abort: stream ended unexpectedly (got 131914 bytes, expected 1630418948)

Reporter

Description

•

8 years ago

treeherder

Filed by: wkocher [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=33260211&repo=mozilla-inbound

http://archive.mozilla.org/pub/spidermonkey/tinderbox-builds/mozilla-inbound-win32-debug/mozilla-inbound_win32-debug_spidermonkey-plaindebug-bm91-build1-build276.txt.gz

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 1

•

8 years ago

The traceback I'm seeing is:
 Mercurial Distributed SCM (version 3.7.3)
 (see https://mercurial-scm.org for more information)
 
 Copyright (C) 2005-2016 Matt Mackall and others
 This is free software; see the source for copying conditions. There is NO
 warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
 + hgargs='--sharebase c:/builds/hg-shared'
 + hgargs='--sharebase c:/builds/hg-shared --revision d3b50142a70cda11137933c91c587b37f1203f95'
 + hgargs='--sharebase c:/builds/hg-shared --revision d3b50142a70cda11137933c91c587b37f1203f95 --upstream https://hg.mozilla.org/mozilla-unified'
 + '[' -z '' ']'
 + hgargs='--sharebase c:/builds/hg-shared --revision d3b50142a70cda11137933c91c587b37f1203f95 --upstream https://hg.mozilla.org/mozilla-unified --purge'
 + hg --config extensions.robustcheckout=/c/builds/moz2_slave/m-in_w32-d_sm-plaindebug-00000/scripts/hgext/robustcheckout.py robustcheckout --sharebase c:/builds/hg-shared --revision d3b50142a70cda11137933c91c587b37f1203f95 --upstream https://hg.mozilla.org/mozilla-unified --purge https://hg.mozilla.org/integration/mozilla-inbound src
 ensuring https://hg.mozilla.org/integration/mozilla-inbound@d3b50142a70cda11137933c91c587b37f1203f95 is available at src
 (cloning from upstream repo https://hg.mozilla.org/mozilla-unified)
 (sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
 searching for changes
 adding changesets
 adding manifests
 adding file changes
 transaction abort!
 rollback completed
 Traceback (most recent call last):
   File "mercurial\dispatch.pyc", line 191, in _runcatch
   File "mercurial\dispatch.pyc", line 924, in _dispatch
   File "mercurial\dispatch.pyc", line 681, in runcommand
   File "mercurial\extensions.pyc", line 195, in closure
   File "hgext\color.pyc", line 518, in colorcmd
   File "mercurial\dispatch.pyc", line 1055, in _runcommand
   File "mercurial\dispatch.pyc", line 1015, in checkargs
   File "mercurial\dispatch.pyc", line 921, in <lambda>
   File "mercurial\util.pyc", line 991, in check
   File "c:/builds/moz2_slave/m-in_w32-d_sm-plaindebug-00000/scripts/hgext/robustcheckout.py", line 149, in robustcheckout
   File "c:/builds/moz2_slave/m-in_w32-d_sm-plaindebug-00000/scripts/hgext/robustcheckout.py", line 229, in _docheckout
   File "mercurial\hg.pyc", line 489, in clone
   File "mercurial\hg.pyc", line 380, in clonewithshare
   File "mercurial\exchange.pyc", line 1188, in pull
   File "mercurial\exchange.pyc", line 1329, in _pullbundle2
   File "mercurial\bundle2.pyc", line 355, in processbundle
   File "mercurial\bundle2.pyc", line 765, in iterparts
   File "mercurial\bundle2.pyc", line 772, in _readpartheader
   File "mercurial\bundle2.pyc", line 602, in _unpack
   File "mercurial\bundle2.pyc", line 607, in _readexact
   File "mercurial\changegroup.pyc", line 43, in readexactly
 Abort: stream ended unexpectedly (got 0 bytes, expected 4)
abort: stream ended unexpectedly (got 0 bytes, expected 4)


Which sounds like a #vcs problem?

Component: JavaScript Engine → Mercurial: hg.mozilla.org

Flags: needinfo?(gps)

Product: Core → Developer Services

QA Contact: hwine

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 2

•

8 years ago

Which seems more likely since I just saw one of these failures on esr45: https://treeherder.mozilla.org/logviewer.html#?job_id=72830&repo=mozilla-esr45

Gregory Szorc [:gps]

Comment 3

•

8 years ago

The reason we're seeing these on SpiderMonkey builds all of a sudden is because bug 1291058 changed some spidermonkey jobs to use the robustclone extension. That change was to the build/tools repo, which means *all* trees got the change as soon as it was pushed.

AFAIK, we're not seeing these failures elsewhere. Or if we are, retries in mozharness or similar are paving over them. robustclone is being successfully used on Linux and OS X in buildbot and TC, so I don't think there's anything wrong with robustclone.

"stream ended unexpectedly" in this stack means the client was receiving data from the server and for whatever reason didn't receive all the data it was expecting. There are a few general explanations:

1) The server is sending malformed data
2) There is a bug in Mercurial's stream reading code
3) The data is getting corrupted in transit
4) Connections are dropping

My bet is on #4. We've had reports of dropped connections to hg.mozilla.org before.

We could pave over the problem by following up on bug 1291058 to retry after failure.

We could also loop in some IT folk to investigate dropped connections. FWIW, the hg.mozilla.org server logs report a few hundred aborted requests due to "broken pipe." Of course, that could be anything from a client terminating a process to a network hiccup of many varieties. Actually figuring out the cause of the network hiccup likely requires packet capturing at as many locations in the link between client and server as possible. And with the volume of traffic we issue to hg.mozilla.org, that can be very difficult. Capturing on a client and seeing which end aborts the TCP connection could be a good start...

Depends on: 1291058

Flags: needinfo?(gps)

Phil Ringnalda (:philor)

Comment 4

•

8 years ago

Retries mostly paving over them, and our massive ability to ignore infra intermittents preventing anyone from hearing about the ones where they don't: https://treeherder.mozilla.org/logviewer.html#?job_id=1490705&repo=autoland is a buildbot Windows build failing out after five stream ends.

Comment hidden (Intermittent Failures Robot)

38 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 28
* mozilla-central: 7
* autoland: 2
* mozilla-esr45: 1

Platform breakdown:
* windowsxp: 35
* windows8-64: 1
* windows7-32-vm: 1
* windows7-32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2016-08-01&endday=2016-08-07&tree=all

Gregory Szorc [:gps]

Comment 6

•

8 years ago

I find it strange that this stopped occurring in the past few days. That or we haven't been starring it.

Phil Ringnalda (:philor)

Comment 7

•

8 years ago

The latter, e.g. https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=92a9c724b77fba587edd4d5a662fbee29c5ea0c2&filter-searchStr=win%20spider

Phil Ringnalda (:philor)

Comment 8

•

8 years ago

Well, a mix of the two would be more accurate: odds are very good that if we kept a constant number of spidermonkey jobs going through both the high-load US daytime and the low-load US nighttime and the no-load weekend, this would happen much more frequently during the high-load times, but, spidermonkey jobs are only triggered when a patch touches js/src/, and people who touch js/src/ almost never push during weekends.

Gregory Szorc [:gps]

Comment 9

•

8 years ago

03:44 < philor> on the bright side, after five abort: stream ended unexpectedly clone failures in a row I finally did what I should have done after the first one, updated hg, and succeeded the next attempt

I don't recall an upstream bug that would cause this failure. But, uh, I guess upgrading Mercurial can't hurt.

FWIW, I'm going to make a push to upgrade to 3.9.1 everywhere once it is released in a few weeks.

Comment hidden (Intermittent Failures Robot)

18 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 11
* autoland: 5
* mozilla-central: 2

Platform breakdown:
* windowsxp: 18

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2016-08-15&endday=2016-08-21&tree=all

Comment hidden (Intermittent Failures Robot)

16 automation job failures were associated with this bug yesterday.

Repository breakdown:
* mozilla-inbound: 7
* autoland: 4
* fx-team: 3
* mozilla-central: 2

Platform breakdown:
* windowsxp: 15
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2016-08-22&endday=2016-08-22&tree=all

Comment hidden (Intermittent Failures Robot)

19 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 7
* autoland: 6
* mozilla-central: 3
* fx-team: 3

Platform breakdown:
* windowsxp: 18
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2016-08-22&endday=2016-08-28&tree=all

Phil Ringnalda (:philor)

Comment 13

•

8 years ago

Today's 13 episodes were brought to you by the letter "Really Crappy Network In General."

I do think it was interesting, though, that during the same period of crappy network there was a single Windows opt build on a single push on autoland which failed (despite the 5 retries) three times, all on the same AWS instance, until I finally got tired of seeing it fail, terminated that instance, and did fine with the next one that took the job.

Comment hidden (Intermittent Failures Robot)

14 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* fx-team: 7
* mozilla-central: 3
* autoland: 3
* mozilla-inbound: 1

Platform breakdown:
* windowsxp: 14

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2016-08-29&endday=2016-09-04&tree=all

Comment hidden (Intermittent Failures Robot)

17 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-aurora: 17

Platform breakdown:
* windowsxp: 10
* windows8-64: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2016-09-19&endday=2016-09-25&tree=all

Phil Ringnalda (:philor)

Updated

•

8 years ago

Summary: Intermittent Windows spidermonkey builds failing with abort: stream ended unexpectedly (got 131914 bytes, expected 1630418948) → Intermittent Windows builds failing with abort: stream ended unexpectedly (got 131914 bytes, expected 1630418948)

Comment hidden (Intermittent Failures Robot)

32 automation job failures were associated with this bug in the last 7 days.

Repository breakdown:
* mozilla-inbound: 15
* autoland: 11
* mozilla-central: 4
* mozilla-aurora: 1
* fx-team: 1

Platform breakdown:
* windowsxp: 28
* windows8-64: 4

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2016-10-10&endday=2016-10-16&tree=all

Comment hidden (Intermittent Failures Robot)

18 failures in 113 pushes (0.159 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 8
* mozilla-inbound: 5
* try: 3
* mozilla-central: 2

Platform breakdown:
* windowsxp: 16
* windows8-64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-04&endday=2017-01-04&tree=all

Comment hidden (Intermittent Failures Robot)

20 failures in 132 pushes (0.152 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 10
* mozilla-inbound: 4
* mozilla-esr45: 4
* mozilla-central: 2

Platform breakdown:
* windowsxp: 17
* windows8-64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-05&endday=2017-01-05&tree=all

Hal Wine [:hwine] use NI!

Updated

•

7 years ago

QA Contact: hwine → klibby

Joel Maher ( :jmaher ) (UTC -8)

Comment 19

•

7 years ago

This just started happening 3 days ago on win32 builds and spidermonkey builds on win32.

:gps, did something in our VCS change on Jan 3rd?

Flags: needinfo?(gps)

Comment hidden (Intermittent Failures Robot)

27 failures in 134 pushes (0.201 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 18
* mozilla-inbound: 6
* mozilla-aurora: 3

Platform breakdown:
* windowsxp: 22
* windows8-64: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-06&endday=2017-01-06&tree=all

Comment hidden (Intermittent Failures Robot)

85 failures in 563 pushes (0.151 failures/push) were associated with this bug in the last 7 days. 

This is the #10 most frequent failure this week. 

** This failure happened more than 50 times this week! Resolving this bug is a high priority. **

Repository breakdown:
* autoland: 38
* mozilla-inbound: 21
* mozilla-central: 16
* mozilla-esr45: 4
* try: 3
* mozilla-aurora: 3

Platform breakdown:
* windowsxp: 74
* windows8-64: 10
* linux64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-02&endday=2017-01-08&tree=all

Gregory Szorc [:gps]

Comment 22

•

7 years ago

(In reply to Joel Maher ( :jmaher) from comment #19)
> This just started happening 3 days ago on win32 builds and spidermonkey
> builds on win32.
> 
> :gps, did something in our VCS change on Jan 3rd?

There were no deployments to hg.mozilla.org on January 3rd. However, it's quite possible some other infrastructure/network work happened around that time, as nearly the whole world practices infrastructure freezes over the holidays and then changes things like crazy in early January.

Still, this needs to be fixed by making `hg robustcheckout` retry after failure.

Flags: needinfo?(gps)

Comment hidden (Intermittent Failures Robot)

23 failures in 128 pushes (0.18 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 12
* mozilla-inbound: 7
* mozilla-aurora: 2
* mozilla-central: 1
* mozilla-beta: 1

Platform breakdown:
* windowsxp: 20
* windows8-64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-09&endday=2017-01-09&tree=all

Phil Ringnalda (:philor)

Updated

•

7 years ago

Severity: normal → major

Ryan VanderMeulen [:RyanVM]

Comment 26

•

7 years ago

(In reply to Gregory Szorc [:gps] from comment #22)
> Still, this needs to be fixed by making `hg robustcheckout` retry after
> failure.

Is there a bug filed for this that we can mark as a dep?

Flags: needinfo?(gps)

Gregory Szorc [:gps]

Comment 27

•

7 years ago

There are a number of bugs all reporting similar symptoms. I've been reluctant to dupe them out of fear it will confuse starring.

Anyway, it seemed like the frequency of the failures was low enough to not warrant the time to fix it. But comment #21 says this was the #10 failure last week and is pretty forcefully worded. So I suppose we should fix this...

Assignee: nobody → gps

Status: NEW → ASSIGNED

Flags: needinfo?(gps)

Ryan VanderMeulen [:RyanVM]

Comment 28

•

7 years ago

At least on the release branches, I'm seeing more failing Windows SM jobs than passing at this point.

Comment hidden (Intermittent Failures Robot)

38 failures in 128 pushes (0.297 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 20
* mozilla-esr45: 5
* mozilla-aurora: 4
* mozilla-inbound: 3
* mozilla-central: 3
* mozilla-beta: 3

Platform breakdown:
* windowsxp: 27
* windows8-64: 11

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-10&endday=2017-01-10&tree=all

Comment hidden (mozreview-request)

A future commit will introduce a 2nd call site for this pattern and
additional complexity for examining the Abort. So factor it into a
reusable function.

Review commit: https://reviewboard.mozilla.org/r/103728/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/103728/

Comment hidden (mozreview-request)

We're seeing a lot of "stream ended unexpectedly" failures when
cloning/pulling in automation. As far as I can tell, this is due
to dropped connections. Unreliable networks are unreliable.

This commit adds a test to reproduce the failure by implementing an
extension that wraps the server response serving function and has it
return early after a configured number of bytes have been sent.
This test will allow us to verify that automatric retry logic
(to be introduced in the next commit) works as advertised.

Review commit: https://reviewboard.mozilla.org/r/103730/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/103730/

Comment hidden (mozreview-request)

The most common network failure we've seen in automation is
"stream ended unexpectedly." In fact, this appears to be the only
network error we see with any frequency.

This commit introduces retry logic for network operations. The
default behavior is to attempt network pulls up to 3 times.

The code doesn't catch all network failures. But it catches the
big one: "stream ended unexpectedly," which is raised when
reading from changegroups, which constitute the bulk of bytes over
the wire for `hg pull` operations.

FWIW, I'm not sure of an exhaustive list of network related
exceptions Mercurial can emit. We'll likely have to follow up and
add more exception detection. Ideally, we'd implement retry logic
upstream. Perfect is the enemy of good.

Review commit: https://reviewboard.mozilla.org/r/103732/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/103732/

Comment hidden (Intermittent Failures Robot)

35 failures in 155 pushes (0.226 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 10
* mozilla-aurora: 9
* mozilla-beta: 6
* mozilla-inbound: 5
* mozilla-central: 5

Platform breakdown:
* windowsxp: 35

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-11&endday=2017-01-11&tree=all

Comment hidden (Intermittent Failures Robot)

46 failures in 137 pushes (0.336 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 16
* mozilla-beta: 14
* mozilla-aurora: 8
* mozilla-inbound: 6
* mozilla-esr45: 2

Platform breakdown:
* windowsxp: 39
* windows8-64: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-12&endday=2017-01-12&tree=all

Comment hidden (mozreview-request)

Comment on attachment 8825614 [details]
robustcheckout: factor out code for handling error.Abort during a pull;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/103728/diff/1-2/

Comment hidden (mozreview-request)

Comment on attachment 8825615 [details]
robustcheckout: add test for server failure during clone (bug 1291926);

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/103730/diff/1-2/

Comment hidden (mozreview-request)

Comment on attachment 8825616 [details]
robustcheckout: retry after "stream ended unexpectedly" (bug 1291926);

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/103732/diff/1-2/

Comment hidden (Intermittent Failures Robot)

22 failures in 118 pushes (0.186 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* mozilla-inbound: 12
* mozilla-central: 5
* autoland: 5

Platform breakdown:
* windowsxp: 20
* windows8-64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-13&endday=2017-01-13&tree=all

Comment hidden (Intermittent Failures Robot)

195 failures in 722 pushes (0.27 failures/push) were associated with this bug in the last 7 days. 

This is the #5 most frequent failure this week. 

** This failure happened more than 50 times this week! Resolving this bug is a high priority. **

Repository breakdown:
* autoland: 75
* mozilla-inbound: 44
* mozilla-beta: 28
* mozilla-aurora: 23
* mozilla-central: 18
* mozilla-esr45: 7

Platform breakdown:
* windowsxp: 171
* windows8-64: 24

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-09&endday=2017-01-15&tree=all

Phil Ringnalda (:philor)

Comment 40

•

7 years ago

If you want something else to retry right now, you could save me cloning this bug the instant it closes with the new summary "Intermittent Windows builds failing with abort: missing support for negative part header size: -1671002596" by also retrying on that.

Though, since 33 of last week's 195 were browser builds, which already retry, I guess I'm going to be cloning it pretty much the instant it closes anyway, aren't I?

Comment hidden (Intermittent Failures Robot)

69 failures in 96 pushes (0.719 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* mozilla-inbound: 37
* autoland: 24
* mozilla-release: 6
* mozilla-central: 1
* mozilla-beta: 1

Platform breakdown:
* windowsxp: 64
* windows8-64: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-16&endday=2017-01-16&tree=all

Comment hidden (Intermittent Failures Robot)

78 failures in 165 pushes (0.473 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* autoland: 35
* mozilla-inbound: 23
* mozilla-aurora: 8
* mozilla-central: 6
* mozilla-release: 4
* mozilla-beta: 2

Platform breakdown:
* windowsxp: 67
* windows8-64: 11

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-18&endday=2017-01-18&tree=all

:glob ✱

Comment 43

•

7 years ago

mozreview-review

Comment on attachment 8825614 [details]
robustcheckout: factor out code for handling error.Abort during a pull;

https://reviewboard.mozilla.org/r/103728/#review106540

Attachment #8825614 - Flags: review?(glob) → review+

Comment hidden (Intermittent Failures Robot)

18 failures in 115 pushes (0.157 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* mozilla-inbound: 7
* mozilla-aurora: 7
* autoland: 3
* mozilla-central: 1

Platform breakdown:
* windowsxp: 16
* windows8-64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-19&endday=2017-01-19&tree=all

:glob ✱

Comment 45

•

7 years ago

mozreview-review

Comment on attachment 8825615 [details]
robustcheckout: add test for server failure during clone (bug 1291926);

https://reviewboard.mozilla.org/r/103730/#review106550

::: hgext/robustcheckout/tests/badserver.py:16
(Diff revision 2)
> +
> +            if untilgoodcount:

nit: this 2nd if is redundant

Attachment #8825615 - Flags: review?(glob) → review+

:glob ✱

Comment 46

•

7 years ago

mozreview-review

Comment on attachment 8825616 [details]
robustcheckout: retry after "stream ended unexpectedly" (bug 1291926);

https://reviewboard.mozilla.org/r/103732/#review106542

lgtm

Attachment #8825616 - Flags: review?(glob) → review+

Gregory Szorc [:gps]

Comment 47

•

7 years ago

mozreview-review-reply

Comment on attachment 8825615 [details]
robustcheckout: add test for server failure during clone (bug 1291926);

https://reviewboard.mozilla.org/r/103730/#review106550

> nit: this 2nd if is redundant

But it isn't! Consider the case where it reads the string "0", which is converted to int(0), which is Falsy. I'll add an inline comment.

Comment hidden (mozreview-request)

Comment on attachment 8825614 [details]
robustcheckout: factor out code for handling error.Abort during a pull;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/103728/diff/2-3/

Comment hidden (mozreview-request)

Comment on attachment 8825615 [details]
robustcheckout: add test for server failure during clone (bug 1291926);

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/103730/diff/2-3/

Comment hidden (mozreview-request)

Comment on attachment 8825616 [details]
robustcheckout: retry after "stream ended unexpectedly" (bug 1291926);

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/103732/diff/2-3/

Comment hidden (Intermittent Failures Robot)

39 failures in 143 pushes (0.273 failures/push) were associated with this bug yesterday.  

Repository breakdown:
* mozilla-inbound: 17
* autoland: 11
* mozilla-central: 8
* mozilla-aurora: 3

Platform breakdown:
* windowsxp: 36
* windows8-64: 2
* osx-10-7: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-20&endday=2017-01-20&tree=all

Comment hidden (Intermittent Failures Robot)

234 failures in 690 pushes (0.339 failures/push) were associated with this bug in the last 7 days. 

This is the #1 most frequent failure this week. 

** This failure happened more than 50 times this week! Resolving this bug is a high priority. **

Repository breakdown:
* mozilla-inbound: 97
* autoland: 79
* mozilla-aurora: 24
* mozilla-central: 16
* mozilla-release: 10
* mozilla-beta: 3
* ash: 3
* mozilla-esr45: 2

Platform breakdown:
* windowsxp: 211
* windows8-64: 22
* osx-10-7: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-16&endday=2017-01-22&tree=all

Pulsebot

Comment 53

•

7 years ago

Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/788a0e0f3ba3
robustcheckout: add test for server failure during clone ; r=glob
https://hg.mozilla.org/hgcustom/version-control-tools/rev/de41dae85307
robustcheckout: retry after "stream ended unexpectedly" ; r=glob

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Resolution: --- → FIXED

Pulsebot

Comment 54

•

7 years ago

Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4f044cabf9b1
Vendor latest robustcheckout extension; r=me

Phil Ringnalda (:philor)

Comment 55

•

7 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4f044cabf9b1

Ryan VanderMeulen [:RyanVM]

Comment 56

•

7 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/075069fc3743

Ryan VanderMeulen [:RyanVM]

Comment 57

•

7 years ago

These patches definitely improved the situation, but there's still some occasional failures creeping through.
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&endday=2017-01-24&startday=2017-01-23&tree=all

Comment hidden (Intermittent Failures Robot)

34 failures in 749 pushes (0.045 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 13
* mozilla-aurora: 9
* mozilla-beta: 8
* autoland: 2
* mozilla-release: 1
* mozilla-central: 1

Platform breakdown:
* windowsxp: 25
* windows8-64: 5
* android-4-0-armv7-api15: 2
* linux64: 1
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-23&endday=2017-01-29&tree=all

Phil Ringnalda (:philor)

Comment 59

•

7 years ago

And even after the update of the robustcheckout in build/tools/, the SpiderMonkey builds are still only attempting a single time.

Comment hidden (Intermittent Failures Robot)

7 failures in 733 pushes (0.01 failures/push) were associated with this bug in the last 7 days.  

Repository breakdown:
* mozilla-inbound: 3
* mozilla-aurora: 3
* autoland: 1

Platform breakdown:
* windowsxp: 5
* windows8-64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-01-30&endday=2017-02-05&tree=all

Comment hidden (Intermittent Failures Robot)

5 failures in 836 pushes (0.006 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* mozilla-beta: 4
* mozilla-esr45: 1

Platform breakdown:
* windowsxp: 5

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-02-06&endday=2017-02-12&tree=all

Comment hidden (Intermittent Failures Robot)

13 failures in 833 pushes (0.016 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* mozilla-aurora: 10
* mozilla-beta: 2
* mozilla-inbound: 1

Platform breakdown:
* windowsxp: 6
* windows8-64: 2
* linux64: 2
* android-4-0-armv7-api15: 2
* linux32: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-02-13&endday=2017-02-19&tree=all

Joel Maher ( :jmaher ) (UTC -8)

Updated

•

7 years ago

Whiteboard: [stockwell fixed]

Comment hidden (Intermittent Failures Robot)

24 failures in 182 pushes (0.132 failures/push) were associated with this bug yesterday.  
Repository breakdown:
* mozilla-esr52: 22
* mozilla-beta: 1
* mozilla-aurora: 1

Platform breakdown:
* windowsxp: 17
* windows8-64: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-02-24&endday=2017-02-24&tree=all

Comment hidden (Intermittent Failures Robot)

35 failures in 812 pushes (0.043 failures/push) were associated with this bug in the last 7 days. 

This is the #49 most frequent failure this week. 

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. **

Repository breakdown:
* mozilla-esr52: 22
* mozilla-aurora: 11
* mozilla-beta: 2

Platform breakdown:
* windowsxp: 21
* windows8-64: 9
* linux64: 3
* linux32: 1
* android-4-0-armv7-api15: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-02-20&endday=2017-02-26&tree=all

Comment hidden (Intermittent Failures Robot)

20 failures in 783 pushes (0.026 failures/push) were associated with this bug in the last 7 days.  
Repository breakdown:
* mozilla-aurora: 10
* mozilla-esr52: 8
* mozilla-release: 1
* mozilla-beta: 1

Platform breakdown:
* windowsxp: 9
* windows8-64: 6
* linux32: 3
* linux64: 2

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-02-27&endday=2017-03-05&tree=all

Comment hidden (Intermittent Failures Robot)

10 failures in 790 pushes (0.013 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-esr52: 9
* mozilla-beta: 1

Platform breakdown:
* windowsxp: 5
* windows8-64: 4
* android-4-0-armv7-api15: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-03-06&endday=2017-03-12&tree=all

Comment hidden (Intermittent Failures Robot)

19 failures in 777 pushes (0.024 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-beta: 11
* mozilla-esr52: 5
* try: 2
* mozilla-aurora: 1

Platform breakdown:
* windowsxp: 6
* android-4-0-armv7-api15: 5
* windows8-32: 3
* linux64: 3
* linux32: 1
* android-4-2-x86: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-03-13&endday=2017-03-19&tree=all

Ryan VanderMeulen [:RyanVM]

Comment 68

•

7 years ago

This still strikes pretty frequently. Is there something we can do to make cloning more reliable? Is there anything telling about it predominantly affecting SM jobs?

Flags: needinfo?(gps)

Comment hidden (Intermittent Failures Robot)

6 failures in 898 pushes (0.007 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-esr52: 4
* mozilla-beta: 2

Platform breakdown:
* windowsxp: 6

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-03-20&endday=2017-03-26&tree=all

Ryan VanderMeulen [:RyanVM]

Comment 70

•

7 years ago

Is it possible we're doing something silly with the mozilla-unified repo being baked into the AMI for these jobs? Just seems really weird that Windows SM jobs seem so disproportionately affected by this problem. AFAICT, we're pretty much hitting this issue constantly, it's just that the retries cover it up when we're lucky. Wonder how much machine time and money we're burning on this right now...

Flags: needinfo?(mcornmesser)

Mark Cornmesser [:markco] OOO 2024/04/15

Comment 71

•

7 years ago

Currently we are not baking any of the repos into the AMI. We tried that previously but it was not successful.

Flags: needinfo?(mcornmesser)

Gregory Szorc [:gps]

Comment 72

•

7 years ago

There is talk upstream about this bug. glandium was able to reproduce it reliably on Linux using signals. We suspect something in Mercurial or Python isn't retrying an interrupted system call.

Upstream has also noticed this error appears more frequently on Windows. I wouldn't be surprised if we're running into a CPython bug.

I have ideas for fixing this. But I'll likely need to deploy a hacked up version of Mercurial to Windows to prove its effectiveness. Bug 1351513 was the first step of that. Next, I'll need to produce a Mercurial Windows installer. This is quite the rabbit hole...

I am actively looking at this.

Flags: needinfo?(gps)

Comment hidden (Intermittent Failures Robot)

10 failures in 845 pushes (0.012 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-esr52: 6
* mozilla-beta: 3
* mozilla-aurora: 1

Platform breakdown:
* windowsxp: 9
* windows8-64: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-03-27&endday=2017-04-02&tree=all

Ryan VanderMeulen [:RyanVM]

Comment 74

•

7 years ago

For whatever reason, this is permafailing Linux x64 Addon builds on Beta since yesterday, no matter how many retriggers I do.
https://treeherder.mozilla.org/logviewer.html#?job_id=89178198&repo=mozilla-beta

Flags: needinfo?(gps)

Ryan VanderMeulen [:RyanVM]

Updated

•

7 years ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Comment hidden (Intermittent Failures Robot)

35 failures in 170 pushes (0.206 failures/push) were associated with this bug yesterday.   

Repository breakdown:
* mozilla-beta: 35

Platform breakdown:
* linux64: 25
* windowsxp: 7
* windows8-64: 3

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-04-06&endday=2017-04-06&tree=all

Gregory Szorc [:gps]

Comment 76

•

7 years ago

I think the majority of failures are due to a network timeout, similar to bug 1338530 comment #17. I'm attempting to prove that by reproducing on a local machine while packet tracing...

Gregory Szorc [:gps]

Updated

•

7 years ago

Depends on: 1354625

Gregory Szorc [:gps]

Comment 77

•

7 years ago

I think there are multiple root causes of this "stream ended unexpectedly" issue.

I think the single biggest cause is a lethargic client becoming network idle resulting in the TCP connection being dropped. Lesser causes include normal network unreliability and failure of Mercurial to handle interrupted system calls (which glandium has reported upstream as a known cause of "stream ended unexpectedly."

I'm now able to reliably reproduce this failure on Linux. Which means it is only a matter of time before I confirm the root cause(s) and devise a fix.

Flags: needinfo?(gps)

Gregory Szorc [:gps]

Updated

•

7 years ago

Status: REOPENED → ASSIGNED

Gregory Szorc [:gps]

Comment 78

•

7 years ago

I've got a packet capture from client to hg.mozilla.org along with an observed "stream ended unexpectedly." Unfortunately it is encrypted. But I think there's a smoking gun of a server-initiated drop after idle.

311.032340 hg->client encrypted data (preceded by kilobytes of similar packets)
311.071510 client->hg ACK
311.331237 hg->client TCP Window Full
311.331254 client->hg TCP ZeroWindow
311.585893 hg->client TCP Keep-Alive
...
repeated ZeroWindow and Keep-Alive pairs following what looks like exponential backoff
436.112884 hg->client TCP Keep-Alive
436.112899 client->hg TCP ZeroWindow
455.004558 client->hg TCP Window Update (ACK)
455.016535 hg->client TCP segment
...
repeated TCP data from server to client for a little while
...
456.392987 hg->client encrypted data (last presumed Mercurial data from server)
456.422323 client->hg ACK
466.727988 hg->client TLS Alert 21 (Encrypted Alert)
466.728028 client->hg ACK
466.728061 hg->client FIN, ACK
466.771518 client->hg ACK
478.655037 hg->client FIN, ACK
478.666036 client->hg ACK

What looks to be happening is the server sends a bunch of data to the client. The client can't keep up on the consuming end and it keeps telling the server to wait. Eventually, the client is ready to accept more data. The server gladly obliges. But, moments later the server ends the TLS session and wants to drop the connection!

I suspect what's happening is that during the long period where the Mercurial client isn't accepting any new data from the server, the TCP connection between the load balancer and the HTTP server on the origin server is dropped or gets in an aborted state. This could be any number of timeouts, including an idle timeout inside mod_wsgi. Instead of dropping the TCP connection right there and then, the connection is kept alive. When the Mercurial client comes back, the load balancer flushes whatever data is has buffered. Once that is flushed, it initiates shutdown.

There is some speculation in the previous paragraph. I would need to packet trace between the load balancer and the httpd server to see exactly what's going on.

The whole issue stems from the Mercurial client being idle over the network for an extended period of time (because it is being slow). (I reproduced this behavior by introducing a sleep(1) between applying manifests. But the "stream ended unexpectedly" doesn't occur until the beginning of filelog updates.)

Also, I do see an OSError(Connected ended unexpectedly) being raised by Python. But it is being swallowed by Mercurial, leading to this custom "stream ended unexpectedly" error. That's worthy of a fix upstream.

I think there is an idle timeout somewhere between zlb <-> httpd <-> mod_wsgi that we need to increase to prevent premature connection dropping...

Gregory Szorc [:gps]

Comment 79

•

7 years ago

atoll: could you please read comment #78 and add any zlb-related expertise you may have?

Also, is it difficult to get a dummy load balancer entry routing to a specific hgweb host? I'd love to packet trace what's happening between zlb and hgweb to confirm/refute suspicions. But that's difficult to do with the firehose of production traffic in the way.

(I'm about to start my weekend. So there's no rush to do anything before Monday.)

Flags: needinfo?(rsoderberg)

:Atoll

Comment 80

•

7 years ago

TS rule, approximately:

If (request.getHeader("X-GPS") == "1") {
  pool.use("hgweb-pool", "1.2.3.4", "443");
}

If you pass request header X-GPS, it will assign your request to the listed IP:Port from the named pool. If request headers are hard, use request.getRemoteIp() instead. Note that Zeus will ignore the usual draining/disabled Settings for the node you select with this method.

More later, mandatory PTO

Comment hidden (Intermittent Failures Robot)

52 failures in 867 pushes (0.06 failures/push) were associated with this bug in the last 7 days. 

This is the #32 most frequent failure this week.  

** This failure happened more than 30 times this week! Resolving this bug is a high priority. **

** Try to resolve this bug as soon as possible. If unresolved for 2 weeks, the affected test(s) may be disabled. ** 

Repository breakdown:
* mozilla-beta: 52

Platform breakdown:
* linux64: 32
* windowsxp: 13
* windows8-64: 7

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1291926&startday=2017-04-03&endday=2017-04-09&tree=all

Gregory Szorc [:gps]

Comment 82

•

7 years ago

Thanks for the info, atoll!

fubar: could you please configure a TS rule (similar to comment #80) to route requests to say hgweb11's IP? I'll then be able to take that host out of service and packet capture my way to glory.

Flags: needinfo?(klibby)

Kendall Libby [:fubar] (he/him)

Comment 83

•

7 years ago

(In reply to Gregory Szorc [:gps] from comment #82)
> fubar: could you please configure a TS rule (similar to comment #80) to
> route requests to say hgweb11's IP? I'll then be able to take that host out
> of service and packet capture my way to glory.

Done. TS rule "hg-debug" created; add the header "X-HG-DEBUG: 1" and all traffic will go to hgweb11.

Flags: needinfo?(klibby)

Gregory Szorc [:gps]

Comment 84

•

7 years ago

I have a packet capture from hgweb11 with the issue reproducing on the client. I still have to sort through that, but there is an error in the httpd server logs:

  (70007)The timeout specified has expired: [client XXXX:YY] mod_wsgi (pid=XXX): Failed to proxy response to client

I think I know what timeout it is referring to. Let me look at the packet trace to confirm the numbers line up.

Gregory Szorc [:gps]

Comment 85

•

7 years ago

Confirming my suspicions, the "timeout" mentioned in the log message references the "socket-timeout" WSGIDaemonProcess option, which defaults to httpd's Timeout value which is currently set to 60. If I tweak socket-timeout, I can get that error message to appear sooner or later.

Here's what's happening from the Apache level.

1. Parent httpd process accepts a new socket and hands a file descriptor to worker process
2. httpd worker finds an available WSGI worker process and starts communicating with it. The httpd worker process essentially acts as a proxy between the TCP socket (speaking HTTP) and the WSGI worker (speaking WSGI)
3. WSGI worker produces data pretty fast and sends it over the network (to the load balancer) which receives it initially pretty fast
4. Within a few seconds, httpd is unable to writev() WSGI data to the socket. First call returns partial result. Second call fails with EAGAIN. This means the write would block.
5. httpd worker issues poll({fd, events=POLLOUT}, 1, <socket-timeout>)
6. Often dozens of seconds later, poll() returns.
7. httpd writev() flushes data remaining from last partial call.
8. httpd read()s 8000 bytes from WSGI and writev()s to socket in rapid succession. This manages to send a little over 256,000 bytes to the socket.
9. Eventually a poll() of the socket fails. httpd writes the "The timeout specified has expired" message to the error log.
10. httpd worker closes file descriptor attached to socket

From the behavior of poll(), read(), and writev(), network writes from httpd to the zlb occur in bursts of ~256kb.

From packet capturing on the client, it appears to receive from the network in bursts of ~128kb. This is likely DECOMPRESSION_RECOMMENDED_INPUT_SIZE from the zstd decompressor wanting ~128kb at a time.

What I find interesting here is that communication between the zlb and httpd occurs at half the rate as between client and zlb. What this tells me is that the buffer on the zlb likely waits until it is empty or has reached a low water mark before asking httpd for more data. I've commonly seen this implemented as a circular buffer in load balancers because that model tends to yield less latency (data can be served to client immediately instead of waiting for origin server to deliver more data). I guess this flavor of load balancer does things differently. Maybe they are optimizing for literal data pass-through. Who knows. Load balancers are magic black boxes. It isn't terribly important.

Now, what's somewhat wonky is the behavior after httpd calls close() on the socket: it lingers for over a minute before it is closed at the TCP level! After the socket is close()d, TCP between the zlb and httpd keep sending ZeroWindow and Keep-Alive messages between each other. Interspersed with those are actual data from httpd to the zlb. This must be data buffered by the kernel because httpd is fully detached from the socket when the data goes over the wire. Because dozens of seconds can pass between the client requesting 128kb chunks, the time difference between the server giving up on the HTTP request and the TCP socket being flushed and closed can be over a minute.

Flags: needinfo?(rsoderberg)

:Atoll

Comment 86

•

7 years ago

You might find some value reading over the Pool options in Zeus. There are a *lot* of finicky detailed TCP options that seem remarkably relevant here. I believe there's a specific "send every packet when it arrives, rather than being efficient" option. I'll share you a link to the relevant Zeus docs folder.

Gregory Szorc [:gps]

Comment 87

•

7 years ago

So, as best I can tell, WSGIDaemonProcess's socket-timeout (https://modwsgi.readthedocs.io/en/develop/configuration-directives/WSGIDaemonProcess.html) is driving the timeouts. The current value is 60s. Raising that timeout will cause the server to stop prematurely serving as many requests. We should probably increase it. A risk to doing so is this opens us up to idle connection DoS. But that can be mitigated at the zlb layer.

Compounding the problem is zlb's buffering strategy, which appears to only request data from the origin server when a low water mark is hit, not continuously. This means traffic between hg and zlb doesn't "reset" a timeout in httpd. We have increased the buffer previously to accommodate very large HTTP request headers. We now use HTTP POST for passing this data, so we can likely revert to the default buffer size. This should result in more interaction between zlb and httpd, mitigating the chances of hitting a timeout on httpd.

Further compounding the problem is Mercurial wanting zstd data in 128kb chunks. If it requested data in smaller chunks, it would communicate with the server more often, resetting any idle timeouts in the process. 128kb is a large chunk for network I/O. So Mercurial should probably change this buffer size for network-originated streams. That being said, compression is special. N input bytes can produce M>>N output bytes. If data compresses extremely well or the repository is slow to apply the incoming data, very small network traffic could make the Mercurial operation so slow that the socket becomes idle and times out. This is essentially what happened in automation last week: Windows workers were already relatively slow applying changegroup data. By converting the server-side repos to generaldelta, we made them even slower. And by rolling out zstd, we decreased the frequency of interaction between client and server (due to zstd's large input buffer size), increasing the probability of timeouts.

What a nasty and complex bug. It's getting to be the end of the work day for me. So I'm going to grab a beer so I can relax the brain cells that toiled over this bug. I'll work with fubar or dhouse tomorrow to roll out fixes for various buffer sizes and timeouts.

:Atoll

Comment 88

•

7 years ago

We could have Zeus talk directly to uWSGI, if you prefer that someday.

:Atoll

Comment 89

•

7 years ago

Oh, I forgot: Yes, ++ to raising HG timeout significantly, *only* for authenticated sessions (if you can possibly determine that).

Gregory Szorc [:gps]

Comment 90

•

7 years ago

(In reply to Richard Soderberg [:atoll] from comment #89)
> Oh, I forgot: Yes, ++ to raising HG timeout significantly, *only* for
> authenticated sessions (if you can possibly determine that).

https://hg.mozilla.org/ is completely unauthenticated. Best we can do is look at source IP. At since so much of our automation operates as spot instances in AWS, unless we are using a special network link between AWS and hg.mo and can identify that in the HTTP request, I'm not sure there's much we can do.

I'm inclined to jack up timeouts on the HTTP origin servers and have zlb deal with handling idle.

Gregory Szorc [:gps]

Comment 91

•

7 years ago

Upon further inspection, the zlb buffer from server to client appears to be 64kb. Not sure why I was seeing 256kb chunks in my packet captures.

Comment hidden (mozreview-request)

This should hopefully make many of the Mercurial client failures
reported in this bug go away. We had ~8000 of these "failed to proxy
response to client" errors in March. And the rate went up last week
when we converted various server repos to generaldelta. So we should
know relatively quickly if this change reduces the failure rate.

Currently, the load balancer is not enforcing an idle timeout on
connections. We should consider changing that. And once we do, we
can increase Timeout to effectively infinity, since as the in-line
comment explains, the thing it is measuring isn't terribly
important so it doesn't add much value.

Review commit: https://reviewboard.mozilla.org/r/128980/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/128980/

Kendall Libby [:fubar] (he/him)

Comment 93

•

7 years ago

mozreview-review

Comment on attachment 8857080 [details]
ansible/hg-web: increase network timeout from 60s to 120s (bug 1291926);

https://reviewboard.mozilla.org/r/128980/#review131528

shipit

Attachment #8857080 - Flags: review?(klibby) → review+

Pulsebot

Comment 94

•

7 years ago

Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/hgcustom/version-control-tools/rev/ef32c182a737
ansible/hg-web: increase network timeout from 60s to 120s ; r=fubar

Status: ASSIGNED → RESOLVED

Closed: 7 years ago → 7 years ago

Resolution: --- → FIXED

Gregory Szorc [:gps]

Comment 95

•

7 years ago

Deployed the timeout bump to production and bounced all servers to pick up the change. No HTTP requests were harmed in the process.

I'm going to keep the bug open to track other things related to "stream ended unexpectedly" errors. And I want to be sure the 120s timeout is sufficient.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Gregory Szorc [:gps]

Comment 96

•

7 years ago

There were ~8000 of these "Failed to proxy response to client" errors in March. And April was looking to be worse.

It is still too early to be conclusive, but in the few hours since this was deployed, there has been exactly 1 of these failures. So I'm optimistic the increased timeout is having the intended effect.

Emma Humphries ☕️🎸🧞‍♀️✨ (she/they) [:emceeaich] (Pacific Time) use needinfo

Updated

•

7 years ago

Status: REOPENED → RESOLVED

Closed: 7 years ago → 7 years ago

Keywords: bulk-close-intermittents

Resolution: --- → INCOMPLETE

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Comment 100

•

5 years ago

Reopening as this is still an issue (e.g. :pbro and I have it in the last week). Currently people have to find out that they shall manually download a hg hundle if they hit the issue: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Source_Code/Mercurial/Bundles

Status: RESOLVED → REOPENED

Keywords: bulk-close-intermittents

Resolution: INCOMPLETE → ---

Whiteboard: [stockwell fixed]

Mitchell Hentges [:mhentges] 🦀

Comment 101

•

4 years ago

I'm doing some testing to see if I can reproduce a failed/timing out hg clone due to a bad environment.
In my tests yesterday, I didn't see any failures after ~an hour or two.

My environment was set up with:

A process constantly stressing the hard-drive: stress -d 4
A hampered network connection configured with: tc qdisc add dev ens4 root netem delay 10ms 20ms corrupt 10% reorder 25% 50% loss 5%
An hg clone with:

hg robustcheckout --sharebase hg-sharebase --purge --config hostsecurity.hg.mozilla.org:fingerprints=sha256:17:38:aa:92:0b:84:3e:aa:8e:52:52:e9:4c:2f:98:a9:0e:bf:6c:3e:e9:15:ff:0a:29:80:f7:06:02:5b:e8:48,sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37:bf:d7:b8:40:84:01:48:9c:26:ce:d9 --upstream https://hg.mozilla.org/mozilla-unified --revision 6c3cd02d3533602e6a08b200973e335065f27fa5 https://hg.mozilla.org/try src

(I yoinked the hg command from a CI job)

I'll try a clone instead of a robustcheckout too see if that will trigger the issues

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

3 years ago

Updated

•

3 years ago

Updated

•

1 year ago

Assignee: gps → nobody

Summary: Intermittent Windows builds failing with abort: stream ended unexpectedly (got 131914 bytes, expected 1630418948) → Intermittent cloning failing with abort: stream ended unexpectedly (got 131914 bytes, expected 1630418948)

robustcheckout: factor out code for handling error.Abort during a pull; 7 years ago Gregory Szorc [:gps] 59 bytes, text/x-review-board-request	glob : review+	Details
robustcheckout: add test for server failure during clone (bug 1291926); 7 years ago Gregory Szorc [:gps] 59 bytes, text/x-review-board-request	glob : review+	Details
robustcheckout: retry after "stream ended unexpectedly" (bug 1291926); 7 years ago Gregory Szorc [:gps] 59 bytes, text/x-review-board-request	glob : review+	Details
ansible/hg-web: increase network timeout from 60s to 120s (bug 1291926); 7 years ago Gregory Szorc [:gps] 59 bytes, text/x-review-board-request	fubar : review+	Details