Open Bug 1291926 Opened 4 years ago Updated 26 days ago

Intermittent Windows builds failing with abort: stream ended unexpectedly (got 131914 bytes, expected 1630418948)


(Developer Services :: Mercurial:, defect, major)

Not set


(Not tracked)



(Reporter: intermittent-bug-filer, Assigned: gps)



(Keywords: intermittent-failure)


(4 files)

The traceback I'm seeing is:
 Mercurial Distributed SCM (version 3.7.3)
 (see for more information)
 Copyright (C) 2005-2016 Matt Mackall and others
 This is free software; see the source for copying conditions. There is NO
 + hgargs='--sharebase c:/builds/hg-shared'
 + hgargs='--sharebase c:/builds/hg-shared --revision d3b50142a70cda11137933c91c587b37f1203f95'
 + hgargs='--sharebase c:/builds/hg-shared --revision d3b50142a70cda11137933c91c587b37f1203f95 --upstream'
 + '[' -z '' ']'
 + hgargs='--sharebase c:/builds/hg-shared --revision d3b50142a70cda11137933c91c587b37f1203f95 --upstream --purge'
 + hg --config extensions.robustcheckout=/c/builds/moz2_slave/m-in_w32-d_sm-plaindebug-00000/scripts/hgext/ robustcheckout --sharebase c:/builds/hg-shared --revision d3b50142a70cda11137933c91c587b37f1203f95 --upstream --purge src
 ensuring is available at src
 (cloning from upstream repo
 (sharing from existing pooled repository 8ba995b74e18334ab3707f27e9eb8f4e37ba3d29)
 searching for changes
 adding changesets
 adding manifests
 adding file changes
 transaction abort!
 rollback completed
 Traceback (most recent call last):
   File "mercurial\dispatch.pyc", line 191, in _runcatch
   File "mercurial\dispatch.pyc", line 924, in _dispatch
   File "mercurial\dispatch.pyc", line 681, in runcommand
   File "mercurial\extensions.pyc", line 195, in closure
   File "hgext\color.pyc", line 518, in colorcmd
   File "mercurial\dispatch.pyc", line 1055, in _runcommand
   File "mercurial\dispatch.pyc", line 1015, in checkargs
   File "mercurial\dispatch.pyc", line 921, in <lambda>
   File "mercurial\util.pyc", line 991, in check
   File "c:/builds/moz2_slave/m-in_w32-d_sm-plaindebug-00000/scripts/hgext/", line 149, in robustcheckout
   File "c:/builds/moz2_slave/m-in_w32-d_sm-plaindebug-00000/scripts/hgext/", line 229, in _docheckout
   File "mercurial\hg.pyc", line 489, in clone
   File "mercurial\hg.pyc", line 380, in clonewithshare
   File "mercurial\exchange.pyc", line 1188, in pull
   File "mercurial\exchange.pyc", line 1329, in _pullbundle2
   File "mercurial\bundle2.pyc", line 355, in processbundle
   File "mercurial\bundle2.pyc", line 765, in iterparts
   File "mercurial\bundle2.pyc", line 772, in _readpartheader
   File "mercurial\bundle2.pyc", line 602, in _unpack
   File "mercurial\bundle2.pyc", line 607, in _readexact
   File "mercurial\changegroup.pyc", line 43, in readexactly
 Abort: stream ended unexpectedly (got 0 bytes, expected 4)
abort: stream ended unexpectedly (got 0 bytes, expected 4)

Which sounds like a #vcs problem?
Component: JavaScript Engine → Mercurial:
Flags: needinfo?(gps)
Product: Core → Developer Services
QA Contact: hwine
The reason we're seeing these on SpiderMonkey builds all of a sudden is because bug 1291058 changed some spidermonkey jobs to use the robustclone extension. That change was to the build/tools repo, which means *all* trees got the change as soon as it was pushed.

AFAIK, we're not seeing these failures elsewhere. Or if we are, retries in mozharness or similar are paving over them. robustclone is being successfully used on Linux and OS X in buildbot and TC, so I don't think there's anything wrong with robustclone.

"stream ended unexpectedly" in this stack means the client was receiving data from the server and for whatever reason didn't receive all the data it was expecting. There are a few general explanations:

1) The server is sending malformed data
2) There is a bug in Mercurial's stream reading code
3) The data is getting corrupted in transit
4) Connections are dropping

My bet is on #4. We've had reports of dropped connections to before.

We could pave over the problem by following up on bug 1291058 to retry after failure.

We could also loop in some IT folk to investigate dropped connections. FWIW, the server logs report a few hundred aborted requests due to "broken pipe." Of course, that could be anything from a client terminating a process to a network hiccup of many varieties. Actually figuring out the cause of the network hiccup likely requires packet capturing at as many locations in the link between client and server as possible. And with the volume of traffic we issue to, that can be very difficult. Capturing on a client and seeing which end aborts the TCP connection could be a good start...
Depends on: 1291058
Flags: needinfo?(gps)
Retries mostly paving over them, and our massive ability to ignore infra intermittents preventing anyone from hearing about the ones where they don't: is a buildbot Windows build failing out after five stream ends.
I find it strange that this stopped occurring in the past few days. That or we haven't been starring it.
Well, a mix of the two would be more accurate: odds are very good that if we kept a constant number of spidermonkey jobs going through both the high-load US daytime and the low-load US nighttime and the no-load weekend, this would happen much more frequently during the high-load times, but, spidermonkey jobs are only triggered when a patch touches js/src/, and people who touch js/src/ almost never push during weekends.
03:44 < philor> on the bright side, after five abort: stream ended unexpectedly clone failures in a row I finally did what I should have done after the first one, updated hg, and succeeded the next attempt

I don't recall an upstream bug that would cause this failure. But, uh, I guess upgrading Mercurial can't hurt.

FWIW, I'm going to make a push to upgrade to 3.9.1 everywhere once it is released in a few weeks.
Today's 13 episodes were brought to you by the letter "Really Crappy Network In General."

I do think it was interesting, though, that during the same period of crappy network there was a single Windows opt build on a single push on autoland which failed (despite the 5 retries) three times, all on the same AWS instance, until I finally got tired of seeing it fail, terminated that instance, and did fine with the next one that took the job.
Summary: Intermittent Windows spidermonkey builds failing with abort: stream ended unexpectedly (got 131914 bytes, expected 1630418948) → Intermittent Windows builds failing with abort: stream ended unexpectedly (got 131914 bytes, expected 1630418948)
QA Contact: hwine → klibby
This just started happening 3 days ago on win32 builds and spidermonkey builds on win32.

:gps, did something in our VCS change on Jan 3rd?
Flags: needinfo?(gps)
(In reply to Joel Maher ( :jmaher) from comment #19)
> This just started happening 3 days ago on win32 builds and spidermonkey
> builds on win32.
> :gps, did something in our VCS change on Jan 3rd?

There were no deployments to on January 3rd. However, it's quite possible some other infrastructure/network work happened around that time, as nearly the whole world practices infrastructure freezes over the holidays and then changes things like crazy in early January.

Still, this needs to be fixed by making `hg robustcheckout` retry after failure.
Flags: needinfo?(gps)
Duplicate of this bug: 1304471
Severity: normal → major
Duplicate of this bug: 1329935
(In reply to Gregory Szorc [:gps] from comment #22)
> Still, this needs to be fixed by making `hg robustcheckout` retry after
> failure.

Is there a bug filed for this that we can mark as a dep?
Flags: needinfo?(gps)
There are a number of bugs all reporting similar symptoms. I've been reluctant to dupe them out of fear it will confuse starring.

Anyway, it seemed like the frequency of the failures was low enough to not warrant the time to fix it. But comment #21 says this was the #10 failure last week and is pretty forcefully worded. So I suppose we should fix this...
Assignee: nobody → gps
Flags: needinfo?(gps)
At least on the release branches, I'm seeing more failing Windows SM jobs than passing at this point.
If you want something else to retry right now, you could save me cloning this bug the instant it closes with the new summary "Intermittent Windows builds failing with abort: missing support for negative part header size: -1671002596" by also retrying on that.

Though, since 33 of last week's 195 were browser builds, which already retry, I guess I'm going to be cloning it pretty much the instant it closes anyway, aren't I?
Comment on attachment 8825614 [details]
robustcheckout: factor out code for handling error.Abort during a pull;
Attachment #8825614 - Flags: review?(glob) → review+
Comment on attachment 8825615 [details]
robustcheckout: add test for server failure during clone (bug 1291926);

::: hgext/robustcheckout/tests/
(Diff revision 2)
> +
> +            if untilgoodcount:

nit: this 2nd if is redundant
Attachment #8825615 - Flags: review?(glob) → review+
Comment on attachment 8825616 [details]
robustcheckout: retry after "stream ended unexpectedly" (bug 1291926);

Attachment #8825616 - Flags: review?(glob) → review+
Comment on attachment 8825615 [details]
robustcheckout: add test for server failure during clone (bug 1291926);

> nit: this 2nd if is redundant

But it isn't! Consider the case where it reads the string "0", which is converted to int(0), which is Falsy. I'll add an inline comment.
Pushed by
robustcheckout: add test for server failure during clone ; r=glob
robustcheckout: retry after "stream ended unexpectedly" ; r=glob
Closed: 3 years ago
Resolution: --- → FIXED
Pushed by
Vendor latest robustcheckout extension; r=me
These patches definitely improved the situation, but there's still some occasional failures creeping through.
And even after the update of the robustcheckout in build/tools/, the SpiderMonkey builds are still only attempting a single time.
Whiteboard: [stockwell fixed]
This still strikes pretty frequently. Is there something we can do to make cloning more reliable? Is there anything telling about it predominantly affecting SM jobs?
Flags: needinfo?(gps)
Is it possible we're doing something silly with the mozilla-unified repo being baked into the AMI for these jobs? Just seems really weird that Windows SM jobs seem so disproportionately affected by this problem. AFAICT, we're pretty much hitting this issue constantly, it's just that the retries cover it up when we're lucky. Wonder how much machine time and money we're burning on this right now...
Flags: needinfo?(mcornmesser)
Currently we are not baking any of the repos into the AMI. We tried that previously but it was not successful.
Flags: needinfo?(mcornmesser)
There is talk upstream about this bug. glandium was able to reproduce it reliably on Linux using signals. We suspect something in Mercurial or Python isn't retrying an interrupted system call.

Upstream has also noticed this error appears more frequently on Windows. I wouldn't be surprised if we're running into a CPython bug.

I have ideas for fixing this. But I'll likely need to deploy a hacked up version of Mercurial to Windows to prove its effectiveness. Bug 1351513 was the first step of that. Next, I'll need to produce a Mercurial Windows installer. This is quite the rabbit hole...

I am actively looking at this.
Flags: needinfo?(gps)
For whatever reason, this is permafailing Linux x64 Addon builds on Beta since yesterday, no matter how many retriggers I do.
Flags: needinfo?(gps)
Resolution: FIXED → ---
I think the majority of failures are due to a network timeout, similar to bug 1338530 comment #17. I'm attempting to prove that by reproducing on a local machine while packet tracing...
Depends on: 1354625
I think there are multiple root causes of this "stream ended unexpectedly" issue.

I think the single biggest cause is a lethargic client becoming network idle resulting in the TCP connection being dropped. Lesser causes include normal network unreliability and failure of Mercurial to handle interrupted system calls (which glandium has reported upstream as a known cause of "stream ended unexpectedly."

I'm now able to reliably reproduce this failure on Linux. Which means it is only a matter of time before I confirm the root cause(s) and devise a fix.
Flags: needinfo?(gps)
I've got a packet capture from client to along with an observed "stream ended unexpectedly." Unfortunately it is encrypted. But I think there's a smoking gun of a server-initiated drop after idle.

311.032340 hg->client encrypted data (preceded by kilobytes of similar packets)
311.071510 client->hg ACK
311.331237 hg->client TCP Window Full
311.331254 client->hg TCP ZeroWindow
311.585893 hg->client TCP Keep-Alive
repeated ZeroWindow and Keep-Alive pairs following what looks like exponential backoff
436.112884 hg->client TCP Keep-Alive
436.112899 client->hg TCP ZeroWindow
455.004558 client->hg TCP Window Update (ACK)
455.016535 hg->client TCP segment
repeated TCP data from server to client for a little while
456.392987 hg->client encrypted data (last presumed Mercurial data from server)
456.422323 client->hg ACK
466.727988 hg->client TLS Alert 21 (Encrypted Alert)
466.728028 client->hg ACK
466.728061 hg->client FIN, ACK
466.771518 client->hg ACK
478.655037 hg->client FIN, ACK
478.666036 client->hg ACK

What looks to be happening is the server sends a bunch of data to the client. The client can't keep up on the consuming end and it keeps telling the server to wait. Eventually, the client is ready to accept more data. The server gladly obliges. But, moments later the server ends the TLS session and wants to drop the connection!

I suspect what's happening is that during the long period where the Mercurial client isn't accepting any new data from the server, the TCP connection between the load balancer and the HTTP server on the origin server is dropped or gets in an aborted state. This could be any number of timeouts, including an idle timeout inside mod_wsgi. Instead of dropping the TCP connection right there and then, the connection is kept alive. When the Mercurial client comes back, the load balancer flushes whatever data is has buffered. Once that is flushed, it initiates shutdown.

There is some speculation in the previous paragraph. I would need to packet trace between the load balancer and the httpd server to see exactly what's going on.

The whole issue stems from the Mercurial client being idle over the network for an extended period of time (because it is being slow). (I reproduced this behavior by introducing a sleep(1) between applying manifests. But the "stream ended unexpectedly" doesn't occur until the beginning of filelog updates.)

Also, I do see an OSError(Connected ended unexpectedly) being raised by Python. But it is being swallowed by Mercurial, leading to this custom "stream ended unexpectedly" error. That's worthy of a fix upstream.

I think there is an idle timeout somewhere between zlb <-> httpd <-> mod_wsgi that we need to increase to prevent premature connection dropping...
atoll: could you please read comment #78 and add any zlb-related expertise you may have?

Also, is it difficult to get a dummy load balancer entry routing to a specific hgweb host? I'd love to packet trace what's happening between zlb and hgweb to confirm/refute suspicions. But that's difficult to do with the firehose of production traffic in the way.

(I'm about to start my weekend. So there's no rush to do anything before Monday.)
Flags: needinfo?(rsoderberg)
TS rule, approximately:

If (request.getHeader("X-GPS") == "1") {
  pool.use("hgweb-pool", "", "443");

If you pass request header X-GPS, it will assign your request to the listed IP:Port from the named pool. If request headers are hard, use request.getRemoteIp() instead. Note that Zeus will ignore the usual draining/disabled Settings for the node you select with this method.

More later, mandatory PTO
Thanks for the info, atoll!

fubar: could you please configure a TS rule (similar to comment #80) to route requests to say hgweb11's IP? I'll then be able to take that host out of service and packet capture my way to glory.
Flags: needinfo?(klibby)
(In reply to Gregory Szorc [:gps] from comment #82)
> fubar: could you please configure a TS rule (similar to comment #80) to
> route requests to say hgweb11's IP? I'll then be able to take that host out
> of service and packet capture my way to glory.

Done. TS rule "hg-debug" created; add the header "X-HG-DEBUG: 1" and all traffic will go to hgweb11.
Flags: needinfo?(klibby)
I have a packet capture from hgweb11 with the issue reproducing on the client. I still have to sort through that, but there is an error in the httpd server logs:

  (70007)The timeout specified has expired: [client XXXX:YY] mod_wsgi (pid=XXX): Failed to proxy response to client

I think I know what timeout it is referring to. Let me look at the packet trace to confirm the numbers line up.
Confirming my suspicions, the "timeout" mentioned in the log message references the "socket-timeout" WSGIDaemonProcess option, which defaults to httpd's Timeout value which is currently set to 60. If I tweak socket-timeout, I can get that error message to appear sooner or later.

Here's what's happening from the Apache level.

1. Parent httpd process accepts a new socket and hands a file descriptor to worker process
2. httpd worker finds an available WSGI worker process and starts communicating with it. The httpd worker process essentially acts as a proxy between the TCP socket (speaking HTTP) and the WSGI worker (speaking WSGI)
3. WSGI worker produces data pretty fast and sends it over the network (to the load balancer) which receives it initially pretty fast
4. Within a few seconds, httpd is unable to writev() WSGI data to the socket. First call returns partial result. Second call fails with EAGAIN. This means the write would block.
5. httpd worker issues poll({fd, events=POLLOUT}, 1, <socket-timeout>)
6. Often dozens of seconds later, poll() returns.
7. httpd writev() flushes data remaining from last partial call.
8. httpd read()s 8000 bytes from WSGI and writev()s to socket in rapid succession. This manages to send a little over 256,000 bytes to the socket. 
9. Eventually a poll() of the socket fails. httpd writes the "The timeout specified has expired" message to the error log.
10. httpd worker closes file descriptor attached to socket

From the behavior of poll(), read(), and writev(), network writes from httpd to the zlb occur in bursts of ~256kb.

From packet capturing on the client, it appears to receive from the network in bursts of ~128kb. This is likely DECOMPRESSION_RECOMMENDED_INPUT_SIZE from the zstd decompressor wanting ~128kb at a time.

What I find interesting here is that communication between the zlb and httpd occurs at half the rate as between client and zlb. What this tells me is that the buffer on the zlb likely waits until it is empty or has reached a low water mark before asking httpd for more data. I've commonly seen this implemented as a circular buffer in load balancers because that model tends to yield less latency (data can be served to client immediately instead of waiting for origin server to deliver more data). I guess this flavor of load balancer does things differently. Maybe they are optimizing for literal data pass-through. Who knows. Load balancers are magic black boxes. It isn't terribly important.

Now, what's somewhat wonky is the behavior after httpd calls close() on the socket: it lingers for over a minute before it is closed at the TCP level! After the socket is close()d, TCP between the zlb and httpd keep sending ZeroWindow and Keep-Alive messages between each other. Interspersed with those are actual data from httpd to the zlb. This must be data buffered by the kernel because httpd is fully detached from the socket when the data goes over the wire. Because dozens of seconds can pass between the client requesting 128kb chunks, the time difference between the server giving up on the HTTP request and the TCP socket being flushed and closed can be over a minute.
Flags: needinfo?(rsoderberg)
You might find some value reading over the Pool options in Zeus. There are a *lot* of finicky detailed TCP options that seem remarkably relevant here. I believe there's a specific "send every packet when it arrives, rather than being efficient" option. I'll share you a link to the relevant Zeus docs folder.
So, as best I can tell, WSGIDaemonProcess's socket-timeout ( is driving the timeouts. The current value is 60s. Raising that timeout will cause the server to stop prematurely serving as many requests. We should probably increase it. A risk to doing so is this opens us up to idle connection DoS. But that can be mitigated at the zlb layer.

Compounding the problem is zlb's buffering strategy, which appears to only request data from the origin server when a low water mark is hit, not continuously. This means traffic between hg and zlb doesn't "reset" a timeout in httpd. We have increased the buffer previously to accommodate very large HTTP request headers. We now use HTTP POST for passing this data, so we can likely revert to the default buffer size. This should result in more interaction between zlb and httpd, mitigating the chances of hitting a timeout on httpd.

Further compounding the problem is Mercurial wanting zstd data in 128kb chunks. If it requested data in smaller chunks, it would communicate with the server more often, resetting any idle timeouts in the process. 128kb is a large chunk for network I/O. So Mercurial should probably change this buffer size for network-originated streams. That being said, compression is special. N input bytes can produce M>>N output bytes. If data compresses extremely well or the repository is slow to apply the incoming data, very small network traffic could make the Mercurial operation so slow that the socket becomes idle and times out. This is essentially what happened in automation last week: Windows workers were already relatively slow applying changegroup data. By converting the server-side repos to generaldelta, we made them even slower. And by rolling out zstd, we decreased the frequency of interaction between client and server (due to zstd's large input buffer size), increasing the probability of timeouts.

What a nasty and complex bug. It's getting to be the end of the work day for me. So I'm going to grab a beer so I can relax the brain cells that toiled over this bug. I'll work with fubar or dhouse tomorrow to roll out fixes for various buffer sizes and timeouts.
We could have Zeus talk directly to uWSGI, if you prefer that someday.
Oh, I forgot: Yes, ++ to raising HG timeout significantly, *only* for authenticated sessions (if you can possibly determine that).
(In reply to Richard Soderberg [:atoll] from comment #89)
> Oh, I forgot: Yes, ++ to raising HG timeout significantly, *only* for
> authenticated sessions (if you can possibly determine that). is completely unauthenticated. Best we can do is look at source IP. At since so much of our automation operates as spot instances in AWS, unless we are using a special network link between AWS and and can identify that in the HTTP request, I'm not sure there's much we can do.

I'm inclined to jack up timeouts on the HTTP origin servers and have zlb deal with handling idle.
Upon further inspection, the zlb buffer from server to client appears to be 64kb. Not sure why I was seeing 256kb chunks in my packet captures.
Comment on attachment 8857080 [details]
ansible/hg-web: increase network timeout from 60s to 120s (bug 1291926);

Attachment #8857080 - Flags: review?(klibby) → review+
Pushed by
ansible/hg-web: increase network timeout from 60s to 120s ; r=fubar
Closed: 3 years ago3 years ago
Resolution: --- → FIXED
Deployed the timeout bump to production and bounced all servers to pick up the change. No HTTP requests were harmed in the process.

I'm going to keep the bug open to track other things related to "stream ended unexpectedly" errors. And I want to be sure the 120s timeout is sufficient.
Resolution: FIXED → ---
There were ~8000 of these "Failed to proxy response to client" errors in March. And April was looking to be worse.

It is still too early to be conclusive, but in the few hours since this was deployed, there has been exactly 1 of these failures. So I'm optimistic the increased timeout is having the intended effect.
Duplicate of this bug: 1333348
Duplicate of this bug: 1348637
Duplicate of this bug: 1340630
Closed: 3 years ago3 years ago
Resolution: --- → INCOMPLETE

Reopening as this is still an issue (e.g. :pbro and I have it in the last week). Currently people have to find out that they shall manually download a hg hundle if they hit the issue:

Resolution: INCOMPLETE → ---
Whiteboard: [stockwell fixed]

I'm doing some testing to see if I can reproduce a failed/timing out hg clone due to a bad environment.
In my tests yesterday, I didn't see any failures after ~an hour or two.

My environment was set up with:

  • A process constantly stressing the hard-drive: stress -d 4
  • A hampered network connection configured with: tc qdisc add dev ens4 root netem delay 10ms 20ms corrupt 10% reorder 25% 50% loss 5%
  • An hg clone with:
hg robustcheckout --sharebase hg-sharebase --purge --config,sha256:8e:ad:f7:6a:eb:44:06:15:ed:f3:e4:69:a6:64:60:37:2d:ff:98:88:37:bf:d7:b8:40:84:01:48:9c:26:ce:d9 --upstream --revision 6c3cd02d3533602e6a08b200973e335065f27fa5 src

(I yoinked the hg command from a CI job)

I'll try a clone instead of a robustcheckout too see if that will trigger the issues

You need to log in before you can comment on or make changes to this bug.