Open Bug 1304942 Opened 8 years ago Updated 28 days ago

Intermittent-infra abort: Connection reset by peer cloning mozilla-unified

Categories

(Developer Services :: Mercurial: robustcheckout, defect)

defect
Not set
normal

Tracking

(Not tracked)

People

(Reporter: intermittent-bug-filer, Unassigned)

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell infra])

I think `hg robustcheckout` should handle the retry logic. I'm also tempted to put fingerprint lookup logic in robustcheckout as well. I'd *really* like to get the automation logic for hg interaction to "just run `hg robustcheckout`."
Component: General → Mercurial: robustcheckout
Product: Taskcluster → Developer Services
I was about to close this as a dupe of bug 1317594, but https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=90756909&lineNumber=1-120 is an instance of the error with the robustcheckout improvements from that bug. So this appears to still be a problem.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Would this just be an additional case in https://dxr.mozilla.org/hgcustom_version-control-tools/rev/6a20a18a2d58b167d2180f6604866b4149bba5c2/hgext/robustcheckout/__init__.py#362 ? Something like elif e.args[0].startswith(_('Connection reset by peer'))? I know that function is called handlepullerror, but is that called on clone errors, too?
Flags: needinfo?(gps)
I /think/ so. Except "Connection reset by peer" doesn't come from Mercurial, so it doesn't need the _() translation.

Also, I wouldn't be surprised if the underlying exception were not a mercurial.error.Abort at this point. I would expect a HTTP-level exception to be a URLError at this point. So this may need to go in the `elif isinstance(e, urllib2.URLError)` block below.

Also, if you will be changing that code, it would be really handy to have logging of the exact exception type. That will make it much easier to chase down the long tail of exceptions that we need to catch and retry.
Flags: needinfo?(gps)
There were 33 failures related to this bug in the last 7 days. 

:KWierso can you please take a look at this since it spiked in the last days?
Flags: needinfo?(kwierso)
Whiteboard: [stockwell infra]
the failure rate here has declined, I assume we had an infrastructure issue.
Flags: needinfo?(kwierso)
This has spiked a bit on the 19th due to a infra issue and now the failure rate has declined.
This has spiked in between 26th-27th and now the failure rate has declined.
I got a few of these after triggering dozens of retries for a build task. One failure is https://public-artifacts.taskcluster.net/doi2NReFTOiZSmtchGgkfA/0/public/logs/live_backing.log.

What caught my eye is that before the failure, the streaming clone was *really* slow. >10x slower than normal.

I suspect my failures (and likely a large percentage in the wild) is due to S3 throttling the Mercurial bundle request. I bet our potentially dozens of machines fetching the clone bundle from S3 at nearly the same time triggers some DoS or other abuse mitigation defense.

We /can/ work around this in robustcheckout by retrying the request. But if we're hitting S3 "scaling limits," that's a dangerous issue to be wallpapering over. Since it occurs so infrequently today, it is probably OK to work around. But if we start cloning from test tasks (which will increase clone volume substantially), we should really keep our eye on S3 throttling issues.
There have been a total of 32 failures in the last week, according to Orange Factor.

Occurrences per platform:
-Linux x64: 23
-osx-cross: 5
-Linux: 2
-linux64-noopt: 1
-android-4-0-armv7-api16: 1

Occurrences per build type:
-debug: 14
-opt: 12
-asan: 4
-pgo: 2

Here is a recent log file and a snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=159141404&lineNumber=47

[vcs 2018-01-29T20:10:20.217Z] abort: Connection reset by peer
[taskcluster 2018-01-29 20:10:20.436Z] === Task Finished ===
[taskcluster 2018-01-29 20:10:20.562Z] Artifact "public/build" not found at "/builds/worker/artifacts/"
[taskcluster 2018-01-29 20:10:21.289Z] Unsuccessful task run with exit code: 255 completed in 325.627 seconds

:gps, could you please take a look?
Flags: needinfo?(gps)
This bug is continuing to fail. It failed 31 times in the last 7 days on Linux, Android affecting opt debug build types.
Link to a recent log https://treeherder.mozilla.org/logviewer.html#?job_id=173805243&repo=mozilla-inbound&lineNumber=34

:gps Could you please take another look?
:gbrown this bug does not have a triage owner, can you please take a look?
Flags: needinfo?(gbrown)
I note comment 45 and comment 62: It seems like there is some hope of avoiding these on-going failures. :gps is likely our best bet for resolving this, and he is assigned and needinfo'd. Hopefully he can provide an update soon.
Flags: needinfo?(gbrown)
The incidence rate of this is low. And I suspect it is the same underlying issue as explained in bug 1371378 comment 58. tl;dr it may require an upstream fix to fully address.
Flags: needinfo?(gps)
Assignee: gps → nobody
Status: ASSIGNED → NEW

Some of the recent failures here are due to the gcp migration, which was rolled back. Please disregard.

Connor, could you have a look over the increase failure rate here? Might be related to the frequent timeouts in builds.

Flags: needinfo?(sheehan)

I don't have any meaningful updates here, sorry to leave you hanging. Please re-needinfo me if more action is warranted on this bug.

Flags: needinfo?(sheehan)
You need to log in before you can comment on or make changes to this bug.