1304942 - Intermittent-infra abort: Connection reset by peer cloning mozilla-unified

Reporter

Description

•

8 years ago

treeherder

Filed by: philringnalda [at] gmail.com https://treeherder.mozilla.org/logviewer.html#?job_id=3915760&repo=autoland https://queue.taskcluster.net/v1/task/MWHm8jQMSI-ELYf1eIm9Mg/runs/0/artifacts/public%2Flogs%2Flive_backing.log

Comment hidden (Intermittent Failures Robot)

Gregory Szorc [:gps]

Comment 2

•

8 years ago

I think `hg robustcheckout` should handle the retry logic. I'm also tempted to put fingerprint lookup logic in robustcheckout as well. I'd *really* like to get the automation logic for hg interaction to "just run `hg robustcheckout`."

Component: General → Mercurial: robustcheckout

Product: Taskcluster → Developer Services

Comment hidden (Intermittent Failures Robot)

Gregory Szorc [:gps]

Comment 22

•

8 years ago

I was about to close this as a dupe of bug 1317594, but https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=90756909&lineNumber=1-120 is an instance of the error with the robustcheckout improvements from that bug. So this appears to still be a problem.

Assignee: nobody → gps

Status: NEW → ASSIGNED

Comment hidden (Intermittent Failures Robot)

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 44

•

7 years ago

Would this just be an additional case in https://dxr.mozilla.org/hgcustom_version-control-tools/rev/6a20a18a2d58b167d2180f6604866b4149bba5c2/hgext/robustcheckout/__init__.py#362 ? Something like elif e.args[0].startswith(_('Connection reset by peer'))? I know that function is called handlepullerror, but is that called on clone errors, too?

Flags: needinfo?(gps)

Gregory Szorc [:gps]

Comment 45

•

7 years ago

I /think/ so. Except "Connection reset by peer" doesn't come from Mercurial, so it doesn't need the _() translation. Also, I wouldn't be surprised if the underlying exception were not a mercurial.error.Abort at this point. I would expect a HTTP-level exception to be a URLError at this point. So this may need to go in the `elif isinstance(e, urllib2.URLError)` block below. Also, if you will be changing that code, it would be really handy to have logging of the exact exception type. That will make it much easier to chase down the long tail of exceptions that we need to catch and retry.

Flags: needinfo?(gps)

Comment hidden (Intermittent Failures Robot)

Andrei Ciure[:aciure]

Comment 52

•

7 years ago

There were 33 failures related to this bug in the last 7 days. :KWierso can you please take a look at this since it spiked in the last days?

Flags: needinfo?(kwierso)

Whiteboard: [stockwell infra]

Joel Maher ( :jmaher ) (UTC -8)

Comment 53

•

7 years ago

the failure rate here has declined, I assume we had an infrastructure issue.

Flags: needinfo?(kwierso)

Comment hidden (Intermittent Failures Robot)

Cosmin Sabou [:CosminS]

Comment 56

•

7 years ago

This has spiked a bit on the 19th due to a infra issue and now the failure rate has declined.

Comment hidden (Intermittent Failures Robot)

Dorel Luca [:dluca]

Comment 60

•

7 years ago

This has spiked in between 26th-27th and now the failure rate has declined.

Comment hidden (Intermittent Failures Robot)

Gregory Szorc [:gps]

Comment 62

•

7 years ago

I got a few of these after triggering dozens of retries for a build task. One failure is https://public-artifacts.taskcluster.net/doi2NReFTOiZSmtchGgkfA/0/public/logs/live_backing.log. What caught my eye is that before the failure, the streaming clone was *really* slow. >10x slower than normal. I suspect my failures (and likely a large percentage in the wild) is due to S3 throttling the Mercurial bundle request. I bet our potentially dozens of machines fetching the clone bundle from S3 at nearly the same time triggers some DoS or other abuse mitigation defense. We /can/ work around this in robustcheckout by retrying the request. But if we're hitting S3 "scaling limits," that's a dangerous issue to be wallpapering over. Since it occurs so infrequently today, it is probably OK to work around. But if we start cloning from test tasks (which will increase clone volume substantially), we should really keep our eye on S3 throttling issues.

Comment hidden (Intermittent Failures Robot)

Tiberius Oros[:tiberius_oros]

Comment 70

•

7 years ago

There have been a total of 32 failures in the last week, according to Orange Factor. Occurrences per platform: -Linux x64: 23 -osx-cross: 5 -Linux: 2 -linux64-noopt: 1 -android-4-0-armv7-api16: 1 Occurrences per build type: -debug: 14 -opt: 12 -asan: 4 -pgo: 2 Here is a recent log file and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=159141404&lineNumber=47 [vcs 2018-01-29T20:10:20.217Z] abort: Connection reset by peer [taskcluster 2018-01-29 20:10:20.436Z] === Task Finished === [taskcluster 2018-01-29 20:10:20.562Z] Artifact "public/build" not found at "/builds/worker/artifacts/" [taskcluster 2018-01-29 20:10:21.289Z] Unsuccessful task run with exit code: 255 completed in 325.627 seconds :gps, could you please take a look?

Flags: needinfo?(gps)

Comment hidden (Intermittent Failures Robot)

Arthur Iakab [arthur_iakab]

Comment 81

•

7 years ago

This bug is continuing to fail. It failed 31 times in the last 7 days on Linux, Android affecting opt debug build types. Link to a recent log https://treeherder.mozilla.org/logviewer.html#?job_id=173805243&repo=mozilla-inbound&lineNumber=34 :gps Could you please take another look?

Arthur Iakab [arthur_iakab]

Comment 82

•

7 years ago

:gbrown this bug does not have a triage owner, can you please take a look?

Flags: needinfo?(gbrown)

Comment hidden (Intermittent Failures Robot)

Geoff Brown [:gbrown]

Comment 84

•

7 years ago

I note comment 45 and comment 62: It seems like there is some hope of avoiding these on-going failures. :gps is likely our best bet for resolving this, and he is assigned and needinfo'd. Hopefully he can provide an update soon.

Flags: needinfo?(gbrown)

Comment hidden (Intermittent Failures Robot)

Gregory Szorc [:gps]

Comment 93

•

7 years ago

The incidence rate of this is low. And I suspect it is the same underlying issue as explained in bug 1371378 comment 58. tl;dr it may require an upstream fix to fully address.

Flags: needinfo?(gps)

Gregory Szorc [:gps]

Updated

•

7 years ago

Assignee: gps → nobody

Status: ASSIGNED → NEW

Comment hidden (Intermittent Failures Robot)

Andreea Pavel [:apavel]

Comment 177

•

5 years ago

Some of the recent failures here are due to the gcp migration, which was rolled back. Please disregard.

Comment hidden (Intermittent Failures Robot)

Cosmin Sabou [:CosminS]

Comment 229

•

2 years ago

•

Edited

Connor, could you have a look over the increase failure rate here? Might be related to the frequent timeouts in builds.

Flags: needinfo?(sheehan)

Comment hidden (Intermittent Failures Robot)

Connor Sheehan [:sheehan]

Comment 285

•

9 months ago

I don't have any meaningful updates here, sorry to leave you hanging. Please re-needinfo me if more action is warranted on this bug.