This just burned a Windows Beta job and will be delaying go-to-build as a result, so setting the severity to Critical here. Especially because it means we have to do the |taskcluster task rerun| dance (which few people have ability to do) and risk hitting bug 1381768 on the retrigger. These failures show up in Treeherder as infra exceptions (purple) and the TH log parser isn't able to see anything useful to highlight, so I fully expect this bug to get little in the way of starring activity, but a cursory glance at TH suggests it's happening on a daily basis at least. Examples: https://queue.taskcluster.net/v1/task/S7va0KV_QW2YHzcDy7Ejtw/runs/0/artifacts/public/logs/live_backing.log https://queue.taskcluster.net/v1/task/PPUAp2JZSuKILy-wiK_fsQ/runs/0/artifacts/public/logs/live_backing.log Is there something we can do to be more tolerant of these failures if we can't make them go away outright?
It looks like these are timeouts talking to S3. Pete, can you have a look? I wonder if we're just doing too many parallel uploads and thus slowing down too much?
So I think we can improve the situation here. What sucks is that S3 is returning an HTTP 400 response when there is a delay sending data: > Error uploading artifact: (Permanent) HTTP response code 400 And, as https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 says: > The HTTP 400 Bad Request response status code indicates that the server could not understand the > request due to invalid syntax. The client should not repeat this request without modification. So since we get a 400 response, we intentionally don't retry. Really, I think AWS shouldn't return with HTTP 400 since it might not be a client issue, but network congestion etc. I'd propose we special case this particular failure, and make it retry. Note, we (rather inefficiently) upload artifacts in series rather than parallel, so there should only be one upload running at a time.
(In reply to Pete Moore [:pmoore][:pete] from comment #2) > What sucks is that S3 is returning an HTTP 400 response when there is a delay sending data: For context, the failure message is: "Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed."
a year ago
Assignee: nobody → pmoore
Created attachment 8902648 [details] [review] Github Pull Request for generic-worker This should do it.
Attachment #8902648 - Flags: review?(dustin)
Commit pushed to master at https://github.com/taskcluster/generic-worker https://github.com/taskcluster/generic-worker/commit/e96b1bdd5adb2ac91db5c5a338d67617511a4b0b Bug 1394557 - retry artifact uploads with HTTP 400 status code response (#63)
Release generic-worker 10.2.1 in progress, should appear here: * https://github.com/taskcluster/generic-worker/releases/tag/v10.2.1 We'll still need to roll it out in https://github.com/mozilla-releng/OpenCloudConfig
3 failures in 939 pushes (0.003 failures/push) were associated with this bug in the last 7 days. Repository breakdown: * mozilla-beta: 3 Platform breakdown: * windows2012-32: 2 * windows2012-64-devedition: 1 For more details, see: https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1394557&startday=2017-08-28&endday=2017-09-03&tree=all
Comment on attachment 8902648 [details] [review] Github Pull Request for generic-worker carrying over the approved review from github
Attachment #8902648 - Flags: review?(dustin) → review+
Looks like the builders were updated to 10.2.2 recently which should include this fix: https://github.com/mozilla-releng/OpenCloudConfig/commit/25a2cc91a604f132aea2f30f35ab18a83df4fc8f
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → FIXED
Word of warning: test workers (win7/win10) can also be hit by this in test tasks - that should disappear when bug 1399401 lands.
Reopening as this still affects testers, until bug 1399401 lands...
Status: RESOLVED → REOPENED
Depends on: 1399401
Resolution: FIXED → ---
8 months ago
Severity: critical → normal
Deployed to all gecko windows workers in bug 1399401.
Status: REOPENED → RESOLVED
Last Resolved: 11 months ago → 5 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.