Intermittent "400 Bad Request" errors when uploading to the TC queue causing Windows job failures

RESOLVED FIXED

Status

RESOLVED FIXED
a year ago
17 days ago

People

(Reporter: RyanVM, Assigned: pmoore)

Tracking

Details

Attachments

(1 attachment)

(Reporter)

Description

a year ago
This just burned a Windows Beta job and will be delaying go-to-build as a result, so setting the severity to Critical here. Especially because it means we have to do the |taskcluster task rerun| dance (which few people have ability to do) and risk hitting bug 1381768 on the retrigger.

These failures show up in Treeherder as infra exceptions (purple) and the TH log parser isn't able to see anything useful to highlight, so I fully expect this bug to get little in the way of starring activity, but a cursory glance at TH suggests it's happening on a daily basis at least.

Examples:
https://queue.taskcluster.net/v1/task/S7va0KV_QW2YHzcDy7Ejtw/runs/0/artifacts/public/logs/live_backing.log
https://queue.taskcluster.net/v1/task/PPUAp2JZSuKILy-wiK_fsQ/runs/0/artifacts/public/logs/live_backing.log

Is there something we can do to be more tolerant of these failures if we can't make them go away outright?
It looks like these are timeouts talking to S3.

Pete, can you have a look?  I wonder if we're just doing too many parallel uploads and thus slowing down too much?
Flags: needinfo?(pmoore)
So I think we can improve the situation here. What sucks is that S3 is returning an HTTP 400 response when there is a delay sending data:

> Error uploading artifact: (Permanent) HTTP response code 400

And, as https://developer.mozilla.org/en-US/docs/Web/HTTP/Status/400 says:

> The HTTP 400 Bad Request response status code indicates that the server could not understand the
> request due to invalid syntax. The client should not repeat this request without modification.

So since we get a 400 response, we intentionally don't retry. Really, I think AWS shouldn't return with HTTP 400 since it might not be a client issue, but network congestion etc.

I'd propose we special case this particular failure, and make it retry.

Note, we (rather inefficiently) upload artifacts in series rather than parallel, so there should only be one upload running at a time.
Flags: needinfo?(pmoore)
(In reply to Pete Moore [:pmoore][:pete] from comment #2)
> What sucks is that S3 is returning an HTTP 400 response when there is a delay sending data:

For context, the failure message is:

  "Your socket connection to the server was not read from or written to within the timeout period. Idle connections will be closed."
Assignee: nobody → pmoore
Created attachment 8902648 [details] [review]
Github Pull Request for generic-worker

This should do it.
Attachment #8902648 - Flags: review?(dustin)
Release generic-worker 10.2.1 in progress, should appear here:

* https://github.com/taskcluster/generic-worker/releases/tag/v10.2.1

We'll still need to roll it out in https://github.com/mozilla-releng/OpenCloudConfig
3 failures in 939 pushes (0.003 failures/push) were associated with this bug in the last 7 days.   

Repository breakdown:
* mozilla-beta: 3

Platform breakdown:
* windows2012-32: 2
* windows2012-64-devedition: 1

For more details, see:
https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1394557&startday=2017-08-28&endday=2017-09-03&tree=all

Comment 8

11 months ago
Comment on attachment 8902648 [details] [review]
Github Pull Request for generic-worker

carrying over the approved review from github
Attachment #8902648 - Flags: review?(dustin) → review+

Comment 9

11 months ago
Looks like the builders were updated to 10.2.2 recently which should include this fix:
https://github.com/mozilla-releng/OpenCloudConfig/commit/25a2cc91a604f132aea2f30f35ab18a83df4fc8f
Status: NEW → RESOLVED
Last Resolved: 11 months ago
Resolution: --- → FIXED
Word of warning: test workers (win7/win10) can also be hit by this in test tasks - that should disappear when bug 1399401 lands.
Reopening as this still affects testers, until bug 1399401 lands...
Status: RESOLVED → REOPENED
Depends on: 1399401
Resolution: FIXED → ---
Severity: critical → normal
Duplicate of this bug: 1450933
Deployed to all gecko windows workers in bug 1399401.
Status: REOPENED → RESOLVED
Last Resolved: 11 months ago5 months ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.