Closed Bug 1618111 Opened 5 years ago Closed 5 years ago

sccache write errors on at least some Windows builds on try

Categories

(Cloud Services :: Operations: Taskcluster, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: glandium, Assigned: edunham)

Details

I just figured out randomly that sccache write errors were occurring on some but not all Windows builds on try. It's plausible that the buckets permissions are not setup properly or something along those lines.

The sccache logs say:

Cache write error: Error(Msg("failed to put cache entry in s3"), State { next_error: Some(Error(BadHTTPStatus(400), State { next_error: None, backtrace: InternalBacktrace })), backtrace: InternalBacktrace })

See https://treeherder.mozilla.org/perf.html#/graphs?highlightAlerts=1&series=try,2105050,1,2&timerange=1209600

This doesn't happen on autoland.

The few logs I've looked at randomly were all on eu-central-1b or eu-central-1c, using the taskcluster-level-1-sccache-eu-central-1 bucket.

Assignee: nobody → edunham
Component: Operations and Service Requests → Operations: Taskcluster
Product: Taskcluster → Cloud Services

A 400 error strongly suggests that the error looks client-side; I'd expect to see 5xx from a bucket permissions problem. Is it possible to inspect an individual cache write query that got an error compared to one that didn't using your tooling?
The sccache buckets are all in the mozilla-taskcluster AWS account, which I think I accidentally broke my access to when I switched my 2-factor auth to a new phone, so I'll get that fixed tomorrow in order to take a closer look at how taskcluster-level-1-sccache-eu-central-1 might differ from the others.

Is this still a cause for concern? I see the treeherder graph shows errors much less frequently now.

I think edunham is slightly wrong, in that we'd expect to see 403 for a permissions problem. However, this error is 400 not 403, which does suggest an issue with the request the client is making.

It it possible to get the sccache code to report the actual error code returned int he response from S3? Those are listed at https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html#ErrorCodeList , and you can see they let you distinguish between the many possible causes of a 400 response.

(In reply to Brian Pitts from comment #2)

It it possible to get the sccache code to report the actual error code returned int he response from S3? Those are listed at https://docs.aws.amazon.com/AmazonS3/latest/API/ErrorResponses.html#ErrorCodeList , and you can see they let you distinguish between the many possible causes of a 400 response.

I think this is possible. It would involve either updating sccache's S3 code:

https://github.com/mozilla/sccache/blob/master/src/simples3/s3.rs

or updating the Rusoto patch (assuming Rusoto has better/richer s3 HTTP error handling the current code):

https://github.com/mozilla/sccache/pull/522

I'm going to close this since I don't think there's anything we can do to help on the Taskcluster operations side. Feel free to reopen if that changes.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.