Pervasive sccache cache write errors

RESOLVED FIXED in Firefox 55

Status

defect
RESOLVED FIXED
2 years ago
Last year

People

(Reporter: chmanchester, Assigned: ted)

Tracking

unspecified
mozilla55

Firefox Tracking Flags

(firefox55 fixed)

Details

Attachments

(1 attachment)

I noticed a pretty sizeable build time regression on autoland, inbound, and try starting around Tuesday afternoon: https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=%5Bmozilla-central,4044b74c437dfc672f4615a746ea01f6e4c0312d,1,2%5D&series=%5Bautoland,077c454bbb47966e9661e9b00ba7100f14bbd6c9,1,2%5D&series=%5Bmozilla-inbound,4044b74c437dfc672f4615a746ea01f6e4c0312d,1,2%5D&series=%5Bautoland,4044b74c437dfc672f4615a746ea01f6e4c0312d,1,2%5D&series=%5Bmozilla-central,077c454bbb47966e9661e9b00ba7100f14bbd6c9,1,2%5D&series=%5Bmozilla-inbound,077c454bbb47966e9661e9b00ba7100f14bbd6c9,1,2%5D

It starts with seemingly innocuous changesets, and pushing a revision before the regression range to try doesn't improve the build time. Looking at the sccache stats, we stopped getting cache hits and started seeing pervasive cache write errors around this time.
I looked at a log from a try push I did and the answer was actually staring me right in the face:
https://public-artifacts.taskcluster.net/OaqLMJqDR3qXW_Xta1fu2w/0/public/logs/live_backing.log
```
[task 2017-04-06T10:27:47.426667Z] 
[task 2017-04-06T10:27:47.426679Z] if [[ -n ${USE_SCCACHE} ]]; then
[task 2017-04-06T10:27:47.426698Z]     # Point sccache at the Taskcluster proxy for AWS credentials.
[task 2017-04-06T10:27:47.426739Z]     export AWS_IAM_CREDENTIALS_URL="http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-${MOZ_SCM_LEVEL}-sccache-${TASKCLUSTER_WORKER_GROUP%?}/?format=iam-role-compat"
[task 2017-04-06T10:27:47.426755Z] fi
[task 2017-04-06T10:27:47.426764Z] + [[ -n 1 ]]
[task 2017-04-06T10:27:47.426810Z] + export 'AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-1-sccache-us-east-/?format=iam-role-compat'
[task 2017-04-06T10:27:47.426867Z] + AWS_IAM_CREDENTIALS_URL='http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-1-sccache-us-east-/?format=iam-role-compat'
```

The bucket name ends with `us-east-` which isn't right. The `${TASKCLUSTER_WORKER_GROUP%?}` bit is removing the last character from that environment variable, since it used to have a trailing letter, but that is no longer the case per the top of that log file:
[taskcluster 2017-04-06 10:24:01.021Z] Worker Group: us-east-1

This was broken by this merge:
https://github.com/taskcluster/docker-worker/commit/000a5191f566a13c27206f597db78431c64104f8

Specifically this commit, which removed the availability zone letter from the workerGroup, which is what sets the TASKCLUSTER_WORKER_GROUP environment variable:
https://github.com/taskcluster/docker-worker/commit/985e123ebc8e06e6f6fa89c5b3872633193a9c9d

On the plus side, this is easy to fix!
Assignee: nobody → ted
garndt said he's going to back that change out and deploy new images.
Assignee: ted → garndt
Component: Build Config → Docker-Worker
Product: Core → Taskcluster
Duplicate of this bug: 1354061
...nevermind, we'll fix this in the build system.
Assignee: garndt → ted
Component: Docker-Worker → Build Config
Product: Taskcluster → Core
Comment on attachment 8855355 [details]
bug 1350093 - fix sccache configuration to handle changes in the format of TASKCLUSTER_WORKER_GROUP.

https://reviewboard.mozilla.org/r/127200/#review129958

I think this is fine, though I don't think it'd be that burdensome to rewrite the logic and dispense with `availability_zone`, since it's only used in this script, and only to figure out the region, which we already have, right?

::: build/mozconfig.cache:54
(Diff revision 1)
> +            # here simpler.x
> +            availability_zone="${TASKCLUSTER_WORKER_GROUP}x"

You have an extra character in your comment as well. :)

Maybe the comment should read something like:

"TASKCLUSTER_WORKER_GROUP used to be the region plus the availability zone, but it has since been changed to be only the region.  In order to avoid changing all the logic below that depends on the formatting of availabilty_zone, we simply tack on a character to TASKCLUSTER_WORKER_GROUP to make it mimic the previous semantics."

Or am I overthinking this because all this is new to me?  Not sure!

Since `TASKCLUSTER_WORKER_GROUP` has these semantics now, would it be reasonable to match it against the known regions we use--us-{east,west}-{1,2}?--so we fail faster if we change the syntax of this variable next time?
Attachment #8855355 - Flags: review?(nfroyd) → review+
I would like to get rid of most of this file at some point, it has always been overly-complicated. For this patch I just wanted to make the smallest changes I could to get things working again.

Ideally we'd find a better place to set these values, but I haven't worked that out yet.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9)
> I would like to get rid of most of this file at some point, it has always
> been overly-complicated. For this patch I just wanted to make the smallest
> changes I could to get things working again.

This works for me.
https://hg.mozilla.org/integration/mozilla-inbound/rev/cdb69724b904d4fdd66e3ccf66f0b8c162511479
bug 1350093 - fix sccache configuration to handle changes in the format of TASKCLUSTER_WORKER_GROUP. r=froydnj
https://hg.mozilla.org/mozilla-central/rev/cdb69724b904
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.