Pervasive sccache cache write errors

RESOLVED FIXED in Firefox 55

Status

()

Core
Build Config
RESOLVED FIXED
7 months ago
7 months ago

People

(Reporter: chmanchester, Assigned: ted)

Tracking

unspecified
mozilla55
Points:
---

Firefox Tracking Flags

(firefox55 fixed)

Details

MozReview Requests

()

Submitter Diff Changes Open Issues Last Updated
Loading...
Error loading review requests:

Attachments

(1 attachment)

(Reporter)

Description

7 months ago
I noticed a pretty sizeable build time regression on autoland, inbound, and try starting around Tuesday afternoon: https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=%5Bmozilla-central,4044b74c437dfc672f4615a746ea01f6e4c0312d,1,2%5D&series=%5Bautoland,077c454bbb47966e9661e9b00ba7100f14bbd6c9,1,2%5D&series=%5Bmozilla-inbound,4044b74c437dfc672f4615a746ea01f6e4c0312d,1,2%5D&series=%5Bautoland,4044b74c437dfc672f4615a746ea01f6e4c0312d,1,2%5D&series=%5Bmozilla-central,077c454bbb47966e9661e9b00ba7100f14bbd6c9,1,2%5D&series=%5Bmozilla-inbound,077c454bbb47966e9661e9b00ba7100f14bbd6c9,1,2%5D

It starts with seemingly innocuous changesets, and pushing a revision before the regression range to try doesn't improve the build time. Looking at the sccache stats, we stopped getting cache hits and started seeing pervasive cache write errors around this time.
(Assignee)

Comment 1

7 months ago
Nothing around sccache in tree has changed for quite a while, so I don't think this is a build config issue. I think it's got to be a network issue or something like that. Maybe our network flows to s3 got screwed up, or some taskcluster-worker change landed that changed things?

If I add Windows buildbot+taskcluster builds to that graph I don't see the same regression with them:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=%5Bmozilla-inbound,4044b74c437dfc672f4615a746ea01f6e4c0312d,1,2%5D&series=%5Bautoland,077c454bbb47966e9661e9b00ba7100f14bbd6c9,1,2%5D&series=%5Bmozilla-central,4044b74c437dfc672f4615a746ea01f6e4c0312d,1,2%5D&series=%5Bautoland,4044b74c437dfc672f4615a746ea01f6e4c0312d,1,2%5D&series=%5Bmozilla-central,077c454bbb47966e9661e9b00ba7100f14bbd6c9,1,2%5D&series=%5Bmozilla-inbound,077c454bbb47966e9661e9b00ba7100f14bbd6c9,1,2%5D&series=%5Bmozilla-inbound,be7331d4f74b6970f67fe9da80f4d4d90ef60b73,1,2%5D&series=%5Bmozilla-inbound,be465112346255e89dd26003061b01f27cc7fd39,1,2%5D
(Assignee)

Comment 2

7 months ago
There's a giant pile of perfherder alerts from this:
https://treeherder.mozilla.org/perf.html#/alerts?id=5859
https://treeherder.mozilla.org/perf.html#/alerts?id=5613
https://treeherder.mozilla.org/perf.html#/alerts?id=5615
https://treeherder.mozilla.org/perf.html#/alerts?id=5616
https://treeherder.mozilla.org/perf.html#/alerts?id=5614
https://treeherder.mozilla.org/perf.html#/alerts?id=5597
https://treeherder.mozilla.org/perf.html#/alerts?id=5611
https://treeherder.mozilla.org/perf.html#/alerts?id=5593
https://treeherder.mozilla.org/perf.html#/alerts?id=5606

That may not be the full set. We should get these alerts to send email to dev-builds so we ensure they get noticed.
(Assignee)

Comment 3

7 months ago
I looked at a log from a try push I did and the answer was actually staring me right in the face:
https://public-artifacts.taskcluster.net/OaqLMJqDR3qXW_Xta1fu2w/0/public/logs/live_backing.log
```
[task 2017-04-06T10:27:47.426667Z] 
[task 2017-04-06T10:27:47.426679Z] if [[ -n ${USE_SCCACHE} ]]; then
[task 2017-04-06T10:27:47.426698Z]     # Point sccache at the Taskcluster proxy for AWS credentials.
[task 2017-04-06T10:27:47.426739Z]     export AWS_IAM_CREDENTIALS_URL="http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-${MOZ_SCM_LEVEL}-sccache-${TASKCLUSTER_WORKER_GROUP%?}/?format=iam-role-compat"
[task 2017-04-06T10:27:47.426755Z] fi
[task 2017-04-06T10:27:47.426764Z] + [[ -n 1 ]]
[task 2017-04-06T10:27:47.426810Z] + export 'AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-1-sccache-us-east-/?format=iam-role-compat'
[task 2017-04-06T10:27:47.426867Z] + AWS_IAM_CREDENTIALS_URL='http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-1-sccache-us-east-/?format=iam-role-compat'
```

The bucket name ends with `us-east-` which isn't right. The `${TASKCLUSTER_WORKER_GROUP%?}` bit is removing the last character from that environment variable, since it used to have a trailing letter, but that is no longer the case per the top of that log file:
[taskcluster 2017-04-06 10:24:01.021Z] Worker Group: us-east-1

This was broken by this merge:
https://github.com/taskcluster/docker-worker/commit/000a5191f566a13c27206f597db78431c64104f8

Specifically this commit, which removed the availability zone letter from the workerGroup, which is what sets the TASKCLUSTER_WORKER_GROUP environment variable:
https://github.com/taskcluster/docker-worker/commit/985e123ebc8e06e6f6fa89c5b3872633193a9c9d

On the plus side, this is easy to fix!
Assignee: nobody → ted
(Assignee)

Comment 4

7 months ago
garndt said he's going to back that change out and deploy new images.
Assignee: ted → garndt
Component: Build Config → Docker-Worker
Product: Core → Taskcluster
(Assignee)

Updated

7 months ago
Duplicate of this bug: 1354061
(Assignee)

Comment 6

7 months ago
...nevermind, we'll fix this in the build system.
Assignee: garndt → ted
Component: Docker-Worker → Build Config
Product: Taskcluster → Core
Comment hidden (mozreview-request)

Comment 8

7 months ago
mozreview-review
Comment on attachment 8855355 [details]
bug 1350093 - fix sccache configuration to handle changes in the format of TASKCLUSTER_WORKER_GROUP.

https://reviewboard.mozilla.org/r/127200/#review129958

I think this is fine, though I don't think it'd be that burdensome to rewrite the logic and dispense with `availability_zone`, since it's only used in this script, and only to figure out the region, which we already have, right?

::: build/mozconfig.cache:54
(Diff revision 1)
> +            # here simpler.x
> +            availability_zone="${TASKCLUSTER_WORKER_GROUP}x"

You have an extra character in your comment as well. :)

Maybe the comment should read something like:

"TASKCLUSTER_WORKER_GROUP used to be the region plus the availability zone, but it has since been changed to be only the region.  In order to avoid changing all the logic below that depends on the formatting of availabilty_zone, we simply tack on a character to TASKCLUSTER_WORKER_GROUP to make it mimic the previous semantics."

Or am I overthinking this because all this is new to me?  Not sure!

Since `TASKCLUSTER_WORKER_GROUP` has these semantics now, would it be reasonable to match it against the known regions we use--us-{east,west}-{1,2}?--so we fail faster if we change the syntax of this variable next time?
Attachment #8855355 - Flags: review?(nfroyd) → review+
(Assignee)

Comment 9

7 months ago
I would like to get rid of most of this file at some point, it has always been overly-complicated. For this patch I just wanted to make the smallest changes I could to get things working again.

Ideally we'd find a better place to set these values, but I haven't worked that out yet.
(In reply to Ted Mielczarek [:ted.mielczarek] from comment #9)
> I would like to get rid of most of this file at some point, it has always
> been overly-complicated. For this patch I just wanted to make the smallest
> changes I could to get things working again.

This works for me.
(Assignee)

Comment 11

7 months ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/cdb69724b904d4fdd66e3ccf66f0b8c162511479
bug 1350093 - fix sccache configuration to handle changes in the format of TASKCLUSTER_WORKER_GROUP. r=froydnj

Comment 12

7 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/cdb69724b904
Status: NEW → RESOLVED
Last Resolved: 7 months ago
status-firefox55: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla55
You need to log in before you can comment on or make changes to this bug.