Closed Bug 1187257 Opened 9 years ago Closed 8 years ago

Enable sccache on taskcluster builds

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: glandium, Assigned: ted)

References

Details

Sccache should be enabled on taskcluster builds. The problem is that since taskcluster doesn't separate try and non-try, it's not possible to use IAM roles like we do on linux ec2 instances to give write access to different S3 buckets. The other problem is that the sccache setup in the tree (see build/mozconfig.cache) relies on buildbot master hostnames to choose the right bucket.
Component: General Automation → General
Product: Release Engineering → Taskcluster
QA Contact: catlee
this will need to be fixed or disabled for the work on windows builds
Assignee: nobody → garndt
Assignee: garndt → nobody
Depends on: 1269355
Assignee: nobody → rthijssen
Assignee: rthijssen → nobody
Depends on: 1279957
Depends on: 1284492
just an update on the windows front:

tc windows builders use an iam profile to give builders access to an iam policy relevant to the repo level.

mozconfig.cache was updated some months ago to provide the relevant bucket config for tc win builds. build logs include lines like these, which demonstrate that the config is being applied:
08:16:28     INFO -      export SCCACHE_BUCKET=taskcluster-level-1-sccache-eu-central-1
08:16:28     INFO -      export SCCACHE_NAMESERVER=169.254.169.253
08:16:28     INFO -      MOZ_PREFLIGHT_ALL+=build/sccache.mk
08:16:28     INFO -      MOZ_POSTFLIGHT_ALL+=build/sccache.mk
08:16:28     INFO -      UPLOAD_EXTRA_FILES+=sccache.log.gz

sccache is not working on windows builds. the reason is not clear but the us-east-1 and eu-central-1 sccache buckets are empty. us-west-1 and us-west-2 buckets are not empty but must have been populated by some other worker type which uses the same buckets.

current iam profiles are:
- arn:aws:iam::692406183521:instance-profile/taskcluster-level-1-sccache
- arn:aws:iam::692406183521:instance-profile/taskcluster-level-3-sccache

current iam policies are:
- arn:aws:iam::692406183521:policy/taskcluster-level-1-sccache
- arn:aws:iam::692406183521:policy/taskcluster-level-3-sccache

in an effort to understand why sccache is not working on tc win builds, today i modified the policies to include get and put ACL permissions which now include:
- s3:DeleteObject
- s3:GetObject
- s3:GetObjectAcl
- s3:PutObject
- s3:PutObjectAcl

late in the build logs we see this message:
09:05:33     INFO -  python2.7 z:/task_1478244397/build/src/sccache/sccache.py 2>&1 | gzip > z:/task_1478244397/build/src/obj-firefox/dist/sccache.log.gz
which indicates that the build is providing some debug output from sccache, however this log does not make it's way into tc artifacts. i will experiment on try to see if i can get this log into artifacts and understand why sccache is not being utilised.
Here's a try push I have that spits that sccache stats to the build log:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=1c7b7676fb5c3b5883cc2c4566785376c0aafe98

You can search the logs for "===SCCACHE STATS===".
glandium: is there a good way to get more output from the sccache process during the build? using ted's hack, i get this output in build logs:

12:30:24     INFO -  ===SCCACHE STATS===
12:30:24     INFO -  bash -c "python2.7 z:/task_1478256962/build/src/sccache/sccache.py 2>&1 | tee >(gzip > z:/task_1478256962/build/src/obj-firefox/dist/sccache.log.gz)"
12:30:24     INFO -  bash: cannot make pipe for process substitution: Function not implemented
12:30:26     INFO -  sccache: Terminated sccache server
12:30:26     INFO -  sccache: Cache hits: 0
12:30:26     INFO -  sccache: Cache misses: 3220
12:30:26     INFO -  sccache: Failure to cache: 69
12:30:26     INFO -  sccache: Non-cachable calls: 5
12:30:26     INFO -  sccache: Did not cache: 3197
12:30:26     INFO -  sccache: Non-compilation calls: 252
12:30:26     INFO -  sccache: Max processes used: 0
12:30:26     INFO -  ===================

i'd like to see the exceptions from the failures to cache to understand if we have a permissions problem or if it's something else.
Flags: needinfo?(mh+mozilla)
We had issues when trying this out with linux some time back, which are the files I think you see in s3. The issue was with PutObjectAcl, at least with sccache v1.  public-read was being set on the object which caused the put operation to fail because we, by default, do not give out that permission.  Adding it to the IAM profile is a good step to seeing if that gets us further!  However, for any task that uses temporary s3 credentials from taskcluster (different from IAM profiles), they will not get this permission.

I think there were some concerns allowing tasks to set ACL's on objects within TaskCluster.  Jonas might recall the discussion/reasoning on that.
Flags: needinfo?(jopsen)
I tweaked sccache2 to stop setting the ACL and did a try push with that build and it still failed to cache:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=199b4a20609c7a053d1414c484408b7421de8cee
21:02:35     INFO -  Cache write errors 3187

I have another patch I can push on top of that to get verbose logging, I'll try that out shortly.
the build worker types included an AWS_CREDENTIAL_FILE env var which i have just removed (https://github.com/mozilla-releng/OpenCloudConfig/commit/78fc2860fc339087718ff99295c7af50e34435d1) as it may have been interfering with the mechanism for using the iam role to authenticate with s3. it will take a few hours for the amis to rebuild and for new workers to spawn without that var set. you can of course delete that env var in try pushes to get the same effect.
> glandium: is there a good way to get more output from the sccache process during the build?

Unfortunately not without modifying sccache itself.
Flags: needinfo?(mh+mozilla)
OK, it took longer than expected because I had rebased my NSS patches (which my sccache2 patches were applied on top of) and I had to fix some things to get those to build again, but I just did a try push which will dump the sccache2 log:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f8a4ef42e11821273f319d61b4cc8246cf7cba3

This should hopefully help pinpoint why things aren't working.
@garndt:
  Problem with ACL is that it allows uploader to persist permissions.
  Which breaks the concept of tracking authority in taskcluster-auth.

If running in the context of docker-worker with the auth-proxy enabled:
  curl http://taskcluster/auth/v1/aws/s3/read-write/<bucket>/<prefix>
Should give temporary S3 credentials, assuming task.scopes contains:
  auth:aws-s3:read-write:<bucket>/<prefix>

I'm not sure if generic-worker has a auth-proxy concept.
Flags: needinfo?(jopsen)
...so I screwed up the "dump the logs" bit of that patch, but the builds did dump stats, and I noted that it only showed 25 cache write errors, which is way less than it was showing before.
OK, so the S3 bucket is working fine for the Windows builds:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=e94a641026eb7ec98a55b902c1786a762a876702

...but:
04:03:58     INFO -  Cache hits                                                                                                                                                                              0
04:03:58     INFO -  Cache misses                                                                                                                                                                         3192

I was confused about this for a bit, but then I looked at the full logs and realized:
04:03:59     INFO -  [2016-11-09][03:24:49][DEBUG] parse_arguments: `-Fofallible.obj -c -Iz:/task_1478657520/build/src/obj-firefox/dist/stl_wrappers -DNDEBUG=1 -DTRIMMED=1 -Dmozilla_Char16_h -Iz:/task_1478657520/build/src/memory/fallible -Iz:/task_1478657520/build/src/obj-firefox/memory/fallible -Iz:/task_1478657520/build/src/obj-firefox/dist/include -Iz:/task_1478657520/build/src/obj-firefox/dist/include/nspr -Iz:/task_1478657520/build/src/obj-firefox/dist/include/public/nss -MD -FI z:/task_1478657520/build/src/obj-firefox/mozilla-config.h -DMOZILLA_CLIENT -deps.deps/fallible.obj.pp -TP -nologo -wd5026 -wd5027 -Zc:sizedDealloc- -Zc:threadSafeInit- -wd4091 -wd4577 -D_HAS_EXCEPTIONS=0 -W3 -Gy -Zc:inline -FS -Gw -wd4251 -wd4244 -wd4267 -wd4345 -wd4351 -wd4800 -wd4819 -wd4595 -we4553 -GR- -Z7 -O1 -Oi -Oy- -WX -Zl z:/task_1478657520/build/src/memory/fallible/fallible.cpp`

sccache uses the full compiler commandline as an input to the hash that forms the cache key, and the generic worker has the task ID baked into the full path, so we'll never get any sccache cache hits with the current setup on the generic worker.
Depends on: 1316329
I've outlined a bucket an permission setup here:
  https://public.etherpad-mozilla.org/p/taskcluster-scache-setup

Please feel free to insert some comments..

@dustin, I'm hoping you can help naming the project that the roles should be created under.
         Or we can choose not to create roles, I guess they are just convenient constructs.
Flags: needinfo?(dustin)
See Also: → 1316410
Flags: needinfo?(dustin)
Fixed!
Assignee: nobody → ted
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.