Closed
Bug 1187257
Opened 9 years ago
Closed 8 years ago
Enable sccache on taskcluster builds
Categories
(Taskcluster :: General, defect)
Taskcluster
General
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: glandium, Assigned: ted)
References
Details
Sccache should be enabled on taskcluster builds. The problem is that since taskcluster doesn't separate try and non-try, it's not possible to use IAM roles like we do on linux ec2 instances to give write access to different S3 buckets. The other problem is that the sccache setup in the tree (see build/mozconfig.cache) relies on buildbot master hostnames to choose the right bucket.
Updated•9 years ago
|
Component: General Automation → General
Product: Release Engineering → Taskcluster
QA Contact: catlee
Comment 2•8 years ago
|
||
this will need to be fixed or disabled for the work on windows builds
Updated•8 years ago
|
Assignee: nobody → garndt
Updated•8 years ago
|
Assignee: garndt → nobody
Updated•8 years ago
|
Assignee: nobody → rthijssen
Updated•8 years ago
|
Assignee: rthijssen → nobody
Comment 3•8 years ago
|
||
just an update on the windows front: tc windows builders use an iam profile to give builders access to an iam policy relevant to the repo level. mozconfig.cache was updated some months ago to provide the relevant bucket config for tc win builds. build logs include lines like these, which demonstrate that the config is being applied: 08:16:28 INFO - export SCCACHE_BUCKET=taskcluster-level-1-sccache-eu-central-1 08:16:28 INFO - export SCCACHE_NAMESERVER=169.254.169.253 08:16:28 INFO - MOZ_PREFLIGHT_ALL+=build/sccache.mk 08:16:28 INFO - MOZ_POSTFLIGHT_ALL+=build/sccache.mk 08:16:28 INFO - UPLOAD_EXTRA_FILES+=sccache.log.gz sccache is not working on windows builds. the reason is not clear but the us-east-1 and eu-central-1 sccache buckets are empty. us-west-1 and us-west-2 buckets are not empty but must have been populated by some other worker type which uses the same buckets. current iam profiles are: - arn:aws:iam::692406183521:instance-profile/taskcluster-level-1-sccache - arn:aws:iam::692406183521:instance-profile/taskcluster-level-3-sccache current iam policies are: - arn:aws:iam::692406183521:policy/taskcluster-level-1-sccache - arn:aws:iam::692406183521:policy/taskcluster-level-3-sccache in an effort to understand why sccache is not working on tc win builds, today i modified the policies to include get and put ACL permissions which now include: - s3:DeleteObject - s3:GetObject - s3:GetObjectAcl - s3:PutObject - s3:PutObjectAcl late in the build logs we see this message: 09:05:33 INFO - python2.7 z:/task_1478244397/build/src/sccache/sccache.py 2>&1 | gzip > z:/task_1478244397/build/src/obj-firefox/dist/sccache.log.gz which indicates that the build is providing some debug output from sccache, however this log does not make it's way into tc artifacts. i will experiment on try to see if i can get this log into artifacts and understand why sccache is not being utilised.
Assignee | ||
Comment 4•8 years ago
|
||
Here's a try push I have that spits that sccache stats to the build log: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1c7b7676fb5c3b5883cc2c4566785376c0aafe98 You can search the logs for "===SCCACHE STATS===".
Comment 5•8 years ago
|
||
glandium: is there a good way to get more output from the sccache process during the build? using ted's hack, i get this output in build logs: 12:30:24 INFO - ===SCCACHE STATS=== 12:30:24 INFO - bash -c "python2.7 z:/task_1478256962/build/src/sccache/sccache.py 2>&1 | tee >(gzip > z:/task_1478256962/build/src/obj-firefox/dist/sccache.log.gz)" 12:30:24 INFO - bash: cannot make pipe for process substitution: Function not implemented 12:30:26 INFO - sccache: Terminated sccache server 12:30:26 INFO - sccache: Cache hits: 0 12:30:26 INFO - sccache: Cache misses: 3220 12:30:26 INFO - sccache: Failure to cache: 69 12:30:26 INFO - sccache: Non-cachable calls: 5 12:30:26 INFO - sccache: Did not cache: 3197 12:30:26 INFO - sccache: Non-compilation calls: 252 12:30:26 INFO - sccache: Max processes used: 0 12:30:26 INFO - =================== i'd like to see the exceptions from the failures to cache to understand if we have a permissions problem or if it's something else.
Flags: needinfo?(mh+mozilla)
Comment 6•8 years ago
|
||
We had issues when trying this out with linux some time back, which are the files I think you see in s3. The issue was with PutObjectAcl, at least with sccache v1. public-read was being set on the object which caused the put operation to fail because we, by default, do not give out that permission. Adding it to the IAM profile is a good step to seeing if that gets us further! However, for any task that uses temporary s3 credentials from taskcluster (different from IAM profiles), they will not get this permission. I think there were some concerns allowing tasks to set ACL's on objects within TaskCluster. Jonas might recall the discussion/reasoning on that.
Flags: needinfo?(jopsen)
Assignee | ||
Comment 7•8 years ago
|
||
I tweaked sccache2 to stop setting the ACL and did a try push with that build and it still failed to cache: https://treeherder.mozilla.org/#/jobs?repo=try&revision=199b4a20609c7a053d1414c484408b7421de8cee 21:02:35 INFO - Cache write errors 3187 I have another patch I can push on top of that to get verbose logging, I'll try that out shortly.
Comment 8•8 years ago
|
||
the build worker types included an AWS_CREDENTIAL_FILE env var which i have just removed (https://github.com/mozilla-releng/OpenCloudConfig/commit/78fc2860fc339087718ff99295c7af50e34435d1) as it may have been interfering with the mechanism for using the iam role to authenticate with s3. it will take a few hours for the amis to rebuild and for new workers to spawn without that var set. you can of course delete that env var in try pushes to get the same effect.
Reporter | ||
Comment 9•8 years ago
|
||
> glandium: is there a good way to get more output from the sccache process during the build?
Unfortunately not without modifying sccache itself.
Flags: needinfo?(mh+mozilla)
Assignee | ||
Comment 10•8 years ago
|
||
OK, it took longer than expected because I had rebased my NSS patches (which my sccache2 patches were applied on top of) and I had to fix some things to get those to build again, but I just did a try push which will dump the sccache2 log: https://treeherder.mozilla.org/#/jobs?repo=try&revision=0f8a4ef42e11821273f319d61b4cc8246cf7cba3 This should hopefully help pinpoint why things aren't working.
Comment 11•8 years ago
|
||
@garndt: Problem with ACL is that it allows uploader to persist permissions. Which breaks the concept of tracking authority in taskcluster-auth. If running in the context of docker-worker with the auth-proxy enabled: curl http://taskcluster/auth/v1/aws/s3/read-write/<bucket>/<prefix> Should give temporary S3 credentials, assuming task.scopes contains: auth:aws-s3:read-write:<bucket>/<prefix> I'm not sure if generic-worker has a auth-proxy concept.
Flags: needinfo?(jopsen)
Assignee | ||
Comment 12•8 years ago
|
||
...so I screwed up the "dump the logs" bit of that patch, but the builds did dump stats, and I noted that it only showed 25 cache write errors, which is way less than it was showing before.
Assignee | ||
Comment 13•8 years ago
|
||
OK, so the S3 bucket is working fine for the Windows builds: https://treeherder.mozilla.org/#/jobs?repo=try&revision=e94a641026eb7ec98a55b902c1786a762a876702 ...but: 04:03:58 INFO - Cache hits 0 04:03:58 INFO - Cache misses 3192 I was confused about this for a bit, but then I looked at the full logs and realized: 04:03:59 INFO - [2016-11-09][03:24:49][DEBUG] parse_arguments: `-Fofallible.obj -c -Iz:/task_1478657520/build/src/obj-firefox/dist/stl_wrappers -DNDEBUG=1 -DTRIMMED=1 -Dmozilla_Char16_h -Iz:/task_1478657520/build/src/memory/fallible -Iz:/task_1478657520/build/src/obj-firefox/memory/fallible -Iz:/task_1478657520/build/src/obj-firefox/dist/include -Iz:/task_1478657520/build/src/obj-firefox/dist/include/nspr -Iz:/task_1478657520/build/src/obj-firefox/dist/include/public/nss -MD -FI z:/task_1478657520/build/src/obj-firefox/mozilla-config.h -DMOZILLA_CLIENT -deps.deps/fallible.obj.pp -TP -nologo -wd5026 -wd5027 -Zc:sizedDealloc- -Zc:threadSafeInit- -wd4091 -wd4577 -D_HAS_EXCEPTIONS=0 -W3 -Gy -Zc:inline -FS -Gw -wd4251 -wd4244 -wd4267 -wd4345 -wd4351 -wd4800 -wd4819 -wd4595 -we4553 -GR- -Z7 -O1 -Oi -Oy- -WX -Zl z:/task_1478657520/build/src/memory/fallible/fallible.cpp` sccache uses the full compiler commandline as an input to the hash that forms the cache key, and the generic worker has the task ID baked into the full path, so we'll never get any sccache cache hits with the current setup on the generic worker.
Comment 14•8 years ago
|
||
I've outlined a bucket an permission setup here: https://public.etherpad-mozilla.org/p/taskcluster-scache-setup Please feel free to insert some comments.. @dustin, I'm hoping you can help naming the project that the roles should be created under. Or we can choose not to create roles, I guess they are just convenient constructs.
Flags: needinfo?(dustin)
Updated•8 years ago
|
Flags: needinfo?(dustin)
Assignee | ||
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•