[sccache] Migrate sccache to new deployment
Categories
(Taskcluster :: Operations and Service Requests, task)
Tracking
(Not tracked)
People
(Reporter: dustin, Assigned: dustin)
References
Details
(Keywords: leave-open)
Attachments
(2 files)
I don't know much about this. What I do know:
There are a bunch of S3 buckets named "...-sccache" that contain cache data for Firefox builds. Workers have access to these buckets (somehow). They're in the Taskcluster production AWS account right now, but could probably be moved elsewhere if necessary. These are separate from Taskcluster artifacts (and are a fairly Mozilla-specific thing).
Assignee | ||
Comment 1•6 years ago
|
||
Next steps:
- gather data on how this works now
- work with chmanchester, grenade, and cloudops to talk about how we want it to work in the new deployment
- use the same AWS account?
- does windows currently use an IAM role, and should that change to use auth.awsSTSCredentials?
Wander, is this something you could work on?
Assignee | ||
Updated•6 years ago
|
Comment 2•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #1)
Next steps:
- gather data on how this works now
- work with chmanchester, grenade, and cloudops to talk about how we want
it to work in the new deployment
- use the same AWS account?
- does windows currently use an IAM role, and should that change to use
auth.awsSTSCredentials?Wander, is this something you could work on?
Not right now, but I might have some time by the end of the quarter.
Assignee | ||
Comment 3•6 years ago
|
||
OK, sounds like we need to find someone else then, to get this set up by the early-August deadline.
Comment 4•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #1)
- gather data on how this works now
Let's concentrate on this part for now.
:pmoore, :wcosta - can you describe how this works in generic-worker/docker-worker right now? Do workers deployed by OCC behave differently, i.e. do we need to loop in :grenade immediately?
Comment 5•6 years ago
|
||
(In reply to Chris Cooper [:coop] pronoun: he from comment #4)
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #1)
- gather data on how this works now
Let's concentrate on this part for now.
:pmoore, :wcosta - can you describe how this works in
generic-worker/docker-worker right now? Do workers deployed by OCC behave
differently, i.e. do we need to loop in :grenade immediately?
At least from docker-worker POV, it relates nothing with the worker or its AMI. Everything is set up in the relevant docker image used to run the build task. I am not sure how sccache is started, but once it is, it must use tc-auth to get temporary credentials. I am also unaware of how sccache interacts with the build system.
Comment 6•6 years ago
|
||
I'm not too sure how these buckets are managed/configured - Rob, do you know?
Comment 7•6 years ago
|
||
i believe the buckets were manually created and are not managed by automation (there is a lot of room for improvement here. eg: terraform or some other automated, auditable, source-controlled approach).
you can see the buckets by searching for sccache
at https://s3.console.aws.amazon.com/s3/.
windows builders use an iam role to access the bucket.
- the role is applied by the provisioner based on the workers provisioner config
- the role is assigned to the provisioner config based on occ config like this: https://github.com/mozilla-releng/OpenCloudConfig/blob/f5250a9/userdata/Manifest/gecko-3-b-win2012.json#L1325-L1327
going forward it would be useful if we lost the iam role on windows in favour of the same mechanisms used by linux workers to obtain temporary credentials from tc-auth but that does involve some more than trivial effort. i don't believe security would be improved since iam also uses temporary rather than permanent credentials, shippable builds don't use sccache and bucket access is scm level specific, however there must be some value in matching the implementations and i believe the finer grained controls used by the tc-auth approach must have some value or they wouldn't have been implemented.
there was some discussion by security folk about disabling sccache until we have a better sccache use story, so it would be worth speaking to ajvb too.
Assignee | ||
Comment 8•6 years ago
|
||
Removing during the transition would certainly make things easier. That said, all of the requirements to support the tc-auth-based approach are in place already, while per-worker IAM roles are not currently supported by worker-manager. So, wwitching Windows to use the tc-auth approach would make it easy to enable this functionality in the new deployment.
Assignee | ||
Comment 9•6 years ago
|
||
so it would be worth speaking to ajvb too.
^^ the reason for the needinfo :)
It's my understanding that the credentials from tc-auth are handled in-tree, removing the necessity of considering them when deploying workers. So hopefully the change wouldn't be too troublesome, since the code is already there to handle it. Aside from the benefit of matching implementations, the security improvements include the ability to audit access to the buckets and control it at a more fine-grained level.
Assignee | ||
Comment 10•6 years ago
|
||
So, [s]witching Windows to use the tc-auth approach would make it easy to enable this functionality in the new deployment.
Rob, I can't find the communication now but my impression is things are headed in this direction anyway. Can we make this the plan of record for the September cutover to the new cluster?
Comment 11•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #9)
so it would be worth speaking to ajvb too.
^^ the reason for the needinfo :)
It's my understanding that the credentials from tc-auth are handled in-tree, removing the necessity of considering them when deploying workers. So hopefully the change wouldn't be too troublesome, since the code is already there to handle it. Aside from the benefit of matching implementations, the security improvements include the ability to audit access to the buckets and control it at a more fine-grained level.
That makes sense to me.
(In reply to Rob Thijssen [:grenade (EET/UTC+0300)] from comment #7)
...
there was some discussion by security folk about disabling sccache until we have a better sccache use story, so it would be worth speaking to ajvb too.
Just to give context to this, we had a discussion right before Whistler '19 about L1 workers being able to potentially poison sccache that would then be used by L1 workers. The current idea was to remove the L1 sccache deployment and to have L1 workers read from the L3 sccache and not write anything. This hasn't been followed up on yet, but just wanted to add the context.
Comment 12•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #10)
So, [s]witching Windows to use the tc-auth approach would make it easy to enable this functionality in the new deployment.
Rob, I can't find the communication now but my impression is things are headed in this direction anyway. Can we make this the plan of record for the September cutover to the new cluster?
i think so but someone will have to pick up the work. from a windows infra perspective, all we'd do is remove the iam roles that we currently apply to gecko-[1-3]-b-win2012.
the harder part is:
- the build system or windows build configurations will need to be modified to obtain and use credentials from tc-auth since they won't be available at the worker instance level from iam.
- does generic-worker already support requesting and using sccache credentials if the builds request it? i have no idea. i'm under the impression that the linux builds which do this are using docker worker but i may be out of the loop on that.
Assignee | ||
Comment 13•6 years ago
|
||
does generic-worker already support requesting and using sccache credentials if the builds request it? i have no idea. i'm under the impression that the linux builds which do this are using docker worker but i may be out of the loop on that.
Yes, it's in-task, and uses the taskcluster-proxy, which generic-worker also supports.
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 14•6 years ago
|
||
I'm hoping someone can make some relatively quick in-tree changes to support this.
Assignee | ||
Comment 15•6 years ago
|
||
I'll at least see what those in-tree changes might look like. I do want to avoid our team owning this, however!
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 16•6 years ago
|
||
https://github.com/mozilla/sccache/pull/492 to document the technique used for AWS + linux
Assignee | ||
Comment 17•6 years ago
|
||
OK, so here's how Linux works:
- https://searchfox.org/mozilla-central/source/taskcluster/scripts/builder/build-linux.sh sets AWS_IAM_CREDENTIALS_URL before run-task runs
- this URL points to the AWS endpoint
- sccache reads that env var and uses it in place of the ec2 metadata service to fetch credentials
So, I think all we need to do is set that env var in windows CI as well.
Assignee | ||
Comment 18•6 years ago
|
||
From what I can tell, on Windows things work like this:
- task.payload.command has a run-task invocation that boils down to
run-task [options] %GECKO_PATH%/testing/mozharness/scripts/fx_desktop_build.py [options]
, so run-task starts the mozharness script, which then does the build. From what I can tell, all builds use fx_desktop_build.py
Under the hood, build-linux.sh
(mentioned in previous comment) runs a similar thing, using $MOZHARNESS_SCRIPT
. For all things that use SCCACHE, that variable is set to fx_desktop_build.py
, so that seems to be the common denominator for things that use SCCACHE.
dustin@lamport ~/p/m-c (bug1562686) $ jq 'values | .[] | select(.task.payload.env.MOZHARNESS_SCRIPT != null and .task.payload.env.MOZHARNESS_SCRIPT != "mozharness/scripts/fx_desktop_build.py") | .kind' tasks.json | sort -u
"generate-profile"
"l10n"
"nightly-l10n"
"openh264-plugin"
"release-generate-checksums"
"repackage"
"repackage-l10n"
"test"
"webrender"
dustin@lamport ~/p/m-c (bug1562686) $ jq 'values | .[] | select(.task.payload.env.USE_SCCACHE == "1") | .kind' tasks.json | sort -u
"build"
"build-fat-aar"
"instrumented-build"
"searchfox"
"static-analysis-autotest"
"valgrind"
those sets do not overlap.
So, I think the fix here is to remove the env var setting from build-linux.sh
and add it in fx_desktop_build.py
. My mozharness-fu is weak, but I can try to make that work. Chris, does that sound right?
Comment 19•6 years ago
|
||
That sounds right. The other place to consider adding this would be mozconfig.cache.
Assignee | ||
Comment 20•6 years ago
|
||
Assignee | ||
Comment 21•6 years ago
|
||
I created aws-provisioner workerType gecko-1-b-win2012-no-isntprof
to test this. Including the typo!
Assignee | ||
Comment 22•6 years ago
|
||
Assignee | ||
Comment 23•6 years ago
|
||
Rob, I'm not able to get a worker to claim tasks if I remove the instance profile from its worker type configuration
https://tools.taskcluster.net/aws-provisioner/gecko-1-b-win2012-no-isntprof/resources
https://tools.taskcluster.net/groups/WK_MfwYhTdm8gwPkn8_vxg/tasks/WK_MfwYhTdm8gwPkn8_vxg/runs/0
any idea what would cause that?
Assignee | ||
Comment 24•6 years ago
|
||
(I also can't find the logs for that worker in papertrail -- I can find something under system ip-10-144-31-14
but that's for a docker-worker instance that started and ended before this worker did)
Comment 25•6 years ago
|
||
most likely the same issue as described in bug 1572089, comment 10. simplest thing is just to use worker type gecko-1-b-win2012-beta, which exists for tests of this nature, unless there's something special about the worker configuration that needs to be kept separate. if so, i can create the gecko-1-b-win2012-no-isntprof
worker type in occ for you. let me know.
Assignee | ||
Comment 26•6 years ago
|
||
OK, I'll just use beta, then. Thanks!
Assignee | ||
Comment 27•6 years ago
|
||
removing
"IamInstanceProfile": {
"Arn": "arn:aws:iam::692406183521:instance-profile/taskcluster-level-1-sccache"
},
from `instanceTypes[*].launchSpec in https://tools.taskcluster.net/aws-provisioner/gecko-1-b-win2012-beta/edit
Assignee | ||
Comment 28•6 years ago
|
||
Assignee | ||
Comment 29•6 years ago
|
||
Oh, boo, sccache is disabled for that task.
https://taskcluster-artifacts.net/eri8v_eTT7uU0Ixex-cwKw has sccache enabled, as seen in https://taskcluster-artifacts.net/eri8v_eTT7uU0Ixex-cwKw/0/public/build/sccache.log. I'll try a copy of that task on the beta workerType.
https://tools.taskcluster.net/groups/fYcNZJlgRN-FEZ7vHV473A/tasks/fYcNZJlgRN-FEZ7vHV473A/details
Assignee | ||
Comment 30•6 years ago
|
||
https://taskcluster-artifacts.net/fYcNZJlgRN-FEZ7vHV473A/0/public/build/sccache.log looks like it ran sccache
So, I'm reasonably confident that this works.
Assignee | ||
Comment 31•6 years ago
|
||
Wait, that was a successful run without the patch. And mysteriously, the IamInstanceProfile has returned in the beta worker type. I guess OCC does that sometimes? So, it proves nothing.
Well, it proves that I don't know how to test this.
Assignee | ||
Comment 32•6 years ago
|
||
OK, modified the workerType definition and got aws-provisioner to start an instance before things got reverted. I've confirmed that the isntance on which
https://tools.taskcluster.net/groups/am0dja3sQtCL_6sCqybfmA/tasks/MvX1D5g0RgyReA_G4m8t2A/runs/0
is running, i-0de1cb4a494e24db8
, has an empty "IAM Role" in the UI, whereas other gecko-1-b-win2012
instances have taskcluster-level-1-sccache
in that spot. So, if that task passes, then I will be justifiably reasonably confident that this works.
Assignee | ||
Comment 33•6 years ago
|
||
https://taskcluster-artifacts.net/MvX1D5g0RgyReA_G4m8t2A/0/public/build/sccache.log doesn't look promising:
DEBUG 2019-08-13T20:56:45Z: sccache::compiler::compiler: [host_DiagnosticsMatcher.obj]: Cache write error: Error(Msg("failed to get AWS credentials"), State { next_error: Some(Error(Msg("Couldn\'t find AWS credentials in environment, credentials file, or IAM role."), State { next_error: None, backtrace: InternalBacktrace })), backtrace: InternalBacktrace })
DEBUG 2019-08-13T20:56:45Z: sccache::server: Error executing cache write: failed to get AWS credentials
Assignee | ||
Comment 34•6 years ago
|
||
Ah, earlier in the log:
DEBUG 2019-08-13T20:56:43Z: sccache::simples3::credential: Attempting to fetch credentials from http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-1-sccache-/?format=iam-role-compat
so it is using the URL. However, the task does not have the relevant scope.
Assignee | ||
Comment 35•6 years ago
|
||
Assignee | ||
Comment 36•6 years ago
|
||
DEBUG 2019-08-14T13:47:20Z: sccache::simples3::credential: Attempting to fetch credentials from http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-1-sccache-/?format=iam-role-compat
WARN 2019-08-14T13:47:20Z: sccache::simples3::credential: Failed to fetch IAM credentials: Couldn't find AccessKeyId in response.
It looks like that URL is incorrect.
Assignee | ||
Comment 37•6 years ago
|
||
Ugh, generic-worker doesn't set TASKCLUSTER_WORKER_GROUP. But there's already code in that file that determines the bucket name... one more try.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=015b2f9e607a5fec1c0a7d61d9a59881e8a3c7ea
Assignee | ||
Comment 38•6 years ago
|
||
DEBUG 2019-08-14T17:14:39Z: sccache::server: [target_lexicon]: Cache write finished in 0.142 s
woo!
I've changed the patch quite a bit, so I'll push a new rev.
Assignee | ||
Comment 39•6 years ago
|
||
In the main log,
[task 2019-08-13T20:51:20.400Z] 20:51:20 INFO - export AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-1-sccache-/?format=iam-role-compat
Assignee | ||
Comment 40•6 years ago
|
||
The latest run with the updated patch was green. I thought I had commented here, but apparently not! Anyway, this works on a windows worker without an instance profile.
Comment 41•6 years ago
|
||
Assignee | ||
Comment 42•6 years ago
|
||
Comment 43•6 years ago
|
||
Backed out for build bustages.
Failure log: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=262062753&repo=autoland&lineNumber=49750
Backout: https://hg.mozilla.org/integration/autoland/rev/95e2662a28ffdbd442be279469797113b0c3edc7
Assignee | ||
Comment 44•6 years ago
|
||
Weird..
sccache.log looks good:
DEBUG 2019-08-16T19:34:05Z: sccache::simples3::credential: Using AWS credentials from IAM
DEBUG 2019-08-16T19:34:05Z: sccache::simples3::s3: PUT http://taskcluster-level-3-sccache-us-east-1.s3.amazonaws.com/3/1/0/310a851cfae971d940681332d950f5972b726521b5edf4331b33b53ec2223464b33244ceed06e55867115632a853b162d6ced4200ec612d348749a24df15f74d
and in the main log..
[fetches 2019-08-16T19:31:22.420Z] Extracting /builds/worker/fetches/sccache.tar.xz to /builds/worker/fetches
...
[fetches 2019-08-16T19:31:23.480Z] /builds/worker/fetches/sccache.tar.xz extracted in 1.060s
[fetches 2019-08-16T19:31:23.480Z] Removing /builds/worker/fetches/sccache.tar.xz
..
[task 2019-08-16T19:31:57.798Z] 19:31:57 INFO - export AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-3-sccache-us-east-1/?format=iam-role-compat
...
[task 2019-08-16T19:31:57.798Z] 19:31:57 INFO - MOZBUILD_MANAGE_SCCACHE_DAEMON=/builds/worker/fetches/sccache/sccache
...
[task 2019-08-16T19:32:00.511Z] 19:32:00 INFO - checking for ccache... /builds/worker/fetches/sccache/sccache
...
[task 2019-08-16T19:33:15.016Z] 19:33:15 INFO - env RUST_LOG=sccache=debug SCCACHE_ERROR_LOG=/builds/worker/artifacts/sccache.log /builds/worker/fetches/sccache/sccache --start-server
[task 2019-08-16T19:33:15.019Z] 19:33:15 INFO - DEBUG 2019-08-16T19:33:15Z: sccache::config: Attempting to read config file at "/builds/worker/.config/sccache/config"
[task 2019-08-16T19:33:15.019Z] 19:33:15 INFO - DEBUG 2019-08-16T19:33:15Z: sccache::config: Couldn't open config file: No such file or directory (os error 2)
[task 2019-08-16T19:33:15.019Z] 19:33:15 INFO - Starting sccache server...
[task 2019-08-16T19:33:15.020Z] 19:33:15 INFO - DEBUG 2019-08-16T19:33:15Z: sccache::config: Attempting to read config file at "/builds/worker/.config/sccache/config"
[task 2019-08-16T19:33:15.020Z] 19:33:15 INFO - DEBUG 2019-08-16T19:33:15Z: sccache::config: Couldn't open config file: No such file or directory (os error 2)
...
[task 2019-08-16T19:54:16.041Z] 19:54:16 INFO - Calling ['/builds/worker/workspace/build/src/obj-x86_64-pc-linux-gnu/_virtualenvs/init/bin/python', 'mach', 'valgrind-test'] with output_timeout 2400
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - Error running mach:
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - ['valgrind-test']
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - The error occurred in code that was called by the mach command. This is either
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - a bug in the called code itself or in the way that mach is calling it.
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - You can invoke |./mach busted| to check if this issue is already on file. If it
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - isn't, please use |./mach busted file| to report it. If |./mach busted| is
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - misbehaving, you can also inspect the dependencies of bug 1543241.
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - If filing a bug, please include the full output of mach, including this error
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - message.
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - The details of the failure are as follows:
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - MetaCharacterException
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - File "/builds/worker/workspace/build/src/build/valgrind/mach_commands.py", line 107, in valgrind_test
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - env.update(self.extra_environment_variables)
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/util.py", line 980, in __get__
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - setattr(instance, name, self.func(instance))
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/base.py", line 397, in extra_environment_variables
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - exports = shellutil.split(line)[1:]
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/shellutil.py", line 177, in split
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - return _ClineSplitter(s).result
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/shellutil.py", line 65, in __init__
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - self._parse_unquoted()
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/shellutil.py", line 117, in _parse_unquoted
[task 2019-08-16T19:54:16.782Z] 19:54:16 INFO - raise MetaCharacterException(match['special'])
[task 2019-08-16T19:54:17.075Z] 19:54:17 ERROR - Return code: 1
...
[task 2019-08-16T19:54:17.510Z] 19:54:17 INFO - Running post-run listener: _shutdown_sccache
[task 2019-08-16T19:54:17.510Z] 19:54:17 INFO - Running command: ['/builds/worker/workspace/build/src/sccache/sccache', '--stop-server'] in /builds/worker/workspace/build/src
[task 2019-08-16T19:54:17.510Z] 19:54:17 INFO - Copy/paste: /builds/worker/workspace/build/src/sccache/sccache --stop-server
[task 2019-08-16T19:54:17.514Z] 19:54:17 ERROR - caught OS error 2: No such file or directory while running ['/builds/worker/workspace/build/src/sccache/sccache', '--stop-server']
I suspect that last bit about sccache not being found isn't fatal (I'm guessing it's a leftover from when sccache was in-tree?), and the MetaCharacterException is what did us in. Indeed:
>>> from mozbuild.shellutil import split
>>> split('foo=1')
['foo=1']
>>> split('foo=1?fbo')
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/dustin/p/m-c/python/mozbuild/mozbuild/shellutil.py", line 177, in split
return _ClineSplitter(s).result
File "/home/dustin/p/m-c/python/mozbuild/mozbuild/shellutil.py", line 65, in __init__
self._parse_unquoted()
File "/home/dustin/p/m-c/python/mozbuild/mozbuild/shellutil.py", line 117, in _parse_unquoted
raise MetaCharacterException(match['special'])
mozbuild.shellutil.MetaCharacterException
so apparently this shell-parsing library can't handle ?
in a string.
Assignee | ||
Comment 45•6 years ago
|
||
I'll try a try push with
- mk_add_options "export AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/${bucket}/?format=iam-role-compat"
+ mk_add_options "export 'AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/${bucket}/?format=iam-role-compat'"
Assignee | ||
Comment 46•6 years ago
|
||
https://treeherder.mozilla.org/#/jobs?repo=try&revision=455dbade32df1f819caf7932e2f4bc8c5bb542ee
(the V job was the only one to fail in the autoland push)
Assignee | ||
Comment 47•6 years ago
|
||
[task 2019-08-19T16:12:09.516Z] 16:12:09 ERROR - Traceback (most recent call last):
[task 2019-08-19T16:12:09.516Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/configure.py", line 133, in <module>
[task 2019-08-19T16:12:09.516Z] 16:12:09 INFO - sys.exit(main(sys.argv))
[task 2019-08-19T16:12:09.516Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/configure.py", line 39, in main
[task 2019-08-19T16:12:09.516Z] 16:12:09 INFO - sandbox.run(os.path.join(os.path.dirname(__file__), 'moz.configure'))
[task 2019-08-19T16:12:09.516Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/configure/__init__.py", line 497, in run
[task 2019-08-19T16:12:09.517Z] 16:12:09 INFO - func(*args)
[task 2019-08-19T16:12:09.517Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/configure/__init__.py", line 541, in _value_for
[task 2019-08-19T16:12:09.517Z] 16:12:09 INFO - return self._value_for_depends(obj)
[task 2019-08-19T16:12:09.517Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/util.py", line 961, in method_call
[task 2019-08-19T16:12:09.517Z] 16:12:09 INFO - cache[args] = self.func(instance, *args)
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/configure/__init__.py", line 550, in _value_for_depends
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - value = obj.result()
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/util.py", line 961, in method_call
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - cache[args] = self.func(instance, *args)
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/configure/__init__.py", line 156, in result
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - return self._func(*resolved_args)
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/configure/__init__.py", line 1125, in wrapped
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - return new_func(*args, **kwargs)
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/build/moz.configure/rust.configure", line 60, in unwrap
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - (retcode, stdout, stderr) = get_cmd_output(prog, '+stable')
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/configure/__init__.py", line 1125, in wrapped
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - return new_func(*args, **kwargs)
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/build/moz.configure/util.configure", line 46, in get_cmd_output
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - log.debug('Executing: `%s`', quote(*args))
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/shellutil.py", line 210, in quote
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - return ' '.join(_quote(s) for s in strings)
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/shellutil.py", line 210, in <genexpr>
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - return ' '.join(_quote(s) for s in strings)
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/shellutil.py", line 198, in _quote
[task 2019-08-19T16:12:09.519Z] 16:12:09 INFO - return t("'%s'") % s.replace(t("'"), t("'\\''"))
[task 2019-08-19T16:12:09.519Z] 16:12:09 ERROR - TypeError: cannot create 'NoneType' instances
I can't reproduce this locally. Chris, how would I go about setting a variable in a mozconfig where the value contains ?
?
Assignee | ||
Comment 48•6 years ago
|
||
(change ni since Chris is away and :glandium is all over shellutil.py)
Comment 49•6 years ago
|
||
The one with mk_add_options "export 'AWS_IAM_CREDENTIALS_URL=...'"
failed because somehow that broke the mozconfig shell script somehow and CARGO
ended up not being set.
The one with mk_add_options "export AWS_IAM_CREDENTIALS_URL=...
failed because MozbuildObject.extra_environment_variables
is using a function used to split shell commands to read what is, in fact, a make syntax to set an environment variable.
Even if the first mk_add_options
variant didn't break CARGO
, it would still be wrong export FOO = 'foo'
in a makefile literally exports "'foo'"
:
$ cat /tmp/test.mk
export FOO = 'foo'
foo:
echo $$FOO
$ make -f /tmp/test.mk
echo $FOO
'foo'
So the real solution on the mozconfig side is the second one. And MozbuildObject.extra_environment_variables
should be fixed to read .mozconfig.mk
as a makefile, presumably with pymake.
OTOH, the only thing using MozbuildObject.extra_environment_variables
is build/valgrind/mach_commands.py
and that was explicitly added for automation, not developer builds, back when we were pulling Gtk from tooltool and needed to set plenty of environment variables for that. But that's not the case anymore. That is, the code that was added in bug 1187245 and that required me to add MozbuildObject.extra_environment_variables
is gone as of bug 1426785.
So I'd just backout the remaining half of bug 1187245.
Assignee | ||
Comment 50•6 years ago
|
||
Assignee | ||
Comment 51•6 years ago
|
||
Depends on D41454
Assignee | ||
Updated•6 years ago
|
Comment 52•6 years ago
|
||
Comment 53•6 years ago
|
||
bugherder |
Comment 54•6 years ago
|
||
Backed out as per glanduim's request.
Backout: https://hg.mozilla.org/integration/autoland/rev/760f1b14b2a45a091f061b367ac474cdc8c13594
Comment 55•6 years ago
|
||
I'll elaborate a little on why: sccache writes were failing, which led to decreasing sccache hits, and increasing build times.
Comment 56•6 years ago
|
||
Ionut: FYI, this is the cause for the large number of build metrics alerts, ranging from sccache write error increases, to sccache hit decreases, to build time increases. Do note, though, that some of the alerts are also about bug 1575471.
Comment 57•6 years ago
|
||
(In reply to Razvan Maries from comment #54)
Backed out as per glanduim's request.
Backout: https://hg.mozilla.org/integration/autoland/rev/760f1b14b2a45a091f061b367ac474cdc8c13594
This is weird because from alert 22628 I can see that a642029c8e7e Bug 1528697 and 4b5339bfcdaa Bug 1573501 seems to have caused the regreesions.
Comment 58•6 years ago
|
||
(In reply to Alexandru Ionescu :alexandrui from comment #57)
(In reply to Razvan Maries from comment #54)
Backed out as per glanduim's request.
Backout: https://hg.mozilla.org/integration/autoland/rev/760f1b14b2a45a091f061b367ac474cdc8c13594
This is weird because from alert 22628 I can see that a642029c8e7e Bug 1528697 and 4b5339bfcdaa Bug 1573501 seems to have caused the regreesions.
Because the effect is not immediate because it's related to caching. You can see everything going back in order exactly on the backout (starting with the sccache write errors).
Assignee | ||
Comment 59•6 years ago
|
||
How can you tell writes are failing? I had observed writes succeeding in my pushes.
Comment 60•6 years ago
|
||
(In reply to Mike Hommey [:glandium] from comment #56)
Ionut: FYI, this is the cause for the large number of build metrics alerts, ranging from sccache write error increases, to sccache hit decreases, to build time increases. Do note, though, that some of the alerts are also about bug 1575471.
Thanks for the heads up! Alexandru, FYI.
Assignee | ||
Comment 61•6 years ago
|
||
Ah, I didn't see that alert link before -- that led me to the quarry. Looking at
https://tools.taskcluster.net/groups/c1hpOdxxQbmHNasJh46zuA/tasks/SPf7ZDCCQTeFEIt-iKnbZg/runs/0/artifacts
I see in sccache.log
WARN 2019-08-21T23:45:20Z: sccache::simples3::credential: Failed to fetch IAM credentials: Didn't get a parseable response body from instance role details
DEBUG 2019-08-21T23:45:20Z: sccache::compiler::compiler: [arm.o]: Cache write error: Error(Msg("failed to get AWS credentials"), State { next_error: Some(Error(Msg("Couldn\'t find AWS credentials in environment, credentials file, or IAM role."), State { next_error: None, backtrace: InternalBacktrace })), backtrace: InternalBacktrace })
and in the task log:
[task 2019-08-21T23:42:17.623Z] 23:42:17 INFO - export SCCACHE_BUCKET=taskcluster-level-3-sccache-us-west-2
[task 2019-08-21T23:42:17.623Z] 23:42:17 INFO - export 'AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-3-sccache-us-west-2/?format=iam-role-compat
The task has scope (among others)
assume:project:taskcluster:gecko:level-3-sccache-buckets
which expands to
auth:aws-s3:read-write:taskcluster-level-3-sccache-us-west-2/*
that should be sufficient (and is the same role and scope it's been using since forever)
The task has taskcluster-proxy enabled. In that condition, I can successfully fetch credentials.
I don't see any logged calls to the auth service's awsS3Credentials
method at that time, or in fact at any time from that worker's IP (I just see auth.expandScopes
calls from docker-worker). I do see such an awsS3Credentials
call for my test task.
Assignee | ||
Comment 62•6 years ago
|
||
This bit of code
let body = body
.map_err(|_e| "Didn't get a parseable response body from instance role details".into())
.and_then(|body| {
String::from_utf8(body).chain_err(|| "failed to read iam role response")
});
does a great job of discarding a useful error message and replacing it with a misleading one. There's nothing parsing-related going on here. Rather, there's an HTTP error of some sort. What sort, we don't know. So it seems clear that sccache is -- for whatever reason -- failing to talk to the taskcluster proxy.
Assignee | ||
Comment 63•6 years ago
|
||
I do see
Aug 21 23:39:44 docker-worker.aws-provisioner.us-west-2c.ami-0beb39c669e36dc9b.m5-4xlarge.i-0c190a9614ea4bab4 docker-worker: {"type":"ensure image","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0c190a9614ea4bab4","workerGroup":"us-west-2","workerType":"gecko-3-b-linux","workerNodeType":"m5.4xlarge","image":{"name":"taskcluster/taskcluster-proxy:5.1.0","type":"docker-image"}}
in the worker logs, suggesting it is correctly loading the proxy. So most likely, sccache is hitting the AWS metadata endpoint and getting an error which it is helfpully obscuring (I'll make a patch..).
The unclosed '
in
[task 2019-08-21T23:42:17.623Z] 23:42:17 INFO - export 'AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-3-sccache-us-west-2/?format=iam-role-compat
is curious. In fact, I bet that's it. So, how did this work in try??
Assignee | ||
Comment 64•6 years ago
|
||
Assignee | ||
Comment 65•6 years ago
|
||
OK, I see 0's for cache write errors in various builds I clicked on, and
DEBUG 2019-08-22T14:41:54Z: sccache::server: [bzip2_sys]: Cache write finished in 0.534 s
in one sccache.log for linux and
DEBUG 2019-08-22T14:51:24Z: sccache::simples3::s3: PUT http://taskcluster-level-1-sccache-us-west-2.s3.amazonaws.com/5/c/0/5c0fd5994ae431164601cd8560585404d31f1d80d876b033f6c99a4e5563f7b28992cd979d1398013fff128b787602bd0f1f71a74fbef9d2e39095f4ede6098e
INFO 2019-08-22T14:51:24Z: sccache::simples3::s3: Read 21174 bytes from http://taskcluster-level-1-sccache-us-west-2.s3.amazonaws.com/7/9/1/791fcead20fcfc79ec8957a62411df71b44d1c2c7e1590e684a60979aab36f9bc5f25cc0b682d02f24f5e2b12249ee3896507bbb41307e8975169c8acb675c8e
DEBUG 2019-08-22T14:51:24Z: sccache::compiler::compiler: [unistr_cnv.obj]: Cache hit in 0.030 s
DEBUG 2019-08-22T14:51:24Z: sccache::compiler::compiler: [cstr]: Stored in cache successfully!
in one for Windows.
Alexandru, do you see any other issues with this try push?
Assignee | ||
Comment 66•6 years ago
|
||
Maybe that's a better question for :glandium
Comment 67•6 years ago
|
||
I don't know what I should be looking for but the Windows 2012 opt has sccache write errors.
Assignee | ||
Comment 68•6 years ago
|
||
Hm, so it worked for most jobs, just not that one? Notably, that's in eu-central-1. I can see calls to the auth.awsS3Credentials
endpoint:
2019-08-22 14:58:37.204
Fields: {
apiVersion: "v1"
clientId: "task-client/Y8WnX6jQSSKGv8votzJCKw/0/on/eu-central-1/i-01d53ffa6138d524a/until/1566487069.247"
duration: 58.736319
expires: "2019-08-22T15:17:49.247Z"
hasAuthed: true
method: "GET"
name: "awsS3Credentials"
public: false
query: {…}
resource: "/aws/s3/read-write/taskcluster-level-1-sccache-eu-central-1/"
satisfyingScopes: [1]
sourceIp: "3.120.111.87"
statusCode: 200
v: 1
}
which is exactly the time that sccache says so in the logs:
DEBUG 2019-08-22T14:58:37Z: sccache::simples3::credential: Using AWS credentials from IAM
DEBUG 2019-08-22T14:58:37Z: sccache::simples3::s3: PUT http://taskcluster-level-1-sccache-eu-central-1.s3.amazonaws.com/5/c/c/5cc399006fa7402cdb7a6920e6758ddc198607f5b4e3c18e7bf7c148946cc5ac82c2438a09f6deab47c461f4daa31c6773bd6a532f77011d2b5d6142c2d17585
DEBUG 2019-08-22T14:58:37Z: sccache::compiler::compiler: [host_AssertAssignmentChecker.obj]: Cache write error: Error(Msg("failed to put cache entry in s3"), State { next_error: Some(Error(BadHTTPStatus(400), State { next_error: None, backtrace: InternalBacktrace })), backtrace: InternalBacktrace })
DEBUG 2019-08-22T14:58:37Z: sccache::server: Error executing cache write: failed to put cache entry in s3
and I can confirm that the auth service has the policy necessary for the eu-central-1 bucket:
{
"Version": "2012-10-17",
"Statement": [
{
"Sid": "Stmt1467642244000",
"Effect": "Allow",
"Action": [
"s3:DeleteObject",
"s3:GetObject",
"s3:PutObject"
],
"Resource": [
"arn:aws:s3:::taskcluster-level-1-sccache-eu-central-1/*",
"arn:aws:s3:::taskcluster-level-1-sccache-us-east-1/*",
"arn:aws:s3:::taskcluster-level-1-sccache-us-west-1/*",
"arn:aws:s3:::taskcluster-level-1-sccache-us-west-2/*"
]
},
{
"Sid": "Stmt1467642244001",
"Effect": "Allow",
"Action": [
"s3:GetBucketLocation",
"s3:GetBucketTagging",
"s3:ListBucket"
],
"Resource": [
"arn:aws:s3:::taskcluster-level-1-sccache-eu-central-1",
"arn:aws:s3:::taskcluster-level-1-sccache-us-east-1",
"arn:aws:s3:::taskcluster-level-1-sccache-us-west-1",
"arn:aws:s3:::taskcluster-level-1-sccache-us-west-2"
]
}
]
}
so, why does S3 respond with a 400 status?
Assignee | ||
Comment 69•6 years ago
|
||
Hm, all of the sccache eu-central-1 buckets are empty. It looks like level 3 doesn't run in eu-central-1 at all, and based on the empty bucket I think that all try jobs running in eu-central-1 have been unable to write to caches for whatever reason since forever, and that's just gone un-noticed. It'd be nice to fix that, but not in this bug.
So, I've retriggered that job. If it comes back without any write errors, then I think this can be landed. OK by you, :glandium?
Assignee | ||
Comment 70•6 years ago
|
||
Of five retriggers, four are in eu-central-1, and one is in us-west-1.
Assignee | ||
Comment 71•6 years ago
|
||
Indeed, the us-west-1 task has zero write errors. So, that's https://bugzilla.mozilla.org/show_bug.cgi?id=1576032. I think this is clear to land.
Comment 73•6 years ago
•
|
||
Hi,
Just for the record:
this push:
(In reply to Pulsebot from comment #52)
Pushed by dmitchell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fe7b9445e1d3
use AWS_IAM_CREDENTIALS_URL for all S3 sccache invocations r=chmanchester
https://hg.mozilla.org/integration/autoland/rev/0ce37eda652a
revert remaining unnecessary bit of bug 1187245; r=glandium
caused alert summary 22611 (about 150 alerts) - regression
and this push (its backout):
(In reply to Razvan Maries from comment #54)
Backed out as per glanduim's request.
Backout: https://hg.mozilla.org/integration/autoland/rev/760f1b14b2a45a091f061b367ac474cdc8c13594
caused alert summary 22698 (about 60 sccache and about 120 build times) - improvement. Probably more improvements of build times to come.
To note that treeherder seems that wasn't able to detect sccache regressions.
Comment 74•6 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #65)
Alexandru, do you see any other issues with this try push?
What do you mean by other issues? I see all-green which at first sight seems to be good. I'm not so familiar with the details build times code. But if you tell me what should I look for, I will.
Assignee | ||
Comment 75•6 years ago
|
||
Thanks Alexandru -- I think :glandium found what I needed. Also, note that bug 1576032 will likely lead to improvements in sccache performance on try.
Comment 76•6 years ago
|
||
Assignee | ||
Comment 77•6 years ago
|
||
0 write errors on
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=263139872
and similar jobs on later pushes. I think we're OK here.
Comment 78•6 years ago
|
||
bugherder |
Assignee | ||
Comment 79•5 years ago
|
||
This appears to have stuck! I'll come back later this week to land the OCC change.
Assignee | ||
Comment 80•5 years ago
|
||
tomprince|pto> dustin: a=tomprince to land that everywhere
There's not a big rush, so Tom has agreed to do that when he's back (and it will be on beta by then anyway).
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 81•5 years ago
|
||
Tom, did this get uplifted?
Comment 82•5 years ago
|
||
bugherder uplift |
Comment 83•5 years ago
|
||
bugherder uplift |
Comment 84•5 years ago
|
||
bugherder uplift |
Assignee | ||
Comment 85•5 years ago
|
||
Thank you!!
Assignee | ||
Comment 86•5 years ago
|
||
OK, I'll count this as done. the OCC change isn't landed yet, but it's not hurting anything and OCC will be unused soon enough.
Updated•5 years ago
|
Description
•