Closed Bug 1562686 Opened 6 months ago Closed 3 months ago

[sccache] Migrate sccache to new deployment

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

(Keywords: leave-open)

Attachments

(2 files)

I don't know much about this. What I do know:

There are a bunch of S3 buckets named "...-sccache" that contain cache data for Firefox builds. Workers have access to these buckets (somehow). They're in the Taskcluster production AWS account right now, but could probably be moved elsewhere if necessary. These are separate from Taskcluster artifacts (and are a fairly Mozilla-specific thing).

Next steps:

  • gather data on how this works now
  • work with chmanchester, grenade, and cloudops to talk about how we want it to work in the new deployment
    • use the same AWS account?
    • does windows currently use an IAM role, and should that change to use auth.awsSTSCredentials?

Wander, is this something you could work on?

Flags: needinfo?(wcosta)

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #1)

Next steps:

  • gather data on how this works now
  • work with chmanchester, grenade, and cloudops to talk about how we want
    it to work in the new deployment
    • use the same AWS account?
    • does windows currently use an IAM role, and should that change to use
      auth.awsSTSCredentials?

Wander, is this something you could work on?

Not right now, but I might have some time by the end of the quarter.

Flags: needinfo?(wcosta)

OK, sounds like we need to find someone else then, to get this set up by the early-August deadline.

Flags: needinfo?(coop)

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #1)

  • gather data on how this works now

Let's concentrate on this part for now.

:pmoore, :wcosta - can you describe how this works in generic-worker/docker-worker right now? Do workers deployed by OCC behave differently, i.e. do we need to loop in :grenade immediately?

Flags: needinfo?(wcosta)
Flags: needinfo?(pmoore)
Flags: needinfo?(coop)

(In reply to Chris Cooper [:coop] pronoun: he from comment #4)

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #1)

  • gather data on how this works now

Let's concentrate on this part for now.

:pmoore, :wcosta - can you describe how this works in
generic-worker/docker-worker right now? Do workers deployed by OCC behave
differently, i.e. do we need to loop in :grenade immediately?

At least from docker-worker POV, it relates nothing with the worker or its AMI. Everything is set up in the relevant docker image used to run the build task. I am not sure how sccache is started, but once it is, it must use tc-auth to get temporary credentials. I am also unaware of how sccache interacts with the build system.

Flags: needinfo?(wcosta)

I'm not too sure how these buckets are managed/configured - Rob, do you know?

Flags: needinfo?(pmoore) → needinfo?(rthijssen)

i believe the buckets were manually created and are not managed by automation (there is a lot of room for improvement here. eg: terraform or some other automated, auditable, source-controlled approach).

you can see the buckets by searching for sccache at https://s3.console.aws.amazon.com/s3/.

windows builders use an iam role to access the bucket.

going forward it would be useful if we lost the iam role on windows in favour of the same mechanisms used by linux workers to obtain temporary credentials from tc-auth but that does involve some more than trivial effort. i don't believe security would be improved since iam also uses temporary rather than permanent credentials, shippable builds don't use sccache and bucket access is scm level specific, however there must be some value in matching the implementations and i believe the finer grained controls used by the tc-auth approach must have some value or they wouldn't have been implemented.

there was some discussion by security folk about disabling sccache until we have a better sccache use story, so it would be worth speaking to ajvb too.

Flags: needinfo?(rthijssen)

Removing during the transition would certainly make things easier. That said, all of the requirements to support the tc-auth-based approach are in place already, while per-worker IAM roles are not currently supported by worker-manager. So, wwitching Windows to use the tc-auth approach would make it easy to enable this functionality in the new deployment.

Flags: needinfo?(abahnken)

Comment 7:

so it would be worth speaking to ajvb too.

^^ the reason for the needinfo :)

It's my understanding that the credentials from tc-auth are handled in-tree, removing the necessity of considering them when deploying workers. So hopefully the change wouldn't be too troublesome, since the code is already there to handle it. Aside from the benefit of matching implementations, the security improvements include the ability to audit access to the buckets and control it at a more fine-grained level.

So, [s]witching Windows to use the tc-auth approach would make it easy to enable this functionality in the new deployment.

Rob, I can't find the communication now but my impression is things are headed in this direction anyway. Can we make this the plan of record for the September cutover to the new cluster?

Flags: needinfo?(rthijssen)

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #9)

Comment 7:

so it would be worth speaking to ajvb too.

^^ the reason for the needinfo :)

It's my understanding that the credentials from tc-auth are handled in-tree, removing the necessity of considering them when deploying workers. So hopefully the change wouldn't be too troublesome, since the code is already there to handle it. Aside from the benefit of matching implementations, the security improvements include the ability to audit access to the buckets and control it at a more fine-grained level.

That makes sense to me.

(In reply to Rob Thijssen [:grenade (EET/UTC+0300)] from comment #7)

...
there was some discussion by security folk about disabling sccache until we have a better sccache use story, so it would be worth speaking to ajvb too.

Just to give context to this, we had a discussion right before Whistler '19 about L1 workers being able to potentially poison sccache that would then be used by L1 workers. The current idea was to remove the L1 sccache deployment and to have L1 workers read from the L3 sccache and not write anything. This hasn't been followed up on yet, but just wanted to add the context.

Flags: needinfo?(abahnken)

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #10)

So, [s]witching Windows to use the tc-auth approach would make it easy to enable this functionality in the new deployment.

Rob, I can't find the communication now but my impression is things are headed in this direction anyway. Can we make this the plan of record for the September cutover to the new cluster?

i think so but someone will have to pick up the work. from a windows infra perspective, all we'd do is remove the iam roles that we currently apply to gecko-[1-3]-b-win2012.

the harder part is:

  • the build system or windows build configurations will need to be modified to obtain and use credentials from tc-auth since they won't be available at the worker instance level from iam.
  • does generic-worker already support requesting and using sccache credentials if the builds request it? i have no idea. i'm under the impression that the linux builds which do this are using docker worker but i may be out of the loop on that.
Flags: needinfo?(rthijssen)

does generic-worker already support requesting and using sccache credentials if the builds request it? i have no idea. i'm under the impression that the linux builds which do this are using docker worker but i may be out of the loop on that.

Yes, it's in-task, and uses the taskcluster-proxy, which generic-worker also supports.

Assignee: nobody → dustin

I'm hoping someone can make some relatively quick in-tree changes to support this.

Assignee: dustin → nobody

I'll at least see what those in-tree changes might look like. I do want to avoid our team owning this, however!

Assignee: nobody → dustin

https://github.com/mozilla/sccache/pull/492 to document the technique used for AWS + linux

OK, so here's how Linux works:

So, I think all we need to do is set that env var in windows CI as well.

From what I can tell, on Windows things work like this:

  • task.payload.command has a run-task invocation that boils down to run-task [options] %GECKO_PATH%/testing/mozharness/scripts/fx_desktop_build.py [options], so run-task starts the mozharness script, which then does the build. From what I can tell, all builds use fx_desktop_build.py

Under the hood, build-linux.sh (mentioned in previous comment) runs a similar thing, using $MOZHARNESS_SCRIPT. For all things that use SCCACHE, that variable is set to fx_desktop_build.py, so that seems to be the common denominator for things that use SCCACHE.

dustin@lamport ~/p/m-c (bug1562686) $ jq 'values | .[] | select(.task.payload.env.MOZHARNESS_SCRIPT != null and .task.payload.env.MOZHARNESS_SCRIPT != "mozharness/scripts/fx_desktop_build.py") | .kind' tasks.json  | sort -u
"generate-profile"
"l10n"
"nightly-l10n"
"openh264-plugin"
"release-generate-checksums"
"repackage"
"repackage-l10n"
"test"
"webrender"                                                                                                                                                                                                                  
dustin@lamport ~/p/m-c (bug1562686) $ jq 'values | .[] | select(.task.payload.env.USE_SCCACHE == "1") | .kind' tasks.json | sort -u                                                                                                                                                                                            
"build"                                                                                                                                                                                                                                                                                                                       
"build-fat-aar"
"instrumented-build"
"searchfox"
"static-analysis-autotest"
"valgrind"

those sets do not overlap.

So, I think the fix here is to remove the env var setting from build-linux.sh and add it in fx_desktop_build.py. My mozharness-fu is weak, but I can try to make that work. Chris, does that sound right?

Flags: needinfo?(cmanchester)

That sounds right. The other place to consider adding this would be mozconfig.cache.

Flags: needinfo?(cmanchester)

I created aws-provisioner workerType gecko-1-b-win2012-no-isntprof to test this. Including the typo!

Rob, I'm not able to get a worker to claim tasks if I remove the instance profile from its worker type configuration
https://tools.taskcluster.net/aws-provisioner/gecko-1-b-win2012-no-isntprof/resources
https://tools.taskcluster.net/groups/WK_MfwYhTdm8gwPkn8_vxg/tasks/WK_MfwYhTdm8gwPkn8_vxg/runs/0
any idea what would cause that?

Flags: needinfo?(rthijssen)

(I also can't find the logs for that worker in papertrail -- I can find something under system ip-10-144-31-14 but that's for a docker-worker instance that started and ended before this worker did)

most likely the same issue as described in bug 1572089, comment 10. simplest thing is just to use worker type gecko-1-b-win2012-beta, which exists for tests of this nature, unless there's something special about the worker configuration that needs to be kept separate. if so, i can create the gecko-1-b-win2012-no-isntprof worker type in occ for you. let me know.

Flags: needinfo?(rthijssen)

OK, I'll just use beta, then. Thanks!

removing

        "IamInstanceProfile": {
          "Arn": "arn:aws:iam::692406183521:instance-profile/taskcluster-level-1-sccache"
        },

from `instanceTypes[*].launchSpec in https://tools.taskcluster.net/aws-provisioner/gecko-1-b-win2012-beta/edit

https://taskcluster-artifacts.net/fYcNZJlgRN-FEZ7vHV473A/0/public/build/sccache.log looks like it ran sccache

So, I'm reasonably confident that this works.

Wait, that was a successful run without the patch. And mysteriously, the IamInstanceProfile has returned in the beta worker type. I guess OCC does that sometimes? So, it proves nothing.

Well, it proves that I don't know how to test this.

OK, modified the workerType definition and got aws-provisioner to start an instance before things got reverted. I've confirmed that the isntance on which
https://tools.taskcluster.net/groups/am0dja3sQtCL_6sCqybfmA/tasks/MvX1D5g0RgyReA_G4m8t2A/runs/0
is running, i-0de1cb4a494e24db8, has an empty "IAM Role" in the UI, whereas other gecko-1-b-win2012 instances have taskcluster-level-1-sccache in that spot. So, if that task passes, then I will be justifiably reasonably confident that this works.

https://taskcluster-artifacts.net/MvX1D5g0RgyReA_G4m8t2A/0/public/build/sccache.log doesn't look promising:

DEBUG 2019-08-13T20:56:45Z: sccache::compiler::compiler: [host_DiagnosticsMatcher.obj]: Cache write error: Error(Msg("failed to get AWS credentials"), State { next_error: Some(Error(Msg("Couldn\'t find AWS credentials in environment, credentials file, or IAM role."), State { next_error: None, backtrace: InternalBacktrace })), backtrace: InternalBacktrace })
DEBUG 2019-08-13T20:56:45Z: sccache::server: Error executing cache write: failed to get AWS credentials

Ah, earlier in the log:

DEBUG 2019-08-13T20:56:43Z: sccache::simples3::credential: Attempting to fetch credentials from http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-1-sccache-/?format=iam-role-compat

so it is using the URL. However, the task does not have the relevant scope.

DEBUG 2019-08-14T13:47:20Z: sccache::simples3::credential: Attempting to fetch credentials from http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-1-sccache-/?format=iam-role-compat
WARN 2019-08-14T13:47:20Z: sccache::simples3::credential: Failed to fetch IAM credentials: Couldn't find AccessKeyId in response.

It looks like that URL is incorrect.

Ugh, generic-worker doesn't set TASKCLUSTER_WORKER_GROUP. But there's already code in that file that determines the bucket name... one more try.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=015b2f9e607a5fec1c0a7d61d9a59881e8a3c7ea

DEBUG 2019-08-14T17:14:39Z: sccache::server: [target_lexicon]: Cache write finished in 0.142 s

woo!

I've changed the patch quite a bit, so I'll push a new rev.

In the main log,

[task 2019-08-13T20:51:20.400Z] 20:51:20     INFO -      export AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-1-sccache-/?format=iam-role-compat
Blocks: 1573977

The latest run with the updated patch was green. I thought I had commented here, but apparently not! Anyway, this works on a windows worker without an instance profile.

Pushed by dmitchell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/91eca815c9fc
use AWS_IAM_CREDENTIALS_URL for all S3 sccache invocations r=chmanchester

Weird..
sccache.log looks good:

DEBUG 2019-08-16T19:34:05Z: sccache::simples3::credential: Using AWS credentials from IAM
DEBUG 2019-08-16T19:34:05Z: sccache::simples3::s3: PUT http://taskcluster-level-3-sccache-us-east-1.s3.amazonaws.com/3/1/0/310a851cfae971d940681332d950f5972b726521b5edf4331b33b53ec2223464b33244ceed06e55867115632a853b162d6ced4200ec612d348749a24df15f74d

and in the main log..

[fetches 2019-08-16T19:31:22.420Z] Extracting /builds/worker/fetches/sccache.tar.xz to /builds/worker/fetches
...
[fetches 2019-08-16T19:31:23.480Z] /builds/worker/fetches/sccache.tar.xz extracted in 1.060s
[fetches 2019-08-16T19:31:23.480Z] Removing /builds/worker/fetches/sccache.tar.xz
..
[task 2019-08-16T19:31:57.798Z] 19:31:57     INFO -      export AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-3-sccache-us-east-1/?format=iam-role-compat
...
[task 2019-08-16T19:31:57.798Z] 19:31:57     INFO -      MOZBUILD_MANAGE_SCCACHE_DAEMON=/builds/worker/fetches/sccache/sccache
...
[task 2019-08-16T19:32:00.511Z] 19:32:00     INFO -  checking for ccache... /builds/worker/fetches/sccache/sccache
...
[task 2019-08-16T19:33:15.016Z] 19:33:15     INFO -  env RUST_LOG=sccache=debug SCCACHE_ERROR_LOG=/builds/worker/artifacts/sccache.log /builds/worker/fetches/sccache/sccache --start-server
[task 2019-08-16T19:33:15.019Z] 19:33:15     INFO -  DEBUG 2019-08-16T19:33:15Z: sccache::config: Attempting to read config file at "/builds/worker/.config/sccache/config"
[task 2019-08-16T19:33:15.019Z] 19:33:15     INFO -  DEBUG 2019-08-16T19:33:15Z: sccache::config: Couldn't open config file: No such file or directory (os error 2)
[task 2019-08-16T19:33:15.019Z] 19:33:15     INFO -  Starting sccache server...
[task 2019-08-16T19:33:15.020Z] 19:33:15     INFO -  DEBUG 2019-08-16T19:33:15Z: sccache::config: Attempting to read config file at "/builds/worker/.config/sccache/config"
[task 2019-08-16T19:33:15.020Z] 19:33:15     INFO -  DEBUG 2019-08-16T19:33:15Z: sccache::config: Couldn't open config file: No such file or directory (os error 2)
...
[task 2019-08-16T19:54:16.041Z] 19:54:16     INFO - Calling ['/builds/worker/workspace/build/src/obj-x86_64-pc-linux-gnu/_virtualenvs/init/bin/python', 'mach', 'valgrind-test'] with output_timeout 2400
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -  Error running mach:
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -      ['valgrind-test']
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -  The error occurred in code that was called by the mach command. This is either
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -  a bug in the called code itself or in the way that mach is calling it.
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -  You can invoke |./mach busted| to check if this issue is already on file. If it
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -  isn't, please use |./mach busted file| to report it. If |./mach busted| is
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -  misbehaving, you can also inspect the dependencies of bug 1543241.
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -  If filing a bug, please include the full output of mach, including this error
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -  message.
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -  The details of the failure are as follows:
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -  MetaCharacterException
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -    File "/builds/worker/workspace/build/src/build/valgrind/mach_commands.py", line 107, in valgrind_test
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -      env.update(self.extra_environment_variables)
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/util.py", line 980, in __get__
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -      setattr(instance, name, self.func(instance))
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/base.py", line 397, in extra_environment_variables
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -      exports = shellutil.split(line)[1:]
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/shellutil.py", line 177, in split
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -      return _ClineSplitter(s).result
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/shellutil.py", line 65, in __init__
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -      self._parse_unquoted()
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/shellutil.py", line 117, in _parse_unquoted
[task 2019-08-16T19:54:16.782Z] 19:54:16     INFO -      raise MetaCharacterException(match['special'])
[task 2019-08-16T19:54:17.075Z] 19:54:17    ERROR - Return code: 1
...
[task 2019-08-16T19:54:17.510Z] 19:54:17     INFO - Running post-run listener: _shutdown_sccache
[task 2019-08-16T19:54:17.510Z] 19:54:17     INFO - Running command: ['/builds/worker/workspace/build/src/sccache/sccache', '--stop-server'] in /builds/worker/workspace/build/src
[task 2019-08-16T19:54:17.510Z] 19:54:17     INFO - Copy/paste: /builds/worker/workspace/build/src/sccache/sccache --stop-server
[task 2019-08-16T19:54:17.514Z] 19:54:17    ERROR - caught OS error 2: No such file or directory while running ['/builds/worker/workspace/build/src/sccache/sccache', '--stop-server']

I suspect that last bit about sccache not being found isn't fatal (I'm guessing it's a leftover from when sccache was in-tree?), and the MetaCharacterException is what did us in. Indeed:

>>> from mozbuild.shellutil import split
>>> split('foo=1')
['foo=1']
>>> split('foo=1?fbo')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/dustin/p/m-c/python/mozbuild/mozbuild/shellutil.py", line 177, in split
    return _ClineSplitter(s).result
  File "/home/dustin/p/m-c/python/mozbuild/mozbuild/shellutil.py", line 65, in __init__
    self._parse_unquoted()
  File "/home/dustin/p/m-c/python/mozbuild/mozbuild/shellutil.py", line 117, in _parse_unquoted
    raise MetaCharacterException(match['special'])
mozbuild.shellutil.MetaCharacterException

so apparently this shell-parsing library can't handle ? in a string.

Flags: needinfo?(dustin)

I'll try a try push with

-        mk_add_options "export AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/${bucket}/?format=iam-role-compat"
+        mk_add_options "export 'AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/${bucket}/?format=iam-role-compat'"
[task 2019-08-19T16:12:09.516Z] 16:12:09    ERROR -  Traceback (most recent call last):
[task 2019-08-19T16:12:09.516Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/configure.py", line 133, in <module>
[task 2019-08-19T16:12:09.516Z] 16:12:09     INFO -      sys.exit(main(sys.argv))
[task 2019-08-19T16:12:09.516Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/configure.py", line 39, in main
[task 2019-08-19T16:12:09.516Z] 16:12:09     INFO -      sandbox.run(os.path.join(os.path.dirname(__file__), 'moz.configure'))
[task 2019-08-19T16:12:09.516Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/configure/__init__.py", line 497, in run
[task 2019-08-19T16:12:09.517Z] 16:12:09     INFO -      func(*args)
[task 2019-08-19T16:12:09.517Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/configure/__init__.py", line 541, in _value_for
[task 2019-08-19T16:12:09.517Z] 16:12:09     INFO -      return self._value_for_depends(obj)
[task 2019-08-19T16:12:09.517Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/util.py", line 961, in method_call
[task 2019-08-19T16:12:09.517Z] 16:12:09     INFO -      cache[args] = self.func(instance, *args)
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/configure/__init__.py", line 550, in _value_for_depends
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -      value = obj.result()
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/util.py", line 961, in method_call
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -      cache[args] = self.func(instance, *args)
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/configure/__init__.py", line 156, in result
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -      return self._func(*resolved_args)
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/configure/__init__.py", line 1125, in wrapped
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -      return new_func(*args, **kwargs)
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/build/moz.configure/rust.configure", line 60, in unwrap
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -      (retcode, stdout, stderr) = get_cmd_output(prog, '+stable')
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/configure/__init__.py", line 1125, in wrapped
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -      return new_func(*args, **kwargs)
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/build/moz.configure/util.configure", line 46, in get_cmd_output
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -      log.debug('Executing: `%s`', quote(*args))
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/shellutil.py", line 210, in quote
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -      return ' '.join(_quote(s) for s in strings)
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/shellutil.py", line 210, in <genexpr>
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -      return ' '.join(_quote(s) for s in strings)
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -    File "/builds/worker/workspace/build/src/python/mozbuild/mozbuild/shellutil.py", line 198, in _quote
[task 2019-08-19T16:12:09.519Z] 16:12:09     INFO -      return t("'%s'") % s.replace(t("'"), t("'\\''"))
[task 2019-08-19T16:12:09.519Z] 16:12:09    ERROR -  TypeError: cannot create 'NoneType' instances

I can't reproduce this locally. Chris, how would I go about setting a variable in a mozconfig where the value contains ??

Flags: needinfo?(cmanchester)

(change ni since Chris is away and :glandium is all over shellutil.py)

Flags: needinfo?(cmanchester) → needinfo?(mh+mozilla)

The one with mk_add_options "export 'AWS_IAM_CREDENTIALS_URL=...'" failed because somehow that broke the mozconfig shell script somehow and CARGO ended up not being set.
The one with mk_add_options "export AWS_IAM_CREDENTIALS_URL=... failed because MozbuildObject.extra_environment_variables is using a function used to split shell commands to read what is, in fact, a make syntax to set an environment variable.

Even if the first mk_add_options variant didn't break CARGO, it would still be wrong export FOO = 'foo' in a makefile literally exports "'foo'":

$ cat /tmp/test.mk
export FOO = 'foo'

foo:
	echo $$FOO
$ make -f /tmp/test.mk
echo $FOO
'foo'

So the real solution on the mozconfig side is the second one. And MozbuildObject.extra_environment_variables should be fixed to read .mozconfig.mk as a makefile, presumably with pymake.

OTOH, the only thing using MozbuildObject.extra_environment_variables is build/valgrind/mach_commands.py and that was explicitly added for automation, not developer builds, back when we were pulling Gtk from tooltool and needed to set plenty of environment variables for that. But that's not the case anymore. That is, the code that was added in bug 1187245 and that required me to add MozbuildObject.extra_environment_variables is gone as of bug 1426785.

So I'd just backout the remaining half of bug 1187245.

Flags: needinfo?(mh+mozilla)
Summary: Migrate sccache to new deployment → [sccache] Migrate sccache to new deployment
Pushed by dmitchell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fe7b9445e1d3
use AWS_IAM_CREDENTIALS_URL for all S3 sccache invocations r=chmanchester
https://hg.mozilla.org/integration/autoland/rev/0ce37eda652a
revert remaining unnecessary bit of bug 1187245; r=glandium
Flags: needinfo?(dustin)

I'll elaborate a little on why: sccache writes were failing, which led to decreasing sccache hits, and increasing build times.

Ionut: FYI, this is the cause for the large number of build metrics alerts, ranging from sccache write error increases, to sccache hit decreases, to build time increases. Do note, though, that some of the alerts are also about bug 1575471.

Flags: needinfo?(igoldan)

(In reply to Razvan Maries from comment #54)

Backed out as per glanduim's request.

Backout: https://hg.mozilla.org/integration/autoland/rev/760f1b14b2a45a091f061b367ac474cdc8c13594

This is weird because from alert 22628 I can see that a642029c8e7e Bug 1528697 and 4b5339bfcdaa Bug 1573501 seems to have caused the regreesions.

(In reply to Alexandru Ionescu :alexandrui from comment #57)

(In reply to Razvan Maries from comment #54)

Backed out as per glanduim's request.

Backout: https://hg.mozilla.org/integration/autoland/rev/760f1b14b2a45a091f061b367ac474cdc8c13594

This is weird because from alert 22628 I can see that a642029c8e7e Bug 1528697 and 4b5339bfcdaa Bug 1573501 seems to have caused the regreesions.

Because the effect is not immediate because it's related to caching. You can see everything going back in order exactly on the backout (starting with the sccache write errors).

How can you tell writes are failing? I had observed writes succeeding in my pushes.

Flags: needinfo?(dustin) → needinfo?(mh+mozilla)

(In reply to Mike Hommey [:glandium] from comment #56)

Ionut: FYI, this is the cause for the large number of build metrics alerts, ranging from sccache write error increases, to sccache hit decreases, to build time increases. Do note, though, that some of the alerts are also about bug 1575471.

Thanks for the heads up! Alexandru, FYI.

Flags: needinfo?(igoldan) → needinfo?(alexandru.ionescu)

Ah, I didn't see that alert link before -- that led me to the quarry. Looking at
https://tools.taskcluster.net/groups/c1hpOdxxQbmHNasJh46zuA/tasks/SPf7ZDCCQTeFEIt-iKnbZg/runs/0/artifacts
I see in sccache.log

 WARN 2019-08-21T23:45:20Z: sccache::simples3::credential: Failed to fetch IAM credentials: Didn't get a parseable response body from instance role details
DEBUG 2019-08-21T23:45:20Z: sccache::compiler::compiler: [arm.o]: Cache write error: Error(Msg("failed to get AWS credentials"), State { next_error: Some(Error(Msg("Couldn\'t find AWS credentials in environment, credentials file, or IAM role."), State { next_error: None, backtrace: InternalBacktrace })), backtrace: InternalBacktrace })

and in the task log:

[task 2019-08-21T23:42:17.623Z] 23:42:17     INFO -      export SCCACHE_BUCKET=taskcluster-level-3-sccache-us-west-2
[task 2019-08-21T23:42:17.623Z] 23:42:17     INFO -      export 'AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-3-sccache-us-west-2/?format=iam-role-compat

The task has scope (among others)

assume:project:taskcluster:gecko:level-3-sccache-buckets

which expands to

auth:aws-s3:read-write:taskcluster-level-3-sccache-us-west-2/*

that should be sufficient (and is the same role and scope it's been using since forever)

The task has taskcluster-proxy enabled. In that condition, I can successfully fetch credentials.

I don't see any logged calls to the auth service's awsS3Credentials method at that time, or in fact at any time from that worker's IP (I just see auth.expandScopes calls from docker-worker). I do see such an awsS3Credentials call for my test task.

Flags: needinfo?(mh+mozilla)

This bit of code

https://github.com/mozilla/sccache/blob/6e3295a22283d6143859c8838f583cb37c176e03/src/simples3/credential.rs#L362

        let body = body
            .map_err(|_e| "Didn't get a parseable response body from instance role details".into())
            .and_then(|body| {
                String::from_utf8(body).chain_err(|| "failed to read iam role response")
            });

does a great job of discarding a useful error message and replacing it with a misleading one. There's nothing parsing-related going on here. Rather, there's an HTTP error of some sort. What sort, we don't know. So it seems clear that sccache is -- for whatever reason -- failing to talk to the taskcluster proxy.

I do see

Aug 21 23:39:44 docker-worker.aws-provisioner.us-west-2c.ami-0beb39c669e36dc9b.m5-4xlarge.i-0c190a9614ea4bab4 docker-worker: {"type":"ensure image","source":"top","provisionerId":"aws-provisioner-v1","workerId":"i-0c190a9614ea4bab4","workerGroup":"us-west-2","workerType":"gecko-3-b-linux","workerNodeType":"m5.4xlarge","image":{"name":"taskcluster/taskcluster-proxy:5.1.0","type":"docker-image"}} 

in the worker logs, suggesting it is correctly loading the proxy. So most likely, sccache is hitting the AWS metadata endpoint and getting an error which it is helfpully obscuring (I'll make a patch..).

The unclosed ' in

[task 2019-08-21T23:42:17.623Z] 23:42:17     INFO -      export 'AWS_IAM_CREDENTIALS_URL=http://taskcluster/auth/v1/aws/s3/read-write/taskcluster-level-3-sccache-us-west-2/?format=iam-role-compat

is curious. In fact, I bet that's it. So, how did this work in try??

OK, I see 0's for cache write errors in various builds I clicked on, and

DEBUG 2019-08-22T14:41:54Z: sccache::server: [bzip2_sys]: Cache write finished in 0.534 s

in one sccache.log for linux and

DEBUG 2019-08-22T14:51:24Z: sccache::simples3::s3: PUT http://taskcluster-level-1-sccache-us-west-2.s3.amazonaws.com/5/c/0/5c0fd5994ae431164601cd8560585404d31f1d80d876b033f6c99a4e5563f7b28992cd979d1398013fff128b787602bd0f1f71a74fbef9d2e39095f4ede6098e
 INFO 2019-08-22T14:51:24Z: sccache::simples3::s3: Read 21174 bytes from http://taskcluster-level-1-sccache-us-west-2.s3.amazonaws.com/7/9/1/791fcead20fcfc79ec8957a62411df71b44d1c2c7e1590e684a60979aab36f9bc5f25cc0b682d02f24f5e2b12249ee3896507bbb41307e8975169c8acb675c8e
DEBUG 2019-08-22T14:51:24Z: sccache::compiler::compiler: [unistr_cnv.obj]: Cache hit in 0.030 s
DEBUG 2019-08-22T14:51:24Z: sccache::compiler::compiler: [cstr]: Stored in cache successfully!

in one for Windows.

Alexandru, do you see any other issues with this try push?

Maybe that's a better question for :glandium

Flags: needinfo?(mh+mozilla)

I don't know what I should be looking for but the Windows 2012 opt has sccache write errors.

Flags: needinfo?(mh+mozilla)

Hm, so it worked for most jobs, just not that one? Notably, that's in eu-central-1. I can see calls to the auth.awsS3Credentials endpoint:

2019-08-22 14:58:37.204
 Fields: {
   apiVersion: "v1"    
   clientId: "task-client/Y8WnX6jQSSKGv8votzJCKw/0/on/eu-central-1/i-01d53ffa6138d524a/until/1566487069.247"    
   duration: 58.736319    
   expires: "2019-08-22T15:17:49.247Z"    
   hasAuthed: true    
   method: "GET"    
   name: "awsS3Credentials"    
   public: false    
   query: {…}    
   resource: "/aws/s3/read-write/taskcluster-level-1-sccache-eu-central-1/"    
   satisfyingScopes: [1]    
   sourceIp: "3.120.111.87"    
   statusCode: 200    
   v: 1    
  }

which is exactly the time that sccache says so in the logs:

DEBUG 2019-08-22T14:58:37Z: sccache::simples3::credential: Using AWS credentials from IAM
DEBUG 2019-08-22T14:58:37Z: sccache::simples3::s3: PUT http://taskcluster-level-1-sccache-eu-central-1.s3.amazonaws.com/5/c/c/5cc399006fa7402cdb7a6920e6758ddc198607f5b4e3c18e7bf7c148946cc5ac82c2438a09f6deab47c461f4daa31c6773bd6a532f77011d2b5d6142c2d17585
DEBUG 2019-08-22T14:58:37Z: sccache::compiler::compiler: [host_AssertAssignmentChecker.obj]: Cache write error: Error(Msg("failed to put cache entry in s3"), State { next_error: Some(Error(BadHTTPStatus(400), State { next_error: None, backtrace: InternalBacktrace })), backtrace: InternalBacktrace })
DEBUG 2019-08-22T14:58:37Z: sccache::server: Error executing cache write: failed to put cache entry in s3

and I can confirm that the auth service has the policy necessary for the eu-central-1 bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1467642244000",
            "Effect": "Allow",
            "Action": [
                "s3:DeleteObject",
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::taskcluster-level-1-sccache-eu-central-1/*",
                "arn:aws:s3:::taskcluster-level-1-sccache-us-east-1/*",
                "arn:aws:s3:::taskcluster-level-1-sccache-us-west-1/*",
                "arn:aws:s3:::taskcluster-level-1-sccache-us-west-2/*"
            ]
        },
        {
            "Sid": "Stmt1467642244001",
            "Effect": "Allow",
            "Action": [
                "s3:GetBucketLocation",
                "s3:GetBucketTagging",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::taskcluster-level-1-sccache-eu-central-1",
                "arn:aws:s3:::taskcluster-level-1-sccache-us-east-1",
                "arn:aws:s3:::taskcluster-level-1-sccache-us-west-1",
                "arn:aws:s3:::taskcluster-level-1-sccache-us-west-2"
            ]
        }
    ]
}

so, why does S3 respond with a 400 status?

Hm, all of the sccache eu-central-1 buckets are empty. It looks like level 3 doesn't run in eu-central-1 at all, and based on the empty bucket I think that all try jobs running in eu-central-1 have been unable to write to caches for whatever reason since forever, and that's just gone un-noticed. It'd be nice to fix that, but not in this bug.

So, I've retriggered that job. If it comes back without any write errors, then I think this can be landed. OK by you, :glandium?

Flags: needinfo?(mh+mozilla)

Of five retriggers, four are in eu-central-1, and one is in us-west-1.

Indeed, the us-west-1 task has zero write errors. So, that's https://bugzilla.mozilla.org/show_bug.cgi?id=1576032. I think this is clear to land.

Fair enough.

Flags: needinfo?(mh+mozilla)

Hi,
Just for the record:

this push:
(In reply to Pulsebot from comment #52)

Pushed by dmitchell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/fe7b9445e1d3
use AWS_IAM_CREDENTIALS_URL for all S3 sccache invocations r=chmanchester
https://hg.mozilla.org/integration/autoland/rev/0ce37eda652a
revert remaining unnecessary bit of bug 1187245; r=glandium

caused alert summary 22611 (about 150 alerts) - regression
and this push (its backout):
(In reply to Razvan Maries from comment #54)

Backed out as per glanduim's request.

Backout: https://hg.mozilla.org/integration/autoland/rev/760f1b14b2a45a091f061b367ac474cdc8c13594

caused alert summary 22698 (about 60 sccache and about 120 build times) - improvement. Probably more improvements of build times to come.
To note that treeherder seems that wasn't able to detect sccache regressions.

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #65)

Alexandru, do you see any other issues with this try push?

What do you mean by other issues? I see all-green which at first sight seems to be good. I'm not so familiar with the details build times code. But if you tell me what should I look for, I will.

Flags: needinfo?(alexandru.ionescu)

Thanks Alexandru -- I think :glandium found what I needed. Also, note that bug 1576032 will likely lead to improvements in sccache performance on try.

Pushed by dmitchell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f266a7b397c1
use AWS_IAM_CREDENTIALS_URL for all S3 sccache invocations r=chmanchester
https://hg.mozilla.org/integration/autoland/rev/a04fc912928e
revert remaining unnecessary bit of bug 1187245; r=glandium

0 write errors on
https://treeherder.mozilla.org/#/jobs?repo=autoland&selectedJob=263139872
and similar jobs on later pushes. I think we're OK here.

This appears to have stuck! I'll come back later this week to land the OCC change.

tomprince|pto> dustin: a=tomprince to land that everywhere

There's not a big rush, so Tom has agreed to do that when he's back (and it will be on beta by then anyway).

Flags: needinfo?(mozilla)

Tom, did this get uplifted?

Thank you!!

OK, I'll count this as done. the OCC change isn't landed yet, but it's not hurting anything and OCC will be unused soon enough.

Status: NEW → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Flags: needinfo?(mozilla)
You need to log in before you can comment on or make changes to this bug.