Open Bug 1547111 Opened 2 years ago Updated 6 months ago

Migrate tier 1 builds from AWS to GCP

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

REOPENED

People

(Reporter: coop, Assigned: dhouse)

References

(Depends on 4 open bugs, Blocks 1 open bug)

Details

(Keywords: leave-open)

Attachments

(16 files, 1 obsolete file)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review

We have tier 3 builds in GCP on all platforms. Now we need to take stock of what's still missing and what we need to do to migrate the tier 1 builds from AWS to GCP.

The #1 blocker right now is the lack of sccache in GCP (bug 1539961). This prevents some debug builds and other slower variants from completing reliably in their existing duration. This is compounded by not yet having access to compute-optimized instances in GCP.

Once sccache is running, there are a bunch of build variants that still need to be setup in GCP:

  • spidermonkey (linux/win)
  • aarch64 (linux/win)
  • asan/asan reporter (linux/mac/win)
  • ccov (linux/mac/win)
  • noopt (linux/mac/win)
  • searchfox (linux/mac/win)
  • PGO builds (all platforms)

Some of those require scriptworker (esp. for signing), although the existing scriptworker pools in AWS can be used transitionally.

We can't start using GCP to produce binaries that we ship to end users (or data we rely on) until we have done a security review of the GCP platform.

We'll also want specific RRAs for the following services that are changing or being created fresh in GCP:

  • worker manager
  • GCP provider
  • helm deployment process of Taskcluster services in GCP

Finally, there are a few accessory services that, like sccache, should live next to workers in any new cloud providers, although these are nice-to-haves in that they will reduce costs and/or speed up the build process by minimizing network transfer. This includes services like:

  • hg mirror
  • cloud-mirror (will be object service eventually)

I'll add to this bug as I think of other things, and will add bug numbers as I find/file them. Please do the same.

Depends on: 1539961

(In reply to Chris Cooper [:coop] pronoun: he from comment #1)

Some of those require scriptworker (esp. for signing), although the existing scriptworker pools in AWS can be used transitionally.

We also need to make sure that workers in GCP can generate the proper signatures for chain-of-trust (bug 1545524).

Depends on: 1545524

results from analyzing the builds are done in bug 1546414.

Key findings:
build times are much slower for almost all build types
intermittent failures in the builds.

i'm working through some issues with sccache on windows in bug 1549346 and also need to modify our provisioning script to provision builders in the us only.

Depends on: 1549346, 1550468
Depends on: 1572236

Overall, there are some concerns around performance but no showstoppers.

Builds are slower than expected, perhaps to the tune of 30-40%. In particular, Windows PGO builds are slower, and tasks that have relatively short timeouts do hit those timeouts frequently, e.g. symbol uploads.

I wouldn't be comfortable moving builds over if they are 30% slower, but there are some mitigations we can put in place:

  • Optimize hg for GCP the same way we did for AWS - bug 1585133
  • Extend timeouts for some jobs (e.g. symbol uploads). I increased the timeout for symbol upload jobs to 30min and they all completed successfully.
  • Temporarily increase instance specs (e.g. Windows PGO builds). These are currently running on n1-highcpu-32 so we could bump that up to n1-highcpu-64.
  • Wait for n2 instances to become available, or create worker pools in us-central where beta n2s are already available. This might be useful for Windows instances.

I think we should wait for bug 1585133 to land because that will improve things across the board.

Attachment #9099440 - Attachment is obsolete: true
Depends on: 1585133

:bc - Connor's patch in bug 1585133 seems to have stuck. Have you been able to generate new perf numbers for validation? Is it worth creating a separate bug just to cover that work?

Flags: needinfo?(bob)

I have been working with a try push where I have been collecting the test results from using the linux builder workers as test workers. I have one iteration of the builds and 5 iterations of the tests so far and plan to add more build iterations to collect the statistics for the builds when I complete the tests. I don't think there is a need for a separate bug for now. I will update this bug with the build performance comparison when I have completed it. I plan to document the test results for the linux builder/test workers in bug 1577276.

Flags: needinfo?(bob)

Coop: I did a new try run with 20 builds. The resulting google sheet looks much better than before.

I found out last week that both the n2 and c2 instance families have hit general availability (GA) in GCP, so I did some renewed timings with both of them using my previous methodology with plain builds:

n2: https://treeherder.mozilla.org/#/jobs?repo=try&revision=65700ea553546df1ef64b032c70a8f7319340f37
Avg task time: 1381.77s (~23.0m)
Std. dev: 48.46
CoV: 0.04

c2: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e705f2899f617e1bcba10e7c45b67a63008f1f9
Avg task time: 1272.52s (~21.2m)
Std. dev: 21.75
CoV: 0.02

Synopsis: both instance types are still faster than our current AWS instances, and are several minutes faster than they were just a few months ago, likely due to the hg speed-ups. Google tells me they won't have enough capacity to run our peak workloads on c2, but n2 can shoulder the burden.

We're not planning on migrating anything this week, but I'll roll some patches to:

  1. Change our current GCP builders to n2 instances. This can land this week
  2. Migrate POSIX build load from AWS -> GCP, to be landed early next week after the migration/TCW

Adding bug 1546801 as a dependency, not for technical reasons but to avoid disrupting that migration by changing the build platform at the same time.

Depends on: tc-cloudops
Depends on: 1594583
Depends on: 1595623

I'm driving this, if not doing the work.

Assignee: nobody → coop
Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED

Didn't mean to close this.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 1597996
Depends on: 1587958
Depends on: 1598295
Depends on: 1601736
Depends on: 1604196
Depends on: 1607241

We are swtching tier-1 builds to GCP.

Attachment #9121113 - Attachment description: Bug 1547111: Remove tier-3 GCP builds r=tomprince → Bug 1547111: Switch GCP and AWS tier builds r=tomprince
Attachment #9121113 - Attachment description: Bug 1547111: Switch GCP and AWS tier builds r=tomprince → Bug 1547111: Replace `-gcp` builds with `-aws` builds r=tomprince
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/integration/autoland/rev/540db822a1d4
Replace `-gcp` builds with `-aws` builds r=tomprince

There was a high rate of tasks failing with claim-expired, and it appears that sccache is not working (the task are getting write errors). Because of that and bug 1609568, I've backed it out.

Flags: needinfo?(wcosta)
Regressions: 1609568
Backout by dvarga@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/daf3b53c3efa
Backed out changeset 540db822a1d4 for causing bug 1609568

It feels like the worker-pool is lacking the scope auth:gcp:access-token:sccache-3/sccache-l{level}-us-central1@sccache-3.iam.gserviceaccount.com. I believe this is a misconfiguration of ci-configuration, but I could not track where. Tom, can you figure this out?

Flags: needinfo?(wcosta) → needinfo?(mozilla)

It feels like everything is configured correctly. I opened a new pull request for sccache report when the request to access token fail. The only piece of the puzzle that I see is taskcluster-auth. :edunham, could you confirm sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com is in the auth white list?

Flags: needinfo?(mozilla) → needinfo?(edunham)
Regressions: 1609949

Wander, sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com is in the allwoedServiceAccounts list configured for the sccache-3 project in gcp_credentials_allowed_projects for the auth service in firefoxci.

Stage previously had some L3 configuration which it turns out should never have been there, so that is being removed now. Stage has the same L1 and L2 accounts configured as firefoxci does, though.

Flags: needinfo?(edunham)

(In reply to Pulsebot from comment #24)

Backout by dvarga@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/daf3b53c3efa
Backed out changeset 540db822a1d4 for causing bug 1609568

== Change summary for alert #24699 (as of Thu, 16 Jan 2020 07:26:06 GMT) ==

Improvements:

24% build times windows2012-64-noopt debug taskcluster-c5.4xlarge 2,609.43 -> 1,992.32

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=24699

The scopes that are used are managed by the
project:taskcluster:{trust_domain}:level-{level}-sccache-buckets
role that is added a few lines above.

Attachment #9122369 - Attachment description: Bug 1547111: Fix sccache bucket perms for level 2/3 workers r=callek → Bug 1547111: Fix sccache bucket perms for level 2/3 workers r=Callek
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/integration/autoland/rev/e1a3a62f2035
Remove incorrect GCP sccache scope; r=Callek
Depends on: 1611255

The scopes was added to tasks using sscache, but does not correspond to an
actual bucket. Now that the code that added that scope is gone, we can remove
the scope.

Depends on: 1609568
Depends on: 1609595

To be explicit, when we do the cutover, we want the try and release builds (levels 1-3) to switch from AWS to GCP, but we also want to switch the tier 3 validation builds that we've been running in GCP to switch to back AWS. This is important in terms of making sure builds continue to work in AWS in case we need to switch back from GCP->AWS at some point in the future.

Pushed by wcosta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9819d9e38727
Replace `-gcp` builds with `-aws` builds r=tomprince
Depends on: 1614852

:fubar is going to drive this going forward.

Assignee: coop → klibby

:tomprince, here is the NI I mentioned yesterday. I verified my test results from April 23 and found a 3% failures on bundles, but they were caught immediately (and did not cause a delayed timeout). I'll work on the image update in another bug and the ci-config test here.

  1. Can you, or can you direct me how to, test moving the linux(+android+macos) builds to gcp in the taskcluster staging env--or do we just need to test it in production like before? (I am guessing we want to keep some "shadow" builds in aws for a few release cycles in case things break in gcp and we need to switch back).

  2. I also want to test smaller instance sizes to minimize cost. Is that something I could test with ci-configuration in the taskcluster staging? I want to target n2-standard -- pending google's confirmation they have the capacity now. I can manually test instances in the google projects, but I'd like to test it "the right way"(?) with ci-config in staging so that we can have more confidence in the results for applying the change to production.

Assignee: klibby → dhouse
Flags: needinfo?(mozilla)

(tom met with me over zoom and explained the ci-admin/ci-config staging testing and we planned the steps for this)

Flags: needinfo?(mozilla)

:fubar, have you heard from our gcp person about capacity for c2 (or if not c2 then n2)? re: coop's numbers from https://docs.google.com/spreadsheets/d/1Qe7MFyvce59Oqtugm62-Fs9H9f3shNKHhcwhLOjuVRc/edit#gid=0 c2-standard-16 looks best, but the note says "capacity limited"

Flags: needinfo?(klibby)

Did someone ever look at the bad performance from the sccache bucket? If I look at a random recent build on GCP, I see that 14.4% of the requests to the bucket take more than 1s (going up to 12.3s!), while the corresponding build on AWS has only 0.26% requests taking more than 1s (and max 3s).

Interestingly, it's the same rust crate cache hit that yields the max time in both cases, which got me to look further, and the cumulative download time is 483s on AWS (average 0.11s), and 2029.3s on GCP (average 0.49s). It seems to be that overall the sccache bucket on gcp is 4~5 times as slow as the one on AWS.

This has a noticeable impact on build times.

In GCP, we have different GCP projects (represented by different worker manager
providers) for level-1, level-3 and test workers. Thus we need to be able to vary the
provider in variant worker-pool families.

Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/ci/ci-admin/rev/f07536224105
Allow varying the worker-pool provider in variants; r=aki

(In reply to Mike Hommey [:glandium] from comment #47)

Did someone ever look at the bad performance from the sccache bucket? If I look at a random recent build on GCP, I see that 14.4% of the requests to the bucket take more than 1s (going up to 12.3s!), while the corresponding build on AWS has only 0.26% requests taking more than 1s (and max 3s).

Interestingly, it's the same rust crate cache hit that yields the max time in both cases, which got me to look further, and the cumulative download time is 483s on AWS (average 0.11s), and 2029.3s on GCP (average 0.49s). It seems to be that overall the sccache bucket on gcp is 4~5 times as slow as the one on AWS.

This has a noticeable impact on build times.

Thanks :glandium for seeing this! I did not look at sccache, and I don't know if anyone else did. Is this an average per task cumulative sum, or across all tasks for a period?

Do you know who has worked on the sccache for gcp (or aws)? If not, I'll need to ask around or search bugs.

Flags: needinfo?(mh+mozilla)

(In reply to Dave House [:dhouse] from comment #51)

Thanks :glandium for seeing this! I did not look at sccache, and I don't know if anyone else did. Is this an average per task cumulative sum, or across all tasks for a period?

Cumulative sum for one random build.

Do you know who has worked on the sccache for gcp (or aws)? If not, I'll need to ask around or search bugs.

No idea.

Flags: needinfo?(mh+mozilla)
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/ci/ci-configuration/rev/36f582746ac6
Refactor gcp workers using variants; r=aki

fubar is checking with google on the n2-standard-16 instance capacity

Flags: needinfo?(klibby)

:miles do you have a docker-worker gcp image ready we could switch to in ci-config?

I see these from May 27th in the taskcluster-imaging project:
docker-worker-gcp-community-googlecompute-2020-05-27t20-17-36z
docker-worker-gcp-googlecompute-2020-05-27t18-59-51z

Are you running this "-community-" image in community, and could you (or i) switch firefox-ci to use this image or is there a better one?

https://hg.mozilla.org/ci/ci-configuration/file/tip/worker-images.yml#l12:
monopacker-docker-worker-current: monopacker-docker-worker-2020-02-07t09-14-17z
monopacker-docker-worker-trusted-current: monopacker-docker-worker-gcp-trusted-2020-02-13t03-22-56z

Flags: needinfo?(miles)

:miles thankyou for contacting me yesterday. Will you build a level3 image to match the level1, or can I use the same image or who do I need to get the secrets from? (I'm assuming secrets are still in the images however and require us to make separate images)

Depends on: 1643562

Some related discussion of this appeared in #firefox-ci today starting at 2:10 pacific:

dustin
who typically bakes the docker-worker images in GCP?
I can, but there's a 49% chance I'll get the secrets wrong, so if someone else typically does it, that'd be great
last time was 2020-02-07
for example I see a secret named docker-worker/yaml/firefox-tc-production-l3-new.yaml, but that's from March 5
and I don't see an l1
for production
aki
hm, not sure
https://hg.mozilla.org/ci/ci-configuration/file/tip/grants.yml#l2300 points to miles and wander
ci-configuration @ tip / grants.yml
Content of grants.yml at revision 523fca0e1e6ab088282cfee1fd0cb15a7e70f8a7 in ci-configuration
miles
dustin: the naming scheme is lacking, looks like we are indeed missing production-l1
-new and -old was from the rotation in march
wander baked some images 5/27 that haven't been entered, we should probably re-do that at this point
because CoT isn't used for L1 I think the staging-l1 yaml has been used for all L1 images

(In reply to Dave House [:dhouse] from comment #58)

:miles thankyou for contacting me yesterday. Will you build a level3 image to match the level1, or can I use the same image or who do I need to get the secrets from? (I'm assuming secrets are still in the images however and require us to make separate images)

:miles could you build a gcp image for level1 and level3 for this? (or if I need to do it, who do I get the secrets from? and I saw the refactoring+changes in the repo, is monopacker in a ready state to build for gcp?)

bug 1643562 also needs new docker-worker images baked for gcp. I don't have the necessary access.

Flags: needinfo?(miles)
You need to log in before you can comment on or make changes to this bug.