Closed Bug 1547111 Opened 5 years ago Closed 2 years ago

Migrate shippable builds from AWS to GCP

Categories

(Release Engineering :: Firefox-CI Administration, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: coop, Assigned: masterwayz)

References

Details

Attachments

(17 files, 1 obsolete file)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review

We have tier 3 builds in GCP on all platforms. Now we need to take stock of what's still missing and what we need to do to migrate the tier 1 builds from AWS to GCP.

The #1 blocker right now is the lack of sccache in GCP (bug 1539961). This prevents some debug builds and other slower variants from completing reliably in their existing duration. This is compounded by not yet having access to compute-optimized instances in GCP.

Once sccache is running, there are a bunch of build variants that still need to be setup in GCP:

  • spidermonkey (linux/win)
  • aarch64 (linux/win)
  • asan/asan reporter (linux/mac/win)
  • ccov (linux/mac/win)
  • noopt (linux/mac/win)
  • searchfox (linux/mac/win)
  • PGO builds (all platforms)

Some of those require scriptworker (esp. for signing), although the existing scriptworker pools in AWS can be used transitionally.

We can't start using GCP to produce binaries that we ship to end users (or data we rely on) until we have done a security review of the GCP platform.

We'll also want specific RRAs for the following services that are changing or being created fresh in GCP:

  • worker manager
  • GCP provider
  • helm deployment process of Taskcluster services in GCP

Finally, there are a few accessory services that, like sccache, should live next to workers in any new cloud providers, although these are nice-to-haves in that they will reduce costs and/or speed up the build process by minimizing network transfer. This includes services like:

  • hg mirror
  • cloud-mirror (will be object service eventually)

I'll add to this bug as I think of other things, and will add bug numbers as I find/file them. Please do the same.

Depends on: 1539961

(In reply to Chris Cooper [:coop] pronoun: he from comment #1)

Some of those require scriptworker (esp. for signing), although the existing scriptworker pools in AWS can be used transitionally.

We also need to make sure that workers in GCP can generate the proper signatures for chain-of-trust (bug 1545524).

Depends on: 1545524

results from analyzing the builds are done in bug 1546414.

Key findings:
build times are much slower for almost all build types
intermittent failures in the builds.

i'm working through some issues with sccache on windows in bug 1549346 and also need to modify our provisioning script to provision builders in the us only.

Depends on: 1549346, 1550468
Depends on: 1572236

Overall, there are some concerns around performance but no showstoppers.

Builds are slower than expected, perhaps to the tune of 30-40%. In particular, Windows PGO builds are slower, and tasks that have relatively short timeouts do hit those timeouts frequently, e.g. symbol uploads.

I wouldn't be comfortable moving builds over if they are 30% slower, but there are some mitigations we can put in place:

  • Optimize hg for GCP the same way we did for AWS - bug 1585133
  • Extend timeouts for some jobs (e.g. symbol uploads). I increased the timeout for symbol upload jobs to 30min and they all completed successfully.
  • Temporarily increase instance specs (e.g. Windows PGO builds). These are currently running on n1-highcpu-32 so we could bump that up to n1-highcpu-64.
  • Wait for n2 instances to become available, or create worker pools in us-central where beta n2s are already available. This might be useful for Windows instances.

I think we should wait for bug 1585133 to land because that will improve things across the board.

Attachment #9099440 - Attachment is obsolete: true
Depends on: 1585133

:bc - Connor's patch in bug 1585133 seems to have stuck. Have you been able to generate new perf numbers for validation? Is it worth creating a separate bug just to cover that work?

Flags: needinfo?(bob)

I have been working with a try push where I have been collecting the test results from using the linux builder workers as test workers. I have one iteration of the builds and 5 iterations of the tests so far and plan to add more build iterations to collect the statistics for the builds when I complete the tests. I don't think there is a need for a separate bug for now. I will update this bug with the build performance comparison when I have completed it. I plan to document the test results for the linux builder/test workers in bug 1577276.

Flags: needinfo?(bob)

Coop: I did a new try run with 20 builds. The resulting google sheet looks much better than before.

I found out last week that both the n2 and c2 instance families have hit general availability (GA) in GCP, so I did some renewed timings with both of them using my previous methodology with plain builds:

n2: https://treeherder.mozilla.org/#/jobs?repo=try&revision=65700ea553546df1ef64b032c70a8f7319340f37
Avg task time: 1381.77s (~23.0m)
Std. dev: 48.46
CoV: 0.04

c2: https://treeherder.mozilla.org/#/jobs?repo=try&revision=7e705f2899f617e1bcba10e7c45b67a63008f1f9
Avg task time: 1272.52s (~21.2m)
Std. dev: 21.75
CoV: 0.02

Synopsis: both instance types are still faster than our current AWS instances, and are several minutes faster than they were just a few months ago, likely due to the hg speed-ups. Google tells me they won't have enough capacity to run our peak workloads on c2, but n2 can shoulder the burden.

We're not planning on migrating anything this week, but I'll roll some patches to:

  1. Change our current GCP builders to n2 instances. This can land this week
  2. Migrate POSIX build load from AWS -> GCP, to be landed early next week after the migration/TCW

Adding bug 1546801 as a dependency, not for technical reasons but to avoid disrupting that migration by changing the build platform at the same time.

Depends on: tc-cloudops
Depends on: 1594583
Depends on: 1595623

I'm driving this, if not doing the work.

Assignee: nobody → coop
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

Didn't mean to close this.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Depends on: 1597996
Depends on: 1587958
Depends on: 1598295
Depends on: 1601736
Depends on: 1604196
Depends on: 1607241

We are swtching tier-1 builds to GCP.

Attachment #9121113 - Attachment description: Bug 1547111: Remove tier-3 GCP builds r=tomprince → Bug 1547111: Switch GCP and AWS tier builds r=tomprince
Attachment #9121113 - Attachment description: Bug 1547111: Switch GCP and AWS tier builds r=tomprince → Bug 1547111: Replace `-gcp` builds with `-aws` builds r=tomprince
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/integration/autoland/rev/540db822a1d4
Replace `-gcp` builds with `-aws` builds r=tomprince

There was a high rate of tasks failing with claim-expired, and it appears that sccache is not working (the task are getting write errors). Because of that and bug 1609568, I've backed it out.

Flags: needinfo?(wcosta)
Regressions: 1609568
Backout by dvarga@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/daf3b53c3efa
Backed out changeset 540db822a1d4 for causing bug 1609568

It feels like the worker-pool is lacking the scope auth:gcp:access-token:sccache-3/sccache-l{level}-us-central1@sccache-3.iam.gserviceaccount.com. I believe this is a misconfiguration of ci-configuration, but I could not track where. Tom, can you figure this out?

Flags: needinfo?(wcosta) → needinfo?(mozilla)

It feels like everything is configured correctly. I opened a new pull request for sccache report when the request to access token fail. The only piece of the puzzle that I see is taskcluster-auth. :edunham, could you confirm sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com is in the auth white list?

Flags: needinfo?(mozilla) → needinfo?(edunham)
Regressions: 1609949

Wander, sccache-l3-us-central1@sccache-3.iam.gserviceaccount.com is in the allwoedServiceAccounts list configured for the sccache-3 project in gcp_credentials_allowed_projects for the auth service in firefoxci.

Stage previously had some L3 configuration which it turns out should never have been there, so that is being removed now. Stage has the same L1 and L2 accounts configured as firefoxci does, though.

Flags: needinfo?(edunham)

(In reply to Pulsebot from comment #24)

Backout by dvarga@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/daf3b53c3efa
Backed out changeset 540db822a1d4 for causing bug 1609568

== Change summary for alert #24699 (as of Thu, 16 Jan 2020 07:26:06 GMT) ==

Improvements:

24% build times windows2012-64-noopt debug taskcluster-c5.4xlarge 2,609.43 -> 1,992.32

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=24699

The scopes that are used are managed by the
project:taskcluster:{trust_domain}:level-{level}-sccache-buckets
role that is added a few lines above.

Attachment #9122369 - Attachment description: Bug 1547111: Fix sccache bucket perms for level 2/3 workers r=callek → Bug 1547111: Fix sccache bucket perms for level 2/3 workers r=Callek
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/integration/autoland/rev/e1a3a62f2035
Remove incorrect GCP sccache scope; r=Callek
Keywords: leave-open
Depends on: 1611255

The scopes was added to tasks using sscache, but does not correspond to an
actual bucket. Now that the code that added that scope is gone, we can remove
the scope.

Depends on: 1609568
Depends on: 1609595

To be explicit, when we do the cutover, we want the try and release builds (levels 1-3) to switch from AWS to GCP, but we also want to switch the tier 3 validation builds that we've been running in GCP to switch to back AWS. This is important in terms of making sure builds continue to work in AWS in case we need to switch back from GCP->AWS at some point in the future.

Pushed by wcosta@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9819d9e38727
Replace `-gcp` builds with `-aws` builds r=tomprince
Depends on: 1614852

:fubar is going to drive this going forward.

Assignee: coop → klibby

:tomprince, here is the NI I mentioned yesterday. I verified my test results from April 23 and found a 3% failures on bundles, but they were caught immediately (and did not cause a delayed timeout). I'll work on the image update in another bug and the ci-config test here.

  1. Can you, or can you direct me how to, test moving the linux(+android+macos) builds to gcp in the taskcluster staging env--or do we just need to test it in production like before? (I am guessing we want to keep some "shadow" builds in aws for a few release cycles in case things break in gcp and we need to switch back).

  2. I also want to test smaller instance sizes to minimize cost. Is that something I could test with ci-configuration in the taskcluster staging? I want to target n2-standard -- pending google's confirmation they have the capacity now. I can manually test instances in the google projects, but I'd like to test it "the right way"(?) with ci-config in staging so that we can have more confidence in the results for applying the change to production.

Assignee: klibby → dhouse
Flags: needinfo?(mozilla)

(tom met with me over zoom and explained the ci-admin/ci-config staging testing and we planned the steps for this)

Flags: needinfo?(mozilla)

:fubar, have you heard from our gcp person about capacity for c2 (or if not c2 then n2)? re: coop's numbers from https://docs.google.com/spreadsheets/d/1Qe7MFyvce59Oqtugm62-Fs9H9f3shNKHhcwhLOjuVRc/edit#gid=0 c2-standard-16 looks best, but the note says "capacity limited"

Flags: needinfo?(klibby)

Did someone ever look at the bad performance from the sccache bucket? If I look at a random recent build on GCP, I see that 14.4% of the requests to the bucket take more than 1s (going up to 12.3s!), while the corresponding build on AWS has only 0.26% requests taking more than 1s (and max 3s).

Interestingly, it's the same rust crate cache hit that yields the max time in both cases, which got me to look further, and the cumulative download time is 483s on AWS (average 0.11s), and 2029.3s on GCP (average 0.49s). It seems to be that overall the sccache bucket on gcp is 4~5 times as slow as the one on AWS.

This has a noticeable impact on build times.

In GCP, we have different GCP projects (represented by different worker manager
providers) for level-1, level-3 and test workers. Thus we need to be able to vary the
provider in variant worker-pool families.

Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/ci/ci-admin/rev/f07536224105
Allow varying the worker-pool provider in variants; r=aki

(In reply to Mike Hommey [:glandium] from comment #47)

Did someone ever look at the bad performance from the sccache bucket? If I look at a random recent build on GCP, I see that 14.4% of the requests to the bucket take more than 1s (going up to 12.3s!), while the corresponding build on AWS has only 0.26% requests taking more than 1s (and max 3s).

Interestingly, it's the same rust crate cache hit that yields the max time in both cases, which got me to look further, and the cumulative download time is 483s on AWS (average 0.11s), and 2029.3s on GCP (average 0.49s). It seems to be that overall the sccache bucket on gcp is 4~5 times as slow as the one on AWS.

This has a noticeable impact on build times.

Thanks :glandium for seeing this! I did not look at sccache, and I don't know if anyone else did. Is this an average per task cumulative sum, or across all tasks for a period?

Do you know who has worked on the sccache for gcp (or aws)? If not, I'll need to ask around or search bugs.

Flags: needinfo?(mh+mozilla)

(In reply to Dave House [:dhouse] from comment #51)

Thanks :glandium for seeing this! I did not look at sccache, and I don't know if anyone else did. Is this an average per task cumulative sum, or across all tasks for a period?

Cumulative sum for one random build.

Do you know who has worked on the sccache for gcp (or aws)? If not, I'll need to ask around or search bugs.

No idea.

Flags: needinfo?(mh+mozilla)
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/ci/ci-configuration/rev/36f582746ac6
Refactor gcp workers using variants; r=aki

fubar is checking with google on the n2-standard-16 instance capacity

Flags: needinfo?(klibby)

:miles do you have a docker-worker gcp image ready we could switch to in ci-config?

I see these from May 27th in the taskcluster-imaging project:
docker-worker-gcp-community-googlecompute-2020-05-27t20-17-36z
docker-worker-gcp-googlecompute-2020-05-27t18-59-51z

Are you running this "-community-" image in community, and could you (or i) switch firefox-ci to use this image or is there a better one?

https://hg.mozilla.org/ci/ci-configuration/file/tip/worker-images.yml#l12:
monopacker-docker-worker-current: monopacker-docker-worker-2020-02-07t09-14-17z
monopacker-docker-worker-trusted-current: monopacker-docker-worker-gcp-trusted-2020-02-13t03-22-56z

Flags: needinfo?(miles)

:miles thankyou for contacting me yesterday. Will you build a level3 image to match the level1, or can I use the same image or who do I need to get the secrets from? (I'm assuming secrets are still in the images however and require us to make separate images)

Depends on: 1643562

Some related discussion of this appeared in #firefox-ci today starting at 2:10 pacific:

dustin
who typically bakes the docker-worker images in GCP?
I can, but there's a 49% chance I'll get the secrets wrong, so if someone else typically does it, that'd be great
last time was 2020-02-07
for example I see a secret named docker-worker/yaml/firefox-tc-production-l3-new.yaml, but that's from March 5
and I don't see an l1
for production
aki
hm, not sure
https://hg.mozilla.org/ci/ci-configuration/file/tip/grants.yml#l2300 points to miles and wander
ci-configuration @ tip / grants.yml
Content of grants.yml at revision 523fca0e1e6ab088282cfee1fd0cb15a7e70f8a7 in ci-configuration
miles
dustin: the naming scheme is lacking, looks like we are indeed missing production-l1
-new and -old was from the rotation in march
wander baked some images 5/27 that haven't been entered, we should probably re-do that at this point
because CoT isn't used for L1 I think the staging-l1 yaml has been used for all L1 images

(In reply to Dave House [:dhouse] from comment #58)

:miles thankyou for contacting me yesterday. Will you build a level3 image to match the level1, or can I use the same image or who do I need to get the secrets from? (I'm assuming secrets are still in the images however and require us to make separate images)

:miles could you build a gcp image for level1 and level3 for this? (or if I need to do it, who do I get the secrets from? and I saw the refactoring+changes in the repo, is monopacker in a ready state to build for gcp?)

bug 1643562 also needs new docker-worker images baked for gcp. I don't have the necessary access.

Flags: needinfo?(miles)
Component: General → Firefox-CI Administration
Product: Taskcluster → Release Engineering
QA Contact: mtabara
Type: defect → task
Depends on: 1749810
Depends on: 1749820
No longer blocks: tc-gcp

Main = everything !debug. Debug will have its own bug so we can use it for testing and getting started on this work.

QA Contact: mtabara → mgoossens
Summary: Migrate tier 1 builds from AWS to GCP → Migrate main tier 1 builds from AWS to GCP

Migrating of debug builds is almost ready to land over in bug 1757602. In the meantime I want to figure out what else is blocking shippable builds.

Hal, is there anything left to do on the SecOps side of things around migrating Linux shippable builds from AWS -> GCP?

Flags: needinfo?(hwine)

(In reply to Andrew Halberstadt [:ahal] from comment #64)

Hal, is there anything left to do on the SecOps side of things around migrating Linux shippable builds from AWS -> GCP?

I'm not sure we (secops) has been involved in this yet. :/

If I understand the situation correctly:

  • these linux builds would be the first release builds to be shipped from GCP (it looks like the other OS are still pending)
  • the scope of this bug is only linux builds (as there are other bugs for the other platforms)
  • we haven't yet done any sec eval of tc in gcp, afaik (:ajvb to confirm)
  • there appear to be a couple of key elements not yet completed:
    • bug 1597771 should be completed first (or we need a longer discussion)
    • bug 1587958 is of interest, as these are the hg-mirrors-that-matter, and no review has been done yet

When I say "review" above, I believe the scope is more of a "mini RRA". We (secops) would want to go over any changes in workflow, permissions, and access between the aws tooling and the gcp tooling. The way things work isn't identical between the systems.

:ahal - what, if any, release builds have we been doing in GCP? If none, what other builds are running in gcp?
:ajvb - has secops (i.e. you 😏) done any review of tc-in-gcp yet?

Flags: needinfo?(hwine)
Flags: needinfo?(ahal)
Flags: needinfo?(abahnken)

There are currently no shippable builds happening in GCP. We have debug and opt builds running in GCP for Linux, Android and OSX / Windows (cross-compiled).. Come to think of it, all our builds might be on Linux due to cross-compiling. If there are release builds happening on other platforms, we can certainly punt on it and just drill down on Linux for now.

Flags: needinfo?(ahal)

:hwine - Nope! I had been involved in the original conversations (now over 2yrs ago! wow) but have not been involved since then. I think you're description of a "mini RRA" sounds great.

Flags: needinfo?(abahnken)

:ahal is there a way to turn off interactive tasks for a pool? I tested and I was able to create one for the l3 gcp builder pool (I hit errors downloading the web interface because of path length, but it created the interactive task). We'll need to make sure the gcp firewalls are blocking incoming for the L3 builders, but could turning it off be "nice" so devs don't expect it to work?

Flags: needinfo?(ahal)

Hm, I'm not aware of any way to do this no. I think even for AWS based pools it's possible to create interactive level 3 tasks, it's just not possible to connect to them.

I agree that a scope error, or better yet, not offer the "Create Interactive" button in the first place would be a much better user experience.. though probably out of scope for this bug. I think solving this will likely involve changes in Taskcluster itself.

Flags: needinfo?(ahal)

:ajvb -- long ago, ulfr raised this issue, and (name forgotten) was supposed to default this feature to "disabled", and then only explicitly enable for "try" (or some other non-L3 set of machines). Does that ring a bell?

Flags: needinfo?(abahnken)

(In reply to Hal Wine [:hwine] (use NI) from comment #70)

:ajvb -- long ago, ulfr raised this issue, and (name forgotten) was supposed to default this feature to "disabled", and then only explicitly enable for "try" (or some other non-L3 set of machines). Does that ring a bell?

No it doesn't :/ - But I do agree that this should be fixed.

Flags: needinfo?(abahnken)

So, for the purposes of the tier 1 migration, we will only care about the same blocking of access to the interactive feature, but:

  • :ahal can you open an appropriate TC bug to allow disabling the interactive task feature, please? CC both AJ & myself on that
  • :dhouse - how do you suggest we verify and/or monitor the L3 firewall setting for this, as you mentioned in comment #68?
Flags: needinfo?(dhouse)
Flags: needinfo?(ahal)

(In reply to Hal Wine [:hwine] (use NI) from comment #72)

  • :ahal can you open an appropriate TC bug to allow disabling the interactive task feature, please? CC both AJ & myself on that

I had previously filed https://github.com/taskcluster/taskcluster/issues/5225 which was similar. I tweaked the title and added some extra context. CC'ed you both.

Flags: needinfo?(ahal)
Summary: Migrate main tier 1 builds from AWS to GCP → Migrate shippable builds from AWS to GCP
No longer depends on: 1587958

I was looking at the RRA doc, and I believe we're good to go here (maybe other than :dhouse's pending needinfo). But the CoT key has been updated in the image. There was also a recommendation around artifact storage, but that shouldn't be changing with this patch (we're just switching from EC2 -> GCE without touching the artifact storage). I also attempted to create an interactive task of a non-shippable build (which is running in GCP) and get the following error in my browser console:

Firefox can’t establish a connection to the server at wss://skkf5eiaaaayfup6iccmlbrvg4q2ik7mbtllkknbu7bcrxhq.taskcluster-worker.net:50724/1QXZS924S1axOiJ81jv0pg/shell.sock?...

So looks like the firewall is working \o/

Pending objections and a green light from relman, we'd like to attempt switching this over on Friday. Hal, can you think of any other reasons to hold off?

Flags: needinfo?(hwine)

I chatted with Hal out of band. He has at least two requirements before we proceed here:

  1. Resolving bug 1597771 either by investigating and deciding the protections we have in place currently is good enough (maybe things have changed since that bug was filed). Or else implementing the same safeguards we have in AWS if they are still missing.

  2. Make sure all non-SRE access to the project in GCP is revoked.

I'm about to head out on PTO -- get signoff from :rforsythe before flipping the switch, please. Open items are in comment 75.

Flags: needinfo?(hwine)

:aj do you have admin access to remove Michelle from the relops folder? https://console.cloud.google.com/iam-admin/iam?authuser=1&folder=723902893592 I no longer have the full admin.
At this point in the migration, we can remove that access and Michelle can coordinate any needed changes through relops.

Flags: needinfo?(dhouse) → needinfo?(abahnken)

:dhouse - I do not have permissions to do this. I'd imagine the owning ops folks of this folder or a GCP admin from SRE can take care of this.

Flags: needinfo?(abahnken)

Hi Chris, is Dave's comment 77 something you can help out with?

Flags: needinfo?(cvalaas)

Nope, I don't have any permissions on that folder. :jason should be able to - or at least discover who does have permissions there.

Flags: needinfo?(cvalaas) → needinfo?(jthomas)
Flags: needinfo?(jthomas)
Assignee: dhouse → mgoossens

Looks like OPST-776 is complete, and all of Hal's remaining concerns from comment 75 have been addressed.

Since this change just rides the trains, I'm planning to:

  1. Land it on autoland
  2. Trigger some shippable builds
  3. Manually run verify_cot to test it out

If it doesn't work, I'll ask sheriffs to back out. If it does, the next test will be nightlies. Then sometime after all-hands we can uplift this to ESR (probably fine to let this ride the trains to release).

Hey Hal, did you want to do a final sign-off here?

Flags: needinfo?(hwine)

(In reply to Andrew Halberstadt [:ahal] from comment #84)

Hey Hal, did you want to do a final sign-off here?

lgtm
X <== marks the spot

Flags: needinfo?(hwine)
Pushed by ahalberstadt@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/7c0a787fe65a
Migrate shippable builds from AWS to GCP r=ahal,jmaher

And done!

Status: REOPENED → RESOLVED
Closed: 5 years ago2 years ago
Keywords: leave-open
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: