Closed Bug 1465851 Opened 3 years ago Closed 2 years ago

ci-admin: manage workerTypes

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

(Blocks 1 open bug)

Details

Attachments

(5 files, 1 obsolete file)

Once we do not have secrets in workerTypes, ci-admin can manage them.  This will allow managing the list of available workerTypes, AMIs, security groups, regions, the whole lot.

As part of this, we can also add workerType "aliases", so that we can retire workerTypes.  The in-tree code would look up its locally-configured workerType and, if that is an alias for another, use the alternative alias instead.
We can also configure metadata about the workerTypes in the worker manager with this system.
Since worker types will be defined in code in ci-admin, presumably this will mean that the JSON actually used for the final definition is derived. And this means that we can leave detailed inline comments in ci-admin. If true, this will make me *extremely* happy. You don't know how many times I've fretted over changing a worker definition because I was full of FUD over the history of the settings.
One of my OKRs should be making you happy!

That said, this bug depends on getting secrets out of those workerType definitions, which has been in progress at a low level for quite a while now.
Dustin: does this also imply maintaining a history of workerTypes and associated deployments?
Yes :)
Assignee: dustin → nobody
Assignee: nobody → dustin

Here's the descriptions of the AMIs for each of the gecko-related workerTypes:

ami-test                       692406183521/taskcluster-docker-worker-overlay2-1555602245
ami-test-pv                    692406183521/taskcluster-docker-worker-overlay2-PV-1516498823
android-api-15                 692406183521/taskcluster-docker-worker-overlay2-1555602245
balrog                         692406183521/taskcluster-docker-worker-overlay2-1555602245
dbg-linux32                    692406183521/taskcluster-docker-worker-overlay2-1555602245
dbg-linux64                    692406183521/taskcluster-docker-worker-overlay2-1555602245
dbg-macosx64                   692406183521/taskcluster-docker-worker-overlay2-1555602245
desktop-test                   692406183521/taskcluster-docker-worker-overlay2-1555602245
desktop-test-large             692406183521/taskcluster-docker-worker-overlay2-1555602245
desktop-test-xlarge            692406183521/taskcluster-docker-worker-overlay2-1555602245
fp-gecko-3-b-linux             692406183521/taskcluster-docker-worker-overlay2-trusted-1532606009
gecko-1-b-android              692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-1-b-linux                692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-1-b-linux-gps            692406183521/taskcluster-docker-worker-overlay2-1525347283
gecko-1-b-linux-large          692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-1-b-linux-usw2           692406183521/taskcluster-docker-worker-overlay2-1532606008
gecko-1-b-linux-xlarge         692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-1-b-macosx64             692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-1-b-win2012              Gecko builder for Windows; TaskCluster worker type: gecko-1-b-win2012, OCC version cae6a929f27c, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/cae6a929f27cd1c24b2a3a7302d25abc1e91ad4b}
gecko-1-b-win2012-beta         Gecko builder for Windows; TaskCluster worker type: gecko-1-b-win2012-beta, OCC version 1efc404df03a, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/1efc404df03a9e392b7ae67319bf7a3055bf25d8}
gecko-1-decision               692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-1-images                 692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-1-linux-shared           692406183521/taskcluster-docker-worker-overlay2-1521032225
gecko-2-b-android              692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-2-b-linux                692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-2-b-linux-large          692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-2-b-linux-xlarge         692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-2-b-macosx64             692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-2-b-win2012              Gecko builder for Windows; TaskCluster worker type: gecko-2-b-win2012, OCC version cae6a929f27c, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/cae6a929f27cd1c24b2a3a7302d25abc1e91ad4b}
gecko-2-decision               692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-2-images                 692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-2-linux-shared           692406183521/taskcluster-docker-worker-overlay2-1521032225
gecko-3-b-android              692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
gecko-3-b-linux                692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
gecko-3-b-linux-large          692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
gecko-3-b-linux-xlarge         692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
gecko-3-b-macosx64             692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
gecko-3-b-win2012              Gecko builder for Windows; TaskCluster worker type: gecko-3-b-win2012, OCC version cae6a929f27c, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/cae6a929f27cd1c24b2a3a7302d25abc1e91ad4b}
gecko-3-b-win2012-c4           Gecko builder for Windows; TaskCluster worker type: gecko-3-b-win2012-c4, OCC version cae6a929f27c, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/cae6a929f27cd1c24b2a3a7302d25abc1e91ad4b}
gecko-3-b-win2012-c5           Gecko builder for Windows; TaskCluster worker type: gecko-3-b-win2012-c5, OCC version cae6a929f27c, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/cae6a929f27cd1c24b2a3a7302d25abc1e91ad4b}
gecko-3-decision               692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
gecko-3-images                 692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
gecko-3-linux-shared           692406183521/taskcluster-docker-worker-overlay2-1521032225
gecko-3-t-linux-xlarge         692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
gecko-misc                     692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-t-linux-large            692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-t-linux-xlarge           692406183521/taskcluster-docker-worker-overlay2-1555602245
gecko-t-win10-64               Gecko tester for Windows 10 64 bit; TaskCluster worker type: gecko-t-win10-64, OCC version cae6a929f27c, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/cae6a929f27cd1c24b2a3a7302d25abc1e91ad4b}
gecko-t-win10-64-alpha         Gecko tester for Windows 10 64 bit; TaskCluster worker type: gecko-t-win10-64-alpha, OCC version cae6a929f27c, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/cae6a929f27cd1c24b2a3a7302d25abc1e91ad4b}
gecko-t-win10-64-beta          Gecko tester for Windows 10 64 bit; TaskCluster worker type: gecko-t-win10-64-beta, OCC version 1efc404df03a, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/1efc404df03a9e392b7ae67319bf7a3055bf25d8}
gecko-t-win10-64-cu            Gecko tester for Windows 10 64 bit; TaskCluster worker type: gecko-t-win10-64-cu, OCC version cae6a929f27c, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/cae6a929f27cd1c24b2a3a7302d25abc1e91ad4b}
gecko-t-win10-64-gpu           Gecko tester for Windows 10 64 bit; TaskCluster worker type: gecko-t-win10-64-gpu, OCC version cae6a929f27c, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/cae6a929f27cd1c24b2a3a7302d25abc1e91ad4b}
gecko-t-win10-64-gpu-a         Gecko tester for Windows 10 64 bit; TaskCluster worker type: gecko-t-win10-64-gpu-a, OCC version cae6a929f27c, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/cae6a929f27cd1c24b2a3a7302d25abc1e91ad4b}
gecko-t-win10-64-gpu-b         Gecko tester for Windows 10 64 bit; TaskCluster worker type: gecko-t-win10-64-gpu-b, OCC version 1efc404df03a, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/1efc404df03a9e392b7ae67319bf7a3055bf25d8}
gecko-t-win7-32                Gecko test worker for Windows 7 32 bit; TaskCluster worker type: gecko-t-win7-32, OCC version 474f7675e40a, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/474f7675e40a07399d9c56562930edb1af148194}
gecko-t-win7-32-beta           Gecko test worker for Windows 7 32 bit; TaskCluster worker type: gecko-t-win7-32-beta, OCC version 421ceabe3446, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/421ceabe3446014a4ebb06672ffee975b06cd3a1}
gecko-t-win7-32-cu             Gecko test worker for Windows 7 32 bit; TaskCluster worker type: gecko-t-win7-32-cu, OCC version 474f7675e40a, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/474f7675e40a07399d9c56562930edb1af148194}
gecko-t-win7-32-gpu            Gecko test worker for Windows 7 32 bit; TaskCluster worker type: gecko-t-win7-32-gpu, OCC version 474f7675e40a, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/474f7675e40a07399d9c56562930edb1af148194}
gecko-t-win7-32-gpu-b          Gecko test worker for Windows 7 32 bit; TaskCluster worker type: gecko-t-win7-32-gpu-b, OCC version a7677c181815, https://github.com/mozilla-releng/OpenCloudConfig.git/tree/a7677c181815298e015a731dbecba20e9c903ff9}
github-worker                  692406183521/taskcluster-docker-worker-overlay2-1555602245
hg-worker                      692406183521/taskcluster-docker-worker-overlay2-1555602245
mobile-1-b-andrcmp             692406183521/taskcluster-docker-worker-overlay2-1555602245
mobile-1-b-fenix               692406183521/taskcluster-docker-worker-overlay2-1555602245
mobile-1-b-ref-browser         692406183521/taskcluster-docker-worker-overlay2-1555602245
mobile-1-decision              692406183521/taskcluster-docker-worker-overlay2-1555602245
mobile-1-images                692406183521/taskcluster-docker-worker-overlay2-1555602245
mobile-3-b-andrcmp             692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
mobile-3-b-fenix               692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
mobile-3-b-ref-browser         692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
mobile-3-decision              692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
mobile-3-images                692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
mozillaonline-1-b-linux        692406183521/taskcluster-docker-worker-overlay2-1555602245
mozillaonline-3-b-linux        692406183521/taskcluster-docker-worker-overlay2-1555602245
mulet-debug                    692406183521/taskcluster-docker-worker-overlay2-1555602245
mulet-opt                      692406183521/taskcluster-docker-worker-overlay2-1555602245
nss-win2012r2                  firefox desktop builds on windows - taskcluster worker - version CdcEDDWnQemvUVYvrNE_RA
nss-win2012r2-new              firefox desktop builds on windows - taskcluster worker - version Y8MTXWoeTOyC1y2r8H-L7Q
opt-linux32                    692406183521/taskcluster-docker-worker-overlay2-1555602245
opt-linux64                    692406183521/taskcluster-docker-worker-overlay2-1555602245
opt-macosx64                   692406183521/taskcluster-docker-worker-overlay2-1555602245
rustbuild                      692406183521/taskcluster-docker-worker-overlay2-1555602245
spidermonkey                   692406183521/taskcluster-docker-worker-overlay2-1555602245
symbol-upload                  692406183521/taskcluster-docker-worker-overlay2-1555602245
taskcluster-generic            692406183521/taskcluster-docker-worker-overlay2-1555602245
taskcluster-images             692406183521/taskcluster-docker-worker-overlay2-trusted-1555602246
version-control-tools          692406183521/taskcluster-docker-worker-overlay2-1532606008
win2012r2                      firefox desktop builds on windows - taskcluster worker - version S7C-no6lTuiuECEONV4qOA

We won't be managing windows workertypes initially, so those are safe to ignore, and the rest look pretty straightforward.

Blocks: 1552242

I had a hard think about this and came up with a pretty minimal approach to start with, allowing room to grow.

There are a few things I would like to accomplish:

  • get a human-written, human-reviewed change history for worker types so we're not wondering "who broke X" or "why Y is configured that way"
  • simplify configuration of workerTypes so it's more human-readable, with fewer illegible sg-abcdef and ami-12345678 identifiers
  • support the transition to worker-manager at the same time as we transition from https://taskcluster.net to the new deployment

This set of patches is a start toward the first point, covering all docker-worker workers for which there are scope grants in grants.yml. It does not address the second point at all. I think we should handle that in worker-manager, although I don't think anyone knows how yet. It does address the second point in that it treats aws-provisioner workerTypes as distinct from worker-manager workerTypes, and will configure only the former on https://taskcluster.net, and (once they are implemented) only the latter on any other deployment. Since we're also changing the provisionerId of all of those workers, that works out just fine.

The worker-types.yml in the fourth patch is generated from a script and when run through ci-admin diff produces no differences. AWS Provisioner has some janky bits in that there are a bunch of unused properties and lots of rarely-used properties that default to empty containers. So I added some light editing to the to_api/from_api functions and to the generator to allow worker-types.yml to omit all but the relevant details. With this in place, at least a single worker-type in that file should fit on a single editor screen.

Future plans:

  • update the docker-worker deployment process to modify ci-configuration instead of calling API methods directly
  • bring generic-worker workers into the fold, adding them to ci-configuration and modifying occ and the generic-worker deploy process to modify ci-configuration instead of calling API methods directly
  • configure worker-manager workerTypes with this tool, too
    • possibly add code to support whatever we do to address the second point above ("simplify configuration ..")

Before I get into code review, I'd like a, say, 50% review from a broad swath of you as to the overall direction here. You can probably get that just by looking at D32085 and reading this comment.

Flags: needinfo?(wcosta)
Flags: needinfo?(rthijssen)
Flags: needinfo?(pmoore)
Flags: needinfo?(mozilla)
Flags: needinfo?(bugzeeeeee)
Flags: needinfo?(bstack)

I cursorily looked over the non-D32085 patches and they make sense to me. D32085 itself is really neat and something I've wanted for a looong time. It is quite easy to read compared to aws-provisioner UI, even without the improvements you plan on making.

On point 2 -- figuring out where to draw the line for what magic worker-manager provides will be quite interesting. I don't have a great idea of how that process will work in the cleanest way. I think most of the identifiers like sg-abcdef and ami-12345678 will come from tooling like terraform/packer/etc so keeping the magic in ci-admin based on config files actually still makes the most sense to me? I am not holding this position strongly however.

Flags: needinfo?(bstack)

lgtm

Flags: needinfo?(rthijssen)

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #11)

The worker-types.yml in the fourth patch is generated from a script and when run through ci-admin diff produces no differences.

Is the script it was generated from included in the patchset?

Flags: needinfo?(pmoore)

Is the script it was generated from included in the patchset

No, the script is a terrible hack job. I only mentioned that to indicate that (a) I can "rebase" this patchset over any worker-type modifications that occur before it lands and (b) this has been tested to have zero impact on production. I'll attach the script here but it's not suitable for landing anywhere.

Attached file wt2.py (obsolete) —

used-workertypes.txt is the list of workertypes for which scopes are granted, and existing-workertypes.json is the set of existing workertypes downloaded out-of-band from the production AWS provisioner. The images-*.json are lists of AMIs pulled from AWS with the AMI description in them.

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #11)

I had a hard think about this and came up with a pretty minimal approach to
start with, allowing room to grow.

There are a few things I would like to accomplish:

  • get a human-written, human-reviewed change history for worker types so
    we're not wondering "who broke X" or "why Y is configured that way"
  • simplify configuration of workerTypes so it's more human-readable, with
    fewer illegible sg-abcdef and ami-12345678 identifiers
  • support the transition to worker-manager at the same time as we
    transition from https://taskcluster.net to the new deployment

This set of patches is a start toward the first point, covering all
docker-worker workers for which there are scope grants in grants.yml. It
does not address the second point at all. I think we should handle that in
worker-manager, although I don't think anyone knows how yet. It does
address the second point in that it treats aws-provisioner workerTypes as
distinct from worker-manager workerTypes, and will configure only the former
on https://taskcluster.net, and (once they are implemented) only the latter
on any other deployment. Since we're also changing the provisionerId of all
of those workers, that works out just fine.

The worker-types.yml in the fourth patch is generated from a script and
when run through ci-admin diff produces no differences. AWS Provisioner
has some janky bits in that there are a bunch of unused properties and lots
of rarely-used properties that default to empty containers. So I added some
light editing to the to_api/from_api functions and to the generator to
allow worker-types.yml to omit all but the relevant details. With this in
place, at least a single worker-type in that file should fit on a single
editor screen.

Future plans:

  • update the docker-worker deployment process to modify ci-configuration
    instead of calling API methods directly
  • bring generic-worker workers into the fold, adding them to
    ci-configuration and modifying occ and the generic-worker deploy process to
    modify ci-configuration instead of calling API methods directly
  • configure worker-manager workerTypes with this tool, too
    • possibly add code to support whatever we do to address the second point
      above ("simplify configuration ..")

Before I get into code review, I'd like a, say, 50% review from a broad
swath of you as to the overall direction here. You can probably get that
just by looking at D32085 and reading this comment.

I would like to leave the management of AMI IDs to docker-worker. Today it is pretty simple and fast to rollback a deployment in case of a bustage. If we delegate to ci-admin, this would mean a new patch to revert the changes, get it reviewed and then someone with powers run it.

Flags: needinfo?(wcosta)

(In reply to Wander Lairson Costa [:wcosta] from comment #17)

I would like to leave the management of AMI IDs to docker-worker. Today it is pretty simple and fast to rollback a deployment in case of a bustage. If we delegate to ci-admin, this would mean a new patch to revert the changes, get it reviewed and then someone with powers run it.

How does this work in generic-worker?

I don't want to design this system around docker-worker, but if it also benefits generic-worker deployments, then we should consider it.

Flags: needinfo?(pmoore)

Wander and I chatted a little. There's some circumstances that allay the rollback concerns: rolling back a ci-configuration change is a normal and quick process (and can even be done without landing the rollback, since ci-admin is run by hand). Also, once staging is running in the next month or so, we'll be able to test -- both at the ami-test level and at the level of a full gecko push -- in staging, so hopefully there will be fewer calls for rollbacks. And, ci-admin can configure staging, too, so we can be confident it's running the same configuration.

We talked about how a docker-worker build process will feed into this configuration. Editing yaml files in place is not a great option. Wander suggested having ci-admin look into the GitHub releases (https://github.com/taskcluster/docker-worker/releases/, but via the API) to translate a version string in worker-types.yml into a set of AMI IDs. That's a little unusual for ci-admin since it means it's reaching out to an external service for deployment data, but maybe that's OK. Tom, what do you think? An alternative might be to reflect that release data into ci-configuration as static data, using some script to update that static data.

This is looking good. A few comments:

  • whether or not we worker-manager ends up having configuration for subnets and the like, I don't think we want to configure those by region in each worker type. Similarly, given that we only have a handful sets of AMIs, I don't think we want those listed in the worker-types themselves. That suggests that we want an additional set configurations for both of those, that get referenced by the worker-types.
  • I'd rather not refer directly to github release artifacts, largely because they are mutable, and so while I don't expect them to be deliberately changed, the history of them aren't easily auditable. So, I'd prefer to have a script that pulls the AMI info and vendors it in the ci-config repository. And we can investigate automating running that with a scriptworker at some point in the future.
Flags: needinfo?(mozilla)

To the first point, I played around with that for a bit, and it's not clear how best to handle that, or where to stop with the abstraction. Since this is to support configs for a deprecated service (aws-provisioner), and since we don't have any history for why worker-type X is subtly different from worker-type Y, I don't think it's worth over-engineering the DRYness of the representation. It's also a little premature, as for example generic-worker uses a distinct AMI and distinct userdata for each workerType, and that won't abstract the same way.

In general, I see this as an initial beachhead into managing workertypes (or as we will call them soon, worker pools). There are a bunch of next-steps to take, and as we do so I'd like to stick to the easy verification of "changes nothing". In a few weeks, as we implement worker-manager, I'd like to simplify the config definitions within that tool, hopefully resulting in simpler configs in ci-configuration as well. We're still not sure how to do that.

Vendoring the release data sounds like a good idea, too. Since docker-worker is also not long for this world, I don't think there's a use in automating it.

Attachment #9066582 - Attachment description: Bug 1465851 - read worker-types.yml → Bug 1465851 - read worker-{pools,images}.yaml from ci-config
Attachment #9066702 - Attachment is obsolete: true
Attached file wt2.py
Attachment #9066587 - Attachment description: Bug 1465851 - add worker-types.yml based on current docker-worker deployment → Bug 1465851 - add worker info based on current docker-worker deployment
Attachment #9066587 - Attachment description: Bug 1465851 - add worker info based on current docker-worker deployment → Bug 1465851 - add worker info based on current docker-worker deployment r=tomprince

lgtm

Flags: needinfo?(bugzeeeeee)
Attachment #9066581 - Attachment description: Bug 1465851 - add support for AWS Provisioner WorkerType objects → Bug 1465851 - add support for AWS Provisioner WorkerType objects r=tomprince
Attachment #9066582 - Attachment description: Bug 1465851 - read worker-{pools,images}.yaml from ci-config → Bug 1465851 - read worker-{pools,images}.yaml from ci-config r=tomprince
Attachment #9066583 - Attachment description: Bug 1465851 - generate AWS provisioner workerTypes → Bug 1465851 - generate AWS provisioner workerTypes r=tomprince
Flags: needinfo?(pmoore)
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Blocks: 1566931
You need to log in before you can comment on or make changes to this bug.