1574648 - Migrate servo to community taskcluster deployment

Reporter

Description

•

5 years ago

Make a plan to move servo to the new community deployment.

Notes:
have a manually set up macos builder in tor running 11.x
have some macstadium workers running 14.1.x
have a custom Windows AMI
using treeherder integration

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1574651

Simon Sapin (:SimonSapin)

Comment 1

•

5 years ago

I think WebRender and Servo should be considered separately.

CI for https://github.com/servo/webrender is owned by the Gecko gfx team and uses two worker types:

aws-provisioner-v1/github-worker
localprovisioner/webrender-ci-osx with one mac mini in the Toronto office. I don’t know who has SSH access to that machine

CI for https://github.com/servo/servo/ is owned by the Servo team (managed in tree as much as possible) and has:

aws-provisioner-v1/servo-docker-worker, set up by :wcosta
aws-provisioner-v1/servo-docker-untrusted, a copy of servo-docker-worker with fewer scopes. Used for pre-review testing of pull requests (so anyone can run anything there)
aws-provisioner-v1/servo-win2016, with a custom AMI running generic-worker
proj-servo/macos, with multiple machines from Macstadium running generic-worker
proj-servo/docker-worker-kvm, disabled at the moment because of a perma-failure. So it’ß not a priority, but we may want to bring it back at some point. This is for running tests in an Android emulator capable of OpenGLES 3, which requires CPU acceleration, which requires running KVM, which requires VT-x CPU instructions that are not available inside AWS EC2 VMs. So we used https://github.com/taskcluster/taskcluster-infrastructure/tree/master/modules/docker-worker to deploy docker-worker to dedicated hardware from Packet.net

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 2

•

5 years ago

Notes from meeting with :SimonSapin and :jdm:

The two docker-worker worker types are configured identically (or soon will be, when staging's docker-worker is upgraded); the difference in scopes is associated with the roles tc-github assigns to the task, and the different worker-types serve merely as a boundary to prevent cross-contamination.

The servo-win2016* worker-types (including -staging) are based on a custom AMI residing in the servo AWS account and generated by a Python script that runs a powershell script in an instance. It currently only generates an AMI in one region.

The macstadium workers are provisioned using salt: https://github.com/servo/servo/tree/master/etc/taskcluster/macos

The packet worker-type can probably be left more-or-less intact. If we get bare-metal working in EC2, we could transition to that, but later.

There is a daily hook -- https://tools.taskcluster.net/hooks/project-servo/daily -- that runs the decision task like pushes and PRs.

The "Treeherder Question" could have one of three answers:

Treeherder can ingest these messages
Servo could run in the Firefox-CI deployment
We could run a distinct Treeherder instance for the community deployment.
The first option is preferred, and that's bug 1574651.

As for administration, servo would prefer to be able to admin their own resources within the larger scope of the community deployment, rather than waiting for PRs to some configuration repository to be approved and merged.

It may be beneficial to do a partial migration, running both in parallel for a while. The decision task could be modified to, for example, only run the docker-worker tasks in the new deployment while running everything in the old deployment.

Simon Sapin (:SimonSapin)

Comment 3

•

5 years ago

As for administration, servo would prefer to be able to admin their own resources within the larger scope of the community deployment, rather than waiting for PRs to some configuration repository to be approved and merged.

It would also be fine to use a shared configuration repository if we (the Servo team) have access to review and deploy PRs to it. (With the understanding that we only use this access for PRs that only affect Servo’s CI.)

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 4

•

5 years ago

It looks like the Treeherder changes are tractable, and I'm working on them now. One question, though: do you use actions like retrigger in Treeherder, or just use it as a status display?

Simon Sapin (:SimonSapin)

Comment 5

•

5 years ago

Assuming "you" means the Servo team: I personally don’t used actions from Treeherder and usually don’t log into Treeherder at all. Josh, how about you? Or as far as you know, other people on the team?

Flags: needinfo?(josh)

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 6

•

5 years ago

Haha, yeah, I guess the informal English pronoun should have been y'all :)

Josh Matthews [:jdm]

Comment 7

•

5 years ago

My attempts to use actions from the treeherder interface have been thwarted by https://github.com/servo/servo/issues/23217, so all my actions happen through the taskcluster interface.

Flags: needinfo?(josh)

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

5 years ago

Assignee: dustin → pmoore

Simon Sapin (:SimonSapin)

Comment 8

•

5 years ago

•

Edited

Hi Dustin. With bug 1574651 fixed, what are the next steps? Is the new deployment available at least for testing? Can we have both enabled on the same GitHub repository, during a transition period?

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 9

•

5 years ago

Pete will be working with you on this.

Flags: needinfo?(dustin) → needinfo?(pmoore)

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

5 years ago

Assignee: pmoore → bstack

Simon Sapin (:SimonSapin)

Comment 10

•

5 years ago

Can we have both enabled on the same GitHub repository, during a transition period?

I’ve submitted https://github.com/taskcluster/taskcluster/pull/1738 as an attempt to make this possible.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 11

•

5 years ago

Thanks! That the idea was invented twice in the same day is a confirmation that the change is a good idea. So yes, let's get that landed and we can make a TC release. Those generally go live in about a day.

The new integration is https://github.com/apps/community-tc-integration.

About 16h ago we switched again and bstack is now assigned (sorry for the churn). We're aware that this is work that we've created for you, so we are happy to make PRs, test changes, etc. I'll leave it to bstack to figure out the next steps.

If it wasn't clear from above, the Treeherder issue was resolved to "Treeherder can ingest these messages" -- treeherder will listen to messages from both the firefox-ci and community-tc deployments, and will "remember' which one is which.

Flags: needinfo?(pmoore)

Simon Sapin (:SimonSapin)

Comment 12

•

5 years ago

Is there a timeline for the current https://taskcluster.net deployment going away?

we are happy to make PRs, test changes, etc.

Since you’re offering :) I guess some good next steps would be:

Ensure that we can distinguish in the GitHub Status API entries from both deployments. A different context string would be best. In handlers.js this appears to be based on this.context.cfg.app.statusContext
- Alternatively, maybe Servo need to migrate to the Checks API first
Set up a servo “project” and an initial worker pool that runs docker-worker
Figure out how to give administrative access to the above to select Servo contributors, ideally even if they don’t have a Mozilla LDAP account.
A PR to https://github.com/servo/servo:
- Making .taskcluster.yml use https://github.com/taskcluster/taskcluster/pull/1738 so that GitHub push events and PR events trigger a decision task in both deployments
- Making etc/taskcluster/decision_task.py run pprint.pprint(os.environ) then exit early when running on the new deployment

I think the first point is most important to safely enable https://github.com/apps/community-tc-integration on servo/servo without disrupting its CI.

treeherder will listen to messages from both the firefox-ci and community-tc deployments

And also the current deployment, as long is it still exists?

Unfortunately, the docs for treeherder integration seem to have been removed :( Is specifying a tc-treeherder.v2._/${tree}.${sha} route on tasks still what we need in community-tc?

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 13

•

5 years ago

Set up a servo “project” and an initial worker pool that runs docker-worker

Done - https://github.com/mozilla/community-tc-config/pull/34. That's already applied, too, in the interest of expediency.

Figure out how to give administrative access to the above to select Servo contributors, ideally even if they don’t have a Mozilla LDAP account.

Done -- community-tc uses GitHub auth!

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 14

•

5 years ago

Comment 13 was based on an irc conversation this morning, in an attempt to be expedient. So I didn't answer the rest of the questions in comment 12. Sorry to interrupt, brian!

Flags: needinfo?(bstack)

Simon Sapin (:SimonSapin)

Comment 15

•

5 years ago

After spending some time looking at tc-admin and community-tc-config (and with IRC help, thanks Dustin!) I came up with https://github.com/servo/taskcluster-config. I’m glad to be able to version-control all this.

One aspect that still not clear to me are the cloud-provider-specific parts of worker pools. I think I’ll likely cargo-cult community-tc-config’s workers.py

Simon Sapin (:SimonSapin)

Updated

•

5 years ago

Depends on: 1591591

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 16

•

5 years ago

I think that's fine.

Note, too, that we can if you wish continue to manage your worker pools while letting you manage everything else -- the advantage is that we then take care of upgrading worker versions, mitigating AWS or GCP issues, etc.

Simon Sapin (:SimonSapin)

Comment 17

•

5 years ago

Could we do that for only some worker pools?

For pools running docker-worker that sounds great. We’d like to have some control over instance types[1] and mix/max capacity, but changes there should be infrequent enough that going through a PR to mozilla/community-tc-config sounds fine. We might need more frequent changes to the scopes granted to those workers but that can be managed separately, right?
For static workers (such as those running macOS) I don’t know which is preferable. There isn’t much configuration for them in worker-manager, is there?
For Windows it may be better that we manage the VM image, in order to be able for example to install another MSVC component in it. Deploying a new image sounds easier if that worker pool is managed in servo/taskcluster-config.

[1] By the way I’d like at some point to benchmark different instance types. Compiling Servo benefits from more CPU cores but only to a point, so there’s likely a sweet spot to balance speed v.s. cost.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 18

•

5 years ago

Could we do that for only some worker pools?

No problem! And the distinction you've described sounds like a good one.

Brian Stack [:bstack]

Assignee

Updated

•

5 years ago

Flags: needinfo?(bstack)

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 19

•

5 years ago

Simon, two things:

How's it going? Can we help? Are we on-track to turn off https://taskcluster.net in a week?
Since we are considering webrender a different project, there's a bit of a conflict inherent in configuring the servo project to manage the entire servo org, which includes servo/webrender. In https://github.com/mozilla/community-tc-config/pull/50/files#diff-70800a36c38d85e21390f3af255c8420L154 :pmoore has changed that to just manage servo/servo. Are there other repos in that org that should be included as well?

Simon Sapin (:SimonSapin)

Comment 20

•

5 years ago

Are we on-track to turn off https://taskcluster.net in a week?

No, not at all. I was not aware of this target date, despite asking in comment 12 :/ Bug 1591591 is still something I hoped we could do before starting the migration for servo/servo.

:pmoore has changed that to just manage servo/servo.

That sounds OK.

Are there other repos in that org that should be included as well?

No, as far as I remember servo/servo and servo/webrender are the only two under servo/ using Taskcuster at the moment. When we want more, PRs to mozilla/community-tc-config to add them on a case-by-case basis sound fine.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 21

•

5 years ago

https://github.com/mozilla/community-tc-config/pull/54 for worker pools

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 22

•

5 years ago

I gave Simon a user (SimonSapin) in the community workers AWS account with EC2 Read-Only access, in order to set up and debug the win2016 images expeditiously. We can remove that once it's in place (and once we have better mechanics for debugging worker instances that don't require EC2 access).

Simon Sapin (:SimonSapin)

Comment 23

•

5 years ago

With aws-provisioner I used this when instances running a new AMI were starting but not picking up tasks, in order to find their public IP address and RDP in so that I could read generic-worker log files.

We can remove that once it's in place

I may need this again for future AMI updates, until there’s some other way to find the IP address or read logs.

Simon Sapin (:SimonSapin)

Comment 24

•

5 years ago

With help on IRC from Dustin and tomprince I’ve now managed to configure a Windows / AWS worker pool and run a task.

Bug 1591591 is the next blocker.

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Updated

•

5 years ago

Depends on: 1593543

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 25

•

5 years ago

I believe bug 1593543 is now fixed, and anyhow had a temporary workaround.

Is there anything else I can do to assist or unblock?

Dustin J. Mitchell [:dustin] (he/him)

Reporter

Comment 26

•

5 years ago

Per irc, this is all set! THANK YOU!

Status: NEW → RESOLVED

Closed: 5 years ago

Resolution: --- → FIXED

Simon Sapin (:SimonSapin)

Comment 27

•

5 years ago

https://treeherder.allizom.org/#/jobs?repo=servo-auto is showing task data as expected (and I hear treeherder.mozilla.org will too soon)
https://github.com/servo/servo/pull/24689 has landed
https://github.com/servo/saltfs/pull/986 is deployed
https://github.com/apps/taskcluster is uninstalled from servo/servo
https://github.com/servo/servo/pull/24697 cleans up some loose ends

I think we’re done! Just in time for tomorrow.