Closed Bug 1574648 Opened 5 years ago Closed 5 years ago

Migrate servo to community taskcluster deployment

Categories

(Taskcluster :: Operations and Service Requests, task)

task
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: bstack)

References

Details

Make a plan to move servo to the new community deployment.

Notes:
have a manually set up macos builder in tor running 11.x
have some macstadium workers running 14.1.x
have a custom Windows AMI
using treeherder integration

Depends on: 1574651

I think WebRender and Servo should be considered separately.

CI for https://github.com/servo/webrender is owned by the Gecko gfx team and uses two worker types:

  • aws-provisioner-v1/github-worker
  • localprovisioner/webrender-ci-osx with one mac mini in the Toronto office. I don’t know who has SSH access to that machine

CI for https://github.com/servo/servo/ is owned by the Servo team (managed in tree as much as possible) and has:

  • aws-provisioner-v1/servo-docker-worker, set up by :wcosta
  • aws-provisioner-v1/servo-docker-untrusted, a copy of servo-docker-worker with fewer scopes. Used for pre-review testing of pull requests (so anyone can run anything there)
  • aws-provisioner-v1/servo-win2016, with a custom AMI running generic-worker
  • proj-servo/macos, with multiple machines from Macstadium running generic-worker
  • proj-servo/docker-worker-kvm, disabled at the moment because of a perma-failure. So it’ß not a priority, but we may want to bring it back at some point. This is for running tests in an Android emulator capable of OpenGLES 3, which requires CPU acceleration, which requires running KVM, which requires VT-x CPU instructions that are not available inside AWS EC2 VMs. So we used https://github.com/taskcluster/taskcluster-infrastructure/tree/master/modules/docker-worker to deploy docker-worker to dedicated hardware from Packet.net

Notes from meeting with :SimonSapin and :jdm:

The two docker-worker worker types are configured identically (or soon will be, when staging's docker-worker is upgraded); the difference in scopes is associated with the roles tc-github assigns to the task, and the different worker-types serve merely as a boundary to prevent cross-contamination.

The servo-win2016* worker-types (including -staging) are based on a custom AMI residing in the servo AWS account and generated by a Python script that runs a powershell script in an instance. It currently only generates an AMI in one region.

The macstadium workers are provisioned using salt: https://github.com/servo/servo/tree/master/etc/taskcluster/macos

The packet worker-type can probably be left more-or-less intact. If we get bare-metal working in EC2, we could transition to that, but later.

There is a daily hook -- https://tools.taskcluster.net/hooks/project-servo/daily -- that runs the decision task like pushes and PRs.

The "Treeherder Question" could have one of three answers:

  • Treeherder can ingest these messages
  • Servo could run in the Firefox-CI deployment
  • We could run a distinct Treeherder instance for the community deployment.
    The first option is preferred, and that's bug 1574651.

As for administration, servo would prefer to be able to admin their own resources within the larger scope of the community deployment, rather than waiting for PRs to some configuration repository to be approved and merged.

It may be beneficial to do a partial migration, running both in parallel for a while. The decision task could be modified to, for example, only run the docker-worker tasks in the new deployment while running everything in the old deployment.

As for administration, servo would prefer to be able to admin their own resources within the larger scope of the community deployment, rather than waiting for PRs to some configuration repository to be approved and merged.

It would also be fine to use a shared configuration repository if we (the Servo team) have access to review and deploy PRs to it. (With the understanding that we only use this access for PRs that only affect Servo’s CI.)

It looks like the Treeherder changes are tractable, and I'm working on them now. One question, though: do you use actions like retrigger in Treeherder, or just use it as a status display?

Assuming "you" means the Servo team: I personally don’t used actions from Treeherder and usually don’t log into Treeherder at all. Josh, how about you? Or as far as you know, other people on the team?

Flags: needinfo?(josh)

Haha, yeah, I guess the informal English pronoun should have been y'all :)

My attempts to use actions from the treeherder interface have been thwarted by https://github.com/servo/servo/issues/23217, so all my actions happen through the taskcluster interface.

Flags: needinfo?(josh)
Assignee: dustin → pmoore

Hi Dustin. With bug 1574651 fixed, what are the next steps? Is the new deployment available at least for testing? Can we have both enabled on the same GitHub repository, during a transition period?

Flags: needinfo?(dustin)

Pete will be working with you on this.

Flags: needinfo?(dustin) → needinfo?(pmoore)
Assignee: pmoore → bstack

Can we have both enabled on the same GitHub repository, during a transition period?

I’ve submitted https://github.com/taskcluster/taskcluster/pull/1738 as an attempt to make this possible.

Thanks! That the idea was invented twice in the same day is a confirmation that the change is a good idea. So yes, let's get that landed and we can make a TC release. Those generally go live in about a day.

The new integration is https://github.com/apps/community-tc-integration.

About 16h ago we switched again and bstack is now assigned (sorry for the churn). We're aware that this is work that we've created for you, so we are happy to make PRs, test changes, etc. I'll leave it to bstack to figure out the next steps.

If it wasn't clear from above, the Treeherder issue was resolved to "Treeherder can ingest these messages" -- treeherder will listen to messages from both the firefox-ci and community-tc deployments, and will "remember' which one is which.

Flags: needinfo?(pmoore)

Is there a timeline for the current https://taskcluster.net deployment going away?

we are happy to make PRs, test changes, etc.

Since you’re offering :) I guess some good next steps would be:

  • Ensure that we can distinguish in the GitHub Status API entries from both deployments. A different context string would be best. In handlers.js this appears to be based on this.context.cfg.app.statusContext
  • Set up a servo “project” and an initial worker pool that runs docker-worker
  • Figure out how to give administrative access to the above to select Servo contributors, ideally even if they don’t have a Mozilla LDAP account.
  • A PR to https://github.com/servo/servo:
    • Making .taskcluster.yml use https://github.com/taskcluster/taskcluster/pull/1738 so that GitHub push events and PR events trigger a decision task in both deployments
    • Making etc/taskcluster/decision_task.py run pprint.pprint(os.environ) then exit early when running on the new deployment

I think the first point is most important to safely enable https://github.com/apps/community-tc-integration on servo/servo without disrupting its CI.

treeherder will listen to messages from both the firefox-ci and community-tc deployments

And also the current deployment, as long is it still exists?

Unfortunately, the docs for treeherder integration seem to have been removed :( Is specifying a tc-treeherder.v2._/${tree}.${sha} route on tasks still what we need in community-tc?

Set up a servo “project” and an initial worker pool that runs docker-worker

Done - https://github.com/mozilla/community-tc-config/pull/34. That's already applied, too, in the interest of expediency.

Figure out how to give administrative access to the above to select Servo contributors, ideally even if they don’t have a Mozilla LDAP account.

Done -- community-tc uses GitHub auth!

Comment 13 was based on an irc conversation this morning, in an attempt to be expedient. So I didn't answer the rest of the questions in comment 12. Sorry to interrupt, brian!

Flags: needinfo?(bstack)

After spending some time looking at tc-admin and community-tc-config (and with IRC help, thanks Dustin!) I came up with https://github.com/servo/taskcluster-config. I’m glad to be able to version-control all this.

One aspect that still not clear to me are the cloud-provider-specific parts of worker pools. I think I’ll likely cargo-cult community-tc-config’s workers.py

Depends on: 1591591

I think that's fine.

Note, too, that we can if you wish continue to manage your worker pools while letting you manage everything else -- the advantage is that we then take care of upgrading worker versions, mitigating AWS or GCP issues, etc.

Could we do that for only some worker pools?

  • For pools running docker-worker that sounds great. We’d like to have some control over instance types[1] and mix/max capacity, but changes there should be infrequent enough that going through a PR to mozilla/community-tc-config sounds fine. We might need more frequent changes to the scopes granted to those workers but that can be managed separately, right?

  • For static workers (such as those running macOS) I don’t know which is preferable. There isn’t much configuration for them in worker-manager, is there?

  • For Windows it may be better that we manage the VM image, in order to be able for example to install another MSVC component in it. Deploying a new image sounds easier if that worker pool is managed in servo/taskcluster-config.

[1] By the way I’d like at some point to benchmark different instance types. Compiling Servo benefits from more CPU cores but only to a point, so there’s likely a sweet spot to balance speed v.s. cost.

Could we do that for only some worker pools?

No problem! And the distinction you've described sounds like a good one.

Flags: needinfo?(bstack)

Simon, two things:

Are we on-track to turn off https://taskcluster.net in a week?

No, not at all. I was not aware of this target date, despite asking in comment 12 :/ Bug 1591591 is still something I hoped we could do before starting the migration for servo/servo.

:pmoore has changed that to just manage servo/servo.

That sounds OK.

Are there other repos in that org that should be included as well?

No, as far as I remember servo/servo and servo/webrender are the only two under servo/ using Taskcuster at the moment. When we want more, PRs to mozilla/community-tc-config to add them on a case-by-case basis sound fine.

I gave Simon a user (SimonSapin) in the community workers AWS account with EC2 Read-Only access, in order to set up and debug the win2016 images expeditiously. We can remove that once it's in place (and once we have better mechanics for debugging worker instances that don't require EC2 access).

With aws-provisioner I used this when instances running a new AMI were starting but not picking up tasks, in order to find their public IP address and RDP in so that I could read generic-worker log files.

We can remove that once it's in place

I may need this again for future AMI updates, until there’s some other way to find the IP address or read logs.

With help on IRC from Dustin and tomprince I’ve now managed to configure a Windows / AWS worker pool and run a task.

Bug 1591591 is the next blocker.

Depends on: 1593543

I believe bug 1593543 is now fixed, and anyhow had a temporary workaround.

Is there anything else I can do to assist or unblock?

Per irc, this is all set! THANK YOU!

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED

I think we’re done! Just in time for tomorrow.

You need to log in before you can comment on or make changes to this bug.