1636245 - Create new worker pool for NSS ARM workers

Reporter

Description

•

4 years ago

As part of the current cost reduction efforts, we're trying to reduce the number of cloud providers we support, specifically by exiting packet.net. Over in bug 1632599, we're trying to move Mozilla CI workloads out of packet.net and into AWS now that Amazon supports bare metal instance types. There is currently a single, statically-provisioned ARM machine that was setup for NSS testing that is the last machine we have running in packet.net.

Amazon has a variety of ARM options now. We'd like to transition NSS to AWS from packet.net.

Here are the specs for the current machine in packet.net:

PROC: 2 x Cavium ThunderX CN8890 @2GHz
RAM: 128GB
DISK: 1 x 340GB SSD
NIC: 2 x 10Gbps Bonded Ports

I'd would humbly suggest that this is over-provisioned for running a single CI worker. I think an a1.large instance in AWS would be sufficient for the new pool. I also think that we can get away with a very small pool here, capping the max # of instances at 5 (or less) given that check-ins are infrequent and a single worker is keeping up just fine right now.

Note: I've cc-ed some Taskcluster folks here because I'm unsure about the vintage of docker-worker that's running on ARM in packet.net. We may need to engage with releng to get something running on ARM in AWS.

See also bug 1594891 where releng inherited many of the other NSS worker pools.

Chris Cooper [:coop] (he/him)

Reporter

Comment 1

•

4 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #0)

I'd would humbly suggest that this is over-provisioned for running a single CI worker. I think an a1.large instance in AWS would be sufficient for the new pool. I also think that we can get away with a very small pool here, capping the max # of instances at 5 (or less) given that check-ins are infrequent and a single worker is keeping up just fine right now.

Having now verified that I can connect to the NSS machine in packet.net, I can see that it's actually running with capacity=20 which makes more sense given the beefiness of the machine.

I'd suggest stepping down to a a1.medium instance in AWS, but having a max pool size of 20 and running 1 worker per instance.

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

4 years ago

Priority: -- → P1

Regressed by: 1636101

Chris Cooper [:coop] (he/him)

Reporter

Comment 2

•

4 years ago

:miles is looking to get an ARM64 docker-worker image running in AWS. Once that's in place we can figure out next steps.

IIRC docker-worker needed some code tweaks to work on ARM and I'm not sure those were ever landed. I guess we'll find out soon.

Chris Cooper [:coop] (he/him)

Reporter

Comment 3

•

4 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #2)

IIRC docker-worker needed some code tweaks to work on ARM and I'm not sure those were ever landed. I guess we'll find out soon.

Found a branch that may help with this: https://github.com/taskcluster/docker-worker/compare/packet-net

Miles Crabill [:miles]

Comment 4

•

4 years ago

I did some hacking on this, here's some of that state:

Used a test version of docker, see here: https://www.docker.com/blog/getting-started-with-docker-for-arm-on-linux/
Compiled worker-runner start-worker on my test box because our releases don't include arm64
Installed docker-worker dependencies, was able to run docker-worker independently and via worker-runner

I have some WIP scripts for this / modifications to monopacker and some notes to pick this back up after the current sprint.

I wasn't able to claim tasks in my dev environment due to docker-worker reporting capacity: 0, this is something we should be able to work around with some small code edits as NSS doesn't use video / audio loopbacks.

Chris Cooper [:coop] (he/him)

Reporter

Comment 5

•

4 years ago

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #4)

I wasn't able to claim tasks in my dev environment due to docker-worker reporting capacity: 0, this is something we should be able to work around with some small code edits as NSS doesn't use video / audio loopbacks.

:kjacobs - I think this is an important point to clarify. Do any of the NSS workloads that would be running on these workers require either the audio and/or video loopback devices that we create for Firefox testing? If we don't need to customize a kernel to add these devices on ARM, our path gets a lot simpler (and quicker), but I'd like confirmation before we proceed.

Flags: needinfo?(kjacobs.bugzilla)

Kevin Jacobs [:kjacobs]

Comment 6

•

4 years ago

That's correct - NSS has no need for these devices. Thanks for checking.

Flags: needinfo?(kjacobs.bugzilla)

Comment hidden (Intermittent Failures Robot)

Kevin Jacobs [:kjacobs]

Comment 8

•

4 years ago

:coop or :miles, is there any updated ETA on worker availability?

Flags: needinfo?(coop)

Chris Cooper [:coop] (he/him)

Reporter

Comment 9

•

4 years ago

(In reply to Kevin Jacobs [:kjacobs] from comment #8)

:coop or :miles, is there any updated ETA on worker availability?

Sorry, I should have updated. :miles will be back to this tomorrow. With the streamlined kernel requirements we should have something for validation shortly (1-2 days).

Assignee: nobody → miles

Status: NEW → ASSIGNED

Flags: needinfo?(coop)

Miles Crabill [:miles]

Comment 10

•

4 years ago

I've been working on this a bit more, shooting to have a working instance to test on by EOD tomorrow and will test manually baking an AMI from that instance so that we have replacement strategy should it fail.

Due to the level of customizations / hackery I haven't built an entirely new set of scripts to automate creating the image, I've taken notes on the changes I've made so we can adapt from there.

Miles Crabill [:miles]

Comment 11

•

4 years ago

•

Edited

I have an arm64 instance that I've claimed some tasks on [0] in stage, so I'm going to create a worker in production now. That should make it easier to test for NSS because of the lack of convenient ways to trigger pushes in stage.

The instance is configured as a static worker using the static provisioner and a test worker-pool / worker-type that I've configured.

[0] https://stage.taskcluster.nonprod.cloudops.mozgcp.net/tasks/OHdGKi-8TN-ufvKvHLnJ_Q

Miles Crabill [:miles]

Comment 12

•

4 years ago

I duplicated my test worker in production but need the credentials to the client reset as they were lost on the old machine.

Tom, could you please reset the static secret for https://firefox-ci-tc.services.mozilla.com/auth/clients/project%2Fnss-nspr%2Faarch64 and send it to me? Not sure if you're most applicable, but I figured you would have access.

Flags: needinfo?(mozilla)

Chris Cooper [:coop] (he/him)

Reporter

Comment 13

•

4 years ago

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #12)

I duplicated my test worker in production but need the credentials to the client reset as they were lost on the old machine.

None of them are in our password repo?

Flags: needinfo?(miles)

Tom Prince [:tomprince]

Comment 14

•

4 years ago

Attached file Bug 1636245: Grant taskcluster team permission to reset NSS worker access tokens; r?Callek — Details

This also removes the scopes need to create the client. The clients already exist,
and will be managed automatically once Bug 1632009 lands.

Tom Prince [:tomprince]

Comment 15

•

4 years ago

I've granted the taskcluster team permissions to reset the access token.

None of them are in our password repo?

I'd generally opt to regenerate access tokens, rather than store than store them separately from the deployment infrastructure for them.

Flags: needinfo?(mozilla)

Flags: needinfo?(miles)

Pulsebot

Comment 16

•

4 years ago

Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/ci/ci-configuration/rev/db8d027d3fbd
Grant taskcluster team permission to reset NSS worker access tokens; r=Callek

Miles Crabill [:miles]

Comment 17

•

4 years ago

The production worker is up and running, the worker-pool matches the one that the client had access to, localprovisioner/nss-aarch64. It looks like there is a different naming scheme for the other nss related worker-pools in Firefox CI, they are prefixed with nss/.

Because of the client structure I configured the worker as standalone rather than static, which should be fine but means that credentials are long-lived and tied to the client.

Who should be given access to the box? Is there a meaningful distinction between L1 and L3 for this worker as in other nss worker types?

Kevin Jacobs [:kjacobs]

Comment 18

•

4 years ago

•

Edited

I can confirm that the new worker is functioning as expected. Thanks!

I'm not sure what "L1" and "L3" refer t. Can you give a little more information (or point to where these are used in the other workers)?

I believe all of the current NSS team (myself, :jcj, and :beurdouche) had SSH access on the old machine. It would be good if we can restore that.

Chris Cooper [:coop] (he/him)

Reporter

Comment 19

•

4 years ago

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #17)

The production worker is up and running, the worker-pool matches the one that the client had access to, localprovisioner/nss-aarch64. It looks like there is a different naming scheme for the other nss related worker-pools in Firefox CI, they are prefixed with nss/.

Because of the client structure I configured the worker as standalone rather than static, which should be fine but means that credentials are long-lived and tied to the client.

Who should be given access to the box? Is there a meaningful distinction between L1 and L3 for this worker as in other nss worker types?

Thanks for this, Miles. I'm sure NSS is glad to get their ARM coverage back.

Now that we're on AWS, we do have the opportunity to make these pools flexible, i.e. livestock not pets. The frequency of check-ins on NSS is much lower, so we will absolutely spend less money if we can spin the pool down when there are no jobs pending. It may also depend on whether the NSS team still needs direct access to the worker (see below). I'll file a follow-up bug for that.

(In reply to Kevin Jacobs [:kjacobs] from comment #18)

I'm not sure what "L1" and "L3" refer t. Can you give a little more information (or point to where these are used in the other workers)?

The level maps to the trust level of machine. For NSS, the L1 machines would map to the nss-try tree and L3 would map to the nss tree. The distinction exists because (at least in the Firefox case), the restrictions around who can push to Try are much lower, and you don't want random jobs from Try poisoning future nightly/release builds.

If you're not expecting to have Try coverage for ARM, then we can eliminate the need for L1 builders. However, it looks like the task list is the same on Try.

I believe all of the current NSS team (myself, :jcj, and :beurdouche) had SSH access on the old machine. It would be good if we can restore that.

What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.

Chris Cooper [:coop] (he/him)

Reporter

Comment 20

•

4 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #19)

What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.

Pinging :kjacobs ^^

Flags: needinfo?(kjacobs.bugzilla)

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

4 years ago

Blocks: 1648080

Chris Cooper [:coop] (he/him)

Reporter

Comment 21

•

4 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #19)

I'll file a follow-up bug for that.

Bug 1648080

Kevin Jacobs [:kjacobs]

Comment 22

•

4 years ago

(In reply to Chris Cooper [:coop] pronoun: he from comment #19)

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #17)

The production worker is up and running, the worker-pool matches the one that the client had access to, localprovisioner/nss-aarch64. It looks like there is a different naming scheme for the other nss related worker-pools in Firefox CI, they are prefixed with nss/.

Because of the client structure I configured the worker as standalone rather than static, which should be fine but means that credentials are long-lived and tied to the client.

Who should be given access to the box? Is there a meaningful distinction between L1 and L3 for this worker as in other nss worker types?

Thanks for this, Miles. I'm sure NSS is glad to get their ARM coverage back.

Now that we're on AWS, we do have the opportunity to make these pools flexible, i.e. livestock not pets. The frequency of check-ins on NSS is much lower, so we will absolutely spend less money if we can spin the pool down when there are no jobs pending. It may also depend on whether the NSS team still needs direct access to the worker (see below). I'll file a follow-up bug for that.

(In reply to Kevin Jacobs [:kjacobs] from comment #18)

I'm not sure what "L1" and "L3" refer t. Can you give a little more information (or point to where these are used in the other workers)?

The level maps to the trust level of machine. For NSS, the L1 machines would map to the nss-try tree and L3 would map to the nss tree. The distinction exists because (at least in the Firefox case), the restrictions around who can push to Try are much lower, and you don't want random jobs from Try poisoning future nightly/release builds.

If you're not expecting to have Try coverage for ARM, then we can eliminate the need for L1 builders. However, it looks like the task list is the same on Try.

Yes, sounds like we'll want to keep both.

I believe all of the current NSS team (myself, :jcj, and :beurdouche) had SSH access on the old machine. It would be good if we can restore that.

What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.

Primarily testing security patches, which become public once pushed to nss-try. Having access to the box allows spinning up a local docker container and testing the patch manually. Maybe there's some alternative that would still enable that use case?

Flags: needinfo?(kjacobs.bugzilla)

Tom Prince [:tomprince]

Comment 23

•

4 years ago

(In reply to Kevin Jacobs [:kjacobs] from comment #22)

What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.

Primarily testing security patches, which become public once pushed to nss-try. Having access to the box allows spinning up a local docker container and testing the patch manually. Maybe there's some alternative that would still enable that use case?

Do you need access to a worker explicitly, or would access to any arm machine worker? Given that AWS has arm machines, would having access to a different EC2 arm machine (either always on, or on demand) in a different account work for manual testing?

Kevin Jacobs [:kjacobs]

Comment 24

•

4 years ago

(In reply to Tom Prince [:tomprince] from comment #23)

(In reply to Kevin Jacobs [:kjacobs] from comment #22)

What's the use-case here? Ideally releng will manage these pools for you going forward and you shouldn't need to access them directly.

Primarily testing security patches, which become public once pushed to nss-try. Having access to the box allows spinning up a local docker container and testing the patch manually. Maybe there's some alternative that would still enable that use case?

Do you need access to a worker explicitly, or would access to any arm machine worker? Given that AWS has arm machines, would having access to a different EC2 arm machine (either always on, or on demand) in a different account work for manual testing?

No, any aarch64 machine should do the trick. The worker was just convenient and available.

Kevin Jacobs [:kjacobs]

Comment 25

•

4 years ago

Tom, is there a process that we should go through in order to get access to one of these arm machines for testing?

Flags: needinfo?(mozilla)

Tom Prince [:tomprince]

Comment 26

•

4 years ago

This would probably make the most sense outside of taskcluster, so redirecting to :fubar, who I think manages our AWS accounts.

Flags: needinfo?(mozilla) → needinfo?(klibby)

J.C. Jones [:jcj] (he/him)

Comment 27

•

4 years ago

I think we can do this inside the NSS team ourselves, actually. I'll DM you klibby if we encounter issues.

Flags: needinfo?(klibby)

Chris Cooper [:coop] (he/him)

Reporter

Comment 28

•

4 years ago

:miles - what are the next steps here? Are we converting this to a managed pool, or leaving it in localprovisioner?

Flags: needinfo?(miles)

Christian Holler (:decoder)

Comment 29

•

4 years ago

This worker type has lots of potential beyond NSS. In particular the JS team would be interested in having native AArch64 builds and tests on such machines, at least for the JS shell.

Pete Moore [:pmoore][:pete]

Comment 30

•

4 years ago

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #10)

Due to the level of customizations / hackery I haven't built an entirely new set of scripts to automate creating the image, I've taken notes on the changes I've made so we can adapt from there.

Hey Miles, could you put a copy of the notes (or a link to them) in this bug? Thanks!

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

4 years ago

Flags: needinfo?(miles)

Chris Cooper [:coop] (he/him)

Reporter

Updated

•

4 years ago

Assignee: miles → nobody

Status: ASSIGNED → NEW

Pete Moore [:pmoore][:pete]

Comment 31

•

4 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #30)

Hey Miles, could you put a copy of the notes (or a link to them) in this bug? Thanks!

Notes from Miles: https://github.com/taskcluster/taskcluster/issues/3524#issue-703707552

Pete Moore [:pmoore][:pete]

Comment 32

•

4 years ago

Coop, should we close this?

I believe the workers are running, and the only caveat is that the machine images were manually created, bug issue 3524 (see comment 31) tracks automating this process for future linux/aarch64 workers.

Flags: needinfo?(coop)

Miles Crabill [:miles]

Comment 33

•

4 years ago

👋!

The important piece is that because that instance was manually created if it is deleted the data is lost. Two good steps to take to make things more recoverable/durable would be 1. disable volume deletion on instance termination so that the root volume can be reused on another instance 2. take a snapshot of the current instance state (or for bonus points make an AMI out if it, note that if you make an AMI you should verify the docker-worker relevant services are set to start on boot).

Chris Cooper [:coop] (he/him)

Reporter

Comment 34

•

4 years ago

(In reply to Pete Moore [:pmoore][:pete] from comment #32)

Coop, should we close this?

Sure, we can close this. I'll file follow-up issues for Miles' call-outs in comment #33.

Status: NEW → RESOLVED

Closed: 4 years ago

Flags: needinfo?(coop)

Resolution: --- → FIXED

Chris Cooper [:coop] (he/him)

Reporter

Comment 35

•

4 years ago

(In reply to Miles Crabill [:miles] from comment #33)
2. take a snapshot of the current instance state (or for bonus points make an AMI out if it, note that if you make an AMI you should verify the docker-worker relevant services are set to start on boot).

I took a snapshot of the running instance. The snapshot is in us-west-2: snap-0a41f3f82635cd474.

Christian Holler (:decoder)

Updated

•

4 years ago

Blocks: 1675561