Closed Bug 1594891 Opened 2 years ago Closed 2 years ago

stand up NSS CI in new Taskcluster Firefox-ci environment, have releng take administration of NSS going forward

Categories

(Release Engineering :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlund, Unassigned)

References

Details

Attachments

(2 files)

NSS is going to be part of FF-CI.

Needs workers, tooltool support, updated tc clients, and to point to the new Taskcluster root url.

Blocks: 1591152

https://treeherder.mozilla.org/#/jobs?repo=nss-try&revision=b775e9d2d4c027df3b599477571ad0bb7bf12a92 is a start -- replaces hard-coded taskcluster.net references and updates the taskcluster client version. It doesn't update tooltool support or switch it to use new workers.

I'm not totally sure what tooltool support entails, unfortunately. Does that mean publishing tooltool manifests for nss artifacts?

  • Update the Taskcluster client used in the decision task to one that
    understands Taskcluster rootUrls.
  • Update scripts that fetch content to use the TASKCLUSTER_ROOT_URL
    • the absence of this variale signals an "old" worker so we use an "old" URL

OK, miles and I have had a deep look at this.

The nss source is mostly in good shape: the worker pools have been renamed and no further renaming will be required. The patch above needed adjustments for the situations where TASKCLUSTER_ROOT_URL isn't set (which is on the old docker-worker version on the packet.net laptop).

As for tooltool, hypothesis is that it's fine as-is.

Here's the state of workers:

  • nss-1/win2012r2 is confirmed to be working in the new deployment (AWS instances)
  • localprovisioner/nss-macos-10-12 is backed by 10 instances of generic-worker on a macstadium host at 208.52.182.28. That has my SSH keys and miles' on it now, and :kjacobs also has access
  • localprovisioner/nss-aarch64 is backed by a host in packet.net, to which miles has access. It's got an ancient version of docker-worker on it.

During the tree-closing window, the plan is:

  • log in to the macstadium host and reconfigure all 10 instances, using the same clientId and accessToken:
    • remove all *BaseURL configuration
    • update accessToken
    • update rootURL
    • start a worker manually to verify
    • then reboot and see that all workers start correctly with the correct context, etc.
  • log in to the packet.net host and (upgrade docker and)? reconfigure its credentials
    • details TBD
  • run a try push

To get there:

  • [DONE] verify that mac workers can talk to the new deployment by reconfiguring one of them
  • verify that an updated docker-worker version can talk to the legacy deployment
  • verify that an updated docker-worker version can talk to the firefox-ci deployment
  • with at least one of each kind of worker, run a try push in the firefox-ci deployment
    • this will validate the assumption about tooltool

The bottom line is, we should (and we'll know for sure when the "to get there" items are complete) be able to make this transition with no downtime for NSS outside of the TCW, and with no need for NSS staff to be available on Saturday.

In the process of reconfiguring the nss-aarch64 worker in packet.net the box disconnected me and went dark to ping and ssh. I wasn’t totally sure of the cause, what I had done:

  • Backed up the existing docker-worker checkout to /home/ci/docker-worker-bak
  • Backed up the existing service definition /lib/systemd/system/docker-worker.service to /home/ci/service.bak
  • Edited /etc/docker-worker.conf to add TASKCLUSTER_ROOT_URL env var
  • Used nvm to install node v8.15.0 to test with a checkout of docker-worker v201911061915
  • Edited /lib/systemd/system/docker-worker.service to point to that checkout

And then the server disconnected me, went dark for a period to ping and ssh, at some point let me back in, then disconnected me again.
I waited ~30 minutes unable to get back into the box, was unable to, then triggered a reboot of the instance via the packet.net webui.

After the reboot the instance responded to ping for a period, then went dark again. I’m going to engage packet.net support to try to resolve this tomorrow.

For now, the impact is that aarch64 tasks will not run.

Status Update:

  • We ran a try push in the new deployment!
    https://treeherder-prototype.herokuapp.com/#/jobs?repo=nss-try&revision=cc8b3f93ff090c267734609d1ab2918f1049a58a&selectedJob=274493469
    I see some orange there, but I can't tell what's wrong. Any assistance interpreting that information would be helpful! A few thoughts:
    • this try push re-generated all of the docker image, so it's possible that one of those images includes an incompatible "latest" version of some dependency
    • these workers are identical in configuration to those in this push which succeeded
  • The packet.net laptop appears to be down, as described in the previous comment.
  • The mac workers don't work when run from the command-line, and I don't have sudo access to reboot the host to get them to start automatically. @franziskus (as I think the next person to have working hours) can you figure out a way for me to be able to reboot the host? Either a sudoers entry, or admin password, or any other solution. I'll need to do it again on Saturday. 9 of the 10 workers continue to function as usual, so mac CI can continue.

Thanks for any help with the above from any of the NSS folks.

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #6)

  • The mac workers don't work when run from the command-line, and I don't have sudo access to reboot the host to get them to start automatically. @franziskus (as I think the next person to have working hours) can you figure out a way for me to be able to reboot the host? Either a sudoers entry, or admin password, or any other solution. I'll need to do it again on Saturday. 9 of the 10 workers continue to function as usual, so mac CI can continue.

Thanks for any help with the above from any of the NSS folks.

I've restarted worker 3 and I reran task u2GyIr2XQteNJ8s6udPqdw which has now turned green! 🥳

To stop one of the ten workers on 208.52.182.28 (workers numbered from 0 to 9):

sudo launchctl unload /Library/LaunchDaemons/net.generic.worker.[0-9].plist

To start a worker:

sudo launchctl load /Library/LaunchDaemons/net.generic.worker.[0-9].plist

To tail the logs of a worker:

tail -1000f /Users/administrator/worker[0-9]/generic-worker.log

I've added the administrator credentials for both NSS macstadium machines to the taskcluster team password store, together with the instructions above.

Config files of workers:

/Users/administrator/worker[0-9]/generic-worker.config

Note the workers are running simple engine, not multiuser engine:

administrators-Mac-mini-98:~ administrator$ /usr/local/bin/generic-worker --version
generic-worker (simple engine) 15.1.4 [ revision: https://github.com/taskcluster/generic-worker/commits/c407e45e3f019599005971b30993f29eb3c59b0d ]
administrators-Mac-mini-98:~ administrator$ 

(In reply to Miles Crabill [:miles] [also mcrabill@mozilla.com] from comment #5)

In the process of reconfiguring the nss-aarch64 worker in packet.net the box disconnected me and went dark to ping and ssh. I wasn’t totally sure of the cause, what I had done:

  • Backed up the existing docker-worker checkout to /home/ci/docker-worker-bak
  • Backed up the existing service definition /lib/systemd/system/docker-worker.service to /home/ci/service.bak
  • Edited /etc/docker-worker.conf to add TASKCLUSTER_ROOT_URL env var
  • Used nvm to install node v8.15.0 to test with a checkout of docker-worker v201911061915
  • Edited /lib/systemd/system/docker-worker.service to point to that checkout

And then the server disconnected me, went dark for a period to ping and ssh, at some point let me back in, then disconnected me again.
I waited ~30 minutes unable to get back into the box, was unable to, then triggered a reboot of the instance via the packet.net webui.

After the reboot the instance responded to ping for a period, then went dark again. I’m going to engage packet.net support to try to resolve this tomorrow.

For now, the impact is that aarch64 tasks will not run.

Cross posting with slack:

docker-worker is probably failing to start and then rebooting the machine. See https://github.com/taskcluster/docker-worker/blob/master/deploy/template/usr/local/bin/start-docker-worker#L5

Fresh try push to see the mac jobs run and to see if those oranges in the previous job reproduce here. Also, once the nss-aarch64 host is reconfigured to the new deployment, we should see those jobs run.

Updates:

The latest try push looks good -- there were lots of retries of the aarch64 worker as we worked to get it set up, but it ran some green jobs after that. Mac, Windows, and Linux are all green.

The aarch64 worker is still configured to point to the new (firefox-ci) deployment. Per Kevin we will leave it there to avoid further churn and simplify things tomorrow. So, no coverage on aarch64 until that's over.

We will be transitioning the mac workers during the TCW tomorrow. Otherwise, I think we're good!

This appears done!

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED

sounds like we need to fix up tooltool here. I'm re-opening until comment 13 lands

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Resolution: --- → FIXED
See Also: → 1636245
You need to log in before you can comment on or make changes to this bug.