Closed Bug 1542819 Opened 1 year ago Closed 5 months ago

migrate signing workers to cloudops

Categories

(Release Engineering :: General, task)

task
Not set

Tracking

(firefox71 fixed)

RESOLVED FIXED
Tracking Status
firefox71 --- fixed

People

(Reporter: rail, Assigned: catlee)

References

(Blocks 1 open bug)

Details

Attachments

(3 files)

We plan to migrate our signing workers to GCP/cloudops as one of the first workers.

Listing their requirements:

  1. The workers need to be able to communicate with the signing servers which are in MDC1 and MDC2. This may require some kind of tunnel. Probably we should use at least 2 k8s clusters, each one the closest region to MDC{1,2}.

  2. The workers need to be able to communicate with the autograph server. AFAIK, we can use IP based whitelisting in order to do this.

  3. The workers will be using the Dockerflow approach, except the health check, which will be a CLI tool, instead of an HTTP API endpoint.

  4. there will be a special docker instance to handle autoscaling for these workers (1 per cluster).

Anything that I missed?

Depends on: 1543563
Type: defect → task

@catlee did some initial work here to test things in our dev environment.

He pushed his signingscript code in the dev branch and pushed a try submission. The tasks https://tools.taskcluster.net/groups/O-oe5xMyQomr_ZFaI5fmcQ/tasks/GVr32A9KQs2wLvXdfFJ6yw/details starved in pending mode as no worker performed the job.

Rok and I did some debugging earlier today and we found two potential culprits:
a) dev was not-up-to-date to Rail's recent work in cleaning the signingscript repo

Solution: merge master branch to dev to be up-to-date

b) there's a CrashLoopBackOff constant error that shows up in recreating the deployments in CloudOps.

There's more docs on this in here that we can start from.

c) the workerType is wrong.

Solution: It expects gecko-1-signing-dev but instead was fed gecko-t-signing-dev in here

For this reason, the correct workers are claiming work but the tasks submitted are querying to the wrong workerType Queue.
I'm stealing @catlee's try submission, apply the three solutions above and try this again.

Once we have a dev-signing worker successfully, we can start rolling out the production ones.

More debugging information, the commit was https://github.com/mozilla-releng/signingscript/commit/fef19a132e16f76d4baa7944466901276c373d35 which corresponds to the lattest tag that was pushed under mozilla/releng-signingscript. So we pushed successfully. If there's an error, it must be somewhere in CloudOps world when (re)pulling the image or deploying it. Digging through.

Pushed a new try submission https://treeherder.mozilla.org/#/jobs?repo=try&revision=10e2e458dc84965b894dcb79f6712c2d9678bc93 with catlee's work + fixing the workerTypes and some fuzzy signing jobs.

Attempting to fix the GCP side as well.

Some progress here, the jobs I scheduled earlier barfed in signingscript.

  File "/app/signingscript/sign.py", line 131, in get_suitable_signing_servers
    f"No signing servers found with cert type {cert_type} and formats {signing_formats}"
signingscript.exceptions.SigningScriptError: No signing servers found with cert type project:releng:signing:cert:dep-signing and formats ['autograph_authenticode']
exit code: 5

The job itself passes CoT and breaks in signingscript.

Which means:
a) we need to fix credentials in signingscript
b) the GCP CrashLoopBackOff is quite misleading, for now.

Earlier I pushed to dev a backed-out revision from catlee's. I put that back and redeployed. Jenkins failed so we need to understand what's going on there. Somehow the tasks do run, even though the CloudOps deployment fails. I reran and still hit the same error. Moreover, the worker-id is similar, which based on this (TIL those are generated randomly) means it's hitting the same worker-id in GCP.

So I'm thinking at the following scenario:

  • regardless of me pushing new images to Docker on dev, Jenkins fails to deploy.
  • GCP isn't able to deploy the new image so tasks are talking to workers from an older version, which doesn't have @catlee's signingscript changes
  • since the signingscript complains about the missing credentials, I'm thinking the the deployment fails for the same reason, at some corresponding missing env var in container
Assignee: rail → catlee
Pushed by catlee@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e99752b85f6c
Use GCP signing workers r=mtabara
Status: NEW → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/919bb48457aa
Port bug 1542819 - Use GCP signing workers. rs=bustage-fix
Pushed by catlee@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/61f8731a42cb
Use worker aliases for signing to unbreak TB. r=tomprince
Pushed by ccoroiu@mozilla.com:
https://hg.mozilla.org/mozilla-central/rev/480000073f46
Use worker aliases for signing to unbreak TB. r=tomprince a=Aryx

We've had to bump up the CPU/memory limits for these workers since they were getting OOM killed a lot.

Regressions: 1580054
Regressions: 1593816
No longer regressions: 1593816
See Also: → 1593816
You need to log in before you can comment on or make changes to this bug.