Closed Bug 1542819 Opened 3 years ago Closed 3 years ago

migrate signing workers to cloudops


(Release Engineering :: General, task)

Not set


(firefox71 fixed)

Tracking Status
firefox71 --- fixed


(Reporter: rail, Assigned: catlee)




(3 files)

We plan to migrate our signing workers to GCP/cloudops as one of the first workers.

Listing their requirements:

  1. The workers need to be able to communicate with the signing servers which are in MDC1 and MDC2. This may require some kind of tunnel. Probably we should use at least 2 k8s clusters, each one the closest region to MDC{1,2}.

  2. The workers need to be able to communicate with the autograph server. AFAIK, we can use IP based whitelisting in order to do this.

  3. The workers will be using the Dockerflow approach, except the health check, which will be a CLI tool, instead of an HTTP API endpoint.

  4. there will be a special docker instance to handle autoscaling for these workers (1 per cluster).

Anything that I missed?

Type: defect → task

@catlee did some initial work here to test things in our dev environment.

He pushed his signingscript code in the dev branch and pushed a try submission. The tasks starved in pending mode as no worker performed the job.

Rok and I did some debugging earlier today and we found two potential culprits:
a) dev was not-up-to-date to Rail's recent work in cleaning the signingscript repo

Solution: merge master branch to dev to be up-to-date

b) there's a CrashLoopBackOff constant error that shows up in recreating the deployments in CloudOps.

There's more docs on this in here that we can start from.

c) the workerType is wrong.

Solution: It expects gecko-1-signing-dev but instead was fed gecko-t-signing-dev in here

For this reason, the correct workers are claiming work but the tasks submitted are querying to the wrong workerType Queue.
I'm stealing @catlee's try submission, apply the three solutions above and try this again.

Once we have a dev-signing worker successfully, we can start rolling out the production ones.

More debugging information, the commit was which corresponds to the lattest tag that was pushed under mozilla/releng-signingscript. So we pushed successfully. If there's an error, it must be somewhere in CloudOps world when (re)pulling the image or deploying it. Digging through.

Pushed a new try submission with catlee's work + fixing the workerTypes and some fuzzy signing jobs.

Attempting to fix the GCP side as well.

Some progress here, the jobs I scheduled earlier barfed in signingscript.

  File "/app/signingscript/", line 131, in get_suitable_signing_servers
    f"No signing servers found with cert type {cert_type} and formats {signing_formats}"
signingscript.exceptions.SigningScriptError: No signing servers found with cert type project:releng:signing:cert:dep-signing and formats ['autograph_authenticode']
exit code: 5

The job itself passes CoT and breaks in signingscript.

Which means:
a) we need to fix credentials in signingscript
b) the GCP CrashLoopBackOff is quite misleading, for now.

Earlier I pushed to dev a backed-out revision from catlee's. I put that back and redeployed. Jenkins failed so we need to understand what's going on there. Somehow the tasks do run, even though the CloudOps deployment fails. I reran and still hit the same error. Moreover, the worker-id is similar, which based on this (TIL those are generated randomly) means it's hitting the same worker-id in GCP.

So I'm thinking at the following scenario:

  • regardless of me pushing new images to Docker on dev, Jenkins fails to deploy.
  • GCP isn't able to deploy the new image so tasks are talking to workers from an older version, which doesn't have @catlee's signingscript changes
  • since the signingscript complains about the missing credentials, I'm thinking the the deployment fails for the same reason, at some corresponding missing env var in container
Assignee: rail → catlee
Pushed by
Use GCP signing workers r=mtabara
Closed: 3 years ago
Resolution: --- → FIXED
Pushed by
Port bug 1542819 - Use GCP signing workers. rs=bustage-fix
Pushed by
Use worker aliases for signing to unbreak TB. r=tomprince
Pushed by
Use worker aliases for signing to unbreak TB. r=tomprince a=Aryx

We've had to bump up the CPU/memory limits for these workers since they were getting OOM killed a lot.

Regressions: 1580054
No longer regressions: 1593816
See Also: → 1593816
You need to log in before you can comment on or make changes to this bug.