migrate signing workers to cloudops
Categories
(Release Engineering :: General, task)
Tracking
(firefox71 fixed)
Tracking | Status | |
---|---|---|
firefox71 | --- | fixed |
People
(Reporter: rail, Assigned: catlee)
References
(Blocks 1 open bug)
Details
Attachments
(3 files)
We plan to migrate our signing workers to GCP/cloudops as one of the first workers.
Listing their requirements:
-
The workers need to be able to communicate with the signing servers which are in MDC1 and MDC2. This may require some kind of tunnel. Probably we should use at least 2 k8s clusters, each one the closest region to MDC{1,2}.
-
The workers need to be able to communicate with the autograph server. AFAIK, we can use IP based whitelisting in order to do this.
-
The workers will be using the Dockerflow approach, except the health check, which will be a CLI tool, instead of an HTTP API endpoint.
-
there will be a special docker instance to handle autoscaling for these workers (1 per cluster).
Anything that I missed?
Reporter | ||
Updated•2 years ago
|
Comment 1•2 years ago
|
||
@catlee did some initial work here to test things in our dev environment.
He pushed his signingscript code in the dev
branch and pushed a try submission. The tasks https://tools.taskcluster.net/groups/O-oe5xMyQomr_ZFaI5fmcQ/tasks/GVr32A9KQs2wLvXdfFJ6yw/details starved in pending
mode as no worker performed the job.
Rok and I did some debugging earlier today and we found two potential culprits:
a) dev
was not-up-to-date to Rail's recent work in cleaning the signingscript repo
Solution: merge master branch to dev to be up-to-date
b) there's a CrashLoopBackOff
constant error that shows up in recreating the deployments in CloudOps.
There's more docs on this in here that we can start from.
c) the workerType is wrong.
Solution: It expects gecko-1-signing-dev
but instead was fed gecko-t-signing-dev
in here
For this reason, the correct workers are claiming work but the tasks submitted are querying to the wrong workerType Queue.
I'm stealing @catlee's try submission, apply the three solutions above and try this again.
Once we have a dev-signing worker successfully, we can start rolling out the production ones.
Comment 2•2 years ago
|
||
More debugging information, the commit was https://github.com/mozilla-releng/signingscript/commit/fef19a132e16f76d4baa7944466901276c373d35 which corresponds to the lattest tag that was pushed under mozilla/releng-signingscript. So we pushed successfully. If there's an error, it must be somewhere in CloudOps world when (re)pulling the image or deploying it. Digging through.
Comment 3•2 years ago
|
||
Pushed a new try submission https://treeherder.mozilla.org/#/jobs?repo=try&revision=10e2e458dc84965b894dcb79f6712c2d9678bc93 with catlee's work + fixing the workerTypes and some fuzzy signing jobs.
Attempting to fix the GCP side as well.
Comment 4•2 years ago
|
||
Some progress here, the jobs I scheduled earlier barfed in signingscript.
File "/app/signingscript/sign.py", line 131, in get_suitable_signing_servers
f"No signing servers found with cert type {cert_type} and formats {signing_formats}"
signingscript.exceptions.SigningScriptError: No signing servers found with cert type project:releng:signing:cert:dep-signing and formats ['autograph_authenticode']
exit code: 5
The job itself passes CoT and breaks in signingscript.
Which means:
a) we need to fix credentials in signingscript
b) the GCP CrashLoopBackOff
is quite misleading, for now.
Comment 5•2 years ago
|
||
Earlier I pushed to dev
a backed-out revision from catlee's. I put that back and redeployed. Jenkins failed so we need to understand what's going on there. Somehow the tasks do run, even though the CloudOps deployment fails. I reran and still hit the same error. Moreover, the worker-id is similar, which based on this (TIL those are generated randomly) means it's hitting the same worker-id in GCP.
So I'm thinking at the following scenario:
- regardless of me pushing new images to Docker on
dev
, Jenkins fails to deploy. - GCP isn't able to deploy the new image so tasks are talking to workers from an older version, which doesn't have @catlee's signingscript changes
- since the signingscript complains about the missing credentials, I'm thinking the the deployment fails for the same reason, at some corresponding missing env var in container
Assignee | ||
Updated•1 year ago
|
Comment 6•1 year ago
|
||
Assignee | ||
Comment 7•1 year ago
|
||
Pushed by catlee@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e99752b85f6c Use GCP signing workers r=mtabara
Comment 9•1 year ago
|
||
bugherder |
Comment 10•1 year ago
|
||
Pushed by mozilla@jorgk.com: https://hg.mozilla.org/comm-central/rev/919bb48457aa Port bug 1542819 - Use GCP signing workers. rs=bustage-fix
Assignee | ||
Comment 11•1 year ago
|
||
Comment 12•1 year ago
|
||
I backed that C-C piece out again:
https://hg.mozilla.org/comm-central/rev/e18bd002ba01b01a5f7e939a30c50e7ca9b8cdab
Comment 13•1 year ago
|
||
Pushed by catlee@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/61f8731a42cb Use worker aliases for signing to unbreak TB. r=tomprince
Comment 14•1 year ago
|
||
Pushed by ccoroiu@mozilla.com: https://hg.mozilla.org/mozilla-central/rev/480000073f46 Use worker aliases for signing to unbreak TB. r=tomprince a=Aryx
Assignee | ||
Comment 15•1 year ago
|
||
We've had to bump up the CPU/memory limits for these workers since they were getting OOM killed a lot.
Comment 16•1 year ago
|
||
bugherder |
Description
•