mono-repo balrogscript deployment to production breaks gecko-3-balrog GCP workers
Categories
(Release Engineering :: Release Automation, defect)
Tracking
(Not tracked)
People
(Reporter: apavel, Assigned: mtabara)
References
Details
(Whiteboard: [stockwell disable-recommended])
Attachments
(1 file)
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Updated•6 years ago
|
Assignee | ||
Comment 1•6 years ago
|
||
We had a green balrog job running through the AM nightly at 11:20 UTC https://tools.taskcluster.net/groups/efiPWf9LSMq9J19mjfp3Tw/tasks/Ao4C9XcPQUe0fPAhJrJoyA/runs/0
At 11:50 UTC I've deployed for the first time, the balrogscript from its new home, the mono-repo. The push was https://github.com/mozilla-releng/scriptworker-scripts/pull/24 and I pushed that to production
branch.
See:
- https://github.com/mozilla-releng/scriptworker-scripts/commits/production
- https://tools.taskcluster.net/groups/HbbX4LJOT0uBRuiGZXdbCQ
Something is wrong along the way, digging now.
Assignee | ||
Comment 2•6 years ago
|
||
Found the culprit, it's related to bug 1587068.
We adjusted for signingscript only, but grafted the changes to mono-repo. Then we switched over balrogscript, which still has its ed25519 key double-encoded, but uses https://github.com/mozilla-releng/scriptworker-scripts/blob/master/docker.d/init.sh#L105 in deployment.
The balrog jobs are actually working as expected, the worker is just killing the task afterwards with internal-error
.
We can either:
a) patch up mono-repo for now as balrogworkers are the only ones switched to production
b) ask CloudOps to assist us directly on the secrets side.
Assignee | ||
Comment 3•6 years ago
|
||
Assignee | ||
Comment 4•6 years ago
|
||
(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #2)
Found the culprit, it's related to bug 1587068.
We adjusted for signingscript only, but grafted the changes to mono-repo. Then we switched over balrogscript, which still has its ed25519 key double-encoded, but uses https://github.com/mozilla-releng/scriptworker-scripts/blob/master/docker.d/init.sh#L105 in deployment.
The balrog jobs are actually working as expected, the worker is just killing the task afterwards with
internal-error
.We can either:
a) patch up mono-repo for now as balrogworkers are the only ones switched to production
Just did this for now to unblock. Hit another issue with mono-repo deployment (potentially a side-effect of https://github.com/mozilla-releng/scriptworker-scripts/pull/26). Working on debugging this and prepping a fix. If this takes too long, I'll push to production
which I know for sure it works ...
Assignee | ||
Comment 5•6 years ago
|
||
Will push in the morning. Evening nightlies will likely fail with Exception. Feel free to ignore, fix is coming in the morning and we'll rerun the jobs.
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 7•6 years ago
|
||
Turns out we were pushing the wrong docker-tag in our Docker hub registry, hence Jenkins didn't pick that up to deploy furthermore in GCP. We were including the project name (e.g. balrogscript
) in the tag, which was incorrect. We pushed https://github.com/mozilla-releng/scriptworker-scripts/commit/c212afe7d381fedac6719b07ceebcc527c109e7d to fix this and re-deployed.
I've rerun all jobs from most recent nightly (AM UTC from 9th October in https://tools.taskcluster.net/groups/S3tAxSvvTr2WdORCBDF0mw) and they are turning back green.
We're done here, sorry again for 2 days bustage.
Assignee | ||
Updated•6 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•7 months ago
|
Description
•