Closed Bug 1587078 Opened 6 years ago Closed 6 years ago

mono-repo balrogscript deployment to production breaks gecko-3-balrog GCP workers

Categories

(Release Engineering :: Release Automation, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: apavel, Assigned: mtabara)

References

Details

(Whiteboard: [stockwell disable-recommended])

Attachments

(1 file)

Assignee: nobody → mtabara
Component: Release Automation: Signing → Release Automation: Updates
QA Contact: aki → mtabara
Summary: Perma Balrog complete updates Signing exceptions → mono-repo balrogscript deployment to production breaks gecko-3-balrog GCP workers

We had a green balrog job running through the AM nightly at 11:20 UTC https://tools.taskcluster.net/groups/efiPWf9LSMq9J19mjfp3Tw/tasks/Ao4C9XcPQUe0fPAhJrJoyA/runs/0

At 11:50 UTC I've deployed for the first time, the balrogscript from its new home, the mono-repo. The push was https://github.com/mozilla-releng/scriptworker-scripts/pull/24 and I pushed that to production branch.
See:

Something is wrong along the way, digging now.

See Also: → 1580054

Found the culprit, it's related to bug 1587068.

We adjusted for signingscript only, but grafted the changes to mono-repo. Then we switched over balrogscript, which still has its ed25519 key double-encoded, but uses https://github.com/mozilla-releng/scriptworker-scripts/blob/master/docker.d/init.sh#L105 in deployment.

The balrog jobs are actually working as expected, the worker is just killing the task afterwards with internal-error.

We can either:
a) patch up mono-repo for now as balrogworkers are the only ones switched to production
b) ask CloudOps to assist us directly on the secrets side.

(In reply to Mihai Tabara [:mtabara]⌚️GMT from comment #2)

Found the culprit, it's related to bug 1587068.

We adjusted for signingscript only, but grafted the changes to mono-repo. Then we switched over balrogscript, which still has its ed25519 key double-encoded, but uses https://github.com/mozilla-releng/scriptworker-scripts/blob/master/docker.d/init.sh#L105 in deployment.

The balrog jobs are actually working as expected, the worker is just killing the task afterwards with internal-error.

We can either:
a) patch up mono-repo for now as balrogworkers are the only ones switched to production

Just did this for now to unblock. Hit another issue with mono-repo deployment (potentially a side-effect of https://github.com/mozilla-releng/scriptworker-scripts/pull/26). Working on debugging this and prepping a fix. If this takes too long, I'll push to production which I know for sure it works ...

Will push in the morning. Evening nightlies will likely fail with Exception. Feel free to ignore, fix is coming in the morning and we'll rerun the jobs.

Turns out we were pushing the wrong docker-tag in our Docker hub registry, hence Jenkins didn't pick that up to deploy furthermore in GCP. We were including the project name (e.g. balrogscript ) in the tag, which was incorrect. We pushed https://github.com/mozilla-releng/scriptworker-scripts/commit/c212afe7d381fedac6719b07ceebcc527c109e7d to fix this and re-deployed.

I've rerun all jobs from most recent nightly (AM UTC from 9th October in https://tools.taskcluster.net/groups/S3tAxSvvTr2WdORCBDF0mw) and they are turning back green.

We're done here, sorry again for 2 days bustage.

Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Component: Release Automation: Updates → Release Automation
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: