1643377 - generic-worker artifacts-upload panic during credentials rotation

:dhouse

Assignee

Description

•

4 years ago

•

Edited

June 3rd between 16:00 and 16:30 utc, the MacOS workers stopped doing work.

The tasks the machines were running were ended with claim-expired. The generic-worker process was not running when I checked them 5 hours later after being alerted in #firefoxci that the macos queue was high and growing.
After a reboot (and fixing a different problem exposed by the timing of the reboot?), the workers resumed doing work.

:dhouse

Assignee

Comment 1

•

4 years ago

The workers hit a certificate problem and panicked:

Jun 03 16:14:36 t-mojave-r7-431 worker Response Body:
Jun 03 16:14:36 t-mojave-r7-431 worker   "code": "AuthenticationFailed",
Jun 03 16:14:36 t-mojave-r7-431 worker   "message": "ext.certificate.signature is not valid, or wrong clientId provided\n\n---\n\n* method:     createArtifact\n* errorCode:  AuthenticationFailed\n* statusCode: 401\n* time:       2020-06-03T16:14:31.753Z",
[...]
 Jun 03 16:14:36 t-mojave-r7-431 worker goroutine 1 [running]:
Jun 03 16:14:36 t-mojave-r7-431 worker runtime/debug.Stack(0xc422a3ec90, 0x144bebc, 0x1593489)
Jun 03 16:14:36 t-mojave-r7-431 worker #011/home/travis/.gimme/versions/go1.10.3.src/src/runtime/debug/stack.go:24 +0xa7
Jun 03 16:14:36 t-mojave-r7-431 worker main.HandleCrash(0x14e1220, 0xc42025f290)
Jun 03 16:14:36 t-mojave-r7-431 worker #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:673 +0x26
Jun 03 16:14:36 t-mojave-r7-431 worker main.RunWorker.func1(0xc422a3fda0)
Jun 03 16:14:36 t-mojave-r7-431 worker #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:692 +0x52
Jun 03 16:14:36 t-mojave-r7-431 worker panic(0x14e1220, 0xc42025f290)
Jun 03 16:14:36 t-mojave-r7-431 worker #011/home/travis/.gimme/versions/go1.10.3.src/src/runtime/panic.go:502 +0x229
Jun 03 16:14:36 t-mojave-r7-431 worker main.(*TaskRun).Run.func1(0xc42000c138, 0xc420088a00)
Jun 03 16:14:36 t-mojave-r7-431 worker #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1219 +0xc5
[...]

:dhouse

Assignee

Comment 2

•

4 years ago

From chat.m.o #firefox-ci,
14:54 pacific. tom.prince:
I suspect this is fallout from rotating credentials earlier (cc: dustin )

Dustin J. Mitchell [:dustin] (he/him)

Comment 3

•

4 years ago

Where do these workers get their credentials from? I wasn't aware that they were using temp creds.

:dhouse

Assignee

Comment 4

•

4 years ago

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #3)

Where do these workers get their credentials from? I wasn't aware that they were using temp creds.

They are using shared credentials per-pool. ClientID: project/releng/generic-worker/datacenter-gecko-t-osx (https://firefox-ci-tc.services.mozilla.com/auth/clients?search=datacenter)

The other generic-worker hardware workers had the same problem, but fewer workers were active in those pools at the time of the rotation, and so fewer were caught at that failure point within g-w (I assume the temp creds for the artifact uploads were created by g-w, and then the creds were rotated, and then the artifact upload creds were invalid).

6 linux64 hardware workers
243 macs
9 windows

This matches all active workers at that time, 16:10utc:
6 linux64
246 macs
9 windows

Summary: Mac workers generic-worker dropped/missing → generic-worker artifacts-upload panic during credentials rotation

:dhouse

Assignee

Comment 5

•

4 years ago

(In reply to Dave House [:dhouse] from comment #4)

This matches all active workers at that time, 16:10utc:
6 linux64
246 macs
9 windows

"active" running tasks

:dhouse

Assignee

Comment 6

•

4 years ago

:markco Does the windows runner re-try/reboot on a 69 exit? If not, you might need to check these workers:
https://my.papertrailapp.com/groups/1958653/events?focus=1204872616029102109&q=%22Exiting%20worker%20with%20exit%20code%2069%22&selected=1204871293233692701

Jun 03 16:20:39 T-W1064-MS-208.mdc1.mozilla.com-1 generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:20:42 T-W1064-MS-216.mdc1.mozilla.com-1 generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:20:42 T-W1064-MS-034.mdc1.mozilla.com generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:20:42 T-W1064-MS-028.mdc1.mozilla.com generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:20:43 T-W1064-MS-079.mdc1.mozilla.com-1 generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:20:44 T-W1064-MS-163.mdc1.mozilla.com generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:21:01 T-W1064-MS-371.mdc2.mozilla.com generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:21:09 T-W1064-MS-330.mdc2.mozilla.com generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:21:11 T-W1064-MS-466.mdc2.mozilla.com generic-worker UTC Exiting worker with exit code 69#015

The linux and macos workers stalled from this panic from the credentials being wrong (we removed any retries on macos/linux since 69 usually meant a TC service was down/broken in the past and just led to the workers thrashing, and spamming errors).

Flags: needinfo?(mcornmesser)

Dustin J. Mitchell [:dustin] (he/him)

Comment 7

•

4 years ago

Ah, I see, it was the task credentials that caused the panic. That's surprising (anything to do with a task should probably not panic, but be handled), but at least makes sense.

Mark Cornmesser [:markco]

Comment 8

•

4 years ago

(In reply to Dave House [:dhouse] from comment #6)

:markco Does the windows runner re-try/reboot on a 69 exit? If not, you might need to check these workers:

A 69 exit will trigger a clean up to remove possibly bad json files and reboots. You will see logs such as "Generic worker exit with code 69; Rebooting to recover" or "Generic worker exit with code %gw_exit_code% more than once; Attempting restore".

If it doesn't recover after multiple reboots the node will attempted to restore itself to a pre-puppet configuration and rebootstrap itself. There will be an accompanying log message of "Generic_worker has faild to start multiple times. Attempting restore.".

Flags: needinfo?(mcornmesser)

:dhouse

Assignee

Comment 9

•

4 years ago

(In reply to Mark Cornmesser [:markco] from comment #8)

(In reply to Dave House [:dhouse] from comment #6)

:markco Does the windows runner re-try/reboot on a 69 exit? If not, you might need to check these workers:

A 69 exit will trigger a clean up to remove possibly bad json files and reboots. You will see logs such as "Generic worker exit with code 69; Rebooting to recover" or "Generic worker exit with code %gw_exit_code% more than once; Attempting restore".

If it doesn't recover after multiple reboots the node will attempted to restore itself to a pre-puppet configuration and rebootstrap itself. There will be an accompanying log message of "Generic_worker has faild to start multiple times. Attempting restore.".

That's great! I'll make a bug for us to pattern after that on linux+macos also. (Dragos had set up the auto-quarantine and looked at similar self-recovery options, and we might resume that work)

:dhouse

Assignee

Comment 10

•

4 years ago

We'll change the linux+macos hardware worker generic-worker exit code handling to be more like on the windows workers, through bug 1643694.

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → INCOMPLETE

Bugzilla

Quick Search

generic-worker artifacts-upload panic during credentials rotation

Categories

(Infrastructure & Operations :: RelOps: Posix OS, task)

Tracking

(Not tracked)

People

(Reporter: dhouse, Assigned: dhouse)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10