Closed Bug 1643377 Opened 4 years ago Closed 4 years ago

generic-worker artifacts-upload panic during credentials rotation

Categories

(Infrastructure & Operations :: RelOps: Posix OS, task)

Tracking

(Not tracked)

RESOLVED INCOMPLETE

People

(Reporter: dhouse, Assigned: dhouse)

Details

June 3rd between 16:00 and 16:30 utc, the MacOS workers stopped doing work.

The tasks the machines were running were ended with claim-expired. The generic-worker process was not running when I checked them 5 hours later after being alerted in #firefoxci that the macos queue was high and growing.
After a reboot (and fixing a different problem exposed by the timing of the reboot?), the workers resumed doing work.

The workers hit a certificate problem and panicked:

Jun 03 16:14:36 t-mojave-r7-431 worker Response Body:
Jun 03 16:14:36 t-mojave-r7-431 worker   "code": "AuthenticationFailed",
Jun 03 16:14:36 t-mojave-r7-431 worker   "message": "ext.certificate.signature is not valid, or wrong clientId provided\n\n---\n\n* method:     createArtifact\n* errorCode:  AuthenticationFailed\n* statusCode: 401\n* time:       2020-06-03T16:14:31.753Z",
[...]
 Jun 03 16:14:36 t-mojave-r7-431 worker goroutine 1 [running]:
Jun 03 16:14:36 t-mojave-r7-431 worker runtime/debug.Stack(0xc422a3ec90, 0x144bebc, 0x1593489)
Jun 03 16:14:36 t-mojave-r7-431 worker #011/home/travis/.gimme/versions/go1.10.3.src/src/runtime/debug/stack.go:24 +0xa7
Jun 03 16:14:36 t-mojave-r7-431 worker main.HandleCrash(0x14e1220, 0xc42025f290)
Jun 03 16:14:36 t-mojave-r7-431 worker #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:673 +0x26
Jun 03 16:14:36 t-mojave-r7-431 worker main.RunWorker.func1(0xc422a3fda0)
Jun 03 16:14:36 t-mojave-r7-431 worker #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:692 +0x52
Jun 03 16:14:36 t-mojave-r7-431 worker panic(0x14e1220, 0xc42025f290)
Jun 03 16:14:36 t-mojave-r7-431 worker #011/home/travis/.gimme/versions/go1.10.3.src/src/runtime/panic.go:502 +0x229
Jun 03 16:14:36 t-mojave-r7-431 worker main.(*TaskRun).Run.func1(0xc42000c138, 0xc420088a00)
Jun 03 16:14:36 t-mojave-r7-431 worker #011/home/travis/gopath/src/github.com/taskcluster/generic-worker/main.go:1219 +0xc5
[...]

From chat.m.o #firefox-ci,
14:54 pacific. tom.prince:
I suspect this is fallout from rotating credentials earlier (cc: dustin )

Where do these workers get their credentials from? I wasn't aware that they were using temp creds.

(In reply to Dustin J. Mitchell [:dustin] (he/him) from comment #3)

Where do these workers get their credentials from? I wasn't aware that they were using temp creds.

They are using shared credentials per-pool. ClientID: project/releng/generic-worker/datacenter-gecko-t-osx (https://firefox-ci-tc.services.mozilla.com/auth/clients?search=datacenter)

The other generic-worker hardware workers had the same problem, but fewer workers were active in those pools at the time of the rotation, and so fewer were caught at that failure point within g-w (I assume the temp creds for the artifact uploads were created by g-w, and then the creds were rotated, and then the artifact upload creds were invalid).

6 linux64 hardware workers
243 macs
9 windows

This matches all active workers at that time, 16:10utc:
6 linux64
246 macs
9 windows

Summary: Mac workers generic-worker dropped/missing → generic-worker artifacts-upload panic during credentials rotation

(In reply to Dave House [:dhouse] from comment #4)

This matches all active workers at that time, 16:10utc:
6 linux64
246 macs
9 windows

"active" running tasks

:markco Does the windows runner re-try/reboot on a 69 exit? If not, you might need to check these workers:
https://my.papertrailapp.com/groups/1958653/events?focus=1204872616029102109&q=%22Exiting%20worker%20with%20exit%20code%2069%22&selected=1204871293233692701

Jun 03 16:20:39 T-W1064-MS-208.mdc1.mozilla.com-1 generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:20:42 T-W1064-MS-216.mdc1.mozilla.com-1 generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:20:42 T-W1064-MS-034.mdc1.mozilla.com generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:20:42 T-W1064-MS-028.mdc1.mozilla.com generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:20:43 T-W1064-MS-079.mdc1.mozilla.com-1 generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:20:44 T-W1064-MS-163.mdc1.mozilla.com generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:21:01 T-W1064-MS-371.mdc2.mozilla.com generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:21:09 T-W1064-MS-330.mdc2.mozilla.com generic-worker UTC Exiting worker with exit code 69#015
Jun 03 16:21:11 T-W1064-MS-466.mdc2.mozilla.com generic-worker UTC Exiting worker with exit code 69#015

The linux and macos workers stalled from this panic from the credentials being wrong (we removed any retries on macos/linux since 69 usually meant a TC service was down/broken in the past and just led to the workers thrashing, and spamming errors).

Flags: needinfo?(mcornmesser)

Ah, I see, it was the task credentials that caused the panic. That's surprising (anything to do with a task should probably not panic, but be handled), but at least makes sense.

(In reply to Dave House [:dhouse] from comment #6)

:markco Does the windows runner re-try/reboot on a 69 exit? If not, you might need to check these workers:

A 69 exit will trigger a clean up to remove possibly bad json files and reboots. You will see logs such as "Generic worker exit with code 69; Rebooting to recover" or "Generic worker exit with code %gw_exit_code% more than once; Attempting restore".

If it doesn't recover after multiple reboots the node will attempted to restore itself to a pre-puppet configuration and rebootstrap itself. There will be an accompanying log message of "Generic_worker has faild to start multiple times. Attempting restore.".

Flags: needinfo?(mcornmesser)

(In reply to Mark Cornmesser [:markco] from comment #8)

(In reply to Dave House [:dhouse] from comment #6)

:markco Does the windows runner re-try/reboot on a 69 exit? If not, you might need to check these workers:

A 69 exit will trigger a clean up to remove possibly bad json files and reboots. You will see logs such as "Generic worker exit with code 69; Rebooting to recover" or "Generic worker exit with code %gw_exit_code% more than once; Attempting restore".

If it doesn't recover after multiple reboots the node will attempted to restore itself to a pre-puppet configuration and rebootstrap itself. There will be an accompanying log message of "Generic_worker has faild to start multiple times. Attempting restore.".

That's great! I'll make a bug for us to pattern after that on linux+macos also. (Dragos had set up the auto-quarantine and looked at similar self-recovery options, and we might resume that work)

We'll change the linux+macos hardware worker generic-worker exit code handling to be more like on the windows workers, through bug 1643694.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.