Add retry/cleanup for linux+macos generic-worker exit code 69
Categories
(Infrastructure & Operations :: RelOps: Posix OS, task)
Tracking
(Not tracked)
People
(Reporter: dhouse, Unassigned)
References
Details
Attachments
(1 file)
100 bytes,
text/plain
|
Details |
bug 1643377 : The latest panic for generic-worker was caused by credential rotation on taskcluster services. g-w had created temporary credentials for the task it was running, and those were invalidated during the task run by the rotation. g-w panicked on the credentials being incorrect and exited with a exit code 69. The windows workers recovered from this after a reboot, but the linux and macos workers stopped and did not reboot.
Some of the handling for windows was added in bug 1595261 (cleanup of config files).
This is a repeat of bug 1537886:
A. we looked at g-w giving different exit codes for service failures (answer: no) and
B. worker-runner replacing the quarantine-worker/handling (answer: no, will not auto-quarantine).
Some of the history for this:
- the worker execution script was rebooting on any exit of g-w and when TC services were down that reboot caused a thrashing repeated reboot of the workers.
a. This would "eat-through" valid task runs in the queue because the results would be marked as failures (requiring re-runs of the tasks) - bug 1461913: a self-quarantine was set up, and workers self-quarantined and stopped doing work.
a. 2+ times, the #2 self-quarantine ended with a majority of the macos and linux pools stopping work because of a brief TC service outage.
b. a bug was created for each worker during service outages - bug 1479692: the self-quarantine was adjusted to allow a no-problems startup to remove the quarantine.
a. then we could reboot the workers for them to re-try (instead of needing to remove the quarantine and close the bug manually)
The handling on windows recovered the workers during this last problem (cert rotation) and I think may be better to pattern after for linux+posix than what we have currently (quarantine and wait for manual intervention).
Description
•