Closed Bug 1643694 Opened 4 years ago Closed 3 years ago

Add retry/cleanup for linux+macos generic-worker exit code 69

Categories

(Infrastructure & Operations :: RelOps: Posix OS, task)

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dhouse, Unassigned)

References

Details

Attachments

(1 file)

bug 1643377 : The latest panic for generic-worker was caused by credential rotation on taskcluster services. g-w had created temporary credentials for the task it was running, and those were invalidated during the task run by the rotation. g-w panicked on the credentials being incorrect and exited with a exit code 69. The windows workers recovered from this after a reboot, but the linux and macos workers stopped and did not reboot.

Some of the handling for windows was added in bug 1595261 (cleanup of config files).

This is a repeat of bug 1537886:
A. we looked at g-w giving different exit codes for service failures (answer: no) and
B. worker-runner replacing the quarantine-worker/handling (answer: no, will not auto-quarantine).

Some of the history for this:

  1. the worker execution script was rebooting on any exit of g-w and when TC services were down that reboot caused a thrashing repeated reboot of the workers.
    a. This would "eat-through" valid task runs in the queue because the results would be marked as failures (requiring re-runs of the tasks)
  2. bug 1461913: a self-quarantine was set up, and workers self-quarantined and stopped doing work.
    a. 2+ times, the #2 self-quarantine ended with a majority of the macos and linux pools stopping work because of a brief TC service outage.
    b. a bug was created for each worker during service outages
  3. bug 1479692: the self-quarantine was adjusted to allow a no-problems startup to remove the quarantine.
    a. then we could reboot the workers for them to re-try (instead of needing to remove the quarantine and close the bug manually)

The handling on windows recovered the workers during this last problem (cert rotation) and I think may be better to pattern after for linux+posix than what we have currently (quarantine and wait for manual intervention).

Depends on: 1537886
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: