Closed Bug 1142630 Opened 9 years ago Closed 6 years ago

docker-worker: Report exception on any runs in progess when a worker crashes.

Categories

(Taskcluster :: Workers, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jlal, Unassigned)

Details

(Whiteboard: [docker-worker])

Workers do crash... When they do we wait for the queue to do the reclaim magic... This works but it is very slow (20min!) lets implement a logging system which will let us do this:


 -> when claiming a task add line to log (with run id/etc...)
 -> when task is finished add line to log
 -> When we crash
 
   On boot check for the log... if the log is present check for any incomplete tasks... If incomplete tasks are found verify that they are still running with the workers id and the run id during the crash. Report an exception for that run.

The append only logs can be done safely (and are atomic within certain size limits). The lines can be json.

Example:

{ state: 'running', taskId: .., runId: ... }
I think doing actual log is overkill.

Whenever a task is claimed or resolved just do:
fs.writeFileSync('/var/docker-worker/running-tasks.json', JSON.stringify([
  {
    taskId: '...',
    runId:  '...'
  }
]));

When you do fs.writeFileSync nothing else can crash docker-worker. I'm not afraid of being killed
by I/O issues, that's like extra rare.

Doing atomic writes right is hard. And this is a small file. If you want it to be a atomic use
a mv command after writing the file. But IMO it's acceptable to write a local file like this sync.
Component: TaskCluster → Docker-Worker
Product: Testing → Taskcluster
Whiteboard: [docker-worker]
Component: Docker-Worker → Worker
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Component: Worker → Workers
You need to log in before you can comment on or make changes to this bug.