Closed Bug 1297151 Opened 8 years ago Closed 6 years ago

[docker-worker] Try to terminate tasks gracefully, consider caches when SIGKILL is used

Tracking

(Not tracked)

Status:

RESOLVED WONTFIX

People

(Reporter: gps, Unassigned)

References

Details

Gregory Szorc [:gps]

Reporter

Description

•

8 years ago

https://tools.taskcluster.net/task-inspector/#FnErLvirS92TdXwhQ5vy3Q/0 shows a Mercurial that got into a weird state. Mercurial thinks a lock has been obtained and times out because the thing with a handle on the lock hasn't released it.

What I think happened is that a task previously running on the worker was forcefully killed with SIGKILL and Mercurial never had the opportunity to clean up and release the lock. The next task on that worker that used the same cache inherited the stale lock and barfed.

There are a few problems here.

First, SIGKILL to stop a task is really aggressive (https://github.com/taskcluster/docker-worker/blob/9fb079cfda45b1b91d528dc71d680d4710d7956f/lib/task.js#L770). SIGKILL doesn't give processes the opportunity to clean up. They just forcefully terminate. So e.g. Mercurial doesn't even have the opportunity to release the lock, possibly leaving the repo in an inconsistent state (it could be in the middle of a transaction). There could be other processes that also don't have the opportunity to perform graceful shutdown. I think we should issue a SIGTERM, wait a handful of seconds, then issue SIGKILL to forcefully abort the container.

Second, if SIGKILL is issued on a container with caches, all bets are off as to the state of those caches. Mercurial could have a lock or transaction in progress. A cache file may be partially written (read: corrupted). I can make the argument that if SIGKILL is sent to a container, all caches used by that container should be wiped because there's no way of knowing whether those caches are in a good state. The counter argument is anything using a cache should detect corruption and recover from it. Ideally, yes, that is true. However, that's a bit of work. Some cache consumers may not want to deal with it.

So here are my requests:

1) Issue SIGTERM and wait before sending SIGKILL
2) Consider wiping caches automatically if a SIGKILL is issued
3) Give tasks the option of specifying "wipe on SIGKILL" behavior for caches

Gregory Szorc [:gps]

Reporter

Updated

•

8 years ago

Blocks: 1297153

Greg Arndt [:garndt]

Updated

•

8 years ago

Component: Docker-Worker → Worker

Summary: Try to terminate tasks gracefully, consider caches when SIGKILL is used → [docker-worker] Try to terminate tasks gracefully, consider caches when SIGKILL is used

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

7 years ago

https://github.com/taskcluster/taskcluster-worker/blob/682acd01ab764225f4388adb43c2913b37a3d1d4/engines/native/system/process_posix.go#L101 implements (1) for taskcluster-worker's native engine, at least.

Jonas Finnemann Jensen (:jonasfj)

Comment 2

•

7 years ago

PR for docker-worker:
https://github.com/taskcluster/docker-worker/pull/292

Chris Cooper [:coop] (he/him)

Updated

•

6 years ago

Status: NEW → RESOLVED

Closed: 6 years ago

QA Contact: pmoore

Resolution: --- → WONTFIX

Nobody; OK to take it and work on it

Assignee

Updated

•

5 years ago

Component: Worker → Workers

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

[docker-worker] Try to terminate tasks gracefully, consider caches when SIGKILL is used

Categories

(Taskcluster :: Workers, defect)

Tracking

(Not tracked)

People

(Reporter: gps, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Comment 1

Comment 2

Updated

Updated