Closed Bug 1297151 Opened 8 years ago Closed 6 years ago

[docker-worker] Try to terminate tasks gracefully, consider caches when SIGKILL is used

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: gps, Unassigned)

References

Details

https://tools.taskcluster.net/task-inspector/#FnErLvirS92TdXwhQ5vy3Q/0 shows a Mercurial that got into a weird state. Mercurial thinks a lock has been obtained and times out because the thing with a handle on the lock hasn't released it.

What I think happened is that a task previously running on the worker was forcefully killed with SIGKILL and Mercurial never had the opportunity to clean up and release the lock. The next task on that worker that used the same cache inherited the stale lock and barfed.

There are a few problems here.

First, SIGKILL to stop a task is really aggressive (https://github.com/taskcluster/docker-worker/blob/9fb079cfda45b1b91d528dc71d680d4710d7956f/lib/task.js#L770). SIGKILL doesn't give processes the opportunity to clean up. They just forcefully terminate. So e.g. Mercurial doesn't even have the opportunity to release the lock, possibly leaving the repo in an inconsistent state (it could be in the middle of a transaction). There could be other processes that also don't have the opportunity to perform graceful shutdown. I think we should issue a SIGTERM, wait a handful of seconds, then issue SIGKILL to forcefully abort the container.

Second, if SIGKILL is issued on a container with caches, all bets are off as to the state of those caches. Mercurial could have a lock or transaction in progress. A cache file may be partially written (read: corrupted). I can make the argument that if SIGKILL is sent to a container, all caches used by that container should be wiped because there's no way of knowing whether those caches are in a good state. The counter argument is anything using a cache should detect corruption and recover from it. Ideally, yes, that is true. However, that's a bit of work. Some cache consumers may not want to deal with it.

So here are my requests:

1) Issue SIGTERM and wait before sending SIGKILL
2) Consider wiping caches automatically if a SIGKILL is issued
3) Give tasks the option of specifying "wipe on SIGKILL" behavior for caches
Blocks: 1297153
Component: Docker-Worker → Worker
Summary: Try to terminate tasks gracefully, consider caches when SIGKILL is used → [docker-worker] Try to terminate tasks gracefully, consider caches when SIGKILL is used
Status: NEW → RESOLVED
Closed: 6 years ago
QA Contact: pmoore
Resolution: --- → WONTFIX
Component: Worker → Workers
You need to log in before you can comment on or make changes to this bug.