Closed
Bug 1297151
Opened 8 years ago
Closed 7 years ago
[docker-worker] Try to terminate tasks gracefully, consider caches when SIGKILL is used
Categories
(Taskcluster :: Workers, defect)
Taskcluster
Workers
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: gps, Unassigned)
References
Details
https://tools.taskcluster.net/task-inspector/#FnErLvirS92TdXwhQ5vy3Q/0 shows a Mercurial that got into a weird state. Mercurial thinks a lock has been obtained and times out because the thing with a handle on the lock hasn't released it.
What I think happened is that a task previously running on the worker was forcefully killed with SIGKILL and Mercurial never had the opportunity to clean up and release the lock. The next task on that worker that used the same cache inherited the stale lock and barfed.
There are a few problems here.
First, SIGKILL to stop a task is really aggressive (https://github.com/taskcluster/docker-worker/blob/9fb079cfda45b1b91d528dc71d680d4710d7956f/lib/task.js#L770). SIGKILL doesn't give processes the opportunity to clean up. They just forcefully terminate. So e.g. Mercurial doesn't even have the opportunity to release the lock, possibly leaving the repo in an inconsistent state (it could be in the middle of a transaction). There could be other processes that also don't have the opportunity to perform graceful shutdown. I think we should issue a SIGTERM, wait a handful of seconds, then issue SIGKILL to forcefully abort the container.
Second, if SIGKILL is issued on a container with caches, all bets are off as to the state of those caches. Mercurial could have a lock or transaction in progress. A cache file may be partially written (read: corrupted). I can make the argument that if SIGKILL is sent to a container, all caches used by that container should be wiped because there's no way of knowing whether those caches are in a good state. The counter argument is anything using a cache should detect corruption and recover from it. Ideally, yes, that is true. However, that's a bit of work. Some cache consumers may not want to deal with it.
So here are my requests:
1) Issue SIGTERM and wait before sending SIGKILL
2) Consider wiping caches automatically if a SIGKILL is issued
3) Give tasks the option of specifying "wipe on SIGKILL" behavior for caches
Updated•8 years ago
|
Component: Docker-Worker → Worker
Summary: Try to terminate tasks gracefully, consider caches when SIGKILL is used → [docker-worker] Try to terminate tasks gracefully, consider caches when SIGKILL is used
Comment 1•7 years ago
|
||
https://github.com/taskcluster/taskcluster-worker/blob/682acd01ab764225f4388adb43c2913b37a3d1d4/engines/native/system/process_posix.go#L101 implements (1) for taskcluster-worker's native engine, at least.
Comment 2•7 years ago
|
||
PR for docker-worker:
https://github.com/taskcluster/docker-worker/pull/292
Updated•7 years ago
|
Status: NEW → RESOLVED
Closed: 7 years ago
QA Contact: pmoore
Resolution: --- → WONTFIX
Assignee | ||
Updated•6 years ago
|
Component: Worker → Workers
You need to log in
before you can comment on or make changes to this bug.
Description
•