Closed Bug 1011821 Opened 9 years ago Closed 7 years ago

Implement cancel handling in mozilla-taskcluster

Categories

(Taskcluster :: General, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jlal, Unassigned)

References

Details

Attachments

(1 file)

Right now we can create and rerun tasks (most important really) but the ability to cancel tasks is also important this opens up the ability to cancel manually (something sheriffs do frequently) and automatically (like in the case of pull requests where new commits are pushed)

Ideally we can cancel from the graph which means something like this:

  - add an endpoint to the graph to cancel a particular graph
  - an endpoint to the queue to cancel a particular task
  - (somehow?) notify a worker that a particular task is canceled.

Notifying the worker is slightly tricky since it only knows about work when acquiring it... Exposing an http endpoint to the worker is the most explicit and simple way to do this but we must also consider cases where the queue cannot talk to the worker (in which case canceling is really a no-op).
We have explicitly designed for this, we just haven't done the implementation yet.
When a task is cancelled the queue immediately _resolves_ it as _failed_.

> Notifying the worker is slightly tricky since it only knows about work when acquiring it...
Not true. The worker reclaims work whenever `takenUntil` is about to expire.
Obviously, the queue should not allow you to reclaim a _resolved_ task (at the moment it might).
But it should return an error code that let's the worker know that it is nolonger reclaiming
the latest run. And the worker should terminate it's docker container in this event.

> Exposing an http endpoint to the worker is the most explicit and simple way to do this
As said above we already have this. I still think the `takenUntil` timeout interval should be rather high. I believe it's configurable right now. And for short tasks it makes sense to configure a timeout that is "expected task runtime + 3min" or something.

Anyways, if workers want to know about cancelled tasks sooner, then they should listen to the
`queue/v1/task-failed` exchange and bind with routing pattern:
  `*.*.<workerGroup>.<workerId>.<provisionerId>.<workerType>.#`

As tasks cancelled will be _resolved_ "failed" a message will be posted about this. And this message will be routed with the <workerGroup> and <workerId> from the most current run.
(As other runs are expected to be done).

Regarding task-graphs, a cancelled end-point can just change task-graph state to "blocked" and ask queue to cancelled the pending task.
Blocks: 1080265
As previously mentioned docker-worker needs to stop execution on task-failed messages and failed reclaim responses.
File bug 1080278 for the latter, as I'm unaware of docker-worker behaviour on failed reclaim responses.

Arguably reclaims should be retried with exponential back-off, unless the queue replies 409, in which it's a clear kill now situation.
Depends on: 1080278
I wanted to note that this could arguably be moved outside of the main blocking but here given treeherder will allow you "cancel" taskcluster jobs from the perspective of a sheriff but it will not actually do the "right" things with treeheder (the upside is because we show all defined tasks [even tests] it's easier to "cancel" a whole bunch of stuff.

We should still obviously make this workflow happen but it is not absolutely critical.
Summary: Cancel workflow → Implement cancel handling in mozilla-taskcluster
This contains commits from another patch I have you marked as reviewer on so please take a look at this first (or maybe just this entire series if you like)
Attachment #8585999 - Flags: review?(garndt)
Comment on attachment 8585999 [details] [review]
https://github.com/taskcluster/mozilla-taskcluster/pull/14

LGTM (although there is a bug (1149748) for fixing an issue in the cancel handler in docker-worker (PR up soon)
Attachment #8585999 - Flags: review?(garndt) → review+
This is done, isn't it?
Flags: needinfo?(jlal)
Jobs are now able to be canceled in TH

https://github.com/taskcluster/mozilla-taskcluster/commit/07dc838aed0525abcbfd69351cbfb1dbdd0fb0de
Status: NEW → RESOLVED
Closed: 7 years ago
Flags: needinfo?(jlal)
Resolution: --- → FIXED
Component: TaskCluster → General
Product: Testing → Taskcluster
Target Milestone: --- → mozilla41
Version: unspecified → Trunk
Resetting Version and Target Milestone that accidentally got changed...
Target Milestone: mozilla41 → ---
Version: Trunk → unspecified
You need to log in before you can comment on or make changes to this bug.