Closed Bug 1298129 Opened 8 years ago Closed 8 years ago

[docker-worker] Add mechanism available to the task environment to signal that task should be retried

Categories

(Taskcluster :: Workers, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: garndt, Assigned: garndt)

References

Details

Attachments

(1 file)

53 bytes, text/x-github-pull-request
wcosta
: review+
Details | Review
Expose a socket to the task environment, that when read will cause the task to be resolved as exception with a reason that will cause the task to be retried.
Is this something where we are going to need to re-architect the solution for other worker implementations?  Is there something more OS-generic than a socket?  For example, does libvirt let us inject a socket?  Windows?

What if the procedure was `curl http://taskcluster/queue/v1/task/<taskId>/run/<runId>/setRetryFlag`?
> Expose a socket to the task environment, that when read will cause the task to be resolved as exception with a reason that will cause the task to be retried.

Which use case(s) is this to satisfy?

> What if the procedure was `curl http://taskcluster/queue/v1/task/<taskId>/run/<runId>/setRetryFlag`?

I think calling the existing endpoint would be enough:

http://taskcluster/queue/v1/task/<taskId>/runs/<runId>/exception

Obviously the user would need the appropriate scope(s).

======

Aside

Until now, only the worker could raise an exception so we could differentiate between worker/taskcluster failures, and task failures. If we let tasks resolve as exception, that could messy the lines. This is why I'm curious what the use case is.
Adding Jonas...
Flags: needinfo?(jopsen)
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> Is this something where we are going to need to re-architect the solution
> for other worker implementations?  Is there something more OS-generic than a
> socket?  For example, does libvirt let us inject a socket?  Windows?
> 
> What if the procedure was `curl
> http://taskcluster/queue/v1/task/<taskId>/run/<runId>/setRetryFlag`?

Good question, I'm not sure if the socket/file approach is the best, but it is what the group decided when we previously met so I was just entering bugs related that.  See https://bugzilla.mozilla.org/show_bug.cgi?id=1280570#c13

We definitely need to make this cross-platform compatible and available to all worker implementations.  I believe that caches and other things are going to be mounted with the qemu engines, so I don't see an immediate reason why it couldn't inject a file or socket.  Jonas will need to chime in with that, and Pete can explain options we might have with generic-worker and the future windows engine for tc-worker.
(In reply to Pete Moore [:pmoore][:pete] from comment #2)
> 
> Until now, only the worker could raise an exception so we could
> differentiate between worker/taskcluster failures, and task failures. If we
> let tasks resolve as exception, that could messy the lines. This is why I'm
> curious what the use case is.

So bug 1280570 has some info.  What we're trying to figure out a solution to is when something running inside of a task fails but it knows that if this task could be retried all over again it's likely to succeed, then we need to somehow retry that task.  We have suggested just rerunning the command within the container or task environment but that has proven to be less than ideal.  Currently in buildbot they signal this with certain exit codes from mozharness that buildbot knows to treat differently, retrying the task in this case.  Instead of doing something magical with exit codes, or having the worker respin up task environments, we thought that if the task could indicate it should be retried the worker could resolve the task appropriately and the queue would retry it similar to how we retry worker-shutdown.
Is it only certain failures that are typically retried? I'm thinking, if any failure is retried for a given task, it might be better as part of the task definition ("retryFailures": true) - but I'm guessing only certain failures are retried.

Given that we need to expose the feature to the task environment (rather than a blanket "retry on failure") I think calling an API endpoint makes the most sense over writing to a socket. We could add the new endpoint

  * http://taskcluster/queue/v1/task/<taskId>/runs/<runId>/<resolution>/<reason>

requiring

  * queue:resolve-task:<resolution>:<reason>:<taskId>/<runId>

in this case, resolution = "exception" and reason = "retry" (or similar).

The reason to move away from using /task/<taskId>/runs/<runId>/exception would be to have granular control over allowed resolution/reason when assigning scopes.
  * http://taskcluster/queue/v1/task/<taskId>/runs/<runId>/<resolution>/<reason>

==>

  * http://taskcluster/queue/v1/task/<taskId>/runs/<runId>/resolve/<resolution>/<reason>
I almost agree -- I just don't think the task itself should be resolving itself (since at that point it's still running).  Rather I think it should set a flag in the queue that says "when this task resolves with failure, retry it".
(In reply to Pete Moore [:pmoore][:pete] from comment #6)
> Is it only certain failures that are typically retried? I'm thinking, if any
> failure is retried for a given task, it might be better as part of the task
> definition ("retryFailures": true) - but I'm guessing only certain failures
> are retried.
> 
> Given that we need to expose the feature to the task environment (rather
> than a blanket "retry on failure") I think calling an API endpoint makes the
> most sense over writing to a socket. We could add the new endpoint
> 
>   *
> http://taskcluster/queue/v1/task/<taskId>/runs/<runId>/<resolution>/<reason>
> 
> requiring
> 
>   * queue:resolve-task:<resolution>:<reason>:<taskId>/<runId>
> 
> in this case, resolution = "exception" and reason = "retry" (or similar).
> 
> The reason to move away from using /task/<taskId>/runs/<runId>/exception
> would be to have granular control over allowed resolution/reason when
> assigning scopes.

If a task resolves itself and because of races that could happen, I assume the worker will need to have some additional handling around trying to reclaim a resolved task, and artifacts will need to be created before the time windows of creating artifacts on a resolved task expires.
I'm definitely +1 for the idea of a task just making a request to a queue endpoint to mark it to be retried once resolved.  Makes the worker implementations simpler and reduces the subtle errors that could be introduced in each of the worker implementations.
Only downside is generic worker will need to support taskcluster proxy!

Shouldn't be a biggy though, it already supports livelog.
(In reply to Greg Arndt [:garndt] from comment #9)

> If a task resolves itself and because of races that could happen, I assume
> the worker will need to have some additional handling around trying to
> reclaim a resolved task, and artifacts will need to be created before the
> time windows of creating artifacts on a resolved task expires.

Let's say I have a task, that wants to resolve itself as exception/retry.

It would make the API call, and wait for a 200 response (otherwise it is a bad task, and the task author has implemented incorrectly).

Once it is resolved (confirmed by the 200 response), future resolutions will not be allowed by the queue. Since the task should only complete _after_ reading the 200 response, the worker will not be able to successfully resolve it as something else, and there is no race condition.

The other possible race condition is from calling the endpoint in a race with either:

  * the task being cancelled
  * the task having a worker-shutdown issued
  * the task timing out (deadline exceeded)

In all three cases, the task would be resolved as exception/<something-other-than-retry> rather than failure/<something>. The current proposal says that the queue will only retry this task if the retry flag is set and the task is marked as failed. Therefore the behavior would be the same. 

So I think both are equivalent, however with a task resolving itself, no changes to the queue are required, so is arguably a cleaner/simpler solution.

Regarding the upload timeout meaning artifacts might not get published, I think it is reasonable that if a task is going to be retried:

  1) we should not need to guarantee artifacts are published (potentially a waste of resources)
  2) if the task does not exit immediately after calling the retry endpoint, the task is at fault

The advantage of the approach of the task resolving itself (rather than setting a separate flag against the queue) is that it requires no changes to the queue.

Let me know what you think, I might have overlooked something! :-)
> http://taskcluster/queue/v1/task/<taskId>/run/<runId>/setRetryFlag`?
I'm not a fan of this option.
I prefer: task -> worker -> queue.
Some workerType may wish to forbid this, some may wish to make the decision based on other declarative logic.
Retrying should be a worker decision, and the worker can allow the task to trigger. If the workerType
desires to allow the task to do so.

A socket or whatever would work fine. I see no reason the signaling method critically have to be the
same across workerTypes except for consistency.

curl -X POST http://taskcluster/retry-task

Would would fine for me. tc-worker is designed to facilitate a lot of small proxies like this.
For docker-worker it might be simpler with a socket. For signing-worker this might be triggered by
inability to reach the signing worker (as tasks are purely declarative).
Note: the signing-worker example is purely illustrative.
Flags: needinfo?(jopsen)
It seems like you're over-complicating the implementation for tasks (remember, we are taking this roundabout course so that we're not gecko-specific) in exchange for flexibility.  YAGNI!  Imagine some new incubator project wants to use this functionality, and we have to explain that there are three worker implementations, each of which works a little differently, so they'll have to figure out per job which worker implementation they're working on, then implement the client side of that functionality.  We can probably point them to a few hundred lines of Python to copy/paste out of gecko (and it'll be Mozharness, so maybe we suggest that they also bring Mozharness into their really nice Rust build process).

It just seems like the *very* long way around the barn for very ephemeral benefit.

Maybe allowing task definitions to specify a mapping from exit code to retry/no-retry would be better than this solution.  I guarantee that parsing exit codes is what every user is going to expect first, and that just about everyone is going to decide to implement their own retrying or just live without retries, rather than write some bespoke multi-platform socket-communication support.
After considering possible options it has been decided the best first step would be to adopt opt-in behavior for exit status handling.  Tasks can define a list of exit codes that should be retried or treated as success.  All other exit codes will be considered a hard failure that will not be retried.

The workers already have handling for exit status for success and failure, the work here would be just to expand this to include statuses that the task specifies that signify retry.

This also has the benefit of being well documented in the task payload schema for those who are building tasks.

Other CI systems have a concept of treating exit status in special ways, this is adding the functionality to our TaskCluster workers.


Opt-in Payload Example:
payload: {

        onExitStatus: {

    // 0 not allowed to be overridden, failure is everything else not specified here

         retry: [1,25],

         success: [3]

    }
}


Other options were considered such as having a local endpoint the task could call, or a socket or file exposed, however, we feel this adds additional complexity that is unnecessary.  Harnesses we have now already have special handling for exit statuses and relying on exit status is already a useful way to alter behavior based on the outcome of running a command.
checking in on this bug- 2 weeks have gone by, I don't know if there are next steps here in the bug, or if there is more to discuss or dependencies we are waiting on.
I apologize for not linking the docker-worker pull request to this bug, I should have done that (and I will now).  

Right now the patch for the worker is in review.  Once workers have been deployed using the new worker, task owners are free to update their their tasks to take advantage of this behavior.  I will link to the schema within this bug once deployed.
Attached file dw PR 253
Attachment #8794290 - Flags: review?(wcosta)
Attachment #8794290 - Flags: review?(wcosta) → review+
https://github.com/taskcluster/docker-worker/commit/0fe3fd0c3be5efe7d13b5f9c1a8c57afc3f6b0e6
Assignee: nobody → garndt
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
AMIs have been created, but will deploy in the morning.
Here is a task running on a test worker type showing the retry behavior:
https://tools.taskcluster.net/task-inspector/#dI5iHW4kR2qQS-AoFMYjCg/0
That's awesome!  So explicit, many wow.
Component: Worker → Workers
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: