Taskcluster tests do not retry on failure due to DMError, etc

RESOLVED FIXED in mozilla52

Status

defect
RESOLVED FIXED
3 years ago
2 years ago

People

(Reporter: gbrown, Assigned: gbrown)

Tracking

unspecified
mozilla52
Dependency tree / graph

Details

Attachments

(1 attachment)

Buildbot test jobs that fail with certain error messages - like Android tests that fail with "DMError" - are automatically retried: The failed job is reported as a "blue" on treeherder, and a new job is started. This feature appears to not be implemented in taskcluster: The failed job is reported as an "orange" on treeherder, and the job is not automatically retried.

The special error messages are listed at http://hg.mozilla.org/build/buildbotcustom/file/6f6a9824ca4d/status/errors.py#l9 and used in http://hg.mozilla.org/build/buildbotcustom/file/6f6a9824ca4d/process/factory.py#l4226.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bb1505031efc&selectedJob=22519061 (R35) is an example of a "DMError" that was not retried on taskcluster.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=bb1505031efc&selectedJob=22519218 (gl6) is an example of a "DMerror" that was retried via buildbot.
Blocks: 1080265
I think these retries need to be detected in mozharness, and the retry signalled there.  Greg, I remember talking about special mozharness exit codes?
Flags: needinfo?(garndt)
There were talks about special mozharness exit codes, but no real decision on how taskcluster should respond to those.  The original discussion was around the orange vs red color of jobs on treeherder depending on exit code.

If a task could include some kind of task information such as exit status (rather than parsing logs), mozilla-taskcluster or whatever reporting jobs to treeherder could initiate a retrigger event if a certain condition ocurrs.  

In the world of taskcluster-treehder that *only* publishes job messages to Pulse, this might get a little confusing.

Another option is to have a sepcial exception status which causes the queue to automatically retry (similar to how our worker-shutdown or claim-expired events are handled).
Flags: needinfo?(garndt)
Blocks: 1157948
Blocks: 1171445
(In reply to Dustin J. Mitchell [:dustin] from comment #1)
> I think these retries need to be detected in mozharness, and the retry
> signalled there.  Greg, I remember talking about special mozharness exit
> codes?

mozharness has special exit codes to specify success/warning/failure/exception/retry:

https://hg.mozilla.org/mozilla-central/annotate/23dc78b7b57e9f91798ea44c242a04e112c37db0/testing/mozharness/mozharness/mozilla/buildbot.py#l41

For retry, if mozharness sees a log line matching the retry_regex:

https://hg.mozilla.org/mozilla-central/annotate/23dc78b7b57e9f91798ea44c242a04e112c37db0/testing/mozharness/mozharness/mozilla/testing/errors.py#l94

the mozharness script will set its exit code to TBPL_RETRY (4).

Most (possibly all) of the errors used by buildbot to identify retry conditions could be copied to the mozharness regex, so that mozharness scripts exit with TBPL_RETRY for those errors.

I tried that in https://treeherder.mozilla.org/#/jobs?repo=try&revision=d5fc414e473a and it seemed to work out. For instance, tc-M7 has https://public-artifacts.taskcluster.net/br1uPM4dQI20bRqVDr80GQ/0/public/logs/live_backing.log, which shows "setting return code to 4" and "returning nonzero exit status 4".

Does that help?
If bug 1282850 is landing imminently, this is more important, since we'll actually have to care about the TC-based fennec debug failures in bug 1157948.
See https://bugzilla.mozilla.org/show_bug.cgi?id=1157948#c713 .. 719 for additional discussion relevant to this bug.
:selena - Can we do something about this soon? Over 100 failures are reported each day for TC jobs that were handled automatically by the retry feature on buildbot.
Flags: needinfo?(sdeckelmann)
After some talks with Jonas, we did have the idea that perhaps the workers could listen for certain exit codes and retry the task based on that rather than trying to do anything special within the platform.

Jonas, am I recalling things correctly?
Flags: needinfo?(jopsen)
Because retrying on exit code is a bit magical, I would like to explore what it would look like to be explicit about this retry logic so that it's opt-in rather than every task having this same behavior.  What I'm trying to avoid is for when more projects outside of things using mozharness return exit codes, I don't want people confused as to what's going on because their command of choice returned an exit code that mozharness also returns to indicate retry.
Buildbot has the obscurely-named `decodeRC` for this purpose (http://docs.buildbot.net/latest/manual/cfg-buildsteps.html#using-shellcommands).  I can see this being something included in the task definition:

 exit-code:
   2: retry
   5: success
in bug 1157948 we saw 843 instances of a retry needed in the last week.  Please make this a priority to at least have an agreed upon plan in place this week with someone assigned to working on it.
I see 4 options:
A) When you have an intermittent test case, you wrap it in a retry-loop
B) Inside your task container you do: bash -c 'until <command>; do echo "try again"; done';
    - maybe with some max number of retries just for sanity
    - You can also write the log a separate file for each attempt if you want
      (then only have livelog printing what attempt number you are currently running, or tee all output)
C) We could build into docker-worker that, if feature activated, it'll run the task container again
   if a certain exit-code is used.
D) We can add a special artifact to tasks that needs to be retriggered, have a small service looking for
   these artifacts and retriggering a task if such special artifact is created.

Obviously, (A) is the ideal solution... (D) the least ideal solution.
My fear is that (D) easily becomes a big hammer used to move things forward instead of address the underlying issue. And that it becomes hard to reason about the state of a push/task-group because
tasks might be retriggered automatically.
With (D) we sort of start to loose the concept that a task is a unit of work.

---
If the motivation for not doing (A) or (B) is that we want tests that required multiple attempts to have an special color on treeherder. Then I suggest (A) or (B) is implemented, but with the addition that a special artifact for taskcluster-treeherder to use to report color is created.

Added benefit of doing this in the task with (A) or (B) is that it discourages the practice :)
Flags: needinfo?(jopsen)
My 2 cents for right now...

I believe with option D we not only work towards solving this issue but also solve potential future issues where people want to know more about a task after the task is resolved and perhaps make choices/actions based on that.  The artifact could contain this meta data that we do not record within the platform.

I do not want to make this a "special artifact" but rather an artifact that contains task metadata that the worker reports prior to resolving the task.  It could include some run time statistics, exit status, whatever we agree on that the worker could report.  Exit status just being one of the potential pieces of information.

That said....D doesn't seem like a perfect fit, but so far seems like it might be a better fit than the other options.  Some more exploration will be needed I think.

I'm not a huge fan of C.  Not only does this mean we still need a solution for the issue of reporting job status to treeherder, but also it moves some retry/scheduling logic within the worker.  It should just be responsible for running a task to completion and returning the result.  If that task then needs to be scheduled again it should be handled externally.

I'm also a fan of discouraging the practice of hiding intermittent issues by retrying it multiple times, but I think in this case teams have spent significant time and effort trying to reduce this as far as they could and this (retrying) has been the agreed solution.
We had a meeting yesterday with Greg, Jonas, Dustin and myself.  We discussed a couple different possibilities as mentioned earlier in this bug.  In the end, we decided that the design we'd like to implement is to have an outcome field called to indicate whether the failure of the task is retry-able in the same data structure that stores whether the task succeeded or failed.  The queue would automatically try retry-able failures up to a limit.

By default, all failures would be considered non-retryable.  We would expose a mechanism (socket/file) which a task could optionally indicate to the worker process that this task has failed with a retryable reason.  Extra information regarding why the task failed, or the colour that Treeherder should present the job with should be done using an artifact.  We decided that the design of that artifact was outside the scope of this bug, but that would be how we'd like to communicate that information.

Having this be an explicit, opt-in feature means that we're not changing the interface presented to the tasks.  The exact mechanism for triggering the retry-able flag was not decided, however, the idea of presenting a socket for the task to read from or write to was the favoured approach, both for robustness (e.g. out of disk could interfere with a file) and because nearly every environment contains a socket implementation.

During this meeting, we considered two options that we ultimately decided not to go with.  First was exposing the exit code from the payload's command.  This is not very general, and worse, is to some degree implicit.  We can't tell if an application exited with a specific exit code because it wanted to opt into our retry behaviour or because it happens to use the same numeric value to represent a different concept.

The second idea was to create some sort of artifact which would indicate information about retrying and a service to do re-triggering based on the information from this artifact.  We had two reasons for not opting for this.  First was that it would require each project to write their own retriggering service.  Second, and related, at the taskcluster-platform level, we really only need to know whether or not to retry.  We also felt that implementors of these hypothetical services would likely only really care about retry-able vs non-retry-able error distinctions.  Further classification of an error is something we feel is better suited to the presentation system, possibly in concert with an artifact generated by the task which gives an explanation for why a task failed.
:gbrown, does this work for you?
Flags: needinfo?(sdeckelmann) → needinfo?(gbrown)
(In reply to John Ford [:jhford] from comment #13)

Thanks for documenting that discussion and explaining the rationale. This sounds like it will work for me / address the pressing issues of bug 1157948, etc. It should be simple enough to modify mozharness to invoke the socket/file mechanism when a mozharness script exits with the RETRY exit code; I can take care of that part...I'll just need details, when they are available.

I am a little troubled by:

> Extra
> information regarding why the task failed, or the colour that Treeherder
> should present the job with should be done using an artifact.  We decided
> that the design of that artifact was outside the scope of this bug, but that
> would be how we'd like to communicate that information.

So, you want this bug to be about starting the new task, but not about reporting the status of the failed task to treeherder, is that right? Turning the original job blue on treeherder is an essential part of the problem; otherwise, we still have 100+ failures a day being starred. Shall I open another bug for that?
Flags: needinfo?(gbrown)
The new task run for a given task should have a "reasonCreated" of "rerun" (it might be called retry, I need to check), and if so, our service for reporting jobs to treeherder already takes care of marking the previous symbol on treeherder as blue.
Blocks: 1293309
Blocks: 1297624
Depends on: 1298059
Depends on: 1298129
Sorry to (possibly) complicate this further, but I have been reminded that we should have a way of limiting the number of times a job is retried. On buildbot, there are occasional instances where mozharness thinks it has identified a temporary infrastructure issue and requests a retry, but the same thing happens on the retry, and we keep retrying until a thoughtful sheriff cancels a job. I can't think of a way for the mozharness script to know that it has retried N times before, so it would be good if something in taskcluster could reject a retry request after N such requests for the same job. I'd suggest N=3.
Typically we have a hard default limit within the queue for retries (at least for the other retrying exceptions we have).  I think the default is 5, but perhaps we could make this lower for the "task-retry" exception.

Jonas, do we have thoughts on when we could have you or bstack look into implementing this additinal retry exception into the queue?
Flags: needinfo?(jopsen)
(In reply to Greg Arndt [:garndt] from comment #18)
> I think the default is 5,
> but perhaps we could make this lower for the "task-retry" exception.

5 seems reasonable too -- I think that would be fine.
Taskcluster Android xpcshell tests have now been hidden on all trees, because of the confusion caused by failures that would have normally (in buildbot) been handled by retries -- bug 1303634. It seems to me that the sheriffs could hide all the Android tests for the same reason.

If we are not close to a resolution here, we should consider going back to buildbot for Android tests.
(In reply to Geoff Brown [:gbrown] from comment #20)
> Taskcluster Android xpcshell tests have now been hidden on all trees,
> because of the confusion caused by failures that would have normally (in
> buildbot) been handled by retries -- bug 1303634. It seems to me that the
> sheriffs could hide all the Android tests for the same reason.
> 
> If we are not close to a resolution here, we should consider going back to
> buildbot for Android tests.

We have agreed on a path forward for this, but have not had much time to focus on a fix.  I'm going to reach out and try to reprioritize work to address these issues sooner rather than later.

Here is the bug where the latest decision was made:
https://bugzilla.mozilla.org/show_bug.cgi?id=1298129#c15
PR is in adding exception reason: intermittent-task, which will add an other run if task.retries isn't exhausted.
We do not distinguish between automation retries and intermittent retries when counting against task.retries. I guess we could, but ideally this should be a feature that is rarely used.

This also added reasonCreated: "task-retry" in response to "intermittent-task" reasonResolved.
See: https://github.com/taskcluster/taskcluster-queue/pull/118
Flags: needinfo?(jopsen)
I just saw bug 1298129 get resolved :)  I still see we depend on bug 1298059.  I assume when that is done, then we can edit the tasks definitions to allow retries?
I have not rolled out a new ami yet because it's late in the day and I will be signing off, but you can see it in action here:
https://tools.taskcluster.net/task-inspector/#SKyZynHoQ12fuCgmbt8rnA/

You can start updating tasks to make use of this now, the worst that happens is that it's not retried until the workers are using a new ami.

Of particular importance is the new onExitStatus payload property that looks like:

  "onExitStatus": {
    "retry": [
      36
    ]
  }

Basically you list the status codes that you want to be retried in the onExitStatus.retry list.
I won't be able to hack on this for a day or two, but can do so then if nobody else is faster -- I know it's high priority.

My rough idea is:

 - add a `retry-exit-status` property to task descriptions (task.py)
   - for docker-worker, translate it into the structure in comment 24
   - for other payload builders, raise an exception if the option is given (since it is not supported)
 - copy that property into job descriptions' schema (as most properties are already copied)
 - set this flag unconditionally in taskcluster/taskgraph/transforms/tests/all_tests.py
       (I'm assuming all mozharness scripts used by tests produce the same retry-me exit status?)
 - (optional) set this flag in for the job-description run-using function 'mozharness' to get similar behavior for builds,
   if it is not already set to True or False
My rough interpretation of dustin's rough idea in comment 25. 

Seems to work fine - comment 26 shows Android DMError cases causing retries. https://hg.mozilla.org/try/rev/36987accb00f03a65982876f952ef5059cf4afb8 is this exact patch, showing the expected payload for tests and builds.
Attachment #8798865 - Flags: review?(dustin)
Comment on attachment 8798865 [details] [diff] [review]
set onExitStatus.retry = TBPL_RETRY for mozharness tests and builds

Review of attachment 8798865 [details] [diff] [review]:
-----------------------------------------------------------------

This looks good.  The one bit it misses is passing the parameter through from job descriptions to test descriptions.  If I wanted to define a one-off job that had a special exit status, there's no place in the job description to put this value.

This is as easy as copying the schema over:
  https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/__init__.py#44

the job transforms already keep all of the fields, so there's not even a need to copy the value!

With that in place, it might be good to make the code in mozharness.py only change the value if it's not already specified.
Attachment #8798865 - Flags: review?(dustin)
Comment on attachment 8798865 [details] [diff] [review]
set onExitStatus.retry = TBPL_RETRY for mozharness tests and builds

Review of attachment 8798865 [details] [diff] [review]:
-----------------------------------------------------------------

Geoff points out that this is implemented to parallel things like `task_desc.worker.relengapi-proxy`, and I think that's fine -- at least for now, and it can be moved later.
Attachment #8798865 - Flags: review+
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/9a7e664b5f68
Retrigger tc tasks when mozharness returns TBPL_RETRY; r=dustin
https://hg.mozilla.org/mozilla-central/rev/9a7e664b5f68
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Duplicate of this bug: 1293309
Duplicate of this bug: 1157948
Bug 1286075 (at least) is required to push to beta (50).
You need to log in before you can comment on or make changes to this bug.