1280570 - Taskcluster tests do not retry on failure due to DMError, etc

Assignee

Description

•

8 years ago

Buildbot test jobs that fail with certain error messages - like Android tests that fail with "DMError" - are automatically retried: The failed job is reported as a "blue" on treeherder, and a new job is started. This feature appears to not be implemented in taskcluster: The failed job is reported as an "orange" on treeherder, and the job is not automatically retried. The special error messages are listed at http://hg.mozilla.org/build/buildbotcustom/file/6f6a9824ca4d/status/errors.py#l9 and used in http://hg.mozilla.org/build/buildbotcustom/file/6f6a9824ca4d/process/factory.py#l4226. https://treeherder.mozilla.org/#/jobs?repo=try&revision=bb1505031efc&selectedJob=22519061 (R35) is an example of a "DMError" that was not retried on taskcluster. https://treeherder.mozilla.org/#/jobs?repo=try&revision=bb1505031efc&selectedJob=22519218 (gl6) is an example of a "DMerror" that was retried via buildbot.

Geoff Brown [:gbrown]

Assignee

Updated

•

8 years ago

Blocks: 1080265

Dustin J. Mitchell [:dustin] (he/him)

Comment 1

•

8 years ago

I think these retries need to be detected in mozharness, and the retry signalled there. Greg, I remember talking about special mozharness exit codes?

Flags: needinfo?(garndt)

Greg Arndt [:garndt]

Comment 2

•

8 years ago

There were talks about special mozharness exit codes, but no real decision on how taskcluster should respond to those. The original discussion was around the orange vs red color of jobs on treeherder depending on exit code. If a task could include some kind of task information such as exit status (rather than parsing logs), mozilla-taskcluster or whatever reporting jobs to treeherder could initiate a retrigger event if a certain condition ocurrs. In the world of taskcluster-treehder that *only* publishes job messages to Pulse, this might get a little confusing. Another option is to have a sepcial exception status which causes the queue to automatically retry (similar to how our worker-shutdown or claim-expired events are handled).

Flags: needinfo?(garndt)

Geoff Brown [:gbrown]

Assignee

Updated

•

8 years ago

Blocks: 1157948

Phil Ringnalda (:philor)

Updated

•

8 years ago

Blocks: 1171445

Geoff Brown [:gbrown]

Assignee

Comment 3

•

8 years ago

(In reply to Dustin J. Mitchell [:dustin] from comment #1) > I think these retries need to be detected in mozharness, and the retry > signalled there. Greg, I remember talking about special mozharness exit > codes? mozharness has special exit codes to specify success/warning/failure/exception/retry: https://hg.mozilla.org/mozilla-central/annotate/23dc78b7b57e9f91798ea44c242a04e112c37db0/testing/mozharness/mozharness/mozilla/buildbot.py#l41 For retry, if mozharness sees a log line matching the retry_regex: https://hg.mozilla.org/mozilla-central/annotate/23dc78b7b57e9f91798ea44c242a04e112c37db0/testing/mozharness/mozharness/mozilla/testing/errors.py#l94 the mozharness script will set its exit code to TBPL_RETRY (4). Most (possibly all) of the errors used by buildbot to identify retry conditions could be copied to the mozharness regex, so that mozharness scripts exit with TBPL_RETRY for those errors. I tried that in https://treeherder.mozilla.org/#/jobs?repo=try&revision=d5fc414e473a and it seemed to work out. For instance, tc-M7 has https://public-artifacts.taskcluster.net/br1uPM4dQI20bRqVDr80GQ/0/public/logs/live_backing.log, which shows "setting return code to 4" and "returning nonzero exit status 4". Does that help?

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Comment 4

•

8 years ago

If bug 1282850 is landing imminently, this is more important, since we'll actually have to care about the TC-based fennec debug failures in bug 1157948.

Geoff Brown [:gbrown]

Assignee

Comment 5

•

8 years ago

See https://bugzilla.mozilla.org/show_bug.cgi?id=1157948#c713 .. 719 for additional discussion relevant to this bug.

Geoff Brown [:gbrown]

Assignee

Comment 6

•

8 years ago

:selena - Can we do something about this soon? Over 100 failures are reported each day for TC jobs that were handled automatically by the retry feature on buildbot.

Flags: needinfo?(sdeckelmann)

Greg Arndt [:garndt]

Comment 7

•

8 years ago

After some talks with Jonas, we did have the idea that perhaps the workers could listen for certain exit codes and retry the task based on that rather than trying to do anything special within the platform. Jonas, am I recalling things correctly?

Flags: needinfo?(jopsen)

Greg Arndt [:garndt]

Comment 8

•

8 years ago

Because retrying on exit code is a bit magical, I would like to explore what it would look like to be explicit about this retry logic so that it's opt-in rather than every task having this same behavior. What I'm trying to avoid is for when more projects outside of things using mozharness return exit codes, I don't want people confused as to what's going on because their command of choice returned an exit code that mozharness also returns to indicate retry.

Dustin J. Mitchell [:dustin] (he/him)

Comment 9

•

8 years ago

Buildbot has the obscurely-named `decodeRC` for this purpose (http://docs.buildbot.net/latest/manual/cfg-buildsteps.html#using-shellcommands). I can see this being something included in the task definition: exit-code: 2: retry 5: success

Joel Maher ( :jmaher ) (UTC -8)

Comment 10

•

8 years ago

in bug 1157948 we saw 843 instances of a retry needed in the last week. Please make this a priority to at least have an agreed upon plan in place this week with someone assigned to working on it.

Jonas Finnemann Jensen (:jonasfj)

Comment 11

•

8 years ago

I see 4 options: A) When you have an intermittent test case, you wrap it in a retry-loop B) Inside your task container you do: bash -c 'until <command>; do echo "try again"; done'; - maybe with some max number of retries just for sanity - You can also write the log a separate file for each attempt if you want (then only have livelog printing what attempt number you are currently running, or tee all output) C) We could build into docker-worker that, if feature activated, it'll run the task container again if a certain exit-code is used. D) We can add a special artifact to tasks that needs to be retriggered, have a small service looking for these artifacts and retriggering a task if such special artifact is created. Obviously, (A) is the ideal solution... (D) the least ideal solution. My fear is that (D) easily becomes a big hammer used to move things forward instead of address the underlying issue. And that it becomes hard to reason about the state of a push/task-group because tasks might be retriggered automatically. With (D) we sort of start to loose the concept that a task is a unit of work. --- If the motivation for not doing (A) or (B) is that we want tests that required multiple attempts to have an special color on treeherder. Then I suggest (A) or (B) is implemented, but with the addition that a special artifact for taskcluster-treeherder to use to report color is created. Added benefit of doing this in the task with (A) or (B) is that it discourages the practice :)

Flags: needinfo?(jopsen)

Greg Arndt [:garndt]

Comment 12

•

8 years ago

My 2 cents for right now... I believe with option D we not only work towards solving this issue but also solve potential future issues where people want to know more about a task after the task is resolved and perhaps make choices/actions based on that. The artifact could contain this meta data that we do not record within the platform. I do not want to make this a "special artifact" but rather an artifact that contains task metadata that the worker reports prior to resolving the task. It could include some run time statistics, exit status, whatever we agree on that the worker could report. Exit status just being one of the potential pieces of information. That said....D doesn't seem like a perfect fit, but so far seems like it might be a better fit than the other options. Some more exploration will be needed I think. I'm not a huge fan of C. Not only does this mean we still need a solution for the issue of reporting job status to treeherder, but also it moves some retry/scheduling logic within the worker. It should just be responsible for running a task to completion and returning the result. If that task then needs to be scheduled again it should be handled externally. I'm also a fan of discouraging the practice of hiding intermittent issues by retrying it multiple times, but I think in this case teams have spent significant time and effort trying to reduce this as far as they could and this (retrying) has been the agreed solution.

John Ford [:jhford] CET/CEST Berlin Time

Comment 13

•

8 years ago

We had a meeting yesterday with Greg, Jonas, Dustin and myself. We discussed a couple different possibilities as mentioned earlier in this bug. In the end, we decided that the design we'd like to implement is to have an outcome field called to indicate whether the failure of the task is retry-able in the same data structure that stores whether the task succeeded or failed. The queue would automatically try retry-able failures up to a limit. By default, all failures would be considered non-retryable. We would expose a mechanism (socket/file) which a task could optionally indicate to the worker process that this task has failed with a retryable reason. Extra information regarding why the task failed, or the colour that Treeherder should present the job with should be done using an artifact. We decided that the design of that artifact was outside the scope of this bug, but that would be how we'd like to communicate that information. Having this be an explicit, opt-in feature means that we're not changing the interface presented to the tasks. The exact mechanism for triggering the retry-able flag was not decided, however, the idea of presenting a socket for the task to read from or write to was the favoured approach, both for robustness (e.g. out of disk could interfere with a file) and because nearly every environment contains a socket implementation. During this meeting, we considered two options that we ultimately decided not to go with. First was exposing the exit code from the payload's command. This is not very general, and worse, is to some degree implicit. We can't tell if an application exited with a specific exit code because it wanted to opt into our retry behaviour or because it happens to use the same numeric value to represent a different concept. The second idea was to create some sort of artifact which would indicate information about retrying and a service to do re-triggering based on the information from this artifact. We had two reasons for not opting for this. First was that it would require each project to write their own retriggering service. Second, and related, at the taskcluster-platform level, we really only need to know whether or not to retry. We also felt that implementors of these hypothetical services would likely only really care about retry-able vs non-retry-able error distinctions. Further classification of an error is something we feel is better suited to the presentation system, possibly in concert with an artifact generated by the task which gives an explanation for why a task failed.

Joel Maher ( :jmaher ) (UTC -8)

Comment 14

•

8 years ago

:gbrown, does this work for you?

Flags: needinfo?(sdeckelmann) → needinfo?(gbrown)

Geoff Brown [:gbrown]

Assignee

Comment 15

•

8 years ago

(In reply to John Ford [:jhford] from comment #13) Thanks for documenting that discussion and explaining the rationale. This sounds like it will work for me / address the pressing issues of bug 1157948, etc. It should be simple enough to modify mozharness to invoke the socket/file mechanism when a mozharness script exits with the RETRY exit code; I can take care of that part...I'll just need details, when they are available. I am a little troubled by: > Extra > information regarding why the task failed, or the colour that Treeherder > should present the job with should be done using an artifact. We decided > that the design of that artifact was outside the scope of this bug, but that > would be how we'd like to communicate that information. So, you want this bug to be about starting the new task, but not about reporting the status of the failed task to treeherder, is that right? Turning the original job blue on treeherder is an essential part of the problem; otherwise, we still have 100+ failures a day being starred. Shall I open another bug for that?

Flags: needinfo?(gbrown)

Greg Arndt [:garndt]

Comment 16

•

8 years ago

The new task run for a given task should have a "reasonCreated" of "rerun" (it might be called retry, I need to check), and if so, our service for reporting jobs to treeherder already takes care of marking the previous symbol on treeherder as blue.

Geoff Brown [:gbrown]

Assignee

Updated

•

8 years ago

Blocks: 1293309

Geoff Brown [:gbrown]

Assignee

Updated

•

8 years ago

Blocks: 1297624

Greg Arndt [:garndt]

Updated

•

8 years ago

Depends on: 1298059

Greg Arndt [:garndt]

Updated

•

8 years ago

Depends on: 1298129

Geoff Brown [:gbrown]

Assignee

Comment 17

•

8 years ago

Sorry to (possibly) complicate this further, but I have been reminded that we should have a way of limiting the number of times a job is retried. On buildbot, there are occasional instances where mozharness thinks it has identified a temporary infrastructure issue and requests a retry, but the same thing happens on the retry, and we keep retrying until a thoughtful sheriff cancels a job. I can't think of a way for the mozharness script to know that it has retried N times before, so it would be good if something in taskcluster could reject a retry request after N such requests for the same job. I'd suggest N=3.

Greg Arndt [:garndt]

Comment 18

•

8 years ago

Typically we have a hard default limit within the queue for retries (at least for the other retrying exceptions we have). I think the default is 5, but perhaps we could make this lower for the "task-retry" exception. Jonas, do we have thoughts on when we could have you or bstack look into implementing this additinal retry exception into the queue?

Flags: needinfo?(jopsen)

Geoff Brown [:gbrown]

Assignee

Comment 19

•

8 years ago

(In reply to Greg Arndt [:garndt] from comment #18) > I think the default is 5, > but perhaps we could make this lower for the "task-retry" exception. 5 seems reasonable too -- I think that would be fine.

Geoff Brown [:gbrown]

Assignee

Comment 20

•

8 years ago

Taskcluster Android xpcshell tests have now been hidden on all trees, because of the confusion caused by failures that would have normally (in buildbot) been handled by retries -- bug 1303634. It seems to me that the sheriffs could hide all the Android tests for the same reason. If we are not close to a resolution here, we should consider going back to buildbot for Android tests.

Greg Arndt [:garndt]

Comment 21

•

8 years ago

(In reply to Geoff Brown [:gbrown] from comment #20) > Taskcluster Android xpcshell tests have now been hidden on all trees, > because of the confusion caused by failures that would have normally (in > buildbot) been handled by retries -- bug 1303634. It seems to me that the > sheriffs could hide all the Android tests for the same reason. > > If we are not close to a resolution here, we should consider going back to > buildbot for Android tests. We have agreed on a path forward for this, but have not had much time to focus on a fix. I'm going to reach out and try to reprioritize work to address these issues sooner rather than later. Here is the bug where the latest decision was made: https://bugzilla.mozilla.org/show_bug.cgi?id=1298129#c15

Jonas Finnemann Jensen (:jonasfj)

Comment 22

•

8 years ago

PR is in adding exception reason: intermittent-task, which will add an other run if task.retries isn't exhausted. We do not distinguish between automation retries and intermittent retries when counting against task.retries. I guess we could, but ideally this should be a feature that is rarely used. This also added reasonCreated: "task-retry" in response to "intermittent-task" reasonResolved. See: https://github.com/taskcluster/taskcluster-queue/pull/118

Flags: needinfo?(jopsen)

Joel Maher ( :jmaher ) (UTC -8)

Comment 23

•

8 years ago

I just saw bug 1298129 get resolved :) I still see we depend on bug 1298059. I assume when that is done, then we can edit the tasks definitions to allow retries?

Greg Arndt [:garndt]

Comment 24

•

8 years ago

I have not rolled out a new ami yet because it's late in the day and I will be signing off, but you can see it in action here: https://tools.taskcluster.net/task-inspector/#SKyZynHoQ12fuCgmbt8rnA/ You can start updating tasks to make use of this now, the worst that happens is that it's not retried until the workers are using a new ami. Of particular importance is the new onExitStatus payload property that looks like: "onExitStatus": { "retry": [ 36 ] } Basically you list the status codes that you want to be retried in the onExitStatus.retry list.

Dustin J. Mitchell [:dustin] (he/him)

Comment 25

•

8 years ago

I won't be able to hack on this for a day or two, but can do so then if nobody else is faster -- I know it's high priority. My rough idea is: - add a `retry-exit-status` property to task descriptions (task.py) - for docker-worker, translate it into the structure in comment 24 - for other payload builders, raise an exception if the option is given (since it is not supported) - copy that property into job descriptions' schema (as most properties are already copied) - set this flag unconditionally in taskcluster/taskgraph/transforms/tests/all_tests.py (I'm assuming all mozharness scripts used by tests produce the same retry-me exit status?) - (optional) set this flag in for the job-description run-using function 'mozharness' to get similar behavior for builds, if it is not already set to True or False

Geoff Brown [:gbrown]

Assignee

Comment 26

•

8 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a32a314486456ac192b174174557ba5433a95268 -- looking good!

Assignee: nobody → gbrown

Geoff Brown [:gbrown]

Assignee

Comment 27

•

8 years ago

Attached patch set onExitStatus.retry = TBPL_RETRY for mozharness tests and builds — Details — Splinter Review

My rough interpretation of dustin's rough idea in comment 25. Seems to work fine - comment 26 shows Android DMError cases causing retries. https://hg.mozilla.org/try/rev/36987accb00f03a65982876f952ef5059cf4afb8 is this exact patch, showing the expected payload for tests and builds.

Attachment #8798865 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 28

•

8 years ago

Comment on attachment 8798865 [details] [diff] [review] set onExitStatus.retry = TBPL_RETRY for mozharness tests and builds Review of attachment 8798865 [details] [diff] [review]: ----------------------------------------------------------------- This looks good. The one bit it misses is passing the parameter through from job descriptions to test descriptions. If I wanted to define a one-off job that had a special exit status, there's no place in the job description to put this value. This is as easy as copying the schema over: https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/transforms/job/__init__.py#44 the job transforms already keep all of the fields, so there's not even a need to copy the value! With that in place, it might be good to make the code in mozharness.py only change the value if it's not already specified.

Attachment #8798865 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Comment 29

•

8 years ago

Comment on attachment 8798865 [details] [diff] [review] set onExitStatus.retry = TBPL_RETRY for mozharness tests and builds Review of attachment 8798865 [details] [diff] [review]: ----------------------------------------------------------------- Geoff points out that this is implemented to parallel things like `task_desc.worker.relengapi-proxy`, and I think that's fine -- at least for now, and it can be moved later.

Attachment #8798865 - Flags: review+

Pulsebot

Comment 30

•

8 years ago

Pushed by gbrown@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/9a7e664b5f68 Retrigger tc tasks when mozharness returns TBPL_RETRY; r=dustin

Ryan VanderMeulen [:RyanVM]

Comment 31

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9a7e664b5f68

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla52

Geoff Brown [:gbrown]

Assignee

Comment 34

•

8 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/1047827ed4d37a7c91411d645dea6f1deaa984c0

Geoff Brown [:gbrown]

Assignee

Comment 35

•

8 years ago

Bug 1286075 (at least) is required to push to beta (50).

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Updated

•

8 years ago

Blocks: 1346574