Closed Bug 1196374 Opened 9 years ago Closed 9 years ago

Delay submitting info about a dependent task to treeherder until ready to be picked up

Categories

(Taskcluster :: General, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED INVALID

People

(Reporter: armenzg, Unassigned)

Details

In this push:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=b0af66e75fdd

I submitted a task graph with two buildbot job through the Buildbot bridge:
http://docs.taskcluster.net/tools/task-graph-inspector/#VdgOBj-pTNe5mH5W7FMcdA/_dgDktc4R0GytOXdydnIpQ

Those two jobs are a:
* build job
* a depedent test job

Ignore that the treeherder grouping of tasks is incorrect.

I would like us to *not* tell treeherder about the dependent job until we actually are ready to make it run.
Right now, the "Windows XP debug" task shows as pending when it should not show up at all.

This sounds like we either are using the treeherder client to report the tasks too soon OR something with the buildbot bridge.

Here's the task graph for inspection:
http://docs.taskcluster.net/tools/task-graph-inspector/#VdgOBj-pTNe5mH5W7FMcdA/_dgDktc4R0GytOXdydnIpQ
Also, how can I prevent a test job from rerunning a second time?
re: rerunning

If "reruns" is set in the graph for a task, it will run that many times until it completes successfully.

If "retries" is set, it will retry that many times if until there is not an infrastructure issue (retries is only for runs resolved as an exception because of infra issues)  Some infrastructure related issues might be a worker getting killed or a claim expiring on a task
ie. you'll want retries. IMO reruns at task-graph level have always been an anti-pattern.


Re: TH integration... you might not want to add task.routes like:
   "tc-treeherder.try.a904ceceacd413be16a524f5c1cb7bcd15dcec5f",
   "tc-treeherder-stage.try.a904ceceacd413be16a524f5c1cb7bcd15dcec5f"
As things submitted throuhg BB-brige is likely to be picked up twice, from BB and TC exchanges via mozilla-taskcluster.
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #3)
> ie. you'll want retries. IMO reruns at task-graph level have always been an
> anti-pattern.
> 
> 
I ended up landing "reruns" for a task:
https://github.com/armenzg/mozilla_ci_tools/blob/master/mozci/sources/buildbot_bridge.py#L37

With "retries", if a job finishes with a code different than 0, would it consider it an infrastructure issue or not?
The doc sounds like tasks are retried for infrastructure issues; what does it qualify as one?
If a job fails because it simply did not succeed, would it be retried automatically?

> Re: TH integration... you might not want to add task.routes like:
>    "tc-treeherder.try.a904ceceacd413be16a524f5c1cb7bcd15dcec5f",
>    "tc-treeherder-stage.try.a904ceceacd413be16a524f5c1cb7bcd15dcec5f"
> As things submitted throuhg BB-brige is likely to be picked up twice, from
> BB and TC exchanges via mozilla-taskcluster.

Yes, this is what I ended up learning. I stripped all of it.
Infrastructure issues are reported as an exception by docker-worker independent of status code.  If a task container runs and the worker does not detect that there was an infrastructure issues outside of running that task, the worker will report the task as completed if the status code is 0, and failed for anything else.
Hrmm, I wonder if the rerun is not need at all since the problem with the task I scheduled was that the BBB did not reclaim the tasks (probably invalid info in the task).
I assume that is the reason that tasks were re-run (since claim-expired):
https://tools.taskcluster.net/task-inspector/#_dgDktc4R0GytOXdydnIpQ/0

Thank you for talking it with me! Gotta make all these newbie mistakes :)
This is invalid and we don't hit it anymore.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → INVALID
> With "retries", if a job finishes with a code different than 0, would it
> consider it an infrastructure issue or not?
> The doc sounds like tasks are retried for infrastructure issues; what does
> it qualify as one?

Infrastructure issues are tasks that are reported as an exception with a reason of "worker-shutdown" or "claim-expired". The "worker-exception" exception is for when AWS kills a spot instance and "claim-expired" when a worker fails to reclaim a task before the "takenUntil" deadline.  This can happen when a worker is killed by other means (either crash, internal issue, or hard killing within the aws console).


> If a job fails because it simply did not succeed, would it be retried
> automatically?

This is the difference between "rerun" at the task graph level and "retry" at the task level.  When defining a task in a graph, there is a parameter for "rerun" which defines how many times a task will be rerun if a "failure" is encountered (that is a task that was reported as failed, not exception).  "retry" at the task level is only for tasks resolved as an exception.
You need to log in before you can comment on or make changes to this bug.