Closed Bug 1274176 Opened 8 years ago Closed 8 years ago

Retrigger doesn't work on taskcluster tests

Categories

(Taskcluster :: Services, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ochameau, Assigned: dustin)

Details

(Whiteboard: [mozilla-taskcluster])

Take this run:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=3a64330e28f9
I'm unable to spawn new dt7 runs.

Neither via Treeherder retrigger, I get the success popup, but nothing happen.

Neither via Taskcluster task inspector. I get "You do not have sufficient scopes" when trying to retrigger it.

This is very limiting our hability to address intermittents for tests that now run only on Taskcluster.

Feel free to move this bug into Taskcluster component.
 <KWierso|afk> emorley: bug 1274176 should probably get moved to a taskcluster component since retriggering directly from taskcluster is failing, too
Component: Treeherder → General
Product: Tree Management → Taskcluster
Version: --- → unspecified
This is caused by changes made in the decision task process to use the new taskcluster queue dependency system rather than the scheduler.  Mozilla-taskcluster that is responsible for handling retrigger events is configured to duplicate the graph as it exists in the scheduler, not using this new dependency system.
Component: General → Platform and Services
Whiteboard: [mozilla-taskcluster]
Assignee: nobody → garndt
To duplicate a node in the new task dependency system requires retrieving the entire task graph (which can be many hundreds of tasks) and iterating on those to determine the dependency tree...for each retrigger.

I have spoken to Jonas about this and it should be possible, and greatly useful, to add an API endpoint to taskcluster-queue that will allow a client to request the list of tasks that depend on the task we query for, including their task definitions.  From there it's just a matter of updating task IDs, timestamps, and resubmitting those tasks.

This is being worked on but will take time to implement, test, and deploy the queue side.  From there mozilla-taskcluster will need to be updated to handle this new scenario.

In the meantime, if not having the ability to retrigger becomes a huge burden, and we only care about the task we're retriggering and not things that depend on it (such as only retriggering tests), then we can put in some ability to only retrigger that particular task and not the dependents.
can we back out the in-tree work that caused this regression until we have a solution?  This functionality is very important to developers and sheriffs.  If this were a firefox feature that landed and caused a pretty serious regression, we would back it out- I see this done just about every week.

As for a short term hack to assume test only jobs, that might solve our problems and would be better than what we have now.
I did not know this was going to cause issues with retriggering, and only made the link this morning.  Sorry about that!

From discussion this morning, the plan is, roughly:

 1. fix single-task retriggering right away (hopefully the majority of use-cases)
 2. fix retriggering entire subtrees (allowing, for example, retriggering builds) using a brute-force approach
 3. modify the queue to better support reverse dependencies for #2
 4. fix up mozilla-taskcluster to properly handle big-graph scheduler tasks (bug 1274716)
I was able to retrigger this failed test job on mozilla-inbound to make sure retriggering was working:
https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=3737024731e6baccf99ac1f01eb5805ac12e3944&selectedJob=28475269
Assignee: garndt → dustin
OK, I think we finally got that working.  I retriggered

  https://treeherder.allizom.org/#/jobs?repo=try&revision=e664a7f36669&selectedJob=22087439

which had 118 dependent jobs, and it created  119 new jobs.  It scans sequentially for dependencies, and since it has to look for dependencies for all 119 of those jobs, that takes a while -- 70 seconds in this case.

I'm hopeful that in the future, this work will be done by an action task, and based on the task-graph.json produced by the decision task rather than pulled out of the queue, but for now this will do the trick!
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Component: Platform and Services → Services
You need to log in before you can comment on or make changes to this bug.