Closed Bug 1420482 Opened 7 years ago Closed 6 years ago

Make retriggering Windows builds from treeherder less painful

Categories

(Taskcluster :: General, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: philor, Assigned: pmoore)

References

Details

It's possible to retrigger Windows builds without winding up with CoT troubles that wind up with a purple Bs job that nobody knows how to star, but the steps are:

Click the "..." menu
Choose Custom action...
Wait for custom actions to load
Change from Backfill to Retrigger
Change "downstream: false" to "downstream: true"
Click the button

which is substantially more involved than the usual method which works for every other sort of build,

Type R

so it would be nice if the code which receives the "R" could learn about the fact that it needs to do things differently when the thing being retriggered is a Windows build.
Thanks for raising!

Ed, do you know what makes the Windows builds "special" here? Is this something that can/should be handled by treeherder? If Windows builds are behaving differently to other types of tasks in taskcluster, and we should work around it in taskcluster, let me know. I'm not sure what the difference is here between Windows builds and any other task.
Flags: needinfo?(emorley)
So the bug description is a little unclear. I'm presuming the real issue here is not about custom actions, but that the standard retrigger doesn't work?

I've not worked on the taskcluster retrigger/custom jobs/backfill/... features, but Brian might know more :-)

Perhaps this is a gap in featureset between what pulse_actions used to do (which has since been switched off), and what the Treeherder client side tc-actions features can do?
Flags: needinfo?(emorley) → needinfo?(bstack)
Retrigger actually was never a pulse_actions thing. It's been handled by mozilla-taskcluster and still is I believe. We've held off on landing the changes to TH to make retrigger (which would include the "r" shortcut) an action task as well. I explain the thinking in bug 1419957, comment 4. We can either move ahead with the retrigger landing now and deal with the 3-day expiry or wait. I'm happy either way. It would take me a couple days to get my treeherder branch back into shape I expect!
Flags: needinfo?(bstack)
This bit philor again today.
Blocks: 1432364
Is this on the radar for the near future? Old-behavior retrigger is related to the last 2 docker-image related cot bustages.
Blocks: 1436712
This had two long-term blockers. One of the blockers for this was bug 1395356 (which just went to production this morning! hassan++).

The other blocker is that current retrigger allows anyone to retrigger since (iirc) it just fires off a pulse thing and mozilla-taskcluster handles the retriggering without checking for scopes. The action task requires that anyone who retriggers a task has all of the scopes for that task since their client is scheduling the task with the queue directly in that case.

We've talked before (bug 1415868) about having a hook that is an indirect way to retrigger. It would be protected by a scope called something like "moz-tree-level-3:can-retrigger" and then _it_ would execute the action task and do all of the task creation. That work was blocked on bug 1324807, but that appears to be fixed now as well!

I think all this comes down to at this point is making the hook and making sure that the way we implement it is secure. :jonasfj/:dustin and I have talked about this before, we can page this back in. I think this might actually slot into our current stability work nicely.
Depends on: 1395356, 1324807
Actions-via-hooks is a bit of a bigger project, but something we very much want to do.  The bits of the hooks service required are already in place, thanks to @alexandrasp.
This is more work than I initially remembered. In particular it requires:

09:43:14 <dustin> * in-tree implementation of a "hook" type action
09:43:21 <dustin> * implementation in tools and treeherder
09:43:32 <dustin> * setup of hooks (using some kind of automated method)

So this is all possible work to do now, but will take a bit of time/effort for somebody.
Are we blocking on the hooks because sheriffs don't have the scopes to run the custom action retrigger?
If they do have the scopes, I'd lean towards killing old-style retriggers anyway. If they don't and we can't give them the scopes, then I agree we should block on the hooks, and we have to educate people about not retriggering every time this burns us.
Maybe we should disable the old-style retrigger button on level3 trees?
There are two old-style retrigger UIs: the button (though I'd hope nearly nobody uses it, since typing R instead is much faster and more convenient), and what you do after an infra failure blows up a couple thousand jobs per tree. For that, you type U to show only unstarred failures, click the push-pin icon at the upper right of each push to either select all the (failed) jobs showing for that push, or the first 500 if there are more than 500, click the dropdown arrow on the right of the pinboard's save menubutton, choose the Retrigger all menuitem, then change the classification to infra and save, and repeat for the next 500.

Doing that by selecting each job individually, click click wait click click think about whether to trigger downstream and maybe edit the textarea click, classify as infra, N, click click wait click click think click, would result in not being ready to reopen after an infra bustage until after the next infra bustage had started, a day or two later. Let's not disable that (and since we're not disabling the wholesale version, also not disable the retail version).
(In reply to Aki Sasaki [:aki] from comment #9)
> Are we blocking on the hooks because sheriffs don't have the scopes to run
> the custom action retrigger?
> If they do have the scopes, I'd lean towards killing old-style retriggers
> anyway. If they don't and we can't give them the scopes, then I agree we
> should block on the hooks, and we have to educate people about not
> retriggering every time this burns us.

All the sheriffs have level 3 access now, so that shouldn't be an issue.

(In reply to Aki Sasaki [:aki] from comment #10)
> Maybe we should disable the old-style retrigger button on level3 trees?

This seems doable in the time frame of our current stability initiative, and then the hook actions part can come later. Maybe the button could redirect to custom actions page for now?

Note: I also don't think this is strictly blocked on the TC team. All of the relevant bits are in-tree and could be wired together by anyone.

That said, I've cc-ed Eli and Hassan who may be able to help with the Treeherder part once their current work on login stability is done.
See Also: → 1450012
Bug 1470622 is in place to do this.
Depends on: 1470622
I think this is complete now.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.