Build jobs re-run for each Talos retrigger/repeat when the build wasn't initially scheduled

RESOLVED FIXED in Firefox 68

Status

task
RESOLVED FIXED
5 months ago
4 months ago

People

(Reporter: MattN, Assigned: dustin)

Tracking

Trunk
mozilla68

Firefox Tracking Flags

(firefox68 fixed)

Details

Attachments

(1 attachment)

Talos comparisons require 6 of each job for PerfHerder and the longstanding way of doing this was to request the Talos jobs you want and the retrigger them 5 more times (the button is now labelled "Repeat the selected job"). For some unrelated reason my mach try fuzzy command only scheduled test-linux64/opt-talos-other-e10s so I used the "Add new jobs" UI to add the Talos jobs I was interested in originally. Then I did the usual pattern of retriggering them. When I did this today[1] it seems like it ran a new build job for each retriggered Talos job which is a huge waste of resources. Once the build completes/starts for the first talos request on that platform, I would have thought it would get re-used for retriggers of Talos on that platform.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&tier=1&group_state=expanded&revision=892666265a5604cd9959c2a234490e97cad9a703

It looks like the retriggers were done individually (there are a lot of action tasks on that push!). That basically runs them in parallel, so they don't "know" about each other and end up independently scheduling dependent tasks.

I see that https://tools.taskcluster.net/groups/Aweo9v9dTg2N-drrs3EeOg/tasks/ESs5_LYfRuao6_u9HX4gIg/details (retriggering linux talos) ran after the linux build (https://tools.taskcluster.net/groups/Aweo9v9dTg2N-drrs3EeOg/tasks/QExa0HvBTri2GY_Yrat4yw/runs/0) was complete, and indeed that linux build was not duplicated.

But the mac builds were duplicated. The distinction seems to be, this mac build was not included in the original decision task, so every action task saw that there was no mac build, and created one.

In general, two things will lead to a better experience:

  • where possible, schedule what you need up-front in the decision task, rather than addressing it later with retriggers. mach try fuzzy has a --rebuild option that can do what you need in this case.
  • when you must use action tasks, use fewer action tasks with more configuration. The add-new-jobs action allows retriggering multiple jobs (just select them all first) and has a "times" parameter that can give the number of times you'd like to retrigger the selected jobs. That is useful for common cases where you didn't know up-front that the try job would need additional talos runs.

There are bugs filed to improve the treeherder UX around the second point -- for example, batching multiple presses of the 'r' key into a single action. I can't find the bugs right now :/

Looking at the logs in the add-new-jobs action for OSX talos, https://tools.taskcluster.net/groups/Aweo9v9dTg2N-drrs3EeOg/tasks/I9gZpQAmR6qqjPcyJhHK3A/details, which ran quite a bit later than the linux action -- it seems to have failed to find label-to-taskid.json for all of the previous actions. So even if the second add-new-tasks action had run after the build had completed, it likely would not have "realized" this and would have still scheduled an extra build. I suspect that's because the actions seem to have only written label-to-taskid-0.json: https://tools.taskcluster.net/groups/Aweo9v9dTg2N-drrs3EeOg/tasks/ESs5_LYfRuao6_u9HX4gIg/runs/0/artifacts

That issue didn't cause the particular problem you're seeing here, but is something we should fix all the same.

Assignee: nobody → dustin

(In reply to Dustin J. Mitchell [:dustin] pronoun: he from comment #1)

It looks like the retriggers were done individually (there are a lot of action tasks on that push!). That basically runs them in parallel, so they don't "know" about each other and end up independently scheduling dependent tasks.

Hmm… it seems like the scheduling could be done in serial (build in parallel) so the 2nd request would have seen that a build was scheduled already. That's how I assumed it worked.

I see that https://tools.taskcluster.net/groups/Aweo9v9dTg2N-drrs3EeOg/tasks/ESs5_LYfRuao6_u9HX4gIg/details (retriggering linux talos) ran after the linux build (https://tools.taskcluster.net/groups/Aweo9v9dTg2N-drrs3EeOg/tasks/QExa0HvBTri2GY_Yrat4yw/runs/0) was complete, and indeed that linux build was not duplicated.

But the mac builds were duplicated. The distinction seems to be, this mac build was not included in the original decision task, so every action task saw that there was no mac build, and created one.

Right.

In general, two things will lead to a better experience:

  • where possible, schedule what you need up-front in the decision task, rather than addressing it later with retriggers. mach try fuzzy has a --rebuild option that can do what you need in this case.

That was my intention and locally ./mach try fuzzy showed the correct output but then the json file it generated only had Linux. I realize now that the issue was that I forget to hit ctrl-a before hitting <enter> in the curses UI. The curses UI also doesn't show the --rebuild or --no-artifact options which is quite annoying and is why I forgot about it. That's why I liked the trychooser webpage much better as I could see the options in front of me.

  • when you must use action tasks, use fewer action tasks with more configuration. The add-new-jobs action allows retriggering multiple jobs (just select them all first) and has a "times" parameter that can give the number of times you'd like to retrigger the selected jobs. That is useful for common cases where you didn't know up-front that the try job would need additional talos runs.

I see… I didn't think that the "Custom Push Action…" dialog would know anything about the jobs that were selected on the push.

There are bugs filed to improve the treeherder UX around the second point -- for example, batching multiple presses of the 'r' key into a single action. I can't find the bugs right now :/

Looking at the logs in the add-new-jobs action for OSX talos, https://tools.taskcluster.net/groups/Aweo9v9dTg2N-drrs3EeOg/tasks/I9gZpQAmR6qqjPcyJhHK3A/details, which ran quite a bit later than the linux action -- it seems to have failed to find label-to-taskid.json for all of the previous actions. So even if the second add-new-tasks action had run after the build had completed, it likely would not have "realized" this and would have still scheduled an extra build. I suspect that's because the actions seem to have only written label-to-taskid-0.json: https://tools.taskcluster.net/groups/Aweo9v9dTg2N-drrs3EeOg/tasks/ESs5_LYfRuao6_u9HX4gIg/runs/0/artifacts

That issue didn't cause the particular problem you're seeing here, but is something we should fix all the same.

Well I believe that some of my retriggers on other platforms were done after builds were completed though maybe that was on the other push: https://treeherder.mozilla.org/#/jobs?repo=try&revision=abe22f60f322347af4bba49830c2440a31432387

This also adds an optimization for the case where there is only one result
(Which is common for actions where times defaults to 1)

Pushed by dmitchell@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/092d677f7181
combine all taskgraph artifacts, not just task-graph; r=tomprince
Pushed by apavel@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/a027a998b8b7
fix lint spacing on a CLOSED TREE
Status: NEW → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla68
You need to log in before you can comment on or make changes to this bug.