Closed Bug 1537558 Opened 6 years ago Closed 5 years ago

Extra builds triggered in response to adding test jobs

Categories

(Tree Management :: Treeherder, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: jfkthame, Unassigned)

Details

It seems there's some inefficiency when using the treeherder interface to add new test jobs to a try run that require a build that wasn't part of the original push.

When the first such test job is requested, it's expected that a new build will be triggered in order to support it. That's fine, so far.

However, it appears that every time such a test job is added (or re-triggered), another new build is started, even though the push now has the build it needs.

Concrete example:

(1) Push a try run with only debug/macOS selected, and requesting a single test, let's say mochitest-bc-7.

(2) Once the job is finished, retrigger the bc7 job several times (let's say we're interested in the frequency of an intermittent failure).

Result: all fine so far, we get one build, a test run, and then repeats of the test run using the same build.

(3) Now use Add New Jobs in treeherder to request an Opt test, let's say mochitest-bc-2 (as that happens to be the chunk where the suspect test runs in opt builds)

Result: as expected, this triggers an Opt build, and then the requested Opt test runs.

(4) Retrigger the opt bc2 run several times.

Result: for each of the retriggers, a new Opt build is started, and the extra tests don't start until their corresponding individual builds are ready. :(

(5) Use Add New Jobs to request another flavor of Opt test, such as a reftest run.

Result: again, this triggers yet another new Opt build (though if several new tests are requested in the same action, only a single new build is started to service them).

In https://treeherder.mozilla.org/#/jobs?repo=try&revision=84167ae5ef5265b6f36e7cf5214e40c2a95dbcf2, this non-optimal behavior led to a lot of unnecessary duplicated builds. It also led to some confusion for me: at first, when I saw the extra builds happening, I thought I must have accidentally re-triggered the wrong thing, and cancelled a bunch of them. But carefully repeating the test re-triggers confirmed this extra-build behavior.

With an increasing level of tasks (usually variants like ccov, socket process, fission) only running on central by default, fast backfills and retriggers on integration branches are necessary (in addition to be cost efficient).

Karl, can we get this on the road map, please? Initial theory is that only the first task graph gets searched for the necessary build.

Flags: needinfo?(kthiessen)

Cam, does this get fixed bug 1565754?

Flags: needinfo?(cdawson)

Sorry, Sebastian, I meant to bring this up in the Treeherder team meeting. I'll check in with Cam to see if this needs any more work.

Flags: needinfo?(kthiessen)

My fix will always find the latest "* Decision Task" if there were retries on the task itself. But I think this would need to find the correct "rt" task to do the retrigger. I'd need to investigate further, though. This spans logic between Taskcluster and us. And it might require a bit of help from the Taskcluster side to help us figure out what the right "rt" decision task is for the job. I don't think treeherder could figure that out at this time.

Flags: needinfo?(cdawson)

Tom, could you give us a hand here? As Cam says, there's logic on both sides involved.

Flags: needinfo?(mozilla)

I'm not able to reproduce the issues described in the first comment. If the action task has completed, subsequent actions on the push should re-use the existing builds. (If the action task hasn't completed, there is no way for other action to see what task it would create).

The case for backfills is trickier. We only look for existing tasks in the artifacts attached to actions run on a given push, so backfills (or retriggers of backfilled jobs) won't see tasks (i.e. builds) triggered by other backfills. To solve this, I suspect the solution is to have the backfill action trigger an actions on each of the pushes being backfilled, and that task would schedule the jobs for that push. That way, subsequent backfills will be able to find the jobs on those pushes.

The backfill case is Bug 1585757. I can't reproduce in the add-new-jobs-case.

Status: NEW → RESOLVED
Closed: 5 years ago
Flags: needinfo?(mozilla)
Resolution: --- → WORKSFORME
Component: Treeherder: Job Triggering & Cancellation → TreeHerder
You need to log in before you can comment on or make changes to this bug.