Open Bug 1373013 Opened 7 years ago Updated 5 months ago

Make decision-task-generated tasks depend on a breakpoint task

Categories

(Firefox Build System :: Task Configuration, task)

task

Tracking

(Not tracked)

People

(Reporter: dustin, Unassigned)

References

(Blocks 1 open bug)

Details

In https://github.com/taskcluster/taskcluster-rfcs/issues/67 we determined that the existing strategy of making all tasks created by a decision task depend on that decision task is inadequate if that task is later re-run, since any tasks created by run 0 will be scheduled once run 1 completes successfully.

The idea there was to instead make all tasks depend (directly or indirectly) on a "dummy" task created by the decision task, and not executed until all tasks have been created.  A second run of the decision task would then create a new dummy task, so any tasks depending on run 0's dummy task would not be affected.

That dummy task could either run as a real task (`exit 0`) or could be claimed and resolved as complete by the decision task, depending on how fancy we want to get.
Product: TaskCluster → Firefox Build System
I was filing a new, performance-oriented bug report with regards to task execution and think this one encapsulated my request.

Essentially, my request is for tasks scheduled by the decision task to start running *before* the decision task finishes. i.e. they wouldn't have a strong dependency on the decision task.

The reason for this is end-to-end latency. Take an Autoland decision task at https://taskcluster-artifacts.net/MoAzeBl5QcyFe5ORGZM2Kw/0/public/logs/live_backing.log for example. Our first task is scheduled at:

[task 2018-05-16T17:45:21.780Z] Creating task with taskId VW5kAoqAS7qQy8bSU5lpIw for build-android-aarch64/opt

And the final log message is:

[taskcluster 2018-05-16 17:46:28.896Z] Successful task run with exit code: 0 completed in 315.179 seconds

That's ~67s between first task scheduling and the decision task ending. That's a delay of ~67s between when the first scheduled task could theoretically start executing and when it is unblocked from executing.

We've justified hours of development time to optimize shorter delays/overheads in Firefox CI.
Blocks: fastci
We want the exact opposite for release graphs - we want them to be atomic, so any failure means the graph doesn't execute - so perhaps this behavior could be toggled differently for hg-push than relpro action tasks.
Possibly crazy idea: could the first task submitted by taskgraph be a task that depends on the decision task and if the decision task fails, it cancels all other scheduled tasks?

Or perhaps we could mark certain tasks in the release graph as depending on the decision task. That way, we could start running some tasks as soon as possible but we wouldn't run "critical" tasks unless the decision task was successful. (I'm assuming some tasks in the release graph can be safely run without consequence.)
It's possible, but it's cleaner to just require the decision task be atomic for release graphs. In the release world, correctness is a higher priority than speed, although both are important. A minute or two at the beginning to make sure the graph is fully submitted correctly is acceptable. Marking certain tasks or allowing certain tasks to run may end up in burning build numbers or even version numbers in the worst case, when we could avoid all of that by requiring the graph submission to be atomic.

From https://bugzilla.mozilla.org/show_bug.cgi?id=1624887:

The problem: a decision task can auto-retry due to worker-shutdown or claim-expired. When it reruns, it will resolve completed. Any tasks that were scheduled in the previous, failed decision task runId, will then run, and Chain of Trust checks will fail.

It appears we'll never get atomic taskgroup submission, for Reasons. In bug 1624887 we set retries to 0 for hg-push, but we could still hit this issue for actions or cron, or if someone manually reruns the failed decision task.

Let's create the breakpoint task in this manner:

decision task runId n <- breakpoint task <- all other tasks

The breakpoint task's task.payload.command could be set to something like python verify_latest_runid.py --taskId DECISION_TASKID --runId DECISION_RUNID. That script should fail if the latest runId of the decision task is either a) non-success, or b) a different runId than the runId the breakpoint task was created with. Otherwise it will go green, and the taskgroup will proceed.

Depends on: 1624887
Summary: Make decision-task-generated tasks depend on a "dummy" task → Make decision-task-generated tasks depend on a breakpoint task
See Also: → 1764371
Severity: normal → S3

This has happened again, 2 times on the same merge:

See Also: → 1850487

(In reply to Iulian Moraru from comment #13)

This has happened again, 2 times on the same merge:

The safe thing to do when the nightly cron task fails is to trigger a new one (via https://firefox-ci-tc.services.mozilla.com/hooks/project-releng/cron-task-mozilla-central%2Fnightly-desktop), rather than rerun the failed task.

Whiteboard: [stockwell disable-recommended]
You need to log in before you can comment on or make changes to this bug.