1373013 - Make decision-task-generated tasks depend on a breakpoint task

Reporter

Description

•

9 years ago

In https://github.com/taskcluster/taskcluster-rfcs/issues/67 we determined that the existing strategy of making all tasks created by a decision task depend on that decision task is inadequate if that task is later re-run, since any tasks created by run 0 will be scheduled once run 1 completes successfully. The idea there was to instead make all tasks depend (directly or indirectly) on a "dummy" task created by the decision task, and not executed until all tasks have been created. A second run of the decision task would then create a new dummy task, so any tasks depending on run 0's dummy task would not be affected. That dummy task could either run as a real task (`exit 0`) or could be claimed and resolved as complete by the decision task, depending on how fancy we want to get.

BMO Automation

Updated

•

8 years ago

Product: TaskCluster → Firefox Build System

Gregory Szorc [:gps]

Comment 1

•

8 years ago

I was filing a new, performance-oriented bug report with regards to task execution and think this one encapsulated my request. Essentially, my request is for tasks scheduled by the decision task to start running *before* the decision task finishes. i.e. they wouldn't have a strong dependency on the decision task. The reason for this is end-to-end latency. Take an Autoland decision task at https://taskcluster-artifacts.net/MoAzeBl5QcyFe5ORGZM2Kw/0/public/logs/live_backing.log for example. Our first task is scheduled at: [task 2018-05-16T17:45:21.780Z] Creating task with taskId VW5kAoqAS7qQy8bSU5lpIw for build-android-aarch64/opt And the final log message is: [taskcluster 2018-05-16 17:46:28.896Z] Successful task run with exit code: 0 completed in 315.179 seconds That's ~67s between first task scheduling and the decision task ending. That's a delay of ~67s between when the first scheduled task could theoretically start executing and when it is unblocked from executing. We've justified hours of development time to optimize shorter delays/overheads in Firefox CI.

Blocks: fastci

Aki Sasaki (not active)

Comment 2

•

8 years ago

We want the exact opposite for release graphs - we want them to be atomic, so any failure means the graph doesn't execute - so perhaps this behavior could be toggled differently for hg-push than relpro action tasks.

Gregory Szorc [:gps]

Comment 3

•

8 years ago

Possibly crazy idea: could the first task submitted by taskgraph be a task that depends on the decision task and if the decision task fails, it cancels all other scheduled tasks? Or perhaps we could mark certain tasks in the release graph as depending on the decision task. That way, we could start running some tasks as soon as possible but we wouldn't run "critical" tasks unless the decision task was successful. (I'm assuming some tasks in the release graph can be safely run without consequence.)

Aki Sasaki (not active)

Comment 4

•

8 years ago

It's possible, but it's cleaner to just require the decision task be atomic for release graphs. In the release world, correctness is a higher priority than speed, although both are important. A minute or two at the beginning to make sure the graph is fully submitted correctly is acceptable. Marking certain tasks or allowing certain tasks to run may end up in burning build numbers or even version numbers in the worst case, when we could avoid all of that by requiring the graph submission to be atomic.

Aki Sasaki (not active)

Comment 5

•

6 years ago

From https://bugzilla.mozilla.org/show_bug.cgi?id=1624887:

The problem: a decision task can auto-retry due to worker-shutdown or claim-expired. When it reruns, it will resolve completed. Any tasks that were scheduled in the previous, failed decision task runId, will then run, and Chain of Trust checks will fail.

It appears we'll never get atomic taskgroup submission, for Reasons. In bug 1624887 we set retries to 0 for hg-push, but we could still hit this issue for actions or cron, or if someone manually reruns the failed decision task.

Let's create the breakpoint task in this manner:

decision task runId n <- breakpoint task <- all other tasks

The breakpoint task's task.payload.command could be set to something like python verify_latest_runid.py --taskId DECISION_TASKID --runId DECISION_RUNID. That script should fail if the latest runId of the decision task is either a) non-success, or b) a different runId than the runId the breakpoint task was created with. Otherwise it will go green, and the taskgroup will proceed.

Depends on: 1624887

Summary: Make decision-task-generated tasks depend on a "dummy" task → Make decision-task-generated tasks depend on a breakpoint task

Atila Butkovits

Comment 8

•

6 years ago

Recent failure log: https://firefoxci.taskcluster-artifacts.net/Y-BJm5TkRAW6acG8qCFRkg/0/public/logs/chain_of_trust.log

Comment hidden (Intermittent Failures Robot)

Julien Cristau [:jcristau]

Updated

•

4 years ago

Updated

•

3 years ago

Severity: normal → S3

Comment hidden (Intermittent Failures Robot)

Iulian Moraru

Comment 13

•

2 years ago

•

Edited

This has happened again, 2 times on the same merge:

failure log 1
failure log 2
LATER EDIT: later, this occurred several times as it can be seen here:
linux shippable
windows 2012 shippable
os x cross compiled shippable

Comment hidden (Intermittent Failures Robot)

Julien Cristau [:jcristau]

Updated

•

2 years ago

Comment 16

•

2 years ago

(In reply to Iulian Moraru from comment #13)

This has happened again, 2 times on the same merge:

The safe thing to do when the nightly cron task fails is to trigger a new one (via https://firefox-ci-tc.services.mozilla.com/hooks/project-releng/cron-task-mozilla-central%2Fnightly-desktop), rather than rerun the failed task.

Cosmin Sabou [:CosminS]

Updated

•

2 years ago

Whiteboard: [stockwell disable-recommended]

Comment hidden (Intermittent Failures Robot)

Bugzilla

Make decision-task-generated tasks depend on a breakpoint task

Categories

(Firefox Build System :: Task Configuration, task)

Tracking

(Not tracked)

People

(Reporter: dustin, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 8

Comment 9

Comment 10

Updated

Updated

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Updated

Comment 16

Updated

Comment 17