Make decision-task-generated tasks depend on a breakpoint task
Categories
(Firefox Build System :: Task Configuration, task)
Tracking
(Not tracked)
People
(Reporter: dustin, Unassigned)
References
(Blocks 1 open bug)
Details
In https://github.com/taskcluster/taskcluster-rfcs/issues/67 we determined that the existing strategy of making all tasks created by a decision task depend on that decision task is inadequate if that task is later re-run, since any tasks created by run 0 will be scheduled once run 1 completes successfully. The idea there was to instead make all tasks depend (directly or indirectly) on a "dummy" task created by the decision task, and not executed until all tasks have been created. A second run of the decision task would then create a new dummy task, so any tasks depending on run 0's dummy task would not be affected. That dummy task could either run as a real task (`exit 0`) or could be claimed and resolved as complete by the decision task, depending on how fancy we want to get.
Updated•6 years ago
|
Comment 1•6 years ago
|
||
I was filing a new, performance-oriented bug report with regards to task execution and think this one encapsulated my request. Essentially, my request is for tasks scheduled by the decision task to start running *before* the decision task finishes. i.e. they wouldn't have a strong dependency on the decision task. The reason for this is end-to-end latency. Take an Autoland decision task at https://taskcluster-artifacts.net/MoAzeBl5QcyFe5ORGZM2Kw/0/public/logs/live_backing.log for example. Our first task is scheduled at: [task 2018-05-16T17:45:21.780Z] Creating task with taskId VW5kAoqAS7qQy8bSU5lpIw for build-android-aarch64/opt And the final log message is: [taskcluster 2018-05-16 17:46:28.896Z] Successful task run with exit code: 0 completed in 315.179 seconds That's ~67s between first task scheduling and the decision task ending. That's a delay of ~67s between when the first scheduled task could theoretically start executing and when it is unblocked from executing. We've justified hours of development time to optimize shorter delays/overheads in Firefox CI.
Comment 2•6 years ago
|
||
We want the exact opposite for release graphs - we want them to be atomic, so any failure means the graph doesn't execute - so perhaps this behavior could be toggled differently for hg-push than relpro action tasks.
Comment 3•6 years ago
|
||
Possibly crazy idea: could the first task submitted by taskgraph be a task that depends on the decision task and if the decision task fails, it cancels all other scheduled tasks? Or perhaps we could mark certain tasks in the release graph as depending on the decision task. That way, we could start running some tasks as soon as possible but we wouldn't run "critical" tasks unless the decision task was successful. (I'm assuming some tasks in the release graph can be safely run without consequence.)
Comment 4•6 years ago
|
||
It's possible, but it's cleaner to just require the decision task be atomic for release graphs. In the release world, correctness is a higher priority than speed, although both are important. A minute or two at the beginning to make sure the graph is fully submitted correctly is acceptable. Marking certain tasks or allowing certain tasks to run may end up in burning build numbers or even version numbers in the worst case, when we could avoid all of that by requiring the graph submission to be atomic.
Comment 5•4 years ago
|
||
From https://bugzilla.mozilla.org/show_bug.cgi?id=1624887:
The problem: a decision task can auto-retry due to worker-shutdown
or claim-expired
. When it reruns, it will resolve completed. Any tasks that were scheduled in the previous, failed decision task runId, will then run, and Chain of Trust checks will fail.
It appears we'll never get atomic taskgroup submission, for Reasons. In bug 1624887 we set retries
to 0 for hg-push, but we could still hit this issue for actions or cron, or if someone manually reruns the failed decision task.
Let's create the breakpoint task in this manner:
decision task runId n
<- breakpoint task <- all other tasks
The breakpoint task's task.payload.command
could be set to something like python verify_latest_runid.py --taskId DECISION_TASKID --runId DECISION_RUNID
. That script should fail if the latest runId of the decision task is either a) non-success, or b) a different runId than the runId the breakpoint task was created with. Otherwise it will go green, and the taskgroup will proceed.
Comment 8•4 years ago
|
||
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Updated•2 years ago
|
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 13•8 months ago
•
|
||
This has happened again, 2 times on the same merge:
- failure log 1
- failure log 2
LATER EDIT: later, this occurred several times as it can be seen here: - linux shippable
- windows 2012 shippable
- os x cross compiled shippable
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 16•8 months ago
|
||
(In reply to Iulian Moraru from comment #13)
This has happened again, 2 times on the same merge:
The safe thing to do when the nightly cron task fails is to trigger a new one (via https://firefox-ci-tc.services.mozilla.com/hooks/project-releng/cron-task-mozilla-central%2Fnightly-desktop), rather than rerun the failed task.
Updated•8 months ago
|
Comment hidden (Intermittent Failures Robot) |
Description
•