Closed
Bug 1398277
Opened 7 years ago
Closed 6 years ago
Retriggering a decision task can cause duplicate jobs to be scheduled
Categories
(Taskcluster :: Services, defect, P5)
Taskcluster
Services
Tracking
(Not tracked)
RESOLVED
FIXED
mozilla63
People
(Reporter: garndt, Assigned: dustin)
References
Details
Attachments
(2 files, 2 obsolete files)
Currently when a task is retriggered, mozilla-taskcluster will respond to the action and duplicate that node along with all descendants.
Tasks scheduled by a gecko decision task are made to depend on the decision task.
In the situation of retriggering, this means that the decision task along with all the dependent tasks are recreated. Once the retriggered decision task runs, it will then schedule another duplicate set of jobs.
Assignee | ||
Comment 1•7 years ago
|
||
I think the fix to this is to make sure that the "regular" retrigger action doesn't apply to the decision task, and define another retrigger action specifically for the decision task that just duplicates it, without duplicating child tasks.
Or I guess since that's all implemented in Python, it could just detect action/decision tasks and not duplicate their children.
Mentor: dustin
Keywords: good-first-bug
Reporter | ||
Updated•7 years ago
|
Priority: -- → P5
Comment 2•7 years ago
|
||
Hi, I would like to work on this bug. Can you guide me to the relevant code?
Flags: needinfo?(dustin)
Assignee | ||
Comment 3•7 years ago
|
||
The retrigger action is implemented here:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py
I think the best solution would be to modify that code as suggested in the second paragraph in comment 1.
Flags: needinfo?(dustin)
Comment 4•7 years ago
|
||
Hey,
This is my first attempt at contributing to an open source project.
It would be highly appreciable if you could guide me as to how to get started.
Thanks.
Assignee | ||
Comment 5•7 years ago
|
||
Hi! I suspect ydidwania has started working on this. Let's see if they are still working before you jump in.
Flags: needinfo?(didwaniayashvardhan)
Comment 6•7 years ago
|
||
I think ydidwania is working on some other issue(1349261).
Please confirm
Thanks.
Comment 7•7 years ago
|
||
(In reply to yuktikirtani from comment #6)
> I think ydidwania is working on some other issue(1349261).
> Please confirm
> Thanks.
Hi yuktikirtani. Thanks for your interest in fixing this bug. Dustin is right, I am still working on this. You are however,free to try out other bugs you find interesting.
Flags: needinfo?(didwaniayashvardhan)
Comment 8•7 years ago
|
||
(In reply to ydidwania from comment #7)
> (In reply to yuktikirtani from comment #6)
> > I think ydidwania is working on some other issue(1349261).
> > Please confirm
> > Thanks.
>
> Hi yuktikirtani. Thanks for your interest in fixing this bug. Dustin is
> right, I am still working on this. You are however,free to try out other
> bugs you find interesting.
Hey, Okay no problem
Comment 9•7 years ago
|
||
I think modifying the code as below should work.
if task_id != decision_task_id:
if input.get('downstream'):
to_run = full_task_graph.graph.transitive_closure(set(to_run), reverse=True).nodes
to_run = to_run & set(label_to_taskid.keys())
with_downstream = ' (with downstream) '
Basically adding downstream tasks to `to_run` only when the task is not a decision task.
Are action tasks different from decision tasks? I didn't find any mention to action task while going through any of the docs.
Can you tell me why `to_run` is a set of labels rather than task ids.There can be many tasks with the same label.
Flags: needinfo?(dustin)
Assignee | ||
Comment 10•7 years ago
|
||
That sounds like a good plan!
Action tasks are sort of like decision tasks, in that they call queue.createTask -- but they serve to modify an already-existing push, rather than start a new one.
It is a set of labels because that represents what we want to retrigger. If you think of a taskId as a specific instance of a label, it's "do this <label> again" rather than repeating the specific instance (which has already failed and can't be further modified).
Flags: needinfo?(dustin)
Comment 11•7 years ago
|
||
Are action tasks covered in the condition
task_id != decision_task_id
Flags: needinfo?(dustin)
Assignee | ||
Comment 12•7 years ago
|
||
I'm not worried about that right now -- let's just worry about decision tasks :)
Flags: needinfo?(dustin)
Comment hidden (mozreview-request) |
Assignee | ||
Updated•7 years ago
|
Assignee: nobody → didwaniayashvardhan
Assignee | ||
Comment 14•7 years ago
|
||
mozreview-review |
Comment on attachment 8916790 [details]
Bug 1398277 - Fixed typo
https://reviewboard.mozilla.org/r/187846/#review193172
Looks good! I triggered a try push, and will try re-triggering that push's decision task.
Attachment #8916790 -
Flags: review?(dustin) → review+
Assignee | ||
Comment 15•7 years ago
|
||
mozreview-review |
Comment on attachment 8916790 [details]
Bug 1398277 - Fixed typo
https://reviewboard.mozilla.org/r/187846/#review193184
Hm, didn't work:
https://public-artifacts.taskcluster.net/NG1i-Na7S6yJTPpznFC93A/0/public/logs/live_backing.log
Attachment #8916790 -
Flags: review+
Comment 16•7 years ago
|
||
I found this in taskcluster/taskgraph/graph.py
def transitive_closure(self, nodes, reverse=False):
"""
Return the transitive closure of <nodes>: the graph containing all
specified nodes as well as any nodes reachable from them, and any
intervening edges.
If `reverse` is true, the "reachability" will be reversed and this
will return the set of nodes that can reach the specified nodes.
Example
-------
a ------> b ------> c
|
`-------> d
transitive_closure([b]).nodes == set([a, b])
transitive_closure([c]).nodes == set([c, b, a])
transitive_closure([c], reverse=True).nodes == set([c])
transitive_closure([b], reverse=True).nodes == set([b, c, d])
"""
1) I went through the implementation. I think it is reversed according to the explanation provided. The examples are correct.
There's this line in retrigger.py:
to_run = full_task_graph.graph.transitive_closure(set(to_run), reverse=True).nodes
which is expected to fetch the set of all downstream tasks from a given task.
2) Regarding the error, I think we need to add the keys as is done when we were duplicating the child tasks as well. Something like
to_run = to_run & set(label_to_taskid.keys())
Flags: needinfo?(dustin)
Assignee | ||
Comment 17•7 years ago
|
||
The error is coming from a *different* call to transitive_closure, from what I can tell.
Flags: needinfo?(dustin)
Comment 18•7 years ago
|
||
Yeah its because of this call -
[task 2017-10-10T15:11:49.623251Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/actions/util.py", line 44, in create_tasks
[task 2017-10-10T15:11:49.623313Z] target_graph = full_task_graph.graph.transitive_closure(to_run)
I got confused when I saw the examples in transitive_closure. I think they contradict the explanation of transitive closure of a graph written above.
Comment 19•7 years ago
|
||
There is some error in `to_run`, the set of labels that we are passing to `create_task() .
Is there a way to test the retrigger action locally?
Flags: needinfo?(dustin)
Assignee | ||
Comment 20•7 years ago
|
||
Yes -- if you look at the task details in
https://tools.taskcluster.net/groups/TuAbomxORdW52JbuFvKRlQ/tasks/NG1i-Na7S6yJTPpznFC93A/details
you will see a bunch of environment variables and a command line. You can set those variables on your command line, and then run the last bit of that command ("./mach --log-no-times taskgraph action-callback").
Flags: needinfo?(dustin)
Comment 21•7 years ago
|
||
The Gecko Decision Task does not get added to the task graph, hence the statement
assert nodes <= self.nodes
in https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/graph.py fails.
This failure happens even with the current version of retrigger.py that is in use( I tried it locally), but from the call in line 59 (https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py#59) .
So my question is why isn't it getting added to the full_task_graph.json or does the problem exist only locally?
Flags: needinfo?(dustin)
Assignee | ||
Comment 22•7 years ago
|
||
That's because the decision task isn't in the task graph -- it *creates* the task graph. So `create_tasks` is the wrong tool for retriggering the decision task. It looks like you want `create_task_from_def`:
https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/util.py#55
you will need to replace the `created`, `deadline`, and `expires` attributes in the task definition with new timestamps using `{"relative-datestamp": ".."}`.
Flags: needinfo?(dustin)
Comment 23•7 years ago
|
||
How do I obtain the task definition of the decision task since it is not present in the task graph?
Will 'get_decision_parameters' in :
https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/decision.py#131
be useful here?
Flags: needinfo?(dustin)
Assignee | ||
Comment 24•7 years ago
|
||
You can call https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#task from the action task to get the task definition, passing it the decision task's taskid.
Flags: needinfo?(dustin)
Comment 25•7 years ago
|
||
This is where I am at currently when task_id == decision_task_id
new_task_definition = requests.get('https://queue.taskcluster.net/v1/task/%s'%(task_id)).json()
new_task_definition['created'] = dict({"relative-datestamp": "0 seconds"})
new_task_definition['deadline'] = dict({"relative-datestamp": "4 days"})
new_task_definition['expired'] = dict({"relative-datestamp": "14 days"})
new_task_id=slugid()
create_task_from_def(new_task_id, new_task_definition, parameters['level'])t
logger.info('Scheduled {}{}(time {}/{})'.format(label, with_downstream, i+1, times))
I am unable to test it because I don't have the rights( thats what scopes mean, right?) to create a task using the api.
A decision task shouldn't be retriggered multiple times hence I did not.
Did I miss anything?
Flags: needinfo?(dustin)
Assignee | ||
Comment 26•7 years ago
|
||
I think that looks good. Do you want to push the patch and we can try it out in try?
Flags: needinfo?(dustin)
Comment hidden (mozreview-request) |
Assignee | ||
Comment 29•7 years ago
|
||
mozreview-review |
Comment on attachment 8923583 [details]
Bug 1398277-Retrigger decision task from task definition;
https://reviewboard.mozilla.org/r/194714/#review200702
Sorry for the delay -- if you can fix this up I can push the buttons in try and see how it works.
::: taskcluster/taskgraph/actions/retrigger.py:80
(Diff revision 1)
>
> - logger.info('Scheduled {}{}(time {}/{})'.format(label, with_downstream, i+1, times))
> + logger.info('Scheduled {}{}(time {}/{})'.format(label, with_downstream, i+1, times))
> + else:
> + print ("hello")
> + task_id = 'QAs5KRluQqe6PKtx_DpQLQ'
> + print (task_id)
Can you remove this? Otherwise it won't work when I push the buttons on the try push :)
Attachment #8923583 -
Flags: review?(dustin)
Comment hidden (mozreview-request) |
Updated•7 years ago
|
Attachment #8923583 -
Attachment is obsolete: true
Comment 31•7 years ago
|
||
I made a commit with the lines removed and pushed only that extra commit for review. I thought it will get mapped to the existing patch (now totalling 3 commits) but instead that patch became obsolete. I have a feeling I messed up. If required I will push the entire revset again for review :)
Assignee | ||
Comment 32•7 years ago
|
||
That looks like a mozreview bug to me. I've triggered the try job, and will try re-triggering the decision task on it.
Assignee | ||
Comment 33•7 years ago
|
||
Flags: needinfo?(dustin)
Assignee | ||
Comment 34•7 years ago
|
||
[task 2017-11-03T16:08:48.528026Z] Creating task with taskId L7ATmV-WT--bbaDkqvwZdg for Gecko Decision Task
[task 2017-11-03T16:08:48.529131Z] Starting new HTTP connection (1): taskcluster
[task 2017-11-03T16:08:49.377623Z] "PUT /queue/v1/task/L7ATmV-WT--bbaDkqvwZdg HTTP/1.1" 400 3848
[task 2017-11-03T16:08:49.378432Z]
[task 2017-11-03T16:08:49.378470Z] Schema Validation Failed!
[task 2017-11-03T16:08:49.378504Z] Rejecting Schema: http://schemas.taskcluster.net/queue/v1/create-task-request.json#
[task 2017-11-03T16:08:49.378525Z] Errors:
[task 2017-11-03T16:08:49.378562Z] * data should NOT have additional properties
[task 2017-11-03T16:08:49.378574Z] ----
[task 2017-11-03T16:08:49.378586Z] method: createTask
[task 2017-11-03T16:08:49.378599Z] errorCode: InputValidationError
[task 2017-11-03T16:08:49.378609Z] statusCode: 400
[task 2017-11-03T16:08:49.378622Z] time: 2017-11-03T16:08:49.191Z
That error means that the payload to createTask had an extra top-level attribute that it should not. It doesn't say *what* that attribute was, unfortunately. But! I think I see it: it should be `expires` and not `expired`.
Comment hidden (mozreview-request) |
Comment 36•7 years ago
|
||
Fixed the typo. Sorry for letting that error pass by. Hope it works fine now :)
Assignee | ||
Comment 37•7 years ago
|
||
Hm, that didn't work either:
[task 2017-11-05T13:39:55.292866Z] In other words you are missing scopes from one of the options:
[task 2017-11-05T13:39:55.292879Z] * Option 0:
[task 2017-11-05T13:39:55.292911Z] - "queue:route:notify.email.dmitchell@mozilla.com.*", and
[task 2017-11-05T13:39:55.292935Z] - "queue:route:notify.email.dmitchell@mozilla.com.on-failed", and
[task 2017-11-05T13:39:55.292953Z] - "queue:route:notify.email.dmitchell@mozilla.com.on-exception"
If it's OK with you, I will take this bug and figure out how to finish it up. I suspect we may have started out in the wrong direction here and I need to rethink it. Which is not to diminish your work -- I appreciate it! Hopefully some of the projects we discussed in irc will be more tractable.
Comment 38•7 years ago
|
||
Yeah, its fine with me. Thanks for all your help on this. Learnt a lot.
I hope you fix this bug soon:)
I'll start on the bug you filed just yesterday.
Assignee | ||
Updated•7 years ago
|
Attachment #8916790 -
Flags: review?(dustin)
Assignee | ||
Updated•7 years ago
|
Assignee: didwaniayashvardhan → dustin
Mentor: dustin
Assignee | ||
Updated•7 years ago
|
Keywords: good-first-bug
Assignee | ||
Comment 39•6 years ago
|
||
Assignee | ||
Comment 40•6 years ago
|
||
I'm deeply suspicious that the fix in that try job will not have the scopes to run a decision task. At least right now, and possibly once we start assigning more fine-grained scopes. We'll see..
Assignee | ||
Comment 41•6 years ago
|
||
Assignee | ||
Comment 42•6 years ago
|
||
Am I cynical or just accurate? You decide..
[task 2018-07-06T18:16:27.986Z] You do not have sufficient scopes. You are missing the following scopes:
[task 2018-07-06T18:16:27.986Z]
[task 2018-07-06T18:16:27.986Z] ```
[task 2018-07-06T18:16:27.986Z] {
[task 2018-07-06T18:16:27.986Z] "AllOf": [
[task 2018-07-06T18:16:27.986Z] "assume:repo:hg.mozilla.org/try:branch:default",
[task 2018-07-06T18:16:27.986Z] "in-tree:hook-action:project-gecko/in-tree-action-1-*"
[task 2018-07-06T18:16:27.986Z] ]
[task 2018-07-06T18:16:27.986Z] }
[task 2018-07-06T18:16:27.986Z] ```
So I think that retriggering a decision task requires an actionPerm, which requires bug 1472792. It also means bug
1470622 can't go ahead yet, as sheriffs need to be able to retrigger decision tasks.
Assignee | ||
Comment 43•6 years ago
|
||
This still needs to wait until bug 1472792 is deployed in treeherder.
Assignee | ||
Comment 44•6 years ago
|
||
Assignee | ||
Comment 45•6 years ago
|
||
Attachment #8916790 -
Attachment is obsolete: true
Attachment #8994319 -
Flags: review?(bstack)
Comment 46•6 years ago
|
||
Comment on attachment 8994319 [details] [diff] [review]
bug1398277.patch
I wish all of these datestamps would have been relative from the get-go. Assuming you ran flake8 on this, r+!
Attachment #8994319 -
Flags: review?(bstack) → review+
Assignee | ||
Comment 47•6 years ago
|
||
Haha, you know me so well. Yes, flakes fixed. This can't land until bug 1472792 lands in treeherder -- that's been waiting for two weeks now, and will wait another three until I'm back from PTO.
Comment 48•6 years ago
|
||
Comment on attachment 8994315 [details]
Bug 1398277: add a special permission for retriggering decision tasks
Brian Stack [:bstack] has approved the revision.
Attachment #8994315 -
Flags: review+
Assignee | ||
Comment 49•6 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/ab58645e9230620ca45de3ec03ee9e61eb4a7cbf
Bug 1398277: special-case retriggering of tasks not in the taskgraph; r=bstack
Comment 50•6 years ago
|
||
bugherder |
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Updated•6 years ago
|
Target Milestone: mozilla61 → mozilla63
Comment 51•6 years ago
|
||
bugherder uplift |
Comment 52•6 years ago
|
||
bugherder uplift |
Updated•6 years ago
|
Component: Integration → Services
You need to log in
before you can comment on or make changes to this bug.
Description
•