Closed Bug 1398277 Opened 7 years ago Closed 6 years ago

Retriggering a decision task can cause duplicate jobs to be scheduled

Categories

(Taskcluster :: Services, defect, P5)

Product:

Component:

Type:

defect

Priority:

P5

Severity:

normal

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

mozilla63

People

(Reporter: garndt, Assigned: dustin)

References

Details

Attachments

(2 files, 2 obsolete files)

Bug 1398277 - Fixed typo 7 years ago Yashvardhan Didwania [:ydidwania] 59 bytes, text/x-review-board-request		Details
Bug 1398277-Retrigger decision task from task definition; 7 years ago Yashvardhan Didwania [:ydidwania] 59 bytes, text/x-review-board-request		Details
Bug 1398277: add a special permission for retriggering decision tasks 7 years ago Dustin J. Mitchell [:dustin] (he/him) 46 bytes, text/x-phabricator-request	bstack : review+	Details \| Review
bug1398277.patch 7 years ago Dustin J. Mitchell [:dustin] (he/him) 15.42 KB, patch	bstack : review+	Details \| Diff \| Splinter Review

Greg Arndt [:garndt]

Reporter

Description

•

7 years ago

Currently when a task is retriggered, mozilla-taskcluster will respond to the action and duplicate that node along with all descendants. Tasks scheduled by a gecko decision task are made to depend on the decision task. In the situation of retriggering, this means that the decision task along with all the dependent tasks are recreated. Once the retriggered decision task runs, it will then schedule another duplicate set of jobs.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 1

•

7 years ago

I think the fix to this is to make sure that the "regular" retrigger action doesn't apply to the decision task, and define another retrigger action specifically for the decision task that just duplicates it, without duplicating child tasks. Or I guess since that's all implemented in Python, it could just detect action/decision tasks and not duplicate their children.

Mentor: dustin

Keywords: good-first-bug

Greg Arndt [:garndt]

Reporter

Updated

•

7 years ago

Priority: -- → P5

Yashvardhan Didwania [:ydidwania]

Comment 2

•

7 years ago

Hi, I would like to work on this bug. Can you guide me to the relevant code?

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 3

•

7 years ago

The retrigger action is implemented here: https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py I think the best solution would be to modify that code as suggested in the second paragraph in comment 1.

Flags: needinfo?(dustin)

Comment 4

•

7 years ago

Hey, This is my first attempt at contributing to an open source project. It would be highly appreciable if you could guide me as to how to get started. Thanks.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 5

•

7 years ago

Hi! I suspect ydidwania has started working on this. Let's see if they are still working before you jump in.

Flags: needinfo?(didwaniayashvardhan)

Comment 6

•

7 years ago

I think ydidwania is working on some other issue(1349261). Please confirm Thanks.

Yashvardhan Didwania [:ydidwania]

Comment 7

•

7 years ago

(In reply to yuktikirtani from comment #6) > I think ydidwania is working on some other issue(1349261). > Please confirm > Thanks. Hi yuktikirtani. Thanks for your interest in fixing this bug. Dustin is right, I am still working on this. You are however,free to try out other bugs you find interesting.

Flags: needinfo?(didwaniayashvardhan)

Comment 8

•

7 years ago

(In reply to ydidwania from comment #7) > (In reply to yuktikirtani from comment #6) > > I think ydidwania is working on some other issue(1349261). > > Please confirm > > Thanks. > > Hi yuktikirtani. Thanks for your interest in fixing this bug. Dustin is > right, I am still working on this. You are however,free to try out other > bugs you find interesting. Hey, Okay no problem

Yashvardhan Didwania [:ydidwania]

Comment 9

•

7 years ago

I think modifying the code as below should work. if task_id != decision_task_id: if input.get('downstream'): to_run = full_task_graph.graph.transitive_closure(set(to_run), reverse=True).nodes to_run = to_run & set(label_to_taskid.keys()) with_downstream = ' (with downstream) ' Basically adding downstream tasks to `to_run` only when the task is not a decision task. Are action tasks different from decision tasks? I didn't find any mention to action task while going through any of the docs. Can you tell me why `to_run` is a set of labels rather than task ids.There can be many tasks with the same label.

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 10

•

7 years ago

That sounds like a good plan! Action tasks are sort of like decision tasks, in that they call queue.createTask -- but they serve to modify an already-existing push, rather than start a new one. It is a set of labels because that represents what we want to retrigger. If you think of a taskId as a specific instance of a label, it's "do this <label> again" rather than repeating the specific instance (which has already failed and can't be further modified).

Flags: needinfo?(dustin)

Yashvardhan Didwania [:ydidwania]

Comment 11

•

7 years ago

Are action tasks covered in the condition task_id != decision_task_id

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 12

•

7 years ago

I'm not worried about that right now -- let's just worry about decision tasks :)

Flags: needinfo?(dustin)

Comment hidden (mozreview-request)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

7 years ago

Assignee: nobody → didwaniayashvardhan

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 14

•

7 years ago

mozreview-review

Comment on attachment 8916790 [details] Bug 1398277 - Fixed typo https://reviewboard.mozilla.org/r/187846/#review193172 Looks good! I triggered a try push, and will try re-triggering that push's decision task.

Attachment #8916790 - Flags: review?(dustin) → review+

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 15

•

7 years ago

mozreview-review

Comment on attachment 8916790 [details] Bug 1398277 - Fixed typo https://reviewboard.mozilla.org/r/187846/#review193184 Hm, didn't work: https://public-artifacts.taskcluster.net/NG1i-Na7S6yJTPpznFC93A/0/public/logs/live_backing.log

Attachment #8916790 - Flags: review+

Yashvardhan Didwania [:ydidwania]

Comment 16

•

7 years ago

I found this in taskcluster/taskgraph/graph.py def transitive_closure(self, nodes, reverse=False): """ Return the transitive closure of <nodes>: the graph containing all specified nodes as well as any nodes reachable from them, and any intervening edges. If `reverse` is true, the "reachability" will be reversed and this will return the set of nodes that can reach the specified nodes. Example ------- a ------> b ------> c | `-------> d transitive_closure([b]).nodes == set([a, b]) transitive_closure([c]).nodes == set([c, b, a]) transitive_closure([c], reverse=True).nodes == set([c]) transitive_closure([b], reverse=True).nodes == set([b, c, d]) """ 1) I went through the implementation. I think it is reversed according to the explanation provided. The examples are correct. There's this line in retrigger.py: to_run = full_task_graph.graph.transitive_closure(set(to_run), reverse=True).nodes which is expected to fetch the set of all downstream tasks from a given task. 2) Regarding the error, I think we need to add the keys as is done when we were duplicating the child tasks as well. Something like to_run = to_run & set(label_to_taskid.keys())

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 17

•

7 years ago

The error is coming from a *different* call to transitive_closure, from what I can tell.

Flags: needinfo?(dustin)

Yashvardhan Didwania [:ydidwania]

Comment 18

•

7 years ago

Yeah its because of this call - [task 2017-10-10T15:11:49.623251Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/actions/util.py", line 44, in create_tasks [task 2017-10-10T15:11:49.623313Z] target_graph = full_task_graph.graph.transitive_closure(to_run) I got confused when I saw the examples in transitive_closure. I think they contradict the explanation of transitive closure of a graph written above.

Yashvardhan Didwania [:ydidwania]

Comment 19

•

7 years ago

There is some error in `to_run`, the set of labels that we are passing to `create_task() . Is there a way to test the retrigger action locally?

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 20

•

7 years ago

Yes -- if you look at the task details in https://tools.taskcluster.net/groups/TuAbomxORdW52JbuFvKRlQ/tasks/NG1i-Na7S6yJTPpznFC93A/details you will see a bunch of environment variables and a command line. You can set those variables on your command line, and then run the last bit of that command ("./mach --log-no-times taskgraph action-callback").

Flags: needinfo?(dustin)

Yashvardhan Didwania [:ydidwania]

Comment 21

•

7 years ago

The Gecko Decision Task does not get added to the task graph, hence the statement assert nodes <= self.nodes in https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/graph.py fails. This failure happens even with the current version of retrigger.py that is in use( I tried it locally), but from the call in line 59 (https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py#59) . So my question is why isn't it getting added to the full_task_graph.json or does the problem exist only locally?

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 22

•

7 years ago

That's because the decision task isn't in the task graph -- it *creates* the task graph. So `create_tasks` is the wrong tool for retriggering the decision task. It looks like you want `create_task_from_def`: https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/util.py#55 you will need to replace the `created`, `deadline`, and `expires` attributes in the task definition with new timestamps using `{"relative-datestamp": ".."}`.

Flags: needinfo?(dustin)

Yashvardhan Didwania [:ydidwania]

Comment 23

•

7 years ago

How do I obtain the task definition of the decision task since it is not present in the task graph? Will 'get_decision_parameters' in : https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/decision.py#131 be useful here?

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 24

•

7 years ago

You can call https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#task from the action task to get the task definition, passing it the decision task's taskid.

Flags: needinfo?(dustin)

Yashvardhan Didwania [:ydidwania]

Comment 25

•

7 years ago

This is where I am at currently when task_id == decision_task_id new_task_definition = requests.get('https://queue.taskcluster.net/v1/task/%s'%(task_id)).json() new_task_definition['created'] = dict({"relative-datestamp": "0 seconds"}) new_task_definition['deadline'] = dict({"relative-datestamp": "4 days"}) new_task_definition['expired'] = dict({"relative-datestamp": "14 days"}) new_task_id=slugid() create_task_from_def(new_task_id, new_task_definition, parameters['level'])t logger.info('Scheduled {}{}(time {}/{})'.format(label, with_downstream, i+1, times)) I am unable to test it because I don't have the rights( thats what scopes mean, right?) to create a task using the api. A decision task shouldn't be retriggered multiple times hence I did not. Did I miss anything?

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 26

•

7 years ago

I think that looks good. Do you want to push the patch and we can try it out in try?

Flags: needinfo?(dustin)

Comment hidden (mozreview-request)

Yashvardhan Didwania [:ydidwania]

Comment 28

•

7 years ago

Pushed. Should I add comments in the code?

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 29

•

7 years ago

mozreview-review

Comment on attachment 8923583 [details] Bug 1398277-Retrigger decision task from task definition; https://reviewboard.mozilla.org/r/194714/#review200702 Sorry for the delay -- if you can fix this up I can push the buttons in try and see how it works. ::: taskcluster/taskgraph/actions/retrigger.py:80 (Diff revision 1) > > - logger.info('Scheduled {}{}(time {}/{})'.format(label, with_downstream, i+1, times)) > + logger.info('Scheduled {}{}(time {}/{})'.format(label, with_downstream, i+1, times)) > + else: > + print ("hello") > + task_id = 'QAs5KRluQqe6PKtx_DpQLQ' > + print (task_id) Can you remove this? Otherwise it won't work when I push the buttons on the try push :)

Attachment #8923583 - Flags: review?(dustin)

Comment hidden (mozreview-request)

Yashvardhan Didwania [:ydidwania]

Updated

•

7 years ago

Attachment #8923583 - Attachment is obsolete: true

Yashvardhan Didwania [:ydidwania]

Comment 31

•

7 years ago

I made a commit with the lines removed and pushed only that extra commit for review. I thought it will get mapped to the existing patch (now totalling 3 commits) but instead that patch became obsolete. I have a feeling I messed up. If required I will push the entire revset again for review :)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 32

•

7 years ago

That looks like a mozreview bug to me. I've triggered the try job, and will try re-triggering the decision task on it.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 33

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=55e59753469fb2a1466335bd2444bc81936764bb

Flags: needinfo?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 34

•

7 years ago

[task 2017-11-03T16:08:48.528026Z] Creating task with taskId L7ATmV-WT--bbaDkqvwZdg for Gecko Decision Task [task 2017-11-03T16:08:48.529131Z] Starting new HTTP connection (1): taskcluster [task 2017-11-03T16:08:49.377623Z] "PUT /queue/v1/task/L7ATmV-WT--bbaDkqvwZdg HTTP/1.1" 400 3848 [task 2017-11-03T16:08:49.378432Z] [task 2017-11-03T16:08:49.378470Z] Schema Validation Failed! [task 2017-11-03T16:08:49.378504Z] Rejecting Schema: http://schemas.taskcluster.net/queue/v1/create-task-request.json# [task 2017-11-03T16:08:49.378525Z] Errors: [task 2017-11-03T16:08:49.378562Z] * data should NOT have additional properties [task 2017-11-03T16:08:49.378574Z] ---- [task 2017-11-03T16:08:49.378586Z] method: createTask [task 2017-11-03T16:08:49.378599Z] errorCode: InputValidationError [task 2017-11-03T16:08:49.378609Z] statusCode: 400 [task 2017-11-03T16:08:49.378622Z] time: 2017-11-03T16:08:49.191Z That error means that the payload to createTask had an extra top-level attribute that it should not. It doesn't say *what* that attribute was, unfortunately. But! I think I see it: it should be `expires` and not `expired`.

Comment hidden (mozreview-request)

Yashvardhan Didwania [:ydidwania]

Comment 36

•

7 years ago

Fixed the typo. Sorry for letting that error pass by. Hope it works fine now :)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 37

•

7 years ago

Hm, that didn't work either: [task 2017-11-05T13:39:55.292866Z] In other words you are missing scopes from one of the options: [task 2017-11-05T13:39:55.292879Z] * Option 0: [task 2017-11-05T13:39:55.292911Z] - "queue:route:notify.email.dmitchell@mozilla.com.*", and [task 2017-11-05T13:39:55.292935Z] - "queue:route:notify.email.dmitchell@mozilla.com.on-failed", and [task 2017-11-05T13:39:55.292953Z] - "queue:route:notify.email.dmitchell@mozilla.com.on-exception" If it's OK with you, I will take this bug and figure out how to finish it up. I suspect we may have started out in the wrong direction here and I need to rethink it. Which is not to diminish your work -- I appreciate it! Hopefully some of the projects we discussed in irc will be more tractable.

Yashvardhan Didwania [:ydidwania]

Comment 38

•

7 years ago

Yeah, its fine with me. Thanks for all your help on this. Learnt a lot. I hope you fix this bug soon:) I'll start on the bug you filed just yesterday.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

7 years ago

Attachment #8916790 - Flags: review?(dustin)

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

7 years ago

Assignee: didwaniayashvardhan → dustin

Mentor: dustin

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Updated

•

7 years ago

Keywords: good-first-bug

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 39

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=c3de1de0f5cbbdcf6cc04a8088181e5c5e8103e8

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 40

•

7 years ago

I'm deeply suspicious that the fix in that try job will not have the scopes to run a decision task. At least right now, and possibly once we start assigning more fine-grained scopes. We'll see..

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 41

•

7 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=13fa4a7c01c52bb316e283942557a54925f2d074

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 42

•

7 years ago

Am I cynical or just accurate? You decide.. [task 2018-07-06T18:16:27.986Z] You do not have sufficient scopes. You are missing the following scopes: [task 2018-07-06T18:16:27.986Z] [task 2018-07-06T18:16:27.986Z] ``` [task 2018-07-06T18:16:27.986Z] { [task 2018-07-06T18:16:27.986Z] "AllOf": [ [task 2018-07-06T18:16:27.986Z] "assume:repo:hg.mozilla.org/try:branch:default", [task 2018-07-06T18:16:27.986Z] "in-tree:hook-action:project-gecko/in-tree-action-1-*" [task 2018-07-06T18:16:27.986Z] ] [task 2018-07-06T18:16:27.986Z] } [task 2018-07-06T18:16:27.986Z] ``` So I think that retriggering a decision task requires an actionPerm, which requires bug 1472792. It also means bug 1470622 can't go ahead yet, as sheriffs need to be able to retrigger decision tasks.

Blocks: 1470622, 1415868

Depends on: 1472792

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 43

•

7 years ago

This still needs to wait until bug 1472792 is deployed in treeherder.

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 44

•

7 years ago

Attached file Bug 1398277: add a special permission for retriggering decision tasks — Details

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 45

•

7 years ago

Attached patch bug1398277.patch — Details — Splinter Review

Attachment #8916790 - Attachment is obsolete: true

Attachment #8994319 - Flags: review?(bstack)

Brian Stack [:bstack]

Comment 46

•

7 years ago

Comment on attachment 8994319 [details] [diff] [review] bug1398277.patch I wish all of these datestamps would have been relative from the get-go. Assuming you ran flake8 on this, r+!

Attachment #8994319 - Flags: review?(bstack) → review+

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 47

•

7 years ago

Haha, you know me so well. Yes, flakes fixed. This can't land until bug 1472792 lands in treeherder -- that's been waiting for two weeks now, and will wait another three until I'm back from PTO.

Brian Stack [:bstack]

Comment 48

•

6 years ago

Comment on attachment 8994315 [details] Bug 1398277: add a special permission for retriggering decision tasks Brian Stack [:bstack] has approved the revision.

Attachment #8994315 - Flags: review+

Dustin J. Mitchell [:dustin] (he/him)

Assignee

Comment 49

•

6 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/ab58645e9230620ca45de3ec03ee9e61eb4a7cbf Bug 1398277: special-case retriggering of tasks not in the taskgraph; r=bstack

Cosmin Sabou [:CosminS]

Comment 50

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ab58645e9230

Status: NEW → RESOLVED

Closed: 6 years ago

Resolution: --- → FIXED

Target Milestone: --- → mozilla61

Ryan VanderMeulen [:RyanVM]

Updated

•

6 years ago

Target Milestone: mozilla61 → mozilla63

Julien Cristau [:jcristau]

Comment 51

•

6 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-release/rev/de2d8106904f

Julien Cristau [:jcristau]

Comment 52

•

6 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-esr60/rev/1aabfd210ab7

Nobody; OK to take it and work on it

Updated

•

6 years ago

Component: Integration → Services

You need to log in before you can comment on or make changes to this bug.