Retriggering a decision task can cause duplicate jobs to be scheduled

RESOLVED FIXED in mozilla63

Status

defect
P5
normal
RESOLVED FIXED
2 years ago
4 months ago

People

(Reporter: garndt, Assigned: dustin)

Tracking

unspecified
mozilla63
Dependency tree / graph

Details

Attachments

(2 attachments, 2 obsolete attachments)

Reporter

Description

2 years ago
Currently when a task is retriggered, mozilla-taskcluster will respond to the action and duplicate that node along with all descendants.  

Tasks scheduled by a gecko decision task are made to depend on the decision task.

In the situation of retriggering, this means that the decision task along with all the dependent tasks are recreated.  Once the retriggered decision task runs, it will then schedule another duplicate set of jobs.
I think the fix to this is to make sure that the "regular" retrigger action doesn't apply to the decision task, and define another retrigger action specifically for the decision task that just duplicates it, without duplicating child tasks.

Or I guess since that's all implemented in Python, it could just detect action/decision tasks and not duplicate their children.
Mentor: dustin
Keywords: good-first-bug
Reporter

Updated

2 years ago
Priority: -- → P5
Hi,  I would like to work on this bug. Can you guide me to the relevant code?
Flags: needinfo?(dustin)
The retrigger action is implemented here:
  https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py

I think the best solution would be to modify that code as suggested in the second paragraph in comment 1.
Flags: needinfo?(dustin)

Comment 4

2 years ago
Hey,
This is my first attempt at contributing to an open source project.
It would be highly appreciable if you could guide me as to how to get started.
Thanks.
Hi! I suspect ydidwania has started working on this.  Let's see if they are still working before you jump in.
Flags: needinfo?(didwaniayashvardhan)

Comment 6

2 years ago
I think ydidwania is working on some other issue(1349261).
Please confirm
Thanks.
(In reply to yuktikirtani from comment #6)
> I think ydidwania is working on some other issue(1349261).
> Please confirm
> Thanks.

Hi yuktikirtani. Thanks for your interest in fixing this bug. Dustin is right, I am still working on this. You are however,free to try out other bugs you find interesting.
Flags: needinfo?(didwaniayashvardhan)

Comment 8

2 years ago
(In reply to ydidwania from comment #7)
> (In reply to yuktikirtani from comment #6)
> > I think ydidwania is working on some other issue(1349261).
> > Please confirm
> > Thanks.
> 
> Hi yuktikirtani. Thanks for your interest in fixing this bug. Dustin is
> right, I am still working on this. You are however,free to try out other
> bugs you find interesting.

Hey, Okay no problem
I think modifying the code as below should work.

if task_id != decision_task_id:
    if input.get('downstream'):
        to_run = full_task_graph.graph.transitive_closure(set(to_run), reverse=True).nodes
        to_run = to_run & set(label_to_taskid.keys())
        with_downstream = ' (with downstream) ' 

Basically adding downstream tasks to `to_run` only when the task is not a decision task.
Are action tasks different from decision tasks? I didn't find any mention to action task while going through any of the docs.

Can you tell me why `to_run` is a set of labels rather than task ids.There can be many tasks with the same label.
Flags: needinfo?(dustin)
That sounds like a good plan!

Action tasks are sort of like decision tasks, in that they call queue.createTask -- but they serve to modify an already-existing push, rather than start a new one.

It is a set of labels because that represents what we want to retrigger.  If you think of a taskId as a specific instance of a label, it's "do this <label> again" rather than repeating the specific instance (which has already failed and can't be further modified).
Flags: needinfo?(dustin)
Are action tasks covered in the condition
 
task_id != decision_task_id
Flags: needinfo?(dustin)
I'm not worried about that right now -- let's just worry about decision tasks :)
Flags: needinfo?(dustin)
Comment hidden (mozreview-request)
Assignee: nobody → didwaniayashvardhan
Assignee

Comment 14

2 years ago
mozreview-review
Comment on attachment 8916790 [details]
Bug 1398277 - Fixed typo

https://reviewboard.mozilla.org/r/187846/#review193172

Looks good!  I triggered a try push, and will try re-triggering that push's decision task.
Attachment #8916790 - Flags: review?(dustin) → review+
I found this in taskcluster/taskgraph/graph.py

def transitive_closure(self, nodes, reverse=False):
        """
        Return the transitive closure of <nodes>: the graph containing all
        specified nodes as well as any nodes reachable from them, and any
        intervening edges.

        If `reverse` is true, the "reachability" will be reversed and this
        will return the set of nodes that can reach the specified nodes.

        Example
        -------

        a ------> b ------> c
                  |
                  `-------> d

        transitive_closure([b]).nodes == set([a, b])
        transitive_closure([c]).nodes == set([c, b, a])
        transitive_closure([c], reverse=True).nodes == set([c])
        transitive_closure([b], reverse=True).nodes == set([b, c, d])
        """

1) I went through the implementation. I think  it is reversed according to the explanation provided. The examples are correct.

There's this line in retrigger.py:

            to_run = full_task_graph.graph.transitive_closure(set(to_run), reverse=True).nodes

which is expected to fetch the set of all downstream tasks from a given task.

2) Regarding the error, I think we need to add the keys as is done when we were duplicating the child tasks as well. Something like
    
            to_run = to_run & set(label_to_taskid.keys())
Flags: needinfo?(dustin)
The error is coming from a *different* call to transitive_closure, from what I can tell.
Flags: needinfo?(dustin)
Yeah its because of this call - 

[task 2017-10-10T15:11:49.623251Z] File "/builds/worker/checkouts/gecko/taskcluster/taskgraph/actions/util.py", line 44, in create_tasks
[task 2017-10-10T15:11:49.623313Z] target_graph = full_task_graph.graph.transitive_closure(to_run)

I got confused when I saw the examples in transitive_closure. I think they contradict the explanation of transitive closure of a graph written above.
There is some error in `to_run`,  the set of labels that we are passing to `create_task() .

Is there a way to test the retrigger action locally?
Flags: needinfo?(dustin)
Yes -- if you look at the task details in
  https://tools.taskcluster.net/groups/TuAbomxORdW52JbuFvKRlQ/tasks/NG1i-Na7S6yJTPpznFC93A/details
you will see a bunch of environment variables and a command line.  You can set those variables on your command line, and then run the last bit of that command ("./mach --log-no-times taskgraph action-callback").
Flags: needinfo?(dustin)
The Gecko Decision Task does not get added to the task graph, hence the statement
   assert nodes <= self.nodes 
in https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/graph.py fails. 

This failure happens even with the current version of retrigger.py that is in use( I tried it locally), but from the call in line 59 (https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/retrigger.py#59) . 

So my question is why isn't it getting added to the full_task_graph.json or does the problem exist only locally?
Flags: needinfo?(dustin)
That's because the decision task isn't in the task graph -- it *creates* the task graph.  So `create_tasks` is the wrong tool for retriggering the decision task.  It looks like you want `create_task_from_def`:
  https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/actions/util.py#55
you will need to replace the `created`, `deadline`, and `expires` attributes in the task definition with new timestamps using `{"relative-datestamp": ".."}`.
Flags: needinfo?(dustin)
How do I obtain the task definition of the decision task since it is not present in the task graph?
Will 'get_decision_parameters' in :
 https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/decision.py#131
 
be useful here?
Flags: needinfo?(dustin)
You can call https://docs.taskcluster.net/reference/platform/taskcluster-queue/references/api#task from the action task to get the task definition, passing it the decision task's taskid.
Flags: needinfo?(dustin)
This is where I am at currently when task_id == decision_task_id

        new_task_definition = requests.get('https://queue.taskcluster.net/v1/task/%s'%(task_id)).json()
        new_task_definition['created'] = dict({"relative-datestamp": "0 seconds"})
        new_task_definition['deadline'] = dict({"relative-datestamp": "4 days"})
        new_task_definition['expired'] = dict({"relative-datestamp": "14 days"})
        new_task_id=slugid()
        create_task_from_def(new_task_id, new_task_definition, parameters['level'])t
        logger.info('Scheduled {}{}(time {}/{})'.format(label, with_downstream, i+1, times))

I am unable to test it because I  don't have the rights( thats what scopes mean, right?) to create a task using the api.
A decision task shouldn't be retriggered  multiple times hence I did not.
Did I miss anything?
Flags: needinfo?(dustin)
I think that looks good.  Do you want to push the patch and we can try it out in try?
Flags: needinfo?(dustin)
Comment hidden (mozreview-request)
Pushed. Should I add comments in the code?
Flags: needinfo?(dustin)
Assignee

Comment 29

2 years ago
mozreview-review
Comment on attachment 8923583 [details]
Bug 1398277-Retrigger decision task from task definition;

https://reviewboard.mozilla.org/r/194714/#review200702

Sorry for the delay -- if you can fix this up I can push the buttons in try and see how it works.

::: taskcluster/taskgraph/actions/retrigger.py:80
(Diff revision 1)
>  
> -        logger.info('Scheduled {}{}(time {}/{})'.format(label, with_downstream, i+1, times))
> +            logger.info('Scheduled {}{}(time {}/{})'.format(label, with_downstream, i+1, times))
> +    else:
> +        print ("hello")
> +        task_id = 'QAs5KRluQqe6PKtx_DpQLQ'
> +        print (task_id)

Can you remove this?  Otherwise it won't work when I push the buttons on the try push :)
Attachment #8923583 - Flags: review?(dustin)
Comment hidden (mozreview-request)
Attachment #8923583 - Attachment is obsolete: true
I made a commit with the lines removed and pushed only that extra commit for review. I thought it will get mapped to the existing patch (now totalling 3 commits) but instead that patch became obsolete. I have a feeling I messed up. If required I will push the entire revset again for review :)
That looks like a mozreview bug to me.  I've triggered the try job, and will try re-triggering the decision task on it.
[task 2017-11-03T16:08:48.528026Z] Creating task with taskId L7ATmV-WT--bbaDkqvwZdg for Gecko Decision Task
[task 2017-11-03T16:08:48.529131Z] Starting new HTTP connection (1): taskcluster
[task 2017-11-03T16:08:49.377623Z] "PUT /queue/v1/task/L7ATmV-WT--bbaDkqvwZdg HTTP/1.1" 400 3848
[task 2017-11-03T16:08:49.378432Z] 
[task 2017-11-03T16:08:49.378470Z] Schema Validation Failed!
[task 2017-11-03T16:08:49.378504Z] Rejecting Schema: http://schemas.taskcluster.net/queue/v1/create-task-request.json#
[task 2017-11-03T16:08:49.378525Z] Errors:
[task 2017-11-03T16:08:49.378562Z]   * data should NOT have additional properties
[task 2017-11-03T16:08:49.378574Z] ----
[task 2017-11-03T16:08:49.378586Z] method:     createTask
[task 2017-11-03T16:08:49.378599Z] errorCode:  InputValidationError
[task 2017-11-03T16:08:49.378609Z] statusCode: 400
[task 2017-11-03T16:08:49.378622Z] time:       2017-11-03T16:08:49.191Z

That error means that the payload to createTask had an extra top-level attribute that it should not.  It doesn't say *what* that attribute was, unfortunately.  But!  I think I see it: it should be `expires` and not `expired`.
Comment hidden (mozreview-request)
Fixed the typo. Sorry for letting that error pass by. Hope it works fine now  :)
Hm, that didn't work either:

[task 2017-11-05T13:39:55.292866Z] In other words you are missing scopes from one of the options:
[task 2017-11-05T13:39:55.292879Z]  * Option 0:
[task 2017-11-05T13:39:55.292911Z]     - "queue:route:notify.email.dmitchell@mozilla.com.*", and
[task 2017-11-05T13:39:55.292935Z]     - "queue:route:notify.email.dmitchell@mozilla.com.on-failed", and
[task 2017-11-05T13:39:55.292953Z]     - "queue:route:notify.email.dmitchell@mozilla.com.on-exception"

If it's OK with you, I will take this bug and figure out how to finish it up.  I suspect we may have started out in the wrong direction here and I need to rethink it.  Which is not to diminish your work -- I appreciate it!  Hopefully some of the projects we discussed in irc will be more tractable.
Yeah, its fine with me. Thanks for all your help on this. Learnt a lot. 
I hope you fix this bug soon:)
I'll start on the bug you filed just yesterday.
Attachment #8916790 - Flags: review?(dustin)
Assignee: didwaniayashvardhan → dustin
Mentor: dustin
I'm deeply suspicious that the fix in that try job will not have the scopes to run a decision task.  At least right now, and possibly once we start assigning more fine-grained scopes.  We'll see..
Am I cynical or just accurate?  You decide..

[task 2018-07-06T18:16:27.986Z] You do not have sufficient scopes. You are missing the following scopes:
[task 2018-07-06T18:16:27.986Z] 
[task 2018-07-06T18:16:27.986Z] ```
[task 2018-07-06T18:16:27.986Z] {
[task 2018-07-06T18:16:27.986Z]   "AllOf": [
[task 2018-07-06T18:16:27.986Z]     "assume:repo:hg.mozilla.org/try:branch:default",
[task 2018-07-06T18:16:27.986Z]     "in-tree:hook-action:project-gecko/in-tree-action-1-*"
[task 2018-07-06T18:16:27.986Z]   ]
[task 2018-07-06T18:16:27.986Z] }
[task 2018-07-06T18:16:27.986Z] ```

So I think that retriggering a decision task requires an actionPerm, which requires bug 1472792.  It also means bug 
1470622 can't go ahead yet, as sheriffs need to be able to retrigger decision tasks.
Blocks: 1470622, 1415868
Depends on: 1472792
This still needs to wait until bug 1472792 is deployed in treeherder.
Attachment #8916790 - Attachment is obsolete: true
Attachment #8994319 - Flags: review?(bstack)
Comment on attachment 8994319 [details] [diff] [review]
bug1398277.patch

I wish all of these datestamps would have been  relative from the get-go. Assuming you ran flake8 on this, r+!
Attachment #8994319 - Flags: review?(bstack) → review+
Haha, you know me so well.  Yes, flakes fixed.  This can't land until bug 1472792 lands in treeherder -- that's been waiting for two weeks now, and will wait another three until I'm back from PTO.
Comment on attachment 8994315 [details]
Bug 1398277: add a special permission for retriggering decision tasks

Brian Stack [:bstack] has approved the revision.
Attachment #8994315 - Flags: review+

Comment 50

10 months ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/ab58645e9230
Status: NEW → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla61
Target Milestone: mozilla61 → mozilla63
Component: Integration → Services
You need to log in before you can comment on or make changes to this bug.