Status

defect
P2
normal
RESOLVED FIXED
3 years ago
Last year

People

(Reporter: rail, Assigned: rail)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

()

Attachments

(5 attachments)

Assignee

Description

3 years ago
The scheduler API is deprecated after TC added support for task dependencies.

Instead of submitting a graph we should submit individual tasks with dependencies specified in task.dependencies and grouped using taskGroupId.

Tasks should be submitted in topographical order (TC verifies dependencies).

In case of failure we should resubmit only failed tasks.

Sorting and mass submission sound like good candidates for TC client.
See Also: → 1288573
Assignee

Updated

3 years ago
Depends on: 1316340
Assignee

Updated

2 years ago
Priority: -- → P2
See Also: → 1375802
It would be great if we could get this done, as this might be the last thing that requires us to keep the scheduler alive (but not well).

For anyone looking at this bug that fancies taking a stab, it looks like this is where the scheduler is called:
https://hg.mozilla.org/build/tools/file/9193e0759659/buildfarm/release/release-runner.py#l517
Mentor: pmoore
Whiteboard: [good first bug]
Erm, i've just read the dependent bugs and realise this might not be so entirely trivial after all! :D
Mentor: pmoore
Whiteboard: [good first bug]
My understanding is that there are two ways to achieve this:
 1. move releasetasks in-tree (a long project)
 2. rewrite releasetasks to use the queue API instead of scheduler

The latter is a little complicated since the releasetasks graph has tasks with more dependencies than the queue API allows, so insertion of some dummy tasks will be required.  There's disagreement as to how difficult that might be :)
Rail I hear has an implementation of a solution.
Flags: needinfo?(rail)
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> My understanding is that there are two ways to achieve this:
>  1. move releasetasks in-tree (a long project)
>  2. rewrite releasetasks to use the queue API instead of scheduler
> 
> The latter is a little complicated since the releasetasks graph has tasks
> with more dependencies than the queue API allows, so insertion of some dummy
> tasks will be required.  There's disagreement as to how difficult that might
> be :)


Rail managed to implement a nice solution for 2) last week during the All-hands, one that uses the Queue API by reorganizing the current releasetasks graph into a large flatten list. There's a topological sort involved as well as some other processing. Last I know on Friday he managed to submit to TC a smaller subgraph via our staging environment. I remembered he mentioned some other complications he needed to tackle to declare victory over this.

With the current tcmigration project focus going on, I can look at this to finalize if Rail wants to hand-off.
Assignee

Comment 7

2 years ago
I'll put together 2 different solutions I have. Leaving NI
Assignee

Comment 8

2 years ago
I ended up with 2 different approaches:

1) extend the existing tools to convert the generated graph into a list of tasks

  * patches releasetasks: https://github.com/mozilla-releng/releasetasks/compare/master...rail:queue_api?expand=1
  ** convert existing graph into a list of tasks
  ** add dummy tasks in between tasks with more than 100 dependent tasks
  ** add a task (atomic task) to prevent ASAP scheduling in case the whole submission fails for some reason (mistakes, scopes, network, etc)
  ** sort the tasks topologically

 * releaserunner patch: https://gist.github.com/rail/4074a19313ca3970f9a7c9123b5b74ee
  ** resolve the atomic task to start the execution
  ** graph2 generation should be tested

I tested 1) on jamun and it worked fine. It takes longer to submit tasks (no async!), but we don't have the max 3 partials limit anymore!


2) py35 CLI tool to do the same using a saved copy of releasetasks-generated graph
  * async submission (to be parallelized)
  * release runner should save the generated graph to a file and pass it to this tool
  * to be puppetized

I have not tested this version at all.

I will be busy with some other stuff until the end of August, but feel free to adapt the patches.
Flags: needinfo?(rail)
Assignee

Comment 10

2 years ago
Oh, forgot to point to the second tool: https://github.com/rail/graph2tasks
See Also: → 1385294
Note to self (context) for later: 
* we're to flatten the existing graph and submit it in chunks via TC Queue
* rail already implemented a good part of this, we still need more testing (unit testing) to that piece and deal with the submission
* currently, the only caveat is the fact that it takes a long time to submit those tasks (about ~20-30min for a release graph). Mostly because all that is sequentially. We need to bring async muscle to this in a smart way
* once we flatten graph, topological sort and all, we'll have a chunk of tasks with 0 dependencies, another chunk with 1 dep, etc. The chunks shall be submitted sequentially but the tasks within the chunk can be asynced.
Hoping to get my hands on this in the next two weeks, now that I'm off releaseduty cycle.
Assignee: nobody → mtabara
Assignee

Comment 13

2 years ago
Hijacking this because I'm back!
Assignee: mtabara → rail
Comment hidden (mozreview-request)
Comment hidden (mozreview-request)
Assignee

Comment 16

2 years ago
Posted file PR for releasetask
Attachment #8900706 - Flags: review?(mtabara)
Comment on attachment 8900706 [details] [review]
PR for releasetask

Commented out in the PR[1] 

[1]: https://github.com/mozilla-releng/releasetasks/pull/267#pullrequestreview-58577330
Attachment #8900706 - Flags: review?(mtabara) → review+

Comment 18

2 years ago
mozreview-review
Comment on attachment 8900704 [details]
Bug 1259627 - [tools] Use TC Queue API

https://reviewboard.mozilla.org/r/172146/#review177782

Looking good to me - just a few nits and a bunch of questions. 
Also, the most concerning thing for me - shouldn't we attempt a staging release for Fennec as well? See my comment below.

::: buildfarm/release/release-runner.py:414
(Diff revision 1)
>  
>          # XXX: Doesn't work with neither Fennec nor Thunderbird
>          platforms = branchConfig['release_platforms']
>  
>          try:
> -            graph_id = slugId()
> +            task_group_id = None

Can't we remove this line since the task group id comes from releasetasks? I doubt there's any warning downstream.

::: buildfarm/release/release-runner.py:523
(Diff revision 1)
>              import pprint
> -            log.debug(pprint.pformat(graph, indent=4, width=160))
> -            print(scheduler.createTaskGraph(graph_id, graph))
> +            for task_id, task_def in tasks.items():
> +                log.debug("%s ->\n%s", task_id,
> +                          pprint.pformat(task_def, indent=4, width=160))
> +            submit_parallelized(queue, tasks)
> +            resolve_task(queue, toplevel_task_id)

Beautiful!

::: buildfarm/release/releasetasks_graph_gen.py:134
(Diff revision 1)
> +        log.debug("%s ->\n%s", task_id,
> +                  pprint.pformat(task_def, indent=4, width=160))
> +
>      if not options.dry_run:
> -        print scheduler.createTaskGraph(graph_id, graph)
> +        submit_parallelized(queue, tasks)
> +        resolve_task(queue, toplevel_task_id)

Wonderful!

::: buildfarm/release/releasetasks_graph_gen.py:218
(Diff revision 1)
>      tc_config = {
>          "credentials": {
>              "clientId": get_config(release_runner_config, "taskcluster", "client_id", None),
>              "accessToken": get_config(release_runner_config, "taskcluster", "access_token", None),
> +        },
> +        "maxRetries": 12,

Question: Since this is global, which task falls into this?

::: lib/python/kickoff/__init__.py:370
(Diff revision 1)
>          kwargs["extra_balrog_submitter_params"] = extra_balrog_submitter_params
>  
> -    # don't import releasetasks until required within function impl to avoid global failures
> -    # during nosetests
> -    from releasetasks import make_task_graph
> -    return make_task_graph(**kwargs)
> +    # don't import releasetasks until required within function impl to avoid
> +    # global failures during nosetests
> +    from releasetasks import make_tasks
> +    return make_tasks(**kwargs)

Really nice approach! Easy to backout if we need to. 

My only concern is that I'd like to see a Fennec staging release as well on jamun. For Fennec we have a decision task within the graph, and I'd feel safer if we tested that case before rolling this out to production.

::: lib/python/kickoff/tc.py:29
(Diff revision 1)
> +    def submit_task(t_id, t_def):
> +        log.info("Submitting %s", t_id)
> +        queue.createTask(t_id, t_def)
> +
> +    with futures.ThreadPoolExecutor(CONCURRENCY) as e:
> +        fs = {}

Nit: maybe use some more intuitive to read variable here instead of `fs`. "submitted_future" or alike.
Attachment #8900704 - Flags: review?(mtabara) → review+

Comment 19

2 years ago
mozreview-review
Comment on attachment 8900705 [details]
Bug 1259627 - Update releaserunner dependencies

https://reviewboard.mozilla.org/r/172150/#review177834

::: modules/releaserunner/manifests/init.pp:60
(Diff revision 1)
>                  'simplejson==2.6.2',
>                  'singledispatch==3.4.0.3',
>                  'six==1.9.0',
>                  'sqlalchemy-migrate==0.7.2',
>                  'taskcluster==0.0.24',
> +                'toposort==1.0',

Any reason we're not using the latest 1.5 version of it ? https://pypi.python.org/pypi/toposort
Attachment #8900705 - Flags: review?(mtabara) → review+
Comment hidden (mozreview-request)

Comment 24

2 years ago
mozreview-review
Comment on attachment 8901870 [details]
Bug 1259627 - Keep printing out the task group ID

https://reviewboard.mozilla.org/r/173302/#review178662

Good catch, lgtm!
Attachment #8901870 - Flags: review+
I was chatting with :jlorenzo earlier tonight as he was doing a Fennec staging release to test things out and we just realized ww maybe can actually get rid of "graph" scopes from releasetasks.

AFAIK, *before*, we'd define scopes in each tasks + graph level needed for the Scheduler API. But since now, all the graph is just a formal step before flattening it and making a list of tasks and individually send them via Queue API, I think we no longer need them.

:jlorenzo was having trouble with some missing scopes and tweaked only the task-level scope, not the graph one and still worked. I think this proves the theory above.

To conclude, I think we can actually remove all scopes from [1].
Might be worth testing this with a staging release tomorrow morning.

Leaving myself a NI on this one:
[1]: https://github.com/mozilla-releng/releasetasks/blob/master/releasetasks/templates/mobile/release_graph.yml.tmpl#L21
Flags: needinfo?(mtabara)
See Also: → 1394511
Assignee

Comment 27

2 years ago
yup, we can kill those.
See Also: → 1394644
Filed bug 1394852 to track removal of those graph-level scopes mentioned in comment 26.
Flags: needinfo?(mtabara)
Assignee

Comment 29

2 years ago
Attachment #8902642 - Flags: review?
Attachment #8902642 - Flags: review? → review+
Assignee

Comment 30

2 years ago
Comment on attachment 8902642 [details] [review]
Funsize scheduler PR

to be deployed soon!
Attachment #8902642 - Flags: checked-in+
We hit some artifact expires errors like

Aug 30 13:05:48 signingworker-6.srv.releng.use1.mozilla.com python: signingworker taskcluster.exceptions.TaskclusterRestFailure: Artifact expires (2017-09-13T20:05:47.672Z) after the task expiration 2017-09-06T12:48:31.067Z (task.expires < expires) - this is not allowed, artifacts may not expire after the task they belong to expires

We *think* https://github.com/mozilla-releng/funsize/commit/96d4d90350a8a1ee378f1b0dddfcd45acd4bff53 should fix that symptom... we'll see during tonight's nightlies, triggering in <1hr. But when digging, I found another possible fix:

At https://github.com/mozilla-releng/signingworker/blob/master/signingworker/worker.py#L176 we set the artifact expires to 2weeks from now, but "now" is well after the graph was created.

At https://github.com/mozilla-releng/signingworker/blob/master/signingworker/worker.py#L71 we get the task definition. `task.get('expires')` should be the task's expiration datestring, if any. We could fall back to an arrow, e.g. `task.get('expires', arrow.now().replace(weeks=2).isoformat())` .

`sign()` has access to the task definition, so we could create `expires` at that point. It calls `download_and_sign_file()`, which calls `create_artifact()`, and it also calls `create_artifact()` directly. If we pass `expires` down as args or as a self.variable, we can then make sure we never try to create an artifact with an `expires` longer than `task.expires`.
Flags: needinfo?(rail)
Assignee

Comment 33

2 years ago
Yeah, I noticed that issue yesterday and https://github.com/mozilla-releng/funsize/pull/82 fixed it. I shouldn't add that field in the first place. :)
Flags: needinfo?(rail)
Assignee

Comment 34

2 years ago
Everything looks good so far, no backouts!
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
I've removed:

  "Scheduler": "http://references.taskcluster.net/scheduler/v1/api.json",
  "SchedulerEvents": "http://references.taskcluster.net/scheduler/v1/exchanges.json",

from http://references.taskcluster.net/manifest.json so that the scheduler APIs don't get generated in our clients.
The following scopes are still dotted around our clients/roles:

scheduler:*
scheduler:create-task-graph
scheduler:extend-task-graph
scheduler:extend-task-graph:*
scheduler:route:gaia-taskcluster
scheduler:route:taskcluster-github.*

I intend to remove these - shout out if you see any issue with me doing this.
Sorry, I see now this bug was about not continuing to use the scheduler APIs, rather than about sunsetting the scheduler.

Therefore I've moved the sunsetting parts into bug 1399437.

Sorry for the noise! :)
See Also: → 1399437
You need to log in before you can comment on or make changes to this bug.