Closed
Bug 1259627
Opened 8 years ago
Closed 6 years ago
Stop using TC scheduler API
Categories
(Release Engineering :: Release Automation: Other, defect, P2)
Release Engineering
Release Automation: Other
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: rail, Assigned: rail)
References
()
Details
Attachments
(5 files)
59 bytes,
text/x-review-board-request
|
mtabara
:
review+
rail
:
checked-in+
|
Details |
59 bytes,
text/x-review-board-request
|
mtabara
:
review+
rail
:
checked-in+
|
Details |
55 bytes,
text/x-github-pull-request
|
mtabara
:
review+
rail
:
checked-in+
|
Details | Review |
59 bytes,
text/x-review-board-request
|
mtabara
:
review+
jlorenzo
:
checked-in+
|
Details |
49 bytes,
text/x-github-pull-request
|
mtabara
:
review+
rail
:
checked-in+
|
Details | Review |
The scheduler API is deprecated after TC added support for task dependencies. Instead of submitting a graph we should submit individual tasks with dependencies specified in task.dependencies and grouped using taskGroupId. Tasks should be submitted in topographical order (TC verifies dependencies). In case of failure we should resubmit only failed tasks. Sorting and mass submission sound like good candidates for TC client.
Assignee | ||
Updated•7 years ago
|
Priority: -- → P2
Comment 1•7 years ago
|
||
It would be great if we could get this done, as this might be the last thing that requires us to keep the scheduler alive (but not well). For anyone looking at this bug that fancies taking a stab, it looks like this is where the scheduler is called: https://hg.mozilla.org/build/tools/file/9193e0759659/buildfarm/release/release-runner.py#l517
Comment 2•7 years ago
|
||
And I believe this is where the graph is generated: https://hg.mozilla.org/build/tools/file/9193e0759659/buildfarm/release/releasetasks_graph_gen.py
Updated•7 years ago
|
Mentor: pmoore
Whiteboard: [good first bug]
Comment 3•7 years ago
|
||
Erm, i've just read the dependent bugs and realise this might not be so entirely trivial after all! :D
Mentor: pmoore
Whiteboard: [good first bug]
Comment 4•7 years ago
|
||
My understanding is that there are two ways to achieve this: 1. move releasetasks in-tree (a long project) 2. rewrite releasetasks to use the queue API instead of scheduler The latter is a little complicated since the releasetasks graph has tasks with more dependencies than the queue API allows, so insertion of some dummy tasks will be required. There's disagreement as to how difficult that might be :)
Comment 6•7 years ago
|
||
(In reply to Dustin J. Mitchell [:dustin] from comment #4) > My understanding is that there are two ways to achieve this: > 1. move releasetasks in-tree (a long project) > 2. rewrite releasetasks to use the queue API instead of scheduler > > The latter is a little complicated since the releasetasks graph has tasks > with more dependencies than the queue API allows, so insertion of some dummy > tasks will be required. There's disagreement as to how difficult that might > be :) Rail managed to implement a nice solution for 2) last week during the All-hands, one that uses the Queue API by reorganizing the current releasetasks graph into a large flatten list. There's a topological sort involved as well as some other processing. Last I know on Friday he managed to submit to TC a smaller subgraph via our staging environment. I remembered he mentioned some other complications he needed to tackle to declare victory over this. With the current tcmigration project focus going on, I can look at this to finalize if Rail wants to hand-off.
Assignee | ||
Comment 7•7 years ago
|
||
I'll put together 2 different solutions I have. Leaving NI
Assignee | ||
Comment 8•7 years ago
|
||
I ended up with 2 different approaches: 1) extend the existing tools to convert the generated graph into a list of tasks * patches releasetasks: https://github.com/mozilla-releng/releasetasks/compare/master...rail:queue_api?expand=1 ** convert existing graph into a list of tasks ** add dummy tasks in between tasks with more than 100 dependent tasks ** add a task (atomic task) to prevent ASAP scheduling in case the whole submission fails for some reason (mistakes, scopes, network, etc) ** sort the tasks topologically * releaserunner patch: https://gist.github.com/rail/4074a19313ca3970f9a7c9123b5b74ee ** resolve the atomic task to start the execution ** graph2 generation should be tested I tested 1) on jamun and it worked fine. It takes longer to submit tasks (no async!), but we don't have the max 3 partials limit anymore! 2) py35 CLI tool to do the same using a saved copy of releasetasks-generated graph * async submission (to be parallelized) * release runner should save the generated graph to a file and pass it to this tool * to be puppetized I have not tested this version at all. I will be busy with some other stuff until the end of August, but feel free to adapt the patches.
Flags: needinfo?(rail)
Comment 9•7 years ago
|
||
I like it! We use a threadpool in-tree https://dxr.mozilla.org/mozilla-central/source/taskcluster/taskgraph/create.py?q=create.py&redirect_type=direct#48 so maybe that's an option here, too?
Assignee | ||
Comment 10•7 years ago
|
||
Oh, forgot to point to the second tool: https://github.com/rail/graph2tasks
Updated•6 years ago
|
Comment 11•6 years ago
|
||
Note to self (context) for later: * we're to flatten the existing graph and submit it in chunks via TC Queue * rail already implemented a good part of this, we still need more testing (unit testing) to that piece and deal with the submission * currently, the only caveat is the fact that it takes a long time to submit those tasks (about ~20-30min for a release graph). Mostly because all that is sequentially. We need to bring async muscle to this in a smart way * once we flatten graph, topological sort and all, we'll have a chunk of tasks with 0 dependencies, another chunk with 1 dep, etc. The chunks shall be submitted sequentially but the tasks within the chunk can be asynced.
Comment 12•6 years ago
|
||
Hoping to get my hands on this in the next two weeks, now that I'm off releaseduty cycle.
Assignee: nobody → mtabara
Comment hidden (mozreview-request) |
Comment hidden (mozreview-request) |
Assignee | ||
Comment 16•6 years ago
|
||
Attachment #8900706 -
Flags: review?(mtabara)
Comment 17•6 years ago
|
||
Comment on attachment 8900706 [details] [review] PR for releasetask Commented out in the PR[1] [1]: https://github.com/mozilla-releng/releasetasks/pull/267#pullrequestreview-58577330
Attachment #8900706 -
Flags: review?(mtabara) → review+
Comment 18•6 years ago
|
||
mozreview-review |
Comment on attachment 8900704 [details] Bug 1259627 - [tools] Use TC Queue API https://reviewboard.mozilla.org/r/172146/#review177782 Looking good to me - just a few nits and a bunch of questions. Also, the most concerning thing for me - shouldn't we attempt a staging release for Fennec as well? See my comment below. ::: buildfarm/release/release-runner.py:414 (Diff revision 1) > > # XXX: Doesn't work with neither Fennec nor Thunderbird > platforms = branchConfig['release_platforms'] > > try: > - graph_id = slugId() > + task_group_id = None Can't we remove this line since the task group id comes from releasetasks? I doubt there's any warning downstream. ::: buildfarm/release/release-runner.py:523 (Diff revision 1) > import pprint > - log.debug(pprint.pformat(graph, indent=4, width=160)) > - print(scheduler.createTaskGraph(graph_id, graph)) > + for task_id, task_def in tasks.items(): > + log.debug("%s ->\n%s", task_id, > + pprint.pformat(task_def, indent=4, width=160)) > + submit_parallelized(queue, tasks) > + resolve_task(queue, toplevel_task_id) Beautiful! ::: buildfarm/release/releasetasks_graph_gen.py:134 (Diff revision 1) > + log.debug("%s ->\n%s", task_id, > + pprint.pformat(task_def, indent=4, width=160)) > + > if not options.dry_run: > - print scheduler.createTaskGraph(graph_id, graph) > + submit_parallelized(queue, tasks) > + resolve_task(queue, toplevel_task_id) Wonderful! ::: buildfarm/release/releasetasks_graph_gen.py:218 (Diff revision 1) > tc_config = { > "credentials": { > "clientId": get_config(release_runner_config, "taskcluster", "client_id", None), > "accessToken": get_config(release_runner_config, "taskcluster", "access_token", None), > + }, > + "maxRetries": 12, Question: Since this is global, which task falls into this? ::: lib/python/kickoff/__init__.py:370 (Diff revision 1) > kwargs["extra_balrog_submitter_params"] = extra_balrog_submitter_params > > - # don't import releasetasks until required within function impl to avoid global failures > - # during nosetests > - from releasetasks import make_task_graph > - return make_task_graph(**kwargs) > + # don't import releasetasks until required within function impl to avoid > + # global failures during nosetests > + from releasetasks import make_tasks > + return make_tasks(**kwargs) Really nice approach! Easy to backout if we need to. My only concern is that I'd like to see a Fennec staging release as well on jamun. For Fennec we have a decision task within the graph, and I'd feel safer if we tested that case before rolling this out to production. ::: lib/python/kickoff/tc.py:29 (Diff revision 1) > + def submit_task(t_id, t_def): > + log.info("Submitting %s", t_id) > + queue.createTask(t_id, t_def) > + > + with futures.ThreadPoolExecutor(CONCURRENCY) as e: > + fs = {} Nit: maybe use some more intuitive to read variable here instead of `fs`. "submitted_future" or alike.
Attachment #8900704 -
Flags: review?(mtabara) → review+
Comment 19•6 years ago
|
||
mozreview-review |
Comment on attachment 8900705 [details] Bug 1259627 - Update releaserunner dependencies https://reviewboard.mozilla.org/r/172150/#review177834 ::: modules/releaserunner/manifests/init.pp:60 (Diff revision 1) > 'simplejson==2.6.2', > 'singledispatch==3.4.0.3', > 'six==1.9.0', > 'sqlalchemy-migrate==0.7.2', > 'taskcluster==0.0.24', > + 'toposort==1.0', Any reason we're not using the latest 1.5 version of it ? https://pypi.python.org/pypi/toposort
Attachment #8900705 -
Flags: review?(mtabara) → review+
Assignee | ||
Comment 20•6 years ago
|
||
Comment on attachment 8900705 [details] Bug 1259627 - Update releaserunner dependencies toposort 1.5, it is! remote: https://hg.mozilla.org/build/puppet/rev/fdbd9474bc29a5d8ff6841d6554f1590e4670054 remote: https://hg.mozilla.org/build/puppet/rev/0da450a5412289b03425bb66be16a1c730dcaa94
Attachment #8900705 -
Flags: checked-in+
Assignee | ||
Comment 21•6 years ago
|
||
Comment on attachment 8900706 [details] [review] PR for releasetask Merged https://github.com/mozilla-releng/releasetasks/pull/267
Attachment #8900706 -
Flags: checked-in+
Assignee | ||
Comment 22•6 years ago
|
||
Comment on attachment 8900704 [details] Bug 1259627 - [tools] Use TC Queue API https://hg.mozilla.org/build/tools/rev/d0fdabea1ee70000001069700a9cde5b3957db4c
Attachment #8900704 -
Flags: checked-in+
Comment hidden (mozreview-request) |
Comment 24•6 years ago
|
||
mozreview-review |
Comment on attachment 8901870 [details] Bug 1259627 - Keep printing out the task group ID https://reviewboard.mozilla.org/r/173302/#review178662 Good catch, lgtm!
Attachment #8901870 -
Flags: review+
Comment 25•6 years ago
|
||
Comment on attachment 8901870 [details] Bug 1259627 - Keep printing out the task group ID https://hg.mozilla.org/build/tools/rev/813629fc9d561badf37ac19e6399f1a1fbac7932
Attachment #8901870 -
Flags: review?(rail) → checked-in+
Comment 26•6 years ago
|
||
I was chatting with :jlorenzo earlier tonight as he was doing a Fennec staging release to test things out and we just realized ww maybe can actually get rid of "graph" scopes from releasetasks. AFAIK, *before*, we'd define scopes in each tasks + graph level needed for the Scheduler API. But since now, all the graph is just a formal step before flattening it and making a list of tasks and individually send them via Queue API, I think we no longer need them. :jlorenzo was having trouble with some missing scopes and tweaked only the task-level scope, not the graph one and still worked. I think this proves the theory above. To conclude, I think we can actually remove all scopes from [1]. Might be worth testing this with a staging release tomorrow morning. Leaving myself a NI on this one: [1]: https://github.com/mozilla-releng/releasetasks/blob/master/releasetasks/templates/mobile/release_graph.yml.tmpl#L21
Flags: needinfo?(mtabara)
Assignee | ||
Comment 27•6 years ago
|
||
yup, we can kill those.
Comment 28•6 years ago
|
||
Filed bug 1394852 to track removal of those graph-level scopes mentioned in comment 26.
Updated•6 years ago
|
Flags: needinfo?(mtabara)
Assignee | ||
Comment 29•6 years ago
|
||
Attachment #8902642 -
Flags: review?
Updated•6 years ago
|
Attachment #8902642 -
Flags: review? → review+
Assignee | ||
Comment 30•6 years ago
|
||
Comment on attachment 8902642 [details] [review] Funsize scheduler PR to be deployed soon!
Attachment #8902642 -
Flags: checked-in+
Assignee | ||
Comment 31•6 years ago
|
||
remote: https://hg.mozilla.org/build/puppet/rev/929209172607b1a4bd33c051041e3784976941ba remote: https://hg.mozilla.org/build/puppet/rev/cfa41b0c02832183ccd2dc31167e88826517f553
Comment 32•6 years ago
|
||
We hit some artifact expires errors like Aug 30 13:05:48 signingworker-6.srv.releng.use1.mozilla.com python: signingworker taskcluster.exceptions.TaskclusterRestFailure: Artifact expires (2017-09-13T20:05:47.672Z) after the task expiration 2017-09-06T12:48:31.067Z (task.expires < expires) - this is not allowed, artifacts may not expire after the task they belong to expires We *think* https://github.com/mozilla-releng/funsize/commit/96d4d90350a8a1ee378f1b0dddfcd45acd4bff53 should fix that symptom... we'll see during tonight's nightlies, triggering in <1hr. But when digging, I found another possible fix: At https://github.com/mozilla-releng/signingworker/blob/master/signingworker/worker.py#L176 we set the artifact expires to 2weeks from now, but "now" is well after the graph was created. At https://github.com/mozilla-releng/signingworker/blob/master/signingworker/worker.py#L71 we get the task definition. `task.get('expires')` should be the task's expiration datestring, if any. We could fall back to an arrow, e.g. `task.get('expires', arrow.now().replace(weeks=2).isoformat())` . `sign()` has access to the task definition, so we could create `expires` at that point. It calls `download_and_sign_file()`, which calls `create_artifact()`, and it also calls `create_artifact()` directly. If we pass `expires` down as args or as a self.variable, we can then make sure we never try to create an artifact with an `expires` longer than `task.expires`.
Flags: needinfo?(rail)
Assignee | ||
Comment 33•6 years ago
|
||
Yeah, I noticed that issue yesterday and https://github.com/mozilla-releng/funsize/pull/82 fixed it. I shouldn't add that field in the first place. :)
Flags: needinfo?(rail)
Assignee | ||
Comment 34•6 years ago
|
||
Everything looks good so far, no backouts!
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Comment 35•6 years ago
|
||
I've removed: "Scheduler": "http://references.taskcluster.net/scheduler/v1/api.json", "SchedulerEvents": "http://references.taskcluster.net/scheduler/v1/exchanges.json", from http://references.taskcluster.net/manifest.json so that the scheduler APIs don't get generated in our clients.
Comment 36•6 years ago
|
||
The following scopes are still dotted around our clients/roles: scheduler:* scheduler:create-task-graph scheduler:extend-task-graph scheduler:extend-task-graph:* scheduler:route:gaia-taskcluster scheduler:route:taskcluster-github.* I intend to remove these - shout out if you see any issue with me doing this.
Comment 37•6 years ago
|
||
Commit pushed to master at https://github.com/taskcluster/taskcluster-client-go https://github.com/taskcluster/taskcluster-client-go/commit/04c1b1bcc331fe51d6d0782ee38d529b5eb3e90c Bug 1259627 - scheduler no longer exists
Comment 38•6 years ago
|
||
Sorry, I see now this bug was about not continuing to use the scheduler APIs, rather than about sunsetting the scheduler. Therefore I've moved the sunsetting parts into bug 1399437. Sorry for the noise! :)
You need to log in
before you can comment on or make changes to this bug.
Description
•