Closed Bug 1420449 Opened 2 years ago Closed 2 years ago

Optimize task graph generation

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: catlee, Assigned: catlee)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 2 obsolete files)

Generating task graphs locally can be pretty slow. It takes about 20s on my machine to run `mach taskgraph target`. This makes the modify/test cycle pretty slow when working on taskgraph changes.

I did some digging, and found a few places we may be able to speed this up, especially for local development:

1) We spend a lot of time hashing directory contents for the toolchain tasks. Most users working locally probably don't need accurate hashes for toolchain tasks until they're ready to test out their changes. I've attached a simple patch that disables directory hashing based on an environment variable. This brings taskgraph generation time down to about 13s on my system.

2) We also spend a lot of time doing task schema validation. I wonder if this could be parallelized at all? It takes about 7s to generate and validate all the 'test' kind tasks.
I would rather have this be a command-line argument rather than an env var.  Maybe something like -F / --fast?

I like the idea of skipping validation.  I don't think it can be parallelized because of the GIL, but it could easily be skipped without an impact to correctness (the user would just need to be aware that they may not hear about schema errors in any modifications they have made).

As for skipping the hashing, I'd prefer that it disable index searching entirely for the relevant tasks, rather than return a static hash.  There will likely be a lot of cases where a --fast taskgraph is accidentally used to look up something like a docker image or toolchain build, and with a static hash it would be trivial to trojan developers' systems in that case.

Since this is a local-use-only option, I think it would be fine for --fast to set some global variable (maybe just `taskgraph.fast`?).  I don't think it should be a parameter, since it would never be used in automation.

/cc ahal since he is interested in fast local taskgraph generation
Yes, thanks for looking into this!

These improvements might make bug 1412121 unnecessary, I'll make this block so we can compare again after these changes land.
Blocks: 1412121
I played around a little bit with bypassing or parallelizing the schema validation and ran into a few issues:

1) skipping schema validation causes errors in the task generation. The schema validation seems to make modifications to the task definitions. I didn't dig into it, but perhaps it's applying defaults?

2) I couldn't figure out how to use multiprocessing or concurrent.futures from mach, I kept hitting errors like
PicklingError: Can't pickle <type 'function'>: attribute lookup __builtin__.function failed
Blocks: fastci
> 1) skipping schema validation causes errors in the task generation. The schema validation seems to make modifications to the task definitions. I didn't dig into it, but perhaps it's applying defaults?

This is definitely the case. Perhaps it would be possible to run voluptuous in a mode that just applies changes and doesn't make other modifications? I guess voluptuous might be expressive enough that the defaults might depend on checking other things, even if taskgraph doesn't use that flexibility.
It's definitely possible -- it returns the modified data, rather than modifying it in place.

I think we only use voluptuous lightly to apply defaults, and we also have lots of transforms that apply defaults.  I suspect that using transforms exclusively would be clearer for readers anyway.
Comment on attachment 8933366 [details]
Bug 1420449: Add mach taskgraph -F/--fast

https://reviewboard.mozilla.org/r/204294/#review210152

Thanks, this looks good! I have a couple comments.

Dustin, does using `graph_config` for this value make sense to you?

::: taskcluster/taskgraph/__init__.py:11
(Diff revision 1)
> +# Enable fast task generation for local debugging?
> +fast = False

I think this would be better off living in the `graph_config` instead of an attribute on the top level module:
https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/generator.py#90

You'd have to modify `load_graph_config` to accept arbitrary kwargs that updates/overwrites whatever is in `config.yml`.

Then this value can be accessed later on via `config.graph_config['fast']`

::: taskcluster/taskgraph/transforms/job/toolchain.py:159
(Diff revision 1)
> +    if not taskgraph.fast:
> -    name = taskdesc['label'].replace('{}-'.format(config.kind), '', 1)
> +        name = taskdesc['label'].replace('{}-'.format(config.kind), '', 1)
> -    add_optimization(
> +        add_optimization(
> -        config, taskdesc,
> +            config, taskdesc,
> -        cache_type=CACHE_TYPE,
> +            cache_type=CACHE_TYPE,
> -        cache_name=name,
> +            cache_name=name,
> -        digest_data=get_digest_data(config, run, taskdesc),
> +            digest_data=get_digest_data(config, run, taskdesc),
> -    )
> +        )

I'm not at all familiar with toolchain tasks. But does skipping this `add_optimization` call make the whole function a no-op? Could we do an early return near the top instead?
Attachment #8933366 - Flags: review?(ahalberstadt)
Comment on attachment 8933366 [details]
Bug 1420449: Add mach taskgraph -F/--fast

https://reviewboard.mozilla.org/r/204294/#review210152

> I think this would be better off living in the `graph_config` instead of an attribute on the top level module:
> https://searchfox.org/mozilla-central/source/taskcluster/taskgraph/generator.py#90
> 
> You'd have to modify `load_graph_config` to accept arbitrary kwargs that updates/overwrites whatever is in `config.yml`.
> 
> Then this value can be accessed later on via `config.graph_config['fast']`

To clarify, the new way of setting this would be `tgg = TaskGraphGenerator(fast=True)`
Comment on attachment 8933366 [details]
Bug 1420449: Add mach taskgraph -F/--fast

https://reviewboard.mozilla.org/r/204294/#review210794

I don't feel too strongly about the config issue. I'll leave it open for now in case there's some more bikeshedding to be had. But if dustin prefers this approach to the `graph_config` one, feel free to close it and land.
Attachment #8933366 - Flags: review+
Comment on attachment 8933366 [details]
Bug 1420449: Add mach taskgraph -F/--fast

https://reviewboard.mozilla.org/r/204294/#review211152

::: taskcluster/taskgraph/__init__.py:11
(Diff revision 1)
> +# Enable fast task generation for local debugging?
> +fast = False

The graph config is for bigger, sweeping changes in the graph generation that differ across trust domains (so TB has its own taskcluste/ci/config.yml).  So I don't think that's a great place.

This particular piece of configuration is a runtime option that's specifically about the generation, and doesn't affect the resulting graph in "important" ways.

I suspect we'll want access to this value from all manner of weird places, such as in the schema validation function, too.  Rather than find a dynamic value that is available everywhere we want it, I think this global makes that easy, and also suggests that this is an internal feature.

My only request would be to expand this comment a little: what does it mean (how much detail can we drop in the interest of speed?), how is it set, how is it accessed?
Keywords: leave-open
Pushed by catlee@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/642f885eb109
Add mach taskgraph -F/--fast r=ahal
Attachment #8931719 - Attachment is obsolete: true
Attached patch noschema.diff (obsolete) — Splinter Review
This produces equivalent output for me with or without --fast as compared to a task graph generated without this patch.

I don't like the repetition of the treeherder environment defaults. Any ideas for how to make that better?
Attachment #8935631 - Flags: feedback?(mozilla)
Attachment #8935631 - Flags: feedback?(dustin)
Attachment #8935631 - Flags: feedback?(ahalberstadt)
Depends on: 1424138
Comment on attachment 8935631 [details] [diff] [review]
noschema.diff

Review of attachment 8935631 [details] [diff] [review]:
-----------------------------------------------------------------

I'm slightly unhappy at moving the defaults away from the schema (although I know I'm the one that suggested it). The schemas are the only documentation for the config formats that exist, so moving the defaults away makes them harder to reference. I wonder how hard it would be to extract the defaults from the schema, (probably once at module scope would be possible) and then apply them like is done here.

>  I don't like the repetition of the treeherder environment defaults. Any ideas for how to make that better?

Yep, get rid of that setting, since it isn't useful anymore: https://reviewboard.mozilla.org/r/206516/
Attachment #8935631 - Flags: feedback?(mozilla) → feedback-
(In reply to Tom Prince [:tomprince] from comment #15)
> Comment on attachment 8935631 [details] [diff] [review]
> noschema.diff
> 
> Review of attachment 8935631 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I'm slightly unhappy at moving the defaults away from the schema (although I
> know I'm the one that suggested it). The schemas are the only documentation
> for the config formats that exist, so moving the defaults away makes them
> harder to reference. I wonder how hard it would be to extract the defaults
> from the schema, (probably once at module scope would be possible) and then
> apply them like is done here.

I think we'll end up re-implementing most of the schema validation code to do this so that we can correctly apply defaults to schemas that accept more than one form (e.g. schemas with Any()). Perhaps we can duplicate the defaults and verify that they match when doing schema verification?

> >  I don't like the repetition of the treeherder environment defaults. Any ideas for how to make that better?
> 
> Yep, get rid of that setting, since it isn't useful anymore:
> https://reviewboard.mozilla.org/r/206516/

\o/
Comment on attachment 8935631 [details] [diff] [review]
noschema.diff

Review of attachment 8935631 [details] [diff] [review]:
-----------------------------------------------------------------

I like this.  I think in the fullness of time we can move some of this to a model where defaults are applied at the time the parameter is consulted (so, use .get instead of []), but for the moment this is a good intermediate step.

::: taskcluster/taskgraph/transforms/source_test.py
@@ +74,1 @@
>          yield validate_schema(source_test_description_schema, job,

We should also change this so that it validates but then returns the original (which, IIRC, voluptuous does not modify).  Then if someone later adds a default, it will have no effect.

::: taskcluster/taskgraph/transforms/task.py
@@ +1091,5 @@
> +            worker.setdefault('max-run-time', 600)
> +        elif worker['implementation'] == 'beetmover-cdns':
> +            worker.setdefault('max-run-time', 600)
> +        elif worker['implementation'] == 'push-apk':
> +            worker.setdefault('commit', False)

could all of these move to the implementation functions?
Attachment #8935631 - Flags: feedback?(dustin) → feedback+
Comment on attachment 8935631 [details] [diff] [review]
noschema.diff

Review of attachment 8935631 [details] [diff] [review]:
-----------------------------------------------------------------

I spent a bit of time trying to recursively extract defaults out of a Schema. Top level defaults and recursively getting defaults out of dict values isn't too hard. But it gets quite complicated when you have to also consider Any/All directives. It's do-able, but seems more complicated than it's worth. Let me know and I can attach what I have so far.

::: taskcluster/taskgraph/util/schema.py
@@ +20,5 @@
>      Validate that object satisfies schema.  If not, generate a useful exception
>      beginning with msg_prefix.
>      """
> +    if taskgraph.fast:
> +        return obj

Should we also deepcopy this?
Attachment #8935631 - Flags: feedback?(ahalberstadt) → feedback+
(In reply to Dustin J. Mitchell [:dustin] from comment #17)
> Comment on attachment 8935631 [details] [diff] [review]
> noschema.diff
> 
> Review of attachment 8935631 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I like this.  I think in the fullness of time we can move some of this to a
> model where defaults are applied at the time the parameter is consulted (so,
> use .get instead of []), but for the moment this is a good intermediate step.
> 
> ::: taskcluster/taskgraph/transforms/source_test.py
> @@ +74,1 @@
> >          yield validate_schema(source_test_description_schema, job,
> 
> We should also change this so that it validates but then returns the
> original (which, IIRC, voluptuous does not modify).  Then if someone later
> adds a default, it will have no effect.

I've changed it so that validate_schema returns None, forcing all callers to yield the original job object.

I don't think we need to worry about mutable defaults in this case. The set_defaults transforms will create new list objects per task.

> ::: taskcluster/taskgraph/transforms/task.py
> @@ +1091,5 @@
> > +            worker.setdefault('max-run-time', 600)
> > +        elif worker['implementation'] == 'beetmover-cdns':
> > +            worker.setdefault('max-run-time', 600)
> > +        elif worker['implementation'] == 'push-apk':
> > +            worker.setdefault('commit', False)
> 
> could all of these move to the implementation functions?

Not easily. The validate transform occurs before the build_task transform, which is where the worker specific payload builder function is called. Any ideas for how to refactor that?
How do we keep this from breaking in the future?
I'm happy with comment 19 -- no ideas for how to refactor that.  My comment was from curiosity.

As to keeping this from breaking -- we could potentially run with --fast in a sub-task, if that would help, perhaps even verifying that task-graph.json matches that from the decision task?
Assignee: nobody → catlee
Attachment #8935631 - Attachment is obsolete: true
Attachment #8938500 - Flags: review?(dustin)
Attachment #8938500 - Flags: review?(ahalberstadt)
Attachment #8938500 - Flags: review?(dustin) → review+
Comment on attachment 8938500 [details] [diff] [review]
Skip schema validation with --fast

Review of attachment 8938500 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks!
Attachment #8938500 - Flags: review?(ahalberstadt) → review+
Pushed by catlee@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/7a47accb11c5
Skip schema validation with --fast r=dustin,ahal
Depends on: 1427797
FWIW, this bounced because of bug 1421100. See https://hg.mozilla.org/integration/mozilla-inbound/rev/c1e20de19b6f

Also note that I landed bug 1419638 on autoland, that adds another use of validate_schema that this bug would need to patch.
(In reply to Mike Hommey [back on Jan 9] [:glandium] from comment #26)
> FWIW, this bounced because of bug 1421100. See
> https://hg.mozilla.org/integration/mozilla-inbound/rev/c1e20de19b6f
> 
> Also note that I landed bug 1419638 on autoland, that adds another use of
> validate_schema that this bug would need to patch.

... although I must say, I don't understand why you're not keeping the current behavior for validate_schema, avoiding all callers to change.
(In reply to Mike Hommey [back on Jan 9] [:glandium] from comment #27)
> (In reply to Mike Hommey [back on Jan 9] [:glandium] from comment #26)
> > FWIW, this bounced because of bug 1421100. See
> > https://hg.mozilla.org/integration/mozilla-inbound/rev/c1e20de19b6f
> > 
> > Also note that I landed bug 1419638 on autoland, that adds another use of
> > validate_schema that this bug would need to patch.

Thanks for the heads up! I'll incorporate that in my patch.

> ... although I must say, I don't understand why you're not keeping the
> current behavior for validate_schema, avoiding all callers to change.

To try and prevent people from relying on side-effects of the schema validation.
Flags: needinfo?(catlee)
(In reply to Chris AtLee [:catlee] from comment #28)
> To try and prevent people from relying on side-effects of the schema
> validation.

You don't need to change the function signature for that. You can still make it return the data it's passed in. That would avoid all those changes to callers.
Pushed by catlee@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/1c24c4cefe70
Skip schema validation with --fast r=dustin,ahal
Pushed by mozilla@hocat.ca:
https://hg.mozilla.org/comm-central/rev/a7badbfcf7b3
Include optional graph config setting, now that defaults aren't applied; rs=bustage-fix CLOSED TREE
I think this is good enough for now.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
Product: TaskCluster → Firefox Build System
Depends on: 1466098
You need to log in before you can comment on or make changes to this bug.