Closed Bug 1280231 Opened 3 years ago Closed 3 years ago

Refactor taskgraph "Kind" to be a base Task class

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: dustin, Assigned: dustin)

References

Details

Attachments

(3 files)

We are currently using Kind instances to represent kinds, with tasks represented by instances of a data-only class.  So we have Kind instance methods like `get_task_dependencies(..task..)`.  This would be much more sensible with a bunch of Task subclasses that have a `get_dependencies()` method.
Assignee: nobody → dustin
This enables kinds that generate tasks based on those output by another kind.
For example, the test kind might generate a set of test tasks for each build
task.

Review commit: https://reviewboard.mozilla.org/r/59832/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/59832/
Comment on attachment 8763686 [details]
Bug 1280231: rename types.py to taskgraph.py;

https://reviewboard.mozilla.org/r/59834/#review56828

This solves my editors issue, so \o/ lets do it.
Attachment #8763686 - Flags: review?(bugspam.Callek) → review+
Comment on attachment 8763684 [details]
Bug 1280231: refactor task kinds to task classes;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59830/diff/1-2/
Comment on attachment 8763685 [details]
Bug 1280231: Load kinds in order by dependencies;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59832/diff/1-2/
Comment on attachment 8763686 [details]
Bug 1280231: rename types.py to taskgraph.py;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59834/diff/1-2/
Comment on attachment 8763684 [details]
Bug 1280231: refactor task kinds to task classes;

https://reviewboard.mozilla.org/r/59830/#review57674

I still think we can make better abstractions.
But this is a great improvement.

::: taskcluster/taskgraph/kind/base.py:14
(Diff revision 2)
>  
> -class Kind(object):
> +class Task(object):
>      """
> +    Representation of a task in a TaskGraph.  Each Task has, at creation:
> +
> +    - kind: the name of the task kind

Do we ever really need this?
And if so why not just use: `obj.__class__.__name__`?

::: taskcluster/taskgraph/kind/base.py:17
(Diff revision 2)
> +    Representation of a task in a TaskGraph.  Each Task has, at creation:
> +
> +    - kind: the name of the task kind
> +    - label; the label for this task
> +    - attributes: a dictionary of attributes for this task (used for filtering)
> +    - task: the task definition (JSON-able dictionary)

Given that you need taskIds for dependencies substited in I would argue that you can't just expose a dictionary for the task definition.

A single method task.definition() would suffice.

An alternative option that would give more control at a higher level would be to do:
task.payload(taskIds)
task.dependencies()
task.deadline()
task.expires()
task.extra()
task.name()
task.source()
task.description()
task.workerType()
task.provisionerId()
task.routes()
task.priority()
task.retries()
task.scopes()

With the invariant that task.payload(taskIds)
is called with a mapping taskIds from objects in 
task.dependencies() to taskId for said objects.

Doing something like this would give code operating on abstract tasks as much power as possible, while still hiding all the internals.
Notice how things like:

 * task.metadata.owner
 * task.taskGroupId
 * task.schedulerId
 * task.created

Don't have to be handled by the abstract task object anymore.
Those are things we easily handle at a higher level.


-----

I realize that the reasoning behind the current design might have to do with the ability to serialize all tasks to json files...
So that we can backfill later.. hmm... That arguably requires us to have a templating language to substitute in the dependencies (or so I suspect).

::: taskcluster/taskgraph/kind/base.py:52
(Diff revision 2)
>      @abc.abstractmethod
> -    def load_tasks(self, parameters):
> +    def load_tasks(cls, kind, path, config, parameters):
>          """
> -        Get the set of tasks of this kind.
> +        Load the tasks for a given kind.
> +
> +        The `kind` is the name of the kind; the configuration for that kind

If `kind` serves so that the same Task class can be used for two different kind implementations I think that's overkill.

We should aim to simplify this. If people really want two kinds using the same implementation, just inherent one from the other, or make a common base class.
(or even better: duplicate the code)

::: taskcluster/taskgraph/kind/base.py:70
(Diff revision 2)
> -    def get_task_dependencies(self, task, taskgraph):
> +    def get_dependencies(self, taskgraph):
>          """
> -        Get the set of task labels this task depends on, by querying the task graph.
> +        Get the set of task labels this task depends on, by querying the full
> +        task set, given as `taskgraph`.
>  
>          Returns a list of (task_label, dependency_name) pairs describing the

Why return task_label and not task objects?
I assume you're filtering on the list of objects that have attributes...

Also isn't dependency_name purely an internal thing to the task object.

I view this task object as an abstraction that can be:

 * filtered
 * depended on
 * optimized out

Without dealing with any internal details about this task.
Abstractions should hide internals.

::: taskcluster/taskgraph/kind/base.py:87
(Diff revision 2)
>          dependencies on this task will isntead depend on that taskId.  It is an
>          error to return no taskId for a task on which other tasks depend.
>  
>          The default never optimizes.
>          """
>          return False, None

Nice!

This correctly hides the logic of deciding whether or not a task can be optimized.

nit: None or "taskId" would work too.

Also, perhaps this method should be async?
(I'm not python expert, so not sure what that would entail)
Attachment #8763684 - Flags: review?(jopsen) → review+
Comment on attachment 8763685 [details]
Bug 1280231: Load kinds in order by dependencies;

https://reviewboard.mozilla.org/r/59832/#review57686

Looks good to me

::: taskcluster/taskgraph/kind/base.py:64
(Diff revision 2)
>          See `taskcluster/docs/parameters.rst` for details.
>  
> +        At the time this method is called, all kinds on which this kind depends
> +        (that is, specified in the `kind-dependencies` key in `self.config`
> +        have already loaded their tasks, and those tasks are available in
> +        `loaded_tasks`, a dictionary keyed by task label.

Don't they all have label as a property?
If so, why make it a dictionary... I just fear that duplicating data is a good way of getting inconsistencies.
Attachment #8763685 - Flags: review?(jopsen) → review+
(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #9)
> Do we ever really need this?
> And if so why not just use: `obj.__class__.__name__`?

There can be multiple kinds using the same class.  Lots of dependencies will be looking for tasks from a specific other kind.

> Given that you need taskIds for dependencies substited in I would argue that
> you can't just expose a dictionary for the task definition.

You can -- there's a {task-reference: ..} syntax that handles it.

The suggestion to use lots of methods for different bits of the task definition has the advantage, if I understand you correctly, that a deep inheritance tree would allow task.created() to be implemented once, with task.scopes(), for example, overridden in subclasses.  In practice I think this just violates locality: now each task definition is generated by 15 different functions in different classes in different source files, even though all of the values are interrelated (e.g., scopes depends on caches, features, devices, etc., artifact expires depends on task expires, payload format depends on worker type, etc.).

As I work on more realistic kinds (bug 1281004) there's an opportunity to factor some of that stuff out into shared methods, but only in a very limited fashion.

> I realize that the reasoning behind the current design might have to do with
> the ability to serialize all tasks to json files...
> So that we can backfill later.. hmm... That arguably requires us to have a
> templating language to substitute in the dependencies (or so I suspect).

Yep :)

> If `kind` serves so that the same Task class can be used for two different
> kind implementations I think that's overkill.

I disagree.  As an example, I suspect we will have different classes for Mulet and desktop tests, but they can easily use the same class.  I guess I don't see the motivation to not allow tihs: it is simple enough, allows a useful kind of flexibility, and encourages separation of configuration from implementation.  Also, enforcing this would be weird: I'd need to check that a given class hasn't already been used to instantiate a kind, and if so, what, raise an exception suggesting creating an empty subclass just to work around the check?

> Why return task_label and not task objects?

Because the graph is based on named references, rather than object references.  Using object identity for dependencies gets messy quickly when updating graphs; for example, the target task set is an edge-free subset of the full task graph, which is then closed to make the target task graph.  If the task objects themselves embody those dependencies, then they must be duplicated at each stage, which would be quite a bit slower.

> Also isn't dependency_name purely an internal thing to the task object.

No, it's used to implement {task-reference: ..}

> I view this task object as an abstraction that can be:
> 
>  * filtered
>  * depended on
>  * optimized out
> 
> Without dealing with any internal details about this task.
> Abstractions should hide internals.

Dependencies are named graph edges of the form (label1, label2, depname).  So the dependencies are not an internal detail of the tasks, but are in fact a very important part of the task's interface, as they are what most of this system is manipulating.

> Also, perhaps this method should be async?
> (I'm not python expert, so not sure what that would entail)

The decision task doesn't run in an async environment (e.g., asyncio or twisted), so sadly that's not possible here.  I have some ideas on how to parallelize a whole bunch of index lookups, but none of them are pretty.  Greg may have some ideas.

(In reply to Jonas Finnemann Jensen (:jonasfj) from comment #10)
> Don't they all have label as a property?
> If so, why make it a dictionary... I just fear that duplicating data is a
> good way of getting inconsistencies.

Good point..
> The suggestion to use lots of methods for different bits of the task definition has the advantage...
The motivation wasn't inheritance, it was more that some parts could be handled generically.
I see the code in two parts:
 A) Abstract Task objects (implemented by kinds/task classes)
 B) Generic Task-Graph code: generating graphs, filtering, optimizing, creating tasks...
Code in (B) operates on (A) without knowing that there are different kinds, as this is abstracted away.
As all tasks have top-level properties in common, the abstract task object could expose those and the
generic task-graph code can reason about these properties.

Anyways, this might be a detail...

> So the dependencies are not an internal detail of the tasks, but are in fact a very
> important part of the task's interface...
hmm... I see how it becomes problematic with the idea that the decision task must create json artifacts. On the other hand I fear we easily end up with multiple layers of templating systems, as a task kind implementation would naturally insert parameters into the YAML files.

Oh, well, let's see where we end up.. My original plan was to try and build a kind implementation with a simple json based template and linking system that could support everything. I _just_ haven't figured out what that would look like yet :) hehe

Note: all of this is definitely an improvement over what we have.
working on it..
Comment on attachment 8763684 [details]
Bug 1280231: refactor task kinds to task classes;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59830/diff/2-3/
Comment on attachment 8763685 [details]
Bug 1280231: Load kinds in order by dependencies;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59832/diff/2-3/
Comment on attachment 8763686 [details]
Bug 1280231: rename types.py to taskgraph.py;

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/59834/diff/2-3/
Flags: needinfo?(dustin)
Product: TaskCluster → Firefox Build System
You need to log in before you can comment on or make changes to this bug.