Open Bug 1249091 Opened 4 years ago Updated 2 years ago

Teach moz.build Files() metadata to trigger Task Cluster tasks

Categories

(Firefox Build System :: General, defect)

defect
Not set

Tracking

(Not tracked)

People

(Reporter: gps, Unassigned)

References

Details

Attachments

(3 files)

As I wrote in bug 1245953 comment 21, I'm not keen on Task Cluster tasks declaring their own list of file patterns that should trigger tasks because there will be frequent references to paths distant from the task filesystem location. This means things can quickly get out of sync. It also means people have to dig around in multiple parts of the tree in order to update metadata.

The moz.build Files() context manager already provides file-based metadata. I think it would be beneficial if we started recording "impacted tasks" in this metadata.

e.g.

with Files('**/*.jsm'):
    IMPACTED_TASKS += ['eslint']
with File('**/*.rst'):
    IMPACTED_TASKS += ['sphinx']

When the Task Cluster graph/decision command runs, it will get the set of impacted files from version control. Then it will invoke the build system Python APIs to query moz.build files metadata for those files. It will then schedule tasks that moz.build metadata says is relevant.

FWIW, there is a JSON API on hg.mozilla.org that exposes moz.build files metadata (e.g. https://hg.mozilla.org/mozilla-central/json-mozbuildinfo/a18630f9ab42). In theory, the Buildbot scheduler could query this to obtain impacted tasks without having a source checkout (Task Cluster has the luxury of having a source checkout).
Bug 1245953 added support for limiting TaskCluster task scheduling to
when certain files change. This was implemented as a list of mozpath
patterns in TaskCluster's YAML files. A drawback of that approach is
YAML files in testing/taskcluster list paths from all over the tree.
This is prone to getting out of sync and can be a pain to update.

moz.build files already provide a mechanism for associating metadata
with files. And TaskCluster's scheduling code runs on a full source code
checkout and it knows which files have changed. Putting all that
together means it is possible for TaskCluster to query moz.build files
for the set of tasks that are relevant to a file.

This commit introduces the IMPACTED_TASKS Files metadata variable for
declaring which TaskCluster tasks are impacted by files [and should be
scheduled]. A subsequent commit will teach the TaskCluster code to query
this metadata.

Review commit: https://reviewboard.mozilla.org/r/35333/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/35333/
Attachment #8720518 - Flags: review?(mh+mozilla)
We want to encourage tasks to only run when necessary. This commit
changes the default behavior of generic tasks to not run unless
their conditions are met. This is accomplished by the introduction of
the "always" condition flag.

We set this flag on old-style tasks to mimic existing behavior. All
new generic tasks will need to set this flag or define conditions
to trigger task scheduling.

While we're here, we also add an "explicitly-added" property to denote
that a task was explicitly requested to be run. This prevents such tasks
from getting filtered.

The filtering logic for tasks is kind of wonky and occurs in multiple
locations. The code should definitely be unified someday.

Review commit: https://reviewboard.mozilla.org/r/35335/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/35335/
Attachment #8720519 - Flags: review?(garndt)
Querying moz.build files info from Python is pretty simple: just
instantiate a BuildReader and pass it a list of paths. We already have
the (possibly empty) set of changed files from version control. So
querying against the moz.build data is pretty easy!

As part of this, the eslint-gecko file patterns have been moved into
moz.build files. The moz.build syntax is a bit verbose because Files()
are limited to a single pattern. There is an open bug on supporting
multiple patterns per Files() instance that will make this cleaner.

Review commit: https://reviewboard.mozilla.org/r/35337/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/35337/
Attachment #8720520 - Flags: review?(garndt)
I'm happy to use moz.build info -- that's great.  However, I think the arrow is pointing the wrong direction.  What I need is a way to query for the files that impact a particular test, and get some stable identifier for their status (a content hash or a cset id or whatever).  Also, just as we'll likely have lots of files that might impact a task, we'll have lots of tasks that might be impacted by changes to a file -- so a list in either direction is not going to be a great fit.  What about introducing some intermediate concept; for sake of argument let's use "fileset":

with File('**/*.rst'):
    FILESETS += ['docs-rst']

then in the task graph generation, I can refer to 'docs-rst' to get the identifier for all files in that fileset:

optimization:
  vcs-last-modified:
    - docs-rst
    - python-source  # assuming rst uses some auto* to pull docstrings
    - ...
> This commit introduces the IMPACTED_TASKS Files metadata variable for
> declaring which TaskCluster tasks are impacted by files [and should be
> scheduled]. A subsequent commit will teach the TaskCluster code to query
> this metadata.

We're growing a number of dependencies in moz.build that are defined by strings -- here, the 'eslint' task is identified by a string.  I see nothing in the patch that ensures that strings aren't typoed, and that task definitions, etc are up to date.  How hard would it be to ensure that the tasks referenced are actually defined in testing/taskcluster/tasks?
(In reply to Nick Alexander :nalexander from comment #5)
> We're growing a number of dependencies in moz.build that are defined by
> strings -- here, the 'eslint' task is identified by a string.  I see nothing
> in the patch that ensures that strings aren't typoed, and that task
> definitions, etc are up to date.  How hard would it be to ensure that the
> tasks referenced are actually defined in testing/taskcluster/tasks?

At run-time, it might add considerable overhead because TC tasks are defined as templatized YAML files and I've noticed that processing them seems to be a bit on the slow side. A shortcut might be for there to be a master list of task names somewhere in moz.build so the check is a simple set lookup.

Mossop already commented on IRC that we now have the file extensions used by eslint defined in 3 places. I was thinking about introducing a global read-only "data" dictionary to moz.build files where we could centrally define "constants" for things like this. We already have the CONFIG variable which is populated by output of configure. If was thinking about a DATA variable whose content comes from a Python file. We could easily make that data structure available to mach commands and other tools so we don't have to redundantly define these things.
(In reply to Dustin J. Mitchell [:dustin] from comment #4)
> I'm happy to use moz.build info -- that's great.  However, I think the arrow
> is pointing the wrong direction.  What I need is a way to query for the
> files that impact a particular test, and get some stable identifier for
> their status (a content hash or a cset id or whatever).

This is doable with the current patch. You can load all moz.build files, find the set of patterns that impact task X and then match every path in the source tree against those patterns. I don't think we have that Python API implemented currently, but it shouldn't be too bad. Performance could be problematic, as we'd potentially be testing 100,000+ paths.

> Also, just as we'll
> likely have lots of files that might impact a task, we'll have lots of tasks
> that might be impacted by changes to a file

Did you just saw the same thing in different ways? Or am I failing to parse?

> What about introducing some intermediate
> concept; for sake of argument let's use "fileset":
> 
> with File('**/*.rst'):
>     FILESETS += ['docs-rst']
> 
> then in the task graph generation, I can refer to 'docs-rst' to get the
> identifier for all files in that fileset:
> 
> optimization:
>   vcs-last-modified:
>     - docs-rst
>     - python-source  # assuming rst uses some auto* to pull docstrings
>     - ...

This is an interesting idea.

If File() could be named/referenced individually or as a group of related items, would that accomplish what you need? e.g.

  with Files('**/*.js', group='javascript-files'):
      pass
  with Files('**/*.jsm', group='javascript-files'):
      pass

  optimization:
    vcs-last-modified:
      - javascript-files
Comment on attachment 8720520 [details]
MozReview Request: Bug 1249091 - Incorporate moz.build Files metadata into TaskCluster scheduling; r?garndt

https://reviewboard.mozilla.org/r/35337/#review32009

::: testing/taskcluster/mach_commands.py:234
(Diff revision 1)
> -class Graph(object):
> +class Graph(MachCommandBase):

I admit, I don't know much about the mach system, so I'm not familiar with MachCommandBase.  Is there a good starting point for becoming familiar with this just so I can be more educated next time?

::: testing/taskcluster/mach_commands.py:439
(Diff revision 1)
> +            tags = task.get('tags', [])

Just to make sure I didn't overlook anything, nothing right now in these changes implements the 'tags' list, right?

::: testing/taskcluster/mach_commands.py:446
(Diff revision 1)
> +                for tag in tags:

Is the idea here that if I pass in "--job eslint-gecko" then I could have multiple tasks tagged with that and they would all get included in the graph rather than a mapping of one name to one task?
Attachment #8720520 - Flags: review?(garndt)
Comment on attachment 8720519 [details]
MozReview Request: Bug 1249091 - Make generic tasks disabled by default; r?garndt

https://reviewboard.mozilla.org/r/35335/#review32011
Attachment #8720519 - Flags: review?(garndt) → review+
(In reply to Gregory Szorc [:gps] from comment #6)
> Mossop already commented on IRC that we now have the file extensions used by
> eslint defined in 3 places. I was thinking about introducing a global
> read-only "data" dictionary to moz.build files where we could centrally
> define "constants" for things like this. We already have the CONFIG variable
> which is populated by output of configure. If was thinking about a DATA
> variable whose content comes from a Python file. We could easily make that
> data structure available to mach commands and other tools so we don't have
> to redundantly define these things.

I had thought of something similar a while ago and filed bug 1179251. My motivating use case was all the data that's currently in package-name.mk.

I was thinking maybe this would just wind up in Python as part of the configure rewrite, but that might be farther out than we'd like.
> > Also, just as we'll
> > likely have lots of files that might impact a task, we'll have lots of tasks
> > that might be impacted by changes to a file
> 
> Did you just saw the same thing in different ways? Or am I failing to parse?

You're failing to parse.  I'm saying it's a many-to-many relationship, and you will soon have hundreds of tasks listed for each fileset.

> This is an interesting idea.
> 
> If File() could be named/referenced individually or as a group of related
> items, would that accomplish what you need? e.g.
> 
>   with Files('**/*.js', group='javascript-files'):
>       pass
>   with Files('**/*.jsm', group='javascript-files'):
>       pass
> 
>   optimization:
>     vcs-last-modified:
>       - javascript-files

I feel like this is getting turned upside down: a substantial redesign that's not considering a lot of important use cases, but is landing in tree already.  I'm then going to have to reverse-engineer this, or design compatibility for it, in the project I'm working on with Brian and Jonas.  Maybe we should decide: are you (or the build module in general) going to own the in-tree configuration design, or should we continue working on it?  Or should I just write some code and get it landed as fast as I can so we have a stake in the ground, too?
Comment on attachment 8720518 [details]
MozReview Request: Bug 1249091 - Add IMPACTED_TASKS to moz.build Files metadata; r?glandium

https://reviewboard.mozilla.org/r/35333/#review32369

Code-wise, it's fine, but... cf. the ongoing discussion with Dustin.

::: python/mozbuild/mozbuild/frontend/context.py:753
(Diff revision 1)
> +                self[k] += sorted(v)

You want alphabetical_sorted from the sandbox, as that's what StringOrderingOnAppendList expects.
Attachment #8720518 - Flags: review?(mh+mozilla)
Product: Core → Firefox Build System
You need to log in before you can comment on or make changes to this bug.