Closed Bug 1467359 Opened 2 years ago Closed 2 years ago

Fetch task improvements

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set

Tracking

(firefox-esr60 fixed, firefox63 fixed)

RESOLVED FIXED
mozilla63
Tracking Status
firefox-esr60 --- fixed
firefox63 --- fixed

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(5 files, 5 obsolete files)

46 bytes, text/x-phabricator-request
tomprince
: review+
Details | Review
46 bytes, text/x-phabricator-request
tomprince
: review+
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
Filing a bug to track the patches I'm about to submit for review.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Previously, the fetch kind was defined by a job "using" flavor.

An upcoming commit will need to derive new tasks from fetch job
definitions. In order to do this, we require a transform. And
this transform would need access to the original data from the
job description. But by the time the "using" transform runs, a
lot of this data is thrown away. It would be possible to stuff
the meaningful metadata inside attributes on the created task.
But this would result in the fetch logic being fragmented across
multiple Python modules (a fetch-specific transform and a job/using-
specific transform). I think it is better to keep all that logic
in a single Python module.

This commit converts the "using: fetch-url" job transform to a
dedicated transform for the fetch kind.

Since we're now using a dedicated transform, we no longer need
to use the normal job schema for defining fetch jobs. So we
refactor the schema a little so it is simpler.

I verified the taskgraph output is nearly identical to before by
diffing the JSON output of `mach taskgraph full`. Aside from the
additions of ['worker']['implementation'] and ['worker']['os']
fields (which seem to be required by a schema somewhere), everything
was identical.
Having to define them explicitly feels too redundant for my liking.

I believe we didn't do this before because we were defining things
in terms of "using: fetch-url." Now that we are using a custom
transform, we can have nice things.
Previously, we told `tar` or `unzip` to operate on an explicit file.
This worked when `tar` understood the compression format of the file.
And this worked in the majority of cases.

But `tar` does not support zstandard compression (at least not outside
extremely new versions, which aren't yet widely deployed). And not all
versions of `tar` support the `-a` argument.

This commit changes our invocation of `tar` and `unzip` so input
data is piped to it from Python. In the case of `tar`, we perform
decompression in Python, if possible. This allows us to support
zstandard and `tar` binaries that don't support `-a` to auto-detect
the compression format.
Comment on attachment 8984029 [details]
Bug 1467359 - Derive Treeherder symbol names for fetch jobs; r?tomprince

Tom Prince [:tomprince] has approved the revision.

https://phabricator.services.mozilla.com/D1575
Attachment #8984029 - Flags: review+
We add an is_archive() function to help identify a file as an archive.
We make extract_archive() error if it is passed a file that isn't an
archive. We now conditionally call extract_archive() depending on the
result of is_archive().

This will make future code easier to implement.
Currently, we download archives and store them as artifacts verbatim.

Sometimes it is useful to change the behavior of an archive. For example,
we may want to normalize on a specific compression format. Or we may want
to normalize the uid/gid stored in tar archives. Or we may want to
strip leading path components and/or replace them with a static value.
Or we may want to produce multiple archive variations - one optimized
for size and another for speed.

This commit introduces a mechanism for "rearchiving" archives that
allows us to do these things.

Given a source archive or directory and an output filename, the command
will deduce the output format automatically and produce an archive
matching the expected output.

Subsequent work will integrate this feature into CI.
We currently feed static-url fetched artifacts directly into dependent
tasks, where they are typically extracted. This works and is the
simplest implementation.

A deficiency with the existing model is that archives used by tasks
are whatever is provided by upstream. We currently have a mix of
bzip, gzip, and xz compressed tarballs. Compared to say zstandard,
all of these archive formats are relatively slow to decompress. And
in all cases but xz, their compression ratios aren't as good.

This commit introduces the ability to "repackage" fetches obtained
via static-url fetch types. Essentially, the job defines an array of
"repackage" operations to perform to the fetch. Each entry gets
converted to a new task which takes the original fetch artifact
and repackages it to a new archive format / artifact.

By default, fetched tarballs automatically derive a repackage task
that produces a zstandard compressed tarball.

It was tempting to perform the repackging in the main fetch task
itself. This is certainly doable. However, if things were implemented
this way, we'd need to fetch the original artifact when repackaging
settings change. A goal of the fetch tasks is to minimize dependencies
on 3rd party services. It is advantageous for the 3rd party fetches
to run as infrequently as possible. This means we need to separate
3rd party fetches from repackaging tasks.

Because we derive job descriptions that are fed into the job
transform, we needed to update the "using: run-task" schema to
allow task references in the "command" attribute of definitions.
Life is too short to wait for bzip2 or lzma decompression when faster
alternatives are available.

Now that we automatically produce zstandard tarballs for all fetched
artifacts, it is trivial to change the dependencies so we use the
(faster) zstandard tarballs instead of the vanilla upstream tarballs.

The toolchain build scripts pull the fetch dependencies from an
environment variable and automatically recognize and support zstandard
tarballs. So we just need to point the dependencies at the zstandard
tarballs and everything should "just work."
So we can produce artifacts of Git and Subversion checkouts.
Various tasks operate on files that come from a checkout of a Git
or Subversion repository.

This commit introduces 2 new fetch-content commands for producing
archives for Git and Subversion checkouts. Essentially, you provide
the URL of a repository, a revision to operate on, and a destination
path and the commands write an archive of all files in the checkout
at that revision.
We introduce a new fetch type that produces a tarball of a Subversion
repository export. The tasks simply call into a `fetch-content` command.

Rather than manually define the 3*6=18 tasks that would be required to
fetch all LLVM project repositories, we also introduce an
"llvm-release" fetch type that effectively calls into the
svn-export-archive code path to generate all required tasks per entry.
This keeps the YAML simple at the expense of some Python code.

We don't yet hook the tasks or their artifacts up to the toolchain
tasks. This will be done in a subsequent commit. I'm not yet 100%
convinced that these tasks are producing the correct artifacts. But
this definitely gets us a few steps closer to having toolchain tasks
not perform `svn` operations in CI. Tweaks can be done as follow-ups
easily enough.
Comment on attachment 8984028 [details]
Bug 1467359 - Implement fetch-url jobs using a transform; r?tomprince

Tom Prince [:tomprince] has approved the revision.

https://phabricator.services.mozilla.com/D1574
Attachment #8984028 - Flags: review+
Comment on attachment 8984030 [details]
Bug 1467359 - Refactor archive decompression; r?dustin

Tom Prince [:tomprince] has approved the revision.

https://phabricator.services.mozilla.com/D1576
Attachment #8984030 - Flags: review+
Comment on attachment 8984898 [details]
Bug 1467359 - Use zstandard tarballs for toolchain fetches; r?glandium

Mike Hommey [:glandium] has approved the revision.

https://phabricator.services.mozilla.com/D1607
Attachment #8984898 - Flags: review+
Attachment #8984028 - Attachment description: Bug 1467359 - Implement fetch-url jobs using a transform; r?dustin → Bug 1467359 - Implement fetch-url jobs using a transform; r?tomprince
Attachment #8984029 - Attachment description: Bug 1467359 - Derive Treeherder symbol names for fetch jobs; r?dustin → Bug 1467359 - Derive Treeherder symbol names for fetch jobs; r?tomprince
Attachment #8984895 - Attachment is obsolete: true
Attachment #8984030 - Attachment description: Bug 1467359 - Pipe decompressed data to tar; r?dustin → Bug 1467359 - Refactor archive decompression; r?dustin
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/d22981772dd9
Implement fetch-url jobs using a transform; r=tomprince
https://hg.mozilla.org/integration/mozilla-inbound/rev/3bb31648056b
Derive Treeherder symbol names for fetch jobs; r=tomprince
Blocks: 1479533
Backed out for linting opt failure on taskgraph/transforms/fetch.py

Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=success&fromchange=3bb31648056bf9d7760c59420d26de25fbf2b268&tochange=f3adbcf8e716b4faf47710b745a7541ed0ee1ef7&filter-searchStr=Linting%20opt%20source-test-mozlint-py-flake8%20(f8)&selectedJob=190941495

Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=190941495&repo=mozilla-inbound&lineNumber=266

Backout link: https://hg.mozilla.org/integration/mozilla-inbound/rev/f3adbcf8e716b4faf47710b745a7541ed0ee1ef7


[task 2018-07-30T18:53:27.620Z] 
[task 2018-07-30T18:53:27.620Z] Error processing command. Ignoring because optional. (optional:packages.txt:comm/build/virtualenv_packages.txt)
[task 2018-07-30T18:54:20.104Z] TEST-UNEXPECTED-ERROR | /builds/worker/checkouts/gecko/taskcluster/taskgraph/transforms/fetch.py:12:1 | 'voluptuous.Extra' imported but unused (F401)
[taskcluster 2018-07-30 18:54:20.642Z] === Task Finished ===
[taskcluster 2018-07-30 18:54:20.642Z] Unsuccessful task run with exit code: 1 completed in 327.367 seconds
Flags: needinfo?(gps)
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/18e9f8b05874
Implement fetch-url jobs using a transform; r=tomprince
https://hg.mozilla.org/integration/mozilla-inbound/rev/6dce0e15337b
Derive Treeherder symbol names for fetch jobs; r=tomprince
Attachment #8984030 - Attachment is obsolete: true
Attachment #8984896 - Attachment is obsolete: true
Attachment #8984897 - Attachment is obsolete: true
Attachment #8984898 - Attachment is obsolete: true
https://hg.mozilla.org/mozilla-central/rev/18e9f8b05874
https://hg.mozilla.org/mozilla-central/rev/6dce0e15337b
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/588cd38b040d
Port bug 1467359 - add Fetch-URL to config.yml. rs=bustage-fix DONTBUILD
Relanded with trivial fix to remove unused import.
Flags: needinfo?(gps)
You need to log in before you can comment on or make changes to this bug.