Closed Bug 1467359 Opened 7 years ago Closed 7 years ago

Fetch task improvements

Categories

(Firefox Build System :: Task Configuration, task)

task
Not set
normal

Tracking

(firefox-esr60 fixed, firefox63 fixed)

RESOLVED FIXED
mozilla63
Tracking Status
firefox-esr60 --- fixed
firefox63 --- fixed

People

(Reporter: gps, Assigned: gps)

References

Details

Attachments

(5 files, 5 obsolete files)

46 bytes, text/x-phabricator-request
tomprince
: review+
Details | Review
46 bytes, text/x-phabricator-request
tomprince
: review+
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
46 bytes, text/x-phabricator-request
Details | Review
Filing a bug to track the patches I'm about to submit for review.
Assignee: nobody → gps
Status: NEW → ASSIGNED
Previously, the fetch kind was defined by a job "using" flavor. An upcoming commit will need to derive new tasks from fetch job definitions. In order to do this, we require a transform. And this transform would need access to the original data from the job description. But by the time the "using" transform runs, a lot of this data is thrown away. It would be possible to stuff the meaningful metadata inside attributes on the created task. But this would result in the fetch logic being fragmented across multiple Python modules (a fetch-specific transform and a job/using- specific transform). I think it is better to keep all that logic in a single Python module. This commit converts the "using: fetch-url" job transform to a dedicated transform for the fetch kind. Since we're now using a dedicated transform, we no longer need to use the normal job schema for defining fetch jobs. So we refactor the schema a little so it is simpler. I verified the taskgraph output is nearly identical to before by diffing the JSON output of `mach taskgraph full`. Aside from the additions of ['worker']['implementation'] and ['worker']['os'] fields (which seem to be required by a schema somewhere), everything was identical.
Having to define them explicitly feels too redundant for my liking. I believe we didn't do this before because we were defining things in terms of "using: fetch-url." Now that we are using a custom transform, we can have nice things.
Previously, we told `tar` or `unzip` to operate on an explicit file. This worked when `tar` understood the compression format of the file. And this worked in the majority of cases. But `tar` does not support zstandard compression (at least not outside extremely new versions, which aren't yet widely deployed). And not all versions of `tar` support the `-a` argument. This commit changes our invocation of `tar` and `unzip` so input data is piped to it from Python. In the case of `tar`, we perform decompression in Python, if possible. This allows us to support zstandard and `tar` binaries that don't support `-a` to auto-detect the compression format.
Comment on attachment 8984029 [details] Bug 1467359 - Derive Treeherder symbol names for fetch jobs; r?tomprince Tom Prince [:tomprince] has approved the revision. https://phabricator.services.mozilla.com/D1575
Attachment #8984029 - Flags: review+
We add an is_archive() function to help identify a file as an archive. We make extract_archive() error if it is passed a file that isn't an archive. We now conditionally call extract_archive() depending on the result of is_archive(). This will make future code easier to implement.
Currently, we download archives and store them as artifacts verbatim. Sometimes it is useful to change the behavior of an archive. For example, we may want to normalize on a specific compression format. Or we may want to normalize the uid/gid stored in tar archives. Or we may want to strip leading path components and/or replace them with a static value. Or we may want to produce multiple archive variations - one optimized for size and another for speed. This commit introduces a mechanism for "rearchiving" archives that allows us to do these things. Given a source archive or directory and an output filename, the command will deduce the output format automatically and produce an archive matching the expected output. Subsequent work will integrate this feature into CI.
We currently feed static-url fetched artifacts directly into dependent tasks, where they are typically extracted. This works and is the simplest implementation. A deficiency with the existing model is that archives used by tasks are whatever is provided by upstream. We currently have a mix of bzip, gzip, and xz compressed tarballs. Compared to say zstandard, all of these archive formats are relatively slow to decompress. And in all cases but xz, their compression ratios aren't as good. This commit introduces the ability to "repackage" fetches obtained via static-url fetch types. Essentially, the job defines an array of "repackage" operations to perform to the fetch. Each entry gets converted to a new task which takes the original fetch artifact and repackages it to a new archive format / artifact. By default, fetched tarballs automatically derive a repackage task that produces a zstandard compressed tarball. It was tempting to perform the repackging in the main fetch task itself. This is certainly doable. However, if things were implemented this way, we'd need to fetch the original artifact when repackaging settings change. A goal of the fetch tasks is to minimize dependencies on 3rd party services. It is advantageous for the 3rd party fetches to run as infrequently as possible. This means we need to separate 3rd party fetches from repackaging tasks. Because we derive job descriptions that are fed into the job transform, we needed to update the "using: run-task" schema to allow task references in the "command" attribute of definitions.
Life is too short to wait for bzip2 or lzma decompression when faster alternatives are available. Now that we automatically produce zstandard tarballs for all fetched artifacts, it is trivial to change the dependencies so we use the (faster) zstandard tarballs instead of the vanilla upstream tarballs. The toolchain build scripts pull the fetch dependencies from an environment variable and automatically recognize and support zstandard tarballs. So we just need to point the dependencies at the zstandard tarballs and everything should "just work."
So we can produce artifacts of Git and Subversion checkouts.
Various tasks operate on files that come from a checkout of a Git or Subversion repository. This commit introduces 2 new fetch-content commands for producing archives for Git and Subversion checkouts. Essentially, you provide the URL of a repository, a revision to operate on, and a destination path and the commands write an archive of all files in the checkout at that revision.
We introduce a new fetch type that produces a tarball of a Subversion repository export. The tasks simply call into a `fetch-content` command. Rather than manually define the 3*6=18 tasks that would be required to fetch all LLVM project repositories, we also introduce an "llvm-release" fetch type that effectively calls into the svn-export-archive code path to generate all required tasks per entry. This keeps the YAML simple at the expense of some Python code. We don't yet hook the tasks or their artifacts up to the toolchain tasks. This will be done in a subsequent commit. I'm not yet 100% convinced that these tasks are producing the correct artifacts. But this definitely gets us a few steps closer to having toolchain tasks not perform `svn` operations in CI. Tweaks can be done as follow-ups easily enough.
Comment on attachment 8984028 [details] Bug 1467359 - Implement fetch-url jobs using a transform; r?tomprince Tom Prince [:tomprince] has approved the revision. https://phabricator.services.mozilla.com/D1574
Attachment #8984028 - Flags: review+
Comment on attachment 8984030 [details] Bug 1467359 - Refactor archive decompression; r?dustin Tom Prince [:tomprince] has approved the revision. https://phabricator.services.mozilla.com/D1576
Attachment #8984030 - Flags: review+
Comment on attachment 8984898 [details] Bug 1467359 - Use zstandard tarballs for toolchain fetches; r?glandium Mike Hommey [:glandium] has approved the revision. https://phabricator.services.mozilla.com/D1607
Attachment #8984898 - Flags: review+
Attachment #8984028 - Attachment description: Bug 1467359 - Implement fetch-url jobs using a transform; r?dustin → Bug 1467359 - Implement fetch-url jobs using a transform; r?tomprince
Attachment #8984029 - Attachment description: Bug 1467359 - Derive Treeherder symbol names for fetch jobs; r?dustin → Bug 1467359 - Derive Treeherder symbol names for fetch jobs; r?tomprince
Attachment #8984895 - Attachment is obsolete: true
Attachment #8984030 - Attachment description: Bug 1467359 - Pipe decompressed data to tar; r?dustin → Bug 1467359 - Refactor archive decompression; r?dustin
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/d22981772dd9 Implement fetch-url jobs using a transform; r=tomprince https://hg.mozilla.org/integration/mozilla-inbound/rev/3bb31648056b Derive Treeherder symbol names for fetch jobs; r=tomprince
Blocks: 1479533
Backed out for linting opt failure on taskgraph/transforms/fetch.py Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=success&fromchange=3bb31648056bf9d7760c59420d26de25fbf2b268&tochange=f3adbcf8e716b4faf47710b745a7541ed0ee1ef7&filter-searchStr=Linting%20opt%20source-test-mozlint-py-flake8%20(f8)&selectedJob=190941495 Failure log: https://treeherder.mozilla.org/logviewer.html#?job_id=190941495&repo=mozilla-inbound&lineNumber=266 Backout link: https://hg.mozilla.org/integration/mozilla-inbound/rev/f3adbcf8e716b4faf47710b745a7541ed0ee1ef7 [task 2018-07-30T18:53:27.620Z] [task 2018-07-30T18:53:27.620Z] Error processing command. Ignoring because optional. (optional:packages.txt:comm/build/virtualenv_packages.txt) [task 2018-07-30T18:54:20.104Z] TEST-UNEXPECTED-ERROR | /builds/worker/checkouts/gecko/taskcluster/taskgraph/transforms/fetch.py:12:1 | 'voluptuous.Extra' imported but unused (F401) [taskcluster 2018-07-30 18:54:20.642Z] === Task Finished === [taskcluster 2018-07-30 18:54:20.642Z] Unsuccessful task run with exit code: 1 completed in 327.367 seconds
Flags: needinfo?(gps)
Pushed by gszorc@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/18e9f8b05874 Implement fetch-url jobs using a transform; r=tomprince https://hg.mozilla.org/integration/mozilla-inbound/rev/6dce0e15337b Derive Treeherder symbol names for fetch jobs; r=tomprince
Attachment #8984030 - Attachment is obsolete: true
Attachment #8984896 - Attachment is obsolete: true
Attachment #8984897 - Attachment is obsolete: true
Attachment #8984898 - Attachment is obsolete: true
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla63
Pushed by mozilla@jorgk.com: https://hg.mozilla.org/comm-central/rev/588cd38b040d Port bug 1467359 - add Fetch-URL to config.yml. rs=bustage-fix DONTBUILD
Relanded with trivial fix to remove unused import.
Flags: needinfo?(gps)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: