Parallelize `mach artifact toolchain` download/unpack?

NEW
Unassigned

Status

enhancement
2 years ago
9 months ago

People

(Reporter: ted, Unassigned)

Tracking

(Blocks 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(5 attachments, 1 obsolete attachment)

I was looking at the code for `mach artifact toolchain` and noticed that it does all its downloads in series currently:
https://dxr.mozilla.org/mozilla-central/rev/c2248f85346939d3e0b01f57276c440ccb2d16a1/python/mozbuild/mozbuild/mach_commands.py#1422

I wonder if it would be faster in CI to do the download/unpack in parallel? Downloads from S3 should be incredibly fast, I would be surprised if we could actually max out the network bandwidth. Disk I/O might be the limiting factor, but even then I suspect downloading multiple artifacts in parallel would be faster.

If we used multiple processes via concurrent.futures then we could also parallelize unpacks, which ought to scale well. We're doing all our builds on beefy multicore instances, we might as well utilize those cores.

Looking at an arbitrary linux64-opt task from inbound:
https://tools.taskcluster.net/groups/e67fUCzEQj2-P0CRMi1JiA/tasks/HKkMXMOuRA6IqhIPZc0lMg/details

The log shows:
[task 2017-11-29T16:32:21.217Z] 16:32:21     INFO - Copy/paste: /usr/bin/python2.7 -u /builds/worker/workspace/build/src/mach artifact toolchain -v --retry 4 --artifact-manifest /builds/worker/workspace/build/src/toolchains.json --tooltool-manifest /builds/worker/workspace/build/src/browser/config/tooltool-manifests/linux32/releng.manifest --tooltool-url http://relengapi/tooltool/ --cache-dir /builds/worker/tooltool-cache public/build/clang.tar.xz@cs28pPSXT0CrYwwX6YZYNg public/build/gcc.tar.xz@YwDW6Ek_Su60IwZXuvJVbA public/build/rustc.tar.xz@Uxt1EzDYTqu_TCZ3VKVQ8A public/build/sccache2.tar.xz@eTLeS7b6RTKubnSIZ8dQXA
<...>
[task 2017-11-29T16:32:21.971Z] 16:32:21     INFO -   0:00.66 materialized mozbuild.action.tooltool.FileRecord(filename='gtk3.tar.xz', size=11189216, digest='18bc52b0599b1308b667e282abb45f47597bfc98a5140cfcab8da71dacf89dd76d0dee22a04ce26fe7ad1f04e2d6596991f9e5b01fd2aaaab5542965f596b0e6', algorithm='sha512', visibility=None)
[task 2017-11-29T16:32:21.971Z] 16:32:21     INFO -   0:00.66 loaded manifest from file '/builds/worker/workspace/build/src/browser/config/tooltool-manifests/linux32/releng.manifest'
[task 2017-11-29T16:32:23.193Z] 16:32:23     INFO -   0:01.88 Downloading gtk3.tar.xz
<...>
[task 2017-11-29T16:32:33.951Z] 16:32:33     INFO -   0:12.64 Downloaded artifact to /builds/worker/tooltool-cache/877fef0f714edddc-sccache2.tar.xz
[task 2017-11-29T16:32:33.984Z] 16:32:33     INFO -   0:12.68 hashed u'/builds/worker/tooltool-cache/877fef0f714edddc-sccache2.tar.xz' with sha256 to be de695b7c87b7489baea3078b59848b12926ab7322640586434e1dd01673cefa2
[task 2017-11-29T16:32:34.036Z] 16:32:34     INFO -   0:12.73 untarring "/builds/worker/workspace/build/src/gtk3.tar.xz"
[task 2017-11-29T16:33:08.537Z] 16:33:08     INFO -   0:47.23 untarring "/builds/worker/workspace/build/src/clang.tar.xz"
[task 2017-11-29T16:33:22.955Z] 16:33:22     INFO -   1:01.65 untarring "/builds/worker/workspace/build/src/gcc.tar.xz"
[task 2017-11-29T16:33:38.909Z] 16:33:38     INFO -   1:17.60 untarring "/builds/worker/workspace/build/src/rustc.tar.xz"
[task 2017-11-29T16:33:51.417Z] 16:33:51     INFO -   1:30.11 untarring "/builds/worker/workspace/build/src/sccache2.tar.xz"
[task 2017-11-29T16:33:52.358Z] 16:33:52     INFO - Return code: 0

It looks like it takes just under 13 seconds to download all the toolchain artifacts, and then another 1m17s to unpack them all. Even if we just ran the unpacks in parallel, the long pole there looks to be gtk3.tar.xz taking ~35s to unpack, so if we could get the unpack step down to just that long we'd be saving ~40s of setup time.
Artifact download and extract accounts for a lot of the setup time in build tasks. It is well worth our time to make things parallel.
Blocks: fastci
This seems to work in very limited testing on try.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=d3685cabcb417357c724f942beec08e7926ab618

I don't like how I've disabled cleaning up the download manager. I did that to work around the artifact cache cancelling other downloads happening in parallel after the first one completed.
Comment on attachment 8950567 [details] [diff] [review]
WIP for parallelizing tool downloading and unpacking

Nick, I think you added in the original support for using the download manager in artifacts.py. Do you remember why you added the call to .cancel() after finishing the download?
Attachment #8950567 - Flags: feedback?(nalexander)
Comment on attachment 8950567 [details] [diff] [review]
WIP for parallelizing tool downloading and unpacking

Review of attachment 8950567 [details] [diff] [review]:
-----------------------------------------------------------------

::: python/mozbuild/mozbuild/action/tooltool.py
@@ +513,5 @@
>          return None
>  
>  
>  def clean_path(dirname):
>      """Remove a subtree if is exists. Helper for unpack_file()."""

nit: might as well fix s/is exists/it exists/.

This comment needs to be updated 'cuz it's "if it exists and is a directory" now.

@@ +527,5 @@
>      Xz support is handled by shelling out to 'tar'."""
>      if tarfile.is_tarfile(filename):
>          tar_file, zip_ext = os.path.splitext(filename)
>          base_file, tar_ext = os.path.splitext(tar_file)
> +        clean_path(os.path.basename(base_file))

This changes the behaviour, so you need to update the comment to be clear about the directories.  I think this is the wrong way to achieve your goal, too -- you should be explicit about where you want the files to come from and not make assumptions about the current working directory, which is what this change does (AFAICT).

::: python/mozbuild/mozbuild/artifacts.py
@@ +790,5 @@
>              # Use the file name from the url if it looks like a hash digest.
>              if len(fname) not in (32, 40, 56, 64, 96, 128):
>                  raise TypeError()
>              binascii.unhexlify(fname)
> +            basename = fname

Is this WIP?  Why not just use `fname`?

@@ +841,5 @@
>                  'Downloaded artifact to {path}')
>              return os.path.abspath(mozpath.join(self._cache_dir, fname))
>          finally:
>              # Cancel any background downloads in progress.
> +            # self._download_manager.cancel()

I'm quite confident I culted this code from mozregression.  It wouldn't have been https://github.com/mozilla/mozregression/blob/b3ceac24abe292ac01e750cc1dad87f801f7aec0/mozregression/bisector.py#L425 exactly, but it would have been following that approach: belt and braces, cancel everything on the way out the door.

::: python/mozbuild/mozbuild/mach_commands.py
@@ +1402,5 @@
>              name, task_id = f.rsplit('@', 1)
>              record = ArtifactRecord(task_id, name)
>              records[record.filename] = record
>  
> +        def _fetch_record(record):

This makes less sense now, returning `record` or `None`.  Raise an exception if the `record` is not valid.

@@ +1476,5 @@
> +
> +        with ThreadPoolExecutor(8) as pool:
> +            # Download artifacts first
> +            unpacking = []
> +            for record in pool.map(_fetch_record, records.values()):

Can we not compose fetch and unpack?  Is there an advantage to staging manually in the way that you have?

@@ +1478,5 @@
> +            # Download artifacts first
> +            unpacking = []
> +            for record in pool.map(_fetch_record, records.values()):
> +                if record is None:
> +                    return 1

Follow-on from above: I think this would be better with an exception, rather than relying on logging and muffling the error to evolve safely over time.
Attachment #8950567 - Flags: feedback?(nalexander) → feedback+
Comment on attachment 8950567 [details] [diff] [review]
WIP for parallelizing tool downloading and unpacking

Review of attachment 8950567 [details] [diff] [review]:
-----------------------------------------------------------------

::: python/mozbuild/mozbuild/action/tooltool.py
@@ +527,5 @@
>      Xz support is handled by shelling out to 'tar'."""
>      if tarfile.is_tarfile(filename):
>          tar_file, zip_ext = os.path.splitext(filename)
>          base_file, tar_ext = os.path.splitext(tar_file)
> +        clean_path(os.path.basename(base_file))

This is to try and fix tooltool requiring the tarball to be present in the current directory in order to unpack it. See the changes in mach_commands.py where I remove the os.link / shutil.copy.

You're right though, I don't really grok why this is here; it's kind of assuming that the tarball includes a directory called `base_file`.

::: python/mozbuild/mozbuild/artifacts.py
@@ +790,5 @@
>              # Use the file name from the url if it looks like a hash digest.
>              if len(fname) not in (32, 40, 56, 64, 96, 128):
>                  raise TypeError()
>              binascii.unhexlify(fname)
> +            basename = fname

So I can use `basename` in the Downloading message below. `fname` has the hash prepended to it in some cases.
Assignee: nobody → catlee
Attachment #8954929 - Flags: review?(nalexander) → review?(core-build-config-reviews)
Attachment #8950567 - Attachment is obsolete: true
Comment on attachment 8954929 [details]
Bug 1421734: Download and unpack toolchain artifacts in parallel

https://reviewboard.mozilla.org/r/224100/#review230120

I'm basically fine with this.  nits about whitespace, a real consideration about `cancel`, and some style choices.  I will be away until Monday, March 5 so if you want re-review, flag Build instead of me.  Thanks for making this happen!

::: python/mozbuild/mozbuild/artifacts.py:773
(Diff revision 1)
>  
>  
>  class ArtifactCache(object):
>      '''Fetch Task Cluster artifact URLs and purge least recently used artifacts from disk.'''
>  
> -    def __init__(self, cache_dir, log=None, skip_cache=False):
> +    def __init__(self, cache_dir, log=None, skip_cache=False, cancel_when_done=True):

I think you should just get rid of `cancel` entirely.  What you have now has a flag on the cache abstraction, but each call to fetch would invoke `cancel`, which doesn't make sense.  If you really want to make cancellation a thing, we should make the cache a Python context manager so that it can clean up after itself.  I don't think we want to do that right now.

::: python/mozbuild/mozbuild/artifacts.py:860
(Diff revision 1)
>              return
>  
>          self._persist_limit.remove_all()
>  
>  
> +

extract new lines?

::: python/mozbuild/mozbuild/mach_commands.py:1471
(Diff revision 1)
>              # when possible.
>              try:
>                  os.link(record.filename, local)
>              except Exception:
>                  shutil.copy(record.filename, local)
> +

eh?

::: python/mozbuild/mozbuild/mach_commands.py:1491
(Diff revision 1)
>                  unpack_file(local, record.setup)
>                  os.unlink(local)
>  
> +            return record
> +
> +        with ThreadPoolExecutor(8) as pool:

Man we have a lot of choices in the tree for the number of threads to use.  One more won't hurt, I guess.

::: python/mozbuild/mozbuild/mach_commands.py:1492
(Diff revision 1)
>                  os.unlink(local)
>  
> +            return record
> +
> +        with ThreadPoolExecutor(8) as pool:
> +            for record in pool.map(_fetch_and_unpack_record, records.values()):

Why isn't this just `downloaded = pool.map(...)`, perhaps with a `list` thrown in?

::: python/mozbuild/mozbuild/mach_commands.py:1495
(Diff revision 1)
> +
> +        with ThreadPoolExecutor(8) as pool:
> +            for record in pool.map(_fetch_and_unpack_record, records.values()):
> +                downloaded.append(record)
> +
> +        cache._download_manager.cancel()

Maybe this should be `try:finally:`, if you care about it.  And are you certain `cancel` is idempotent (i.e., you can call it twice)?
Attachment #8954929 - Flags: review+
Attachment #8954929 - Flags: review?(core-build-config-reviews)
Product: Core → Firefox Build System
Comment on attachment 8954929 [details]
Bug 1421734: Download and unpack toolchain artifacts in parallel

https://reviewboard.mozilla.org/r/224100/#review230120

> I think you should just get rid of `cancel` entirely.  What you have now has a flag on the cache abstraction, but each call to fetch would invoke `cancel`, which doesn't make sense.  If you really want to make cancellation a thing, we should make the cache a Python context manager so that it can clean up after itself.  I don't think we want to do that right now.

I was trying to preserve the previous behavior of cancelling per fetch. There are some other uses of the artifact cache.

However, removing the cancel doesn't seem to have any undesired effects.

> Maybe this should be `try:finally:`, if you care about it.  And are you certain `cancel` is idempotent (i.e., you can call it twice)?

This would have been called only once. I'll remove it entirely in the upcoming patch.
Pushed by gszorc@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5ceb1365ae75
Download and unpack toolchain artifacts in parallel; r=nalexander
For reasons I can't explain, the makensis executable couldn't be located with this patch. This should have been caught as a fatal configure error. But it wasn't. The build failed later during installer generation due to the missing binary.

I suspect an issue with extraction path, cwd, or something of that nature.
Flags: needinfo?(gps) → needinfo?(catlee)
I think the problem is that nsis.tar.xz unpacks artifacts into mingw32/share/nsis/. mingw32.tar.xz also unpacks into the mingw32 directory. Tooltool expects that archives contain a single top-level directory inside them that matches the archive's basename. It enforces this by removing the top-level directory if it exists prior to unpacking. In this case, nsis.tar.xz is being unpacked first, and then when mingw32 is being processed, tooltool deletes the mingw32 directory before extracting the mingw32 archive.

See https://taskcluster-artifacts.net/WoQ5RO8UTiS-sPp7AbPlUw/0/public/logs/live_backing.log with more debugging output.
Flags: needinfo?(catlee)
Ugh. I suppose we should fix one of those archives to use a different path.
I think we should also make it an error for an archive to not include the correct top-level directory.
(In reply to Chris AtLee [:catlee] from comment #16)
> I think we should also make it an error for an archive to not include the
> correct top-level directory.

+1

Failing fast here is better than pulling out your hair trying to debug issues like this.
I did a try push here: https://treeherder.mozilla.org/#/jobs?repo=try&revision=bf0d17855be34a443551b2a6b8fc38284c15eba8

Looks like we have some other archives that need to be restructured:
nsis.tar.xz (for wingw builds)
git.zip (for sccache build on Windows)
eslint.tar.gz (for lint docker image, update-verify docker image(!?))
I don't have time to chase down malformed tooltool archives. Hopefully someone else can pick this up!
Assignee: catlee → nobody
This patch is pretty good and the improvement to times in CI can be significant. We should try to find an owner for this...
Comment on attachment 8969495 [details]
Bug 1421734 - Build nsis archive with nsis as root directory;

https://reviewboard.mozilla.org/r/238240/#review246714
Attachment #8969495 - Flags: review+
Attachment #8969495 - Flags: review?(core-build-config-reviews)
Comment on attachment 8969496 [details]
Bug 1421734 - Use MinGit archive with matching root directory;

https://reviewboard.mozilla.org/r/238242/#review246718
Attachment #8969496 - Flags: review+
Attachment #8969496 - Flags: review?(core-build-config-reviews)
Comment on attachment 8969497 [details]
Bug 1421734 - Download and unpack toolchain artifacts in parallel;

https://reviewboard.mozilla.org/r/238244/#review246728

This looks fine to me.

Sorry that this got lost in the shuffle -- most of my attention is on GitHub these days.
Attachment #8969497 - Flags: review?(nalexander) → review+
Depends on: 1495641
I propose that to unblock this someone should write a script to do an audit of all our tooltool and toolchain artifacts. The simplest way to do this is likely just to hack the existing `mach artifact toolchain` code, which already knows how to handle both of those. I would suggest changing this code to take a list of tooltool manifests and process each of them:
https://dxr.mozilla.org/mozilla-central/rev/dcba2a476ccfb77f859b3614d2c77ed3cc06a225/python/mozbuild/mozbuild/mach_commands.py#1410-1419

and then change the toolchain artifact handling code here to just fetch every toolchain artifact (`toolchains` contains all of them, just feed those all into the loop that calculates `task_id` and `artifact_name` and creates an `ArtifactRecord):
https://dxr.mozilla.org/mozilla-central/rev/dcba2a476ccfb77f859b3614d2c77ed3cc06a225/python/mozbuild/mozbuild/mach_commands.py#1421-1479

You'd also need to make `records` into a list since lots of artifacts will have duplicate filenames.

Then for verification, instead of actually unpacking the artifacts here, just list their contents and ensure that every path in the archive starts with `record.basename`:
https://dxr.mozilla.org/mozilla-central/rev/dcba2a476ccfb77f859b3614d2c77ed3cc06a225/python/mozbuild/mozbuild/mach_commands.py#1547

You'd probably want to do all this work atop catlee's existing patches here, since fetching a zillion artifacts is going to take some time! Also it would likely be fastest to just do this on an interactive Taskcluster task to get fast access to all the artifacts. (If you use an interactive task make sure you set the feature `relengAPIProxy: true` so you can fetch internal tooltool artifacts!)

I'm not 100% sure, but I think this is probably the full list of tooltool manifests in the tree:
luser@eye7:/build/mozilla-central$ fd -t f -p '/tooltool-manifests/'
browser/config/tooltool-manifests/win32/gn-build.manifest
browser/config/tooltool-manifests/macosx64/releng.manifest
browser/config/tooltool-manifests/win32/vs2015.manifest
browser/config/tooltool-manifests/macosx64/cross-releng.manifest~
browser/config/tooltool-manifests/macosx64/cross-clang.manifest~
browser/config/tooltool-manifests/win32/build-clang-cl.manifest
browser/config/tooltool-manifests/win64/vs2015.manifest
browser/config/tooltool-manifests/win32/l10n.manifest
browser/config/tooltool-manifests/macosx64/cross-clang.manifest
browser/config/tooltool-manifests/linux64/jsshell.manifest
browser/config/tooltool-manifests/win64/l10n.manifest
browser/config/tooltool-manifests/win32/releng.manifest
browser/config/tooltool-manifests/macosx64/cross-releng.manifest
browser/config/tooltool-manifests/win64/releng.manifest
browser/config/tooltool-manifests/win64/sccache-build.manifest
browser/config/tooltool-manifests/macosx64/clang.manifest
testing/config/tooltool-manifests/win32/nodejs.manifest
testing/config/tooltool-manifests/androidarm_7_0/mach-emulator.manifest
testing/config/tooltool-manifests/win32/releng.manifest
testing/config/tooltool-manifests/androidx86_7_0/mach-emulator.manifest
testing/config/tooltool-manifests/macosx64/hostutils.manifest
testing/config/tooltool-manifests/linux64/jimdb-arm-pie.manifest
testing/config/tooltool-manifests/macosx64/jimdb-arm-pie.manifest
testing/config/tooltool-manifests/linux64/hostutils.manifest
testing/config/tooltool-manifests/macosx64/jimdb-arm.manifest
testing/config/tooltool-manifests/macosx64/nodejs.manifest
testing/config/tooltool-manifests/linux64/nodejs.manifest
testing/config/tooltool-manifests/linux64/jimdb-arm.manifest
testing/config/tooltool-manifests/linux64/jimdb-x86-pie.manifest
testing/config/tooltool-manifests/linux64/releng.manifest
testing/config/tooltool-manifests/macosx64/jimdb-x86-pie.manifest
testing/config/tooltool-manifests/macosx64/releng.manifest
testing/config/tooltool-manifests/macosx64/jimdb-x86.manifest
testing/config/tooltool-manifests/linux64/jimdb-x86.manifest
testing/config/tooltool-manifests/androidarm_4_3/mach-emulator.manifest
testing/config/tooltool-manifests/androidarm_4_3/releng.manifest
testing/config/tooltool-manifests/androidx86/mach-emulator.manifest
testing/config/tooltool-manifests/androidx86/releng.manifest
testing/config/tooltool-manifests/linux32/jimdb-arm-pie.manifest
testing/config/tooltool-manifests/linux32/hostutils.manifest
testing/config/tooltool-manifests/linux32/jimdb-arm.manifest
testing/config/tooltool-manifests/linux32/jimdb-x86-pie.manifest
testing/config/tooltool-manifests/linux32/nodejs.manifest
testing/config/tooltool-manifests/linux32/jimdb-x86.manifest
testing/config/tooltool-manifests/linux32/releng.manifest
testing/config/tooltool-manifests/androidx86_7_0/releng.manifest
mobile/android/config/tooltool-manifests/android-x86/releng.manifest
mobile/android/config/tooltool-manifests/android/releng.manifest
testing/mozharness/configs/openh264/tooltool-manifests/linux.manifest
testing/mozharness/configs/openh264/tooltool-manifests/win.manifest
testing/mozharness/configs/openh264/tooltool-manifests/android.manifest
testing/mozharness/configs/openh264/tooltool-manifests/osx.manifest
You need to log in before you can comment on or make changes to this bug.