Closed Bug 1392370 Opened 3 years ago Closed 2 years ago
Investigate and potentially use multi-threaded xz compression
59 bytes, text/x-review-board-request
Toolchain tasks can spend >7 minutes in xz compression. Modern versions of xz-utils support multi-threaded compression via -T. Archives produced with multi-threaded compression can be read by legacy xz tools. Essentially, it splits the input archive into segments and compresses each independently. You will lose some compression ratio. How much depends on block sizes and the input to the compressor. Given that we spent minutes in xz, let's investigate parallel compression. If the compression ratio loss is minimal, we should probably use `xz -T 0` everywhere so we automatically scale to the number of available threads. Or at least something like `xz -T 4` to get some compression speed-up.
Why not use zstd?
I'm not opposed to zstd. We already use zstd (parallel compression even) for Docker images. zstd isn't packaged as widely yet. So the main downside is we'd have to install binaries in various places. Not the hardest thing to do. Also, tooltool may need to be taught up zstd for cases where we use "unpack": true.
Ah, gah, it would make life miserable for mach bootstrap. For instance, Ubuntu 16.04 (LTS and released a few months ago) comes with an old version of zstd, so even if it's installed, it can't read files produced by zstd 1.x.
While the xz-utils provided in Debian 7 is not modern enough, I added pxz to the toolchain-build docker image in bug 1427326. Comparing the time spent compression an archive of GCC 6: $ time tar -Jcf gcc.tar.xz gcc real 6m17.828s user 6m17.500s sys 0m1.964s $ time tar -cf - gcc | pxz --compress -T $(nproc) > gcc.tar.pxz real 0m55.424s user 9m50.416s sys 0m1.824s $ ls -l gcc.tar.xz gcc.tar.pxz -rw-r--r-- 1 root root 160037852 Dec 29 08:52 gcc.tar.pxz -rw-r--r-- 1 root root 156507952 Dec 29 08:50 gcc.tar.xz Marginally larger, significantly faster.
Depends on: 1427326
FWIW: # time tar -cf - gcc | zstd -o gcc.tar.zst /*stdin*\ : 29.60% (860518400 => 254714054 bytes, gcc.tar.zst) real 0m7.734s user 0m7.380s sys 0m0.728s Even with max compression and max time, we don't get near xz/pxz in terms of size: # time tar -cf - gcc | zstd -19 -o gcc.tar.zst /*stdin*\ : 21.21% (860518400 => 182553689 bytes, gcc.tar.zst2) real 4m1.778s user 4m1.168s sys 0m1.176s
# time tar -cf - gcc | zstd -19 -T$(nproc) -o gcc.tar.zst /*stdin*\ : 21.55% (860518400 => 185462606 bytes, gcc.tar.zst2) real 0m26.783s user 5m29.768s sys 0m1.016s
zstd level 21 or 22 should get pretty close to xz. But probably still a bit larger. Those modes allocate a very large window and thus require a lot of memory to decompress. We actually had to stop using the higher levels for Mercurial bundles because 32-bit Python processes couldn't allocate enough memory. See bug 1344790.
Comment on attachment 8943513 [details] Bug 1392370 - Enable xz parallel compression on Debian-based docker images. https://reviewboard.mozilla.org/r/213852/#review220054 I'll likely autoland this later once the trees are reopened. Don't want to land it now in case tons of stuff piles up behind it.
Attachment #8943513 - Flags: review?(gps) → review+
Pushed by firstname.lastname@example.org: https://hg.mozilla.org/integration/autoland/rev/78e74514176f Enable xz parallel compression on Debian-based docker images. r=gps
You need to log in before you can comment on or make changes to this bug.