Closed Bug 1550108 Opened 6 months ago Closed Last month

Reduce time spent decompressing StartupCache entries

Categories

(Toolkit :: Startup and Profile System, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox68 - wontfix
firefox71 --- fixed

People

(Reporter: dthayer, Assigned: dthayer)

References

(Blocks 2 open bugs, Regressed 3 open bugs)

Details

(Keywords: main-thread-io, perf, Whiteboard: [fxperf:p1])

Attachments

(10 files, 2 obsolete files)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review

On reference hardware I see anywhere from 600ms to 3000ms spent in inflate under StartupCache::GetBuffer. I think we can do better with lz4. Currently we compress at compression level 1 with deflate, I imagine because compressing with deflate at max compression is too expensive. With lz4 I'm able to compress at its level 6 while taking less time, meaning the overall filesize is about the same, while taking an order of magnitude less time.

Priority: -- → P3

[fxperf:p1] because I'm currently working on it.

Assignee: nobody → dothayer
Status: NEW → ASSIGNED
Keywords: main-thread-io, perf
Whiteboard: [fxperf:p1]

I opted to go with what I perceived as the more expedient route
of leaving lz4 roughly where it is and just adding to that. The
biggest complication was xxhash, which is included elsewhere.
I'm not generally proficient with build-related things though so
my solution may be wrong and not just ugly.

Edits to all of the other locations that use nsIZipWriter to
follow. This only switches StartupCache. Regarding the substance
of the change, though, I opted for lz4 because it's the fastest
by a long shot on decompress, so we'll pay the smallest cost for
warm startups, and at compression level 6 it's able to match the
compression ratio of deflate at level 1, while still being
faster to compress. We could potentially opt for a higher
compression ratio alg, but I think the shutdown cost of doing so
may not be worth paying, considering we already opted to compress
with deflate at 1.

Depends on D30640

I split this out to hopefully make it easier to digest the more
substantive changes in the previous commit. The changes here
should just leave the existing behavior unchanged.

Depends on D30641

This will not behave exactly the same if we had previously written bad
data for the entry that would fail to decompress. I imagine this is rare
enough, and the consequences are not severe enough, that this should be
fine.

Depends on D30642

I thought I had already written out the patch to remove these, but
apparently not. Per discussion in the startup cache telemetry bug,
there should be no reason for doing this.

Depends on D30640

Attachment #9065458 - Attachment description: Bug 1550108 - Don't both reading omnijars from StartupCache r?kmag → Bug 1550108 - Don't read from app/gre caches in StartupCache r?kmag

This just splits out the InputBuffer and OutputBuffer helper classes
to make it cleaner for the StartupCache to include them.

Depends on D30643

I am not aware of anything that depends on StartupCache being a
zip file, and since I want to use lz4 compression because inflate
is showing up quite a lot in profiles, it's simplest to just use
a custom format. This loosely mimicks the ScriptPreloader code,
with a few diversions:

  • Obviously the contents of the cache are compressed. I used lz4
    for this as I hit the same file size as deflate at a compression
    level of 1, which is what the StartupCache was using previously,
    while decompressing an order of magnitude faster. Seemed like
    the most conservative change to make. I think it's worth
    investigating what the impact of slower algs with higher ratios
    would be, but for right now I settled on this. We'd probably
    want to look at zstd next.
  • I use streaming compression for this via lz4frame. This is not
    strictly necessary, but has the benefit of not requiring as
    much memory for large buffers, as well as giving us a built-in
    checksum, rather than relying on the much slower CRC that we
    were doing with the zip-based approach.
  • I coded the serialization of the headers inline, since I had to
    jump back to add the offset and compressed size, which would
    make the nice Code(...) method for the ScriptPreloader stuff
    rather more complex. Open to cleaner solutions, but moving it
    out just felt like extra hoops for the reader to jump through
    to understand without the benefit of being more concise.

Depends on D34651

We should be able to just let consumers borrow the buffer, as long
as we can guarantee that we'll keep it around. We shouldn't be
invalidating the cache with any significant frequency, so we
should be able to just hold onto old items indefinitely.

Depends on D34652

The first run loads more things into the StartupCache than are
used on the second and subsequent runs. This just ensures that if
the StartupCache diverges too far from its actual use that we will
rebuild it.

Depends on D34653

Attachment #9064033 - Attachment is obsolete: true
Attachment #9064034 - Attachment is obsolete: true

I updated the patch to just change the StartupCache's underlying format entirely, rather than extending nsIZipWriter to support lz4, as it's nonstandard. This means that D30640 is no longer strictly necessary, but I would like to keep it for reasons mentioned in the description of D34652.

This also facilitated eliminating some copies and making it easier to compact the cache if it's larger than it needs to be. Removing unused items from the cache means that every page that straddles the boundary between two cache items is fully utilized.

Talos seems rather happy with the changes overall. In my original approach, I compressed the cache as one big chunk and decompressed the whole thing on initialize. This had more dramatic results, but strangely it improved WebRender-enabled ts_paint while degrading WR-disabled ts_paint. I couldn't get this effect to go away, so I went with the per-item compression.

Does what it says on the tin. Once we have a central scheduling
system this should likely just consume that.

Depends on D34654

Kris, could you give an update on the status of the reviews of this? Does it look like it will be generally acceptable or do you foresee major changes required?

Flags: needinfo?(kmaglione+bmo)

Hey kmag, any chance you'll have a chance to look at this soon? Or would it be best to find a different reviewer?

Hey froydnj, would you be a good candidate to review these last few patches?

Flags: needinfo?(nfroyd)

(In reply to Mike Conley (:mconley) (:⚙️) from comment #16)

Hey froydnj, would you be a good candidate to review these last few patches?

Sure, I can look at these. Doug, throw these into my phabricator queue, please?

Flags: needinfo?(nfroyd)
Flags: needinfo?(kmaglione+bmo)
Flags: needinfo?(dothayer)

Reviewed (most of) the patches.

Flags: needinfo?(dothayer)
Attachment #9071478 - Attachment description: Bug 1550108 - Change StartupCache format from zip to custom r?kmag → Bug 1550108 - Change StartupCache format from zip to custom r?froydnj
Attachment #9071479 - Attachment description: Bug 1550108 - Eliminate large buffer copies from StartupCache r?kmag → Bug 1550108 - Eliminate large buffer copies from StartupCache r?froydnj
Attachment #9071480 - Attachment description: Bug 1550108 - Compact the StartupCache if it is bloated r?kmag → Bug 1550108 - Compact the StartupCache if it is bloated r?froydnj
Attachment #9073067 - Attachment description: Bug 1550108 - Prefetch StartupCache off main thread r?kmag → Bug 1550108 - Prefetch StartupCache off main thread r?froydnj

Quick update on where this is: I ran it through Talos to check that the numbers were still good, and somehow a ts_paint regression has crept in, so I'm trying to diagnose where this was introduced, as ts_paint was something like 5% green in the earlier versions of the patches.

I am almost certain that the above result is spurious at this point. ts_paint performs better locally, startup profiles better locally (less time spend in the StartupCache), and ts_paint on try profiles better all with these patches applied. I think stuff is just being shuffled around by this performing faster such that it happens to be a little worse on the hardware we're running this on.

Waiting for profiles on try from shippable builds, but assuming those come in looking the same my plan is just to button this up and take the nominal loss.

Adds an explicit error mapping function to Result, to simplify
using MOZ_TRY... macros.

Depends on D34651

Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/0fd41e9dbd2a
Pull in secondary lz4 libraries r=glandium
https://hg.mozilla.org/integration/autoland/rev/395affa4c205
Don't read from app/gre caches in StartupCache r=kmag
https://hg.mozilla.org/integration/autoland/rev/7a009d42e7e7
Avoid decompressing entries just to check if they exist r=kmag
https://hg.mozilla.org/integration/autoland/rev/e63d0a1fec38
Split out Input/OutputBuffer into their own file r=froydnj
https://hg.mozilla.org/integration/autoland/rev/f950e30afd90
Add mapErr method to Result r=froydnj
https://hg.mozilla.org/integration/autoland/rev/60991713b1e2
Change StartupCache format from zip to custom r=froydnj
https://hg.mozilla.org/integration/autoland/rev/62416909cf67
Eliminate large buffer copies from StartupCache r=froydnj
https://hg.mozilla.org/integration/autoland/rev/0a1fa8d8bb3c
Compact the StartupCache if it is bloated r=froydnj
https://hg.mozilla.org/integration/autoland/rev/2560f0ab6ebf
Prefetch StartupCache off main thread r=froydnj
https://hg.mozilla.org/integration/autoland/rev/cbadfa2bbd7e
Reduce stack size on StartupCache threads r=froydnj
Pushed by dothayer@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/ea19d193ebbb
Pull in secondary lz4 libraries r=glandium
https://hg.mozilla.org/integration/autoland/rev/eb37df990734
Don't read from app/gre caches in StartupCache r=kmag
https://hg.mozilla.org/integration/autoland/rev/7915439e3c10
Avoid decompressing entries just to check if they exist r=kmag
https://hg.mozilla.org/integration/autoland/rev/fbba9fe43303
Split out Input/OutputBuffer into their own file r=froydnj
https://hg.mozilla.org/integration/autoland/rev/8cb90c97dddb
Add mapErr method to Result r=froydnj
https://hg.mozilla.org/integration/autoland/rev/992e2a6359f9
Change StartupCache format from zip to custom r=froydnj
https://hg.mozilla.org/integration/autoland/rev/849df6250b7e
Eliminate large buffer copies from StartupCache r=froydnj
https://hg.mozilla.org/integration/autoland/rev/c90c4ca6cd25
Compact the StartupCache if it is bloated r=froydnj
https://hg.mozilla.org/integration/autoland/rev/36a2b9138e11
Prefetch StartupCache off main thread r=froydnj
https://hg.mozilla.org/integration/autoland/rev/0c4837041392
Reduce stack size on StartupCache threads r=froydnj
Regressions: 1586744

Thanks a lot, it looks like it improved DevTools startup quite significantly!

== Change summary for alert #23353 (as of Sat, 05 Oct 2019 21:21:33 GMT) ==

Improvements:

43% damp console.openwithcache.open.DAMP linux64-shippable opt e10s stylo 506.01 -> 287.60
7% damp simple.webconsole.open.DAMP linux64-shippable opt e10s stylo 295.81 -> 275.02
7% damp complicated.webconsole.open.DAMP windows10-64-shippable-qr opt e10s stylo 322.21 -> 300.79
6% damp cold.webconsole.open.DAMP windows10-64-shippable opt e10s stylo 306.30 -> 287.18
6% damp cold.webconsole.open.DAMP windows10-64-shippable-qr opt e10s stylo 305.04 -> 286.02
...

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=23353

Duplicate of this bug: 1587288
Depends on: 1587567
No longer depends on: 1587567
Regressions: 1587567
Flags: needinfo?(dothayer)
Regressions: 1587112, 1587107
No longer regressions: 1587567

Update on this for any interested parties: it looks like we improved median time to first paint by something like 5-10% with this change. Here is the raw data - med_val is the median time to first paint, and the change landed on the 20191005 build:

----- before change -----
build: 20190917, median_first_paint: 2507
build: 20190918, median_first_paint: 2385
build: 20190919, median_first_paint: 2254
build: 20190920, median_first_paint: 2381
build: 20190921, median_first_paint: 2416
build: 20190922, median_first_paint: 2361
build: 20190923, median_first_paint: 2395
build: 20190924, median_first_paint: 2407
build: 20190925, median_first_paint: 2492
build: 20190926, median_first_paint: 2507
build: 20190927, median_first_paint: 2383
build: 20190928, median_first_paint: 1991
build: 20190929, median_first_paint: 2431
build: 20190930, median_first_paint: 2364
build: 20191001, median_first_paint: 2349
build: 20191002, median_first_paint: 2359
build: 20191003, median_first_paint: 2352
build: 20191004, median_first_paint: 2401
mean value: 2374

----- after change -----
build: 20191005, median_first_paint: 2246
build: 20191006, median_first_paint: 2223
build: 20191007, median_first_paint: 2182
mean value: 2217

6.61% improvement

The builds after 20191007 show even lower (better) numbers, but it's too early to tell if they are valid.

Just FTR, as it may be too late to be worth doing anything about it: I think this bug should have updated the CLOBBER file -- at least, I've had broken builds on a couple of machines recently, complaining about "No rule to make target '..../mfbt/lz4.c', needed by 'lz4.obj'", which I believe is a result of the file moves here (lz4.{c,h} moved into a subdirectory). Clobbering the objdir (where that out-of-date dependency is recorded) fixes this.

Blocks: clobber
Regressions: 1591172

(In reply to Jonathan Kew (:jfkthame) from comment #30)

Just FTR, as it may be too late to be worth doing anything about it: I think this bug should have updated the CLOBBER file -- at least, I've had broken builds on a couple of machines recently, complaining about "No rule to make target '..../mfbt/lz4.c', needed by 'lz4.obj'", which I believe is a result of the file moves here (lz4.{c,h} moved into a subdirectory). Clobbering the objdir (where that out-of-date dependency is recorded) fixes this.

I filed bug 1591172 about this, as the issue reappeared for me even after clobbering.

Have you by any chance relaxed the test to print out the warning result.isErr() in the following code a bit?

https://dxr.mozilla.org/mozilla-central/source/startupcache/StartupCache.cpp?q=StartupCache.cpp&redirect_type=direct#173

I am trying to figure out why I see many errors in locally executed |mach xpcshell-tests| of C-C TB (FULL DEBUG version) in the last month (it ran OK in August.).
https://bugzilla.mozilla.org/show_bug.cgi?id=1589779

While I am analyzing the many cases of indexedDB failures now observed in the local log, I noticed many warnings of the following form:
77:57.81 pid:29914 [29914, Main Thread] WARNING: 'result.isErr()', file /NEW-SSD/NREF-COMM-CENTRAL/mozilla/startupcache/StartupCache.cpp, line 173

I am not sure if the warning is related to the indexedDB error I observe or not.
But it could be that maybe xpcshell-tests harness may not populate the startup cache correctly or something.

I just wanted to be sure that the warning could be safely ignored and unrelated to the indexedDB failures I observe.
At the same time, it may be nice if we can reduce the warning to a single line instead of producing it many times (maybe very difficult to do.)

Anyway, just FYI.

TIA

Regressions: 1594051
Regressions: 1594056
Regressions: 1595596
You need to log in before you can comment on or make changes to this bug.