Closed Bug 1535146 Opened 5 years ago Closed 5 years ago

Hook up the WebRender shader cache on android

Categories

(Core :: Graphics: WebRender, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox70 --- fixed

People

(Reporter: bholley, Assigned: jnicol)

References

Details

(Whiteboard: [wr-amvp][wr-q2])

Attachments

(2 files)

I've measured shader compilation time janking WR rendering on the order of 300ms, which is something we should improve.

The current code for sharing compiled shaders across WebRender instances is a bit convoluted. I'm not sure exactly where the WR devices boundaries are on GeckoView, but if SharedGL is used (bug 1532929), I'm pretty sure we should only ever need to compile each shader once for the lifetime of the GV parent process. Even so, we'll almost certainly want some combination of (a) serialization of program binaries to persistent storage, and (b) eager linking/compilation on startup.

We currently have infrastructure for serializing compiled shaders to disk, but it's currently only enabled on windows [1]. The only explicitly architecture-specific code I'm aware of is the profile directory selection, where there's already a non-windows codepath that probably works but should be tested [2]. That said, there are other, subtler platform issues:

  • On Windows, we added special sauce in ANGLE [3] in order to link deserialized shaders in parallel on a background pool during startup [4]. That may not be available on any mobile drivers, but we should investigate.
  • On Mac, glGetProgramBinary always returns null, effectively preventing shader serialization. Hopefully we don't encounter this on mobile.

One significant issue at play is the different application models between desktop and mobile. Desktop Firefox renders the application UI with WR, so shader compilation is a startup performance issue on Desktop. In contrast, GeckoView generally won't require the shader until it's needed for content. This potentially gives us more time to work with, but also (combined with the generally-shorter process lifetime on Android) increases the risk of janking actual web content. Our current strategy for deciding what goes into the cache is to serialize all shaders used within the first ten frames [5] (which generally means the browser UI, plus the new tab page). That seems unlikely to give good results on GeckoView, so we'll need a different approach.

[1] https://searchfox.org/mozilla-central/rev/aae527894a97ee3bbe0c2cfce9c67c59e8b8fcb9/modules/libpref/init/all.js#938
[2] https://searchfox.org/mozilla-central/source/gfx/webrender_bindings/src/program_cache.rs#68
[3] bug 1494474
[4] https://searchfox.org/mozilla-central/rev/aae527894a97ee3bbe0c2cfce9c67c59e8b8fcb9/gfx/wr/webrender/src/device/gl.rs#2225
[5] https://searchfox.org/mozilla-central/rev/aae527894a97ee3bbe0c2cfce9c67c59e8b8fcb9/gfx/wr/webrender/src/device/gl.rs#2776

Another option for implementing this on mobile is to consider building the shader binaries during the build process. On Mali devices, we could integrate the offline shader compiler [1] and embed the shader binaries into the APK.

This would mean zero compilation overhead on the devices.

[1] https://developer.arm.com/products/software-development-tools/graphics-development-tools/mali-offline-compiler

Is there any way we can run a post-install Activity? (firefox --precompile-shaders?) That'd be ideal, so we can just bring the GL driver online and have it compile binaries for us. This way we aren't increasing binary wire size, too.

Compiling the shaders is generally a sub-second task, but the problem is that sub-second isn't fast enough on Startup. It's plenty fast for Install-time.

Flags: needinfo?(snorp)

There's still the question of which shaders we'd precompile. The answer could be "all of them", but then we'd be reading all those binaries on on startup, and need to spend time linking them. We could consider multiple tiers - i.e. compiled + loaded versus compiled + loaded + linked.

Google is really cracking down on any code running without user notification. If your app is in the background, you need to show a persistent notification if it's running. So while we could maybe do this, we'd need a "Reticulating Splines..." type of notification shown to the user.

Flags: needinfo?(snorp)

When I say "compile shaders", I really mean "compile and link program binaries". I don't think explicitly caching shaders rather than programs is possible everywhere.

I think we want to cache all programs.

If the cache size gets too large, we could have a two part cache: startup and runtime.

Ideally, we would support:

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #4)

Google is really cracking down on any code running without user notification. If your app is in the background, you need to show a persistent notification if it's running. So while we could maybe do this, we'd need a "Reticulating Splines..." type of notification shown to the user.

That's ok, this would be totally fine for a second or two post-install. Maybe it's worth trying to PoC this? This would be way better than doing things on first-run the way we do now. (though system updates could still invalidate our cache)

(In reply to Jeff Gilbert [:jgilbert] from comment #5)

When I say "compile shaders", I really mean "compile and link program binaries". I don't think explicitly caching shaders rather than programs is possible everywhere.

Yeah, I'm using shader and program interchangeably, I suppose I probably shouldn't.

I think we want to cache all programs.

If the cache size gets too large, we could have a two part cache: startup and runtime.

We already have runtime. Per comment 0, no program is ever compiled more than once for the lifetime of the process.

Ideally, we would support:

Interesting - how do the second and third relate to glProgramBinary? From a quick skim they appear to be about runtime sharing, which we basically already get through our infrastructure.

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #4)

Google is really cracking down on any code running without user notification. If your app is in the background, you need to show a persistent notification if it's running. So while we could maybe do this, we'd need a "Reticulating Splines..." type of notification shown to the user.

That's ok, this would be totally fine for a second or two post-install. Maybe it's worth trying to PoC this? This would be way better than doing things on first-run the way we do now. (though system updates could still invalidate our cache)

That would certainly be nice to have, but I think we need to start by implementing this for the first-run case - since that's simpler, is an improvement over the status quo, and involves sorting out which programs to serialize/deserialize (which we'd need for the installer integration as well).

Here's a strawman proposal for how to pick the set:

We make a small test webpage that exercises the content codepaths that we believe to be common enough to warrant eager compilation / linking. We then load this page (without any Firefox browser chrome), log the shaders compiled, and encode the result in a whitelist. We can iterate on this testcase to ensure it includes the same shaders loaded on first visit to various sites on our topsite list. To ensure this whitelist stays up to date across future refactors of our shaders, we run an automation job (possibly via wrench) to verify that the whitelist matches what actually gets loaded.

What I meant by "runtime" cache was more of:

  • Post-install-task: Cache binaries in "startup" disk cache
  • On startup, load startup cache ASAP. (hot path)
  • Stream in "deferred" (formerly "runtime") disk cache in the background
  • Cache everything forever
    • If not viable (combinatorical explosions?), LRU-expire 'old' cache entries during flushes to disk.
  • On cache miss, add to deferred cache.
    • Flush updates to the deferred cache back to disk periodically
  • Invalidate both caches on failed cache hit

I think pulling the "startup" cache from a corpus like you suggest is a good idea. I do however think the hottest path should be loading the home page/startup page(s), and that loading subsequent pages during navigation is a less-hot path.

Decisive for all of this is big the cache can get, and as it expands, how badly does that affect the loading of a less-hot "deferred" cache from disk, especially in the context of all navigation overhead?

(In reply to Bobby Holley (:bholley) from comment #6)

(In reply to Jeff Gilbert [:jgilbert] from comment #5)

Ideally, we would support:

Interesting - how do the second and third relate to glProgramBinary? From a quick skim they appear to be about runtime sharing, which we basically already get through our infrastructure.

The latter two solve an underlying weakness to GetProgramBinary: GetProgramBinary may not give you the bytecode for all variants of a program on all drivers. For instance, ANGLE uses different actual sub-binaries for PROVOKING_VERTEX_FIRST vs _LAST: One has a geometry shader, and the other doesn't. For this example, ANGLE_program_cache_control and ANDROID_blob_cache would call a callback based on a key that changes based on the provoking-vertex setting, and so either or both underlying binaries could be cached. This is true of any case of driver-side shader/program recompilation.

Priority: -- → P2
Depends on: 1535745

(In reply to Jeff Gilbert [:jgilbert] from comment #9)

(In reply to Bobby Holley (:bholley) from comment #6)

(In reply to Jeff Gilbert [:jgilbert] from comment #5)

Ideally, we would support:

Interesting - how do the second and third relate to glProgramBinary? From a quick skim they appear to be about runtime sharing, which we basically already get through our infrastructure.

The latter two solve an underlying weakness to GetProgramBinary: GetProgramBinary may not give you the bytecode for all variants of a program on all drivers. For instance, ANGLE uses different actual sub-binaries for PROVOKING_VERTEX_FIRST vs _LAST: One has a geometry shader, and the other doesn't.

This is what we solved with EGL_MOZ_create_context_provoking_vertex_dont_care, right?

For this example, ANGLE_program_cache_control and ANDROID_blob_cache would call a callback based on a key that changes based on the provoking-vertex setting, and so either or both underlying binaries could be cached. This is true of any case of driver-side shader/program recompilation.

Are any of those likely to bite us on practice on mobile GPU drivers?

The downside of the blob cache model is that it puts the driver in charge, and the driver presumably requests the shaders on-demand. That would prevent us from doing the eager/parallel pre-linking we do on Windows, which was critical to getting startup perf down to an acceptable level. And supporting both models would add significant complexity to our story here, so it seems like something we'd want to avoid unless we really need it.

FWIW, Samsung's docs suggest skipping the blob cache and doing it our way: https://developer.samsung.com/game/opengl

We replaced create_context_provoking_vertex_dont_care with ANGLE_provoking_vertex, but as long as you only ever use it one way it should be fine.

There are a bunch of reasons for runtime shader recompilation historically, provoking vertex is just one example. It depends on the driver, and is really opaque to us, but it's a cause of jank that can be helped with a blob-cache-style approach.

The blob cache model doesn't prevent us from doing early-compilation: We just need to incur it into doing this compilation before we actually need it, by simply compiling and linking whatever shaders we want pre-compiled.

I do think that if we only do one thing, it should be GetProgramBinary, but I've been told by ANGLE and Google that they recommend using the other extensions as well, due to the concerns I've mentioned.

FWIW the Samsung guide calls shader-recomp "Shader patching", but says it should have minimal overhead, but to still be aware of it.

Ok. Sounds like we should start by extending GetProgramBinary to GeckoView, and see how far that gets us. If we run into issues with shader-recomp, we can look into hooking up the blob cache.

Depends on: 1540576
Whiteboard: [wr-amvp][wr-q2]
Depends on: 1549927
Assignee: nobody → jnicol

Just getting up to speed on all of this now, here are my thoughts:

  • First paint (on 2nd or later launch) definitely seems snappier since the shader cache was enabled in bug 1540576, woohoo \o/

  • It seems like KHR_parallel_shader_compile basically isn't supported anywhere on GLES (https://opengles.gpuinfo.org/listreports.php?extension=GL_KHR_parallel_shader_compile - I know we don't always trust sites like these, but my devices don't claim to support it either). So if we want asynchronous and parallel shader compilation we'd need to do it ourselves, which could be messy.

  • I believe Sotaro is correct that EGL_ANDROID_blob_cache isn't exposed to applications, it should be done transparently by EGL. So if it's there great, if not, our cache should do most of its work anyway.

So what's left to do here?

  • Loading the cache from disk sooner (bug 1549927)
  • Decide which set of shaders we want to be cached on disk
  • Look in to compiling them in a background task on install/upgrade instead of on first launch

As an experiment I forced us to compile all possible shaders. (By setting RendererOptions.precache_flags = ShaderPrecacheFlags::FULL_COMPILE.) This took up 1.6MB on disk, which I don't think is unreasonable at all. Loading the cache from disk however seemed to take ~10ms if the disk was warm or ~50ms if cold. About 10x more than loading just the set of shaders used on mozilla.org, for comparison. 50ms seems too high, but perhaps acceptable with bug 1549927.

I think the best approach therefore, would be to save all compiled shaders to disk, not just the first 10 frames. But to prioritize which ones we load from disk in to the cache, then load the rest asynchronously. I think this is what Jeff was suggesting in comment 8.

(In reply to Jamie Nicol [:jnicol] from comment #15)

As an experiment I forced us to compile all possible shaders. (By setting RendererOptions.precache_flags = ShaderPrecacheFlags::FULL_COMPILE.) This took up 1.6MB on disk, which I don't think is unreasonable at all. Loading the cache from disk however seemed to take ~10ms if the disk was warm or ~50ms if cold.

Is that just for reading them in, or does it also include the time taken in the driver to link up the cached binary?

About 10x more than loading just the set of shaders used on mozilla.org, for comparison. 50ms seems too high, but perhaps acceptable with bug 1549927.

I think we should consider the common case to be "user clicks a link in another app and switches to the browser", where that optimization doesn't help. We can obviously add that optimization for the launch-to-homescreen case, we just shouldn't build our strategy around it.

I think the best approach therefore, would be to save all compiled shaders to disk, not just the first 10 frames. But to prioritize which ones we load from disk in to the cache, then load the rest asynchronously. I think this is what Jeff was suggesting in comment 8.

How do you propose identifying which shaders should be prioritized?

(In reply to Bobby Holley (:bholley) from comment #16)

Is that just for reading them in, or does it also include the time taken in the driver to link up the cached binary?

That was just the time to read them, but the time taken in the driver seemed negligible (microseconds).

I think we should consider the common case to be "user clicks a link in another app and switches to the browser", where that optimization doesn't help. We can obviously add that optimization for the launch-to-homescreen case, we just shouldn't build our strategy around it.

I agree. I hadn't realised that optimisation was only when opening the homescreen, that's unfortunate. I don't think even the homescreen on fenix is rendered by webrender either (though that is a guess)? If so, the first page really could be anything even when the app is opened directly.

How do you propose identifying which shaders should be prioritized?

I liked your idea of a test case and using top sites, with some automation to ensure it stays correct.

(In reply to Jamie Nicol [:jnicol] from comment #17)

I agree. I hadn't realised that optimisation was only when opening the homescreen, that's unfortunate. I don't think even the homescreen on fenix is rendered by webrender either (though that is a guess)? If so, the first page really could be anything even when the app is opened directly.

Right. Fenix homescreen is native code. None of the browser UI is rendered with Gecko.

Remember also that not all disks are created equal. There are some serious potato disks in some android phones. I expect disk access to be quite a bit worse on cheaper phones.

No longer blocks: wr-android-mvp

I've had a rough go at implementing this in a couple of different ways.

One of the issues is how we determine at run time which shaders we want to save/load to disk. ie not how we on bugzilla determine in theory which ones we want, but how the cache code actually identifies those shaders. The cache is at too low a level I think: it only knows the source digests. I experimented using a list of those (in a pref/gfxVar), which worked, but it might be a bit awkward to keep that list correct. I think shade.rs is the best place where we can identify the shaders properly. So we probably want to hard-code flags there saying whether shaders are "common" or not.

This is what I think I've settled on as my preferred approach:

  • Save all shaders to the disk cache, don't delete existing-but-unused ones after 10 frames
  • Instead save a whitelist of the ones to load on next startup.
    • We can get this from a hard-coded "common" flag in shade.rs like I mentioned above.
    • Even just using the linked programs after 10 frames, like we currently do for desktop, works pretty well though. Assuming there's a reasonable overlap in the set of shaders from page to page.
  • Load only the whitelisted shaders on startup.
  • In link_program(), if the shader is not in the cache, attempt to load it from the disk.
    • The overhead of this on a miss is tiny.
    • This is too late to get the benefit of the driver asynchronously doing glProgramBinary, like when we call that in create_program() for shaders already in the cache. But only takes a couple of milliseconds rather than a couple of hundred to fully compile the shader. Doing this in create_program() is unsuitable, because then we would attempt to load every shader from disk during Shaders::new() (because of the ShaderPrecacheFlags::ASYNC_COMPILE flag).

Actually, it's worth mentioning that Shaders::new() with ShaderPrecacheFlags::ASYNC_COMPILE is currently quite expensive, as it calls create_program() for every possible shader. For me it takes ~60ms even with no shaders in the cache (ie no glProgramBinary calls). Perhaps we want a new ShaderPrecacheFlag which only precaches the shaders which we have marked as "common".

Webrender caches the program binaries of shaders used within the first
ten frames, so that on next startup it can load them from disk rather
than having to recompile them.

Previously it would load all binaries found in the disk cache on
startup, and when saving to the cache it would delete any existing
binaries that weren't used.

This changes it so that unused binaries are not deleted. The disk
space this requires is insignificant, but as the cache grows loading
all the shaders on startup can get expensive. To solve that we write a
whitelist of the shaders used during startup, and only load those
during the next startup.

On startup some program binaries are loaded from disk into an
in-memory cache. When we call create_program() we check if the
required program is present in this cache, and if so we call
glProgramBinary(). This is done early on so that the driver can
perform any necessary work in the background.

There may however be binaries in the disk cache that have not yet been
loaded in to memory, in order not to slow down startup. This change
makes it so that we attempt to load missing binaries from disk during
link_program(). The reason we do not do this in create_program() is
because that would result in loading all shaders from disk during
startup, which we want to avoid. Loading these shaders may therefore
take slightly longer than if they'd been loaded at startup, but will
still be much faster than recompiling them from scratch, and startup
will remain quick.

Depends on D33954

These patches make it so that we:

  • On startup still only load the binaries used on previous startup
  • But don't delete unused binaries from the disk cache
  • In link_program() if the binary is not in the in-memory cache, attempt to load from disk cache.

So there is no hard-coded set of shaders we think we want for startup, it is using the same heuristic as before (first 10 frames). I think we should see how far this gets us. Missing shaders from that set won't be as big a problem as before, because we then load from disk on demand instead of recompiling from scratch.

Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/95b574e4e11f
Use a whitelist to decide which shaders to load from disk on startup. r=bholley
https://hg.mozilla.org/integration/autoland/rev/826f8589f165
Attempt to load non-startup shaders from disk cache when required. r=bholley

Can you file a followup bug on caching shaders encountered after the first 10 frames?

Flags: needinfo?(jnicol)
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70
Flags: needinfo?(jnicol)
See Also: → 1567620
No longer depends on: 1549927
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: