Bugzilla

Comment 2

•

5 years ago

Is there any way we can run a post-install Activity? (firefox --precompile-shaders?) That'd be ideal, so we can just bring the GL driver online and have it compile binaries for us. This way we aren't increasing binary wire size, too.

Compiling the shaders is generally a sub-second task, but the problem is that sub-second isn't fast enough on Startup. It's plenty fast for Install-time.

Reporter

Updated

•

5 years ago

Flags: needinfo?(snorp)

James Willcox (:snorp) (jwillcox@mozilla.com) (he/him)

Reporter

Comment 3

•

5 years ago

There's still the question of which shaders we'd precompile. The answer could be "all of them", but then we'd be reading all those binaries on on startup, and need to spend time linking them. We could consider multiple tiers - i.e. compiled + loaded versus compiled + loaded + linked.

Comment 4

•

5 years ago

Google is really cracking down on any code running without user notification. If your app is in the background, you need to show a persistent notification if it's running. So while we could maybe do this, we'd need a "Reticulating Splines..." type of notification shown to the user.

Flags: needinfo?(snorp)

Comment 5

•

5 years ago

When I say "compile shaders", I really mean "compile and link program binaries". I don't think explicitly caching shaders rather than programs is possible everywhere.

I think we want to cache all programs.

If the cache size gets too large, we could have a two part cache: startup and runtime.

Ideally, we would support:

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #4)

Google is really cracking down on any code running without user notification. If your app is in the background, you need to show a persistent notification if it's running. So while we could maybe do this, we'd need a "Reticulating Splines..." type of notification shown to the user.

That's ok, this would be totally fine for a second or two post-install. Maybe it's worth trying to PoC this? This would be way better than doing things on first-run the way we do now. (though system updates could still invalidate our cache)

Reporter

Comment 6

•

5 years ago

(In reply to Jeff Gilbert [:jgilbert] from comment #5)

When I say "compile shaders", I really mean "compile and link program binaries". I don't think explicitly caching shaders rather than programs is possible everywhere.

Yeah, I'm using shader and program interchangeably, I suppose I probably shouldn't.

I think we want to cache all programs.

If the cache size gets too large, we could have a two part cache: startup and runtime.

We already have runtime. Per comment 0, no program is ever compiled more than once for the lifetime of the process.

Ideally, we would support:

glProgramBinary

https://github.com/google/angle/blob/master/extensions/EGL_ANGLE_program_cache_control.txt

https://www.khronos.org/registry/EGL/extensions/ANDROID/EGL_ANDROID_blob_cache.txt

Interesting - how do the second and third relate to glProgramBinary? From a quick skim they appear to be about runtime sharing, which we basically already get through our infrastructure.

(In reply to James Willcox (:snorp) (jwillcox@mozilla.com) (he/him) from comment #4)

Google is really cracking down on any code running without user notification. If your app is in the background, you need to show a persistent notification if it's running. So while we could maybe do this, we'd need a "Reticulating Splines..." type of notification shown to the user.

That's ok, this would be totally fine for a second or two post-install. Maybe it's worth trying to PoC this? This would be way better than doing things on first-run the way we do now. (though system updates could still invalidate our cache)

That would certainly be nice to have, but I think we need to start by implementing this for the first-run case - since that's simpler, is an improvement over the status quo, and involves sorting out which programs to serialize/deserialize (which we'd need for the installer integration as well).

Reporter

Comment 7

•

5 years ago

Here's a strawman proposal for how to pick the set:

We make a small test webpage that exercises the content codepaths that we believe to be common enough to warrant eager compilation / linking. We then load this page (without any Firefox browser chrome), log the shaders compiled, and encode the result in a whitelist. We can iterate on this testcase to ensure it includes the same shaders loaded on first visit to various sites on our topsite list. To ensure this whitelist stays up to date across future refactors of our shaders, we run an automation job (possibly via wrench) to verify that the whitelist matches what actually gets loaded.

Comment 8

•

5 years ago

What I meant by "runtime" cache was more of:

Post-install-task: Cache binaries in "startup" disk cache
On startup, load startup cache ASAP. (hot path)
Stream in "deferred" (formerly "runtime") disk cache in the background
Cache everything forever
- If not viable (combinatorical explosions?), LRU-expire 'old' cache entries during flushes to disk.
On cache miss, add to deferred cache.
- Flush updates to the deferred cache back to disk periodically
Invalidate both caches on failed cache hit

I think pulling the "startup" cache from a corpus like you suggest is a good idea. I do however think the hottest path should be loading the home page/startup page(s), and that loading subsequent pages during navigation is a less-hot path.

Decisive for all of this is big the cache can get, and as it expands, how badly does that affect the loading of a less-hot "deferred" cache from disk, especially in the context of all navigation overhead?

Jeff Muizelaar [:jrmuizel]

Comment 9

•

5 years ago

(In reply to Bobby Holley (:bholley) from comment #6)

(In reply to Jeff Gilbert [:jgilbert] from comment #5)

Ideally, we would support:

glProgramBinary

https://github.com/google/angle/blob/master/extensions/EGL_ANGLE_program_cache_control.txt

https://www.khronos.org/registry/EGL/extensions/ANDROID/EGL_ANDROID_blob_cache.txt

Interesting - how do the second and third relate to glProgramBinary? From a quick skim they appear to be about runtime sharing, which we basically already get through our infrastructure.

The latter two solve an underlying weakness to GetProgramBinary: GetProgramBinary may not give you the bytecode for all variants of a program on all drivers. For instance, ANGLE uses different actual sub-binaries for PROVOKING_VERTEX_FIRST vs _LAST: One has a geometry shader, and the other doesn't. For this example, ANGLE_program_cache_control and ANDROID_blob_cache would call a callback based on a key that changes based on the provoking-vertex setting, and so either or both underlying binaries could be cached. This is true of any case of driver-side shader/program recompilation.

Updated

•

5 years ago

Priority: -- → P2

Reporter

Updated

•

5 years ago

Depends on: 1535745

Reporter

Comment 10

•

5 years ago

(In reply to Jeff Gilbert [:jgilbert] from comment #9)

(In reply to Bobby Holley (:bholley) from comment #6)

(In reply to Jeff Gilbert [:jgilbert] from comment #5)

Ideally, we would support:

glProgramBinary

https://github.com/google/angle/blob/master/extensions/EGL_ANGLE_program_cache_control.txt

https://www.khronos.org/registry/EGL/extensions/ANDROID/EGL_ANDROID_blob_cache.txt

Interesting - how do the second and third relate to glProgramBinary? From a quick skim they appear to be about runtime sharing, which we basically already get through our infrastructure.

The latter two solve an underlying weakness to GetProgramBinary: GetProgramBinary may not give you the bytecode for all variants of a program on all drivers. For instance, ANGLE uses different actual sub-binaries for PROVOKING_VERTEX_FIRST vs _LAST: One has a geometry shader, and the other doesn't.

This is what we solved with EGL_MOZ_create_context_provoking_vertex_dont_care, right?

For this example, ANGLE_program_cache_control and ANDROID_blob_cache would call a callback based on a key that changes based on the provoking-vertex setting, and so either or both underlying binaries could be cached. This is true of any case of driver-side shader/program recompilation.

Are any of those likely to bite us on practice on mobile GPU drivers?

The downside of the blob cache model is that it puts the driver in charge, and the driver presumably requests the shaders on-demand. That would prevent us from doing the eager/parallel pre-linking we do on Windows, which was critical to getting startup perf down to an acceptable level. And supporting both models would add significant complexity to our story here, so it seems like something we'd want to avoid unless we really need it.

FWIW, Samsung's docs suggest skipping the blob cache and doing it our way: https://developer.samsung.com/game/opengl

Comment 11

•

5 years ago

We replaced create_context_provoking_vertex_dont_care with ANGLE_provoking_vertex, but as long as you only ever use it one way it should be fine.

There are a bunch of reasons for runtime shader recompilation historically, provoking vertex is just one example. It depends on the driver, and is really opaque to us, but it's a cause of jank that can be helped with a blob-cache-style approach.

The blob cache model doesn't prevent us from doing early-compilation: We just need to incur it into doing this compilation before we actually need it, by simply compiling and linking whatever shaders we want pre-compiled.

I do think that if we only do one thing, it should be GetProgramBinary, but I've been told by ANGLE and Google that they recommend using the other extensions as well, due to the concerns I've mentioned.

Comment 12

•

5 years ago

FWIW the Samsung guide calls shader-recomp "Shader patching", but says it should have minimal overhead, but to still be aware of it.

Reporter

Comment 13

•

5 years ago

Ok. Sounds like we should start by extending GetProgramBinary to GeckoView, and see how far that gets us. If we run into issues with shader-recomp, we can look into hooking up the blob cache.

Jessie [:jbonisteel] pls NI

Updated

•

5 years ago

Depends on: 1540576

Updated

•

5 years ago

Whiteboard: [wr-amvp][wr-q2]

Comment 14

•

5 years ago

(In reply to Jeff Gilbert [:jgilbert] from comment #5)

Ideally, we would support:

glProgramBinary

https://github.com/google/angle/blob/master/extensions/EGL_ANGLE_program_cache_control.txt

https://www.khronos.org/registry/EGL/extensions/ANDROID/EGL_ANDROID_blob_cache.txt

It seems that android OS does not expose eglSetBlobCacheFuncsANDROID to an application.

http://androidxref.com/9.0.0_r3/xref/frameworks/native/opengl/libs/EGL/eglApi.cpp#239

Updated

•

5 years ago

Depends on: 1549927

Assignee

Updated

•

5 years ago

Assignee: nobody → jnicol

Assignee

Comment 15

•

5 years ago

Just getting up to speed on all of this now, here are my thoughts:

First paint (on 2nd or later launch) definitely seems snappier since the shader cache was enabled in bug 1540576, woohoo \o/
It seems like KHR_parallel_shader_compile basically isn't supported anywhere on GLES (https://opengles.gpuinfo.org/listreports.php?extension=GL_KHR_parallel_shader_compile - I know we don't always trust sites like these, but my devices don't claim to support it either). So if we want asynchronous and parallel shader compilation we'd need to do it ourselves, which could be messy.
I believe Sotaro is correct that EGL_ANDROID_blob_cache isn't exposed to applications, it should be done transparently by EGL. So if it's there great, if not, our cache should do most of its work anyway.

So what's left to do here?

Loading the cache from disk sooner (bug 1549927)
Decide which set of shaders we want to be cached on disk
Look in to compiling them in a background task on install/upgrade instead of on first launch

As an experiment I forced us to compile all possible shaders. (By setting RendererOptions.precache_flags = ShaderPrecacheFlags::FULL_COMPILE.) This took up 1.6MB on disk, which I don't think is unreasonable at all. Loading the cache from disk however seemed to take ~10ms if the disk was warm or ~50ms if cold. About 10x more than loading just the set of shaders used on mozilla.org, for comparison. 50ms seems too high, but perhaps acceptable with bug 1549927.

I think the best approach therefore, would be to save all compiled shaders to disk, not just the first 10 frames. But to prioritize which ones we load from disk in to the cache, then load the rest asynchronously. I think this is what Jeff was suggesting in comment 8.

Reporter

Comment 16

•

5 years ago

(In reply to Jamie Nicol [:jnicol] from comment #15)

As an experiment I forced us to compile all possible shaders. (By setting RendererOptions.precache_flags = ShaderPrecacheFlags::FULL_COMPILE.) This took up 1.6MB on disk, which I don't think is unreasonable at all. Loading the cache from disk however seemed to take ~10ms if the disk was warm or ~50ms if cold.

Is that just for reading them in, or does it also include the time taken in the driver to link up the cached binary?

About 10x more than loading just the set of shaders used on mozilla.org, for comparison. 50ms seems too high, but perhaps acceptable with bug 1549927.

I think we should consider the common case to be "user clicks a link in another app and switches to the browser", where that optimization doesn't help. We can obviously add that optimization for the launch-to-homescreen case, we just shouldn't build our strategy around it.

I think the best approach therefore, would be to save all compiled shaders to disk, not just the first 10 frames. But to prioritize which ones we load from disk in to the cache, then load the rest asynchronously. I think this is what Jeff was suggesting in comment 8.

How do you propose identifying which shaders should be prioritized?

Assignee

Comment 17

•

5 years ago

(In reply to Bobby Holley (:bholley) from comment #16)

Is that just for reading them in, or does it also include the time taken in the driver to link up the cached binary?

That was just the time to read them, but the time taken in the driver seemed negligible (microseconds).

I think we should consider the common case to be "user clicks a link in another app and switches to the browser", where that optimization doesn't help. We can obviously add that optimization for the launch-to-homescreen case, we just shouldn't build our strategy around it.

I agree. I hadn't realised that optimisation was only when opening the homescreen, that's unfortunate. I don't think even the homescreen on fenix is rendered by webrender either (though that is a guess)? If so, the first page really could be anything even when the app is opened directly.

How do you propose identifying which shaders should be prioritized?

I liked your idea of a test case and using top sites, with some automation to ensure it stays correct.

Reporter

Comment 18

•

5 years ago

(In reply to Jamie Nicol [:jnicol] from comment #17)

I agree. I hadn't realised that optimisation was only when opening the homescreen, that's unfortunate. I don't think even the homescreen on fenix is rendered by webrender either (though that is a guess)? If so, the first page really could be anything even when the app is opened directly.

Right. Fenix homescreen is native code. None of the browser UI is rendered with Gecko.

Jessie [:jbonisteel] pls NI

Comment 19

•

5 years ago

Remember also that not all disks are created equal. There are some serious potato disks in some android phones. I expect disk access to be quite a bit worse on cheaper phones.

Updated

•

5 years ago

Blocks: wr-android-nightly

Darkspirit

Updated

•

5 years ago

No longer blocks: wr-android-mvp

Assignee

Comment 20

•

5 years ago

I've had a rough go at implementing this in a couple of different ways.

One of the issues is how we determine at run time which shaders we want to save/load to disk. ie not how we on bugzilla determine in theory which ones we want, but how the cache code actually identifies those shaders. The cache is at too low a level I think: it only knows the source digests. I experimented using a list of those (in a pref/gfxVar), which worked, but it might be a bit awkward to keep that list correct. I think shade.rs is the best place where we can identify the shaders properly. So we probably want to hard-code flags there saying whether shaders are "common" or not.

This is what I think I've settled on as my preferred approach:

Save all shaders to the disk cache, don't delete existing-but-unused ones after 10 frames
Instead save a whitelist of the ones to load on next startup.
- We can get this from a hard-coded "common" flag in shade.rs like I mentioned above.
- Even just using the linked programs after 10 frames, like we currently do for desktop, works pretty well though. Assuming there's a reasonable overlap in the set of shaders from page to page.
Load only the whitelisted shaders on startup.
In link_program(), if the shader is not in the cache, attempt to load it from the disk.
- The overhead of this on a miss is tiny.
- This is too late to get the benefit of the driver asynchronously doing glProgramBinary, like when we call that in create_program() for shaders already in the cache. But only takes a couple of milliseconds rather than a couple of hundred to fully compile the shader. Doing this in create_program() is unsuitable, because then we would attempt to load every shader from disk during Shaders::new() (because of the ShaderPrecacheFlags::ASYNC_COMPILE flag).

Actually, it's worth mentioning that Shaders::new() with ShaderPrecacheFlags::ASYNC_COMPILE is currently quite expensive, as it calls create_program() for every possible shader. For me it takes ~60ms even with no shaders in the cache (ie no glProgramBinary calls). Perhaps we want a new ShaderPrecacheFlag which only precaches the shaders which we have marked as "common".

Assignee

Comment 21

•

5 years ago

Attached file Bug 1535146 - Use a whitelist to decide which shaders to load from disk on startup. r?bholley — Details

Webrender caches the program binaries of shaders used within the first
ten frames, so that on next startup it can load them from disk rather
than having to recompile them.

Previously it would load all binaries found in the disk cache on
startup, and when saving to the cache it would delete any existing
binaries that weren't used.

This changes it so that unused binaries are not deleted. The disk
space this requires is insignificant, but as the cache grows loading
all the shaders on startup can get expensive. To solve that we write a
whitelist of the shaders used during startup, and only load those
during the next startup.

Assignee

Comment 22

•

5 years ago

Attached file Bug 1535146 - Attempt to load non-startup shaders from disk cache when required. r?bholley — Details

On startup some program binaries are loaded from disk into an
in-memory cache. When we call create_program() we check if the
required program is present in this cache, and if so we call
glProgramBinary(). This is done early on so that the driver can
perform any necessary work in the background.

There may however be binaries in the disk cache that have not yet been
loaded in to memory, in order not to slow down startup. This change
makes it so that we attempt to load missing binaries from disk during
link_program(). The reason we do not do this in create_program() is
because that would result in loading all shaders from disk during
startup, which we want to avoid. Loading these shaders may therefore
take slightly longer than if they'd been loaded at startup, but will
still be much faster than recompiling them from scratch, and startup
will remain quick.

Depends on D33954

Assignee

Comment 23

•

5 years ago

These patches make it so that we:

On startup still only load the binaries used on previous startup
But don't delete unused binaries from the disk cache
In link_program() if the binary is not in the in-memory cache, attempt to load from disk cache.

So there is no hard-coded set of shaders we think we want for startup, it is using the same heuristic as before (first 10 frames). I think we should see how far this gets us. Missing shaders from that set won't be as big a problem as before, because we then load from disk on demand instead of recompiling from scratch.

Pulsebot

Comment 24

•

5 years ago

Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/95b574e4e11f
Use a whitelist to decide which shaders to load from disk on startup. r=bholley
https://hg.mozilla.org/integration/autoland/rev/826f8589f165
Attempt to load non-startup shaders from disk cache when required. r=bholley

https://hg.mozilla.org/mozilla-central/rev/95b574e4e11f
https://hg.mozilla.org/mozilla-central/rev/826f8589f165

Reporter

Comment 25

•

5 years ago

Can you file a followup bug on caching shaders encountered after the first 10 frames?

Flags: needinfo?(jnicol)

Andrei Ciure[:aciure]

Comment 26

•

5 years ago

bugherder

Status: NEW → RESOLVED

Closed: 5 years ago

status-firefox70: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla70

Assignee

Comment 27

•

5 years ago

Filed bug 1565522

Flags: needinfo?(jnicol)