918941 - (webgl-shader-cache) cache results of shader compilation

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Description

•

12 years ago

Compiling WebGL shaders is really expensive, especially on Win32; also slow on Mobile. We should cache as much as we can.. e.g. given the same WebGL GLSL input, and the same firefox/angle/driver versions, we should be able to load HLSL bytecode from a previous ANGLE->HLSL compiler run, or shader binary code on mobile.

Kelsey Gilbert [:jgilbert]

Comment 1

•

12 years ago

More generally, we should cache program binaries for all platforms. ANGLE supports a form of shader program binaries, so we can just treat it like any other type of shader binary.

Kelsey Gilbert [:jgilbert]

Updated

•

12 years ago

OS: Windows 8 → All

Hardware: x86_64 → All

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 2

•

12 years ago

Yeah that's true, this could be done at the GLContext layer.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 3

•

12 years ago

Any preferences or other settings that could make the cached shaders "invalid" or at least require us to purge this cache?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 4

•

12 years ago

We should probably collect that list explicitly... at first pass: - gl strings: GL_VERSION/GL_VENDOR/GL_RENDERER/GL_SHADING_LANGUAGE_VERSION (of the actual GLContext, not the WebGL ones) - some ANGLE version string if it's not part of the above - webgl.prefer-native-gl - webgl.shader_validator - (windows) D3D compiler DLL version I don't think we need to depend on the underlying D3D driver below ANGLE, since we'd be caching HLSL bytecodes there.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 5

•

12 years ago

Granted, the shaders I worked with in the past were likely more complicated, but there is no way that app could have survived at all without us caching the compiled shaders. It would be nice to somehow get these numbers for our scenario though, so that we at least know what we're talking about. Is there a way to just time the compile, as just subtracting all of those will be our speed limit.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 6

•

12 years ago

Yeah, we can just time how long glCompileShader/glLinkShader takes total. That will be -most- of the total time saved that would be replaced by an internal read-from-cache. Could do that by modifying the JS or by modifying Firefox itself. We really should add some telemetry for this too, now that I think about it. Also, the shadertoy shaders are extremely complicated, way more than what you'd generally see in the real world :/ That's what makes this so much painful.

Marco Mucci [:MarcoM]

Updated

•

12 years ago

Blocks: gecko-games

Milan Sreckovic [:milan] (needinfo for best results)

Comment 7

•

12 years ago

It may be premature claiming that this blocks Gecko as a gaming platform; we do need numbers.

Kelsey Gilbert [:jgilbert]

Comment 8

•

12 years ago

(In reply to Milan Sreckovic [:milan] from comment #7) > It may be premature claiming that this blocks Gecko as a gaming platform; we > do need numbers. I don't have the numbers in front of me right now, but we've seen numbers from games developers before, and it's definitely something that will help with start-up time. Off the top of my head, we had one example with ~110 shaders, a couple of which took many hundreds of milliseconds each to compile.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 9

•

12 years ago

That's a good example, thanks. I didn't mean to deprioritize this bug, I was just trying to understand why we're saying that one shouldn't do games on Gecko until it is fixed. However, it may be more of a "related" than "blocking" relationship we're trying to describe. It also sounds like we either have the numbers or can get them once we want to measure if "we're done" with this bug...

Chad Austin

Comment 10

•

11 years ago

FWIW, our WebGL application spends several seconds compiling shaders. We would benefit greatly from a compiled shader cache. The application is behind closed beta right now, but if you'd like access for testing, I can get you an account.

Walter Litwinczyk [:walter]

Comment 11

•

11 years ago

I gathered some simple results. I ran four WebGL demos: Unigine Crypt Demo: http://crypt-webgl.unigine.com/ Unity Angry Robots Demo, and the two Dead Trigger Demos: http://blogs.unity3d.com/2014/04/29/on-the-future-of-web-publishing-in-unity/ Conveniently nVIDIA on Linux does have a shader cache, so I also ran the demos a second time to see what the speed up might be. Link Time is the time WebGLContext::LinkProgram() took, full function. Compile Time is the time ShCompile() call in WebGLContext::CompileShader() took. Ran on Ubuntu 14.04 x64 and Mac OSX 10.9 x64 on a MBP 11,3 ==Result Summary== Angry Robots Linux ------------------ Total Compile Time: 62.8214 ms across 229 calls Total Link Time: 519.3323 ms across 114 calls nVIDIA Cache Time: Total Compile Time: 64.9124ms Total Link Time: 17.1665ms Unity Dead Trigger 2 - Helicopter Linux --------------------------------------- Total Compile Time: 288.2592 ms across 705 calls Total Link Time: 1697.2105 ms across 354 calls nVIDIA cache time: Total Compile Time: 309.8136ms Total Link Time: 64.7068ms Unigine Crypt Demo Linux ------------------------ Total Compile Time: 32.806 ms across 24 calls Total Link Time: 141.6146 ms across 15 calls nVIDIA cache time: Total Compile Time: 33.4767 ms across 24 calls Total Link Time: 2.4799 ms across 15 calls Dead Trigger 2 - Village Demo Linux ----------------------------------- Total Compile Time: 240.5811 ms across 662 calls Total Link Time: 1274.736 ms across 333 calls nVIDIA cache time: Total Compile Time: 226.3913 ms across 660 calls Total Link Time: 53.7249 ms across 332 calls ========================================================= Angry Robots OSX ---------------- Compile time: 63.9002 ms across 229 calls Link time: 52.0501 ms across 114 calls Dead Trigger 2 - Helicopter OSX ------------------------------- Compile time: 297.2739 ms across 707 calls Link time: 227.4129 ms across 355 calls Dead Trigger 2 - Village OSX ---------------------------- Total Compile Time: 60.6543 ms across 172 calls Total Link Time: 45.2148 ms across 86 calls Unigine Crypt Demo - OSX ------------------------ Total Compile Time: 32.5662 ms across 24 calls Total Link Time: 13.585 ms across 15 calls I've got a little more detailed files and can post the diff, but Bugzilla doesn't seem to support multi-upload.

Chad Austin

Comment 12

•

11 years ago

Can you test on Windows too? Shader compilation is generally slower on Windows due to ANGLE.

Walter Litwinczyk [:walter]

Comment 13

•

11 years ago

Attached patch Simple timing diff — Details — Splinter Review

Output time taken to compile/link shaders

Walter Litwinczyk [:walter]

Comment 14

•

11 years ago

Attached file Adds results to total timings — Details

Make a text file of the form: ======== WebGL Demo Title ======== [Insert firefox console output here] e.g. ==== Angry Robots ==== ShCompile: 0.1983ms Shader: 24 Link Program: 0.2343ms .... redirect it into the script ./add.py < results

Chad Austin

Comment 15

•

11 years ago

In case the value of implementing a shader cache is not clear to everyone, I will share our situation. Our WebGL application uses about 140 shaders. Almost all of those are variants optimized for different numbers of lights, different numbers of skinned bones, etc. On Windows, compiling shaders costs us about 7 seconds total. In that time period, the browser is frozen. (We can spread the compilation work across frames, but that doesn't really solve the problem - it just moves it. It means we will have a slow frame rate for 15 seconds or whatever.) Chrome implements a compiled shader cache ( https://code.google.com/p/chromium/issues/detail?id=88572 , https://code.google.com/p/chromium/issues/detail?id=249739 ) which reduces shader compile times to about 10%. Other options: enable shader compilation in Web Workers or on a background thread in the browser.

Johannes Singler

Comment 16

•

11 years ago

Why is there so much interest in this bug, but only two votes (including mine)?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Comment 18

•

10 years ago

Doing this has been blocked on some of the quota manager/PBackground work -- the asm.js cache in dom/asmjscache is a model for how this can be implemented for webgl. However, that code is more complicated than it needs to be, and will be simplified greatly by the work that's waiting in bug 961049. Once that lands, then this can be implemented pretty straightforwardly.

Depends on: 961049, 961057

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Reporter

Updated

•

10 years ago

Depends on: 942542
No longer depends on: 961057

Jukka Jylänki

Comment 19

•

9 years ago

Attached image platformergame_shader_compilation_stutter.png — Details

Shader compilation has generally been a load time issue for most engines, though just now two Emscripten ported engines popped up that actually do compiled shaders on demand at runtime. It looks like Unreal Engine 4 does this as well. Found a particularly good test case with Unreal Engine 4 PlatformerGame demo, uploaded it to https://s3.amazonaws.com/mozilla-games/tmp/2016-04-23-PlatformerGame/PlatformerGame.html?cpuprofiler&playback Attached a screenshot that illustrates the stuttering caused by shader compilation. Pauses in the range of 500-1000 msecs don't seem that uncommon in this demo. Looking at the spikes in geckoprofiler, they lead to glLinkProgram(). Here is one geckoprofile trace: https://cleopatra.io/#report=4f151b9643c09da0d85b4242635204ea9849403b&filter=%5B%7B%22type%22%3A%22RangeSampleFilter%22,%22start%22%3A80498,%22end%22%3A86634%7D,%7B%22type%22%3A%22FocusedCallstackPrefixSampleFilter%22,%22name%22%3A%22Browser_setImmediate_messageHandler()%20%40%20cb8fb3a6-fdd7-4b2c-a124-5248ada9de1d%3A8%22,%22focusedCallstack%22%3A%5B0,2541,3,4,2542%5D,%22appliesToJS%22%3Afalse%7D%5D&selection=2542,2543,2544,2545,2546 The time spent in these seem to be dominated in D3DCompiler_47.dll, as opposed to some ANGLE code, so looks like the slow path is the actual D3D HLSL compilation, and not ANGLE shader translation/validation or some such.

Kelsey Gilbert [:jgilbert]

Updated

•

9 years ago

Blocks: 1268629

Whiteboard: [games:p2] → [games:p2] webgl-perf

Kelsey Gilbert [:jgilbert]

Updated

•

9 years ago

Alias: webgl-shader-cache

Jukka Jylänki

Comment 20

•

9 years ago

Improved the above test case page to more rigorously measure and highlight the shader compilation times. Visit https://s3.amazonaws.com/mozilla-games/tmp/2016-05-05-PlatformerGame-profiling/PlatformerGame-HTML5-Shipping.html?playback&cpuprofiler&webglprofiler&expandhelp&tracegl=50 for an automated run. While the page is running, light blue spikes (the "Cold WebGL Calls" section) will appear on the profiling graph to indicate stuttering from shader compilation. After the run finishes, open the web page console, which will have logged events like Trace: at t=43965.7, section "Cold GL" called via "_glLinkProgram" <- "__ZL11LinkProgramRK33FOpenGLLinkedProgramConfiguration" <- "__ZN17FOpenGLDynamicRHI25RHICreateBoundShaderStateEP21FRHIVertexDeclarationP16FRHIVertexShaderP14FRHIHullShaderP16FRHIDomainShaderP15FRHIPixelShaderP18FRHIGeometryShader" took 243.69 msecs! To run the same page interactively, remove the "playback" option from the URL, i.e. visit https://s3.amazonaws.com/mozilla-games/tmp/2016-05-05-PlatformerGame-profiling/PlatformerGame-HTML5-Shipping.html?cpuprofiler&webglprofiler&expandhelp&tracegl=50 On Firefox Nightly on Windows with Core i7-5960X and a GTX 980 Ti with 365.19 NVidia drivers, 342.50 msecs was the longest observed duration that a shader compilation event took, and overall, there's about 60 such events that take more than 50 msecs. Looks like Chrome has a somewhat effective caching behavior, and there on the second run there was only one stutter event that took longer than 50 msecs.

Anthony Vaughn [:anthony][:avaughn][San Francisco - Pacific]

Updated

•

9 years ago

Whiteboard: [games:p2] webgl-perf → [games:p?] webgl-perf

Anthony Vaughn [:anthony][:avaughn][San Francisco - Pacific]

Updated

•

9 years ago

Whiteboard: [games:p?] webgl-perf → [games:p1] webgl-perf

Desigan Chinniah [:cyberdees] [:dees] [London - GMT]

Updated

•

9 years ago

Whiteboard: [games:p1] webgl-perf → [games:p1] webgl-perf [platform-rel-Games]

Jenn Chaulk (:jchaulk)

Updated

•

9 years ago

platform-rel: --- → ?

Desigan Chinniah [:cyberdees] [:dees] [London - GMT]

Updated

•

9 years ago

platform-rel: ? → ---

Milan Sreckovic [:milan] (needinfo for best results)

Updated

•

9 years ago

Blocks: webgl-perf-parity

Kelsey Gilbert [:jgilbert]

Updated

•

6 years ago

Type: defect → enhancement

Priority: -- → P3

wingman.jr.addon

Comment 21

•

6 years ago

Hello! I would like to present a new use case that has popped up: the use of machine learning framework Tensorflow.js. As best as I can tell, this compiles many shaders (a shader per graph op?) when the model initially loads and does the first inference. This causes quite heavy page loads.
While this can be used directly on a web page (which perhaps represents the bulk of usage?), I use it in a Firefox plugin for inference on images.
Basically this causes a 10 second stall when the plugin is initially loaded. As a point of reference on size, that's for a model based on MobilenetV2 - not something considered to be huge. I'm tracking my corresponding issue here

Beyond the undue level of hype that machine learning has received in recent years, I do anticipate a significant growth in its usage - even on the web. Based on that, I would imagine that - whether Tensorflow.js retains its popularity or not - caching shaders will remain important for this use case in the future as the CPU is simply not cut out for the linear algebra needed. (I also see active work on a WebGPU backend, but that's a future story...) For the curious, Tensorflow.js has several demos online.

Thank you team for developing Firefox and I hope you find this new use case report helpful!

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Sungun Park

Comment 22

•

3 months ago

Hello,

I'm checking in to see if there have been any recent updates or progress regarding this issue. Our team has recently encountered significant slowness in the shader linking process within our product on the FireFox browser, and we suspect it may be related to this bug. The performance degradation is particularly pronounced on Mac.

This is a sample webpage that consistently reproduces this behavior:

Webpage: https://prideout.net/slow_compile/repro.html
Source Code: https://github.com/prideout/slow_compile

This sample measures the time taken for shader linking. In our tests on Mac, Firefox takes approximately 640ms, whereas Chrome completes the same operation in about 30ms.

Within this sample, the bottleneck appears to be this line (https://github.com/prideout/slow_compile/blob/893994d876482ff6b67ce17fea2d744bcc44d6ec/vshader.js#L15). We observed that reducing the array size in this specific line (e.g., down to 8 or 1) dramatically decreases the link time from ~640ms to around 6ms.

We hope this information is helpful.

Simple timing diff 11 years ago Walter Litwinczyk [:walter] 2.35 KB, patch		Details \| Diff \| Splinter Review
Adds results to total timings 11 years ago Walter Litwinczyk [:walter] 1.70 KB, text/x-python		Details
platformergame_shader_compilation_stutter.png 9 years ago Jukka Jylänki 3.62 MB, image/png		Details