Closed Bug 863316 Opened 11 years ago Closed 7 years ago

ANGLE should call D3DCompile off the main thread

Categories

(Core :: Graphics: CanvasWebGL, defect)

defect
Not set
normal

Tracking

()

RESOLVED DUPLICATE of bug 1347866

People

(Reporter: vladan, Unassigned)

References

(Blocks 2 open bugs, )

Details

(Whiteboard: [Snappy:P3][games:p1][diamond][platform-rel-Games])

I noticed the BananaBread WebGL demo will consistently hang my browser several times (for a second or two) as the game is being launched.

This is the relevant part of the hang stacks:

D3DCompile (in D3DCompiler_43.pdb)
egl::Display::compileShaderSource(...) (in libEGL.pdb)
gl::ProgramBinary::link(...) (in libGLESv2.pdb)
gl::Program::link() (in libGLESv2.pdb)
gl::Context::linkProgram(unsigned int) (in libGLESv2.pdb)
glLinkProgram (in libGLESv2.pdb)
mozilla::WebGLContext::LinkProgram(mozilla::WebGLProgram *) (in xul.pdb)
mozilla::dom::WebGLRenderingContextBinding::linkProgram (in xul.pdb)
mozilla::dom::WebGLRenderingContextBinding::genericMethod (in xul.pdb)

Profile: http://people.mozilla.com/~bgirard/cleopatra/#report=ea5e218a44706e0382511a732f6e8f2c92703e8a

BananaBread page: https://developer.mozilla.org/en/demos/detail/bananabread

The hangs are coming form D3DCompile, while ShCompile is executing very quickly. Benoit Jacob says D3DCompile could be moved off the main thread.
Thanks for the investigation. The profile indeed shows D3DCompile being an order of magnitude slower than ANGLE's compiler (ShCompile). D3DCompile can indeed be deferred as it only affects rendering and does not affect validation or other internal WebGL logic.

Filed ANGLE bug:
  https://code.google.com/p/angleproject/issues/detail?id=422
This would also solve the issues with D3DCompile taking infinitely long, freezing the browser: see bug 680188 and https://code.google.com/p/angleproject/issues/detail?id=198
Blocks: 680188
Caching program binaries would help considerably with this for non-first-run cases.
Whiteboard: [Snappy:P3] → [Snappy:P3][games:p?]
Whiteboard: [Snappy:P3][games:p?] → [Snappy:P3][games:p3]
Blocks: gecko-games
Benoit: This looks like a nontrivial change, but would you be willing to mentor a contributor through this?
Flags: needinfo?(bjacob)
This bug seems like one of those bugs that are not too hard to implement, but could possibly result in tricky bugs. For instance, we already know that ANGLE had thread-safety bugs (:khuey was running into that the other day for webgl-on-worker-threads). Are we going to run into ANGLE threadsafety bugs in this project too? Are we going to notice before this lands? These questions make me uncomfortable about making this a mentored bug.
Flags: needinfo?(bjacob)
I don't actually understand how you'd do this off the main thread -- it's a synchronous operation in WebGL.  You could substitute a dummy shader, but that would mean that any introspection of the shader would then fail.. or you'd need to speculatively do it.  Maybe ANGLE's parsing is enough for it to return info about uniforms/varyings/etc., and then you can sub in a working shader when the compile is done -- using a dummy shader that just renders black in the meantime.  But if we do that, the black would be distracting for some use cases, so we'd also have to introduce a way to either disable this or to figure out when compilation is done.

Note also that newer D3D compiler versions (46, 47) that we're now shipping are *much* faster than v43.

A better approach, IMO, would be to introduce a WEBGL_async_shader_compilation extension that has CompileAsync and LinkAsync that return promises.  That way apps can opt-in to this behaviour (and many will), and it would be easy to write a main-thread polyfill for when the extension isn't available.
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #6)
> I don't actually understand how you'd do this off the main thread -- it's a
> synchronous operation in WebGL.  You could substitute a dummy shader, but
> that would mean that any introspection of the shader would then fail.. or
> you'd need to speculatively do it.  Maybe ANGLE's parsing is enough for it
> to return info about uniforms/varyings/etc., and then you can sub in a
> working shader when the compile is done -- using a dummy shader that just
> renders black in the meantime.  But if we do that, the black would be
> distracting for some use cases, so we'd also have to introduce a way to
> either disable this or to figure out when compilation is done.
> 
> Note also that newer D3D compiler versions (46, 47) that we're now shipping
> are *much* faster than v43.
> 
> A better approach, IMO, would be to introduce a
> WEBGL_async_shader_compilation extension that has CompileAsync and LinkAsync
> that return promises.  That way apps can opt-in to this behaviour (and many
> will), and it would be easy to write a main-thread polyfill for when the
> extension isn't available.

We can do shader compilation asynchronously between calls to compile/link and their respective GetStatus function calls. Devs would be able to take advantage of this by changing from:
for cur in programList:
  gl.linkProgram(cur)
  if (gl.getProgramInfo(LINK_STATUS) != GOOD)
    // failure case

To:
for cur in programList:
  gl.linkProgram(cur)]

for cur in programList:
  if (gl.getProgramInfo(LINK_STATUS) != GOOD)
    // failure case


Having proper async compilation would be ideal, but we can at least (in theory) make shader compilation/linking fully parallel today, without an extension.
(In reply to Jeff Gilbert [:jgilbert] from comment #7)
> Having proper async compilation would be ideal, but we can at least (in
> theory) make shader compilation/linking fully parallel today, without an
> extension.

How do we make it parallel when there's a gigantic block of shared state?
(In reply to Dan Glastonbury :djg :kamidphish from comment #8)
> (In reply to Jeff Gilbert [:jgilbert] from comment #7)
> > Having proper async compilation would be ideal, but we can at least (in
> > theory) make shader compilation/linking fully parallel today, without an
> > extension.
> 
> How do we make it parallel when there's a gigantic block of shared state?

Which shared state? There's only a couple of states that we care about, so while their values are 'deferred', we just have to block when the user tries to access them, blocking until they are no longer 'deferred', but correctly resolved again.

Shader compilation doesn't touch much shared mutable state, I think. (at least, there's no reason it needs to. If it does inside ANGLE, we'll need to fix that)
It doesn't -- everything is encapsulated in the ShCompiler object.  However -- that will take care of translating GLSL to HLSL.  The actual compilation takes place inside libGLESv2; so I'm pretty sure that this will require some changes to ANGLE -- ideally a way to invoke the compilation and return a program binary without needing to create a D3D context.

Alternatively, if we're ok with creating a new ANGLE context per compilation thread, we can use glGetProgramBinaryOES/glSetProgramBinaryOES.  That would have the advantage of having the threaded compilation code work anywhere where binary shaders (or something similar) is supported.
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #10)
> It doesn't -- everything is encapsulated in the ShCompiler object.  However
> -- that will take care of translating GLSL to HLSL.  The actual compilation
> takes place inside libGLESv2; so I'm pretty sure that this will require some
> changes to ANGLE -- ideally a way to invoke the compilation and return a
> program binary without needing to create a D3D context.
> 
> Alternatively, if we're ok with creating a new ANGLE context per compilation
> thread, we can use glGetProgramBinaryOES/glSetProgramBinaryOES.  That would
> have the advantage of having the threaded compilation code work anywhere
> where binary shaders (or something similar) is supported.

Yep, leveraging program binaries here is the right way to go, IMO.
(In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #10)
> The actual compilation takes place inside libGLESv2; 

Jeff, this is what I meant by gigantic block of shared state.

> so I'm pretty sure that this will require some
> changes to ANGLE -- ideally a way to invoke the compilation and return a
> program binary without needing to create a D3D context.

I'm not up-to-date with DX11, but in DX9 land, D3DCompile takes HLSL and turns it into D3D bytecode, the D3D driver takes the bytecode and compiles it into GPU specific instruction set.

Anyway, I guess I was thinking that if the slow part is driver compilation, then we might be trapped by having "one driver". Although, the driver is likely to be threaded and have compilation on another thread.

I guess I was questioning "but we can at least (in theory) make shader compilation/linking fully parallel today" statement when the GL present the model of a huge black box of state. Can we compile and link programs on separate contexts and share them will our main rendering context?

> Alternatively, if we're ok with creating a new ANGLE context per compilation
> thread, we can use glGetProgramBinaryOES/glSetProgramBinaryOES.  That would
> have the advantage of having the threaded compilation code work anywhere
> where binary shaders (or something similar) is supported.

Sounds like an interesting avenue to explore.
(In reply to Dan Glastonbury :djg :kamidphish from comment #12)
> (In reply to Vladimir Vukicevic [:vlad] [:vladv] from comment #10)
> > The actual compilation takes place inside libGLESv2; 
> 
> Jeff, this is what I meant by gigantic block of shared state.
Ah, alright. I don't think there's any state that we actually need for doing shader compilation and linking, other than the shader/program objects themselves. Maybe I'm forgetting something? Maybe enabling an extension for ANGLE can change shader transpilation?
> 
> > so I'm pretty sure that this will require some
> > changes to ANGLE -- ideally a way to invoke the compilation and return a
> > program binary without needing to create a D3D context.
> 
> I'm not up-to-date with DX11, but in DX9 land, D3DCompile takes HLSL and
> turns it into D3D bytecode, the D3D driver takes the bytecode and compiles
> it into GPU specific instruction set.
> 
> Anyway, I guess I was thinking that if the slow part is driver compilation,
> then we might be trapped by having "one driver". Although, the driver is
> likely to be threaded and have compilation on another thread.
> 
> I guess I was questioning "but we can at least (in theory) make shader
> compilation/linking fully parallel today" statement when the GL present the
> model of a huge black box of state. Can we compile and link programs on
> separate contexts and share them will our main rendering context?
We should be able to transpile in parallel, and once we transpile, we should be able to compile/link that into a program binary in parallel. (These are guaranteed to be interoperable between contexts)
> 
> > Alternatively, if we're ok with creating a new ANGLE context per compilation
> > thread, we can use glGetProgramBinaryOES/glSetProgramBinaryOES.  That would
> > have the advantage of having the threaded compilation code work anywhere
> > where binary shaders (or something similar) is supported.
> 
> Sounds like an interesting avenue to explore.
Whiteboard: [Snappy:P3][games:p3] → [Snappy:P3][games:p3][diamond]
Whiteboard: [Snappy:P3][games:p3][diamond] → [Snappy:P3][games:p?][diamond]
Whiteboard: [Snappy:P3][games:p?][diamond] → [Snappy:P3][games:p3][diamond]
Whiteboard: [Snappy:P3][games:p3][diamond] → [Snappy:P3][games:p3][diamond][platform-rel-Games]
platform-rel: --- → ?
Bumping this up to track as a priority 1 item, as this seems to be the most common performance hog source we see both in Unreal Engine and Unity titles on Windows, which manifests as slow loading times and runtime frame hiccups.
Whiteboard: [Snappy:P3][games:p3][diamond][platform-rel-Games] → [Snappy:P3][games:p1][diamond][platform-rel-Games]
platform-rel: ? → ---
I went to the their issue tracker page, it seems they landed some patches to support async/multi-thread D3D shader compiling a few months ago, I think it worth a try to rebase our ANGLE branch to their trunk.
 
https://bugs.chromium.org/p/angleproject/issues/detail?id=422
I upgraded ANGLE to build 2950 which has included the async/multithread D3D Comilation and test the shader compile time using ANGLE by EpicGames ZenGarden WebGL2 demo, comparing among the latest Firefox nightly build with/without ANGLE upgrade and Firefox release. I ran these 3 versions of Firefox 3 times for each and got the result below.

Firefox Nightly with ANGLE 2950
6.69sec
6.67sec
6.68sec

Firefox Nightly
8.1sec
8.08sec
8.1sec

Firefox Release
9.98sec
9.91sec
9.92sec

We can see that shader compile time is noticeably shorter after we updated ANGLE.

So I think we should upgrade ANGLE to version newer than 2950, it does increase our performance.
Mark this duplicate to bug1347866 because ANGLE 2950 has already included this feature.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.