Open
Bug 1133570
Opened 10 years ago
Updated 2 years ago
WebGL rendering on 10kCubes sample is 20x slower compared to native.
Categories
(Core :: Graphics: CanvasWebGL, defect, P3)
Tracking
()
NEW
People
(Reporter: jujjyl, Unassigned)
References
Details
(Whiteboard: gfx-noted)
Attachments
(3 files)
I just got a new 8-core Intel Core i7 5960X + nVidia GeForce 980 GTX setup, and gave it a go with my "10kCubes" large batch count rendering stress test page. Here are some observations that can hopefully lead to optimizations for WebGL:
Native Windows GL3 executable:
Running with 100000 cubes, I'm getting 42fps, with ~24msecs/frame. nVidia NSight shows that the execution is fully CPU bound, the GPU being idle about 90% of the time.
A native profile shows that about 70% of execution is inside the nVidia GL driver. The remaining 30% is in animating and submitting draws.
10kCubes in current Firefox Nightly, with e10s disabled and try-d3d11 enabled: https://dl.dropboxusercontent.com/u/40949268/emcc/10kCubes_vsync/10kCubes.html (tap B once to make the rendering back to rAF instead of setTimeout):
Running with 100000 cubes, I'm getting 2fps, with ~510msecs/frame. nVidia NSight shows that the execution is also fully CPU bound, the GPU being idle ~100% of the time. A native profile with AMD CodeXL gives very interesting info about the bottlenecks. I've attached a screenshot that shows the top samples. These give interesting numbers:
- Only about 20% of samples are inside the nVidia D3D driver (compared to 70% in the native execution).
- libGLESv2.dll dominates by taking up 45.62% of total samples.
- In the samples inside libGLESv2.dll, the entry point (deep samples) gl::Context::drawArrays() dominates with 10.39% of total time being spent by that tree.
- of which 29.51% is taken up by the function gl::Context::applyTextures(), however the WebGL application does not use textures at all.
- and 12.86% is taken up by gl::context::applyShaders(), but the WebGL code never changes shaders during the hot rendering loop. (it activates the shader in the beginning of the frame once, and never changes it)
- gl::ProgramBinary::sortAttributesByLayout(), std::string creation and destruction and framebuffer completeness checking also come up high, which looks odd.
How do these call sites look to you guys? Anything that's not expected showing up there?
Reporter | ||
Comment 1•10 years ago
|
||
Flipping the pref webgl.angle.try-d3d11;false improves performance from about 510msecs/frame to 450msecs/frame, so it is slightly better, but looking at the profile, the biggest call sites showing up samples seem to be the same.
Reporter | ||
Comment 2•10 years ago
|
||
Ops, attachment was wrong and referred to the native version. Here is the Firefox version.
After pressing 'up' a lot I reached 100K cubes and I see 120ms/f in an optimised build of demo-hacks branch. This is using DX-OGL interop on nVidia GTX660.
Flags: needinfo?(jujjyl)
Reporter | ||
Comment 6•10 years ago
|
||
Sorry for the delay. The source code for the demo is not unfortunately available, but native builds can be found here, if they are of any use:
- Windows: https://dl.dropboxusercontent.com/u/40949268/code/10kCubes/10kCubes_2015_02_23_Win.zip
- OSX: https://dl.dropboxusercontent.com/u/40949268/code/10kCubes/10kCubes_2015_02_23_OSX.tar.gz
For native runs, passing command line parameters "/objects 100000" gives a startup object count of 100000 objects. For the html build, the command line parameters can be passed in the URL as GET objects, i.e. "?/objects&100000" will give 100k objects, like this:
https://dl.dropboxusercontent.com/u/40949268/emcc/10kCubes_vsync/10kCubes.html?/objects&100000
I did
Flags: needinfo?(jujjyl)
Reporter | ||
Comment 7•10 years ago
|
||
I did a profile with VTune, which shows similar data to AMD CodeAnalyst, see the attachment below.
Reporter | ||
Comment 8•10 years ago
|
||
This continues to make me want to figure out if we can make the DX/GL interop rock solid for at least some recent set of users and use native GL where possible... we should be able to optimize ANGLE as well though. Is your source in github somewhere? I'd be curious to see the layout of the vertex arrays going in to the draw call. D3D can only efficiently do interleaved vertex arrays (one struct per vertex basically), and I think that's what that prepareVertexData stuff is doing (interleaving).
Comment 10•10 years ago
|
||
Testing on my machine, I see:
10kCubes_Win:
frametime: ~122-124ms
rendertime: ~41-42ms
swapbuffers: ~79-81ms
10kCubes_html:
frametime: ~126-127ms
rendertime: ~87-90ms
swapbuffers: 0ms
This is with the webgl 2 demo-hacks build using DX/GL interop.
Updated•10 years ago
|
Whiteboard: gfx-noted
Updated•7 years ago
|
Priority: -- → P3
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•