Open Bug 1133570 Opened 10 years ago Updated 2 years ago

WebGL rendering on 10kCubes sample is 20x slower compared to native.

Categories

(Core :: Graphics: CanvasWebGL, defect, P3)

x86
Windows 8.1
defect

Tracking

()

People

(Reporter: jujjyl, Unassigned)

References

Details

(Whiteboard: gfx-noted)

Attachments

(3 files)

I just got a new 8-core Intel Core i7 5960X + nVidia GeForce 980 GTX setup, and gave it a go with my "10kCubes" large batch count rendering stress test page. Here are some observations that can hopefully lead to optimizations for WebGL: Native Windows GL3 executable: Running with 100000 cubes, I'm getting 42fps, with ~24msecs/frame. nVidia NSight shows that the execution is fully CPU bound, the GPU being idle about 90% of the time. A native profile shows that about 70% of execution is inside the nVidia GL driver. The remaining 30% is in animating and submitting draws. 10kCubes in current Firefox Nightly, with e10s disabled and try-d3d11 enabled: https://dl.dropboxusercontent.com/u/40949268/emcc/10kCubes_vsync/10kCubes.html (tap B once to make the rendering back to rAF instead of setTimeout): Running with 100000 cubes, I'm getting 2fps, with ~510msecs/frame. nVidia NSight shows that the execution is also fully CPU bound, the GPU being idle ~100% of the time. A native profile with AMD CodeXL gives very interesting info about the bottlenecks. I've attached a screenshot that shows the top samples. These give interesting numbers: - Only about 20% of samples are inside the nVidia D3D driver (compared to 70% in the native execution). - libGLESv2.dll dominates by taking up 45.62% of total samples. - In the samples inside libGLESv2.dll, the entry point (deep samples) gl::Context::drawArrays() dominates with 10.39% of total time being spent by that tree. - of which 29.51% is taken up by the function gl::Context::applyTextures(), however the WebGL application does not use textures at all. - and 12.86% is taken up by gl::context::applyShaders(), but the WebGL code never changes shaders during the hot rendering loop. (it activates the shader in the beginning of the frame once, and never changes it) - gl::ProgramBinary::sortAttributesByLayout(), std::string creation and destruction and framebuffer completeness checking also come up high, which looks odd. How do these call sites look to you guys? Anything that's not expected showing up there?
Flipping the pref webgl.angle.try-d3d11;false improves performance from about 510msecs/frame to 450msecs/frame, so it is slightly better, but looking at the profile, the biggest call sites showing up samples seem to be the same.
Ops, attachment was wrong and referred to the native version. Here is the Firefox version.
Is there a magic short cut key to get to 100K cubes?
Also is the C++ source code to 10k cubes available?
After pressing 'up' a lot I reached 100K cubes and I see 120ms/f in an optimised build of demo-hacks branch. This is using DX-OGL interop on nVidia GTX660.
Flags: needinfo?(jujjyl)
Sorry for the delay. The source code for the demo is not unfortunately available, but native builds can be found here, if they are of any use: - Windows: https://dl.dropboxusercontent.com/u/40949268/code/10kCubes/10kCubes_2015_02_23_Win.zip - OSX: https://dl.dropboxusercontent.com/u/40949268/code/10kCubes/10kCubes_2015_02_23_OSX.tar.gz For native runs, passing command line parameters "/objects 100000" gives a startup object count of 100000 objects. For the html build, the command line parameters can be passed in the URL as GET objects, i.e. "?/objects&100000" will give 100k objects, like this: https://dl.dropboxusercontent.com/u/40949268/emcc/10kCubes_vsync/10kCubes.html?/objects&100000 I did
Flags: needinfo?(jujjyl)
I did a profile with VTune, which shows similar data to AMD CodeAnalyst, see the attachment below.
This continues to make me want to figure out if we can make the DX/GL interop rock solid for at least some recent set of users and use native GL where possible... we should be able to optimize ANGLE as well though. Is your source in github somewhere? I'd be curious to see the layout of the vertex arrays going in to the draw call. D3D can only efficiently do interleaved vertex arrays (one struct per vertex basically), and I think that's what that prepareVertexData stuff is doing (interleaving).
Testing on my machine, I see: 10kCubes_Win: frametime: ~122-124ms rendertime: ~41-42ms swapbuffers: ~79-81ms 10kCubes_html: frametime: ~126-127ms rendertime: ~87-90ms swapbuffers: 0ms This is with the webgl 2 demo-hacks build using DX/GL interop.
Whiteboard: gfx-noted
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: