Closed Bug 1358977 Opened 8 years ago Closed 8 years ago

Profiling WebGL performance

Categories

(Core :: Graphics: CanvasWebGL, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: daoshengmu, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(5 files, 1 obsolete file)

In this bug, I wanna discuss some strange performance issues when I diagnose from profiling tools, like Gecko Profiler (https://perf-html.io), Visual Studio Graphics Diagnostics, etc.
Platform: Win 10 GPU: HD Graphics 530 Demo: Zen Garden-wasm Version: 55.0a1 (2017-04-24) (64-bit) release build Capture from: Gecko Profiler In this report, I notice we get bad performance on Renderer11::genericDrawElements() before doing applyVertexBuffer. It costs us 4ms. Moreover, InputLayoutCache::applyVertexBuffers takes 2ms as well. ---- 8ms 53.3% mozilla::dom::WebGL2RenderingContextBinding::drawElements . . ---- 8ms 53.3% rx::Renderer11::genericDrawElements(rx::Context11 *,unsigned int,int,unsigned int,void const *,int,gl::IndexRange const &) ---- 4ms 26.7% rx::Renderer11::applyVertexBuffer(gl::State const &,unsigned int,int,int,int,rx::TranslatedIndexData *) ---- 3ms 20.0% rx::InputLayoutCache::applyVertexBuffers(gl::State const &,std::vector<rx::TranslatedAttribute,std::allocator<rx::TranslatedAttribute> > const &,std::vector<rx::TranslatedAttribute,std::allocator<rx::TranslatedAttribute> > const &,unsigned int,int,rx::TranslatedIndexData *,int) ---- 1ms 6.7% rx::InputLayoutCache::updateInputLayout(gl::State const &,unsigned int,std::array<int,16> const &,int)
(In reply to Daosheng Mu[:daoshengmu] from comment #1) > Platform: Win 10 > GPU: HD Graphics 530 > Demo: Zen Garden-wasm > Version: 55.0a1 (2017-04-24) (64-bit) release build > Capture from: Gecko Profiler > > > In this report, I notice we get bad performance on > Renderer11::genericDrawElements() before doing applyVertexBuffer. It costs > us 4ms. Moreover, InputLayoutCache::applyVertexBuffers takes 2ms as well. > > ---- > 8ms 53.3% mozilla::dom::WebGL2RenderingContextBinding::drawElements > . > . > ---- > 8ms 53.3% rx::Renderer11::genericDrawElements(rx::Context11 *,unsigned > int,int,unsigned int,void const *,int,gl::IndexRange const &) > ---- > 4ms 26.7% rx::Renderer11::applyVertexBuffer(gl::State const &,unsigned > int,int,int,int,rx::TranslatedIndexData *) > ---- > 3ms 20.0% rx::InputLayoutCache::applyVertexBuffers(gl::State const > &,std::vector<rx::TranslatedAttribute,std::allocator<rx:: > TranslatedAttribute> > const > &,std::vector<rx::TranslatedAttribute,std::allocator<rx:: > TranslatedAttribute> > const &,unsigned int,int,rx::TranslatedIndexData > *,int) > ---- > 1ms 6.7% rx::InputLayoutCache::updateInputLayout(gl::State const > &,unsigned int,std::array<int,16> const &,int) I get a similar result from Visual Studio Diagnostics. I think it is worth to have some investigation on rx::Renderer11::applyShaders and applyVertexBuffer(). Function Name Total CPU (%) Self CPU (%) Total CPU (ms) Self CPU (ms) Module - rx::Renderer11::genericDrawElements 26.22 % 0.00 % 129 0 libGLESv2.dll + rx::Renderer11::applyShaders 7.93 % 0.20 % 39 1 libGLESv2.dll + rx::Renderer11::applyVertexBuffer 6.50 % 0.00 % 32 0 libGLESv2.dll + rx::RendererD3D::applyTextures 4.47 % 0.00 % 22 0 libGLESv2.dll + rx::Renderer11::drawElementsImpl 4.27 % 0.20 % 21 1 libGLESv2.dll + rx::Renderer11::updateState 2.03 % 0.20 % 10 1 libGLESv2.dll + rx::Renderer11::applyIndexBuffer 0.61 % 0.00 % 3 0 libGLESv2.dll + rx::Renderer11::generateSwizzles 0.41 % 0.00 % 2 0 libGLESv2.dll
In rx::Renderer11::applyShaders, ANGLE mainly does these below. 1. Update its D3D11 shader InputLayout via current GL context state. 2. Try to obtain pre-compiled shader executables corresponding to shader codes we are using from its internal cache, if failed, compile the code right now. 3. Set and assemble these shaders to form a complete pipeline. I think the most time-consuming part is 2. if it encounter a "cache miss", it has to compile D3D shader code on-the-fly which we have already known D3DCompile is really time-consuming. Maybe we can dig the stack deeper, and try to calculate the "cache hit rate", perhaps there will be something interesting. By the way, I attempted to perform profiling on latest Chromium with VisualStudio as well, but I am unable to obtain a proper report because visualstudio repeatedly crash when generating report, possibly caused by OOM.
I have profiled chromium and got a valid report, it seems that chromium exhibits similar behavior in draw calls comparing to gecko, so I think it is because of ANGLE's design instead of our misuse. However, I found something interesting when running a specific WebGL2 benchmark demo. In this demo, firefox got about 30~40 fps with 5~7% CPU idle while chromium got a stable 70fps with 40~50% CPU idle. I am not sure whether the CPU performance difference is caused by bogus benchmark tool or not, but chromium does render the demo more smooth. I will try to compare the profile both firefox and chromium when they are rendering this demo to find out the difference.
(In reply to Michael Leu[:Lenzak](UTC+8) from comment #4) > I have profiled chromium and got a valid report, it seems that chromium > exhibits similar behavior in draw calls comparing to gecko, so I think it is > because of ANGLE's design instead of our misuse. > > However, I found something interesting when running a specific WebGL2 > benchmark demo. > > In this demo, firefox got about 30~40 fps with 5~7% CPU idle while chromium > got a stable 70fps with 40~50% CPU idle. > > I am not sure whether the CPU performance difference is caused by bogus > benchmark tool or not, but chromium does render the demo more smooth. > > I will try to compare the profile both firefox and chromium when they are > rendering this demo to find out the difference. Skip profiling Chrome for now. Just focus on identifying where we can do better in general.
Platform: Win 10 GPU: NV GeForce GTX 1080 Demo: Zen Garden-wasm Version: 55.0a1 (2017-04-24) (64-bit) release build Capture from: Gecko Profiler I notice when we do something that needs InvalidateBufferFetching(), we have to validate our buffer fetching again at WebGLContext::ValidateBufferFetching(). It takes a lot of time. RunningTime Self 7ms 58.3% 0ms mozilla::dom::WebGL2RenderingContextBinding::drawElements --- 6ms 50.0% 0ms mozilla::WebGLContext::DrawElements_check(char const *,unsigned int,int,unsigned int,__int64,int) --- 4ms 33.3% 2ms mozilla::WebGLContext::ValidateBufferFetching(char const *)
Platform: Win 10 GPU: NV GeForce GTX 1080 Demo: Zen Garden-wasm Version: 55.0a1 (2017-05-02) (64-bit) release build -disable-e10s Capture from: Visual Studio Diagnostics In Visual Studio Diagnostics, I have to turn off e10s for capturing the WebGL function calls. We realize fDrawElements takes most of time like I mentioned at Comment 2. It is from ANGLE lib, I think we could not do too much improvement there. Therefore, I would like to discuss some findings that are related to our implementation. We take much time on IndexedFetch and BufferFetching validation. They probably are redundant because we have ANGLE to help this already. Besides, ScopedResolveTexturesForDraw also takes a little time. An another function we need to investigate is WebGL2RenderingContextBinding::uniform4fv(). WebGLContext::ValidateUniformArraySetter takes 18 ms of 82 ms compare to gl::GLContext::fUniform4fv takes 52 ms. Moreover, Float32ArrayOrUnrestrictedFloatSequenceArgument::TrySetToFloat32Array takes 21 ms of 127 ms. We might be interested in how can we improve it. Function Name Total CPU (%) Self CPU (%) Total CPU (ms) Self CPU (ms) Module - firefox.exe (PID: 17316) 100.00 % 0.00 % 1450 0 firefox.exe --- Function Name Total CPU (%) Self CPU (%) Total CPU (ms) Self CPU (ms) Module - mozilla::dom::WebGL2RenderingContextBinding::drawElements 36.41 % 0.07 % 528 1 xul.dll Function Name Total CPU (%) Self CPU (%) Total CPU (ms) Self CPU (ms) Module + mozilla::gl::GLContext::fDrawElements 29.38 % 0.07 % 426 1 xul.dll Function Name Total CPU (%) Self CPU (%) Total CPU (ms) Self CPU (ms) Module + mozilla::WebGLContext::DrawElements_check 2.21 % 0.14 % 32 2 xul.dll Function Name Total CPU (%) Self CPU (%) Total CPU (ms) Self CPU (ms) Module + mozilla::WebGLBuffer::ValidateIndexedFetch 0.76 % 0.07 % 11 1 xul.dll Function Name Total CPU (%) Self CPU (%) Total CPU (ms) Self CPU (ms) Module + mozilla::WebGLContext::ValidateBufferFetching 0.76 % 0.14 % 11 2 xul.dll Function Name Total CPU (%) Self CPU (%) Total CPU (ms) Self CPU (ms) Module + mozilla::ScopedResolveTexturesForDraw::ScopedResolveTexturesForDraw 1.24 % 0.07 % 18 1 xul.dll --- Function Name Total CPU (%) Self CPU (%) Total CPU (ms) Self CPU (ms) Module + mozilla::dom::WebGL2RenderingContextBinding::uniform4fv 8.76 % 0.41 % 127 6 xul.dll Function Name Total CPU (%) Self CPU (%) Total CPU (ms) Self CPU (ms) Module + mozilla::WebGLContext::Uniform4fv 5.66 % 0.00 % 82 0 xul.dll Function Name Total CPU (%) Self CPU (%) Total CPU (ms) Self CPU (ms) Module + mozilla::dom::Float32ArrayOrUnrestrictedFloatSequenceArgument::TrySetToFloat32Array 1.45 % 0.14 % 21 2 xul.dll
Depends on: 1363433
Depends on: 1363436
Platform: Win 10 GPU: NV GeForce GTX 1080 Demo: Zen Garden-wasm Version: 55.0a1 (2017-05-02) (64-bit) release build -enable-e10s Capture from: Visual Studio 2017 Diagnostics
Platform: Win 10 GPU: NV GeForce GTX 1080 Demo: Zen Garden-wasm Version: 55.0a1 (2017-05-02) (64-bit) release build -enable-e10s Capture from: Visual Studio 2017 Diagnostics
(In reply to Daosheng Mu[:daoshengmu] from comment #9) > Created attachment 8868447 [details] > Enable Angle in Zen Garden-wasm > > Platform: Win 10 > GPU: NV GeForce GTX 1080 > Demo: Zen Garden-wasm > Version: 55.0a1 (2017-05-02) (64-bit) release build -enable-e10s > Capture from: Visual Studio 2017 Diagnostics At the case of enabling ANGLE, we find the bottleneck is at gl::DrawElements that takes 17.48% and WebGLContext::DrawElements_check takes less than 0.6%. So, ValidateIndexedFetch() and ValidateBufferFetching() didn't have so much impact. Probably, we should start thinking about why ANGLE takes us so much time... I am curious if it is because we put JS binding code and gl draw commands at the same thread, it makes us sending gl draw commands to GPU intermittently and turn back to serve JS codes. In this way, we would make driver and runtime busy in switching. Ideally, we should collect all gl commands and dispatch them simultaneously.
We tried disable VSync (by set frame_rate to 1000) and remove layer transaction to eliminate external interference to the rendering performance. We used this configuration to compare the performance between Gecko w/ and w/o ANGLE Our GFX card is Nvidia GTX 1070, the testcase is ZenGarden-wasm in emunittest: w/ ANGLE: about 63fps w/o ANGLE about 110fps It seems that there is very significant difference between w/ and w/o ANGLE.
Currently, we find the performance issue should come from ANGLE. In order to do the further investigation on ANGLE, we can use this patch to help us.
After applying attachment 8871629 [details] [diff] [review], we can get the profiling result among Zen Garden, In Limbo, and 10kCubes as below when running emunittests. --------MozProfiler::ShowResult------------ Zen Garden Fri May 26 14:58:00 2017 -ProgramD3D::applyUniformBuffers elapsed time: 44.980147 -Renderer11::applyIndexBuffer elapsed time: 48.917923 -Renderer11::applyShaders elapsed time: 48.458520 -Renderer11::applyTransformFeedbackBuffers elapsed time: 46.311202 -Renderer11::applyVertexBuffer elapsed time: 44.262802 -Renderer11::drawArraysImpl elapsed time: 0.228934 -Renderer11::drawElementsImpl elapsed time: 46.594001 -Renderer11::generateSwizzles elapsed time: 88.949233 -RendererD3D::applyTextures elapsed time: 90.872598 Renderer11::genericDrawArrays elapsed time: 0.415961 Renderer11::genericDrawElements elapsed time: 55.389535 --------MozProfiler::ShowResult------------ Fri May 26 15:03:50 2017 In Limbo -ProgramD3D::applyUniformBuffers elapsed time: 37.641925 -Renderer11::applyIndexBuffer elapsed time: 0.644062 -Renderer11::applyShaders elapsed time: 38.016607 -Renderer11::applyTransformFeedbackBuffers elapsed time: 35.982898 -Renderer11::applyVertexBuffer elapsed time: 35.555896 -Renderer11::drawArraysImpl elapsed time: 36.608179 -Renderer11::drawElementsImpl elapsed time: 0.535119 -Renderer11::generateSwizzles elapsed time: 70.268288 -RendererD3D::applyTextures elapsed time: 71.081953 Renderer11::genericDrawArrays elapsed time: 47.727757 Renderer11::genericDrawElements elapsed time: 1.398989 --------MozProfiler::ShowResult------------ Fri May 26 15:31:21 2017 10kCubes -ProgramD3D::applyUniformBuffers elapsed time: 130.304722 -Renderer11::applyShaders elapsed time: 141.348076 -Renderer11::applyTransformFeedbackBuffers elapsed time: 130.891795 -Renderer11::applyVertexBuffer elapsed time: 131.179610 -Renderer11::drawArraysImpl elapsed time: 132.876443 -Renderer11::generateSwizzles elapsed time: 275.337574 -RendererD3D::applyTextures elapsed time: 257.352789 Renderer11::genericDrawArrays elapsed time: 128.978519 We can figure out the most wasteful time part is generateSwizzles() and applyTextures(). The weird thing is that 10kCubes demo seems to not have complicated shaders, even it needn't have textures for this drawcall benchmark. Perhaps, we can try to improve the applyTextures() function.
I think we can add this shorcut, if (totalRegisterCountVS && totalRegisterCountPS && vertexUniformsDirty && pixelUniformsDirty) { break; } at gl::Error Renderer11::applyUniforms() for avoiding redundant check, and it can improve around 4.8% performance on Zen Garden (52676ms -> 50121ms total time).
Attachment #8874748 - Attachment description: CPU usage of main thread in content process w/ ANGLE → CPU usage of main thread in content process w/o ANGLE
Here is the CPU usage over time of main thread in Content process when running ZenGarden-WASM demo. When ANGLE is enabled, the CPU usage is clamped to about 13% which is the maximum usage (100% * 1/8 for Core i7) for a single CPU logical core. And this phenomenon is not present when we have ANGLE disabled. This indicates that ANGLE has overhead which occupies all CPU resource for one thread and become the bottleneck when executing webgl webapps. I think offloading ANGLE calls out of main thread will be helpful for this kind of situation.
Below is some profiling results in ANGLE functions: Platform: Win 10 GPU: NV GeForce GTX 1080 --------MozProfiler::ShowResult------------ Tue Jun 6 17:52:27 2017 10kCubes -ProgramD3D::applyUniformBuffers elapsed time: 139.404335 -ProgramD3D::updateSamplerMapping elapsed time: 132.711752 -Renderer11::applyPrimitiveType elapsed time: 146.973945 -Renderer11::applyShaders elapsed time: 3615.922279 -Renderer11::applyTransformFeedbackBuffers elapsed time: 1540.105728 -Renderer11::applyVertexBuffer elapsed time: 2857.612279 -Renderer11::drawArraysImpl elapsed time: 308.253003 -Renderer11::generateSwizzles elapsed time: 211.696698 -Renderer11::updateState elapsed time: 830.215602 -RendererD3D::applyTextures elapsed time: 568.923871 -State::isTransformFeedbackActiveUnpaused elapsed time: 405.400222 Renderer11::genericDrawArrays elapsed time: 24498.455484 --------MozProfiler::ShowResult------------ Tue Jun 6 17:55:08 2017 Zen Garden-wasm -ProgramD3D::applyUniformBuffers elapsed time: 38.940291 -ProgramD3D::updateSamplerMapping elapsed time: 46.271044 -Renderer11::applyIndexBuffer elapsed time: 467.431196 -Renderer11::applyPrimitiveType elapsed time: 42.619321 -Renderer11::applyShaders elapsed time: 2063.050164 -Renderer11::applyTransformFeedbackBuffers elapsed time: 641.468724 -Renderer11::applyVertexBuffer elapsed time: 2256.686961 -Renderer11::drawArraysImpl elapsed time: 1.349382 -Renderer11::drawElementsImpl elapsed time: 97.800884 -Renderer11::generateSwizzles elapsed time: 218.588933 -Renderer11::updateState elapsed time: 1105.074853 -RendererD3D::applyTextures elapsed time: 1926.555555 -State::isTransformFeedbackActiveUnpaused elapsed time: 91.870827 Renderer11::genericDrawArrays elapsed time: 101.667316 Renderer11::genericDrawElements elapsed time: 13446.546693 --------MozProfiler::ShowResult------------ Tue Jun 6 17:56:41 2017 In Limbo -ProgramD3D::applyUniformBuffers elapsed time: 38.805037 -ProgramD3D::updateSamplerMapping elapsed time: 42.678620 -Renderer11::applyIndexBuffer elapsed time: 13.434028 -Renderer11::applyPrimitiveType elapsed time: 43.160927 -Renderer11::applyShaders elapsed time: 1931.327180 -Renderer11::applyTransformFeedbackBuffers elapsed time: 639.339219 -Renderer11::applyVertexBuffer elapsed time: 1535.310072 -Renderer11::drawArraysImpl elapsed time: 101.323352 -Renderer11::drawElementsImpl elapsed time: 2.370359 -Renderer11::generateSwizzles elapsed time: 143.136457 -Renderer11::updateState elapsed time: 1083.531692 -RendererD3D::applyTextures elapsed time: 796.538530 -State::isTransformFeedbackActiveUnpaused elapsed time: 113.735229 Renderer11::genericDrawArrays elapsed time: 10531.219175 Renderer11::genericDrawElements elapsed time: 344.542353 Summary: We can realize applyVertexBuffer() and applyShaders() are the most wasteful functions in ANGLE. But, after some investigation, it seems that we do not have much place could do optimization unless we wanna enhance some C++ STL usages for ANGLE.
Attachment #8871629 - Attachment is obsolete: true
Well, I think I got an important finding. After watching our thread model from Process Explorer.exe, we can get the below result. When we are using ANGLE, we will put these draw commands into libGLESv2.dll, and libGLESv2.dll is running at the same thread that serves Firefox.exe. However, if we disable ANGLE, our draw commands are put into the xul.dll that will be dispatched into a separate thread from Firefox.exe. Due to ANGLE, libGLESv2.dll is quite performance wasteful, it seems to not a good idea to put both of Firefox.exe and libGLESv2.dll at the same thread. Therefore, we should start to think about how to let libGLESv2.dll be run at an individual thread. Platform: Win 10 GPU: NV GeForce GTX 1080 disable VSync ANGLE Function Name Total CPU (ms) Total CPU (%) Self CPU (ms) Self CPU (%) Module gl::DrawElements 1514 16.18 % 4 0.04 % libGLESv2.dll Diable-ANGLE Function Name Total CPU (ms) Total CPU (%) Self CPU (ms) Self CPU (%) Module WebGLContext::DrawElements 258 1.42 % 12 0.07 % xul.dll
(In reply to Daosheng Mu[:daoshengmu] from comment #20) > Well, I think I got an important finding. After watching our thread model > from Process Explorer.exe, we can get the below result. > > When we are using ANGLE, we will put these draw commands into libGLESv2.dll, > and libGLESv2.dll is running at the same thread that serves Firefox.exe. > However, if we disable ANGLE, our draw commands are put into the xul.dll > that will be dispatched into a separate thread from Firefox.exe. Due to > ANGLE, libGLESv2.dll is quite performance wasteful, it seems to not a good > idea to put both of Firefox.exe and libGLESv2.dll at the same thread. > > Therefore, we should start to think about how to let libGLESv2.dll be run at > an individual thread. > > Platform: Win 10 > GPU: NV GeForce GTX 1080 > disable VSync > > ANGLE > Function Name Total CPU (ms) Total CPU (%) Self CPU (ms) Self CPU (%) > Module > gl::DrawElements 1514 16.18 % 4 0.04 % > libGLESv2.dll > > Diable-ANGLE > Function Name Total CPU (ms) Total CPU (%) Self CPU (ms) Self CPU > (%) Module > WebGLContext::DrawElements 258 1.42 % 12 0.07 % > xul.dll Correct the comment here, even though ANGLE is disabled, the xul.dll that runs WebGLContext is executed in the main thread as well. So, for the use case of WebGL, both of the draw commands from OpenGL or ANGLE are run in the main thread.
We have done the profiling work in this period of time.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
The index validation for computing the index range is really wasteful, https://dxr.mozilla.org/mozilla-central/rev/d9a144b7b6d994fc9a497c53b13f51a2a654d85e/gfx/angle/src/libANGLE/validationES.cpp#3336. It tries to find the minimal and maximum index no. from the index buffer to confirm the index buffer is large enough for vertex buffer pulling elementArrayBuffer->getIndexRange(type, static_cast<size_t>(offset), count, state.isPrimitiveRestartEnabled(), indexRangeOut);
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: