Closed Bug 1011017 Opened 11 years ago Closed 7 years ago

Certain WebGL operations take disproportionally long time compared to other platforms.

Categories

(Core :: Graphics: CanvasWebGL, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: jujjyl, Unassigned)

Details

(Whiteboard: [games] webgl-perf)

Attachments

(2 files)

Running profiles on a WebGL game on FFOS, I'm seeing long times being spent in odd places in the WebGL backend: - 39.3% of total time spent in SurfaceStream_TripleBuffer::SwapProducer (running in browser on OSX, this takes less than 5%) - 20.6% of total time spent in glDrawArrays (OSX browser gives less than 10%) - 8.2% of total time spent in glEnable (OSX browser < 0.5%) - 6.5% of total time spent in glViewport (OSX browser shows the item, but it is 0.0%) - 3.9% of total time spent in glBufferSubData (OSX browser 0.1%) Overall, 27.8% of time is left for asm.js JavaScript execution. On the Flame device, the game runs at average 35fps, and on OSX, if I disable the frame limiter, the game runs at over 220fps. glViewport is called in the game total of two times per frame, which makes the figure look very high, and glEnable is called less than 10 times per frame. These profiler hotspots might show opportunities for optimizing the GL performance on FFOS. The application is private Mozilla partner code, contact me for details.
Whiteboard: [games]
(In reply to Jukka Jylänki from comment #0) > Running profiles on a WebGL game on FFOS, I'm seeing long times being spent > in odd places in the WebGL backend: > > - 39.3% of total time spent in SurfaceStream_TripleBuffer::SwapProducer > (running in browser on OSX, this takes less than 5%) > - 20.6% of total time spent in glDrawArrays (OSX browser gives less than > 10%) > - 8.2% of total time spent in glEnable (OSX browser < 0.5%) > - 6.5% of total time spent in glViewport (OSX browser shows the item, but > it is 0.0%) > - 3.9% of total time spent in glBufferSubData (OSX browser 0.1%) > > Overall, 27.8% of time is left for asm.js JavaScript execution. On the Flame > device, the game runs at average 35fps, and on OSX, if I disable the frame > limiter, the game runs at over 220fps. > > glViewport is called in the game total of two times per frame, which makes > the figure look very high, and glEnable is called less than 10 times per > frame. These profiler hotspots might show opportunities for optimizing the > GL performance on FFOS. > > The application is private Mozilla partner code, contact me for details. glViewport and glEnable should basically be free. Are these glFoo symbols the actual driver symbols? If so, there's really not a lot we can do, unless they're redundant calls we can omit. Depending on your device, SwapProducer can be really slow. However, most of that time should be in glReadPixels, which I don't see on the list here. That means we're probably waiting on the compositor, which can be tricky to improve. Running at 1/6th the speed of an OSX machine is actually pretty good, though we should always looks for ways to improve.
Whiteboard: [games] → [games] webgl-perf
I did not have the native symbols for the Flame device, so I need to build B2G for that myself. The profile only had JS side call stacks.
(In reply to Jukka Jylänki from comment #0) > - 8.2% of total time spent in glEnable (OSX browser < 0.5%) > - 6.5% of total time spent in glViewport (OSX browser shows the item, but That is odd. What chipset does the Flame have?
Oh, actually I now notice that while doing the profiling run on OSX, I had layers.dump pref enabled for debugging. Disabling that lets the app run wild and smooth at 500fps, rAF times taking less than 2 msec on the OSX laptop. If I try to make the omtc composition rate higher, I seem to hit some timing issue, and I frame rates become erratic. Reading here https://blog.mozilla.org/blog/2014/02/23/new-developer-hardware-and-tools-show-firefox-os-ecosystem-momentum-2/ , the Flame has Qualcomm MSM8210 Snapdragon . From here http://en.wikipedia.org/wiki/Snapdragon_(system_on_chip) I see that it would be an Adreno 302 GPU, which also Huawei Ascend Y530, Sony Xperia E1, Sony Xperia E1 Dual and Motorola Moto E have. glEnable() is called seven times during a frame, to enable and disable GL_DEPTH_TEST, and GL_BLEND during rendering. glViewport() is called exactly twice.
Tracing the code, I see that calling glEnable() takes from anywhere between 1-8msecs, with glEnable(GL_DEPTH_TEST) being the slowest glEnable call to make. glViewport() calls take 2-11 msecs. Occasionally there are very large spikes too, with glEnable(GL_DEPTH_TEST) spending more than 20 msecs.
I wonder if these are implemented as pipeline reconfiguration which causes the driver to stall and drain all commands in the push buffer. Sounds very expensive for glViewport() though.
If the game is not fully optimized for mobile GPU, the glDrawArray symptom maybe totally normal. You can check: http://gamedev.stackexchange.com/questions/46820/c-and-opengl-es-gldrawarrays-calls-are-time-consuming Since mobile GPU is totally different from desktop GPU, compare the difference between them may not provide any useful information. If the game runs well on Fennec, but runs bad on FirefoxOS with the same hardware, then something wrong.
The game is extremely optimized for mobile GPUs, and the rendering is very lightweight (~70 WebGL draw calls and ~6000 vertices per frame). We've spent more than 5 months optimizing the asm.js compilation path with Emscripten and the GLES2 renderer for the game. It runs natively on Android and iOS, and on middle-to-low-end Android devices, the game easily runs at 60fps. I flashed my Nexus 10 with Cyanogenmod last night to enable the ARM Mali Graphics Debugger, and stepped through the application in that to trace the GL usage that the Nightly browser is doing. I find reading the trace alarming, and the browser does more GL calls per frame (~7000) than there are even vertices to draw (~6000) in a frame! The most visible thing was bug 1013647, but also other odd things popped up: - the debugger counts that there are about 300 draw calls being made per frame, even though the WebGL application only does about 70. This is a very large overhead from the compositor itself! - the compositor tiled rendering loop looks like it's been implemented somewhat naively, and it does a lot of redundant glUseProgram+glBind+glVertexAttribPointer+glUniform* calls, which could be reduced by a tighter compositing loop that batches and avoids redundant calls. - There are a large amount of redundant glEnable/DisableVertexAttribArray calls in the stream that the WebGL code does not do. - some of the shader programs used by gecko/B2G during compositing have highp precision arithmetic in the fragment shader, which is overkill for just a "glBlitFramebuffer" type of blit operation that only copies colors. - KHR sync objects are created and destroyed at each frame, but apparently never waited on (or the call to wait was not caught by the ARM Mali Graphics Debugger) - There is a glFinish call in the call stream way in the middle, before eglSwapBuffers is being called to finally present, which looks like a per-frame pipeline bubble. - Also, there are glGetError calls and glCheckFramebufferStatus calls being made in the middle of each frame, even though the WebGL code never does those. These would be better avoided as well. - preferably there should be a fast path to avoid the whole compositor architecture, as that's what wastes about 60% of the CPU time during a frame. For a HTML page that directly renders to a fullscreen WebGL canvas, those GL operations should directly manipulate the GL Framebuffer 0 instead of routing all rendering through a separate FBO and blitting the results to screen tile by tile. This has 2xfullscreen pixels worth of added per-pixel impact per frame, which I suspect explains the reason why running the game natively gives so much better GPU performance than running in gecko. I would very much like to export a call stream from that tool as an attachment here, but the Mali Graphics Debugger tool does not have such an option! :( I definitely recommend using that tool, or the Qualcomm Adreno SDK on a Nexus 4/5 to audit the GL operations, since that gives a lot of insight to how gecko internally operates. Burning the extra pixel fill with the compositor seems to be the biggest issue, and optimizing the compositor and removing the associated CPU-GPU syncs in that architecture would improve performance the most, based on the MGD traces I'm seeing.
It seems like glViewport can in fact cause queue flushes
(In reply to Jukka Jylänki from comment #8) > The game is extremely optimized for mobile GPUs, and the rendering is very > lightweight (~70 WebGL draw calls and ~6000 vertices per frame). We've spent > more than 5 months optimizing the asm.js compilation path with Emscripten > and the GLES2 renderer for the game. It runs natively on Android and iOS, > and on middle-to-low-end Android devices, the game easily runs at 60fps. > > I flashed my Nexus 10 with Cyanogenmod last night to enable the ARM Mali > Graphics Debugger, and stepped through the application in that to trace the > GL usage that the Nightly browser is doing. I find reading the trace > alarming, and the browser does more GL calls per frame (~7000) than there > are even vertices to draw (~6000) in a frame! > > The most visible thing was bug 1013647, but also other odd things popped up: > - the debugger counts that there are about 300 draw calls being made per > frame, even though the WebGL application only does about 70. This is a very > large overhead from the compositor itself! > - the compositor tiled rendering loop looks like it's been implemented > somewhat naively, and it does a lot of redundant > glUseProgram+glBind+glVertexAttribPointer+glUniform* calls, which could be > reduced by a tighter compositing loop that batches and avoids redundant > calls. I think this is worthy to check in another bug, I believe a full screen WebGL game does not have any tile inside. > - There are a large amount of redundant glEnable/DisableVertexAttribArray > calls in the stream that the WebGL code does not do. All these things can be avoided by GL state caching. I heard that we did it for Compositor, isn't it? > - some of the shader programs used by gecko/B2G during compositing have > highp precision arithmetic in the fragment shader, which is overkill for > just a "glBlitFramebuffer" type of blit operation that only copies colors. In fact, we may need highp tex coordinate, so let's try to assign each variable explicit precision and see what different. > - KHR sync objects are created and destroyed at each frame, but > apparently never waited on (or the call to wait was not caught by the ARM > Mali Graphics Debugger) Fence object is tracked by other bugs. > - There is a glFinish call in the call stream way in the middle, before > eglSwapBuffers is being called to finally present, which looks like a > per-frame pipeline bubble. > - Also, there are glGetError calls and glCheckFramebufferStatus calls > being made in the middle of each frame, even though the WebGL code never > does those. These would be better avoided as well. These 2 are really strange, and worth to analysis. > - preferably there should be a fast path to avoid the whole compositor > architecture, as that's what wastes about 60% of the CPU time during a > frame. For a HTML page that directly renders to a fullscreen WebGL canvas, > those GL operations should directly manipulate the GL Framebuffer 0 instead > of routing all rendering through a separate FBO and blitting the results to > screen tile by tile. This has 2xfullscreen pixels worth of added per-pixel > impact per frame, which I suspect explains the reason why running the game > natively gives so much better GPU performance than running in gecko. > > I would very much like to export a call stream from that tool as an > attachment here, but the Mali Graphics Debugger tool does not have such an > option! :( I definitely recommend using that tool, or the Qualcomm Adreno > SDK on a Nexus 4/5 to audit the GL operations, since that gives a lot of > insight to how gecko internally operates. > > Burning the extra pixel fill with the compositor seems to be the biggest > issue, and optimizing the compositor and removing the associated CPU-GPU > syncs in that architecture would improve performance the most, based on the > MGD traces I'm seeing. In fact, it is impossible to really bypass composition stage on both B2G/Android: On Android, a Java app must use a GLSurfaceView to render something by GL, GLSurfaceView is in fact a layer in SurfaceFlinger(Android Compositor), since it does not need the Android view system painting, so it is handed to composition directly. On B2G, the rendering result is handed to compositor directly, too. On both platform, it maybe faster and use less power if HWC is enabled for such case. So let's check whether HWC works as expected. I will try to see whether HWC on Flame works for the game once I get a device. If not, I think we may have bug for that already. By the way, even if HWC works, the sync is still needed, otherwise artefacts may be observed.
The profiling data from miniperf@Flame looks like this image. The profile data is too large to upload. It seems that the CPU idle uses 20% time, Audio playback uses 2%. There is a huge corrupted stack (maybe asm.js). And most WebGL call is not very heavy. As a result, I start to think it maybe GPU-bound or the gl call pattern makes CPU wait for GPU result regularly. By hack the GLScreenBuffer a bit to avoid any CPU stall, I can get about 50FPS in most stage I tried. I think we have a bug(bug 767484) to support eglSwapBuffers in WebGL which can help this kind of problems. Maybe we can wait its complete or do a similar try to see whether it helps. Another thing I noticed is that the game seems using MRT (log by egltrace). And it calls glGetIntegerv(GL_DRAW_BUFFER[0,4], ...) each frame, then glDrawBuffers(4, ...). Maybe it is too heavy for the GPU...
(In reply to Chiajung Hung [:chiajung] from comment #11) > Created attachment 8428599 [details] > Screenshot from 2014-05-26 17:10:00.png > > The profiling data from miniperf@Flame looks like this image. The profile > data is too large to upload. Are there instructions for running miniperf on Flame some place? Does the kernel that we ship support it directly? > It seems that the CPU idle uses 20% time, Audio playback uses 2%. There is a > huge corrupted stack (maybe asm.js). And most WebGL call is not very heavy. > > As a result, I start to think it maybe GPU-bound or the gl call pattern > makes CPU wait for GPU result regularly. Yeah, that matches what I've seen. I would still like to look to see if are webgl implementation is causing us to wait more than necessary. > By hack the GLScreenBuffer a bit to avoid any CPU stall, I can get about > 50FPS in most stage I tried. I think we have a bug(bug 767484) to support > eglSwapBuffers in WebGL which can help this kind of problems. Maybe we can > wait its complete or do a similar try to see whether it helps. Can you post a patch of your hack? > Another thing I noticed is that the game seems using MRT (log by egltrace). > And it calls glGetIntegerv(GL_DRAW_BUFFER[0,4], ...) each frame, then > glDrawBuffers(4, ...). Maybe it is too heavy for the GPU... The appearance of MRT is caused by our Clear() code and not by the game. See bug 1014209 for more info on that problem.
Flags: needinfo?(chung)
Attached patch Test patchSplinter Review
The idea is to mimic what Android did: eglSwapBuffers(glFlush) at client(content) side, and Fence wait at compositor side. And here is the instruction of mini perf from StevenLee: Here is some steps to run miniperf on B2G. * git clone https://github.com/comoyo/B2G.git * ./config.sh flame * run your programs and do something * ./run-perf.sh record-sps * get the profiling data and upload to http://people.mozilla.org/~bgirard/cleopatra/# You may need some patch around strzcmp16 to build pass it. If you need the patch I can upload it here, too.
Flags: needinfo?(chung)
(In reply to Jeff Muizelaar [:jrmuizel] from comment #12) > (In reply to Chiajung Hung [:chiajung] from comment #11) > > Another thing I noticed is that the game seems using MRT (log by egltrace). > > And it calls glGetIntegerv(GL_DRAW_BUFFER[0,4], ...) each frame, then > > glDrawBuffers(4, ...). Maybe it is too heavy for the GPU... > > The appearance of MRT is caused by our Clear() code and not by the game. See > bug 1014209 for more info on that problem. Isn't it redundant? As spec[1] says: "6) Does this have any interactions with glClear? RESOLVED: Yes. When multiple color buffers are enabled for writing, glClear clears all of the color buffers. Added language clarifying that glClear and glClearColor may affect multiple color buffers." I will disable those code to see the difference. [1]http://www.khronos.org/registry/gles/extensions/EXT/EXT_draw_buffers.txt
By disable those glDrawBuffers, I can got about 5~10 more FPS. And I can get stable 60 fps at score settlement scene (about 45~55 before).
We already have the following approaches. attachment 8428935 [details] [diff] [review] is similar to Bug 1001417. - Bug 1001417 - Bug 1014815 - Bug 767484
About attachment 8428935 [details] [diff] [review], I have a concern about the fence handling. It does not forward fence from content to compositor, but the compositor side just do fencing independently. I am not sure it could always ensure the correct fencing.
(In reply to Sotaro Ikeda [:sotaro] from comment #17) > About attachment 8428935 [details] [diff] [review], I have a concern about > the fence handling. It does not forward fence from content to compositor, > but the compositor side just do fencing independently. I am not sure it > could always ensure the correct fencing. Right, I just want know how fast it can be without those sync operation. And yes, I know there are many works about the sync/eglSwapBuffers chain. It just for test, I think it should cause some weird rendering :p. By the way, avoid the glClear things alone does not make the game 60fps, it just goes from 40fps to 50fps at the title scene. I will try to generate new profiling data with the glClear hack. In egltrace, I found some strange call flow like: 05-26 21:05:43.802 1363 1363 D libEGL : glVertexAttribPointer(2, 2, GL_FLOAT, GL_FALSE, 16, (const GLvoid*) 0x00000008); 05-26 21:05:43.802 1363 1363 D libEGL : glDrawArrays(GL_TRIANGLE_STRIP, 0, 148); 05-26 21:05:43.802 1363 1363 D libEGL : glDrawArrays(GL_TRIANGLE_STRIP, 0, 88); 05-26 21:05:43.802 1363 1363 D libEGL : glBindBuffer(GL_ARRAY_BUFFER, 223); I think it is some logging problem, not a redundant draw call. But I noticed that we call to glBindFramebuffer with same fbo several times per frame, and I think our WebGL implementation make it that way, the log looks like: 05-26 21:05:43.562 1363 1363 D libEGL : glFlush(); 05-26 21:05:43.562 1363 1363 I Gecko : fence: 2.721094 ms 05-26 21:05:43.562 1363 1363 I Gecko : Swap time: 3.291406 ms, swapP time: 3.290312 ms 05-26 21:05:43.562 1363 1363 D libEGL : glBindFramebuffer(GL_FRAMEBUFFER_OES, 1); 05-26 21:05:43.562 1363 1363 D libEGL : glFramebufferTexture2D(GL_FRAMEBUFFER_OES, GL_COLOR_ATTACHMENT0_OES, GL_TEXTURE_2D, 1, 0); 05-26 21:05:43.572 1363 1363 D libEGL : glBindFramebuffer(GL_FRAMEBUFFER_OES, 1); 05-26 21:05:43.572 1363 1363 I Gecko : ScreenBuffer attach time: 1.996198 ms 05-26 21:05:43.572 1363 1363 D libEGL : glBindFramebuffer(GL_FRAMEBUFFER_OES, 1); 05-26 21:05:43.572 1363 1363 I Gecko : publish time: 6.852238 ms 05-26 21:05:43.572 1363 1363 I Gecko : swap C time: 0.010052 ms 05-26 21:05:43.622 1363 1363 D libEGL : glBindFramebuffer(GL_FRAMEBUFFER_OES, 1); 05-26 21:05:43.622 1363 1363 D libEGL : glBindFramebuffer(GL_FRAMEBUFFER_OES, 1); 05-26 21:05:43.622 1363 1363 D libEGL : glBindFramebuffer(GL_FRAMEBUFFER_OES, 1); 05-26 21:05:43.622 1363 1363 D libEGL : glClear(GL_COLOR_BUFFER_BIT | GL_DEPTH_BUFFER_BIT); 05-26 21:05:43.622 1363 1363 D libEGL : glBindFramebuffer(GL_FRAMEBUFFER_OES, 3); The glFlush is added by me to recognize frame boundary, I will check to see whether such call hurt performance.
Blocks: 1014011
No longer blocks: 1014011
Benoit, Did our driver contact say whether switching FBOs was a pipeline flush?
Flags: needinfo?(bjacob)
Yes, he did confirm that. And that's something that we already knew from various GPU vendors' online docs. My toronto colleagues were recently looking at framebuffer-binding-change-related flushes that were triggered by other GL calls, that should in principle have been cheap, like glViewport (which by itslef should be no different than glUniform4f). Ask :jrmuizel for more details.
Flags: needinfo?(bjacob)
I've uploaded the traces I did with Nexus 10 and Mali Graphics Debugger to http://clb.demon.fi/dump/mgd_dumps/ . Unfortunately one will need to install that tool from http://malideveloper.arm.com/develop-for-mali/tools/analysis-debug/mali-graphics-debugger/ to be able to load the traces, it's not possible to copy-paste the data out from the tool. The traces show in detail the issues I listed in comment 8.
Closing this one as resolved, as it referred to Where's My Water? Firefox OS project, which was then worked on and optimized by the graphics team, and shipped successfully.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: