Closed Bug 905589 Opened 7 years ago Closed 7 years ago
use VBOs instead of client-side rendering
Display resolution is 1280x800 Freescale Quad-core CPU, 1.4GHz with Vivante GPU inside The FPS is low(under 30) in Homescreen sliding. First time we found that the bottleneck is caused by composition in B2G process. The root cause is glDrawArray takes around 10ms. After GPU vendor updating library, glDrawArray's execution time lowered down to ~1ms. Then the FPS increases from 15 to 25, the bottleneck is now in content process. The Homescreen layers are all drawing by software rendering. Peter Chang found that CPU resource is still sufficient. Viral Wang's experiment proofed CPU resource allocation is fine. More profiling and analyzing jobs have to do. This Bug is to track the status of performance issue of tablet project.
Add Andreas in cc list. -- Keven
Ok, lets attach some profiles here. I would like to see a profile of the compositor thread and a profile of the main thread. There is no way we are gated on the compositor here. Something else is wrong.
(In reply to Andreas Gal :gal from comment #2) > Ok, lets attach some profiles here. I would like to see a profile of the > compositor thread and a profile of the main thread. There is no way we are > gated on the compositor here. Something else is wrong. Can you please profile gl()->fEGLImageTargetTexture2D() at http://dxr.mozilla.org/mozilla-central/source/gfx/layers/opengl/TextureHostOGL.cpp#l1073
Ok, the team did a really clever hack here to prove that the Freescale driver is just totally broken. They disabled in an Android build the HWC and frame rate of the Android homescreen drops below 30fps. So HWC is hiding the fact that Freescale's driver is fantastically slow. It seems that issuing a draw call can block >10ms. Since Android is broken too, I don't think we have much of a chance of fixing this on our end. Its up to the vendor to ship a working GPU driver. We can continue with the HWC work, but even with that this would bite us elsewhere, so until we have better drivers, we are stuck with this particular chipset. Since we have alternatives, lets explore those and we can continue here once we get vendor help.
On a related note, it seems that the issue is around locking. With gralloc disabled framerate doubles to 25 or so (using shmem). When uploading shmem is faster than gralloc, its sadness all around.
I'm going to CC Sotaro on this; I don't think it's related to bug 905304/bug 898919, but just in case.
I have an idea here. The Vivante driver might have a slow path when using client-stored vertices. We should try using a VBO instead. I will see if my GL is good enough to hack this up so we can test it.
In particular, a really dumb driver might use internally glBufferSubData on a single VBO to handle vertex pointers, so the 2nd glDrawArrays would then use the same VBO, and as a result glBufferSubData would have to wait until the GL pipeline has drained before it can proceed.
Here goes nothing. This switches the BindAndDrawQuadWithTextureRect path over to using VBOs. It turns out BindAndDrawQuad already uses an VBO (mQuadVBO). As bonus I also switched the FPS counter drawing over. I have no idea whether this actually fixes the bug here, but its probably a good idea in general to avoid client-stored vertex data, so we might want to take this patch even if it doesn't help this broken driver. This patch was tested only on MacOSX.
Assignee: nobody → gal
Don't leak VBOs when drawing the FPS counter.
Attachment #791676 - Attachment is obsolete: true
Remove unintentional change to JS::Value.
Attachment #791679 - Attachment is obsolete: true
Delete texture on shutdown (typo).
Attachment #791680 - Attachment is obsolete: true
Before I found the glDrawArray was very slow from DrawQuad calls in Freescale paltform. http://mxr.mozilla.org/mozilla-central/source/gfx/layers/opengl/CompositorOGL.cpp#620 Using VBO didn't help the performance but it is good to have VBO patch landed as gal mentioned. Only reducing the texture size to one small 2x2 black texture for the background layer could improve the FPS. Therefore, our partner releases a gralloc fix for us. Later Jerry will provide the profiling data after applying gralloc partner fix.
Freescale fixed the slowness in glDrawArray. The previous version of the driver was slow. The current version is fast even without VBOs.
The tablet screen resolution is 1280x800. We set cpu governor to performance mode. The cpu freq is fixed at 1G. (adb shell "echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor") And we use the the latest gralloc library(fix glDrawArray performance problem) from partner. When we scroll homescreen between two pages(no repainting on content process), it spends about 30~40 ms for SwapBuffers(). http://mxr.mozilla.org/mozilla-central/source/gfx/layers/opengl/CompositorOGL.cpp#1291 We make an experiment that calls glFinish before SwapBuffers. Then the glFinish becomes the bottleneck. mGLContext->fFinish(); //30~40ms mGLContext->SwapBuffers(); //0~1ms
Nice work Jerry. Lets analyze this further. I will come upstairs to chat.
Jerry, can you please get the current values in BindAndDrawQuadWithTextureRect for: mGLContext->CanUploadNonPowerOfTwo() aTexture->GetSize().width aTexture->GetSize().height wrapMode (our assumption right now is that this is not GL_REPEAT)
We currently do not support tiled thebes for B2G. We only support them on Android and via texture upload. We can probably make that path work for B2G, but it will be some work. We also don't support tiled images layers at all. So if we are slow rendering a very large image layer here, that would suck.
(In reply to Andreas Gal :gal from comment #18) > Jerry, can you please get the current values in > BindAndDrawQuadWithTextureRect for: > > mGLContext->CanUploadNonPowerOfTwo() true > aTexture->GetSize().width > aTexture->GetSize().height There are many texute,such as 1280*800, 1260*800, 1185*800 in homescreen > wrapMode (our assumption right now is that this is not GL_REPEAT) LOCAL_GL_CLAMP_TO_EDGE
After some back and forth, we are at this point not 100% confident that the GPU has even enough memory bandwidth to composite a large texture to the framebuffer. The team will try to measure at what texture size things fall apart. We did try to disable CanUploadNonPowerOfTwo, and it didn't help. In addition rendering also was wrong. At the time being this is a dead end. We should feed this back to Freescale, and line up work for tiled layers. However, even that may not solve the problem if this is indeed memory bandwidth.
Actually, one more idea. Can you please remove the homescreen from gaia and just have it black or any color? Lets see whether that impacts frame rate.
Instead of chnage texture size and guess the limitation of this platform, why don't we just create an simple app and reveal this problem to platfrom vendor? Mimic the layout of launcher(2 full screen layers and one lauch bar which stay at the bottom of screen), keep compositor compose these three layers(with some hack). If we can prove that all the other platforms can survive(reach 60FPS) in this scenario, ask chip set vendor to have a deeper look into driver side, it might just a cache-line bug, or some trivial problems, and easy for them to fix. But we need to have some evidance in a simple app(scenario) which can tell them the story easily.
Morris, please update profile data after HWC enable.
We spend a lot of time, up and down, and eventually know the problem is in B2G process, instead of content process, and it's sort of GPU bound instead of CPU bound. I file another issue(Bug 907055) to construct a better/ more effecient way to identify GLCompositor GPU bound problem. With that feature, gfx guys in AE or System team can address this kind of problem while porting.
This is the performance profile sheet when using dummy texture for homescreen. We create a static dummy texture, and use this texture for all composition operation. https://docs.google.com/spreadsheet/ccc?key=0AhJcEt1NXQbqdGVDWWNFN09KRWpvbXJGSUh6QW5YWWc&usp=sharing
Attachment #791685 - Flags: feedback?(vlin) → feedback?(bas)
Attachment #791685 - Flags: feedback?(bas) → review?(bas)
Comment on attachment 791685 [details] [diff] [review] patch Review of attachment 791685 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/gl/VBOArena.h @@ +21,5 @@ > + > + // Throw away all allocated VBOs. > + void Flush(GLContext *aGLContext); > +private: > + nsTArray<GLuint> mAllocatedVBOs; nit: in the whole 'which way do we go discussion', maybe these should be std::vector's? I have a mild preference for using STL where we can. I don't feel strongly though. ::: gfx/layers/opengl/CompositorOGL.cpp @@ +55,5 @@ > + size_t length = aElements; > + if (aMode == LOCAL_GL_TRIANGLES) > + length *= 3; > + else > + length *= 2; I believe there's an error here, this is all really confusing in our current implementation so I don't blame you :P. aElements is actually the amount of vertices you're drawing, not the amount of primitives, so 'length' in what you're currently doing as far as I understand it should just be aElements. In our case, the stride of both our arrays should always be 2 * sizeof(GLfloat) and thus the size 2 * aElements * sizeof(GLfloat). Currently I'd say from looking at it your code is reading random additional memory off the stack when in LOCAL_GL_TRIANGLES mode, this won't cause visual problems since GL will only use the valid portions when drawing. also a nit: I'd like to see some verticle whitespace in this function to improve readability.
Attachment #791685 - Flags: review?(bas) → review-
Other than the buffer overflow that I believe is occurring this patch looks fine to me btw.
Comment on attachment 793560 [details] [diff] [review] patch Nice catch with *2/*3. Thanks. I absolutely hate nsTArray, so +1 from me for using std.
Attachment #793560 - Flags: review?(bas)
Comment on attachment 793560 [details] [diff] [review] patch Review of attachment 793560 [details] [diff] [review]: ----------------------------------------------------------------- Strictly an improvement in my mind. ::: gfx/layers/opengl/CompositorOGL.cpp @@ +49,5 @@ > +static void > +DrawWithVertexBuffer2(GLContext *aGLContext, VBOArena &aVBOs, > + GLenum aMode, GLsizei aElements, > + GLint aAttr1, GLfloat *aArray1, > + GLint aAttr2, GLfloat *aArray2) One last nit: We should probably add a tiny comment that this function expects arrays with a stride of 2 * sizeof(GLfloat).
Attachment #793560 - Flags: review?(bas) → review+
Summary: Vivante GPU (Freescale) can't reach 60fps on homescreen → use VBOs instead of client-side rendering
Doesn't apply cleanly, please rebase. Also, please make sure that you have hg configured to generate patches with all the necessary commit information, including a commit message. https://developer.mozilla.org/en-US/docs/Mercurial_FAQ#How_can_I_generate_a_patch_for_somebody_else_to_check-in_for_me.3F
Whiteboard: needs vendor fix
Andreas, let me know if it helps for us to rebase this patch instead.
Carry over r+ from :bas for the rebase.
Carry over r+ from :bas for the rebase.
Backed out for Android and B2G build bustage. https://hg.mozilla.org/integration/b2g-inbound/rev/39599246f51b https://tbpl.mozilla.org/php/getParsedLog.php?id=26879354&tree=B2g-Inbound
std::vector::data is new in C++11, so looks like some compilers didn't support it before.
Carry over r+ from :bas for the rebase. Use &vector instead of vector.data() as not all compilers support std::vector::data() method.
(In reply to Andreas Gal :gal from comment #7) > I have an idea here. The Vivante driver might have a slow path when using > client-stored vertices. We should try using a VBO instead. I will see if my > GL is good enough to hack this up so we can test it. When using VBO, we also need to upload vertex and texture coordinate for each different quad. Does use VBO better than use vertex data directly?
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.