use VBOs instead of client-side rendering

RESOLVED FIXED

Status

Firefox OS
General
--
critical
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: vilin, Assigned: gal)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 10 obsolete attachments)

(Reporter)

Description

5 years ago
Display resolution is 1280x800
Freescale Quad-core CPU, 1.4GHz with Vivante GPU inside
The FPS is low(under 30) in Homescreen sliding.

First time we found that the bottleneck is caused by composition in B2G process. The root cause is glDrawArray takes around 10ms. After GPU vendor updating library, glDrawArray's execution time lowered down to ~1ms. Then the FPS increases from 15 to 25, the bottleneck is now in content process.

The Homescreen layers are all drawing by software rendering.
Peter Chang found that CPU resource is still sufficient.
Viral Wang's experiment proofed CPU resource allocation is fine.

More profiling and analyzing jobs have to do.
This Bug is to track the status of performance issue of tablet project.

Updated

5 years ago
Blocks: 903304

Comment 1

5 years ago
Add Andreas in cc list.

--
Keven
Blocks: 905004
(Assignee)

Comment 2

5 years ago
Ok, lets attach some profiles here. I would like to see a profile of the compositor thread and a profile of the main thread. There is no way we are gated on the compositor here. Something else is wrong.
(In reply to Andreas Gal :gal from comment #2)
> Ok, lets attach some profiles here. I would like to see a profile of the
> compositor thread and a profile of the main thread. There is no way we are
> gated on the compositor here. Something else is wrong.

Can you please profile gl()->fEGLImageTargetTexture2D() 

at http://dxr.mozilla.org/mozilla-central/source/gfx/layers/opengl/TextureHostOGL.cpp#l1073
(Assignee)

Comment 4

5 years ago
Ok, the team did a really clever hack here to prove that the Freescale driver is just totally broken. They disabled in an Android build the HWC and frame rate of the Android homescreen drops below 30fps. So HWC is hiding the fact that Freescale's driver is fantastically slow. It seems that issuing a draw call can block >10ms. Since Android is broken too, I don't think we have much of a chance of fixing this on our end. Its up to the vendor to ship a working GPU driver. We can continue with the HWC work, but even with that this would bite us elsewhere, so until we have better drivers, we are stuck with this particular chipset. Since we have alternatives, lets explore those and we can continue here once we get vendor help.
(Assignee)

Comment 5

5 years ago
On a related note, it seems that the issue is around locking. With gralloc disabled framerate doubles to 25 or so (using shmem). When uploading shmem is faster than gralloc, its sadness all around.
No longer blocks: 903304, 905004
Summary: Performance: Graphic performance issues on tablet project → Vivante GPU (Freescale) can't reach 60fps on homescreen
Whiteboard: needs vendor fix
I'm going to CC Sotaro on this; I don't think it's related to bug 905304/bug 898919, but just in case.
(Assignee)

Comment 7

5 years ago
I have an idea here. The Vivante driver might have a slow path when using client-stored vertices. We should try using a VBO instead. I will see if my GL is good enough to hack this up so we can test it.
(Assignee)

Comment 8

5 years ago
In particular, a really dumb driver might use internally glBufferSubData on a single VBO to handle vertex pointers, so the 2nd glDrawArrays would then use the same VBO, and as a result glBufferSubData would have to wait until the GL pipeline has drained before it can proceed.
(Assignee)

Comment 9

5 years ago
Created attachment 791675 [details] [diff] [review]
Use VBOs when drawing in CompositorOGL.cpp

Here goes nothing. This switches the BindAndDrawQuadWithTextureRect path over to using VBOs. It turns out BindAndDrawQuad already uses an VBO (mQuadVBO). As bonus I also switched the FPS counter drawing over. I have no idea whether this actually fixes the bug here, but its probably a good idea in general to avoid client-stored vertex data, so we might want to take this patch even if it doesn't help this broken driver. This patch was tested only on MacOSX.
Assignee: nobody → gal
(Assignee)

Comment 10

5 years ago
Created attachment 791676 [details] [diff] [review]
Added missing VBOArena.[h|cpp]
Attachment #791675 - Attachment is obsolete: true
(Assignee)

Comment 11

5 years ago
Created attachment 791679 [details] [diff] [review]
patch

Don't leak VBOs when drawing the FPS counter.
Attachment #791676 - Attachment is obsolete: true
(Assignee)

Comment 12

5 years ago
Created attachment 791680 [details] [diff] [review]
patch

Remove unintentional change to JS::Value.
Attachment #791679 - Attachment is obsolete: true
(Assignee)

Comment 13

5 years ago
Created attachment 791685 [details] [diff] [review]
patch

Delete texture on shutdown (typo).
Attachment #791680 - Attachment is obsolete: true
(Assignee)

Updated

5 years ago
Attachment #791685 - Flags: feedback?(vlin)
Before I found the glDrawArray was very slow from DrawQuad calls in Freescale paltform.
http://mxr.mozilla.org/mozilla-central/source/gfx/layers/opengl/CompositorOGL.cpp#620

Using VBO didn't help the performance but it is good to have VBO patch landed as gal mentioned.

Only reducing the texture size to one small 2x2 black texture for the background layer could improve the FPS. Therefore, our partner releases a gralloc fix for us.

Later Jerry will provide the profiling data after applying gralloc partner fix.
(Assignee)

Comment 15

5 years ago
Freescale fixed the slowness in glDrawArray. The previous version of the driver was slow. The current version is fast even without VBOs.
The tablet screen resolution is 1280x800.
We set cpu governor to performance mode. The cpu freq is fixed at 1G.
(adb shell "echo performance > /sys/devices/system/cpu/cpu0/cpufreq/scaling_governor")
	
And we use the the latest gralloc library(fix glDrawArray performance problem) from partner.

When we scroll homescreen between two pages(no repainting on content process), it spends about 30~40 ms for SwapBuffers().
http://mxr.mozilla.org/mozilla-central/source/gfx/layers/opengl/CompositorOGL.cpp#1291

We make an experiment that calls glFinish before SwapBuffers.
Then the glFinish becomes the bottleneck.

mGLContext->fFinish();	//30~40ms
mGLContext->SwapBuffers();	//0~1ms
(Assignee)

Comment 17

5 years ago
Nice work Jerry. Lets analyze this further. I will come upstairs to chat.
(Assignee)

Comment 18

5 years ago
Jerry, can you please get the current values in BindAndDrawQuadWithTextureRect for:

mGLContext->CanUploadNonPowerOfTwo()
aTexture->GetSize().width
aTexture->GetSize().height
wrapMode (our assumption right now is that this is not GL_REPEAT)
(Assignee)

Comment 19

5 years ago
We currently do not support tiled thebes for B2G. We only support them on Android and via texture upload. We can probably make that path work for B2G, but it will be some work.

We also don't support tiled images layers at all. So if we are slow rendering a very large image layer here, that would suck.
(In reply to Andreas Gal :gal from comment #18)
> Jerry, can you please get the current values in
> BindAndDrawQuadWithTextureRect for:
> 
> mGLContext->CanUploadNonPowerOfTwo()
true
> aTexture->GetSize().width
> aTexture->GetSize().height
There are many texute,such as 1280*800, 1260*800, 1185*800 in homescreen

> wrapMode (our assumption right now is that this is not GL_REPEAT)
LOCAL_GL_CLAMP_TO_EDGE
(Assignee)

Comment 21

5 years ago
After some back and forth, we are at this point not 100% confident that the GPU has even enough memory bandwidth to composite a large texture to the framebuffer. The team will try to measure at what texture size things fall apart. We did try to disable CanUploadNonPowerOfTwo, and it didn't help. In addition rendering also was wrong. At the time being this is a dead end. We should feed this back to Freescale, and line up work for tiled layers. However, even that may not solve the problem if this is indeed memory bandwidth.
(Assignee)

Comment 22

5 years ago
Actually, one more idea. Can you please remove the homescreen from gaia and just have it black or any color? Lets see whether that impacts frame rate.

Comment 23

5 years ago
Instead of chnage texture size and guess the limitation of this platform, why don't we just create an simple app and reveal this problem to platfrom vendor?
Mimic the layout of launcher(2 full screen layers and one lauch bar which stay at the bottom of screen), keep compositor compose these three layers(with some hack). If we can prove that all the other platforms can survive(reach 60FPS) in this scenario, ask chip set vendor to have a deeper look into driver side, it might just a cache-line bug, or some trivial problems, and easy for them to fix. But we need to have some evidance in a simple app(scenario) which can tell them the story easily.

Comment 24

5 years ago
Morris, please update profile data after HWC enable.
Flags: needinfo?(mtseng)

Comment 25

5 years ago
We spend a lot of time, up and down, and eventually know the problem is in B2G process, instead of content process, and it's sort of GPU bound instead of CPU bound.

I file another issue(Bug 907055) to construct a better/ more effecient way to identify GLCompositor GPU bound problem. With that feature, gfx guys in AE or System team can address this kind of problem while porting.
This is the performance profile sheet when using dummy texture for homescreen.
We create a static dummy texture, and use this texture for all composition operation.

https://docs.google.com/spreadsheet/ccc?key=0AhJcEt1NXQbqdGVDWWNFN09KRWpvbXJGSUh6QW5YWWc&usp=sharing
(Assignee)

Updated

5 years ago
Attachment #791685 - Flags: feedback?(vlin) → feedback?(bas)
(Assignee)

Updated

5 years ago
Attachment #791685 - Flags: feedback?(bas) → review?(bas)
Comment on attachment 791685 [details] [diff] [review]
patch

Review of attachment 791685 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/gl/VBOArena.h
@@ +21,5 @@
> +
> +    // Throw away all allocated VBOs.
> +    void Flush(GLContext *aGLContext);
> +private:
> +    nsTArray<GLuint> mAllocatedVBOs;

nit: in the whole 'which way do we go discussion', maybe these should be std::vector's? I have a mild preference for using STL where we can. I don't feel strongly though.

::: gfx/layers/opengl/CompositorOGL.cpp
@@ +55,5 @@
> +  size_t length = aElements;
> +  if (aMode == LOCAL_GL_TRIANGLES)
> +    length *= 3;
> +  else
> +    length *= 2;

I believe there's an error here, this is all really confusing in our current implementation so I don't blame you :P. aElements is actually the amount of vertices you're drawing, not the amount of primitives, so 'length' in what you're currently doing as far as I understand it should just be aElements. In our case, the stride of both our arrays should always be 2 * sizeof(GLfloat) and thus the size 2 * aElements * sizeof(GLfloat). Currently I'd say from looking at it your code is reading random additional memory off the stack when in LOCAL_GL_TRIANGLES mode, this won't cause visual problems since GL will only use the valid portions when drawing.

also a nit: I'd like to see some verticle whitespace in this function to improve readability.
Attachment #791685 - Flags: review?(bas) → review-
Other than the buffer overflow that I believe is occurring this patch looks fine to me btw.
(Assignee)

Comment 29

5 years ago
Created attachment 793559 [details] [diff] [review]
patch
Attachment #791685 - Attachment is obsolete: true
(Assignee)

Comment 30

5 years ago
Created attachment 793560 [details] [diff] [review]
patch
Attachment #793559 - Attachment is obsolete: true
(Assignee)

Comment 31

5 years ago
Comment on attachment 793560 [details] [diff] [review]
patch

Nice catch with *2/*3. Thanks. I absolutely hate nsTArray, so +1 from me for using std.
Attachment #793560 - Flags: review?(bas)
Comment on attachment 793560 [details] [diff] [review]
patch

Review of attachment 793560 [details] [diff] [review]:
-----------------------------------------------------------------

Strictly an improvement in my mind.

::: gfx/layers/opengl/CompositorOGL.cpp
@@ +49,5 @@
> +static void
> +DrawWithVertexBuffer2(GLContext *aGLContext, VBOArena &aVBOs,
> +                      GLenum aMode, GLsizei aElements,
> +                      GLint aAttr1, GLfloat *aArray1,
> +                      GLint aAttr2, GLfloat *aArray2)

One last nit: We should probably add a tiny comment that this function expects arrays with a stride of 2 * sizeof(GLfloat).
Attachment #793560 - Flags: review?(bas) → review+
(Assignee)

Comment 33

5 years ago
Created attachment 793669 [details] [diff] [review]
patch ready for check-in
Attachment #793560 - Attachment is obsolete: true
(Assignee)

Updated

5 years ago
Attachment #793560 - Attachment is obsolete: false
(Assignee)

Updated

5 years ago
Flags: needinfo?(mtseng)
Keywords: checkin-needed
Summary: Vivante GPU (Freescale) can't reach 60fps on homescreen → use VBOs instead of client-side rendering
Attachment #793560 - Attachment is obsolete: true
Doesn't apply cleanly, please rebase. Also, please make sure that you have hg configured to generate patches with all the necessary commit information, including a commit message.
https://developer.mozilla.org/en-US/docs/Mercurial_FAQ#How_can_I_generate_a_patch_for_somebody_else_to_check-in_for_me.3F
Keywords: checkin-needed
Whiteboard: needs vendor fix
Andreas, let me know if it helps for us to rebase this patch instead.
Flags: needinfo?(gal)
Created attachment 794071 [details] [diff] [review]
use VBO instead of client side rendering. r=bas (rebased)

Carry over r+ from :bas for the rebase.
Attachment #793669 - Attachment is obsolete: true
Attachment #794071 - Flags: review+
Flags: needinfo?(gal)
Keywords: checkin-needed
Keywords: checkin-needed
Created attachment 794109 [details] [diff] [review]
use VBO instead of client side rendering. r=bas (rebased)

Carry over r+ from :bas for the rebase.
Attachment #794071 - Attachment is obsolete: true
Attachment #794109 - Flags: review+
Keywords: checkin-needed
std::vector::data is new in C++11, so looks like some compilers didn't support it before.
Created attachment 794185 [details] [diff] [review]
use VBO instead of client side rendering. r=bas (rebased, B2G, Android fix)

Carry over r+ from :bas for the rebase. Use &vector[0] instead of vector.data() as not all compilers support std::vector::data() method.
Attachment #794109 - Attachment is obsolete: true
Attachment #794185 - Flags: review+
Keywords: checkin-needed
(In reply to Andreas Gal :gal from comment #7)
> I have an idea here. The Vivante driver might have a slow path when using
> client-stored vertices. We should try using a VBO instead. I will see if my
> GL is good enough to hack this up so we can test it.

When using VBO, we also need to upload vertex and texture coordinate for each different quad.
Does use VBO better than use vertex data directly?
Flags: needinfo?(gal)
https://hg.mozilla.org/mozilla-central/rev/d128d47bfe47
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
(Assignee)

Updated

4 years ago
Flags: needinfo?(gal)
You need to log in before you can comment on or make changes to this bug.