Bad flickering with accelerated canvas on Mali Bifrost devices on htts://www.nperf.com
Categories
(Core :: Graphics: Canvas2D, defect, P1)
Tracking
()
People
(Reporter: jnicol, Assigned: jnicol)
References
Details
Reported as a comment in bug 1892601, but it's a separate issue. Here is a video: https://bugzilla.mozilla.org/attachment.cgi?id=9397766
This appears to affect all Mali Bifrost devices eg Mali-G51, Mali-G72, G76. But not Valhall (G77 and later). Mozregression found a large range that contains bug 1801824 (enabling accelerated canvas). Disabling accelerated canvas avoids the issue.
Assignee | ||
Comment 1•1 year ago
|
||
Running in a debug build with GLContext debugging enabled, I see a lot of these errors:
void mozilla::gl::GLContext::raw_fGenTextures(GLsizei, GLuint *): Generated unexpected GL_OUT_OF_MEMORY error.
Always in glGenTextures
. Then eventually we hit this assertion because MakeCurrent fails because the context has been lost.
Assignee | ||
Comment 2•1 year ago
•
|
||
(In reply to Jamie Nicol [:jnicol] from comment #1)
Running in a debug build with GLContext debugging enabled, I see a lot of these errors:
void mozilla::gl::GLContext::raw_fGenTextures(GLsizei, GLuint *): Generated unexpected GL_OUT_OF_MEMORY error.
Always in
glGenTextures
. Then eventually we hit this assertion because MakeCurrent fails because the context has been lost.
The glGenTextures
call is the one which is used when attaching a SurfaceTexture to the renderer's GL context. And the reason we see the error here may just be because that's the only renderer thread's GL call that uses the GLContext
class, which in debug mode calls glFinish before calling glGetError. If we add a glFinish to check_gl_errors()
in webrender's Renderer then we see out of memory errors occur there instead.
I can reproduce this when using SurfaceTexture shared surface backend (the default), or enabling AHardwareBuffer. But it doesn't occur when disabling those (which uses SharedSurface_Basic). With a bit of hacking I got SharedSurface_EGLImage to work again, and it also does not reproduce in that case.
Sotaro, do you have any ideas what could be happening here?
Assignee | ||
Comment 3•1 year ago
|
||
I captured this page in the Android Studio memory profiler, and it shows a very large number of GeckoSurface and GeckoSurfaceTexture objects. Most of these appear to be waiting to be garbage collected, but that could be the cause of the OOM. Presumably something similar happens when using AHardwareBuffers too.
I also found a bug in GLScreenBuffer that is causing us to churn through SharedSurfaces more frequently than we should. Fixing that, thereby reducing the SurfaceTexture churn, does appear to make this bug harder to reproduce, but it still does reproduce eventually.
It's not clear to my why this only affects certain Mali devices, however.
Comment 4•1 year ago
|
||
(In reply to Jamie Nicol [:jnicol] from comment #2)
Sotaro, do you have any ideas what could be happening here?
Sorry, I am not sure why the problem could happens only with specific GPUs. It is weird that the problem happens also with AHardwareBuffer.
Updated•1 year ago
|
Assignee | ||
Comment 5•1 year ago
|
||
I'm still unable to figure out the root cause of this bug. It defintely appears to occur after some android Surfaces have been destroyed. With bug 1894929 this happens immediately, but after fixing that, it happens still occurs after pressing the "start test" button. If we avoid calling eglDestroySurface here and releasing the SurfaceTexture then we cannot reproduce. The flickering kinda looks like certain render tasks just aren't being done - sometimes picture cache tiles resulting in flickering black 1024x512 tiles. Sometimes a random texture gets rendered instead of the canvas (eg a texture cache texture). This leads me to believe it might be a texture management bug in the driver - perhaps the driver is getting confused when attempting to reuse a handle that was used by a surfacetexture. Not sure.
I've also discovered it only reproduces on the page nperf.com due to backdrop filter. If we disable backdrop filter then we cannot reproduce.
In any case, I think I've spent enough time trying to figure out the root cause, and instead we should look for solutions. One solution I've found is that if we set external_images_require_copy
true here then it avoids the flickering. The effect of this is that all external images are first copied in to a render task prior to being used elsewhere in the render task graph. The extra copy wouldn't be ideal as this affects a large amount of devices.
The other thing I noticed is that while both SharedSurface_SurfaceTexture and SharedSurface_AndroidHardwareBuffer run in to this issue, using SharedSurface_EGLImage instead appears to work (with some hacks to the code to make it work again). Given that accelerated canvas now runs in the GPU process, we don't actually need to send our surfaces cross-process. So is there any reason we shouldn't go back to using EGLImage? Sotaro, do you have any thoughts on that?
Comment 6•1 year ago
|
||
Assigning to Jamie for tracking purposes.
Comment 7•1 year ago
|
||
Jamie, could we just block accelerated canvas on these devices?
Assignee | ||
Comment 8•1 year ago
|
||
It’s a lot of devices, including the A51. That would very negatively affect performance for a lot of users, not to mention our speedometer score
The patches in bug 1898238 work around this issue. (And don’t appear to affect sp3 score) Just need to have them reviewed.
Comment 9•1 year ago
|
||
Ah, cool. That sounds like a much nicer solution.
Assignee | ||
Comment 10•11 months ago
|
||
This has been fixed by bug 1898238
Description
•