[Layers][unagi] B2G18 leaking EGLImages attached to GL_TEXTURE_EXTERNAL textures

RESOLVED WONTFIX

Status

Firefox OS
General
RESOLVED WONTFIX
5 years ago
5 years ago

People

(Reporter: mikeh, Assigned: mikeh)

Tracking

({unagi})

unspecified
1.0.1 Madrid (19apr)
ARM
Gonk (Firefox OS)
unagi
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:tef+)

Details

(Whiteboard: [MemShrink:P1][status: needs patch, progress happening][madrid])

Attachments

(1 attachment, 15 obsolete attachments)

5.73 KB, patch
jgilbert
: review-
Details | Diff | Splinter Review
Investigation of bug 846903 shows the following dark matter:

Unreported: ~1,521 blocks in stack trace record 1 of 538
 ~6,225,453 bytes (~6,225,453 requested / ~0 slop)
 12.10% of the heap (12.10% cumulative);  17.81% of unreported (17.81% cumulative)
 Allocated at
   malloc /home/mikeh/dev/mozilla/m-c/b2g18/memory/build/replace_malloc.c:152
   os_malloc_ext (0x42a92134 libgsl.so+0x4134) (no addr2line)
   qeglDrvAPI_eglCreateImageKHR (0x42dd0d10 libEGL_adreno200.so+0xbd10) (no addr2line)
   eglCreateImageKHR (0x42dca6f4 libEGL_adreno200.so+0x56f4) (no addr2line)
   eglCreateImageKHR /home/mikeh/dev/mozilla/btg019/frameworks/base/opengl/libs/EGL/eglApi.cpp:1276
   mozilla::gl::GLLibraryEGL::fCreateImage(void*, void*, unsigned int, void*, int const*) /home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLLibraryEGL.h:346
   mozilla::gl::GLContextEGL::BindExternalBuffer(unsigned int, void*) /home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp:439
   mozilla::layers::ShadowImageLayerOGL::RenderLayer(int, nsIntPoint const&) /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ImageLayerOGL.cpp:969
   ContainerRender<mozilla::layers::ShadowContainerLayerOGL> /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
   ContainerRender<mozilla::layers::ShadowContainerLayerOGL> /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
   ContainerRender<mozilla::layers::ShadowRefLayerOGL> /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
   ContainerRender<mozilla::layers::ShadowContainerLayerOGL> /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
   ContainerRender<mozilla::layers::ShadowContainerLayerOGL> /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
   ContainerRender<mozilla::layers::ShadowContainerLayerOGL> /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
   ContainerRender<mozilla::layers::ShadowContainerLayerOGL> /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
   mozilla::layers::LayerManagerOGL::Render() /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/LayerManagerOGL.cpp:1031
   mozilla::layers::LayerManagerOGL::EndTransaction(void (*)(mozilla::layers::ThebesLayer*, gfxContext*, nsIntRegion const&, nsIntRegion const&, void*), void*, mozilla::layers::LayerManager::EndTransactionFlags) /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/LayerManagerOGL.cpp:700   mozilla::layers::LayerManagerOGL::EndEmptyTransaction(mozilla::layers::LayerManager::EndTransactionFlags) /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/LayerManagerOGL.cpp:642
   ~AutoResolveRefLayers /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/ipc/CompositorParent.cpp:458
   RunnableMethod<mozilla::layers::ImageContainerChild, void (mozilla::layers::ImageContainerChild::*)(), Tuple0>::Run() /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/task.h:308
   MessageLoop::RunTask(Task*) /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/message_loop.cc:335
   MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/message_loop.cc:345
   MessageLoop::DoWork() /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/message_loop.cc:442
   base::MessagePumpDefault::Run(base::MessagePump::Delegate*) /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/message_pump_default.cc:24

Spoke to bjacob and he suggested that the mismatch in types, LOCAL_GL_TEXTURE_EXTERNAL vs LOCAL_GL_TEXTURE_2D, passed to fBindTexture() in BindExternalBuffer() and UnbindExternalBuffer(), respectively, might be the cause.  The mismatch would cause the call in the latter to fail and not free the EGLImage created in BindExternalBuffer().
Right. What's happening here is UnbindExternalBuffer is supposed to undo the effect of BindExternalBuffer, but is actually a no-operation because it tries to bind to the TEXTURE_2D target a texture that had been bound to the TEXTURE_EXTERNAL target.

So Bind/UnbindExternalBuffer functions are _inherently_ broken --- it's not a bug in the caller.

To fix this, we have to find a way to actually un-bind a gralloc buffer from a TEXTURE_EXTERNAL texture. We could delete and recreate GL texture objects everytime, but that would presumably be slow. It seems more reasonable to have a singleton dummy EGLImage here and bind it in UnbindExternalBuffer as a means of detaching any EGLImage that was previously attached there.

I'm also CC'ing jgilbert here as ISTR he had a suggestion on this topic recently but I don't remember what he said or where...
Created attachment 735975 [details] [diff] [review]
fix texture-type mismatch causing EGLImages to not get freed

Benoit, something like this?

(With this change, I see RSS for the b2g parent process fluctuate between ~91 and 95MiB; it will take more time to see if there's a steady leak.)
Assignee: nobody → mhabicher
Status: NEW → ASSIGNED
Attachment #735975 - Flags: feedback?(bjacob)
blocking-b2g: --- → tef?
as discussed with jgilbert on IRC--- creating a EGLImage wrapping a null buffer is maybe a little bit too wonky; instead you could create a real legitimate android::GraphicBuffer of size 1x1 and get its native buffer.
Comment on attachment 735975 [details] [diff] [review]
fix texture-type mismatch causing EGLImages to not get freed

No, this creates a new EGLImage everytime and leaves it attached to a texture, like BindExternalBuffer does, so I think this will leak in exactly the same way, right?

I had in mind a patch where the EGLImage would be a singleton (a static local var here) and would be attached to _every_ texture you call UnbindExternalBuffer on. That _should_ be OK and at least would not leak.
Attachment #735975 - Flags: feedback?(bjacob) → feedback-
Created attachment 735982 [details] [diff] [review]
fix texture-type mismatch causing EGLImages to not get freed

This works, and the leak is gone!
Attachment #735975 - Attachment is obsolete: true
Attachment #735982 - Flags: review?(bjacob)
Created attachment 735987 [details] [diff] [review]
patch: Bind to a lazy singleton 'null' EGLImage to Unbind.

I've had this floating around for a while, but was waiting for the refactor to land before perturbing things.
Attachment #735987 - Flags: review?(bjacob)
Attachment #735987 - Flags: feedback?(mhabicher)
Created attachment 735988 [details] [diff] [review]
fix texture-type mismatch causing EGLImages to not get freed, v3

Post the right patch this time.

I'm going to let the test run for a while to gather more data.  Will r? this patch if things look good.
Attachment #735982 - Attachment is obsolete: true
Attachment #735982 - Flags: review?(bjacob)
jgilbert, it looks like the leak might still be there.  I'll try your patch next (later tonight).
Comment on attachment 735982 [details] [diff] [review]
fix texture-type mismatch causing EGLImages to not get freed

Review of attachment 735982 [details] [diff] [review]:
-----------------------------------------------------------------

This patch seems to be two patches stuck together.

Also:
fEGLImageTargetTexture2D(LOCAL_GL_TEXTURE_EXTERNAL, nullptr)
`nullptr` isn't a valid EGLImage, so this emits an error by spec. (INVALID_VALUE? I forget which)

sEGLLibrary.fCreateImage(EGL_DISPLAY(),
		EGL_NO_CONTEXT,
		EGL_NATIVE_BUFFER_ANDROID,
		nullptr, attrs);
The spec for this also requires that `buffer` is a valid ANativeWindowBuffer, or generates EGL_BAD_PARAMETER.
Attachment #735982 - Flags: feedback-
(In reply to Jeff Gilbert [:jgilbert] from comment #6)
> Created attachment 735987 [details] [diff] [review]
> patch: Bind to a lazy singleton 'null' EGLImage to Unbind.
> 
> I've had this floating around for a while, but was waiting for the refactor
> to land before perturbing things.

This patch doesn't apply:

/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp: In member function 'virtual bool mozilla::gl::GLContextEGL::UnbindExternalBuffer(GLuint)':
/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp:462: error: 'ScopedBindTexture' was not declared in this scope
/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp:462: error: expected ';' before 'autoTex'
/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp:476: error: expected primary-expression before '->' token
/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp:492: error: 'LOCAL_GL_TEXTURE_BINDING_EXTERNAL' was not declared in this scope
And to confirm: the leak still exists with attachment 735988 [details] [diff] [review].  jgilbert, if you can unbitrot attachment 735987 [details] [diff] [review], I can test it.
So it looks like 'ScopedBindTexture' needs to be 'TextureImage::ScopedBindTexture' and 'GLuint tempTex' needs to be 'GLenum tempTex'; with those changes, I get the following build errors:

/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp: In member function 'virtual bool mozilla::gl::GLContextEGL::UnbindExternalBuffer(GLuint)':
/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp:466: error: no matching function for call to 'mozilla::gl::TextureImage::ScopedBindTexture::ScopedBindTexture(mozilla::gl::GLContextEGL* const, GLenum&)'
/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContext.h:231: note: candidates are: mozilla::gl::TextureImage::ScopedBindTexture::ScopedBindTexture(mozilla::gl::TextureImage*, GLenum)
/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContext.h:229: note:                 mozilla::gl::TextureImage::ScopedBindTexture::ScopedBindTexture(const mozilla::gl::TextureImage::ScopedBindTexture&)
/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp:480: error: expected primary-expression before '->' token
Flags: needinfo?(jgilbert)
I don't understand how this:

(In reply to Mike Habicher [:mikeh] from comment #5)
> Created attachment 735982 [details] [diff] [review]
> fix texture-type mismatch causing EGLImages to not get freed
> 
> This works, and the leak is gone!

reconciles with this:

(In reply to Mike Habicher [:mikeh] from comment #11)
> And to confirm: the leak still exists with attachment 735988 [details] [diff] [review]

Both 735982 and 735988 seem to use the approach of attaching a nullptr EGLImage to replace the existing attachment, so does this not work after all?

Assuming it doesn't: let's wait for Jeff to update his patch.
Attachment #735987 - Flags: review?(bjacob)
Attachment #735987 - Flags: feedback?(mhabicher)
As I discussed with bjacob in IRC, it doesn't.  On my particular test run, RSS initially stayed stable, but then started climbing again.  DMD showed continuing leakage through fCreateImage().
We want mozilla::gl::ScopedBindTexture. I won't have access to a b2g building machine for another hour, but that should be the issue.
Flags: needinfo?(jgilbert)
Created attachment 736532 [details] [diff] [review]
patch prereq: Finish adding enum defines for the extension we use
Attachment #736532 - Flags: review?(bjacob)
Created attachment 736533 [details] [diff] [review]
patch: bind to a lazy singleton 'null' EGLImage in order to 'unbind'.

Requires the prereq to give us a define enum we need.
Attachment #735987 - Attachment is obsolete: true
Attachment #736533 - Flags: review?(bjacob)
Attachment #736533 - Flags: feedback?(mhabicher)
tef+ as this blocks a blocker
blocking-b2g: tef? → tef+
Comment on attachment 735988 [details] [diff] [review]
fix texture-type mismatch causing EGLImages to not get freed, v3

Obsoleting my naïve solution.
Attachment #735988 - Attachment is obsolete: true
Comment on attachment 736532 [details] [diff] [review]
patch prereq: Finish adding enum defines for the extension we use

This patch doesn't cleanly apply to the b2g18 tree.
Attachment #736532 - Flags: feedback-
Comment on attachment 736533 [details] [diff] [review]
patch: bind to a lazy singleton 'null' EGLImage in order to 'unbind'.

I changed 'mozilla::gl::ScopedBindTexture autoTex(this, tempTex);' to 'TextureImage::ScopedBindTexture autoTex(this, tempTex);' but still get the following build errors:

/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp: In member function 'virtual bool mozilla::gl::GLContextEGL::UnbindExternalBuffer(GLuint)':
/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp:462: error: no matching function for call to 'mozilla::gl::TextureImage::ScopedBindTexture::ScopedBindTexture(mozilla::gl::GLContextEGL* const, GLuint&)'
/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContext.h:231: note: candidates are: mozilla::gl::TextureImage::ScopedBindTexture::ScopedBindTexture(mozilla::gl::TextureImage*, GLenum)
/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContext.h:229: note:                 mozilla::gl::TextureImage::ScopedBindTexture::ScopedBindTexture(const mozilla::gl::TextureImage::ScopedBindTexture&)
/home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp:476: error: 'GetEGLContext' was not declared in this scope

It looks like the constructor is being called with the wrong type of parameters, and a grep for 'GetEGLContext' doesn't find any matches in the b2g18 tree aside from the applied patch.
Attachment #736533 - Flags: feedback?(mhabicher) → feedback-
(In reply to Mike Habicher [:mikeh] from comment #20)
> Comment on attachment 736532 [details] [diff] [review]
> patch prereq: Finish adding enum defines for the extension we use
> 
> This patch doesn't cleanly apply to the b2g18 tree.

Oooh, you want it against b2g18. Yeah, this is against central.

(In reply to Mike Habicher [:mikeh] from comment #21)
> Comment on attachment 736533 [details] [diff] [review]
> patch: bind to a lazy singleton 'null' EGLImage in order to 'unbind'.
> 
> I changed 'mozilla::gl::ScopedBindTexture autoTex(this, tempTex);' to
> 'TextureImage::ScopedBindTexture autoTex(this, tempTex);' but still get the
> following build errors:

That's because TextureImage::ScopedBindTexture is completely different, and not what we want. Evidently b2g18 doesn't have gl::ScopedBindTexture, which *is* what we want.

I'll pull 18 and update the patch.
That said, bjacob should still review the patches for landing on central.
Created attachment 736997 [details] [diff] [review]
[b2g18] patch prereq
Attachment #736997 - Flags: review?(bjacob)
Created attachment 736998 [details] [diff] [review]
[b2g18] patch
Attachment #736998 - Flags: review?(bjacob)
Attachment #736998 - Flags: feedback?(mhabicher)
Attachment #736532 - Flags: review?(bjacob) → review+
Comment on attachment 736533 [details] [diff] [review]
patch: bind to a lazy singleton 'null' EGLImage in order to 'unbind'.

Review of attachment 736533 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/gl/GLContextProviderEGL.cpp
@@ +489,5 @@
> +
> +                mNullEGLImage = sEGLLibrary.fCreateImage(sEGLLibrary.Display(),
> +                                                         GetEGLContext(),
> +                                                         LOCAL_EGL_GL_TEXTURE_2D,
> +                                                         (EGLClientBuffer)tempTex,

How do you know that casting a GL texture object to a EGLClientBuffer works? Is that guaranteed by EGL?

@@ +495,5 @@
> +                fTexSubImage2D(LOCAL_GL_TEXTURE_2D, 0,
> +                               0, 0,
> +                               2, 2,
> +                               LOCAL_GL_RGBA, LOCAL_GL_UNSIGNED_BYTE,
> +                               blankData);

Why do a texImage2D + texSubImage2D instead of a texImage2D?

I guess what I don't understand is why you don't write blankData to a android::GraphicBuffer, construct a EGLImage from it (we have a function for doing this here) and tie it to a GL texture with fEGLImageTargetTexture2D. In this way you don't need to assume that casting to EGLClientBuffer works, you don't need to create a new GL texture object and you don't need to texSubImage2D.
Attachment #736533 - Flags: review?(bjacob) → review-
Attachment #736997 - Flags: review?(bjacob) → review+
Comment on attachment 736998 [details] [diff] [review]
[b2g18] patch

Review of attachment 736998 [details] [diff] [review]:
-----------------------------------------------------------------

OK after IRC conversation with jgilbert.
Attachment #736998 - Flags: review?(bjacob) → review+
Comment on attachment 736533 [details] [diff] [review]
patch: bind to a lazy singleton 'null' EGLImage in order to 'unbind'.

Review of attachment 736533 [details] [diff] [review]:
-----------------------------------------------------------------

OK after IRC conversation.
Attachment #736533 - Flags: review- → review+
Comment on attachment 736998 [details] [diff] [review]
[b2g18] patch

Unfortunately this didn't fix the leak.  The b2g parent RSS still rises to ~155MiB, and causes all other processes to get killed.  The DMD traces still show:

- mozilla::gl::GLContextEGL::BindExternalBuffer()
- mozilla::gl::GLLibraryEGL::fCreateImage()

I'll attach the complete DMD report next.
Attachment #736998 - Flags: feedback?(mhabicher) → feedback-
Created attachment 737134 [details]
[DMD] log from run with attachments 736997 and 736998
That is very worrying. Can you verify that GLContextEGL::UnbindExternalBuffer is called as many times as GLContextEGL::BindExternalBuffer is?

If it is, then that means that we don't know yet how to avoid this leak on Qualcomm drivers. We would then have to ask for help from Qualcomm.

Updated

5 years ago
Depends on: 842518

Updated

5 years ago
Blocks: 842518
No longer depends on: 842518
FYI, when I apply the above [b2g18] patches, I get a crash trying to move icons on the b2g homescreen.  I haven't dug into this yet, but the logcat shows:

E(  445:0x1cc) <qeglDrvAPI_eglCreateImageKHR:4047>: EGL_BAD_PARAMETER
F(  445:0x1cc) Assertion failure: mNullEGLImage, at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp:496
F(  445:0x1cc) Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1)
Created attachment 737439 [details] [diff] [review]
[b2g18] Bind/UnbindTexture call counting

bjacob, good idea--when I apply this patch, and do "adb logcat | grep mikeh" I *never* see any UNBIND lines, and the mBindUnbindCount value climbs monotonically.

So the textures are never getting unbound.
Attachment #737439 - Flags: feedback?(bjacob)
Let's look into this in the office today.
Some more information: UnbindExternalBuffer() is only ever called from GetLockSurface(), inside a test for mGraphicBuffer != nullptr.  I added a printf_stderr() to the beginning of this function, and it never appears in my log.

Moreover, GetLockSurface() is called from BeginUpdate() and DirectUpdate().  I've instrumented both functions and neither of them is ever called while the camera viewfinder is running.
bjacob, the call to BindExternalBuffer() is in ShadowImageLayerOGL::RenderLayer(), line 975.  (This is another call on line 437, but I never see it getting called.)

Grepping shows there are _no_ calls to UnbindExternalBuffer() outside of GetLockBuffer().
Here's what I know:
- in GLContextProviderEGL.cpp, UnbindExternalBuffer() is _only_ called from GetLockSurface()
- GetLockSurface() is only called from two places, both in GLContextProviderEGL.cpp:
    - BeginUpdate() (2 places)
    - DirectUpdate()

I have instrumented both BeginUpdate() and DirectUpdate(), and neither is ever called while the camera viewfinder is active.

BeginUpdate() is called from the following places in GLContext.cpp:
- BasicTextureImage::BeginUpdate()
- TiledTextureImage::BeginUpdate()
- TiledTextureImage::EndUpdate()

...from BasicBufferOGL::BeginPaint() in ThebesLayerOGL.cpp, and possibly from TextureImage::Resize() in GLContext.h.

Except for Resize(), I have instrumented these functions and none of them are ever called.  (There is no need to instrument Resize(), since it doesn't contain any conditional paths to BeginUpdate().)

Resize() is called from:
- in GLContextProviderEGL.cpp:
    - TextureImageEGL::TextureImageEGL(), but only if gUseBackingSurface is true
    - BindTexture()
    - GetTextureID()
- in GLContext.cpp:
    - TiledTextureImage::TiledTextureImage()
- in ThebesLayerOGL.cpp:
    - BasicBufferOGL::BeginPaint()

Only BindTexture() is ever called, but never (apparently) for our viewfinder frames (though for other elements when updating the overall viewfinder).
With "b mozilla::gl::GLContextEGL::BindExternalBuffer", here is the backtrace:

#0  mozilla::gl::GLContextEGL::BindExternalBuffer (this=0x48a98800, texture=135, buffer=0x496f7084) at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp:437
#1  0x41a74824 in mozilla::layers::ShadowImageLayerOGL::RenderLayer (this=0x49cd8000, aPreviousFrameBuffer=<value optimized out>, aOffset=...)
    at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ImageLayerOGL.cpp:976
#2  0x41a71d82 in ContainerRender<mozilla::layers::ShadowContainerLayerOGL> (this=0x44fb7800, aPreviousFrameBuffer=0, aOffset=<value optimized out>)
    at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
#3  mozilla::layers::ShadowContainerLayerOGL::RenderLayer (this=0x44fb7800, aPreviousFrameBuffer=0, aOffset=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:450
#4  0x41a71d82 in ContainerRender<mozilla::layers::ShadowContainerLayerOGL> (this=0x4987fc00, aPreviousFrameBuffer=0, aOffset=<value optimized out>)
    at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
#5  mozilla::layers::ShadowContainerLayerOGL::RenderLayer (this=0x4987fc00, aPreviousFrameBuffer=0, aOffset=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:450
#6  0x41a71962 in ContainerRender<mozilla::layers::ShadowRefLayerOGL> (this=0x4987d000, aPreviousFrameBuffer=0, aOffset=<value optimized out>)
    at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
#7  mozilla::layers::ShadowRefLayerOGL::RenderLayer (this=0x4987d000, aPreviousFrameBuffer=0, aOffset=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:491
#8  0x41a71d82 in ContainerRender<mozilla::layers::ShadowContainerLayerOGL> (this=0x4987c800, aPreviousFrameBuffer=0, aOffset=<value optimized out>)
    at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
#9  mozilla::layers::ShadowContainerLayerOGL::RenderLayer (this=0x4987c800, aPreviousFrameBuffer=0, aOffset=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:450
#10 0x41a71d82 in ContainerRender<mozilla::layers::ShadowContainerLayerOGL> (this=0x4987f000, aPreviousFrameBuffer=0, aOffset=<value optimized out>)
    at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
#11 mozilla::layers::ShadowContainerLayerOGL::RenderLayer (this=0x4987f000, aPreviousFrameBuffer=0, aOffset=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:450
#12 0x41a71d82 in ContainerRender<mozilla::layers::ShadowContainerLayerOGL> (this=0x495df800, aPreviousFrameBuffer=0, aOffset=<value optimized out>)
    at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
#13 mozilla::layers::ShadowContainerLayerOGL::RenderLayer (this=0x495df800, aPreviousFrameBuffer=0, aOffset=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:450
#14 0x41a71d82 in ContainerRender<mozilla::layers::ShadowContainerLayerOGL> (this=0x495df000, aPreviousFrameBuffer=0, aOffset=<value optimized out>)
    at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:264
#15 mozilla::layers::ShadowContainerLayerOGL::RenderLayer (this=0x495df000, aPreviousFrameBuffer=0, aOffset=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/ContainerLayerOGL.cpp:450
#16 0x41a77d1a in mozilla::layers::LayerManagerOGL::Render (this=0x48af2ef0) at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/LayerManagerOGL.cpp:1028
#17 0x41a78390 in mozilla::layers::LayerManagerOGL::EndTransaction (this=0x48af2ef0, aCallback=0, aCallbackData=0x0, aFlags=<value optimized out>)
    at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/LayerManagerOGL.cpp:700
#18 0x41a75854 in mozilla::layers::LayerManagerOGL::EndEmptyTransaction (this=0x48a98800, aFlags=1114578632) at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/opengl/LayerManagerOGL.cpp:641
#19 0x41a8a602 in mozilla::layers::CompositorParent::Composite (this=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/layers/ipc/CompositorParent.cpp:555
#20 0x417dd5b0 in DispatchToMethod<mozilla::dom::ContentParent, void (mozilla::dom::ContentParent::*)()> (this=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/tuple.h:383
#21 RunnableMethod<mozilla::dom::ContentParent, void (mozilla::dom::ContentParent::*)(), Tuple0>::Run (this=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/task.h:307
#22 0x41a04cee in MessageLoop::RunTask (this=0x44affdd0, task=0x4925d120) at /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/message_loop.cc:334
#23 0x41a05518 in MessageLoop::DeferOrRunPendingTask (this=0x48a98800, pending_task=<value optimized out>) at /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/message_loop.cc:342
#24 0x41a0626e in MessageLoop::DoWork (this=0x44affdd0) at /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/message_loop.cc:442
#25 0x41a065ea in base::MessagePumpDefault::Run (this=0x448e6120, delegate=0x44affdd0) at /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/message_pump_default.cc:23
#26 0x41a052a2 in MessageLoop::RunInternal (this=0x44affdd0) at /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/message_loop.cc:216
#27 0x41a05302 in MessageLoop::RunHandler (this=0x44affdd0) at /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/message_loop.cc:209
#28 MessageLoop::Run (this=0x44affdd0) at /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/message_loop.cc:183
#29 0x41a0f26c in base::Thread::ThreadMain (this=0x448e5460) at /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/thread.cc:156
#30 0x41a1cc3e in ThreadFunc (closure=0x48a98800) at /home/mikeh/dev/mozilla/m-c/b2g18/ipc/chromium/src/base/platform_thread_posix.cc:39
#31 0x4005de18 in __thread_entry (func=0x41a1cc35 <ThreadFunc>, arg=0x448e5460, tls=<value optimized out>) at bionic/libc/bionic/pthread.c:217
#32 0x4005d96c in pthread_create (thread_out=<value optimized out>, attr=0xbee91ef8, start_routine=0x41a1cc35 <ThreadFunc>, arg=0x448e5460) at bionic/libc/bionic/pthread.c:357
#33 0x00000000 in ?? ()
(Note that in comment 38, the line numbers include numerous printf()s.)
Hi mikeh, is the STR for this bug opening the camera and finding that UnbindExternalBuffer wasn't called?
So far there are two known reasons for this leak:
 - UnbindExternalBuffer isn't called
 - even if it were called, in its current form it's a no-operation as explained in the first few comments at the top.
Kan-Ru, STR is opening the camera and doing nothing for a few hours.  Eventually the b2g parent process RSS will expand until everything is killed, and even the homescreen can't be loaded.  Even if that doesn't happen, the only way to recover the leaked memory is to restart b2g.

Benoit, any idea where the Unbind should be happening?
(In reply to Mike Habicher [:mikeh] from comment #42)
> Kan-Ru, STR is opening the camera and doing nothing for a few hours. 
> Eventually the b2g parent process RSS will expand until everything is
> killed, and even the homescreen can't be loaded.  Even if that doesn't
> happen, the only way to recover the leaked memory is to restart b2g.
> 
> Benoit, any idea where the Unbind should be happening?

Yes; well, I'm not authoritative as I hadn't see that code before, but the idea is that you really want to do the Bind right before you draw, and the Unbind right after. Here we are doing the Bind right before we draw, and so it really seems that we should just be unbinding right after:

    gl()->fActiveTexture(LOCAL_GL_TEXTURE0);
    gl()->BindExternalBuffer(data->mTexture.GetTextureID(), ioImage->GetNativeBuffer());

    ShaderProgramOGL *program = mOGLManager->GetProgram(RGBAExternalLayerProgramType, GetMaskLayer());

    gl()->ApplyFilterToBoundTexture(mFilter);

    program->Activate();
    program->SetLayerQuadRect(nsIntRect(0, 0, 
                                        ioImage->GetSize().width, 
                                        ioImage->GetSize().height));
    program->SetLayerTransform(GetEffectiveTransform());
    program->SetLayerOpacity(GetEffectiveOpacity());
    program->SetRenderOffset(aOffset);
    program->SetTextureUnit(0);
    program->LoadMask(GetMaskLayer());

    mOGLManager->BindAndDrawQuadWithTextureRect(program,
                                                GetVisibleRegion().GetBounds(),
                                                nsIntSize(ioImage->GetSize().width,
                                                          ioImage->GetSize().height));

^ should unbind here, right after the BindAndDrawQuadWithTextureRect.
Note: that was in ImageLayerOGL::RenderLayer, in ImageLayerOGL.cpp, as in frame #1 in your stack.
(In reply to Benoit Jacob [:bjacob] from comment #44)
> Note: that was in ImageLayerOGL::RenderLayer, in ImageLayerOGL.cpp, as in
> frame #1 in your stack.

Is it ShadowImageLayerOGL::RenderLayer()?
(In reply to Sotaro Ikeda [:sotaro] from comment #45)
> (In reply to Benoit Jacob [:bjacob] from comment #44)
> > Note: that was in ImageLayerOGL::RenderLayer, in ImageLayerOGL.cpp, as in
> > frame #1 in your stack.
> 
> Is it ShadowImageLayerOGL::RenderLayer()?

IIRC, ImageLayerOGL::RenderLayer() is not used in gonk.
Hm. You know what, I was confused because I had done a BRANCH=master ./config.sh to get master Gaia, and didn't realize that it updated gecko.

Looking in b2g18 now... : so we have the Bind call here,

http://hg.mozilla.org/releases/mozilla-b2g18/file/5eb76dd7c2a4/gfx/layers/opengl/ImageLayerOGL.cpp#l969

... and the code there is very complicated so I can't claim to understand it in first reading. But the idea remains, that the pattern should be: bind, draw, unbind all in the same function. Here, we bind, draw, and don't unbind. So the right place to add the unbind call is inside of this ShadowImageLayerOGL::RenderLayer function, after any draw call that may require the gralloc buffer binding. 

It's difficult to get this right as this function has error cases with early return statements. A RAII helper to bind/unbind a gralloc buffer would help here...

anyway, not worrying about the error cases for now, the right place to put the Unbind call seems to be at the end of ShadowImageLayerOGL::RenderLayer. Indeed, much of this function consists in a series of if...else if...else... with each a draw call that may require the gralloc buffer binding; the last one is there:

http://hg.mozilla.org/releases/mozilla-b2g18/file/5eb76dd7c2a4/gfx/layers/opengl/ImageLayerOGL.cpp#l1102

and that's just before the end of the function, so we have no choice but to do the unbinding at the very end of the function.

Make sure to do the unbinding in the same case as you did the binding before. I mean, the binding is in a 

   956 #ifdef MOZ_WIDGET_GONK
   957     if (img
   958         && (img->type() == SharedImage::TSurfaceDescriptor)
   959         && (img->get_SurfaceDescriptor().type() == SurfaceDescriptor::TSurfaceDescriptorGralloc)) {

I think the safest is to set a bool boundGrallocBuffer there, and at the end of ShadowImageLayerOGL::RenderLayer, do

 if (boundGrallocBuffer) {
    UnbindExternalBuffer(...);
  }

Even better, again, would be a gralloc helper. 

Thank god we rewrote all this junk in the layers refactoring... once bug 860441 is fixed it'll be awesome.
> Even better, again, would be a gralloc helper. 

Er, I mean a RAII helper.
With the patch in attachment 736998 [details] [diff] [review], I see the MOZ_ASSERT(mNullEGLImage) fail:

E(  593:0x260) <qeglDrvAPI_eglCreateImageKHR:4047>: EGL_BAD_PARAMETER
F(  593:0x260) Assertion failure: mNullEGLImage, at /home/mikeh/dev/mozilla/m-c/b2g18/gfx/gl/GLContextProviderEGL.cpp:498
F(  593:0x260) Fatal signal 11 (SIGSEGV) at 0x00000000 (code=1)

This is with the new call to UnbindExternalTexture() in ShadowImageLayerOGL::RenderLayer().
Flags: needinfo?(bjacob)
The spec allows `attrib_list` to be null, but it's possible that the implementation doesn't like this. Also, I didn't actually check if the EGL implementation has KHR_gl_texture_2D_image. If it doesn't, we'll need to do the gralloc approach.

If we have the source for the driver, that would make it easy to see what's up.
Flags: needinfo?(bjacob)
(In reply to Jeff Gilbert [:jgilbert] from comment #50)
>
> The spec allows `attrib_list` to be null, but it's possible that the
> implementation doesn't like this. Also, I didn't actually check if the EGL
> implementation has KHR_gl_texture_2D_image. If it doesn't, we'll need to do
> the gralloc approach.

I can easily test a non-null attrib_list--what value(s) should I stuff in there?

How can I check if KHR_gl_texture_2D_image is implemented?

> If we have the source for the driver, that would make it easy to see what's
> up.

Unfortunately we don't--we get the Adreno driver as a blob.
Flags: needinfo?(jgilbert)
(In reply to Mike Habicher [:mikeh] from comment #51)
> (In reply to Jeff Gilbert [:jgilbert] from comment #50)
> >
> > The spec allows `attrib_list` to be null, but it's possible that the
> > implementation doesn't like this. Also, I didn't actually check if the EGL
> > implementation has KHR_gl_texture_2D_image. If it doesn't, we'll need to do
> > the gralloc approach.
> 
> I can easily test a non-null attrib_list--what value(s) should I stuff in
> there?
const GLint attrib_list[] = { LOCAL_EGL_NONE };
> 
> How can I check if KHR_gl_texture_2D_image is implemented?
`sEGLLibrary.IsExtensionSupported(KHR_gl_texture_2D_image)`
> 
> > If we have the source for the driver, that would make it easy to see what's
> > up.
> 
> Unfortunately we don't--we get the Adreno driver as a blob.
Flags: needinfo?(jgilbert)
jgilbert:

         if (!mNullEGLImage) {
+            if (sEGLLibrary.HasKHRImageTexture2D()) {
+                printf_stderr("mikeh: KHR_gl_texture_2D_image extension is supported\n");
+            } else {
+                printf_stderr("mikeh: KHR_gl_texture_2D_image extension is NOT supported\n");
+            }
             GLuint tempTex = 0;
             fGenTextures(1, &tempTex);
 
             GLuint prevTex = 0;
             GetUIntegerv(LOCAL_GL_TEXTURE_BINDING_2D, &prevTex);
             fBindTexture(LOCAL_GL_TEXTURE_2D, tempTex);

I(  897:0x390) mikeh: KHR_gl_texture_2D_image extension is NOT supported
Flags: needinfo?(jgilbert)
Well that's unfortunate. We'll probably have to do it with gralloc, then. Can you dump out the EGL extensions it *does* have?
Flags: needinfo?(jgilbert)
jgilbert:

         if (!mNullEGLImage) {
+            for (int i = 0; i < sEGLLibrary.Extensions_Max; i++) {
+                if (sEGLLibrary.IsExtensionSupported(static_cast<GLLibraryEGL::EGLExtensions>(i))) {
+                    printf_stderr("mikeh: extension %d is supported\n", i);
+                } else {
+                    printf_stderr("mikeh: extension %d is NOT supported\n", i);
+                }
+            }
             GLuint tempTex = 0;
             fGenTextures(1, &tempTex);

I( 1214:0x4cd) mikeh: extension 0 is supported
I( 1214:0x4cd) mikeh: extension 1 is NOT supported
I( 1214:0x4cd) mikeh: extension 2 is NOT supported
I( 1214:0x4cd) mikeh: extension 3 is NOT supported
I( 1214:0x4cd) mikeh: extension 4 is NOT supported
I( 1214:0x4cd) mikeh: extension 5 is NOT supported
I( 1214:0x4cd) mikeh: extension 6 is supported
I( 1214:0x4cd) mikeh: extension 7 is supported

Ref. https://mxr.mozilla.org/mozilla-central/source/gfx/gl/GLLibraryEGL.h#112
(In reply to Kan-Ru Chen [:kanru] from comment #53)
> We are not using UnbindExternalBuffer but we do release the texture here
> https://hg.mozilla.org/releases/mozilla-b2g18/file/5eb76dd7c2a4/gfx/layers/
> opengl/ImageLayerOGL.cpp#l1041

It would be interesting then, to step through GLTexture::Release to understand why the gralloc buffer is still being leaked. A side note though, the path in there that could actually release it calls glDeleteTextures; we should really try not to create and delete GL textures at every frame if we are to achieve good performance (though at this point we haven't established yet that we can avoid doing so).

(In reply to Mike Habicher [:mikeh] from comment #54)
> I(  897:0x390) mikeh: KHR_gl_texture_2D_image extension is NOT supported

So let's go back to the other approach, of using a dummy android::GraphicBuffer, as discussed in comment 3 and onwards.

Fortunately I needed something similar in bug 860441 so there is already code for that there, only it's for mozilla-central.

https://bug860441.bugzilla.mozilla.org/attachment.cgi?id=737891

+    EGLImage GetNullEGLImage() MOZ_OVERRIDE
+    {
+        if (!mNullEGLImage) {
+            android::GraphicBuffer *buffer
+              = new android::GraphicBuffer(
+                  1, 1,
+                  PIXEL_FORMAT_RGB_565,
+                  GRALLOC_USAGE_SW_READ_NEVER | GRALLOC_USAGE_SW_WRITE_NEVER);
+            EGLint attrs[] = {
+                LOCAL_EGL_NONE, LOCAL_EGL_NONE
+            };
+            mNullEGLImage = sEGLLibrary.fCreateImage(EGL_DISPLAY(),
+                                                     EGL_NO_CONTEXT,
+                                                     LOCAL_EGL_NATIVE_BUFFER_ANDROID,
+                                                     buffer->getNativeBuffer(), attrs);
+        }
+        return mNullEGLImage;
+    }

Updated

5 years ago
Whiteboard: [MemShrink] → [MemShrink][status: needs patch, progress happening]
Comment on attachment 737439 [details] [diff] [review]
[b2g18] Bind/UnbindTexture call counting

Yup, that was the right way to do the counting.
Attachment #737439 - Flags: feedback?(bjacob) → feedback+
Created attachment 737978 [details] [diff] [review]
fix UnbindExternalBuffer

Does this work for you? I haven't tried building so there may be trivial compile errors. Does this fix the leak?

Note, the _whole_ BindExternalBuffer/UnbindExternalBuffer API is inherently bad because it forces BindExternalBuffer to create and destroy a new EGLImage everytime... hopefully that's not too slow.
Attachment #737978 - Flags: feedback?(mhabicher)
Created attachment 737980 [details] [diff] [review]
fix UnbindExternalBuffer

Fixed a couple of typos.
Attachment #737978 - Attachment is obsolete: true
Attachment #737978 - Flags: feedback?(mhabicher)
Attachment #737980 - Attachment is patch: true
Attachment #737980 - Flags: feedback?(mhabicher)
Note: this patch only takes care of fixing UnbindExternalBuffer (hopefully); it does not add the missing call to UnbindExternalBuffer. Have you guys figured that part yet?
Comment on attachment 737980 [details] [diff] [review]
fix UnbindExternalBuffer

With the missing symbols defined, as discussed in IRC, this builds and runs without crashing, but the viewfinder display is corrupted: it shows a repeating pattern of fine-pitch pink and black lines.
Attachment #737980 - Flags: feedback?(mhabicher) → feedback-
Created attachment 737992 [details] [diff] [review]
[b2g18] Current patch

This is my currently-applied patch.
Attachment #737992 - Flags: feedback?(bjacob)
The name of UnbindExternalBuffer is wrong because it is used to unbind a GL_TEXTURE_2D gralloc buffer and was never used to unbind a GL_TEXTURE_EXTERNAL gralloc buffer. The user of BindExternalBuffer & UnbindExternalBuffer are different so it's very confusing.
(In reply to Kan-Ru Chen [:kanru] from comment #64)
>
> The name of UnbindExternalBuffer is wrong because it is used to unbind a
> GL_TEXTURE_2D gralloc buffer and was never used to unbind a
> GL_TEXTURE_EXTERNAL gralloc buffer. The user of BindExternalBuffer &
> UnbindExternalBuffer are different so it's very confusing.

100% agree!

Any idea how to unbind the texture bound in BindExternalBuffer() so that we can free the EGLImage?
(In reply to Mike Habicher [:mikeh] from comment #65)
> Any idea how to unbind the texture bound in BindExternalBuffer() so that we
> can free the EGLImage?

The right way to unbind the gralloc buffer bound in BindExternalBuffer() is (hopefully) as is done by the UnbindExternalBuffer function as modified by my patch, but reading Kan-Ru's comment, I now understand that that is not what existing callers of UnbindExternalBuffer expected!

So we should take this code into a new, separate function, and leave UnbindExternalBuffer unchanged.
Summary: [Layers] Texture type mismatch prevents freeing EGLImages → [Layers] B2G18 leaking gralloc buffers attached to GL_TEXTURE_EXTERNAL textures
Summary: [Layers] B2G18 leaking gralloc buffers attached to GL_TEXTURE_EXTERNAL textures → [Layers] B2G18 leaking EGLImages attached to GL_TEXTURE_EXTERNAL textures
I can certainly code up that change; but that doesn't address the (new) pink screen problem.
My hope is that the new pink screen problem was caused by messing with UnbindExternalBuffer, which as Kan-Ru explains is used elsewhere, and so I hope that making a new patch that leaves UnbindExternalBuffer untouched will avoid the problem.
I hope so too, but in my testing, I _never_ saw UnbindExternalBuffer() getting called, even once, even outside of the context of testing the camera app.
Oh, that's a good point.

To investigate this kind of rendering issues:
 - 1. ensure that your build is a DEBUG one
 - 2. add this to b2g.sh before the exec b2g line:
         export MOZ_GL_DEBUG_ABORT_ON_ERROR=1
 - 3. check in 'adb logcat' or in /proc/kmsg if there are any messages about 'genlock'. That's the system-wide lock mechanism controlling gralloc buffers.
Created attachment 738109 [details] [diff] [review]
[b2g18] Current patch - rolled back UnbindExternalBuffer()

Here's the latest change.  With this patch, the camera viewfinder stays solid green.
Attachment #737992 - Attachment is obsolete: true
Attachment #737992 - Flags: feedback?(bjacob)
(The patch in comment 71 goes on top of the other [b2g18] patchen.)
(In reply to Benoit Jacob [:bjacob] from comment #70)
> 
> To investigate this kind of rendering issues:
>  - 1. ensure that your build is a DEBUG one
>  - 2. add this to b2g.sh before the exec b2g line:
>          export MOZ_GL_DEBUG_ABORT_ON_ERROR=1
>  - 3. check in 'adb logcat' or in /proc/kmsg if there are any messages about
> 'genlock'. That's the system-wide lock mechanism controlling gralloc buffers.

With |export MOZ_GL_DEBUG_ABORT_ON_ERROR=1| set in b2g.sh, the camera viewfinder works properly; with it removed, the viewfinder goes solid green again.
Flags: needinfo?(bjacob)
Hm, this could explain the trouble:

@@ -1035,17 +1035,17 @@
     program->SetRenderOffset(aOffset);
     program->SetTextureUnit(0);
     program->LoadMask(GetMaskLayer());
 
     mOGLManager->BindAndDrawQuadWithTextureRect(program,
                                                 mPictureRect,
                                                 nsIntSize(mSize.width, mSize.height));
 
-    gl()->UnbindExternalBuffer(mExternalBufferTexture.GetTextureID());
+    gl()->AlsoUnbindExternalBuffer(mExternalBufferTexture.GetTextureID());


As I tried to explain in comment 47, we've got to make sure that we Unbind if and only if we had Bound. Since the Bind call is inside a nontrivial if() block, we need to be careful there.

I got a b2g18 build now, testing a patch.
Flags: needinfo?(bjacob)
(In reply to Mike Habicher [:mikeh] from comment #73)
> (In reply to Benoit Jacob [:bjacob] from comment #70)
> > 
> > To investigate this kind of rendering issues:
> >  - 1. ensure that your build is a DEBUG one
> >  - 2. add this to b2g.sh before the exec b2g line:
> >          export MOZ_GL_DEBUG_ABORT_ON_ERROR=1
> >  - 3. check in 'adb logcat' or in /proc/kmsg if there are any messages about
> > 'genlock'. That's the system-wide lock mechanism controlling gralloc buffers.
> 
> With |export MOZ_GL_DEBUG_ABORT_ON_ERROR=1| set in b2g.sh, the camera
> viewfinder works properly; with it removed, the viewfinder goes solid green
> again.

That's interesting. The most likely way that MOZ_GL_DEBUG_ABORT_ON_ERROR could make this difference, is that it inserts glFinish() calls after every GL call, effectively turning GL into a synchronous API. So what you observe here is what you'd get if the green rendering bug were caused by a us failing to properly synchronize (and the glFinish calls would add that missing synchronization).
(In reply to Benoit Jacob [:bjacob] from comment #74)
> 
> As I tried to explain in comment 47, we've got to make sure that we Unbind
> if and only if we had Bound. Since the Bind call is inside a nontrivial if()
> block, we need to be careful there.

My bad--the jetlag is hitting pretty hard today.  I can exercise your patch too, if you like.
Created attachment 738147 [details] [diff] [review]
Combined diff, Unbind only if bound, works here

Mike, can you try this? I built b2g18 with it and everything seems to go smoothly, in particular the camera preview looks good, but I am not sure what the STR are for the visual glitch you mentioned.

Also, regarding the leak, normally I'd use tools/get_about_memory.py and check the gralloc metric, but it doesn't seem to be there in b2g18.
Attachment #738147 - Flags: feedback?(mhabicher)
Comment on attachment 738147 [details] [diff] [review]
Combined diff, Unbind only if bound, works here

I just tried this patch (making sure MOZ_GL_DEBUG_ABORT_ON_ERROR=1 was not set) and the camera viewfinder is working properly.

I will let it run for a while and examine the DMD output.

(You can also see in gross terms if the EGLImage is getting freed by running |adb shell b2g-ps| and looking at the RSS numbers for the b2g process--it will keep increasing if there is a leak.)
Attachment #738147 - Flags: feedback?(mhabicher) → feedback+
Mike says it's still leaking, and also confirms that the Bind/Unbind calls are balancing out.

So at this point we have a very strong indication that there is an EGLImage leak in the EGL libraries installed on Unagi phones.

Let's get the patch reviewed and landed as it makes things at least sane on our end, and then let's ask for help from Qualcomm.
Created attachment 738225 [details] [diff] [review]
Fix any known possibility that the leak could be caused by Gecko

In summary, this patch
 - implements a ActuallyUnbindExternalBuffer function that really does what it says
 - uses it to unbind buffers

So with that, I don't see another way that the leak could be caused by Gecko. The leaks seems to still be here but it is now apparently a driver bug.
Attachment #736998 - Attachment is obsolete: true
Attachment #737134 - Attachment is obsolete: true
Attachment #737439 - Attachment is obsolete: true
Attachment #737980 - Attachment is obsolete: true
Attachment #738109 - Attachment is obsolete: true
Attachment #738147 - Attachment is obsolete: true
Attachment #736532 - Attachment is obsolete: true
Attachment #736533 - Attachment is obsolete: true
Attachment #736997 - Attachment is obsolete: true
Attachment #738225 - Flags: review?(jgilbert)
Michael, we have some strong indication that there may be a leak of EGLImages on Unagi.

The leak is described in comment 0. With the patch here, it still exists, and our code now looks like this:

   BindExternalBuffer(texture);
   drawSomeStuff();
   ActuallyUnbindExternalBuffer(texture);

where BindExternalBuffer and ActuallyUnbindExternalBuffer are like this:

    void BindExternalBuffer(GLuint texture, void* buffer)
    {
        EGLint attrs[] = {
            LOCAL_EGL_IMAGE_PRESERVED, LOCAL_EGL_TRUE,
            LOCAL_EGL_NONE, LOCAL_EGL_NONE
        };
        /*** THIS IS WHERE THE LEAKED EGLIMAGE IS CREATED !!! ***/
        EGLImage image = sEGLLibrary.fCreateImage(EGL_DISPLAY(),
                                                  EGL_NO_CONTEXT,
                                                  EGL_NATIVE_BUFFER_ANDROID,
                                                  buffer, attrs);
        fBindTexture(LOCAL_GL_TEXTURE_EXTERNAL, texture);
        fEGLImageTargetTexture2D(LOCAL_GL_TEXTURE_EXTERNAL, image);
        sEGLLibrary.fDestroyImage(EGL_DISPLAY(), image);
    }

    void ActuallyUnbindExternalBuffer(GLuint texture)
    {
        fBindTexture(LOCAL_GL_TEXTURE_EXTERNAL, texture);
        fEGLImageTargetTexture2D(LOCAL_GL_TEXTURE_EXTERNAL, GetNullEGLImage());
    }

    EGLImage GetNullEGLImage() MOZ_OVERRIDE
    {
        if (!mNullGraphicBuffer.get()) {
            mNullGraphicBuffer
              = new android::GraphicBuffer(
                  1, 1,
                  PIXEL_FORMAT_RGB_565,
                  GRALLOC_USAGE_SW_READ_NEVER | GRALLOC_USAGE_SW_WRITE_NEVER);
            EGLint attrs[] = {
                LOCAL_EGL_NONE, LOCAL_EGL_NONE
            };
            mNullEGLImage = sEGLLibrary.fCreateImage(
                EGL_DISPLAY(),
                EGL_NO_CONTEXT,
                LOCAL_EGL_NATIVE_BUFFER_ANDROID,
                mNullGraphicBuffer->getNativeBuffer(), attrs);
        }
        return mNullEGLImage;
    }
Flags: needinfo?(mvines)
Side note: we have a mitigation path that consists in having ImageLayerOGL no longer use the crummy BindExternalBuffer API and instead maintain its own mEGLImage. In this way, even if the leak still exists, it would now be only one mEGLImage leaked per image layer object, instead of leaking over and over again at every rendered frame (see how BindExternalBuffer creates and destroys EGLImages everytime).
Michael: some more comments on this. See in comment 81 how BindExternalBuffer creates an EGLImage, passes it to EGLImageTargetTexture2D, and destroys it. This relies on the EGL implementation to get reference-counting right. So it may be a good approach to look for reference-counting issues around there in the EGL implementation.
Whiteboard: [MemShrink][status: needs patch, progress happening] → [MemShrink:P1][status: needs patch, progress happening]
Please retest on either a vendor *unmodified* Buri or Inari.  I have no idea what random libraries you may have on Unagi.
Flags: needinfo?(mvines)
The only B2G development devices I have here in Toronto are Unagi's and Otoro's.

I'm going to implement the mitigation described in comment 82, then.

Updated

5 years ago
Whiteboard: [MemShrink:P1][status: needs patch, progress happening] → [MemShrink:P1][status: needs patch, progress happening][madrid]
(In reply to Michael Vines [:m1] [:evilmachines] from comment #84)
>
> Please retest on either a vendor *unmodified* Buri or Inari.  I have no idea
> what random libraries you may have on Unagi.

This is an interesting result: running an unmodified Buri idling on the camera viewfinder, I can see the b2g parent process' RSS start at ~60MiB.  Over time it rises to 62-64MiB at approximately the same rate I would expect for the leak we're trying to address above.  But _then_ RSS drops back down to ~60MiB and restarts it climb.  It's done this twice now over the past hour, as if something is waiting until it has collected enough "garbage" and is then disposing of it all at once.

Most importantly, this is running code supplied entirely by the vendor, without the above patch.

The test needs to run longer to be conclusive, but I wanted to share the preliminary results.  I will provide an update later.
That sort of makes sense, as (see comment 83) given correct refcounting on the part of the EGL implementation, the current b2g18 code should actually not leak. (Earlier I thought that we had an actual bug on our part causing the leak, but I was wrong).
The test has now been running for >5 hours, and the b2g parent RSS has stayed within ~66MiB running the unmodified Buri build.

Does this mean Unagi has an old Adreno driver?
Flags: needinfo?(mvines)
I hope so :)

But I actually have no idea what blob bits are in a Moz Unagi build, which is why I was interested in the behaviour on a vendor build that has a well known origin and is fully supported by the team here.
Flags: needinfo?(mvines)
Any chance we can update the Unagi Adreno driver and retest?

If not, I think we should tag this bug as Unagi-specific and mark it as RESOLVED/WONTFIX.

Thoughts?
Whiteboard: [MemShrink:P1][status: needs patch, progress happening][madrid] → [MemShrink:P1][status: maybe invalid][madrid]
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Keywords: unagi
Resolution: --- → WONTFIX
Summary: [Layers] B2G18 leaking EGLImages attached to GL_TEXTURE_EXTERNAL textures → [Layers][unagi] B2G18 leaking EGLImages attached to GL_TEXTURE_EXTERNAL textures
Whiteboard: [MemShrink:P1][status: maybe invalid][madrid] → [MemShrink:P1][status: needs patch, progress happening][madrid]

Comment 91

5 years ago
This should be tested on hamachi or leo, which have newer adreno drivers.
(In reply to Michael Wu [:mwu] from comment #91)
>
> This should be tested on hamachi or leo, which have newer adreno drivers.

This was tested on Buri/Hamachi and the problem did not exist.
Comment on attachment 738225 [details] [diff] [review]
Fix any known possibility that the leak could be caused by Gecko

Review of attachment 738225 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/gl/GLContext.h
@@ +778,4 @@
>                                     gfxPattern::GraphicsFilter aFilter);
>  
>      virtual bool BindExternalBuffer(GLuint texture, void* buffer) { return false; }
> +    // UnbindExternalBuffer is not actually doing the converse of BindExternalBuffer

We're going to need a better explanation that this.

::: gfx/gl/GLContextProviderEGL.cpp
@@ +464,4 @@
>  #endif
>      }
>  
> +    bool ActuallyUnbindExternalBuffer(GLuint texture)

I'm going to need a good explanation why we have a 'UnbindExternalBuffer' that says "I'm broken" and then additionally a function for 'ActuallyUnbindExternalBuffer'.

@@ +702,5 @@
> +    EGLImage GetNullEGLImage() MOZ_OVERRIDE
> +    {
> +        if (!mNullGraphicBuffer.get()) {
> +            mNullGraphicBuffer
> +              = new android::GraphicBuffer(

No use pulling `new andr[...]` to the next line.

@@ +704,5 @@
> +        if (!mNullGraphicBuffer.get()) {
> +            mNullGraphicBuffer
> +              = new android::GraphicBuffer(
> +                  1, 1,
> +                  PIXEL_FORMAT_RGB_565,

I would prefer this be 2x2 and RGBA. (simplest combo ever)
I'd also love if this was filled with pink, for an indication if we end up rendering from this accidentally.

@@ +706,5 @@
> +              = new android::GraphicBuffer(
> +                  1, 1,
> +                  PIXEL_FORMAT_RGB_565,
> +                  GRALLOC_USAGE_SW_READ_NEVER | GRALLOC_USAGE_SW_WRITE_NEVER);
> +            EGLint attrs[] = {

Leave a newline here between the indented previous function call and this different bit of code.

@@ +784,5 @@
>          return surface;
>      }
> +
> +#ifdef MOZ_WIDGET_GONK
> +    EGLImage mNullEGLImage;

Where is this destroyed?

::: gfx/gl/GLDefs.h
@@ +46,4 @@
>  
>  // OES_EGL_image (GLES)
>  typedef void* GLeglImage;
> +typedef void* EGLImage;

This should technically go under the heading for EGL_KHR_image or EGL_KHR_image_base (iirc).

::: gfx/layers/opengl/ImageLayerOGL.cpp
@@ +936,5 @@
>      return;
>    }
>    mOGLManager->MakeCurrent();
> +
> +#ifdef MOZ_WIDGET_GONK

Ifdefs are infecting the code again. :(
Attachment #738225 - Flags: review?(jgilbert) → review-
(In reply to Mike Habicher [:mikeh] from comment #90)
> If not, I think we should tag this bug as Unagi-specific and mark it as
> RESOLVED/WONTFIX.
> 
> Thoughts?

OK as far as I'm concerned.

Updated

5 years ago
Target Milestone: --- → Madrid (19apr)
You need to log in before you can comment on or make changes to this bug.