Closed Bug 862324 Opened 11 years ago Closed 11 years ago

Use after free of PGrallocBufferParent when the child process is killed

Categories

(Core :: Graphics, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla23

People

(Reporter: cyu, Assigned: bjacob)

References

Details

Attachments

(2 files, 5 obsolete files)

Using trunk. STR as in bug 860079.

Here are the stack traces:
> Breakpoint 1, ~PGrallocBufferParent (this=0x44b99e58, __in_chrg=<value optimized out>) at ipc/ipdl/PGrallocBufferParent.cpp:49
> 49	PGrallocBufferParent::~PGrallocBufferParent()
> #0  ~PGrallocBufferParent (this=0x44b99e58, __in_chrg=<value optimized out>) at ipc/ipdl/PGrallocBufferParent.cpp:49
> #1  0x41dfee74 in ~GrallocBufferActor (this=0x44b99e40, __in_chrg=<value optimized out>) at gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp:193
> #2  0x41dfee9a in ~GrallocBufferActor (this=0x44b99e58, __in_chrg=<value optimized out>) at gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp:193
> #3  0x40e01f06 in mozilla::net::NeckoParent::DeallocPCookieService (this=<value optimized out>, cs=0x401ba044) at netwerk/ipc/NeckoParent.cpp:424
> #4  0x419e95ac in mozilla::layers::PLayersParent::DeallocSubtree (this=0x404486a0) at ipc/ipdl/PLayersParent.cpp:819
> #5  0x419defb0 in mozilla::layers::PCompositorParent::DeallocSubtree (this=0x47c61c50) at ipc/ipdl/PCompositorParent.cpp:827
> #6  0x419df1da in mozilla::layers::PCompositorParent::OnChannelError (this=0x47c61c50) at ipc/ipdl/PCompositorParent.cpp:664
> #7  0x41971e4c in mozilla::ipc::AsyncChannel::NotifyMaybeChannelError (this=0x47c61c58) at ipc/glue/AsyncChannel.cpp:570
> #8  0x41973124 in mozilla::ipc::AsyncChannel::OnNotifyMaybeChannelError (this=0x47c61c58) at ipc/glue/AsyncChannel.cpp:535
> #9  0x4194732c in DispatchToMethod<mozilla::dom::ContentParent, void (mozilla::dom::ContentParent::*)()> (this=<value optimized out>) at ipc/chromium/src/base/tuple.h:383
> #10 RunnableMethod<mozilla::dom::ContentParent, void (mozilla::dom::ContentParent::*)(), Tuple0>::Run (this=<value optimized out>) at ipc/chromium/src/base/task.h:307
> #11 0x41d6345e in MessageLoop::RunTask (this=0x478ffdd0, task=0x46c5f060) at ipc/chromium/src/base/message_loop.cc:334
> #12 0x41d63c88 in MessageLoop::DeferOrRunPendingTask (this=0x44b99e58, pending_task=<value optimized out>) at ipc/chromium/src/base/message_loop.cc:342
> #13 0x41d649de in MessageLoop::DoWork (this=0x478ffdd0) at ipc/chromium/src/base/message_loop.cc:442
> #14 0x41d64d5a in base::MessagePumpDefault::Run (this=0x46864160, delegate=0x478ffdd0) at ipc/chromium/src/base/message_pump_default.cc:23
> #15 0x41d63a12 in MessageLoop::RunInternal (this=0x478ffdd0) at ipc/chromium/src/base/message_loop.cc:216
> #16 0x41d63a72 in MessageLoop::RunHandler (this=0x478ffdd0) at ipc/chromium/src/base/message_loop.cc:209
> #17 MessageLoop::Run (this=0x478ffdd0) at ipc/chromium/src/base/message_loop.cc:183
> #18 0x41d6d9a4 in base::Thread::ThreadMain (this=0x4682d3a0) at ipc/chromium/src/base/thread.cc:156
> #19 0x41d7b0d6 in ThreadFunc (closure=0x44b99e58) at ipc/chromium/src/base/platform_thread_posix.cc:39
> #20 0x400dce18 in __thread_entry (func=0x41d7b0cd <ThreadFunc>, arg=0x4682d3a0, tls=<value optimized out>) at bionic/libc/bionic/pthread.c:217
> #21 0x400dc96c in pthread_create (thread_out=<value optimized out>, attr=0xbeae1240, start_routine=0x41d7b0cd <ThreadFunc>, arg=0x4682d3a0) at bionic/libc/bionic/pthread.c:357
> #22 0x00000000 in ?? ()
> 
> Program received signal SIGSEGV, Segmentation fault.
> 0x4278585c in mozalloc_abort (msg=<value optimized out>) at memory/mozalloc/mozalloc_abort.cpp:30
> 30	    MOZ_CRASH();
> (gdb) bt
> #0  0x4278585c in mozalloc_abort (msg=<value optimized out>) at memory/mozalloc/mozalloc_abort.cpp:30
> #1  0x41d26e36 in Abort (aSeverity=<value optimized out>, aStr=<value optimized out>, aExpr=<value optimized out>, aFile=<value optimized out>, aLine=34) at xpcom/base/nsDebugImpl.cpp:430
> #2  NS_DebugBreak (aSeverity=<value optimized out>, aStr=<value optimized out>, aExpr=<value optimized out>, aFile=<value optimized out>, aLine=34) at xpcom/base/nsDebugImpl.cpp:387
> #3  0x41aa188c in mozilla::layers::PGrallocBuffer::Transition (from=<value optimized out>, trigger=..., next=<value optimized out>) at ipc/ipdl/PGrallocBuffer.cpp:34
> #4  0x419df80e in mozilla::layers::PGrallocBufferParent::Send__delete__ (actor=0x44b99e58) at ipc/ipdl/PGrallocBufferParent.cpp:82
> #5  0x41dff5c6 in mozilla::layers::ISurfaceAllocator::PlatformDestroySharedSurface (aSurface=0x44b99980) at gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp:274
> #6  0x41df8690 in mozilla::layers::ISurfaceAllocator::DestroySharedSurface (this=0x404486c4, aSurface=0x44b99980) at gfx/layers/ipc/ISurfaceAllocator.cpp:110
> #7  0x41dfbc90 in ~TextureHost (this=0x459f9b80, __in_chrg=<value optimized out>) at gfx/layers/composite/TextureHost.cpp:58
> #8  0x41dfb8fc in ~GrallocTextureHostOGL (this=0x459f9b80, __in_chrg=<value optimized out>) at gfx/layers/opengl/TextureHostOGL.cpp:653
> #9  0x41dfb912 in ~GrallocTextureHostOGL (this=0xa5, __in_chrg=<value optimized out>) at gfx/layers/opengl/TextureHostOGL.cpp:653
> #10 0x41dedf92 in mozilla::RefCounted<mozilla::layers::TextureSource>::Release (this=0x452320e8, t=0x0) at ../../dist/include/mozilla/RefPtr.h:71
> #11 mozilla::RefPtr<mozilla::layers::TextureHost>::unref (this=0x452320e8, t=0x0) at ../../dist/include/mozilla/RefPtr.h:166
> #12 mozilla::RefPtr<mozilla::layers::TextureHost>::assign (this=0x452320e8, t=0x0) at ../../dist/include/mozilla/RefPtr.h:152
> #13 0x41dedfec in mozilla::RefPtr<mozilla::layers::TextureHost>::operator= (this=0x452320e8, t=0x478ff138) at ../../dist/include/mozilla/RefPtr.h:127
> #14 0x41dee136 in mozilla::layers::ContentHostDoubleBuffered::DestroyTextures (this=0x45232080) at gfx/layers/composite/ContentHost.cpp:303
> #15 0x41dee7ba in ~ContentHostDoubleBuffered (this=0xa5, __in_chrg=<value optimized out>) at gfx/layers/composite/ContentHost.cpp:274
> #16 0x41dee7ea in ~ContentHostDoubleBuffered (this=0xa5, __in_chrg=<value optimized out>) at gfx/layers/composite/ContentHost.cpp:276
> #17 0x41dd80dc in mozilla::RefCounted<mozilla::layers::CompositableHost>::Release (this=<value optimized out>) at ../../dist/include/mozilla/RefPtr.h:71
> #18 0x41ddca24 in mozilla::RefPtr<mozilla::layers::ContentHost>::unref (this=0x47d57c00, __in_chrg=<value optimized out>) at ../../dist/include/mozilla/RefPtr.h:166
> #19 ~RefPtr (this=0x47d57c00, __in_chrg=<value optimized out>) at ../../dist/include/mozilla/RefPtr.h:116
> #20 ~ThebesLayerComposite (this=0x47d57c00, __in_chrg=<value optimized out>) at gfx/layers/composite/ThebesLayerComposite.cpp:42
> #21 0x41ddca5a in ~ThebesLayerComposite (this=0xa5, __in_chrg=<value optimized out>) at gfx/layers/composite/ThebesLayerComposite.cpp:42
> #22 0x41dd96ae in mozilla::layers::Layer::Release (this=<value optimized out>, aChild=0x47d57c90) at gfx/layers/Layers.h:588
> #23 mozilla::layers::ContainerLayerComposite::RemoveChild (this=<value optimized out>, aChild=0x47d57c90) at gfx/layers/composite/ContainerLayerComposite.cpp:246
> #24 0x41dd975a in ~ContainerLayerComposite (this=0x45183c00, __in_chrg=<value optimized out>) at gfx/layers/composite/ContainerLayerComposite.cpp:176
> #25 0x41dd97ce in ~ContainerLayerComposite (this=0xa5, __in_chrg=<value optimized out>) at gfx/layers/composite/ContainerLayerComposite.cpp:178
> #26 0x40edf34e in mozilla::layers::Layer::Release (this=0x45183c90) at ../../dist/include/Layers.h:588
> #27 0x41df5326 in ~nsRefPtr (this=0x404486a0, __in_chrg=<value optimized out>) at ../../dist/include/nsAutoPtr.h:880
> #28 ~ShadowLayersParent (this=0x404486a0, __in_chrg=<value optimized out>) at gfx/layers/ipc/ShadowLayersParent.cpp:150
> #29 0x41df535e in ~ShadowLayersParent (this=0xa5, __in_chrg=<value optimized out>) at gfx/layers/ipc/ShadowLayersParent.cpp:150
> #30 0x41deb5e2 in mozilla::layers::CrossProcessCompositorParent::DeallocPLayers (this=<value optimized out>, aLayers=0x404486a0) at gfx/layers/ipc/CompositorParent.cpp:1407
> #31 0x419defd6 in mozilla::layers::PCompositorParent::DeallocSubtree (this=0x47c61c50) at ipc/ipdl/PCompositorParent.cpp:831
> #32 0x419df1da in mozilla::layers::PCompositorParent::OnChannelError (this=0x47c61c50) at ipc/ipdl/PCompositorParent.cpp:664
> #33 0x41971e4c in mozilla::ipc::AsyncChannel::NotifyMaybeChannelError (this=0x47c61c58) at ipc/glue/AsyncChannel.cpp:570
> #34 0x41973124 in mozilla::ipc::AsyncChannel::OnNotifyMaybeChannelError (this=0x47c61c58) at ipc/glue/AsyncChannel.cpp:535
> #35 0x4194732c in DispatchToMethod<mozilla::dom::ContentParent, void (mozilla::dom::ContentParent::*)()> (this=<value optimized out>) at ipc/chromium/src/base/tuple.h:383
> #36 RunnableMethod<mozilla::dom::ContentParent, void (mozilla::dom::ContentParent::*)(), Tuple0>::Run (this=<value optimized out>) at ipc/chromium/src/base/task.h:307
> #37 0x41d6345e in MessageLoop::RunTask (this=0x478ffdd0, task=0x46c5f060) at ipc/chromium/src/base/message_loop.cc:334
> #38 0x41d63c88 in MessageLoop::DeferOrRunPendingTask (this=0xa5, pending_task=<value optimized out>) at ipc/chromium/src/base/message_loop.cc:342
> #39 0x41d649de in MessageLoop::DoWork (this=0x478ffdd0) at ipc/chromium/src/base/message_loop.cc:442
> #40 0x41d64d5a in base::MessagePumpDefault::Run (this=0x46864160, delegate=0x478ffdd0) at ipc/chromium/src/base/message_pump_default.cc:23
> #41 0x41d63a12 in MessageLoop::RunInternal (this=0x478ffdd0) at ipc/chromium/src/base/message_loop.cc:216
> #42 0x41d63a72 in MessageLoop::RunHandler (this=0x478ffdd0) at ipc/chromium/src/base/message_loop.cc:209
> #43 MessageLoop::Run (this=0x478ffdd0) at ipc/chromium/src/base/message_loop.cc:183
> #44 0x41d6d9a4 in base::Thread::ThreadMain (this=0x4682d3a0) at ipc/chromium/src/base/thread.cc:156
> #45 0x41d7b0d6 in ThreadFunc (closure=0xa5) at ipc/chromium/src/base/platform_thread_posix.cc:39
> #46 0x400dce18 in __thread_entry (func=0x41d7b0cd <ThreadFunc>, arg=0x4682d3a0, tls=<value optimized out>) at bionic/libc/bionic/pthread.c:217
> #47 0x400dc96c in pthread_create (thread_out=<value optimized out>, attr=0xbeae1240, start_routine=0x41d7b0cd <ThreadFunc>, arg=0x4682d3a0) at bionic/libc/bionic/pthread.c:357
> #48 0x00000000 in ?? ()
Depends on: 860892
Which version of FirefoxOS do you use? From the crash log, the source code seems after gfx refactoring's one. They should be m-c/m-i.
Flags: needinfo?(cyu)
(In reply to Sotaro Ikeda [:sotaro] from comment #1)
> Which version of FirefoxOS do you use? From the crash log, the source code
> seems after gfx refactoring's one. They should be m-c/m-i.

Oh, it is written as turnk.
It seems like the crash is regression from gfx layer refactoring.
Flags: needinfo?(cyu)
Benoit, is this the same as the other ones we're tracking?
Flags: needinfo?(bjacob)
Yes, this is exactly the crash we've been having.

This bug is well-filed, the GDB session in comment 0 shows exactly what's happening, so let's keep this bug as our canonical one about this issue.
Flags: needinfo?(bjacob)
Assignee: nobody → bjacob
bug 776940 had a very similar problem
Here's what I know, based mostly on earlier work on this bug with Jeff M.

If you look at the two stacks in comment 0, the last stack frame in common is in PCompositorParent::DeallocSubtree().

Here's this function:

void
PCompositorParent::DeallocSubtree()
{
    {
        // Recursively deleting PLayers kids
        InfallibleTArray<PLayersParent*>& kids = mManagedPLayersParent;
        for (uint32_t i = 0; (i) < ((kids).Length()); (++(i))) {
            (kids[i])->DeallocSubtree();  // <-- first stack, where we destroy
        }                                 //     the PGrallocBufferParent

        for (uint32_t i = 0; (i) < ((kids).Length()); (++(i))) {
            DeallocPLayers(kids[i]);      // <-- second stack, where we destroy
        }                                 //     the TextureHost, calling Send__delete__
        (mManagedPLayersParent).Clear();  //     on the deleted PGrallocBufferParent
    }
}

From there I am not sure how to interprete this bug, as I am not sure what design principles our IPC code relies on.

One possible interpretation is that this is almost right, we just are doing this in the wrong order: if the TextureHost were destroyed _before_ the PGrallocBufferParent, we would be in good shape --- according to that interpretation. We've been working with that interpretation in past attempts to fix this bug, but evidently we haven't succeeded.

Another interpretation might be that IPDL structures such as PGrallocBufferParent are NOT to be manually held by non-IPDL-generated code. Here, the TextureHost holds the PGrallocBufferParent indirectly: the TextureHost really holds a SurfaceDescriptor (TextureHost::mBuffer), which wraps a PGrallocBufferParent in this case. In this interpretation, the bug is that non-IPDL-generated code such as TextureHost should not hold on to IPDL stuff such as PGrallocBufferParent (here though the SurfaceDescriptor).

Josh, could you comment on that? Hope that made any sense...
Flags: needinfo?(josh)
At all times, code using IPDL actors needs to be ensuring that the actors are actually valid. This is the reason ActorDestroy exists; it should be the responsibility of a GrallocBufferParent actor to notify any code that could make use of it that it is no longer a valid actor. It looks like the ShadowLayerForwarder::PlatformDestroySharedSurface code is in the wrong here; the way IPDL deallocates subtrees is correct.
Flags: needinfo?(josh)
> In this interpretation, the bug
> is that non-IPDL-generated code such as TextureHost should not hold on to
> IPDL stuff such as PGrallocBufferParent (here though the SurfaceDescriptor).

I don't see how this could be the case -- you have to be able to hold on to IPDL "stuff" outside of IPDL objects, because that's your entry point to communication with the other end...
I am a bit confused by ISurfaceAllocator::PlatformDestroySharedSurface. The child case makes sense (Send__delete__). In the parent case, however, I don't get why we are sending a delete there.
Bug 860442 added that code, I don't think we had this before the refactoring. There should be a symmetry between the two PlatformDestroySharedSurface implementations (ISurfaceAllocator and ShadowLayerForwarder), which is no longer the case, so I would say bug 860442 needs a different fix, and backing out that code should fix this problem here.
The patch in bug 860442 only makes a difference in the child case; here we are in the parent case. So I don't suppose that it would make a difference here, though I could be missing something.

We've just looked into this with :jrmuizel, :sotaro and :nical, and agreed on the following:

The core problem is that TextureHost has a SurfaceDescriptor member (mBuffer). SurfaceDescriptor holds a raw pointer to a gralloc buffer actor, allowing the actor to disappear under its feet. The crash here then occurs when the TextureHost destructor then tries to call Send__delete__ on the already-disappeared actor.

The fix is to have TextureHost no longer have a SurfaceDescriptor member. Instead, concrete TextureHost child classes should hold concrete surfaces and keep them alive; there should be no need to keep the SurfaceDescriptor. Note that GrallocTextureHostOGL already has a android::sp<android::GraphicBuffer> member; the differences is that its base class, TextureHost, should no longer have a SurfaceDescriptor member.

:nical is already working on a sizable change to TextureHost (among other classes) that will in particular accomplish that. But that is only scheduled for Gecko 24.

So we want to go for a short-term fix here, fixing only the B2G path rather than spending much more time fixing TextureHost in general --- as :nical's upcoming changes will overwrite ours anyway.

The short-term fix is going to be to ensure that the mBuffer field of TextureHost is not used for anything else than passing the gralloc buffer to GrallocTextureHostOGL.
Hey, I have a quick hacky patch that /seems/ to work in that the b2g main process no longer crashes on http://nightly-gupshup.herokuapp.com, but I can't test that page yet because of this content process crash still happening:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 1491.1543]
0x40d39b12 in nsDOMCameraManager::AddRef (this=0x45c9dc10) at /hack/mozilla-central/dom/camera/DOMCameraManager.cpp:31
31      NS_IMPL_CYCLE_COLLECTING_ADDREF(nsDOMCameraManager)
(gdb) bt
#0  0x40d39b12 in nsDOMCameraManager::AddRef (this=0x45c9dc10) at /hack/mozilla-central/dom/camera/DOMCameraManager.cpp:31
#1  0x40bac0ec in nsRefPtr (this=0x464ffcfc, aSmartPtr=<value optimized out>) at ../../dist/include/nsAutoPtr.h:901
#2  0x40bad288 in mozilla::MediaManager::GetBackend (this=0x461a3850, aWindowId=3) at /hack/mozilla-central/dom/media/MediaManager.cpp:1153
#3  0x40baf606 in mozilla::GetUserMediaRunnable::Run (this=0x46395720) at /hack/mozilla-central/dom/media/MediaManager.cpp:532
#4  0x414cdcc0 in nsThread::ProcessNextEvent (this=0x45b27a20, mayWait=<value optimized out>, result=<value optimized out>) at /hack/mozilla-central/xpcom/threads/nsThread.cpp:627
#5  0x41491554 in NS_ProcessNextEvent (thread=0xb6, mayWait=true) at /hack/b2g/B2G/objdir-gecko/xpcom/build/nsThreadUtils.cpp:238
#6  0x414ce3e4 in nsThread::ThreadFunc (arg=<value optimized out>) at /hack/mozilla-central/xpcom/threads/nsThread.cpp:265
#7  0x42df728c in _pt_root (arg=<value optimized out>) at /hack/mozilla-central/nsprpub/pr/src/pthreads/ptthread.c:191
#8  0x40047e18 in __thread_entry (func=0x42df71f9 <_pt_root>, arg=0x463955e0, tls=<value optimized out>) at bionic/libc/bionic/pthread.c:217
#9  0x4004796c in pthread_create (thread_out=<value optimized out>, attr=0xbe99bb94, start_routine=0x42df71f9 <_pt_root>, arg=0x463955e0) at bionic/libc/bionic/pthread.c:357                                                                                           
#10 0x00000000 in ?? ()                                                                                                                                                                                        
Anyone got a clue about it?
This seems to be running nicely here:
 - No compositor crash anymore on http://nightly-gupshup.herokuapp.com, even though the browser process keeps crashing as in the previous comment. This shows that the compositor now survives the content process crash.
 - Also tested on some newspaper websites .
Attachment #741525 - Flags: review?(nical.bugzilla)
Comment on attachment 741525 [details] [diff] [review]
hacky fix for GrallocTextureHostOGL to avoid storing the SurfaceDescriptor, until Nical makes everything nice

Review of attachment 741525 [details] [diff] [review]:
-----------------------------------------------------------------

Since this review is very time-sensitive and it's very late in Europe, let's add Nick as an alternate reviewer; one review is enough.
Attachment #741525 - Flags: review?(ncameron)
Comment on attachment 741525 [details] [diff] [review]
hacky fix for GrallocTextureHostOGL to avoid storing the SurfaceDescriptor, until Nical makes everything nice

Review of attachment 741525 [details] [diff] [review]:
-----------------------------------------------------------------

r=me if it works :-p

::: gfx/layers/opengl/TextureHostOGL.cpp
@@ +752,5 @@
> +SurfaceDescriptor*
> +GrallocTextureHostOGL::GetBuffer() const
> +{
> +  static SurfaceDescriptor dummyDescriptor;
> +  return &dummyDescriptor;

Woah! This works? This kind of scares me. I'll trust you if you say it works, but presumably that is only if this never gets called? In which case please add at least a warning and maybe MOZ_NOT_REACHED

@@ +760,5 @@
> +GrallocTextureHostOGL::SetBuffer(SurfaceDescriptor* aBuffer, ISurfaceAllocator* aAllocator)
> +{
> +  mDeAllocator = aAllocator;
> +
> +  android::sp<android::GraphicBuffer> buffer = GrallocBufferActor::GetFrom(*aBuffer);

buffer is not used - if this is just to keep aBuffer alive, please add a comment, otherwise delete it.
Attachment #741525 - Flags: review?(ncameron) → review+
Comment on attachment 741525 [details] [diff] [review]
hacky fix for GrallocTextureHostOGL to avoid storing the SurfaceDescriptor, until Nical makes everything nice

Actually, this only worked on my testing on newspaper sites doing ThebesLayers... I then tested the Camera app which has an ImageHost, and it crashes. The problem is, I was assuming that the old way of double buffering with TextureHost was not used anymore (it is not used anymore by ContentHost) but it is still used by ImageLayer. With the old way to double buffering, the TextureHost actually needs to keep track of the SurfaceDescriptor...

New patch coming up with a different approach...
Attachment #741525 - Flags: review?(nical.bugzilla)
(In reply to Benoit Jacob [:bjacob] from comment #17)
> New patch coming up with a different approach...

The new approach has problems that we have no solutions to.
(In reply to Benoit Jacob [:bjacob] from comment #13)
> Hey, I have a quick hacky patch that /seems/ to work in that the b2g main
> process no longer crashes on http://nightly-gupshup.herokuapp.com, but I
> can't test that page yet because of this content process crash still
> happening:
> 
> Program received signal SIGSEGV, Segmentation fault.
> [Switching to Thread 1491.1543]
> 0x40d39b12 in nsDOMCameraManager::AddRef (this=0x45c9dc10) at
> /hack/mozilla-central/dom/camera/DOMCameraManager.cpp:31
> 31      NS_IMPL_CYCLE_COLLECTING_ADDREF(nsDOMCameraManager)
> (gdb) bt
> #0  0x40d39b12 in nsDOMCameraManager::AddRef (this=0x45c9dc10) at
> /hack/mozilla-central/dom/camera/DOMCameraManager.cpp:31
> #1  0x40bac0ec in nsRefPtr (this=0x464ffcfc, aSmartPtr=<value optimized
> out>) at ../../dist/include/nsAutoPtr.h:901
> #2  0x40bad288 in mozilla::MediaManager::GetBackend (this=0x461a3850,
> aWindowId=3) at /hack/mozilla-central/dom/media/MediaManager.cpp:1153
> #3  0x40baf606 in mozilla::GetUserMediaRunnable::Run (this=0x46395720) at
> /hack/mozilla-central/dom/media/MediaManager.cpp:532
> #4  0x414cdcc0 in nsThread::ProcessNextEvent (this=0x45b27a20,
> mayWait=<value optimized out>, result=<value optimized out>) at
> /hack/mozilla-central/xpcom/threads/nsThread.cpp:627
> #5  0x41491554 in NS_ProcessNextEvent (thread=0xb6, mayWait=true) at
> /hack/b2g/B2G/objdir-gecko/xpcom/build/nsThreadUtils.cpp:238
> #6  0x414ce3e4 in nsThread::ThreadFunc (arg=<value optimized out>) at
> /hack/mozilla-central/xpcom/threads/nsThread.cpp:265
> #7  0x42df728c in _pt_root (arg=<value optimized out>) at
> /hack/mozilla-central/nsprpub/pr/src/pthreads/ptthread.c:191
> #8  0x40047e18 in __thread_entry (func=0x42df71f9 <_pt_root>,
> arg=0x463955e0, tls=<value optimized out>) at
> bionic/libc/bionic/pthread.c:217
> #9  0x4004796c in pthread_create (thread_out=<value optimized out>,
> attr=0xbe99bb94, start_routine=0x42df71f9 <_pt_root>, arg=0x463955e0) at
> bionic/libc/bionic/pthread.c:357                                            
> 
> #10 0x00000000 in ?? ()                                                     
> 
> Anyone got a clue about it?

If you take patch in Bug 825110 to enable WebRTC on B2G, I uploaded a new patch to avoid this cycle collector crash.
OK, here is approach #3: implement GrallocBufferActor::ActorDestroy to inform the TextureHost when it's going away. This was pointed out by Bas and :bent as a reasonable short-term fix for this kind of bug. It's not great, because it relies on the possibly nontrivial (?) assumption that the TextureHost is the only thing outside of IPDL-generated code referencing the PGrallocBufferParent.

Anyway, it works here. Contrary to the 2 approaches tried above, it removes the present crash from all the places where I've been able to reproduce it consistently, including the Camera app and WebRTC pages.
Attachment #741985 - Flags: review?(ncameron)
Attachment #741985 - Flags: feedback?(bent.mozilla)
Comment on attachment 741985 [details] [diff] [review]
Let the PGrallocBufferParent inform the TextureHost when it's going away

Review of attachment 741985 [details] [diff] [review]:
-----------------------------------------------------------------

Um, this is awful. But it looks like it should work, just as long as we get rid of it as soon as possible (in fact, please file a bug to do that now).

::: gfx/layers/composite/TextureHost.h
@@ +253,5 @@
>      mDeAllocator = aAllocator;
>    }
>  
> +  // used only for hacky fix in gecko 23 for bug 862324
> +  void ForgetBuffer();

Can we add this to GrallocTextureHostOGL instead? Then cast in the gralloc actor? I mean if it is not the right type there, we are in all kinds of trouble anyway, right?

@@ +297,5 @@
> +  SurfaceDescriptor* mBuffer; // FIXME [bjacob] it's terrible to have a SurfaceDescriptor here,
> +                              // because SurfaceDescriptor's may have raw pointers to IPDL actors,
> +                              // which can go away under our feet at any time. This is the cause
> +                              // of bug 862324 among others. Our current understanding is that
> +                              // this will be gone in Gecko 24.

Do we have a bug number for Nical's Texture host/client work? Could you add that to all the comments in this file so we know when they should be unnecessary?

::: gfx/layers/ipc/ShadowLayerUtilsGralloc.h
@@ +84,5 @@
>    static android::sp<GraphicBuffer>
>    GetFrom(const SurfaceDescriptorGralloc& aDescriptor);
>  
> +  // used only for hacky fix in gecko 23 for bug 862324
> +  void ActorDestroy(ActorDestroyReason why);

Does this override something? If so please make it virtual and MOZ_OVERRRIDE
Attachment #741985 - Flags: review?(ncameron) → review+
(In reply to Benoit Jacob [:bjacob] from comment #20)
> Created attachment 741985 [details] [diff] [review]
> ...
> because it relies on the possibly nontrivial (?) assumption that the

Is there a way to put in some kinds of guards (asserts?) for this assumption?
(In reply to Nick Cameron [:nrc] from comment #21)
> Um, this is awful. But it looks like it should work, just as long as we get
> rid of it as soon as possible (in fact, please file a bug to do that now).

This would be bug 858914
(In reply to Nicolas Silva [:nical] from comment #23)
> (In reply to Nick Cameron [:nrc] from comment #21)
> > Um, this is awful. But it looks like it should work, just as long as we get
> > rid of it as soon as possible (in fact, please file a bug to do that now).
> 
> This would be bug 858914

Can we have a specific bug, blocked by bug 858914, for removing this stuff?
Yup, will file the bug.

A quick update. It turns out that the crash is not completely fixed yet, for the following reason. The patch here assumes that the only way that TextureHost::mBuffer can be set by the outside, is through SetBuffer(). That is true as far as the _pointer_ mBuffer is concerned. But we still have a crash were something around ImageHost is changing *mBuffer under our feet --- same mBuffer pointer, no SetBuffer call, but *mBuffer changes. Debugging at the moment.
Got it by breaking on SurfaceDescriptor::operator=. So the other place where the TextureHost *mBuffer can be changed is in SwapTextures.

New patch coming with review comments from :nrc.
Depends on: 865908
By the way, filed bug 865908.
Pushed to try:
https://tbpl.mozilla.org/?tree=Try&rev=870ad028ce8c

Feel free to land if it's green.
Regarding this review comment which I didn't apply:

(In reply to Nick Cameron [:nrc] from comment #21)
> ::: gfx/layers/composite/TextureHost.h
> @@ +253,5 @@
> >      mDeAllocator = aAllocator;
> >    }
> >  
> > +  // used only for hacky fix in gecko 23 for bug 862324
> > +  void ForgetBuffer();
> 
> Can we add this to GrallocTextureHostOGL instead? Then cast in the gralloc
> actor? I mean if it is not the right type there, we are in all kinds of
> trouble anyway, right?

Unfortunately, that turned out to be complicated. It meant adding #include "TextureHostOGL.h" to ShadowLayersUtilsGralloc.h which created complicated compilation errors; I gave up after trying and failing a bit.
Orange on Try, with just 'Killed'. Could be out-of-memory; could speculate that's a leak. Maybe caused by ForgetBuffer(). Will look tomorrow.
Try push destroying all what we can in ForgetBuffer()... would help is the TextureHost's lifetime extended significantly past the GrallocBufferActor...
https://tbpl.mozilla.org/?tree=Try&rev=f3bd163f3ef0
Blocks: 864598
Another try push. I dont remember what really was the rationale for skipping the Send__delete__ in ForgetBuffer. Re-adding it. ForgetBuffer may be a misnomer btw.
https://tbpl.mozilla.org/?tree=Try&rev=915dc0ef65d0
Attached patch Baseline patch. Leaking. (obsolete) — Splinter Review
Known to be leaking.
The last patch (NOT hold a strong reference) seems to fix the leak in local testing. Time to push again to try:

https://tbpl.mozilla.org/?tree=Try&rev=31d73e4ff5c3
Attachment #742540 - Attachment description: Make GrallocBufferActor NOT hold a strong reference to TextureHost. Testing at the moment. → Make GrallocBufferActor NOT hold a strong reference to TextureHost. Seems to fix the leak locally.
(In reply to Benoit Jacob [:bjacob] from comment #37)
> The last patch (NOT hold a strong reference) seems to fix the leak in local
> testing. Time to push again to try:
> 
> https://tbpl.mozilla.org/?tree=Try&rev=31d73e4ff5c3

We're additionally going to make sure we clear this reference after the TextureHost is gone.
Hah, the patch adding the Send__delete__ to ForgetBuffer is actually bad, as it creates a recursion which ends in an assertion failure:

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 644.667]
0x426049e0 in mozalloc_abort (msg=<value optimized out>) at /hack/mozilla-central/memory/mozalloc/mozalloc_abort.cpp:30
30          MOZ_CRASH();
(gdb) bt
#0  0x426049e0 in mozalloc_abort (msg=<value optimized out>) at /hack/mozilla-central/memory/mozalloc/mozalloc_abort.cpp:30
#1  0x41c77172 in Abort (aSeverity=<value optimized out>, aStr=<value optimized out>, aExpr=<value optimized out>, aFile=<value optimized out>, aLine=392) at /hack/mozilla-central/xpcom/base/nsDebugImpl.cpp:430
#2  NS_DebugBreak (aSeverity=<value optimized out>, aStr=<value optimized out>, aExpr=<value optimized out>, aFile=<value optimized out>, aLine=392) at /hack/mozilla-central/xpcom/base/nsDebugImpl.cpp:387
#3  0x4199e8ac in mozilla::layers::PGrallocBufferParent::Write (this=<value optimized out>, __v=<value optimized out>, __msg=0x48c86480, __nullable=true) at /hack/b2g/B2G/objdir-gecko/ipc/ipdl/PGrallocBufferParent.cpp:392
#4  0x4199ebb4 in mozilla::layers::PGrallocBufferParent::Send__delete__ (actor=0x49bb9b1c) at /hack/b2g/B2G/objdir-gecko/ipc/ipdl/PGrallocBufferParent.cpp:73
#5  0x41d4d9d6 in mozilla::layers::ISurfaceAllocator::PlatformDestroySharedSurface (aSurface=0x48fd48c0) at /hack/mozilla-central/gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp:295
#6  0x41d45fd8 in mozilla::layers::ISurfaceAllocator::DestroySharedSurface (this=0x4873b5c8, aSurface=0x48fd48c0) at /hack/mozilla-central/gfx/layers/ipc/ISurfaceAllocator.cpp:110
#7  0x41d496e4 in mozilla::layers::GrallocTextureHostOGL::ForgetBuffer (this=0x4938ce20) at /hack/mozilla-central/gfx/layers/opengl/TextureHostOGL.h:617
#8  0x41d4cf2e in mozilla::layers::GrallocBufferActor::ActorDestroy (this=<value optimized out>) at /hack/mozilla-central/gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp:230
#9  0x41954b0a in mozilla::plugins::PPluginBackgroundDestroyerParent::DestroySubtree (this=0x49bb9b1c, why=mozilla::ipc::IProtocolManager<mozilla::ipc::RPCChannel::RPCListener>::Deletion)
    at /hack/b2g/B2G/objdir-gecko/ipc/ipdl/PPluginBackgroundDestroyerParent.cpp:332
#10 0x4199ec5a in mozilla::layers::PGrallocBufferParent::Send__delete__ (actor=0x49bb9b1c) at /hack/b2g/B2G/objdir-gecko/ipc/ipdl/PGrallocBufferParent.cpp:88
#11 0x41d4d9d6 in mozilla::layers::ISurfaceAllocator::PlatformDestroySharedSurface (aSurface=0x48fd48c0) at /hack/mozilla-central/gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp:295
#12 0x41d45fd8 in mozilla::layers::ISurfaceAllocator::DestroySharedSurface (this=0x4873b5c8, aSurface=0x48fd48c0) at /hack/mozilla-central/gfx/layers/ipc/ISurfaceAllocator.cpp:110
#13 0x41d496e4 in mozilla::layers::GrallocTextureHostOGL::ForgetBuffer (this=0x4938ce20) at /hack/mozilla-central/gfx/layers/opengl/TextureHostOGL.h:617
#14 0x41d4cf2e in mozilla::layers::GrallocBufferActor::ActorDestroy (this=<value optimized out>) at /hack/mozilla-central/gfx/layers/ipc/ShadowLayerUtilsGralloc.cpp:230
#15 0x41954b0a in mozilla::plugins::PPluginBackgroundDestroyerParent::DestroySubtree (this=0x49bb975c, why=mozilla::ipc::IProtocolManager<mozilla::ipc::RPCChannel::RPCListener>::Deletion)
    at /hack/b2g/B2G/objdir-gecko/ipc/ipdl/PPluginBackgroundDestroyerParent.cpp:332


So now I remember --- that's why the original patch didn't have the Send__delete__. Will clarify the comment.
Comment on attachment 742539 [details] [diff] [review]
Add Send__delete__ call. Still leaking.

Obsoleting this patch, see previous comment.
Attachment #742539 - Attachment is obsolete: true
Alright. So the do-not-hold-a-strong-reference alone (without the bad Send__delete__ patch) seems to work:
 - no kill in local emulator reftest run,
 - no issue in local testing of Camera and Browser on unagi.

--> pushed again to try, hoping for green based on local emulator reftest run:
https://tbpl.mozilla.org/?tree=Try&rev=35832507011d

Can land if it's green!
Attachment #742084 - Attachment is obsolete: true
Attachment #742538 - Attachment is obsolete: true
Attachment #742540 - Attachment is obsolete: true
Attachment #742568 - Flags: review+
(In reply to Benoit Jacob [:bjacob] from comment #41)
> Created attachment 742568 [details] [diff] [review]
> Can land if it's green!

This has an orange R9. With a crash in:

libxul.so!IPC::ParamTraits<mozilla::layers::MagicGrallocBufferHandle>::Write(IPC::Message*, mozilla::layers::MagicGrallocBufferHandle const&

This looks suspicious.
Thanks for retriggering; the second run is green. I agree this looks a bit too intimidating to just land now. Let's retrigger more times, try cycles are cheap on Friday nights, and then let's compare to the average frequency of random oranges on B2G emulator reftests.
It's all green so far, including all 11 retriggers of R9.

Also, this orange was actually bug 818103: TBPL suggested it, and looking at the stacks there, the suggestion is correct --- I checked a few stacks and they all involve that MagicGrallocBufferHandle stuff.

That bug 818103 has been starred over 40 times today.

So I think that our one orange R9 out of 12 runs is this existing intermittent.

Landing now.
Retriggers of R2 and R7 have hit bug 818103 too; still no reason to believe it's a regression given the huge volume of bug 818103.
Attachment #741985 - Flags: feedback?(bent.mozilla)
Inbound is closed for bustage :-( and I don't want to have to watch the tree after pushing to central. Will push tomorrow.
https://hg.mozilla.org/mozilla-central/rev/6543154c7487
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Depends on: 873486
Should we undo this now?
Flags: needinfo?(bjacob)
Adding Nicolas to the questions as well (Benoit is not back for another week), because bug 858914 is close to landing and it affects this area significantly.  It may be worth waiting for that instead of undoing this change on its own?
Flags: needinfo?(nical.bugzilla)
Indeed, :nical is likely the right person: if any current/recent change makes the present cset no longer needed, it would be his current work on refactoring texturehost.
Flags: needinfo?(bjacob)
Patches in Bug 858914 haven't landed yet so I guess we need to wait before undoing that.
We will also need to wait for followups of 858914 because this one only fixes textures with ImageLayers.
Flags: needinfo?(nical.bugzilla)
Might the crash seen in bug 905152 be related to these changes?
Depends on: 879681
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: