Closed Bug 1060703 Opened 10 years ago Closed 10 years ago

Shutdown hang in CompositorParent::Shutdown (caused by event loop mismatch?)

Categories

(Core :: Graphics: Layers, defect)

32 Branch
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla35

People

(Reporter: khuey, Assigned: nical)

Details

(Keywords: hang)

Attachments

(3 files, 2 obsolete files)

My b2g emulator hung during shutdown today in CompositorParent::Shutdown.  I wasn't able to trace the entire thing before gdb segfaulted, but I think I know what's going on.

CompositorParent::Shutdown[0] attempts to join the compositor thread.  It does this by waiting for CompositorThreadHolder::DestroyCompositorThread to run and set sFinishedCompositorShutDown to true.  DestroyCompositorThread runs from ~CompositorThreadHolder which runs on the main thread.  When CompositorThreadHolder's last reference is released it uses some crazy macro[1] to run its dtor on the main thread.  But this posts to the chromium event loop, while CompositorParent::Shutdown is blocked waiting on an XPCOM event.  If nobody else is kind enough to send us an XPCOM event to wake us up then we wait forever because the message we're waiting on is in a queue we're not looking at.

[0] http://hg.mozilla.org/mozilla-central/annotate/eed9fe35a00d/gfx/layers/ipc/CompositorParent.cpp#l192
[1] http://hg.mozilla.org/mozilla-central/annotate/eed9fe35a00d/gfx/layers/ipc/ThreadSafeRefcountingWithMainThreadDestruction.h#l44
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #0)
> CompositorParent::Shutdown[0] attempts to join the compositor thread.  It
> does this by waiting for CompositorThreadHolder::DestroyCompositorThread to
> run and set sFinishedCompositorShutDown to true.  DestroyCompositorThread
> runs from ~CompositorThreadHolder which runs on the main thread.  When
> CompositorThreadHolder's last reference is released it uses some crazy
> macro[1] to run its dtor on the main thread.  But this posts to the chromium

But the chromium message loop is main thread's.
http://hg.mozilla.org/mozilla-central/annotate/eed9fe35a00d/gfx/layers/ipc/ThreadSafeRefcountingWithMainThreadDestruction.h#l17

Is it a release build and we can't catch the assert?

> event loop, while CompositorParent::Shutdown is blocked waiting on an XPCOM
> event.  If nobody else is kind enough to send us an XPCOM event to wake us
> up then we wait forever because the message we're waiting on is in a queue
> we're not looking at.
The problem is not that we're posting to a MessageLoop for another thread.  It's that we're posting to a MessageLoop, but blocked on the condvar at http://hg.mozilla.org/mozilla-central/annotate/5fd6428ae179/xpcom/threads/nsEventQueue.cpp#l67, which is only signaled at http://hg.mozilla.org/mozilla-central/annotate/5fd6428ae179/xpcom/threads/nsEventQueue.cpp#l122 when someone posts an XPCOM event loop.  MessageLoop::PostTask does not signal that condvar and does not wake the main thread up.
Assignee: nobody → nical.bugzilla
Comment on attachment 8482776 [details] [diff] [review]
Use XPCOM's event loop instead of Chromium's in NS_INLINE_DECL_THREADSAFE_REFCOUNTING_WITH_MAIN_THREAD_DESTRUCTION

Review of attachment 8482776 [details] [diff] [review]:
-----------------------------------------------------------------

The approach is wrong, as it reverts an intentional change that we had to make to shut down properly with e10s. See the conversation that we had around bug 774388 comment 67. Also see bug 1033358.

More specifically: this shutdown code here could run late enough during shutdown, that nsThreadManager is no longer available. This patch uses NS_DispatchToMainThread, which calls NS_GetMainThread which relies on nsThreadManager. Boom.
Attachment #8482776 - Flags: review?(bjacob) → review-
So piecing this together with Kyle's above comment, the right approach would be to replace the NS_ProcessNextEvent in 

  while (!sFinishedCompositorShutDown) {
    NS_ProcessNextEvent(nullptr, true);
  }

in CompositorParent::Shutdown, by equivalent chromium-threads code.
(In reply to Benoit Jacob [:bjacob] from comment #5)
> More specifically: this shutdown code here could run late enough during
> shutdown, that nsThreadManager is no longer available. This patch uses
> NS_DispatchToMainThread, which calls NS_GetMainThread which relies on
> nsThreadManager. Boom.

NS_ProcessNextEvent uses the thread manager to get the current thread since we passed in nullptr, so this code is already dependent on the thread manager still existing.  And http://mxr.mozilla.org/mozilla-central/source/xpcom/build/nsXPComInit.cpp#840 is clearly before thread manager shutdown.
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #7)
> (In reply to Benoit Jacob [:bjacob] from comment #5)
> > More specifically: this shutdown code here could run late enough during
> > shutdown, that nsThreadManager is no longer available. This patch uses
> > NS_DispatchToMainThread, which calls NS_GetMainThread which relies on
> > nsThreadManager. Boom.
> 
> NS_ProcessNextEvent uses the thread manager to get the current thread since
> we passed in nullptr, so this code is already dependent on the thread
> manager still existing.

Good catch; that's bad, and needs to be fixed, then.

> And
> http://mxr.mozilla.org/mozilla-central/source/xpcom/build/nsXPComInit.
> cpp#840 is clearly before thread manager shutdown.

That is only the _start_ of the compositor shutdown sequence. The destruction of the compositor actors and thread can occur much later, depending for instance on IPC.
(In reply to Benoit Jacob [:bjacob] from comment #8)
> (In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #7)
> > NS_ProcessNextEvent uses the thread manager to get the current thread since
> > we passed in nullptr, so this code is already dependent on the thread
> > manager still existing.
> 
> Good catch; that's bad, and needs to be fixed, then.

To be clear, I mean: we should not use NS_ProcessNextEvent here. Similar to the conclusion already arrived to in comment 6.
Sorry if I'm rehashing old decisions here, but why don't you just block the shutdown process until compositor shutdown is complete?
(In reply to Kyle Huey [:khuey] (khuey@mozilla.com) from comment #10)
> Sorry if I'm rehashing old decisions here, but why don't you just block the
> shutdown process until compositor shutdown is complete?

Actually... we now do, right? with CompositorParent::Shutdown() blocking on sFinishedCompositorShutDown becoming true. So all of what I wrote above was referring to an earlier state of affairs when we were not blocking XPCOM shutdown on compositor shutdown, and that is no longer the case, so the issues that I was worrying about do no longer exist.

Correct? will redo the review after lunch...
Comment on attachment 8482776 [details] [diff] [review]
Use XPCOM's event loop instead of Chromium's in NS_INLINE_DECL_THREADSAFE_REFCOUNTING_WITH_MAIN_THREAD_DESTRUCTION

So my above concern is cleared --- sorry, there were so many iterations on bug 774388, before we arrived to the current solution of blocking XPCOM shutdown on compositor shutdown, and I had lost track of that.

This patch is still r- for a different reason now. It changes DestroyToBeCalledOnMainThread from being a static method to being a non-static method, and changes NewRunnableFunction into NS_NewRunnableMethod. The problem is that NS_NewRunnableMethod results in an Addref() call on the closure object, here:
http://dxr.mozilla.org/mozilla-central/source/xpcom/glue/nsThreadUtils.h#280

That's at least going to confuse refcount logging: this extra Addref() will be logged, but won't be matched with any Release(), so I suppose that this will cause leaks to be reported by XPCOM_MEM_LEAK_LOG.
Yeah, XPCOM should still be alive here. I'm a little surprised that this is causing problems, because dispatching a chromium event is *supposed* to wake up the XPCOM event loop via http://mxr.mozilla.org/mozilla-central/source/ipc/glue/MessagePump.cpp#71

In any case, it shouldn't be a problem for this to use XPCOM events entirely.
Right, again, the "we can't use XPCOM threads/events here because XPCOM is already shut down" discussion here was based on an intermediate state of affairs on bug 774388. I agree that now that we block XPCOM shutdown on compositor shutdown, that class of issues is gone.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #13)
> Yeah, XPCOM should still be alive here. I'm a little surprised that this is
> causing problems, because dispatching a chromium event is *supposed* to wake
> up the XPCOM event loop via
> http://mxr.mozilla.org/mozilla-central/source/ipc/glue/MessagePump.cpp#71
> 
> In any case, it shouldn't be a problem for this to use XPCOM events entirely.

Do we use that MessagePump?  There are a number of different implementations.  The one that's android/b2g specific appears to poke the appshell event handling mechanism rather than the XPCOM event queue.
I don't know, but it should be an invariant that posting to the chromium messageloop causes NS_ProcessNextEvent to wake up under any circumstance.
Attachment #8482776 - Attachment is obsolete: true
Attachment #8483551 - Flags: review?(bjacob)
Comment on attachment 8483551 [details] [diff] [review]
v2 (better taking refcount into account)

Review of attachment 8483551 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/ThreadSafeRefcountingWithMainThreadDestruction.h
@@ +53,4 @@
>      MOZ_ASSERT(NS_IsMainThread());                                            \
> +    MOZ_ASSERT(mRefCnt == 1);                                                 \
> +    /* to balance the implicit AddRef in NS_NewRunnableMethod */              \
> +    Release();                                                                \

So this Release() call calls the Release() method defined just below, which decrements the refcount, making it be 0 again, and now we are on the main thread, so we take the "delete this;" path.

Good? No! Because the Runnable object here is still holding a reference to "this". That pointer is now dangling. When the runnable object is destroyed (just after this callback), it will try to call Release() again on it... which is now a use-after-free case!

Why can't you just use a RunnableFunction instead of a RunnableMethod, like the current code does?
Attachment #8483551 - Flags: review?(bjacob) → review-
(In reply to Benoit Jacob [:bjacob] from comment #18)
> Why can't you just use a RunnableFunction instead of a RunnableMethod, like
> the current code does?

Where is XPCOM's equivalent of Chromium's NewRunnableFunction defined? I looked for it but I only found NS_NewRunnableMethod and NS_NewNonOwningRunnableMethod.


> Good? No! Because the Runnable object here is still holding a reference to
> "this". That pointer is now dangling. When the runnable object is destroyed
> (just after this callback), it will try to call Release() again on it...
> which is now a use-after-free case!

Ok, makes makes more sense. Earlier you said that the AddRef would not be matched and I misunderstood what you meant back then.
My previous patch was explicitly deleting the object, so when the matching Release call would happen, the object would already be deleted (so it would be matched but would cause a use-after-free).
(In reply to Nicolas Silva [:nical] from comment #19)
> (In reply to Benoit Jacob [:bjacob] from comment #18)
> > Why can't you just use a RunnableFunction instead of a RunnableMethod, like
> > the current code does?
> 
> Where is XPCOM's equivalent of Chromium's NewRunnableFunction defined? I
> looked for it but I only found NS_NewRunnableMethod and
> NS_NewNonOwningRunnableMethod.

You might have to subclass nsRunnable directly (which is what NS_NewRunnableMethod does internally).
Attached patch Fix v3Splinter Review
New version with refcounting fixed.
try push: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=1488b2918407
Attachment #8483551 - Attachment is obsolete: true
Attachment #8486560 - Flags: review?(bjacob)
Alternative patch based on comment 6. I don't fully understand the potential side effects of doing this so Benoit it's your call :)
try push: https://treeherder.mozilla.org/ui/#/jobs?repo=try&revision=46570c235c91
Attachment #8486564 - Flags: review?(bjacob)
Attachment #8486560 - Flags: review?(bjacob) → review+
Comment on attachment 8486564 [details] [diff] [review]
Alternative patch: spin the chromium event loop instead of xpcom's

Review of attachment 8486564 [details] [diff] [review]:
-----------------------------------------------------------------

Can you explain more what the benefit of this patch is?
Attachment #8486564 - Flags: review?(bjacob)
Comment on attachment 8486564 [details] [diff] [review]
Alternative patch: spin the chromium event loop instead of xpcom's

Review of attachment 8486564 [details] [diff] [review]:
-----------------------------------------------------------------

Oh, nevermind. So I don't have a strong preference between these two approaches. If you want to go this route, I'm the wrong person to review this patch, :bsmedberg or :bent might feel more comfortable telling whether Run() is really what you want here (you wouldn't want to accidentally spin a nested event loop here...!)
(In reply to Benoit Jacob [:bjacob] from comment #24)
> Comment on attachment 8486564 [details] [diff] [review]
> Alternative patch: spin the chromium event loop instead of xpcom's
> 
> Review of attachment 8486564 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Oh, nevermind. So I don't have a strong preference between these two
> approaches. If you want to go this route, I'm the wrong person to review
> this patch, :bsmedberg or :bent might feel more comfortable telling whether
> Run() is really what you want here (you wouldn't want to accidentally spin a
> nested event loop here...!)

I don't feel very comfortable with this approach, so let's stay with the other one.
https://hg.mozilla.org/mozilla-central/rev/5fdee53a4624
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: