Closed Bug 826300 Opened 7 years ago Closed 7 years ago

The compositor blocking on java during resume results in a deadlock

Categories

(Firefox for Android :: Toolbar, defect)

All
Android
defect
Not set

Tracking

()

RESOLVED FIXED
Firefox 21

People

(Reporter: kats, Assigned: kats)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 1 obsolete file)

Splitting https://bugzilla.mozilla.org/show_bug.cgi?id=785597#c32 into a separate bug as it is a cause of a lot of deadlock bugs. What I've identified is this:

1) Fennec starts up
2) The GL surface is created
3) A SIZE_CHANGED event is dispatched from some java thread (often the UI thread but not necessarily) to the gecko thread
4) The user opens the awesomescreen making the surface go away
5) The SIZE_CHANGED event handler in nsWindow.cpp calls ScheduleResumeComposition (this happens on the gecko thread and so may happen some time after step 4)
6) The ScheduleResumeComposition call goes into Compositor::ScheduleResumeOnCompositoThread, which dispatches a resume onto the compositor thread and then blocks
7) The compositor thread processes the resume and goes into GLContextEGL::RenewSurface(), which calls AndroidBridge::ProvideEGLSurface. This call ends up calling GLController.waitForValidSurface which blocks the compositor thread (since the surface is gone as of step 4)

At this point the compositor thread is blocked on waiting for a surface, and the gecko thread is blocked on the compositor thread to process the resume, and any subsequent calls to geckoEventSync will cause an ANR and badness. This could be a key input event or the composition pause event (note that one gets dispatched in step 4), or a memory pressure event. Bug 706500 and bug 807090 are both examples of these failures. I was able to reproduce this issue consistently in bug 785597 (around comment 29 and on) where the deadlock ended up preventing injection of events and manifesting as a seemingly-unrelated error.

Note that prior to bug 797942 / cset ef02503c4387, ScheduleResumeComposition was never called from the gecko thread; it was always called directly from the java UI thread so step 5 above could never happen and this deadlock scenario would never happen.

My general solution to fixing this is to allow AndroidBridge::ProvideEGLSurface to return null if no surface is available. At first I did this unconditionally and that caused "We need a context on Android" startup crashes. My current approach is to make it conditional, so that we allow ProvideEGLSurface to block on startup when we're creating the LayerManagerOGL but not during compositor resume. Instead, if a surface is not available during resume, it just leaves the compositor in a paused state. This fix "makes sense" to me because if there is no valid surface when the compositor is resumed it is almost certainly because the resume is being driven from the gecko thread (as in the steps outlined above) and the surface has disappeared between the time the resume was issued by java and the time it was processed by the compositor. In this case it makes sense to put the compositor back into a paused state, and it will get resumed again once the surface is back.

Try push pending at https://tbpl.mozilla.org/?tree=Try&rev=0c3e80f9d2f4
Comment on attachment 697477 [details] [diff] [review]
Allow GL surface renewal to fail gracefully rather than block indefinitely

Ok, the try push is looking good. Requesting review.
Attachment #697477 - Flags: review?(snorp)
Attachment #697477 - Flags: review?(bgirard)
gbrown pointed out that my try push above still has most of the robocop tests disabled. Pushed another one to run those: https://tbpl.mozilla.org/?tree=Try&rev=3d634943cb9e
The patch seems to cause s and tpn talos failures. I'm trying to reproduce locally so I can debug it. woo whack-a-bug!
Comment on attachment 697477 [details] [diff] [review]
Allow GL surface renewal to fail gracefully rather than block indefinitely

Review of attachment 697477 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with the following NIT addresses

::: gfx/layers/ipc/CompositorParent.cpp
@@ +349,2 @@
>  #ifdef MOZ_WIDGET_ANDROID
> +  if (static_cast<LayerManagerOGL*>(mLayerManager.get())->gl()->RenewSurface()) {

NIT: This is a bit unclear. I think this style is much simpler.

bool surfaceRewned = static_cast<LayerManagerOGL*>(mLayerManager.get())->gl()->RenewSurface();
if (!surfaceRewned) {
  // We can't get a surface. This could be because the activity changed between the time the resume was schedule and now.
  lock.NotifyAll();
  return;
}

and keep mPaused = false & Composite() outside the #ifdef

With this change we run of risk of not honoring a resume and appearing as we're frozen if there's no subsequent resume. This may be a good place to log a warning since it's rare, quite severe and annoying to debug?
Attachment #697477 - Flags: review?(bgirard) → review+
When I run the tsvg and tp4m-nochrome talos tests locally I see fennec crash because the pageloader extension used in the tests creates its own top-level window and so once again we have two LayerManagerOGL instances being created and two compositors being registered, which is not allowed.
I confirmed that the pageloader extension problem is only cropping up because of the GetLayerManager call in the "failure patch" from bug 785597. Without that patch the tests pass fine: https://tbpl.mozilla.org/?tree=Try&rev=1a765b6367f1

That means there's nothing particularly blocking this patch from landing, and I am investigating a better fix to the issue of creating two LayerManagerOGL instances in bug 785597.
After reading brad's patch in bug 826607 I realized that the nsWindow::sCompositorPaused flag was still getting updated even if the resume failed, so I've updated my patch to fix that. It also addresses BenWa's previous review comments.

The changes I made will bitrot Brad's patch on bug 826607 but is not actually incompatible with that so it shouldn't block that from landing.
Attachment #697477 - Attachment is obsolete: true
Attachment #697477 - Flags: review?(snorp)
Attachment #698751 - Flags: review?(snorp)
Attachment #698751 - Flags: review?(bgirard)
Blocks: 827844
Comment on attachment 698751 [details] [diff] [review]
Allow GL surface renewal to fail gracefully rather than block indefinitely (v2)

Review of attachment 698751 [details] [diff] [review]:
-----------------------------------------------------------------

nice
Attachment #698751 - Flags: review?(snorp) → review+
Comment on attachment 698751 [details] [diff] [review]
Allow GL surface renewal to fail gracefully rather than block indefinitely (v2)

Review of attachment 698751 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +349,5 @@
>  #ifdef MOZ_WIDGET_ANDROID
> +  if (!static_cast<LayerManagerOGL*>(mLayerManager.get())->gl()->RenewSurface()) {
> +    // We can't get a surface. This could be because the activity changed between
> +    // the time resume was scheduled and now.
> +    __android_log_print(ANDROID_LOG_INFO, "CompositorParent", "Unable to renew compositor surface; remaining in paused state");

Should this be a higher logging level such as error? I'd imagine that unless proven otherwise this is a serious error that can leave the app frozen.
Attachment #698751 - Flags: review?(bgirard) → review+
(In reply to Benoit Girard (:BenWa) from comment #10)
> Should this be a higher logging level such as error? I'd imagine that unless
> proven otherwise this is a serious error that can leave the app frozen.

This condition does happen reasonably regularly in testing and in the wild, if you open the awesomescreen right on fennec start and you manage to do it at the right time in the middle of the surface initialization. I don't think it warrants an error message now that we're handling this case more gracefully (i.e. it shouldn't leave the app frozen).
https://hg.mozilla.org/mozilla-central/rev/44a1fb250a30
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Firefox 21
You need to log in before you can comment on or make changes to this bug.