Closed Bug 752368 Opened 8 years ago Closed 8 years ago

Random black screens when opening links from other applications

Categories

(Firefox for Android :: General, defect, major)

15 Branch
ARM
Android
defect
Not set
major

Tracking

()

RESOLVED FIXED
Firefox 15
Tracking Status
firefox14 --- fixed
blocking-fennec1.0 --- +

People

(Reporter: mcote, Assigned: vlad)

References

Details

Crash Data

Attachments

(2 files, 5 obsolete files)

At least once a day, on my Motorola Atrix, when I open a link from another Android app (often Google Reader), Nightly will show a black content screen. The chrome is still there, and it looks like it is loading the page (the loading symbol spins for a while, then stops, as usual when loading a page), but the content part of the screen stays completely black. I've seen this for some time but haven't been able to figure out what causes it exactly. I've never seen this problem on my Asus Eee Transformer tablet, but I use it a lot less than my phone.

I'll attach a logcat the next time it happens.
Crop of these bugs coming up; other similar ones bug 752153, bug 751732, bug 751589.

These need to be sorted out.

Mark, can you try this:

Take a debug Android build and run it (via Android ADB) with MOZ_GL_DEBUG=1 and MOZ_GL_DEBUG_ABORT_ON_ERROR=1, e.g.

  adb shell am start -a android.activity.MAIN -n org.mozilla.fennec_<user>/.App --es env0 MOZ_GL_DEBUG=1 --es env1 MOZ_GL_DEBUG_ABORT_ON_ERROR=1
blocking-fennec1.0: --- → ?
Severity: normal → major
I can reproduce this on today's Aurora nightly (20120507) on my Atrix.  It took about 5-6 link opens from another app, but eventually the link opened up with the content area black and never came back.  App is still responsive; I can get a tab list, etc., but any content areas are all black.
Okay I'm running the debug build with those options. I assume you want a logcat the next time this happens?
Hm I've opened about 15-20 links so far, and no black screen. It looks like fennec might have crashed once, although it came back very quickly (is that level of crash recovery a new addition? it was pretty slick!).

I'll keep trying.
Attached file logcat after black-screen crash (obsolete) —
Oh right, the crash must be happening because of the MOZ_GL_DEBUG_ABORT_ON_ERROR setting (I saw the content screen briefly go black before it crashed). So disregard my last comment. :)

I saw the crash again. Attached is a logcat.
Vlad - Can you renom for beta-blocking if this is more than just the Atrix?
Assignee: nobody → vladimir
blocking-fennec1.0: ? → +
Attached file unzipped log
Attachment #621683 - Attachment is obsolete: true
Debug build log:

05-07 16:28:22.756 20250 20263 I GeckoApp: Got message: DOMContentLoaded
05-07 16:28:23.166 20250 20250 I GeckoTabs: Removed a tab with id: 3

** hitting back button

05-07 16:28:23.166  1645  2255 I ActivityManager: moveTaskToBack: 63
05-07 16:28:23.176 20250 20250 D PhoneWindow: couldn't save which view has focus because the focused view org.mozilla.gecko.gfx.LayerView@40547248 has no id.
05-07 16:28:23.176 20250 20250 I GeckoApp: application paused

** bunch of Surface is invalid errors

05-07 16:28:23.576 20250 20296 E Surface : surface (identity=612) is invalid, err=-19 (No such device)
...
05-07 16:28:23.666 20250 20250 I GeckoApp: stop
...
05-07 16:28:24.076 20250 20296 E Surface : surface (identity=612) is invalid, err=-19 (No such device)

* I click on link in twitter; intent gets thrown over to Gecko

05-07 16:28:24.386  1645  2223 I ActivityManager: Starting: Intent { act=android.intent.action.VIEW dat=http://t.co/ogZwxyNi flg=0x10000000 cmp=android/com.android.internal.app.ResolverActivity } from pid 19947
05-07 16:28:24.456 20250 20296 E Surface : surface (identity=612) is invalid, err=-19 (No such device)
05-07 16:28:24.656  1645  1995 I ActivityManager: Displayed android/com.android.internal.app.ResolverActivity: +263ms
05-07 16:28:25.316  1645  1675 I ActivityManager: Starting: Intent { act=android.intent.action.VIEW dat=http://t.co/ogZwxyNi flg=0x13800000 cmp=org.mozilla.fennec_vladimir/.App } from pid 1645
05-07 16:28:25.336 20250 20250 W GeckoApp: zerdatime 42188476 - onNewIntent
05-07 16:28:25.336 20250 20250 I GeckoApp: onNewIntent: http://t.co/ogZwxyNi

* gecko gets it

05-07 16:28:25.346 20250 20250 I GeckoApp: restart

* and restarts/starts/resumes

05-07 16:28:25.346 20250 20250 W GeckoApp: zerdatime 42188480 - onStart
05-07 16:28:25.346 20250 20250 I GeckoApp: start
05-07 16:28:25.346 20250 20250 I GeckoApp: resume
05-07 16:28:25.346 20250 20250 I GeckoApp: application resumed
05-07 16:28:25.376 20250 20263 I GeckoTabs: Got message: Tab:Added
05-07 16:28:25.376 20250 20263 I GeckoTabs: Received message from Gecko: 42188517 - Tab:Added
05-07 16:28:25.376 20250 20263 I GeckoTabs: Added a tab with id: 4
05-07 16:28:25.396 20250 20263 I GeckoApp: Got message: Content:LocationChange
05-07 16:28:25.396 20250 20263 I GeckoApp: URI - about:blank
...

no Surface errors or GL errors (with MOZ_GL_DEBUG_ABORT_ON_ERROR) after this point, but the content area just remains black.
The above sequence appears in the attached logs as well (the attached log has multiple runs of Gecko; there are two instances of getting into surface is invalid.
Now with more logging!  Here's the relevant sequence:

05-07 19:38:29.818  1644  2265 I ActivityManager: moveTaskToBack: 20
05-07 19:38:29.828 12540 12540 I GeckoTabs: Removed a tab with id: 3
05-07 19:38:29.828 12540 12540 I GeckoApp: application paused
05-07 19:38:29.868 12540 12540 I GeckoToolbar: zerdatime 1919232 - Throbber stop
05-07 19:38:29.868 12540 12540 I GeckoDoorHangerPopup: Showing all doorhangers for tab: 1
05-07 19:38:29.868 12540 12572 I Gecko   : Composite (not paused)
05-07 19:38:29.868 12540 12540 I GeckoDoorHangerPopup: Showing all doorhangers for tab: 1
05-07 19:38:29.968 12540 12572 I Gecko   : Composite (not paused)
05-07 19:38:29.978 12540 12540 W IInputConnectionWrapper: showStatusIcon on inactive InputConnection
05-07 19:38:29.978 12540 12572 I Gecko   : Composite (not paused)
05-07 19:38:29.988 12540 12572 I Gecko   : Composite (not paused)
05-07 19:38:30.078 12540 12572 I Gecko   : Composite (not paused)
05-07 19:38:30.406 12540 12540 I GeckoApp: stop

** this is where things go wrong.

05-07 19:38:30.586 12540 12572 I Gecko   : Composite (not paused)
05-07 19:38:30.586 12540 12572 E Surface : surface (identity=119) is invalid, err=-19 (No such device)
05-07 19:38:30.586 12540 12572 I Gecko   : PauseComposition
05-07 19:38:30.586 12540 12572 E Surface : surface (identity=119) is invalid, err=-19 (No such device)
05-07 19:38:30.606 12540 12572 I Gecko   : Composite (paused; skipped)


Basically, we're getting an onStop, but (likely) we still have either an outstanding Composite in the queue or one just happened to have gotten issued at the time.  We try to process it before we get to the PauseComposition, and things blow up.  We never recover from this.

This doesn't seem to be Atrix specific; just for whatever reason it happens to be easier to hit it on that device.
Basically, as soon as we get an onStop, we must not try to composite.  This will be tricky, as they're on different threads and the Composite could already be queued as we're processing the onStop.  This is a bit of a problem.
Btw, just wanted to mention that I'm more than happy to try out fixes. Doesn't take me too long to repro this problem normally.
Attached patch prototype fix (obsolete) — Splinter Review
Ok, this seems to fix things for me.  It's only a prototype, since I'm kind of hijacking SchedulePauseComposition, but it might just be ok (minus the debug printfs).

The basic idea is that pausing composition needs to be synchronous, and needs to be done whenever we get a pause or a stop.  I can't reproduce either of the black screen issues (either from new apps or via awesomebar) with this patch applied.  (Also, now that I look at it, I should probably just use MutexAutoLock too.)
Attachment #621812 - Flags: feedback?(bgirard)
Attachment #621812 - Flags: feedback?(ajuma)
Comment on attachment 621812 [details] [diff] [review]
prototype fix

I like this approach -- it's too bad that we have to add more synchronization, but it seems unavoidable here.
Attachment #621812 - Flags: feedback?(ajuma) → feedback+
Attached patch cleaned up fix (obsolete) — Splinter Review
Ok, cleaned up version of the prototype; using mozilla::Monitor this time, because derp.  Debugging crap removed.  If we're ok with having SchedulePauseOnCompositorThread block on android until it's complete, then this should be good to go.
Attachment #621812 - Attachment is obsolete: true
Attachment #621812 - Flags: feedback?(bgirard)
Attachment #621826 - Flags: review?(bgirard)
Attachment #621826 - Flags: review?(ajuma)
Comment on attachment 621826 [details] [diff] [review]
cleaned up fix

Review of attachment 621826 [details] [diff] [review]:
-----------------------------------------------------------------

Looking at the cleaned up patch reminded me of one more thing we'll need to handle: nsWindow's sCompositorPaused needs to updated (e.g. by sending a COMPOSITOR_PAUSE event to Gecko) before the compositor actually pauses, or else Gecko might try to send a layer tree update after the compositor is paused. This could be a problem (since, effectively, we would need the Java thread to block first on Gecko, and then on the compositor), but might just work.

Alternatively, we need the compositor to be able to handle getting a layer tree update while paused. The problem with getting a layer tree update while paused is that we can't do texture uploads. BenWa's work on making layer tree updates asynch helps here, in that it would allow us to receive the layer tree update while paused but postpone the texture upload till later, but we don't yet have asynch updates for all layer types. Thoughts on this, BenWa?
Any reason to not just set sCompositorPaused to true in nsWindow::SchedulePauseCompositor (either before or after the call to sCompositorParent->SchedulePauseOnCompositorThread())?  We know it'll be paused after that returns because the pause will happen synchronously, right?
(In reply to Vladimir Vukicevic (:vlad) from comment #18)
> Any reason to not just set sCompositorPaused to true in
> nsWindow::SchedulePauseCompositor (either before or after the call to
> sCompositorParent->SchedulePauseOnCompositorThread())?  We know it'll be
> paused after that returns because the pause will happen synchronously, right?

The problem is that SchedulePauseCompositor is getting called on the Java thread. So what could happen is that Gecko checks sCompositorPaused, then begins to paint, then SchedulePauseCompositor sets sCompositorPaused to true and pauses the compositor, then Gecko begins a layer tree update. Even if we could split the paint and the layer tree update and have Gecko check sCompositorPaused right before the layer tree update (rather than right before the paint), we could have that Gecko checks the flag, then SchedulePauseCompositor sets the flag and pauses the compositor, and then Gecko does a layer tree update.

So we need to keep sCompositorPaused locked between the time Gecko checks it and the time it actually sends the layer tree update.
Keywords: qawanted
Attached patch fix, v2 (obsolete) — Splinter Review
Different, simpler approach.  I realized that the surfaceDestroyed callback is still valid, and that's where we need to pause composition.  The two changes that are needed are to make pausing composition synchronous on the compositor side (as in the previous patch), and new to this patch, to dispatch the pause request to the gecko thread synchronously.  This ensures that we won't be in the middle of painting or anything on the gecko thread, since we'll effectively be synchronizing all three threads.  Works fine and fixes the problem on my Atrix, works fine on Galaxy Nexus.
Attachment #621826 - Attachment is obsolete: true
Attachment #621826 - Flags: review?(bgirard)
Attachment #621826 - Flags: review?(ajuma)
Attachment #622041 - Flags: review?(bgirard)
Attachment #622041 - Flags: review?(ajuma)
Comment on attachment 622041 [details] [diff] [review]
fix, v2

Review of attachment 622041 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +137,5 @@
>    CompositorLoop()->PostTask(FROM_HERE, renderTask);
>  }
>  
> +#ifdef MOZ_WIDGET_ANDROID
> +static mozilla::Monitor sPauseCompositionMonitor("PauseCompositionMonitor");

This will trigger leak detection since static desconstructor run after leak checks. :(

@@ +184,4 @@
>  void
>  CompositorParent::SchedulePauseOnCompositorThread()
>  {
> +#ifdef MOZ_WIDGET_ANDROID

This function here is only intended to be used by Android as I recall. I don't like having it behave differently in case it gets called by another platform. We should either make it clear that this function is android only or make it behave the same (not be both async/sync).

@@ +193,5 @@
>    CompositorLoop()->PostTask(FROM_HERE, pauseTask);
> +
> +#ifdef MOZ_WIDGET_ANDROID
> +  // XXX we should use a timeout here
> +  lock.Wait();

Let's add a comment to make it clear what we're waiting for here such as:
// Wait for the compositor thread to notify us that it has paused.

::: mobile/android/base/gfx/GeckoLayerClient.java
@@ +457,3 @@
>      public void compositionPauseRequested() {
>          // We need to coordinate with Gecko when pausing composition, to ensure
>          // that Gecko never executes a draw event while the compositor is paused.

We should get this comment updated to be more specific. Specifically we should mention why we're sending a sync event to gecko (to make sure we're not longer drawing), and that this event will in turn sync with the compositor to pause it.
Attachment #622041 - Flags: review?(bgirard) → review-
Attached patch fix, v3 (obsolete) — Splinter Review
Updated with review comments.  Moved the Monitor to a class member.
Attachment #622041 - Attachment is obsolete: true
Attachment #622041 - Flags: review?(ajuma)
Attachment #622061 - Flags: review?(bgirard)
Just a heads-up in case you haven't seen it already: there is a somewhat-related issue that Chris Lord is looking at, see https://bugzilla.mozilla.org/show_bug.cgi?id=751690#c12
Comment on attachment 622061 [details] [diff] [review]
fix, v3

Review of attachment 622061 [details] [diff] [review]:
-----------------------------------------------------------------

::: mobile/android/base/gfx/GeckoLayerClient.java
@@ +462,5 @@
> +        // surfaceDestroyed notification), and to make sure that any in-flight
> +        // Gecko draw events have been processed.  When this returns, composition is
> +        // definitely paused -- it'll synchronize with the Gecko event loop, which
> +        // in turn will synchronize with the compositor thread.
> +        GeckoAppShell.sendEventToGeckoSync(GeckoEvent.createCompositorPauseEvent());

When Gecko receives this event, it sends sCompositorChild->SendPause(), which synchronously pauses the compositor. So once we use sendEventToGeckoSync here, it seems we have all we need. 

Do we still need the changes to SchedulePauseOnCompositorThread? (Is this actually getting called at all with this version of the patch?)
is sCompositorChild->SendPause() synchronous?  If so, then we don't, but it seems like we should keep the bits in SchedulePauseOnCompositorThread in case someone needs to call that directly from Java like we used to so that they have the same semantics.
(In reply to Vladimir Vukicevic (:vlad) from comment #25)
> is sCompositorChild->SendPause() synchronous? 

Yes.

> If so, then we don't, but it
> seems like we should keep the bits in SchedulePauseOnCompositorThread in
> case someone needs to call that directly from Java like we used to so that
> they have the same semantics.

That sounds like a good idea.
Attachment #622061 - Flags: review?(bgirard)
Attachment #622061 - Flags: review?(ajuma)
Attachment #622061 - Flags: review+
Attachment #622061 - Flags: review?(ajuma) → review+
Pushed to inbound as 3502d9840d98
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/b5304fd23df9 - sent robocheck, robocheck2 and robopan into permanent failure (a state only barely distinguishable from their normal state, but still distinguishable).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I see the following egl errors in the log for the test runs, probably related to the failure in some way:

05-09 20:50:34.216 E/libEGL  ( 1020): eglSetSwapRectangleANDROID:1861 error 3008 (EGL_BAD_DISPLAY)

05-09 20:50:34.476 E/SurfaceFlinger( 1020): eglSwapBuffers: EGL error 0x3008 (EGL_BAD_DISPLAY)
(In reply to Kartikaya Gupta (:kats) from comment #29)
> I see the following egl errors in the log for the test runs, probably
> related to the failure in some way:
> 
> 05-09 20:50:34.216 E/libEGL  ( 1020): eglSetSwapRectangleANDROID:1861 error
> 3008 (EGL_BAD_DISPLAY)
> 
> 05-09 20:50:34.476 E/SurfaceFlinger( 1020): eglSwapBuffers: EGL error 0x3008
> (EGL_BAD_DISPLAY)

If these errors aren't directly related, I wonder if the "error" text in the log is throwing off the log parser.
Shouldn't be related -- they're coming from what looks like android startup stuff?  I don't actually see fennec being launched in those logs at all.
I see:

05-09 20:39:17.205 W/System.err( 1451): java.io.FileNotFoundException: /mnt/sdcard/tests/fennec-15.0a1.en-US.android-arm.apk (No such file or directory)

05-09 20:40:38.765 W/System.err( 1451): java.io.FileNotFoundException: /mnt/sdcard/tests/robocop.apk (No such file or directory)

after the respective pushes and install attempts... but there are so many errors and other stuff in that log that look to be just from the test that I can't really understand what's a valid error or not; but I don't see anything that looks like fennec even got installed, much less launched.
(In reply to Vladimir Vukicevic (:vlad) from comment #32)
> 05-09 20:39:17.205 W/System.err( 1451): java.io.FileNotFoundException:
> /mnt/sdcard/tests/fennec-15.0a1.en-US.android-arm.apk (No such file or
> directory)
> 
> 05-09 20:40:38.765 W/System.err( 1451): java.io.FileNotFoundException:
> /mnt/sdcard/tests/robocop.apk (No such file or directory)

These errors are normal. At some point before the file is transfered the host requests a hash of the file or something which results in this exception. It's annoying as hell but I believe benign.

(In reply to Vladimir Vukicevic (:vlad) from comment #31)
> Shouldn't be related -- they're coming from what looks like android startup
> stuff?  I don't actually see fennec being launched in those logs at all.

Usually I never actually see fennec being launched in the logs. The logs generated are pretty useless if you try to just interpret them directly. The best way is to compare a working-test log to a non-working-test log and see what's different.
Working through bug 749426, which is what I think is causing the test failures here.
Status: REOPENED → NEW
Testing this patch along with bug 749426 caused the same robo* test failures on the try server, but I was not able to reproduce the failures locally on either a galaxy nexus or on a local tegra board.  I'm a bit stumped as to how to proceed; one thing that could be useful is the logcat output from the tinderbox tegra device while the test is running.
(In reply to Vladimir Vukicevic (:vlad) from comment #35)
> one thing that could be useful is the logcat output from the
> tinderbox tegra device while the test is running.

I have encountered other situations in which this would have been useful as well. It would be nice if the test harness could be modified to provide this always.
(In reply to Vladimir Vukicevic (:vlad) from comment #35)
> I'm a bit stumped as to
> how to proceed; one thing that could be useful is the logcat output from the
> tinderbox tegra device while the test is running.

Joel, any suggestions? I thought we had a solution for this.
adb/tcp is the most unreliable form of communication and that is the only way we can harvest the entire set of logcat data.  In addition, there would be no way to store this.  The volume is so high on the tegras that it would flood the logfile we produce.  

I am open to suggestions, but the few attempts we have tried have yielded nothing useful (i.e. it collects the wrong stuff, or not enough).
one thing that could cut down on the volume is grep'ing the logcat for the process id of fennec
we find that we miss important things like crash dumps in logcat (which are different than when we crash fennec and produce a crash dump).  

I do know there are ways with logcat to filter the output, we might be able filter based on the process and minimize our pipe of information.

The other factor is that this would be collected after the fact, not interleaved with test results.
(In reply to Joel Maher (:jmaher) from comment #40)
> we find that we miss important things like crash dumps in logcat (which are
> different than when we crash fennec and produce a crash dump).  
Is that filtering on "Gecko" instead of the pid?
Can we run something like alogcat on the device and have it collect the full log, and then upload it somewhere using a reliable channel?
I like that idea.  If we could essentially do a log flush before we start the test, record the log, then pull it up, that would work out great.  At that point in our test harness script we could do any parsing and filtering necessary.
Just wanted to mention that, this morning, I got the black content on my Asus Eee Transformer as well (latest nightly).
I filed bug 754873 for capturing logcat output.

I tried submitting this patch with nothing but the sendEventToGecko -> sendEventToGeckoSync change (which technically should be all that's required); same problem.  I can't reproduce it locally, though I'm about to try something different here shortly.
This patch fixes topcrasher bug 743938.
Blocks: 743938
Duplicate of this bug: 743938
Oops, meant to post that in the other bug.
No longer blocks: 743938
I can finally reproduce the deadlock!  I needed to run the exact same test run, using SUT and Talos to do it; doing anything as a mochitest etc. didn't work.  What's happening is this:

The Android thread posted a native compositing paused event and is waiting for gecko to process it:

java.util.concurrent.CountDownLatch	await	CountDownLatch.java:207
org.mozilla.gecko.GeckoAppShell	geckoEventSync	GeckoAppShell.java:618
org.mozilla.gecko.GeckoAppShell	sendEventToGeckoSync	GeckoAppShell.java:538
org.mozilla.gecko.gfx.GeckoLayerClient	compositionPauseRequested	GeckoLayerClient.java:475
org.mozilla.gecko.gfx.LayerView	surfaceDestroyed	LayerView.java:269
android.view.SurfaceView	reportSurfaceDestroyed	SurfaceView.java:568

The Gecko thread, on the other hand, is here:

org.mozilla.gecko.gfx.LayerView	requestRender	LayerView.java:182
org.mozilla.gecko.gfx.LayerController	setCheckerboardColor LayerController.java:347
org.mozilla.gecko.GeckoApp	handleMessage	GeckoApp.java:861
org.mozilla.gecko.GeckoAppShell	handleGeckoMessage	GeckoAppShell.java:1865
org.mozilla.gecko.GeckoAppShell	nativeRun	GeckoAppShell.java:-2
org.mozilla.gecko.GeckoAppShell	nativeRun	GeckoAppShell.java:-2
org.mozilla.gecko.GeckoAppShell	runGecko	GeckoAppShell.java:505
org.mozilla.gecko.GeckoThread	run	GeckoThread.java:114

So, Gecko is tryng to call LayerView.requestRender() while we're already in the middle of LayerView.surfaceDestroyed().  Both those methods are synchronized.  Thus, we deadlock.
odd why we get that only in talos vs mochitest version of this.  Could it be something in the profile (pref or extension), or just random chance.
Yeah, I'm confused by that as well; I don't see anything in prefs.js in the profile, at least.
Crash Signature: [@ WSEGL_GetDrawableParameters]
Attached patch fix, v4Splinter Review
Updated fix that fixes the deadlock.  (Also gives the Gecko thread a name, for easier finding in ddms.)

Removing synchronized from requestRepaint seems safe, since it just flows into posting an event to Gecko.. which is thread safe by design anyway.
Attachment #624782 - Flags: review?(ajuma)
Comment on attachment 624782 [details] [diff] [review]
fix, v4

Looks good to me. I think kats should take a look at the Java bits too.
Attachment #624782 - Flags: review?(bugmail.mozilla)
Attachment #624782 - Flags: review?(ajuma)
Attachment #624782 - Flags: review+
Comment on attachment 624782 [details] [diff] [review]
fix, v4

Review of attachment 624782 [details] [diff] [review]:
-----------------------------------------------------------------

Looks like most of the synchronization in LayerView is just an artifact from long ago, we should probably revisit that at some point. But yeah, this change looks fine.

::: mobile/android/base/gfx/LayerView.java
@@ +200,1 @@
>      public long getRenderTime() {

This function is never used; might as well take it out.
Attachment #624782 - Flags: review?(bugmail.mozilla) → review+
On inbound, with getRenderTime and bits nuked.  Fingers crossed that it sticks!

  https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=4a6bdd1308c3
https://hg.mozilla.org/mozilla-central/rev/4a6bdd1308c3
Status: NEW → RESOLVED
Closed: 8 years ago8 years ago
Flags: in-testsuite-
Resolution: --- → FIXED
Target Milestone: --- → Firefox 15
Comment on attachment 624782 [details] [diff] [review]
fix, v4

[Approval Request Comment]
User impact if declined: crashes, freezes, all sorts of badness
Testing completed (on m-c, etc.): local testing on variety of devices; on m-c
Risk to taking this patch (and alternatives if risky): crashes, freezes, all sorts of badness
Attachment #624782 - Flags: approval-mozilla-aurora?
Er, ignore the risk to taking this patch line above, meant to delete that! :p
Attachment #624782 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 622061 [details] [diff] [review]
fix, v3

># HG changeset patch
># Parent 448f554f6acbfe632da02e23526344f3defebe69
>
>diff --git a/gfx/layers/ipc/CompositorParent.cpp b/gfx/layers/ipc/CompositorParent.cpp
>--- a/gfx/layers/ipc/CompositorParent.cpp
>+++ b/gfx/layers/ipc/CompositorParent.cpp
>@@ -1,4 +1,4 @@
>-/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 4 -*- */
>+/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
> /* vim: set sw=2 ts=2 et tw=80 : */
> /* ***** BEGIN LICENSE BLOCK *****
>  * Version: MPL 1.1/GPL 2.0/LGPL 2.1
>@@ -66,6 +66,7 @@ CompositorParent::CompositorParent(nsIWi
>   , mLayersUpdated(false)
>   , mCompositorLoop(aMsgLoop)
>   , mThreadID(aThreadID)
>+  , mPauseCompositionMonitor("PauseCompositionMonitor")
> {
>   MOZ_COUNT_CTOR(CompositorParent);
> }
>@@ -141,6 +142,9 @@ CompositorParent::PauseComposition()
> {
>   NS_ABORT_IF_FALSE(CompositorThreadID() == PlatformThread::CurrentId(),
>                     "PauseComposition() can only be called on the compositor thread");
>+
>+  mozilla::MonitorAutoLock lock(mPauseCompositionMonitor);
>+
>   if (!mPaused) {
>     mPaused = true;
> 
>@@ -148,6 +152,9 @@ CompositorParent::PauseComposition()
>     static_cast<LayerManagerOGL*>(mLayerManager.get())->gl()->ReleaseSurface();
> #endif
>   }
>+
>+  // if anyone's waiting to make sure that composition really got paused, tell them
>+  lock.NotifyAll();
> }
> 
> void
>@@ -169,12 +176,21 @@ CompositorParent::ResumeCompositionAndRe
>   ResumeComposition();
> }
> 
>+/*
>+ * This will execute a pause synchronously, waiting to make sure that the compositor
>+ * really is paused.
>+ */
> void
> CompositorParent::SchedulePauseOnCompositorThread()
> {
>+  mozilla::MonitorAutoLock lock(mPauseCompositionMonitor);
>+
>   CancelableTask *pauseTask = NewRunnableMethod(this,
>                                                 &CompositorParent::PauseComposition);
>   CompositorLoop()->PostTask(FROM_HERE, pauseTask);
>+
>+  // Wait until the pause has actually been processed by the compositor thread
>+  lock.Wait();
> }
> 
> void
>diff --git a/gfx/layers/ipc/CompositorParent.h b/gfx/layers/ipc/CompositorParent.h
>--- a/gfx/layers/ipc/CompositorParent.h
>+++ b/gfx/layers/ipc/CompositorParent.h
>@@ -1,4 +1,4 @@
>-/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 4 -*- */
>+/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
> /* vim: set sw=4 ts=8 et tw=80 : */
> /* ***** BEGIN LICENSE BLOCK *****
>  * Version: MPL 1.1/GPL 2.0/LGPL 2.1
>@@ -52,6 +52,7 @@
> #include "mozilla/layers/PCompositorParent.h"
> #include "mozilla/layers/PLayersParent.h"
> #include "base/thread.h"
>+#include "mozilla/Monitor.h"
> #include "ShadowLayersManager.h"
> 
> class nsIWidget;
>@@ -165,6 +166,8 @@ private:
>   MessageLoop* mCompositorLoop;
>   PlatformThreadId mThreadID;
> 
>+  mozilla::Monitor mPauseCompositionMonitor;
>+
>   DISALLOW_EVIL_CONSTRUCTORS(CompositorParent);
> };
> 
>diff --git a/mobile/android/base/gfx/GeckoLayerClient.java b/mobile/android/base/gfx/GeckoLayerClient.java
>--- a/mobile/android/base/gfx/GeckoLayerClient.java
>+++ b/mobile/android/base/gfx/GeckoLayerClient.java
>@@ -457,7 +457,13 @@ public class GeckoLayerClient implements
>     public void compositionPauseRequested() {
>         // We need to coordinate with Gecko when pausing composition, to ensure
>         // that Gecko never executes a draw event while the compositor is paused.
>-        GeckoAppShell.sendEventToGecko(GeckoEvent.createCompositorPauseEvent());
>+        // This is sent synchronously to make sure that we don't attempt to use
>+        // any outstanding Surfaces after we call this (such as from a
>+        // surfaceDestroyed notification), and to make sure that any in-flight
>+        // Gecko draw events have been processed.  When this returns, composition is
>+        // definitely paused -- it'll synchronize with the Gecko event loop, which
>+        // in turn will synchronize with the compositor thread.
>+        GeckoAppShell.sendEventToGeckoSync(GeckoEvent.createCompositorPauseEvent());
>     }
> 
>     /** Implementation of LayerView.Listener */
Attachment #622061 - Attachment is obsolete: true
No longer depends on: 749426
Depends on: 749426
This should be fixed in Nightly, right? Just checking 'cause I experienced the problem again in today's Nightly...
Mark : Yes, it should be in the nightly.  Do you have steps to reproduce?
Duplicate of this bug: 744286
Duplicate of this bug: 737128
Duplicate of this bug: 749580
Duplicate of this bug: 749578
Naoki: Unfortunately it just happened once as per the normal steps--opened a link from an other application. Had to stop the process and relaunch to get the content screen back.
If you see it again, can you grab the logcat output?  It should let us determine if the root cause is the same as what was fixed here, or something different.
Duplicate of this bug: 753138
Duplicate of this bug: 756652
Duplicate of this bug: 740837
Still haven't seen it since. Looks like it is much, much reduced in frequency.
Assignee: vladimir → nobody
Sorry about that, it was by mistake.
Assignee: nobody → vladimir
You need to log in before you can comment on or make changes to this bug.