752368 - Random black screens when opening links from other applications

Reporter

Description

•

13 years ago

At least once a day, on my Motorola Atrix, when I open a link from another Android app (often Google Reader), Nightly will show a black content screen. The chrome is still there, and it looks like it is loading the page (the loading symbol spins for a while, then stops, as usual when loading a page), but the content part of the screen stays completely black. I've seen this for some time but haven't been able to figure out what causes it exactly. I've never seen this problem on my Asus Eee Transformer tablet, but I use it a lot less than my phone. I'll attach a logcat the next time it happens.

Aaron Train [:aaronmt]

Comment 1

•

13 years ago

Crop of these bugs coming up; other similar ones bug 752153, bug 751732, bug 751589. These need to be sorted out. Mark, can you try this: Take a debug Android build and run it (via Android ADB) with MOZ_GL_DEBUG=1 and MOZ_GL_DEBUG_ABORT_ON_ERROR=1, e.g. adb shell am start -a android.activity.MAIN -n org.mozilla.fennec_<user>/.App --es env0 MOZ_GL_DEBUG=1 --es env1 MOZ_GL_DEBUG_ABORT_ON_ERROR=1

blocking-fennec1.0: --- → ?

Aaron Train [:aaronmt]

Updated

•

13 years ago

Severity: normal → major

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 2

•

13 years ago

I can reproduce this on today's Aurora nightly (20120507) on my Atrix. It took about 5-6 link opens from another app, but eventually the link opened up with the content area black and never came back. App is still responsive; I can get a tab list, etc., but any content areas are all black.

Mark Côté [:mcote]

Reporter

Comment 3

•

13 years ago

Okay I'm running the debug build with those options. I assume you want a logcat the next time this happens?

Mark Côté [:mcote]

Reporter

Comment 4

•

13 years ago

Hm I've opened about 15-20 links so far, and no black screen. It looks like fennec might have crashed once, although it came back very quickly (is that level of crash recovery a new addition? it was pretty slick!). I'll keep trying.

Mark Côté [:mcote]

Reporter

Comment 5

•

13 years ago

Attached file logcat after black-screen crash (obsolete) — Details

Oh right, the crash must be happening because of the MOZ_GL_DEBUG_ABORT_ON_ERROR setting (I saw the content screen briefly go black before it crashed). So disregard my last comment. :) I saw the crash again. Attached is a logcat.

Mark Finkle (:mfinkle) (use needinfo?)

Comment 6

•

13 years ago

Vlad - Can you renom for beta-blocking if this is more than just the Atrix?

Assignee: nobody → vladimir

blocking-fennec1.0: ? → +

Mark Finkle (:mfinkle) (use needinfo?)

Updated

•

13 years ago

Keywords: qawanted

Brad Lassey [:blassey] (use needinfo?)

Comment 7

•

13 years ago

Attached file unzipped log — Details

Attachment #621683 - Attachment is obsolete: true

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 8

•

13 years ago

Debug build log: 05-07 16:28:22.756 20250 20263 I GeckoApp: Got message: DOMContentLoaded 05-07 16:28:23.166 20250 20250 I GeckoTabs: Removed a tab with id: 3 ** hitting back button 05-07 16:28:23.166 1645 2255 I ActivityManager: moveTaskToBack: 63 05-07 16:28:23.176 20250 20250 D PhoneWindow: couldn't save which view has focus because the focused view org.mozilla.gecko.gfx.LayerView@40547248 has no id. 05-07 16:28:23.176 20250 20250 I GeckoApp: application paused ** bunch of Surface is invalid errors 05-07 16:28:23.576 20250 20296 E Surface : surface (identity=612) is invalid, err=-19 (No such device) ... 05-07 16:28:23.666 20250 20250 I GeckoApp: stop ... 05-07 16:28:24.076 20250 20296 E Surface : surface (identity=612) is invalid, err=-19 (No such device) * I click on link in twitter; intent gets thrown over to Gecko 05-07 16:28:24.386 1645 2223 I ActivityManager: Starting: Intent { act=android.intent.action.VIEW dat=http://t.co/ogZwxyNi flg=0x10000000 cmp=android/com.android.internal.app.ResolverActivity } from pid 19947 05-07 16:28:24.456 20250 20296 E Surface : surface (identity=612) is invalid, err=-19 (No such device) 05-07 16:28:24.656 1645 1995 I ActivityManager: Displayed android/com.android.internal.app.ResolverActivity: +263ms 05-07 16:28:25.316 1645 1675 I ActivityManager: Starting: Intent { act=android.intent.action.VIEW dat=http://t.co/ogZwxyNi flg=0x13800000 cmp=org.mozilla.fennec_vladimir/.App } from pid 1645 05-07 16:28:25.336 20250 20250 W GeckoApp: zerdatime 42188476 - onNewIntent 05-07 16:28:25.336 20250 20250 I GeckoApp: onNewIntent: http://t.co/ogZwxyNi * gecko gets it 05-07 16:28:25.346 20250 20250 I GeckoApp: restart * and restarts/starts/resumes 05-07 16:28:25.346 20250 20250 W GeckoApp: zerdatime 42188480 - onStart 05-07 16:28:25.346 20250 20250 I GeckoApp: start 05-07 16:28:25.346 20250 20250 I GeckoApp: resume 05-07 16:28:25.346 20250 20250 I GeckoApp: application resumed 05-07 16:28:25.376 20250 20263 I GeckoTabs: Got message: Tab:Added 05-07 16:28:25.376 20250 20263 I GeckoTabs: Received message from Gecko: 42188517 - Tab:Added 05-07 16:28:25.376 20250 20263 I GeckoTabs: Added a tab with id: 4 05-07 16:28:25.396 20250 20263 I GeckoApp: Got message: Content:LocationChange 05-07 16:28:25.396 20250 20263 I GeckoApp: URI - about:blank ... no Surface errors or GL errors (with MOZ_GL_DEBUG_ABORT_ON_ERROR) after this point, but the content area just remains black.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 9

•

13 years ago

The above sequence appears in the attached logs as well (the attached log has multiple runs of Gecko; there are two instances of getting into surface is invalid.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 10

•

13 years ago

Now with more logging! Here's the relevant sequence: 05-07 19:38:29.818 1644 2265 I ActivityManager: moveTaskToBack: 20 05-07 19:38:29.828 12540 12540 I GeckoTabs: Removed a tab with id: 3 05-07 19:38:29.828 12540 12540 I GeckoApp: application paused 05-07 19:38:29.868 12540 12540 I GeckoToolbar: zerdatime 1919232 - Throbber stop 05-07 19:38:29.868 12540 12540 I GeckoDoorHangerPopup: Showing all doorhangers for tab: 1 05-07 19:38:29.868 12540 12572 I Gecko : Composite (not paused) 05-07 19:38:29.868 12540 12540 I GeckoDoorHangerPopup: Showing all doorhangers for tab: 1 05-07 19:38:29.968 12540 12572 I Gecko : Composite (not paused) 05-07 19:38:29.978 12540 12540 W IInputConnectionWrapper: showStatusIcon on inactive InputConnection 05-07 19:38:29.978 12540 12572 I Gecko : Composite (not paused) 05-07 19:38:29.988 12540 12572 I Gecko : Composite (not paused) 05-07 19:38:30.078 12540 12572 I Gecko : Composite (not paused) 05-07 19:38:30.406 12540 12540 I GeckoApp: stop ** this is where things go wrong. 05-07 19:38:30.586 12540 12572 I Gecko : Composite (not paused) 05-07 19:38:30.586 12540 12572 E Surface : surface (identity=119) is invalid, err=-19 (No such device) 05-07 19:38:30.586 12540 12572 I Gecko : PauseComposition 05-07 19:38:30.586 12540 12572 E Surface : surface (identity=119) is invalid, err=-19 (No such device) 05-07 19:38:30.606 12540 12572 I Gecko : Composite (paused; skipped) Basically, we're getting an onStop, but (likely) we still have either an outstanding Composite in the queue or one just happened to have gotten issued at the time. We try to process it before we get to the PauseComposition, and things blow up. We never recover from this. This doesn't seem to be Atrix specific; just for whatever reason it happens to be easier to hit it on that device.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 12

•

13 years ago

Basically, as soon as we get an onStop, we must not try to composite. This will be tricky, as they're on different threads and the Composite could already be queued as we're processing the onStop. This is a bit of a problem.

Mark Côté [:mcote]

Reporter

Comment 13

•

13 years ago

Btw, just wanted to mention that I'm more than happy to try out fixes. Doesn't take me too long to repro this problem normally.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 14

•

13 years ago

Attached patch prototype fix (obsolete) — Details — Splinter Review

Ok, this seems to fix things for me. It's only a prototype, since I'm kind of hijacking SchedulePauseComposition, but it might just be ok (minus the debug printfs). The basic idea is that pausing composition needs to be synchronous, and needs to be done whenever we get a pause or a stop. I can't reproduce either of the black screen issues (either from new apps or via awesomebar) with this patch applied. (Also, now that I look at it, I should probably just use MutexAutoLock too.)

Attachment #621812 - Flags: feedback?(bgirard)

Attachment #621812 - Flags: feedback?(ajuma)

Ali Juma [:ajuma]

Comment 15

•

13 years ago

Comment on attachment 621812 [details] [diff] [review] prototype fix I like this approach -- it's too bad that we have to add more synchronization, but it seems unavoidable here.

Attachment #621812 - Flags: feedback?(ajuma) → feedback+

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 16

•

13 years ago

Attached patch cleaned up fix (obsolete) — Details — Splinter Review

Ok, cleaned up version of the prototype; using mozilla::Monitor this time, because derp. Debugging crap removed. If we're ok with having SchedulePauseOnCompositorThread block on android until it's complete, then this should be good to go.

Attachment #621812 - Attachment is obsolete: true

Attachment #621812 - Flags: feedback?(bgirard)

Attachment #621826 - Flags: review?(bgirard)

Attachment #621826 - Flags: review?(ajuma)

Ali Juma [:ajuma]

Comment 17

•

13 years ago

Comment on attachment 621826 [details] [diff] [review] cleaned up fix Review of attachment 621826 [details] [diff] [review]: ----------------------------------------------------------------- Looking at the cleaned up patch reminded me of one more thing we'll need to handle: nsWindow's sCompositorPaused needs to updated (e.g. by sending a COMPOSITOR_PAUSE event to Gecko) before the compositor actually pauses, or else Gecko might try to send a layer tree update after the compositor is paused. This could be a problem (since, effectively, we would need the Java thread to block first on Gecko, and then on the compositor), but might just work. Alternatively, we need the compositor to be able to handle getting a layer tree update while paused. The problem with getting a layer tree update while paused is that we can't do texture uploads. BenWa's work on making layer tree updates asynch helps here, in that it would allow us to receive the layer tree update while paused but postpone the texture upload till later, but we don't yet have asynch updates for all layer types. Thoughts on this, BenWa?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 18

•

13 years ago

Any reason to not just set sCompositorPaused to true in nsWindow::SchedulePauseCompositor (either before or after the call to sCompositorParent->SchedulePauseOnCompositorThread())? We know it'll be paused after that returns because the pause will happen synchronously, right?

Ali Juma [:ajuma]

Comment 19

•

13 years ago

(In reply to Vladimir Vukicevic (:vlad) from comment #18) > Any reason to not just set sCompositorPaused to true in > nsWindow::SchedulePauseCompositor (either before or after the call to > sCompositorParent->SchedulePauseOnCompositorThread())? We know it'll be > paused after that returns because the pause will happen synchronously, right? The problem is that SchedulePauseCompositor is getting called on the Java thread. So what could happen is that Gecko checks sCompositorPaused, then begins to paint, then SchedulePauseCompositor sets sCompositorPaused to true and pauses the compositor, then Gecko begins a layer tree update. Even if we could split the paint and the layer tree update and have Gecko check sCompositorPaused right before the layer tree update (rather than right before the paint), we could have that Gecko checks the flag, then SchedulePauseCompositor sets the flag and pauses the compositor, and then Gecko does a layer tree update. So we need to keep sCompositorPaused locked between the time Gecko checks it and the time it actually sends the layer tree update.

Adrian Tamas (:AdrianT)

Updated

•

13 years ago

Keywords: qawanted

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 20

•

13 years ago

Attached patch fix, v2 (obsolete) — Details — Splinter Review

Different, simpler approach. I realized that the surfaceDestroyed callback is still valid, and that's where we need to pause composition. The two changes that are needed are to make pausing composition synchronous on the compositor side (as in the previous patch), and new to this patch, to dispatch the pause request to the gecko thread synchronously. This ensures that we won't be in the middle of painting or anything on the gecko thread, since we'll effectively be synchronizing all three threads. Works fine and fixes the problem on my Atrix, works fine on Galaxy Nexus.

Attachment #621826 - Attachment is obsolete: true

Attachment #621826 - Flags: review?(bgirard)

Attachment #621826 - Flags: review?(ajuma)

Attachment #622041 - Flags: review?(bgirard)

Attachment #622041 - Flags: review?(ajuma)

Benoit Girard (:BenWa)

Comment 21

•

13 years ago

Comment on attachment 622041 [details] [diff] [review] fix, v2 Review of attachment 622041 [details] [diff] [review]: ----------------------------------------------------------------- ::: gfx/layers/ipc/CompositorParent.cpp @@ +137,5 @@ > CompositorLoop()->PostTask(FROM_HERE, renderTask); > } > > +#ifdef MOZ_WIDGET_ANDROID > +static mozilla::Monitor sPauseCompositionMonitor("PauseCompositionMonitor"); This will trigger leak detection since static desconstructor run after leak checks. :( @@ +184,4 @@ > void > CompositorParent::SchedulePauseOnCompositorThread() > { > +#ifdef MOZ_WIDGET_ANDROID This function here is only intended to be used by Android as I recall. I don't like having it behave differently in case it gets called by another platform. We should either make it clear that this function is android only or make it behave the same (not be both async/sync). @@ +193,5 @@ > CompositorLoop()->PostTask(FROM_HERE, pauseTask); > + > +#ifdef MOZ_WIDGET_ANDROID > + // XXX we should use a timeout here > + lock.Wait(); Let's add a comment to make it clear what we're waiting for here such as: // Wait for the compositor thread to notify us that it has paused. ::: mobile/android/base/gfx/GeckoLayerClient.java @@ +457,3 @@ > public void compositionPauseRequested() { > // We need to coordinate with Gecko when pausing composition, to ensure > // that Gecko never executes a draw event while the compositor is paused. We should get this comment updated to be more specific. Specifically we should mention why we're sending a sync event to gecko (to make sure we're not longer drawing), and that this event will in turn sync with the compositor to pause it.

Attachment #622041 - Flags: review?(bgirard) → review-

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 22

•

13 years ago

Attached patch fix, v3 (obsolete) — Details — Splinter Review

Updated with review comments. Moved the Monitor to a class member.

Attachment #622041 - Attachment is obsolete: true

Attachment #622041 - Flags: review?(ajuma)

Attachment #622061 - Flags: review?(bgirard)

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 23

•

13 years ago

Just a heads-up in case you haven't seen it already: there is a somewhat-related issue that Chris Lord is looking at, see https://bugzilla.mozilla.org/show_bug.cgi?id=751690#c12

Ali Juma [:ajuma]

Comment 24

•

13 years ago

Comment on attachment 622061 [details] [diff] [review] fix, v3 Review of attachment 622061 [details] [diff] [review]: ----------------------------------------------------------------- ::: mobile/android/base/gfx/GeckoLayerClient.java @@ +462,5 @@ > + // surfaceDestroyed notification), and to make sure that any in-flight > + // Gecko draw events have been processed. When this returns, composition is > + // definitely paused -- it'll synchronize with the Gecko event loop, which > + // in turn will synchronize with the compositor thread. > + GeckoAppShell.sendEventToGeckoSync(GeckoEvent.createCompositorPauseEvent()); When Gecko receives this event, it sends sCompositorChild->SendPause(), which synchronously pauses the compositor. So once we use sendEventToGeckoSync here, it seems we have all we need. Do we still need the changes to SchedulePauseOnCompositorThread? (Is this actually getting called at all with this version of the patch?)

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 25

•

13 years ago

is sCompositorChild->SendPause() synchronous? If so, then we don't, but it seems like we should keep the bits in SchedulePauseOnCompositorThread in case someone needs to call that directly from Java like we used to so that they have the same semantics.

Ali Juma [:ajuma]

Comment 26

•

13 years ago

(In reply to Vladimir Vukicevic (:vlad) from comment #25) > is sCompositorChild->SendPause() synchronous? Yes. > If so, then we don't, but it > seems like we should keep the bits in SchedulePauseOnCompositorThread in > case someone needs to call that directly from Java like we used to so that > they have the same semantics. That sounds like a good idea.

Benoit Girard (:BenWa)

Updated

•

13 years ago

Attachment #622061 - Flags: review?(bgirard)

Attachment #622061 - Flags: review?(ajuma)

Attachment #622061 - Flags: review+

Ali Juma [:ajuma]

Updated

•

13 years ago

Attachment #622061 - Flags: review?(ajuma) → review+

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 27

•

13 years ago

Pushed to inbound as 3502d9840d98

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → FIXED

Phil Ringnalda (:philor)

Comment 28

•

13 years ago

Backed out in https://hg.mozilla.org/integration/mozilla-inbound/rev/b5304fd23df9 - sent robocheck, robocheck2 and robopan into permanent failure (a state only barely distinguishable from their normal state, but still distinguishable).

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 29

•

13 years ago

I see the following egl errors in the log for the test runs, probably related to the failure in some way: 05-09 20:50:34.216 E/libEGL ( 1020): eglSetSwapRectangleANDROID:1861 error 3008 (EGL_BAD_DISPLAY) 05-09 20:50:34.476 E/SurfaceFlinger( 1020): eglSwapBuffers: EGL error 0x3008 (EGL_BAD_DISPLAY)

Mark Finkle (:mfinkle) (use needinfo?)

Comment 30

•

13 years ago

(In reply to Kartikaya Gupta (:kats) from comment #29) > I see the following egl errors in the log for the test runs, probably > related to the failure in some way: > > 05-09 20:50:34.216 E/libEGL ( 1020): eglSetSwapRectangleANDROID:1861 error > 3008 (EGL_BAD_DISPLAY) > > 05-09 20:50:34.476 E/SurfaceFlinger( 1020): eglSwapBuffers: EGL error 0x3008 > (EGL_BAD_DISPLAY) If these errors aren't directly related, I wonder if the "error" text in the log is throwing off the log parser.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 31

•

13 years ago

Shouldn't be related -- they're coming from what looks like android startup stuff? I don't actually see fennec being launched in those logs at all.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 32

•

13 years ago

I see: 05-09 20:39:17.205 W/System.err( 1451): java.io.FileNotFoundException: /mnt/sdcard/tests/fennec-15.0a1.en-US.android-arm.apk (No such file or directory) 05-09 20:40:38.765 W/System.err( 1451): java.io.FileNotFoundException: /mnt/sdcard/tests/robocop.apk (No such file or directory) after the respective pushes and install attempts... but there are so many errors and other stuff in that log that look to be just from the test that I can't really understand what's a valid error or not; but I don't see anything that looks like fennec even got installed, much less launched.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 33

•

13 years ago

(In reply to Vladimir Vukicevic (:vlad) from comment #32) > 05-09 20:39:17.205 W/System.err( 1451): java.io.FileNotFoundException: > /mnt/sdcard/tests/fennec-15.0a1.en-US.android-arm.apk (No such file or > directory) > > 05-09 20:40:38.765 W/System.err( 1451): java.io.FileNotFoundException: > /mnt/sdcard/tests/robocop.apk (No such file or directory) These errors are normal. At some point before the file is transfered the host requests a hash of the file or something which results in this exception. It's annoying as hell but I believe benign. (In reply to Vladimir Vukicevic (:vlad) from comment #31) > Shouldn't be related -- they're coming from what looks like android startup > stuff? I don't actually see fennec being launched in those logs at all. Usually I never actually see fennec being launched in the logs. The logs generated are pretty useless if you try to just interpret them directly. The best way is to compare a working-test log to a non-working-test log and see what's different.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 34

•

13 years ago

Working through bug 749426, which is what I think is causing the test failures here.

Status: REOPENED → NEW

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Updated

•

13 years ago

Depends on: 749426

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 35

•

13 years ago

Testing this patch along with bug 749426 caused the same robo* test failures on the try server, but I was not able to reproduce the failures locally on either a galaxy nexus or on a local tegra board. I'm a bit stumped as to how to proceed; one thing that could be useful is the logcat output from the tinderbox tegra device while the test is running.

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 36

•

13 years ago

(In reply to Vladimir Vukicevic (:vlad) from comment #35) > one thing that could be useful is the logcat output from the > tinderbox tegra device while the test is running. I have encountered other situations in which this would have been useful as well. It would be nice if the test harness could be modified to provide this always.

Brad Lassey [:blassey] (use needinfo?)

Comment 37

•

13 years ago

(In reply to Vladimir Vukicevic (:vlad) from comment #35) > I'm a bit stumped as to > how to proceed; one thing that could be useful is the logcat output from the > tinderbox tegra device while the test is running. Joel, any suggestions? I thought we had a solution for this.

Joel Maher ( :jmaher ) (UTC -8)

Comment 38

•

13 years ago

adb/tcp is the most unreliable form of communication and that is the only way we can harvest the entire set of logcat data. In addition, there would be no way to store this. The volume is so high on the tegras that it would flood the logfile we produce. I am open to suggestions, but the few attempts we have tried have yielded nothing useful (i.e. it collects the wrong stuff, or not enough).

Brad Lassey [:blassey] (use needinfo?)

Comment 39

•

13 years ago

one thing that could cut down on the volume is grep'ing the logcat for the process id of fennec

Joel Maher ( :jmaher ) (UTC -8)

Comment 40

•

13 years ago

we find that we miss important things like crash dumps in logcat (which are different than when we crash fennec and produce a crash dump). I do know there are ways with logcat to filter the output, we might be able filter based on the process and minimize our pipe of information. The other factor is that this would be collected after the fact, not interleaved with test results.

Brad Lassey [:blassey] (use needinfo?)

Comment 41

•

13 years ago

(In reply to Joel Maher (:jmaher) from comment #40) > we find that we miss important things like crash dumps in logcat (which are > different than when we crash fennec and produce a crash dump). Is that filtering on "Gecko" instead of the pid?

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 42

•

13 years ago

Can we run something like alogcat on the device and have it collect the full log, and then upload it somewhere using a reliable channel?

Joel Maher ( :jmaher ) (UTC -8)

Comment 43

•

13 years ago

I like that idea. If we could essentially do a log flush before we start the test, record the log, then pull it up, that would work out great. At that point in our test harness script we could do any parsing and filtering necessary.

Mark Côté [:mcote]

Reporter

Comment 44

•

13 years ago

Just wanted to mention that, this morning, I got the black content on my Asus Eee Transformer as well (latest nightly).

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 45

•

13 years ago

I filed bug 754873 for capturing logcat output. I tried submitting this patch with nothing but the sendEventToGecko -> sendEventToGeckoSync change (which technically should be all that's required); same problem. I can't reproduce it locally, though I'm about to try something different here shortly.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 46

•

13 years ago

Try builds with this patch are at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/vladimir@pobox.com-0e539f1d4ce4/try-android/ for testing.

Benoit Jacob [:bjacob] (mostly away)

Comment 47

•

13 years ago

This patch fixes topcrasher bug 743938.

Blocks: 743938

Benoit Jacob [:bjacob] (mostly away)

Comment 49

•

13 years ago

Here's Vlad's tryserver build with the patch, to try: http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/vladimir@pobox.com-0e539f1d4ce4/try-android/

Benoit Jacob [:bjacob] (mostly away)

Comment 50

•

13 years ago

Oops, meant to post that in the other bug.

Scoobidiver (away)

Updated

•

13 years ago

No longer blocks: 743938

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 51

•

13 years ago

I can finally reproduce the deadlock! I needed to run the exact same test run, using SUT and Talos to do it; doing anything as a mochitest etc. didn't work. What's happening is this: The Android thread posted a native compositing paused event and is waiting for gecko to process it: java.util.concurrent.CountDownLatch await CountDownLatch.java:207 org.mozilla.gecko.GeckoAppShell geckoEventSync GeckoAppShell.java:618 org.mozilla.gecko.GeckoAppShell sendEventToGeckoSync GeckoAppShell.java:538 org.mozilla.gecko.gfx.GeckoLayerClient compositionPauseRequested GeckoLayerClient.java:475 org.mozilla.gecko.gfx.LayerView surfaceDestroyed LayerView.java:269 android.view.SurfaceView reportSurfaceDestroyed SurfaceView.java:568 The Gecko thread, on the other hand, is here: org.mozilla.gecko.gfx.LayerView requestRender LayerView.java:182 org.mozilla.gecko.gfx.LayerController setCheckerboardColor LayerController.java:347 org.mozilla.gecko.GeckoApp handleMessage GeckoApp.java:861 org.mozilla.gecko.GeckoAppShell handleGeckoMessage GeckoAppShell.java:1865 org.mozilla.gecko.GeckoAppShell nativeRun GeckoAppShell.java:-2 org.mozilla.gecko.GeckoAppShell nativeRun GeckoAppShell.java:-2 org.mozilla.gecko.GeckoAppShell runGecko GeckoAppShell.java:505 org.mozilla.gecko.GeckoThread run GeckoThread.java:114 So, Gecko is tryng to call LayerView.requestRender() while we're already in the middle of LayerView.surfaceDestroyed(). Both those methods are synchronized. Thus, we deadlock.

Joel Maher ( :jmaher ) (UTC -8)

Comment 52

•

13 years ago

odd why we get that only in talos vs mochitest version of this. Could it be something in the profile (pref or extension), or just random chance.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 53

•

13 years ago

Yeah, I'm confused by that as well; I don't see anything in prefs.js in the profile, at least.

Naoki Hirata :nhirata (please use needinfo instead of cc)

Updated

•

13 years ago

Crash Signature: [@ WSEGL_GetDrawableParameters]

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 54

•

13 years ago

Attached patch fix, v4 — Details — Splinter Review

Updated fix that fixes the deadlock. (Also gives the Gecko thread a name, for easier finding in ddms.) Removing synchronized from requestRepaint seems safe, since it just flows into posting an event to Gecko.. which is thread safe by design anyway.

Attachment #624782 - Flags: review?(ajuma)

Ali Juma [:ajuma]

Comment 55

•

13 years ago

Comment on attachment 624782 [details] [diff] [review] fix, v4 Looks good to me. I think kats should take a look at the Java bits too.

Attachment #624782 - Flags: review?(bugmail.mozilla)

Attachment #624782 - Flags: review?(ajuma)

Attachment #624782 - Flags: review+

Kartikaya Gupta (email:kats@mozilla.staktrace.com)

Comment 56

•

13 years ago

Comment on attachment 624782 [details] [diff] [review] fix, v4 Review of attachment 624782 [details] [diff] [review]: ----------------------------------------------------------------- Looks like most of the synchronization in LayerView is just an artifact from long ago, we should probably revisit that at some point. But yeah, this change looks fine. ::: mobile/android/base/gfx/LayerView.java @@ +200,1 @@ > public long getRenderTime() { This function is never used; might as well take it out.

Attachment #624782 - Flags: review?(bugmail.mozilla) → review+

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 57

•

13 years ago

On inbound, with getRenderTime and bits nuked. Fingers crossed that it sticks! https://tbpl.mozilla.org/?tree=Mozilla-Inbound&rev=4a6bdd1308c3

(PTO June 19-July 7) Ryan VanderMeulen

Comment 58

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/4a6bdd1308c3

Status: NEW → RESOLVED

Closed: 13 years ago → 13 years ago

Flags: in-testsuite-

Resolution: --- → FIXED

Target Milestone: --- → Firefox 15

Scoobidiver (away)

Updated

•

13 years ago

status-firefox14: --- → affected

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 59

•

13 years ago

Comment on attachment 624782 [details] [diff] [review] fix, v4 [Approval Request Comment] User impact if declined: crashes, freezes, all sorts of badness Testing completed (on m-c, etc.): local testing on variety of devices; on m-c Risk to taking this patch (and alternatives if risky): crashes, freezes, all sorts of badness

Attachment #624782 - Flags: approval-mozilla-aurora?

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 60

•

13 years ago

Er, ignore the risk to taking this patch line above, meant to delete that! :p

Joe Drew (not getting mail)

Updated

•

13 years ago

Attachment #624782 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Joe Drew (not getting mail)

Comment 62

•

13 years ago

Comment on attachment 622061 [details] [diff] [review] fix, v3 ># HG changeset patch ># Parent 448f554f6acbfe632da02e23526344f3defebe69 > >diff --git a/gfx/layers/ipc/CompositorParent.cpp b/gfx/layers/ipc/CompositorParent.cpp >--- a/gfx/layers/ipc/CompositorParent.cpp >+++ b/gfx/layers/ipc/CompositorParent.cpp >@@ -1,4 +1,4 @@ >-/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 4 -*- */ >+/* -*- Mode: C++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ > /* vim: set sw=2 ts=2 et tw=80 : */ > /* ***** BEGIN LICENSE BLOCK ***** > * Version: MPL 1.1/GPL 2.0/LGPL 2.1 >@@ -66,6 +66,7 @@ CompositorParent::CompositorParent(nsIWi > , mLayersUpdated(false) > , mCompositorLoop(aMsgLoop) > , mThreadID(aThreadID) >+ , mPauseCompositionMonitor("PauseCompositionMonitor") > { > MOZ_COUNT_CTOR(CompositorParent); > } >@@ -141,6 +142,9 @@ CompositorParent::PauseComposition() > { > NS_ABORT_IF_FALSE(CompositorThreadID() == PlatformThread::CurrentId(), > "PauseComposition() can only be called on the compositor thread"); >+ >+ mozilla::MonitorAutoLock lock(mPauseCompositionMonitor); >+ > if (!mPaused) { > mPaused = true; > >@@ -148,6 +152,9 @@ CompositorParent::PauseComposition() > static_cast<LayerManagerOGL*>(mLayerManager.get())->gl()->ReleaseSurface(); > #endif > } >+ >+ // if anyone's waiting to make sure that composition really got paused, tell them >+ lock.NotifyAll(); > } > > void >@@ -169,12 +176,21 @@ CompositorParent::ResumeCompositionAndRe > ResumeComposition(); > } > >+/* >+ * This will execute a pause synchronously, waiting to make sure that the compositor >+ * really is paused. >+ */ > void > CompositorParent::SchedulePauseOnCompositorThread() > { >+ mozilla::MonitorAutoLock lock(mPauseCompositionMonitor); >+ > CancelableTask *pauseTask = NewRunnableMethod(this, > &CompositorParent::PauseComposition); > CompositorLoop()->PostTask(FROM_HERE, pauseTask); >+ >+ // Wait until the pause has actually been processed by the compositor thread >+ lock.Wait(); > } > > void >diff --git a/gfx/layers/ipc/CompositorParent.h b/gfx/layers/ipc/CompositorParent.h >--- a/gfx/layers/ipc/CompositorParent.h >+++ b/gfx/layers/ipc/CompositorParent.h >@@ -1,4 +1,4 @@ >-/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 4 -*- */ >+/* -*- Mode: C++; tab-width: 8; indent-tabs-mode: nil; c-basic-offset: 2 -*- */ > /* vim: set sw=4 ts=8 et tw=80 : */ > /* ***** BEGIN LICENSE BLOCK ***** > * Version: MPL 1.1/GPL 2.0/LGPL 2.1 >@@ -52,6 +52,7 @@ > #include "mozilla/layers/PCompositorParent.h" > #include "mozilla/layers/PLayersParent.h" > #include "base/thread.h" >+#include "mozilla/Monitor.h" > #include "ShadowLayersManager.h" > > class nsIWidget; >@@ -165,6 +166,8 @@ private: > MessageLoop* mCompositorLoop; > PlatformThreadId mThreadID; > >+ mozilla::Monitor mPauseCompositionMonitor; >+ > DISALLOW_EVIL_CONSTRUCTORS(CompositorParent); > }; > >diff --git a/mobile/android/base/gfx/GeckoLayerClient.java b/mobile/android/base/gfx/GeckoLayerClient.java >--- a/mobile/android/base/gfx/GeckoLayerClient.java >+++ b/mobile/android/base/gfx/GeckoLayerClient.java >@@ -457,7 +457,13 @@ public class GeckoLayerClient implements > public void compositionPauseRequested() { > // We need to coordinate with Gecko when pausing composition, to ensure > // that Gecko never executes a draw event while the compositor is paused. >- GeckoAppShell.sendEventToGecko(GeckoEvent.createCompositorPauseEvent()); >+ // This is sent synchronously to make sure that we don't attempt to use >+ // any outstanding Surfaces after we call this (such as from a >+ // surfaceDestroyed notification), and to make sure that any in-flight >+ // Gecko draw events have been processed. When this returns, composition is >+ // definitely paused -- it'll synchronize with the Gecko event loop, which >+ // in turn will synchronize with the compositor thread. >+ GeckoAppShell.sendEventToGeckoSync(GeckoEvent.createCompositorPauseEvent()); > } > > /** Implementation of LayerView.Listener */

Attachment #622061 - Attachment is obsolete: true

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 63

•

13 years ago

https://hg.mozilla.org/releases/mozilla-aurora/rev/ac685c4554a5

Scoobidiver (away)

Updated

•

13 years ago

status-firefox14: affected → fixed

Gordon P. Hemsley [:GPHemsley]

Updated

•

13 years ago

No longer depends on: 749426

Gordon P. Hemsley [:GPHemsley]

Updated

•

13 years ago

Depends on: 749426

Mark Côté [:mcote]

Reporter

Comment 64

•

13 years ago

This should be fixed in Nightly, right? Just checking 'cause I experienced the problem again in today's Nightly...

Naoki Hirata :nhirata (please use needinfo instead of cc)

Comment 65

•

13 years ago

Mark : Yes, it should be in the nightly. Do you have steps to reproduce?

Mark Côté [:mcote]

Reporter

Comment 70

•

13 years ago

Naoki: Unfortunately it just happened once as per the normal steps--opened a link from an other application. Had to stop the process and relaunch to get the content screen back.

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Assignee

Comment 72

•

13 years ago

If you see it again, can you grab the logcat output? It should let us determine if the root cause is the same as what was fixed here, or something different.

Mark Côté [:mcote]

Reporter

Comment 76

•

13 years ago

Still haven't seen it since. Looks like it is much, much reduced in frequency.

Andreea Pod

Updated

•

13 years ago

Assignee: vladimir → nobody

Andreea Pod

Comment 77

•

13 years ago

Sorry about that, it was by mistake.

Assignee: nobody → vladimir

Nobody; OK to take it and work on it

Updated

•

5 years ago

Product: Firefox for Android → Firefox for Android Graveyard

logcat after black-screen crash 13 years ago Mark Côté [:mcote] 126.73 KB, application/gzip		Details
unzipped log 13 years ago Brad Lassey [:blassey] (use needinfo?) 1.23 MB, text/plain		Details
prototype fix 13 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 4.45 KB, patch	ajuma : feedback+	Details \| Diff \| Splinter Review
cleaned up fix 13 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 2.60 KB, patch		Details \| Diff \| Splinter Review
fix, v2 13 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 2.46 KB, patch	BenWa : review-	Details \| Diff \| Splinter Review
fix, v3 13 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 3.96 KB, patch	BenWa : review+ ajuma : review+	Details \| Diff \| Splinter Review
fix, v4 13 years ago Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond) 5.70 KB, patch	ajuma : review+ kats : review+ joe : approval-mozilla-aurora+	Details \| Diff \| Splinter Review