Closed Bug 735230 Opened 8 years ago Closed 8 years ago

Maple: Texture upload while Fennec is in the background is causing black screens

Categories

(Firefox for Android :: General, defect)

ARM
Android
defect
Not set

Tracking

()

VERIFIED FIXED
Firefox 14
Tracking Status
firefox14 --- verified
blocking-fennec1.0 --- beta+

People

(Reporter: csuciu, Assigned: ajuma)

References

Details

(Whiteboard: maple)

Attachments

(7 files, 3 obsolete files)

Attached file Log file
Maple nightly 13.0a1 2012-03-13 (20120313040242)
Samsung Galaxy SII (Android 2.3.4)

Steps:
1. Start maple with a clean profile
2. Load Google.com
3. Enter some text in the search field and tap on google "Search" button
4. After showing the results tap again on the search field 

Expected:
Google page is displayed with the search field in focus 

Actual:
A blank page is displayed
Attached image Screenshot
Whiteboard: maple
I've seen this as a result of the scroll coordinates not getting clamped properly. Usually if you try to pan the gray area where the page is supposed to be it comes back.
Depends on: 732016
I've also hit this condition using my formsninput.html test page (http://people.mozilla.com/~nhirata/html_tp/formsninput.html)  I believe it's the same issue based on comment 2.
Attached image screenshot 2
I guess that I am able to reproduce the same issue by following these reproducing steps:

1. Go to cnn.com
2. Tap on any news title
3. Wait until the new opened page is fully loaded
4. Tap on device back button
5. Tap on device home button and wait at least 15-20s until the page opened at step 2 is loaded in the background
6. Tap on Fennec icon to bring it back on the screen

Expected result:
cnn.com is displayed correctly

Actual result:
A black page is displayed instead of expected content (see screenshot 1)

--
Maple (2012-03-13)
Device: Samsung Nexus S
OS: Android 2.3.6
For me the steps in comment 4 don't reproduce a problem; the page draws as expected. However I'm testing on a Galaxy Nexus running ICS, and the behaviour may be different on older versions of Android. Specifically, while Fennec is in the background I see the following dumped to logcat as it draws (I added the VIEWPORT logging lines):

03-14 12:50:42.745 I/Gecko   (23450): Compositor: Layers update took 5 ms (blocking gecko).
03-14 12:50:42.752 D/GeckoLayerClient(23450): VIEWPORT endDrawing
03-14 12:50:42.791 D/GeckoLayerClient(23450): VIEWPORT beginDrawing {"width":720,"height":1038,"pageWidth":720,"pageHeight":1949.0625,"zoom":0.703125,"x":0,"y":0}
03-14 12:50:42.799 I/Gecko   (23450): VIEWPORT painting document in pres shell 0x5c090400
03-14 12:50:42.924 E/libEGL  (23450): eglMakeCurrent:685 error 3009 (EGL_BAD_MATCH)
03-14 12:50:42.924 I/Gecko   (23450): ###!!! ASSERTION: Failed to make GL context current!: 'succeeded', file /Users/kats/zspace/kiwifox/gfx/gl/GLContextProviderEGL.cpp, line 1191

:ajuma, :pcwalton - do you know what the assertion is and if this might be an indication of something wrong?

Note also this is probably a different bug than described in comments 0..3.
(In reply to Kartikaya Gupta (:kats) from comment #5)
> 03-14 12:50:42.924 E/libEGL  (23450): eglMakeCurrent:685 error 3009
> (EGL_BAD_MATCH)
> 03-14 12:50:42.924 I/Gecko   (23450): ###!!! ASSERTION: Failed to make GL
> context current!: 'succeeded', file
> /Users/kats/zspace/kiwifox/gfx/gl/GLContextProviderEGL.cpp, line 1191
> 
> :ajuma, :pcwalton - do you know what the assertion is and if this might be
> an indication of something wrong?

Yes, we're failing to make the GL context current, which means that any subsequent GL calls we make will have undefined results. It seems that ICS is more lenient in this situation, so we're not seeing any visual artifacts there, but seeing a black screen after this error is not at all unexpected.

The previous log line notes that that error is EGL_BAD_MATCH, which makes sense since we're calling eglMakeCurrent with EGL_NO_SURFACE but with a context other than EGL_NO_CONTEXT. This was added in Bug 728041 (and I should have caught this when reviewing that).

The real problem both here and in Bug 728041 is that we're uploading textures when we're in the background, and we shouldn't be doing this. Probably the right fix is for the compositor to reject layer tree updates when go into the background, and then invalidate its entire layer tree when we resume.

> Note also this is probably a different bug than described in comments 0..3.

I think so too.
Blocks: 728041
This should block beta.
Assignee: nobody → ajuma
blocking-fennec1.0: --- → ?
blocking-fennec1.0: ? → beta+
We should split this into two bugs, since comment 4 onwards is a different issue than comments 0..3.
(In reply to Kartikaya Gupta (:kats) from comment #8)
> We should split this into two bugs, since comment 4 onwards is a different
> issue than comments 0..3.

Since the issue described in comments 0..3 sounds very much like a duplicate of Bug 732016, how about we use this bug for the issue discussed in comment 4 onwards?
Summary: Maple: Blank page on google.com → Maple: Texture upload while Fennec is in the background is causing black screens
I've consistently hit this bug now with the following STR:
1. place device in landscape mode.
2. go to http://people.mozilla.com/~nhirata/html_tp/datentimescript.html
3. click in the field
4. switch to portrait mode
To summarize, the root problem here is that although we cannot make a GL context current when we don't have an Android surface (i.e. during the time between the Android SurfaceDestroyed and SurfaceCreated events), we sometimes do receive a layer tree update during this time. For example, this can happen when the Gecko thread was in the middle of a paint when the SurfaceDestroyed event was received; when this paint finishes, Gecko will send a layer tree update to the compositor.

The compositor cannot correctly process a layer tree update in this situation, since it cannot upload textures. The failed texture uploads are causing the black screens we're seeing.

Here's a suggested approach for handling this: During the time that we have no Android surface, the compositor shouldn't attempt to process layer tree updates. Instead, in this situation, ShadowLayersParent::RecvUpdate should simply return immediately. Of course, this means that the compositor's layer tree will be out-of-sync with content. To handle this, we can invalidate the content layer tree when we receive a SurfaceCreated event, and this will cause content to produce a new content tree and send this to the compositor. Until the compositor receives this, it can still composite its existing layer tree (that is, when Fennec resumes, the compositor can still render to the screen immediately).
Blocks: 736666
Not sure if this is the cause of my phone getting really hot this afternoon; checked log cat and it was flooding with: E/libEGL  ( 9441): eglMakeCurrent:685 error 3009 (EGL_BAD_MATCH) messages, and top revealed: 9441  1  96% S    36 669232K 187320K  bg app_83   org.mozilla.fennec -- 96% ouch
(In reply to Aaron Train [:aaronmt] from comment #13)
> Not sure if this is the cause of my phone getting really hot this afternoon;
> checked log cat and it was flooding with: E/libEGL  ( 9441):
> eglMakeCurrent:685 error 3009 (EGL_BAD_MATCH) messages, and top revealed:
> 9441  1  96% S    36 669232K 187320K  bg app_83   org.mozilla.fennec -- 96%
> ouch

The EGL_BAD_MATCH messages are indeed caused by this. Whether this on its own is the cause of all that CPU usage is less clear; the more general issue is that Gecko continuing to run in the background as if it were in the foreground, and attempting to upload textures is one of the things it happens to do. We also have Bug 718364 for the background CPU usage issue.
Still reproducible on the latest Nightly build on cnn.com

--
Firefox (2012-03-20)
Device: Samsung Nexus S
OS: Android 2.3.6
Version: Firefox 13 → Trunk
(In reply to Ali Juma [:ajuma] from comment #12) 
> Here's a suggested approach for handling this: During the time that we have
> no Android surface, the compositor shouldn't attempt to process layer tree
> updates. Instead, in this situation, ShadowLayersParent::RecvUpdate should
> simply return immediately. Of course, this means that the compositor's layer
> tree will be out-of-sync with content. To handle this, we can invalidate the
> content layer tree when we receive a SurfaceCreated event, and this will
> cause content to produce a new content tree and send this to the compositor.
> Until the compositor receives this, it can still composite its existing
> layer tree (that is, when Fennec resumes, the compositor can still render to
> the screen immediately).

Perhaps a cleaner approach than ignoring layer tree updates is to prevent them from happening in the first place. So, instead of having Java directly handle pausing/resuming the compositor, Java should instead send an event to Gecko. Gecko can then set/unset a flag that prevents drawing, and pause/resume the compositor thread. This approach ensures that no layer tree updates will arrive while the compositor is paused.
(In reply to Ali Juma [:ajuma] from comment #16)
> Perhaps a cleaner approach than ignoring layer tree updates is to prevent
> them from happening in the first place. So, instead of having Java directly
> handle pausing/resuming the compositor, Java should instead send an event to
> Gecko. Gecko can then set/unset a flag that prevents drawing, and
> pause/resume the compositor thread. This approach ensures that no layer tree
> updates will arrive while the compositor is paused.

This general approach works, with one caveat: Waiting for Gecko to resume the compositor thread takes too long when Fennec returns from the background, causing a brief black flash. To avoid this black flash, we still need Java to resume the compositor (and inform Gecko that the compositor is no longer paused, so drawing should no longer be prevented).

However, pausing the compositor does need to happen from Gecko, to ensure that Gecko doesn't draw after the compositor is paused.

This leads to another potential complication: on surfaceChanged, we currently do a pause followed by a resume. Right now, the pause doesn't really accomplish anything, since we immediately resume. However, if pauses have to go through Gecko but resumes do not, we have a good chance of the resume getting scheduled before the pause. Since the pause doesn't accomplish anything right now anyway, we should just remove it (and only pause on surfaceDestroyed). We do still need to schedule a resume, since this triggers the compositor doing a makeCurrent for the changed surface.
We'll use the Pause message when pausing the compositor from Gecko. We don't need the Resume message right now (as discussed above, we'll still resume from Java), but it makes sense to add it to the protocol anyway.
Attachment #609884 - Flags: review?(bgirard)
This adds COMPOSITOR_PAUSE and COMPOSITOR_RESUME events, and ensures that Gecko doesn't draw between these events. To make up for any discarded draws, Gecko has to schedule a draw when it receives a Resume event.

Since handling these events promptly is critical (to avoid GL errors in the case of pausing, and to begin drawing again in the case of resume), we give priority to these events in nsAppShell::PostEvent.

This patch also does some cleanup, removing SetCompositorParent from AndroidBridge (where it doesn't really belong).
Attachment #609891 - Flags: review?(bugmail.mozilla)
Attachment #609891 - Attachment is patch: true
This makes us follow this approach discussed in Comment 17.
Attachment #609892 - Flags: review?(bugmail.mozilla)
This prevents us from getting the error EGL_BAD_MATCH when calling eglMakeCurrent.
Attachment #609893 - Flags: review?(gwright)
Comment on attachment 609891 [details] [diff] [review]
Part 2: Add compositor pause/resume events to Gecko.

Review of attachment 609891 [details] [diff] [review]:
-----------------------------------------------------------------

I feel like this patch combined with part 3 ends up being more un-symmetric than it needs to be, which makes it a little harder to follow. Would it be possible to add the following changes:

- Call nsWindow::ScheduleResumeComposition() as the first line of the COMPOSITOR_RESUME event handler in nsWindow
- Remove the call to GeckoAppShell.scheduleResumeComposition() in GeckoLayerClient.compositionResumeRequested
- Get rid of the GeckoAppShell.schedule{Pause,Resume}Composition JNI functions since they will no longer be needed

I think the only effective difference this would make is that when composition is resumed, the compositor will be asked to resume on an event rather than via a JNI call, so there is a bit of added latency. The event is already inserted at the head of the queue though, so I'm thinking the added latency will be negligible and the code will be cleaner. Does that make sense?
(In reply to Kartikaya Gupta (:kats) from comment #22)
> Would it be
> possible to add the following changes:
> 
> - Call nsWindow::ScheduleResumeComposition() as the first line of the
> COMPOSITOR_RESUME event handler in nsWindow
> - Remove the call to GeckoAppShell.scheduleResumeComposition() in
> GeckoLayerClient.compositionResumeRequested
> - Get rid of the GeckoAppShell.schedule{Pause,Resume}Composition JNI
> functions since they will no longer be needed
> 
> I think the only effective difference this would make is that when
> composition is resumed, the compositor will be asked to resume on an event
> rather than via a JNI call, so there is a bit of added latency. The event is
> already inserted at the head of the queue though, so I'm thinking the added
> latency will be negligible and the code will be cleaner. Does that make
> sense?

This approach is certainly cleaner and preferable. However, it still adds enough of a delay that it causes us to briefly display a black screen on resume. This happens even if we additionally modify the compositor's resume event so that, rather than merely scheduling a composition event, it goes ahead and does a composition right there and then.

I added some timing output to see how much latency is added by going through Gecko, and it turns out to be quite a bit. The time between nsAppShell::PostEvent inserting the resume event at the head of the queue, and nsWindow::OnGlobalAndroidEvent processing the event is typically 400-500 ms! (I tested on a Nexus S and an HTC Desire.)
(In reply to Ali Juma [:ajuma] from comment #23) 
> I added some timing output to see how much latency is added by going through
> Gecko, and it turns out to be quite a bit. The time between
> nsAppShell::PostEvent inserting the resume event at the head of the queue,
> and nsWindow::OnGlobalAndroidEvent processing the event is typically 400-500
> ms! (I tested on a Nexus S and an HTC Desire.)

Interesting, for pause events, the time is much more reasonable, typically 6 ms or less.
Comment on attachment 609891 [details] [diff] [review]
Part 2: Add compositor pause/resume events to Gecko.

Review of attachment 609891 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/android/nsWindow.cpp
@@ +987,5 @@
> +        case AndroidGeckoEvent::COMPOSITOR_PAUSE:
> +            if (sCompositorChild) {
> +                sCompositorChild->SendPause();
> +            }
> +            sCompositorPaused = true;

Add some comments here, basically what you said in comments 16 and 17 of this bug.

@@ +992,5 @@
> +            break;
> +
> +        case AndroidGeckoEvent::COMPOSITOR_RESUME:
> +            sCompositorPaused = false;
> +            win->RedrawAll();

Ditto.
Attachment #609891 - Flags: review?(bugmail.mozilla) → review+
Comment on attachment 609892 [details] [diff] [review]
Part 3: Use the new compositor pause/resume events.

Review of attachment 609892 [details] [diff] [review]:
-----------------------------------------------------------------

Again, just add some comments here explaining what's going on. Also maybe drop in a link to comment 23 so that we know why the code is arranged the way it is.
Attachment #609892 - Flags: review?(bugmail.mozilla) → review+
Comment on attachment 609884 [details] [diff] [review]
Part 1: Add Pause and Resume messages to PCompositor.

Looks good. I would like to see a bit more documentation on the IPDL interface, we should also later come back and document all the existing calls.
Attachment #609884 - Flags: review?(bgirard) → review+
Comment on attachment 609893 [details] [diff] [review]
Part 4: Don't call MakeCurrent with a context and without a surface.

Review of attachment 609893 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm.
Attachment #609893 - Flags: review?(gwright) → review+
Patch updated to address review comments. Carrying forward r=bgirard.
Attachment #609884 - Attachment is obsolete: true
Attachment #610263 - Flags: review+
Patch updated to address review comments. Carrying forward r=kats.
Attachment #609891 - Attachment is obsolete: true
Comment on attachment 610264 [details] [diff] [review]
Part 2: Add compositor pause/resume events to Gecko.

Meant to carry forward r=kats.
Attachment #610264 - Attachment is patch: true
Attachment #610264 - Flags: review+
Patch updated to address review comments. Carrying forward r=kats.
Attachment #609892 - Attachment is obsolete: true
Attachment #610268 - Flags: review+
Verified/fixed on:
Nightly Fennec 14.0a1 (2012-04-03)
Device: HTC Desire
OS: Android 2.2.2
Status: RESOLVED → VERIFIED
Blocks: 730145
You need to log in before you can comment on or make changes to this bug.