Closed Bug 1647797 Opened 5 years ago Closed 5 years ago

Empty screenshot when attaching GeckoView to already loaded session

Categories

(Core :: Graphics, defect, P1)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
81 Branch
Tracking Status
firefox-esr68 --- unaffected
firefox-esr78 --- unaffected
firefox79 --- wontfix
firefox80 --- wontfix
firefox81 --- fixed

People

(Reporter: jonalmeida, Assigned: jnicol)

References

(Regression)

Details

(Keywords: regression, Whiteboard: [geckoview:m81][fenix:p1])

Attachments

(3 files)

In bug 1641064, we decided to fail fast when we cannot create a thumbnail. This approach seems to work better for our Fenix needs. However, a side-effect from it seems to be that it provides us with a blank bitmap on failure. It's not reasonably possible to figure out if the bitmap is blank or has content in it unfortunately.

Would it be possible to have the return value be a GeckoResult<Bitmap?> instead?

Whiteboard: [geckoview:m80]

This approach seems to work better for our Fenix needs. However, a side-effect from it seems to be that it provides us with a blank bitmap on failure

Hey Jon, as far as I am aware we don't intentionally send an empty Bitmap. When we fail fast in Bug 1641064 we reject the GeckoResult, so no empty bitmap is sent there.

For sanity sake, I re-verified that we are indeed getting blank bitmaps back from GV. Christian corrected me in realizing that waiting for the firstContentFullPaint was added to only stop the exceptions (but we were still seeing those come through) and not for getting a useful screenshot.

We're seeing this happen when we return to the browser and we take a screenshot if the page is done loading.

(In reply to Jonathan Almeida [:jonalmeida] from comment #2)

We're seeing this happen when we return to the browser and we take a screenshot if the page is done loading.

I see, that makes sense. We can definitely look at this case and figure out if there's something we can do better here.

Summary: GeckoView.capturePixels fails fast with a blank bitmap when a screenshot cannot be retrieved → Empty screenshot when attaching GeckoView to already loaded session

A good first step here would be to see if we can write a junit test for this.

Whiteboard: [geckoview:m80] → [geckoview:m80][fenix:p1]
Severity: -- → S3
Priority: -- → P1
Priority: P1 → P2
Whiteboard: [geckoview:m80][fenix:p1] → [geckoview:m81][fenix:p1]

This is a super high priority for Fenix FYI.

Severity: S3 → S2
Component: General → Graphics
Product: GeckoView → Core

@Jamie: Could you take a look at this.

Blocks: wr-81
Flags: needinfo?(jnicol)

Sure. Jonathan, do you have a simple steps to reproduce? I'm afraid I'm not sure how to translate "attaching GeckoView to already loaded session" in to user interactions! Also, are you seeing this on specific devices, or all devices? (ie is it webrender or non-webrender?) Thanks!

Flags: needinfo?(jnicol) → needinfo?(jonalmeida942)

Hey Jamie, sorry about the lack of context.

Steps to reproduce

  1. Open Fenix or Reference Browser and load a site with static content (so that its easier to visually see the differences) like https://mozilla.github.io/geckoview/.
  2. Open the tabs tray and observe a thumbnail image.
  3. Switch to a different screen by going to the three-dot kebab menu and clicking Settings.
  4. Return back to page by pressing back.
  5. Open the tabs tray and observe a thumbnail image.

Expected results

  • At 5, you should see the same screenshot as the step 3.

Actual results

  • At 5, you see a blank white thumbnail.

This used to be easier to reproduce on Fenix but there have been lots of changes there so it's a hit of miss for me, we are still getting lots of reports on the issue though, for example this one; Reference Browser is consistent.

As mentioned in comment 2, it seems like we're getting blank white images from Gecko on requesting a thumbnail when returning to a browser view where the page being loaded from the browser cache in GeckoView, which is roughly what the "attaching GeckoView to already loaded session" would translate to.

Flags: needinfo?(jonalmeida942) → needinfo?(jnicol)

Thanks for the detailed instructions Jon! I can indeed reproduce consistently in the Reference Browser, much more than in Fenix. Will try to get to the bottom of this tomorrow.

Flags: needinfo?(jnicol)

Okay, I think I've figured out what's going on here.

When geckoview is paused, we call ClearCachedResources(). This makes its way to LayerManagerComposite::ClearCachedResources() or WebRenderBridgeParent::RecvClearCachedResources(), which will throw away the data they need to render the page (eg the PaintedLayer content tiles, the webrender display list, etc).

When we then resume we immediately force a composite, so that the uninitialized SurfaceTexture is not visible (resulting in a black flash). But since we have no content to composite, all we can do is just clear the screen to the chosen clear colour. It's not ideal, but the logic is that a flash of the clear colour is preferable to a flash of black.

I gather that Fenix/RB previously waited for the first contentful paint before requesting the screenshot, hence why it captured the correct data. By now requesting the screenshot immediately it is requesting the screenshot of the first frame, which is just the clear colour.

I guess we have a couple of ways of fixing this:

  • Fenix can wait until the first contentful paint before capturing the screenshot. Make the GeckoView API explicit that attempting to capture a thumbnail before then will return a blank colour.
  • Make LayerManagerComposite/WebRenderBridgeParent wait until a contentful composite before capturing the screenshot.

Is there some context for why was Fenix changed to capture the screenshot immediately instead of waiting for the first contentful paint?

As an aside, you'll notice that pages such as about:support do always get thumbnailed correctly. I think this is because we don't ClearCachedResources() for parent-process pages, or something.

I think there's a discussion to be had about whether it's wise to clear our cached resources immediately upon pausing. This "clear colour frame" causes a slight flickering effect which is noticeable (though much less noticable than when it was a black flash!). We probably don't want that to happen when the user opens the settings menu then presses back. But we certainly will want to clear them eventually after pausing, so the screenshot mechanism needs to be resilient to this one way or the other.

(In reply to Jamie Nicol [:jnicol] from comment #11)

As an aside, you'll notice that pages such as about:support do always get thumbnailed correctly. I think this is because we don't ClearCachedResources() for parent-process pages, or something.

I think there's a discussion to be had about whether it's wise to clear our cached resources immediately upon pausing. This "clear colour frame" causes a slight flickering effect which is noticeable (though much less noticable than when it was a black flash!). We probably don't want that to happen when the user opens the settings menu then presses back. But we certainly will want to clear them eventually after pausing, so the screenshot mechanism needs to be resilient to this one way or the other.

Yes, I think we should only clear those resources after a timeout or memory pressure. Short pauses of the compositor are very common in situations like the one you mentioned. You also don't want to throw everything away when switching between two (or more) tabs frequently, etc.

(In reply to Jamie Nicol [:jnicol] from comment #10)
Thanks for digging into this!

I gather that Fenix/RB previously waited for the first contentful paint before requesting the screenshot, hence why it captured the correct data. By now requesting the screenshot immediately it is requesting the screenshot of the first frame, which is just the clear colour.

I guess we have a couple of ways of fixing this:

  • Fenix can wait until the first contentful paint before capturing the screenshot. Make the GeckoView API explicit that attempting to capture a thumbnail before then will return a blank colour.

This is pretty annoying for the app to track, and we already have some pretty annoying caveats with this method.

  • Make LayerManagerComposite/WebRenderBridgeParent wait until a contentful composite before capturing the screenshot.

This would be my preferred solution. It could timeout if there isn't a reasonable paint within some threshold.

Is there some context for why was Fenix changed to capture the screenshot immediately instead of waiting for the first contentful paint?

I think it's still doing it then too, but not sure. We did have some problems where FCP wasn't fired consistently.

(In reply to Jamie Nicol [:jnicol] from comment #10)

Is there some context for why was Fenix changed to capture the screenshot immediately instead of waiting for the first contentful paint?

This was done for a short period of time until we understood that this change was still needed.

In Fenix and Reference Browser (that both get this feature from Android Components), we indeed wait for the firstContentfulPaint before requesting a screenshot, see here.

On a Matrix conversation, Randall from FxR did some investigation here and found that the flag is only set per session and not per site load, which would explain why we still see this issue.

Make LayerManagerComposite/WebRenderBridgeParent wait until a contentful composite before capturing the screenshot.

Maybe this recommendation is something that would need to be altered depending on my above comment?

Flags: needinfo?(jnicol)

On a Matrix conversation, Randall from FxR did some investigation here and found that the flag is only set per session and not per site load, which would explain why we still see this issue.

Yeah, I was looking in to this further today and reached a similar conclusion. I think on the Gecko end the first contentful paint event does get fired for each new site (as it's tracked by a property on the PresContext), but android-components never unsets its copy of the flag. But yeah, it certainly doesn't get unset after we lose our painted data when pausing, so we hit this bug.

Maybe this recommendation is something that would need to be altered depending on my above comment?

I do think it logically makes more sense for android-components to completely decide when to thumbnail. But for that to work GV needs to be able to tell it when we no longer have content painted. This would presumably require a new/changed API so might therefore be a more difficult change to make. I'd also worry about changing the semantics of things like the firstContentfulPaint flag which might be used for other things too, and modifying PresContext code which I really don't know.

OTOH the compositor isn't really aware of how suitable the thing it's compositing is for thumbnailing, so I think it's a bit of a hack to do that option. But it's probably the path of least resistance. My handwavey plan is to add a flag to LayerManagerComposite/WebRenderBridgeParent which tracks whether we are contentful/thumbnailable. When we send a Layer transaction or webrender display list, we can set a flag which sets that true. When we ClearCachedResources() we set it false. When recieve a screenshot request, we immediately perform the screenshot if the flag is true, otherwise we wait.

Flags: needinfo?(jnicol)

After attempting to implement the discussed workarounds in graphics code, I've decided not to continue that approach. I wasn't happy with the code at all, and I think the other option of communicating to GeckoView when the contentful paint status is reset is much cleaner. It involves an API change (I've renamed GeckoSession.ContentDelegate.onFirstContentfulPaint() to GeckoSession.ContentDelegate.onContentfulPaintStatusChanged(boolean paintStatus), and some changes to android-component's engine-gecko-* components. Thanks to their abstractions the Thumbnail component is magically fixed without any changes required!

Patch up for this shortly, and I can submit a PR for the android-components changes too

Android-components listens to the GeckoView callback onFirstContentfulPaint to
track whether a contentful paint has occured, in order to decide when to
thumbnail a tab. Currently this gets fired once per tab.

However, when the GeckoSession is paused, we clear cached resources in the
compositor. This means that when the session is resumed, the compositor does not
have the necessary information to render the page (such as painted content
buffers, or the webrender display list). Because android-components attempts to
capture a new thumbnail immediately upon resuming, it ends up capturing a blank
thumbnail.

To fix this, change the onFirstContentfulPaint() callback to
onContentfulPaintStatusChanged(bool paintStatus), so that android-components can
be informed when the contenful paint is no longer visible. It can then wait
until the subsequent contentful paint occurs before capturing the thumbnail.

Assignee: nobody → jnicol
Status: NEW → ASSIGNED

Add test paintStatusReset() to ContentDelegateTest, which asserts that
onPaintStatusReset() is called after GeckoSession.setActive(false).

Add test capturePixelsSessionDeactivatedActivated() to ScreenshotTest, which
asserts that capturePixels() is successful if called when the session is
deactivated then reactivated, after waiting for the onPaintStatusReset() and
onFirstContentfulPaint() callbacks.

Depends on D87341

Pushed by jnicol@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c0a323ca7d8b Add GeckoView API to listen for when contentful paint status has been reset. r=geckoview-reviewers,agi,esawin
Pushed by jnicol@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/f2663e59234a Add geckoview-junit tests for onPaintStatusReset. r=agi,geckoview-reviewers
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 81 Branch
Has Regression Range: --- → yes
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: