Closed Bug 1766127 Opened 3 years ago Closed 3 years ago

Page loads failing intermittently with black screen

Categories

(GeckoView :: General, defect, P1)

Firefox 101
Unspecified
All
defect

Tracking

(firefox-esr91 unaffected, firefox99 unaffected, firefox100+ disabled, firefox101+ fixed)

RESOLVED FIXED
101 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox99 --- unaffected
firefox100 + disabled
firefox101 + fixed

People

(Reporter: csadilek, Assigned: jnicol)

References

(Regression)

Details

(Keywords: regression)

Attachments

(3 files)

This has been detected by Raptor tests and was also reported by Fenix users.

STR that work reliably for me in Fenix Nightly (101.0a1 #201587625):

  • Open Home Screen
  • Tap on wikipedia shortuct

Expected:
Page loads.

Actual:
Black (sometimes white) screen.

Needs to be repeated few times to trigger the problem. I've also encountered this just following a link.

This started happening at 2022-04-05 based on bug 1692701. It looks like a regression of 1753700, but still needs to be confirmed.

Fenix ticket: https://github.com/mozilla-mobile/fenix/issues/24782

Agi, can you take a look? Timing-wise the tests started failing right after we landed the patch for bug 1753700 to adjust process priorities, but I haven't confirmed manually. Happy to help. Luckily this is reproducible, see above.

Flags: needinfo?(agi)

There are no content process crashes reported and locally I don't see anything obvious in my logcat. Raptor tests have some "process died" messages though: https://firefoxci.taskcluster-artifacts.net/IHq16g62T7S_pz-iMsRmUg/0/public/test_info/logcat-ZY322HN8G4.log

I doubt Bug 1753700 can cause something like that, and the effects of this bug are exactly the same as Bug 1762025. And indeed, I can only reproduce this bug with the GPU process turned on on my Samsung S10e. I can look more on Monday but my guess is that we lose the GPU process at some point and we're not able to restart it for some reason.

FYI S1 is for browser breaking bugs, and I don't believe this one is one of them.

Severity: S1 → S3
Flags: needinfo?(agi) → needinfo?(jnicol)
Priority: P1 → --

[Tracking Requested - why for this release]: Didn't mean to remove tracking flags.

Can also reproduce, though it takes quite a lot of closing and opening tabs. And also haven't reproduced with the GPU process disabled, so it probably only occurs with it enabled.

But it doesn't look like the GPU process is crashing at all, nor that we have a unfixable issue rendering like in bug 1762025. Webrender seems to believe that there's nothing to render for some reason, so chooses not to. Rotating the device or clicking the url bar and back again seems to force it to start rendering again.

Flags: needinfo?(jnicol)

Interestingly, I cannot reproduce the problem on Beta/100, only Nightly/101, and the GPU process was enabled in 100 already. It seems like 100 is not affected.

I'm trying to bisect.

OK so good news is I can reproduce on GVE fairly easily:

  • close all tabs
  • open new tab (this is to work around a bug in GVE)
  • launch adb shell am start -W -n org.mozilla.geckoview_example/.GeckoViewActivity -a android.intent.action.VIEW -d "https://www.wikipedia.com" to load wikipedia in the new tab

trying this a few times I get a black screen.

Unfortunately at the moment I only have an Android 12 device available, where I can test 4/19-4/25 only due to Bug 1762424. If somebody has a non-android-12 device handy to bisect that would be great! Otherwise I'll do that tonight.

I have a pretty good idea what is going on here. Webrender is bailing on rendering here because there is no active document. This is because we didn't create the frame here because has_pixels() returned false. This is because the scene's device rect is zero-sized. The device rect gets set here in SetDisplayList() for the root WebRenderBridgeParent (ie the browser chrome, rather than the tab). It gets the size from AndroidCompositorWidget::GetClientSize(), which is returning 0x0 intermittently. This explains why interacting with the page (eg scrolling) doesn't cause us to render again, but changing the device's orientation does (as that will cause the window to resize and a new display list to be sent for the chrome, and the size will be valid by this point).

We've seen something similar before in bug 1749745.

Our widget size handling on android is a little weird. We mostly ignore the size set from the main thread and instead try to calculate the size from the Surface (which is set from the UI thread). This means it is zero initially, then when the compositor gets Resumed we can calculate the size from the Surface.

The reason we get a zero-size is because SyncResumeAndResize() sometimes occurs before the compositor has been initialized. Or more importantly, before the UI thread has been notified that the compositor has been initialized. This means we are unable to resume the compositor here. The compositor will instead be resumed after it is finished initializing here. This leaves us in a race between the first paint occuring (which will set the webrender display list) and the compositor being resumed (which will set the CompositorWidget size). Since these occur cross-process, from different threads, and using different IPDL protocols, we have no guarantee about their ordering.

I don't think this explains why this only affects 101 and not 100. Nor why it's GPU process specific. I guess it's just timing sensitive and the GPU process can affect the timings quite significantly.

I think we can fix this by allowing the CompositorWidget's size to be updated from the main thread, as well as from the UI thread like it currently is. If we add an IPDL call to PCompositorWidget to set the size, this will be guaranteed to occur before the PWebRenderBridge SetDisplayList IPDL call (as both protocols have the same manager)

Unfortunately I cannot reproduce in GVE even with these STR, so I cannot bisect either.

This reproduces very reliably if I change this DispatchToUiThread in to a DelayedDispatch with a 1000ms wait. Which seems to confirm my theory. And my proposed fix avoids the bug even with the delay.

On Android we have long since calculated the compositor widget size
from the Surface rather than using the main thread widget size. This
is to avoid a trip via the main thread in response to a
ResumeAndResize event on the UI thread.
AndroidCompositorWidget::mClientSize therefore gets calculated when
the compositor is resumed.

However, it is possible in some circumstances for the compositor to
receive a display list prior to it being resumed. In this bug's case,
SyncResumeAndResize is getting called before the UI thread has been
notified that the compositor has been initialized. It therefore cannot
resume the compositor, and we instead resume the compositor on the
subsequent NotifyCompositorCreated call. This starts a race between
the main thread paint and NotifyCompositorCreated being scheduled on
the UI thread then resuming the compositor via
PUiCompositorController. If we receive
WebRenderBridgeParent::RecvSetDisplayList prior to
UiCompositorControllerParent::RecvResumeAndResize, then
AndroidCompositorWidget::mClientSize will be zero. This results in us
setting zero-sized webrender scene rect, leading to webrender exiting
early during rendering, resulting in a black screen.

To fix this, allow the main thread to set the AndroidCompositorWidget
size, in addition to the UI thread being able to set it. We do so by
adding a NotifyClientSizeChanged method to
PlatformCompositorWidgetDelegate, called from
nsWindow::OnSizeChanged. The cross-process implementation of this uses
an IPDL call on PCompositorWidget, which shares a top-level protocol
with PWebRenderBridge, meaning calls are guaranteed to occur in
order. This means a resize event on the main thread is guaranteed to
set the CompositorWidget size before the display list from the
subsequent paint is received.

Assignee: nobody → jnicol
Status: NEW → ASSIGNED

Do we need to uplift this fix to Beta 100? It will merge to the Release channel next week, so we have very little time to test a fix in Beta.

Priority: -- → P1

I bisected this and the bisection does point to Bug 1331109 as the root cause.

My test was:

  • close tab (GVE has 0 tabs now)
  • open tab

until the tab becomes black (instead of white) indicating that nothing is painting on the surface.

13:12.17 INFO: No more integration revisions, bisection finished.
13:12.17 INFO: Last good revision: c526b3d6a76d508e67fa61483669605b4ed89f4b
13:12.17 INFO: First bad revision: 1e862c8fcaceb95c6577bc32be1169821af8faff
13:12.17 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=c526b3d6a76d508e67fa61483669605b4ed89f4b&tochange=1e862c8fcaceb95c6577bc32be1169821af8faff
Regressed by: 1331109
Pushed by jnicol@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/18dedaa743e2 Allow AndroidCompositorWidget::mClientSize to be updated from main thread. r=agi,gfx-reviewers,lsalzman

(In reply to Chris Peterson [:cpeterson] from comment #12)

Do we need to uplift this fix to Beta 100? It will merge to the Release channel next week, so we have very little time to test a fix in Beta.

Yes, we should uplift this. I'll request as soon as it's landed

The nominated patch does not graft cleanly to release

widget/android/AndroidCompositorWidget.h
widget/android/CompositorWidgetParent.h
widget/android/InProcessAndroidCompositorWidget.h

Flags: needinfo?(jnicol)

Given that we found Bug 1766127 so late in the beta cycle, let's disable the
GPU process for one more version, the GPU process will ride the trains on 101
instead.

Comment on attachment 9273704 [details]
Bug 1766127 - Disable GPU process in beta.

Beta/Release Uplift Approval Request

  • User impact if declined: Potentially undiscovered bugs that cause pages to go blank intermittently
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Medium
  • Why is the change risky/not risky? (and alternatives if risky): This code path has been the default for years and is the default for Android 12 devices, so we don't expect regressions.
  • String changes made/needed:
  • Is Android affected?: Yes
Attachment #9273704 - Flags: approval-mozilla-beta?

Comment on attachment 9273704 [details]
Bug 1766127 - Disable GPU process in beta.

Approved for 100.0rc1

Attachment #9273704 - Flags: approval-mozilla-beta? → approval-mozilla-release+

On Android we have long since calculated the compositor widget size
from the Surface rather than using the main thread widget size. This
is to avoid a trip via the main thread in response to a
ResumeAndResize event on the UI thread.
AndroidCompositorWidget::mClientSize therefore gets calculated when
the compositor is resumed.

However, it is possible in some circumstances for the compositor to
receive a display list prior to it being resumed. In this bug's case,
SyncResumeAndResize is getting called before the UI thread has been
notified that the compositor has been initialized. It therefore cannot
resume the compositor, and we instead resume the compositor on the
subsequent NotifyCompositorCreated call. This starts a race between
the main thread paint and NotifyCompositorCreated being scheduled on
the UI thread then resuming the compositor via
PUiCompositorController. If we receive
WebRenderBridgeParent::RecvSetDisplayList prior to
UiCompositorControllerParent::RecvResumeAndResize, then
AndroidCompositorWidget::mClientSize will be zero. This results in us
setting zero-sized webrender scene rect, leading to webrender exiting
early during rendering, resulting in a black screen.

To fix this, allow the main thread to set the AndroidCompositorWidget
size, in addition to the UI thread being able to set it. We do so by
adding a NotifyClientSizeChanged method to
PlatformCompositorWidgetDelegate, called from
nsWindow::OnSizeChanged. The cross-process implementation of this uses
an IPDL call on PCompositorWidget, which shares a top-level protocol
with PWebRenderBridge, meaning calls are guaranteed to occur in
order. This means a resize event on the main thread is guaranteed to
set the CompositorWidget size before the display list from the
subsequent paint is received.

Comment on attachment 9273794 [details]
Bug 1766127 - Allow AndroidCompositorWidget::mClientSize to be updated from main thread.

Beta/Release Uplift Approval Request

  • User impact if declined: Potential black screens when opening a new tab (though less likely now we have disabled the GPU process)
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Just ensures the compositor widget size is correctly set from the main thread. If the main-thread widget size was ever wrong we'd encounter other issues, so this should be safe
  • String changes made/needed:
  • Is Android affected?: Yes
Flags: needinfo?(jnicol)
Attachment #9273794 - Flags: approval-mozilla-release?

D144687 is the patch rebased on release. I think we should take it even though the GPU process has now been disabled, as the bug could still occur anyway (it's just a timing issue)

Flags: needinfo?(dsmith)
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 101 Branch
Has Regression Range: --- → yes

Comment on attachment 9273794 [details]
Bug 1766127 - Allow AndroidCompositorWidget::mClientSize to be updated from main thread.

Approved for 100.0rc2

Flags: needinfo?(dsmith)
Attachment #9273794 - Flags: approval-mozilla-release? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: