Closed Bug 1768197 Opened 3 years ago Closed 3 years ago

Crash in [@ gfxPlatform::FallbackFromAcceleration]

Categories

(Core :: Graphics: WebRender, defect)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
102 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox100 --- unaffected
firefox101 + wontfix
firefox102 + fixed

People

(Reporter: amejia, Assigned: aosmond)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/564de69b-5ad6-40e7-8a5b-9179f0220506

MOZ_CRASH Reason: MOZ_CRASH(Fallback configurations exhausted)

Top 10 frames of crashing thread:

0 libxul.so gfxPlatform::FallbackFromAcceleration gfx/thebes/gfxPlatform.cpp:3420
1 libxul.so mozilla::gfx::GPUProcessManager::DisableWebRender gfx/ipc/GPUProcessManager.cpp:578
2 libxul.so mozilla::gfx::GPUProcessManager::NotifyWebRenderError gfx/ipc/GPUProcessManager.cpp:597
3 libxul.so mozilla::layers::CompositorManagerChild::RecvNotifyWebRenderError gfx/layers/ipc/CompositorManagerChild.cpp:257
4 libxul.so mozilla::layers::PCompositorManagerChild::OnMessageReceived ipc/ipdl/PCompositorManagerChild.cpp:567
5 libxul.so mozilla::ipc::MessageChannel::MessageTask::Run ipc/glue/MessageChannel.cpp:1535
6 libxul.so NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:465
7 libxul.so mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:85
8 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:355
9 libxul.so nsBaseAppShell::Run widget/nsBaseAppShell.cpp:137

This is showing up in the Fenix Beta topcrash list. Can you please take a look, Andrew?

Sorry, this is probably more of a jnicol question.

Component: Graphics: WebGPU → Graphics: WebRender
Flags: needinfo?(aosmond) → needinfo?(jnicol)

Just a note: Jamie is on PTO until 25 May.

That'll be the middle of RC week. Is there someone else who can look into this in the mean time? Though now that I think of it, I wonder if this will go away with the GPU process being disabled by bug 1768674. We'll be shipping beta.4 tomorrow with that change, so maybe we can revisit on Monday once we've seen some incoming crash data for that release.

Flags: needinfo?(bhood)

I'll check my team, but Jamie may have taken the bulk of the mobile expertise with him.

Flags: needinfo?(bhood)

I will look.

So we only trigger this crash if we issue a WebRenderError::NEW_SURFACE failure, after trying full HW WebRender with EGL, and then partial HW WebRender with SWGL drawing and GL compositing. This can happen when we fail to create EGL surfaces on Android with both EGL and EGL+SWGL backends.

I haven't found any evidence this crash happens much without the GPU process (GPUProcessStatus is always Running in CrashAnnotations), so now that we've disabled it, I am expecting the volume to go down. This could be wrong.

Assuming it is tied to the GPU process:

  1. Why does it happen in the first place?
  2. If intrinsic to the platform/device, should we try to fallback again by disabling the GPU process?

Bug 1762424 introduced the explicit crash.

See Also: → 1762424

Thinking on it further, the uptime, e.g. original report was after 44 minutes, suggests that a user can encounter the NEW_SURFACE error during regular use. This is problematic to crash the parent process in that case -- the user might be generally in a stable state, but cannot continue with the current GL context situation. We should consider either tearing down the GPU process and restarting it, or tearing down the compositor sessions to treat it like a device reset.

Still seeing crashes with 101.0.0-beta.4 :(

Assignee: nobody → aosmond
Status: NEW → ASSIGNED
Attachment #9276753 - Attachment description: Bug 1768197 - Restart the GPU process when stable for WebRenderError::NEW_SURFACE errors. → Bug 1768197 - Handle WebRenderError::NEW_SURFACE errors more gracefully.
Attachment #9276753 - Attachment description: Bug 1768197 - Handle WebRenderError::NEW_SURFACE errors more gracefully. → Bug 1768197 - Handle WebRenderError errors more gracefully.
Pushed by jmuizelaar@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9120c2f0a77f Handle WebRenderError errors more gracefully. r=jrmuizel
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 102 Branch
Flags: needinfo?(jnicol)

The patch landed in nightly and beta is affected.
:aosmond, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(aosmond)

It's not clear to me that the patch has had any significant effect on the crash volume.

Jamie will be returning from PTO soon (in a little more than 12 hours), and I will have him focus on this immediately.

Thanks for picking this up in my absence, Andrew. I think the landed patch makes a lot of sense, as OOM could in theory certainly cause this to occur.

I agree it doesn't appear to have had a significant effect on the crash volume. Part of this is because I landed bug 1768925 just before I left, to prevent users running in to bug 1767128. The effect of this is that whenever the GPU process crashes on Android 12 (and it is still enabled on nightly) we run in the this issue. If we break down the crash stats by android version we can see that the crash rate spikes on Android 12 (SDK level 31) following this, as expected.

Also if we look at the breakdown by android version, we can see that SDK level 28 (Android 9) is significantly higher than the others. I'm guessing that we therefore have a bona fide issue on Android 9 causing us to run in to this bug. If we ignore the Android 12 and Android 9 crashes then I think the crash rate is acceptably low (and Andrew's patch hopefully makes it even lower).

In bug 1767128 we will solve the Android 12 issue. We can re-open this or file a new one to investigate the Android 9 issue. In the interim I think we still want to force this crash, as the browser will be in an unusable state otherwise and it helps us detect potential issues like the suspected android 9 one.

Do you think this patch is worth taking on 101 by itself?

Flags: needinfo?(jnicol)

I'm inclined to say no. Android 9 accounts for 78% of the crashes on Beta, and I suspect this patch will not help those. The GPU process / WR fallback logic is complex and I don't think the risks of changing that are worth slightly reducing that remaining 22%.

Flags: needinfo?(jnicol)

OK, let's move the follow-up work to a new bug for better tracking.

This is currently the #3 Fenix 101 topcrash since release. I think we need to spend more time investigating here :(

Flags: needinfo?(jnicol)

I am investigating

Flags: needinfo?(jnicol)
Blocks: 1772839
Regressions: 1797068
See Also: → 1824083
Flags: needinfo?(aosmond)
Regressions: 1997848
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: