Closed Bug 1768197 Opened 3 years ago Closed 3 years ago

Crash in [@ gfxPlatform::FallbackFromAcceleration]

Tracking

()

Status:

RESOLVED FIXED

Milestone:

102 Branch

Tracking Flags:

Tracking

Status

firefox-esr91

---

unaffected

firefox100

---

unaffected

firefox101

wontfix

firefox102

fixed

People

(Reporter: amejia, Assigned: aosmond)

References

(Blocks 1 open bug)

Details

(Keywords: crash)

Crash Data

Attachments

(1 file)

Bug 1768197 - Handle WebRenderError errors more gracefully. 3 years ago Andrew Osmond [:aosmond] (he/him) 48 bytes, text/x-phabricator-request		Details \| Review

Arturo Mejia [:amejia]

Reporter

Description

•

3 years ago

Crash report: https://crash-stats.mozilla.org/report/index/564de69b-5ad6-40e7-8a5b-9179f0220506

MOZ_CRASH Reason: MOZ_CRASH(Fallback configurations exhausted)

Top 10 frames of crashing thread:

0 libxul.so gfxPlatform::FallbackFromAcceleration gfx/thebes/gfxPlatform.cpp:3420
1 libxul.so mozilla::gfx::GPUProcessManager::DisableWebRender gfx/ipc/GPUProcessManager.cpp:578
2 libxul.so mozilla::gfx::GPUProcessManager::NotifyWebRenderError gfx/ipc/GPUProcessManager.cpp:597
3 libxul.so mozilla::layers::CompositorManagerChild::RecvNotifyWebRenderError gfx/layers/ipc/CompositorManagerChild.cpp:257
4 libxul.so mozilla::layers::PCompositorManagerChild::OnMessageReceived ipc/ipdl/PCompositorManagerChild.cpp:567
5 libxul.so mozilla::ipc::MessageChannel::MessageTask::Run ipc/glue/MessageChannel.cpp:1535
6 libxul.so NS_ProcessNextEvent xpcom/threads/nsThreadUtils.cpp:465
7 libxul.so mozilla::ipc::MessagePump::Run ipc/glue/MessagePump.cpp:85
8 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:355
9 libxul.so nsBaseAppShell::Run widget/nsBaseAppShell.cpp:137

Ryan VanderMeulen [:RyanVM]

Comment 1

•

3 years ago

This is showing up in the Fenix Beta topcrash list. Can you please take a look, Andrew?

status-firefox100: --- → unaffected

status-firefox101: --- → affected

status-firefox102: --- → affected

status-firefox-esr91: --- → unaffected

tracking-firefox101: --- → +

tracking-firefox102: --- → +

Flags: needinfo?(aosmond)

Ryan VanderMeulen [:RyanVM]

Comment 2

•

3 years ago

Sorry, this is probably more of a jnicol question.

Component: Graphics: WebGPU → Graphics: WebRender

Flags: needinfo?(aosmond) → needinfo?(jnicol)

Bob Hood [:bhood]

Comment 3

•

3 years ago

Just a note: Jamie is on PTO until 25 May.

Ryan VanderMeulen [:RyanVM]

Comment 4

•

3 years ago

That'll be the middle of RC week. Is there someone else who can look into this in the mean time? Though now that I think of it, I wonder if this will go away with the GPU process being disabled by bug 1768674. We'll be shipping beta.4 tomorrow with that change, so maybe we can revisit on Monday once we've seen some incoming crash data for that release.

Flags: needinfo?(bhood)

Bob Hood [:bhood]

Comment 5

•

3 years ago

I'll check my team, but Jamie may have taken the bulk of the mobile expertise with him.

Flags: needinfo?(bhood)

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 6

•

3 years ago

I will look.

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 7

•

3 years ago

So we only trigger this crash if we issue a WebRenderError::NEW_SURFACE failure, after trying full HW WebRender with EGL, and then partial HW WebRender with SWGL drawing and GL compositing. This can happen when we fail to create EGL surfaces on Android with both EGL and EGL+SWGL backends.

I haven't found any evidence this crash happens much without the GPU process (GPUProcessStatus is always Running in CrashAnnotations), so now that we've disabled it, I am expecting the volume to go down. This could be wrong.

Assuming it is tied to the GPU process:

Why does it happen in the first place?
If intrinsic to the platform/device, should we try to fallback again by disabling the GPU process?

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 8

•

3 years ago

Bug 1762424 introduced the explicit crash.

Comment 9

•

3 years ago

Thinking on it further, the uptime, e.g. original report was after 44 minutes, suggests that a user can encounter the NEW_SURFACE error during regular use. This is problematic to crash the parent process in that case -- the user might be generally in a stable state, but cannot continue with the current GL context situation. We should consider either tearing down the GPU process and restarting it, or tearing down the compositor sessions to treat it like a device reset.

Ryan VanderMeulen [:RyanVM]

Comment 10

•

3 years ago

Still seeing crashes with 101.0.0-beta.4 :(

Andrew Osmond [:aosmond] (he/him)

Assignee

Comment 11

•

3 years ago

Attached file Bug 1768197 - Handle WebRenderError errors more gracefully. — Details

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → aosmond

Status: NEW → ASSIGNED

Phabricator Automation

Updated

•

3 years ago

Attachment #9276753 - Attachment description: Bug 1768197 - Restart the GPU process when stable for WebRenderError::NEW_SURFACE errors. → Bug 1768197 - Handle WebRenderError::NEW_SURFACE errors more gracefully.

Phabricator Automation

Updated

•

3 years ago

Attachment #9276753 - Attachment description: Bug 1768197 - Handle WebRenderError::NEW_SURFACE errors more gracefully. → Bug 1768197 - Handle WebRenderError errors more gracefully.

Pulsebot

Comment 12

•

3 years ago

Pushed by jmuizelaar@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/9120c2f0a77f Handle WebRenderError errors more gracefully. r=jrmuizel

Natalia Csoregi [:nataliaCs]

Comment 13

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9120c2f0a77f

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox102: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 102 Branch

Ryan VanderMeulen [:RyanVM]

Updated

•

3 years ago

Flags: needinfo?(jnicol)

BugBot [:suhaib / :marco/ :calixte]

Comment 14

•

3 years ago

The patch landed in nightly and beta is affected.
:aosmond, is this bug important enough to require an uplift?
If not please set status_beta to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(aosmond)

Ryan VanderMeulen [:RyanVM]

Comment 15

•

3 years ago

It's not clear to me that the patch has had any significant effect on the crash volume.

Bob Hood [:bhood]

Comment 16

•

3 years ago

Jamie will be returning from PTO soon (in a little more than 12 hours), and I will have him focus on this immediately.

Jamie Nicol [:jnicol]

Comment 17

•

3 years ago

Thanks for picking this up in my absence, Andrew. I think the landed patch makes a lot of sense, as OOM could in theory certainly cause this to occur.

I agree it doesn't appear to have had a significant effect on the crash volume. Part of this is because I landed bug 1768925 just before I left, to prevent users running in to bug 1767128. The effect of this is that whenever the GPU process crashes on Android 12 (and it is still enabled on nightly) we run in the this issue. If we break down the crash stats by android version we can see that the crash rate spikes on Android 12 (SDK level 31) following this, as expected.

Also if we look at the breakdown by android version, we can see that SDK level 28 (Android 9) is significantly higher than the others. I'm guessing that we therefore have a bona fide issue on Android 9 causing us to run in to this bug. If we ignore the Android 12 and Android 9 crashes then I think the crash rate is acceptably low (and Andrew's patch hopefully makes it even lower).

In bug 1767128 we will solve the Android 12 issue. We can re-open this or file a new one to investigate the Android 9 issue. In the interim I think we still want to force this crash, as the browser will be in an unusable state otherwise and it helps us detect potential issues like the suspected android 9 one.

Ryan VanderMeulen [:RyanVM]

Comment 18

•

3 years ago

Do you think this patch is worth taking on 101 by itself?

Flags: needinfo?(jnicol)

Jamie Nicol [:jnicol]

Comment 19

•

3 years ago

I'm inclined to say no. Android 9 accounts for 78% of the crashes on Beta, and I suspect this patch will not help those. The GPU process / WR fallback logic is complex and I don't think the risks of changing that are worth slightly reducing that remaining 22%.

Flags: needinfo?(jnicol)

Ryan VanderMeulen [:RyanVM]

Comment 20

•

3 years ago

OK, let's move the follow-up work to a new bug for better tracking.

status-firefox101: affected → wontfix

Ryan VanderMeulen [:RyanVM]

Comment 21

•

3 years ago

This is currently the #3 Fenix 101 topcrash since release. I think we need to spend more time investigating here :(

Flags: needinfo?(jnicol)

Jamie Nicol [:jnicol]

Comment 22

•

3 years ago

I am investigating

Flags: needinfo?(jnicol)

Jamie Nicol [:jnicol]

Updated

•

3 years ago

Blocks: 1772839

Jamie Nicol [:jnicol]

Updated

•

3 years ago

Regressions: 1797068

Gabriele Svelto [:gsvelto]

Updated

•

2 years ago

Updated

•

1 year ago

Blocks: always-gpu-process

Jim Mathies [:jimm]

Updated

•

1 year ago

Flags: needinfo?(aosmond)

Andrew Osmond [:aosmond] (he/him)

Assignee

Updated

•

2 days ago

Regressions: 1997848

You need to log in before you can comment on or make changes to this bug.