1747116 - Crash in [@ mozilla::wr::RenderCompositorOGLSWGL::Resume]

Ryan VanderMeulen [:RyanVM]

Reporter

Description

•

3 years ago

Crash report: https://crash-stats.mozilla.org/report/index/93a1e7e9-6e1b-4a22-ad88-2c2930211221

Looks like this has been around since Fenix 93, but it's currently up to #10 overall in the topcrash list for 95.2.0.

Reason: SIGSEGV / SEGV_MAPERR

Top 10 frames of crashing thread:

0 None @0x9f1699be 
1 libxul.so mozilla::wr::RenderCompositorOGLSWGL::Resume gfx/webrender_bindings/RenderCompositorOGLSWGL.cpp:236
2 libxul.so mozilla::wr::WebRenderAPI::Resume gfx/webrender_bindings/WebRenderAPI.cpp:623
3 libxul.so mozilla::wr::RenderThread::RunEvent gfx/webrender_bindings/RenderThread.cpp:428
4 libxul.so mozilla::detail::RunnableMethodImpl<mozilla::wr::RenderThread*, void  xpcom/threads/nsThreadUtils.h:1200
5 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1169
6 libxul.so mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:330
7 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:306
8 libxul.so nsThread::ThreadFunc xpcom/threads/nsThread.cpp:391
9 libnss3.so _pt_root nsprpub/pr/src/pthreads/ptthread.c:201

Ryan VanderMeulen [:RyanVM]

Reporter

Updated

•

3 years ago

Updated

•

3 years ago

Comment 1

•

3 years ago

Assuming bug 1734637 and bug 1740738 are the same issue (and they sure appear to be), this is actually closer to the #3 overall topcrash for Fenix at the moment.

Ryan VanderMeulen [:RyanVM]

Reporter

Updated

•

3 years ago

Blocks: gfx-triage

Severity: S2 → --

Andrew Osmond [:aosmond] (he/him)

Comment 2

•

3 years ago

This signature would have morphed due to bug 1741156. jnicol, do you think this is still a problem?

Flags: needinfo?(jnicol)

Comment 3

•

3 years ago

Looks like it is still a problem, adding the new signature.

Crash Signature: [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] → [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize]

Andrew Osmond [:aosmond] (he/him)

Comment 4

•

3 years ago

This seems to be the very first crash report with this signature:

https://crash-stats.mozilla.org/report/index/42d28149-6381-4fb9-957c-5b0320210806#tab-details

Andrew Osmond [:aosmond] (he/him)

Comment 5

•

3 years ago

I don't know how to figure out the Fenix nightly release schedule, but the pushlog from the start of 92 to the first crash report is:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=5dee15cf3f28&tochange=d66e0aca4012

We did enable fallback to SW-WR on Android in 92. I wonder if that is related to bug 1721190. That wouldn't have caused it directly but it might be the first time users were falling back in a build.

Andrew Osmond [:aosmond] (he/him)

Comment 6

•

3 years ago

It looks like all of the reports have Fallback WR to SW-WR in the graphics critical log which paints a bulls eye on bug 1721190.

Regressed by: 1721190

BMO Automation

Updated

•

3 years ago

Has Regression Range: --- → yes

Andrew Osmond [:aosmond] (he/him)

Comment 7

•

3 years ago

https://searchfox.org/mozilla-central/rev/36e240b0ddb7227290a6614117ae6720c010357f/widget/android/AndroidCompositorWidget.cpp#86

It is just a nullptr crash so I guess we failed to get a valid native window here.

Andrew Osmond [:aosmond] (he/him)

Comment 8

•

3 years ago

•

Edited

I wonder if this could be a race condition where:

https://searchfox.org/mozilla-central/rev/36e240b0ddb7227290a6614117ae6720c010357f/widget/android/CompositorWidgetParent.cpp#41

gets called too late and so we haven't set AndroidCompositorWidget::mSurface yet. Perhaps CompositorWidgetParent::OnCompositorSurfaceChanged consistently gets called on startup, but when we fallback, it doesn't get called again?

We would have destroyed and recreated CompositorWidgetParent objects as part of the fallback.

Andrew Osmond [:aosmond] (he/him)

Comment 9

•

3 years ago

It looks like we just store the surfaces in an array indexed by the widget ID:

https://searchfox.org/mozilla-central/rev/36e240b0ddb7227290a6614117ae6720c010357f/mobile/android/geckoview/src/main/java/org/mozilla/gecko/process/GeckoServiceGpuProcess.java#53

We get the widget ID here:

https://searchfox.org/mozilla-central/rev/36e240b0ddb7227290a6614117ae6720c010357f/widget/android/nsWindow.cpp#2587

and we create the binding between the surface and the widget ID here:

https://searchfox.org/mozilla-central/rev/36e240b0ddb7227290a6614117ae6720c010357f/widget/android/nsWindow.cpp#2197

Andrew Osmond [:aosmond] (he/him)

Comment 10

•

3 years ago

The above is actually when we had a GPU process and all of these crashes are without. This is when mSurface gets set:

https://searchfox.org/mozilla-central/rev/36e240b0ddb7227290a6614117ae6720c010357f/widget/android/InProcessAndroidCompositorWidget.cpp#47

Andrew Osmond [:aosmond] (he/him)

Comment 11

•

3 years ago

That needs to be triggered by:

https://searchfox.org/mozilla-central/rev/36e240b0ddb7227290a6614117ae6720c010357f/widget/android/AndroidCompositorWidget.cpp#77

There is a sync method which triggers that:

https://searchfox.org/mozilla-central/rev/36e240b0ddb7227290a6614117ae6720c010357f/gfx/layers/ipc/PUiCompositorController.ipdl#31

The path that calls that on Android is LayerViewSupport::SyncResumeCompositor which appears to be called from Java:

https://searchfox.org/mozilla-central/rev/36e240b0ddb7227290a6614117ae6720c010357f/widget/android/nsWindow.cpp#1144

Andrew Osmond [:aosmond] (he/him)

Comment 12

•

3 years ago

Or:

https://searchfox.org/mozilla-central/rev/36e240b0ddb7227290a6614117ae6720c010357f/widget/android/nsWindow.cpp#1165

Andrew Osmond [:aosmond] (he/him)

Comment 13

•

3 years ago

Attached file Bug 1747116 - Initialize InProcessAndroidCompositorWidget::mSurface greedily. (obsolete) — Details

This patch is a speculative fix to the crashes observed in bug 1747116.
The crash reports suggest mSurface is null when we attempt to get the
native window size during compositing. This always appears to happen
after we have fallen back from WR to SW-WR on Android. Given that
mSurface is only populated through an event generated by the upper
layers of Android, which may not be aware we recreated the underlying
compositor objects, this patch attempts to greedily initialize mSurface
in case it already exists.

Phabricator Automation

Updated

•

3 years ago

Assignee: nobody → aosmond

Status: NEW → ASSIGNED

Andrew Osmond [:aosmond] (he/him)

Comment 14

•

3 years ago

I've attached a speculative fix. I have no certainty that it will fix it. If it does, more work will need to be done for the GPU process in this scenario.

BugBot [:suhaib / :marco/ :calixte]

Comment 15

•

3 years ago

Set release status flags based on info from the regressing bug 1721190

status-firefox95: --- → affected

status-firefox96: --- → affected

status-firefox97: --- → affected

status-firefox-esr91: --- → unaffected

Jamie Nicol [:jnicol]

Assignee

Comment 16

•

3 years ago

I don't think this theory explains why the crash can happen on release. Prior to my recent changes the compositor simply reads the Surface from the widget - there is no OnCompositorSurfaceChanged function so it can't be that it hasn't been called.

And since those changes, the compositor should be initialized in a paused state so shouldn't be attempting to render until it is resumed here, just after we call OnCompositorSurfaceChanged.

My hunch is that the problem is here. If the android front end invalidates the surface and attempts to pause the compositor (eg the app is minimised) during fallback (meaning the UICompositorController is null), then we don't set mCompositorPaused to true. In which case we'll resume the compositor post initialization even though the surface is invalid. I think we want to move the mCompositorPaused = true line out of the if statement so we set it unconditionally.

Flags: needinfo?(jnicol)

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Keywords: regression

Chris Peterson [:cpeterson]

Comment 17

•

3 years ago

[@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage] crashes like bp-43ffc7e5-c8b5-4277-bfc6-3facb0220103 are also AndroidCompositorWidget::GetClientSize() crashes.

Crash Signature: [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize] → [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize] [@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage ]

Jamie Nicol [:jnicol]

Assignee

Comment 18

•

3 years ago

Hmm, I think the volume is too high for my theory in comment 16 to be correct. It seems unlikely that many people are minimizing the app at the same time as a fallback occuring.

Looking at the cash stats there are a lot of OOM warnings in the "graphics critical error" field. So probably it's just that the surface creation fails and this is the first place we don't handle the null pointer correctly. I'll try to see if I can reproduce

Dianna Smith [:diannaS]

Updated

•

3 years ago

status-firefox95: affected → wontfix

tracking-firefox96: --- → +

tracking-firefox97: --- → +

Jamie Nicol [:jnicol]

Assignee

Comment 19

•

3 years ago

Okay so I think the OOM is a slight red herring, except for the fact that since bug 1731748 landed OOMs easily cause us to reinitialize the compositor (as a GL_OUT_OF_MEMORY error gets reported as a renderer error, which causes fallback.)

I believe the root of the problem is here - note the typo OZ_WIDGET instead of MOZ_WIDGET. This means we don't call DestroyEGLSurface() when RenderCompositorOGLSWGL is destroyed. This only affects RenderCompositorOGLSWGL, not RenderCompositorEGL, meaning it only happens after the 2nd fallback.

The effect of this is that we get stuck in a loop attempting to reinitialize the compositor:

    BufferQueueProducer: [SurfaceView - org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity#0] connect: already connected (cur=1 req=1)
    libEGL  : eglCreateWindowSurface: native_window_api_connect (win=0x77cdd06010) failed (0xffffffea) (already connected to another API?)
    libEGL  : eglCreateWindowSurfaceTmpl:729 error 3003 (EGL_BAD_ALLOC)
    Gecko   : [GFX1-]: Failed to create EGLSurface
    CompositorBridgeParent: Unable to renew compositor surface; remaining in paused state
    Gecko   : [GFX1-]: Failed to connect WebRenderBridgeChild.

We cannot create an EGL surface because the window is already in use because we didn't destroy the previous surface.

This doesn't cause a crash, but does cause the app to freeze indefinitely. I can then indeed trigger the crash by minimizing the app as speculated in comment 16. And I don't think it's implausible as I stated in comment 18, because I reckon plenty of users would minimize the app if it's frozen, and the race condition isn't very hard to trigger since we're stuck in a loop trying to reinitialize.

In order to fix this and guard against future problems, I propose we:

Fix the typo so that we destroy the old egl surface when the compositor is destroyed
Unconditionally mark the compositor as paused when the java surface is destroyed
Make AndroidCompositorWidget::OnResumeComposition() fallible, like WebRenderBridgeParent::Resume() is.
- We can query the size once here rather than for every AndroidCompositorWidget::GetClientSize() call. And guard against a nullptr. If we have a null pointer then return false, and CompositorBridgeParent::ResumeComposition() will handle that error by not resuming.

Jim Mathies [:jimm]

Updated

•

3 years ago

Severity: -- → S2

Jamie Nicol [:jnicol]

Assignee

Comment 20

•

3 years ago

Attached file Bug 1747116 - Ensure that RenderCompositorOGLSWGL destroys its EGL surface on Android. r?#gfx-reviewers — Details

Due to a typo in a preprocessor condition, the RenderCompositorOGLSWGL
destructor was not destroying its EGL surface on Android. This meant
that subsequent RenderCompositorOGLSWGL instances were unable to
initialize, as the underlying window surface was still in use. After
attempting to reinitialize the compositor in response to an error, this
resulted in an endless loop of trying and failing to initialize.

After fixing this typo, RenderCompositorOGLSWGL correctly cleans up
after itself and subsequent instances are able to initialize.

Jamie Nicol [:jnicol]

Assignee

Comment 21

•

3 years ago

Attached file Bug 1747116 - Guard against null native window in AndroidCompositorWidget r?#gfx-reviewers — Details

If AndroidCompositorWidget's surface reference points to a surface
that has been destroyed, we can end up with a null ANativeWindow
pointer. This can result in crashes when using it to query the window
size.

This patch makes it so that we use the native window to query the size
only when the surface has changed rather than for every call to
GetClientSize(). This allows us to guard against a null pointer in a
single place. If we have a null pointer then return false from
OnCompositorSurfaceChanged(). CompositorBridgeParent::ResumeComposition()
will handle that by not resuming the compositor, like it already does
if WebRenderBridgeParent::Resume() fails.

Additonally, when we receive a pause event from GeckoView ensure that
we always set the mCompositorPaused flag to true, even if the
UiCompositorController is null. This avoids a possible cause of the
situation described above - if we receive a pause event (eg the app is
minimised) during compositor reinitialization (while the
UiCompositorController is temporarily null) we would not set that flag
to true, and would therefore resume compositing in to an invalid
surface.

Depends on D135117

Pulsebot

Comment 22

•

3 years ago

Pushed by jnicol@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/46add7687005 Ensure that RenderCompositorOGLSWGL destroys its EGL surface on Android. r=gfx-reviewers,nical

Jamie Nicol [:jnicol]

Assignee

Comment 23

•

3 years ago

Comment on attachment 9257637 [details]
Bug 1747116 - Ensure that RenderCompositorOGLSWGL destroys its EGL surface on Android. r?#gfx-reviewers

Beta/Release Uplift Approval Request

User impact if declined: Hangs and crashes for users on Android
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: No
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: None
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Ensures we clean up resources correctly. Hardware-accelerated webrender already does this correctly, this fixes a typo in the corresponding software webrender path to make it do the same
String changes made/needed:

Attachment #9257637 - Flags: approval-mozilla-beta?

Jamie Nicol [:jnicol]

Assignee

Comment 24

•

3 years ago

I think we should uplift D135117 ASAP. That believe that should fix the root cause of most of the crashes.

D135118 is still blocked on geckoview review, when that lands we can decide whether to take it along too.

Flags: needinfo?(dsmith)

Phabricator Automation

Updated

•

3 years ago

Attachment #9256760 - Attachment is obsolete: true

Alexandru Michis [:malexandru]

Comment 25

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/46add7687005

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox97: affected → fixed

Resolution: --- → FIXED

Target Milestone: --- → 97 Branch

Dianna Smith [:diannaS]

Comment 26

•

3 years ago

Comment on attachment 9257637 [details]
Bug 1747116 - Ensure that RenderCompositorOGLSWGL destroys its EGL surface on Android. r?#gfx-reviewers

Approved for 96.0rc2

Flags: needinfo?(dsmith)

Attachment #9257637 - Flags: approval-mozilla-beta? → approval-mozilla-release+

Dianna Smith [:diannaS]

Comment 27

•

3 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-release/rev/eb0638f7a0a9

status-firefox96: affected → fixed

Jim Mathies [:jimm]

Updated

•

3 years ago

No longer blocks: gfx-triage

Pulsebot

Comment 28

•

3 years ago

Pushed by jnicol@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3adb5c71ca3b Guard against null native window in AndroidCompositorWidget r=gfx-reviewers,geckoview-reviewers,aosmond,m_kato

Marian-Vasile Laza

Comment 29

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/3adb5c71ca3b

Andrew Osmond [:aosmond] (he/him)

Updated

•

3 years ago

Assignee: aosmond → jnicol

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 30

•

3 years ago

From what I can see from the incoming crash reports, the first patch doesn't seem to have moved the needle much if at all, but the second patch is looking good so far on Nightly. Setting the status for 96 back to affected accordingly.

status-firefox96: fixed → affected

Jamie Nicol [:jnicol]

Assignee

Comment 31

•

3 years ago

I guess that makes sense, though is disappointing. The first patch attempted to prevent us getting in to a bad state via the only way I knew how to reproduce it, but users in the wild must be reproducing it another way. The second patch prevents the crash whilst in that state.

So we should probably uplift the second patch too if we have a dot release planned? We should also take bug 1749745 too, though. As I believe D135117 in some cases will just replace the crash with the app effectively freezing instead (nothing being rendered)

Flags: needinfo?(ryanvm)

Jamie Nicol [:jnicol]

Assignee

Updated

•

3 years ago

Regressions: 1749745

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 32

•

3 years ago

We'll probably have a dot release some time next week, so yeah, we might as well try to fix this if we can ship the fix for bug 1749745 alongside it. It would be helpful if we can get bug 1749745 into a beta build soon too if at all possible.

Flags: needinfo?(ryanvm)

Jamie Nicol [:jnicol]

Assignee

Comment 33

•

3 years ago

Comment on attachment 9257638 [details]
Bug 1747116 - Guard against null native window in AndroidCompositorWidget r?#gfx-reviewers

Beta/Release Uplift Approval Request

User impact if declined: Crashes for Android users
Is this code covered by automated tests?: No
Has the fix been verified in Nightly?: Yes
Needs manual test from QE?: No
If yes, steps to reproduce:
List of other uplifts needed: Bug 1749745
Risk to taking this patch: Low
Why is the change risky/not risky? (and alternatives if risky): Guards against null pointer dereference.
String changes made/needed:

Attachment #9257638 - Flags: approval-mozilla-release?

Dianna Smith [:diannaS]

Comment 34

•

3 years ago

Comment on attachment 9257638 [details]
Bug 1747116 - Guard against null native window in AndroidCompositorWidget r?#gfx-reviewers

Approved for 96.0.2

Attachment #9257638 - Flags: approval-mozilla-release? → approval-mozilla-release+

Dianna Smith [:diannaS]

Comment 35

•

3 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-release/rev/093a7bd577ca

status-firefox96: affected → fixed

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Crash Signature: [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize] [@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage ] → [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize] [@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage ] [@ mozilla::detail::MutexImpl::mutexTryLock | nsTimerImpl::nsTi…

Ryan VanderMeulen [:RyanVM]

Reporter

Comment 39

•

3 years ago

No signs of this crash in 96.2.0 \m/

Crash Signature: [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize] [@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage ] [@ mozilla::detail::MutexImpl::mutexTryLock | → [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize] [@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage ] [@ mozilla::detail::MutexImpl::mutexTryLock |

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Crash Signature: nsTimerImpl::nsTimerImpl] → nsTimerImpl::nsTimerImpl] [@ libandroid.so@0x15208] [@ libandroid.so@0x8b7a]

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Crash Signature: nsTimerImpl::nsTimerImpl] [@ libandroid.so@0x15208] [@ libandroid.so@0x8b7a] → nsTimerImpl::nsTimerImpl] [@ libandroid.so@0x15208] [@ libandroid.so@0x8b7a] [@ libandroid.so@0x155b9]

Bug 1747116 - Initialize InProcessAndroidCompositorWidget::mSurface greedily. 3 years ago Andrew Osmond [:aosmond] (he/him) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1747116 - Ensure that RenderCompositorOGLSWGL destroys its EGL surface on Android. r?#gfx-reviewers 3 years ago Jamie Nicol [:jnicol] 48 bytes, text/x-phabricator-request	diannaS : approval-mozilla-release+	Details \| Review
Bug 1747116 - Guard against null native window in AndroidCompositorWidget r?#gfx-reviewers 3 years ago Jamie Nicol [:jnicol] 48 bytes, text/x-phabricator-request	diannaS : approval-mozilla-release+	Details \| Review