Closed Bug 1747116 Opened 5 months ago Closed 5 months ago

Crash in [@ mozilla::wr::RenderCompositorOGLSWGL::Resume]

Categories

(Core :: Graphics: WebRender, defect)

Unspecified
Android
defect

Tracking

()

RESOLVED FIXED
97 Branch
Tracking Status
firefox-esr91 --- unaffected
firefox95 --- wontfix
firefox96 + fixed
firefox97 + fixed

People

(Reporter: RyanVM, Assigned: jnicol)

References

(Regression)

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files, 1 obsolete file)

Crash report: https://crash-stats.mozilla.org/report/index/93a1e7e9-6e1b-4a22-ad88-2c2930211221

Looks like this has been around since Fenix 93, but it's currently up to #10 overall in the topcrash list for 95.2.0.

Reason: SIGSEGV / SEGV_MAPERR

Top 10 frames of crashing thread:

0 None @0x9f1699be 
1 libxul.so mozilla::wr::RenderCompositorOGLSWGL::Resume gfx/webrender_bindings/RenderCompositorOGLSWGL.cpp:236
2 libxul.so mozilla::wr::WebRenderAPI::Resume gfx/webrender_bindings/WebRenderAPI.cpp:623
3 libxul.so mozilla::wr::RenderThread::RunEvent gfx/webrender_bindings/RenderThread.cpp:428
4 libxul.so mozilla::detail::RunnableMethodImpl<mozilla::wr::RenderThread*, void  xpcom/threads/nsThreadUtils.h:1200
5 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1169
6 libxul.so mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:330
7 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:306
8 libxul.so nsThread::ThreadFunc xpcom/threads/nsThread.cpp:391
9 libnss3.so _pt_root nsprpub/pr/src/pthreads/ptthread.c:201
See Also: → 1740738
See Also: → 1734637

Assuming bug 1734637 and bug 1740738 are the same issue (and they sure appear to be), this is actually closer to the #3 overall topcrash for Fenix at the moment.

Blocks: gfx-triage
Severity: S2 → --

This signature would have morphed due to bug 1741156. jnicol, do you think this is still a problem?

Flags: needinfo?(jnicol)
See Also: → 1741156

Looks like it is still a problem, adding the new signature.

Crash Signature: [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] → [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize]

I don't know how to figure out the Fenix nightly release schedule, but the pushlog from the start of 92 to the first crash report is:

https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=5dee15cf3f28&tochange=d66e0aca4012

We did enable fallback to SW-WR on Android in 92. I wonder if that is related to bug 1721190. That wouldn't have caused it directly but it might be the first time users were falling back in a build.

It looks like all of the reports have Fallback WR to SW-WR in the graphics critical log which paints a bulls eye on bug 1721190.

Regressed by: 1721190
Has Regression Range: --- → yes

I wonder if this could be a race condition where:

https://searchfox.org/mozilla-central/rev/36e240b0ddb7227290a6614117ae6720c010357f/widget/android/CompositorWidgetParent.cpp#41

gets called too late and so we haven't set AndroidCompositorWidget::mSurface yet. Perhaps CompositorWidgetParent::OnCompositorSurfaceChanged consistently gets called on startup, but when we fallback, it doesn't get called again?

We would have destroyed and recreated CompositorWidgetParent objects as part of the fallback.

The above is actually when we had a GPU process and all of these crashes are without. This is when mSurface gets set:

https://searchfox.org/mozilla-central/rev/36e240b0ddb7227290a6614117ae6720c010357f/widget/android/InProcessAndroidCompositorWidget.cpp#47

This patch is a speculative fix to the crashes observed in bug 1747116.
The crash reports suggest mSurface is null when we attempt to get the
native window size during compositing. This always appears to happen
after we have fallen back from WR to SW-WR on Android. Given that
mSurface is only populated through an event generated by the upper
layers of Android, which may not be aware we recreated the underlying
compositor objects, this patch attempts to greedily initialize mSurface
in case it already exists.

Assignee: nobody → aosmond
Status: NEW → ASSIGNED

I've attached a speculative fix. I have no certainty that it will fix it. If it does, more work will need to be done for the GPU process in this scenario.

Set release status flags based on info from the regressing bug 1721190

I don't think this theory explains why the crash can happen on release. Prior to my recent changes the compositor simply reads the Surface from the widget - there is no OnCompositorSurfaceChanged function so it can't be that it hasn't been called.

And since those changes, the compositor should be initialized in a paused state so shouldn't be attempting to render until it is resumed here, just after we call OnCompositorSurfaceChanged.

My hunch is that the problem is here. If the android front end invalidates the surface and attempts to pause the compositor (eg the app is minimised) during fallback (meaning the UICompositorController is null), then we don't set mCompositorPaused to true. In which case we'll resume the compositor post initialization even though the surface is invalid. I think we want to move the mCompositorPaused = true line out of the if statement so we set it unconditionally.

Flags: needinfo?(jnicol)

[@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage] crashes like bp-43ffc7e5-c8b5-4277-bfc6-3facb0220103 are also AndroidCompositorWidget::GetClientSize() crashes.

Crash Signature: [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize] → [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize] [@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage ]

Hmm, I think the volume is too high for my theory in comment 16 to be correct. It seems unlikely that many people are minimizing the app at the same time as a fallback occuring.

Looking at the cash stats there are a lot of OOM warnings in the "graphics critical error" field. So probably it's just that the surface creation fails and this is the first place we don't handle the null pointer correctly. I'll try to see if I can reproduce

Okay so I think the OOM is a slight red herring, except for the fact that since bug 1731748 landed OOMs easily cause us to reinitialize the compositor (as a GL_OUT_OF_MEMORY error gets reported as a renderer error, which causes fallback.)

I believe the root of the problem is here - note the typo OZ_WIDGET instead of MOZ_WIDGET. This means we don't call DestroyEGLSurface() when RenderCompositorOGLSWGL is destroyed. This only affects RenderCompositorOGLSWGL, not RenderCompositorEGL, meaning it only happens after the 2nd fallback.

The effect of this is that we get stuck in a loop attempting to reinitialize the compositor:

    BufferQueueProducer: [SurfaceView - org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity#0] connect: already connected (cur=1 req=1)
    libEGL  : eglCreateWindowSurface: native_window_api_connect (win=0x77cdd06010) failed (0xffffffea) (already connected to another API?)
    libEGL  : eglCreateWindowSurfaceTmpl:729 error 3003 (EGL_BAD_ALLOC)
    Gecko   : [GFX1-]: Failed to create EGLSurface
    CompositorBridgeParent: Unable to renew compositor surface; remaining in paused state
    Gecko   : [GFX1-]: Failed to connect WebRenderBridgeChild.

We cannot create an EGL surface because the window is already in use because we didn't destroy the previous surface.

This doesn't cause a crash, but does cause the app to freeze indefinitely. I can then indeed trigger the crash by minimizing the app as speculated in comment 16. And I don't think it's implausible as I stated in comment 18, because I reckon plenty of users would minimize the app if it's frozen, and the race condition isn't very hard to trigger since we're stuck in a loop trying to reinitialize.

In order to fix this and guard against future problems, I propose we:

  • Fix the typo so that we destroy the old egl surface when the compositor is destroyed
  • Unconditionally mark the compositor as paused when the java surface is destroyed
  • Make AndroidCompositorWidget::OnResumeComposition() fallible, like WebRenderBridgeParent::Resume() is.
    • We can query the size once here rather than for every AndroidCompositorWidget::GetClientSize() call. And guard against a nullptr. If we have a null pointer then return false, and CompositorBridgeParent::ResumeComposition() will handle that error by not resuming.
Severity: -- → S2

Due to a typo in a preprocessor condition, the RenderCompositorOGLSWGL
destructor was not destroying its EGL surface on Android. This meant
that subsequent RenderCompositorOGLSWGL instances were unable to
initialize, as the underlying window surface was still in use. After
attempting to reinitialize the compositor in response to an error, this
resulted in an endless loop of trying and failing to initialize.

After fixing this typo, RenderCompositorOGLSWGL correctly cleans up
after itself and subsequent instances are able to initialize.

If AndroidCompositorWidget's surface reference points to a surface
that has been destroyed, we can end up with a null ANativeWindow
pointer. This can result in crashes when using it to query the window
size.

This patch makes it so that we use the native window to query the size
only when the surface has changed rather than for every call to
GetClientSize(). This allows us to guard against a null pointer in a
single place. If we have a null pointer then return false from
OnCompositorSurfaceChanged(). CompositorBridgeParent::ResumeComposition()
will handle that by not resuming the compositor, like it already does
if WebRenderBridgeParent::Resume() fails.

Additonally, when we receive a pause event from GeckoView ensure that
we always set the mCompositorPaused flag to true, even if the
UiCompositorController is null. This avoids a possible cause of the
situation described above - if we receive a pause event (eg the app is
minimised) during compositor reinitialization (while the
UiCompositorController is temporarily null) we would not set that flag
to true, and would therefore resume compositing in to an invalid
surface.

Depends on D135117

Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/46add7687005
Ensure that RenderCompositorOGLSWGL destroys its EGL surface on Android. r=gfx-reviewers,nical

Comment on attachment 9257637 [details]
Bug 1747116 - Ensure that RenderCompositorOGLSWGL destroys its EGL surface on Android. r?#gfx-reviewers

Beta/Release Uplift Approval Request

  • User impact if declined: Hangs and crashes for users on Android
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Ensures we clean up resources correctly. Hardware-accelerated webrender already does this correctly, this fixes a typo in the corresponding software webrender path to make it do the same
  • String changes made/needed:
Attachment #9257637 - Flags: approval-mozilla-beta?

I think we should uplift D135117 ASAP. That believe that should fix the root cause of most of the crashes.

D135118 is still blocked on geckoview review, when that lands we can decide whether to take it along too.

Flags: needinfo?(dsmith)
Attachment #9256760 - Attachment is obsolete: true
Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → 97 Branch

Comment on attachment 9257637 [details]
Bug 1747116 - Ensure that RenderCompositorOGLSWGL destroys its EGL surface on Android. r?#gfx-reviewers

Approved for 96.0rc2

Flags: needinfo?(dsmith)
Attachment #9257637 - Flags: approval-mozilla-beta? → approval-mozilla-release+
No longer blocks: gfx-triage
Pushed by jnicol@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3adb5c71ca3b
Guard against null native window in AndroidCompositorWidget r=gfx-reviewers,geckoview-reviewers,aosmond,m_kato
Assignee: aosmond → jnicol

From what I can see from the incoming crash reports, the first patch doesn't seem to have moved the needle much if at all, but the second patch is looking good so far on Nightly. Setting the status for 96 back to affected accordingly.

I guess that makes sense, though is disappointing. The first patch attempted to prevent us getting in to a bad state via the only way I knew how to reproduce it, but users in the wild must be reproducing it another way. The second patch prevents the crash whilst in that state.

So we should probably uplift the second patch too if we have a dot release planned? We should also take bug 1749745 too, though. As I believe D135117 in some cases will just replace the crash with the app effectively freezing instead (nothing being rendered)

Flags: needinfo?(ryanvm)
Regressions: 1749745

We'll probably have a dot release some time next week, so yeah, we might as well try to fix this if we can ship the fix for bug 1749745 alongside it. It would be helpful if we can get bug 1749745 into a beta build soon too if at all possible.

Flags: needinfo?(ryanvm)

Comment on attachment 9257638 [details]
Bug 1747116 - Guard against null native window in AndroidCompositorWidget r?#gfx-reviewers

Beta/Release Uplift Approval Request

  • User impact if declined: Crashes for Android users
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: Bug 1749745
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Guards against null pointer dereference.
  • String changes made/needed:
Attachment #9257638 - Flags: approval-mozilla-release?

Comment on attachment 9257638 [details]
Bug 1747116 - Guard against null native window in AndroidCompositorWidget r?#gfx-reviewers

Approved for 96.0.2

Attachment #9257638 - Flags: approval-mozilla-release? → approval-mozilla-release+
Duplicate of this bug: 1734030
Crash Signature: [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize] [@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage ] → [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize] [@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage ] [@ mozilla::detail::MutexImpl::mutexTryLock | nsTimerImpl::nsTi…
Duplicate of this bug: 1740738
Duplicate of this bug: 1734637

No signs of this crash in 96.2.0 \m/

Crash Signature: [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize] [@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage ] [@ mozilla::detail::MutexImpl::mutexTryLock | → [@ mozilla::wr::RenderCompositorOGLSWGL::Resume] [@ mozilla::widget::AndroidCompositorWidget::GetClientSize] [@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage ] [@ mozilla::detail::MutexImpl::mutexTryLock |
Crash Signature: nsTimerImpl::nsTimerImpl] → nsTimerImpl::nsTimerImpl] [@ libandroid.so@0x15208] [@ libandroid.so@0x8b7a]
Duplicate of this bug: 1726792
Crash Signature: nsTimerImpl::nsTimerImpl] [@ libandroid.so@0x15208] [@ libandroid.so@0x8b7a] → nsTimerImpl::nsTimerImpl] [@ libandroid.so@0x15208] [@ libandroid.so@0x8b7a] [@ libandroid.so@0x155b9]
You need to log in before you can comment on or make changes to this bug.