Crash in [@ mozilla::wr::RenderCompositorOGLSWGL::Resume]
Categories
(Core :: Graphics: WebRender, defect)
Tracking
()
People
(Reporter: RyanVM, Assigned: jnicol)
References
(Regression)
Details
(Keywords: crash, regression)
Crash Data
Attachments
(2 files, 1 obsolete file)
48 bytes,
text/x-phabricator-request
|
diannaS
:
approval-mozilla-release+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
diannaS
:
approval-mozilla-release+
|
Details | Review |
Crash report: https://crash-stats.mozilla.org/report/index/93a1e7e9-6e1b-4a22-ad88-2c2930211221
Looks like this has been around since Fenix 93, but it's currently up to #10 overall in the topcrash list for 95.2.0.
Reason: SIGSEGV / SEGV_MAPERR
Top 10 frames of crashing thread:
0 None @0x9f1699be
1 libxul.so mozilla::wr::RenderCompositorOGLSWGL::Resume gfx/webrender_bindings/RenderCompositorOGLSWGL.cpp:236
2 libxul.so mozilla::wr::WebRenderAPI::Resume gfx/webrender_bindings/WebRenderAPI.cpp:623
3 libxul.so mozilla::wr::RenderThread::RunEvent gfx/webrender_bindings/RenderThread.cpp:428
4 libxul.so mozilla::detail::RunnableMethodImpl<mozilla::wr::RenderThread*, void xpcom/threads/nsThreadUtils.h:1200
5 libxul.so nsThread::ProcessNextEvent xpcom/threads/nsThread.cpp:1169
6 libxul.so mozilla::ipc::MessagePumpForNonMainThreads::Run ipc/glue/MessagePump.cpp:330
7 libxul.so MessageLoop::Run ipc/chromium/src/base/message_loop.cc:306
8 libxul.so nsThread::ThreadFunc xpcom/threads/nsThread.cpp:391
9 libnss3.so _pt_root nsprpub/pr/src/pthreads/ptthread.c:201
Reporter | ||
Comment 1•3 years ago
|
||
Assuming bug 1734637 and bug 1740738 are the same issue (and they sure appear to be), this is actually closer to the #3 overall topcrash for Fenix at the moment.
Reporter | ||
Updated•3 years ago
|
Comment 2•3 years ago
|
||
This signature would have morphed due to bug 1741156. jnicol, do you think this is still a problem?
Comment 3•3 years ago
|
||
Looks like it is still a problem, adding the new signature.
Comment 4•3 years ago
|
||
This seems to be the very first crash report with this signature:
https://crash-stats.mozilla.org/report/index/42d28149-6381-4fb9-957c-5b0320210806#tab-details
Comment 5•3 years ago
|
||
I don't know how to figure out the Fenix nightly release schedule, but the pushlog from the start of 92 to the first crash report is:
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=5dee15cf3f28&tochange=d66e0aca4012
We did enable fallback to SW-WR on Android in 92. I wonder if that is related to bug 1721190. That wouldn't have caused it directly but it might be the first time users were falling back in a build.
Comment 6•3 years ago
|
||
It looks like all of the reports have Fallback WR to SW-WR
in the graphics critical log which paints a bulls eye on bug 1721190.
Updated•3 years ago
|
Comment 7•3 years ago
|
||
It is just a nullptr crash so I guess we failed to get a valid native window here.
Comment 8•3 years ago
•
|
||
I wonder if this could be a race condition where:
gets called too late and so we haven't set AndroidCompositorWidget::mSurface
yet. Perhaps CompositorWidgetParent::OnCompositorSurfaceChanged
consistently gets called on startup, but when we fallback, it doesn't get called again?
We would have destroyed and recreated CompositorWidgetParent
objects as part of the fallback.
Comment 9•3 years ago
|
||
It looks like we just store the surfaces in an array indexed by the widget ID:
We get the widget ID here:
and we create the binding between the surface and the widget ID here:
Comment 10•3 years ago
|
||
The above is actually when we had a GPU process and all of these crashes are without. This is when mSurface
gets set:
Comment 11•3 years ago
|
||
That needs to be triggered by:
There is a sync method which triggers that:
The path that calls that on Android is LayerViewSupport::SyncResumeCompositor
which appears to be called from Java:
Comment 12•3 years ago
|
||
Comment 13•3 years ago
|
||
This patch is a speculative fix to the crashes observed in bug 1747116.
The crash reports suggest mSurface is null when we attempt to get the
native window size during compositing. This always appears to happen
after we have fallen back from WR to SW-WR on Android. Given that
mSurface is only populated through an event generated by the upper
layers of Android, which may not be aware we recreated the underlying
compositor objects, this patch attempts to greedily initialize mSurface
in case it already exists.
Updated•3 years ago
|
Comment 14•3 years ago
|
||
I've attached a speculative fix. I have no certainty that it will fix it. If it does, more work will need to be done for the GPU process in this scenario.
Comment 15•3 years ago
|
||
Set release status flags based on info from the regressing bug 1721190
Assignee | ||
Comment 16•3 years ago
|
||
I don't think this theory explains why the crash can happen on release. Prior to my recent changes the compositor simply reads the Surface from the widget - there is no OnCompositorSurfaceChanged
function so it can't be that it hasn't been called.
And since those changes, the compositor should be initialized in a paused state so shouldn't be attempting to render until it is resumed here, just after we call OnCompositorSurfaceChanged
.
My hunch is that the problem is here. If the android front end invalidates the surface and attempts to pause the compositor (eg the app is minimised) during fallback (meaning the UICompositorController is null), then we don't set mCompositorPaused
to true. In which case we'll resume the compositor post initialization even though the surface is invalid. I think we want to move the mCompositorPaused = true
line out of the if statement so we set it unconditionally.
Updated•3 years ago
|
Comment 17•3 years ago
|
||
[@ mozilla::detail::MutexImpl::mutexLock | mozilla::ipc::PortLink::SendMessage]
crashes like bp-43ffc7e5-c8b5-4277-bfc6-3facb0220103 are also AndroidCompositorWidget::GetClientSize() crashes.
Assignee | ||
Comment 18•3 years ago
|
||
Hmm, I think the volume is too high for my theory in comment 16 to be correct. It seems unlikely that many people are minimizing the app at the same time as a fallback occuring.
Looking at the cash stats there are a lot of OOM warnings in the "graphics critical error" field. So probably it's just that the surface creation fails and this is the first place we don't handle the null pointer correctly. I'll try to see if I can reproduce
Updated•3 years ago
|
Assignee | ||
Comment 19•3 years ago
|
||
Okay so I think the OOM is a slight red herring, except for the fact that since bug 1731748 landed OOMs easily cause us to reinitialize the compositor (as a GL_OUT_OF_MEMORY error gets reported as a renderer error, which causes fallback.)
I believe the root of the problem is here - note the typo OZ_WIDGET
instead of MOZ_WIDGET
. This means we don't call DestroyEGLSurface()
when RenderCompositorOGLSWGL
is destroyed. This only affects RenderCompositorOGLSWGL
, not RenderCompositorEGL
, meaning it only happens after the 2nd fallback.
The effect of this is that we get stuck in a loop attempting to reinitialize the compositor:
BufferQueueProducer: [SurfaceView - org.mozilla.geckoview_example/org.mozilla.geckoview_example.GeckoViewActivity#0] connect: already connected (cur=1 req=1)
libEGL : eglCreateWindowSurface: native_window_api_connect (win=0x77cdd06010) failed (0xffffffea) (already connected to another API?)
libEGL : eglCreateWindowSurfaceTmpl:729 error 3003 (EGL_BAD_ALLOC)
Gecko : [GFX1-]: Failed to create EGLSurface
CompositorBridgeParent: Unable to renew compositor surface; remaining in paused state
Gecko : [GFX1-]: Failed to connect WebRenderBridgeChild.
We cannot create an EGL surface because the window is already in use because we didn't destroy the previous surface.
This doesn't cause a crash, but does cause the app to freeze indefinitely. I can then indeed trigger the crash by minimizing the app as speculated in comment 16. And I don't think it's implausible as I stated in comment 18, because I reckon plenty of users would minimize the app if it's frozen, and the race condition isn't very hard to trigger since we're stuck in a loop trying to reinitialize.
In order to fix this and guard against future problems, I propose we:
- Fix the typo so that we destroy the old egl surface when the compositor is destroyed
- Unconditionally mark the compositor as paused when the java surface is destroyed
- Make
AndroidCompositorWidget::OnResumeComposition()
fallible, likeWebRenderBridgeParent::Resume()
is.- We can query the size once here rather than for every
AndroidCompositorWidget::GetClientSize()
call. And guard against a nullptr. If we have a null pointer then return false, andCompositorBridgeParent::ResumeComposition()
will handle that error by not resuming.
- We can query the size once here rather than for every
Updated•3 years ago
|
Assignee | ||
Comment 20•3 years ago
|
||
Due to a typo in a preprocessor condition, the RenderCompositorOGLSWGL
destructor was not destroying its EGL surface on Android. This meant
that subsequent RenderCompositorOGLSWGL instances were unable to
initialize, as the underlying window surface was still in use. After
attempting to reinitialize the compositor in response to an error, this
resulted in an endless loop of trying and failing to initialize.
After fixing this typo, RenderCompositorOGLSWGL correctly cleans up
after itself and subsequent instances are able to initialize.
Assignee | ||
Comment 21•3 years ago
|
||
If AndroidCompositorWidget's surface reference points to a surface
that has been destroyed, we can end up with a null ANativeWindow
pointer. This can result in crashes when using it to query the window
size.
This patch makes it so that we use the native window to query the size
only when the surface has changed rather than for every call to
GetClientSize(). This allows us to guard against a null pointer in a
single place. If we have a null pointer then return false from
OnCompositorSurfaceChanged(). CompositorBridgeParent::ResumeComposition()
will handle that by not resuming the compositor, like it already does
if WebRenderBridgeParent::Resume() fails.
Additonally, when we receive a pause event from GeckoView ensure that
we always set the mCompositorPaused flag to true, even if the
UiCompositorController is null. This avoids a possible cause of the
situation described above - if we receive a pause event (eg the app is
minimised) during compositor reinitialization (while the
UiCompositorController is temporarily null) we would not set that flag
to true, and would therefore resume compositing in to an invalid
surface.
Depends on D135117
Comment 22•3 years ago
|
||
Assignee | ||
Comment 23•3 years ago
|
||
Comment on attachment 9257637 [details]
Bug 1747116 - Ensure that RenderCompositorOGLSWGL destroys its EGL surface on Android. r?#gfx-reviewers
Beta/Release Uplift Approval Request
- User impact if declined: Hangs and crashes for users on Android
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Ensures we clean up resources correctly. Hardware-accelerated webrender already does this correctly, this fixes a typo in the corresponding software webrender path to make it do the same
- String changes made/needed:
Assignee | ||
Comment 24•3 years ago
|
||
I think we should uplift D135117 ASAP. That believe that should fix the root cause of most of the crashes.
D135118 is still blocked on geckoview review, when that lands we can decide whether to take it along too.
Updated•3 years ago
|
Comment 25•3 years ago
|
||
bugherder |
Comment 26•3 years ago
|
||
Comment on attachment 9257637 [details]
Bug 1747116 - Ensure that RenderCompositorOGLSWGL destroys its EGL surface on Android. r?#gfx-reviewers
Approved for 96.0rc2
Comment 27•3 years ago
|
||
bugherder uplift |
Updated•3 years ago
|
Comment 28•3 years ago
|
||
Comment 29•3 years ago
|
||
bugherder |
Updated•3 years ago
|
Reporter | ||
Comment 30•3 years ago
|
||
From what I can see from the incoming crash reports, the first patch doesn't seem to have moved the needle much if at all, but the second patch is looking good so far on Nightly. Setting the status for 96 back to affected accordingly.
Assignee | ||
Comment 31•3 years ago
|
||
I guess that makes sense, though is disappointing. The first patch attempted to prevent us getting in to a bad state via the only way I knew how to reproduce it, but users in the wild must be reproducing it another way. The second patch prevents the crash whilst in that state.
So we should probably uplift the second patch too if we have a dot release planned? We should also take bug 1749745 too, though. As I believe D135117 in some cases will just replace the crash with the app effectively freezing instead (nothing being rendered)
Reporter | ||
Comment 32•3 years ago
|
||
We'll probably have a dot release some time next week, so yeah, we might as well try to fix this if we can ship the fix for bug 1749745 alongside it. It would be helpful if we can get bug 1749745 into a beta build soon too if at all possible.
Assignee | ||
Comment 33•3 years ago
|
||
Comment on attachment 9257638 [details]
Bug 1747116 - Guard against null native window in AndroidCompositorWidget r?#gfx-reviewers
Beta/Release Uplift Approval Request
- User impact if declined: Crashes for Android users
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: Bug 1749745
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): Guards against null pointer dereference.
- String changes made/needed:
Comment 34•3 years ago
|
||
Comment on attachment 9257638 [details]
Bug 1747116 - Guard against null native window in AndroidCompositorWidget r?#gfx-reviewers
Approved for 96.0.2
Comment 35•3 years ago
|
||
bugherder uplift |
Updated•3 years ago
|
Reporter | ||
Comment 39•3 years ago
|
||
No signs of this crash in 96.2.0 \m/
Updated•3 years ago
|
Updated•3 years ago
|
Description
•