Crash in [@ mozilla::layers::NativeLayerCA::NotifySurfaceReady]
Categories
(Core :: Graphics: WebRender, defect)
Tracking
()
People
(Reporter: sefeng, Assigned: bradwerth)
References
(Blocks 1 open bug)
Details
(Keywords: crash, topcrash)
Crash Data
Attachments
(3 files, 2 obsolete files)
48 bytes,
text/x-phabricator-request
|
dmeehan
:
approval-mozilla-release-
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
dmeehan
:
approval-mozilla-release+
|
Details | Review |
48 bytes,
text/x-phabricator-request
|
dmeehan
:
approval-mozilla-release+
|
Details | Review |
This bug is for crash report bp-a4f55d2b-fe20-471e-b4ed-a5e5c0200813.
Top 5 frames of crashing thread:
0 XUL mozilla::layers::NativeLayerCA::NotifySurfaceReady gfx/layers/NativeLayerCA.mm:642
1 XUL mozilla::layers::CompositorOGL::EndRenderingToNativeLayer gfx/layers/opengl/CompositorOGL.cpp:952
2 XUL mozilla::layers::CompositorOGL::NormalDrawingDone gfx/layers/opengl/CompositorOGL.cpp:891
3 XUL mozilla::layers::LayerManagerComposite::Render const gfx/layers/composite/LayerManagerComposite.cpp:1178
4 XUL mozilla::layers::LayerManagerComposite::Render gfx/layers/composite/LayerManagerComposite.cpp:1221
A low frequent crash, only see it on MacOS. Probably dereferencing a null pointer somewhere.
Updated•4 years ago
|
Comment 1•4 years ago
|
||
This is spiking today in build 20201016094031, tracking as we are close to merge day and we want to make sure it doesn't spike in 83 beta.
Comment 2•4 years ago
|
||
Comment 3•4 years ago
|
||
Potential cause : Bug 1664804 (graphics and macos).
Matt, could your patch in bug 1664804 be the cause for this crash?
Comment 4•4 years ago
|
||
Nicolas, can you help us with the investigation on this graphics crash (for timezones reasons) ? Thanks
Comment 5•4 years ago
|
||
Today's spike should be fixed by the backout of bug 1642308.
Updated•4 years ago
|
Comment 6•4 years ago
|
||
Presumably the pre-existing issue from before 1642308 is still around.
Comment 7•3 years ago
|
||
Crash volume increased around the end of June 2021, though there was no new Firefox release around that time.
Updated•3 years ago
|
Comment 8•2 years ago
|
||
The bug is linked to a topcrash signature, which matches the following criterion:
- Top 5 desktop browser crashes on Mac on release
:bhood, could you consider increasing the severity of this top-crash bug?
For more information, please visit auto_nag documentation.
Comment 9•2 years ago
|
||
The crash volume remains notable in Fx105, but dropped precipitously in Fx106, so the underlying issue may have already been addressed in that version. As such, I think S3 is appropriate until/unless crash volume spikes again.
Comment 10•2 years ago
•
|
||
It definitely dropped in FF 106, but that may be because 106 is in beta (and has fewer users).
Here's a better crash stack, made with the new and improved stack walker:
bp-5a5c5aa8-9a0f-40ef-abda-590eb0221004
Crashing Thread (16), Name: Renderer
Frame Module Signature Source Trust
0 XUL mozilla::layers::NativeLayerCA::NotifySurfaceReady() gfx/layers/NativeLayerCA.mm:1288 context
1 XUL mozilla::wr::RenderCompositorNativeOGL::Unbind() gfx/webrender_bindings/RenderCompositorNative.cpp:557 cfi
2 XUL webrender::renderer::Renderer::draw_frame gfx/wr/webrender/src/renderer/mod.rs:4376 cfi
3 XUL webrender::renderer::Renderer::render_impl gfx/wr/webrender/src/renderer/mod.rs:1476 cfi
4 XUL webrender::renderer::Renderer::render gfx/wr/webrender/src/renderer/mod.rs:1198 cfi
5 XUL wr_renderer_render gfx/webrender_bindings/src/bindings.rs:620 cfi
6 XUL mozilla::wr::RenderThread::UpdateAndRender(mozilla::wr::WrWindowId, mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> const&, mozilla::TimeStamp const&, bool, mozilla::Maybe<mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> > const&, mozilla::Maybe<mozilla::wr::ImageFormat> const&, mozilla::Maybe<mozilla::Range<unsigned char> > const&, bool*) gfx/webrender_bindings/RenderThread.cpp:565 cfi
7 XUL mozilla::wr::RenderThread::HandleFrameOneDoc(mozilla::wr::WrWindowId, bool) gfx/webrender_bindings/RenderThread.cpp:411 cfi
8 XUL mozilla::detail::RunnableMethodImpl<mozilla::wr::RenderThread*, void (mozilla::wr::RenderThread::*)(mozilla::wr::WrWindowId, bool), true, (mozilla::RunnableKind)0, mozilla::wr::WrWindowId, bool>::Run() xpcom/threads/nsThreadUtils.h:1200 cfi
9 XUL nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1199 cfi
10 XUL mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*) ipc/glue/MessagePump.cpp:330 cfi
11 XUL MessageLoop::Run() ipc/chromium/src/base/message_loop.cc:356 cfi
12 XUL nsThread::ThreadFunc(void*) xpcom/threads/nsThread.cpp:384 cfi
13 libnss3.dylib _pt_root nsprpub/pr/src/pthreads/ptthread.c:201 cfi
14 libsystem_pthread.dylib _pthread_start cfi
15 libsystem_pthread.dylib thread_start
Updated•2 years ago
|
Comment 11•2 years ago
|
||
Glenn, could you please evaluate the severity on this?
Comment 12•2 years ago
|
||
This isn't a simple NULL-pointer access anymore, we're hitting this assertion. The volume is also significant, accounting for the throttle rate of the various channels we're looking at 3 or 4 thousand users per week seeing this crash.
Updated•2 years ago
|
Assignee | ||
Comment 13•2 years ago
|
||
I'll see if I can do something about this.
Updated•2 years ago
|
Comment 14•2 years ago
|
||
The volume of crashes increased significantly since we started the 111 nightly cycle
Comment 15•2 years ago
|
||
:bradwerth have there been any updates on the investigation since Comment 13?
Assignee | ||
Comment 16•2 years ago
|
||
The assert is designed to ensure that we aren't presenting invalid content with a layer update. Since we don't have widespread reports that macOS users are seeing garbage pixels, the problem is most likely that the logic in the assert is overzealous. In that case, the simplest thing would be to remove it entirely. Before I do that, I'm going to spend some time trying to figure out what the assert is trying to maintain, and see if I can change the logic to make it both correct and useful.
Assignee | ||
Comment 17•2 years ago
|
||
(In reply to Brad Werth [:bradwerth] from comment #16)
The assert is designed to ensure that we aren't presenting invalid content with a layer update. Since we don't have widespread reports that macOS users are seeing garbage pixels, the problem is most likely that the logic in the assert is overzealous.
Well, since it's a release assert, the bad pixels will never reach the screen, because this assert is crashing the browser proactively. So the solution is still about better understanding the logic, but the fix will be pushing the assert earlier in the pipeline so we know which type of partial update is causing this condition.
Assignee | ||
Comment 18•2 years ago
|
||
The problem is that when we reuse surfaces, we don't maximally set the invalid region. Our old surfaces have invalid regions that overlap the display rect of the partial update, but not of the current front surface, so those regions are copying garbage. Patch inbound.
Assignee | ||
Comment 19•2 years ago
|
||
This patch clarifies the purpose of the maximal rect defined in
NativeLayerCA::NextSurface and ensures it is applied to reused surfaces
and not just newly-created ones.
Additionally, it reorganizes the region boolean logic in
NativeLayerCA::HandlePartialUpdate to avoid adding in the update region
just to subtract it away again. It adds an assert that the in progress
surface exists before the partial update.
Assignee | ||
Comment 20•2 years ago
|
||
I'm less sure this is the issue. Re-thinking the patch...
Updated•2 years ago
|
Assignee | ||
Comment 21•2 years ago
|
||
The best I can think of right now is to push the assert up earlier, to detect if WebRender is sending us an insufficiently-spanning update area during a partial update.
Comment 22•2 years ago
|
||
Comment 23•2 years ago
|
||
bugherder |
Comment 24•2 years ago
|
||
Can this be uplifted 111 branch? Firefox dev edition is nearly unusable on Mac. :(
Updated•2 years ago
|
Comment 25•2 years ago
|
||
Brad, could you request uplift to beta? Thanks!
Assignee | ||
Comment 26•2 years ago
|
||
I don't think this should be uplifted until we fix Bug 1817691, which is receiving the crashes this used to capture (with more useful info).
Updated•2 years ago
|
Assignee | ||
Comment 27•2 years ago
|
||
The new crash signature is Nightly-only, and we'll see more crash signatures associated with this Bug until Bug 1817691 is fixed. Not sure if the right management strategy is to keep this Bug open, or resolve it as a duplicate. For now, I'll try to get a quick fix for the new Bug and we'll be able to close and possibly uplift both.
Updated•2 years ago
|
Comment 28•2 years ago
|
||
The patch landed in nightly and beta is affected.
:bradwerth, is this bug important enough to require an uplift?
- If yes, please nominate the patch for beta approval.Also, don't forget to request an uplift for the patches in the regression caused by this fix.
- If no, please set
status-firefox111
towontfix
.
For more information, please visit auto_nag documentation.
Comment 29•2 years ago
|
||
:bradwerth now Bug 1817691 has been fixed, could you look at an uplift request on this and mention the patch in Bug 1817691?
Is there risk in Bug 1817691 that needs some additional time in Nightly, or would it be safe to uplift both soon?
Assignee | ||
Comment 30•2 years ago
|
||
(In reply to Donal Meehan [:dmeehan] from comment #29)
:bradwerth now Bug 1817691 has been fixed, could you look at an uplift request on this and mention the patch in Bug 1817691?
Is there risk in Bug 1817691 that needs some additional time in Nightly, or would it be safe to uplift both soon?
It seems likely that Bug 1817691 will fix the problem. It certainly adds no new risk, so I will request uplift of both patches now.
Assignee | ||
Comment 31•2 years ago
|
||
Comment on attachment 9316233 [details]
Bug 1658986: Make NativeLayerCA::HandlePartialUpdate sanity-check the update area.
Beta/Release Uplift Approval Request
- User impact if declined: macOS users will continue to occasionally experience crashes while scrolling under some unknown conditions
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: No
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: Bug 1817691
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This patch moves some asserts around, and the patch in Bug 1817691 probably solves the problem in a way that is very low-risk. There is no meaningful risk added by these patches.
- String changes made/needed:
- Is Android affected?: No
Comment 32•2 years ago
|
||
Comment on attachment 9316233 [details]
Bug 1658986: Make NativeLayerCA::HandlePartialUpdate sanity-check the update area.
Approved for 111.0b6
Comment 33•2 years ago
|
||
bugherder uplift |
Comment 34•2 years ago
|
||
:bradwerth could you take a look at the crash reports here?
Though the number has dropped there are still crash reports
Assignee | ||
Comment 35•2 years ago
|
||
I'll keep looking.
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 36•2 years ago
|
||
We're headed for a crash, so this change makes gfxCriticalError have more
information that will help us understand the crash. Unlike some other
error-checking code in this class, it is not NIGHTLY_BUILD because the
crash signatures in this Bug are reported in Beta and beyond.
Assignee | ||
Comment 37•2 years ago
|
||
(In reply to Donal Meehan [:dmeehan] from comment #34)
:bradwerth could you take a look at the crash reports here?
Though the number has dropped there are still crash reports
Donal, why doesn't the crash summary at the top have a table for Nightly crashes? I think that would make it clearer that the crash level in Nightly has gone to zero, which is what we hoped the landing of Bug 1817691 would accomplish.
Assignee | ||
Comment 38•2 years ago
|
||
(In reply to Brad Werth [:bradwerth] from comment #37)
I think that would make it clearer that the crash level in Nightly has gone to zero
Ahh, the level is not all the way to zero, just much lower than the peak we saw at beginning of February.
Comment 39•2 years ago
|
||
The crashes referenced in comment 34 were in beta, the patches were uplifted to beta (sorry should have been explicit in my comment)
Comment 40•2 years ago
|
||
Comment 41•2 years ago
|
||
bugherder |
Assignee | ||
Comment 42•2 years ago
|
||
Looking at crash reports, we're still crashing in cases where the intersection of the display rect and the invalid region is a one-pixel strip. That's because our logic only cleans up the invalid region if the whole thing is a one-pixel strip. We only care if the visible part is a one-pixel strip, so the logic for the cleanup can be relaxed a bit.
Assignee | ||
Comment 43•2 years ago
|
||
I'm also not sure if the clip rect is indicating that the invalid pixels are clipped out, and therefore we don't need to care about them. I'll see if I can figure out that logic and incorporate it into the assert.
Assignee | ||
Comment 44•2 years ago
|
||
I think it may be time to make this a Nightly-only crash because we just don't have a great understanding of why this assert is being hit, and the crash volume has risen so much. The worst case if we relax this for beta and release is that those users might see invalid pixels, which they might report as a new Bug, but it wouldn't manifest as a crash. With this crash volume, I think that's a preferable outcome.
Assignee | ||
Comment 45•2 years ago
|
||
This makes three changes to the invalid region checks:
- It cleans up invalid one-pixel strips that intersect the display rect,
even if the rest of the invalid region is bigger than a one-pixel strip. - It checks the invalid region against the clip rect (if provided) and
does the same one-pixel strip cleanup as above. - It makes both of these checks, as well as the final crash happen only
on Nightly.
This will hopefully give us more confidence that we are only showing valid
pixels, while enhancing any Nightly crashes with enough information to
determine what might be going wrong. It may lead to beta and release users
seeing invalid pixels!
Comment 46•2 years ago
|
||
Got this crash on 111.0b6 while a video was playing on reddit and then I pinch zoomed in. (Those steps don't reproduce the crash though)
Comment 47•2 years ago
|
||
Setting 111 and 112 back to affected since the bug has a leave-open keyword
Clearing the uplift approval on the attachment to remove this from the list of missed uplifts.
:bradwerth given the volume and if you're close to a fix, we might consider a 111 RC respin to include this. The timing for a respin would be tomorrow.
I see the patch is pending review, will need to land, make it's way to nightly, etc. - but what are your thoughts on risk of this for release uplift to 111?
Updated•2 years ago
|
Assignee | ||
Comment 48•2 years ago
|
||
This is mostly just preliminary work to make it possible to recalculate
the clip position and bounds from elsewhere in NativeLayerCA. Along the
way, it eliminates some redundant temporary variables:
- effectiveClip is just a holder of both the position and bounds of the
clip, which are applied separatley by the caller, so the new function
returns them separately. - clipToLayerOffset is just the unscaled negated value of the clip
position, and the only place it is used can tolerate doing the math at the
point of use.
It also cleans up some semantic issues in ApplyChanges -- it no longer
sets the mWrappingCALayer position unless the clip is in effect, which was
harmless but confusing.
Updated•2 years ago
|
Comment 49•2 years ago
|
||
Adding a comment as a trace for a response to Comment 47 outside of Bugzilla.
This is still on the radar for a release uplift request - it will require the patches from parts 2, 3, and 4.
Given the timing, this will not make it for 111.0 and instead be considered in a 111 dot release.
Assignee | ||
Comment 50•2 years ago
|
||
I was just able to reproduce the bug (with patches applied) on Reddit home page while pinch-zooming in and then scrolling around. The error log was:
The front surface invalid region [0,0,90,512] must not intersect the display rect (x=89, y=0, w=935, h=512). And clip rect is (x=1034, y=300, w=1962, h=1802).
To me, that looks like the intersection of the front surface invalid region and the display rect is a one-pixel wide vertical strip from (89, 0) to (90, 512). The patches should have cleaned that up. I'll take them out of review while I try to convince myself we'll fix this case.
Assignee | ||
Comment 51•2 years ago
|
||
I misspoke, my failure run was without the patches applied. That's mostly good news because this is a case that the patches should fix. I'm rebuilding with patches applied and I'll try to replicate with a local build which additionally logs out whenever it is cleaning up one of these one-pixel strips, which will give me confidence that we are catching and fixing this case.
Assignee | ||
Comment 52•2 years ago
|
||
With patches applied, I was able to reproduce again and get the logging messaging indicating that the one-pixel strip was resolved successfully. I think the patches are ready.
Assignee | ||
Comment 53•2 years ago
|
||
(In reply to Brad Werth [:bradwerth] from comment #52)
With patches applied, I was able to reproduce again and get the logging messaging indicating that the one-pixel strip was resolved successfully. I think the patches are ready.
Also, in my reproduction run, where the "problem" was noted and ignored, I didn't subsequently see any obviously invalid pixels. This gives me confidence that moving the checks to nightly-only will not result in bad pixels being shown in beta and release.
Updated•2 years ago
|
Updated•2 years ago
|
Assignee | ||
Comment 54•2 years ago
|
||
These checks were designed to identify if we have a logic error in the
display rects or update rects we get from our callers to
HandlePartialUpdate. We've gathered enough evidence that there is such an
error and we've opened Bug 1818540 to address it. There's no longer a good
reason to keep these checks around, as they just crash the browser for
users rather than risk showing them invalid pixels (which we're fairly
sure are valid pixels).
Comment 55•2 years ago
|
||
Updated•2 years ago
|
Comment 56•2 years ago
|
||
bugherder |
Comment 57•2 years ago
|
||
:bradwerth there are no crashes in nightly since this landed. No crashes so far in 112 beta, but 112 just went to beta and is not at 100% rollout.
- Does this still need to be in leave-open? Any further investigation ongoing?
- If you plan on resolving it, we would like to review an uplift request on this for release.
-- If you think it's low risk, then we can monitor it in 112 beta before uplifting to release and including it in the planned 111 dot release
Assignee | ||
Comment 58•2 years ago
|
||
This is fixed now. I'll nominate for uplift.
Updated•2 years ago
|
Assignee | ||
Comment 59•2 years ago
|
||
Comment on attachment 9322439 [details]
Bug 1658986 Part 3: Remove NativelayerCA checks of invalid regions.
Beta/Release Uplift Approval Request
- User impact if declined: macOS users will experience crashes while pinch-zooming under exceptional circumstances.
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce:
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): This removes an intentional crash we put in to flush out potential invalid pixels. We now believe those pixels are valid, so we no longer want this check, and we certainly don't need to crash.
- String changes made/needed:
- Is Android affected?: No
Assignee | ||
Updated•2 years ago
|
Assignee | ||
Comment 60•2 years ago
|
||
If this can't be merged to release with all parts applied, I will create a separate rebased upliftable patch for release.
Updated•2 years ago
|
Comment 61•2 years ago
•
|
||
Comment on attachment 9316233 [details]
Bug 1658986: Make NativeLayerCA::HandlePartialUpdate sanity-check the update area.
Already uplifted for 111
See Comment 33
Comment 62•2 years ago
|
||
Comment on attachment 9320624 [details]
Bug 1658986 Part 2: Add region stringification to gfxCriticalError before crashing.
Approved for 111.0.1
Comment 63•2 years ago
|
||
Comment on attachment 9322439 [details]
Bug 1658986 Part 3: Remove NativelayerCA checks of invalid regions.
Approved for 111.0.1
Comment 64•2 years ago
|
||
bugherder uplift |
Description
•