Closed Bug 1658986 Opened 4 years ago Closed 2 years ago

Crash in [@ mozilla::layers::NativeLayerCA::NotifySurfaceReady]

Categories

(Core :: Graphics: WebRender, defect)

Unspecified
macOS
defect

Tracking

()

RESOLVED FIXED
112 Branch
Tracking Status
firefox-esr91 --- wontfix
firefox-esr102 --- wontfix
firefox83 --- wontfix
firefox94 --- wontfix
firefox95 --- wontfix
firefox96 --- wontfix
firefox109 --- wontfix
firefox110 --- wontfix
firefox111 + fixed
firefox112 --- fixed

People

(Reporter: sefeng, Assigned: bradwerth)

References

(Blocks 1 open bug)

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(3 files, 2 obsolete files)

This bug is for crash report bp-a4f55d2b-fe20-471e-b4ed-a5e5c0200813.

Top 5 frames of crashing thread:

0 XUL mozilla::layers::NativeLayerCA::NotifySurfaceReady gfx/layers/NativeLayerCA.mm:642
1 XUL mozilla::layers::CompositorOGL::EndRenderingToNativeLayer gfx/layers/opengl/CompositorOGL.cpp:952
2 XUL mozilla::layers::CompositorOGL::NormalDrawingDone gfx/layers/opengl/CompositorOGL.cpp:891
3 XUL mozilla::layers::LayerManagerComposite::Render const gfx/layers/composite/LayerManagerComposite.cpp:1178
4 XUL mozilla::layers::LayerManagerComposite::Render gfx/layers/composite/LayerManagerComposite.cpp:1221

A low frequent crash, only see it on MacOS. Probably dereferencing a null pointer somewhere.

Severity: -- → S3

This is spiking today in build 20201016094031, tracking as we are close to merge day and we want to make sure it doesn't spike in 83 beta.

Potential cause : Bug 1664804 (graphics and macos).

Matt, could your patch in bug 1664804 be the cause for this crash?

Flags: needinfo?(matt.woodrow)
Severity: S3 → S1
Keywords: topcrash

Nicolas, can you help us with the investigation on this graphics crash (for timezones reasons) ? Thanks

Flags: needinfo?(nical.bugzilla)

Today's spike should be fixed by the backout of bug 1642308.

Severity: S1 → S3
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(matt.woodrow)
Keywords: topcrash
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED

Presumably the pre-existing issue from before 1642308 is still around.

Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Crash volume increased around the end of June 2021, though there was no new Firefox release around that time.

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 5 desktop browser crashes on Mac on release

:bhood, could you consider increasing the severity of this top-crash bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(bhood)
Keywords: topcrash

The crash volume remains notable in Fx105, but dropped precipitously in Fx106, so the underlying issue may have already been addressed in that version. As such, I think S3 is appropriate until/unless crash volume spikes again.

Flags: needinfo?(bhood)

It definitely dropped in FF 106, but that may be because 106 is in beta (and has fewer users).

Here's a better crash stack, made with the new and improved stack walker:

bp-5a5c5aa8-9a0f-40ef-abda-590eb0221004

Crashing Thread (16), Name: Renderer
Frame  Module  Signature  Source  Trust
0  XUL  mozilla::layers::NativeLayerCA::NotifySurfaceReady()  gfx/layers/NativeLayerCA.mm:1288  context
1  XUL  mozilla::wr::RenderCompositorNativeOGL::Unbind()  gfx/webrender_bindings/RenderCompositorNative.cpp:557  cfi
2  XUL  webrender::renderer::Renderer::draw_frame  gfx/wr/webrender/src/renderer/mod.rs:4376  cfi
3  XUL  webrender::renderer::Renderer::render_impl  gfx/wr/webrender/src/renderer/mod.rs:1476  cfi
4  XUL  webrender::renderer::Renderer::render  gfx/wr/webrender/src/renderer/mod.rs:1198  cfi
5  XUL  wr_renderer_render  gfx/webrender_bindings/src/bindings.rs:620  cfi
6  XUL  mozilla::wr::RenderThread::UpdateAndRender(mozilla::wr::WrWindowId, mozilla::layers::BaseTransactionId<mozilla::VsyncIdType> const&, mozilla::TimeStamp const&, bool, mozilla::Maybe<mozilla::gfx::IntSizeTyped<mozilla::gfx::UnknownUnits> > const&, mozilla::Maybe<mozilla::wr::ImageFormat> const&, mozilla::Maybe<mozilla::Range<unsigned char> > const&, bool*)  gfx/webrender_bindings/RenderThread.cpp:565  cfi
7  XUL  mozilla::wr::RenderThread::HandleFrameOneDoc(mozilla::wr::WrWindowId, bool)  gfx/webrender_bindings/RenderThread.cpp:411  cfi
8  XUL  mozilla::detail::RunnableMethodImpl<mozilla::wr::RenderThread*, void (mozilla::wr::RenderThread::*)(mozilla::wr::WrWindowId, bool), true, (mozilla::RunnableKind)0, mozilla::wr::WrWindowId, bool>::Run()  xpcom/threads/nsThreadUtils.h:1200  cfi
9  XUL  nsThread::ProcessNextEvent(bool, bool*)  xpcom/threads/nsThread.cpp:1199  cfi
10  XUL  mozilla::ipc::MessagePumpForNonMainThreads::Run(base::MessagePump::Delegate*)  ipc/glue/MessagePump.cpp:330  cfi
11  XUL  MessageLoop::Run()  ipc/chromium/src/base/message_loop.cc:356  cfi
12  XUL  nsThread::ThreadFunc(void*)  xpcom/threads/nsThread.cpp:384  cfi
13  libnss3.dylib  _pt_root  nsprpub/pr/src/pthreads/ptthread.c:201  cfi
14  libsystem_pthread.dylib  _pthread_start   cfi
15  libsystem_pthread.dylib  thread_start

https://crash-stats.mozilla.org/search/?signature=~NativeLayerCA%3A%3ANotifySurfaceReady&platform=Mac%20OS%20X&date=%3E%3D2022-09-20T01%3A45%3A00.000Z&date=%3C2022-10-04T01%3A45%3A00.000Z&_facets=signature&_facets=proto_signature&_facets=version&_sort=-date&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-version

Component: Graphics: Layers → Graphics: WebRender

Glenn, could you please evaluate the severity on this?

This isn't a simple NULL-pointer access anymore, we're hitting this assertion. The volume is also significant, accounting for the throttle rate of the various channels we're looking at 3 or 4 thousand users per week seeing this crash.

Severity: S3 → S2
Blocks: wr-mac

I'll see if I can do something about this.

Assignee: nobody → bwerth

The volume of crashes increased significantly since we started the 111 nightly cycle

:bradwerth have there been any updates on the investigation since Comment 13?

Flags: needinfo?(bwerth)

The assert is designed to ensure that we aren't presenting invalid content with a layer update. Since we don't have widespread reports that macOS users are seeing garbage pixels, the problem is most likely that the logic in the assert is overzealous. In that case, the simplest thing would be to remove it entirely. Before I do that, I'm going to spend some time trying to figure out what the assert is trying to maintain, and see if I can change the logic to make it both correct and useful.

(In reply to Brad Werth [:bradwerth] from comment #16)

The assert is designed to ensure that we aren't presenting invalid content with a layer update. Since we don't have widespread reports that macOS users are seeing garbage pixels, the problem is most likely that the logic in the assert is overzealous.

Well, since it's a release assert, the bad pixels will never reach the screen, because this assert is crashing the browser proactively. So the solution is still about better understanding the logic, but the fix will be pushing the assert earlier in the pipeline so we know which type of partial update is causing this condition.

The problem is that when we reuse surfaces, we don't maximally set the invalid region. Our old surfaces have invalid regions that overlap the display rect of the partial update, but not of the current front surface, so those regions are copying garbage. Patch inbound.

Flags: needinfo?(bwerth)

This patch clarifies the purpose of the maximal rect defined in
NativeLayerCA::NextSurface and ensures it is applied to reused surfaces
and not just newly-created ones.

Additionally, it reorganizes the region boolean logic in
NativeLayerCA::HandlePartialUpdate to avoid adding in the update region
just to subtract it away again. It adds an assert that the in progress
surface exists before the partial update.

I'm less sure this is the issue. Re-thinking the patch...

Attachment #9316233 - Attachment description: Bug 1658986: Ensure reused surfaces are marked maximially invalid. → Bug 1658986: Make NativeLayerCA::HandlePartialUpdate sanity-check the update area.

The best I can think of right now is to push the assert up earlier, to detect if WebRender is sending us an insufficiently-spanning update area during a partial update.

Pushed by bwerth@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1b0aa79390a0 Make NativeLayerCA::HandlePartialUpdate sanity-check the update area. r=mstange
Status: REOPENED → RESOLVED
Closed: 4 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 112 Branch

Can this be uplifted 111 branch? Firefox dev edition is nearly unusable on Mac. :(

Flags: needinfo?(bwerth)

Brad, could you request uplift to beta? Thanks!

I don't think this should be uplifted until we fix Bug 1817691, which is receiving the crashes this used to capture (with more useful info).

Flags: needinfo?(bwerth)

The new crash signature is Nightly-only, and we'll see more crash signatures associated with this Bug until Bug 1817691 is fixed. Not sure if the right management strategy is to keep this Bug open, or resolve it as a duplicate. For now, I'll try to get a quick fix for the new Bug and we'll be able to close and possibly uplift both.

The patch landed in nightly and beta is affected.
:bradwerth, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.Also, don't forget to request an uplift for the patches in the regression caused by this fix.
  • If no, please set status-firefox111 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(bwerth)

:bradwerth now Bug 1817691 has been fixed, could you look at an uplift request on this and mention the patch in Bug 1817691?
Is there risk in Bug 1817691 that needs some additional time in Nightly, or would it be safe to uplift both soon?

(In reply to Donal Meehan [:dmeehan] from comment #29)

:bradwerth now Bug 1817691 has been fixed, could you look at an uplift request on this and mention the patch in Bug 1817691?
Is there risk in Bug 1817691 that needs some additional time in Nightly, or would it be safe to uplift both soon?

It seems likely that Bug 1817691 will fix the problem. It certainly adds no new risk, so I will request uplift of both patches now.

Flags: needinfo?(bwerth)

Comment on attachment 9316233 [details]
Bug 1658986: Make NativeLayerCA::HandlePartialUpdate sanity-check the update area.

Beta/Release Uplift Approval Request

  • User impact if declined: macOS users will continue to occasionally experience crashes while scrolling under some unknown conditions
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: Bug 1817691
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This patch moves some asserts around, and the patch in Bug 1817691 probably solves the problem in a way that is very low-risk. There is no meaningful risk added by these patches.
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9316233 - Flags: approval-mozilla-beta?

Comment on attachment 9316233 [details]
Bug 1658986: Make NativeLayerCA::HandlePartialUpdate sanity-check the update area.

Approved for 111.0b6

Attachment #9316233 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

:bradwerth could you take a look at the crash reports here?
Though the number has dropped there are still crash reports

Flags: needinfo?(bwerth)

I'll keep looking.

Status: RESOLVED → REOPENED
Flags: needinfo?(bwerth)
Resolution: FIXED → ---

We're headed for a crash, so this change makes gfxCriticalError have more
information that will help us understand the crash. Unlike some other
error-checking code in this class, it is not NIGHTLY_BUILD because the
crash signatures in this Bug are reported in Beta and beyond.

(In reply to Donal Meehan [:dmeehan] from comment #34)

:bradwerth could you take a look at the crash reports here?
Though the number has dropped there are still crash reports

Donal, why doesn't the crash summary at the top have a table for Nightly crashes? I think that would make it clearer that the crash level in Nightly has gone to zero, which is what we hoped the landing of Bug 1817691 would accomplish.

Flags: needinfo?(dmeehan)

(In reply to Brad Werth [:bradwerth] from comment #37)

I think that would make it clearer that the crash level in Nightly has gone to zero

Ahh, the level is not all the way to zero, just much lower than the peak we saw at beginning of February.

The crashes referenced in comment 34 were in beta, the patches were uplifted to beta (sorry should have been explicit in my comment)

Flags: needinfo?(dmeehan)
Pushed by bwerth@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/120ea59f799f Part 2: Add region stringification to gfxCriticalError before crashing. r=mstange

Looking at crash reports, we're still crashing in cases where the intersection of the display rect and the invalid region is a one-pixel strip. That's because our logic only cleans up the invalid region if the whole thing is a one-pixel strip. We only care if the visible part is a one-pixel strip, so the logic for the cleanup can be relaxed a bit.

I'm also not sure if the clip rect is indicating that the invalid pixels are clipped out, and therefore we don't need to care about them. I'll see if I can figure out that logic and incorporate it into the assert.

I think it may be time to make this a Nightly-only crash because we just don't have a great understanding of why this assert is being hit, and the crash volume has risen so much. The worst case if we relax this for beta and release is that those users might see invalid pixels, which they might report as a new Bug, but it wouldn't manifest as a crash. With this crash volume, I think that's a preferable outcome.

This makes three changes to the invalid region checks:

  1. It cleans up invalid one-pixel strips that intersect the display rect,
    even if the rest of the invalid region is bigger than a one-pixel strip.
  2. It checks the invalid region against the clip rect (if provided) and
    does the same one-pixel strip cleanup as above.
  3. It makes both of these checks, as well as the final crash happen only
    on Nightly.

This will hopefully give us more confidence that we are only showing valid
pixels, while enhancing any Nightly crashes with enough information to
determine what might be going wrong. It may lead to beta and release users
seeing invalid pixels!

Got this crash on 111.0b6 while a video was playing on reddit and then I pinch zoomed in. (Those steps don't reproduce the crash though)

Setting 111 and 112 back to affected since the bug has a leave-open keyword
Clearing the uplift approval on the attachment to remove this from the list of missed uplifts.
:bradwerth given the volume and if you're close to a fix, we might consider a 111 RC respin to include this. The timing for a respin would be tomorrow.
I see the patch is pending review, will need to land, make it's way to nightly, etc. - but what are your thoughts on risk of this for release uplift to 111?

Flags: needinfo?(bwerth)
Attachment #9316233 - Flags: approval-mozilla-beta+

This is mostly just preliminary work to make it possible to recalculate
the clip position and bounds from elsewhere in NativeLayerCA. Along the
way, it eliminates some redundant temporary variables:

  1. effectiveClip is just a holder of both the position and bounds of the
    clip, which are applied separatley by the caller, so the new function
    returns them separately.
  2. clipToLayerOffset is just the unscaled negated value of the clip
    position, and the only place it is used can tolerate doing the math at the
    point of use.

It also cleans up some semantic issues in ApplyChanges -- it no longer
sets the mWrappingCALayer position unless the clip is in effect, which was
harmless but confusing.

Attachment #9321741 - Attachment description: Bug 1658986 Part 3: Make NativeLayerCA checks of invalid regions more complex, and nightly-only. → Bug 1658986 Part 4: Make NativeLayerCA checks of invalid regions consider the clip, and change to nightly-only.

Adding a comment as a trace for a response to Comment 47 outside of Bugzilla.
This is still on the radar for a release uplift request - it will require the patches from parts 2, 3, and 4.
Given the timing, this will not make it for 111.0 and instead be considered in a 111 dot release.

I was just able to reproduce the bug (with patches applied) on Reddit home page while pinch-zooming in and then scrolling around. The error log was:

The front surface invalid region [0,0,90,512] must not intersect the display rect (x=89, y=0, w=935, h=512). And clip rect is (x=1034, y=300, w=1962, h=1802).

To me, that looks like the intersection of the front surface invalid region and the display rect is a one-pixel wide vertical strip from (89, 0) to (90, 512). The patches should have cleaned that up. I'll take them out of review while I try to convince myself we'll fix this case.

Flags: needinfo?(bwerth)

I misspoke, my failure run was without the patches applied. That's mostly good news because this is a case that the patches should fix. I'm rebuilding with patches applied and I'll try to replicate with a local build which additionally logs out whenever it is cleaning up one of these one-pixel strips, which will give me confidence that we are catching and fixing this case.

With patches applied, I was able to reproduce again and get the logging messaging indicating that the one-pixel strip was resolved successfully. I think the patches are ready.

(In reply to Brad Werth [:bradwerth] from comment #52)

With patches applied, I was able to reproduce again and get the logging messaging indicating that the one-pixel strip was resolved successfully. I think the patches are ready.

Also, in my reproduction run, where the "problem" was noted and ignored, I didn't subsequently see any obviously invalid pixels. This gives me confidence that moving the checks to nightly-only will not result in bad pixels being shown in beta and release.

Attachment #9321965 - Attachment is obsolete: true
Attachment #9321741 - Attachment is obsolete: true

These checks were designed to identify if we have a logic error in the
display rects or update rects we get from our callers to
HandlePartialUpdate. We've gathered enough evidence that there is such an
error and we've opened Bug 1818540 to address it. There's no longer a good
reason to keep these checks around, as they just crash the browser for
users rather than risk showing them invalid pixels (which we're fairly
sure are valid pixels).

Pushed by bwerth@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/1177d15b54df Part 3: Remove NativelayerCA checks of invalid regions. r=mstange
Target Milestone: 112 Branch → ---

:bradwerth there are no crashes in nightly since this landed. No crashes so far in 112 beta, but 112 just went to beta and is not at 100% rollout.

  • Does this still need to be in leave-open? Any further investigation ongoing?
  • If you plan on resolving it, we would like to review an uplift request on this for release.
    -- If you think it's low risk, then we can monitor it in 112 beta before uplifting to release and including it in the planned 111 dot release
Flags: needinfo?(bwerth)

This is fixed now. I'll nominate for uplift.

Status: REOPENED → RESOLVED
Closed: 2 years ago2 years ago
Flags: needinfo?(bwerth)
Keywords: leave-open
Resolution: --- → FIXED
Target Milestone: --- → 112 Branch

Comment on attachment 9322439 [details]
Bug 1658986 Part 3: Remove NativelayerCA checks of invalid regions.

Beta/Release Uplift Approval Request

  • User impact if declined: macOS users will experience crashes while pinch-zooming under exceptional circumstances.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This removes an intentional crash we put in to flush out potential invalid pixels. We now believe those pixels are valid, so we no longer want this check, and we certainly don't need to crash.
  • String changes made/needed:
  • Is Android affected?: No
Attachment #9322439 - Flags: approval-mozilla-release?
Attachment #9316233 - Flags: approval-mozilla-release?
Attachment #9320624 - Flags: approval-mozilla-release?

If this can't be merged to release with all parts applied, I will create a separate rebased upliftable patch for release.

Comment on attachment 9316233 [details]
Bug 1658986: Make NativeLayerCA::HandlePartialUpdate sanity-check the update area.

Already uplifted for 111
See Comment 33

Attachment #9316233 - Flags: approval-mozilla-release? → approval-mozilla-release-

Comment on attachment 9320624 [details]
Bug 1658986 Part 2: Add region stringification to gfxCriticalError before crashing.

Approved for 111.0.1

Attachment #9320624 - Flags: approval-mozilla-release? → approval-mozilla-release+

Comment on attachment 9322439 [details]
Bug 1658986 Part 3: Remove NativelayerCA checks of invalid regions.

Approved for 111.0.1

Attachment #9322439 - Flags: approval-mozilla-release? → approval-mozilla-release+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: