Closed Bug 1817336 Opened 1 year ago Closed 1 year ago

Crash in [@ nouveau_fence_trigger_work] on poison values

Categories

(Core :: Graphics, defect, P1)

Desktop
Linux
defect

Tracking

()

RESOLVED FIXED
112 Branch
Tracking Status
firefox-esr102 --- unaffected
firefox-esr115 - ---
firefox110 + wontfix
firefox111 --- fixed
firefox112 --- fixed

People

(Reporter: mccr8, Assigned: lsalzman)

References

(Blocks 1 open bug)

Details

(Keywords: crash, csectype-uaf, sec-high, Whiteboard: [adv-main111+r])

Crash Data

Attachments

(1 file)

Crash report: https://crash-stats.mozilla.org/report/index/65b272d8-e521-4366-9cce-b71a20230215

Reason: SIGSEGV / SI_KERNEL

Top 10 frames of crashing thread:

0  libgallium_dri.so  nouveau_fence_trigger_work  src/gallium/drivers/nouveau/nouveau_fence.c:53
1  libgallium_dri.so  nouveau_fence_update  src/gallium/drivers/nouveau/nouveau_fence.c:128
2  libgallium_dri.so  nouveau_fence_wait  src/gallium/drivers/nouveau/nouveau_fence.c:219
3  libgallium_dri.so  st_finish  src/mesa/state_tracker/st_cb_flush.c:76
4  libgallium_dri.so  st_glFinish  src/mesa/state_tracker/st_cb_flush.c:115
5  libgallium_dri.so  _mesa_get_fallback_texture  src/mesa/main/texobj.c:1070
6  libgallium_dri.so  _mesa_update_texture_state  src/mesa/main/texstate.c:910
7  libgallium_dri.so  _mesa_update_state_locked  src/mesa/main/state.c:476
8  libgallium_dri.so  _mesa_update_state  src/mesa/main/state.c:504
9  libgallium_dri.so  _mesa_ClearBufferfv  src/mesa/main/clear.c:663

The above crash looks like it is in canvas WebGL.

In this one, there's some CanvasRenderingContext2D stuff on the main thread:
bp-a6af659d-5054-4d66-9c3e-203f20230215

Lots of these crashes are on the jemalloc poison value. These stacks look pretty deep inside the graphics drivers so I'm not sure if we can do anything.

Summary: Crash in [@ nouveau_fence_trigger_work] → Crash in [@ nouveau_fence_trigger_work] on poison values

These crashes are happening with relatively recent versions of Mesa, in particular 22.2.5 is still supported so we might file a bug on their tracker. I did it in the past for similar issues and they were always responsive.

I have filed a confidential issue on Mesa's bug tracker: https://gitlab.freedesktop.org/mesa/mesa/-/issues/8309

Since this is affecting both Ubuntu and Debian users I don't think it's an issue with either distribution's packages, it's more likely to be an upstream issue.

I'll mark this as sec-high for now, but maybe it should be sec-vector instead.

Keywords: sec-high
Blocks: gfx-triage
Severity: -- → S3
Priority: -- → P2

The severity field for this bug is set to S3. However, the bug is flagged with the sec-high keyword.
:bhood, could you consider increasing the severity of this security bug?

For more information, please visit auto_nag documentation.

Flags: needinfo?(bhood)
Flags: needinfo?(bhood)
OS: Unspecified → Linux
Hardware: Unspecified → Desktop

Upstream thinks this is a thread safety issue. We can consider blocking nouveau < 22.3

It looks like this is happening because of accelerated canvas in the parent causing us to do multi-threaded GL on nouveau which doesn't work. Lee can we disable accelerated canvas in the parent for now?

Flags: needinfo?(lsalzman)
Severity: S3 → S2
Priority: P2 → P1
Assignee: nobody → lsalzman
Status: NEW → ASSIGNED

In discussions with Jeff, I believe the fallout from this is limited to Linux + Nouveau configurations, so a fraction of a fraction of the release population.

I don't know how exploitable this is, but I would guess the most common problem will simply be instability and crashiness.

Given all that, sec-high might be a bit much.

Flags: needinfo?(lsalzman)

Comment on attachment 9319572 [details]
Bug 1817336 - Only create accelerated Canvas2D contexts in content process. r?jrmuizel

Security Approval Request

  • How easily could an exploit be constructed based on the patch?: Unknown, as this will mostly just lead to races inside the Mesa driver. This only affects the Linux + Nouveau population, and the most common symptom will be frequent crashes when using browser features that rely on any form of Canvas2D-based screenshotting or otherwise utilize Canvas2D in the parent process. However, given the crash frequency compared to the size of this population (only 3.5% of Linux users), the crash frequency is a bit worrying.
  • Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem?: Unknown
  • Which older supported branches are affected by this flaw?: 110
  • If not all supported branches, which bug introduced the flaw?: Bug 110
  • Do you have backports for the affected branches?: Yes
  • If not, how different, hard to create, and risky will they be?:
  • How likely is this patch to cause regressions; how much testing does it need?: This ensures accelerated Canvas2D can't be used in anything other than the content process, which was the case prior to 110 anyway. We generally should never be creating accelerated Canvas2D contexts in any other process than a content process anyway, since these are not typically long-running use-cases that benefit from heavy acceleration.
  • Is Android affected?: No
Attachment #9319572 - Flags: sec-approval?

A bit more detail - normally WebGL remoting from the content process to the parent/GPU process will ensure that such WebGL commands are executed on either the Renderer or CanvasRenderer thread in the parent/GPU process depending on whether the THREADSAFE_GL feature is present (which it is present on all platforms save Linux/Nouveau).

However, if something in the parent process tries to do its own Canvas2D usage, and potentially creating an accelerated WebGL context, it will bypass the remoting normally used from the content process and directly execute those commands in a thread other than the Renderer or CanvasRenderer threads. WebRender will then be running on the Renderer thread in the same process, although on a separate GL context. However, Nouveau is not safe to use in this way (hence no THREADSAFE_GL feature), and will result in the crashes we are seeing.

Blocks: 1818625

Andrew, can we get a quick sec-approval on this? I am not terribly worried about the sec implications, more that this is causing a worryingly high crash rate among a very small subset of our release population. I would like to get this into nightly and tested ASAP, in case this needs a dot-release ride-along or something like that.

Flags: needinfo?(continuation)

[Tracking Requested - why for this release]:
High crash rate among a small subset (Linux + Mesa/Nouveau) of our release population.

Comment on attachment 9319572 [details]
Bug 1817336 - Only create accelerated Canvas2D contexts in content process. r?jrmuizel

Approved to land. This seems obtuse enough I'm not that concerned about ensuring an uplift and it seems we may want to bake it for a bit anyway.

Attachment #9319572 - Flags: sec-approval? → sec-approval+
Flags: needinfo?(continuation)
Crash Signature: [@ nouveau_fence_trigger_work] → [@ nouveau_fence_trigger_work] [@ libdrm_nouveau.so.2@0x4e06] [@ librdrm_nouveau.so.2@0x5312] [@ nouveau_pushbuf_data.localalias]

Comment on attachment 9319572 [details]
Bug 1817336 - Only create accelerated Canvas2D contexts in content process. r?jrmuizel

Beta/Release Uplift Approval Request

  • User impact if declined: High crash rate for users of Linux + Mesa/Nouveau.
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce:
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): This disables acceleration in a scenario that it was never expected to be used in the first place. So in the worst case scenario that this didn't actually fix the bug, it doesn't break anything because it only returns us to the status quo prior to 110. On the flip side, we still want to ensure Canvas2D acceleration is never used outside of content processes anyway.

Best case scenario, this will just magically fix all these crashes for our users without incident. This bug is hard to test for locally, so without pushing it into the wild it is really hard to guess if this is working, since the incidence in nightly is fairly low, while jetting up significantly in beta and release. While it might not justify a dot-release on its own, a ride-along with a dot-release I would strongly advocate for.

  • String changes made/needed:
  • Is Android affected?: No
Attachment #9319572 - Flags: approval-mozilla-release?
Attachment #9319572 - Flags: approval-mozilla-beta?
Crash Signature: [@ nouveau_fence_trigger_work] [@ libdrm_nouveau.so.2@0x4e06] [@ librdrm_nouveau.so.2@0x5312] [@ nouveau_pushbuf_data.localalias] → [@ nouveau_fence_trigger_work] [@ libdrm_nouveau.so.2@0x4e06] [@ librdrm_nouveau.so.2@0x5312] [@ nouveau_pushbuf_data.localalias] [@ arena_run_reg_dalloc | arena_t::DallocSmall | arena_dalloc | BaseAllocator::free | Allocator<T>::free | free | nouveau_f…
Crash Signature: nouveau_fence_trigger_work ] → nouveau_fence_trigger_work ] [@ nouveau_mm_free | nouveau_fence_trigger_work ]
Crash Signature: nouveau_fence_trigger_work ] [@ nouveau_mm_free | nouveau_fence_trigger_work ] → nouveau_fence_trigger_work ] [@ nouveau_mm_free | nouveau_fence_trigger_work ] [@ list_del ]

Bug 1777849 seems like we ran into a kinda similar WebGL thread-safety issue from an entirely different source - libX11 is also not thread-safe in older versions, which caused it to crash... Worth nothing, even though it has a completely different cause, because the end result is that we're running multiple WebGL threads in a scenario where that is not good.

See Also: → 1777849
Group: gfx-core-security → core-security-release
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → 112 Branch

It seems like in coordination with bug 1777849, this can aggravate the libX11 race condition as well by originating more GLX calls from the main thread, which can induce the symptoms in bug 1777849 independent of Nouveau. Rather, that causes the Xlib crash as opposed to what we see here with Nouveau. The fix here works for both cases regardless.

Comment on attachment 9319572 [details]
Bug 1817336 - Only create accelerated Canvas2D contexts in content process. r?jrmuizel

Approved for 111.0b7

Attachment #9319572 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
QA Whiteboard: [post-critsmash-triage]

Comment on attachment 9319572 [details]
Bug 1817336 - Only create accelerated Canvas2D contexts in content process. r?jrmuizel

Rejecting release uplift approval, we are in RC week for Fx111.
There is no Fx110 dot release planned.

Attachment #9319572 - Flags: approval-mozilla-release? → approval-mozilla-release-
Whiteboard: [adv-main111+r]
Duplicate of this bug: 1823563

Copying crash signatures from duplicate bugs.

Crash Signature: nouveau_fence_trigger_work ] [@ nouveau_mm_free | nouveau_fence_trigger_work ] [@ list_del ] → nouveau_fence_trigger_work ] [@ nouveau_mm_free | nouveau_fence_trigger_work ] [@ list_del ] [@ arena_run_reg_dalloc | arena_t::DallocSmall | arena_dalloc | BaseAllocator::free | Allocator<T>::free | PageFree]
No longer blocks: gfx-triage
Crash Signature: nouveau_fence_trigger_work ] [@ nouveau_mm_free | nouveau_fence_trigger_work ] [@ list_del ] [@ arena_run_reg_dalloc | arena_t::DallocSmall | arena_dalloc | BaseAllocator::free | Allocator<T>::free | PageFree] → nouveau_fence_trigger_work ] [@ nouveau_mm_free | nouveau_fence_trigger_work ] [@ list_del ] [@ arena_run_reg_dalloc | arena_t::DallocSmall | arena_dalloc | BaseAllocator::free | Allocator<T>::free | PageFree]
Group: core-security-release
Flags: needinfo?(ryanvm)
Flags: needinfo?(lsalzman)

This fix landed in 112. It's already in 115 ESR?

Flags: needinfo?(lsalzman)

Sounds like we should get a new bug filed for any lingering crashes with this signature.

Flags: needinfo?(ryanvm)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: