Closed Bug 1850285 Opened 10 months ago Closed 8 months ago

Crash exiting fullscreen (F11) with gfx.canvas.accelerated set on Nvidia + Wayland

Categories

(Core :: Graphics, defect)

Firefox 116
Unspecified
Linux
defect

Tracking

()

RESOLVED MOVED

People

(Reporter: jplebreton, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: crash)

Crash Data

Attachments

(2 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:109.0) Gecko/20100101 Firefox/116.0

Steps to reproduce:

  1. Run Firefox on a GNOME Wayland session with an Nvidia GPU.
  2. Ensure gfx.canvas.accelerated is set to true.
  3. Open a page that uses Canvas, eg https://webglsamples.org/aquarium/aquarium.html
  4. Press F11 to make the browser full screen, then press it again to exit full screen. Observe: Firefox doesn't crash.
  5. Wait a while; leave the Canvas tab open and use the browser for other stuff. (Sorry this is underspecified; I don't know what exactly causes it to start happening.)
  6. With any tab (not just the Canvas tab) open, press F11 once to fullscreen, then press it again.

Actual results:

The browser crashes and the standard Firefox crash report dialog comes up.

Expected results:

The browser shouldn't crash when exiting fullscreen.

A few findings from this issue I filed earlier to try to narrow down the causes / related phenonema: https://github.com/Robbendebiene/Gesturefy/issues/679

  1. This also happens with extensions that use a Canvas, eg Gesturefy's use of it for drawing the user's gesture trail on screen above the page in Firefox. Turning off the trail made the crash stop happening.
  2. This only happens in Wayland; when I start Firefox in an X11 session, the crash doesn't happen.
  3. This doesn't happen when gfx.canvas.accelerated is set to false.

Hence the specific combination of terms in the bug summary.

Given that Nvidia's Wayland support is still maturing, I wouldn't be surprised if this is a bug on their end, but it seemed logical to file this bug first to determine that it wasn't a Firefox issue - as a game developer I have plenty of other applications on my system that create fullscreen (all the various kinds - borderless, exclusive, etc) windows, but only Firefox exhibits this crash.

Happy to provide additional information as needed.

One last point I forgot to note above: I'm running the Nvidia binary drivers, version 535.98.

The Bugbug bot thinks this bug should belong to the 'Core::Graphics' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Graphics
Product: Firefox → Core

Thank you for the bug report. Could you please go to about:support, click "copy text to clipboard" and attach the contents to this bug?

Could you please also post a link to a crash report from about:crashes?

cc Lee for accelerated canvas crash

Severity: -- → S3
Component: Graphics → Graphics: Canvas2D
Flags: needinfo?(lsalzman)

Can you please submit a crash report from about:crashes? Without having a crash stack, it's pretty difficult to diagnose what's going on here.

Flags: needinfo?(lsalzman) → needinfo?(jplebreton)

as requested, contents of about:support from a minimal profile that reproduces the crash

Flags: needinfo?(jplebreton)

(In reply to jplebreton from comment #6)

And here's a crash report link:

https://crash-stats.mozilla.org/report/index/fc09b7fc-8296-481b-85f4-6c0450230829

Martin, it looks like something is getting double-freed inside a wayland resize callback?

Component: Graphics: Canvas2D → Widget: Gtk
Flags: needinfo?(stransky)
OS: Unspecified → Linux
Blocks: wayland

To me this clearly looks like a bug in the Nvidia driver, a race condition.

If I read the stack right, the driver crashes because it handles updated linux dmabuf feedback tranches in the resize callback. The new tranch (without scanout tranch) is expected to get send when exiting fullscreen on Gnome, so that's not a surprise.

The fact that it's only observed so far with gfx.canvas.accelerated enabled is likely due to timing differences, as reallocating/resizing buffers on the GPU takes time.

Just a guess, but the driver calls function create_surface_context, indicating that it destroyed the old surface context - which may contained a reference to the format list to be destroyed in wlEglDestroyFormatSet, given that tranche data is surface related.

Crash Signature: [@ wlEglDestroyFormatSet ]
Keywords: crash

The bug has a crash signature, thus the bug will be considered confirmed.

Status: UNCONFIRMED → NEW
Ever confirmed: true

May be a dupe of Bug 1844543

Flags: needinfo?(stransky)
Duplicate of this bug: 1850096

User in bug 1850285 provided crash reports matching this signature. STR are watching video in fullscreen then exiting fullscreen

The bug is linked to a topcrash signature, which matches the following criterion:

  • Top 10 desktop browser crashes on nightly

:stransky, could you consider increasing the severity of this top-crash bug?

For more information, please visit BugBot documentation.

Flags: needinfo?(stransky)
Keywords: topcrash

Bug 1844543 landed so would be great if crash ratio is lower now.

Flags: needinfo?(stransky)

Bug 1850696 may be another cause of it - if we're running out of file descriptors on WebGL it can happen anything.

See Also: → 1850696, 1844543

As mentioned in Comment 8 I'm pretty sure this is simply a Nvidia driver bug - specifically in their EGL Wayland implementation. Nothing we can do anything about apart from pinging Nvidia devs (or disabling HW Webrender).

Arthur, could you maybe have a look here?

Flags: needinfo?(ahuillet)

I created a thread on the Nvidia Linux driver support forums that refers back to this bug report: https://forums.developer.nvidia.com/t/crash-exiting-fullscreen-in-firefox-under-wayland/266666

Hopefully this makes the driver team there aware of this and/or someone there will be able to confirm or refute the suggestion here that it's a driver issue.

Robert, I'm not qualified to help, but Erik who I believe you've interacted with before, should be.

Flags: needinfo?(ahuillet) → needinfo?(ekurzinger)

Earlier I investigated what is almost certainly the same crash in https://bugzilla.mozilla.org/show_bug.cgi?id=1840360. Unfortunately I was unable to reproduce it and couldn't think of a plausible explanation by inspection.

I still haven't managed to get a local repro. Strangely, on my system it looks like mutter doesn't want to enable direct scan-out for Firefox even when it's full-screen, and hence it never sends any dmabuf feedback events. Other full-screen EGL applications do get scanned out, so I'm not sure what's going on there.

However, while I was poking around I noticed that there actually is a pretty bad bug in egl-wayland's dmabuf feedback handling, and I think it could potentially explain what folks have been seeing. The problem is that, after a window is resized (by wl_egl_window_resize), we immediately try to dispatch the queue that the feedback events are received on, but the compositor won't send those events until after the next wl_surface_commit when we attach a buffer with the new size (meaning after the next eglSwapBuffers call).

So when you make a window full-screen, egl-wayland doesn't actually receive a new feedback. When the compositor does send one later, it will just sit in the wl_event_queue. That is until you un-full-screen the window. Then it gets processed. Thus, we're effectively always one-feedback-behind where we should be.

Therefore, my theory as to what ends up causing the double-free is a surface getting destroyed while there's still that one pending batch of feedback events in the queue from the most recent full-screening / un-full-screening. Then when some other surface eventually dispatches those, it causes a crash.

One possible fix would be to use the per-surface queue for dmabuf feedback events instead of the per-display queue which we currently do. The per-surface queue gets dispatched every frame, so that should ensure they always get processed promptly.

Alas, I can't really confirm the above. If anyone who is able to reproduce the crash would be willing to build egl-wayland (https://github.com/NVIDIA/egl-wayland) with the attached patch applied I would be grateful.

Flags: needinfo?(ekurzinger)

I'd like to also update this bug that I tested Erik's patch on top of master branch of egl-wayland on my system and in at least 1 day for many times trying to reproduce it, I never get crash. So I could say the patch fixed this.

I wanted to try out Erik's package on my system. I'm on an Arch-based distro so the first step was to replace the official repo's egl-wayland package (last updated 2023-06-06) with the AUR (locally built) egl-wayland-git package. Without changing anything about the code from the latest git main branch (ie not having yet applied Erik's patch), this change appears to have fixed the crash on my system. So it's possible that the two commits to egl-wayland's git since its 1.1.12 release was tagged fixed the issue? As it's locally built there are many other variables that could also explain the change, but I'm offering the info here in case it's useful.

Based on the topcrash criteria, the crash signature linked to this bug is not a topcrash signature anymore.

For more information, please visit BugBot documentation.

Keywords: topcrash
Component: Widget: Gtk → Graphics
Blocks: wr-nv-linux

May no longer be a topcrasher, but I'm hitting wlEglDestroyFormatSet crash signature daily since switching to Wayland

No fullscreen mode is involved in my case, however.

Yeah it crashes often here too, even in full screen google meet.

tekstr1der / Alexandre: Could you try the patch mentioned in Comment 20 / 21 / 22? (given that the commits landed already you can simply clone https://github.com/NVIDIA/egl-wayland/commits/master)

There doesn't appear to be a release with them yet unfortunately.

(In reply to Robert Mader [:rmader] from comment #26)

tekstr1der / Alexandre: Could you try the patch mentioned in Comment 20 / 21 / 22? (given that the commits landed already you can simply clone https://github.com/NVIDIA/egl-wayland/commits/master)

Sure. I built egl-wayland 1.1.12+r3+g3f9889c, and will see how things go. Thanks!

Also might be worth trying to repro with the just-released 545 beta drivers: https://www.nvidia.com/download/driverResults.aspx/212964/en-us/

I posted a new release of egl-wayland today, 1.1.13, which should contain the fix.

Thanks Erik!

Status: NEW → RESOLVED
Closed: 8 months ago
Resolution: --- → MOVED
Duplicate of this bug: 1862255

Copying crash signatures from duplicate bugs.

Crash Signature: [@ wlEglDestroyFormatSet ] → [@ wlEglDestroyFormatSet ] [@ huge_dalloc | libnvidia-egl-wayland.so.1@0x485f]
See Also: → 1849237
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: