1737834 - Wayland/Nvidia: Crash in [@ NvGlEglGetFunctions] (Fixed by Nvidia driver 510)

Sebastian Hengst [:aryx] (needinfo me if it's about an intermittent or backout)

Reporter

Description

•

4 years ago

16 Linux crashes from 9+ devices. The oldest reported crash is from 2021-09-16 and affects even some older builds but Nightly 94.0a1 + 95.0a1 have 16 crashes from 9+ devices. Driver update issue?

Maybe Fission related. (DOMFissionEnabled=1)

Crash report: https://crash-stats.mozilla.org/report/index/6fb2714d-5337-4211-b6f7-5f1020211025

Reason: SIGSEGV / SEGV_MAPERR

Top 10 frames of crashing thread:

0 libnvidia-eglcore.so.470.74 NvGlEglGetFunctions 
1 libnvidia-eglcore.so.470.74 <.text ELF section in libnvidia-eglcore.so.470.74> 
2 libnvidia-eglcore.so.470.74 <.text ELF section in libnvidia-eglcore.so.470.74> 
3 libnvidia-eglcore.so.470.74 <.text ELF section in libnvidia-eglcore.so.470.74> 
4 libnvidia-eglcore.so.470.74 <.text ELF section in libnvidia-eglcore.so.470.74> 
5 libnvidia-eglcore.so.470.74 <.text ELF section in libnvidia-eglcore.so.470.74> 
6 libxul.so <gleam::gl::GlFns as gleam::gl::Gl>::tex_sub_image_2d_pbo third_party/rust/gleam/src/gl_fns.rs:750
7 libxul.so webrender::device::gl::TextureUploader::update_impl gfx/wr/webrender/src/device/gl.rs:4678
8 libxul.so webrender::device::gl::TextureUploader::flush_buffer gfx/wr/webrender/src/device/gl.rs:4631
9 libxul.so webrender::device::gl::TextureUploader::flush gfx/wr/webrender/src/device/gl.rs:4640

Darkspirit

Comment 1

•

4 years ago

https://bugreports.qt.io/browse/QTBUG-97127 mentions NvGlEglGetFunctions on Wayland.

Blocks: wr-nv-linux

OS: Unspecified → Linux

Hardware: Unspecified → x86_64

Updated

•

3 years ago

Summary: Crash in [@ NvGlEglGetFunctions] → Wayland/Nvidia: Crash in [@ NvGlEglGetFunctions]

Darkspirit

Updated

•

3 years ago

Depends on: 1743833

matt

Comment 2

•

3 years ago

I'm not sure if this helps at all but I'm having this issue specifically on the sign in page for Google. The page itself loads (and seems to render) fine but then a few seconds later the tab crashes. It happens every time but that's the only page I've found that this behavior occurs. Like the bug report says, I'm on Nvidia and Wayland. Please let me know if you'd like more info or if I can help at all!

Robert Mader [:rmader]

Comment 3

•

3 years ago

@matt: thanks, having a user around with an affected device is super helpful. Could you share which driver version you are on? The best would be if you could share the content of your about:support ("Copy text to clipboard" -> paste here into comment -> a popup offering to create an attachment should open -> yes).

Flags: needinfo?(matt)

Robert Mader [:rmader]

Updated

•

3 years ago

Blocks: wayland

matt

Comment 4

•

3 years ago

Attached file about:support information — Details

(In reply to Robert Mader [:rmader] from comment #3) > @matt: thanks, having a user around with an affected device is super helpful. Could you share which driver version you are on? The best would be if you could share the content of your `about:support` ("Copy text to clipboard" -> paste here into comment -> a popup offering to create an attachment should open -> yes). Sure thing! Here's the info you requested. I think its worth noting that this issue no longer seems to be occurring to me after logging into a GNOMe X session and then back into a GNOME wayland session.

Robert Mader [:rmader]

Comment 5

•

3 years ago

Thanks! So this is on 495.46.0.0, the original report is for 470.74. I assume it is most likely a driver issue (or libglvnd etc.).

Eric, needinfoing you here so you know about the issue - do you maybe have internal reports about this already? Thanks!

Flags: needinfo?(ekurzinger)

Robert Mader [:rmader]

Updated

•

3 years ago

Flags: needinfo?(matt)

Erik Kurzinger

Comment 6

•

3 years ago

I'm inclined to think NvGLEglGetFunctions is just appearing in the stack-trace as an artifact. That function is only ever called during EGL initialization. The real problem is probably related to the glTexSubImage2D call.

@matt, if you have a reliable repro, would you mind trying with the environment variable OGL_SkipTextureHostCopies=1? Looking at a couple of internally-tracked bugs I suspect that might have an effect. If it seems to fix the issue it would help narrow things down. I'm not able to reproduce the crash on my system, unfortunately.

Flags: needinfo?(ekurzinger)

matt

Comment 8

•

3 years ago

(In reply to Erik Kurzinger from comment #6)

@matt, if you have a reliable repro, would you mind trying with the environment variable OGL_SkipTextureHostCopies=1? Looking at a couple of internally-tracked bugs I suspect that might have an effect. If it seems to fix the issue it would help narrow things down. I'm not able to reproduce the crash on my system, unfortunately.

I'll definitely give it a shot when I notice a specific site crashing between restarts. Unfortunately its incredibly inconsistent to reproduce. I had issues with Google sign in and then I had the same thing occur with hCaptcha embeds. Now both are working fine so it may just be a matter of waiting for this to pop up for me again (it seems to persist on the same site across Firefox restarts but not desktop sessions). Will keep you updated!

Robert Mader [:rmader]

Comment 9

•

3 years ago

Is there a chance that this only happens after suspend/resume, given that the Nvidia driver is a bit special about that?

matt

Comment 10

•

3 years ago

I believe I have suspending disabled at the moment because it breaks everything :P Manually suspending it doesn't seem to make the issue occur either, at least on the usual sites.

matt

Comment 11

•

3 years ago

I believe was able to reproduce the crash while running with the OGL_SkipTextureHostCopies=1 environment variable. Here's the log: https://crash-stats.mozilla.org/report/index/5ca54b3a-28a6-46df-9659-0566c0211231

Emilio Cobos Álvarez [:emilio]

Comment 13

•

3 years ago

It seems the reporter of bug 1750448 can consistently repro this. From the reports linked there it seems we're just trying to initialize a WebGL context and crash inside the NVidia driver. Jeff, do you know if we have contacts at NVidia that could take a look?

Flags: needinfo?(jmuizelaar)

Emilio Cobos Álvarez [:emilio]

Comment 14

•

3 years ago

Is there any chance you could attach your about:support information here? Does your graphics card match the one from comment 4?

Flags: needinfo?(assaf_hershko)

assaf_hershko

Comment 15

•

3 years ago

Attached file AboutSupport — Details

Sure, see my about:support data below.

assaf_hershko

Comment 16

•

3 years ago

With regards to my graphics card - it is a GTX 3060 Ti. And it may be worth noting the exact same thing happens on my other machine, which has a GTX 970.

Flags: needinfo?(assaf_hershko)

assaf_hershko

Comment 17

•

3 years ago

And adding what I believe is an important clue - as per my original bug report, I am using fractional scaling (at 125%).
If I disable fractional scaling (gsettings reset org.gnome.mutter experimental-features) and restart - the sites load fine.
In other words - this (or at least, my specific issue) seems to be caused by something about how Wayland (and the Nvidia drivers?) handle fractional scaling.
I hope this helps! :)

matt

Comment 18

•

3 years ago

(In reply to assaf_hershko from comment #17)

And adding what I believe is an important clue - as per my original bug report, I am using fractional scaling (at 125%).
If I disable fractional scaling (gsettings reset org.gnome.mutter experimental-features) and restart - the sites load fine.
In other words - this (or at least, my specific issue) seems to be caused by something about how Wayland (and the Nvidia drivers?) handle fractional scaling.
I hope this helps! :)

I don't believe I was using fractional scaling when I encountered these issues, so it may be a bit deeper than that.

assaf_hershko

Comment 19

•

3 years ago

Yes, agree. I turned fractional scaling back on and the sites still loaded fine for a bit... but are crashing again now.
The good news (I guess) is that I can easily reproduce this on both machines. Would be happy to provide you with whatever info is needed, and/or access to the machines themselves if that helps.

Darkspirit

Comment 20

•

3 years ago

Please test if this has been fixed with Nvidia driver 510.

bug 1743833 comment 9, comment 4 and comment 15 have Gnome Wayland with driver 495 and 2 displays.

(Alynx Zhou from bug 1743833 comment 17)

I am trying to downgrade mutter to 41.1 to see if things changed.

(Alynx Zhou from bug 1743833 comment #18)

It works fine with mutter 41.1 for hours, I'll open an issue for mutter for help.

(Alynx Zhou from bug 1743833 comment #19)

https://gitlab.gnome.org/GNOME/mutter/-/issues/2045
Issue for mutter.

(Erik Kurzinger from https://gitlab.gnome.org/GNOME/mutter/-/issues/2045#note_1331543)

Oh snap, I think I've figured it out. Firstly, the Xwayland vidmem leak was a red herring. I thought that was what was happening because if you watch nvidia-smi while repeatedly running and killing an X11 application, Xwayland's memory usage will steadily increase, but only up to a certain point (about 900MB on my system). This is just because our EGL driver won't always immediately free vidmem allocations for textures and whatnot when the GL object is destroyed, but they will get freed eventually.

However, we are indeed leaking vidmem. Specifically the GBM buffers mutter allocates for the hardware cursor image. It's not mutter's fault, though, it correctly calls gbm_bo_destroy when it's done with the buffers. The problem is that is uses gbm_bo_write to write the cursor data into the buffer. Our implementation of that function will do an implicit gbm_bo_map, but won't unmap afterwards. So even though mutter destroys the GBM buffer, the backing dma-buf won't get freed because of this leftover mapping.

If I change our gbm_bo_write implementation to unmap the buffer when it's done, I can confirm that the dma-buf does get freed as expected, so I'm fairly certain this will resolve the issue.

So yeah, it was a driver bug, thanks again to all who provided helpful information here.

(Erik Kurzinger from https://gitlab.gnome.org/GNOME/mutter/-/issues/2045#note_1335845)

This will be fixed in the upcoming 510 driver. I believe the beta release is targeted for January 5th. Until then, the work-around would be to have mutter use a software cursor either directly with !2150 (merged) or indirectly by forcing it to use EGLStreams with !2132 (merged)

assaf_hershko

Comment 21

•

3 years ago

I installed 510 and so far so good. Hurrah!
Any idea when it will be out of beta?

Darkspirit

Updated

•

3 years ago

Status: NEW → RESOLVED

Closed: 3 years ago

status-firefox94: affected → ---

status-firefox95: affected → ---

Flags: needinfo?(jmuizelaar)

Resolution: --- → MOVED

See Also: → https://gitlab.gnome.org/GNOME/mutter/-/issues/2045

Summary: Wayland/Nvidia: Crash in [@ NvGlEglGetFunctions] → Wayland/Nvidia: Crash in [@ NvGlEglGetFunctions] (Fixed by Nvidia driver 510)

Emilio Cobos Álvarez [:emilio]

Updated

•

3 years ago

Updated

•

3 years ago

about:support information 3 years ago matt 36.37 KB, text/plain		Details
AboutSupport 3 years ago assaf_hershko 35.48 KB, text/plain		Details