Wayland/Nvidia: Crash in [@ NvGlEglGetFunctions] (Fixed by Nvidia driver 510)
Categories
(Core :: Graphics: WebRender, defect)
Tracking
()
People
(Reporter: aryx, Unassigned)
References
(Blocks 2 open bugs)
Details
(Keywords: crash)
Crash Data
Attachments
(2 files)
16 Linux crashes from 9+ devices. The oldest reported crash is from 2021-09-16 and affects even some older builds but Nightly 94.0a1 + 95.0a1 have 16 crashes from 9+ devices. Driver update issue?
Maybe Fission related. (DOMFissionEnabled=1)
Crash report: https://crash-stats.mozilla.org/report/index/6fb2714d-5337-4211-b6f7-5f1020211025
Reason: SIGSEGV / SEGV_MAPERR
Top 10 frames of crashing thread:
0 libnvidia-eglcore.so.470.74 NvGlEglGetFunctions
1 libnvidia-eglcore.so.470.74 <.text ELF section in libnvidia-eglcore.so.470.74>
2 libnvidia-eglcore.so.470.74 <.text ELF section in libnvidia-eglcore.so.470.74>
3 libnvidia-eglcore.so.470.74 <.text ELF section in libnvidia-eglcore.so.470.74>
4 libnvidia-eglcore.so.470.74 <.text ELF section in libnvidia-eglcore.so.470.74>
5 libnvidia-eglcore.so.470.74 <.text ELF section in libnvidia-eglcore.so.470.74>
6 libxul.so <gleam::gl::GlFns as gleam::gl::Gl>::tex_sub_image_2d_pbo third_party/rust/gleam/src/gl_fns.rs:750
7 libxul.so webrender::device::gl::TextureUploader::update_impl gfx/wr/webrender/src/device/gl.rs:4678
8 libxul.so webrender::device::gl::TextureUploader::flush_buffer gfx/wr/webrender/src/device/gl.rs:4631
9 libxul.so webrender::device::gl::TextureUploader::flush gfx/wr/webrender/src/device/gl.rs:4640
Comment 1•4 years ago
|
||
https://bugreports.qt.io/browse/QTBUG-97127 mentions NvGlEglGetFunctions on Wayland.
Updated•3 years ago
|
I'm not sure if this helps at all but I'm having this issue specifically on the sign in page for Google. The page itself loads (and seems to render) fine but then a few seconds later the tab crashes. It happens every time but that's the only page I've found that this behavior occurs. Like the bug report says, I'm on Nvidia and Wayland. Please let me know if you'd like more info or if I can help at all!
Comment 3•3 years ago
|
||
@matt: thanks, having a user around with an affected device is super helpful. Could you share which driver version you are on? The best would be if you could share the content of your about:support ("Copy text to clipboard" -> paste here into comment -> a popup offering to create an attachment should open -> yes).
Comment 5•3 years ago
|
||
Thanks! So this is on 495.46.0.0, the original report is for 470.74. I assume it is most likely a driver issue (or libglvnd etc.).
Eric, needinfoing you here so you know about the issue - do you maybe have internal reports about this already? Thanks!
Updated•3 years ago
|
Comment 6•3 years ago
|
||
I'm inclined to think NvGLEglGetFunctions is just appearing in the stack-trace as an artifact. That function is only ever called during EGL initialization. The real problem is probably related to the glTexSubImage2D call.
@matt, if you have a reliable repro, would you mind trying with the environment variable OGL_SkipTextureHostCopies=1? Looking at a couple of internally-tracked bugs I suspect that might have an effect. If it seems to fix the issue it would help narrow things down. I'm not able to reproduce the crash on my system, unfortunately.
(In reply to Erik Kurzinger from comment #6)
@matt, if you have a reliable repro, would you mind trying with the environment variable OGL_SkipTextureHostCopies=1? Looking at a couple of internally-tracked bugs I suspect that might have an effect. If it seems to fix the issue it would help narrow things down. I'm not able to reproduce the crash on my system, unfortunately.
I'll definitely give it a shot when I notice a specific site crashing between restarts. Unfortunately its incredibly inconsistent to reproduce. I had issues with Google sign in and then I had the same thing occur with hCaptcha embeds. Now both are working fine so it may just be a matter of waiting for this to pop up for me again (it seems to persist on the same site across Firefox restarts but not desktop sessions). Will keep you updated!
Comment 9•3 years ago
|
||
Is there a chance that this only happens after suspend/resume, given that the Nvidia driver is a bit special about that?
Comment 10•3 years ago
|
||
I believe I have suspending disabled at the moment because it breaks everything :P Manually suspending it doesn't seem to make the issue occur either, at least on the usual sites.
Comment 11•3 years ago
|
||
I believe was able to reproduce the crash while running with the OGL_SkipTextureHostCopies=1 environment variable. Here's the log: https://crash-stats.mozilla.org/report/index/5ca54b3a-28a6-46df-9659-0566c0211231
Comment 13•3 years ago
|
||
It seems the reporter of bug 1750448 can consistently repro this. From the reports linked there it seems we're just trying to initialize a WebGL context and crash inside the NVidia driver. Jeff, do you know if we have contacts at NVidia that could take a look?
Comment 14•3 years ago
|
||
Is there any chance you could attach your about:support information here? Does your graphics card match the one from comment 4?
Comment 15•3 years ago
|
||
Comment 16•3 years ago
|
||
With regards to my graphics card - it is a GTX 3060 Ti. And it may be worth noting the exact same thing happens on my other machine, which has a GTX 970.
Comment 17•3 years ago
|
||
And adding what I believe is an important clue - as per my original bug report, I am using fractional scaling (at 125%).
If I disable fractional scaling (gsettings reset org.gnome.mutter experimental-features) and restart - the sites load fine.
In other words - this (or at least, my specific issue) seems to be caused by something about how Wayland (and the Nvidia drivers?) handle fractional scaling.
I hope this helps! :)
Comment 18•3 years ago
|
||
(In reply to assaf_hershko from comment #17)
And adding what I believe is an important clue - as per my original bug report, I am using fractional scaling (at 125%).
If I disable fractional scaling (gsettings reset org.gnome.mutter experimental-features) and restart - the sites load fine.
In other words - this (or at least, my specific issue) seems to be caused by something about how Wayland (and the Nvidia drivers?) handle fractional scaling.
I hope this helps! :)
I don't believe I was using fractional scaling when I encountered these issues, so it may be a bit deeper than that.
Comment 19•3 years ago
|
||
Yes, agree. I turned fractional scaling back on and the sites still loaded fine for a bit... but are crashing again now.
The good news (I guess) is that I can easily reproduce this on both machines. Would be happy to provide you with whatever info is needed, and/or access to the machines themselves if that helps.
Comment 20•3 years ago
|
||
Please test if this has been fixed with Nvidia driver 510.
bug 1743833 comment 9, comment 4 and comment 15 have Gnome Wayland with driver 495 and 2 displays.
(Alynx Zhou from bug 1743833 comment 17)
I am trying to downgrade mutter to 41.1 to see if things changed.
(Alynx Zhou from bug 1743833 comment #18)
It works fine with mutter 41.1 for hours, I'll open an issue for mutter for help.
(Alynx Zhou from bug 1743833 comment #19)
https://gitlab.gnome.org/GNOME/mutter/-/issues/2045
Issue for mutter.
(Erik Kurzinger from https://gitlab.gnome.org/GNOME/mutter/-/issues/2045#note_1331543)
Oh snap, I think I've figured it out. Firstly, the Xwayland vidmem leak was a red herring. I thought that was what was happening because if you watch nvidia-smi while repeatedly running and killing an X11 application, Xwayland's memory usage will steadily increase, but only up to a certain point (about 900MB on my system). This is just because our EGL driver won't always immediately free vidmem allocations for textures and whatnot when the GL object is destroyed, but they will get freed eventually.
However, we are indeed leaking vidmem. Specifically the GBM buffers mutter allocates for the hardware cursor image. It's not mutter's fault, though, it correctly calls gbm_bo_destroy when it's done with the buffers. The problem is that is uses gbm_bo_write to write the cursor data into the buffer. Our implementation of that function will do an implicit gbm_bo_map, but won't unmap afterwards. So even though mutter destroys the GBM buffer, the backing dma-buf won't get freed because of this leftover mapping.
If I change our gbm_bo_write implementation to unmap the buffer when it's done, I can confirm that the dma-buf does get freed as expected, so I'm fairly certain this will resolve the issue.
So yeah, it was a driver bug, thanks again to all who provided helpful information here.
(Erik Kurzinger from https://gitlab.gnome.org/GNOME/mutter/-/issues/2045#note_1335845)
This will be fixed in the upcoming 510 driver. I believe the beta release is targeted for January 5th. Until then, the work-around would be to have mutter use a software cursor either directly with !2150 (merged) or indirectly by forcing it to use EGLStreams with !2132 (merged)
Comment 21•3 years ago
|
||
I installed 510 and so far so good. Hurrah!
Any idea when it will be out of beta?
Updated•3 years ago
|
Description
•