Closed Bug 1759315 Opened 2 years ago Closed 1 year ago

[Sway][NVIDIA] glxtest crashes with SIGSEGV on wayland

Categories

(Core :: Widget: Gtk, defect, P3)

defect

Tracking

()

RESOLVED MOVED

People

(Reporter: mehl.ger, Assigned: scorpion-26)

References

(Blocks 3 open bugs, )

Details

Attachments

(6 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:98.0) Gecko/20100101 Firefox/98.0

Steps to reproduce:

  • System: Arch Linux, Nvidia proprietary drivers 510.54, sway 1.7
  • Fresh profile on Firefox 98.0 (also reproducible on firefox-nightly)
  • Start firefox with MOZ_ENABLE_WAYLAND=1

Actual results:

The child process spawned to perform glxtest segfaults. The only output i get is:

[GFX1-]: No GPUs detected via PCI
[GFX1-]: glxtest: process failed (received signal 11)

After building it locally and stepping through with gdb, i was able to obtain the following backtrace:

Thread 2.1 "firefox" received signal SIGSEGV, Segmentation fault.
0x00007ffff58a992a in wl_display_read_events () from /usr/lib/libwayland-client.so.0
+(gdb) bt
#0  0x00007ffff58a992a in wl_display_read_events () at /usr/lib/libwayland-client.so.0
#1  0x00007ffff58aa1f4 in wl_display_dispatch_queue () at /usr/lib/libwayland-client.so.0
#2  0x00007ffff58aa4c0 in wl_display_roundtrip_queue () at /usr/lib/libwayland-client.so.0
#3  0x00007fffeffe9f73 in get_wayland_screen_info(wl_display*) (dpy=0x7ffff78a4190) at mozilla-unified/toolkit/xre/glxtest.cpp:1234
#4  wayland_egltest() () at mozilla-unified/toolkit/xre/glxtest.cpp:1255
#5  childgltest() () at mozilla-unified/toolkit/xre/glxtest.cpp:1276
#6  0x00007fffeffea697 in fire_glxtest_process() () at mozilla-unified/toolkit/xre/glxtest.cpp:1321
#7  0x00007fffeffdeef7 in XREMain::XRE_mainInit(bool*) (this=<optimized out>, this@entry=0x7fffffffd280, aExitFlag=aExitFlag@entry=0x7fffffffd207)
    at mozilla-unified/toolkit/xre/nsAppRunner.cpp:4026
#8  0x00007fffeffe5f6f in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) (this=this@entry=0x7fffffffd280, argc=argc@entry=4, argv=argv@entry=0x7fffffffe548, aConfig=...)
    at mozilla-unified/toolkit/xre/nsAppRunner.cpp:5898
#9  0x00007fffeffe639f in XRE_main(int, char**, mozilla::BootstrapConfig const&) (argc=6, argv=0x0, aConfig=...) at mozilla-unified/toolkit/xre/nsAppRunner.cpp:5983
#10 0x000055555557d244 in do_main(int, char**, char**) (argc=6, argv=0x7fffffffe548, envp=0x7fffffffe570) at mozilla-unified/browser/app/nsBrowserApp.cpp:225
#11 main(int, char**, char**) (argc=<optimized out>, argv=<optimized out>, envp=0x7fffffffe570) at mozilla-unified/browser/app/nsBrowserApp.cpp:395

Expected results:

The glxtest process shouldn't segfault.

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Widget: Gtk
Product: Firefox → Core

Please run with WAYLAND_DEBUG=1 env variable and attach the log here. Also can you test mutter compositor?
https://fedoraproject.org/wiki/How_to_debug_Firefox_problems?rd=Bug_info_Firefox#Testing_different_Wayland_compositor
Thanks.

Flags: needinfo?(mehl.ger)
Priority: -- → P3
Summary: glxtest crashes with SIGSEGV on wayland → [NVIDIA] glxtest crashes with SIGSEGV on wayland
Flags: needinfo?(mehl.ger)

I wasn't able to reproduce this with mutter.

glxtest is running in extra process to cover driver crashes, so it works as expected. But no idea why we see the wl_display_read_events() crashes here.

Blocks: wayland-sway

My laptop has both an NVIDIA GPU and and Intel integrated GPU. I'm seeing this problem when running firefox in sway with the NVIDIA GPU but not the Intel GPU so this could be NVIDIA related.

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0

When I single stepped into libwayland-client.so.0 code, the segfault happened here, as part of wl_display_roundtrip: 1

if (opcode >= proxy->object.interface->event_count) {

On that line, proxy->object.interface points to un-mapped memory.

What's interesting is that if I exchange the order of these two blocks of code in firefox: 2, 3. Then the glxtest stops crashing.

glxtest also no longer crashes if I comment out these two lines 4 without exchanging code blocks.

When I pause in the same spot in libwayland-client.so.0, proxy->object.interface now contains things like zwp_linux_dmabuf_v1, living in /usr/local/lib/libnvidia-egl-wayland.so.1.1.10. The libnvidia-egl-wayland library is not present during the earlier crashes, implying that the get_egl_status function probably unloaded that library, but left dangling pointers, thus the crash when handling events later.

Does that reveal any clues? I uploaded a patch file that delays terminating the EGLDisplay. But because I'm first time working on firefox I'm not sure if that's in the right direction or merely glossed over some deeper bug..

Attached patch glxtest.patchSplinter Review

I experience crash with nvidia that I believe is same or similar to this issue.

Fedora 36 fresh install. AMD iGPU and Nvidia dGPU. Nvidia proprietary driver 515.65.01. mutter 42.4-1. libwayland-client-1.20.0-4.fc36
Firefox 104.0.1 with fresh profile.
Only happens when launching with discrete nvidia card:

Device: 0
  Name:        Advanced Micro Devices, Inc. [AMD®/ATI] Cezanne
  Default:     yes
  Environment: DRI_PRIME=pci-0000_05_00_0

Device: 1
  Name:        NVIDIA Corporation GA104M [GeForce RTX 3080 Mobile / Max-Q 8GB/16GB]
  Default:     no
  Environment: __GLX_VENDOR_LIBRARY_NAME=nvidia __NV_PRIME_RENDER_OFFLOAD=1 __VK_LAYER_NV_optimus=NVIDIA_only

Browser output

[GFX1-]: No GPUs detected via PCI
[GFX1-]: glxtest: process failed (received signal 11)

Trace from journal:

#0  0x00007efc821251e0 wl_display_read_events (libwayland-client.so.0 + 0x91e0)
#1  0x00007efc82126061 wl_display_dispatch_queue (libwayland-client.so.0 + 0xa061)
#2  0x00007efc8212705f wl_display_roundtrip_queue (libwayland-client.so.0 + 0xb05f)
#3  0x00007efc79fb2240 childgltest (libxul.so + 0x2997240)
#4  0x00007efc79fae6bd _Z20fire_glxtest_processv.cold (libxul.so + 0x29936bd)
#5  0x00007efc7b2cde9b _ZN7XREMain12XRE_mainInitEPb (libxul.so + 0x3cb2e9b)
#6  0x00007efc7b2cd97e _ZN7XREMain8XRE_mainEiPPcRKN7mozilla15BootstrapConfigE (libxul.so + 0x3cb297e)
#7  0x00007efc7b2cd639 _Z8XRE_mainiPPcRKN7mozilla15BootstrapConfigE (libxul.so + 0x3cb2639)
#8  0x0000562ad1e137e2 _ZL7do_mainiPPcS0_ (firefox + 0x547e2)
#9  0x0000562ad1e059d8 main (firefox + 0x469d8)
#10 0x00007efc840b9550 __libc_start_call_main (libc.so.6 + 0x29550)
#11 0x00007efc840b9609 __libc_start_main@@GLIBC_2.34 (libc.so.6 + 0x29609)
#12 0x0000562ad1e13245 _start (firefox + 0x54245)

Robert, any idea here?
Thanks.

Flags: needinfo?(robert.mader)
Flags: needinfo?(stransky)

Aleksey: could you make sure you don't have issues launching any other native Wayland GL app with the dedicated GPU and report back? Like a GTK4 app or weston-simple-egl? To me this crash looks like a driver/setup issue, unlikely to be FF specific.

Flags: needinfo?(robert.mader) → needinfo?(aleksey)

P.S.: regarding the patch in comment 9, I do think we can definitely do better in glxtest regarding library loading/unloading. I've been planning to look into that for some time, using common helper functions. This would however only avoid the crash and not fix the general issue of the test failing IIUC.

Attached file eglinfo output
nvidia-smi shows weston-simple-egl process running on nvidia, no issues.
```
__GLX_VENDOR_LIBRARY_NAME=nvidia __NV_PRIME_RENDER_OFFLOAD=1 __VK_LAYER_NV_optimus=NVIDIA_only weston-simple-egl 
has EGL_EXT_buffer_age and EGL_KHR_swap_buffers_with_damage
300 frames in 5 seconds: 60.000000 fps
298 frames in 5 seconds: 59.599998 fps
71 frames in 5 seconds: 14.200000 fps
^Csimple-egl exiting
```
Of several apps I tried I get celluloid to error and close with following:
```
[drm:nv_drm_gem_export_nvkms_memory_ioctl [nvidia_drm]] *ERROR* [nvidia-drm] [GPU ID 0x00000100] Failed to lookup NVKMS gem object for export: 0x00000001
```

abrt shows following but I think it is some guesswork and not something gleaned from coredump?

> Likely crash reason: Jump to an invalid address

Also it has linked report that might be somehow relevant:
https://retrace.fedoraproject.org/faf/reports/270283/
Flags: needinfo?(aleksey)

I remember when I tried to get firefox working on wayland with nvidia prime offloading couple years ago there was a problem where environment variables (as shown in previous comment) were not propagated to glxtest.
Could that be the issue?

(In reply to aleksey from comment #16)

I remember when I tried to get firefox working on wayland with nvidia prime offloading couple years ago there was a problem where environment variables (as shown in previous comment) were not propagated to glxtest.
Could that be the issue?

Thanks for testing! And yes, that could indeed be the problem here.

Flags: needinfo?(stransky)
See Also: → 1739611

The patch from tengyifei88 above was the only way I could prevent this issue, which forced software Webrender and disabled vaapi for me. However, it seems to need a rebase for FF 107.

Looks like the feature (run Wayland on second NVIDIA hardware) is supported by Sway only:
https://gitlab.gnome.org/GNOME/mutter/-/issues/831

I still have an issue with this. and I wast able to build FF yet.

One additional point I want to make.

I am using Nobara, Nvidia, X11, FF 104.0.1.

When Starting FF I am getting:

[GFX1-]: glxtest: process failed (received signal 11)
and WEBGl is not working.

If I start FF with 'sudo su' and 'firefox'
no such error and WEBGL is working.

Also tried firefox-x11 - same

By calling wl_display_roundtrip early, the crash that happens at that call is resolved.
Note that this only happens when the environment variable MOZ_GLX_TEST_EARLY_WL_ROUNDTRIP=1 is set

Assignee: nobody → schweers.ti
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

(In reply to [:scorpion-26] from comment #22)

Created attachment 9308033 [details]
Bug 1759315 - Add environment var to wait for wl events early

By calling wl_display_roundtrip early, the crash that happens at that call is resolved.
Note that this only happens when the environment variable MOZ_GLX_TEST_EARLY_WL_ROUNDTRIP=1 is set

I can confirm that this patch fixes the reported issue. Thanks you very much for looking into this.

(In reply to mehl.ger from comment #23)

(In reply to [:scorpion-26] from comment #22)

Created attachment 9308033 [details]
Bug 1759315 - Add environment var to wait for wl events early

By calling wl_display_roundtrip early, the crash that happens at that call is resolved.
Note that this only happens when the environment variable MOZ_GLX_TEST_EARLY_WL_ROUNDTRIP=1 is set

I can confirm that this patch fixes the reported issue. Thanks you very much for looking into this.

Indeed, this fixes it for me too.

Summary: [NVIDIA] glxtest crashes with SIGSEGV on wayland → [Swayl][NVIDIA] glxtest crashes with SIGSEGV on wayland
Summary: [Swayl][NVIDIA] glxtest crashes with SIGSEGV on wayland → [Sway][NVIDIA] glxtest crashes with SIGSEGV on wayland
Summary: [Sway][NVIDIA] glxtest crashes with SIGSEGV on wayland → [NVIDIA] glxtest crashes with SIGSEGV on wayland
Summary: [NVIDIA] glxtest crashes with SIGSEGV on wayland → [Sway][NVIDIA] glxtest crashes with SIGSEGV on wayland

This PR fixes the issue upstream: https://github.com/NVIDIA/egl-wayland/pull/74
Seems like once it gets merged, the workaround provided in comment #22 is not needed anymore.

On Arch Linux, the patch is already included in egl-wayland 2:1.1.11-3:
https://bugs.archlinux.org/task/77260

I'm currently running Firefox 109.0 with the patched egl-wayland package without any issues.

Good, let's close it as MOVED then.

Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → MOVED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: