Closed Bug 1715245 Opened 2 years ago Closed 2 years ago

Regression: WebRender support on Nvidia Binary

Categories

(Core :: Graphics: WebRender, defect)

Firefox 91
defect

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox89 --- unaffected
firefox90 --- fixed
firefox91 --- fixed

People

(Reporter: Vash63, Assigned: rmader)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: regression)

Attachments

(4 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0

Steps to reproduce:

Updated Nightly. Noticed I was in legacy GL (my default profile has layers.acceleration.force-enabled on) and getting graphical issues.

Actual results:

Checked about:support and confirmed that WebRender was disabled due to glxtest process failure in failure log.

Expected results:

Firefox should have launched with WebRender

Additional info that didn't fit in the template:

I ran mozregression and it gave me: https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=42d5dd452cd79ed9aef39ba6a429d839f0344887&tochange=cbbccab13f240baabf0fe2678b4f4b683fd21e5f

Between the two this is probably due to: https://bugzilla.mozilla.org/show_bug.cgi?id=1714897

Also this is the exact same symptoms I had about a month ago that was fixed in either https://bugzilla.mozilla.org/show_bug.cgi?id=1706762 or https://bugzilla.mozilla.org/show_bug.cgi?id=1706452

The Bugbug bot thinks this bug should belong to the 'Core::Graphics: WebRender' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Graphics: WebRender
Product: Firefox → Core

Thanks for the bug report and regression range Vash63. Could you please attach your about:support information to this bug?

Glenn, any ideas what could have caused this.

Blocks: wr-nv-linux
Flags: needinfo?(gwatson)
Flags: needinfo?(Vash63)
Regressed by: 1714897
Has Regression Range: --- → yes
Severity: -- → S3
Attached file about:support

About:support from the latest nightly attached. It notes "glxtest: process failed (received signal 11)" which was the same application updated in https://phabricator.services.mozilla.com/D116957

Flags: needinfo?(Vash63)
Attached file about:support text

FWIW here's my about:support, on linux with nvidia driver version 460.73.01, NVIDIA Corporation GP107GL [Quadro P400] [10de:1cb3]

There's two things that patch does - it passes the current X display to eglInitialize, which was what fixes the glxtest code on amdgpu, and it moves the installation of the X error handler slightly earlier.

I wouldn't expect either of those to crash the nvidia proprietary driver, but it sounds like one of them does :|

Robert, have you got any ideas what we should do here?

Flags: needinfo?(gwatson) → needinfo?(robert.mader)

aosmond, stransky, do either of you have a machine with an nvidia binary driver set up that you could test this on?

Flags: needinfo?(stransky)
Flags: needinfo?(aosmond)

I have NVIDIA card available so I can set it up for testing. Which patch do you mean?

Thanks Martin. It's already landed in m-c (https://phabricator.services.mozilla.com/D116957), so I guess just testing to see if glxtest is crashing on your machine (I suspect it will only reproduce in an X11 session, but it might be worth testing on all X/Wayland combinations).

Looks like it's crashing in XCloseDisplay.

(gdb) next
852	  XCloseDisplay(dpy);
(gdb) p glxtest_buf
$9 = 0x7fffffffc1d0 "PCI_VENDOR_ID\n0x10de\nPCI_DEVICE_ID\n0x1cb3\nVENDOR\nNVIDIA Corporation\nRENDERER\nQuadro P400/PCIe/SSE2\nVERSION\n4.6.0 NVIDIA 460.73.01\nTFP\nTRUE\nMESA_ACCELERATED\nTRUE\nSCREEN_INFO\n3840x2160:1;\n"
(gdb) next

Thread 2.1 "firefox" received signal SIGSEGV, Segmentation fault.
0x00007fffe960f580 in ?? ()
(gdb) bt
#0  0x00007fffe960f580 in  ()
#1  0x00007ffff6282ba2 in XCloseDisplay (dpy=0x7ffff78ca000) at ../../src/ClDisplay.c:65
#2  0x00007ffff1f19d6d in x11_egltest(int) (pci_count=<optimized out>) at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:852
#3  childgltest() () at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:1222
#4  0x00007ffff1f1a407 in fire_glxtest_process() () at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:1261
#5  0x00007ffff1f0e752 in XREMain::XRE_mainInit(bool*) (this=<optimized out>, this@entry=0x7fffffffccb0, aExitFlag=aExitFlag@entry=0x7fffffffcc37)
    at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/nsAppRunner.cpp:3614
#6  0x00007ffff1f156db in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&)
    (this=this@entry=0x7fffffffccb0, argc=argc@entry=4, argv=argv@entry=0x7fffffffdf58, aConfig=...)
    at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/nsAppRunner.cpp:5411
#7  0x00007ffff1f15ada in XRE_main(int, char**, mozilla::BootstrapConfig const&) (argc=-142277752, argv=0x3, aConfig=...)
    at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/nsAppRunner.cpp:5496
#8  0x000055555557bdc4 in do_main(int, char**, char**) (argc=-142277752, argv=0x7fffffffdf58, envp=<optimized out>)
    at /home/jcristau/src/hg.mozilla.org/mozilla-unified/browser/app/nsBrowserApp.cpp:224
#9  main(int, char**, char**) (argc=4, argv=<optimized out>, envp=<optimized out>) at /home/jcristau/src/hg.mozilla.org/mozilla-unified/browser/app/nsBrowserApp.cpp:351

XCloseDisplay is calling the NV-GLX close_display callback:

(gdb) up
#1  0x00007ffff6282ba2 in XCloseDisplay (dpy=0x7ffff78ca000) at ../../src/ClDisplay.c:65
65	../../src/ClDisplay.c: No such file or directory.
(gdb) p *ext
$11 = {next = 0x7ffff784f480, codes = {extension = 1, major_opcode = 155, first_event = 0, first_error = 0}, create_GC = 0x0, copy_GC = 0x0, flush_GC = 0x0, free_GC = 0x0, 
  create_Font = 0x0, free_Font = 0x0, close_display = 0x7fffe960f580, error = 0x0, error_string = 0x0, name = 0x7ffff781b178 "NV-GLX", error_values = 0x0, 
  before_flush = 0x0, next_flush = 0x0}

but it looks like nv-glx has been unloaded before then?

Set release status flags based on info from the regressing bug 1714897

(In reply to Julien Cristau [:jcristau] from comment #11)

Looks like it's crashing in XCloseDisplay.
...
but it looks like nv-glx has been unloaded before then?

Ah, this kinda makes sense. Looking at get_glx_status()[1], it jumps to attention that libgl is opened before and closed after the X11 Display. Now I wonder if this a bug in the nvidia driver and how we best work around it. We could probably shuffle things around a bit, however it looks like on Wayland it's required to first open the display connection. Another option could be to just not close the display connection - the process will exit directly after anyways.

1: https://searchfox.org/mozilla-central/source/toolkit/xre/glxtest.cpp#652-835

Flags: needinfo?(robert.mader)
Flags: needinfo?(aosmond)

FTR, I don't think NV-GLX has any business being there as that's a pure EGL code path :/

Commenting out the call to XCloseDisplay gives me WebRender back (on the second run).

Closing the X11 connection is buggy on nv prop. drivers. Leave
it open, the process will exit anyways.

Assignee: nobody → robert.mader

(In reply to Julien Cristau [:jcristau] from comment #15)

Commenting out the call to XCloseDisplay gives me WebRender back (on the second run).

Thanks, was about to ask you. Yeah, if glxtest fails it usually takes two restarts to clean up the blocklist entries or so. So well, I guess lets go with this ugly workaround then. It works well here as well, both on X and Wayland.

(In reply to Robert Mader [:rmader] from comment #14)

FTR, I don't think NV-GLX has any business being there as that's a pure EGL code path :/

FWIW here's where the NV-GLX close_display hook seems to get added:

Thread 2.1 "firefox" hit Breakpoint 2, XESetCloseDisplay (dpy=0x7ffff78ca000, extension=1, proc=0x7fffe960f580) at ../../src/InitExt.c:93
93	in ../../src/InitExt.c
(gdb) bt
#0  XESetCloseDisplay (dpy=0x7ffff78ca000, extension=1, proc=0x7fffe960f580) at ../../src/InitExt.c:93
#1  0x00007fffe960f7fa in  () at /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.73.01
#2  0x00007fffe96020b8 in  () at /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.73.01
#3  0x00007fffe9602245 in  () at /usr/lib/x86_64-linux-gnu/libnvidia-glsi.so.460.73.01
#4  0x00007fffe9888c73 in  () at /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
#5  0x00007fffe9888da1 in  () at /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
#6  0x00007fffe98a10a0 in  () at /usr/lib/x86_64-linux-gnu/libEGL_nvidia.so.0
#7  0x00007fffe9c5698d in  () at /usr/lib/x86_64-linux-gnu/libEGL.so.1
#8  0x00007ffff1f1dea7 in get_egl_status(void*, bool, bool) (native_dpy=native_dpy@entry=0x7ffff78ca000, gles_test=false, require_driver=false)
    at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:589
#9  0x00007ffff1f19cd5 in x11_egltest(int) (pci_count=1) at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:846
#10 childgltest() () at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:1225
#11 0x00007ffff1f1a417 in fire_glxtest_process() () at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/glxtest.cpp:1264
#12 0x00007ffff1f0e752 in XREMain::XRE_mainInit(bool*) (this=<optimized out>, this@entry=0x7fffffffccb0, aExitFlag=aExitFlag@entry=0x7fffffffcc37)
    at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/nsAppRunner.cpp:3614
#13 0x00007ffff1f156db in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&)
    (this=this@entry=0x7fffffffccb0, argc=argc@entry=4, argv=argv@entry=0x7fffffffdf58, aConfig=...)
    at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/nsAppRunner.cpp:5411
#14 0x00007ffff1f15ada in XRE_main(int, char**, mozilla::BootstrapConfig const&) (argc=1, argv=0x7fffe960f580, aConfig=...)
    at /home/jcristau/src/hg.mozilla.org/mozilla-unified/toolkit/xre/nsAppRunner.cpp:5496
#15 0x000055555557bdc4 in do_main(int, char**, char**) (argc=1, argv=0x7fffffffdf58, envp=<optimized out>)
    at /home/jcristau/src/hg.mozilla.org/mozilla-unified/browser/app/nsBrowserApp.cpp:224
#16 main(int, char**, char**) (argc=4, argv=<optimized out>, envp=<optimized out>) at /home/jcristau/src/hg.mozilla.org/mozilla-unified/browser/app/nsBrowserApp.cpp:351

I haven't tracked down where libnvidia-glsi.so is unloaded yet.

(In reply to Julien Cristau [:jcristau] from comment #18)

I haven't tracked down where libnvidia-glsi.so is unloaded yet.

It's from dlclose(libegl); I guess eglTerminate doesn't clean up the display hooks properly?

I'm inclined to write this off as another pain to suffer for being one of the first big applications to go full EGL on X11. The good news is that our work has already motivated fixes in mesa and Xwayland, so that GTK4 was able to make the transition more easily [1]. Another one is obs-studio, which recently landed similar work (but didn't enable it by default yet) [2].

So, well, even after all this years things won't be painless. But we're at least not alone in this struggle :)

1: https://gitlab.gnome.org/GNOME/gtk/-/merge_requests/3540
2: https://github.com/obsproject/obs-studio/pull/2478 / https://github.com/obsproject/obs-studio/pull/2484

Status: UNCONFIRMED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch

If there is a suspicion of a NVIDIA driver bug, I'd like to take a look. But I do not fully understand from the discussion here what the bug is thought to be.
Is the "glxtest" tool available standalone so I can observe the problem locally?
Thank you

(In reply to Arthur Huillet from comment #23)

If there is a suspicion of a NVIDIA driver bug, I'd like to take a look. But I do not fully understand from the discussion here what the bug is thought to be.
Is the "glxtest" tool available standalone so I can observe the problem locally?
Thank you

Thanks, that would be great. I don't think glxtest can be run standalone - Julien, do you know more maybe?
You should be able to observe the issue in a build from yesterday, using mozregression - I can also provide you with one, if that helps.

Flags: needinfo?(stransky) → needinfo?(jcristau)

P.S.: The relevant lines are those at https://searchfox.org/mozilla-central/source/toolkit/xre/glxtest.cpp#837-855.
What happens is:

Thanks. I'd need to observe the crash locally to determine what, if anything, the NVIDIA driver is doing wrong. There have been bugs found and fixed recently related to EGL resource lifecycle management.
Can you please help me observe it locally? I've never run mozregression, or really anything Mozilla/Firefox that wasn't a distro package in the past.

I wonder if the more proper fix might be to avoid calling dlclose(libegl) at least while the display is open. On the assumption that after we've called eglGetDisplay(native_dpy) it's not safe to unload libEGL (and its dependencies) while the display is alive?

Flags: needinfo?(jcristau) → needinfo?(robert.mader)

There's uplift requests in flight for this and related bugs, updating status flags.

(In reply to Julien Cristau [:jcristau] from comment #27)

I wonder if the more proper fix might be to avoid calling dlclose(libegl) at least while the display is open. On the assumption that after we've called eglGetDisplay(native_dpy) it's not safe to unload libEGL (and its dependencies) while the display is alive?

Well, we call eglTerminate(dpy) on the EGLDisplay we created via eglGetDisplay(). Unless we forgot to release everything properly (looking at eglMakeCurrent right now), it should be save to close the Display, no?

Flags: needinfo?(robert.mader)

Robert: Xlib doesn't really expect that, as far as I can tell. If any of the libs pulled in by libEGL registers an extension (with XextAddDisplay), then there doesn't seem to be a reasonable way for it to clean up. In this case nvidia's eglGetDisplay adds the NV-GLX extension, and libXext itself adds Generic Event Extension, both of which register close_display callbacks.

Arthur: here's what glxtest is doing, as far as I can tell (reduced to cause the crash; link with -ldl -lX11):

#include <dlfcn.h>
#include <stdbool.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <X11/Xlib.h>
#include <EGL/egl.h>

#define LIBEGL_FILENAME "libEGL.so.1"

static bool get_egl_status(EGLNativeDisplayType native_dpy) {
  void* libegl = dlopen(LIBEGL_FILENAME, RTLD_LAZY);
  if (!libegl) {
    return false;
  }
  PFNEGLGETPROCADDRESSPROC eglGetProcAddress =
      (PFNEGLGETPROCADDRESSPROC)(dlsym(libegl, "eglGetProcAddress"));
  if (!eglGetProcAddress) {
    dlclose(libegl);
    return false;
  }
  PFNEGLGETDISPLAYPROC eglGetDisplay =
      (PFNEGLGETDISPLAYPROC)(eglGetProcAddress("eglGetDisplay"));
  PFNEGLTERMINATEPROC eglTerminate =
      (PFNEGLTERMINATEPROC)(eglGetProcAddress("eglTerminate"));
  if (!eglGetDisplay || !eglTerminate) {
    dlclose(libegl);
    return false;
  }
  EGLDisplay dpy = eglGetDisplay(native_dpy);
  if (!dpy) {
    dlclose(libegl);
    return false;
  }
  eglTerminate(dpy);
  dlclose(libegl);
  return true;
}

int main() {
  Display* dpy = XOpenDisplay(NULL);
  if (!dpy) {
    return 1;
  }
  if (!get_egl_status(dpy)) {
    return 1;
  }
  XCloseDisplay(dpy);
  return 0;
}

(In reply to Arthur Huillet from comment #26)

Can you please help me observe it locally? I've never run mozregression, or really anything Mozilla/Firefox that wasn't a distro package in the past.

$ pip3 install --user mozregression

Bad build from comment 5: $ mozregression --launch 20210608091819 -a about:support

  • Compositing: WebRender (Software)
  • "WebGL creation failed"
  • Failure Log
    (#0) Error: No GPUs detected via PCI
    (#1) Error: glxtest: process failed (exited with status 1)

Good build from comment 6: $ mozregression --launch 2021-06-01 -a about:support

  • Compositing: WebRender
  • no WebGL error
  • no failures

(In reply to Julien Cristau [:jcristau] from comment #30)

Robert: Xlib doesn't really expect that, as far as I can tell. If any of the libs pulled in by libEGL registers an extension (with XextAddDisplay), then there doesn't seem to be a reasonable way for it to clean up. In this case nvidia's eglGetDisplay adds the NV-GLX extension, and libXext itself adds Generic Event Extension, both of which register close_display callbacks.

Well, I don't really see an advantage of trading dlclose(libegl) against XCloseDisplay(dpy) - skipping either of them is somewhat dirty but also doesn't really matter, as the process will exit with milliseconds anyway. Do you have a preference?

In any case, AFAICS NV-GLX should simply not get loaded when initializing EGL. Looks to me like some unfortunate entanglement within the driver that should probably better be avoided.

Thank you Julien for sharing the standalone reproducer. I filed NVIDIA bug 200740810 for this to be investigated. Sadly I am not personally qualified in EGL to say who's wrong here and why.

Comment on attachment 9226143 [details]
Bug 1715245 - Leave X11 connection open, r=aosmond

approved for 90.0b7 to avoid a regression

Attachment #9226143 - Flags: approval-mozilla-beta+

Confirmed NVIDIA driver bug, we're looking into it. A workaround we can suggest is to XCloseDisplay before the dlclose.

Aside from the NVIDIA driver itself, the same bug is also in libXext.so for the generic event extension:
https://gitlab.freedesktop.org/xorg/lib/libxext/-/issues/3

Even after the NVIDIA driver is fixed, the test program in comment #30 would still crash, because when it unloads libEGL.so, that would also unload libXext.so.

The cleanest workaround I can think of would be to call XCloseDIsplay first, before you unload libEGL.so. The callbacks are all per-display, so closing the display first ensures that they get cleared out without having to leak anything.

Regressions: 1717843
You need to log in before you can comment on or make changes to this bug.