Closed Bug 1845765 Opened 1 year ago Closed 1 year ago

Work around frequent driver crash in Wayland pool

Categories

(Firefox Build System :: Task Configuration, task)

task

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: ahal, Unassigned)

References

(Blocks 1 open bug, )

Details

We're hitting a frequent crash related to llvmpipe in the Wayland pool:
https://bugzilla.mozilla.org/show_bug.cgi?id=1835691

Note that this is not due to using Wayland, but just a side effect of the pool's environment. If we can't fix the crash, we'll need to work around this in the image somehow.

Martin, do you think using a different version of llvmpipe could help us get unstuck here?

Flags: needinfo?(stransky)

by environment is this a VM, running at GCP, or something else?

Yes Ubuntu 22.04 VirtualBox VM running on GCP instances

Martin, also when you reproduced locally, was it using VirtualBox?

(I'm just a tester, not a programmer.)
Mesa d3d12 is for the "Windows Subsystem for Linux" (Linux on top of Windows), it's irrelevant for regular Linux desktops, but it has caused regressions like bug 1804030 (llvmpipe was unaffected + probably introduced with 22.3.0) which has been fixed in Mesa 22.3.1.
Ubuntu 22.04 jammy has Mesa 22.2.5 and should be unaffected from that specific bug.

Does llvmpipe still crash if you set these prefs?
webgl.use-canvas-render-thread=false
webgl.threadsafe-gl.force-disabled=true

  • This and above pref ensure OpenGL is not used on multiple threads. bug 1777849 comment 21 affected X11, but Mesa itself is neither fully thread safe IIUC. Firefox blocks THREADSAFE_GL by default only for Nouveau. IIRC there was a comment that Radeon is neither fully thread safe. I assume llvmpipe has the same problem. I don't understand why better performance under load is traded against security risks(?) by using OpenGL in multiple threads in the same process.

widget.dmabuf-webgl.enabled=false

  • (IIUC: copies WebGL frames (WebGL -> system memory -> WebRender GL) instead of sharing their GPU buffers)
  • Edit: This pref change isn't needed because LIBGL_ALWAYS_SOFTWARE=1 mozregression --launch 2023-07-24 -a about:support --pref gfx.webrender.all:true contains DMABUF: Failed to configure + FEATURE_FAILURE_NO_DRM_DEVICE

media.hardware-video-decoding.enabled=false

  • Edit: This pref change isn't needed because LIBGL_ALWAYS_SOFTWARE=1 mozregression --launch 2023-07-24 -a about:support --pref gfx.webrender.all:true contains HARDWARE_VIDEO_DECODING: Force disabled by gfxInfo + FEATURE_FAILURE_VIDEO_DECODING_TEST_FAILED

webgl.out-of-process.async-present.force-sync=true (was needed to avoid bug 1831548)
gfx.canvas.accelerated=false
gfx.webrender.panic-on-gl-error=true
gfx.webrender.multithreading=false

I was able to reproduce the crash following the steps from https://bugzilla.mozilla.org/show_bug.cgi?id=1835691#c16

Then after setting the above prefs, I can no longer reproduce. I also kicked off a try run with these prefs set to be extra sure:
https://treeherder.mozilla.org/jobs?repo=try&revision=c63634902fb2e430e61629e6df5fc5729d00db52

Andy, is comment 8 something you think you could set up in the image?

Flags: needinfo?(aerickson)

Ideally the patch would be reviewed, merged, and has a deb built as part of a PPA. There is a PPA (https://launchpad.net/~kisak/+archive/ubuntu/kisak-mesa), but it seems to be a LTS-oriented one... not sure when they'd pick up this change.

We don't usually do the merging, building, and packaging ourselves, but we can if it's a requirement.

What's the priority on this? Can we wait to see if we can get it upstreamed or do we need it ASAP (and need to do everything ourselves)?

Flags: needinfo?(aerickson)

Yeah I can imagine patching our own graphics drivers would be a pain.

Maybe we can get tests going that don't exercise the graphics stack heavily in the meantime (though those tests are probably also less valuable).

So I guess the questions are:

  1. How badly do we need this testing / how long are we willing to wait for the upstream patch?
  2. Are there any easier workarounds (are there alternatives to llvmpipe we can use?)

Let me try a bit harder to disable my way to victory here (and test out a few other suites) before we do anything drastic like that.

this is mochitest-plain, it isn't mochitest-media, reftest, etc.; maybe xpcshell, cppunittest, gtest would be good starting points?

(In reply to Andrew Halberstadt [:ahal] from comment #6)

I was able to reproduce the crash following the steps from https://bugzilla.mozilla.org/show_bug.cgi?id=1835691#c16

Then after setting the above prefs, I can no longer reproduce. I also kicked off a try run with these prefs set to be extra sure:
https://treeherder.mozilla.org/jobs?repo=try&revision=c63634902fb2e430e61629e6df5fc5729d00db52

Did the pref change work?
If yes, you could add the following here to use the same code path as the Mesa Nvidia driver and the proprietary Nvidia driver:

    APPEND_TO_DRIVER_BLOCKLIST_EXT(
        OperatingSystem::Linux, ScreenSizeStatus::All, BatteryStatus::All,
        WindowProtocol::All, DriverVendor::SoftwareMesaAll, DeviceFamily::All,
        nsIGfxInfo::FEATURE_THREADSAFE_GL, nsIGfxInfo::FEATURE_BLOCKED_DEVICE,
        DRIVER_COMPARISON_IGNORED, V(0, 0, 0, 0),
        "FEATURE_FAILURE_BUG_1845765",
        "https://gitlab.freedesktop.org/mesa/mesa/-/issues/9074");

Thanks! To clarify are you saying we'd fallback to software rendering? Or proprietary drivers? If the former, it would likely defeat a lot of the purpose of the tests (though would still be better than nothing I guess). This would essentially be the -swr variant. If the latter, that sounds like a good path forward. Maybe we could install drivers from:
https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa ?

That said, it also looks like Martin's patch has a merge request and there is activity:
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24414

Though I'm not sure how quickly the fix will be available via PPA.. Maybe https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers would be suitable? It doesn't look very official.

IIUC:
Firefox' rendering engine (WebRender) has two codepaths:

  • "Software WebRender" as a fallback for old systems (WebRender then uses an OpenGL context provided by its built-in SWGL software opengl library which doesn't use a GPU)
  • (Hardware) WebRender
    • The open source Linux graphics driver, Mesa, contains the iris driver for Intel, amdgpu for Radeon, nouveau for Nvidia. (Irrelevant & just to give an overview: There also exists a proprietary Nvidia driver outside of Mesa that can be installed manually to replace nouveau.)
    • But the CI doesn't test with a real GPU with a hardware OpenGL driver, therefore Mesa's built-in LLVMpipe software OpenGL library is used. It's like Firefox' built-in SWGL, but doesn't have special WebRender OpenGL extensions as fast-paths.
      • With LLVMpipe software opengl you also can't test Dmabuf VAAPI hardware video decoding, you would need a real Intel or AMD GPU for it.
    • If WebGL is used, it runs parallel to WebRender and its frames/images are exported and then imported into WebRender.
      • By default, WebGL uses OpenGL in the same process as WebRender, but in a different thread called CanvasRendererThread (possible security risk, but allegedly good for performance) .
        It can be turned off with webgl.use-canvas-render-thread=false, then WebGL still runs in a different thread in the same process, but with another name (still risky).
        With webgl.threadsafe-gl.force-disabled=true, WebGL runs in the same thread as WebRender (safe, but possibly worse performance). That's the default for users with Nvidia GPU and the Mesa Nouveau driver or the proprietary Nvidia driver. From my understanding, llvmpipe should be added to the same blocklist (comment 13) as it's the same problem, but in a different Mesa driver.

aerickson, the fix for this is now released in mesa 23.1.6. It looks like that PPA you linked in comment 10 should have the fix already. Could you re-build images to use the newer mesa? This is our biggest remaining blocker for Wayland testing.

Flags: needinfo?(aerickson)
Depends on: 1852844

Verified that the fix was in 23.1.6 and the latest version in the PPA is 23.17.

I'll make a new image.

Flags: needinfo?(aerickson)

The image with the updated mesa debs from the PPA is live.

Thanks Andrew! And thanks Martin for the upstream fix!

Status: NEW → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.