Work around frequent driver crash in Wayland pool
Categories
(Firefox Build System :: Task Configuration, task)
Tracking
(Not tracked)
People
(Reporter: ahal, Unassigned)
References
(Blocks 1 open bug, )
Details
We're hitting a frequent crash related to llvmpipe in the Wayland pool:
https://bugzilla.mozilla.org/show_bug.cgi?id=1835691
Note that this is not due to using Wayland, but just a side effect of the pool's environment. If we can't fix the crash, we'll need to work around this in the image somehow.
Reporter | ||
Comment 1•1 year ago
|
||
Martin, do you think using a different version of llvmpipe could help us get unstuck here?
Comment 2•1 year ago
|
||
by environment is this a VM, running at GCP, or something else?
Reporter | ||
Comment 3•1 year ago
|
||
Yes Ubuntu 22.04 VirtualBox VM running on GCP instances
Reporter | ||
Comment 4•1 year ago
|
||
Martin, also when you reproduced locally, was it using VirtualBox?
Comment 5•1 year ago
•
|
||
(I'm just a tester, not a programmer.)
Mesa d3d12 is for the "Windows Subsystem for Linux" (Linux on top of Windows), it's irrelevant for regular Linux desktops, but it has caused regressions like bug 1804030 (llvmpipe was unaffected + probably introduced with 22.3.0) which has been fixed in Mesa 22.3.1.
Ubuntu 22.04 jammy has Mesa 22.2.5 and should be unaffected from that specific bug.
Does llvmpipe still crash if you set these prefs?
webgl.use-canvas-render-thread=false
webgl.threadsafe-gl.force-disabled=true
- This and above pref ensure OpenGL is not used on multiple threads. bug 1777849 comment 21 affected X11, but Mesa itself is neither fully thread safe IIUC. Firefox blocks THREADSAFE_GL by default only for Nouveau. IIRC there was a comment that Radeon is neither fully thread safe. I assume llvmpipe has the same problem. I don't understand why better performance under load is traded against security risks(?) by using OpenGL in multiple threads in the same process.
widget.dmabuf-webgl.enabled=false
- (IIUC: copies WebGL frames (WebGL -> system memory -> WebRender GL) instead of sharing their GPU buffers)
- Edit: This pref change isn't needed because
LIBGL_ALWAYS_SOFTWARE=1 mozregression --launch 2023-07-24 -a about:support --pref gfx.webrender.all:true
contains DMABUF: Failed to configure + FEATURE_FAILURE_NO_DRM_DEVICE
media.hardware-video-decoding.enabled=false
- Edit: This pref change isn't needed because
LIBGL_ALWAYS_SOFTWARE=1 mozregression --launch 2023-07-24 -a about:support --pref gfx.webrender.all:true
contains HARDWARE_VIDEO_DECODING: Force disabled by gfxInfo + FEATURE_FAILURE_VIDEO_DECODING_TEST_FAILED
webgl.out-of-process.async-present.force-sync=true (was needed to avoid bug 1831548)
gfx.canvas.accelerated=false
gfx.webrender.panic-on-gl-error=true
gfx.webrender.multithreading=false
Reporter | ||
Comment 6•1 year ago
•
|
||
I was able to reproduce the crash following the steps from https://bugzilla.mozilla.org/show_bug.cgi?id=1835691#c16
Then after setting the above prefs, I can no longer reproduce. I also kicked off a try run with these prefs set to be extra sure:
https://treeherder.mozilla.org/jobs?repo=try&revision=c63634902fb2e430e61629e6df5fc5729d00db52
Comment 7•1 year ago
|
||
Comment 8•1 year ago
|
||
We can take the upstream patch and patch mesa locally in automation:
https://gitlab.freedesktop.org/mesa/mesa/uploads/4c7a10e1b45d1b83c2b0e8a5f9a22bc5/mesa-lp_scene-thread.patch
Reporter | ||
Comment 9•1 year ago
|
||
Andy, is comment 8 something you think you could set up in the image?
Comment 10•1 year ago
|
||
Ideally the patch would be reviewed, merged, and has a deb built as part of a PPA. There is a PPA (https://launchpad.net/~kisak/+archive/ubuntu/kisak-mesa), but it seems to be a LTS-oriented one... not sure when they'd pick up this change.
We don't usually do the merging, building, and packaging ourselves, but we can if it's a requirement.
What's the priority on this? Can we wait to see if we can get it upstreamed or do we need it ASAP (and need to do everything ourselves)?
Reporter | ||
Comment 11•1 year ago
|
||
Yeah I can imagine patching our own graphics drivers would be a pain.
Maybe we can get tests going that don't exercise the graphics stack heavily in the meantime (though those tests are probably also less valuable).
So I guess the questions are:
- How badly do we need this testing / how long are we willing to wait for the upstream patch?
- Are there any easier workarounds (are there alternatives to llvmpipe we can use?)
Let me try a bit harder to disable my way to victory here (and test out a few other suites) before we do anything drastic like that.
Comment 12•1 year ago
|
||
this is mochitest-plain, it isn't mochitest-media, reftest, etc.; maybe xpcshell, cppunittest, gtest would be good starting points?
Comment 13•1 year ago
|
||
(In reply to Andrew Halberstadt [:ahal] from comment #6)
I was able to reproduce the crash following the steps from https://bugzilla.mozilla.org/show_bug.cgi?id=1835691#c16
Then after setting the above prefs, I can no longer reproduce. I also kicked off a try run with these prefs set to be extra sure:
https://treeherder.mozilla.org/jobs?repo=try&revision=c63634902fb2e430e61629e6df5fc5729d00db52
Did the pref change work?
If yes, you could add the following here to use the same code path as the Mesa Nvidia driver and the proprietary Nvidia driver:
APPEND_TO_DRIVER_BLOCKLIST_EXT(
OperatingSystem::Linux, ScreenSizeStatus::All, BatteryStatus::All,
WindowProtocol::All, DriverVendor::SoftwareMesaAll, DeviceFamily::All,
nsIGfxInfo::FEATURE_THREADSAFE_GL, nsIGfxInfo::FEATURE_BLOCKED_DEVICE,
DRIVER_COMPARISON_IGNORED, V(0, 0, 0, 0),
"FEATURE_FAILURE_BUG_1845765",
"https://gitlab.freedesktop.org/mesa/mesa/-/issues/9074");
Reporter | ||
Comment 14•1 year ago
|
||
Thanks! To clarify are you saying we'd fallback to software rendering? Or proprietary drivers? If the former, it would likely defeat a lot of the purpose of the tests (though would still be better than nothing I guess). This would essentially be the -swr
variant. If the latter, that sounds like a good path forward. Maybe we could install drivers from:
https://launchpad.net/~graphics-drivers/+archive/ubuntu/ppa ?
That said, it also looks like Martin's patch has a merge request and there is activity:
https://gitlab.freedesktop.org/mesa/mesa/-/merge_requests/24414
Though I'm not sure how quickly the fix will be available via PPA.. Maybe https://launchpad.net/~oibaf/+archive/ubuntu/graphics-drivers would be suitable? It doesn't look very official.
Comment 15•1 year ago
•
|
||
IIUC:
Firefox' rendering engine (WebRender) has two codepaths:
- "Software WebRender" as a fallback for old systems (WebRender then uses an OpenGL context provided by its built-in SWGL software opengl library which doesn't use a GPU)
- (Hardware) WebRender
- The open source Linux graphics driver, Mesa, contains the iris driver for Intel, amdgpu for Radeon, nouveau for Nvidia. (Irrelevant & just to give an overview: There also exists a proprietary Nvidia driver outside of Mesa that can be installed manually to replace nouveau.)
- But the CI doesn't test with a real GPU with a hardware OpenGL driver, therefore Mesa's built-in LLVMpipe software OpenGL library is used. It's like Firefox' built-in SWGL, but doesn't have special WebRender OpenGL extensions as fast-paths.
- With LLVMpipe software opengl you also can't test Dmabuf VAAPI hardware video decoding, you would need a real Intel or AMD GPU for it.
- If WebGL is used, it runs parallel to WebRender and its frames/images are exported and then imported into WebRender.
- By default, WebGL uses OpenGL in the same process as WebRender, but in a different thread called CanvasRendererThread (possible security risk, but allegedly good for performance) .
It can be turned off with webgl.use-canvas-render-thread=false, then WebGL still runs in a different thread in the same process, but with another name (still risky).
With webgl.threadsafe-gl.force-disabled=true, WebGL runs in the same thread as WebRender (safe, but possibly worse performance). That's the default for users with Nvidia GPU and the Mesa Nouveau driver or the proprietary Nvidia driver. From my understanding, llvmpipe should be added to the same blocklist (comment 13) as it's the same problem, but in a different Mesa driver.
- By default, WebGL uses OpenGL in the same process as WebRender, but in a different thread called CanvasRendererThread (possible security risk, but allegedly good for performance) .
Reporter | ||
Comment 16•1 year ago
•
|
||
aerickson, the fix for this is now released in mesa 23.1.6. It looks like that PPA you linked in comment 10 should have the fix already. Could you re-build images to use the newer mesa? This is our biggest remaining blocker for Wayland testing.
Comment 17•1 year ago
|
||
Verified that the fix was in 23.1.6 and the latest version in the PPA is 23.17.
I'll make a new image.
Comment 18•1 year ago
|
||
The image with the updated mesa debs from the PPA is live.
Reporter | ||
Comment 19•1 year ago
|
||
Thanks Andrew! And thanks Martin for the upstream fix!
Description
•