MOZ_X11_EGL/Nvidia: Broken fonts and images after suspend/resume EGL
Categories
(Core :: Graphics, defect)
Tracking
()
People
(Reporter: mar.kolya, Assigned: rmader)
References
(Blocks 2 open bugs, Regression)
Details
(Keywords: correctness, regression)
Attachments
(5 files, 1 obsolete file)
User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:94.0) Gecko/20100101 Firefox/94.0
Steps to reproduce:
Put my laptop through suspend resume cycle.
Actual results:
Fonts became garbled, icons not visible - see screenshot. Even new tabs display garbled fonts.
This seems to be specifically caused by recent nightly switch to EGL. I've tried turning EGL on in earlier versions of nightly and saw same results.
I use nvidia proprietary drivers.
Expected results:
Suspend/resume should not break pages.
Reporter | ||
Comment 1•3 years ago
|
||
To add: I remember there was a different bug when image became broken after suspend/resume on nvidia drivers - but that got resolved at some point. And that problem had different visual effects - like whole window with some random colors and that usually fixed itself without restarting firefox. Also new tabs were not affected.
This is the new problem - now it's mainly fonts that get affected and it looks like only restart helps.
Comment 2•3 years ago
|
||
The Bugbug bot thinks this bug should belong to the 'Core::Layout: Text and Fonts' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.
Comment 3•3 years ago
|
||
It looks like not just fonts but also the icons in the tabs and toolbar are broken, which suggests to me a more general Graphics issue rather than anything specifically text-related.
Comment 4•3 years ago
|
||
Thanks for the report!
Updated•3 years ago
|
Comment 5•3 years ago
|
||
Ubuntu 21.04, Gnome X11, GTX 1060, Nvidia driver 470
Comment hidden (obsolete) |
Comment 7•3 years ago
|
||
Comment 8•3 years ago
|
||
if (glx.HasVideoMemoryPurge()) {
attribs.insert(attribs.end(),
{
LOCAL_GLX_GENERATE_RESET_ON_VIDEO_MEMORY_PURGE_NV,
bug 1680759 added it for EGL:
https://searchfox.org/mozilla-central/rev/45e308665eb9fc52fd21e2d4b3b959f3cec13ef1/gfx/gl/GLContextProviderEGL.cpp#683
if (flags & CreateContextFlags::PREFER_ROBUSTNESS) {
std::vector<EGLint> base_robustness_attribs = required_attribs;
if (egl->IsExtensionSupported(
EGLExtension::NV_robustness_video_memory_purge)) {
base_robustness_attribs.push_back(
LOCAL_EGL_GENERATE_RESET_ON_VIDEO_MEMORY_PURGE_NV);
https://www.khronos.org/registry/OpenGL/extensions/NV/NV_robustness_video_memory_purge.txt
GL_NV_robustness_video_memory_purge
GLX_NV_robustness_video_memory_purge
EGL_NV_robustness_video_memory_purge
Nvidia X Server Settings > X Screen 0 > Graphics Information:
- GLX_NV_robustness_video_memory_purge is listed on GLX, Server GLX, Client GLX tabs.
- GL_NV_robustness_video_memory_purge is listed on OpenGL tab. It's listed on about:support.
- But EGL_NV_robustness_video_memory_purge is missing on the EGL tab. It's missing on about:support.
Comment 9•3 years ago
|
||
Assignee | ||
Comment 10•3 years ago
|
||
(In reply to Darkspirit from comment #8)
- But EGL_NV_robustness_video_memory_purge is missing on the EGL tab. It's missing on about:support.
Jan: hm, that is odd - it would be a weird oversight by the NV driver, given that's their more or less private extension. Can you shortly confirm that it's also not listed in e.g. eglinfo
(as opposed to glxinfo
)?
Erik, can I ask you for some insight here? Is it expected that EGL_NV_robustness_video_memory_purge
is not exposed?
Updated•3 years ago
|
Comment 11•3 years ago
|
||
Assignee | ||
Comment 12•3 years ago
|
||
Something very similar can apparently observed when running Gnome-Shell/Wayland: https://gitlab.gnome.org/GNOME/mutter/-/issues/1942
Comment 13•3 years ago
|
||
To my surprise, it would seem that the reason EGL_NV_robustness_video_memory_purge is not exposed is because it really isn't supported by our EGL driver. I imagine it was included in the extension spec because we intended to implement it at some point, but nobody ever got around to actually doing so. Presumably there was never a demand for it until now.
It looks like it would be pretty easy to get it working, considering that we already have all the infrastructure in place for the GLX version. I think we'd just need to have eglCreateContext process the EGL_GENERATE_RESET_ON_VIDEO_MEMORY_PURGE_NV attribute. Unfortunately it's too late to get it into the upcoming 495 beta driver release, but possibly the following release.
Assignee | ||
Comment 14•3 years ago
|
||
Thanks Erik! I suppose it will be needed as long as NVreg_PreserveVideoMemoryAllocations
is not enabled by default?
Jan: can you check if you see the same on native Wayland? I'd like to revert https://phabricator.services.mozilla.com/D117434 so we have HW-WR on native Wayland, but if this bug also happens there, we need to postpone that I suppose.
Assignee | ||
Comment 15•3 years ago
|
||
Erik: if I understand correctly, the long term plan to fix issues like this is the NVreg_PreserveVideoMemoryAllocations=1
road, however that's still work in progress[1][2]. This potentially explains why EGL_GENERATE_RESET_ON_VIDEO_MEMORY_PURGE_NV
was never implemented.
If you can wire up EGL_GENERATE_RESET_ON_VIDEO_MEMORY_PURGE_NV
, do you think it could also get backported to the 470 LTS driver series? Or would it only be possible for >=495?
I'm asking because this bug is likely the main blocker for shipping EGL on X11, which again helps with a bunch of other bugs (e.g. bug 1716049) - so not having to wait for all systems to have either NVreg_PreserveVideoMemoryAllocations
properly set up or having been upgraded to the 495 series would be a great help :)
1: https://download.nvidia.com/XFree86/Linux-x86_64/470.63.01/README/powermanagement.html#PreserveAllVide719f0
2: https://rpmfusion.org/Howto/NVIDIA#Suspend
Comment 16•3 years ago
|
||
(In reply to Nikolay Martynov from comment #1)
To add: I remember there was a different bug when image became broken after suspend/resume on nvidia drivers - but that got resolved at some point. And that problem had different visual effects - like whole window with some random colors and that usually fixed itself without restarting firefox. Also new tabs were not affected.
This is the new problem - now it's mainly fonts that get affected and it looks like only restart helps.
Actually workaround for this is to enable GPU process and manually kill it after suspend\resume.
Comment 17•3 years ago
|
||
Long term plan is vidmem preservation yes. The extension was never implemented for EGL because there were no potential users of it back then, and I focused on GLX with the intent of delivering something fast. Then forgot about EGL :(
I filed NVIDIA bug 200778113 to track implementation for EGL. We cannot commit to a timeframe yet, and so can't really commit to where it would be backported, but conceptually it is the sort of thing that should be easy to backport.
Comment 18•3 years ago
|
||
(In reply to Robert Mader [:rmader] from comment #14)
Jan: can you check if you see the same on native Wayland?
I have not succeeded to get it running yet. I always got a black screen when gdm3 should show up.
(In reply to Arthur Huillet from comment #17)
Long term plan is vidmem preservation yes.
Debian and Ubuntu seem to include your systemd services in their Nvidia driver packages, but they do not set NVreg_PreserveVideoMemoryAllocations=1
. Please tell them to do so or try to enable it by default somehow because it works.
https://download.nvidia.com/XFree86/Linux-x86_64/455.28/README/powermanagement.html#SystemdConfigur74e29
-
PopOS already sets NVreg_PreserveVideoMemoryAllocations=1 in /lib/modprobe.d/nvidia-graphics-drivers.conf: https://github.com/pop-os/nvidia-graphics-drivers/commit/de515f69a4816b6e11633faba2d11b1f2a738c55
-
https://packages.ubuntu.com/hirsute-updates/amd64/nvidia-kernel-common-470/filelist
Ubuntu does not setoptions nvidia NVreg_PreserveVideoMemoryAllocations=1
in /lib/modprobe.d/nvidia-graphics-drivers.conf.
I can confirm that manually setting it fixes this bug and removing it reintroduces the bug. :) -
https://packages.debian.org/bullseye/nvidia-driver
nvidia driver > nvidia-kernel-dkms > nvidia-kernel-support > nvidia-kernel-common
https://packages.debian.org/bullseye/amd64/nvidia-kernel-common/filelist
/etc/modprobe.d/nvidia-kernel-common.conf does not containoptions nvidia NVreg_PreserveVideoMemoryAllocations=1
.
Comment 19•3 years ago
|
||
I've tried wiring up EGL_NV_robustness_video_memory_purge in the driver, and can confirm that the GL_PURGED_CONTEXT_RESET_NV notification is getting propagated to Firefox after suspend / resume, but unfortunately this doesn't appear to fix the corruption.
With both GLX and EGL, after resuming Firefox outputs this message
[GFX1-]: GFX: RenderThread detected a device reset in PostUpdate
However with GLX, but not with EGL, it also prints this
Unflushed glGetGraphicsResetStatus: 0x92bb
So it seems like something different is happening between the two platforms. Not sure what the problem might be.
Assignee | ||
Comment 20•3 years ago
|
||
(In reply to Erik Kurzinger from comment #19)
...
Thanks for looking into it! There are some differences in our GL context handling between GLX and EGL (we use a global context, not one for every window, bug 1684194), so I'd not be totally surprised if something is not wired up correctly. I suppose it would be easiest to confirm the feature on a simple reproducer demo - or can you give us access to such a driver build?
Alternatively we could make a build with bug 1684194 reverted - the robustness paths should behave like on GLX then.
Comment 21•3 years ago
|
||
Reverting https://hg.mozilla.org/mozilla-central/rev/52299c7cbec4 fixes the corruption. So yeah, that seems to be the key difference.
Comment 22•3 years ago
|
||
(In reply to Arthur Huillet from comment #17)
(In reply to Erik Kurzinger from comment #21)
Are there reasons that would speak against asking Linux distributions to properly package the Nvidia driver by setting NVreg_PreserveVideoMemoryAllocations=1
like PopOS has already done? (comment 18)
Assignee | ||
Comment 23•3 years ago
|
||
(In reply to Erik Kurzinger from comment #21)
Reverting https://hg.mozilla.org/mozilla-central/rev/52299c7cbec4 fixes the corruption. So yeah, that seems to be the key difference.
Great, I guess that means your implementation works and the fact that it doesn't work with the new EGL paths is likely a bug on our side that we can fix once we have the driver build available? Or are you looking into what's wrong with bug 1684194 yourself?
(In reply to Darkspirit from comment #22)
Are there reasons that would speak against asking Linux distributions to properly package the Nvidia driver by setting
NVreg_PreserveVideoMemoryAllocations=1
like PopOS has already done? (comment 18)
I don't think this a feasible short term solution - IIUC the feature is not declared fully stable yet.
Updated•3 years ago
|
Comment 24•3 years ago
|
||
Alright, so EGL_NV_robustness_video_memory_purge will be in the first official 495.xx driver release (as mentioned above, not the upcoming beta) and since it ended up being pretty trivial to enable, we decided it was fine to back-port to the 470 branch, so it will also be present in the next 470.xx release, whenever that is.
Anyway, I think the reason https://hg.mozilla.org/mozilla-central/rev/52299c7cbec4 prevents proper recovery is because it switched to using the "singleton" GL context, which doesn't get re-created after a reset. For example, if I add a call to ClearSingletonGL from RenderThread after destroying the RenderCompositorEGL, things appear to work properly.
diff -r fc5e583b2dd7 gfx/webrender_bindings/RenderThread.cpp
--- a/gfx/webrender_bindings/RenderThread.cpp Wed Sep 29 04:02:02 2021 -0400
+++ b/gfx/webrender_bindings/RenderThread.cpp Wed Sep 29 16:30:47 2021 -0400
@@ -253,6 +253,7 @@
mRenderers.erase(aWindowId);
if (mRenderers.empty()) {
+ ClearSingletonGL();
mHandlingDeviceReset = false;
mHandlingWebRenderError = false;
}
Assignee | ||
Comment 25•3 years ago
|
||
Thanks Erik! We'll try your suggestion / look into it once the driver is available. Strongly looking forward to the EGL/DmaBuf future :)
Comment 26•3 years ago
|
||
Set release status flags based on info from the regressing bug 1695933
Comment 27•3 years ago
|
||
The severity field is not set for this bug.
:jimm, could you have a look please?
For more information, please visit auto_nag documentation.
Updated•3 years ago
|
Comment 29•3 years ago
|
||
I can confirm that this seems to be fixed when running on the latest NVIDIA 495.29.05 Beta. I was able to suspend and resume my PC without the fonts getting corrupted on 94 Beta
Assignee | ||
Comment 30•3 years ago
|
||
(In reply to Tony Stipanic from comment #29)
I can confirm that this seems to be fixed when running on the latest NVIDIA 495.29.05 Beta. I was able to suspend and resume my PC without the fonts getting corrupted on 94 Beta
We backed out the EGL roleout for Nvidia in 94 because even with the driver in place we might be missing something small, see comment 24. So you'd need to test nightly or enable gfx.x11-egl.force-enabled
. However, now that we can test the driver we can hopefully ship EGL for 95 (with a check for that extension).
Comment 31•3 years ago
|
||
I have checked again if I missed that, but I can confirm that I have tested it with gfx.x11-egl.force-enabled set to true.
However, I took the chance to also look if Nightly 95.0a1 has this bug and there I couldn't recreate the bug anymore either.
Assignee | ||
Comment 34•3 years ago
|
||
EGL_NV_robustness_video_memory_purge
is now available in the 470.82
and 495.44
driver releases. Thus 470.82
can be our new baseline for activating EGL by default. We still need to land a patch as outlined in comment 24 and some testing though.
Comment 35•3 years ago
•
|
||
Gnome X11, Ubuntu 21.10, GTX 1060, 495 stable
Without the patch from comment 24:
Resume from suspend causes fallback to SW WR (LOCAL_EGL_BAD_ALLOC):
GFX: RenderThread detected a device reset in PostUpdate
Failed to create EGLSurface!: 0x3003
Failed to create EGLSurface
Fallback WR to SW-WR
At one time, Firefox was transparent and there were many "Error in eglSetDamageRegion: 0x3001" (EGL_NOT_INITIALIZED) in the terminal.
With the patch from comment 24:
RenderThread detected a device reset in PostUpdate
only once in terminal and no fallback.
Meta.add_clutter_debug_flags(0, Clutter.DrawDebugFlag.PAINT_DAMAGE_REGION, 0)
shows that partial present lags behind some visual changes/frames.
Some tiles in the vertical middle are not updated correctly. It can be best seen when hovering lines on about:config.
gfx.webrender.allow-partial-present-buffer-age=false doesn't help, only gfx.webrender.max-partial-present-rects=0 helps.
Most often the window seemed fine after resume.
Sometimes, the window can still be black after resume. Hovering the tab bar seemed to fix it, the content area seemed frozen (cursor didn't change). Sometimes it fixed itself after being black.
Assignee | ||
Comment 36•3 years ago
|
||
Andrew, as you implemented robustness for EGL in bug 1680759, can I ask you for some help here? EGL_NV_robustness_video_memory_purge
apparently need some extra invalidation after bug 1684194. Comment 24 makes a suggestion, however Darkspirit found some issues regarding partial damage after resume, even with that change applied. Apparently we need to force a full repaint directly after that extension triggered a reset - any ideas how to do that or should that maybe already be the case?
Sotaro, NIing you as well, assuming you might also have some knowledge about this area.
Comment 37•3 years ago
|
||
I appear to be affected by this issue, as of installing Firefox 95b2 (I appear to have skipped b1). Previously, on 94 and earlier, I had experienced repaint issues when resuming from suspend, but now all graphical elements are corrupted - including the browser chrome itself. Only restarting the browser fixes the issue.
https://gareth.halfacree.co.uk/pubimages/firefox95b2-corruption.png
I had previously mentioned the same issue in 95 Nightly while chasing down a different bug, but can now confirm it on the beta channel too.
I note reference to Nvidia driver versions 470.82 and 495.44 above: I am currently on 470.57.02. Should the issue be resolved if I'm on the newer driver?
System: Firefox 95b2, Ubuntu 20.04.3 64-bit, Ryzen 2700X, Nvidia RTX 2080 470.57.02.
Assignee | ||
Comment 38•3 years ago
|
||
Should the issue be resolved if I'm on the newer driver?
No, not until this issue has been closed. However, with the 470.82 and 495.44 driver releases we now have everything we need to actually do that. We'll then also bump the minimal driver version required to enable EGL in nightly (and likely also beta/release) by default. Until then you might want to disable EGL in nightly by setting gfx.x11-egl.force-disabled
in about:config
.
Comment 39•3 years ago
|
||
With 495.44 driver and gnome 41, I am experiencing a similar corruption for the entire gnome session, including lock screen. This is with NVreg_PreserveVideoMemoryAllocations=1. I am posting this here in case it is a related data point. If this is unrelated, please feel free to ignore/delete.
Comment 41•3 years ago
•
|
||
(In reply to Robert Mader [:rmader] (back on ~23. Nov) from comment #36)
Andrew, as you implemented robustness for EGL in bug 1680759, can I ask you for some help here?
EGL_NV_robustness_video_memory_purge
apparently need some extra invalidation after bug 1684194. Comment 24 makes a suggestion, however Darkspirit found some issues regarding partial damage after resume, even with that change applied. Apparently we need to force a full repaint directly after that extension triggered a reset - any ideas how to do that or should that maybe already be the case?
Sorry for slow response. Can we use mCompositor->RequestFullRender() for full rendering?
RenderCompositorEGL::RequestFullRender() does not handle it yet.
And on Windows case, device reset triggers to re-create all WebRenders/WebRenderBridgeParents/WebRenderBridgeChilds.
GPUProcessManager::OnRemoteProcessDeviceReset()
Assignee | ||
Comment 42•3 years ago
|
||
With bug 1740675 landed, things should now work as expected in latest nightly. Can anyone with an affected setup confirm?
Assignee | ||
Comment 43•3 years ago
|
||
Confirmed fixed by bug 1740675, however there's a related issue around partial present after resume. We can investigate that in a follow up bug though.
Updated•3 years ago
|
Comment 44•3 years ago
|
||
@rmader: I cannot confirm the fix here. I've installed Nightly 20211125, and am still using Nvidia 470.57.02 drivers: to my understanding, that should result in EGL being disabled and resuming from suspend working as in 94 and prior.
However, I'm seeing the same behaviour: resuming from suspend results in a completely broken window. Should 20211125 definitely be working at stock settings on 470.57.02 drivers?
Comment 45•3 years ago
|
||
https://bugzilla.mozilla.org/show_bug.cgi?id=1742862 was only merged 14 hours ago, could it be that it is not yet in 20211125 nightly you tested with? The summary is here:
https://hg.mozilla.org/mozilla-central/rev/e5f33ef244ff
Updated•3 years ago
|
Assignee | ||
Comment 46•3 years ago
|
||
(In reply to gareth from comment #44)
@rmader: I cannot confirm the fix here. I've installed Nightly 20211125, and am still using Nvidia 470.57.02 drivers: to my understanding, that should result in EGL being disabled and resuming from suspend working as in 94 and prior.
Could you shortly retest and confirm that by now nigtly should not enable EGL on that driver version any more?
Comment 47•3 years ago
|
||
Now on Nightly 20211127, and it appears to be fixed (or, rather, worked-around): suspending and resuming with the faulty 470.57.02 Nvidia driver bundle no longer corrupts the Firefox window, and everything restores perfectly well - all at default settings, fresh profile, no changes.
At some point I'll get around to upgrading to the latest Nvidia driver bundle!
Thanks for the fix!
Description
•