Closed Bug 1731172 Opened 3 years ago Closed 3 years ago

MOZ_X11_EGL/Nvidia: Broken fonts and images after suspend/resume EGL

Categories

(Core :: Graphics, defect)

Firefox 94
x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
Tracking Status
firefox-esr78 --- unaffected
firefox-esr91 --- disabled
firefox92 --- disabled
firefox93 --- disabled
firefox94 --- disabled
firefox95 --- disabled
firefox96 --- verified

People

(Reporter: mar.kolya, Assigned: rmader)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: correctness, regression)

Attachments

(5 files, 1 obsolete file)

Attached image shot.png

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:94.0) Gecko/20100101 Firefox/94.0

Steps to reproduce:

Put my laptop through suspend resume cycle.

Actual results:

Fonts became garbled, icons not visible - see screenshot. Even new tabs display garbled fonts.

This seems to be specifically caused by recent nightly switch to EGL. I've tried turning EGL on in earlier versions of nightly and saw same results.

I use nvidia proprietary drivers.

Expected results:

Suspend/resume should not break pages.

To add: I remember there was a different bug when image became broken after suspend/resume on nvidia drivers - but that got resolved at some point. And that problem had different visual effects - like whole window with some random colors and that usually fixed itself without restarting firefox. Also new tabs were not affected.
This is the new problem - now it's mainly fonts that get affected and it looks like only restart helps.

The Bugbug bot thinks this bug should belong to the 'Core::Layout: Text and Fonts' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Layout: Text and Fonts
Product: Firefox → Core

It looks like not just fonts but also the icons in the tabs and toolbar are broken, which suggests to me a more general Graphics issue rather than anything specifically text-related.

Component: Layout: Text and Fonts → Graphics

Thanks for the report!

OS: Unspecified → Linux
Regressed by: 1695933
Hardware: Unspecified → x86_64
See Also: → 1682876, 1500520, 1484782
Summary: Broken fonts and images after suspend/resume EGL → MOZ_X11_EGL/Nvidia: Broken fonts and images after suspend/resume EGL
Has Regression Range: --- → yes

Ubuntu 21.04, Gnome X11, GTX 1060, Nvidia driver 470

Attached video 2021-09-17 14-26-25.mp4
Attachment #9241782 - Attachment is obsolete: true

https://searchfox.org/mozilla-central/rev/45e308665eb9fc52fd21e2d4b3b959f3cec13ef1/gfx/gl/GLContextProviderGLX.cpp#375

  if (glx.HasVideoMemoryPurge()) {
    attribs.insert(attribs.end(),
                   {
                       LOCAL_GLX_GENERATE_RESET_ON_VIDEO_MEMORY_PURGE_NV,

bug 1680759 added it for EGL:
https://searchfox.org/mozilla-central/rev/45e308665eb9fc52fd21e2d4b3b959f3cec13ef1/gfx/gl/GLContextProviderEGL.cpp#683

  if (flags & CreateContextFlags::PREFER_ROBUSTNESS) {
    std::vector<EGLint> base_robustness_attribs = required_attribs;
    if (egl->IsExtensionSupported(
            EGLExtension::NV_robustness_video_memory_purge)) {
      base_robustness_attribs.push_back(
          LOCAL_EGL_GENERATE_RESET_ON_VIDEO_MEMORY_PURGE_NV);

https://www.khronos.org/registry/OpenGL/extensions/NV/NV_robustness_video_memory_purge.txt

GL_NV_robustness_video_memory_purge
GLX_NV_robustness_video_memory_purge
EGL_NV_robustness_video_memory_purge

Nvidia X Server Settings > X Screen 0 > Graphics Information:

  • GLX_NV_robustness_video_memory_purge is listed on GLX, Server GLX, Client GLX tabs.
  • GL_NV_robustness_video_memory_purge is listed on OpenGL tab. It's listed on about:support.
  • But EGL_NV_robustness_video_memory_purge is missing on the EGL tab. It's missing on about:support.
Depends on: 1680759

(In reply to Darkspirit from comment #8)

  • But EGL_NV_robustness_video_memory_purge is missing on the EGL tab. It's missing on about:support.

Jan: hm, that is odd - it would be a weird oversight by the NV driver, given that's their more or less private extension. Can you shortly confirm that it's also not listed in e.g. eglinfo (as opposed to glxinfo)?

Erik, can I ask you for some insight here? Is it expected that EGL_NV_robustness_video_memory_purge is not exposed?

Flags: needinfo?(ekurzinger)
Status: UNCONFIRMED → NEW
Ever confirmed: true

Something very similar can apparently observed when running Gnome-Shell/Wayland: https://gitlab.gnome.org/GNOME/mutter/-/issues/1942

To my surprise, it would seem that the reason EGL_NV_robustness_video_memory_purge is not exposed is because it really isn't supported by our EGL driver. I imagine it was included in the extension spec because we intended to implement it at some point, but nobody ever got around to actually doing so. Presumably there was never a demand for it until now.

It looks like it would be pretty easy to get it working, considering that we already have all the infrastructure in place for the GLX version. I think we'd just need to have eglCreateContext process the EGL_GENERATE_RESET_ON_VIDEO_MEMORY_PURGE_NV attribute. Unfortunately it's too late to get it into the upcoming 495 beta driver release, but possibly the following release.

Flags: needinfo?(ekurzinger)

Thanks Erik! I suppose it will be needed as long as NVreg_PreserveVideoMemoryAllocations is not enabled by default?

Jan: can you check if you see the same on native Wayland? I'd like to revert https://phabricator.services.mozilla.com/D117434 so we have HW-WR on native Wayland, but if this bug also happens there, we need to postpone that I suppose.

Flags: needinfo?(jan)

Erik: if I understand correctly, the long term plan to fix issues like this is the NVreg_PreserveVideoMemoryAllocations=1 road, however that's still work in progress[1][2]. This potentially explains why EGL_GENERATE_RESET_ON_VIDEO_MEMORY_PURGE_NV was never implemented.

If you can wire up EGL_GENERATE_RESET_ON_VIDEO_MEMORY_PURGE_NV, do you think it could also get backported to the 470 LTS driver series? Or would it only be possible for >=495?
I'm asking because this bug is likely the main blocker for shipping EGL on X11, which again helps with a bunch of other bugs (e.g. bug 1716049) - so not having to wait for all systems to have either NVreg_PreserveVideoMemoryAllocations properly set up or having been upgraded to the 495 series would be a great help :)

1: https://download.nvidia.com/XFree86/Linux-x86_64/470.63.01/README/powermanagement.html#PreserveAllVide719f0
2: https://rpmfusion.org/Howto/NVIDIA#Suspend

Flags: needinfo?(ekurzinger)

(In reply to Nikolay Martynov from comment #1)

To add: I remember there was a different bug when image became broken after suspend/resume on nvidia drivers - but that got resolved at some point. And that problem had different visual effects - like whole window with some random colors and that usually fixed itself without restarting firefox. Also new tabs were not affected.
This is the new problem - now it's mainly fonts that get affected and it looks like only restart helps.

Actually workaround for this is to enable GPU process and manually kill it after suspend\resume.

See Also: → 1732002

Long term plan is vidmem preservation yes. The extension was never implemented for EGL because there were no potential users of it back then, and I focused on GLX with the intent of delivering something fast. Then forgot about EGL :(
I filed NVIDIA bug 200778113 to track implementation for EGL. We cannot commit to a timeframe yet, and so can't really commit to where it would be backported, but conceptually it is the sort of thing that should be easy to backport.

(In reply to Robert Mader [:rmader] from comment #14)

Jan: can you check if you see the same on native Wayland?

I have not succeeded to get it running yet. I always got a black screen when gdm3 should show up.

(In reply to Arthur Huillet from comment #17)

Long term plan is vidmem preservation yes.

Debian and Ubuntu seem to include your systemd services in their Nvidia driver packages, but they do not set NVreg_PreserveVideoMemoryAllocations=1. Please tell them to do so or try to enable it by default somehow because it works.
https://download.nvidia.com/XFree86/Linux-x86_64/455.28/README/powermanagement.html#SystemdConfigur74e29

Flags: needinfo?(jan)

I've tried wiring up EGL_NV_robustness_video_memory_purge in the driver, and can confirm that the GL_PURGED_CONTEXT_RESET_NV notification is getting propagated to Firefox after suspend / resume, but unfortunately this doesn't appear to fix the corruption.

With both GLX and EGL, after resuming Firefox outputs this message
[GFX1-]: GFX: RenderThread detected a device reset in PostUpdate

However with GLX, but not with EGL, it also prints this
Unflushed glGetGraphicsResetStatus: 0x92bb

So it seems like something different is happening between the two platforms. Not sure what the problem might be.

Flags: needinfo?(ekurzinger)

(In reply to Erik Kurzinger from comment #19)

...

Thanks for looking into it! There are some differences in our GL context handling between GLX and EGL (we use a global context, not one for every window, bug 1684194), so I'd not be totally surprised if something is not wired up correctly. I suppose it would be easiest to confirm the feature on a simple reproducer demo - or can you give us access to such a driver build?

Alternatively we could make a build with bug 1684194 reverted - the robustness paths should behave like on GLX then.

Flags: needinfo?(ekurzinger)

Reverting https://hg.mozilla.org/mozilla-central/rev/52299c7cbec4 fixes the corruption. So yeah, that seems to be the key difference.

Flags: needinfo?(ekurzinger)

(In reply to Arthur Huillet from comment #17)
(In reply to Erik Kurzinger from comment #21)
Are there reasons that would speak against asking Linux distributions to properly package the Nvidia driver by setting NVreg_PreserveVideoMemoryAllocations=1 like PopOS has already done? (comment 18)

(In reply to Erik Kurzinger from comment #21)

Reverting https://hg.mozilla.org/mozilla-central/rev/52299c7cbec4 fixes the corruption. So yeah, that seems to be the key difference.

Great, I guess that means your implementation works and the fact that it doesn't work with the new EGL paths is likely a bug on our side that we can fix once we have the driver build available? Or are you looking into what's wrong with bug 1684194 yourself?

(In reply to Darkspirit from comment #22)

Are there reasons that would speak against asking Linux distributions to properly package the Nvidia driver by setting NVreg_PreserveVideoMemoryAllocations=1 like PopOS has already done? (comment 18)

I don't think this a feasible short term solution - IIUC the feature is not declared fully stable yet.

Alright, so EGL_NV_robustness_video_memory_purge will be in the first official 495.xx driver release (as mentioned above, not the upcoming beta) and since it ended up being pretty trivial to enable, we decided it was fine to back-port to the 470 branch, so it will also be present in the next 470.xx release, whenever that is.

Anyway, I think the reason https://hg.mozilla.org/mozilla-central/rev/52299c7cbec4 prevents proper recovery is because it switched to using the "singleton" GL context, which doesn't get re-created after a reset. For example, if I add a call to ClearSingletonGL from RenderThread after destroying the RenderCompositorEGL, things appear to work properly.

diff -r fc5e583b2dd7 gfx/webrender_bindings/RenderThread.cpp
--- a/gfx/webrender_bindings/RenderThread.cpp   Wed Sep 29 04:02:02 2021 -0400
+++ b/gfx/webrender_bindings/RenderThread.cpp   Wed Sep 29 16:30:47 2021 -0400
@@ -253,6 +253,7 @@
   mRenderers.erase(aWindowId);
 
   if (mRenderers.empty()) {
+    ClearSingletonGL();
     mHandlingDeviceReset = false;
     mHandlingWebRenderError = false;
   }

Thanks Erik! We'll try your suggestion / look into it once the driver is available. Strongly looking forward to the EGL/DmaBuf future :)

Set release status flags based on info from the regressing bug 1695933

The severity field is not set for this bug.
:jimm, could you have a look please?

For more information, please visit auto_nag documentation.

Flags: needinfo?(jmathies)
Severity: -- → S3
Flags: needinfo?(jmathies)

I can confirm that this seems to be fixed when running on the latest NVIDIA 495.29.05 Beta. I was able to suspend and resume my PC without the fonts getting corrupted on 94 Beta

(In reply to Tony Stipanic from comment #29)

I can confirm that this seems to be fixed when running on the latest NVIDIA 495.29.05 Beta. I was able to suspend and resume my PC without the fonts getting corrupted on 94 Beta

We backed out the EGL roleout for Nvidia in 94 because even with the driver in place we might be missing something small, see comment 24. So you'd need to test nightly or enable gfx.x11-egl.force-enabled. However, now that we can test the driver we can hopefully ship EGL for 95 (with a check for that extension).

I have checked again if I missed that, but I can confirm that I have tested it with gfx.x11-egl.force-enabled set to true.
However, I took the chance to also look if Nightly 95.0a1 has this bug and there I couldn't recreate the bug anymore either.

Blocks: 1737428

EGL_NV_robustness_video_memory_purge is now available in the 470.82 and 495.44 driver releases. Thus 470.82 can be our new baseline for activating EGL by default. We still need to land a patch as outlined in comment 24 and some testing though.

Gnome X11, Ubuntu 21.10, GTX 1060, 495 stable

Without the patch from comment 24:
Resume from suspend causes fallback to SW WR (LOCAL_EGL_BAD_ALLOC):

GFX: RenderThread detected a device reset in PostUpdate
Failed to create EGLSurface!: 0x3003
Failed to create EGLSurface
Fallback WR to SW-WR

At one time, Firefox was transparent and there were many "Error in eglSetDamageRegion: 0x3001" (EGL_NOT_INITIALIZED) in the terminal.


With the patch from comment 24:
RenderThread detected a device reset in PostUpdate only once in terminal and no fallback.

Meta.add_clutter_debug_flags(0, Clutter.DrawDebugFlag.PAINT_DAMAGE_REGION, 0) shows that partial present lags behind some visual changes/frames.
Some tiles in the vertical middle are not updated correctly. It can be best seen when hovering lines on about:config.
gfx.webrender.allow-partial-present-buffer-age=false doesn't help, only gfx.webrender.max-partial-present-rects=0 helps.

Most often the window seemed fine after resume.
Sometimes, the window can still be black after resume. Hovering the tab bar seemed to fix it, the content area seemed frozen (cursor didn't change). Sometimes it fixed itself after being black.

See Also: → 1737078

Andrew, as you implemented robustness for EGL in bug 1680759, can I ask you for some help here? EGL_NV_robustness_video_memory_purge apparently need some extra invalidation after bug 1684194. Comment 24 makes a suggestion, however Darkspirit found some issues regarding partial damage after resume, even with that change applied. Apparently we need to force a full repaint directly after that extension triggered a reset - any ideas how to do that or should that maybe already be the case?

Sotaro, NIing you as well, assuming you might also have some knowledge about this area.

Flags: needinfo?(sotaro.ikeda.g)
Flags: needinfo?(aosmond)

I appear to be affected by this issue, as of installing Firefox 95b2 (I appear to have skipped b1). Previously, on 94 and earlier, I had experienced repaint issues when resuming from suspend, but now all graphical elements are corrupted - including the browser chrome itself. Only restarting the browser fixes the issue.

https://gareth.halfacree.co.uk/pubimages/firefox95b2-corruption.png

I had previously mentioned the same issue in 95 Nightly while chasing down a different bug, but can now confirm it on the beta channel too.

I note reference to Nvidia driver versions 470.82 and 495.44 above: I am currently on 470.57.02. Should the issue be resolved if I'm on the newer driver?

System: Firefox 95b2, Ubuntu 20.04.3 64-bit, Ryzen 2700X, Nvidia RTX 2080 470.57.02.

Should the issue be resolved if I'm on the newer driver?

No, not until this issue has been closed. However, with the 470.82 and 495.44 driver releases we now have everything we need to actually do that. We'll then also bump the minimal driver version required to enable EGL in nightly (and likely also beta/release) by default. Until then you might want to disable EGL in nightly by setting gfx.x11-egl.force-disabled in about:config.

With 495.44 driver and gnome 41, I am experiencing a similar corruption for the entire gnome session, including lock screen. This is with NVreg_PreserveVideoMemoryAllocations=1. I am posting this here in case it is a related data point. If this is unrelated, please feel free to ignore/delete.

(In reply to Robert Mader [:rmader] (back on ~23. Nov) from comment #36)

Andrew, as you implemented robustness for EGL in bug 1680759, can I ask you for some help here? EGL_NV_robustness_video_memory_purge apparently need some extra invalidation after bug 1684194. Comment 24 makes a suggestion, however Darkspirit found some issues regarding partial damage after resume, even with that change applied. Apparently we need to force a full repaint directly after that extension triggered a reset - any ideas how to do that or should that maybe already be the case?

Sorry for slow response. Can we use mCompositor->RequestFullRender() for full rendering?

RenderCompositorEGL::RequestFullRender() does not handle it yet.

And on Windows case, device reset triggers to re-create all WebRenders/WebRenderBridgeParents/WebRenderBridgeChilds.
GPUProcessManager::OnRemoteProcessDeviceReset()

Flags: needinfo?(sotaro.ikeda.g)
Depends on: 1740675
Depends on: 1742862

With bug 1740675 landed, things should now work as expected in latest nightly. Can anyone with an affected setup confirm?

Confirmed fixed by bug 1740675, however there's a related issue around partial present after resume. We can investigate that in a follow up bug though.

Status: NEW → RESOLVED
Closed: 3 years ago
Flags: needinfo?(aosmond)
Resolution: --- → FIXED
See Also: → 1743051

@rmader: I cannot confirm the fix here. I've installed Nightly 20211125, and am still using Nvidia 470.57.02 drivers: to my understanding, that should result in EGL being disabled and resuming from suspend working as in 94 and prior.

However, I'm seeing the same behaviour: resuming from suspend results in a completely broken window. Should 20211125 definitely be working at stock settings on 470.57.02 drivers?

https://bugzilla.mozilla.org/show_bug.cgi?id=1742862 was only merged 14 hours ago, could it be that it is not yet in 20211125 nightly you tested with? The summary is here:
https://hg.mozilla.org/mozilla-central/rev/e5f33ef244ff

Assignee: nobody → robert.mader

(In reply to gareth from comment #44)

@rmader: I cannot confirm the fix here. I've installed Nightly 20211125, and am still using Nvidia 470.57.02 drivers: to my understanding, that should result in EGL being disabled and resuming from suspend working as in 94 and prior.

Could you shortly retest and confirm that by now nigtly should not enable EGL on that driver version any more?

Now on Nightly 20211127, and it appears to be fixed (or, rather, worked-around): suspending and resuming with the faulty 470.57.02 Nvidia driver bundle no longer corrupts the Firefox window, and everything restores perfectly well - all at default settings, fresh profile, no changes.

At some point I'll get around to upgrading to the latest Nvidia driver bundle!

Thanks for the fix!

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: