Closed Bug 1780972 Opened 2 years ago Closed 2 years ago

HW-WR/XFCE/Nvidia: WebRender occasionally goes crazy after suspend&resume

Categories

(Core :: Graphics: WebRender, defect)

Firefox 102
x86_64
Linux
defect

Tracking

()

RESOLVED DUPLICATE of bug 1781167
Tracking Status
firefox-esr102 --- fixed
firefox105 --- wontfix
firefox106 --- wontfix
firefox107 --- wontfix
firefox108 --- fixed
firefox109 --- fixed

People

(Reporter: aros, Unassigned)

References

(Blocks 1 open bug, )

Details

(Keywords: regression)

Attachments

(5 files, 1 obsolete file)

Attached video record.mkv

Steps to reproduce:

Firefox 102 has a weird regression, sometimes the browser starts misbehaving:

  • You cannot click anything on the current page
  • You cannot scroll any page
  • You cannot open/close/switch between tabs
  • The browser continues working only whatever was drawn on the screen at last never gets updated
  • You can at least resize the browser window but that's it

Only restarting the browser fixes the issue.

Attached file support.json (obsolete) —

about:support

Attached file support.json

about:support

Attachment #9286793 - Attachment is obsolete: true

The Bugbug bot thinks this bug should belong to the 'Core::Graphics: WebRender' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Graphics: WebRender
Product: Firefox → Core

This happens randomly and I've not found a surefire way to trigger the issue.

Have you heard of anything like this before Robert? Or aware of any changes in 102 which may have caused it?

Blocks: wr-nv-linux
Severity: -- → S3
Flags: needinfo?(robert.mader)
Attached image Firefox 103 UI frozen

This is reproducible in Firefox 103.0.1 as well.

Do you have multiple Firefox windows open when this bug occurs or only one?
Is this bug caused by suspend&resume or by plugging in another monitor?
Can this bug be prevented by setting gfx.webrender.software to true on about:config and restarting Firefox? (software rendering)

Do you have multiple Firefox windows open when this bug occurs or only one?

I have two Firefox applications (profiles) running on two different virtual screens. Each Firefox application only has tabs in them, i.e. they run in single windows.

Is this bug caused by suspend&resume or by plugging in another monitor?

As far as I've noticed it only happens after resuming. I have the only monitor.

Can this bug be prevented by setting gfx.webrender.software to true on about:config and restarting Firefox? (software rendering)

I've set it to true, let's see what the next suspend/resume cycle will bring.

Flags: needinfo?(aros)
See Also: → 1777664

(In reply to Jamie Nicol [:jnicol] from comment #5)

Have you heard of anything like this before Robert? Or aware of any changes in 102 which may have caused it?

No, not off hand.

(In reply to Artem S. Tashkinov from comment #8)

Is this bug caused by suspend&resume or by plugging in another monitor?

As far as I've noticed it only happens after resuming. I have the only monitor.

That would be a hint that something regarding the Nvidia specific robustness extension broke. Not totally unexpected - we also disable partial updates on Nvidia because they break after resume.

@Artem: are you using a system compositor? Or is this "plain" X11?

Flags: needinfo?(robert.mader)

Considering I seem to be the only person who's affected/who cares, let's just close it.

Restarting the browser is no big deal anyways.

Status: UNCONFIRMED → RESOLVED
Closed: 2 years ago
Resolution: --- → INVALID

(In reply to Artem S. Tashkinov from comment #10)

Considering I seem to be the only person who's affected/who cares, let's just close it.

Restarting the browser is no big deal anyways.

Does this bug still occur?

Hello. I and several other users are also seeing this bug. I found this report from this thread:
https://forums.linuxmint.com/viewtopic.php?f=47&t=376770

I also haven't found any way to reproduce the bug, unfortunately. And it doesn't crash FF, so I don't have a crash report either. I'll put some system info below, and let me know if there is a way I can collect more info for you. In the thread mentioned above, people were speculating it happened more often when there was very little free memory. That sounds plausible, as I also run a relatively low-memory system, but I don't have any proof.

>  uname -a
Linux nodename 5.15.0-46-generic #49~20.04.1-Ubuntu SMP Thu Aug 4 19:15:44 UTC 2022 x86_64 x86_64 x86_64 GNU/Linux

> lsb_release -a
No LSB modules are available.
Distributor ID:	Ubuntu
Description:	Ubuntu 20.04.4 LTS
Release:	20.04
Codename:	focal

> firefox -v
Mozilla Firefox 104.0

(In reply to Darkspirit from comment #11)

Does this bug still occur?

Haven't seen it in a while. Maybe the latest NVIDIA drivers or Firefox release have fixed it.

Flags: needinfo?(aros)

(In reply to Artem S. Tashkinov from comment #13)
Thanks for testing!

(In reply to leif.poorman from comment #12)

Hello. I and several other users are also seeing this bug. I found this report from this thread:
https://forums.linuxmint.com/viewtopic.php?f=47&t=376770

Thanks for this link!
Please test bug 1781167 comment 12. Ideally file an own bug, open about:support, click on "Copy text to clipboard" and paste it into the "Add an attachment" field. It's easier to close a bug as duplicate than having comments about possibly different bugs.

Resolution: INVALID → WORKSFORME

I know have a Firefox instance (again after resume) in this semi-broken state.

Is there any way to collect any information from it?

I can sort of work with it but to see anything I have to minimize/restore (or resize the window) it to make it repaint itself.

What's funny minimize/restore sometimes restore it to the state where the top half of the client window is completely white (I remember there was a bug about NVIDIA which manifested in exactly that). Doing minimize/restore twice makes it render everything.

Attached file about-support.json
Status: RESOLVED → REOPENED
Ever confirmed: true
OS: Unspecified → Linux
Hardware: Unspecified → x86_64
Resolution: WORKSFORME → ---

Using gcore I've generated the core dumps of all Firefox processes, I wonder if they are of any use - I can send them to a Firefox developer, no problem.

I'm using the official Firefox release from your FTP server under Fedora 36.

The

"[GFX1-]: GFX: RenderThread detected a device reset in PostUpdate"

lines from comment 17 definitely look suspicious. I suppose this shouldn't happen in the first place, but then there's the question how well our robustness/recovery code works.

Also, I see you are using XFCE - what are your compositor settings there? Like, is the compositor always on, off, on-demand (e.g. off in fullscreen)? If you force it to be always on, does that make the device resets go away?

Compositing is disabled.

If you force it to be always on, does that make the device resets go away?

Haven't tested it yet but I prefer not to have it enabled at all because it results in a heavier GPU use and I just don't like it.

(In reply to Artem S. Tashkinov from comment #22)

If you force it to be always on, does that make the device resets go away?

Haven't tested it yet but I prefer not to have it enabled at all because it results in a heavier GPU use and I just don't like it.

The issue is: the proprietary Nvidia driver might not like that / their devs might not care any more for non-composited desktops :/ It could well be that what you're seeing is simply a driver issue we can't do much about - apart from blocklisting GPU rendering on such setups.

The Linux driver has very problematic power management - while in Windows the desktop (dwm) and Firefox barely register in terms of GPU wattage/use, in Linux even scrolling a web page in Firefox results in a ~26W power use (vs ~11W at idle) which is simply insane.

And enabling compositing in XFCE makes the GPU jump to this 26W power use and stay at this mode for far too long to my liking.

Would still be great if you could try out if that makes the device resets stop - and if you could also confirm that the issues indeed only happen after one occurred. If that's the case we might be able to convince NV driver devs to check deeper on it.

  • The bug is extremely difficult to reproduce as it happens sporadically.
  • It can be easily "fixed" by restarting the browser.
  • Fewer than a dozen people have even noticed it.
  • There are no driver related log messages anywhere when the bug occurs.

Given all of that, there's 0% likelihood NVIDIA will even create an internal ticket for it.

I've no idea what to do with this bug. It's as good as dead.

Don't know if it is related, but I can reliably reproduce this on GNOME Wayland.
With hardware acceleration enabled, just open a new window, and it won't render and I can't click in anything until I open the GNOME overview or I unmaximize the window.

I have this problem too also as documented in the Linux Mint forum above as well as:
https://bugzilla.mozilla.org/show_bug.cgi?id=1781167

and while it's not related to Nvidia, it IS related to whatever changes occurred in v. 102. That seems to be the common theme for all of us.
Here's another thread from the Solus distro on exactly this issue:
https://discuss.getsol.us/d/8515-firefox-102-hangs-from-time-to-time/
One of the posters also cites using an nvidia gpu, but so far, the symptoms described, which match exactly, do not seem to be limited to Nvidia at all.

Nonetheless, since the summary here states Nvidia and I'm using an ancient Intel iGPU with software-everything on MATE and no h/w acceleration, I'll post my updates to Bug 1781167

Summary: [LINUX, NVIDIA, Regression] WebRender occasionally goes crazy → [LINUX, Regression] WebRender occasionally goes crazy

(In reply to VJ from comment #28)

I have this problem too also as documented in the Linux Mint forum above as well as:
https://bugzilla.mozilla.org/show_bug.cgi?id=1781167

and while it's not related to Nvidia, it IS related to whatever changes occurred in v. 102. That seems to be the common theme for all of us.
Here's another thread from the Solus distro on exactly this issue:
https://discuss.getsol.us/d/8515-firefox-102-hangs-from-time-to-time/
One of the posters also cites using an nvidia gpu, but so far, the symptoms described, which match exactly, do not seem to be limited to Nvidia at all.

Nonetheless, since the summary here states Nvidia and I'm using an ancient Intel iGPU with software-everything on MATE and no h/w acceleration, I'll post my updates to Bug 1781167

Looks like the issue is far more spread than I initially thought and affects people running other GPU vendors. Adjusting the title accordingly. I guess one of these bug reports can be closed as a dupe.

A very long discussion with exactly the same symptoms is here: https://forums.linuxmint.com/viewtopic.php?t=376770

From Linux Mint forums:

Maybe I'll look mid-May

Added 2:
Ahh - the mozregression tool does the 'bisection' automatically. Set release 'good' to 101, 'bad' to 102, starts with build from 5/16. Cool!

Added 3:
Build from 5/16 froze. Tool moved to 5/9.

Added 4:
Build from 5/9 froze. Tool moved to 5/6,

Added 5:
Build from 5/6 froze. Tool moved to 5/4.

Added 6:
Mon. night. Still running 5/4 build, no freezes since Sun. afternoon.

Update 7:
Tues. evening. Build from 5/4 still running. It's been a long time since I was able to run FF for 2+ days without a freeze.

May, 5/6th build is probably where the regression took place.

See Also: → 1781167
Status: REOPENED → NEW
Summary: [LINUX, Regression] WebRender occasionally goes crazy → HW-WR/XFCE/Nvidia: WebRender occasionally goes crazy after suspend&resume

I've been running mozregression tests since Sunday (see comment 30 above). The May 4 nightly ran for 3+ days without issue. Yesterday I began testing the May nightly, and it froze twice within a few hours. From my testing, it appears that this 'issue,' whatever it is, was introduced between the 5/4/ and 5/5 nightlies.

The mozregression tool now has me testing some sort of interim builds that I am not familiar with, but I will continue to test whatever builds it offers me, and will report results back here as need be.

(In reply to RandyS from comment #31)

I've been running mozregression tests since Sunday (see comment 30 above). The May 4 nightly ran for 3+ days without issue. Yesterday I began testing the May nightly, and it froze twice within a few hours. From my testing, it appears that this 'issue,' whatever it is, was introduced between the 5/4/ and 5/5 nightlies.

The mozregression tool now has me testing some sort of interim builds that I am not familiar with, but I will continue to test whatever builds it offers me, and will report results back here as need be.

It was the May 5th nightly that froze twice yesterday - no edit button?

Mozregression testing update:

I believe I have finally emerged from the rabbit hole that is 'autoland' build testing.

When I marked the last build tested 'good' this afternoon, the tool gave me this info:

2022-09-21T16:25:35.997000: INFO : Narrowed integration regression window from [daae2d11, 805110b5] (3 builds) to [ad30f002, 805110b5] (2 builds) (~1 steps left)
2022-09-21T16:25:36.080000: DEBUG : Starting merge handling...
2022-09-21T16:25:36.084000: DEBUG : Using url: https://hg.mozilla.org/integration/autoland/json-pushes?changeset=805110b540517d2531951ea874bc9d4670eddfaf&full=1
2022-09-21T16:25:36.094000: DEBUG : redo: attempt 1/3
2022-09-21T16:25:36.096000: DEBUG : redo: retry: calling _default_get with args: ('https://hg.mozilla.org/integration/autoland/json-pushes?changeset=805110b540517d2531951ea874bc9d4670eddfaf&full=1',), kwargs: {}, attempt #1
2022-09-21T16:25:36.125000: DEBUG : urllib3.connectionpool: Resetting dropped connection: hg.mozilla.org
2022-09-21T16:25:38.882000: DEBUG : urllib3.connectionpool: https://hg.mozilla.org:443 "GET /integration/autoland/json-pushes?changeset=805110b540517d2531951ea874bc9d4670eddfaf&full=1 HTTP/1.1" 200 None
2022-09-21T16:25:38.895000: DEBUG : Found commit message:
Bug 1765399 - Don't create a new SoftwareVsyncSource instance when layout.frame_rate is changed to a different value. r=smaug

Differential Revision: https://phabricator.services.mozilla.com/D144378

2022-09-21T16:25:38.898000: DEBUG : Did not find a branch, checking all integration branches
2022-09-21T16:25:38.924000: INFO : The bisection is done.
2022-09-21T16:25:38.938000: INFO : Stopped

Here is a summary of all tests I ran in the past 2 1/2 weeks:

Mozregression testing begun Sunday, Sept 4,
with release 101 = 'good,' release 102 = 'bad'

All 'good' tests were allowed to run ~3 days before being marked as good.
All failed tests occurred within a few hours of beginning testing.

Tested build from May 16 - failed.
Tested build from May 9 - failed.
Tested build from May 6 - failed.

Tested build from May 4 - ran for 3+ days without failure. Labeled as 'good.'

Tested build from May 5 - failed.

Tested 'mozilla central build: 228073cf...', build_date: 2022-05-05 11:35:40.967000 - failed

Tested 'autoland' build: 2022-05-04 23:11:43.174000 - marked 'good'

Testing 'autoland' build b869511e: 2022-05-05 01:42:53.865000 - marked good

9/14 12:00
Testing 'autoland' build daae2d11: 2022-05-07 13:10:54.295000 - marked good

9/18 3:20p
Testing 'autoland' build 000ea190: 2022-05-07 13:00:14.643000 application_buildid: 20220505040320 - marked bad

9/18 6:30p
Testing 'autoland' build 805110b5: 2022-05-07 13:08:39.010000 application_buildid: 20220505034937 - marked bad

9/18 7:15p
Testing 'autoland' build ad30f002: 2022-05-05 05:03:51.410000 application_buildid: 20220505034020 - marked good

9/21 4:30p
Marked 'autoland' build ad30f002 good.

Let me know if have any questions, or any further tests you would like run.

Bug 1765399 - Don't create a new SoftwareVsyncSource instance when layout.frame_rate is changed to a different value. r=smaug

:mstange, since you are the author of the regressor, bug 1765399, could you take a look?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mstange.moz)

I confirm this bug on LXDE. Frequency of occurrence: every day. Version of Firefox: 106.0b8 (64-bit).
I tried to turn off and on GPU acceleration, to turn off and on Fission Windows, to lower the level of security.sandbox.content.level from 4 to 1 etc. Nothing helped me to resolve the bug.

Attached video Bug 1780972 on LXDE

Such freezes have become a lot more frequent with Firefox 105.0.2 for some reasons.

I've had four today while normally it happens maybe once a week.

Freezes began to occur many times less often with these settings:

gfx.webrender.force-legacy-layers = true
gfx.x11-egl.force-enabled = true

Freezes occurred several times a day with old settings. Now freezes happen once every few days.

I can confirm the same issue is present on my system (Fedora 37 / NVIDIA / KDE).

As it's related to refresh driver, can you run on terminal with

MOZ_LOG="nsRefreshDriver:5"

and attach last ~200 lines when you see the freeze? I wonder if the refresh driver is blocked.
Thanks.

The patch in bug 1781167 should fix this. Thank you everyone for your patience.

Status: NEW → RESOLVED
Closed: 2 years ago2 years ago
Duplicate of bug: 1781167
Flags: needinfo?(mstange.moz)
No longer regressed by: 1765399
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: