Closed Bug 1781167 Opened 7 months ago Closed 3 months ago

Firefox window randomly freezes

Categories

(Core :: Widget, defect, P3)

Firefox 102
Unspecified
Linux
defect

Tracking

()

RESOLVED FIXED
109 Branch
Tracking Status
firefox-esr102 108+ fixed
firefox107 - wontfix
firefox108 + fixed
firefox109 + fixed

People

(Reporter: nuromi, Assigned: mstange)

References

(Regression)

Details

(Keywords: regression)

Attachments

(10 files)

Attached file about:support

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Firefox/102.0

Steps to reproduce:

Surf the web.

Actual results:

Firefox window randomly freeze. Just the window freeze, Firefox itself kept working.
If I interact with the window (scrolling, click a link, open a new tab, etc.) and then minimize and open the window again, the changes are reflected in the window (although is still freeze). This happens until I close and reopen Firefox.

Expected results:

Firefox window does not freeze.

OS: Debian 11 with Xfce 4.16
Firefox 102.0.1 from Mozilla binaries

Flags: needinfo?(nuromi)

The Bugbug bot thinks this bug should belong to the 'Core::Widget: Gtk' component, and is moving the bug to that component. Please correct in case you think the bot is wrong.

Component: Untriaged → Widget: Gtk
Product: Firefox → Core

I could not reproduce the issue on Ubuntu 20.4 using build 102.0.1(20220705093820).
Can you please provide the web that is freezing? Does the problem still happen if you start Firefox in Safe Mode? (Safe Mode disables add-ons, extensions and themes, hardware acceleration and some JavaScript stuff in order to exclude some possible reasons for problems.) See https://support.mozilla.org/en-US/kb/troubleshoot-firefox-issues-using-safe-mode
And does this also happen with a new and empty profile? See https://support.mozilla.org/en-US/kb/troubleshoot-and-diagnose-firefox-problems#w_6-create-a-new-firefox-profile .

(In reply to Andre Klapper from comment #1)

Please see https://support.mozilla.org/kb/firefox-hangs-or-not-responding and report back

I've been trying the solutions there but no luck yet

(In reply to Monica Chiorean from comment #3)

Can you please provide the web that is freezing?

It can happen in any webpage and in any moment.

Does the problem still happen if you start Firefox in Safe Mode? (Safe Mode disables add-ons, extensions and themes, hardware acceleration and some JavaScript stuff in order to exclude some possible reasons for problems.) See https://support.mozilla.org/en-US/kb/troubleshoot-firefox-issues-using-safe-mode
And does this also happen with a new and empty profile? See https://support.mozilla.org/en-US/kb/troubleshoot-and-diagnose-firefox-problems#w_6-create-a-new-firefox-profile .

I'm going to try to test that, but since the freezes happen very randomly (can happen 1 a week or 3 in a day) I don't know how long it would take me.

I followed this guide https://udn.realityripple.com/docs/Mozilla/How_to_report_a_hung_Firefox and did a couple of crash reports in case they help:
https://crash-stats.mozilla.org/report/index/2c9fa42b-5fdb-4a07-ab1d-0746b0220719
https://crash-stats.mozilla.org/report/index/7c90c5e0-b19c-44d9-974a-9a3d20220729

Flags: needinfo?(nuromi)

By the way, this started to happen from firefox 102.

I found this reddit post from someone with the same problem as me
https://www.reddit.com/r/firefox/comments/weqzwm/firefox_suddenly_freezes_on_certain_sites_linux/

Still happen in a newly created profile ( with new .mozilla folder) with default settings and no addons.
Although it took a week and a half to happen again.

I have recorded a video of the bug.

Attached video Video of the bug

still happen with hardware acceleration disabled.

Hello? Is there someone here?
I don't know what more to do, so I accept suggestions.

I and a number of others running Linux Mint have run into the same issue as described in this topic: https://forums.linuxmint.com/viewtopic.php?f=47&t=376770

A change introduced in FF102 is at the root of our issue. I have a Timeshift snapshot of my system with FF101.0.1 and a copy of my profile with FF101.1 so I can revert to FF101.0.1 and work without freezes. With FF102 and newer, using a fresh profile, running in safe mode, having hardware acceleration on or off, etc. makes no difference. It still freezes. It sort of gives the impression of a possible race condition or a stuck memory situation?

I do want to thank nuromi for mentioning minimizing and then maximizing again gets FF to change because that has been helpful to me to see other tabs I have open before I have to close FF to clear the problem. I usually just switch to another application rather than minimizing apps.

In my case, I normally have multiple tabs open when the freeze occurs, although the number of tabs and the length of time I have had FF open does not seem to correlate to when the problem happens. In FF102 (and 103) when this problem happened if I clicked to change tabs, the title above the tab changed and the tab with focus changed, but the page (below the tab) did not change. Thus I would have a tab from one page in focus with the page from the prior tab still on screen. If I clicked the x to close a tab, the tab might have closed, although usually it did not. If the tab did disappear, the page did not repaint so I had a gap (space) where the tab would have been.

In FF104, if I click a different tab, the tab with focus does not change; only the title at the very top of the page changes to indicate I clicked a different tab. However, if I minimize and then maximize in FF104 after closing a tab, the tab disappears. When these freezes happens, I can not click the + and have a new tab open.

I have tried tracking memory usage (of just firefox.bin) at the time of the freeze, but there does not seem to be a correlation. Sometimes I can go all day before it happens. Other times I am lucky if I make it an hour or two. I am currently running FF104.0.1 and it is still happening.

If possible, please try to find a regression range, you will get a pushlog url at the end:
$ pip3 install -U mozregression
$ ~/.local/bin/mozregression --good 100 --bad 103

General ideas:

  • Test https://nightly.mozilla.org.
  • Nvidia
  • For Intel users who have manually force-enabled hardware rendering:
    Remove deprecated Intel DDX driver, use default modesetting driver. bug 1710400 comment 20:
    sudo apt remove xserver-xorg-video-intel
  • To prevent glxtest crash and fallback to software rendering, remove deprecated libva-vdpau-driver: bug 1787182 comment 2
  • Try disabling GLX vsync: Open about:config, set layout.frame_rate=60, restart Firefox.
  • XFCE and KDE users should try disabling their compositor (restart Firefox afterwards) and should also check if the same problem occurs with Gnome.
  • Try enforcing software rendering: Open about:config, set gfx.webrender.software=true, restart Firefox.
Flags: needinfo?(nuromi)
Priority: -- → P3

Also please run Firefox on terminal with MOZ_LOG="Widget:5" and look what happens during the freeze - do Firefox print debug output (receiving events) during the freeze? Does it get keyboard/mouse events?
Thanks.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #13)

Also please run Firefox on terminal with MOZ_LOG="Widget:5" and look what happens during the freeze - do Firefox print debug output (receiving events) during the freeze? Does it get keyboard/mouse events?
Thanks.

I tried this today. I ran
firefox --MOZ_LOG="Widget:5"
from the terminal and watched the terminal while I had the browser up. When the freeze happened, keystrokes did register in the terminal as I attempted to change tabs or do anything else.

Yesterday when a freeze happened, I was able to save a bookmark of a page (using the minimize/maximize the window trick), but parts of the window were missing. I blindly saved what came up. I did the same today and snapped a photo. I will attach it so you can see what I mean about parts missing.

This is what came up when I attempted to save a bookmark of the page (a page I had not yet had a chance to read). I just blindly hit enter and the bookmark did save.

(In reply to Darkspirit from comment #12)

If possible, please try to find a regression range, you will get a pushlog url at the end:
$ pip3 install -U mozregression
$ ~/.local/bin/mozregression --good 100 --bad 103

If I know 101.0.1 was good and I first starting having problems in 102, would I still want to use --good 100 --bad 103? I have no idea what is causing the problem so all I can do is work as I normally would and see if it happens.

General ideas:

  • Test https://nightly.mozilla.org.
  • Try disabling GLX vsync: Open about:config, set layout.frame_rate=60, restart Firefox.
  • Try enforcing software rendering: Open about:config, set gfx.webrender.software=true, restart Firefox.

My laptop only has onboard Intel graphics (Sandy Bridge era Celeron processor) and I have always used modesetting. I think these might be the only options which applies to my situation. I will give the last one a try and see if the freezes stop (because I'm not sure if GLX vsync applies in my situation?). I have always had the options in Settings > General > Performance unchecked. However, that must be a different setting because when I checked gfx.webrender.software it is set to false.

Hi.
In my case, freezes are very infrequent, like one once a week, so any test I do will take a long time to confirm the result.
So, thanks to Susan and the other Linux Mint users for helping with troubleshooting.

Flags: needinfo?(nuromi)

There have only been one or two days where I was able to make it through without a freeze. It's my impression (which may not be accurate) those with Nvidia seem to be running into the freezes a bit less often than those of us with Intel and AMD. However, it's also possible web activity levels might be a factor. No real way for me to be able to judge that.

Both 'Try disabling GLX vsync:' and 'Try enforcing software rendering' resulted in a freeze. (I tried the options separately as did someone on the forum with AMD graphics.) I will move on to figuring out pip and mozregression to get them installed.

Attached file My system information

I will leave my system information here in case is useful.

Because I do not know what is triggering the problem, I decided if I could run two days without a freeze then I would consider that "good" and move on the to the next version in the regression. Normally, I close Firefox each night and either shut down my computer or suspend it. To keep the test running I just disconnected from the Internet last night and did not suspend. I only had one blank tab up in Firefox.

I downloaded the first nightly build which came up. Adjusted my settings as I normally have them, imported my bookmarks (HTML file from Firefox 104) and started working yesterday afternoon. This morning it froze. I thought the first build would be an approximation of the Firefox 101.0.1 version I had been successfully using. This is what I was testing.
https://archive.mozilla.org/pub/firefox/nightly/2022/04/2022-04-04-23-18-05-mozilla-central/firefox-101.0a1.en-US.linux-x86_64.tar.bz2

Is there something about a nightly build that might be different from a final version I would normally get?

I will try again but this time I will start with --good 99 instead of 100. Please let me know if I should be trying something else.

I'm also one who's affected, and have posted on that linux mint forum. I also reference other people encountering this on the Solus Linux distro forum. I'm using MATE, ancient Intel G41 chipset igpu, firefox config defaults, no 3d compositing, no h/w acceleration. Went through the whole troubleshooting mode with add-ons disabled to no avail. Started exactly with v. 102.0 official release

Current about:support Graphics:

The large raw text attachment above turned out to be a jumbled mess.
Here's a snippet, hopefully of the more relevant parts:

Features
Compositing     WebRender (Software)

WebGL 1 Driver Renderer Intel Open Source Technology Center -- Mesa DRI Intel(R) G41 (ELK)
WebGL 1 Driver Version  2.1 Mesa 21.2.6

WebGL 2 Driver WSI Info -
WebGL 2 Driver Renderer WebGL creation failed:
* tryNativeGL (FEATURE_FAILURE_EGL_NO_CONFIG)
* Exhausted GL driver options. (FEATURE_FAILURE_WEBGL_EXHAUSTED_DRIVERS)
WebGL 2 Driver Version  -
WebGL 2 Driver Extensions       -

HW_COMPOSITING
available by default
disabled by user: Disabled by layers.acceleration.disabled=true
OPENGL_COMPOSITING
unavailable by default: Hardware compositing is disabled

WEBRENDER
available by default
disabled by env: Not qualified
unavailable-no-hw-compositing by runtime: Hardware compositing is disabled

WEBRENDER_QUALIFIED
available by default
blocklisted by env: No qualified hardware

WEBRENDER_COMPOSITOR
disabled by default: Disabled by default
blocklisted by env: Blocklisted by gfxInfo
blocked by runtime: Cannot be enabled in release or beta

WEBRENDER_PARTIAL
available by default

WEBRENDER_SHADER_CACHE
disabled by default: Disabled by default
unavailable by runtime: WebRender disabled

WEBRENDER_OPTIMIZED_SHADERS
available by default
unavailable by runtime: WebRender disabled

WEBRENDER_ANGLE
available by default
unavailable by env: OS not supported

WEBRENDER_SOFTWARE
available by default

WEBGPU
disabled by default: Disabled by default
blocked by runtime: WebGPU cannot be enabled in release or beta

X11_EGL
available by default

DMABUF
available by default

HARDWARE_VIDEO_DECODING
available by default
unavailable by runtime: Force disabled by gfxInfo

DMABUF_SURFACE_EXPORT
blocked by default: Blocklisted by gfxInfo

BACKDROP_FILTER
available by default
Failure Log
(#0) Error      glxtest: VA-API test failed: no supported VAAPI profile found.

I have not tried firefox --MOZ_LOG="Widget:5" yet but I can vouch for duplicating Susan's experience where keyboard and mouse input responds albeit at a delayed snails' pace with partially updating GUI when the freeze occurs.

I did in the past try this logging:

export NSPR_LOG_MODULES=all:5
export NSPR_LOG_FILE=~/firefox/firefox.log

full details reported here: https://forums.linuxmint.com/viewtopic.php?p=2207661#p2207661

but the gist is that I the logs themselves didn't seem to indicate anything out of the ordinary (comparing non-freezing vs freezing) EXCEPT for the fact whenever the freeze occurs, the last 3 or so child processes last created as evident by the "firefox.log.child-XXX" files created are always killed.
e.g.

firefox.log.child-591
firefox.log.child-591.moz_log
firefox.log.child-590
firefox.log.child-590.moz_log
firefox.log.child-589
firefox.log.child-589.moz_log

firefox.log.child-585
firefox.log.child-588
firefox.log.child-588.moz_log
firefox.log.child-587
firefox.log.child-587.moz_log

but checking:

ps -ef | grep 'childID 591'
ps -ef | grep 'childID 590'
ps -ef | grep 'childID 589'

only shows that "childID 585" and earlier are running when this freeze happens.
childID 591, 590, and 589 despite being the most recently created child processes, seems like they have been killed or terminated

I don't know if my interpretation of this debugging facility is correct, but this is the consistent behavior I observe on my end

Another bug with the same problem https://bugzilla.mozilla.org/show_bug.cgi?id=1780972

(In reply to nuromi from comment #24)

Another bug with the same problem https://bugzilla.mozilla.org/show_bug.cgi?id=1780972

I linked comment 12 in bug 1780972 comment 14.
Difference: In comment 0 you have software rendering on XFCE, but bug 1780972 uses hardware rendering on XFCE.

OS: Unspecified → Linux
See Also: → 1780972
Summary: Firefox window randomly freeze → SW-WR/XFCE/Intel: Firefox window randomly freeze

(In reply to VJ from comment #21)

...no 3d compositing...

Could you check if turning on WM compositing works around the issue, given that it's one of the similarities with bug 1780972? That would be great :)

Attached file system-info-Susan.txt

Here is my system info. I am running Cinnamon desktop with the Effects turned off, but running with its default compositing enabled. I do have the Firefox setting for "Use hardware acceleration when available." unchecked, but I've always had it that way.

I am back on Firefox 101.0.1 temporarily to get some work done without having to worry about losing work due to freezes, but expect to resume trouble-shooting by the weekend.

(In reply to Robert Mader [:rmader] from comment #26)

(In reply to VJ from comment #21)

...no 3d compositing...

Could you check if turning on WM compositing works around the issue, given that it's one of the similarities with bug 1780972? That would be great :)

Sorry I misspoke about this. So it turns out I was running a compositor, just with all the effects disabled.
As an aside, does running a compositor always imply using OpenGL/3d portions of the gpu? (or can one use 2d/X11 primitives?)
Anyways, in MATE I have these choices for Window Manager + Compositor combos:

Marco
Marco + Compositing
Marco + Compton
Metacity
Metacity + Compositing
Metacity + Compton
Compiz

NO compositing choices are: Marco and Metacity
wm-detect will tell me if I'm running a compositor or not.
Now I remember I had changed from the default "Marco" to "Marco + Compositing" some time ago, but forgot about that.
Now I realize the only difference between the two is a slight shadow around the window edges.

So I can confirm that with the "Marco + Compositing" selection, the freeze does still occur. Should I switch back to Marco without compositing?

I think it's directly related to your hardware (G41) as all the reports here uses it.
I wonder if there's any driver/mesa bug which we hit with latest Firefox version.

(In reply to VJ from comment #28)

So I can confirm that with the "Marco + Compositing" selection, the freeze does still occur. Should I switch back to Marco without compositing?

Compositing means you use transparent windows (usually used for decorations & shadows).
May it be Bug 1756903 ?
Do you see any difference if you run Firefox as:

MOZ_GTK_TITLEBAR_DECORATION=system firefox

or

MOZ_GTK_TITLEBAR_DECORATION=client firefox

or

MOZ_GTK_TITLEBAR_DECORATION=none firefox

?

(In reply to Martin Stránský [:stransky] (ni? me) from comment #29)

I think it's directly related to your hardware (G41) as all the reports here uses it.
I wonder if there's any driver/mesa bug which we hit with latest Firefox version.

Mine is Intel HD Graphics 2000 (SNB GT1) and not (G41), but in the Mint forum thread it did seem the issue was more likely to happen if one's computer was in the 10+ year old range, regardless of graphics (Intel, AMD, Nvidia).

I've been running mozregression tests since Sunday (see comment 30 on bug 1780972). The May 4 nightly ran for 3+ days without issue. Yesterday I began testing the May nightly, and it froze twice within a few hours. From my testing, it appears that this 'issue,' whatever it is, was introduced between the 5/4/ and 5/5 nightlies.

The mozregression tool now has me testing some sort of interim builds that I am not familiar with, but I will continue to test whatever builds it offers me, and will report results back here as need be.

As Susan said above, I don't think this problem has been confined to any HW combo. I am on an AMD processor with Radeon graphics.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #30)

(In reply to VJ from comment #28)

So I can confirm that with the "Marco + Compositing" selection, the freeze does still occur. Should I switch back to Marco without compositing?

Compositing means you use transparent windows (usually used for decorations & shadows).
ok makes sense, given the meaning of the word

May it be Bug 1756903 ?
Do you see any difference if you run Firefox as:

MOZ_GTK_TITLEBAR_DECORATION=system firefox

or

MOZ_GTK_TITLEBAR_DECORATION=client firefox

or

MOZ_GTK_TITLEBAR_DECORATION=none firefox

I don't see any difference between those 3 and with the variable unset

BTW I don't know if it's just timing and incidental usage pattern on this machine, but I recall I hit freezing more frequently in v.102.x, then seemed like it progressively decreased in v.103.x and with the update to v.104.0.x I've only hit it once (a little before my first post here) if I recall.
I did notice one big beneficial change recently is the much faster loading of saved sessions.

If there is a way to get a stack trace of the current tab (or all tabs or core dump) I can try that too the next time I encounter it

Also when running it from the command line, I get:

[GFX1-]: glxtest: VA-API test failed: no supported VAAPI profile found.
ATTENTION: default value of option mesa_glthread overridden by environment.
...
[GFX1-]: Managed to allocate after flush.
ATTENTION: default value of option mesa_glthread overridden by environment.
[GFX1-]: Managed to allocate after flush.

// on spotify:

Sandbox: attempt to open unexpected file /sys/devices/system/cpu/cpu0/cache/index2/size
Sandbox: attempt to open unexpected file /sys/devices/system/cpu/cpu0/cache/index3/size
Sandbox: attempt to open unexpected file /sys/devices/system/cpu/present
Sandbox: attempt to open unexpected file /sys/devices/system/cpu
Sandbox: unexpected multiple open of file /proc/cpuinfo

// on youtube:

[2022-09-09T17:18:43Z ERROR mp4parse] Found 2 nul bytes in "\0\0"
[2022-09-09T17:18:43Z ERROR mp4parse] Found 2 nul bytes in "\0\0"
[2022-09-09T17:18:43Z ERROR mp4parse] Found 2 nul bytes in "\0\0"
 ....

// don't know when/where:

[Parent 51792, Main Thread] WARNING: g_object_ref: assertion 'G_IS_OBJECT (object)' failed: 'glib warning', file /builds/worker/checkouts/gecko/toolkit/xre/nsSigHandlers.cpp:167

(firefox:51792): GLib-GObject-CRITICAL **: 10:26:51.227: g_object_ref: assertion 'G_IS_OBJECT (object)' failed

(/usr/lib/firefox/firefox-bin:57989): dconf-WARNING **: 10:26:51.397: Unable to open /var/lib/flatpak/exports/share/dconf/profile/user: Permission denied

In terms of not hitting lately on my end since v.104+ (just once before the update to 104.0.2) I also have to mention that the rest of the Mint Mate system has also been continuously updated so there's potentially other confounding variables with libraries, drivers, and minor kernel updates (5.4.x version) if the cause since the firefox v.102+ official release had some dependency on other external system factors, which seems to be the case as it apparently seems to just affect quite old systems

I spoke too soon. I finally hit it again on v.104.0.2 and only after a day of use.

In case it matters, just wanted to note everything I came across up to the point of the freeze:

[Child 9213, MediaDecoderStateMachine #5] WARNING: Decoder=7fe78d73ec00 state=DECODING_METADATA Decode metadata failed, shutting down decoder: file /builds/worker/checkouts/gecko/dom/media/MediaDecoderStateMachine.cpp:370
[Child 9213, MediaDecoderStateMachine #5] WARNING: Decoder=7fe78d73ec00 Decode error: NS_ERROR_DOM_MEDIA_METADATA_ERR (0x806e0006) - static MP4Metadata::ResultAndByteBuffer mozilla::MP4Metadata::Metadata(mozilla::ByteStream *): Cannot parse metadata: file /builds/worker/checkouts/gecko/dom/media/MediaDecoderStateMachineBase.cpp:151
[Parent 2126, Main Thread] WARNING: g_object_ref: assertion 'G_IS_OBJECT (object)' failed: 'glib warning', file /builds/worker/checkouts/gecko/toolkit/xre/nsSigHandlers.cpp:167

lots of:

ATTENTION: default value of option mesa_glthread overridden by environment.
ATTENTION: default value of option mesa_glthread overridden by environment.
ATTENTION: default value of option mesa_glthread overridden by environment.
ATTENTION: default value of option mesa_glthread overridden by environment.

also lots of memory pressure enough to unload tabs while I loaded up many youtube tabs and switched around to others. Then closed all those.
Also did "Minimine memory usage" in about:memory
Then started more agressively then. Opened a few in the background then switched between them.

During the freeze incident, I observed:

main firefox parent process used 562 MB resident memory, 12.2 GB virtual memory

  • 8 "Isolated Web Co" child processes
    each process using: ~102 MB to 141 MB resident memory, 2.3GB to 2.5GB virtual memory
    status of all children are sleeping, with occasional wakeup for short runs

Then upon pkill firefox:

Exiting due to channel error.
Exiting due to channel error.
Exiting due to channel error.
 ....

I am back to testing and just figured out why I was having problems with what should have been good builds on my first go-round. I use Thunderbird and when I get forum notifications I launch from the link in the email. I did not notice doing that had opened my installed Firefox 104 version and I had been doing my work in it instead of the MozRegression build. Now that mystery is solved, hopefully I can eventually produce a pushlog url.

I encountered another freeze on v.104.0.2 again after two days. The amount of time doesn't really seem to matter; it appears more dependent on usage. I've left it running for several days without hitting it on this machine, seemingly when using it more lightly or occasionally.

I'm curious about the firefox virtual memory consumption I cited above. I encountered this large mem scenario again this latest time. I attempted to generate a core dump with gcore which seemed to have failures but still left a 13GB core file from the main parent firefox process.
Both cases where I've look at this during the freeze shows that firefox vmem exceeds my total main phys mem 4GB + 7.6GB of swap.

Mozregression testing update:

I believe I have finally emerged from the rabbit hole that is 'autoland' build testing.

When I marked the last build tested 'good' this afternoon, the tool gave me this info:

2022-09-21T16:25:35.997000: INFO : Narrowed integration regression window from [daae2d11, 805110b5] (3 builds) to [ad30f002, 805110b5] (2 builds) (~1 steps left)
2022-09-21T16:25:36.080000: DEBUG : Starting merge handling...
2022-09-21T16:25:36.084000: DEBUG : Using url: https://hg.mozilla.org/integration/autoland/json-pushes?changeset=805110b540517d2531951ea874bc9d4670eddfaf&full=1
2022-09-21T16:25:36.094000: DEBUG : redo: attempt 1/3
2022-09-21T16:25:36.096000: DEBUG : redo: retry: calling _default_get with args: ('https://hg.mozilla.org/integration/autoland/json-pushes?changeset=805110b540517d2531951ea874bc9d4670eddfaf&full=1',), kwargs: {}, attempt #1
2022-09-21T16:25:36.125000: DEBUG : urllib3.connectionpool: Resetting dropped connection: hg.mozilla.org
2022-09-21T16:25:38.882000: DEBUG : urllib3.connectionpool: https://hg.mozilla.org:443 "GET /integration/autoland/json-pushes?changeset=805110b540517d2531951ea874bc9d4670eddfaf&full=1 HTTP/1.1" 200 None
2022-09-21T16:25:38.895000: DEBUG : Found commit message:
Bug 1765399 - Don't create a new SoftwareVsyncSource instance when layout.frame_rate is changed to a different value. r=smaug

Differential Revision: https://phabricator.services.mozilla.com/D144378

2022-09-21T16:25:38.898000: DEBUG : Did not find a branch, checking all integration branches
2022-09-21T16:25:38.924000: INFO : The bisection is done.
2022-09-21T16:25:38.938000: INFO : Stopped

Here is a summary of all tests I ran in the past 2 1/2 weeks:

Mozregression testing begun Sunday, Sept 4,
with release 101 = 'good,' release 102 = 'bad'

All 'good' tests were allowed to run ~3 days before being marked as good.
All failed tests occurred within a few hours of beginning testing.

Tested build from May 16 - failed.
Tested build from May 9 - failed.
Tested build from May 6 - failed.

Tested build from May 4 - ran for 3+ days without failure. Labeled as 'good.'

Tested build from May 5 - failed.

Tested 'mozilla central build: 228073cf...', build_date: 2022-05-05 11:35:40.967000 - failed

Tested 'autoland' build: 2022-05-04 23:11:43.174000 - marked 'good'

Testing 'autoland' build b869511e: 2022-05-05 01:42:53.865000 - marked good

9/14 12:00
Testing 'autoland' build daae2d11: 2022-05-07 13:10:54.295000 - marked good

9/18 3:20p
Testing 'autoland' build 000ea190: 2022-05-07 13:00:14.643000 application_buildid: 20220505040320 - marked bad

9/18 6:30p
Testing 'autoland' build 805110b5: 2022-05-07 13:08:39.010000 application_buildid: 20220505034937 - marked bad

9/18 7:15p
Testing 'autoland' build ad30f002: 2022-05-05 05:03:51.410000 application_buildid: 20220505034020 - marked good

9/21 4:30p
Marked 'autoland' build ad30f002 good.

Let me know if have any questions, or any further tests you would like run.

I am now bisecting taskclusters on 2022-05-05. Wish I could say I see a pattern that leads to the freeze, but it still seems random.

These are the nightly builds the command line tool had me test for $ ~/.local/bin/mozregression --good 100 --bad 103

2022-04-04 - firefox 101.0a1 - good (ran 2 days with no problems)
2022-06-27 - firefox-104.0a1 - bad ( ~1.5 days before it froze)
2022-05-16 - firefox 102.0a1 - bad ( ~20 minutes before it froze)
2022-04-25 - firefox 101.0a1 - good (ran 2 days with no problems)
2022-05-06 - firefox 102.0a1 - bad ( ~21 hrs before it froze)
2022-05-01 - firefox 101.0a1 - good (ran 2 days with no problems)
2022-05-04 - firefox 102.0a1 - good (ran 2 days with no problems)
2022-05-05 - firefox 102.0a1 - bad ( ~26 hrs before it froze)

Regressed by: 1765399

:mstange, since you are the author of the regressor, bug 1765399, could you take a look? Also, could you set the severity field?

For more information, please visit auto_nag documentation.

Flags: needinfo?(mstange.moz)

If it helps any, on my system I have in:
/etc/X11/xorg.conf.d/20-intel.conf

Section "Device"
	Identifier  "Intel Graphics"
	Driver      "intel"
	Option      "AccelMethod" "SNA"
	Option      "TearFree"    "true"
EndSection

Is there something more I can do to help? Maybe traces I could run or logs that might be helpful?

We have people from other distros joining the Linux Mint forum to indicate they too are having the freeze problems.

Comment 44 might be a good hint - the deprecated Intel DDX driver, especially with TearFree, has been proven extremely buggy in the past. So much that we had to disable hardware acceleration for it altogether, see bug 1710400 . It would be interesting if this is the common denominator here.

Can anyone affected check if xrandr --listproviders contains name:Intel? If that's the case, can you check if switching to glamor solves the issue? E.g. apt remove xserver-xorg-video-intel

I have a 2nd-gen Intel which uses modesetting.
xrandr --listproviders
Providers: number : 1
Provider 0: id: 0x49 cap: 0x9, Source Output, Sink Offload crtcs: 2 outputs: 8 associated providers: 0 name:modesetting

Mine, (Intel G41 mobo chipset gpu, pre-UHD series.... like pre-"legacy")

xrandr --listproviders
Providers: number : 1
Provider 0: id: 0x46 cap: 0x9, Source Output, Sink Offload crtcs: 3 outputs: 4 associated providers: 0 name:Intel

I am a little hesitant to remove the default driver/package for a few reasons.
First, I still have no way of reliably reproducing it, especially as more recently, it can go for a while.. over a week easily without hitting it. Or I can hit it quickly, a couple times in a day.

Would it even support glamour? This is a really old motherboard gpu, prior to generation of Intel CPUs with integrated GPUs, and I don't trust it to fully implement all the OpenGL functions. And if that's not available, and using modsetting driver, then without the intel DDX, would I just be using pure software rendering without the baisc h/w 2D acceleration?

The other member had on issue with AMD and another forum member had an issue with a modern coffee lake UHD gpu which should be using glamour from a test install I did before on another machine.
With regards to bug bug 1710400, I never had this issue until exactly the official v.102 release, while bug 1710400 stated it was from v88

However, if you insist, I can still try on removing xserver-xorg-video-intel

Both RandyS in this topic and a newcomer to the Mint forum who just posted today are using AMD graphics. This issue transcends the intel driver (which most of us having the problem are NOT using).

After being notified by a website I was running an out-of-date browser (I had been using 101.0.1 since I finished tested), I decided to give 106.0.1 a try. The problem is still there. I was able to use the trick of minimize-FF/maximize-FF to get the browser screen to change and and salvage some of my work and save a bookmark before I gave up and closed Firefox. I should not have to lose work just to make Firefox functional.

Is there something I can be logging that might be helpful in determining what is happening?

After redoing the same piece of work for the third time tonight because the browser froze on me in the middle of my first two attempts, I gave up on FF106.0.1 after six days. I am back on FF101.0.1.

I reviewed the pushlog changes/bug that is causing this problem and noticed there is code to handle an issue with Wayland that was not originally expected to be a problem. I'm using X Server and wonder if maybe Wayland was not the only code affected and it hits all Linux-based distros?

There's got to be some type of corner case that I and other are hitting. I've been doing my best to track memory and cpu usage and there is no consistency with regards to those values and when a freeze happens. It also does not relate to how long the browser has been open. So frustrating. Will try to spend some more time checking the pushlog code now that I'm back on a browser version where I can manage my time better because I'm not continually having to redo work.

I am another one who is affected by this bug. I can confirm that:

  • the freezes appears randomly
  • after a freeze happens FF is operational, but the results of scrolling are visible after minimizing and maximizing
  • after a freeze happens there is no information in the console (when FF is launched from the console), but the mechanism of displaying info from FF in the console is still working

After the update from FF ESR 93.x to FF ESR 102.x I noticed:

  • frequent freezes (every 2-3 days)
  • frequent crashes (every 1-2 days)
  • one crash after a freeze (today, when I opened a new window so that I could test if the new window is also freezed - I managed to click on File and New Window in the menu bar)

I use a 15-years-old machine with Mageia 5 linux (32bit, 4.4.114-server-1.mga5) and nvidia 384.111 driver.

I would like that someone from Mozilla confirm that they're looking into this issue, and we are not being just completely ignored...

So with more reports coming in and comment #50, maybe the commonality is Xorg?
Anyone on modern Ubuntu default and Fedora encountering this on Wayland?

Duplicate of this bug: 1794563

Happens also on X11 + KDE (Plasma 5.25.x).
Firefox 105 and 106.
Bug 1794563.

(In reply to nuromi from comment #52)

I would like that someone from Mozilla confirm that they're looking into this issue, and we are not being just completely ignored...

I believe the Firefox team hasn't even figured out what's causing it to freeze. Therefore, we will have to wait for a fix for a very long time, if it is fixed at all.

This one is really hard to grasp so far :/

It was already mentioned in comment 30, but can somebody else (apart from comment 33) confirm that running with MOZ_GTK_TITLEBAR_DECORATION=system or MOZ_GTK_TITLEBAR_DECORATION=none doesn't help with the freezes?

(In reply to randylow from comment #56)

I believe the Firefox team hasn't even figured out what's causing it to freeze. Therefore, we will have to wait for a fix for a very long time, if it is fixed at all.

Two of us have identified the code change Firefox made that is causing the problem to happen.
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=ad30f0024f7f5677c8d0ab804d16916629cf9e97&tochange=805110b540517d2531951ea874bc9d4670eddfaf

I would like to think there is something I could be logging that would help pin point the problem in that code because it is widespread across multiple desktop environments and across multiple Linux-based distros. I am trying to find documentation myself with regards to logging, but I am starting with almost zero knowledge on the topic so my progress is very slow. :(

(In reply to Robert Mader [:rmader] from comment #57)

This one is really hard to grasp so far :/

It was already mentioned in comment 30, but can somebody else (apart from comment 33) confirm that running with MOZ_GTK_TITLEBAR_DECORATION=system or MOZ_GTK_TITLEBAR_DECORATION=none doesn't help with the freezes?

I am running the Cinnamon 5.4 desktop which uses muffin(Mutter) for windows management. I have no options to change it or to change compositing.

When I go to about:config and search for MOZ_GTK_TITLEBAR_DECORATION, it indicates my current setting is Boolean. Are you wanting me to change that to String and then add those values? Or is the fact I am using Boolean a helpful clue? Or is there some place I should be looking for the default value?

(In reply to Susan from comment #59)

When I go to about:config and search for MOZ_GTK_TITLEBAR_DECORATION, it indicates my current setting is Boolean. Are you wanting me to >change that to String and then add those values? Or is the fact I am using Boolean a helpful clue? Or is there some place I should be looking for >the default value?

Susan, I don't think that value is in about:config. If you type it in the search, it looks to me like it is offering to ADD the value, not modify an existing one.

As far as the setting goes, I'm willing to try about anything. But I am with you, in that I don't know how or where to set that value. Maybe it can be added to the command line parameters? A little guidance here would help us out.

(In reply to RandyS from comment #60)

(In reply to Susan from comment #59)

When I go to about:config and search for MOZ_GTK_TITLEBAR_DECORATION, it indicates my current setting is Boolean. Are you wanting me to >change that to String and then add those values? Or is the fact I am using Boolean a helpful clue? Or is there some place I should be looking for >the default value?

Susan, I don't think that value is in about:config. If you type it in the search, it looks to me like it is offering to ADD the value, not modify an existing one.

As far as the setting goes, I'm willing to try about anything. But I am with you, in that I don't know how or where to set that value. Maybe it can be added to the command line parameters? A little guidance here would help us out.

As far as I understand, it's environment variable. You can set it on the same line preceding the firefox command, or you can set it globally with export MOZ_GTK_TITLEBAR_DECORATION=client etc
then run firefox from that shell

Sorry, never mind about MOZ_GTK_TITLEBAR_DECORATION - as pointed out in comment 58 and before bug 1765399 is apparently to blame.

Also, just to clarify my answer in comment 33 about whether I "see a difference" question from Martin Stránský in the different MOZ_GTK_TITLEBAR_DECORATION values, I meant that I did not see a difference in the window decoration / GUI / titlebar between any values, but not regarding the freezing

In terms of the bug 1765399 change in v102, Is there anything regarding Vsync / vsyncsource specifically that we can test, in firefox such as a env variable or preference or the system?
Currently in firefox preferences, there are 5 settings with "vsync"
And system wise, for example, is there something we can try in xrandr after we encounter the freeze to check or maybe work around it?

Also, maybe others who are running Wayland and haven't enountered the bug can switch to X11 and see if they do encounter a freeze or not?

(In reply to VJ from comment #63)

Also, maybe others who are running Wayland and haven't enountered the bug can switch to X11 and see if they do encounter a freeze or not?

I've been reviewing both bugs before going back through the code again and I see someone in the sister Bug 1780972 Comment 27 mentions they can reproduce it in GNOME Wayland which would shoot down that theory (which was just a guess on my part).

I think this may be a more fundamental issue about what is created and how long it lives.

(In reply to VJ from comment #23)

but the gist is that I the logs themselves didn't seem to indicate anything out of the ordinary (comparing non-freezing vs freezing) EXCEPT for the fact whenever the freeze occurs, the last 3 or so child processes last created as evident by the "firefox.log.child-XXX" files created are always killed.
...
childID 591, 590, and 589 despite being the most recently created child processes, seems like they have been killed or terminated

but this is the consistent behavior I observe on my end

That is what your earlier testing results I just quoted seem to indicate. Something may be being erroneously terminated.

Plus I found Bug 1789119 which sounds like the same issue. I'll plan to check that data for clues as well.

(In reply to Susan from comment #64)
Thanks a lot for pointing out the other comment that I missed about Wayland.

Also it's great to see someone else also see the same behavior about the most recent child processes being killed (or crashing?)
What I'd like to know is how to find out what those other processes are. The process names given are somewhat non-descript i.e. "Isolated Web Co", "Web Content", "WebExtensions" "Priveleged Cont"

I noticed that not only the minimize/maximize trick leads to show the proper content of the window.

When more than one tab was open, I could click on the inactive tab and at this moment the title bar of FF changed appropriately but the window content did not change (however, it was not every time - I noticed a normal behaviour twice per approximately 100 tries). But when you click on the tab once again after a few seconds, the chance of appearance increases up to 30-40 %. Sometimes three or four clicks were needed to bring the tab into view.

Another observation - when I used a down or up arrow once and did the min/max trick once - the content either didn't move or moved slightly, and (surprisingly) when I did the min/max trick once again the content moved slightly once again.

It looks like that when the freeze occurs there is a problem with handling mouse and keyboard events.
Hypothesis #1: when the freeze occurs the event queue reports to be empty when it is not empty.
Hypothesis #2: when the freeze occurs the event queue returns incorrectly its first element (returns its first element from the time before last operation of removing its first element was done).

Duplicate of this bug: 1799795

As it's related to refresh driver, can you run on terminal with

MOZ_LOG="nsRefreshDriver:5"

and attach last ~200 lines when you see the freeze? I wonder if the refresh driver is blocked.
Thanks.

I ran

firefox --MOZ_LOG="nsRefreshDriver:5"

This command spews out A LOT of data so to try and make sure I would be getting the results specifically for when the issue happened, this is what I did.

When I clicked and no action happened. I then clicked a couple of other tabs to verify it was "frozen" and then just immediately closed Firefox. I did not try the minimize maximize trick. I am running Firefox 107.0.

I then highlighted the bottom line in my terminal and kept scrolling upward until it seemed like things were changing. I ended up grabbing 538 lines. I saved those so if you need to go back more than the 200 lines I included in the file I attached, let me know.

Thanks. From the log it's not related to refresh driver - transactions seems to be processed correctly and there isn't any block.

Please run with MOZ_LOG="Widget:5" and if you notice the freeze, check if new lines to the log are added - i.e. if mouse/button clicks are logged.

Please make sure you're running Mozilla binaries and when you notice the freeze, try to get backtrace from the Firefox:
https://fedoraproject.org/wiki/Debugging_guidelines_for_Mozilla_products#Getting_Mozilla_crash_report_from_running_application

(In reply to Susan from comment #69)

Created attachment 9304241 [details]
Susan-last 200 lines of MOZ_LOG="nsRefreshDriver:5"

I ran

firefox --MOZ_LOG="nsRefreshDriver:5"

This command spews out A LOT of data so to try and make sure I would be getting the results specifically for when the issue happened, this is what I did.

Susan, you might able to use the 'MOZ_LOG_FILE' parameter to save output to a file, for example: MOZ_LOG_FILE=/tmp/log.txt

That idea comes from this page, in the 'Linux' section - you can use the export cmds too, instead of command line arguments. I think...

https://firefox-source-docs.mozilla.org/networking/http/logging.html

(In reply to Martin Stránský [:stransky] (ni? me) from comment #70)

Please run with MOZ_LOG="Widget:5" and if you notice the freeze, check if new lines to the log are added - i.e. if mouse/button clicks are logged.

Unless something has changed in the code since I tried this earlier (FF103? I can't recall for sure), then, yes, the mouse/button clicks are logged as I listed in comment #14.

Please make sure you're running Mozilla binaries and when you notice the freeze, try to get backtrace from the Firefox:
https://fedoraproject.org/wiki/Debugging_guidelines_for_Mozilla_products#Getting_Mozilla_crash_report_from_running_application

It's my understanding Linux Mint just repackages the Mozilla binaries so they work with our package management system, so I believe I am running Mozilla binaries. Will follow up to try and get a backtrace on the next freeze.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #70)

Please make sure you're running Mozilla binaries and when you notice the freeze, try to get backtrace from the Firefox:
https://fedoraproject.org/wiki/Debugging_guidelines_for_Mozilla_products#Getting_Mozilla_crash_report_from_running_application

I was getting error messages in the terminal when I tried to kill the process, but after several different attempts (which all produced error messages?) I finally got the pop-up saying Mozilla had crashed. I told it not to restart to make sure all was stopped. I then picked up the following info when I restarted.

This was the info where I clicked the Submit button
Report ID - 30d48564-1013-484a-5db0-c44b928b8679 11/20/22, 11:42 AM

Then this came up with a View button
Report ID - bp-d51f84c9-a97d-48fa-be98-c063d0221120 11/20/22, 11:45 AM

I have not yet reviewed any of it.

(In reply to Martin Stránský [:stransky] (ni? me) from comment #70)

Thanks. From the log it's not related to refresh driver - transactions seems to be processed correctly and there isn't any block.

Please run with MOZ_LOG="Widget:5" and if you notice the freeze, check if new lines to the log are added - i.e. if mouse/button clicks are logged.

Please make sure you're running Mozilla binaries and when you notice the freeze, try to get backtrace from the Firefox:
https://fedoraproject.org/wiki/Debugging_guidelines_for_Mozilla_products#Getting_Mozilla_crash_report_from_running_application

I attached a file with ~200 lines form terminal. Since the terminal and firefox were on different monitors so I could easily recognize the moment of freeze (it is indicated in the attached file).

Some additional information (from the last few day):

  • Once (so far), a freeze happened a few seconds after firefox was launched, so freezes are not connected with any specific html content.
  • There are also frequent crashes of firefox, they started the same time the freezes started (after upgrade from FF ESR 91 to 102). Is it possible that the crashes are connected with the freezes? If you think it may help, I can attach messages printed in the terminal at the moment of crashes.

I found a problem in VsyncDispatcher::UpdateVsyncStatus. It is called from multiple threads but it doesn't protect the order of calls to VsyncSource::AddVsyncDispatcher and VsyncSource::RemoveVsyncDispatcher. After the state is updated but before the calls, one thread could get interrupted while another thread runs, so the state is correct but the calls end up in the wrong order. If they get out of order then in practice it will not fix itself because almost all Vsync updates stop. I had this occur with normal timing after using it for a while with my usual browser activities. Then I added a delay before the Remove call and I can duplicate the problem just by shaking the mouse pointer and pushing the scroll wheel up and down for a couple seconds. I didn't see any logging built into these functions so I added my own printfs.

This is phenomenal work, thank you!!

(In reply to Jeff DeFouw from comment #76)

If they get out of order then in practice it will not fix itself because almost all Vsync updates stop.

Hmm, I don't understand this part - shouldn't the AddVsyncDispatcher call cause vsync to be re-enabled?

Flags: needinfo?(mstange.moz)
Status: UNCONFIRMED → NEW
Ever confirmed: true

(In reply to Markus Stange [:mstange] from comment #77)

(In reply to Jeff DeFouw from comment #76)

If they get out of order then in practice it will not fix itself because almost all Vsync updates stop.
Hmm, I don't understand this part - shouldn't the AddVsyncDispatcher call cause vsync to be re-enabled?

Since the calls were done out of order outside of the recorded state (mState) in the Dispatcher, the last call was RemoveVsyncDispatcher even though mObservers is not empty and the Dispatcher's mIsObservingVsync is true. AddVsyncDispatcher will not be called again as long as mObservers is not empty, and that appears to be the case for as long as Firefox keeps running normally. Based on how all the Observer Add/Removes also stop in my logs I assume the observers (at least one of them) are waiting for more Vsync events and they will keep themselves in the Observers until that happens. If mObservers becomes empty then an extra call to RemoveVsyncDispatcher will be made to clear mIsObservingVsync and the next AddVsyncObserver will call AddVsyncDispatcher and that will enable Vsync again. In my logs this actually happens while Firefox is closing.

From widget code perspective there isn't anything wrong with the provided logs/backtraces - WebRender rendering doesn't look blocked and we're getting all events from system.

(In reply to Jeff DeFouw from comment #76)

Created attachment 9304353 [details]
Log of Vsync activity showing out-of-order calls with extra delay

I found a problem in VsyncDispatcher::UpdateVsyncStatus. It is called from multiple threads but it doesn't protect the order of calls to VsyncSource::AddVsyncDispatcher and VsyncSource::RemoveVsyncDispatcher. After the state is updated but before the calls, one thread could get interrupted while another thread runs, so the state is correct but the calls end up in the wrong order. If they get out of order then in practice it will not fix itself because almost all Vsync updates stop.

Thank you for your work. I understand your explanation from the standpoint of how everything would freeze. But are those of us who are minimizing and then maximizing the browser to get something to happen on screen just getting lucky that we are picking a thread which has its calls close enough they are not out of order than thus something happens?

I usually have many tabs open when the freeze occurs. In order to minimize lost work, I have found if I click to scroll down the page and then minimize and then maximize Firefox, usually the page has scrolled down (as per my click) so I can see the next part of the page. I repeat that process if it is a long page. I have, at times, also been able to successfully bookmark the page so I can return to it. And change tabs to read what I had not yet seen. So sometimes we can get something to happen, but it is usually just one event at a time per min/max of the Firefox window. (It doesn't always work, but that may be a factor of at what point things got out of order?)

Then again, maybe those actions which complete relate to the point you made here and the min/max is a Vsync event?
(In reply to Jeff DeFouw from comment #78)

Based on how all the Observer Add/Removes also stop in my logs I assume the observers (at least one of them) are waiting for more Vsync events and they will keep themselves in the Observers until that happens. If mObservers becomes empty then an extra call to RemoveVsyncDispatcher will be made to clear mIsObservingVsync and the next AddVsyncObserver will call AddVsyncDispatcher and that will enable Vsync again.

(In reply to Susan from comment #80)

Thank you for your work. I understand your explanation from the standpoint of how everything would freeze. But are those of us who are minimizing and then maximizing the browser to get something to happen on screen just getting lucky that we are picking a thread which has its calls close enough they are not out of order than thus something happens?

There are many actions you can take that will make a call to WebRenderBridgeParent::ScheduleForcedGenerateFrame and the sequence that follows can decide to force some updating to happen without a Vsync notification and outside of the normal Vsync calls.

(In reply to Jeff DeFouw from comment #76)

Created attachment 9304353 [details]
Log of Vsync activity showing out-of-order calls with extra delay

I found a problem in VsyncDispatcher::UpdateVsyncStatus. It is called from multiple threads but it doesn't protect the order of calls to VsyncSource::AddVsyncDispatcher and VsyncSource::RemoveVsyncDispatcher.

Jeff,

I, too, very much appreciate your efforts.

When you speak of 'multiple threads,' does this relate to the various 'dom.ipc.processCount.*' settings in about:config? I've wondered before, if there is some sort of non-thread-safe issue, if setting those values that are defaulted to 4 and 8 back to 1, and trying to make FF run 'single-threaded,' might make some sort of difference. The behavior you saw just made me all the more curious.

Just for kicks, I changed all those settings to '1' this morning on a newer FF version that has frozen for me before (1.06.05), and am trying it to see what happens. But I honestly don't know if those settings are related to those 'multiple threads' you speak of - though I think they may be... Since you're somewhat familiar with the code, I though you might know the answer.

I'll get back to this discussion on the results of my 'single-thread' testing. At this point, I figure, what the heck? I'm trying to find something that will keep the browser running...

(In reply to RandyS from comment #82)

I'll get back to this discussion on the results of my 'single-thread' testing. At this point, I figure, what the heck? I'm trying to find something that will keep the browser running...

Well, that didn't take as long as I had hoped. I was using it all morning, but it just froze. So much for that idea.

Duplicate of this bug: 1789119

(In reply to Jeff DeFouw from comment #78)

(In reply to Markus Stange [:mstange] from comment #77)

(In reply to Jeff DeFouw from comment #76)

If they get out of order then in practice it will not fix itself because almost all Vsync updates stop.
Hmm, I don't understand this part - shouldn't the AddVsyncDispatcher call cause vsync to be re-enabled?

Since the calls were done out of order outside of the recorded state (mState) in the Dispatcher, the last call was RemoveVsyncDispatcher

Ah of course, I got it now, after looking at your log more closely. The VsyncDispatcher wants to send out "Add myself, Remove myself, Add myself", but the VsyncSource ends up seeing "Add, Add, Remove", so the second Add doesn't end up having any effect, and the dispatcher remains removed. Furthermore, the VsyncDispatcher's mIsObservingVsync state is now wrong - the dispatcher thinks it's still registered at the source, but it's not.

I think there are two options to fix this: We could put the AddVsyncDispatcher/RemoveVsyncDispatcher call inside of the lock, or we can make the VsyncSource keep a "reference count" per dispatcher, so that Add,Add,Remove still ends up with a count of 1 and keeps the dispatcher registered.
I'm going to try the locking solution first. But I'll need to change some VsyncSource implementations so that they no longer call NotifyVsync from inside EnableVsync.

Assignee: nobody → mstange.moz
Status: NEW → ASSIGNED

With a random-duration sleep() in the right place I was able to reproduce this relatively easily on macOS. So this is really a cross-platform issue which just depends on (un)lucky thread scheduling.
Here's a profile captured of the freeze, on macOS, with a few extra markers: https://share.firefox.dev/3TT93Sk

(In reply to Markus Stange [:mstange] from comment #85)

I'm going to try the locking solution first. But I'll need to change some VsyncSource implementations so that they no longer call NotifyVsync from inside EnableVsync.

Hmm, WaylandVsyncSource can definitely call NotifyVsync inside EnableVsync. I'm a bit scared to touch it in a patch that I want to uplift to release, so I think I'll go with the "refcount" solution instead.

This fixes a bug which caused Firefox windows to become frozen after some time.

Full credit goes to Susan and RandyS for bisecting the regressor of this bug, and
to Jeff DeFouw for debugging the issue and finding the cause.

The bug here is a "state race" between the VsyncDispatcher state and
the VsyncSource state. Both are protected by locks, and the code that
runs in those locks respectively can see a different orders of invocations.

VsyncDispatcher::UpdateVsyncStatus does this thing where it updates its state inside
a lock, gathers some information, and then calls methods on VsyncSource outside the lock.
Since it calls those methods outside the lock, these calls can end up being executed
in a different order than the state changes were observed inside the lock.

Here's the bad scenario in detail, with the same VsyncDispatcher being used from
two different threads, turning a Remove,Add into an Add,Remove:

Thread A                                       Thread B

VsyncDispatcher::UpdateVsync
 |
 |----> Enter VsyncDispatcher lock
 |    |                                         VsyncDispatcher::UpdateVsync
 |    |   state->mIsObservingVsync = false       |
 |    |   (We want to stop listening)            |
 |    |                                          |
 |<---- Exit VsyncDispatcher lock                |
 |                                               |----> Enter VsyncDispatcher lock
 |                                               |    |
 |                                               |    |   state->mIsObservingVsync = true
 |                                               |    |   (We want to start listening)
 |                                               |    |
 |                                               |<----  Exit VsyncDispatcher lock
 |                                               |
 |                                               |----> Enter VsyncSource::AddVsyncDispatcher
 |                                               |    |
 |                                               |    |----> Enter VsyncSource lock
 |                                               |    |    |
 |                                               |    |    |  state->mDispatchers.Contains(aVsyncDispatcher)
 |----> VsyncSource::RemoveVsyncDispatcher       |    |    |  VsyncDispatcher already present in list, not doing anything
 |    |                                          |    |    |
 |    |                                          |    |<---- Exit VsyncSource lock
 |    |                                          |    |
 |    |                                          |<---- Exit VsyncSource::AddVsyncDispatcher
 |    |----> Enter VsyncSource lock
 |    |    |
 |    |    |  Removing aVsyncDispatcher from state->mDispatchers
 |    |    |
 |    |<---- Exit VsyncSource lock
 |    |
 |<---- Exit VsyncSource::AddVsyncDispatcher

Now the VsyncDispatcher thinks it is still observing vsync, but it is
no longer registered with the VsyncSource.

This patch makes it so that two calls to AddVsyncDispatcher followed by one call
to RemoveVsyncDispatcher result in the VsyncDispatcher still being registered.
AddVsyncDispatcher is no longer idempotent.

(In reply to Markus Stange [:mstange] from comment #85)

(In reply to Jeff DeFouw from comment #78)
I think there are two options to fix this: We could put the AddVsyncDispatcher/RemoveVsyncDispatcher call inside of the lock, or we can make the VsyncSource keep a "reference count" per dispatcher, so that Add,Add,Remove still ends up with a count of 1 and keeps the dispatcher registered.
I'm going to try the locking solution first. But I'll need to change some VsyncSource implementations so that they no longer call NotifyVsync from inside EnableVsync.

I noticed there's a RecursiveMutex in xpcom while looking around. Protecting the entire UpdateVsyncStatus call with its own RecursiveMutex and RecursiveMutexAutoLock was the first idea I had as a safe and somewhat simple solution but I'm completely new to the code. It did fix my test case.

(In reply to Markus Stange [:mstange] from comment #86)

With a random-duration sleep() in the right place I was able to reproduce this relatively easily on macOS. So this is really a cross-platform issue which just depends on (un)lucky thread scheduling.

The Reddit thread mentioned in ( Bug 1789119 ) (which was merged into this bug as a duplicate) has several people running Windows saying they ran into the same issue so I'm not surprised by your results. Plus, it didn't seem like there was any specific OS distinction in the code which is one reason I found it puzzling it seemed only Linux-based distros were really finding it to be a problem.

And even more odd that only some of us were experiencing it and not everyone running Linux-based distros. We theorized in the Linux Mint thread that maybe it had something to do with the age of the computer, but it seems to me if that were the case then I would have been able to more consistently hit the issue. It truly seems erratic as to when it happens. It's more likely to happen in my case if Firefox has been up and running for ~24 hours, but I've had some happen with 20 minute of restarting. I think it was ~6 hours after restarting this past Sunday. So very random.

Thanks for your work patching this.

(In reply to Jeff DeFouw from comment #89)

I noticed there's a RecursiveMutex in xpcom while looking around. Protecting the entire UpdateVsyncStatus call with its own RecursiveMutex and RecursiveMutexAutoLock was the first idea I had as a safe and somewhat simple solution

Good point, this would have been an option, too. I usually try to avoid re-entrant locks but I don't really remember why.

Patch seems to be green on try: https://treeherder.mozilla.org/jobs?repo=try&revision=895937c9f623389cb16c1634dd6a48d171149bbf

Pushed by mstange@themasta.com:
https://hg.mozilla.org/integration/autoland/rev/052f343e49ad
Allow stacking calls to Add/RemoveVsyncDispatcher so that we survive the sequence Add,Add,Remove. r=jrmuizel

[Tracking Requested - why for this release]: This regression was introduced in 102. It happens rarely, but if it happens, it freezes the entire browser. This bug was initially thought to only affect a few Linux configurations, but it turns out to affect all platforms.

Severity: -- → S2
Component: Widget: Gtk → Widget
Summary: SW-WR/XFCE/Intel: Firefox window randomly freeze → Firefox window randomly freezes
Duplicate of this bug: 1780972
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 109 Branch

The patch landed in nightly and beta is affected.
:mstange, is this bug important enough to require an uplift?

  • If yes, please nominate the patch for beta approval.
  • If no, please set status-firefox108 to wontfix.

For more information, please visit auto_nag documentation.

Flags: needinfo?(mstange.moz)

Not tracking for 107, but setting 107 to fix-optional.
After some bake time in nightly/beta, this could be considered as a dot release ride-along if nominated for release uplift

The fix is in Firefox Nightly now. To everyone who was seeing this bug somewhat regularly, can you test Nightly and see if the bug is indeed fixed?

I will request uplift now but we may want to wait for some confirmation that the fix worked before uplifting.

Flags: needinfo?(mstange.moz)

Comment on attachment 9304780 [details]
Bug 1781167 - Allow stacking calls to Add/RemoveVsyncDispatcher so that we survive the sequence Add,Add,Remove. r=jrmuizel

Beta/Release Uplift Approval Request

  • User impact if declined: Frozen browser after extended usage (e.g. 1 day) for some users
  • Is this code covered by automated tests?: Yes
  • Has the fix been verified in Nightly?: No
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: (hard and time-intensive to reproduce)
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Tightly-scoped fix.
  • String changes made/needed:
  • Is Android affected?: Yes

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: Dataloss: If this issue occurs, users have to restart the browser
  • User impact if declined: Frozen browser after extended usage (e.g. 1 day) for some users
  • Fix Landed on Version: 109
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): Tightly-scoped fix.
Attachment #9304780 - Flags: approval-mozilla-release?
Attachment #9304780 - Flags: approval-mozilla-esr102?
Attachment #9304780 - Flags: approval-mozilla-beta?
See Also: → 1800452
Duplicate of this bug: 1790206
Duplicate of this bug: 1800903
Duplicate of this bug: 1801988

There's a report that the patch didn't fix the bug: bug 1802229

Duplicate of this bug: 1800452

Comment on attachment 9304780 [details]
Bug 1781167 - Allow stacking calls to Add/RemoveVsyncDispatcher so that we survive the sequence Add,Add,Remove. r=jrmuizel

Rejecting release uplift per Comment 103, while the investigation continues.
Comments in Bug 1802229 should be investigated if this should still be considered for 108?

Attachment #9304780 - Flags: approval-mozilla-release? → approval-mozilla-release-

Comment on attachment 9304780 [details]
Bug 1781167 - Allow stacking calls to Add/RemoveVsyncDispatcher so that we survive the sequence Add,Add,Remove. r=jrmuizel

Approved for 108.0b8
This seems to have corrected the issue for some users in Bug 1802229

Attachment #9304780 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9304780 [details]
Bug 1781167 - Allow stacking calls to Add/RemoveVsyncDispatcher so that we survive the sequence Add,Add,Remove. r=jrmuizel

Approved for 102.6esr.

Attachment #9304780 - Flags: approval-mozilla-esr102? → approval-mozilla-esr102+
See Also: → 1803982
Duplicate of this bug: 1803982
No longer duplicate of this bug: 1803982
No longer duplicate of this bug: 1800903
See Also: → 1800487
You need to log in before you can comment on or make changes to this bug.