Open Bug 1981051 Opened 2 months ago Updated 1 month ago

Firefox hangs on MacOS for seconds, getting worse over time

Categories

(Core :: Performance: Responsiveness, defect)

Firefox 141
defect

Tracking

()

People

(Reporter: jamesvd, Unassigned, NeedInfo)

References

(Depends on 1 open bug)

Details

Attachments

(2 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:141.0) Gecko/20100101 Firefox/141.0

Steps to reproduce:

Pretty much anything. Normal web browsing from a fresh browser. As the day goes on pauses become more and more pronounced and disruptive quickly.

Actual results:

Doing anything on any page pauses for seconds, locking up all actions, including typing, scrolling, etc. Get a beach ball, then it resumes and catches up on typing. Gets worse the longer the browser is open until it completely freezes. Last time that happened, I was even unable to force quit firefox from the activity panel.

This happens frequently. Right now, it's happened about 5 times during the typing on this page.

Profile: https://share.firefox.dev/3IXIvQd

Could you capture another profile, with these changes to the profiler settings?

  • Add ,IPC I/O,IPDL to the "Add custom threads by name:" textbox
  • Under Features, also enable "Native Stacks" and "IPC Messages"

Thanks!

Flags: needinfo?(jamesvd)

Requested profiles are attached. This certainly wasn't the browser acting it's worse, but it takes a while for it to build up and I've had to restart my browser for company-required updates today.

I might have found something that relates to this. It seems like when i went to the default Browser Privacy setting (from strict) and disabled DNS over HTTPS settings, things got better, but I didn't leave it running for long before I re-enabled them to try to capture this profile.

Let me know if you need more info.

Flags: needinfo?(jamesvd)

Uh oh, things are looking pretty rough in that profile!

I'd say the profile indicates that there are a lot of blob URLs around, and that Firefox isn't dealing with them very well. Can you go to about:memory, click Measure, and see which page or add-on has created all those blob URLs?

Flags: needinfo?(jamesvd)

In the profile, launching a new content processes took one full second, with the time spent in IPCBlobUtils::Serialize, PContentParent::SendInitBlobURLs , and IPC::ParamTraits<mozilla::dom::BlobURLRegistrationData>::Read. But even more importantly, it looks like whenever any content process is shutting down, all IPC to the parent process is blocked for 4 seconds! Those four seconds are spent in Node::DestroyAllPortsWithPeer on the IPC I/O thread. Nika, who would be the right person to look into this?

Flags: needinfo?(nika)
Component: Performance: General → Performance: Responsiveness

(In reply to Markus Stange [:mstange] from comment #4)

In the profile, launching a new content processes took one full second, with the time spent in IPCBlobUtils::Serialize, PContentParent::SendInitBlobURLs , and IPC::ParamTraits<mozilla::dom::BlobURLRegistrationData>::Read.

That's quite unfortunate. IIRC BlobURL stuff is something which has been a huge pain in the past, and I remember us having some discussions about wanting to massively reduce the amount of stuff we need to send down for it, but I don't know how feasable that is right now. Some quick searching found bug 1619943, which looks like an old project to improve that situation which doesn't appear to have moved.

But even more importantly, it looks like whenever any content process is shutting down, all IPC to the parent process is blocked for 4 seconds! Those four seconds are spent in Node::DestroyAllPortsWithPeer on the IPC I/O thread. Nika, who would be the right person to look into this?

That's not great! The Node::DestroyAllPortsWithPeer code is spending a lot of time specifically within TaskQueue::Dispatch waiting for the mQueueMonitor. When we lose connection with a process, we need to notify any actors which are still connected to that process that it is now gone (so that they can clean up), and to do that we dispatch a runnable to the nsISerialEventTarget corresponding to each actor.

It seems that in this particular case, we have a very large number of outstanding actors in the parent process, each of which is bound to a TaskQueue, and we're dispatching OnNotifyMaybeChannelError events to each of them to notify them that they are dying (https://searchfox.org/mozilla-central/rev/00d2cc8ebe323e0cde5619004c588d5e08ad1f46/ipc/glue/MessageChannel.cpp#2078-2082). The monitor on these TaskQueues appears to unfortunately be quite contended, meaning that we're spending a lot of time waiting for the mutex to be passed back and forth between threads, leading to the IPC I/O thread being blocked for an extended period of time.

My vague guess is that the bulk of these actors are all bound to the same TaskQueue. This would help explain the high contention we're seeing, as over the 4 seconds, we would be both trying to acquire the mutex on the task queue to run the events, as well as trying to acquire it on the IO thread in order to queue events. Given in this profile you're already seeing a very large number of Blobs, I'm guessing these may all be PRemoteLazyInputStream actors (which share a single TaskQueue in the parent process).

It might be feasible to reduce the contention a bit here by changing how TaskQueue is implemented to allow it to acquire the Monitor less while running. I'm not 100% sure what that would look like, but I expect we could perhaps move multiple entries from the TaskQueue's queue into the Runner at a time, to allow multiple tasks to be retired before re-acquiring the lock. I'll leave a NI? for myself to look into this more.

Changing this on the IPC side to somehow detect that all of the notifications are going to the same event target seems more difficult, as there's a lot of abstraction layers between Node::DestroyAllPortsWithPeer and the actual dispatch, so our best bet is probably making the dispatch cheaper. (As a side note, reducing the number of Blob URLs which need to be broadcast could also improve the situation here).

Attached file memory-report.json.gz (obsolete) —

I attached a memory report, but this isn't from the same session as before. Nothing has changed settings-wise since the last one, I just restarted my computer and there are different tabs.

Flags: needinfo?(jamesvd)

I'll add that some of the artifacts I'm seeing are:

  1. Typing pauses when typing anything, including this comment. ~1-3s each time.
  2. When scrolling pages, the new part of the page coming up from the bottom is blank (white) until a few seconds later, even if the scrolling doesn't hang.
  3. When switching tabs of an already loaded page, I see a blank page with a loading indicator (gray radiating lines in a circle) for a few seconds before the content appears.
  4. Often when switching tabs I'll get a spinning "beach ball" for a mouse cursor while I wait for the tab to load.
  5. YouTube videos freeze, but the sound continues playing. This can last 5-10s. No audio stuttering.

All of the symptoms seem to get worse as the day goes on. If the browser is acting very slow, when I go to close it, the window will close, but the process with remain non-responsive in Activity Monitor for about a minute before it finally closes.

In case it matters, I'm on a 2025 Mac M4 Pro w/ 24 GB ram. No other software on my machine is pausing or even slowed down during these hiccups.

Attached file memory-report.json.gz

Replaced the previous memory report after realizing that anonymizing the report removed the extension names/urls.

Attachment #9505581 - Attachment is obsolete: true

When you captured the memory report, was Firefox in the state where it was performing slowly?

Yes, though not as bad as it sometimes does. I'll keep this window open for the rest of the day and tomorrow and capture a new report when things get really slow.

It's "SquareX Enterprise - Spreedly" that created the blob URLs.

curl -Ls 'https://bugzilla.mozilla.org/attachment.cgi?id=9505587' | gunzip | grep -c 'memory-blob-urls/owner(moz-extension:.*15237d20-dab3-4143-b9e7-1bc847749b7d' counts 202011 blob URLs (that's over 200 000) for the add-on with the ID 15237d20-dab3-4143-b9e7-1bc847749b7d, which is "SquareX Enterprise - Spreedly".

I've filed bug 1981596 to handle the developer outreach aspect of this issue.

Status: UNCONFIRMED → NEW
Ever confirmed: true

And I've filed bug 1981600 about the fact that the memory report has over 8GB of heap-unclassified.

Ok, thanks! I suspected it might be that since it's something I'm not familiar with. It's installed by my company and I can't disable it.

Is there anything I can do to help or follow up with?

Here's a new profile with it being very slow and locking up every few seconds: https://share.firefox.dev/4m5Azer

See Also: → 1981596

I don't think this will fix the issue, because we'll still be creating a vast number of blob URIs, but we might be able to improve the locking contention which is appearing on the profile a bit with bug 1983309.

Flags: needinfo?(nika)
See Also: → 1983309

The severity field is not set for this bug.
:mstange, could you have a look please?

For more information, please visit BugBot documentation.

Flags: needinfo?(mstange.moz)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: