Slow page loading/rendering after upgrading to MacOS 11.0.1 Big Sur with blocked Socket thread, and "close" system calls taking one second each
Categories
(Core :: Networking, defect)
Tracking
()
Performance Impact | high |
People
(Reporter: numbernine.mozilla, Unassigned)
Details
(Keywords: perf:pageload)
Attachments
(3 files)
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0
Steps to reproduce:
- Start with a new, clean Firefox profile.
- Open a new tab.
- Enter search terms into the URL bar, e.g. "this is a test search" and press Enter
- Results page begins to load, albeit very slowly (e.g. some graphical elements take a long time to appear)
- Page appears fully rendered a small eternity later. :-)
Actual results:
Expected page appears, albeit slower than it would normally and slower when compared to other browsers on the same computer (e.g. Safari, Google Chrome).
Expected results:
Page loads as snappily as it usually does.
Hello!
As requested by Reddit user /u/mstange on this this Reddit thread ( https://old.reddit.com/r/firefox/comments/jud0ei/firefox_82_seems_slow_after_upgrading_to_macos_11/ ), I've submitted this issue as a new bug report.
This issue may or may not be related to Bug 1678116 - I commented on that Reddit thread assuming that it might be, but I won't guarantee that's the case.
As requested by mstange, you can also examine some about:networking logs here: https://drive.google.com/file/d/1CyYF4W5liA5YFsRhrAu8fgLT0r7gOV1W/view?usp=sharing
Also: results of running the Profiler while I perform the steps listed in the initial bug report can be found here:
https://share.firefox.dev/36PMVBP
Comment 3•4 years ago
|
||
Needinfoing Markus so he sees this and can pick a better component (I guessed Layout based on the janky part of the profile but that's probably incorrect - though perhaps less incorrect than General :).
Comment 4•4 years ago
|
||
Thanks for the bug report, Matt!
This is a networking issue. The socket thread frequently blocks for a full second when calling close
: https://share.firefox.dev/35OEx6f
Unfortunately we don't see the full call stacks for the blocking calls, because we don't have stacks on Release builds on macOS (bug 1451902).
Thanks for responding Markus!
If I read Bug 1451902 correctly, might I be able to provide better networking debug information if I try a Nightly release instead? (Assuming the issue persists in Nightly, anyway.)
Comment 6•4 years ago
|
||
That's right - if this reproduces in Nightly, a new profile (with Socket thread, as before) would indeed be helpful.
Alright, here we go!
Profiler results: https://share.firefox.dev/2IVDuJ1
Logs: https://drive.google.com/file/d/1smq_avNtKB-vt4BZExsDkz-4W-BVEj5j/view?usp=sharing
Hopefully these will be more illuminating. :)
Comment 9•4 years ago
|
||
Thanks!
This is the blocking stack:
close
ssl_DefClose
nsNSSSocketInfo::CloseSocketAndDestroy()
mozilla::net::nsSocketTransport::ReleaseFD_Locked(PRFileDesc*)
mozilla::net::nsSocketTransport::OnSocketDetached(PRFileDesc*)
mozilla::net::nsSocketTransportService::DetachSocket(mozilla::net::nsSocketTransportService::SocketContext*, mozilla::net::nsSocketTransportService::SocketContext*)
mozilla::net::nsSocketTransportService::Run()
At this point we need somebody from the networking team to take a closer look.
Comment 10•4 years ago
|
||
I suppose another useful piece of information would be kernel stacks.
Matt, can you run the following dtrace scripts on the command line, as described below?
sudo dtrace -n 'profile-1001 /pid == 43459 && tid == 8917219/ { @[stack(), ustack()] = count(); }' > on-cpu-stacks.txt
sudo dtrace -n 'sched:::off-cpu /pid == 43459 && tid == 8917219/ { @[stack(), ustack()] = count(); }' > off-cpu-stacks.txt
First, find out the process id and thread id of the Socket thread in the Firefox Nightly parent process. You can do this by grabbing a profile as before, and then hovering the first Socket process row in the thread list at the top. For example, for your profile from comment 7 (https://share.firefox.dev/2IVDuJ1), the Socket thread has pid 1065 and tid 21228.
Then, replace 43459 with the right pid and 8917219 with the right tid in the command line pieces that I wrote above.
Then, for each of the two dtrace commands:
- Execute the command in Terminal. If it asks for your password, enter it and press enter.
- Perform the slow page load actions in Firefox.
- Back in Terminal, press Ctrl+C.
So we'll capture two different page loads in two different text files. Then, attach both files here in the bug.
An alternative to this would be to use Instruments, which comes with Xcode, which is a very large download. Instruments also has the ability to sample kernel callstacks.
Updated•4 years ago
|
Reporter | ||
Comment 11•4 years ago
|
||
Reporter | ||
Comment 12•4 years ago
|
||
Reporter | ||
Comment 13•4 years ago
|
||
Maruks: Done! I hope I did that right. :)
Comment 14•4 years ago
|
||
Well done!
Here are some kernel stacks for soclose
from the off-cpu log. So I think these are the stacks where the CPU starts sleeping.
kernel`thread_block_reason+0xaf
kernel`lck_mtx_sleep_deadline+0x73
kernel`0xffffff800f05bb60+0xdd
kernel`cfil_sock_close_wait+0x276
kernel`soclose_locked+0xda
kernel`soclose+0x9c
kernel`0xffffff800f017290+0x13e
kernel`0xffffff800f018c80+0x480
kernel`unix_syscall64+0x27b
kernel`hndl_unix_scall64+0x16
kernel`thread_block_reason+0xaf
kernel`return_to_iret+0x110
kernel`cfil_sock_detach+0x28a
kernel`soclose_locked+0xea
kernel`soclose+0x9c
kernel`0xffffff800f017290+0x13e
kernel`0xffffff800f018c80+0x480
kernel`unix_syscall64+0x27b
kernel`hndl_unix_scall64+0x16
kernel`thread_block_reason+0xaf
kernel`kernel_trap+0x311
kernel`0xffffff800ea61a09+0x26
kernel`hw_lock_unlock+0x50
kernel`waitq_wakeup64_all_locked+0xb9
kernel`waitq_wakeup64_all+0x88
kernel`selwakeup+0x35
kernel`sowakeup+0x10a
kernel`sorflush+0x53
kernel`sofreelastref+0x275
kernel`soclose_locked+0x622
kernel`soclose+0x9c
kernel`0xffffff800f017290+0x13e
kernel`0xffffff800f018c80+0x480
kernel`unix_syscall64+0x27b
kernel`hndl_unix_scall64+0x16
The function names cfil_sock_close_wait
and lck_mtx_sleep_deadline
sound suspicious. Maybe we can find some information on why we're hitting this path by reading the Darwin source code.
Comment 15•4 years ago
|
||
cfil_sock_close_wait
is implemented in this file: https://github.com/apple/darwin-xnu/blob/master/bsd/net/content_filter.c
It talks about using the "Network Extension Control Policy" database to find socket filters. Here's some documentation about NetworkExtension: https://developer.apple.com/documentation/networkextension?language=objc
Matt, do you use any VPN services, or firewalls, or similar?
Reporter | ||
Comment 16•4 years ago
|
||
Oh! I don't have any VPNs installed, but I am running Little Snitch as a firewall. I installed the latest release (5.0.2) at the same time as I switched to Big Sur, as the older 4.x versions of Little Snitch don't work with Big Sur.
I've tried turning Little Snitch off ("Stop Network Filter" in the LS preferences) but it doesn't seem to change any behaviour in Firefox though. With that said, maybe they're affecting each other in some other way?
Comment hidden (obsolete) |
Reporter | ||
Comment 18•4 years ago
|
||
Thought I'd provide an update after spending some more time debugging this today!
I may have spoken too quickly in my previous comment - it seems that in Little Snitch 5.0.2 (i.e. the most recent release), simply turning off the Network Filter in Little Snitch's preferences does not disable the MacOS Network Filter used by Little Snitch itself (see attached screen shot). Armed with this knowledge, I decided to try uninstalling Little Snitch altogether and thus removing the associated network filter - and Firefox began working perfectly!
Taking it one step further, it seems that it is still possible to install the older Little Snitch 4.6 version in Big Sur, but it requires "blessing" the kernel extension required (see https://www.obdev.at/support/littlesnitch/245913651253917 ). I decided to try this approach instead, and Little Snitch 4.6 also works perfectly with Firefox in Big Sur!
At this point, I think it's safe to say that my experience with Firefox and Little Snitch 5.0.2 in Big Sur is either a bug in Little Snitch's (re)implementation using the MacOS Network Extension framework or a bug in that framework itself. I'll try to contact Little Snitch support and see if this is a known issue or if they can provide any suggestions, referring to this bug that provided a lot of interesting debugging information so far. :-)
Thank you Markus for your time going down the rabbit hole of this bug with me! Hopefully this discussion helps some other unfortunate person in the future.
Reporter | ||
Comment 19•4 years ago
|
||
Comment 20•4 years ago
|
||
Thanks for the update! I'm glad we were able to track this down to the source.
I'm interested in whether Firefox is doing anything special to trigger this timeout in Little Snitch's socket close handler, and whether it can be worked around from our side. It would be great to have the Little Snitch developers add their thoughts on this bug.
Reporter | ||
Comment 21•4 years ago
|
||
Agreed! I submitted a bug report to them just a couple hours ago, and I provided a link to this bug report as well as my observations about which LS versions exacerbate the behaviour (i.e. it's only LS 5.0.2 that has issues). Hopefully one of their devs can respond here, but if they only respond to me directly then I'll pass the response along here!
Comment 22•4 years ago
|
||
Not sure what we can do here.
I'd like to put this in our triage pool for a while and wait for Little Snitch's response.
Reporter | ||
Comment 23•4 years ago
|
||
Quick update: I received a response to my support request from Christian at Objective Development, and they had the following to say:
Hi Matthew,
you are an absolute genius! Thanks for shedding some light on this issue!
We have had similar reports for Chrome, but we cannot reproduce any of this here. It somehow seemed to be related to Pi-Hole remapping several requests to 0.0.0.0. Users could fix it by letting their Pi-Hole return NXDOMAIN for blocked names.
Having first-hand info from the browser end is great! I have reported this directly to Apple, hoping for a fix. As far as I can tell, we should have no way to delay socket close, unless a connection alert is on screen.
So: As far as I can tell, there is no direct way for us to fix it. We must wait for a fix from Apple.
Regards, Christian.
As it stands, it sounds like they believe this is a bug on Apple's side (and I'm inclined to agree). Hopefully someone from Apple will help investigate further!
Comment 24•4 years ago
|
||
Just wanted to chime in and say that I also had a heck of a time pinpointing this and it also led me to LittleSnitch 5. I have also been in touch with Christian and he pretty much confirmed the same information. Interestingly enough I had been noticing a severe impact on Chrome but less severe on Firefox. Safari seemed unaffected interestingly enough.
For me it was NextDNS (similar to Pi-Hole) returning 0.0.0.0 and :: for blocked domains, though I don't believe it has the option of returning NXDOMAIN as a workaround so I ended up uninstalling LS. Good to know there's still a way to get 4.x to work on Big Sur.
From our email thread:
I can reproduce a 1s delay with
time curl -v https://0.0.0.0/
In our currently shipping product, we get a call to decide (alllow/deny/ask/look at data) and decide that we want to see the data for Deep Packet Inspection. This results in another immediate call "data complete". We don't change our mind and say "show me the data". 1s later we receive the "socket closed" notification.
Just to see whether we can mitigate the problem with other responses, I have made a version which responds with "allow" to address 0.0.0.0 in any case. We don't get another call with "data complete" in this case, but the close is still one second delayed. So we have no way to work around the problem. Sorry.
I do hope Apple responds to this at some point.
Updated•4 years ago
|
Updated•3 years ago
|
Description
•