Closed Bug 1678446 Opened 4 years ago Closed 4 years ago

Slow page loading/rendering after upgrading to MacOS 11.0.1 Big Sur with blocked Socket thread, and "close" system calls taking one second each

Categories

(Core :: Networking, defect)

Firefox 83
defect

Tracking

()

RESOLVED INVALID
Performance Impact high

People

(Reporter: numbernine.mozilla, Unassigned)

Details

(Keywords: perf:pageload)

Attachments

(3 files)

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.16; rv:83.0) Gecko/20100101 Firefox/83.0

Steps to reproduce:

  • Start with a new, clean Firefox profile.
  • Open a new tab.
  • Enter search terms into the URL bar, e.g. "this is a test search" and press Enter
  • Results page begins to load, albeit very slowly (e.g. some graphical elements take a long time to appear)
  • Page appears fully rendered a small eternity later. :-)

Actual results:

Expected page appears, albeit slower than it would normally and slower when compared to other browsers on the same computer (e.g. Safari, Google Chrome).

Expected results:

Page loads as snappily as it usually does.

Hello!

As requested by Reddit user /u/mstange on this this Reddit thread ( https://old.reddit.com/r/firefox/comments/jud0ei/firefox_82_seems_slow_after_upgrading_to_macos_11/ ), I've submitted this issue as a new bug report.

This issue may or may not be related to Bug 1678116 - I commented on that Reddit thread assuming that it might be, but I won't guarantee that's the case.

As requested by mstange, you can also examine some about:networking logs here: https://drive.google.com/file/d/1CyYF4W5liA5YFsRhrAu8fgLT0r7gOV1W/view?usp=sharing

Also: results of running the Profiler while I perform the steps listed in the initial bug report can be found here:
https://share.firefox.dev/36PMVBP

Needinfoing Markus so he sees this and can pick a better component (I guessed Layout based on the janky part of the profile but that's probably incorrect - though perhaps less incorrect than General :).

Component: General → Layout
Flags: needinfo?(mstange.moz)

Thanks for the bug report, Matt!

This is a networking issue. The socket thread frequently blocks for a full second when calling close: https://share.firefox.dev/35OEx6f

Unfortunately we don't see the full call stacks for the blocking calls, because we don't have stacks on Release builds on macOS (bug 1451902).

Status: UNCONFIRMED → NEW
Component: Layout → Networking
Ever confirmed: true
Flags: needinfo?(mstange.moz)
Summary: Slow page loading/rendering after upgrading to MacOS 11.0.1 Big Sur → Slow page loading/rendering after upgrading to MacOS 11.0.1 Big Sur with blocked Socket thread, and "close" system calls taking one second each
Whiteboard: [qf]

Thanks for responding Markus!

If I read Bug 1451902 correctly, might I be able to provide better networking debug information if I try a Nightly release instead? (Assuming the issue persists in Nightly, anyway.)

That's right - if this reproduces in Nightly, a new profile (with Socket thread, as before) would indeed be helpful.

Alright, here we go!

Profiler results: https://share.firefox.dev/2IVDuJ1
Logs: https://drive.google.com/file/d/1smq_avNtKB-vt4BZExsDkz-4W-BVEj5j/view?usp=sharing

Hopefully these will be more illuminating. :)

I should've mentioned: those were taken with 85.0a1.

Thanks!

This is the blocking stack:

close
ssl_DefClose
nsNSSSocketInfo::CloseSocketAndDestroy()
mozilla::net::nsSocketTransport::ReleaseFD_Locked(PRFileDesc*)
mozilla::net::nsSocketTransport::OnSocketDetached(PRFileDesc*)
mozilla::net::nsSocketTransportService::DetachSocket(mozilla::net::nsSocketTransportService::SocketContext*, mozilla::net::nsSocketTransportService::SocketContext*)
mozilla::net::nsSocketTransportService::Run()

At this point we need somebody from the networking team to take a closer look.

I suppose another useful piece of information would be kernel stacks.

Matt, can you run the following dtrace scripts on the command line, as described below?

sudo dtrace -n 'profile-1001 /pid == 43459 && tid == 8917219/ { @[stack(), ustack()] = count(); }' > on-cpu-stacks.txt
sudo dtrace -n 'sched:::off-cpu /pid == 43459 && tid == 8917219/ { @[stack(), ustack()] = count(); }' > off-cpu-stacks.txt

First, find out the process id and thread id of the Socket thread in the Firefox Nightly parent process. You can do this by grabbing a profile as before, and then hovering the first Socket process row in the thread list at the top. For example, for your profile from comment 7 (https://share.firefox.dev/2IVDuJ1), the Socket thread has pid 1065 and tid 21228.
Then, replace 43459 with the right pid and 8917219 with the right tid in the command line pieces that I wrote above.

Then, for each of the two dtrace commands:

  1. Execute the command in Terminal. If it asks for your password, enter it and press enter.
  2. Perform the slow page load actions in Firefox.
  3. Back in Terminal, press Ctrl+C.

So we'll capture two different page loads in two different text files. Then, attach both files here in the bug.

An alternative to this would be to use Instruments, which comes with Xcode, which is a very large download. Instruments also has the ability to sample kernel callstacks.

Whiteboard: [qf] → [qf:p1:pageload]
Attached file off-cpu-stacks.txt
Attached file on-cpu-stacks.txt

Maruks: Done! I hope I did that right. :)

Well done!

Here are some kernel stacks for soclose from the off-cpu log. So I think these are the stacks where the CPU starts sleeping.

              kernel`thread_block_reason+0xaf
              kernel`lck_mtx_sleep_deadline+0x73
              kernel`0xffffff800f05bb60+0xdd
              kernel`cfil_sock_close_wait+0x276
              kernel`soclose_locked+0xda
              kernel`soclose+0x9c
              kernel`0xffffff800f017290+0x13e
              kernel`0xffffff800f018c80+0x480
              kernel`unix_syscall64+0x27b
              kernel`hndl_unix_scall64+0x16
              kernel`thread_block_reason+0xaf
              kernel`return_to_iret+0x110
              kernel`cfil_sock_detach+0x28a
              kernel`soclose_locked+0xea
              kernel`soclose+0x9c
              kernel`0xffffff800f017290+0x13e
              kernel`0xffffff800f018c80+0x480
              kernel`unix_syscall64+0x27b
              kernel`hndl_unix_scall64+0x16
              kernel`thread_block_reason+0xaf
              kernel`kernel_trap+0x311
              kernel`0xffffff800ea61a09+0x26
              kernel`hw_lock_unlock+0x50
              kernel`waitq_wakeup64_all_locked+0xb9
              kernel`waitq_wakeup64_all+0x88
              kernel`selwakeup+0x35
              kernel`sowakeup+0x10a
              kernel`sorflush+0x53
              kernel`sofreelastref+0x275
              kernel`soclose_locked+0x622
              kernel`soclose+0x9c
              kernel`0xffffff800f017290+0x13e
              kernel`0xffffff800f018c80+0x480
              kernel`unix_syscall64+0x27b
              kernel`hndl_unix_scall64+0x16

The function names cfil_sock_close_wait and lck_mtx_sleep_deadline sound suspicious. Maybe we can find some information on why we're hitting this path by reading the Darwin source code.

cfil_sock_close_wait is implemented in this file: https://github.com/apple/darwin-xnu/blob/master/bsd/net/content_filter.c

It talks about using the "Network Extension Control Policy" database to find socket filters. Here's some documentation about NetworkExtension: https://developer.apple.com/documentation/networkextension?language=objc

Matt, do you use any VPN services, or firewalls, or similar?

Oh! I don't have any VPNs installed, but I am running Little Snitch as a firewall. I installed the latest release (5.0.2) at the same time as I switched to Big Sur, as the older 4.x versions of Little Snitch don't work with Big Sur.

I've tried turning Little Snitch off ("Stop Network Filter" in the LS preferences) but it doesn't seem to change any behaviour in Firefox though. With that said, maybe they're affecting each other in some other way?

Thought I'd provide an update after spending some more time debugging this today!

I may have spoken too quickly in my previous comment - it seems that in Little Snitch 5.0.2 (i.e. the most recent release), simply turning off the Network Filter in Little Snitch's preferences does not disable the MacOS Network Filter used by Little Snitch itself (see attached screen shot). Armed with this knowledge, I decided to try uninstalling Little Snitch altogether and thus removing the associated network filter - and Firefox began working perfectly!

Taking it one step further, it seems that it is still possible to install the older Little Snitch 4.6 version in Big Sur, but it requires "blessing" the kernel extension required (see https://www.obdev.at/support/littlesnitch/245913651253917 ). I decided to try this approach instead, and Little Snitch 4.6 also works perfectly with Firefox in Big Sur!

At this point, I think it's safe to say that my experience with Firefox and Little Snitch 5.0.2 in Big Sur is either a bug in Little Snitch's (re)implementation using the MacOS Network Extension framework or a bug in that framework itself. I'll try to contact Little Snitch support and see if this is a known issue or if they can provide any suggestions, referring to this bug that provided a lot of interesting debugging information so far. :-)

Thank you Markus for your time going down the rabbit hole of this bug with me! Hopefully this discussion helps some other unfortunate person in the future.

Thanks for the update! I'm glad we were able to track this down to the source.

I'm interested in whether Firefox is doing anything special to trigger this timeout in Little Snitch's socket close handler, and whether it can be worked around from our side. It would be great to have the Little Snitch developers add their thoughts on this bug.

Agreed! I submitted a bug report to them just a couple hours ago, and I provided a link to this bug report as well as my observations about which LS versions exacerbate the behaviour (i.e. it's only LS 5.0.2 that has issues). Hopefully one of their devs can respond here, but if they only respond to me directly then I'll pass the response along here!

Not sure what we can do here.
I'd like to put this in our triage pool for a while and wait for Little Snitch's response.

Quick update: I received a response to my support request from Christian at Objective Development, and they had the following to say:

Hi Matthew,
you are an absolute genius! Thanks for shedding some light on this issue!

We have had similar reports for Chrome, but we cannot reproduce any of this here. It somehow seemed to be related to Pi-Hole remapping several requests to 0.0.0.0. Users could fix it by letting their Pi-Hole return NXDOMAIN for blocked names.

Having first-hand info from the browser end is great! I have reported this directly to Apple, hoping for a fix. As far as I can tell, we should have no way to delay socket close, unless a connection alert is on screen.

So: As far as I can tell, there is no direct way for us to fix it. We must wait for a fix from Apple.

Regards, Christian.

As it stands, it sounds like they believe this is a bug on Apple's side (and I'm inclined to agree). Hopefully someone from Apple will help investigate further!

Just wanted to chime in and say that I also had a heck of a time pinpointing this and it also led me to LittleSnitch 5. I have also been in touch with Christian and he pretty much confirmed the same information. Interestingly enough I had been noticing a severe impact on Chrome but less severe on Firefox. Safari seemed unaffected interestingly enough.

For me it was NextDNS (similar to Pi-Hole) returning 0.0.0.0 and :: for blocked domains, though I don't believe it has the option of returning NXDOMAIN as a workaround so I ended up uninstalling LS. Good to know there's still a way to get 4.x to work on Big Sur.

From our email thread:

I can reproduce a 1s delay with

    time curl -v https://0.0.0.0/

In our currently shipping product, we get a call to decide (alllow/deny/ask/look at data) and decide that we want to see the data for Deep Packet Inspection. This results in another immediate call "data complete". We don't change our mind and say "show me the data". 1s later we receive the "socket closed" notification.

Just to see whether we can mitigate the problem with other responses, I have made a version which responds with "allow" to address 0.0.0.0 in any case. We don't get another call with "data complete" in this case, but the close is still one second delayed. So we have no way to work around the problem. Sorry.

I do hope Apple responds to this at some point.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INVALID
Performance Impact: --- → P1
Keywords: perf:pageload
Whiteboard: [qf:p1:pageload]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: