Closed Bug 1718719 Opened 3 years ago Closed 3 years ago

High CPU usage of parent process when network isn't good

Categories

(Core :: Networking, defect, P2)

Firefox 91
defect

Tracking

()

RESOLVED FIXED
93 Branch
Tracking Status
firefox93 --- fixed

People

(Reporter: lilydjwg, Assigned: dragana)

References

Details

(Whiteboard: [necko-triaged])

Attachments

(1 file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0

Steps to reproduce:

  • use a pac to proxy some requests via socks5 (remote dns)
  • open a couple of tabs, use some extensions that sends background requests (e.g. Google Keep, GMail, Mastodon, the Inoreader addon)
  • wait until at least one request hangs because the proxy connects but receives no traffic and keeps waiting

Actual results:

The parent process starts to consume more than 100% CPU for some time. I notice because the fan starts spinning wildly.

When it happens, the UI is hardly responsive. I can only click on the profiler's icon and wait until firefox calms down.

Expected results:

No high CPU usage. If the request hangs, firefox should just wait until it times out. Firefox should remain responsive.

This also happens when a new tab request hangs but that's less likely for me. This has being happening for at least several days.

I don't know if the proxy plays part in this. If the request isn't going via the proxy, I won't know it hangs. I'm still looking for other clues on how it reproduces.

https://share.firefox.dev/3yc7Jj9

The Bugbug bot thinks this bug should belong to the 'Core::Networking' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Networking
Product: Firefox → Core

Based on several profiles lilydjwg shared with me, it seems that the socket thread dispatches lots of net::nsHttpActivityDistributor::ObserveActivityWithArgs to the main thread, and then the main thread spent lots of time in handling them as well as going through the normal event loop over and over again spending time on checking native events as well.

Severity: -- → S3
Priority: -- → P2
Whiteboard: [necko-triaged]

It seems that a proxy is not necessary for it to happen. I've been notified by my fan even when there is no proxy connection. lsof shows only a TCP connection to 52.88.186.163 (state is ESTABLISHED) which I believe belongs to Sync or Telemetry.

I can hardly get any more details on this. Firefox isn't responsive when it's happening and I don't get source line information from libxul.so debug symbols. This has affected me and several other people in two Telegram groups I'm in.

This doesn't only happen when idle. It happens a lot when I'm using Firefox. E.g. at exactly this moment while I'm typing. Please help!

(English is not my native language, please excuse typing errors.

A proxy is definitely not necessary --- a slow network can reproduce it several times a day. And it seems that it is more likely to happen when Firefox is in background. I'm using awesomewm, and most of the time when I switch to another tag and switch back, I would discover that Firefox freezes.

I'm not familiar with Firefox or networking stuffs, so... sorry that I don't know what other information I should provide.

This happens in clean profile and safe mode as well. My steps to reproduce:

  • prepare a good proxy and a bad proxy that connects but doesn't send back anything
  • start the good proxy
  • start firefox, configure its proxy to the good proxy
  • open some tab, wait some time
  • stop the good proxy and replace it with the bad proxy. Requests now keep waiting for data
  • firefox starts freezing and eating CPU. If not, open a new tab to send out some requests

The above proxies are socks5 proxies and used to control network. This problem doesn't happen when either the following is true:

  • firefox can never connect to the network
  • all packets are dropped (so using iptables to reproduce won't work)
  • ServerHello is received
See Also: → 1718951

similar issue here Bug 1719046

(In reply to Xidorn Quan [:xidorn] UTC+11 from comment #2)

Based on several profiles lilydjwg shared with me, it seems that the socket thread dispatches lots of net::nsHttpActivityDistributor::ObserveActivityWithArgs to the main thread, and then the main thread spent lots of time in handling them as well as going through the normal event loop over and over again spending time on checking native events as well.

net::nsHttpActivityDistributor::ObserveActivityWithArgs this should be only called when the network panel in devtool is shown. Could you check if you can still reproduce this without opening devtool?

Also, we recently fixed bug 1720079 that could cause high cpu loading. Could you try to use the latest nightly to reproduce this again? Thanks.

Flags: needinfo?(lilydjwg)

I've made a new profiler with latest nightly + an empty profile. The issue is still there. I didn't open devtools.
https://share.firefox.dev/2UTLbVJ

Flags: needinfo?(lilydjwg)

(In reply to lilydjwg from comment #8)

I've made a new profiler with latest nightly + an empty profile. The issue is still there. I didn't open devtools.
https://share.firefox.dev/2UTLbVJ

Well, it turns out that every transaction send HttpActivity to something else. Not sure if this is the problem of this high cpu usage.
I've created a build that disable sending HttpActivity. Could you give this build a try?
Thanks.

p.s., the build is created from this try push

Flags: needinfo?(lilydjwg)

Still not good: https://share.firefox.dev/3hHhLDq
It uses more CPU. 300%+ for parent and 200%+ for a Web Content for some seconds at start. On my second try the content process calms down later: https://share.firefox.dev/36L6tre

Flags: needinfo?(lilydjwg)

Could you try to get a http log? Please start logging after you start the bad proxy, otherwise the log file would be too big. Thanks.

Flags: needinfo?(lilydjwg)

It is still very large (785M). I've compressed it with xz and uploaded here: https://transfer.sh/1hfEmXO/fxnet-log.txt-main.1791000.moz_log.xz

Flags: needinfo?(lilydjwg)

Hi Reporter,

I'd like to ask for more information.

  1. Does this issue happen on previous firefox versions (e.g., ESR 78)? If not, could you try to use mozregression to find out which patch causes this issue?
  2. Does this issue only happen with socks proxy? If yes, could you try to create a new http log with adding SOCKS:5 to MOZ_LOG env?

Thanks.

Flags: needinfo?(lilydjwg)
  1. Yes. It's weird that I can reproduce it but I didn't have this issue in the past.
  2. Probably. I didn't succeeded to reproduce it with direct connections and iptables.

New log: https://transfer.sh/1vP8C6n/fxnet-log2.txt-main.2146335.moz_log.xz

Flags: needinfo?(lilydjwg)

Same here. I am in China,.

When Firefox start random freezing, generally I am opened ad Google Search page, I am using a socks5 proxy to access Google services.

This bug happen since Firefox 89, is rather annoying.

For those who can reproduce this bug, please help to use mozregression to find out the problem. This would be the fastest way. Thanks.

There is an easy way to reproduce it

  1. Install qv2ray and new a qv2ray config(just type something, it should be wrong).
  2. Open some website by Firefox such as twitch, youtube or others.
  3. Config Firefox to let it connect with socks.
  4. And you can got it.

strace shows that, Firefox is polling the socks5 connection fd. It returns POLLOUT and then Firefox tries to recvfrom the fd. This keeps repeating.

Following the code, I find that setting security.tls.enable_0rtt_data to false works around this issue. I believe this commit is relevant.

(In reply to lilydjwg from comment #20)

strace shows that, Firefox is polling the socks5 connection fd. It returns POLLOUT and then Firefox tries to recvfrom the fd. This keeps repeating.

Following the code, I find that setting security.tls.enable_0rtt_data to false works around this issue. I believe this commit is relevant.

Thanks for this information. Based on the commit you provided, I've created this test build. Could you give this a test?
Another question: did you find this commit by mozregression?
Could you also use this build to get another http log? Please add pipnss:5 to the MOZ_LOG env. Thanks.

Flags: needinfo?(lilydjwg)

Tested. It still happens. The log is here: https://transfer.sh/1wN7ZNk/fxnet-log3.txt-main.2371822.moz_log.xz

And no, mozregression doesn't work. I tested multiple versions. I chrooted to earlier system backups. I downgraded a system inside my virtual machine to half a year ago. I downgraded the proxy software (v2ray). It reproduced every time.

I read the log and searched the code, and found that commit which was recent and touched the code. I suspect that it might be that Google changed their server settings (this issue happens when connecting to Google's servers; I don't remember if there were other cases).

Flags: needinfo?(lilydjwg)

I confirm that it happens with Google, but not for Baidu, Wikipedia and this bugzilla. And it doesn't happen when Firefox has never successfully connected to Google since process start.

I kind of reproduced this issue without using a proxy, but not easy anyway.

  • Disable HTTP/3
  • Visit www.google.com once
  • Block ServerHello packets somehow. I wrote a NFQUEUE program to do that
  • Wait until the TCP connection to Google is closed (about:networking#sockets no longer shows it)
  • Refresh the google tab

The difference is that the Socket Thread uses 100% but other threads are calm. In the proxy case, Socket Thread 100%, BHMgr Monitor ~50% and the whole process ~250% (for recent nightlies. Earlier it was about ~140%).

Could you try this build and see if there is any improvement?
Note that this is just a mitigation for this problem. I added a check: if a connection waiting for 0RTT response for more than 5s, the connection will be restarted without 0RTT.
Please also get a http log with this build.
Thanks.

Flags: needinfo?(lilydjwg)

200% CPU usage, 100% for Socket Thread and the other 100% for main thread. It doesn't change after 5s.
The log is larger because I wait longer: https://transfer.sh/1IccUpx/fxnet-log4.txt-main.163995.moz_log.xz

Flags: needinfo?(lilydjwg)

Hello developers and guys, I always use Firefox 78 ESR. @Kershaw Chang
I see people reporting bugs ( this one and 1719046, 1703232 ). Allow me to report my similar problem.
Recently (about 1 or 2 months) I encounter Firefox lagging/UI not responding/high CPU (for 2 minutes) problem (just like the video in 1719046) .
I tried disabling some addons. It definitely helped but not completely solved.

Many of them said it usually happend when using proxy or in bad network connection. I feel the same.
People also said such problem is since Firefox 87 or 89, which is quite consistent with my problem time. So I think there's probability that my problem is related to this bug.

(In reply to Kershaw Chang [:kershaw OOO until Aug 9th] from comment #15)

Hi, I've created a test build based on this patch.
Could you check if things are improved?

Thanks.

Flags: needinfo?(lilydjwg)

It still happens. 200% CPU usage, 100% for Socket Thread and the other 100% for main thread. Compared to nightly 2021-08-09, 50-60% less CPU usage as the BHMgr Monitor thread is quiet.

The proxy fd is still being polled POLLIN|POLLOUT|POLLPRI and returns POLLOUT, then a recvfrom appears and gets EAGAIN.

New log file: https://transfer.sh/1tIdmIC/fxnet-log4.txt-main.1974308.moz_log.xz

Flags: needinfo?(lilydjwg)

Thanks for testing this.
However, it seems that 0RTT is not used at all according to the log. Could you try to enable 0RTT and test again?
Also, could you explain how you tested this briefly? Did you block the server hello packet from www.google.com permanently?

Flags: needinfo?(lilydjwg)

Weird. Every time I test, I start with an empty profile, and the 0RTT pref is enabled by default.
I'm using the steps in comment 5, by switching proxies. The "block server hello" method (in comment 24) takes more time because I need to wait the connection to be dropped. I've uploaded the program I use here: https://github.com/lilydjwg/drop-serverhello

Flags: needinfo?(lilydjwg)

Patch for bug 1382886 landed. Can you try to reproduce the issue? If you can reproduce the issue please make a http log.

Thanks!

Flags: needinfo?(lilydjwg)

Nothing seems to have changed, log https://transfer.sh/29U/fxnet-log5.txt-main.3179748.moz_log.xz
nightly version 2021-08-19 (zh-CN variant).

Flags: needinfo?(lilydjwg)

Thank you for the log. I think I know what is causing this:

  • you are using proxy, therefore Firefox(necko) does not try to use 0RTT, but the nss library still enables 0RTT
  • my patch from bug 1382886 mitigates the problem in case that necko knows that 0RTT is used, butt when proxy is use necko does not check that.

I will extend the fix from bug 1382886 to proxies as well.

Assignee: nobody → dd.mozilla

Currently we do not use the early-data when a proxy is used, but we do not disable it in the nss layer.

Pushed by ddamjanovic@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/028b3aadad28 Disable the TLS early-data if proxy is use r=necko-reviewers,keeler,kershaw
Flags: needinfo?(dd.mozilla)
Pushed by ddamjanovic@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c33b8afdbd6f Disable the TLS early-data if proxy is use r=necko-reviewers,keeler,kershaw
Status: UNCONFIRMED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 93 Branch
Regressions: 1729298
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: