1718719 - High CPU usage of parent process when network isn't good

lilydjwg

Reporter

Description

•

3 years ago

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Firefox/91.0

Steps to reproduce:

use a pac to proxy some requests via socks5 (remote dns)
open a couple of tabs, use some extensions that sends background requests (e.g. Google Keep, GMail, Mastodon, the Inoreader addon)
wait until at least one request hangs because the proxy connects but receives no traffic and keeps waiting

Actual results:

The parent process starts to consume more than 100% CPU for some time. I notice because the fan starts spinning wildly.

When it happens, the UI is hardly responsive. I can only click on the profiler's icon and wait until firefox calms down.

Expected results:

No high CPU usage. If the request hangs, firefox should just wait until it times out. Firefox should remain responsive.

This also happens when a new tab request hangs but that's less likely for me. This has being happening for at least several days.

I don't know if the proxy plays part in this. If the request isn't going via the proxy, I won't know it hangs. I'm still looking for other clues on how it reproduces.

https://share.firefox.dev/3yc7Jj9

BugBot [:suhaib / :marco/ :calixte]

Comment 1

•

3 years ago

The Bugbug bot thinks this bug should belong to the 'Core::Networking' component, and is moving the bug to that component. Please revert this change in case you think the bot is wrong.

Component: Untriaged → Networking

Product: Firefox → Core

Xidorn Quan [:xidorn] UTC+10

Comment 2

•

3 years ago

Based on several profiles lilydjwg shared with me, it seems that the socket thread dispatches lots of net::nsHttpActivityDistributor::ObserveActivityWithArgs to the main thread, and then the main thread spent lots of time in handling them as well as going through the normal event loop over and over again spending time on checking native events as well.

Dragana Damjanovic [:dragana]

Assignee

Updated

•

3 years ago

Severity: -- → S3

Priority: -- → P2

Whiteboard: [necko-triaged]

lilydjwg

Reporter

Comment 3

•

3 years ago

It seems that a proxy is not necessary for it to happen. I've been notified by my fan even when there is no proxy connection. lsof shows only a TCP connection to 52.88.186.163 (state is ESTABLISHED) which I believe belongs to Sync or Telemetry.

I can hardly get any more details on this. Firefox isn't responsive when it's happening and I don't get source line information from libxul.so debug symbols. This has affected me and several other people in two Telegram groups I'm in.

This doesn't only happen when idle. It happens a lot when I'm using Firefox. E.g. at exactly this moment while I'm typing. Please help!

khr_91

Comment 4

•

3 years ago

(English is not my native language, please excuse typing errors.

A proxy is definitely not necessary --- a slow network can reproduce it several times a day. And it seems that it is more likely to happen when Firefox is in background. I'm using awesomewm, and most of the time when I switch to another tag and switch back, I would discover that Firefox freezes.

I'm not familiar with Firefox or networking stuffs, so... sorry that I don't know what other information I should provide.

lilydjwg

Reporter

Comment 5

•

3 years ago

This happens in clean profile and safe mode as well. My steps to reproduce:

prepare a good proxy and a bad proxy that connects but doesn't send back anything
start the good proxy
start firefox, configure its proxy to the good proxy
open some tab, wait some time
stop the good proxy and replace it with the bad proxy. Requests now keep waiting for data
firefox starts freezing and eating CPU. If not, open a new tab to send out some requests

The above proxies are socks5 proxies and used to control network. This problem doesn't happen when either the following is true:

firefox can never connect to the network
all packets are dropped (so using iptables to reproduce won't work)
ServerHello is received

Kershaw Chang [:kershaw]

Updated

•

3 years ago

Comment 6

•

3 years ago

similar issue here Bug 1719046

Kershaw Chang [:kershaw]

Comment 7

•

3 years ago

(In reply to Xidorn Quan [:xidorn] UTC+11 from comment #2)

Based on several profiles lilydjwg shared with me, it seems that the socket thread dispatches lots of net::nsHttpActivityDistributor::ObserveActivityWithArgs to the main thread, and then the main thread spent lots of time in handling them as well as going through the normal event loop over and over again spending time on checking native events as well.

net::nsHttpActivityDistributor::ObserveActivityWithArgs this should be only called when the network panel in devtool is shown. Could you check if you can still reproduce this without opening devtool?

Also, we recently fixed bug 1720079 that could cause high cpu loading. Could you try to use the latest nightly to reproduce this again? Thanks.

Flags: needinfo?(lilydjwg)

lilydjwg

Reporter

Comment 8

•

3 years ago

I've made a new profiler with latest nightly + an empty profile. The issue is still there. I didn't open devtools.
https://share.firefox.dev/2UTLbVJ

Flags: needinfo?(lilydjwg)

Kershaw Chang [:kershaw]

Comment 9

•

3 years ago

•

Edited

(In reply to lilydjwg from comment #8)

I've made a new profiler with latest nightly + an empty profile. The issue is still there. I didn't open devtools.
https://share.firefox.dev/2UTLbVJ

Well, it turns out that every transaction send HttpActivity to something else. Not sure if this is the problem of this high cpu usage.
I've created a build that disable sending HttpActivity. Could you give this build a try?
Thanks.

p.s., the build is created from this try push

Flags: needinfo?(lilydjwg)

lilydjwg

Reporter

Comment 10

•

3 years ago

Still not good: https://share.firefox.dev/3hHhLDq
It uses more CPU. 300%+ for parent and 200%+ for a Web Content for some seconds at start. On my second try the content process calms down later: https://share.firefox.dev/36L6tre

Flags: needinfo?(lilydjwg)

Kershaw Chang [:kershaw]

Comment 13

•

3 years ago

Could you try to get a http log? Please start logging after you start the bad proxy, otherwise the log file would be too big. Thanks.

Flags: needinfo?(lilydjwg)

lilydjwg

Reporter

Comment 14

•

3 years ago

It is still very large (785M). I've compressed it with xz and uploaded here: https://transfer.sh/1hfEmXO/fxnet-log.txt-main.1791000.moz_log.xz

Flags: needinfo?(lilydjwg)

Kershaw Chang [:kershaw]

Comment 15

•

3 years ago

Hi Reporter,

I'd like to ask for more information.

Does this issue happen on previous firefox versions (e.g., ESR 78)? If not, could you try to use mozregression to find out which patch causes this issue?
Does this issue only happen with socks proxy? If yes, could you try to create a new http log with adding SOCKS:5 to MOZ_LOG env?

Thanks.

Flags: needinfo?(lilydjwg)

lilydjwg

Reporter

Comment 16

•

3 years ago

Yes. It's weird that I can reproduce it but I didn't have this issue in the past.
Probably. I didn't succeeded to reproduce it with direct connections and iptables.

New log: https://transfer.sh/1vP8C6n/fxnet-log2.txt-main.2146335.moz_log.xz

Flags: needinfo?(lilydjwg)

muzuiget

Comment 17

•

3 years ago

Same here. I am in China,.

When Firefox start random freezing, generally I am opened ad Google Search page, I am using a socks5 proxy to access Google services.

This bug happen since Firefox 89, is rather annoying.

Kershaw Chang [:kershaw]

Comment 18

•

3 years ago

For those who can reproduce this bug, please help to use mozregression to find out the problem. This would be the fastest way. Thanks.

zxz1054855541

Comment 19

•

3 years ago

There is an easy way to reproduce it

Install qv2ray and new a qv2ray config(just type something, it should be wrong).
Open some website by Firefox such as twitch, youtube or others.
Config Firefox to let it connect with socks.
And you can got it.

lilydjwg

Reporter

Comment 20

•

3 years ago

strace shows that, Firefox is polling the socks5 connection fd. It returns POLLOUT and then Firefox tries to recvfrom the fd. This keeps repeating.

Following the code, I find that setting security.tls.enable_0rtt_data to false works around this issue. I believe this commit is relevant.

Kershaw Chang [:kershaw]

Comment 21

•

3 years ago

(In reply to lilydjwg from comment #20)

strace shows that, Firefox is polling the socks5 connection fd. It returns POLLOUT and then Firefox tries to recvfrom the fd. This keeps repeating.

Following the code, I find that setting security.tls.enable_0rtt_data to false works around this issue. I believe this commit is relevant.

Thanks for this information. Based on the commit you provided, I've created this test build. Could you give this a test?
Another question: did you find this commit by mozregression?
Could you also use this build to get another http log? Please add pipnss:5 to the MOZ_LOG env. Thanks.

Flags: needinfo?(lilydjwg)

lilydjwg

Reporter

Comment 22

•

3 years ago

Tested. It still happens. The log is here: https://transfer.sh/1wN7ZNk/fxnet-log3.txt-main.2371822.moz_log.xz

And no, mozregression doesn't work. I tested multiple versions. I chrooted to earlier system backups. I downgraded a system inside my virtual machine to half a year ago. I downgraded the proxy software (v2ray). It reproduced every time.

I read the log and searched the code, and found that commit which was recent and touched the code. I suspect that it might be that Google changed their server settings (this issue happens when connecting to Google's servers; I don't remember if there were other cases).

Flags: needinfo?(lilydjwg)

lilydjwg

Reporter

Comment 23

•

3 years ago

I confirm that it happens with Google, but not for Baidu, Wikipedia and this bugzilla. And it doesn't happen when Firefox has never successfully connected to Google since process start.

lilydjwg

Reporter

Comment 24

•

3 years ago

I kind of reproduced this issue without using a proxy, but not easy anyway.

Disable HTTP/3
Visit www.google.com once
Block ServerHello packets somehow. I wrote a NFQUEUE program to do that
Wait until the TCP connection to Google is closed (about:networking#sockets no longer shows it)
Refresh the google tab

The difference is that the Socket Thread uses 100% but other threads are calm. In the proxy case, Socket Thread 100%, BHMgr Monitor ~50% and the whole process ~250% (for recent nightlies. Earlier it was about ~140%).

Kershaw Chang [:kershaw]

Comment 25

•

3 years ago

Could you try this build and see if there is any improvement?
Note that this is just a mitigation for this problem. I added a check: if a connection waiting for 0RTT response for more than 5s, the connection will be restarted without 0RTT.
Please also get a http log with this build.
Thanks.

Flags: needinfo?(lilydjwg)

lilydjwg

Reporter

Comment 26

•

3 years ago

200% CPU usage, 100% for Socket Thread and the other 100% for main thread. It doesn't change after 5s.
The log is larger because I wait longer: https://transfer.sh/1IccUpx/fxnet-log4.txt-main.163995.moz_log.xz

Flags: needinfo?(lilydjwg)

garywill

Comment 28

•

3 years ago

Hello developers and guys, I always use Firefox 78 ESR. @Kershaw Chang
I see people reporting bugs ( this one and 1719046, 1703232 ). Allow me to report my similar problem.
Recently (about 1 or 2 months) I encounter Firefox lagging/UI not responding/high CPU (for 2 minutes) problem (just like the video in 1719046) .
I tried disabling some addons. It definitely helped but not completely solved.

Many of them said it usually happend when using proxy or in bad network connection. I feel the same.
People also said such problem is since Firefox 87 or 89, which is quite consistent with my problem time. So I think there's probability that my problem is related to this bug.

(In reply to Kershaw Chang [:kershaw OOO until Aug 9th] from comment #15)

Kershaw Chang [:kershaw]

Comment 29

•

3 years ago

Hi, I've created a test build based on this patch.
Could you check if things are improved?

Thanks.

Flags: needinfo?(lilydjwg)

lilydjwg

Reporter

Comment 30

•

3 years ago

It still happens. 200% CPU usage, 100% for Socket Thread and the other 100% for main thread. Compared to nightly 2021-08-09, 50-60% less CPU usage as the BHMgr Monitor thread is quiet.

The proxy fd is still being polled POLLIN|POLLOUT|POLLPRI and returns POLLOUT, then a recvfrom appears and gets EAGAIN.

New log file: https://transfer.sh/1tIdmIC/fxnet-log4.txt-main.1974308.moz_log.xz

Flags: needinfo?(lilydjwg)

Kershaw Chang [:kershaw]

Comment 31

•

3 years ago

Thanks for testing this.
However, it seems that 0RTT is not used at all according to the log. Could you try to enable 0RTT and test again?
Also, could you explain how you tested this briefly? Did you block the server hello packet from www.google.com permanently?

Flags: needinfo?(lilydjwg)

lilydjwg

Reporter

Comment 32

•

3 years ago

Weird. Every time I test, I start with an empty profile, and the 0RTT pref is enabled by default.
I'm using the steps in comment 5, by switching proxies. The "block server hello" method (in comment 24) takes more time because I need to wait the connection to be dropped. I've uploaded the program I use here: https://github.com/lilydjwg/drop-serverhello

Flags: needinfo?(lilydjwg)

Dragana Damjanovic [:dragana]

Assignee

Comment 33

•

3 years ago

Patch for bug 1382886 landed. Can you try to reproduce the issue? If you can reproduce the issue please make a http log.

Thanks!

Flags: needinfo?(lilydjwg)

lilydjwg

Reporter

Comment 34

•

3 years ago

Nothing seems to have changed, log https://transfer.sh/29U/fxnet-log5.txt-main.3179748.moz_log.xz
nightly version 2021-08-19 (zh-CN variant).

Flags: needinfo?(lilydjwg)

Dragana Damjanovic [:dragana]

Assignee

Comment 35

•

3 years ago

Thank you for the log. I think I know what is causing this:

you are using proxy, therefore Firefox(necko) does not try to use 0RTT, but the nss library still enables 0RTT
my patch from bug 1382886 mitigates the problem in case that necko knows that 0RTT is used, butt when proxy is use necko does not check that.

I will extend the fix from bug 1382886 to proxies as well.

Assignee: nobody → dd.mozilla

Dragana Damjanovic [:dragana]

Assignee

Comment 36

•

3 years ago

Attached file Disable the TLS early-data if proxy is use — Details

Currently we do not use the early-data when a proxy is used, but we do not disable it in the nss layer.

Pulsebot

Comment 37

•

3 years ago

Pushed by ddamjanovic@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/028b3aadad28 Disable the TLS early-data if proxy is use r=necko-reviewers,keeler,kershaw

Sandor Molnar[:smolnar]

Comment 38

•

3 years ago

Backed out for causing bustages in build-fuzzing.

Backout link: https://hg.mozilla.org/integration/autoland/rev/4bf4379056766929dece43246db61770a4ac90df

Push with failures

Failure log

Flags: needinfo?(dd.mozilla)

Dragana Damjanovic [:dragana]

Assignee

Comment 39

•

3 years ago

https://treeherder.mozilla.org/jobs?repo=try&tier=1%2C2%2C3&revision=3103ab72264a75736a3588923a78c4b416aa03b0

Flags: needinfo?(dd.mozilla)

Pulsebot

Comment 40

•

3 years ago

Pushed by ddamjanovic@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c33b8afdbd6f Disable the TLS early-data if proxy is use r=necko-reviewers,keeler,kershaw

Cristian Tuns

Comment 41

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/c33b8afdbd6f

Status: UNCONFIRMED → RESOLVED

Closed: 3 years ago

status-firefox93: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 93 Branch

Andra Esanu (needinfo me)

Updated

•

3 years ago

Regressions: 1729298