Some periodic job in Mozilla Firefox keeps one CPU core 100% occupied for long periods of time
Categories
(Core :: Networking, defect, P2)
Tracking
()
Performance Impact | low |
People
(Reporter: aros, Assigned: dragana)
References
Details
(Keywords: perf:responsiveness, Whiteboard: [necko-triaged])
Attachments
(2 files)
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0
Steps to reproduce:
I first thought it was due to my add-ons and now I'm running Firefox in -safe-mode and the issue still persists.
On average each two minutes Firefox starts using one CPU core up to two minutes 30 seconds. That's driving me insane because I don't know what's going on.
Maybe Firefox is optimizing its SQLite databases but I'm not sure about that since inotifywait doesn't show any .db files accesses during these CPU hogging time periods.
I've just written down several such periods:
06:11:05 - 06:12:40
01:11:05 - 01:12:40
01:13:05 - 01:15:20
01:15:25 - 01:17:34
01:18:35 - 01:19:15
01:19:45 - 01:20:05
01:21:10 - 01:21:40
01:21:55 - 01:22:15
I'm now filing this bug report and Firefox again is doing something. There's nothing in about:performance.
The process which is abusing one CPU core for so long is the root process:
423 pts/3 Ss 0:00 _ bash
464 pts/3 Sl+ 24:20 _ firefox -safe-mode
567 pts/3 Sl+ 2:29 _ /opt/firefox/firefox-bin -contentproc -childID 1 -isForBrowser -prefsLen 1 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni /op
606 pts/3 Sl+ 0:02 _ /opt/firefox/firefox-bin -contentproc -childID 2 -isForBrowser -prefsLen 137 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni /
643 pts/3 Sl+ 0:00 _ /opt/firefox/firefox-bin -contentproc -childID 3 -isForBrowser -prefsLen 6112 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni
707 pts/3 Sl+ 0:05 _ /opt/firefox/firefox-bin -contentproc -childID 4 -isForBrowser -prefsLen 6780 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni
21958 pts/3 Sl+ 0:00 _ /opt/firefox/firefox-bin -contentproc -childID 5 -isForBrowser -prefsLen 7496 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni
I.e. pid 464.
I tried stracing it but the log file is so huge (1,5GB) I'm unable to understand what's going on. At the time of stracing I wasn't doing anything in Firefox - it was minimized.
If Mozilla developers are interested in solving this bug I can email you privately my entire profile because this bug is really nasty, annoying and really bad.
I'm using Firefox 69 from Mozilla ftp:
https://ftp.mozilla.org/pub/firefox/releases/69.0/linux-x86_64/en-US/firefox-69.0.tar.bz2
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 1•5 years ago
|
||
This bug doesn't prevent Firefox from shutting down fast, so whatever that it's doing can be interrupted successfully and fast.
Reporter | ||
Comment 2•5 years ago
|
||
The most weird thing about this bug is that sometimes it triggers, sometimes not. i.e. I'm now running Firefox and everything is OK.
Comment 3•5 years ago
|
||
Hello,
I tried to reproduce the cpu usage while having 3 instances of Firefox opened at the same time: release 69.0, beta 70.0b6 and Nightly 71.0a1 (2019-09-16) and didn't see any abnormal activity. There were spikes up to 60% sometimes (depends on the website, downloads, etc.).
There might be something wrong with your profile. Can you please create a performance profile when you encounter the issue again? We need that to check out whats going wrong, here is the link with the addon: link : https://profiler.firefox.com/ .
You can try this step as well : firefox menu > options > privacy & security panel, check the setting to "Prevent accessibility services from accessing your browser" in the permissions section and restart once the changes were performed.
Reporter | ||
Comment 4•5 years ago
|
||
I will reopen this bug report once I am able to collect the required information.
Reporter | ||
Comment 5•5 years ago
|
||
Reproduced on another PC with the same profile:
perfht.ml / 2NrH4LO
Reporter | ||
Comment 6•5 years ago
|
||
The original PC dump:
perfht.ml / 2mmScNU
Comment 7•5 years ago
|
||
I have added a component to this issue in order to get the dev team involved. If it's no the correct one feel free to change it.
Comment 8•5 years ago
|
||
Thanks, Artem. Unfortunately, the Firefox profiler is currently not very well-suited for investigating background thread CPU usage.
Since you're on Linux, could you try to use the linux command line tool perf
to get a profile, and then upload it through profiler.firefox.com? See https://perf-html.io/docs/#/./guide-perf-profiling for instructions.
Also, can you do this on Nightly, if possible? Otherwise the perf
profile will not contain useful symbols.
Reporter | ||
Comment 9•5 years ago
|
||
Will do that once I'm able to repro the bug.
On same days it never occurs, on some days it does.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 10•5 years ago
|
||
I have to reopen this bug report because the bug persists in Firefox 75/76 beta even without any add-ons installed.
It manifests itself when I'm either enabling or disabling VPN (not for work, for accessing the Internet via a third party).
It looks like Firefox opens a connection to some server(s) (DNS/DOT, some updates, etc. - I've no idea) and when my Internet routing changes drastically this connection breaks and something starts pegging the CPU like crazy.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 11•5 years ago
|
||
A similar bug 344977 was opened 11 years ago but it was never formally fixed.
There are quite a lot of reports on the Internet as well: https://google.com/search?q=firefox+high+cpu+use+vpn
Speaking of the Internet routing.
When I'm accessing the Internet from different Internet locations, I get different IP addresses from DNS for the same servers and these IP addresses might be accessible only from the locations they were given to me, i.e.
location1 -> DNS -> server1.ipaddress1.
location2 -> server1.ipaddress1 now becomes inaccessible because it's only designated for location1.
Reporter | ||
Comment 12•5 years ago
|
||
What's weird is that I have
network.dnsCacheExpiration=5
network.dnsCacheExpirationGracePeriod=5
which in theory should have solved this issue for me but it doesn't.
It's still possible I'm hitting a completely different error than what I'm describing here.
Comment 13•5 years ago
|
||
This profile shows us using a LOT of CPU in SocketThread doing SSL handshakes - a short handshake burst is normal, but not 20% continuously for 8+ seconds. Everything else seems basically idle.
https://perfht.ml/2XUvUn5
Were these taken with it capturing all threads (there a lot, so probably you did). If not, then in the extension, setting the threadlist field to "," (a bare comma) will capture all (nsThread) threads (it doesn't capture OS threads or raw pthreads used by some subsystems).
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 14•5 years ago
|
||
I see the majority of the time in that profile being on the socket receive during the TLS setup, so this looks like slow network in this trace. That's corroborated by the 15 seconds of waiting at PR_Poll
for incoming data.
It's possible that something is working together wrong here to cause such long waits, but I am unsure what combination that might be.
I regularly use Firefox through IPSec tunnels as well as proxied over SSH, and have to say I haven't encountered this behavior myself. But to be fair, it's been a long time since I used anything other than SSH tunneling on Linux, and this is sufficiently low-level that I suspect it's a Linux + That VPN Setup combination.
VPN details are sensitive I know, but it'd be good to know what you're able to share on this, and maybe whether your connection is going through an interception proxy that might be also slowing things down in some way.
Comment 15•4 years ago
|
||
Waiting on a response to comment 13 from reporter.
Reporter | ||
Comment 16•4 years ago
|
||
(In reply to Randell Jesup [:jesup] (needinfo me) from comment #13)
This profile shows us using a LOT of CPU in SocketThread doing SSL handshakes - a short handshake burst is normal, but not 20% continuously for 8+ seconds. Everything else seems basically idle.
https://perfht.ml/2XUvUn5Were these taken with it capturing all threads (there a lot, so probably you did). If not, then in the extension, setting the threadlist field to "," (a bare comma) will capture all (nsThread) threads (it doesn't capture OS threads or raw pthreads used by some subsystems).
Disabling my VPN connection makes Firefox eat CPU indefinitely either until I close it entirely or reestablish the connection.
Must be related to network.trr.mode=2 set locally but I haven't tested.
Reporter | ||
Comment 17•4 years ago
|
||
This is reproducible in Firefox 80 as well.
Comment 18•4 years ago
|
||
In the profiler extension, can you capture a new profile using the "custom" settings with the "Bypass selections above and record all registered threads" checkbox checked? (this is what we were asking of you in comment 13)
Reporter | ||
Comment 19•4 years ago
|
||
Again, steps to reproduce:
- Start Firefox with a new profile
- Set network.trr.mode to 2
- Connect to a VPN from another region altogether
- Disconnect from the VPN
Result: see a single CPU core pegged at 100% indefinitely until you reconnect.
As soon as I started profiling the issue sort of disappeared - instead of a single core being pegged, a dozen other showed some CPU usage.
Reporter | ||
Comment 20•4 years ago
|
||
Comment 21•4 years ago
|
||
And you're sure it's Firefox that is using the CPU? As far as I can tell, that profile just shows Firefox waiting for user input - it's not doing anything.
In any case, this sounds like a networking issue.
Reporter | ||
Comment 22•4 years ago
|
||
(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #21)
And you're sure it's Firefox that is using the CPU? As far as I can tell, that profile just shows Firefox waiting for user input - it's not doing anything.
In any case, this sounds like a networking issue.
top
, htop
and ps
all say the load is caused by firefox, to be more precise its parent process.
Actually with Firefox 80.0.1 the bug now manifests itself quite rarely.
Comment 23•4 years ago
|
||
I am interested in the networking status after disconnecting the VPN.
Could you try to get the http log when you see this issue happens again?
Thanks.
Reporter | ||
Comment 24•4 years ago
|
||
(In reply to Kershaw Chang [:kershaw, OOO until 19.10] from comment #23)
I am interested in the networking status after disconnecting the VPN.
Could you try to get the http log when you see this issue happens again?
1.5GB of logs for five seconds: https://drive.google.com/file/d/1nfEYCTkLnEdlsJJX-aMOqFQ3xcCXhd4V/view?usp=sharing
Comment 25•4 years ago
|
||
Comment 26•4 years ago
|
||
It seems we get stuck in this loop until the timer fires:
2020-10-12 11:49:09.528043 UTC - [Parent 231769: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x7f7ea39c5000 delta since last read 67s
2020-10-12 11:49:09.528044 UTC - [Parent 231769: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x7f7ea39c5000 handle outstanding ping
2020-10-12 11:49:09.528045 UTC - [Parent 231769: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x7f7ea39c5000 Ping Timer Exhaustion
2020-10-12 11:49:09.528047 UTC - [Parent 231769: Socket Thread]: I/nsHttp Http2Session::Close 0x7f7ea39c5000 804B000E
and we finally close the connection.
The cause of why we get stuck there is less obvious to me. I think the bug is related to mWaitingFor0RTTResponse being true.
I've marked the lines we exercise here and here
Dragana, can you tell what's wrong here?
Based on comment 16 it would seem we're not properly cleaning up the sockets on failure + VPN network change?
Assignee | ||
Comment 27•4 years ago
|
||
I am wondering why this is happening. We are calling DriveHandshake which should call the socket and pick up an error, but the socket is not called or it returns NS_OK (mSocketOutCondition==NS_OK). Also the fact that OnSocketWritable is triggered immediately means that socket is closed already so NSS never reads from the socket.
Artem, can you post an log that capture the time you disconnect a VPN? The looping above does not need to happen, I want to see if we detect that VPN is disconnected.
Reporter | ||
Comment 28•4 years ago
|
||
(In reply to Dragana Damjanovic [:dragana] from comment #27)
I am wondering why this is happening. We are calling DriveHandshake which should call the socket and pick up an error, but the socket is not called or it returns NS_OK (mSocketOutCondition==NS_OK). Also the fact that OnSocketWritable is triggered immediately means that socket is closed already so NSS never reads from the socket.
Artem, can you post an log that capture the time you disconnect a VPN? The looping above does not need to happen, I want to see if we detect that VPN is disconnected.
I'm not sure it's been logged but here it is:
https://drive.google.com/file/d/1AM2XLbHGqwAi00gqo2u-tIlDnnM1HvKd/view?usp=sharing
Assignee | ||
Updated•4 years ago
|
Reporter | ||
Comment 30•3 years ago
|
||
It was fixed either in Firefox 88 or earlier. I haven't encountered the bug for at least a month already.
Updated•3 years ago
|
Description
•