1581130 - Some periodic job in Mozilla Firefox keeps one CPU core 100% occupied for long periods of time

Artem S. Tashkinov

Reporter

Description

•

5 years ago

Attached file strace.7z — Details

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0

Steps to reproduce:

I first thought it was due to my add-ons and now I'm running Firefox in -safe-mode and the issue still persists.

On average each two minutes Firefox starts using one CPU core up to two minutes 30 seconds. That's driving me insane because I don't know what's going on.

Maybe Firefox is optimizing its SQLite databases but I'm not sure about that since inotifywait doesn't show any .db files accesses during these CPU hogging time periods.

I've just written down several such periods:

06:11:05 - 06:12:40
01:11:05 - 01:12:40
01:13:05 - 01:15:20
01:15:25 - 01:17:34
01:18:35 - 01:19:15
01:19:45 - 01:20:05
01:21:10 - 01:21:40
01:21:55 - 01:22:15

I'm now filing this bug report and Firefox again is doing something. There's nothing in about:performance.

The process which is abusing one CPU core for so long is the root process:

423 pts/3 Ss 0:00 _ bash
464 pts/3 Sl+ 24:20 _ firefox -safe-mode
567 pts/3 Sl+ 2:29 _ /opt/firefox/firefox-bin -contentproc -childID 1 -isForBrowser -prefsLen 1 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni /op
606 pts/3 Sl+ 0:02 _ /opt/firefox/firefox-bin -contentproc -childID 2 -isForBrowser -prefsLen 137 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni /
643 pts/3 Sl+ 0:00 _ /opt/firefox/firefox-bin -contentproc -childID 3 -isForBrowser -prefsLen 6112 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni
707 pts/3 Sl+ 0:05 _ /opt/firefox/firefox-bin -contentproc -childID 4 -isForBrowser -prefsLen 6780 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni
21958 pts/3 Sl+ 0:00 _ /opt/firefox/firefox-bin -contentproc -childID 5 -isForBrowser -prefsLen 7496 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni

I.e. pid 464.

I tried stracing it but the log file is so huge (1,5GB) I'm unable to understand what's going on. At the time of stracing I wasn't doing anything in Firefox - it was minimized.

If Mozilla developers are interested in solving this bug I can email you privately my entire profile because this bug is really nasty, annoying and really bad.

I'm using Firefox 69 from Mozilla ftp:

https://ftp.mozilla.org/pub/firefox/releases/69.0/linux-x86_64/en-US/firefox-69.0.tar.bz2

Artem S. Tashkinov

Reporter

Updated

•

5 years ago

Severity: normal → critical

OS: Unspecified → Linux

Hardware: Unspecified → x86_64

Artem S. Tashkinov

Reporter

Comment 1

•

5 years ago

This bug doesn't prevent Firefox from shutting down fast, so whatever that it's doing can be interrupted successfully and fast.

Artem S. Tashkinov

Reporter

Comment 2

•

5 years ago

The most weird thing about this bug is that sometimes it triggers, sometimes not. i.e. I'm now running Firefox and everything is OK.

Andrei Purice

Comment 3

•

5 years ago

Hello,
I tried to reproduce the cpu usage while having 3 instances of Firefox opened at the same time: release 69.0, beta 70.0b6 and Nightly 71.0a1 (2019-09-16) and didn't see any abnormal activity. There were spikes up to 60% sometimes (depends on the website, downloads, etc.).

There might be something wrong with your profile. Can you please create a performance profile when you encounter the issue again? We need that to check out whats going wrong, here is the link with the addon: link : https://profiler.firefox.com/ .

You can try this step as well : firefox menu > options > privacy & security panel, check the setting to "Prevent accessibility services from accessing your browser" in the permissions section and restart once the changes were performed.

Flags: needinfo?(aros)

Artem S. Tashkinov

Reporter

Comment 4

•

5 years ago

I will reopen this bug report once I am able to collect the required information.

Status: UNCONFIRMED → RESOLVED

Closed: 5 years ago

Resolution: --- → WORKSFORME

Artem S. Tashkinov

Reporter

Comment 5

•

5 years ago

Reproduced on another PC with the same profile:

perfht.ml / 2NrH4LO

Status: RESOLVED → UNCONFIRMED

Flags: needinfo?(aros)

Resolution: WORKSFORME → ---

Artem S. Tashkinov

Reporter

Comment 6

•

5 years ago

The original PC dump:

perfht.ml / 2mmScNU

Andrei Purice

Comment 7

•

5 years ago

I have added a component to this issue in order to get the dev team involved. If it's no the correct one feel free to change it.

Status: UNCONFIRMED → NEW

Component: Untriaged → Performance

Ever confirmed: true

Product: Firefox → Core

Markus Stange [:mstange]

Comment 8

•

5 years ago

Thanks, Artem. Unfortunately, the Firefox profiler is currently not very well-suited for investigating background thread CPU usage.

Since you're on Linux, could you try to use the linux command line tool perf to get a profile, and then upload it through profiler.firefox.com? See https://perf-html.io/docs/#/./guide-perf-profiling for instructions.
Also, can you do this on Nightly, if possible? Otherwise the perf profile will not contain useful symbols.

Flags: needinfo?(aros)

Artem S. Tashkinov

Reporter

Comment 9

•

5 years ago

Will do that once I'm able to repro the bug.

On same days it never occurs, on some days it does.

Status: NEW → RESOLVED

Closed: 5 years ago → 5 years ago

Resolution: --- → INACTIVE

Artem S. Tashkinov

Reporter

Updated

•

5 years ago

Flags: needinfo?(aros)

Artem S. Tashkinov

Reporter

Comment 10

•

5 years ago

I have to reopen this bug report because the bug persists in Firefox 75/76 beta even without any add-ons installed.

It manifests itself when I'm either enabling or disabling VPN (not for work, for accessing the Internet via a third party).

It looks like Firefox opens a connection to some server(s) (DNS/DOT, some updates, etc. - I've no idea) and when my Internet routing changes drastically this connection breaks and something starts pegging the CPU like crazy.

Artem S. Tashkinov

Reporter

Updated

•

5 years ago

Status: RESOLVED → REOPENED

Resolution: INACTIVE → ---

Version: 69 Branch → 75 Branch

Artem S. Tashkinov

Reporter

Comment 11

•

5 years ago

A similar bug 344977 was opened 11 years ago but it was never formally fixed.

There are quite a lot of reports on the Internet as well: https://google.com/search?q=firefox+high+cpu+use+vpn

Speaking of the Internet routing.

When I'm accessing the Internet from different Internet locations, I get different IP addresses from DNS for the same servers and these IP addresses might be accessible only from the locations they were given to me, i.e.

location1 -> DNS -> server1.ipaddress1.
location2 -> server1.ipaddress1 now becomes inaccessible because it's only designated for location1.

Artem S. Tashkinov

Reporter

Comment 12

•

5 years ago

What's weird is that I have

network.dnsCacheExpiration=5
network.dnsCacheExpirationGracePeriod=5

which in theory should have solved this issue for me but it doesn't.

It's still possible I'm hitting a completely different error than what I'm describing here.

Randell Jesup [:jesup] (needinfo me)

Comment 13

•

5 years ago

This profile shows us using a LOT of CPU in SocketThread doing SSL handshakes - a short handshake burst is normal, but not 20% continuously for 8+ seconds. Everything else seems basically idle.
https://perfht.ml/2XUvUn5

Were these taken with it capturing all threads (there a lot, so probably you did). If not, then in the extension, setting the threadlist field to "," (a bare comma) will capture all (nsThread) threads (it doesn't capture OS threads or raw pthreads used by some subsystems).

Component: Performance → Networking

Flags: needinfo?(jjones)

Flags: needinfo?(aros)

Whiteboard: [qf]

Honza Bambas (:mayhemer)

Updated

•

5 years ago

Component: Networking → Security: PSM

Randell Jesup [:jesup] (needinfo me)

Updated

•

5 years ago

Whiteboard: [qf] → [qf:responsiveness:p3]

Dana Keeler (she/her) (use needinfo) [:keeler]

Updated

•

5 years ago

Priority: -- → P3

Whiteboard: [qf:responsiveness:p3] → [qf:responsiveness:p3][psm-waiting]

J.C. Jones [:jcj] (he/they)

Comment 14

•

5 years ago

I see the majority of the time in that profile being on the socket receive during the TLS setup, so this looks like slow network in this trace. That's corroborated by the 15 seconds of waiting at PR_Poll for incoming data.

It's possible that something is working together wrong here to cause such long waits, but I am unsure what combination that might be.

I regularly use Firefox through IPSec tunnels as well as proxied over SSH, and have to say I haven't encountered this behavior myself. But to be fair, it's been a long time since I used anything other than SSH tunneling on Linux, and this is sufficiently low-level that I suspect it's a Linux + That VPN Setup combination.

VPN details are sensitive I know, but it'd be good to know what you're able to share on this, and maybe whether your connection is going through an interception proxy that might be also slowing things down in some way.

Flags: needinfo?(jjones)

Dana Keeler (she/her) (use needinfo) [:keeler]

Comment 15

•

4 years ago

Waiting on a response to comment 13 from reporter.

Status: REOPENED → RESOLVED

Closed: 5 years ago → 4 years ago

Resolution: --- → INCOMPLETE

Artem S. Tashkinov

Reporter

Comment 16

•

4 years ago

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #13)

This profile shows us using a LOT of CPU in SocketThread doing SSL handshakes - a short handshake burst is normal, but not 20% continuously for 8+ seconds. Everything else seems basically idle.
https://perfht.ml/2XUvUn5

Were these taken with it capturing all threads (there a lot, so probably you did). If not, then in the extension, setting the threadlist field to "," (a bare comma) will capture all (nsThread) threads (it doesn't capture OS threads or raw pthreads used by some subsystems).

Disabling my VPN connection makes Firefox eat CPU indefinitely either until I close it entirely or reestablish the connection.

Must be related to network.trr.mode=2 set locally but I haven't tested.

Flags: needinfo?(aros)

Artem S. Tashkinov

Reporter

Comment 17

•

4 years ago

This is reproducible in Firefox 80 as well.

Dana Keeler (she/her) (use needinfo) [:keeler]

Comment 18

•

4 years ago

In the profiler extension, can you capture a new profile using the "custom" settings with the "Bypass selections above and record all registered threads" checkbox checked? (this is what we were asking of you in comment 13)

Flags: needinfo?(aros)

Artem S. Tashkinov

Reporter

Comment 19

•

4 years ago

https://profiler.firefox.com/from-addon/calltree/?globalTrackOrder=6-0-1-2-3-4-5&hiddenGlobalTracks=1-2-3-4-5&hiddenLocalTracksByPid=708154-3-4-6-7-8-9-10-11-12-13-14-15-16-18-19-20-21-22-23-30-31-33-34-35-37-38-39-41-42-44-45-46-47-48-50-51-52-53-54-55-56-60-61-62-64-65-66~710600-1-3-4-5-6-7-8-9-10-12-13-14-15-16-17-18-19-20~708284-1-3-4-5-6-7-8-9-10-12-13-14-15-16-17-18-19-20-21~708496-1-3-4-5-6-7-8-9-10-12-13-14-15-16-17-18-19-20-21-28~708376-1-3-4-5-6-7-8-9-10-12-13-14-15-16-17-18-19-20-21-28-29~708233-1-3-4-5-6-7-8-9-10-12-13-14-15-16-17-18-19-20-21-28-29-30-31-32&localTrackOrderByPid=708154-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32-33-34-35-36-37-38-39-40-41-42-43-44-45-46-47-48-49-50-51-52-53-54-55-56-57-58-59-60-61-62-63-64-65-66~710600-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20~708284-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27~708496-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28~708376-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29~708233-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32~&thread=0&v=5

Again, steps to reproduce:

Start Firefox with a new profile
Set network.trr.mode to 2
Connect to a VPN from another region altogether
Disconnect from the VPN

Result: see a single CPU core pegged at 100% indefinitely until you reconnect.

As soon as I started profiling the issue sort of disappeared - instead of a single core being pegged, a dozen other showed some CPU usage.

Flags: needinfo?(aros)

Artem S. Tashkinov

Reporter

Comment 20

•

4 years ago

https://share.firefox.dev/301FzbS

Dana Keeler (she/her) (use needinfo) [:keeler]

Comment 21

•

4 years ago

And you're sure it's Firefox that is using the CPU? As far as I can tell, that profile just shows Firefox waiting for user input - it's not doing anything.
In any case, this sounds like a networking issue.

Severity: critical → --

Status: RESOLVED → REOPENED

Component: Security: PSM → Networking

Flags: needinfo?(aros)

Priority: P3 → --

Resolution: INCOMPLETE → ---

Whiteboard: [qf:responsiveness:p3][psm-waiting] → [qf:responsiveness:p3]

Artem S. Tashkinov

Reporter

Comment 22

•

4 years ago

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #21)

And you're sure it's Firefox that is using the CPU? As far as I can tell, that profile just shows Firefox waiting for user input - it's not doing anything.
In any case, this sounds like a networking issue.

top, htop and ps all say the load is caused by firefox, to be more precise its parent process.

Actually with Firefox 80.0.1 the bug now manifests itself quite rarely.

Flags: needinfo?(aros)

Kershaw Chang [:kershaw]

Comment 23

•

4 years ago

I am interested in the networking status after disconnecting the VPN.
Could you try to get the http log when you see this issue happens again?

Thanks.

Flags: needinfo?(aros)

Artem S. Tashkinov

Reporter

Comment 24

•

4 years ago

(In reply to Kershaw Chang [:kershaw, OOO until 19.10] from comment #23)

I am interested in the networking status after disconnecting the VPN.
Could you try to get the http log when you see this issue happens again?

1.5GB of logs for five seconds: https://drive.google.com/file/d/1nfEYCTkLnEdlsJJX-aMOqFQ3xcCXhd4V/view?usp=sharing

Flags: needinfo?(aros)

Valentin Gosu [:valentin] (he/him)

Comment 25

•

4 years ago

Attached file Relevant parts of the log — Details

Valentin Gosu [:valentin] (he/him)

Comment 26

•

4 years ago

It seems we get stuck in this loop until the timer fires:

2020-10-12 11:49:09.528043 UTC - [Parent 231769: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x7f7ea39c5000 delta since last read 67s
2020-10-12 11:49:09.528044 UTC - [Parent 231769: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x7f7ea39c5000 handle outstanding ping
2020-10-12 11:49:09.528045 UTC - [Parent 231769: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x7f7ea39c5000 Ping Timer Exhaustion
2020-10-12 11:49:09.528047 UTC - [Parent 231769: Socket Thread]: I/nsHttp Http2Session::Close 0x7f7ea39c5000 804B000E

and we finally close the connection.

The cause of why we get stuck there is less obvious to me. I think the bug is related to mWaitingFor0RTTResponse being true.
I've marked the lines we exercise here and here

Dragana, can you tell what's wrong here?
Based on comment 16 it would seem we're not properly cleaning up the sockets on failure + VPN network change?

Flags: needinfo?(dd.mozilla)

Dragana Damjanovic [:dragana]

Assignee

Comment 27

•

4 years ago

I am wondering why this is happening. We are calling DriveHandshake which should call the socket and pick up an error, but the socket is not called or it returns NS_OK (mSocketOutCondition==NS_OK). Also the fact that OnSocketWritable is triggered immediately means that socket is closed already so NSS never reads from the socket.

Artem, can you post an log that capture the time you disconnect a VPN? The looping above does not need to happen, I want to see if we detect that VPN is disconnected.

Flags: needinfo?(dd.mozilla) → needinfo?(aros)

Artem S. Tashkinov

Reporter

Comment 28

•

4 years ago

(In reply to Dragana Damjanovic [:dragana] from comment #27)

I am wondering why this is happening. We are calling DriveHandshake which should call the socket and pick up an error, but the socket is not called or it returns NS_OK (mSocketOutCondition==NS_OK). Also the fact that OnSocketWritable is triggered immediately means that socket is closed already so NSS never reads from the socket.

Artem, can you post an log that capture the time you disconnect a VPN? The looping above does not need to happen, I want to see if we detect that VPN is disconnected.

I'm not sure it's been logged but here it is:

https://drive.google.com/file/d/1AM2XLbHGqwAi00gqo2u-tIlDnnM1HvKd/view?usp=sharing

Flags: needinfo?(aros)

Dragana Damjanovic [:dragana]

Assignee

Updated

•

4 years ago

Assignee: nobody → dd.mozilla

Severity: -- → S3

Status: REOPENED → ASSIGNED

Priority: -- → P2

Whiteboard: [qf:responsiveness:p3] → [qf:responsiveness:p3][necko-triaged]

Dragana Damjanovic [:dragana]

Assignee

Comment 29

•

3 years ago

Can you still reproduce this?

Flags: needinfo?(aros)

Artem S. Tashkinov

Reporter

Comment 30

•

3 years ago

It was fixed either in Firefox 88 or earlier. I haven't encountered the bug for at least a month already.

Status: ASSIGNED → RESOLVED

Closed: 4 years ago → 3 years ago

Flags: needinfo?(aros)

Resolution: --- → WORKSFORME

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

3 years ago

Performance Impact: --- → P3

Keywords: perf:responsiveness

Whiteboard: [qf:responsiveness:p3][necko-triaged] → [necko-triaged]

Daniel Bodea [:danibodea]

Updated

•

2 years ago

strace.7z 5 years ago Artem S. Tashkinov 237.25 KB, application/x-7z-compressed		Details
Relevant parts of the log 4 years ago Valentin Gosu [:valentin] (he/him) 17.41 KB, text/plain		Details