Closed Bug 1581130 Opened 5 years ago Closed 3 years ago

Some periodic job in Mozilla Firefox keeps one CPU core 100% occupied for long periods of time

Categories

(Core :: Networking, defect, P2)

75 Branch
x86_64
Linux
defect

Tracking

()

RESOLVED WORKSFORME
Performance Impact low

People

(Reporter: aros, Assigned: dragana)

References

Details

(Keywords: perf:responsiveness, Whiteboard: [necko-triaged])

Attachments

(2 files)

Attached file strace.7z

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:69.0) Gecko/20100101 Firefox/69.0

Steps to reproduce:

I first thought it was due to my add-ons and now I'm running Firefox in -safe-mode and the issue still persists.

On average each two minutes Firefox starts using one CPU core up to two minutes 30 seconds. That's driving me insane because I don't know what's going on.

Maybe Firefox is optimizing its SQLite databases but I'm not sure about that since inotifywait doesn't show any .db files accesses during these CPU hogging time periods.

I've just written down several such periods:

06:11:05 - 06:12:40
01:11:05 - 01:12:40
01:13:05 - 01:15:20
01:15:25 - 01:17:34
01:18:35 - 01:19:15
01:19:45 - 01:20:05
01:21:10 - 01:21:40
01:21:55 - 01:22:15

I'm now filing this bug report and Firefox again is doing something. There's nothing in about:performance.

The process which is abusing one CPU core for so long is the root process:

423 pts/3 Ss 0:00 _ bash
464 pts/3 Sl+ 24:20 _ firefox -safe-mode
567 pts/3 Sl+ 2:29 _ /opt/firefox/firefox-bin -contentproc -childID 1 -isForBrowser -prefsLen 1 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni /op
606 pts/3 Sl+ 0:02 _ /opt/firefox/firefox-bin -contentproc -childID 2 -isForBrowser -prefsLen 137 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni /
643 pts/3 Sl+ 0:00 _ /opt/firefox/firefox-bin -contentproc -childID 3 -isForBrowser -prefsLen 6112 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni
707 pts/3 Sl+ 0:05 _ /opt/firefox/firefox-bin -contentproc -childID 4 -isForBrowser -prefsLen 6780 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni
21958 pts/3 Sl+ 0:00 _ /opt/firefox/firefox-bin -contentproc -childID 5 -isForBrowser -prefsLen 7496 -prefMapSize 208448 -safeMode -parentBuildID 20190827005903 -greomni

I.e. pid 464.

I tried stracing it but the log file is so huge (1,5GB) I'm unable to understand what's going on. At the time of stracing I wasn't doing anything in Firefox - it was minimized.

If Mozilla developers are interested in solving this bug I can email you privately my entire profile because this bug is really nasty, annoying and really bad.

I'm using Firefox 69 from Mozilla ftp:

https://ftp.mozilla.org/pub/firefox/releases/69.0/linux-x86_64/en-US/firefox-69.0.tar.bz2

Severity: normal → critical
OS: Unspecified → Linux
Hardware: Unspecified → x86_64

This bug doesn't prevent Firefox from shutting down fast, so whatever that it's doing can be interrupted successfully and fast.

The most weird thing about this bug is that sometimes it triggers, sometimes not. i.e. I'm now running Firefox and everything is OK.

Hello,
I tried to reproduce the cpu usage while having 3 instances of Firefox opened at the same time: release 69.0, beta 70.0b6 and Nightly 71.0a1 (2019-09-16) and didn't see any abnormal activity. There were spikes up to 60% sometimes (depends on the website, downloads, etc.).

There might be something wrong with your profile. Can you please create a performance profile when you encounter the issue again? We need that to check out whats going wrong, here is the link with the addon: link : https://profiler.firefox.com/ .

You can try this step as well : firefox menu > options > privacy & security panel, check the setting to "Prevent accessibility services from accessing your browser" in the permissions section and restart once the changes were performed.

Flags: needinfo?(aros)

I will reopen this bug report once I am able to collect the required information.

Status: UNCONFIRMED → RESOLVED
Closed: 5 years ago
Resolution: --- → WORKSFORME

Reproduced on another PC with the same profile:

perfht.ml / 2NrH4LO

Status: RESOLVED → UNCONFIRMED
Flags: needinfo?(aros)
Resolution: WORKSFORME → ---

The original PC dump:

perfht.ml / 2mmScNU

I have added a component to this issue in order to get the dev team involved. If it's no the correct one feel free to change it.

Status: UNCONFIRMED → NEW
Component: Untriaged → Performance
Ever confirmed: true
Product: Firefox → Core

Thanks, Artem. Unfortunately, the Firefox profiler is currently not very well-suited for investigating background thread CPU usage.

Since you're on Linux, could you try to use the linux command line tool perf to get a profile, and then upload it through profiler.firefox.com? See https://perf-html.io/docs/#/./guide-perf-profiling for instructions.
Also, can you do this on Nightly, if possible? Otherwise the perf profile will not contain useful symbols.

Flags: needinfo?(aros)

Will do that once I'm able to repro the bug.

On same days it never occurs, on some days it does.

Status: NEW → RESOLVED
Closed: 5 years ago5 years ago
Resolution: --- → INACTIVE
Flags: needinfo?(aros)

I have to reopen this bug report because the bug persists in Firefox 75/76 beta even without any add-ons installed.

It manifests itself when I'm either enabling or disabling VPN (not for work, for accessing the Internet via a third party).

It looks like Firefox opens a connection to some server(s) (DNS/DOT, some updates, etc. - I've no idea) and when my Internet routing changes drastically this connection breaks and something starts pegging the CPU like crazy.

Status: RESOLVED → REOPENED
Resolution: INACTIVE → ---
Version: 69 Branch → 75 Branch

A similar bug 344977 was opened 11 years ago but it was never formally fixed.

There are quite a lot of reports on the Internet as well: https://google.com/search?q=firefox+high+cpu+use+vpn

Speaking of the Internet routing.

When I'm accessing the Internet from different Internet locations, I get different IP addresses from DNS for the same servers and these IP addresses might be accessible only from the locations they were given to me, i.e.

location1 -> DNS -> server1.ipaddress1.
location2 -> server1.ipaddress1 now becomes inaccessible because it's only designated for location1.

What's weird is that I have

network.dnsCacheExpiration=5
network.dnsCacheExpirationGracePeriod=5

which in theory should have solved this issue for me but it doesn't.

It's still possible I'm hitting a completely different error than what I'm describing here.

This profile shows us using a LOT of CPU in SocketThread doing SSL handshakes - a short handshake burst is normal, but not 20% continuously for 8+ seconds. Everything else seems basically idle.
https://perfht.ml/2XUvUn5

Were these taken with it capturing all threads (there a lot, so probably you did). If not, then in the extension, setting the threadlist field to "," (a bare comma) will capture all (nsThread) threads (it doesn't capture OS threads or raw pthreads used by some subsystems).

Component: Performance → Networking
Flags: needinfo?(jjones)
Flags: needinfo?(aros)
Whiteboard: [qf]
Component: Networking → Security: PSM
Whiteboard: [qf] → [qf:responsiveness:p3]
Priority: -- → P3
Whiteboard: [qf:responsiveness:p3] → [qf:responsiveness:p3][psm-waiting]

I see the majority of the time in that profile being on the socket receive during the TLS setup, so this looks like slow network in this trace. That's corroborated by the 15 seconds of waiting at PR_Poll for incoming data.

It's possible that something is working together wrong here to cause such long waits, but I am unsure what combination that might be.

I regularly use Firefox through IPSec tunnels as well as proxied over SSH, and have to say I haven't encountered this behavior myself. But to be fair, it's been a long time since I used anything other than SSH tunneling on Linux, and this is sufficiently low-level that I suspect it's a Linux + That VPN Setup combination.

VPN details are sensitive I know, but it'd be good to know what you're able to share on this, and maybe whether your connection is going through an interception proxy that might be also slowing things down in some way.

Flags: needinfo?(jjones)

Waiting on a response to comment 13 from reporter.

Status: REOPENED → RESOLVED
Closed: 5 years ago4 years ago
Resolution: --- → INCOMPLETE

(In reply to Randell Jesup [:jesup] (needinfo me) from comment #13)

This profile shows us using a LOT of CPU in SocketThread doing SSL handshakes - a short handshake burst is normal, but not 20% continuously for 8+ seconds. Everything else seems basically idle.
https://perfht.ml/2XUvUn5

Were these taken with it capturing all threads (there a lot, so probably you did). If not, then in the extension, setting the threadlist field to "," (a bare comma) will capture all (nsThread) threads (it doesn't capture OS threads or raw pthreads used by some subsystems).

Disabling my VPN connection makes Firefox eat CPU indefinitely either until I close it entirely or reestablish the connection.

Must be related to network.trr.mode=2 set locally but I haven't tested.

Flags: needinfo?(aros)

This is reproducible in Firefox 80 as well.

In the profiler extension, can you capture a new profile using the "custom" settings with the "Bypass selections above and record all registered threads" checkbox checked? (this is what we were asking of you in comment 13)

Flags: needinfo?(aros)

https://profiler.firefox.com/from-addon/calltree/?globalTrackOrder=6-0-1-2-3-4-5&hiddenGlobalTracks=1-2-3-4-5&hiddenLocalTracksByPid=708154-3-4-6-7-8-9-10-11-12-13-14-15-16-18-19-20-21-22-23-30-31-33-34-35-37-38-39-41-42-44-45-46-47-48-50-51-52-53-54-55-56-60-61-62-64-65-66~710600-1-3-4-5-6-7-8-9-10-12-13-14-15-16-17-18-19-20~708284-1-3-4-5-6-7-8-9-10-12-13-14-15-16-17-18-19-20-21~708496-1-3-4-5-6-7-8-9-10-12-13-14-15-16-17-18-19-20-21-28~708376-1-3-4-5-6-7-8-9-10-12-13-14-15-16-17-18-19-20-21-28-29~708233-1-3-4-5-6-7-8-9-10-12-13-14-15-16-17-18-19-20-21-28-29-30-31-32&localTrackOrderByPid=708154-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32-33-34-35-36-37-38-39-40-41-42-43-44-45-46-47-48-49-50-51-52-53-54-55-56-57-58-59-60-61-62-63-64-65-66~710600-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20~708284-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27~708496-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28~708376-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29~708233-0-1-2-3-4-5-6-7-8-9-10-11-12-13-14-15-16-17-18-19-20-21-22-23-24-25-26-27-28-29-30-31-32~&thread=0&v=5

Again, steps to reproduce:

  1. Start Firefox with a new profile
  2. Set network.trr.mode to 2
  3. Connect to a VPN from another region altogether
  4. Disconnect from the VPN

Result: see a single CPU core pegged at 100% indefinitely until you reconnect.

As soon as I started profiling the issue sort of disappeared - instead of a single core being pegged, a dozen other showed some CPU usage.

Flags: needinfo?(aros)

And you're sure it's Firefox that is using the CPU? As far as I can tell, that profile just shows Firefox waiting for user input - it's not doing anything.
In any case, this sounds like a networking issue.

Severity: critical → --
Status: RESOLVED → REOPENED
Component: Security: PSM → Networking
Flags: needinfo?(aros)
Priority: P3 → --
Resolution: INCOMPLETE → ---
Whiteboard: [qf:responsiveness:p3][psm-waiting] → [qf:responsiveness:p3]

(In reply to Dana Keeler (she/her) (use needinfo) (:keeler for reviews) from comment #21)

And you're sure it's Firefox that is using the CPU? As far as I can tell, that profile just shows Firefox waiting for user input - it's not doing anything.
In any case, this sounds like a networking issue.

top, htop and ps all say the load is caused by firefox, to be more precise its parent process.

Actually with Firefox 80.0.1 the bug now manifests itself quite rarely.

Flags: needinfo?(aros)

I am interested in the networking status after disconnecting the VPN.
Could you try to get the http log when you see this issue happens again?

Thanks.

Flags: needinfo?(aros)

(In reply to Kershaw Chang [:kershaw, OOO until 19.10] from comment #23)

I am interested in the networking status after disconnecting the VPN.
Could you try to get the http log when you see this issue happens again?

1.5GB of logs for five seconds: https://drive.google.com/file/d/1nfEYCTkLnEdlsJJX-aMOqFQ3xcCXhd4V/view?usp=sharing

Flags: needinfo?(aros)

It seems we get stuck in this loop until the timer fires:

2020-10-12 11:49:09.528043 UTC - [Parent 231769: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x7f7ea39c5000 delta since last read 67s
2020-10-12 11:49:09.528044 UTC - [Parent 231769: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x7f7ea39c5000 handle outstanding ping
2020-10-12 11:49:09.528045 UTC - [Parent 231769: Socket Thread]: I/nsHttp Http2Session::ReadTimeoutTick 0x7f7ea39c5000 Ping Timer Exhaustion
2020-10-12 11:49:09.528047 UTC - [Parent 231769: Socket Thread]: I/nsHttp Http2Session::Close 0x7f7ea39c5000 804B000E

and we finally close the connection.

The cause of why we get stuck there is less obvious to me. I think the bug is related to mWaitingFor0RTTResponse being true.
I've marked the lines we exercise here and here

Dragana, can you tell what's wrong here?
Based on comment 16 it would seem we're not properly cleaning up the sockets on failure + VPN network change?

Flags: needinfo?(dd.mozilla)

I am wondering why this is happening. We are calling DriveHandshake which should call the socket and pick up an error, but the socket is not called or it returns NS_OK (mSocketOutCondition==NS_OK). Also the fact that OnSocketWritable is triggered immediately means that socket is closed already so NSS never reads from the socket.

Artem, can you post an log that capture the time you disconnect a VPN? The looping above does not need to happen, I want to see if we detect that VPN is disconnected.

Flags: needinfo?(dd.mozilla) → needinfo?(aros)

(In reply to Dragana Damjanovic [:dragana] from comment #27)

I am wondering why this is happening. We are calling DriveHandshake which should call the socket and pick up an error, but the socket is not called or it returns NS_OK (mSocketOutCondition==NS_OK). Also the fact that OnSocketWritable is triggered immediately means that socket is closed already so NSS never reads from the socket.

Artem, can you post an log that capture the time you disconnect a VPN? The looping above does not need to happen, I want to see if we detect that VPN is disconnected.

I'm not sure it's been logged but here it is:

https://drive.google.com/file/d/1AM2XLbHGqwAi00gqo2u-tIlDnnM1HvKd/view?usp=sharing

Flags: needinfo?(aros)
Assignee: nobody → dd.mozilla
Severity: -- → S3
Status: REOPENED → ASSIGNED
Priority: -- → P2
Whiteboard: [qf:responsiveness:p3] → [qf:responsiveness:p3][necko-triaged]

Can you still reproduce this?

Flags: needinfo?(aros)

It was fixed either in Firefox 88 or earlier. I haven't encountered the bug for at least a month already.

Status: ASSIGNED → RESOLVED
Closed: 4 years ago3 years ago
Flags: needinfo?(aros)
Resolution: --- → WORKSFORME
Performance Impact: --- → P3
Whiteboard: [qf:responsiveness:p3][necko-triaged] → [necko-triaged]
See Also: → 1810022
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: