Content process taking 100% of a core, running IdleTimeGCTaskRunnable continuously
Categories
(Core :: JavaScript: GC, defect, P2)
Tracking
()
People
(Reporter: julienw, Assigned: sfink)
References
(Blocks 1 open bug, Regression)
Details
(Keywords: perf, power, regression)
Attachments
(1 file, 1 obsolete file)
48 bytes,
text/x-phabricator-request
|
pascalc
:
approval-mozilla-beta+
RyanVM
:
approval-mozilla-esr78-
RyanVM
:
approval-mozilla-esr91+
|
Details | Review |
Today I got a problem with a process in Firefox, where it was taking 130% of my CPU.
Then thanks to about:processes I closed all tabs from this process... but it was still taking 130% of my CPU!!
Here is a profile: https://share.firefox.dev/2VMgYZz
Some weird things:
- A lot of idle categories, despite that it's obviously not idle.
- It doesn't seem to do anything useful!
- Lots of Runnable but I don't know if this is meaningful.
So I don't know what happens really.
It looks more like an issue with idle runnables running amok! I have a faint memory of seeing something like this before. It would be a good profile to bring to Joy of Profiling™! 😉
I doubt this is a profiler issue per se. (Keeping it here until we find it a better home.)
It's possible to have lots of samples in the idle category, while still using a lot of CPU.
The "Idle" category is used around code that is supposed to wait until some event happens (typically a mutex or semaphore state change, or I/O operation ending), in this case waiting for the next task on the event queue.
Go to the Stack Chart, and zoom in on a small time range, and you'll see individual samples: There are system "poll" functions (considered idle, but they can still use a bit of CPU) being sampled most of the time. And in-between these there are lots of IdleRunnable
and PIdleScheduler::Msg_IdleTime
tasks, so the CPU is actively running them.
These tasks are more visible in the Marker Table, I count roughly 50 per millisecond! I guess they are very short, that's why samples don't always capture them. But their quantity would explain why we spend so much CPU there.
Comment 2•3 years ago
|
||
It is odd that XPCJSContext::AfterProcessTask seems to be dispatching new tasks all the time.
I think this is coming from https://searchfox.org/mozilla-central/rev/538569c63ffb17d590c51df66f9fff790dd7431e/xpcom/base/CycleCollectedJSContext.cpp#469,495,512
jonco, pbone, does that ring any bells?
Comment 3•3 years ago
|
||
This shouldn't continually dispatch new tasks. We do try and do an idle time nursery collection in a few situations but it shouldn't run repeatedly like this.
It's a shame we can't see the runnable executing in the profile, to confirm it's the idle GC time runnable and to see what it's doing.
Julien, is this reproduceable at all?
Reporter | ||
Comment 4•3 years ago
|
||
I'm not sure how to reproduce this reliably, but I'll take care...
One guess I have is that chat.mozilla.org with ublock enabled takes a lot of memory (for some reason) and even stops working after a while. On this new computer I forgot to disable ublock for this, and I think that's what happened. But now I disabled it, so I'm not so sure this will reproduce again...
Comment 5•3 years ago
|
||
I'm not seeing anything here that others haven't seen. It does point to XPCJSContext::AfterProcessTask
creating a task and running again and again, but I can't say why.
Based on XPCJSContext::AfterProcessTask
being the apparent cause, I'm moving this bug to the JS Engine component.
Updated•3 years ago
|
Comment 7•3 years ago
|
||
Based on the profile, this is coming from
https://searchfox.org/mozilla-central/rev/538569c63ffb17d590c51df66f9fff790dd7431e/xpcom/base/CycleCollectedJSContext.cpp#469,495,512
It is IdleRunnable which is there all the time in the profile
https://searchfox.org/mozilla-central/rev/69e4ba3cbce75872d3e88ab0422fbb0198d7000d/xpcom/base/CycleCollectedJSContext.cpp#497
That instance if IdleRunnable doesn't change the name of the runnable. Other cases where IdleRunnable is used have different name.
Comment 8•3 years ago
|
||
(In reply to Olli Pettay [:smaug] from comment #7)
That instance if IdleRunnable doesn't change the name of the runnable. Other cases where IdleRunnable is used have different name.
That was harder to find in the profile than it needs to be, I filed bug 1727137 to require all IdleRunnable instances to have a name.
Updated•3 years ago
|
Comment 9•3 years ago
|
||
this is really JS, as far as I see. Regression from bug 1376891 basically.
Comment 10•3 years ago
|
||
does https://searchfox.org/mozilla-central/rev/5a362eb7d054740dc9d7c82c79a2efbc5f3e4776/js/src/jsapi.cpp#1292,1294 return true all the time?
Updated•3 years ago
|
Comment 12•3 years ago
|
||
The initial report here was on Linux, but the report in bug 1727975 was on Windows, so this is not Linux-specific.
Updated•3 years ago
|
Assignee | ||
Comment 13•3 years ago
|
||
(In reply to Olli Pettay [:smaug] from comment #10)
does https://searchfox.org/mozilla-central/rev/5a362eb7d054740dc9d7c82c79a2efbc5f3e4776/js/src/jsapi.cpp#1292,1294 return true all the time?
From my reading of the code, it shouldn't. But that profile is also not showing any minor GC markers. Which suggests to me that IsIdleGCTaskNeeded()
is constantly returning true, then when the task runs, it says "oh wait, never mind, we don't need to minor GC anymore." IsIdleGCTaskNeeded()
returns true if shouldCollect()
returns true. RunIdleTimeGCTask
runs exactly the same shouldCollect()
check.
Let's say that shouldCollect()
keeps returning true. We're not getting a minor GC marker, which would mean (1) suppressGC
is set, (2) the nursery is disabled, or (3) the nursery is empty. That last seems the most likely. So then the question is why shouldCollect()
would return true with an empty nursery.
The only way that can happen is if minorGCRequested()
is true. That can be caused by js::Nursery::registerMallocedBuffer
, StoreBuffer::setAboutToOverflow
, and js::jit::AllocateBigIntNoGC
. Given that we're assuming the nursery stays empty, those are all a bit weird, but I'll ignore delving into that for now and look at what clears the minor GC request flag. It happens unconditionally in GCRuntime::collectNursery
, even if the nursery was empty, but NOT if suppressGC
was set.
So I'll change my hypothesis: we have somehow managed to get GC suppression stuck on.
The profile shows 11 stacks in GCRuntime::collectNursery
, which is compatible with this hypothesis. Then again, GC suppression is supposed to be done with RAII, and there's nothing on the stack that would be suppressing. The only exception I see is an uncalled function (intended for use from gdb).
I do see that AutoSuppressGC
is used in StartOffThreadParseTask
. But that looks to be called on the main thread, so should only suppress GC long enough to queue up the task.
I'm stumped. One thing that might paper over the symptom is if shouldCollect
checked the suppressGC
flag, even if minorGCRequested()
is true. It would be nice to have a way to verify whether it fixes it.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 14•3 years ago
|
||
Comment 15•3 years ago
|
||
since my bug (Bug 1727975) mark as duplicate of this bug. so i will post my assumption here too.
from my latest profiler https://share.firefox.dev/3yAykWW
and i think problem may from this code https://searchfox.org/mozilla-central/source/ipc/glue/IdleSchedulerParent.cpp#398-414
idleRequester->remove();
will be called only when idleRequester->mRequestingGC
or idleRequester->IsWaitingForIdle()
be true.
but if idleRequester->mRequestingGC
and idleRequester->IsWaitingForIdle()
be true at the same time idleRequester->remove();
will never called. and it may be permanent live in sIdleAndGCRequests
LinkedList
Comment 17•3 years ago
|
||
But the idea is that only once you aren't requesting GC, nor idle, then the request is removed from the list. And that wouldn't still explain why
IdleTimeGCTaskRunnable is being dispatched all the time.
https://searchfox.org/mozilla-central/rev/49b6e60550243b4b4d71d6ab35a3ff2b9a9f7c69/xpcom/base/CycleCollectedJSContext.cpp#513 should return false at some point.
Comment 18•3 years ago
|
||
i am not familiar with firefox architecture so that info may incorrect.
and i have reproduce step (at least it can reproduce on my machine)
0. you need computer that has more than 8GB+ of free ram (actually in GC process at peak it may 10GB+ from this example)
- open https://ishare.in.th/tmp/bug-1724980/empty.html
- open https://ishare.in.th/tmp/bug-1724980/bug.html in tab that has same PID and then click start
- click "increment"
- click "clearA"
- click "clearB"
- click "increment" and wait for "clearD" button appear (after this step memory usage may around 4-5GB)
- wait until cpu usage 100% (in my machine it not more than 30 second)
- close bug.html
- wait for GC done (memory usage reduce to 2xxMB) and now IPC I/O Child will constantly use CPU
Comment 19•3 years ago
|
||
(In reply to mix5003 from comment #15)
since my bug (Bug 1727975) mark as duplicate of this bug. so i will post my assumption here too.
from my latest profiler https://share.firefox.dev/3yAykWW
and i think problem may from this code https://searchfox.org/mozilla-central/source/ipc/glue/IdleSchedulerParent.cpp#398-414
idleRequester->remove();
will be called only whenidleRequester->mRequestingGC
oridleRequester->IsWaitingForIdle()
be true.
but ifidleRequester->mRequestingGC
andidleRequester->IsWaitingForIdle()
be true at the same timeidleRequester->remove();
will never called. and it may be permanent live insIdleAndGCRequests
LinkedList
It's supposed to work like that. The list is for any task that wants to GC and/or run an idle task. If either of those things is true then the task must be in the list. If we are changing one of those things, eg we give a task some idle time, then we can only remove it from the list if it does not want the other (GC).
Also I don't see how this is related to the idle task spinning.
Comment 20•3 years ago
|
||
I managed to reproduce this following mix5003's instructions from comment 18. Thank you mix5003.
Comment 21•3 years ago
|
||
sfink: For some reason the nursery gets disabled but also has it's flag set to say it wants collection. I'm not sure how it gets to this state but I think it's due to the GC hitting the 4GB limit. I'm sending you a patch.
Comment 22•3 years ago
|
||
Assignee | ||
Comment 23•3 years ago
|
||
(In reply to Paul Bone [:pbone] from comment #21)
sfink: For some reason the nursery gets disabled but also has it's flag set to say it wants collection. I'm not sure how it gets to this state but I think it's due to the GC hitting the 4GB limit. I'm sending you a patch.
Excellent, thank you!
I started trying to go through those steps, but I got distracted when I first checked my current idling browser -- it was pegged at about 25% CPU. In attempting to look at it in the profiler, I encountered bug 1698748 (Runnable - DummyEvent
happening 1000x/sec).
Updated•3 years ago
|
Updated•3 years ago
|
Comment 24•3 years ago
|
||
Pushed by pbone@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4bd978bb1379 We don't need to collect the nursery if it's empty or disabled r=sfink
Comment 25•3 years ago
|
||
bugherder |
Comment 26•3 years ago
|
||
Do we need this on branches? This seems like a rather bad case when this happens. Or was this a recent regression after all?
Updated•3 years ago
|
Assignee | ||
Comment 27•3 years ago
|
||
Comment on attachment 9239346 [details]
Bug 1724980 - We don't need to collect the nursery if it's empty or disabled r=sfink
ESR Uplift Approval Request
- If this is not a sec:{high,crit} bug, please state case for ESR consideration: This is a fairly serious problem where a page could allocate enough in the GC heap to disable the nursery, resulting in the browser remaining functional but constantly consuming CPU (and eating battery).
- User impact if declined: some unknown number of users would get their browser stuck spinning CPU
- Fix Landed on Version: 94
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): The change itself is tiny: don't tell the task scheduler that we want to do a nursery collection if, in fact, the collection won't happen.
- String or UUID changes made by this patch: none
Beta/Release Uplift Approval Request
- User impact if declined: CPU spinning in some situations (see above)
- Is this code covered by automated tests?: No
- Has the fix been verified in Nightly?: Yes
- Needs manual test from QE?: No
- If yes, steps to reproduce: (comment 18 has STR)
- List of other uplifts needed: None
- Risk to taking this patch: Low
- Why is the change risky/not risky? (and alternatives if risky): see above
- String changes made/needed: none
Assignee | ||
Comment 28•3 years ago
|
||
(In reply to Olli Pettay [:smaug] from comment #26)
Do we need this on branches? This seems like a rather bad case when this happens. Or was this a recent regression after all?
As far as I can tell, the same thing would happen in past branches. In esr78, IsIdleGCTaskNeeded
directly calls shouldCollect
which does not check isEnabled
. I don't know how often this might be happening, but you're right, it seems worth backporting.
Comment 29•3 years ago
|
||
Comment on attachment 9239346 [details]
Bug 1724980 - We don't need to collect the nursery if it's empty or disabled r=sfink
ESR78 is nearly EOL (the upcoming 78.15 release is the final planned one), so I don't think we need this there.
Comment 30•3 years ago
|
||
Comment on attachment 9239346 [details]
Bug 1724980 - We don't need to collect the nursery if it's empty or disabled r=sfink
Approved for 93 beta 3, thanks.
Comment 31•3 years ago
|
||
bugherder uplift |
Comment 32•3 years ago
|
||
Comment on attachment 9239346 [details]
Bug 1724980 - We don't need to collect the nursery if it's empty or disabled r=sfink
Approved for 91.2esr.
Comment 33•3 years ago
|
||
bugherder uplift |
Updated•3 years ago
|
Updated•3 years ago
|
Comment 34•3 years ago
|
||
I tried to reproduce the initial issue on my machine on Windows 10 x64 and Ubuntu 20.04 x64 and also on macOS Big Sur 11.5.2 but I was unable to do so using builds before the fix (Firefox 92 and Nightly from 2021-09-01) - so I can't verify that this is fixed.
Mix5003, will you be so kind and verify that this is fixed using Firefox 91.2 esr, Firefox 93 beta 3, and the latest Nightly 94.0a1?
Firefox 91.2 esr - http://archive.mozilla.org/pub/firefox/candidates/91.0.2-candidates/build1/
Firefox 93 beta 3 - http://archive.mozilla.org/pub/firefox/candidates/93.0b3-candidates/build1/
Nightly 94.0a1 - http://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central/
Thanks!
Comment 35•3 years ago
|
||
i think Firefox 93 beta 3 and Nightly 94.0a1 already fixed. i can not reproduce it now (firefox 93.0b3 20210909185859 and 94.0a1 20210910033856)
but for firefox 91.0.2 (build id 20210823123856). i think it still not fixed. i not sure that is same cause or not because i think profiler tool is not provide.
one thing that i should mention here is for test Firefox 91.2 esr and Firefox 93 beta3 i go to "about:preferences" and unchecked "Use recommended performance settings" then set "Content process limit" to 2. so i can reproduce it easily. ex open https://ishare.in.th/tmp/bug-1724980/empty.html for 3 tab before use reproduce step in Comment18
Comment 36•3 years ago
|
||
(In reply to mix5003 from comment #35)
i think Firefox 93 beta 3 and Nightly 94.0a1 already fixed. i can not reproduce it now (firefox 93.0b3 20210909185859 and 94.0a1 20210910033856)
Thank you!
but for firefox 91.0.2 (build id 20210823123856). i think it still not fixed. i not sure that is same cause or not because i think profiler tool is not provide.
This is expected since the fix is not present on this build.
one thing that i should mention here is for test Firefox 91.2 esr and Firefox 93 beta3 i go to "about:preferences" and unchecked "Use recommended performance settings" then set "Content process limit" to 2. so i can reproduce it easily. ex open https://ishare.in.th/tmp/bug-1724980/empty.html for 3 tab before use reproduce step in Comment18
Steve, could you please take a look, is this expected?
Comment 37•3 years ago
|
||
ah, sorry for my bad english. it as expected. i just methon about more reliable way to reproduce it in version that this bug is affected. (for example in Firefox 91.2 esr)
Assignee | ||
Comment 38•3 years ago
•
|
||
(In reply to Simona Badau from comment #36)
(In reply to mix5003 from comment #35)
one thing that i should mention here is for test Firefox 91.2 esr and Firefox 93 beta3 i go to "about:preferences" and unchecked "Use recommended performance settings" then set "Content process limit" to 2. so i can reproduce it easily. ex open https://ishare.in.th/tmp/bug-1724980/empty.html for 3 tab before use reproduce step in Comment18
Steve, could you please take a look, is this expected?
Yes, this is expected -- it's a little tricky to reproduce without doing this, since you need both pages to be in the same process, as the reporter mentioned.
mix5003, thank you for the excellent reproduction case.
(In reply to mix5003 from comment #35)
i think Firefox 93 beta 3 and Nightly 94.0a1 already fixed. i can not reproduce it now (firefox 93.0b3 20210909185859 and 94.0a1 20210910033856)
but for firefox 91.0.2 (build id 20210823123856). i think it still not fixed. i not sure that is same cause or not because i think profiler tool is not provide.
I think that was the wrong link. The fix landed in 91.2.0, so 91.0.2 shouldn't be fixed. (It doesn't help that the numbers are permutations of each other...) I don't think there's a release candidate for 91.2 yet, so you'd probably have to test with (a continuous integration build)[https://treeherder.mozilla.org/jobs?repo=mozilla-esr91&revision=25182ce7ad0c&searchStr=build%2Cship], something like this depending on your OS.
I verified it is fixed in 91.2.0. My slightly modified steps:
- go to about:preferences, uncheck "Use recommended performance settings, set "Content process limit" to 2.
- open https://ishare.in.th/tmp/bug-1724980/empty.html
- open https://ishare.in.th/tmp/bug-1724980/bug.html in 3 tabs (I used Ctrl-L, Alt-Enter to clone the first one twice)
- go to about:processes to identify which process contains both empty.html and bug.html. Double-click on bug.html to identify which tab that is.
- close the other bug.html tabs
- click "start"
- click "increment"
- click "clearA"
- click "clearB"
- click "increment" and wait for "clearD" button appear (after this step memory usage may around 4-5GB)
- wait until cpu usage 100% (in my machine it not more than 30 second)
- close bug.html
- watch about:processes. It will show the memory usage drop from many GB to MBs (mine is 16MB) and CPU usage will drop off rapidly after that (in 91.2.0) or hover at around 96% (in 91.0.2)
Comment 39•3 years ago
|
||
(In reply to Steve Fink [:sfink] [:s:] from comment #38)
(In reply to Simona Badau from comment #36)
(In reply to mix5003 from comment #35)
one thing that i should mention here is for test Firefox 91.2 esr and Firefox 93 beta3 i go to "about:preferences" and unchecked "Use recommended performance settings" then set "Content process limit" to 2. so i can reproduce it easily. ex open https://ishare.in.th/tmp/bug-1724980/empty.html for 3 tab before use reproduce step in Comment18
Steve, could you please take a look, is this expected?
Yes, this is expected -- it's a little tricky to reproduce without doing this, since you need both pages to be in the same process, as the reporter mentioned.
The other way to achieve this is to enable fission (set fission.autostart
to true
in about:config
). But there's also a new fission feature coming that means this could be "usually" rather than "always". So check the PIDs by hovering over the tabs and reading the tooltip.
The bug is still independent of fission. it's more that these STR are more reliable with it.
Comment 40•3 years ago
|
||
Based on Comment 35 and Comment 38, setting the tracking flags for firefox 91 esr, firefox 93 and firefox 94 to verified.
Comment 41•3 years ago
|
||
[Tracking Requested - why for this release]:
I heard there's a 92.0.1 planned. Can we uplift this?
Comment 42•3 years ago
|
||
I heard there's a 92.0.1 planned. Can we uplift this?
Comment 43•3 years ago
|
||
This issue has been around since Fx58. What's driving the need to get it into Fx92 1-2 weeks earlier than it would otherwise by shipping in Fx93?
Comment 44•3 years ago
|
||
(In reply to Ryan VanderMeulen [:RyanVM] from comment #43)
This issue has been around since Fx58. What's driving the need to get it into Fx92 1-2 weeks earlier than it would otherwise by shipping in Fx93?
I wasn't sure if this was a good idea / what kinds of things get uplifted to release. I guess 1-2 weeks is fine for something that's not a crash/security problem.
Updated•3 years ago
|
Description
•