Closed Bug 1724980 Opened 4 months ago Closed 3 months ago

Content process taking 100% of a core, running IdleTimeGCTaskRunnable continuously

Categories

(Core :: JavaScript: GC, defect, P2)

defect

Tracking

()

VERIFIED FIXED
94 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox-esr91 --- verified
firefox92 --- wontfix
firefox93 --- verified
firefox94 --- verified

People

(Reporter: julienw, Assigned: sfink)

References

(Blocks 1 open bug, Regression)

Details

(Keywords: perf, power, regression, Whiteboard: [qf])

Attachments

(1 file, 1 obsolete file)

Today I got a problem with a process in Firefox, where it was taking 130% of my CPU.

Then thanks to about:processes I closed all tabs from this process... but it was still taking 130% of my CPU!!

Here is a profile: https://share.firefox.dev/2VMgYZz

Some weird things:

  1. A lot of idle categories, despite that it's obviously not idle.
  2. It doesn't seem to do anything useful!
  3. Lots of Runnable but I don't know if this is meaningful.

So I don't know what happens really.

It looks more like an issue with idle runnables running amok! I have a faint memory of seeing something like this before. It would be a good profile to bring to Joy of Profiling™! 😉

I doubt this is a profiler issue per se. (Keeping it here until we find it a better home.)

It's possible to have lots of samples in the idle category, while still using a lot of CPU.
The "Idle" category is used around code that is supposed to wait until some event happens (typically a mutex or semaphore state change, or I/O operation ending), in this case waiting for the next task on the event queue.
Go to the Stack Chart, and zoom in on a small time range, and you'll see individual samples: There are system "poll" functions (considered idle, but they can still use a bit of CPU) being sampled most of the time. And in-between these there are lots of IdleRunnable and PIdleScheduler::Msg_IdleTime tasks, so the CPU is actively running them.
These tasks are more visible in the Marker Table, I count roughly 50 per millisecond! I guess they are very short, that's why samples don't always capture them. But their quantity would explain why we spend so much CPU there.

It is odd that XPCJSContext::AfterProcessTask seems to be dispatching new tasks all the time.
I think this is coming from https://searchfox.org/mozilla-central/rev/538569c63ffb17d590c51df66f9fff790dd7431e/xpcom/base/CycleCollectedJSContext.cpp#469,495,512

jonco, pbone, does that ring any bells?

Flags: needinfo?(pbone)
Flags: needinfo?(jcoppeard)

This shouldn't continually dispatch new tasks. We do try and do an idle time nursery collection in a few situations but it shouldn't run repeatedly like this.

It's a shame we can't see the runnable executing in the profile, to confirm it's the idle GC time runnable and to see what it's doing.

Julien, is this reproduceable at all?

Flags: needinfo?(jcoppeard) → needinfo?(felash)

I'm not sure how to reproduce this reliably, but I'll take care...

One guess I have is that chat.mozilla.org with ublock enabled takes a lot of memory (for some reason) and even stops working after a while. On this new computer I forgot to disable ublock for this, and I think that's what happened. But now I disabled it, so I'm not so sure this will reproduce again...

Flags: needinfo?(felash)

I'm not seeing anything here that others haven't seen. It does point to XPCJSContext::AfterProcessTask creating a task and running again and again, but I can't say why.

Flags: needinfo?(pbone)

Based on XPCJSContext::AfterProcessTask being the apparent cause, I'm moving this bug to the JS Engine component.

Component: Gecko Profiler → JavaScript Engine
Component: JavaScript Engine → Performance

Based on the profile, this is coming from
https://searchfox.org/mozilla-central/rev/538569c63ffb17d590c51df66f9fff790dd7431e/xpcom/base/CycleCollectedJSContext.cpp#469,495,512
It is IdleRunnable which is there all the time in the profile
https://searchfox.org/mozilla-central/rev/69e4ba3cbce75872d3e88ab0422fbb0198d7000d/xpcom/base/CycleCollectedJSContext.cpp#497
That instance if IdleRunnable doesn't change the name of the runnable. Other cases where IdleRunnable is used have different name.

Component: Performance → JavaScript Engine

(In reply to Olli Pettay [:smaug] from comment #7)

That instance if IdleRunnable doesn't change the name of the runnable. Other cases where IdleRunnable is used have different name.

That was harder to find in the profile than it needs to be, I filed bug 1727137 to require all IdleRunnable instances to have a name.

See Also: → 1727137
Component: JavaScript Engine → XPConnect

this is really JS, as far as I see. Regression from bug 1376891 basically.

Component: XPConnect → JavaScript: GC
Severity: -- → S3
Priority: -- → P2
Duplicate of this bug: 1727975

The initial report here was on Linux, but the report in bug 1727975 was on Windows, so this is not Linux-specific.

Keywords: perf, regression
Regressed by: 1376891
Summary: Idle category despite CPU being used at 130% by this process → Content process taking 100% of a core, running IdleTimeGCTaskRunnable continuously
Whiteboard: [qf]

(In reply to Olli Pettay [:smaug] from comment #10)

does https://searchfox.org/mozilla-central/rev/5a362eb7d054740dc9d7c82c79a2efbc5f3e4776/js/src/jsapi.cpp#1292,1294 return true all the time?

From my reading of the code, it shouldn't. But that profile is also not showing any minor GC markers. Which suggests to me that IsIdleGCTaskNeeded() is constantly returning true, then when the task runs, it says "oh wait, never mind, we don't need to minor GC anymore." IsIdleGCTaskNeeded() returns true if shouldCollect() returns true. RunIdleTimeGCTask runs exactly the same shouldCollect() check.

Let's say that shouldCollect() keeps returning true. We're not getting a minor GC marker, which would mean (1) suppressGC is set, (2) the nursery is disabled, or (3) the nursery is empty. That last seems the most likely. So then the question is why shouldCollect() would return true with an empty nursery.

The only way that can happen is if minorGCRequested() is true. That can be caused by js::Nursery::registerMallocedBuffer, StoreBuffer::setAboutToOverflow, and js::jit::AllocateBigIntNoGC. Given that we're assuming the nursery stays empty, those are all a bit weird, but I'll ignore delving into that for now and look at what clears the minor GC request flag. It happens unconditionally in GCRuntime::collectNursery, even if the nursery was empty, but NOT if suppressGC was set.

So I'll change my hypothesis: we have somehow managed to get GC suppression stuck on.

The profile shows 11 stacks in GCRuntime::collectNursery, which is compatible with this hypothesis. Then again, GC suppression is supposed to be done with RAII, and there's nothing on the stack that would be suppressing. The only exception I see is an uncalled function (intended for use from gdb).

I do see that AutoSuppressGC is used in StartOffThreadParseTask. But that looks to be called on the main thread, so should only suppress GC long enough to queue up the task.

I'm stumped. One thing that might paper over the symptom is if shouldCollect checked the suppressGC flag, even if minorGCRequested() is true. It would be nice to have a way to verify whether it fixes it.

Assignee: nobody → sphink
Status: NEW → ASSIGNED

since my bug (Bug 1727975) mark as duplicate of this bug. so i will post my assumption here too.
from my latest profiler https://share.firefox.dev/3yAykWW
and i think problem may from this code https://searchfox.org/mozilla-central/source/ipc/glue/IdleSchedulerParent.cpp#398-414
idleRequester->remove(); will be called only when idleRequester->mRequestingGC or idleRequester->IsWaitingForIdle() be true.
but if idleRequester->mRequestingGC and idleRequester->IsWaitingForIdle() be true at the same time idleRequester->remove(); will never called. and it may be permanent live in sIdleAndGCRequests LinkedList

pbone, see the comment above.

Flags: needinfo?(pbone)

But the idea is that only once you aren't requesting GC, nor idle, then the request is removed from the list. And that wouldn't still explain why
IdleTimeGCTaskRunnable is being dispatched all the time.
https://searchfox.org/mozilla-central/rev/49b6e60550243b4b4d71d6ab35a3ff2b9a9f7c69/xpcom/base/CycleCollectedJSContext.cpp#513 should return false at some point.

i am not familiar with firefox architecture so that info may incorrect.

and i have reproduce step (at least it can reproduce on my machine)
0. you need computer that has more than 8GB+ of free ram (actually in GC process at peak it may 10GB+ from this example)

  1. open https://ishare.in.th/tmp/bug-1724980/empty.html
  2. open https://ishare.in.th/tmp/bug-1724980/bug.html in tab that has same PID and then click start
  3. click "increment"
  4. click "clearA"
  5. click "clearB"
  6. click "increment" and wait for "clearD" button appear (after this step memory usage may around 4-5GB)
  7. wait until cpu usage 100% (in my machine it not more than 30 second)
  8. close bug.html
  9. wait for GC done (memory usage reduce to 2xxMB) and now IPC I/O Child will constantly use CPU

(In reply to mix5003 from comment #15)

since my bug (Bug 1727975) mark as duplicate of this bug. so i will post my assumption here too.
from my latest profiler https://share.firefox.dev/3yAykWW
and i think problem may from this code https://searchfox.org/mozilla-central/source/ipc/glue/IdleSchedulerParent.cpp#398-414
idleRequester->remove(); will be called only when idleRequester->mRequestingGC or idleRequester->IsWaitingForIdle() be true.
but if idleRequester->mRequestingGC and idleRequester->IsWaitingForIdle() be true at the same time idleRequester->remove(); will never called. and it may be permanent live in sIdleAndGCRequests LinkedList

It's supposed to work like that. The list is for any task that wants to GC and/or run an idle task. If either of those things is true then the task must be in the list. If we are changing one of those things, eg we give a task some idle time, then we can only remove it from the list if it does not want the other (GC).

Also I don't see how this is related to the idle task spinning.

Flags: needinfo?(pbone)

I managed to reproduce this following mix5003's instructions from comment 18. Thank you mix5003.

sfink: For some reason the nursery gets disabled but also has it's flag set to say it wants collection. I'm not sure how it gets to this state but I think it's due to the GC hitting the 4GB limit. I'm sending you a patch.

Flags: needinfo?(sphink)

(In reply to Paul Bone [:pbone] from comment #21)

sfink: For some reason the nursery gets disabled but also has it's flag set to say it wants collection. I'm not sure how it gets to this state but I think it's due to the GC hitting the 4GB limit. I'm sending you a patch.

Excellent, thank you!

I started trying to go through those steps, but I got distracted when I first checked my current idling browser -- it was pegged at about 25% CPU. In attempting to look at it in the profiler, I encountered bug 1698748 (Runnable - DummyEvent happening 1000x/sec).

Flags: needinfo?(sphink)
See Also: → 1698748
Attachment #9238816 - Attachment is obsolete: true
Pushed by pbone@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/4bd978bb1379
We don't need to collect the nursery if it's empty or disabled r=sfink
Status: ASSIGNED → RESOLVED
Closed: 3 months ago
Resolution: --- → FIXED
Target Milestone: --- → 94 Branch

Do we need this on branches? This seems like a rather bad case when this happens. Or was this a recent regression after all?

Flags: needinfo?(sphink)

Comment on attachment 9239346 [details]
Bug 1724980 - We don't need to collect the nursery if it's empty or disabled r=sfink

ESR Uplift Approval Request

  • If this is not a sec:{high,crit} bug, please state case for ESR consideration: This is a fairly serious problem where a page could allocate enough in the GC heap to disable the nursery, resulting in the browser remaining functional but constantly consuming CPU (and eating battery).
  • User impact if declined: some unknown number of users would get their browser stuck spinning CPU
  • Fix Landed on Version: 94
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): The change itself is tiny: don't tell the task scheduler that we want to do a nursery collection if, in fact, the collection won't happen.
  • String or UUID changes made by this patch: none

Beta/Release Uplift Approval Request

  • User impact if declined: CPU spinning in some situations (see above)
  • Is this code covered by automated tests?: No
  • Has the fix been verified in Nightly?: Yes
  • Needs manual test from QE?: No
  • If yes, steps to reproduce: (comment 18 has STR)
  • List of other uplifts needed: None
  • Risk to taking this patch: Low
  • Why is the change risky/not risky? (and alternatives if risky): see above
  • String changes made/needed: none
Flags: needinfo?(sphink)
Attachment #9239346 - Flags: approval-mozilla-esr91?
Attachment #9239346 - Flags: approval-mozilla-esr78?
Attachment #9239346 - Flags: approval-mozilla-beta?

(In reply to Olli Pettay [:smaug] from comment #26)

Do we need this on branches? This seems like a rather bad case when this happens. Or was this a recent regression after all?

As far as I can tell, the same thing would happen in past branches. In esr78, IsIdleGCTaskNeeded directly calls shouldCollect which does not check isEnabled. I don't know how often this might be happening, but you're right, it seems worth backporting.

Comment on attachment 9239346 [details]
Bug 1724980 - We don't need to collect the nursery if it's empty or disabled r=sfink

ESR78 is nearly EOL (the upcoming 78.15 release is the final planned one), so I don't think we need this there.

Attachment #9239346 - Flags: approval-mozilla-esr78? → approval-mozilla-esr78-

Comment on attachment 9239346 [details]
Bug 1724980 - We don't need to collect the nursery if it's empty or disabled r=sfink

Approved for 93 beta 3, thanks.

Attachment #9239346 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Comment on attachment 9239346 [details]
Bug 1724980 - We don't need to collect the nursery if it's empty or disabled r=sfink

Approved for 91.2esr.

Attachment #9239346 - Flags: approval-mozilla-esr91? → approval-mozilla-esr91+
Flags: qe-verify+
QA Whiteboard: [qa-triaged]

I tried to reproduce the initial issue on my machine on Windows 10 x64 and Ubuntu 20.04 x64 and also on macOS Big Sur 11.5.2 but I was unable to do so using builds before the fix (Firefox 92 and Nightly from 2021-09-01) - so I can't verify that this is fixed.

Mix5003, will you be so kind and verify that this is fixed using Firefox 91.2 esr, Firefox 93 beta 3, and the latest Nightly 94.0a1?

Firefox 91.2 esr - http://archive.mozilla.org/pub/firefox/candidates/91.0.2-candidates/build1/
Firefox 93 beta 3 - http://archive.mozilla.org/pub/firefox/candidates/93.0b3-candidates/build1/
Nightly 94.0a1 - http://archive.mozilla.org/pub/firefox/nightly/latest-mozilla-central/

Thanks!

Flags: needinfo?(mix5003)

i think Firefox 93 beta 3 and Nightly 94.0a1 already fixed. i can not reproduce it now (firefox 93.0b3 20210909185859 and 94.0a1 20210910033856)

but for firefox 91.0.2 (build id 20210823123856). i think it still not fixed. i not sure that is same cause or not because i think profiler tool is not provide.

one thing that i should mention here is for test Firefox 91.2 esr and Firefox 93 beta3 i go to "about:preferences" and unchecked "Use recommended performance settings" then set "Content process limit" to 2. so i can reproduce it easily. ex open https://ishare.in.th/tmp/bug-1724980/empty.html for 3 tab before use reproduce step in Comment18

Flags: needinfo?(mix5003)

(In reply to mix5003 from comment #35)

i think Firefox 93 beta 3 and Nightly 94.0a1 already fixed. i can not reproduce it now (firefox 93.0b3 20210909185859 and 94.0a1 20210910033856)

Thank you!

but for firefox 91.0.2 (build id 20210823123856). i think it still not fixed. i not sure that is same cause or not because i think profiler tool is not provide.

This is expected since the fix is not present on this build.

one thing that i should mention here is for test Firefox 91.2 esr and Firefox 93 beta3 i go to "about:preferences" and unchecked "Use recommended performance settings" then set "Content process limit" to 2. so i can reproduce it easily. ex open https://ishare.in.th/tmp/bug-1724980/empty.html for 3 tab before use reproduce step in Comment18

Steve, could you please take a look, is this expected?

Flags: needinfo?(sphink)

ah, sorry for my bad english. it as expected. i just methon about more reliable way to reproduce it in version that this bug is affected. (for example in Firefox 91.2 esr)

(In reply to Simona Badau from comment #36)

(In reply to mix5003 from comment #35)

one thing that i should mention here is for test Firefox 91.2 esr and Firefox 93 beta3 i go to "about:preferences" and unchecked "Use recommended performance settings" then set "Content process limit" to 2. so i can reproduce it easily. ex open https://ishare.in.th/tmp/bug-1724980/empty.html for 3 tab before use reproduce step in Comment18

Steve, could you please take a look, is this expected?

Yes, this is expected -- it's a little tricky to reproduce without doing this, since you need both pages to be in the same process, as the reporter mentioned.

mix5003, thank you for the excellent reproduction case.

(In reply to mix5003 from comment #35)

i think Firefox 93 beta 3 and Nightly 94.0a1 already fixed. i can not reproduce it now (firefox 93.0b3 20210909185859 and 94.0a1 20210910033856)

but for firefox 91.0.2 (build id 20210823123856). i think it still not fixed. i not sure that is same cause or not because i think profiler tool is not provide.

I think that was the wrong link. The fix landed in 91.2.0, so 91.0.2 shouldn't be fixed. (It doesn't help that the numbers are permutations of each other...) I don't think there's a release candidate for 91.2 yet, so you'd probably have to test with (a continuous integration build)[https://treeherder.mozilla.org/jobs?repo=mozilla-esr91&revision=25182ce7ad0c&searchStr=build%2Cship], something like this depending on your OS.

I verified it is fixed in 91.2.0. My slightly modified steps:

  • go to about:preferences, uncheck "Use recommended performance settings, set "Content process limit" to 2.
  • open https://ishare.in.th/tmp/bug-1724980/empty.html
  • open https://ishare.in.th/tmp/bug-1724980/bug.html in 3 tabs (I used Ctrl-L, Alt-Enter to clone the first one twice)
  • go to about:processes to identify which process contains both empty.html and bug.html. Double-click on bug.html to identify which tab that is.
  • close the other bug.html tabs
  • click "start"
  • click "increment"
  • click "clearA"
  • click "clearB"
  • click "increment" and wait for "clearD" button appear (after this step memory usage may around 4-5GB)
  • wait until cpu usage 100% (in my machine it not more than 30 second)
  • close bug.html
  • watch about:processes. It will show the memory usage drop from many GB to MBs (mine is 16MB) and CPU usage will drop off rapidly after that (in 91.2.0) or hover at around 96% (in 91.0.2)
Status: RESOLVED → VERIFIED
Flags: needinfo?(sphink)

(In reply to Steve Fink [:sfink] [:s:] from comment #38)

(In reply to Simona Badau from comment #36)

(In reply to mix5003 from comment #35)

one thing that i should mention here is for test Firefox 91.2 esr and Firefox 93 beta3 i go to "about:preferences" and unchecked "Use recommended performance settings" then set "Content process limit" to 2. so i can reproduce it easily. ex open https://ishare.in.th/tmp/bug-1724980/empty.html for 3 tab before use reproduce step in Comment18

Steve, could you please take a look, is this expected?

Yes, this is expected -- it's a little tricky to reproduce without doing this, since you need both pages to be in the same process, as the reporter mentioned.

The other way to achieve this is to enable fission (set fission.autostart to true in about:config). But there's also a new fission feature coming that means this could be "usually" rather than "always". So check the PIDs by hovering over the tabs and reading the tooltip.

The bug is still independent of fission. it's more that these STR are more reliable with it.

Based on Comment 35 and Comment 38, setting the tracking flags for firefox 91 esr, firefox 93 and firefox 94 to verified.

QA Whiteboard: [qa-triaged]
Flags: qe-verify+

[Tracking Requested - why for this release]:

I heard there's a 92.0.1 planned. Can we uplift this?

I heard there's a 92.0.1 planned. Can we uplift this?

Flags: needinfo?(ryanvm)

This issue has been around since Fx58. What's driving the need to get it into Fx92 1-2 weeks earlier than it would otherwise by shipping in Fx93?

Flags: needinfo?(ryanvm) → needinfo?(pbone)

(In reply to Ryan VanderMeulen [:RyanVM] from comment #43)

This issue has been around since Fx58. What's driving the need to get it into Fx92 1-2 weeks earlier than it would otherwise by shipping in Fx93?

I wasn't sure if this was a good idea / what kinds of things get uplifted to release. I guess 1-2 weeks is fine for something that's not a crash/security problem.

Flags: needinfo?(pbone)
You need to log in before you can comment on or make changes to this bug.