Closed Bug 624549 Opened 13 years ago Closed 13 years ago

Overly aggressive calls to nsJSContext::MaybeCC() cause pauses in Flight of the Navigator

Categories

(Core :: DOM: Core & HTML, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: pcwalton, Assigned: smaug)

References

()

Details

Attachments

(6 files, 1 obsolete file)

There are bad GC pauses in the Flight of the Navigator demo that seem to be caused by cycle collection happening too often:

	0.0%	51.3%	XUL	                                   nsUITimerCallback::Notify(nsITimer*)
	0.0%	51.3%	XUL	                                    nsObserverService::NotifyObservers(nsISupports*, char const*, unsigned short const*)
	0.0%	51.3%	XUL	                                     nsObserverList::NotifyObservers(nsISupports*, char const*, unsigned short const*)
	0.0%	51.3%	XUL	                                      nsUserActivityObserver::Observe(nsISupports*, char const*, unsigned short const*)
	0.0%	51.3%	XUL	                                       nsJSContext::MaybeCC(int)
	0.0%	51.3%	XUL	                                        nsJSContext::CC(nsICycleCollectorListener*)
	0.0%	51.2%	XUL	                                         nsXPConnect::GarbageCollect()
	0.0%	51.2%	XUL	                                          nsXPConnect::Collect()
	0.0%	51.2%	XUL	                                           js_GC(JSContext*, JSCompartment*, JSGCInvocationKind)
Maybe nsUserActivityObserver::Observe() should avoid calling MaybeCC() if animations are going on? We could check the paint count and if it's continually rising we can avoid calling MaybeCC().
Summary: Overly aggressive calls to nsJSContext::MaybeCC() causes pauses in Flight of the Navigator → Overly aggressive calls to nsJSContext::MaybeCC() cause pauses in Flight of the Navigator
Assignee: nobody → pwalton
This bug (and pcwalton's frame rate tool + Shark) are hawt :-|.

/be
Attached patch Proposed patch. (obsolete) — Splinter Review
Proposed patch attached.

This simplifies our CC logic to just two timers:
* The "opportunistic" timer, which fires once every 5 seconds after any "user interaction event", which is defined as a UI event or a repaint.
* The "guaranteed" timer, which fires every 5 minutes.

Each time any of these timers fires, a cycle collection is performed. This tries to ensure that cycle collections happen soon after any UI event, but don't block the events. At the same time, it makes sure that cycle collections *will* happen eventually, no matter what, to guard against a persistent animation starving the CC.
Attachment #502709 - Flags: feedback?(gal)
Feedback from smaug would be appreciated too: I'm sure I missed some edge cases in there :)
Comment on attachment 502709 [details] [diff] [review]
Proposed patch.

This looks awesome. Change GC to CC in the constants. Sometimes you indent by 4. Also 

+  return mGuaranteedTimer->InitWithCallback(static_cast<nsITimerCallback*(this),
                                             NS_GUARANTEED_GC_DELAY,
+                                            nsITimer::TYPE_ONE_SHOT);

looks a lot nicer than ( overhand.
Attachment #502709 - Flags: feedback?(gal) → feedback+
Attachment #502709 - Flags: feedback?(Olli.Pettay)
Have you posted the patch to tryserver?
It is, or at least was, *very* easy to regress at least tp4 when changing
CC scheduling.

And Fennec may not want a "guaranteed timer", since IIRC the aim has been
to remove all unnecessary timers which run when the browser is otherwise idle.
And the patch makes us call CC even in cases there are no new suspected
objects.

If user moves mouse all the time, CC is called after 5 mins?
That is a long time. *Lots* of garbage can be created during that time.
I mean hundreds of megabytes.
And the 5 mins is also the time how often CC is called when user isn't
active?

The user-interaction-active notification in nsDisplayListBuilder::EnterPresShell
is strange. Painting has nothing to do with user interaction.
And I'd guess that notifying during painting could show up in the profiles.
If something like that is needed, perhaps nsDisplayListBuilder could
have some timestamp when the last paint happened and CC scheduling could
use that?

How does https://bugzilla.mozilla.org/show_bug.cgi?id=517665#c16
behave with the patch? (Remember to not interact with the browser at all
while running that test.)
Btw, what is the URL for the testcase for this bug?
(In reply to comment #7)
> How does https://bugzilla.mozilla.org/show_bug.cgi?id=517665#c16
> behave with the patch?
Actually, the patch shouldn't affect to this after all.
Attached file a testcase for CC
This creates lots of cycle collectable objects.
(In reply to comment #10)
> Created attachment 502773 [details]
With the patch it is easy to OOM.
(In reply to comment #10)
> Created attachment 502773 [details]
> a testcase for CC
> 
> This creates lots of cycle collectable objects.

Without the patch, running that test for a few seconds and then closing the tab causes my I'm-pretty-sure-CC-related periodic pauses to Get Big.  Do we not do a CC on tab close or bfcache expiry?

Even after like 5 minutes, they haven't gone away and I'm still at 1.6GB of memory.
(In reply to comment #7)
> And the patch makes us call CC even in cases there are no new suspected
> objects.
> 
> If user moves mouse all the time, CC is called after 5 mins?
> That is a long time. *Lots* of garbage can be created during that time.
> I mean hundreds of megabytes.
> And the 5 mins is also the time how often CC is called when user isn't
> active?

I'm dropping it down to 1 minute after a conversation with Andreas. Do you think that's better?

> And Fennec may not want a "guaranteed timer", since IIRC the aim has been
> to remove all unnecessary timers which run when the browser is otherwise idle.

How about keying it to user interaction events (i.e. every so often, a cycle collection gets automatically run)?
That is, after N user interaction events, a cycle collection is always run.

Also, IMHO regressing tp4 is something we should consider taking. It seems to me that this is a big win for users: essentially, the status quo right now is that all benefits of compartmental GC are lost due to aggressive cycle collection. I haven't found a single page that benefits in a significant way from compartmental GC without this patch, because the CCs constantly trigger full GC.
(In reply to comment #15)
> That is, after N user interaction events, a cycle collection is always run. 
That is the current behavior, although currently there are other parameters too,
like the number of new suspected objects, and never run CC more often than
every 10 seconds.

> Also, IMHO regressing tp4 is something we should consider taking.
I agree. One reason for the complexity of CC scheduling parameters is
the do-not-regress-any-perf-test.

> It seems to
> me that this is a big win for users: essentially, the status quo right now is
> that all benefits of compartmental GC are lost due to aggressive cycle
> collection.
Is the time really spent in cycle collection or in garbage collection?

> I haven't found a single page that benefits in a significant way
> from compartmental GC without this patch, because the CCs constantly trigger
> full GC.
The problem with the patch is that it really doesn't handle cases when lots
of cycle collectable garbage (not JS objects) is created.

Have you debugged what causes CC to run? I mean are there lots of
new suspected objects or what?
Could we just increase NS_MIN_SUSPECT_CHANGES and/or NS_MAX_SUSPECT_CHANGES
significantly?
(In reply to comment #13)
> Without the patch, running that test for a few seconds and then closing the tab
> causes my I'm-pretty-sure-CC-related periodic pauses to Get Big.  Do we not do
> a CC on tab close or bfcache expiry?
No. Currently we try to do CC when user is inactive and/or there are some new 
suspected objects. But if the pauses stay really long, that sounds like
something is keeping the document alive. The testcase creates a *huge* document.
(In reply to comment #12)
> http://videos.mozilla.org/serv/mozhacks/flight-of-the-navigator/
Unfortunately this doesn't work on my Linux machine (because of WebGL).
I'll test this all on Mac.
Patrick, or anyone, have you tried flight-of-the-navigator with the
patch for Bug 565217?
We saw MaybeCC invoked full-JS-GC more often than every 10 seconds during FotN with Patrick's frame-rate monitor, which automatically sharks the culprit(s). So something is already buggy there (on file?).

Doing CC on a fixed timer seems wrong in any scenario. It'll chew battery. It will jank your demo at the worst possible moment, Murphy sez.

/be
(In reply to comment #16)
> Have you debugged what causes CC to run? I mean are there lots of
> new suspected objects or what?

Lots of user-interaction-inactive I believe; I'll add some debugging in and report back.

The reason I replaced the logic with a timer is that user-interaction-active was triggering calls to MaybeCC(), which struck me as odd - it meant that moving the mouse over a page rapidly would cause cycle collection, unless I'm misunderstanding the code.

> Could we just increase NS_MIN_SUSPECT_CHANGES and/or NS_MAX_SUSPECT_CHANGES
> significantly?

I do think that we want to add some kind of animation heuristic. Interrupting an animation to do CC hurts our perception of performance, even when no user interaction is occurring. This is the main problem in FOTN.
(In reply to comment #20)
> Doing CC on a fixed timer seems wrong in any scenario. It'll chew battery. It
> will jank your demo at the worst possible moment, Murphy sez.

Currently CC doesn't run on a fixed timer. When user is active or just became
inactive, there is a timer which fires the active/inactive notifications.
Otherwise CC is triggered by thread observer. Well, page loads can also trigger
CC.
(In reply to comment #21)
> Lots of user-interaction-inactive I believe; I'll add some debugging in and
> report back.
Ah, that is something we should tweak. Maybe the time from user-interaction-inactive to cycle collection should be quite long.


> The reason I replaced the logic with a timer is that user-interaction-active
> was triggering calls to MaybeCC(), which struck me as odd - it meant that
> moving the mouse over a page rapidly would cause cycle collection, unless I'm
> misunderstanding the code.
CC shouldn't run more often than every 10 seconds, and it should run
only if there are some new suspected objects.
It is very possible that the value of those constants aren't what we want.

> I do think that we want to add some kind of animation heuristic. Interrupting
> an animation to do CC hurts our perception of performance, even when no user
> interaction is occurring. This is the main problem in FOTN.
But we must CC every now and then if lots of cycle collectable garbage is
created.
See also https://bugzilla.mozilla.org/show_bug.cgi?id=565217#c7
"I was hoping to run the CC a lot more often with this patch, since we can stop it at any time."
Keep in mind that I have a patch up for review that makes the cycle collector interruptible, so scheduling it becomes much less of a problem.
(In reply to comment #25)
> Keep in mind that I have a patch up for review that makes the cycle collector
> interruptible, so scheduling it becomes much less of a problem.

Well... in an ideal world, yes. Right now we don't normally stay idle long enough to CC fully.
Probably not only on Mac OS X.

We need to do something low-risk but effective for this bug in Firefox 4. I think we can, based on progress so far.

/be
blocking2.0: --- → ?
OS: Mac OS X → All
Hardware: x86 → All
Interruptible CC is not the answer. It has enough unknowns to be "not Firefox 4" and as bent says, it didn't even help other hard cases because the CC was starved and had to run fully. It's a fine thing to experiment with for a later release.

We have mistuning of existing tunables, and as Patrick observes complete lack of animation tuning as well. This can be more safely and locally fixed, I bet.

/be
So, without this patch, moving the mouse repeatedly over FOTN seems to make the CCs go away. Could we just make paints count as user interaction as a quick fix?
> I do think that we want to add some kind of animation heuristic.

Perhaps a more general heuristic could be used? Animations aren't the only thing that is noticeable by users if GC pauses occur. Sound, realtime JS calculations (not necessarily rendered as animations), are other potential examples.

It should be feasible to check how much the main thread was idle, for example (that is, how large the interval is between events running there). An animation would interrupt the main thread frequently, as would other timing-sensitive things.

In other words, a GC pause will introduce a delay in the main thread, we can check based on recent activity how disruptive that delay would be.
(In reply to comment #30)
> It should be feasible to check how much the main thread was idle, for example
> (that is, how large the interval is between events running there). An animation
> would interrupt the main thread frequently, as would other timing-sensitive
> things.

Our main thread is almost never idle right now, so this approach is not really going to help at the moment. In the future this might work though.
I was debugging FOTN a bit, and the reason it triggers CC quite often is
that it actually creates plenty of suspected objects, and some objects
are really garbage which needs to be collected occasionally.

I think we could change nsJSContext::CC so that it doesn't always call GC.
(It is GC which is slow in FOTN, not CC.) Some cycles are collectable even
if the GC didn't run just before CC - in FOTN case this approach seems to
work well.
I also increased NS_MIN_SUSPECT_CHANGES and NS_MAX_SUSPECT_CHANGES.

I need to now figure out how to build compartmental GC, and then run some
more tests.
(In reply to comment #31)
> (In reply to comment #30)
> > It should be feasible to check how much the main thread was idle, for example
> > (that is, how large the interval is between events running there). An animation
> > would interrupt the main thread frequently, as would other timing-sensitive
> > things.
> 
> Our main thread is almost never idle right now, so this approach is not really
> going to help at the moment. In the future this might work though.

Is that true if the user is not generating interaction events? All the threads (including the main one) get almost entirely silent in that case, when I tested (on Fennec).

There remain a few rare events, but it was less than 1 in a few seconds, whereas if an animation is running we get dozens of events each second - so the two cases are very easy to distinguish.
Compartment-wise GC is on.

The good work Andreas et al. did to separate CC and GC will pay off here if it is indeed the case that there's little to cycle-collect. We need to avoid the global mark phase. We could even do a full GC before starting an intensive JS app or animiation to seed the gray bits.

/be
Patrick, want to try this?

This makes us call cc usually without gc, but sometimes, if gc hasn't run, with gc.

I'll investigate if we need some special handling for animation.

Btw, on my machine CC takes about 7ms, GC several hundreds ms in FOTN.
Attachment #502903 - Flags: feedback?(pwalton)
Attachment #502709 - Flags: feedback?(Olli.Pettay) → feedback-
(In reply to comment #35)
> Created attachment 502903 [details] [diff] [review]
Uploaded to tryserver.
Attachment #502903 - Flags: feedback?(pwalton) → feedback+
With this patch, all GCs due to cycle collection are gone in Flight of the Navigator. Awesome!
Oh, really. I still get some GCs, but apparently they are coming from jseng.
Er, right, you said "due to cycle collection". Yes, that is what I see.
I'm now trying to figure out if there are some problematic cases for the 
approach.
(In reply to comment #37)
> With this patch, all GCs due to cycle collection are gone in Flight of the
> Navigator. Awesome!

Very, very, nice.  Good job, guys.  :)
There probably should be a forced GC at some point;
maybe when user is inactive and there isn't any "animation".
We should force GC when we see task manager start up, or the user visits about:memory.  (half-kidding)
I'll run Massif tomorrow to see how the patch behaves 
with the testcase from http://blog.mozilla.com/jseward/2011/01/07/space-profiling-the-browser/.
It is possible that memory usage goes up way too much, but then it is time
to tweak CC/GC scheduling parameters some more.
Attachment #503128 - Attachment description: + addition GC in DOMWindowUtils → + additional GC in DOMWindowUtils
There weren't any significant changes to perf test results in tryserver.

I'm using the patch atm, and so far things seem to be ok.
I have lots of tab open (over 200 or so), and when closing them, memory is released the way it should be and I don't see bad CC hiccups.
blocking2.0: ? → ---
OS: All → Mac OS X
Hardware: All → x86
Comment on attachment 503128 [details] [diff] [review]
+ additional GC in DOMWindowUtils

Andreas, what do you think about this?
Attachment #503128 - Flags: review?(gal)
Comment on attachment 503128 [details] [diff] [review]
+ additional GC in DOMWindowUtils

Looks good. I would like to tune this further after 4. Where does the 5000 come from? Gut feeling or did we measure that? Patrick, can you try this patch and see what the lag looks like?
Attachment #503128 - Flags: review?(gal) → review+
(In reply to comment #47)
> Where does the 5000 come
> from? Gut feeling or did we measure that?
Sort of both. I did test several common web apps at some point and tried
to come up with constants which worked reasonably well then.
But it was long ago, and both browser UI and web apps have changed.

attachment 503128 [details] [diff] [review] is otherwise the same as previous patch, but I wanted to keep
DOMWindowUtils::GarbageCollection behavior the same as before.
And this all is something which really needs some real world testing.
There are just so many different kinds of web apps, and different ways to
use a browser.
It is possible that some tweaking is needed.
Attachment #502709 - Attachment is obsolete: true
Attachment #503128 - Flags: approval2.0?
Just ran FOTN before and after. Before this patch, we have 38 GC pauses. With this patch applied, we have 18 GC pauses. That's a reduction in GC pause time of 54%. Moreover, with this patch, the pauses are concentrated in the part when the ship flies into the city, where more garbage is being created (I believe), which is what we would expect.
I landed the patch, but it seemed to trigger an unrelated null pointer crash
in nsPresContext. Backed out.
I'll file a bug for the prescontext bug.
(In reply to comment #51)
> it seemed to trigger an unrelated null pointer crash
> in nsPresContext.
Hmm, not quite sure about the crash yet.
Depends on: 625311
I'm also puzzled by bug 561007 - it failed a couple of times last spring, then nothing until October 12th when it failed three times (two of which you starred), then nothing until while this was in and it failed another five or so times.
And, incomprehensible though it is, js1_5/extensions/regress-452168.js timed out only on OS X 10.5 debug jsreftests, both times.
Blocks: 625871
We've been using this patch a lot to build or ff4 demo, and it's a work of art perf wise.  Not sure where what people think about getting it in for 4.  Any update on issues with backout?
(In reply to comment #56)
> We've been using this patch a lot to build or ff4 demo, and it's a work of art
> perf wise.  Not sure where what people think about getting it in for 4.  Any
> update on issues with backout?

I believe the consensus was to just try landing it again?
I've tried to run it again today with some other relevant patches.
There is still one test failure though.
What's the failure?
TEST-UNEXPECTED-FAIL | chrome://mochitests/content/chrome/toolkit/components/places/tests/chrome/test_381357.xul | Test timed out.

Haven't managed to reproduce that locally, but it happens quite
reliably on tryserver.
should this be approved given the test failure?
Approval or not, it can't land until the regression is fixed. smaug? patrick?
Working on it.
Just posted a bit less regression risky patch to tryserver
(with Bug 614347)
Attached patch v4Splinter Review
With this the failure doesn't happen so often.
Locally I still haven't managed to reproduce the problem even with the original
patch.

Note, the problem has happened earlier, even without the patch, but something
triggers it to happen a lot more often when CC/GC handling is changed.

This patch is anyway less regression risky, since it doesn't change page load
handling.

Need to still investigate this some more... Have to figure out what in the test is problematic.
(though, I'm not 100% sure which test causes the problem)
The test is timing out, and it uses places (livemarks). And the bug pre-exists, just is easier to repro with the patch that we want to land.

I suspect the native code involved is to blame, and unless the patch breaks some non-test user-facing scenario, we benefit more by taking this bug's patch and doing something to get the test out of our way. We should find and fix the bug, where ever it may be, but we need to ship Firefox 4 and we would like to reduce these CC pauses.

Cc'ing some big guns to see whether they agree.

/be
disabling the test seems ok, if we are sure which test is acting up.
Sayrer, it is your test ;) and yes, disabling that one particular test does
seem to make tests green.
But I'm still trying to understand why it fails occasionally.
Just an fyi, but this bug is blocking one of the demos slated  for the fx 4 release.
So far I haven't figured out what causes the problem.
I can't reproduce the problem locally.
Something somewhere down deep under reloadLivemarkFolder() fails.

I guess we could disable that test for now, take the patch, and I could
continue to debug why test_381357.xul fails occasionally (rarer without the patch). Debugging using just tryserver and code inspection is slow.
Changing CC/GC scheduling is rather scary and certainly needs time for
real world testing before the release, and we should be ready to back the patch
out it if causes problems.
Attachment #503128 - Flags: approval2.0?
Comment on attachment 506534 [details] [diff] [review]
v4

Andreas, what you think of this. This one doesn't change the
CC handling during page loads, so it should be less-risky.

I would need to disable that one test for now.
(like in http://hg.mozilla.org/try/rev/5620932f1e10)
Attachment #506534 - Flags: review?(gal)
sayrer already said ok to disable test. Followup bug filed and cited please. I can double-authorize if you need it.

/be
Comment on attachment 506534 [details] [diff] [review]
v4

>diff --git a/dom/base/nsDOMWindowUtils.cpp b/dom/base/nsDOMWindowUtils.cpp
>--- a/dom/base/nsDOMWindowUtils.cpp
>+++ b/dom/base/nsDOMWindowUtils.cpp
>@@ -650,16 +650,19 @@ nsDOMWindowUtils::GarbageCollect(nsICycl
> {
>   // Always permit this in debug builds.
> #ifndef DEBUG
>   if (!IsUniversalXPConnectCapable()) {
>     return NS_ERROR_DOM_SECURITY_ERR;
>   }
> #endif
> 
>+  if (nsContentUtils::XPConnect()) {
>+    nsContentUtils::XPConnect()->GarbageCollect();
>+  }

Why are we adding a mandatory GC back here? That doubles the latency of this call?
>   // nsCycleCollector_collect() no longer forces a JS garbage collection,
>   // so we have to do it ourselves here.
>-  if (nsContentUtils::XPConnect()) {
>+  if (nsContentUtils::XPConnect() &&
>+      (aForceGC ||
>+       (!GetGCRunsSinceLastCC() &&

Why do we care whether the GC has run since the last CC?
(In reply to comment #73)
> Comment on attachment 506534 [details] [diff] [review]
> v4
> 
> >diff --git a/dom/base/nsDOMWindowUtils.cpp b/dom/base/nsDOMWindowUtils.cpp
> >--- a/dom/base/nsDOMWindowUtils.cpp
> >+++ b/dom/base/nsDOMWindowUtils.cpp
> >@@ -650,16 +650,19 @@ nsDOMWindowUtils::GarbageCollect(nsICycl
> > {
> >   // Always permit this in debug builds.
> > #ifndef DEBUG
> >   if (!IsUniversalXPConnectCapable()) {
> >     return NS_ERROR_DOM_SECURITY_ERR;
> >   }
> > #endif
> > 
> >+  if (nsContentUtils::XPConnect()) {
> >+    nsContentUtils::XPConnect()->GarbageCollect();
> >+  }
> 
> Why are we adding a mandatory GC back here? That doubles the latency of this
> call?
So that all our tests using nsDOMWindowUtils::GarbageCollect work the
way they used to.
Actually, I can now just call ::CC with force gc parameter.

> Why do we care whether the GC has run since the last CC?
We used to call GC, while it was still part of CC.
Attached patch v6Splinter Review
Attachment #507845 - Flags: review?(gal)
Attachment #506534 - Flags: review?(gal)
Depends on: 561007
Comment on attachment 507845 [details] [diff] [review]
v6

Thanks for spending so much time on this smaug. Awesome that we got a fix for this for FF4.
Attachment #507845 - Flags: review?(gal) → review+
Attachment #507845 - Flags: approval2.0?
Assignee: pwalton → Olli.Pettay
Attachment #507845 - Flags: approval2.0? → approval2.0+
http://hg.mozilla.org/mozilla-central/rev/3ea2b5a7c9c8
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
For what it's worth, this caused 10% regression in the Dromaeo (CSS) test on Mac 10.6.  Not sure what that translates to in terms of the "real" dromaeo test.
Thats bad :( We should find out why.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Yeah, changes to perf results are very likely when tweaking CC or GC scheduling :(
Should the patch be backed out?
This bug is fixed, until it *is* backed out.  A separate bug for the regression seems better than using comments 80+ for it here; if the conclusion is to back it out, then we can reopen, but IMO it's hard to imagine wanting 10% on one dromaeo subtest more than the responsiveness wins here.
Status: REOPENED → RESOLVED
Closed: 13 years ago13 years ago
Resolution: --- → FIXED
Agree with shaver -- dromaeo has enough issues and it's the tail that ought not wag the dog here.

/be
Do file and investigate as time allows, for sure. Tails can tell dogs to jump when doors are slamming, so we want to know if it's a heavy front door or a light screen door (or slight motion of air) before too long.

/be
Yeah lets leave the patch in I concur but we need a bug to investigate this. I can work on that bug with patrick. smaug want to file it?
Depends on: 630156
Depends on: 614347
No longer depends on: 614347
Depends on: 614347
Depends on: 630877
That patch was for Bug 624867, I assume.
Depends on: 630947
Depends on: 631494
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.