Closed Bug 565217 Opened 10 years ago Closed 7 years ago

make cycle collector interruptible

Categories

(Core :: XPCOM, defect, P2)

defect

Tracking

()

RESOLVED WONTFIX
Tracking Status
blocking2.0 --- -

People

(Reporter: gal, Assigned: mccr8)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [Snappy:P2])

Attachments

(1 file, 11 obsolete files)

42.43 KB, patch
Details | Diff | Splinter Review
We have discussed this a couple times before. The cycle collector should be interruptible. Currently it is not because its intertwined with the JS garbage collector. We will break that dependency and allow the CC to be aborted at any time (i.e. when a mouse/keyboard/network event arrives). This will hide the latency, and at the same time allow us to CC more often.
Depends on: 549806
We should really do this for 1.9.3, but we can probably do it past the first beta.
Priority: -- → P2
Target Milestone: --- → mozilla1.9.3
Keywords: meta, perf
Depends on: 565229
Blocks: 490122
The GC has to leave behind gray bits for the CC. bug 565229 will make sure the bitmap survives the GC phase.
I fixed bug 565229. Its available on tracemonkey and will be merged into m-c shortly.
Hardware: x86 → All
After talking to a few people, here is my design for a partially concurrent, interruptible cycle collector:

- Add the ability to observe when the main thread is about to block on the native event queue:
  1) Add 2 new methods to nsIThreadObserver, beforeWait() and afterWait().
  2) Change nsBaseAppShell to loop with DoProcessNativeEvent(mayWait = FALSE) until we are out of events, then call beforeWait(), call DoProcessNativeEvent(mayWait = TRUE), and then afterWait().
- Move the cycle collector into its own thread. It blocks on a condvar until its triggered, and notifies a condvar when its done. Hook up the current CC trigger point to do just this. This will get the CC out of the main thread's cache on multi core machines and is a win in itself.
  3) Write up beforeWait() and afterWait() to call into the cycle collector controller. beforeWait() starts a CC on the other thread if some time has passed. The CC runs concurrently to us blocking on the native event queue. When afterWait() is called, 2 cases are possible:
  * We are still building the CC model. Increment an atomic counter in afterWait, which the CC thread samples. It will abort what it does and return to its waiting stage.
  * We are done building the CC model. The CC continues, finds cycles, and builds up a bunch of nsIRunnables that contain a list of dead objects to be Unlink()ed. Send the list to the main thread for disposal when done and resume the waiting stage.

I think I can do all of the above tasks, but most of this stuff is parallelizable and I am new to the gecko code base, so if I can get help with some of these tasks, that would be great.
Question.  Given a webpage with a 10ms interval timer on it, won't the main thread wake up every 10ms?  And will that mean that only CC runs that take less than 10ms will be able to complete while that web page is loaded?
How do you plan to deal with all cycle-collected objects being main-thread-only?

Did you consider using the concurrent CC algorithm in the Bacon & Ranjan paper?

Also, do you happen to have any performance benchmarks for cycle collection pause performance?  I'd sort of like to figure out at some point if my work on bug 549793 actually helps.
dbaron, the cycle collector thread merely is a vessel to get the cycle collector out of the main thread's cache. It only runs concurrently to the main thread _waiting_. We can also let the actual cycle detection run concurrently since it only walks the internal model. Disposing of xpcom objects happens on the main thread.

I would like to get this into FF4, so I am trying to do the minimum surgery for that. The B&R concurrent algorithm is after FF4 I think.

And yeah, bug 549793 can certainly not hurt, in particular because I was hoping to run the CC a lot more often with this patch, since we can stop it at any time. That way individual CCs will hopefully be faster, too.
#5: Yes, the CC can be starved, forcing us eventually to eventually do a blocking CC. Better than doing a blocking one every time :)
Attached patch patch (obsolete) — Splinter Review
Add beforeIdle/afterIdle to nsBaseAppShell. This doesn't work on macosx because macosx doesn't run our event loop. Need mac help.
ccing folks who might know mac appshell.
> macosx doesn't run our event loop

This isn't true ... at least not literally.

(Note that I haven't yet studied the patch, and may understand the issues better after I've done that.)
Basically what I need is for nsBaseAppShell to be told when the main thread blocks for native events, and the moment it comes back (beginIdle, endIdle). I couldn't find a way to make that happen.
I'll look into that tomorrow.
Cool, thanks a lot.
Attached patch Testing patch that works on OS X (obsolete) — Splinter Review
With your (Andreas') patch from comment #9, BeforeIdle() and
AfterIdle() are called on OS X while pages are being loaded or
reloaded, but not (apparently) at other times.  I assume this is what
you meant by saying that it "doesn't work" on OS X.

Turns out the reason is that ProcessNextNativeEvent() isn't the only
place native events are processed.  Many (possibly most) of them get
processed during the monolithic call to [NSApp run] in
nsAppShell::Run().

This isn't easy to work around.

[NSApp run] is called once when the app starts up, and stays running
until it terminates.  But it basically just calls [NSApplication
nextEventMatchingMask:untilDate:inMode:dequeue:] in a tight loop.  So
at first I thought I could replace the call to [NSApp run] with
something like what I do in nsAppShell::ProcessNextNativeEvent().  But
this led to strange problems (for example not being able to quit the
app), from which I conclude that the call to [NSApp run] performs
special, undocumented magic -- which makes it difficult or impossible
to replace.

The only other solution I can think of is to hook calls to
[NSApplication sendEvent:].  Every native event goes through this
handler/method.  But it's tricky to find out where it's been called
from (nsAppShell::Run() or nsAppShell::ProcessNextNativeEvent()), so I
thought it best to have platform-specific calls to BeforeIdle() and
AfterIdle() entirely replace cross-platform calls (i.e. those made
from nsBaseAppShell::SpinNativeEventLoop()).  This is what my current
patch does.

A (partially or wholly) platform-specific solution to this problem is
required on OS X.  But allowing other platforms the option to support
BeforeIdle() and AfterIdle() "natively" may give finer-grained control
over when they're called -- which may turn out to be useful.

I changed BeforeIdle() and AfterIdle() to be static methods --
otherwise I'd have needed to create a global pointer to the current
appshell.  I suspect this may actually be more desireable.  But if not
my patch will need to be changed.

Please try my patch out, and let me know what you think.
Awesome. Thanks a lot. Giving a try now.
Steven, this doesn't seem to work right yet.

BeforeIdle 119
AfterIdle
BeforeIdle 120
cc: Starting nsCycleCollector::Collect(1)

Breakpoint 1, nsCycleCollector_beginCollection () at ../../../xpcom/base/nsCycleCollector.cpp:3236
3236	    return sCollector ? sCollector->BeginCollection() : PR_FALSE;
(gdb) bt
#0  nsCycleCollector_beginCollection () at ../../../xpcom/base/nsCycleCollector.cpp:3236
#1  0x0000000100cf16ce in XPCCycleCollectGCCallback (cx=0x11a773300, status=JSGC_MARK_END) at ../../../../../js/src/xpconnect/src/nsXPConnect.cpp:361
#2  0x0000000104188e2b in GC (cx=0x11a773300) at ../../../js/src/jsgc.cpp:3337
#3  0x0000000104189181 in GCUntilDone (cx=0x11a773300, gckind=GC_NORMAL) at ../../../js/src/jsgc.cpp:3705
#4  0x00000001041892cb in js_GC (cx=0x11a773300, gckind=GC_NORMAL) at ../../../js/src/jsgc.cpp:3759
#5  0x000000010411d5ef in JS_GC (cx=0x11a773300) at ../../../js/src/jsapi.cpp:2410
#6  0x0000000100cf0cb1 in nsXPConnect::Collect (this=0x106417d00) at ../../../../../js/src/xpconnect/src/nsXPConnect.cpp:449
#7  0x0000000101399f06 in nsCycleCollector::Collect (this=0x1050a9600, aTryCollections=1) at ../../../xpcom/base/nsCycleCollector.cpp:2527
#8  0x000000010139a071 in nsCycleCollector_collect () at ../../../xpcom/base/nsCycleCollector.cpp:3224
#9  0x0000000100891def in nsJSContext::CC () at ../../../dom/base/nsJSEnvironment.cpp:3582
#10 0x0000000100891e4c in nsJSContext::IntervalCC () at ../../../dom/base/nsJSEnvironment.cpp:3670
#11 0x0000000100892214 in nsUserActivityObserver::Observe (this=0x11a772e20, aSubject=0x0, aTopic=0x1017b5cee "user-interaction-inactive", aData=0x0) at ../../../dom/base/nsJSEnvironment.cpp:270
#12 0x000000010132ce7e in nsObserverList::NotifyObservers (this=0x10529d708, aSubject=0x0, aTopic=0x1017b5cee "user-interaction-inactive", someData=0x0) at ../../../xpcom/ds/nsObserverList.cpp:130
#13 0x000000010132df16 in nsObserverService::NotifyObservers (this=0x104b1d6b0, aSubject=0x0, aTopic=0x1017b5cee "user-interaction-inactive", someData=0x0) at ../../../xpcom/ds/nsObserverService.cpp:182
#14 0x00000001006a3854 in nsUITimerCallback::Notify (this=0x1187842f0, aTimer=0x104bc5640) at ../../../../content/events/src/nsEventStateManager.cpp:282
#15 0x000000010138a5e2 in nsTimerImpl::Fire (this=0x104bc5640) at ../../../xpcom/threads/nsTimerImpl.cpp:430
#16 0x000000010138a7e6 in nsTimerEvent::Run (this=0x11af2e0b0) at ../../../xpcom/threads/nsTimerImpl.cpp:519
#17 0x0000000101383846 in nsThread::ProcessNextEvent (this=0x104b19620, mayWait=0, result=0x7fff5fbfcf04) at ../../../xpcom/threads/nsThread.cpp:547
#18 0x0000000101311d6d in NS_ProcessPendingEvents_P (thread=0x104b19620, timeout=20) at nsThreadUtils.cpp:200
#19 0x000000010119f4a8 in nsBaseAppShell::NativeEventCallback (this=0x118719100) at ../../../../widget/src/xpwidgets/nsBaseAppShell.cpp:127
#20 0x0000000101153cd2 in nsAppShell::ProcessGeckoEvents (aInfo=0x118719100) at ../../../../widget/src/cocoa/nsAppShell.mm:398
#21 0x00007fff87a84f21 in __CFRunLoopDoSources0 ()
#22 0x00007fff87a83119 in __CFRunLoopRun ()
#23 0x00007fff87a828df in CFRunLoopRunSpecific ()
#24 0x00007fff85cc2ada in RunCurrentEventLoopInMode ()
#25 0x00007fff85cc28df in ReceiveNextEventCommon ()
#26 0x00007fff85cc2798 in BlockUntilNextEventMatchingListInMode ()
#27 0x00007fff86bb6a4a in _DPSNextEvent ()
#28 0x00007fff86bb6399 in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] ()
#29 0x00007fff86b7c06f in -[NSApplication run] ()
#30 0x0000000101153571 in nsAppShell::Run (this=0x118719100) at ../../../../widget/src/cocoa/nsAppShell.mm:763
#31 0x0000000100ede930 in nsAppStartup::Run (this=0x118830c80) at ../../../../../toolkit/components/startup/src/nsAppStartup.cpp:192
#32 0x000000010002824d in XRE_main (argc=1, argv=0x7fff5fbfeaa0, aAppData=0x104b16200) at ../../../toolkit/xre/nsAppRunner.cpp:3628
#33 0x00000001000011f9 in main (argc=1, argv=0x7fff5fbfeaa0) at ../../../browser/app/nsBrowserApp.cpp:158
(gdb) 

You can see here that cycle collection is hit while we are supposed to be in idle. The AfterIdle notification happens after the cycle collection is done. I think timer events go to the app shell, so we have to coordinate the code you wrote with the app shell code somehow.
A couple people ask today about the distribution of collection times. This is a mostly idle browser:

This is the part of CC that will run while the browse idles:

cc: Starting nsCycleCollector::Collect(1)
cc: mRuntimes[*]->BeginCycleCollection() took 0ms
cc: SelectPurple() took 2ms

This is the part that will run on a background thread, assuming we manage to get the above done while we idle (2ms, doable!).

cc: MarkRoots() took 19ms
cc: ScanRoots() took 4ms
cc: RootWhite() took 0ms [not sure this can run concurrently, investigating]

This is the part that will happen back on the main thread:

cc: CollectWhite() took 0ms

The last part of Collect() is the JS GC running to completion. With my patch to disentangle the CC from JS GC this won't happen here at all, and we have separate bugs to make this time much less (individual compartments get GCed).

cc: Collect() took 112ms

I also checked the idle times. At least on MacOSX we get very decent idle times, even with content like cnn or gmail open (seconds of idle as long the mouse isn't moved).

Last but not least, here is a more realistic CC (a couple windows open):

cc: Starting nsCycleCollector::Collect(1)
cc: mRuntimes[*]->BeginCycleCollection() took 1ms
cc: SelectPurple() took 6ms

Note again that the actual Bacon & Rajan algorithm takes a long time, not building the CC model. Also, the GC part of the model building won't happen here any more.

cc: MarkRoots() took 43ms
cc: ScanRoots() took 7ms
cc: RootWhite() took 0ms
cc: CollectWhite() took 0ms
(In reply to comment #17)

> so we have to coordinate the code you wrote with the app shell code
> somehow

I take this to mean that we're only really "idle" when neither native
events nor Gecko events are being processed.  (Your stack shows that
timer "events" are Gecko events.)

New patch coming up.
Depends on: 573060
Attached patch OS X testing patch rev2 (obsolete) — Splinter Review
Here's a revision of my patch that suspends the operation of its
native idle processing whenever Gecko processes an event (and
therefore "coordinates" its platform-specific idle handling with your
original patch's cross-platform idle handling).  It should guarantee
that the app is never considered idle while either a Gecko event or a
native event is being processed.

But I'm afraid you're not going to like it.  It appears that the Gecko
event queue wakes up every small fraction of a second to process
events (presumably timer events).  So the app is never idle for long.

(I've left in the option to choose entirely platform-specific idle
handling.  I still think it may turn out to be useful.)
Attachment #452111 - Attachment is obsolete: true
Do you have a ballpark number handy? On my laptop with a few windows open I can get past the critical region of the cycle collector in less than 20ms or so. After that its ok if gecko wakes up. The cycle collector will collect to completion in the background and we don't have to throw away the heap model (if the wakeup comes earlier, we have to stop the collector and re-try later).
> Do you have a ballpark number handy?

No, I don't.  I just watched the printf statements flow by in the console log :-)
> It appears that the Gecko event queue wakes up every small fraction
> of a second to process events (presumably timer events).  So the app
> is never idle for long.

It occurs to me that some (most?, almost all?) of these events may be
dummy events created in nsBaseAppShell::OnProcessNextEvent().  If so,
they deserve special handling -- the app should still be considered
idle while the Gecko event loop is processing one of these events (if
it's the only event on the stack).

I'll look into this.
Hmm. Why do we have those events?  Are those what keeps mac builds at near-constant ~10% cpu usage?  (Separate bug, I know.)
> Are those what keeps mac builds at near-constant ~10% cpu usage?

No.  They're used on all platforms.
(By the way (and this is off-topic), I got some ideas for possibly speeding up performance of nsAppShell::ProcessNextNativeEvent() while working on this bug.  I'm currently working like a dog on a new version of the JEP.  But at some point in the next few weeks, when (hopefully) I have more time, I can start looking into them.  When I do, I'll need your help, Boris.)
Sounds good.  I've seen that come up a lot in profiles....
(Following up comment #23)

No, I was wrong -- they really are almost all nsTimerEvent events.

I haven't been able to figure out what kind, exactly.
Might be good to file a followup bug on that?  And looking up the relevant nsTimerImpl mCallbackType types would be a good start....
Yeah, no mysteries -- we hate 'em. Please file, we can dup if needed.

/be
(In reply to comment #28)
> No, I was wrong -- they really are almost all nsTimerEvent events.
> 
> I haven't been able to figure out what kind, exactly.
Note that images can spawn a bunch of them.  See bug 502694.
I've done all I can here -- I really need to get back to work on the JEP.
Thanks for the help Steven! Testing your patch now.
Depends on: 573175
Attached patch Timer/event logging patch (obsolete) — Splinter Review
> Thanks for the help Steven!

You're most welcome!

> And looking up the relevant nsTimerImpl mCallbackType types would be
> a good start....

Given this hint, I found it isn't too difficult to log when every
C/C++ timer on the trunk fires.

So here's one last contribution.  It also logs the class-name of all
Gecko events as they run.  Since it uses rtti, you need to add the
following to your mozconfig:

ac_add_options --enable-cpp-rtti

With it, I was able to discover that almost all of the troublesome
timer events' target is nsGlobalWindow::TimerCallback().  So it/they
are probably installed from JavaScript code.

And so's there's still a lot of work left to be done.

Sigh.

We really need to make timers easier to debug.  At a minimum, there
should be a debug flag that makes all timers (C/C++ and JavaScript
ones) log identifying information every time they fire.
I'm a real glutton for punishment -- I've figured out what happens
during the calls to nsGlobalWindow::TimerCallback().  And here's the
new version of my patch that I used to do it.

Turns out most (really almost all) of these calls are to the function
nextStep() in the "button-periodic-redraw" binding of
toolkit/content/widgets/button.xml.
Attachment #452399 - Attachment is obsolete: true
> Turns out most (really almost all) of these calls are to the
> function nextStep() in the "button-periodic-redraw" binding of
> toolkit/content/widgets/button.xml.

Interestingly, these calls only happen on OS X.  They were added (on
the trunk) by the patch for bug 206636
(http://hg.mozilla.org/mozilla-central/rev/d1555870e786).

The calls are on 100ms timers.  But (as best I can tell) each "default
button" gets its own timer.
Where did you make these measurements, what windows did you have open? This timer is what drives the pulse animation for blue default buttons.
I used my patch from comment #35 in all my tests.

I didn't actually make any time measurements -- I noticed that the nextStep() function with a "delay" of '100'.

And by the way, I'm not at all sure that these timers are a problem.  I just CCed you so that you could answer questions about them (more from others than myself, presumably).
I had just one window open (to the Minefield default home page).  The calls to nextStep() continued even when the page was blank.
> I noticed that the nextStep() function with a "delay" of '100'

I noticed that the nextStep() function is passed to setInterval() with a "delay" of '100'.
> But (as best I can tell) each "default button" gets its own timer.

Is this correct, Markus?
That's correct, yes.

Our future plans for these kinds of animations are in bug 546837 comment 4.
There's one non-anonymous buttond[default] in the Firefox UI itself somewhere. Not sure whether there might be anonymous ones too.  But that shouldn't get you "event queue wakes up every small fraction of a second" unless "small fraction" is 0.1.
Boris, do you have any idea what the behavior would/should be if you had (say) 10 default buttons visible at once?  Would/should this decrease the length of the longest possible idle?

(I'm sure Andreas will do the measurements.  I just wonder if there are any established ideas about how this *should* work.)
setTimeout and setInterval should cost at most one pending XPCOM timer. That's how they used to work, and that's how any minimal implementation would work. Are we really getting multiple nsTimerImpl instances, or just lots of closely spaced timer firings?

/be
A bit of both, really.

Each default button has a separate timer that fires once every 100ms.  The question (or at least my question) is about the spacing of the timer firings as the number of these timers goes up.
Seems to me that the only sane way to deal with this is to have a single timer for these buttons and when that timer fires we update every button that needs updating etc. Even if we're able to synchronize these timers such that they'd all fire after one and other it'd be a waste of timer thread activity to do that over just having one IMO.
> Would/should this decrease the length of the longest possible idle?

With our current code it likely would; basically the firing times would be randomly distributed.  Whether it should is a separate question.
Target Milestone: mozilla1.9.3 → mozilla2.0
We should get a separate bug filed on that button thing.  Can someone with a working computer (i.e. not me) please do that?
I'm not sure what the button problem is. That it uses timers at all?

About the OS X implementation:
The CFRunLoopObserver documentation says that you can be notified when the native event loop enters/exits the idle state by observing kCFRunLoopBefore/AfterWaiting. That sounds like a good place to call Before/AfterIdle().
> That it uses timers at all?

Yes, and especially one timer per button.
(In reply to comment #50)

> About the OS X implementation:
> The CFRunLoopObserver documentation says that you can be notified
> when the native event loop enters/exits the idle state by observing
> kCFRunLoopBefore/AfterWaiting. That sounds like a good place to call
> Before/AfterIdle().

You mean as an alternative to hooking [NSApplication sendEvent:]?
Yes, that might work.

But we'll still need to also call AfterIdle() when we start processing
Gecko events.  In other words, we're only truly idle when we're
processing neither native events nor Gecko events, and
kCFRunLoopBefore/AfterWaiting is only about native events.
If we use a single timer for all animated widgets, that would limit the wakeups to a reasonable interval.

But it seems to me that for people with lots of tabs open (the people who get really hurt by cycle collection currently), there are likely to be various timers, both internal and JS, firing often enough that we probably won't see a useful interval of idle time ... unless we aggressively throttle things for background tabs. Which we should do anyway. Is there work happening on that?

Seems like we need experiments with 50+ tabs open at least, including offenders like GMail, Zimbra, and other common apps.
> Is there work happening on that?

There is for throttling the refresh driver, but not for js timeouts...
(In reply to comment #52)
> But we'll still need to also call AfterIdle() when we start processing
> Gecko events.

I don't think so. We can't process Gecko events while the native event loop is waiting for a new native event. In order to get back into the Gecko event loop we need to leave the native waiting state first, and that's when AfterIdle() would fire.
Depends on: 586216
(In reply to comment #51)
> > That it uses timers at all?
> 
> Yes, and especially one timer per button.

I filed bug 586216.
(In reply to comment #55)

> In order to get back into the Gecko event loop we need to leave the
> native waiting state first, and that's when AfterIdle() would fire.

No.  AfterIdle() should be called when we're no longer idle, which is
just when we start processing a Gecko event.  It doesn't matter
whether or not we continue in a "native idle state".

My current patch calls AfterIdle() more than once between calls to
BeforeIdle().  I'll try to fix that in a future patch ... but it
shouldn't be hard to write AfterIdle() in such a way that this doesn't
matter.
(In reply to comment #57)
> (In reply to comment #55)
> 
> > In order to get back into the Gecko event loop we need to leave the
> > native waiting state first, and that's when AfterIdle() would fire.
> 
> No.  AfterIdle() should be called when we're no longer idle, which is
> just when we start processing a Gecko event.  It doesn't matter
> whether or not we continue in a "native idle state".

I don't understand how this contradicts my suggestion. I'll write up a patch of what I have in mind so that we can be sure we're talking about the same thing.
This patch dumps a message to the console whenever we enter/exit the native run loop waiting state and whenever we process a native event or a Gecko event. It also reduces the animation timer of the invisible progressbar in the main browser window to once per second.

If we called nsBaseAppShell::BeforeIdle() and AfterIdle() at exactly those times where I now print "entering idle state" and "woke up!", we'd satisfy Andreas' requirements, wouldn't we?
> If we called nsBaseAppShell::BeforeIdle() and AfterIdle() at exactly
> those times where I now print "entering idle state" and "woke up!",
> we'd satisfy Andreas' requirements, wouldn't we?

I don't think so.  I think we'd only ever "wake up" when processing a
native event -- not when processing a Gecko event.

But I haven't yet tried your patch -- I need to do that before I'm
sure.
(In reply to comment #60)
> > If we called nsBaseAppShell::BeforeIdle() and AfterIdle() at exactly
> > those times where I now print "entering idle state" and "woke up!",
> > we'd satisfy Andreas' requirements, wouldn't we?
> 
> I don't think so.  I think we'd only ever "wake up" when processing a
> native event -- not when processing a Gecko event.

We do wake up when processing a Gecko event. Gecko events are processed from our CFRunLoopSource which we schedule in ScheduleNativeEventCallback(). In order to notify run loop sources the run loop has to leave the waiting state first.

See also
http://developer.apple.com/mac/library/documentation/CoreFoundation/Reference/CFRunLoopObserverRef/Reference/reference.html#//apple_ref/doc/c_ref/CFRunLoopActivity
(In reply to comment #61)

> We do wake up when processing a Gecko event. Gecko events are
> processed from our CFRunLoopSource which we schedule in
> ScheduleNativeEventCallback().

I've now built your patch, played with it (in gdb), and re-read the
relevant code (in nsAppShell.mm and nsBaseAppShell.cpp).  I was wrong
-- your patch does always "wake up" whenever a Gecko event is
processed.  And in fact things are (with your patch) exactly the
opposite of what I first feared -- it doesn't wake up for native
events (at least not directly).  But I now think this doesn't matter:
As best I can tell, what Andreas wants to do requires only that Gecko
be idle (whether or not native events are being processed should make
no difference).  (I'll talk more about this in my next comment.)

So, though your patch is incomplete, I now think future patches should
be based upon it -- or upon a combination of Andreas' original patch
and your patch, borrowing only mHandleIdleNatively from my patch
(since, on OS X, we'll want your patch to entirely override what
Andreas' patch does on other platforms).

In a bit I'll post a revision to show what I mean.
Here's an updated (and somewhat simplified) version of my debugging
patch from comment #35, incorporating most of Markus' debugging patch.

(Following up comment #62)

> And in fact things are (with your patch) exactly the opposite of
> what I first feared -- it doesn't wake up for native events (at
> least not directly).

I was wrong about this, too.  With my debug patch I can see (from the
NSLog() timestamps being the same) that Markus'
RunLoopObserverCallback() does "wake up" at the same time pretty much
every mouse or keyboard event is processed.  mCFRunLoop (returned by
+[NSRunLoop currentRunLoop]) is "the NSRunLoop object for the current
thread" (i.e. the main thread).  So it makes sense that it would also
process native events (in addition to the Gecko events that we make it
process by adding mCFRunLoopSource to it).

By the way, using Markus' strategy we only call BeforeIdle() and
AfterIdle() on the main thread.  I assume this is correct.  Is it?
Attachment #452430 - Attachment is obsolete: true
Attachment #465408 - Attachment description: Timer/event logging patch rev1 (incorporate Markus' ideas) → Timer/event logging patch rev2 (incorporate Markus' ideas)
(In reply to comment #63)
> I can see (from the
> NSLog() timestamps being the same) that Markus'
> RunLoopObserverCallback() does "wake up" at the same time pretty much
> every mouse or keyboard event is processed.

As far as I can tell, being in the run loop waiting state is defined by the top three items in the call stack being mach_msg_trap | mach_msg | __CFRunLoopRun. If that's correct, everything else that happens on the main thread by definition doesn't happen during the run loop waiting state.

Your patch looks good to me.
(In reply to comment #63)
> By the way, using Markus' strategy we only call BeforeIdle() and
> AfterIdle() on the main thread.  I assume this is correct.  Is it?

Yes, it's exactly what we're looking for!

Thanks for these patches guys. My work is moving along nicely.
Attached patch Rollup patch, v1 (obsolete) — Splinter Review
This patch incorporates Andreas' original patch and Steven/Markus' patch, then makes CC interruptable and hooks it all together. Thanks for the help guys!
Assignee: nobody → bent.mozilla
Attachment #451734 - Attachment is obsolete: true
Attachment #464983 - Attachment is obsolete: true
Attachment #465435 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #467482 - Flags: review?(peterv)
I should note that we don't make it through a full CC very often before the event loop wakes back up (though we do get complete runs every now and then). We need to get the dependent bugs fixed to give us longer.
IsCollectionCancelled() looks unnecessarily slow. You could use a simple flag that you set from the main thread. We have a similar mechanism in the JS engine (the operation callback flag, js_TriggerOperationCallback()).

Otherwise this looks really awesome. ++bent
Please spell Canceled like an 'murkin, so it doesn't look like the stress goes on the last syllable.

/be
(In reply to comment #70)
> Please spell Canceled like an 'murkin, so it doesn't look like the stress goes
> on the last syllable.

It is so in the patch, Andreas just betrayed his own prejudices ;)
Keywords: meta
Summary: make cycle collector interruptible [meta] → make cycle collector interruptible
Blocking on this as this is a large part of what we need to do to cut down on gc pauses for fx4.
blocking2.0: --- → betaN+
Blocks: 595574
Attached patch Rollup patch, v1.1 (obsolete) — Splinter Review
Merged to t-m tip.
Attachment #467482 - Attachment is obsolete: true
Attachment #488604 - Flags: review?(peterv)
Attachment #467482 - Flags: review?(peterv)
Ben, have you pushed this to tryserver? Does the patch affect to
performance test (dromaeo, tp4) results?
Changing the time when CC runs usually affects to the tests.
Attached patch Patch, v2 (obsolete) — Splinter Review
Ok, this patch is way better. The old one leaked intermittently on try server due to the fact that we were removing things from the purple buffer during incremental CC but then maybe not allowing CC to finish if the main thread woke up again. Fixed that by moving the purple buffer modifications to the end of the traverse process. Also this uses mozilla::TimeStamp as the old interval code could have overflowed and made life miserable.

Peter, please ignore the |fallible_t| changes for the moment as I'm having a hard time getting them to compile on windows. They're not strictly necessary for this patch, but we realized that the move to infallible malloc renders the cycle collector's OOM handling code moot so we need to fix that somehow.

Smaug, I haven't done talos tests yet, any pointers on how to compare times here?
Attachment #488604 - Attachment is obsolete: true
Attachment #492421 - Flags: review?(peterv)
Attachment #488604 - Flags: review?(peterv)
At this point, I don't think we can block on this, we need to ship.
blocking2.0: betaN+ → -
(In reply to comment #75)
> Smaug, I haven't done talos tests yet, any pointers on how to compare times
> here?
Well, just compare the numbers of tp4 at least.
We have a tested patch ready to land, waiting for review for too long :( I will ask for approval when the review is done.
bent, can we bounce the review to someone else than peter?
> I will ask for approval when the review is done.

Right.  This is definitely wanted and we should do it; we just won't block on it.
Comment on attachment 492421 [details] [diff] [review]
Patch, v2

dbaron, maybe you'd like to look over this one? ;)
Attachment #492421 - Flags: review?(dbaron)
I started reviewing this.
We really desperately need this. Thanks peter.
Comment on attachment 492421 [details] [diff] [review]
Patch, v2

>diff --git a/dom/base/nsJSEnvironment.cpp b/dom/base/nsJSEnvironment.cpp

>+void RunLoopObserverCallback(CFRunLoopObserverRef observer,
>+                             CFRunLoopActivity activity,
>+                             void *info)
>+{
>+  if (activity == kCFRunLoopBeforeWaiting) {
>+    nsAppShell::BeforeGeckoIdle();
>+  } else {

Maybe assert that activity is kCFRunLoopAfterWaiting here?

>+    nsAppShell::AfterGeckoIdle();

Why not call nsBaseAppShell::Before/AfterGeckoIdle directly here?

>@@ -306,16 +323,31 @@ nsAppShell::Init()

>+  CFRunLoopObserverContext observerContext;
>+  bzero(&observerContext, sizeof(observerContext));
>+  // observerContext.version = 0;
>+  observerContext.info = this;

We never seem to use info, just pass NULL as the context instead?

>+
>+  mCFRunLoopObserver =
>+    ::CFRunLoopObserverCreate(kCFAllocatorDefault,
>+                              kCFRunLoopBeforeWaiting | kCFRunLoopAfterWaiting,
>+                              true, 0, RunLoopObserverCallback,
>+                              &observerContext);
>+  NS_ENSURE_STATE(mCFRunLoopObserver);
>+
>+  ::CFRunLoopAddObserver(mCFRunLoop, mCFRunLoopObserver, kCFRunLoopCommonModes);

Why not use kCFRunLoopDefaultMode?

>+nsBaseAppShell::AfterIdle()
>+{
>+  if (gIdleCCRunning) {
>+    // This will block until cycle collection finishes.
>+    if (nsCycleCollector_stopCollecting()) {
>+      // Reset the idle time if we actually collected something as we don't want
>+      // to run again immediately.

But we do if we didn't collect anything? Not sure I understand why we treat those conditions differently. Also, if we collected something then running it again immediately might collect more?

>diff --git a/xpcom/base/nsCycleCollector.cpp b/xpcom/base/nsCycleCollector.cpp

>@@ -2624,30 +2716,42 @@ nsCycleCollector::BeginCollection(PRBool

>+    if (IsCollectionCanceled()) {
>+        mScanInProgress = PR_FALSE;
>+        FinishTraverse();

Would it make sense to have some sort of stack object to reset mScanInProgress (and maybe even calls FinishTraverse)?

>@@ -2856,16 +2966,22 @@ nsCycleCollector::ExplainLiveExpectedGar

>+    if (IsCollectionCanceled()) {
>+        printf("nsCycleCollector: not explaining expected garbage since\n"
>+               "  cycle collection was interrupted\n");
>+        return;
>+    }

Doesn't that mean that we never do this for non-blocking collections?

>@@ -2887,16 +3003,18 @@ nsCycleCollector::ExplainLiveExpectedGar
>         // traversal deficiencies.
>         mPurpleBuf.NoteAll(builder);
> 
>         MarkRoots(builder);
>         ScanRoots();
> 
>         mScanInProgress = PR_FALSE;
> 
>+        FinishTraverse();

Shouldn't you remove the loop at the end of ExplainLiveExpectedGarbage that does:

        for (PRUint32 i = 0; i <= nsIProgrammingLanguage::MAX; ++i) {
            if (mRuntimes[i])
                mRuntimes[i]->FinishTraverse();
        }
?

>+                // We can't hold the lock while running cycle collection.

Why exactly?

>+    PRUint32 CollectAndBlock(nsICycleCollectorListener *aListener)
>     {
>         NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
>+
>+        nsAutoTPtrArray<PtrInfo, 4000> whiteNodes;
>+
>+        NS_ASSERTION(!mListener, "Should have cleared this already!");
>+        mListener = aListener;
>+
>+#ifdef DEBUG
>+        mMode = Blocking;
>+#endif
>+
>+        {
>+            MutexAutoLock autoLock(mLock);
>+
>+            if (!mRunning)
>+                return 0;
>+
>+            {
>+                MutexAutoUnlock autoUnlock(mLock);
>+                if (!mCollector->PrepareForCollection(&whiteNodes))
>+                    return 0;
>+            }
>+
>+            mCollectionInProgress = PR_TRUE;
>+
>+            mRequest.Notify();

>+    PRUint32 StartCollecting(nsICycleCollectorListener *aListener)
>     {
>         NS_ASSERTION(NS_IsMainThread(), "Wrong thread!");
> 

>+        if (!mNonBlockingWhiteNodes)
>+            mNonBlockingWhiteNodes = new nsTPtrArray<PtrInfo>(4000);
> 
>         NS_ASSERTION(!mListener, "Should have cleared this already!");
>         mListener = aListener;
> 
>+#ifdef DEBUG
>+        mMode = NonBlocking;
>+#endif
>+
>+        MutexAutoLock autoLock(mLock);
>+
>+        if (!mRunning)
>+            return 0;
>+
>+        {
>+            MutexAutoUnlock autoUnlock(mLock);
>+            if (!mCollector->PrepareForCollection(mNonBlockingWhiteNodes))
>+                return 0;
>+        }
>+
>+        mCollectionInProgress = PR_TRUE;
>         mRequest.Notify();

>+
>+        return 0;

CollectAndBlock and StartCollecting look very similar, could we merge them and just pass in the mode? This |return 0| would just be an early return if mode == NonBlocking.

>+    PRUint32 StopCollecting()

>+#ifdef DEBUG
>+            if (mCollectionInProgress)
>+                NS_ASSERTION(mMode == NonBlocking, "Wrong mode!");
>+#endif

NS_ASSERTION(!mCollectionInProgress || mMode == NonBlocking, ...);

>+            mCollector->CancelCollection();

This part I find confusing. mCanceled doesn't mean canceled, that's only true if we were still running. Otherwise we've just finished building the graph, no? See also my comment about ExplainLiveExpectedGarbage.

Still need to figure out how/when this makes us GC.
(In reply to comment #84)
> Maybe assert that activity is kCFRunLoopAfterWaiting here?
...
> Why not call nsBaseAppShell::Before/AfterGeckoIdle directly here?
...
> We never seem to use info, just pass NULL as the context instead?
...
> Why not use kCFRunLoopDefaultMode?

Steven, would you mind commenting on those questions here? I didn't modify your code.

> But we do if we didn't collect anything? Not sure I understand why we treat
> those conditions differently. Also, if we collected something then running it
> again immediately might collect more?

My thought was that if we succeed in collecting something we should wait a while before trying again so as to keep the idle CPU usage down. I'm not sure about these heuristics, smaug would probably be better at figuring out when we should try again. I'm just providing the knobs so others can tweak them :)

> Would it make sense to have some sort of stack object to reset mScanInProgress
> (and maybe even calls FinishTraverse)?

I'm not in love with the idea, but I could do that if you want.

> Doesn't that mean that we never do this for non-blocking collections?

Yeah, that's a bug. I need to have two state bits I guess.

> Shouldn't you remove the loop at the end of ExplainLiveExpectedGarbage that

Probably, yes :)

> Why exactly?

Because then when we try to grab the lock in StopCollecting we will block until cycle collection finishes without setting our cancel flag.

> CollectAndBlock and StartCollecting look very similar, could we merge them and
> just pass in the mode? This |return 0| would just be an early return if mode ==
> NonBlocking.

Sure.

> NS_ASSERTION(!mCollectionInProgress || mMode == NonBlocking, ...);

Jonas likes those too...

> This part I find confusing. mCanceled doesn't mean canceled, that's only true
> if we were still running. Otherwise we've just finished building the graph, no?

Yeah, the name may be confusing you. It just means "please stop ASAP". Most of the time that will mean "cancel", but sometimes it will mean nothing.
>> Maybe assert that activity is kCFRunLoopAfterWaiting here?
> ...
>> Why not call nsBaseAppShell::Before/AfterGeckoIdle directly here?
> ...
>> We never seem to use info, just pass NULL as the context instead?
> ...
>> Why not use kCFRunLoopDefaultMode?
>
> Steven, would you mind commenting on those questions here? I didn't
> modify your code.

I'll try to get to these in the next few days.
(In reply to comment #86)
> I'll try to get to these in the next few days.

Thanks!
>>+void RunLoopObserverCallback(CFRunLoopObserverRef observer,
>>+                             CFRunLoopActivity activity,
>>+                             void *info)
>>+{
>>+  if (activity == kCFRunLoopBeforeWaiting) {
>>+    nsAppShell::BeforeGeckoIdle();
>>+  } else {
>
> Maybe assert that activity is kCFRunLoopAfterWaiting here?

Maybe it's best to change the else clause to

  } else if (kCFRunLoopAfterWaiting) {
    nsAppShell::AfterGeckoIdle();
  }

>>+    nsAppShell::AfterGeckoIdle();
>
> Why not call nsBaseAppShell::Before/AfterGeckoIdle directly here?

Fine with me.  Then we could get rid of
nsAppShell::Before/AfterGeckoIdle.

>>+  CFRunLoopObserverContext observerContext;
>>+  bzero(&observerContext, sizeof(observerContext));
>>+  // observerContext.version = 0;
>>+  observerContext.info = this;
>
> We never seem to use info, just pass NULL as the context instead?

Also fine with me.

>>+
>>+  mCFRunLoopObserver =
>>+    ::CFRunLoopObserverCreate(kCFAllocatorDefault,
>>+                              kCFRunLoopBeforeWaiting |
>>+                                kCFRunLoopAfterWaiting,
>>+                              true, 0, RunLoopObserverCallback,
>>+                              &observerContext);
>>+  NS_ENSURE_STATE(mCFRunLoopObserver);
>>+
>>+  ::CFRunLoopAddObserver(mCFRunLoop, mCFRunLoopObserver,
>>+                         kCFRunLoopCommonModes);
>
> Why not use kCFRunLoopDefaultMode?

That's probably a good idea -- we're probably not properly idle unless
the base/default run loop is idle.  But this change would require
additional testing.
> we're probably not properly idle unless the base/default run loop is
> idle

We're probably not properly idle unless the main thread's run loop is
idle in the default mode.
Comment on attachment 492421 [details] [diff] [review]
Patch, v2

bent says my review is no longer needed here now that peterv is reviewing this
Attachment #492421 - Flags: review?(dbaron)
Should we take this to Fx5?
CC pauses aren't usually very long though, it is GC.
@Olli: pauses are long enough to be a constant drain on my willpower not to smash something...

The best way to get Firefox to do long pauses for testing purposes  is as follows:

- Go to eBay
- Open 25 products in background tabs.
- Look through them. Close the tabs.
- Go to step one.

Repeat this for up to 10 cycles and Firefox will start exhibiting the freezes. 

The easiest way to verify that freezes are present is IMO to press Alt, Down, to drop down the menu, and then just hold the Left arrow. Every now and then a certain menu just stops scrolling.

In my current v4.0, with 6 tabs open (+10 more in Tab Panorama), the pauses exceed 1 second (!). That is not so short as to be unnoticeable - in fact, it's almost unusably long. The PC isn't a slow one - a quad-core with 8GB RAM.

Now that people will start using Tab Panorama to maintain more open tabs than ever before, this will probably become even more severe than in v3.
Roman, are you sure the pauses are caused by CC and not GC?
And note, this bug won't make CC pauses go away, just make them interruptible
(at least in common cases).

I do think we should fix this sooner than later, but we need to understand
how the patch behaves in different cases.
Oh, apologies - no I'm not sure about that. It's just that this is the only bug that bug 490122 currently depends on, so it seemed that there was some certainty about this.
Roman, if you have exact steps to reproduce the problem you see,
could you file a new bug. CC me.
Roman, one way to check if the pauses you're seeing are because of gc and/or cc is to set javascript.options.mem.log to true in about:config and watch the messages in the Error Console, there you'll then see how much time we spend doing gc and cc. Please include that information in your new bug.
Thanks all, enabled logging and will file a separate report if the exact STR become clearer. Really doubt it though, since the problem is clearly of a random nature. I'll try to get a good profiling capture of the pauses though - can't believe I hadn't thought of Visual Studio's built-in profiler before.
Could we get the patch updated and get some tryserver builds for testing.
Andrew, can you take a stab at this? This would help a ton with pause times.
Assignee: bent.mozilla → continuation
Sure, I'll work on that this week.
Target Milestone: mozilla2.0 → ---
(In reply to Andreas Gal :gal from comment #99)
> This would help a ton with pause times.

Well... Yes, if we idled long enough. Last we checked we were almost never idle, maybe that has changed by now?
That really depends on the workload. Time to measure.
Examples of pauses due to CC
CC timestamp: 1320052537368000, collected: 6 (6 waiting for GC), suspected: 309, duration: 3188 ms.
CC timestamp: 1320052524790000, collected: 332 (332 waiting for GC), suspected: 842, duration: 3157 ms.
CC timestamp: 1320052500180000, collected: 0 (6 waiting for GC), suspected: 3700, duration: 3203 ms.
CC timestamp: 1320052470821000, collected: 6 (6 waiting for GC), suspected: 3639, duration: 3281 ms.
CC timestamp: 1320052462493000, collected: 329 (329 waiting for GC), suspected: 3897, duration: 3234 ms.
CC timestamp: 1320052438087000, collected: 6 (6 waiting for GC), suspected: 125, duration: 3156 ms.
Phoenix, something is going terribly wrong with your cycle collector, but that isn't directly related to this bug.  Please file a new bug and CC me on it.  It would also be good if you included information in your new bug report about what sites are causing this problem, what addons you have, what version of Firefox you are running, and so forth.  Thanks!
I already have it, today/tomorrow I'll finish collect new information for update on it and add you in copy
(In reply to Phoenix from comment #103)
> Examples of pauses due to CC

To clarify, we're interested in the time that the event loop is idle (while the app, or at least the main thread, has nothing to do).
Blocks: 698919
Blocks: 646941
Ben, Andrew, please take a look - https://bugzilla.mozilla.org/show_bug.cgi?id=608954#c45
At least nsCycleCollector.cpp compiles.  I haven't tried a full build yet.
Attachment #465408 - Attachment is obsolete: true
Attachment #492421 - Attachment is obsolete: true
Attachment #492421 - Flags: review?(peterv)
Got the patch building, was able to browse around for a few minutes.  I pushed to try to see what it looks like.  Windows probably doesn't build, as I think I forgot to add an argument somewhere.  https://tbpl.mozilla.org/?tree=Try&rev=eec1e90ed3b6
Looks pretty good.  It looks like the leaks are all nsTArrays, probably from sloppy porting of the white node array on my part.  Should be easy to fix.
Andrew, can you get numbers on the impact on responsiveness? and how often do we manage to CC without interruption?
I put up a Linux debug build on try: https://tbpl.mozilla.org/?tree=Try&rev=1fb6386cf639
I somehow managed to turn an nsAutoPtr into a regular pointer when rebasing.  Hopefully this fixes that problem.

The Windows build problem was not actually in widget stuff like I suspected.  Apparently MSVC doesn't like the fallible allocation the patch adds:
  Block *b = new (fallible_t()) Block();

results in these errors:

e:/builds/moz2_slave/try-w32-dbg/build/xpcom/base/nsCycleCollector.cpp(441) : error C2066: cast to function type is illegal

e:/builds/moz2_slave/try-w32-dbg/build/xpcom/base/nsCycleCollector.cpp(441) : error C2440: 'initializing' : cannot convert from 'mozilla::fallible_t (__cdecl **)(void)' to 'EdgePool::Block *'

        Types pointed to are unrelated; conversion requires reinterpret_cast, C-style cast or function-style cast

e:/builds/moz2_slave/try-w32-dbg/build/xpcom/base/nsCycleCollector.cpp(441) : error C2144: syntax error : 'EdgePool::Block' should be preceded by ';'

I'll try to find out the right way to do this in MSVC.
Whiteboard: [Snappy]
Andrew can you assign a snappy priority to this?
Assigning it P2 because I think it will be really great when we're consistently under, say, 100ms, but until then we should focus on getting there first.
Whiteboard: [Snappy] → [Snappy:P2]
No longer blocks: 698919
Blocks: 698919
Incremental CC seems more useful at this point.
(which is bug 850065)
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
See Also: → 1306728
You need to log in before you can comment on or make changes to this bug.