This is a tracking bug for performance work on the cycle collector.
Created attachment 261877 [details] Zipped-up profile as of today This is a profile of a build with the patches to bug 373693 applied. To get this, I loaded a large bonsai query in one window, and browsed in another one. Every so often I moused over the bonsai window to trigger traversal of all that stuff. I started and stopped jprof programmatically around the three calls to nsCycleCollector_collect in nsJSEnvironment; if we have other entry points into it (other than shutdown), I'd love to know. The profile largely shows hashtable stuff, actually. I should try the patch in bug 377606; didn't realize it was there.
Actually, looks like I can't apply that on top of the patch to bug 373693. I'll just wait for more stuff to land.
changing Browser.sessionhistory.max_total_viewers in about:config from -1 to 3 produced a noticeable performance increase. perhaps an at least temporary "workaround" would be to submit a patch that changes it to (at least) 3 upon install and maybe if its possible changes it if ff is already installed. Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9a4pre) Gecko/20070417 Minefield/3.0a4pre - Build ID: 2007041704
Created attachment 262393 [details] Updated profile This was generated the same way that the profile in comment 1 was generated, but in a build with the patches for bug 372110, bug 373693, bug 374872, bug 377606 applied (basically, all the patches checked in as of right now as far as I can tell). Every so often I hovered the bonsai page to make sure it was purple (from the mousemove event dispatch). There are 455673 hits under nsCycleCollector::Collect, broken down mostly as as: 191121 nsCycleCollector::MarkRoots() 164912 nsCycleCollector::ScanRoots() 43962 nsCycleCollector::CollectWhite() 31329 nsXPConnect::BeginCycleCollection() 9144 PL_DHashTableFinish 7560 nsXPConnect::FinishCycleCollection() 6565 memset nsXPConnect::BeginCycleCollection is spent under JS_GC, memset (allocating the hashtable, I guess), and PL_DHashTableFinish. nsCycleCollector::CollectWhite just calls PL_DHashTableEnumerate, which pretty much just calls FindWhiteCallback. About 40% of the time is spent in FindWhiteCallback itself, another 30% under nsCycleCollector::Forget, 5% QI on nsTextNodes, 5% UnmarkPurple on generic elements and text nodes, 10% in QI on various HTML elements. nsCycleCollector::ScanRoots basically calls GraphWalker::Walk, which more or less calls nsCycleCollectionXPCOMRuntime::Traverse (there is a detour that gets there through nsXPConnect::Traverse calling GraphWalker::DescribeNode calling scanWalker::VisitNode calling GraphWalker::Walk). The time under there is basically split between: canonicalize() -- 24% of ScanRoots PL_DHashTableOperate -- 27% of ScanRoots, split up by caller more or less as: 50% -- GraphWalker::NoteXPCOMChild 25% -- GraphWalker::Walk 10% -- nsDocument::GetReference 10% -- nsBindingManager::GetBinding 3% -- GraphWalker::NoteScriptChild 2% -- nsContentUtils::TraverseListenerManager ToParticipant -- 10% of ScanRoots. nsDeque push and pop -- 8% of ScanRoots (from Walk and NoteXPCOMChild). ScanBlackWalker::ShouldVisitNode -- 2.5% of ScanRoots In (not under) GraphWalker::NoteXPCOMChild -- 2.5% of ScanRoots In (not under) nsGenericElement::cycleCollection::Traverse -- 4% of ScanRoots Various other non-hotspotty stuff. nsCycleCollector::MarkRoots calls GraphWalker::Walk, which mostly calls nsCycleCollectionXPCOMRuntime::Traverse, etc. From the bottom up, we have: canonicalize() -- 20% of MarkRoots PL_DHashTableOperate -- 37% of MarkRoots, split up by caller more or less as: 66% -- GraphWalker::NoteXPCOMChild 16% -- GraphWalker::Walk 6% -- nsDocument::GetReference 4% -- nsBindingManager::GetBinding 2% -- GraphWalker::NoteScriptChild 2% -- nsContentUtils::TraverseListenerManager ToParticipant -- 9% of MarkRoots. nsDeque push and pop -- 6% of MarkRoots (from Walk and NoteXPCOMChild). In (not under) GraphWalker::NoteXPCOMChild -- 2.5% of MarkRoots In (not under) nsGenericElement::cycleCollection::Traverse -- 3% of ScanRoots Various other non-hotspotty stuff, I think. Things that jump out at me: canonicalize() and ToParticipant() together make up 24% of the total cycle collection time. I wonder whether there's some way to at least cache the results between ScanRoots and MarkRoots. PL_DHashTableOperate makes up another 25% of the total. Some of these tables we could try to optimize a bit; for example, could we use bits on nodes to indicate whether they have a binding or a preserved wrapper so that we don't have to hit the hashtable for all the ones that do not? Basic problem really is that we're walking a _lot_ of stuff here. :( The pause I was profiling is about 3 seconds long on my machine, without jprof running. Note: I got some of the data above using -i and -e options to jprof; I'm not sure it's possible to get some of those numbers directly out of the profile that I'm attaching.
Filed bug 378389 on the preserved wrapper table and bug 378390 on the binding manager table.
(In reply to comment #4) 5% QI on > nsTextNodes, 10% in QI on > various HTML elements. Could we improve QIing, changing the order of NS_INTERFACE_MAP_ENTRYs.
Yes, we could. That could hurt performance elsewhere, of course....
Yes, improving QI is 1 of the 3 things we're going to try next: 1) Don't traverse some root objects (probably nsGlobalWindow and nsDocument) that hang on to a large number of other objects, if we can know for sure that they are alive (black). Sicking will file a bug and make a patch. 2) Improve QI by moving cycle collection entries. I will file a bug and make a patch. 3) Record the traversal, so that we can play it back instead of traversing multiple times. Dbaron filed bug 378514 for this.
Can we mark this fixed, or is there more work to do?
I'm going to mark this fixed as it hasn't been touched in a few years. Later CC perf tracking bugs: bug 549533 and bug 698919.