This is likely the root cause of bug 700645 and bug 701443.
We need a simple, synthetic testcase to prove that this is the cause. Then we presumably need to figure out how to modify the CC to avoid failing so hard here.
Now that I think about it, I should be able to analyze the CC logs from you and from dvander to see what the deal is with the DOM nodes in there.
The basic problem is that for "normal" DOM nodes, the CC will see that it is a part of a window that is currently being displayed, and then not traverse through the entire thing. In at least jlebar's IRCcloud case, the page is generating huge amounts of nodes that are not in the currently displayed window, but are somehow still live. The question is, can we somehow test that these are live?
The first step is to figure out why these are alive, and then see if we can link the DOM nodes to what is holding them alive.
One thing I need to look at is whether these nodes are truly free floating, or they are part of a separate DOM tree. I have a crazy idea for handling the latter case, by treating the entire DOM tree as a single node in the CC graph, and write barriering any thing like set user data that creates edges out of the graph.
Created attachment 574861 [details]
In the synthetic benchmark, the failure mode involves a disconnected subtree with a million nodes. The relevant code is:
var thing3 = doc.createElement("p");
The key is that after the function returns, thing3 can be GCd. thing3 is an xpconnect wrapper (a slim wrapper here, though a regular wrapper works too) and when it's GCd it Releases the native object which causes it to get suspected (or, with a regular xpconnect wrapper, some xpconnect thing which holds the native object gets suspected).
Then when the CC runs, the suspected native for thing3 ends up traversing the entire disconnected subtree (the million nodes). Since the nodes aren't in the document, the INTERRUPTED_TRAVERSE stuff doesn't save us. If you load this in a clean profile you'll see this quite clearly. After you click your way through a bunch of slow script dialogs, your CC time will be extremely low (around 10ms on my machine). If you click on the 3rd button, the next CC will be quite long (around 1s on my machine).
I think in some sense this is a regression from the strong parent pointer stuff. The architectural issue way there before, but now that parent pointers are strong we end up traversing the entire disconnected subtree as opposed to some subsubtree.
One thing we could maybe do is not traverse a node if its JS object was marked black by the GC.
That doesn't help us very much, though (at least in the worst case). If the one native whose JS object is black is on the other side of the subtree we still end up traversing almost all of it.
If you load the synthetic benchmark in Firefox 8 (the last version without strong parent pointers) you'll find that the CC pause is negligible (as expected) and that the testcase loads much, much, much faster. I'm not sure whether the latter is connected to the former or not.
Nominating for tracking on the basis that we have a synthetic benchmark that shows massive performance regressions.
Apparently the loading pause only shows up in my local build (not even on nightlies ...) but the CC slowness is definitely there.
Created attachment 574869 [details]
The previous benchmark with one addition, a case that causes the high CC times in weak-parent-pointer builds.
So strong parentNode did add a new way to cause huge CC graphs, but there were other ways to create
such graphs even with weak parentNode.
Sure, but it looks like this is being hit by web content now, while it wasn't before.
If we decide to disable strong parentNode in 9 (and 10), that decision must happen *very* soon,
during this week or so.
Backing out should be doable, but it will touch lots of code and would take some time
to get all the backout patches ready.
And, if we back out, I think we should back out from trunk too until cycle collector can handle
the situation better.
But, I'm still hoping we could find out some, perhaps hacky, fix for cycle collector handling.
Before we panic too much, we should try to figure out how much this actually hits web content. We've seen a few weird cases that maybe cause problems, but only Asa has been able to consistently recreate a problem. (I haven't confirmed that dvanderson's case is actually this. I'll try to put together a script today that can find these problems.)
We should look at telemetry, and compare what 8 looks like to 9 and 10. I think the best measure to look at is the number of ref counted nodes in the CC graph. That will avoid problems due to differing hardware, though not websites. Is there an appreciable spike?
Secondly, precisely what are we trying to fix? Cases where the top level node is being held live by JS? Where any node is being held live by JS? Where any node is being held live for any reason?
I have some ideas for the first two.
An unoptimized black node hack (which doesn't leak with Kyle's test, and thing3 test's
CC time drops in a debug build from 750 to 15ms)
The algorithm is silly (it misses marking nodes, which leads to iterating a lot),
and that's why could cause slowness in certain cases.
Pushed to try to see if it leaks :)
But anyway, I hope the patch shows that perhaps there are simple ways to fix at least the
most common problems related not-in-document subtrees.
Yeah, this patch is going to be something like n log n in the worst case. n being the number of nodes, and log n being the depth of the tree. Or something like that.
Can we somehow move this check into the cycle collector itself? We could associate with owner documents if we've ever found a member document that has a nongray wrapper that is keeping it alive. Then, while we are traversing up the parents, we add the parents to the graph. This will keep them from being traversed twice, at the cost of bloating up the graph a bit before we find the wrapped node.
Are these nodes with marked wrappers going to be traversed as XPConnectRoots? Those are added to the graph first, so potentially if we store nongrayness in the owner document then we don't have to actually traverse up the parent chain: by the time we get to nodes in the "meat" of the DOM, we'll already have examined if any of the members of the DOM are being held alive in this way.
I should also make sure this will actually help with the IRCcloud graph I got from jlebar. I looked at the paths that were holding things live, and they followed some weird path through JS back out to a DOM, but the path finding can be a little wonky.
(In reply to Andrew McCreight [:mccr8] from comment #16)
> Can we somehow move this check into the cycle collector itself? We could
> associate with owner documents if we've ever found a member document that
> has a nongray wrapper that is keeping it alive.
I'm not sure what this means. "member document"?
> Then, while we are
> traversing up the parents, we add the parents to the graph. This will keep
> them from being traversed twice, at the cost of bloating up the graph a bit
> before we find the wrapped node.
Adding anything to graph and using ::Traverse is slow comparing to
non-virtual inline calls what can be used when dealing nodes.
> Are these nodes with marked wrappers going to be traversed as
> XPConnectRoots? Those are added to the graph first, so potentially if we
> store nongrayness in the owner document
But we can't store non-grayness in the owner document. We're dealing with
unconnected subtrees here. ownerDocument doesn't own the nodes in that case.
> then we don't have to actually
> traverse up the parent chain: by the time we get to nodes in the "meat" of
> the DOM, we'll already have examined if any of the members of the DOM are
> being held alive in this way.
This would work if ownerDocument would keep disconnected subtrees alive, but it doesn't.
> I should also make sure this will actually help with the IRCcloud graph I
> got from jlebar. I looked at the paths that were holding things live, and
> they followed some weird path through JS back out to a DOM, but the path
> finding can be a little wonky.
To fix the regression-like part of this all for 9/10 it should be enough to figure out if the root of the subtree is
black, since in 8 it is necessary to keep root explicitly alive to keep subtree alive.
This could be optimized so that if we're dealing with a tree which has document as root,
we mark that document as black and then it is enough to check node->GetCurrentDoc()->IsBlack():
But in cases of those trees where some other node type is root, we would need to mark all the ancestors,
and clear the markers after CC. That would be O(2n) (where n=number of nodes in the tree)
Btw, the hacky patch passed tests on try without leaks.
Okay, most of my comments were just me being confused about the document...
Created attachment 575341 [details]
Counter Smaug's patch
Here's a test example that is a minor variant of Kyle's that causes slow CC times even with Smaug's patch. The basic trick is here:
var nu_root = document.createElement("span");
root = document.createElement("a");
Instead of storing the tippy-topmostnode (nu_root) in a JS variable, we store one of its children. Thus when we suspect some DOM node deep in the tree, we don't find the node with the marked wrapper.
And as I said before, in the IRCcloud case, with jlebar's log, the DOM isn't being held alive by a marked wrapper, so this optimization may not work there either. The topmost element in the DOM, which appears to be a <div>, has a preserved wrapper, but the wrapper is not marked. I didn't check every node on the way up. I also made the assumption that a random element I picked must be part of the huge glob.
I looked at telemetry and didn't see a particularly huge difference in ref counted CC graph size between 8 and 9. Looking at CC times, they seem about the same too.
(In reply to Andrew McCreight [:mccr8] from comment #19)
> Created attachment 575341 [details]
> Counter Smaug's patch
> Here's a test example that is a minor variant of Kyle's that causes slow CC
> times even with Smaug's patch. The basic trick is here:
> var nu_root = document.createElement("span");
> root = document.createElement("a");
> Instead of storing the tippy-topmostnode (nu_root) in a JS variable, we
> store one of its children.
Sure, but that doesn't even work in 8 - I mean in 8 we'd delete nu_root.
So I'm not too worried about that case, since I expect web site to actually store the
root somewhere because of the Gecko bug in <= 8.
> And as I said before, in the IRCcloud case, with jlebar's log, the DOM isn't
> being held alive by a marked wrapper, so this optimization may not work
> there either.
> The topmost element in the DOM, which appears to be a <div>,
> has a preserved wrapper, but the wrapper is not marked. I didn't check
> every node on the way up. I also made the assumption that a random element
> I picked must be part of the huge glob.
But something must be keeping the subtree alive. In 8 if you don't explicitly keep the root of some subtree
alive, the root will be delete, and all its children disconnected from it. This can lead to several
smaug: I've been running the try build you linked in the other threads, with both the AMCtv website and IRCcloud going. heap-unclassified has held fairly steady, below 30%. Maybe it is increasing slowly. My CC pause times are a little weird. About 7/8 are reasonable-ish, around 200 to 300ms. About 1/8 are longer, around 750ms to 1000ms. The long CC was getting worse and worse, but dropped down to the low end again. There's no reason I can see that it has gotten better, and there's nothing in particular that I've noticed about why the longer one is longer. Sometimes they are right after a GC, sometimes there's another CC in between.
I mean the patch in bug 702609.
I'd be interested to know how the patch affects to those cases when heap-unclassified is high.
Are we going to take any countermeasures here for 9? The clock is almost out ...
if the problem is strong parent node, it is quite strange that we got first bugs filed 3.5 months
after the patch landed to trunk.
Also, some reports say that there is a regression from 9->10
(which would mean some totally different problem)
But in any case I can prepare the strong-parentNode-backout patch for 9.
I wonder if bc's spider could let us know which top sites seem to trigger gc these problems.
I know there have been lots of additions of logging capability for memory, gc and cc but I haven't kept up on the details. Can someone point me in the right direction to find out more about how to use them>
I can get the time in seconds a GC or CC occurs since the browser started from GC(T+time) or CC(T+time), and the duration of CC's from the duration: [0-9]+ ms.$ in the CC output line.
What are the other interesting fields from the output lines or should I just collect them all in a csv for further analysis?
Would you want the results for starting the browser, loading the page, waiting for X seconds, then stopping the browser or do you want the results for a continuous run where the browser starts then successively loads pages with a delay of X seconds from the top sites while recording the GC/CC output? What should be the wait period to let the browser sit on a page?
I'd like to know about pages which, say after 15s of page load cause >500ms CC times.
(exact!) steps to reproduce would be great.
Btw, a patch landed on Friday which changes CC handling in some cases.
If you see any regressions, please file new bugs and make them block Bug 712743.
If you see lower CC times, report to bug Bug 712743.
Created attachment 584706 [details]
Individual Page Loads Top 999 (bzip2 Open Office ODF Spreadsheet)
Created attachment 584707 [details]
Continuous Page Loads Top 999 (bzip2 Open Office ODF Spreadsheet)
I captured the GC and CC times on two separate Linux VMs using today's Nightly build for the top 999 sites in two cases:
1) Individual Loading - Start the browser, Load the Page and wait 15 seconds, Shutdown the browser.
2) Continuous Loading - Start the browser, Load each Page in succession waiting 15 seconds after each page, Shutdown the browser.
As you can see there are a few spikes for individual sites but none over 500 ms (one CC up to 410 ms) and perhaps a secular growth trend for the continuous loading case but 999 sites is not sufficient I believe. If this presentation is good enough for you, I can produce results for a larger set of pages.
Would it have been better to accumulate the total GC+CC for each site rather than reporting the results of each individual GC/CC ?
Created attachment 585388 [details]
top 10,000 sites gc times (csv.bz2)
The test run where the browser is started, the page loaded, and the the browser is shut down after 15 seconds has completed. I've attached a bzip2d csv file of the data since it is smaller. It is trivial to import into a spreadsheet and to create graphs.
The top 10 times were:
Site gctype gcduration
http://jsonline.com/ GC 1396.5
http://haaretz.co.il GC 1258.2
http://crictime.com/ GC 646.6
http://haaretz.com/ GC 303
http://ilemi.me/ GC 289
http://mako.co.il/ CC 288
http://archdaily.com/ CC 233
http://hindustantimes.com/ GC 232.9
http://redmondpie.com/ CC 224
http://deadspin.com/ CC 192
Created attachment 585391 [details]
top 6900 gc times single session (csv.bz2)
After some hangs and restarts the test run where each of the top 10,000 sites is loaded during a single browser session is still running and has completed 6900 sites. I've attached the bzip2d csv file of the data for your analysis. I'll let the run complete but consider this data sufficient and won't attach the full set unless someone requests it.
Note the secular increases in the gc times and their subsequent drop offs. The drop offs correspond to my manually closing the number of pop up windows that had been opened by the web sites. The gc times increase as the number of open windows increases and drops back to "normal" levels after the extra windows are closed.
Created attachment 585913 [details]
partial test with current nightly (csv)
Created attachment 585914 [details]
partial test with try (csv)
Created attachment 586055 [details]
partial test with current nightly (csv)
Created attachment 586056 [details]
partial test with try (csv)
min GC 0 0
average GC 21 20
max GC 580 223
stddev GC 23 14
min CC 0 0
average CC 8 17
max CC 581 328
stddev CC 15 22
Hey Bob - sending this tracked, yet unassigned, bug over to you since I think you've taken next action.
Ouch! That's hot! (http://en.wikipedia.org/wiki/Hot_Potato_%28game%29) Actually Smaug is doing the work. I'm just helping out with some data collection.
Created attachment 586970 [details]
completed test nightly csv.bz2
Created attachment 586972 [details]
completed test try csv bz2
min GC 0 0
average GC 20 22
max GC 796 8872 *
stddev GC 22 64
min CC 0 0
average CC 7 19
max CC 581 328
stddev CC 11 24
[*] not reproducible.
Not really a MemShrink issue, as there's no leak here. The lack of about:memory coverage for disconnected DOMs is being tracked in bug 704623.
Looks fixed to me, on a recent Nightly. CC times never go about say 50ms with the two example files. They still cause 5 second CCs when you close the tab, and the CC frees a million nodes, but that's a separate issue.