Last Comment Bug 702813 - Pages with many DOM objects which don't live in the document can cause multi-second CC pauses
: Pages with many DOM objects which don't live in the document can cause multi-...
Status: RESOLVED FIXED
[Snappy]
:
Product: Core
Classification: Components
Component: DOM (show other bugs)
: unspecified
: All All
: -- normal with 2 votes (vote)
: ---
Assigned To: Olli Pettay [:smaug]
:
Mentors:
Depends on: 702609 704623 708572
Blocks: 698919 700645 701443
  Show dependency treegraph
 
Reported: 2011-11-15 15:49 PST by Justin Lebar (not reading bugmail)
Modified: 2012-05-20 12:35 PDT (History)
31 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
+
+


Attachments
Synthetic benchmark (1.11 KB, text/html)
2011-11-16 03:22 PST, Kyle Huey [:khuey] (khuey@mozilla.com)
no flags Details
Synthetic benchmark (1.40 KB, text/html)
2011-11-16 04:19 PST, Kyle Huey [:khuey] (khuey@mozilla.com)
no flags Details
Counter Smaug's patch (1.49 KB, text/html)
2011-11-17 17:15 PST, Andrew McCreight [:mccr8]
no flags Details
Individual Page Loads Top 999 (bzip2 Open Office ODF Spreadsheet) (146.47 KB, application/x-bzip2)
2011-12-28 21:26 PST, Bob Clary [:bc:]
no flags Details
Continuous Page Loads Top 999 (bzip2 Open Office ODF Spreadsheet) (264.17 KB, application/x-bzip2)
2011-12-28 21:27 PST, Bob Clary [:bc:]
no flags Details
top 10,000 sites gc times (csv.bz2) (255.88 KB, application/x-bzip2)
2012-01-03 05:54 PST, Bob Clary [:bc:]
no flags Details
top 6900 gc times single session (csv.bz2) (345.34 KB, application/x-bzip2)
2012-01-03 06:00 PST, Bob Clary [:bc:]
no flags Details
partial test with current nightly (csv) (39.39 KB, text/plain)
2012-01-04 15:09 PST, Bob Clary [:bc:]
no flags Details
partial test with try (csv) (49.84 KB, text/plain)
2012-01-04 15:11 PST, Bob Clary [:bc:]
no flags Details
partial test with current nightly (csv) (191.70 KB, text/plain)
2012-01-05 06:24 PST, Bob Clary [:bc:]
no flags Details
partial test with try (csv) (153.69 KB, text/plain)
2012-01-05 06:24 PST, Bob Clary [:bc:]
no flags Details
completed test nightly csv.bz2 (198.12 KB, application/x-bzip2)
2012-01-09 05:51 PST, Bob Clary [:bc:]
no flags Details
completed test try csv bz2 (171.67 KB, application/x-bzip2)
2012-01-09 05:52 PST, Bob Clary [:bc:]
no flags Details

Description Justin Lebar (not reading bugmail) 2011-11-15 15:49:20 PST
This is likely the root cause of bug 700645 and bug 701443.

We need a simple, synthetic testcase to prove that this is the cause.  Then we presumably need to figure out how to modify the CC to avoid failing so hard here.
Comment 1 Andrew McCreight [:mccr8] 2011-11-15 15:57:35 PST
Now that I think about it, I should be able to analyze the CC logs from you and from dvander to see what the deal is with the DOM nodes in there.

The basic problem is that for "normal" DOM nodes, the CC will see that it is a part of a window that is currently being displayed, and then not traverse through the entire thing.  In at least jlebar's IRCcloud case, the page is generating huge amounts of nodes that are not in the currently displayed window, but are somehow still live.  The question is, can we somehow test that these are live?

The first step is to figure out why these are alive, and then see if we can link the DOM nodes to what is holding them alive.
Comment 2 Andrew McCreight [:mccr8] 2011-11-15 17:38:10 PST
One thing I need to look at is whether these nodes are truly free floating, or they are part of a separate DOM tree.  I have a crazy idea for handling the latter case, by treating the entire DOM tree as a single node in the CC graph, and write barriering any thing like set user data that creates edges out of the graph.
Comment 3 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-11-16 03:22:43 PST
Created attachment 574861 [details]
Synthetic benchmark
Comment 4 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-11-16 03:35:02 PST
In the synthetic benchmark, the failure mode involves a disconnected subtree with a million nodes.  The relevant code is:

function touch3()
{
  var thing3 = doc.createElement("p");
  last.appendChild(thing3);
}

The key is that after the function returns, thing3 can be GCd.  thing3 is an xpconnect wrapper (a slim wrapper here, though a regular wrapper works too) and when it's GCd it Releases the native object which causes it to get suspected (or, with a regular xpconnect wrapper, some xpconnect thing which holds the native object gets suspected).

Then when the CC runs, the suspected native for thing3 ends up traversing the entire disconnected subtree (the million nodes).  Since the nodes aren't in the document, the INTERRUPTED_TRAVERSE stuff doesn't save us.  If you load this in a clean profile you'll see this quite clearly.  After you click your way through a bunch of slow script dialogs, your CC time will be extremely low (around 10ms on my machine).  If you click on the 3rd button, the next CC will be quite long (around 1s on my machine).

I think in some sense this is a regression from the strong parent pointer stuff.  The architectural issue way there before, but now that parent pointers are strong we end up traversing the entire disconnected subtree as opposed to some subsubtree.
Comment 5 Peter Van der Beken [:peterv] 2011-11-16 03:38:56 PST
One thing we could maybe do is not traverse a node if its JS object was marked black by the GC.
Comment 6 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-11-16 03:42:03 PST
That doesn't help us very much, though (at least in the worst case).  If the one native whose JS object is black is on the other side of the subtree we still end up traversing almost all of it.
Comment 7 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-11-16 03:53:41 PST
If you load the synthetic benchmark in Firefox 8 (the last version without strong parent pointers) you'll find that the CC pause is negligible (as expected) and that the testcase loads much, much, much faster.  I'm not sure whether the latter is connected to the former or not.
Comment 8 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-11-16 03:55:11 PST
Nominating for tracking on the basis that we have a synthetic benchmark that shows massive performance regressions.
Comment 9 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-11-16 04:08:02 PST
Apparently the loading pause only shows up in my local build (not even on nightlies ...) but the CC slowness is definitely there.
Comment 10 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-11-16 04:19:53 PST
Created attachment 574869 [details]
Synthetic benchmark

The previous benchmark with one addition, a case that causes the high CC times in weak-parent-pointer builds.
Comment 11 Olli Pettay [:smaug] 2011-11-16 04:23:02 PST
So strong parentNode did add a new way to cause huge CC graphs, but there were other ways to create
such graphs even with weak parentNode.
Comment 12 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-11-16 04:25:53 PST
Sure, but it looks like this is being hit by web content now, while it wasn't before.
Comment 13 Olli Pettay [:smaug] 2011-11-16 04:43:51 PST
If we decide to disable strong parentNode in  9 (and 10), that decision must happen *very* soon,
during this week or so.
Backing out should be doable, but it will touch lots of code and would take some time
to get all the backout patches ready.

And, if we back out, I think we should back out from trunk too until cycle collector can handle
the situation better.

But, I'm still hoping we could find out some, perhaps hacky, fix for cycle collector handling.
Comment 14 Andrew McCreight [:mccr8] 2011-11-16 08:49:59 PST
Before we panic too much, we should try to figure out how much this actually hits web content.  We've seen a few weird cases that maybe cause problems, but only Asa has been able to consistently recreate a problem.  (I haven't confirmed that dvanderson's case is actually this.  I'll try to put together a script today that can find these problems.)

We should look at telemetry, and compare what 8 looks like to 9 and 10.  I think the best measure to look at is the number of ref counted nodes in the CC graph.  That will avoid problems due to differing hardware, though not websites.  Is there an appreciable spike?

Secondly, precisely what are we trying to fix?  Cases where the top level node is being held live by JS?  Where any node is being held live by JS?  Where any node is being held live for any reason?

I have some ideas for the first two.
Comment 15 Olli Pettay [:smaug] 2011-11-16 14:58:57 PST
An unoptimized black node hack (which doesn't leak with Kyle's test, and thing3 test's
CC time drops in a debug build from 750 to 15ms)
The algorithm is silly (it misses marking nodes, which leads to iterating a lot),
and that's why could cause slowness in certain cases.
Pushed to try to see if it leaks :)
https://hg.mozilla.org/try/rev/6c55fa768be8
https://tbpl.mozilla.org/?tree=Try&rev=6c55fa768be8

But anyway, I hope the patch shows that perhaps there are simple ways to fix at least the
most common problems related not-in-document subtrees.
Comment 16 Andrew McCreight [:mccr8] 2011-11-16 17:42:44 PST
Yeah, this patch is going to be something like n log n in the worst case.  n being the number of nodes, and log n being the depth of the tree.  Or something like that.

Can we somehow move this check into the cycle collector itself?  We could associate with owner documents if we've ever found a member document that has a nongray wrapper that is keeping it alive.  Then, while we are traversing up the parents, we add the parents to the graph.  This will keep them from being traversed twice, at the cost of bloating up the graph a bit before we find the wrapped node.

Are these nodes with marked wrappers going to be traversed as XPConnectRoots?  Those are added to the graph first, so potentially if we store nongrayness in the owner document then we don't have to actually traverse up the parent chain: by the time we get to nodes in the "meat" of the DOM, we'll already have examined if any of the members of the DOM are being held alive in this way.

I should also make sure this will actually help with the IRCcloud graph I got from jlebar.  I looked at the paths that were holding things live, and they followed some weird path through JS back out to a DOM, but the path finding can be a little wonky.
Comment 17 Olli Pettay [:smaug] 2011-11-17 02:12:26 PST
(In reply to Andrew McCreight [:mccr8] from comment #16)
 
> Can we somehow move this check into the cycle collector itself?  We could
> associate with owner documents if we've ever found a member document that
> has a nongray wrapper that is keeping it alive.
I'm not sure what this means. "member document"?

>  Then, while we are
> traversing up the parents, we add the parents to the graph.  This will keep
> them from being traversed twice, at the cost of bloating up the graph a bit
> before we find the wrapped node.
Adding anything to graph and using ::Traverse is slow comparing to
non-virtual inline calls what can be used when dealing nodes.
 
> Are these nodes with marked wrappers going to be traversed as
> XPConnectRoots?  Those are added to the graph first, so potentially if we
> store nongrayness in the owner document
But we can't store non-grayness in the owner document. We're dealing with
unconnected subtrees here. ownerDocument doesn't own the nodes in that case.

> then we don't have to actually
> traverse up the parent chain: by the time we get to nodes in the "meat" of
> the DOM, we'll already have examined if any of the members of the DOM are
> being held alive in this way.
This would work if ownerDocument would keep disconnected subtrees alive, but it doesn't.

> I should also make sure this will actually help with the IRCcloud graph I
> got from jlebar.  I looked at the paths that were holding things live, and
> they followed some weird path through JS back out to a DOM, but the path
> finding can be a little wonky.
To fix the regression-like part of this all for 9/10 it should be enough to figure out if the root of the subtree is
black, since in 8 it is necessary to keep root explicitly alive to keep subtree alive.

This could be optimized so that if we're dealing with a tree which has document as root,
we mark that document as black and then it is enough to check node->GetCurrentDoc()->IsBlack():

But in cases of those trees where some other node type is root, we would need to mark all the ancestors,
and clear the markers after CC. That would be O(2n) (where n=number of nodes in the tree)

Btw, the hacky patch passed tests on try without leaks.
Comment 18 Andrew McCreight [:mccr8] 2011-11-17 12:01:23 PST
Okay, most of my comments were just me being confused about the document...
Comment 19 Andrew McCreight [:mccr8] 2011-11-17 17:15:25 PST
Created attachment 575341 [details]
Counter Smaug's patch

Here's a test example that is a minor variant of Kyle's that causes slow CC times even with Smaug's patch.  The basic trick is here:

  var nu_root = document.createElement("span");
  root = document.createElement("a");
  nu_root.appendChild(root);

Instead of storing the tippy-topmostnode (nu_root) in a JS variable, we store one of its children.  Thus when we suspect some DOM node deep in the tree, we don't find the node with the marked wrapper.

And as I said before, in the IRCcloud case, with jlebar's log, the DOM isn't being held alive by a marked wrapper, so this optimization may not work there either.  The topmost element in the DOM, which appears to be a <div>, has a preserved wrapper, but the wrapper is not marked.  I didn't check every node on the way up.  I also made the assumption that a random element I picked must be part of the huge glob.
Comment 20 Andrew McCreight [:mccr8] 2011-11-17 17:22:59 PST
I looked at telemetry and didn't see a particularly huge difference in ref counted CC graph size between 8 and 9.  Looking at CC times, they seem about the same too.
Comment 21 Olli Pettay [:smaug] 2011-11-18 02:50:44 PST
(In reply to Andrew McCreight [:mccr8] from comment #19)
> Created attachment 575341 [details]
> Counter Smaug's patch
> 
> Here's a test example that is a minor variant of Kyle's that causes slow CC
> times even with Smaug's patch.  The basic trick is here:
> 
>   var nu_root = document.createElement("span");
>   root = document.createElement("a");
>   nu_root.appendChild(root);
> 
> Instead of storing the tippy-topmostnode (nu_root) in a JS variable, we
> store one of its children.
Sure, but that doesn't even work in 8 - I mean in 8 we'd delete nu_root.
So I'm not too worried about that case, since I expect web site to actually store the
root somewhere because of the Gecko bug in <= 8.


> And as I said before, in the IRCcloud case, with jlebar's log, the DOM isn't
> being held alive by a marked wrapper, so this optimization may not work
> there either.
>  The topmost element in the DOM, which appears to be a <div>,
> has a preserved wrapper, but the wrapper is not marked.  I didn't check
> every node on the way up.  I also made the assumption that a random element
> I picked must be part of the huge glob.
But something must be keeping the subtree alive. In 8 if you don't explicitly keep the root of some subtree
alive, the root will be delete, and all its children disconnected from it. This can lead to several
disconnected subtrees.
Comment 22 Andrew McCreight [:mccr8] 2011-11-28 11:48:08 PST
smaug: I've been running the try build you linked in the other threads, with both the AMCtv website and IRCcloud going.  heap-unclassified has held fairly steady, below 30%.  Maybe it is increasing slowly.  My CC pause times are a little weird.  About 7/8 are reasonable-ish, around 200 to 300ms.  About 1/8 are longer, around 750ms to 1000ms.  The long CC was getting worse and worse, but dropped down to the low end again.  There's no reason I can see that it has gotten better, and there's nothing in particular that I've noticed about why the longer one is longer.  Sometimes they are right after a GC, sometimes there's another CC in between.
Comment 23 Andrew McCreight [:mccr8] 2011-11-28 12:15:32 PST
I mean the patch in bug 702609.
Comment 24 Olli Pettay [:smaug] 2011-11-28 12:33:24 PST
I'd be interested to know how the patch affects to those cases when heap-unclassified is high.
Comment 25 Kyle Huey [:khuey] (khuey@mozilla.com) 2011-12-07 13:11:17 PST
Are we going to take any countermeasures here for 9?  The clock is almost out ...
Comment 26 Olli Pettay [:smaug] 2011-12-07 14:57:51 PST
if the problem is strong parent node, it is quite strange that we got first bugs filed 3.5 months
after the patch landed to trunk.
Also, some reports say that there is a regression from 9->10
(which would mean some totally different problem)

But in any case I can prepare the strong-parentNode-backout patch for 9.
Comment 27 chris hofmann 2011-12-21 14:59:31 PST
I wonder if bc's spider could let us know which top sites seem to trigger gc these problems.
Comment 28 Bob Clary [:bc:] 2011-12-21 16:42:24 PST
I know there have been lots of additions of logging capability for memory, gc and cc but I haven't kept up on the details. Can someone point me in the right direction to find out more about how to use them>
Comment 29 Bob Clary [:bc:] 2011-12-25 00:25:25 PST
javascript.options.mem.log appears to be at least part of what I need. 

I can get the time in seconds a GC or CC occurs since the browser started from GC(T+time) or CC(T+time), and the duration of CC's from the duration: [0-9]+ ms.$ in the CC output line.

What are the other interesting fields from the output lines or should I just collect them all in a csv for further analysis?

Would you want the results for starting the browser, loading the page, waiting for X seconds, then stopping the browser or do you want the results for a continuous run where the browser starts then successively loads pages with a delay of X seconds from the top sites while recording the GC/CC output? What should be the wait period to let the browser sit on a page?
Comment 30 Olli Pettay [:smaug] 2011-12-25 07:13:57 PST
I'd like to know about pages which, say after 15s of page load cause >500ms CC times.
(exact!) steps to reproduce would be great.
Comment 31 Olli Pettay [:smaug] 2011-12-25 07:16:04 PST
Btw, a patch landed on Friday which changes CC handling in some cases.
If you see any regressions, please file new bugs and make them block Bug 712743.
If you see lower CC times, report to bug Bug 712743.
Comment 32 Bob Clary [:bc:] 2011-12-28 21:26:02 PST
Created attachment 584706 [details]
Individual Page Loads Top 999 (bzip2 Open Office ODF Spreadsheet)
Comment 33 Bob Clary [:bc:] 2011-12-28 21:27:11 PST
Created attachment 584707 [details]
Continuous Page Loads Top 999 (bzip2 Open Office ODF Spreadsheet)
Comment 34 Bob Clary [:bc:] 2011-12-28 21:33:16 PST
I captured the GC and CC times on two separate Linux VMs using today's Nightly build for the top 999 sites in two cases:

1) Individual Loading - Start the browser, Load the Page and wait 15 seconds, Shutdown the browser.
2) Continuous Loading - Start the browser, Load each Page in succession waiting 15 seconds after each page, Shutdown the browser.

As you can see there are a few spikes for individual sites but none over 500 ms (one CC up to 410 ms) and perhaps a secular growth trend for the continuous loading case but 999 sites is not sufficient I believe. If this presentation is good enough for you, I can produce results for a larger set of pages.

Would it have been better to accumulate the total GC+CC for each site rather than reporting the results of each individual GC/CC ?
Comment 35 Bob Clary [:bc:] 2012-01-03 05:54:34 PST
Created attachment 585388 [details]
top 10,000 sites gc times (csv.bz2)

The test run where the browser is started, the page loaded, and the the browser is shut down after 15 seconds has completed. I've attached a bzip2d csv file of the data since it is smaller. It is trivial to import into a spreadsheet and to create graphs.

The top 10 times were:

Site                       gctype gcduration
http://jsonline.com/       GC     1396.5
http://haaretz.co.il       GC     1258.2
http://crictime.com/       GC      646.6
http://haaretz.com/        GC      303
http://ilemi.me/           GC      289
http://mako.co.il/         CC      288
http://archdaily.com/      CC      233
http://hindustantimes.com/ GC      232.9
http://redmondpie.com/     CC      224
http://deadspin.com/       CC      192
Comment 36 Bob Clary [:bc:] 2012-01-03 06:00:19 PST
Created attachment 585391 [details]
top 6900 gc times single session (csv.bz2)

After some hangs and restarts the test run where each of the top 10,000 sites is loaded during a single browser session is still running and has completed 6900 sites. I've attached the bzip2d csv file of the data for your analysis. I'll let the run complete but consider this data sufficient and won't attach the full set unless someone requests it.

Note the secular increases in the gc times and their subsequent drop offs. The drop offs correspond to my manually closing the number of pop up windows that had been opened by the web sites. The gc times increase as the number of open windows increases and drops back to "normal" levels after the extra windows are closed.
Comment 37 Bob Clary [:bc:] 2012-01-04 15:09:42 PST
Created attachment 585913 [details]
partial test with current nightly (csv)
Comment 38 Bob Clary [:bc:] 2012-01-04 15:11:00 PST
Created attachment 585914 [details]
partial test with try (csv)

with http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/opettay@mozilla.com-9687bb33e6ea/
Comment 39 Bob Clary [:bc:] 2012-01-05 06:24:08 PST
Created attachment 586055 [details]
partial test with current nightly (csv)
Comment 40 Bob Clary [:bc:] 2012-01-05 06:24:43 PST
Created attachment 586056 [details]
partial test with try (csv)
Comment 41 Bob Clary [:bc:] 2012-01-05 06:32:58 PST
           nightly  try
min     GC     0      0
average GC    21     20
max     GC   580    223
stddev  GC    23     14
min     CC     0      0
average CC     8     17
max     CC   581    328
stddev  CC    15     22
Comment 42 Alex Keybl [:akeybl] 2012-01-05 18:36:35 PST
Hey Bob - sending this tracked, yet unassigned, bug over to you since I think you've taken next action.
Comment 43 Bob Clary [:bc:] 2012-01-06 04:05:23 PST
Ouch! That's hot! (http://en.wikipedia.org/wiki/Hot_Potato_%28game%29) Actually Smaug is doing the work. I'm just helping out with some data collection.
Comment 44 Bob Clary [:bc:] 2012-01-09 05:51:46 PST
Created attachment 586970 [details]
completed test nightly csv.bz2
Comment 45 Bob Clary [:bc:] 2012-01-09 05:52:54 PST
Created attachment 586972 [details]
completed test try csv bz2
Comment 46 Bob Clary [:bc:] 2012-01-09 05:58:22 PST
           nightly  try
min     GC     0      0
average GC    20     22
max     GC   796   8872 *
stddev  GC    22     64
min     CC     0      0
average CC     7     19
max     CC   581    328
stddev  CC    11     24

[*] not reproducible.
Comment 47 Andrew McCreight [:mccr8] 2012-01-17 20:39:23 PST
Not really a MemShrink issue, as there's no leak here.  The lack of about:memory coverage for disconnected DOMs is being tracked in bug 704623.
Comment 48 Andrew McCreight [:mccr8] 2012-02-16 12:16:45 PST
Looks fixed to me, on a recent Nightly.  CC times never go about say 50ms with the two example files.  They still cause 5 second CCs when you close the tab, and the CC frees a million nodes, but that's a separate issue.

Note You need to log in before you can comment on or make changes to this bug.