Closed Bug 702813 Opened 14 years ago Closed 14 years ago

Pages with many DOM objects which don't live in the document can cause multi-second CC pauses

Tracking

()

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

firefox9

---

firefox10

---

People

(Reporter: justin.lebar+bug, Assigned: smaug)

References

Details

(Whiteboard: [Snappy])

Attachments

(8 files, 5 obsolete files)

Synthetic benchmark 14 years ago Kyle Huey (Exited; not receiving bugmail, old account, do not use) 1.11 KB, text/html		Details
Synthetic benchmark 14 years ago Kyle Huey (Exited; not receiving bugmail, old account, do not use) 1.40 KB, text/html		Details
Counter Smaug's patch 14 years ago Andrew McCreight [:mccr8] 1.49 KB, text/html		Details
Individual Page Loads Top 999 (bzip2 Open Office ODF Spreadsheet) 14 years ago Bob Clary [:bc] (inactive) 146.47 KB, application/x-bzip2		Details
Continuous Page Loads Top 999 (bzip2 Open Office ODF Spreadsheet) 14 years ago Bob Clary [:bc] (inactive) 264.17 KB, application/x-bzip2		Details
top 10,000 sites gc times (csv.bz2) 14 years ago Bob Clary [:bc] (inactive) 255.88 KB, application/x-bzip2		Details
top 6900 gc times single session (csv.bz2) 14 years ago Bob Clary [:bc] (inactive) 345.34 KB, application/x-bzip2		Details
partial test with current nightly (csv) 14 years ago Bob Clary [:bc] (inactive) 39.39 KB, text/plain		Details
partial test with try (csv) 14 years ago Bob Clary [:bc] (inactive) 49.84 KB, text/plain		Details
partial test with current nightly (csv) 14 years ago Bob Clary [:bc] (inactive) 191.70 KB, text/plain		Details
partial test with try (csv) 14 years ago Bob Clary [:bc] (inactive) 153.69 KB, text/plain		Details
completed test nightly csv.bz2 14 years ago Bob Clary [:bc] (inactive) 198.12 KB, application/x-bzip2		Details
completed test try csv bz2 14 years ago Bob Clary [:bc] (inactive) 171.67 KB, application/x-bzip2		Details

Justin Lebar (not reading bugmail)

Reporter

Description

•

14 years ago

This is likely the root cause of bug 700645 and bug 701443. We need a simple, synthetic testcase to prove that this is the cause. Then we presumably need to figure out how to modify the CC to avoid failing so hard here.

Andrew McCreight [:mccr8]

Comment 1

•

14 years ago

Now that I think about it, I should be able to analyze the CC logs from you and from dvander to see what the deal is with the DOM nodes in there. The basic problem is that for "normal" DOM nodes, the CC will see that it is a part of a window that is currently being displayed, and then not traverse through the entire thing. In at least jlebar's IRCcloud case, the page is generating huge amounts of nodes that are not in the currently displayed window, but are somehow still live. The question is, can we somehow test that these are live? The first step is to figure out why these are alive, and then see if we can link the DOM nodes to what is holding them alive.

Andrew McCreight [:mccr8]

Comment 2

•

14 years ago

One thing I need to look at is whether these nodes are truly free floating, or they are part of a separate DOM tree. I have a crazy idea for handling the latter case, by treating the entire DOM tree as a single node in the CC graph, and write barriering any thing like set user data that creates edges out of the graph.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 3

•

14 years ago

Attached file Synthetic benchmark (obsolete) — Details

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 4

•

14 years ago

In the synthetic benchmark, the failure mode involves a disconnected subtree with a million nodes. The relevant code is: function touch3() { var thing3 = doc.createElement("p"); last.appendChild(thing3); } The key is that after the function returns, thing3 can be GCd. thing3 is an xpconnect wrapper (a slim wrapper here, though a regular wrapper works too) and when it's GCd it Releases the native object which causes it to get suspected (or, with a regular xpconnect wrapper, some xpconnect thing which holds the native object gets suspected). Then when the CC runs, the suspected native for thing3 ends up traversing the entire disconnected subtree (the million nodes). Since the nodes aren't in the document, the INTERRUPTED_TRAVERSE stuff doesn't save us. If you load this in a clean profile you'll see this quite clearly. After you click your way through a bunch of slow script dialogs, your CC time will be extremely low (around 10ms on my machine). If you click on the 3rd button, the next CC will be quite long (around 1s on my machine). I think in some sense this is a regression from the strong parent pointer stuff. The architectural issue way there before, but now that parent pointers are strong we end up traversing the entire disconnected subtree as opposed to some subsubtree.

Peter Van der Beken [:peterv]

Comment 5

•

14 years ago

One thing we could maybe do is not traverse a node if its JS object was marked black by the GC.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 6

•

14 years ago

That doesn't help us very much, though (at least in the worst case). If the one native whose JS object is black is on the other side of the subtree we still end up traversing almost all of it.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 7

•

14 years ago

If you load the synthetic benchmark in Firefox 8 (the last version without strong parent pointers) you'll find that the CC pause is negligible (as expected) and that the testcase loads much, much, much faster. I'm not sure whether the latter is connected to the former or not.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 8

•

14 years ago

Nominating for tracking on the basis that we have a synthetic benchmark that shows massive performance regressions.

tracking-firefox10: --- → ?

tracking-firefox9: --- → ?

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 9

•

14 years ago

Apparently the loading pause only shows up in my local build (not even on nightlies ...) but the CC slowness is definitely there.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 10

•

14 years ago

Attached file Synthetic benchmark — Details

The previous benchmark with one addition, a case that causes the high CC times in weak-parent-pointer builds.

Attachment #574861 - Attachment is obsolete: true

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 11

•

14 years ago

So strong parentNode did add a new way to cause huge CC graphs, but there were other ways to create such graphs even with weak parentNode.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 12

•

14 years ago

Sure, but it looks like this is being hit by web content now, while it wasn't before.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 13

•

14 years ago

If we decide to disable strong parentNode in 9 (and 10), that decision must happen *very* soon, during this week or so. Backing out should be doable, but it will touch lots of code and would take some time to get all the backout patches ready. And, if we back out, I think we should back out from trunk too until cycle collector can handle the situation better. But, I'm still hoping we could find out some, perhaps hacky, fix for cycle collector handling.

Andrew McCreight [:mccr8]

Comment 14

•

14 years ago

Before we panic too much, we should try to figure out how much this actually hits web content. We've seen a few weird cases that maybe cause problems, but only Asa has been able to consistently recreate a problem. (I haven't confirmed that dvanderson's case is actually this. I'll try to put together a script today that can find these problems.) We should look at telemetry, and compare what 8 looks like to 9 and 10. I think the best measure to look at is the number of ref counted nodes in the CC graph. That will avoid problems due to differing hardware, though not websites. Is there an appreciable spike? Secondly, precisely what are we trying to fix? Cases where the top level node is being held live by JS? Where any node is being held live by JS? Where any node is being held live for any reason? I have some ideas for the first two.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 15

•

14 years ago

An unoptimized black node hack (which doesn't leak with Kyle's test, and thing3 test's CC time drops in a debug build from 750 to 15ms) The algorithm is silly (it misses marking nodes, which leads to iterating a lot), and that's why could cause slowness in certain cases. Pushed to try to see if it leaks :) https://hg.mozilla.org/try/rev/6c55fa768be8 https://tbpl.mozilla.org/?tree=Try&rev=6c55fa768be8 But anyway, I hope the patch shows that perhaps there are simple ways to fix at least the most common problems related not-in-document subtrees.

Andrew McCreight [:mccr8]

Comment 16

•

14 years ago

Yeah, this patch is going to be something like n log n in the worst case. n being the number of nodes, and log n being the depth of the tree. Or something like that. Can we somehow move this check into the cycle collector itself? We could associate with owner documents if we've ever found a member document that has a nongray wrapper that is keeping it alive. Then, while we are traversing up the parents, we add the parents to the graph. This will keep them from being traversed twice, at the cost of bloating up the graph a bit before we find the wrapped node. Are these nodes with marked wrappers going to be traversed as XPConnectRoots? Those are added to the graph first, so potentially if we store nongrayness in the owner document then we don't have to actually traverse up the parent chain: by the time we get to nodes in the "meat" of the DOM, we'll already have examined if any of the members of the DOM are being held alive in this way. I should also make sure this will actually help with the IRCcloud graph I got from jlebar. I looked at the paths that were holding things live, and they followed some weird path through JS back out to a DOM, but the path finding can be a little wonky.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 17

•

14 years ago

(In reply to Andrew McCreight [:mccr8] from comment #16) > Can we somehow move this check into the cycle collector itself? We could > associate with owner documents if we've ever found a member document that > has a nongray wrapper that is keeping it alive. I'm not sure what this means. "member document"? > Then, while we are > traversing up the parents, we add the parents to the graph. This will keep > them from being traversed twice, at the cost of bloating up the graph a bit > before we find the wrapped node. Adding anything to graph and using ::Traverse is slow comparing to non-virtual inline calls what can be used when dealing nodes. > Are these nodes with marked wrappers going to be traversed as > XPConnectRoots? Those are added to the graph first, so potentially if we > store nongrayness in the owner document But we can't store non-grayness in the owner document. We're dealing with unconnected subtrees here. ownerDocument doesn't own the nodes in that case. > then we don't have to actually > traverse up the parent chain: by the time we get to nodes in the "meat" of > the DOM, we'll already have examined if any of the members of the DOM are > being held alive in this way. This would work if ownerDocument would keep disconnected subtrees alive, but it doesn't. > I should also make sure this will actually help with the IRCcloud graph I > got from jlebar. I looked at the paths that were holding things live, and > they followed some weird path through JS back out to a DOM, but the path > finding can be a little wonky. To fix the regression-like part of this all for 9/10 it should be enough to figure out if the root of the subtree is black, since in 8 it is necessary to keep root explicitly alive to keep subtree alive. This could be optimized so that if we're dealing with a tree which has document as root, we mark that document as black and then it is enough to check node->GetCurrentDoc()->IsBlack(): But in cases of those trees where some other node type is root, we would need to mark all the ancestors, and clear the markers after CC. That would be O(2n) (where n=number of nodes in the tree) Btw, the hacky patch passed tests on try without leaks.

Andrew McCreight [:mccr8]

Comment 18

•

14 years ago

Okay, most of my comments were just me being confused about the document...

Andrew McCreight [:mccr8]

Updated

•

14 years ago

Blocks: 698919

Andrew McCreight [:mccr8]

Comment 19

•

14 years ago

Attached file Counter Smaug's patch — Details

Here's a test example that is a minor variant of Kyle's that causes slow CC times even with Smaug's patch. The basic trick is here: var nu_root = document.createElement("span"); root = document.createElement("a"); nu_root.appendChild(root); Instead of storing the tippy-topmostnode (nu_root) in a JS variable, we store one of its children. Thus when we suspect some DOM node deep in the tree, we don't find the node with the marked wrapper. And as I said before, in the IRCcloud case, with jlebar's log, the DOM isn't being held alive by a marked wrapper, so this optimization may not work there either. The topmost element in the DOM, which appears to be a <div>, has a preserved wrapper, but the wrapper is not marked. I didn't check every node on the way up. I also made the assumption that a random element I picked must be part of the huge glob.

Andrew McCreight [:mccr8]

Comment 20

•

14 years ago

I looked at telemetry and didn't see a particularly huge difference in ref counted CC graph size between 8 and 9. Looking at CC times, they seem about the same too.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 21

•

14 years ago

(In reply to Andrew McCreight [:mccr8] from comment #19) > Created attachment 575341 [details] > Counter Smaug's patch > > Here's a test example that is a minor variant of Kyle's that causes slow CC > times even with Smaug's patch. The basic trick is here: > > var nu_root = document.createElement("span"); > root = document.createElement("a"); > nu_root.appendChild(root); > > Instead of storing the tippy-topmostnode (nu_root) in a JS variable, we > store one of its children. Sure, but that doesn't even work in 8 - I mean in 8 we'd delete nu_root. So I'm not too worried about that case, since I expect web site to actually store the root somewhere because of the Gecko bug in <= 8. > And as I said before, in the IRCcloud case, with jlebar's log, the DOM isn't > being held alive by a marked wrapper, so this optimization may not work > there either. > The topmost element in the DOM, which appears to be a <div>, > has a preserved wrapper, but the wrapper is not marked. I didn't check > every node on the way up. I also made the assumption that a random element > I picked must be part of the huge glob. But something must be keeping the subtree alive. In 8 if you don't explicitly keep the root of some subtree alive, the root will be delete, and all its children disconnected from it. This can lead to several disconnected subtrees.

Andrew McCreight [:mccr8]

Updated

•

14 years ago

Depends on: 702609

Andrew McCreight [:mccr8]

Comment 22

•

14 years ago

smaug: I've been running the try build you linked in the other threads, with both the AMCtv website and IRCcloud going. heap-unclassified has held fairly steady, below 30%. Maybe it is increasing slowly. My CC pause times are a little weird. About 7/8 are reasonable-ish, around 200 to 300ms. About 1/8 are longer, around 750ms to 1000ms. The long CC was getting worse and worse, but dropped down to the low end again. There's no reason I can see that it has gotten better, and there's nothing in particular that I've noticed about why the longer one is longer. Sometimes they are right after a GC, sometimes there's another CC in between.

Andrew McCreight [:mccr8]

Comment 23

•

14 years ago

I mean the patch in bug 702609.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 24

•

14 years ago

I'd be interested to know how the patch affects to those cases when heap-unclassified is high.

Andrew McCreight [:mccr8]

Updated

•

14 years ago

Depends on: 704623

Alex Keybl [:akeybl]

Updated

•

14 years ago

tracking-firefox10: ? → +

tracking-firefox9: ? → +

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 25

•

14 years ago

Are we going to take any countermeasures here for 9? The clock is almost out ...

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Comment 26

•

14 years ago

if the problem is strong parent node, it is quite strange that we got first bugs filed 3.5 months after the patch landed to trunk. Also, some reports say that there is a regression from 9->10 (which would mean some totally different problem) But in any case I can prepare the strong-parentNode-backout patch for 9.

Olli Pettay [:smaug][bugs@pettay.fi]

Assignee

Updated

•

14 years ago

Depends on: 708572

Lawrence Mandel [:lmandel] (use needinfo)

Updated

•

14 years ago

Whiteboard: [MemShrink:P1] → [MemShrink:P1][Snappy]

chris hofmann

Comment 27

•

14 years ago

I wonder if bc's spider could let us know which top sites seem to trigger gc these problems.