Closed Bug 720512 Opened 13 years ago Closed 12 years ago

huge number of orphan DOM nodes when Gmail is open for a long time

Categories

(Core :: XPCOM, defect)

x86_64
macOS
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: mccr8, Assigned: mccr8)

References

Details

(Whiteboard: [Snappy:P1])

Attachments

(1 file, 2 obsolete files)

I've noticed that if I leave Gmail up for a couple of days, then CC times start getting bad, around 650ms consistently. GC times remain reasonable. Memory usage is not awesome, at around 400mb, but not totally abysmal. Heap-unclassified is a little high at 37%. Presumably Gmail is leaking something that makes the cycle collector sad. I'm going to analyze the CC logs to try to figure out what is going on, and try to figure out if we can do better. This is certainly a Snappy issue. May be a MemShrink thing, too, so I'll mark it as such for now.
Andrew says reloading the tab drops the CC times, which backs up it being gmail holding onto stuff.
Assignee: nobody → continuation
Whiteboard: [MemShrink], [Snappy] → [Snappy]
(In reply to Randell Jesup [:jesup] from comment #1) > Andrew says reloading the tab drops the CC times, which backs up it being > gmail holding onto stuff. Sure, but it could be that we're leaking stuff within content somehow.
I had this problem again, post CC improvements. I did some analysis of the logs. Here are the top 8 types of nodes: 60471 nsGenericElement (xhtml) 20314 nsGenericDOMDataNode 14247 JS Object 4865 nsEventListenerManager 4259 nsDOMCSSAttributeDeclaration 2813 XPCWrappedNative 1556 nsGenericElement (XUL) 1183 JS Script Huge number of DOM nodes. It looks like the bulk of these (around 71000 of them) are orphan DOM nodes, eg DOMs that aren't currently in a document. There are almost 300 separate orphan DOM trees. I'm going to try to merge each of these together into a single node to see if I can make any more sense out of the graph that way, as it will get rid of 71k nodes from the graph.
I'm going to try running this build for a few days with the new DOM bindings turned off and we'll see if that matters.
Same problem with them turned off. about 71575 nodes in orphaned gmail DOMs.
Whiteboard: [Snappy] → [Snappy:P1]
I had a similar problem just now, after my browser had been running for a number of days with Gmail open. I'd been getting some long GC times, often around a second. CC times maybe were a little bad, but not too bad. When I closed Gmail (and some other tabs so maybe it was something else), I started getting 2.6 second CCs. I had about a half dozen of these, until finally it freed the 400k or so nodes and it went back to normal. I forgot to grab a log though. CC and GC times dropped back to fast again (100ms GC and 52ms CC).
I have Cww's CC log about this issue, I think. Something somewhere is keeping an element alive.
Andrew, if you can reproduce this, could you check whether you've used find. Although bug 669845 is fixed, I think find might keep some elements alive too long. Also, could you perhaps check what object is the root which keeps the stuff alive, and when reloading, what ends up releasing it. (52ms CC time when not loading pages isn't really normal. Median CC time is somewhere between 10 and 20)
I have 50-60ms CC times and 200-220ms GC with this from about:tabs 13.0a1 (2012-03-03), running since last Saturday night on a private pull & build 781 tabs across 21 groups in 21 windows 29 tabs have been loaded about:memory 1,203.41 MB ── resident 1,141.36 MB ── explicit 2,074.08 MB ── vsize about 200MB system-principal; 50MB about:blank (unloaded tabs); 43MB plus.google.com, 23MB gmail; 283MB window-objects; 231MB (20%) heap-unclassified
Randell, could you file a separate issue for that please. This bug is about a specific issue with long CCs while just having a single tab open. Thanks. (In reply to Olli Pettay [:smaug] from comment #8) > Andrew, if you can reproduce this, could you check whether you've used find. > Although bug 669845 is fixed, I think find might keep some elements alive > too long. It is hard to reproduce. You have to keep Gmail open for 4 or 5 days. It is better than before. The CC times are not really that bad, it is mostly the GC that is having problems. But given that this problem feels like it is a slow leak of live data, that's what you'd expect. Woah, I just had a 2 second GC. Awesome. GC(T+257365.3) TotalTime: 1966.6ms, Type: global, MMU(20ms): 0%, MMU(50ms): 0%, Reason: SET_NEW_DOCUMENT, NonIncrementalReason: incremental permanently disabled, +chunks: 0, -chunks: 0 mark: 135.8, mark-roots: 8.4, mark-other: 19.0, sweep: 756.3, sweep-obj: 27.9, sweep-string: 5.2, sweep-script: 0.9, sweep-shape: 8.1, discard-code: 14.8, discard-analysis: 669.5, xpconnect: 5.0, deallocate: 0.2 I've also seen a few like this: GC(T+257507.4) TotalTime: 1890.8ms, Type: global, MMU(20ms): 0%, MMU(50ms): 0%, Reason: CC_WAITING, NonIncrementalReason: incremental permanently disabled, +chunks: 0, -chunks: 0 mark: 140.9, mark-roots: 14.2, mark-other: 19.0, sweep: 332.2, sweep-obj: 50.3, sweep-string: 4.7, sweep-script: 0.9, sweep-shape: 7.1, discard-code: 10.5, discard-analysis: 243.5, xpconnect: 4.8, deallocate: 0.2 Notice that there's about a second of GC time unaccounted for in the breakdown! > Also, could you perhaps check what object is the root which keeps the stuff > alive, > and when reloading, what ends up releasing it. Sure, I'll try to take a log next time it gets bad. Though maybe I should use bad GC times as my benchmark.
Last time I looked at it, the problem was Gmail had a huge number of disconnected DOM nodes.
I'm also getting the occasional weirdly long CC. CC(T+258720.2) duration: 531ms, suspected: 5678, visited: 9648 RCed and 5530 GCed, collected: 1988 RCed and 61 GCed (2049 waiting for GC) ForgetSkippable 30 times before CC, min: 0 ms, max: 19 ms, avg: 1 ms, total: 44 ms, removed: 11794 CC(T+258737.9) duration: 72ms, suspected: 1471, visited: 9428 RCed and 5512 GCed, collected: 2168 RCed and 68 GCed (2236 waiting for GC) ForgetSkippable 7 times before CC, min: 0 ms, max: 49 ms, avg: 10 ms, total: 75 ms, removed: 2393 These CCs both have about the same size CC graph and collected about the same number of objects, but the first one took much longer. It has more suspected nodes, but that doesn't seem like it should be enough. I guess I should turn on detailed timer information.
Nice, I reloaded Gmail just now, and I'm now getting crazy CC times: CC(T+259244.4) duration: 3385ms, suspected: 208, visited: 226245 RCed and 134109 GCed, collected: 42 RCed and 0 GCed (203 waiting for GC) ForgetSkippable 2 times before CC, min: 0 ms, max: 0 ms, avg: 0 ms, total: 1 ms, removed: 209 I guess something is holding all of the dead Gmail gunk alive. I have a CC log for it, though logs this huge are hard to analyze. Doing a find in a different page (on Wikipedia) didn't help anything.
Summary: long CC times on gmail after being open for a long time → huge number of orphan DOM nodes when Gmail is open for a long time
Attached patch guess fix (obsolete) — Splinter Review
https://tbpl.mozilla.org/?tree=Try&rev=387d15cf5cdd Make DOMEventTarget helper to not keep window alive.
Looks like applying the patch to m-c needs some --fuzz
I browsed for maybe 5 hours using this patch, then reloaded Gmail and I got this: CC(T+17874.7) duration: 126ms, suspected: 14978, visited: 23966 RCed and 2803 GCed, collected: 20695 RCed and 284 GCed (20979 waiting for GC) ForgetSkippable 3 times before CC, min: 6 ms, max: 17 ms, avg: 10 ms, total: 32 ms, removed: 6882 CC(T+17880.9) duration: 19ms, suspected: 458, visited: 3235 RCed and 2524 GCed, collected: 319 RCed and 3 GCed (322 waiting for GC) ForgetSkippable 6 times before CC, min: 0 ms, max: 12 ms, avg: 2 ms, total: 17 ms, removed: 974 CC(T+17887.1) duration: 11ms, suspected: 104, visited: 2950 RCed and 2521 GCed, collected: 52 RCed and 2 GCed (54 waiting for GC) ForgetSkippable 6 times before CC, min: 0 ms, max: 11 ms, avg: 2 ms, total: 15 ms, removed: 809 In other words, this appears to have fixed the problem. When I reload Gmail, I get a bunch of nodes in the graph, but they are collected right away. And it doesn't seem quite as bad. I'll leave my Gmail tab up for another day or so and see if that continues to hold.
By "doesn't seem quite as bad", I mean it doesn't seem like there are quite as many orphan nodes.
Sounds good. The patch is pretty mechanical but need some reviews...
Perhaps I spoke too soon. I'm getting half-second CCs now. CC(T+46568.4) duration: 519ms, suspected: 90, visited: 22754 RCed and 133975 GCed, collected: 57 RCed and 2 GCed (59 waiting for GC) ForgetSkippable 2 times before CC, min: 0 ms, max: 24 ms, avg: 12 ms, total: 25 ms, removed: 256
Comment on attachment 603681 [details] [diff] [review] v2 >- mScriptContext = aScriptContext; > if (aOwnerWindow) { >- mOwner = aOwnerWindow->IsOuterWindow() ? >- aOwnerWindow->GetCurrentInnerWindow() : aOwnerWindow; >+ BindToOwner(aOwnerWindow->IsOuterWindow() ? >+ aOwnerWindow->GetCurrentInnerWindow() : aOwnerWindow); > } else { >- mOwner = nsnull; >+ BindToOwner(aOwnerWindow); > } BindToOwner(aOwnerWindow); makes us use the right version of BindToOwner(nsnull). > nsXMLHttpRequest::GetLoadGroup() const > { >- if (mState & XML_HTTP_REQUEST_BACKGROUND) { >+ if (mState & XML_HTTP_REQUEST_BACKGROUND) { Apparently I added few extra spaces here. I'll fix. The idea with the patch is the nsDOMEventTargetHelper objects, like XHR, doesn't own the window or relevant scriptcontext anymore. Instead the nsDOMEventTargetHelper::mOwner <-> nsGlobalWindow connection is a raw pointer and cleared when either one of the objects is deleted. mHasHadOwner is there to prevent event handling on objects which don't have mOwner anymore (shouldn't change any behavior, since before the patch owner shouldn't be the current inner window anymore, if eventtargethelper is the only one which owns it, and in that case CheckInnerWindowCorrectness would fail.).
Attachment #603681 - Flags: review?(jst)
Comment on attachment 603681 [details] [diff] [review] v2 Ah, better to investigate the CC issues then some more.
Attachment #603681 - Flags: review?(jst)
Though, I think we should take the patch anyway. It is just useless to have strong pointers in nsDOMEventTargetHelper. I move the patch to another bug.
Attachment #603681 - Attachment is obsolete: true
Well, maybe it did fix some problem. I'm seeing orphan DOMs from Google Reader, and some pages I had closed many hours before, but Gmail doesn't seem to show up in any important number (this is with all tabs closed).
One of the pages that was leaking appears to be held alive by setDesktopBackground.xul. I'll file a new bug blocking bug 728407.
Okay, so, there are still a ton of Gmail orphan nodes in the WantAllTraces dump I made, so I guess this didn't actually fix it? Well, maybe it didn't fix the leak, but it may have fixed the problem where it takes a long time for things to be freed after you close the tab. One of the orphan DOMs is being kept alive by two elements in the DOM that have a missing edge: 0x15fec8580 [nsGenericElement (xhtml) div id=':1cu' class='anM' https://mail.google.com/mail/?ui=2&view=bsp&ver=ohhl4rw8mbn4] 0x12c3add00 [nsGenericElement (xhtml) span class='hb' https://mail.google.com/mail/?ui=2&view=bsp&ver=ohhl4rw8mbn4]
(In reply to Andrew McCreight [:mccr8] from comment #26) > Okay, so, there are still a ton of Gmail orphan nodes in the WantAllTraces > dump I made, so I guess this didn't actually fix it? Well, maybe it didn't > fix the leak, but it may have fixed the problem where it takes a long time > for things to be freed after you close the tab. Well, I was hoping that it could fix the latter problem, not the first one. So, bug 734057 could be for the latter problem, and this one could be used for the original problem. > > One of the orphan DOMs is being kept alive by two elements in the DOM that > have a missing edge: > > 0x15fec8580 [nsGenericElement (xhtml) div id=':1cu' class='anM' > https://mail.google.com/mail/?ui=2&view=bsp&ver=ohhl4rw8mbn4] > 0x12c3add00 [nsGenericElement (xhtml) span class='hb' > https://mail.google.com/mail/?ui=2&view=bsp&ver=ohhl4rw8mbn4] This looks like the Cww's log I have, and bz had similar problem with nytimes. Could you try to debug what releases those roots when you close the tab.
(In reply to Olli Pettay [:smaug] from comment #27) > Could you try to debug what releases those roots when you close the tab. Sure. This is a problem that comes up over and over again so I'm tempted to come up with a more general solution.
Here's another weird thing with the patched build. I was logged out by gmail, so it booted me back to the login screen. I logged in again, and I started getting huge CCs: CC(T+112902.0) duration: 1822ms, suspected: 53, visited: 136442 RCed and 207090 GCed, collected: 52 RCed and 2 GCed (54 waiting for GC) ForgetSkippable 2 times before CC, min: 0 ms, max: 30 ms, avg: 15 ms, total: 31 ms, removed: 92 Those objects won't go away, even when I closed the window. They've hung around like this for a few minutes. When I did a dump, one of the orphan DOM nodes was being held alive by a path that starts out like this: 0x14ef23c00 [nsDOMEventTargetHelper ] --[mOnReadystatechangeListener]-> 0x1565f8fa0 [nsDOMEventListenerWrapper] --[mListener]-> 0x15d6d3010 [nsXPCWrappedJS (nsIDOMEventListener)] --[mJSObj]-> 0x14698e8c0 [JS Object (Function - $Xb)] --[]-> 0x14dbf52e0 [JS Object (Function - $Xb)] --[]-> 0x155eea340 [JS Object (Object)] The full path is kind of interesting, I'll attach it. It may shed light on why we're leaking everything in Gmail.
Attached file path2orphan
Jason, is it possible that necko keeps stuff alive forever, or very long time? nsDOMEventTargetHelper is XMLHttpRequest. The patch for bug 734057 could be made more effective if the hashtable was enumerated not only in dtor but also in nsGlobalWindow::CleanUp. Could you Andrew try that? Ah, and could you make nsDOMEventTargetHelper::Disconnect() virtual. Then in XHR add Disconnect(). It should call nsDOMEventTargetHelper::Disconnect() and make mListener of mOnReadystatechangeListener null.
If you try that, make sure to clear also mOnFooListeners in nsXHREventTarget
Sure, I can look into that after I finish up this other security bug I've been avoiding for 3 months.
> Jason, is it possible that necko keeps stuff alive forever Generally it shouldn't, of course, but I'm not an expert in XHR, so jonas might know better than me (looks like you may have figured this out on IRC with khuey already?)
I have no ideas here. Are we sure the problem isn't simply Google holding JS-references to these XHR objects? I.e. do we know if the XHR still has a JS-wrapper, and if it does, what's keeping it alive. We've certainly seen bugs in gmail before resulting in vast number of objects being held alive.
(In reply to Jonas Sicking (:sicking) from comment #35) > I have no ideas here. > > Are we sure the problem isn't simply Google holding JS-references to these > XHR objects? I.e. do we know if the XHR still has a JS-wrapper, and if it > does, what's keeping it alive. > > We've certainly seen bugs in gmail before resulting in vast number of > objects being held alive. There are two unknown native edges. Look at the trace mccr8 posted.
(In reply to Jonas Sicking (:sicking) from comment #35) > I have no ideas here. > > Are we sure the problem isn't simply Google holding JS-references to these > XHR objects? In that case XHR wouldn't be in the CC log at all.
The XPCWrappedNative is keeping the XHR alive. Doesn't that mean that the JS-wrapper is alive?
Sure, but XPCWrappedNative wouldn't be black here. Anyhow, there are unknown edges. Those are the interesting ones. bug 734057 should help here.
I haven't noticed this in a while.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: