Closed Bug 497808 Opened 15 years ago Closed 11 years ago

GMail window leak

Categories

(Core :: DOM: Core & HTML, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: roc, Assigned: peterv)

References

Details

(Keywords: memory-leak, Whiteboard: [softblocker])

Attachments

(2 files, 6 obsolete files)

I don't know if this is really our problem or GMail's. But for a while now on trunk, I've been seeing sluggish performance that gradually gets worse over time --- very frequent short pauses that get longer over time. Restart fixes it. Sometimes I've attached a debugger and collected some stack traces; sometimes we're in GMail, with JSON message data on the JS stack, and often we're in the cycle collector.

Just now performance was getting bad again so I tried closing the GMail window. I got the following console output:

JavaScript error: https://mail.google.com/mail/?ui=2&view=js&name=js&ver=aPCJgb7NS-E.en.L&am=b1EopeS3cCHoAb0i0bA2xsRfYkZ2Tw&random=1244409062175, line 22: Fb is not defined
--DOMWINDOW == 231 (0x8780500) [serial = 711] [outer = 0x2529a580] [url = about:blank]
--DOMWINDOW == 230 (0x36fcb9b0) [serial = 7656] [outer = 0x3cade7f0] [url = about:blank]
--DOMWINDOW == 229 (0x23ca4dc0) [serial = 706] [outer = 0xf38eca0] [url = https://mail.google.com/mail/?ui=2&view=bsp&ver=1qygpcgurkovy]
--DOMWINDOW == 228 (0x225d2ad0) [serial = 707] [outer = 0xf38d620] [url = https://mail.google.com/mail/?ui=2&view=bsp&ver=1qygpcgurkovy]
--DOMWINDOW == 227 (0x3cd75860) [serial = 7653] [outer = 0x30036c10] [url = about:blank]
--DOMWINDOW == 226 (0x30036c40) [serial = 7651] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&at=xn3j310qfv6pa5kvv7boxakigulx5l&view=up&act=sm&jsid=9ym2eb-79ye1s&cmid=71&rt=h&search=all&zx=s5pe6o-3tocvq]
--DOMWINDOW == 225 (0x367dd970) [serial = 7635] [outer = 0x3e594b50] [url = about:blank]
--DOMWINDOW == 224 (0x3e594b80) [serial = 7634] [outer = 0x0] [url = about:blank]
--DOMWINDOW == 223 (0x3f379390) [serial = 7614] [outer = 0x38eca1f0] [url = about:blank]
--DOMWINDOW == 222 (0x38eca220) [serial = 7612] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&at=xn3j310qfv6pa5kvv7boxakigulx5l&view=up&act=sm&jsid=9ym2eb-79ye1s&cmid=70&rt=h&search=inbox&zx=33bdgt-idotca]

... etc etc etc ...

--DOMWINDOW == 97 (0x23dd7230) [serial = 785] [outer = 0x2c8ba870] [url = about:blank]
--DOMWINDOW == 96 (0x2c8ba8a0) [serial = 783] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&at=xn3j310qfv6pa5kvv7boxakigulx5l&view=up&act=sm&jsid=9ym2eb-79ye1s&cmid=2&rt=h&cat=WHATWG&search=cat&zx=t4svdl-7bzy3y]
--DOMWINDOW == 95 (0xf3372e0) [serial = 755] [outer = 0xf4569d0] [url = about:blank]
--DOMWINDOW == 94 (0xf456a00) [serial = 753] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&at=xn3j310qfv6pa5kvv7boxakigulx5l&view=up&act=sm&jsid=9ym2eb-79ye1s&cmid=1&rt=h&cat=WHATWG&search=cat&zx=oectlf-6bmnp9]
--DOCSHELL 0xcbbab00 == 30
--DOMWINDOW == 93 (0xf38ecd0) [serial = 703] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&view=bsp&ver=1qygpcgurkovy]
--DOMWINDOW == 92 (0xf38d650) [serial = 704] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&view=bsp&ver=1qygpcgurkovy]
--DOMWINDOW == 91 (0x2529a5b0) [serial = 710] [outer = 0x0] [url = about:blank]
--DOMWINDOW == 90 (0x3cade820) [serial = 7655] [outer = 0x0] [url = about:blank]
++DOMWINDOW == 91 (0x334842e0) [serial = 7659] [outer = 0x3f64f690]
--DOMWINDOW == 90 (0xf5ac670) [serial = 708] [outer = 0x249ad1c0] [url = https://mail.google.com/mail/?ui=2&view=bsp&ver=1qygpcgurkovy]
--DOMWINDOW == 89 (0x2ad64250) [serial = 702] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&view=js&name=js&ver=aPCJgb7NS-E.en.L&am=b1EopeS3cCHoAb0i0bA2xsRfYkZ2Tw&random=1244409062175]
--DOMWINDOW == 88 (0xcbbb2c0) [serial = 7] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&zx=2s9zdrkhzj2w&shva=1]
--DOMWINDOW == 87 (0x3e72ee10) [serial = 7657] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&view=tl&start=0&num=120&rt=h&cat=Bugzilla&search=cat]
--DOMWINDOW == 86 (0x2f96d4e0) [serial = 7641] [outer = 0x3f64f690] [url = http://tinderbox.mozilla.org/showbuilds.cgi?tree=Firefox&panel=1]
--DOMWINDOW == 85 (0x226513e0) [serial = 709] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&view=js&name=js&ver=aPCJgb7NS-E.en.L&am=b1EopeS3cCHoAb0i0bA2xsRfYkZ2Tw&random=1244409062175]
--DOMWINDOW == 84 (0x2ed16a60) [serial = 7658] [outer = 0x0] [url = about:blank]
--DOMWINDOW == 83 (0x832df20) [serial = 701] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&zx=2s9zdrkhzj2w&shva=1]

So that's about 150-odd DOM windows hanging around associated with GMail ... frequent cycle collection traversal of those windows (perhaps triggered by GMail's polling logic?) might explain the pauses and other symptoms I've been seeing.

This problem is really, really bad. It's been bothering me for a while now, it's hard to pin down, and I haven't had enough information to really file a bug until now. But it's a disastrous user experience.
So next time I notice this happening, what should I do to help figure it out?

I don't understand nsGlobalWindow very well, but it looks like we have a whole lot of outer windows whose URIs correspond to GMail actions, and each one of those has an inner window whose URI is about:blank, but no other inners. That seems weird.
Do you know what the steps-to-reproduce for getting the window count to increase are?

If you want to generate a lot of maybe-useful data, you might try doing a DEBUG_CC build, but moving this line in nsGlobalWindow:
  nsCycleCollector_DEBUG_shouldBeFreed(static_cast<nsIScriptGlobalObject*>(this));
from nsGlobalWindow::CleanUp and FreeInnerObjects and SetDocShell to the constructor.

I think that should give you the explanations of why windows are being kept around (along with lots of other noise).
Right now the only steps to reproduce I have are "use GMail for a few days" :-(
I wonder if we could mozmill our way to STR here...
Er... Don't we (or didn't we?!) have code to prevent CC'ing over windows and associated documents that we know are alive? I would think that would prevent this sort of thing unless the documents in question really are orphaned.
(In reply to comment #5)
> code to prevent CC'ing over windows and associated documents that we
> know are alive

Note that we disable that optimization for DEBUG_CC builds.
That should only apply if the window is still being presented in a document viewer, etc.
(In reply to comment #2)
> Do you know what the steps-to-reproduce for getting the window count to
> increase are?
> 
> If you want to generate a lot of maybe-useful data, you might try doing a
> DEBUG_CC build, but moving this line in nsGlobalWindow:
>  
> nsCycleCollector_DEBUG_shouldBeFreed(static_cast<nsIScriptGlobalObject*>(this));
> from nsGlobalWindow::CleanUp and FreeInnerObjects and SetDocShell to the
> constructor.

To the nsGlobalWindow constructor? Are you sure? That sounds weird
(In reply to comment #8)
> To the nsGlobalWindow constructor? Are you sure? That sounds weird

I think so, yes. What it will do is assume that the window should have gone away as soon as it is constructed. So you will get data on who holds it alive every time we cycle collect, starting from when it is created. That way you get the data even if the window is being held alive in a document viewer (which could also be causing this memory growth).
I've got a 4GB log file which contains this when I closed my Gmail window:

--DOMWINDOW == 68 (0x3004c930) [serial = 500] [outer = 0x30be6990] [url = about:blank]
--DOMWINDOW == 67 (0x30be69c0) [serial = 499] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&view=tl&star
t=0&num=120&rt=h&search=inbox]
--DOMWINDOW == 66 (0x31f06960) [serial = 497] [outer = 0x0] [url = about:blank]
--DOMWINDOW == 65 (0x8083540) [serial = 19] [outer = 0xc7e8ce0] [url = https://mail.google.com/mail/?nocheckbrowser#inbox]
--DOMWINDOW == 64 (0x9e5e040) [serial = 30] [outer = 0x9e59120] [url = about:blank]
--DOMWINDOW == 63 (0x80d92e0) [serial = 24] [outer = 0x80b7a90] [url = https://mail.google.com/mail/?ui=2&view=bsp&ver=1qygpcgur
kovy]
--DOMWINDOW == 62 (0x80e6990) [serial = 25] [outer = 0x80bbe10] [url = https://mail.google.com/mail/?ui=2&view=bsp&ver=1qygpcgur
kovy]
--DOMWINDOW == 61 (0x33250290) [serial = 498] [outer = 0x0] [url = about:blank]
--DOMWINDOW == 60 (0xc9e8880) [serial = 485] [outer = 0x2590c0d0] [url = about:blank]
--DOMWINDOW == 59 (0x2590c100) [serial = 484] [outer = 0x0] [url = about:blank]
--DOMWINDOW == 58 (0x300c2eb0) [serial = 471] [outer = 0x230371e0] [url = about:blank]
--DOMWINDOW == 57 (0x23037210) [serial = 470] [outer = 0x0] [url = about:blank]
--DOMWINDOW == 56 (0x25340760) [serial = 451] [outer = 0x21d66880] [url = about:blank]
--DOMWINDOW == 55 (0x21d668b0) [serial = 450] [outer = 0x0] [url = about:blank]
--DOMWINDOW == 54 (0x21a578c0) [serial = 421] [outer = 0x9fde470] [url = about:blank]
--DOMWINDOW == 53 (0x9fde4a0) [serial = 420] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&view=tl&start
=0&num=120&rt=h&cat=Bugzilla&search=cat]
--DOMWINDOW == 52 (0x31f3dc00) [serial = 330] [outer = 0x27f92cd0] [url = about:blank]
--DOMWINDOW == 51 (0x27f92d00) [serial = 329] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&view=cv&th=1
2188cb7e2b04f12&cw=700&rt=h&q=alan+chang+telstra&search=query]
--DOMWINDOW == 50 (0x33226650) [serial = 309] [outer = 0x24888010] [url = about:blank]
--DOMWINDOW == 49 (0x24888040) [serial = 307] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&at=xn3j310qf
v6pa5kvv7boxakigulx5l&view=up&act=sm&jsid=kfy400-84ugq4&cmid=5&rt=h&search=inbox&zx=lpr9ss-yallte]
--DOMWINDOW == 48 (0x258331b0) [serial = 218] [outer = 0x235b0930] [url = about:blank]
--DOMWINDOW == 47 (0x235b0960) [serial = 217] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&view=tl&star
t=0&num=120&rt=h&cat=WHATWG&search=cat]
--DOMWINDOW == 46 (0x2306d500) [serial = 209] [outer = 0x23af91a0] [url = about:blank]
--DOMWINDOW == 45 (0x23af91d0) [serial = 207] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&at=xn3j310qf
v6pa5kvv7boxakigulx5l&view=up&act=sm&jsid=kfy400-84ugq4&cmid=4&rt=h&search=inbox&zx=esmeof-6ga91z]
--DOMWINDOW == 44 (0x224a9c80) [serial = 198] [outer = 0x22655730] [url = about:blank]
--DOMWINDOW == 43 (0x22655760) [serial = 196] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&at=xn3j310qf
v6pa5kvv7boxakigulx5l&view=up&act=sm&jsid=kfy400-84ugq4&cmid=3&rt=h&search=inbox&zx=du7bil-jld2ok]
--DOMWINDOW == 42 (0x1060ce80) [serial = 176] [outer = 0x21d18b30] [url = about:blank]
--DOMWINDOW == 41 (0x21d18b60) [serial = 174] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&ik=17e10128ca&at=xn3j310qf
v6pa5kvv7boxakigulx5l&view=up&act=sm&jsid=kfy400-84ugq4&cmid=2&rt=h&search=inbox&zx=inh54-agp8le]
--DOCSHELL 0xc7e8560 == 18
--DOMWINDOW == 40 (0x80b35a0) [serial = 20] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&view=js&name=js&ver=J-4BWlKS
S54.en.L&am=b1EopdTncAEFBfnukfAWRlMTj7bl1A]
--DOMWINDOW == 39 (0x80b7ac0) [serial = 21] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&view=bsp&ver=1qygpcgurkovy]
--DOMWINDOW == 38 (0x80bbe40) [serial = 22] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&view=bsp&ver=1qygpcgurkovy]
--DOMWINDOW == 37 (0x7d24fe0) [serial = 28] [outer = 0x0] [url = https://mail.google.com/mail/?ui=2&view=js&name=js&ver=J-4BWlKS
S54.en.L&am=b1EopdTncAEFBfnukfAWRlMTj7bl1A]
--DOMWINDOW == 36 (0xc7e8d10) [serial = 7] [outer = 0x0] [url = https://mail.google.com/mail/?nocheckbrowser#inbox]
--DOMWINDOW == 35 (0x9e59150) [serial = 29] [outer = 0x0] [url = about:blank]

What should I do with it? :-)
(In reply to comment #9)
> (In reply to comment #8)
> > To the nsGlobalWindow constructor? Are you sure? That sounds weird
> 
> I think so, yes. What it will do is assume that the window should have gone
> away as soon as it is constructed. So you will get data on who holds it alive
> every time we cycle collect, starting from when it is created. That way you get
> the data even if the window is being held alive in a document viewer (which
> could also be causing this memory growth).

So what should I see in the output?

I have lots of occurrences of things like
nsCycleCollector: XPCWrappedNative (XULElement) 0x10297160 was not collected due to 1
  external references (2 total - 1 known)

But I can't find any occurrences of things like
nsCycleCollector: nsGlobalWindow ...
So can someone tell me what to do with this DEBUG_CC data?
Are there any lines in the log about nsGlobalWindow?
There are reports which mention nsGlobalWindow like this:

nsCycleCollector: XPCWrappedNative (XULElement) 0x24f3d130 was not collected due to 
1
  external references (2 total - 1 known)
  An object expected to be garbage could be reached from it by the path:
    XPCWrappedNative (XULElement) 0x24f3d130
        via <unknown edge>
    JS Object (XULElement) (global=72d9200) 0x82eb360
        via _contentWindow
    XPCNativeWrapper (Window) (global=781ae80) 0x7c35e20
        via focus
    JS Object (Function - XPCNativeWrapper function wrapper) (global=781ae80) 0x7139
e00
        via __proto__
    JS Object (Function) (global=6ead4c0) 0x7bed508
        via __parent__
    JS Object (ChromeWindow) (global=6ead4c0) 0x6ead4c0
        via xpc_GetJSPrivate(obj)
    XPCWrappedNative (ChromeWindow) 0x23277370
        via <unknown edge>
    nsGlobalWindow 0x21815ff0
        via mFocusedNode
    nsGenericElement 0x9ebbb80
        via mWrapper
    XPCWrappedNative (XULElement) 0x23c1a590
        via <unknown edge>
    JS Object (XULElement) (global=6ead4c0) 0x8862b40
        via __parent__
    JS Object (XULElement) (global=6ead4c0) 0x88629e0
        via __parent__
    JS Object (XULElement) (global=6ead4c0) 0x88658e0
        via __parent__
    JS Object (XULElement) (global=6ead4c0) 0x8865480
        via xpc_GetJSPrivate(obj)
    XPCWrappedNative (XULElement) 0x22135da0
        via <unknown edge>
    nsGenericElement 0x8929180
        via [via hash] mListenerManager
    nsEventListenerManager 0xa08c7d0
        via mListeners[i] mListener
    nsJSEventListener 0x233f1180
        via mContext
    nsJSContext 0x2284b590
  The 1 known references to it were from:
    JS Object (XULElement) (global=72d9200) 0x82eb360 via xpc_GetJSPrivate(obj)

Is that relevant?
Attached patch the debugging patch (obsolete) — Splinter Review
This is the patch I used to generate the debug output.

I've uploaded the gzipped log file here:
http://people.mozilla.org/~roc/gmail-foo-bar.gz
It compressed really well ... 4GB to about 70MB.

If you go to the end of the file and search backwards for mail.google.com you'll find the lines that show several DOM windows being released when I closed the tab.
I think I'm still seeing this bug. I'm seeing it in my current session; I've got about 700 DOMWindows with only a few tabs open. Closing the GMail tab does not release the DOMWindows. The browser hangs every few seconds doing cycle collection (I can tell since I have a debugger attached). Anything I can do to help here?
OK, I'll try that.

After I shut down that browser session, it turned out that the leaked windows were not GMail windows (I guess that's good). They're mostly Firebug windows:
--DOMWINDOW == 147 (0x32c9a960) [serial = 990] [outer = 0x0] [url = chrome://browser/content/browser.xul]
--DOMWINDOW == 146 (0x2e747a00) [serial = 973] [outer = 0x2b84fe70] [url = chrome://firebug/content/panel.html]
--DOMWINDOW == 145 (0x221b47a0) [serial = 972] [outer = 0x288a01a0] [url = chrome://firebug/content/panel.html]
--DOMWINDOW == 144 (0x2b84fea0) [serial = 968] [outer = 0x0] [url = chrome://firebug/content/panel.html]
--DOMWINDOW == 143 (0x288a01d0) [serial = 967] [outer = 0x0] [url = chrome://firebug/content/panel.html]

So maybe this bug is fixed...
Leak monitor runs on trunk, but it pops up multiple windows almost every time I close a chrome window. I'm not sure if these are real issues or not.
I seem to be in a state where closing a leakmonitor window causes a new leak report :-(. I'm going to have to turn this off.
bug 563826 might help here.
Roc, bug 563826 landed ~5 days ago, any idea yet as to whether it made things better here or not?
I had a build with this patch running for a few days but before I could close the GMail window to check, my machine crashed :-(
OK, I just shut down and it looks like the leak is still here.
Attached patch debug patch (obsolete) — Splinter Review
Could we get this debugging patch on trunk? I'd like to stop having to maintain it in my queue, and it could be useful for other people and other problems. I've made it so that you can dump the graph whenever DEBUG or DEBUG_CC is defined.
Attachment #444484 - Flags: review?(peterv)
(In reply to comment #25)
> I've made it so that you can dump the graph whenever DEBUG or DEBUG_CC is
> defined.

Actually, the patch seems to enable dumping the graph for all builds?
Yes, actually it does, but the cost for non-debug builds when you're not collecting a graph is minimal, I think.
I've been debugging GMail under DEBUG_CC, mostly concentrating on messages of this type:

> nsCycleCollector: nsGenericElement 0x11924bb70 was not collected due to 2
>  external references (5 total - 3 known)

I'm sure my technique is unsophisticated, but here's what I've found.

I made a little function called PrintElement that supplies some default arguments to nsGenericElement::List. It prints out the element name, attributes, and children.

Then, I set a breakpoint on that cycle collector message. Once I catch one, I pull out the pi->mPointer value and cast it to whatever it's supposed to be in gdb.

So far, it's looking like accessibility code is probably the culprit in the elements I've examined so far.

(gdb) call $5.PrintElement()
    <div tabindex="-1" role="menu" class="b7 J-M" style="-moz-user-select: none; display: none;" aria-haspopup="true">
    </div>

(gdb) call $2.PrintElement()
    <input type="text" class="dI dG" label="Search, add, or invite" title="Search, add, or invite" ignoreesc="true" autocomplete="off" id=":vd" aria-haspopup="true">
    </input>

(gdb) call $8.PrintElement()
	    <div class="nH qp" role="navigation">
	      <div id="gbar">
	        <nobr>
	        ...
looking over the /accessible/ directory, I see basically none of it is cycle collected. We should probably write a Dehydra script that prints out nsCOMPtr and other sorts of members that we would expect to be included in a traverse and unlink implementation, but aren't. That might be better than tracking these down in web pages.
Alright, a11y is not my problem, but it does look fishy that it's not CCed.
(In reply to comment #27)
> Yes, actually it does, but the cost for non-debug builds when you're not
> collecting a graph is minimal, I think.

Peter? Can we get this reviewed and landed please?
Comment on attachment 444484 [details] [diff] [review]
debug patch

We should still always output graphs if DEBUG_CC is set. It looks like this patch disables that.

>diff --git a/dom/base/nsDOMWindowUtils.cpp b/dom/base/nsDOMWindowUtils.cpp

>+nsDOMWindowUtils::GarbageCollect(PRBool aDrawGraph)

>-  nsJSContext::CC();
>-  nsJSContext::CC();
>+  nsJSContext::CC(aDrawGraph);
>+  nsJSContext::CC(aDrawGraph);

Just drop one of these calls, callers can call twice if needed.

>diff --git a/dom/base/nsJSEnvironment.h b/dom/base/nsJSEnvironment.h

>+  static void CC(PRBool aDrawGraph = PR_FALSE);

I'd rather not use a default argument.

>diff --git a/dom/interfaces/base/nsIDOMWindowUtils.idl b/dom/interfaces/base/nsIDOMWindowUtils.idl

>@@ -254,17 +254,17 @@ interface nsIDOMWindowUtils : nsISupport
> 
>   /**
>    * Force a garbage collection. This will run the cycle-collector twice to
>    * make sure all garbage is collected.
>    *
>    * Will throw a DOM security error if called without UniversalXPConnect
>    * privileges in non-debug builds. Available to all callers in debug builds.
>    */
>-  void garbageCollect();
>+  void garbageCollect(in boolean aDrawGraph);

Add a comment describing the argument.

>diff --git a/xpcom/base/nsCycleCollector.cpp b/xpcom/base/nsCycleCollector.cpp

> #ifdef DEBUG_CC
>+#define DEBUG

This seems weird, I think you should use a new define (DEBUG_DESCRIBE_EDGE_NAME?) that's defined if either DEBUG_CC or DEBUG are defined.

> GCGraphBuilder::GCGraphBuilder(GCGraph &aGraph,
>-                               nsCycleCollectionLanguageRuntime **aRuntimes)
>+                               nsCycleCollectionLanguageRuntime **aRuntimes,
>+                               PRBool aDrawGraph)
>     : mNodeBuilder(aGraph.mNodes),
>       mEdgeBuilder(aGraph.mEdges),
>       mRuntimes(aRuntimes)
> {
>     if (!PL_DHashTableInit(&mPtrToNodeMap, &PtrNodeOps, nsnull,
>                            sizeof(PtrToNodeEntry), 32768))
>         mPtrToNodeMap.ops = nsnull;
>-#ifdef DEBUG_CC
>-    // Do we need to set these all the time?
>-    mFlags |= nsCycleCollectionTraversalCallback::WANT_DEBUG_INFO |
>-              nsCycleCollectionTraversalCallback::WANT_ALL_TRACES;
>-#endif
>+    if (aDrawGraph) {
>+        mFlags |= nsCycleCollectionTraversalCallback::WANT_DEBUG_INFO |
>+                  nsCycleCollectionTraversalCallback::WANT_ALL_TRACES;

Those flags still need to be always set if DEBUG_CC is defined.

> GCGraphBuilder::NoteXPCOMChild(nsISupports *child) 
> {
>-#ifdef DEBUG_CC
>     nsCString edgeName(mNextEdgeName);
>     mNextEdgeName.Truncate();
>-#endif

Shouldn't this be:

    nsCString edgeName;
    if (WantDebugInfo()) {
        edgeName.Assign(mNextEdgeName);
        mNextEdgeName.Truncate();
    }

(here and below).

> nsCycleCollector::MaybeDrawGraphs()

>+        // We can't just use _popen here because graphviz-for-windows
>+        // doesn't set up its stdin stream properly, sigh.

Just drop this comment.

Minusing because we should still always have the logging in DEBUG_CC builds.
Attachment #444484 - Flags: review?(peterv) → review-
Attached patch debugging patch v2 (obsolete) — Splinter Review
All comments addressed
Attachment #462562 - Flags: review?(peterv)
Comment on attachment 462562 [details] [diff] [review]
debugging patch v2

>diff --git a/xpcom/base/nsCycleCollector.cpp b/xpcom/base/nsCycleCollector.cpp
>--- a/xpcom/base/nsCycleCollector.cpp
>+++ b/xpcom/base/nsCycleCollector.cpp

> NS_IMETHODIMP_(void)
> GCGraphBuilder::DescribeNode(CCNodeType type, nsrefcnt refCount,
>                              size_t objSz, const char *objName)
> {
> #ifdef DEBUG_CC
>     mCurrPi->mBytes = objSz;
>-    mCurrPi->mName = PL_strdup(objName);
>+#endif
>+#ifdef DEBUG_CC_GRAPHS
>+    if (WantDebugInfo()) {
>+        mCurrPi->mName = PL_strdup(objName);
>+    }
> #endif

I'm ok with doing this, but the graphs will not be very useful without the name I think. I think I'd be equally ok with dropping DEBUG_CC_GRAPHS altogether.

> GCGraphBuilder::NoteScriptChild(PRUint32 langID, void *child) 
> {
>-#ifdef DEBUG_CC
>     nsCString edgeName(mNextEdgeName);
>     mNextEdgeName.Truncate();

    nsCString edgeName;
    if (WantDebugInfo()) {
        edgeName.Assign(mNextEdgeName);
        mNextEdgeName.Truncate();
    }

>-#endif

>@@ -2429,16 +2422,22 @@ nsCycleCollector::Collect(PRUint32 aTryC

>+    PRBool drawGraphs = mParams.mDrawGraphs;
>+    mParams.mDrawGraphs |= aDrawGraph;

>+#ifdef DEBUG_CC
>+    mParams.mDrawGraphs = PR_TRUE;
>+#endif

After looking at the file before and after I have to say I was wrong. We don't always draw in DEBUG_CC build, so drop these three lines again. Sorry.
Attachment #462562 - Flags: review?(peterv) → review+
Whiteboard: [needs landing]
(In reply to comment #34)
> I think I'd be equally ok with dropping DEBUG_CC_GRAPHS altogether.

If we drop DEBUG_CC_GRAPHS altogether then release builds will have to copy edge-name strings around. That sounds like a perf hit we don't want. Instead I will wrap DEBUG_CC_GRAPHS around all the graph drawing stuff so that release builds can never draw graphs, since the graphs would be useless with all unknown edge labels
Attached patch debugging patch v3 (obsolete) — Splinter Review
Attachment #383846 - Attachment is obsolete: true
Attachment #462562 - Attachment is obsolete: true
http://hg.mozilla.org/mozilla-central/rev/2e2b2ae083e5
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Assignee: nobody → roc
Whiteboard: [needs landing]
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
This was landed but was backed out. There were a few problems:
1) build bustage where in some configurations dlldeps.cpp calls nsCycleCollector_collect(), and needed to pass a value for aDrawGraphs
2) we were only calling the destructor for mEdgeNames #ifdef DEBUG_CC, so we appeared to leak a lot (although the leak wasn't real, since the array was always empty)
3) a number of tests call DOMWindowUtils.garbageCollect, and failed because they weren't passing a parameter.
Attached patch debugging patch v4 (obsolete) — Splinter Review
1) fixed call in dlldeps.cpp
2) made parameter to DOMWindowUtils.garbageCollect optional
3) made mEdgeNames only defined #ifdef DEBUG_CC_GRAPHS, made us iterate through and call Destroy on all nodes #ifdef DEBUG_CC_GRAPHS. This fixes the leak.
Attachment #465040 - Attachment is obsolete: true
Attachment #465141 - Flags: review?(peterv)
blocking2.0: --- → ?
Whiteboard: [needs review]
(In reply to comment #35)
> If we drop DEBUG_CC_GRAPHS altogether then release builds will have to copy
> edge-name strings around. That sounds like a perf hit we don't want.

So your previous patch already did that, my comment was about object names. Given that we wrap everything in |WantDebugInfo()|, I don't see why we care about the performance hit. Graphs without descriptions of nodes and edges sound pretty useless, no?
You're right. But there's also a cost for conditional branches (including where we destroy empty nsTArrays etc), and a space cost in PtrInfo (which is called out as being something that needs to be kept small) --- we're adding two pointers to it.

I really don't want to hurt cycle collector performance. Wrapping everything in DEBUG_CC_GRAPHS means I don't have to worry about it, otherwise I think we're going to have to measure it, and I don't want to do that right now.
Oh, the other performance cost is that in DEBUG_CC_GRAPHS we have to iterate through all PtrInfos to destroy them all. I bet that's measurable. I suppose we could optimize that away though.

I suppose it would be pretty useful to be able to dump graphs from release builds. Maybe we could even expose the graph in an extension or something.
OK, here's the version which enables labelled graphs in release builds.

With this patch, you can get a graph in any build by opening the Error Console and evaluating
window.QueryInterface(Components.interfaces.nsIInterfaceRequestor).getInterface(Components.interfaces.nsIDOMWindowUtils).garbageCollect(true)
Attachment #465533 - Flags: review?(peterv)
It would be cool if we could have an nsICycleCollectorGraphSink interface that we could pass in as a parameter to garbageCollect(), then extensions could capture the graph and dump it or do other things with it.
Comment on attachment 465533 [details] [diff] [review]
debugging patch, with graphs always enabled

> GCGraphBuilder::NoteScriptChild(PRUint32 langID, void *child) 
> {
>-#ifdef DEBUG_CC
>     nsCString edgeName(mNextEdgeName);
>     mNextEdgeName.Truncate();
>-#endif

    nsCString edgeName;
    if (WantDebugInfo()) {
        edgeName.Assign(mNextEdgeName);
        mNextEdgeName.Truncate();
    }

I'm ok with this. Adding two pointers to PtrInfo is unfortunate, but if we do want graphs in all builds then we need to bite the bullet. The only other option I see is moving everything behind |#ifdef DEBUG|.
Attachment #465533 - Flags: review?(peterv) → review+
Comment on attachment 465141 [details] [diff] [review]
debugging patch v4

This is wrong in any case, as it doesn't put all of the edgename code behind DEBUG_CC_GRAPHS.
Attachment #465141 - Flags: review?(peterv) → review-
We could avoid adding to PtrInfo if we dump some of the information as we're building the graph.  It would require a little more postprocessing, though.
(For example, one could dump all the nodes (with names), edges (with names), and reference counts while building the graph, and then the internal-refs and colors at the end.)
I landed the patch, but it caused leaks on all debug boxes so was backed out again.

I'm really too busy and frustrated to work on this more right now. But we really need the GMail bug fixed. This needs to land ASAP so we can get the data to do that. I hope someone who understands the cycle collector will be able to do this.
Assignee: roc → nobody
Whiteboard: [needs review]
Over to peterv.
Status: REOPENED → ASSIGNED
blocking2.0: ? → final+
Assignee: nobody → peterv
I've put a better logging patch in bug 466157. It already had a reviewed patch for logging, but I've incorporated some of the ideas from this bug (added an interface to listen to the building of the graph, implemented a logging component in all builds without too much overhead).
Depends on: 466157
Keywords: mlk
Whiteboard: softblocker
Whiteboard: softblocker → [softblocker]
Whiteboard: [softblocker]
Whiteboard: [softblocker]
Alex, are you still seeing this? We found one leak related to FAYT and fixed it. I'm not sure I'm seeing this anymore. (I'm not sure I'm NOT seeing it, either.)
** PRODUCT DRIVERS PLEASE NOTE **

This bug is one of 7 automatically changed from blocking2.0:final+ to blocking2.0:.x during the endgame of Firefox 4 for the following reasons:

 - it was marked as a soft blocking issue without a requirement for beta coverage
blocking2.0: final+ → .x+
(In reply to comment #54)
> Alex, are you still seeing this? We found one leak related to FAYT and fixed
> it. I'm not sure I'm seeing this anymore. (I'm not sure I'm NOT seeing it,
> either.)

As best I can tell (from having the browser open all day with GMail open as an app tab), this bug is alive and well.
Blocks: mlk-fx5+
I agree there is some kind of GMail-related memory issue. Right now I have a session where with nothing but GMail open, I have a 1GB working set.
about:memory:

win32/privatebytes930,041,856
win32/workingset950,538,240
js/gc-heap112,197,632
js/string-data5,179,774
js/mjit-code15,819,988
storage/sqlite/pagecache52,659,808
storage/sqlite/other1,674,216
gfx/d2d/surfacecache778,092
gfx/d2d/surfacevram17,437,844
images/chrome/used/raw0
images/chrome/used/uncompressed588,916
images/chrome/unused/raw0
images/chrome/unused/uncompressed0
images/content/used/raw230,721
images/content/used/uncompressed199,740
images/content/unused/raw6,082
images/content/unused/uncompressed9,648
layout/all5,427,542
layout/bidi652
gfx/surface/image876,292
gfx/surface/win320
content/canvas/2d_pixel_bytes230,400
shmem/allocated1,765,376
shmem/mapped1,765,376
      
CC dump:
http://people.mozilla.org/~roc/gmail-cc-edges.log.bz2

Peterv, are the tools you use to analyze that log available anywhere? I'm not sure what to do with it.

I grabbed a full crash dump as well.
Depends on: 644070
I've been running gmail in a new profile in a single tab for a few days, and I'm trying to see if I can use the OS X leaks tool to determine what's happening here.  The first real leak that I've found in this effort is bug 644070.
Good on you!

I'm writing a tool to analyze Windows crashdumps and figure out as much as possible about how the memory is being used.
Also, whatever is the cause of the major leak here is not reported in about:memory.  Currently, Firefox is using 1.05GB memory on my system, and about:memory shows ~150MB committed memory.  The number 150MB has been stable for me for the past few days.
Depends on: 644073
I think I've exhausted all of the things that the leaks tool could help me find here.  I ended up excluding quite a bit of stuff because of masked pointers, and such:

leaks -exclude GetPropertyTreeChild -exclude js_NewRegExp -exclude InitExnPrivate -exclude 'CSSParserImpl::ParseStyleAttribute(nsAString_internal const&, nsIURI*, nsIURI*, nsIPrincipal*, nsICSSStyleRule**)' -exclude 'nsAttrValue::ParseAtomArray(nsAString_internal const&)' -exclude define_JavaPackage -exclude 'nsAttrValue::EnsureEmptyMiscContainer()' -exclude 'nsRuleNode::ConvertChildrenToHash()' -exclude 'nsRuleNode::Sweep()' -exclude 'nsXULDocument::AddBroadcastListenerFor(nsIDOMElement*, nsIDOMElement*, nsAString_internal const&)' -exclude 'nsRuleNode::Transition(nsIStyleRule*, unsigned char, unsigned char)' -exclude ProcessCharSet -exclude 'js::PropertyTree::insertChild(JSContext*, js::Shape*, js::Shape*)' -exclude 'nsXBLProtoImplMethod::AppendBodyText(nsAString_internal const&)' -exclude 'nsCSSExpandedDataBlock::Compress(nsCSSCompressedDataBlock**, nsCSSCompressedDataBlock**)' -exclude 'nsHtml5Portability::newStringFromBuffer(unsigned short*, int, int)' -exclude 'gfxMacFont::CreatePlatformShaper()' -exclude 'PICLinker::init(JSContext*)' -exclude 'js::PropertyTree::removeChild(js::Shape*)' -exclude 'NewArguments(JSContext*, JSObject*, unsigned int, JSObject&)' -exclude 'nsAlertsService::Init()' -exclude 'nsXULPrototypeElement::SetAttrAt(unsigned int, nsAString_internal const&, nsIURI*)' -exclude 'nsStyledElement::ParseStyleAttribute(nsAString_internal const&, nsAttrValue&, int)' -exclude 'CSSStyleRuleImpl::DeclarationChanged(mozilla::css::Declaration*, int)' -exclude 'nsIDOMCSS2Properties_Set(JSContext*, JSObject*, long, int, nsCSSProperty, unsigned long long*)' -exclude 'TryToStartImageLoad(nsCSSValue const&, nsIDocument*, nsCSSProperty)' -exclude 'nsIDOMCSSStyleDeclaration_SetCssText(JSContext*, JSObject*, long, int, unsigned long long*)' -exclude 'nsAttrValue::ToString(nsAString_internal&) const' -exclude 'nsAttrValue::ToString(nsAString_internal&) const' -exclude 'js_NewFlatClosure(JSContext*, JSFunction*, JSOp, unsigned long)' -exclude 'nsXBLProtoImplMethod::SetLineNumber(unsigned int)' -exclude 'nsXBLProtoImplMethod::AddParameter(nsAString_internal const&)' -exclude TSMCurrentKeyboardInputSourceRefCreate -exclude 'nsCheapStringSet::Put(nsAString_internal const&)'  18569 > mlk

Process 18569: 17707 leaks for 1042896 total leaked bytes.


We're dealing with a cycle here for sure.  That's my conclusion for the day.
(In reply to comment #58)
> Peterv, are the tools you use to analyze that log available anywhere?

Yup, in bug 466157. We've got someone to work on better tools, so we should soon have documentation (and hopefully even better tools).

What I've done so far is first reverse the edges using attachment 468816 [details]:

./reverse-edges.pl gmail-cc-edges.log > out-gmail-cc-edges.log

and then look for interesting stuff using attachment 468818 [details]:

./find-roots.pl out-gmail-cc-edges.log google

which gives:

0x2cb77708 nsDocument (xhtml) https://mail.google.com/mail/?ui=2&view=bsp&ver=ohhl
0x10ece780 nsDocument (xhtml) https://mail.google.com/mail/?ui=2&view=btop&ver=1s4
0x34a78880 nsDocument (xhtml) https://mail.google.com/mail/?ui=2&view=bsp&ver=ohhl
...

and

./find-roots.pl out-gmail-cc-edges.log 0x2cb77708

which gives:

0x2cb77708 [nsDocument (xhtml) https://mail.google.com/mail/?ui=2&view=bsp&ver=ohhl]

    0x192ecdc8 [JS Object (Range) (global=305d2ed8)]
        --[xpc_GetJSPrivate(obj)]-> 0x33fadea0 [XPCWrappedNative (Range)]
        --[]-> 0x27ec6f28 [nsRange]
        --[mRoot]-> 0x2cb77708 [nsDocument (xhtml) https://mail.google.com/mail/?ui=2&view=bsp&ver=ohhl]

        known edges:
            0x0f0189d8 JS Object (Object) (global=fdb58c0) --[]--> 0x192ecdc8

so looks like more problems with ranges.

(One can also do |./find-roots.pl out-gmail-cc-edges.log all google| which will look at all those nsDocuments, but that'll take forever).

We are missing some information though, the problem is that for black JS objects there's no good way to know if they are a root or just marked by a root. So for object 0x0f0189d8 the tool thinks it's a root, which might be true or it might just have a root as an ancestor in the graph. I'm trying to change the tool so it'll at least keep going backwards if there are edges into the black JS object.
Attached file Log for 0x2cb77708
This is a log for 0x2cb77708 taking into account all incoming edges into JS objects. It is a bit overwhelming though. Some interesting stuff, a number of live timeouts keeping that object alive. Also:

    0x11c73c08 [JS Object (XMLHttpRequest) (global=fdb58c0)]
        --[]-> 0x0fdb58c0 [JS Object (Window) (global=fdb58c0)]
...
        --[mRoot]-> 0x2cb77708 [nsDocument (xhtml) https://mail.google.com/mail/?ui=2&view=bsp&ver=ohhl]

        known edges:

JS objects with no known incoming edges are either on the stack or kept alive by traditional JS rooting.
Attachment #465141 - Attachment is obsolete: true
Attachment #465533 - Attachment is obsolete: true
The range's start and end nodes are the body element of the 0x2cb77708 document, so we can't just cut that edge somehow.
The window 0x0fdb58c0 is https://mail.google.com/mail/?ui=2&view=js&name=main, so I think that definitely should be alive. The question is, why is it keeping that Range alive?

All paths to the Range go through the object 19D45C60. Too bad the CC-log doesn't label the edges coming out of JS objects... (can we fix that?) I'll see what I can figure out.
The edge labels are
        --[]-> 0x19d45c60 [JS Object (Object) (global=fdb58c0)]
        --["Ca"]-> 0x0f018a20 [JS Object (Object) (global=fdb58c0)]
        --["xa"]-> 0x0f0189d8 [JS Object (Object) (global=fdb58c0)]
        --["aa"]-> 0x192ecdc8 [JS Object (Range) (global=305d2ed8)]

Nasty obfuscated names :-(.

I wonder whether, say, Chrome keeps the same set of documents alive. Right now I have a Chrome session with nothing but GMail, and the Chrome dev tools report 23 extant HTMLDocuments, and 11 Range objects. I wonder if in fact GMail just leaks, but each leaked document costs us more for some reason.
I can reproduce this in GMail with a google chat window open and the cursor blinking inside of it.
blocking2.0: .x+ → ---
OS: Mac OS X → Windows 7
Did you mean to remove the blocking flag? :-)
Google people agree that it looks like a GMail bug. I'm trying to help them narrow it down.
We narrowed it down to a bug in GMail; when an email editing widget is dismissed, some stuff isn't unlinked from a global object that should be. It's up to them to take it from here.

I'd still like to figure out why we leak so much more memory than Chrome though in the same situation.
As to the additional memory usage, there are probably other bugs, like Bug 645633
Maybe in chrome, keeping a range alive doesn't keep the whole document alive?  (Fixing bug 335998 will make this sort of problem even more common in Gecko.)
No, I confirmed that Chrome was leaking documents. The documents that are leaked are small too.
roc, should we close this or leave it open until Google's fix is seen and checked?
No longer blocks: mlk-fx5+
roc, are you still seeing this problem?
Flags: needinfo?(roc)
Not currently.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago11 years ago
Flags: needinfo?(roc)
Resolution: --- → WORKSFORME
Component: DOM → DOM: Core & HTML
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: