Closed Bug 764742 Opened 12 years ago Closed 12 years ago

Thunderbird shows reference count leaks on shutdown

Categories

(Thunderbird :: Mail Window Front End, defect)

13 Branch
defect
Not set
normal

Tracking

(thunderbird13+ fixed, thunderbird14+ fixed, thunderbird15+ fixed)

RESOLVED FIXED
Thunderbird 16.0
Tracking Status
thunderbird13 + fixed
thunderbird14 + fixed
thunderbird15 + fixed

People

(Reporter: standard8, Assigned: standard8)

References

Details

(Keywords: memory-leak, regression)

Attachments

(4 files)

Attached file Bloat Log
Sometime between Thunderbird 10 and 14, we started showing a whole lot of refcount leaks on shutdown. We missed this because we haven't been running reference count checks on our bloat builds, and we've not got UI based coverage for refcount leaks. I'll file a follow-up bug for that.

We're leaking global windows, so it is possible that we've got observer services holding onto windows that we're not tidying up on shutdown.

An example log is attached, this was generated via "make mailbloat" on a trunk build.
(In reply to Mark Banner (:standard8) from comment #0)
> Sometime between Thunderbird 10 and 14

I've got this down to between THUNDERBIRD_12_0_RELEASE and THUNDERBIRD_13_0_RELEASE.
Dumb question time: how does one go about rooting these leaks out?
In Firefox what you'd do if you have no idea what is causing the problem is to generate a cycle collector log on shutdown (the easiest way to do this right now is to set gAlwaysLogCCGraphs to true in xpcom/base/nsCycleCollector.cpp), then use my find_roots script (which you can get from here https://github.com/amccreight/heapgraph/tree/master/cc : you also need the  	parse_cc_graph.py file).  That will tell you what is holding alive one of the nsGlobalWindows.  If it is a refcounted object, then you need to do refcount logging for that particular object.  Not an easy process.
There's also this old document about refcounting:

http://www-archive.mozilla.org/performance/leak-tutorial.html

(In reply to Mark Banner (:standard8) from comment #1)
> I've got this down to between THUNDERBIRD_12_0_RELEASE and
> THUNDERBIRD_13_0_RELEASE.

and now down to between THUNDERBIRD_12_0_RELEASE and THUNDERBIRD_13_0b1_RELEASE.
(In reply to Andrew McCreight [:mccr8] from comment #3)
> In Firefox what you'd do if you have no idea what is causing the problem is
> to generate a cycle collector log on shutdown (the easiest way to do this
> right now is to set gAlwaysLogCCGraphs to true in
> xpcom/base/nsCycleCollector.cpp),

I have a build with DEBUG_CC defined, and started it with XPCOM_CC_DRAW_GRAPHS=1 in the environment. That's equivalent, right?


> then use my find_roots script (which you
> can get from here https://github.com/amccreight/heapgraph/tree/master/cc :
> you also need the  	parse_cc_graph.py file).  That will tell you what is
> holding alive one of the nsGlobalWindows.

I then cd /var/tmp/ and run python2.6 find_roots.py gc-edges-4.32000.log nsGlobalWindow

This prints thousands of "Error: Unknown result entry: <some line from the gc-edges-4.32000.log file>" lines in my terminal, and then:
Done loading graph. Reversing graph. Done.

No matching class names found.


What could I be doing wrong? :)
(In reply to Florian Quèze from comment #5)
> I have a build with DEBUG_CC defined, and started it with
> XPCOM_CC_DRAW_GRAPHS=1 in the environment. That's equivalent, right?
Yup.

> What could I be doing wrong? :)
Sounds like a problem with my script.  If you could post a compressed log here or email it to me I can figure out what is happening.
Attached file Gzipped cc log
Attachment #633146 - Attachment mime type: text/plain → application/x-gzip
Oh, sorry, I didn't read your earlier post closely enough.  There should be two files that it creates, gc-edges-XYZ and cc-edges-XYZ.  You want to run the script on the cc-edges file.  I forget that we generate both nowadays.  I should have been more specific.
Works much better with the correct input file!
I really need to add some kind of header so the parsing scripts can see that they are loading the wrong file, instead of just dumping out mountains of useless error messages.

So, here's an example entry.  The first line shows the object that the cycle collector has decided is alive. The rest of it shows how that object is keeping alive the nsGlobalWindow, but for our purposes here we probably don't care.

0x242b6a18 [nsXPCWrappedJS (nsIInterfaceRequestor)]
    --[mJSObj]-> 0x2340ca00 [JS Object (Object)]
    --[type_proto]-> 0x2bb9790 [JS Object (Object)]
    --[parent]-> 0x2bb1040 [JS Object (ChromeWindow)]
    --[xpc_GetJSPrivate(obj)]-> 0x27c7e10 [XPCWrappedNative (ChromeWindow)]
    --[mIdentity]-> 0x27c688c [nsGlobalWindow]

    Root 0x242b6a18 is a ref counted object with 1 unknown edge(s).
    known edges:
       0x242b6a18 [nsXPCWrappedJS (nsIInterfaceRequestor)] --[self]-> 0x242b6a18

Looking over these, it appears that there are three different objects that are keeping the nsGlobalWindow alive:
  0x245a9cf8 [nsXPCWrappedJS (nsIMsgDBViewCommandUpdater)]
  0x242b6a18 [nsXPCWrappedJS (nsIInterfaceRequestor)]
  0x242c1bc8 [nsXPCWrappedJS (nsIMsgHeaderSink)]

These are all wrapped JS, which means that JS is being used to implement some interface exposed to C++, like the nsIMsgDBViewCommandUpdater.  They each have 1 unknown edge, which means there's something keeping each of these alive.

My guess would be is that these are some kind of service that is being registered somewhere by being stored into an array or whatever, and that the array isn't being cleared.  If some or all of these types aren't widely used then the easiest thing from here would be to just find where they are used in the code.  Because these are wrapped JS, which is a widely used class, ref counting may not be practical.
nsIInterfaceRequestor doesn't help me much, it's used in so many places.

I started investigating in the direction of nsIMsgDBViewCommandUpdater that seems easier.
It's implemented at http://mxr.mozilla.org/comm-central/source/mail/base/content/folderDisplay.js#1229
The only place that really seems to use it, but keeps a strong reference is: http://mxr.mozilla.org/comm-central/source/mailnews/base/src/nsMsgDBView.cpp#2401

So I'm guessing something keeps some references to a nsMsgDBView instance. And there's some funny business with some gDBView global variables in the UI code. I'm trying to understand what's happening there exactly.


nsIMsgHeaderSink is implemented in http://mxr.mozilla.org/comm-central/source/mail/base/content/msgHdrViewOverlay.js#374 and stored at http://mxr.mozilla.org/comm-central/source/mailnews/base/src/nsMsgWindow.cpp#173
So I guess something keeps a reference to the nsMsgWindow instance. I'll investigate that later.

Thanks for the help Andrew!
No problem!  You can try doing ref count logging on nsMsgDBView or something, although it is kind of a pain, so you may want to investigate other approaches first. ;)
I can look at it from the nsIMsgDBView angle.
Changing at http://mxr.mozilla.org/comm-central/source/mail/base/content/folderDisplay.js#911 the line
    if (this._active) {
to
    if (gDBView == this.view.dbView || this._active) {

removes from the output of Andrew's script all the mentions of nsIMsgDBViewCommandUpdater.

We'll need to debug why the this._active value isn't correctly updated (or just stop using it).
If at http://mxr.mozilla.org/comm-central/source/mail/base/content/msgMail3PaneWindow.js#622 I add " msgWindow.msgHeaderSink = null;", the only thing that remains in the leak log is: http://pastebin.mozilla.org/1662416

Note: setting that global msgWindow variable to null doesn't help.

I don't really know how I can debug the nsXPCWrappedJS (nsIInterfaceRequestor) case that remains.
Blocks: 738778
(In reply to Florian Quèze from comment #15)
> If at
> http://mxr.mozilla.org/comm-central/source/mail/base/content/
> msgMail3PaneWindow.js#622 I add " msgWindow.msgHeaderSink = null;", the only
> thing that remains in the leak log is: http://pastebin.mozilla.org/1662416
> 
> Note: setting that global msgWindow variable to null doesn't help.
> 
> I don't really know how I can debug the nsXPCWrappedJS
> (nsIInterfaceRequestor) case that remains.

could it be the notification callbacks for the nsIMsgWindow? i.e., does doing msgWindow.notificationCallbacks = null; help?
As discussed on irc, there's a multitude of leaks but the most severe ones are for the standalone message window, which for some users will get opened and closed a lot.

This patch aims to fix mainly the standalone message ones and incorporates fixes from David, Florian and myself.

The three pane and compose windows still leak - we'll need a follow-up to track those down.
Attachment #633268 - Flags: review?(dbienvenu)
Comment on attachment 633268 [details] [diff] [review]
Fix standalone message window leaks

I guess we can mutually r= this.
Attachment #633268 - Flags: review?(dbienvenu) → review+
Comment on attachment 633268 [details] [diff] [review]
Fix standalone message window leaks

[Triage Comment]
We want this for 13.0.1 due to the memory leak issues in the standalone window. a=me.
Attachment #633268 - Flags: approval-comm-release+
Attachment #633268 - Flags: approval-comm-beta+
Attachment #633268 - Flags: approval-comm-aurora+
> Fix standalone message window leaks

Thanks, guys!
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: