Closed Bug 1342745 Opened 7 years ago Closed 7 years ago

Leaks detected by xpcshell tests failing on the atom leak assertion (IMAP)

Categories

(MailNews Core :: Backend, defect)

defect
Not set
critical

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: jorgk-bmo, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(2 files, 2 obsolete files)

+++ This bug was initially created as a clone of Bug #1337865 +++

PROCESS-CRASH | mailnews/imap/test/unit/test_imapFilterActions.js
PROCESS-CRASH | mailnews/imap/test/unit/test_imapFilterActionsPostplugin.js
PROCESS-CRASH | mailnews/db/gloda/test/unit/test_index_junk_imap_online.js
PROCESS-CRASH | mailnews/db/gloda/test/unit/test_index_junk_imap_offline.js

Let's look at more leaks in this new bug.
Moving onto the next fire ;-)

A little bisection in test_imapFilterActionsPostplugin.js shows that two subtest cause the test to leak. When removing subtest MoveToFolder() and CopyToFolder() the leak is gone. Both of these when run as only test cause the leak.

Equally, when removing the MoveToFolder*() and CopyToFolder*() tests from test_imapFilterActions.js the test stops leaking. So by the looks of it, IMAP move or copy to folder by a filter causes the leak.
Wherever you look, you find beautiful code, like here:
First dereference pointer, then check for != nullptr. Nice.
Continuing leak search:

XPCOM_MEM_LOG_CLASSES=nsMsgDBFolder XPCOM_MEM_REFCNT_LOG=$HOME/refcounts.log mozilla/mach xpcshell-test mailnews/imap/test/unit/test_imapFilterActions.js

python mozilla/tools/rb/find_leakers.py $HOME/refcounts.log
000000000C2B2400 (12) @ <nsMsgDBFolder>
000000000A5C8C00 (13) @ <nsMsgDBFolder>

Repeat:
XPCOM_MEM_LOG_CLASSES=nsMsgDBFolder XPCOM_MEM_LOG_OBJECTS=12 XPCOM_MEM_REFCNT_LOG=$HOME/refcounts.log mozilla/mach xpcshell-test mailnews/imap/test/unit/test_imapFilterActions.js

python mozilla/tools/rb/find_leakers.py $HOME/refcounts.log
000000000C0C3000 (1) @ <nsMsgDBFolder>

perl -w mozilla/tools/rb/make-tree.pl --ignore-balanced --object 000000000C0C3000 < $HOME/refcounts.log > $HOME/xx.log

Sadly that died with "Out of memory during request for 22804 bytes, total sbrk() is 268267520 bytes!" after writing a file of 128 MB for further analysis. Aleth apparently found leaks using this tool, but I don't see how looking through an incomplete 128 MB file will reveal anything.
Running the analysis tools on a reduced test_imapFilterActions.js again, this time for nsImapMailFolder I get:
python mozilla/tools/rb/find_leakers.py $HOME/refcounts2.log
000000000C0B2400 (2) @ <nsImapMailFolder>

Logging constructor and destructor, I see:
"=== nsImapMailFolder::nsImapMailFolder 000000000C2C2C00"
"=== nsImapMailFolder::nsImapMailFolder 000000000C380800" <== not destroyed
"=== nsImapMailFolder::nsImapMailFolder 000000000D9D1400"
"=== nsImapMailFolder::nsImapMailFolder 000000000D9D1800"
"=== nsImapMailFolder::~nsImapMailFolder 000000000D9D1800"
"=== nsImapMailFolder::nsImapMailFolder 000000000D9D1800"
"=== nsImapMailFolder::~nsImapMailFolder 000000000D9D1800"
"=== nsImapMailFolder::~nsImapMailFolder 000000000D9D1400"
"=== nsImapMailFolder::~nsImapMailFolder 000000000C2C2C00"

So the second one allocated doesn't get destroyed.

So further debugging shows that the second call is triggered from IMAPpump.js:96:
  imapAccount.incomingServer = IMAPPump.incomingServer;
and has this stack:
nsImapMailFolder::nsImapMailFolder() Line 212	C++
nsImapMailFolderConstructor(nsISupports * aOuter, const nsID & aIID, void * * aResult) Line 625	C++
RDFServiceImpl::GetResource(const nsACString_internal & aURI, nsIRDFResource * * aResource) Line 905	C++
nsImapMailFolder::AddSubfolderWithPath(nsAString_internal & name, ...) Line 371	C++
nsImapMailFolder::CreateClientSubfolderInfo(const nsACString_internal & folderName, ..) Line 941	C++
nsImapMailFolder::GetSubFolders(nsISimpleEnumerator * * aResult) Line 577	C++
nsMsgAccount::SetIncomingServer(nsIMsgIncomingServer * aIncomingServer) Line 157	C++

nsImapMailFolder.cpp around line 577 reads:
  GetFolderWithFlags(nsMsgFolderFlags::Inbox, getter_AddRefs(inboxFolder));
  if (!inboxFolder)
  {
    // create an inbox if we don't have one.
    CreateClientSubfolderInfo(NS_LITERAL_CSTRING("INBOX"), kOnlineHierarchySeparatorUnknown, 0, true); <=== 577
so it is the forcefully created inbox that leaks.

Not sure where it leaks, but I can't be far off now. Or maybe if doesn't leak and someone is just holding a reference to it.
Test failures due to this were last seen Mon Mar 6, 2017, 11:16:18
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=8be1ae77d2644e401fa6480a05ff234e70d0ccdf
The next push which covered these tests in debug mode of Tue Mar 7, 2017, 20:55:54, didn't show the failures.

However, when run locally, databases still leak.

I think what has happened is that some change in JS improved the GC. Let's see:
Last bad:   M-C: 966464a68a2cb3ca1125808e34abb5c1d3
First good: M-C: 3d341b9ba5353b6b8ab45b6ca03dcb1b2d
https://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=966464a68a2cb3ca1125808e34abb5c1d3&tochange=3d341b9ba5353b6b8ab45b6ca03dcb1b2d

I can't see it. Tooru-san, are you aware of any changes that could have magically fixed this?
Flags: needinfo?(arai.unmht)
Whiteboard: [Thunderbird-testfailure: X all debug]
nope
but I guess it's not about JS GC, but maybe CC.

then, if it's about atom, bug 1344848 might be related.
Flags: needinfo?(arai.unmht)
Thanks, yes, bug 1344848.
After two months I tried to return to this bug. I've done a try run
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=17aea81ba5aeacb5691746cf58e4b4536f49d5c3
with bug 1344848 (https://hg.mozilla.org/mozilla-central/rev/7818d19d769d) backed out.

Sadly the build fails in the on pre-compiling the startup cache:
Executing /builds/slave/tb-try-c-cen-l64-d-00000000000/build/objdir-tb/dist/bin/xpcshell -g /builds/slave/tb-try-c-cen-l64-d-00000000000/build/objdir-tb/dist/bin/ -a /builds/slave/tb-try-c-cen-l64-d-00000000000/build/objdir-tb/dist/bin/ -f /builds/slave/tb-try-c-cen-l64-d-00000000000/build/mozilla/toolkit/mozapps/installer/precompile_cache.js -e precompile_startupcache("resource://gre/");

We already have leaking atoms at that stage, pretty much like in bug 1334771 comment #1.

https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-17aea81ba5aeacb5691746cf58e4b4536f49d5c3/try-comm-central-linux64-debug/try-comm-central-linux64-debug-bm78-try1-build307.txt.gz
names the culprit:

[20559] ###!!! ASSERTION: 1 dynamic atom(s) with non-zero refcount: [System Principal]: 'nonZeroRefcountAtomsCount == 0', file /builds/slave/tb-try-c-cen-l64-d-00000000000/build/mozilla/xpcom/ds/nsAtomTable.cpp, line 422

So "[System Principal]" is atomised but that atom is never destroyed. That looks like something caused by M-C.

Sorry to trouble you again, Boris, can you see where or who is leaking that atom? SYSTEM_PRINCIPAL_SPEC or "[System Principal]" don't seem to be widely used, certainly not in C-C.

Without fixing this I'm a little stuck here since I can't get a debug build and run tests in debug mode.
Flags: needinfo?(bzbarsky)
Looks like that string gets atomized in SystemPrincipal::Create by calling BasePrincipal::FinishInit which does:

  mOriginNoSuffix = NS_Atomize(aOriginNoSuffix);

So sounds like you're leaking the SystemPrincipal singleton.

Note that you should be able to get a debug build locally just fine; it's packaging that fails with this assertion failure.
Flags: needinfo?(bzbarsky)
(In reply to Boris Zbarsky [:bz] (still a bit busy) (if a patch has no decent message, automatic r-) from comment #10)
> Note that you should be able to get a debug build locally just fine; it's
> packaging that fails with this assertion failure.
Sure, I didn't express it properly. I wanted to do a debug try server build to run all xpcshell tests we have. Local debug builds work fine as you pointed out.
(In reply to Boris Zbarsky [:bz] (still a bit busy) from comment #10)
> So sounds like *you're* leaking the SystemPrincipal singleton.
I think this implies that C-C code is leaking this. Is this for certain?

For the record, the class to look for in leak debugging is nsJSPrincipals, so:
XPCOM_MEM_LEAK_LOG=1 XPCOM_MEM_LOG_CLASSES=nsJSPrincipals  <cmd that produces the leak>
and that gives (see attachment).
> I think this implies that C-C code is leaking this.

No, just that it's leaking in Thunderbird.  Maybe it's C-C code.  Maybe it's m-c code that is being used in some particular way in Tbird but not Firefox.  Maybe it's m-c code that screws up because it expects something to be present that's not present in tbird.  "You" in this case meant "that build you compiled", not "the Thunderbird project".
Anyway, that leak log shows 4 references outstanding, and a bunch of JS/xpconnect stuff leaked (backstagepass, nsXPCWrappedJS, XPCWrappedNative, etc).  The JS/xpconnect leaks are why the principal is leaking, since all that stuff would entrain it.
Hmm, so how to proceed? David Barron once pointed me to: http://www-archive.mozilla.org/performance/leak-tutorial.html

The attached log wasn't from the "pre-compiling the startup cache" (comment #9) but some other test.

Here's the report from the "pre-compiling the startup cache".

== BloatView: ALL (cumulative) LEAK STATISTICS, default process 2684

     |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|
     |                                      | Per-Inst   Leaked|   Total      Rem|
   0 |TOTAL                                 |       45    10832|  252916      200|
   8 |BackstagePass                         |      104     1144|     916       11|
  20 |CondVar                               |       64      128|      19        2|
  41 |IdlePeriod                            |       24       24|       6        1|
  62 |Mutex                                 |       80      480|     111        6|
 105 |SharedScriptableHelperForJSIID        |       24       24|       1        1|
 128 |XPCNativeInterface                    |       56      560|    2389       10|
 129 |XPCNativeMember                       |       16      160|   53078       10|
 130 |XPCNativeSet                          |       32      288|    1485        9|
 131 |XPCWrappedNative                      |       96     2400|    9869       25|
 132 |XPCWrappedNativeProto                 |       40      480|    3786       12|
 134 |XPCWrappedNativeTearOff               |       32      864|   13729       27|
 152 |nsBayesianFilter                      |      248      248|       1        1|
 191 |nsIdleService                         |      104      104|       1        1|
 192 |nsIdleServiceDaily                    |      112      112|       1        1|
 193 |nsIdleServiceWin                      |      104      104|       1        1|
 198 |nsJSIID                               |       40       40|     994        1|
 199 |nsJSPrincipals                        |       24       24|       4        1| <===
 220 |nsMsgTag                              |       88      440|      25        5|
 221 |nsMsgTagService                       |       40       40|       1        1|
 240 |nsPrefBranch                          |      120      240|      20        2|
 258 |nsStopwatch                           |       80      160|       2        2|
 259 |nsStringBuffer                        |        8      192|   57935       24|
 267 |nsTArray_base                         |        8      176|   54740       22|
 268 |nsThread                              |      448      448|       5        1|
 271 |nsTimer                               |       32      160|      19        5|
 272 |nsTimerImpl                           |      232     1160|      19        5|
 293 |nsXPCWrappedJS                        |      120      120|     117        1|
 294 |nsXPCWrappedJSClass                   |       72       72|      18        1|
 299 |xptiInterfaceInfo                     |       40      440|     407       11|

nsTraceRefcnt::DumpStatistics: 300 entries

Serial Numbers of Leaked Objects:
1 @0000000000850780 (1 references)

So only one reference, but the same JS/xpconnect stuff is being leaked.
Right, so you want to see why you're leaking BackstagePass, XPCWrappedNative, etc.  A good place to start might be the 1 leaking nsXPCWrappedJS: that sounds like C++ is holding on to a JS-implemented XPCOM object, which is then entraining the rest of the stuff.
Hmm, I'm not very good at that voodoo stuff. Here's what I've done following the instructions:

XPCOM_MEM_REFCNT_LOG=$HOME/ref.log XPCOM_MEM_LOG_CLASSES=nsXPCWrappedJS <cmd>
python mozilla/tools/rb/find_leakers.py $HOME/ref.log
  That gave object no. 99. So:
XPCOM_MEM_LOG_OBJECTS=99 XPCOM_MEM_REFCNT_LOG=$HOME/ref.log  XPCOM_MEM_LOG_CLSES=nsXPCWrappedJS <cmd>
  That then gave: 000000000C8C5380 (1) @ <nsXPCWrappedJS>
perl make-tree.pl --ignore-balanced --object 000000000C8C5380 < ref.log > tree.log
For the record, make-tree.pl miserably dies if the input has CRLF line endings, so they need to be converted to LF first.

Result attached.

I think the instructions at http://www-archive.mozilla.org/performance/leak-tutorial.html are out of date, since XPCOM_MEM_COMPTR_LOG doesn't appear to work any more. Previously I was looking at https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Refcount_tracing_and_balancing. Oh, that says that XPCOM_MEM_COMPTR_LOG doesn't work on all platforms, I wonder which ones they are :-(
grep comm-central tree.log | grep -v mozilla (gives nothing)
shows that all the code listed in the log is from comm-central/mozilla so M-C, so it may look like I'm leaking it personally with my build, but we (as the project) aren't leaking it ;-) - Happy to stand corrected.

BTW, when I last looked at this area two months ago, there was no [System Principal] atom leak. Looking at where this all started, bug 1334874 comment #9, back then there were plenty of leaked C-C objects, but none of that *XPC* stuff.

Sadly we don't have a good grasp on leaks in general, and as one can see, it took me two months to return to this :-( (fighting all sorts of fires in the meantime).
Boris, could you please take a look at the log file. Or should I ask the "memshrink" people (Andrew McCreight, Eric Rahm) for help?
Flags: needinfo?(bzbarsky)
> since XPCOM_MEM_COMPTR_LOG doesn't appear to work any more.

It certainly worked for me a few days ago, albeit on Mac...  I would assume it works on Linux too.  I can't speak to Windows.  But I expect it should work there too.

So looking at that balance tree, there are basically two things going on:

1)  The object involved got passed to nsIdleService::AddIdleObserver and then never removed.
2)  The object involved got passed to nsObserverService::AddObserver and then never removed,
    but looks like those references then went away via observer service shutdown
    dropping its refs.

OK, so that addIdleObserver call.  It's coming under nsXPCComponents_Utils::Import which is Components.utils.import.  In fact, the import() calls are nested several times.  It's possible that this addIdleObserver call is coming from m-c code.  It's also possible that it's coming from c-c code; there are various calls to addIdleObserver there.  There are even calls to both addIdleObserver and addObserver with the same object; for example see mailnews/db/gloda/modules/indexer.js, the "enabled" property setter.

It might be worth trying to comment out all the addIdleObserver calls and seeing whether that makes the leaks go away for you.  If it does, then uncomment them one by one until you find the one that causes the problem.  Or you could enable half, etc; then 4-5 tests should let you pin down the relevant call, given that I see only 17 calls to addIdleObserver in c-c and m-c together (6 in c-c, 11 in m-c).

Or you could just start with the indexer.js one, since at first glance it's more likely than the others: it's in a script that gets imported via Cu.import and it adds the object as both an idle observer and as an observer...

It's entirely possible that the right answer is "shutdown should just clear out the idle service, like we already do for the observer service".  It means leaks will get covered up, of course...
Flags: needinfo?(bzbarsky)
Depends on: 1364342
Thank you so much, Boris.

(In reply to Boris Zbarsky [:bz] (still a bit busy) from comment #20)
> Or you could just start with the indexer.js one, ...
I might start with mailnews/db/gloda/modules/indexer.js since Gloda has already been the cause of leaks during the "pre-compile" stage, see bug 1334771, when the assert on leaking atoms was first introduced.

I commented out
this._idleService.addIdleObserver(this, this._indexIdleThresholdSecs);
and
this._idleService.removeIdleObserver(this, this._indexIdleThresholdSecs);
and the leak is gone!

For the record (for myself) from bug 1334771 comment #12, this is the command:
$ obj-x86_64-pc-mingw32/dist/bin/xpcshell.exe -g obj-x86_64-pc-mingw32/dist/bin/ -a obj-x86_64-pc-mingw32/dist/bin/ -f mozilla/toolkit/mozapps/installer/precompile_cache.js -e 'precompile_startupcache("resource://gre/");'

I'll take this to bug 1364342.

Skipping to the last paragraph of your comment:
> It's entirely possible that the right answer is "shutdown should just clear
> out the idle service, like we already do for the observer service". 
> It means leaks will get covered up, of course...
That's the XPCOM shutdown?

I think we can just fix this is Gloda, no?
OK, I finally got a try run to complete.
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=9d972b670bd4fb9c41de4be0879220d48981fbaa

As expected, still these four test failures:
mailnews/imap/test/unit/test_imapFilterActions.js
mailnews/imap/test/unit/test_imapFilterActionsPostplugin.js
mailnews/db/gloda/test/unit/test_index_junk_imap_offline.js
mailnews/imap/test/unit/test_imapFilterActionsPostplugin.js

Also four additional tests fail:
chat/protocols/irc/test/test_ircCommands.js
chat/protocols/irc/test/test_setMode.js
dom/presentation/tests/xpcshell/test_multicast_dns_device_provider.js
mailnews/imap/test/unit/test_dod.js

Looking at the log:
https://archive.mozilla.org/pub/thunderbird/try-builds/mozilla@jorgk.com-9d972b670bd4fb9c41de4be0879220d48981fbaa/try-comm-central-linux64-debug/try-comm-central_ubuntu64_vm-debug_test-xpcshell-1-bm53-tests1-linux64-build9.txt.gz
these four additional tests only leak one atom:
###!!! ASSERTION: 1 dynamic atom(s) with non-zero refcount: [System Principal]

Of these, dom/presentation/tests/xpcshell/test_multicast_dns_device_provider.js is purely M-C, so there is definitely a problem in M-C code.
> That's the XPCOM shutdown?

Yes. 

> I think we can just fix this is Gloda, no?

That would be preferable.

Looking at your try run, I'm not sure what problem you're trying to solve.  It looks like you're backing out the patch for bug 1344848?  Why?  Obviously any run that leaks things in general can leak atoms too, and per that bug various xpcshell tests do in fact have leaks.
(In reply to Boris Zbarsky [:bz] (still a bit busy) from comment #23)
> Looking at your try run, I'm not sure what problem you're trying to solve. 
> It looks like you're backing out the patch for bug 1344848?  Why?
Well, when bug 1276669 first introduced the atom leak check, we had 14 crashing tests since we were leaking nsMsgDBFolder objects. The DTOR of the last folder object is meant to destroy the atoms we allocate.

We found various leaks and fixed 10 tests (in various bugs). Four test still leak folders and hence atoms. However, they don't crash any more since the assert was taken out in bug 1344848.

Since I didn't have time for this for two month, I now simply wanted to do a debug try run to see what the current leak situation is. The summary is in comment #22. The four original problems are still there, plus we have four more tests leaking a single atom for the system principal. And, sadly, the debug build dies in the packaging stage due to that same atom leaking in the pre-compile. But that's really a side issue. Only because you work in DOM security, I asked you for help with that side issue.

I will now focus on the four tests again, since we should plug the leaking folders and hence databases as they contribute to bug 1342858.

You can safely un-CC yourself, I'll NI you again should I have another question. Thanks as always for your help and also the comment made in bug 1364342.
Comment on attachment 8841360 [details] [diff] [review]
1342745-misc-fixes.patch - misc. fixes

Moved to bug 1386220.
Attachment #8841360 - Attachment is obsolete: true
Comment on attachment 8842019 [details] [diff] [review]
1342745-addref-release-nsImapMailFolder.patch - removing addref/release

Moved to bug 1386220.
Attachment #8842019 - Attachment is obsolete: true
nsIAtom got removed from C-C, so there is nothing left to fix here.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: