Closed Bug 1334874 Opened 7 years ago Closed 7 years ago

Fix fallout (failing tests) from bug 1276669 and bug 1334558 in mailnews (fixed 8 of 14). Remaining fallout in bug 1337865 (6 of 14)

Categories

(MailNews Core :: Database, defect)

defect
Not set
normal

Tracking

(thunderbird_esr5253+ fixed, thunderbird53 fixed, thunderbird54 fixed)

RESOLVED FIXED
Thunderbird 54.0
Tracking Status
thunderbird_esr52 53+ fixed
thunderbird53 --- fixed
thunderbird54 --- fixed

People

(Reporter: jorgk-bmo, Assigned: aleth)

References

Details

(Whiteboard: [Thunderbird-testfailure: X all debug])

Attachments

(7 files, 9 obsolete files)

1.16 KB, patch
mkmelin
: review+
Details | Diff | Splinter Review
1.75 KB, patch
mkmelin
: review+
Details | Diff | Splinter Review
1.47 KB, patch
jorgk-bmo
: review+
Details | Diff | Splinter Review
990 bytes, patch
Details | Diff | Splinter Review
2.52 KB, patch
aleth
: review+
aleth
: feedback+
Details | Diff | Splinter Review
4.31 KB, text/plain
Details
7.98 KB, patch
aceman
: review+
Details | Diff | Splinter Review
+++ This bug was initially created as a clone of Bug #1334834 +++

Traditionally TB hasn't been very good at releasing resources, as reported in bug 1223574 about leaked JS runtimes.

With bug 1276669 landed, the system now also complains about leaked dynamic atoms, and to add insult to injury, we are using more dynamic atoms since bug 1334558 (where we switched 21 strings from static atoms to dynamic atoms). Those dynamic atoms should be released when the last nsMsgDBFolder instance is destroyed, but that never happens as traditionally those objects are left lying around. Switching from atoms to strings in bug 1334834 will solve the leaked-atoms-complaint, but not its root cause, the leaked nsMsgDBFolder objects.

But there are many other leakages that can be inspected in bug 1334771 comment #0. Atoms leaked include "stringBundleBindings" and "key_previousMsg" which are not atoms TB defines. Read the disconcerting bug 1334771 comment #20.

We will see whether this situation improves when M-C fix bug 1303637, since the FF profile manager is also leaking and might or might not suffer from the same problems we suffer from in TB.

So far the history.

This bug here is just to collect the big bucket of problems which have now come to bite us finally. We might decide to solve this in parts or spin off other bugs and make this one here a meta-bug.

Read bug 1334834 comment #1 to bug 1334834 comment #7 for some initial investigation.

There has always been the somewhat disconcerting message:
WARNING: some msg dbs left open: '!m_dbCache.Length()', file ... /mailnews/db/msgdb/src/nsMsgDatabase.cpp, line 85
which may or may not have anything to do with it.
Summarising some of the findings from bug 1334834 and bug 1334771:

Aleth has been looking at mailnews/local/test/unit/test_folderLoaded.js which apparently leaks 1 of the 5 nsMsgDBFolder objects it creates. Apparently commenting out line #71:
  gTargetFolder.msgDatabase = null;
makes the test pass but instead leaks 4 of 5 objects. Note that that assignment triggers nsMsgDBFolder::SetMsgDatabase().
14 failing xpcshell tests in total:

PROCESS-CRASH | mailnews/imap/test/unit/test_imapFilterActions.js
PROCESS-CRASH | mailnews/imap/test/unit/test_imapFilterActionsPostplugin.js
PROCESS-CRASH | mailnews/imap/test/unit/test_imapFilterActions.js
PROCESS-CRASH | mailnews/news/test/unit/test_server.js
PROCESS-CRASH | mailnews/base/test/unit/test_nsMsgDBView.js
PROCESS-CRASH | mailnews/base/test/unit/test_junkingWhenDisabled.js
PROCESS-CRASH | mailnews/compose/test/unit/test_autoReply.js 
PROCESS-CRASH | mailnews/db/gloda/test/unit/test_index_junk_imap_online.js
PROCESS-CRASH | mailnews/db/gloda/test/unit/test_index_junk_imap_offline.js
PROCESS-CRASH | mailnews/imap/test/unit/test_imapFilterActionsPostplugin.js
PROCESS-CRASH | mailnews/local/test/unit/test_folderLoaded.js
PROCESS-CRASH | mailnews/local/test/unit/test_pop3FilterActions.js 
PROCESS-CRASH | mailnews/local/test/unit/test_pop3MoveFilter.js
PROCESS-CRASH | mailnews/news/test/unit/test_internalUris.js

Most likely there are a few common causes to this so we should fix the shortest test first ;-)
I've re-done the testing that Aleth has done.

Was he wrote elsewhere, running test_folderLoaded.js creates 5 nsMsgDBFolder objects and releases 4, so the test fails since the atoms don't get released that should be released in the destructor of nsMsgDBFolder on destroying the last instance.

Commenting out line 71 |gTargetFolder.msgDatabase = null;| will destroy all 5 instances, release the atoms and make the test pass.

So maybe we should take a look at what nsMsgDBFolder::SetMsgDatabase(nullptr) does ;-)
Attached patch nsMsgDBFolder-debug.patch (obsolete) — Splinter Review
Kent, can you help me here. Observed behaviour is that nsMsgDBFolder objects leak and therefore atoms leak since these are released when the last instance of nsMsgDBFolder gets destroyed.

I appears that a nsMsgDBFolder object has a mDatabase member and that nsMsgDatabase has an m_folder member. Is this 1:1? Since that COM-pointer stuff all works with ref-counts, the object pointed to will only be destroyed if its reference count goes down to zero.

Now this classA points to classB and classB points to classA scenario is the perfect way to get those references muddled up.

Take this concrete case: There is nsMsgDBFolder::SetMsgDatabase() that can set a new database or release the database (by passing null) of a folder. So the database previously pointed to gets its reference count decreased. But what happens to the folder it pointed to? And what does it mean to assign a new database to a folder?

In the attached patch I try to ensure the 1:1 relationship, but the test fails just the same.

I apologise for my ignorance, maybe I misunderstood it all.
Flags: needinfo?(rkent)
Let's revisit this issue after you do the changes suggested in my r- of bug 1334558.
Flags: needinfo?(rkent)
I don't think I will do any more changes in bug 1334558 and you should just put your r+ onto that. In fact, as I said over there, the way it's done guarantees that we *must* fix the leaked folders now which we have ignored for years.

Kindly answer my questions:

In the nsMsgDBFolder <--> mDatabase relationship, is the relationship meant to me 1:1?
What does it mean to assign a new database to a folder?
What happens to the folder the previous database pointed to?
What does it mean to clear the database of a folder?
If we're not very careful, if a database keeps pointing to a folder, then that folder can never die.
And if one never dies, the atoms never get released.

But there's good news. If I start my test environment, I get 109 folder objects allocated, and as long as I close TB right away and don't do any manual operations, all 109 folder objects get destroyed and the atoms released. In this case we also do *not* leak any JS runtimes.

So I believe cleaning up our act on those folder objects is the first step in the clean-up. So please help.
Flags: needinfo?(rkent)
And let's attribute the merit to Aleth who said in bug 1334834 comment #5:
===
My *guess* is that as nsMsgDatabase contains a folder attribute, it retains a reference to the folder, and while the folder may forget about the database, the database retains the reference to the folder and that's the problem.
===
On the issue of leaked JS runtimes, Aleth posted this to bug 1303637 comment #6:
===
There are previous shutdown leak fixes even in c-c if you are looking for potentially useful examples, e.g. bug 765074, bug 764742, bug 955563, bug 955236.
===
As I said in comment #6, the JS runtime leak doesn't always occur, and I haven't worked out what operation triggers it. But obviously we have a lot of expertise in fixing those leaks. With the leaking atoms we also have enough pressure to do something about it ;-(

I still think we should tackle the leaked nsMsgDBFolder objects first, to exclude them as the source of the problem. We can always fix the leaked runtimes in bug 1223574.
Summary: Fix fallout (failing tests) from bug 1276669 and bug 1334558 in mailnews → Fix fallout (failing tests) from bug 1276669 and bug 1334558 in mailnews: Leaked nsMsgDBFolder objects
It seems it's not just the folder object that is the leak, that's just one of the symptoms. Using test_folderLoaded.js as an example, with the "offending" line commented out (test pass) it leaks

 0:00.82 PROCESS_OUTPUT: Thread-1 (pid:62192) "== BloatView: ALL (cumulative) LEAK STATISTICS, default process 62192"
 0:00.82 PROCESS_OUTPUT: Thread-1 (pid:62192) "     |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|"
 0:00.82 PROCESS_OUTPUT: Thread-1 (pid:62192) "     |                                      | Per-Inst   Leaked|   Total      Rem|"
 0:00.82 PROCESS_OUTPUT: Thread-1 (pid:62192) "   0 |TOTAL                                 |       52      208|   23304        3|"
 0:00.82 PROCESS_OUTPUT: Thread-1 (pid:62192) " 171 |nsLocalFile                           |      192      192|    1299        1|"
 0:00.82 PROCESS_OUTPUT: Thread-1 (pid:62192) " 231 |nsStringBuffer                        |        8        8|    8892        1|"
 0:00.82 PROCESS_OUTPUT: Thread-1 (pid:62192) " 238 |nsTArray_base                         |        8        8|    3206        1|"
 
but with the database=null left in, we see

 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) "== BloatView: ALL (cumulative) LEAK STATISTICS, default process 62128"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) "     |<----------------Class--------------->|<-----Bytes------>|<----Objects---->|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) "     |                                      | Per-Inst   Leaked|   Total      Rem|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) "   0 |TOTAL                                 |       52     3360|   23374       66|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) "  33 |GenericFactory                        |       32       64|      94        2|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) "  46 |MsgDBReporter                         |       32       32|      16        1|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) "  60 |RDFServiceImpl                        |      320      320|       1        1|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 102 |morkEnv                               |      136      136|      17        1|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 103 |morkFactory                           |      232      232|       1        1|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 105 |morkObject                            |       72      144|     156        2|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 131 |nsCollationMacUC                      |       56       56|       1        1|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 171 |nsLocalFile                           |      192      192|    1303        1|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 185 |nsMsgDBFolder                         |      488      488|      11        1|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 187 |nsMsgDatabase                         |      392      392|      16        1|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 194 |nsMsgKeySet                           |       24      288|     132       12|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 195 |nsMsgLocalMailFolder                  |      576      576|      11        1|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 198 |nsMsgRetentionSettings                |       48       48|       3        1|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 216 |nsRDFResource                         |       48       48|      14        1|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 231 |nsStringBuffer                        |        8      208|    8956       26|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 238 |nsTArray_base                         |        8       96|    3203       12|"
 0:00.91 PROCESS_OUTPUT: Thread-1 (pid:62128) " 246 |nsWeakReference                       |       40       40|      33        1|"

so it seems it might be leaking something which itself holds a whole lot of references. Maybe someone more experienced with mailnews/ can spot the likely candidate.
Hmm, or it's just leaking e.g. a folder or database, but that reference means the whole incomingServer structure isn't released... :-/
I'd say, as you said in comment #10, that it's all connected to the folder/database. The mork stuff hangs off the database, for sure. I'm not sure where you see the incomingServer though.
(In reply to Jorg K (GMT+1) from comment #11)
> I'd say, as you said in comment #10, that it's all connected to the
> folder/database. The mork stuff hangs off the database, for sure. I'm not
> sure where you see the incomingServer though.

You're right, no nsMsgIncomingServer, but I thought the nsRDFResource might be its server resource, aka root folder.
If you are hunting why folders/dbs are still referenced, we noticed something like that in bug 1135310. There is a cache of folders/dbs kept open. When we evict a database from the cache, the folder's db is set to null. But we observed the DB is not really destroyed after that.
Only if you open a new window suddenly the DBs objects are destroyed.
Hmm, since you added this comment:
https://hg.mozilla.org/comm-central/rev/ec7886c51421#l1.53
* The attribute can also be set to another database or to null to force the
* folder to reopen the same database when it is needed again.
you might know more about it then I do. See questions to Kent in comment #6:

In the nsMsgDBFolder <--> mDatabase relationship, is the relationship meant to me 1:1?
What does it mean to assign a new database to a folder?
What happens to the folder the previous database pointed to?
What does it mean to clear the database of a folder?
If we're not very careful, if a database keeps pointing to a folder, then that folder can never die.
And if one never dies, the atoms never get released.

So looking at this briefly, there seems to be a cache manager that will close open databases. Maybe so there aren't too many open file handles?

Seems they you "close" the databases be setting the reference in the folder that references them to null:
https://hg.mozilla.org/comm-central/rev/ec7886c51421#l2.96
db.folder.msgDatabase = null;

Why don't you then also null out the reference to the folder by going:
db.folder = null?

In this bug we're looking at the problem that nsMsgDBFolder objects never die since someone is still holding a reference to them.
OK, now also nulling the folder reference in the database. Makes no difference, which is no surprise since I've already changed nsMsgDBFolder::SetMsgDatabase() to do that anyway.

To answer my own question:
> Why don't you then also null out the reference to the folder by going:
> db.folder = null?

That attribute used to be readonly ;-)

The hunt for references to nsMsgDBFolder objects continues.
Attachment #8831571 - Attachment is obsolete: true
OK I spent an hour or so in the debugger and looking at the test, making sure that I understand how things are supposed to work. So let me answer your questions as I understand them:

In the nsMsgDBFolder <--> mDatabase relationship, is the relationship meant to me 1:1?

Yes.

What does it mean to assign a new database to a folder?

Databases are not supposed to be open (meaning referenced in nsMsgDBFolder) for every folder that exists, as a database takes up significant memory. Typically folders are long-lived, but databases are not. If the database goes away, then when it is referenced again, a new database object is created and assigned to the folder if it is referenced again.

What happens to the folder the previous database pointed to?

Databases go away, but folders live. Normally folders are only destroyed at shutdown by a shutdown listener (they are owned by the server, and the servers are owned by the account manager). This shutdown listener nulls the folder database reference, which breaks the folder<-->database reference cycle and is supposed to make the database go away. Your problem is that the database is not going away, and therefore the folder cannot go away.

What does it mean to clear the database of a folder?

Remove the reference to the database in the folder, which is supposed to make the database go away, "closing" the database and freeing its (substantial) memory. (unless someone else is holding onto a reference of the database, which causes havoc).

If we're not very careful, if a database keeps pointing to a folder, then that folder can never die.
And if one never dies, the atoms never get released.

I think that you are looking at it backwards. The design here is that the only remaining reference to the database is supposed to be in the folder, so when the folder nulls the database, the database goes away and then (in the shutdown listener) the folder also gets destroyed. The problem is that something is holding onto a database reference.

But let's put this all in perspective. We have had leaking databases at shutdown for a long time. This is not good, because then the database does not get reliably closed. But shouldn't we be deciding our priorities here? Leaking atoms is a really, really unimportant problem in the grand scale of things. It is only the $*%*(# shutdown assert which is the problem. This is why I try to be cautious with asserts, you are forcing your priorities on others when you add them.
Flags: needinfo?(rkent)
Thanks for you comment, much appreciated.

> This is not good, because then the database does not get reliably closed.
So I think we should be looking into this. The atoms business has just given us some added incentive. Obviously the database issue needed some push to get addressed because no one ever cared enough to address it. The tree being red now due to this is a very visible reminder.
(In reply to Kent James (:rkent) from comment #16)
> I think that you are looking at it backwards. The design here is that the
> only remaining reference to the database is supposed to be in the folder, so
> when the folder nulls the database, the database goes away and then (in the
> shutdown listener) the folder also gets destroyed. The problem is that
> something is holding onto a database reference.
Backwards or forwards, well, at least someone is looking at the problem ;-)

If I understand it correctly, there is some funny caching mechanism (checkCachedDBs in msgDBCacheManager.js) that keeps track of open databases and closes some if there are too many open ones.

If the folder abandons its database by nulling the reference, who tells the cache to release its reference to the database as well so the database can really go away? If you only have a few folders, like in this test, the "close some open databases" will never kick in.

Also, if the relationship is 1:1 the database should also null its reference to the folder. Well, as my WIP patch shows, that has no effect, but I still think it would be better for good house-keeping.
(In reply to Jorg K (GMT+1) from comment #18)
> (In reply to Kent James (:rkent) from comment #16)
> > I think that you are looking at it backwards. The design here is that the
> > only remaining reference to the database is supposed to be in the folder, so
> > when the folder nulls the database, the database goes away and then (in the
> > shutdown listener) the folder also gets destroyed. The problem is that
> > something is holding onto a database reference.
> Backwards or forwards, well, at least someone is looking at the problem ;-)
> 
> If I understand it correctly, there is some funny caching mechanism
> (checkCachedDBs in msgDBCacheManager.js) that keeps track of open databases
> and closes some if there are too many open ones.
> 
> If the folder abandons its database by nulling the reference, who tells the
> cache to release its reference to the database as well so the database can
> really go away? If you only have a few folders, like in this test, the
> "close some open databases" will never kick in.

I am sorry that I cannot offer any detailed insight, but the following bug suggests that the cache is not handled properly even on the folder removal/purging. The old cache for a removed/purged folder content lives on (until later refresh caused probably by internal sanity check?): 

Bug 1153720 - Stale cached No. of messages(?): delete a folder, empty the trash bin, create a folder with the deleted folder's name, the the new folder is displayed as having messages (# of messages in the deleted folder).

So there may be an angle to look at the issue from that issue, too.
Removing/purging a folder certainly closes the reference to a folder (and database presumably) but it seems that cache is not operated upon properly then. 


> Also, if the relationship is 1:1 the database should also null its reference
> to the folder. Well, as my WIP patch shows, that has no effect, but I still
> think it would be better for good house-keeping.

Agreed.

It hurts that no one seems to be able to answer/confirm this validity of 1:1 mapping 
in a couple of day's time...
OK, this is the burden of legacy code, I suspect.


One other thing: I see the following lines (similar lines at each shutdown of TB process during testing by |make mozmill|)
Leaked URLs:
  chrome://editor/content/editorMailOverlay.xul
  chrome://gloda/content/thunderbirdOverlay.xul
  chrome://messenger-smime/content/msgReadSMIMEOverlay.xul
  chrome://messenger/content/about-support/overlay.xul
  chrome://messenger/content/chat/chat-messenger-overlay.xul
  chrome://messenger/content/search/searchOverlay.xul
  chrome://messenger-smime/content/msgHdrViewSMIMEOverlay.xul
  ... total 788 URLs ...
  chrome://messenger/content/addressbook/addrbookWidgets.xml#map-list
  chrome://global/content/bindings/popup.xml#popup
  chrome://global/content/bindings/popup.xml#popup
  chrome://global/content/bindings/menu.xml#menu-iconic
  chrome://messenger/content/tabmail.xml#tabmail-alltabs-popup
  chrome://global/skin/icons/loading.png
  chrome://global/skin/tree/twisty-clsd.png

The above list is printed at the end of my |make mozmill| run. YMMV since I changed the order of executed tests somewhat in the past.

Given that those atoms (or strings once converted) seem to be used for storing the lexical element of .xml files and such for parsing purposes, unless these still-open files are closed (and related atoms are released based on in which file it is used, etc. [we may even need refcounting if the same "string" is used in  many files, which seems to be the case.]).
Or maybe I am barking up the wrong tree.

At least, when I check what the atom (or strings) are left lying around by searching through the |make mozill| log, I found many of them are used in the XUL files via DXR search, but there *are* strings that I cannot find from DXR search alone, and seem to be generated at run-time or something like that.

BTW, where on earth is the function that is invoked at the time of closing TB?
I mean that function *ought* to call many cleanup functions, correct?
Or is it like the subsystems/modules whatever install such functions as callback that is invoked 
at the time of shutting down (but then how is the ordering of shutdown is assured?). Inquiring mind wants to know.
I never figured out this exactly when I was disturbed at the error messages spewed out in the log of |make mozmill| using DEBUG-build of C-C TB.

At least, the above list of "Leaked URLs" gets printed at the time of object/class shutdown, it seems:
https://dxr.mozilla.org/mozilla-central/search?q=Leaked+URLs&redirect=false
But I am not entirely sure how all the classes are destructed. Is there a single function to do that?
(And that function is supposed to additionally call all the [dynamically] loaded .js files and xpcom functions, etc.?)

TIA
I do support Jorg's effort here. If we do not release folders and databases properly it is a bit problem. The whole DB cache management it only half working if we just mark databases for deallocation, but they are never really destroyed, freed from memory. We do have reports of too high memory use, too many file descriptors open, which can be caused by this problem.
Comment on attachment 8831832 [details] [diff] [review]
WIP: nsMsgDBFolder-debug.patch (v2)

Review of attachment 8831832 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/base/util/msgDBCacheManager.js
@@ +131,5 @@
>        if (db.lastUseTime < closeThreshold)
>        {
>          log.debug("closing expired msgDatabase for folder: " + db.folder.name);
>          db.folder.msgDatabase = null;
> +        db.folder = null;

Are these needed if you set the folder to null inside the nsMsgDBFolder::SetMsgDatabase() ?

::: mailnews/base/util/nsMsgDBFolder.cpp
@@ +947,5 @@
>          m_newMsgs.Clear();
>          m_newMsgs.AppendElements(newMessageKeys, numNewKeys);
>        }
> +      if (newMessageKeys)
> +        NS_Free(newMessageKeys);

Doesn't NS_Free handle nullptr gracefully?

::: mailnews/db/msgdb/src/nsMsgDatabase.cpp
@@ +1507,5 @@
>  }
>  
> +NS_IMETHODIMP nsMsgDatabase::SetFolder(nsIMsgFolder *aFolder)
> +{
> +  m_folder = aFolder;

What about NS_IF_RELEASE(m_folder) first?
(In reply to :aceman from comment #21)
Look, the patch was just a first attempt to debug things, take it as some random ramblings. But the prints were first class ;-)

> Are these needed if you set the folder to null inside the
> nsMsgDBFolder::SetMsgDatabase() ?
Not needed, I said so in comment #15.

> Doesn't NS_Free handle nullptr gracefully?
When I looked, NS_Free==free, but there are lots of occurrences of exactly the same code. So this hunk is also not needed. 

> What about NS_IF_RELEASE(m_folder) first?
Hmm, as I understand refcounting, assigning a new value to a pointer should decrease the refcount in the object previously referenced.

Let's forget my patch. The problem is most likely much more subtle and hidden. Maybe Aleth's patch is on the right way.
Comment on attachment 8832133 [details] [diff] [review]
WIP fix for leak detected by test_folderLoaded.js

Review of attachment 8832133 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/local/src/nsLocalMailFolder.cpp
@@ +431,5 @@
>      // return of NS_ERROR_NOT_INITIALIZED means running parsing URL
> +    // We don't need the return value, and assigning it to mDatabase which
> +    // is already set internally leaks.
> +    nsCOMPtr<nsIMsgDatabase> db;
> +    rv = GetDatabaseWithReparse(this, aWindow, getter_AddRefs(db));

Ingenious! How did you find that?
(In reply to Jorg K (GMT+1) from comment #25)
> Ingenious! How did you find that?

Using the tools from comment 23 ;) You get refcount trees that you can then investigate, diff between runs, etc.
(In reply to aleth [:aleth] from comment #26)
> Using the tools from comment 23 ;) You get refcount trees that you can then
> investigate, diff between runs, etc.
Great that you had time and energy to investigate. You're moving on to bug 1223574 next ;-) Or does this patch here fix that one as well?

I, on the other hand, spend much of my day triaging old and new bugs to the point where a new BMO member asked for my resignation ;-) - Bug 1334946 comment #2 - after I closed his bug as duplicate.
Attachment #8831832 - Attachment is obsolete: true
(In reply to Jorg K (GMT+1) from comment #27)
> (In reply to aleth [:aleth] from comment #26)
> > Using the tools from comment 23 ;) You get refcount trees that you can then
> > investigate, diff between runs, etc.
> Great that you had time and energy to investigate. You're moving on to bug
> 1223574 next ;-) Or does this patch here fix that one as well?

I doubt that patch fixes even this bug. It fixes that one test, I haven't investigated further.
(In reply to aleth [:aleth] from comment #28)
> I doubt that patch fixes even this bug. It fixes that one test, I haven't
> investigated further.
Sadly so, I ran
mailnews/base/test/unit/test_nsMsgDBView.js
mailnews/local/test/unit/test_pop3FilterActions.js 
mailnews/local/test/unit/test_pop3MoveFilter.js
mailnews/news/test/unit/test_internalUris.js
locally and they still fail. You'll see it in your try results any minute now. But anyway, if you already mastered the tools from comment #23, the rest should be easy ;-)
13 tests still fail :-(
But did it improve any number of tests that no longer report the leak?
Yes, one. Without the patch 14 fail, with the patch 13 fail.
(In reply to :aceman from comment #31)
> But did it improve any number of tests that no longer report the leak?

The other tests don't use that code path. It seems likely that e.g. the various failing filter tests have a common origin though, so I doubt all 14 tests are independent leaks.
Maybe you want to educate us how to uses these tools ... or do it all yourself. I wanted to retrace your steps for this test, which seems to be the easiest one. So I typed:

$ XPCOM_MEM_LOG_CLASSES=nsMsgDatabase  XPCOM_MEM_REFCNT_LOG=refcounts.log mozilla/mach xpcshell-test mailnews/local/test/unit/test_folderLoaded.js

I move the refcounts.log for easy access and then typed:

$ python mozilla/tools/rb/find_leakers.py refcounts.log
0000000000A1FDA0 (16) @ <nsMsgDatabase>

Next:

$ perl -w mozilla/tools/rb/make-tree.pl --ignore-balanced --object 0000000000A1FDA0 < refcounts.log xx.log

And the output of that is just illegible.

I also tried test_nsMsgDBView.js and here the "find_leakers.py" already gave four leaked nsMsgDatabase objects.

make-tree.pl crashes here with "out of memory".
I got rid of the atom ASSERT to avoid crashing before the refcount logs can be written.

Together with XPCOM_MEM_LOG_CLASSES=nsMsgDatabase, I used XPCOM_MEM_LOG_OBJECTS to pick the specific object to follow (you get the number from the leak log), that cuts down on the amount of output that the scripts have to parse. I then ran make-tree.pl as you did. I looked at the tree part of the output and I guess if you know the code well or step through with a debugger, you can see how it maps onto what the test is doing, but for me it was a bit much to wade through.

So I commented out the line in the test that got rid of the leak, ran all the steps again outputting to different files, and then looked at a diff of the two trees. That worked very nicely as it revealed the sections due to the extra line. The mdn page has an example which is pretty close to what happened in that test (a suspicious increase by 1 in the number of references in UpdateFolder compared to the GetDatabaseWithReparse which it calls). So I looked at that bit of code and there was the leak.

I don't think there's any guaranteed way to do this, the same method may not help in other cases.
(In reply to aleth [:aleth] from comment #35)
> I got rid of the atom ASSERT to avoid crashing before the refcount logs can
> be written.
So you're saying that my log was incomplete?

> Together with XPCOM_MEM_LOG_CLASSES=nsMsgDatabase, I used
> XPCOM_MEM_LOG_OBJECTS to pick the specific object to follow (you get the
> number from the leak log), ...
I don't follow that. The leak log comes from find_leakers.py which processes the log you created before. So you're saying you reran the test a second time, this time with XPCOM_MEM_LOG_OBJECTS=xx, where xx was obtained from the previous run? So in my example above xx==16?
(In reply to Jorg K (GMT+1) from comment #36)
> (In reply to aleth [:aleth] from comment #35)
> > I got rid of the atom ASSERT to avoid crashing before the refcount logs can
> > be written.
> So you're saying that my log was incomplete?

No, I'm just telling you what I did, as you asked.
 
> > Together with XPCOM_MEM_LOG_CLASSES=nsMsgDatabase, I used
> > XPCOM_MEM_LOG_OBJECTS to pick the specific object to follow (you get the
> > number from the leak log), ...
> I don't follow that. The leak log comes from find_leakers.py which processes
> the log you created before. So you're saying you reran the test a second
> time, this time with XPCOM_MEM_LOG_OBJECTS=xx, where xx was obtained from
> the previous run? So in my example above xx==16?

Right. I didn't find find_leakers.py useful, setting XPCOM_MEM_LEAK_LOG provides that info already.
Aleth, do you intend to continue here? Maybe we should split the tests in order to avoid double work. test_server.js and test_internalUris.js are smaller tests which we should look at next.
(In reply to Jorg K (GMT+1) from comment #38)
> Aleth, do you intend to continue here? Maybe we should split the tests in
> order to avoid double work. test_server.js and test_internalUris.js are
> smaller tests which we should look at next.

I started looking at test_autoreply the other evening.
Comment on attachment 8832586 [details] [diff] [review]
WIP fix for leak detected by test_autoReply.js

Another great find!

Do you understand this code? Why doesn't that just go:
nsCOMPtr<nsMsgTemplateReplyHelper> helper;

I suppose it's just some empty "frame" whose members get filled with stuff the object doesn't own, like:
  helper->mHdrToReplyTo = aMsgHdr;
  helper->mMsgWindow = aMsgWindow;
  helper->mIdentity = identity;

You added NS_RELEASE() to some error paths, but were does it get released in the non-error processing?
And why is that NS_ADDREF(helper); needed in the first place?

Note: Bug 1276669 requested Aurora uplift, so we'll be busted there too soon :-(
(In reply to Jorg K (GMT+1) from comment #41)
> Do you understand this code? Why doesn't that just go:
> nsCOMPtr<nsMsgTemplateReplyHelper> helper;
Oops, make that RefPtr.

I tried
  RefPtr<nsMsgTemplateReplyHelper> helper = new nsMsgTemplateReplyHelper;
  if (!helper)
    return NS_ERROR_OUT_OF_MEMORY;
(without the NS_ADDREF(helper)), but that gives a crash. I haven't checked where. Maybe this doesn't work out:

  nsCOMPtr<nsISupports> listenerSupports;
  helper->QueryInterface(NS_GET_IID(nsISupports), getter_AddRefs(listenerSupports));
Comment on attachment 8832586 [details] [diff] [review]
WIP fix for leak detected by test_autoReply.js

Review of attachment 8832586 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/compose/src/nsMsgComposeService.cpp
@@ +918,5 @@
>    aMsgHdr->GetStringProperty("replyTo", getter_Copies(replyTo));
>    if (replyTo.IsEmpty())
>      aMsgHdr->GetAuthor(getter_Copies(replyTo));
> +  if (replyTo.IsEmpty()) {
> +    NS_RELEASE(helper);

just make |helper| a nsCOMPtr at [1].  It will release automatically if there is no reference added to it before this point.

[1] https://dxr.mozilla.org/comm-central/rev/756fa9e390ab08f995382c9a9dfc6bf4b72421ca/mailnews/compose/src/nsMsgComposeService.cpp#907
Attachment #8832586 - Flags: review-
(In reply to Jorg K (GMT+1) from comment #42)
> (In reply to Jorg K (GMT+1) from comment #41)
> > Do you understand this code? Why doesn't that just go:
> > nsCOMPtr<nsMsgTemplateReplyHelper> helper;
> Oops, make that RefPtr.
> 
> I tried
>   RefPtr<nsMsgTemplateReplyHelper> helper = new nsMsgTemplateReplyHelper;
>   if (!helper)
>     return NS_ERROR_OUT_OF_MEMORY;
> (without the NS_ADDREF(helper)), but that gives a crash. 

Then the problem is elsewhere, probably some misplaced NS_RELEASE or something.  

You don't need to check for non-nullness.  the |new| operator is infallible.

> I haven't checked
> where. Maybe this doesn't work out:
> 
>   nsCOMPtr<nsISupports> listenerSupports;
>   helper->QueryInterface(NS_GET_IID(nsISupports),
> getter_AddRefs(listenerSupports));

So, first of all - NEVER USE NS_ADDREF and NS_RELEASE.  Always use nsCOMPtr<nsIInterface> or RefPtr<ConcreteClassType>.  Then you *never* need to manually call addref/release.

I think nsCOMPtr is obvious to use.  Non obvious may be some rules like when a function returns an object as an out param.  It's done this way:

foo(nsIFoo **aOut)
{
  nsCOMPtr<nsIFoo> result = new Object();
  result.forget(aOut);
}

and then it's called like:

  nsCOMPtr<nsIFoo> object;
  foo(getter_AddRefs(object));

here |object| has one reference and is filled with the new Object instance.


You can read https://developer.mozilla.org/en-US/docs/Mozilla/Tech/XPCOM/Using_nsCOMPtr for more details.
Thanks for the comments.

(In reply to Honza Bambas (:mayhemer) from comment #43)
> just make |helper| a nsCOMPtr at [1].  It will release automatically if
> there is no reference added to it before this point.
I tried nsCOMPtr but that doesn't work since it's not a nsIInterface class. I get a compile error.
The class is defined in the same file:
https://dxr.mozilla.org/comm-central/rev/756fa9e390ab08f995382c9a9dfc6bf4b72421ca/mailnews/compose/src/nsMsgComposeService.cpp#655

So I tried the RefPtr. That works, so the function actually returns, but it crashes later. Not easy to find since I'm running this as Xpcshell test from the command line and can't attach the MS Visual C++ debugger. I think we need to understand what's going on and why that 'helper' was done in such a strange way.

Thanks for looking at the patch. Aleth had marked it as WIP, so it wasn't ready for review yet. I think he's spending a vast amount of time finding the leaks, that needs to be lauded, and as we can see, the fix is not obvious, or, IOW, applying the obvious fix doesn't work for some non-obvious reasons :-(
(In reply to Jorg K (GMT+1) from comment #45)
> Thanks for the comments.
> 
> (In reply to Honza Bambas (:mayhemer) from comment #43)
> > just make |helper| a nsCOMPtr at [1].  It will release automatically if
> > there is no reference added to it before this point.
> I tried nsCOMPtr but that doesn't work since it's not a nsIInterface class.
> I get a compile error.
> The class is defined in the same file:
> https://dxr.mozilla.org/comm-central/rev/
> 756fa9e390ab08f995382c9a9dfc6bf4b72421ca/mailnews/compose/src/
> nsMsgComposeService.cpp#655
> 
> So I tried the RefPtr. That works, so the function actually returns, but it
> crashes later. Not easy to find since I'm running this as Xpcshell test from
> the command line and can't attach the MS Visual C++ debugger. 

You can:

./mach xpcshell-test path/to/test --interactive

it will then go to the xpcshell prompt instead of immediately running the test, let you attach the debugger first.  then you type _execute_test(); in the prompt and the test starts.

> I think we
> need to understand what's going on and why that 'helper' was done in such a
> strange way.
> 
> Thanks for looking at the patch. Aleth had marked it as WIP, so it wasn't
> ready for review yet. I think he's spending a vast amount of time finding
> the leaks, that needs to be lauded, and as we can see, the fix is not
> obvious, or, IOW, applying the obvious fix doesn't work for some non-obvious
> reasons :-(

yep, feel free to add any patches here.  I can look at them, but - sorry - w/o knowing the whole picture it would be impossible to help you ;)
(In reply to Honza Bambas (:mayhemer) from comment #46)
> ./mach xpcshell-test path/to/test --interactive
Thanks, that worked, sadly it crashes in the destructor of nsMsgProtocol (nsMsgProtocol::~nsMsgProtocol). So something is quite wrong here.

Also I noticed that the original code
https://dxr.mozilla.org/comm-central/rev/756fa9e390ab08f995382c9a9dfc6bf4b72421ca/mailnews/compose/src/nsMsgComposeService.cpp#907
always leaked that object from
   nsMsgTemplateReplyHelper *helper = new nsMsgTemplateReplyHelper;
(In reply to Jorg K (GMT+1) from comment #47)
> Also I noticed that the original code
> https://dxr.mozilla.org/comm-central/rev/
> 756fa9e390ab08f995382c9a9dfc6bf4b72421ca/mailnews/compose/src/
> nsMsgComposeService.cpp#907
> always leaked that object from
>    nsMsgTemplateReplyHelper *helper = new nsMsgTemplateReplyHelper;

I'm not sure that's true, otherwise there would be leaks from the previous test functions in test_autoReply. (Which is not to say that rewriting this using reference-counted helpers wouldn't make things a lot cleaner!)
Fixes all the pop3 failures. This one is a bad leak from a non-refcounted class - the whole parser leaks every time it's used. It's surprising this passed review once, but there are also TODOs for error handling in this code, so...
All=two? So 10 to go? Maybe we want to land those patches and then look for a proper fix for the auto-reply one, unless it's obvious, which it doesn't appear to be.
What's left are these 8:
PROCESS-CRASH | mailnews/imap/test/unit/test_imapFilterActions.js
PROCESS-CRASH | mailnews/imap/test/unit/test_imapFilterActionsPostplugin.js
PROCESS-CRASH | mailnews/base/test/unit/test_nsMsgDBView.js
PROCESS-CRASH | mailnews/base/test/unit/test_junkingWhenDisabled.js
PROCESS-CRASH | mailnews/db/gloda/test/unit/test_index_junk_imap_online.js
PROCESS-CRASH | mailnews/db/gloda/test/unit/test_index_junk_imap_offline.js
PROCESS-CRASH | mailnews/news/test/unit/test_server.js
PROCESS-CRASH | mailnews/news/test/unit/test_internalUris.js

iirc you were looking at test_server.js and test_nsMsgDBView.js. With luck the three junk failures are also related.

I don't think we should land any of these yet, it's not like that would make anything green anyway. They need another look before going up for review.
(In reply to aleth [:aleth] from comment #51)
> iirc you were looking at test_server.js and test_nsMsgDBView.js.
Hmm, I tried looking into test_nsMsgDBView.js but with no result since the log files were too big and I hadn't developed the right technique. Also, I'm tied up in all sorts of other things, so I'd love to leave all the glory to you ;-) The little time I had was spent on improving the auto-reply thing, also with little success, other than getting Honza interested.

> With luck the three junk failures are also related.
With luck all the IMAP failures are related ;-)
Fixes test_nsMsgDBView.js and test_junkingWhenDisabled.js, and the same situation in TB outside of the tests. Generally the architecture problem is that TB objects hold a lot of cross-references, so here a harmless-looking JS fake selection object refers to a treeview, which is a DBView, which holds references to a database among other things, which holds reference to a folder etc. So a 'small' leak can become quite large.
Assignee: nobody → aleth
Status: NEW → ASSIGNED
https://bugzilla.mozilla.org/page.cgi?id=fields.html#assigned_to
This is the person in charge of resolving the bug ;-)
\o/ - Thank you.
Attachment #8832586 - Attachment is obsolete: true
Leak detected by test_folderLoaded.js.
Attachment #8832133 - Attachment is obsolete: true
Great! I had the |RefPtr<nsMsgTemplateReplyHelper> helper = new nsMsgTemplateReplyHelper;| but I didn't spot the release which made it crash later.
Comment on attachment 8833816 [details] [diff] [review]
Set LastUseTime in GetDatabaseWithReparse just like in GetDatabaseWOReparse

What happened to the other hunk?
nsCOMPtr<nsIMsgDatabase> db;
rv = GetDatabaseWithReparse(this, aWindow, getter_AddRefs(db));
I thought calling this with getter_AddRefs(mDatabase) caused a problem? No?
(In reply to Jorg K (GMT+1) from comment #59)
> What happened to the other hunk?

Different issue, different patch.
(In reply to aleth [:aleth] from comment #60)
> Different issue, different patch.
... that's coming up?
(In reply to Jorg K (GMT+1) from comment #61)
> (In reply to aleth [:aleth] from comment #60)
> > Different issue, different patch.
> ... that's coming up?

Comment 56.
Sorry, got it now. Why don't you combine everything into one patch? That way I could easily include this in my try pushes for bug 791645 and have less failures and a better chance to see real failures caused by my patch rather than seeing 14 failures, 8 of which are already fixed (if I haven't lost count).
Comment on attachment 8833809 [details] [diff] [review]
Avoid treeselection.view cycle leaks

The lesson here seems to be that there is a reference cycle with the view. Ideally we would hope that GC would solve this problem, but apparently it is not. What is the object in the view causing the problem? Ideally we would figure out the correct ownership relationship, and make back references to parents be weak references (or interduce the cycle collector). At the very least, if the requirement is that a particular object (say the view) is nulled in GC to prevent leaks, that should be clearly noted in the documentation somewhere.

I would be cautious about simply making tests pass without fixing the underlying issues. After, our goal is to fix issues, not tests.
(In reply to Kent James (:rkent) from comment #64)
> The lesson here seems to be that there is a reference cycle with the view.
> Ideally we would hope that GC would solve this problem, but apparently it is
> not. What is the object in the view causing the problem? 

The cycle is obvious:
  gFakeSelection.view = gTreeView;
  gTreeView.selection = gFakeSelection;
What's not so obvious is indeed why GC doesn't catch it. The answer, looking at the leak, seems to be that the cycle is going through XPConnect wrappers which JS can't always reason about.

> make back references to
> parents be weak references (or interduce the cycle collector). 

I don't see a way of making that work here. JSTreeSelection implements nsITreeSelection, so there is limited scope to change the design. I don't know of a good way of implementing weak references in JS outside of WeakSet/Map, and those aren't enough as the code calls methods on the view.

> At the very
> least, if the requirement is that a particular object (say the view) is
> nulled in GC to prevent leaks, that should be clearly noted in the
> documentation somewhere.

Sure, I can add a comment.

> I would be cautious about simply making tests pass without fixing the
> underlying issues. After, our goal is to fix issues, not tests.

If you see a better solution here, that would be good.
(In reply to Jorg K (GMT+1) from comment #63)
> Sorry, got it now. Why don't you combine everything into one patch? 

It's good practice to have independent fixes in separate commits, both in case there is some regression or unexpected side effect, and to make the changeset less confusing to anyone who looks at it in the future via hg annotate etc.
Attachment #8833809 - Flags: review?(rkent)
Attachment #8833814 - Flags: review?(mkmelin+mozilla)
Attachment #8833815 - Flags: review?(mkmelin+mozilla)
Comment on attachment 8833816 [details] [diff] [review]
Set LastUseTime in GetDatabaseWithReparse just like in GetDatabaseWOReparse

Review of attachment 8833816 [details] [diff] [review]:
-----------------------------------------------------------------

I don't know if this is an actual bug. Hg blame says aceman added this to GetDatabaseWOReparse, I am wondering if it was an oversight or intentional that it wasn't added to GetDatabaseWithReparse as well.
Attachment #8833816 - Flags: feedback?(acelists)
Comment on attachment 8833816 [details] [diff] [review]
Set LastUseTime in GetDatabaseWithReparse just like in GetDatabaseWOReparse

Could you tell me which issue this fixes? Or did I miss it in a previous comment?
Comment on attachment 8833470 [details] [diff] [review]
Fix leak detected by test_pop3FilterActions.js

I looked at making MaildirStoreParser an object only accessed via RefPtr. That doesn't really help here as nothing outside it retains a reference to it.

Drive-by comment: If this is part of the maildir code there was some discussion about making the default in the near future, that code looks like it needs some more work or at least another detailed review done on it before that happens (note e.g. the missing error handling mentioned in the TODOs).
Attachment #8833470 - Attachment description: WIP Fix leak detected by test_pop3FilterActions.js → Fix leak detected by test_pop3FilterActions.js
Attachment #8833470 - Flags: review?(mkmelin+mozilla)
(In reply to Jorg K (GMT+1) from comment #68)
> Comment on attachment 8833816 [details] [diff] [review]
> Set LastUseTime in GetDatabaseWithReparse just like in GetDatabaseWOReparse
> 
> Could you tell me which issue this fixes? Or did I miss it in a previous
> comment?

Yes, the preceding comment. It doesn't fix any leak.
Unless I lost track, there are six to go, right?
PROCESS-CRASH | mailnews/imap/test/unit/test_imapFilterActions.js
PROCESS-CRASH | mailnews/imap/test/unit/test_imapFilterActionsPostplugin.js
PROCESS-CRASH | mailnews/db/gloda/test/unit/test_index_junk_imap_online.js
PROCESS-CRASH | mailnews/db/gloda/test/unit/test_index_junk_imap_offline.js
PROCESS-CRASH | mailnews/news/test/unit/test_server.js
PROCESS-CRASH | mailnews/news/test/unit/test_internalUris.js
Attachment #8833470 - Flags: review?(mkmelin+mozilla) → review+
Comment on attachment 8833814 [details] [diff] [review]
Fix nsMsgTemplateReplyHelper leak as detected by test_autoReply.js

Review of attachment 8833814 [details] [diff] [review]:
-----------------------------------------------------------------

Great! r=mkmelin
Attachment #8833814 - Flags: review?(mkmelin+mozilla) → review+
Comment on attachment 8833815 [details] [diff] [review]
Fix database leak in nsMsgLocalMailFolder::UpdateFolder

Review of attachment 8833815 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/local/src/nsLocalMailFolder.cpp
@@ +429,5 @@
>      // return of NS_ERROR_NOT_INITIALIZED means running parsing URL
> +    // We don't need the return value, and assigning it to mDatabase which
> +    // is already set internally leaks.
> +    nsCOMPtr<nsIMsgDatabase> returnedDb;
> +    rv = GetDatabaseWithReparse(this, aWindow, getter_AddRefs(returnedDb));

I don't understand why it would leak, as mDatabase is an nsCOMPtr.
Attachment #8833815 - Flags: review?(mkmelin+mozilla) → review?(rkent)
Specifically answering "I don't know of a good way of implementing weak references in JS outside of WeakSet/Map" this is an example of using a weak reference. With this change, test_nsMsgDBView.js no longer leaks.

I have not looked at this enough to know if this is the right overall answer or not. But I am not very comfortable with relying on explicit nulls to stop leaks rather than a memory ownership design that works naturally.
Attachment #8834116 - Flags: feedback?(aleth)
May I say something? I once worked on fixing up the M-C dictionary code. I wrote a few Mochi tests (not Mozmill). One of those tests worked find but when FF exited, it was leaking. The fix was to set some JS variable to null and that was it. I complained to Ehsan at the time questioning the JS GC, but I was just ignored. While I still don't understand what happened there, I understand that sometimes the JS engine is not very good at recognising that a "thing" has lost its handle.

I'm with Aleth on this one. There is no harm in resetting JS variables to null and if that fixes long-standing problems of databases not being closed(!!), then I'm all for it. No offence intended: But this is far superior than switching atoms to strings and losing the valuable information we gain from those atom leaks.
(In reply to Magnus Melin from comment #73)
> > -    rv = GetDatabaseWithReparse(this, aWindow, getter_AddRefs(mDatabase));
> I don't understand why it would leak, as mDatabase is an nsCOMPtr.
Of course that leaks. The object is holding a refcount and one of those references comes from mDatabase. You're increasing the refcount but mDatabase doesn't suddenly point at the object twice.
It wouldn't leak if you set mDatabase to nullptr before the call, but you can't do that since then the function doesn't work since it returns that value. An alternative is to call release on the thing, but Aleth' solution is far better.
Comment on attachment 8833815 [details] [diff] [review]
Fix database leak in nsMsgLocalMailFolder::UpdateFolder

Review of attachment 8833815 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/local/src/nsLocalMailFolder.cpp
@@ +429,5 @@
>      // return of NS_ERROR_NOT_INITIALIZED means running parsing URL
> +    // We don't need the return value, and assigning it to mDatabase which
> +    // is already set internally leaks.
> +    nsCOMPtr<nsIMsgDatabase> returnedDb;
> +    rv = GetDatabaseWithReparse(this, aWindow, getter_AddRefs(returnedDb));

which already holds a pointer to the object. It can't hold two!

Maybe you could improve the comment:
We don't need the return value. Do not assign to mDatabase since this already holds a reference (and can't hold a second one).
Attachment #8833815 - Flags: review?(rkent) → review+
(In reply to Jorg K (GMT+1) from comment #75)
> I'm with Aleth on this one. There is no harm in resetting JS variables to
> null and if that fixes long-standing problems of databases not being
> closed(!!), then I'm all for it.

But if somebody forgets to reset those 2 variables in some future code, you have a leak again (yeah, MAYBE some test may catch it).

If Kent's patch fixes this internally to treeview implementation, so that consumers can't get it wrong, that is the way to go.
Comment on attachment 8834116 [details] [diff] [review]
demo using Cu.getWeakReference

Review of attachment 8834116 [details] [diff] [review]:
-----------------------------------------------------------------

Awesome, thanks! I didn't know Cu.getWeakReference. I doubt there are cases where a jsTreeSelection should in fact be the only thing keeping a TreeView alive. 

In fact, depending on how this code is used, I wonder if jsTreeSelection.tree should also be a weak reference? (The "real" nsITreeSelection does hold a strong reference, however.)
Attachment #8834116 - Flags: feedback?(aleth) → feedback+
Attachment #8833809 - Attachment is obsolete: true
Attachment #8833809 - Flags: review?(rkent)
Ignore me, I'm tired. I didn't even see that Kent attached a patch :-( - Sorry.
https://hg.mozilla.org/comm-central/rev/24488ee6afd1dc2c1e3f88aba010fedb42687c0c
Bug 1334874 - Fix database leak in nsMsgLocalMailFolder::UpdateFolder. r=jorgk

https://hg.mozilla.org/comm-central/rev/65e6288d5b01073c7e2a0e9c83055c1457eccdc9
Bug 1334874 - Fix leak in MaildirStoreParser. r=mkmelin

https://hg.mozilla.org/comm-central/rev/6483fa45f5b2330a94edbb24ff7235170bce7830
Bug 1334874 - Fix nsMsgTemplateReplyHelper leak as detected by test_autoReply.js. r=mkmelin a=aleth
Additional comments on JS/XPCOM references:

I've fought this issue a lot in JsAccount. Although the Cu.getWeakReference() usually works, I've also found times when it does not, and I was forced to use nsISupportsWeakReference items instead, which seem to work more reliably (but requires more cooperation from the C++ side). I don't really understand what is going on there. I think that the issues in normal code are simpler than the issues in JsAccount, so I would start with Cu.getWeakReference() unless you can prove that fails.

Of course, weak references really only work well as back-references from a child to a parent, where the parent clearly owns the child, and the child's life ends with the parent's life. If you are not certain of that, then weak references start to introduce all sorts of unexpected behavior, and you are better off to stick with documenting expected reference cycles, then nulling to break cycles.

Is it safe to assume that a selection is owned by the view, will not continue to exist after the view is destroyed?

The current JSAccount memory design has become quite complex, mainly because there are many different types of references that refer to the same object, and getting the right one is not trivial. It really makes the whole JsAccount concept unworkable unless you really, really understand all of the ramifications of different objects.
Comment on attachment 8834116 [details] [diff] [review]
demo using Cu.getWeakReference

Review of attachment 8834116 [details] [diff] [review]:
-----------------------------------------------------------------

r+, but see the question in comment 79.

> Is it safe to assume that a selection is owned by the view, will not continue to exist after the view is destroyed?

That seems a reasonable assumption to me (certainly the converse should be wrong - the view should not be owned by the selection), but I don't know the frontend code well enough to be absolutely certain.

> The current JSAccount memory design has become quite complex, mainly because there are many different types of references that refer to > the same object, and getting the right one is not trivial. It really makes the whole JsAccount concept unworkable unless you really, 
> really understand all of the ramifications of different objects.

That sounds painful. You could consider message-passing instead of references, if that helps. But I suppose it's hard when jsAccount has to slot into the existing API. Maybe a Proxy to automate which "ramification" you return? Not that proxies aren't complicated too...
Attachment #8834116 - Flags: review+
(In reply to Kent James (:rkent) from comment #82)
> Although the
> Cu.getWeakReference() usually works, I've also found times when it does not,
> and I was forced to use nsISupportsWeakReference items instead, which seem
> to work more reliably (but requires more cooperation from the C++ side).

Looks like Cu.getWeakReference produces a nsISupportsWeakReference wrapper: http://searchfox.org/mozilla-central/source/js/xpconnect/src/XPCJSWeakReference.cpp#49
We're a little OT here, but "Maybe a Proxy to automate which "ramification" you return? Not that proxies aren't complicated too..." is part of what I did. Maybe the discussion is useful here?

C++ callers are supposed to only see the C++ "delegator" object. There is also a JS "delegate" object that can override or implement XPCOM methods (or extend with new methods), with the C++ object being the base class. The C++ object owns the JS delegate, the JS delegate has a weak reference (nsISupportsWeakReference.GetWeakReference()) to the C++ object for parent class calls.

The problem comes if someone gets a reference to only the JS class, so nothing keeps the C++ class alive. The solution is that JS callers actually use a Proxy that simply adds a hard reference to the C++ class to the JS class. There is no resulting reference loop, but it is quite complex (and critical to make sure that C++ callers do not get the JS reference).

The equivalent problem here would be if someone got a reference to the selection, but nothing kept the view alive. Unfortunately when C++ classes get destroyed, the memory gets replaced with marker values like 5a5a5a5a, so simple null checks in C++ do not find unusable references.
Comment on attachment 8833470 [details] [diff] [review]
Fix leak detected by test_pop3FilterActions.js

I should have looked at the context of this:
void MaildirStoreParser::TimerCallback(nsITimer *aTimer, void *aClosure) {
  MaildirStoreParser *parser = (MaildirStoreParser *) aClosure;
...
    // Parsing complete.
    delete parser; <=== added line.
    return;
  }
  nsCOMPtr<nsISupports> aSupport;
  parser->m_directoryEnumerator->GetNext(getter_AddRefs(aSupport));

In one branch you're deleting an object you do not own and in another branch you don't.

I'd say this is isn't right.
Attachment #8833470 - Flags: review-
Comment on attachment 8833814 [details] [diff] [review]
Fix nsMsgTemplateReplyHelper leak as detected by test_autoReply.js

Context:

NS_IMETHODIMP nsMsgTemplateReplyHelper::OnStopRunningUrl(nsIURI *aUrl, nsresult aExitCode)
...
-  Release(); <== removed line.

NS_IMETHODIMP nsMsgComposeService::ReplyWithTemplate(...
-  nsMsgTemplateReplyHelper *helper = new nsMsgTemplateReplyHelper;
-  NS_ADDREF(helper);
+  RefPtr<nsMsgTemplateReplyHelper> helper = new nsMsgTemplateReplyHelper;

I don't think we have understood what really happens to that helper, at least I haven't. By using the RefPtr the object will now be destroyed when 'helper' goes out of scope, so at the end of the function. So you've shortened the lifespan of the object. Let's hope it's not needed after the function returns ;-)

But since the other patch clearly causes the crash (see your try run), I'll leave this one in ;-)
(In reply to Jorg K (GMT+1) from comment #87)
> I'd say this is isn't right.
Typo: I'd say this isn't right.
Backout:
https://hg.mozilla.org/comm-central/rev/6ba5dceb464c7853076ad781e10b34c61539bcce

Well, I thought I had backed it out, but this was never recorded in the pushlog. There seems to be some Mercurial problem.
Apparently there was a delay, so the system will catch up. Apparently it works like this:

Master receives changesets via ssh and passes them on to various other places, amongst them the servers running https://hg.mozilla.org. In the meantime, there are changes that no one can see.
Still crashing :-(
Comment on attachment 8833814 [details] [diff] [review]
Fix nsMsgTemplateReplyHelper leak as detected by test_autoReply.js

I understand this now. This is a terrible horrible atrocious hack.

Surely in the days when this was written, RefPtr (or nsRefPtr was they were called, IIRC) existed. Why didn't the original author use them?

Well, they wanted the helper object to live from the function where it was created right to the end of nsMsgTemplateReplyHelper::OnStopRunningUrl().

So what did they do? They created a new object without refcounting, then added one reference knowing that the object would live until that reference was removed. And removing the reference destroys the object, note that there isn't any 'delete' for it anywhere. And that happens miles away in another function. Sadly they forgot that there are error paths where they forgot to destroy the object. You destroy the object now in the function that created it, so any access after the function has returned will crash.

So as I see it, the only way to fix the problem at hand is to land the r- patch with a lot of BIG comments and restructure the whole thing later in another bug. Sad :-(

I know it's very unfair to put an f- onto it now ;-(

BTW, the maildir problem is the same type of problem. You should 'delete' the object in the function that created it, not elsewhere.
Attachment #8833814 - Flags: feedback-
Comment on attachment 8832586 [details] [diff] [review]
WIP fix for leak detected by test_autoReply.js

IMHO, this is the only way short-term to fix this. Of course it needs HUGE comments to explain the hack.
Attachment #8832586 - Attachment is obsolete: false
Attachment #8832586 - Flags: review+
Comment on attachment 8833470 [details] [diff] [review]
Fix leak detected by test_pop3FilterActions.js

I looked at this some more.

Basically, that timer callback gets called with 'this', so the object it's running on. I don't think you should delete the object while running one of it's methods ;-)

Sadly right now after staring at it for a while I don't have a good idea how to restructure it. It looks like the same problem you have with the addref/release in the other patch. You want to arrange for object deletion miles from where you created it. All I can think of is using the same ugly hack.

Maybe let's ask Honza. Honza, can you please look at
https://dxr.mozilla.org/comm-central/rev/6483fa45f5b2330a94edbb24ff7235170bce7830/mailnews/local/src/nsMsgMaildirStore.cpp#1298
This is a small self-contained source file and you don't need to know anything about any of what's happening there.

How would you suggest for this object to be deleted?

Unrelated but important for us: If you look at the constructor (a few lines above), it holds a reference to folder and a database, and those references never die since the object never dies.

We have a similar problem in the patch you gave r-. There the situation is that an object is created, they addref so it survives for a while, and when they want to dispose of it, they release. That actually works. Using RefPtr myObj = new ... doesn't work, since the object dies when myObj goes out of scope. But we need this object elsewhere, so they addrefing gets us there.
Flags: needinfo?(honzab.moz)
(Rather busy these days, please don't expect answer soon)
One has to be a bit careful with confirmation bias, once one knows there is a crash somewhere one is likely to see issues everywhere which one didn't think of before, real or not. It might help to test a hypothesis locally or on try. That's just the opposite of a too-quick r+ because something makes something green ;)

There's also the possibility that fixing one problem reveals another...

(In reply to Jorg K (GMT+1) from comment #87)
> Comment on attachment 8833470 [details] [diff] [review]
> Fix leak detected by test_pop3FilterActions.js
> 
> In one branch you're deleting an object you do not own and in another branch
> you don't.

It's a static timer callback. The object only gets deleted after the timer has fired for the last time.

I agree about this code being hacky, see comment 69 and earlier.

(In reply to Jorg K (GMT+1) from comment #93)
> Comment on attachment 8833814 [details] [diff] [review]
> Fix nsMsgTemplateReplyHelper leak as detected by test_autoReply.js
> 
> I understand this now. This is a terrible horrible atrocious hack.
> 
> Surely in the days when this was written, RefPtr (or nsRefPtr was they were
> called, IIRC) existed. Why didn't the original author use them?
> 
> Well, they wanted the helper object to live from the function where it was
> created right to the end of nsMsgTemplateReplyHelper::OnStopRunningUrl().
> 
> So what did they do? They created a new object without refcounting, then
> added one reference knowing that the object would live until that reference
> was removed. And removing the reference destroys the object, note that there
> isn't any 'delete' for it anywhere. And that happens miles away in another
> function. Sadly they forgot that there are error paths where they forgot to
> destroy the object.

I think that's likely correct.

>  You destroy the object now in the function that created
> it, so any access after the function has returned will crash.

I may be wrong, but I don't think so. The helper gets passed as a url listener, and down the line gets passed to nsIMsgMailNewsUrl :RegisterListener. That should hold a reference. Of course it's possible there are additional bugs.
(In reply to aleth [:aleth] from comment #98)
> One has to be a bit careful with confirmation bias, once one knows there is
> a crash somewhere one is likely to see issues everywhere which one didn't
> think of before, real or not. It might help to test a hypothesis locally or
> on try. That's just the opposite of a too-quick r+ because something makes
> something green ;)
My apologies. I don't know what else so say. I'm really sorry for backing out some potentially correct fixes.

While it might help to "test a hypothesis", I think the better approach is to *understand* what the various little sprockets in the gearbox do and then document this in the code.

I still dislike the maildir fix, since it's really not obvious that the 'delete' happens on the last firing of the timer. Can't the nsMsgMaildirStore not hold a |RefPtr<MaildirStoreParser> mParser;|, so the parser gets destroyed when the nsMsgMaildirStore goes away?

The nsMsgTemplateReplyHelper is equally twisted: Why would the original authors have used the addref/release hack (missing the release on some code paths) when a simple RefPtr would have saved the day? Passing the 'helper' as an argument should not increase its refcount. I still think it dies right in that function now, but I may be hallucinating.

Of the three patches landed, I backed out two leaving one:
https://hg.mozilla.org/comm-central/rev/24488ee6afd1dc2c1e3f88aba010fedb42687c0c
Is this the source of the crash? Then I'd really like to understand why.
This is what I meant by:

Can't the nsMsgMaildirStore not hold a |RefPtr<MaildirStoreParser> mParser;|?

Surely a much more expensive solution ;-)
Flags: needinfo?(honzab.moz)
Attachment #8834532 - Flags: feedback?(aleth)
I think this is the standard answer to this sort of problem. Make someone else own the object so it dies with them ;-)
Attachment #8834537 - Flags: feedback?(aleth)
(In reply to Jorg K (GMT+1) from comment #99)
> (In reply to aleth [:aleth] from comment #98)
> > One has to be a bit careful with confirmation bias, once one knows there is
> > a crash somewhere one is likely to see issues everywhere which one didn't
> > think of before, real or not. It might help to test a hypothesis locally or
> > on try. That's just the opposite of a too-quick r+ because something makes
> > something green ;)
> My apologies. I don't know what else so say. I'm really sorry for backing
> out some potentially correct fixes.

I don't mind the backout at all given the test failures.

> While it might help to "test a hypothesis", I think the better approach is
> to *understand* what the various little sprockets in the gearbox do and then
> document this in the code.

Well, I expect the reviewer to complain if something is not clear!
 
> I still dislike the maildir fix, since it's really not obvious that the
> 'delete' happens on the last firing of the timer.

If you like, I can add a comment. It seemed obvious to me because the code follows inside of if(!hasMore) and a few lines after   parser->m_timer->Cancel().
 
> Passing the 'helper'
> as an argument should not increase its refcount. 

Adding it to an observer array will, however, if I am not mistaken. I may be wrong on this.

> I still think it dies right
> in that function now, but I may be hallucinating.

Why rely on guesses? That's what I meant by "testing hypotheses", it wasn't meant as a criticism but as a suggestion.

I have no idea why that mozmill test is crashing. There wasn't any m-c landing in between so it does appear to have *something* to do with one of these commits. Nobody has looked at the actual failing test yet though, it always seems to be the same one at least.
Comment on attachment 8834532 [details] [diff] [review]
1334874-parser.patch - alternative maildir parser leak fix

Review of attachment 8834532 [details] [diff] [review]:
-----------------------------------------------------------------

I considered that, but it's not a good idea as the code seems to be structured so you can have multiple parsers at the same time (and even in the test, the parsing is not fast.) If there were only ever one parser object, you wouldn't need to make it a separate object at all, it could just be methods of the parent class. Of course if you or someone else knows this code well enough to guarantee there's only ever one or zero parsers, it would be OK.
Attachment #8834532 - Flags: feedback?(aleth) → feedback-
(In reply to aleth [:aleth] from comment #102)
> I don't mind the backout at all given the test failures.
I'll back out the last one after the next M-C merge. At least, we didn't waste any resources in vain.

> Well, I expect the reviewer to complain if something is not clear!
Even reviewers make mistakes, sometimes they get carried away be something that looks simple. Try Kent if you want a 500% thorough review.

> If you like, I can add a comment. It seemed obvious to me because the code
> follows inside of if(!hasMore) and a few lines after  
> parser->m_timer->Cancel().
You need to read half a page of code up to see this. I still think any hack needs to be clearly marked, and deleting 'this' in its own (static) method is really bad.

> > Passing the 'helper'
> > as an argument should not increase its refcount. 
> Adding it to an observer array will, however, if I am not mistaken. I may be
> wrong on this.
I don't see where that happens. Anyway, IMHO when nsMsgComposeService::ReplyWithTemplate() returns, the helper is dead. But you need to carry it over to OnStopRunningUrl().

> Why rely on guesses? That's what I meant by "testing hypotheses", it wasn't
> meant as a criticism but as a suggestion.
Actually, I don't want to rely on *guesses*, I'd really like to understand what's going on.

> I have no idea why that mozmill test is crashing. There wasn't any m-c
> landing in between so it does appear to have *something* to do with one of
> these commits. Nobody has looked at the actual failing test yet though, it
> always seems to be the same one at least.
Which one? test-attachment-reminder.js? I ran that locally in a debug build and it didn't crash.
Comment on attachment 8834532 [details] [diff] [review]
1334874-parser.patch - alternative maildir parser leak fix

Review of attachment 8834532 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/local/src/nsMsgMaildirStore.cpp
@@ +41,5 @@
>  
>  nsMsgMaildirStore::nsMsgMaildirStore()
>  {
>    MailDirLog = PR_NewLogModule("MailDirStore");
> +  mParser = nullptr;

no need to init RefPtr or nsCOMPtr, obviously

@@ +1164,5 @@
> +
> +private:
> +  virtual ~MaildirStoreParser();
> +  nsAutoRefCnt mRefCnt;
> +  NS_DECL_OWNINGTHREAD

use https://dxr.mozilla.org/mozilla-central/rev/af8a2573d0f1e9cc6f2ba0ab67d7a702a197f177/xpcom/base/nsISupportsImpl.h#500

also, take care if this class is used on more than one thread or not.
(In reply to Jorg K (GMT+1) from comment #104)
> > If you like, I can add a comment. It seemed obvious to me because the code
> > follows inside of if(!hasMore) and a few lines after  
> > parser->m_timer->Cancel().
> You need to read half a page of code up to see this. I still think any hack
> needs to be clearly marked, and deleting 'this' in its own (static) method
> is really bad.

I don't really understand what you are getting at. If you use refpointers, you'd have a NS_RELEASE there. See for example http://searchfox.org/mozilla-central/source/accessible/generic/DocAccessible.cpp#605. Then the Release() method contains a "delete this".
https://hg.mozilla.org/comm-central/rev/f9c681da2d3121e7dded4e41044cc895ada64722
Backed out changeset 24488ee6afd1 (bug 1334874) to separate the mozmill issue. rs=bustage-fix
(In reply to aleth [:aleth] from comment #67)
> I don't know if this is an actual bug. Hg blame says aceman added this to
> GetDatabaseWOReparse, I am wondering if it was an oversight or intentional
> that it wasn't added to GetDatabaseWithReparse as well.

Which change do you mean by me? I can only see https://hg.mozilla.org/comm-central/rev/83fd926bf1f8 adding that SetLastUseTime().
(In reply to :aceman from comment #108)
> (In reply to aleth [:aleth] from comment #67)
> > I don't know if this is an actual bug. Hg blame says aceman added this to
> > GetDatabaseWOReparse, I am wondering if it was an oversight or intentional
> > that it wasn't added to GetDatabaseWithReparse as well.
> 
> Which change do you mean by me? I can only see
> https://hg.mozilla.org/comm-central/rev/83fd926bf1f8 adding that
> SetLastUseTime().

exactly, you added it here https://hg.mozilla.org/comm-central/rev/83fd926bf1f8#l8.12 to GetDatabaseWOReparse. I just noticed in passing you didn't add it to GetDatabaseWithReparse at the same time. Hopefully that was intentional.
I still do not see where you see that change was by me. That link to patch shows Bienvenu as the author, which can be correct. I did not implement the database cache. I've just fixed some logic of the cache in the JS cache manager part.
(In reply to Jorg K (GMT+1) from comment #104)
> > > Passing the 'helper'
> > > as an argument should not increase its refcount. 
> > Adding it to an observer array will, however, if I am not mistaken. I may be
> > wrong on this.
> I don't see where that happens.

I told you in comment 98.

> > Why rely on guesses? That's what I meant by "testing hypotheses", it wasn't
> > meant as a criticism but as a suggestion.
> Actually, I don't want to rely on *guesses*, I'd really like to understand
> what's going on.

In order to do that, one can run the test and look at the refcount log for a nsMsgTemplateReplyHelper. The result is in the attachment.
(In reply to :aceman from comment #110)
> I still do not see where you see that change was by me. That link to patch
> shows Bienvenu as the author, which can be correct. I did not implement the
> database cache. I've just fixed some logic of the cache in the JS cache
> manager part.

Oh, sorry. I misunderstood some bug comment then. If you have no idea either whether that code should be there or not then let's not worry about it for now...
(In reply to aleth [:aleth] from comment #111)
> In order to do that, one can run the test and look at the refcount log for a
> nsMsgTemplateReplyHelper. The result is in the attachment.

NB I don't pretend to understand every line in that log! But it does show what's going on.
Attachment #8834634 - Attachment mime type: text/x-log → text/plain
With all three patches backed out, it still crashes. That doesn't help.

It started crashing when you pushed them in
https://treeherder.mozilla.org/#/jobs?repo=comm-central&revision=6483fa45f5b2330a94edbb24ff7235170bce7830
which is built on the same M-C changeset 12c02bf624c48903b155428f7c8a419ba7 as the previous non-crashing push.

We would have to be very unlucky if one of the patches does cause a crash, and while backing them all out, another crash has crept in from M-C. Maybe we should have driven the patch landing to the extreme and landed them all separately.

Maybe to find out for sure, we need three try builds, all with M-C at 12c02bf624c48903b155428f7c8a419ba7 and then the three patches applied separately. Apparently you can do try builds with fixed M-C versions as Kent did here:
https://hg.mozilla.org/try-comm-central/rev/9fc62ed41dbd1a4bf002a5ecbe27573aea5311de

My "feeling" says that 
+    nsCOMPtr<nsIMsgDatabase> returnedDb;
+    rv = GetDatabaseWithReparse(this, aWindow, getter_AddRefs(returnedDb));
should really not cause a crash.

The 'delete' in the static function is probably OK, although extremely hacky and should have a big comment.

I'm still not convinced about the 'helper' issue. I don't understand any of the log, I'm sorry. There are numbers from -2 to 2, and should that be a reference count, I don't understand that it can be negative.

OnStopRunningUrl() isn't in the log, so maybe the object dies before we get there, but that can't be since OnStopRunningUrl() is a method of that object.

Anyway, I've done enough damage and disqualified myself enough so that I'll just won't interfere for a while I have other issues burning hot, like the proxy bug.
(In reply to Jorg K (GMT+1) from comment #114)
> With all three patches backed out, it still crashes. That doesn't help.
> 
> It started crashing when you pushed them in
> https://treeherder.mozilla.org/#/jobs?repo=comm-
> central&revision=6483fa45f5b2330a94edbb24ff7235170bce7830
> which is built on the same M-C changeset 12c02bf624c48903b155428f7c8a419ba7
> as the previous non-crashing push.

Yes, it's very puzzling. I think we should file a separate bug for the test-attachment-reminder.js crash. After all the try runs and backouts, my guess is that is unrelated to the patches in this bug, but I have no idea what suddenly started to make them happen. I just submitted a clobber, let's see if that makes it go green again. Usually it doesn't ;)

> I'm still not convinced about the 'helper' issue. I don't understand any of
> the log, I'm sorry. There are numbers from -2 to 2, and should that be a
> reference count, I don't understand that it can be negative.

When you have a moment, you can look at https://developer.mozilla.org/en-US/docs/Mozilla/Performance/Refcount_tracing_and_balancing, it will still be complicated to understand but all the information is there. In particular "Each call site has a balance factor, which is positive if more AddRef()s than Release()es have happened at the site, zero if the number of AddRef()s and Release()es are equal, and negative if more Release()es than AddRef()s have happened at the site."
 
> OnStopRunningUrl() isn't in the log, so maybe the object dies before we get
> there, but that can't be since OnStopRunningUrl() is a method of that object.

It's not in the *refcount* log because callsites with 0 are not included unless some addrefs/releases actually happened there (or in a subtree). However, SetUrlState (which calls it) is in the log.
(In reply to aleth [:aleth] from comment #115)
> Yes, it's very puzzling.
Does anyone understand the ...
INFO -  Timeout: bridge.execFunction("701e4e8f-ed80-11e6-ad82-002590c0b2cd", bridge.registry["{cd508ce6-4c69-4c73-94c4-a7a487438442}"]["runTestDirectory"], ["C:\\slave\\test\\build\\tests\\mozmill\\composition"])
WARNING -  TEST-UNEXPECTED-FAIL | Disconnect Error: Application unexpectedly closed
INFO -  Timeout: bridge.set("cb7b48b0-ed80-11e6-ac11-002590c0b2cd", Components.utils.import('resource://mozmill/modules/mozmill.js'))

Aceman, can you please take a look at the Mozmill crash that's holding us up here.

> When you have a moment, ...
Sadly, I don't. I can send you a PM about my workload if you're interested.
Flags: needinfo?(acelists)
Experiment:
hg push -f -r tip cc-try
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&revision=c1540a60fbb5fc65bb8415642ef7e00c8205cd4a
So current C-C tip with M-C at 12c02bf624c48903b155428f7c8a419ba7

Damn, I didn't want all those platforms, so I'll cancel some.
OK, current tip on C-C with M-C at 12c02bf624c48903b155428f7c8a419ba7 still crashes in Mozmill.

Does anyone know how to set the C-C version in a try run? I tried:
$ hg up -r 1eae202f7efa
then apply the changes to client.py-args and then
$ hg push -f -r 1eae202f7efa cc-try
and that said: "no changes found".
Flags: needinfo?(aleth)
Let's continue the Mozmill crash business in bug 1337713. I got a revelation just now looking at the Aurora tree. There those crashes also started, last good, Feb 6 in the morning, first bad, Feb 7 in the morning. On C-C the last good was on Feb 6 around 7 PM (GMT?) and the first bad was some hours later that day. So maybe something changed on the Mozilla servers?
Flags: needinfo?(aleth)
Flags: needinfo?(acelists)
(In reply to Jorg K (GMT+1) from comment #118)
> Does anyone know how to set the C-C version in a try run? I tried:
> $ hg up -r 1eae202f7efa
> then apply the changes to client.py-args and then
> $ hg push -f -r 1eae202f7efa cc-try
> and that said: "no changes found".

Well you're pushing without the change. Probably 

 hg push -f -r . cc-try

(or use whatever version the change is).
Aleth, I disabled the failing test in bug 1337713. We can't afford to be held up by this.

Once again, I'm really sorry for backing out all your patches :-( - My sincere apologies.

So please land your patches again and please take the time to add some comments to the maildir 'delete'. Maybe something like:
// Deleting the parser object in a static method after the timer was cancelled.
// XXX Hacky, but better than leaking it and all the things it holds.
Attachment #8832586 - Flags: review+
Attachment #8833470 - Flags: review-
Attachment #8832586 - Attachment is obsolete: true
Attachment #8833814 - Flags: feedback-
Attachment #8834532 - Attachment is obsolete: true
Attachment #8834537 - Attachment is obsolete: true
Attachment #8834537 - Flags: feedback?(aleth)
https://hg.mozilla.org/comm-central/rev/27e021adf35e5b68d36a1ff9d2122070e8a6bebe
Bug 1334874 - Use weak view reference to avoid JSTreeSlection.view cycle leaks. r=aleth

https://hg.mozilla.org/comm-central/rev/495a38b840265114c336b51877d0e3477dd524e8
Bug 1334874 - Fix leak in MaildirStoreParser. r=mkmelin

https://hg.mozilla.org/comm-central/rev/752aface26cafc254abf8871b6bc479cded5fcca
Bug 1334874 - Fix nsMsgTemplateReplyHelper leak as detected by test_autoReply.js. r=mkmelin

https://hg.mozilla.org/comm-central/rev/dae4f5ac532c2f224bd2f900c6795731a898c596
Bug 1334874 - Fix database leak in nsMsgLocalMailFolder::UpdateFolder. r=jorgk
This bug is too long already - the rest can be done in followup bugs.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 54.0
Depends on: 1337865
Comment on attachment 8833470 [details] [diff] [review]
Fix leak detected by test_pop3FilterActions.js

Let's fix these leaks on the branches since bug 1276669 is already on Aurora and going to Beta any minute now.

Uplift approval for all four patches in this bug as landed in comment #122.
Attachment #8833470 - Flags: approval-comm-beta+
Attachment #8833470 - Flags: approval-comm-aurora+
I have reservations about uplift to beta.  Recently I'm seeing high memory usage with 2017-02-11 nightly Thunderbird - 1.3GB on Sunday after 1 day of use, and 800MB today after 1 day of use.  1.3GB is roughly double what I see as a max in normal usage after a few days - not 1 day. I haven't yet tried to reproduce or determine cause.  So I don't know what I'm seeing is legit or related to this bug. But it is concerning.  

Do we have people using nightly that keep Thunderbird running, and are watching their memory usage?
And how critical is this patch?
I don't understand the concern. This but is reducing memory usage by not leaking objects.
Besides, uplift is not optional since bug 1276669 has landed on Aurora and will land on Beta in the next few days. Bug 1276669 is causing crashes in debug builds, initially 14 Xpcshell tests crashed, with the four patches here we're down to 6 which are covered in bug 1337865.
Aurora (TB 53):
https://hg.mozilla.org/releases/comm-aurora/rev/2fca3dd08c71ba78cc05e72c69c75e7f02d0affa
https://hg.mozilla.org/releases/comm-aurora/rev/8ef7bdc53e224004a78392ee36941ce8fd9f63a7
https://hg.mozilla.org/releases/comm-aurora/rev/f2ab70ad3589a28e809865153504c74bb2a50a10
https://hg.mozilla.org/releases/comm-aurora/rev/3955ea7da434085bc2d071ed78734f087cc28e8e
Summary: Fix fallout (failing tests) from bug 1276669 and bug 1334558 in mailnews: Leaked nsMsgDBFolder objects → Fix fallout (failing tests) from bug 1276669 and bug 1334558 in mailnews (fixed 8 of 14). Remaining fallout in bug 1337865 (6 of 14)
(In reply to Jorg K (GMT+1) from comment #126)
> I don't understand the concern. This but is reducing memory usage by not
> leaking objects.
> Besides, uplift is not optional since bug 1276669 has landed on Aurora and
> will land on Beta in the next few days. Bug 1276669 is causing crashes in
> debug builds, initially 14 Xpcshell tests crashed, with the four patches
> here we're down to 6 which are covered in bug 1337865.

I am by nature a skeptic. But I understand more clearly with your comment that this is required because of the core bug. Let's uplift and be vigilant :)
Depends on: 1340517
https://hg.mozilla.org/comm-central/rev/aa0e3b9c8e18982724f0754fe25ca5ee86a217c4
which is backout of
https://hg.mozilla.org/comm-central/rev/27e021adf35e5b68d36a1ff9d2122070e8a6bebe
for causing bug 1340517 and most likely bug 1340857.

Sorry about the excessive NI on three bugs now.
Flags: needinfo?(rkent)
Flags: needinfo?(aleth)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attachment #8833816 - Flags: feedback?(acelists)
Whiteboard: [Thunderbird-testfailure: X all debug]
Depends on: 1342745
Comment on attachment 8833809 [details] [diff] [review]
Avoid treeselection.view cycle leaks

Since Kent's patch had unwanted side effects, I'll un-obsolete this one for further discussion.
Attachment #8833809 - Attachment is obsolete: false
Comment on attachment 8833809 [details] [diff] [review]
Avoid treeselection.view cycle leaks

Could we please reconsider this since Kent's patch caused two regressions.
Attachment #8833809 - Flags: review?(rkent)
Attachment #8833809 - Flags: review?(acelists)
Just tested, Aleth patch does *not* regress bug 1340517.

STR for that bug: Set option to view messages in a separate window. Open a message, click delete button. Window content should update with next message. With Kent's patch, it doesn't update, with Aleth' patch it does.
OK, I've added the comments that Kent requested.

The design might not be ideal but we're not in a position to turn this upside down while in bustage fix mode.

Note that test test_junkingWhenDisabled.js already had the gFakeSelection code, this is now transplanted to test_nsMsgDBView.js, so both tests are aligned.

I think we should accept this now, especially since Kent said on IRC:
That [his] patch was meant more as a demo of getting a weak JS reference. Regressions are caused (generally) when the weak reference fails because nobody is holding onto the parent, and it gets GC'd.

So since we observed two regressions, the weak reference approach doesn't appear to fly here.
Attachment #8833809 - Attachment is obsolete: true
Attachment #8833809 - Flags: review?(rkent)
Attachment #8833809 - Flags: review?(acelists)
Flags: needinfo?(rkent)
Flags: needinfo?(aleth)
Attachment #8841373 - Flags: review?(mkmelin+mozilla)
Attachment #8841373 - Flags: review?(acelists)
Comment on attachment 8841373 [details] [diff] [review]
1334874-leak-treeselection.patch - Aleth's original with added comment

Review of attachment 8841373 [details] [diff] [review]:
-----------------------------------------------------------------

Even if we have a proper ownership of objects in tree form, these gRightMouseButtonSavedSelection and gFakeSelection objects seem to take additional references to our object 'from the side' so I don't know if it is possible to make this in some elegant way.
But it should never hurt to clear references that we have set. Just that the GC clears many references/objects for us automatically when they get out of scope.

I would be fine with this if there are no objections.

::: mail/base/content/msgMail3PaneWindow.js
@@ +736,5 @@
> +
> +  if (gRightMouseButtonSavedSelection &&
> +      gRightMouseButtonSavedSelection.view) {
> +    // Avoid possible cycle leaks.
> +    gRightMouseButtonSavedSelection.view = null;

Why not also gRightMouseButtonSavedSelection = null as the other file does?

::: mailnews/base/test/unit/test_nsMsgDBView.js
@@ +283,5 @@
>    gDBView.curCustomColumn = "authorFirstLetterCol";
>  
>    gTreeView = gDBView.QueryInterface(Components.interfaces.nsITreeView);
> +  gFakeSelection.view = gTreeView;
> +  gTreeView.selection = gFakeSelection;

The other test has this in reversed order:
  gTreeView.selection = gFakeSelection;
  gFakeSelection.view = gTreeView;

Does it matter?
But I wonder what this is supposed to do, it looks like a cycle to me.
Attachment #8841373 - Flags: feedback+
Comment on attachment 8841373 [details] [diff] [review]
1334874-leak-treeselection.patch - Aleth's original with added comment

Review of attachment 8841373 [details] [diff] [review]:
-----------------------------------------------------------------

::: mail/base/content/msgMail3PaneWindow.js
@@ +733,5 @@
>    try {
>      mailInstrumentationManager.uninit();
>    } catch (ex) {logException(ex);}
> +
> +  if (gRightMouseButtonSavedSelection &&

And I would think this should be higher in this function, before we start tearing down the UI.
(In reply to :aceman from comment #136)
> Why not also gRightMouseButtonSavedSelection = null as the other file does?
Looks like Aleth went for the minimal fix, but I can add it.

> The other test has this in reversed order:
>   gTreeView.selection = gFakeSelection;
>   gFakeSelection.view = gTreeView;
> 
> Does it matter?
> But I wonder what this is supposed to do, it looks like a cycle to me.
I don't think it matters, but I can reverse the order to match the other test.

(In reply to :aceman from comment #137)
> And I would think this should be higher in this function, before we start
> tearing down the UI.
I can move this up.

Thanks for getting involved!
Here come the changes requested by Aceman.
Attachment #8841373 - Attachment is obsolete: true
Attachment #8841373 - Flags: review?(mkmelin+mozilla)
Attachment #8841373 - Flags: review?(acelists)
Comment on attachment 8841444 [details] [diff] [review]
1334874-leak-treeselection.patch - Aleth's original with added comment and minor tweaks

Review of attachment 8841444 [details] [diff] [review]:
-----------------------------------------------------------------

::: mail/base/content/msgMail3PaneWindow.js
@@ +707,5 @@
>    Services.prefs.removeObserver("mail.pane_config.dynamic", MailPrefObserver);
>    Services.prefs.removeObserver("mail.showCondensedAddresses", MailPrefObserver);
>  
> +  if (gRightMouseButtonSavedSelection &&
> +      gRightMouseButtonSavedSelection.view) {

Aceman, would you like to lose the second line here?
Attachment #8841444 - Flags: review?(acelists)
(In reply to Jorg K (GMT+1) from comment #140)
> > +  if (gRightMouseButtonSavedSelection &&
> > +      gRightMouseButtonSavedSelection.view) {
> 
> Aceman, would you like to lose the second line here?

Not check for .view? Yes, just null it all whenever 
gRightMouseButtonSavedSelection is not null.
As you wish ;-)
Attachment #8841444 - Attachment is obsolete: true
Attachment #8841444 - Flags: review?(acelists)
Attachment #8841663 - Flags: review?(acelists)
Comment on attachment 8841663 [details] [diff] [review]
1334874-leak-treeselection.patch - Aleth's original with added comment and minor tweaks - take 2

Review of attachment 8841663 [details] [diff] [review]:
-----------------------------------------------------------------

Let's try it :)
Attachment #8841663 - Flags: review?(acelists) → review+
https://hg.mozilla.org/comm-central/rev/a4027b8e95e6c8a850e22e2e142e974fd910c8af

Additional landings from comment #122:
https://hg.mozilla.org/comm-central/rev/495a38b840265114c336b51877d0e3477dd524e8
Bug 1334874 - Fix leak in MaildirStoreParser. r=mkmelin
https://hg.mozilla.org/comm-central/rev/752aface26cafc254abf8871b6bc479cded5fcca
Bug 1334874 - Fix nsMsgTemplateReplyHelper leak as detected by test_autoReply.js. r=mkmelin
https://hg.mozilla.org/comm-central/rev/dae4f5ac532c2f224bd2f900c6795731a898c596
Bug 1334874 - Fix database leak in nsMsgLocalMailFolder::UpdateFolder. r=jorgk

Those three are in Aurora (TB 53) and Beta (TB 52).
Status: REOPENED → RESOLVED
Closed: 7 years ago7 years ago
Resolution: --- → FIXED
Comment on attachment 8841663 [details] [diff] [review]
1334874-leak-treeselection.patch - Aleth's original with added comment and minor tweaks - take 2

Let's uplift this to fix the leaks on the branches.
Attachment #8841663 - Flags: approval-comm-beta?
Attachment #8841663 - Flags: approval-comm-aurora?
Attachment #8841663 - Flags: approval-comm-aurora? → approval-comm-aurora+
Attachment #8841663 - Flags: approval-comm-esr52?
Attachment #8841663 - Flags: approval-comm-beta?
Attachment #8841663 - Flags: approval-comm-esr52? → approval-comm-esr52+
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: