Last Comment Bug 723248 - add support for closing inactive databases (folders)
: add support for closing inactive databases (folders)
Status: RESOLVED FIXED
: footprint, perf
Product: MailNews Core
Classification: Components
Component: Database (show other bugs)
: Trunk
: All All
: -- normal (vote)
: Thunderbird 15.0
Assigned To: David :Bienvenu
:
Mentors:
Depends on: 750080 787557 793455 794401 876548 1135310
Blocks: 347837 495911 722183 231606 526568 671613 770559
  Show dependency treegraph
 
Reported: 2012-02-01 12:23 PST by David :Bienvenu
Modified: 2015-02-23 11:35 PST (History)
9 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
implement last use time for msg db's (16.80 KB, patch)
2012-02-01 12:23 PST, David :Bienvenu
neil: review+
Details | Diff | Splinter Review
cumulative wip (27.31 KB, patch)
2012-02-05 18:35 PST, David :Bienvenu
no flags Details | Diff | Splinter Review
nsIArray approach that crashes in cycle collector (18.09 KB, patch)
2012-02-21 12:06 PST, David :Bienvenu
no flags Details | Diff | Splinter Review
fix for crash in cycle collector (18.49 KB, patch)
2012-02-21 15:17 PST, David :Bienvenu
mozilla: review+
standard8: superreview+
Details | Diff | Splinter Review

Description David :Bienvenu 2012-02-01 12:23:46 PST
Created attachment 593559 [details] [diff] [review]
implement last use time for msg db's

Because nsIMsgFolder caches open db's, it's possible to end up with db's open that aren't being used. We try to avoid this by nsIMsgFolder.msgDatabase = null; but we also leave special folders open, and sometimes db's get left open inadvertently, and some extensions don't know that they should be doing this. Keeping track of the last use time for a db and all the objects that come from the db is daunting, but it occurred to me that we can do well enough by tracking Commit and Open and a few other methods, and that the perfect is definitely the enemy of the good in this case. The attached patch instruments the db's last use time, and allows some other client code to close inactive db's. I'm still experimenting with the client code to close inactive db's, but I want to get the backend code started.
Comment 1 David :Bienvenu 2012-02-01 12:28:00 PST
wsmwk, when finished, this should help with some memory bloat issues, as well as file handle limit issues.
Comment 2 Andrew Sutherland [:asuth] 2012-02-01 13:18:50 PST
An alternative to using the timestamp might be to use a 'generation' identifier that lives in a global.  While the overhead of the PR_Now is likely only to be noticable in GetMsgHdrForKey, the use of the generation could aid in debugging since it could provide more useful info.

For example, we could reserve generation id's ending in 1 for autosync.  When autosync is doing something on the event loop, it boosts the generation id and then that is associated with the autosync.  When we go to expire the database, we can see that autosync is complicit in the database being left open by logging the generation identifier.  Autosync would boost the generation when it is done to something that does not blame it.

Updating the generation otherwise could be handled by the 'cron' process that handles retirement.  It could update the generation to the current timestamp times 10 or something like that.

This mechanism could potentially also be used to point blame at extensions, although causing the tagging to happen reliably is non-trivial without some help from the extension (or a framework it uses) or platform.
Comment 3 David :Bienvenu 2012-02-01 17:15:44 PST
Thx, a generation identifier is an interesting idea. I hadn't intended to run the cron process frequently - I'm doing it every minute right now.  I'm really trying to solve the problem of gradual bloat, not the problem of an extension deciding to open the db for every folder in a tight loop - those extensions are beyond hope :-) I wasn't focused on blame, either, since I know most of the culprits, and there are definitely advantages to leaving db's open for a brief amount of time, under the theory that if the user does something that involves a particular folder, they're more likely to do something with that folder again in the not too distant future. And I'm trying to lessen the burden on client and extension code, and having them cooperate with the generation mechanism could increase it.

My first idea for closing db's was to have two thresholds - a db idle time pref in seconds, and a max number of db's open. Every time through the cron process, we close db's that have been idle more than the limit, and then if more db's are still open than the max, we close the ones that have been idle the longest (with perhaps a minimum idle time).

I'm not convinced that the PR_Now call in GetMsgHdrForKey will be noticeable, but I probably don't need it either - setting the last use time when client code accesses nsIMsgFolder.msgDatabase occurred to me later and handles the cases I was worrying about fairly well. I could use your idea of a global but keep the timestamp and make GetMsgHdrForKey set the global value in the db.

I was also planning on hooking up the code that figures out which db's to close to the memory pressure notifications.
Comment 4 Andrew Sutherland [:asuth] 2012-02-01 17:28:38 PST
Yeah, noticeable in this case was "anywhere on the radar"; I don't expect it to add up to anything meaningful.  If anything, I'm more concerned about the overhead of responding to well-meaning bug reporters who see Thunderbird being slow, run strace, and then report that gettimeofday() is probably the problem because it gets called a zillion times.  (Although apparently on x86_64 on linux there is a magic VDSO fast-path that can be used that just reads stuff out of allowed kernel space rather than touching glibc and issuing a syscall.)
Comment 5 neil@parkwaycc.co.uk 2012-02-03 16:02:57 PST
Comment on attachment 593559 [details] [diff] [review]
implement last use time for msg db's

>+  /// an enumerator to iterate over the open dbs.
>+  readonly attribute nsISimpleEnumerator openDBs;
This doesn't seem to be return the open dbs in a useful way. Slightly more useful would be an nsIArray, since you could at least count them, and you get to avoid the ugly function-that-returns-an-xpcom-array syntax, which would otherwise have been my preference.
Comment 6 David :Bienvenu 2012-02-05 18:35:56 PST
Created attachment 594613 [details] [diff] [review]
cumulative wip

I think I will change to an nsIArray, but this is a snapshot of what I have so far.
Comment 7 David :Bienvenu 2012-02-21 07:35:33 PST
I switched to an nsIArray but that made the cycle collector really unhappy and crashy. I used strong refs when calling AppendElement, and it doesn't seem like the object that made it crash had been deleted. I'll do a clobber build, but it's surprising.
Comment 8 David :Bienvenu 2012-02-21 12:06:39 PST
Created attachment 599290 [details] [diff] [review]
nsIArray approach that crashes in cycle collector

not sure what I'm doing wrong, but this crashes in the cycle collector after a while.

I added the deletes of the db arrays in the js module just to get the crash to occur faster, which it seems to.
Comment 9 David :Bienvenu 2012-02-21 13:32:33 PST
Ugh, it seems as if we return an array of xpcom objects to js, we need to have the participants implement nsXPCOMCycleCollectionParticipant?

this code in nsCycleCollector:

    nsXPCOMCycleCollectionParticipant *cp;
    ToParticipant(child, &cp);

seems to cause problems when the QI fails, because

static inline void
ToParticipant(nsISupports *s, nsXPCOMCycleCollectionParticipant **cp)
{
    // We use QI to move from an nsISupports to an
    // nsXPCOMCycleCollectionParticipant, which is a per-class singleton helper
    // object that implements traversal and unlinking logic for the nsISupports
    // in question.
    CallQueryInterface(s, cp);

doesn't null out *cp, so if CallQI fails, we're left with garbage in cp.

I must be missing something, because that would seem like a rather large problem. Is our QI supposed to null out the out param before it returns an error? nsMsgDatabase implements it by hand, and perhaps the helper functions do it for you...I'll poke around a bit.
Comment 10 David :Bienvenu 2012-02-21 13:49:42 PST
Andrew, I've noticed that the cycle collector will crash if the QI method of an object returns an error but doesn't null the out pointer. I can fix the offending code on my end to null the out pointer on error, but the cycle collector might want to protect against this as well. I'm cc'ing you since you seem to have been active in nsCycleCollector.cpp recently. I'm also wondering if it's possible that some of the other crash bugs in nsCycleCollector might be related to this issue.
Comment 11 Andrew McCreight [:mccr8] 2012-02-21 14:11:37 PST
Yes, that does sound really bad.  Basically, the QIs in the CC should probably check for an error and treat that as though a null was return?  I'm not very familiar with QIs, but I do recall that is the sort of behavior that's supposed to happen.  Thanks for letting me know, I'll file a new bug about it.
Comment 12 David :Bienvenu 2012-02-21 14:34:43 PST
(In reply to Andrew McCreight [:mccr8] from comment #11)
> Yes, that does sound really bad.  Basically, the QIs in the CC should
> probably check for an error and treat that as though a null was return? 
Andrew M, thx for looking at this. Yes, static inline void
ToParticipant(nsISupports *s, nsXPCOMCycleCollectionParticipant **cp)

should check the rv of CallQueryInterface, and set cp to null if it's failure. Or just initialize *cp to nsnull before calling QI. 

I did a quick look through the mailnews and mozilla code, and the mailnews code is all OK (once I fixed nsMsgDatabase.cpp) and almost all the mozilla code looks fine, though there are a few suspicious things:

nsXPCWrappedJS::QueryInterface return an error without nulling the out param if !isValid().

AudioSession::QueryInterface() fails to null the out param on error.

same for ReadbackManagerD3D10::QueryInterface and COMMessageFilter::QueryInterface

nsXTFElementWrapper::QueryInterface is a little suspicious

(I don't know if any of these ever cause any actual problems, and the mailnews class that was broken is ancient code).
Comment 13 David :Bienvenu 2012-02-21 15:17:22 PST
Created attachment 599389 [details] [diff] [review]
fix for crash in cycle collector

carrying forward Neil's r+, requesting sr for db changes, and r+ for mail changes.

I've put the code that does the actual aging in a Thunderbird js file loaded by the TB 3 pane code. I imagine SeaMonkey will want to be able to age open db's as well, and perhaps it would just like to use the same code, but I'd like to shake out any issues in TB-land.
Comment 14 Andrew McCreight [:mccr8] 2012-02-21 16:22:20 PST
(In reply to David :Bienvenu from comment #12)
> (I don't know if any of these ever cause any actual problems, and the
> mailnews class that was broken is ancient code).

I talked to jst, and he said that code that returns an error should null out what it returns.  But maybe we should just check anyways...
Comment 15 David :Bienvenu 2012-02-21 16:37:27 PST
(In reply to Andrew McCreight [:mccr8] from comment #14)

> I talked to jst, and he said that code that returns an error should null out
> what it returns.  But maybe we should just check anyways...

It definitely should, and almost everyone does. But the cycle collector might want to be a bit defensive about this since it may have to handle arbitrary xpcom objects, including 3rd party binary components.

I don't know enough about the cycle collector to know when we'll get into trouble with bad QI implementations. I know my case was with an nsIArray, and it's probably unlikely that things like audiosessions end up in nsArrays exposed to js.
Comment 16 Mark Banner (:standard8) 2012-02-22 00:21:28 PST
(In reply to David :Bienvenu from comment #15)
> (In reply to Andrew McCreight [:mccr8] from comment #14)
> 
> > I talked to jst, and he said that code that returns an error should null out
> > what it returns.  But maybe we should just check anyways...
> 
> It definitely should, and almost everyone does.

My understanding of xpcom rules is that if you are returning/throwing an error code, you don't to touch the return/out parameter and it is up to the caller to check the error value. Exceptions to the rules can happen, but it has to be documented on the interface, and it isn't for QueryInterface...
Comment 17 David :Bienvenu 2012-02-22 07:43:21 PST
(In reply to Mark Banner (:standard8) from comment #16)
> My understanding of xpcom rules is that if you are returning/throwing an
> error code, you don't to touch the return/out parameter and it is up to the
> caller to check the error value. Exceptions to the rules can happen, but it
> has to be documented on the interface, and it isn't for QueryInterface...

I think there's a pre-existing exception to that rule for QI, and the example code about how to write a good QI impl nulls the out param, and calls out the fact that it's doing so. And, all the NS_IMPL macros null the out param before doing anything else.
Comment 18 Benjamin Smedberg [:bsmedberg] 2012-03-01 10:47:51 PST
There has been disagreement about what to do with outparams in failure cases. The current common practice is:

1) you may change an outparam to null and fail
2) you must not set an interface outparam to nonnull and then fail
3) some interfaces document that outparams are always nulled, and I believe that xpconnect will always null them
Comment 19 Mark Banner (:standard8) 2012-04-23 06:35:20 PDT
Comment on attachment 593559 [details] [diff] [review]
implement last use time for msg db's

Assuming this patch is obsolete.
Comment 20 Mark Banner (:standard8) 2012-04-23 06:57:12 PDT
Comment on attachment 599389 [details] [diff] [review]
fix for crash in cycle collector

Review of attachment 599389 [details] [diff] [review]:
-----------------------------------------------------------------

sr=me with the comments addressed. Not sure if we might want to land this after the merge to allow a bit more testing on it.

::: mail/base/modules/msgDBCacheManager.js
@@ +1,2 @@
> +/* ***** BEGIN LICENSE BLOCK *****
> + *   Version: MPL 2.0

The MPL 2 license block should just be three lines:

http://www.mozilla.org/MPL/headers/

::: mailnews/base/prefs/content/amUtils.js
@@ +75,2 @@
>          var directoryAlreadyUsed =
> +          messengerBundle.getFormattedString(

Is this a different bug or something? (it was bitrotted, and it doesn't seem related to this bug).
Comment 21 Wayne Mery (:wsmwk, NI for questions) 2012-04-23 08:25:03 PDT
I like the idea of landing after the merge. We can always decide push it into v14 at a later date.
Comment 22 David :Bienvenu 2012-04-23 08:29:10 PDT
(In reply to Wayne Mery (:wsmwk) from comment #21)
> I like the idea of landing after the merge. We can always decide push it
> into v14 at a later date.

yes, that's fine with me.
Comment 23 David :Bienvenu 2012-04-26 14:19:26 PDT
fixed on trunk - http://hg.mozilla.org/comm-central/rev/83fd926bf1f8
Comment 24 :aceman 2012-04-26 14:32:15 PDT
What is considered an idle DB here? Will this close all the relevant .msf files after the timeout?
Comment 25 David :Bienvenu 2012-04-26 14:33:13 PDT
(In reply to :aceman from comment #24)
> What is considered an idle DB here? Will this close all the relevant .msf
> files after the timeout?

in the patch, you'll see two new prefs that you can use to control this:

+// How long should we leave idle db's open, in seconds.
+pref("mail.db.idle_limit", 3000);
+// How many db's should we leave open? LRU db's will be closed first
+pref("mail.db.max_open", 30);
Comment 26 Wayne Mery (:wsmwk, NI for questions) 2012-04-28 08:16:03 PDT
user with 231 open db at http://forums.mozillazine.org/viewtopic.php?f=39&t=2459825&p=11943887#p11943887
Comment 27 Jonathan Kamens 2012-08-31 11:42:36 PDT
This has an impact which I suspect was not anticipated and which breaks at least some code (including my add-on, Send Later 3, which has 25,000 users).

See:

https://groups.google.com/d/topic/mozilla.dev.apps.thunderbird/_WnlEWM5Y0o/discussion
Comment 28 Wayne Mery (:wsmwk, NI for questions) 2012-08-31 11:51:20 PDT
(In reply to Jonathan Kamens from comment #27)
> This has an impact which I suspect was not anticipated and which breaks at
> least some code (including my add-on, Send Later 3, which has 25,000 users).
> 
> See:
> 
> https://groups.google.com/d/topic/mozilla.dev.apps.thunderbird/_WnlEWM5Y0o/
> discussion

"I think the logic in nsMsgLocalMailFolder::UpdateFolder is
probably wrong, and a FolderLoaded event should be generated
when a folder is opened, not just when it is updated"

file a bug?
Comment 29 Mark Banner (:standard8) 2012-09-03 00:39:49 PDT
Yes, please file a new bug for new issues. Commenting on old bugs will likely mean your comment will be lost over time.
Comment 30 :aceman 2012-09-03 00:41:34 PDT
I think he filed bug 787557 for it.
Comment 31 Wayne Mery (:wsmwk, NI for questions) 2015-02-21 00:20:33 PST
(In reply to Andrew McCreight [:mccr8] from comment #11)
> Yes, that does sound really bad.  Basically, the QIs in the CC should
> probably check for an error and treat that as though a null was return?  I'm
> not very familiar with QIs, but I do recall that is the sort of behavior
> that's supposed to happen.  Thanks for letting me know, I'll file a new bug
> about it.

Was such a bug filed?  Or was it decided no bug was needed?  (comment 14, etc)
Comment 32 Andrew McCreight [:mccr8] 2015-02-23 11:35:37 PST
(In reply to Wayne Mery (:wsmwk) from comment #31)
> Was such a bug filed?  Or was it decided no bug was needed?  (comment 14,
> etc)

Oh, no, I'd totally forgotten about that.  Thanks for the reminder!  I filed bug 1135773 for making the CC more robust, and bug 1135772 for fixing some of the broken QIs that were pointed out.

Note You need to log in before you can comment on or make changes to this bug.