Closed Bug 723248 Opened 12 years ago Closed 12 years ago

add support for closing inactive databases (folders)

Categories

(MailNews Core :: Database, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 15.0

People

(Reporter: Bienvenu, Assigned: Bienvenu)

References

(Blocks 2 open bugs)

Details

(Keywords: memory-footprint, perf)

Attachments

(3 files, 1 obsolete file)

Because nsIMsgFolder caches open db's, it's possible to end up with db's open that aren't being used. We try to avoid this by nsIMsgFolder.msgDatabase = null; but we also leave special folders open, and sometimes db's get left open inadvertently, and some extensions don't know that they should be doing this. Keeping track of the last use time for a db and all the objects that come from the db is daunting, but it occurred to me that we can do well enough by tracking Commit and Open and a few other methods, and that the perfect is definitely the enemy of the good in this case. The attached patch instruments the db's last use time, and allows some other client code to close inactive db's. I'm still experimenting with the client code to close inactive db's, but I want to get the backend code started.
Attachment #593559 - Flags: superreview?(mbanner)
Attachment #593559 - Flags: review?(neil)
wsmwk, when finished, this should help with some memory bloat issues, as well as file handle limit issues.
Status: NEW → ASSIGNED
An alternative to using the timestamp might be to use a 'generation' identifier that lives in a global.  While the overhead of the PR_Now is likely only to be noticable in GetMsgHdrForKey, the use of the generation could aid in debugging since it could provide more useful info.

For example, we could reserve generation id's ending in 1 for autosync.  When autosync is doing something on the event loop, it boosts the generation id and then that is associated with the autosync.  When we go to expire the database, we can see that autosync is complicit in the database being left open by logging the generation identifier.  Autosync would boost the generation when it is done to something that does not blame it.

Updating the generation otherwise could be handled by the 'cron' process that handles retirement.  It could update the generation to the current timestamp times 10 or something like that.

This mechanism could potentially also be used to point blame at extensions, although causing the tagging to happen reliably is non-trivial without some help from the extension (or a framework it uses) or platform.
Thx, a generation identifier is an interesting idea. I hadn't intended to run the cron process frequently - I'm doing it every minute right now.  I'm really trying to solve the problem of gradual bloat, not the problem of an extension deciding to open the db for every folder in a tight loop - those extensions are beyond hope :-) I wasn't focused on blame, either, since I know most of the culprits, and there are definitely advantages to leaving db's open for a brief amount of time, under the theory that if the user does something that involves a particular folder, they're more likely to do something with that folder again in the not too distant future. And I'm trying to lessen the burden on client and extension code, and having them cooperate with the generation mechanism could increase it.

My first idea for closing db's was to have two thresholds - a db idle time pref in seconds, and a max number of db's open. Every time through the cron process, we close db's that have been idle more than the limit, and then if more db's are still open than the max, we close the ones that have been idle the longest (with perhaps a minimum idle time).

I'm not convinced that the PR_Now call in GetMsgHdrForKey will be noticeable, but I probably don't need it either - setting the last use time when client code accesses nsIMsgFolder.msgDatabase occurred to me later and handles the cases I was worrying about fairly well. I could use your idea of a global but keep the timestamp and make GetMsgHdrForKey set the global value in the db.

I was also planning on hooking up the code that figures out which db's to close to the memory pressure notifications.
Yeah, noticeable in this case was "anywhere on the radar"; I don't expect it to add up to anything meaningful.  If anything, I'm more concerned about the overhead of responding to well-meaning bug reporters who see Thunderbird being slow, run strace, and then report that gettimeofday() is probably the problem because it gets called a zillion times.  (Although apparently on x86_64 on linux there is a magic VDSO fast-path that can be used that just reads stuff out of allowed kernel space rather than touching glibc and issuing a syscall.)
Comment on attachment 593559 [details] [diff] [review]
implement last use time for msg db's

>+  /// an enumerator to iterate over the open dbs.
>+  readonly attribute nsISimpleEnumerator openDBs;
This doesn't seem to be return the open dbs in a useful way. Slightly more useful would be an nsIArray, since you could at least count them, and you get to avoid the ugly function-that-returns-an-xpcom-array syntax, which would otherwise have been my preference.
Attachment #593559 - Flags: review?(neil) → review+
Attached patch cumulative wip — — Splinter Review
I think I will change to an nsIArray, but this is a snapshot of what I have so far.
I switched to an nsIArray but that made the cycle collector really unhappy and crashy. I used strong refs when calling AppendElement, and it doesn't seem like the object that made it crash had been deleted. I'll do a clobber build, but it's surprising.
not sure what I'm doing wrong, but this crashes in the cycle collector after a while.

I added the deletes of the db arrays in the js module just to get the crash to occur faster, which it seems to.
Ugh, it seems as if we return an array of xpcom objects to js, we need to have the participants implement nsXPCOMCycleCollectionParticipant?

this code in nsCycleCollector:

    nsXPCOMCycleCollectionParticipant *cp;
    ToParticipant(child, &cp);

seems to cause problems when the QI fails, because

static inline void
ToParticipant(nsISupports *s, nsXPCOMCycleCollectionParticipant **cp)
{
    // We use QI to move from an nsISupports to an
    // nsXPCOMCycleCollectionParticipant, which is a per-class singleton helper
    // object that implements traversal and unlinking logic for the nsISupports
    // in question.
    CallQueryInterface(s, cp);

doesn't null out *cp, so if CallQI fails, we're left with garbage in cp.

I must be missing something, because that would seem like a rather large problem. Is our QI supposed to null out the out param before it returns an error? nsMsgDatabase implements it by hand, and perhaps the helper functions do it for you...I'll poke around a bit.
Andrew, I've noticed that the cycle collector will crash if the QI method of an object returns an error but doesn't null the out pointer. I can fix the offending code on my end to null the out pointer on error, but the cycle collector might want to protect against this as well. I'm cc'ing you since you seem to have been active in nsCycleCollector.cpp recently. I'm also wondering if it's possible that some of the other crash bugs in nsCycleCollector might be related to this issue.
Yes, that does sound really bad.  Basically, the QIs in the CC should probably check for an error and treat that as though a null was return?  I'm not very familiar with QIs, but I do recall that is the sort of behavior that's supposed to happen.  Thanks for letting me know, I'll file a new bug about it.
(In reply to Andrew McCreight [:mccr8] from comment #11)
> Yes, that does sound really bad.  Basically, the QIs in the CC should
> probably check for an error and treat that as though a null was return? 
Andrew M, thx for looking at this. Yes, static inline void
ToParticipant(nsISupports *s, nsXPCOMCycleCollectionParticipant **cp)

should check the rv of CallQueryInterface, and set cp to null if it's failure. Or just initialize *cp to nsnull before calling QI. 

I did a quick look through the mailnews and mozilla code, and the mailnews code is all OK (once I fixed nsMsgDatabase.cpp) and almost all the mozilla code looks fine, though there are a few suspicious things:

nsXPCWrappedJS::QueryInterface return an error without nulling the out param if !isValid().

AudioSession::QueryInterface() fails to null the out param on error.

same for ReadbackManagerD3D10::QueryInterface and COMMessageFilter::QueryInterface

nsXTFElementWrapper::QueryInterface is a little suspicious

(I don't know if any of these ever cause any actual problems, and the mailnews class that was broken is ancient code).
carrying forward Neil's r+, requesting sr for db changes, and r+ for mail changes.

I've put the code that does the actual aging in a Thunderbird js file loaded by the TB 3 pane code. I imagine SeaMonkey will want to be able to age open db's as well, and perhaps it would just like to use the same code, but I'd like to shake out any issues in TB-land.
Attachment #599290 - Attachment is obsolete: true
Attachment #599389 - Flags: superreview?(mbanner)
Attachment #599389 - Flags: review+
(In reply to David :Bienvenu from comment #12)
> (I don't know if any of these ever cause any actual problems, and the
> mailnews class that was broken is ancient code).

I talked to jst, and he said that code that returns an error should null out what it returns.  But maybe we should just check anyways...
(In reply to Andrew McCreight [:mccr8] from comment #14)

> I talked to jst, and he said that code that returns an error should null out
> what it returns.  But maybe we should just check anyways...

It definitely should, and almost everyone does. But the cycle collector might want to be a bit defensive about this since it may have to handle arbitrary xpcom objects, including 3rd party binary components.

I don't know enough about the cycle collector to know when we'll get into trouble with bad QI implementations. I know my case was with an nsIArray, and it's probably unlikely that things like audiosessions end up in nsArrays exposed to js.
(In reply to David :Bienvenu from comment #15)
> (In reply to Andrew McCreight [:mccr8] from comment #14)
> 
> > I talked to jst, and he said that code that returns an error should null out
> > what it returns.  But maybe we should just check anyways...
> 
> It definitely should, and almost everyone does.

My understanding of xpcom rules is that if you are returning/throwing an error code, you don't to touch the return/out parameter and it is up to the caller to check the error value. Exceptions to the rules can happen, but it has to be documented on the interface, and it isn't for QueryInterface...
(In reply to Mark Banner (:standard8) from comment #16)
> My understanding of xpcom rules is that if you are returning/throwing an
> error code, you don't to touch the return/out parameter and it is up to the
> caller to check the error value. Exceptions to the rules can happen, but it
> has to be documented on the interface, and it isn't for QueryInterface...

I think there's a pre-existing exception to that rule for QI, and the example code about how to write a good QI impl nulls the out param, and calls out the fact that it's doing so. And, all the NS_IMPL macros null the out param before doing anything else.
There has been disagreement about what to do with outparams in failure cases. The current common practice is:

1) you may change an outparam to null and fail
2) you must not set an interface outparam to nonnull and then fail
3) some interfaces document that outparams are always nulled, and I believe that xpconnect will always null them
Blocks: 347837
Keywords: footprint, perf
Summary: add support for closing inactive databases → add support for closing inactive databases (folders)
Blocks: 722183
Comment on attachment 593559 [details] [diff] [review]
implement last use time for msg db's

Assuming this patch is obsolete.
Attachment #593559 - Flags: superreview?(mbanner)
Comment on attachment 599389 [details] [diff] [review]
fix for crash in cycle collector

Review of attachment 599389 [details] [diff] [review]:
-----------------------------------------------------------------

sr=me with the comments addressed. Not sure if we might want to land this after the merge to allow a bit more testing on it.

::: mail/base/modules/msgDBCacheManager.js
@@ +1,2 @@
> +/* ***** BEGIN LICENSE BLOCK *****
> + *   Version: MPL 2.0

The MPL 2 license block should just be three lines:

http://www.mozilla.org/MPL/headers/

::: mailnews/base/prefs/content/amUtils.js
@@ +75,2 @@
>          var directoryAlreadyUsed =
> +          messengerBundle.getFormattedString(

Is this a different bug or something? (it was bitrotted, and it doesn't seem related to this bug).
Attachment #599389 - Flags: superreview?(mbanner) → superreview+
I like the idea of landing after the merge. We can always decide push it into v14 at a later date.
(In reply to Wayne Mery (:wsmwk) from comment #21)
> I like the idea of landing after the merge. We can always decide push it
> into v14 at a later date.

yes, that's fine with me.
OS: Windows 7 → All
Hardware: x86_64 → All
fixed on trunk - http://hg.mozilla.org/comm-central/rev/83fd926bf1f8
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 15.0
What is considered an idle DB here? Will this close all the relevant .msf files after the timeout?
Blocks: 231606
(In reply to :aceman from comment #24)
> What is considered an idle DB here? Will this close all the relevant .msf
> files after the timeout?

in the patch, you'll see two new prefs that you can use to control this:

+// How long should we leave idle db's open, in seconds.
+pref("mail.db.idle_limit", 3000);
+// How many db's should we leave open? LRU db's will be closed first
+pref("mail.db.max_open", 30);
Depends on: 750080
Blocks: 770559
Blocks: 526568
Blocks: 671613
No longer blocks: TB2SM
Blocks: 495911
This has an impact which I suspect was not anticipated and which breaks at least some code (including my add-on, Send Later 3, which has 25,000 users).

See:

https://groups.google.com/d/topic/mozilla.dev.apps.thunderbird/_WnlEWM5Y0o/discussion
(In reply to Jonathan Kamens from comment #27)
> This has an impact which I suspect was not anticipated and which breaks at
> least some code (including my add-on, Send Later 3, which has 25,000 users).
> 
> See:
> 
> https://groups.google.com/d/topic/mozilla.dev.apps.thunderbird/_WnlEWM5Y0o/
> discussion

"I think the logic in nsMsgLocalMailFolder::UpdateFolder is
probably wrong, and a FolderLoaded event should be generated
when a folder is opened, not just when it is updated"

file a bug?
Yes, please file a new bug for new issues. Commenting on old bugs will likely mean your comment will be lost over time.
I think he filed bug 787557 for it.
Depends on: 787557
Depends on: 793455
Depends on: 794401
Depends on: 876548
Depends on: 1135310
(In reply to Andrew McCreight [:mccr8] from comment #11)
> Yes, that does sound really bad.  Basically, the QIs in the CC should
> probably check for an error and treat that as though a null was return?  I'm
> not very familiar with QIs, but I do recall that is the sort of behavior
> that's supposed to happen.  Thanks for letting me know, I'll file a new bug
> about it.

Was such a bug filed?  Or was it decided no bug was needed?  (comment 14, etc)
Flags: needinfo?(continuation)
(In reply to Wayne Mery (:wsmwk) from comment #31)
> Was such a bug filed?  Or was it decided no bug was needed?  (comment 14,
> etc)

Oh, no, I'd totally forgotten about that.  Thanks for the reminder!  I filed bug 1135773 for making the CC more robust, and bug 1135772 for fixing some of the broken QIs that were pointed out.
Flags: needinfo?(continuation)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: