Reindexing should save message metadata (junk-related, tags, etc.)

RESOLVED FIXED in Thunderbird 3.0b1

Status

MailNews Core
Database
RESOLVED FIXED
10 years ago
9 years ago

People

(Reporter: rkent, Assigned: rkent)

Tracking

Trunk
Thunderbird 3.0b1
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 5 obsolete attachments)

(Assignee)

Description

10 years ago
Currently, reindexing the message database loses any information that is not available in the message store. The main data affected are junk related. Change the reindexing process so that it saves any message metadata that would otherwise be lost.
Bug 392704 is for loss of tag by rebuild index when IMAP server doesn't support flag/keyword. Will this bug covers such IMAP case?
(Assignee)

Comment 2

10 years ago
(In reply to comment #1)
> Bug 392704 is for loss of tag by rebuild index when IMAP server doesn't support
> flag/keyword. Will this bug covers such IMAP case?
> 
I'll see if the same code can do both. If so, I'll try to fix that issue as well.
(Assignee)

Comment 3

10 years ago
Created attachment 334551 [details] [diff] [review]
Local folder reindexing support

This patch saves database values, but currently only supports reindexing of the local mail folder. I'm saving keywords in anticipation of eventually supporting saving of them during reindexing of IMAP mail folders that don't support custom flags. I'll do a followup patch to add IMAP support aft erhtis patch has landed.
Attachment #334551 - Flags: review?(bugzilla)

Comment 4

10 years ago
sorry, wasn't cc'ed on this.

Comment 5

10 years ago
Comment on attachment 334551 [details] [diff] [review]
Local folder reindexing support

I haven't looked at this real closely, but it would be really nice if the folder didn't have to get so involved in this process.  It also seems like you could just keep two db's open, and let the db's keep all the information, without copying it and maintaining it.
(Assignee)

Comment 6

10 years ago
(In reply to comment #5)
> (From update of attachment 334551 [details] [diff] [review])
> I haven't looked at this real closely, but it would be really nice if the
> folder didn't have to get so involved in this process.  It also seems like you
> could just keep two db's open, and let the db's keep all the information,
> without copying it and maintaining it.
>

I considered that, but I was concerned with the memory usage of keeping two databases open. What happens if you reindex a really large folder? jcranmer tells me mork keeps everything in memory. But I could be persuaded if you don't think this is a problem.

I didn't cc: you because I thought that you got Database component email anyway.

Comment 7

10 years ago
having two db's open doesn't bother me even a little. We do the same thing when you compact a folder.
(Assignee)

Comment 8

10 years ago
Then let me revamp this to keep two databases open.
(Assignee)

Updated

10 years ago
Attachment #334551 - Flags: review?(bugzilla)
(Assignee)

Comment 9

10 years ago
I tried to investigate an approach involving keeping two databases open, but that sure seems to me to be a lot more complicated and expensive than simply copying the values that I want into a temporary memory structure. Partly there's a large learning curve involved in understanding how to manage mork for the creation, renaming, and then destruction of a second database file.

In the case of compacting, an entire row of the database file must be copied, and that is easier to do with a second database. But in this case, I don't think we would want to copy an entire row. There is too much risk of carrying over the bad values that we are reindexing to rid ourselves of. So we need somewhere a list of values to carry over (tags and junk-related initially). Once we have that list, it is easy to save the values in a memory structure.

So from my perspective the use of a memory structure is easier for me to understand, uses less memory than keeping open a second database, plus is less risky. What is the argument in favor of using a second database?

But then we have the question of who owns the memory structure. I don't see what's wrong with having the folder own the object as is proposed here, as the folder is a stable general object that I will be able to use in both IMAP and local cases. We can't use the database to hold it, as that is being recreated. The closest existing example is the transferInfo object, which is held as a temporary in the local folder.

So can you help me to understand why keeping two databases open is a better approach, to motivate me to undergo the learning curve I'll need to implement that?

Comment 10

10 years ago
This approach isn't general or extensible - if core code or extensions add properties to message headers, they will still be lost. If we copy data between message headers, we don't need to know what the data is because mork is schema-less and we can just copy the cells/row. We *may* chose to blacklist certain well-known properties and not copy them, but I'd be hard-pressed to name them...
(Assignee)

Comment 11

10 years ago
Created attachment 339207 [details] [diff] [review]
Supports reindex of local folder using 2nd database

This is a work in progress, but now I have a question.

In this patch, I create a backup database file before manual reindexing, then it is deleted after reparsing. However, I seem to lose my database every few months on my system for reasons unknown. Presumably it detects an irrecoverable error during opening of the database, then reparses the file. This patch does not help there, as it only works during manual reindexing.

So here's my proposal: periodically (for example at shutdown) I could create a backup copy of the databases. If reparsing occurs, the code would use any existing backup database file. During manual reindexing, we would create the backup database file immediately before reparsing as in this patch.

Comments?
Attachment #334551 - Attachment is obsolete: true

Comment 12

10 years ago
I'm not sure you know why you lose your database every few months; my guess is that Thunderbird is deleting it for some reason, but I doubt that reason is the irrecoverable error; it's more likely that uid validity rolled on the server, or we decided something was internally inconsistent (e.g., bad threading). In either case, we're perfectly able to pull the per-msg data out that we care about. 

I think we need to extend your patch to the cases where we decide to blow away the database, not just when the user decides to explicitly blow away the database. I'd do that way before doing the backup db approach.
(Assignee)

Comment 13

10 years ago
I'm a little confused by your remarks:

"I think we need to extend your patch to the cases where we decide to blow away
the database, not just when the user decides to explicitly blow away the
database. I'd do that way before doing the backup db approach."

The backup db approach *is* the intended way to deal with the cases where we decide to blow away the database. In those cases, there was some problem accessing the regular database. I'm not sure that is the best time to try to duplicate the failed database in preparation for reparsing. So we need some sort of valid copy of the database available, which is why I propose keeping a backup copy around.

The alternative as I understand it would be to try to recover from the failed opening of the message database, and make a copy at that point in time. That seems to me to be unlikely to work.

Comment 14

10 years ago
We decide to blow away the db for things like the imap server saying uid validity has changed. That doesn't make the .msf file invalid, does it? All the local tags and junk scores, etc, are still correct, accessible, and associated with a message id, right? Or, if we have a thread with messages parented incorrectly, the junk scores, local tags, etc, are still valid.
(Assignee)

Comment 15

10 years ago
OK, let me take this patch and try to catch all of the cases where the local mailbox is reparsed. For the initiation of the second database at reparse, it's simply a matter of changing calls to ForceDBClosed to closeAndBackupFolderDB. I'll deal with the issues of IMAP in a separate patch (trying to keep them small if possible).

I was never quite sure if I needed to attach the backup database to a separate file (like I did in the patch) or if there is some way to simply keep a second database open in memory. Do I need the second file? The compact folder example seemed to use one.

Comment 16

10 years ago
Thx, Kent. I'm not sure what you mean about attaching the backup database to a file...there needs to be a .msf file on disk, the backing store.  There most likely does not need to be a corresponding folder/berkeley mailbox. Compact has one because it copies the messages themselves to a temp file, and then copies the temp file back over the original berkeley mailbox.  But I'm not sure how happy nsMailDatabase will be if there's no corresponding mailbox, because when flags change, it will try to update the mailbox. We shouldn't be changing flags in this situation, of course...
(Assignee)

Comment 17

10 years ago
(In reply to comment #16)
> I'm not sure what you mean about attaching the backup database to a
> file...there needs to be a .msf file on disk, the backing store.
>
If opening the database reads the entire mork database into memory, then at least in theory I could open one database, then open a second attached to the same file, and then use one to store the reparsed values, and the second store the backed up values. That's not what I did in the patch - I copied the msf file for the backup - and it seems to work. So if you're not saying "of course you don't need two files, one will do" then I'll leave it the way that it is.
(Assignee)

Comment 18

10 years ago
This will affect tags in some cases, plus any extra metadata that are added by extensions, so I am changing the description.
Summary: Reindexing should save junk-related metadata → Reindexing should save message metadata (junk-related, tags, etc.)
(Assignee)

Comment 19

10 years ago
Created attachment 341527 [details] [diff] [review]
Supports reindex, rename on Local and IMAP folders

This is a work in progress that now supports manual reindexing of Local and IMAP folders, as well as renaming of IMAP. I'm still trying to figure out the best places to create and remove the backup database, but overall it seems to work just fine.
Attachment #339207 - Attachment is obsolete: true
(Assignee)

Comment 20

10 years ago
Created attachment 342392 [details] [diff] [review]
Ready for review

This patch has been tested with 1) IMAP rename, 2) IMAP reindex, 3) IMAP UID change, 4) Mailbox reindex. There is no news support.

I think we should add the same approach to moves and copies in another bug, that is to copy the full message database record instead of just selected fields as is done now. That way an extension could use the local DB to add message metadata that would reliably stay with the message.

The key questions to answer are 1) when do we make the backup, 2) when do we open the backup, and 3) when do we close the backup. I tried to do the backup and open when the database was opened with getdatabase, but could not make that work. What I am doing seems to be done more deeply into the folder operations than I would prefer, but it seems to work.

We also really need a unique message identifier that could be used to sync messages. Using the message id, as is done here, is not optimal when there are copies of messages.
Attachment #341527 - Attachment is obsolete: true
Attachment #342392 - Flags: superreview?(bienvenu)
Attachment #342392 - Flags: review?(bienvenu)

Comment 21

10 years ago
Comment on attachment 342392 [details] [diff] [review]
Ready for review

argh, I lost a bunch of comments on this patch, so forgive me if I'm a bit terse.

Thx for the patch!

You need to rev uuids when you change interfaces:

+++ b/mailnews/base/public/nsIMsgFolder.idl
@@ -139,6 +139,15 @@ interface nsIMsgFolder : nsISupports {
   void setFilterList(in nsIMsgFilterList filterList);
 
   void ForceDBClosed ();
+  /**
+   * Close and backup a folder database prior to reparsing
+   *
+   * @param  newName  New name of the corresponding message folder.

and here:

diff --git a/mailnews/local/public/nsIMsgParseMailMsgState.idl b/mailnews/local/public/nsIMsgParseMailMsgState.idl
--- a/mailnews/local/public/nsIMsgParseMailMsgState.idl
+++ b/mailnews/local/public/nsIMsgParseMailMsgState.idl
@@ -53,6 +53,13 @@ interface nsIMsgParseMailMsgState : nsIS

Our style is to declare vars where they're used, and not before. E.g.,

+  nsresult rv;
+  ForceDBClosed();
+
+  // We only support backup for mail at the moment
+  if ( !(mFlags & nsMsgFolderFlags::Mail))
+    return NS_OK;
+
+  nsCOMPtr<nsILocalFile> folderPath;
+  rv = GetFilePath(getter_AddRefs(folderPath));

rv can be declared on the last line, instead of the first. It cuts down on the overall line count, and improves readability. There are a many instances of this in the patch that can be cleaned up.

don't need assignment to rv here:

+  rv = backupDBFile->Remove(PR_FALSE);
+  return rv;

+nsMsgDBFolder::GetBackupMsgDatabase(nsIMsgDatabase** aMsgDatabase)
+{
+  NS_ENSURE_ARG_POINTER(aMsgDatabase);
+  nsresult rv;
+  rv = OpenBackupMsgDatabase();
+  NS_ENSURE_SUCCESS(rv, rv);
+  if (!mBackupDatabase)
+    return NS_ERROR_FAILURE;
+
+  NS_ADDREF(*aMsgDatabase = mBackupDatabase);
   return NS_OK;
 }

What happens if mBackupDatabase is already set here? Should we return the cached copy? Does the folder really need to cache the mBackupDatabase? Could it rely on the caller to clean up the db it retrieves, like we do in other places in the code?

+  nsCOMPtr<nsILocalFile> path;
+  path = do_QueryInterface(pathIFile);
+  if (!path)
+    return NS_ERROR_NO_INTERFACE;

can be 
nsCOMPtr<nsILocalFile> path = do_QueryInterface(pathIFile, &rv);
NS_ENSURE_SUCCESS(rv, rv);
(Assignee)

Comment 22

10 years ago
Created attachment 343429 [details] [diff] [review]
Fixed Bienvenu's issues
Attachment #342392 - Attachment is obsolete: true
Attachment #342392 - Flags: superreview?(bienvenu)
Attachment #342392 - Flags: review?(bienvenu)
(Assignee)

Updated

10 years ago
Attachment #343429 - Flags: superreview?(bienvenu)
Attachment #343429 - Flags: review?(bienvenu)
(Assignee)

Comment 23

10 years ago
(In reply to comment #21)
> (From update of attachment 342392 [details] [diff] [review])
> 
> What happens if mBackupDatabase is already set here? Should we return the
> cached copy? Does the folder really need to cache the mBackupDatabase? Could it
> rely on the caller to clean up the db it retrieves, like we do in other places
> in the code?
> 

I now use the cached value, but it is done in OpenBackupMsgDatabase.

I considered the issue of not caching the backup database in the folder, even started to change it until I ran into problems. There are at least two issues. A minor one is that IMAP sets up a parser for each message, that needs the databases, so in at least the IMAP case I would need to cache it in the folder. But the bigger issue is deleting. There are various possible ways that the reparsing gets setup. I want to create the backup, if possible, at the moment when the issue is detected or a request made. Creating the backup needs to proceed as reliably as possible, which means that any open nsIMsgDatabase objects need to be closed so that file operations can proceed. That is hard to do if they are not centralized in a well-known location, like the folder object. Also, with the various async operations involved, it's hard for me to be 100% confident that I successfully close and delete the backup after it is needed - particularly since reindexing is likely to be done under conditions where there are problems. I think in the case of the regular databases you solve the problem by caching the database objects in a structure of some sort.

So this patch leaves the backup database cached in nsMsgDBFolder, which seems to me to be a fairly safe way to do this. But I am open to other suggestions.

Comment 24

10 years ago
trying this now

Comment 25

10 years ago
this looks pretty good, and seems to work fine (though I haven't tried it with a server that doesn't support user-defined keywords)

space after //, and after if -

+  //If that doesn't exist, then we have to create this directory
+  if(!pathIsDirectory)
+  {
+    PRBool pathExists;
+    path->Exists(&pathExists);
+    //If for some reason there's a file with the directory separator
+    //then we are going to fail.

technically, our rule is that if either the if or else clauses require braces, then both should have braces.  It's not my favorite rule, but I should at least mention it :-)
+  if (!newName.IsEmpty())
+    rv = backupDBDummyFolder->AppendNative(newName);
+  else // if newName is null, use the folder name
+  {

since you're not returning rvbackup, you can say

if (NS_FAILED(OpenBackupMsgDatabase())
{

twice, actually.

+        // A backup message database might have been created earlier, for example
+        // if the user requested a reindex. We'll use the earlier one if we can,
+        // otherwise we'll try to backup at this point.
+        nsresult rvbackup = OpenBackupMsgDatabase();
+        if (NS_FAILED(rvbackup))
+        {
+          CloseAndBackupFolderDB(EmptyCString());
+          rvbackup = OpenBackupMsgDatabase();
+          if (NS_FAILED(rvbackup))
+            mBackupDatabase = nsnull;
+        }
(Assignee)

Comment 26

10 years ago
Created attachment 345248 [details] [diff] [review]
fixed a few more nits

Added those few more requested changes, updated to latest trunk as well.
Attachment #343429 - Attachment is obsolete: true
Attachment #345248 - Flags: superreview?(bienvenu)
Attachment #345248 - Flags: review?(bienvenu)
Attachment #343429 - Flags: superreview?(bienvenu)
Attachment #343429 - Flags: review?(bienvenu)

Comment 27

10 years ago
Comment on attachment 345248 [details] [diff] [review]
fixed a few more nits

thx a lot, Kent
Attachment #345248 - Flags: superreview?(bienvenu)
Attachment #345248 - Flags: superreview+
Attachment #345248 - Flags: review?(bienvenu)
Attachment #345248 - Flags: review+
(Assignee)

Updated

10 years ago
Keywords: checkin-needed
Checked in: http://hg.mozilla.org/comm-central/rev/4fd3dba65a6e
Keywords: checkin-needed
Target Milestone: --- → Thunderbird 3.0b1
(Assignee)

Updated

9 years ago
Status: ASSIGNED → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED

Updated

9 years ago
Blocks: 347263
You need to log in before you can comment on or make changes to this bug.