[SoC] Integration of Thunderbird with Vista Desktop Search

RESOLVED INCOMPLETE

Status

defect
RESOLVED INCOMPLETE
13 years ago
11 years ago

People

(Reporter: chofmann, Assigned: dommy.fdo)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 3 obsolete attachments)

Tracking bug for SoC project status
OS: Windows XP → Windows Vista
Depends on: 369283
Whether it's coincidence or causation, when I was trying to close out the tracker bugs from 2006, I noticed that the more successful projects were also the ones where the person doing the project actually knew about things like their tracking bug.

Damitha, meet your bug :)
Damitha and I were just talking about this over IM and email, and his first homework assignment was to create a bugzilla acct, and assign this bug to himself after I elevated his bugzilla priveleges. :)

Damitha, meet Phil who is a huge help & valuable resource :)
Assignee: mscott → dommy.fdo
Posting Damith's weekly status report: 

"Managed to checkout the trunk TB source and do a build. Also reassigned the ownership of the SoC tracking bug (bug 377249 <https://bugzilla.mozilla.org/show_bug.cgi?id=377249>) to myself.

And about the vista search integration. What we have to do is to extend the capability of vista desktop search. Though there are several was to make the vista search  aware  of custom data sources, we would be better off with the use of IFilters <http://msdn2.microsoft.com/en-us/library/ms691105.aspx>which is more powerful and flexible. So we need to create a COM (MS) component that exposes several functions to the search tool (defined by the IFilters interface). Once our component is properly registered the vista search will load that module and use it to get data from our mails. So  we are the  people  who  has to strip text from messages and feed it to vista indexing tool.

We can get data from either directly from the files or use a stream from messages in TB (from what i've seen from mozilla.dev.apps.thunderbird its possible to get a stream from messages?). We will discuss such issues in the newsgroup pretty soon.

Also from our discussion I'm thinking of making this a part of TB since that will make it easy to for future developments of the module (let it get built in to a separate *.dll and register it if installed in vista)."
Damitha, I stumbled on this while researching IFilters tonight:
http://www.ifilter.org/links_email.htm

I wonder why this company went the route of a protocol handler instead of an IFilter for e-mail. Do you know what the trade offs are between an Ifilter and a protocol handler?

with IFilters, I wonder how Vista search will know to associate our berkely mail folders with our IFilter implementation. Mail boxes don't have file extensions associated with them so we can't register to take over files with a given extension. David, did you have to solve this problem for spotlight? Or was it a non issue because spotlight integration requires us to write out an xml file for each msg which gets indexed by spotlight.
Right, I forgot about that. It's a non-issue for Spotlight because we're writing out the .mozeml files for each message. The lack of file extension for mailboxes is what led me to think we'd need to do the separate msg per file approach for Vista as well, but I was looking at the IFilter approach at the time. I don't know about the protocol handler approach.

Making our mailboxes have a file extension is an option - perhaps we could truly make it an option, so that we would still display folders that didn't have the file extension, but we would create folders with the extension but strip off the extension when creating the display name for the folder.
It's really great to see the number people out here to help me with this project (which I will be needing a lot ;)). So lets get started with my first comment in bugzilla!!!

Summery on extending windows vista search:

Vista search can be extended to index both new file types and data stores such as databases. 

To add support for new file types, we need to create a new Property Handler. I.e. use either IPropertyStore (If the file format is not very much complicated) or IFilter (otherwise). So IFilter is used to extract text from a complex file format. A Property Handler can only be associated with a specific file type (.pdf .html)

On the other hand if we have a data store such as a database (or the mbox mail format that TB use) we need to create a Protocol Handler (Use ISearchProtocol and IUrlAccessor).  If our custom data store has files embedded inside, we also need to have an appropriate filter to filter out text.

This might answer Scotts question about the use of the Protocol Handler and yes! I have done a bit of a mess up when understanding IFilters :( .
Also that particular add-in is for windows desktop search 2.5 and what we are dealing with is the ver 3.x which is not backward compatible
Approaches to make the Thunderbird mail to be indexed

I. Migrate into a maildir like mailbox format with ".eml" extension which is used by windows mail and supported by vista search (I was able to search that "welcome to windows mail" mail though it was opened with TB since .eml is registered with TB). But this idea is ruled out because of the substantial amount of work (and not to mention that I would have to give up on my GSoC project ;) ).

II. Using the same approach as in Spotlight where we create an xml file for each mail that is received.

Advantages:
1. Ability to use a substantial amount of code from Spotlight work.
2. Creation of the property handler to be used with the xml file would be simple.
3. File locking issues will not be there. (once the .xml file is written TB would have no interest on the file except maybe when deleting it in case the original message is deleted)

Disadvantages:
1. Controlling the indexing options will be hard. (Might have to create folder structure to map the multiple mail folders and set the folders to be indexed in vista search as appropriate)
2. Repetition of data
3. Wrath of the developer community. (saw how angry they were when the one-per-file restriction was found in Spotlight integration;) )

III. Create a protocol handler
Since windows vista search supports the indexing of a custom data stores we can work on the TBs mail folders as they are.

Advantages:
1. No repetition of data
2. Efficient
3. Higher controllability
4. Still use spotlight code to strip out text from messages.

Disadvantages
1. If we work on mail folders, issues with file locking will be present . On the other hand if we rely on TB to stream out mail then TB has to be running in order for the indexing to occur (which is not desirable). 

Microsoft outlook seem to use a similar method (with having a single file for all folders)
We may be able to use the mail folders directly and  overcome the locking issue by making windows search to back off when TB is open (who wants to search a TB mail via vista search while TB itself is open - debatable).


Feel free to add to this.
Its been a bit since my last comment. So heres the update...

If a protocol handler is used, windows doesn't care what the actual content source would be (database, a file, local , remote..) as long as we can handle that protocol and get data from that source and give it to the search. Instead of a file name, each unit of data (mail message in our case) is identified using a url. So we need to map between the mails and the urls. (a new url is supposed to follow the format <protocol>:// [{user SID}/] <localhost>/<path>/[<ItemID>] so we will come up with something similar to mozilla.mail://username/inbox/msg0001)

The Protocol Handler deals with retrieving items from a proprietary content source. These in turn has to be filtered to be given to the indexer. Since a unit of data in our case is similar to a .eml file we can use the default .eml filter which come with windows.

This article has an intro into protocol handlers  http://msdn2.microsoft.com/en-us/library/ms965732.aspx 
I've been working on a sample protocol handler that can index multiple data structures that reside in a single file (xml). The overall thing consists of two major parts where one is the overall handler and the other which access individual urls (mail in our case). The idea is that handler creates one accessor per each url it recieves from the crawler. These should be able to run as many as 32 threads and therefore the common objects that are used should be thread safe. 

This thread issue has made me to go through theory on threading models supported by COM (MS) ;-)

One luxury that we have with regard to thread safety is that protocol handler’s own threads wouldn’t go and modify the mail files but simply read them (but we need to think on getting around the problem of co-existing with TB and not stand in the way of TB)
Also another problem to tackle (almost forgot to publish :) ) - How to deal with IMAP accounts

Since we need our protocol handler to be independent of thunderbird (thunderbird doesn’t have to be running in order for indexing to happen) we will be working on the mail folders directly. Unlike with pop/local accounts (where the message bodies reside local machine) we will not able to get hold of IMAP messages easily. 

The possible approaches
I.    Code the PH  to retrieve IMAP mails from the server (Worst case ;) )

II.   Similar to spotlight approach we can write tiny files onto the disk and teach the PH to get them .

III.  Ask the user to configure to use offline messages and this will give us a good old Berkeley folders. (As done in Google desktop)
I prefer III - might as well let the user take full advantage of the extra disk space we're using. And instead of asking, I would just do it - tell the user that we will configure all their folders for offline use. I think we might also want to  turn on autosync for each imap server as well, so that we'll download the message bodies automatically, instead of having the user do it explicitly.
Almost done with a stand alone app that would walk through the mail folder (and hopefully will be able to upload the code today itself ;-) ). 
At initialization, this will run through the mail folder building up a database of the messages (identified using the Message-ID:) with the offset to the start of the message and the length of the message. So using this information the crawler threads can (In the protocol handler) can get a mail extracted in a snap and they can parse that (the mail that is extracted) within their own threads.

Limitations:

This has no local folder management so any mail folders has to be specified explicitly.

Might no comply 100% with rfc2822

The mail extraction portion (supposed to be accessed concurrently) is not thread safe, but should be able to do so by using a critical section in the code.


The DB can easily be written on to a file to prevent repeated building up of the database. But there will be issues when the user compacts the mailbox. Can we load this from a file and do a validity check by loading a couple of mails at random (or may be last few) and check for the message boundaries and Message-ID for integrity?
Status: NEW → ASSIGNED
Posted file Parser for the berkeley folders (obsolete) —
added the mail parser for the mail folders.

This is to be used in the protocol handler. 
The parser will build up an index of the mails (message id, offset to the message, length of the message). Also extraction of mails is supported (when a message id is provided).
This week I've been working on the following modules and plan to upload the code by Monday (Jul-23)


Local folders management
------------------------

I.e. detecting the location of the mail storage and iterating through mail folders
However as a start the profile location will be passed in and then make the parser walk through the mail folders.

Writing the summary into a file
-------------------------------

To eliminate the necessity of parsing of the mail folder on each time the vista search looks at TB, the built summary can be written into a file. 
But this may get invalidated due to several reasons (when the user comacts the mail folders, etc). So we can validate the summary by checking the preservation of the message boundary and the message id. 
The detection of local storage location has been a bit of a mess. From what I've seen the user has the freedom to select the location thus it doesn't have to be C:\Users\<Windows user name>\AppData\Roaming\Thunderbird\Profiles\<Profile name>\  (in windows vista)

Possible solutions

I. Get some solid location where profile data and read the storage location of mail out of that

II. Write the location(s) into the registry by the profile manager (TB)

The second can be used to prevent the parsing of some configuration file.

this link had a bit of info on the profile location
http://kb.mozillazine.org/Profile_folder_-_Thunderbird
This has
1. Parse a mail folder and build up an index
2. Iterate through mail folders in a given search location
3. Keeps a mail summary(of its own, with Crc32 integrity checking)
Attachment #272640 - Attachment is obsolete: true
Posted the earlier attachment without a file extension and posting the same with the extension.
Attachment #277138 - Attachment is obsolete: true
Posted patch Mail ParserSplinter Review
Subtle changes and bug fixes to the previously submitted code.
Attachment #277142 - Attachment is obsolete: true
thanks for posting this Damitha! What's the best way for me to compile and test it? Cheers.
What's been done so far
-----------------------

To recap what has to be done, we would be needing to create a protocol handler for windows desktop search that will enable it to go through the TB mails (where WDS will identify each mail using a url). Then we can tell WDS to use the default shipped filter for .eml files to be used with the mail retrieved using the PH (saving us from working on stripping down messages and extracting the useful searchable text which is within a mail). 

The parser will buildup an index of mails when a mail storage location is passed to it. Each mail folder is loaded in th following sequence.

- Try to locate and read from the summary file (checks the summery file for integrity using Crc32)
- if that succeeded the loaded index is validated against the real mail folder to see the recorded summary from the last parse remain valid (i.e. mail folder's not been compacted)
- if the summary is valid then look for newly arrived mail, parse them and load them into DB and add to the summary file.
- if any of above steps fail, completely reparse the mail folder.

- Also supports extracting of mail (as a whole to be given to vista search). The implementation is thread safe using critical sections.

The mail index building up will work in an initialization step and doesn't need to be thread safe
But the mail extraction part needs to be since the protocol handler will be bombarded with urls by multiple threads. 

What has to be done
-------------------

Still the storage location has to be detected. As I and scott discussed we will be looking into the profiles.ini (more here http://kb.mozillazine.org/Profiles.ini_file) which will be in a solid location and using that we can locate the storage location. (plan to get on to the default profile first)

Detection of the deleted mails when parsing. Since the deleted mails are going to be around unless we compact the mail folders. some means to detect the deleted mails.

Put up the code together and create the com component (since its easy to code and test, the code has been developed using a stand alone application).

WDS support files and directory has to be add to the ignored file list of the parser of TB.


Loads profiles from profiles.ini file.

iter_Profile profile;
ProfileManager profileManager = new ProfileManager();
BOOL ret = profileManager->getDefaultProfile(profile);


This can be used to get the default mail folder.

Also has functions to retrieve other mail folders.
The parser now finds the mail folder locations on its own. Tested and worked fine with the default locations (wow).

Also the deleted mail are now ignored.

compiling and testing:
As we are yet to build the com module with this code we cannot plug it straight into WDS but we can test the code using a standalone app. Also we need to add the summary files to the ignored file list of TB before testing it with the local mail folder.

i.e. 
- the .wds_support/ directory
- .wsf files(WDS Summary File)

any other suggestions for the directory name and file extention?? 
Attachment #277422 - Attachment is patch: true
Attachment #278211 - Attachment is patch: true
Changed the summery file location form "[mail storage location]/.wds_support/" to "[mail storage location]/.mozmsgs/" so that TBs ignored list of files need not changed. But there could be a potential problem is a vista user copies his mail profile into a mac or vise versa...

David:  Is there any issue with spotlight if a bunch of .wsf files hang inside /.mozmsgs directory when a user migrates from vista to mac?

Also there wont be any issues the other way around (mac->vista) because the parser only looks for a particular .wsf file related to an actual mail folder (e.g. /inbox corresponds to /.mozmsgs/inbox.wsf)
Damitha, I'm not sure what Spotlight will do with the .wsf files. It might just try to index them, if it notices they're text files, or more likely it will ignore them because there's no mdimporter, but it won't associate them with Thunderbird.
Flags: blocking-thunderbird3?
Should this be duped to bug 430614, the GSoC 2008 bug?
Blocks: 369283
No longer depends on: 369283
Nope.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Flags: blocking-thunderbird3?
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.