Open Bug 1572000 Opened 5 years ago Updated 29 days ago

[meta] database backed global message index

Categories

(MailNews Core :: Database, task, P1)

Tracking

(Not tracked)

ASSIGNED

People

(Reporter: mkmelin, Assigned: mkmelin)

References

(Depends on 4 open bugs, Blocks 4 open bugs)

Details

(Keywords: meta)

Attachments

(1 file)

Currently folder indexing uses Mork, and message index is per folder. Mork removal has been wanted for a long time (bug 453975), but when working on removing it from usage for the folder index we should also take the opportunity to design it correctly to use a global message index.

This would enable a conversation view of messages (which currently requires gloda, and gloda is not meant for that).

It would make issues like bug 43278 go away.

I belive for Gmail, we're downloading the same messages multiple times, because we don't know it's already in All Mail (duh!).

We need to figure out how storage of the actual message data should be handled: put it in the database, or keep it on the file system, or a combination where normalized/decoded content would be put into the database for quick searching and indexing and the raw data would be kept only for backup.

Primarily I think we should target IndexedDB for database, since that is the web thing to do.

Blocks: 453975
Priority: -- → P3

Any reading material on the IndexedDB?

https://developer.mozilla.org/en-US/docs/Web/API/IndexedDB_API is one. If you want the spec, see http://w3c.github.io/IndexedDB/

In a nutshell, if you need a database and want to be using web technologies, this is what you need to use. Amongst the more important features, it can be used from Web Workers, so for instance you can have background workers fetching your mail from the server and putting them into the database, with no need to steal processing time from the UI thread. For all other solutions (like sqlite) you'd have to jump through multiple hoops to get that working in any kind of hacky way (if at all).

Do you plan to do a prototype to check the performance of the new solution? Mork is old and ugly, but it can handle huge folders in a reasonable time.

Yes we would have to ensure the performance is acceptable.
A complete one-to-one comparison to current state will be more difficult since current state isn't dealing with global, but if you keep it one folder only for comparisons, that should work.

Depends on: 11050

This is at the same time a good idea and a totally wrong idea.

Indeed, we do need to be able to index all messages.
And we do need to be using a Database.

But - the entire separation between "the messages", "an index for the messages" and "a backing database" with Thunderbird marrying them together or using one for the other - is myopic.

Messages should simply go in a message database. That's it. Enough said, full stop.

No more saying, "Oh, we need search ability X" or "indexing ability Y" - we just need a DBMS, a software system which has all that stuff. You put your messages in, and you're done (up to having effective access to its contents and facilities). I'm not saying existing document-oriented DBMSes have all the features we need - maybe they're missing some - but we definitely need lots and lots of what they already have.

I've been thinking about this quite a lot (in the context of folders and of pluggable mailstore), and here are my thoughts so far:

The very first step should be to document all the things that are stored in the messagedb.
There are a lot of protocol-specific and implementation-specific things stored in there.
For example:

  • mbox mailstores stash offsets to each message within the mbox file.
  • Maildir mailstores stash filenames.
  • IMAP folders stash the server-side name of the folder.
  • etc...

So I think the first chunk of work is to audit exactly what's going into the DB and to document it. I'm making a start on this now, noting down uses as I encounter them.

This is an issue when doing things like moving folders: eg what needs to be done to a message database when you move a folder from a local account to an IMAP account, say? Are there paths in the db that need to be patched up? What needs to be done with child folders? Can we even copied and reuse the db or does it need to be rebuilt from scratch? At the moment there's a whole heap of ad-hoc and slightly voodoo rules scattered around, and I think there are a lot of tricky little bugs stemming from this.

Beyond that...

Ideally it'd be good if we could tighten up the message-centric DB API (currently nsIMsgDatabase). It currently exposes a lot of general-purpose DB access. So it can be hard to track what stuff the various different systems are stashing in there.

I'm currently leaning toward the idea that the database should be owned by the PluggableMailstore. That's the bit which deals with the filesystem, and is already pretty co-dependent with the message database.
If there is a tighter messagedb API to be had, then it'd be nice if the boundary was at PluggableMailstore. This means we could be DB-agnostic, in the same way the messages themselves are already mailstore-agnostic.
And, as Eyal says, we gain the option to implement mailstores which use a unified database for both messages and their metadata.

@BenCampell : I believe you are (and some of our code is) conflating a message database with a message store. You write that "the message-centric DB API" we currently use is nsIMsgDatabase. But that API (ignoring its mix of higher-level and lower-level features) uses a specific schema with specific fields or flags one can set; and allows for very few of the features of a full-fledged DB: The ability to accept and execute/apply phrases in languages for data definition, data control, data manipulation and data querying - with these languages being reasonably expressive. That does not mean the API should be textual query/command-based, but nsIMsgDatabase certainly doesn't cover it. we obviously have some implicit or explicit APIs for accessing our message "database":

  • Virtual search folders - materialized views. [2]
  • The folder tree in the 3-pane window - a kind of a nested materialized view
  • Message filters - stored procedures. [3]
  • Triggers for message filters - DB triggers [4]
  • A Bulk-insert mechanism - routinely downloading messages from POP3 servers (or message headers from IMAP servers) and adding them into the DB. For IMAP it's perhaps more a bulk-upsert mechanism.

... except that these mechanisms are not conceived in terms of a database.

For this reason I don't think it's important is to be DBMS agnostic [1] - the agnosticism we need is in support for different schemes of message storage (maildir, mork/mdb, IMAP). I may also need to qualify what I wrote earlier about document-oriented DBMSes: It appears most of them expect you to have contents in some format that they like, such as JSON or XML etc - while we have the complex MIME structure, with nested multiparts and alternative-parts. So it may end up being the case that - without implementing a complete DBMS - it could be better to use a relational DB which supports binary blobs, with MIME parts being such blobs; or maybe not, I'm not sure.

As for IndexedDB - I'm giving it the once-over, and it's not clear to me that it's sufficiently expressive to reduce enough of what TB is doing to operations through that interface. (But maybe it is and I just need to read through that document).

[1] - DB = the tables, the columns, their types, the constraints, and the data in the tables. DBMS = The software system for creating managing altering and querying DBs.
[2] - https://en.wikipedia.org/wiki/Materialized_view
[3] - https://en.wikipedia.org/wiki/Stored_procedure
[4] - https://en.wikipedia.org/wiki/Database_trigger

Guys, please move index and cache and other "not config" files OUT OF profile folder.
Please move it to something like %localappdata% etc.

Leaving only config files in profile folder would allow to synchronize whole folder in realtime using dropbox, google drive etc.
What i mean is - we want to put profile folder under dropbox or google in order to have realtime backup.

Cache, db and index files only pollute profile folder and make thunderbird totally unusable in terms of realtime profile protection.

Should this "Global Database" also contain the newsgroup articles? Or should these articles be stored in a separate database?

Just my (1(1¢)

Blocks: 43278

(In reply to Magnus Melin [:mkmelin] from comment #0)

It would make issues like bug 43278 go away.

OK, that's the answer. Never mind.

See Also: → 1717113
Severity: normal → S3

I've been working on this. I'll soon upload a WIP patch. While it works, it's not really ready for too much feedback yet and many thing can and will change.

In a nutshell, the plan is to initialize the message index data from Mork (.msf) files of the folders, and store this in a database of all messages (all message metadata). After that, keep the new database up to date with what gets written to msf files. This dual-write is intended to be a temporary measure, and having the new database fully replace Mork as a second step. That second step requires a lot of work, so we can have the dual system in place soon to at least use for conversations until we're ready to fully switch over to the new database.

This enables us to display conversations stretching multiple folders, which is usually the case if you reply to something.

Assignee: nobody → mkmelin+mozilla
Status: NEW → ASSIGNED
Priority: P3 → P1

Is there a design spec/document for the new database? That would help with providing feedback without everyone having to read the patch/code - and also provide documentation for future maintainers.

This is storing the message meta in IndexedDB. As an object store, what is stored is following what the object looks like. Basically, the data is what we have at the moment. We store the message uri, and use that as key. Other additional things that needs to be stored are folder and root (= which account the folder is in) since e.g. if you you have correspondence between accounts those shouldn't be in the same conversation when you view it. Additionally, it does seem like we need to assign and store a threadId.

The patch has an in-memory db implementation as well, as it was more convenient to experiment with that. Switching in to the IndexedDB implementation happens by changing the store at the bottom of MessageIndex.jsm

Note that all the UI changes in the patch are only to ease debugging and try things out.

Blocks: 1686504
Depends on: 1635340

(In reply to Magnus Melin [:mkmelin] from comment #14)

This is storing the message meta in IndexedDB. As an object store, what is stored is following what the object looks like. Basically, the data is what we have at the moment.

From what I can tell, the current data set may be too limiting and doesn't fix some of the overall issues with the existing message system.

For example, one of the current issues for add-ons is that if you want to access the headers for the messages, then you end up having to stream the whole message which, of course, is expensive when you need to do it for every message in a folder. Gloda gives you access to this in some form via its defineAttribute mechanism, that's probably not exactly what we want here, but having quick access to the headers (or maybe a defined selection) would help.

If we're not going to want to store all or some of the headers in here (which we might not want to anyway), then we either need to come up with an alternative mechanism which avoids having to stream messages every time, or link the new database to Gloda, and properly support Gloda at an add-on level.

Another thing I can't tell if you're planning on doing or not, is that I'm pretty sure Gloda has some form of "conversation detection" which fixes up some of the broken conversations that come in. Presumably that would be useful to port across to this database as well.

Gloda's message snippet may also be useful in the context of having a multiple-line message tree - as then you wouldn't need to stream each message to get the snippet (maybe the tree is already using Gloda for that though?.. but what about non-indexed folders/messages?).

I think this is what I'm currently lacking - the context around a clear description of how Gloda & the new database interact, the responsibilities of each, and how add-ons might be able to benefit from the new system (where the current Gloda system has limitations). I realise ideals might develop over time, but I think it would be better if these are being thought about up front, rather than trying to bolt things on after the main implementation is done.

Thanks for the comments! I've been focused on getting the data we already have into usable form for conversational view so far, so mainly global threading. It might indeed be good to store more data about each message: potentially all of the headers in parsed format, and also a text and html representation of the message content in parsed ready-to-display format (for display and search). Of course, this part will require an initial going back and re-parsing the world to obtain that data, so super expensive. At least for the random headers, probably they should have their own store linked to the message since we don't need them usually.

Going forwards it could be possible to also add a store for actually storing the raw messages. That's not something I'm looking at atm.

Re Gloda, the gloda-id is in the .msf so that property would be available, and from that one would be able to link into Gloda when that's available, I assume? I guess there is hope to make Gloda obsolete, though that is a large undertaking in itself, a bit depending on how much much functionality would have to be 1:1.

Depends on: 1798241
Depends on: 1709521

Just a quick pie-in-the-sky thought I just had:

Currently, folders use a nsIMsgDatabase object, which is implemented as it's own, separate-per-folder database.
With a global message database, each folder is effectively presenting a view upon that global DB - just the messages which are in that folder.
We've already got an nsIMsgDBView interface, which serves a pretty similar purpose.
Would it be reasonable to ultimately aim at unifying the nsIMsgDatabase and nsIMsgDBView interfaces?
At the moment they're miles apart, but conceptually they feel like they're really doing the same kind of thing: presenting a subset of messages from a larger pool.
I don't know much about the GUI side, but I'd bet there's some hoop-jumping currently going on to deal the differences between nsIMsgDatabase and nsIMsgDBView and some major simplification to be had there somewhere...

Depends on: 1801574
Depends on: 1802828
Depends on: 1806770
See Also: → 521633

Fixing bug 521633, bug 1727169 and bug 1796145 would require fixing or replacing Gloda. Does this project really plan to replace Gloda?

Yes, we're moving to SQLite, which will allow us to trigger very fast and accurate queries without uncertainties

See Also: → 1727169

(In reply to Alessandro Castellani [:aleca] from comment #19)

Yes, we're moving to SQLite, which will allow us to trigger very fast and accurate queries without uncertainties

The current SQLite implementation of a global search index (GLODA) is at best spotty and at worse does not even know where the mails it indexes are. After 10 years of this article. https://support.mozilla.org/kb/rebuilding-global-database I seriously hope that this implementation of a global database to store critical data that uses exactly the same underlying technology is a lot more reliable. I have no intention of upgrading to a new global database for my 20+ years of email unless I am sure it will not just loose the plot as GLODA does.

Will this implementation bring to completion the decade long wait for the completion of maildir like storage?
Will this enhance or impede the pluggable store? or is that something that has been abandoned?

This bug appears to be morphing from a new database format to replace MSF files into a global message location for the application. Without any express storage systems definitions or decisions about store data formats.

We're gonna release plans and technical overviews of everything we're doing in the upcoming weeks on our tb-planning mailing list.

(In reply to Matt from comment #20)

Will this implementation bring to completion the decade long wait for the completion of maildir like storage?

Bug 1719121 was the biggest blocker for maildir. Everywhere in the code applied its own mbox rules, spreading out waaaaay beyond the mbox pluggable store. But now all the mbox rules are (more or less) contained in that one place.
Still chasing down possible regressions for that work (there was some pretty major plumbing involved), but after that the maildir bugs will be much easier to fix.

Will this enhance or impede the pluggable store? or is that something that has been abandoned?

Current plan is to still use pluggable store to stash local copies of full messages.
It definitely needs some refactoring (Bug 1714472).
I could imagine a nsIMsgPluggableStore which stored the messages in the same global DB as the metadata. But I've not thought too much about this.

This bug appears to be morphing from a new database format to replace MSF files into a global message location for the application. Without any express storage systems definitions or decisions about store data formats.

The core aim is to merge all the per-folder databases (msf files) into a single global DB. Same data, but now we can view across folders without having to jump through hoops. It also better matches server-side data models - i.e. where a single message can appear in multiple folders (see gmail, yahoo, various IMAP extensions, NNTP, JMAP, exchange etc...).
So more a reorganisation of data than completely new data, and the path forward is mostly just code plumbing rather than design.

After that gloda becomes a definite "why is this in a parallel database?" outlier, and it seems that the natural thing would be to merge the functionality of gloda into the new global db. But as far as I'm aware there's no concrete plans for that yet. And as Alex says, the plan is to do the planning in public.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: