Open Bug 475857 Opened 16 years ago Updated 2 years ago

gloda could specialize for gmail IMAP

Categories

(MailNews Core :: Database, defect)

defect

Tracking

(Not tracked)

People

(Reporter: asuth, Unassigned)

References

(Blocks 2 open bugs, )

Details

(Keywords: perf, Whiteboard: [no l10n impact])

Gloda currently indexes gmail IMAP like it was any other IMAP account. Given their wacky IMAP mapping and ridiculous popularity, it behooves us to try and make this work better than it does. Right now, you can easily end up with multiple apparent copies of single messages that you care about. Proposed after discussion with clarkbw and davida are the following: * Only store each message once. Treat the most specific location as the canonical location of the message. (Specific label folder over Inbox over All Mail.) This is important because opening the .msf file for All Mail is expensive time-wise and holding it open is memory intensive. Inbox is potentially less expensive, but still expensive.) * Store all known locations of the message to allow for fall-back as labels are removed. * Specialize gloda message retrieval by header to compensate for the non-canonical storage realities. This translates to providing a separate index that packs the folder id and message key into a single large numeric value and using that for exact lookup. * Specialize gloda message query by folder to compensate for the non-canonical storage realities. (Similar issue to the above.) Our options here are to alter things so that all folder lookups happen this way (implying all indexing happens that way), or to actually just bite the bullet and alter the storage mapping rather than having this indexing stuff happening. The latter may simply be the best option given the relative dominance of gmail and the potentially ugliness that would otherwise result. * Do not index the spam folder. Google is really good about spam, and people frequently do get a lot of gmail spam. I am targeting this to beta 3 because this honestly could be a lot of work and might be hard to get to happen for beta 3 given all the other stuff that is going on. I will seriously try and make this happen for beta 2, though.
Flags: blocking-thunderbird3?
Flags: blocking-thunderbird3? → blocking-thunderbird3+
Whiteboard: [b3ux]
Whiteboard: [b3ux] → [b3ux][m4]
Whiteboard: [b3ux][m4] → [b3ux][not gonna happen]
Target Milestone: Thunderbird 3.0b3 → Thunderbird 3.0rc1
This sounds great, but will it work with Google Apps mail accounts too? For example my danbri@danbri.org mail is google apps, but to access it in IMAP via Thunderbird I used a general IMAP account, rather than the special-cased Gmail Account, since that seems to hardcode (at least in the UI forms) an assumption that you're xxxxx@gmail.com
Whiteboard: [b3ux][not gonna happen] → [b3ux][not gonna happen][no l10n impact]
Dropping TB3 blocking status. General correctness of gloda trumps specializing for a wacky IMAP implementation. Since only faceted search exposes gloda results, it may be desirable to just do a post-pass coalescing of messages with the same message-id, at least in the results list.
Assignee: bugmail → nobody
Status: ASSIGNED → NEW
Flags: blocking-thunderbird3+
Summary: gloda needs to specialize for gmail IMAP → gloda could specialize for gmail IMAP
Whiteboard: [b3ux][not gonna happen][no l10n impact] → [no l10n impact]
Target Milestone: Thunderbird 3.0rc1 → ---
Lack of gloda specialization for Gmail also makes the conversation view difficult to work with, since the conversation shows duped messages (one for each tag the message has + one more for all mail). It's not really a huge issue, but it does mean that Gmail users probably won't use the conversation view very much.
hmm, that's a good point. We remove duplicates from the search results, but not from the thread view. probably not that simple to do there.
if the dupes have the same message id (and I'm pretty sure they do), it would be pretty easy.
(In reply to comment #6) > if the dupes have the same message id (and I'm pretty sure they do), it would be pretty easy. Gmail IMAP returns absolutely same data(including message-id header) as one in "[Gmail]/All Mail" for a mail copy in other mail folder(folder name=Gmail Label case. "[Gmail]/Trash", "[Gmail]/Spam" is excluded). So, "if same mail, always same message-id" is true. However opposite(same message-id is always same mail) is not true. When I uploaded same or almost same mail data to different mail folders(Gmail Label), Gmail didn't create new mail data in "[Gmail]/All Mail" in many cases. Duplication check is apparently executed by Gmail. mail data with same Message-id, same Subject:, same From:, same To:, but with some different headers, with slightly different mail body text. But, Gmail sometimes generated mail for the newly uploaded mail data for some crafted mails. (same Message-id, but mail body and some headers were different.) So, "same Message-id" doesn't always mean "absolutely same mail" even Gmail IMAP, because "same Message-id == same mail" is not garanteed by e-mail system. However, my test was with crafted mails to guess Gmail's duplication check rule/algorythm. I think "same Message-id, same From:, same To:, same Subject:, same Reply-To: + same some other headers if required" can be practically used as condition of "same mail data of Gmail IMAP". Another solution for Gmail IMAP: (i) Invoke Indexer for "[Gmail]/All Mail" only. (and for [Gmail]/Trash, [Gmail]/Spam, if required) (ii) Don't invoke indexer for Gmail IMAP folder of "Gmail Label". Use index data for mail in "[Gmail]/All Mail" with same Message-Id &&, same From: && same To: && same Subject: && same Reply-To:, + same some other headers, + same mail body length, + same body data, if required. If performance is concerned, option like strict_duplicate_mail_check, false=message-id only, true=all of above, may be required. "Thread display" of Gmail IMAP may be improved by; - Construct thread structure from mails in "[Gmail]/All Mail". - Remove mail which doens't exist in folder(=Gmail Label) from the constructed structure for "[Gmail]/All Mail". Because it can be said that Gmail IMAP holds real mail data only in "[Gmail]/All Mail"(and "[Gmail]/Trash", "[Gmail]/Spam"), new MailDB column for "pointer to same mail in Gmail/All Mail" may be a simple/practical enhancement for ease of implementation of some additional Gmail IMAP support features.
Blocks: tb-gmailWIP
Gmail IMAP currently assigns UID in a folder(Gmail Label) from 1 and increments by one upon new mail addition, as ordinal IMAP does. If Gmail IMAP uses same UID as [Gmail]/All Mail for copies in other IMAP folder(Gmail Label), IMAP mail client can know which is same mail with multiple Gmail Label by UID of mail.
I just stumbled upon http://code.google.com/apis/gmail/imap/ It turns out the following useful things can be retrieved via IMAP which would be useful to such an endeavor: - Access to the Gmail unique message ID: X-GM-MSGID - Access to the Gmail thread ID: X-GM-THRID - Access to Gmail labels: X-GM-LABELS
hmm, yeah, those could be very useful...
Depends on: 721316
Blocks: 800003
No longer depends on: 721316
Because we know "original mail" is held in one of [Gmail]/All Mail, [Gmail]/Trash, [Gmail]/Spam, Gloda also can do something. But it's absolutely same situation as "many duplicates in many locaal folders by filtering, copying by user. If POP3, UIDL is usually supported,so, POP3server+UIDL+Message-ID:+messagesize like one can be used to identify a mail. So, we can request following : gloda could specialize for locally copied mails in local mail folders. Appropriate request? If auto-sync is done only for [Gmail]/All Mail, no problem in Gloda. I believe culprit of issues is never Gloda. Cultrit is "who generated duplicated local copies of mails in offline-store file of multilpe mboxes". However, because duplication can be detected by X-GM-MSGID if Gmaail IMAP, "holding X-GM-MSGID in Gloda DB" and "avoid duplicated indexing job by X-GM-MSGID check" is not so bad idea., and is not so expensive. If X-GM-LABELS is sync'ed with server always, "where duplicated copy is held" also can easily be known. And, it's already possible if enhancement of bug 1124924 will be done with change of "fetch Flags => fetch Flags X-GM-MSGID X-GM-THRID X-GM-LABELS if Gmail IMAP". X-GM-MSGID, X-GM-THRID, X-GM-LABELS are currently held in StringPropery of msgDBHdr. Because "fetch X-GM-MSGID X-GM-THRID X-GM-LABELS" is currently request by "header fetch of new mail " only, X-GM-LABELS in StringProperty is currently useless. Because Gmail IMAP started to support CONDSTORE last year, if "uid fetch 1:* Flags X-GM-MSGID X-GM-THRID X-GM-LABELS (CHANGEDSiNCE modseq)" is utilized, sync of lX-GM-LABELS is possible. So, Table(DB) like next is easily obtained. MSGID_Table[ MSGID ]["X-GM-LABELS"][ GMLABEL ] = {} GMLABEL _Table[ GMLABEL ]["X-GM-MSGID"][ MSGID ] = {} "GMLABEL -> Mbox Name in Tb" is easy because mapping rule of Gmail is simple. Because XLIST is supported by Gmail IMAP and Tb, ."GMLABEL -> localized special mbox name" is also easy. "MSGID -> UID" is currently slightly hard, because "scan of StringProperty of all msgDBHdr in a folder" is needed. But it's never impossible. Once "MSGID -> UID at [Gmail]/All Mail" is done, data of any mail can be obtained by "select [Gmail]/All Mail + uid nn fetch body[]". As known by bug 854161, conversion of "a msgFolder + UID => msgFolder who holds mail data in his offline-store, UID=messageKry in this folder" is already possible by nsImapMailFolder::GetOfflineMsgFolder although some improvements is needed. As for Gmail IMAP, I feel following is better . Because Gloda is requested to index mail data of Gmail IMAP by Tb user, Gloda always watches [Gmail]/All Mail. And if new mail is detected, download entire mail data to offline-store of [Gmail]/All Mail and index it, then notify to Biff for new mail. Because user doesn't know about new mail until indexing completed, user won't interfere any job by Gloda, auto-sync, filter etc. :-) Because Gloda already fetched entire mail data, message filter can do body filtering, and auto-sync can sleep well :-)
FYI. (1) Current "msgFolder + msgDBHdr of the msgFolder" is identical to following table from perspective of FolderURI/MboxName, UID, X-GM-MSGID, X-GM-LABELS. X-GM-MSGID, X-GM-THRID, X-GM-LABELS is currently saved in StringProperty of msgDBHfr. > GM_MSG_TBL[ [Gmail]/All Mail ] = {} > GM_MSG_TBL[ [Gmail]/All Mail ]["FolderURI"] = imap://.../[Gmail]/All Mail > ["MboxName"] = [Gmail]/Sent Mail > ["MSG"] = {} > ["MSG"][ 2222 ]["messageKey"] = 2222 (==UID) > ["MSG"][ 2222 ]["X-GM-MSGID"] = 2222222222 > ["MSG"][ 2222 ]["X-GM-THRID"] = 2222222222 > ["MSG"][ 2222 ]["X-GM-LABELS"] = "a b" "X (\" Y \\) Z" "\\Important" "\\Sent" ABC DEF > ["MSG"][ 2222 ]["LABELS"] = {} > ["MSG"][ 2222 ]["LABELS"][ \Important ] = true > ["MSG"][ 2222 ]["LABELS"][ \Sent ] = true > ["MSG"][ 2222 ]["LABELS"][ X (" Y \) Z ] = true > ["MSG"][ 2222 ]["LABELS"][ other GM-LABELs ] = true (2) This table can easily be converted to table like following, Relation between GM-LABEL and MboxName is simple. GM-LABEL == MboxName except for some special names, Rule for "some special GM-LABELs" and "some special top level MboxName" is predefined. See bug 1129870, please. > GM_MSGID_TBL[2222222222]["MSGID"] = 2222222222 > ["Owner"]["Mbox"] = [Gmail]/All Mail > ["Owner"]["Label"] = \AllMail > ["Owner"]["UID"] = 2222 > ["LABELS"][ \AllMail] = true > ["LABELS"][ \Sent ] = true > ["LABELS"][ other GMLABELs ] = true (3) Ftom this GM_MSGID_TBL, table like following can be easily constructed. This is new version of first GM_MSG_TBL. This is a kind of indexing. > GM_LABEL_TBL[ \Sent ]["GM-LABEL"] = \Sent > ["Mbox"] = [Gmail]/Sent Mail > ["MSGID->UID"][3333333333] = 3333 (["Owner"]["Mbox"] = [Gmail]/Trash) > [2222222222] = 2222 (["Owner"]["Mbox"] = [Gmail]/All Mail) > ["UID->MSGID"][3333] = 3333333333 (["Owner"]["Mbox"] = [Gmail]/Trash) > [2222] = 2222222222 (["Owner"]["Mbox"] = [Gmail]/All Mail) > GM_Mbox_TBL[ [Gmail]/Sent Mail ]["GM-LABEL"] = \Sent > ["Mbox"] = [Gmail]/Sent Mail > ["MSGID->UID"][3333333333] = 3333 (["Owner"]["Mbox"] = [Gmail]/Trash) > [2222222222] = 2222 (["Owner"]["Mbox"] = [Gmail]/All Mail) > ["UID->MSGID"][3333] = 3333333333 (["Owner"]["Mbox"] = [Gmail]/Trash) > [2222] = 2222222222 (["Owner"]["Mbox"] = [Gmail]/All Mail) (4) Gloda can do job like above because Glod knows about all existing mails and all new mails and deletion of mails, as far as following is guranteed on Gmail IMAP account. (a) auto-sync to offline^store file is for [Gmail]/All Mail, [Gmail]/Trash, [Gmail]/Spam only. This can be done by forcing, New mail check is for inbox and these 3 folders only Offline-Use=O is these 3 folder only (b) X-GM-LABELS of any mail in [Gmail]/All Mail, [Gmail]/Trash, [Gmail]/Spam is always sync'ed. This enhancement is already requested. (c) X-GM-LABELS content change is always notified to Gloda. . (5) If table like above GM_MSGID_TBL is maintained, "using MSGID only with accessing [Gmail]/All Mail, [Gmail]/Trash, [Gmail]/Spam only" is possible., because "getting mail data from other folder's Offline-Store file" is already supported by nsImapMailFolder::GetOfflineMsgFolder. This kind of feature is not job of Gloda, but I think Gloda can be enhanced for supporting "better Gmail IMAP access " without big changes. "SDLite DB for above GM_MSGID_TBL" can be pretty easily created if Gloda. If SQLite DB is utilized, GM_MSGID_TBL like data can be saved as BLOB data of SDLite DB without paying attention to "file loss even after safe file writing". If "GM-MSGID in Gmail -> GlodaKey=folderURI+messageKey in a folder + GM-MSGID if Gmail" and "GM-LABEL -> Gloda-Label=name of a collection of mails held in Tb" will be made, Gloda can be "Global Message Database Manager in Thunderbird". And, mail viewing can be done only by "SELECT GlodaKey From GlodaDB Where Gloda-Label=Inbox" + "passing data to gDBView" :-)
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.