Open Bug 646225 Opened 13 years ago Updated 2 months ago

Fake headers (bug 574441), when run through getMessagesCollectionForHeaders, return totally unrelated messages

Categories

(MailNews Core :: Database, defect)

x86_64
Linux
defect

Tracking

(blocking-thunderbird5.0 -)

REOPENED
Thunderbird 5.0b1
Tracking Status
blocking-thunderbird5.0 --- -

People

(Reporter: protz, Unassigned)

References

(Blocks 1 open bug, )

Details

(Whiteboard: [gs][status partially fixed:comment 53][see comment 24, 49][datalossy])

Attachments

(5 files, 2 obsolete files)

So bug 574441 introduced fake headers (the ones with "highwater mark" (per :bienvenu's words)) that represent sent messages, waiting to be repalced with "real headers" once we get them from the server.

When running getMessagesCollectionForHeaders on one of these headers, a totally unrelated message is returned. Symptoms: Thunderbird Conversations displays a totally unrelated conversation, followed by your message.

This sounds to me like a int overflow somewhere, and :asuth mentioned on IRC that the strtoul call from 636133 might be wrong, and may ought to be strtoull instead.
Let me amend that by saying that it would be wrong if the message keys were 64-bit, but it seems like they are definitely 32-bit.  (I though the constant used in the test was 2^32, but it turned out to be a bit less than that...

A potentially more possible thing would be that getMessageCollectionForHeaders is being passed a dead header and the attempt to get a messagekey from it returns something gibberishy which returns the wrong thing.  The query it constructs is based on folder and message key, which is pretty hard to go off the rails otherwise.
blocking-thunderbird5.0: --- → ?
I don't think we can block 3.3 on this as it doesn't seem to be a user-visible error in Thunderbird by itself. We would, of course, take a fix for this.
blocking-thunderbird5.0: ? → -
Flags: wanted-thunderbird+
I'm getting these errors with Gloda dump enabled... investigation ongoing...

2011-04-19 10:59:11	gloda.datamodel	INFO	Message with message key does not match expected header! (4D74DC9F.5000304@gmail.com expected, got 4DAD4ED6.5050104@gmail.com)
2011-04-19 10:59:11	gloda.datamodel	INFO	Message with message key does not match expected header! (4D74DC9F.5000304@gmail.com expected, got 4DAD4ED6.5050104@gmail.com)
2011-04-19 10:59:11	gloda.datamodel	INFO	Message with message key does not match expected header! (4D839451.3040108@gmail.com expected, got 4DAD4ED6.5050104@gmail.com)
Some notes for when asuth comes back from vacations:
- the query that gloda uses when I issue my call to getMessagesCollectionForHeaders is

SELECT * FROM messages INNER JOIN messagesText ON messages.id =
messagesText.rowid WHERE (id IN (SELECT id FROM messages WHERE (folderID IN
(55)) AND (messageKey IN (4294967168))) AND +deleted = 0 AND +folderID IS NOT
NULL AND +messageKey IS NOT NULL)

- there's only one message selected, so I took a look at the inner query

SELECT * FROM messages WHERE (folderID IN (55)) AND (messageKey IN (4294967168))

- this actually returned three messages.

I think that's an error. Gloda seems to pick up the msgKeyChanged notification properly, so I don't really know why this is happening...
Blocks: 646226
More thoughts:
- the part that actually performs the key update is updateMessageKeys from datastore.js
- the query looks like "UPDATE messages SET messageKey = ?1 WHERE id = ?2");" with
  * ?1 being the correct new message key
  * ?2 being 0, which leads me to think that there is something wrong happening there, which might incidentally be responsible for bug 646226, as gloda message IDs under 32 are not supposed to be valid,
- the 0 is obtained from getGlodaState in index_msg.js where we go into the else-branch and return aMsgHdr.getUint32Property(GLODA_MESSAGE_ID_PROPERTY)

I'll put this investigation on hold until asuth can confirm that I'm not going off the rails and wasting my time :-).
Short answer, you are way on the rails; good job!

Yes, 0 seems like a very bad and illegal gloda id.  It will be a bit before I could directly look into it, but the most likely hypothetical (and unresearched) reasons would be:

- The attribute is not on the message header when we go to look, either because we are using the wrong header, it is failing to be propagated, or by the time we look the header has been obsoleted.  Since the indexer goes out of its way to convert the message header to a [folder, message key] tuple to avoid that problem, it's unlikely that the last thing is sinking us, but this avoidance may be our problem.  Specifically, if we say "hey, I noticed that header#FAKE is new, let's index it" and then convert that to a note to ourself to index [some folder, FAKE], but don't update that tuple/the event-driven indexing job when messageKeyChanged happens, that could cause complications.

- Similar deal but involving the pending commit tracker and the message key.  Perhaps we are not updating the pending commit tracker when the key changes, or we are using the post key where we mean to use the pre key.

- Gloda or someone else is failing to commit changes to the .msf file prior to closing it, causing data loss.  This seems like it should not happen given that the indexer goes out of its way to explicitly commit things, etc.


Based on what you are describing, it sounds like we are managing to perform the event-driven indexing for fake sent messages (assuming that 4294967168 is a legitimate speculative UID), but are screwing things up in the PendingCommitTracker in such a way that we leave the nsIMsgDBHdr thinking it is properly indexed, but the message key data completely out of sync.  Because we detect indexed-ness based on having a gloda id on an nsIMsgDBHdr but look it up by message key, it definitely sounds like the inconsistency is screwing things up fatally and the only thing that can fix it is a compaction pass.

One might ask why we still use the [folder id, message key] to get the collection for a message at this point if we could just rely on the gloda-id, especially since our failure in this case results in a major and obvious fast failure... and the sorta-answer is that it's desirable for the failure to be blatant, like we're seeing here.  I'm not strictly wedded to that call, but we probably want to error console it if we detect a mismatch if we change that logic...
protz, are you waiting on additional feedback beyond comment 6 from me?  lemme know.
My theory is that bug 646556 caused invalid info to be stored in _keyChangedBatchInfo, and then, when another message would re-use the same fake message key later, it would pick up the old, outdated info from _keyChangedBatchInfo, and gloda would go all confused. But that's just a theory :-).
Comment #5 was actually me discovering the root cause for bug 646226 ; comment #8 is wrong, the issue is still happening after landing the patch from bug 646226 ; the issue is still present after recreating global-messages-db.sqlite so there has to be something that results in entries with highwater message keys hanging around in the messages table. Since we're reusing these message keys, next time a message reuses an old key, getMessagesCollectionForHeaders finds the old entries in the messages table.
My new color-coded debug output tells me that I do have a surprisingly high number of messages with the pseudoHdr Uint32 property still on them. I'd say a dozen or so. This means once in a while, a message I send doesn't get rid of that property.

However, dumping the corresponding msgHdr's messageKey tells me that this is a regular message key, not a highwater one. Any idea how that could possibly happen?
Here's a procedure that worked for me. STR:
- I'm in the unified inbox that also searches sent folders.
- Send myself a message A (I'm running a fake smtp server, so nothing ends up being received).
- Reply to the message that just appeared (message B).
- Display the thread (I'm assuming you're running Thunderbird Conversations).
- Start a new thread, by emailing yourself something (message C).
- Reply to that new thread (message D).
- Somehow message C flashed out of the view, leaving only message D.
- Click on message D.
- After a while, the collections for the two threads will be merged into one, and clicking on thread D will also return the messages from thread A.

Andrew, I think having msgKeyChanged in index_msg.js invalidate all cache entries related to the old message key might be a good start. What do you think?
I blew up my database a few hours ago, and after a few minutes of testing (this includes restarting Thunderbird frequently) and sending out a lot of messages, I already have three messages in the gloda "messages" table, sharing the same messageKey (4294967168).
(In reply to comment #11)
> Andrew, I think having msgKeyChanged in index_msg.js invalidate all cache
> entries related to the old message key might be a good start. What do you
> think?

Based on your characterization, I believe the problem from Thunderbird's conversation's perspective is:

P1) Its getMessageCollectionForHeaders is returning messages from another conversation.

This is happening because:

P2a) Gloda.getMessageCollectionForHeaders builds queries based on message keys and the query is never updated even when message keys change.  This is sorta okay because:

P2b) GlodaDatastore.updateMessageKeys never generates itemsModified notifications when it updates the in-memory representations, but unfortunately...

P2c) The fake message header keys can be reused.  So a long-lived getMessageCollectionForHeaders query that is issued on a fake header will keep the results with its query, but whenever any new fake-headers are indexed, they may *also* match the query, and so get smooshed in.


I believe the phenomenon "Somehow message C flashed out of the view, leaving only message D." is due to the fact that replying to a message is a dirtying event that will cause the message to be reindexed (because of the replied-to flag).  As a result, gloda will reindex the message and generate an itemsModified event which will cause all unfrozen to re-evaluate their queries... since the query is on the (folderID, fake message key) but the message now has value (folderID, real message key), it gets removed from the query set.

The good news is that it sounds like, apart from Gloda.getMessageCollectionForHeaders producing unreliable collections, it sounds like everything else is actually working, which is not a bad thing.  Sadly, we don't actually have the unit tests to prove it.


I believe I've broadly characterized the solution set in IRC, but I'm going to revisit our possible options here so there's a canonical answer:

S1) Stop using (folder ID, message key) constraints for getMessageCollectionForHeader(s) and exclusively use gloda-id's.  The problem is that un-indexed messages have no gloda-id (which is why we use folder/msgkey).  We could speculatively issue gloda-id's (SQLite does not issue the id's, we do, and we do so from the main thread), but because people can get at nsIMsgDBHdr's before the spam classifier is run, one potential danger is that we may issue id's to messages that will never get indexed.  Since we will still get msgsClassified notifications for these headers, however, it's possible we could deal with this at the cost of some extra property checking on messages we would otherwise ignore.

note: it would be critical that we use the glodaDirty flag (or something similar) to mark the gloda-id as speculative.  We would need to audit the current uses of PendingCommitTracker.getGlodaState and potentially change some of them to either be aware of the speculative id or pretend it does not exist.


S2) Use gloda-id queries when we have gloda-id's, use (folder, message key) queries when we don't.  Permute (folder, message key) queries in-memory on msgKeyChanged events and I fear also deal with move-notifications.  We could optionally permute/promote the (folder, message key) constraints to gloda-id constraints once a message gets indexed, but that wouldn't save us from the message key tracking stuff.


S3) Declare that getMessageCollectionForHeader(s) does not work on messages that are not yet indexed.  Make it the problem of Thunderbird Conversations/friends to deal with these complicating shenanigans themselves.

This would be more tenable if we were still planning to develop a UI that only exposed gloda-indexed messages; right now it seems like it would just be a jerky cop-out.  The counterpoint is that it's not clear anything in-tree is going to get bit by this right now, as the multi-message/thread summaries may potentially end up re-computing and re-querying when the message keys change, and show-conversation uses queries based on the conversation-id...
Oh, and I left out a hacky workaround, which would be to make sure we don't reuse fake-message-header values in a single session.  I fear that might eventually result in sadness for the IMAP layer, however, if it ended up causing integer overflow/wraparound or any of the class of situations where we start colliding with real UIDs (or apparent UID space, even if we don't collide) or going into gibberish number territory.
I could ameliorate this by paying attention to the UID returned on the append operation if the server supports UIDPLUS. In this case, I wouldn't have to use a fake UID but could use the real one. Since UIDPLUS support is more common than not, that could free us up to choose a relatively less efficient solution for the fake key issue. Sorry for not suggesting this earlier, but I didn't know this was going to be such a painful issue.

It would also be possible to be much less aggressive about re-using fake msg keys. In the current situation, does Thunderbird need to get shut down before re-using fake msg keys won't cause a problem?
(In reply to comment #15)
> It would also be possible to be much less aggressive about re-using fake msg
> keys. In the current situation, does Thunderbird need to get shut down
> before re-using fake msg keys won't cause a problem?

Right now it's a question of the lifetime of gloda collections built from getMessageCollectionForHeader(s), which is really a question of the lifetime of things being displayed in the message summary pane, either the built-in stuff of Thunderbird Conversations.
Attached patch use appenduid result (obsolete) — Splinter Review
protz, do you want to give this patch a try?

asuth, did your last comment mean that once a user selects a different conversation, then we can re-use fake keys w/o this bug manifesting?
Thanks for all the recap and the extensive explanations. It's late so I might not understand everything properly, but how does that account for the fact that I've got entries in the messages table that have a highwater mark and are never updated with the real message key?
David, I'll give this a try tomorrow, right now, I'm afraid I need to sleep :-)
(In reply to comment #18)
...
> but how does that account for the
> fact that I've got entries in the messages table that have a highwater mark
> and are never updated with the real message key?

This suggests that msgsClassified is never being generated for some fake-made-real headers after their msgKeyChanged notifications are received.  gloda's msgKeyChanged handler just accumulates the changes in an in-memory structure; it requires a msgsClassified event to be generated for the messages subsequently for the changes to affect the in-memory and on-disk gloda representations.
(In reply to comment #17)
> asuth, did your last comment mean that once a user selects a different
> conversation, then we can re-use fake keys w/o this bug manifesting?

Yes, although I cringe at the thought of the presentation layer hinting to the IMAP layer about such things.
(In reply to comment #21)
> (In reply to comment #17)
> > asuth, did your last comment mean that once a user selects a different
> > conversation, then we can re-use fake keys w/o this bug manifesting?
> 
> Yes, although I cringe at the thought of the presentation layer hinting to
> the IMAP layer about such things.

Ha, that ain't happening. I was just trying to understand. So if I didn't reuse fake keys until I had to, the failure case would be the user staying on the same conversation and sending roughly 128 or 256 messages instead of 2.

Though I suppose extensions could do all sorts of funky things like hold onto message collections.
(In reply to comment #20)

> This suggests that msgsClassified is never being generated for some
> fake-made-real headers after their msgKeyChanged notifications are received.
> gloda's msgKeyChanged handler just accumulates the changes in an in-memory
> structure; it requires a msgsClassified event to be generated for the
> messages subsequently for the changes to affect the in-memory and on-disk
> gloda representations.

This would mean that there's some way for newly downloaded imap headers not to cause a msgsClassified event, which would mean that gloda would never index them. If this happened, we'd see gloda not knowing about a lot of messages, I would think. I don't think there's anything special about the fake-made-real headers at header download time, other than the fact that we've sent a NotifyMsgKeyChanged notification.
(In reply to comment #23)
> This would mean that there's some way for newly downloaded imap headers not
> to cause a msgsClassified event, which would mean that gloda would never
> index them. If this happened, we'd see gloda not knowing about a lot of
> messages, I would think. I don't think there's anything special about the
> fake-made-real headers at header download time, other than the fact that
> we've sent a NotifyMsgKeyChanged notification.

Yes, that was (probably) wrong.  I have thought through some stuff that I did not think about in comment 13 because the STR did not mention the messages table issue.

Long hypothesis short: msgKeyChanged is happening *during* the gloda indexing process for the fake header message.

Long solution short: msgKeyChanged needs to see if we are actively indexing the message in question and, if so, adjust internal state.  It would also be nice if we could update the entries for the current job for messages not yet indexed or queued jobs.  I believe the indexing logic will operate correctly if we fail to do so in those other cases, we may just end up indexing some messages twice...

We likely do not want to use the job killing mechanism because we would need to complicate the job-killing mechanism a lot since we would not want to forget about indexing any other messages in the current job.  It's not required for correctness, just timeliness.  (The folder compaction logic does this, and so is slightly over-aggressive; it coincidentally was originally the only kind of event that could change message keys.)

An alternate means of addressing this would be a variant of solution S1 from comment 13 where we label all messages with their speculative gloda-id's when we mark them for enqueueing.
Attachment #532041 - Flags: feedback?(jonathan.protzenko)
Comment on attachment 532041 [details] [diff] [review]
use appenduid result

Outside of the conversations context, this fix means we usually won't have to download headers for messages appended to the drafts or sent folders.
Attachment #532041 - Flags: review?(neil)
Attachment #532041 - Flags: review?(neil) → review?(neil)
thx for fixing the requestee, Ludo.
Target Milestone: --- → Thunderbird 3.3a4
Sorry, stupid bugzilla tweaks bug.
Target Milestone: Thunderbird 3.3a4 → ---
Attached file Backtrace
I've had segfaults with the patch on, in the SetMsgAppendUid method, which I believe was affected by the patch. This segfault is 100% reproducible. It is a move of a message via drag-and-drop accross different accounts.
Comment on attachment 532041 [details] [diff] [review]
use appenduid result

feedback- because of the segfault
Attachment #532041 - Flags: feedback?(jonathan.protzenko) → feedback-
Attached patch fix for crashSplinter Review
good catch, this should fix the crash.
Assignee: jonathan.protzenko → dbienvenu
Attachment #532041 - Attachment is obsolete: true
Attachment #532041 - Flags: review?(neil)
Attachment #534431 - Flags: review?(neil)
Attachment #534431 - Flags: feedback?(jonathan.protzenko)
Comment on attachment 534431 [details] [diff] [review]
fix for crash

>-  nsCOMPtr<nsImapMailCopyState> m_copyState;
Bah, how many more of these bogus references are there :-(
Attachment #534431 - Flags: review?(neil) → review+
pushed to trunk http://hg.mozilla.org/comm-central/rev/5d098294efd2 in the hopes of getting it for TB 6.
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 3.4
David, shall we try to land this on Miramar?
Attachment #534431 - Flags: feedback?(jonathan.protzenko)
Attachment #534431 - Flags: feedback+
Attachment #534431 - Flags: approval-thunderbird3.3?
I'd like it on Miramar, but I'll let Standard8 decide. Is my understanding correct that this contributes to a more or less permanent gloda issue of bad conversations?
I'm reopening this bug because bienvenu's fix mitigates the underlying gloda problem I described on the bug when UIDPLUS is available, but it doesn't fix gloda's problem which is what is characterized in the subject.  This bug should be re-subjected and the gloda problem cloned off to another bug with the analysis thread if we want to mark this one fixed.

From a triage perspective, if miramar has any of our recent msgKeyChanged escapades (I think it does), it probably really wants the migitation fix...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
er, yes, sorry about that - kept meaning to comment to that effect. And miramar does have the msgKeyChanged fun in it.

I could simply file a new bug that describes the appenduid fix, which has other nice benefits, and just mark it fixed, and leave this one open.
Comment on attachment 534431 [details] [diff] [review]
fix for crash

Ok, given the comments, I think I'd rather get this fixed now and ready for testing before the first beta rather than after.
Attachment #534431 - Flags: approval-thunderbird3.3? → approval-thunderbird3.3+
http://hg.mozilla.org/releases/comm-miramar/rev/205f41f6883a
Target Milestone: Thunderbird 3.4 → Thunderbird 3.3a4
Blocks: 241197
Hi,

http://getsatisfaction.com/mozilla_messaging/topics/thunderbird_conversations_not_working_correctly_on_osx?utm_content=topic_link&utm_medium=email&utm_source=reply_notification seems to imply that this problem has gotten worse. I unfortunately don't have time to work on this :(.

Thanks,

jonathan
Message A is the last message in a valid thread.
Message B is an entirely new message (no 'reply', 'edit as new', or anything else) followed by its subsequent thread.

Other attachments show these message sources including unrelated ID's.
I don't know if this is particularly helpful at this point, but I've attached some screenshots showing this erroneous threading.  I've also included the source (including message ID's) from the two messages that were associated but shouldn't have been.
Hi Sean,

Thanks for adding some information to the bug. The problem is well-known and there are potential fixes outlined in comment #24; what we need is just someone with enough time to dive into the code and fix this :).

Should something happen, you'll see some action in this bug!

Cheers,

jonathan
Yeah - I'd read the thread and figured these wouldn't be much help.  I've been thinking about bugging my manager to give us 1-2 hrs per week of Google-style 'undirected' development time.  If I can convince him of it, I may start by digging into this bug because it's been driving me NUTZZ!

BTW - thanks for creating the Conversations plugin.  IMHO that alone makes TB the #1 desktop email client. It needs to be officially integrated, supported, & maintained by Mozilla.
Sean, which IMAP server is it that you're connecting to? If you don't have that information at hand, you can find out by running from a terminal (assuming linux here):

For IMAP+SSL:
protzenk@sauternes:~ $ openssl s_client -connect imap.googlemail.com:993 -crlf

For regular IMAP:
protzenk@sauternes:~ $ nc imap.free.fr 143

The first few lines may have information about which kind of server they're running.
Unfortunately I'm not going to be able to look at this issue at this point.  I've just got too much other stuff going on.  Sorry about that.

As for the IMAP server I'm connecting to?  I've got two email addresses I'm using.  One is connecting to a DavMail instance running on localhost (DavMail works as a proxy to Exchange servers).  The other is hosted at hostmonster.com.  I ran that command and got a lot of certificate & SSL info, but I don't see anything identifying the server.  Why are you looking for this info?
I just want to confirm that this bug only happens for servers which do not support the UIDPLUS command, but I think we're pretty sure this is already the case.
Maybe this is the line your looking for from Hostmonster:
OK [CAPABILITY IMAP4rev1 LITERAL+ SASL-IR LOGIN-REFERRALS ID ENABLE AUTH=PLAIN AUTH=LOGIN] Dovecot ready.

I don't know if I'm seeing it at the Hostmonster server as I don't end up generating nearly as many threads on that account.  The one where I know I see this issue is on the DavMail server.  I'm running v3.9.8-1921.

Maybe this helps?
Not sure if this adds information or not, but I feel its better to err on the safe side.

In Thunderbird 17.0.2 I noticed the following behavior: Sent emails are initially placed in the wrong thread both in the inbox view (threaded), and in Conversation View. After a restart of Thunderbird, the email will be placed correctly in the inbox list, but is still in the wrong spot in Conversation View. Instead of restarting, I can also open the Sent folder. There is then a short delay before the newly sent email pops up, and if I then switch back to the unified inbox/sent folder the email will be correctly sorted in the inbox view, but not in the Conversation view.

Tl;dr: The email goes to the wrong spot in both Thunderbird's inbox view and in Thunderbird Conversations' view, however the inbox view's sorting can be fixed by fiddling around, whereas Conversations' sorting can not.
protz, is comment 49 sufficient for your question?

I'm thinking we should close this fixed, and open new bug(s) for anything still not working
Flags: needinfo?(jonathan.protzenko)
Hello Wayne,

comment 49 confirms that this issue is only partly fixed, and that whatever remains to be fixed is accurately described by comment 24.

I don't have time to work on this right now unfortunately, but anyone willing to fix this should read the discussion up to comment 24 which contains some guidance as to how to fix this.

Thanks,

jonathan
Flags: needinfo?(jonathan.protzenko)
Is anyone currently actively working on improving threading in Thunderbird?
Whiteboard: [gs] → [gs][status comment 53]
Anyone can help to fix this issue? It is badly needed for me.
aceman, 
It would be great to kill this off.
Is comment 24 something you can handle?
Flags: needinfo?(acelists)
Whiteboard: [gs][status comment 53] → [gs][status partially fixed:comment 53]
IMAP and gloda in the same bug? No, that is not for me right now ;)
Flags: needinfo?(acelists)
Version: unspecified → Trunk
Assignee: mozilla → nobody
See Also: → 574441
Whiteboard: [gs][status partially fixed:comment 53] → [gs][status partially fixed:comment 53][see comment 24, 49][datalossy]
Severity: normal → S3
Attachment #9386663 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: