Open Bug 663270 Opened 13 years ago Updated 2 years ago

Improve the threading behavior to include hints from the message subject and the message body

Categories

(Thunderbird :: Mail Window Front End, defect)

x86_64
Linux
defect

Tracking

(Not tracked)

People

(Reporter: ehsan.akhgari, Unassigned)

Details

The current threading behavior only uses the message-id based hints.  This breaks when bad mail gateway software messes up with the respective headers.  In such cases, we can thread based on the subject line.  For example, if a subject begins with "Re: " we can try to assign it to an existing thread with the same subject line.

In order to determine where in the thread to inject the message, we can look at the quoted text in the mail body and try to match it to the body of an existing message as a simple heuristic.
(In reply to comment #0)
> The current threading behavior only uses the message-id based hints.  This
> breaks when bad mail gateway software messes up with the respective headers.
> In such cases, we can thread based on the subject line.  For example, if a
> subject begins with "Re: " we can try to assign it to an existing thread
> with the same subject line.

We have prefs to control the various threading options - see https://wiki.mozilla.org/Thunderbird:Help_Documentation:Hidden_Preferences - that doc is a little out of date because we've switched the defaults around. Ludo might know of a better doc that tells you exactly what to do to get the kind of threading you want. But we don't do the quoted text in the mail body, since we don't always have the message body when we do threading.
(In reply to comment #1)
> (In reply to comment #0)
> > The current threading behavior only uses the message-id based hints.  This
> > breaks when bad mail gateway software messes up with the respective headers.
> > In such cases, we can thread based on the subject line.  For example, if a
> > subject begins with "Re: " we can try to assign it to an existing thread
> > with the same subject line.
> 
> We have prefs to control the various threading options - see
> https://wiki.mozilla.org/Thunderbird:Help_Documentation:Hidden_Preferences -
> that doc is a little out of date because we've switched the defaults around.
> Ludo might know of a better doc that tells you exactly what to do to get the
> kind of threading you want.

Hmm, the mail.strict_threading kind of implies that we already do subject-based threading, which contradicts my experience.

> But we don't do the quoted text in the mail body,
> since we don't always have the message body when we do threading.

Can we do that if we do have the body available?
(In reply to comment #2)

> 
> Hmm, the mail.strict_threading kind of implies that we already do
> subject-based threading, which contradicts my experience.

We should thread by subject if the reply starts with re:, by default.

> 
> > But we don't do the quoted text in the mail body,
> > since we don't always have the message body when we do threading.
> 
> Can we do that if we do have the body available?

Not efficiently, no. If subject and reference threading are both broken, it would be kinda slow to grovel though the message bodies of all of your messages looking for messages that contain the quoted content. We could use gloda full test queries, probably, but it would have to be after the normal threading. This might be something worth trying in an extension.
(In reply to comment #3)
> (In reply to comment #2)
> 
> > 
> > Hmm, the mail.strict_threading kind of implies that we already do
> > subject-based threading, which contradicts my experience.
> 
> We should thread by subject if the reply starts with re:, by default.

Will that take precedence over the In-Reply-To header (or whatever it's called, can't remember the exact name)?
References/In-Reply-To is matched first, and then I think we try to match the subject if we can't find a match yet.
Hmm, tell me that I'm crazy but this <http://mxr.mozilla.org/comm-central/source/mailnews/mailnews.js#134> tells me that if the pref is true (which is the default behavior), no subject threading is performed at all!

So, should we consider changing the default value of the pref to false?
Oh, yeah, some folks (hi, Ludo) convinced me that we should try strict threading. I'd rather thread by subject without re: myself, so I'd be happy to flip the default value. But I'll give those with the opposing view a chance to comment.
I'm reading through all of this and I definitely think TB5 changed something about how threads are collected together.  In every case where it breaks, the first message that started the thread is not combined with the replies.  

This made me think that the control in mail.correct_threading (due to it controlling message ID involvement) were the key, but it is unclear to me when these indexes get updated, so my experiments might be off.
So here's some thoughts.
- Strict threading is on by default, which means we don't use the subject to thread. Making this off by default is *wrong*. It frequently happens that someone sends you a message called "hey" or "hello". Threading by subject will thread all these messages together. Suggestion: only thread by subject if the messages are close enough in time and/or some recipients are shared.
- Threading by message-id confuses people when someone intends to start a new thread by hits reply all because they're lazy, not realizing that this will keep the References: header intact. Suggestion: consider the message to be in a new thread if the subject changes drastically from that in the message pointed to by the In-Reply-To message.
- Threading without re (thread_without_re) is just a bad idea imho.

Indeed, using body contents (which I think gmail does somehow) would be ideal, but we don't have the infrastructure to do that (and, in my mind, is likely to require significant engineering effort to get "the rightest way"). Applying the two rules above (which gmail does, incidentally) would be already a massive improvement.

Please note that gloda has its own concept of threading; we should definitely keep that in sync with the main threading logic (namely, it only uses references: and in-reply-to: and is not sensitive to subject-threading).
To elaborate on the triggering use-case, the problem is that Mailman 2.1 (still, as of 2.1.10, which is what Mozilla is using) deletes the message-id header and replaces it with its own when gatewaying from the mailing list to the new server.  It does not stash it in another field or update/generate a synthetic References field.

http://bazaar.launchpad.net/~mailman-coders/mailman/2.1/view/head:/Mailman/Queue/NewsRunner.py#L128

The generated message-id has some information in it, namely an epoch timestamp.  It is generated like so:
    global _serial
    msgid = '<mailman.%d.%d.%d.%s@%s>' % (
        _serial, time.time(), os.getpid(),
        mlist.internal_name(), mlist.host_name)

Because the serial number precedes the timestamp (which will usually be some small number of seconds after the 'Received' header's timestamp at the mail-server) and is unpredictable, this means that any database range query would need to be on an intentionally created index using the timestamp.  Which is to say that gloda can't create an efficient query on the already-existing message-id index.

The message-id's are *not* munged when messages posted to the newsgroup are gatewayed to the mailing list.  Also, no reverse transformation of the message-id clobbering is performed, which would have saved us this pain.  This has the notable side-effect that when someone replies to a message and the reply goes to both the newsgroup and the mailing list, the mailing list ends up with two copies with the same (original) message-id, and the newsgroup ends up with two copies where one is the original message-id and one is the munged message-id.

Aside: If thunderbird were to de-dupe, we would probably want to take the one directly from the mailing list because it is the one that will have DKIM information which provides the strongest authentication.


The mozilla bugs that relate to our gateways are: Bug 651527, Bug 711375.
The launchpad bug with activity relating to mailman being a jerk is: https://bugs.launchpad.net/mailman/+bug/266263
And one with a patch but no activity:
https://bugs.launchpad.net/mailman/+bug/496233


For analysis convenience:

Bug 451995 is where we changed (Jan 12, 2009):
http://hg.mozilla.org/comm-central/rev/20af028c35be
  mail.thread_without_re: true => false

Bug 449821 is where we changed (Feb 1, 2009):
http://hg.mozilla.org/comm-central/rev/fb8823d88515
  mail.strict_threading: false => true
  mail.correct_threading: false => true

Thread without Re was-and-is super-dangerous because it would thread automatically generated e-mails together and this would frequently result in pathological performance when deleting e-mails because re-threading would need to occur.  The canonical example was justdave deleting 10,000 messages with the same subject all at once.  This can be mitigated with my (abandoned because the complexity was no longer justified) fix to topologically sort messages by thread when deleting on Bug 452221.
(In reply to Jonathan Protzenko [:protz] from comment #10)
> - Threading by message-id confuses people when someone intends to start a
> new thread by hits reply all because they're lazy, not realizing that this
> will keep the References: header intact. Suggestion: consider the message to
> be in a new thread if the subject changes drastically from that in the
> message pointed to by the In-Reply-To message.

Is there an issue specifically for this case that I can track and vote for? It seems pretty doable since message-id and subject are  already used, and a simpler case than this issue describes.

On a related note, has anyone have thoughts on threading Facebook messages? Annoyingly they don't use the standard headers and change the subject as well (e.g. "Dave commented on", "Bob commented on"). I think the needed info is in "X-Facebook-Notify" though. My idea would be to thread everything on the pid or equivalent. Probably its own issue, but I thought I would ask here first.
This would be an important bug to fix. Especially the less technically inclined email users often reuse old messages to create a new ones on a completely different topic (falsely preserving the message-ID). This leads to ridiculous threads including message with 2+ completely different subject headers. A basic subject matching process like done in Gmail would be very important to make Thunderbird better.
FYI, I created a separate issue 822101 for Facebook threading since its really header-based, just different headers. https://bugzilla.mozilla.org/show_bug.cgi?id=822101
I believe GMail actually just drops the (In reply to David Rees from comment #12)
> (In reply to Jonathan Protzenko [:protz] from comment #10)
> > - Threading by message-id confuses people when someone intends to start a
> > new thread by hits reply all because they're lazy, not realizing that this
> > will keep the References: header intact. Suggestion: consider the message to
> > be in a new thread if the subject changes drastically from that in the
> > message pointed to by the In-Reply-To message.
> 
> Is there an issue specifically for this case that I can track and vote for?
> It seems pretty doable since message-id and subject are  already used, and a
> simpler case than this issue describes.
> 

I created bug 893261 for this specific case of deleting "In-Reply-To" when a reply's subject is changed. Its really on the email creation side (and an easier change as well).
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.