Closed Bug 461479 Opened 16 years ago Closed 16 years ago

Integrate generalized Bayes traits into mailnews classifications

Categories

(MailNews Core :: Filters, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b2

People

(Reporter: rkent, Assigned: rkent)

References

Details

Attachments

(2 files, 8 obsolete files)

The generalized bayes code in bug 453885 does not interact with the normal processing of email in mailnews. Define a service to manage traits added by extensions and core code, and convert the classifyMessages calls used in normal email processing to also process any defined traits.
Depends on: 453885
This patch provides a central pref-based storage system for message traits, and integrates trait analysis into the standard spam processing of IMAP and Local folders. I believe there is enough backend-support here that an extension would be able to implement a soft tags features, where a tag is set automatically based on the results of bayesian analysis. This patch requires the most recent patch from bug 453885 to work.

It is also clear, and often requested, that the filter plugin needs to reduce its dependence on the existing bayesian junk filter plugin. The processing is highly coupled to the particular implementation of junk management, including things such as whitelisting by address or domain, skipping certain folders, etc. All of that really should be integrated somehow into a central, extendable filter framework. I didn't do that here of course. Instead, trait analysis is done to all incoming messages if a particular trait is set enabled. The analysis simply sets a percent match in the database. The idea is that users would set a listener on the database to detect this, and take any appropriate action (such as setting a tag value).

To follow is a sample (and useless) extension to visualize that this trait analysis actually is doing something.
This extension is a modified version of "junquilla". "Junquilla" does extended display of information related to junk processing - including showing the percent bayes result, as well as different icons depending on how the junk score was determined.

This adaptation enables traits analysis of the "good" message feature, and displays the percent results in a column. This is the opposite of "junk" so if the percent match is 70% to "junk" it should be 30% to "good". This is of course useless - but does exercise the entire trait-calculation framework.
Attached patch Ready for review (obsolete) — Splinter Review
Let's review this now.

I added a couple of message flags, which may seem excessive but they were the easiest approach. They could be removed if needed in the future. The MSG_FLAG_CLASSIFY_TRAITS would be fairly easy to replace, MSG_FLAG_CLASSIFY_JUNK would probably need a new database attribute. The issue is that we need to still do the bayesian processing when junk processing is not done, and the listeners need to know that. I tried to fix the writing of volatile flags in nsMsgHdr.cpp using the macro-defined flag, but look at that because it could have unforeseen consequences if someone was relying on the old behaviour.

I'd like to get this patch in by beta 1 so that we could do an extension that would demonstrate soft tags.
Attachment #344765 - Attachment is obsolete: true
Attachment #345584 - Flags: superreview?(bienvenu)
Attachment #345584 - Flags: review?(bugzilla)
Attached patch With the new files this time (obsolete) — Splinter Review
Attachment #345584 - Attachment is obsolete: true
Attachment #345589 - Flags: superreview?(bienvenu)
Attachment #345589 - Flags: review?(bugzilla)
Attachment #345584 - Flags: superreview?(bienvenu)
Attachment #345584 - Flags: review?(bugzilla)
I'd really much rather use hdr properties than flags, if possible, especially for transient state like this.
OK I like the idea of properties on nsMsgHdr. I implemented it though as a type of flag, a "processFlag". If we ever decide to implement other process managing steps - such as the notification flag I have sometimes proposed - this would be a good location for that as well.
Attachment #345589 - Attachment is obsolete: true
Attachment #345670 - Flags: superreview?(bienvenu)
Attachment #345670 - Flags: review?(bugzilla)
Attachment #345589 - Flags: superreview?(bienvenu)
Attachment #345589 - Flags: review?(bugzilla)
Status: NEW → ASSIGNED
Updated patch, in the perhaps vain hope of getting this reviewed and in before b1.
Attachment #345670 - Attachment is obsolete: true
Attachment #348276 - Flags: superreview?(bienvenu)
Attachment #348276 - Flags: review?(bugzilla)
Attachment #345670 - Flags: superreview?(bienvenu)
Attachment #345670 - Flags: review?(bugzilla)
Comment on attachment 348276 [details] [diff] [review]
unbitrotted version of previous patch

I've had a quick look through this...in general, it looks fine - some initial comments:

+    char* strScore = PR_smprintf("%u", aPercents[i]);
+    mDatabase->SetStringPropertyByHdr(msgHdr, traitId.get(), strScore);
+    PR_smprintf_free(strScore);

better to use a local var big enough to hold the string and pr_snprintf instead of allocating and freeing this memory. (I think  this is happening in several places, including existing code that you've just moved).

I don't quite understand what processFlags are exactly. Are they simply transient flags, non-persistent flags? Or are they specifically meant to be used for incoming mail processing? Or could they be used for after the fact processing? I just want to make sure we use the right var name to avoid confusion...maybe processingFlags would be better.


is nsIMsgTraitService.lastIndex only used by the test code? Traits are persistent, right? So if I add 5 traits, and remove 3 of them, is lastIndex set to 5?
(In reply to comment #8)
> (From update of attachment 348276 [details] [diff] [review])
> 
> I don't quite understand what processFlags are exactly. Are they simply
> transient flags, non-persistent flags?

Yes, on a per-message basis.
> Or are they specifically meant to be
> used for incoming mail processing? Or could they be used for after the fact
> processing?

I don't think they need to be restricted to incoming mail processing, but all of my use cases clearly involve that. BIFF management comes to mind, for example. As transient flags though they need to be used for events with a short lifetime.

> 
> is nsIMsgTraitService.lastIndex only used by the test code? Traits are
> persistent, right? So if I add 5 traits, and remove 3 of them, is lastIndex set
> to 5?

I struggled with this decision, but at the moment I do not reuse the index. That is, if you remove 3, those numbers are lost forever, and lastIndex continues to be 5. Otherwise, you have to deal with all sorts of complicated issues of knowing if stored data for trait 4 is really the current trait 4, or leftover data from the deleted trait 4. This is not just used in test code, but is used when you register a new trait to assign an index. So this method would not work for an application that for some reason created and destroyed lots of traits. But I think a trait is like a tag, and we should only expect a very limited number of them.
Comment on attachment 348276 [details] [diff] [review]
unbitrotted version of previous patch

I've started looking at this, and below are a few comments, but I think are probably mostly the major ones.

It doesn't compile first time round on my mac, but did when I clobbered mailnews/. It could be we have a deps problem somewhere.

Could you also attach a diff -w next time as I think that will make the review easier. Thanks.

>+    
>+    /*
>+     * operations on the processing flag, used to manage message processing.

nit: Operations

>+     * These flags are not written to nor read from the persisent database

nit: persistent

>+     */
>+    readonly attribute unsigned long processFlags;
>+    void orProcessFlags(in unsigned long mask);
>+    void andProcessFlags(in unsigned long mask);

Can you use the doxygen group format to comment these ("@{ ... }@" see nsIAbCard.idl for examples

>+ /*
>+  * This interface provides management of traits that are used to categorize
>+  * messages. A trait is some characteristic of a message, such as being "junk"
>+  * or "personal", that may be discoverable by analysis of the message.
>+  *

nit: /**

>+  /**
>+   * set the trait name, which is the localized display name for the trait.
>+   *
>+   * @param id     the trait universal identifier
>+   * @param name   localized display name for the trait.
>+   *               This is implemented as nsIPrefLocalizedString
>+   */
>+  void setName(in ACString id, in AString name);

I don't think this is going to do what I think you are expecting. Certainly in its current form nsIPrefLocalizedString has no effect, and mailnews.traits.name.* can't be localized. Additionally if someone installs a different language pack (or adds support for a different language in an extension), then you're still going to get the original name.

>+  /**
>+   * set the anti trait, which indicates messages that have been marked as
>+   * NOT matching a particular trait.
>+   *
>+   * @param id      the trait universal identifier
>+   * @param antiId  trait id for messages marked as not matching the trait
>+   */
>+  void setAntiId(in ACString id, in ACString antiId);

Why not just have a trait pair with good and bad traits defined together?
(In reply to comment #10)

I'll do an updated patch, but let me address some of the comments.

> >+  /**
> >+   * set the trait name, which is the localized display name for the trait.
> >+   *
> >+   * @param id     the trait universal identifier
> >+   * @param name   localized display name for the trait.
> >+   *               This is implemented as nsIPrefLocalizedString
> >+   */
> >+  void setName(in ACString id, in AString name);
> 
> I don't think this is going to do what I think you are expecting. Certainly in
> its current form nsIPrefLocalizedString has no effect, and
> mailnews.traits.name.* can't be localized. Additionally if someone installs a
> different language pack (or adds support for a different language in an
> extension), then you're still going to get the original name.

I never tested with nsIPrefLocalizedString, but from the description it sounded like it was supposed to help with localization. I'll take your word for it that it does not work for my case. Unless there are objections, let me just use anormal string, and let the extension deal with any required localization.

> 
> >+  /**
> >+   * set the anti trait, which indicates messages that have been marked as
> >+   * NOT matching a particular trait.
> >+   *
> >+   * @param id      the trait universal identifier
> >+   * @param antiId  trait id for messages marked as not matching the trait
> >+   */
> >+  void setAntiId(in ACString id, in ACString antiId);
> 
> Why not just have a trait pair with good and bad traits defined together?

The issue here is that the anti-traits are always hard to get info on. For example, we could automatically train a pro trait when someone tags a message, but then how do we train the anti trait? The best method currently is to display messages in an Uncertain category, and ask the user to train those. But what I am hoping, though have not tested, is that for low-incidence pro traits, we could use as the anti-trait something like the summary of all messages, which would not need explicit training by the user. That summary anti-trait could be shared between pro-traits. So for that reason, I did not want to have pro/anti trait pairs, but allow them to be separately specified.
(In reply to comment #11)
> (In reply to comment #10)
> > >+  /**
> > >+   * set the anti trait, which indicates messages that have been marked as
> > >+   * NOT matching a particular trait.
> > >+   *
> > >+   * @param id      the trait universal identifier
> > >+   * @param antiId  trait id for messages marked as not matching the trait
> > >+   */
> > >+  void setAntiId(in ACString id, in ACString antiId);
> > 
> > Why not just have a trait pair with good and bad traits defined together?
> 
> The issue here is that the anti-traits are always hard to get info on. For
> example, we could automatically train a pro trait when someone tags a message,
> but then how do we train the anti trait? The best method currently is to
> display messages in an Uncertain category, and ask the user to train those. But
> what I am hoping, though have not tested, is that for low-incidence pro traits,
> we could use as the anti-trait something like the summary of all messages,
> which would not need explicit training by the user. That summary anti-trait
> could be shared between pro-traits. So for that reason, I did not want to have
> pro/anti trait pairs, but allow them to be separately specified.

Ok, that's fine, thanks for the explanation.
In updating and re-reviewing this patch, I may have a problem. Based on comment
#5 I implemented the processing flags as a property on an nsMsgDBHdr object.
But those objects are not passed around, they are created as needed from the
message key, or taken from the message header cache. From the cache, a stored
property is OK, but if the cache size is exceeded and the hdr is recreated from
the mork database, then there is no way for the process flags to get restored.
The similar New flag gets restored from the structure stored in the database -
which is also replicated to the folder when the database is destroyed - but I
have no parallel mechanism for the process flags.

Bienvenu, is this indeed a problem? If so, do you have any suggestions?

In bug 441932, I proposed some changes to push volatile flag storage directly
to the folder from the header - but that patch did not attract much enthusiasm,
so was never implemented.

Given these persistence issues, I think it would be easier if the new flags
proposed in this bug simply be made normal database attributes, either as
additional fields on the existing message flag, or as a new processing flag
added as a database attribute. We don't have the problem here of needing to
store a sorted value, like we do with the new flag, so there is no real reason
not to treat them like other message flags. The choice would be whether to add
a new attribute, or use existing free bits in the message flag.
Based on IRC discussion with Bienvenu, in this patch I persist the process flags using nsMsgKeyset objects in the folder. I also did a few minor additional tweaks to the code, and implemented suggestions from reviewers.
Attachment #348276 - Attachment is obsolete: true
Attachment #348735 - Flags: superreview?(bienvenu)
Attachment #348735 - Flags: review?(bugzilla)
Attachment #348276 - Flags: superreview?(bienvenu)
Attachment #348276 - Flags: review?(bugzilla)
Comment on attachment 348735 [details] [diff] [review]
Store process flags as nsMsgKeyset in the folder

these should follow the mVarName naming convention 
+  // parallel arrays storing keys that have a processing flag set
+  nsTArray<PRUint32> processFlagBits;      // the processing flag bit
+  nsTArray<nsMsgKeySetU*> processFlagKeys; // list of keys that match

I don't see anyone using the new nsIMsgHdr methods...it would be nice not to need them, if possible.

These all need to do an NS_ENSURE_ARG_POINTER:

+NS_IMETHODIMP nsMsgDBFolder::GetProcessFlags(nsMsgKey aKey, PRUint32 *aFlags)
+{
+  *aFlags = 0;


delete null is fine, so you don't need the null checks here:

+  if (loKeySet)
+    delete loKeySet;
+  if (hiKeySet)
+    delete hiKeySet;

or here:

+    if (processFlagKeys[i])
+      delete processFlagKeys[i];

Can we call things processing instead of process? Process is overloaded in meaning.

Either we don't need the temp var here, or we should check for failure:

+    nsMsgKeySetU* keyset = nsMsgKeySetU::Create();
+    processFlagKeys.AppendElement(keyset);

ugh, the unsigned msg key set is unfortunate - someday nsMsgKeySet should just be fixed.
Comment on attachment 348735 [details] [diff] [review]
Store process flags as nsMsgKeyset in the folder

>   // this allows a folder to have a special identity. E.g., you might want to
>   // associate an identity with a particular newsgroup, or for IMAP shared folders in
>   // the other users namespace, you might want to create a delegated identity
>   readonly attribute nsIMsgIdentity customIdentity;
>+
>+  /**
>+   * @{
>+   * Read the processing flags, used to manage message processing.
>+   *
>+   * @param msgKey   message key
>+   * @return         processing flags
>+   */
>+  unsigned long getProcessFlags(in nsMsgKey msgKey);
>+
>+  /**
>+   * OR a mask into the processing flags
>+   *
>+   * @param msgKey   message key
>+   * @param mask     mask to OR into the flags
>+   */
>+
>+  void orProcessFlags(in nsMsgKey msgKey, in unsigned long mask);
>+
>+  /**
>+   * AND a mask into the processing flags
>+   *
>+   * @param msgKey   message key
>+   * @param mask     mask to AND into the flags
>+   */
>+  void andProcessFlags(in nsMsgKey msgKey, in unsigned long mask);
>+  /** @} */

normally the group syntax means you don't need to document the individual functions. Just provide generic documentation at the top.

>   if (!mDatabase)
>   {
>     rv = GetDatabase(nsnull);   // XXX is nsnull a reasonable arg here?
>     NS_ENSURE_SUCCESS(rv, rv);
>   }
> 
>+  // check if trait processing needed
>+
>+  nsCOMPtr<nsIMsgTraitService> traitService(
>+      do_GetService("@mozilla.org/msg-trait-service;1", &rv));
>+  NS_ENSURE_SUCCESS(rv, rv);
>+  PRUint32 count, *proIndices, *antiIndices;

nit: blank line after the NS_ENSURE_SUCCESS please

>+/* static */ nsMsgKeySetU* nsMsgKeySetU::Create()
>+{
>+  nsMsgKeySetU* set = new nsMsgKeySetU;
>+  if (set)
>+  {
>+    set->loKeySet = nsMsgKeySet::Create();
>+    set->hiKeySet = nsMsgKeySet::Create();
>+  }
>+  if (!(set && set->loKeySet && set->hiKeySet))
>+    set = nsnull;

That doesn't look write, surely you should be deleting set?

>+int nsMsgKeySetU::Add(PRUint32 aKey)
>+{
>+  PRInt32 intKey = static_cast<PRInt32>(aKey);
>+  if (intKey >= 0)
>+    return loKeySet->Add(intKey);
>+  else
>+    return hiKeySet->Add(intKey & kLowerBits);

You can drop the else here as you're returning early (plus two more cases below this).

>+// traits 3 - 100 are reserved for use my mailnews@mozilla.org
>+// the first externally defined trait will have index 101
>+pref("mailnews.traits.lastIndex", 100);
>+

Is mailnews@mozilla.org your extension, or internal to mailnews (the "my" is confusing)? If its internal to mailnews, I would bump the number to 500 or even 1000. Hopefully we'll never reach that, but that's what someone said about search/filters...
Comment on attachment 348736 [details] [diff] [review]
diff -w version of the previous patch

>+  processFlagBits.AppendElement(MSG_PROCESSFLAG_CLASSIFY_JUNK);
>+  processFlagBits.AppendElement(MSG_PROCESSFLAG_CLASSIFY_TRAITS);
>+  for (PRUint32 i = 0; i < processFlagBits.Length(); i++)
>+  {
>+    nsMsgKeySetU* keyset = nsMsgKeySetU::Create();
>+    processFlagKeys.AppendElement(keyset);
>+  }
...
>   static const NS_MSG_BASE_STATIC_MEMBER_(nsStaticAtom) folder_atoms[];
>+  // parallel arrays storing keys that have a processing flag set
>+  nsTArray<PRUint32> processFlagBits;      // the processing flag bit
>+  nsTArray<nsMsgKeySetU*> processFlagKeys; // list of keys that match
> };

So from what I can tell, processFlagBits never gets modified, apart from to add the JUNK and TRAITS at the beginning. There's also no interfaces to modify it.

processFlagKeys seems to be an array which matches the size of processFlagBits.

Therefore can't we use an array with fixed size, possibly dropping nsTArray altogether, and using a struct to combine the two bits of data?
Fixed all requests from reviewers.

The file format in traits.dat and the accompanying data structure were not designed for large gaps in trait id. So I changed it here.
Attachment #348735 - Attachment is obsolete: true
Attachment #348736 - Attachment is obsolete: true
Attachment #349295 - Flags: superreview?(bienvenu)
Attachment #349295 - Flags: review?(bugzilla)
Attachment #348735 - Flags: superreview?(bienvenu)
Attachment #348735 - Flags: review?(bugzilla)
Attachment #349295 - Flags: review?(bugzilla) → review+
Attachment #349295 - Flags: superreview?(bienvenu) → superreview+
Comment on attachment 349295 [details] [diff] [review]
More fixes, changes traits.dat format

thx, Kent, sr=me, with one nit: can you combine the two ifs into one, to reduce the unneeded nesting, e.g.,

if (NS_SUCCEEDED(rv) && whiteListDirArray.Count())



+      if (whiteListDirArray.Count() != 0)
+      {
+        if (NS_SUCCEEDED(rv))
+        {
+          nsIAbCard* cardForAddress = nsnull;
+          // don't want to abort the rest of the scoring.
+          if (!authorEmailAddress.IsEmpty())
+          {
+            for (PRInt32 index = 0; index < whiteListDirArray.Count() && !cardForAddress; index++)
+              whiteListDirArray[index]->CardForEmailAddress(authorEmailAddress,
+                                                            &cardForAddress);
+          }
+          if (cardForAddress)
+          {
+            NS_RELEASE(cardForAddress);
+            // mark this msg as non-junk, because we whitelisted it.
+            nsCAutoString msgJunkScore;
+            msgJunkScore.AppendInt(nsIJunkMailPlugin::IS_HAM_SCORE);
+            mDatabase->SetStringProperty(msgKey, "junkscore", msgJunkScore.get());
+            mDatabase->SetStringProperty(msgKey, "junkscoreorigin", "whitelist");
+            break; // skip this msg since it's in the white list
+          }
+        }
+      }
Attached patch Patch to checkinSplinter Review
Carrying over reviews, bienvenu's last nit fixed and ready to check in.
Attachment #349295 - Attachment is obsolete: true
Attachment #349791 - Flags: superreview+
Attachment #349791 - Flags: review+
Keywords: checkin-needed
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Target Milestone: Thunderbird 3.0b1 → Thunderbird 3.0b2
Flags: in-testsuite+
Comment on attachment 349791 [details] [diff] [review]
Patch to checkin

>+// This class is a kludge to allow nsMsgKeySet to be used with PRUint32 keys
So which existing caller expects negative keys?
(In reply to comment #23)
> (From update of attachment 349791 [details] [diff] [review])
> >+// This class is a kludge to allow nsMsgKeySet to be used with PRUint32 keys
> So which existing caller expects negative keys?

Sorry, that was over a year ago and I don't recall all of the details. But it is possible that I simply did not want to modify all of the existing users due to the increased review load and regression risk, so instead I did a kludge for this patch which has a very restricted usage.
I now see that nsMsgKeySet only works with keys from 1 to INT_MAX (0 is special cased in certain places, so I'm not sure how valid it is to use 0).
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: