Closed Bug 453885 Opened 16 years ago Closed 16 years ago

Generalize bayes code to support multiple traits

Categories

(MailNews Core :: Filters, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0b1

People

(Reporter: rkent, Assigned: rkent)

References

(Blocks 1 open bug)

Details

Attachments

(1 file, 4 obsolete files)

The bayesian filter code is unnecessarily tied to junk mail definitions, making it difficult to use the code for other classification purposes. Generalize the code to classify a generic "trait" of an email, with junk being one type of trait.
Hi! I'd like to help with this. I've been planning generalizing Bayesian class long time ago to support multistate.

Which direction you want to take with this? Do you want to base on any papers?

Stas (CC'ing) can also help here, AFAIR he did a bit of work around it ;)
(In reply to comment #1)
> Hi! I'd like to help with this. I've been planning generalizing Bayesian class
> long time ago to support multistate.
> 
> Which direction you want to take with this? Do you want to base on any papers?
> 
This bug is intended on being a baby step (though the patch is 100K). That is, I have taken the existing Bayesian code, and simply generalized it to be able to work with multiple traits - with junk being one trait. It still supports the existing interfaces, but adds new ones that can work with multiple traits. The patch itself is entirely backend, providing no new user features.

Subsequent patches will need to add support for that in the UI amd message processing code, probably through hooks available to extensions rather than in the main interface. That is where the interesting work will come - and is where I would suggest you and others consider helping, as I am really done with the backend generalization.

What I am imagining myself is providing support for smart tags, as that would be fairly straightforward. That is, you would define a tag, say "personal", and train a corpus of messages with the trait, and anti-messages without the trait. You could then have a virtual search folder that displays messages that match the tag. You'd also need UI to change the limits of acceptance, as well as some sort of display of Uncertain messages to encourage training of matching and unmatching messages.

But I think a reasonable goal for TB 3 is to add enough hooks to the code so that these features could be added through extensions. Imagining possible extensions, and adding the appropriate hooks, would be very helpful.

Also, the exiting Bayesian code is not multi-state, but binary. You need to train a trait (junk) and contrast with a trained anti-trait (good). You'll continue to need to do the same in the patch I'm submitting as part of this bug.
That's great! I wanted to make nsIBayesian multistate for years, but lacked skills/time to work on this.

Do you plan to make enough generic to be used outside of Tb? Say, Firefox places?

I asked about any base for your approach because during last year I gathered some documentation on improvements various people proposed on the original algorithm like:
http://www.garyrobinson.net/2004/04/improved_chi.html
http://www.orocos.org/bfl
http://www.cs.uga.edu/~kangli/src/sigmetr2006.pdf

I was wondering if you're using any base for your multi-state approach.

Also, btw. Stas is working on an extension (together with his thesis) that uses another approach to recognize where a textual item belongs - HAC algorithm (http://nlp.stanford.edu/IR-book/html/htmledition/hierarchical-agglomerative-clustering-1.html)

Stas's approach may be efficient in different cases than Bayes. AFAIU it builds multidimensional space with each dimension being a distance between words (Stas, correct me please here). In result it can analyze similarity between textual objects like websites or emails and group them together.
(In reply to comment #3)
> 
> Do you plan to make enough generic to be used outside of Tb? Say, Firefox
> places?
>
No plans for this.

> I asked about any base for your approach ...
> I was wondering if you're using any base for your multi-state approach.
>
Are you asking about the statistical algorithms? Or does "base" here have some technical meaning that I am unaware of? Currently what I am doing is identical to the existing Bayes algorithm. That is, junk/good is a pro/anti trait pair, where people mark some messages as having the trait ("pro" or junk in this case), or not having the trait ("anti" or good). By generalizing, I simply mean you could define other pro/anti pairs, and the code will efficiently process those additional pairs as well as the original junk/good pair.

I can tell you the statistical problem that concerns me, though. It is generally easy to get the user to mark the "pro" trait messages (junk for example, or "personal") It is much harder to get them to mark the "anti" trait (good for example). Currently my approach to this is through UI, that is I run an "Uncertain" virtual folder that the user is encouraged to drive to zero members by marking messages as pro (junk) or anti (good). I'm not sure how that will work with other trait pairs. If I mark a few messages as "Personal" then am I supposed to mark everything else as "Not Personal"? I wish there was some way that I could use the properties of an "average" email as the anti trait, so that the user would only need to mark the pro trait messages. I'll probably experiment with that some using the existing algorithm, but it would be good to have something where that was planned in as part of the algorithm development.
Attached patch WIP, though mostly done (obsolete) — Splinter Review
Here's the current patch. I've got to do some changes before review, but it shows what I am doing if interested. Works on current TB trunk.
The Naive Bayesian classifier is still tightly coupled in the code.
Please pay attention that it is possible to replace the classification algorithm itself in (near) future.

I propose to examine state-of-the-art algorithms like a Support Vector Machine. SVM suffers less under overfitting and still provide much better classification accuracy.

One of the winners of TREC 2007 [1] was a "Relaxed Online SVM implementation" [2] which has low computational costs ("relaxed") and can be incrementally trained ("online").

Other SVM implementations also add multi-class capability (eg. liblinear [3]).

[1]: http://trec.nist.gov/pubs/trec16/papers/SPAM.OVERVIEW.pdf
[2]: http://www.eecs.tufts.edu/~dsculley/papers/emailAndWebSpamSIGIR.pdf
[3]: http://www.csie.ntu.edu.tw/~cjlin/liblinear/
Attached patch Cleaned up for review (obsolete) — Splinter Review
This patch provides the backend support in the bayesian code for analyzing multiple traits, using the same token store. It does not add any interface to the rest of the code to use the additional traits, but is a direct replacement for the existing code. The next patch will hook the multiple traits into the rest of the code, with enough capability to implement multiple traits through an extension.

In addition to the included unit tests, I did some performance tests on junk analysis. I tested on my existing training file, analyzing junk on a folder with 1945 messages and 37MB size, using a debug build. Average of three tests are:

Before patch: 59.3 seconds to execute, bayes added 12.8 MB of memory
After patch: 60.3 seconds to execute, bayes added 13.8 MB of memory.

So there are barely measurable small performance costs to the additional overhead. The memory cost though is much less than what was gained through the previous patch for bug 453881, so there is a net savings with the two patches.
Attachment #343080 - Attachment is obsolete: true
Attachment #343360 - Flags: superreview?(bienvenu)
Attachment #343360 - Flags: review?(bugzilla)
Overall, this looks good - this will enable some really cool features.

Indentation goofy here:

  *   Inform a listener of a message's classification aClassification, which
  * can be one of 3 values:  UNCLASSIFIED, GOOD, or JUNK.
  *   Pass the indicator aJunkPercent, where 0 is not junk,
  * 100 is junk, and intermediate values represent varying uncertainty.
  */


+    NS_ASSERTION(aProTraits.Length() == aAntiTraits.Length(),
+      "Each Pro trait needs a matching Anti trait");

what happens if they don't? Should we make this a fatal error and return? If it's not fatal, a comment would help.

nProMessages and nAntiMessages are a little bit confusing as var names - I guess n is for num?

Can't you use InsertElementsAt here, instead of looping?:

+  nsAutoTArray<PRUint32, kTraitAutoCapacity> proTraits;
+  nsAutoTArray<PRUint32, kTraitAutoCapacity> antiTraits;
+  for (PRUint32 index = 0; index < aTraitCount; index++)
+  {
+    proTraits.AppendElement(aProTraits[index]);
+    antiTraits.AppendElement(aAntiTraits[index]);
+  }
+

and if you can, you can do that in a few other places as well.

don't need the blank line before the comment:
+  if (shrink)
+  {
+
+    // We'll clear the tokens, and read them back in from the file.
+    // Yes this is slower than in place, but this is a rare event.

don't *need* the braces here, but I think it's OK if you want them...
+      for (var i = 0; i < gTest.traitIds.length; i++)
+      {
+        proArray.push(gTest.traitIds[i]);
+      }

Pardon an ignorant question, but is there an easy way to avoid this? Maybe make index -1 mean end of list? I only ask because having a dummy 0th element seems a little fragile.

+    // dummy 0th element. Index 0 means "end of list" so we need to
+    // start from 1
+    AnalysisPerToken analysisPT(0, 0.0, 0.0);
+    mAnalysisStore.AppendElement(analysisPT);
+    mNextAnalysisIndex = 1;
(In reply to comment #8)
> 
> Pardon an ignorant question, but is there an easy way to avoid this? Maybe make
> index -1 mean end of list? I only ask because having a dummy 0th element seems
> a little fragile.
> 
> +    // dummy 0th element. Index 0 means "end of list" so we need to
> +    // start from 1
> +    AnalysisPerToken analysisPT(0, 0.0, 0.0);
> +    mAnalysisStore.AppendElement(analysisPT);
> +    mNextAnalysisIndex = 1;

One thing that drives this is that the hash code (which I don't control) initializes new memory to zero. I rely on that zero to mean "empty list".

But let me think about it a little.
Attached patch Fixed David's issues (obsolete) — Splinter Review
Fixed all of David's issues except the dummy elements. One possible approach to that would be to have the stored index use 0 to mean empty, and offset all calls to the store by 1. I tried that, but it results in lots of cases where you have to really think carefully about whether you need to offset an index or not. Overall, I prefer the dummy 0th element. The other approach would be to try to initialize items after they are added to memory by the hash code. I didn't pursue this, because it takes a lot of brain preparation before I can muck with the hash code again, and when I first designed this I decided it was difficult to do.

But I am still open to one of those approaches if you feel strongly.
Attachment #343360 - Attachment is obsolete: true
Attachment #344156 - Flags: superreview?(bienvenu)
Attachment #344156 - Flags: review?(bugzilla)
Attachment #343360 - Flags: superreview?(bienvenu)
Attachment #343360 - Flags: review?(bugzilla)
Comment on attachment 344156 [details] [diff] [review]
Fixed David's issues

>-[scriptable, uuid(caf3d467-d57c-11d6-96a9-00039310a47a)]
>+[scriptable, uuid(AF247D07-72F0-482d-9EAB-5A786407AA4C)]
>+interface nsIMsgTraitClassificationListener : nsISupports
>+{
>+/**
>+ *  Inform a listener of a message's match to traits. The list
...

This comment needs two spaces inserting before each line.

>+  void onMessageTraitsClassified(in string aMsgURI,

So I was going to comment on these could be made into ACString. However, given the rest of the code is still on string for the message URIs, I think leave them as they are in the patch, but consider changing them in a future bug.

>+     * Given a message URI, evaluate its relative match to a list of
>+     * traits according to the current training set.
>+     *
>+     * @param aMsgURI          URI of the message to be evaluated
>+     * @param aMsgWindow       current message window (may be null)
>+     * @param aJunkListener    junk-oriented callback listener (may be null)
>+     * @param aTraitListener   trait-oriented callback listener (may be null)

Just a thought, would it be better to put these three attributes at the end of the argument list and make them [optional].

>diff --git a/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp b/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp

> // token for a particular message
>+// mcount, mAnalysisLink is initialized to zero by the hash code

nit: mcount -> mCount, is -> are

> struct Token : public BaseToken {
>     PRUint32 mCount;
>-    double mProbability;        // TODO:  cache probabilities
>-    double mDistance;
>+    PRUint32 mAnalysisLink; // index in mAnalysisStore of the AnalysisPerToken
>+                            // object for the first trait for this token
> };
> 
> // token stored in a training file for a group of messages
>+// mTraitLink is initialized to 0 by the hash code
> struct CorpusToken : public BaseToken
> {
>-    PRUint32 mJunkCount;
>-    PRUint32 mGoodCount;
>+    PRUint32 mTraitLink;    // index in mTraitStore of the TraitPerToken
>+                            // object for the first trait for this token
> };
>+
>+// set the value of a TraitPerToken object
>+TraitPerToken::TraitPerToken(PRUint32 aTraitId, PRUint32 aCount)
>+  :  mId(aTraitId), mCount(aCount), mNextLink(0)
>+{
>+}
>+

>@@ -1021,39 +1062,78 @@ nsBayesianFilter::~nsBayesianFilter()
>     Shutdown();
> }
> 
> // this object is used for one call to classifyMessage or classifyMessages().
> // So if we're classifying multiple messages, this object will be used for each message.
> // It's going to hold a reference to itself, basically, to stay in memory.
> class MessageClassifier : public TokenAnalyzer {
> public:
>-    MessageClassifier(nsBayesianFilter* aFilter, nsIJunkMailClassificationListener* aListener,
>-      nsIMsgWindow *aMsgWindow,
>-      PRUint32 aNumMessagesToClassify, const char **aMessageURIs)
>-        :   mFilter(aFilter), mSupports(aFilter), mListener(aListener), mMsgWindow(aMsgWindow)
>+    // full classifier with arbitrary traits
>+    MessageClassifier(nsBayesianFilter* aFilter,
>+                      nsIJunkMailClassificationListener* aJunkListener,
>+                      nsIMsgTraitClassificationListener* aTraitListener,
>+                      nsTArray<PRUint32>& aProTraits,
>+                      nsTArray<PRUint32>& aAntiTraits,
>+                      nsIMsgWindow *aMsgWindow,
>+                      PRUint32 aNumMessagesToClassify,
>+                      const char **aMessageURIs)
>+    :   mFilter(aFilter),
>+        mSupports(aFilter),
>+        mJunkListener(aJunkListener),
>+        mTraitListener(aTraitListener),
>+        mProTraits(aProTraits),
>+        mAntiTraits(aAntiTraits),
>+        mMsgWindow(aMsgWindow)

Please put these initialisers in the same order as you have defined them in the class definition. This saves compiler warnings.

>     {
>       mCurMessageToClassify = 0;
>       mNumMessagesToClassify = aNumMessagesToClassify;
>       mMessageURIs = (char **) nsMemory::Alloc(sizeof(char *) * aNumMessagesToClassify);
>       for (PRUint32 i = 0; i < aNumMessagesToClassify; i++)
>         mMessageURIs[i] = PL_strdup(aMessageURIs[i]);
>+
>+    }
>+
>+    // junk-only classifier
>+    MessageClassifier(nsBayesianFilter* aFilter,
>+                      nsIJunkMailClassificationListener* aJunkListener,
>+                      nsIMsgWindow *aMsgWindow,
>+                      PRUint32 aNumMessagesToClassify,
>+                      const char **aMessageURIs)
>+    :   mFilter(aFilter),
>+        mSupports(aFilter),
>+        mJunkListener(aJunkListener),
>+        mTraitListener(nsnull),
>+        mMsgWindow(aMsgWindow)
>+    {
>+      mCurMessageToClassify = 0;
>+      mNumMessagesToClassify = aNumMessagesToClassify;
>+      mMessageURIs = (char **) nsMemory::Alloc(sizeof(char *) * aNumMessagesToClassify);
>+      for (PRUint32 i = 0; i < aNumMessagesToClassify; i++)
>+        mMessageURIs[i] = PL_strdup(aMessageURIs[i]);
>+      mProTraits.AppendElement(kJunkTrait);
>+      mAntiTraits.AppendElement(kGoodTrait);
> 
>     }
> 
>     virtual ~MessageClassifier()
>     {
>        if (mMessageURIs)
>        {
>          NS_FREE_XPCOM_ALLOCATED_POINTER_ARRAY(mNumMessagesToClassify, mMessageURIs);
>        }
>     }
>     virtual void analyzeTokens(Tokenizer& tokenizer)
>     {
>-        mFilter->classifyMessage(tokenizer, mTokenSource.get(), mListener);
>+        mFilter->classifyMessage(tokenizer,
>+                                 mTokenSource.get(),
>+                                 mProTraits,
>+                                 mAntiTraits,
>+                                 mJunkListener,
>+                                 mTraitListener);
>         tokenizer.clearTokens();
>         classifyNextMessage();
>     }
> 
>     virtual void classifyNextMessage()
>     {
> 
>       if (++mCurMessageToClassify < mNumMessagesToClassify && mMessageURIs[mCurMessageToClassify]) {
>@@ -1065,17 +1145,20 @@ public:
>         mTokenListener = nsnull; // this breaks the circular ref that keeps this object alive
>                                  // so we will be destroyed as a result.
>       }
>     }
> 
> private:
>     nsBayesianFilter* mFilter;
>     nsCOMPtr<nsISupports> mSupports;
>-    nsCOMPtr<nsIJunkMailClassificationListener> mListener;
>+    nsTArray<PRUint32> mProTraits;
>+    nsTArray<PRUint32> mAntiTraits;
>+    nsCOMPtr<nsIJunkMailClassificationListener> mJunkListener;
>+    nsCOMPtr<nsIMsgTraitClassificationListener> mTraitListener;
>     nsCOMPtr<nsIMsgWindow> mMsgWindow;
>     PRInt32 mNumMessagesToClassify;
>     PRInt32 mCurMessageToClassify; // 0-based index
>     char **mMessageURIs;
> };
> 
> nsresult nsBayesianFilter::tokenizeMessage(const char* aMessageURI, nsIMsgWindow *aMsgWindow, TokenAnalyzer* aAnalyzer)
> {


>-    for (i = 0; i < count; ++i)
>+    // pro message counts per trait index
>+    nsAutoTArray<PRUint32, kTraitAutoCapacity> numProMessages;
>+    // anti message counts per trait index
>+    nsAutoTArray<PRUint32, kTraitAutoCapacity> numAntiMessages;
>+
>+    for (PRUint32 traitIndex = 0; traitIndex < traitCount; traitIndex++)
>+    {
>+      numProMessages.AppendElement(
>+        mCorpus.getMessageCount(aProTraits[traitIndex]));
>+      numAntiMessages.AppendElement(
>+        mCorpus.getMessageCount(aAntiTraits[traitIndex]));
>+    }

From what I can tell, the num*Messages arrays are only added to here. So why not initialise their capacity to traitCount?

>+    for (PRUint32 traitIndex = 0; traitIndex < traitCount; traitIndex++)
>+    {
>+      nsAutoTArray<TraitAnalysis, 1024> traitAnalyses;
>+      // copy valid tokens into an array to sort
>+      for (PRUint32 tokenIndex = 0; tokenIndex < tokenCount; tokenIndex++)

Use tokenCount rather than 1024?

>+        if (listener && !mCorpus.getMessageCount(kGoodTrait) )
>+          isJunk = PR_TRUE;
>+        else if (listener && !mCorpus.getMessageCount(kJunkTrait) )
>+          isJunk = PR_FALSE;

nit: no space in between )) please.

>+  nsAutoTArray<PRUint32, kTraitAutoCapacity> proTraits;
>+  nsAutoTArray<PRUint32, kTraitAutoCapacity> antiTraits;
>+  proTraits.AppendElements(aProTraits, aTraitCount);
>+  antiTraits.AppendElements(aAntiTraits, aTraitCount);

Use aTraitCount for the capacity again?

>+  nsAutoTArray<PRUint32, kTraitAutoCapacity> oldTraits;
>+  nsAutoTArray<PRUint32, kTraitAutoCapacity> newTraits;
>+  oldTraits.AppendElements(aOldTraits, aOldCount);
>+  newTraits.AppendElements(aNewTraits, aNewCount);

ditto but with aOldCount/aNewCount.

(apologies if I'm mssing some reason they can't be done like this).

>+
>+  MessageObserver* analyzer = new MessageObserver(this, oldTraits,
>+    newTraits, aJunkListener, aTraitListener);
>+  if (!analyzer) return NS_ERROR_OUT_OF_MEMORY;

Put this on two lines with a blank line after, like you did with the others.

>-    if (oldClassification != newClassification)
>+    for (PRUint32 index = 0; index < oldClassifications.Length(); index++)

Storing the Length in an intermediate variable would save the cost of calling .Length() each time round the loop (ditto below as well).

>+  *aResult = (  (mCorpus.getMessageCount(kGoodTrait)
>+               + mCorpus.getMessageCount(kJunkTrait))
>+              && mCorpus.countTokens());

Please put operators on the end of the line.

> CorpusStore::CorpusStore() :
>-  TokenHash(sizeof(CorpusToken)), mGoodMessageCount(0), mJunkMessageCount(0)
>+  TokenHash(sizeof CorpusToken),
>+  mNextTraitIndex(1) // skip 0 since index=0 will mean end of linked list

sizeof should keep the brackets (it fails to build on my system).

>   for (PRUint32 i = 0; i < tokenCount; ++i)
>   {
>     CorpusToken* token = static_cast<CorpusToken*>(tokens.nextToken());
>-    {
>-      PRUint32 count = aIsJunk ? token->mJunkCount : token->mGoodCount;
>-      if (count > 1 || (!shrink && count == 1))
>-        newTokenCount++;
>-    }
>+    PRUint32 count = getTraitCount(token, aTraitId);
>+    // Shrinking the token database is accomplished by dividing all token counts by 2.
>+    // If shrinking, we'll ignore counts < 2, otherwise only ignore counts of < 1
>+    if ( (shrink && count > 1) || (!shrink && count) )

nit: Please drop the spaces between (( and ))

>+  PRUint32 numberOfTraits = mMessageCounts.Length();
>+  PRUint32 firstTraitToWrite = 3;  // Traits 1 and 2 are in training.dat
>+  PRUint32 numberOfTraitsToWrite = numberOfTraits >= firstTraitToWrite ?
>+                                     numberOfTraits - firstTraitToWrite + 1: 0;
>+  PRBool error;
>+  while (1) // break on error or done
>+  {
>+    if (error = (fwrite(kTraitCookie, sizeof(kTraitCookie), 1, stream) != 1))
>+      break;

I'm getting lots of compiler warnings on these:

/Users/moztest/comm/main/src/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp:1997: warning: comparison between signed and unsigned integer expressions

>+function getSpec(aFileName)
>+{
>+  var file = do_get_file(
>+    "../mailnews/extensions/bayesian-spam-filter/test/resources/" + aFileName);
>+  var uri =  Cc["@mozilla.org/network/io-service;1"]
>+               .getService(Ci.nsIIOService)
>+               .newFileURI(file)
>+               .QueryInterface(Ci.nsIURL);
>+  uri.query = "type=application/x-message-display";
>+  return uri.spec;
>+}

Would it be worth putting this in head_bayes.js as you have two copies of it in this patch?

>+var tests =
>+[
>+  // train two different combinations of messages
>+  {command: kTrain,
>+   fileName: "ham1.eml",
>+   traitIds: [3,6]
>+  }
>+ ,{command: kTrain,
>+   fileName: "spam1.eml",
>+   traitIds: [4]
>+  }
>+ ,{command: kTrain,
>+   fileName: "spam4.eml",
>+   traitIds: [5]
>+  }

Please put the commas on the end of the previous lines.
Attachment #344156 - Flags: review?(bugzilla) → review-
Attached patch Fix Mark's issues (obsolete) — Splinter Review
I fixed all of Mark's complaints and requests, with the exception of:
(In reply to comment #11)
> 
> From what I can tell, the num*Messages arrays are only added to here. So why
> not initialise their capacity to traitCount?
> 

> Use tokenCount rather than 1024?
> 
 
> Use aTraitCount for the capacity again?
> 
 
> ditto but with aOldCount/aNewCount.
> 
> (apologies if I'm mssing some reason they can't be done like this).

I tried to do that at one point, but got compiler errors. As I understand it, auto arrays must have their size defined at compile time, so I cannot use variables to define their size. I could switch to non-auto arrays, but the fear there is causing heap fragmentation.

> 
> I'm getting lots of compiler warnings on these:
> 
> /Users/moztest/comm/main/src/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp:1997:
> warning: comparison between signed and unsigned integer expressions

This line is the only place that I get this warning. If you see others, please tell me specifically. I understand that the expression with the warning is:

  (countTokens() > aMaximumTokenCount))

Both of those originate outside of this code, so I would need a cast of some sort to get rid of the warning. I didn't do that, but I could if you like. The expression is safe since these are both counts.
Attachment #344156 - Attachment is obsolete: true
Attachment #344416 - Flags: superreview?(bienvenu)
Attachment #344416 - Flags: review?(bugzilla)
Attachment #344156 - Flags: superreview?(bienvenu)
>I tried to do that at one point, but got compiler errors. As I understand it,
>auto arrays must have their size defined at compile time, so I cannot use
>variables to define their size. I could switch to non-auto arrays, but the fear
>there is causing heap fragmentation.

Right, you should use SetCapacity on auto arrays in the case where you're appending items one at a time in a loop. But other than that, this looks OK, and I'll mark my sr+...
Attachment #344416 - Flags: superreview?(bienvenu) → superreview+
(In reply to comment #13)
 
> Right, you should use SetCapacity on auto arrays in the case where you're
> appending items one at a time in a loop. But other than that, this looks OK,
> and I'll mark my sr+...

I thought that in all cases I defined the desired capacity in the existing patch. Is there still something that I need to change?
I was just pointing out that, instead of using kTraitAutoCapacity, you could use SetCapacity.  The number of traits is probably going to be small, so it's better to have everything on the stack in the normal case, as you've done. But you could check if the count is larger than kTraitAutoCapacity, and call SetCapacity in that case. I'm ok either way - I leave it up to you and Mark.

+    nsAutoTArray<PRUint32, kTraitAutoCapacity> numProMessages;
+    // anti message counts per trait index
+    nsAutoTArray<PRUint32, kTraitAutoCapacity> numAntiMessages;
+
+    for (PRUint32 traitIndex = 0; traitIndex < traitCount; traitIndex++)
+    {
+      numProMessages.AppendElement(
+        mCorpus.getMessageCount(aProTraits[traitIndex]));
+      numAntiMessages.AppendElement(
+        mCorpus.getMessageCount(aAntiTraits[traitIndex]));
+    }
+
Blocks: 461479
Attachment #344416 - Flags: review?(bugzilla) → review+
Comment on attachment 344416 [details] [diff] [review]
Fix Mark's issues

nit: Comment still not quite right:

+[scriptable, uuid(AF247D07-72F0-482d-9EAB-5A786407AA4C)]
+interface nsIMsgTraitClassificationListener : nsISupports
+{
+/**
+ * Inform a listener of a message's match to traits. The list
+ * of traits being matched is in aTraits. Corresponding
+ * indicator of match (percent) is in aPercents.
+ *

should be:

+[scriptable, uuid(AF247D07-72F0-482d-9EAB-5A786407AA4C)]
+interface nsIMsgTraitClassificationListener : nsISupports
+{
+  /**
+   * Inform a listener of a message's match to traits. The list
+   * of traits being matched is in aTraits. Corresponding
+   * indicator of match (percent) is in aPercents.
+   *
etc

I also think Davids comments along the lines of

if (aProTraits.Length() > kTraitAutoCapacity)
  numProMessages.SetCapacity(aProTraits.Length());

could be a good saving if we go over kTraitAutoCapacity.

I don't mind if you do that in this patch or a follow-up one.
Carrying over reviews, patch to checkin. Added the SetCapacity as requested, plus fixed the last nit.
Attachment #344416 - Attachment is obsolete: true
Attachment #345139 - Flags: superreview+
Attachment #345139 - Flags: review+
Keywords: checkin-needed
Comment on attachment 345139 [details] [diff] [review]
patch to check in
[Checkin: Comment 18]

http://hg.mozilla.org/comm-central/rev/a1b646111579
Attachment #345139 - Attachment description: patch to check in → patch to check in [Checkin: Comment 18]
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Flags: in-testsuite+
Keywords: checkin-needed
Resolution: --- → FIXED
Its great to see work on generalising the bayes code to support more traits, and not be limited to classifying junk mail. However, I have only seen mention of classifying emails - it would seem sensible to make sure that any new code is also able to classify items from rss feeds and other news sources as well.

I've been using feedzero.com and filteredrss.com since they appeared earlier this year and found them to be very powerful in rating and filtering rss feeds. It would be a shame to miss this opportunity to include this sort of functionality in Thunderbird 3, as this would, I think, be a killer feature.
(In reply to comment #20)
RSS support has not been on the radar screen in the past, but I think that comes from the origins of these features in the junk mail world. But one powerful application of message processing is dealing with the broader problem of attention management - and in that area I can see that RSS feeds could even be the most important items to manage.

Right now, the underlying code has some limitations in its ability to manage message processing. For example, the bayesian processing is done after the main filtering, so any information that comes from the bayesian processing cannot be made available to filters. I've been thinking about ways to generalize the whole process, so that it could be more easily manipulated by power users and extensions. Thanks to your comment, I will think hard about including RSS feeds into the scheme at that time.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: