Closed Bug 366491 Opened 18 years ago Closed 16 years ago

Maintain correct junkscore and junkscoreorigin, preserve junkstatus

Categories

(MailNews Core :: Filters, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rkent, Assigned: rkent)

References

(Blocks 1 open bug)

Details

Attachments

(2 files, 3 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1
Build Identifier: Thunderbird 2.0 beta 2007-01-08

In the core junk mail code, the handling of junkstatus, junkorigin, and junkscore is inconsistent. Specifically:

1) junkscore is used throughout as a proxy for junkstatus. That is, really only two values are allowed, 0 and 100. Junkscore is calculated by the Bayesian filter, but not communicated outside of the plugin.

2) junkstatusorigin does not really reflect the origin of the status. "plugin" is set when a filter is used. Sometimes junkstatus is fixed by address book whitelists without any update to junkstatusorigin.

3) junkstatus is the primary variable communicated about email, but is not actually maintained after it is communicated, other than being used to set an approximate junkscore of 0 or 100.

These issues have little effect on the current UI, but prevent extensions from adding additional information about junk mail handling.

Reproducible: Always

Steps to Reproduce:
1. Allow bayesian filter to analyze an email with junkscore between 0 and 100
2. Examine junkscore values with, for example, and extension
Actual Results:  
junkscore will be 0 or 100

Expected Results:  
junkscore should be actual value determined by the plugin, for example 20.

junkstatusorigin should maintain values of "plugin" when junk status is set by the bayes filter plugin, "filter" when set by a user-defined filter, "user" when set by a user directly, "whitelist" when set from the address book, and "flags" when set from IMAP server flags.

Replace junkscore with junkstatus as the primary variable used to determine email junk status. Maintain junkstatus in the database, and from the IMAP flags.

Change OnMessageClassified to communicate junkscore, and maintain its value in the database.

Proposed MOZILLA_1_8_BRANCH patch to follow.
Attached patch MOZILLA_1_8_BRANCH fix, rev 1 (obsolete) — Splinter Review
It's probably way late to consider a large patch like this for TB 2.0, but what the heck here it is so let's discuss it. This patch only works on MOZILLA_1_8_BRANCH. Trunk has deviated too much, I will need to do a separate patch for that. I would like feedback about whether this is a direction that you would like to see for the code, and if so at what point it should be applied. Then if desired I can do the patch for the trunk. If too late for TB 2.0 I understand.

The main variable that determines junk status is changed from junkscore to junkstatus. Backward compatibility is maintained by the IsJunkFromStatus function, which gives correct results for previously categorized email. That function also provides a consistent, centralized point to determine whether a previously scored message is junk, which is currently done many different ways. 

Once you start using this code there are some subtle issues if you revert back to older code, for example if junkscore is 95 then old code may consider that marked as not junk. That is because old code really only supports 0 or 100 as a value for junkscore, anything else is treated inconsistently.

The code should appear to do nothing with the standard UI. To use it, you will need to use an extension, for example: https://bugzilla.mozilla.org/attachment.cgi?id=245170

After this code is implemented, you will really wish that you could filter on junkscore, but that will need to be for the future. I typically keep junkscore displayed for all of my emails, and train emails that are close to the limits.

Really the main point of this is to allow an extension that will give more extensive information about junk mails for those who want more tight control. Personally, I find the standard interface does not give enough feedback, so often the junk mail processing appears to be doing nothing. I change the junk mail icon to a three-state icon, display junkscore and junkscoreorigin, and then I have a very responsive junk mail management system.

Unfortunately information is lost if emails are copied between servers. If you copy an IMAP message between servers, junkstatus is maintained, but junkscore reverts to the old behavior of only allowing 0 or 100, and the junkstatusorgin is set to "flags".
Attachment #251004 - Flags: review?(bienvenu)
Thx for working on this, Kent. I think in general it could be very handy.

yes, we couldn't take this for the 2.0 branch since it changes methods in interfaces, among other things...

I'm not crazy about all these calls to get the junk mail service in order to convert the score to status, especially in code that's called during display (i.e., a lot!). Getting a service is hash lookup, but it's not free...

I don't have time to really analyze this patch right now - but looking at it, I get the feeling that there must be a way to get what you want w/o changing nearly so much code, but that's just a gut feeling...
Status: UNCONFIRMED → NEW
Ever confirmed: true
1. You could change much less code if the existing "junkscore" variable was accepted for what it is, namely a three-state flag with values of null, 0, or 100 for the three possible values of junkstatus. Then you could introduce a new variable, say "realJunkscore" which includes the correct 0-100 value. I am not advocating this, just saying it is possible.

2. The correct maintenance of junkscoreorigin is a much lighter change, and perhaps should be made a separate bug. That is, use "user", "plugin", "filter", "whitelist" or "flags" depending on who set the junkscore.

3. the isJunkFromStatus function, which requires the frequent calls to the junkmail service, was the last thing I changed, because I thought I would get flak for my previous inline implementation. It would be easy enough to replace it with separate JavaScript and CPP functions and abandon the XPCOM interface.

Yes, I think introducing a new variable would be a lot less disruptive - I'm trying to think of a better name than realJunkscore - filterJunkscore, maybe, to indicate that it's the value the plugin set, as opposed to what TB is using (or maybe pluginJunkscore).
I've reviewed this again, trying to see if I can make it a lighter weight patch. Here's what I intend to do.

1. I won't rename junkscore, not will I change the places where it is tested inconsistently. So junkscore will retain its current definition, that is =0 for ham, and =100 for spam. Any valid tests with that definition (for example, either ==0 or <50 as tests for ham) will be left alone. I also won't add a separate function to provide that test.

2. I will add a new variable bayesJunkscore that will retain the junk indicator returned from the bayes filter. (It isn't really a probability, just an indicator that has been scaled from 0 to 100 to resemble a probability).

3. I will change the onMessageClassified function call in nsIJunkMailClassificationListener interface to add a new variable, that is it will go from

void onMessageClassified(in string aMsgURI, in nsMsgJunkStatus aClassification);

to

void onMessageClassified(in string aMsgURI,
  in nsMsgJunkStatus aClassification, 
  in PRUint32 aBayesJunkscore);

4. I will attempt to preserve more useful values of junkstatusorigin. That is, I will add new values "whitelist" when set from the addressbook, "filter" when set from a filter, and "flags" when set from IMAP flags. Note this will result in loss of values in some IMAP cases, because we cannot tell why a message on an IMAP server was marked as junk.

Comments are welcome.

How do I assign this bug to myself? Is that a bugzilla right that I don't have?
Here you go. (I don't think it requires special privileges, its the Reassign bug to _____ field down on each bug page.)
Assignee: nobody → kent
Note to self: This will impact STEEL if nsIJunkMailClassificationListener changes.  Nothing serious, just will need to update the junk setter and getter for steelIMessage.
Magnus, now that you have assigned the bug to me, I see the fields at the bottom. But I don't see them on other bugs, just this one. So I think there is some right involved.
Status: NEW → ASSIGNED
Blocks: 209890
(In reply to comment #3)
>3. the isJunkFromStatus function, which requires the frequent calls to the
>junkmail service, was the last thing I changed, because I thought I would get
>flak for my previous inline implementation. It would be easy enough to replace
>it with separate JavaScript and CPP functions and abandon the XPCOM interface.
You don't need this at all. Simply continue to set the old junkscore value.
(New code, e.g. sort by score, can of course use the new bayesjunkscore value.)
I'm not sure that I understand your comment, as "Simply continue to use the old junkscore value" does not seem to relate to the issue of providing a central place for the evaluation of isSpam from the junkscore value. Nevertheless, I think we are in agreement. That is, originally I was proposing to rename the existing junkscore variable to junkstatus, and add a new junkscore variable that had the bayes score. Although that has appeal to a coding purist, as it more accurately reflects the usage of the variables, it causes issues of compatibility if people tried the new version, and then switched back to the old. Plus, it forced me to change a lot of code just to rename things. Now, I am leaving junkscore alone, just adding the new bayesjunkstatus variable. I think that is a better solution for compatibility. I still hate leaving a binary (junkscore) with 100 for true and 0 for false.

I'm still trying to figure out the priorities within this project between code purity, backwards compatibility, centralized management of issues, etc. The isJunkFromStatus function was an attempt at centralizing the test for spam in a single place, which is a Good Thing in my book. David mentioned some possible performance issues with that, and I also got the sense that there was a strong desire to reduce the total number of lines of code that was changed. So I'm not doing that in my current code incarnation. But I can easily be pushed this way or that by winds of opinion, so feel free to move me.

On the issue of centralizing, I've asked in m.d.a.thunderbird about the existence of a central place to store common javascript code. The same issue exists for isJunkFromStatus. Is there a single place where I could put a utility function like that, that would be available to all of the SM and TB code?
Progress update: I'm got an initial version of this patch done, and am using it as my main email client. But I want to do a change. Currently, sometimes when I see that the bayesjunkscore is in a middle range, I will set the email as junk or not junk. But my code is resetting bayesjunkscore to 0 or 100 at that point. That's not necessary, and I lose useful information about scores. So I'm going to change it so that bayesjunkscore, if it is already set, is not changed if a user manually sets the junk status. I'm not sure what to do if the bayesjunkscore is null. Leaving it null sounds consistent, but I'm afraid it might cause problems for people who try to sort email by bayesjunkscore. I could fix that in the tree sort field I guess, but that might also surprise people. Still, I think that's what I'll do so that bayesjunkscore ONLY contains a value actually returned from the bayes filter.
I'm implementing an additional change here. The junkscoreorigin variable (which is virtually unused at the moment) has a value of "user" when manually set, but "plugin" in all other current circumstances, even when the junkscore is set by whitelisting or filters. I've added values of "whitelist" or "filter". But if someone tried to analyze messages, they would not know if "plugin" was really due to the plugin, or was just an old value, possibly a whitelist or filter, before my changes.

So I am proposing to completely eliminate the term "plugin" and instead use "bayes" when the junkscore is set by the bayes filter. That way, future users of junkscoreorigin can tell the difference between the ambiguous legacy term "plugin" and the more precise term "bayes".
Kent, I think that's fine - I don't believe anyone uses the junkscoreorigin (except possibly for extensions)
Attached patch Trunk using bayesjunkscore (obsolete) — Splinter Review
This trunk patch implements the bayesjunkscore variable, and gives meaningful values to junkscoreorigin. One last change was implemented here: I use "imapflag" as junkscoreorigin when set from IMAP flags, so that the first character is unique for easy checking. This patch was tested on both Thunderbird and SeaMonkey trunk builds. An attachment will follow with an extension that shows these values.
Attachment #251004 - Attachment is obsolete: true
Attachment #299485 - Flags: review?(bienvenu)
Attachment #251004 - Flags: review?(bienvenu)
Use this extension on TB or SM trunk to show the variables bayesjunkscore and junkscore origin.
Blocks: 414179
How can I get my patch reviewed?
-    db.setStringProperty(msgHdr.messageKey, "junkscoreorigin", "plugin");
+    db.setStringProperty(msgHdr.messageKey, "junkscoreorigin", "bayes");

The point of using "plugin" is that we don't know what the classifier is in the notification - the user could have installed an extension that installs a new classifier by overriding the built-in classifier, and the notification callback doesn't get told what classifier actually was. Are you getting actual value by changing it to "bayes" or did it just seem like the right thing to do?

Similarly, we don't know if it's a bayesian score - the plugin mechanism is supposed to be somewhat generic so we can have different kinds of spam filters.
Is there any practical mechanism by which the bayes plugin could be overridden in an extension? If there is, then perhaps we should take the next step, and add some sort of additional string return to OnMessageClassified that informs users what sort of plugin is in use - then set that to "bayes" with the standard plugin.

But if there is no practical way for an extension to override this (and I don't see any practical way) then I would argue that we design to the current practice, which is that the plugin is a bayes filter, and "bayes" is more meaningful then "plugin". As I argued in comment #12, "plugin" was not used correctly in the past, so a renaming to something would make it easier in the future to truly understand how a junkscore was assigned. If you want to maintain the possibility that the bayes plugin could be overridden, then I would still argue that we replace "plugin" with some other term.
yes, all an extension has to do is install an xpcom component that registers the same contract id, and that will override the bayesian plugin. I know of an extension that does that, and uses one of the other spam libraries.
So which do you prefer:

1) Leave "plugin" as junkscoreorigin, and live with the ambiguity in prior versions (I'll still add "whitelist", "filter", and "imapflag" values),

2) Add a text parameter returned from OnMessageClassified to return a name for the plugin to use in junkscoreorigin. Change standard plugin to return "bayes".

3) Rename "plugin" to something else generic in the future, say "analyzer".

4) ???

If we support alternate plugins, perhaps I should rename the new database entry (and associated variables) to something more generic instead of bayesjunkscore. Maybe "junkpercent"?
thx, Kent, I think 1) is fine - junk is not historically interesting for most users.

Yes, renaming the db entry and var names to something like junkscore or junkpercent would be good.
Blocks: 424539
I've now changed the variable name to junkpercent (that is, it does not imply that the plugin is a bayes algorithm.) Also I only set that if the plugin really calculated something, so for example if the junk status is set with the whitelist,  junkpercent is not set. I also kept the value "plugin" for junkscoreorigin - though it is used in less cases then before since there are new values like "whitelist", "imapflag", and "filter".

I'm hoping this can be reviewed and checked in before the pending Thunderbird alpha. Sorry for the delay, I was waiting for bug 336615 to land which had considerable interference with this bug.
Attachment #299485 - Attachment is obsolete: true
Attachment #299486 - Attachment is obsolete: true
Attachment #312214 - Flags: review?(bienvenu)
Attachment #299485 - Flags: review?(bienvenu)
Flags: wanted-thunderbird3.0a1?
Kent, this looks good, but one question - if aJunkPercent is always a UINT, why not use nsIMsgHdr.setUint32Property? That way, you don't need that code to convert back and forth between a uint and  string, and you can let the msg db code do it. Or you could add a method nsIMsgDatabase to setUint32Property that's similar to setStringProperty. Other than that, it looks good.
The junkpercent code parallels the type decisions made for junkscore, which also is a uint with value 0, 100, or null. I don't have a strong feeling about it though (other than my laziness to avoid changing something that works). I usually try to follow the existing style and decisions in the code whenever possible. Would you like for me to change this? 
Then again, I see that junkscore has introduced its own conversion issues, like this recent code at http://mxr.mozilla.org/seamonkey/source/mailnews/base/src/nsMsgDBView.cpp#1237:

  nsCString junkScoreStr;
  msgHdr->GetStringProperty("junkscore", getter_Copies(junkScoreStr));
  if (!junkScoreStr.IsEmpty()) {
    properties->AppendElement(junkScoreStr.ToInteger((PRInt32*)&rv) == nsIJunkMailPlugin::IS_SPAM_SCORE ?
                              kJunkMsgAtom : kNotJunkMsgAtom);
    NS_ASSERTION(NS_SUCCEEDED(rv), "Converting junkScore to integer failed.");
  }

I'm not familiar with the type issues in the database. Are attributes always stored as strings, and converted to integers when read? Or is 32 (as an integer) stored differently then "32" as a string?. To put it another way, if I change the existing code that accesses junkscore to use uint instead of string reads, will that clobber the existing data? If not, then maybe I should convert the calls to junkscore to also use uint32 access methods, and follow that parallel with junkpercent.
attributes are always stored as strings in the actual database, so yes, you should be OK.

I see what you mean about the way junk score is treated - I'm going to leave the decision up to you whether to leave your patch the way it is or to store them as uint32 properties...
I tried a little to implement this with uint32 properties, but started to encounter issues in parts of the code I have not dealt with before. For example, there's SetAttributesOnPendingHdr which is called for junkscore, and seems to use only string properties. There may be easy answers to these issues, but I'm just getting in deeper to avoid a few lines of string conversion code - plus I lose the distinct comprehension advantage for future programmers of the parallels between junkscore and junkpercent. So let's just leave junkpercent as a string. If so, can you accept that patch as is? And can you also sr or do I need another?
Comment on attachment 312214 [details] [diff] [review]
New variable is now junkpercent

sure, thx, Kent.
Attachment #312214 - Flags: superreview+
Attachment #312214 - Flags: review?(bienvenu)
Attachment #312214 - Flags: review+
Keywords: checkin-needed
mail/base/content/mailCommands.js 1.43
mailnews/base/resources/content/mailCommands.js 1.110
mailnews/base/search/public/nsIMsgFilterPlugin.idl 1.16
mailnews/base/src/nsMsgDBView.cpp 1.307
mailnews/base/util/nsMsgDBFolder.cpp 1.333
mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp 1.73
mailnews/imap/src/nsImapMailFolder.cpp 1.804
mailnews/local/src/nsLocalMailFolder.cpp 1.576
mailnews/local/src/nsParseMailbox.cpp 1.301
Status: ASSIGNED → RESOLVED
Closed: 16 years ago
Keywords: checkin-needed
Resolution: --- → FIXED
Flags: wanted-thunderbird3.0a1?
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: