Open Bug 209890 Opened 21 years ago Updated 2 years ago

[RFE] Allow user to view and sort by junk certainty score

Categories

(MailNews Core :: Filters, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: krellan, Unassigned)

References

Details

(Keywords: uiwanted)

Attachments

(2 files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030529
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4) Gecko/20030529

The user should be allowed to view the junk score, perhaps as a percentange
chance that a given email is junk.
It should appear as a sortable column in the mail pane, so that the user can
sort by junk score.

This would be useful for identifying messages that are right on the edge of the
junk filter, as well as evaluating how well the filter is doing at identifying
which messages are junk.

The user could go to the Junk folder and look at the messages with a low junk
percentage score, perhaps they are really legitimate emails that have been
wrongly classified as junk!  Being able to sort out the "maybe" junk emails from
the "yes" junk emails, without having to read through the entire folder of junk
emails, would be a great boon.  It would ease the worry of having an important
email message accidentally junked.

Also, displaying the junk score as a percentage would be a prerequisite towards
another improvement in the junk filtering system: allowing the user to set the
percentage threshhold at which an email is considered junk.  This would let the
user tune the aggressiveness of the filter!  If too many or too few messages are
being considered as junk, this would let the user solve that problem.


Reproducible: Always

Steps to Reproduce:
1. Click the "Click to select columns to display" icon.
2. Be disappointed that there is no "Junk Score" in the list of available columns.

Actual Results:  
See above.

Expected Results:  
See above.
*** Bug 208267 has been marked as a duplicate of this bug. ***
see also bug 202557.

i like the phrase "certainty" from dup bug 208267,
adding it to summary.
summary was "[RFE] Allow user to view and sort by junk score"
summary now "[RFE] Allow user to view and sort by junk certainty score"

this enhancement would need a back-end implementation first.
don't know best component bugzilla catagory to put it in; 
"Mail Window Front End" doesn't seem correct.
moving to 'Filters' like dup bug 208267 
until a better option surfaces (like a 'Junk' catagory).
Component: Mail Window Front End → Filters
Summary: [RFE] Allow user to view and sort by junk score → [RFE] Allow user to view and sort by junk certainty score
I propose a UI solution that would be easy for the user to use and to understand:

Have a slider control in the UI.  It would be going from 0% to 100%, with two
levers.  One lever marks the boundary between "spam" and "unsure", and the other
lever marks the boundary between "unsure" and "ham".  The user would be free to
adjust both levers, and the UI would make sure that the levers would never overlap.

This is a common UI metaphor, used in other programs that need the user to
define a range.
Product: MailNews → Core
*** Bug 197480 has been marked as a duplicate of this bug. ***
This TB extension stupidly "fixes" this bug using the custom column handler implemented in Bug 348504, but unfortunately the main application does not properly maintain and use information about junkscore and junkstatus. So if this extension is applied to the stock build, you will just see 0 or 100 as the junkscore.

I am also working on patches to the main source to properly maintain the junk score and status as calculated by the junk filter. Unfortunately there are LOTS of places where the code uses checks of junkscore="100" or junkscore="0" to determine junk status (which is really a different concept entirely).
I started investigating this bug, hoping it would be a fairly easy way to get my feet wet in Thunderbird hacking. But now I seem to be immersed in a flood of issues, mostly related to the past confusion between junk mail "status" and "score". I have this now partially working on my personal build. However I find myself making design decisions, and wondering if there is some history I need to know about, or if the change is considered too massive or controversial to be accepted. So I would appreciate comments from potential code reviewers or others about my direction.

Generally, my task is to ensure that throughout the code base, a proper distinction is maintained between junk "status" and "score".

I define "score" as  a numerical rating of the junk potential of an email, typically by some automatic method. The score is 0 for almost certain notjunk mail, and >=100 for almost certain junk mail. I would rather not force 100 as the maximum, as for example SpamAssassin has no natural limit to the score, and it would be convenient to simply scale the SA score by x10 to get the TB junk "score" in some future plugin or extension.

The "status" is a classification by some method, which could be manual or automatic, of whether a spam message is junk or notjunk.

The two attributes need not be consistent. For example, a low junk score combined with a "junk" status might be used by someone as a filter to determine when to train a junk mail analyzer with a particular email.

To implement this, I need to do the following:

1) modify onMessageClassified in nsIJunkMailClassificationListener to include an additional parameter, the junk score. Right now the value is calculated by the Bayesian filter, converted to "junk" or "good", thrown away, then recreated as either "0" or "100" which makes no sense to me.

2) add a new database field "junkstatus" that is stored along with "junkscore" and "junkscoreorigin". For junkstatus, I am proposing a text field with values of "unclassified", "junk" or "good".

3) Correct all code locations where junkscore is used as a proxy for junkstatus, to properly use the new junkstatus. This is the big task, which requires patches to approximately 20 different files. It also creates some backward compatibility issues that need to be handled carefully.

sample modified code snippet from nsIMsgFilterPlugin.idl:

interface nsIJunkMailClassificationListener : nsISupports
{
    void onMessageClassified(in string aMsgURI, in nsMsgJunkStatus aClassification, in PRUint32 aJunkScore);
};

[scriptable, uuid(caf3d467-d57c-11d6-96a9-00039310a47a)]
interface nsIJunkMailPlugin : nsIMsgFilterPlugin
{
    /**
     * Message classifications.
     * 
     * in earlier versions of mailnews, no junkstatus was stored for messages. Instead, the junkstatus
     * was simply converted into a junkscore of "0" for not junk, and "100" for junk, and no value for unclassified. 
     * To preserve some compatibility between old and new versions, use the following algorithm to determine
     * the junk status of a mix of old and new messages:
     *   If (junkstatus=="junk") then message is junk
     *   else if (junkstatus=="good" then message is not junk)
     *   else if (junkscore > 90) then message is junk
     *   else message is not junk
     *
     *   Example implementation of this in javascript is:
     *
     *   var isJunk = (junkStatus == "junk") ? true : ( (junkStatus == "good") ? false : (junkScore>90) );
     */
    const nsMsgJunkStatus UNCLASSIFIED = 0;
    const nsMsgJunkStatus GOOD = 1;
    const nsMsgJunkStatus JUNK = 2;
    
    /**
    * Message classification in db indexed by nsMsgJunkStatus
    *   used to set junkstatus value in database
    *
    */
    const char *JUNKSTATUSTEXT[] = {
        "unclassified", "good", "junk" };
Assignee: sspitzer → nobody
QA Contact: esther → filters
Depends on: 366491
Depends on: 414179
Dmose suggested that I add a few people to the CC list, and discuss whether the TB UI should support the bayes junk percent column.

Once bug 366491 and bug 414179 land, then the requested features in this current bug can be accomplished with an extension. The issue is, should support be added to the main message tree to have a column to display the junk percent?

I'd appreciate some definitive comments from potential reviewers about whether this is desired in the user interface, or should be left to an extension.
OS: Windows 2000 → All
Hardware: PC → All
Can you add a screenshot of what it looks like?
As requested, here's a sample of a view with junk percent showing. This particular view is from my "uncertain" folder, where emails are shown that have a middle range for junk percent. I haven't really thought about the UI graphic issues, so this is not intended to be a definitive proposal for the graphics. Probably the junk symbol with a percent sign would be better than "J%" which doesn't translate.

The issue here is not the approval of my graphic concept, but whether the junk percent should be supported at all in the core display.
I'm not sure it's something that needs to be in the core product, not sure it shouldn't either. Primarily I suppose it'd be used for sorting. It would be kind of cool to have clicking the (current) Junk column would sort by junk score instead of only by the status.
(In reply to comment #10)
Visibility of junk percent in some form or another is an important part of a more effective spam management strategy. The current cutoff point is set very high, to prevent false positives. To effectively lower it, users need a number of things:

1) Some confidence that the system is effective at classifying mail. Being able to see the junk percent helps that.
2) An uncertain view that collects the occasional false positive, but doesn't require looking at ALL of the spam to find the false positives.
3) UI to adjust the junk percent limits for the certain and uncertain categories.
4) Increased training of spam and ham, subject to performance limits.
5) Ability to diagnose WHY a particular email was miscategorized.

I've been adding capability to the core to allow this, but not exposing it in the core UI. I'm happy to do this either in the core or extensions. The question about junk percent is part of item 1) (All of this in response to your comment that this is mostly about sorting. I think it is mostly about confidence.)
Product: Core → MailNews Core
To me, simply a number seems like it adds to informational clutter in way that outweighs the amount of help it provides.  Maybe a bar-graph like column (like the iTunes store uses for Popularity) would work better, as it would allow the users' brains to process it without as much work needed to interpret.
If you disagree, and think this UI should be considered as-is, feel free to set a ui-review? flag for clarkbw...
My current intention is to add junkpercent display in an extension rather than as a core feature - probably because the response has been lukewarm among module peers so far. There is still some work required in the backend to make it work reliably also, so that junkpercent survives message database reindexing.

Having run with junkpercent displayed for a few months now, I would say that the primary use in in the "Uncertain" virtual folder, where I show emails with 10 < junkpercent < 90. That folder has two goals. First, it allows me to exclude junk email more aggressively from my active email display, but catch false positives reliably in the Uncertain folder. Second, it forces me to train both spam and ham that is uncertain. The Uncertain folder is sorted by junkpercent, so that ham is at the top, and spam at the bottom. For that folder, I am happy with the numerical display.

So in my extension, I will probably include methods to exclude Junkpercent column display from most folders.

Sounds entirely reasonable.  Having an "Uncertain" folder is a really interesting idea, because it could mean that the user would stop ever having to scan the "Junk" folder for misclassified stuff, and could just scan the (hopefully much smaller) "Uncertain" folder instead.  This seems like it would be worth doing some experimentation around.  
(In reply to comment #16)
> Sounds entirely reasonable.  Having an "Uncertain" folder is a really
> interesting idea, because it could mean that the user would stop ever having to
> scan the "Junk" folder for misclassified stuff, and could just scan the
> (hopefully much smaller) "Uncertain" folder instead.
> 

You reminded me of another reason why I prefer the display of numbers to the bar graph, at least when you are thinking about junk as you do in the Uncertain folder.

There is a preference to set for the cutoff value of junk percent where the system will declare something to be spam, and move it to the spam folder. Typically that value should be set very conservatively, so that it is very unlikely for important or interesting emails to be automatically categorized as junk. Users need to learn the appropriate value for that cutoff, both to help them set it and to give them the confidence to lower it sometimes, by watching how the junk filter is working in marginal cases such as those that show up in the Uncertain folder. The numerical junkpercent number is initially not meaningful to users (it is not really a statistical percent that a message is junk, for example), but the number develops some meaning as they see emails appear in the Uncertain folder, or sort by junkpercent to see a few examples of high junkpercent emails that turned out to be ham. They really need to see the junkpercent in those cases, as they will need to set the junkpercent cutoff values for both the Uncertain folder, as well as the automatic categorization as spam.
Just to note that bug 446306 and the folder pane designs ( https://wiki.mozilla.org/Thunderbird:Folder_Pane ) include a similar "Possible Junk" folder.  It'd be great to get comments on that.
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: