Use bayesian filters for other features than spam (junk mail)

NEW
Unassigned

Status

MailNews Core
Filters
--
enhancement
15 years ago
3 years ago

People

(Reporter: William Tanksley, Unassigned)

Tracking

(Blocks: 1 bug)

Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(1 attachment, 2 obsolete attachments)

(Reporter)

Description

15 years ago
User-Agent:       Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1)
Build Identifier: 

Bayesian learning is great for spam. Thanks! Now I want to use it for everything.

What if every message I manually moved into a folder was added to the Baysian
corpus for that folder, and messages could be automatically filtered into
folders based on Bayesian matches?

This would be far easier to work with than the rule-based systems all other
email clients have... and in some cases, would be more useful (some mailing
lists seem to actively TRY to make it impossible to filter based on headers).

It would be even nicer if you could do more than just move from folder to
folder, but that's a good start -- and it's a great user interface.

IMO. :-)

Reproducible: Didn't try

Steps to Reproduce:

Comment 1

15 years ago
yes, this would be great.  POPFile at SourceForge tries to do this as a local
POP3 server.  then you'd only have to make one filter per folder and let the
bayesian stuff do the rest.

http://sourceforge.net/projects/popfile/

apparently their code works on windows, but is probably quite portable.

Comment 2

15 years ago
I also would like the Bayesian filters for news. Seems like a great way to
filter trolls.

Comment 3

15 years ago
Ifile did this, back in 1996 (or maybe earlier). See:
http://www.ai.mit.edu/~jrennie/ifile

Comment 4

15 years ago
*** Bug 180160 has been marked as a duplicate of this bug. ***

Comment 5

15 years ago
*** Bug 192927 has been marked as a duplicate of this bug. ***

Comment 6

15 years ago
Might I propose to add the keywords Junk Mail Controls somewhere? That might
ease the duplication of this rfe :)

Comment 7

15 years ago
Some, imo, key reasons why the spam classifier works nicely:

1. The Bayesian classifier. But that's just because...

2. Near zero false positive rate.
   "False positives are bugs." [1]

3. Acceptably fast learning rate.

4. Few false negatives.

5. "False negatives are about performance, not bugs." [1]

Presumably, many clear cut 'readme' / 'ignoreme' binary classifiers will retain
these desirable traits.

Other binary classifiers might be problematic. For example, a 'readme' /
'readmetoo' binary classifier like girls/boys means that point 5 no longer
applies, so suddenly false negatives become "bugs" just like false positives,
and so the number of "bugs" is going to naturally jump, say, an order of
magnitude higher. Other problems arise for other types of binary classifier.
These other binary classifiers might still be as successful for some as the spam
detection classifier, but they probably won't be in the general case.

Things are generally worse for ternary classifiers and beyond.

Classifiers can of course be chained. That is, one could plausibly have a 'check
these out first' / 'check later' classifier run over one's non-spam messages to
ensure that more urgent emails get dealt with first. Much more interestingly,
one could have such a classifier run over one's nntp messages and have them
marked as readworthy (or not). Note that I have no idea if these particular
binary classifiers would work out in practice, they're just ideas.

Going further, assuming the Bayesian classifier were able to be applied outside
of mail/news, one could conceivably mark web sites according maybe to various
classifications, but at least to 'rulez'/'sucks', and build up one's own
web content corpus that could begin to be used to augment the browsing
experience by doing things like, say, guiding link prefetch to quit prefetching
sucky pages.

[1]
Paraphrasing Paul Graham.


--
ralph

Comment 8

15 years ago
> Other binary classifiers might be problematic. ... number of "bugs" is going
to naturally jump

What am I thinking? One simply makes something like a boy/girl classifier
be a 3 way classifier, with XXYs left in the inbox.

I think the rest of what I said stands.

--
ralph

Comment 9

15 years ago
*** Bug 195709 has been marked as a duplicate of this bug. ***

Comment 10

15 years ago
*** Bug 196044 has been marked as a duplicate of this bug. ***

Comment 11

15 years ago
*** Bug 197779 has been marked as a duplicate of this bug. ***

Comment 12

15 years ago
Please modify the bug summary to contain the word junk. This will prevent dupes.
It would have at least prevented the dupe I entered (I used the word junk as
that is the word used frequently in mozilla)

I tried setting it to 'Bayesian filters for more than spam (junk)' but only the
owner can do that.

Updated

15 years ago
Summary: Bayesian filters for more than spam → Bayesian filters for more than spam (junk)
(Reporter)

Updated

15 years ago
Summary: Bayesian filters for more than spam (junk) → Bayesian filters for more than spam (junk mail)

Comment 13

15 years ago
*** Bug 199022 has been marked as a duplicate of this bug. ***
mass re-assign.
Assignee: naving → sspitzer

Comment 15

14 years ago
*** Bug 207609 has been marked as a duplicate of this bug. ***

Comment 16

14 years ago
Altering summary - I thought it were about false spam hits.
Summary: Bayesian filters for more than spam (junk mail) → Use bayesian filters for other features than spam (junk mail)

Comment 17

14 years ago
Created attachment 130265 [details] [diff] [review]
preliminary version of my implementation of this feature

Attached is my first ever Mozilla patch :).  (Actually, it's for Thunderbird,
don't think it would work on Mozilla Mail).

This more or less implements this feature.  Whenever you move a message to
another folder, it will train the filter.  When you receive new messages or
choose Run Junk Mail Controls, it will try to classify the messages into the
appropriate folders - if it is confident it will move them, if it is not so
confident it will just give a suggestion in the "junkbar" area.  The confidence
threshold is kind of low at the moment (50%, though the percentages are
probably not quite accurate), so it will probably move more than it should - I
plan to eventually make this a variable the user can set.  Be warned that this
is far from a final release - there are plenty of issues...

First and most importantly, IMAP does not yet work properly; Bienvenu told me
he would work on a patch to help make IMAP more feasible for this feature. 
Along similar lines, I haven't tried copying across accounts, so I'm not sure
how well that'll work.	For now, this actually replaces the junk mail controls.
 It would be possible to do it separately, but to me it makes a lot more sense
this way (since you can use it to filter out junk anyway).  To get it to work,
you must enable adaptive junk mail filters and disable whitelists (there's
another thing I'll have to implement later).  You should set up a junk folder
as usual and you might have to allow it to move messages; I'm not sure if it
would work properly without that at the moment.  When trying to train a new
message that just came it, it sometimes doesn't train the first time you move
it - just keep moving it until you see the "observing message" message in the
terminal.

There are still a bunch of improvements to be made that I haven't yet gotten
around to implementing (notice all the ARITODO comments).  It's possible that
there are also some memory leaks, and it will print a bunch of debugging
messages onto your terminal.  I wouldn't suggest renaming/deleting any folders
you're using this with.  If you test this, please give me feedback.  Be nice
:).

Enjoy!
Disclaimer: if there is anything I forgot to mention or if I messed up the diff
or if this completely fails to work, I'm blaming it on the fact that it's after
4 am.

Comment 18

14 years ago
Hey, that sounds like a great patch.  (Forgive me for not testing it -- my email
doesn't like to be experimented with :)).

Two small questions:

1) How does this interact with the user's existing rules for moving (e.g.
mailing list) mail into subfolders?

2) Does it make a special exception for the learning of mail dropped into (for
example) the Trash folder?  Some other folders' contents are just unclassifiable
too, which I wouldn't want stuff automatically filed in.  For example, I keep
'ANSWER-ME', 'TO-DO', 'RECIPES-I-HAVE-TRIED' and 'FUNNY-CLASSICS' folders.

Comment 19

14 years ago
Given that this patch is for Thunderbird, it would be nicer if you package it as
an extension instead. It's a great feature but it seems special enough that few
advanced users might need it, and it's better to avoid bloat for the regular user.

Comment 20

14 years ago
It should interact with existing filters the same way the current spam filter 
does - I'm pretty sure existing filters should get precedence and my filter 
will only try to classify a message that isn't touched by the existing ones, 
but I haven't really tried this so I can't confirm that. 
 
It ignores all moves to the trash - clearly you don't want to train on those.  
I'm thinking about eventually making a preferences dialog that would let you 
choose which folders you want to allow it to move messages to (so instead of 
the current system where you just choose to allow it to move messages or not, 
you can do it on a per-folder basis).  I'm not sure I really see the harm in 
it training on some of those other folders, as long as it never actually tries 
to move a message there against the user's will.  Who knows, though, maybe it 
could actually learn some of those?  I could imagine it picking up a pattern 
in some of the messages you consider high priority, though obviously this 
would be a lot harder than a more subject-oriented category. 
 
As for the comment about making it an extension, I would definitely like to, 
but for now I am focusing on getting it working.  I think it shouldn't be too 
hard to move most of my code into my own file once everything works - if 
anyone wants to try to convert it now that'd be great though.  On the other 
hand, if kept as-is, it's actually not quite as much bloat as you'd expect, 
since it actually removes a lot of code from the current spam filter's 
implementation: 
ari@ari:~/mozilla/mozilla$ grep -r ^[+] mydiffs -c 
954 
ari@ari:~/mozilla/mozilla$ grep -r ^[-] mydiffs -c 
466 
About 250 of the added lines are used to implement my own listener class in 
the msgCopyService to find out when the copy is complete - it's got all these 
empty functions that don't do anything.  It's possible that once Bienvenu adds 
the notification that will help make IMAP work I can cut down on some of 
these. 

Comment 21

14 years ago
Damn - just discovered that I did botch the patch a little: line 309 should read:
@@ -42,6 +42,9 @@ (it is +42,10 in the file), and line 319 should read:
@@ -385,6 +388,257 @@ (it was +389,257).
Guess that's what happens when you right try to clean up an extra newline AFTER
doing the diff...sorry.

While I'm posting, if anyone wants to try to test this out, here are all the
steps for getting this working in Linux without worrying about losing any e-mail
(I just tried it on another machine so I'm pretty sure these should work).  It's
actually not too hard to do, even if you've never built Mozilla before:

1) Follow the CVS build instructions from
http://www.mozilla.org/projects/thunderbird/build.html
2) Before running the build_all line, get my patch, put it inside your mozilla
directory, and make those two changes I just mentioned above.  Then type "patch
-p0 <mypatch" (replacing mypatch with whatever you saved my patch as).
3) Do the build_all as in the instruction page - this will take a while.
4) (optional, but highly recommended) If you already have a ~/.thunderbird
directory, move it to ~/.thunderbird-backup - this way we don't mess with any
e-mails you already have saved there.
5) Go into the newly created dist/bin directory and run thunderbird.  Create a
new POP account (you are using POP, right?), and make sure you don't turn off
the default option of "leave messages on server" and don't turn on "delete from
server when I move messages" so that you don't have to worry about losing
anything (both options in Tools->Account Settings->Server Settings).
6) Go to Tools->Junk Mail Controls and click the adaptive filters tag, and
enable it.  Go back to the settings tab and disable the white lists and
(optionally) enable the other 3 check boxes.
7) Make some new folders in your account, get some e-mail and move it around.
8) If something with my filters gets really out of whack, you can start over by
deleting the ~/.thunderbird directory and repeating steps 5-7.  If you were
using Thunderbird before, and want to go back to using your normal version, make
sure you move your ~/.thunderbird-backup directory back to ~/.thunderbird before
switching back.

Comment 22

14 years ago
Created attachment 130393 [details] [diff] [review]
2nd release of my patch

I've attached a much improved version of my patch, sorry to all those who
downloaded the first version (it was awful).  This is still far from bug-free,
and still won't work with IMAP (hopefully David will implement bug 216612 which
should hopefully make IMAP more feasible), but it is a lot more useable now.

Improvements in this one are a ton of bug fixes, a much improved (and working)
interface, and some classification improvements.

There are a couple known bugs besides the IMAP issue: sometimes messages do not
get moved after being classified by the "run junk mail controls on this folder"
tool.  Also, sometimes moving from one message to another does not seem to
update.  Folder renaming/deleting will probably not be handled particularly
gracefully.  The preferences interface is still unchanged; apply the same
settings I suggested before.

I have a precompiled binary made with Red Hat Linux at
http://www.stanford.edu/~ari05/

If you download this, please send me an e-mail telling me what you think.

Remember to never delete messages from server and to backup your old
~/.thunderbird directory so that you don't risk losing anything.

Updated

14 years ago
Attachment #130265 - Attachment is obsolete: true

Comment 23

14 years ago
Status update - work has not stopped on this, but I was on dial-up for a couple
weeks and now I am back at school but with classes coming up, so I won't be
updating as frequently.
I fixed a couple bugs, added in a panel in the preferences dialog to set
per-folder thresholds (the panel is a bit rough, my apologies), and added in a
Porter stemmer which I've found tends to improve classification performance. 
I'm thinking about playing some more with the tokenizer to get it to recognize a
few special features (keep URLs and e-mail addresses intact - or for URL's
simplify them to just domains...throw out html comments that are used to break
spammy words in half...etc.) but I don't plan to do too much more with
tokenizing than that (improving the tokenizer is worthy of an entirely separate
project).
IMAP still stands where it did before - waiting for progress on bug 216612.  I
think I'll hold off on posting a new version of this until IMAP works.

Comment 24

14 years ago
I have a question: how are we ever going to know which messages were moved and
where they were moved? Is there some indicator that messages were moved so that
we check if the move was correct and train the system?

Comment 25

14 years ago
Created attachment 132050 [details] [diff] [review]
new version, includes imap support

Woohoo!  David came through on bug 216612 so I fixed up all the IMAP stuff and
got it working!
I would say that at this point, the patch is totally useable - in fact I'm
using this for all my mail now.  No guarantees, of course.

Remaining issues, in no particular order:
1. Junk Mail Controls panel - there are a bunch of legacy options, and my added
tab is not especially user friendly.  In particular, pressing cancel will not
forget the changes for my tab.	If someone good with xul wants to help out here
that'd be great.
2. There will be problems using this with multiple accounts, at least I assume
there will be - haven't extensively tested it.	I'm thinking about adding some
options to make this work better - any ideas?
3. Visual feedback - as Andrea pointed out above, this will be important. 
Right now I try to print a status bar message in some of my js code but it
often doesn't stick around long enough to see it and I don't do it everywhere. 
Some sort of highlighting in the folder tree would be great, again, I'd welcome
help from anyone good with this sort of stuff.
4. Tokenization still not great.  I added the Porter stemmer (converts words to
their root form) to the tokenizer  which should help training happen more
quickly, as well as modifying it to not split words at @ symbols or .'s (though
it will still ignore .'s and @'s at the beginning and end of the word - this
way e-mail addresses stay intact).  A major boost would be to strip html
comments as in bug 213614, but this seems pretty tricky (any volunteers?).
5. I'd like to add in a "disable notifications" option so messages marked, for
example, as spam do not trigger the normal notification and distract you from
work.
6. Random edge cases - better handling of renamed/deleted folders, how to
handle situations where a message is marked or not marked as something but the
training data doesn't have this information saved (if, for example, the program
crashed or an IMAP message was moved on the server or something like that).

I'd really appreciate it if anyone could help me with any of the above,
particularly the UI-related issues.  If you have any feedback please e-mail me
or post it here.
Attachment #130393 - Attachment is obsolete: true

Comment 26

14 years ago
Sorry to post again, just wanted to elaborate on two points above.

First, with respect to the visual feedback, there is definitely a fair amount
already in place - the bar where the normal spam classifier says "Thunderbird
thinks this message is junk" now says either "This message appears to be ______
(confidence ___)" (with a button for "move it there" or if its already there to
"confirm classification") or "You labeled this message as _______" (with a
button for "undo label") or "This message is unlabeled" (with a button for
"guess" and one for "label as current folder").  The other visual feedback
already provided is that if incoming mail is moved, whatever folder it was moved
into will of course display the usual unread messages indication.  The only
problem with the feedback is if you, for example, say "run junk controls on
current folder" or press the "guess" button - some messages will get moved
around according to your settings but it's hard to tell where they were moved at
the moment, so it would be nice to have some sort of highlighting indicating
which folder the messages were moved to or something like that.

The other thing I wanted to elaborate on is the Porter stemmer...if you're
curious, it came from http://www.tartarus.org/~martin/PorterStemmer/.

Comment 27

14 years ago
Ari, the visual feedback is an important issue. 

One possibility is to replace the junk status icon on the messages window with
folder icons (perhaps with different colors / shapes) and have the same icons
next to the folder name in the folders tree. This would take a long time for
you, so a first shot at it would be to just display the target folder name in
the junk status column.

I also suggest remapping the "run junk controls" to the junk mail filter only,
if you can. Then create a new item called "run adaptive filters", that runs all
of the filters that are not junk. 

Anyway I am looking forward to your stuff, I have been waiting for long for
someone fixing bug 183929 (the only reason I miss eudora: I like to read mail an
THEN move it to folders) and your stuff seems to be a good enough substitute. 

Comment 28

14 years ago
Hi Ari,

This is great!

I think that automatically setting up a message filter when you drag messages is
a good idea.  But why not build upon (i.e. incorporate) the existing filter
functionality somehow?  Surely this can only improve the accuracy of your
filtering.  If a user has an explicit filter saying all messages from X to Y go
to folder Z, then shouldn't you be making use of the information?  Also, this
would allow users to partially "correct" errors in your Bayesian filtering by
introducing new filters.  Just an idea.  A second little idea would be to make
an "undo move" command reverse the effects on your filters (everyone drags to
the wrong folder sometimes).  Similar to this, when you drag M from X to Y,
neither of which are the Inbox, this says "new messages in the Inbox with this
signature go into Y AND ALSO don't go into X" ... not sure if that functionality
is already there.  Okay, there you go, just some thoughts for you.

I presume you're already following the POPFile project . . .

Cheers,
Chris

Comment 29

14 years ago
Hey, if anyone here wants to test this out without having to deal with compiling
stuff, I just put a Linux build (compiled in Debian unstable) up at
http://ari.stanford.edu/mythunderbird1.tar.bz2
Let me know how (if?) it works for you.

Comment 30

14 years ago
Ari,

I'd really like to see your patch distributed as an extension, given I'm running
Thunderbird under Windows, and don't have the tools to rebuild it.

I appreciate that you're possibly still testing things out, but I also think
you'd see a much greater interest in your work if you got it on the texturizer
page as an extension.

Cheers,
Matt

Comment 31

14 years ago
Does anybody have this working under Windows?  I'd love to give it a try if it
works as advertised.  Is development still ongoing?

Comment 32

14 years ago
*** Bug 225965 has been marked as a duplicate of this bug. ***

Comment 33

14 years ago
There's a bounty to be collected for this feature, see
http://www.markshuttleworth.com/bounty.html, the second bounty in the 'mozilla
work' chapter (at least if it'll work for thunderbird). 
yeah, can we package this up as a Thunderbird extension?  I too would love to
try this.

Comment 35

14 years ago
Alright, I've been kind of MIA for the past month or so (quick plug: been
working on this: http://www.stanford.edu/~mcslee/ultimate/) but now I'm hoping
to get back on track working on this.  I did see the thing about the bounty and
contacted the person but have not yet heard back from him - getting that would
definitely be a nice incentive for me to push forward a bit more.

Either way, though, making this into an extension is a high priority for me (in
fact, I'd say it's pretty much all there is left to do), but sadly it's also a
nontrivial task.  While I designed the patch to reuse a lot of the existing spam
filter's infrastructure, unfortunately I needed to modify that infrastructure to
get it to work in a more generalized way (for example, rather than simply
passing around an enum specifying JUNK/NOTJUNK, I need to actually pass around
folder uri's as strings).  The plan now is to get my modified interfaces working
with the normal junk filter so that it is easier to swap back and forth between
the two.  So basically, I will need to get some code integrated into the main
Thunderbird code, but I will do my best to minimize the changes I need to make
to that so that it is easier for the developers to merge my changes.  In
addition to the more generalized interfaces, I will need to get the hook for a
notification of a message being copied added as well.

Anyway, that's my status; I'm hoping to have something working by January, but
no promises.
Here's another take on this concept, which could turn out to be a real killer
feature: 

Bayesian prioritizing of the inbox.  

Monitor my Inbox and see how I respond to mail from (1) specific senders, (2)
specific threads, (3) mail that's in response to something I sent vs mail that
arrives out of the blue, (4) mail that contains specific keywords (maybe let the
user manually assign the keywords) and use that to prioritize the mailbox so
that more important stuff floats to the top.

Note that simply starting with (1) would be a giant step forward.  There's some
people to whom I usually respond right away.  Other people's mail tends to just
get read.  So float the people to whom I usually respond promptly to the top of
my inbox.  

This would be an outstanding way to differentiate Thunderbird.

Bart

Comment 37

14 years ago
Interesting idea that I've thought a little about but haven't really
experimented with.  The way you describe it - prioritized by sender - sounds
more like a glorified way of searching/sorting your box; you could almost
accomplish something similar just using the existing search or filtering
functionality.

However, you posted this idea under this Bayesian filter bug, and I think you
may be on to something with that connection.  I'm not sure how good a job it
would do, but my classifier could easily be modified to use the categories "high
priority"/"low priority" rather than the user's folders.  The only difference
between how it currently works and how you propose is that rather than learning
by observing message moves, it would learn by observing the user's behavior -
did they reply to the message or not (maybe also extend it to "did they save it
or delete it" as well).  Also, ideally the result of classification would be to
mark a message as "important" rather than as belonging to a certain category,
possibly with a sliding scale (color-coded highlighting?) depending on how
important it appears to be.

Of course, this learning approach would only work under the assumption that
there is a pattern in the content of the emails that you tend to reply to, which
may or may not be true for each user.  I'm fairly positive it's nowhere near as
easy as identifying spam, so don't get your hopes up too high that this would be
effective.  Maybe I'll give it a shot once the "categorizer" is fully working.
Ari,

Thanks for the feedback.  The potential complexity of the feature was what made
me think that maybe it's best to first strip it down to something as simple as
possible and 'market' it as such.  

That's what got me thinking maybe it's easiest to focus at first on the sender.
 In a very simple form, the learning agent could distinguish between mail from a
sender whose mail generally gets (a) deleted without a response, (b) filed
without a response, (c) responded to.  And sort my Inbox accordingly.  

While this is definitely a much less precise art than junk filtering (a tough
enough task as it is), the good news is that, unlike for junk mail filtering,
there's no huge price to pay for false positives.  So the worst that can happen
is that something ends up lower down in your inbox than it might (and of course
any of the current views of the Inbox suffer from that as well).  

But let me get out of your way since, frankly, I have no idea what I'm talking
about :)

Comment 39

14 years ago
Ari contacted me in connection with the bounty I've put up for this
functionality. I think he's well on track to claiming the bounty. I wanted to
put some more flesh on the usage scenario I had in mind in order to be able to
claim the bounty, in case anyone else wants to comment. Here's an extract from
an email I sent Ari:

Let's use the following usage "story". I don't expect mail to be automatically
filed to folders when it is received, as with the current Junk Mail bayesian
system. Instead, I envisaged the Bayesian tool being used to assist with the
quick selection of the folder to be filed WHEN the message is filed. So
basically this would look like an advanced "File Message" dialog box. When the
message comes in and has been read by the user, the user presses a hotkey to
bring up the "Advanced Message Filing" dialog. We now have to select the correct
folder in which to file this message. If you read the bounty descriptions, I
want to be able to do two things. First, by simply starting to type the name of
the folder that I want, a drop-down listbox of possible folders (with their full
folder paths) would be populated. This is a little like the current system for
email addresses in message composition windows, when you start typing, it shows
a list of suggested email recipients based on what you have typed. Second, that
list would be bolstered by the output from the bayesian system. For example,
before even typing anything, it might start out with the best-guess folder name
already selected based on the bayesian logic. The list might include several
Bayesian guesses until I actually start typing something else.

So here would be the process.

  1. Open Thunderbird, read message in Inbox as normal.
  2. Press hotkey for Advanced Message Filing
  3. Best-guess folder is already preselected. Hitting enter will file message
there immediately.
  4. If I start typing the letters "me" the listbox is immediately showing:
          account/Inbox/mail/people/m/megan elliott
          account/Inbox/mail/people/m/meenal gallal
          account/Inbox/mail/people/m/mendhip muran
          account/Inbox/mail/companies/m/medical magic
          account/Inbox/mail/companies/m/mexican mojo
  5. I can quickly select a folder and hit enter to have it filed there.
  6. Filing a message automatically trains the system a little more, so next
     time it will start off suggesting a better folder.

So the extension I'm looking for is not a pre-classification system or filter,
so much as intelligent support for the manual message filing process.

Good luck to Ari and I hope he'll claim the bounty soon!

Comment 40

14 years ago
This is great. I love the progress Ari has made and I think Mark's bounty will
push this along quite nicely. I'm trying to help strip out some annoying HTML
tags in bug 213614 as Ari mentioned in comment 25. Anychance someone could point
me to the file that does the tokenizing for the bayesian filter. 
I code C/C++ fairly well and have a passing understanding of bayesian filters from
my MIT AI class, Let me know if there's anything else I can I can do to help. 
(I'm not trying to edge in on the bounty ;), I just want this feature pushed out
the door.). 

Also from what Mark describes it sounds like you could put a dropdown dialog box 
in the bar Ari describes in Comment 26. Just change ["This message appears to be 
_____(confidence ___)" (with a button for "move it there" )] to ["Message should
be " (drop down box defaulted to what filter thinks it should be) (Move button)]
Just implement type ahead find in the dropdown box and have the "hot key" move 
to the drop down. This of course is just a suggestion. 

Since it doesn't move the message anyway I would just remove ["This message is
unlabeled" (with a button for "guess" and one for "label as current folder")]
and use the above message regardless of confidence. 

Also I would like to see the option to have messages automatically filed based
on confidence remain, just have it turned off by default. Ideally you could
adjust the confidence per folder.

I think that's about it. I've tested Ari's build somewhat and find it works
well. I would like to see everything already in a folder classified as that
folder (you have to do it manually now). But thanks again for great work. 
Again let me know if there's anything I can do,
         Miller

Comment 41

14 years ago
*** Bug 232061 has been marked as a duplicate of this bug. ***

Comment 42

14 years ago
I just thought I would mention, even though perhaps off-topic, that it would be
cool to use the same code to filter bookmarks (I hate having to file nested
bookmarks) based on page contents, title, etc.  Maybe too ambitious, but it
could be neat!

Comment 43

13 years ago
*** Bug 252989 has been marked as a duplicate of this bug. ***

Comment 44

13 years ago
*** Bug 260430 has been marked as a duplicate of this bug. ***

Comment 45

13 years ago
I find it depressing that this bug remains outstanding more than a year after a
patch was submitted. Is there a problem with applying it? What's the future of
this bug?

Updated

13 years ago
Attachment #132050 - Flags: superreview?(dmose)
Attachment #132050 - Flags: review?(bienvenu)
Product: MailNews → Core

Comment 46

13 years ago
(In reply to comment #45)
> I find it depressing that this bug remains outstanding more than a year after a
> patch was submitted. Is there a problem with applying it? What's the future of
> this bug?

I also find it depressing, but this has happened to me before..

I guess this time it just disappeared among all the other things to fix?
Or it was a design decision to not allow general bayesian filters? I don't know,
I been watching and I'm still interested in this..

Where are you guys taking this decisions? :) Aren't you interested in this?

Now when labels seems to be a fixed enumeration this is even more doable.

Comment 47

12 years ago
*** Bug 297108 has been marked as a duplicate of this bug. ***
Comment on attachment 132050 [details] [diff] [review]
new version, includes imap support

I'm sure this patch has bitrotted since it was posted.	Last I heard, Gandalf
was looking into generalizing the bayesian code for use in toolkit/.
Attachment #132050 - Flags: superreview?(dmose)
Attachment #132050 - Flags: review?(bienvenu)
Comment on attachment 132050 [details] [diff] [review]
new version, includes imap support

I'm sure this patch has bitrotted since it was posted.	Last I heard, Gandalf
was looking into generalizing the bayesian code for use in toolkit/.
Comment on attachment 132050 [details] [diff] [review]
new version, includes imap support

Unfortunately, this patch  has probably bitrotted fairly significantly since it
was contributed.  Last I heard, Gandalf was looking into generalizing the
bayesian code for use in toolkit/, so I'm going to reassign to him for
recommendation about what to do with it.

Updated

12 years ago
Assignee: sspitzer → gandalf

Comment 51

12 years ago
*** Bug 324276 has been marked as a duplicate of this bug. ***

Comment 52

12 years ago
I just requested a very similar enhancement in bug 336112 ("Advanced Automatic Email Filtering") but there are some important differences that might be of interest here.

1) I recommended CRM114 because of it's better classification when dealing with more than two choices.  The question is no longer "is this spam" but rather "where does this belong".  The "hyperspace" classifier sounds most appropriate.

2) In my case, the problem was some users simply getting too much legitimate email to handle it.  They needed complete automatic, unattended sorting.  Having a title bar ask "move this to somebox?" would be of no help because they'd still have to open every message first before it could be "automatically" sorted.

3) I recommended being able to select "automatically filter to this folder" every time a new folder was created.  If selected, then it would happen automatically, without any user intervention, every time mail arrived.  If it was wrong, the user would just move the offending message to the correct place and the system would learn from the mistake.  As long as users were even semi-consistant about correcting mistakes (and not just deleting it), the system should learn very quickly (CRM114 learns faster than traditional Bayes).

Comment 53

11 years ago
*** Bug 339442 has been marked as a duplicate of this bug. ***
Assignee: gandalf → nobody
QA Contact: laurel → filters
I'm not dead!

Still plan to work on this, but if someone has more time, feel free to take this bug.
(Assignee)

Updated

9 years ago
Product: Core → MailNews Core

Comment 55

9 years ago
I'm looking at implementing a stripped-down version of this in the TB3 timeframe. Let me take the bug for now, if only to get anyone else working on this to talk to me.

The main implementation will focus on automatic tagging rather than moves. I'll generalize the existing bayesian classifier to be able to classify N generic features, where the first feature will be the existing junk classification. At least by TB3 I should be able to get enough hooks into the bayesian code so that an extension could easily implement soft tags based on the bayes results. Perhaps we could get some UI hooked up by TB3 as well - though time is tight for that.
Assignee: nobody → kent

Updated

9 years ago
Depends on: 453885
Duplicate of this bug: 336112

Comment 57

9 years ago
I'm removing myself as the assignee for this bug, because the work is progressing, but is directed at an extension rather than the core. I will interpret this bug as requesting functionality in the core - and I am not committed to doing that.

But let me give an update to interested readers. I've done a number of patches that allow access to the bayesian filter for features that are different than junk, but I use the term "traits" rather than "features". There is one critical patch that remains, for bug 471071, which is needed to get access to db listeners from js. Once that is implemented (and I hope to see it for TB 3.0 beta 2) then it will be possible for extensions to access the bayesian filter results for traits, and implement functionality that uses it.

I will be doing this for "soft tags" in an extension TaQuilla, which will be tracked at http://mesquilla.com  But other extensions could also use the functionality as well. For TB 3 beta 2, the trait classification still follows the same rules as junk filters, which means that it does not work by default on certain special folders (sent, draft for example) nor on things like newsgroups or rss feeds. But that is a limitation of the normal message processing, not the bayesian filter. I hope to add additional control to this in the TB 3 time frame, but in any case you can run the bayes filter manually on RSS and news in an extension if you want to.
Assignee: kent → nobody

Comment 58

9 years ago
TaQuilla is now released to experimental status at https://addons.mozilla.org/en-US/thunderbird/addon/10905 This allows the bayes filters to automatically tag emails, which matches the "other features than spam" request of the summary.

With the backend work now largely complete, I am tempted to close this bug and ask that new bugs be filed for any specific features that use the bayesian filters. Right now, it is difficult to control how to apply the bayes filters, but I will do specific followup bugs (such as bug 471833) to deal with specific issues.

The original request by the reporter:

"What if every message I manually moved into a folder was added to the Baysian
corpus for that folder, and messages could be automatically filtered into
folders based on Bayesian matches?"

is a fairly specific request - and certainly doable. It's really the folder-paradigm equivalent of the soft tagging in TaQuilla, I just prefer the tag/virtual folder paradigm myself.

Still, this bug as currently defined is pretty general. If someone would like to keep it open, then I would appreciate a comment on what, precisely, is being requested in this bug beyond the current backend work.

So plan A is to close this bug. Plan B is to leave it open, but change the summary to specifically mention automatic moves.

Comments or further suggestions?

Updated

7 years ago
Blocks: 562050

Comment 59

6 years ago
re comment 58: 

Well, that original request was 9 years ago, so other things have happened around it.

I'm not sure I even want folder-based sorting any more, but attribute based views into the mail collection.  Date/Family/websites/subscriptions/invoices/chatter etc.

IF messages could be given tags for selecting in views, THEN the bayesian filters would be a grand way to slap tags on incoming messages.
You need to log in before you can comment on or make changes to this bug.