Open Bug 179999 Opened 22 years ago Updated 2 years ago

should we ship with a pre-populated training.dat file?

Categories

(MailNews Core :: Filters, enhancement)

enhancement

Tracking

(Not tracked)

People

(Reporter: sspitzer, Unassigned)

References

(Blocks 1 open bug, )

Details

(Whiteboard: +1 negative vote(s) (dwx))

should we ship with a pre-populated training.dat file?

this has come up in private threads (several times).  I'll summarize (or 
perhaps dmose can) all the pro's and con's and issues raised so far, to help 
make the decision.
Not to jump the gun, but I strongly feel that if this is at all possible, we
should do it. Normal users are not going to get that they have to train it on
both good and junk messages unless we really hold their hands through the
process. It might be impossible to have a default training set that's not
biased; I don't know. We could add a command to clear the training set and allow
the user to start over, if that's really an issue. Also, what do other products
that use bayesian filtering do out of the box?
I disagree .. what some may see as spam .. others may be "subscribing" to ... 
I know I get a lot of junk mail from legitimate sources ... 

I vote "NO" on pre-populated training.dat files.  Let the user train their own
system.
I strongly recommand WONTFIX

mozilla.org could not, and should not, determine what is acceptable to
anyone and what is not. mozilla.org should stop at providing facilities
for content filtering and making sure that they work properly.
Similarly mozilla.org should not label anyone as spam sender, pop-up
advertiser, porn publisher, etc. etc.
Whiteboard: +1 negative vote(s) (dwx)
I agree with dwx.  However, I also think that if someone were to switch on
Bayesian spam filtering, they would expect it to do something.

There is extra work involved.  You have to train it.  Most users will need to
have this explained to them.  In any case, maybe we should try to improve or
accelerate the learning curve of the Bayesian filter.  So when they do turn it
on, it catches on quickly.  I don't keep huge spam/not-spam archives, and most
people don't either.  I recommend a dialog/alert that might come up if Bayesian
filters have never been turned on, to educate the user about them and the
necessary training.  Or something.

Just some more ideas.  :-)
As a user who receives moutains of Junk email a day, I vote for a *small*
training.dat file to come with the install.  I think that it would not be too
hard to come up with a file that would at least stop the question "I turned on
junk mail fitlering.  Why is NOTHING getting marked as junk?"  A small file
would quickly get overridden by the user's actual training.  This doesn't
permanently label people as spammers (after a very little training my filter
marked Bugzila messages as junk) it just provides a starting point.

Another argument for the "little bit of something" approach:  Get users in the
habit of un-marking the non-junk in addition to marking the junk. If they get
too comfortable that anything marked as junk is necessarily so, they will miss
their receipts from purchases.

At the end of the day the definition of "junk mail" is highly personal, but I
don't think that many people would morn the passing of "Make millions sending
junk mail" and "watch these teen s**ts do *** to ***".  And if there is someone
out there who wants to receive these messages, he can just mark a few as non-junk.

Regardless of this choice, user education will be required.
I would just like to vote a strong NO to this suggestion of including a
pre-populated training.dat file.  Some people's spam might be another person's
ham.  The worst thing that a spam filter can do is create false-positives, and
this is exactly what will happen if a pre-populated training.dat file is provided.

However, if a training.dat file was created using only header information, and
not keywords from the body of the email, than it may be possible.  But I still
think it is better to FORCE every user to create their own training.dat from
scratch.
See bug 188232 for a possible reason for a very simple "pre-populated
training.dat".  It's quite possible for the training data to somehow get
confused so that NO messages are marked as junk, despite training on 100s of
junk messages.

There's something in the training algorithms that keeps them from working
properly when you have no "good" tokens in the training.dat file, and hundreds
of "bad" tokens (but not enough similarity in the bad tokens that anything ever
gets marked as junk).  So, either the algorithms need to check for that boundary
condition, or the boundary condition needs to be avoided (per the really
annoying workaround I came up with involving editting the binary training.dat file).
*** Bug 191566 has been marked as a duplicate of this bug. ***
How hard would it be to add a "Junk Filter status" menu option, which would
display how many junk and non-junk messages have been used for training?
How about Mozilla having a web page users can go to to select the types of mail
they consider to be junk and somehow having clicking on these types of messages
adding data to the training.dat file. Or perhaps the option of downloading
several options of differently set up training.dat files.
As was said earlier - the junk mail system has caused and awful lot of headaches
for the average user. If someone like Yahoo can filter spam pretty well I do not
see what is wrong with Mozilla giving the option of having a prepopulated
training.dat file. This is the route to the greatest success for this feature in
my opinion - otherwise it is just going to be too hard for some people.
The idea for a web site where users can select types of common junk mail and
have a populated training.dat produced for them is very good!  I would love to
see that regardless of whether or not mozilla ships with one.  As to the issue
at hand, I think a small pre-populated training.dat would be a very good idea. 
It would provide immediate results for someone who was new to the concept of
junk filtering.  As to the grey areas of what is spam, we could easily avoid
them.  Get a few dozen users (not all developers!  get some people who are
mom/dad/dontknowjack users too) to submit all their email to a pile, then
calculate a training.dat from it.  Then go through the resulting file and only
keep entries that have over a certain threshold of occurences and are >95% or
<5% spam.  This will put things like "penis" and "teen" on the list as spam
indicators while avoiding grey area 'words' like "order" and "http://". 
Obviously the values would be tweaked, but I think it would be very easy to
produce a training.dat that had over a 80% true-positive and true-negative rate
and less than a 5% false-positive rate for more than 95% of users.
I'd like to re-iterate my opinion of NO to having a pre-populated Junk file. 
This is a bad idea.  There are many other ways around this, and I think having a
pre-populated training.dat file should be the last possible option.
David, what would you suggest?
on another note, the false positives argument isnt really valid.  with no
training.dat, and during the initial stages of training (the first few dozen
markings), the false positive rate is going to be high anyways.  by shipping
with a prepopulated list we would decrease the initial false positive rate for
an extreme majority of users (id say 99% or better) and for the very few others
who actually WANT to know how to enlarge their penis or who really do have
friends in nigeria with a billion dollars to launder it will only take a few
corrections for the spam filter to almost completely forget its original training.

PS:  I am willing to create the training.dat I speak of if anyone is interested
in trying it, to prove that it is easily correctable but also more accurate than
no training.dat at all.
Product: MailNews → Core
OS: Windows 2000 → All
Hardware: PC → All
Was consensus every reached on this? One thing to note is that there are already multiple community approved spam corpuses (corpii?) - Both the SpamAssassin and TREC Spam corpuses are widely used in academia and the real world. Their efficacy as training data sets can be debated but I don't think the spamminess of their contents is controversial. If there are a handful of users who would find the Viagra ads therein that are not user requested useful, I'd say sacrificing usability for them is worth the added value provided to the (much larger) general user population.

Having said that, the idea quoted above regarding divving up the training set into categories (medical, sexual etc) and including only those is a great idea. This would also allow us to have foreign language training subsets etc.
Incorporating large spam/ham corpora in a default training set would introduce training inertia, requiring more training actions to correct one kind of classification error. Also, as thunderbird is dealing with private content with a close feedback loop, it's in a great position to leverage the distinct personality of the user's ham to the fullest; that would be weakened by a heavy generic training set.

A minimal training set that's very quickly overridden by user training might help discoverability (there's probably some potential for annoyance there too), but that's already better than when this was filed, afair. Junk filtering is on by default; the Junk button is on the toolbar by default, and touching it gives a short introduction dialog about training the first time.
(In reply to comment #16)
> Incorporating large spam/ham corpora in a default training set would introduce
> training inertia, requiring more training actions to correct one kind of
> classification error. Also, as thunderbird is dealing with private content with
> a close feedback loop, it's in a great position to leverage the distinct
> personality of the user's ham to the fullest; that would be weakened by a heavy
> generic training set.
> 
> A minimal training set that's very quickly overridden by user training might
> help discoverability (there's probably some potential for annoyance there too),
> but that's already better than when this was filed, afair. Junk filtering is on
> by default; the Junk button is on the toolbar by default, and touching it gives
> a short introduction dialog about training the first time.

I'm not sure about trunk, but the junk mail feature is not really working in the 2.0 nightly builds.
> I'm not sure about trunk, but the junk mail feature is not really working in

...so ask about it on forums and/or file a bug. Don't comment on a random bug report fully quoting a completely unrelated comment. Thanks.
Assigning bugs that I'm not actively working on back to nobody; use
SearchForThis as a search term if you want to delete all related bugmail at
once.
Assignee: dmose → nobody
Severity: normal → enhancement
QA Contact: laurel → filters
Product: Core → MailNews Core
Blocks: 230269
See Also: → 250470
No longer blocks: 230269
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.