Open
Bug 179999
Opened 22 years ago
Updated 2 years ago
should we ship with a pre-populated training.dat file?
Categories
(MailNews Core :: Filters, enhancement)
MailNews Core
Filters
Tracking
(Not tracked)
NEW
People
(Reporter: sspitzer, Unassigned)
References
(Blocks 1 open bug, )
Details
(Whiteboard: +1 negative vote(s) (dwx))
should we ship with a pre-populated training.dat file? this has come up in private threads (several times). I'll summarize (or perhaps dmose can) all the pro's and con's and issues raised so far, to help make the decision.
Comment 1•22 years ago
|
||
Not to jump the gun, but I strongly feel that if this is at all possible, we should do it. Normal users are not going to get that they have to train it on both good and junk messages unless we really hold their hands through the process. It might be impossible to have a default training set that's not biased; I don't know. We could add a command to clear the training set and allow the user to start over, if that's really an issue. Also, what do other products that use bayesian filtering do out of the box?
Comment 2•22 years ago
|
||
I disagree .. what some may see as spam .. others may be "subscribing" to ... I know I get a lot of junk mail from legitimate sources ... I vote "NO" on pre-populated training.dat files. Let the user train their own system.
Comment 3•22 years ago
|
||
I strongly recommand WONTFIX mozilla.org could not, and should not, determine what is acceptable to anyone and what is not. mozilla.org should stop at providing facilities for content filtering and making sure that they work properly. Similarly mozilla.org should not label anyone as spam sender, pop-up advertiser, porn publisher, etc. etc.
Whiteboard: +1 negative vote(s) (dwx)
I agree with dwx. However, I also think that if someone were to switch on Bayesian spam filtering, they would expect it to do something. There is extra work involved. You have to train it. Most users will need to have this explained to them. In any case, maybe we should try to improve or accelerate the learning curve of the Bayesian filter. So when they do turn it on, it catches on quickly. I don't keep huge spam/not-spam archives, and most people don't either. I recommend a dialog/alert that might come up if Bayesian filters have never been turned on, to educate the user about them and the necessary training. Or something. Just some more ideas. :-)
Comment 5•22 years ago
|
||
As a user who receives moutains of Junk email a day, I vote for a *small* training.dat file to come with the install. I think that it would not be too hard to come up with a file that would at least stop the question "I turned on junk mail fitlering. Why is NOTHING getting marked as junk?" A small file would quickly get overridden by the user's actual training. This doesn't permanently label people as spammers (after a very little training my filter marked Bugzila messages as junk) it just provides a starting point. Another argument for the "little bit of something" approach: Get users in the habit of un-marking the non-junk in addition to marking the junk. If they get too comfortable that anything marked as junk is necessarily so, they will miss their receipts from purchases. At the end of the day the definition of "junk mail" is highly personal, but I don't think that many people would morn the passing of "Make millions sending junk mail" and "watch these teen s**ts do *** to ***". And if there is someone out there who wants to receive these messages, he can just mark a few as non-junk. Regardless of this choice, user education will be required.
Comment 6•22 years ago
|
||
I would just like to vote a strong NO to this suggestion of including a pre-populated training.dat file. Some people's spam might be another person's ham. The worst thing that a spam filter can do is create false-positives, and this is exactly what will happen if a pre-populated training.dat file is provided. However, if a training.dat file was created using only header information, and not keywords from the body of the email, than it may be possible. But I still think it is better to FORCE every user to create their own training.dat from scratch.
Comment 7•22 years ago
|
||
See bug 188232 for a possible reason for a very simple "pre-populated training.dat". It's quite possible for the training data to somehow get confused so that NO messages are marked as junk, despite training on 100s of junk messages. There's something in the training algorithms that keeps them from working properly when you have no "good" tokens in the training.dat file, and hundreds of "bad" tokens (but not enough similarity in the bad tokens that anything ever gets marked as junk). So, either the algorithms need to check for that boundary condition, or the boundary condition needs to be avoided (per the really annoying workaround I came up with involving editting the binary training.dat file).
*** Bug 191566 has been marked as a duplicate of this bug. ***
Comment 9•21 years ago
|
||
How hard would it be to add a "Junk Filter status" menu option, which would display how many junk and non-junk messages have been used for training?
Comment 10•21 years ago
|
||
How about Mozilla having a web page users can go to to select the types of mail they consider to be junk and somehow having clicking on these types of messages adding data to the training.dat file. Or perhaps the option of downloading several options of differently set up training.dat files. As was said earlier - the junk mail system has caused and awful lot of headaches for the average user. If someone like Yahoo can filter spam pretty well I do not see what is wrong with Mozilla giving the option of having a prepopulated training.dat file. This is the route to the greatest success for this feature in my opinion - otherwise it is just going to be too hard for some people.
Comment 11•21 years ago
|
||
The idea for a web site where users can select types of common junk mail and have a populated training.dat produced for them is very good! I would love to see that regardless of whether or not mozilla ships with one. As to the issue at hand, I think a small pre-populated training.dat would be a very good idea. It would provide immediate results for someone who was new to the concept of junk filtering. As to the grey areas of what is spam, we could easily avoid them. Get a few dozen users (not all developers! get some people who are mom/dad/dontknowjack users too) to submit all their email to a pile, then calculate a training.dat from it. Then go through the resulting file and only keep entries that have over a certain threshold of occurences and are >95% or <5% spam. This will put things like "penis" and "teen" on the list as spam indicators while avoiding grey area 'words' like "order" and "http://". Obviously the values would be tweaked, but I think it would be very easy to produce a training.dat that had over a 80% true-positive and true-negative rate and less than a 5% false-positive rate for more than 95% of users.
Comment 12•21 years ago
|
||
I'd like to re-iterate my opinion of NO to having a pre-populated Junk file. This is a bad idea. There are many other ways around this, and I think having a pre-populated training.dat file should be the last possible option.
Comment 13•21 years ago
|
||
David, what would you suggest?
Comment 14•21 years ago
|
||
on another note, the false positives argument isnt really valid. with no training.dat, and during the initial stages of training (the first few dozen markings), the false positive rate is going to be high anyways. by shipping with a prepopulated list we would decrease the initial false positive rate for an extreme majority of users (id say 99% or better) and for the very few others who actually WANT to know how to enlarge their penis or who really do have friends in nigeria with a billion dollars to launder it will only take a few corrections for the spam filter to almost completely forget its original training. PS: I am willing to create the training.dat I speak of if anyone is interested in trying it, to prove that it is easily correctable but also more accurate than no training.dat at all.
Updated•20 years ago
|
Product: MailNews → Core
Updated•19 years ago
|
OS: Windows 2000 → All
Hardware: PC → All
Comment 15•17 years ago
|
||
Was consensus every reached on this? One thing to note is that there are already multiple community approved spam corpuses (corpii?) - Both the SpamAssassin and TREC Spam corpuses are widely used in academia and the real world. Their efficacy as training data sets can be debated but I don't think the spamminess of their contents is controversial. If there are a handful of users who would find the Viagra ads therein that are not user requested useful, I'd say sacrificing usability for them is worth the added value provided to the (much larger) general user population. Having said that, the idea quoted above regarding divving up the training set into categories (medical, sexual etc) and including only those is a great idea. This would also allow us to have foreign language training subsets etc.
Comment 16•17 years ago
|
||
Incorporating large spam/ham corpora in a default training set would introduce training inertia, requiring more training actions to correct one kind of classification error. Also, as thunderbird is dealing with private content with a close feedback loop, it's in a great position to leverage the distinct personality of the user's ham to the fullest; that would be weakened by a heavy generic training set. A minimal training set that's very quickly overridden by user training might help discoverability (there's probably some potential for annoyance there too), but that's already better than when this was filed, afair. Junk filtering is on by default; the Junk button is on the toolbar by default, and touching it gives a short introduction dialog about training the first time.
Comment 17•17 years ago
|
||
(In reply to comment #16) > Incorporating large spam/ham corpora in a default training set would introduce > training inertia, requiring more training actions to correct one kind of > classification error. Also, as thunderbird is dealing with private content with > a close feedback loop, it's in a great position to leverage the distinct > personality of the user's ham to the fullest; that would be weakened by a heavy > generic training set. > > A minimal training set that's very quickly overridden by user training might > help discoverability (there's probably some potential for annoyance there too), > but that's already better than when this was filed, afair. Junk filtering is on > by default; the Junk button is on the toolbar by default, and touching it gives > a short introduction dialog about training the first time. I'm not sure about trunk, but the junk mail feature is not really working in the 2.0 nightly builds.
Comment 18•17 years ago
|
||
> I'm not sure about trunk, but the junk mail feature is not really working in
...so ask about it on forums and/or file a bug. Don't comment on a random bug report fully quoting a completely unrelated comment. Thanks.
Comment 19•17 years ago
|
||
Assigning bugs that I'm not actively working on back to nobody; use SearchForThis as a search term if you want to delete all related bugmail at once.
Assignee: dmose → nobody
Updated•17 years ago
|
Severity: normal → enhancement
QA Contact: laurel → filters
Updated•17 years ago
|
Assignee | ||
Updated•16 years ago
|
Product: Core → MailNews Core
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•