179999 - should we ship with a pre-populated training.dat file?

(not reading, please use seth@sspitzer.org instead)

Reporter

Description

•

23 years ago

should we ship with a pre-populated training.dat file? this has come up in private threads (several times). I'll summarize (or perhaps dmose can) all the pro's and con's and issues raised so far, to help make the decision.

(not reading, please use seth@sspitzer.org instead)

Reporter

Updated

•

23 years ago

Blocks: 11035

David :Bienvenu

Comment 1

•

23 years ago

Not to jump the gun, but I strongly feel that if this is at all possible, we should do it. Normal users are not going to get that they have to train it on both good and junk messages unless we really hold their hands through the process. It might be impossible to have a default training set that's not biased; I don't know. We could add a command to clear the training set and allow the user to start over, if that's really an issue. Also, what do other products that use bayesian filtering do out of the box?

Andrew Perry

Comment 2

•

23 years ago

I disagree .. what some may see as spam .. others may be "subscribing" to ... I know I get a lot of junk mail from legitimate sources ... I vote "NO" on pre-populated training.dat files. Let the user train their own system.

Daniel Wang

Comment 3

•

23 years ago

I strongly recommand WONTFIX mozilla.org could not, and should not, determine what is acceptable to anyone and what is not. mozilla.org should stop at providing facilities for content filtering and making sure that they work properly. Similarly mozilla.org should not label anyone as spam sender, pop-up advertiser, porn publisher, etc. etc.

Whiteboard: +1 negative vote(s) (dwx)

NorthMan

Comment 4

•

23 years ago

I agree with dwx. However, I also think that if someone were to switch on Bayesian spam filtering, they would expect it to do something. There is extra work involved. You have to train it. Most users will need to have this explained to them. In any case, maybe we should try to improve or accelerate the learning curve of the Bayesian filter. So when they do turn it on, it catches on quickly. I don't keep huge spam/not-spam archives, and most people don't either. I recommend a dialog/alert that might come up if Bayesian filters have never been turned on, to educate the user about them and the necessary training. Or something. Just some more ideas. :-)

Eric Dimick Eastman

Comment 5

•

23 years ago

As a user who receives moutains of Junk email a day, I vote for a *small* training.dat file to come with the install. I think that it would not be too hard to come up with a file that would at least stop the question "I turned on junk mail fitlering. Why is NOTHING getting marked as junk?" A small file would quickly get overridden by the user's actual training. This doesn't permanently label people as spammers (after a very little training my filter marked Bugzila messages as junk) it just provides a starting point. Another argument for the "little bit of something" approach: Get users in the habit of un-marking the non-junk in addition to marking the junk. If they get too comfortable that anything marked as junk is necessarily so, they will miss their receipts from purchases. At the end of the day the definition of "junk mail" is highly personal, but I don't think that many people would morn the passing of "Make millions sending junk mail" and "watch these teen s**ts do *** to ***". And if there is someone out there who wants to receive these messages, he can just mark a few as non-junk. Regardless of this choice, user education will be required.

David Grant

Comment 6

•

23 years ago

I would just like to vote a strong NO to this suggestion of including a pre-populated training.dat file. Some people's spam might be another person's ham. The worst thing that a spam filter can do is create false-positives, and this is exactly what will happen if a pre-populated training.dat file is provided. However, if a training.dat file was created using only header information, and not keywords from the body of the email, than it may be possible. But I still think it is better to FORCE every user to create their own training.dat from scratch.

mozilla.gv6r

Comment 7

•

23 years ago

See bug 188232 for a possible reason for a very simple "pre-populated training.dat". It's quite possible for the training data to somehow get confused so that NO messages are marked as junk, despite training on 100s of junk messages. There's something in the training algorithms that keeps them from working properly when you have no "good" tokens in the training.dat file, and hundreds of "bad" tokens (but not enough similarity in the bad tokens that anything ever gets marked as junk). So, either the algorithms need to check for that boundary condition, or the boundary condition needs to be avoided (per the really annoying workaround I came up with involving editting the binary training.dat file).

Sander

Comment 8

•

23 years ago

*** Bug 191566 has been marked as a duplicate of this bug. ***

Ray Charbonneau

Comment 9

•

23 years ago

How hard would it be to add a "Junk Filter status" menu option, which would display how many junk and non-junk messages have been used for training?

farcusnz

Comment 10

•

23 years ago

How about Mozilla having a web page users can go to to select the types of mail they consider to be junk and somehow having clicking on these types of messages adding data to the training.dat file. Or perhaps the option of downloading several options of differently set up training.dat files. As was said earlier - the junk mail system has caused and awful lot of headaches for the average user. If someone like Yahoo can filter spam pretty well I do not see what is wrong with Mozilla giving the option of having a prepopulated training.dat file. This is the route to the greatest success for this feature in my opinion - otherwise it is just going to be too hard for some people.

Clarence Risher

Comment 11

•

22 years ago

The idea for a web site where users can select types of common junk mail and have a populated training.dat produced for them is very good! I would love to see that regardless of whether or not mozilla ships with one. As to the issue at hand, I think a small pre-populated training.dat would be a very good idea. It would provide immediate results for someone who was new to the concept of junk filtering. As to the grey areas of what is spam, we could easily avoid them. Get a few dozen users (not all developers! get some people who are mom/dad/dontknowjack users too) to submit all their email to a pile, then calculate a training.dat from it. Then go through the resulting file and only keep entries that have over a certain threshold of occurences and are >95% or <5% spam. This will put things like "penis" and "teen" on the list as spam indicators while avoiding grey area 'words' like "order" and "http://". Obviously the values would be tweaked, but I think it would be very easy to produce a training.dat that had over a 80% true-positive and true-negative rate and less than a 5% false-positive rate for more than 95% of users.

David Grant

Comment 12

•

22 years ago

I'd like to re-iterate my opinion of NO to having a pre-populated Junk file. This is a bad idea. There are many other ways around this, and I think having a pre-populated training.dat file should be the last possible option.

NorthMan

Comment 13

•

22 years ago

David, what would you suggest?

Clarence Risher

Comment 14

•

22 years ago

on another note, the false positives argument isnt really valid. with no training.dat, and during the initial stages of training (the first few dozen markings), the false positive rate is going to be high anyways. by shipping with a prepopulated list we would decrease the initial false positive rate for an extreme majority of users (id say 99% or better) and for the very few others who actually WANT to know how to enlarge their penis or who really do have friends in nigeria with a billion dollars to launder it will only take a few corrections for the spam filter to almost completely forget its original training. PS: I am willing to create the training.dat I speak of if anyone is interested in trying it, to prove that it is easily correctable but also more accurate than no training.dat at all.

Myk Melez [:myk] [@mykmelez]

Updated

•

21 years ago

Product: MailNews → Core

Adam D. Moss

Updated

•

21 years ago

OS: Windows 2000 → All

Hardware: PC → All

Aarjav Trivedi

Comment 15

•

19 years ago

Was consensus every reached on this? One thing to note is that there are already multiple community approved spam corpuses (corpii?) - Both the SpamAssassin and TREC Spam corpuses are widely used in academia and the real world. Their efficacy as training data sets can be debated but I don't think the spamminess of their contents is controversial. If there are a handful of users who would find the Viagra ads therein that are not user requested useful, I'd say sacrificing usability for them is worth the added value provided to the (much larger) general user population. Having said that, the idea quoted above regarding divving up the training set into categories (medical, sexual etc) and including only those is a great idea. This would also allow us to have foreign language training subsets etc.

Tuukka Tolvanen (sp3000)

Comment 16

•

19 years ago

Incorporating large spam/ham corpora in a default training set would introduce training inertia, requiring more training actions to correct one kind of classification error. Also, as thunderbird is dealing with private content with a close feedback loop, it's in a great position to leverage the distinct personality of the user's ham to the fullest; that would be weakened by a heavy generic training set. A minimal training set that's very quickly overridden by user training might help discoverability (there's probably some potential for annoyance there too), but that's already better than when this was filed, afair. Junk filtering is on by default; the Junk button is on the toolbar by default, and touching it gives a short introduction dialog about training the first time.

Worcester12345

Comment 17

•

19 years ago

(In reply to comment #16) > Incorporating large spam/ham corpora in a default training set would introduce > training inertia, requiring more training actions to correct one kind of > classification error. Also, as thunderbird is dealing with private content with > a close feedback loop, it's in a great position to leverage the distinct > personality of the user's ham to the fullest; that would be weakened by a heavy > generic training set. > > A minimal training set that's very quickly overridden by user training might > help discoverability (there's probably some potential for annoyance there too), > but that's already better than when this was filed, afair. Junk filtering is on > by default; the Junk button is on the toolbar by default, and touching it gives > a short introduction dialog about training the first time. I'm not sure about trunk, but the junk mail feature is not really working in the 2.0 nightly builds.

Tuukka Tolvanen (sp3000)

Comment 18

•

19 years ago

> I'm not sure about trunk, but the junk mail feature is not really working in ...so ask about it on forums and/or file a bug. Don't comment on a random bug report fully quoting a completely unrelated comment. Thanks.

Dan Mosedale (:dmosedale, :dmose)

Comment 19

•

19 years ago

Assigning bugs that I'm not actively working on back to nobody; use SearchForThis as a search term if you want to delete all related bugmail at once.

Assignee: dmose → nobody

Wayne Mery (:wsmwk)

Updated

•

18 years ago

Severity: normal → enhancement

QA Contact: laurel → filters

Przemyslaw Bialik

Updated

•

18 years ago

URL: http://www.entrian.com/sbwiki/Trainin...

Nobody; OK to take it and work on it

Assignee

Updated

•

17 years ago

Product: Core → MailNews Core

Wayne Mery (:wsmwk)

Updated

•

11 years ago

Blocks: 230269

Wayne Mery (:wsmwk)

Updated

•

11 years ago

Updated

•

11 years ago

No longer blocks: 230269

BMO Automation

Updated

•

3 years ago

Severity: normal → S3