Open Bug 234411 Opened 21 years ago Updated 6 years ago

Make Bayesian filter (junk mail) work both on 'bytes' and 'characters' (DBCS/CJK) Japanese/Chinese/Korean

Categories

(MailNews Core :: Filters, defect)

defect
Not set
critical

Tracking

(Not tracked)

People

(Reporter: jshin1987, Unassigned)

References

()

Details

(Keywords: intl)

It seems like Mozilla's Bayesian junkmail filter works on 'bytes' but not on 'characters'. For English messages in ASCII-compatible encodings, 'bytes' and 'characters' are not different. However, non-English text can be represented in multiple ways and just filtering based on 'octet-contents' of email messages wouldn't be as effective as considering 'character contents' as well. See http://www5.justnet.ne.jp/~level/mozilla/party4/JunkMailFilter.ppt http://www5e.biglobe.ne.jp/~level0/mozilla/spam/ Pages are in Japanese (I don't read Japanese although I can make something out of Kanjis). Kat, can you help us with the PPT presentation above?
Take a look at the URL link in this bug. It looks to me like we do support UTF-8, but that the code makes the bogus assertion that anything non-ASCII is going to be UTF-8 encoded. Could that be the real nature of the problem you're seeing?
s/assertion/assumption/
jshin, as dmose points out, we use a I18N semantic scanner to scan the utf-8 text in search of reasonable semantic units. Does that code not look right to you? Of course, before we do that, we are tokenizing the string based on white space, line returns, tabs, etc.
Thanks, dmose and mscott, for your comments. It seems to be doing the right thing provided that MIME charset is identified correctly (following the usual set of rules applied to the mailview-window except for the manual user-override) and the conversion to 'Unicode' (including base64/qp decoding before that) is done accordingly. I should have looked into the code instead of relying on my __partial/incomplete (and likely to be wrong)__ 'interpretaion' of the write-up (in Japanese) I mentioned in comment #0. I also have to relay the observation of Korean mozilla users. According to them, it works rather well, but I (and they) thought that it might work even better if 'characters' are used as well (if they're not used currently, which turned out to be a wrong guess). Based on the new information, I guess I have to resolve this as invalid. Before that, let me add an assertion (IsUTF8) and see if it ever gets fired.
I was at the Mozilla Japan Conference where Emura-san presented this intro to Mozilla's Junk Mail Filter circa 1.4. This is the first document cited by Jungshik. It is true that Mozilla trains on the internal UTF-8 data but the way it trains is somewhat crude as far as Japanese is concerned. It depends on the character identification method used elsewhere in the Mozilla code. This is based on the Unicode Character Block distinctions among the characters used in Japanese. (Slide 22) 1. A longest sequence of Hiragana script until it meets a boundary of non- Hiragana, i.e. punctuation, Katakana, Kanji, space, Roman letters, etc. 2. A longest sequence of Katakana script until it meets a boundary of non-Katakana. 3. A longest sequence of Symbol characters until it meets a boundary of non- Symbol character. 4. Any single Chinese (Kanji) character. These are reasonable first look at parsing Japanese but pretty crude. Using these rules, Mozilla will arrive at parsing shown in Slide 23, which is not very good. A better result would be something like Slide 25 according to Emura-san. I think you can argue if this result is the best it could be but there is no question that it is definitely much better than what we have now. For this revision, Emura-san suggests the following rules: 1. Hiragana sequence must contain at least 2 characters. 2. Katakana sequence must contain at least 2 characters. 3. Symbol characters sequence must contain as least 2 characters. 4. For Chinese characters, a sequence must contain at least 1 Chinese (Kanji) and then one more arbitary character from any class. 5. A Chinese sequence with more than 2 Chinese characters, i.e. above 3 characters. These rules are suggested in Slide 24. While Emura-san does not offer hard evidence why these rules are preferable to the current one, they will probably do much better. These suggestions were made last April. I wonder if there has not been a bug filed to reflect these ideas. How are the Korean character classes distinguished and are the current rules sufficient? I would think that we need to apply language-specific rules for further improvement.
Product: MailNews → Core
sorry for the spam. making bugzilla reflect reality as I'm not working on these bugs. filter on FOOBARCHEESE to remove these in bulk.
Assignee: sspitzer → nobody
Filter on "Nobody_NScomTLD_20080620"
QA Contact: laurel → filters
.
Product: Core → MailNews Core
is there anything from Bug 472764 that can help with this bug? bug 191387 a duplicate?
Severity: normal → major
Bug 277354 implemented a Japanese-specific tokenizer for the bayes filter code on 2005-01-17. I think it would make sense to close all Japanese-specific bugs that were prior to that date as WFM or DUP. > is there anything from Bug 472764 that can help with this bug? It would be interesting to try out that tokenizer scheme, as spam tokenizers in general have been using similar n-gram types of tokenizers recently, rather than trying to define a specific list of punctuation as we currently do.
191387 (and this) was the only bug I had found regarding *any* non-english languages, and the dupe is done.
Gloda tokenizer is a SQLite3 module, so we cannot use same module for tb's bayeian filter. But I will consider to fix this on 3.1 release. I take this bug.
Assignee: nobody → m_kato
Assignee: m_kato → nobody

Considering the size of our CJK user population, this is a critical flaw

Severity: major → critical
Summary: Make Bayesian filter (junk mail) work both on 'bytes' and 'characters' → Make Bayesian filter (junk mail) work both on 'bytes' and 'characters' (DBCS/CJK) Japanese/Chinese/Korean
You need to log in before you can comment on or make changes to this bug.