Open
Bug 234411
Opened 21 years ago
Updated 6 years ago
Make Bayesian filter (junk mail) work both on 'bytes' and 'characters' (DBCS/CJK) Japanese/Chinese/Korean
Categories
(MailNews Core :: Filters, defect)
MailNews Core
Filters
Tracking
(Not tracked)
NEW
People
(Reporter: jshin1987, Unassigned)
References
()
Details
(Keywords: intl)
It seems like Mozilla's Bayesian junkmail filter works on 'bytes' but not on
'characters'. For English messages in ASCII-compatible encodings, 'bytes' and
'characters' are not different. However, non-English text can be represented in
multiple ways and just filtering based on 'octet-contents' of email messages
wouldn't be as effective as considering 'character contents' as well.
See
http://www5.justnet.ne.jp/~level/mozilla/party4/JunkMailFilter.ppt
http://www5e.biglobe.ne.jp/~level0/mozilla/spam/
Pages are in Japanese (I don't read Japanese although I can make something out
of Kanjis).
Kat, can you help us with the PPT presentation above?
Comment 1•21 years ago
|
||
Take a look at the URL link in this bug. It looks to me like we do support
UTF-8, but that the code makes the bogus assertion that anything non-ASCII is
going to be UTF-8 encoded. Could that be the real nature of the problem you're
seeing?
Comment 2•21 years ago
|
||
s/assertion/assumption/
Comment 3•21 years ago
|
||
jshin, as dmose points out, we use a I18N semantic scanner to scan the utf-8
text in search of reasonable semantic units. Does that code not look right to you?
Of course, before we do that, we are tokenizing the string based on white space,
line returns, tabs, etc.
Reporter | ||
Comment 4•21 years ago
|
||
Thanks, dmose and mscott, for your comments. It seems to be doing the right
thing provided that MIME charset is identified correctly (following the usual
set of rules applied to the mailview-window except for the manual user-override)
and the conversion to 'Unicode' (including base64/qp decoding before that) is
done accordingly.
I should have looked into the code instead of relying on my
__partial/incomplete (and likely to be wrong)__ 'interpretaion' of the write-up
(in Japanese) I mentioned in comment #0. I also have to relay the observation of
Korean mozilla users. According to them, it works rather well, but I (and they)
thought that it might work even better if 'characters' are used as well (if
they're not used currently, which turned out to be a wrong guess).
Based on the new information, I guess I have to resolve this as invalid. Before
that, let me add an assertion (IsUTF8) and see if it ever gets fired.
Comment 5•21 years ago
|
||
I was at the Mozilla Japan Conference where Emura-san presented this
intro to Mozilla's Junk Mail Filter circa 1.4. This is the first
document cited by Jungshik.
It is true that Mozilla trains on the internal UTF-8 data but
the way it trains is somewhat crude as far as Japanese
is concerned. It depends on the character identification method
used elsewhere in the Mozilla code. This is based on the Unicode
Character Block distinctions among the characters used in Japanese.
(Slide 22)
1. A longest sequence of Hiragana script until it meets a boundary of non-
Hiragana, i.e. punctuation, Katakana, Kanji, space, Roman letters, etc.
2. A longest sequence of Katakana script until it meets a boundary of
non-Katakana.
3. A longest sequence of Symbol characters until it meets a boundary of non-
Symbol character.
4. Any single Chinese (Kanji) character.
These are reasonable first look at parsing Japanese but pretty crude.
Using these rules, Mozilla will arrive at parsing shown in Slide 23, which
is not very good.
A better result would be something like Slide 25 according to Emura-san.
I think you can argue if this result is the best it could be but there
is no question that it is definitely much better than what we have now.
For this revision, Emura-san suggests the following rules:
1. Hiragana sequence must contain at least 2 characters.
2. Katakana sequence must contain at least 2 characters.
3. Symbol characters sequence must contain as least 2 characters.
4. For Chinese characters, a sequence must contain at least 1 Chinese
(Kanji) and then one more arbitary character from any class.
5. A Chinese sequence with more than 2 Chinese characters, i.e.
above 3 characters.
These rules are suggested in Slide 24.
While Emura-san does not offer hard evidence why these rules are
preferable to the current one, they will probably do much better.
These suggestions were made last April. I wonder if there has not been
a bug filed to reflect these ideas.
How are the Korean character classes distinguished and are the current
rules sufficient? I would think that we need to apply language-specific
rules for further improvement.
Updated•21 years ago
|
Product: MailNews → Core
Comment 7•18 years ago
|
||
sorry for the spam. making bugzilla reflect reality as I'm not working on these bugs. filter on FOOBARCHEESE to remove these in bulk.
Assignee: sspitzer → nobody
Comment 9•17 years ago
|
||
.
Assignee | ||
Updated•17 years ago
|
Product: Core → MailNews Core
Comment 10•16 years ago
|
||
is there anything from Bug 472764 that can help with this bug?
bug 191387 a duplicate?
Severity: normal → major
Comment 11•16 years ago
|
||
Bug 277354 implemented a Japanese-specific tokenizer for the bayes filter code on 2005-01-17. I think it would make sense to close all Japanese-specific bugs that were prior to that date as WFM or DUP.
> is there anything from Bug 472764 that can help with this bug?
It would be interesting to try out that tokenizer scheme, as spam tokenizers in general have been using similar n-gram types of tokenizers recently, rather than trying to define a specific list of punctuation as we currently do.
Comment 12•16 years ago
|
||
191387 (and this) was the only bug I had found regarding *any* non-english languages, and the dupe is done.
Comment 13•16 years ago
|
||
Gloda tokenizer is a SQLite3 module, so we cannot use same module for tb's bayeian filter.
But I will consider to fix this on 3.1 release. I take this bug.
Assignee: nobody → m_kato
Updated•9 years ago
|
Assignee: m_kato → nobody
Comment 15•6 years ago
|
||
Considering the size of our CJK user population, this is a critical flaw
Severity: major → critical
Summary: Make Bayesian filter (junk mail) work both on 'bytes' and 'characters' → Make Bayesian filter (junk mail) work both on 'bytes' and 'characters' (DBCS/CJK) Japanese/Chinese/Korean
You need to log in
before you can comment on or make changes to this bug.
Description
•