Open
Bug 528113
Opened 16 years ago
Updated 3 years ago
Spam filter should penalize script-mixing within single words
Categories
(MailNews Core :: Filters, enhancement)
Tracking
(Not tracked)
NEW
People
(Reporter: usenet, Unassigned)
References
Details
As referred to in bug 528109, words like "VIAGRA" can be obfuscated in many ways using a variety of homoglyphs, effectively defeating the spam detection algorithm.
Although, as suggested in that bug, homograph normalization and accent stripping on words before adding comparing them with, or adding entries to, the filter dictionary may well ameliorate this problem, spammers' ingenuity is likely to exceed efforts to generate effective countermeasures to this kind of obfuscation.
However, what a great many of these obfuscation techniques share in common is the mixing of characters from multiple scripts within a single word.
To defeat this, the spam filter should consider adding a small penalty for words that mix together within that word characters from multiple scripts which are not normally mixed together in any known writing system.
For example, mixing Russian and Latin characters in a word would recieve a penalty; but mixing together Katakana, Hiragana and Kanji scripts within a word would not, because it is part of the Japanese writing system.
This is not as complex as it sounds; the character ranges for each script are documented in the Unicode tables, and the reasonable combinations of writing systems and their mixtures are easily enumerated.
Updated•16 years ago
|
Component: General → Filters
Product: Thunderbird → MailNews Core
QA Contact: general → filters
Version: unspecified → 1.9.1 Branch
Updated•15 years ago
|
Severity: normal → enhancement
Updated•3 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•