Open Bug 528113 Opened 16 years ago Updated 3 years ago

Spam filter should penalize script-mixing within single words

Categories

(MailNews Core :: Filters, enhancement)

1.9.1 Branch
enhancement

Tracking

(Not tracked)

People

(Reporter: usenet, Unassigned)

References

Details

As referred to in bug 528109, words like "VIAGRA" can be obfuscated in many ways using a variety of homoglyphs, effectively defeating the spam detection algorithm. Although, as suggested in that bug, homograph normalization and accent stripping on words before adding comparing them with, or adding entries to, the filter dictionary may well ameliorate this problem, spammers' ingenuity is likely to exceed efforts to generate effective countermeasures to this kind of obfuscation. However, what a great many of these obfuscation techniques share in common is the mixing of characters from multiple scripts within a single word. To defeat this, the spam filter should consider adding a small penalty for words that mix together within that word characters from multiple scripts which are not normally mixed together in any known writing system. For example, mixing Russian and Latin characters in a word would recieve a penalty; but mixing together Katakana, Hiragana and Kanji scripts within a word would not, because it is part of the Japanese writing system. This is not as complex as it sounds; the character ranges for each script are documented in the Unicode tables, and the reasonable combinations of writing systems and their mixtures are easily enumerated.
Component: General → Filters
Product: Thunderbird → MailNews Core
QA Contact: general → filters
Version: unspecified → 1.9.1 Branch
Severity: normal → enhancement
See Also: → 528109
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.