Closed Bug 305309 Opened 19 years ago Closed 17 years ago

junkmail filters should be more charset aware

Categories

(MailNews Core :: Filters, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 234411

People

(Reporter: usenet, Unassigned)

Details

Attachments

(2 files, 6 obsolete files)

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

I receive lots of Korean spam, yet I can't read Korean. No matter how much I
mark Korean texts as spam, they still get through the filters. To fix this, the
spam filters should be Unicode script range aware.



Reproducible: Always

Steps to Reproduce:
1.Recieve lots of Korean spam.
2.Mark as spam in the mail tool.
3.

Actual Results:  
It still gets through

Expected Results:  
The mail filter should learn that I don't like text with a preponderance of
characters in the Korean script Unicode code point range.



Whilst Bug 199478 has been marked WONTFIX, this appears to be far more
reasonable to fix without bloat, as only a small table is needed to hold the
major Unicode code point ranges.
Suggested algorithm:

Bin the Unicode characters of _rendered_ text (not page source) into Unicode
script ranges. The dominant script is the script-range bin with the most
non-space characters in it. Then: generate either/both a pseudo-token for the
Bayesian learning algorithm, and/or a pseudo-header for the E-mail that
identifies the dominant script.

Whilst this cannot be used to distinguish between E-mails in different
languages, it can at least distinguish between E-mails in different scripts.

This can then be used by the Bayesian spam filter to categorize E-mails that are
substantially in scripts that the user considers to be spam, as spam.
You need to tell us what product and what version you use ;) (since you directly
filed it in Core).
Product: Thunderbird 1.0.6 x86 version for Linux
This Python program shows how script ranges can be detected. Note that it makes
no attempt to detect _languages_, only _scripts_, and it completely ignores
punctuation, symbols, etc, unless they are specific to a given script.

Since this is only an heuristic, and Unicode is intended to be a stable
encoding, the table entries given should survive any new additions to Unicode.
This Python program shows how script ranges can be detected. Note that it makes
no attempt to detect _languages_, only _scripts_, and it completely ignores
punctuation, symbols, etc, unless they are specific to a given script.

Since this is only an heuristic, and Unicode is intended to be a stable
encoding, the table entries given should survive any new additions to Unicode.

Let's try that again, with a MIME type of text/plain.
This is a cleaned-up version of the earlier program, with a pre-compiled list,
with consecutive ranges merged.
Attachment #193284 - Attachment is obsolete: true
Attachment #193285 - Attachment is obsolete: true
This is a C version of the script classifier, with a binary chop search and
proper handling of sub-ranges of fullwidth and halfwidth forms.
This has been rewritten to pack the lookup tables more effectively, with 2
range tablesm one for shorts and the other for longs. In addition, the string
lookups are removed (Bayesian code only needs unique tags, does not care about
meanings), and the binning table has been moved to the stack.

The string lookup table is preserved in the source, but #ifdef'd out.
Attachment #193298 - Attachment is obsolete: true
Attached file Cleanup of the above code. (obsolete) —
This is the same code as above, with various infelicities cleaned up.
Attachment #193341 - Attachment is obsolete: true
Now with finer discrimination in some code ranges.
Attachment #193342 - Attachment is obsolete: true
This version gives a 10% - 15% speed gain over the previous one, at the cost of
a 3% increase in x86 code size (less libraries, symbols)

Another tweak to the above for speed, it's not easy to get more speed out
without spending more space, unless we make a fast path for ASCII (which
favours one script over others), or use a last-matched range heuristic (which
stops things from being cleanly re-entrant). However, since the current speed
is around 11 Mchars/sec on a 2GHz P4, it's probably fast enough.
Attachment #193347 - Attachment is obsolete: true
Oops, that should have been a 10% increase in code size. Footprint is now 1.2k.
xref bug 234411 -- dupe?
you might want to comment in bug 234411
Status: UNCONFIRMED → RESOLVED
Closed: 17 years ago
Resolution: --- → DUPLICATE
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: