junkmail filters should be more charset aware

RESOLVED DUPLICATE of bug 234411

Status

MailNews Core
Filters
--
enhancement
RESOLVED DUPLICATE of bug 234411
12 years ago
9 years ago

People

(Reporter: Neil Harris, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments, 6 obsolete attachments)

(Reporter)

Description

12 years ago
User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

I receive lots of Korean spam, yet I can't read Korean. No matter how much I
mark Korean texts as spam, they still get through the filters. To fix this, the
spam filters should be Unicode script range aware.



Reproducible: Always

Steps to Reproduce:
1.Recieve lots of Korean spam.
2.Mark as spam in the mail tool.
3.

Actual Results:  
It still gets through

Expected Results:  
The mail filter should learn that I don't like text with a preponderance of
characters in the Korean script Unicode code point range.



Whilst Bug 199478 has been marked WONTFIX, this appears to be far more
reasonable to fix without bloat, as only a small table is needed to hold the
major Unicode code point ranges.
(Reporter)

Comment 1

12 years ago
Suggested algorithm:

Bin the Unicode characters of _rendered_ text (not page source) into Unicode
script ranges. The dominant script is the script-range bin with the most
non-space characters in it. Then: generate either/both a pseudo-token for the
Bayesian learning algorithm, and/or a pseudo-header for the E-mail that
identifies the dominant script.

Whilst this cannot be used to distinguish between E-mails in different
languages, it can at least distinguish between E-mails in different scripts.

This can then be used by the Bayesian spam filter to categorize E-mails that are
substantially in scripts that the user considers to be spam, as spam.

Comment 2

12 years ago
You need to tell us what product and what version you use ;) (since you directly
filed it in Core).
(Reporter)

Comment 3

12 years ago
Product: Thunderbird 1.0.6 x86 version for Linux
(Reporter)

Comment 4

12 years ago
Created attachment 193284 [details]
Python program as example of how script ranges can be detected

This Python program shows how script ranges can be detected. Note that it makes
no attempt to detect _languages_, only _scripts_, and it completely ignores
punctuation, symbols, etc, unless they are specific to a given script.

Since this is only an heuristic, and Unicode is intended to be a stable
encoding, the table entries given should survive any new additions to Unicode.
(Reporter)

Comment 5

12 years ago
Created attachment 193285 [details]
Python program as example of how script ranges can be detected

This Python program shows how script ranges can be detected. Note that it makes
no attempt to detect _languages_, only _scripts_, and it completely ignores
punctuation, symbols, etc, unless they are specific to a given script.

Since this is only an heuristic, and Unicode is intended to be a stable
encoding, the table entries given should survive any new additions to Unicode.

Let's try that again, with a MIME type of text/plain.
(Reporter)

Comment 6

12 years ago
Created attachment 193287 [details]
Cleaned-up version of earlier Python algorithm example

This is a cleaned-up version of the earlier program, with a pre-compiled list,
with consecutive ranges merged.
Attachment #193284 - Attachment is obsolete: true
Attachment #193285 - Attachment is obsolete: true
(Reporter)

Comment 7

12 years ago
Created attachment 193298 [details]
Script classifier rewritten in C, with enhanced classification table

This is a C version of the script classifier, with a binary chop search and
proper handling of sub-ranges of fullwidth and halfwidth forms.
(Reporter)

Comment 8

12 years ago
Created attachment 193341 [details] [diff] [review]
Script classifier rewritten in C, with code/data memory footprint reduced to 1.1k

This has been rewritten to pack the lookup tables more effectively, with 2
range tablesm one for shorts and the other for longs. In addition, the string
lookups are removed (Bayesian code only needs unique tags, does not care about
meanings), and the binning table has been moved to the stack.

The string lookup table is preserved in the source, but #ifdef'd out.
Attachment #193298 - Attachment is obsolete: true
(Reporter)

Comment 9

12 years ago
Created attachment 193342 [details]
Cleanup of the above code.

This is the same code as above, with various infelicities cleaned up.
Attachment #193341 - Attachment is obsolete: true
(Reporter)

Comment 10

12 years ago
Created attachment 193347 [details]
Update of the above, with better discrimination in 8859-1 and presentation forms ranges

Now with finer discrimination in some code ranges.
Attachment #193342 - Attachment is obsolete: true
(Reporter)

Comment 11

12 years ago
Created attachment 193386 [details]
Slightly faster version of the above

This version gives a 10% - 15% speed gain over the previous one, at the cost of
a 3% increase in x86 code size (less libraries, symbols)

Another tweak to the above for speed, it's not easy to get more speed out
without spending more space, unless we make a fast path for ASCII (which
favours one script over others), or use a last-matched range heuristic (which
stops things from being cleanly re-entrant). However, since the current speed
is around 11 Mchars/sec on a 2GHz P4, it's probably fast enough.
Attachment #193347 - Attachment is obsolete: true
(Reporter)

Comment 12

12 years ago
Oops, that should have been a 10% increase in code size. Footprint is now 1.2k.

Comment 13

12 years ago
xref bug 234411 -- dupe?

Comment 14

11 years ago
you might want to comment in bug 234411
Status: UNCONFIRMED → RESOLVED
Last Resolved: 11 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 234411
(Assignee)

Updated

9 years ago
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.