Closed Bug 305309 Opened 19 years ago Closed 17 years ago

junkmail filters should be more charset aware

Tracking

(Not tracked)

Status:

RESOLVED DUPLICATE of bug 234411

People

(Reporter: usenet, Unassigned)

Details

Attachments

(2 files, 6 obsolete files)

Python program as example of how script ranges can be detected 19 years ago Neil Harris 2.80 KB, text/x-python		Details
Python program as example of how script ranges can be detected 19 years ago Neil Harris 2.80 KB, text/plain		Details
Cleaned-up version of earlier Python algorithm example 19 years ago Neil Harris 2.73 KB, text/plain		Details
Script classifier rewritten in C, with enhanced classification table 19 years ago Neil Harris 6.94 KB, text/plain		Details
Script classifier rewritten in C, with code/data memory footprint reduced to 1.1k 19 years ago Neil Harris 9.29 KB, patch		Details \| Diff \| Splinter Review
Cleanup of the above code. 19 years ago Neil Harris 9.41 KB, text/plain		Details
Update of the above, with better discrimination in 8859-1 and presentation forms ranges 19 years ago Neil Harris 9.83 KB, text/plain		Details
Slightly faster version of the above 19 years ago Neil Harris 10.71 KB, text/plain		Details

Neil Harris

Reporter

Description

•

19 years ago

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6

I receive lots of Korean spam, yet I can't read Korean. No matter how much I
mark Korean texts as spam, they still get through the filters. To fix this, the
spam filters should be Unicode script range aware.



Reproducible: Always

Steps to Reproduce:
1.Recieve lots of Korean spam.
2.Mark as spam in the mail tool.
3.

Actual Results:  
It still gets through

Expected Results:  
The mail filter should learn that I don't like text with a preponderance of
characters in the Korean script Unicode code point range.



Whilst Bug 199478 has been marked WONTFIX, this appears to be far more
reasonable to fix without bloat, as only a small table is needed to hold the
major Unicode code point ranges.

Neil Harris

Reporter

Comment 1

•

19 years ago

Suggested algorithm:

Bin the Unicode characters of _rendered_ text (not page source) into Unicode
script ranges. The dominant script is the script-range bin with the most
non-space characters in it. Then: generate either/both a pseudo-token for the
Bayesian learning algorithm, and/or a pseudo-header for the E-mail that
identifies the dominant script.

Whilst this cannot be used to distinguish between E-mails in different
languages, it can at least distinguish between E-mails in different scripts.

This can then be used by the Bayesian spam filter to categorize E-mails that are
substantially in scripts that the user considers to be spam, as spam.

Frank Wein [:mcsmurf]

Comment 2

•

19 years ago

You need to tell us what product and what version you use ;) (since you directly
filed it in Core).

Neil Harris

Reporter

Comment 3

•

19 years ago

Product: Thunderbird 1.0.6 x86 version for Linux

Neil Harris

Reporter

Comment 4

•

19 years ago

Attached file Python program as example of how script ranges can be detected (obsolete) — Details

This Python program shows how script ranges can be detected. Note that it makes
no attempt to detect _languages_, only _scripts_, and it completely ignores
punctuation, symbols, etc, unless they are specific to a given script.

Since this is only an heuristic, and Unicode is intended to be a stable
encoding, the table entries given should survive any new additions to Unicode.

Neil Harris

Reporter

Comment 5

•

19 years ago

Attached file Python program as example of how script ranges can be detected (obsolete) — Details

This Python program shows how script ranges can be detected. Note that it makes
no attempt to detect _languages_, only _scripts_, and it completely ignores
punctuation, symbols, etc, unless they are specific to a given script.

Since this is only an heuristic, and Unicode is intended to be a stable
encoding, the table entries given should survive any new additions to Unicode.

Let's try that again, with a MIME type of text/plain.

Neil Harris

Reporter

Comment 6

•

19 years ago

Attached file Cleaned-up version of earlier Python algorithm example — Details

This is a cleaned-up version of the earlier program, with a pre-compiled list,
with consecutive ranges merged.

Attachment #193284 - Attachment is obsolete: true

Attachment #193285 - Attachment is obsolete: true

Neil Harris

Reporter

Comment 7

•

19 years ago

Attached file Script classifier rewritten in C, with enhanced classification table (obsolete) — Details

This is a C version of the script classifier, with a binary chop search and
proper handling of sub-ranges of fullwidth and halfwidth forms.

Neil Harris

Reporter

Comment 8

•

19 years ago

Attached patch Script classifier rewritten in C, with code/data memory footprint reduced to 1.1k (obsolete) — Details — Splinter Review

This has been rewritten to pack the lookup tables more effectively, with 2
range tablesm one for shorts and the other for longs. In addition, the string
lookups are removed (Bayesian code only needs unique tags, does not care about
meanings), and the binning table has been moved to the stack.

The string lookup table is preserved in the source, but #ifdef'd out.

Attachment #193298 - Attachment is obsolete: true

Neil Harris

Reporter

Comment 9

•

19 years ago

Attached file Cleanup of the above code. (obsolete) — Details

This is the same code as above, with various infelicities cleaned up.

Attachment #193341 - Attachment is obsolete: true

Neil Harris

Reporter

Comment 10

•

19 years ago

Attached file Update of the above, with better discrimination in 8859-1 and presentation forms ranges (obsolete) — Details

Now with finer discrimination in some code ranges.

Attachment #193342 - Attachment is obsolete: true

Neil Harris

Reporter

Comment 11

•

19 years ago

Attached file Slightly faster version of the above — Details

This version gives a 10% - 15% speed gain over the previous one, at the cost of
a 3% increase in x86 code size (less libraries, symbols)

Another tweak to the above for speed, it's not easy to get more speed out
without spending more space, unless we make a fast path for ASCII (which
favours one script over others), or use a last-matched range heuristic (which
stops things from being cleanly re-entrant). However, since the current speed
is around 11 Mchars/sec on a 2GHz P4, it's probably fast enough.

Attachment #193347 - Attachment is obsolete: true

Neil Harris

Reporter

Comment 12

•

19 years ago

Oops, that should have been a 10% increase in code size. Footprint is now 1.2k.

Mike Cowperthwaite

Comment 13

•

19 years ago

xref bug 234411 -- dupe?

Wayne Mery (:wsmwk)

Comment 14

•

17 years ago

you might want to comment in bug 234411

Status: UNCONFIRMED → RESOLVED

Closed: 17 years ago

Resolution: --- → DUPLICATE

Nobody; OK to take it and work on it

Assignee

Updated

•

16 years ago

Product: Core → MailNews Core

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

junkmail filters should be more charset aware

Categories

(MailNews Core :: Filters, enhancement)

Tracking

(Not tracked)

People

(Reporter: usenet, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(2 files, 6 obsolete files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Attachment

General

Description

File Name

Content Type