Closed
Bug 305309
Opened 19 years ago
Closed 17 years ago
junkmail filters should be more charset aware
Categories
(MailNews Core :: Filters, enhancement)
MailNews Core
Filters
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 234411
People
(Reporter: usenet, Unassigned)
Details
Attachments
(2 files, 6 obsolete files)
User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 I receive lots of Korean spam, yet I can't read Korean. No matter how much I mark Korean texts as spam, they still get through the filters. To fix this, the spam filters should be Unicode script range aware. Reproducible: Always Steps to Reproduce: 1.Recieve lots of Korean spam. 2.Mark as spam in the mail tool. 3. Actual Results: It still gets through Expected Results: The mail filter should learn that I don't like text with a preponderance of characters in the Korean script Unicode code point range. Whilst Bug 199478 has been marked WONTFIX, this appears to be far more reasonable to fix without bloat, as only a small table is needed to hold the major Unicode code point ranges.
Reporter | ||
Comment 1•19 years ago
|
||
Suggested algorithm: Bin the Unicode characters of _rendered_ text (not page source) into Unicode script ranges. The dominant script is the script-range bin with the most non-space characters in it. Then: generate either/both a pseudo-token for the Bayesian learning algorithm, and/or a pseudo-header for the E-mail that identifies the dominant script. Whilst this cannot be used to distinguish between E-mails in different languages, it can at least distinguish between E-mails in different scripts. This can then be used by the Bayesian spam filter to categorize E-mails that are substantially in scripts that the user considers to be spam, as spam.
Comment 2•19 years ago
|
||
You need to tell us what product and what version you use ;) (since you directly filed it in Core).
Reporter | ||
Comment 3•19 years ago
|
||
Product: Thunderbird 1.0.6 x86 version for Linux
Reporter | ||
Comment 4•19 years ago
|
||
This Python program shows how script ranges can be detected. Note that it makes no attempt to detect _languages_, only _scripts_, and it completely ignores punctuation, symbols, etc, unless they are specific to a given script. Since this is only an heuristic, and Unicode is intended to be a stable encoding, the table entries given should survive any new additions to Unicode.
Reporter | ||
Comment 5•19 years ago
|
||
This Python program shows how script ranges can be detected. Note that it makes no attempt to detect _languages_, only _scripts_, and it completely ignores punctuation, symbols, etc, unless they are specific to a given script. Since this is only an heuristic, and Unicode is intended to be a stable encoding, the table entries given should survive any new additions to Unicode. Let's try that again, with a MIME type of text/plain.
Reporter | ||
Comment 6•19 years ago
|
||
This is a cleaned-up version of the earlier program, with a pre-compiled list, with consecutive ranges merged.
Attachment #193284 -
Attachment is obsolete: true
Attachment #193285 -
Attachment is obsolete: true
Reporter | ||
Comment 7•19 years ago
|
||
This is a C version of the script classifier, with a binary chop search and proper handling of sub-ranges of fullwidth and halfwidth forms.
Reporter | ||
Comment 8•19 years ago
|
||
This has been rewritten to pack the lookup tables more effectively, with 2 range tablesm one for shorts and the other for longs. In addition, the string lookups are removed (Bayesian code only needs unique tags, does not care about meanings), and the binning table has been moved to the stack. The string lookup table is preserved in the source, but #ifdef'd out.
Attachment #193298 -
Attachment is obsolete: true
Reporter | ||
Comment 9•19 years ago
|
||
This is the same code as above, with various infelicities cleaned up.
Attachment #193341 -
Attachment is obsolete: true
Reporter | ||
Comment 10•19 years ago
|
||
Now with finer discrimination in some code ranges.
Attachment #193342 -
Attachment is obsolete: true
Reporter | ||
Comment 11•19 years ago
|
||
This version gives a 10% - 15% speed gain over the previous one, at the cost of a 3% increase in x86 code size (less libraries, symbols) Another tweak to the above for speed, it's not easy to get more speed out without spending more space, unless we make a fast path for ASCII (which favours one script over others), or use a last-matched range heuristic (which stops things from being cleanly re-entrant). However, since the current speed is around 11 Mchars/sec on a 2GHz P4, it's probably fast enough.
Attachment #193347 -
Attachment is obsolete: true
Reporter | ||
Comment 12•19 years ago
|
||
Oops, that should have been a 10% increase in code size. Footprint is now 1.2k.
Comment 13•19 years ago
|
||
xref bug 234411 -- dupe?
Comment 14•17 years ago
|
||
you might want to comment in bug 234411
Status: UNCONFIRMED → RESOLVED
Closed: 17 years ago
Resolution: --- → DUPLICATE
Assignee | ||
Updated•16 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•