User-Agent: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.10) Gecko/20050716 Firefox/1.0.6 I receive lots of Korean spam, yet I can't read Korean. No matter how much I mark Korean texts as spam, they still get through the filters. To fix this, the spam filters should be Unicode script range aware. Reproducible: Always Steps to Reproduce: 1.Recieve lots of Korean spam. 2.Mark as spam in the mail tool. 3. Actual Results: It still gets through Expected Results: The mail filter should learn that I don't like text with a preponderance of characters in the Korean script Unicode code point range. Whilst Bug 199478 has been marked WONTFIX, this appears to be far more reasonable to fix without bloat, as only a small table is needed to hold the major Unicode code point ranges.
Suggested algorithm: Bin the Unicode characters of _rendered_ text (not page source) into Unicode script ranges. The dominant script is the script-range bin with the most non-space characters in it. Then: generate either/both a pseudo-token for the Bayesian learning algorithm, and/or a pseudo-header for the E-mail that identifies the dominant script. Whilst this cannot be used to distinguish between E-mails in different languages, it can at least distinguish between E-mails in different scripts. This can then be used by the Bayesian spam filter to categorize E-mails that are substantially in scripts that the user considers to be spam, as spam.
You need to tell us what product and what version you use ;) (since you directly filed it in Core).
Product: Thunderbird 1.0.6 x86 version for Linux
Created attachment 193284 [details] Python program as example of how script ranges can be detected This Python program shows how script ranges can be detected. Note that it makes no attempt to detect _languages_, only _scripts_, and it completely ignores punctuation, symbols, etc, unless they are specific to a given script. Since this is only an heuristic, and Unicode is intended to be a stable encoding, the table entries given should survive any new additions to Unicode.
Created attachment 193285 [details] Python program as example of how script ranges can be detected This Python program shows how script ranges can be detected. Note that it makes no attempt to detect _languages_, only _scripts_, and it completely ignores punctuation, symbols, etc, unless they are specific to a given script. Since this is only an heuristic, and Unicode is intended to be a stable encoding, the table entries given should survive any new additions to Unicode. Let's try that again, with a MIME type of text/plain.
Created attachment 193287 [details] Cleaned-up version of earlier Python algorithm example This is a cleaned-up version of the earlier program, with a pre-compiled list, with consecutive ranges merged.
Created attachment 193298 [details] Script classifier rewritten in C, with enhanced classification table This is a C version of the script classifier, with a binary chop search and proper handling of sub-ranges of fullwidth and halfwidth forms.
Created attachment 193341 [details] [diff] [review] Script classifier rewritten in C, with code/data memory footprint reduced to 1.1k This has been rewritten to pack the lookup tables more effectively, with 2 range tablesm one for shorts and the other for longs. In addition, the string lookups are removed (Bayesian code only needs unique tags, does not care about meanings), and the binning table has been moved to the stack. The string lookup table is preserved in the source, but #ifdef'd out.
Created attachment 193342 [details] Cleanup of the above code. This is the same code as above, with various infelicities cleaned up.
Created attachment 193347 [details] Update of the above, with better discrimination in 8859-1 and presentation forms ranges Now with finer discrimination in some code ranges.
Created attachment 193386 [details] Slightly faster version of the above This version gives a 10% - 15% speed gain over the previous one, at the cost of a 3% increase in x86 code size (less libraries, symbols) Another tweak to the above for speed, it's not easy to get more speed out without spending more space, unless we make a fast path for ASCII (which favours one script over others), or use a last-matched range heuristic (which stops things from being cleanly re-entrant). However, since the current speed is around 11 Mchars/sec on a 2GHz P4, it's probably fast enough.
Oops, that should have been a 10% increase in code size. Footprint is now 1.2k.
xref bug 234411 -- dupe?
you might want to comment in bug 234411