Closed Bug 277354 Opened 20 years ago Closed 20 years ago

Japanese tokenizer for Bayesian spam filter

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

Thunderbird1.1

People

(Reporter: norinoue, Assigned: mscott)

References

Details

Attachments

(2 files, 3 obsolete files)

Japanese tokenizer patch for bayesian-spam-filter 20 years ago Noriyuki INOUE 7.42 KB, patch		Details \| Diff \| Splinter Review
Japanese tokenizer for bayesian spam filter. 20 years ago Noriyuki INOUE 8.83 KB, patch		Details \| Diff \| Splinter Review
XUL extension for browsing training.dat 20 years ago Noriyuki INOUE 7.13 KB, application/vnd.mozilla.xul+xml		Details
Japanese tokenizer for Bayesian spam filter 20 years ago Noriyuki INOUE 6.88 KB, patch		Details \| Diff \| Splinter Review
updated patch with some minor review changes 20 years ago Scott MacGregor 5.40 KB, patch		Details \| Diff \| Splinter Review

Noriyuki INOUE

Reporter

Description

•

20 years ago

I introduce the Japanese specific tokenizer into Bayesian spam filter.
It will improve the performance of classifying Japanese mail, i think.

...and I have a idea:
Language specific header for each word is needed.
Because the same word between different languages is not the same.
The avobe patch add "japanese_token" before each Japanese kanji words to avoid
conflicting Chinese kanji word.

Noriyuki INOUE

Reporter

Comment 1

•

20 years ago

Attached patch Japanese tokenizer patch for bayesian-spam-filter (obsolete) — Details — Splinter Review

Japanese tokenizer patch for bayesian-spam-filter.

Scott MacGregor

Assignee

Comment 2

•

20 years ago

thanks for looking into this.

Can you build yourself a test scenario to measure the before and after
performance on spam detection for japanese messages? 

i.e. make a training set. Then a set of messages to test the filters against.
Start with an empty training.dat file. Use a build without the change, train it,
then run it on the test messages and calculate the percentage caught, and the
percentage of false positives (msgs incorrectly identified as spam). Repeat the
exercsie on a build with the patch....

Summary: Japanese tokenizer for Bayesian spam filter → Japanese tokenizer for Bayesian spam filter

Noriyuki INOUE

Reporter

Comment 3

•

20 years ago

Thank you for your advice.

I confirmed their performance according to your message.

First, i collected a set of mail i have recently recieved, and classify them
into spam and ham by hand. And make a junk mail filter training with this test set.
Then, run it on the same set. The set is all consisted by japanese mails.

The result is following.

before patch:				
		total	success	ratio
ham	news	582	582	100.0%
	private	49	49	100.0%
	total	631	631	100.0%

spam	news	53	52	98.1%
	others	93	30	32.3%
	total	146	82	56.2%

Number of ham tokens: 74835
Number of spam tokens: 9374


after patch:				
		total	success	ratio
ham	news	582	581	99.8%
	private	49	49	100.0%
	total	631	630	99.8%

spam	news	53	53	100.0%
	others	93	80	86.0%
	total	146	133	91.1%

Number of ham tokens: 95020
Number of spam tokens: 10149

'news' in ham is mail magazine i subscribed and 'private' is personal mail for
me. 'news' in spam is advertisement mails i recieved for compensation for free
web-site account. 'others' is so-called spam.

Ratio of catching spam is increased after patched. One mail misclassified as
spam after patched is a questionnaire from the free web-site host. Actually, it
is not for me.

File size of trainig.dat after patched is 1.5 times as large as before patched.
And it consumed a bit longer time for processing.

Scott MacGregor

Assignee

Comment 4

•

20 years ago

I just ran this patch through my junk mail regression tests (which consists of
only ascii junk and non junk messages) and unfortunately this patch regresses
the performance of the filter on ascii messages.

Before the patch:
195 false positives out of 1904 test messages (89%)

After the patch:
244 false positives out of 1904 test messages  (87%)

A false negative is a piece of spam that was not caught by the filter. It's
strange that it would generate different numbers.

Status: UNCONFIRMED → ASSIGNED

Ever confirmed: true

Noriyuki INOUE

Reporter

Comment 5

•

20 years ago

> A false negative is a piece of spam that was not caught by the filter. It's

Do you mean false positive?

Since the process for japanese messages should be executed after the process for
ascii messages, it is strange.
But it may be caused following reason.

The patch only picks up hiragana word and katakana word from japanese chunk.
So a latain word connecting japanese character (to bypass spam filter?)
classifyed as japanese chunk is neglected.

I will modify this point and test.


And I found a bug:
I18N sematic unit scanner may pick up numeral tokens.

Noriyuki INOUE

Reporter

Comment 6

•

20 years ago

(In reply to comment #5)
Sorry, i had a several mistakes.

> But it may be caused following reason. 

-> it may be caused by the following reason.

> The patch only picks up hiragana word and katakana word from japanese chunk.

-> picks up kanji(ideograph) word and katakana(mostly phonetic for foreign) word

> I18N sematic unit scanner may pick up numeral tokens.

-> I18N semantic unit scanner

Noriyuki INOUE

Reporter

Comment 7

•

20 years ago

I understood.

I was confused because of too many number of false positive.
False positive is quite few in my mailbox.
The performance for non-junk message is no fewer than 99% in my test.

Noriyuki INOUE

Reporter

Comment 8

•

20 years ago

Attached patch Japanese tokenizer for bayesian spam filter. (obsolete) — Details — Splinter Review

Attachment #170516 - Attachment is obsolete: true

Noriyuki INOUE

Reporter

Comment 9

•

20 years ago

I tested new patch with japanese messages consist of 1108 non-junks and 96
junks. Non-junks includes private mail, mailing list and ad-mails from online
shop. Junks includes so-called spam. I trained a junk filter with them and ran
it recursively.

Before patch: 0 false positive (100%) and 41 false negative (57%).
After patch: 0 false positive (100%) and 3 false negative (97%).

I also tested with ascii messages consist of 22 non-junks and 1000 junks. It
produces no false positive and no false negative both before and after.

Scott MacGregor

Assignee

Comment 10

•

20 years ago

thakns for the continued work. I haven't had a chance to test this new patch yet
but did see one thing in the patch:

         if (isASCII(word))
             tokenize_ascii_word(word);
+        if (isJapanese(word))
+            tokenize_japanese_words(word);

should that be:

if (isAscii(word))
  
ELSE
  if (isJapanese(word))

if it's ascii we should never call tokenize japanese word right?

Noriyuki INOUE

Reporter

Comment 11

•

20 years ago

(In reply to comment #10)

> ELSE
>   if (isJapanese(word))

That's right...
I had expected but forgot.

Noriyuki INOUE

Reporter

Comment 12

•

20 years ago

I think recording time is nice for the filter system.

Training a filter for a long while inflates training.dat.
Recording first studied, last studied and last refered time with token are
needed   for garbage collection.

Noriyuki INOUE

Reporter

Comment 13

•

20 years ago

Attached file XUL extension for browsing training.dat — Details

This extension add tools menu 'show training.dat' command.

It is useful for tuning up tokenizer.
And it is also useful for end-user in terms of unveiling the black box of the
junk filter control.

Noriyuki INOUE

Reporter

Comment 14

•

20 years ago

Attached patch Japanese tokenizer for Bayesian spam filter (obsolete) — Details — Splinter Review

Adding 'else' and optimizing for critical path.

Attachment #171142 - Attachment is obsolete: true

Scott MacGregor

Assignee

Comment 15

•

20 years ago

ok great, this latest patch now passes my regression tests for ascii mail.

Thanks for the extra work.

I'll go ahead and finish reviewing this and will then drive this into the tree
for you.

On an related note, it seems like some of these JA detection methods would be
better off in the i18n module instead of in the junk mail code. Could you file a
new bug under the i18n component pointing back to the routines in this bug? Make
sure jshin is cc'ed on it too in case he has some ideas on where he'd like those
methods to live. 

Thanks again for your work on this. This is great news for our Japanese users!

Scott MacGregor

Assignee

Comment 16

•

20 years ago

One other "just for the record" observation: This patch shouldn't impact the
performance of the filter for ascii mail because we only execute any of this new
code if our isWordAscii test fails...

Noriyuki INOUE

Reporter

Updated

•

20 years ago

Blocks: 278483

Noriyuki INOUE

Reporter

Comment 17

•

20 years ago

I am pleased that this patch will be included by the tree.

I have opened new Bug 277354 but i have no idea whether it is right.
Could you correct if necessary?

No longer blocks: 278483

Noriyuki INOUE

Reporter

Comment 18

•

20 years ago

erratum: Bug 277354 -> Bug 278483

Noriyuki INOUE

Reporter

Updated

•

20 years ago

Blocks: 278483

Noriyuki INOUE

Reporter

Comment 19

•

20 years ago

I note that this patch considerably increase the performance of classifying
japanese mails but it is not the final form. Tokenizing method of japanese for
filtering is a subject which is being studied yet. So i want to place the code
on a place to ease to modify for a while.

Scott MacGregor

Assignee

Comment 20

•

20 years ago

Attached patch updated patch with some minor review changes — Details — Splinter Review

I changed some of the string class operations from the previous patch in
addition to some white space changes.

Attachment #171265 - Attachment is obsolete: true

Scott MacGregor

Assignee

Comment 21

•

20 years ago

I just checked this into the trunk. Thanks again for the work.

Status: ASSIGNED → RESOLVED

Closed: 20 years ago

Resolution: --- → FIXED

Target Milestone: --- → Thunderbird1.1

You need to log in before you can comment on or make changes to this bug.