Allow specification of tokenizing delimiters in bayes filter

RESOLVED FIXED in Thunderbird 3.0b3

Status

MailNews Core
Filters
--
enhancement
RESOLVED FIXED
9 years ago
9 years ago

People

(Reporter: rkent, Assigned: rkent)

Tracking

Trunk
Thunderbird 3.0b3
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment, 1 obsolete attachment)

(Assignee)

Description

9 years ago
The bayes filter code currently has a hard wired list of tokenizing characters:

static const char* kBayesianFilterTokenDelimiters = " \t\n\r\f.";

I would like to allow this to be set by a preference. This bug will be about the general support for this. The general support could be useful in specific locales or for testing, but the main target is to allow setting of specific delimiter lists for specific headers, which will be covered in a follow-on bug. I would like to be able, for example, to parse out the spam assassin rule hits in a specific header, without having to apply that delimiter list to the entire bayes calculation. In headers, "." is probably best not to be a delimiter, so that we can pick up full URLs and IP addresses, while "," is a useful delimiter as it often separates information fields from upstream providers.

I think the default support, in this bug, will be for one delimiter string for headers, and a separate delimiter for body tokenization.
(Assignee)

Comment 1

9 years ago
Created attachment 365292 [details] [diff] [review]
Add custom tokenization preferences

This patch adds some control of bayes tokenization using preferences.

The original motivation of this was to make sure that headers with complex information, such as the x-spam-status header added by SpamAssassin, could be used to best advantage. Unfortunately it is very difficult to test for that. I finally convinced myself that adding subtokenization of the x-spam-status header reduces false negatives by about 10% for my email, but the tests are subtle and subject to interpretation.

But I really think the main motivation for this is going to come when I enable bayes processing for RSS and news, as well as using bayes for non-junk traits. The existing tokenization is focused on spam in email, but as the bayes gets used for other things, we need a mechanism to test variations of tokenization. This should help to provide that. There may also be applications for languages that do not use the same punctuation as English (though the restriction to CStrings may limit that).
Attachment #365292 - Flags: superreview?(bienvenu)
Attachment #365292 - Flags: review?(bugzilla)
(Assignee)

Updated

9 years ago
OS: Windows XP → All
Hardware: x86 → All
Whiteboard: [needs r standard8, sr bienvenu]
Comment on attachment 365292 [details] [diff] [review]
Add custom tokenization preferences

>+  nsresult rv;
>+  nsCOMPtr<nsIPrefService> prefs = do_GetService(NS_PREFSERVICE_CONTRACTID, &rv);
>+  if (NS_FAILED(rv))
>+  {
>+    NS_ERROR("failed accessing preferences service");
>+    return;
>+  }

You can use NS_ENSURE_SUCCESS(rv, ) here, which will warn if it fails.

>+  nsCOMPtr<nsIPrefBranch> prefBranch;
>+  rv = prefs->GetBranch("mailnews.bayesian_spam_filter.", getter_AddRefs(prefBranch));
>+  if (NS_FAILED(rv)) // no branch defined, just use defaults
>+    return;

Why not just initialise mBodyDelimiters and mHeaderDelimiters to kBayesianFilterTokenDelimiters at construction time, then if there's no prefs change, you can still use mBodyDelimiters etc without having to check for null and fall-back etc?

>+  // get customized maximum token length
>+  if (prefBranch)
>+    rv = prefBranch->GetIntPref("maxlengthfortoken", &mMaxLengthForToken);

I don't think you need to check for (prefBranch) here - the NS_FAILED(rv) should have picked it up earlier, and besides, you've done two calls to it without the check.

>-      while ((word = NS_strtok(kBayesianFilterTokenDelimiters, &next)) != NULL)
>+      const char* delimiters =  !aDelimiters ?
>+          kBayesianFilterTokenDelimiters : aDelimiters;

nit: no need for double space before ! (if this section remains).

>-        if (word[0] == '\0') continue;
>+        if (strlen(word) < kMinLengthForToken) continue;

nit: I'd prefer the continue on the next line please.

>+      for (PRUint32 i = 0; i < mDisabledHeaders.Length(); i++)
>+        if (headerName.Equals(mDisabledHeaders[i]))
>+        {
>+          headerProcessed = PR_TRUE;
>+          break;
>+        }

nit: please enclose the if statement in brackets as well.

r=me with those comments fixed.
Attachment #365292 - Flags: superreview?(bienvenu)
Attachment #365292 - Flags: review?(bugzilla)
Attachment #365292 - Flags: review+
Whiteboard: [needs r standard8, sr bienvenu] → [needs sr bienvenu]
Attachment #365292 - Flags: review+ → review-
Comment on attachment 365292 [details] [diff] [review]
Add custom tokenization preferences

Opps, I knew I submitted  that too early, I've just run the test and I'm getting a failure:

TEST-UNEXPECTED-FAIL | ../../../../mozilla/_tests/xpcshell/test_bayes/unit/test_customTokenization.js | test failed, see log
../../../../mozilla/_tests/xpcshell/test_bayes/unit/test_customTokenization.js.log:
>>>>>>>
*** Registering components in: nsMailModule
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /Users/moztest/comm/main/src/mozilla/js/src/xpconnect/loader/mozJSComponentLoader.cpp, line 1544
JS Component Loader: ERROR (null):0
                     uncaught exception: [Exception... "Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIXPCComponents_Utils.import]"  nsresult: "0x80040111 (NS_ERROR_NOT_AVAILABLE)"  location: "JS frame :: file:///Users/moztest/comm/main/tb/mozilla/dist/bin/components/nsActivity.js :: <TOP_LEVEL> :: line 40"  data: no]
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80040111: file /Users/moztest/comm/main/src/mozilla/js/src/xpconnect/loader/mozJSComponentLoader.cpp, line 1544
JS Component Loader: ERROR (null):0
                     uncaught exception: [Exception... "Component returned failure code: 0x80040111 (NS_ERROR_NOT_AVAILABLE) [nsIXPCComponents_Utils.import]"  nsresult: "0x80040111 (NS_ERROR_NOT_AVAILABLE)"  location: "JS frame :: file:///Users/moztest/comm/main/tb/mozilla/dist/bin/components/nsActivityManager.js :: <TOP_LEVEL> :: line 41"  data: no]
*** registering nsLDAPProtocolHandler.js: [ LDAP Protocol Handler, LDAP Protocol Handler ]
*** test pending
Directory request for: MailD that we (mailDirService.js) are not handling, leaving it to another handler.
Directory request for: MFCaF that we (mailDirService.js) are not handling, leaving it to another handler.
Directory request for: DefRt that we (mailDirService.js) are not handling, leaving it to another handler.
*** test pending
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80550006: file /Users/moztest/comm/main/src/mailnews/base/util/nsMsgDBFolder.cpp, line 4647
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80550006: file /Users/moztest/comm/main/src/mailnews/local/src/nsMailboxService.cpp, line 673
*** test finished
*** running event loop
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80550006: file /Users/moztest/comm/main/src/mailnews/base/util/nsMsgDBFolder.cpp, line 4647
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80550006: file /Users/moztest/comm/main/src/mailnews/local/src/nsMailboxService.cpp, line 673
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80550006: file /Users/moztest/comm/main/src/mailnews/base/util/nsMsgDBFolder.cpp, line 4647
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80550006: file /Users/moztest/comm/main/src/mailnews/local/src/nsMailboxService.cpp, line 673
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80550006: file /Users/moztest/comm/main/src/mailnews/base/util/nsMsgDBFolder.cpp, line 4647
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x80550006: file /Users/moztest/comm/main/src/mailnews/local/src/nsMailboxService.cpp, line 673
*** exiting
*** TEST-UNEXPECTED-FAIL | ../../../../mozilla/_tests/xpcshell/test_bayes/unit/test_customTokenization.js | false == true
JS frame :: /Users/moztest/comm/main/src/mozilla/testing/xpcshell/head.js :: do_throw :: line 101
JS frame :: /Users/moztest/comm/main/src/mozilla/testing/xpcshell/head.js :: do_check_eq :: line 120
JS frame :: /Users/moztest/comm/main/src/mozilla/testing/xpcshell/head.js :: do_check_true :: line 126
JS frame :: ../../../../mozilla/_tests/xpcshell/test_bayes/unit/test_customTokenization.js :: anonymous :: line 144
*** FAIL ***
WARNING: nsExceptionService ignoring thread destruction after shutdown: file /Users/moztest/comm/main/src/mozilla/xpcom/base/nsExceptionService.cpp, line 194
WARNING: NS_ENSURE_SUCCESS(rv, rv) failed with result 0x8000FFFF: file /Users/moztest/comm/main/src/mailnews/extensions/bayesian-spam-filter/src/nsBayesianFilter.cpp, line 2275
recorder: started(6), aborted(6), completed(0), different header(0), trees trashed(0), slot promoted(0), unstable loop variable(0), breaks(0), returns(0), unstableInnerCalls(0)
monitor: triggered(0), exits(0), type mismatch(0), global mismatch(0)
nsStringStats
 => mAllocCount:           3588
 => mReallocCount:          140
 => mFreeCount:            3587  --  LEAKED 1 !!!
 => mShareCount:            283
 => mAdoptCount:            116
 => mAdoptFreeCount:        116

<<<<<<<
(Assignee)

Comment 4

9 years ago
I ran this again, and I pass the unit tests on my system. Without my dumps, it's hard to see where this is failing in your run. Would you mind running the unit test again, but uncomment the printouts here:

  onMessageTraitDetails: function(aMsgURI, aProTrait, {}, aTokenString,
                                  aTokenPercents, aRunningPercents)
  {
    //print("Details for " + aMsgURI);
    //for (var i = 0; i < aTokenString.length; i++)
    //  print(" Token " + aTokenString[i]);

    // we should have these tokens
    for each (var value in gTest.tokens)
    {
      //dump("\nDo we have '" + value + "'? ");
      do_check_true(aTokenString.indexOf(value) >= 0);
    }
Ok, here is the extra dump output:

Details for file:///Users/moztest/comm/main/src/mailnews/extensions/bayesian-spam-filter/test/resources/tokenTest.eml?type=application/x-message-display
 Token mime-version:1.0
 Token out!
 Token date:30/4/08 08:12
 Token important
 Token subject:eat
 Token http://www
 Token url
 Token example
 Token this
 Token from:mom <mother@example.com>
 Token subject:vegetables
 Token subject:live
 Token check
 Token org
 Token subject:long
 Token sentence
 Token subject:your
 Token to:careful reader <reader@example.org>

Do we have 'important'? 
Do we have 'subject:eat'? 
Do we have 'date:4/30/2008 12:12 am'? *** exiting
*** TEST-UNEXPECTED-FAIL | ../../../../mozilla/_tests/xpcshell/test_bayes/unit/test_customTokenization.js | false == true
JS frame :: /Users/moztest/comm/main/src/mozilla/testing/xpcshell/head.js :: do_throw :: line 101
JS frame :: /Users/moztest/comm/main/src/mozilla/testing/xpcshell/head.js :: do_check_eq :: line 120
JS frame :: /Users/moztest/comm/main/src/mozilla/testing/xpcshell/head.js :: do_check_true :: line 126
JS frame :: ../../../../mozilla/_tests/xpcshell/test_bayes/unit/test_customTokenization.js :: anonymous :: line 144
*** FAIL ***
(Assignee)

Comment 6

9 years ago
Thanks. Then it is just a date format problem, which is unimportant to this bug. I'll just replace the date check with something else.

Is it OK if I leave the dumps in the final version of the test? They are helpful when things go wrong, and they only show up in the logs.
(In reply to comment #6)
> Is it OK if I leave the dumps in the final version of the test? They are
> helpful when things go wrong, and they only show up in the logs.

I think as long as they aren't excessive and help point at actual problems if a test fails, leaving dumps in test cases is fine.
(Assignee)

Comment 8

9 years ago
Created attachment 366989 [details] [diff] [review]
fixed failed Date: header in test, other Standard8 issues

I replaced the data header, which varied with localization, with a message-id header in the tests. Fixed Standard8's issues, with these additional comments:

"Why not just initialise mBodyDelimiters and mHeaderDelimiters to
kBayesianFilterTokenDelimiters at construction time, then if there's no prefs
change, you can still use mBodyDelimiters etc without having to check for null
and fall-back etc?"

I ended up doing this (and added mMaxLengthForToken as well), but it did not help eliminate any null tests. In getting preferences, the preference service blanks the result if it does not exist. In the use in addTokenForHeader, I had to use a null as the default value for the delimiters to make the many calls using default parameters to work. But setting the member variables at constructor time eliminated a possible error if the constructor exited early.
Attachment #365292 - Attachment is obsolete: true
Attachment #366989 - Flags: superreview?(bienvenu)
Attachment #366989 - Flags: review?(bugzilla)
Comment on attachment 366989 [details] [diff] [review]
fixed failed Date: header in test, other Standard8 issues

That's better, thanks. r=Standard8
Attachment #366989 - Flags: review?(bugzilla) → review+

Updated

9 years ago
Attachment #366989 - Flags: superreview?(bienvenu) → superreview+

Updated

9 years ago
Whiteboard: [needs sr bienvenu]
(Assignee)

Updated

9 years ago
Keywords: checkin-needed
Whiteboard: [needs checkin]
Checked in: http://hg.mozilla.org/comm-central/rev/af0dd9254aea
Status: ASSIGNED → RESOLVED
Last Resolved: 9 years ago
Flags: in-testsuite+
Keywords: checkin-needed
Resolution: --- → FIXED
Whiteboard: [needs checkin]
You need to log in before you can comment on or make changes to this bug.