Closed Bug 554033 Opened 14 years ago Closed 13 years ago

bump the gloda fts3 tokenizer minimum token length from 2 to 3

Categories

(MailNews Core :: Database, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 10.0

People

(Reporter: asuth, Assigned: protz)

Details

(Keywords: perf, Whiteboard: [gloda key][tokenizer key])

Attachments

(1 file, 2 obsolete files)

Our tokenizer does not currently have any stopword support, so we index extremely common words like "the", "in", "of", etc.  Yes, that's right, we also currently emit 2-letter tokens for non-CJK stuff too.

That the tokenizer will be presented with multiple languages but have no idea what language it is looking at complicates things.  The user's locale could provide some insight into the probability of certain things being stopwords.

Mitigation could fall into two cases:
1) Full stopword consumption.  Does not get emitted at all.
2) Escalation to bi-gram.  We don't emit the term, but we do emit the term and what follows it.  For example, "mit" is a german stopword but is also the commonly used acronym for a college (let's ignore that it would be upper-cased in most cases in that context for now).  Assuming "mit" gets the bi-gram escalation flag and we see the phrase "mit campus", we would literally emit that as our token.  (We would also emit campus separately.)

The other major complication is that the introduction of stop-words complicates our query building somewhat.  As in bug 549594, if the tokenizer is eating tokens, it may cause the boolean logic we are using to end up (vacuously) false because of a gobbled token.  This demands that we expose the tokenizer in some manner to XPCOM.

(Note: bi-gram escalation would likely also require some explicit support in the XPCOM exposure.  For example, we would want to know that "mit" is an escalating stop-word so that we could include permutations of it and the other terms.  As per the 'MIT campus' example above, we might want 'campus MIT' to also get results which would require that parameterization.

Any improvements to this problem are better than no improvements, so even if we can't address the multilingual case out of the gate, removing ridiculously common english stop-words as well as eliminating 2-character non-CJK tokens (to help out other languages too) is probably a great way to go.
Keywords: perf
Taking this. I'm going to:
- not index words two-characters long, unless in CJK,
- make sure when searching for sentences and breaking it into words, we don't intersect with empty results from a two-character word,
- write a test for that.

This is the cheap solution until we really expose the tokenizer through xpcom.
Assignee: nobody → jonathan.protzenko
Status: NEW → ASSIGNED
Attached patch Patch v1 (obsolete) — Splinter Review
First patch. This is very minimal but I had to find the right places to patch. There's no tests yet, and I'm not sure I'm doing the right thing. Andrew, I'm going to wait until you're back from vacations, then we can see if this is going into the right direction, and how we could possibly test this.

This gave a 3% size improvement on global-messages-db.sqlite in my production profile
Attachment #553284 - Flags: feedback?(bugmail)
Attached patch Patch v2 (obsolete) — Splinter Review
Forgot to qrefresh, as usual.
Attachment #553284 - Attachment is obsolete: true
Attachment #553285 - Flags: feedback?(bugmail)
Attachment #553284 - Flags: feedback?(bugmail)
Comment on attachment 553285 [details] [diff] [review]
Patch v2

This definitely looks like the right direction.

Test-wise, it looks like we currently exercise the stemmer intentionally in:
- test_intl.js: this does the tokenizer fun, but is not actually testing the non-CJK stuff so much.

We also sort exercise it for scoring purposes in:
- test_query_core.js: through its limiting scoring tests, this is very limited
- test_msg_search.js: mainly ranking.

I would suggest we create a new file test_fts3_tokenizer.js along the lines of what we test_intl/test_msg_search (so cramming stuff into messages) that tests that:
- two-letter tokens like "xx" get nuked
- three-letter tokens like "foo" are still present

We will likely need explicit tests that check to:
- make sure msg_searcher is eliminating the search terms from the query it generates for "xx"
- make sure "xx" was never emitted as a token by either manually creating a SQL statement (it can be very simple, just against the fulltext table with a COUNT) or by monkey-patching the msg_search implementation so it actually does generate the query for "xx".  I think I would like that we run a manual SQL statement once, but it's fine if we use monkey-patching if we end up running a list of things through.
Attachment #553285 - Flags: feedback?(bugmail) → feedback+
Attached patch Patch v3Splinter Review
Test added. It's a pretty exhaustive test...
Attachment #553285 - Attachment is obsolete: true
Attachment #555518 - Flags: review?(bugmail)
Comment on attachment 555518 [details] [diff] [review]
Patch v3

Yep, these look like the tests I asked for!  Awesome and thanks!

When this bug gets mark fixed, unless you're planning to actually implement stop-words, we'll want to:
- Clone off a copy of this bug with the current name and my thoughts on stopwords intact.
- Rename this bug so that it conveys we're upping the minimum number of characters to 3 rather than doing anything with stop-words.
Attachment #555518 - Flags: review?(bugmail) → review+
Summary: gloda fts3 tokenizer would greatly benefit from stopword support → bump the gloda fts3 tokenizer minimum token length from 2 to 3
http://hg.mozilla.org/comm-central/rev/d7ed2d9fa348
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 10.0
You need to log in before you can comment on or make changes to this bug.