Closed Bug 554033 Opened 15 years ago Closed 14 years ago

bump the gloda fts3 tokenizer minimum token length from 2 to 3

Tracking

(Not tracked)

Status:

RESOLVED FIXED

Milestone:

Thunderbird 10.0

People

(Reporter: asuth, Assigned: protz)

Details

(Keywords: perf, Whiteboard: [gloda key][tokenizer key])

Attachments

(1 file, 2 obsolete files)

Patch v1 14 years ago Jonathan Protzenko [:protz] 3.24 KB, patch		Details \| Diff \| Splinter Review
Patch v2 14 years ago Jonathan Protzenko [:protz] 4.17 KB, patch	asuth : feedback+	Details \| Diff \| Splinter Review
Patch v3 14 years ago Jonathan Protzenko [:protz] 14.65 KB, patch	asuth : review+	Details \| Diff \| Splinter Review

Andrew Sutherland [:asuth] (he/him)

Reporter

Description

•

15 years ago

Our tokenizer does not currently have any stopword support, so we index extremely common words like "the", "in", "of", etc. Yes, that's right, we also currently emit 2-letter tokens for non-CJK stuff too. That the tokenizer will be presented with multiple languages but have no idea what language it is looking at complicates things. The user's locale could provide some insight into the probability of certain things being stopwords. Mitigation could fall into two cases: 1) Full stopword consumption. Does not get emitted at all. 2) Escalation to bi-gram. We don't emit the term, but we do emit the term and what follows it. For example, "mit" is a german stopword but is also the commonly used acronym for a college (let's ignore that it would be upper-cased in most cases in that context for now). Assuming "mit" gets the bi-gram escalation flag and we see the phrase "mit campus", we would literally emit that as our token. (We would also emit campus separately.) The other major complication is that the introduction of stop-words complicates our query building somewhat. As in bug 549594, if the tokenizer is eating tokens, it may cause the boolean logic we are using to end up (vacuously) false because of a gobbled token. This demands that we expose the tokenizer in some manner to XPCOM. (Note: bi-gram escalation would likely also require some explicit support in the XPCOM exposure. For example, we would want to know that "mit" is an escalating stop-word so that we could include permutations of it and the other terms. As per the 'MIT campus' example above, we might want 'campus MIT' to also get results which would require that parameterization. Any improvements to this problem are better than no improvements, so even if we can't address the multilingual case out of the gate, removing ridiculously common english stop-words as well as eliminating 2-character non-CJK tokens (to help out other languages too) is probably a great way to go.

Wayne Mery (:wsmwk)

Updated

•

15 years ago

Keywords: perf

Jonathan Protzenko [:protz]

Assignee

Comment 1

•

14 years ago

Taking this. I'm going to: - not index words two-characters long, unless in CJK, - make sure when searching for sentences and breaking it into words, we don't intersect with empty results from a two-character word, - write a test for that. This is the cheap solution until we really expose the tokenizer through xpcom.

Assignee: nobody → jonathan.protzenko

Status: NEW → ASSIGNED

Jonathan Protzenko [:protz]

Assignee

Comment 2

•

14 years ago

Attached patch Patch v1 (obsolete) — Details — Splinter Review

First patch. This is very minimal but I had to find the right places to patch. There's no tests yet, and I'm not sure I'm doing the right thing. Andrew, I'm going to wait until you're back from vacations, then we can see if this is going into the right direction, and how we could possibly test this. This gave a 3% size improvement on global-messages-db.sqlite in my production profile

Attachment #553284 - Flags: feedback?(bugmail)

Jonathan Protzenko [:protz]

Assignee

Comment 3

•

14 years ago

Attached patch Patch v2 (obsolete) — Details — Splinter Review

Forgot to qrefresh, as usual.

Attachment #553284 - Attachment is obsolete: true

Attachment #553285 - Flags: feedback?(bugmail)

Attachment #553284 - Flags: feedback?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 4

•

14 years ago

Comment on attachment 553285 [details] [diff] [review] Patch v2 This definitely looks like the right direction. Test-wise, it looks like we currently exercise the stemmer intentionally in: - test_intl.js: this does the tokenizer fun, but is not actually testing the non-CJK stuff so much. We also sort exercise it for scoring purposes in: - test_query_core.js: through its limiting scoring tests, this is very limited - test_msg_search.js: mainly ranking. I would suggest we create a new file test_fts3_tokenizer.js along the lines of what we test_intl/test_msg_search (so cramming stuff into messages) that tests that: - two-letter tokens like "xx" get nuked - three-letter tokens like "foo" are still present We will likely need explicit tests that check to: - make sure msg_searcher is eliminating the search terms from the query it generates for "xx" - make sure "xx" was never emitted as a token by either manually creating a SQL statement (it can be very simple, just against the fulltext table with a COUNT) or by monkey-patching the msg_search implementation so it actually does generate the query for "xx". I think I would like that we run a manual SQL statement once, but it's fine if we use monkey-patching if we end up running a list of things through.

Attachment #553285 - Flags: feedback?(bugmail) → feedback+

Jonathan Protzenko [:protz]

Assignee

Comment 5

•

14 years ago

Attached patch Patch v3 — Details — Splinter Review

Test added. It's a pretty exhaustive test...

Attachment #553285 - Attachment is obsolete: true

Attachment #555518 - Flags: review?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 6

•

14 years ago

Comment on attachment 555518 [details] [diff] [review] Patch v3 Yep, these look like the tests I asked for! Awesome and thanks! When this bug gets mark fixed, unless you're planning to actually implement stop-words, we'll want to: - Clone off a copy of this bug with the current name and my thoughts on stopwords intact. - Rename this bug so that it conveys we're upping the minimum number of characters to 3 rather than doing anything with stop-words.

Attachment #555518 - Flags: review?(bugmail) → review+

Jonathan Protzenko [:protz]

Assignee

Updated

•

14 years ago

Summary: gloda fts3 tokenizer would greatly benefit from stopword support → bump the gloda fts3 tokenizer minimum token length from 2 to 3

Jonathan Protzenko [:protz]

Assignee

Comment 7

•

14 years ago

http://hg.mozilla.org/comm-central/rev/d7ed2d9fa348

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Target Milestone: --- → Thunderbird 10.0

You need to log in before you can comment on or make changes to this bug.

Bugzilla

bump the gloda fts3 tokenizer minimum token length from 2 to 3

Categories

(MailNews Core :: Database, defect)

Tracking

(Not tracked)

People

(Reporter: asuth, Assigned: protz)

References

Details

(Keywords: perf, Whiteboard: [gloda key][tokenizer key])

Crash Data

Security

(public)

User Story

Attachments

(1 file, 2 obsolete files)

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Attachment

General

Description

File Name

Content Type