Open Bug 681754 Opened 14 years ago Updated 2 years ago

gloda fts3 tokenizer would greatly benefit from stopword support

Categories

(MailNews Core :: Database, defect)

defect

Tracking

(Not tracked)

People

(Reporter: protz, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [gloda key][tokenizer key])

(Spin-off of bug 554033 where we ended up implementing the cheap solution, namely bumping the minimum token length from two to three (unless the token is a cjk bigram).) Our tokenizer does not currently have any stopword support, so we index extremely common words like "the", "in", "of", etc. Yes, that's right, we also currently emit 2-letter tokens for non-CJK stuff too. That the tokenizer will be presented with multiple languages but have no idea what language it is looking at complicates things. The user's locale could provide some insight into the probability of certain things being stopwords. Mitigation could fall into two cases: 1) Full stopword consumption. Does not get emitted at all. 2) Escalation to bi-gram. We don't emit the term, but we do emit the term and what follows it. For example, "mit" is a german stopword but is also the commonly used acronym for a college (let's ignore that it would be upper-cased in most cases in that context for now). Assuming "mit" gets the bi-gram escalation flag and we see the phrase "mit campus", we would literally emit that as our token. (We would also emit campus separately.) The other major complication is that the introduction of stop-words complicates our query building somewhat. As in bug 549594, if the tokenizer is eating tokens, it may cause the boolean logic we are using to end up (vacuously) false because of a gobbled token. This demands that we expose the tokenizer in some manner to XPCOM. (Note: bi-gram escalation would likely also require some explicit support in the XPCOM exposure. For example, we would want to know that "mit" is an escalating stop-word so that we could include permutations of it and the other terms. As per the 'MIT campus' example above, we might want 'campus MIT' to also get results which would require that parameterization. Any improvements to this problem are better than no improvements, so even if we can't address the multilingual case out of the gate, removing ridiculously common english stop-words as well as eliminating 2-character non-CJK tokens (to help out other languages too) is probably a great way to go.
Depends on: 549594
protz, anyone other than you who might easily jump on this and bug 549594? this could help older laptops be more performant. (including mine)
Well if you could get an intern to work on this, I could most definitely provide guidance. Asuth could fix it, but I'm pretty sure he's got better things to do. Apart from that, no, I can't see anyone else who's proficient with the gloda code. Squib could do it for sure, he's very good, but I'm not sure that's his area of interest... (hope that answers your question)
any crude estimate range of what this might buy us? 15% improvement in indexing speed or reduced index space?
This would definitely buy us some index space, a few percents maybe.
Blocks: 1023000
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.