Open Bug 1425333 Opened 7 years ago Updated 7 months ago

Research, and implement if necessary, a multilocale tokenizer for SQLite

Categories

(Core :: SQLite and Embedded Database Bindings, enhancement, P3)

enhancement

Tracking

()

Tracking Status
firefox59 --- affected

People

(Reporter: rnewman, Unassigned)

References

(Blocks 1 open bug)

Details

Ten years ago, Myk filed Bug 414102. A few years later, we still want to use FTS: see Bug 1340487 (Places), Bug 808872 (Android) and Bug 1173164 (iOS). (myk and mak, please fill in more history if you have it!) We'd like to use SQLite's FTS3, FTS4, or FTS5 for searching text in bookmarks, history, and other storage. This is true for all platforms. However, we are not confident that SQLite's built-in tokenizers are sufficient for our purposes, even in 2017. There are four default tokenizers: - simple, which is simple: it splits on ASCII punctuation and does relatively blind ASCII downcasing of tokens. This is inadequate for non-punctuated languages like Japanese, and languages with complex case transforms (Turkish?). - porter, which adds English Porter stemming. - icu, which uses ICU for word boundary detection. - unicode61, which is 'simple' but with better downcasing and punctuation support. It optionally strips diacritics. I don't know if it's possible to match diacritics with a higher score. I'm far from an expert, but I believe that none of these really meet our needs, particularly for text that mixes languages. We'd like to capture requirements and consult the literature to define a tokenizer that meets our needs. If one exists, we should document the correct way to use it, and move forward the three bugs above. If one does not, we should consider doing so in Rust. References: * https://www.sqlite.org/fts3.html#tokenizer * https://www.sqlite.org/fts5.html
Blocks: 1340487
Blocks: 1173164, 808872
N.B., FTS5 can't use the ICU tokenizer.
The Thunderbird tokenizer is also available at https://dxr.mozilla.org/comm-central/source/mailnews/extensions/fts3/src and can be summarized as follows, although https://dxr.mozilla.org/comm-central/rev/f14a2331480c63fe38dedf30b649d32e5c791733/mailnews/extensions/fts3/src/fts3_porter.c#939 is perhaps the best summary. - A hacked-up version of the porter stemmer that is Unicode-aware (instead of ASCII only), performs case and accent-folding, and emits bi-gram tokens over CJK characters. I believe FTS5 now allows overlapping tokens to be emitted, allowing for cleverness like postgres' tokenizers' ability to do things like emit all of [full domain, each domain segment, full URL, each URL path segment, etc.] given a single URL.
Thanks for the pointer, asuth! I saw that in other bugs, but I didn't know if it was still relevant. Reminds me of https://github.com/jonasfj/trilite too…
Priority: -- → P3
Severity: normal → S3
Product: Toolkit → Core
You need to log in before you can comment on or make changes to this bug.