Closed Bug 414102 Opened 17 years ago Closed 6 years ago

make SQLite full text search module (fts3) support Unicode

Categories

(Core :: SQLite and Embedded Database Bindings, defect)

defect
Not set
normal

Tracking

()

RESOLVED INACTIVE

People

(Reporter: myk, Unassigned)

References

Details

Attachments

(3 obsolete files)

The SQLite full text search module (fts3) uses an ASCII tokenizer described thusly: The module currently uses the following generic tokenization mechanism. A token is a contiguous sequence of alphanumeric ASCII characters (A-Z, a-z and 0-9). All non-ASCII characters are ignored. Each token is converted to lowercase before it is stored in the index, so all full-text searches are case-insensitive. The module does not perform stemming of any sort. - <http://www.sqlite.org/cvstrac/wiki?p=FullTextIndex> Although Firefox 3 doesn't use the module, we plan to enable it in bug 413589 so that extensions can make use of it, and it would be more useful to extensions serving Firefox's large international userbase if the module supported Unicode text. The SQLite mailing list's current recommended approach for this is to use the icu module <http://www.mail-archive.com/sqlite-users%40sqlite.org/msg30978.html>, although we've balked at that in the past because of that module's size, preferring instead of to support Unicode in the case (upper, lower) functions and LIKE operator via custom functions that redefine the SQLite-native ones <http://lxr.mozilla.org/mozilla/source/storage/src/mozStorageUnicodeFunctions.cpp>.
Version: unspecified → Trunk
Note that the previous patch (#321766) is very simplistic, in that it tokenizes all characters above US-ASCII in a per-character basis. Still, it's better than not able to recognize any Unicode text at all. Alternately, this patch tokenizes those non-ascii characters as bi-grams, which is suitable for Chinese Ideographic phrases, but it'd probably need a more elaborate character categorizer to restrict its scope to CJK only. All things considered, #321766 probably fits the idea of a "simple" tokenizer better than this patch.
Attached patch patch v1 (obsolete) — Splinter Review
bi-gram indexing if CJK character. The indexing of Sqlite is used by Gloda search of Thunderbird3. I believe that the best way is to use libicu, but this library will be fat.
Attachment #321766 - Attachment is obsolete: true
Attachment #321769 - Attachment is obsolete: true
Attachment #397611 - Flags: review?
Attachment #397611 - Flags: review? → review?(sdwilsh)
Comment on attachment 397611 [details] [diff] [review] patch v1 We do not take changes to sqlite3.c in our tree. We only take upstream copies and commit them.
Attachment #397611 - Flags: review?(sdwilsh) → review-
it seems there is a waiting tokenizer patch for sqlite that hasn't been implemented. but, wouldn't bi-gram indexing preclude finding, say, 海 in 上海市? in any event, in addition to the CJK limitation, sqlite fts has other problems like only one negation term that cannot be the sole term, issues with special chars and escaping. i've instead implemented a regex search. of course, on a relative comparison it can't compete with indexing speed-wise, but is acceptable on an absolute basis. so it's a tradeoff between complete and correct functionality slower, or partial incomplete functionality faster. it also seems fts is a bit abandoned in sqlite.. perhaps a custom regex based utility can be createFunctioned in Fx and exposed to js.
(In reply to comment #5) > it also seems fts is a bit abandoned in sqlite.. eh, far from it
(In reply to comment #5) > it seems there is a waiting tokenizer patch for sqlite that hasn't been > implemented. but, wouldn't bi-gram indexing preclude finding, say, 海 in 上海市? The Thunderbird-specificish tokenizer landed in bug 472764 in comm-central/mailnews/ resolves this issue by explicitly promoting a single-character CJK query to a wildcarded query (海*) in the Thunderbird front-end. So we will match 海 in 上海市 but not 海 in 上海 because we never generate a bi-gram that begins with the search character. > in any event, in addition to the CJK limitation, sqlite fts has other problems > like only one negation term that cannot be the sole term, issues with special You should be able to construct a SQLite query that accomplishes the sole negation term. (SELECT * FROM blah WHERE id NOT IN (select id from blahFulltext WHERE searchColumn MATCH negatedWord). > chars and escaping. i've instead implemented a regex search. of course, on a I'm not sure what you mean about special characters unless you just mean that SQLite's FTS currently does not ship with any language-aware tokenizers other than English. > relative comparison it can't compete with indexing speed-wise, but is > acceptable on an absolute basis. so it's a tradeoff between complete and ... > perhaps a custom regex based utility can be createFunctioned in Fx and exposed > to js. Could you elaborate on another (new) bug or in the newsgroups on what you have implemented? There certainly is a subset of the Thunderbird userbase that loves using regular expressions for searching and which would likely be great fodder for an extension. It would be interesting to hear what other options are possible beyond Thunderbird's existing avenues for implementing this.
(In reply to comment #6) > (In reply to comment #5) > > it also seems fts is a bit abandoned in sqlite.. > eh, far from it well, you (obviously) are more involved so i will take your comment as a good sign it's not. but that patch hanging out so long, and a posted comment from mr fts about 'fts burnout' and such lead to my statement.
Comment on attachment 397611 [details] [diff] [review] patch v1 Marking patch obsolete because that development effort was moved to bug 472764 where it successfully concluded in the landing of a patch.
Attachment #397611 - Attachment is obsolete: true
(In reply to comment #7) > (In reply to comment #5) > > it seems there is a waiting tokenizer patch for sqlite that hasn't been > > implemented. but, wouldn't bi-gram indexing preclude finding, say, 海 in 上海市? > > The Thunderbird-specificish tokenizer landed in bug 472764 in > comm-central/mailnews/ resolves this issue by explicitly promoting a > single-character CJK query to a wildcarded query (海*) in the Thunderbird > front-end. So we will match 海 in 上海市 but not 海 in 上海 because we never generate > a bi-gram that begins with the search character. > my regex implementation finds 海 in both 上海 and 海上. the concept of word boundaries does not exist in CJK, so while non CJK strings need quotes, they are not necessary in a CJK 'sentence'. therefore, a CJK search should not require either quotes or *. > > in any event, in addition to the CJK limitation, sqlite fts has other problems > > like only one negation term that cannot be the sole term, issues with special > > You should be able to construct a SQLite query that accomplishes the sole > negation term. (SELECT * FROM blah WHERE id NOT IN (select id from > blahFulltext WHERE searchColumn MATCH negatedWord). true, and the same sort of thing could be used for multiple negations. but it's not really using fts indexing then. one could return fts results without using fts. > > > chars and escaping. i've instead implemented a regex search. of course, on a > > I'm not sure what you mean about special characters unless you just mean that > SQLite's FTS currently does not ship with any language-aware tokenizers other > than English. > well, i want to be able to find "A Quote" (with the "s), so i would enter \"A Quote\" to do so. it's the only special char that needs escaping, by the user. not possible in fts (afaict). also, fts treats non ascii as a single space so the term 'test-case' will match 'test---case' and 'test*case' etc. not great. > > relative comparison it can't compete with indexing speed-wise, but is > > acceptable on an absolute basis. so it's a tradeoff between complete and > ... > > perhaps a custom regex based utility can be createFunctioned in Fx and exposed > > to js. > > Could you elaborate on another (new) bug or in the newsgroups on what you have > implemented? There certainly is a subset of the Thunderbird userbase that > loves using regular expressions for searching and which would likely be great > fodder for an extension. It would be interesting to hear what other options > are possible beyond Thunderbird's existing avenues for implementing this. sure, i'll post something somewhere once i release the code (this is for Snowl). i intend to offer an option of direct regex user input so they can go wild if that's how they want to search.
(In reply to comment #10) > well, i want to be able to find "A Quote" (with the "s), so i would enter \"A > Quote\" to do so. it's the only special char that needs escaping, by the user. > not possible in fts (afaict). also, fts treats non ascii as a single space so > the term 'test-case' will match 'test---case' and 'test*case' etc. not great. Both of the examples you mention are results of decisions made by the tokenizer. The tokenizer can be made to emit any characters as tokens; the output doesn't even need to be well-formed UTF-8. The main ramification of addressing your examples in the tokenizer would be impact on NEAR/adjacency logic plus potential non-trivial index bloat. It would likely be better to post-process the results with a filter that runs at a higher layer than SQLite so you can benefit from the tremendous reduction in result set size from using the inverted index but still have your very specific non-textual constraints.
i'll have to think about the post processing idea, makes sense where small numbers are returned initially, so it could be a wash. but, in fact, if i just use regex and don't bother with any fts indexing, i get natural (user entry perspective) results with the exception of escaping ". imo, it's more important not to force unnecessary wildcards or return incorrect matches or return matches not explicitly intended, even if they're edge cases. perhaps you can port the CJK tokenizer to mozilla-central? lack of CKJ was half the importance of ditching fts. thanks for the input.
Per policy at https://wiki.mozilla.org/Bug_Triage/Projects/Bug_Handling/Bug_Husbandry#Inactive_Bugs. If this bug is not an enhancement request or a bug not present in a supported release of Firefox, then it may be reopened.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INACTIVE
Product: Toolkit → Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: