Closed
Bug 479783
Opened 16 years ago
Closed 16 years ago
Gloda full text search indexer (fts3) must use a tokenizer which support non-space-separated languages
Categories
(MailNews Core :: Database, defect)
MailNews Core
Database
Tracking
(Not tracked)
RESOLVED
DUPLICATE
of bug 472764
People
(Reporter: bugzilla, Unassigned)
References
Details
(Keywords: intl)
As I said in bug 479214, we cannot search Japanese (or other multibyte) strings within message body (or attachmentName) with Thunderbar.
That's because
1. multibyte strings are not imported into the database correctly (mojibake)
2. full text search indexer is not proper for non-space-separated langages
Bug 479214 will solve first problem and this is the bug for second one.
# this bug depends on bug 479214
Even when we import message body (and attachmentName) correctly gloda will make index with 'porter' tokenizer so far:
http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/modules/datastore.js#905
905 // - Create the fulltext table if applicable
906 if ("fulltextColumns" in table) {
907 let createFulltextSQL = "CREATE VIRTUAL TABLE " + aTableName + "Text" +
908 " USING fts3(tokenize porter, " + table.fulltextColumns.join(", ") +
909 ")";
910 this._log.info("Create fulltext: " + createFulltextSQL);
911 aDBConnection.executeSimpleSQL(createFulltextSQL);
912 }
But 'porter' tokenizer works only for English (and some other space-separeted langages).
For example in Japanese,
"現在の Thunderbird は全文検索エンジンが使い物にならない。"
will be separated only in three tokens with porter:
"現在の", "Thunderbird" and "は全文検索エンジンが使い物にならない。"
We cannot do meaningful full tet search with 'porter' tokenizer in Japanse, Korea, Chinese etc...
So, we must use some tokenizer which support non-space-separeted languages like Japanese.
Most simple answer for this is, of course, use custom tokenizer which implement Ngram (bigram etc) or N-Mgram.
# morphological analysis will be better but cannot implement for all Tb locales
# If each locale can select it as spell-check dictionary, it's better of course.
Some ICU library like in php may help but not sure:
http://phpadvent.org/2008/full-text-searching-with-sqlite-by-scott-macvicar
Updated•16 years ago
|
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → DUPLICATE
You need to log in
before you can comment on or make changes to this bug.
Description
•