Closed Bug 479783 Opened 16 years ago Closed 16 years ago

Gloda full text search indexer (fts3) must use a tokenizer which support non-space-separated languages

Categories

(MailNews Core :: Database, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 472764

People

(Reporter: bugzilla, Unassigned)

References

Details

(Keywords: intl)

As I said in bug 479214, we cannot search Japanese (or other multibyte) strings within message body (or attachmentName) with Thunderbar. That's because 1. multibyte strings are not imported into the database correctly (mojibake) 2. full text search indexer is not proper for non-space-separated langages Bug 479214 will solve first problem and this is the bug for second one. # this bug depends on bug 479214 Even when we import message body (and attachmentName) correctly gloda will make index with 'porter' tokenizer so far: http://mxr.mozilla.org/comm-central/source/mailnews/db/gloda/modules/datastore.js#905 905 // - Create the fulltext table if applicable 906 if ("fulltextColumns" in table) { 907 let createFulltextSQL = "CREATE VIRTUAL TABLE " + aTableName + "Text" + 908 " USING fts3(tokenize porter, " + table.fulltextColumns.join(", ") + 909 ")"; 910 this._log.info("Create fulltext: " + createFulltextSQL); 911 aDBConnection.executeSimpleSQL(createFulltextSQL); 912 } But 'porter' tokenizer works only for English (and some other space-separeted langages). For example in Japanese, "現在の Thunderbird は全文検索エンジンが使い物にならない。" will be separated only in three tokens with porter: "現在の", "Thunderbird" and "は全文検索エンジンが使い物にならない。" We cannot do meaningful full tet search with 'porter' tokenizer in Japanse, Korea, Chinese etc... So, we must use some tokenizer which support non-space-separeted languages like Japanese. Most simple answer for this is, of course, use custom tokenizer which implement Ngram (bigram etc) or N-Mgram. # morphological analysis will be better but cannot implement for all Tb locales # If each locale can select it as spell-check dictionary, it's better of course. Some ICU library like in php may help but not sure: http://phpadvent.org/2008/full-text-searching-with-sqlite-by-scott-macvicar
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.