Closed Bug 472764 Opened 16 years ago Closed 15 years ago

gloda full-text search always uses SQLite's porter stemmer without regard to effective locale, etc.

Categories

(MailNews Core :: Database, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 3.0rc1

People

(Reporter: asuth, Assigned: m_kato)

References

Details

(4 keywords, Whiteboard: [no l10n impact][gloda key][needs unit tests][see links in comment #57-#58][needs review nits addressed before landing w/ other gloda fixes])

Attachments

(2 files, 10 obsolete files)

The SQLite FTS3 engine supports per-column tokenizers.  We currently always use the Porter stemmer.  This is only appropriate for messages written in English.

It's a tricky situation because each column only gets one tokenizer, and there need not be a correlation between the user's locale and the languages they receive messages in.  Using a unicode tokenizer might get us pretty far, but stemming is also very useful for English.  The good news is that sqlite at least allows easily pluggable tokenizers.

Marking this blocking pref-ing gloda on, at least until we figure out what our game-plan is.  (Once we have a plan, it may no longer be blocking.)
Assignee: nobody → bugmail
Status: NEW → ASSIGNED
Flags: blocking-thunderbird3?
Target Milestone: --- → Thunderbird 3.0b2
Flags: blocking-thunderbird3? → blocking-thunderbird3+
Good news, everyone!  Apparently we already have a fairly clever tokenization function added for the spam filter!  nsISemanticUnitScanner!  Examples of results on the adding bug at 176528#c13 and the subsequent comment.  The logic seems consistent with my frantic wikipedia & friends research.  The usage is also compatible with the FTS3 tokenizer API.

So, that gets us what we want, except that we would also really like to use the built-in porter stemmer for English.  As such, I propose we implement 2 tokenizers:

1) nsISemanticUnitScanner-based tokenizer.  This is the default.

2) English-biased tokenizer, for use when the user's locale is en_*.  It sounds wacky, but the idea is that we conceptually apply the porter stemmer to the output of every word emitted by the nsISemanticUnitScanner tokenizer.

My first instinct was to say "apply the porter stemmer only if we think the message contains only english."  Unfortunately, this can result in making some queries intuitively impossible to formulate.  For example, let's say I have an indexed string containing the word "dancing" and another word with an accent, causing the porter stemmer to not be used.  Then the term will be indexed as "dancing".  But when I go to formulate a query for "dancing", because my query doesn't have anything to convince the tokenizer that it's not English, it will stem my query term to "danc".  Bad news.

So, the English-biased tokenizer will use the porter stemmer against anything that the porter stemmer would stem.  (The porter stemmer only stems [a-zA-z], punting to copying otherwise.)  This has the potential to result in false positives where a non-english word that lacks accents will be stemmed to collide with other words (possibly of a different language), resulting in user confusion.  It seems like there has to be an awesome and amusing example of this, but none spring to mind.

Note: I use the words "conceptually" and "would" because there may be efficiency dancing going on, not to mention odd implementation choices required.  Namely, the tokenizer actually gets its strings as utf-8, but nsISemanticUnitScanner consumes PRUnichars.  For efficiency reasons we might want to avoid crossing the conversion bridge until we have to.  At the same time, the porter stemmer implementation has some private state we probably don't want to much with.  A 'minimal' implementation might always run the porter stemmer, but double-check the input stream and re-process any text span consumed by the porter stemmer that had 0x80 or higher in it.  The book-keeping would be fairly reasonable, and would allow for on-demand instantiation of the semantic scanner.

dmose, if you would comment on the sanity of this plan, it would be appreciated.  Assuming it is deemed sane, we can remove the blocking status of this bug.  Whether the semantic scanner implementation targets beta 2 or beta 3 I suppose would depend on whether we want to try and put the implementation in mozilla-central or not.  There is no technical reason for the code to live in mozilla-central; the tokenizer registration can all be accomplished without any changes to the storage component.  I am inclined to put the code in comm-central, the most noble reasons being that it involves arbitrary heuristics and the test code makes the most sense to exist in comm-central.  (No sense slowing down mozilla-central for a thunderbird feature, etc.)
Wow, great find!  I'd completely forgotten about that code.  This seems sane to me, but I'd like to hear thoughts from other folks whom I've added to the CC list (most especially smontagu, as he's likely to have the lion's share of expertise).  I don't have a terribly strong opinion on where this code would want to live; perhaps sdwilsh and myk have thoughts there.
Whiteboard: [has l10n impact]
Whiteboard: [has l10n impact] → [has l10n impact][needs patch]
What is the ETA for this bug? 

Tomorrow at 23:59 PST is our string freeze and it would be great if this blocking bug could be fixed by that time as mentioned in https://wiki.mozilla.org/Thunderbird/Driver_Meetings/2009-02-03
Updating the status whiteboard as this doesn't actually affect strings.  I'm going to add the relnote keyword for now because if this lands I have the feeling we'll want to make a note about the behaviour effecting locales.
Keywords: relnote
Whiteboard: [has l10n impact][needs patch] → [no l10n impact][needs patch]
nm for now, bumping off the b2 list and aiming for b3
Keywords: relnote
Target Milestone: Thunderbird 3.0b2 → Thunderbird 3.0b3
Keywords: intl
in a sentence:
We cannot use nsISemanticUnitScanner (at least as is) and we should use some other library or implement (port) tokenizer (or indexer as a whole) for sqlite.

and details:
nsISemanticUnitScanner is not so bad for spam mail filter but as for full text search, we cannot use it (at least for Japanese and some other non-space-separated languages).
# actually nsISemanticUnitScanner is not good enough for such languages too
# I hope you already know and so you write "-based".

As a tokenizer for full text search, there are following problems with nsISemanticUnitScanner at least:
# examples are taken from bug 176528#c14
 1. full-width symbols are not split well
    ex. [”(AIM)。] should be split as [”] [(] [AIM] [)] [。]
 2. some tokens in (at least) Japanese hiragana/katakana and Korea (both use phonogram) are not split well.
    ex. [のすべてを] should be split as [の] [すべて] [を]
    ex. [여러분이] should be split as [여러분] [이]
        [5시까지] should be split as [5] [시] [까지]
    # there must be other languages who have same problem
 3. simple uni-gram may not fast enough
    in general, we use bi-gram, tri-gram or m.n-gram for fast search

note: Korea is not always written with spaces but acutually usually without spaces and in that case almost tokens will be garbage.

Third problem is performance issue (will be serious for GB data) but first/second problems will cause (quite many) drop of search words and are really critical. With current nsISemanticUnitScanner, full text search will only be useful for space-separated non multi-byte languages.
# even as for space-separated non multi-byte lang, some probably have problems

So, we should use much better tokenizer for full-text search.
# of course better to have for spam filter too, not only full-text search
To solve performance issue too, we should reuse some production level library I think.

Some fine implementation example of well-known production level full text search engines are for example:
http://hyperestraier.sourceforge.net/
http://lucene.apache.org/
http://qwik.jp/senna/
http://projects.netlab.jp/rast/

But as far as I read some documents I cannot find how to use them with sqlite.
I don't know yet if we can reuse them for tokenizer of sqlite's fts3 or not.
I'm also not sure we can solve all the problem just using custom tokenizer for fts3 or we should customize fts3 indexer itself.
# some are used with mysql, postgresql etc or standalone but not with sqlite...
# this means we cannot reuse, just I haven't find howto document with sqlite yet

I cannot say which algorithm/library we should use but I hope this help you.
The FTS3 tokenizer implementation is fairly pluggable (see the interface definition at http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/fts3_tokenizer.h&v=1.2).  If using something like an overlapping bi-gram tokenizer, we might have to help it out a little bit on the queries to get the desired results, but nothing that we should need to modify the core for.

We could certainly use something like the bi-gram tokenizer patch on bug 414102, but augmented to only do bi-gram tokenization when in the CJK character spaces.  To deal with characters that should not be part of a bi-gram, we could simply transform single-CJK-character queries into wildcarded prefix queries (ex: "の*" for the query "の").  It sounds like this might achieve a more correct result.

What are you thoughts on this strategy?  I would be very happy to have a better solution, but from a resource perspective, I will not be able to implement a better solution in the beta 3 timeframe, so I need a fallback plan if someone else doesn't step up with a better implementation.  You seem very informed and interested in this problem, perhaps you would like to take this on? :)

In terms of other full-text engines, for the current time-frame, I suspect it's only feasible to port tokenizers from other implementations.  Unfortunately, we also need to keep licensing issues in mind.  Both hyperestraier and senna are LGPL, which I think means we cannot use them unless the authors/copyright holders of the code we port consent to have the code re-licensed under the MPL as well.

Thanks again for your input!  As you may surmise, (human) languages are not my forte.
(I am not actively working this yet, dropping assignment so as to encourage other people with an interest to consider working it...)
Assignee: bugmail → nobody
Status: ASSIGNED → NEW
Re LGPL dual-licensing, Mozilla has convinced other projects to do this before, so if porting one of those tokenizers seems like the best option, we could certainly ask.
-> asuth - since it's not ASSIGNED, we hope people will figure out they can help, but better not to be an unowned blocker.
Assignee: nobody → bugmail
Keywords: helpwanted
simple/poor bi-gram tokenizer patch based on attachment 321769 [details] [diff] [review] of bug 414102.
just to check if bi-gram tokenizer with fts3 will be useful or not
As Andrew say, IMHO we should support simple N-gram first to make gloda works correctly even if it's not have high performance. And then we should work for faster tokenizer (and other parts not only tokenizer)
# much faster than existing search without gloda

In thunderbird, target data size will be (currently) ~XX GB, not XX TB and we need not support parallel machine indexer nor complex optimization just to get a little faster indexer/searcher or a little smaller database size.

I tested simple bi-gram tokenizer based on the patch on bug 414102.
Tha bi-gram patch itself was too poor (not supporting Hiragana/Katakana/Korean etc!) and I just commented out if statement for puncutations to avoid it.
# still too poor, just to check if simple bi-gram is acceptable or not

Then I can search Japanese words (including single char).
As for the speed, in case the search results will be few items, it takes a few seconds in the exptoolbar to show results in the new tab.
 - about 1,2 secondes for "浅井", "井上", "東京", "開発", "wordcamp" etc
On the other hand in case the search results will be hundreds of mails, I have to wait long time.
 - about 8 secondes for "bug"
 - about 18 secondes for "mozilla" - too slow :(
# with 1.8GM mail data on Intel iMac Core 2 Duo 3.06GHz

Not really fast but this speed is similar to current search with porter tokenizer which don't support multibyte chars.
# And this is maybe also problem of exptoolbar front-end, which should get
# search results incrementally but maybe currently not (I guess).

When we will support incremental search against message body within the thunderbar like awesomebar of Fx, it may be the problem and we should consider faster one but we should accept at first.
# we have too many thing around gloda before tb3...

note: current problem with the patch on bug 414102:
 - should based on porter tokenizer, not on simple one
   # at least Japanese and other languages which often use English words
   # within their language
 - should detect and skip invalid byte as utf-8
 - should define correct delimiter for multibyte chars
   # currently, all hiragana/katakana/Korean etc are treated as delimiters!
 - should support search words including delimiters like "utf-8", which
   include "-" and cannot search from exptoolbar now
   # this may be the bug of front-end, not tokenizer but not sure yet

Even when we use simple N-gram, there are many complex problems and it may have to import some robust/fast one finally. But not sure yet now and should start with simple N-gram...


 - p.s.
I'm really interested in this problem since I never want to say "Thunderbird 3 is great but not for Japanese..." to Japanese users.
# I'm technical marketing/evangelist staff in Mozilla Japan
I read some documents/papers/codes about full-text search but still not enough good to write custom fine tokenizer in this time frame. I'll do my best for this bug but not accepting this bug so far, sorry.
note: to test bi-gram tokenizer based on the simple tokenizer, you need edit
Shredder.app/Contents/MacOS/modules/gloda/datastore.js line 909 (or around)
       let createFulltextSQL = "CREATE VIRTUAL TABLE " + aTableName + "Text" +
-        " USING fts3(tokenize porter, " +
+        " USING fts3(tokenize simple, " +
         [(coldef[0] + " " + coldef[1]) for each

if you need binary of sqlite dynamic library, you can use (replace with):
http://people.mozilla.com/~dynamis/thunderbird/sqlite3/bigramtokenizer/1st/
# only tested with Mac (libsqlite3.dylib)
Whiteboard: [no l10n impact][needs patch] → [no l10n impact][needs patch][b3ux]
Thank you for the testing.  I am not too surprised that exptoolbar is being slow, especially in cases with lots of results.  exptoolbar doesn't bother to use a "limit" on its query and gloda returns the results all at once (a current gloda limitation).  This turns out extra poorly, as you get to see.  I would not be surprised if Thunderbird even becomes non-responsive for a few seconds as a massive garbage collection triggers.

I hope to have up a revised patch on bug 474701 soon that exposes the new, simpler gloda search interface with much better performance.
Whiteboard: [no l10n impact][needs patch][b3ux] → [no l10n impact][needs patch][b3ux][m5]
Whiteboard: [no l10n impact][needs patch][b3ux][m5] → [b3ux][m5][no l10n impact][needs patch]
Whiteboard: [b3ux][m5][no l10n impact][needs patch] → [b3ux][m6][no l10n impact][needs patch]
Not sure what to do about this bug and beta 3.  Although it does not require changes to the user interface (and so in theory could slip to rc1), we cannot have gloda-search replace quick-search for all intents and purposes if it majorly regresses the ability to search CJK.

Only enabling gloda search for english or non-CJK locales for beta 3 is problematic because it seems confusing to have beta 3 have major differences in apparent feature sets for different locales.  On the other hand, I don't know what else to do short of fairly serious slippage; I expect this to take at least a week of solid work, and I already have other things with higher priorities at serious risk.
Assignee: bugmail → dmose
Target Milestone: Thunderbird 3.0b3 → Thunderbird 3.0b4
Would be nice to have land this soon enough so we can get testing from people using non latin charsets.
Whiteboard: [b3ux][m6][no l10n impact][needs patch] → [no l10n impact][needs patch]
Blocks: 510775
Blocks: 507632
After discussion with drivers, we decided that while it would be exceedingly unfortunate to not have good support for locales in Gloda in 3.0, we'd still rather get good GloDa support into the hands of some people rather than no people at all, so we wouldn't block on this bug if it were the last bug standing.
Assignee: dmose → nobody
Flags: blocking-thunderbird3.1+
Flags: blocking-thunderbird3-
Flags: blocking-thunderbird3+
Target Milestone: Thunderbird 3.0b4 → ---
(In reply to comment #18)
> After discussion with drivers, we decided that while it would be exceedingly
> unfortunate to not have good support for locales in Gloda in 3.0, we'd still
> rather get good GloDa support into the hands of some people rather than no
> people at all, so we wouldn't block on this bug if it were the last bug
> standing.

I'm guessing this should probably be release-noted for TB3 - searching (and trying to make use of faceted search / gloda) in UTF-8 languages is a very natural thing to do for users of those languages (e.g. zh_CN, zh_TW and many more).

cc'ing Support & Marketing gurus...
Keywords: relnote
In which case, if the search widget combining gloda search and quick search of asuth's tryserver build is carried forward, then for certain localized builds do we want the default search to be quick search and not gloda search?

OT but I'm not sure there's a bug/checkin to blame: I'm not sure I am keen on having the two searches combined for my workflow (extra clicks, etc), although it definitely simplifies the UI which is good for most users
(In reply to comment #20)
> In which case, if the search widget combining gloda search and quick search of
> asuth's tryserver build is carried forward, then for certain localized builds
> do we want the default search to be quick search and not gloda search?

Not even necessary to just be localized builds - I had been trying to search Chinese characters in emails even in the en_US localization.
Attached patch WIP v0.9 (obsolete) — Splinter Review
Add FTS3 module to search CJK text.  Because we cannot modify sqlite3 code (see bug 414102 commnet #4).
Attachment #366308 - Attachment is obsolete: true
Woohoo!  Thank you, Makoto.  I need to add some unit tests for the gloda msg_search.js file, and then I think we can also use that as a basis for testing the indexer.  If you can come up with some simple test examples for strings we should be able to put in the message and then queries that should (and should not) match those messages, that would be also be very helpful.

I need to quickly try and finish up some beta-4 related things for the next few days, but I should be able to get to this after that and I think we definitely want to get this into 3.0 rc1.  Re-nominating for 3.0 blocker since your work has now made this feasible.
Flags: blocking-thunderbird3- → blocking-thunderbird3?
Target Milestone: --- → Thunderbird 3.0rc1
(In reply to comment #23)
> msg_search.js file, and then I think we can also use that as a basis for
> testing the indexer.  If you can come up with some simple test examples for

Er, I mean tokenizer.
Also, I will post to try-server after I test mac and linux.

This patch has unnecessary ifdef XP_WIN since I test on Windows only now.
Attached patch WIP v0.91 (obsolete) — Splinter Review
Attachment #400491 - Attachment is obsolete: true
Also, It is necessary to update schema version.
Whiteboard: [no l10n impact][needs patch] → [no l10n impact][gloda key][has patch][needs review asuth][needs unit tests]
Attached patch patch v1 (obsolete) — Splinter Review
add test.
After testing on CentOS 5, I will send a review...
Assignee: nobody → m_kato
Attachment #400757 - Attachment is obsolete: true
Getting some resolution of this issue for TB3 is a blocker.
Flags: blocking-thunderbird3? → blocking-thunderbird3+
Attachment #401868 - Flags: review?(bienvenu)
Comment on attachment 401868 [details] [diff] [review]
patch v1

Thx for doing this!

Andrew is going to need to look at this, though he's on vacation at the moment.

I wonder if the sqlite code should be part of the sqlite lib, though I understand it might be hard for us to get that through mozilla-central at the moment.

The license for +++ b/mailnews/extensions/fts3/Makefile.in

looks to be old - it's 1998.

This might be a good thing to generate try server builds for, if we have testers available to try it out.
Attachment #401868 - Flags: superreview?(bienvenu)
Attachment #401868 - Flags: review?(bugmail)
Attachment #401868 - Flags: review?(bienvenu)
Submitted a try build, tagged 'bug472764-att401868'.
(In reply to comment #30)
> I wonder if the sqlite code should be part of the sqlite lib, though I
> understand it might be hard for us to get that through mozilla-central at the
> moment.
It won't likely be hard to get an updated SQLite into mozilla-central.  I'll be sure to be quick on reviews!
(In reply to comment #32)

> It won't likely be hard to get an updated SQLite into mozilla-central.  I'll be
> sure to be quick on reviews!

Cool, would it make sense to add the sqlite parts of this patch to sqlite.c? I'm not sure what all else that would entail w.r.t. exporting symbols, etc.
sdwilsh: all the way back into 1.9.1?  

Given our past experience in the delays in getting simple changes back from trunk to 1.9.2 to 1.9.1, I think for this bug it'd be good to go with a tb-specific patch first, given our very short timelines, and a parallel m-c patch.
speaking from almost complete ignorance, it looks like this would be just adding the contents of a file to sqlite.c (which is already the concatenation of a bunch of files). But a parallel m-c trunk patch and then fixing this when we move to the trunk probably is a lot easier.
(In reply to comment #33)
> Cool, would it make sense to add the sqlite parts of this patch to sqlite.c?
> I'm not sure what all else that would entail w.r.t. exporting symbols, etc.
We don't take changes to the SQLite source.  We only take files from SQLite releases.

(In reply to comment #34)
> sdwilsh: all the way back into 1.9.1?
We'd have to ship it in a 1.9.2 beta first, but that's coming up soon.  I'm not sure I'll have cycles to do the SQLite upgrade, but I'll be quick with reviews.
Attached patch patch v2 (obsolete) — Splinter Review
fix build issue.

If I can use http://hg.mozilla.org/try-comm-central/ as try server of comm-central tree, I will push it.
Attachment #401868 - Attachment is obsolete: true
Attachment #401868 - Flags: superreview?(bienvenu)
Attachment #401868 - Flags: review?(bugmail)
Makoto-san: The page https://wiki.mozilla.org/Thunderbird/Infrastructure/TryServer describes how to get a certificate to use with the try server.

In the meantime, I've spun up some try server builds with your second patch, tagged 472764v2.

They should show up on http://tinderbox.mozilla.org/showbuilds.cgi?tree=ThunderbirdTry
os x build failed for mysterious reasons, so trying again.
Attached patch patch v3 (obsolete) — Splinter Review
patch v2 has no xpt file for tokenizer, so I add it to package-static.
Attachment #403413 - Attachment is obsolete: true
(In reply to comment #39)
> Makoto-san: The page
> https://wiki.mozilla.org/Thunderbird/Infrastructure/TryServer describes how to
> get a certificate to use with the try server.

Thanks, David.  I file a request to bugzilla.
spun up new try server builds, labeled 'cjkv3' or something like that =)
v3 patch failed on windows:

http://tinderbox.mozilla.org/showlog.cgi?log=ThunderbirdTry/1254324631.1254328669.30206.gz&fulltext=1

======= making ./nsldif32v60.dll
link -MANIFEST:NO -OUT:"nsldif32v60.dll" -DEBUG -OPT:REF -MAP -nologo -DLL -SUBSYSTEM:WINDOWS  -SUBSYSTEM:CONSOLE    -out:"nsldif32v60.dll" ./line64.obj  -DEF:e:/buildbot/sendchange-win32-hg/mozilla/directory/c-sdk/ldap/libraries/msdos/winsock/nsldif32.def  
e:/buildbot/sendchange-win32-hg/mozilla/directory/c-sdk/ldap/libraries/msdos/winsock/nsldif32.def(39) : warning LNK4017: DESCRIPTION statement not supported for the target platform; ignored
   Creating library nsldif32v60.lib and object nsldif32v60.exp
LINK : fatal error LNK1104: cannot open file 'nsldif32v60.lib'
make[6]: Leaving directory `/e/buildbot/sendchange-win32-hg/mozilla/objdir/directory/c-sdk/ldap/libraries/libldif'
make[5]: Leaving directory `/e/buildbot/sendchange-win32-hg/mozilla/objdir/directory/c-sdk/ldap/libraries'
make[4]: Leaving directory `/e/buildbot/sendchange-win32-hg/mozilla/objdir/directory/c-sdk/ldap'
make[3]: Leaving directory `/e/buildbot/sendchange-win32-hg/mozilla/objdir/directory/c-sdk'
make[6]: *** [nsldif32v60.dll] Error 80
make[6]: *** Waiting for unfinished jobs....
make[5]: *** [export] Error 2
make[4]: *** [export] Error 2
make[3]: *** [export] Error 2
make[2]: Leaving directory `/e/buildbot/sendchange-win32-hg/mozilla/objdir'
make[1]: Leaving directory `/e/buildbot/sendchange-win32-hg/mozilla/objdir'
make[2]: *** [tier_app] Error 2
make[1]: *** [default] Error 2
make: *** [build] Error 2
program finished with exit code 2
That's an error linking LDAP, I wonder if there's some other tree coincidental with that try build that's causing it...
"tree bustage coincidental" is what I meant to say.
just for fun, tried again.
The try server is very brittle it seems. The latest run worked on windows & linux:

windows: http://s3.mozillamessaging.com/build/try-server/2009-09-30_12:10-dascher@mozillamessaging.com-cjkv3try2/dascher@mozillamessaging.com-cjkv3try2-mail-try-win32.zip

and linux: http://s3.mozillamessaging.com/build/try-server/2009-09-30_12:10-dascher@mozillamessaging.com-cjkv3try2/dascher@mozillamessaging.com-cjkv3try2-mail-try-linux.tar.bz2

and the older successful mac build (also w/ 3rd patch): http://s3.mozillamessaging.com/build/try-server/2009-09-30_08:30-dascher@mozillamessaging.com-cjkv3/dascher@mozillamessaging.com-cjkv3-mail-try-mac.dmg

It'd be great if people could report whether these builds work a) with CJK corpora, and b) with non-CJK corpora)
Keywords: qawanted
Whiteboard: [no l10n impact][gloda key][has patch][needs review asuth][needs unit tests] → [no l10n impact][gloda key][has patch][needs review asuth][needs unit tests][see links in comment #50]
(In reply to comment #50)
> The try server is very brittle it seems. The latest run worked on windows &
> linux:
> 
> windows:
> http://s3.mozillamessaging.com/build/try-server/2009-09-30_12:10-dascher@mozillamessaging.com-cjkv3try2/dascher@mozillamessaging.com-cjkv3try2-mail-try-win32.zip
> 
> It'd be great if people could report whether these builds work a) with CJK
> corpora, and b) with non-CJK corpora)

Just had some time (not much) to test on CJK searching on this en_US WinXP build. I had a test message with the "十" character (means "ten" in Chinese), gloda is on, the message is onDisk and has a gloda ID, but when I search all messages at the top, it still says "no messages match your search".

:(

ID of this tryserver build on WinXP:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.4pre) Gecko/20090930 Shredder/3.0pre
(In reply to comment #51)

> build. I had a test message with the "十" character (means "ten" in Chinese),
> gloda is on, the message is onDisk and has a gloda ID, but when I search all
> messages at the top, it still says "no messages match your search".

This implementation is bi-gram.  At least, 2 characters are necessary as search string.  So How about at least 2 character as search string??

If we allow 1 character, too, database size will becomes too big.
(In reply to comment #50)

> 
> It'd be great if people could report whether these builds work a) with CJK
> corpora, and b) with non-CJK corpora)

I've send a few emails to the contacts I made back in may to try the builds and give us some feedback on it.
Attached patch patch v4 (obsolete) — Splinter Review
fix comment #51 (I tests on Japanese mail only)
Attachment #403735 - Attachment is obsolete: true
Attached patch patch v4.1 (obsolete) — Splinter Review
typo
Attachment #403996 - Attachment is obsolete: true
Setup try server builds w/ patch 4.1
Attachment #404212 - Flags: superreview?(bienvenu)
Attachment #404212 - Flags: review?(bugmail)
Whiteboard: [no l10n impact][gloda key][has patch][needs review asuth][needs unit tests][see links in comment #50] → [no l10n impact][gloda key][has patch][needs review asuth][needs unit tests][see links in comment #57-#58]
Those builds seems to work correctly on ASCII latin usage.
Attachment #405241 - Flags: superreview?(bienvenu)
Attachment #405241 - Flags: review?(bugmail)
Here's a patch that goes on top of the v5 patch, it:

- Adds a lot of comments.  In order to verify the correctness of the code I needed to understand it, and I am nowhere near clever enough to keep all of the balls in the air in just my head... and my memory is certainly not good enough to remember it for next time if I did somehow manage it (caffeine?).

- Treats non-ASCII things in the generic script page like ASCII rather than punctuation/whitespace.  Renamed BIGRAM_ASCII to BIGRAM_ALPHA (and used that) accordingly.

- No longer emits single-character tokens (CJK or ALPHA), generally speaking (see next point).  The emission of single-character CJK tokens seems like it was probably an unintended bug.  Given that we don't specialize for anything in the general punctuation page except whitespace, this seems advisable.

- Loses the special logic in msg_search.js that tries to generate bi-grams.  The tokenizer is used on the queries too so there is no need to do this and we're likely to get burned if try.  The logic to add the wildcard in the case of a single CJK character actually did not work; it was depending on the prior bug where we generate only a single CJK character.  Namely if '&' is a CJK character, then the wildcarded string would end up as "&*".  This does not invoke the FTS3 wildcard logic at all; it has to be an unquoted term and the tokenizer has to eat a single token and leave '*' as the next character.  It just happened that "&*" tokenized down to "&" and if the source message had just a plain "&" somewhere, it would have put it in the database like that.  So to make the wildcard stuff work I changed the JS code to emit an unquoted character with the wildcard and special-cased the tokenizer to detect when it was in such a situation and emit the single CJK character in that case.  If someone were to have an e-mail message where the body or the subject was a single CJK character followed by an asterisk and absolutely no whitespace, we would get tricked in that case too, but I don't believe it to be likely.

- Added some more unit test cases.  These just test that the tokenizer works when multiple bigrams are involved.
Patch v5 is r=asuth with the changes from the previous patch.  I have folded my changes into v5 to produce v6.

Requesting review from Makoto to make sure my changes do not break or alter the intent of the patch.  And keeping the bienvenu sr request around.

I will spin try server builds as "cjk-v6" but will not be awake to see when they complete.  If someone posts the links here when they complete (or fail to complete), I won't complain.

Per discussion, the plan is currently _not to land the patch_ once the reviews are finished.  We want to try and time things with some other gloda patches I am hoping to land soon so that we avoid blowing away people's gloda databases multiple times in the span of a few days.
Attachment #405241 - Attachment is obsolete: true
Attachment #406850 - Flags: superreview?(bienvenu)
Attachment #406850 - Flags: review?(m_kato)
Attachment #405241 - Flags: superreview?(bienvenu)
Attachment #405241 - Flags: review?(bugmail)
Comment on attachment 406850 [details] [diff] [review]
v6 patch; v5 patch with asuth mods

(r=asuth :)
Attachment #406850 - Flags: review+
uh, and I probably should have mentioned that if before using one of these builds you were using some of the other CJK builds, then you will need to blow away your global-messages-db.sqlite file in your profile directory.  (My changes do not bump the schema so this will not be done automatically unless you are coming from a nightly/beta4 build.)
Attachment #406850 - Flags: review?(m_kato) → review+
Keywords: jp-critical
Channy, can you do some testing with the builds Wayne Mery created for Korean/Hangul?
Comment on attachment 406850 [details] [diff] [review]
v6 patch; v5 patch with asuth mods

no need for temp var code here, since it's only used once:

+        let code = term.charCodeAt(0);
+        // Our tokenizer treats anything at/above 0x2000 as CJK for now.
+        if (code >= 0x2000)
+          magicWild = true;

so this can be simply:

let magicWild = term.length == 1 && term.charCodeAt(0) >= 0x2000)

in fact, magicWild is only used once, so the whole thing can be in-lined.

move decl of rv to where it's first used here:

+nsFts3Tokenizer::RegisterTokenizer(mozIStorageConnection *connection)
+{
+    nsresult rv;
+    nsCOMPtr <mozIStorageStatement> selectStatement;
+
+    rv = connection->CreateStatement(NS_LITERAL_CSTRING(

you're ignoring the first rv here: 

+    rv = selectStatement->BindStringParameter(0, NS_LITERAL_STRING("mozporter"));
+    rv = selectStatement->BindBlobParameter(1, (PRUint8*)&module, sizeof(module));
+    NS_ENSURE_SUCCESS(rv, rv);

either add an NS_ENSURE_SUCCESS(rv, rv); between the two statements, or don't assign to rv, whichever is more appropriate.

just return the value of selectStatement directly here, instead of assigning to rv:

+    PRBool hasMore;
+    rv = selectStatement->ExecuteStep(&hasMore);
+
+    return rv;

sr=me, with those things fixed.
Attachment #406850 - Flags: superreview?(bienvenu) → superreview+
Whiteboard: [no l10n impact][gloda key][has patch][needs review asuth][needs unit tests][see links in comment #57-#58] → [no l10n impact][gloda key][needs unit tests][see links in comment #57-#58][needs review nits addressed before landing w/ other gloda fixes]
The plan is that I am now going to:

1) Make bienvenu's requested changes.
2) Remove the schema bumping.
3) Smoke-test that things work (still using the old porter stemmer tokenizer) for a gloda database created by a pre-CJK-aware build.
4) Land it.
5) Post about the landing as well as instructions on how to blow away your database to get the new CJK-aware-tokenizer.  This may also get some people out of the woodwork to further improve the tokenizer for non-English, non-CJK languages.  (Simple case/accent-folding seems reasonable and likely low risk assuming they are able to use libintl.)
1-4 completed:

I did this as 1 push of 3 commits:

pushed Makoto's v5 patch:
http://hg.mozilla.org/comm-central/rev/4aab2fb7d9e1

pushed my modifications to the patch mainly adding comments:
http://hg.mozilla.org/comm-central/rev/1ef3fd69325a

pushed the review modifications and the removal of the schema bumping:
http://hg.mozilla.org/comm-central/rev/810cd547ef75

I'm off to post about this.

Marking resolved fixed since the fix is on the trunk.

Super thanks once again to Makoto Kato for the patch!
Status: NEW → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
I posted to m.d.a.t and m.d.l10n about this:
http://groups.google.ca/group/mozilla.dev.apps.thunderbird/browse_thread/thread/bf4fe06f0e0b20aa#

Once the nightlies are available we should probably reply to the message with an update.  Not sure if we should make a blog post about it or not after the nightlies are spun.  My message leaves it up to the reader to figure out how to locate their profile directory; that may be setting the bar too high.
Keywords: relnote
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: