472764 - gloda full-text search always uses SQLite's porter stemmer without regard to effective locale, etc.

Reporter

Description

•

16 years ago

The SQLite FTS3 engine supports per-column tokenizers.  We currently always use the Porter stemmer.  This is only appropriate for messages written in English.

It's a tricky situation because each column only gets one tokenizer, and there need not be a correlation between the user's locale and the languages they receive messages in.  Using a unicode tokenizer might get us pretty far, but stemming is also very useful for English.  The good news is that sqlite at least allows easily pluggable tokenizers.

Marking this blocking pref-ing gloda on, at least until we figure out what our game-plan is.  (Once we have a plan, it may no longer be blocking.)

Andrew Sutherland [:asuth] (he/him)

Reporter

Updated

•

16 years ago

Assignee: nobody → bugmail

Status: NEW → ASSIGNED

Flags: blocking-thunderbird3?

Target Milestone: --- → Thunderbird 3.0b2

David Ascher (:davida)

Updated

•

15 years ago

Flags: blocking-thunderbird3? → blocking-thunderbird3+

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 1

•

15 years ago

Good news, everyone! Apparently we already have a fairly clever tokenization function added for the spam filter! nsISemanticUnitScanner! Examples of results on the adding bug at 176528#c13 and the subsequent comment. The logic seems consistent with my frantic wikipedia & friends research. The usage is also compatible with the FTS3 tokenizer API.

So, that gets us what we want, except that we would also really like to use the built-in porter stemmer for English. As such, I propose we implement 2 tokenizers:

1) nsISemanticUnitScanner-based tokenizer. This is the default.

2) English-biased tokenizer, for use when the user's locale is en_*. It sounds wacky, but the idea is that we conceptually apply the porter stemmer to the output of every word emitted by the nsISemanticUnitScanner tokenizer.

My first instinct was to say "apply the porter stemmer only if we think the message contains only english." Unfortunately, this can result in making some queries intuitively impossible to formulate. For example, let's say I have an indexed string containing the word "dancing" and another word with an accent, causing the porter stemmer to not be used. Then the term will be indexed as "dancing". But when I go to formulate a query for "dancing", because my query doesn't have anything to convince the tokenizer that it's not English, it will stem my query term to "danc". Bad news.

So, the English-biased tokenizer will use the porter stemmer against anything that the porter stemmer would stem. (The porter stemmer only stems [a-zA-z], punting to copying otherwise.) This has the potential to result in false positives where a non-english word that lacks accents will be stemmed to collide with other words (possibly of a different language), resulting in user confusion. It seems like there has to be an awesome and amusing example of this, but none spring to mind.

Note: I use the words "conceptually" and "would" because there may be efficiency dancing going on, not to mention odd implementation choices required. Namely, the tokenizer actually gets its strings as utf-8, but nsISemanticUnitScanner consumes PRUnichars. For efficiency reasons we might want to avoid crossing the conversion bridge until we have to. At the same time, the porter stemmer implementation has some private state we probably don't want to much with. A 'minimal' implementation might always run the porter stemmer, but double-check the input stream and re-process any text span consumed by the porter stemmer that had 0x80 or higher in it. The book-keeping would be fairly reasonable, and would allow for on-demand instantiation of the semantic scanner.

dmose, if you would comment on the sanity of this plan, it would be appreciated. Assuming it is deemed sane, we can remove the blocking status of this bug. Whether the semantic scanner implementation targets beta 2 or beta 3 I suppose would depend on whether we want to try and put the implementation in mozilla-central or not. There is no technical reason for the code to live in mozilla-central; the tokenizer registration can all be accomplished without any changes to the storage component. I am inclined to put the code in comm-central, the most noble reasons being that it involves arbitrary heuristics and the test code makes the most sense to exist in comm-central. (No sense slowing down mozilla-central for a thunderbird feature, etc.)

Dan Mosedale (:dmosedale, :dmose)

Comment 2

•

15 years ago

Wow, great find!  I'd completely forgotten about that code.  This seems sane to me, but I'd like to hear thoughts from other folks whom I've added to the CC list (most especially smontagu, as he's likely to have the lion's share of expertise).  I don't have a terribly strong opinion on where this code would want to live; perhaps sdwilsh and myk have thoughts there.

Simon Paquet [:sipaq]

Updated

•

15 years ago

Whiteboard: [has l10n impact]

Simon Paquet [:sipaq]

Updated

•

15 years ago

Whiteboard: [has l10n impact] → [has l10n impact][needs patch]

Simon Paquet [:sipaq]

Comment 3

•

15 years ago

What is the ETA for this bug? 

Tomorrow at 23:59 PST is our string freeze and it would be great if this blocking bug could be fixed by that time as mentioned in https://wiki.mozilla.org/Thunderbird/Driver_Meetings/2009-02-03

Bryan Clark (DevTools PM) [:clarkbw]

Comment 4

•

15 years ago

Updating the status whiteboard as this doesn't actually affect strings.  I'm going to add the relnote keyword for now because if this lands I have the feeling we'll want to make a note about the behaviour effecting locales.

Keywords: relnote

Whiteboard: [has l10n impact][needs patch] → [no l10n impact][needs patch]

Bryan Clark (DevTools PM) [:clarkbw]

Comment 5

•

15 years ago

nm for now, bumping off the b2 list and aiming for b3

Keywords: relnote

Target Milestone: Thunderbird 3.0b2 → Thunderbird 3.0b3

Andrew Sutherland [:asuth] (he/him)

Reporter

Updated

•

15 years ago

Keywords: intl

dynamis (Tomoya ASAI)

Comment 7

•

15 years ago

in a sentence:
We cannot use nsISemanticUnitScanner (at least as is) and we should use some other library or implement (port) tokenizer (or indexer as a whole) for sqlite.

and details:
nsISemanticUnitScanner is not so bad for spam mail filter but as for full text search, we cannot use it (at least for Japanese and some other non-space-separated languages).
# actually nsISemanticUnitScanner is not good enough for such languages too
# I hope you already know and so you write "-based".

As a tokenizer for full text search, there are following problems with nsISemanticUnitScanner at least:
# examples are taken from bug 176528#c14
 1. full-width symbols are not split well
    ex. [”（AIM）。] should be split as [”] [（] [AIM] [）] [。]
 2. some tokens in (at least) Japanese hiragana/katakana and Korea (both use phonogram) are not split well.
    ex. [のすべてを] should be split as [の] [すべて] [を]
    ex. [여러분이] should be split as [여러분] [이]
        [5시까지] should be split as [5] [시] [까지]
    # there must be other languages who have same problem
 3. simple uni-gram may not fast enough
    in general, we use bi-gram, tri-gram or m.n-gram for fast search

note: Korea is not always written with spaces but acutually usually without spaces and in that case almost tokens will be garbage.

Third problem is performance issue (will be serious for GB data) but first/second problems will cause (quite many) drop of search words and are really critical. With current nsISemanticUnitScanner, full text search will only be useful for space-separated non multi-byte languages.
# even as for space-separated non multi-byte lang, some probably have problems

So, we should use much better tokenizer for full-text search.
# of course better to have for spam filter too, not only full-text search
To solve performance issue too, we should reuse some production level library I think.

Some fine implementation example of well-known production level full text search engines are for example:
http://hyperestraier.sourceforge.net/
http://lucene.apache.org/
http://qwik.jp/senna/
http://projects.netlab.jp/rast/

But as far as I read some documents I cannot find how to use them with sqlite.
I don't know yet if we can reuse them for tokenizer of sqlite's fts3 or not.
I'm also not sure we can solve all the problem just using custom tokenizer for fts3 or we should customize fts3 indexer itself.
# some are used with mysql, postgresql etc or standalone but not with sqlite...
# this means we cannot reuse, just I haven't find howto document with sqlite yet

I cannot say which algorithm/library we should use but I hope this help you.

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 8

•

15 years ago

The FTS3 tokenizer implementation is fairly pluggable (see the interface definition at http://www.sqlite.org/cvstrac/fileview?f=sqlite/ext/fts3/fts3_tokenizer.h&v=1.2). If using something like an overlapping bi-gram tokenizer, we might have to help it out a little bit on the queries to get the desired results, but nothing that we should need to modify the core for.

We could certainly use something like the bi-gram tokenizer patch on bug 414102, but augmented to only do bi-gram tokenization when in the CJK character spaces. To deal with characters that should not be part of a bi-gram, we could simply transform single-CJK-character queries into wildcarded prefix queries (ex: "の*" for the query "の"). It sounds like this might achieve a more correct result.

What are you thoughts on this strategy? I would be very happy to have a better solution, but from a resource perspective, I will not be able to implement a better solution in the beta 3 timeframe, so I need a fallback plan if someone else doesn't step up with a better implementation. You seem very informed and interested in this problem, perhaps you would like to take this on? :)

In terms of other full-text engines, for the current time-frame, I suspect it's only feasible to port tokenizers from other implementations. Unfortunately, we also need to keep licensing issues in mind. Both hyperestraier and senna are LGPL, which I think means we cannot use them unless the authors/copyright holders of the code we port consent to have the code re-licensed under the MPL as well.

Thanks again for your input! As you may surmise, (human) languages are not my forte.

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 9

•

15 years ago

(I am not actively working this yet, dropping assignment so as to encourage other people with an interest to consider working it...)

Assignee: bugmail → nobody

Status: ASSIGNED → NEW

Dan Mosedale (:dmosedale, :dmose)

Comment 10

•

15 years ago

Re LGPL dual-licensing, Mozilla has convinced other projects to do this before, so if porting one of those tokenizers seems like the best option, we could certainly ask.

David :Bienvenu

Comment 11

•

15 years ago

-> asuth - since it's not ASSIGNED, we hope people will figure out they can help, but better not to be an unowned blocker.

Assignee: nobody → bugmail

Keywords: helpwanted

dynamis (Tomoya ASAI)

Comment 12

•

15 years ago

Attached patch bigram tokenizer (based on attachment 321769) (obsolete) — Details — Splinter Review

simple/poor bi-gram tokenizer patch based on attachment 321769 [details] [diff] [review] of bug 414102.
just to check if bi-gram tokenizer with fts3 will be useful or not

dynamis (Tomoya ASAI)

Comment 13

•

15 years ago

As Andrew say, IMHO we should support simple N-gram first to make gloda works correctly even if it's not have high performance. And then we should work for faster tokenizer (and other parts not only tokenizer)
# much faster than existing search without gloda

In thunderbird, target data size will be (currently) ~XX GB, not XX TB and we need not support parallel machine indexer nor complex optimization just to get a little faster indexer/searcher or a little smaller database size.

I tested simple bi-gram tokenizer based on the patch on bug 414102.
Tha bi-gram patch itself was too poor (not supporting Hiragana/Katakana/Korean etc!) and I just commented out if statement for puncutations to avoid it.
# still too poor, just to check if simple bi-gram is acceptable or not

Then I can search Japanese words (including single char).
As for the speed, in case the search results will be few items, it takes a few seconds in the exptoolbar to show results in the new tab.
 - about 1,2 secondes for "浅井", "井上", "東京", "開発", "wordcamp" etc
On the other hand in case the search results will be hundreds of mails, I have to wait long time.
 - about 8 secondes for "bug"
 - about 18 secondes for "mozilla" - too slow :(
# with 1.8GM mail data on Intel iMac Core 2 Duo 3.06GHz

Not really fast but this speed is similar to current search with porter tokenizer which don't support multibyte chars.
# And this is maybe also problem of exptoolbar front-end, which should get
# search results incrementally but maybe currently not (I guess).

When we will support incremental search against message body within the thunderbar like awesomebar of Fx, it may be the problem and we should consider faster one but we should accept at first.
# we have too many thing around gloda before tb3...

note: current problem with the patch on bug 414102:
 - should based on porter tokenizer, not on simple one
   # at least Japanese and other languages which often use English words
   # within their language
 - should detect and skip invalid byte as utf-8
 - should define correct delimiter for multibyte chars
   # currently, all hiragana/katakana/Korean etc are treated as delimiters!
 - should support search words including delimiters like "utf-8", which
   include "-" and cannot search from exptoolbar now
   # this may be the bug of front-end, not tokenizer but not sure yet

Even when we use simple N-gram, there are many complex problems and it may have to import some robust/fast one finally. But not sure yet now and should start with simple N-gram...


 - p.s.
I'm really interested in this problem since I never want to say "Thunderbird 3 is great but not for Japanese..." to Japanese users.
# I'm technical marketing/evangelist staff in Mozilla Japan
I read some documents/papers/codes about full-text search but still not enough good to write custom fine tokenizer in this time frame. I'll do my best for this bug but not accepting this bug so far, sorry.

dynamis (Tomoya ASAI)

Comment 14

•

15 years ago

note: to test bi-gram tokenizer based on the simple tokenizer, you need edit
Shredder.app/Contents/MacOS/modules/gloda/datastore.js line 909 (or around)
       let createFulltextSQL = "CREATE VIRTUAL TABLE " + aTableName + "Text" +
-        " USING fts3(tokenize porter, " +
+        " USING fts3(tokenize simple, " +
         [(coldef[0] + " " + coldef[1]) for each

if you need binary of sqlite dynamic library, you can use (replace with):
http://people.mozilla.com/~dynamis/thunderbird/sqlite3/bigramtokenizer/1st/
# only tested with Mac (libsqlite3.dylib)

Dan Mosedale (:dmosedale, :dmose)

Updated

•

15 years ago

Whiteboard: [no l10n impact][needs patch] → [no l10n impact][needs patch][b3ux]

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 15

•

15 years ago

Thank you for the testing.  I am not too surprised that exptoolbar is being slow, especially in cases with lots of results.  exptoolbar doesn't bother to use a "limit" on its query and gloda returns the results all at once (a current gloda limitation).  This turns out extra poorly, as you get to see.  I would not be surprised if Thunderbird even becomes non-responsive for a few seconds as a massive garbage collection triggers.

I hope to have up a revised patch on bug 474701 soon that exposes the new, simpler gloda search interface with much better performance.

Andrew Sutherland [:asuth] (he/him)

Reporter

Updated

•

15 years ago

Whiteboard: [no l10n impact][needs patch][b3ux] → [no l10n impact][needs patch][b3ux][m5]

Wayne Mery (:wsmwk)

Updated

•

15 years ago

Whiteboard: [no l10n impact][needs patch][b3ux][m5] → [b3ux][m5][no l10n impact][needs patch]

Andrew Sutherland [:asuth] (he/him)

Reporter

Updated

•

15 years ago

Whiteboard: [b3ux][m5][no l10n impact][needs patch] → [b3ux][m6][no l10n impact][needs patch]

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 16

•

15 years ago

Not sure what to do about this bug and beta 3.  Although it does not require changes to the user interface (and so in theory could slip to rc1), we cannot have gloda-search replace quick-search for all intents and purposes if it majorly regresses the ability to search CJK.

Only enabling gloda search for english or non-CJK locales for beta 3 is problematic because it seems confusing to have beta 3 have major differences in apparent feature sets for different locales.  On the other hand, I don't know what else to do short of fairly serious slippage; I expect this to take at least a week of solid work, and I already have other things with higher priorities at serious risk.

Dan Mosedale (:dmosedale, :dmose)

Updated

•

15 years ago

Assignee: bugmail → dmose

Target Milestone: Thunderbird 3.0b3 → Thunderbird 3.0b4

Ludovic Hirlimann [:Usul]

Comment 17

•

15 years ago

Would be nice to have land this soon enough so we can get testing from people using non latin charsets.

Whiteboard: [b3ux][m6][no l10n impact][needs patch] → [no l10n impact][needs patch]

Ludovic Hirlimann [:Usul]

Updated

•

15 years ago

Blocks: 510775

[:Aureliano Buendía]

Updated

•

15 years ago

Blocks: 507632

Dan Mosedale (:dmosedale, :dmose)

Comment 18

•

15 years ago

After discussion with drivers, we decided that while it would be exceedingly unfortunate to not have good support for locales in Gloda in 3.0, we'd still rather get good GloDa support into the hands of some people rather than no people at all, so we wouldn't block on this bug if it were the last bug standing.

Assignee: dmose → nobody

Flags: blocking-thunderbird3.1+

Flags: blocking-thunderbird3-

Flags: blocking-thunderbird3+

Target Milestone: Thunderbird 3.0b4 → ---

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 19

•

15 years ago

(In reply to comment #18)
> After discussion with drivers, we decided that while it would be exceedingly
> unfortunate to not have good support for locales in Gloda in 3.0, we'd still
> rather get good GloDa support into the hands of some people rather than no
> people at all, so we wouldn't block on this bug if it were the last bug
> standing.

I'm guessing this should probably be release-noted for TB3 - searching (and trying to make use of faceted search / gloda) in UTF-8 languages is a very natural thing to do for users of those languages (e.g. zh_CN, zh_TW and many more).

cc'ing Support & Marketing gurus...

Mark Banner (:standard8)

Updated

•

15 years ago

Keywords: relnote

Wayne Mery (:wsmwk)

Comment 20

•

15 years ago

In which case, if the search widget combining gloda search and quick search of asuth's tryserver build is carried forward, then for certain localized builds do we want the default search to be quick search and not gloda search?

OT but I'm not sure there's a bug/checkin to blame: I'm not sure I am keen on having the two searches combined for my workflow (extra clicks, etc), although it definitely simplifies the UI which is good for most users

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 21

•

15 years ago

(In reply to comment #20)
> In which case, if the search widget combining gloda search and quick search of
> asuth's tryserver build is carried forward, then for certain localized builds
> do we want the default search to be quick search and not gloda search?

Not even necessary to just be localized builds - I had been trying to search Chinese characters in emails even in the en_US localization.

Makoto Kato [:m_kato]

Assignee

Comment 22

•

15 years ago

Attached patch WIP v0.9 (obsolete) — Details — Splinter Review

Add FTS3 module to search CJK text.  Because we cannot modify sqlite3 code (see bug 414102 commnet #4).

Attachment #366308 - Attachment is obsolete: true

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 23

•

15 years ago

Woohoo!  Thank you, Makoto.  I need to add some unit tests for the gloda msg_search.js file, and then I think we can also use that as a basis for testing the indexer.  If you can come up with some simple test examples for strings we should be able to put in the message and then queries that should (and should not) match those messages, that would be also be very helpful.

I need to quickly try and finish up some beta-4 related things for the next few days, but I should be able to get to this after that and I think we definitely want to get this into 3.0 rc1.  Re-nominating for 3.0 blocker since your work has now made this feasible.

Flags: blocking-thunderbird3- → blocking-thunderbird3?

Target Milestone: --- → Thunderbird 3.0rc1

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 24

•

15 years ago

(In reply to comment #23)
> msg_search.js file, and then I think we can also use that as a basis for
> testing the indexer.  If you can come up with some simple test examples for

Er, I mean tokenizer.

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Updated

•

15 years ago

Blocks: 515560

Makoto Kato [:m_kato]

Assignee

Comment 25

•

15 years ago

Also, I will post to try-server after I test mac and linux.

This patch has unnecessary ifdef XP_WIN since I test on Windows only now.

Makoto Kato [:m_kato]

Assignee

Comment 26

•

15 years ago

Attached patch WIP v0.91 (obsolete) — Details — Splinter Review

Attachment #400491 - Attachment is obsolete: true

Makoto Kato [:m_kato]

Assignee

Comment 27

•

15 years ago

Also, It is necessary to update schema version.

Andrew Sutherland [:asuth] (he/him)

Reporter

Updated

•

15 years ago

Whiteboard: [no l10n impact][needs patch] → [no l10n impact][gloda key][has patch][needs review asuth][needs unit tests]

Makoto Kato [:m_kato]

Assignee

Comment 28

•

15 years ago

Attached patch patch v1 (obsolete) — Details — Splinter Review

add test.
After testing on CentOS 5, I will send a review...

Assignee: nobody → m_kato

Attachment #400757 - Attachment is obsolete: true

David Ascher (:davida)

Comment 29

•

15 years ago

Getting some resolution of this issue for TB3 is a blocker.

Flags: blocking-thunderbird3? → blocking-thunderbird3+

Makoto Kato [:m_kato]

Assignee

Updated

•

15 years ago

Attachment #401868 - Flags: review?(bienvenu)

David :Bienvenu

Comment 30

•

15 years ago

Comment on attachment 401868 [details] [diff] [review]
patch v1

Thx for doing this!

Andrew is going to need to look at this, though he's on vacation at the moment.

I wonder if the sqlite code should be part of the sqlite lib, though I understand it might be hard for us to get that through mozilla-central at the moment.

The license for +++ b/mailnews/extensions/fts3/Makefile.in

looks to be old - it's 1998.

This might be a good thing to generate try server builds for, if we have testers available to try it out.

Attachment #401868 - Flags: superreview?(bienvenu)

Attachment #401868 - Flags: review?(bugmail)

Attachment #401868 - Flags: review?(bienvenu)

Philippe M. Chiasson (:gozer)

Comment 31

•

15 years ago

Submitted a try build, tagged 'bug472764-att401868'.

Shawn Wilsher :sdwilsh

Comment 32

•

15 years ago

(In reply to comment #30)
> I wonder if the sqlite code should be part of the sqlite lib, though I
> understand it might be hard for us to get that through mozilla-central at the
> moment.
It won't likely be hard to get an updated SQLite into mozilla-central.  I'll be sure to be quick on reviews!

David :Bienvenu

Comment 33

•

15 years ago

(In reply to comment #32)

> It won't likely be hard to get an updated SQLite into mozilla-central.  I'll be
> sure to be quick on reviews!

Cool, would it make sense to add the sqlite parts of this patch to sqlite.c? I'm not sure what all else that would entail w.r.t. exporting symbols, etc.

David Ascher (:davida)

Comment 34

•

15 years ago

sdwilsh: all the way back into 1.9.1?  

Given our past experience in the delays in getting simple changes back from trunk to 1.9.2 to 1.9.1, I think for this bug it'd be good to go with a tb-specific patch first, given our very short timelines, and a parallel m-c patch.

David :Bienvenu

Comment 35

•

15 years ago

speaking from almost complete ignorance, it looks like this would be just adding the contents of a file to sqlite.c (which is already the concatenation of a bunch of files). But a parallel m-c trunk patch and then fixing this when we move to the trunk probably is a lot easier.

Shawn Wilsher :sdwilsh

Comment 36

•

15 years ago

(In reply to comment #33)
> Cool, would it make sense to add the sqlite parts of this patch to sqlite.c?
> I'm not sure what all else that would entail w.r.t. exporting symbols, etc.
We don't take changes to the SQLite source.  We only take files from SQLite releases.

(In reply to comment #34)
> sdwilsh: all the way back into 1.9.1?
We'd have to ship it in a 1.9.2 beta first, but that's coming up soon.  I'm not sure I'll have cycles to do the SQLite upgrade, but I'll be quick with reviews.

Philippe M. Chiasson (:gozer)

Comment 37

•

15 years ago

Failed on Win32 with:

e:\buildbot\sendchange-win32-hg\mozilla\mailnews\extensions\fts3\src\fts3_tokenizer.h(27) : fatal error C1083: Cannot open include file: 'sqlite3.h': No such file or directory

Windows Log: <http://tinderbox.mozilla.org/showlog.cgi?log=ThunderbirdTry/1254154576.1254159607.21523.gz>

Mac Build: <http://s3.mozillamessaging.com/build/try-server/2009-09-28_09:16-gozer@mozillamessaging.com-bug472764-att401868/gozer@mozillamessaging.com-bug472764-att401868-mail-try-mac.dmg>

Linux Build: <http://s3.mozillamessaging.com/build/try-server/2009-09-28_09:16-gozer@mozillamessaging.com-bug472764-att401868/gozer@mozillamessaging.com-bug472764-att401868-mail-try-linux.tar.bz2>

Makoto Kato [:m_kato]

Assignee

Comment 38

•

15 years ago

Attached patch patch v2 (obsolete) — Details — Splinter Review

fix build issue.

If I can use http://hg.mozilla.org/try-comm-central/ as try server of comm-central tree, I will push it.

Attachment #401868 - Attachment is obsolete: true

Attachment #401868 - Flags: superreview?(bienvenu)

Attachment #401868 - Flags: review?(bugmail)

David Ascher (:davida)

Comment 39

•

15 years ago

Makoto-san: The page https://wiki.mozilla.org/Thunderbird/Infrastructure/TryServer describes how to get a certificate to use with the try server.

In the meantime, I've spun up some try server builds with your second patch, tagged 472764v2.

They should show up on http://tinderbox.mozilla.org/showbuilds.cgi?tree=ThunderbirdTry

David Ascher (:davida)

Comment 40

•

15 years ago

os x build failed for mysterious reasons, so trying again.

David Ascher (:davida)

Comment 41

•

15 years ago

windows build: http://s3.mozillamessaging.com/build/try-server/2009-09-28_21:32-dascher@mozillamessaging.com-472764v2/dascher@mozillamessaging.com-472764v2-mail-try-win32.zip

linux build: http://s3.mozillamessaging.com/build/try-server/2009-09-28_21:32-dascher@mozillamessaging.com-472764v2/dascher@mozillamessaging.com-472764v2-mail-try-linux.tar.bz2

mac build will be a few hours.

David Ascher (:davida)

Comment 42

•

15 years ago

The next batch worked fine:

linux: http://s3.mozillamessaging.com/build/try-server/2009-09-28_22:30-dascher@mozillamessaging.com-cjktry3/dascher@mozillamessaging.com-cjktry3-mail-try-linux.tar.bz2


mac: http://s3.mozillamessaging.com/build/try-server/2009-09-28_22:30-dascher@mozillamessaging.com-cjktry3/dascher@mozillamessaging.com-cjktry3-mail-try-mac.dmg

windows: http://s3.mozillamessaging.com/build/try-server/2009-09-28_22:30-dascher@mozillamessaging.com-cjktry3/dascher@mozillamessaging.com-cjktry3-mail-try-win32.installer.exe
or http://s3.mozillamessaging.com/build/try-server/2009-09-28_22:30-dascher@mozillamessaging.com-cjktry3/dascher@mozillamessaging.com-cjktry3-mail-try-win32.zip

Makoto Kato [:m_kato]

Assignee

Comment 43

•

15 years ago

Attached patch patch v3 (obsolete) — Details — Splinter Review

patch v2 has no xpt file for tokenizer, so I add it to package-static.

Attachment #403413 - Attachment is obsolete: true

Makoto Kato [:m_kato]

Assignee

Comment 44

•

15 years ago

(In reply to comment #39)
> Makoto-san: The page
> https://wiki.mozilla.org/Thunderbird/Infrastructure/TryServer describes how to
> get a certificate to use with the try server.

Thanks, David.  I file a request to bugzilla.

David Ascher (:davida)

Comment 45

•

15 years ago

spun up new try server builds, labeled 'cjkv3' or something like that =)

David Ascher (:davida)

Comment 46

•

15 years ago

v3 patch failed on windows:

http://tinderbox.mozilla.org/showlog.cgi?log=ThunderbirdTry/1254324631.1254328669.30206.gz&fulltext=1

======= making ./nsldif32v60.dll
link -MANIFEST:NO -OUT:"nsldif32v60.dll" -DEBUG -OPT:REF -MAP -nologo -DLL -SUBSYSTEM:WINDOWS  -SUBSYSTEM:CONSOLE    -out:"nsldif32v60.dll" ./line64.obj  -DEF:e:/buildbot/sendchange-win32-hg/mozilla/directory/c-sdk/ldap/libraries/msdos/winsock/nsldif32.def  
e:/buildbot/sendchange-win32-hg/mozilla/directory/c-sdk/ldap/libraries/msdos/winsock/nsldif32.def(39) : warning LNK4017: DESCRIPTION statement not supported for the target platform; ignored
   Creating library nsldif32v60.lib and object nsldif32v60.exp
LINK : fatal error LNK1104: cannot open file 'nsldif32v60.lib'
make[6]: Leaving directory `/e/buildbot/sendchange-win32-hg/mozilla/objdir/directory/c-sdk/ldap/libraries/libldif'
make[5]: Leaving directory `/e/buildbot/sendchange-win32-hg/mozilla/objdir/directory/c-sdk/ldap/libraries'
make[4]: Leaving directory `/e/buildbot/sendchange-win32-hg/mozilla/objdir/directory/c-sdk/ldap'
make[3]: Leaving directory `/e/buildbot/sendchange-win32-hg/mozilla/objdir/directory/c-sdk'
make[6]: *** [nsldif32v60.dll] Error 80
make[6]: *** Waiting for unfinished jobs....
make[5]: *** [export] Error 2
make[4]: *** [export] Error 2
make[3]: *** [export] Error 2
make[2]: Leaving directory `/e/buildbot/sendchange-win32-hg/mozilla/objdir'
make[1]: Leaving directory `/e/buildbot/sendchange-win32-hg/mozilla/objdir'
make[2]: *** [tier_app] Error 2
make[1]: *** [default] Error 2
make: *** [build] Error 2
program finished with exit code 2

Dan Mosedale (:dmosedale, :dmose)

Comment 47

•

15 years ago

That's an error linking LDAP, I wonder if there's some other tree coincidental with that try build that's causing it...

Dan Mosedale (:dmosedale, :dmose)

Comment 48

•

15 years ago

"tree bustage coincidental" is what I meant to say.

David Ascher (:davida)

Comment 49

•

15 years ago

just for fun, tried again.

David Ascher (:davida)

Comment 50

•

15 years ago

The try server is very brittle it seems. The latest run worked on windows & linux:

windows: http://s3.mozillamessaging.com/build/try-server/2009-09-30_12:10-dascher@mozillamessaging.com-cjkv3try2/dascher@mozillamessaging.com-cjkv3try2-mail-try-win32.zip

and linux: http://s3.mozillamessaging.com/build/try-server/2009-09-30_12:10-dascher@mozillamessaging.com-cjkv3try2/dascher@mozillamessaging.com-cjkv3try2-mail-try-linux.tar.bz2

and the older successful mac build (also w/ 3rd patch): http://s3.mozillamessaging.com/build/try-server/2009-09-30_08:30-dascher@mozillamessaging.com-cjkv3/dascher@mozillamessaging.com-cjkv3-mail-try-mac.dmg

It'd be great if people could report whether these builds work a) with CJK corpora, and b) with non-CJK corpora)

Keywords: qawanted

Whiteboard: [no l10n impact][gloda key][has patch][needs review asuth][needs unit tests] → [no l10n impact][gloda key][has patch][needs review asuth][needs unit tests][see links in comment #50]

Gary Kwong [:gkw] [:nth10sd] (NOT official MoCo now)

Comment 51

•

15 years ago

(In reply to comment #50)
> The try server is very brittle it seems. The latest run worked on windows &
> linux:
> 
> windows:
> http://s3.mozillamessaging.com/build/try-server/2009-09-30_12:10-dascher@mozillamessaging.com-cjkv3try2/dascher@mozillamessaging.com-cjkv3try2-mail-try-win32.zip
> 
> It'd be great if people could report whether these builds work a) with CJK
> corpora, and b) with non-CJK corpora)

Just had some time (not much) to test on CJK searching on this en_US WinXP build. I had a test message with the "十" character (means "ten" in Chinese), gloda is on, the message is onDisk and has a gloda ID, but when I search all messages at the top, it still says "no messages match your search".

:(

ID of this tryserver build on WinXP:
Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.1.4pre) Gecko/20090930 Shredder/3.0pre

Makoto Kato [:m_kato]

Assignee

Comment 52

•

15 years ago

(In reply to comment #51)

> build. I had a test message with the "十" character (means "ten" in Chinese),
> gloda is on, the message is onDisk and has a gloda ID, but when I search all
> messages at the top, it still says "no messages match your search".

This implementation is bi-gram.  At least, 2 characters are necessary as search string.  So How about at least 2 character as search string??

If we allow 1 character, too, database size will becomes too big.

Ludovic Hirlimann [:Usul]

Comment 53

•

15 years ago

(In reply to comment #50)

> 
> It'd be great if people could report whether these builds work a) with CJK
> corpora, and b) with non-CJK corpora)

I've send a few emails to the contacts I made back in may to try the builds and give us some feedback on it.

Makoto Kato [:m_kato]

Assignee

Comment 54

•

15 years ago

Attached patch patch v4 (obsolete) — Details — Splinter Review

fix comment #51 (I tests on Japanese mail only)

Attachment #403735 - Attachment is obsolete: true

Makoto Kato [:m_kato]

Assignee

Comment 55

•

15 years ago

Attached patch patch v4.1 (obsolete) — Details — Splinter Review

typo

Attachment #403996 - Attachment is obsolete: true

David Ascher (:davida)

Comment 56

•

15 years ago

Setup try server builds w/ patch 4.1

Makoto Kato [:m_kato]

Assignee

Comment 57

•

15 years ago

Attached patch patch v4.2 (obsolete) — Details — Splinter Review

fix seamonkey installer code (maybe need *.xpt for tokenizer)

Windows
http://s3.mozillamessaging.com/build/try-server/2009-10-01_21:11-m_kato@ga2.so-net.ne.jp-1254456162/m_kato@ga2.so-net.ne.jp-1254456162-mail-try-win32.zip
http://s3.mozillamessaging.com/build/try-server/2009-10-01_21:11-m_kato@ga2.so-net.ne.jp-1254456162/m_kato@ga2.so-net.ne.jp-1254456162-mail-try-win32.installer.exe

Linux
http://s3.mozillamessaging.com/build/try-server/2009-10-01_21:11-m_kato@ga2.so-net.ne.jp-1254456162/m_kato@ga2.so-net.ne.jp-1254456162-mail-try-linux.tar.bz2

I will rebuild mac version (due to infra error?)

Attachment #403999 - Attachment is obsolete: true

Makoto Kato [:m_kato]

Assignee

Comment 58

•

15 years ago

Mac
http://s3.mozillamessaging.com/build/try-server/2009-10-02_02:40-m_kato@ga2.so-net.ne.jp-cjk-42/m_kato@ga2.so-net.ne.jp-cjk-42-mail-try-mac.dmg

Makoto Kato [:m_kato]

Assignee

Updated

•

15 years ago

Attachment #404212 - Flags: superreview?(bienvenu)

Attachment #404212 - Flags: review?(bugmail)

David Ascher (:davida)

Updated

•

15 years ago

Whiteboard: [no l10n impact][gloda key][has patch][needs review asuth][needs unit tests][see links in comment #50] → [no l10n impact][gloda key][has patch][needs review asuth][needs unit tests][see links in comment #57-#58]

Ludovic Hirlimann [:Usul]

Comment 59

•

15 years ago

Those builds seems to work correctly on ASCII latin usage.

Makoto Kato [:m_kato]

Assignee

Comment 60

•

15 years ago

Attached patch patch v5 (obsolete) — Details — Splinter Review

I got feedback about this patch.  A mixed content that has Japanese and ascii without break space causes that we may not search ascii word well.

Fixed binary is here.

Windows
http://s3.mozillamessaging.com/build/try-server/2009-10-08_02:00-m_kato@ga2.so-net.ne.jp-bug472764-v5/m_kato@ga2.so-net.ne.jp-bug472764-v5-mail-try-win32.zip
http://s3.mozillamessaging.com/build/try-server/2009-10-08_02:00-m_kato@ga2.so-net.ne.jp-bug472764-v5/m_kato@ga2.so-net.ne.jp-bug472764-v5-mail-try-win32.installer.exe

Mac
http://s3.mozillamessaging.com/build/try-server/2009-10-08_02:00-m_kato@ga2.so-net.ne.jp-bug472764-v5/m_kato@ga2.so-net.ne.jp-bug472764-v5-mail-try-mac.dmg

Linux
http://s3.mozillamessaging.com/build/try-server/2009-10-08_02:00-m_kato@ga2.so-net.ne.jp-bug472764-v5/m_kato@ga2.so-net.ne.jp-bug472764-v5-mail-try-linux.tar.bz2

Attachment #404212 - Attachment is obsolete: true

Attachment #404212 - Flags: superreview?(bienvenu)

Attachment #404212 - Flags: review?(bugmail)

Makoto Kato [:m_kato]

Assignee

Updated

•

15 years ago

Attachment #405241 - Flags: superreview?(bienvenu)

Attachment #405241 - Flags: review?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 61

•

15 years ago

Attached patch asuth cjk mods to go on top of the v5 patch — Details — Splinter Review

Here's a patch that goes on top of the v5 patch, it:

- Adds a lot of comments. In order to verify the correctness of the code I needed to understand it, and I am nowhere near clever enough to keep all of the balls in the air in just my head... and my memory is certainly not good enough to remember it for next time if I did somehow manage it (caffeine?).

- Treats non-ASCII things in the generic script page like ASCII rather than punctuation/whitespace. Renamed BIGRAM_ASCII to BIGRAM_ALPHA (and used that) accordingly.

- No longer emits single-character tokens (CJK or ALPHA), generally speaking (see next point). The emission of single-character CJK tokens seems like it was probably an unintended bug. Given that we don't specialize for anything in the general punctuation page except whitespace, this seems advisable.

- Loses the special logic in msg_search.js that tries to generate bi-grams. The tokenizer is used on the queries too so there is no need to do this and we're likely to get burned if try. The logic to add the wildcard in the case of a single CJK character actually did not work; it was depending on the prior bug where we generate only a single CJK character. Namely if '&' is a CJK character, then the wildcarded string would end up as "&*". This does not invoke the FTS3 wildcard logic at all; it has to be an unquoted term and the tokenizer has to eat a single token and leave '*' as the next character. It just happened that "&*" tokenized down to "&" and if the source message had just a plain "&" somewhere, it would have put it in the database like that. So to make the wildcard stuff work I changed the JS code to emit an unquoted character with the wildcard and special-cased the tokenizer to detect when it was in such a situation and emit the single CJK character in that case. If someone were to have an e-mail message where the body or the subject was a single CJK character followed by an asterisk and absolutely no whitespace, we would get tricked in that case too, but I don't believe it to be likely.

- Added some more unit test cases. These just test that the tokenizer works when multiple bigrams are involved.

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 62

•

15 years ago

Attached patch v6 patch; v5 patch with asuth mods — Details — Splinter Review

Patch v5 is r=asuth with the changes from the previous patch.  I have folded my changes into v5 to produce v6.

Requesting review from Makoto to make sure my changes do not break or alter the intent of the patch.  And keeping the bienvenu sr request around.

I will spin try server builds as "cjk-v6" but will not be awake to see when they complete.  If someone posts the links here when they complete (or fail to complete), I won't complain.

Per discussion, the plan is currently _not to land the patch_ once the reviews are finished.  We want to try and time things with some other gloda patches I am hoping to land soon so that we avoid blowing away people's gloda databases multiple times in the span of a few days.

Attachment #405241 - Attachment is obsolete: true

Attachment #406850 - Flags: superreview?(bienvenu)

Attachment #406850 - Flags: review?(m_kato)

Attachment #405241 - Flags: superreview?(bienvenu)

Attachment #405241 - Flags: review?(bugmail)

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 63

•

15 years ago

Comment on attachment 406850 [details] [diff] [review]
v6 patch; v5 patch with asuth mods

(r=asuth :)

Attachment #406850 - Flags: review+

Wayne Mery (:wsmwk)

Comment 64

•

15 years ago

tryserver builds are done and ready for testing

http://s3.mozillamessaging.com/build/try-server/2009-10-16_23:22-asutherland@asutherland.org-cjk-v6/asutherland@asutherland.org-cjk-v6-mail-try-linux.tar.bz2

http://s3.mozillamessaging.com/build/try-server/2009-10-16_23:22-asutherland@asutherland.org-cjk-v6/asutherland@asutherland.org-cjk-v6-mail-try-mac.dmg

http://s3.mozillamessaging.com/build/try-server/2009-10-16_23:22-asutherland@asutherland.org-cjk-v6/asutherland@asutherland.org-cjk-v6-mail-try-win32.installer.exe

http://s3.mozillamessaging.com/build/try-server/2009-10-16_23:22-asutherland@asutherland.org-cjk-v6/asutherland@asutherland.org-cjk-v6-mail-try-win32.zip

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 65

•

15 years ago

uh, and I probably should have mentioned that if before using one of these builds you were using some of the other CJK builds, then you will need to blow away your global-messages-db.sqlite file in your profile directory.  (My changes do not bump the schema so this will not be done automatically unless you are coming from a nightly/beta4 build.)

Makoto Kato [:m_kato]

Assignee

Updated

•

15 years ago

Attachment #406850 - Flags: review?(m_kato) → review+

Kohei Yoshino [:kohei]

Updated

•

15 years ago

Keywords: jp-critical

Gen Kanai [:gen]

Comment 66

•

15 years ago

Channy, can you do some testing with the builds Wayne Mery created for Korean/Hangul?

David :Bienvenu

Comment 67

•

15 years ago

Comment on attachment 406850 [details] [diff] [review]
v6 patch; v5 patch with asuth mods

no need for temp var code here, since it's only used once:

+        let code = term.charCodeAt(0);
+        // Our tokenizer treats anything at/above 0x2000 as CJK for now.
+        if (code >= 0x2000)
+          magicWild = true;

so this can be simply:

let magicWild = term.length == 1 && term.charCodeAt(0) >= 0x2000)

in fact, magicWild is only used once, so the whole thing can be in-lined.

move decl of rv to where it's first used here:

+nsFts3Tokenizer::RegisterTokenizer(mozIStorageConnection *connection)
+{
+    nsresult rv;
+    nsCOMPtr <mozIStorageStatement> selectStatement;
+
+    rv = connection->CreateStatement(NS_LITERAL_CSTRING(

you're ignoring the first rv here: 

+    rv = selectStatement->BindStringParameter(0, NS_LITERAL_STRING("mozporter"));
+    rv = selectStatement->BindBlobParameter(1, (PRUint8*)&module, sizeof(module));
+    NS_ENSURE_SUCCESS(rv, rv);

either add an NS_ENSURE_SUCCESS(rv, rv); between the two statements, or don't assign to rv, whichever is more appropriate.

just return the value of selectStatement directly here, instead of assigning to rv:

+    PRBool hasMore;
+    rv = selectStatement->ExecuteStep(&hasMore);
+
+    return rv;

sr=me, with those things fixed.

Attachment #406850 - Flags: superreview?(bienvenu) → superreview+

David :Bienvenu

Updated

•

15 years ago

Whiteboard: [no l10n impact][gloda key][has patch][needs review asuth][needs unit tests][see links in comment #57-#58] → [no l10n impact][gloda key][needs unit tests][see links in comment #57-#58][needs review nits addressed before landing w/ other gloda fixes]

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 68

•

15 years ago

The plan is that I am now going to:

1) Make bienvenu's requested changes.
2) Remove the schema bumping.
3) Smoke-test that things work (still using the old porter stemmer tokenizer) for a gloda database created by a pre-CJK-aware build.
4) Land it.
5) Post about the landing as well as instructions on how to blow away your database to get the new CJK-aware-tokenizer.  This may also get some people out of the woodwork to further improve the tokenizer for non-English, non-CJK languages.  (Simple case/accent-folding seems reasonable and likely low risk assuming they are able to use libintl.)

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 69

•

15 years ago

1-4 completed:

I did this as 1 push of 3 commits:

pushed Makoto's v5 patch:
http://hg.mozilla.org/comm-central/rev/4aab2fb7d9e1

pushed my modifications to the patch mainly adding comments:
http://hg.mozilla.org/comm-central/rev/1ef3fd69325a

pushed the review modifications and the removal of the schema bumping:
http://hg.mozilla.org/comm-central/rev/810cd547ef75

I'm off to post about this.

Marking resolved fixed since the fix is on the trunk.

Super thanks once again to Makoto Kato for the patch!

Status: NEW → RESOLVED

Closed: 15 years ago

Resolution: --- → FIXED

Andrew Sutherland [:asuth] (he/him)

Reporter

Comment 70

•

15 years ago

I posted to m.d.a.t and m.d.l10n about this:
http://groups.google.ca/group/mozilla.dev.apps.thunderbird/browse_thread/thread/bf4fe06f0e0b20aa#

Once the nightlies are available we should probably reply to the message with an update.  Not sure if we should make a blog post about it or not after the nightlies are spun.  My message leaves it up to the reader to figure out how to locate their profile directory; that may be setting the bar too high.

Robert Kaiser

Updated

•

15 years ago

Keywords: fixed-seamonkey2.0.1

Robert Kaiser

Updated

•

15 years ago

Keywords: fixed-seamonkey2.0.1

Mark Banner (:standard8)

Updated

•

12 years ago

Keywords: relnote

bigram tokenizer (based on attachment 321769) 15 years ago dynamis (Tomoya ASAI) 3.05 KB, patch		Details \| Diff \| Splinter Review
WIP v0.9 15 years ago Makoto Kato [:m_kato] 58.09 KB, patch		Details \| Diff \| Splinter Review
WIP v0.91 15 years ago Makoto Kato [:m_kato] 57.40 KB, patch		Details \| Diff \| Splinter Review
patch v1 15 years ago Makoto Kato [:m_kato] 58.94 KB, patch		Details \| Diff \| Splinter Review
patch v2 15 years ago Makoto Kato [:m_kato] 59.37 KB, patch		Details \| Diff \| Splinter Review
patch v3 15 years ago Makoto Kato [:m_kato] 59.77 KB, patch		Details \| Diff \| Splinter Review
patch v4 15 years ago Makoto Kato [:m_kato] 60.08 KB, patch		Details \| Diff \| Splinter Review
patch v4.1 15 years ago Makoto Kato [:m_kato] 60.08 KB, patch		Details \| Diff \| Splinter Review
patch v4.2 15 years ago Makoto Kato [:m_kato] 60.81 KB, patch		Details \| Diff \| Splinter Review
patch v5 15 years ago Makoto Kato [:m_kato] 61.57 KB, patch		Details \| Diff \| Splinter Review
asuth cjk mods to go on top of the v5 patch 15 years ago Andrew Sutherland [:asuth] (he/him) 30.63 KB, patch		Details \| Diff \| Splinter Review
v6 patch; v5 patch with asuth mods 15 years ago Andrew Sutherland [:asuth] (he/him) 84.89 KB, patch	m_kato : review+ asuth : review+ Bienvenu : superreview+	Details \| Diff \| Splinter Review