Closed Bug 525537 Opened 15 years ago Closed 14 years ago

Gloda search tokenizer should perform case-folding and accent-folding to be case-insensitive and accent-insensitive, also handle half-width katakana

Categories

(Thunderbird :: Search, defect)

defect
Not set
normal

Tracking

(blocking-thunderbird3.1 beta2+, thunderbird3.1 beta2-fixed)

RESOLVED FIXED
Thunderbird 3.1b2
Tracking Status
blocking-thunderbird3.1 --- beta2+
thunderbird3.1 --- beta2-fixed

People

(Reporter: shopik, Assigned: m_kato)

References

Details

(Whiteboard: [gloda key])

Attachments

(3 files, 5 obsolete files)

Doing same search using Cyrilic letters but with different case you get different results. While doing so using English only letter gloda produce same result.
I can't test this right now using trunk, because new tabs are broken but sure this is exist on 3.0.
Yes, the only special processing we do for non-CJK characters is case-folding of ASCII characters (which the SQLite tokenizer already knew how to do).  If someone can take a crack at augmenting the tokenizer right away using libintl we might be able to fix this for 3.0, otherwise not.
If that not make into first 3.0 release this should be noted in Release Notes or some other way, so user don't expect it working correctly with language other than English.
Keywords: relnote
what is not working? case-insensitive search in locales other than en, or case-insensitive search with non-ASCII letters?
Searching is case-insensitive in Dutch, so it would seem that the problem is case-insensitive search with non-ASCII letters. The title of this bug should be changed to reflect this.

Tested with Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.1.5) Gecko/20091121 Thunderbird/3.0.
Summary: Gloda search is case sensitive for languages other than English → Gloda search is case sensitive for non-ASCII letters
OS: Windows XP → All
Hardware: x86 → All
Whiteboard: [gloda key]
Could anyone attach a sample of mail text for testing and expected result?

I consider updating fts3 module to support full-width number and alphabet character as half-width and case sensitive, so that I need test case for this, too.
Attached file sample of mail
Search by keyword "новое*" not working.
Attached patch WIP v0.1 (obsolete) — Splinter Review
Work in progress.  This is only test on CentOS 5.4
Assignee: nobody → m_kato
Attached patch patch v1 (obsolete) — Splinter Review
Attachment #423952 - Attachment is obsolete: true
Attachment #424942 - Flags: review?
Attachment #424942 - Flags: review? → review?(bugmail)
This fix is includes...
- case insensitive support for non-ASCII and full-width character
- normalize full-width alphabet, number and symbol (indexed it as ASCII)
- normalize half-width Katakana (indexed it as full-width Katakana)
Thank you for the patch!  Your continued efforts on creating/improving our tokenizer are awesome and greatly appreciated.

It looks like the case insensitive support for non-ASCII is hanging on the locale system.  Can you elaborate on the benefits/limitations of this?  There are basically no comments about the use of it in the patch except for the one before the scary "#define _GNU_SOURCE".  Based on my quick research, it would appear your code is trying to use the user's default locale which suggests that what we are able to lower-case will vary based on the user and their active locale.  If that is the case, it's obviously better than nothing, but definitely needs to be documented both in the code and elsewhere as it would likely limit the test cases we use and needs to be documented in a place the user can find it too.

I would be interested in alternatives you might have considered/investigated or if there are inherent limitations that require us to always use a locale.  For example, if there are characters for which lowercasing has different results in different locales, that would be very interesting/useful to know.  Is there something that prevents us from assembling a static lower-case mapping table by just sucking up all the locale definitions from a suitably licensed set of locale definitions?  etc.

For half-width Katakana and voiced sound mark, if you could either provide brief blurbs that explain more about their encodings and why they would be used, or provide a link to a page that does so (even just on the unicode website or wikipedia if you think they do a good enough job), it would be very useful.  As the tokenizer grows in size, I think it's very important to maintenance (and my ability to review the code :) for someone reading the code to be able to understand what is happening and why.  Once people become afraid of the code we're in for trouble because then no one wants to touch it.

Also, is there a reason there is no explicit test case for voiced sound mark?
(In reply to comment #11)
> Thank you for the patch!  Your continued efforts on creating/improving our
> tokenizer are awesome and greatly appreciated.
> 
> It looks like the case insensitive support for non-ASCII is hanging on the
> locale system.  Can you elaborate on the benefits/limitations of this?  There
> are basically no comments about the use of it in the patch except for the one
> before the scary "#define _GNU_SOURCE".  Based on my quick research, it would
> appear your code is trying to use the user's default locale which suggests that
> what we are able to lower-case will vary based on the user and their active
> locale.  If that is the case, it's obviously better than nothing, but
> definitely needs to be documented both in the code and elsewhere as it would
> likely limit the test cases we use and needs to be documented in a place the
> user can find it too.
> 
> I would be interested in alternatives you might have considered/investigated or
> if there are inherent limitations that require us to always use a locale.  For
> example, if there are characters for which lowercasing has different results in
> different locales, that would be very interesting/useful to know.  Is there
> something that prevents us from assembling a static lower-case mapping table by
> just sucking up all the locale definitions from a suitably licensed set of
> locale definitions?  etc.

Although I tests on many locales such as English, Germany, Japanese, Chinese and etc on Windows, Mac OS X and Linux (CentOS 5), It works fine except to C locale (LANG=C).

Also, I can create the table for non-ascii character (less than U+2000) to support case insensitive.  Do you think the better to create it?  Table may become big.

> For half-width Katakana and voiced sound mark, if you could either provide
> brief blurbs that explain more about their encodings and why they would be
> used, or provide a link to a page that does so (even just on the unicode
> website or wikipedia if you think they do a good enough job), it would be very
> useful.  As the tokenizer grows in size, I think it's very important to
> maintenance (and my ability to review the code :) for someone reading the code
> to be able to understand what is happening and why.  Once people become afraid
> of the code we're in for trouble because then no one wants to touch it.

Voiced sound mark is only on Japanese.  It is like accent.  It combines with previous character.  Example, "サ" (Katakana) with "゛" (voiced sound mark) is "ザ".  Although full-width character mapping has combined character like "ザ", there is no combined character on half-width Katanaka character mapping.

I should ignore it to optimize index for Katakana.

> Also, is there a reason there is no explicit test case for voiced sound mark?

It is included in Half-width tests.  The last test case is half-width of "サンダーバード" that is Japanese Katakana of "Thunderbird".  "ダ", "バ" and "ド" have voiced sound mark.
(In reply to comment #12)
> Although I tests on many locales such as English, Germany, Japanese, Chinese
> and etc on Windows, Mac OS X and Linux (CentOS 5), It works fine except to C
> locale (LANG=C).
> 
> Also, I can create the table for non-ascii character (less than U+2000) to
> support case insensitive.  Do you think the better to create it?  Table may
> become big.

I just looked into the locale stuff more and it looks like you picked exactly the right way to do it.  On my system there's basically a single i18n "tolower" map under LC_CTYPE that everything else uses.  Apart from POSIX, a very few locales make minor changes to it (tr_TR and km_KH is what I see).  This sounds ideal and fine for testing.

Thank you for the explanations, they are very helpful!

I should be able to get to this by the end of the weekend at the latest.
Status: NEW → ASSIGNED
btw, I looked into how our JS tolower works.  It appears they have a compact table representation based on the java.lang.Character implementation.  The code is in jsstr.h and jsstr.cpp.
(In reply to comment #14)
> btw, I looked into how our JS tolower works.  It appears they have a compact
> table representation based on the java.lang.Character implementation.  The code
> is in jsstr.h and jsstr.cpp.

The table of Java is based on character type table.

To lower character, I may only need both U+80 - U+5FF and U+1E00 - U+1FFF.
After consultation on when it's appropriate to blow away the gloda database again, it's been suggested we should not do that until we also have accent folding in the tokenizer.

Since the locale subsystem does not support accent folding, I think we will need to turn to a non-locale table/mapping after all.

It looks like most of the code that tries to do things like these consumes the Unicode Character Database to do so:
http://www.unicode.org/Public/5.1.0/ucd/UCD.html

It looks like mozilla's implementation of such a thing can be found in mozilla/intl/unicharutil.  It also looks less than comprehensive, not particularly documented, and rather out of date.  Unfortunately I don't have time at the moment to investigate it much further.

If you have any thoughts on accent folding and how to accomplish it, I would be very greatful.  Right now I think that will block landing/enabling the changes in this patch.
So I didn't completely give up.  There's an interesting comment on a similar problem at:
https://issues.apache.org/jira/browse/LUCENE-1343?focusedCommentId=12622432&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#action_12622432

nsIUnicodeNormalizer provides NormalizeUnicodeNFKD and nsIUGenDetailCategory can tell us what character is a spacing/combining mark (enum value kUGDC_Mc).  The cost of calling the former might be okay; the cost of the latter could add up.  If we can short-cut the XPCOM interface or cache the results, that could be useful.  nsICaseConversion can do case conversion too.
fts3 tokenizer shouldn't use XPCOM that it is called from SQLite.  I believe that using the table into jsstr.h is better than nsICaseConversion.
Comment on attachment 424942 [details] [diff] [review]
patch v1

cancel review.  I consider jsstr.h way.
Attachment #424942 - Flags: review?(bugmail)
Attached patch patch v2 (obsolete) — Splinter Review
Working in progress...  remove locale code and use the table from mozilla/intl.
Attachment #424942 - Attachment is obsolete: true
Attached patch patch v2.1 (obsolete) — Splinter Review
Attachment #426954 - Attachment is obsolete: true
Attachment #427268 - Flags: review?(bugmail)
Comment on attachment 427268 [details] [diff] [review]
patch v2.1

no depended on locale system.  I use conversion table from mozilla/intl
Makoto, what are your thoughts on accent folding?  As I noted in comment 16 we cannot land any changes to the tokenizer until we also tackle accent folding.

As I mentioned in comment 17, it sounds like if we perform NFKD and kill the spacing/combining marks and then lowercase we might have a good solution for accent folding.  Your patch seems to port the case-conversion logic from mozilla/intl over an XPCOM concern but does not seem to indicate a plan to exclusively use table-driven conversion for accent-folding, nor if the goal is to avoid XPCOM calls and do the NFKD thing does it avoid requiring XPCOM calls.

Also, if you could elaborate on why you think we shouldn't make XPCOM calls from inside the tokenizer, that would be useful.  While I agree it would be preferable to avoid the XPCOM calls, there is no real danger as long as we know the XPCOM calls are not going to try and spin an event loop or otherwise trigger execution of code we do not expect.

Thank you for your continued work on the patch.  I'm sorry if I did not sufficiently express before that accent folding was my main concern; that it mooted use of locales was merely a side-effect.
(In reply to comment #23)
> Makoto, what are your thoughts on accent folding?  As I noted in comment 16 we
> cannot land any changes to the tokenizer until we also tackle accent folding.

Although I don't know accent much for Latin language, how do they search word that has accent or non-accent?  But, as long as I check Bug 506064, it is better to be ignore accent for indexing.

Andrew, how do you think?

 
> As I mentioned in comment 17, it sounds like if we perform NFKD and kill the
> spacing/combining marks and then lowercase we might have a good solution for
> accent folding.  Your patch seems to port the case-conversion logic from
> mozilla/intl over an XPCOM concern but does not seem to indicate a plan to
> exclusively use table-driven conversion for accent-folding, nor if the goal is
> to avoid XPCOM calls and do the NFKD thing does it avoid requiring XPCOM calls.
> 
> Also, if you could elaborate on why you think we shouldn't make XPCOM calls
> from inside the tokenizer, that would be useful.  While I agree it would be
> preferable to avoid the XPCOM calls, there is no real danger as long as we know
> the XPCOM calls are not going to try and spin an event loop or otherwise
> trigger execution of code we do not expect.

fts3 module is called from sqlite not gecko, so I think that it should not depend on XPCOM for safety.
(In reply to comment #24)
> Although I don't know accent much for Latin language, how do they search word
> that has accent or non-accent?  But, as long as I check Bug 506064, it is
> better to be ignore accent for indexing.
> 
> Andrew, how do you think?

I think bug 506064 and I are saying the same thing.  We want the tokenizer to ignore the accent marks; when tokenizing "parís" as per that bug's example, we want the tokenizer to emit "paris".  (And with case-folding also happening, "París" should become "paris".)

My understanding of the NFKD mechanism is that it decomposes the accented "í" into an "i" with a combining mark for the accent and does this for all accented characters.  By performing this transform and then stripping the combining mark, we get the desired result.  There are likely character-centric ways to accomplish this using the unicode database as well.

If you do not feel comfortable or do not have the time to try and implement this, that is fine, but we can't land changes to the tokenizer that would affect its output until we have a patch that also handles accents.  (Because of the need to blow away the gloda database.)
 
> fts3 module is called from sqlite not gecko, so I think that it should not
> depend on XPCOM for safety.

The mozStorage component's support for custom functions already results in SQLite calling into XPCOM code.  I am familiar with both the mozStorage code and SQLite code and it is fine as long as we 1) only affect reference counting on XPCOM instances with threadsafe addref/release and 2) do not spin or call anything that might spin an event loop.
Now I am testing new patch with new normalization table (for removing accent, case insensitive for non-ASCII, full-width ASCII and half-width katakana).
Normalization function of new patch generates by nfkc.txt and nfkc_cf.txt into libicu...

New fix will be included in Bug 506064 fix.
(In reply to comment #26)
> Now I am testing new patch with new normalization table (for removing accent,
> case insensitive for non-ASCII, full-width ASCII and half-width katakana).
> Normalization function of new patch generates by nfkc.txt and nfkc_cf.txt into
> libicu...

This is great news!  But please put the patch on this bug since this bug has all of the relevant technical discussion thus far and it's not even clear if that bug is about SQLite FTS3-driven search or Thunderbird's search subsystem.
Summary: Gloda search is case sensitive for non-ASCII letters → Gloda search tokenizer should perform case-folding and accent-folding to be case-insensitive and accent-insensitive, also handle half-width katakana
Attachment #427268 - Flags: review?(bugmail) → review-
Attached patch WIP v3 (obsolete) — Splinter Review
Attachment #427268 - Attachment is obsolete: true
Awesome!

I'll wait for you to request review (and turn back on your unit tests :), but here's one request:

In copy_stemmer if you could either change variables "i" and "j" to something more explanatory or add comments about what they are used for, that would be helpful.  (I realize they were already named that way, but you are increasing the size and complexity of the function so it's not just simple byte-copying anymore.  Also I would have complained at the original author if I could :).
blocking-thunderbird3.1: --- → ?
I don't think we'd block 3.1 for this if it were the last bug standing, but it would be really really great to get it.

This is a big enough change that I think if we're going to take it for 3.1, it needs to land by the feature freeze date so that it gets sufficient baking time before we ship.  Feature freeze is currently scheduled to be two weeks from today.
blocking-thunderbird3.1: ? → -
Flags: wanted-thunderbird+
I understand that this patch fixes a set of real, significant issues, but I'd like to get a better feel for the quality and quantity of pain being felt by our users.

Roland, sipaq: have we gotten much feedback about the problems this bug fixes in 3.0?  If so, can you summarize what the relative quantity and tone of that feedback has been like?
No feedback regarding this issue has reached me.
In this case I don't think we can judge the pain by the feedback we are getting - which doesn't help you much I'm afraid, but it's a datapoint. Indeed I think you'd be hard pressed, via a serious search, to find people posting about such issues.  Perhaps polling folks in moz-japan and the like on issues like this might be beneficial.

(posted before but I tend to forget this myself - participation in bugzilla and our various fora + gsfn is not reflective of our user population. IOW, we don't hear much from users about their international issues as much as we need.)  I think this is also reflective in the general health in the bugs tagged intl. We get some wins via what gets fixed in core, but otherwise our intl bugs tend to linger.
Search was one on major reworked functionality introduced in TB3. MoCo advertise this in http://www.mozillamessaging.com/en-US/thunderbird/, so people downloading program think they can actually use new functionality but they can't if language is differ from English. So people will download TB to try new functionality and find out this new super-uber search aren't doing thing, and of course user feel cheated. So such international issues shouldn't be treated as second class when releasing major version.
This is good input, thanks.  I have a bunch of thoughts, but before I spend time enumerating them in detail, I suspect it would be most productive if m_kato could give an idea of how much work there is left to do, and what kind of time he's likely to have for this in the near future....
Attachment #432107 - Flags: review?(bugmail)
This looks great!  Thank you!

I probably will not be able to do a full review pass where I try out the scripts and such until tomorrow, but I expect this to land with only minor changes.  (I want to paste in some of m_kato's comments from the thread, mainly.)

I may defer the landing by a few days in order to coordinate making a simultaneous gloda schema change.  If it looks like that is going to take longer than that, we may just have to bump the schema once for this patch and once later for gloda changes.
I'm marking this as blocking and myself as the responsible party to make sure this is fully on the drivers/my radar and gets into beta 2.
Assignee: m_kato → bugmail
blocking-thunderbird3.1: - → beta2+
Whiteboard: [gloda key] → [gloda key][asuth needs to review]
Comment on attachment 432107 [details] [diff] [review]
patch v4 http://hg.mozilla.org/comm-central/rev/d67ca271fb2d

r=asuth with the changes I am attaching in a patch for your review...
Attachment #432107 - Flags: review?(bugmail) → review+
This does the following:

- Makes generate_table support "0000>" syntax where we 'nuke' the source character.  For our purposes, nuking turns it into a space character to force it to delimit.  This may be the wrong thing to do since I believe one of the examples is a soft-hyphen which should probably just disappear into the ether, but it is definitely better than leaving the soft-hyphen in as part of the token.

- Makes generate_table support "0000..0000>0000" syntax where multiple things map to the same thing.  This is largely used for the nuking purpose above.

- Makes generate_table do the full transitive closure thing on the contents of array.  Just doing array[array[i]] was not sufficient since there can be multiple chained decompositions followed by a case conversion.  This eliminates the need for the hacky manual attempt to perform some last-ditch case conversion.

- Adds two new tests of the normalization logic.  The new 'yesterday' test passes with my changes but fails without them (needs the transitive closure).  The 'awesome' test is a less extensive variant of the test, but already passed without my changes.

- uses a regenerated Normalize.c with the modified table generation script

- avoids the potential for buffer overrun because of the potential for normalization to result in a longer utf-8 encoding for the normalized character than the original character.  (example cited in comments)

- changes the copy_stemmer logic to operate on whole utf-8 encoded unicode characters.

- changes the copy_stemmer logic to no longer treat tokens with numbers in them differently.

- Adds various comments.

The patch *does not* rev the gloda DB schema right now.  I'll add that in at the last minute when I push the patches; there may be other schema changes at the same time...
Attachment #433005 - Flags: review?(m_kato)
Er, just to be clear, you need to apply m_kato's v4 patch then apply mine on top of that.
Assignee: bugmail → m_kato
Comment on attachment 433005 [details] [diff] [review]
review changes patch v1 on top of m_kato's v4 patch http://hg.mozilla.org/comm-central/rev/5d97bb8bf0cc

Great!
Attachment #433005 - Flags: review?(m_kato) → review+
pushed to comm-central, including a db schema bump:
http://hg.mozilla.org/comm-central/rev/d67ca271fb2d
http://hg.mozilla.org/comm-central/rev/5d97bb8bf0cc

victory Ǡccȇnts for everybody!
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Attachment #432107 - Attachment description: patch v4 → patch v4 http://hg.mozilla.org/comm-central/rev/d67ca271fb2d
Attachment #433005 - Attachment description: review changes patch v1 on top of m_kato's v4 patch → review changes patch v1 on top of m_kato's v4 patch http://hg.mozilla.org/comm-central/rev/5d97bb8bf0cc
Whiteboard: [gloda key][asuth needs to review] → [gloda key]
Target Milestone: --- → Thunderbird 3.1b2
Depends on: 579870
Keywords: relnote
You need to log in before you can comment on or make changes to this bug.