Closed Bug 145975 Opened 22 years ago Closed 14 years ago

Implement nsCaseInsensitiveUTF8StringComparator()

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- -

People

(Reporter: bzbarsky, Assigned: justin.lebar+bug)

References

Details

(Keywords: intl)

Attachments

(2 files, 15 obsolete files)

15.75 KB, patch
Details | Diff | Splinter Review
45.82 KB, patch
smontagu
: review+
Details | Diff | Splinter Review
At the moment, URI specs are UTF8Strings.  This includes the hostname (now that
IDN support is in).  Hostname comparisons should be case-insensitive.  The
obvious way to do such a comparison is via:

Compare(host1, host2, some_comparator);

but unfortunately there is no suitable comparator.  So the only way to properly
compare two hostnames at the moment is:

NS_ConvertUTF8toUCS2 host1UCS(host1);
NS_ConvertUTF8toUCS2 host2UCS(host2);
Compare(host1UCS, host2UCS, nsCaseInsensitiveStringComparator());

This is pretty obviously sub-optimal.  It would be much better to have a
comparator that can do case-insensitive comparison of UTF8 strings directly.
better re-assign to string owner.
jag?
Assignee: yokoyama → jaggernaut
Code issue, QA to yokoyama for now.
Keywords: intl
QA Contact: ruixu → yokoyama
How soon will hostnames actually be in utf8?
they are right now, assuming of course that you visit a site that uses them ;)
So dns supports utf8 domain names?
And with that I don't mean our implementation of it, but rather the infrastructure.
no, DNS only supports 7-bit case insensitive ASCII text.  that probably won't
change anytime soon.  the current solutions in place include converting UTF-8
text to a case insensitive ASCII version that does work with DNS.  in
otherwords, there are special ASCII domainnames that should be converted to
unicode before being shown the user.  these we handle via an external
implementation of nsIIDNService.  there are plans to implement a version of this
service within mozilla (which i believe are underway).

without the plugin, UTF-8 domainnames will be URL escaped before being sent to
the DNS server.  there is also a bug on not doing this, but instead sending
UTF-8 text to the DNS server.  the thought being that DNS servers might
eventually support UTF-8 domainnames directly.

my point (in my previous comment) was that our APIs support UTF-8 domainnames,
and if a webpage includes a UTF-8 domainname, that UTF-8 text will enter mozilla
and be passed around.  so, even if we lack any means of communicating with the
UTF-8 host, we should still be able to handle UTF-8 hostnames internally...
afterall, UTF-8 domainname support is just a matter of installing a drop-in
XPCOM component.
The question I should've asked: by when do you need this?
i don't know of any examples where this is needed, but maybe bz was thinking of
a few... bz?
This could be needed in cookie code, in autocomplete code, and possibly in the 
security manager.  I suspect autocomplete is done in JS; ccing morse and mstoltz 
for their takes on the issue.

I filed this as a "we should have this working at some point" kind of thing; I 
don't actually visit any sites that have non-ASCII domain names, so I have no 
examples of anything being broken by whatever method we use for comparisons at 
the moment.
Just FYI, listed below are cases where the 'simple' case-conversion crosses
UTF-8 length boundaries. The list was posted to the Unicode list by Markus Scherer.


U+0130 simple-lowercases to U+0069
U+0130 is LATIN CAPITAL LETTER I WITH DOT ABOVE

U+0131 simple-uppercases to U+0049
U+0131 is LATIN SMALL LETTER DOTLESS I

U+017f simple-uppercases to U+0053
U+017f is LATIN SMALL LETTER LONG S

U+1fbe simple-uppercases to U+0399
U+1fbe is GREEK PROSGEGRAMMENI

U+2126 simple-lowercases to U+03c9
U+2126 is OHM SIGN

U+212a simple-lowercases to U+006b
U+212a is KELVIN SIGN

U+212b simple-lowercases to U+00e5
U+212b is ANGSTROM SIGN

Depends on: 231782
QA Contact: tetsuroy → i18n
Blocks: 570975
In case this is useful, ICU (from IBM?) has a C API that looks like it can do this. It has a compatible license as far as I can tell (looks BSD-ish).

http://icu-project.org/apiref/icu4c/ucol_8h.html

'ucol_strcollIter'
Attached patch POC (obsolete) — Splinter Review
This patch implements the necessary intl/ primitives to create a nsCaseInsensitiveUTF8Comparator.  I haven't banged on it too hard, but it does pass the testcase in UnicharSelfTest.cpp.

This is built on my nsICaseConverter deCOM patch in Bug 575043.
(In reply to comment #13)
> In case this is useful, ICU (from IBM?) has a C API that looks like it can do
> this. It has a compatible license as far as I can tell (looks BSD-ish).
We may want to include ICU in the future too in order to use FTS3 (or 4) in SQLite in a Unicode-aware manner.
Attached patch XPCOM String API changes (obsolete) — Splinter Review
Completely untested, but it compiles.  Want to play with it sdwilsh?
This is built on top of my patch for Bug 578714, so you'll need that patch too.  That's built on top of my patch for bug 575043 :-P

Maybe I should just give you an easier diff in the morning.
We're going to want this for Bug 570975 if nothing else.
Assignee: jag-mozilla → me
Status: NEW → ASSIGNED
blocking2.0: --- → ?
I'll play with it if bug 570975 gets blocking+, but otherwise I won't be able to for a while.
Comment on attachment 458940 [details] [diff] [review]
UTF-8 intl primitives

I've only skimmed this, but I'm not overjoyed about adding all these extra tables. Would it be so bad for performance to convert a single character, or rather pair of characters, to UTF-16 and use the existing tables, when you need the lowercase compare? Since you do this:

>+// This function is mildly complicated because we don't want to pay the cost
>+// of comprehending every UTF-8 character.  Instead we treat the arrays as byte
>+// streams until we find a mismatch.  Once we find a mismatch, we backtrack
>+// if necessary until we've isolated a UTF-8 character from each array,
>+// then lowercase and compare.

(which is a plan so cunning that you could put a tail on it and call it a weasel), I wouldn't have thought that the performance cost would be so great.
I haven't benchmarked it, but if we're going to convert to UTF-16 internally what's the point of implementing this at all?
FWIW, I do intend to benchmark it, I just haven't had time.
What's the size cost of the tables?  Time > space for the vast majority of this stuff, though cache effects have turned some of the "look up rather than compute" logic on its heads for game developers recently, at least.
3 * sizeof(PRUint32) * ~ 350

negligible IMO
(In reply to comment #22)
> I haven't benchmarked it, but if we're going to convert to UTF-16 internally
> what's the point of implementing this at all?

The point is that in practice we will very rarely need to make the conversion, especially if you special-case < 0x80
OTOH, if 3 * sizeof(PRUint32) * ~ 350 is negligible, we can also fix bug 210501, since the extra size for expanding the existing tables from UCS-2 to UCS-4 is half that ;-)
4200 bytes is negligible, yes.  Did someone really claim that 2100 bytes was too much to add for non-BMP case folding?  I'm afraid to look at the bug now.
I'll try to get some solid numbers relatively soon.
(In reply to comment #27)
> OTOH, if 3 * sizeof(PRUint32) * ~ 350 is negligible, we can also fix bug
> 210501, since the extra size for expanding the existing tables from UCS-2 to
> UCS-4 is half that ;-)

I'm breaking most of the same interfaces we'd need to do to do that, so we probably could pick up Bug 210501 at some point.
Kyle, are you going to have time to work on this now that you're back in school?  If not, I might try and pick this one up.
Probably not.  This mostly just needs testing (perf and correctness).  If you have the desire and time to push it forward please do.
Depends on: 587550
The UTF-8 intl primitives patch won't apply on tip of trunk; it references intl/unicharutil/public/nsCaseConversion.h, which doesn't exist.

Am I missing a patch in my queue?
(In reply to comment #33)
> The UTF-8 intl primitives patch won't apply on tip of trunk; it references
> intl/unicharutil/public/nsCaseConversion.h, which doesn't exist.
> 
> Am I missing a patch in my queue?

Nope.  I wrote this before I finished the deCOM bug.  You should be able to make the added CaseInsensitiveCompare not a member of nsCaseConversion and just live in nsUnicharUtils instead and be set.
Ah, I see.  I'll try and fix it up then.  Thanks.
Attached patch Folded, rebased patch (obsolete) — Splinter Review
Patch based on tip of trunk (and my test changes in bug 587550).  I folded Khuey's two patches into one since it was hard for me to keep track of which changes belonged where, but I can split them up again for review if this matters to anyone.
Attachment #458940 - Attachment is obsolete: true
Attachment #458941 - Attachment is obsolete: true
> From nsCompressedUTF8Map::Map()
>
> PRUint32 aChar;
> PRUnichar res;
> if (PR_INT16_MIN < (aChar - res) && (aChar - res) < PR_INT16_MAX)
>   mCache[aChar & CASE_MAP_CACHE_MASK] =
>     (((aChar << 8) & 0xFFFF0000) | (aChar - res));

Why are the inequalities in the |if| strict?  It looks like what you mean is

  if (!(0xffff0000 & (aChar - res))

but maybe I'm misunderstanding.
(In reply to comment #37)
> > From nsCompressedUTF8Map::Map()
> >
> > PRUint32 aChar;
> > PRUnichar res;
> > if (PR_INT16_MIN < (aChar - res) && (aChar - res) < PR_INT16_MAX)
> >   mCache[aChar & CASE_MAP_CACHE_MASK] =
> >     (((aChar << 8) & 0xFFFF0000) | (aChar - res));
> 
> Why are the inequalities in the |if| strict?  It looks like what you mean is
> 
>   if (!(0xffff0000 & (aChar - res))
> 
> but maybe I'm misunderstanding.

I believe you're correct.

Another bug to watch out for is that IIRC the patch I wrote will backtrack to before the beginning of the string when presented with appropriately malformed input.
gencasetable.pl is generating 64-bit negative numbers on my 64-bit machine.  How do I get perl to output just 32 bits?
Assignee: me → justin.lebar+bug
Attached patch Patch v2 (obsolete) — Splinter Review
I managed to get perl to do my bidding.
Attachment #466796 - Attachment is obsolete: true
Attached patch Patch v2.1 (obsolete) — Splinter Review
Whoops; forgot to qref.  This one should compile.
Attachment #466920 - Attachment is obsolete: true
(In reply to comment #38)
> Another bug to watch out for is that IIRC the patch I wrote will backtrack to
> before the beginning of the string when presented with appropriately malformed
> input.

What's the contract for these functions when given malformed input?  Are they supposed to be robust?
I wonder how we should test this.  The code in UnicharSelfTest.cpp seems pretty ad-hoc.  Perhaps an automated approach would be better, but that might also be overkill.

Any thoughts?
(In reply to comment #42)
> (In reply to comment #38)
> > Another bug to watch out for is that IIRC the patch I wrote will backtrack to
> > before the beginning of the string when presented with appropriately malformed
> > input.
> 
> What's the contract for these functions when given malformed input?  Are they
> supposed to be robust?

They need to be able to deal with untrusted input (not crash/overflow) but if the input isn't valid UTF-8 I don't think the answers need to make sense.
Attached patch Patch v3 (obsolete) — Splinter Review
I fixed a number of subtle bugs in the case-insensitive search just by running the crappy testcase in UnicodeSelfTest.  There are a lot of special cases; it's going to take a pretty decent test to convince me that it's right.
Attachment #467042 - Attachment is obsolete: true
Jonas and I spoke yesterday about testing this.  We were thinking I could test the case-insensitive comparison by iterating over many random strings, as follows:  Pick two byte strings.  If they're both valid UTF8, then convert to UTF16 and check that the case-insensitive comparison between them is the same.  If either is invalid UTF8, compare them anyway and make sure that we don't crash.

We could run under Valgrind to detect more errors.  And we could be more clever than totally random strings -- we could perhaps test all combinations of bytes with zero, one, two, three, and four high bits set.

I just spent a few hours messing with the tests and build system, and I'm now less hopeful about this testing strategy.  I can't access IsUTF8 from here -- it's an internal string API -- this makes it difficult to test the algorithm for correctness.  We might be able to find memory corruption bugs by testing with random strings, but not sure it's worth it.

Maybe the test we have is sufficient.  If so, patch v3 should be ready for review.
This might help with test coverage a bit:

http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt

Also, do Perl or Python have test suites that cover case-insensitive comparison of UTF-8 (or Unicode in general), perhaps for their regular-expression capabilities?
Simon has a good set of sample valid text in many languages.  Testing error handling will be more interesting.

FWIW, we don't have any real testing of the existing code in this area (other than what the testsuite naturally exercises of course, which is a lot of it).
Attached patch Patch v3.1 (obsolete) — Splinter Review
This includes a pretty dumb fuzzer.  Didn't find any problems.
Attachment #467201 - Attachment is obsolete: true
blocking2.0: ? → -
Attached patch Patch v3.2 (obsolete) — Splinter Review
Benchmarks in a moment.
Attachment #467610 - Attachment is obsolete: true
Attached patch Ugly benchmark code (obsolete) — Splinter Review
This benchmark calls the case-insensitive UTF-8 and UTF-16 comparators with three workloads:

 * two identical ASCII strings,
 * one all lower-case and one all upper-case ASCII string, and
 * one mixed-case unicode string and one all upper-case unicode string.  (At
   least, I think one string is mixed and one is uppercase; I didn't
   double-check.) 

Since both functions do a bitwise comparison when the strings are identical, I'd expect comparing two identical Unicode strings to have similar performance characteristics to comparing two identical ASCII strings.

I timed three operations for each pair of strings:
 * converting the strings from UTF8 to UTF16,
 * comparing the two strings as UTF16, and
 * comparing the two strings as UTF8.

Here are the results, using patch v3.2:

Identical strings:
  UTF16 convert=904us, compare=141us, convert+compare=1045us
  UTF8 compare=217us
All upper vs. all lower ASCII:
  UTF16 convert=814us, compare=991us, convert+compare=1805us
  UTF8 compare=1327us
Mixed case vs. all upper unicode
  UTF16 convert=6137us, compare=3277us, convert+compare=9414us
  UTF8 compare=12984us

The UTF-8 comparison is a clear win in the all ASCII case.  The UTF-8 comparison itself is a bit slower than the UTF-16 comparison, presumably because the UTF-8 loop is more sizable (even though, in the identical strings case, both the UTF-8 and UTF-16 loops do a simple bitwise comparison), but still much faster than convert+compare.

If we throw in enough unicode characters, however, it becomes faster to convert to UTF-16 and compare in that domain.  This is pretty sensible, since the unicode case is really no worse for the UTF-16 code than the ASCII case, while the UTF-8 comparator needs to backtrack and whatnot.

I think how we proceed from here depends on what we're optimizing for.  Do we think most of our strings will be identical?  Then maybe the backtracking loop khuey wrote is pretty good.  If we think that most of our strings will be different, however, then we can make the opposite assumption and not backtrack.

My take on this is that the most important user of this function is the aweseomebar, which pumps most of the places database through this comparator.  I imagigne most people will type lower-case text into their location bar, and most page titles are in title case.  So perhaps we should optimize for the case where most, but not all, letters are identical.
> I think how we proceed from here depends on what we're optimizing for.

I was going to suggest instrumenting this function to print out which case it fell into and then pushing to try, but that won't exercise the awesomebar, of course.

Maybe instrument it and browse for a bit?
(In reply to comment #51)
> I think how we proceed from here depends on what we're optimizing for.  Do we
> think most of our strings will be identical?  Then maybe the backtracking loop
> khuey wrote is pretty good.  If we think that most of our strings will be
> different, however, then we can make the opposite assumption and not backtrack.
> 
> My take on this is that the most important user of this function is the
> aweseomebar, which pumps most of the places database through this comparator. 
> I imagigne most people will type lower-case text into their location bar, and
> most page titles are in title case.  So perhaps we should optimize for the case
> where most, but not all, letters are identical.
I think you are mistaken in assumptions about location bar.  It will search through every single bookmark you have (url and title), every history entry (url, title, and tags) after splitting things up into tokens depending on your search preferences.  It will stop after it finds 12 matches (this is also pref controllable).  I think the common case here is that the letters are not identical.
Ah, indeed.  Then the common case is overwhelmingly that the strings don't match at all, yes?
By identical I assume he's referring to case.  Two strings that are substantively different will fall out of the comparator almost immediately, so this will be a massive win against the current version that pays the overhead of two string conversions before rejecting them.

We really should be comparing against Simon's suggestion from comment 21.  We all agree that the current method is bad.
(In reply to comment #55)
> By identical I assume he's referring to case.  Two strings that are
> substantively different will fall out of the comparator almost immediately, so
> this will be a massive win against the current version that pays the overhead
> of two string conversions before rejecting them.
oh, case is a crap-shoot then.
Attached patch Patch v3.3 (obsolete) — Splinter Review
I manually inlined the ASCII case conversion, and now the UTF8 code is faster than the UTF16 code on the different-case ASCII test.

Identical strings:
  UTF16 convert=890us, compare=134us, convert+compare=1024us
  UTF8 compare=201us
All upper vs. all lower ASCII:
  UTF16 convert=819us, compare=998us, convert+compare=1817us
  UTF8 compare=705us
Mixed case vs. all upper unicode:
  UTF16 convert=6211us, compare=3060us, convert+compare=9271us
  UTF8 compare=12047us

I'm going to collect some profiling data from people who have non-ASCII characters in their history.
Attachment #468508 - Attachment is obsolete: true
I looked through the data from my machine, and I'm now not sure this is worth fretting over much further.  The vast majority of comparisons will be rejected on the first character, so small optimizations within the loop aren't going to make a big difference.

A much bigger candidate for optimization, I think, is lower-casing what's in the URL bar *once* and then using a half-case-insensitive comparator which assumes that one of the operands is already lower-case.
I don't think that doing that gives the same behavior as doing case-insensitive comparisons, with non-ASCII Unicode. Specifically, and this may be the only such example, "I" and "İ" should probably not test equal, but lowercasing the former gives "i" which is in fact case-insensitively equal to "İ".

And you can't just lowercase both and compare exactly, since if you do that you will end up with "I" not testing equal to "ı", which it should be.
(In reply to comment 59)
The UTF8 conversion code in this patch may be broken, then.  I'm not sure about the UTF16 code.

I'll have a look.
(In reply to comment #59)
> I don't think that doing that gives the same behavior as doing case-insensitive
> comparisons, with non-ASCII Unicode. Specifically, and this may be the only
> such example, "I" and "İ" should probably not test equal, but lowercasing the
> former gives "i" which is in fact case-insensitively equal to "İ".
> 
> And you can't just lowercase both and compare exactly, since if you do that you
> will end up with "I" not testing equal to "ı", which it should be.

All of our current code does this AIUI.
> All of our current code does this AIUI.

Then our current code is buggy, no?  Looks like it was doing that even before the patch for bug 575043, though.... and in fact may have been broken all along.  Unless I'm missing something?  Simon?
Yes, we have always had that bug. Even if we fix it for text that we know is in Turkish (bug 231162), I'm not sure if there is a good solution for contexts like the awesome bar :(
Simon, would you be willing to take this patch as-is (with a very simple fuzzer and a single test comparing two valid strings), or do you want more extensive testing of valid text?
Are those the only alternatives? ;-)

I would be happy with three tests rather than one: one of ASCII-only strings, one of trans-ASCII Unicode strings, and one where the two strings are different lengths in UTF-8 (There are a bunch of characters in the range U+2C60-2C7F that have case mappings in the range U+02xx)
Attached patch Patch v3.4 (obsolete) — Splinter Review
Attachment #468545 - Attachment is obsolete: true
Attachment #470381 - Flags: review?
Attachment #470381 - Flags: review? → review?(smontagu)
The patch above has a unicode test (with different-length strings) and a few ASCII sanity checks in addition the naive fuzzer.

I also realized when I was going over the patch that I'd only inlined one of the case conversions.  Below are results from the same benchmark as comment 57, but with both case conversions inlined.  It's substantially faster, especially in the all upper vs. all lower ASCII test.

Identical strings:
  UTF16 convert=975us, compare=135us, convert+compare=1110us
  UTF8 compare=179us
All upper vs. all lower ASCII:
  UTF16 convert=898us, compare=1118us, convert+compare=2016us
  UTF8 compare=536us
Mixed case vs. all upper unicode:
  UTF16 convert=6098us, compare=3052us, convert+compare=9150us
  UTF8 compare=11980us
It looks like I need another function for bug 570975: CaseInsensitiveStartsWith.  I'll post a new patch with this function soon.
Attached patch Patch v3.5 (obsolete) — Splinter Review
This has what I need for bug 570975.
Attachment #470381 - Attachment is obsolete: true
Attachment #470854 - Flags: review?(smontagu)
Attachment #470381 - Flags: review?(smontagu)
Benchmarking is revealing that CaseInsensitiveStartsWith is slower than I'd like.  Simon, would you mind reviewing all but that function (i.e. patch v3.4)?  I'll post a separate patch once I figure out what I need for bug 570975 to be fast.
Attachment #470381 - Flags: review?(smontagu)
Attachment #470854 - Attachment is obsolete: true
Attachment #470854 - Flags: review?(smontagu)
Attachment #470381 - Attachment is obsolete: false
I just re-implemented the case-insensitive comparator in terms of a character-by-character comparator (i.e., what was suggested in comment 21), and it's much slower than Khuey's approach in the identical and different-case ASCII cases and just as slow in the Unicode case.

That said, I still want the character-by-character comparator I wrote, because it makes the awesomebar really fast.  I'll post what I have soon.
Attached patch Patch v4 (obsolete) — Splinter Review
This patch adds CaseInsensitiveUTF8CharEquals().

Sorry for the churn, Simon!  I'm happy to provide interdiffs if you've already started looking at this.
Attachment #470381 - Attachment is obsolete: true
Attachment #471347 - Flags: review?
Attachment #470381 - Flags: review?(smontagu)
Attachment #471347 - Flags: review? → review?(smontagu)
Good news and bad news:

The good news is that I've managed to speed this up and get rid of a substantial amount of code.  The speedup comes from noticing that we were missing a call to IS_NOCASE_CHAR in the UTF8 code, so we were spending a lot more time in the case map than we needed to be.

The size decrease comes from the following observation: The code previously avoided computing UTF8 codepoints (instead, it worked with the concatenation of the representative bytes), but I realized that computing the real codepoint would come at the cost of just a few bitwise ANDs.  This change lets us reuse pretty much all of the existing UTF16 infrastructure, so decreases the size of the patch significantly.

The bad news is that I basically rewrote the patch *again*.  This canceling of reviews is getting kind of silly, so I'm not going to request review until I've let this new one bake for a while.

I'll put up the patch in a bit; I just need to clean it up a bit.
Attachment #471347 - Flags: review?(smontagu)
Interesting, maybe there are some code alignment effects going on here?  I'm finding that the UTF16 identical compare time varies from 270us to 140us depending on changes I make to other, unrelated functions.
Oh, no; this is probably the scheduler peeking its head in here.  Changing the code has nothing to do with it.  I'll see if I can fix the benchmark.
Attached patch Patch v5 (obsolete) — Splinter Review
I think I'm done with this, hopefully for real this time.  Benchmarks in a moment.
Attachment #471347 - Attachment is obsolete: true
Attachment #472874 - Flags: review?(smontagu)
Executive summary: The UTF-8 comparison functions are substantially faster than converting to UTF-16 and comparing.  Speedups are on the order of 2-3x.

You'll need patch v4 from bug 570975 to run the benchmark yourself.  (I did say it was ugly.  :)

You'll also probably want to enable -O3 if you're on Linux.  I make no guarantees that the new code will be fast if you're using -Os (currently the default on Linux).  See bug 590181.

These results are from my Linux x64 box with -O3.

These test CaseInsensitiveUTF8Comparator:

Identical strings:
  UTF16 convert=282ms, compare=72ms, convert+compare=354ms
  UTF8 compare=140ms
All upper vs. all lower ASCII:
  UTF16 convert=279ms, compare=157ms, convert+compare=436ms
  UTF8 compare=140ms
Mixed case vs. all upper unicode:
  UTF16 convert=324ms, compare=99ms, convert+compare=423ms
  UTF8 compare=207ms

These test CaseInsensitiveCharsEqual() by calling MatchAutoCompleteFunction::findOnBoundary and MatchAutoCompleteFunction::findAnywhere.  The times reported for UTF16 include the cost of converting to UTF16.

** findOnBoundary benchmark:
UTF8 ASCII:      153ms.
UTF16 ASCII:     323ms.
UTF8 unicode:    192ms.
UTF16 unicode:   364ms.
** findAnywhere benchmark:
UTF8 ASCII:      149ms.
UTF16 ASCII:     303ms.
UTF8 unicode:    193ms.
UTF16 unicode:   309ms.
Attachment #468528 - Attachment is obsolete: true
So, as I noted on the train, this probably should be using case folding rather than lowercasing for the case-insensitive matching. (On a related note, I assume we're normalizing the text somewhere before performing the match?)

See the sections of Unicode linked from
  http://unicode.org/reports/tr21/
Case folding vs. lowercasing can in some cases give different results.
Perhaps this should be a follow-up bug, since the code as-is replicates the current UTF-16 functionality.

And yes, I think we're assuming the text has been normalized.
bug 570975 has been marked blocking betaN+, but if anything will keep that bug from landing, it's this patch, I think.
blocking2.0: - → ?
The front-end could really benefit from this, as much as a 2x speedup on Awesomebar results and responsiveness.
Comment on attachment 472874 [details] [diff] [review]
Patch v5

>diff --git a/intl/unicharutil/tools/gencasetable.pl b/intl/unicharutil/tools/gencasetable.pl

>+# Map x -> x, except for upper-case letters, which we map to lower-case
>+# letters.
>+for($idx=0; $idx < 128; $idx++)
>+{
>+  print OUT "    ";
>+  if (65 <= $idx && $idx <= 90) {
>+    print OUT ($idx + 0x20);
>+  } else {
>+    print OUT $idx;
>+  }
>+  print OUT ",\n";
>+}

Nit: please use hex, not decimal. For extra credit, format the table in rows of 16 codepoints each.

>diff --git a/intl/unicharutil/util/nsUnicharUtils.cpp b/intl/unicharutil/util/nsUnicharUtils.cpp

>+static NS_ALWAYS_INLINE PRUint32
>+GetLowerUTF8Codepoint(const char* aStr, const char* aEnd, const char **aNext)

>+  if (UTF8traits::is3byte(str[0]) && NS_LIKELY(aStr + 2 < aEnd)) {
>+    // It's a three-byte sequence, so it looks like
>+    //  1110XXXX 10XXXXXX 10XXXXXX.
>+    // This may or may not fit into a PRUint16, so we need to accumulate into a
>+    // PRUint32.

Why "may or may not"? The last three byte UTF-8 sequence is EF BF BF, which is U+FFFF.
(In reply to comment #82)
> Why "may or may not"? The last three byte UTF-8 sequence is EF BF BF, which is
> U+FFFF.

Oh, that's absolutely right.  There are 16 Xs in the comment.  :)
Attached patch Patch v5.1Splinter Review
Addresses comment 82.
Attachment #472874 - Attachment is obsolete: true
Attachment #484102 - Flags: review?
Attachment #472874 - Flags: review?(smontagu)
Attachment #484102 - Flags: review? → review?(smontagu)
Attachment #484102 - Flags: review?(smontagu) → review+
http://hg.mozilla.org/mozilla-central/rev/57bff252b161
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
blocking2.0: ? → -
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: