Closed
Bug 145975
Opened 23 years ago
Closed 14 years ago
Implement nsCaseInsensitiveUTF8StringComparator()
Categories
(Core :: Internationalization, defect)
Core
Internationalization
Tracking
()
RESOLVED
FIXED
Tracking | Status | |
---|---|---|
blocking2.0 | --- | - |
People
(Reporter: bzbarsky, Assigned: justin.lebar+bug)
References
Details
(Keywords: intl)
Attachments
(2 files, 15 obsolete files)
15.75 KB,
patch
|
Details | Diff | Splinter Review | |
45.82 KB,
patch
|
smontagu
:
review+
|
Details | Diff | Splinter Review |
At the moment, URI specs are UTF8Strings. This includes the hostname (now that
IDN support is in). Hostname comparisons should be case-insensitive. The
obvious way to do such a comparison is via:
Compare(host1, host2, some_comparator);
but unfortunately there is no suitable comparator. So the only way to properly
compare two hostnames at the moment is:
NS_ConvertUTF8toUCS2 host1UCS(host1);
NS_ConvertUTF8toUCS2 host2UCS(host2);
Compare(host1UCS, host2UCS, nsCaseInsensitiveStringComparator());
This is pretty obviously sub-optimal. It would be much better to have a
comparator that can do case-insensitive comparison of UTF8 strings directly.
Code issue, QA to yokoyama for now.
Keywords: intl
QA Contact: ruixu → yokoyama
Comment 3•23 years ago
|
||
How soon will hostnames actually be in utf8?
Comment 4•23 years ago
|
||
they are right now, assuming of course that you visit a site that uses them ;)
Comment 5•23 years ago
|
||
So dns supports utf8 domain names?
Comment 6•23 years ago
|
||
And with that I don't mean our implementation of it, but rather the infrastructure.
Comment 7•23 years ago
|
||
no, DNS only supports 7-bit case insensitive ASCII text. that probably won't
change anytime soon. the current solutions in place include converting UTF-8
text to a case insensitive ASCII version that does work with DNS. in
otherwords, there are special ASCII domainnames that should be converted to
unicode before being shown the user. these we handle via an external
implementation of nsIIDNService. there are plans to implement a version of this
service within mozilla (which i believe are underway).
without the plugin, UTF-8 domainnames will be URL escaped before being sent to
the DNS server. there is also a bug on not doing this, but instead sending
UTF-8 text to the DNS server. the thought being that DNS servers might
eventually support UTF-8 domainnames directly.
my point (in my previous comment) was that our APIs support UTF-8 domainnames,
and if a webpage includes a UTF-8 domainname, that UTF-8 text will enter mozilla
and be passed around. so, even if we lack any means of communicating with the
UTF-8 host, we should still be able to handle UTF-8 hostnames internally...
afterall, UTF-8 domainname support is just a matter of installing a drop-in
XPCOM component.
Comment 8•23 years ago
|
||
The question I should've asked: by when do you need this?
Comment 9•23 years ago
|
||
i don't know of any examples where this is needed, but maybe bz was thinking of
a few... bz?
Reporter | ||
Comment 10•23 years ago
|
||
This could be needed in cookie code, in autocomplete code, and possibly in the
security manager. I suspect autocomplete is done in JS; ccing morse and mstoltz
for their takes on the issue.
I filed this as a "we should have this working at some point" kind of thing; I
don't actually visit any sites that have non-ASCII domain names, so I have no
examples of anything being broken by whatever method we use for comparisons at
the moment.
Comment 11•21 years ago
|
||
Just FYI, listed below are cases where the 'simple' case-conversion crosses
UTF-8 length boundaries. The list was posted to the Unicode list by Markus Scherer.
U+0130 simple-lowercases to U+0069
U+0130 is LATIN CAPITAL LETTER I WITH DOT ABOVE
U+0131 simple-uppercases to U+0049
U+0131 is LATIN SMALL LETTER DOTLESS I
U+017f simple-uppercases to U+0053
U+017f is LATIN SMALL LETTER LONG S
U+1fbe simple-uppercases to U+0399
U+1fbe is GREEK PROSGEGRAMMENI
U+2126 simple-lowercases to U+03c9
U+2126 is OHM SIGN
U+212a simple-lowercases to U+006b
U+212a is KELVIN SIGN
U+212b simple-lowercases to U+00e5
U+212b is ANGSTROM SIGN
Updated•15 years ago
|
QA Contact: tetsuroy → i18n
Comment 13•14 years ago
|
||
In case this is useful, ICU (from IBM?) has a C API that looks like it can do this. It has a compatible license as far as I can tell (looks BSD-ish).
http://icu-project.org/apiref/icu4c/ucol_8h.html
'ucol_strcollIter'
This patch implements the necessary intl/ primitives to create a nsCaseInsensitiveUTF8Comparator. I haven't banged on it too hard, but it does pass the testcase in UnicharSelfTest.cpp.
This is built on my nsICaseConverter deCOM patch in Bug 575043.
Comment 15•14 years ago
|
||
(In reply to comment #13)
> In case this is useful, ICU (from IBM?) has a C API that looks like it can do
> this. It has a compatible license as far as I can tell (looks BSD-ish).
We may want to include ICU in the future too in order to use FTS3 (or 4) in SQLite in a Unicode-aware manner.
Attachment #458260 -
Attachment is obsolete: true
Completely untested, but it compiles. Want to play with it sdwilsh?
This is built on top of my patch for Bug 578714, so you'll need that patch too. That's built on top of my patch for bug 575043 :-P
Maybe I should just give you an easier diff in the morning.
We're going to want this for Bug 570975 if nothing else.
Assignee: jag-mozilla → me
Status: NEW → ASSIGNED
blocking2.0: --- → ?
Comment 20•14 years ago
|
||
I'll play with it if bug 570975 gets blocking+, but otherwise I won't be able to for a while.
Comment 21•14 years ago
|
||
Comment on attachment 458940 [details] [diff] [review]
UTF-8 intl primitives
I've only skimmed this, but I'm not overjoyed about adding all these extra tables. Would it be so bad for performance to convert a single character, or rather pair of characters, to UTF-16 and use the existing tables, when you need the lowercase compare? Since you do this:
>+// This function is mildly complicated because we don't want to pay the cost
>+// of comprehending every UTF-8 character. Instead we treat the arrays as byte
>+// streams until we find a mismatch. Once we find a mismatch, we backtrack
>+// if necessary until we've isolated a UTF-8 character from each array,
>+// then lowercase and compare.
(which is a plan so cunning that you could put a tail on it and call it a weasel), I wouldn't have thought that the performance cost would be so great.
I haven't benchmarked it, but if we're going to convert to UTF-16 internally what's the point of implementing this at all?
FWIW, I do intend to benchmark it, I just haven't had time.
Comment 24•14 years ago
|
||
What's the size cost of the tables? Time > space for the vast majority of this stuff, though cache effects have turned some of the "look up rather than compute" logic on its heads for game developers recently, at least.
3 * sizeof(PRUint32) * ~ 350
negligible IMO
Comment 26•14 years ago
|
||
(In reply to comment #22)
> I haven't benchmarked it, but if we're going to convert to UTF-16 internally
> what's the point of implementing this at all?
The point is that in practice we will very rarely need to make the conversion, especially if you special-case < 0x80
Comment 27•14 years ago
|
||
OTOH, if 3 * sizeof(PRUint32) * ~ 350 is negligible, we can also fix bug 210501, since the extra size for expanding the existing tables from UCS-2 to UCS-4 is half that ;-)
Comment 28•14 years ago
|
||
4200 bytes is negligible, yes. Did someone really claim that 2100 bytes was too much to add for non-BMP case folding? I'm afraid to look at the bug now.
I'll try to get some solid numbers relatively soon.
(In reply to comment #27)
> OTOH, if 3 * sizeof(PRUint32) * ~ 350 is negligible, we can also fix bug
> 210501, since the extra size for expanding the existing tables from UCS-2 to
> UCS-4 is half that ;-)
I'm breaking most of the same interfaces we'd need to do to do that, so we probably could pick up Bug 210501 at some point.
Assignee | ||
Comment 31•14 years ago
|
||
Kyle, are you going to have time to work on this now that you're back in school? If not, I might try and pick this one up.
Probably not. This mostly just needs testing (perf and correctness). If you have the desire and time to push it forward please do.
Assignee | ||
Comment 33•14 years ago
|
||
The UTF-8 intl primitives patch won't apply on tip of trunk; it references intl/unicharutil/public/nsCaseConversion.h, which doesn't exist.
Am I missing a patch in my queue?
(In reply to comment #33)
> The UTF-8 intl primitives patch won't apply on tip of trunk; it references
> intl/unicharutil/public/nsCaseConversion.h, which doesn't exist.
>
> Am I missing a patch in my queue?
Nope. I wrote this before I finished the deCOM bug. You should be able to make the added CaseInsensitiveCompare not a member of nsCaseConversion and just live in nsUnicharUtils instead and be set.
Assignee | ||
Comment 35•14 years ago
|
||
Ah, I see. I'll try and fix it up then. Thanks.
Assignee | ||
Comment 36•14 years ago
|
||
Patch based on tip of trunk (and my test changes in bug 587550). I folded Khuey's two patches into one since it was hard for me to keep track of which changes belonged where, but I can split them up again for review if this matters to anyone.
Attachment #458940 -
Attachment is obsolete: true
Attachment #458941 -
Attachment is obsolete: true
Assignee | ||
Comment 37•14 years ago
|
||
> From nsCompressedUTF8Map::Map()
>
> PRUint32 aChar;
> PRUnichar res;
> if (PR_INT16_MIN < (aChar - res) && (aChar - res) < PR_INT16_MAX)
> mCache[aChar & CASE_MAP_CACHE_MASK] =
> (((aChar << 8) & 0xFFFF0000) | (aChar - res));
Why are the inequalities in the |if| strict? It looks like what you mean is
if (!(0xffff0000 & (aChar - res))
but maybe I'm misunderstanding.
(In reply to comment #37)
> > From nsCompressedUTF8Map::Map()
> >
> > PRUint32 aChar;
> > PRUnichar res;
> > if (PR_INT16_MIN < (aChar - res) && (aChar - res) < PR_INT16_MAX)
> > mCache[aChar & CASE_MAP_CACHE_MASK] =
> > (((aChar << 8) & 0xFFFF0000) | (aChar - res));
>
> Why are the inequalities in the |if| strict? It looks like what you mean is
>
> if (!(0xffff0000 & (aChar - res))
>
> but maybe I'm misunderstanding.
I believe you're correct.
Another bug to watch out for is that IIRC the patch I wrote will backtrack to before the beginning of the string when presented with appropriately malformed input.
Assignee | ||
Comment 39•14 years ago
|
||
gencasetable.pl is generating 64-bit negative numbers on my 64-bit machine. How do I get perl to output just 32 bits?
Assignee | ||
Updated•14 years ago
|
Assignee: me → justin.lebar+bug
Assignee | ||
Comment 40•14 years ago
|
||
I managed to get perl to do my bidding.
Attachment #466796 -
Attachment is obsolete: true
Assignee | ||
Comment 41•14 years ago
|
||
Whoops; forgot to qref. This one should compile.
Attachment #466920 -
Attachment is obsolete: true
Assignee | ||
Comment 42•14 years ago
|
||
(In reply to comment #38)
> Another bug to watch out for is that IIRC the patch I wrote will backtrack to
> before the beginning of the string when presented with appropriately malformed
> input.
What's the contract for these functions when given malformed input? Are they supposed to be robust?
Assignee | ||
Comment 43•14 years ago
|
||
I wonder how we should test this. The code in UnicharSelfTest.cpp seems pretty ad-hoc. Perhaps an automated approach would be better, but that might also be overkill.
Any thoughts?
(In reply to comment #42)
> (In reply to comment #38)
> > Another bug to watch out for is that IIRC the patch I wrote will backtrack to
> > before the beginning of the string when presented with appropriately malformed
> > input.
>
> What's the contract for these functions when given malformed input? Are they
> supposed to be robust?
They need to be able to deal with untrusted input (not crash/overflow) but if the input isn't valid UTF-8 I don't think the answers need to make sense.
Assignee | ||
Comment 45•14 years ago
|
||
I fixed a number of subtle bugs in the case-insensitive search just by running the crappy testcase in UnicodeSelfTest. There are a lot of special cases; it's going to take a pretty decent test to convince me that it's right.
Attachment #467042 -
Attachment is obsolete: true
Assignee | ||
Comment 46•14 years ago
|
||
Jonas and I spoke yesterday about testing this. We were thinking I could test the case-insensitive comparison by iterating over many random strings, as follows: Pick two byte strings. If they're both valid UTF8, then convert to UTF16 and check that the case-insensitive comparison between them is the same. If either is invalid UTF8, compare them anyway and make sure that we don't crash.
We could run under Valgrind to detect more errors. And we could be more clever than totally random strings -- we could perhaps test all combinations of bytes with zero, one, two, three, and four high bits set.
I just spent a few hours messing with the tests and build system, and I'm now less hopeful about this testing strategy. I can't access IsUTF8 from here -- it's an internal string API -- this makes it difficult to test the algorithm for correctness. We might be able to find memory corruption bugs by testing with random strings, but not sure it's worth it.
Maybe the test we have is sufficient. If so, patch v3 should be ready for review.
Comment 47•14 years ago
|
||
This might help with test coverage a bit:
http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt
Also, do Perl or Python have test suites that cover case-insensitive comparison of UTF-8 (or Unicode in general), perhaps for their regular-expression capabilities?
Simon has a good set of sample valid text in many languages. Testing error handling will be more interesting.
FWIW, we don't have any real testing of the existing code in this area (other than what the testsuite naturally exercises of course, which is a lot of it).
Assignee | ||
Comment 49•14 years ago
|
||
This includes a pretty dumb fuzzer. Didn't find any problems.
Attachment #467201 -
Attachment is obsolete: true
Updated•14 years ago
|
blocking2.0: ? → -
Assignee | ||
Comment 50•14 years ago
|
||
Benchmarks in a moment.
Attachment #467610 -
Attachment is obsolete: true
Assignee | ||
Comment 51•14 years ago
|
||
This benchmark calls the case-insensitive UTF-8 and UTF-16 comparators with three workloads:
* two identical ASCII strings,
* one all lower-case and one all upper-case ASCII string, and
* one mixed-case unicode string and one all upper-case unicode string. (At
least, I think one string is mixed and one is uppercase; I didn't
double-check.)
Since both functions do a bitwise comparison when the strings are identical, I'd expect comparing two identical Unicode strings to have similar performance characteristics to comparing two identical ASCII strings.
I timed three operations for each pair of strings:
* converting the strings from UTF8 to UTF16,
* comparing the two strings as UTF16, and
* comparing the two strings as UTF8.
Here are the results, using patch v3.2:
Identical strings:
UTF16 convert=904us, compare=141us, convert+compare=1045us
UTF8 compare=217us
All upper vs. all lower ASCII:
UTF16 convert=814us, compare=991us, convert+compare=1805us
UTF8 compare=1327us
Mixed case vs. all upper unicode
UTF16 convert=6137us, compare=3277us, convert+compare=9414us
UTF8 compare=12984us
The UTF-8 comparison is a clear win in the all ASCII case. The UTF-8 comparison itself is a bit slower than the UTF-16 comparison, presumably because the UTF-8 loop is more sizable (even though, in the identical strings case, both the UTF-8 and UTF-16 loops do a simple bitwise comparison), but still much faster than convert+compare.
If we throw in enough unicode characters, however, it becomes faster to convert to UTF-16 and compare in that domain. This is pretty sensible, since the unicode case is really no worse for the UTF-16 code than the ASCII case, while the UTF-8 comparator needs to backtrack and whatnot.
I think how we proceed from here depends on what we're optimizing for. Do we think most of our strings will be identical? Then maybe the backtracking loop khuey wrote is pretty good. If we think that most of our strings will be different, however, then we can make the opposite assumption and not backtrack.
My take on this is that the most important user of this function is the aweseomebar, which pumps most of the places database through this comparator. I imagigne most people will type lower-case text into their location bar, and most page titles are in title case. So perhaps we should optimize for the case where most, but not all, letters are identical.
Reporter | ||
Comment 52•14 years ago
|
||
> I think how we proceed from here depends on what we're optimizing for.
I was going to suggest instrumenting this function to print out which case it fell into and then pushing to try, but that won't exercise the awesomebar, of course.
Maybe instrument it and browse for a bit?
Comment 53•14 years ago
|
||
(In reply to comment #51)
> I think how we proceed from here depends on what we're optimizing for. Do we
> think most of our strings will be identical? Then maybe the backtracking loop
> khuey wrote is pretty good. If we think that most of our strings will be
> different, however, then we can make the opposite assumption and not backtrack.
>
> My take on this is that the most important user of this function is the
> aweseomebar, which pumps most of the places database through this comparator.
> I imagigne most people will type lower-case text into their location bar, and
> most page titles are in title case. So perhaps we should optimize for the case
> where most, but not all, letters are identical.
I think you are mistaken in assumptions about location bar. It will search through every single bookmark you have (url and title), every history entry (url, title, and tags) after splitting things up into tokens depending on your search preferences. It will stop after it finds 12 matches (this is also pref controllable). I think the common case here is that the letters are not identical.
Assignee | ||
Comment 54•14 years ago
|
||
Ah, indeed. Then the common case is overwhelmingly that the strings don't match at all, yes?
By identical I assume he's referring to case. Two strings that are substantively different will fall out of the comparator almost immediately, so this will be a massive win against the current version that pays the overhead of two string conversions before rejecting them.
We really should be comparing against Simon's suggestion from comment 21. We all agree that the current method is bad.
Comment 56•14 years ago
|
||
(In reply to comment #55)
> By identical I assume he's referring to case. Two strings that are
> substantively different will fall out of the comparator almost immediately, so
> this will be a massive win against the current version that pays the overhead
> of two string conversions before rejecting them.
oh, case is a crap-shoot then.
Assignee | ||
Comment 57•14 years ago
|
||
I manually inlined the ASCII case conversion, and now the UTF8 code is faster than the UTF16 code on the different-case ASCII test.
Identical strings:
UTF16 convert=890us, compare=134us, convert+compare=1024us
UTF8 compare=201us
All upper vs. all lower ASCII:
UTF16 convert=819us, compare=998us, convert+compare=1817us
UTF8 compare=705us
Mixed case vs. all upper unicode:
UTF16 convert=6211us, compare=3060us, convert+compare=9271us
UTF8 compare=12047us
I'm going to collect some profiling data from people who have non-ASCII characters in their history.
Attachment #468508 -
Attachment is obsolete: true
Assignee | ||
Comment 58•14 years ago
|
||
I looked through the data from my machine, and I'm now not sure this is worth fretting over much further. The vast majority of comparisons will be rejected on the first character, so small optimizations within the loop aren't going to make a big difference.
A much bigger candidate for optimization, I think, is lower-casing what's in the URL bar *once* and then using a half-case-insensitive comparator which assumes that one of the operands is already lower-case.
Reporter | ||
Comment 59•14 years ago
|
||
I don't think that doing that gives the same behavior as doing case-insensitive comparisons, with non-ASCII Unicode. Specifically, and this may be the only such example, "I" and "İ" should probably not test equal, but lowercasing the former gives "i" which is in fact case-insensitively equal to "İ".
And you can't just lowercase both and compare exactly, since if you do that you will end up with "I" not testing equal to "ı", which it should be.
Assignee | ||
Comment 60•14 years ago
|
||
(In reply to comment 59)
The UTF8 conversion code in this patch may be broken, then. I'm not sure about the UTF16 code.
I'll have a look.
(In reply to comment #59)
> I don't think that doing that gives the same behavior as doing case-insensitive
> comparisons, with non-ASCII Unicode. Specifically, and this may be the only
> such example, "I" and "İ" should probably not test equal, but lowercasing the
> former gives "i" which is in fact case-insensitively equal to "İ".
>
> And you can't just lowercase both and compare exactly, since if you do that you
> will end up with "I" not testing equal to "ı", which it should be.
All of our current code does this AIUI.
Reporter | ||
Comment 62•14 years ago
|
||
> All of our current code does this AIUI.
Then our current code is buggy, no? Looks like it was doing that even before the patch for bug 575043, though.... and in fact may have been broken all along. Unless I'm missing something? Simon?
Comment 63•14 years ago
|
||
Yes, we have always had that bug. Even if we fix it for text that we know is in Turkish (bug 231162), I'm not sure if there is a good solution for contexts like the awesome bar :(
Assignee | ||
Comment 64•14 years ago
|
||
Simon, would you be willing to take this patch as-is (with a very simple fuzzer and a single test comparing two valid strings), or do you want more extensive testing of valid text?
Comment 65•14 years ago
|
||
Are those the only alternatives? ;-)
I would be happy with three tests rather than one: one of ASCII-only strings, one of trans-ASCII Unicode strings, and one where the two strings are different lengths in UTF-8 (There are a bunch of characters in the range U+2C60-2C7F that have case mappings in the range U+02xx)
Assignee | ||
Comment 66•14 years ago
|
||
Attachment #468545 -
Attachment is obsolete: true
Attachment #470381 -
Flags: review?
Assignee | ||
Updated•14 years ago
|
Attachment #470381 -
Flags: review? → review?(smontagu)
Assignee | ||
Comment 67•14 years ago
|
||
The patch above has a unicode test (with different-length strings) and a few ASCII sanity checks in addition the naive fuzzer.
I also realized when I was going over the patch that I'd only inlined one of the case conversions. Below are results from the same benchmark as comment 57, but with both case conversions inlined. It's substantially faster, especially in the all upper vs. all lower ASCII test.
Identical strings:
UTF16 convert=975us, compare=135us, convert+compare=1110us
UTF8 compare=179us
All upper vs. all lower ASCII:
UTF16 convert=898us, compare=1118us, convert+compare=2016us
UTF8 compare=536us
Mixed case vs. all upper unicode:
UTF16 convert=6098us, compare=3052us, convert+compare=9150us
UTF8 compare=11980us
Assignee | ||
Comment 68•14 years ago
|
||
It looks like I need another function for bug 570975: CaseInsensitiveStartsWith. I'll post a new patch with this function soon.
Assignee | ||
Comment 69•14 years ago
|
||
This has what I need for bug 570975.
Attachment #470381 -
Attachment is obsolete: true
Attachment #470854 -
Flags: review?(smontagu)
Attachment #470381 -
Flags: review?(smontagu)
Assignee | ||
Comment 70•14 years ago
|
||
Benchmarking is revealing that CaseInsensitiveStartsWith is slower than I'd like. Simon, would you mind reviewing all but that function (i.e. patch v3.4)? I'll post a separate patch once I figure out what I need for bug 570975 to be fast.
Assignee | ||
Updated•14 years ago
|
Attachment #470381 -
Flags: review?(smontagu)
Assignee | ||
Updated•14 years ago
|
Attachment #470854 -
Attachment is obsolete: true
Attachment #470854 -
Flags: review?(smontagu)
Assignee | ||
Updated•14 years ago
|
Attachment #470381 -
Attachment is obsolete: false
Assignee | ||
Comment 71•14 years ago
|
||
I just re-implemented the case-insensitive comparator in terms of a character-by-character comparator (i.e., what was suggested in comment 21), and it's much slower than Khuey's approach in the identical and different-case ASCII cases and just as slow in the Unicode case.
That said, I still want the character-by-character comparator I wrote, because it makes the awesomebar really fast. I'll post what I have soon.
Assignee | ||
Comment 72•14 years ago
|
||
This patch adds CaseInsensitiveUTF8CharEquals().
Sorry for the churn, Simon! I'm happy to provide interdiffs if you've already started looking at this.
Attachment #470381 -
Attachment is obsolete: true
Attachment #471347 -
Flags: review?
Attachment #470381 -
Flags: review?(smontagu)
Assignee | ||
Updated•14 years ago
|
Attachment #471347 -
Flags: review? → review?(smontagu)
Assignee | ||
Comment 73•14 years ago
|
||
Good news and bad news:
The good news is that I've managed to speed this up and get rid of a substantial amount of code. The speedup comes from noticing that we were missing a call to IS_NOCASE_CHAR in the UTF8 code, so we were spending a lot more time in the case map than we needed to be.
The size decrease comes from the following observation: The code previously avoided computing UTF8 codepoints (instead, it worked with the concatenation of the representative bytes), but I realized that computing the real codepoint would come at the cost of just a few bitwise ANDs. This change lets us reuse pretty much all of the existing UTF16 infrastructure, so decreases the size of the patch significantly.
The bad news is that I basically rewrote the patch *again*. This canceling of reviews is getting kind of silly, so I'm not going to request review until I've let this new one bake for a while.
I'll put up the patch in a bit; I just need to clean it up a bit.
Assignee | ||
Updated•14 years ago
|
Attachment #471347 -
Flags: review?(smontagu)
Assignee | ||
Comment 74•14 years ago
|
||
Interesting, maybe there are some code alignment effects going on here? I'm finding that the UTF16 identical compare time varies from 270us to 140us depending on changes I make to other, unrelated functions.
Assignee | ||
Comment 75•14 years ago
|
||
Oh, no; this is probably the scheduler peeking its head in here. Changing the code has nothing to do with it. I'll see if I can fix the benchmark.
Assignee | ||
Comment 76•14 years ago
|
||
I think I'm done with this, hopefully for real this time. Benchmarks in a moment.
Attachment #471347 -
Attachment is obsolete: true
Attachment #472874 -
Flags: review?(smontagu)
Assignee | ||
Comment 77•14 years ago
|
||
Executive summary: The UTF-8 comparison functions are substantially faster than converting to UTF-16 and comparing. Speedups are on the order of 2-3x.
You'll need patch v4 from bug 570975 to run the benchmark yourself. (I did say it was ugly. :)
You'll also probably want to enable -O3 if you're on Linux. I make no guarantees that the new code will be fast if you're using -Os (currently the default on Linux). See bug 590181.
These results are from my Linux x64 box with -O3.
These test CaseInsensitiveUTF8Comparator:
Identical strings:
UTF16 convert=282ms, compare=72ms, convert+compare=354ms
UTF8 compare=140ms
All upper vs. all lower ASCII:
UTF16 convert=279ms, compare=157ms, convert+compare=436ms
UTF8 compare=140ms
Mixed case vs. all upper unicode:
UTF16 convert=324ms, compare=99ms, convert+compare=423ms
UTF8 compare=207ms
These test CaseInsensitiveCharsEqual() by calling MatchAutoCompleteFunction::findOnBoundary and MatchAutoCompleteFunction::findAnywhere. The times reported for UTF16 include the cost of converting to UTF16.
** findOnBoundary benchmark:
UTF8 ASCII: 153ms.
UTF16 ASCII: 323ms.
UTF8 unicode: 192ms.
UTF16 unicode: 364ms.
** findAnywhere benchmark:
UTF8 ASCII: 149ms.
UTF16 ASCII: 303ms.
UTF8 unicode: 193ms.
UTF16 unicode: 309ms.
Attachment #468528 -
Attachment is obsolete: true
Comment 78•14 years ago
|
||
So, as I noted on the train, this probably should be using case folding rather than lowercasing for the case-insensitive matching. (On a related note, I assume we're normalizing the text somewhere before performing the match?)
See the sections of Unicode linked from
http://unicode.org/reports/tr21/
Case folding vs. lowercasing can in some cases give different results.
Assignee | ||
Comment 79•14 years ago
|
||
Perhaps this should be a follow-up bug, since the code as-is replicates the current UTF-16 functionality.
And yes, I think we're assuming the text has been normalized.
Assignee | ||
Comment 80•14 years ago
|
||
bug 570975 has been marked blocking betaN+, but if anything will keep that bug from landing, it's this patch, I think.
blocking2.0: - → ?
Comment 81•14 years ago
|
||
The front-end could really benefit from this, as much as a 2x speedup on Awesomebar results and responsiveness.
Comment 82•14 years ago
|
||
Comment on attachment 472874 [details] [diff] [review]
Patch v5
>diff --git a/intl/unicharutil/tools/gencasetable.pl b/intl/unicharutil/tools/gencasetable.pl
>+# Map x -> x, except for upper-case letters, which we map to lower-case
>+# letters.
>+for($idx=0; $idx < 128; $idx++)
>+{
>+ print OUT " ";
>+ if (65 <= $idx && $idx <= 90) {
>+ print OUT ($idx + 0x20);
>+ } else {
>+ print OUT $idx;
>+ }
>+ print OUT ",\n";
>+}
Nit: please use hex, not decimal. For extra credit, format the table in rows of 16 codepoints each.
>diff --git a/intl/unicharutil/util/nsUnicharUtils.cpp b/intl/unicharutil/util/nsUnicharUtils.cpp
>+static NS_ALWAYS_INLINE PRUint32
>+GetLowerUTF8Codepoint(const char* aStr, const char* aEnd, const char **aNext)
>+ if (UTF8traits::is3byte(str[0]) && NS_LIKELY(aStr + 2 < aEnd)) {
>+ // It's a three-byte sequence, so it looks like
>+ // 1110XXXX 10XXXXXX 10XXXXXX.
>+ // This may or may not fit into a PRUint16, so we need to accumulate into a
>+ // PRUint32.
Why "may or may not"? The last three byte UTF-8 sequence is EF BF BF, which is U+FFFF.
Assignee | ||
Comment 83•14 years ago
|
||
(In reply to comment #82)
> Why "may or may not"? The last three byte UTF-8 sequence is EF BF BF, which is
> U+FFFF.
Oh, that's absolutely right. There are 16 Xs in the comment. :)
Assignee | ||
Comment 84•14 years ago
|
||
Addresses comment 82.
Attachment #472874 -
Attachment is obsolete: true
Attachment #484102 -
Flags: review?
Attachment #472874 -
Flags: review?(smontagu)
Assignee | ||
Updated•14 years ago
|
Attachment #484102 -
Flags: review? → review?(smontagu)
Updated•14 years ago
|
Attachment #484102 -
Flags: review?(smontagu) → review+
Assignee | ||
Comment 85•14 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•14 years ago
|
blocking2.0: ? → -
You need to log in
before you can comment on or make changes to this bug.
Description
•