Closed Bug 145975 Opened 23 years ago Closed 14 years ago

Implement nsCaseInsensitiveUTF8StringComparator()

Tracking

()

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

blocking2.0

---

People

(Reporter: bzbarsky, Assigned: justin.lebar+bug)

References

Details

(Keywords: intl)

Attachments

(2 files, 15 obsolete files)

POC 14 years ago Kyle Huey (Exited; not receiving bugmail, old account, do not use) 43.42 KB, patch		Details \| Diff \| Splinter Review
UTF-8 intl primitives 14 years ago Kyle Huey (Exited; not receiving bugmail, old account, do not use) 43.38 KB, patch		Details \| Diff \| Splinter Review
XPCOM String API changes 14 years ago Kyle Huey (Exited; not receiving bugmail, old account, do not use) 12.86 KB, patch		Details \| Diff \| Splinter Review
Folded, rebased patch 14 years ago Justin Lebar (not reading bugmail) 67.18 KB, patch		Details \| Diff \| Splinter Review
Patch v2 14 years ago Justin Lebar (not reading bugmail) 74.18 KB, patch		Details \| Diff \| Splinter Review
Patch v2.1 14 years ago Justin Lebar (not reading bugmail) 74.29 KB, patch		Details \| Diff \| Splinter Review
Patch v3 14 years ago Justin Lebar (not reading bugmail) 74.68 KB, patch		Details \| Diff \| Splinter Review
Patch v3.1 14 years ago Justin Lebar (not reading bugmail) 76.97 KB, patch		Details \| Diff \| Splinter Review
Patch v3.2 14 years ago Justin Lebar (not reading bugmail) 78.40 KB, patch		Details \| Diff \| Splinter Review
Ugly benchmark code 14 years ago Justin Lebar (not reading bugmail) 5.17 KB, patch		Details \| Diff \| Splinter Review
Patch v3.3 14 years ago Justin Lebar (not reading bugmail) 78.74 KB, patch		Details \| Diff \| Splinter Review
Patch v3.4 14 years ago Justin Lebar (not reading bugmail) 78.86 KB, patch		Details \| Diff \| Splinter Review
Patch v3.5 14 years ago Justin Lebar (not reading bugmail) 79.64 KB, patch		Details \| Diff \| Splinter Review
Patch v4 14 years ago Justin Lebar (not reading bugmail) 82.98 KB, patch		Details \| Diff \| Splinter Review
Patch v5 14 years ago Justin Lebar (not reading bugmail) 46.13 KB, patch		Details \| Diff \| Splinter Review
Ugly benchmark code v2 14 years ago Justin Lebar (not reading bugmail) 15.75 KB, patch		Details \| Diff \| Splinter Review
Patch v5.1 14 years ago Justin Lebar (not reading bugmail) 45.82 KB, patch	smontagu : review+	Details \| Diff \| Splinter Review

Boris Zbarsky [:bzbarsky]

Reporter

Description

•

23 years ago

At the moment, URI specs are UTF8Strings. This includes the hostname (now that IDN support is in). Hostname comparisons should be case-insensitive. The obvious way to do such a comparison is via: Compare(host1, host2, some_comparator); but unfortunately there is no suitable comparator. So the only way to properly compare two hostnames at the moment is: NS_ConvertUTF8toUCS2 host1UCS(host1); NS_ConvertUTF8toUCS2 host2UCS(host2); Compare(host1UCS, host2UCS, nsCaseInsensitiveStringComparator()); This is pretty obviously sub-optimal. It would be much better to have a comparator that can do case-insensitive comparison of UTF8 strings directly.

Roy Yokoyama

Comment 1

•

23 years ago

better re-assign to string owner. jag?

Assignee: yokoyama → jaggernaut

Rui Xu

Comment 2

•

23 years ago

Code issue, QA to yokoyama for now.

Keywords: intl

QA Contact: ruixu → yokoyama

jag (Peter Annema)

Comment 3

•

23 years ago

How soon will hostnames actually be in utf8?

Darin Fisher

Comment 4

•

23 years ago

they are right now, assuming of course that you visit a site that uses them ;)

jag (Peter Annema)

Comment 5

•

23 years ago

So dns supports utf8 domain names?

jag (Peter Annema)

Comment 6

•

23 years ago

And with that I don't mean our implementation of it, but rather the infrastructure.

Darin Fisher

Comment 7

•

23 years ago

no, DNS only supports 7-bit case insensitive ASCII text. that probably won't change anytime soon. the current solutions in place include converting UTF-8 text to a case insensitive ASCII version that does work with DNS. in otherwords, there are special ASCII domainnames that should be converted to unicode before being shown the user. these we handle via an external implementation of nsIIDNService. there are plans to implement a version of this service within mozilla (which i believe are underway). without the plugin, UTF-8 domainnames will be URL escaped before being sent to the DNS server. there is also a bug on not doing this, but instead sending UTF-8 text to the DNS server. the thought being that DNS servers might eventually support UTF-8 domainnames directly. my point (in my previous comment) was that our APIs support UTF-8 domainnames, and if a webpage includes a UTF-8 domainname, that UTF-8 text will enter mozilla and be passed around. so, even if we lack any means of communicating with the UTF-8 host, we should still be able to handle UTF-8 hostnames internally... afterall, UTF-8 domainname support is just a matter of installing a drop-in XPCOM component.

jag (Peter Annema)

Comment 8

•

23 years ago

The question I should've asked: by when do you need this?

Darin Fisher

Comment 9

•

23 years ago

i don't know of any examples where this is needed, but maybe bz was thinking of a few... bz?

Boris Zbarsky [:bzbarsky]

Reporter

Comment 10

•

23 years ago

This could be needed in cookie code, in autocomplete code, and possibly in the security manager. I suspect autocomplete is done in JS; ccing morse and mstoltz for their takes on the issue. I filed this as a "we should have this working at some point" kind of thing; I don't actually visit any sites that have non-ASCII domain names, so I have no examples of anything being broken by whatever method we use for comparisons at the moment.

Jungshik Shin

Comment 11

•

21 years ago

Just FYI, listed below are cases where the 'simple' case-conversion crosses UTF-8 length boundaries. The list was posted to the Unicode list by Markus Scherer. U+0130 simple-lowercases to U+0069 U+0130 is LATIN CAPITAL LETTER I WITH DOT ABOVE U+0131 simple-uppercases to U+0049 U+0131 is LATIN SMALL LETTER DOTLESS I U+017f simple-uppercases to U+0053 U+017f is LATIN SMALL LETTER LONG S U+1fbe simple-uppercases to U+0399 U+1fbe is GREEK PROSGEGRAMMENI U+2126 simple-lowercases to U+03c9 U+2126 is OHM SIGN U+212a simple-lowercases to U+006b U+212a is KELVIN SIGN U+212b simple-lowercases to U+00e5 U+212b is ANGSTROM SIGN

Boris Zbarsky [:bzbarsky]

Reporter

Updated

•

19 years ago

Depends on: 231782

Phil Ringnalda (:philor)

Updated

•

15 years ago

QA Contact: tetsuroy → i18n

Shawn Wilsher :sdwilsh

Updated

•

14 years ago

Blocks: 570975

Josh Aas

Comment 13

•

14 years ago

In case this is useful, ICU (from IBM?) has a C API that looks like it can do this. It has a compatible license as far as I can tell (looks BSD-ish). http://icu-project.org/apiref/icu4c/ucol_8h.html 'ucol_strcollIter'

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 14

•

14 years ago

Attached patch POC (obsolete) — Details — Splinter Review

This patch implements the necessary intl/ primitives to create a nsCaseInsensitiveUTF8Comparator. I haven't banged on it too hard, but it does pass the testcase in UnicharSelfTest.cpp. This is built on my nsICaseConverter deCOM patch in Bug 575043.

Shawn Wilsher :sdwilsh

Comment 15

•

14 years ago

(In reply to comment #13) > In case this is useful, ICU (from IBM?) has a C API that looks like it can do > this. It has a compatible license as far as I can tell (looks BSD-ish). We may want to include ICU in the future too in order to use FTS3 (or 4) in SQLite in a Unicode-aware manner.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 16

•

14 years ago

Attached patch UTF-8 intl primitives (obsolete) — Details — Splinter Review

Attachment #458260 - Attachment is obsolete: true

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 17

•

14 years ago

Attached patch XPCOM String API changes (obsolete) — Details — Splinter Review

Completely untested, but it compiles. Want to play with it sdwilsh?

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 18

•

14 years ago

This is built on top of my patch for Bug 578714, so you'll need that patch too. That's built on top of my patch for bug 575043 :-P Maybe I should just give you an easier diff in the morning.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 19

•

14 years ago

We're going to want this for Bug 570975 if nothing else.

Assignee: jag-mozilla → me

Status: NEW → ASSIGNED

blocking2.0: --- → ?

Shawn Wilsher :sdwilsh

Comment 20

•

14 years ago

I'll play with it if bug 570975 gets blocking+, but otherwise I won't be able to for a while.

Simon Montagu :smontagu

Comment 21

•

14 years ago

Comment on attachment 458940 [details] [diff] [review] UTF-8 intl primitives I've only skimmed this, but I'm not overjoyed about adding all these extra tables. Would it be so bad for performance to convert a single character, or rather pair of characters, to UTF-16 and use the existing tables, when you need the lowercase compare? Since you do this: >+// This function is mildly complicated because we don't want to pay the cost >+// of comprehending every UTF-8 character. Instead we treat the arrays as byte >+// streams until we find a mismatch. Once we find a mismatch, we backtrack >+// if necessary until we've isolated a UTF-8 character from each array, >+// then lowercase and compare. (which is a plan so cunning that you could put a tail on it and call it a weasel), I wouldn't have thought that the performance cost would be so great.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 22

•

14 years ago

I haven't benchmarked it, but if we're going to convert to UTF-16 internally what's the point of implementing this at all?

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 23

•

14 years ago

FWIW, I do intend to benchmark it, I just haven't had time.

Mike Shaver (:shaver emeritus)

Comment 24

•

14 years ago

What's the size cost of the tables? Time > space for the vast majority of this stuff, though cache effects have turned some of the "look up rather than compute" logic on its heads for game developers recently, at least.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 25

•

14 years ago

3 * sizeof(PRUint32) * ~ 350 negligible IMO

Simon Montagu :smontagu

Comment 26

•

14 years ago

(In reply to comment #22) > I haven't benchmarked it, but if we're going to convert to UTF-16 internally > what's the point of implementing this at all? The point is that in practice we will very rarely need to make the conversion, especially if you special-case < 0x80

Simon Montagu :smontagu

Comment 27

•

14 years ago

OTOH, if 3 * sizeof(PRUint32) * ~ 350 is negligible, we can also fix bug 210501, since the extra size for expanding the existing tables from UCS-2 to UCS-4 is half that ;-)

Mike Shaver (:shaver emeritus)

Comment 28

•

14 years ago

4200 bytes is negligible, yes. Did someone really claim that 2100 bytes was too much to add for non-BMP case folding? I'm afraid to look at the bug now.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 29

•

14 years ago

I'll try to get some solid numbers relatively soon.

Kyle Huey (Exited; not receiving bugmail, old account, do not use)

Comment 30

•

14 years ago

(In reply to comment #27) > OTOH, if 3 * sizeof(PRUint32) * ~ 350 is negligible, we can also fix bug > 210501, since the extra size for expanding the existing tables from UCS-2 to > UCS-4 is half that ;-) I'm breaking most of the same interfaces we'd need to do to do that, so we probably could pick up Bug 210501 at some point.