Closed Bug 102624 Opened 24 years ago Closed 24 years ago

nsUnicodeToEscapedU does not handle surrogate pair correctly

Categories

(Core :: Internationalization, defect)

x86
Windows NT
defect
Not set
normal

Tracking

()

VERIFIED INVALID

People

(Reporter: ftang, Assigned: ftang)

Details

(Keywords: intl)

0xd800 0xdc00 should be output as \u10000 instead of \ud800\udc00
Status: NEW → ASSIGNED
some answer from newsgroup From: David Hopwood <david.hopwood@zetnet.co.uk> Mon 8:54 PM Subject: Re: surrogate at java's property file To: Yung-Fong Tang <ftang@netscape.com>, unicode@unicode.org -----BEGIN PGP SIGNED MESSAGE----- Yung-Fong Tang wrote: > Any one know how does Java handle Surrogate pair property file ? > > Java's property file use the \u encoding for non ASCII characters, > therefore U+00a5 is \u00A5. I wonder anyone know how does it handle > Surrogate Pair? > > Does U+10000 (0xd800 0xdc00) encoded as "\u10000" or "\ud800\udc00" ? (I > think it should be \u10000) or they cannot handle them at all ? "\ud800\udc00". Java 'char's are really UTF-16 code units (that's what the converters implement; any documentation that says UCS-2 is out of date). It's up to applications to avoid splitting surrogates. From: "Addison Phillips [wM]" <aphillips@webmethods.com> Mon 6:23 PM Subject: RE: surrogate at java's property file To: "Yung-Fong Tang" <ftang@netscape.com>, <unicode@unicode.org> Java doesn't define any characters beyond Unicode 2.1.8 at the moment. It's stuck in a time-warp. JDK 1.4 will update to Unicode 3.0... neither of these versions have defined characters in the supplemental planes. In Java, a java.lang.Character object is closely tied to the definition of an "int", the 16-bit numeric type. Many classes and objects make no distinction (or worse, conflate a character with an int---many methods are defined to take and return ints for "Characters"). As a result, the Java character model appears to be tied to UCS-2 (and I don't mean UTF-16). A surrogate character *is* recognized to be a surrogate, but a high-low pair is not recognized as representing a character, nor can you retrieve the character properties of the matched pair. So to property files. The java.lang.Character sequence U+D800 U+DC00 is represented by the sequence "\ud800\udc00". This sequence does NOT represent U+10000. It represents TWO Characters, which happen to be surrogates that form a valid pair. I should point out that Java is slightly clever. For example, the UTF-8 converter knows that U+D800 U+DC00 represents the scalar value U+10000 and encodes it as a valid four byte sequence: f0-90-80-80 (and vice versa, of course). However, it is unclear how Unicode 3.1 support is going to make it into JDK 1.4++. The APIs are going to have to change to support the supplemental planes and the ripple effects on various APIs seems like an interesting problem. Perhaps they'll redefine an int to be a 32-bit value and switch Java to UTF-32 (yeah, sure.....) Best Regards, Addison Addison P. Phillips Globalization Architect / Manager, Globalization Engineering webMethods, Inc. 432 Lakeside Drive, Sunnyvale, CA +1 408.962.5487 (phone) +1 408.210.3569 (mobile) ------------------------------------------------- Internationalization is an architecture. It is not a feature. -----Original Message----- From: unicode-bounce@unicode.org [mailto:unicode-bounce@unicode.org]On Behalf Of Yung-Fong Tang Sent: Monday, October 01, 2001 5:10 PM To: unicode@unicode.org Subject: surrogate at java's property file Any one know how does Java handle Surrogate pair property file ? Java's property file use the \u encoding for non ASCII characters, therefore U+00a5 is \u00A5. I wonder anyone know how does it handle Surrogate Pair? Does U+10000 (0xd800 0xdc00) encoded as "\u10000" or "\ud800\udc00" ? (I think it should be \u10000) or they cannot handle them at all ? From: "Addison Phillips [wM]" <aphillips@webmethods.com> Mon 6:33 PM Subject: RE: surrogate at java's property file To: "Addison Phillips [wM]" <aphillips@webmethods.com>, "Yung-Fong Tang" <ftang@netscape.com>, <unicode@unicode.org> But then, it's my day to be an idiot... Of course an int can store more than 16 bits. It's char that's defined at 0..65535 in Java. int's will work fine in the APIs. It's the chars that are a problem. Must be the heat. ;-) Addison Addison P. Phillips
It looks that we are compatable with Java now. mark this bug as invalid
Status: ASSIGNED → RESOLVED
Closed: 24 years ago
Resolution: --- → INVALID
Verified as invalid according to ftang's comment.
Status: RESOLVED → VERIFIED
Keywords: intl
You need to log in before you can comment on or make changes to this bug.