Closed Bug 469463 Opened 16 years ago Closed 16 years ago

Firefox 3 does not properly display surrogate characters

Categories

(Core :: Internationalization, defect)

x86
All
defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: sander.peschier, Assigned: smontagu)

References

Details

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; nl; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; nl; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4

I am working with chinese characters that have an unicode value > #ffff. On a website i have programmed myself, i am using two ways to display these characters:
1. using a html entity, for instance: 𨳌
2. using a server-side function that converts the character into a html encoded string, for instance: Server.HTMLEncode("
It seems that something has gone wrong with saving the bug report. This should be added:
====
Server.HTMLEncode("<some chinese character >#ffff>"). The result of this function is a surrogate pair: &#55395;&#56524;

Both approaches should work (and the first one does), but with the second one i get .

Both approaches used to work fine with Firefox 2. 
====

To reduplicate the bug, use this code:
===
<html>
<body>
Both methods should produce the same character, but they do not:<br>
Method 1: &#167116;<br>
Method 2: &#55395;&#56524;<br>
</body>
</html>
====
Assignee: nobody → smontagu
Component: General → Internationalization
Product: Firefox → Core
QA Contact: general → i18n
Attached file reporter's testcase
Confirmed using: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4
and
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2a1pre) Gecko/20081206 Minefield/3.2a1pre
OS: Windows XP → All
This bug is INVALID: &#55395; and &#56524; are legal UTF-16 code points, but not legal Unicode scalar values, and so are illegal values for NCRs. It's true that they worked in Firefox 2, but that was a bug which has since been fixed (bug 316394).
Blocks: 316394
Status: UNCONFIRMED → RESOLVED
Closed: 16 years ago
Resolution: --- → INVALID
I don't think you should take &#55395; and &#56524; as two seperate characters. The first is a so called High-Surrogate Code Unit.

From the Unicode website:
====
High-Surrogate Code Unit. A 16-bit code unit in the range D800 to DBFF, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate. (See definition D72 in Section 3.8, Surrogates.)
====

The 2 codepoints form a pair so Unicode scalar values greater then 0xFFFF can be used.


55395: D863(In reply to comment #3)
> This bug is INVALID: &#55395; and &#56524; are legal UTF-16 code points, but
> not legal Unicode scalar values, and so are illegal values for NCRs. It's true
> that they worked in Firefox 2, but that was a bug which has since been fixed
> (bug 316394).
(In reply to comment #4)

> High-Surrogate Code Unit. A 16-bit code unit in the range D800 to DBFF, used in
> UTF-16 as the leading code unit of a surrogate pair. Also known as a leading
> surrogate. (See definition D72 in Section 3.8, Surrogates.)

Yes, but a NCR must represent an abstract character in the document character set without being limited to a specific encoding, and surrogate code units are explicitly defined as not abstract characters, and are "used only in the context of the UTF-16 character encoding form".

See also the section on supplementary characters in http://www.w3.org/International/questions/qa-escapes:

|Supplementary characters are those Unicode characters that have code points 
|higher than the characters in the Basic Multilingual Plane (BMP). In UTF-16 a 
|supplementary character is encoded using two 16-bit surrogate code points from 
|the BMP. Because of this, some people think that supplementary characters need 
|to be represented using two escapes, but this is incorrect - you must use the 
|single, scalar value for that character. For example, use &#x233B4; rather 
|than &#xD84C;&#xDFB4;

and the output of the W3C validator for attachment 352870 [details]: http://validator.w3.org/check?uri=https%3A//bugzilla.mozilla.org/attachment.cgi%3Fid%3D352870, which flags &#55395; and &#56524; as "reference to non-SGML character".
Aha! Ok, I see now why this isn't really a bug of Firefox.

The real problem is the ASP function Server.HTMLEncode(). It splits the original character into two entities (&#55395; and &#56524;) instead of leaving it as one (&#167116;).

Thank you all for helping me.

Sander
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: