Closed Bug 469463 Opened 16 years ago Closed 16 years ago

Firefox 3 does not properly display surrogate characters

Tracking

()

Status:

RESOLVED INVALID

People

(Reporter: sander.peschier, Assigned: smontagu)

References

Details

Attachments

(1 file)

reporter's testcase 16 years ago Nochum Sossonko [:Natch] 161 bytes, text/html		Details

sander.peschier

Reporter

Description

•

16 years ago

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; nl; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; nl; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4

I am working with chinese characters that have an unicode value > #ffff. On a website i have programmed myself, i am using two ways to display these characters:
1. using a html entity, for instance: &#167116;
2. using a server-side function that converts the character into a html encoded string, for instance: Server.HTMLEncode("

sander.peschier

Reporter

Comment 1

•

16 years ago

It seems that something has gone wrong with saving the bug report. This should be added:
====
Server.HTMLEncode("<some chinese character >#ffff>"). The result of this function is a surrogate pair: &#55395;&#56524;

Both approaches should work (and the first one does), but with the second one i get .

Both approaches used to work fine with Firefox 2. 
====

To reduplicate the bug, use this code:
===
<html>
<body>
Both methods should produce the same character, but they do not:<br>
Method 1: &#167116;<br>
Method 2: &#55395;&#56524;<br>
</body>
</html>
====

Matthias Versen [:Matti]

Updated

•

16 years ago

Assignee: nobody → smontagu

Component: General → Internationalization

Product: Firefox → Core

QA Contact: general → i18n

Nochum Sossonko [:Natch]

Comment 2

•

16 years ago

Attached file reporter's testcase — Details

Confirmed using: Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.0.4) Gecko/2008102920 Firefox/3.0.4
and
Mozilla/5.0 (Windows; U; Windows NT 6.0; en-US; rv:1.9.2a1pre) Gecko/20081206 Minefield/3.2a1pre

Nochum Sossonko [:Natch]

Updated

•

16 years ago

OS: Windows XP → All

Simon Montagu :smontagu

Assignee

Comment 3

•

16 years ago

This bug is INVALID: &#55395; and &#56524; are legal UTF-16 code points, but not legal Unicode scalar values, and so are illegal values for NCRs. It's true that they worked in Firefox 2, but that was a bug which has since been fixed (bug 316394).

Blocks: 316394

Simon Montagu :smontagu

Assignee

Updated

•

16 years ago

Status: UNCONFIRMED → RESOLVED

Closed: 16 years ago

Resolution: --- → INVALID

sander.peschier

Reporter

Comment 4

•

16 years ago

I don't think you should take &#55395; and &#56524; as two seperate characters. The first is a so called High-Surrogate Code Unit.

From the Unicode website:
====
High-Surrogate Code Unit. A 16-bit code unit in the range D800 to DBFF, used in UTF-16 as the leading code unit of a surrogate pair. Also known as a leading surrogate. (See definition D72 in Section 3.8, Surrogates.)
====

The 2 codepoints form a pair so Unicode scalar values greater then 0xFFFF can be used.


55395: D863(In reply to comment #3)
> This bug is INVALID: &#55395; and &#56524; are legal UTF-16 code points, but
> not legal Unicode scalar values, and so are illegal values for NCRs. It's true
> that they worked in Firefox 2, but that was a bug which has since been fixed
> (bug 316394).

Simon Montagu :smontagu

Assignee

Comment 5

•

16 years ago

(In reply to comment #4)

> High-Surrogate Code Unit. A 16-bit code unit in the range D800 to DBFF, used in
> UTF-16 as the leading code unit of a surrogate pair. Also known as a leading
> surrogate. (See definition D72 in Section 3.8, Surrogates.)

Yes, but a NCR must represent an abstract character in the document character set without being limited to a specific encoding, and surrogate code units are explicitly defined as not abstract characters, and are "used only in the context of the UTF-16 character encoding form".

See also the section on supplementary characters in http://www.w3.org/International/questions/qa-escapes:

|Supplementary characters are those Unicode characters that have code points 
|higher than the characters in the Basic Multilingual Plane (BMP). In UTF-16 a 
|supplementary character is encoded using two 16-bit surrogate code points from 
|the BMP. Because of this, some people think that supplementary characters need 
|to be represented using two escapes, but this is incorrect - you must use the 
|single, scalar value for that character. For example, use &#x233B4; rather 
|than &#xD84C;&#xDFB4;

and the output of the W3C validator for attachment 352870 [details]: http://validator.w3.org/check?uri=https%3A//bugzilla.mozilla.org/attachment.cgi%3Fid%3D352870, which flags &#55395; and &#56524; as "reference to non-SGML character".

sander.peschier

Reporter

Comment 6

•

16 years ago

Aha! Ok, I see now why this isn't really a bug of Firefox.

The real problem is the ASP function Server.HTMLEncode(). It splits the original character into two entities (&#55395; and &#56524;) instead of leaving it as one (&#167116;).

Thank you all for helping me.

Sander

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Firefox 3 does not properly display surrogate characters

Categories

(Core :: Internationalization, defect)

Tracking

()

People

(Reporter: sander.peschier, Assigned: smontagu)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Updated

Comment 2

Updated

Comment 3

Updated

Comment 4

Comment 5

Comment 6

Attachment

General

Description

File Name

Content Type