Closed Bug 207923 Opened 21 years ago Closed 21 years ago

[RFE] Uint32toUTF16 (Uint32toString?) to use with fromCharCode()

Categories

(Core :: JavaScript Engine, enhancement)

enhancement
Not set
normal

Tracking

()

VERIFIED INVALID

People

(Reporter: jshin1987, Assigned: waldemar)

Details

(Keywords: intl)

Attachments

(1 file)

a spin-off from bug 162431

fromCharCode() method of String class doesn't understand non-BMP characters. For
non-BMP characters, it always returns characters as if the input charcode were
bitwise 'AND'ed with 0xffff (i.e. only the lowest 16bits are interpreted). 

Javascript is supposed to use UTF-16 (NOT UCS-2), but it may be still using
UCS-2 in some places.
Attached file a test case
This is to demonstrate that String.fromCharCode(0x10400) gives the same result
as String.fromCharCode(0x0400). That is, only the low 16bits are made use of. 
On the other hand, String.fromCharCode(0xd801, 0xdc00) (a pair of surrogate
code points for U+10400) works. I guess this is just by coincidence because
Mozilla internally uses UTF-16. 

It has to be checked what ECMAscript standard says about fromCharCode() as to
whether it's supposed to accept UCVs for non-BMP characters or accept surrogate
pairs.
Adding Brendan, the father of JS to CC.
This bug might be invalid because ECMA-262 has the following and ToUint16 is
defined to return the input value modulo 2^16. I'm not sure, but this doesn't
seem to be the best way to define fromCharCode.  

---------------
15.5.3.2 String.fromCharCode ( [ char0 [ , char1 [ , … ] ] ] )

Returns a string value containing as many characters as the number of arguments.
Each argument
specifies one character of the resulting string, with the first argument
specifying the first character,
and so on, from left to right. An argument is converted to a character by
applying the operation
ToUint16 (9.7) and regarding the resulting 16-bit integer as the code point
value of a character. If no
arguments are supplied, the result is the empty string.
The length property of the fromCharCode function is 1.
--------------

This is not a valid bug against the engine.  The ECMA TC39 working group is
looking into full Unicode 17 plane support, I hear.  Cc'ing waldemar, but I
expect this bug should be an RFE for now.

/be
Severity: normal → enhancement
Summary: fromCharaCode doesn't understand non-BMP characters → fromCharCode doesn't understand non-BMP characters
Brendan, you came here while I was reading thruough ECMA 262 and writing this.
:-). I agree with you. 


ECMA section 6 has the following about 'code point', 'character' and 'Unicode
character'. According to this and the definition of fromCharCode(), this bug is
invalid.  

Although not convenient, one has to write a simple function (if not already
available) to convert a USV corresponding to a non-BMP char. to a pair of
surrogate code points and use that before invoking fromCharCode().

I'm changing the summary line for RFEing such a function. I realize that this
can't be done by Mozilla alone (although offering additional functions does not
violate the standard) and has to be coordinated through the standard body.


--------------
Throughout the rest of this document, the phrase “code point” and the word
“character” will be used to refer
to a 16-bit unsigned value used to represent a single 16-bit unit of UTF-16
text. The phrase “Unicode
character” will be used to refer to the abstract linguistic or typographical
unit represented by a single Unicode
scalar value (which may be longer than 16 bits and thus may be represented by
more than one code point).
----------------------
Summary: fromCharCode doesn't understand non-BMP characters → [RFE] Uint32toUTF16 (Uint32toString?) to use with fromCharCode()
Since this is a standards issue, let me reassign this to Waldemar -
Assignee: rogerl → waldemar
SpiderMonkey works as specified by ECMAScript Edition 3.  For Edition 4 we've
already changed the standard to allow supplementary character codes as input to
fromCharCode, which will no longer treat the integers modulo 2^16.  For
supplementary characters you'll get a pair of surrogates in the string.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → INVALID
Marking Verified.

jshin@mailaps.org: thank you for raising this question -
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: