Closed
Bug 320500
Opened 19 years ago
Closed 10 years ago
Add \u{xxxxxx} string literals for non-BMP Unicode characters
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla40
Tracking | Status | |
---|---|---|
firefox40 | --- | fixed |
People
(Reporter: daumling, Assigned: arai)
References
Details
(Keywords: dev-doc-complete, intl)
Attachments
(1 file)
6.23 KB,
patch
|
Waldo
:
review+
|
Details | Diff | Splinter Review |
Chinese government requires support for Unicode characters > 0xFFFF. SpiderMonkey should at least support the definition of large Unicode character constants. Adobe ExtendScript has the \Uxxxxxxxx (capital U, 8 hex digits) notation that generates a surrogate pair. We should add this capability to the SpiderMonkey parser as well.
Is extended Unicode character support planned for JS 2.0?
Comment 1•19 years ago
|
||
(In reply to comment #0)
> Adobe ExtendScript has the \Uxxxxxxxx (capital U, 8 hex digits)
> notation that generates a surrogate pair. We should add this capability to the
> SpiderMonkey parser as well.
Does it support only \Uxxxxxxxx and not, say, \U{xxxxxxxx} ? The latter is much easy to read and allows to write \U{xxxxx} with 5 x which should cover AFAIK all currently defined characters.
Reporter | ||
Comment 2•19 years ago
|
||
Let me modify the proposal according to your idea:
Let's allow for {xxx} as a generic hex escape sequence. Make \u equivalent to \x:
\u{12345} == \x{12345}
Comment 3•19 years ago
|
||
I don't think we can move unilaterally. I'm being too lazy to dig up related ECMA activities in this area. Something must have been going on...
Keywords: intl
OS: Windows XP → All
Hardware: PC → All
Summary: REQUEST: Add \Uxxxxxxxx string literals for 32-bit glyphs → Add \Uxxxxxxxx string literals for non-BMP Unicode characters
Comment 4•19 years ago
|
||
(In reply to comment #3)
> I don't think we can move unilaterally. I'm being too lazy to dig up related
> ECMA activities in this area. Something must have been going on...
Yes, ECMA TG1 is meeting (see my blog in a little bit for an update).
We could use some i18n advice. There are obvious problems with ECMA-262 Edition 3 (e.g., RegExp character classes are generally ASCII-only). Jungshik, do you know of lists of problems, or bugs on file, that we can collate?
Michael, can you take assignment of this bug?
/be
Comment 5•12 years ago
|
||
The Ecma TC 39 meeting in May 2012 decided to use \u{xxxxxx}, with up to six hex digits and values up to 10FFFF.
http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#Escapes
Updated•12 years ago
|
Summary: Add \Uxxxxxxxx string literals for non-BMP Unicode characters → Add \u{xxxxxx} string literals for non-BMP Unicode characters
Updated•11 years ago
|
Assignee: general → nobody
Assignee | ||
Comment 6•10 years ago
|
||
There seems to be no restriction to the length of the HexDigits [1] (there is restriction to it's MV [2] though), so added test for too long leading "0", is it correct? (or did I overlook?)
assertEq(eval(`"\\u{${"0".repeat(Math.pow(2, 28) - 20) + "1234"}}"`), String.fromCodePoint(0x1234));
Green on try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f389f9debf15
[1] http://people.mozilla.org/~jorendorff/es6-draft.html#sec-literals-string-literals
[2] http://people.mozilla.org/~jorendorff/es6-draft.html#sec-string-literals-static-semantics-early-errors
Assignee: nobody → arai.unmht
Attachment #8591431 -
Flags: review?(sstangl)
Comment 7•10 years ago
|
||
Comment on attachment 8591431 [details] [diff] [review]
Add \u{xxxxxx} string literals.
Forwarding review to Waldo, who likely has an opinion on the matter.
Attachment #8591431 -
Flags: review?(sstangl) → review?(jwalden+bmo)
Comment 8•10 years ago
|
||
Comment on attachment 8591431 [details] [diff] [review]
Add \u{xxxxxx} string literals.
Review of attachment 8591431 [details] [diff] [review]:
-----------------------------------------------------------------
::: js/src/frontend/TokenStream.cpp
@@ +1681,5 @@
>
> bool
> +TokenStream::getBracedUnicode(uint32_t* cp)
> +{
> + skipChars(1);
Use consumeKnownChar('{'); instead.
@@ +1687,5 @@
> + bool first = true;
> + int32_t c;
> + uint32_t code = 0;
> + while (true) {
> + c = getCharIgnoreEOL();
I'd kind of prefer explicit treatment of |c == EOF| meaning return false directly, rather than through the JS7_ISHEX further down.
@@ +1750,5 @@
> + if (!getBracedUnicode(&code)) {
> + reportError(JSMSG_MALFORMED_ESCAPE, "Unicode");
> + return false;
> + }
> +
Add MOZ_ASSERT(code <= 0x10FFFF) here.
@@ +1751,5 @@
> + reportError(JSMSG_MALFORMED_ESCAPE, "Unicode");
> + return false;
> + }
> +
> + if (code >= 0x10000) {
Consistent with UTF16Encoding(cp) in the spec, I'd prefer if these two arm were reversed -- single code unit first, two code units second.
@@ +1754,5 @@
> +
> + if (code >= 0x10000) {
> + if (!tokenbuf.append((code - 0x10000) / 1024 + 0xD800))
> + return false;
> + c = (code - 0x10000) % 1024 + 0xDC00;
Mild preference for bracing the %, tho I think most readers would probably parse it as it executes.
::: js/src/tests/ecma_6/String/unicode-braced.js
@@ +39,5 @@
> +assertEq("\u{00}", String.fromCodePoint(0x0));
> +assertEq("\u{00000000000000000}", String.fromCodePoint(0x0));
> +assertEq("\u{00000000000001000}", String.fromCodePoint(0x1000));
> +
> +assertEq(eval(`"\\u{${"0".repeat(Math.pow(2, 28) - 20) + "1234"}}"`), String.fromCodePoint(0x1234));
512MB allocation here seems a bit much. :-) Math.pow(2, 24) should be more than adequate. (Actually I'm a little surprised you could [in theory] allocate a string that large. I thought our length limits were lower than that.)
@@ +52,5 @@
> +assertThrowsInstanceOf(() => eval(`"\\u{"`), SyntaxError);
> +assertThrowsInstanceOf(() => eval(`"\\u{110000}"`), SyntaxError);
> +assertThrowsInstanceOf(() => eval(`"\\u{00110000}"`), SyntaxError);
> +assertThrowsInstanceOf(() => eval(`"\\u{100000000000000000000000000000}"`), SyntaxError);
> +assertThrowsInstanceOf(() => eval(`"\\u{FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF}"`), SyntaxError);
Add some tests with spaces before, after, and intermixt in the HexDigits.
Also a test with 100000001 or somesuch, to verify the absence of overflow wrapping around back to "\u0001", would be nice.
Attachment #8591431 -
Flags: review?(jwalden+bmo) → review+
Comment 10•10 years ago
|
||
Status: NEW → RESOLVED
Closed: 10 years ago
status-firefox40:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
Assignee | ||
Comment 11•10 years ago
|
||
Updated•10 years ago
|
Keywords: dev-doc-needed
Comment 13•10 years ago
|
||
See comment 11 plus:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#Unicode_code_point_escapes
Keywords: dev-doc-needed → dev-doc-complete
You need to log in
before you can comment on or make changes to this bug.
Description
•