Closed Bug 320500 Opened 19 years ago Closed 10 years ago

Add \u{xxxxxx} string literals for non-BMP Unicode characters

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla40
Tracking Status
firefox40 --- fixed

People

(Reporter: daumling, Assigned: arai)

References

Details

(Keywords: dev-doc-complete, intl)

Attachments

(1 file)

Chinese government requires support for Unicode characters > 0xFFFF. SpiderMonkey should at least support the definition of large Unicode character constants. Adobe ExtendScript has the \Uxxxxxxxx (capital U, 8 hex digits) notation that generates a surrogate pair. We should add this capability to the SpiderMonkey parser as well. Is extended Unicode character support planned for JS 2.0?
(In reply to comment #0) > Adobe ExtendScript has the \Uxxxxxxxx (capital U, 8 hex digits) > notation that generates a surrogate pair. We should add this capability to the > SpiderMonkey parser as well. Does it support only \Uxxxxxxxx and not, say, \U{xxxxxxxx} ? The latter is much easy to read and allows to write \U{xxxxx} with 5 x which should cover AFAIK all currently defined characters.
Let me modify the proposal according to your idea: Let's allow for {xxx} as a generic hex escape sequence. Make \u equivalent to \x: \u{12345} == \x{12345}
I don't think we can move unilaterally. I'm being too lazy to dig up related ECMA activities in this area. Something must have been going on...
Keywords: intl
OS: Windows XP → All
Hardware: PC → All
Summary: REQUEST: Add \Uxxxxxxxx string literals for 32-bit glyphs → Add \Uxxxxxxxx string literals for non-BMP Unicode characters
(In reply to comment #3) > I don't think we can move unilaterally. I'm being too lazy to dig up related > ECMA activities in this area. Something must have been going on... Yes, ECMA TG1 is meeting (see my blog in a little bit for an update). We could use some i18n advice. There are obvious problems with ECMA-262 Edition 3 (e.g., RegExp character classes are generally ASCII-only). Jungshik, do you know of lists of problems, or bugs on file, that we can collate? Michael, can you take assignment of this bug? /be
The Ecma TC 39 meeting in May 2012 decided to use \u{xxxxxx}, with up to six hex digits and values up to 10FFFF. http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#Escapes
Summary: Add \Uxxxxxxxx string literals for non-BMP Unicode characters → Add \u{xxxxxx} string literals for non-BMP Unicode characters
Assignee: general → nobody
Blocks: 1135377
There seems to be no restriction to the length of the HexDigits [1] (there is restriction to it's MV [2] though), so added test for too long leading "0", is it correct? (or did I overlook?) assertEq(eval(`"\\u{${"0".repeat(Math.pow(2, 28) - 20) + "1234"}}"`), String.fromCodePoint(0x1234)); Green on try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f389f9debf15 [1] http://people.mozilla.org/~jorendorff/es6-draft.html#sec-literals-string-literals [2] http://people.mozilla.org/~jorendorff/es6-draft.html#sec-string-literals-static-semantics-early-errors
Assignee: nobody → arai.unmht
Attachment #8591431 - Flags: review?(sstangl)
Comment on attachment 8591431 [details] [diff] [review] Add \u{xxxxxx} string literals. Forwarding review to Waldo, who likely has an opinion on the matter.
Attachment #8591431 - Flags: review?(sstangl) → review?(jwalden+bmo)
Comment on attachment 8591431 [details] [diff] [review] Add \u{xxxxxx} string literals. Review of attachment 8591431 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/frontend/TokenStream.cpp @@ +1681,5 @@ > > bool > +TokenStream::getBracedUnicode(uint32_t* cp) > +{ > + skipChars(1); Use consumeKnownChar('{'); instead. @@ +1687,5 @@ > + bool first = true; > + int32_t c; > + uint32_t code = 0; > + while (true) { > + c = getCharIgnoreEOL(); I'd kind of prefer explicit treatment of |c == EOF| meaning return false directly, rather than through the JS7_ISHEX further down. @@ +1750,5 @@ > + if (!getBracedUnicode(&code)) { > + reportError(JSMSG_MALFORMED_ESCAPE, "Unicode"); > + return false; > + } > + Add MOZ_ASSERT(code <= 0x10FFFF) here. @@ +1751,5 @@ > + reportError(JSMSG_MALFORMED_ESCAPE, "Unicode"); > + return false; > + } > + > + if (code >= 0x10000) { Consistent with UTF16Encoding(cp) in the spec, I'd prefer if these two arm were reversed -- single code unit first, two code units second. @@ +1754,5 @@ > + > + if (code >= 0x10000) { > + if (!tokenbuf.append((code - 0x10000) / 1024 + 0xD800)) > + return false; > + c = (code - 0x10000) % 1024 + 0xDC00; Mild preference for bracing the %, tho I think most readers would probably parse it as it executes. ::: js/src/tests/ecma_6/String/unicode-braced.js @@ +39,5 @@ > +assertEq("\u{00}", String.fromCodePoint(0x0)); > +assertEq("\u{00000000000000000}", String.fromCodePoint(0x0)); > +assertEq("\u{00000000000001000}", String.fromCodePoint(0x1000)); > + > +assertEq(eval(`"\\u{${"0".repeat(Math.pow(2, 28) - 20) + "1234"}}"`), String.fromCodePoint(0x1234)); 512MB allocation here seems a bit much. :-) Math.pow(2, 24) should be more than adequate. (Actually I'm a little surprised you could [in theory] allocate a string that large. I thought our length limits were lower than that.) @@ +52,5 @@ > +assertThrowsInstanceOf(() => eval(`"\\u{"`), SyntaxError); > +assertThrowsInstanceOf(() => eval(`"\\u{110000}"`), SyntaxError); > +assertThrowsInstanceOf(() => eval(`"\\u{00110000}"`), SyntaxError); > +assertThrowsInstanceOf(() => eval(`"\\u{100000000000000000000000000000}"`), SyntaxError); > +assertThrowsInstanceOf(() => eval(`"\\u{FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF}"`), SyntaxError); Add some tests with spaces before, after, and intermixt in the HexDigits. Also a test with 100000001 or somesuch, to verify the absence of overflow wrapping around back to "\u0001", would be nice.
Attachment #8591431 - Flags: review?(jwalden+bmo) → review+
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla40
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: