Closed Bug 320500 Opened 20 years ago Closed 10 years ago

Add \u{xxxxxx} string literals for non-BMP Unicode characters

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla40

Tracking Flags:

Tracking

Status

firefox40

---

fixed

People

(Reporter: daumling, Assigned: arai)

References

Details

(Keywords: dev-doc-complete, intl)

Attachments

(1 file)

Add \u{xxxxxx} string literals. 10 years ago Tooru Fujisawa [:arai] 6.23 KB, patch	Waldo : review+	Details \| Diff \| Splinter Review

Michael Daumling

Reporter

Description

•

20 years ago

Chinese government requires support for Unicode characters > 0xFFFF. SpiderMonkey should at least support the definition of large Unicode character constants. Adobe ExtendScript has the \Uxxxxxxxx (capital U, 8 hex digits) notation that generates a surrogate pair. We should add this capability to the SpiderMonkey parser as well. Is extended Unicode character support planned for JS 2.0?

Igor Bukanov

Comment 1

•

20 years ago

(In reply to comment #0) > Adobe ExtendScript has the \Uxxxxxxxx (capital U, 8 hex digits) > notation that generates a surrogate pair. We should add this capability to the > SpiderMonkey parser as well. Does it support only \Uxxxxxxxx and not, say, \U{xxxxxxxx} ? The latter is much easy to read and allows to write \U{xxxxx} with 5 x which should cover AFAIK all currently defined characters.

Michael Daumling

Reporter

Comment 2

•

20 years ago

Let me modify the proposal according to your idea: Let's allow for {xxx} as a generic hex escape sequence. Make \u equivalent to \x: \u{12345} == \x{12345}

Jungshik Shin

Comment 3

•

20 years ago

I don't think we can move unilaterally. I'm being too lazy to dig up related ECMA activities in this area. Something must have been going on...

Keywords: intl

OS: Windows XP → All

Hardware: PC → All

Summary: REQUEST: Add \Uxxxxxxxx string literals for 32-bit glyphs → Add \Uxxxxxxxx string literals for non-BMP Unicode characters

Brendan Eich [:brendan]

Comment 4

•

20 years ago

(In reply to comment #3) > I don't think we can move unilaterally. I'm being too lazy to dig up related > ECMA activities in this area. Something must have been going on... Yes, ECMA TG1 is meeting (see my blog in a little bit for an update). We could use some i18n advice. There are obvious problems with ECMA-262 Edition 3 (e.g., RegExp character classes are generally ASCII-only). Jungshik, do you know of lists of problems, or bugs on file, that we can collate? Michael, can you take assignment of this bug? /be

Norbert Lindenberg

Comment 5

•

13 years ago

The Ecma TC 39 meeting in May 2012 decided to use \u{xxxxxx}, with up to six hex digits and values up to 10FFFF. http://norbertlindenberg.com/2012/05/ecmascript-supplementary-characters/index.html#Escapes

Norbert Lindenberg

Updated

•

12 years ago

Summary: Add \Uxxxxxxxx string literals for non-BMP Unicode characters → Add \u{xxxxxx} string literals for non-BMP Unicode characters

Nobody; OK to take it and work on it

Updated

•

11 years ago

Assignee: general → nobody

Tooru Fujisawa [:arai]

Assignee

Updated

•

10 years ago

Blocks: 1135377

Tooru Fujisawa [:arai]

Assignee

Comment 6

•

10 years ago

Attached patch Add \u{xxxxxx} string literals. — Details — Splinter Review

There seems to be no restriction to the length of the HexDigits [1] (there is restriction to it's MV [2] though), so added test for too long leading "0", is it correct? (or did I overlook?) assertEq(eval(`"\\u{${"0".repeat(Math.pow(2, 28) - 20) + "1234"}}"`), String.fromCodePoint(0x1234)); Green on try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f389f9debf15 [1] http://people.mozilla.org/~jorendorff/es6-draft.html#sec-literals-string-literals [2] http://people.mozilla.org/~jorendorff/es6-draft.html#sec-string-literals-static-semantics-early-errors

Assignee: nobody → arai.unmht

Attachment #8591431 - Flags: review?(sstangl)

Sean Stangl [:sstangl]

Comment 7

•

10 years ago

Comment on attachment 8591431 [details] [diff] [review] Add \u{xxxxxx} string literals. Forwarding review to Waldo, who likely has an opinion on the matter.

Attachment #8591431 - Flags: review?(sstangl) → review?(jwalden+bmo)

Jeff Walden [:Waldo]

Comment 8

•

10 years ago

Comment on attachment 8591431 [details] [diff] [review] Add \u{xxxxxx} string literals. Review of attachment 8591431 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/frontend/TokenStream.cpp @@ +1681,5 @@ > > bool > +TokenStream::getBracedUnicode(uint32_t* cp) > +{ > + skipChars(1); Use consumeKnownChar('{'); instead. @@ +1687,5 @@ > + bool first = true; > + int32_t c; > + uint32_t code = 0; > + while (true) { > + c = getCharIgnoreEOL(); I'd kind of prefer explicit treatment of |c == EOF| meaning return false directly, rather than through the JS7_ISHEX further down. @@ +1750,5 @@ > + if (!getBracedUnicode(&code)) { > + reportError(JSMSG_MALFORMED_ESCAPE, "Unicode"); > + return false; > + } > + Add MOZ_ASSERT(code <= 0x10FFFF) here. @@ +1751,5 @@ > + reportError(JSMSG_MALFORMED_ESCAPE, "Unicode"); > + return false; > + } > + > + if (code >= 0x10000) { Consistent with UTF16Encoding(cp) in the spec, I'd prefer if these two arm were reversed -- single code unit first, two code units second. @@ +1754,5 @@ > + > + if (code >= 0x10000) { > + if (!tokenbuf.append((code - 0x10000) / 1024 + 0xD800)) > + return false; > + c = (code - 0x10000) % 1024 + 0xDC00; Mild preference for bracing the %, tho I think most readers would probably parse it as it executes. ::: js/src/tests/ecma_6/String/unicode-braced.js @@ +39,5 @@ > +assertEq("\u{00}", String.fromCodePoint(0x0)); > +assertEq("\u{00000000000000000}", String.fromCodePoint(0x0)); > +assertEq("\u{00000000000001000}", String.fromCodePoint(0x1000)); > + > +assertEq(eval(`"\\u{${"0".repeat(Math.pow(2, 28) - 20) + "1234"}}"`), String.fromCodePoint(0x1234)); 512MB allocation here seems a bit much. :-) Math.pow(2, 24) should be more than adequate. (Actually I'm a little surprised you could [in theory] allocate a string that large. I thought our length limits were lower than that.) @@ +52,5 @@ > +assertThrowsInstanceOf(() => eval(`"\\u{"`), SyntaxError); > +assertThrowsInstanceOf(() => eval(`"\\u{110000}"`), SyntaxError); > +assertThrowsInstanceOf(() => eval(`"\\u{00110000}"`), SyntaxError); > +assertThrowsInstanceOf(() => eval(`"\\u{100000000000000000000000000000}"`), SyntaxError); > +assertThrowsInstanceOf(() => eval(`"\\u{FFFFFFFFFFFFFFFFFFFFFFFFFFFFFF}"`), SyntaxError); Add some tests with spaces before, after, and intermixt in the HexDigits. Also a test with 100000001 or somesuch, to verify the absence of overflow wrapping around back to "\u0001", would be nice.

Attachment #8591431 - Flags: review?(jwalden+bmo) → review+

Pulsebot

Comment 9

•

10 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/d31dfe0f365a

Carsten Book [:Tomcat]

Comment 10

•

10 years ago

https://hg.mozilla.org/mozilla-central/rev/d31dfe0f365a

Status: NEW → RESOLVED

Closed: 10 years ago

status-firefox40: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla40

Tooru Fujisawa [:arai]

Assignee

Comment 11

•

10 years ago

Updated following documentations https://developer.mozilla.org/en-US/Firefox/Releases/40 https://developer.mozilla.org/en-US/docs/Web/JavaScript/New_in_JavaScript/ECMAScript_6_support_in_Mozilla https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/String

Jean-Yves Perrier [:teoli]

Updated

•

10 years ago

Keywords: dev-doc-needed

Florian Scholz (Open Web Docs)

Comment 13

•

10 years ago

See comment 11 plus: https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Lexical_grammar#Unicode_code_point_escapes

Keywords: dev-doc-needed → dev-doc-complete

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Add \u{xxxxxx} string literals for non-BMP Unicode characters

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: daumling, Assigned: arai)

References

Details

(Keywords: dev-doc-complete, intl)

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Updated

Updated

Updated

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Comment 13

Attachment

General

Description

File Name

Content Type