1197230 - Correctly tokenize valid JS names that use non-BMP code points

Reporter

Description

•

10 years ago

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0 Build ID: 20150511103818 Steps to reproduce: open the javascript console and paste: var

Comment hidden (obsolete)

brett hartshorn

Reporter

Comment 7

•

10 years ago

http://jsfiddle.net/xb8hbphk/1/

Flags: needinfo?(goatman.py)

Loic

Updated

•

10 years ago

Component: Untriaged → DOM: Core & HTML

Product: Firefox → Core

Gingerbread Man

Comment 8

•

10 years ago

Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:43.0) Gecko/20100101 Firefox/43.0 20150822030206 Mathematical bold capital A (U+1D400) Mathematical bold capital B (U+1D401) Mathematical bold capital C (U+1D402) Works in Opera 31. https://mothereff.in/js-variables says it's a valid identifier according to ECMAScript 6.

Blocks: es6

URL: http://jsfiddle.net/xb8hbphk/1/

Status: UNCONFIRMED → NEW

Component: DOM: Core & HTML → JavaScript Engine

Ever confirmed: true

Keywords: testcase

Summary: unicode mathematical alphanumerics syntax error → SyntaxError: illegal character when using Unicode mathematical alphanumeric symbols as variable name

Whiteboard: [parity-chrome][parity-opera]

Jeff Walden [:Waldo]

Comment 9

•

10 years ago

The basic problem is that we tokenize JS by 16-bit code unit, not by Unicode code point. And if the character in the identifier is U+1D400 or whatever, we then interpret it as two 16-bit code units corresponding to the surrogate pair in use -- not to the right Unicode codepoint. It is possibly not difficult to fix this, by adjusting TokenStream::getTokenInternal. Once that's done, we'll require further changes to handle non-BMP codepoints. Our current js::unicode::IsIdentifier{Start,Part} methods take a char16_t, obviously insufficient to handle non-BMP. Simply changing the signatures to accommodate larger numbers will hit the problem that vm/make_unicode.py limits itself to considering only BMP codepoints -- partly because necessity, but also because this permitted a more compact Unicode representation form to be used. So we'd need a different compaction technique to handle the full Unicode range. Which is to say, there's a stream of issues involved in fixing this. Some easy, some less easy. Fixing make_unicode.py is probably the hardest bit of this. Ideally we'd just ask our embedded copy of ICU instead, but (at least) until bug 864843 is fixed, we can't rely on having ICU on hand to ask. :-\

Summary: SyntaxError: illegal character when using Unicode mathematical alphanumeric symbols as variable name → Correctly tokenize valid JS names that use non-BMP code points

Version: 38 Branch → Trunk

Jeff Walden [:Waldo]

Comment 10

•

10 years ago

Amend comment 9 to say that bug 1215247 is what's preventing us just asking our embedded ICU directly -- bug 864843 got repurposed underneath us. :-(

Morgan Phillips [:mrrrgn]

Updated

•

10 years ago

Blocks: 917436

Alexandre Folle de Menezes

Comment 12

•

9 years ago

Firefox needs this to pass the "Syntax/Unicode point escapes/identifiers" test on http://kangax.github.io/compat-table/es6/

Tooru Fujisawa [:arai]

Assignee

Comment 13

•

9 years ago

Attached patch Handle non-BMP identifier. (obsolete) — Details — Splinter Review

summary: * add macro that enumerates ID_START and ID_CONTINUE (not ID_CONTINUE_ONLY) codePoints into UnicodeNonBMP.h, with the same way as other macros in UnicodeNonBMP.h * use those macros in IsIdentifierStart, IsIdentifierPart, and IsUnicodeIDStart * in TokenStream, check for lead surrogate, and then it is, peek trail surrogate, and do the same thing as BMP code and details: [make_unicode.py] added the following 3 sets (actually dict), and store codePoints that is space, or id_start, or id_continue: non_bmp_id_start_set non_bmp_id_cont_set non_bmp_space_set when writing to file, call make_non_bmp_group_macro, that performns almost same thing as make_non_bmp_convert_macro, for non_bmp_id_start_set and non_bmp_id_cont_set. make_non_bmp_group_macro finds contiguous range of code points in the set, and generates macro that calls passed macro for each range. here, non_bmp_space_set is now empty, so I put assertion to check it's empty, and skipped remaining parts. also, changed make_non_bmp_convert_macro to calculate lead/trail code units in 1st loop, and merges range only if lead surrogate is same. previously we were asserting lead surrogate is same for the range, but identifiers spans 2 lead surrogate code units, so applied same changes here. [Unicode.h] Modified IsIdentifierStart, IsIdentifierPart, and IsUnicodeIDStart to check non-BMP ranges, by macro added to UnicodeNonBMP.h. I used |MOZ_LIKELY(codePoint <= UTF16Max)|, since it should be rare that non-BMP appears. [TokenStream.cpp] * for raw non-BMP code points that is in surrogate pair * TokenStream::putIdentInTokenbuf * TokenStream::getTokenInternal * frontend::IsIdentifier while checking identifier, check if the current code unit is lead surrogate, and if so, peek trail surrogate, and then check for identifier start/part, and do the same thing as non-BMP If the condition is negative (like |if (unicode::IsUnicodeIDStart(...))|), the code is put before the non-BMP code If the condition is positive (like |if (!unicode::IsIdentifierPart(...))|), the code is put after the non-BMP code * for escaped non-BMP code points * TokenStream::putIdentInTokenbuf check if the escaped code point is non-BMP (unicode::IsSupplementary), and encode the code point to surrogate pair, and put them into buffer to avoid performance impact for existing code, I've put the code to support non-BMP separately from existing code, with MOZ_UNLIKELY. (it could become simpler by returning code point from some methods, but it may affect non-BMP code performance) in frontend::IsIdentifier, I've added totally separated the code supports non-BMP. frontend::IsIdentifier calls IsIdentifierMaybeNonBMP only if the string contains char16_t. IsIdentifierMaybeNonBMP first calls pre-existing IsIdentifier, and if it fails, check for identifier including non-BMP code points.

Assignee: nobody → arai.unmht

Status: NEW → ASSIGNED

Attachment #8809621 - Flags: review?(shu)

Shu-yu Guo [:shu]

Comment 14

•

9 years ago

Comment on attachment 8809621 [details] [diff] [review] Handle non-BMP identifier. Review of attachment 8809621 [details] [diff] [review]: ----------------------------------------------------------------- Thanks for explaining stuff on IRC, the patch was a lot less scary afterwards. Patch looks good. I see the following pattern often: checking IsLeadSurrogate, then checking if the next character IsTrailSurrogate, then ungetting the trail. Try to refactor this if possible. Some nice to haves for later: - A header for the name and type of each field in UnicodeNonBMP.h - Comments for make_unicode.py ::: js/src/frontend/TokenStream.cpp @@ +118,5 @@ > return true; > } > > +static uint32_t > +GetSingleCodePoint(const char16_t*& p, const char16_t*& end) Looks like |end| is not modified in the function. I'd prefer the signature to be GetSingleCodePoint(const char16_t** p, const char16_t* end) so when you call it the callsite GetSingleCodePoint(&p, end) makes which one gets mutated obvious. @@ +1247,5 @@ > goto identifier; > } > > + if (MOZ_UNLIKELY(unicode::IsLeadSurrogate(c))) { > + int32_t maybeTrail = getCharIgnoreEOL(); There's no system where the EOF value can look like a trail surrogate, right? I imagine EOF is 0xFFFFFFFF everywhere. ::: js/src/vm/Unicode.h @@ +152,5 @@ > +#define CHECK_RANGE(FROM, TO, LEAD, TRAIL_FROM, TRAIL_TO) \ > + if (codePoint >= FROM && codePoint <= TO) \ > + return true; > +FOR_EACH_NON_BMP_ID_START(CHECK_RANGE) > +#undef CHECK_RANGE Have you benchmarked performance? I wonder if generating so many ifs affects the inlining. @@ +186,5 @@ > +#define CHECK_RANGE(FROM, TO, LEAD, TRAIL_FROM, TRAIL_TO) \ > + if (codePoint >= FROM && codePoint <= TO) \ > + return true; > +FOR_EACH_NON_BMP_ID_CONT(CHECK_RANGE) > +#undef CHECK_RANGE Ditto on inlining being affected. @@ +206,5 @@ > +#define CHECK_RANGE(FROM, TO, LEAD, TRAIL_FROM, TRAIL_TO) \ > + if (codePoint >= FROM && codePoint <= TO) \ > + return true; > +FOR_EACH_NON_BMP_ID_START(CHECK_RANGE) > +#undef CHECK_RANGE Ditto on inlining being affected. ::: js/src/vm/make_unicode.py @@ +151,5 @@ > converted = convert_map[code] > diff = converted - code > > + if entry and code == entry['code'] + entry['length'] and \ > + diff == entry['diff'] and lead == entry['lead']: Comments about that this loop is trying to find continuous ranges would be helpful. No need to do it for this patch, though.

Attachment #8809621 - Flags: review?(shu) → review+

Shu-yu Guo [:shu]

Comment 15

•

9 years ago

(In reply to Shu-yu Guo [:shu] from comment #14) > Comment on attachment 8809621 [details] [diff] [review] > Some nice to haves for later: > > - A header for the name and type of each field in UnicodeNonBMP.h By "header" I mean a comment at the top, not a header file.

Tooru Fujisawa [:arai]

Assignee

Comment 16

•

9 years ago

Attached patch Handle non-BMP identifier. — Details — Splinter Review

Thank you for reviewing :) about multiline condition in Python, I added "(" and ")" around the conditions and removed "\", since it's suggested in PEP 8. (In reply to Shu-yu Guo [:shu] from comment #14) > I see the following pattern often: checking IsLeadSurrogate, then checking > if the next character IsTrailSurrogate, then ungetting the trail. Try to > refactor this if possible. Moved IsTrailSurrogate+ungetCharIgnoreEOL+UTF16Decode part to TokenStream::matchTrailForLeadSurrogate. > ::: js/src/frontend/TokenStream.cpp > @@ +1247,5 @@ > > goto identifier; > > } > > > > + if (MOZ_UNLIKELY(unicode::IsLeadSurrogate(c))) { > > + int32_t maybeTrail = getCharIgnoreEOL(); > > There's no system where the EOF value can look like a trail surrogate, > right? I imagine EOF is 0xFFFFFFFF everywhere. Yeah, so if we get EOF, it won't match and will unget. > ::: js/src/vm/Unicode.h > @@ +152,5 @@ > > +#define CHECK_RANGE(FROM, TO, LEAD, TRAIL_FROM, TRAIL_TO) \ > > + if (codePoint >= FROM && codePoint <= TO) \ > > + return true; > > +FOR_EACH_NON_BMP_ID_START(CHECK_RANGE) > > +#undef CHECK_RANGE > > Have you benchmarked performance? I wonder if generating so many ifs affects > the inlining. I moved the non-BMP part to dedicated non-inline functions defined in Unicode.cpp, IsIdentifierStartNonBMP and IsIdentifierPartNonBMP, thorse are directly created from |non_bmp_id_start_set| and |non_bmp_id_cont_set|. Also, considering the fact that we won't use the data elsewhere, as Unicode.h provides sufficient functions, removed the macro definition from UnicodeNonBMP.h Now IsIdentifierStart and IsUnicodeIDStart call IsIdentifierStartNonBMP for non-BMP case, and IsIdentifierPart calls IsIdentifierPartNonBMP for non-BMP case, all with MOZ_UNLIKELY. Here's the performance comparison with parsemark. I don't see any notable regression. http://bnjbvr.github.io/simplegraph/#title=ParseMark%20result%20before%2Fafter%20bug%201197230&ytitle=Score%20%5Bms%5D&categories=gravity_combined.js%2Ctetris_combined.js%2Cga.js%2Ctwitter_combined.js%2Cjquery.min.js%2Czimbra_combined.js%2Cjsgb_combined.js%2C280slides_combined.js%2Cv8_combined.js%2Cdojo.js%2Cprocessing_twitch.js%2Cgmail_combined.js%2Ceffectgames_galaxy.js%2Cball_pool_combined.js%2Cyui-min.js%2CMochiKit.js%2Cpipio_combined.js%2Csunspider_combined.js&values=before%3A%209.37%201.54%201.05%207.73%202.16%2050.46%203.43%201.30%208.02%203.33%200.98%2041.79%206.00%206.29%200.64%206.41%2033.68%204.94%0Aafter%3A%209.43%201.52%201.05%207.74%202.19%2050.88%203.45%201.31%208.14%203.26%200.99%2041.98%206.08%206.22%200.66%206.44%2033.40%204.88 Can you review those changed part?

Attachment #8809621 - Attachment is obsolete: true

Attachment #8812411 - Flags: review?(shu)

Shu-yu Guo [:shu]

Comment 17

•

9 years ago

Comment on attachment 8812411 [details] [diff] [review] Handle non-BMP identifier. Review of attachment 8812411 [details] [diff] [review]: ----------------------------------------------------------------- Looks good! Thank you for the patch. ::: js/src/vm/Unicode.cpp @@ +1748,5 @@ > 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, > }; > > +bool > +js::unicode::IsIdentifierStartNonBMP(uint32_t codePoint) Could you add a comment saying this function and the one below it are auto-generated?

Attachment #8812411 - Flags: review?(shu) → review+

Tooru Fujisawa [:arai]

Assignee

Comment 18

•

9 years ago

https://hg.mozilla.org/integration/mozilla-inbound/rev/f4cac94feb53415a51d9966dd60b1ccdecc43c36 Bug 1197230 - Handle non-BMP identifier. r=shu

Carsten Book [:Tomcat]

Comment 19

•

9 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f4cac94feb53

Status: ASSIGNED → RESOLVED

Closed: 9 years ago

status-firefox53: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla53

Handle non-BMP identifier. 9 years ago Tooru Fujisawa [:arai] 52.39 KB, patch	shu : review+	Details \| Diff \| Splinter Review
Handle non-BMP identifier. 9 years ago Tooru Fujisawa [:arai] 60.10 KB, patch	shu : review+	Details \| Diff \| Splinter Review