Closed Bug 1197230 Opened 9 years ago Closed 8 years ago

Correctly tokenize valid JS names that use non-BMP code points

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla53
Tracking Status
firefox53 --- fixed

People

(Reporter: goatman.py, Assigned: arai)

References

(Blocks 1 open bug, )

Details

(Keywords: testcase, Whiteboard: [parity-chrome][parity-opera])

Attachments

(1 file, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:38.0) Gecko/20100101 Firefox/38.0
Build ID: 20150511103818

Steps to reproduce:

open the javascript console and paste:
var 
http://jsfiddle.net/xb8hbphk/1/
Flags: needinfo?(goatman.py)
Component: Untriaged → DOM: Core & HTML
Product: Firefox → Core
Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:43.0) Gecko/20100101 Firefox/43.0
20150822030206

Mathematical bold capital A (U+1D400)
Mathematical bold capital B (U+1D401)
Mathematical bold capital C (U+1D402)

Works in Opera 31.

https://mothereff.in/js-variables
says it's a valid identifier according to ECMAScript 6.
Blocks: es6
Status: UNCONFIRMED → NEW
Component: DOM: Core & HTML → JavaScript Engine
Ever confirmed: true
Keywords: testcase
Summary: unicode mathematical alphanumerics syntax error → SyntaxError: illegal character when using Unicode mathematical alphanumeric symbols as variable name
Whiteboard: [parity-chrome][parity-opera]
The basic problem is that we tokenize JS by 16-bit code unit, not by Unicode code point.  And if the character in the identifier is U+1D400 or whatever, we then interpret it as two 16-bit code units corresponding to the surrogate pair in use -- not to the right Unicode codepoint.  It is possibly not difficult to fix this, by adjusting TokenStream::getTokenInternal.

Once that's done, we'll require further changes to handle non-BMP codepoints.  Our current js::unicode::IsIdentifier{Start,Part} methods take a char16_t, obviously insufficient to handle non-BMP.  Simply changing the signatures to accommodate larger numbers will hit the problem that vm/make_unicode.py limits itself to considering only BMP codepoints -- partly because necessity, but also because this permitted a more compact Unicode representation form to be used.  So we'd need a different compaction technique to handle the full Unicode range.

Which is to say, there's a stream of issues involved in fixing this.  Some easy, some less easy.  Fixing make_unicode.py is probably the hardest bit of this.  Ideally we'd just ask our embedded copy of ICU instead, but (at least) until bug 864843 is fixed, we can't rely on having ICU on hand to ask.  :-\
Summary: SyntaxError: illegal character when using Unicode mathematical alphanumeric symbols as variable name → Correctly tokenize valid JS names that use non-BMP code points
Version: 38 Branch → Trunk
Amend comment 9 to say that bug 1215247 is what's preventing us just asking our embedded ICU directly -- bug 864843 got repurposed underneath us.  :-(
Blocks: 917436
Firefox needs this to pass the "Syntax/Unicode point escapes/identifiers" test on http://kangax.github.io/compat-table/es6/
Attached patch Handle non-BMP identifier. (obsolete) — Splinter Review
summary:
  * add macro that enumerates ID_START and ID_CONTINUE (not ID_CONTINUE_ONLY) codePoints into UnicodeNonBMP.h,
    with the same way as other macros in UnicodeNonBMP.h
  * use those macros in IsIdentifierStart, IsIdentifierPart, and IsUnicodeIDStart
  * in TokenStream, check for lead surrogate, and then it is, peek trail surrogate,
    and do the same thing as BMP code

and details:

[make_unicode.py]

added the following 3 sets (actually dict), and store codePoints that is space, or id_start, or id_continue:
  non_bmp_id_start_set
  non_bmp_id_cont_set
  non_bmp_space_set

when writing to file, call make_non_bmp_group_macro, that performns almost same thing as make_non_bmp_convert_macro, for non_bmp_id_start_set and non_bmp_id_cont_set.

make_non_bmp_group_macro finds contiguous range of code points in the set, and generates macro that calls passed macro for each range.

here, non_bmp_space_set is now empty, so I put assertion to check it's empty, and skipped remaining parts.

also, changed make_non_bmp_convert_macro to calculate lead/trail code units in 1st loop, and merges range only if lead surrogate is same.
previously we were asserting lead surrogate is same for the range, but identifiers spans 2 lead surrogate code units, so applied same changes here.

[Unicode.h]

Modified IsIdentifierStart, IsIdentifierPart, and IsUnicodeIDStart to check non-BMP ranges, by macro added to UnicodeNonBMP.h.
I used |MOZ_LIKELY(codePoint <= UTF16Max)|, since it should be rare that non-BMP appears.

[TokenStream.cpp]

* for raw non-BMP code points that is in surrogate pair
    * TokenStream::putIdentInTokenbuf
    * TokenStream::getTokenInternal
    * frontend::IsIdentifier

  while checking identifier, check if the current code unit is lead surrogate,
  and if so, peek trail surrogate, and then check for identifier start/part,
  and do the same thing as non-BMP

  If the condition is negative (like |if (unicode::IsUnicodeIDStart(...))|),
  the code is put before the non-BMP code

  If the condition is positive (like |if (!unicode::IsIdentifierPart(...))|),
  the code is put after the non-BMP code

* for escaped non-BMP code points
    * TokenStream::putIdentInTokenbuf

  check if the escaped code point is non-BMP (unicode::IsSupplementary),
  and encode the code point to surrogate pair, and put them into buffer

to avoid performance impact for existing code, I've put the code to support non-BMP separately from existing code, with MOZ_UNLIKELY.
(it could become simpler by returning code point from some methods, but it may affect non-BMP code performance)

in frontend::IsIdentifier, I've added totally separated the code supports non-BMP.
frontend::IsIdentifier calls IsIdentifierMaybeNonBMP only if the string contains char16_t.
IsIdentifierMaybeNonBMP first calls pre-existing IsIdentifier, and if it fails,
check for identifier including non-BMP code points.
Assignee: nobody → arai.unmht
Status: NEW → ASSIGNED
Attachment #8809621 - Flags: review?(shu)
Comment on attachment 8809621 [details] [diff] [review]
Handle non-BMP identifier.

Review of attachment 8809621 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for explaining stuff on IRC, the patch was a lot less scary afterwards. Patch looks good.

I see the following pattern often: checking IsLeadSurrogate, then checking if the next character IsTrailSurrogate, then ungetting the trail. Try to refactor this if possible.

Some nice to haves for later:

- A header for the name and type of each field in UnicodeNonBMP.h
- Comments for make_unicode.py

::: js/src/frontend/TokenStream.cpp
@@ +118,5 @@
>      return true;
>  }
>  
> +static uint32_t
> +GetSingleCodePoint(const char16_t*& p, const char16_t*& end)

Looks like |end| is not modified in the function. I'd prefer the signature to be

GetSingleCodePoint(const char16_t** p, const char16_t* end)

so when you call it the callsite GetSingleCodePoint(&p, end) makes which one gets mutated obvious.

@@ +1247,5 @@
>              goto identifier;
>          }
>  
> +        if (MOZ_UNLIKELY(unicode::IsLeadSurrogate(c))) {
> +            int32_t maybeTrail = getCharIgnoreEOL();

There's no system where the EOF value can look like a trail surrogate, right? I imagine EOF is 0xFFFFFFFF everywhere.

::: js/src/vm/Unicode.h
@@ +152,5 @@
> +#define CHECK_RANGE(FROM, TO, LEAD, TRAIL_FROM, TRAIL_TO) \
> +    if (codePoint >= FROM && codePoint <= TO) \
> +        return true;
> +FOR_EACH_NON_BMP_ID_START(CHECK_RANGE)
> +#undef CHECK_RANGE

Have you benchmarked performance? I wonder if generating so many ifs affects the inlining.

@@ +186,5 @@
> +#define CHECK_RANGE(FROM, TO, LEAD, TRAIL_FROM, TRAIL_TO) \
> +    if (codePoint >= FROM && codePoint <= TO) \
> +        return true;
> +FOR_EACH_NON_BMP_ID_CONT(CHECK_RANGE)
> +#undef CHECK_RANGE

Ditto on inlining being affected.

@@ +206,5 @@
> +#define CHECK_RANGE(FROM, TO, LEAD, TRAIL_FROM, TRAIL_TO) \
> +    if (codePoint >= FROM && codePoint <= TO) \
> +        return true;
> +FOR_EACH_NON_BMP_ID_START(CHECK_RANGE)
> +#undef CHECK_RANGE

Ditto on inlining being affected.

::: js/src/vm/make_unicode.py
@@ +151,5 @@
>          converted = convert_map[code]
>          diff = converted - code
>  
> +        if entry and code == entry['code'] + entry['length'] and \
> +           diff == entry['diff'] and lead == entry['lead']:

Comments about that this loop is trying to find continuous ranges would be helpful. No need to do it for this patch, though.
Attachment #8809621 - Flags: review?(shu) → review+
(In reply to Shu-yu Guo [:shu] from comment #14)
> Comment on attachment 8809621 [details] [diff] [review]

> Some nice to haves for later:
> 
> - A header for the name and type of each field in UnicodeNonBMP.h

By "header" I mean a comment at the top, not a header file.
Thank you for reviewing :)

about multiline condition in Python, I added "(" and ")" around the conditions and removed "\", since it's suggested in PEP 8.

(In reply to Shu-yu Guo [:shu] from comment #14)
> I see the following pattern often: checking IsLeadSurrogate, then checking
> if the next character IsTrailSurrogate, then ungetting the trail. Try to
> refactor this if possible.

Moved IsTrailSurrogate+ungetCharIgnoreEOL+UTF16Decode part to TokenStream::matchTrailForLeadSurrogate.


> ::: js/src/frontend/TokenStream.cpp
> @@ +1247,5 @@
> >              goto identifier;
> >          }
> >  
> > +        if (MOZ_UNLIKELY(unicode::IsLeadSurrogate(c))) {
> > +            int32_t maybeTrail = getCharIgnoreEOL();
> 
> There's no system where the EOF value can look like a trail surrogate,
> right? I imagine EOF is 0xFFFFFFFF everywhere.

Yeah, so if we get EOF, it won't match and will unget.


> ::: js/src/vm/Unicode.h
> @@ +152,5 @@
> > +#define CHECK_RANGE(FROM, TO, LEAD, TRAIL_FROM, TRAIL_TO) \
> > +    if (codePoint >= FROM && codePoint <= TO) \
> > +        return true;
> > +FOR_EACH_NON_BMP_ID_START(CHECK_RANGE)
> > +#undef CHECK_RANGE
> 
> Have you benchmarked performance? I wonder if generating so many ifs affects
> the inlining.

I moved the non-BMP part to dedicated non-inline functions defined in Unicode.cpp, IsIdentifierStartNonBMP and IsIdentifierPartNonBMP, thorse are directly created from |non_bmp_id_start_set| and |non_bmp_id_cont_set|.
Also, considering the fact that we won't use the data elsewhere, as Unicode.h provides sufficient functions, removed the macro definition from UnicodeNonBMP.h

Now IsIdentifierStart and IsUnicodeIDStart call IsIdentifierStartNonBMP for non-BMP case, and IsIdentifierPart calls IsIdentifierPartNonBMP for non-BMP case, all with MOZ_UNLIKELY.


Here's the performance comparison with parsemark.
I don't see any notable regression.

http://bnjbvr.github.io/simplegraph/#title=ParseMark%20result%20before%2Fafter%20bug%201197230&ytitle=Score%20%5Bms%5D&categories=gravity_combined.js%2Ctetris_combined.js%2Cga.js%2Ctwitter_combined.js%2Cjquery.min.js%2Czimbra_combined.js%2Cjsgb_combined.js%2C280slides_combined.js%2Cv8_combined.js%2Cdojo.js%2Cprocessing_twitch.js%2Cgmail_combined.js%2Ceffectgames_galaxy.js%2Cball_pool_combined.js%2Cyui-min.js%2CMochiKit.js%2Cpipio_combined.js%2Csunspider_combined.js&values=before%3A%209.37%201.54%201.05%207.73%202.16%2050.46%203.43%201.30%208.02%203.33%200.98%2041.79%206.00%206.29%200.64%206.41%2033.68%204.94%0Aafter%3A%209.43%201.52%201.05%207.74%202.19%2050.88%203.45%201.31%208.14%203.26%200.99%2041.98%206.08%206.22%200.66%206.44%2033.40%204.88


Can you review those changed part?
Attachment #8809621 - Attachment is obsolete: true
Attachment #8812411 - Flags: review?(shu)
Comment on attachment 8812411 [details] [diff] [review]
Handle non-BMP identifier.

Review of attachment 8812411 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good! Thank you for the patch.

::: js/src/vm/Unicode.cpp
@@ +1748,5 @@
>        0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,   0,
>  };
>  
> +bool
> +js::unicode::IsIdentifierStartNonBMP(uint32_t codePoint)

Could you add a comment saying this function and the one below it are auto-generated?
Attachment #8812411 - Flags: review?(shu) → review+
https://hg.mozilla.org/mozilla-central/rev/f4cac94feb53
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla53
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: