Last Comment Bug 917436 - Update identifier syntax to match ES6
: Update identifier syntax to match ES6
Status: RESOLVED FIXED
: dev-doc-complete
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
-- normal with 3 votes (vote)
: mozilla52
Assigned To: André Bargull
:
: Jason Orendorff [:jorendorff]
Mentors:
: 1208842 (view as bug list)
Depends on: 1197230 1230490 1314037
Blocks: es6
  Show dependency treegraph
 
Reported: 2013-09-17 11:31 PDT by Anne (:annevk)
Modified: 2017-01-11 05:03 PST (History)
15 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
fixed


Attachments
Hacks to disallow \u2e2f in identifiers (12.38 KB, patch)
2013-12-22 05:09 PST, Till Schneidereit [till]
no flags Details | Diff | Splinter Review
bug917436-part1.patch (35.22 KB, patch)
2016-11-04 08:29 PDT, André Bargull
arai.unmht: feedback+
Details | Diff | Splinter Review
bug917436-part2.patch (60.36 KB, patch)
2016-11-04 12:25 PDT, André Bargull
arai.unmht: feedback+
Details | Diff | Splinter Review
bug917436-part1.patch (859.33 KB, patch)
2016-11-04 15:50 PDT, André Bargull
arai.unmht: review+
Details | Diff | Splinter Review
bug917436-part2.patch (58.41 KB, patch)
2016-11-04 16:13 PDT, André Bargull
arai.unmht: review+
Details | Diff | Splinter Review
bug917436-part1.patch (859.79 KB, patch)
2016-11-04 16:49 PDT, André Bargull
andrebargull: review+
Details | Diff | Splinter Review

Description User image Anne (:annevk) 2013-09-17 11:31:50 PDT
TC39 discussed http://esdiscuss.org/topic/backwards-compatibility-and-u-2e2f-in-identifier-s and concluded that we should not deviate from Unicode because we don't want to separate lists from Unicode.

That means \u2e2f and ⸯ need to throw when used as identifier.
Comment 1 User image Mathias Bynens 2013-09-18 01:31:59 PDT
Tests: http://mathias.html5.org/tests/javascript/identifiers/
Comment 2 User image Till Schneidereit [till] 2013-12-22 05:09:46 PST
Created attachment 8351004 [details] [diff] [review]
Hacks to disallow \u2e2f in identifiers

This isn't that easy to implement.

We're checking for IdentifierStart by first checking the ASCII-range special cases (i.e. $ and _) and then by whether we have a unicode letter. In the check for IdentifierPart, we do the latter, plus checking for the flag IDENTIFIER_PART, which is set by the vm/make_unicode.py script.

Fixing the latter is doable by changing make_unicode.py to set IDENTIFIER_PART for all letters, too, and explicitly excluding \u2e2f. Hacky, but at least the ugly is in generator script, not in the engine itself. The in-engine change is just to not check for letters in IsIdentifierPart.

However, now we are in the situation were the identifier is ended, and then we check for IsIdentifierStart next. Which returns true, because \u2e2f is a letter. The parser is not prepared for this situation being possible, and does stupid things until a failing assert puts it out if its misery.

The core issue is that, as far as Unicode is concerned, we shouldn't check for letters, or $ or _ at all, but for whether the char has the ID_Start property. Which is defined in DerivedCoreProperties.txt. Which we don't even open. The only simple way around that is to explicitly check for \u2e2f in IsIdentifierStart and return false for it. Urgh.

The attached patch does these things. It doesn't fully handle a literal ⸯ, but I don't really want to look into that until someone tells me that, yes, this is messy, but so is reality.
Comment 3 User image Mathias Bynens 2013-12-22 05:14:30 PST
(In reply to Till Schneidereit [:till] from comment #2)
> The core issue is that, as far as Unicode is concerned, we shouldn't check
> for letters, or $ or _ at all, but for whether the char has the ID_Start
> property. Which is defined in DerivedCoreProperties.txt. Which we don't even
> open.

There is another way. You could just stick to using Unicode category data, by following <http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax>:

ID_Start: Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), minus Pattern_Syntax and Pattern_White_Space code points, plus stability extensions.

ID_Continue: All of the above, plus characters having the Unicode General_Category of nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc), plus stability extensions, minus Pattern_Syntax and Pattern_White_Space code points.

Here’s an example script that gets all the Unicode code points for `IdentifierStart` and `IdentifierPart` as per ES6 using that technique: https://gist.github.com/mathiasbynens/6334847#file-javascript-identifier-regex-js-L65-L102
Comment 4 User image Mathias Bynens 2013-12-22 05:16:32 PST
Looks like it would be fairly easy to tweak that script into a replacement for `vm/make_unicode.py`.
Comment 5 User image Till Schneidereit [till] 2013-12-22 05:59:37 PST
(In reply to Mathias Bynens from comment #4)
> Looks like it would be fairly easy to tweak that script into a replacement
> for `vm/make_unicode.py`.

Patches very much welcome. :) Sadly, we don't have the infrastructure to run JS scripts during the SpiderMonkey builds, though, so this would have to be ported to Python.
Comment 6 User image Mathias Bynens 2015-03-04 08:13:00 PST
Some more info & examples: https://mathiasbynens.be/notes/javascript-identifiers-es6

Should I file a separate bug for updating the identifier grammar to ES6?
Comment 7 User image Till Schneidereit [till] 2015-03-04 08:18:09 PST
I think we can just re=purpose this bug. It seems likely that we'd do all of it in one go.

It seems unlikely, however, that I'll be able to get to this anytime soon, so unassigning myself.
Comment 8 User image Mathias Bynens 2015-03-09 14:45:55 PDT
Tests (based on ES6 and Unicode 5.1.0, i.e. the minimum required Unicode version as per the spec): https://mathias.html5.org/tests/javascript/identifiers/
Comment 9 User image Tom Schuster [:evilpie] 2015-09-27 08:25:33 PDT
*** Bug 1208842 has been marked as a duplicate of this bug. ***
Comment 10 User image Morgan Phillips [:mrrrgn] 2015-12-04 02:01:42 PST
I've changed the script to derive the Id_Start and Id_Continue properties from their constituent categories, however, there's another hitch. UnicodeData.txt from all versions above 6.0.0 fail with:

Downloading...
Generating...
Best: 2048+8224 bins at shift 5; 12320 bytes
Size of original table: 65536 bytes
Traceback (most recent call last):
  File "make_unicode.py", line 377, in <module>
    open('../tests/ecma_5/String/string-space-trim.js', 'w'))
  File "make_unicode.py", line 282, in generate_unicode_stuff
    dump(index1, 'index1', data_file)
  File "make_unicode.py", line 268, in dump
    assert entry < 256
AssertionError

Versions <= 6.0.0 still work.
Comment 11 User image Jeff Walden [:Waldo] (remove +bmo to email) 2015-12-04 12:09:53 PST
I think bug 1197230 is a dup of this -- would mark if I didn't have a quick errand before heading to the ofice.

Note also the last two comments may be of some interest.  The rumor mill leads me to suspect the blocker mentioned there *might* be favorably resolved in the next few months.  But I'm nowhere near certain enough of that to bet on it, and we might still be stuck fixing the script here.

As to why make_unicode.py might be failing, I had *thought* it was run when we updated to ICU 56.  Maybe I'm wrong about that.  evilpie wrote this stuff originally, he might be able to help with it.  (Or maybe not, if he's paged it out of memory as well as I have.  :-) )
Comment 12 User image Morgan Phillips [:mrrrgn] 2015-12-04 12:39:37 PST
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #11)
> I think bug 1197230 is a dup of this -- would mark if I didn't have a quick
> errand before heading to the ofice.
> 
> Note also the last two comments may be of some interest.  The rumor mill
> leads me to suspect the blocker mentioned there *might* be favorably
> resolved in the next few months.  But I'm nowhere near certain enough of
> that to bet on it, and we might still be stuck fixing the script here.
> 
> As to why make_unicode.py might be failing, I had *thought* it was run when
> we updated to ICU 56.  Maybe I'm wrong about that.  evilpie wrote this stuff
> originally, he might be able to help with it.  (Or maybe not, if he's paged
> it out of memory as well as I have.  :-) )

You are my hero. I'm going to reassess my work on this then. It's a deep dark rabbit hole, and while I definitely appreciate the compaction being done in make_unicode.py it's very painful to maintain for a newcomer.
Comment 13 User image Morgan Phillips [:mrrrgn] 2015-12-04 12:40:09 PST

*** This bug has been marked as a duplicate of bug 1197230 ***
Comment 14 User image Morgan Phillips [:mrrrgn] 2015-12-04 12:43:45 PST
Actually, not sure this is a true duplicate. Reopening and adding the tokenizer bug as a blocker.
Comment 15 User image Morgan Phillips [:mrrrgn] 2016-01-05 14:48:30 PST
Going to release this one in case someone else decides to grab it. Otherwise, I'll circle back later.
Comment 16 User image André Bargull 2016-11-04 08:29:03 PDT
Created attachment 8807589 [details] [diff] [review]
bug917436-part1.patch

Before I try to start working on this bug, it probably make sense to clean up make_unicode.py a little bit.

Files regenerated with:
  cd js/src/vm
  ./make_unicode.py --version=6.2.0 --casefold-version=8.0.0 --no-keep
Comment 17 User image Tooru Fujisawa [:arai] 2016-11-04 11:54:51 PDT
what's the intent of adding --directory and --no-keep ?
are we going to remove copy of CaseFolding.txt and UnicodeData.txt from tree?
Comment 18 User image André Bargull 2016-11-04 12:08:02 PDT
(In reply to Tooru Fujisawa [:arai] from comment #17)
> what's the intent of adding --directory and --no-keep ?
> are we going to remove copy of CaseFolding.txt and UnicodeData.txt from tree?

I've just added those to make it simpler for me to test and update the files. For example I needed quite some time to realize we're currently using two different Unicode version (6.2 and 8.0), and with those parameters I could store the downloaded files to /tmp/unicode<version> without needing to download the different Unicode files over and over again. And with --no-keep I can update the generated Unicode 6.2 files without also downgrading vm/UnicodeData.txt from Unicode 8 to Unicode 6.2. 

IOW we can probably remove them again as soon as we use a single Unicode version.
Comment 19 User image André Bargull 2016-11-04 12:25:49 PDT
Created attachment 8807691 [details] [diff] [review]
bug917436-part2.patch

And part 2 with the actual changes to use the Unicode properties ID_Start and ID_Continue.
Comment 20 User image Tooru Fujisawa [:arai] 2016-11-04 12:27:55 PDT
(In reply to André Bargull from comment #18)
> And with --no-keep
> I can update the generated Unicode 6.2 files without also downgrading
> vm/UnicodeData.txt from Unicode 8 to Unicode 6.2. 

vm/UnicodeData.txt is Unicode 6.2
so you won't downgrade it if you download 6.2 there.
Comment 21 User image Tooru Fujisawa [:arai] 2016-11-04 13:02:35 PDT
Comment on attachment 8807589 [details] [diff] [review]
bug917436-part1.patch

Review of attachment 8807589 [details] [diff] [review]:
-----------------------------------------------------------------

make_unicode.py can be simplified more,
options that needs are:
  * download from network, or use local file
  * which versions to download (or maybe the version of local file)

otherwise, this looks nice :)

::: js/src/vm/make_unicode.py
@@ +29,3 @@
>  
> +# ECMAScript 2016
> +# §11.2 White Space

there's some places that uses "$" as section mark, so, if you use "§" instead it would be nice to replace all of them in this file, to search-ability.

@@ +180,5 @@
>      same_upper_dummy = (0, 0, 0)
>      same_upper_table = [same_upper_dummy]
>      same_upper_cache = {same_upper_dummy: 0}
> +    same_upper_index = [0] * (MAX_BMP + 1)
> +    

please remove white spaces here.

@@ +224,5 @@
>              flags |= FLAG_SPACE
>              test_space_table.append(code)
>          if category in ['Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl']: # $ 7.6 (UnicodeLetter)
>              flags |= FLAG_LETTER
> +        if category in ['Mn', 'Mc', 'Nd', 'Pc'] or code in compatibility_identifier_part: # $ 7.6 (IdentifierPart)

while you're touching this line, can you move the comment to independent line before if-statement?
same for "# $ 7.6 (UnicodeLetter)"

@@ +736,5 @@
> +
> +    def to_download_url(version):
> +        baseurl = 'http://unicode.org/Public'
> +        if version is 'UNIDATA':
> +            return '%s/%s' % (baseurl, version)

I thought "...".format() is preferred over % operator these days, but I'm not sure.
at least it will be deprecated in the future.

maybe we should ask someone else about coding rule.

@@ +767,5 @@
> +            if info['download'] is not None:
> +                print('\t%s download directory: %s' % (info['name'], info['download']))
> +                if not os.path.isdir(info['download']):
> +                    raise RuntimeError('not a directory: %s' % info['download'])
> +        if info['local'] is not None:

can this be just "else:" ?

@@ +876,5 @@
> +    parser.add_argument('--casefold-directory',
> +                        help='Saves downloaded Unicode casefolding files in the given directory')
> +
> +    parser.add_argument('--no-keep', default=False, action='store_true',
> +                        help='Don\'t save downloaded files in the current directory')

so, --directory, --casefold-directory, and --no-keep options won't be necessary.

@@ +881,5 @@
> +
> +    parser.add_argument('local_directory', nargs="?",
> +                        help='Local Unicode directory, if omitted downloads files from unicode.org')
> +    parser.add_argument('casefold_local_directory', nargs="?",
> +                        help='Local Unicode casefold directory, if omitted downloads files from unicode.org')

and also having different directory for them won't be necessary.
maybe we just drop directory parameters and just use current directory.
Comment 22 User image André Bargull 2016-11-04 13:16:49 PDT
If forgot to mention that part 2 depends on bug 1314037.
Comment 23 User image Tooru Fujisawa [:arai] 2016-11-04 15:32:30 PDT
Comment on attachment 8807691 [details] [diff] [review]
bug917436-part2.patch

Review of attachment 8807691 [details] [diff] [review]:
-----------------------------------------------------------------

looks good, except the downloading part in make_unicode.py that would be changed along with part 1.

::: js/src/vm/Unicode.h
@@ +55,5 @@
>   *      return True
>   *
>   */
>  
> +enum class CharFlag : uint8_t {

as discussed in IRC, please change this to namespace,
and remove cast below.
Comment 24 User image André Bargull 2016-11-04 15:50:29 PDT
Created attachment 8807768 [details] [diff] [review]
bug917436-part1.patch

Updated part 1 per review comments.
Comment 25 User image André Bargull 2016-11-04 16:13:34 PDT
Created attachment 8807775 [details] [diff] [review]
bug917436-part2.patch

Updated part 2 per review comments.
Comment 26 User image Tooru Fujisawa [:arai] 2016-11-04 16:15:58 PDT
Comment on attachment 8807768 [details] [diff] [review]
bug917436-part1.patch

Review of attachment 8807768 [details] [diff] [review]:
-----------------------------------------------------------------

thank you :D

r+ with minor fixes (sorry I should've caught them before)

::: js/src/vm/make_unicode.py
@@ +762,5 @@
> +            tfile.write(data)
> +            tfile.flush()
> +            tfile.seek(0)
> +        else:
> +            tfile = os.path.join(os.getcwd(), fname)

reusing `tfile` with different types won't be nice.
also, `os.path.join(os.getcwd(), fname)` is used in both branch.
so, it would be nice to create this before if-branch, with different name.
maybe tfile_path or something

@@ +768,5 @@
> +                raise RuntimeError('File not found: %s' % tfile)
> +            tfile = io.open(tfile, 'rb');
> +        return tfile
> +
> +    def version_from_file(info, f, fname):

you could remove `info` parameter

@@ +823,5 @@
> +
> +    parser = argparse.ArgumentParser(description='Update Unicode data.')
> +
> +    parser.add_argument('--version',
> +                        help='Unicode version number, use "UNIDATA" to select to latest published version')

since now this is a switch to download or use local file,
it would be nice to mention that the version number is for downloading, and also specifying this argument will download the file instead of using local file.
Comment 27 User image André Bargull 2016-11-04 16:49:40 PDT
Created attachment 8807784 [details] [diff] [review]
bug917436-part1.patch

Updated part 1 to fix remaining nits. Carrying r+ from arai.
Comment 29 User image Pulsebot 2016-11-07 18:41:45 PST
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/816fef70c11d
Part 1: Record Unicode version in files generated from make_unicode.py. r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/483dc75949fa
Part 2: Use IDStart and IDContinue Unicode properties for identifiers. r=arai
Comment 31 User image Alexandre Folle de Menezes 2016-11-10 10:53:40 PST
Shouldn't nightly now be passing the "Syntax/Unicode point escapes/identifiers" test on http://kangax.github.io/compat-table/es6/ , since this issue claims identifier syntax now matches ES6 spec?
Comment 32 User image Tooru Fujisawa [:arai] 2016-11-10 11:39:41 PST
(In reply to Alexandre Folle de Menezes from comment #31)
> Shouldn't nightly now be passing the "Syntax/Unicode point
> escapes/identifiers" test on http://kangax.github.io/compat-table/es6/ ,
> since this issue claims identifier syntax now matches ES6 spec?

the testcase depends on bug 1197230 to handle non-BMP code points.

> var \u{102C0} = { \u{102C0} : 2 };
> return \u{102C0}['\ud800\udec0'] === 2;
Comment 33 User image Florian Scholz [:fscholz] (MDN) 2017-01-11 05:03:32 PST
Added to https://developer.mozilla.org/en-US/Firefox/Releases/52#JavaScript

I don't think this needs to be in any reference doc (seems like a detail to me). Please let me know, if you think this should be mentioned somewhere on MDN.

Note You need to log in before you can comment on or make changes to this bug.