Closed Bug 917436 Opened 11 years ago Closed 8 years ago

Update identifier syntax to match ES6

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox52 --- fixed

People

(Reporter: annevk, Assigned: anba)

References

(Blocks 1 open bug)

Details

(Keywords: dev-doc-complete)

Attachments

(2 files, 4 obsolete files)

TC39 discussed http://esdiscuss.org/topic/backwards-compatibility-and-u-2e2f-in-identifier-s and concluded that we should not deviate from Unicode because we don't want to separate lists from Unicode.

That means \u2e2f and ⸯ need to throw when used as identifier.
Blocks: es6
This isn't that easy to implement.

We're checking for IdentifierStart by first checking the ASCII-range special cases (i.e. $ and _) and then by whether we have a unicode letter. In the check for IdentifierPart, we do the latter, plus checking for the flag IDENTIFIER_PART, which is set by the vm/make_unicode.py script.

Fixing the latter is doable by changing make_unicode.py to set IDENTIFIER_PART for all letters, too, and explicitly excluding \u2e2f. Hacky, but at least the ugly is in generator script, not in the engine itself. The in-engine change is just to not check for letters in IsIdentifierPart.

However, now we are in the situation were the identifier is ended, and then we check for IsIdentifierStart next. Which returns true, because \u2e2f is a letter. The parser is not prepared for this situation being possible, and does stupid things until a failing assert puts it out if its misery.

The core issue is that, as far as Unicode is concerned, we shouldn't check for letters, or $ or _ at all, but for whether the char has the ID_Start property. Which is defined in DerivedCoreProperties.txt. Which we don't even open. The only simple way around that is to explicitly check for \u2e2f in IsIdentifierStart and return false for it. Urgh.

The attached patch does these things. It doesn't fully handle a literal ⸯ, but I don't really want to look into that until someone tells me that, yes, this is messy, but so is reality.
Assignee: general → till
Status: NEW → ASSIGNED
(In reply to Till Schneidereit [:till] from comment #2)
> The core issue is that, as far as Unicode is concerned, we shouldn't check
> for letters, or $ or _ at all, but for whether the char has the ID_Start
> property. Which is defined in DerivedCoreProperties.txt. Which we don't even
> open.

There is another way. You could just stick to using Unicode category data, by following <http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax>:

ID_Start: Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), minus Pattern_Syntax and Pattern_White_Space code points, plus stability extensions.

ID_Continue: All of the above, plus characters having the Unicode General_Category of nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc), plus stability extensions, minus Pattern_Syntax and Pattern_White_Space code points.

Here’s an example script that gets all the Unicode code points for `IdentifierStart` and `IdentifierPart` as per ES6 using that technique: https://gist.github.com/mathiasbynens/6334847#file-javascript-identifier-regex-js-L65-L102
Looks like it would be fairly easy to tweak that script into a replacement for `vm/make_unicode.py`.
(In reply to Mathias Bynens from comment #4)
> Looks like it would be fairly easy to tweak that script into a replacement
> for `vm/make_unicode.py`.

Patches very much welcome. :) Sadly, we don't have the infrastructure to run JS scripts during the SpiderMonkey builds, though, so this would have to be ported to Python.
Some more info & examples: https://mathiasbynens.be/notes/javascript-identifiers-es6

Should I file a separate bug for updating the identifier grammar to ES6?
I think we can just re=purpose this bug. It seems likely that we'd do all of it in one go.

It seems unlikely, however, that I'll be able to get to this anytime soon, so unassigning myself.
Assignee: till → nobody
Status: ASSIGNED → NEW
OS: Mac OS X → All
Hardware: x86 → All
Summary: Update identifier syntax to disallow \u2e2f → Update identifier syntax to match ES6
Tests (based on ES6 and Unicode 5.1.0, i.e. the minimum required Unicode version as per the spec): https://mathias.html5.org/tests/javascript/identifiers/
Assignee: nobody → winter2718
I've changed the script to derive the Id_Start and Id_Continue properties from their constituent categories, however, there's another hitch. UnicodeData.txt from all versions above 6.0.0 fail with:

Downloading...
Generating...
Best: 2048+8224 bins at shift 5; 12320 bytes
Size of original table: 65536 bytes
Traceback (most recent call last):
  File "make_unicode.py", line 377, in <module>
    open('../tests/ecma_5/String/string-space-trim.js', 'w'))
  File "make_unicode.py", line 282, in generate_unicode_stuff
    dump(index1, 'index1', data_file)
  File "make_unicode.py", line 268, in dump
    assert entry < 256
AssertionError

Versions <= 6.0.0 still work.
Depends on: 1230490
I think bug 1197230 is a dup of this -- would mark if I didn't have a quick errand before heading to the ofice.

Note also the last two comments may be of some interest.  The rumor mill leads me to suspect the blocker mentioned there *might* be favorably resolved in the next few months.  But I'm nowhere near certain enough of that to bet on it, and we might still be stuck fixing the script here.

As to why make_unicode.py might be failing, I had *thought* it was run when we updated to ICU 56.  Maybe I'm wrong about that.  evilpie wrote this stuff originally, he might be able to help with it.  (Or maybe not, if he's paged it out of memory as well as I have.  :-) )
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #11)
> I think bug 1197230 is a dup of this -- would mark if I didn't have a quick
> errand before heading to the ofice.
> 
> Note also the last two comments may be of some interest.  The rumor mill
> leads me to suspect the blocker mentioned there *might* be favorably
> resolved in the next few months.  But I'm nowhere near certain enough of
> that to bet on it, and we might still be stuck fixing the script here.
> 
> As to why make_unicode.py might be failing, I had *thought* it was run when
> we updated to ICU 56.  Maybe I'm wrong about that.  evilpie wrote this stuff
> originally, he might be able to help with it.  (Or maybe not, if he's paged
> it out of memory as well as I have.  :-) )

You are my hero. I'm going to reassess my work on this then. It's a deep dark rabbit hole, and while I definitely appreciate the compaction being done in make_unicode.py it's very painful to maintain for a newcomer.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
Actually, not sure this is a true duplicate. Reopening and adding the tokenizer bug as a blocker.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Depends on: 1197230
Going to release this one in case someone else decides to grab it. Otherwise, I'll circle back later.
Assignee: winter2718 → nobody
See Also: → 1282724
Attached patch bug917436-part1.patch (obsolete) — Splinter Review
Before I try to start working on this bug, it probably make sense to clean up make_unicode.py a little bit.

Files regenerated with:
  cd js/src/vm
  ./make_unicode.py --version=6.2.0 --casefold-version=8.0.0 --no-keep
Assignee: nobody → andrebargull
Attachment #8351004 - Attachment is obsolete: true
Status: REOPENED → ASSIGNED
Attachment #8807589 - Flags: review?(arai.unmht)
what's the intent of adding --directory and --no-keep ?
are we going to remove copy of CaseFolding.txt and UnicodeData.txt from tree?
Flags: needinfo?(andrebargull)
(In reply to Tooru Fujisawa [:arai] from comment #17)
> what's the intent of adding --directory and --no-keep ?
> are we going to remove copy of CaseFolding.txt and UnicodeData.txt from tree?

I've just added those to make it simpler for me to test and update the files. For example I needed quite some time to realize we're currently using two different Unicode version (6.2 and 8.0), and with those parameters I could store the downloaded files to /tmp/unicode<version> without needing to download the different Unicode files over and over again. And with --no-keep I can update the generated Unicode 6.2 files without also downgrading vm/UnicodeData.txt from Unicode 8 to Unicode 6.2. 

IOW we can probably remove them again as soon as we use a single Unicode version.
Flags: needinfo?(andrebargull)
Attached patch bug917436-part2.patch (obsolete) — Splinter Review
And part 2 with the actual changes to use the Unicode properties ID_Start and ID_Continue.
Attachment #8807691 - Flags: review?(arai.unmht)
(In reply to André Bargull from comment #18)
> And with --no-keep
> I can update the generated Unicode 6.2 files without also downgrading
> vm/UnicodeData.txt from Unicode 8 to Unicode 6.2. 

vm/UnicodeData.txt is Unicode 6.2
so you won't downgrade it if you download 6.2 there.
Comment on attachment 8807589 [details] [diff] [review]
bug917436-part1.patch

Review of attachment 8807589 [details] [diff] [review]:
-----------------------------------------------------------------

make_unicode.py can be simplified more,
options that needs are:
  * download from network, or use local file
  * which versions to download (or maybe the version of local file)

otherwise, this looks nice :)

::: js/src/vm/make_unicode.py
@@ +29,3 @@
>  
> +# ECMAScript 2016
> +# §11.2 White Space

there's some places that uses "$" as section mark, so, if you use "§" instead it would be nice to replace all of them in this file, to search-ability.

@@ +180,5 @@
>      same_upper_dummy = (0, 0, 0)
>      same_upper_table = [same_upper_dummy]
>      same_upper_cache = {same_upper_dummy: 0}
> +    same_upper_index = [0] * (MAX_BMP + 1)
> +    

please remove white spaces here.

@@ +224,5 @@
>              flags |= FLAG_SPACE
>              test_space_table.append(code)
>          if category in ['Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl']: # $ 7.6 (UnicodeLetter)
>              flags |= FLAG_LETTER
> +        if category in ['Mn', 'Mc', 'Nd', 'Pc'] or code in compatibility_identifier_part: # $ 7.6 (IdentifierPart)

while you're touching this line, can you move the comment to independent line before if-statement?
same for "# $ 7.6 (UnicodeLetter)"

@@ +736,5 @@
> +
> +    def to_download_url(version):
> +        baseurl = 'http://unicode.org/Public'
> +        if version is 'UNIDATA':
> +            return '%s/%s' % (baseurl, version)

I thought "...".format() is preferred over % operator these days, but I'm not sure.
at least it will be deprecated in the future.

maybe we should ask someone else about coding rule.

@@ +767,5 @@
> +            if info['download'] is not None:
> +                print('\t%s download directory: %s' % (info['name'], info['download']))
> +                if not os.path.isdir(info['download']):
> +                    raise RuntimeError('not a directory: %s' % info['download'])
> +        if info['local'] is not None:

can this be just "else:" ?

@@ +876,5 @@
> +    parser.add_argument('--casefold-directory',
> +                        help='Saves downloaded Unicode casefolding files in the given directory')
> +
> +    parser.add_argument('--no-keep', default=False, action='store_true',
> +                        help='Don\'t save downloaded files in the current directory')

so, --directory, --casefold-directory, and --no-keep options won't be necessary.

@@ +881,5 @@
> +
> +    parser.add_argument('local_directory', nargs="?",
> +                        help='Local Unicode directory, if omitted downloads files from unicode.org')
> +    parser.add_argument('casefold_local_directory', nargs="?",
> +                        help='Local Unicode casefold directory, if omitted downloads files from unicode.org')

and also having different directory for them won't be necessary.
maybe we just drop directory parameters and just use current directory.
Attachment #8807589 - Flags: review?(arai.unmht) → feedback+
If forgot to mention that part 2 depends on bug 1314037.
Comment on attachment 8807691 [details] [diff] [review]
bug917436-part2.patch

Review of attachment 8807691 [details] [diff] [review]:
-----------------------------------------------------------------

looks good, except the downloading part in make_unicode.py that would be changed along with part 1.

::: js/src/vm/Unicode.h
@@ +55,5 @@
>   *      return True
>   *
>   */
>  
> +enum class CharFlag : uint8_t {

as discussed in IRC, please change this to namespace,
and remove cast below.
Attachment #8807691 - Flags: review?(arai.unmht) → feedback+
Attached patch bug917436-part1.patch (obsolete) — Splinter Review
Updated part 1 per review comments.
Attachment #8807589 - Attachment is obsolete: true
Attachment #8807768 - Flags: review?(arai.unmht)
Updated part 2 per review comments.
Attachment #8807691 - Attachment is obsolete: true
Attachment #8807775 - Flags: review?(arai.unmht)
Comment on attachment 8807768 [details] [diff] [review]
bug917436-part1.patch

Review of attachment 8807768 [details] [diff] [review]:
-----------------------------------------------------------------

thank you :D

r+ with minor fixes (sorry I should've caught them before)

::: js/src/vm/make_unicode.py
@@ +762,5 @@
> +            tfile.write(data)
> +            tfile.flush()
> +            tfile.seek(0)
> +        else:
> +            tfile = os.path.join(os.getcwd(), fname)

reusing `tfile` with different types won't be nice.
also, `os.path.join(os.getcwd(), fname)` is used in both branch.
so, it would be nice to create this before if-branch, with different name.
maybe tfile_path or something

@@ +768,5 @@
> +                raise RuntimeError('File not found: %s' % tfile)
> +            tfile = io.open(tfile, 'rb');
> +        return tfile
> +
> +    def version_from_file(info, f, fname):

you could remove `info` parameter

@@ +823,5 @@
> +
> +    parser = argparse.ArgumentParser(description='Update Unicode data.')
> +
> +    parser.add_argument('--version',
> +                        help='Unicode version number, use "UNIDATA" to select to latest published version')

since now this is a switch to download or use local file,
it would be nice to mention that the version number is for downloading, and also specifying this argument will download the file instead of using local file.
Attachment #8807768 - Flags: review?(arai.unmht) → review+
Attachment #8807775 - Flags: review?(arai.unmht) → review+
Depends on: 1314037
Updated part 1 to fix remaining nits. Carrying r+ from arai.
Attachment #8807768 - Attachment is obsolete: true
Attachment #8807784 - Flags: review+
Keywords: checkin-needed
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/816fef70c11d
Part 1: Record Unicode version in files generated from make_unicode.py. r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/483dc75949fa
Part 2: Use IDStart and IDContinue Unicode properties for identifiers. r=arai
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/816fef70c11d
https://hg.mozilla.org/mozilla-central/rev/483dc75949fa
Status: ASSIGNED → RESOLVED
Closed: 9 years ago8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Shouldn't nightly now be passing the "Syntax/Unicode point escapes/identifiers" test on http://kangax.github.io/compat-table/es6/ , since this issue claims identifier syntax now matches ES6 spec?
(In reply to Alexandre Folle de Menezes from comment #31)
> Shouldn't nightly now be passing the "Syntax/Unicode point
> escapes/identifiers" test on http://kangax.github.io/compat-table/es6/ ,
> since this issue claims identifier syntax now matches ES6 spec?

the testcase depends on bug 1197230 to handle non-BMP code points.

> var \u{102C0} = { \u{102C0} : 2 };
> return \u{102C0}['\ud800\udec0'] === 2;
Added to https://developer.mozilla.org/en-US/Firefox/Releases/52#JavaScript

I don't think this needs to be in any reference doc (seems like a detail to me). Please let me know, if you think this should be mentioned somewhere on MDN.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: