Closed
Bug 917436
Opened 11 years ago
Closed 8 years ago
Update identifier syntax to match ES6
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
mozilla52
Tracking | Status | |
---|---|---|
firefox52 | --- | fixed |
People
(Reporter: annevk, Assigned: anba)
References
(Blocks 1 open bug)
Details
(Keywords: dev-doc-complete)
Attachments
(2 files, 4 obsolete files)
58.41 KB,
patch
|
arai
:
review+
|
Details | Diff | Splinter Review |
859.79 KB,
patch
|
anba
:
review+
|
Details | Diff | Splinter Review |
TC39 discussed http://esdiscuss.org/topic/backwards-compatibility-and-u-2e2f-in-identifier-s and concluded that we should not deviate from Unicode because we don't want to separate lists from Unicode.
That means \u2e2f and ⸯ need to throw when used as identifier.
Comment 1•11 years ago
|
||
Comment 2•11 years ago
|
||
This isn't that easy to implement.
We're checking for IdentifierStart by first checking the ASCII-range special cases (i.e. $ and _) and then by whether we have a unicode letter. In the check for IdentifierPart, we do the latter, plus checking for the flag IDENTIFIER_PART, which is set by the vm/make_unicode.py script.
Fixing the latter is doable by changing make_unicode.py to set IDENTIFIER_PART for all letters, too, and explicitly excluding \u2e2f. Hacky, but at least the ugly is in generator script, not in the engine itself. The in-engine change is just to not check for letters in IsIdentifierPart.
However, now we are in the situation were the identifier is ended, and then we check for IsIdentifierStart next. Which returns true, because \u2e2f is a letter. The parser is not prepared for this situation being possible, and does stupid things until a failing assert puts it out if its misery.
The core issue is that, as far as Unicode is concerned, we shouldn't check for letters, or $ or _ at all, but for whether the char has the ID_Start property. Which is defined in DerivedCoreProperties.txt. Which we don't even open. The only simple way around that is to explicitly check for \u2e2f in IsIdentifierStart and return false for it. Urgh.
The attached patch does these things. It doesn't fully handle a literal ⸯ, but I don't really want to look into that until someone tells me that, yes, this is messy, but so is reality.
Updated•11 years ago
|
Assignee: general → till
Status: NEW → ASSIGNED
Comment 3•11 years ago
|
||
(In reply to Till Schneidereit [:till] from comment #2)
> The core issue is that, as far as Unicode is concerned, we shouldn't check
> for letters, or $ or _ at all, but for whether the char has the ID_Start
> property. Which is defined in DerivedCoreProperties.txt. Which we don't even
> open.
There is another way. You could just stick to using Unicode category data, by following <http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax>:
ID_Start: Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), minus Pattern_Syntax and Pattern_White_Space code points, plus stability extensions.
ID_Continue: All of the above, plus characters having the Unicode General_Category of nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc), plus stability extensions, minus Pattern_Syntax and Pattern_White_Space code points.
Here’s an example script that gets all the Unicode code points for `IdentifierStart` and `IdentifierPart` as per ES6 using that technique: https://gist.github.com/mathiasbynens/6334847#file-javascript-identifier-regex-js-L65-L102
Comment 4•11 years ago
|
||
Looks like it would be fairly easy to tweak that script into a replacement for `vm/make_unicode.py`.
Comment 5•11 years ago
|
||
(In reply to Mathias Bynens from comment #4)
> Looks like it would be fairly easy to tweak that script into a replacement
> for `vm/make_unicode.py`.
Patches very much welcome. :) Sadly, we don't have the infrastructure to run JS scripts during the SpiderMonkey builds, though, so this would have to be ported to Python.
Comment 6•10 years ago
|
||
Some more info & examples: https://mathiasbynens.be/notes/javascript-identifiers-es6
Should I file a separate bug for updating the identifier grammar to ES6?
Comment 7•10 years ago
|
||
I think we can just re=purpose this bug. It seems likely that we'd do all of it in one go.
It seems unlikely, however, that I'll be able to get to this anytime soon, so unassigning myself.
Assignee: till → nobody
Status: ASSIGNED → NEW
OS: Mac OS X → All
Hardware: x86 → All
Summary: Update identifier syntax to disallow \u2e2f → Update identifier syntax to match ES6
Comment 8•10 years ago
|
||
Tests (based on ES6 and Unicode 5.1.0, i.e. the minimum required Unicode version as per the spec): https://mathias.html5.org/tests/javascript/identifiers/
Updated•9 years ago
|
Assignee: nobody → winter2718
Comment 10•9 years ago
|
||
I've changed the script to derive the Id_Start and Id_Continue properties from their constituent categories, however, there's another hitch. UnicodeData.txt from all versions above 6.0.0 fail with:
Downloading...
Generating...
Best: 2048+8224 bins at shift 5; 12320 bytes
Size of original table: 65536 bytes
Traceback (most recent call last):
File "make_unicode.py", line 377, in <module>
open('../tests/ecma_5/String/string-space-trim.js', 'w'))
File "make_unicode.py", line 282, in generate_unicode_stuff
dump(index1, 'index1', data_file)
File "make_unicode.py", line 268, in dump
assert entry < 256
AssertionError
Versions <= 6.0.0 still work.
Comment 11•9 years ago
|
||
I think bug 1197230 is a dup of this -- would mark if I didn't have a quick errand before heading to the ofice.
Note also the last two comments may be of some interest. The rumor mill leads me to suspect the blocker mentioned there *might* be favorably resolved in the next few months. But I'm nowhere near certain enough of that to bet on it, and we might still be stuck fixing the script here.
As to why make_unicode.py might be failing, I had *thought* it was run when we updated to ICU 56. Maybe I'm wrong about that. evilpie wrote this stuff originally, he might be able to help with it. (Or maybe not, if he's paged it out of memory as well as I have. :-) )
Comment 12•9 years ago
|
||
(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #11)
> I think bug 1197230 is a dup of this -- would mark if I didn't have a quick
> errand before heading to the ofice.
>
> Note also the last two comments may be of some interest. The rumor mill
> leads me to suspect the blocker mentioned there *might* be favorably
> resolved in the next few months. But I'm nowhere near certain enough of
> that to bet on it, and we might still be stuck fixing the script here.
>
> As to why make_unicode.py might be failing, I had *thought* it was run when
> we updated to ICU 56. Maybe I'm wrong about that. evilpie wrote this stuff
> originally, he might be able to help with it. (Or maybe not, if he's paged
> it out of memory as well as I have. :-) )
You are my hero. I'm going to reassess my work on this then. It's a deep dark rabbit hole, and while I definitely appreciate the compaction being done in make_unicode.py it's very painful to maintain for a newcomer.
Updated•9 years ago
|
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → DUPLICATE
Comment 14•9 years ago
|
||
Actually, not sure this is a true duplicate. Reopening and adding the tokenizer bug as a blocker.
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
Comment 15•9 years ago
|
||
Going to release this one in case someone else decides to grab it. Otherwise, I'll circle back later.
Assignee: winter2718 → nobody
Updated•9 years ago
|
Keywords: dev-doc-needed
Assignee | ||
Comment 16•8 years ago
|
||
Before I try to start working on this bug, it probably make sense to clean up make_unicode.py a little bit.
Files regenerated with:
cd js/src/vm
./make_unicode.py --version=6.2.0 --casefold-version=8.0.0 --no-keep
Assignee: nobody → andrebargull
Attachment #8351004 -
Attachment is obsolete: true
Status: REOPENED → ASSIGNED
Attachment #8807589 -
Flags: review?(arai.unmht)
Comment 17•8 years ago
|
||
what's the intent of adding --directory and --no-keep ?
are we going to remove copy of CaseFolding.txt and UnicodeData.txt from tree?
Flags: needinfo?(andrebargull)
Assignee | ||
Comment 18•8 years ago
|
||
(In reply to Tooru Fujisawa [:arai] from comment #17)
> what's the intent of adding --directory and --no-keep ?
> are we going to remove copy of CaseFolding.txt and UnicodeData.txt from tree?
I've just added those to make it simpler for me to test and update the files. For example I needed quite some time to realize we're currently using two different Unicode version (6.2 and 8.0), and with those parameters I could store the downloaded files to /tmp/unicode<version> without needing to download the different Unicode files over and over again. And with --no-keep I can update the generated Unicode 6.2 files without also downgrading vm/UnicodeData.txt from Unicode 8 to Unicode 6.2.
IOW we can probably remove them again as soon as we use a single Unicode version.
Flags: needinfo?(andrebargull)
Assignee | ||
Comment 19•8 years ago
|
||
And part 2 with the actual changes to use the Unicode properties ID_Start and ID_Continue.
Attachment #8807691 -
Flags: review?(arai.unmht)
Comment 20•8 years ago
|
||
(In reply to André Bargull from comment #18)
> And with --no-keep
> I can update the generated Unicode 6.2 files without also downgrading
> vm/UnicodeData.txt from Unicode 8 to Unicode 6.2.
vm/UnicodeData.txt is Unicode 6.2
so you won't downgrade it if you download 6.2 there.
Comment 21•8 years ago
|
||
Comment on attachment 8807589 [details] [diff] [review]
bug917436-part1.patch
Review of attachment 8807589 [details] [diff] [review]:
-----------------------------------------------------------------
make_unicode.py can be simplified more,
options that needs are:
* download from network, or use local file
* which versions to download (or maybe the version of local file)
otherwise, this looks nice :)
::: js/src/vm/make_unicode.py
@@ +29,3 @@
>
> +# ECMAScript 2016
> +# §11.2 White Space
there's some places that uses "$" as section mark, so, if you use "§" instead it would be nice to replace all of them in this file, to search-ability.
@@ +180,5 @@
> same_upper_dummy = (0, 0, 0)
> same_upper_table = [same_upper_dummy]
> same_upper_cache = {same_upper_dummy: 0}
> + same_upper_index = [0] * (MAX_BMP + 1)
> +
please remove white spaces here.
@@ +224,5 @@
> flags |= FLAG_SPACE
> test_space_table.append(code)
> if category in ['Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl']: # $ 7.6 (UnicodeLetter)
> flags |= FLAG_LETTER
> + if category in ['Mn', 'Mc', 'Nd', 'Pc'] or code in compatibility_identifier_part: # $ 7.6 (IdentifierPart)
while you're touching this line, can you move the comment to independent line before if-statement?
same for "# $ 7.6 (UnicodeLetter)"
@@ +736,5 @@
> +
> + def to_download_url(version):
> + baseurl = 'http://unicode.org/Public'
> + if version is 'UNIDATA':
> + return '%s/%s' % (baseurl, version)
I thought "...".format() is preferred over % operator these days, but I'm not sure.
at least it will be deprecated in the future.
maybe we should ask someone else about coding rule.
@@ +767,5 @@
> + if info['download'] is not None:
> + print('\t%s download directory: %s' % (info['name'], info['download']))
> + if not os.path.isdir(info['download']):
> + raise RuntimeError('not a directory: %s' % info['download'])
> + if info['local'] is not None:
can this be just "else:" ?
@@ +876,5 @@
> + parser.add_argument('--casefold-directory',
> + help='Saves downloaded Unicode casefolding files in the given directory')
> +
> + parser.add_argument('--no-keep', default=False, action='store_true',
> + help='Don\'t save downloaded files in the current directory')
so, --directory, --casefold-directory, and --no-keep options won't be necessary.
@@ +881,5 @@
> +
> + parser.add_argument('local_directory', nargs="?",
> + help='Local Unicode directory, if omitted downloads files from unicode.org')
> + parser.add_argument('casefold_local_directory', nargs="?",
> + help='Local Unicode casefold directory, if omitted downloads files from unicode.org')
and also having different directory for them won't be necessary.
maybe we just drop directory parameters and just use current directory.
Attachment #8807589 -
Flags: review?(arai.unmht) → feedback+
Assignee | ||
Comment 22•8 years ago
|
||
If forgot to mention that part 2 depends on bug 1314037.
Comment 23•8 years ago
|
||
Comment on attachment 8807691 [details] [diff] [review]
bug917436-part2.patch
Review of attachment 8807691 [details] [diff] [review]:
-----------------------------------------------------------------
looks good, except the downloading part in make_unicode.py that would be changed along with part 1.
::: js/src/vm/Unicode.h
@@ +55,5 @@
> * return True
> *
> */
>
> +enum class CharFlag : uint8_t {
as discussed in IRC, please change this to namespace,
and remove cast below.
Attachment #8807691 -
Flags: review?(arai.unmht) → feedback+
Assignee | ||
Comment 24•8 years ago
|
||
Updated part 1 per review comments.
Attachment #8807589 -
Attachment is obsolete: true
Attachment #8807768 -
Flags: review?(arai.unmht)
Assignee | ||
Comment 25•8 years ago
|
||
Updated part 2 per review comments.
Attachment #8807691 -
Attachment is obsolete: true
Attachment #8807775 -
Flags: review?(arai.unmht)
Comment 26•8 years ago
|
||
Comment on attachment 8807768 [details] [diff] [review]
bug917436-part1.patch
Review of attachment 8807768 [details] [diff] [review]:
-----------------------------------------------------------------
thank you :D
r+ with minor fixes (sorry I should've caught them before)
::: js/src/vm/make_unicode.py
@@ +762,5 @@
> + tfile.write(data)
> + tfile.flush()
> + tfile.seek(0)
> + else:
> + tfile = os.path.join(os.getcwd(), fname)
reusing `tfile` with different types won't be nice.
also, `os.path.join(os.getcwd(), fname)` is used in both branch.
so, it would be nice to create this before if-branch, with different name.
maybe tfile_path or something
@@ +768,5 @@
> + raise RuntimeError('File not found: %s' % tfile)
> + tfile = io.open(tfile, 'rb');
> + return tfile
> +
> + def version_from_file(info, f, fname):
you could remove `info` parameter
@@ +823,5 @@
> +
> + parser = argparse.ArgumentParser(description='Update Unicode data.')
> +
> + parser.add_argument('--version',
> + help='Unicode version number, use "UNIDATA" to select to latest published version')
since now this is a switch to download or use local file,
it would be nice to mention that the version number is for downloading, and also specifying this argument will download the file instead of using local file.
Attachment #8807768 -
Flags: review?(arai.unmht) → review+
Updated•8 years ago
|
Attachment #8807775 -
Flags: review?(arai.unmht) → review+
Assignee | ||
Comment 27•8 years ago
|
||
Updated part 1 to fix remaining nits. Carrying r+ from arai.
Attachment #8807768 -
Attachment is obsolete: true
Attachment #8807784 -
Flags: review+
Assignee | ||
Comment 28•8 years ago
|
||
Assignee | ||
Updated•8 years ago
|
Keywords: checkin-needed
Comment 29•8 years ago
|
||
Pushed by ryanvm@gmail.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/816fef70c11d
Part 1: Record Unicode version in files generated from make_unicode.py. r=arai
https://hg.mozilla.org/integration/mozilla-inbound/rev/483dc75949fa
Part 2: Use IDStart and IDContinue Unicode properties for identifiers. r=arai
Keywords: checkin-needed
Comment 30•8 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/816fef70c11d
https://hg.mozilla.org/mozilla-central/rev/483dc75949fa
Status: ASSIGNED → RESOLVED
Closed: 9 years ago → 8 years ago
status-firefox52:
--- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
Comment 31•8 years ago
|
||
Shouldn't nightly now be passing the "Syntax/Unicode point escapes/identifiers" test on http://kangax.github.io/compat-table/es6/ , since this issue claims identifier syntax now matches ES6 spec?
Comment 32•8 years ago
|
||
(In reply to Alexandre Folle de Menezes from comment #31)
> Shouldn't nightly now be passing the "Syntax/Unicode point
> escapes/identifiers" test on http://kangax.github.io/compat-table/es6/ ,
> since this issue claims identifier syntax now matches ES6 spec?
the testcase depends on bug 1197230 to handle non-BMP code points.
> var \u{102C0} = { \u{102C0} : 2 };
> return \u{102C0}['\ud800\udec0'] === 2;
Comment 33•8 years ago
|
||
Added to https://developer.mozilla.org/en-US/Firefox/Releases/52#JavaScript
I don't think this needs to be in any reference doc (seems like a detail to me). Please let me know, if you think this should be mentioned somewhere on MDN.
Keywords: dev-doc-needed → dev-doc-complete
You need to log in
before you can comment on or make changes to this bug.
Description
•