<a class="header-button" href="https://bugzilla.mozilla.org/home" title="Go to home page"> Bugzilla

Comment 1

•

11 years ago

Tests: http://mathias.html5.org/tests/javascript/identifiers/

Comment 2

•

11 years ago

Attached patch Hacks to disallow \u2e2f in identifiers (obsolete) — Details — Splinter Review

This isn't that easy to implement. We're checking for IdentifierStart by first checking the ASCII-range special cases (i.e. $ and _) and then by whether we have a unicode letter. In the check for IdentifierPart, we do the latter, plus checking for the flag IDENTIFIER_PART, which is set by the vm/make_unicode.py script. Fixing the latter is doable by changing make_unicode.py to set IDENTIFIER_PART for all letters, too, and explicitly excluding \u2e2f. Hacky, but at least the ugly is in generator script, not in the engine itself. The in-engine change is just to not check for letters in IsIdentifierPart. However, now we are in the situation were the identifier is ended, and then we check for IsIdentifierStart next. Which returns true, because \u2e2f is a letter. The parser is not prepared for this situation being possible, and does stupid things until a failing assert puts it out if its misery. The core issue is that, as far as Unicode is concerned, we shouldn't check for letters, or $ or _ at all, but for whether the char has the ID_Start property. Which is defined in DerivedCoreProperties.txt. Which we don't even open. The only simple way around that is to explicitly check for \u2e2f in IsIdentifierStart and return false for it. Urgh. The attached patch does these things. It doesn't fully handle a literal ⸯ, but I don't really want to look into that until someone tells me that, yes, this is messy, but so is reality.

Updated

•

11 years ago

Assignee: general → till

Status: NEW → ASSIGNED

Comment 3

•

11 years ago

(In reply to Till Schneidereit [:till] from comment #2) > The core issue is that, as far as Unicode is concerned, we shouldn't check > for letters, or $ or _ at all, but for whether the char has the ID_Start > property. Which is defined in DerivedCoreProperties.txt. Which we don't even > open. There is another way. You could just stick to using Unicode category data, by following <http://www.unicode.org/reports/tr31/#Default_Identifier_Syntax>: ID_Start: Characters having the Unicode General_Category of uppercase letters (Lu), lowercase letters (Ll), titlecase letters (Lt), modifier letters (Lm), other letters (Lo), letter numbers (Nl), minus Pattern_Syntax and Pattern_White_Space code points, plus stability extensions. ID_Continue: All of the above, plus characters having the Unicode General_Category of nonspacing marks (Mn), spacing combining marks (Mc), decimal number (Nd), connector punctuations (Pc), plus stability extensions, minus Pattern_Syntax and Pattern_White_Space code points. Here’s an example script that gets all the Unicode code points for `IdentifierStart` and `IdentifierPart` as per ES6 using that technique: https://gist.github.com/mathiasbynens/6334847#file-javascript-identifier-regex-js-L65-L102

Comment 4

•

11 years ago

Looks like it would be fairly easy to tweak that script into a replacement for `vm/make_unicode.py`.

Comment 5

•

11 years ago

(In reply to Mathias Bynens from comment #4) > Looks like it would be fairly easy to tweak that script into a replacement > for `vm/make_unicode.py`. Patches very much welcome. :) Sadly, we don't have the infrastructure to run JS scripts during the SpiderMonkey builds, though, so this would have to be ported to Python.

Comment 6

•

10 years ago

Some more info & examples: https://mathiasbynens.be/notes/javascript-identifiers-es6 Should I file a separate bug for updating the identifier grammar to ES6?

Comment 7

•

10 years ago

I think we can just re=purpose this bug. It seems likely that we'd do all of it in one go. It seems unlikely, however, that I'll be able to get to this anytime soon, so unassigning myself.

Assignee: till → nobody

Status: ASSIGNED → NEW

OS: Mac OS X → All

Hardware: x86 → All

Summary: Update identifier syntax to disallow \u2e2f → Update identifier syntax to match ES6

Comment 8

•

10 years ago

Tests (based on ES6 and Unicode 5.1.0, i.e. the minimum required Unicode version as per the spec): https://mathias.html5.org/tests/javascript/identifiers/

Updated

•

9 years ago

Assignee: nobody → winter2718

Comment 10

•

9 years ago

I've changed the script to derive the Id_Start and Id_Continue properties from their constituent categories, however, there's another hitch. UnicodeData.txt from all versions above 6.0.0 fail with: Downloading... Generating... Best: 2048+8224 bins at shift 5; 12320 bytes Size of original table: 65536 bytes Traceback (most recent call last): File "make_unicode.py", line 377, in <module> open('../tests/ecma_5/String/string-space-trim.js', 'w')) File "make_unicode.py", line 282, in generate_unicode_stuff dump(index1, 'index1', data_file) File "make_unicode.py", line 268, in dump assert entry < 256 AssertionError Versions <= 6.0.0 still work.

Updated

•

9 years ago

Depends on: 1230490

Jeff Walden [:Waldo]

Comment 11

•

9 years ago

I think bug 1197230 is a dup of this -- would mark if I didn't have a quick errand before heading to the ofice. Note also the last two comments may be of some interest. The rumor mill leads me to suspect the blocker mentioned there *might* be favorably resolved in the next few months. But I'm nowhere near certain enough of that to bet on it, and we might still be stuck fixing the script here. As to why make_unicode.py might be failing, I had *thought* it was run when we updated to ICU 56. Maybe I'm wrong about that. evilpie wrote this stuff originally, he might be able to help with it. (Or maybe not, if he's paged it out of memory as well as I have. :-) )

Comment 12

•

9 years ago

(In reply to Jeff Walden [:Waldo] (remove +bmo to email) from comment #11) > I think bug 1197230 is a dup of this -- would mark if I didn't have a quick > errand before heading to the ofice. > > Note also the last two comments may be of some interest. The rumor mill > leads me to suspect the blocker mentioned there *might* be favorably > resolved in the next few months. But I'm nowhere near certain enough of > that to bet on it, and we might still be stuck fixing the script here. > > As to why make_unicode.py might be failing, I had *thought* it was run when > we updated to ICU 56. Maybe I'm wrong about that. evilpie wrote this stuff > originally, he might be able to help with it. (Or maybe not, if he's paged > it out of memory as well as I have. :-) ) You are my hero. I'm going to reassess my work on this then. It's a deep dark rabbit hole, and while I definitely appreciate the compaction being done in make_unicode.py it's very painful to maintain for a newcomer.

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → DUPLICATE

Comment 14

•

9 years ago

Actually, not sure this is a true duplicate. Reopening and adding the tokenizer bug as a blocker.

Status: RESOLVED → REOPENED

Resolution: DUPLICATE → ---

Updated

•

9 years ago

Depends on: 1197230

Comment 15

•

9 years ago

Going to release this one in case someone else decides to grab it. Otherwise, I'll circle back later.

Assignee: winter2718 → nobody

Updated

•

9 years ago

Updated

•

9 years ago

Keywords: dev-doc-needed

Assignee

Comment 16

•

8 years ago

Attached patch bug917436-part1.patch (obsolete) — Details — Splinter Review

Before I try to start working on this bug, it probably make sense to clean up make_unicode.py a little bit. Files regenerated with: cd js/src/vm ./make_unicode.py --version=6.2.0 --casefold-version=8.0.0 --no-keep

Assignee: nobody → andrebargull

Attachment #8351004 - Attachment is obsolete: true

Status: REOPENED → ASSIGNED

Attachment #8807589 - Flags: review?(arai.unmht)

Comment 17

•

8 years ago

what's the intent of adding --directory and --no-keep ? are we going to remove copy of CaseFolding.txt and UnicodeData.txt from tree?

Flags: needinfo?(andrebargull)

Assignee

Comment 18

•

8 years ago

(In reply to Tooru Fujisawa [:arai] from comment #17) > what's the intent of adding --directory and --no-keep ? > are we going to remove copy of CaseFolding.txt and UnicodeData.txt from tree? I've just added those to make it simpler for me to test and update the files. For example I needed quite some time to realize we're currently using two different Unicode version (6.2 and 8.0), and with those parameters I could store the downloaded files to /tmp/unicode<version> without needing to download the different Unicode files over and over again. And with --no-keep I can update the generated Unicode 6.2 files without also downgrading vm/UnicodeData.txt from Unicode 8 to Unicode 6.2. IOW we can probably remove them again as soon as we use a single Unicode version.

Flags: needinfo?(andrebargull)

Assignee

Comment 19

•

8 years ago

Attached patch bug917436-part2.patch (obsolete) — Details — Splinter Review

And part 2 with the actual changes to use the Unicode properties ID_Start and ID_Continue.

Attachment #8807691 - Flags: review?(arai.unmht)

Comment 20

•

8 years ago

(In reply to André Bargull from comment #18) > And with --no-keep > I can update the generated Unicode 6.2 files without also downgrading > vm/UnicodeData.txt from Unicode 8 to Unicode 6.2. vm/UnicodeData.txt is Unicode 6.2 so you won't downgrade it if you download 6.2 there.

Comment 21

•

8 years ago

Comment on attachment 8807589 [details] [diff] [review] bug917436-part1.patch Review of attachment 8807589 [details] [diff] [review]: ----------------------------------------------------------------- make_unicode.py can be simplified more, options that needs are: * download from network, or use local file * which versions to download (or maybe the version of local file) otherwise, this looks nice :) ::: js/src/vm/make_unicode.py @@ +29,3 @@ > > +# ECMAScript 2016 > +# §11.2 White Space there's some places that uses "$" as section mark, so, if you use "§" instead it would be nice to replace all of them in this file, to search-ability. @@ +180,5 @@ > same_upper_dummy = (0, 0, 0) > same_upper_table = [same_upper_dummy] > same_upper_cache = {same_upper_dummy: 0} > + same_upper_index = [0] * (MAX_BMP + 1) > + please remove white spaces here. @@ +224,5 @@ > flags |= FLAG_SPACE > test_space_table.append(code) > if category in ['Lu', 'Ll', 'Lt', 'Lm', 'Lo', 'Nl']: # $ 7.6 (UnicodeLetter) > flags |= FLAG_LETTER > + if category in ['Mn', 'Mc', 'Nd', 'Pc'] or code in compatibility_identifier_part: # $ 7.6 (IdentifierPart) while you're touching this line, can you move the comment to independent line before if-statement? same for "# $ 7.6 (UnicodeLetter)" @@ +736,5 @@ > + > + def to_download_url(version): > + baseurl = 'http://unicode.org/Public' > + if version is 'UNIDATA': > + return '%s/%s' % (baseurl, version) I thought "...".format() is preferred over % operator these days, but I'm not sure. at least it will be deprecated in the future. maybe we should ask someone else about coding rule. @@ +767,5 @@ > + if info['download'] is not None: > + print('\t%s download directory: %s' % (info['name'], info['download'])) > + if not os.path.isdir(info['download']): > + raise RuntimeError('not a directory: %s' % info['download']) > + if info['local'] is not None: can this be just "else:" ? @@ +876,5 @@ > + parser.add_argument('--casefold-directory', > + help='Saves downloaded Unicode casefolding files in the given directory') > + > + parser.add_argument('--no-keep', default=False, action='store_true', > + help='Don\'t save downloaded files in the current directory') so, --directory, --casefold-directory, and --no-keep options won't be necessary. @@ +881,5 @@ > + > + parser.add_argument('local_directory', nargs="?", > + help='Local Unicode directory, if omitted downloads files from unicode.org') > + parser.add_argument('casefold_local_directory', nargs="?", > + help='Local Unicode casefold directory, if omitted downloads files from unicode.org') and also having different directory for them won't be necessary. maybe we just drop directory parameters and just use current directory.

Attachment #8807589 - Flags: review?(arai.unmht) → feedback+

Assignee

Comment 22

•

8 years ago

If forgot to mention that part 2 depends on bug 1314037.

Comment 23

•

8 years ago

Comment on attachment 8807691 [details] [diff] [review] bug917436-part2.patch Review of attachment 8807691 [details] [diff] [review]: ----------------------------------------------------------------- looks good, except the downloading part in make_unicode.py that would be changed along with part 1. ::: js/src/vm/Unicode.h @@ +55,5 @@ > * return True > * > */ > > +enum class CharFlag : uint8_t { as discussed in IRC, please change this to namespace, and remove cast below.

Attachment #8807691 - Flags: review?(arai.unmht) → feedback+

Assignee

Comment 24

•

8 years ago

Attached patch bug917436-part1.patch (obsolete) — Details — Splinter Review

Updated part 1 per review comments.

Attachment #8807589 - Attachment is obsolete: true

Attachment #8807768 - Flags: review?(arai.unmht)

Assignee

Comment 25

•

8 years ago

Attached patch bug917436-part2.patch — Details — Splinter Review

Updated part 2 per review comments.

Attachment #8807691 - Attachment is obsolete: true

Attachment #8807775 - Flags: review?(arai.unmht)

Comment 26

•

8 years ago

Comment on attachment 8807768 [details] [diff] [review] bug917436-part1.patch Review of attachment 8807768 [details] [diff] [review]: ----------------------------------------------------------------- thank you :D r+ with minor fixes (sorry I should've caught them before) ::: js/src/vm/make_unicode.py @@ +762,5 @@ > + tfile.write(data) > + tfile.flush() > + tfile.seek(0) > + else: > + tfile = os.path.join(os.getcwd(), fname) reusing `tfile` with different types won't be nice. also, `os.path.join(os.getcwd(), fname)` is used in both branch. so, it would be nice to create this before if-branch, with different name. maybe tfile_path or something @@ +768,5 @@ > + raise RuntimeError('File not found: %s' % tfile) > + tfile = io.open(tfile, 'rb'); > + return tfile > + > + def version_from_file(info, f, fname): you could remove `info` parameter @@ +823,5 @@ > + > + parser = argparse.ArgumentParser(description='Update Unicode data.') > + > + parser.add_argument('--version', > + help='Unicode version number, use "UNIDATA" to select to latest published version') since now this is a switch to download or use local file, it would be nice to mention that the version number is for downloading, and also specifying this argument will download the file instead of using local file.

Attachment #8807768 - Flags: review?(arai.unmht) → review+

Updated

•

8 years ago

Attachment #8807775 - Flags: review?(arai.unmht) → review+

Assignee

Updated

•

8 years ago

Depends on: 1314037

Assignee

Comment 27

•

8 years ago

Attached patch bug917436-part1.patch — Details — Splinter Review

Updated part 1 to fix remaining nits. Carrying r+ from arai.

Attachment #8807768 - Attachment is obsolete: true

Attachment #8807784 - Flags: review+

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=c835c94eb47cf872bef84882b9ee5dd0a4d384dd

Assignee

Comment 28

•

8 years ago

Wes Kocher (:KWierso) (Not reading bugmail; email directly if needed)

Assignee

Updated

•

8 years ago

Keywords: checkin-needed

Pulsebot

Comment 29

•

8 years ago

Pushed by ryanvm@gmail.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/816fef70c11d Part 1: Record Unicode version in files generated from make_unicode.py. r=arai https://hg.mozilla.org/integration/mozilla-inbound/rev/483dc75949fa Part 2: Use IDStart and IDContinue Unicode properties for identifiers. r=arai

Keywords: checkin-needed

Comment 30

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/816fef70c11d https://hg.mozilla.org/mozilla-central/rev/483dc75949fa

Status: ASSIGNED → RESOLVED

Closed: 9 years ago → 8 years ago

status-firefox52: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla52

Alexandre Folle de Menezes

Comment 31

•

8 years ago

Shouldn't nightly now be passing the "Syntax/Unicode point escapes/identifiers" test on http://kangax.github.io/compat-table/es6/ , since this issue claims identifier syntax now matches ES6 spec?