The default bug view has changed. See this FAQ.

Support /u flag and the relevant semantics on regular expressions

RESOLVED FIXED in Firefox 46

Status

()

Core
JavaScript Engine
RESOLVED FIXED
2 years ago
10 months ago

People

(Reporter: till, Assigned: arai)

Tracking

(Blocks: 1 bug, {dev-doc-complete})

Trunk
mozilla46
dev-doc-complete
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(firefox46 fixed)

Details

(Whiteboard: [DocArea=JS])

Attachments

(11 attachments, 15 obsolete attachments)

24.83 KB, patch
arai
: review+
Details | Diff | Splinter Review
10.76 KB, patch
arai
: review+
Details | Diff | Splinter Review
35.75 KB, patch
till
: review+
Details | Diff | Splinter Review
5.29 KB, patch
till
: review+
Details | Diff | Splinter Review
11.34 KB, patch
till
: review+
Details | Diff | Splinter Review
213.12 KB, patch
till
: review+
Details | Diff | Splinter Review
16.34 KB, patch
till
: review+
Details | Diff | Splinter Review
18.14 KB, patch
till
: review+
Details | Diff | Splinter Review
4.80 KB, patch
till
: review+
Details | Diff | Splinter Review
10.94 KB, patch
till
: review+
Details | Diff | Splinter Review
47.33 KB, patch
till
: review+
Details | Diff | Splinter Review
(Reporter)

Description

2 years ago
ES6 introduces Unicode support in regular expressions. See e.g. step 10 of https://people.mozilla.org/~jorendorff/es6-draft.html#sec-get-regexp.prototype.flags
need to fix bug 320500 before writing testcase for this.
it's too hard to write without the feature :/
Depends on: 320500
Now testing patches: https://treeherder.mozilla.org/#/jobs?repo=try&revision=2954e0872abb

Before attaching patches, I'd like to post some idea related to the patch series.
To keep using char16_t in regexp engine, I made following rules, to convert unicode pattern into simple BMP pattern.

(A) Single non-BMP character
Convert it into a sequence of surrogate pair, and wrap them by group.

e.g.
  \u{1F438}  =>  (?:\uD83D\uDC38)


(B) Single non-BMP character in CharacterClass
Same as (A).

e.g.
  [\u{1F438}]  =>  (?:\uD83D\uDC38)


(C) BMP character and non-BMP character in CharacterClass
Convert it into two Alternatives, each for BMP character and non-BMP character, and perform (B).

e.g.
  [A\u{1F438}]  =>  (?:[A]|\uD83D\uDC38)


(D) multiple non-BMP characters in CharacterClass
Convert it into multiple Alternatives, each for character, and perform (B).

e.g.
  [\u{1F438}\u{1F43A}]  =>  (?:\uD83D\uDC38|\uD83D\uDC3A)


(E) non-BMP character range
Say the pattern is [\u{L1}\u{T1}-\u{L2}\u{T2}], perform forllowing translation
(some more optimization is applied in the patch though)

  if L1 == L2:
    if T1 == T2:
      return '\u{L1}\u{T1}'
    else
      return '\u{L1}[\u{T1}-\u{T2}]'
  elif L1 + 1 == L2:
    return '(?:\u{L1}[\u{T1}-\uDFFF]|\u{L2}[\uDC00-\u{T2}]'
  elif L1 + 2 == L2:
    return '(?:\u{L1}[\u{T1}-\uDFFF]|\u{L1+1}[\uDC00-\uDFFF]|\u{L2}[\uDC00-\u{T2}]'
  else
    return '(?:\u{L1}[\u{T1}-\uDFFF]|[\u{L1+1}-\u{L2-1}][\uDC00-\uDFFF]|\u{L2}[\uDC00-\u{T2}]'

e.g.
  [\u{1F428}-\u{1F438}]  =>  (?:\uD83D[\uDC28-\uDC38])


(F) Single standalone lead surrogate character (U+D800 - U+DBFF)
Append negative lookahead with trail surrogate characters range, and wrap by group.

e.g.
  \uD800 => (?:\uD800(?![\uDC00-\uDFFF]))


(G) Single standalone trail surrogate character (U+DC00 - U+DFFF)
Prepend negative lookbehind with lead surrogate characters range, and wrap by group.

e.g.
  \uDC00 => (?:(?<![\uD800-\uDBFF])\uDC00)

Of course, lookbehind is not supported, so I added pseudo-assertion that matches to it, for internal use only (no syntax for it).


(H) BMP characters, standalone surrogates, and non-BMP characters in CharacterClass
Convert it into 4 Alternatives, each for BMP range, lead surrogate range, trail surrogate range, and non-BMP range, and perform above.

e.g.
  [A\uDC00\uD800\u{1F428}-\u{1F438}] => (?:[A]|\uD800(?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])\uDC00|\uD83D[\uDC28-\uDC38])


(I) "."
Convert it into corresponding CharacterClass, and perform (H).


(J) \S, \W, and \D
Convert them into corresponding CharacterClass, and perform (H).


(I) non-BMP character in ignoreCase pattern
Convert it into CharacterClass (if needed), and add folded (or reverse-folded) character.

e.g.
  \u{10401} => [\u{10401}\u{10429}]  -- folded
  \u{10429} => [\u{10401}\u{10429}]  -- reverse

and perform (D) later.
oops, there are two (I)s, second one is (K).

Then, prepared 9 patches, for each aspect of unicode pattern support.

Short Summary:
  Part 1:   support /u and add unicode flag accessor
            (added this first, to test others)
  Part 2,3: support \u{XXXXXX} and \uXXXX\uXXXX (surrogate pair) pattern -- (A)-(H)
  Part 4,5: support ".", \S, \W, and \D -- (I), (J)
  Part 6,7: support ignoreCase -- (K)
  Part 8:   disallow remaining extended patterns
  Part 9:   Make String.prototype.{match,replace} to use unicode flag
            (I'm going to fix the implementation again and make them ES6 compatible in bug 887016 though)

I'll post if the try run passes.
(Reporter)

Comment 4

2 years ago
Jandem, I think you're best qualified to comment on arai's proposal in comment 2.
Flags: needinfo?(jdemooij)
Created attachment 8600880 [details] [diff] [review]
Part 1: Implement RegExp unicode flag.

Posting some parts not related to the conversion in comment #2.

Part 1

  * add RegExp.prototype.unicode accessor
  * and /u flag (but does nothing, at this point)
  * support unicode flag in RegExp.prototype.flags
Assignee: nobody → arai.unmht
Attachment #8600880 - Flags: review?(till)
Created attachment 8600881 [details] [diff] [review]
Part 9: Use RegExp.prototype.unicode in String.prototype.{match,replace}.

and Part 9

  * support RegExp.prototype.unicode in String.prototype.{match,replace}
    * get .unicode property
    * call AdvanceStringIndex
Attachment #8600881 - Flags: review?(till)
As suggested in IRC, changing needinfo.
jorendorff, can I get some feedback for the conversion rule in comment #2 ?
Flags: needinfo?(jdemooij) → needinfo?(jorendorff)
(Reporter)

Comment 8

2 years ago
Comment on attachment 8600881 [details] [diff] [review]
Part 9: Use RegExp.prototype.unicode in String.prototype.{match,replace}.

Review of attachment 8600881 [details] [diff] [review]:
-----------------------------------------------------------------

Very nice.

::: js/src/jsstr.cpp
@@ +2269,5 @@
>  
> +/* ES6 final draft 21.2.5.2.3. */
> +static size_t
> +AdvanceStringIndex(HandleLinearString input, size_t length, size_t index, bool unicode)
> +{

Nit: add a comment "Steps 1-3 (implicit)."

@@ +2274,5 @@
> +    /* Step 4. */
> +    if (!unicode)
> +        return index + 1;
> +
> +    /* If S is latin1, there is no surrogate pair. */

nit: I think it'd be even clearer to just combine this with the `!unicode` test above: `if (!unicode || input->hasLatin1Chars)`. In a way, the `hasLatin1Chars` tests the same thing.
Attachment #8600881 - Flags: review?(till) → review+
(Reporter)

Comment 9

2 years ago
Comment on attachment 8600880 [details] [diff] [review]
Part 1: Implement RegExp unicode flag.

Review of attachment 8600880 [details] [diff] [review]:
-----------------------------------------------------------------

Excellent. r=me, but please wait for bhackett's response to the needinfo I'll request in a minute before landing.

::: js/src/builtin/RegExp.cpp
@@ +551,5 @@
> +MOZ_ALWAYS_INLINE bool
> +regexp_unicode_impl(JSContext* cx, CallArgs args)
> +{
> +    MOZ_ASSERT(IsRegExpObject(args.thisv()));
> +    Rooted<RegExpObject*> reObj(cx, &args.thisv().toObject().as<RegExpObject>());

It doesn't hurt much, but you don't actually need to root this: there's nothing going on in the rest of the function that could trigger a GC. Hence, you could just inline the flag access:
args.rval().setBoolean(args.thisv().toObject().as<RegExpObject>().unicode);

Up to you though.

::: js/src/vm/RegExpObject.h
@@ +362,1 @@
>      static const unsigned PRIVATE_SLOT = 7;

I don't understand why you don't also need to bump PRIVATE_SLOT. Maybe you do? I don't understand why PRIVATE_SLOT is 7 instead of 6 to begin with, though, so I'm not sure. Will needinfo bhackett for an explanation.
Attachment #8600880 - Flags: review?(till) → review+
(Reporter)

Comment 10

2 years ago
Brian, in bug 1066828 you introduced PRIVATE_SLOT in RegExpObject.h. Can you explain why it isn't the slot after the last reserved slot, i.e., 6? And, more importantly, should it be bumped along with RESERVED_SLOTS?
Flags: needinfo?(bhackett1024)
Created attachment 8604092 [details] [diff] [review]
Part 9-followup: Use RegExp unicode flag in String.prototype.split.

Thank you for reviewing :)

I noticed that unicode flag is also used in String.prototype.split.
There actually RegExp.prototype.flags is used, but now we're getting RegExpObject's property directly, so no need to use flags, so just using unicode as well (and it will be fixed again in bug 887016).

Also, the summary for Part 9 was a bit misleading, since we're not calling RegExp.prototype.unicode accessor. so I'll change it to "RegExp unicode flag", like this patch (I'll merge them).

Green on try run: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f6e1e07c7c70
Attachment #8604092 - Flags: review?(till)
(In reply to Till Schneidereit (on vacation until May 26) from comment #10)
> Brian, in bug 1066828 you introduced PRIVATE_SLOT in RegExpObject.h. Can you
> explain why it isn't the slot after the last reserved slot, i.e., 6? And,
> more importantly, should it be bumped along with RESERVED_SLOTS?

RegExpObjects store their RegExpShared via their private pointer, i.e. JSObject::getPrivate.  Conceptually the private pointer has separate storage from the object's reserved slots, but in practice we use the last 64 bit word inline with the object to store this.  This is the memory which would otherwise be used by the object's last fixed slot, so an object with a private pointer has one less fixed slot than it otherwise would.  This detail can be ignored in most of the engine, but bug 1066828 changed things so that the JIT needs to access the RegExpShared, which requires it to know the exact offset in a RegExpObject to find it.  PRIVATE_SLOT indicates the fixed slot address to use to access this private pointer.
Flags: needinfo?(bhackett1024)
so, RegExpObject has 8 slots (AllocKind::OBJECT8) both before and after this patch, and `PRIVATE_SLOT = 7` is the last slot which is not used in both case, and we don't need to bump it (to 11?), am I correct?
(In reply to Tooru Fujisawa [:arai] from comment #13)
> so, RegExpObject has 8 slots (AllocKind::OBJECT8) both before and after this
> patch, and `PRIVATE_SLOT = 7` is the last slot which is not used in both
> case, and we don't need to bump it (to 11?), am I correct?

Yes.
(Reporter)

Comment 15

2 years ago
Comment on attachment 8604092 [details] [diff] [review]
Part 9-followup: Use RegExp unicode flag in String.prototype.split.

Review of attachment 8604092 [details] [diff] [review]:
-----------------------------------------------------------------

Great catch!
Attachment #8604092 - Flags: review?(till) → review+
A question related to surrogate pair and index, in following code:

1.
  var r = /\uDC38/ug; r.lastIndex = 1; r.exec("\uD83D\uDC38")
2.
  var r = /./ug; r.lastIndex = 1; r.exec("\uD83D\uDC38X")

lastIndex points the trail surrogate in the input string.

both es6draft and my WIP patch return null for 1st case, and ["X"] for 2nd case.
but I'm not sure where this behavior is specified in the ES6 spec.

I think this is somehow handled in ES6 21.2.2.2 [1] step 2.2.
> Let listIndex be the index into Input of the character that was obtained from element index of str.

but what could be the value of listIndex here if index points the trail surrogate in surrogate pair?  how does the Pattern skip "\uDC38"? and how does it reach "X" in 2nd case?

:anba, do you have any idea?

[1] http://www.ecma-international.org/ecma-262/6.0/#sec-pattern
Flags: needinfo?(andrebargull)

Comment 17

a year ago
`var r = /\uDC38/ug; r.lastIndex = 1; r.exec("\uD83D\uDC38")` should probably return `["\uDC38"]` instead of `null`. Unfortunately we didn't cover this case when unpaired surrogates were discussed the last time (https://mail.mozilla.org/pipermail/es-discuss/2015-January/041281.html). 21.2.2.2 definitely needs to be fixed to handle this. The email notification service for bugs.ecmascript.org is currently defunct (bug 1211522), therefore I suggest to use https://github.com/tc39/ecma262/issues for filing this issue.
Flags: needinfo?(andrebargull)
Thank you :D
Filed as https://github.com/tc39/ecma262/issues/128

Comment 19

a year ago
this actually is covered by the spec. language in 21.2.2.2.

In step 2.1 str is "\uD83D\uDC38" and Input becomes the single element List whose sole element is U+1f438.

In step 2.2 index is 1. Element 1 of str contributed to the creation of element 0 of Input. (in other words Input[0] was obtained from str[0] and str[1]). So, listIndex will be 0 since Input[0] was obtained from string[1].

When matching is actually performed, no match is found because the single element of Input (U+1f438) does not match the pattern character U+0dc38.

The spec. language probably needs to be clearer about what it means by "obtained".

There is an actual bug in 21.2.2.2.  Step 2.2 does not say what should be done if  index >= the length of str. In that case, lastIndex should be set to the same value as InputLength.

Comment 20

a year ago
(In reply to allen from comment #19)

>      In that case, lastIndex should be set
> to the same value as InputLength.

oops, should be

In that case, listIndex should be set to the same value as InputLength.
Thanks :)

looks like we should move index back to lead surrogate for unicode pattern if index points trails surrogate and it has corresponding lead surrogate, maybe in ExecuteRegExpImpl or somewhere else.
With our implementation, returning the actual match index as match object "index" property is simpler.
So, following code prints 0, that does not match to ES6.

  var r = /\uD83D\uDC38/ug;
  r.lastIndex = 1;
  var str = "\uD83D\uDC38";
  var result = r.exec(str);
  print(result.index);

current WIP patches are attached to bug 887016 comment #47 (that is based on this bug).
anba, can i get feedback on unicode RegExp pattern conversion rule in comment #2?
Flags: needinfo?(jorendorff) → needinfo?(andrebargull)

Comment 24

a year ago
I'm not quite sure conversion works, but let's start with some missing cases.

- Character range from BMP to non-BMP, e.g. [\0-\u{10ffff}]
- BMP characters in ignoreCase pattern
  - Differences between Unicode case folding and ECMAScript Canonicalize mapping through String.prototype.toUpperCase
    - /\u00df/i.test("\u1e9e") == false, but /\u00df/iu.test("\u1e9e") == true
    - /s/i.test("\u017f") == false, but /s/iu.test("\u017f") == true

And now the annoying cases which may break the conversion approach:

- How should /[\\udbff][\\udfff]/u get handled?
    - /[\udbff][\udfff]/.test("\udbff\udfff") == true, but /[\udbff][\udfff]/u.test("\udbff\udfff") == false

- How should /foo(.+)bar\1/u get handled? (https://mail.mozilla.org/pipermail/es-discuss/2015-January/041281.html)
  - /foo(.+)bar\1/.test("foo\uD834bar\uD834\uDC00") == true, but /foo(.+)bar\1/u.test("foo\uD834bar\uD834\uDC00") == false


And not related to the conversion approach: The RegExp parser needs to be adjusted to follow the stricter Unicode-RegExp rules, e.g. /\a/u needs throw a SyntaxError.
Flags: needinfo?(andrebargull)
Thank you for pointing out those missing cases :)

(In reply to André Bargull from comment #24)
> - Character range from BMP to non-BMP, e.g. [\0-\u{10ffff}]

This is converted into following at first:
  (?:[\0-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF]|[\uDC00-\uDFFF]|[\u{10000}-\u{10ffff}])

so, BMP except surrogate pair, lead surrogate, trail surrogate, and non-BMP classes for each.
and then, conversion rules are applied to latter 3 classes.

Two more missing cases are that CharacterClass with multiple lead/trail surrogate characters.
Those are converted almost same way as rules (F) and (G):

  [\uD800-\uDBFF] => (?:[\uD800-\uDBFF](?![\uDC00-\uDFFF]))

  [\uDC00-\uDFFF] => (?:(?<![\uD800-\uDBFF])[\uDC00-\uDFFF])

These cases are actually handled in WIP patches, just forgot to note in the comment.


> - BMP characters in ignoreCase pattern
>   - Differences between Unicode case folding and ECMAScript Canonicalize
> mapping through String.prototype.toUpperCase
>     - /\u00df/i.test("\u1e9e") == false, but /\u00df/iu.test("\u1e9e") ==
> true
>     - /s/i.test("\u017f") == false, but /s/iu.test("\u017f") == true

It's handled inside regexp engine after pattern conversion.  Character with ignoreCase is expanded to case independent characters in following function:
  https://dxr.mozilla.org/mozilla-central/rev/099f695d31326c39595264c34988a0f4b7cbc698/js/src/irregexp/RegExpEngine.cpp#191

I'll add another |choices| for unicode pattern, so, all case foldings for given character, instead of using ToLowerCase/ToUpperCase.


> And now the annoying cases which may break the conversion approach:
> 
> - How should /[\\udbff][\\udfff]/u get handled?
>     - /[\udbff][\udfff]/.test("\udbff\udfff") == true, but
> /[\udbff][\udfff]/u.test("\udbff\udfff") == false

As noted above, [\uDBFF] is converted into (?:[\uDBFF](?![\uDC00-\uDFFF])), so it does not match the 1st character, because it has trail surrogate after it.
in same way, [\uDFFF] is converted into (?:(?<![\uD800-\uDBFF])[\uDFFF]), so it does not match 2nd character, because it has lead surrogate before it.


> - How should /foo(.+)bar\1/u get handled?
> (https://mail.mozilla.org/pipermail/es-discuss/2015-January/041281.html)
>   - /foo(.+)bar\1/.test("foo\uD834bar\uD834\uDC00") == true, but
> /foo(.+)bar\1/u.test("foo\uD834bar\uD834\uDC00") == false

Oh, this case is really not handled (WIP patch returns true!) .

This may need dedicated code in RegExp engine.  maybe, like following:
  * if the match string's last character is lead surrogate, next character should not be trail surrogate
    this rule should make /foo(.+)bar\1/u.test("foo\uD834bar\uD834\uDC00") return false,
    because 2nd \uD834 has trail surrogate after it

I think the case "the match string's 1st character is trail surrogate" won't matter, as the pattern before the back reference doesn't match the surrogate pair partially.


> And not related to the conversion approach: The RegExp parser needs to be
> adjusted to follow the stricter Unicode-RegExp rules, e.g. /\a/u needs throw
> a SyntaxError.

Yeah, I'll apply stricter rule for RegExp with unicode flag :)
oh, the back reference's case could be handled more easily, by just adding (?![\uDC00-\uDFFF]) after all back reference.

Comment 27

a year ago
The back reference issue needs to be handled in the engine, for example there's also
  /^(.+)\1$/.test("\uDC00foobar\uD834\uDC00foobar\uD834") == true, but /^(.+)\1$/u.test("\uDC00foobar\uD834\uDC00foobar\uD834") == false

(Re: /[\udbff][\udfff]/ -> I confused myself when going through my Unicode RegExp tests. Your proposed conversion handles that pattern correctly.)
Thanks again :)

(In reply to André Bargull from comment #27)
> The back reference issue needs to be handled in the engine, for example
> there's also
>   /^(.+)\1$/.test("\uDC00foobar\uD834\uDC00foobar\uD834") == true, but
> /^(.+)\1$/u.test("\uDC00foobar\uD834\uDC00foobar\uD834") == false

(.+) is converted into following pattern:

  ((?:[\u0000-\u0009\u000B-\u000C\u000E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])+)

so, it does not match to 1st "\uDC00foobar\uD834".
(In reply to Tooru Fujisawa [:arai] from comment #26)
> oh, the back reference's case could be handled more easily, by just adding
> (?![\uDC00-\uDFFF]) after all back reference.

actually, this was wrong.
/foo(.+)bar\1/u.exec("foo\uDC00bar\uDC00\uDC00") should be true.

So, as you wrote, it needs to be handled in the engine :)
the (?![\uDC00-\uDFFF]) should be applied only if the last character is lead surrogate.
Here's possible solution in engine.

Just like pseudo-assertion for (?<![\uD800-\uDBFF]), we could add yet another pseudo-assertion that checks "next character is not a trail surrogate that has corresponding lead surrogate".  and place the assertion after the back reference.
Created attachment 8692322 [details]
WIP patches

here's WIP patches

In addition to comment #3
  Part 10: Decrement index when it points trail surrogate that has corresponding lead surrogate.
    addresses comment #16-#22
    match starts from lead surrogate, and match object's "index" property also points lead surrogate

  Part 11: Support back reference with unicode flag.
    Added NOT_IN_SURROGATE_PAIR assertion, and place it after back reference in unicode pattern
Created attachment 8693178 [details] [diff] [review]
Part 2: Parse RegExp unicode character in non-CharacterClass.

now I guess it would be better to post each patch with more details, to discuss about the possibility and overlooked cases.

Actually, the pattern conversion is not done in "string pattern to string pattern" way.  Instead of that, RegExp parser generates a RegExpTree that looks like the input pattern string is converted with those rules.  So, most of those conversion is done inside RegExpParser.


[Part 2]

corresponds to rules (A), (F), and (G)

  * adds NOT_AFTER_LEAD_SURROGATE assertion, that checks "previous character is not lead surrogate",
    that means "current character cannot be a part of surrogate pair"

  * adds some constants and functions in Unicode.h, to handle surrogate pair and non-BMP code point

  * supports unicode code points in non-CharacterClass
    * \u{XXXXXX}
        \u{1f438}  =>  (?:\uD83D\uDC38)

    * \uXXXX\uXXXX with surrogate pair
        \uD83D\uDC38  =>  (?:\uD83D\uDC38)

    * raw unicode surrogate pair, in |new RegExp(...)| or eval
        "\uD83D\uDC38"  =>  (?:\uD83D\uDC38)

    * standalone lead surrogate
        \uD83D  =>  (?:\uD83D(?![\uDC00-\uDFFF]))

    * standalone trail surrogate
        \uDC38  =>  (?:(?<![\uD800-\uDBFF])\uDC38)
      where (?<![\uD800-\uDBFF]) is NOT_AFTER_LEAD_SURROGATE assertion
Created attachment 8693179 [details] [diff] [review]
Part 3: Parse RegExp unicode character in CharacterClass.

[Part 3] (similar to Part 2)

corresponds to rules (B)-(H).

  * supports unicode code points in in CharacterClass
    * \u{XXXXXX}
        [\u{1F428}-\u{1F438}]  =>  (?:\uD83D[\uDC28-\uDC38])
        [\u{1F328}-\u{1F438}]  =>  (?:\uD83C[\uDF28-\uDFFF]|\uD83D[\uDC00-\uDC38])
        [\u{1E028}-\u{1F438}]  =>  (?:\uD838[\uDC28-\uDFFF]|[\uD839-\uD83C][\uDC00-\uDFFF]|\uD83D[\uDC00-\uDC38])

    * \uXXXX\uXXXX with surrogate pair
        [\uD83D\uDC28-\uD83D\uDC38]  =>  (?:\uD83D[\uDC28-\uDC38])

    * raw unicode surrogate pair, in |new RegExp(...)| or eval
        "[\uD83D\uDC28-\uD83D\uDC38]"  =>  (?:\uD83D[\uDC28-\uDC38])

    * standalone lead surrogate
        [\uD83D-\uD83F]  =>  (?:[\uD83D-\uD83F](?![\uDC00-\uDFFF]))

    * standalone trail surrogate
        [\uDC28-\uDC38]  =>  (?:(?<![\uD800-\uDBFF])[\uDC28-\uDC38])

  * splits CharacterClass with a mix of non-surrogate, surrogate pair, non-BMP
    [A-C\uD83D-\uD83F\uDC28-\uDC38\u{1F428}-\u{1F438}]
      => (?:[A-C]|[\uD83D-\uD83F]|[\uDC28-\uDC38]|[\u{1F428}-\u{1F438}])
    and apply above rules
      => (?:[A-C]|[\uD83D-\uD83F][\uDC28-\uDC38]|(?<![\uD800-\uDBFF])[\uDC28-\uDC38]|\uD83D[\uDC28-\uDC38])

  * disallows range with CharacterClassEscape
      /[\w-\W]/u  =>  SyntaxError

  * now RegExpParser::ParseClassCharacterEscape is fallible, as pattern starts with "\u" can throw SyntaxError if it does not match to \u{XXXXX} nor \uXXXX, in unicode pattern
    (previously that returned "u" character)

  * adds some functions to handle unicode ranges (add, sub, negate)
    * unicode ranges is separated into 4 ranges
      * BMP, non-surrogate
      * lead surrogate
      * trail surrogate
      * non-BMP
    * CharacterClass with mixed range is separated into above 4 ranges
      and conversion rules are applied for each range
Created attachment 8693180 [details] [diff] [review]
Part 4: Support everything Atom in RegExp with unicode flag.

[Part 4]

corresponds to rule (I).

  * supports "."
    * everything except \x0a, \x0d, \u2028 and \u2029

    * to match to full surrogate pair characters,
      and not to match to lead or trail surrogate pair character partially

      .  =>  (?:[\u0000-\u0009\u000B-\u000C\u000E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])
Created attachment 8693181 [details] [diff] [review]
Part 5: Support CharacterClassEscape in RegExp with unicode flag.

[Part 5] (similar to Part 4)

corresponds to rule (J).

  * supports \S, \W, and \D
    * to match to full surrogate pair characters,
      and not to match to lead or trail surrogate pair character partially

      \S => (?:[\u0000-\u0008\u000E-\u001F\u0021-\u009F\u00A1-\u167F\u1681-\u180D\u180F-\u1FFF\u200B-\u2027\u202A-\u202E\u2030-\u205E\u2060-\u2FFF\u3001-\uD7FF\uE000-\uFEFE\uFF00-\uFFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])
Created attachment 8693182 [details] [diff] [review]
Part 6: Support ignoreCase for BMP in RegExp with unicode flag.

[Part 6]

  * imports CaseFolding-6.2.0.txt (is it correct version?) to m-c tree
    * adds script to convert it into Unicode.h and tests

  * adds BC_CHECK_NOT_BACK_REF_NO_CASE_UNICODE opcode for interpreter
    * almost same as BC_CHECK_NOT_BACK_REF_NO_CASE, but checks the back reference
      with case folding insensitive

  * supports ignoreCase for BMP
    * it uses case folding/reverse case folding instead of lowercase/uppercase

    * this is done inside engine, not pattern conversion

  * case folding information is stored in FoldingInfo class in Unicode.h
    that contains following, for certain character C:
      * folding = toCasefold(C) - C
      * reverse1 = R1 - C
          that satisfies |C == toCasefold(R1)|
      * reverse2 = R2 - C
          that satisfies |C == toCasefold(R2)|
      * reverse3 = R3 - C
          that satisfies |C == toCasefold(R3)|
    the number of reverse mappings (Rn) are up to 3, for now
Created attachment 8693183 [details] [diff] [review]
Part 7: Support ignoreCase for non-BMP in RegExp with unicode flag.

[Part 7] (similar to Part 6)

corresponds to rule (K).

  * supports ignoreCase for non-BMP
    * converts non-BMP case foldable characters into CharacterClass

        /\u{10401}/iu => /[\u{10401}\u{10429}]/iu

    * cases folding information is stored as macro that has following informations

        * code point range from and to
        * lead surrogate (common)
        * trail surrogate from and to
        * difference to get folded character

      there both code point and surrogate pair are stored as immediate value
      to make consumer of the macro simple

      e.g.

        (0x10400, 0x10427, 0xD801, 0xDC00, 0xDC27, 0x28)

        * code point range is [0x10400-0x10427]
        * case folded code point can be obtained by adding 0x28 to code point

        * with surrogate pair, the range is 0xD801[0xDC00-0xDC27]
        * in same way, case folded code point can be obtained by adding 0x28 to trail surrogate

    * currently only 2 ranges for Deseret (U+10400 - U+1044F) are defined
Created attachment 8693184 [details] [diff] [review]
Part 8: Disallow extended pattern in RegExp with unicode flag.

[Part 8]

  * disallows remaining extended pattern, and throw SyntaxError for them
    * identity escape with non-SyntaxCharacter
        /\a/u  =>  SyntaxError
        /\c/u  =>  SyntaxError
        /\xHH/u  =>  SyntaxError

    * LegacyOctalEscapeSequence, back reference without matching group
        /\1/u, /\01/u  =>  SyntaxError

    * QuantifiableAssertion
        /(?:a){1}/u  =>  SyntaxError
        /(?!a){1}/u  =>  SyntaxError

    * standalone brace
        /{/u  =>  SyntaxError
        /}/u  =>  SyntaxError
Created attachment 8693185 [details] [diff] [review]
Part 10: Decrement index when it points trail surrogate that has corresponding lead surrogate.

[Part 10]

corresponds to comment #16-#22

  * adjusts the index that match starts to lead surrogate when it points trail surrogate that has corresponds lead surrogate

      var r = /\uD83D\uDC38/ug;
      r.lastIndex = 1;
      var str = "\uD83D\uDC38";
      var result = r.exec(str);    // match starts from \uD83D, not \uDC38
      print(result.index);

  * match object's "index" property also points lead surrogate in that case
    this does not match to ES6 spec, but the spec will be changed

      var r = /\uD83D\uDC38/ug;
      r.lastIndex = 1;
      var str = "\uD83D\uDC38";
      var result = r.exec(str);
      print(result.index);         // prints 0, the index for lead surrogate

  * this is not done in JIT, as for now lastIndex should always be 0 in JIT.
    after supporting /g and /y flags in JIT, this code is also added there (bug 1207922, as a part of bug 887016)
Created attachment 8693186 [details] [diff] [review]
Part 11: Support back reference with unicode flag.

[Part 11]

  * adds NOT_IN_SURROGATE_PAIR assertion that checks "the current character is not a trail surrogate that has corresponds lead surrogate"

  * supports back reference
    * places NOT_IN_SURROGATE_PAIR assertion after all back references in unicode pattern
      to avoid matching surrogate pair partially with back reference

      (\uD83D)\1  =>  (?:\uD83D(?![\uDC00-\uDFFF]))\1<NOT_IN_SURROGATE_PAIR>

      so this pattern does not match to "\uD83D\uD83D\uDC38"

    * currently this assertion is placed regardless of referred group
Created attachment 8693187 [details] [diff] [review]
Part 1: Implement RegExp unicode flag. r=till

just rebased
Attachment #8600880 - Attachment is obsolete: true
Attachment #8693187 - Flags: review+
Created attachment 8693191 [details] [diff] [review]
Part 9: Use RegExp unicode flag in String.prototype.{match,replace,split}. r=till

rebased and merged (this part will be fixed again in bug 887016)

so, that's all for now.

if you find any overlooked cases or wrong conversion, please tell me :)
Attachment #8600881 - Attachment is obsolete: true
Attachment #8604092 - Attachment is obsolete: true
Attachment #8692322 - Attachment is obsolete: true
Attachment #8693191 - Flags: review+
Created attachment 8693539 [details] [diff] [review]
Part 11: Support back reference with unicode flag.

Forgot to add following test case.
assertEq(/^(.+)\1$/u.exec("\uDC00foobar\uD834\uDC00foobar\uD834"), null);
Attachment #8693186 - Attachment is obsolete: true
:anba, do you know any other complicated case in unicode pattern?

also, I'd appreciate if you could give some feedback on above each patch's rules
Flags: needinfo?(andrebargull)

Comment 45

a year ago
I think I'm convinced the proposed rules work out to implement Unicode RegExps. Great work! :-D

I've also run my Unicode RegExp tests [1] and found a couple of issues.

[1] The tests from https://github.com/anba/es6draft/tree/master/src/test/scripts/suite/regexp, slightly changed to run in SpiderMonkey.
Test library files: https://gist.github.com/anba/d6bf0db9951d0cc8cb02
Actual test files https://gist.github.com/anba/b85b5d0ca4e91dc1ebbc
To run a single test file: ./build-debug/dist/bin/js -f /tmp/regexp/index.js /tmp/regexp/ignorecase_unicode.js


/[^A]/iu.test("A")
/[^a]/iu.test("A")
/[]/u.test("A")
Expected: false
Actual: true

/\W/iu.test("S")
Expected: true
Actual: false
Note: https://bugs.ecmascript.org/show_bug.cgi?id=3145

/\u017f/iu.test("S")
Expected: true
Actual: false
Note: 21.2.2.8.2 Runtime Semantics: Canonicalize ( ch ), Note 4

/\//u
Expected: No Error
Actual: Throws SyntaxError
Note: 21.2.1, IdentityEscape[U] :: [+U] /

/[\0]/u
Expected: No Error
Actual: Throws SyntaxError
Note: 21.2.1, ClassEscape[U] :: DecimalEscape and 21.2.2.19 ClassEscape :: DecimalEscape, step 2. 

/[\c_]/u
Expected: Throws SyntaxError
Actual: No SyntaxError
Note: Non-standard extended control letters 0-9 and _ should not be applied for Unicode RegExps.

/]/u
Expected: Throws SyntaxError
Actual: No SyntaxError
Note: 21.2.1 PatternCharacter :: SourceCharacter but not SyntaxCharacter


js> /(\udbff)\1?/u.test("\udbff")
Assertion failure: last_added_ == ADD_ATOM, at /home/andre/hg/mozilla-central/js/src/irregexp/RegExpParser.cpp:185

#0  0x0000000000ed377a in js::irregexp::RegExpBuilder::AddQuantifierToAtom (this=0x7ffff69dc058, min=0, max=1, 
    quantifier_type=js::irregexp::RegExpQuantifier::GREEDY) at /home/andre/hg/mozilla-central/js/src/irregexp/RegExpParser.cpp:185
#1  0x0000000000edcf36 in js::irregexp::RegExpParser<unsigned char>::ParseDisjunction (this=0x7fffffffbe70)
    at /home/andre/hg/mozilla-central/js/src/irregexp/RegExpParser.cpp:1760
#2  0x0000000000edb9ca in js::irregexp::RegExpParser<unsigned char>::ParsePattern (this=0x7fffffffbe70)
    at /home/andre/hg/mozilla-central/js/src/irregexp/RegExpParser.cpp:1242
#3  0x0000000000ed731c in ParsePatternSyntax<unsigned char> (ts=..., alloc=..., chars=0x7ffff7e7aee8 "(\\udbff)\\1?", length=11, unicode=true)
    at /home/andre/hg/mozilla-central/js/src/irregexp/RegExpParser.cpp:1826
#4  0x0000000000ed5248 in js::irregexp::ParsePatternSyntax (ts=..., alloc=..., str=0x7ffff7e7aee0, unicode=true)
    at /home/andre/hg/mozilla-central/js/src/irregexp/RegExpParser.cpp:1835
...
Flags: needinfo?(andrebargull)
Thank you for thorough testing! :D

Looks like the case sensitivity bugs are issues are both in parser (range calculation) and engine (test whether the character has case or not, so applying case folding or not)

SyntaxError things are just forgotten parts.  the assertion failure is that I need to wrap the back reference and pseudo-assertion with a group.

Will fix them shortly :)
Created attachment 8698066 [details] [diff] [review]
Part 2: Parse RegExp unicode character in non-CharacterClass.

Finally passed :anba's all tests :D

[Part 2]
(almost no change from previous patch series)

corresponds to rules (A), (F), and (G)

  * adds NOT_AFTER_LEAD_SURROGATE assertion, that checks "previous character is not lead surrogate",
    that means "current character cannot be a part of surrogate pair"

  * adds some constants and functions in Unicode.h, to handle surrogate pair and non-BMP code point

  * supports unicode code points in non-CharacterClass
    * \u{XXXXXX}
        \u{1f438}  =>  (?:\uD83D\uDC38)

    * \uXXXX\uXXXX with surrogate pair
        \uD83D\uDC38  =>  (?:\uD83D\uDC38)

    * raw unicode surrogate pair, in |new RegExp(...)| or eval
        "\uD83D\uDC38"  =>  (?:\uD83D\uDC38)

    * standalone lead surrogate
        \uD83D  =>  (?:\uD83D(?![\uDC00-\uDFFF]))

    * standalone trail surrogate
        \uDC38  =>  (?:(?<![\uD800-\uDBFF])\uDC38)
      where (?<![\uD800-\uDBFF]) is NOT_AFTER_LEAD_SURROGATE assertion
Attachment #8693178 - Attachment is obsolete: true
Attachment #8698066 - Flags: review?(till)
Created attachment 8698067 [details] [diff] [review]
Part 3: Parse RegExp unicode character in CharacterClass.

[Part 3] (similar to Part 2)
(almost no change from previous patch series)

corresponds to rules (B)-(H).

  * supports unicode code points in in CharacterClass
    * \u{XXXXXX}
        [\u{1F428}-\u{1F438}]  =>  (?:\uD83D[\uDC28-\uDC38])
        [\u{1F328}-\u{1F438}]  =>  (?:\uD83C[\uDF28-\uDFFF]|\uD83D[\uDC00-\uDC38])
        [\u{1E028}-\u{1F438}]  =>  (?:\uD838[\uDC28-\uDFFF]|[\uD839-\uD83C][\uDC00-\uDFFF]|\uD83D[\uDC00-\uDC38])

    * \uXXXX\uXXXX with surrogate pair
        [\uD83D\uDC28-\uD83D\uDC38]  =>  (?:\uD83D[\uDC28-\uDC38])

    * raw unicode surrogate pair, in |new RegExp(...)| or eval
        "[\uD83D\uDC28-\uD83D\uDC38]"  =>  (?:\uD83D[\uDC28-\uDC38])

    * standalone lead surrogate
        [\uD83D-\uD83F]  =>  (?:[\uD83D-\uD83F](?![\uDC00-\uDFFF]))

    * standalone trail surrogate
        [\uDC28-\uDC38]  =>  (?:(?<![\uD800-\uDBFF])[\uDC28-\uDC38])

  * splits CharacterClass with a mix of non-surrogate, surrogate pair, non-BMP
    [A-C\uD83D-\uD83F\uDC28-\uDC38\u{1F428}-\u{1F438}]
      => (?:[A-C]|[\uD83D-\uD83F]|[\uDC28-\uDC38]|[\u{1F428}-\u{1F438}])
    and apply above rules
      => (?:[A-C]|[\uD83D-\uD83F][\uDC28-\uDC38]|(?<![\uD800-\uDBFF])[\uDC28-\uDC38]|\uD83D[\uDC28-\uDC38])

  * disallows range with CharacterClassEscape
      /[\w-\W]/u  =>  SyntaxError

  * now RegExpParser::ParseClassCharacterEscape is fallible, as pattern starts with "\u" can throw SyntaxError if it does not match to \u{XXXXX} nor \uXXXX, in unicode pattern
    (previously that returned "u" character)

  * adds some functions to handle unicode ranges (add, sub, negate)
    * unicode ranges is separated into 4 ranges
      * BMP, non-surrogate
      * lead surrogate
      * trail surrogate
      * non-BMP
    * CharacterClass with mixed range is separated into above 4 ranges
      and conversion rules are applied for each range
Attachment #8693179 - Attachment is obsolete: true
Attachment #8698067 - Flags: review?(till)
Created attachment 8698068 [details] [diff] [review]
Part 4: Support everything Atom in RegExp with unicode flag.

[Part 4]
(almost no change from previous patch series)

corresponds to rule (I).

  * supports "."
    * everything except \x0a, \x0d, \u2028 and \u2029

    * to match to full surrogate pair characters,
      and not to match to lead or trail surrogate pair character partially

      .  =>  (?:[\u0000-\u0009\u000B-\u000C\u000E-\u2027\u202A-\uD7FF\uE000-\uFFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])
Attachment #8693180 - Attachment is obsolete: true
Attachment #8698068 - Flags: review?(till)
Created attachment 8698069 [details] [diff] [review]
Part 5: Support CharacterClassEscape in RegExp with unicode flag.

[Part 5] (similar to Part 4)
(almost no change from previous patch series)

corresponds to rule (J).

  * supports \S, \W, and \D
    * to match to full surrogate pair characters,
      and not to match to lead or trail surrogate pair character partially

      \S => (?:[\u0000-\u0008\u000E-\u001F\u0021-\u009F\u00A1-\u167F\u1681-\u180D\u180F-\u1FFF\u200B-\u2027\u202A-\u202E\u2030-\u205E\u2060-\u2FFF\u3001-\uD7FF\uE000-\uFEFE\uFF00-\uFFFF]|[\uD800-\uDBFF](?![\uDC00-\uDFFF])|(?<![\uD800-\uDBFF])[\uDC00-\uDFFF]|[\uD800-\uDBFF][\uDC00-\uDFFF])
Attachment #8693181 - Attachment is obsolete: true
Attachment #8698069 - Flags: review?(till)
Created attachment 8698070 [details] [diff] [review]
Part 6: Support ignoreCase for BMP in RegExp with unicode flag.

[Part 6]
(many change from previous patch.)

  * imports CaseFolding-6.2.0.txt (is it correct version?) to m-c tree
    * adds script to convert it into Unicode.h and tests

  * adds BC_CHECK_NOT_BACK_REF_NO_CASE_UNICODE opcode for interpreter
    * almost same as BC_CHECK_NOT_BACK_REF_NO_CASE, but checks the back reference
      with case folding insensitive

  * supports ignoreCase for BMP
    * it uses case folding/reverse case folding instead of lowercase/uppercase

    * this is done inside engine, not pattern conversion

  * case folding information is stored in FoldingInfo class in Unicode.h
    that contains following, for certain character C:
      * folding = toCasefold(C) - C
      * reverse1 = R1 - C
          that satisfies |C == toCasefold(R1)|
                           or
                         |toCasefold(C) == toCasefold(R1)|
      * reverse2 = R2 - C
          same as R1
      * reverse3 = R3 - C
          same as R1
    the number of reverse mappings (Rn) are up to 3, for now

  * added 2 ranges for \w and \W
      's' and 'k' (and their variants) are matched in both case

  * added some cases that non-ascii characters are mapped to ascii (latin1 ?)
    added unicode parameter to FilterASCII and FilterSuccessor to avoid
    filtering them
Attachment #8693182 - Attachment is obsolete: true
Attachment #8698070 - Flags: review?(till)
Created attachment 8698071 [details] [diff] [review]
Part 7: Support ignoreCase for non-BMP in RegExp with unicode flag.

[Part 7] (similar to Part 6)
(almost no change from previous patch series)

corresponds to rule (K).

  * supports ignoreCase for non-BMP
    * converts non-BMP case foldable characters into CharacterClass

        /\u{10401}/iu => /[\u{10401}\u{10429}]/iu

    * cases folding information is stored as macro in following format

        * code point range from and to
        * lead surrogate (common)
        * trail surrogate from and to
        * difference to get folded character

      there both code point and surrogate pair are stored as immediate value
      to make consumer of the macro simple

      e.g.

        (0x10400, 0x10427, 0xD801, 0xDC00, 0xDC27, 0x28)

        * code point range is [0x10400-0x10427]
        * case folded code point can be obtained by adding 0x28 to code point

        * with surrogate pair, the range is 0xD801[0xDC00-0xDC27]
        * in same way, case folded code point can be obtained by adding 0x28 to trail surrogate

    * currently only 2 ranges for Deseret (U+10400 - U+1044F) are defined
Attachment #8693183 - Attachment is obsolete: true
Attachment #8698071 - Flags: review?(till)
Created attachment 8698072 [details] [diff] [review]
Part 8: Disallow extended pattern in RegExp with unicode flag.

[Part 8]
(Fixed some rules from previous patch)

  * disallows remaining extended pattern, and throw SyntaxError for them
    * identity escape with non-SyntaxCharacter
        /\a/u  =>  SyntaxError
        /\c/u  =>  SyntaxError
        /\xHH/u  =>  SyntaxError

    * LegacyOctalEscapeSequence, back reference without matching group
        /\1/u, /\01/u  =>  SyntaxError

    * QuantifiableAssertion
        /(?:a){1}/u  =>  SyntaxError
        /(?!a){1}/u  =>  SyntaxError

    * standalone brace
        /{/u  =>  SyntaxError
        /}/u  =>  SyntaxError
Attachment #8698072 - Flags: review?(till)
Created attachment 8698073 [details] [diff] [review]
Part 10: Decrement index when it points trail surrogate that has corresponding lead surrogate.

[Part 10]
(almost no change from previous patch series)

corresponds to comment #16-#22

  * adjusts the index that match starts to lead surrogate when it points trail surrogate that has corresponds lead surrogate

      var r = /\uD83D\uDC38/ug;
      r.lastIndex = 1;
      var str = "\uD83D\uDC38";
      var result = r.exec(str);    // match starts from \uD83D, not \uDC38
      print(result.index);

  * match object's "index" property also points lead surrogate in that case
    this does not match to ES6 spec, but the spec will be changed

      var r = /\uD83D\uDC38/ug;
      r.lastIndex = 1;
      var str = "\uD83D\uDC38";
      var result = r.exec(str);
      print(result.index);         // prints 0, the index for lead surrogate

  * this is not done in JIT, as for now lastIndex should always be 0 in JIT.
    after supporting /g and /y flags in JIT, this code is also added there (bug 1207922, as a part of bug 887016)
Attachment #8693184 - Attachment is obsolete: true
Attachment #8693185 - Attachment is obsolete: true
Attachment #8698073 - Flags: review?(till)
Created attachment 8698075 [details] [diff] [review]
Part 11: Support back reference with unicode flag.

[Part 11]

(Added group around backreference and assertion, from previous patch)

  * adds NOT_IN_SURROGATE_PAIR assertion that checks "the current character is not a trail surrogate that has corresponds lead surrogate"

  * supports back reference
    * places NOT_IN_SURROGATE_PAIR assertion after all back references in unicode pattern
      to avoid matching surrogate pair partially with back reference

      (\uD83D)\1  =>  (?:\uD83D(?![\uDC00-\uDFFF]))(?:\1<NOT_IN_SURROGATE_PAIR>)

      so this pattern does not match to "\uD83D\uD83D\uDC38"

    * currently this assertion is placed regardless of referred group


forgot to note first,  changes in Xdr.h are in just temporary format to make it easier to rebase, I'll fix them later.
Attachment #8693539 - Attachment is obsolete: true
Attachment #8698075 - Flags: review?(till)
(Reporter)

Comment 56

a year ago
Comment on attachment 8698068 [details] [diff] [review]
Part 4: Support everything Atom in RegExp with unicode flag.

Review of attachment 8698068 [details] [diff] [review]:
-----------------------------------------------------------------

r=me. I'm not really reviewing these patches in any particular order, so please accept my apologies for taking on the easy ones first :)

::: js/src/irregexp/RegExpParser.cpp
@@ +1212,5 @@
> +                                       unicode::TrailSurrogateMax));
> +
> +    builder->NewAlternative();
> +
> +    builder->AddAssertion(alloc->newInfallible<RegExpAssertion>(RegExpAssertion::NOT_AFTER_LEAD_SURROGATE));

Nit: more than 99 columns. This is annoying to wrap - I guess the best way is to wrap before RegExpAssertion::NOT_AFTER_LEAD_SURROGATE and indent that by 8 spaces.
Attachment #8698068 - Flags: review?(till) → review+
(Reporter)

Comment 57

a year ago
Comment on attachment 8698069 [details] [diff] [review]
Part 5: Support CharacterClassEscape in RegExp with unicode flag.

Review of attachment 8698069 [details] [diff] [review]:
-----------------------------------------------------------------

r=me. I didn't check the correctness of the code point values too closely: I'll take your and anba's word for it.
Attachment #8698069 - Flags: review?(till) → review+
(Reporter)

Comment 58

a year ago
Comment on attachment 8698066 [details] [diff] [review]
Part 2: Parse RegExp unicode character in non-CharacterClass.

Review of attachment 8698066 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with feedback addressed.

Overall nit: please be consistent about the parameter naming. Existing parameters are underscore_named, so it'd be best to align with that. Applies to all following patches as well.

Also, please add f=anba to all patches before landing.

::: js/src/irregexp/RegExpEngine.cpp
@@ +2844,5 @@
>      assembler->Bind(&ok);
>      on_success->Emit(compiler, &new_trace);
>  }
>  
> +// Make sure next character cannot be a part of surrogate pair.

nit: s/part of/part of a/

::: js/src/irregexp/RegExpParser.cpp
@@ +392,5 @@
> +
> +static inline RegExpTree*
> +RangeAtom(LifoAlloc* alloc, char16_t from, char16_t to)
> +{
> +    CharacterRangeVector* ranges = alloc->newInfallible<CharacterRangeVector>(*alloc);

Nit: prevailing style in this file clearly favors less whitespace, so please remove the empty lines in this and the next few functions.

@@ +402,5 @@
> +
> +static inline RegExpTree*
> +NegativeLookahead(LifoAlloc* alloc, char16_t from, char16_t to)
> +{
> +    RegExpBuilder* builder = alloc->newInfallible<RegExpBuilder>(alloc);

I'm probably missing something, but shouldn't it be possible to pass in the caller's builder here and in the similar functions below and operate on it directly instead of creating new builders and turning their output into atoms? Seems like that'd be slightly more efficient. It also probably just doesn't matter, so only do this if it's not too much effort. (If it's doable at all, and I am not, as is likely, missing something.)

If it does work, I guess this function for example would become
AddNegativeLookahead(LifoAlloc* alloc, RegExpBuilder* builder, char16_t from, char_16_t to)

Or it would just become a method on RegExpBuilder.

::: js/src/js.msg
@@ +453,5 @@
>  MSG_DEF(JSMSG_NUMBERS_OUT_OF_ORDER,    0, JSEXN_SYNTAXERR, "numbers out of order in {} quantifier.")
>  MSG_DEF(JSMSG_TOO_MANY_PARENS,         0, JSEXN_INTERNALERR, "too many parentheses in regular expression")
>  MSG_DEF(JSMSG_UNMATCHED_RIGHT_PAREN,   0, JSEXN_SYNTAXERR, "unmatched ) in regular expression")
>  MSG_DEF(JSMSG_UNTERM_CLASS,            0, JSEXN_SYNTAXERR, "unterminated character class")
> +MSG_DEF(JSMSG_UNICODE_OVERFLOW,        0, JSEXN_SYNTAXERR, "unicode codepoint should not be greater than 0x10FFFF in regular expression")

Please move these messages up to the other "invalid foo" messages above.

::: js/src/vm/Unicode.h
@@ +242,5 @@
> +const size_t NonBMPMin = 0x10000;
> +const size_t NonBMPMax = 0x10FFFF;
> +
> +inline bool
> +IsLeadSurrogate(size_t value) {

Opening brace on new line here and for the functions below.

::: js/src/vm/Xdr.h
@@ +28,5 @@
>   * this wiki page:
>   *
>   *  https://developer.mozilla.org/en-US/docs/SpiderMonkey/Internals/Bytecode
>   */
> +static const uint32_t XDR_BYTECODE_VERSION_SUBTRAHEND = 325 + 1;

Please just increment these values instead of turning them into expressions. Here and in all other patches. (Ah, I just see your comment for part 11.)
(Reporter)

Comment 59

a year ago
Comment on attachment 8698067 [details] [diff] [review]
Part 3: Parse RegExp unicode character in CharacterClass.

Review of attachment 8698067 [details] [diff] [review]:
-----------------------------------------------------------------

Overall, this looks great. Canceling review because I'd like to see the results of the suggested changes in RegExpParser.cpp.

::: js/src/irregexp/RegExpParser.cpp
@@ +613,5 @@
> +                       CharacterRangeVector* ranges,
> +                       CharacterRangeVector* leadRanges,
> +                       CharacterRangeVector* trailRanges,
> +                       WideCharRangeVector* wideRanges,
> +                       char16_t char_class,

Or at the very least use camelCasing consistently :)

@@ +679,5 @@
> +// Subtract ranges2 from ranges1, and store the result into resultRanges.
> +// ranges1 will also be modified while calculating.
> +template <typename RangeType>
> +static inline Vector<RangeType, 1, LifoAllocPolicy<Infallible> >*
> +SubRanges(Vector<RangeType, 1, LifoAllocPolicy<Infallible> >* ranges1,

It took me a long time to understand this function (assuming I understand it now). I think a better signature might be something like

// Negate a vector of ranges by subtracting its ranges from a range
// encompassing the full range of possible values.
template <typename RangeType>
static inline void
NegateRanges(Vector<RangeType, 1, LifoAllocPolicy<Infallible> >** ranges, RangeType* fullRange)

The function would then allocate a tmpRange internally and modify `ranges` in-place.

@@ +730,5 @@
> +{
> +    CharacterRangeVector* tmpRanges = alloc->newInfallible<CharacterRangeVector>(*alloc);
> +    CharacterRangeVector* resultRanges = alloc->newInfallible<CharacterRangeVector>(*alloc);
> +    tmpRanges->append(LeadSurrogateRange());
> +    *leadRanges = SubRanges(tmpRanges, *leadRanges, resultRanges);

If the suggestion above works out, this would become

NegateRanges(leadRanges, LeadSurrogateRange());

In fact, it might make sense to remove NegateUnicodeRanges and call NegateRanges 3x in UnicodeRangesAtom (and perhaps rename NegateRanges to NegateUnicodeRanges).

@@ +751,5 @@
> +                  CharacterRangeVector* trailRanges,
> +                  WideCharRangeVector* wideRanges,
> +                  bool is_negated)
> +{
> +    if (is_negated) {

Nit: no brace required. (Though that'd change if the suggestions above work out, of course.)

@@ +888,5 @@
>                  // If we reach the end we break out of the loop and let the
>                  // following code report an error.
>                  break;
>              } else if (current() == ']') {
> +                if (unicode_)

Multiline body, so needs braces here and for the `else`.

::: js/src/irregexp/RegExpParser.h
@@ +184,5 @@
>      // Parses a {...,...} quantifier and stores the range in the given
>      // out parameters.
>      bool ParseIntervalQuantifier(int* min_out, int* max_out);
>  
> +    // Parses and stores a single escaped character.  The character

"Tries to parse the input as a single escaped character. If successful it stores the result in the output parameter and returns true. Otherwise it throws an error and returns false.
The character must not [..]"
Attachment #8698067 - Flags: review?(till)
(Reporter)

Comment 60

a year ago
Comment on attachment 8698070 [details] [diff] [review]
Part 6: Support ignoreCase for BMP in RegExp with unicode flag.

Review of attachment 8698070 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with nits addressed, very nice!

Regarding the CaseFolding.txt version: if the standard doesn't prescribe a specific version, I'd say go with the latest one. (I'm on a train right now and can't check. If I remember to do so later, I'll update this comment.)

::: js/src/irregexp/RegExpEngine.cpp
@@ +253,5 @@
> +FOR_EACH_NON_ASCII_TO_ASCII_FOLDING(CHECK_RANGE)
> +#undef CHECK_RANGE
> +    }
> +
> +    /* TODO(dcarney): this could be a lot more efficient. */ \

This comment probably applies to the newly added checks, too, so please keep it at the top of the function.

::: js/src/vm/Unicode.h
@@ +259,5 @@
> +inline char16_t
> +FoldCase(char16_t ch)
> +{
> +    const FoldingInfo& info = CaseFoldInfo(ch);
> +

Nit: remove the blank line here and in the functions below.
Attachment #8698070 - Flags: review?(till) → review+
(Reporter)

Comment 61

a year ago
Comment on attachment 8698071 [details] [diff] [review]
Part 7: Support ignoreCase for non-BMP in RegExp with unicode flag.

Review of attachment 8698071 [details] [diff] [review]:
-----------------------------------------------------------------

Yep, nice.

::: js/src/irregexp/RegExpParser.cpp
@@ +761,5 @@
>      *wideRanges = SubRanges(tmpWideRanges, *wideRanges, resultWideRanges);
>  }
>  
> +static bool
> +WideCharRangesContains(WideCharRangeVector* wideRanges, widechar c)

nit: s/Contains/Contain/, as there are multiple ranges.
Attachment #8698071 - Flags: review?(till) → review+
(Reporter)

Comment 62

a year ago
Comment on attachment 8698072 [details] [diff] [review]
Part 8: Disallow extended pattern in RegExp with unicode flag.

Review of attachment 8698072 [details] [diff] [review]:
-----------------------------------------------------------------

r=me

::: js/src/irregexp/RegExpParser.cpp
@@ +489,5 @@
>        case 'c': {
>          widechar controlLetter = Next();
>          widechar letter = controlLetter & ~('A' ^ 'a');
>          // For compatibility with JSC, inside a character class
>          // we also accept digits and underscore as control characters.

Add ", but only in non-unicode mode" to the comment to reflect reality.

@@ +523,1 @@
>          // For compatibility, we interpret a decimal escape that isn't

"[..] outside of unicode mode, we interpret [..]"

@@ +539,2 @@
>          // If \x is not followed by a two-digit hexadecimal, treat it
>          // as an identity escape.

"[..] identity escape in non-unicode mode."

::: js/src/js.msg
@@ +460,5 @@
>  MSG_DEF(JSMSG_RANGE_WITH_CLASS_ESCAPE, 0, JSEXN_SYNTAXERR, "character class escape cannot be used in class range in regular expression")
> +MSG_DEF(JSMSG_RAW_BRACE_IN_REGEP,      0, JSEXN_SYNTAXERR, "raw brace is not allowed in regular expression with unicode flag")
> +MSG_DEF(JSMSG_RAW_BRACKET_IN_REGEP,    0, JSEXN_SYNTAXERR, "raw bracket is not allowed in regular expression with unicode flag")
> +MSG_DEF(JSMSG_BACK_REF_OUT_OF_RANGE,   0, JSEXN_SYNTAXERR, "back reference out of range in regular expression")
> +MSG_DEF(JSMSG_INVALID_DECIMAL_ESCAPE, 0, JSEXN_SYNTAXERR, "invalid decimal escape in regular expression")

nit: move this message up to the other "invalid" messages.
Attachment #8698072 - Flags: review?(till) → review+
(Reporter)

Comment 63

a year ago
Comment on attachment 8698073 [details] [diff] [review]
Part 10: Decrement index when it points trail surrogate that has corresponding lead surrogate.

Review of attachment 8698073 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with feedback addressed.

::: js/src/builtin/RegExp.cpp
@@ +859,5 @@
>      }
>  
> +    /* Steps 12-13. */
> +    bool fullUnicode = reobj->unicode();
> +    if (fullUnicode) {

nit: just make this `if (reobj->unicode())` and remove the var.

@@ +877,5 @@
> +         *   var result = r.exec(str); // pattern match starts from index 0
> +         *   print(result.index);      // prints 0
> +         *
> +         * The value of |result.index| doesn't match to ES6, but the spec will
> +         * be changed.  See https://github.com/tc39/ecma262/issues/128

While this is a very nice comment, I think it'd be entirely sufficient to say something like

"Note: this doesn't match the current spec text and result in different values for `result.index` under certain conditions. However, the spec will change to match our implementation's behavior. See https://github.com/tc39/ecma262/issues/128."
Attachment #8698073 - Flags: review?(till) → review+
(Reporter)

Comment 64

a year ago
Comment on attachment 8698075 [details] [diff] [review]
Part 11: Support back reference with unicode flag.

Review of attachment 8698075 [details] [diff] [review]:
-----------------------------------------------------------------

r=m with feedback addressed - assuming my suggested comment for UnicodeBackReferenceAtom is ok. If not, I'd like to understand what's really going on before giving an r+.

::: js/src/irregexp/RegExpEngine.cpp
@@ +2995,5 @@
>      assembler->Bind(&ok);
>      on_success->Emit(compiler, &new_trace);
>  }
>  
> +// Make sure next character is not a trail surrogate that has corresponding

s/Make sure next/Assert that the next/
and
s/has corresponding/has a corresponding/

@@ +3010,5 @@
> +    // character register.
> +    Trace new_trace(*trace);
> +    new_trace.InvalidateCurrentCharacter();
> +
> +    if (new_trace.cp_offset() == 0) {

Nit: no brace required.

@@ +3019,5 @@
> +    assembler->LoadCurrentCharacter(new_trace.cp_offset(), new_trace.backtrack(), false);
> +    assembler->CheckCharacterNotInRange(unicode::TrailSurrogateMin, unicode::TrailSurrogateMax,
> +                                        &ok);
> +
> +    // Next check if previous character is a lead surrogate

Nit: missing "." at the end.

::: js/src/irregexp/RegExpParser.cpp
@@ +1378,5 @@
>      return UnicodeRangesAtom(alloc, ranges, leadRanges, trailRanges, wideRanges, false, false);
>  }
>  
> +static inline RegExpTree*
> +UnicodeBackReferenceAtom(LifoAlloc* alloc, RegExpTree* atom)

Same comment on perhaps turning this into a method on RegExpBuilder instead of allocating a new builder as in part 2 applies.

@@ +1382,5 @@
> +UnicodeBackReferenceAtom(LifoAlloc* alloc, RegExpTree* atom)
> +{
> +    // Back reference may contain standalone lead surrogate at the last
> +    // character, and it should not match the lead surrogate that has
> +    // corresponding trail surrogate.

If I understand all this correctly, perhaps this comment would work better:
"If a back reference has a standalone lead surrogate as its last character, then that lead surrogate shouldn't match lead surrogates that are paired with a corresponding trail surrogate."

If that comment doesn't match, please reword things in an applicable way, or ping me and we can discuss what this should say.
Attachment #8698075 - Flags: review?(till) → review+
(Reporter)

Comment 65

a year ago
Comment on attachment 8698066 [details] [diff] [review]
Part 2: Parse RegExp unicode character in non-CharacterClass.

Review of attachment 8698066 [details] [diff] [review]:
-----------------------------------------------------------------

Forgot to toggle the flag.
Attachment #8698066 - Flags: review?(till) → review+
Thank you for reviewing :D

about RegExpBuilder, iiuc, single builder can create only single group, so it won't be possible to reuse caller's builder, as they should create group.
for example, \uD800{2} should be interpreted as (?:\uD800(?![\uDC00-\uDFFF])){2}, not \uD800(?![\uDC00-\uDFFF]){2}.

NegativeLookahead should be re-written without builder, as it actually does not require accumulating:
  static inline RegExpTree*
  NegativeLookahead(LifoAlloc* alloc, char16_t from, char16_t to)
  {
      return alloc->newInfallible<RegExpLookahead>(RangeAtom(alloc, from, to), false, 0, 0);
  }

SurrogatePairAtom could also be re-written without builder, but using builder seems to be easier and consistent with others.

for other cases (LeadSurrogateAtom, TrailSurrogateAtom, UnicodeEverythingAtom, CaseFoldingSurrogatePairAtom, and UnicodeBackReferenceAtom), it should be easier to create a dedicated builder for each, as they accumulate multiple atoms/assertions.

will post updated patch for part 3 after some more testing.
Created attachment 8699850 [details] [diff] [review]
Part 3: Parse RegExp unicode character in CharacterClass.

Renamed SubRanges to NegateUnicodeRanges, with ranges as out parameter.
Attachment #8698067 - Attachment is obsolete: true
Attachment #8699850 - Flags: review?(till)
(Reporter)

Comment 68

a year ago
Comment on attachment 8699850 [details] [diff] [review]
Part 3: Parse RegExp unicode character in CharacterClass.

Review of attachment 8699850 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks, this looks great.

Also thanks for the explanations in comment 66. They make a lot of sense, and I think your decisions regarding those points are the right ones.

::: js/src/irregexp/RegExpParser.cpp
@@ +674,5 @@
> +// encompassing the full range of possible values.
> +template <typename RangeType>
> +static inline void
> +NegateUnicodeRanges(LifoAlloc* alloc, Vector<RangeType, 1, LifoAllocPolicy<Infallible> >** ranges,
> +             RangeType full_range)

Nit: wrong indentation.

@@ +681,5 @@
> +    RangeVector* tmp_ranges = alloc->newInfallible<RangeVector>(*alloc);
> +    tmp_ranges->append(full_range);
> +    RangeVector* result_ranges = alloc->newInfallible<RangeVector>(*alloc);
> +
> +    // Perform following calculation:

"the following"

@@ +683,5 @@
> +    RangeVector* result_ranges = alloc->newInfallible<RangeVector>(*alloc);
> +
> +    // Perform following calculation:
> +    //   result_ranges = tmp_ranges - ranges
> +    // with following steps:

"the following"

@@ +725,5 @@
> +        tmp_ranges = result_ranges;
> +        result_ranges = tmp;
> +    }
> +
> +    // After the loop, result is pointed by tmp_ranges, instead of

s/pointed by/pointed at by/
Attachment #8699850 - Flags: review?(till) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/9295eeb878f5fc4570025cb4cf38d4be1e364e8d
Bug 1135377 - Part 1: Implement RegExp unicode flag. r=till, f=anba

https://hg.mozilla.org/integration/mozilla-inbound/rev/5a944e733ada08f08dd095531c9c715f64683a0d
Bug 1135377 - Part 2: Parse RegExp unicode character in non-CharacterClass. r=till, f=anba

https://hg.mozilla.org/integration/mozilla-inbound/rev/4e05611fe3dd8f91320ed1d123bfc2032d11eabe
Bug 1135377 - Part 3: Parse RegExp unicode character in CharacterClass. r=till, f=anba

https://hg.mozilla.org/integration/mozilla-inbound/rev/803f23393bc4864f6cc342cf4f5469bb521387b1
Bug 1135377 - Part 4: Support everything Atom in RegExp with unicode flag. r=till, f=anba

https://hg.mozilla.org/integration/mozilla-inbound/rev/872d04109a5ce42e20ab9466bf80809e90b157d1
Bug 1135377 - Part 5: Support CharacterClassEscape in RegExp with unicode flag. r=till, f=anba

https://hg.mozilla.org/integration/mozilla-inbound/rev/f0e8ecf3ee9942fa55f945143b60f97d049457d3
Bug 1135377 - Part 6: Support ignoreCase for BMP in RegExp with unicode flag. r=till, f=anba

https://hg.mozilla.org/integration/mozilla-inbound/rev/93113598bb3b40b26f8c141b11522ffbcaf29ceb
Bug 1135377 - Part 7: Support ignoreCase for non-BMP in RegExp with unicode flag. r=till, f=anba

https://hg.mozilla.org/integration/mozilla-inbound/rev/3bd3d3ed5fe4ffc440f6e9ae2d2161481034daae
Bug 1135377 - Part 8: Disallow extended pattern in RegExp with unicode flag. r=till, f=anba

https://hg.mozilla.org/integration/mozilla-inbound/rev/5b0ced0284a9e13609fad337abc442a290ee30de
Bug 1135377 - Part 9: Use RegExp unicode flag in String.prototype.{match,replace,split}. r=till, f=anba

https://hg.mozilla.org/integration/mozilla-inbound/rev/69c495efe7991438ce7aaeabd9367bb66d0ceccb
Bug 1135377 - Part 10: Decrement index when it points trail surrogate that has corresponding lead surrogate. r=till, f=anba

https://hg.mozilla.org/integration/mozilla-inbound/rev/26aa408f9d11ccfe04218e6a96c4f5f6dc2cf9e7
Bug 1135377 - Part 11: Support back reference with unicode flag. r=till, f=anba
Blocks: 1227906

Comment 70

a year ago
bugherder
https://hg.mozilla.org/mozilla-central/rev/9295eeb878f5
https://hg.mozilla.org/mozilla-central/rev/5a944e733ada
https://hg.mozilla.org/mozilla-central/rev/4e05611fe3dd
https://hg.mozilla.org/mozilla-central/rev/803f23393bc4
https://hg.mozilla.org/mozilla-central/rev/872d04109a5c
https://hg.mozilla.org/mozilla-central/rev/f0e8ecf3ee99
https://hg.mozilla.org/mozilla-central/rev/93113598bb3b
https://hg.mozilla.org/mozilla-central/rev/3bd3d3ed5fe4
https://hg.mozilla.org/mozilla-central/rev/5b0ced0284a9
https://hg.mozilla.org/mozilla-central/rev/69c495efe799
https://hg.mozilla.org/mozilla-central/rev/26aa408f9d11
Status: NEW → RESOLVED
Last Resolved: a year ago
status-firefox46: --- → fixed
Resolution: --- → FIXED
Target Milestone: --- → mozilla46
Updated following documentation:
  https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp
  https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp/unicode
  https://developer.mozilla.org/en-US/Firefox/Releases/46
  https://developer.mozilla.org/en-US/docs/Web/JavaScript/New_in_JavaScript/ECMAScript_6_support_in_Mozilla
Thank you!
Keywords: dev-doc-needed → dev-doc-complete
Depends on: 1279467
You need to log in before you can comment on or make changes to this bug.