YARR allows what seems like a bogus character-class range

RESOLVED FIXED

Status

()

Core
JavaScript Engine
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: Jesse Ruderman, Assigned: cdleary)

Tracking

(Blocks: 1 bug, {testcase})

Trunk
x86
Mac OS X
testcase
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking2.0 -, status2.0 wanted)

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 attachment, 1 obsolete attachment)

(Reporter)

Description

7 years ago
/[C-\s]/;

YARR: accepts it
OLD:  "invalid range in character class"
(Reporter)

Comment 1

7 years ago
Same with /[f-\d]/
blocking2.0: --- → ?

Updated

7 years ago
Assignee: general → cdleary
blocking2.0: ? → betaN+
TL;DR: YARR wrong. Will fix.

YARR treats the dash as being a member of the character class (brackets) when it precedes a built-in character class (like \s).

js> /[C-\s]/.exec(' ') 
[" "]
js> /[C-\s]/.exec('-') 
["-"]

This string is accepted by the grammar by the following derivation, so it's not a syntax error due to the grammar:

...
-> Atom
-> CharacterClass
-> "[" ClassRanges "]"                    (remaining: "C-\s")
-> NonEmptyClassRanges
-> ClassAtom "-" ClassAtom ClassRanges
-> ClassAtom "-" ClassAtom [empty]
-> ClassAtomNoDash "-" ClassAtomNoDash
-> SourceCharacter "-" "\" ClassEscape
-> "C" "-" "\" CharacterClassEscape
-> "C" "-" "\" "s"                        (accepted)

However, 15.10 tacks additional restrictions onto the grammar; namely, ECMAScriptv5 15.10.2.15 production NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges calls the abstract operation CharacterRange(A, B) in step 4, and CharacterRange step 1 says that if B does not contain exactly one character then a SyntaxError must be thrown, and all the builtin character classes contain more than one character.
Created attachment 482286 [details] [diff] [review]
Fix character class range with builtin character class on RHS.

Produces the same error message for us as the rest of the character class badnesses, but I added a new error code for upstreamy goodness.
Attachment #482286 - Flags: review?(jwalden+bmo)
Comment on attachment 482286 [details] [diff] [review]
Fix character class range with builtin character class on RHS.

[jwalden@find-waldo-now src]$ dbg/js
js> /[C-\s]/.exec("-")
invalid range in character class
js> /[\s-C]/.exec("-")
["-"]

Test needs to make sure to address the reversed cases, too.
Attachment #482286 - Flags: review?(jwalden+bmo) → review-
The restrictions on ClassAtomNoDash are also not enforced because the character class parser is a very simple DFA. Any non-descending single-character range is considered valid. Check out this awesome pattern:

/[--/]/.exec('.') 
["."]
(In reply to comment #5)
> Check out this awesome pattern:

Sorry, that pattern is actually valid! The restrictions on ClassAtomNoDash's SourceCharacter (but not one of '\' or ']' or '-') are still suspect, just have to come up with a pattern that triggers it.
Eh, I think it's okay. Expanding the productions under NonEmptyClassRanges, you get the following regular patterns, with CA short for ClassAtom, CAND short for ClassAtomNoDash, and ... indicating a permitted top-level recursion:

CA
CA CA
CA CAND* CA
CA CAND - CA ...
CA - CA ...
CA CAND - CA ...

The semantic errors for the |CAND - CA| and |CA - CA| forms are the same, so we should be able to figure out all the appropriate places for a dash with mode bits, as we do now. I'll just fix the error and shut up now. :-)
Created attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.

No need to cache/flush the character class like we do the unescaped characters, since a range with a builtin character class in it is always a syntax error.
Attachment #482286 - Attachment is obsolete: true
Attachment #485837 - Flags: review?(jwalden+bmo)
Comment on attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.

Hmm, I think /\s\w-\W/ should be failing as well.
Attachment #485837 - Flags: review?(jwalden+bmo)
Comment on attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.

That's because I didn't put it in a character class. Duh. (One of those days.)
Attachment #485837 - Flags: review?(jwalden+bmo)
Status: NEW → ASSIGNED
Comment on attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.

>diff --git a/js/src/jit-test/tests/basic/bug576837-regexp.js b/js/src/jit-test/tests/basic/bug576837-regexp.js

>+function isRegExpSyntaxError(pattern) {
>+    try {
>+        var re = new RegExp(pattern);
>+    } catch (e) {
>+        if (e instanceof SyntaxError)
>+            return true;
>+    }
>+    return false;
>+}
>+
>+function testRangeSyntax(end1, end2, shouldFail) {
>+    var makePattern = function(e1, e2) {
>+        var pattern = '[' + e1 + '-' + e2 + ']';
>+        print(uneval(pattern));
>+        return pattern;
>+    };
>+    assertEq(isRegExpSyntaxError(makePattern(end1, end2)), shouldFail);
>+    assertEq(isRegExpSyntaxError(makePattern(end2, end1)), shouldFail);
>+}
>+
>+testRangeSyntax('C', '\\s', true);
>+testRangeSyntax('C', '\\d', true);
>+testRangeSyntax('C', '\\W', true);
>+testRangeSyntax('C', '\\W', true);
>+testRangeSyntax('C', '', false);
>+testRangeSyntax('C', 'C', false);
>+testRangeSyntax('\\s', '\\s', true);
>+testRangeSyntax('\\W', '\\s', true);

Plain true/false don't do it for me for readability, suggest checkRangeValid/checkRangeInvalid methods forwarding to one common method.

Add in some tests with \b and \B since those aren't treated as character class escapes.


>diff --git a/js/src/yarr/yarr/RegexParser.h b/js/src/yarr/yarr/RegexParser.h

>     class CharacterClassParserDelegate {
>     public:
>         CharacterClassParserDelegate(Delegate& delegate, ErrorCode& err)
>             : m_delegate(delegate)
>             , m_err(err)
>             , m_state(empty)
>+            , sawCharacterClass(false)
>         {
>         }

Good style appears to suggest m_sawCharacterClass, to make upstreaming easier.


>             case cachedCharacter:
>+                if (m_character == '-' && sawCharacterClass) {
>+                    m_err = CharacterClassRangeSingleChar;
>+                    m_state = empty;
>+                    break;
>+                } else {
>+                    sawCharacterClass = false;
>+                }
>+
>                 if (ch == '-')
>                     m_state = cachedCharacterHyphen;
>                 else {
>                     m_delegate.atomCharacterClassAtom(m_character);
>                     m_character = ch;
>                 }
>                 break;

As far as good style goes, it appears that this code parenthesizes equality expressions that aren't the whole of conditions -- so |((m_character == '-') && m_sawCharacterClass)|.

I think this would be more readable if you didn't put the sawCharacterClass in the else-block.  It makes more sense (even if one doesn't subscribe to else-after-return being evil) to say that *if* the just-passed character class starts a range, *then* it's an error and you exit.  But the sawCharacterClass assignment applies to both arms of the if below, so it seems it makes more sense to something like this:

>+                if (m_character == '-' && sawCharacterClass) {
>+                    m_err = CharacterClassRangeSingleChar;
>+                    m_state = empty;
>+                    break;
>+                }
>+
>+                sawCharacterClass = false;
>                 if (ch == '-')
>                     m_state = cachedCharacterHyphen;
>                 else {
>                     m_delegate.atomCharacterClassAtom(m_character);
>                     m_character = ch;
>                 }
>                 break;


>         void atomBuiltInCharacterClass(BuiltInCharacterClassID classID, bool invert)
>         {
>+            if (m_state == cachedCharacterHyphen ||
>+                (sawCharacterClass && m_state == cachedCharacter && m_character == '-')) {
>+                // If the RHS of a range does not contain exacly one character then a SyntaxError
>+                // must be thrown.
>+                // Assumes none of the built in character classes contain a single character.
>+                m_err = CharacterClassRangeSingleChar;
>+                m_state = empty;
>+                return;
>+            }
>             flush();
>             m_delegate.atomCharacterClassBuiltIn(classID, invert);
>+            sawCharacterClass = true;
>         }

(Add parentheses per the earlier comment.)

I had trouble understanding both parts of your condition here -- suggest adding something noting that the first part handles things like [a-\d] and the second part handles things like [\w-\s].  For that matter, the most obvious place for any of these checks, in |case cachedCharacter|, is only hit (I think) in the case of [\w-a] or [\w-\b] -- may as well add a comment to that effect there, too.
Attachment #485837 - Flags: review?(jwalden+bmo) → review+
For what it's worth, I found the intricate handoffs between parsing escapes, parsing non-escapes, and the state of CharacterClassDelegate to be way too confusing.  But I am loath to suggest significant adjustment as it seems likely to complicate the maintenance burden upstream updates would cause in concert with such changes.
http://hg.mozilla.org/tracemonkey/rev/8ae5fce0f19b
Whiteboard: fixed-in-tracemonkey
Backout: http://hg.mozilla.org/tracemonkey/rev/f4444a398ec1

Somehow I missed bug 351463: with a hyphen between two character classes we are intentionally not compliant to the spec.
Whiteboard: fixed-in-tracemonkey

Comment 15

7 years ago
http://hg.mozilla.org/mozilla-central/rev/8ae5fce0f19b
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → FIXED
Reopening because it was backed out -- I still have to follow up by committing with the two-character-class range removed as a syntax error.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It looks like this bug was almost fixed. What's left to do on it?

Updated

7 years ago
blocking2.0: betaN+ → -
status2.0: --- → wanted
I made us conform to our old, 1.9.2 behavior, so the m_sawCharacterClass was unnecessary after all.

http://hg.mozilla.org/tracemonkey/rev/08f2504e5f44
Whiteboard: fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/08f2504e5f44
Status: REOPENED → RESOLVED
Last Resolved: 7 years ago7 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.