Closed Bug 576837 Opened 14 years ago Closed 14 years ago

YARR allows what seems like a bogus character-class range

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking2.0 --- -
status2.0 --- wanted

People

(Reporter: jruderman, Assigned: cdleary)

References

Details

(Keywords: testcase, Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 1 obsolete file)

/[C-\s]/;

YARR: accepts it
OLD:  "invalid range in character class"
Same with /[f-\d]/
blocking2.0: --- → ?
Assignee: general → cdleary
blocking2.0: ? → betaN+
TL;DR: YARR wrong. Will fix.

YARR treats the dash as being a member of the character class (brackets) when it precedes a built-in character class (like \s).

js> /[C-\s]/.exec(' ') 
[" "]
js> /[C-\s]/.exec('-') 
["-"]

This string is accepted by the grammar by the following derivation, so it's not a syntax error due to the grammar:

...
-> Atom
-> CharacterClass
-> "[" ClassRanges "]"                    (remaining: "C-\s")
-> NonEmptyClassRanges
-> ClassAtom "-" ClassAtom ClassRanges
-> ClassAtom "-" ClassAtom [empty]
-> ClassAtomNoDash "-" ClassAtomNoDash
-> SourceCharacter "-" "\" ClassEscape
-> "C" "-" "\" CharacterClassEscape
-> "C" "-" "\" "s"                        (accepted)

However, 15.10 tacks additional restrictions onto the grammar; namely, ECMAScriptv5 15.10.2.15 production NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges calls the abstract operation CharacterRange(A, B) in step 4, and CharacterRange step 1 says that if B does not contain exactly one character then a SyntaxError must be thrown, and all the builtin character classes contain more than one character.
Produces the same error message for us as the rest of the character class badnesses, but I added a new error code for upstreamy goodness.
Attachment #482286 - Flags: review?(jwalden+bmo)
Comment on attachment 482286 [details] [diff] [review]
Fix character class range with builtin character class on RHS.

[jwalden@find-waldo-now src]$ dbg/js
js> /[C-\s]/.exec("-")
invalid range in character class
js> /[\s-C]/.exec("-")
["-"]

Test needs to make sure to address the reversed cases, too.
Attachment #482286 - Flags: review?(jwalden+bmo) → review-
The restrictions on ClassAtomNoDash are also not enforced because the character class parser is a very simple DFA. Any non-descending single-character range is considered valid. Check out this awesome pattern:

/[--/]/.exec('.') 
["."]
(In reply to comment #5)
> Check out this awesome pattern:

Sorry, that pattern is actually valid! The restrictions on ClassAtomNoDash's SourceCharacter (but not one of '\' or ']' or '-') are still suspect, just have to come up with a pattern that triggers it.
Eh, I think it's okay. Expanding the productions under NonEmptyClassRanges, you get the following regular patterns, with CA short for ClassAtom, CAND short for ClassAtomNoDash, and ... indicating a permitted top-level recursion:

CA
CA CA
CA CAND* CA
CA CAND - CA ...
CA - CA ...
CA CAND - CA ...

The semantic errors for the |CAND - CA| and |CA - CA| forms are the same, so we should be able to figure out all the appropriate places for a dash with mode bits, as we do now. I'll just fix the error and shut up now. :-)
No need to cache/flush the character class like we do the unescaped characters, since a range with a builtin character class in it is always a syntax error.
Attachment #482286 - Attachment is obsolete: true
Attachment #485837 - Flags: review?(jwalden+bmo)
Comment on attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.

Hmm, I think /\s\w-\W/ should be failing as well.
Attachment #485837 - Flags: review?(jwalden+bmo)
Comment on attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.

That's because I didn't put it in a character class. Duh. (One of those days.)
Attachment #485837 - Flags: review?(jwalden+bmo)
Status: NEW → ASSIGNED
Comment on attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.

>diff --git a/js/src/jit-test/tests/basic/bug576837-regexp.js b/js/src/jit-test/tests/basic/bug576837-regexp.js

>+function isRegExpSyntaxError(pattern) {
>+    try {
>+        var re = new RegExp(pattern);
>+    } catch (e) {
>+        if (e instanceof SyntaxError)
>+            return true;
>+    }
>+    return false;
>+}
>+
>+function testRangeSyntax(end1, end2, shouldFail) {
>+    var makePattern = function(e1, e2) {
>+        var pattern = '[' + e1 + '-' + e2 + ']';
>+        print(uneval(pattern));
>+        return pattern;
>+    };
>+    assertEq(isRegExpSyntaxError(makePattern(end1, end2)), shouldFail);
>+    assertEq(isRegExpSyntaxError(makePattern(end2, end1)), shouldFail);
>+}
>+
>+testRangeSyntax('C', '\\s', true);
>+testRangeSyntax('C', '\\d', true);
>+testRangeSyntax('C', '\\W', true);
>+testRangeSyntax('C', '\\W', true);
>+testRangeSyntax('C', '', false);
>+testRangeSyntax('C', 'C', false);
>+testRangeSyntax('\\s', '\\s', true);
>+testRangeSyntax('\\W', '\\s', true);

Plain true/false don't do it for me for readability, suggest checkRangeValid/checkRangeInvalid methods forwarding to one common method.

Add in some tests with \b and \B since those aren't treated as character class escapes.


>diff --git a/js/src/yarr/yarr/RegexParser.h b/js/src/yarr/yarr/RegexParser.h

>     class CharacterClassParserDelegate {
>     public:
>         CharacterClassParserDelegate(Delegate& delegate, ErrorCode& err)
>             : m_delegate(delegate)
>             , m_err(err)
>             , m_state(empty)
>+            , sawCharacterClass(false)
>         {
>         }

Good style appears to suggest m_sawCharacterClass, to make upstreaming easier.


>             case cachedCharacter:
>+                if (m_character == '-' && sawCharacterClass) {
>+                    m_err = CharacterClassRangeSingleChar;
>+                    m_state = empty;
>+                    break;
>+                } else {
>+                    sawCharacterClass = false;
>+                }
>+
>                 if (ch == '-')
>                     m_state = cachedCharacterHyphen;
>                 else {
>                     m_delegate.atomCharacterClassAtom(m_character);
>                     m_character = ch;
>                 }
>                 break;

As far as good style goes, it appears that this code parenthesizes equality expressions that aren't the whole of conditions -- so |((m_character == '-') && m_sawCharacterClass)|.

I think this would be more readable if you didn't put the sawCharacterClass in the else-block.  It makes more sense (even if one doesn't subscribe to else-after-return being evil) to say that *if* the just-passed character class starts a range, *then* it's an error and you exit.  But the sawCharacterClass assignment applies to both arms of the if below, so it seems it makes more sense to something like this:

>+                if (m_character == '-' && sawCharacterClass) {
>+                    m_err = CharacterClassRangeSingleChar;
>+                    m_state = empty;
>+                    break;
>+                }
>+
>+                sawCharacterClass = false;
>                 if (ch == '-')
>                     m_state = cachedCharacterHyphen;
>                 else {
>                     m_delegate.atomCharacterClassAtom(m_character);
>                     m_character = ch;
>                 }
>                 break;


>         void atomBuiltInCharacterClass(BuiltInCharacterClassID classID, bool invert)
>         {
>+            if (m_state == cachedCharacterHyphen ||
>+                (sawCharacterClass && m_state == cachedCharacter && m_character == '-')) {
>+                // If the RHS of a range does not contain exacly one character then a SyntaxError
>+                // must be thrown.
>+                // Assumes none of the built in character classes contain a single character.
>+                m_err = CharacterClassRangeSingleChar;
>+                m_state = empty;
>+                return;
>+            }
>             flush();
>             m_delegate.atomCharacterClassBuiltIn(classID, invert);
>+            sawCharacterClass = true;
>         }

(Add parentheses per the earlier comment.)

I had trouble understanding both parts of your condition here -- suggest adding something noting that the first part handles things like [a-\d] and the second part handles things like [\w-\s].  For that matter, the most obvious place for any of these checks, in |case cachedCharacter|, is only hit (I think) in the case of [\w-a] or [\w-\b] -- may as well add a comment to that effect there, too.
Attachment #485837 - Flags: review?(jwalden+bmo) → review+
For what it's worth, I found the intricate handoffs between parsing escapes, parsing non-escapes, and the state of CharacterClassDelegate to be way too confusing.  But I am loath to suggest significant adjustment as it seems likely to complicate the maintenance burden upstream updates would cause in concert with such changes.
Backout: http://hg.mozilla.org/tracemonkey/rev/f4444a398ec1

Somehow I missed bug 351463: with a hyphen between two character classes we are intentionally not compliant to the spec.
Whiteboard: fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/8ae5fce0f19b
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Reopening because it was backed out -- I still have to follow up by committing with the two-character-class range removed as a syntax error.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
It looks like this bug was almost fixed. What's left to do on it?
blocking2.0: betaN+ → -
status2.0: --- → wanted
I made us conform to our old, 1.9.2 behavior, so the m_sawCharacterClass was unnecessary after all.

http://hg.mozilla.org/tracemonkey/rev/08f2504e5f44
Whiteboard: fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/08f2504e5f44
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: