Closed Bug 576837 Opened 14 years ago Closed 14 years ago

YARR allows what seems like a bogus character-class range

Tracking

()

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

blocking2.0

---

status2.0

---

wanted

People

(Reporter: jruderman, Assigned: cdleary)

References

Details

(Keywords: testcase, Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file, 1 obsolete file)

Fix character class range with builtin character class on RHS. 14 years ago Chris Leary [:cdleary] (not checking bugmail) 6.30 KB, patch	Waldo : review-	Details \| Diff \| Splinter Review
Fix character class range with builtin character class on any side. 14 years ago Chris Leary [:cdleary] (not checking bugmail) 12.32 KB, patch	Waldo : review+	Details \| Diff \| Splinter Review

Jesse Ruderman

Reporter

Description

•

14 years ago

/[C-\s]/;

YARR: accepts it
OLD:  "invalid range in character class"

Jesse Ruderman

Reporter

Comment 1

•

14 years ago

Same with /[f-\d]/

Boris Zbarsky [:bzbarsky]

Updated

•

14 years ago

blocking2.0: --- → ?

Robert Sayre

Updated

•

14 years ago

Assignee: general → cdleary

blocking2.0: ? → betaN+

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 2

•

14 years ago

TL;DR: YARR wrong. Will fix.

YARR treats the dash as being a member of the character class (brackets) when it precedes a built-in character class (like \s).

js> /[C-\s]/.exec(' ') 
[" "]
js> /[C-\s]/.exec('-') 
["-"]

This string is accepted by the grammar by the following derivation, so it's not a syntax error due to the grammar:

...
-> Atom
-> CharacterClass
-> "[" ClassRanges "]"                    (remaining: "C-\s")
-> NonEmptyClassRanges
-> ClassAtom "-" ClassAtom ClassRanges
-> ClassAtom "-" ClassAtom [empty]
-> ClassAtomNoDash "-" ClassAtomNoDash
-> SourceCharacter "-" "\" ClassEscape
-> "C" "-" "\" CharacterClassEscape
-> "C" "-" "\" "s"                        (accepted)

However, 15.10 tacks additional restrictions onto the grammar; namely, ECMAScriptv5 15.10.2.15 production NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges calls the abstract operation CharacterRange(A, B) in step 4, and CharacterRange step 1 says that if B does not contain exactly one character then a SyntaxError must be thrown, and all the builtin character classes contain more than one character.

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 3

•

14 years ago

Attached patch Fix character class range with builtin character class on RHS. (obsolete) — Details — Splinter Review

Produces the same error message for us as the rest of the character class badnesses, but I added a new error code for upstreamy goodness.

Attachment #482286 - Flags: review?(jwalden+bmo)

Jeff Walden [:Waldo]

Comment 4

•

14 years ago

Comment on attachment 482286 [details] [diff] [review]
Fix character class range with builtin character class on RHS.

[jwalden@find-waldo-now src]$ dbg/js
js> /[C-\s]/.exec("-")
invalid range in character class
js> /[\s-C]/.exec("-")
["-"]

Test needs to make sure to address the reversed cases, too.

Attachment #482286 - Flags: review?(jwalden+bmo) → review-

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 5

•

14 years ago

The restrictions on ClassAtomNoDash are also not enforced because the character class parser is a very simple DFA. Any non-descending single-character range is considered valid. Check out this awesome pattern:

/[--/]/.exec('.') 
["."]

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 6

•

14 years ago

(In reply to comment #5)
> Check out this awesome pattern:

Sorry, that pattern is actually valid! The restrictions on ClassAtomNoDash's SourceCharacter (but not one of '\' or ']' or '-') are still suspect, just have to come up with a pattern that triggers it.

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 7

•

14 years ago

Eh, I think it's okay. Expanding the productions under NonEmptyClassRanges, you get the following regular patterns, with CA short for ClassAtom, CAND short for ClassAtomNoDash, and ... indicating a permitted top-level recursion:

CA
CA CA
CA CAND* CA
CA CAND - CA ...
CA - CA ...
CA CAND - CA ...

The semantic errors for the |CAND - CA| and |CA - CA| forms are the same, so we should be able to figure out all the appropriate places for a dash with mode bits, as we do now. I'll just fix the error and shut up now. :-)

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 8

•

14 years ago

Attached patch Fix character class range with builtin character class on any side. — Details — Splinter Review

No need to cache/flush the character class like we do the unescaped characters, since a range with a builtin character class in it is always a syntax error.

Attachment #482286 - Attachment is obsolete: true

Attachment #485837 - Flags: review?(jwalden+bmo)

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 9

•

14 years ago

Comment on attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.

Hmm, I think /\s\w-\W/ should be failing as well.

Attachment #485837 - Flags: review?(jwalden+bmo)

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 10

•

14 years ago

Comment on attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.

That's because I didn't put it in a character class. Duh. (One of those days.)

Attachment #485837 - Flags: review?(jwalden+bmo)

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Updated

•

14 years ago

Status: NEW → ASSIGNED

Jeff Walden [:Waldo]

Comment 11

•

14 years ago

Comment on attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.

>diff --git a/js/src/jit-test/tests/basic/bug576837-regexp.js b/js/src/jit-test/tests/basic/bug576837-regexp.js

>+function isRegExpSyntaxError(pattern) {
>+    try {
>+        var re = new RegExp(pattern);
>+    } catch (e) {
>+        if (e instanceof SyntaxError)
>+            return true;
>+    }
>+    return false;
>+}
>+
>+function testRangeSyntax(end1, end2, shouldFail) {
>+    var makePattern = function(e1, e2) {
>+        var pattern = '[' + e1 + '-' + e2 + ']';
>+        print(uneval(pattern));
>+        return pattern;
>+    };
>+    assertEq(isRegExpSyntaxError(makePattern(end1, end2)), shouldFail);
>+    assertEq(isRegExpSyntaxError(makePattern(end2, end1)), shouldFail);
>+}
>+
>+testRangeSyntax('C', '\\s', true);
>+testRangeSyntax('C', '\\d', true);
>+testRangeSyntax('C', '\\W', true);
>+testRangeSyntax('C', '\\W', true);
>+testRangeSyntax('C', '', false);
>+testRangeSyntax('C', 'C', false);
>+testRangeSyntax('\\s', '\\s', true);
>+testRangeSyntax('\\W', '\\s', true);

Plain true/false don't do it for me for readability, suggest checkRangeValid/checkRangeInvalid methods forwarding to one common method.

Add in some tests with \b and \B since those aren't treated as character class escapes.


>diff --git a/js/src/yarr/yarr/RegexParser.h b/js/src/yarr/yarr/RegexParser.h

>     class CharacterClassParserDelegate {
>     public:
>         CharacterClassParserDelegate(Delegate& delegate, ErrorCode& err)
>             : m_delegate(delegate)
>             , m_err(err)
>             , m_state(empty)
>+            , sawCharacterClass(false)
>         {
>         }

Good style appears to suggest m_sawCharacterClass, to make upstreaming easier.


>             case cachedCharacter:
>+                if (m_character == '-' && sawCharacterClass) {
>+                    m_err = CharacterClassRangeSingleChar;
>+                    m_state = empty;
>+                    break;
>+                } else {
>+                    sawCharacterClass = false;
>+                }
>+
>                 if (ch == '-')
>                     m_state = cachedCharacterHyphen;
>                 else {
>                     m_delegate.atomCharacterClassAtom(m_character);
>                     m_character = ch;
>                 }
>                 break;

As far as good style goes, it appears that this code parenthesizes equality expressions that aren't the whole of conditions -- so |((m_character == '-') && m_sawCharacterClass)|.

I think this would be more readable if you didn't put the sawCharacterClass in the else-block.  It makes more sense (even if one doesn't subscribe to else-after-return being evil) to say that *if* the just-passed character class starts a range, *then* it's an error and you exit.  But the sawCharacterClass assignment applies to both arms of the if below, so it seems it makes more sense to something like this:

>+                if (m_character == '-' && sawCharacterClass) {
>+                    m_err = CharacterClassRangeSingleChar;
>+                    m_state = empty;
>+                    break;
>+                }
>+
>+                sawCharacterClass = false;
>                 if (ch == '-')
>                     m_state = cachedCharacterHyphen;
>                 else {
>                     m_delegate.atomCharacterClassAtom(m_character);
>                     m_character = ch;
>                 }
>                 break;


>         void atomBuiltInCharacterClass(BuiltInCharacterClassID classID, bool invert)
>         {
>+            if (m_state == cachedCharacterHyphen ||
>+                (sawCharacterClass && m_state == cachedCharacter && m_character == '-')) {
>+                // If the RHS of a range does not contain exacly one character then a SyntaxError
>+                // must be thrown.
>+                // Assumes none of the built in character classes contain a single character.
>+                m_err = CharacterClassRangeSingleChar;
>+                m_state = empty;
>+                return;
>+            }
>             flush();
>             m_delegate.atomCharacterClassBuiltIn(classID, invert);
>+            sawCharacterClass = true;
>         }

(Add parentheses per the earlier comment.)

I had trouble understanding both parts of your condition here -- suggest adding something noting that the first part handles things like [a-\d] and the second part handles things like [\w-\s].  For that matter, the most obvious place for any of these checks, in |case cachedCharacter|, is only hit (I think) in the case of [\w-a] or [\w-\b] -- may as well add a comment to that effect there, too.

Attachment #485837 - Flags: review?(jwalden+bmo) → review+

Jeff Walden [:Waldo]

Comment 12

•

14 years ago

For what it's worth, I found the intricate handoffs between parsing escapes, parsing non-escapes, and the state of CharacterClassDelegate to be way too confusing.  But I am loath to suggest significant adjustment as it seems likely to complicate the maintenance burden upstream updates would cause in concert with such changes.

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 13

•

14 years ago

http://hg.mozilla.org/tracemonkey/rev/8ae5fce0f19b

Whiteboard: fixed-in-tracemonkey

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 14

•

14 years ago

Backout: http://hg.mozilla.org/tracemonkey/rev/f4444a398ec1

Somehow I missed bug 351463: with a hyphen between two character classes we are intentionally not compliant to the spec.

Whiteboard: fixed-in-tracemonkey

Robert Sayre

Comment 15

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/8ae5fce0f19b

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 16

•

14 years ago

Reopening because it was backed out -- I still have to follow up by committing with the two-character-class range removed as a syntax error.

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

David Mandelin [:dmandelin]

Comment 17

•

14 years ago

It looks like this bug was almost fixed. What's left to do on it?

Robert Sayre

Updated

•

14 years ago

blocking2.0: betaN+ → -

status2.0: --- → wanted

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 18

•

14 years ago

I made us conform to our old, 1.9.2 behavior, so the m_sawCharacterClass was unnecessary after all.

http://hg.mozilla.org/tracemonkey/rev/08f2504e5f44

Whiteboard: fixed-in-tracemonkey

Chris Leary [:cdleary] (not checking bugmail)

Assignee

Comment 19

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/08f2504e5f44

Status: REOPENED → RESOLVED

Closed: 14 years ago → 14 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.