Closed
Bug 576837
Opened 15 years ago
Closed 15 years ago
YARR allows what seems like a bogus character-class range
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: jruderman, Assigned: cdleary)
References
Details
(Keywords: testcase, Whiteboard: fixed-in-tracemonkey)
Attachments
(1 file, 1 obsolete file)
12.32 KB,
patch
|
Waldo
:
review+
|
Details | Diff | Splinter Review |
/[C-\s]/;
YARR: accepts it
OLD: "invalid range in character class"
Reporter | ||
Comment 1•15 years ago
|
||
Same with /[f-\d]/
![]() |
||
Updated•15 years ago
|
blocking2.0: --- → ?
Updated•15 years ago
|
Assignee: general → cdleary
blocking2.0: ? → betaN+
Assignee | ||
Comment 2•15 years ago
|
||
TL;DR: YARR wrong. Will fix.
YARR treats the dash as being a member of the character class (brackets) when it precedes a built-in character class (like \s).
js> /[C-\s]/.exec(' ')
[" "]
js> /[C-\s]/.exec('-')
["-"]
This string is accepted by the grammar by the following derivation, so it's not a syntax error due to the grammar:
...
-> Atom
-> CharacterClass
-> "[" ClassRanges "]" (remaining: "C-\s")
-> NonEmptyClassRanges
-> ClassAtom "-" ClassAtom ClassRanges
-> ClassAtom "-" ClassAtom [empty]
-> ClassAtomNoDash "-" ClassAtomNoDash
-> SourceCharacter "-" "\" ClassEscape
-> "C" "-" "\" CharacterClassEscape
-> "C" "-" "\" "s" (accepted)
However, 15.10 tacks additional restrictions onto the grammar; namely, ECMAScriptv5 15.10.2.15 production NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges calls the abstract operation CharacterRange(A, B) in step 4, and CharacterRange step 1 says that if B does not contain exactly one character then a SyntaxError must be thrown, and all the builtin character classes contain more than one character.
Assignee | ||
Comment 3•15 years ago
|
||
Produces the same error message for us as the rest of the character class badnesses, but I added a new error code for upstreamy goodness.
Attachment #482286 -
Flags: review?(jwalden+bmo)
Comment 4•15 years ago
|
||
Comment on attachment 482286 [details] [diff] [review]
Fix character class range with builtin character class on RHS.
[jwalden@find-waldo-now src]$ dbg/js
js> /[C-\s]/.exec("-")
invalid range in character class
js> /[\s-C]/.exec("-")
["-"]
Test needs to make sure to address the reversed cases, too.
Attachment #482286 -
Flags: review?(jwalden+bmo) → review-
Assignee | ||
Comment 5•15 years ago
|
||
The restrictions on ClassAtomNoDash are also not enforced because the character class parser is a very simple DFA. Any non-descending single-character range is considered valid. Check out this awesome pattern:
/[--/]/.exec('.')
["."]
Assignee | ||
Comment 6•15 years ago
|
||
(In reply to comment #5)
> Check out this awesome pattern:
Sorry, that pattern is actually valid! The restrictions on ClassAtomNoDash's SourceCharacter (but not one of '\' or ']' or '-') are still suspect, just have to come up with a pattern that triggers it.
Assignee | ||
Comment 7•15 years ago
|
||
Eh, I think it's okay. Expanding the productions under NonEmptyClassRanges, you get the following regular patterns, with CA short for ClassAtom, CAND short for ClassAtomNoDash, and ... indicating a permitted top-level recursion:
CA
CA CA
CA CAND* CA
CA CAND - CA ...
CA - CA ...
CA CAND - CA ...
The semantic errors for the |CAND - CA| and |CA - CA| forms are the same, so we should be able to figure out all the appropriate places for a dash with mode bits, as we do now. I'll just fix the error and shut up now. :-)
Assignee | ||
Comment 8•15 years ago
|
||
No need to cache/flush the character class like we do the unescaped characters, since a range with a builtin character class in it is always a syntax error.
Attachment #482286 -
Attachment is obsolete: true
Attachment #485837 -
Flags: review?(jwalden+bmo)
Assignee | ||
Comment 9•15 years ago
|
||
Comment on attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.
Hmm, I think /\s\w-\W/ should be failing as well.
Attachment #485837 -
Flags: review?(jwalden+bmo)
Assignee | ||
Comment 10•15 years ago
|
||
Comment on attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.
That's because I didn't put it in a character class. Duh. (One of those days.)
Attachment #485837 -
Flags: review?(jwalden+bmo)
Assignee | ||
Updated•15 years ago
|
Status: NEW → ASSIGNED
Comment 11•15 years ago
|
||
Comment on attachment 485837 [details] [diff] [review]
Fix character class range with builtin character class on any side.
>diff --git a/js/src/jit-test/tests/basic/bug576837-regexp.js b/js/src/jit-test/tests/basic/bug576837-regexp.js
>+function isRegExpSyntaxError(pattern) {
>+ try {
>+ var re = new RegExp(pattern);
>+ } catch (e) {
>+ if (e instanceof SyntaxError)
>+ return true;
>+ }
>+ return false;
>+}
>+
>+function testRangeSyntax(end1, end2, shouldFail) {
>+ var makePattern = function(e1, e2) {
>+ var pattern = '[' + e1 + '-' + e2 + ']';
>+ print(uneval(pattern));
>+ return pattern;
>+ };
>+ assertEq(isRegExpSyntaxError(makePattern(end1, end2)), shouldFail);
>+ assertEq(isRegExpSyntaxError(makePattern(end2, end1)), shouldFail);
>+}
>+
>+testRangeSyntax('C', '\\s', true);
>+testRangeSyntax('C', '\\d', true);
>+testRangeSyntax('C', '\\W', true);
>+testRangeSyntax('C', '\\W', true);
>+testRangeSyntax('C', '', false);
>+testRangeSyntax('C', 'C', false);
>+testRangeSyntax('\\s', '\\s', true);
>+testRangeSyntax('\\W', '\\s', true);
Plain true/false don't do it for me for readability, suggest checkRangeValid/checkRangeInvalid methods forwarding to one common method.
Add in some tests with \b and \B since those aren't treated as character class escapes.
>diff --git a/js/src/yarr/yarr/RegexParser.h b/js/src/yarr/yarr/RegexParser.h
> class CharacterClassParserDelegate {
> public:
> CharacterClassParserDelegate(Delegate& delegate, ErrorCode& err)
> : m_delegate(delegate)
> , m_err(err)
> , m_state(empty)
>+ , sawCharacterClass(false)
> {
> }
Good style appears to suggest m_sawCharacterClass, to make upstreaming easier.
> case cachedCharacter:
>+ if (m_character == '-' && sawCharacterClass) {
>+ m_err = CharacterClassRangeSingleChar;
>+ m_state = empty;
>+ break;
>+ } else {
>+ sawCharacterClass = false;
>+ }
>+
> if (ch == '-')
> m_state = cachedCharacterHyphen;
> else {
> m_delegate.atomCharacterClassAtom(m_character);
> m_character = ch;
> }
> break;
As far as good style goes, it appears that this code parenthesizes equality expressions that aren't the whole of conditions -- so |((m_character == '-') && m_sawCharacterClass)|.
I think this would be more readable if you didn't put the sawCharacterClass in the else-block. It makes more sense (even if one doesn't subscribe to else-after-return being evil) to say that *if* the just-passed character class starts a range, *then* it's an error and you exit. But the sawCharacterClass assignment applies to both arms of the if below, so it seems it makes more sense to something like this:
>+ if (m_character == '-' && sawCharacterClass) {
>+ m_err = CharacterClassRangeSingleChar;
>+ m_state = empty;
>+ break;
>+ }
>+
>+ sawCharacterClass = false;
> if (ch == '-')
> m_state = cachedCharacterHyphen;
> else {
> m_delegate.atomCharacterClassAtom(m_character);
> m_character = ch;
> }
> break;
> void atomBuiltInCharacterClass(BuiltInCharacterClassID classID, bool invert)
> {
>+ if (m_state == cachedCharacterHyphen ||
>+ (sawCharacterClass && m_state == cachedCharacter && m_character == '-')) {
>+ // If the RHS of a range does not contain exacly one character then a SyntaxError
>+ // must be thrown.
>+ // Assumes none of the built in character classes contain a single character.
>+ m_err = CharacterClassRangeSingleChar;
>+ m_state = empty;
>+ return;
>+ }
> flush();
> m_delegate.atomCharacterClassBuiltIn(classID, invert);
>+ sawCharacterClass = true;
> }
(Add parentheses per the earlier comment.)
I had trouble understanding both parts of your condition here -- suggest adding something noting that the first part handles things like [a-\d] and the second part handles things like [\w-\s]. For that matter, the most obvious place for any of these checks, in |case cachedCharacter|, is only hit (I think) in the case of [\w-a] or [\w-\b] -- may as well add a comment to that effect there, too.
Attachment #485837 -
Flags: review?(jwalden+bmo) → review+
Comment 12•15 years ago
|
||
For what it's worth, I found the intricate handoffs between parsing escapes, parsing non-escapes, and the state of CharacterClassDelegate to be way too confusing. But I am loath to suggest significant adjustment as it seems likely to complicate the maintenance burden upstream updates would cause in concert with such changes.
Assignee | ||
Comment 13•15 years ago
|
||
Whiteboard: fixed-in-tracemonkey
Assignee | ||
Comment 14•15 years ago
|
||
Backout: http://hg.mozilla.org/tracemonkey/rev/f4444a398ec1
Somehow I missed bug 351463: with a hyphen between two character classes we are intentionally not compliant to the spec.
Whiteboard: fixed-in-tracemonkey
Comment 15•15 years ago
|
||
Status: ASSIGNED → RESOLVED
Closed: 15 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 16•15 years ago
|
||
Reopening because it was backed out -- I still have to follow up by committing with the two-character-class range removed as a syntax error.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 17•15 years ago
|
||
It looks like this bug was almost fixed. What's left to do on it?
Updated•15 years ago
|
Assignee | ||
Comment 18•15 years ago
|
||
I made us conform to our old, 1.9.2 behavior, so the m_sawCharacterClass was unnecessary after all.
http://hg.mozilla.org/tracemonkey/rev/08f2504e5f44
Whiteboard: fixed-in-tracemonkey
Assignee | ||
Comment 19•15 years ago
|
||
Status: REOPENED → RESOLVED
Closed: 15 years ago → 15 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•