Closed
Bug 576837
Opened 14 years ago
Closed 14 years ago
YARR allows what seems like a bogus character-class range
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: jruderman, Assigned: cdleary)
References
Details
(Keywords: testcase, Whiteboard: fixed-in-tracemonkey)
Attachments
(1 file, 1 obsolete file)
12.32 KB,
patch
|
Waldo
:
review+
|
Details | Diff | Splinter Review |
/[C-\s]/; YARR: accepts it OLD: "invalid range in character class"
Reporter | ||
Comment 1•14 years ago
|
||
Same with /[f-\d]/
Updated•14 years ago
|
blocking2.0: --- → ?
Updated•14 years ago
|
Assignee: general → cdleary
blocking2.0: ? → betaN+
Assignee | ||
Comment 2•14 years ago
|
||
TL;DR: YARR wrong. Will fix. YARR treats the dash as being a member of the character class (brackets) when it precedes a built-in character class (like \s). js> /[C-\s]/.exec(' ') [" "] js> /[C-\s]/.exec('-') ["-"] This string is accepted by the grammar by the following derivation, so it's not a syntax error due to the grammar: ... -> Atom -> CharacterClass -> "[" ClassRanges "]" (remaining: "C-\s") -> NonEmptyClassRanges -> ClassAtom "-" ClassAtom ClassRanges -> ClassAtom "-" ClassAtom [empty] -> ClassAtomNoDash "-" ClassAtomNoDash -> SourceCharacter "-" "\" ClassEscape -> "C" "-" "\" CharacterClassEscape -> "C" "-" "\" "s" (accepted) However, 15.10 tacks additional restrictions onto the grammar; namely, ECMAScriptv5 15.10.2.15 production NonemptyClassRanges :: ClassAtom - ClassAtom ClassRanges calls the abstract operation CharacterRange(A, B) in step 4, and CharacterRange step 1 says that if B does not contain exactly one character then a SyntaxError must be thrown, and all the builtin character classes contain more than one character.
Assignee | ||
Comment 3•14 years ago
|
||
Produces the same error message for us as the rest of the character class badnesses, but I added a new error code for upstreamy goodness.
Attachment #482286 -
Flags: review?(jwalden+bmo)
Comment 4•14 years ago
|
||
Comment on attachment 482286 [details] [diff] [review] Fix character class range with builtin character class on RHS. [jwalden@find-waldo-now src]$ dbg/js js> /[C-\s]/.exec("-") invalid range in character class js> /[\s-C]/.exec("-") ["-"] Test needs to make sure to address the reversed cases, too.
Attachment #482286 -
Flags: review?(jwalden+bmo) → review-
Assignee | ||
Comment 5•14 years ago
|
||
The restrictions on ClassAtomNoDash are also not enforced because the character class parser is a very simple DFA. Any non-descending single-character range is considered valid. Check out this awesome pattern: /[--/]/.exec('.') ["."]
Assignee | ||
Comment 6•14 years ago
|
||
(In reply to comment #5) > Check out this awesome pattern: Sorry, that pattern is actually valid! The restrictions on ClassAtomNoDash's SourceCharacter (but not one of '\' or ']' or '-') are still suspect, just have to come up with a pattern that triggers it.
Assignee | ||
Comment 7•14 years ago
|
||
Eh, I think it's okay. Expanding the productions under NonEmptyClassRanges, you get the following regular patterns, with CA short for ClassAtom, CAND short for ClassAtomNoDash, and ... indicating a permitted top-level recursion: CA CA CA CA CAND* CA CA CAND - CA ... CA - CA ... CA CAND - CA ... The semantic errors for the |CAND - CA| and |CA - CA| forms are the same, so we should be able to figure out all the appropriate places for a dash with mode bits, as we do now. I'll just fix the error and shut up now. :-)
Assignee | ||
Comment 8•14 years ago
|
||
No need to cache/flush the character class like we do the unescaped characters, since a range with a builtin character class in it is always a syntax error.
Attachment #482286 -
Attachment is obsolete: true
Attachment #485837 -
Flags: review?(jwalden+bmo)
Assignee | ||
Comment 9•14 years ago
|
||
Comment on attachment 485837 [details] [diff] [review] Fix character class range with builtin character class on any side. Hmm, I think /\s\w-\W/ should be failing as well.
Attachment #485837 -
Flags: review?(jwalden+bmo)
Assignee | ||
Comment 10•14 years ago
|
||
Comment on attachment 485837 [details] [diff] [review] Fix character class range with builtin character class on any side. That's because I didn't put it in a character class. Duh. (One of those days.)
Attachment #485837 -
Flags: review?(jwalden+bmo)
Assignee | ||
Updated•14 years ago
|
Status: NEW → ASSIGNED
Comment 11•14 years ago
|
||
Comment on attachment 485837 [details] [diff] [review] Fix character class range with builtin character class on any side. >diff --git a/js/src/jit-test/tests/basic/bug576837-regexp.js b/js/src/jit-test/tests/basic/bug576837-regexp.js >+function isRegExpSyntaxError(pattern) { >+ try { >+ var re = new RegExp(pattern); >+ } catch (e) { >+ if (e instanceof SyntaxError) >+ return true; >+ } >+ return false; >+} >+ >+function testRangeSyntax(end1, end2, shouldFail) { >+ var makePattern = function(e1, e2) { >+ var pattern = '[' + e1 + '-' + e2 + ']'; >+ print(uneval(pattern)); >+ return pattern; >+ }; >+ assertEq(isRegExpSyntaxError(makePattern(end1, end2)), shouldFail); >+ assertEq(isRegExpSyntaxError(makePattern(end2, end1)), shouldFail); >+} >+ >+testRangeSyntax('C', '\\s', true); >+testRangeSyntax('C', '\\d', true); >+testRangeSyntax('C', '\\W', true); >+testRangeSyntax('C', '\\W', true); >+testRangeSyntax('C', '', false); >+testRangeSyntax('C', 'C', false); >+testRangeSyntax('\\s', '\\s', true); >+testRangeSyntax('\\W', '\\s', true); Plain true/false don't do it for me for readability, suggest checkRangeValid/checkRangeInvalid methods forwarding to one common method. Add in some tests with \b and \B since those aren't treated as character class escapes. >diff --git a/js/src/yarr/yarr/RegexParser.h b/js/src/yarr/yarr/RegexParser.h > class CharacterClassParserDelegate { > public: > CharacterClassParserDelegate(Delegate& delegate, ErrorCode& err) > : m_delegate(delegate) > , m_err(err) > , m_state(empty) >+ , sawCharacterClass(false) > { > } Good style appears to suggest m_sawCharacterClass, to make upstreaming easier. > case cachedCharacter: >+ if (m_character == '-' && sawCharacterClass) { >+ m_err = CharacterClassRangeSingleChar; >+ m_state = empty; >+ break; >+ } else { >+ sawCharacterClass = false; >+ } >+ > if (ch == '-') > m_state = cachedCharacterHyphen; > else { > m_delegate.atomCharacterClassAtom(m_character); > m_character = ch; > } > break; As far as good style goes, it appears that this code parenthesizes equality expressions that aren't the whole of conditions -- so |((m_character == '-') && m_sawCharacterClass)|. I think this would be more readable if you didn't put the sawCharacterClass in the else-block. It makes more sense (even if one doesn't subscribe to else-after-return being evil) to say that *if* the just-passed character class starts a range, *then* it's an error and you exit. But the sawCharacterClass assignment applies to both arms of the if below, so it seems it makes more sense to something like this: >+ if (m_character == '-' && sawCharacterClass) { >+ m_err = CharacterClassRangeSingleChar; >+ m_state = empty; >+ break; >+ } >+ >+ sawCharacterClass = false; > if (ch == '-') > m_state = cachedCharacterHyphen; > else { > m_delegate.atomCharacterClassAtom(m_character); > m_character = ch; > } > break; > void atomBuiltInCharacterClass(BuiltInCharacterClassID classID, bool invert) > { >+ if (m_state == cachedCharacterHyphen || >+ (sawCharacterClass && m_state == cachedCharacter && m_character == '-')) { >+ // If the RHS of a range does not contain exacly one character then a SyntaxError >+ // must be thrown. >+ // Assumes none of the built in character classes contain a single character. >+ m_err = CharacterClassRangeSingleChar; >+ m_state = empty; >+ return; >+ } > flush(); > m_delegate.atomCharacterClassBuiltIn(classID, invert); >+ sawCharacterClass = true; > } (Add parentheses per the earlier comment.) I had trouble understanding both parts of your condition here -- suggest adding something noting that the first part handles things like [a-\d] and the second part handles things like [\w-\s]. For that matter, the most obvious place for any of these checks, in |case cachedCharacter|, is only hit (I think) in the case of [\w-a] or [\w-\b] -- may as well add a comment to that effect there, too.
Attachment #485837 -
Flags: review?(jwalden+bmo) → review+
Comment 12•14 years ago
|
||
For what it's worth, I found the intricate handoffs between parsing escapes, parsing non-escapes, and the state of CharacterClassDelegate to be way too confusing. But I am loath to suggest significant adjustment as it seems likely to complicate the maintenance burden upstream updates would cause in concert with such changes.
Assignee | ||
Comment 13•14 years ago
|
||
http://hg.mozilla.org/tracemonkey/rev/8ae5fce0f19b
Whiteboard: fixed-in-tracemonkey
Assignee | ||
Comment 14•14 years ago
|
||
Backout: http://hg.mozilla.org/tracemonkey/rev/f4444a398ec1 Somehow I missed bug 351463: with a hyphen between two character classes we are intentionally not compliant to the spec.
Whiteboard: fixed-in-tracemonkey
Comment 15•14 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/8ae5fce0f19b
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Assignee | ||
Comment 16•14 years ago
|
||
Reopening because it was backed out -- I still have to follow up by committing with the two-character-class range removed as a syntax error.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 17•14 years ago
|
||
It looks like this bug was almost fixed. What's left to do on it?
Updated•14 years ago
|
Assignee | ||
Comment 18•14 years ago
|
||
I made us conform to our old, 1.9.2 behavior, so the m_sawCharacterClass was unnecessary after all. http://hg.mozilla.org/tracemonkey/rev/08f2504e5f44
Whiteboard: fixed-in-tracemonkey
Assignee | ||
Comment 19•14 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/08f2504e5f44
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•