Closed Bug 613820 Opened 14 years ago Closed 13 years ago

Regular expression corner-case back-reference issue, forwards ref in quantified parens

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla8

People

(Reporter: barraclough, Assigned: cdleary)

References

()

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10_7; en-us) AppleWebKit/534.2 (KHTML, like Gecko) Version/5.0 Safari/534.2
Build Identifier: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.7; en-US; rv:1.9.2.12) Gecko/20101026 Firefox/3.6.12

Small ES5 compatibility issue (I think - regular expressions are kinda complicated!).

The problem is with back-references in regular expressions, where the back-reference is actually a forwards-reference (references a sub-pattern after itself in the expression), and where the back-reference and the sub-expression being referred to exist within a quantified (looping) set of parentheses.

The test case given contains two sets of parentheses, where the outer parentheses are quantified to be matched twice.  On their first iteration they clearly match "a", the question is whether the second iteration matches "a" or "aa".

In Firefox, in subsequent iterations of the loop, the back-reference will match against the value of the sub-expression from the previous iteration, but following the spec I believe that the back-reference should be matching the empty string.

Behaviour is specified in ECMA-262 section 15.10.2.5.  In the description of the RepeatMatcher, step 2 describes closure 'd' used by the RepeatMatcher to match repeated iterations of a quantified term.  In steps 7, 8c and 9 of the RepeatMatcher calls are made to closure 'd' to match repeated iterations of the term.  In all cases a newly constructed state, 'xr', is passed to the closure.  In state 'xr' all sub-pattern captures have been reset to undefined (per step 4 of RepeatMatcher).  As such, an attempt to match the back-reference in a subsequent iteration of the quantified term should be doing so with respect to state 'xr', in which the sub-pattern is undefined, and as such the back-reference should match the empty string (per NOTE on 15.10.2.9).

Phew! - hope my parsing of the spec is all sound here. :-)

So, given all this, I believe both iterations of the quantified expression above should be matching "a", and the whole pattern match should be "aa".



Reproducible: Always

Steps to Reproduce:
javascript:alert(/(\2(a)){2}/.exec("aaa"))

Actual Results:  
["aaa", "aa", "a"]


Expected Results:  
["aa", "a", "a"]
Hrrrm, this bug may not be an issue - might be fixed with the new regex engine.  Looks from other bugs like firefox might not just be adopting YARR JIT, but YARR interpreter too.  If so, this should not be a bug with YARR interpreter enabled.
Thanks for the report, Gavin! Will check it out -- we're not using YARR-interp, just YARR-JIT and PCRE. We had to make some modifications to PCRE in order to set nested capture groups to become undefined in subsequent iterations, so the difference may stem from that. (For example, our reftests state that in /(?:(f)(o)(o)|(b)(a)(r))*/.exec('foobar') capture groups 1-3 wind up as undefined, which I don't believe is part of the spec.)
Assignee: general → cdleary
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Hi Chris,

I could be wrong here, but I think that the behaviour you describe (the 'foobar' example) is correct in Firefox, and that this behaviour *is* also required by the spec.  See section 15.10.2.5, NOTE3 which contains the following example:

    javascript:alert(/(z)((a+)?(b+)?(c))*/.exec("zaacbbbcac") == "zaacbbbcac,z,ac,a,,c")

Failing to reset capture groups between iterations of the outer parentheses would result in the b's being captured too ("zaacbbbcac,z,ac,a,bbb,c"), which the spec explicitly states would be wrong.  I believe this behaviour rises from steps 3 & 4 of the description of the RepeatMatcher.

Here's a slightly expanded version of my test case, which I think might be a little more helpful:

    javascript:alert(/(?:^(a)|\1(a)|(ab)){2}/.exec("aab"))

The outer set of (non-capturing) parentheses are quantified to iterate twice, and on the first iteration the first alternative in the parentheses match ("^(a)").  In the second iteration the first alternative fails to match (due to the begin-of-line assertion), and we go to the second alternative ("\1(a)").  On Firefox this alternative also fails to match, and instead we match the third alternative ("(ab)").  However looking at the results, there is no obvious reason for the second alternative to fail to match.  It contains a back-reference to subpattern 1, which is the 'a' matched in the previous iteration.  But by this point the subpattern captures should already have been reset (per step 4 of the RepeatMatcher), and the "\1" should match "".

From looking at the output of the regex, Firefox is correctly resetting subpattern 1 to undefined.  Failing to match due to subpattern 1, but showing subpattern 1 as undefined in the output array seems a little self-inconsistent!

I wonder if it's just resetting the subpatterns a little too late? – perhaps the subpatterns are reset only when the second iteration has completed matching and wants to write its results to the output array, rather than being reset prior to running the second iteration of the match?

Hope this helps!
G.
All good with the current regexp engine state.
Attachment #537314 - Flags: review?(dmandelin)
Attachment #537314 - Flags: review?(dmandelin) → review+
This didn't make it into mozilla-central. I'll land it now.
http://hg.mozilla.org/mozilla-central/rev/612b0115095f
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla8
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: