Closed Bug 369778 Opened 17 years ago Closed 10 years ago

Javascript regular expression captures broken with alternation in some cases.

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla34

People

(Reporter: sthussey, Assigned: ae.anderson0)

References

(Depends on 1 open bug)

Details

(Keywords: addon-compat, dev-doc-complete, Whiteboard: [good first bug][lang=c++][DocArea=JS])

Attachments

(5 files, 2 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1

Javascript regular expressions capturing multiple matches using alternation set some of the result's capture fields to undefined in some cases. The cause is based on the order of the alternates in the text.

Reproducible: Always

Steps to Reproduce:
Example code:
re = /(?:(ABC)|(123)){2}/;

var t = re.exec("ABC123");
var u = re.exec("123ABC");
Actual Results:  
t = ['ABC123',undefined,'123']
u = ['123ABC','ABC','123']

Expected Results:  
Captures should be identical in both cases with a possible change in order in the results array.
Assignee: nobody → general
Component: General → JavaScript Engine
Product: Firefox → Core
QA Contact: general → general
Version: unspecified → 1.8 Branch
Looks like a bug -- Brian, can you take a look in due course?

/be
Assignee: general → crowder
Here are the perl results, for my information:

$ perl -e '$tst = "123ABC"; $tst =~ /(?:(ABC)|(123)){2}/; print $& . "\n"; print $1 . "\n"; print $2 . "\n"'
123ABC
ABC
123

$ perl -e '$tst = "ABC123"; $tst =~ /(?:(ABC)|(123)){2}/; print $& . "\n"; print $1 . "\n"; print $2 . "\n"'
ABC123
ABC
123
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
The bug here seems to be in the way that REOP_LPAREN works on the second attempt to match in the repeat set, it zeros out the capture results for parenIndex 0 because it is about to attempt to match (ABC) again, which will fail.  When that fails, the values are left in this "empty" state, even though we have already successfully matched group 0.  We are only getting lucky in the second case ("123ABC") because the first attempt to match in paren-group 0 fails, and is then overwritten by the successful match from the other group.  More to come as I understand it.
Attached patch don't overwrite last paren match (obsolete) — Splinter Review
REOP_LPAREN's bad behavior of overwriting the old capture with the assumption that whatever happens on this match is more important than whatever happened on the last match is corrected by this patch.  The captures array should (I think) only be overwritten in the case of a full backtrack, not (as in this case) for a speculative match occurring as the result of a REPEAT operation.  We WILL overwrite this capture if the second match -succeeds- (I'm not sure I can conceive of an ALT/REPEAT situation such as this one where that is possible AND yields different results, but it is perhaps likely -- if someone can concoct such a case, I'll try to match perl's behavior, if this is not that).

This should get the full js test suite test action before it even thinks about landing.
Attachment #254624 - Flags: review?(mrbkap)
Comment on attachment 254624 [details] [diff] [review]
don't overwrite last paren match

How does this handle nested parentheses? I'm not sure, just by looking at the code, that startcp won't get stepped on by something else. Or does everything that we care about assign the right thing into startcp?
Actually that's a good point.  I might need to add something equivalent to startcp to the RECapture structure, and stick that there.  Let look into it.
Attached patch should handle nesting (obsolete) — Splinter Review
Thanks for the obvious catch, mrbkap.  This should be ok; here's my reasoning:
If you're in an LPAREN op, cap->startcp will always be set for the current parenIndex (and every matching paren group has its own index).  The data is only ever used if the RPAREN op executes (ie., at the end of a successful match of everything inside the paren-group), and the index/length members from which dependent strings are eventually build are only ever modified on successful matches.  This should both fix this bug AND survive nested matching paren-groups.
Attachment #254624 - Attachment is obsolete: true
Attachment #254953 - Flags: review?(mrbkap)
Attachment #254624 - Flags: review?(mrbkap)
I fixed a few ugly comments with this patch, too.
Comment on attachment 254953 [details] [diff] [review]
should handle nesting

Seems reasonable.
Attachment #254953 - Flags: review?(mrbkap) → review+
Comment on attachment 254953 [details] [diff] [review]
should handle nesting

Damn.  This actually causes some trouble with tests in the JS suite:

These are the ones that fail with this patch applied:
ecma_3/RegExp/15.10.2-1.js
ecma_3/RegExp/perlstress-001.js
ecma_3/RegExp/perlstress-002.js
ecma_3/RegExp/regress-123437.js
ecma_3/RegExp/regress-165353.js
ecma_3/RegExp/regress-202564.js
ecma_3/RegExp/regress-209919.js
ecma_3/RegExp/regress-24712.js
ecma_3/RegExp/regress-31316.js
ecma_3/RegExp/regress-87231.js

Back to the drawing board.
Attachment #254953 - Attachment is obsolete: true
Some of the tests themselves may be wrong.  I'm trying to figure out what perl does:

Testcase ecma_3/RegExp/15.10.2-1.js failed Bug Number (none)
STATUS: RegExp conformance test
Failure messages were:
FAILED!: [reported from test()] Section 8 of test -
FAILED!: [reported from test()] regexp = /(a*)*/
FAILED!: [reported from test()] string = 'b'
FAILED!: [reported from test()] ERROR !!! regexp failed to give expected match array:
FAILED!: [reported from test()] Expect: ["", undefined]
FAILED!: [reported from test()] Actual: ["", ""]

Perl returns $& and $1 as "", and "" respectively, and even with -w does not consider either to be an uninitialized value.


FAILED!: [reported from test()] Section 12 of test -
FAILED!: [reported from test()] regexp = /(.*?)a(?!(a+)b\2c)\2(.*)/
FAILED!: [reported from test()] string = 'baaabaac'
FAILED!: [reported from test()] ERROR !!! regexp failed to give expected match array:
FAILED!: [reported from test()] Expect: ["baaabaac", "ba", undefined, "abaac"]
FAILED!: [reported from test()] Actual: ["baaabaac", "ba", "a", "baac"]

perl -we '"baaabaac" =~ (/(.*?)a(?!(a+)b\2c)\2(.*)/); print "[\"$&\", \"$1\", \"$2\", \"$3\"]\n"'
["baaabaac", "ba", "a", "baac"]


These testcases, at least, seem wrong to me.  I will come back with more after I've figured out why I am failing some of the others that -aren't- wrong.
  /(?:(ABC)|(123)){2}/.exec("ABC123").toSource()
should return
  ["ABC123", (void 0), "123"]
and
  /(?:(ABC)|(123)){2}/.exec("123ABC").toSource()
should return
  ["123ABC", "ABC", (void 0)]
because each repetition of a quantifier resets the inner captures to undefined (step 4 in the definition of RepeatMatcher).

  /(a*)*/.exec("b").toSource()
should return
  ["", (void 0)]
because a quantifier with minimum 0 cannot match an empty string (unlike Perl, which matches it once).

  /(.*?)a(?!(a+)b\2c)\2(.*)/.exec().toSource()
should return
  ["baaabaac", "ba", (void 0), "abaac"]
because captures inside (?!) aren't visible outside (unlike (?=)).

The testcases in ecma_3/RegExp/15.10.2-1.js are examples from the spec, where they are explained.
This patch should make the behavior ECMA conformant. I would recommend to check it in after 1.9, so that any real world problems with it can be detected. If there are serious problems or ECMA decides to revise the standard, we can go for Perl compatible behavior later. Compare bug 351349.

Patch passes the testsuite.
Assignee: crowder → x00000000
Status: ASSIGNED → NEW
Flags: wanted1.9.2?
Flags: in-testsuite?
These bugs are all part of a search I made for js bugs that are getting lost in transit:

http://tinyurl.com/jsDeadEndBugs

They all have a review+'ed, non-obsoleted patch and are not marked fixed-in-tracemonkey or checkin-needed but have not seen any activity in 300 days. Some of these got lost simply because the assignee/patch provider never requested a checkin, or just because they were forgotten about.
This still an issue with YARR?
Flags: wanted1.9.2?
This works properly in YARR. Taking assignment so I remember to check in the test case post-FF4.
Assignee: x00000000 → cdleary
Status: NEW → ASSIGNED
Mass-reassigning cdleary's bugs to default. He won't work on any of them, anymore. I guess, at least.

@cdleary: shout if you take issue with this.
Assignee: cdleary → general
Status: ASSIGNED → NEW
This bug is open just because of the testcase.
This seems like a good first bug, with some IRC #jseng guidance.

/be
Whiteboard: [good first bug]
bhackett: Will this difference between SpiderMonkey's and V8's regex capturing still be relevant after your irregexp port?

Here is a simple test case from @kangax [1]:

  'x'.replace(/x(.)?/g, function(m, group) { alert("'group:" + group + "'"); })

displays 'group:' in Firefox, but 'group:undefined' in Chrome/WebKit.

[1] https://twitter.com/kangax/status/461538545388511232
Flags: needinfo?(bhackett1024)
OS: Windows XP → All
Hardware: x86 → All
Whiteboard: [good first bug] → [good first bug][lang=c++]
Version: 1.8 Branch → Trunk
(In reply to Chris Peterson (:cpeterson) from comment #20)
> bhackett: Will this difference between SpiderMonkey's and V8's regex
> capturing still be relevant after your irregexp port?
> 
> Here is a simple test case from @kangax [1]:
> 
>   'x'.replace(/x(.)?/g, function(m, group) { alert("'group:" + group + "'");
> })
> 
> displays 'group:' in Firefox, but 'group:undefined' in Chrome/WebKit.
> 
> [1] https://twitter.com/kangax/status/461538545388511232

No, this behavior looks related to a different part of our regexp handling code.  In particular, RegExpStatics::makeMatch makes an empty string when the last match had an undefined capture, when Chrome/Safari look like they are making an undefined value.
Flags: needinfo?(bhackett1024)
There's a patch for this in bug 392378. It's from before fire was invented, so I very much doubt anything can be salvaged, but it does nicely show what needs to be done here.
Is there any reason why I could not work on this as a first bug? Utterly unfamiliar with the insides of SpiderMonkey but I have some relevant experience and I'd like to start contributing.
(In reply to Andy Anderson from comment #25)
> Is there any reason why I could not work on this as a first bug?

No reason at all, welcome and thanks for digging in!

> Utterly unfamiliar with the insides of SpiderMonkey but I have some 
> relevant experience and I'd like to start contributing.

Nice. The first thing you'll want to do is to set up a build environment that allows you to compile the browser[1] and the SpiderMonkey shell[2].

Once you have that, you can start reading some source code. I wouldn't recommend trying to understand the big picture before doing anything, as that's a pretty daunting task. Instead, you should try making localized changes based on theories on what the effects should be. For this particular bug, you should look at the code bhackett mentioned in comment 21. RegExpStatics::makeMatch can be found in `js/src/vm/RegExpStatics.h`. (All SpiderMonkey-related code can be found under `js`, and *almost* all of it under `js/src`).

The change itself is pretty trivial in this case, so the main focus will have to be on testing it. We have two test suites, one under `js/src/tests`, the other under `js/src/jit-test`. You'll want to run both of them before and after your change and see if the number of failing tests changes. It *should* be none before any changes, but there are some tests that can time out depending on your machine and the exact build configuration, and then there are some that, sadly, can fail depending on the timezone you're in. Those you can just ignore.

You should add an additional test based on the code in comment 20. Essentially, you'd assign the `group` argument from the callback function to a variable in the outer code and then use `assertEq` to compare it with an expected value. I guess it makes sense to add the test to the `tests/ecma_3/String` directory, as that was the language version where String#replace was added, apparently.

If you have any questions, join us in the #jsapi channel on irc.mozilla.org, or ask here. And once you have a patch, add it to this bug and ask me for review.


thanks!


[1]: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Build_Instructions
[2]: https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Build_Documentation
Attached patch test369778.patchSplinter Review
Test case for 369778
Attached patch fix369778.patchSplinter Review
Fixes 369778
Attached patch fix220367.patchSplinter Review
Changes test for 220367 to reflect new engine behavior
I was unclear on whether disparate changes such as these should be rolled into one, so I went and made separate ones for the different changes.

Passes on my machine, let me know if there's an issue or if I missed something administrative.

And thanks for the help on irc!
Hmm, I'd be in favour of leaving the legacy RegExp statics behaviour as-is and only change the argument value for String.prototype.replace.
Comment on attachment 8454642 [details] [diff] [review]
test369778.patch

Jeff, please review all these and/or set a reviewer.
Attachment #8454642 - Flags: review?(jwalden+bmo)
What's the review status of this bug? I can't really proceed with more contributions until I've gotten a first patch through a complete review cycle.
Assignee: general → nobody
Comment on attachment 8454642 [details] [diff] [review]
test369778.patch

Review of attachment 8454642 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay, I was out basically all last week, and I've been digging out of various things earlier this week.  This test looks fine to me.

Is the contention that this patch is the test, and one or more of the other patches here is the actual fix?  Guess I'll look at those and try to figure things out myself now.  In the future it'd help if that were made clearer when attaching patches.  :-)  Generally (I count security bugs as the only exception, where we might want to land *only* a patch but not the test, initially) we include the test in the patch that fixes a bug, so a build without patch, then applying the patch causes the new test to fail, then building that and testing will make it work.  Makes the connection and nature of the fix that much clearer.
Attachment #8454642 - Flags: review?(jwalden+bmo) → review+
I uploaded three attachments. 

test369778.patch tests for the new behavior; it is an entirely new test.

fix369778.patch introduces the new behavior.

fix220367.patch modifies an existing test, to reflect the new engine behavior. I received guidance from Till on IRC that this was what I needed to do.
Sorry to double post, but - 

the contention raised earlier is that because this modifies long-standing engine behavior, perhaps the new behavior should be more limited in scope than what I have submitted. I do not have strong feelings on this, but am inclined to not complicate the behavior further by adding exceptions, unless these exceptions also exist in other engines.
Depends on: 1043683
Comment on attachment 8454643 [details] [diff] [review]
fix369778.patch

Review of attachment 8454643 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/RegExpStatics.h
@@ +249,5 @@
>      bool checkWhich  = checkValidIndex % 2;
>      size_t checkPair = checkValidIndex / 2;
>  
> +    if (matches.empty() || checkPair >= matches.pairCount() || matches[pairNum].isUndefined())
> +    {

SpiderMonkey style would have brace at end of line.  Also, this could use a comment explaining what it's doing, with an example.  So something like:

    // 
    if (matches.empty() || checkPair >= matches.pairCount() || matches[pairNum].isUndefined()) {

with brace at end, all on one line because it fits in 99ch, appropriately adjusted wrt bug 1043683.  (I'll do the adjusting for you, since I'm the one making the change underneath yours.  :-) )

@@ +259,1 @@
>      {

Same thing here.

...except, when you consider the patch in bug 1043683, this condition is *exactly* identical to |matches[pairNum].isUndefined()|.  So this whole block should go away.

...in which case, is there *ever* a situation where we should be returning "" here, other than if the matched characters were the empty string?  Seems like no to me.  So this block should disappear.
Attachment #8454643 - Flags: review+
Comment on attachment 8454644 [details] [diff] [review]
fix220367.patch

Review of attachment 8454644 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not too worried about behavior of RegExp.* versus behavior via the array returned by exec().  We should be consistent, up until someone complains and it's demonstrated that we should have contradictory behavior in the two cases.  So I think we take these patches exactly as-is.

After you review the patch in the blocking bug, I'll combine these together and land the whole patch series.  Thanks for working on this!  And sorry for the delay in getting to looking at these patches.  :-(
Attachment #8454644 - Flags: review+
Running tests with this patch showed that the following tests (jstests only, haven't tried jit-tests, haven't tried doing a full test run on tryserver) have now started passing:

    ecma_3/String/15.5.4.11.js
    ecma_3/String/regress-392378.js

I did a little archaeology on one of them, and it basically seems to date back forever.  It appears both of these tests are marked as failing, and have been, since they were landed -- in advance of a fix -- in bug 392378.  So we need to stop expecting failure, with the patchwork here.
Looks like we have at least one case of code breaking due to expecting unmatched captures to expand to the empty string.  I think this extra patch should fix it, pushed an M5-only test run to verify, will post that patch for review if it does:

https://tbpl.mozilla.org/?tree=Try&rev=185bacfb7631
Assignee: nobody → ae.anderson0
Status: NEW → ASSIGNED
Dolske, we're changing things in this bug so if you access the matched text for a capturing group, when that capturing group didn't actually get consulted because quantifiers prevented its exercise, you get |undefined| instead of the empty string.

Simple fix for these two places: just put quantifiers inside parentheses so the group will match the no-character sequence, rather than not be matched at all.

(:)? thus turns into (:?), with the overall pattern matching exactly the same things.

(.*[^&])? we turn into ([^&]*).  This is not strictly identical in that . doesn't match \n:

js> /(.*[^&])?/.exec("\n\n\n\n\n")[1].length
1
js> /([^&]*)/.exec("\n\n\n\n\n")[1].length
5

If we're really and truly paranoid, we could get exactly identical matching with more groups, ((?:[^\n&]*[^&])?) or so.  But it doesn't seem necessary.  People aren't going to put repeated newlines just before accesskey specifiers in button label text.
Attachment #8462408 - Flags: review?(dolske)
Assignee: ae.anderson0 → jwalden+bmo
Assignee: jwalden+bmo → ae.anderson0
Comment on attachment 8462408 [details] [diff] [review]
Fix CommonDialog.jsm RegExp usage expecting an unmatched capture group to match as "" and not |undefined|

Review of attachment 8462408 [details] [diff] [review]:
-----------------------------------------------------------------

No need for paranoia here. These are always going to be strings from either us or an addon, and if someone is doing that they are bad and should feel bad.

Also this (existing) code is gross, looks like it dates back the original ancient commonDialog code. *shudder*
Attachment #8462408 - Flags: review?(dolske) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/22cfcffb5a8b
https://hg.mozilla.org/integration/mozilla-inbound/rev/eb01c53fd879

Unfortunately I landed the patches out of order -- the CommonDialog.jsm change should have landed before the actual behavior change -- but too late to do anything about it now.  :-(  Hopefully no one will get so supremely unlucky about bisect points as to hit it, and be testing specifically M5 when doing said bisection.

We'll probably want to keep our eyes peeled for any code expecting unmatched captures to expand to the empty string instead of to |undefined|.  Probably won't be many, given every other engine seems to produce |undefined| in this bug's scenarios, but it's a distinct possibility we'll see some in Gecko-specific code paths or extensions.

As far as docs go: I'm not sure what the best way to document this is.  Really it probably should be described in a JS guide that discusses RegExps most generally, rather than on the specific JS reference pages that happen to be affected by it.  (Those being for RegExp.prototype.exec, String.prototype.match, RegExp.prototype.test, String.prototype.replace, and probably a few others as well.  It's really a general RegExp issue, not something particular to one method or another's behavior.)
https://hg.mozilla.org/mozilla-central/rev/22cfcffb5a8b
https://hg.mozilla.org/mozilla-central/rev/eb01c53fd879
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34
The tests we had to change to land this are adequate testing for this bug.
Flags: in-testsuite? → in-testsuite+
Depends on: 1048000
Whiteboard: [good first bug][lang=c++] → [good first bug][lang=c++][DocArea=JS]
Depends on: 1054910
You need to log in before you can comment on or make changes to this bug.