Bug 369778 - Javascript regular expression captures broken with alternation in some cases.
 Summary: Javascript regular expression captures broken with alternation in some cases.
 Status: RESOLVED FIXED [good first bug][lang=c++][DocArea=JS] addon-compat, dev-doc-complete Core Components JavaScript Engine (show other bugs) Trunk All All -- normal (vote) mozilla34 Andy Anderson Jason Orendorff [:jorendorff] (view as bug list) 1054910 661974 1043683 1048000 Show dependency tree / graph

Reported: 2007-02-08 13:15 PST by Scott Hussey
Modified: 2015-01-05 05:23 PST (History)
23 users (show)
jwalden+bmo: in‑testsuite+
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---

Attachments
don't overwrite last paren match (2.32 KB, patch)
2007-02-09 22:24 PST, Brian Crowder
no flags Details | Diff | Splinter Review
should handle nesting (4.11 KB, patch)
2007-02-13 08:28 PST, Brian Crowder
mrbkap: review+
Details | Diff | Splinter Review
patch for ECMA conformant behavior (1.02 KB, patch)
2008-04-16 14:53 PDT, x0
no flags Details | Diff | Splinter Review
test369778.patch (1.09 KB, patch)
2014-07-11 13:09 PDT, Andy Anderson
jwalden+bmo: review+
Details | Diff | Splinter Review
fix369778.patch (1.34 KB, patch)
2014-07-11 13:10 PDT, Andy Anderson
jwalden+bmo: review+
Details | Diff | Splinter Review
fix220367.patch (839 bytes, patch)
2014-07-11 13:10 PDT, Andy Anderson
jwalden+bmo: review+
Details | Diff | Splinter Review
Fix CommonDialog.jsm RegExp usage expecting an unmatched capture group to match as "" and not |undefined| (1.14 KB, patch)
2014-07-25 01:18 PDT, Jeff Walden [:Waldo] (remove +bmo to email)
dolske: review+
Details | Diff | Splinter Review

 Scott Hussey 2007-02-08 13:15:22 PST User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1 Javascript regular expressions capturing multiple matches using alternation set some of the result's capture fields to undefined in some cases. The cause is based on the order of the alternates in the text. Reproducible: Always Steps to Reproduce: Example code: re = /(?:(ABC)|(123)){2}/; var t = re.exec("ABC123"); var u = re.exec("123ABC"); Actual Results: t = ['ABC123',undefined,'123'] u = ['123ABC','ABC','123'] Expected Results: Captures should be identical in both cases with a possible change in order in the results array. Brendan Eich [:brendan] 2007-02-08 15:42:06 PST Looks like a bug -- Brian, can you take a look in due course? /be Brian Crowder 2007-02-09 11:29:52 PST Here are the perl results, for my information: $perl -e '$tst = "123ABC"; $tst =~ /(?:(ABC)|(123)){2}/; print$& . "\n"; print $1 . "\n"; print$2 . "\n"' 123ABC ABC 123 $perl -e '$tst = "ABC123"; $tst =~ /(?:(ABC)|(123)){2}/; print$& . "\n"; print $1 . "\n"; print$2 . "\n"' ABC123 ABC 123  Brian Crowder 2007-02-09 14:40:12 PST The bug here seems to be in the way that REOP_LPAREN works on the second attempt to match in the repeat set, it zeros out the capture results for parenIndex 0 because it is about to attempt to match (ABC) again, which will fail. When that fails, the values are left in this "empty" state, even though we have already successfully matched group 0. We are only getting lucky in the second case ("123ABC") because the first attempt to match in paren-group 0 fails, and is then overwritten by the successful match from the other group. More to come as I understand it. Brian Crowder 2007-02-09 22:24:11 PST Created attachment 254624 [details] [diff] [review] don't overwrite last paren match REOP_LPAREN's bad behavior of overwriting the old capture with the assumption that whatever happens on this match is more important than whatever happened on the last match is corrected by this patch. The captures array should (I think) only be overwritten in the case of a full backtrack, not (as in this case) for a speculative match occurring as the result of a REPEAT operation. We WILL overwrite this capture if the second match -succeeds- (I'm not sure I can conceive of an ALT/REPEAT situation such as this one where that is possible AND yields different results, but it is perhaps likely -- if someone can concoct such a case, I'll try to match perl's behavior, if this is not that). This should get the full js test suite test action before it even thinks about landing. Blake Kaplan (:mrbkap) 2007-02-12 15:37:11 PST Comment on attachment 254624 [details] [diff] [review] don't overwrite last paren match How does this handle nested parentheses? I'm not sure, just by looking at the code, that startcp won't get stepped on by something else. Or does everything that we care about assign the right thing into startcp? Brian Crowder 2007-02-12 15:42:18 PST Actually that's a good point. I might need to add something equivalent to startcp to the RECapture structure, and stick that there. Let look into it. Brian Crowder 2007-02-13 08:28:45 PST Created attachment 254953 [details] [diff] [review] should handle nesting Thanks for the obvious catch, mrbkap. This should be ok; here's my reasoning: If you're in an LPAREN op, cap->startcp will always be set for the current parenIndex (and every matching paren group has its own index). The data is only ever used if the RPAREN op executes (ie., at the end of a successful match of everything inside the paren-group), and the index/length members from which dependent strings are eventually build are only ever modified on successful matches. This should both fix this bug AND survive nested matching paren-groups. Brian Crowder 2007-02-13 08:30:39 PST I fixed a few ugly comments with this patch, too. Blake Kaplan (:mrbkap) 2007-02-20 21:08:34 PST Comment on attachment 254953 [details] [diff] [review] should handle nesting Seems reasonable. Brian Crowder 2007-02-21 14:23:58 PST Comment on attachment 254953 [details] [diff] [review] should handle nesting Damn. This actually causes some trouble with tests in the JS suite: These are the ones that fail with this patch applied: ecma_3/RegExp/15.10.2-1.js ecma_3/RegExp/perlstress-001.js ecma_3/RegExp/perlstress-002.js ecma_3/RegExp/regress-123437.js ecma_3/RegExp/regress-165353.js ecma_3/RegExp/regress-202564.js ecma_3/RegExp/regress-209919.js ecma_3/RegExp/regress-24712.js ecma_3/RegExp/regress-31316.js ecma_3/RegExp/regress-87231.js Back to the drawing board. Brian Crowder 2007-02-21 14:49:35 PST Some of the tests themselves may be wrong. I'm trying to figure out what perl does: Testcase ecma_3/RegExp/15.10.2-1.js failed Bug Number (none) STATUS: RegExp conformance test Failure messages were: FAILED!: [reported from test()] Section 8 of test - FAILED!: [reported from test()] regexp = /(a*)*/ FAILED!: [reported from test()] string = 'b' FAILED!: [reported from test()] ERROR !!! regexp failed to give expected match array: FAILED!: [reported from test()] Expect: ["", undefined] FAILED!: [reported from test()] Actual: ["", ""] Perl returns $& and$1 as "", and "" respectively, and even with -w does not consider either to be an uninitialized value. FAILED!: [reported from test()] Section 12 of test - FAILED!: [reported from test()] regexp = /(.*?)a(?!(a+)b\2c)\2(.*)/ FAILED!: [reported from test()] string = 'baaabaac' FAILED!: [reported from test()] ERROR !!! regexp failed to give expected match array: FAILED!: [reported from test()] Expect: ["baaabaac", "ba", undefined, "abaac"] FAILED!: [reported from test()] Actual: ["baaabaac", "ba", "a", "baac"] perl -we '"baaabaac" =~ (/(.*?)a(?!(a+)b\2c)\2(.*)/); print "[\"$&\", \"$1\", \"$2\", \"$3\"]\n"' ["baaabaac", "ba", "a", "baac"] These testcases, at least, seem wrong to me. I will come back with more after I've figured out why I am failing some of the others that -aren't- wrong. x0 2008-04-08 12:10:16 PDT  /(?:(ABC)|(123)){2}/.exec("ABC123").toSource() should return ["ABC123", (void 0), "123"] and /(?:(ABC)|(123)){2}/.exec("123ABC").toSource() should return ["123ABC", "ABC", (void 0)] because each repetition of a quantifier resets the inner captures to undefined (step 4 in the definition of RepeatMatcher). /(a*)*/.exec("b").toSource() should return ["", (void 0)] because a quantifier with minimum 0 cannot match an empty string (unlike Perl, which matches it once). /(.*?)a(?!(a+)b\2c)\2(.*)/.exec().toSource() should return ["baaabaac", "ba", (void 0), "abaac"] because captures inside (?!) aren't visible outside (unlike (?=)). The testcases in ecma_3/RegExp/15.10.2-1.js are examples from the spec, where they are explained.  x0 2008-04-16 14:53:10 PDT Created attachment 316103 [details] [diff] [review] patch for ECMA conformant behavior This patch should make the behavior ECMA conformant. I would recommend to check it in after 1.9, so that any real world problems with it can be detected. If there are serious problems or ECMA decides to revise the standard, we can go for Perl compatible behavior later. Compare bug 351349. Patch passes the testsuite. Nochum Sossonko [:Natch] 2010-10-06 08:51:23 PDT These bugs are all part of a search I made for js bugs that are getting lost in transit: http://tinyurl.com/jsDeadEndBugs They all have a review+'ed, non-obsoleted patch and are not marked fixed-in-tracemonkey or checkin-needed but have not seen any activity in 300 days. Some of these got lost simply because the assignee/patch provider never requested a checkin, or just because they were forgotten about. Nochum Sossonko [:Natch] 2011-02-18 09:28:21 PST This still an issue with YARR? Chris Leary [:cdleary] (not checking bugmail) 2011-02-18 15:26:23 PST This works properly in YARR. Taking assignment so I remember to check in the test case post-FF4. Till Schneidereit [till] 2013-03-27 04:41:40 PDT Mass-reassigning cdleary's bugs to default. He won't work on any of them, anymore. I guess, at least. @cdleary: shout if you take issue with this. Guilherme Lima 2013-11-01 20:25:33 PDT This bug is open just because of the testcase. Brendan Eich [:brendan] 2014-04-30 13:32:04 PDT This seems like a good first bug, with some IRC #jseng guidance. /be Chris Peterson [:cpeterson] 2014-04-30 15:55:30 PDT bhackett: Will this difference between SpiderMonkey's and V8's regex capturing still be relevant after your irregexp port? Here is a simple test case from @kangax [1]: 'x'.replace(/x(.)?/g, function(m, group) { alert("'group:" + group + "'"); }) displays 'group:' in Firefox, but 'group:undefined' in Chrome/WebKit. [1] https://twitter.com/kangax/status/461538545388511232 Brian Hackett (:bhackett) 2014-04-30 16:19:10 PDT (In reply to Chris Peterson (:cpeterson) from comment #20) > bhackett: Will this difference between SpiderMonkey's and V8's regex > capturing still be relevant after your irregexp port? > > Here is a simple test case from @kangax [1]: > > 'x'.replace(/x(.)?/g, function(m, group) { alert("'group:" + group + "'"); > }) > > displays 'group:' in Firefox, but 'group:undefined' in Chrome/WebKit. > > [1] https://twitter.com/kangax/status/461538545388511232 No, this behavior looks related to a different part of our regexp handling code. In particular, RegExpStatics::makeMatch makes an empty string when the last match had an undefined capture, when Chrome/Safari look like they are making an undefined value. Till Schneidereit [till] 2014-05-25 14:37:17 PDT *** Bug 1015762 has been marked as a duplicate of this bug. *** Till Schneidereit [till] 2014-05-25 14:40:45 PDT *** Bug 392378 has been marked as a duplicate of this bug. *** Till Schneidereit [till] 2014-05-25 14:41:02 PDT There's a patch for this in bug 392378. It's from before fire was invented, so I very much doubt anything can be salvaged, but it does nicely show what needs to be done here. Andy Anderson 2014-06-22 21:27:55 PDT Is there any reason why I could not work on this as a first bug? Utterly unfamiliar with the insides of SpiderMonkey but I have some relevant experience and I'd like to start contributing. Till Schneidereit [till] 2014-06-23 02:15:19 PDT (In reply to Andy Anderson from comment #25) > Is there any reason why I could not work on this as a first bug? No reason at all, welcome and thanks for digging in! > Utterly unfamiliar with the insides of SpiderMonkey but I have some > relevant experience and I'd like to start contributing. Nice. The first thing you'll want to do is to set up a build environment that allows you to compile the browser[1] and the SpiderMonkey shell[2]. Once you have that, you can start reading some source code. I wouldn't recommend trying to understand the big picture before doing anything, as that's a pretty daunting task. Instead, you should try making localized changes based on theories on what the effects should be. For this particular bug, you should look at the code bhackett mentioned in comment 21. RegExpStatics::makeMatch can be found in js/src/vm/RegExpStatics.h. (All SpiderMonkey-related code can be found under js, and *almost* all of it under js/src). The change itself is pretty trivial in this case, so the main focus will have to be on testing it. We have two test suites, one under js/src/tests, the other under js/src/jit-test. You'll want to run both of them before and after your change and see if the number of failing tests changes. It *should* be none before any changes, but there are some tests that can time out depending on your machine and the exact build configuration, and then there are some that, sadly, can fail depending on the timezone you're in. Those you can just ignore. You should add an additional test based on the code in comment 20. Essentially, you'd assign the group argument from the callback function to a variable in the outer code and then use assertEq to compare it with an expected value. I guess it makes sense to add the test to the tests/ecma_3/String directory, as that was the language version where String#replace was added, apparently. If you have any questions, join us in the #jsapi channel on irc.mozilla.org, or ask here. And once you have a patch, add it to this bug and ask me for review. thanks! [1]: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Build_Instructions [2]: https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Build_Documentation Till Schneidereit [till] 2014-07-08 06:12:50 PDT *** Bug 1034950 has been marked as a duplicate of this bug. *** Andy Anderson 2014-07-11 13:09:02 PDT Created attachment 8454642 [details] [diff] [review] test369778.patch Test case for 369778 Andy Anderson 2014-07-11 13:10:23 PDT Created attachment 8454643 [details] [diff] [review] fix369778.patch Fixes 369778 Andy Anderson 2014-07-11 13:10:59 PDT Created attachment 8454644 [details] [diff] [review] fix220367.patch Changes test for 220367 to reflect new engine behavior Andy Anderson 2014-07-11 13:13:45 PDT I was unclear on whether disparate changes such as these should be rolled into one, so I went and made separate ones for the different changes. Passes on my machine, let me know if there's an issue or if I missed something administrative. And thanks for the help on irc! André Bargull 2014-07-11 13:29:40 PDT Hmm, I'd be in favour of leaving the legacy RegExp statics behaviour as-is and only change the argument value for String.prototype.replace. Jason Orendorff [:jorendorff] 2014-07-14 13:17:03 PDT Comment on attachment 8454642 [details] [diff] [review] test369778.patch Jeff, please review all these and/or set a reviewer. Andy Anderson 2014-07-23 10:53:01 PDT What's the review status of this bug? I can't really proceed with more contributions until I've gotten a first patch through a complete review cycle. Jeff Walden [:Waldo] (remove +bmo to email) 2014-07-24 14:25:17 PDT Comment on attachment 8454642 [details] [diff] [review] test369778.patch Review of attachment 8454642 [details] [diff] [review]: ----------------------------------------------------------------- Sorry for the delay, I was out basically all last week, and I've been digging out of various things earlier this week. This test looks fine to me. Is the contention that this patch is the test, and one or more of the other patches here is the actual fix? Guess I'll look at those and try to figure things out myself now. In the future it'd help if that were made clearer when attaching patches. :-) Generally (I count security bugs as the only exception, where we might want to land *only* a patch but not the test, initially) we include the test in the patch that fixes a bug, so a build without patch, then applying the patch causes the new test to fail, then building that and testing will make it work. Makes the connection and nature of the fix that much clearer. Andy Anderson 2014-07-24 15:02:16 PDT I uploaded three attachments. test369778.patch tests for the new behavior; it is an entirely new test. fix369778.patch introduces the new behavior. fix220367.patch modifies an existing test, to reflect the new engine behavior. I received guidance from Till on IRC that this was what I needed to do. Andy Anderson 2014-07-24 15:04:56 PDT Sorry to double post, but - the contention raised earlier is that because this modifies long-standing engine behavior, perhaps the new behavior should be more limited in scope than what I have submitted. I do not have strong feelings on this, but am inclined to not complicate the behavior further by adding exceptions, unless these exceptions also exist in other engines. Jeff Walden [:Waldo] (remove +bmo to email) 2014-07-24 17:06:28 PDT Comment on attachment 8454643 [details] [diff] [review] fix369778.patch Review of attachment 8454643 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/vm/RegExpStatics.h @@ +249,5 @@ > bool checkWhich = checkValidIndex % 2; > size_t checkPair = checkValidIndex / 2; > > + if (matches.empty() || checkPair >= matches.pairCount() || matches[pairNum].isUndefined()) > + { SpiderMonkey style would have brace at end of line. Also, this could use a comment explaining what it's doing, with an example. So something like: // if (matches.empty() || checkPair >= matches.pairCount() || matches[pairNum].isUndefined()) { with brace at end, all on one line because it fits in 99ch, appropriately adjusted wrt bug 1043683. (I'll do the adjusting for you, since I'm the one making the change underneath yours. :-) ) @@ +259,1 @@ > { Same thing here. ...except, when you consider the patch in bug 1043683, this condition is *exactly* identical to |matches[pairNum].isUndefined()|. So this whole block should go away. ...in which case, is there *ever* a situation where we should be returning "" here, other than if the matched characters were the empty string? Seems like no to me. So this block should disappear. Jeff Walden [:Waldo] (remove +bmo to email) 2014-07-24 17:08:49 PDT Comment on attachment 8454644 [details] [diff] [review] fix220367.patch Review of attachment 8454644 [details] [diff] [review]: ----------------------------------------------------------------- I'm not too worried about behavior of RegExp.* versus behavior via the array returned by exec(). We should be consistent, up until someone complains and it's demonstrated that we should have contradictory behavior in the two cases. So I think we take these patches exactly as-is. After you review the patch in the blocking bug, I'll combine these together and land the whole patch series. Thanks for working on this! And sorry for the delay in getting to looking at these patches. :-( Jeff Walden [:Waldo] (remove +bmo to email) 2014-07-24 17:40:41 PDT Running tests with this patch showed that the following tests (jstests only, haven't tried jit-tests, haven't tried doing a full test run on tryserver) have now started passing: ecma_3/String/15.5.4.11.js ecma_3/String/regress-392378.js I did a little archaeology on one of them, and it basically seems to date back forever. It appears both of these tests are marked as failing, and have been, since they were landed -- in advance of a fix -- in bug 392378. So we need to stop expecting failure, with the patchwork here. Jeff Walden [:Waldo] (remove +bmo to email) 2014-07-24 18:22:06 PDT https://tbpl.mozilla.org/?tree=Try&rev=b3ea47188bba Jeff Walden [:Waldo] (remove +bmo to email) 2014-07-24 20:11:16 PDT Looks like we have at least one case of code breaking due to expecting unmatched captures to expand to the empty string. I think this extra patch should fix it, pushed an M5-only test run to verify, will post that patch for review if it does: https://tbpl.mozilla.org/?tree=Try&rev=185bacfb7631 Jeff Walden [:Waldo] (remove +bmo to email) 2014-07-25 01:18:43 PDT Created attachment 8462408 [details] [diff] [review] Fix CommonDialog.jsm RegExp usage expecting an unmatched capture group to match as "" and not |undefined| Dolske, we're changing things in this bug so if you access the matched text for a capturing group, when that capturing group didn't actually get consulted because quantifiers prevented its exercise, you get |undefined| instead of the empty string. Simple fix for these two places: just put quantifiers inside parentheses so the group will match the no-character sequence, rather than not be matched at all. (:)? thus turns into (:?), with the overall pattern matching exactly the same things. (.*[^&])? we turn into ([^&]*). This is not strictly identical in that . doesn't match \n: js> /(.*[^&])?/.exec("\n\n\n\n\n")[1].length 1 js> /([^&]*)/.exec("\n\n\n\n\n")[1].length 5 If we're really and truly paranoid, we could get exactly identical matching with more groups, ((?:[^\n&]*[^&])?) or so. But it doesn't seem necessary. People aren't going to put repeated newlines just before accesskey specifiers in button label text. Justin Dolske [:Dolske] 2014-07-25 15:11:57 PDT Comment on attachment 8462408 [details] [diff] [review] Fix CommonDialog.jsm RegExp usage expecting an unmatched capture group to match as "" and not |undefined| Review of attachment 8462408 [details] [diff] [review]: ----------------------------------------------------------------- No need for paranoia here. These are always going to be strings from either us or an addon, and if someone is doing that they are bad and should feel bad. Also this (existing) code is gross, looks like it dates back the original ancient commonDialog code. *shudder* Jeff Walden [:Waldo] (remove +bmo to email) 2014-07-25 16:41:52 PDT https://hg.mozilla.org/integration/mozilla-inbound/rev/22cfcffb5a8b https://hg.mozilla.org/integration/mozilla-inbound/rev/eb01c53fd879 Unfortunately I landed the patches out of order -- the CommonDialog.jsm change should have landed before the actual behavior change -- but too late to do anything about it now. :-( Hopefully no one will get so supremely unlucky about bisect points as to hit it, and be testing specifically M5 when doing said bisection. We'll probably want to keep our eyes peeled for any code expecting unmatched captures to expand to the empty string instead of to |undefined|. Probably won't be many, given every other engine seems to produce |undefined| in this bug's scenarios, but it's a distinct possibility we'll see some in Gecko-specific code paths or extensions. As far as docs go: I'm not sure what the best way to document this is. Really it probably should be described in a JS guide that discusses RegExps most generally, rather than on the specific JS reference pages that happen to be affected by it. (Those being for RegExp.prototype.exec, String.prototype.match, RegExp.prototype.test, String.prototype.replace, and probably a few others as well. It's really a general RegExp issue, not something particular to one method or another's behavior.) Carsten Book [:Tomcat] 2014-07-28 06:38:14 PDT https://hg.mozilla.org/mozilla-central/rev/22cfcffb5a8b https://hg.mozilla.org/mozilla-central/rev/eb01c53fd879 Jeff Walden [:Waldo] (remove +bmo to email) 2014-08-04 19:33:52 PDT The tests we had to change to land this are adequate testing for this bug. Tooru Fujisawa [:arai] 2015-01-01 22:07:50 PST Updated following pages: https://developer.mozilla.org/en-US/Firefox/Releases/34/Site_Compatibility https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp (Gecko specific note on the bottom of the page) Following page is already updated: https://developer.mozilla.org/en-US/Firefox/Releases/34 Florian Scholz [:fscholz] (MDN) 2015-01-05 03:12:05 PST Thanks, :arai!

 Note You need to log in before you can comment on or make changes to this bug.