# Javascript regular expression captures broken with alternation in some cases.

RESOLVED FIXED in mozilla34

## Status

### ()

Core
JavaScript Engine
RESOLVED FIXED
11 years ago
3 years ago

## Tracking

### (Depends on: 1 bug, {addon-compat, dev-doc-complete})

Trunk
mozilla34
Points:
---
Dependency tree / graph
Bug Flags:
 in-testsuite +

## Attachments

### (5 attachments, 2 obsolete attachments)

 1.02 KB, patch Details | Diff | Splinter Review 1.09 KB, patch Waldo : review+ Details | Diff | Splinter Review 1.34 KB, patch Waldo : review+ Details | Diff | Splinter Review 839 bytes, patch Waldo : review+ Details | Diff | Splinter Review 1.14 KB, patch Dolske : review+ Details | Diff | Splinter Review
(Reporter)

### Description

11 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.1) Gecko/20061204 Firefox/2.0.0.1

Javascript regular expressions capturing multiple matches using alternation set some of the result's capture fields to undefined in some cases. The cause is based on the order of the alternates in the text.

Reproducible: Always

Steps to Reproduce:
Example code:
re = /(?:(ABC)|(123)){2}/;

var t = re.exec("ABC123");
var u = re.exec("123ABC");
Actual Results:
t = ['ABC123',undefined,'123']
u = ['123ABC','ABC','123']

Expected Results:
Captures should be identical in both cases with a possible change in order in the results array.

### Updated

11 years ago
Assignee: nobody → general
Component: General → JavaScript Engine
Product: Firefox → Core
QA Contact: general → general
Version: unspecified → 1.8 Branch

### Comment 1

11 years ago
Looks like a bug -- Brian, can you take a look in due course?

/be
Assignee: general → crowder

### Comment 2

11 years ago
Here are the perl results, for my information:

$perl -e '$tst = "123ABC"; $tst =~ /(?:(ABC)|(123)){2}/; print$& . "\n"; print $1 . "\n"; print$2 . "\n"'
123ABC
ABC
123

$perl -e '$tst = "ABC123"; $tst =~ /(?:(ABC)|(123)){2}/; print$& . "\n"; print $1 . "\n"; print$2 . "\n"'
ABC123
ABC
123


### Updated

11 years ago
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true

### Comment 3

11 years ago
The bug here seems to be in the way that REOP_LPAREN works on the second attempt to match in the repeat set, it zeros out the capture results for parenIndex 0 because it is about to attempt to match (ABC) again, which will fail.  When that fails, the values are left in this "empty" state, even though we have already successfully matched group 0.  We are only getting lucky in the second case ("123ABC") because the first attempt to match in paren-group 0 fails, and is then overwritten by the successful match from the other group.  More to come as I understand it.

### Comment 4

11 years ago
Created attachment 254624 [details] [diff] [review]
don't overwrite last paren match

REOP_LPAREN's bad behavior of overwriting the old capture with the assumption that whatever happens on this match is more important than whatever happened on the last match is corrected by this patch.  The captures array should (I think) only be overwritten in the case of a full backtrack, not (as in this case) for a speculative match occurring as the result of a REPEAT operation.  We WILL overwrite this capture if the second match -succeeds- (I'm not sure I can conceive of an ALT/REPEAT situation such as this one where that is possible AND yields different results, but it is perhaps likely -- if someone can concoct such a case, I'll try to match perl's behavior, if this is not that).

This should get the full js test suite test action before it even thinks about landing.
Attachment #254624 - Flags: review?(mrbkap)

### Comment 5

11 years ago
Comment on attachment 254624 [details] [diff] [review]
don't overwrite last paren match

How does this handle nested parentheses? I'm not sure, just by looking at the code, that startcp won't get stepped on by something else. Or does everything that we care about assign the right thing into startcp?

### Comment 6

11 years ago
Actually that's a good point.  I might need to add something equivalent to startcp to the RECapture structure, and stick that there.  Let look into it.

### Comment 7

11 years ago
Created attachment 254953 [details] [diff] [review]
should handle nesting

Thanks for the obvious catch, mrbkap.  This should be ok; here's my reasoning:
If you're in an LPAREN op, cap->startcp will always be set for the current parenIndex (and every matching paren group has its own index).  The data is only ever used if the RPAREN op executes (ie., at the end of a successful match of everything inside the paren-group), and the index/length members from which dependent strings are eventually build are only ever modified on successful matches.  This should both fix this bug AND survive nested matching paren-groups.
Attachment #254624 - Attachment is obsolete: true
Attachment #254953 - Flags: review?(mrbkap)
Attachment #254624 - Flags: review?(mrbkap)

### Comment 8

11 years ago
I fixed a few ugly comments with this patch, too.

### Comment 9

11 years ago
Comment on attachment 254953 [details] [diff] [review]
should handle nesting

Seems reasonable.
Attachment #254953 - Flags: review?(mrbkap) → review+

### Comment 10

11 years ago
Comment on attachment 254953 [details] [diff] [review]
should handle nesting

Damn.  This actually causes some trouble with tests in the JS suite:

These are the ones that fail with this patch applied:
ecma_3/RegExp/15.10.2-1.js
ecma_3/RegExp/perlstress-001.js
ecma_3/RegExp/perlstress-002.js
ecma_3/RegExp/regress-123437.js
ecma_3/RegExp/regress-165353.js
ecma_3/RegExp/regress-202564.js
ecma_3/RegExp/regress-209919.js
ecma_3/RegExp/regress-24712.js
ecma_3/RegExp/regress-31316.js
ecma_3/RegExp/regress-87231.js

Back to the drawing board.
Attachment #254953 - Attachment is obsolete: true

### Comment 11

11 years ago
Some of the tests themselves may be wrong.  I'm trying to figure out what perl does:

Testcase ecma_3/RegExp/15.10.2-1.js failed Bug Number (none)
STATUS: RegExp conformance test
Failure messages were:
FAILED!: [reported from test()] Section 8 of test -
FAILED!: [reported from test()] regexp = /(a*)*/
FAILED!: [reported from test()] string = 'b'
FAILED!: [reported from test()] ERROR !!! regexp failed to give expected match array:
FAILED!: [reported from test()] Expect: ["", undefined]
FAILED!: [reported from test()] Actual: ["", ""]

Perl returns $& and$1 as "", and "" respectively, and even with -w does not consider either to be an uninitialized value.

FAILED!: [reported from test()] Section 12 of test -
FAILED!: [reported from test()] regexp = /(.*?)a(?!(a+)b\2c)\2(.*)/
FAILED!: [reported from test()] string = 'baaabaac'
FAILED!: [reported from test()] ERROR !!! regexp failed to give expected match array:
FAILED!: [reported from test()] Expect: ["baaabaac", "ba", undefined, "abaac"]
FAILED!: [reported from test()] Actual: ["baaabaac", "ba", "a", "baac"]

perl -we '"baaabaac" =~ (/(.*?)a(?!(a+)b\2c)\2(.*)/); print "[\"$&\", \"$1\", \"$2\", \"$3\"]\n"'
["baaabaac", "ba", "a", "baac"]

These testcases, at least, seem wrong to me.  I will come back with more after I've figured out why I am failing some of the others that -aren't- wrong.

### Comment 12

9 years ago
  /(?:(ABC)|(123)){2}/.exec("ABC123").toSource()
should return
["ABC123", (void 0), "123"]
and
/(?:(ABC)|(123)){2}/.exec("123ABC").toSource()
should return
["123ABC", "ABC", (void 0)]
because each repetition of a quantifier resets the inner captures to undefined (step 4 in the definition of RepeatMatcher).

/(a*)*/.exec("b").toSource()
should return
["", (void 0)]
because a quantifier with minimum 0 cannot match an empty string (unlike Perl, which matches it once).

/(.*?)a(?!(a+)b\2c)\2(.*)/.exec().toSource()
should return
["baaabaac", "ba", (void 0), "abaac"]
because captures inside (?!) aren't visible outside (unlike (?=)).

The testcases in ecma_3/RegExp/15.10.2-1.js are examples from the spec, where they are explained.


### Comment 13

9 years ago
Created attachment 316103 [details] [diff] [review]
patch for ECMA conformant behavior

This patch should make the behavior ECMA conformant. I would recommend to check it in after 1.9, so that any real world problems with it can be detected. If there are serious problems or ECMA decides to revise the standard, we can go for Perl compatible behavior later. Compare bug 351349.

Patch passes the testsuite.

### Updated

9 years ago
Assignee: crowder → x00000000
Status: ASSIGNED → NEW

### Updated

8 years ago
Flags: wanted1.9.2?
Flags: in-testsuite?

### Comment 14

7 years ago
These bugs are all part of a search I made for js bugs that are getting lost in transit:

They all have a review+'ed, non-obsoleted patch and are not marked fixed-in-tracemonkey or checkin-needed but have not seen any activity in 300 days. Some of these got lost simply because the assignee/patch provider never requested a checkin, or just because they were forgotten about.

### Comment 15

7 years ago
This still an issue with YARR?
Flags: wanted1.9.2?

### Comment 16

7 years ago
This works properly in YARR. Taking assignment so I remember to check in the test case post-FF4.
Assignee: x00000000 → cdleary
Status: NEW → ASSIGNED

### Updated

6 years ago
Depends on: 661974

### Comment 17

4 years ago
Mass-reassigning cdleary's bugs to default. He won't work on any of them, anymore. I guess, at least.

@cdleary: shout if you take issue with this.
Assignee: cdleary → general
Status: ASSIGNED → NEW

### Comment 18

4 years ago
This bug is open just because of the testcase.

### Comment 19

3 years ago
This seems like a good first bug, with some IRC #jseng guidance.

/be

### Updated

3 years ago
Whiteboard: [good first bug]

### Comment 20

3 years ago
bhackett: Will this difference between SpiderMonkey's and V8's regex capturing still be relevant after your irregexp port?

Here is a simple test case from @kangax [1]:

'x'.replace(/x(.)?/g, function(m, group) { alert("'group:" + group + "'"); })

displays 'group:' in Firefox, but 'group:undefined' in Chrome/WebKit.

[1] https://twitter.com/kangax/status/461538545388511232
Flags: needinfo?(bhackett1024)
OS: Windows XP → All
Hardware: x86 → All
Whiteboard: [good first bug] → [good first bug][lang=c++]
Version: 1.8 Branch → Trunk

### Comment 21

3 years ago
(In reply to Chris Peterson (:cpeterson) from comment #20)
> bhackett: Will this difference between SpiderMonkey's and V8's regex
> capturing still be relevant after your irregexp port?
>
> Here is a simple test case from @kangax [1]:
>
>   'x'.replace(/x(.)?/g, function(m, group) { alert("'group:" + group + "'");
> })
>
> displays 'group:' in Firefox, but 'group:undefined' in Chrome/WebKit.
>

No, this behavior looks related to a different part of our regexp handling code.  In particular, RegExpStatics::makeMatch makes an empty string when the last match had an undefined capture, when Chrome/Safari look like they are making an undefined value.
Flags: needinfo?(bhackett1024)

### Updated

3 years ago
Duplicate of this bug: 1015762

### Updated

3 years ago
Duplicate of this bug: 392378

### Comment 24

3 years ago
There's a patch for this in bug 392378. It's from before fire was invented, so I very much doubt anything can be salvaged, but it does nicely show what needs to be done here.
(Assignee)

### Comment 25

3 years ago
Is there any reason why I could not work on this as a first bug? Utterly unfamiliar with the insides of SpiderMonkey but I have some relevant experience and I'd like to start contributing.

### Comment 26

3 years ago
(In reply to Andy Anderson from comment #25)
> Is there any reason why I could not work on this as a first bug?

No reason at all, welcome and thanks for digging in!

> Utterly unfamiliar with the insides of SpiderMonkey but I have some
> relevant experience and I'd like to start contributing.

Nice. The first thing you'll want to do is to set up a build environment that allows you to compile the browser[1] and the SpiderMonkey shell[2].

Once you have that, you can start reading some source code. I wouldn't recommend trying to understand the big picture before doing anything, as that's a pretty daunting task. Instead, you should try making localized changes based on theories on what the effects should be. For this particular bug, you should look at the code bhackett mentioned in comment 21. RegExpStatics::makeMatch can be found in js/src/vm/RegExpStatics.h. (All SpiderMonkey-related code can be found under js, and *almost* all of it under js/src).

The change itself is pretty trivial in this case, so the main focus will have to be on testing it. We have two test suites, one under js/src/tests, the other under js/src/jit-test. You'll want to run both of them before and after your change and see if the number of failing tests changes. It *should* be none before any changes, but there are some tests that can time out depending on your machine and the exact build configuration, and then there are some that, sadly, can fail depending on the timezone you're in. Those you can just ignore.

You should add an additional test based on the code in comment 20. Essentially, you'd assign the group argument from the callback function to a variable in the outer code and then use assertEq to compare it with an expected value. I guess it makes sense to add the test to the tests/ecma_3/String directory, as that was the language version where String#replace was added, apparently.

If you have any questions, join us in the #jsapi channel on irc.mozilla.org, or ask here. And once you have a patch, add it to this bug and ask me for review.

thanks!

[1]: https://developer.mozilla.org/en-US/docs/Mozilla/Developer_guide/Build_Instructions
[2]: https://developer.mozilla.org/en-US/docs/Mozilla/Projects/SpiderMonkey/Build_Documentation

### Updated

3 years ago
Duplicate of this bug: 1034950
(Assignee)

### Comment 28

3 years ago
Created attachment 8454642 [details] [diff] [review]
test369778.patch

Test case for 369778
(Assignee)

### Comment 29

3 years ago
Created attachment 8454643 [details] [diff] [review]
fix369778.patch

Fixes 369778
(Assignee)

### Comment 30

3 years ago
Created attachment 8454644 [details] [diff] [review]
fix220367.patch

Changes test for 220367 to reflect new engine behavior
(Assignee)

### Comment 31

3 years ago
I was unclear on whether disparate changes such as these should be rolled into one, so I went and made separate ones for the different changes.

Passes on my machine, let me know if there's an issue or if I missed something administrative.

And thanks for the help on irc!

### Comment 32

3 years ago
Hmm, I'd be in favour of leaving the legacy RegExp statics behaviour as-is and only change the argument value for String.prototype.replace.

### Comment 33

3 years ago
Comment on attachment 8454642 [details] [diff] [review]
test369778.patch

Jeff, please review all these and/or set a reviewer.
Attachment #8454642 - Flags: review?(jwalden+bmo)
(Assignee)

### Comment 34

3 years ago
What's the review status of this bug? I can't really proceed with more contributions until I've gotten a first patch through a complete review cycle.

### Updated

3 years ago
Assignee: general → nobody

### Comment 35

3 years ago
Comment on attachment 8454642 [details] [diff] [review]
test369778.patch

Review of attachment 8454642 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay, I was out basically all last week, and I've been digging out of various things earlier this week.  This test looks fine to me.

Is the contention that this patch is the test, and one or more of the other patches here is the actual fix?  Guess I'll look at those and try to figure things out myself now.  In the future it'd help if that were made clearer when attaching patches.  :-)  Generally (I count security bugs as the only exception, where we might want to land *only* a patch but not the test, initially) we include the test in the patch that fixes a bug, so a build without patch, then applying the patch causes the new test to fail, then building that and testing will make it work.  Makes the connection and nature of the fix that much clearer.
Attachment #8454642 - Flags: review?(jwalden+bmo) → review+
(Assignee)

### Comment 36

3 years ago
I uploaded three attachments.

test369778.patch tests for the new behavior; it is an entirely new test.

fix369778.patch introduces the new behavior.

fix220367.patch modifies an existing test, to reflect the new engine behavior. I received guidance from Till on IRC that this was what I needed to do.
(Assignee)

### Comment 37

3 years ago
Sorry to double post, but -

the contention raised earlier is that because this modifies long-standing engine behavior, perhaps the new behavior should be more limited in scope than what I have submitted. I do not have strong feelings on this, but am inclined to not complicate the behavior further by adding exceptions, unless these exceptions also exist in other engines.

### Updated

3 years ago
Depends on: 1043683

### Comment 38

3 years ago
Comment on attachment 8454643 [details] [diff] [review]
fix369778.patch

Review of attachment 8454643 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/vm/RegExpStatics.h
@@ +249,5 @@
>      bool checkWhich  = checkValidIndex % 2;
>      size_t checkPair = checkValidIndex / 2;
>
> +    if (matches.empty() || checkPair >= matches.pairCount() || matches[pairNum].isUndefined())
> +    {

SpiderMonkey style would have brace at end of line.  Also, this could use a comment explaining what it's doing, with an example.  So something like:

//
if (matches.empty() || checkPair >= matches.pairCount() || matches[pairNum].isUndefined()) {

with brace at end, all on one line because it fits in 99ch, appropriately adjusted wrt bug 1043683.  (I'll do the adjusting for you, since I'm the one making the change underneath yours.  :-) )

@@ +259,1 @@
>      {

Same thing here.

...except, when you consider the patch in bug 1043683, this condition is *exactly* identical to |matches[pairNum].isUndefined()|.  So this whole block should go away.

...in which case, is there *ever* a situation where we should be returning "" here, other than if the matched characters were the empty string?  Seems like no to me.  So this block should disappear.
Attachment #8454643 - Flags: review+

### Comment 39

3 years ago
Comment on attachment 8454644 [details] [diff] [review]
fix220367.patch

Review of attachment 8454644 [details] [diff] [review]:
-----------------------------------------------------------------

I'm not too worried about behavior of RegExp.* versus behavior via the array returned by exec().  We should be consistent, up until someone complains and it's demonstrated that we should have contradictory behavior in the two cases.  So I think we take these patches exactly as-is.

After you review the patch in the blocking bug, I'll combine these together and land the whole patch series.  Thanks for working on this!  And sorry for the delay in getting to looking at these patches.  :-(
Attachment #8454644 - Flags: review+

### Comment 40

3 years ago
Running tests with this patch showed that the following tests (jstests only, haven't tried jit-tests, haven't tried doing a full test run on tryserver) have now started passing:

ecma_3/String/15.5.4.11.js
ecma_3/String/regress-392378.js

I did a little archaeology on one of them, and it basically seems to date back forever.  It appears both of these tests are marked as failing, and have been, since they were landed -- in advance of a fix -- in bug 392378.  So we need to stop expecting failure, with the patchwork here.

### Comment 41

3 years ago
https://tbpl.mozilla.org/?tree=Try&rev=b3ea47188bba

### Comment 42

3 years ago
Looks like we have at least one case of code breaking due to expecting unmatched captures to expand to the empty string.  I think this extra patch should fix it, pushed an M5-only test run to verify, will post that patch for review if it does:

https://tbpl.mozilla.org/?tree=Try&rev=185bacfb7631
Assignee: nobody → ae.anderson0
Status: NEW → ASSIGNED

### Comment 43

3 years ago
Created attachment 8462408 [details] [diff] [review]
Fix CommonDialog.jsm RegExp usage expecting an unmatched capture group to match as "" and not |undefined|

Dolske, we're changing things in this bug so if you access the matched text for a capturing group, when that capturing group didn't actually get consulted because quantifiers prevented its exercise, you get |undefined| instead of the empty string.

Simple fix for these two places: just put quantifiers inside parentheses so the group will match the no-character sequence, rather than not be matched at all.

(:)? thus turns into (:?), with the overall pattern matching exactly the same things.

(.*[^&])? we turn into ([^&]*).  This is not strictly identical in that . doesn't match \n:

js> /(.*[^&])?/.exec("\n\n\n\n\n")[1].length
1
js> /([^&]*)/.exec("\n\n\n\n\n")[1].length
5

If we're really and truly paranoid, we could get exactly identical matching with more groups, ((?:[^\n&]*[^&])?) or so.  But it doesn't seem necessary.  People aren't going to put repeated newlines just before accesskey specifiers in button label text.
Attachment #8462408 - Flags: review?(dolske)

### Updated

3 years ago
Assignee: ae.anderson0 → jwalden+bmo

### Updated

3 years ago
Assignee: jwalden+bmo → ae.anderson0

### Comment 44

3 years ago
Comment on attachment 8462408 [details] [diff] [review]
Fix CommonDialog.jsm RegExp usage expecting an unmatched capture group to match as "" and not |undefined|

Review of attachment 8462408 [details] [diff] [review]:
-----------------------------------------------------------------

No need for paranoia here. These are always going to be strings from either us or an addon, and if someone is doing that they are bad and should feel bad.

Also this (existing) code is gross, looks like it dates back the original ancient commonDialog code. *shudder*
Attachment #8462408 - Flags: review?(dolske) → review+

### Comment 45

3 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/22cfcffb5a8b
https://hg.mozilla.org/integration/mozilla-inbound/rev/eb01c53fd879

Unfortunately I landed the patches out of order -- the CommonDialog.jsm change should have landed before the actual behavior change -- but too late to do anything about it now.  :-(  Hopefully no one will get so supremely unlucky about bisect points as to hit it, and be testing specifically M5 when doing said bisection.

We'll probably want to keep our eyes peeled for any code expecting unmatched captures to expand to the empty string instead of to |undefined|.  Probably won't be many, given every other engine seems to produce |undefined| in this bug's scenarios, but it's a distinct possibility we'll see some in Gecko-specific code paths or extensions.

As far as docs go: I'm not sure what the best way to document this is.  Really it probably should be described in a JS guide that discusses RegExps most generally, rather than on the specific JS reference pages that happen to be affected by it.  (Those being for RegExp.prototype.exec, String.prototype.match, RegExp.prototype.test, String.prototype.replace, and probably a few others as well.  It's really a general RegExp issue, not something particular to one method or another's behavior.)

### Comment 46

3 years ago
https://hg.mozilla.org/mozilla-central/rev/22cfcffb5a8b
https://hg.mozilla.org/mozilla-central/rev/eb01c53fd879
Status: ASSIGNED → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla34

### Comment 47

3 years ago
The tests we had to change to land this are adequate testing for this bug.
Flags: in-testsuite? → in-testsuite+

### Updated

3 years ago
Depends on: 1048000

### Updated

3 years ago
Whiteboard: [good first bug][lang=c++] → [good first bug][lang=c++][DocArea=JS]

### Updated

3 years ago
Depends on: 1054910

### Comment 48

3 years ago
Updated following pages:
https://developer.mozilla.org/en-US/Firefox/Releases/34/Site_Compatibility
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp
(Gecko specific note on the bottom of the page)

https://developer.mozilla.org/en-US/Firefox/Releases/34
Thanks, :arai!