Closed Bug 501739 Opened 15 years ago Closed 11 years ago

String match and replace methods do not update global regexp lastIndex per ES3&5

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla27

People

(Reporter: brendan, Assigned: yaron.tausky)

References

Details

(Keywords: dev-doc-complete, site-compat, Whiteboard: [good first bug][lang=c++][mentor=Waldo][DocArea=JS])

Attachments

(3 files, 3 obsolete files)

Rob, more regexp lastIndex stuff.  Interested?
As mentioned: I should have invited Waldo, who has bug 98409, which is related.  Please shuffle as desired.
Assignee: general → jwalden+bmo
Target Milestone: mozilla1.9.2a1 → mozilla2.0b2
Is this bug valid?  ES5's replace verbiage clearly says to update lastIndex; its match algorithm clearly says to update lastIndex as well.  Was this bug filed in anticipation of a spec change that never happened?
Anyone want to verify my understanding as noted in comment 3?
We should update lastIndex of the regexp object per 15.10.6.2 step 11a. The caller (replace) indicates it wants lastIndex updated through the global attribute of the regexp object it passes to exec.

It seems unspecified what order the lambda replaceValue is invoked WRT the update of searchValue.lastIndex in String.prototype.match -- it simply says, "If replaceValue is a function, then for each matched substring, call the function with the following m + 3 arguments."

This ordering is probably vague in the spec because it permits a lambda replaceValue with either a non-regexp flat match (no lastIndex to speak of) or a regexp object. I guess we should lock down the lastIndex ordering WRT the lambda invocation?
https://bugs.webkit.org/show_bug.cgi?id=26890 is a sibling report to this one that has precise details on what's purportedly wrong, and what should actually happen -- most importantly with simple testcases and expected results.

I don't have time to verify the details myself, but I could help someone interested in spec-reading and bug-fixing to do that and write the patch, I think.
Assignee: jwalden+bmo → general
Whiteboard: [good first bug][lang=c++][mentor=Waldo]
This bug bit me today, so I think it's only fair to try and attempt it
Assignee: general → jon
Target Milestone: mozilla2.0b2 → ---
Attached file Possible test case
This code has different results in:

Firefox and Opera:

r.match() = a
r.exec() = a,a
r2.exec() = a,a
r.match() = b
r.exec() = b,b
r2.exec() = null

Chrome and Safari:

r.match() = a
r.exec() = a,a
r2.exec() = a,a
r.match() = b
r.exec() = null
r2.exec() = null
Attached patch patch v1 (obsolete) — Splinter Review
And here's a patch that makes Firefox returns the same results as Chrome/Safari above. Please note that this is a WAG, and maybe it's a Webkit bug, but `reobj->zeroLastIndex();` doesn't look correct compared to ES 15.10.6.2 Step 11.

And here's a tryserver run with this patch to see if I blew anything up: https://tbpl.mozilla.org/?tree=Try&rev=e637722d74ef
Comment on attachment 616021 [details] [diff] [review]
patch v1

Well, I blew up a single test which isn't so bad to fix. Does this patch look kosher though, or am I reading the spec wrong here?
Attachment #616021 - Flags: feedback?(jwalden+bmo)
I'm not sure if it's kosher yet.  It doesn't appear to correspond to step 11, certainly!  Unfortunately I don't understand how [[Match]] works, and what it returns, well enough yet to say for sure this patch is fully the right change.  I'll look at it more shortly.

Note that in SpiderMonkey code the proper style for single-line ifs doesn't brace, as long as the condition the if is testing fit all on one line (and the same is true for blocks associated with any else-if or else clauses, and their associated conditions, if any).  So you should remove the braces here, eventually (don't bother until I can verify this is a semantically correct and complete patch).
Comment on attachment 616021 [details] [diff] [review]
patch v1

Review of attachment 616021 [details] [diff] [review]:
-----------------------------------------------------------------

Sorry for the delay on this, I'm kind of buried under other work right now, and regex semantics are not my strong suit in the language.  :-(

::: js/src/builtin/RegExp.cpp
@@ +636,1 @@
>      }

We don't parenthesize single-line if statements (where both the condition being tested, and the body of the if, are single lines, that is).

I have no idea where that zeroing is coming from, or why it's there, but it's pretty clearly not supposed to be in the spec that way.  Weird.

...or no.  The comments are just confusing, because the step marked as 11 is not *just* step 11 -- it's also step 9a.  (Yes, really.)  Step 9 loops from the starting index up to the final index, testing for a match at each location.  The bottom-out case, when nothing matches, is step 9a.  We handle step 9a for the case where the starting location's out-of-bounds before the loop.  But we handle the case where we *did* enter the loop at the location you're changing here.  That's what the isNull() test is checking for -- setting the lastIndex to 0 if we didn't match at all.

Here's a testcase, with your patch applied:

[jwalden@wheres-wally src]$ dbg/js
js> var r = /a/g
js> r.lastIndex = 2
2
js> r.exec("bcdefgh");
null
js> r.lastIndex
2

Because there's no match, the lastIndex should be 0 at the end.  Without your patch, it's properly 0:

[jwalden@wheres-wally src]$ make -s -C dbg -j8
RegExp.cpp
dbg/js
[jwalden@wheres-wally src]$ dbg/js
js> var r = /a/g
js> r.lastIndex = 2
2
js> r.exec("bcdefgh");
null
js> r.lastIndex
0

I still haven't looked quite closely enough, or with enough confidence, to say that something's definitely spec-incorrect here.  But it's definitely the case at the very least that the "step 11" code needs to be commented to note that it also handles a portion of step 9a as well.
Attachment #616021 - Flags: feedback?(jwalden+bmo)
Assignee: jon → general
Attached patch fix_string_match_replace v1 (obsolete) — Splinter Review
I think I found the cause of the problem, but it's not entirely straight-forward. This patch only handles the case where the parameter JSObject* points to a RegExpObject, but that is not always the case. I'm not sure how to deal with proxies here.
Assignee: general → yaron.tausky
Attachment #616021 - Attachment is obsolete: true
Attachment #796311 - Flags: review?(jwalden+bmo)
(In reply to Yaron Tausky from comment #13)
> This patch only handles the case where the parameter
> JSObject* points to a RegExpObject, but that is not always the case. I'm not
> sure how to deal with proxies here.

Looking at ES5 String.prototype.match <http://ecma-international.org/ecma-262/5.1/#sec-15.5.4.10> it seems the regexp argument must always either be a genuine built-in RegExp, or else it will be coerced to a built-in RegExp on line 4. Thus it seems to me the 'rx' variable in the spec algo can never be a proxy.

Looking at ES6 String.prototype.match <http://people.mozilla.org/~jorendorff/es6-draft.html#sec-15.5.3.10> it seems the algorithm has changed to simply invoke the "match" method on its argument (which would trigger a proxy's "invoke" trap).
I managed to get it to work with those proxies, but it does look a bit hairy. Is this the right way to handle them?
I added the tests from WebKit's bugzilla and adopted their interpretation of replace()'s behaviour, but there's a small caveat: since we replace a substring immediately after matching, in order to pass the last test, I zeroed lastIndex in str_replace_regexp before the matching begins. Under normal circumstances it gives the same result, but I'm not sure if that would be acceptable if something were to fail later on.
Attachment #796311 - Attachment is obsolete: true
Attachment #796311 - Flags: review?(jwalden+bmo)
Attachment #798164 - Flags: review?(jwalden+bmo)
Comment on attachment 798164 [details] [diff] [review]
501739_fix_string_match_replace.patch v2

Review of attachment 798164 [details] [diff] [review]:
-----------------------------------------------------------------

I still need to go through this more, and attempt to page anything of RegExp code back into memory, but I should probably at least post the bits I'm sure about at this point.

You've probably noticed that our RegExp matching code/interfaces are messy here -- sorry you have to go through this byzantine implementation here.  :-\

::: js/src/builtin/RegExp.cpp
@@ +681,5 @@
> +    Value vp[2];
> +    vp[0].setUndefined();
> +    vp[1].setObject(*re);
> +    CallArgs args = CallArgsFromVp(0, vp);
> +    return CallNonGenericMethod<IsRegExp, RegExpZeroLastIndex_impl>(cx, args);

Bleh.  This isn't the right way to handle wrappers and proxies and stuff.  :-)  CallArgs should only ever be instantiated from the argc/vp passed to a JSNative, I think, and nowhere else.  (Well, except CallArgsFromSp, but that's basically totally internal.)

That aside, the spec algorithms -- or at least String.prototype.match, which is what I've looked at so far -- seem to say to [[Set]] the lastIndex property, class-agnostically.  So I don't think there's any need for IsRegExp and this kinda-dodgy call-non-generic stuff here at all -- we should just be JSObject::setProperty-ing the right value that way.  This also helps us with having correct behavior for such stupidities as

[jwalden@find-waldo-now src]$ gcc-dbg/js
js> var r = /foo/g;
js> Object.defineProperty(r, "lastIndex", { writable: false , value: 17 })
/foo/g
js> r.lastIndex
17
js> "foo".match(r)
["foo"]
js> r.lastIndex
0

where, technically, the match call should throw a Ty...um.  Looks like ES5 calls the [[Put]] method with too few arguments there, so who knows what should happen.

::: js/src/jsstr.cpp
@@ +1939,5 @@
> +    if (g.regExp().global()) {
> +        bool result = DoMatchGlobal(cx, args, res, linearStr, g.regExp());
> +        if (result && g.hasObj())
> +            RegExpZeroLastIndex(cx, g.obj());
> +        return result;

Given DoMatchGlobal is called exactly once, this zeroing should be folded into it, certainly.

@@ +2680,5 @@
>      if (rdata.repstr && rdata.repstr->length() == 0) {
>          JS_ASSERT(!rdata.lambda && !rdata.elembase && !rdata.dollar);
> +        bool result = str_replace_regexp_remove(cx, args, rdata.str, re);
> +        if (result && re.global() && rdata.g.hasObj())
> +            RegExpZeroLastIndex(cx, rdata.g.obj());

Same comment about folding zeroing into str_replace_regexp_remove.

@@ +2695,2 @@
>          if (!DoMatchForReplaceGlobal(cx, res, linearStr, re, rdata))
>              return false;

And again for DoMatchForReplaceGlobal.
The regular-expression goo between the ES5 methods and Yarr underneath is byzantine.  It took me a long time to understand it (plus what the changes in the proposed patch did), and I only really did that by doing the usual spec-step-comment thing I've been doing for a long time in our code.

The end result is I have a different patch, inspired by the one posted here, but basically entirely different, or something.  This also has a bunch of comments on weirdnesses we have here -- like only needing to update .lastIndex one time, and the weird way the RegExp statics get updated (which *doesn't* accord with the calls to RegExp.prototype.exec in the spec algorithm).  I'm sure it'll be a bear to review, but it's regular expression code, it goes with the terrain.  :-\
Attachment #804861 - Flags: review?(jorendorff)
Comment on attachment 798164 [details] [diff] [review]
501739_fix_string_match_replace.patch v2

Review of attachment 798164 [details] [diff] [review]:
-----------------------------------------------------------------

Okay, I think the "".match portion of this is covered by attachment 804861 [details] [diff] [review].

The "".replace parts of this are a lot messier, because of ES5's replace algorithm being a bunch of handwaviness.  I *think* the set of changes you've made, are sufficient for "".replace *post-conditions* to be correct.  (Although I think this is substantially less than clear due to the structuring of how all the fast-paths and implementation special-cases are handled.  I think we could make this quite a bit more readable if we spent some time distinguishing all the different cases up front, rather than spreading such decisions throughout half a dozen functions spread across hundreds of lines of code.  Probably too much work to do here.)

But it's not clear to me whether .lastIndex changes should be interspersed with replace-function callbacks or not.  The thread referenced in comment 0 makes different notes about what should happen, and what did happen in engines at the time (one of which is dead now).  ES6 is handwavy still, but *might* if I'm reading it right, advocate for setting .lastIndex to zero before calling the replacer, then never touching .lastIndex (for reads or writes) for the rest of "".replace's execution.  So I think we want to have .lastIndex zeroing occur at the start of DoMatchForReplaceGlobal, and at the start of str_replace_regexp_remove.  But really, these two zeroings at the start can be unified into a single zeroing, just before the /* Optimize removal. */ comment in str_replace_regexp.

Any chance you could whip up a patch to zero .lastIndex here, in the way the other patch posted here does, with tests that check that both the optimize-removal and don't-optimize-removal cases correctly change?  (You'll need to have a test with a replacer that assertEq's that .lastIndex is 0 through multiple callbacks, and that .lastIndex can be mutated in those callbacks and will have that value persist to the next callback.  Also probably worth having a test where .lastIndex is non-zero at the start, and separately non-writable at the start, to ensure those cases work as well.)
Attachment #798164 - Flags: review?(jwalden+bmo)
How does it look now?
Attachment #798164 - Attachment is obsolete: true
Attachment #806618 - Flags: review?(jwalden+bmo)
Comment on attachment 806618 [details] [diff] [review]
501739_string_replace_refactor.patch

Review of attachment 806618 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good modulo nitpicks.  I'll make the changes locally and push it (along with the other patch, which I notice you've gone the extra mile to write this patch atop :-) ) when the other patch here gets reviewed.  Thanks!

::: js/src/jsstr.cpp
@@ +1783,5 @@
>          return cx->compartment()->regExps.get(cx, patstr, opt, &re_);
>      }
>  
> +    bool zeroLastIndex(JSContext *cx) {
> +        if (regExpIsObject()) {

Slightly nitpicky, but generally it's good to minimize indentation when possible.  Here, structuring like so effects this:

  bool zeroLastIndex(...) {
    if (!regExpIsObject())
      return true;

    // ....
    RootedValue ...
    ....
  }

@@ +2708,5 @@
>  
>      RegExpStatics *res = cx->global()->getRegExpStatics();
>      RegExpShared &re = rdata.g.regExp();
>  
> +    /* The spec doesn't describe this function very clearly, so we go ahead

SpiderMonkey /**/ comment style for multi-line comments is like so:

/*
 * Line 1, wrapping to
 * line 2, ending on line
 * 3.
 */

Or we could just switch this to '//'-style comments.  I like /**/ for overview and documentation comments, myself (I think they're more visible and prominent due to the extra whitespace around them).  As this isn't such a comment, I'm going to switch to '//' comments for this.

@@ +2710,5 @@
>      RegExpShared &re = rdata.g.regExp();
>  
> +    /* The spec doesn't describe this function very clearly, so we go ahead
> +     * and assume that when the input to String.prototype.replace is a global
> +     * RegExp, the replacement step takes place only after the matching is

Rather than "the replacement step", let's make this "calling the replacer function (assuming one was provided)".

::: js/src/tests/ecma_5/String/replace-throws-nonwritable-lastIndex-global.js
@@ +28,5 @@
> +}
> +catch (e)
> +{
> +  assertEq(e instanceof TypeError, true,
> +           "should have thrown a TypeError, instead got: " + e);

Wouldn't hurt to add assertEq(p1.lastIndex, 0) here.

@@ +44,5 @@
> +}
> +catch (e)
> +{
> +  assertEq(e instanceof TypeError, true,
> +           "should have thrown a TypeError, instead got: " + e);

We should also have an assertEq(p2.lastIndex, 3) here, to verify nothing got changed.

@@ +60,5 @@
> +}
> +catch (e)
> +{
> +  assertEq(e instanceof TypeError, true,
> +           "should have thrown a TypeError, instead got: " + e);

assertEq(p3.lastIndex, 0)

@@ +76,5 @@
> +}
> +catch (e)
> +{
> +  assertEq(e instanceof TypeError, true,
> +           "should have thrown a TypeError, instead got: " + e);

assertEq(p4.lastIndex, 3)
Attachment #806618 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 804861 [details] [diff] [review]
Totally refactor "".match with spec-step comments and .lastIndex updating

Review of attachment 804861 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with comments addressed.

I'm not crazy about this lastIndex hack, but sure, whatever.

::: js/src/jsstr.cpp
@@ +1833,2 @@
>  {
> +    // Steps 8a, 8f(i) (partially), 8f(iii)(2)(a).

I think it's more direct and less confusing to write "Step 8a." here, and mention the other two steps in the comment.

The comment would be more direct if it were framed as "here's why we don't have to update .lastIndex any more after this, even though the spec says to do it" rather than "this single .lastIndex assignment does all the other assignments too".

But any comment is a vast improvement over none.

@@ +1834,5 @@
> +    // Steps 8a, 8f(i) (partially), 8f(iii)(2)(a).
> +    //
> +    // Note: String.prototype.match's behavior is unobservable after step 8a.
> +    // The inputs to the calls to RegExp.prototype.exec are a string and a
> +    // RegExp object, and the method is pure *except* with regard to the

Extra-pedantic semantic nit: "pure" isn't the right word here. It has no observable side effects; but a pure function also always returns the same value, for the same arguments.

(Remember when shaver marked `Math.random()` and `new Object()` as pure functions in Tracemonkey? Good times.)

@@ +1868,2 @@
>          if (!JS_CHECK_OPERATION_LIMIT(cx))
>              return false;

So, this can happen, which would make the leftover .lastIndex observable. Admittedly under pretty strange conditions.

@@ +1874,2 @@
>          if (status == RegExpRunStatus_Error)
>              return false;

Another error path.

@@ +1880,5 @@
>  
> +        lastSuccessfulStart = searchIndex;
> +
> +        // Steps 8f(iii)(1-3).
> +        searchIndex = match.isEmpty() ? searchIndex + 1 : nextSearchIndex;

I don't think this is right. Suppose re has a lookahead assertion. Then the match could be empty, but occur at an index > searchIndex, right?

Needs a test, if so.

@@ +1887,5 @@
>          JSLinearString *str = js_NewDependentString(cx, input, match.start, match.length());
>          if (!str)
>              return false;
>          if (!elements.append(StringValue(str)))
>              return false;

And OOM as well. Hmm.

@@ +1898,5 @@
>      }
>  
> +    // The last *successful* match updates the RegExpStatics. (Interestingly,
> +    // this implies that String.prototype.match's semantics aren't those
> +    // implied by the RegExp.prototype.exec calls in the ES5 algorithm.)

*queasy look*

@@ +1939,4 @@
>      return true;
>  }
>  
> +/* ES5 15.5.4.10. */

Why // comments up there and /**/ comments here? I don't mind, but // would be a little better.

::: js/src/tests/ecma_5/String/match-throws-nonwritable-lastIndex-global.js
@@ +45,5 @@
> +catch (e)
> +{
> +  assertEq(e instanceof TypeError, true,
> +           "should have thrown a TypeError, instead got: " + e);
> +}

Could add a third test for when no match is found. It should still throw in that case, right?
Attachment #804861 - Flags: review?(jorendorff) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/13b2aafc8d02
https://hg.mozilla.org/integration/mozilla-inbound/rev/a179552e0d48
https://hg.mozilla.org/integration/mozilla-inbound/rev/f688d540beb1

Thanks again for the patchwork and willingness to respond to review commentary!
Flags: in-testsuite+
Target Milestone: --- → mozilla27
Depends on: 925949
Whiteboard: [good first bug][lang=c++][mentor=Waldo] → [good first bug][lang=c++][mentor=Waldo][DocArea=JS]
Depends on: 1168257
You need to log in before you can comment on or make changes to this bug.