Closed Bug 98409 (regexpliteralflaw) Opened 23 years ago Closed 15 years ago

literal global regular expression (regexp) instance remembers lastIndex

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla1.9.3a1

People

(Reporter: fstmaurice, Assigned: brendan)

References

Details

(Whiteboard: fixed-in-tracemonkey)

Attachments

(2 files, 4 obsolete files)

When testing a regular expression such as in an email validation, the
RegExp.test() method returns the right value only half the time.

Reproducible: Always
Steps to Reproduce:

- Load the following attachement
- Click several times on the RegExp.test() button to validate the email

Actual Results:  The email is found to be valid only half the time

Expected Results:  The test method should always return true.

Other comments: This testcase works as expected in Internet Explorer, but not in
NS4.
Attached file Test Case of the bug
For a workaround, just take out the comment on line 8 of the test case.
seeing this on current linux trunk as well
Status: UNCONFIRMED → NEW
Ever confirmed: true
OS: Windows 2000 → All
Hardware: PC → All
You can see this behavior with a simpler example: 

                     var RE = /a/g;
                     var STR = 'aaa';


Then do RE.test(STR) in a loop. I will attach this test below -
Attached file Reduced testcase
The reduced testcase runs the test in two loops of five turns each.

It shows that each time RE.test(STR) is executed, RE.lastIndex increments,
and the next RE.test(STR) is done from this position in STR. Once we have 
reached the end of STR, RE.lastIndex == STR.length and there is no match
from this position. At this point, RE.lastIndex gets set back to 0.
Thus when we test again, we get a match again, at the beginning of STR.
From the ECMA-262 Final Spec (December 1999):

15.10.6.3 RegExp.prototype.test(string) 
Equivalent to the expression RegExp.prototype.exec(string) != null. 


15.10.6.2 RegExp.prototype.exec(string) 
Performs a regular expression match of string against the regular expression
and returns an Array object containing the results of the match, or null
if the string did not match 

The string ToString(string) is searched for an occurrence of the regular 
expression pattern as follows: 

1.  Let S be the value of ToString(string). 
2.  Let length be the length of S. 
3.  Let lastIndex be the value of the lastIndex property. 
4.  Let i be the value of ToInteger(lastIndex). 
5.  If the global property is false, let i = 0. 
6.  If I < 0 or I > length then set lastIndex to 0 and return null. 
7.  Call [[Match]], giving it the arguments S and i. If [[Match]] returned             
    failure, go to step 8; otherwise let r be its State result & go to step 10. 
8.  Let i = i+1. 
9.  Go to step 6. 
10. Let e be r's endIndex value. 
11. If the global property is true, set lastIndex to e. 
12. Let n be the length of r's captures array.
    (This is the same value as 15.10.2.1's NCapturingParens.) 
13. Return a new array with the following properties: 
  • The index property is set to the position of the matched substring
    within the complete string S. 
  • The input property is set to S. 
  • The length property is set to n + 1. 
  • The 0 property is set to the matched substring
    (i.e. the portion of S between offset i inclusive and offset e exclusive). 
  • For each integer i such that I > 0 and I •n, set the property named                     
    ToString(i) to the i th element of r's captures array.
  
Here is the output of the reduced testcase:


TESTING:   RE = /a/g
           STR = 'aaa'
          (two loops of five turns)

RE.test(STR) = true    RE.lastIndex = 1    RegExp.lastMatch = a    
RE.test(STR) = true    RE.lastIndex = 2    RegExp.lastMatch = a    
RE.test(STR) = true    RE.lastIndex = 3    RegExp.lastMatch = a    
RE.test(STR) = false   RE.lastIndex = 0    RegExp.lastMatch = a    
RE.test(STR) = true    RE.lastIndex = 1    RegExp.lastMatch = a    

RE.test(STR) = true    RE.lastIndex = 2    RegExp.lastMatch = a    
RE.test(STR) = true    RE.lastIndex = 3    RegExp.lastMatch = a    
RE.test(STR) = false   RE.lastIndex = 0    RegExp.lastMatch = a    
RE.test(STR) = true    RE.lastIndex = 1    RegExp.lastMatch = a    
RE.test(STR) = true    RE.lastIndex = 2    RegExp.lastMatch = a    
FINDINGS

I get the same results above in both NN4.7 and Mozilla.
In IE4.7 I get


TESTING:   RE = /a/g
           STR = 'aaa'
          (two loops of five turns)

RE.test(STR) = true    RE.lastIndex = undefined    RegExp.lastMatch = undefined    
RE.test(STR) = true    RE.lastIndex = undefined    RegExp.lastMatch = undefined    
RE.test(STR) = true    RE.lastIndex = undefined    RegExp.lastMatch = undefined    
RE.test(STR) = true    RE.lastIndex = undefined    RegExp.lastMatch = undefined    
RE.test(STR) = true    RE.lastIndex = undefined    RegExp.lastMatch = undefined    

RE.test(STR) = true    RE.lastIndex = undefined    RegExp.lastMatch = undefined    
RE.test(STR) = true    RE.lastIndex = undefined    RegExp.lastMatch = undefined    
RE.test(STR) = true    RE.lastIndex = undefined    RegExp.lastMatch = undefined    
RE.test(STR) = true    RE.lastIndex = undefined    RegExp.lastMatch = undefined    
RE.test(STR) = true    RE.lastIndex = undefined    RegExp.lastMatch = undefined    
I think that NN4.x, Mozilla, and N6 are behaving according to spec here.
In particular, note ECMA steps 6 and 11:


6.  If i < 0 or i > length then set lastIndex to 0 and return null. 

11. If the global property is true, set lastIndex to e.


So if the global property is true (as in this bug report),
lastIndex gets incremented. Once we have lastIndex > length, the
method is supposed to return null by step 6. When this happens,
then RE.test() == (RE.prototype.exec(STR) != null) == false,
and lastIndex gets set back to 0.
Have to mark this one invalid. Thank you for this report, however -
please continue using Bugzilla. We depend on contributors like you
to catch the things we miss!

Will ask rogerl to verify this one -
Status: NEW → RESOLVED
Closed: 23 years ago
Resolution: --- → INVALID
rogerl agrees - marking Verified. 

Note that RE.test(STR)== false can actually be utilized as information -
it tells you when you've gone beyond the length of the string STR.
Status: RESOLVED → VERIFIED
*** Bug 112376 has been marked as a duplicate of this bug. ***
*** Bug 185415 has been marked as a duplicate of this bug. ***
I apologize. Invalid bug.

6.  If I < 0 or I > length then set lastIndex to 0 and return null. 


So \g matches the last index. Should really be "g", though, which is why I
didn't get the behavior in IE. With the RegExp constructor, you don't escape flags.
*** Bug 225408 has been marked as a duplicate of this bug. ***
Better summary for duping.

/be
Summary: RegExp.test() method does not always return the same result → literal global regular expression (regexp) instance remembers lastIndex
*** Bug 252363 has been marked as a duplicate of this bug. ***
*** Bug 252363 has been marked as a duplicate of this bug. ***
Note also that a literal regexp causes one object to be created at runtime when
the script or function containing the literal regexp is evaluated, even if the
literal is in a loop body.  If you call new RegExp, you'll get one object per
new expression evaluated, which in a loop body will be as many as the loop iterates.

/be
*** Bug 291175 has been marked as a duplicate of this bug. ***
*** Bug 296610 has been marked as a duplicate of this bug. ***
*** Bug 330221 has been marked as a duplicate of this bug. ***
*** Bug 330334 has been marked as a duplicate of this bug. ***
*** Bug 286259 has been marked as a duplicate of this bug. ***
*** Bug 237111 has been marked as a duplicate of this bug. ***
*** Bug 334795 has been marked as a duplicate of this bug. ***
*** Bug 339687 has been marked as a duplicate of this bug. ***
*** Bug 347748 has been marked as a duplicate of this bug. ***
See http://wiki.ecmascript.org/doku.php?id=proposals:bug_fixes for the good news that this bug will no longer be INVALID in ES4. Should we reopen and fix ahead of the spec for JS1.8 in Firefox 3, to get real-world testing? I think so, and to avoid forgetting this issue, I'm making this bug blog the js1.8 meta-bug.

/be
Blocks: js1.8
"[REGEXP.CONSTRUCTION] new RegExp(”pattern”) and /pattern/ mean different things: the latter is not created anew every time it is evaluated. This is usually surprising and is probably a historical accident? [Brendan says it was a mistake that optimized the wrong good.] In the context of E4X it will be more surprising still, since evaluating <tag>content</tag> yields a new XML object every time."

So we want /pattern/ to instantiate (or behave as though it is) a new RegExp object on every iteration?  For global regexp literals this means setting lastIndex = 0 before every match?  What other differences should it yield?
Oh, and doesn't changing this behavior potentially break the web in a noticeable way?
The main problem is that lastIndex doesn't reset, but a full fix would require things like removing custom properties as well.

As for breakage, it depends on the use. IE recreates the regexp, so any application depending on this behavior would break there. When stored in a variable, the current behavior is also contrary to basic JS expectations (no static variables). 

Direct use, e.g. while(res=/re/g.exec(str)){}, is trickier. It's still static and breaks IE, but the staticness is less obvious here. There could be non-cross-browser scripts depending on it. So the public web would be safe, but some isolated sections might not.
(In reply to comment #37)
> Oh, and doesn't changing this behavior potentially break the web in a
> noticeable way?

Not likely. The third most duped bug when I looked (see my blog, first post on
js1, js2 and in between) was about this. People have latent bugs due to it. The
sorts of workarounds you would add (resetting lastIndex too much) become
harmless with the incompatible "bug fix".

As Magnus points out, IE doesn't follow ES3 spec here. That's a strong clue that we can change without breaking the web, but user-agent control flow forks could bite us.

We are planning to make the "bug fix" for JS2/ES4: /hi/ evaluates each time to
a fresh object. Late to think about trying it out in 1.9/fx3 but we could take
that chance. Or do it soon after.

/be
Blocks: es5
Alias: regexpliteralflaw
WebKit also does this the same as IE. So Gecko and Opera is the only ones doing it "the right way" right now. So I think it's safe to switch this behavior since normally users tends to make things cross browser these days and not Firefox only.
Assignee: rogerl → general
QA Contact: pschwartau → general
IE, grrr. For years I had ass-u-med they followed ES3. That alone does not raise the temperature on this bug, but in conjunction with WebKit and (separately, also raising temp) ES3.1, I think it's time to morph.

The patch should be small. If it looks like trouble, 1.9.2. If not, let's please consider it with a careful risk analysis.

Magic-8-ball says "No doubt!" to "Waldo?".

/be
Assignee: general → jwalden+bmo
Status: VERIFIED → REOPENED
Flags: wanted1.9.1?
Resolution: INVALID → ---
Status: REOPENED → NEW
No longer blocks: js1.8
Hi --- this bug is blocking for ECMAScript 5.  Any progress here?
No, haven't even looked at it in months.  Maybe soon.
I've started working on the patch, also doing a little thinking on what the parse tree and bytecode should look like.  Nothing testable yet, but this should be fairly straightforward to finish up.
Shouldn't require new bytecode or AST or anything, just adjustment to the meaning of JSOP_REGEXP and some nice reduction in code that optimized for the old lexical singleton way. Igor knows this code, please have him review your version of this, if this is close enough to the mark. Merry Christmas!

/be
Attached patch shorter draft patch (obsolete) — Splinter Review
Still "gifting" (not regifting ;-) this patch. Soliciting Igor's review since he has been involved in this code.

I removed more old, overlong, and now-bogus comments. Not only can't we hoist the cloning without a ton of analysis, we shouldn't try. Also, js_CloneRegExpObject calls js_NewObject, etc., which bootstrap parent for us, no need for that loop up the scope chain.

Unless there's some code (more likely stale comments) I'm forgetting that knows about script->nfixed for global code counting regexps in addition to gvars, or some similar code that knows about function reserved slots counting upvars and regexps lazily cloned as singleton-per-lexeme objects, this should do it.

/be
Attachment #419184 - Attachment is obsolete: true
Attachment #419210 - Flags: review?(igor)
(In reply to comment #62)
> I removed more old, overlong, and now-bogus comments. Not only can't we hoist
> the cloning without a ton of analysis, we shouldn't try.

The cloning avoidance makes sense currently because the script stores not only the regexp itself but also the object that holds it. If that object would be eliminated and script be taught to store the regexp directly, then there would be no point in doing JSOP_REGEXP->JSOP_OBJECT.
Attachment #419210 - Flags: review?(igor) → review+
Attached patch passes tests (obsolete) — Splinter Review
There were a couple more places that knew about global scripts' nfixed members counting regexps as well as gvars.

/be
Attachment #419210 - Attachment is obsolete: true
Attachment #419446 - Flags: review?(igor)
Attachment #419446 - Flags: review?(igor) → review+
(In reply to comment #63)
> (In reply to comment #62)
> > I removed more old, overlong, and now-bogus comments. Not only can't we hoist
> > the cloning without a ton of analysis, we shouldn't try.
> 
> The cloning avoidance makes sense currently because the script stores not only
> the regexp itself but also the object that holds it. If that object would be
> eliminated and script be taught to store the regexp directly, then there would
> be no point in doing JSOP_REGEXP->JSOP_OBJECT.

Are you describing the pre-ES5 semantics here, or something about the patch that could be improved?

With ES5 there is a useful optimization still: compile-n-go scripts expressing regexp literals in statements executed at most once need not clone at all -- they can use the compiler-created "exemplar". The patch selects JSOP_OBJECT over JSOP_REGEXP for this case.

Otherwise ES5 semantics require JSOP_REGEXP's cloning behavior.

/be
Flags: in-testsuite?
(In reply to comment #65)
> Are you describing the pre-ES5 semantics here, or something about the patch
> that could be improved?

That is for another bug.

> 
> With ES5 there is a useful optimization still: compile-n-go scripts expressing
> regexp literals in statements executed at most once need not clone at all --
> they can use the compiler-created "exemplar". The patch selects JSOP_OBJECT
> over JSOP_REGEXP for this case.

JSOP_REGEXP->JSOP_OBJECT currently makes sense because, to store JSRegExp, JSScript also needs to store JSObject holding it. If the latter can be avoided and JSScript would hold JSRegExp directly, the would be no need to do the optimization.

> Otherwise ES5 semantics require JSOP_REGEXP's cloning behavior.
> 
> /be
(In reply to comment #66)
> JSOP_REGEXP->JSOP_OBJECT currently makes sense because, to store JSRegExp,
> JSScript also needs to store JSObject holding it. If the latter can be avoided
> and JSScript would hold JSRegExp directly, the would be no need to do the
> optimization.

You need two data structures, one the compiler creates (expensive to duplicate) for the regular expression, the other a lightweight to hold lastIndex and any other mutable state (ad-hoc properties) -- the first is JSRegExp, the second is JSObject. You need one of the first, one or more of the latter.

Ah, but are you suggesting fusing the two as we did with JSFunction/JSObject for similar reasons, so the first has a "free" object (JSRegExp is-a JSObject)? That would only save a GC thing allocation at compile time, though. You'd still want two data structures and bytecodes, AFAICT.

/be
(In reply to comment #67)
> You need two data structures, one the compiler creates (expensive to duplicate)
> for the regular expression, the other a lightweight to hold lastIndex and any
> other mutable state (ad-hoc properties) -- the first is JSRegExp, the second is
> JSObject. You need one of the first, one or more of the latter.

Currently the compiler creates both the object and JSRegExp even if the object is not used as the regexp would always be cloned. This extra object can be eliminated if the compiler creates and stores in JSScript just an instance of JSRegExp without this extra object. Then JSOP_REGEXP needs just to create a new object and set its private data to the stored JSRegExp. This way the object is created only when JSOP_REGEXP is executed avoiding any need in doing JSOP_REGEXP->JSOP_OBJECT optimization.
Gotcha -- seems like a good followup bug.

/be
This bug's patch needs a test, and needs to be committed. Waldo, do the es5 tests from MSFT cover this change? Do we have those integrated somehow, or will we? Or if not, where would you add the test to js/src/tests?

/be
The MSFT es5 tests don't cover it -- or at least, assuming such tests would and should fit under 7.8.5, they don't exist.  We don't have them integrated yet; last I looked the harness was buggy -- since fixed, but I haven't checked on it recently (and, at the moment, the only harness interface only runs on Windows).  I assume we will integrate them sometime, but we haven't, they're not as comprehensive as our own tests in my estimation, and I'm wary of punting on testing until some unspecified time when we do integrate them anyway.

ecma_5/RegExp/ is as good a place as any for a new test, I guess.
http://hg.mozilla.org/tracemonkey/rev/3862a7e48e79

/be
Whiteboard: fixed-in-tracemonkey
Backed out changeset 3862a7e48e79 due to tinderbox failures on js1_5/GC/regress-324278.js.

http://hg.mozilla.org/tracemonkey/rev/ee9b5d13cbaf

The error is JIT-only and looks like this:

$ ../d-objdir/js -j -f shell.js -f js1_5/shell.js -f js1_5/GC/shell.js js1_5/GC/regress-324278.js
begin test: js1_5/GC/regress-324278.js
BUGNUMBER: 324278
STATUS: GC without recursion
STATUS: BUILT
js1_5/GC/regress-324278.js:71: TypeError: chainTop is not a function
Whiteboard: fixed-in-tracemonkey
(In reply to comment #73)
> Backed out changeset 3862a7e48e79 due to tinderbox failures on
> js1_5/GC/regress-324278.js.
> 
> http://hg.mozilla.org/tracemonkey/rev/ee9b5d13cbaf
> 
> The error is JIT-only and looks like this:

Not JIT-only, from my gdb'ing the `jsDriver.pl -t` command line with and without -j.

> $ ../d-objdir/js -j -f shell.js -f js1_5/shell.js -f js1_5/GC/shell.js
> js1_5/GC/regress-324278.js
> begin test: js1_5/GC/regress-324278.js
> BUGNUMBER: 324278
> STATUS: GC without recursion
> STATUS: BUILT
> js1_5/GC/regress-324278.js:71: TypeError: chainTop is not a function

This is precisely because of the ES5-formalized change to semantics. The test has:

function build(N) {
  // Explore the fact that regexp literals are shared between
  // function invocations. Thus we build the following chain:
  // chainTop: function->regexp->function->regexp....->null
  // to check how GC would deal with this chain.

  var chainTop = null;
  for (var i = 0; i != N; ++i) {
    var f = Function('some_arg'+i, ' return /test/;');
    var re = f();
    re.previous = chainTop;
    chainTop = f;
  }
  return chainTop;
}

and later, check invokes chainTop:

function check(chainTop, N) {
  for (var i = 0; i != N; ++i) {
    var re = chainTop();
...

The second invocation of the synthesized (by Function) function evaluates /test/ into a fresh RegExp object, with no shared singleton-per-lexeme so no .previous property.

Igor, thoughts on how to fix the test?

/be
The comment means "Exploit", not "Explore". I'd like to re-land the patch for this bug with the test disabled. Comments?

/be
Assignee: jwalden+bmo → brendan
Attachment #419446 - Attachment is obsolete: true
Status: NEW → ASSIGNED
Attachment #420320 - Flags: review?(igor)
Test change looks good to me.
http://hg.mozilla.org/tracemonkey/rev/7599965304e6

I landed again -- Igor, please feel free to ping me with any test changes you want and I'll make them.

/be
Whiteboard: fixed-in-tracemonkey
Seems related to http://code.google.com/p/google-caja/issues/detail?id=528
Once this is fixed, would it then be safe to whitelist RegExp.prototype.test and RegExp.prototype.exec in Caja and draft SES[1]?

[1] http://code.google.com/p/es-lab/source/browse/trunk/src/ses/initSES.js#106
(In reply to comment #79)
> Seems related to http://code.google.com/p/google-caja/issues/detail?id=528

No, this bug (bug 98409) is not related to the issue mentioned there, specifically as described at:

http://code.google.com/p/google-caja/wiki/RegexpsLeakMatchGlobally

This bug is about a regexp literal evaluating to the same singleton per lexeme no matter how often evaluated, per ES3.

The issue described by the caja doc is about RegExp.input being the default arg to exec and test. I don't see a bugzilla bug on file. There is only a starting attempt at document the "RegExp static properties" here:

http://wiki.whatwg.org/wiki/Web_ECMAScript#RegExp

Before filing a bugzilla bug it would help to have a full testsuite and results for all the major browsers. In any case, we should take this elsewhere.

/be
(In reply to comment #76)
> Created an attachment (id=420320) [details]
> fix the test to track the semantic change

The main purpose of that test from the bug 324278 is to show that the Deutsch-Schorr-Waite algorithm only works if it covers *all* possible navigation paths through the object graph. In particular, the test shows how to expose a bug in the DSW implementation via constructing a deep GC thing chain with links like JSFunction->JSScript->regexp_object_accessible_from_scripts->JSFunction. Since the DSW implementation was not aware that JSScript can point to an arbitrary object graph, that lead to stack overflow during the GC. 

Changes in this bug removes the possibility of having JSFunction->JSScript->regexp_object_accessible_from_scripts. So, if this bug were fixed prior that bug, the would be no way to show the problems in the DSW code using regexps in function literals.

To restore the original test coverage the test may use something like:

var cursor = null;
for (var i = 0; i != N; ++i) {
    var iter = Function('eval("yield /re/")')();
    var re = iter.next();
    re.prev = cursor;
    cursor = iter;
    iter = null;
}


This builds a chain like JSGenerator->JSScript->JSRegExp->JSGenerator.
http://hg.mozilla.org/mozilla-central/rev/7599965304e6
Status: ASSIGNED → RESOLVED
Closed: 23 years ago15 years ago
Resolution: --- → FIXED
Depends on: 540985
Depends on: 540306
Target Milestone: --- → mozilla1.9.3a1
Hello all!

I've found this bug while searching for problem in WikEd MediaWiki JS editor extension (http://en.wikipedia.org/wiki/User:Cacycle/wikEd), read all the comments and did not fully understand - which behaviour you think is the correct one?
1) /a/g.match('aaa') must remember position?
or
2) /a/g.match('aaa') must not remember position?

Now, Firefox 4 behaves like (2), so the following widely-used in WikEd code:
 while(/.../g.match(string) != null)
 {
 }
gives us an infinite loop.

My question is - is this Firefox 4 bug? Or do you think /.../g must create new Regexp object on each iteration?
(2) is correct behavior per ECMAScript 5.  (1) was specified by ECMAScript 3, but that's old and busted now, ES5 is the new hotness, and everyone implements (2) now (except maybe Opera, but if they haven't changed as we did I'd be very surprised).
Ok, thanks for the information.
It's really "except Google Chrome"...
Just checked, Opera does the same, and Chrome does ES3 behaviour...
(In reply to comment #90)
> Ok, thanks for the information.
> It's really "except Google Chrome"...
> Just checked, Opera does the same, and Chrome does ES3 behaviour...

What version of Chrome? I'm not seeing this bug on Chrome 12.0.742.9 dev.
Oops, that was some mystics. Now I also don't see this bug (on the same version). Maybe it was 11.x...
Hello, why this bug still happens in firefox 9? I tought that with the new especification, this bug should be fixed.
Chrome also have this bug...
Attachment #420320 - Attachment is obsolete: true
Attachment #420320 - Flags: review?(igor)
Hi. In Firefox 21, on Ubuntu 64bits still happens.

I has been tested with:

var regExp = /ba/ig;
for (var i = 0; i < 10; ++i) {
    console.log(regExp.test('a'));
}

// Result:

// true
// false
// true
// false
// true
// false
// true
// false
// true
// false
Yes, that's the spec-required behavior.  This bug was about using a regexp literal, which you are not.  As in, it was about this testcase:

  for (var i = 0; i < 10; ++i) {
    console.log(/ba/ig.test('a'));
  }

which conceptually creates a new RegExp object on each iteration.  You, on the other hand, are using the same RegExp object over and over.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: