Closed
Bug 225094
Opened 21 years ago
Closed 19 years ago
RegExp.exec does not return null when there are no more matches
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
INVALID
People
(Reporter: crisp, Unassigned)
Details
Attachments
(1 file, 1 obsolete file)
3.19 KB,
text/html
|
Details |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.5) Gecko/20031007 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.5) Gecko/20031007 There is a problem when performing a regular expression in a loop that has only optional components; when no more matches are found the exec() does not return null, but an empty string or an array of empty strings; when you explicitly check for the null value in the loop the browser will hang in an eternal loop. Reproducible: Always Steps to Reproduce: simple testcase that will cause an eternal loop: <script type="text/javascript"> var s = 'abcdefghijklmnopqrstuvwxyz0123456789'; var re = /[a-z]?/g var match = re.exec(s); while (match != null) { document.write(match[0]+'<br />'); match = re.exec(s); } </script> Actual Results: browser hangup because of the eternal loop Expected Results: when there are no more matches, the exec() method should return the null value n/a
Comment 1•21 years ago
|
||
> when no more matches are found That never happens, since a match is _always_ found for that regexp.... Consider the algorithm in the Ecma spec (http://www.ecma-international.org/publications/files/ecma-st/Ecma-262.pdf): RegExp.prototype.exec(string): 1. Let S be the value of ToString(string). 2. Let length be the length of S. 3. Let lastIndex be the value of the lastIndex property. 4. Let i be the value of ToInteger(lastIndex). 5. If the global property is false, let i =0. 6. If i <0 or i > length then set lastIndex to 0 and return null. 7. Call [[Match]], giving it the arguments S and i. If [[Match]] returned failure, go to step 8; otherwise let r be its State result and go to step 10. 8. Let i = i+1. 9. Go to step 6. 10. Let e be r's endIndex value. 11. If the global property is true,setlastIndex to e. 12. Let n be the length of r's captures array. (This is the same value as 15.10.2.1's NCapturingParens.) 13. Return a new array with the following properties: - The index property is set to the position of the matched substring within the complete string S. - The input property is set to S. - The length property is set to n + 1. - The 0 property is set to the matched substring (i.e. the portion of S between offset i inclusive and offset e exclusive). - For each integer i such that i >0 and i <= n, set the property named ToString(i) to the ith element of r's captures array. Now in our case we go through till we have i == length. Then we call [[Match]] in step 7. The patterm in question happily matches a zero-length string, so [[Match]] returns success. Then we go to step 10. endIndex == length, since the regexp matched but took up no chars, so lastIndex gets set to "length" and we return an array. Next time through the loop, we end up with exactly the same situation. So I believe our behavior is in fact correct -- a regexp that matches any string (such as the one provided) cannot ever return null from an exec() call.
Even though it is according specs, i'd say returning nothing as a match (an empty string or array) can logically be concidered 'no match' and for the sake of preventing eternal loops just returning null might be the wiser thing to do - as Internet Explorer and Opera are also doing in this particular case...
Comment 3•21 years ago
|
||
> Even though it is according specs, i'd say returning nothing as a match (an
> empty string or array) can logically be concidered 'no match'
Well, no. It's a match -- that much is clear. For example, does the regexp
/^ *$/ match the empty string? Yes, it does. And if I do /^ *$/.test("") I
expect it to return true. It does in Mozilla. It also does in Opera.
/^ *$/.exec("") returns a non-null result in both Mozilla and Opera.
Now here is the kicker. In Opera, yout testcase is actually an infinite loop
(looking at Opera 7 Linux here). It just happens that in Opera you can interact
with the UI even while this infinite loop is running (though it does chew up a
nice amount of CPU as it runs). You can trivially test this by adding a
document.write of a nonempty string or an alert inside the loop and watching
what happens.
I have no way to test IE, so I can't verify your claim there.
Note that for String.prototype.match() the spec does explicitly specify a different behavior. Thus I think it is intentional. Both behaviors differ from Perl, where a zero-length match forces the next match to have non-zero length (or to fail). js> "bar".match( /.??/g ).slice( 1 ).join( ":" ); :: js> "bar".split( /.??/g ).join( ":" ); b:a:r js> "bar".replace( /.??/g, "<$&>" ); <>b<>a<>r<> ~ $ perl -we 'print +(join ":", "bar" =~ /.??/g), "\n"' :b::a::r: ~ $ perl -we 'print +(join ":", split /.??/, "bar"), "\n"' ~ $ perl -we '$_= "bar"; s/.??/<$&>/g; print "$_\n"' <><b><><a><><r><> A related thing is /(.*)*/ (see bug 209919, comment 9 ff): js> "a".match( /(a|b*)*/ )[1]; a js> "a".match( /(b*)*/ )[1] === undefined; // Bug in Mozilla < 1.6a ! true ~ $ perl -we 'print +("a" =~ /(a|b*)*/)[0], "\n"' ~ $ perl -we 'print ! defined +("a" =~ /(b*)*/)[0], "\n"' Perl's behavior may be more intuitive, but ECMA has choosen to use more straight-forward rules.
Comment 5•21 years ago
|
||
Comment 6•21 years ago
|
||
The testcase sets re = /[a-z]?/g s = 'abc123' If you run it in IE6 and repeatedly call re.exec(s), you get this output: N = 0 match = undef re.lastIndex = 0 N = 1 match = ["a"] re.lastIndex = 1 N = 2 match = ["b"] re.lastIndex = 2 N = 3 match = ["c"] re.lastIndex = 3 N = 4 match = [""] re.lastIndex = 4 N = 5 match = [""] re.lastIndex = 5 N = 6 match = [""] re.lastIndex = 6 N = 7 match = [""] re.lastIndex = 7 N = 8 match = null re.lastIndex = 0 This shows that in IE, once re.exec(s) has consumed every character of |s|, if called again it will consume $, and move beyond the end of the string. Once beyond the string, it (as would Mozilla) returns a null match. Mozilla on the other hand is following the spec, as Boris demonstrated. I will try to get in touch with Waldemar and ask him to comment on this. As one of the authors of the spec, his insight will be invaluable here -
IE handles this as specified for a similar String.prototype.match(). Opera 7.11 is ECMA-compliant. Konqueror 3.1.4 has odd results (behaves as Perl does, but fails to stop at $): N = 0 match = undef re.lastIndex = 0 N = 1 match = ["a"] re.lastIndex = 1 N = 2 match = ["b"] re.lastIndex = 2 N = 3 match = ["c"] re.lastIndex = 3 N = 4 match = [""] re.lastIndex = 3 N = 5 match = [""] re.lastIndex = 4 N = 6 match = [""] re.lastIndex = 5 N = 7 match = [""] re.lastIndex = 6 N = 8 match = [""] re.lastIndex = 6 N = 9 match = [""] re.lastIndex = 6 ... ~ $ perl -e '$_= "abc123"; for $i (1..16) { $m= /[a-z]?/g; print "$i <$&> ${\(pos)}\n"; $m or last }' 1 <a> 1 2 <b> 2 3 <c> 3 4 <> 3 5 <> 4 6 <> 5 7 <> 6 8 <> When using match() with /g , Mozilla resets lastIndex (not changed by the regexp rewrite). Is this intentional or should I file a bug? According to ECMA it should be 7. js> var re= /[a-z]?/g, a= "abc123".match( re ); js> a.toSource() ["a", "b", "c", "", "", "", ""] js> re.lastIndex 0
I was wrong, the behavior of match() is correct. When lastIndex is 7, exec() is invoked one more time and sets it to 0.
Comment 9•21 years ago
|
||
Attachment #135096 -
Attachment is obsolete: true
Comment 10•21 years ago
|
||
As before, re = /[a-z]?/g s = 'abc123' The newer testcase makes it easier to see the difference between Mozilla and IE on re.exec(s). The crux of the matter: ------------------------------------- Moz ------------------------------------- N = 2 match = ["b"] lastIndex = 2 leftContext = a rightContext = c123 N = 3 match = ["c"] lastIndex = 3 leftContext = ab rightContext = 123 N = 4 match = [""] lastIndex = 3 leftContext = abc rightContext = 123 N = 5 match = [""] lastIndex = 3 leftContext = abc rightContext = 123 etc. etc. ------------------------------------- IE ------------------------------------- N = 2 match = ["b"] lastIndex = 2 leftContext = a rightContext = c123 N = 3 match = ["c"] lastIndex = 3 leftContext = ab rightContext = 123 N = 4 match = [""] lastIndex = 4 leftContext = abc rightContext = 123 N = 5 match = [""] lastIndex = 5 leftContext = abc1 rightContext = 23 The two browsers (and others, I expect) agree on the first four matches. So when N = 4, both browsers match the 0-length string between 'c' and '1'. When N = 5, IE decides to consume the '1'. The next match according to IE is therefore the 0-length string between '1' and '2'. Mozilla follows the algorithm Boris outlined above. Since the N=4 match is successful, we fall into Step 10 and 11. Since the match is of zero length, |lastIndex| remains the same. The '1' is never consumed. The next match will be the same 0-length string between 'c' and '1'.
Comment 11•21 years ago
|
||
cc'ing Waldemar -
Comment 12•21 years ago
|
||
Perhaps re.exec() could return the null result set from the end of the string only once? Such as by setting lastIndex=-1 when it hits the end of the test string, and returning null thereafter.
Comment 13•19 years ago
|
||
Sorry, we follow ECMA, other browsers try to; everyone should stick to the spec. /be
QA Contact: pschwartau → general
Comment 14•19 years ago
|
||
marking invalid since this is not a bug in the implementation.
Status: UNCONFIRMED → RESOLVED
Closed: 19 years ago
Resolution: --- → INVALID
You need to log in
before you can comment on or make changes to this bug.
Description
•