Closed Bug 225094 Opened 18 years ago Closed 16 years ago

RegExp.exec does not return null when there are no more matches

Categories

(Core :: JavaScript Engine, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: crisp, Unassigned)

Details

Attachments

(1 file, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.5) Gecko/20031007
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.5) Gecko/20031007

There is a problem when performing a regular expression in a loop that has only
optional components; when no more matches are found the exec() does not return
null, but an empty string or an array of empty strings; when you explicitly
check for the null value in the loop the browser will hang in an eternal loop.

Reproducible: Always

Steps to Reproduce:
simple testcase that will cause an eternal loop:

<script type="text/javascript">

var s = 'abcdefghijklmnopqrstuvwxyz0123456789';
var re = /[a-z]?/g

var match = re.exec(s);

while (match != null) {

  document.write(match[0]+'<br />');

  match = re.exec(s);

}
</script>

Actual Results:  
browser hangup because of the eternal loop

Expected Results:  
when there are no more matches, the exec() method should return the null value

n/a
> when no more matches are found

That never happens, since a match is _always_ found for that regexp.... 
Consider the algorithm in the Ecma spec
(http://www.ecma-international.org/publications/files/ecma-st/Ecma-262.pdf):

RegExp.prototype.exec(string):

1. Let S be the value of ToString(string). 
2. Let length be the length of S. 
3. Let lastIndex be the value of the lastIndex property. 
4. Let i be the value of ToInteger(lastIndex). 
5. If the global property is false, let i =0. 
6. If i <0 or i > length then set lastIndex to 0 and return null. 
7. Call [[Match]], giving it the arguments S and i. If [[Match]] returned
   failure, go to step 8; otherwise let r be its State result and go to step 10.
8. Let i = i+1. 
9. Go to step 6. 
10. Let e be r's endIndex value.
11. If the global property is true,setlastIndex to e. 
12. Let n be the length of r's captures array. (This is the same value as
    15.10.2.1's NCapturingParens.) 
13. Return a new array with the following properties: 
      - The index property is set to the position of the matched substring
        within the complete string S.
      - The input property is set to S.
      - The length property is set to n + 1. 
      - The 0 property is set to the matched substring (i.e. the portion of S
         between offset i inclusive and offset e exclusive).
      - For each integer i such that i >0 and i <= n, set the property named
        ToString(i) to the ith element of r's captures array.

Now in our case we go through till we have i == length.  Then we call [[Match]]
in step 7.  The patterm in question happily matches a zero-length string, so
[[Match]] returns success.  Then we go to step 10. endIndex == length, since the
regexp matched but took up no chars, so lastIndex gets set to "length" and we
return an array.  Next time through the loop, we end up with exactly the same
situation.

So I believe our behavior is in fact correct -- a regexp that matches any string
(such as the one provided) cannot ever return null from an exec() call.
Even though it is according specs, i'd say returning nothing as a match (an 
empty string or array) can logically be concidered 'no match' and for the sake 
of preventing eternal loops just returning null might be the wiser thing to do -
 as Internet Explorer and Opera are also doing in this particular case...
> Even though it is according specs, i'd say returning nothing as a match (an 
> empty string or array) can logically be concidered 'no match'

Well, no.  It's a match -- that much is clear.  For example, does the regexp
/^ *$/ match the empty string?  Yes, it does.  And if I do /^ *$/.test("") I
expect it to return true.  It does in Mozilla.  It also does in Opera.

/^ *$/.exec("") returns a non-null result in both Mozilla and Opera.

Now here is the kicker.  In Opera, yout testcase is actually an infinite loop
(looking at Opera 7 Linux here).  It just happens that in Opera you can interact
with the UI even while this infinite loop is running (though it does chew up a
nice amount of CPU as it runs). You can trivially test this by adding a
document.write of a nonempty string or an alert inside the loop and watching
what happens.

I have no way to test IE, so I can't verify your claim there.
Note that for String.prototype.match() the spec does explicitly specify a
different behavior. Thus I think it is intentional. Both behaviors differ from
Perl, where a zero-length match forces the next match to have non-zero length
(or to fail).

  js> "bar".match( /.??/g ).slice( 1 ).join( ":" );
  ::
  js> "bar".split( /.??/g ).join( ":" );
  b:a:r
  js> "bar".replace( /.??/g, "<$&>" );
  <>b<>a<>r<>

  ~ $ perl -we 'print +(join ":", "bar" =~ /.??/g), "\n"'
  :b::a::r:
  ~ $ perl -we 'print +(join ":", split /.??/, "bar"), "\n"'
  
  ~ $ perl -we '$_= "bar"; s/.??/<$&>/g; print "$_\n"'
  <><b><><a><><r><>

A related thing is /(.*)*/ (see bug 209919, comment 9 ff):

  js> "a".match( /(a|b*)*/ )[1];
  a
  js> "a".match( /(b*)*/ )[1] === undefined;  // Bug in Mozilla < 1.6a !
  true

  ~ $ perl -we 'print +("a" =~ /(a|b*)*/)[0], "\n"'
  
  ~ $ perl -we 'print ! defined +("a" =~ /(b*)*/)[0], "\n"'
  

Perl's behavior may be more intuitive, but ECMA has choosen to use more
straight-forward rules.
The testcase sets

                     re = /[a-z]?/g
                      s = 'abc123'


If you run it in IE6 and repeatedly call re.exec(s), you get this output:

N = 0		match = undef		re.lastIndex = 0
N = 1		match = ["a"]		re.lastIndex = 1
N = 2		match = ["b"]		re.lastIndex = 2
N = 3		match = ["c"]		re.lastIndex = 3
N = 4		match = [""]		re.lastIndex = 4
N = 5		match = [""]		re.lastIndex = 5
N = 6		match = [""]		re.lastIndex = 6
N = 7		match = [""]		re.lastIndex = 7
N = 8		match = null		re.lastIndex = 0


This shows that in IE, once re.exec(s) has consumed every character of |s|,
if called again it will consume $, and move beyond the end of the string.
Once beyond the string, it (as would Mozilla) returns a null match.

Mozilla on the other hand is following the spec, as Boris demonstrated.
I will try to get in touch with Waldemar and ask him to comment on this.
As one of the authors of the spec, his insight will be invaluable here -
IE handles this as specified for a similar String.prototype.match().
Opera 7.11 is ECMA-compliant.

Konqueror 3.1.4 has odd results (behaves as Perl does, but fails to stop at $):

  N = 0		match = undef		re.lastIndex = 0
  N = 1		match = ["a"]		re.lastIndex = 1
  N = 2		match = ["b"]		re.lastIndex = 2
  N = 3		match = ["c"]		re.lastIndex = 3
  N = 4		match = [""]		re.lastIndex = 3
  N = 5		match = [""]		re.lastIndex = 4
  N = 6		match = [""]		re.lastIndex = 5
  N = 7		match = [""]		re.lastIndex = 6
  N = 8		match = [""]		re.lastIndex = 6
  N = 9		match = [""]		re.lastIndex = 6
  ...

  ~ $ perl -e '$_= "abc123"; for $i (1..16) { $m= /[a-z]?/g; print "$i <$&>
${\(pos)}\n"; $m or last }'
  1 <a> 1
  2 <b> 2
  3 <c> 3
  4 <> 3
  5 <> 4
  6 <> 5
  7 <> 6
  8 <>

When using match() with /g , Mozilla resets lastIndex (not changed by the regexp
rewrite). Is this intentional or should I file a bug? According to ECMA it
should be 7.

  js> var re= /[a-z]?/g, a= "abc123".match( re );
  js> a.toSource()
  ["a", "b", "c", "", "", "", ""]
  js> re.lastIndex
  0
I was wrong, the behavior of match() is correct. When lastIndex is 7, exec() is
invoked one more time and sets it to 0.
As before,
                            re = /[a-z]?/g
                             s = 'abc123'

The newer testcase makes it easier to see the difference between
Mozilla and IE on re.exec(s). The crux of the matter:


------------------------------------- Moz -------------------------------------
N = 2	match = ["b"]	lastIndex = 2	 leftContext = a     rightContext = c123
N = 3	match = ["c"]	lastIndex = 3    leftContext = ab    rightContext = 123
N = 4	match = [""]	lastIndex = 3    leftContext = abc   rightContext = 123
N = 5	match = [""]	lastIndex = 3    leftContext = abc   rightContext = 123
                                     etc.
                                     etc.  


------------------------------------- IE -------------------------------------
N = 2	match = ["b"]	lastIndex = 2	 leftContext = a     rightContext = c123
N = 3	match = ["c"]	lastIndex = 3    leftContext = ab    rightContext = 123
N = 4	match = [""]	lastIndex = 4    leftContext = abc   rightContext = 123
N = 5	match = [""]	lastIndex = 5    leftContext = abc1  rightContext = 23



The two browsers (and others, I expect) agree on the first four matches.
So when N = 4, both browsers match the 0-length string between 'c' and '1'.

When N = 5, IE decides to consume the '1'. The next match according to IE
is therefore the 0-length string  between '1' and '2'. 

Mozilla follows the algorithm Boris outlined above. Since the N=4 match is 
successful, we fall into Step 10 and 11. Since the match is of zero length,
|lastIndex| remains the same. The '1' is never consumed. The next match
will be the same 0-length string  between  'c' and '1'.
cc'ing Waldemar -
Perhaps re.exec() could return the null result set from the end of the string
only once? Such as by setting lastIndex=-1 when it hits the end of the test
string, and returning null thereafter.
Sorry, we follow ECMA, other browsers try to; everyone should stick to the spec.

/be
QA Contact: pschwartau → general
marking invalid since this is not a bug in the implementation.
Status: UNCONFIRMED → RESOLVED
Closed: 16 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.