Closed Bug 273477 Opened 20 years ago Closed 20 years ago

Javascript String.split produces incorrect output if regular expression can be empty

Categories

(Core :: JavaScript Engine, defect)

x86
Linux
defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: iketo2, Unassigned)

Details

User-Agent:       Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.7.5) Gecko/20041107 Firefox/1.0

The Javascript function String.split takes a regular expression as input.  If
the regular expression can end up matching an empty string, the array output by
String.split is not what one would expect.  For example, while
'aaaabbbccbdddbbb'.split(/b+/) gives you ['aaaa','cc','ddd',''], which is what
one would expect, the corresponding 'aaaabbbccbdddbbb'.split(/b*/) would give
you ['a','a','a','a','c','c','d','d','d'], which is unexpected (because the last
'' element is missing; in fact I'd expect an initial '' element as well as the
first match of /b*/ is at position 0, but that's subject to discussion).

A related issue with the problem happens when parentheses occurs within the
string.  While 'aaaabbbccbdddbbb'.split(/(b+)/) does the right thing and gives
you ['aaaa','bbb','cc','b','ddd','bbb',''], 'aaaabbbccbdddbbb'.split(/(b*)/)
behave in a strange way and gives you
['a','','a','','a','','a','bbb','c','','c','b','d','','d','','d','bbb'].  Note
that the last 'bbb' is not followed by a '', i.e., the separator is there, but
the last field is lost.

At least one would expect that the behaviour in the beginning of the string is
the same as the behaviour at the end of the string.  If we ask for
'bbbaaaabbbccbdddbbb'.split(/b*/), we get
['','a','a','a','a','c','c','d','d','d'], i.e., we get an empty field at the
beginning, but lost that at the end.

Reproducible: Always
Steps to Reproduce:
1.
2.
3.

Actual Results:  
Each of the example above has an empty string lost at the end.

Expected Results:  
See above.
> in fact I'd expect an initial '' element as well

There shouldn't be one, per Section 15.5.4.14 of ECMA-262, which says:

  In this case, separator does not match the empty substring at the beginning or
  end of the input string.

For the rest, the difference between + and * wrt the end of the string seems
like a bug indeed...
Status: UNCONFIRMED → NEW
Ever confirmed: true
From ECMA-262 Edition 3 15.5.4.14:

The value of separator may be an empty string, an empty regular expression, or a
regular expression that can match an empty string. In this case, separator does
not match the empty substring at the beginning or end of the input string, nor
does it match the empty substring at the end of the previous separator match.
(For example, if separator is the empty string, the string is split up into
individual characters; the length of the result array equals the length of the
string, and each substring contains one character.) If separator is a regular
expression, only the first match at a given position of the this string is
considered, even if backtracking could yield a non-empty-substring match at that
position. (For example, "ab".split(/a*?/) evaluates to the array ["a","b"],
while "ab".split(/a*/) evaluates to the array["","b"].)

/be
Status: NEW → RESOLVED
Closed: 20 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.