Closed Bug 359651 Opened 18 years ago Closed 10 years ago

Non-greedy regular expressions can capture an extra character under certain circumstances

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
trivial

Tracking

()

RESOLVED INVALID

People

(Reporter: kliu, Unassigned)

References

()

Details

(Keywords: regression)

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1) Gecko/20061010 Firefox/2.0

Under certain unusual circumstances, non-greedy regexp is not entirely non-greedy.  I have not tested all of the cases where this happens, but from the cases that I have tested, it seems that 1) if there is a pattern capture (parentheses) and 2) if the pattern capture should capture 0 bytes when non-greedy and >0 bytes when greedy and 3) that capture parentheses itself is moderated by a '?', then the pattern capture will capture 1 byte when non-greedy (instead of 0 bytes).

Reproducible: Always

Steps to Reproduce:
1. Run the following JavaScript:
var x = "123";
var regexp = /^(.*?)?(\d+)$/;
alert(x.replace(regexp, "$1-$2"));

Actual Results:  
1-23

Expected Results:  
-123


Perl: -123
MSIE/6: -123
Gecko/1.5 (FB/0.7): -123
Gecko/1.6 (FF/0.8): 1-23

So the problem cropped up somewhere in the transition from Gecko/1.5 to Gecko/1.6.  Everything I tried after Gecko/1.6 (incl. trunk) returns the incorrect "1-23" string.

Other cases:
x = "123"; regexp = /^(.*?)(\d+)$/; -> WORKS (no '?' after capture)
x = "x123"; regexp = /^(.*?)?(\d+)$/; -> WORKS (>0 bytes in capture)
x = "x123"; regexp = /^(x.*?)?(\d+)$/; -> WORKS (>0 bytes in capture)
x = "x123"; regexp = /^x(.*?)?(\d+)$/; -> BROKEN

I stumbled upon this bug by accident.  A '?' after a (.*?), while perfectly legal, is redundant.  I was changing some regexp around, and I had left in a '?' after changing one of my captures from a pattern that matched at least 1 byte to one that matched at least 0 bytes, specifically, (.*?) (and subsequently spending some amount of time wondering why my code suddenly stopped worked correctly).  Because you shouldn't need to have a '?' after captures that can capture 0 bytes, I think that this is a minor problem.  Nevertheless, the behavior exhibited by Gecko/1.6 and above is incorrect and inconsistent with that of Perl and should be corrected (and there could be other people who end up doing what I did; neglecting to remove the '?' when changes to a capture made it redundant).
Attached file bug demonstration
Brian, another one crying out for help from you.

/be
Blocks: 443590
Severity: minor → trivial
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: regression
OS: Windows XP → All
Hardware: PC → All
Given that all JS engines agree on the result here, this isn't a bug. If anything, it's a specification bug, but even then, it most certainly can't be changed anymore, as this behavior will be relied upon by client code, nowadays.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: