Closed Bug 587366 Opened 14 years ago Closed 14 years ago

Regexp failure for ")".replace(")","*$&*");

Categories

(Core :: JavaScript Engine, defect)

x86
Windows 7
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla2.0
Tracking Status
blocking2.0 --- beta5+

People

(Reporter: alice0775, Assigned: cdleary)

References

Details

(Keywords: regression, Whiteboard: fixed-in-tracemonkey)

Attachments

(3 files, 2 obsolete files)

User-Agent:       Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100814 Minefield/4.0b4pre ID:20100814040443
Build Identifier: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:2.0b4pre) Gecko/20100814 Minefield/4.0b4pre ID:20100814040443

See forum, http://forums.mozillazine.org/viewtopic.php?p=9754685#p9754685

When execute the following code, returns an error. 

")".replace(")","*$&*");

Actual Results:

Error: unmatched ) in regular expression
Source File: javascript:%20")".replace(")","*$&*")
Line: 1

Expected Results:
*)*

Regression pushlog:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=0ee09dea0911&tochange=fc6783c960ca
Regression from the YARR! landing?
Looks like a combination of our turning flat strings into flat regexps to search for (per ES3 and ES5, a non-RegExp first arg to replace means convert that arg to string and find it in the |this| string) and YARR/JSPCRE being pickier about unbalanced parens.

/be
Assignee: general → cdleary
blocking2.0: --- → beta5+
Patching now. As Brendan says, turning flat string to flat regexp and displaying a compilation error is incorrect behavior here.

Interesting side note: jsc's result for this '*$&*'. They have a special single-char pattern replacement path within JSC::stringProtoFuncReplace that does no substitution of back references. (Sure, there can be no parenthetical backreference substitution in this case, but not having the leftContext, rightContext, and match substitutions is in error per ECMAv5:15.5.4.11.) I'll file a bug against webkit before closing this.

Another aside, the behavior for numbered capture group backreference substitution in this scenario is implementation defined according to that section! Why don't we just specify something reasonable there, like literal character substitution?
Status: NEW → ASSIGNED
Blocks: 587608
Out of curiosity I ran some stats for single-char flat regexp replace on tracemonkey benchmarks:

SunSpider single character replace (in cycles):
       Mean:              781.455
     StdDev:              613.818
      Count:                   11
      Total:                 8596
  ms @ 3GHz:           0.00286533

v8 single character replace (in cycles):
       Mean:               229.77
     StdDev:              167.209
      Count:                 4155
      Total:               954695
  ms @ 3GHz:             0.318232
Disregard those measurements, was measuring text length instead of pattern length... some of the variable names in that area of code are easily confused.
Here are the corrected stats for single char flat patterns on tracemonkey. SunSpider only, v8 had no such patterns:

single character flat patterns (in cycles):
       Mean:              1304.24
     StdDev:              49330.3
      Count:                 2512
      Total:          3.27624e+06
  ms @ 3GHz:              1.09208
So it turns out that our String.prototype.replace has been a little more non-standard than is correct. If there is a dollar in the replaceValue or a lambda, we turn a non-regexp value of searchValue into a regular expression and perform a regular expression match. This has the side-effect of updating the RegExp statics, which is not to spec! Unfortunately, it meshes nicely with our optional third parameter, flags ( https://developer.mozilla.org/en/JavaScript/Reference/Global_Objects/String/replace ). This allows users to specify things like 'g', whereas the normal String.prototype.replace can only perform a single replace.

To rectify the situation we'll rip out the spec-violating behavior, make the non-flags case as fast and minimal as possible, and put the flags case on a slow path that updates the RegExp statics.
Attached patch Clean up RegExpGuard results. (obsolete) — Splinter Review
Refactors the RegExpGuard results into result classes instead of accessing RegExpGuard members directly.

Revised |String.prototype.replace| functionality soon to follow.
Attachment #466791 - Flags: review?(lw)
Lambda case is still outstanding. Going back to tweak a few things in the base patch after talking to lw.
Attachment #466791 - Attachment is obsolete: true
Attachment #466879 - Flags: review?(lw)
Attachment #466791 - Flags: review?(lw)
There's one fixme left in there for lambda-based substitution that I'll make the next patch on the stack.
Attachment #466868 - Attachment is obsolete: true
Attachment #466881 - Flags: review?(lw)
Comment on attachment 466879 [details] [diff] [review]
Part 1: Clean up RegExpGuard results.

Looks good, a few tiny comments:

>+class FlatMatch
>+{
>+    bool            fail;
...
>+  public:
>+    operator bool() const { return !fail; }

As discussed, tryFlatMatch can just return a FlatMatch*, so no need for 'fail' or operator bool().

> /*
>  * RegExpGuard factors logic out of String regexp operations. After each
>  * operation completes, RegExpGuard data members become available, according to
>  * the comments below.
>  *
>  * Notes on parameters to RegExpGuard member functions:
>  *  - 'optarg' indicates in which argument position RegExp flags will be found,
>  *    if present. This is a Mozilla extension and not part of any ECMA spec.
>  *  - 'flat' indicates that the given pattern string will not be interpreted as
>  *    a regular expression, hence regexp meta-characters are ignored.
>  */

It looks like this comment can be gutted.

>         if (optarg < argc ||
>-            (!flat &&
>-             (patlen > sMaxFlatPatLen || RegExp::hasMetaChars(pat, patlen)))) {
>-            return false;
>+            (checkMetaChars && (fm.patlen > MAX_FLAT_PAT_LEN
>+             || RegExp::hasMetaChars(fm.pat, fm.patlen)))) {

SM style puts the || at the end of the line and probably align RegExp::hasMetaChars with fm.patlen.

>+    if (!rdata.dollar && !rdata.lambda) {
>+        FlatMatch fm = rdata.g.tryFlatMatch(rdata.str, 2, argc, false);
>+        if (fm)
>+            return BuildFlatReplacement(cx, rdata.str, rdata.repstr, fm, vp);

Can merge decl and test.
Attachment #466879 - Flags: review?(lw) → review+
(In reply to comment #14)
> >         if (optarg < argc ||
> >+            (checkMetaChars && (fm.patlen > MAX_FLAT_PAT_LEN
> >+             || RegExp::hasMetaChars(fm.pat, fm.patlen)))) {
> 
> SM style puts the || at the end of the line and probably align
> RegExp::hasMetaChars with fm.patlen.

Yes to || at end (&& too), but rather than run on in a right-heavy way past the nested &&, break after && in the parenthetical and indent accordingly.

Even better, since the consequent looks to have been (still be?) return false;, write two if-then-return-false statements, one for the optarg < argc condition, the second after it for the longer condition.

> >+    if (!rdata.dollar && !rdata.lambda) {
> >+        FlatMatch fm = rdata.g.tryFlatMatch(rdata.str, 2, argc, false);
> >+        if (fm)
> >+            return BuildFlatReplacement(cx, rdata.str, rdata.repstr, fm, vp);
> 
> Can merge decl and test.

Doing that all over lately, and Waldo is too. While gal's Java-tainted reptile brain had issues with it, I declare it pure C++ win (something Algol had, IIRC, lost in the C lineage early on).

/be
Workaround:

String.prototype._replace_ = String.prototype.replace;
String.prototype.replace = function replace() {
  var args = Array.slice(arguments);
  if (typeof args[0] == "string") {
    args[0] = RegExp(args[0]._replace_(/[{[(\\^.$|?*+/)\]}]/g, "\\$&"));
  }
  return this._replace_.apply(this, args);
};
Comment on attachment 466881 [details] [diff] [review]
Part 2: Fix non-regexp replace.

r+ with some nits:

>+    JSCharBuffer newStringChars(cx); /* newstring in the spec is the replaceValue with subs. */

Perhaps:

  JSCharBuffer newReplaceChars(cx);  /* = dollarSub(replaceValue) */

?

>+    /* 
>+     * Move the rest of the replacement string char-by-char, interpreting dollars as we encounter
>+     * them.
>+     */

4 lines of comment are so much more than 1; perhaps s/of the replacement string// so that it fits on one line?

>+        /*
>+         * Note: we know it's a dollar in the first iteration, but this isn't
>+         * that hot of a path to care to optimize it at this point.
>+         */

True, but probably doesn't need a comment.  We won't judge you for not caring ;-)

>+    JSString *newString = js_NewStringFromCharBuffer(cx, newStringChars);
>+    if (!newString)
>+        return false;

Perhaps newReplace?

>+        newStringChars.append(*it);

It looks like you need to OOM-check this and all the other append() calls in this function.

>+    FlatMatch fm = rdata.g.tryFlatMatch(rdata.str, 2, argc, false);
>+    if (!fm)
>+        return str_replace_regexp(cx, argc, vp, rdata);

In the spirit of the nice comment before this, could you JS_ASSERT(argc > 2) in the then-path?  (This assert depends on there not being an error/OOM path out of tryFlatMatch, but I didn't see one...)

>+          default: /* The dollar we saw was not special (no matter what its mother told it). */

:)

>+     * Note: we could optimize the text.length == pattern.length case if we wanted,
>+     * even in the presence of dolla' bills (y'all).

:(
Attachment #466881 - Flags: review?(lw) → review+
Comment on attachment 466967 [details] [diff] [review]
Part 3: Fix non-regexp searchValue with lambda replaceValue.

Thanks for cutting this up into 3 readable patches!

>+static inline bool
>+str_replace_flat_lambda(JSContext *cx, uintN argc, Value *vp, ReplaceData &rdata,
>+                        const FlatMatch &fm)
>+{
>+    JS_ASSERT(fm.match() >= 0);
>+    /* lambda(matchStr, matchStart, textstr) */
>+    LeaveTrace(cx);

Could you push this comment down to right before the definition of lambdaArgc?

>+    Value *sp = args.argv();
>+    sp++->setString(matchStr);
>+    sp++->setInt32(fm.match());
>+    sp++->setString(rdata.str);

With base+offset addressing, I think it would be faster to write:

  Value *argv = args.argv();
  argv[0].setString(matchStr);
  argv[1].setInt32(fm.match());
  argv[2].setString(rdata.str);

Not that it matters thaaaat much, as we are about to call Invoke (OTOH, Invoke is getting faster by the day!).

>diff --git a/js/src/tests/ecma_5/String/15.5.4.11-01.js b/js/src/tests/ecma_5/String/15.5.4.11-01.js
>new file mode 100644
>--- /dev/null
>+++ b/js/src/tests/ecma_5/String/15.5.4.11-01.js

Great!
Attachment #466967 - Flags: review?(lw) → review+
Still has a problem when third argument flag is specified like a following expression:

")".replace(")","*$&*","g");

Workaround:
Make a first argument into regexp with flag, e.g. Bug 587608.
http://hg.mozilla.org/mozilla-central/rev/77f1425d2731
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
(In reply to comment #21)
> http://hg.mozilla.org/mozilla-central/rev/77f1425d2731

So what?
The problem in comment #20 is not yet fixed.
Should I file a new bug?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
cdleary will file a followup to deal with the "g" case.
Status: REOPENED → RESOLVED
Closed: 14 years ago14 years ago
Resolution: --- → FIXED
Filed as bug 592027 (YARR: regexp syntax error on unbalanced parens).
Chris, bug 592027 is not about the optional third flags ("g") parameter that our String.prototype.replace implementation used to take. But that seems to be the bone of contention in comment 20 and comment 22, and sayrer names it explicitly in comment 23.

IMHO we should remove that old extension. It already burned us on SunSpider.

/be
(In reply to comment #25)
> IMHO we should remove that old extension. It already burned us on SunSpider.

Oh cool! I've thought about this a little bit before but got stumped on how to make it a smooth transition -- is there a way to provide user feedback that they provided us one-too-many arguments?

Removing the functionality of the third (flags) argument could create very subtle bugs in existing code -- i.e. regexps silently becoming case-sensitive or non-multiline.
Grep for warnedAboutTwoArgumentEval.

/be
Filed bug 592143 (retire three-argument RegExp extensions).
Workaround:

String.prototype.replace = (function() {
  var func = String.prototype.replace;
  return function replace() {
    if (typeof arguments[0] == "string") {
     arguments[0] = RegExp(arguments[0].replace(/[{[(\\^|$.?*+/)\]}]/g, "\\$&"), arguments[2]);
    }
    return func.apply(this, arguments);
  };
})();
You need to log in before you can comment on or make changes to this bug.