Closed Bug 330569 Opened 19 years ago Closed 18 years ago

s.replace(/<!--(.*|\n)*-->/, "") hangs Firefox (exponentially explosive regexp)

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
critical

Tracking

()

VERIFIED FIXED

People

(Reporter: antonglv, Assigned: crowderbt)

References

Details

(Keywords: hang, testcase)

Attachments

(4 files, 7 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; ru-RU; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; ru-RU; rv:1.8.0.1) Gecko/20060111 Firefox/1.5.0.1

to reproduce use this code:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
<html>
  <head>
  <meta http-equiv="content-type" content="text/html; charset=windows-1250">
  <meta name="generator" content="PSPad editor, www.pspad.com">
  <title></title>
  </head>
  <body>
<!-- hello -->
<script language="JavaScript">
    var s = document. body. innerHTML;
    var d = s. replace (/<!--(.*|\n)*-->/, "");
    alert (d);
</script>
  </body>
</html>

Reproducible: Always

Steps to Reproduce:
1. Open code in the Firefox.
2.
3.
Version: Trunk → 1.8 Branch
This is an exponentially explosive regexp, given SpiderMonkey's (and ECMA's) naive non-deterministic finite automaton.  It should be caught by some static or fast dynamic checks, which mrbkap was going to add.  I suspect this is a dup of a bug on his list.

/be
Whiteboard: DUPEME
Keywords: hang, testcase
Summary: s.replace(/<!--(.*|\n)*-->/, ""); method hangs up Firefox (and also IE6, but no Opera 8.5, 9) → s.replace(/<!--(.*|\n)*-->/, "") hangs Firefox (exponentially explosive regexp)
Cant we confirm this bug?

===
BTW, following can be used to strip comments from html

var s = document. body. innerHTML;
var d = s. replace (/<!--([^-]|-[^-])*-->/g, "");
alert (d);
Basically the same as bug 307456.

Reduced testcase:

function test( re, n ) {
  for ( var i= 0; i <= n; ++i ) {
    var t= Date.now();
    re.test( Array( i+1 ).join() );
    print( i, Date.now() - t );
  }
}

test( /(?:,*)*x/, 22 );

In bug 307456 it is

test( /(?:,|,)*x/, 22 );

Even worse:

test( /(?:,|,|,|,|,)*x/, 10 );
midair!

Related bugs:
Bug 310405, Bug 265353, Bug 307456. These should all be consolidated. mrbkap? Care to lump these all into one place?
Nominating: this hang makes it difficult to test the RegExp implementation in Gecko.
Flags: blocking1.9?
Status: UNCONFIRMED → NEW
Ever confirmed: true
The perl implementors have attempted to solve a majority of problematic expressions like this by caching the results of previously visited "nodes".  A node is described by the program-counter as it executes the regexp, plus the position in input, plus some additional state (ie., the external count for {x,y} style expressions).  If all of these things match and have already resulted in a failure, they consider the current path a failure without re-traversing it.  This might not be too hard to implement for us, I'm checking it out.
Also need to track info about backreferences in the state table.
Blocks: 351449
Bug 310405 is definitely -not- related to this, btw.
Unless bug 310405 is going to be re-described to be "more things in spidermonkey need to check the branch-callback"
Status: NEW → ASSIGNED
Brian, can you take this one?

/be
Assignee: general → crowder
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
mrbkap:  Would you mind describing the specific heuristics you had in mind for this?  I come back to this exponential regexp problem from time to time only to be discouraged by how limited our regexp code seems to be; perhaps your perspective is more optimistic.
OS: Windows XP → All
Hardware: PC → All
So the idea here is, rather than secretly compensate for bad regular expressions, we give the user the ability to turn on or off an extension of the regex engine that causes it to give up and throw an exception for an expression which is behaving with worse than O(n^3) complexity.  Actually, we offer a bit more room than that, since we count actual backtracks, not total states generated.  This makes backtracking itself a tiny tiny bit more expensive, but shouldn't dramatically impact the overall performance of the regex engine (as Perl's solution does).  This could also prove to be a useful tool for implementors of JS regex.

A nice enhancement to this (though I'm not sure how it would be undertaken) would be to allow the exception to be caught and still somehow resume the regex operation.  I'm not sure if that's possible or not.  In any case, it is highly likely that a badly-behaved expression like this could be re-architected to perform in closer-to-linear-time.

An arbitrary fudge factor of a few hundred thousand backtrack states is also permitted.  Perhaps this is better done another way, also?  An enhancement might be to make this fudge factor tunable by embedders.
Attachment #252372 - Flags: review?(mrbkap)
Comment on attachment 252372 [details] [diff] [review]
exeption on exponential regex (> O(n^3))

This is basically what I had in mind. Perhaps reverse the conditions in the if statement to avoid the function call?
Attachment #252372 - Flags: review?(mrbkap) → review+
Attached patch Removing a function call (obsolete) — Splinter Review
Should be better:  saves a function-call on every backtrack.  Thanks to Waldo for suggestion.
Attachment #252372 - Attachment is obsolete: true
Attachment #252376 - Flags: review?(mrbkap)
heh, mrbkap:  Waldo beat you to it by like four seconds.  :)  Thanks for the feedback.  Any thoughts on either the tuning or the "resuming" feature?
Comment on attachment 252376 [details] [diff] [review]
Removing a function call

>Index: jsregexp.c
>                 return NULL;
>             }
>+            gData->backTrackCount++;
>+            if (gData->explosiveRELimit &&

Nit: I think surrounding style is to explicitly compare ints with 0. Also, want to add some newlines around this to allow it to breath some?

>+    gData.explosiveRELimit = 0;
>+    if (JS_GetOptions(cx) & JSOPTION_THROW_ON_EXPLOSIVE_RE) {
>+        gData.explosiveRELimit = length * length * length; /* O(n^3) */
>+        if (gData.explosiveRELimit < 400000) /* Fudge for short strings */
>+            gData.explosiveRELimit = 400000;
>+    }

You could move the gData.explosiveRELimit into an else clause if you think it's worth it.

I don't think having a 'resume' option is worth the work it'd be. I think it would be possible (just store the gData somewhere useful and properly restore all of the required things), but let's see if it's really needed.
Attachment #252376 - Flags: review?(mrbkap) → review+
Attached patch v3, nits picked (obsolete) — Splinter Review
With mrbkap's nits addressed.  I sort of liked leaving out the != 0 since it looks more like testing to see if a feature is enabled, but I'll go with house style here.  I also liked the "limit = 0" setting early, since it looks like init code that must be run (ie., the limit MUST be set to something), but again I'm not attached.  I'll land this on the trunk this evening, if no one else has feedback, and then start looking at Firefox preferences for actually enabling it?

What about ditching the "magic number" of 400000.  Would I be better off with a per-context setting in the API?  JS_SetRegexMinLimit(cx, 400000) or something instead?
Attachment #252376 - Attachment is obsolete: true
Attachment #252383 - Flags: review+
(In reply to comment #20)
I'm not married to my suggestions, so don't take them as demands. One thing to note is that if the then clause of an if statement is braced, the else clause should be too, even if it's unnecessary.

> What about ditching the "magic number" of 400000.  Would I be better off with a
> per-context setting in the API?  JS_SetRegexMinLimit(cx, 400000) or something
> instead?

I hope Brendan will chime in on this question.
Whiteboard: DUPEME
Comment on attachment 252383 [details] [diff] [review]
v3, nits picked

Brendan:  mind taking a look at this and addressing some of the comment questions?  Thanks
Attachment #252383 - Flags: review+ → review?(brendan)
Comment on attachment 252383 [details] [diff] [review]
v3, nits picked

>+    {"strict",              JSOPTION_STRICT},
>+    {"werror",              JSOPTION_WERROR},
>+    {"atline",              JSOPTION_ATLINE},
>+    {"xml",                 JSOPTION_XML},
>+    {"explosive_re_throws", JSOPTION_THROW_ON_EXPLOSIVE_RE },

Would want a shorter option name, and to match the string to the #define name. How about JSOPTION_RELIMIT?

>+MSG_DEF(JSMSG_EXPLOSIVE_REGEX,        219, 0, JSEXN_INTERNALERR, "regular expression is exponential")

We don't know it's "exponential", just higher order than n^3 by backtrack count. Plus users don't know what exponential means. How about "regular expression too complex"?  Hey, we already have that one: JSMSG_REGEXP_TOO_COMPLEX.

>+    size_t backTrackCount;          /* how many times we've backtracked */
>+    size_t explosiveRELimit;        /* max backtrack states to generate */

Names should be siblins: backTraceLimit for backTrackCount (/me wishes rogerl has lowercased the T, but no point in changing now).

>+            if (gData->explosiveRELimit != 0 &&

I agree with your original inclination to test as a boolean, so do that if you still prefer.

>@@ -3216,16 +3229,17 @@ InitMatch(JSContext *cx, REGlobalData *g
>     JS_ARENA_ALLOCATE_CAST(gData->backTrackStack, REBackTrackData *,
>                            &gData->pool,
>                            INITIAL_BACKTRACK);
>     if (!gData->backTrackStack)
>         goto bad;
> 
>     gData->backTrackSP = gData->backTrackStack;
>     gData->cursz = 0;
>+    gData->backTrackCount = 0;

Note backTrackCount zero-init'ed in InitMatch (a single-use static subroutine, but let that pass).

>@@ -3287,16 +3301,24 @@ js_ExecuteRegExp(JSContext *cx, JSRegExp
>     gData.cpbegin = cp;
>     gData.cpend = cp + length;
>     cp += start;
>     gData.start = start;
>     gData.skipped = 0;
> 
>     JS_InitArenaPool(&gData.pool, "RegExpPool", 8096, 4);
>     x = InitMatch(cx, &gData, re);
>+
>+    if (JS_GetOptions(cx) & JSOPTION_THROW_ON_EXPLOSIVE_RE) {
>+        gData.explosiveRELimit = length * length * length; /* O(n^3) */
>+        if (gData.explosiveRELimit < 400000) /* Fudge for short strings */
>+            gData.explosiveRELimit = 400000;
>+    } else
>+        gData.explosiveRELimit = 0;
>+

First, I liked your original zero-init before the if. Second, per above comment, this belongs in InitMatch, so that gData->backTrackLimit can be initialized right after gData->backTrackCount. There's no point in splitting across InitMatch and js_ExecuteRegExp arbitrarily this way. Possibly there's no point in InitMatch (but perhaps it saves some goto out mess to have a single failure return value -- anyway leave that for a later cleanup).

/be
> >+MSG_DEF(JSMSG_EXPLOSIVE_REGEX,        219, 0, JSEXN_INTERNALERR, "regular expression is exponential")
> 
> We don't know it's "exponential", just higher order than n^3 by backtrack
> count. Plus users don't know what exponential means. How about "regular
> expression too complex"?  Hey, we already have that one:
> JSMSG_REGEXP_TOO_COMPLEX.

I noticed this existing message when I first wrote the code, but I was worried about people writing very simple exponential expressions and boggling at the "too complex" message...  I was hoping to provide something more informative (though I agree the word "exponential" is liable to be daunting).  Can you think of a good compromise there, or is it better just to use the existing message regardless?
Either "too complex" is enough, or it's confusing, but anything longer would be more confusing.  "likely to be too slow" comes to mind, but really ;-).

Question: are we sure other browsers don't do better here? ECMA does not bind us so much as real-world inter-operation requirements. I'd like to see results for all the bad examples listed here and in related bugs. Please post 'em here for IE, Opera, and Safari at least.

/be
The first attachment here definitely hangs IE 6 (my IE is still unusable), I will try with IE7 shortly.  Safari fairly quickly displays a somewhat mysterious dialog box which contains URL, the source of the <SCRIPT> element and an "OK" button, and seems to suspend execution, halting it entirely after "OK" is pressed.  Opera next.
Second case also hangs IE6, but resolves very quickly in Safari, I'm not totally clear on why.

First and second case both hang IE7.

First case yields a "stop scripts on this page" dialog from Opera, and the second case totally hangs it.
(In reply to comment #26)
> Safari fairly quickly displays a somewhat
> mysterious dialog box which contains URL, the source of the <SCRIPT> element
> and an "OK" button, and seems to suspend execution, halting it entirely after
> "OK" is pressed.

There's nothing mysterious about the dialog.  It's an alert() showing exactly what it's supposed to show.  Read the script again :)

Safari passes.
Then Safari's results are incorrect.  Using the non-explosive equivalent provided in comment #3 here (whose expression should, imo, produce identical results), the results I get show the script with the actual regexp itself REMOVED (which is also what Safari shows, when using the non-explosive regexp).  However, using the attachment from comment 0 in Safari, the result I get still shows the <!--...--> part of the regular expression from the script text.  My guess is that Safari is just bailing early without producing correct results (probably also bailing early in the "quick brown fox" case, too).
Safari is correct to leave the <!--...--> part in the script in the attachment in comment 0, since the regular expression is not global and only matches the HTML comment.  But if I take out the HTML comment, I see what appears to be "bailing early", like you describe.

What is special about \n?  If I replace that with "z" or "\t", Safari works correctly.
Comment on attachment 252383 [details] [diff] [review]
v3, nits picked

New patch coming?

/be
Attachment #252383 - Flags: review?(brendan) → review-
Attached patch from brendan's feedback (obsolete) — Splinter Review
> There's no point in splitting across InitMatch and js_ExecuteRegExp
> arbitrarily this way.

It's not arbitrary, I need to know the length of the particular input string for this match execution, so that I can use it in my big-O math.  Also, in theory, this option setting could change in between executions of the same regular expression, which may be a good thing (run regex, catch exception, change option, try again?)?

Everything else I agree with and have changed.

How do you feel in general about the heuristic itself, Brendan?  Is O(n^3) backtracks adequate?  It seems to be more than enough in my tests....  Also, how about my minimum "fudge"-factor for short strings (400000 backtracks)?
Attachment #252383 - Attachment is obsolete: true
Attachment #255370 - Flags: review?(brendan)
Oh, actually...  ignore my last remark, I will add an argument to InitMatch.  Patch shortly.
Attached patch one more try (obsolete) — Splinter Review
This moves some init code into InitMatch(), as suggested
Attachment #255370 - Attachment is obsolete: true
Attachment #255372 - Flags: review?(brendan)
Attachment #255370 - Flags: review?(brendan)
(In reply to comment #32)
> How do you feel in general about the heuristic itself, Brendan?  Is O(n^3)
> backtracks adequate?  It seems to be more than enough in my tests....

Who knows if it's enough? We would have to test the web, approximated by shipping in a Firefox product release.

Is Safari just doing its own version of the slow script dialog, or is it short-circuiting to the correct answer?

So long as we are not losing compared to other browsers, this patch *may* be ok. But it's really impossible to say without more exhaustive testing against real-world input.

/be
Comment on attachment 255372 [details] [diff] [review]
one more try


>-    {"strict",          JSOPTION_STRICT},
>-    {"werror",          JSOPTION_WERROR},
>-    {"atline",          JSOPTION_ATLINE},
>-    {"xml",             JSOPTION_XML},
>-    {0,                 0}
>+    {"strict",              JSOPTION_STRICT},
>+    {"werror",              JSOPTION_WERROR},
>+    {"atline",              JSOPTION_ATLINE},
>+    {"xml",                 JSOPTION_XML},
>+    {"relimit",             JSOPTION_RELIMIT },
>+    {0,                     0}

Restore old indentation to minimize patch.

>+#define JSOPTION_RELIMIT \
>+                                JS_BIT(9)       /* Throw exception on any
>+                                                   regular expression which
>+                                                   backtracks more than n^3
>+                                                   times, where n is length
>+                                                   of the input string */

No need for backslash continuation.

>+    size_t backTrackCount;          /* how many times we've backtracked */
>+    size_t backTrackLimit;          /* max backtrack states to generate */

Not max, limit (see below) -- max is maximum legal value, limit is fencepost.

>+
>+            /* Potentially detect explosive regex here. */
>+            gData->backTrackCount++;
>+            if (gData->backTrackLimit &&
>+                gData->backTrackCount > gData->backTrackLimit) {

You want >= here, consider basis case (1).

>+        if (gData->backTrackLimit < 400000) /* Fudge for short strings */
>+            gData->backTrackLimit = 400000;

Hoist magic number into a #define at file scope just above the enclosing function.

Minusing to get off my review radar (trying to trim backlog), but this is close and I'm in favor of trying it out for 1.9.

/be
Attachment #255372 - Flags: review?(brendan) → review-
Safari uses PCRE and I talked to some of the webkit guys the other night on IRC; it's not modified.  PCRE bails at (iirc) 10000000 backtracks, hardcoded, and without respect to the size of the input string.
As a note:  this code DOESN'T impose a hard-limit ala PCRE, because the branch callback largely serves that purpose.  Adding one would be trivial, of course.
Attachment #255372 - Attachment is obsolete: true
Attachment #255400 - Flags: review?(brendan)
Comment on attachment 255400 [details] [diff] [review]
patch cleanups recommended by brendan

>+#define MIN_BACKTRACK_LIMIT 400000

Blank line here, per prevailing style.

> static REMatchState *
>-InitMatch(JSContext *cx, REGlobalData *gData, JSRegExp *re)
>+InitMatch(JSContext *cx, REGlobalData *gData, JSRegExp *re, size_t length)

r=me with that, thanks.

/be
Attachment #255400 - Flags: review?(brendan) → review+
Attached patch final patchSplinter Review
I will land this later tonight when I have time to babysit it.
Attachment #255400 - Attachment is obsolete: true
Attachment #255401 - Flags: review+
So if we want this on for 1.9, we need a patch in the DOM to turn on JSOPTION_RELIMIT. And to be extra nice, reflect that as a boolean property of the options object defined in window, which reflects the javascript.options.strict and javascript.options.werror boolean prefs -- so I guess there ought to be a pref for relimit too.

This should all re-use existing code in dom/src/base/nsJSEnvironment.cpp IIRC.

/be
I bet sheppy didn't know about the pref-defaulted, page-settable options described in comment 41...

/be
js.c            3.132
jsapi.h         3.139 
jsregexp.c      3.133

Leaving the bug open until I have done the DOM patch.
Attached patch bugzilla as backup (obsolete) — Splinter Review
This is a first stab at implementing the dom side of enabling this (and providing a settable option).  Not tested, do not review.
This patch allows the relimit setting to be tweaked from about:config and defaults it to false.
Attachment #256678 - Attachment is obsolete: true
Attachment #256689 - Flags: superreview?
Attachment #256689 - Flags: review?(mrbkap)
Attachment #256689 - Flags: superreview? → superreview?(jst)
Still might be worth considering some sort of very large cap on total backtracks, as well, since a long input string at O(n^3) could still take a very long time.
We don't want too many knobs on the code. Won't the operation-counter stuff igor added help run the branch callback soon enough, for a long input string and regexp not growing faster than O(n^3)?

/be
Yeah, I even made that comment myself, earlier.  It just seems like the script callback takes a -very- long time to prompt me, and we could beat it with a hard-limit.  I agree about not kludging this code up too much, though.
(In reply to comment #48)
> Yeah, I even made that comment myself, earlier.

Ah, so you did. Therefore...

> It just seems like the script
> callback takes a -very- long time to prompt me, and we could beat it with a
> hard-limit.  I agree about not kludging this code up too much, though.

Something is wrong with the operation counter stuff and how it iteracts with the branch callback's slow-script watchdog. The latter takes some number of calls (256 IIRC) to snapshot wall time. Then it needs to be called some larger 2^n times to actually compare current time to the snapshot. Can you dig into this? I hope it's easy to adjust things so we get timely slow-script dialogs for regexp operation count limiting. Thanks,

/be
Comment on attachment 256689 [details] [diff] [review]
dom patch, and default setting

>Index: dom/src/base/nsJSEnvironment.cpp
>+    if (((optbit & (optbit - 1)) == 0 && optbit <= JSOPTION_WERROR) ||
>+        optbit == JSOPTION_RELIMIT) {

I find myself wondering if the crazy (optbit & (optbit - 1)) stuff is really more clear than just |optbit == JSOPTION_STRICT || optbit == JSOPTION_WERROR || optbit == JSOPTION_RELIMIT|.
Attachment #256689 - Flags: review?(mrbkap) → review+
I wanted to chime in and ask -- why am I cc'd on this bug?  Is there something that needs documenting here?  I don't see anything obviously in need of writing up.
(In reply to comment #42)
> I bet sheppy didn't know about the pref-defaulted, page-settable options
> described in comment 41...
> 
> /be

sheppy:  This was for you.
I saw that comment, but I still have no idea what if anything needs documenting here.
jst:  ping?
Comment on attachment 256689 [details] [diff] [review]
dom patch, and default setting

sr=jst, sorry for the delay :(
Attachment #256689 - Flags: superreview?(jst) → superreview+
nsJSEnvironment.cpp: 1.319
all.js: 3.668
Status: ASSIGNED → RESOLVED
Closed: 18 years ago
Resolution: --- → FIXED
/cvsroot/mozilla/js/tests/js1_5/extensions/regress-330569.js,v  <--  regress-330569.js
initial revision: 1.1
Flags: in-testsuite+
verified fixed linux, windows, mac* shell 20070406
Status: RESOLVED → VERIFIED
afaict, this was fixed on trunk, not 1.8.
Version: 1.8 Branch → Trunk
Depends on: 696723
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: