Regular expression hangs

RESOLVED INVALID

Status

()

Core
JavaScript Engine
--
critical
RESOLVED INVALID
11 years ago
11 years ago

People

(Reporter: Georg Maaß, Assigned: Brian Crowder)

Tracking

({hang})

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

11 years ago
User-Agent:       Mozilla/5.0 (Macintosh; U; PPC Mac OS X Mach-O; en-US; rv:1.7.13) Gecko/20060414
Build Identifier: 

Regular expression hangs and never returns.

Reproducible: Always

Steps to Reproduce:
Test case always reproducable:

var a = 'Ida.tools = {\n\n// possible additional compressions:\n// [] => []\n// {} => {}\n\n\n// showText(Ida.tools.compressJS(new Ida.File(Ida.directory.idaContentDir + "scripting\\core\\idaattribute.js").getContent()))\n// alert(new Ida.File(Ida.directory.idaContentDir + "scripting\\core\\idaattribute.js").getContent().replace(/\/\/.+?\n/g,"");)\n/**\n* compressJS\n*        compresses a string by removing comments, whitespace etc. You can\n*        control the compressed result by some flags of the parameters.\n*\n*    @param    {string}    aString\n*        the string to compress\n*    @param    {boolean}    forWeb\n*        if this flag is true, all stuff between \n';

var r = /([^\\])(\/\*)(.|\s)*?(\*\/)/g;

var res = a.match(r);

Actual Results:  

var res = a.match(r);

never returns.

Expected Results:  
should return the found matches (none).

The Problem occures in SpiderMonkey 1.6 (built on Mac) as well as in XULRunner 1.9a5 on Windows. Probably it is everywhere.

Updated

11 years ago
Keywords: hang

Comment 1

11 years ago
I have tried the test code with a Firefox trunk nightly (Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9a5pre) Gecko/20070517 Minefield/3.0a5pre) and it pops up a dialog "unresponsive script" to warn that the script is taking a long time, allowing the user to choose whether to continue running the script or whether to stop it. I have choosen to continue a few times with the dialog reoccuring, then stopped the script which Firefox did.

When the same test code is run with IE 6 or Opera 9 it hangs them completely.

Rhino also hangs on the test code.

Browser test case is here:
<http://home.arcor.de/martin.honnen/mozillaBugs/ecmascript/commentPattern2.html>
using script file
<http://home.arcor.de/martin.honnen/mozillaBugs/ecmascript/commentPattern2.js>
(Reporter)

Comment 2

11 years ago
(In reply to comment #1)
> I have tried the test code with a Firefox trunk nightly (Mozilla/5.0 (Windows;
> U; Windows NT 5.1; en-US; rv:1.9a5pre) Gecko/20070517 Minefield/3.0a5pre) and
> it pops up a dialog "unresponsive script" to warn that the script is taking a
> long time, allowing the user to choose whether to continue running the script
> or whether to stop it.

Yes, this happens also with XULRunner, but the spidermonkey shell probaly does not support such a warning.

Changing the regular expression a little bit prevents the hanging.


r1 = /([^\\])(\/\*)(.|\s)*?(\*\/)/g;
                   ======

r2 = /([^\\])(\/\*)(.|\n|\r|\u2028|\u2029)*?(\*\/)/g;
                   =======================

The only difference is in the part:

(.|\s)

and

(.|\n|\r|\u2028|\u2029)

The first causes trouble, the second does not. The intension is "match any character including any kind of line break". \s matches all line breaks and additionaly some white space characters also matched by .

\n|\r|\u2028|\u2029 are exactly those characters not matched by .

Probably the white space matched by . and by \s causes the endless loop inside the regular expression, 
(Assignee)

Comment 3

11 years ago
This is a real bug, though, not a bug in the regular expression.  We're in an infinite loop inside the expression code, not exploring a large (exponential) search-space.  Thanks for the reports and follow-up research.

On a trunk debug build, we're hitting a new assertion that I've added recently which may or may not be a good assertion; more study is needed.  I'll take this bug.
Assignee: general → crowder
Status: UNCONFIRMED → NEW
Ever confirmed: true
(Assignee)

Updated

11 years ago
Status: NEW → ASSIGNED
(Assignee)

Comment 4

11 years ago
Sorry about that, this IS indeed an exponential regular expression.  The input string is long enough that obtaining a result (capped at O(n^3) time) still seems to take forever.
Status: ASSIGNED → RESOLVED
Last Resolved: 11 years ago
Resolution: --- → INVALID
(Reporter)

Comment 5

11 years ago
The input string is not very long, less than 1000 characters.

The patterns r1 and r2 should be identical, but do not behave identical. This causes the trouble. Optimizations are duty of the machine not of the author. This is a situation, which is not obvious to the author that this causes such a different behaviour
(Assignee)

Comment 6

11 years ago
(In reply to comment #5)
> Optimizations are duty of the machine not of the author.  This is a situation,
> which is not obvious to the author that this causes such a different behaviour

Regular expressions are a programming language in many respects.  If programmers could make this assertion, that "optimizations are the duty of the machine, not of the author", then our lives would indeed be much easier.  The simple fact is that this is not the case.  If you are programming (and indeed, when writing regular expressions, you are -- even if it does not seem so), then the algorithmic complexity of your program is your responsibility, not the "machine's".  No compiler can rewrite your code to work as you /meant/ it to work, instead of as you /wrote/ it to work.  Patterns r1 and r2 are NOT identical.  The overlap of potentially matching characters in the alternation expression causes your program to run with exponential complexity in r1, while there is no such overlap in r2.

Comment 7

11 years ago
Isn't the entire point of *regular* expressions to be able to express *regular* languages (that is, languages that can be matched in O(n) in string length) concisely? :/  Someone programming in an imperative language is expected to pick fast algorithms, but someone writing a regular expression isn't writing an algorithm at all.
(Assignee)

Comment 8

11 years ago
Unfortunately, the expression language specified by ECMA is an NFA, not a DFA, so it is not only possible, but easy, to create non-linear expressions.  An optimization we might consider long-term would be to compile some expressions (where possible, based on what features they use) as DFAs and fallback to NFAs when necessary.  It would be a huge performance boost and would probably fix issues like this, but it is a large project.
(Reporter)

Comment 9

11 years ago
A regular expression is just a pattern; nothing else.

This two patterns are semantically identical in JavaScript regular expressions:

(.|\s)

and

(.|\n|\r|\u2028|\u2029)

But at least the performance behaviour of them is very different, and this is not obvious for an author. Maybe a developer of the regular expression engine sees that there might be a performance difference, but to estimate this, you need detailed knowledge about internals of the regular expression engine implementation.

In very many cases performance tuning is race condition tuning. This causes tuning to be done during the race.

Modern cpu do race condition performance tuning using out of order execution and many tricks. This is not done by the compiler, it is done at runtime.

In the regular expression above the compiler can see that both regular expressions are semantically identical. I don't know whether the second one is allways faster or only under certain circumstances. But the developer of the regular expression engine might estimate this and might be able to let the engine optimize the pattern to the estimated fastest one.

May be that such an optimization is too expensive at runtime. In this case an additional tool for regulare expression optimizing would be nice, to check expressions during code development for optimized suggestions.

Comment 10

11 years ago
Would it help to optimize disjunctions of single characters, such as

(.|\n|\r|\u2028|\u2029),

into to a character class?
(Reporter)

Comment 11

11 years ago
a character class is at least more user friendly, easier to read. That is the reason, why the original expression was written with the partial expression (.|\s)

But as not developer of the engine, I can not estimate whether a new character class results in better performance or not. It results in an easier to read expression.

If a new character class is introduced than users should be warned not to use them im web pages, because this might result in trouble with other and older browsers. But in XULRunner applications extentions are welcome, because there exist no compatibility issues. I was using the regular expression in a XULRunner application, so everything is allowed there, because no IE or Opera ever sees that expression.
(Assignee)

Comment 12

11 years ago
The reason this works the way it does currently, or so I believe (I have not tested it with a profiler) is that the \s case is specially optimized, and in the general case of regular expressions, this yields a nice performance boost for people matching on space.  It is significantly faster than matching over a character class.

Generalizing a heuristic that knows how to merge expressions across an alternation (the "|" operator) would be relatively difficult, so Jesse's suggestion doesn't seem very doable (at least not immediately).

Georg:  A regular expression is most definitely -not- just a pattern.  Regular expressions in ECMA-262 are extremely powerful and are capable of more than just linear pattern-matching.  I don't know what "race condition performance tuning" is, perhaps you can provide a citation?  In any case, this is not anything of the sort.  The regular expression compiler cannot programmatically optimize an exponential expression into a linear one with the exception of a very few special cases.  This might be one of those cases, but it doesn't seem worth coding for, since the workaround/fix is a very easy one.
(Reporter)

Comment 13

11 years ago
If the relative performance of two semantically identical patterns depends on the string then this is a race condition. This depends on the implementation details, whether this can happen or not.

Some tuning can be done at compile time. Other must be done earlier at design time. And other must be done at runtime, because it depends on data, which is a race condition.
(Assignee)

Comment 14

11 years ago
If you can cite a reference for how this "race condition" theory applies to regular expressions, I'd be glad to take a look.  My understanding of the term "race condition" seems to be entirely different than yours.
You need to log in before you can comment on or make changes to this bug.