Closed Bug 322836 Opened 19 years ago Closed 19 years ago

zero-width lookahead assertions fairly busted in JS regexps

Categories

(Core :: JavaScript Engine, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla1.9alpha1

People

(Reporter: brendan, Assigned: mrbkap)

Details

(Whiteboard: [patch])

Attachments

(1 file, 1 obsolete file)

js> r = /(?=[^\\])\$(..)/g
/(?=[^\\])\$(..)/g
js> a = "asdf$(12)ewr\$(34)rewui$(56)fds".match(r)
$(1,$(3,$(5
js> a.length
3
js> a[0]
$(1
js> a[0].length
3
js> b = r("asdf$(12)ewr\$(34)rewui$(56)fds")
$(1,(1

This bug is the followup to bug 256798 comment 9.  At least the following are broken:

1.  Negative assertions just don't work, due to a bug in my fix to bug 254296. They didn't work in all cases before that attempt.

2.  As shown above, the width of the lookahead seems to be subtracted from the match, so the width is not zero.

Paging mrbkap, emergency!  This is bad, I'm surprised no one noticed.  We lack test coverage, and obviously users don't exercise these features of Edition 3.

/be
Oh, and 3.  Positive assertions don't work either, as shown by the a[1] match.

/be
What am I missing here? The example given in bug 256798 comment 8 is INVALID: there are no occurances of 'bb' in the given string, which is what Bob's trying to match.

The example given here also seems to be doing the right thing (I haven't tested negative assertions yet, mind):

2. Look at your regexp again: the only capturing parentheses are right after the $ matching /../, so we find the $ and capture the next two characters, the parenthesis and the first number.
3. A 0-width assertion at the beginning of a regular expression is effectively a no-op, since it matches against the character that the operation after the assertion is going to match against.

I'm morphing this bug into fixing the REOP_IS_SIMPLE tests that have a tendancy to be bogus.
Priority: -- → P3
Target Milestone: --- → mozilla1.9alpha
Hey, I smelled something bogus and started going off, don't shoot me! ;-)

This perl testcase is broken:

$ perl -e 'print ("AB" =~ );'
1

From the js shell:

js> /(?!AB+D)AB/("AB")
null

As you say, it's all about how the simple case of the negatively-asserted sub-expression is handled.  My bad for not fixing it fully last time.  Thanks for taking this, I will strive to report more regexp bugs for you ;-).

/be
Status: NEW → ASSIGNED
Whiteboard: [patch]
Attached patch Proposed fix (obsolete) — Splinter Review
The old code was using the non-updated pc to test against, this makes sure that we're actually testing the the pc after the simple one against the pc for the REOP_ASSERT(NOT)TEST. I might be able to get away without using a temporary pc, but that sort of scares me, so I decided against it.
Attachment #208025 - Flags: review?(brendan)
Comment on attachment 208025 [details] [diff] [review]
Proposed fix

>@@ -2650,36 +2650,35 @@ SimpleMatch(REGlobalData *gData, REMatch

The comment needs to be updated to describe the change to |update|, which I'd rather you rename |updatecp| or some such.

>             case REOP_ASSERT:
>                 nextpc = pc + GET_OFFSET(pc);  /* start of term after ASSERT */
>                 pc += ARG_LEN;                 /* start of ASSERT child */
>+                temppc = pc;
>                 op = (REOp) *pc++;
>                 if (REOP_IS_SIMPLE(op) &&
>-                    !SimpleMatch(gData, x, op, &pc, JS_FALSE)) {
>+                    !SimpleMatch(gData, x, op, &temppc, JS_FALSE) &&
>+                    temppc == nextpc) {

I don't think you want this (temppc == nextpc) clause, see bug 256798.

>             case REOP_ASSERT_NOT:
>                 nextpc = pc + GET_OFFSET(pc);
>                 pc += ARG_LEN;
>+                temppc = pc;
>                 op = (REOp) *pc++;
>                 if (REOP_IS_SIMPLE(op) /* Note - fail to fail! */ &&
>-                    SimpleMatch(gData, x, op, &pc, JS_FALSE) &&
>-                    pc == nextpc) {
>+                    SimpleMatch(gData, x, op, &temppc, JS_FALSE) &&
>+                    temppc == nextpc) {
>                     result = NULL;
>                     break;
>                 }

Here, we are handling not only a simple op, but only a simple op, not several ops in a row.  So if SimpleMatch succeeds, but if there were other ops after it before the next op after the (?!...), we would not necessarily be able to fail here -- we would want to fall into the general (backtracking) case.

Not sure what I saw earlier under the debugger.  But it still looks to me as though ASSERTNOTTEST is the last bytecode before the next op, and after any single simple op, so I don't see how (temppc == nextpc) can be right.

/be
Attachment #208025 - Flags: review?(brendan) → review-
This should fix everything, no more comparisons to nextpc, instead, just see if we'd be about to process the REOP_ASSERTNOTTEST.
Attachment #208025 - Attachment is obsolete: true
Attachment #208054 - Flags: review?(brendan)
Comment on attachment 208054 [details] [diff] [review]
Fixed proposed fix

> /*
>  * Apply the current op against the given input to see if it's going to match
>- * or fail. Return false if we don't get a match, true if we do and update the
>- * state of the input and pc if the update flag is true.
>+ * or fail. Return false if we don't get a match, true if we do. If updatecp is
>+ * true, then update the current state's cp. startpc is always updated to the
>+ * next op's position.

Avoid passive voice in last sentence: "Always update startpc to point at the next op."

>+    jsbytecode *nextpc, *temppc;

How about naming it testpc, to avoid back-to-back p's?

You might JS_ASSERT(testpc + 1 == nextpc) in the block that returns early.

r=me, thanks for taking this one.

/be
Attachment #208054 - Flags: review?(brendan) → review+
Fix checked into trunk.
Status: ASSIGNED → RESOLVED
Closed: 19 years ago
Resolution: --- → FIXED
Is there a perl regexp test suite that we can crib from?
I'll add /(?!AB+D)AB/("AB") to js1_5/Regress/regress-254296.js but I'm not sure what else this bug covers.

/cvsroot/mozilla/js/tests/js1_5/Regress/regress-254296.js,v  <--  regress-254296.js
new revision: 1.2; previous revision: 1.1

I could "crib" from the Perl regular expression tests in t/op/re_tests. Looking at the Artistic License, that may be permissable. Mitchell, do you know under what conditions I could include tests derived from the Perl tests? The file consists of lines like "ab{1,}bc	abbbbc	y	$&	abbbbc" which would be modified to be javascript function calls into a test function.
Flags: testcase+
(In reply to comment #10)
> Mitchell, do you know under
> what conditions I could include tests derived from the Perl tests?

Mitchell's not doing license analysis, trying Gerv.

/be
What exactly do you want to do? Copy munged Perl testsuite regexps into an MPLed file in our testsuite?

Would you be able to just make a new file with the Perl testsuite-based tests in?

Even better, could you write a JavaScript script which downloaded, munged and eval()ed the Perl suite? ;-)

Seriously, the easiest thing all round is probably a new file. That way, there are no MPL issues. I'd rather not have yet another licence in the tree, but if it's only testsuite code, I guess it's OK.

Gerv
(In reply to comment #12)
> What exactly do you want to do? Copy munged Perl testsuite regexps into an
> MPLed file in our testsuite?
> 
> Would you be able to just make a new file with the Perl testsuite-based tests
> in?
> 

The idea was to create a call to a test function from the original regexp test text. For example,

"ab{1,}bc      abbbbc  y       $&      abbbbc" would be used to create something like:

test(/ab{1,}bc/, "abbbbc", "y", "RegExp['$&']", "abbbbc");

This would go into a new file such as mozilla/js/tests/ecma_3/RegExp/perl-compat.js

> Even better, could you write a JavaScript script which downloaded, munged and
> eval()ed the Perl suite? ;-)
> 

That is a possibility, and would make it easier to keep the tests in sync. 

> Seriously, the easiest thing all round is probably a new file. That way, there
> are no MPL issues. I'd rather not have yet another licence in the tree, but if
> it's only testsuite code, I guess it's OK.
> 

I was thinking that the Artistic license gave us some leeway.

<quote>
3. You may otherwise modify your copy of this Package in any way, provided
that you insert a prominent notice in each changed file stating how and
when you changed that file, and provided that you do at least ONE of the
following:

    a) place your modifications in the Public Domain or otherwise make them
    Freely Available, such as by posting said modifications to Usenet or
    an equivalent medium, or placing the modifications on a major archive
    site such as uunet.uu.net, or by allowing the Copyright Holder to include
    your modifications in the Standard Version of the Package.

    b) use the modified Package only within your corporation or organization.

    c) rename any non-standard executables so the names do not conflict
    with standard executables, which must also be provided, and provide
    a separate manual page for each non-standard executable that clearly
    documents how it differs from the Standard Version.

    d) make other distribution arrangements with the Copyright Holder.
</quote>

Unlike the source code files (.c, .h) this text file t/op/re_tests has no copyright or licensing boilerplate.

I guess to be safest, I should download the Perl source, use t/op/re_tests as input to a program that creates the javascript tests then run that.
If you don't mind doing that (using the Perl file as input), that would be easiest from a licensing point of view. And a good way to keep the tests in sync.

Gerv
verified fixed trunk 20060328 win/mac/linux
Status: RESOLVED → VERIFIED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: