Closed Bug 416933 Opened 16 years ago Closed 16 years ago

Invalid range error for in Þ-ß case-insensitive regular expression

Categories

(Core :: JavaScript Engine, defect, P1)

x86
Windows XP
defect

Tracking

()

VERIFIED FIXED
mozilla1.9beta5

People

(Reporter: cacyclewp, Assigned: crowderbt)

References

Details

(Keywords: regression)

Attachments

(6 files, 10 obsolete files)

2.12 KB, patch
Details | Diff | Splinter Review
3.40 KB, text/plain
Details
9.02 KB, patch
Details | Diff | Splinter Review
4.38 KB, patch
shaver
: review+
shaver
: approval1.9+
Details | Diff | Splinter Review
3.56 KB, application/x-javascript
Details
4.32 KB, application/x-javascript
Details
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b4pre) Gecko/2008021104 Minefield/3.0b4pre

Any Unicode character range in  JavaScript regular expressions that goes over Þ-ß throws an invalid range error  ("Error: invalid range in character class").

This happens in Firefox 3.0b2 - 3.0b4pre (Firefox, Minefield, Swiftfox), but not in current SeaMonkey 1.1.7 and current Firefox 2.0.0.12 releases.

The critical range is between the following two consecutive characters:

U+00DE Þ c39e LATIN CAPITAL LETTER THORN
U+00DF ß c39f LATIN SMALL LETTER SHARP S

There is no problem with these characters alone outside of a range or these characters at the end or the start of a range, respectively.

Reproducible: Always
Assignee: nobody → general
Component: General → JavaScript Engine
Keywords: regression
Product: Firefox → Core
QA Contact: general → general
Version: unspecified → Trunk
Can you provide the exact expression that's generating this error?  Using my current nightly, this works:

javascript:/[Þ-ß]/.exec("Þ")
Þ

Thanks
I have just figured out that it only happens with case insensitive regular expressions that have their "i" flag set:

javascript:/[Þ-ß]/i.exec("Þ")

throws the error as described above.
Severity: major → normal
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Invalid range error for in Þ-ß JavaScript regular expression → Invalid range error for in Þ-ß case-insensitive regular expression
Assignee: general → crowder
Attached patch more liberal range-handling (obsolete) — Splinter Review
I was overzealous when working on this previously.  This patch backs off a bit on my earlier stance and ensures that bitmap-size is correct.
Attachment #309988 - Flags: review?(mrbkap)
This bug apparently breaks the WidEd editor for Wikipedia pages.  It should block.
Flags: blocking1.9?
Comment on attachment 309988 [details] [diff] [review]
more liberal range-handling

Hm, not quite ready for prime-time, this needs some more fixing, to prevent a regression revealed by the test-suite:

ecma_3/RegExp/regress-375715-01-n.js

This test also, by its nature, changed another test or two, more in a second.
Attachment #309988 - Flags: review?(mrbkap)
Attached patch more liberal still (obsolete) — Splinter Review
Attachment #309988 - Attachment is obsolete: true
Attachment #310019 - Flags: review?(mrbkap)
The patch from comment #7 causes the following two tests to need to be reworked:

Testcase ecma_3/RegExp/regress-375711.js failed Bug Number 375711
[ Top of Page ]
STATUS: Do not assert with /[Q-b]/i.exec("")
Failure messages were:
FAILED! expected: [reported from test()] Expected value 'SyntaxError: invalid range in character class', Actual value ''

Testcase ecma_3/RegExp/regress-375715-01-n.js failed Bug Number 375715
[ Top of Page ]
Expected exit code 3, got 0
Testcase terminated with signal 0
Complete testcase output was:
BUGNUMBER: 375715
STATUS: Do not assert: (c2 <= cs->length) && (c1 <= c2)
PASSED! Do not assert: (c2 <= cs->length) && (c1 <= c2) /[Wb-G]/.exec("")
jstest: ecma_3/RegExp/regress-375715-01.js bug: result: PASSED type: shell description: Do not assert: (c2 <= cs->length) && (c1 <= c2) /[Wb-G]/.exec("") expected: actual: reason:
Can we get this in for Beta5?
Flags: blocking1.9? → blocking1.9+
Priority: -- → P1
Comment on attachment 310019 [details] [diff] [review]
more liberal still

yeah, need a more available reviewer.
Attachment #310019 - Flags: review?(mrbkap) → review?(shaver)
Comment on attachment 310019 [details] [diff] [review]
more liberal still

r=shaver, assuming you've verified that it's acceptable to not report an error for hi-lo ranges. Just wondering what motivated the addition of the the test that needs changing.
Attachment #310019 - Flags: review?(shaver) → review+
This restores the range exception for case-insensitive class ranges whose initial value is higher than its end value.
Attachment #310019 - Attachment is obsolete: true
Attachment #310670 - Flags: review?(shaver)
Target Milestone: --- → mozilla1.9beta5
Comment on attachment 310670 [details] [diff] [review]
better error reporting for case-insensitive matches

>         if (state->flags & JSREG_FOLD) {
>             jschar uc = upcase(localMax);
>             jschar dc = downcase(localMax);
> 
>             c = JS_MAX(uc, dc);
>             if (c > localMax)
>                 localMax = c;
>+        } else if (inRange && rangeStart > localMax) {
>+            JS_ReportErrorNumber(state->context,
>+                                 js_GetErrorMessage, NULL,
>+                                 JSMSG_BAD_CLASS_RANGE);
>+            return JS_FALSE;
>         }

Could cite 15.10.2.15 NonemptyClassRanges, the CharacterRange helper function, step 6. To do that nicely, change the else if to else { followed by comma and if-then indented one more basic offset, then }.

>+    JS_ASSERT(c2 <= cs->length);

Blank line before whole-line-or-more comment per house style.

>+    /* Swap, if c1 > c2. */
>+    if (c1 > c2) {
>+        c1 = c1 ^ c2;
>+        c2 = c1 ^ c2;
>+        c1 = c1 ^ c2;
>+    }

Use ^= if XOR wins, but using a tmp is simpler source (not that we eschew bit magic, just not sure it is worth it here -- if it is, ^= must be used for C purity ;-).

/be
Attachment #310670 - Flags: review?(shaver) → review+
Attached patch the landed patchSplinter Review
Attachment #310670 - Attachment is obsolete: true
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
/cvsroot/mozilla/js/tests/ecma_3/RegExp/regress-375711.js,v  <--  regress-375711.js
new revision: 1.2; previous revision: 1.1
Flags: in-testsuite+
Flags: in-litmus-
The fix (and also the testcase) is wrong.

/[q-b]/i.exec("") should throw, as should /[b-Q]/i.exec(""), because the code point of the first character isn't allowed to be greater than the code point of the second character, even in case-insensitive RegExps.

Ranges with non-single characters as start or end like /[\d-q]/ should also throw per ECMA instead of assuming a literal -.

The whole method for case-insensitive character ranges is wrong because it assumes it can simply fill the bits in the bitmap between up- and downcased start and end characters. Instead it needs to calculate up- and downcased characters for each code point between start and end of the range. Note that the distance between upper and lower case variants can differ. \xfe (þ) has its associated upper case character at \xde while \xff (ÿ) has it at \u0178.

Currently /[\xc0-\xdf]/i.exec("\xf6") (/[À-ß]/i.exec("ö")) fails and /[\xc0-\xde]/i.exec("\xf7") (/[À-Þ]/i.exec("÷")), while it should be the other way round. Even the results for ASCII characters are wrong: /[E-f]/i matches only E, F, e and f instead of all ASCII letters plus [, \, ], ^, _ and ` (see example in ECMA 15.10.2.15).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(In reply to comment #16)
> /[q-b]/i.exec("") should throw, as should /[b-Q]/i.exec(""), because the code
> point of the first character isn't allowed to be greater than the code point of
> the second character, even in case-insensitive RegExps.

I think the spec is not "spec"ific here.  What section(s) of the spec show this definitively?

> Ranges with non-single characters as start or end like /[\d-q]/ should also
> throw per ECMA instead of assuming a literal -.

Again, I don't think the spec is specific here.  Perl causes \d-q to become \d,'-',q ...  show me what sections you base your conclusion on?

> Currently /[\xc0-\xdf]/i.exec("\xf6") (/[À-ß]/i.exec("ö")) fails and
> /[\xc0-\xde]/i.exec("\xf7") (/[À-Þ]/i.exec("÷")), while it should be the
> other way round. Even the results for ASCII characters are wrong: /[E-f]/i
> matches only E, F, e and f instead of all ASCII letters plus [, \, ], ^, _ and
> ` (see example in ECMA 15.10.2.15).

These results, I agree, are wrong.  I'll rework this patch, unless you have a patch in mind already, x00000000.
Ah right.  Trying some examples out, and looking through my IRC logs is refreshing my memory.  Brendan and I both (I think) agreed on the interpretation my patch imposes here:

23:33 <crowder> So the real question is, what's the right behavior?  The patch I have now will yield two smaller ranges (thanks to swapping)
23:33 <crowder> but perhaps we should have one large range from the lowest possible value to the highest
23:37 <brendan> 15.10.2.15 is clear enough
23:38 <brendan> we should not be using case-folded code points to decide to throw
23:38 <brendan> that's the bug
23:39 <crowder> And we don't throw on case-folded code-points with this patch
23:39 <crowder> And it will build the correct ranges
23:40 <brendan> we should throw on unfolded points being out of order
23:40 <brendan> that's it
23:40 <brendan> it's in ECMA-262 Ed. 3
23:40 <crowder> Right, which we do
23:40 <crowder> The last patch here restores that
23:40 <brendan> but we also throw on folded code points being out of order
23:40 <brendan> which is wrong
23:40 <crowder> no, we don't
23:40 <crowder> not with the patch applied
23:40 <brendan> i'm talking about trunk state
23:40 <crowder> oh right
23:40 <crowder> yeah
23:40 <crowder> We throw for both on current trunk
23:40 <crowder> which is, agreed, bogus
23:41 <brendan> ok, so your first change in the first hunk fixes that
23:41 <brendan> why the second change in that hunk?
23:41 <brendan>             localMax = JS_MAX(localMax, rangeStart);
23:42 <crowder> You must still ensure that the bitmap is large enough to accomodate the largest possible character
23:42 <brendan> and nothing did that before?
23:42 <crowder> localMax here is used eventually to calculate the total bmp
23:42 <brendan> oh, it was relying on the bogus throw to save itself
23:42 <crowder> right

Bug 427244 shows that I am still missing something here, though.
(In reply to comment #17)
> > Currently /[\xc0-\xdf]/i.exec("\xf6") (/[À-ß]/i.exec("ö")) fails and
> > /[\xc0-\xde]/i.exec("\xf7") (/[À-Þ]/i.exec("÷")), while it should be the
> > other way round. Even the results for ASCII characters are wrong: /[E-f]/i
> > matches only E, F, e and f instead of all ASCII letters plus [, \, ], ^, _ and
> > ` (see example in ECMA 15.10.2.15).
> 
> These results, I agree, are wrong.  I'll rework this patch, unless you have a
> patch in mind already, x00000000.

Based on my conversation with Brendan, I now agree with my old self again.  These results are correct by my interpretation of the spec.  x00000000:  I would be thrilled to be proven wrong.
Section 15.10.2.15 of ECMA 262-3 is very clear in the definition of the internal helper function CharacterRange, and it has also "Informative comments" on the topic:

| The internal helper function CharacterRange takes two CharSet parameters A
| and B and performs the following:
|
| 1. If A does not contain exactly one character or B does not contain exactly
|    one character then throw a SyntaxError exception.
| 2. Let a be the one character in CharSet A.
| 3. Let b be the one character in CharSet B.
| 4. Let i be the code point value of character a.
| 5. Let j be the code point value of character b.
| 6. If I > j then throw a SyntaxError exception.
| 7. Return the set containing all characters numbered i through j, inclusive.

| ClassRanges can expand into single ClassAtoms and/or ranges of two
| ClassAtoms separated by dashes. In the latter case the ClassRanges includes
| all characters between the first ClassAtom and the second ClassAtom,
| inclusive; an error occurs if either ClassAtom does not represent a single
| character (for example, if one is \w) or if the first ClassAtom's code point
| value is greater than the second ClassAtom's code point value.
| 
| Even if the pattern ignores case, the case of the two ends of a range is
| significant in determining which characters belong to the range. Thus, for
| example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while
| the pattern /[E-f]/i matches all upper and lower-case ASCII letters as well
| as the symbols [, \, ], ^, _, and `.

The handling of case insensitivity in ECMA occurs at a later stage in the internal helper functions CharacterSetMatcher and Canonicalize, see section 15.10.2.8; the CharSet of the character range consists of the characters as specified, without case insensitivity applied. Of course, an implementation can do it in another way, as long as the final result is the same.

I'm not working on a patch for this.

BTW, the new Unicode 5.1.0 now has also a capital letter for \xdf (ß) at \u1e9e, so bitmaps can get quite large, even if only ISO-8859-1 characters are involved.
at least E-f is explained in Bug 418099.
(In reply to comment #20)
> | Even if the pattern ignores case, the case of the two ends of a range is
> | significant in determining which characters belong to the range. Thus, for
> | example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while
> | the pattern /[E-f]/i matches all upper and lower-case ASCII letters as well
> | as the symbols [, \, ], ^, _, and `.

Ok, so this answers the question I started with in my log (not sure why I didn't notice this when poring over the spec before).  We should be adding all of the characters between MIN(upcase(start), upcase(end)) and MAX(downcase(start), downcase(start)), not just the two subranges.  That fix should be easy.

> BTW, the new Unicode 5.1.0 now has also a capital letter for \xdf (ß) at
> \u1e9e, so bitmaps can get quite large, even if only ISO-8859-1 characters are
> involved.

I'm not sure what this means for Spidermonkey, but it might be something the ES4 committee needs to consider?
> We should be adding all of the characters between MIN(upcase(start), upcase(end))
> and MAX(downcase(start), downcase(start)), not just the two subranges.

I meant MIN(upcase(start), downcase(start)) and MAX(upcase(end), downcase(end)), but that isn't enough, if /[E-f]/.test("A") should be true.
(In reply to comment #22)
> Ok, so this answers the question I started with in my log (not sure why I
> didn't notice this when poring over the spec before).  We should be adding all
> of the characters between MIN(upcase(start), upcase(end)) and
> MAX(downcase(start), downcase(start)), not just the two subranges.  That fix
> should be easy.

No, all characters between start and end plus the results of their individual upcase() and downcase() results.

> I'm not sure what this means for Spidermonkey, but it might be something the
> ES4 committee needs to consider?

A ECMA 3 conforming implementation can use any unicode version starting with 2.1. AFAIK SpiderMonkey still uses a very old version.
(In reply to comment #23)
> if /[E-f]/.test("A") should be true.

I meant /[E-f]/i.test("A")

...  this isn't as easy as I hoped, perhaps.  We actually need to visit each character in the range, calculate its up-/down-case, and compute the range as the minimum of all of those calculations to the maximum.
> ...  this isn't as easy as I hoped, perhaps.  We actually need to visit each
> character in the range, calculate its up-/down-case, and compute the range as
> the minimum of all of those calculations to the maximum.

Yes. But it's as easy as the current version, just more expensive (and it has to be done twice). Except if we want to shortcut this for ASCII-only.
(In reply to comment #26)
> > ...  this isn't as easy as I hoped, perhaps.  We actually need to visit each
> > character in the range, calculate its up-/down-case, and compute the range as
> > the minimum of all of those calculations to the maximum.
> 
> Yes. But it's as easy as the current version, just more expensive (and it has
> to be done twice). Except if we want to shortcut this for ASCII-only.

Actually no. No minimum, no maximum. Just start and end and upcase() and downcase() of each character between them. Then take the maximum them to compute the bitmap length, and then do it again and set each bit accordigly to upcase() and downcase() (in ProcessCharSet).
Attached patch ugh! WIP (obsolete) — Splinter Review
This is horrible.  Need to cache the results computed in CalculateBitmapSize so that we don't have to do another potentially huge round of upcase/downcasing in ProcessCharSet.
(In reply to comment #28)
> Created an attachment (id=314852) [details]
> ugh!  WIP

Seems still wrong. I think it should be like

        /* Throw a SyntaxError here, per ECMA-262, 15.10.2.15. */
        if (inRange && rangeStart > localMax) {
            JS_ReportErrorNumber(state->context,
                                  js_GetErrorMessage, NULL,
                                  JSMSG_BAD_CLASS_RANGE);
            return JS_FALSE;
        }

        if (inRange) {
            if (state->flags & JSREG_FOLD) {
                jsChar rangeEnd = localMax;
                for (c = rangeStart; c <= rangeEnd; ++c) {
                    jschar uc = upcase(c);
                    jschar dc = downcase(c);
                    if (uc > localMax)
                        localMax = uc;
                    if (dc > localMax)
                        localMax = dc;
                }
            }
            inRange = JS_FALSE;
        } else {
            if (src < end - 1) {
                if (*src == '-') {
                    if (!canStartRange) {
                        // throw
                    }
                    ++src;
                    inRange = JS_TRUE;
                    rangeStart = (jschar)localMax;
                    continue;
                }
            }
        }

...

        if (inRange) {
            if (gData->regexp->flags & JSREG_FOLD) {
                for (c = rangeStart; c <= thisCh; ++c) {
                    AddCharacterToCharSet(charSet, upcase(c));
                    AddCharacterToCharSet(charSet, downcase(c));
                }
            } else {
                AddCharacterRangeToCharSet(charSet, rangeStart, thisCh);
            }
            inRange = JS_FALSE;
        } else {

> This is horrible.  Need to cache the results computed in CalculateBitmapSize so
> that we don't have to do another potentially huge round of upcase/downcasing in
> ProcessCharSet.

There's nothing to cache. You would need the information that you're going to retrieve. The loop with the 2 AddCharacterToCharSet() could somewhat optimized.
Attached file *boggle*
Or, I could be wrong, and this could be way faster in SunSpider:

** TOTAL **:           1.046x as fast    3801.3ms +/- 1.5%   3633.7ms +/- 1.0%     significant
x00000000:  Sorry, can you describe the changes you made and why?  Your code looks different, but seems to accomplish the same result.
Your code still doesn't throw for reversed ranges in /i. If you throw, then you can assume rangeStart <= localMax and need no JS_MIN/JS_MAX.

In CalculateBitmapSize only the maximum is relevant, also for upcase(), which can be greater than the result of downcase().

In ProcessCharSet, you still add a whole character range to the set instead of single characters. There can be holes, as for /[Ö-Ø]/i, because the × in between cannot be uppercased, and all the code points between upper Ø and lower ö should also not be included.
See the log with Brendan:  we shouldn't throw for reversed ranges with "fold"ing.  (ie., /i)

Agreed with your other two points, new patch in a second.
I played around with possible ASCII optimization. But I wouldn't check that in shortly before a release with FIRST_UNSAFE_CHAR > 0.

Note that the case for \s is also particulary inefficient.

(In reply to comment #34)
> See the log with Brendan:  we shouldn't throw for reversed ranges with
> "fold"ing.  (ie., /i)

It's not entirely clear to me what Brendan really meant. At that point the code points are always unfolded. But the spec is clear and more relevant than an IRC log.
Attached patch v2 (obsolete) — Splinter Review
Let's worry about ASCII optimization for a follow-up bug.  x00000000, can you take a look at this patch?
Attachment #314852 - Attachment is obsolete: true
Attachment #314955 - Flags: review?(x00000000)
(In reply to comment #35)
> But the spec is clear and more relevant than an IRC log.

I think the spec is not clear in the /i case.  Worse: there is old code on the web which depends on /i not throwing for what you (and the spec, potentially) deem invalid /i character ranges -- there is at least one bug in b.m.o. documenting this, and caused by my -original- interpretation of the spec, which agreed with yours (ie., throw for any bogus range), though I can't find it now.  We didn't throw in this case in Fx2, and introducing a throw now could cause regressions.  Too late for that now, in my opinion.  We need to liberally accept that case (for now, at least) and try to do the closest-to-correct thing with the expression.
Perhaps in strict mode, we could throw on these bogus /i ranges?
SpiderMonkey used to throw on reversed character ranges also for /i before
bug 375711 broke it (also in 1.8.0.12 and 1.8.1.4).

Will look at the patch tomorrow.
Comment on attachment 314955 [details] [diff] [review]
v2

Even assuming that it is correct not to throw on reversed ranges for /i, how do you know what do then? Consider /[e-Q]/i. Swapping e and Q means that it is the same as /[Q-e]/i, where we seem to agree now that it means /[Q-eq-zA-E]/. Why shouldn't it be /[E-Qe-q]/ as it is now? Or /[E-q]/? I can't find a hint for any behavior in the spec. Swapping would make some sense with the broken old behavior that assumed 2 separated ranges for each case-insensitive RegExp. But SpiderMonkey 1.4.2, 1.5 and 1.6 did throw nevertheless.

>+            jschar min = localMax, max = localMax;

min is now unused.

A variable named max already exists in that scope; it may be better to use another name or to set localMax directly and put the calculation of the end condition outside the loop.

>+            for (i = JS_MIN(rangeStart, localMax);
>+                 i < JS_MAX(rangeStart, localMax); i++) {
>+                max = JS_MAX(max, downcase(i));
>+                max = JS_MAX(max, upcase(i));
>+            }
>+            localMax = max;

You need to include the last character. Same off-by-one error in ProcessCharSet.

Using JS_MAX may mean that downcase/upcase will be evaluated twice.

You may also consider to reverse the loop, because higher code points are more likely to produce the final max.

>         } else {

You forgot the not-in-range case for state->flags & JSREG_FOLD (as I did in comment 30).
Attachment #314955 - Flags: review?(x00000000) → review-
Attached patch v3 (obsolete) — Splinter Review
This introduces a failure in ecma_3/RegExp/regress-375715-04.js because the exception being thrown is not caught; the testcase does not crash, though so it just needs a little rework.
Attachment #314955 - Attachment is obsolete: true
Attachment #315155 - Flags: review?(x00000000)
x00000000: you've convinced me of the righteousness of throwing an exception even for folded expressions, if start > end, you'll notice that in this patch.
Attachment #315155 - Attachment is obsolete: true
Attachment #315155 - Flags: review?(x00000000)
Attachment #315157 - Flags: review?(x00000000)
Comment on attachment 315157 [details] [diff] [review]
v4, make throw and assertion match better

mrbkap: Mind re-reviewing this chunk of code for the nth time?
Attachment #315157 - Flags: review?(mrbkap)
Sorry, I could not follow most of your details, so excuse if this comment is irrelevant or ignorant :-)

Shouldn't the correct way for /i be to select all characters of the given range as stated (direction doe not really matter here) and then determine and add the corresponding upper- or lowercase character for each of the selected characters individually.
Cacycle:  Yes, and the current patch should do just that.
Comment on attachment 315157 [details] [diff] [review]
v4, make throw and assertion match better

>+            for (i = rangeStart; i <= localMax; i++) {
>+                jschar uch, dch;
>+
>+                uch = upcase(i);
>+                dch = downcase(i);
>+                localMax = JS_MAX(localMax, uch);
>+                localMax = JS_MAX(localMax, dch);
>+            }

This may extend the range; you need to copy the old localMax before the loop for use in the end condition.

And we only want to do that if state->flags & JSREG_FOLD.

>+        if (state->flags & JSREG_FOLD) {
>+            jschar uc = upcase(localMax);
>+            jschar dc = downcase(localMax);
>+
>+            c = JS_MAX(uc, dc);
>+            if (c > localMax)
>+                localMax = c;
>+        }

This has already been done if inRange. Put it inside the else clause.

It would be prettier to use code more similar to that in the loop, because it does the same as one iteration of the loop (you could even set rangeStart to localMax and use the loop code directly).

There's also the old problem that [\d\D\s\S\w\W] must be treated different than single characters. There have been several solutions hacked in, but the code is still unclean. If we can assume that an up/downcased digit is always a digit, an up/downcased non-digit is always a non-digit etc. (don't know if thats true), then we should set max directly (if appropriate) and continue the while loop without executing any code below the switch. Else we need specialized code. I don't know why that hasn't been done and instead the canStartRange hacked in (can't read bug 375876 and even its check-in comment seems to be faked).
Attachment #315157 - Flags: review?(x00000000) → review-
Attached patch revised (obsolete) — Splinter Review
Thanks for the outstanding reviews.
Attachment #315157 - Attachment is obsolete: true
Attachment #315621 - Flags: review?(x00000000)
Attachment #315157 - Flags: review?(mrbkap)
Attachment #315621 - Flags: review?(mrbkap)
Comment on attachment 315621 [details] [diff] [review]
revised

Looks good now, but the range swapping code in AddCharacterRangeToCharSet shouldn't be needed anymore and could be replaced by an assertion.

>+                rangeStart = localMax; /* one run of the uc/dc loop below */

I think the prevailing style is more than one space before the comment, but I may be wrong.
Attachment #315621 - Flags: review?(x00000000) → review+
put the old assertion back, but without the extra ()s, and added a couple spaces before the comment, per nit (can't go to much further away without violating 79-char limit)

(r? shaver, since I think mrbkap is very busy this week, shaver: retarget if you don't want this, please)
Attachment #315621 - Attachment is obsolete: true
Attachment #316015 - Flags: review?(shaver)
Attachment #315621 - Flags: review?(mrbkap)
x00000000:  Thanks again, btw, for thoughtful and helpful reviews.
Whiteboard: [needs review shaver]
Comment on attachment 316015 [details] [diff] [review]
restoring old assertion

r+a=shaver, largely on the strength of reading x0 and crowder's conversation, but this needs a pile of tests, and I worry about our test coverage for detecting regression.  Can someone generate or author a bunch of suite tests?

We need the regexp-snoop patch like yesterday.
Attachment #316015 - Flags: review?(shaver)
Attachment #316015 - Flags: review+
Attachment #316015 - Flags: approval1.9+
Whiteboard: [needs review shaver] → [needs landing]
jsregexp.c:3.199
Status: REOPENED → RESOLVED
Closed: 16 years ago16 years ago
Resolution: --- → FIXED
Attached file testcase (obsolete) —
This testcase tests if case insensitive ranges have the same bugs (notably bug 428816) as single character classes, except for two known differences: Characters with separate title cases have been partly fixed by this bug (/[\u01c5-\u01c5]/i.test('\u01c5') is now true, but /[\u01c5]/i.test('\u01c5') still false), and /[\0-\x01]/i.test('\0') is true although /[\0]/i.test('\0') is false (bug 429243).

Stage 1 of the testcase prints the failures for single characters, and stage 2 prints any unexpected differences from that for character ranges (it tests some random ranges with higher probability for lower code points). Stage 2 shouldn't find any failures; failures in stage 1 are a different bug.

The testcase may take a few hours to complete. I haven't run it completely, but the lower code points (|limit| set to 0x2000) show no failure.
As previously noted, the following now fail:

ecma_3/RegExp/regress-375711.js:
Expected value 'No Error', Actual value 'SyntaxError: invalid range in character class'

ecma_3/RegExp/regress-375715-04.js:
/ecma_3/RegExp/regress-375715-04.js:56: SyntaxError: invalid range in character class

I'll modify them to match the new behavior and add x0's test from comment 54 this evening.

Flags: in-testsuite+ → in-testsuite?
Attached file less complete but faster testcase (obsolete) —
The previous testcase isn't fast enough for the test suite. This one skips code point blocks with caseless characters only and tests only some random characters. It should take less than a minute and still catch systematic errors (and even that may be overkill for the test suite).
The previous testcase was quite fast because it didn't do anything in stage 2. This one is slower, but can be made much faster once the adaptions for bug 428816 aren't needed anymore.
Attachment #317150 - Attachment is obsolete: true
/cvsroot/mozilla/js/tests/public-failures.txt,v  <--  public-failures.txt
new revision: 1.69; previous revision: 1.68

/cvsroot/mozilla/js/tests/ecma_3/RegExp/regress-375711.js,v  <--  regress-375711.js
new revision: 1.3; previous revision: 1.2

/cvsroot/mozilla/js/tests/ecma_3/RegExp/regress-375715-04.js,v  <--  regress-375715-04.js
new revision: 1.3; previous revision: 1.2

I tried to adapt the test in comment 57 but it is way too slow to be run normally. 

x0, how much faster do you think it will run when the adaptations are removed? The random parts of the test have me a bit concerned in that it is important for the failures to be consistent across runs. Also, can you put the test into the format used by the other tests using reportCompare to report failures? Finally, I see you use the pattern a= rhs. It would be nice if you could instead use a = rhs. Thanks.
for ecma_3/RegExp/regress-375711.js, ecma_3/RegExp/regress-375715-04.js
Flags: in-testsuite? → in-testsuite+
v 1.9.0 for ecma_3/RegExp/regress-375711.js, ecma_3/RegExp/regress-375715-04.js
Status: RESOLVED → VERIFIED
Could this fix be related to Bug 440926?
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: