Closed Bug 416933 Opened 18 years ago Closed 17 years ago

Invalid range error for in Þ-ß case-insensitive regular expression

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla1.9beta5

People

(Reporter: cacyclewp, Assigned: crowderbt)

References

Details

(Keywords: regression)

Attachments

(6 files, 10 obsolete files)

more liberal range-handling 18 years ago Brian Crowder 1.14 KB, patch		Details \| Diff \| Splinter Review
more liberal still 18 years ago Brian Crowder 1.77 KB, patch	shaver : review+	Details \| Diff \| Splinter Review
better error reporting for case-insensitive matches 17 years ago Brian Crowder 2.07 KB, patch	brendan : review+	Details \| Diff \| Splinter Review
the landed patch 17 years ago Brian Crowder 2.12 KB, patch		Details \| Diff \| Splinter Review
ugh! WIP 17 years ago Brian Crowder 3.24 KB, patch		Details \| Diff \| Splinter Review
boggle 17 years ago Brian Crowder 3.40 KB, text/plain		Details
example patch with ASCII optimization 17 years ago x0 9.02 KB, patch		Details \| Diff \| Splinter Review
v2 17 years ago Brian Crowder 3.32 KB, patch	x00000000 : review-	Details \| Diff \| Splinter Review
v3 17 years ago Brian Crowder 3.64 KB, patch		Details \| Diff \| Splinter Review
v4, make throw and assertion match better 17 years ago Brian Crowder 3.64 KB, patch	x00000000 : review-	Details \| Diff \| Splinter Review
revised 17 years ago Brian Crowder 3.63 KB, patch	x00000000 : review+	Details \| Diff \| Splinter Review
restoring old assertion 17 years ago Brian Crowder 4.38 KB, patch	shaver : review+ shaver : approval1.9+	Details \| Diff \| Splinter Review
testcase 17 years ago x0 3.51 KB, application/x-javascript		Details
testcase with minor bugs fixed, plus progess indicator for stage 2 17 years ago x0 3.56 KB, application/x-javascript		Details
less complete but faster testcase 17 years ago x0 4.43 KB, application/x-javascript		Details
previous testcase corrected 17 years ago x0 4.32 KB, application/x-javascript		Details

Cacycle

Reporter

Description

•

18 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b4pre) Gecko/2008021104 Minefield/3.0b4pre Any Unicode character range in JavaScript regular expressions that goes over Þ-ß throws an invalid range error ("Error: invalid range in character class"). This happens in Firefox 3.0b2 - 3.0b4pre (Firefox, Minefield, Swiftfox), but not in current SeaMonkey 1.1.7 and current Firefox 2.0.0.12 releases. The critical range is between the following two consecutive characters: U+00DE Þ c39e LATIN CAPITAL LETTER THORN U+00DF ß c39f LATIN SMALL LETTER SHARP S There is no problem with these characters alone outside of a range or these characters at the end or the start of a range, respectively. Reproducible: Always

Jesse Ruderman

Updated

•

18 years ago

Assignee: nobody → general

Component: General → JavaScript Engine

Keywords: regression

Product: Firefox → Core

QA Contact: general → general

Version: unspecified → Trunk

Brian Crowder

Assignee

Comment 1

•

18 years ago

Can you provide the exact expression that's generating this error? Using my current nightly, this works: javascript:/[Þ-ß]/.exec("Þ") Þ Thanks

Cacycle

Reporter

Comment 2

•

18 years ago

I have just figured out that it only happens with case insensitive regular expressions that have their "i" flag set: javascript:/[Þ-ß]/i.exec("Þ") throws the error as described above.

Brian Crowder

Assignee

Updated

•

18 years ago

Severity: major → normal

Status: UNCONFIRMED → NEW

Ever confirmed: true

Summary: Invalid range error for in Þ-ß JavaScript regular expression → Invalid range error for in Þ-ß case-insensitive regular expression

Brian Crowder

Assignee

Updated

•

18 years ago

Assignee: general → crowder

Brian Crowder

Assignee

Comment 3

•

18 years ago

Attached patch more liberal range-handling (obsolete) — Details — Splinter Review

I was overzealous when working on this previously. This patch backs off a bit on my earlier stance and ensures that bitmap-size is correct.

Attachment #309988 - Flags: review?(mrbkap)

Brian Crowder

Assignee

Comment 4

•

18 years ago

This bug apparently breaks the WidEd editor for Wikipedia pages. It should block.

Flags: blocking1.9?

Brian Crowder

Assignee

Comment 6

•

18 years ago

Comment on attachment 309988 [details] [diff] [review] more liberal range-handling Hm, not quite ready for prime-time, this needs some more fixing, to prevent a regression revealed by the test-suite: ecma_3/RegExp/regress-375715-01-n.js This test also, by its nature, changed another test or two, more in a second.

Attachment #309988 - Flags: review?(mrbkap)

Brian Crowder

Assignee

Comment 7

•

18 years ago

Attached patch more liberal still (obsolete) — Details — Splinter Review

Attachment #309988 - Attachment is obsolete: true

Attachment #310019 - Flags: review?(mrbkap)

Brian Crowder

Assignee

Comment 8

•

18 years ago

The patch from comment #7 causes the following two tests to need to be reworked: Testcase ecma_3/RegExp/regress-375711.js failed Bug Number 375711 [ Top of Page ] STATUS: Do not assert with /[Q-b]/i.exec("") Failure messages were: FAILED! expected: [reported from test()] Expected value 'SyntaxError: invalid range in character class', Actual value '' Testcase ecma_3/RegExp/regress-375715-01-n.js failed Bug Number 375715 [ Top of Page ] Expected exit code 3, got 0 Testcase terminated with signal 0 Complete testcase output was: BUGNUMBER: 375715 STATUS: Do not assert: (c2 <= cs->length) && (c1 <= c2) PASSED! Do not assert: (c2 <= cs->length) && (c1 <= c2) /[Wb-G]/.exec("") jstest: ecma_3/RegExp/regress-375715-01.js bug: result: PASSED type: shell description: Do not assert: (c2 <= cs->length) && (c1 <= c2) /[Wb-G]/.exec("") expected: actual: reason:

Mike Schroepfer

Comment 9

•

17 years ago

Can we get this in for Beta5?

Flags: blocking1.9? → blocking1.9+

Priority: -- → P1

Brian Crowder

Assignee

Comment 10

•

17 years ago

Comment on attachment 310019 [details] [diff] [review] more liberal still yeah, need a more available reviewer.

Attachment #310019 - Flags: review?(mrbkap) → review?(shaver)

Mike Shaver (:shaver emeritus)

Comment 11

•

17 years ago

Comment on attachment 310019 [details] [diff] [review] more liberal still r=shaver, assuming you've verified that it's acceptable to not report an error for hi-lo ranges. Just wondering what motivated the addition of the the test that needs changing.

Attachment #310019 - Flags: review?(shaver) → review+

Brian Crowder

Assignee

Comment 12

•

17 years ago

Attached patch better error reporting for case-insensitive matches (obsolete) — Details — Splinter Review

This restores the range exception for case-insensitive class ranges whose initial value is higher than its end value.

Attachment #310019 - Attachment is obsolete: true

Attachment #310670 - Flags: review?(shaver)

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

17 years ago

Target Milestone: --- → mozilla1.9beta5

Brendan Eich [:brendan]

Comment 13

•

17 years ago

Comment on attachment 310670 [details] [diff] [review] better error reporting for case-insensitive matches > if (state->flags & JSREG_FOLD) { > jschar uc = upcase(localMax); > jschar dc = downcase(localMax); > > c = JS_MAX(uc, dc); > if (c > localMax) > localMax = c; >+ } else if (inRange && rangeStart > localMax) { >+ JS_ReportErrorNumber(state->context, >+ js_GetErrorMessage, NULL, >+ JSMSG_BAD_CLASS_RANGE); >+ return JS_FALSE; > } Could cite 15.10.2.15 NonemptyClassRanges, the CharacterRange helper function, step 6. To do that nicely, change the else if to else { followed by comma and if-then indented one more basic offset, then }. >+ JS_ASSERT(c2 <= cs->length); Blank line before whole-line-or-more comment per house style. >+ /* Swap, if c1 > c2. */ >+ if (c1 > c2) { >+ c1 = c1 ^ c2; >+ c2 = c1 ^ c2; >+ c1 = c1 ^ c2; >+ } Use ^= if XOR wins, but using a tmp is simpler source (not that we eschew bit magic, just not sure it is worth it here -- if it is, ^= must be used for C purity ;-). /be

Attachment #310670 - Flags: review?(shaver) → review+

Brian Crowder

Assignee

Comment 14

•

17 years ago

Attached patch the landed patch — Details — Splinter Review

Attachment #310670 - Attachment is obsolete: true

Brian Crowder

Assignee

Updated

•

17 years ago

Status: NEW → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

Bob Clary [:bc] (inactive)

Comment 15

•

17 years ago

/cvsroot/mozilla/js/tests/ecma_3/RegExp/regress-375711.js,v <-- regress-375711.js new revision: 1.2; previous revision: 1.1

Flags: in-testsuite+

Flags: in-litmus-

Comment 16

•

17 years ago

The fix (and also the testcase) is wrong. /[q-b]/i.exec("") should throw, as should /[b-Q]/i.exec(""), because the code point of the first character isn't allowed to be greater than the code point of the second character, even in case-insensitive RegExps. Ranges with non-single characters as start or end like /[\d-q]/ should also throw per ECMA instead of assuming a literal -. The whole method for case-insensitive character ranges is wrong because it assumes it can simply fill the bits in the bitmap between up- and downcased start and end characters. Instead it needs to calculate up- and downcased characters for each code point between start and end of the range. Note that the distance between upper and lower case variants can differ. \xfe (þ) has its associated upper case character at \xde while \xff (ÿ) has it at \u0178. Currently /[\xc0-\xdf]/i.exec("\xf6") (/[À-ß]/i.exec("ö")) fails and /[\xc0-\xde]/i.exec("\xf7") (/[À-Þ]/i.exec("÷")), while it should be the other way round. Even the results for ASCII characters are wrong: /[E-f]/i matches only E, F, e and f instead of all ASCII letters plus [, \, ], ^, _ and ` (see example in ECMA 15.10.2.15).

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

Brian Crowder

Assignee

Comment 17

•

17 years ago

(In reply to comment #16) > /[q-b]/i.exec("") should throw, as should /[b-Q]/i.exec(""), because the code > point of the first character isn't allowed to be greater than the code point of > the second character, even in case-insensitive RegExps. I think the spec is not "spec"ific here. What section(s) of the spec show this definitively? > Ranges with non-single characters as start or end like /[\d-q]/ should also > throw per ECMA instead of assuming a literal -. Again, I don't think the spec is specific here. Perl causes \d-q to become \d,'-',q ... show me what sections you base your conclusion on? > Currently /[\xc0-\xdf]/i.exec("\xf6") (/[À-ß]/i.exec("ö")) fails and > /[\xc0-\xde]/i.exec("\xf7") (/[À-Þ]/i.exec("÷")), while it should be the > other way round. Even the results for ASCII characters are wrong: /[E-f]/i > matches only E, F, e and f instead of all ASCII letters plus [, \, ], ^, _ and > ` (see example in ECMA 15.10.2.15). These results, I agree, are wrong. I'll rework this patch, unless you have a patch in mind already, x00000000.

Brian Crowder

Assignee

Comment 18

•

17 years ago

Ah right. Trying some examples out, and looking through my IRC logs is refreshing my memory. Brendan and I both (I think) agreed on the interpretation my patch imposes here: 23:33 <crowder> So the real question is, what's the right behavior? The patch I have now will yield two smaller ranges (thanks to swapping) 23:33 <crowder> but perhaps we should have one large range from the lowest possible value to the highest 23:37 <brendan> 15.10.2.15 is clear enough 23:38 <brendan> we should not be using case-folded code points to decide to throw 23:38 <brendan> that's the bug 23:39 <crowder> And we don't throw on case-folded code-points with this patch 23:39 <crowder> And it will build the correct ranges 23:40 <brendan> we should throw on unfolded points being out of order 23:40 <brendan> that's it 23:40 <brendan> it's in ECMA-262 Ed. 3 23:40 <crowder> Right, which we do 23:40 <crowder> The last patch here restores that 23:40 <brendan> but we also throw on folded code points being out of order 23:40 <brendan> which is wrong 23:40 <crowder> no, we don't 23:40 <crowder> not with the patch applied 23:40 <brendan> i'm talking about trunk state 23:40 <crowder> oh right 23:40 <crowder> yeah 23:40 <crowder> We throw for both on current trunk 23:40 <crowder> which is, agreed, bogus 23:41 <brendan> ok, so your first change in the first hunk fixes that 23:41 <brendan> why the second change in that hunk? 23:41 <brendan> localMax = JS_MAX(localMax, rangeStart); 23:42 <crowder> You must still ensure that the bitmap is large enough to accomodate the largest possible character 23:42 <brendan> and nothing did that before? 23:42 <crowder> localMax here is used eventually to calculate the total bmp 23:42 <brendan> oh, it was relying on the bogus throw to save itself 23:42 <crowder> right Bug 427244 shows that I am still missing something here, though.

Brian Crowder

Assignee

Comment 19

•

17 years ago

(In reply to comment #17) > > Currently /[\xc0-\xdf]/i.exec("\xf6") (/[À-ß]/i.exec("ö")) fails and > > /[\xc0-\xde]/i.exec("\xf7") (/[À-Þ]/i.exec("÷")), while it should be the > > other way round. Even the results for ASCII characters are wrong: /[E-f]/i > > matches only E, F, e and f instead of all ASCII letters plus [, \, ], ^, _ and > > ` (see example in ECMA 15.10.2.15). > > These results, I agree, are wrong. I'll rework this patch, unless you have a > patch in mind already, x00000000. Based on my conversation with Brendan, I now agree with my old self again. These results are correct by my interpretation of the spec. x00000000: I would be thrilled to be proven wrong.

Comment 20

•

17 years ago

Section 15.10.2.15 of ECMA 262-3 is very clear in the definition of the internal helper function CharacterRange, and it has also "Informative comments" on the topic: | The internal helper function CharacterRange takes two CharSet parameters A | and B and performs the following: | | 1. If A does not contain exactly one character or B does not contain exactly | one character then throw a SyntaxError exception. | 2. Let a be the one character in CharSet A. | 3. Let b be the one character in CharSet B. | 4. Let i be the code point value of character a. | 5. Let j be the code point value of character b. | 6. If I > j then throw a SyntaxError exception. | 7. Return the set containing all characters numbered i through j, inclusive. | ClassRanges can expand into single ClassAtoms and/or ranges of two | ClassAtoms separated by dashes. In the latter case the ClassRanges includes | all characters between the first ClassAtom and the second ClassAtom, | inclusive; an error occurs if either ClassAtom does not represent a single | character (for example, if one is \w) or if the first ClassAtom's code point | value is greater than the second ClassAtom's code point value. | | Even if the pattern ignores case, the case of the two ends of a range is | significant in determining which characters belong to the range. Thus, for | example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while | the pattern /[E-f]/i matches all upper and lower-case ASCII letters as well | as the symbols [, \, ], ^, _, and `. The handling of case insensitivity in ECMA occurs at a later stage in the internal helper functions CharacterSetMatcher and Canonicalize, see section 15.10.2.8; the CharSet of the character range consists of the characters as specified, without case insensitivity applied. Of course, an implementation can do it in another way, as long as the final result is the same. I'm not working on a patch for this. BTW, the new Unicode 5.1.0 now has also a capital letter for \xdf (ß) at \u1e9e, so bitmaps can get quite large, even if only ISO-8859-1 characters are involved.

mozilla

Comment 21

•

17 years ago

at least E-f is explained in Bug 418099.

Brian Crowder

Assignee

Comment 22

•

17 years ago

(In reply to comment #20) > | Even if the pattern ignores case, the case of the two ends of a range is > | significant in determining which characters belong to the range. Thus, for > | example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while > | the pattern /[E-f]/i matches all upper and lower-case ASCII letters as well > | as the symbols [, \, ], ^, _, and `. Ok, so this answers the question I started with in my log (not sure why I didn't notice this when poring over the spec before). We should be adding all of the characters between MIN(upcase(start), upcase(end)) and MAX(downcase(start), downcase(start)), not just the two subranges. That fix should be easy. > BTW, the new Unicode 5.1.0 now has also a capital letter for \xdf (ß) at > \u1e9e, so bitmaps can get quite large, even if only ISO-8859-1 characters are > involved. I'm not sure what this means for Spidermonkey, but it might be something the ES4 committee needs to consider?

Brian Crowder

Assignee

Comment 23

•

17 years ago

> We should be adding all of the characters between MIN(upcase(start), upcase(end)) > and MAX(downcase(start), downcase(start)), not just the two subranges. I meant MIN(upcase(start), downcase(start)) and MAX(upcase(end), downcase(end)), but that isn't enough, if /[E-f]/.test("A") should be true.

Comment 24

•

17 years ago

(In reply to comment #22) > Ok, so this answers the question I started with in my log (not sure why I > didn't notice this when poring over the spec before). We should be adding all > of the characters between MIN(upcase(start), upcase(end)) and > MAX(downcase(start), downcase(start)), not just the two subranges. That fix > should be easy. No, all characters between start and end plus the results of their individual upcase() and downcase() results. > I'm not sure what this means for Spidermonkey, but it might be something the > ES4 committee needs to consider? A ECMA 3 conforming implementation can use any unicode version starting with 2.1. AFAIK SpiderMonkey still uses a very old version.

Brian Crowder

Assignee

Comment 25

•

17 years ago

(In reply to comment #23) > if /[E-f]/.test("A") should be true. I meant /[E-f]/i.test("A") ... this isn't as easy as I hoped, perhaps. We actually need to visit each character in the range, calculate its up-/down-case, and compute the range as the minimum of all of those calculations to the maximum.

Comment 26

•

17 years ago

> ... this isn't as easy as I hoped, perhaps. We actually need to visit each > character in the range, calculate its up-/down-case, and compute the range as > the minimum of all of those calculations to the maximum. Yes. But it's as easy as the current version, just more expensive (and it has to be done twice). Except if we want to shortcut this for ASCII-only.

Comment 27

•

17 years ago

(In reply to comment #26) > > ... this isn't as easy as I hoped, perhaps. We actually need to visit each > > character in the range, calculate its up-/down-case, and compute the range as > > the minimum of all of those calculations to the maximum. > > Yes. But it's as easy as the current version, just more expensive (and it has > to be done twice). Except if we want to shortcut this for ASCII-only. Actually no. No minimum, no maximum. Just start and end and upcase() and downcase() of each character between them. Then take the maximum them to compute the bitmap length, and then do it again and set each bit accordigly to upcase() and downcase() (in ProcessCharSet).

Brian Crowder

Assignee

Comment 28

•

17 years ago

Attached patch ugh! WIP (obsolete) — Details — Splinter Review

This is horrible. Need to cache the results computed in CalculateBitmapSize so that we don't have to do another potentially huge round of upcase/downcasing in ProcessCharSet.

Comment 30

•

17 years ago

(In reply to comment #28) > Created an attachment (id=314852) [details] > ugh! WIP Seems still wrong. I think it should be like /* Throw a SyntaxError here, per ECMA-262, 15.10.2.15. */ if (inRange && rangeStart > localMax) { JS_ReportErrorNumber(state->context, js_GetErrorMessage, NULL, JSMSG_BAD_CLASS_RANGE); return JS_FALSE; } if (inRange) { if (state->flags & JSREG_FOLD) { jsChar rangeEnd = localMax; for (c = rangeStart; c <= rangeEnd; ++c) { jschar uc = upcase(c); jschar dc = downcase(c); if (uc > localMax) localMax = uc; if (dc > localMax) localMax = dc; } } inRange = JS_FALSE; } else { if (src < end - 1) { if (*src == '-') { if (!canStartRange) { // throw } ++src; inRange = JS_TRUE; rangeStart = (jschar)localMax; continue; } } } ... if (inRange) { if (gData->regexp->flags & JSREG_FOLD) { for (c = rangeStart; c <= thisCh; ++c) { AddCharacterToCharSet(charSet, upcase(c)); AddCharacterToCharSet(charSet, downcase(c)); } } else { AddCharacterRangeToCharSet(charSet, rangeStart, thisCh); } inRange = JS_FALSE; } else { > This is horrible. Need to cache the results computed in CalculateBitmapSize so > that we don't have to do another potentially huge round of upcase/downcasing in > ProcessCharSet. There's nothing to cache. You would need the information that you're going to retrieve. The loop with the 2 AddCharacterToCharSet() could somewhat optimized.

Brian Crowder

Assignee

Comment 31

•

17 years ago

Attached file *boggle* — Details

Or, I could be wrong, and this could be way faster in SunSpider: ** TOTAL **: 1.046x as fast 3801.3ms +/- 1.5% 3633.7ms +/- 1.0% significant

Brian Crowder

Assignee

Comment 32

•

17 years ago

x00000000: Sorry, can you describe the changes you made and why? Your code looks different, but seems to accomplish the same result.

Comment 33

•

17 years ago

Your code still doesn't throw for reversed ranges in /i. If you throw, then you can assume rangeStart <= localMax and need no JS_MIN/JS_MAX. In CalculateBitmapSize only the maximum is relevant, also for upcase(), which can be greater than the result of downcase(). In ProcessCharSet, you still add a whole character range to the set instead of single characters. There can be holes, as for /[Ö-Ø]/i, because the × in between cannot be uppercased, and all the code points between upper Ø and lower ö should also not be included.

Brian Crowder

Assignee

Comment 34

•

17 years ago

See the log with Brendan: we shouldn't throw for reversed ranges with "fold"ing. (ie., /i) Agreed with your other two points, new patch in a second.

Comment 35

•

17 years ago

Attached patch example patch with ASCII optimization — Details — Splinter Review

I played around with possible ASCII optimization. But I wouldn't check that in shortly before a release with FIRST_UNSAFE_CHAR > 0. Note that the case for \s is also particulary inefficient. (In reply to comment #34) > See the log with Brendan: we shouldn't throw for reversed ranges with > "fold"ing. (ie., /i) It's not entirely clear to me what Brendan really meant. At that point the code points are always unfolded. But the spec is clear and more relevant than an IRC log.

Brian Crowder

Assignee

Comment 36

•

17 years ago

Attached patch v2 (obsolete) — Details — Splinter Review

Let's worry about ASCII optimization for a follow-up bug. x00000000, can you take a look at this patch?

Attachment #314852 - Attachment is obsolete: true

Attachment #314955 - Flags: review?(x00000000)

Brian Crowder

Assignee

Comment 37

•

17 years ago

(In reply to comment #35) > But the spec is clear and more relevant than an IRC log. I think the spec is not clear in the /i case. Worse: there is old code on the web which depends on /i not throwing for what you (and the spec, potentially) deem invalid /i character ranges -- there is at least one bug in b.m.o. documenting this, and caused by my -original- interpretation of the spec, which agreed with yours (ie., throw for any bogus range), though I can't find it now. We didn't throw in this case in Fx2, and introducing a throw now could cause regressions. Too late for that now, in my opinion. We need to liberally accept that case (for now, at least) and try to do the closest-to-correct thing with the expression.

Brian Crowder

Assignee

Comment 38

•

17 years ago

Perhaps in strict mode, we could throw on these bogus /i ranges?

Comment 39

•

17 years ago

SpiderMonkey used to throw on reversed character ranges also for /i before bug 375711 broke it (also in 1.8.0.12 and 1.8.1.4). Will look at the patch tomorrow.

Comment 40

•

17 years ago

Comment on attachment 314955 [details] [diff] [review] v2 Even assuming that it is correct not to throw on reversed ranges for /i, how do you know what do then? Consider /[e-Q]/i. Swapping e and Q means that it is the same as /[Q-e]/i, where we seem to agree now that it means /[Q-eq-zA-E]/. Why shouldn't it be /[E-Qe-q]/ as it is now? Or /[E-q]/? I can't find a hint for any behavior in the spec. Swapping would make some sense with the broken old behavior that assumed 2 separated ranges for each case-insensitive RegExp. But SpiderMonkey 1.4.2, 1.5 and 1.6 did throw nevertheless. >+ jschar min = localMax, max = localMax; min is now unused. A variable named max already exists in that scope; it may be better to use another name or to set localMax directly and put the calculation of the end condition outside the loop. >+ for (i = JS_MIN(rangeStart, localMax); >+ i < JS_MAX(rangeStart, localMax); i++) { >+ max = JS_MAX(max, downcase(i)); >+ max = JS_MAX(max, upcase(i)); >+ } >+ localMax = max; You need to include the last character. Same off-by-one error in ProcessCharSet. Using JS_MAX may mean that downcase/upcase will be evaluated twice. You may also consider to reverse the loop, because higher code points are more likely to produce the final max. > } else { You forgot the not-in-range case for state->flags & JSREG_FOLD (as I did in comment 30).

Attachment #314955 - Flags: review?(x00000000) → review-

Brian Crowder

Assignee

Comment 41

•

17 years ago

Attached patch v3 (obsolete) — Details — Splinter Review

This introduces a failure in ecma_3/RegExp/regress-375715-04.js because the exception being thrown is not caught; the testcase does not crash, though so it just needs a little rework.

Attachment #314955 - Attachment is obsolete: true

Attachment #315155 - Flags: review?(x00000000)

Brian Crowder

Assignee

Comment 42

•

17 years ago

Attached patch v4, make throw and assertion match better (obsolete) — Details — Splinter Review

x00000000: you've convinced me of the righteousness of throwing an exception even for folded expressions, if start > end, you'll notice that in this patch.

Attachment #315155 - Attachment is obsolete: true

Attachment #315155 - Flags: review?(x00000000)

Brian Crowder

Assignee

Updated

•

17 years ago

Attachment #315157 - Flags: review?(x00000000)

Brian Crowder

Assignee

Comment 43

•

17 years ago

Comment on attachment 315157 [details] [diff] [review] v4, make throw and assertion match better mrbkap: Mind re-reviewing this chunk of code for the nth time?

Attachment #315157 - Flags: review?(mrbkap)

Cacycle

Reporter

Comment 44

•

17 years ago

Sorry, I could not follow most of your details, so excuse if this comment is irrelevant or ignorant :-) Shouldn't the correct way for /i be to select all characters of the given range as stated (direction doe not really matter here) and then determine and add the corresponding upper- or lowercase character for each of the selected characters individually.

Brian Crowder

Assignee

Comment 45

•

17 years ago

Cacycle: Yes, and the current patch should do just that.

Comment 46

•

17 years ago

Comment on attachment 315157 [details] [diff] [review] v4, make throw and assertion match better >+ for (i = rangeStart; i <= localMax; i++) { >+ jschar uch, dch; >+ >+ uch = upcase(i); >+ dch = downcase(i); >+ localMax = JS_MAX(localMax, uch); >+ localMax = JS_MAX(localMax, dch); >+ } This may extend the range; you need to copy the old localMax before the loop for use in the end condition. And we only want to do that if state->flags & JSREG_FOLD. >+ if (state->flags & JSREG_FOLD) { >+ jschar uc = upcase(localMax); >+ jschar dc = downcase(localMax); >+ >+ c = JS_MAX(uc, dc); >+ if (c > localMax) >+ localMax = c; >+ } This has already been done if inRange. Put it inside the else clause. It would be prettier to use code more similar to that in the loop, because it does the same as one iteration of the loop (you could even set rangeStart to localMax and use the loop code directly). There's also the old problem that [\d\D\s\S\w\W] must be treated different than single characters. There have been several solutions hacked in, but the code is still unclean. If we can assume that an up/downcased digit is always a digit, an up/downcased non-digit is always a non-digit etc. (don't know if thats true), then we should set max directly (if appropriate) and continue the while loop without executing any code below the switch. Else we need specialized code. I don't know why that hasn't been done and instead the canStartRange hacked in (can't read bug 375876 and even its check-in comment seems to be faked).

Attachment #315157 - Flags: review?(x00000000) → review-

Brian Crowder

Assignee

Comment 47

•

17 years ago

Attached patch revised (obsolete) — Details — Splinter Review

Thanks for the outstanding reviews.

Attachment #315157 - Attachment is obsolete: true

Attachment #315621 - Flags: review?(x00000000)

Attachment #315157 - Flags: review?(mrbkap)

Brian Crowder

Assignee

Updated

•

17 years ago

Attachment #315621 - Flags: review?(mrbkap)

Comment 48

•

17 years ago

Comment on attachment 315621 [details] [diff] [review] revised Looks good now, but the range swapping code in AddCharacterRangeToCharSet shouldn't be needed anymore and could be replaced by an assertion. >+ rangeStart = localMax; /* one run of the uc/dc loop below */ I think the prevailing style is more than one space before the comment, but I may be wrong.

Attachment #315621 - Flags: review?(x00000000) → review+

Brian Crowder

Assignee

Comment 49

•

17 years ago

Attached patch restoring old assertion — Details — Splinter Review

put the old assertion back, but without the extra ()s, and added a couple spaces before the comment, per nit (can't go to much further away without violating 79-char limit) (r? shaver, since I think mrbkap is very busy this week, shaver: retarget if you don't want this, please)

Attachment #315621 - Attachment is obsolete: true

Attachment #316015 - Flags: review?(shaver)

Attachment #315621 - Flags: review?(mrbkap)

Brian Crowder

Assignee

Comment 50

•

17 years ago

x00000000: Thanks again, btw, for thoughtful and helpful reviews.

Mike Schroepfer

Updated

•

17 years ago

Whiteboard: [needs review shaver]

Mike Shaver (:shaver emeritus)

Comment 51

•

17 years ago

Comment on attachment 316015 [details] [diff] [review] restoring old assertion r+a=shaver, largely on the strength of reading x0 and crowder's conversation, but this needs a pile of tests, and I worry about our test coverage for detecting regression. Can someone generate or author a bunch of suite tests? We need the regexp-snoop patch like yesterday.

Attachment #316015 - Flags: review?(shaver)

Attachment #316015 - Flags: review+

Attachment #316015 - Flags: approval1.9+

Mike Schroepfer

Updated

•

17 years ago

Whiteboard: [needs review shaver] → [needs landing]

Brian Crowder

Assignee

Comment 52

•

17 years ago

jsregexp.c:3.199

Status: REOPENED → RESOLVED

Closed: 17 years ago → 17 years ago

Resolution: --- → FIXED

Comment 53

•

17 years ago

Attached file testcase (obsolete) — Details

This testcase tests if case insensitive ranges have the same bugs (notably bug 428816) as single character classes, except for two known differences: Characters with separate title cases have been partly fixed by this bug (/[\u01c5-\u01c5]/i.test('\u01c5') is now true, but /[\u01c5]/i.test('\u01c5') still false), and /[\0-\x01]/i.test('\0') is true although /[\0]/i.test('\0') is false (bug 429243). Stage 1 of the testcase prints the failures for single characters, and stage 2 prints any unexpected differences from that for character ranges (it tests some random ranges with higher probability for lower code points). Stage 2 shouldn't find any failures; failures in stage 1 are a different bug. The testcase may take a few hours to complete. I haven't run it completely, but the lower code points (|limit| set to 0x2000) show no failure.

Comment 54

•

17 years ago

Attached file testcase with minor bugs fixed, plus progess indicator for stage 2 — Details

Attachment #317087 - Attachment is obsolete: true

Bob Clary [:bc] (inactive)

Comment 55

•

17 years ago

As previously noted, the following now fail: ecma_3/RegExp/regress-375711.js: Expected value 'No Error', Actual value 'SyntaxError: invalid range in character class' ecma_3/RegExp/regress-375715-04.js: /ecma_3/RegExp/regress-375715-04.js:56: SyntaxError: invalid range in character class I'll modify them to match the new behavior and add x0's test from comment 54 this evening.

Flags: in-testsuite+ → in-testsuite?

Comment 56

•

17 years ago

Attached file less complete but faster testcase (obsolete) — Details

The previous testcase isn't fast enough for the test suite. This one skips code point blocks with caseless characters only and tests only some random characters. It should take less than a minute and still catch systematic errors (and even that may be overkill for the test suite).

Comment 57

•

17 years ago

Attached file previous testcase corrected — Details

The previous testcase was quite fast because it didn't do anything in stage 2. This one is slower, but can be made much faster once the adaptions for bug 428816 aren't needed anymore.

Attachment #317150 - Attachment is obsolete: true

Bob Clary [:bc] (inactive)

Comment 58

•

17 years ago

/cvsroot/mozilla/js/tests/public-failures.txt,v <-- public-failures.txt new revision: 1.69; previous revision: 1.68 /cvsroot/mozilla/js/tests/ecma_3/RegExp/regress-375711.js,v <-- regress-375711.js new revision: 1.3; previous revision: 1.2 /cvsroot/mozilla/js/tests/ecma_3/RegExp/regress-375715-04.js,v <-- regress-375715-04.js new revision: 1.3; previous revision: 1.2 I tried to adapt the test in comment 57 but it is way too slow to be run normally. x0, how much faster do you think it will run when the adaptations are removed? The random parts of the test have me a bit concerned in that it is important for the failures to be consistent across runs. Also, can you put the test into the format used by the other tests using reportCompare to report failures? Finally, I see you use the pattern a= rhs. It would be nice if you could instead use a = rhs. Thanks.

Bob Clary [:bc] (inactive)

Comment 59

•

17 years ago

for ecma_3/RegExp/regress-375711.js, ecma_3/RegExp/regress-375715-04.js

Flags: in-testsuite? → in-testsuite+

Bob Clary [:bc] (inactive)

Comment 60

•

17 years ago

v 1.9.0 for ecma_3/RegExp/regress-375711.js, ecma_3/RegExp/regress-375715-04.js

Status: RESOLVED → VERIFIED

Cacycle

Reporter

Comment 61

•

17 years ago

Could this fix be related to Bug 440926?

Robert O'Callahan (:roc) (email my personal email if necessary)

Updated

•

17 years ago

Whiteboard: [needs landing]

You need to log in before you can comment on or make changes to this bug.