Closed
Bug 416933
Opened 16 years ago
Closed 16 years ago
Invalid range error for in Þ-ß case-insensitive regular expression
Categories
(Core :: JavaScript Engine, defect, P1)
Tracking
()
VERIFIED
FIXED
mozilla1.9beta5
People
(Reporter: cacyclewp, Assigned: crowderbt)
References
Details
(Keywords: regression)
Attachments
(6 files, 10 obsolete files)
2.12 KB,
patch
|
Details | Diff | Splinter Review | |
3.40 KB,
text/plain
|
Details | |
9.02 KB,
patch
|
Details | Diff | Splinter Review | |
4.38 KB,
patch
|
shaver
:
review+
shaver
:
approval1.9+
|
Details | Diff | Splinter Review |
3.56 KB,
application/x-javascript
|
Details | |
4.32 KB,
application/x-javascript
|
Details |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12 Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b4pre) Gecko/2008021104 Minefield/3.0b4pre Any Unicode character range in JavaScript regular expressions that goes over Þ-ß throws an invalid range error ("Error: invalid range in character class"). This happens in Firefox 3.0b2 - 3.0b4pre (Firefox, Minefield, Swiftfox), but not in current SeaMonkey 1.1.7 and current Firefox 2.0.0.12 releases. The critical range is between the following two consecutive characters: U+00DE Þ c39e LATIN CAPITAL LETTER THORN U+00DF ß c39f LATIN SMALL LETTER SHARP S There is no problem with these characters alone outside of a range or these characters at the end or the start of a range, respectively. Reproducible: Always
Updated•16 years ago
|
Assignee: nobody → general
Component: General → JavaScript Engine
Keywords: regression
Product: Firefox → Core
QA Contact: general → general
Version: unspecified → Trunk
Assignee | ||
Comment 1•16 years ago
|
||
Can you provide the exact expression that's generating this error? Using my current nightly, this works: javascript:/[Þ-ß]/.exec("Þ") Þ Thanks
I have just figured out that it only happens with case insensitive regular expressions that have their "i" flag set: javascript:/[Þ-ß]/i.exec("Þ") throws the error as described above.
Assignee | ||
Updated•16 years ago
|
Severity: major → normal
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Invalid range error for in Þ-ß JavaScript regular expression → Invalid range error for in Þ-ß case-insensitive regular expression
Assignee | ||
Updated•16 years ago
|
Assignee: general → crowder
Assignee | ||
Comment 3•16 years ago
|
||
I was overzealous when working on this previously. This patch backs off a bit on my earlier stance and ensures that bitmap-size is correct.
Attachment #309988 -
Flags: review?(mrbkap)
Assignee | ||
Comment 4•16 years ago
|
||
This bug apparently breaks the WidEd editor for Wikipedia pages. It should block.
Flags: blocking1.9?
Assignee | ||
Comment 6•16 years ago
|
||
Comment on attachment 309988 [details] [diff] [review] more liberal range-handling Hm, not quite ready for prime-time, this needs some more fixing, to prevent a regression revealed by the test-suite: ecma_3/RegExp/regress-375715-01-n.js This test also, by its nature, changed another test or two, more in a second.
Attachment #309988 -
Flags: review?(mrbkap)
Assignee | ||
Comment 7•16 years ago
|
||
Attachment #309988 -
Attachment is obsolete: true
Attachment #310019 -
Flags: review?(mrbkap)
Assignee | ||
Comment 8•16 years ago
|
||
The patch from comment #7 causes the following two tests to need to be reworked: Testcase ecma_3/RegExp/regress-375711.js failed Bug Number 375711 [ Top of Page ] STATUS: Do not assert with /[Q-b]/i.exec("") Failure messages were: FAILED! expected: [reported from test()] Expected value 'SyntaxError: invalid range in character class', Actual value '' Testcase ecma_3/RegExp/regress-375715-01-n.js failed Bug Number 375715 [ Top of Page ] Expected exit code 3, got 0 Testcase terminated with signal 0 Complete testcase output was: BUGNUMBER: 375715 STATUS: Do not assert: (c2 <= cs->length) && (c1 <= c2) PASSED! Do not assert: (c2 <= cs->length) && (c1 <= c2) /[Wb-G]/.exec("") jstest: ecma_3/RegExp/regress-375715-01.js bug: result: PASSED type: shell description: Do not assert: (c2 <= cs->length) && (c1 <= c2) /[Wb-G]/.exec("") expected: actual: reason:
Comment 9•16 years ago
|
||
Can we get this in for Beta5?
Flags: blocking1.9? → blocking1.9+
Priority: -- → P1
Assignee | ||
Comment 10•16 years ago
|
||
Comment on attachment 310019 [details] [diff] [review] more liberal still yeah, need a more available reviewer.
Attachment #310019 -
Flags: review?(mrbkap) → review?(shaver)
Comment on attachment 310019 [details] [diff] [review] more liberal still r=shaver, assuming you've verified that it's acceptable to not report an error for hi-lo ranges. Just wondering what motivated the addition of the the test that needs changing.
Attachment #310019 -
Flags: review?(shaver) → review+
Assignee | ||
Comment 12•16 years ago
|
||
This restores the range exception for case-insensitive class ranges whose initial value is higher than its end value.
Attachment #310019 -
Attachment is obsolete: true
Attachment #310670 -
Flags: review?(shaver)
Updated•16 years ago
|
Target Milestone: --- → mozilla1.9beta5
Comment 13•16 years ago
|
||
Comment on attachment 310670 [details] [diff] [review] better error reporting for case-insensitive matches > if (state->flags & JSREG_FOLD) { > jschar uc = upcase(localMax); > jschar dc = downcase(localMax); > > c = JS_MAX(uc, dc); > if (c > localMax) > localMax = c; >+ } else if (inRange && rangeStart > localMax) { >+ JS_ReportErrorNumber(state->context, >+ js_GetErrorMessage, NULL, >+ JSMSG_BAD_CLASS_RANGE); >+ return JS_FALSE; > } Could cite 15.10.2.15 NonemptyClassRanges, the CharacterRange helper function, step 6. To do that nicely, change the else if to else { followed by comma and if-then indented one more basic offset, then }. >+ JS_ASSERT(c2 <= cs->length); Blank line before whole-line-or-more comment per house style. >+ /* Swap, if c1 > c2. */ >+ if (c1 > c2) { >+ c1 = c1 ^ c2; >+ c2 = c1 ^ c2; >+ c1 = c1 ^ c2; >+ } Use ^= if XOR wins, but using a tmp is simpler source (not that we eschew bit magic, just not sure it is worth it here -- if it is, ^= must be used for C purity ;-). /be
Attachment #310670 -
Flags: review?(shaver) → review+
Assignee | ||
Comment 14•16 years ago
|
||
Attachment #310670 -
Attachment is obsolete: true
Assignee | ||
Updated•16 years ago
|
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Comment 15•16 years ago
|
||
/cvsroot/mozilla/js/tests/ecma_3/RegExp/regress-375711.js,v <-- regress-375711.js new revision: 1.2; previous revision: 1.1
Flags: in-testsuite+
Flags: in-litmus-
Comment 16•16 years ago
|
||
The fix (and also the testcase) is wrong. /[q-b]/i.exec("") should throw, as should /[b-Q]/i.exec(""), because the code point of the first character isn't allowed to be greater than the code point of the second character, even in case-insensitive RegExps. Ranges with non-single characters as start or end like /[\d-q]/ should also throw per ECMA instead of assuming a literal -. The whole method for case-insensitive character ranges is wrong because it assumes it can simply fill the bits in the bitmap between up- and downcased start and end characters. Instead it needs to calculate up- and downcased characters for each code point between start and end of the range. Note that the distance between upper and lower case variants can differ. \xfe (þ) has its associated upper case character at \xde while \xff (ÿ) has it at \u0178. Currently /[\xc0-\xdf]/i.exec("\xf6") (/[À-ß]/i.exec("ö")) fails and /[\xc0-\xde]/i.exec("\xf7") (/[À-Þ]/i.exec("÷")), while it should be the other way round. Even the results for ASCII characters are wrong: /[E-f]/i matches only E, F, e and f instead of all ASCII letters plus [, \, ], ^, _ and ` (see example in ECMA 15.10.2.15).
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Assignee | ||
Comment 17•16 years ago
|
||
(In reply to comment #16) > /[q-b]/i.exec("") should throw, as should /[b-Q]/i.exec(""), because the code > point of the first character isn't allowed to be greater than the code point of > the second character, even in case-insensitive RegExps. I think the spec is not "spec"ific here. What section(s) of the spec show this definitively? > Ranges with non-single characters as start or end like /[\d-q]/ should also > throw per ECMA instead of assuming a literal -. Again, I don't think the spec is specific here. Perl causes \d-q to become \d,'-',q ... show me what sections you base your conclusion on? > Currently /[\xc0-\xdf]/i.exec("\xf6") (/[À-ß]/i.exec("ö")) fails and > /[\xc0-\xde]/i.exec("\xf7") (/[À-Þ]/i.exec("÷")), while it should be the > other way round. Even the results for ASCII characters are wrong: /[E-f]/i > matches only E, F, e and f instead of all ASCII letters plus [, \, ], ^, _ and > ` (see example in ECMA 15.10.2.15). These results, I agree, are wrong. I'll rework this patch, unless you have a patch in mind already, x00000000.
Assignee | ||
Comment 18•16 years ago
|
||
Ah right. Trying some examples out, and looking through my IRC logs is refreshing my memory. Brendan and I both (I think) agreed on the interpretation my patch imposes here: 23:33 <crowder> So the real question is, what's the right behavior? The patch I have now will yield two smaller ranges (thanks to swapping) 23:33 <crowder> but perhaps we should have one large range from the lowest possible value to the highest 23:37 <brendan> 15.10.2.15 is clear enough 23:38 <brendan> we should not be using case-folded code points to decide to throw 23:38 <brendan> that's the bug 23:39 <crowder> And we don't throw on case-folded code-points with this patch 23:39 <crowder> And it will build the correct ranges 23:40 <brendan> we should throw on unfolded points being out of order 23:40 <brendan> that's it 23:40 <brendan> it's in ECMA-262 Ed. 3 23:40 <crowder> Right, which we do 23:40 <crowder> The last patch here restores that 23:40 <brendan> but we also throw on folded code points being out of order 23:40 <brendan> which is wrong 23:40 <crowder> no, we don't 23:40 <crowder> not with the patch applied 23:40 <brendan> i'm talking about trunk state 23:40 <crowder> oh right 23:40 <crowder> yeah 23:40 <crowder> We throw for both on current trunk 23:40 <crowder> which is, agreed, bogus 23:41 <brendan> ok, so your first change in the first hunk fixes that 23:41 <brendan> why the second change in that hunk? 23:41 <brendan> localMax = JS_MAX(localMax, rangeStart); 23:42 <crowder> You must still ensure that the bitmap is large enough to accomodate the largest possible character 23:42 <brendan> and nothing did that before? 23:42 <crowder> localMax here is used eventually to calculate the total bmp 23:42 <brendan> oh, it was relying on the bogus throw to save itself 23:42 <crowder> right Bug 427244 shows that I am still missing something here, though.
Assignee | ||
Comment 19•16 years ago
|
||
(In reply to comment #17) > > Currently /[\xc0-\xdf]/i.exec("\xf6") (/[À-ß]/i.exec("ö")) fails and > > /[\xc0-\xde]/i.exec("\xf7") (/[À-Þ]/i.exec("÷")), while it should be the > > other way round. Even the results for ASCII characters are wrong: /[E-f]/i > > matches only E, F, e and f instead of all ASCII letters plus [, \, ], ^, _ and > > ` (see example in ECMA 15.10.2.15). > > These results, I agree, are wrong. I'll rework this patch, unless you have a > patch in mind already, x00000000. Based on my conversation with Brendan, I now agree with my old self again. These results are correct by my interpretation of the spec. x00000000: I would be thrilled to be proven wrong.
Comment 20•16 years ago
|
||
Section 15.10.2.15 of ECMA 262-3 is very clear in the definition of the internal helper function CharacterRange, and it has also "Informative comments" on the topic: | The internal helper function CharacterRange takes two CharSet parameters A | and B and performs the following: | | 1. If A does not contain exactly one character or B does not contain exactly | one character then throw a SyntaxError exception. | 2. Let a be the one character in CharSet A. | 3. Let b be the one character in CharSet B. | 4. Let i be the code point value of character a. | 5. Let j be the code point value of character b. | 6. If I > j then throw a SyntaxError exception. | 7. Return the set containing all characters numbered i through j, inclusive. | ClassRanges can expand into single ClassAtoms and/or ranges of two | ClassAtoms separated by dashes. In the latter case the ClassRanges includes | all characters between the first ClassAtom and the second ClassAtom, | inclusive; an error occurs if either ClassAtom does not represent a single | character (for example, if one is \w) or if the first ClassAtom's code point | value is greater than the second ClassAtom's code point value. | | Even if the pattern ignores case, the case of the two ends of a range is | significant in determining which characters belong to the range. Thus, for | example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while | the pattern /[E-f]/i matches all upper and lower-case ASCII letters as well | as the symbols [, \, ], ^, _, and `. The handling of case insensitivity in ECMA occurs at a later stage in the internal helper functions CharacterSetMatcher and Canonicalize, see section 15.10.2.8; the CharSet of the character range consists of the characters as specified, without case insensitivity applied. Of course, an implementation can do it in another way, as long as the final result is the same. I'm not working on a patch for this. BTW, the new Unicode 5.1.0 now has also a capital letter for \xdf (ß) at \u1e9e, so bitmaps can get quite large, even if only ISO-8859-1 characters are involved.
Comment 21•16 years ago
|
||
at least E-f is explained in Bug 418099.
Assignee | ||
Comment 22•16 years ago
|
||
(In reply to comment #20) > | Even if the pattern ignores case, the case of the two ends of a range is > | significant in determining which characters belong to the range. Thus, for > | example, the pattern /[E-F]/i matches only the letters E, F, e, and f, while > | the pattern /[E-f]/i matches all upper and lower-case ASCII letters as well > | as the symbols [, \, ], ^, _, and `. Ok, so this answers the question I started with in my log (not sure why I didn't notice this when poring over the spec before). We should be adding all of the characters between MIN(upcase(start), upcase(end)) and MAX(downcase(start), downcase(start)), not just the two subranges. That fix should be easy. > BTW, the new Unicode 5.1.0 now has also a capital letter for \xdf (ß) at > \u1e9e, so bitmaps can get quite large, even if only ISO-8859-1 characters are > involved. I'm not sure what this means for Spidermonkey, but it might be something the ES4 committee needs to consider?
Assignee | ||
Comment 23•16 years ago
|
||
> We should be adding all of the characters between MIN(upcase(start), upcase(end))
> and MAX(downcase(start), downcase(start)), not just the two subranges.
I meant MIN(upcase(start), downcase(start)) and MAX(upcase(end), downcase(end)), but that isn't enough, if /[E-f]/.test("A") should be true.
Comment 24•16 years ago
|
||
(In reply to comment #22) > Ok, so this answers the question I started with in my log (not sure why I > didn't notice this when poring over the spec before). We should be adding all > of the characters between MIN(upcase(start), upcase(end)) and > MAX(downcase(start), downcase(start)), not just the two subranges. That fix > should be easy. No, all characters between start and end plus the results of their individual upcase() and downcase() results. > I'm not sure what this means for Spidermonkey, but it might be something the > ES4 committee needs to consider? A ECMA 3 conforming implementation can use any unicode version starting with 2.1. AFAIK SpiderMonkey still uses a very old version.
Assignee | ||
Comment 25•16 years ago
|
||
(In reply to comment #23) > if /[E-f]/.test("A") should be true. I meant /[E-f]/i.test("A") ... this isn't as easy as I hoped, perhaps. We actually need to visit each character in the range, calculate its up-/down-case, and compute the range as the minimum of all of those calculations to the maximum.
Comment 26•16 years ago
|
||
> ... this isn't as easy as I hoped, perhaps. We actually need to visit each
> character in the range, calculate its up-/down-case, and compute the range as
> the minimum of all of those calculations to the maximum.
Yes. But it's as easy as the current version, just more expensive (and it has to be done twice). Except if we want to shortcut this for ASCII-only.
Comment 27•16 years ago
|
||
(In reply to comment #26) > > ... this isn't as easy as I hoped, perhaps. We actually need to visit each > > character in the range, calculate its up-/down-case, and compute the range as > > the minimum of all of those calculations to the maximum. > > Yes. But it's as easy as the current version, just more expensive (and it has > to be done twice). Except if we want to shortcut this for ASCII-only. Actually no. No minimum, no maximum. Just start and end and upcase() and downcase() of each character between them. Then take the maximum them to compute the bitmap length, and then do it again and set each bit accordigly to upcase() and downcase() (in ProcessCharSet).
Assignee | ||
Comment 28•16 years ago
|
||
This is horrible. Need to cache the results computed in CalculateBitmapSize so that we don't have to do another potentially huge round of upcase/downcasing in ProcessCharSet.
Comment 30•16 years ago
|
||
(In reply to comment #28) > Created an attachment (id=314852) [details] > ugh! WIP Seems still wrong. I think it should be like /* Throw a SyntaxError here, per ECMA-262, 15.10.2.15. */ if (inRange && rangeStart > localMax) { JS_ReportErrorNumber(state->context, js_GetErrorMessage, NULL, JSMSG_BAD_CLASS_RANGE); return JS_FALSE; } if (inRange) { if (state->flags & JSREG_FOLD) { jsChar rangeEnd = localMax; for (c = rangeStart; c <= rangeEnd; ++c) { jschar uc = upcase(c); jschar dc = downcase(c); if (uc > localMax) localMax = uc; if (dc > localMax) localMax = dc; } } inRange = JS_FALSE; } else { if (src < end - 1) { if (*src == '-') { if (!canStartRange) { // throw } ++src; inRange = JS_TRUE; rangeStart = (jschar)localMax; continue; } } } ... if (inRange) { if (gData->regexp->flags & JSREG_FOLD) { for (c = rangeStart; c <= thisCh; ++c) { AddCharacterToCharSet(charSet, upcase(c)); AddCharacterToCharSet(charSet, downcase(c)); } } else { AddCharacterRangeToCharSet(charSet, rangeStart, thisCh); } inRange = JS_FALSE; } else { > This is horrible. Need to cache the results computed in CalculateBitmapSize so > that we don't have to do another potentially huge round of upcase/downcasing in > ProcessCharSet. There's nothing to cache. You would need the information that you're going to retrieve. The loop with the 2 AddCharacterToCharSet() could somewhat optimized.
Assignee | ||
Comment 31•16 years ago
|
||
Or, I could be wrong, and this could be way faster in SunSpider: ** TOTAL **: 1.046x as fast 3801.3ms +/- 1.5% 3633.7ms +/- 1.0% significant
Assignee | ||
Comment 32•16 years ago
|
||
x00000000: Sorry, can you describe the changes you made and why? Your code looks different, but seems to accomplish the same result.
Comment 33•16 years ago
|
||
Your code still doesn't throw for reversed ranges in /i. If you throw, then you can assume rangeStart <= localMax and need no JS_MIN/JS_MAX. In CalculateBitmapSize only the maximum is relevant, also for upcase(), which can be greater than the result of downcase(). In ProcessCharSet, you still add a whole character range to the set instead of single characters. There can be holes, as for /[Ö-Ø]/i, because the × in between cannot be uppercased, and all the code points between upper Ø and lower ö should also not be included.
Assignee | ||
Comment 34•16 years ago
|
||
See the log with Brendan: we shouldn't throw for reversed ranges with "fold"ing. (ie., /i) Agreed with your other two points, new patch in a second.
Comment 35•16 years ago
|
||
I played around with possible ASCII optimization. But I wouldn't check that in shortly before a release with FIRST_UNSAFE_CHAR > 0. Note that the case for \s is also particulary inefficient. (In reply to comment #34) > See the log with Brendan: we shouldn't throw for reversed ranges with > "fold"ing. (ie., /i) It's not entirely clear to me what Brendan really meant. At that point the code points are always unfolded. But the spec is clear and more relevant than an IRC log.
Assignee | ||
Comment 36•16 years ago
|
||
Let's worry about ASCII optimization for a follow-up bug. x00000000, can you take a look at this patch?
Attachment #314852 -
Attachment is obsolete: true
Attachment #314955 -
Flags: review?(x00000000)
Assignee | ||
Comment 37•16 years ago
|
||
(In reply to comment #35) > But the spec is clear and more relevant than an IRC log. I think the spec is not clear in the /i case. Worse: there is old code on the web which depends on /i not throwing for what you (and the spec, potentially) deem invalid /i character ranges -- there is at least one bug in b.m.o. documenting this, and caused by my -original- interpretation of the spec, which agreed with yours (ie., throw for any bogus range), though I can't find it now. We didn't throw in this case in Fx2, and introducing a throw now could cause regressions. Too late for that now, in my opinion. We need to liberally accept that case (for now, at least) and try to do the closest-to-correct thing with the expression.
Assignee | ||
Comment 38•16 years ago
|
||
Perhaps in strict mode, we could throw on these bogus /i ranges?
Comment 39•16 years ago
|
||
SpiderMonkey used to throw on reversed character ranges also for /i before bug 375711 broke it (also in 1.8.0.12 and 1.8.1.4). Will look at the patch tomorrow.
Comment 40•16 years ago
|
||
Comment on attachment 314955 [details] [diff] [review] v2 Even assuming that it is correct not to throw on reversed ranges for /i, how do you know what do then? Consider /[e-Q]/i. Swapping e and Q means that it is the same as /[Q-e]/i, where we seem to agree now that it means /[Q-eq-zA-E]/. Why shouldn't it be /[E-Qe-q]/ as it is now? Or /[E-q]/? I can't find a hint for any behavior in the spec. Swapping would make some sense with the broken old behavior that assumed 2 separated ranges for each case-insensitive RegExp. But SpiderMonkey 1.4.2, 1.5 and 1.6 did throw nevertheless. >+ jschar min = localMax, max = localMax; min is now unused. A variable named max already exists in that scope; it may be better to use another name or to set localMax directly and put the calculation of the end condition outside the loop. >+ for (i = JS_MIN(rangeStart, localMax); >+ i < JS_MAX(rangeStart, localMax); i++) { >+ max = JS_MAX(max, downcase(i)); >+ max = JS_MAX(max, upcase(i)); >+ } >+ localMax = max; You need to include the last character. Same off-by-one error in ProcessCharSet. Using JS_MAX may mean that downcase/upcase will be evaluated twice. You may also consider to reverse the loop, because higher code points are more likely to produce the final max. > } else { You forgot the not-in-range case for state->flags & JSREG_FOLD (as I did in comment 30).
Attachment #314955 -
Flags: review?(x00000000) → review-
Assignee | ||
Comment 41•16 years ago
|
||
This introduces a failure in ecma_3/RegExp/regress-375715-04.js because the exception being thrown is not caught; the testcase does not crash, though so it just needs a little rework.
Attachment #314955 -
Attachment is obsolete: true
Attachment #315155 -
Flags: review?(x00000000)
Assignee | ||
Comment 42•16 years ago
|
||
x00000000: you've convinced me of the righteousness of throwing an exception even for folded expressions, if start > end, you'll notice that in this patch.
Attachment #315155 -
Attachment is obsolete: true
Attachment #315155 -
Flags: review?(x00000000)
Assignee | ||
Updated•16 years ago
|
Attachment #315157 -
Flags: review?(x00000000)
Assignee | ||
Comment 43•16 years ago
|
||
Comment on attachment 315157 [details] [diff] [review] v4, make throw and assertion match better mrbkap: Mind re-reviewing this chunk of code for the nth time?
Attachment #315157 -
Flags: review?(mrbkap)
Reporter | ||
Comment 44•16 years ago
|
||
Sorry, I could not follow most of your details, so excuse if this comment is irrelevant or ignorant :-) Shouldn't the correct way for /i be to select all characters of the given range as stated (direction doe not really matter here) and then determine and add the corresponding upper- or lowercase character for each of the selected characters individually.
Assignee | ||
Comment 45•16 years ago
|
||
Cacycle: Yes, and the current patch should do just that.
Comment 46•16 years ago
|
||
Comment on attachment 315157 [details] [diff] [review] v4, make throw and assertion match better >+ for (i = rangeStart; i <= localMax; i++) { >+ jschar uch, dch; >+ >+ uch = upcase(i); >+ dch = downcase(i); >+ localMax = JS_MAX(localMax, uch); >+ localMax = JS_MAX(localMax, dch); >+ } This may extend the range; you need to copy the old localMax before the loop for use in the end condition. And we only want to do that if state->flags & JSREG_FOLD. >+ if (state->flags & JSREG_FOLD) { >+ jschar uc = upcase(localMax); >+ jschar dc = downcase(localMax); >+ >+ c = JS_MAX(uc, dc); >+ if (c > localMax) >+ localMax = c; >+ } This has already been done if inRange. Put it inside the else clause. It would be prettier to use code more similar to that in the loop, because it does the same as one iteration of the loop (you could even set rangeStart to localMax and use the loop code directly). There's also the old problem that [\d\D\s\S\w\W] must be treated different than single characters. There have been several solutions hacked in, but the code is still unclean. If we can assume that an up/downcased digit is always a digit, an up/downcased non-digit is always a non-digit etc. (don't know if thats true), then we should set max directly (if appropriate) and continue the while loop without executing any code below the switch. Else we need specialized code. I don't know why that hasn't been done and instead the canStartRange hacked in (can't read bug 375876 and even its check-in comment seems to be faked).
Attachment #315157 -
Flags: review?(x00000000) → review-
Assignee | ||
Comment 47•16 years ago
|
||
Thanks for the outstanding reviews.
Attachment #315157 -
Attachment is obsolete: true
Attachment #315621 -
Flags: review?(x00000000)
Attachment #315157 -
Flags: review?(mrbkap)
Assignee | ||
Updated•16 years ago
|
Attachment #315621 -
Flags: review?(mrbkap)
Comment 48•16 years ago
|
||
Comment on attachment 315621 [details] [diff] [review] revised Looks good now, but the range swapping code in AddCharacterRangeToCharSet shouldn't be needed anymore and could be replaced by an assertion. >+ rangeStart = localMax; /* one run of the uc/dc loop below */ I think the prevailing style is more than one space before the comment, but I may be wrong.
Attachment #315621 -
Flags: review?(x00000000) → review+
Assignee | ||
Comment 49•16 years ago
|
||
put the old assertion back, but without the extra ()s, and added a couple spaces before the comment, per nit (can't go to much further away without violating 79-char limit) (r? shaver, since I think mrbkap is very busy this week, shaver: retarget if you don't want this, please)
Attachment #315621 -
Attachment is obsolete: true
Attachment #316015 -
Flags: review?(shaver)
Attachment #315621 -
Flags: review?(mrbkap)
Assignee | ||
Comment 50•16 years ago
|
||
x00000000: Thanks again, btw, for thoughtful and helpful reviews.
Updated•16 years ago
|
Whiteboard: [needs review shaver]
Comment on attachment 316015 [details] [diff] [review] restoring old assertion r+a=shaver, largely on the strength of reading x0 and crowder's conversation, but this needs a pile of tests, and I worry about our test coverage for detecting regression. Can someone generate or author a bunch of suite tests? We need the regexp-snoop patch like yesterday.
Attachment #316015 -
Flags: review?(shaver)
Attachment #316015 -
Flags: review+
Attachment #316015 -
Flags: approval1.9+
Updated•16 years ago
|
Whiteboard: [needs review shaver] → [needs landing]
Assignee | ||
Comment 52•16 years ago
|
||
jsregexp.c:3.199
Status: REOPENED → RESOLVED
Closed: 16 years ago → 16 years ago
Resolution: --- → FIXED
Comment 53•16 years ago
|
||
This testcase tests if case insensitive ranges have the same bugs (notably bug 428816) as single character classes, except for two known differences: Characters with separate title cases have been partly fixed by this bug (/[\u01c5-\u01c5]/i.test('\u01c5') is now true, but /[\u01c5]/i.test('\u01c5') still false), and /[\0-\x01]/i.test('\0') is true although /[\0]/i.test('\0') is false (bug 429243). Stage 1 of the testcase prints the failures for single characters, and stage 2 prints any unexpected differences from that for character ranges (it tests some random ranges with higher probability for lower code points). Stage 2 shouldn't find any failures; failures in stage 1 are a different bug. The testcase may take a few hours to complete. I haven't run it completely, but the lower code points (|limit| set to 0x2000) show no failure.
Comment 54•16 years ago
|
||
Attachment #317087 -
Attachment is obsolete: true
Comment 55•16 years ago
|
||
As previously noted, the following now fail: ecma_3/RegExp/regress-375711.js: Expected value 'No Error', Actual value 'SyntaxError: invalid range in character class' ecma_3/RegExp/regress-375715-04.js: /ecma_3/RegExp/regress-375715-04.js:56: SyntaxError: invalid range in character class I'll modify them to match the new behavior and add x0's test from comment 54 this evening.
Flags: in-testsuite+ → in-testsuite?
Comment 56•16 years ago
|
||
The previous testcase isn't fast enough for the test suite. This one skips code point blocks with caseless characters only and tests only some random characters. It should take less than a minute and still catch systematic errors (and even that may be overkill for the test suite).
Comment 57•16 years ago
|
||
The previous testcase was quite fast because it didn't do anything in stage 2. This one is slower, but can be made much faster once the adaptions for bug 428816 aren't needed anymore.
Attachment #317150 -
Attachment is obsolete: true
Comment 58•16 years ago
|
||
/cvsroot/mozilla/js/tests/public-failures.txt,v <-- public-failures.txt new revision: 1.69; previous revision: 1.68 /cvsroot/mozilla/js/tests/ecma_3/RegExp/regress-375711.js,v <-- regress-375711.js new revision: 1.3; previous revision: 1.2 /cvsroot/mozilla/js/tests/ecma_3/RegExp/regress-375715-04.js,v <-- regress-375715-04.js new revision: 1.3; previous revision: 1.2 I tried to adapt the test in comment 57 but it is way too slow to be run normally. x0, how much faster do you think it will run when the adaptations are removed? The random parts of the test have me a bit concerned in that it is important for the failures to be consistent across runs. Also, can you put the test into the format used by the other tests using reportCompare to report failures? Finally, I see you use the pattern a= rhs. It would be nice if you could instead use a = rhs. Thanks.
Comment 59•16 years ago
|
||
for ecma_3/RegExp/regress-375711.js, ecma_3/RegExp/regress-375715-04.js
Flags: in-testsuite? → in-testsuite+
Comment 60•16 years ago
|
||
v 1.9.0 for ecma_3/RegExp/regress-375711.js, ecma_3/RegExp/regress-375715-04.js
Status: RESOLVED → VERIFIED
Reporter | ||
Comment 61•16 years ago
|
||
Could this fix be related to Bug 440926?
Whiteboard: [needs landing]
You need to log in
before you can comment on or make changes to this bug.
Description
•