Closed Bug 428816 Opened 16 years ago Closed 8 years ago

Case insensitive RegExps compare wrong for some international characters in character classes

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: x00000000, Assigned: x00000000)

References

()

Details

Case insensitiv character classes work by adding their members up- and downcased to a bitmap and then comparing the input string with it, without additional consideration of case.

This method assumes
(1) upcase(downcase(ch)) == upcase(ch)
(2) downcase(upcase(ch)) == downcase(ch)
(3) ch == upcase(ch) || ch == downcase(ch)

But that isn't true for some international characters.

With the currently used Unicode tables in jsstr.c, (1) is violated for U+0130 (İ, Turkish I with dot above) and the old Georgian capital letters U+10A0 .. U+10C5 (not an issue with the current Unicode version, but there are new ones instead). This causes unwanted matches like /[\u0130]/i.exec("i"), while /\u0130/i.exec("i") correctly does not match.

(2) is violated by the variants of the small Greek letters beta, theta, phi, pi, kappa and rho. In the current Unicode version there are more (e.g. final sigma and long s). This causes missed matches; e.g. /[\u03d1]/i.exec("\u03b8") (comparing both small theta variants) fails, whereas /\u03d1/i.exec("\u03b8") succeeds.

(3) is violated by Dž, Lj, Nj and Dz. This is worst because it causes even /[\u01cb]/i.exec("\u01cb") to fail (unlike /[\u01cb]/.exec("\u01cb") or /\u01cb/i.exec("\u01cb")), but easiest to fix (need just to add the character itself to the bitmap, as the patch in bug 315157 does for ranges).

(1) and (2) are hard to fix if performance should not degrade. Basically additional code tables for a modified TOUPPER/TOLOWER are needed, or at least a flag for characters that need special treatment (plus code for that).

Inverted ranges have additional implications, but it's hard to say what correct behavior is in that case, because the standard seems to be flawed: As I understand it, /[^Xx]/i.exec("x") should succeed, which is somewhat counterintuitive.
I wanted to cite bug 416933 and its attachment 315157 [details] [diff] [review].
(In reply to comment #0)
> Inverted ranges have additional implications, but it's hard to say what correct
> behavior is in that case, because the standard seems to be flawed

I misunterstood the spec.

I'm working on this bug. The fix involves major changes to the Unicode tables if it should not result in bad performance, so I'm going to upgrade them to the latest Unicode version in bug 394604 first.

Note that attachment 317093 [details] (see bug 416933 comment 53) is also usable for this bug.
Status: NEW → ASSIGNED
Depends on: 394604
Assignee: general → x00000000
Status: ASSIGNED → NEW
x0, is this going to be fixed by bug 428816?
Also for x0: are you still working on this? It's closely related to bug 502789, which I am starting work on. If you're not working on it, I'll probably want to knock it off while I'm thinking about the topic.
No longer reproducible in current Nightly, probably fixed by bug 1280046. Resolving as WFM.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.