Closed
Bug 428816
Opened 17 years ago
Closed 8 years ago
Case insensitive RegExps compare wrong for some international characters in character classes
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: x00000000, Assigned: x00000000)
References
()
Details
Case insensitiv character classes work by adding their members up- and downcased to a bitmap and then comparing the input string with it, without additional consideration of case.
This method assumes
(1) upcase(downcase(ch)) == upcase(ch)
(2) downcase(upcase(ch)) == downcase(ch)
(3) ch == upcase(ch) || ch == downcase(ch)
But that isn't true for some international characters.
With the currently used Unicode tables in jsstr.c, (1) is violated for U+0130 (İ, Turkish I with dot above) and the old Georgian capital letters U+10A0 .. U+10C5 (not an issue with the current Unicode version, but there are new ones instead). This causes unwanted matches like /[\u0130]/i.exec("i"), while /\u0130/i.exec("i") correctly does not match.
(2) is violated by the variants of the small Greek letters beta, theta, phi, pi, kappa and rho. In the current Unicode version there are more (e.g. final sigma and long s). This causes missed matches; e.g. /[\u03d1]/i.exec("\u03b8") (comparing both small theta variants) fails, whereas /\u03d1/i.exec("\u03b8") succeeds.
(3) is violated by Dž, Lj, Nj and Dz. This is worst because it causes even /[\u01cb]/i.exec("\u01cb") to fail (unlike /[\u01cb]/.exec("\u01cb") or /\u01cb/i.exec("\u01cb")), but easiest to fix (need just to add the character itself to the bitmap, as the patch in bug 315157 does for ranges).
(1) and (2) are hard to fix if performance should not degrade. Basically additional code tables for a modified TOUPPER/TOLOWER are needed, or at least a flag for characters that need special treatment (plus code for that).
Inverted ranges have additional implications, but it's hard to say what correct behavior is in that case, because the standard seems to be flawed: As I understand it, /[^Xx]/i.exec("x") should succeed, which is somewhat counterintuitive.
(In reply to comment #0)
> Inverted ranges have additional implications, but it's hard to say what correct
> behavior is in that case, because the standard seems to be flawed
I misunterstood the spec.
I'm working on this bug. The fix involves major changes to the Unicode tables if it should not result in bad performance, so I'm going to upgrade them to the latest Unicode version in bug 394604 first.
Note that attachment 317093 [details] (see bug 416933 comment 53) is also usable for this bug.
Status: NEW → ASSIGNED
Depends on: 394604
Comment 3•16 years ago
|
||
x0, is this going to be fixed by bug 428816?
Comment 4•16 years ago
|
||
Also for x0: are you still working on this? It's closely related to bug 502789, which I am starting work on. If you're not working on it, I'll probably want to knock it off while I'm thinking about the topic.
Comment 5•8 years ago
|
||
No longer reproducible in current Nightly, probably fixed by bug 1280046. Resolving as WFM.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•