Closed Bug 428816 Opened 16 years ago Closed 8 years ago

Case insensitive RegExps compare wrong for some international characters in character classes

Tracking

()

Status:

RESOLVED WORKSFORME

People

(Reporter: x00000000, Assigned: x00000000)

References

(
URL
)

Details

Assignee

Description

•

16 years ago

Case insensitiv character classes work by adding their members up- and downcased to a bitmap and then comparing the input string with it, without additional consideration of case.

This method assumes
(1) upcase(downcase(ch)) == upcase(ch)
(2) downcase(upcase(ch)) == downcase(ch)
(3) ch == upcase(ch) || ch == downcase(ch)

But that isn't true for some international characters.

With the currently used Unicode tables in jsstr.c, (1) is violated for U+0130 (İ, Turkish I with dot above) and the old Georgian capital letters U+10A0 .. U+10C5 (not an issue with the current Unicode version, but there are new ones instead). This causes unwanted matches like /[\u0130]/i.exec("i"), while /\u0130/i.exec("i") correctly does not match.

(2) is violated by the variants of the small Greek letters beta, theta, phi, pi, kappa and rho. In the current Unicode version there are more (e.g. final sigma and long s). This causes missed matches; e.g. /[\u03d1]/i.exec("\u03b8") (comparing both small theta variants) fails, whereas /\u03d1/i.exec("\u03b8") succeeds.

(3) is violated by ǅ, ǈ, ǋ and ǲ. This is worst because it causes even /[\u01cb]/i.exec("\u01cb") to fail (unlike /[\u01cb]/.exec("\u01cb") or /\u01cb/i.exec("\u01cb")), but easiest to fix (need just to add the character itself to the bitmap, as the patch in bug 315157 does for ranges).

(1) and (2) are hard to fix if performance should not degrade. Basically additional code tables for a modified TOUPPER/TOLOWER are needed, or at least a flag for characters that need special treatment (plus code for that).

Inverted ranges have additional implications, but it's hard to say what correct behavior is in that case, because the standard seems to be flawed: As I understand it, /[^Xx]/i.exec("x") should succeed, which is somewhat counterintuitive.

Assignee

Comment 1

•

16 years ago

I wanted to cite bug 416933 and its attachment 315157 [details] [diff] [review].

Assignee

Comment 2

•

16 years ago

(In reply to comment #0)
> Inverted ranges have additional implications, but it's hard to say what correct
> behavior is in that case, because the standard seems to be flawed

I misunterstood the spec.

I'm working on this bug. The fix involves major changes to the Unicode tables if it should not result in bad performance, so I'm going to upgrade them to the latest Unicode version in bug 394604 first.

Note that attachment 317093 [details] (see bug 416933 comment 53) is also usable for this bug.

Status: NEW → ASSIGNED

Depends on: 394604

Assignee

Updated

•

16 years ago

Assignee: general → x00000000

Status: ASSIGNED → NEW

Nochum Sossonko [:Natch]

Comment 3

•

15 years ago

x0, is this going to be fixed by bug 428816?

David Mandelin [:dmandelin]

Comment 4

•

15 years ago

Also for x0: are you still working on this? It's closely related to bug 502789, which I am starting work on. If you're not working on it, I'll probably want to knock it off while I'm thinking about the topic.

André Bargull [:anba]

Comment 5

•

8 years ago

No longer reproducible in current Nightly, probably fixed by bug 1280046. Resolving as WFM.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WORKSFORME

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

Case insensitive RegExps compare wrong for some international characters in character classes

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: x00000000, Assigned: x00000000)

References

(
URL
)

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Updated

Comment 3

Comment 4

Comment 5