Closed Bug 502789 Opened 14 years ago Closed 14 years ago

JIT: Case insensitive regular expressions don't work for international characters

Categories

(Core :: JavaScript Engine, defect)

1.9.1 Branch
x86
Windows XP
defect
Not set
normal

Tracking

()

RESOLVED FIXED
Tracking Status
blocking1.9.1 --- -
status1.9.1 --- ?

People

(Reporter: demisone, Assigned: dmandelin)

References

Details

(Keywords: regression, Whiteboard: regression from 3.0)

Attachments

(2 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR 3.5.30729)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR 3.5.30729)

If you try to do a "Φ".search("φ","i") it returns -1. If Firefox 3.0 this worked correctly and returned 0.

I don't know if this is also true for other non-latin characters but at least for greek it is.

Reproducible: Always

Steps to Reproduce:
1. "Φ".search("φ","i");

("φ" is the lower-case "Φ")
Actual Results:  
got -1
Version: unspecified → 3.5 Branch
This also happens in the Spanish language. Also I checked that the bug is from upper case to lower case.

>>> var a = "Ñ";
>>> var b="ñ";
>>> a.search(b)
-1
>>> b.search(a,"i")
0
Sorry for my last comment, the third line is wrong. It should be:

>>> a.search(b,"i")
-1
javascript:alert("Φ".search("φ","i"))

jit true: -1
jit false: 0

Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.9.2a1pre) Gecko/20090706 Minefield/3.6a1pre ID:20090706042748
Possibly bug 428816
Assignee: nobody → general
Component: General → JavaScript Engine
Product: Firefox → Core
QA Contact: general → general
Summary: Case incensitive regular expressions don't work for greek characters → JIT: Case incensitive regular expressions don't work for greek characters
Version: 3.5 Branch → 1.9.1 Branch
On Firefox 3.5 (nl_NL) on Mac OS X 10.5.7 (nl_NL):

   javascript:alert("Φ".search("φ","i"))

gives:

   -1
(In reply to comment #3)
> javascript:alert("Φ".search("φ","i"))
> 
> jit true: -1
> jit false: 0

I also get this result in my nightly. Strangely, I get -1 whether the JIT is on or not in a shell build from tip. FWIW, Nitro and V8 give the same result as SpiderMonkey in shell builds. I will look into this further.
Assignee: general → dmandelin
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: wanted1.9.1.x?
Flags: blocking1.9.1.1?
Summary: JIT: Case incensitive regular expressions don't work for greek characters → JIT: Case insensitive regular expressions don't work for international characters
Attached patch Patch (obsolete) — Splinter Review
The shell/browser difference probably has something to do with the way non-ASCII chars are handled by the terminal. The test case in the patch uses Unicode escapes so it can be reliably executed in any environment. The FF3 behavior is correct, and this patch fixes the regexp JIT.
Attachment #387320 - Flags: review?(lw)
Attachment #387320 - Flags: review?(lw) → review+
Comment on attachment 387320 [details] [diff] [review]
Patch

Makes sense.

Perhaps the condition in hasCases could be rewritten:

return JS_TOLOWER(ch) != JS_TOUPPER(ch);

?
The patch seems to work, but with it, the jit still fails this test case:

    '\u03d5'.search('\u03a6', 'i');

due to a closely related issue. I need to think about this a bit more. ECMA-262 says the correct way to do a case-insensitive regexp character match is to convert both characters to upper-case and then compare. This isn't how Unicode does it [1], but apparently ECMA-262 is only slightly Unicode-aware. I want to think up an algorithm for doing it the ECMA-262 way without needing to apply case mappings to the text string during matching, if possible.

[1] Unicode specifies doing case-folding by mapping each character to its equivalence class in the equivalence relation where x R y iff x is a lower-, upper- or title-case version of y or vice versa.
I found there are 17 characters with multiple lower-case versions. (Formally, 17 characters |u| such that more than one character maps to |u| as its upper-case.) They are listed at the end of the comment.

Given the small number of sets, it seems reasonable to special-case them and generate 3 or 4 tests as needed.

['i', 'ı']
['s', 'ſ']
['Dž', 'dž']
['Lj', 'lj']
['Nj', 'nj']
['Dz', 'dz']
['β', 'ϐ']
['ε', 'ϵ']
['θ', 'ϑ']
['ͅ', 'ι', 'ι']
['κ', 'ϰ']
['µ', 'μ']
['π', 'ϖ']
['ρ', 'ϱ']
['ς', 'σ']
['φ', 'ϕ']
['ṡ', 'ẛ']
Attached patch Patch 2Splinter Review
This patch correctly handles all the funky extra lower-case forms. Btw, the non-jit regexp engine doesn't pass the included test case.
Attachment #387320 - Attachment is obsolete: true
Attachment #387791 - Flags: review?(brendan)
Attachment #387791 - Flags: review?(brendan) → review?(lw)
First of all, nice job unravelling the mysteries of Unicode.  The logic in the patch looks good.  My only complaint, and maybe you were going to do this anyways, is that after all the story behind this patch, the comment explaining the resulting code is a bit terse.  Perhaps just a bug number?

So I was playing around with the regression test you submitted and I made it test each character in an equivalence class against each other (attached).  Surprisingly, this exposed a bunch of cases of asymmetric matches.  As you can see in the output of running the attachment, the source of the problem is the toUpperCase function, specifically the JS_TOUPPER macro.  I checked on a few code points, and the JS_TOUPPER is definitely diverging from the Unicode spec.
Attachment #387791 - Flags: review?(lw) → review+
Comment on attachment 387791 [details] [diff] [review]
Patch 2

(Existing) JS_UPPERCASE bug notwithstanding, the code in this patch looks good.
Pushed to TM as 274d57daf19c.
http://hg.mozilla.org/mozilla-central/rev/274d57daf19c
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
blocking1.9.1: --- → needed
Flags: wanted1.9.1.x?
Flags: wanted1.9.1.x+
Flags: blocking1.9.1.1?
Flags: blocking1.9.1.1-
Is this still needed on the 1.9.1 branch?
not going to block on a 1.9.1 release, but if you still intend to fix this in 3.5.x please request approval for that branch on an applicable patch.
blocking1.9.1: needed → -
Keywords: regression
Whiteboard: regression from 3.0
You need to log in before you can comment on or make changes to this bug.