Closed Bug 502789 Opened 14 years ago Closed 14 years ago
JIT: Case insensitive regular expressions don't work for international characters
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR 3.5.30729) Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR 3.5.30729) If you try to do a "Φ".search("φ","i") it returns -1. If Firefox 3.0 this worked correctly and returned 0. I don't know if this is also true for other non-latin characters but at least for greek it is. Reproducible: Always Steps to Reproduce: 1. "Φ".search("φ","i"); ("φ" is the lower-case "Φ") Actual Results: got -1
This also happens in the Spanish language. Also I checked that the bug is from upper case to lower case. >>> var a = "Ñ"; >>> var b="ñ"; >>> a.search(b) -1 >>> b.search(a,"i") 0
Sorry for my last comment, the third line is wrong. It should be: >>> a.search(b,"i") -1
Possibly bug 428816
Assignee: nobody → general
Product: Firefox → Core
QA Contact: general → general
Summary: Case incensitive regular expressions don't work for greek characters → JIT: Case incensitive regular expressions don't work for greek characters
Version: 3.5 Branch → 1.9.1 Branch
Assignee: general → dmandelin
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: JIT: Case incensitive regular expressions don't work for greek characters → JIT: Case insensitive regular expressions don't work for international characters
The shell/browser difference probably has something to do with the way non-ASCII chars are handled by the terminal. The test case in the patch uses Unicode escapes so it can be reliably executed in any environment. The FF3 behavior is correct, and this patch fixes the regexp JIT.
Attachment #387320 - Flags: review?(lw)
Comment on attachment 387320 [details] [diff] [review] Patch Makes sense. Perhaps the condition in hasCases could be rewritten: return JS_TOLOWER(ch) != JS_TOUPPER(ch); ?
The patch seems to work, but with it, the jit still fails this test case: '\u03d5'.search('\u03a6', 'i'); due to a closely related issue. I need to think about this a bit more. ECMA-262 says the correct way to do a case-insensitive regexp character match is to convert both characters to upper-case and then compare. This isn't how Unicode does it , but apparently ECMA-262 is only slightly Unicode-aware. I want to think up an algorithm for doing it the ECMA-262 way without needing to apply case mappings to the text string during matching, if possible.  Unicode specifies doing case-folding by mapping each character to its equivalence class in the equivalence relation where x R y iff x is a lower-, upper- or title-case version of y or vice versa.
I found there are 17 characters with multiple lower-case versions. (Formally, 17 characters |u| such that more than one character maps to |u| as its upper-case.) They are listed at the end of the comment. Given the small number of sets, it seems reasonable to special-case them and generate 3 or 4 tests as needed. ['i', 'ı'] ['s', 'ſ'] ['ǅ', 'ǆ'] ['ǈ', 'ǉ'] ['ǋ', 'ǌ'] ['ǲ', 'ǳ'] ['β', 'ϐ'] ['ε', 'ϵ'] ['θ', 'ϑ'] ['ͅ', 'ι', 'ι'] ['κ', 'ϰ'] ['µ', 'μ'] ['π', 'ϖ'] ['ρ', 'ϱ'] ['ς', 'σ'] ['φ', 'ϕ'] ['ṡ', 'ẛ']
This patch correctly handles all the funky extra lower-case forms. Btw, the non-jit regexp engine doesn't pass the included test case.
Attachment #387791 - Flags: review?(brendan) → review?(lw)
First of all, nice job unravelling the mysteries of Unicode. The logic in the patch looks good. My only complaint, and maybe you were going to do this anyways, is that after all the story behind this patch, the comment explaining the resulting code is a bit terse. Perhaps just a bug number? So I was playing around with the regression test you submitted and I made it test each character in an equivalence class against each other (attached). Surprisingly, this exposed a bunch of cases of asymmetric matches. As you can see in the output of running the attachment, the source of the problem is the toUpperCase function, specifically the JS_TOUPPER macro. I checked on a few code points, and the JS_TOUPPER is definitely diverging from the Unicode spec.
Comment on attachment 387791 [details] [diff] [review] Patch 2 (Existing) JS_UPPERCASE bug notwithstanding, the code in this patch looks good.
Pushed to TM as 274d57daf19c.
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Is this still needed on the 1.9.1 branch?
not going to block on a 1.9.1 release, but if you still intend to fix this in 3.5.x please request approval for that branch on an applicable patch.
You need to log in before you can comment on or make changes to this bug.