Closed
Bug 502789
Opened 16 years ago
Closed 16 years ago
JIT: Case insensitive regular expressions don't work for international characters
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
FIXED
People
(Reporter: demisone, Assigned: dmandelin)
References
Details
(Keywords: regression, Whiteboard: regression from 3.0)
Attachments
(2 files, 1 obsolete file)
13.21 KB,
patch
|
luke
:
review+
|
Details | Diff | Splinter Review |
5.51 KB,
application/javascript
|
Details |
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR 3.5.30729)
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR 3.5.30729)
If you try to do a "Φ".search("φ","i") it returns -1. If Firefox 3.0 this worked correctly and returned 0.
I don't know if this is also true for other non-latin characters but at least for greek it is.
Reproducible: Always
Steps to Reproduce:
1. "Φ".search("φ","i");
("φ" is the lower-case "Φ")
Actual Results:
got -1
Reporter | ||
Updated•16 years ago
|
Version: unspecified → 3.5 Branch
Comment 1•16 years ago
|
||
This also happens in the Spanish language. Also I checked that the bug is from upper case to lower case.
>>> var a = "Ñ";
>>> var b="ñ";
>>> a.search(b)
-1
>>> b.search(a,"i")
0
Comment 2•16 years ago
|
||
Sorry for my last comment, the third line is wrong. It should be:
>>> a.search(b,"i")
-1
Comment 3•16 years ago
|
||
javascript:alert("Φ".search("φ","i"))
jit true: -1
jit false: 0
Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.9.2a1pre) Gecko/20090706 Minefield/3.6a1pre ID:20090706042748
Comment 4•16 years ago
|
||
Possibly bug 428816
Assignee: nobody → general
Component: General → JavaScript Engine
Product: Firefox → Core
QA Contact: general → general
Summary: Case incensitive regular expressions don't work for greek characters → JIT: Case incensitive regular expressions don't work for greek characters
Version: 3.5 Branch → 1.9.1 Branch
Comment 5•16 years ago
|
||
On Firefox 3.5 (nl_NL) on Mac OS X 10.5.7 (nl_NL):
javascript:alert("Φ".search("φ","i"))
gives:
-1
Assignee | ||
Comment 6•16 years ago
|
||
(In reply to comment #3)
> javascript:alert("Φ".search("φ","i"))
>
> jit true: -1
> jit false: 0
I also get this result in my nightly. Strangely, I get -1 whether the JIT is on or not in a shell build from tip. FWIW, Nitro and V8 give the same result as SpiderMonkey in shell builds. I will look into this further.
Assignee: general → dmandelin
Status: UNCONFIRMED → NEW
Ever confirmed: true
Updated•16 years ago
|
Flags: wanted1.9.1.x?
Flags: blocking1.9.1.1?
Summary: JIT: Case incensitive regular expressions don't work for greek characters → JIT: Case insensitive regular expressions don't work for international characters
Assignee | ||
Comment 7•16 years ago
|
||
The shell/browser difference probably has something to do with the way non-ASCII chars are handled by the terminal. The test case in the patch uses Unicode escapes so it can be reliably executed in any environment. The FF3 behavior is correct, and this patch fixes the regexp JIT.
Attachment #387320 -
Flags: review?(lw)
Updated•16 years ago
|
Attachment #387320 -
Flags: review?(lw) → review+
Comment 8•16 years ago
|
||
Comment on attachment 387320 [details] [diff] [review]
Patch
Makes sense.
Perhaps the condition in hasCases could be rewritten:
return JS_TOLOWER(ch) != JS_TOUPPER(ch);
?
Assignee | ||
Comment 9•16 years ago
|
||
The patch seems to work, but with it, the jit still fails this test case:
'\u03d5'.search('\u03a6', 'i');
due to a closely related issue. I need to think about this a bit more. ECMA-262 says the correct way to do a case-insensitive regexp character match is to convert both characters to upper-case and then compare. This isn't how Unicode does it [1], but apparently ECMA-262 is only slightly Unicode-aware. I want to think up an algorithm for doing it the ECMA-262 way without needing to apply case mappings to the text string during matching, if possible.
[1] Unicode specifies doing case-folding by mapping each character to its equivalence class in the equivalence relation where x R y iff x is a lower-, upper- or title-case version of y or vice versa.
Assignee | ||
Comment 10•16 years ago
|
||
I found there are 17 characters with multiple lower-case versions. (Formally, 17 characters |u| such that more than one character maps to |u| as its upper-case.) They are listed at the end of the comment.
Given the small number of sets, it seems reasonable to special-case them and generate 3 or 4 tests as needed.
['i', 'ı']
['s', 'ſ']
['Dž', 'dž']
['Lj', 'lj']
['Nj', 'nj']
['Dz', 'dz']
['β', 'ϐ']
['ε', 'ϵ']
['θ', 'ϑ']
['ͅ', 'ι', 'ι']
['κ', 'ϰ']
['µ', 'μ']
['π', 'ϖ']
['ρ', 'ϱ']
['ς', 'σ']
['φ', 'ϕ']
['ṡ', 'ẛ']
Assignee | ||
Comment 11•16 years ago
|
||
This patch correctly handles all the funky extra lower-case forms. Btw, the non-jit regexp engine doesn't pass the included test case.
Attachment #387320 -
Attachment is obsolete: true
Attachment #387791 -
Flags: review?(brendan)
Assignee | ||
Updated•16 years ago
|
Attachment #387791 -
Flags: review?(brendan) → review?(lw)
Comment 12•16 years ago
|
||
First of all, nice job unravelling the mysteries of Unicode. The logic in the patch looks good. My only complaint, and maybe you were going to do this anyways, is that after all the story behind this patch, the comment explaining the resulting code is a bit terse. Perhaps just a bug number?
So I was playing around with the regression test you submitted and I made it test each character in an equivalence class against each other (attached). Surprisingly, this exposed a bunch of cases of asymmetric matches. As you can see in the output of running the attachment, the source of the problem is the toUpperCase function, specifically the JS_TOUPPER macro. I checked on a few code points, and the JS_TOUPPER is definitely diverging from the Unicode spec.
Updated•16 years ago
|
Attachment #387791 -
Flags: review?(lw) → review+
Comment 13•16 years ago
|
||
Comment on attachment 387791 [details] [diff] [review]
Patch 2
(Existing) JS_UPPERCASE bug notwithstanding, the code in this patch looks good.
Assignee | ||
Comment 14•16 years ago
|
||
Pushed to TM as 274d57daf19c.
Comment 15•16 years ago
|
||
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
Updated•16 years ago
|
blocking1.9.1: --- → needed
status1.9.1:
--- → wanted
Flags: wanted1.9.1.x?
Flags: wanted1.9.1.x+
Flags: blocking1.9.1.1?
Flags: blocking1.9.1.1-
Comment 17•15 years ago
|
||
Is this still needed on the 1.9.1 branch?
Comment 18•15 years ago
|
||
not going to block on a 1.9.1 release, but if you still intend to fix this in 3.5.x please request approval for that branch on an applicable patch.
You need to log in
before you can comment on or make changes to this bug.
Description
•