Closed Bug 502789 Opened 16 years ago Closed 16 years ago

JIT: Case insensitive regular expressions don't work for international characters

Tracking

()

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

blocking1.9.1

---

status1.9.1

---

People

(Reporter: demisone, Assigned: dmandelin)

References

Details

(Keywords: regression, Whiteboard: regression from 3.0)

Attachments

(2 files, 1 obsolete file)

Patch 16 years ago David Mandelin [:dmandelin] 4.15 KB, patch	luke : review+	Details \| Diff \| Splinter Review
Patch 2 16 years ago David Mandelin [:dmandelin] 13.21 KB, patch	luke : review+	Details \| Diff \| Splinter Review
expanded and noisy test case 16 years ago Luke Wagner [:luke] 5.51 KB, application/javascript		Details

Nikos Dimitrakopoulos

Reporter

Description

•

16 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR 3.5.30729) Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-GB; rv:1.9.1) Gecko/20090624 Firefox/3.5 (.NET CLR 3.5.30729) If you try to do a "Φ".search("φ","i") it returns -1. If Firefox 3.0 this worked correctly and returned 0. I don't know if this is also true for other non-latin characters but at least for greek it is. Reproducible: Always Steps to Reproduce: 1. "Φ".search("φ","i"); ("φ" is the lower-case "Φ") Actual Results: got -1

Nikos Dimitrakopoulos

Reporter

Updated

•

16 years ago

Version: unspecified → 3.5 Branch

Javier Gonel

Comment 1

•

16 years ago

This also happens in the Spanish language. Also I checked that the bug is from upper case to lower case. >>> var a = "Ñ"; >>> var b="ñ"; >>> a.search(b) -1 >>> b.search(a,"i") 0

Javier Gonel

Comment 2

•

16 years ago

Sorry for my last comment, the third line is wrong. It should be: >>> a.search(b,"i") -1

John P Baker

Comment 3

•

16 years ago

javascript:alert("Φ".search("φ","i")) jit true: -1 jit false: 0 Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.9.2a1pre) Gecko/20090706 Minefield/3.6a1pre ID:20090706042748

John P Baker

Comment 4

•

16 years ago

Possibly bug 428816

u88484

Updated

•

16 years ago

Assignee: nobody → general

Component: General → JavaScript Engine

Product: Firefox → Core

QA Contact: general → general

Summary: Case incensitive regular expressions don't work for greek characters → JIT: Case incensitive regular expressions don't work for greek characters

Version: 3.5 Branch → 1.9.1 Branch

Ernst de Haan

Comment 5

•

16 years ago

On Firefox 3.5 (nl_NL) on Mac OS X 10.5.7 (nl_NL): javascript:alert("Φ".search("φ","i")) gives: -1

David Mandelin [:dmandelin]

Assignee

Comment 6

•

16 years ago

(In reply to comment #3) > javascript:alert("Φ".search("φ","i")) > > jit true: -1 > jit false: 0 I also get this result in my nightly. Strangely, I get -1 whether the JIT is on or not in a shell build from tip. FWIW, Nitro and V8 give the same result as SpiderMonkey in shell builds. I will look into this further.

Assignee: general → dmandelin

Status: UNCONFIRMED → NEW

Ever confirmed: true

Nochum Sossonko [:Natch]

Updated

•

16 years ago

Flags: wanted1.9.1.x?

Flags: blocking1.9.1.1?

Summary: JIT: Case incensitive regular expressions don't work for greek characters → JIT: Case insensitive regular expressions don't work for international characters

David Mandelin [:dmandelin]

Assignee

Comment 7

•

16 years ago

Attached patch Patch (obsolete) — Details — Splinter Review

The shell/browser difference probably has something to do with the way non-ASCII chars are handled by the terminal. The test case in the patch uses Unicode escapes so it can be reliably executed in any environment. The FF3 behavior is correct, and this patch fixes the regexp JIT.

Attachment #387320 - Flags: review?(lw)

Luke Wagner [:luke]

Updated

•

16 years ago

Attachment #387320 - Flags: review?(lw) → review+

Luke Wagner [:luke]

Comment 8

•

16 years ago

Comment on attachment 387320 [details] [diff] [review] Patch Makes sense. Perhaps the condition in hasCases could be rewritten: return JS_TOLOWER(ch) != JS_TOUPPER(ch); ?

David Mandelin [:dmandelin]

Assignee

Comment 9

•

16 years ago

The patch seems to work, but with it, the jit still fails this test case: '\u03d5'.search('\u03a6', 'i'); due to a closely related issue. I need to think about this a bit more. ECMA-262 says the correct way to do a case-insensitive regexp character match is to convert both characters to upper-case and then compare. This isn't how Unicode does it [1], but apparently ECMA-262 is only slightly Unicode-aware. I want to think up an algorithm for doing it the ECMA-262 way without needing to apply case mappings to the text string during matching, if possible. [1] Unicode specifies doing case-folding by mapping each character to its equivalence class in the equivalence relation where x R y iff x is a lower-, upper- or title-case version of y or vice versa.

David Mandelin [:dmandelin]

Assignee

Comment 10

•

16 years ago

I found there are 17 characters with multiple lower-case versions. (Formally, 17 characters |u| such that more than one character maps to |u| as its upper-case.) They are listed at the end of the comment. Given the small number of sets, it seems reasonable to special-case them and generate 3 or 4 tests as needed. ['i', 'ı'] ['s', 'ſ'] ['ǅ', 'ǆ'] ['ǈ', 'ǉ'] ['ǋ', 'ǌ'] ['ǲ', 'ǳ'] ['β', 'ϐ'] ['ε', 'ϵ'] ['θ', 'ϑ'] ['ͅ', 'ι', 'ι'] ['κ', 'ϰ'] ['µ', 'μ'] ['π', 'ϖ'] ['ρ', 'ϱ'] ['ς', 'σ'] ['φ', 'ϕ'] ['ṡ', 'ẛ']

David Mandelin [:dmandelin]

Assignee

Comment 11

•

16 years ago

Attached patch Patch 2 — Details — Splinter Review

This patch correctly handles all the funky extra lower-case forms. Btw, the non-jit regexp engine doesn't pass the included test case.

Attachment #387320 - Attachment is obsolete: true

Attachment #387791 - Flags: review?(brendan)

David Mandelin [:dmandelin]

Assignee

Updated

•

16 years ago

Attachment #387791 - Flags: review?(brendan) → review?(lw)

Luke Wagner [:luke]

Comment 12

•

16 years ago

Attached file expanded and noisy test case — Details

First of all, nice job unravelling the mysteries of Unicode. The logic in the patch looks good. My only complaint, and maybe you were going to do this anyways, is that after all the story behind this patch, the comment explaining the resulting code is a bit terse. Perhaps just a bug number? So I was playing around with the regression test you submitted and I made it test each character in an equivalence class against each other (attached). Surprisingly, this exposed a bunch of cases of asymmetric matches. As you can see in the output of running the attachment, the source of the problem is the toUpperCase function, specifically the JS_TOUPPER macro. I checked on a few code points, and the JS_TOUPPER is definitely diverging from the Unicode spec.

Luke Wagner [:luke]

Updated

•

16 years ago

Attachment #387791 - Flags: review?(lw) → review+

Luke Wagner [:luke]

Comment 13

•

16 years ago

Comment on attachment 387791 [details] [diff] [review] Patch 2 (Existing) JS_UPPERCASE bug notwithstanding, the code in this patch looks good.

David Mandelin [:dmandelin]

Assignee

Comment 14

•

16 years ago

Pushed to TM as 274d57daf19c.

Robert Sayre

Comment 15

•

16 years ago

http://hg.mozilla.org/mozilla-central/rev/274d57daf19c

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Mike Beltzner [:beltzner, not reading bugmail]

Updated

•

16 years ago

blocking1.9.1: --- → needed

status1.9.1: --- → wanted

Flags: wanted1.9.1.x?

Flags: wanted1.9.1.x+

Flags: blocking1.9.1.1?

Flags: blocking1.9.1.1-

Samuel Sidler (old account; do not CC)

Comment 17

•

16 years ago

Is this still needed on the 1.9.1 branch?

Daniel Veditz [:dveditz]

Comment 18

•

16 years ago

not going to block on a 1.9.1 release, but if you still intend to fix this in 3.5.x please request approval for that branch on an applicable patch.

blocking1.9.1: needed → -

status1.9.1: wanted → ?

Keywords: regression

Whiteboard: regression from 3.0

You need to log in before you can comment on or make changes to this bug.