Closed Bug 440926 Opened 13 years ago Closed 11 years ago

Regular expression character sets that contain "\u0130" match "i" character

Categories

(Core :: JavaScript Engine, defect)

x86
Windows XP
defect
Not set
major

Tracking

()

RESOLVED FIXED
mozilla1.9.3a1

People

(Reporter: cacyclewp, Assigned: steveharper60)

References

Details

Attachments

(4 files, 6 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0

Regular expression character set that contain or include "\u0130" match the "i" character.

This does not happen for \u0130 outside a regExp character set.



Reproducible: Always

Steps to Reproduce:
Paste any of the following regExp into the address bar:

javascript:'abcdefghijklmnopqrstuvwxyz'.replace(/([\u0130])/gi, '#')
javascript:'abcdefghijklmnopqrstuvwxyz'.replace(/([\u007f-ffff])/gi, '#')

Actual Results:  
Always: abcdefgh#jklmnopqrstuvwxyz (i is replaced)
This bug might be related to Bug 416933
I'm not sure I agree with you that a case-insensitive match of "Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130)" should _not_ match a lower-case "i".  Can you provide some argument?
You might be right as this behaviour is seen for standard ASCII characters:

javascript:'aA'.replace(/[a]/gi, '#') results in '##'
javascript:'aA'.replace(/a/gi, '#') results in '##'

But:

javascript:'iI\u0130'.replace(/[\u0130]/gi, '#') results in '###'
javascript:'iI\u0130'.replace(/\u0130/gi, '#') results in 'iI#' and NOT in '###' as one would expect!

There is an discrepancy between normal characters and character ranges as mentioned above.
Severity: major → normal
Also, if you read the Latin Extended-A unicode block code tables you see that there are myriads of variants of ASCII characters named LATIN CAPITAL/SMALL LETTER X WITH Y. 

I do not think that any of these characters should be treated as case variants of ASCII characters (if only to avoid a common source of totally counterintuitive and erratic behavior). 

Also, why should LATIN CAPITAL LETTER I WITH DOT ABOVE be treated differently from LATIN CAPITAL LETTER G WITH DOT ABOVE?

x0'll appreciate this one.  Yeah, we're probably wrong in the case-insensitive situation, for non-ASCII characters.  I'll check this out more on Monday, if x0 hasn't replied by then.
I changed severity to major as this has the potential to block javascripts in an unpredictable way.
Severity: normal → major
Ah, the sins of the past! It seems this is still a remnant from the old hack for Turkish, which has given localization all around the world a lot of trouble, not just at Mozilla.
Brendan:  Do you have any thoughts on this?  It's a bit of unicode weirdness in regexp.  Of the examples in comment #3, which are correct?
Brian: It is correct that "javascript:'iI\u0130'.replace(/\u0130/gi, '#')" results in "iI#". It is a bug that "javascript:'iI\u0130'.replace(/[\u0130]/gi, '#')" results in "###". 

Character ranges in case insensitive regular expressions MUST NOT include any non-ASCII "case" variants.
Bug 378738 is related to this. It shows that non-ASCII Unicode characters should not be treated as ASCII character variants. It would be nice if this could be fixed, please note the high number of votes for this bug.
cc:ing dmandelin since he is closer to JS regexp these days than I am.  Glad to help w/ any questions, though!
(In reply to comment #3)

ECMA-262 says that case-insensitive regular expression matching is done by converting both the pattern character and the text character to upper case, then comparing. So:

This one is wrong, because upper(U+0130) -> U+130, while upper('i') -> 'I', which are not equal:
> javascript:'iI\u0130'.replace(/[\u0130]/gi, '#') results in '###'
(I actually get '#I#' currently, but that's still wrong.)

This one is right:
> javascript:'iI\u0130'.replace(/\u0130/gi, '#') results in 'iI#' and NOT in
> '###' as one would expect!

So, do we want to go with EMCA-262 here, or does web compat require something different? (For comparison, SFX and V8 both give the EMCA-262-specified 'iI#' for both examples.)
Did you try IE?  I'm guessing we should conform, and that they probably do, too.  This is likely just a weird bug we have.
Yeah, try IE and Opera -- maybe we are lone wolf.

/be
IE8 returns '###' for both tests.
Google Chrome 2.0.172.37 returns the correct iI# for both tests. 

I is hard to imagine that any existing script relies in any way on this bug - quite the opposite, it is VERY complicated to create workarounds (and those would not be affected by fixing this bug). 

Therefore I do not see any reason to imitate IE when it is so obviously wrong and weird that even they might fix it soon...
Opera 9.64 returns the correct iI# for both tests.
Duplicate of this bug: 522021
Any chance to get this fixed anytime soon? This has been reported one and a half years ago and it seems like a real no-brainer.
Yes, this should be considered as a spec conformance fix. ES5 is a good occasion although this is not strictly an "ES5" bug. I did bring it up at the Redmond TC39 meeting in July and talk about it the Microsoft rep. No idea when or if they'll fix IE JScript. We should fix soon tho.

/be
Blocks: es5
Status: UNCONFIRMED → NEW
Ever confirmed: true
Assignee: nobody → general
Component: General → JavaScript Engine
Product: Firefox → Core
QA Contact: general → general
Attached patch Patch for Bug 440926 (obsolete) — Splinter Review
I have made some minor changes to the regular expression code in Spidermonkey inorder to ensure regular expressionos in firefox produce the same results as google chrome. It looks like the regular expression parser converts the input string and pattern to lower case before checking for matches. I have attached a patch to this post.
Mr. Harper:  It looks like your patch contains tabs, do you mind uploading a new one without tabs for review?  (otherwise, it looks pretty good to me)  Does it fix this case in particular?  What other tests are you using to compare results against Google Chrome?

Thanks!
I have tested the two regular expressions above:

javascript:'iI\u0130'.replace(/[\u0130]/gi, '#')
javascript:'iI\u0130'.replace(/\u0130/gi, '#')

both return "iI#" (The same result as google chrome)

I check to see if the unicode character has an ascii lowercase character. If it does then i just return the origional uppercase character causing matches against any ascii character to fail. Hope the tabs are ok, I use Visual Studio as my text editor. Im new to the project so im not sure how to run the automated unit tests at the code changes.
Attachment #416270 - Attachment is patch: true
Attachment #416270 - Attachment mime type: application/octet-stream → text/plain
This patch seems to have no spaces at all....  What I meant in comment 22 is that we need patches that contain correct white-spacing (indentation) without tabs.

Thanks!
Attached file Patch for Bug 440926 (without tabs) (obsolete) —
Attachment #416235 - Attachment is obsolete: true
Attachment #416270 - Attachment is obsolete: true
Attachment #416277 - Attachment is obsolete: true
Comment on attachment 416278 [details] [diff] [review]
Patch for Bug 440926 (without tabs)

mrbkap:  Please redirect if you've got a better reviewer.  This looks good to me, at a cursory glance.
Attachment #416278 - Flags: review?(mrbkap)
(In reply to comment #24)
> I check to see if the unicode character has an ascii lowercase character. If it
> does then i just return the origional uppercase character causing matches
> against any ascii character to fail. Hope the tabs are ok, I use Visual Studio
> as my text editor.

Steve Harper: I have not checked the patched code, but want to make sure that above-ASCII characters are never ever be treated as case variants of ASCII characters (as discussed above).
(In reply to comment #29)
> (In reply to comment #24)
> > I check to see if the unicode character has an ascii lowercase character. If it
> > does then i just return the origional uppercase character causing matches
> > against any ascii character to fail. Hope the tabs are ok, I use Visual Studio
> > as my text editor.
> 
> Steve Harper: I have not checked the patched code, but want to make sure that
> above-ASCII characters are never ever be treated as case variants of ASCII
> characters (as discussed above).

Looking at the patch (without knowing the full context of the code) it appears that that is both the purpose and the effect of the patch.
Cacycle: Nathan Tuggy is correct, the intention of the patch is to ensure that ASCII characters never match unicode character variants in a case insensitive regular expression.
Also relevant, the program:

print('iI\u0130'.replace(/[\u0130]/gi, '#'));
print('iI\u0130'.replace(/\u0130/gi, '#'));

prints:

#I# iI#

without JIT and:

#I# #I#

with JIT.
Luke: I have tested the trunk version of Firefox with the patch attached to this bug. Both regular expressions you have listed in comment 32 return the correct result: iI#. This is with both JIT enabled and disabled.
(In reply to comment #33)

Oh sorry, I didn't mean to imply that I got those outputs from running with the patch applied.  That is without the patch; I was just pointing out the different behavior of with and without JIT.
I see that ECMA 262v3 15.10.2.8 describes case insensitive matching in terms of Canonicalize, which is reflected, modulo the IgnoreCase test in step 1, by upcase().  However, the current bug, as I understand it, with \u0130 matching 'i' is not that downcase(\u0130) == 'i' but rather that we don't check that upcase(downcase(\u0130)) == \u0130.  The modifications in the previous patch seem to achieve this for the specific case of \u0130 but perhaps there are other cases where c is non-ASCII, downcase(c) is non-ASCII and upcase(downcase(c)) != c.  I'm not at all a Unicode expert, so maybe I'm wrong here, but the attached patch does what I'm talking about and correctly produces "iI#" for the above tests.
There may be a perf cost to any approach that does a more up/downcasing, have you tested?
(In reply to comment #36)
> There may be a perf cost to any approach that does a more up/downcasing, have
> you tested?

The extra upcast() happens only when compiling the regex, so the overhead is dwarved by the cost of compilation.  To be sure I tested on SS and a micro-bench.
I think I can follow what your trying to explain Luke. But wouldn't upcase(downcase(c)) == c if c was none ASCII and downcase(c) was none ASCII? The small patch I submitted just excludes any characters from the ASCII subset matching lower case unicode characters.

javascript:'i\u00e1\u00c1'.replace(/[\u00c1]/gi, '#') is an example that correctly matches 'LATIN CAPITAL LETTER A WITH ACUTE' to the lower case version.
(In reply to comment #38)
I agree that your patch works for non-ASCII lowercase and uppercase matching like \u00e1 and \u00c1; what I was thinking was that there might be two non-ASCII characters a and b where upcase(a) == a and downcase(a) == b but upcase(b) != a (thus, /[a]/i.test(b) should be false while the earlier patch would, I believe, return true).

I made the attached test using charCodeAt and toUpperCase/toLowerCase to find if there are any characters with this property and found U+10A0 through U+10C5.
Comment on attachment 416278 [details] [diff] [review]
Patch for Bug 440926 (without tabs)

I'm not really the best reviewer here. I think dmandelin would be better, but it seems like there's still some debate over the best way to write the patch, so I'll just remove this request for now.
Attachment #416278 - Flags: review?(mrbkap)
Attached patch Patch for Bug 440926 V2 (obsolete) — Splinter Review
Attachment #416278 - Attachment is obsolete: true
Attachment #417193 - Flags: review?(dmandelin)
The character set ranging from U+10A0 to U+10C5 corresponds to the Georgian characters(http://unicode.org/charts/PDF/U10A0.pdf). It appears that Capital letters in this character set can have lower case versions, but the lower case version cannot map to the upper case version. I have submitted a new patch to account for the fact that upcase(downcase(c)) != c when using characters in the Georgian character set.

The regular expressions:
javascript:'iI\u10a0'.replace(/[\u10d0]/gi, '#') 
javascript:'iI\u10a0'.replace(/\u10d0/gi, '#') 

javascript:'iI\u10d0'.replace(/[\u10a0]/gi, '#') 
javascript:'iI\u10d0'.replace(/\u10a0/gi, '#') 

Should now produce the same results as Google chrome.
(In reply to comment #42)

-    return JS_TOLOWER(ch);
+    cu = JS_TOLOWER(ch);
+    return JS_TOLOWER(cu < 128) ? ch : cu;

Ignoring the typo with |JS_TOLOWER(cu < 128)|, I believe the change to downcase is unnecessary; it's subsumed by the upcase(downcase(c)) != c check.
http://www.iam.uni-bonn.de/~alt/html/unicode_34.html

"Khutsuri 

This is the uppercase of the old ecclesiastical alphabet. The style shown in the code charts is known as Asomtavruli."

See also http://www.unicode.org/reports/tr21/tr21-4.3.html.

/be
Sorry Luke you are right. I must have forgot to revert back to the origional file before making the new changes. I will remove this and check i get the same results.
Brendan: Those two links look interesting. Thanks, I will take a look.
Attachment #417193 - Attachment is obsolete: true
Attachment #417193 - Flags: review?(dmandelin)
Attachment #417250 - Flags: review?(mrbkap)
Comment on attachment 417250 [details] [diff] [review]
Patch for Bug 440926 V3 (Removed superfluous code in downcase function)

Moving review to dmandelin as mrbkap previously requested.
Attachment #417250 - Flags: review?(mrbkap) → review?(dmandelin)
Comment on attachment 417250 [details] [diff] [review]
Patch for Bug 440926 V3 (Removed superfluous code in downcase function)

The logic looks good. I do have a couple of alternate suggestions for how to express this piece, which is part of a somewhat tricky bit of code (at least to me, and I think I wrote it ;-) ). See if you think either is clearer. r+ with the form you think is most readable (including original).

>         if (cs->flags & JSREG_FOLD) {
>             ch = JS_TOUPPER(ch);
>             jschar lch = JS_TOLOWER(ch);
>+            
>+            if (upcase(lch) != ch)
>+              lch = ch;
> 
>             if (ch != lch) {

1. The bottom two statements could be done shorter (also matching the way it's coded later in the patch):

              if (ch != lch && upcase(lch) == ch) {

2. But I think it's still far from obvious what we're doing, which is computing the inverse image (in the mathematical sense) of |ch| for JS_TOUPPER, { x | JS_TOUPPER(x) == ch }.

|lch| is supposed to be a canonical element of the inverse image, namely the one such that |JS_TOLOWER(ch) == lch|, if that element is in the inverse image, or else |ch|. We could try to make this more obvious (heh) like this:

    // Return the 'canonical' inverse upcase of |ch|. That is the character
    // |lch| such that |upcase(lch) == ch| and (|lch| is the lower-case form
    // of |ch| or is |ch|).
    static inline jschar inverse_upcase(jschar ch)
    {
        jschar lch = JS_TOLOWER(ch);
        return upcase(lch) == ch ? lch : ch;
    }

Then we can say:

        if (cs->flags & JSREG_FOLD) {
            ch = JS_TOUPPER(ch);
            jschar lch = inverse_upcase(ch);
            
            if (ch != lch) {

and the same way in the character class computation.
Attachment #417250 - Flags: review?(dmandelin) → review+
Hi David, thanks for taking time to review this patch. Im not sure what you meant by "r+ with the form you think is most readable (including original)". 

I think that Using the inverse_upcase function with the comment above would make the patch intentions more ovbious. Should I submit a new patch, or would you like to work on this?
Steve, just FYI, r+ is short for review+ and means something like "passes my review"; often, it's coupled with conditions that usually involve submitting a new patch, and in this case, David would probably like you to submit a new patch with the inverse_upcase changes, since you both agree that that's the right move.

BTW, thanks for working on this. :)
(In reply to comment #51)
> Steve, just FYI, r+ is short for review+ and means something like "passes my
> review"; often, it's coupled with conditions that usually involve submitting a
> new patch, and in this case, David would probably like you to submit a new
> patch with the inverse_upcase changes, since you both agree that that's the
> right move.

Yes. I was also implying that a second review was not needed; simply consider my suggestions, choose what you think is best, and consider my r+ to apply to your final version.
Attachment #417250 - Attachment is obsolete: true
Can we get this patch into the current trunk build?

Cheers,

Steve
Comment on attachment 418379 [details] [diff] [review]
Patch for Bug 440926 V4

Beautifully simple.
Attachment #418379 - Flags: review+
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Assignee: general → steveharper60
Flags: in-testsuite+
Target Milestone: --- → mozilla1.9.3a1
Version: unspecified → Trunk
You need to log in before you can comment on or make changes to this bug.