Regular expression character sets that contain "\u0130" match "i" character

RESOLVED FIXED in mozilla1.9.3a1

Status

()

--
major
RESOLVED FIXED
11 years ago
5 years ago

People

(Reporter: cacyclewp, Assigned: steveharper60)

Tracking

Trunk
mozilla1.9.3a1
x86
Windows XP
Points:
---
Bug Flags:
in-testsuite +

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(4 attachments, 6 obsolete attachments)

(Reporter)

Description

11 years ago
User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.12) Gecko/20080201 Firefox/2.0.0.12
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0

Regular expression character set that contain or include "\u0130" match the "i" character.

This does not happen for \u0130 outside a regExp character set.



Reproducible: Always

Steps to Reproduce:
Paste any of the following regExp into the address bar:

javascript:'abcdefghijklmnopqrstuvwxyz'.replace(/([\u0130])/gi, '#')
javascript:'abcdefghijklmnopqrstuvwxyz'.replace(/([\u007f-ffff])/gi, '#')

Actual Results:  
Always: abcdefgh#jklmnopqrstuvwxyz (i is replaced)
(Reporter)

Comment 1

11 years ago
This bug might be related to Bug 416933

Comment 2

11 years ago
I'm not sure I agree with you that a case-insensitive match of "Unicode Character 'LATIN CAPITAL LETTER I WITH DOT ABOVE' (U+0130)" should _not_ match a lower-case "i".  Can you provide some argument?
(Reporter)

Comment 3

11 years ago
You might be right as this behaviour is seen for standard ASCII characters:

javascript:'aA'.replace(/[a]/gi, '#') results in '##'
javascript:'aA'.replace(/a/gi, '#') results in '##'

But:

javascript:'iI\u0130'.replace(/[\u0130]/gi, '#') results in '###'
javascript:'iI\u0130'.replace(/\u0130/gi, '#') results in 'iI#' and NOT in '###' as one would expect!

There is an discrepancy between normal characters and character ranges as mentioned above.
Severity: major → normal
(Reporter)

Comment 4

11 years ago
Also, if you read the Latin Extended-A unicode block code tables you see that there are myriads of variants of ASCII characters named LATIN CAPITAL/SMALL LETTER X WITH Y. 

I do not think that any of these characters should be treated as case variants of ASCII characters (if only to avoid a common source of totally counterintuitive and erratic behavior). 

Also, why should LATIN CAPITAL LETTER I WITH DOT ABOVE be treated differently from LATIN CAPITAL LETTER G WITH DOT ABOVE?

Comment 5

11 years ago
x0'll appreciate this one.  Yeah, we're probably wrong in the case-insensitive situation, for non-ASCII characters.  I'll check this out more on Monday, if x0 hasn't replied by then.
(Reporter)

Comment 6

11 years ago
I changed severity to major as this has the potential to block javascripts in an unpredictable way.
Severity: normal → major

Comment 7

10 years ago
Ah, the sins of the past! It seems this is still a remnant from the old hack for Turkish, which has given localization all around the world a lot of trouble, not just at Mozilla.

Comment 8

10 years ago
Brendan:  Do you have any thoughts on this?  It's a bit of unicode weirdness in regexp.  Of the examples in comment #3, which are correct?
(Reporter)

Comment 9

10 years ago
Brian: It is correct that "javascript:'iI\u0130'.replace(/\u0130/gi, '#')" results in "iI#". It is a bug that "javascript:'iI\u0130'.replace(/[\u0130]/gi, '#')" results in "###". 

Character ranges in case insensitive regular expressions MUST NOT include any non-ASCII "case" variants.
(Reporter)

Comment 10

10 years ago
Bug 378738 is related to this. It shows that non-ASCII Unicode characters should not be treated as ASCII character variants. It would be nice if this could be fixed, please note the high number of votes for this bug.

Comment 11

10 years ago
cc:ing dmandelin since he is closer to JS regexp these days than I am.  Glad to help w/ any questions, though!
(In reply to comment #3)

ECMA-262 says that case-insensitive regular expression matching is done by converting both the pattern character and the text character to upper case, then comparing. So:

This one is wrong, because upper(U+0130) -> U+130, while upper('i') -> 'I', which are not equal:
> javascript:'iI\u0130'.replace(/[\u0130]/gi, '#') results in '###'
(I actually get '#I#' currently, but that's still wrong.)

This one is right:
> javascript:'iI\u0130'.replace(/\u0130/gi, '#') results in 'iI#' and NOT in
> '###' as one would expect!

So, do we want to go with EMCA-262 here, or does web compat require something different? (For comparison, SFX and V8 both give the EMCA-262-specified 'iI#' for both examples.)

Comment 13

10 years ago
Did you try IE?  I'm guessing we should conform, and that they probably do, too.  This is likely just a weird bug we have.
Yeah, try IE and Opera -- maybe we are lone wolf.

/be
IE8 returns '###' for both tests.
(Reporter)

Comment 16

10 years ago
Google Chrome 2.0.172.37 returns the correct iI# for both tests. 

I is hard to imagine that any existing script relies in any way on this bug - quite the opposite, it is VERY complicated to create workarounds (and those would not be affected by fixing this bug). 

Therefore I do not see any reason to imitate IE when it is so obviously wrong and weird that even they might fix it soon...
(Reporter)

Comment 17

10 years ago
Opera 9.64 returns the correct iI# for both tests.

Updated

10 years ago
Duplicate of this bug: 522021
(Reporter)

Comment 19

10 years ago
Any chance to get this fixed anytime soon? This has been reported one and a half years ago and it seems like a real no-brainer.
Yes, this should be considered as a spec conformance fix. ES5 is a good occasion although this is not strictly an "ES5" bug. I did bring it up at the Redmond TC39 meeting in July and talk about it the Microsoft rep. No idea when or if they'll fix IE JScript. We should fix soon tho.

/be
Blocks: 445494
Status: UNCONFIRMED → NEW
Ever confirmed: true

Updated

10 years ago
Assignee: nobody → general
Component: General → JavaScript Engine
Product: Firefox → Core
QA Contact: general → general
(Assignee)

Comment 21

9 years ago
Posted patch Patch for Bug 440926 (obsolete) — Splinter Review
I have made some minor changes to the regular expression code in Spidermonkey inorder to ensure regular expressionos in firefox produce the same results as google chrome. It looks like the regular expression parser converts the input string and pattern to lower case before checking for matches. I have attached a patch to this post.

Comment 22

9 years ago
Mr. Harper:  It looks like your patch contains tabs, do you mind uploading a new one without tabs for review?  (otherwise, it looks pretty good to me)  Does it fix this case in particular?  What other tests are you using to compare results against Google Chrome?

Thanks!
(Assignee)

Comment 23

9 years ago
(Assignee)

Comment 24

9 years ago
I have tested the two regular expressions above:

javascript:'iI\u0130'.replace(/[\u0130]/gi, '#')
javascript:'iI\u0130'.replace(/\u0130/gi, '#')

both return "iI#" (The same result as google chrome)

I check to see if the unicode character has an ascii lowercase character. If it does then i just return the origional uppercase character causing matches against any ascii character to fail. Hope the tabs are ok, I use Visual Studio as my text editor. Im new to the project so im not sure how to run the automated unit tests at the code changes.

Updated

9 years ago
Attachment #416270 - Attachment is patch: true
Attachment #416270 - Attachment mime type: application/octet-stream → text/plain

Comment 25

9 years ago
This patch seems to have no spaces at all....  What I meant in comment 22 is that we need patches that contain correct white-spacing (indentation) without tabs.

Thanks!
(Assignee)

Comment 26

9 years ago
Posted file Patch for Bug 440926 (without tabs) (obsolete) —
(Assignee)

Comment 27

9 years ago
Attachment #416235 - Attachment is obsolete: true
Attachment #416270 - Attachment is obsolete: true
Attachment #416277 - Attachment is obsolete: true

Comment 28

9 years ago
Comment on attachment 416278 [details] [diff] [review]
Patch for Bug 440926 (without tabs)

mrbkap:  Please redirect if you've got a better reviewer.  This looks good to me, at a cursory glance.
Attachment #416278 - Flags: review?(mrbkap)
(Reporter)

Comment 29

9 years ago
(In reply to comment #24)
> I check to see if the unicode character has an ascii lowercase character. If it
> does then i just return the origional uppercase character causing matches
> against any ascii character to fail. Hope the tabs are ok, I use Visual Studio
> as my text editor.

Steve Harper: I have not checked the patched code, but want to make sure that above-ASCII characters are never ever be treated as case variants of ASCII characters (as discussed above).
(In reply to comment #29)
> (In reply to comment #24)
> > I check to see if the unicode character has an ascii lowercase character. If it
> > does then i just return the origional uppercase character causing matches
> > against any ascii character to fail. Hope the tabs are ok, I use Visual Studio
> > as my text editor.
> 
> Steve Harper: I have not checked the patched code, but want to make sure that
> above-ASCII characters are never ever be treated as case variants of ASCII
> characters (as discussed above).

Looking at the patch (without knowing the full context of the code) it appears that that is both the purpose and the effect of the patch.
(Assignee)

Comment 31

9 years ago
Cacycle: Nathan Tuggy is correct, the intention of the patch is to ensure that ASCII characters never match unicode character variants in a case insensitive regular expression.
Also relevant, the program:

print('iI\u0130'.replace(/[\u0130]/gi, '#'));
print('iI\u0130'.replace(/\u0130/gi, '#'));

prints:

#I# iI#

without JIT and:

#I# #I#

with JIT.
(Assignee)

Comment 33

9 years ago
Luke: I have tested the trunk version of Firefox with the patch attached to this bug. Both regular expressions you have listed in comment 32 return the correct result: iI#. This is with both JIT enabled and disabled.
(In reply to comment #33)

Oh sorry, I didn't mean to imply that I got those outputs from running with the patch applied.  That is without the patch; I was just pointing out the different behavior of with and without JIT.
I see that ECMA 262v3 15.10.2.8 describes case insensitive matching in terms of Canonicalize, which is reflected, modulo the IgnoreCase test in step 1, by upcase().  However, the current bug, as I understand it, with \u0130 matching 'i' is not that downcase(\u0130) == 'i' but rather that we don't check that upcase(downcase(\u0130)) == \u0130.  The modifications in the previous patch seem to achieve this for the specific case of \u0130 but perhaps there are other cases where c is non-ASCII, downcase(c) is non-ASCII and upcase(downcase(c)) != c.  I'm not at all a Unicode expert, so maybe I'm wrong here, but the attached patch does what I'm talking about and correctly produces "iI#" for the above tests.

Comment 36

9 years ago
There may be a perf cost to any approach that does a more up/downcasing, have you tested?
(In reply to comment #36)
> There may be a perf cost to any approach that does a more up/downcasing, have
> you tested?

The extra upcast() happens only when compiling the regex, so the overhead is dwarved by the cost of compilation.  To be sure I tested on SS and a micro-bench.
(Assignee)

Comment 38

9 years ago
I think I can follow what your trying to explain Luke. But wouldn't upcase(downcase(c)) == c if c was none ASCII and downcase(c) was none ASCII? The small patch I submitted just excludes any characters from the ASCII subset matching lower case unicode characters.

javascript:'i\u00e1\u00c1'.replace(/[\u00c1]/gi, '#') is an example that correctly matches 'LATIN CAPITAL LETTER A WITH ACUTE' to the lower case version.
(In reply to comment #38)
I agree that your patch works for non-ASCII lowercase and uppercase matching like \u00e1 and \u00c1; what I was thinking was that there might be two non-ASCII characters a and b where upcase(a) == a and downcase(a) == b but upcase(b) != a (thus, /[a]/i.test(b) should be false while the earlier patch would, I believe, return true).

I made the attached test using charCodeAt and toUpperCase/toLowerCase to find if there are any characters with this property and found U+10A0 through U+10C5.
Comment on attachment 416278 [details] [diff] [review]
Patch for Bug 440926 (without tabs)

I'm not really the best reviewer here. I think dmandelin would be better, but it seems like there's still some debate over the best way to write the patch, so I'll just remove this request for now.
Attachment #416278 - Flags: review?(mrbkap)
(Assignee)

Comment 41

9 years ago
Posted patch Patch for Bug 440926 V2 (obsolete) — Splinter Review
Attachment #416278 - Attachment is obsolete: true

Updated

9 years ago
Attachment #417193 - Flags: review?(dmandelin)
(Assignee)

Comment 42

9 years ago
The character set ranging from U+10A0 to U+10C5 corresponds to the Georgian characters(http://unicode.org/charts/PDF/U10A0.pdf). It appears that Capital letters in this character set can have lower case versions, but the lower case version cannot map to the upper case version. I have submitted a new patch to account for the fact that upcase(downcase(c)) != c when using characters in the Georgian character set.

The regular expressions:
javascript:'iI\u10a0'.replace(/[\u10d0]/gi, '#') 
javascript:'iI\u10a0'.replace(/\u10d0/gi, '#') 

javascript:'iI\u10d0'.replace(/[\u10a0]/gi, '#') 
javascript:'iI\u10d0'.replace(/\u10a0/gi, '#') 

Should now produce the same results as Google chrome.
(In reply to comment #42)

-    return JS_TOLOWER(ch);
+    cu = JS_TOLOWER(ch);
+    return JS_TOLOWER(cu < 128) ? ch : cu;

Ignoring the typo with |JS_TOLOWER(cu < 128)|, I believe the change to downcase is unnecessary; it's subsumed by the upcase(downcase(c)) != c check.
http://www.iam.uni-bonn.de/~alt/html/unicode_34.html

"Khutsuri 

This is the uppercase of the old ecclesiastical alphabet. The style shown in the code charts is known as Asomtavruli."

See also http://www.unicode.org/reports/tr21/tr21-4.3.html.

/be
(Assignee)

Comment 45

9 years ago
Sorry Luke you are right. I must have forgot to revert back to the origional file before making the new changes. I will remove this and check i get the same results.
(Assignee)

Comment 46

9 years ago
Brendan: Those two links look interesting. Thanks, I will take a look.
(Assignee)

Comment 47

9 years ago
Attachment #417193 - Attachment is obsolete: true
Attachment #417193 - Flags: review?(dmandelin)
(Assignee)

Updated

9 years ago
Attachment #417250 - Flags: review?(mrbkap)

Comment 48

9 years ago
Comment on attachment 417250 [details] [diff] [review]
Patch for Bug 440926 V3 (Removed superfluous code in downcase function)

Moving review to dmandelin as mrbkap previously requested.
Attachment #417250 - Flags: review?(mrbkap) → review?(dmandelin)
Comment on attachment 417250 [details] [diff] [review]
Patch for Bug 440926 V3 (Removed superfluous code in downcase function)

The logic looks good. I do have a couple of alternate suggestions for how to express this piece, which is part of a somewhat tricky bit of code (at least to me, and I think I wrote it ;-) ). See if you think either is clearer. r+ with the form you think is most readable (including original).

>         if (cs->flags & JSREG_FOLD) {
>             ch = JS_TOUPPER(ch);
>             jschar lch = JS_TOLOWER(ch);
>+            
>+            if (upcase(lch) != ch)
>+              lch = ch;
> 
>             if (ch != lch) {

1. The bottom two statements could be done shorter (also matching the way it's coded later in the patch):

              if (ch != lch && upcase(lch) == ch) {

2. But I think it's still far from obvious what we're doing, which is computing the inverse image (in the mathematical sense) of |ch| for JS_TOUPPER, { x | JS_TOUPPER(x) == ch }.

|lch| is supposed to be a canonical element of the inverse image, namely the one such that |JS_TOLOWER(ch) == lch|, if that element is in the inverse image, or else |ch|. We could try to make this more obvious (heh) like this:

    // Return the 'canonical' inverse upcase of |ch|. That is the character
    // |lch| such that |upcase(lch) == ch| and (|lch| is the lower-case form
    // of |ch| or is |ch|).
    static inline jschar inverse_upcase(jschar ch)
    {
        jschar lch = JS_TOLOWER(ch);
        return upcase(lch) == ch ? lch : ch;
    }

Then we can say:

        if (cs->flags & JSREG_FOLD) {
            ch = JS_TOUPPER(ch);
            jschar lch = inverse_upcase(ch);
            
            if (ch != lch) {

and the same way in the character class computation.
Attachment #417250 - Flags: review?(dmandelin) → review+
(Assignee)

Comment 50

9 years ago
Hi David, thanks for taking time to review this patch. Im not sure what you meant by "r+ with the form you think is most readable (including original)". 

I think that Using the inverse_upcase function with the comment above would make the patch intentions more ovbious. Should I submit a new patch, or would you like to work on this?
Steve, just FYI, r+ is short for review+ and means something like "passes my review"; often, it's coupled with conditions that usually involve submitting a new patch, and in this case, David would probably like you to submit a new patch with the inverse_upcase changes, since you both agree that that's the right move.

BTW, thanks for working on this. :)
(In reply to comment #51)
> Steve, just FYI, r+ is short for review+ and means something like "passes my
> review"; often, it's coupled with conditions that usually involve submitting a
> new patch, and in this case, David would probably like you to submit a new
> patch with the inverse_upcase changes, since you both agree that that's the
> right move.

Yes. I was also implying that a second review was not needed; simply consider my suggestions, choose what you think is best, and consider my r+ to apply to your final version.
(Assignee)

Comment 53

9 years ago
Attachment #417250 - Attachment is obsolete: true
(Assignee)

Comment 54

9 years ago
Can we get this patch into the current trunk build?

Cheers,

Steve
Comment on attachment 418379 [details] [diff] [review]
Patch for Bug 440926 V4

Beautifully simple.
Attachment #418379 - Flags: review+
Status: NEW → RESOLVED
Last Resolved: 9 years ago
Resolution: --- → FIXED
Assignee: general → steveharper60
Flags: in-testsuite+
Target Milestone: --- → mozilla1.9.3a1
Version: unspecified → Trunk
You need to log in before you can comment on or make changes to this bug.