Closed Bug 394604 Opened 17 years ago Closed 8 years ago

toLowerCase() and toUpperCase() are wrong for some intl characters

Categories

(Core :: JavaScript: Internationalization API, defect)

defect
Not set
minor

Tracking

()

RESOLVED FIXED

People

(Reporter: Dolske, Assigned: smontagu)

References

(Blocks 1 open bug)

Details

Attachments

(6 files)

There are a few spots in Login Manager which do a case-insensitive comparison. For example, when providing autocomplete results we ignore the case of the current field value. It does this by calling comparing str1.toLowerCase() to str2.toLowerCase() In an old review, Gavin mentioned that this might not work in some locales (Turkish?), with complex case conversion rules. I'm not sure if this is an issue or not, and what the right fix is. I see the form fill code and others using C++'s nsCaseInsensitiveStringComparator but I haven't looked at its implementation.
Shawn pointed out a testcase that, uhh, tests this kind of thing in MozStorage... http://mxr.mozilla.org/seamonkey/source/storage/test/unit/test_unicode.js
so, as long as it ends up using nsUnicharUtils.h's ToLowerCase it will do the trick (this is what mozStorage does).
So, I think this works fine. javascript:alert("\xc6") --> AE javascript:alert("\xc6".toLowerCase()) --> ae I didn't see any obvious replacements for this when I looked at it before, so if .toLowerCase() isn't sufficient for some situation I'd tend to think it's a bug in .toLowerCase().
Status: NEW → RESOLVED
Closed: 17 years ago
Resolution: --- → INVALID
(In reply to comment #2) > so, as long as it ends up using nsUnicharUtils.h's ToLowerCase it will do the > trick (this is what mozStorage does). The JS impl of toLowerCase certainly doesn't rely on intl/ code. It seems to just use http://bonsai.mozilla.org/cvsblame.cgi?file=mozilla/js/src/jsstr.h&rev=3.55#339 . Perhaps smontagu would know if there were any especially problematic cases. Why did you choose to test "\xc6" ?
JS uses character properties from Unicode 2 (which I've beefed about before in bug 258974 comment 4). This means that JS toLowerCase will give the wrong result for over 200 characters which are new since Unicode 2 or have had changes to the case-folding properties since then. I don't know what the autocomplete implementation in the location bar uses in practice. I tested it by visiting http://en.wiktionary.org/wiki/%E2%B2%89%E2%B2%99%E2%B2%9F%E2%B2%A9 and entering both Ⲉ and ⲉ in the location bar (U+2C88 and U+2C89). The site appears in the autocomplete results for both, but the ⲉ in the results is only underlined when using the same case as in the title.
I believe the location bar uses the SQL ToLower function, which uses our unicode functions. Seth would know best though.
justin and simon are on to something here. sdwilsh is right, when we attempt to find matches, we'll end up using our likeCompare(), see http://lxr.mozilla.org/seamonkey/source/storage/src/mozStorageUnicodeFunctions.cpp but when emphasizing, I use toLower() in JS, see http://lxr.mozilla.org/seamonkey/source/toolkit/content/widgets/autocomplete.xml#1237. So this seems like a valid bug, so I'm re-opening. Comment #5 is a specific instance of it with the url bar autocomplete. (Simon, can you spin that off to a new bug?)
Status: RESOLVED → REOPENED
Resolution: INVALID → ---
> Simon, can you spin that off to a new bug? I've spun it off to bug #409873, thanks for catching that simon
So, what's the right fix for this problem? Bug 258974 is about RegExs (and 3 years old :/), but I don't see anything for toLowerCase. This still seems like something that should be addressed by fixing toLowerCase(), although I could see there being tricky issues with ECMA compliance. Is there some awful XPCOM utility function hack to work around the toLowerCase limitation? I suppose this also implies that existing usage in the Mozilla tree should be audited. I see ~500 toLowerCase calls in *.js/xul/xbl, and ~200 toUpperCase calls.
Do we know if nsUnicharUtils.h's ToLowerCase a superset of JS's toLowerCase()?
No it isn't. U+018E maps to U+10DD in nsUnicharUtils but to U+0258 in JS, and the whole Georgian block from U+10a0 to U+10C5 maps to the new Georgian small letters from U+2D00 instead of the Georgian letters from U+10D0.
Attaching this test so it doesn't get lost. It's generated from the Unicode 5 data file.
(In reply to comment #9) > This still seems like something that should be addressed by fixing > toLowerCase(), although I could see there being tricky issues with ECMA > compliance. ECMA says "A conforming implementation of this International standard shall interpret characters in conformance with the Unicode Standard, Version 2.1 _or_ _later_" (emphasis added), so I don't think that there's a problem.
So, sounds to me like this is something that should be fixed in the JS toLowerCase/toUpperCase implementation. Moving to Core:JS.
Assignee: nobody → general
Status: REOPENED → NEW
Component: Password Manager → JavaScript Engine
Product: Firefox → Core
QA Contact: password.manager → general
Summary: look at replacing toLowerCase() usage for L10N? → toLowerCase() and toUpperCase() are wrong for some intl characters
as Edward can tell you, the AwesomeBar relies heavily on toLowerCase(), see http://lxr.mozilla.org/seamonkey/source/toolkit/content/widgets/autocomplete.xml my gut tells me fixing this bug in Core:JS isn't likely to happen for Firefox 3, but I'll seek blocking to get it on the radar / discussion table. If fixing this in Core:JS isn't possible at this stage, what about implementing a "i18n" friendly (but slower, I'm sure) version of toLowerCase() that autocomplete.xml can use? Even if it were measurably slower, I claim (without proof) that it wouldn't be the "long performance" pole in the AwesomeBar. thoughts / comments?
Flags: blocking1.9?
Search functionality-wise, ToLowerCase is being used http://bonsai.mozilla.org/cvsblame.cgi?file=/mozilla/toolkit/components/places/src/nsNavHistoryAutoComplete.cpp&rev=1.49&mark=390-391#390 However, the UI uses string.toLowerCase(), so it might not correctly emphasize matches.
Ok, when considering blocking here, it seems to me that if you are suggesting we change Core:JS toLowerCase(), and at this point, I'd say a cold hard minus. However, it sounds like there should be a fix to *some* implementation of toLowerCase() to make sure the AwesomeBar functions as expected/appropriate. So, on that note, I'm minusing this bug as is, but still sounds like there should be another bug opened per Comment #16.
Flags: blocking1.9? → blocking1.9-
Blocks: 424428
damon: yes, this bug is about the Core:JS toLowerCase impl, and agreed, that would be a blocking-. I'll leave it to JS guru's to decide when / how to fix that. as for the specific AwesomeBar issue, I logged bug #424428 and asked for firefox 3 blocking.
The testcase in attachment 294603 [details] isn't complete. It misses the cases where String.prototype.toUpperCase should return a multi-character string for a single character, e.g.: "\xdf".toUpperCase() should return "SS" (German sharp s 'ß') "\u0130".toLowerCase() should return "i\u0307" (I with point above 'İ') "\ufb00".toUpperCase() should return "FF" (ligature ff 'ff') ECMA states (15.5.4.16): | NOTE The result should be derived according to the case mappings in the | Unicode character database (this explicitly includes not only the | UnicodeData.txt file, but also the SpecialCasings.txt file that accompanies | it in Unicode 2.1.8 and later). The SpecialCasings.txt file contains the multi-character mappings. It is at http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt , the 2.1.8 version at http://www.unicode.org/Public/2.1-Update3/SpecialCasing-1.txt . (In reply to comment #0) > For example, when providing autocomplete results we ignore the case of the > current field value. It does this by calling comparing str1.toLowerCase() to > str2.toLowerCase() Comparing str1.toUpperCase() and str2.toUpperCase() is more reliable.
(In reply to comment #20) > Comparing str1.toUpperCase() and str2.toUpperCase() is more reliable. According to the output in attachment 294604 [details], there are more failures for toUpperCase than toLowerCase.. ? (220 lines of toUpper and 217 lines of toLower)
Yes, because some of the upcasing rules that make it more reliable were added in later Unicode versions. E.g., both small sigmas (σ and ς) now map to Σ, so ς and Σ will compare equal if you use toUpperCase (once it is fixed), but not if you use toLowerCase. For the theta variant it is already working: "ϑ".toUpperCase() == "Θ".toUpperCase() is true, but "ϑ".toLowerCase() == "Θ".toLowerCase() is false.
Blocks: 428816
Status: NEW → ASSIGNED
Assignee: general → x00000000
Status: ASSIGNED → NEW
This patch replaces the old Unicode tables with Unicode 5.1.0 based tables. The layout is different and removes some stuff that isn't used to save space. The total size of the tables increases nevertheless from 9328 to 19554 bytes. That is mostly due to the added characters; almost all code points in the Basic Multilingual Plane are filled now. I've left the General Categories in, although they aren't used now, because they will be needed for bug 258974. They cost about 6 KB. The tables are split into two parts: One for the General Categories plus the XML name and ECMAScript identifier flags, and one for the case mappings (whitespace flags use 2 leftover bits of them). That uses less space than combined tables and saves one layer of indirection for some cases. js_CharAttribTableA has 16 bit entries. 8 bits would suffice with blocks of 64 instead of 32 code points, but js_CharAttribTableB would grow by more than js_CharAttribTableA would shrink then (difference is 320 bytes). 16 bits also give the opportunity to use direct indices, so one shift per lookup is saved. On the other hand, cache usage for lookups to limited ranges will be slightly higher. I've upgraded the XML name flags to XML 1.1. Note that there was a major change in the policy for XML names between 1.0 and 1.1. It's much more liberal now, allowing even unassigned code points to be more forward compatible. Currently, the XML flags are only an approximation to the XML 1.0 rules; apparently they have been hacked into the tables, ignoring the deviations from the Unicode categories (e.g. ª, º and µ from ISO-8859-1 have their flags set although they aren't valid characters for XML names). The patch also adds a small ASCII table to speed up some classification macros and hex digit evaluation, and to get rid of the ctype.h dependency, which will cause failures for locales that aren't ASCII compatible. JS_TOLOWER was misused in some instances, where really only ASCII matching should be done. I fixed them and removed the macro. The valid use instances got specialized code to cope with multi-character results and to optimize the lookup. The patch doesn't implement context sensitive case mapping. Without a specific locale, there's only one case where context sensitive case mapping would be required: Lower case of Σ should be ς instead of σ at the end of a word. Note that the case folding mechanism in the patch (that fixes bug 428816) doesn't follow Unicode rules, but the poorer ECMA3 rules (they don't seem to change in ECMA4, except for the clarification that a user-supplied String.prototype.toUpperCase shouldn't be used for it). 5 testcases fail with the patch because they depend on the old Unicode version. They need to be upgraded too: ecma/String/15.5.4.11-2.js: U+10A0..U+10C5 GEORGIAN CAPITAL LETTER * They have now their own lower case variants and do not longer map to the caseless modern Georgian letters. ecma/String/15.5.4.11-5.js, ecma/String/15.5.4.12-4.js: U+0400 CYRILLIC CAPITAL LETTER IE WITH GRAVE U+040D CYRILLIC CAPITAL LETTER I WITH GRAVE U+0450 CYRILLIC SMALL LETTER IE WITH GRAVE U+045D CYRILLIC SMALL LETTER I WITH GRAVE New characters; the testcase assumes absent upper and lower case for unknown characters. ecma/String/15.5.4.12-1.js: U+00B5 MICRO SIGN U+0149 LATIN SMALL LETTER N PRECEDED BY APOSTROPHE MICRO SIGN upcases to GREEK CAPITAL LETTER MU now. U+0149 had already in Unicode 2.1.8 a multi-character upper case that should be used in toUpperCase per ECMA3, so the testcase is wrong independent of the Unicode version. ecma/String/15.5.4.12-5.js: U+0587 ARMENIAN SMALL LIGATURE ECH YIWN Also bogus testcase because of ignored multi-character upper case. The test in attachment 294603 [details] fails almost completely because it doesn't expect multi-character results for a single character and all results are shifted by the additional characters.
Expects http://www.unicode.org/Public/UNIDATA/UnicodeData.txt and http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt in the current directory, and also the old mozilla/js/src/jsstr.c to report changes from the old Unicode tables (you need to edit the file to change options).
Attachment #317828 - Attachment mime type: application/octet-stream → text/plain
Comprehensive tests for toLowerCase and toUpperCase with Unicode 5.1.0 data. Obsoletes tests/ecma/String/15.5.4.1[12]-*.js. Note that the tests of general function object properties currently fail due to bug 202019.
(In reply to comment #27) > Created an attachment (id=318085) [details] > testcases for tests/ecma_3/String > > Comprehensive tests for toLowerCase and toUpperCase with Unicode 5.1.0 data. > Obsoletes tests/ecma/String/15.5.4.1[12]-*.js. You should call printBugNumber(BUGNUMBER); just prior to printStatus(summary) The tests define escape() which conflicts with a native function, but could use escapeString() from bug 431108 I think. To obsolete tests, add them to spidermonkey-n.tests (in general) and the appropriate branch dependent versions e.g. spidermonkey-n-1.9.0.tests. Is this going to land on 1.9.0 or a 1.9.1 branch?
(In reply to comment #28) > Is this going to land on 1.9.0 or a 1.9.1 branch? Gecko 1.9.0 supports Unicode 5.0 so this shouldn't land on 1.9.0 as it stands. In general, landing should be coordinated with bug 427350.
(In reply to comment #29) > Gecko 1.9.0 supports Unicode 5.0 so this shouldn't land on 1.9.0 as it stands. > In general, landing should be coordinated with bug 427350. The changes between 5.0 and 5.1 are minor compared to those between the currently used version (don't know what exact version that is) and 5.0. These changes from 5.0 to 5.1 affect the tables in the patch: - 5 new capital characters with case mappings to existing characters: U+2C6D LATIN CAPITAL LETTER ALPHA <-> U+0251 LATIN SMALL LETTER ALPHA U+2C6E LATIN CAPITAL LETTER M WITH HOOK <-> U+0271 LATIN SMALL LETTER M WITH HOOK U+2C6F LATIN CAPITAL LETTER TURNED A <-> U+0250 LATIN SMALL LETTER TURNED A U+03CF GREEK CAPITAL KAI SYMBOL <-> U+03D7 GREEK KAI SYMBOL U+A77D LATIN CAPITAL LETTER INSULAR G <-> U+1D79 LATIN SMALL LETTER INSULAR G - 1 new capital character with lower case mapping to an existing one (but not the reverse, so ECMAScript 3 case folding won't be affected): U+1E9E;LATIN CAPITAL LETTER SHARP S -> U+00DF LATIN SMALL LETTER SHARP S - 95 new upper/lower case pairs in the BMP - 1120 caseless new characters in the BMP - 2 General Categoriy changes from Sk to Lm (become valid ECMAScript 3 ids): U+02EC MODIFIER LETTER VOICING U+0374 GREEK NUMERAL SIGN - 2 other General Categoriy changes (not observable without bug 258974 fixed): U+05BE HEBREW PUNCTUATION MAQAF Po -> Pd U+A802 SYLOTI NAGRI SIGN DVISVARA Mc -> Mn But I don't expect to get this into 1.9.0. Although I think it is save (after some more testing) and adds much value, it's still a large patch, and it has much changes to the set of allowed characters for identifiers and XML names (within js).
We need an owner for this bug, plus an assessment against the ES3.1 spec (either this bug's patch or the draft 3.1 spec could be buggy, or both!). /be
Flags: wanted1.9.2?
We have (according to mdc, and I've used them in code AFAIK) toLocaleLowerCase and toLocaleUpperCase. Should we just be using those instead?
(In reply to comment #32) > We have (according to mdc, and I've used them in code AFAIK) toLocaleLowerCase > and toLocaleUpperCase. Should we just be using those instead? From ES3: 15.5.4.17 String.prototype.toLocaleLowerCase ( ) This function works exactly the same as toLowerCase except that its result is intended to yield the correct result for the host environment’s current locale, rather than a locale-independent result. There will only be a difference in the few cases (such as Turkish) where the rules for that language conflict with the regular Unicode case mappings. Not relevant here. This bug is about out of date Unicode tables (universal, hence "Unicode" ;-), not failure to choose a locale-specific case conversion. /be
(In reply to comment #29) > Gecko 1.9.0 supports Unicode 5.0 so this shouldn't land on 1.9.0 as it stands. > In general, landing should be coordinated with bug 427350. That's bug's been fixed, can we look at targeting 1.9.2? [Too late for 1.9.1 now.] (In reply to comment #31) > We need an owner for this bug, plus an assessment against the ES3.1 spec > (either this bug's patch or the draft 3.1 spec could be buggy, or both!). Who'd be a suitable owner for this? What does the assessment entail (or is that obvious to whoever picks this up :)?
x0, are you still owning this bug? We can reassign if you are busy elsewhere. Thanks, /be
filed bug 536672 for the 'ß'.toUpperCase() bug. Feel free to dupe to here or make it dependent.
(In reply to comment #37) > filed bug 536672 for the 'ß'.toUpperCase() bug. Feel free to dupe to here or > make it dependent. I think these bugs can stay separate, as they address pretty separate issues. I repeat Brendan's call for an owner for this. x0, feel free to put yourself back on if you want to work on this. By the way, I think we are going to have to move to something new from the current table system (which was imported from Java 1.1 or so, I think) because it isn't able to encode the data for certain characters, e.g., some "middle-case" characters. A newer JS or Java implementation would probably be a good place to start. These days, I think Java is still using a table format, but a less compressed one.
Assignee: x00000000 → general
Taking
Assignee: general → smontagu
I'm trying to catch up with what has already been said and done in this bug. I tried running attachment 317828 [details] against the current versions (5.2.0) of UnicodeData.txt and SpecialCasing.txt, but it failed with Died at uniprop.pl line 598 (In reply to comment #38) > By the way, I think we are going to have to move to something new from the > current table system (which was imported from Java 1.1 or so, I think) because > it isn't able to encode the data for certain characters, e.g., some > "middle-case" characters. I don't understand this comment, specifically I don't know what you mean by "middle-case" characters. Can you clarify? (In reply to comment #31) > We need an owner for this bug, plus an assessment against the ES3.1 spec > (either this bug's patch or the draft 3.1 spec could be buggy, or both!). Preliminary assessment of the spec: the definitions of toLowerCase and toUpperCase in http://www.ecma-international.org/publications/files/ECMA-ST/ECMA-262.pdf are a little vague, but it's clear that the intention is to follow the mappings in the Unicode Standard (except for supplementary characters). The definition of case-insignificant matching in 15.10.2.8 is not the same as the Unicode definition of simple (i.e. preserving string length) caseless matching, but I need further analysis to be sure what all the practical differences are. Note: being different from the Unicode standard doesn't mean that ES is not conformant to Unicode, since the Unicode standard explicitly says (C20 in http://unicode.org/versions/Unicode5.2.0/ch03.pdf) "A conformant implementation may perform casing operations that are different from the default algorithms, perhaps tailored to a particular orthography, so long as the fact that a tailoring is applied is disclosed."
(In reply to comment #20) > The testcase in attachment 294603 [details] isn't complete. It misses the cases where > String.prototype.toUpperCase should return a multi-character string for a > single character, e.g.: > > "\xdf".toUpperCase() should return "SS" (German sharp s 'ß') > "\u0130".toLowerCase() should return "i\u0307" (I with point above 'İ') > "\ufb00".toUpperCase() should return "FF" (ligature ff 'ff') roc raised the question whether we want to do this in principle? In other words are we OK with the possibility that foo.toLowerCase().length or foo.toUpperCase().length is not necessarily equal to foo.length, or could that break things?
(In reply to comment #41) > roc raised the question whether we want to do this in principle? In other words > are we OK with the possibility that foo.toLowerCase().length or > foo.toUpperCase().length is not necessarily equal to foo.length, or could that > break things? Shouldn't ligatures and the German double S characters count as two?
See https://mail.mozilla.org/pipermail/es-discuss/2009-March/009086.html and other messages in that thread. Quoting Waldemar: "The reason the ES3 specification was the way it was is because converting one character to many during case conversions would be incompatible with regular expressions. The regular expression algorithm refers to String.prototype.toUpperCase." Simon, if you have any comments in reply to Allen Wirfs-Brock's call to Unicode experts, please lay them on es-discuss: https://mail.mozilla.org/pipermail/es-discuss/2009-March/008921.html /be
I bet dmandelin meant "titlecase", not "middlecase". So comment 43 is meant to cite the argument for answering "no, not OK" to roc's question Simon quoted in comment 41: "roc raised the question whether we want to do this in principle? In other words are we OK with the possibility that foo.toLowerCase().length or foo.toUpperCase().length is not necessarily equal to foo.length, or could that break things?" Or to answer the last clause, changing length when changing case would break JS regular expressions. /be
(In reply to comment #44) > Or to answer the last clause, changing length when changing case would break JS > regular expressions. Just to clarify: is that taking into account that the regular expression algorithm is different from the toUpperCase/toLowerCase algorithm and doesn't allow changing length?
(In reply to comment #45) > (In reply to comment #44) > > Or to answer the last clause, changing length when changing case would break JS > > regular expressions. > > Just to clarify: is that taking into account that the regular expression > algorithm is different from the toUpperCase/toLowerCase algorithm and doesn't > allow changing length? No, it's because as Waldemar noted, the regexp code uses the same algorithm -- it uses the original builtin String.prototype.toUpperCase. If that function returns a multi-character string, the result is discarded and the input char is used. See Canonicalize in ES3 or ES5 (link to ES5 spec at http://www.ecmascript.org/). /be
Depends on: 672042
Mass-moving existing Intl-related bugs to the new Core :: JavaScript: Internationalization API component. If you think this bug has been moved in error, feel free to move it back to Core :: JavaScript Engine. [Mass change filter: core-js-intl-api-move]
Component: JavaScript Engine → JavaScript: Internationalization API
Blocks: test262
I am going to assume this was fixed by bug 1318403. Probably not much value in keeping this old bug open, even the Unicode tables changed in the meantime.
Status: NEW → RESOLVED
Closed: 17 years ago8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: