Closed Bug 203179 Opened 21 years ago Closed 21 years ago

String.prototype.localeCompare() not sorting according to locale

Categories

(Core :: DOM: Core & HTML, defect)

x86
Windows 2000
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: bmo, Unassigned)

References

()

Details

Attachments

(1 file, 3 obsolete files)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4a) Gecko/20030401
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.0; en-US; rv:1.4a) Gecko/20030401

The sample page demonstrates sorting using localeCompare which doesn't sort
correctly in Mozilla. It does however sort correctly in IE 6. I first found this
bug in code which is sorting an array of Japanese strings (Japanese Windows
2000), and found that it doesn't do a locale comparison there either, although
IE6 does.

Reproducible: Always

Steps to Reproduce:
1. go to the URL
2. 'a' and 'frisé' should both come before 'Mission'
Confirming report using Mozilla trunk binary 2003042308 on WinNT.
Compare Moz and IE6 on these examples:

'a'.localeCompare('Mission')
Moz ---> 20
IE6 ---> -1

'a'.localeCompare('a')
Moz ---> 0
IE6 ---> 0

'a'.localeCompare('A')
Moz ---> 32
IE6 ---> -1

'A'.localeCompare('a')
Moz ---> -32
IE6 ---> 1


Note this means in IE6, localeCompare() puts 'a' before 'A',
whereas the canonical string comparison does the reverse. 
In Mozilla, we get the same result for either comparison:


function f(x,y) {return x.localeCompare(y);};

IE6
['a', 'A'].sort()  --->  'A', 'a'
['a', 'A'].sort(f) --->  'a', 'A'

Moz
['a', 'A'].sort()  --->  'A', 'a'
['a', 'A'].sort(f) --->  'A', 'a'
Assignee: rogerl → khanson
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: localeCompare not sorting according to locale → String.prototype.localeCompare() not sorting according to locale
From the ECMA-262 Edition 3 Final spec at 
http://www.mozilla.org/js/language/E262-3.pdf:

15.5.4.9 String.prototype.localeCompare (that)

When the localeCompare method is called with one argument |that|,
it returns a number other than NaN that represents the result of
a locale-sensitive string comparison of |this| object (converted
to a string) with |that| (converted to a string). The two strings
are compared in an implementation-defined fashion.

The result is intended to order strings in the sort order specified
by the system default locale, and will be negative, zero, or positive,
depending on whether |this| comes before |that| in the sort order,
the strings are equal, or |this| comes after |that| in the sort order,
respectively.

The actual return values are left implementation-defined to permit
implementers to encode additional information in the result value,
but the function is required to define a total ordering on all strings
and to return 0 when comparing two strings that are considered
canonically equivalent by the Unicode standard.

NOTE: This function is intended to rely on whatever language-sensitive
comparison functionality is available to the ECMAScript environment
from the host environment, and to compare according to the rules of
the host environment’s current locale. It is strongly recommended that
this function treat strings that are canonically equivalent according
to the Unicode standard as identical (i.e. compare the strings as if
they had both been converted to Normalised Form C or D first). It is
also recommended that this function not honour Unicode compatibility
equivalences or decompositions.

If no language-sensitive comparison at all is available from the host
environment, this function may perform a bitwise comparison.
For convenience:

         charCodeAt(0)    Unicode value (hex)
'A'          65                0041
'a'          97                0061
Also:

         charCodeAt(0)    Unicode value (hex)
'M'          77                004D
'a'          97                0061


So both Moz and IE6 will put 'M' before 'a' in the canonical string comparison,
hence 'Mission' before 'a'; as can be seen via these javascript:URLs

javascript: alert('M' < 'a')                 --->  true
javascript: alert(['a', 'Mission'].sort())   ---> 'Mission', 'a'


But as Brodie notes, in IE6, String.prototype.localeCompare() 
puts 'a' before 'M'.
I looked on the web and found a similar report here:
http://www.geocrawler.com/archives/3/113/2001/7/0/6213846/

Message: 6213846 
FROM:    Alex Iliev news
DATE:    07/19/2001 08:49:10
SUBJECT: JavaScript localeCompare()

Hello, perhaps someone can help me with a problem using localeCompare()
in Mozilla (Gecko/20010713 on LinuxPPC) - it does not give me correct
results in the German language, in particular when 'ß' (sz) is involved.
I set <html lang="de-DE"> in the page. still, 'ß' seems to be taken as
a single letter higher in the collation order than 'z'.

Am I specifying the language incorrectly, or is there some problem with 
the implementation?

Thanks very much,
Alex
Spidermonkey contains a function, JS_SetLocaleCallbacks, that's used to set up
the locale functions used to handle localeCompare, as well as toLocaleUpperCase
and toLocaleLowerCase.  In other words, it's the job of the embedding
application, not the JS Engine, to get this right.  But according to
http://lxr.mozilla.org/seamonkey/ident?i=JS_SetLocaleCallbacks, there's nothing
in the Mozilla Browser code that does this.

The component for this bug needs to be changed, but aside from Browser-General,
I'm not sure where it should go.

--scole
well DOM is the spidermonkey host, so i'll push it that way, someone can poke it 
from there.
Assignee: khanson → dom_bugs
Component: JavaScript Engine → DOM Core
QA Contact: pschwartau → desale
This hooks up locale callbacks in the JS engine, but it doesn't seem like we
have any locale aware compare methods in Mozilla, AFAIK, so this still doesn't
fix the problem...
Attachment #126028 - Attachment description: Hook up locale callbacks, though no local aware functionality yet. → Hook up locale callbacks, though no locale aware functionality yet.
Dunno about upper/lower case but nsICollation has a CompareString method...
Comment on attachment 126195 [details] [diff] [review]
locale aware string compare using nsICollation

>+  PRBool result;
>+  rv = gCollation->CompareString(kCollationStrengthDefault, str1, str2,
>+                                 &result);
Last time I looked, result was a PRInt32 (same as strcmp).
CompareString has a performance issue (bug 195008), so please be carefull if
this is performance critical.
Naoki, this is something that's extremely rarely used on webpages, so I think
it's fine to have non-optimal performance here until that bug is fixed.
Attachment #126028 - Attachment is obsolete: true
Attachment #126195 - Attachment is obsolete: true
Attachment #126320 - Flags: superreview?(bzbarsky)
Attachment #126320 - Flags: review?(nhotta)
Comment on attachment 126320 [details] [diff] [review]
Same as above, but clean up gCollation on shutdown.

Bleh, forgot the PRBool -> PRInt32 change...
Attachment #126320 - Attachment is obsolete: true
Attachment #126320 - Flags: superreview?(bzbarsky)
Attachment #126320 - Flags: review?(nhotta)
Attachment #126323 - Flags: superreview?(bzbarsky)
Attachment #126323 - Flags: review?(nhotta)
Comment on attachment 126323 [details] [diff] [review]
Proposed fix.

Is there a reason for removing the initializers on the statics?  I know they
should be 0 anyway, but the code as written makes that clearer...

Also, why use the CID instead of the contractid for the nsICollationFactory ?

In any case, sr=me with either way on those two questions.
Attachment #126323 - Flags: superreview?(bzbarsky) → superreview+
Um, there is no contract id for the collation factory (AFAICT). I put back the
initializers for the statics, the reason I took them out was to make things
consistent, now I went the other way in stead, and added more of them...
Ugh.  Ok, no contractid means you can't use one.. ;)
Comment on attachment 126323 [details] [diff] [review]
Proposed fix.

>+// Thank you Microsoft!
>+#ifdef CompareString
>+#undef CompareString
>+#endif

Sigh. I hate to see another of these go into the tree, but what else can you
do?

>+static JSBool
>+ChangeCase(JSContext *cx, JSString *src, jsval *rval,
>+           void(* changeCaseFnc)(const nsAString&, nsAString&))
>+{
>+  nsDependentString str(NS_REINTERPRET_CAST(const PRUnichar *,
>+                                             ::JS_GetStringChars(src)),
>+                         ::JS_GetStringLength(src));

Nit: regularize the indentation here

r=smontagu
Attachment #126323 - Flags: review?(nhotta) → review+
Fixed.
Status: NEW → RESOLVED
Closed: 21 years ago
Resolution: --- → FIXED
Any toolchain that doesn't ensure uninitialized global (extern/static) variables
are zero-initialized is busted.  But last I checked (a while ago), putting = 0;
all over the place caused explicitly-zero-initialized variables to be put in
.data (initialized data), not in .bss (blank starting store), taking up space in
the object file and time loading (more time than asking the kernel for a
pre-zeroed page).  So I'm generally against explicit zero initialization of globals.

Could the #undef CompareString jazz go into a %{C++ section in the IDL file?

/be
Unfortunately putting the #undef in the .idl doesn't really fix the problem, it
only fixes it in the cases where the .idl file (or generated .h file) is
#included after the file that pulls in MS' #define of CompareString (as was the
case in this patch). This sucks rocks, but I don't think we can do much better
than what we do now, thanks to MS.

I'll pull out those initializers again from all the statics. I don't care either
way, as long as we're consistent.
The #undef can still win in the .idl file if we follow the rule that system
headers (those included via <...> notation) are always included before local
("...") ones.  We seem to follow that rule generally in code I care about.  It's
an old bit of folk wisdom in Unix circles (whence C came).  If it helps
consolidate the #undef next time there's a conflict, let's do it, rather than
copy the #undef all over.

/be
Yeah, if there is a perf hit for having the initializers, remove them.
The problem here is that the header files that pull in the #define CompareString
is included through other mozilla #includes, in this case, either some of the
nspr headers, nsILiveConnect, or nsIJVMManager. I don't have a problem putting
the #undef in the idl, but I still don't think it'll save us much...
In this case, it's nsIJVMManager.h that pulls in whatever header (through jni.h)
that #defines CompareString.
Component: DOM: Core → DOM: Core & HTML
QA Contact: desale → general
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: