localeCompare behaves differently in C.utf8 locale
Categories
(Core :: JavaScript: Internationalization API, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox-esr78 | --- | unaffected |
firefox84 | --- | unaffected |
firefox85 | --- | wontfix |
firefox86 | --- | wontfix |
firefox87 | --- | wontfix |
firefox88 | --- | wontfix |
firefox89 | --- | wontfix |
firefox90 | --- | wontfix |
firefox91 | --- | fixed |
People
(Reporter: marusak.matej, Assigned: anba)
References
(Regression)
Details
(Keywords: regression)
Attachments
(3 files)
User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0
Steps to reproduce:
I am using the current firefox-nighty (I have it symlinked to 'firefox' in the following examples). This started to happen a few weeks ago. I cannot point out the exact version unfortunately.
$ LC_ALL=C.utf8 firefox
"Virtio block device".localeCompare("Virtio SCSI")
1
$ LC_ALL=en_US.UTF-8 firefox
"Virtio block device".localeCompare("Virtio SCSI")
-1
in Chrome it always is -1.
Even with using { sensitivity: 'base' }
in localeCompare
the
behavior would stay the same.
I am on Fedora 33.
Actual results:
Based on locale the result differs and it differs from what other browsers do.
Expected results:
It always should be -1 no matter the locale.
Updated•5 years ago
|
Assignee | ||
Comment 1•5 years ago
|
||
This is a regression from bug 1635561.
STR:
- Run
LC_ALL=C.UTF-8 mozregression
- Open dev-console and evaluate
Intl.Collator().resolvedOptions().locale
Before bug 1635561, this returned "en-US", but now it is returning "en-US-posix".
moz-regression output:
12:29.00 INFO: No more integration revisions, bisection finished.
12:29.00 INFO: Last good revision: 1ce1ac399abc56ace9d4dd63190dcd3cf897e59a
12:29.00 INFO: First bad revision: 47eb8c778c414b89da6f59c092531252412e7fcb
12:29.00 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=1ce1ac399abc56ace9d4dd63190dcd3cf897e59a&tochange=47eb8c778c414b89da6f59c092531252412e7fcb
Updated•5 years ago
|
Comment 2•5 years ago
|
||
I ran a quick comparison between Firefox and Chrome:
let a = Intl.Collator("en-US");
let b = Intl.Collator("en-US-posix");
a.compare("Virtio block device", "Virtio SCSI") // -1 in both Chrome and Firefox
b.compare("Virtio block device", "Virtio SCSI") // -1 in Chrome, 1 in Firefox
a.resolvedOptions().locale // "en-US" in both Chrome and Firefox
b.resolvedOptions().locale // "en-US" in Chrome, "en-US-posix" in Firefox
Comment 3•5 years ago
|
||
is it V8 or us or implementer specific behavior?
Assignee | ||
Comment 4•5 years ago
|
||
V8/Chrome doesn't ship "en-US-posix", so any request for it will always return the "en-US" fallback. I guess https://phabricator.services.mozilla.com/D98390 changed our behaviour, but I can't tell if the old or the new behaviour is more correct.
ICU changes "C" to "en-US-posix" in uprv_getDefaultLocaleID()
, so at least from ICU's side using "en-US-posix" is correct. The collation difference happens because "en-US-posix" uses different rules, cf. https://searchfox.org/mozilla-central/source/intl/icu/source/data/coll/en_US_POSIX.txt.
Comment 5•5 years ago
|
||
Ok, so it seems to me like our behavior is correct. The only wiggle room I see is:
- Do we want to ship
en-US-posix
CLDR data? - Should we read
C
asen-US-posix
just because ICU does?
Updated•5 years ago
|
Comment 6•5 years ago
|
||
Set release status flags based on info from the regressing bug 1635561
Updated•5 years ago
|
Assignee | ||
Comment 7•5 years ago
|
||
Hmm, ICU canonicalises "en-US-posix" to "en-US-u-va-posix" (cf. Intl.getCanonicalLocales("en-us-posix")
in V8/JSC), even though there's no variant mapping for "posix" in https://github.com/unicode-org/cldr/blob/master/common/supplemental/supplementalMetadata.xml. So when Intl.Collator("en-US-posix")
is called, "en-US-posix" is first canonicalised through CanonicalizeLocaleList (which results in "en-US-u-va-posix" in V8/JSC) and when then searching for an available locale in LookupMatcher any Unicode extension sequences are removed (which means "en-US-u-va-posix" is changed to "en-US" in V8/JSC). So Intl.Collator("en-US-posix")
doesn't use the "en-US-posix" locale in V8, because V8 doesn't ship "en-US-posix". And it also doesn't work in JSC, because JSC calls ICU canonicalisation functions which make it impossible to select "en-US-posix".
Maybe the Intl.getCanonicalLocales("en-us-posix")
case should go into test262. This will cause test errors in V8/JSC, which may encourage someone to fix this case in ICU... :-)
Comment 8•5 years ago
|
||
Thank you for the analysis!
I reported it in https://github.com/tc39/test262/issues/2928 and pending their resolution we'll likely close this bug.
Comment 9•5 years ago
|
||
Andre - from the upstream ticket it seems that en-US-posix
canonicalization should lead to en-US-u-va-posix
according to LDML, and not just ICU4C implementation detail.
Would you agree that it means that our implementation is not performing full canonicalization?
Assignee | ||
Comment 10•5 years ago
|
||
I don't think https://unicode.org/reports/tr35/#Legacy_Variants applies for "Unicode BCP 47 locale identifiers", but instead only for older locale identifier syntaxes. In the test262 ticket, you mentioned:
[...] https://unicode.org/reports/tr35/#Canonical_Unicode_Locale_Identifiers which calls https://unicode.org/reports/tr35/#Legacy_Variants .
But I don't see any reference to "3.8.2 Legacy Variants" in "3.2.1 Canonical Unicode Locale Identifiers". And I also don't see it mentioned in Annex C. LocaleId Canonicalization.
Therefore I still think the correct canonicalisation (in an ECMA-402 context) for en-US-posix
is en-US-posix
.
Comment 11•5 years ago
|
||
My mistake. I also cannot find a reference to 3.8.2 from 3.2.1. Reported upstream
Updated•5 years ago
|
![]() |
||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
![]() |
||
Updated•5 years ago
|
Comment 12•4 years ago
|
||
We now have CLDR consensus - https://unicode-org.atlassian.net/browse/CLDR-14487 - LDML will get updated to apply legacy variants during canonicalization.
Assignee | ||
Comment 13•4 years ago
|
||
Okay, if the resolution is to canonicalise "en-US-posix" to "en-US-u-va-posix", we should simply strip "en-US-posix" from the ICU data file, because "en-US-u-va-posix" can never be selected from Intl
service constructors.
From https://tc39.es/ecma402/#sec-internal-slots:
[[AvailableLocales]] is a List that contains structurally valid (6.2.2) and canonicalized (6.2.3) Unicode BCP 47 locale identifiers [...]. Language tags on the list must not have a Unicode locale extension sequence. [...]
Because elements in [[AvailableLocales]] mustn't have Unicode locale extension sequences, like for example "u-va-posix", the input "en-US-u-va-posix" can never be resolved from LookupMatcher and BestFitMatcher and therefore it doesn't make sense to ship the data for it.
Assignee | ||
Comment 14•4 years ago
|
||
Zibi, do you agree with the plan to remove "en-US-posix" from the ICU data file?
Comment 15•4 years ago
|
||
Zibi, do you agree with the plan to remove "en-US-posix" from the ICU data file?
Yes. I'm comfortable with it. the data seems to be mostly confusing users and causing web compat issues since other browsers do not ship it.
Assignee | ||
Comment 16•4 years ago
|
||
Replace "whitelist" and "blacklist" with "includelist" resp. "excludelist", because
the latter is now the preferred name in ICU and the ICU docs/examples are all using
the new names.
Assignee | ||
Comment 17•4 years ago
|
||
The filter file doesn't support exclusion lists for the "locales" filter type
(https://github.com/unicode-org/icu/blob/main/docs/userguide/icu_data/buildtool.md#filtering-by-locale),
therefore we have to manually exclude "en-US-posix" from the relevant resource
types: "en-US-posix" data is only present for collation, locales, and break
iteration. Break iteration is already completely stripped from the data file,
so we don't need to change anything on that front.
The string must be "en_US_POSIX" to match the resource file name, also see
https://unicode-org.atlassian.net/browse/ICU-21400.
Depends on D117975
Assignee | ||
Comment 18•4 years ago
|
||
This change ensures we don't report "en-US-posix" as the default locale when
LANG=C
is set by the user, because that could be confusing after part 2.
The current rules about selecting the appropriate default locale were last
discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1175347. The
preference in that bug was to accept every part of the default locale as long
as there's a possible fallback locale. For example when the user locale is
"de-ZA", which can be supported through the fallback to "de", "de-ZA" as a whole
is accepted. But "de-ZA" is not accepted when the default locale is for example
just "de".
The test cases were adapted to use a locale which has multiple subtags and which
has only partial support in Intl.Collator
: Intl.Collator
only natively
supports "az", but not "az-Cyrl-AZ". "az-Cyrl-AZ" is completely supported by all
other Intl service constructors.
Depends on D117976
Comment 20•4 years ago
|
||
![]() |
||
Comment 21•4 years ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/38fcef1d6c87
https://hg.mozilla.org/mozilla-central/rev/44cf438c40fd
https://hg.mozilla.org/mozilla-central/rev/8e44d65bbe08
Comment 22•4 years ago
|
||
Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.
Assignee | ||
Updated•4 years ago
|
Description
•