Closed Bug 1685075 Opened 2 years ago Closed 2 years ago

localeCompare behaves differently in C.utf8 locale

Categories

(Core :: JavaScript: Internationalization API, defect, P2)

Firefox 86
defect

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox-esr78 --- unaffected
firefox84 --- unaffected
firefox85 --- wontfix
firefox86 --- wontfix
firefox87 --- wontfix
firefox88 --- wontfix
firefox89 --- wontfix
firefox90 --- wontfix
firefox91 --- fixed

People

(Reporter: marusak.matej, Assigned: anba)

References

(Regression)

Details

(Keywords: regression)

Attachments

(3 files)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:83.0) Gecko/20100101 Firefox/83.0

Steps to reproduce:

I am using the current firefox-nighty (I have it symlinked to 'firefox' in the following examples). This started to happen a few weeks ago. I cannot point out the exact version unfortunately.

$ LC_ALL=C.utf8 firefox                                                         
"Virtio block device".localeCompare("Virtio SCSI")                              
1                                                                               
$ LC_ALL=en_US.UTF-8 firefox                                                    
"Virtio block device".localeCompare("Virtio SCSI")                              
-1                                                                              

in Chrome it always is -1.

Even with using { sensitivity: 'base' } in localeCompare the
behavior would stay the same.

I am on Fedora 33.

Actual results:

Based on locale the result differs and it differs from what other browsers do.

Expected results:

It always should be -1 no matter the locale.

Component: Untriaged → JavaScript: Standard Library
Product: Firefox → Core

This is a regression from bug 1635561.

STR:

  • Run LC_ALL=C.UTF-8 mozregression
  • Open dev-console and evaluate Intl.Collator().resolvedOptions().locale

Before bug 1635561, this returned "en-US", but now it is returning "en-US-posix".

moz-regression output:

12:29.00 INFO: No more integration revisions, bisection finished.
12:29.00 INFO: Last good revision: 1ce1ac399abc56ace9d4dd63190dcd3cf897e59a
12:29.00 INFO: First bad revision: 47eb8c778c414b89da6f59c092531252412e7fcb
12:29.00 INFO: Pushlog:
https://hg.mozilla.org/integration/autoland/pushloghtml?fromchange=1ce1ac399abc56ace9d4dd63190dcd3cf897e59a&tochange=47eb8c778c414b89da6f59c092531252412e7fcb
Status: UNCONFIRMED → NEW
Component: JavaScript: Standard Library → Internationalization
Ever confirmed: true
Regressed by: 1635561
Has Regression Range: --- → yes

I ran a quick comparison between Firefox and Chrome:

let a = Intl.Collator("en-US");
let b = Intl.Collator("en-US-posix");
a.compare("Virtio block device", "Virtio SCSI") // -1 in both Chrome and Firefox
b.compare("Virtio block device", "Virtio SCSI") // -1 in Chrome, 1 in Firefox
a.resolvedOptions().locale // "en-US" in both Chrome and Firefox
b.resolvedOptions().locale // "en-US" in Chrome, "en-US-posix" in Firefox
Assignee: nobody → dminor
Severity: -- → S3

is it V8 or us or implementer specific behavior?

Flags: needinfo?(andrebargull)

V8/Chrome doesn't ship "en-US-posix", so any request for it will always return the "en-US" fallback. I guess https://phabricator.services.mozilla.com/D98390 changed our behaviour, but I can't tell if the old or the new behaviour is more correct.

ICU changes "C" to "en-US-posix" in uprv_getDefaultLocaleID(), so at least from ICU's side using "en-US-posix" is correct. The collation difference happens because "en-US-posix" uses different rules, cf. https://searchfox.org/mozilla-central/source/intl/icu/source/data/coll/en_US_POSIX.txt.

Flags: needinfo?(andrebargull)

Ok, so it seems to me like our behavior is correct. The only wiggle room I see is:

  1. Do we want to ship en-US-posix CLDR data?
  2. Should we read C as en-US-posix just because ICU does?

Set release status flags based on info from the regressing bug 1635561

Hmm, ICU canonicalises "en-US-posix" to "en-US-u-va-posix" (cf. Intl.getCanonicalLocales("en-us-posix") in V8/JSC), even though there's no variant mapping for "posix" in https://github.com/unicode-org/cldr/blob/master/common/supplemental/supplementalMetadata.xml. So when Intl.Collator("en-US-posix") is called, "en-US-posix" is first canonicalised through CanonicalizeLocaleList (which results in "en-US-u-va-posix" in V8/JSC) and when then searching for an available locale in LookupMatcher any Unicode extension sequences are removed (which means "en-US-u-va-posix" is changed to "en-US" in V8/JSC). So Intl.Collator("en-US-posix") doesn't use the "en-US-posix" locale in V8, because V8 doesn't ship "en-US-posix". And it also doesn't work in JSC, because JSC calls ICU canonicalisation functions which make it impossible to select "en-US-posix".

Maybe the Intl.getCanonicalLocales("en-us-posix") case should go into test262. This will cause test errors in V8/JSC, which may encourage someone to fix this case in ICU... :-)

Thank you for the analysis!

I reported it in https://github.com/tc39/test262/issues/2928 and pending their resolution we'll likely close this bug.

Andre - from the upstream ticket it seems that en-US-posix canonicalization should lead to en-US-u-va-posix according to LDML, and not just ICU4C implementation detail.

Would you agree that it means that our implementation is not performing full canonicalization?

Flags: needinfo?(andrebargull)

I don't think https://unicode.org/reports/tr35/#Legacy_Variants applies for "Unicode BCP 47 locale identifiers", but instead only for older locale identifier syntaxes. In the test262 ticket, you mentioned:

[...] https://unicode.org/reports/tr35/#Canonical_Unicode_Locale_Identifiers which calls https://unicode.org/reports/tr35/#Legacy_Variants .

But I don't see any reference to "3.8.2 Legacy Variants" in "3.2.1 Canonical Unicode Locale Identifiers". And I also don't see it mentioned in Annex C. LocaleId Canonicalization.

Therefore I still think the correct canonicalisation (in an ECMA-402 context) for en-US-posix is en-US-posix.

Flags: needinfo?(andrebargull)

My mistake. I also cannot find a reference to 3.8.2 from 3.2.1. Reported upstream

Component: Internationalization → JavaScript: Internationalization API
Priority: -- → P2

We now have CLDR consensus - https://unicode-org.atlassian.net/browse/CLDR-14487 - LDML will get updated to apply legacy variants during canonicalization.

Okay, if the resolution is to canonicalise "en-US-posix" to "en-US-u-va-posix", we should simply strip "en-US-posix" from the ICU data file, because "en-US-u-va-posix" can never be selected from Intl service constructors.

From https://tc39.es/ecma402/#sec-internal-slots:

[[AvailableLocales]] is a List that contains structurally valid (6.2.2) and canonicalized (6.2.3) Unicode BCP 47 locale identifiers [...]. Language tags on the list must not have a Unicode locale extension sequence. [...]

Because elements in [[AvailableLocales]] mustn't have Unicode locale extension sequences, like for example "u-va-posix", the input "en-US-u-va-posix" can never be resolved from LookupMatcher and BestFitMatcher and therefore it doesn't make sense to ship the data for it.

Zibi, do you agree with the plan to remove "en-US-posix" from the ICU data file?

Flags: needinfo?(zbraniecki)

Zibi, do you agree with the plan to remove "en-US-posix" from the ICU data file?

Yes. I'm comfortable with it. the data seems to be mostly confusing users and causing web compat issues since other browsers do not ship it.

Flags: needinfo?(zbraniecki)

Replace "whitelist" and "blacklist" with "includelist" resp. "excludelist", because
the latter is now the preferred name in ICU and the ICU docs/examples are all using
the new names.

The filter file doesn't support exclusion lists for the "locales" filter type
(https://github.com/unicode-org/icu/blob/main/docs/userguide/icu_data/buildtool.md#filtering-by-locale),
therefore we have to manually exclude "en-US-posix" from the relevant resource
types: "en-US-posix" data is only present for collation, locales, and break
iteration. Break iteration is already completely stripped from the data file,
so we don't need to change anything on that front.

The string must be "en_US_POSIX" to match the resource file name, also see
https://unicode-org.atlassian.net/browse/ICU-21400.

Depends on D117975

This change ensures we don't report "en-US-posix" as the default locale when
LANG=C is set by the user, because that could be confusing after part 2.

The current rules about selecting the appropriate default locale were last
discussed in https://bugzilla.mozilla.org/show_bug.cgi?id=1175347. The
preference in that bug was to accept every part of the default locale as long
as there's a possible fallback locale. For example when the user locale is
"de-ZA", which can be supported through the fallback to "de", "de-ZA" as a whole
is accepted. But "de-ZA" is not accepted when the default locale is for example
just "de".

The test cases were adapted to use a locale which has multiple subtags and which
has only partial support in Intl.Collator: Intl.Collator only natively
supports "az", but not "az-Cyrl-AZ". "az-Cyrl-AZ" is completely supported by all
other Intl service constructors.

Depends on D117976

:anba, thanks for the patches!

Assignee: dminor → andrebargull
Pushed by andre.bargull@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/38fcef1d6c87
Part 1: Replace black/white-list in ICU data filter file. r=zbraniecki
https://hg.mozilla.org/integration/autoland/rev/44cf438c40fd
Part 2: Remove "en-US-posix" locale from ICU data file. r=zbraniecki
https://hg.mozilla.org/integration/autoland/rev/8e44d65bbe08
Part 3: Use the actual supported locale when computing the default locale. r=zbraniecki
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch

Since the status are different for nightly and release, what's the status for beta?
For more information, please visit auto_nag documentation.

You need to log in before you can comment on or make changes to this bug.