Closed Bug 1522070 Opened 2 years ago Closed 1 year ago

Use UTS 35 Unicode BCP 47 Locale Identifiers instead of RFC-5646/6067 BCP 47 language tags

Categories

(Core :: JavaScript: Internationalization API, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox70 --- fixed

People

(Reporter: anba, Assigned: anba)

References

Details

Attachments

(18 files)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review

Intl.Locale (bug 1433303) depends on https://github.com/tc39/ecma402/pull/289.

https://github.com/tc39/ecma402/pull/289#issuecomment-444178026 denotes multiple open issues, but the PR got merged nonetheless, so it's not entirely clear what to implement in some edge cases.

I propose we first switch the language tag parser over to Unicode BCP 47 locale identifiers and if that works out without causing any web-compat issues, we can proceed to switch the canonicalisation to whatever UTS 35 specifies. (But also see https://github.com/tc39/ecma402/issues/330.)

https://github.com/tc39/ecma402/pull/289 changed ECMA-402 to use Unicode BCP47
locale identifiers instead of BCP47 language tags for language tags. That means
extlang subtags are no longer supported in language tags.

Irregular grandfathered language tags and regular grandfathered tags with
extlang-like subtags can't be parsed as Unicode BCP 47 locale identifiers, so
they now need to be rejected by the language tag parser.

Depends on D23536

Language tags only consisting of a private-use subtags are not allowed in Unicode
BCP 47 locale identifiers.

Depends on D23537

Unicode BCP 47 locale identifiers don't support four letter language subtags.

Depends on D23538

  • Strict parsing for "u" and "t" extensions is not yet implemented.
  • Canonicalisation per UTS 35 is also not yet implemented, so it still refers to BCP 47 tags.

Depends on D23539

Unicode BCP 47 locale identifiers have stricter requirements for the Unicode ("-u-") and
tranformed content ("-t-") extension sequences.

  • Keys in Unicode extensions must be of the form "alphanum alpha".
  • Transformed content extensions need to be parsed following the transformed_extensions
    syntax from UTS 35.

Depends on D23540

Duplicate of this bug: 1457571
Assignee: nobody → andrebargull
Status: NEW → ASSIGNED

Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/113a287cfb7f
Part 1: Remove support for extlang subtags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/5a50fd4ec8ba
Part 2: Remove support for irregular grandfathered tags and regular grandfathered tags with extlang-like subtags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/b215b68fbccc
Part 3: Remove support for privateuse-only language tags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/c5a97d342431
Part 4: Remove support for four letter language subtags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/7b0c2144242c
Part 5: Update comments to refer to Unicode BCP 47 locale identifiers. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1245a50cc3a0
Part 6: Add strict parsing of Unicode and transform extension sequences. r=jwalden

Keywords: checkin-needed

== Change summary for alert #20419 (as of Fri, 12 Apr 2019 06:30:28 GMT) ==

Improvements:

1% Base Content JS linux64-shippable opt 4,023,191.33 -> 4,002,330.67
1% Base Content JS linux64-shippable-qr opt 4,023,148.00 -> 4,002,240.67
1% Base Content JS osx-10-10-shippable opt 4,020,194.67 -> 3,999,178.67
1% Base Content JS windows10-64-shippable opt 4,083,708.00 -> 4,062,744.67
1% Base Content JS windows10-64-shippable-qr opt 4,083,694.00 -> 4,062,758.00
0% Base Content JS linux64-shippable-qr opt 4,020,210.33 -> 4,002,276.67

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20419

I...think this is still open to implement UTS canonicalization from comment 1? Not sure any more from reading the comment history here. I guess that's all the canonical-form things mentioned in http://unicode.org/reports/tr35/#unicode_locale_id like for en-u-ms-imperialen-u-ms-uksystem.

If that's all that's left, I guess we need to do some more make_intl_data.py hacking to read through all the transform extension data and generate the necessary code to handle that.

Flags: needinfo?(andrebargull)
Priority: -- → P2
Blocks: 1548877

(In reply to Jeff Walden [:Waldo] from comment #13)

I...think this is still open to implement UTS canonicalization from comment 1? Not sure any more from reading the comment history here. I guess that's all the canonical-form things mentioned in http://unicode.org/reports/tr35/#unicode_locale_id like for en-u-ms-imperialen-u-ms-uksystem.

This canonicalisation is currently not required, but probably should be for consistency with Intl.Locale. See also https://github.com/tc39/ecma402/issues/330.

If that's all that's left, I guess we need to do some more make_intl_data.py hacking to read through all the transform extension data and generate the necessary code to handle that.

The ten patches (actually eleven, but one of them was already reviewed in bug 1433303 and now got moved here) are on the way to you! :-)

Flags: needinfo?(andrebargull)

Start implementing the new canonicalisation algorithm by validating all
subtags are in normalised case.

Updated test cases to conform to the changed canonicalization when variants are
sorted in alphabetical order.

Depends on D37440

The new language mappings data will be retrieved from CLDR, so rename the previous file
before starting to make the switch to CLDR.

Depends on D37443

Switch language and region mappings from IANA to CLDR.

Depends on D37444

  • Add support for language mappings where in addition to the language subtag also
    other subtags are modified.
  • Add support for region mappings where the preferred replacement region is chosen
    based on the likely subtags from the base language and script subtags.
  • Variant subtags replacements are not supported in the currently used CLDR
    algorithm, so remove them for now.

Depends on D37445

Intl.Locale no longer requires to handle (uncanonicalised) grandfathered tags,
so we can directly update grandfathered tags to their modern form in the parser.

Depends on D37448

Pushed by nbeleuzu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c6dd14881fdb
Part 7: Ensure subtags are in normalised case. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/f79999353e41
Part 8: Order all subtags in canonical syntax form. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/2794e19ca868
Part 9: Add BCP47 tokenizer and split stringification from CanonicalizeLanguageTag. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/566e019402f2
Part 10: Canonicalize BCP 47 T extension subtag. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/6059f6604172
Part 11: Rename LangTagMappingsGenerated.js to LangTagMappingsIANAGenerated.js. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/df9468fb8406
Part 12: Add simple language and region mappings from CLDR. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/aa7801f87840
Part 13: Add complex language and region mappings from CLDR. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1c3668a78484
Part 14: Remove no longer used IANA language subtag registry code. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/6de0fe9b007c
Part 15: Update comment for CanonicalizeUnicodeExtension. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/984a6cb95c09
Part 16: Update grandfathered tags to modern form directly in the parser. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1fec79fd2faf
Part 17: Remove references to RFC 5646. r=jwalden

Keywords: checkin-needed
// Test valid language tags - derived from IANA and BCP-47 spec
// and our Intl.js implementation.
var validTags = [
  "aa", "ab", "ae", "af", "ak", "am", "an", "ar", "as", "av", "ay", "az",
  "ba", "be", "bg", "bh", "bi", "bm", "bn", "bo", "br", "bs", "ca", "ce",
  "ch", "co", "cr", "cs", "cu", "cv", "cy", "da", "de", "dv", "dz", "ee",
  "el", "en", "eo", "es", "et", "eu", "fa", "ff", "fi", "fj", "fo", "fr",
  "fy", "ga", "gd", "gl", "gn", "gu", "gv", "ha", "he", "hi", "ho", "hr",
  "ht", "hu", "hy", "hz", "ia", "id", "ie", "ig", "ik", "io",
  "is", "it", "iu", "ja", "jv", "ka", "kg", "ki", "kj",
  "kk", "kl", "km", "kn", "ko", "kr", "ks", "ku", "kv", "kw", "ky", "la",
  "lb", "lg", "li", "ln", "lo", "lt", "lu", "lv", "mg", "mh", "mi", "mk",
  "ml", "mn", "mr", "ms", "mt", "my", "na", "nb", "nd", "ne", "ng",
  "nl", "nn", "no", "nr", "nv", "ny", "oc", "oj", "om", "or", "os", "pa",
  "pi", "pl", "ps", "pt", "qu", "rm", "rn", "ro", "ru", "rw", "sa", "sc",
  "sd", "se", "sg", "sh", "si", "sk", "sl", "sm", "sn", "so", "sq", "sr",
  "ss", "st", "su", "sv", "sw", "ta", "te", "tg", "th", "ti", "tk", "tl",
  "tn", "to", "tr", "ts", "tt", "tw", "ty", "ug", "uk", "ur", "uz", "ve",
  "vi", "vo", "wa", "wo", "xh", "yi", "yo", "za", "zh", "zu", "en-US",
  "jp-JS", "pt-PT", "pt-BR", "de-CH", "de-DE-1901", "es-419", "sl-IT-nedis",
  "en-US-boont", "mn-Cyrl-MN", "sr-Cyrl", "sr-Latn",
  "zh-TW", "en-GB-boont-posix-r-extended-sequence-x-private",
  "nan-Hans-MM-variant2-variant1-t-zh-latn-u-ca-chinese-x-private",
  "yue-HK", "de-CH-x-phonebk", "az-Arab-x-aze-derbend",
  "qaa-Qaaa-QM-x-southern",
];


for (var tag of validTags) {
  const expected = `Expect lang to be "${tag}"`;
  data.jsonText = JSON.stringify({
    lang: tag,
  });
  const result = processor.process(data);
  is(result.lang, tag, expected);
}

...yeah, "sh" processed above will ultimately go through https://searchfox.org/mozilla-central/rev/da855d65d1fbdd714190cab2c46130f7422f3699/dom/manifest/ValueExtractor.jsm#73 which will behave differently after these patches. Looks like the stuff in this file needs a regen of expected results, more or less, for at least that value, possibly for others.

ECMA-402 changed the language tag specification from RFC-5646 BCP-47 language
tags to UTS 35 Unicode BCP-47 locale identifiers. Update the expected
canonicalisation results accordingly.

Depends on D37450

Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/cc29a8483de6
Part 7: Ensure subtags are in normalised case. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/79f844a1bf23
Part 8: Order all subtags in canonical syntax form. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/4e9279f09500
Part 9: Add BCP47 tokenizer and split stringification from CanonicalizeLanguageTag. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/80158bfcf0f5
Part 10: Canonicalize BCP 47 T extension subtag. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/a11b3db4d9ca
Part 11: Rename LangTagMappingsGenerated.js to LangTagMappingsIANAGenerated.js. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/21eae8e6487a
Part 12: Add simple language and region mappings from CLDR. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/2b41a865b58b
Part 13: Add complex language and region mappings from CLDR. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/fec9e8ac45dd
Part 14: Remove no longer used IANA language subtag registry code. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/20bd2706beb6
Part 15: Update comment for CanonicalizeUnicodeExtension. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/be2eb3b3774c
Part 16: Update grandfathered tags to modern form directly in the parser. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/69499c0220c4
Part 17: Remove references to RFC 5646. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1e0a350b954a
Part 18: Update 'lang' member test for Web manifest to match latest ECMA-402. r=marcosc

Keywords: checkin-needed
Keywords: leave-open
Status: ASSIGNED → RESOLVED
Closed: 1 year ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla70
Duplicate of this bug: 1548877
Regressions: 1567902
You need to log in before you can comment on or make changes to this bug.