Closed Bug 1522070 Opened 6 years ago Closed 6 years ago

Use UTS 35 Unicode BCP 47 Locale Identifiers instead of RFC-5646/6067 BCP 47 language tags

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla70

Tracking Flags:

Tracking

Status

firefox70

---

fixed

People

(Reporter: anba, Assigned: anba)

References

Details

Attachments

(18 files)

Bug 1522070 - Part 1: Remove support for extlang subtags. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 2: Remove support for irregular grandfathered tags and regular grandfathered tags with extlang-like subtags. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 3: Remove support for privateuse-only language tags. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 4: Remove support for four letter language subtags. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 5: Update comments to refer to Unicode BCP 47 locale identifiers. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 6: Add strict parsing of Unicode and transform extension sequences. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 7: Ensure subtags are in normalised case. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 8: Order all subtags in canonical syntax form. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 9: Add BCP47 tokenizer and split stringification from CanonicalizeLanguageTag. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 10: Canonicalize BCP 47 T extension subtag. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 11: Rename LangTagMappingsGenerated.js to LangTagMappingsIANAGenerated.js. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 12: Add simple language and region mappings from CLDR. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 13: Add complex language and region mappings from CLDR. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 14: Remove no longer used IANA language subtag registry code. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 15: Update comment for CanonicalizeUnicodeExtension. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 16: Update grandfathered tags to modern form directly in the parser. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 17: Remove references to RFC 5646. r=jwalden! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1522070 - Part 18: Update 'lang' member test for Web manifest to match latest ECMA-402. r=marcosc! 6 years ago André Bargull [:anba] 47 bytes, text/x-phabricator-request		Details \| Review

André Bargull [:anba]

Assignee

Description

•

6 years ago

Intl.Locale (bug 1433303) depends on https://github.com/tc39/ecma402/pull/289.

https://github.com/tc39/ecma402/pull/289#issuecomment-444178026 denotes multiple open issues, but the PR got merged nonetheless, so it's not entirely clear what to implement in some edge cases.

André Bargull [:anba]

Assignee

Comment 1

•

6 years ago

I propose we first switch the language tag parser over to Unicode BCP 47 locale identifiers and if that works out without causing any web-compat issues, we can proceed to switch the canonicalisation to whatever UTS 35 specifies. (But also see https://github.com/tc39/ecma402/issues/330.)

André Bargull [:anba]

Assignee

Comment 2

•

6 years ago

Attached file Bug 1522070 - Part 1: Remove support for extlang subtags. r=jwalden! — Details

https://github.com/tc39/ecma402/pull/289 changed ECMA-402 to use Unicode BCP47
locale identifiers instead of BCP47 language tags for language tags. That means
extlang subtags are no longer supported in language tags.

André Bargull [:anba]

Assignee

Comment 3

•

6 years ago

Attached file Bug 1522070 - Part 2: Remove support for irregular grandfathered tags and regular grandfathered tags with extlang-like subtags. r=jwalden! — Details

Irregular grandfathered language tags and regular grandfathered tags with
extlang-like subtags can't be parsed as Unicode BCP 47 locale identifiers, so
they now need to be rejected by the language tag parser.

Depends on D23536

André Bargull [:anba]

Assignee

Comment 4

•

6 years ago

Attached file Bug 1522070 - Part 3: Remove support for privateuse-only language tags. r=jwalden! — Details

Language tags only consisting of a private-use subtags are not allowed in Unicode
BCP 47 locale identifiers.

Depends on D23537

André Bargull [:anba]

Assignee

Comment 5

•

6 years ago

Attached file Bug 1522070 - Part 4: Remove support for four letter language subtags. r=jwalden! — Details

Unicode BCP 47 locale identifiers don't support four letter language subtags.

Depends on D23538

André Bargull [:anba]

Assignee

Comment 6

•

6 years ago

Attached file Bug 1522070 - Part 5: Update comments to refer to Unicode BCP 47 locale identifiers. r=jwalden! — Details

Strict parsing for "u" and "t" extensions is not yet implemented.
Canonicalisation per UTS 35 is also not yet implemented, so it still refers to BCP 47 tags.

Depends on D23539

André Bargull [:anba]

Assignee

Comment 7

•

6 years ago

Attached file Bug 1522070 - Part 6: Add strict parsing of Unicode and transform extension sequences. r=jwalden! — Details

Unicode BCP 47 locale identifiers have stricter requirements for the Unicode ("-u-") and
tranformed content ("-t-") extension sequences.

Keys in Unicode extensions must be of the form "alphanum alpha".
Transformed content extensions need to be parsed following the transformed_extensions
syntax from UTS 35.

Depends on D23540

André Bargull [:anba]

Assignee

Updated

•

6 years ago

Assignee: nobody → andrebargull

Status: NEW → ASSIGNED

André Bargull [:anba]

Assignee

Comment 9

•

6 years ago

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f5f2bb8e0636a84c2b4ea4442abe936cf23f3983

Keywords: checkin-needed, leave-open

Pulsebot

Comment 10

•

6 years ago

Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/113a287cfb7f
Part 1: Remove support for extlang subtags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/5a50fd4ec8ba
Part 2: Remove support for irregular grandfathered tags and regular grandfathered tags with extlang-like subtags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/b215b68fbccc
Part 3: Remove support for privateuse-only language tags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/c5a97d342431
Part 4: Remove support for four letter language subtags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/7b0c2144242c
Part 5: Update comments to refer to Unicode BCP 47 locale identifiers. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1245a50cc3a0
Part 6: Add strict parsing of Unicode and transform extension sequences. r=jwalden

Keywords: checkin-needed

Cosmin Sabou [:CosminS]

Comment 11

•

6 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/113a287cfb7f
https://hg.mozilla.org/mozilla-central/rev/5a50fd4ec8ba
https://hg.mozilla.org/mozilla-central/rev/b215b68fbccc
https://hg.mozilla.org/mozilla-central/rev/c5a97d342431
https://hg.mozilla.org/mozilla-central/rev/7b0c2144242c
https://hg.mozilla.org/mozilla-central/rev/1245a50cc3a0

Florin Strugariu [:Bebe]

Comment 12

•

6 years ago

== Change summary for alert #20419 (as of Fri, 12 Apr 2019 06:30:28 GMT) ==

Improvements:

1% Base Content JS linux64-shippable opt 4,023,191.33 -> 4,002,330.67
1% Base Content JS linux64-shippable-qr opt 4,023,148.00 -> 4,002,240.67
1% Base Content JS osx-10-10-shippable opt 4,020,194.67 -> 3,999,178.67
1% Base Content JS windows10-64-shippable opt 4,083,708.00 -> 4,062,744.67
1% Base Content JS windows10-64-shippable-qr opt 4,083,694.00 -> 4,062,758.00
0% Base Content JS linux64-shippable-qr opt 4,020,210.33 -> 4,002,276.67

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20419

Jeff Walden [:Waldo]

Comment 13

•

6 years ago

I...think this is still open to implement UTS canonicalization from comment 1? Not sure any more from reading the comment history here. I guess that's all the canonical-form things mentioned in http://unicode.org/reports/tr35/#unicode_locale_id like for en-u-ms-imperial ⇒ en-u-ms-uksystem.

If that's all that's left, I guess we need to do some more make_intl_data.py hacking to read through all the transform extension data and generate the necessary code to handle that.

Flags: needinfo?(andrebargull)

Priority: -- → P2

André Bargull [:anba]

Assignee

Updated

•

6 years ago

Blocks: 1548877

André Bargull [:anba]

Assignee

Comment 14

•

6 years ago

(In reply to Jeff Walden [:Waldo] from comment #13)

I...think this is still open to implement UTS canonicalization from comment 1? Not sure any more from reading the comment history here. I guess that's all the canonical-form things mentioned in http://unicode.org/reports/tr35/#unicode_locale_id like for en-u-ms-imperial ⇒ en-u-ms-uksystem.

This canonicalisation is currently not required, but probably should be for consistency with Intl.Locale. See also https://github.com/tc39/ecma402/issues/330.

If that's all that's left, I guess we need to do some more make_intl_data.py hacking to read through all the transform extension data and generate the necessary code to handle that.

The ten patches (actually eleven, but one of them was already reviewed in bug 1433303 and now got moved here) are on the way to you! :-)

Flags: needinfo?(andrebargull)

André Bargull [:anba]

Assignee

Comment 15

•

6 years ago

Attached file Bug 1522070 - Part 7: Ensure subtags are in normalised case. r=jwalden! — Details

Start implementing the new canonicalisation algorithm by validating all
subtags are in normalised case.

André Bargull [:anba]

Assignee

Comment 16

•

6 years ago

Attached file Bug 1522070 - Part 8: Order all subtags in canonical syntax form. r=jwalden! — Details

Updated test cases to conform to the changed canonicalization when variants are
sorted in alphabetical order.

Depends on D37440

André Bargull [:anba]

Assignee

Comment 17

•

6 years ago

Attached file Bug 1522070 - Part 9: Add BCP47 tokenizer and split stringification from CanonicalizeLanguageTag. r=jwalden! — Details

Depends on D37441

André Bargull [:anba]

Assignee

Comment 18

•

6 years ago

Attached file Bug 1522070 - Part 10: Canonicalize BCP 47 T extension subtag. r=jwalden! — Details

Depends on D37442

André Bargull [:anba]

Assignee

Comment 19

•

6 years ago

Attached file Bug 1522070 - Part 11: Rename LangTagMappingsGenerated.js to LangTagMappingsIANAGenerated.js. r=jwalden! — Details

The new language mappings data will be retrieved from CLDR, so rename the previous file
before starting to make the switch to CLDR.

Depends on D37443

André Bargull [:anba]

Assignee

Comment 20

•

6 years ago

Attached file Bug 1522070 - Part 12: Add simple language and region mappings from CLDR. r=jwalden! — Details

Switch language and region mappings from IANA to CLDR.

Depends on D37444

André Bargull [:anba]

Assignee

Comment 21

•

6 years ago

Attached file Bug 1522070 - Part 13: Add complex language and region mappings from CLDR. r=jwalden! — Details

Add support for language mappings where in addition to the language subtag also
other subtags are modified.
Add support for region mappings where the preferred replacement region is chosen
based on the likely subtags from the base language and script subtags.
Variant subtags replacements are not supported in the currently used CLDR
algorithm, so remove them for now.

Depends on D37445

André Bargull [:anba]

Assignee

Comment 22

•

6 years ago

Attached file Bug 1522070 - Part 14: Remove no longer used IANA language subtag registry code. r=jwalden! — Details

Depends on D37446

André Bargull [:anba]

Assignee

Comment 23

•

6 years ago

Attached file Bug 1522070 - Part 15: Update comment for CanonicalizeUnicodeExtension. r=jwalden! — Details

Depends on D37447

André Bargull [:anba]

Assignee

Comment 24

•

6 years ago

Attached file Bug 1522070 - Part 16: Update grandfathered tags to modern form directly in the parser. r=jwalden! — Details

Intl.Locale no longer requires to handle (uncanonicalised) grandfathered tags,
so we can directly update grandfathered tags to their modern form in the parser.

Depends on D37448

André Bargull [:anba]

Assignee

Comment 25

•

6 years ago

Attached file Bug 1522070 - Part 17: Remove references to RFC 5646. r=jwalden! — Details

Depends on D37449

André Bargull [:anba]

Assignee

Comment 26

•

6 years ago

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=1d1089b58023c109bbea1cdcd59e92a126c4ab56

Keywords: checkin-needed

Pulsebot

Comment 27

•

6 years ago

Pushed by nbeleuzu@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c6dd14881fdb
Part 7: Ensure subtags are in normalised case. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/f79999353e41
Part 8: Order all subtags in canonical syntax form. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/2794e19ca868
Part 9: Add BCP47 tokenizer and split stringification from CanonicalizeLanguageTag. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/566e019402f2
Part 10: Canonicalize BCP 47 T extension subtag. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/6059f6604172
Part 11: Rename LangTagMappingsGenerated.js to LangTagMappingsIANAGenerated.js. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/df9468fb8406
Part 12: Add simple language and region mappings from CLDR. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/aa7801f87840
Part 13: Add complex language and region mappings from CLDR. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1c3668a78484
Part 14: Remove no longer used IANA language subtag registry code. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/6de0fe9b007c
Part 15: Update comment for CanonicalizeUnicodeExtension. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/984a6cb95c09
Part 16: Update grandfathered tags to modern form directly in the parser. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1fec79fd2faf
Part 17: Remove references to RFC 5646. r=jwalden

Keywords: checkin-needed

Narcis Beleuzu [:NarcisB]

Comment 28

•

6 years ago

Backed out for mochitest failures on test_ManifestProcessor_lang.html

Backout link: https://hg.mozilla.org/integration/autoland/rev/0d8d40e596d0c5b1455c9e210cb70853613ad6f6
Log link: https://treeherder.mozilla.org/logviewer.html#/jobs?job_id=257230947&repo=autoland&lineNumber=8069

Flags: needinfo?(andrebargull)

Jeff Walden [:Waldo]

Comment 29

•

6 years ago

// Test valid language tags - derived from IANA and BCP-47 spec
// and our Intl.js implementation.
var validTags = [
  "aa", "ab", "ae", "af", "ak", "am", "an", "ar", "as", "av", "ay", "az",
  "ba", "be", "bg", "bh", "bi", "bm", "bn", "bo", "br", "bs", "ca", "ce",
  "ch", "co", "cr", "cs", "cu", "cv", "cy", "da", "de", "dv", "dz", "ee",
  "el", "en", "eo", "es", "et", "eu", "fa", "ff", "fi", "fj", "fo", "fr",
  "fy", "ga", "gd", "gl", "gn", "gu", "gv", "ha", "he", "hi", "ho", "hr",
  "ht", "hu", "hy", "hz", "ia", "id", "ie", "ig", "ik", "io",
  "is", "it", "iu", "ja", "jv", "ka", "kg", "ki", "kj",
  "kk", "kl", "km", "kn", "ko", "kr", "ks", "ku", "kv", "kw", "ky", "la",
  "lb", "lg", "li", "ln", "lo", "lt", "lu", "lv", "mg", "mh", "mi", "mk",
  "ml", "mn", "mr", "ms", "mt", "my", "na", "nb", "nd", "ne", "ng",
  "nl", "nn", "no", "nr", "nv", "ny", "oc", "oj", "om", "or", "os", "pa",
  "pi", "pl", "ps", "pt", "qu", "rm", "rn", "ro", "ru", "rw", "sa", "sc",
  "sd", "se", "sg", "sh", "si", "sk", "sl", "sm", "sn", "so", "sq", "sr",
  "ss", "st", "su", "sv", "sw", "ta", "te", "tg", "th", "ti", "tk", "tl",
  "tn", "to", "tr", "ts", "tt", "tw", "ty", "ug", "uk", "ur", "uz", "ve",
  "vi", "vo", "wa", "wo", "xh", "yi", "yo", "za", "zh", "zu", "en-US",
  "jp-JS", "pt-PT", "pt-BR", "de-CH", "de-DE-1901", "es-419", "sl-IT-nedis",
  "en-US-boont", "mn-Cyrl-MN", "sr-Cyrl", "sr-Latn",
  "zh-TW", "en-GB-boont-posix-r-extended-sequence-x-private",
  "nan-Hans-MM-variant2-variant1-t-zh-latn-u-ca-chinese-x-private",
  "yue-HK", "de-CH-x-phonebk", "az-Arab-x-aze-derbend",
  "qaa-Qaaa-QM-x-southern",
];


for (var tag of validTags) {
  const expected = `Expect lang to be "${tag}"`;
  data.jsonText = JSON.stringify({
    lang: tag,
  });
  const result = processor.process(data);
  is(result.lang, tag, expected);
}

...yeah, "sh" processed above will ultimately go through https://searchfox.org/mozilla-central/rev/da855d65d1fbdd714190cab2c46130f7422f3699/dom/manifest/ValueExtractor.jsm#73 which will behave differently after these patches. Looks like the stuff in this file needs a regen of expected results, more or less, for at least that value, possibly for others.

André Bargull [:anba]

Assignee

Comment 30

•

6 years ago

Attached file Bug 1522070 - Part 18: Update 'lang' member test for Web manifest to match latest ECMA-402. r=marcosc! — Details

ECMA-402 changed the language tag specification from RFC-5646 BCP-47 language
tags to UTS 35 Unicode BCP-47 locale identifiers. Update the expected
canonicalisation results accordingly.

Depends on D37450

André Bargull [:anba]

Assignee

Comment 31

•

6 years ago

Try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=f0032fb96e8190bf9d4230d4bdecb934b0a19627

Flags: needinfo?(andrebargull)

Keywords: checkin-needed

Pulsebot

Comment 32

•

6 years ago

Pushed by archaeopteryx@coole-files.de:
https://hg.mozilla.org/integration/autoland/rev/cc29a8483de6
Part 7: Ensure subtags are in normalised case. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/79f844a1bf23
Part 8: Order all subtags in canonical syntax form. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/4e9279f09500
Part 9: Add BCP47 tokenizer and split stringification from CanonicalizeLanguageTag. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/80158bfcf0f5
Part 10: Canonicalize BCP 47 T extension subtag. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/a11b3db4d9ca
Part 11: Rename LangTagMappingsGenerated.js to LangTagMappingsIANAGenerated.js. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/21eae8e6487a
Part 12: Add simple language and region mappings from CLDR. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/2b41a865b58b
Part 13: Add complex language and region mappings from CLDR. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/fec9e8ac45dd
Part 14: Remove no longer used IANA language subtag registry code. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/20bd2706beb6
Part 15: Update comment for CanonicalizeUnicodeExtension. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/be2eb3b3774c
Part 16: Update grandfathered tags to modern form directly in the parser. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/69499c0220c4
Part 17: Remove references to RFC 5646. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1e0a350b954a
Part 18: Update 'lang' member test for Web manifest to match latest ECMA-402. r=marcosc

Keywords: checkin-needed

Andreea Pavel [:apavel]

Comment 33

•

6 years ago

bugherder

André Bargull [:anba]

Assignee

Updated

•

6 years ago

Keywords: leave-open

André Bargull [:anba]

Assignee

Updated

•

6 years ago

Status: ASSIGNED → RESOLVED

Closed: 6 years ago

status-firefox70: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla70

Jan de Mooij [:jandem]

Updated

•

6 years ago

Regressions: 1567902

Florin Strugariu [:Bebe]

Updated

•

6 years ago

Regressions: 1568760

You need to log in before you can comment on or make changes to this bug.