Use UTS 35 Unicode BCP 47 Locale Identifiers instead of RFC-5646/6067 BCP 47 language tags

ASSIGNED
Assigned to
(NeedInfo from)

Status

()

enhancement
P2
normal
ASSIGNED
5 months ago
5 days ago

People

(Reporter: anba, Assigned: anba, NeedInfo)

Tracking

(Blocks 1 bug, {leave-open})

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(6 attachments)

Assignee

Description

5 months ago

Intl.Locale (bug 1433303) depends on https://github.com/tc39/ecma402/pull/289.

https://github.com/tc39/ecma402/pull/289#issuecomment-444178026 denotes multiple open issues, but the PR got merged nonetheless, so it's not entirely clear what to implement in some edge cases.

Assignee

Comment 1

3 months ago

I propose we first switch the language tag parser over to Unicode BCP 47 locale identifiers and if that works out without causing any web-compat issues, we can proceed to switch the canonicalisation to whatever UTS 35 specifies. (But also see https://github.com/tc39/ecma402/issues/330.)

Assignee

Comment 2

3 months ago

https://github.com/tc39/ecma402/pull/289 changed ECMA-402 to use Unicode BCP47
locale identifiers instead of BCP47 language tags for language tags. That means
extlang subtags are no longer supported in language tags.

Assignee

Comment 3

3 months ago

Irregular grandfathered language tags and regular grandfathered tags with
extlang-like subtags can't be parsed as Unicode BCP 47 locale identifiers, so
they now need to be rejected by the language tag parser.

Depends on D23536

Assignee

Comment 4

3 months ago

Language tags only consisting of a private-use subtags are not allowed in Unicode
BCP 47 locale identifiers.

Depends on D23537

Assignee

Comment 5

3 months ago

Unicode BCP 47 locale identifiers don't support four letter language subtags.

Depends on D23538

Assignee

Comment 6

3 months ago
  • Strict parsing for "u" and "t" extensions is not yet implemented.
  • Canonicalisation per UTS 35 is also not yet implemented, so it still refers to BCP 47 tags.

Depends on D23539

Assignee

Comment 7

3 months ago

Unicode BCP 47 locale identifiers have stricter requirements for the Unicode ("-u-") and
tranformed content ("-t-") extension sequences.

  • Keys in Unicode extensions must be of the form "alphanum alpha".
  • Transformed content extensions need to be parsed following the transformed_extensions
    syntax from UTS 35.

Depends on D23540

Assignee

Updated

3 months ago
Duplicate of this bug: 1457571
Assignee

Updated

3 months ago
Assignee: nobody → andrebargull
Status: NEW → ASSIGNED

Comment 10

2 months ago

Pushed by csabou@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/113a287cfb7f
Part 1: Remove support for extlang subtags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/5a50fd4ec8ba
Part 2: Remove support for irregular grandfathered tags and regular grandfathered tags with extlang-like subtags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/b215b68fbccc
Part 3: Remove support for privateuse-only language tags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/c5a97d342431
Part 4: Remove support for four letter language subtags. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/7b0c2144242c
Part 5: Update comments to refer to Unicode BCP 47 locale identifiers. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/1245a50cc3a0
Part 6: Add strict parsing of Unicode and transform extension sequences. r=jwalden

Keywords: checkin-needed

== Change summary for alert #20419 (as of Fri, 12 Apr 2019 06:30:28 GMT) ==

Improvements:

1% Base Content JS linux64-shippable opt 4,023,191.33 -> 4,002,330.67
1% Base Content JS linux64-shippable-qr opt 4,023,148.00 -> 4,002,240.67
1% Base Content JS osx-10-10-shippable opt 4,020,194.67 -> 3,999,178.67
1% Base Content JS windows10-64-shippable opt 4,083,708.00 -> 4,062,744.67
1% Base Content JS windows10-64-shippable-qr opt 4,083,694.00 -> 4,062,758.00
0% Base Content JS linux64-shippable-qr opt 4,020,210.33 -> 4,002,276.67

For up to date results, see: https://treeherder.mozilla.org/perf.html#/alerts?id=20419

I...think this is still open to implement UTS canonicalization from comment 1? Not sure any more from reading the comment history here. I guess that's all the canonical-form things mentioned in http://unicode.org/reports/tr35/#unicode_locale_id like for en-u-ms-imperialen-u-ms-uksystem.

If that's all that's left, I guess we need to do some more make_intl_data.py hacking to read through all the transform extension data and generate the necessary code to handle that.

Flags: needinfo?(andrebargull)
Priority: -- → P2
You need to log in before you can comment on or make changes to this bug.