Closed Bug 1433303 Opened 3 years ago Closed 1 year ago

Implement Intl.Locale

Categories

(Core :: JavaScript: Internationalization API, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla70
Tracking Status
firefox70 --- fixed

People

(Reporter: zbraniecki, Assigned: anba)

References

(Blocks 1 open bug)

Details

(Keywords: dev-doc-complete, parity-chrome)

Attachments

(5 files, 8 obsolete files)

47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
47 bytes, text/x-phabricator-request
Details | Review
Intl.Locale proposal is now at Stage 3 - https://github.com/tc39/proposal-intl-locale/

It would allow us to complete bug 1349377 and help with all language tag operations.
Duplicate of this bug: 1376616
Depends on: 1431957
Depends on: 1382253
Attached patch bug1433303.patch (obsolete) — Splinter Review
A first stab at implementing the Intl.Locale proposal. There are still some spec issues, so it's not always clear to me how some of the operations should actually work. I've worked around some of these issues and simply implemented it the way I expect it to work, for others I've followed the current spec proposal, ignoring that this will create bogus results, e.g. |new Intl.Locale("pt-u-attr-ca-gregory").toString()| returns "pt-u-ca-gregory-attr" which is obviously not the intended result. :-)
Assignee: nobody → andrebargull
Status: NEW → ASSIGNED
Attachment #8948428 - Flags: feedback?(gandalf)
Priority: -- → P3
Comment on attachment 8948428 [details] [diff] [review]
bug1433303.patch

Review of attachment 8948428 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks Anba! That looks great!

I feel like it's distinct enough from other Intl.* APIs that I failed at my attempt to get this working, so I'm particularly thankful you took time to write it.

I have a couple thoughts based on my implementation of that in Rust and JS, they're up to your discretion.

::: js/src/builtin/intl/Locale.js
@@ +15,5 @@
> + * The return type matches the |parseLanguageTag| function, i.e. returns null
> + * if the input could not be parsed, otherwise returns an object holding the
> + * language tag subtags.
> + */
> +function parseLanguageTagSubtag(type, subtag) {

This seems a bit extreme if I understand it correctly. The rules for handling language/script/region subtags are fairly easy ("2*3ALPHA", "4ALPHA", "2ALPHA/3DIGIT" respectively), so I'm not sure why you expand a subtag into a whole tag and then parse it just to retrieve the subtag you need?

@@ +719,5 @@
> +    // Step 3.
> +    var locale = getLocaleInternals(loc).locale;
> +    var localeObj = parseLanguageTag(locale);
> +
> +    // FIXME: spec issue - regular grandfathered tags also match langtag.

I think we decided not to support grandgathered tags at all. (I was also hoping to not support extlang tbh, but I'm not sure if that's an option)

@@ +749,5 @@
> +        ThrowTypeError(JSMSG_INTL_OBJECT_NOT_INITED, "Locale", "script", "Locale");
> +
> +    // Step 3.
> +    var locale = getLocaleInternals(loc).locale;
> +    var localeObj = parseLanguageTag(locale);

Wouldn't it be better to store parsed and canonicalized properties like language/script/region/variants etc and just return them here instead of storing a locale string and parsing it on each get?
Attachment #8948428 - Flags: feedback?(gandalf) → feedback+
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #3)
> This seems a bit extreme if I understand it correctly. The rules for
> handling language/script/region subtags are fairly easy ("2*3ALPHA",
> "4ALPHA", "2ALPHA/3DIGIT" respectively), so I'm not sure why you expand a
> subtag into a whole tag and then parse it just to retrieve the subtag you
> need?

language subtags are a bit more difficult because of extlang subtags, so I took the easy way out and simply called into the complete parser. But for the final patch this is going to change when we need to start to think about performance tweaks.


> I think we decided not to support grandgathered tags at all. (I was also
> hoping to not support extlang tbh, but I'm not sure if that's an option)

I still need to get back to littledan and ask about this comment [1]. In [2] he said no grandfathered tags, then I asked if he meant only irregular or also regular grandfathered tags, he answered in [1] only irregular grandfathered tags, but also said in [1] the change I suggested was important. But the change is only important if regular grandfathered tags are also unsupported. And that's a contradiction, isn't it? I'm a bit confused right now.

[1] https://github.com/tc39/proposal-intl-locale/issues/12#issuecomment-363550422
[2] https://github.com/tc39/proposal-intl-locale/issues/12#issuecomment-363523930


> Wouldn't it be better to store parsed and canonicalized properties like
> language/script/region/variants etc and just return them here instead of
> storing a locale string and parsing it on each get?

Yes, for the final patch we want to make this more performant.
Anba - do you have any ETA for when this may be ready? I have a number of patches (like bug 1449505) where it would be much cleaner to use this than having to manually parse the language tag.
Flags: needinfo?(andrebargull)
I'd like to land bug 1451082 first, because it simplifies implementing resp. optimising Intl.Locale a bit. And there still a few open issues for the Intl.Locale proposal, so I don't think we want to expose Intl.Locale to non-Chrome content for now. But since most of these open issues are restricted to edge cases, it should be okay to expose it to Chrome content. The new, not yet uploaded, patch works around the open issues as I see fit, but that may not match with what will end up in the spec.
Flags: needinfo?(andrebargull)
> I don't think we want to expose Intl.Locale to non-Chrome content for now.

My focus is on exposing it to chrome-only content via mozIntl. This is also a great vehicle to test the spec on the path to stage 4.

Thank you! I understand why you want to go after bug 1451082 first. I'll wait for that with my code.
Depends on: 1451082
Attached patch bug1433303.patch (obsolete) — Splinter Review
Updated patch, applies on top of bug 1451082.

Also includes some optimisations, for example it avoids parsing the language two times (in ApplyOptionsToTag and in ApplyUnicodeExtensionToTag). Creating Intl.Locale objects is still not super-fast, because it involves a slow C++ -> JS call (~30% of the time needed to construct a simple object like |new Intl.Locale("de")| is spent just to perform the call from C++ to JS).
Attachment #8948428 - Attachment is obsolete: true
Attached patch bug1433303.patch (obsolete) — Splinter Review
Updated for current proposal text plus https://github.com/tc39/proposal-intl-locale/pull/24.
Attachment #8964826 - Attachment is obsolete: true
Depends on: 1457571
Attached patch bug1433303.patch (obsolete) — Splinter Review
Updated to match current proposal text (ea319add7f8719ed17f800892c61575c124dac6b).
Attachment #8969303 - Attachment is obsolete: true
Comment on attachment 8972460 [details] [diff] [review]
bug1433303.patch

Review of attachment 8972460 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/tests/non262/Intl/Locale/locale-constructor-unicode-ext.js
@@ +33,5 @@
> +for (var [langtag, canonical] of Object.entries(validLanguageTags)) {
> +    assertEq(new Intl.Locale(langtag).toString(), canonical);
> +}
> +
> +for (var [langtag, canonical] of invalidLanguageTags) {

Interestingly, this happily runs, but every loop gets langtag="d" and canonical="a" and the rest of the string is dropped.
(In reply to :Ms2ger (⌚ UTC+1/+2) from comment #11)
> Interestingly, this happily runs, but every loop gets langtag="d" and
> canonical="a" and the rest of the string is dropped.

Ugh, that line should have been:
---
for (var langtag of invalidLanguageTags) {
---
Comment on attachment 8972460 [details] [diff] [review]
bug1433303.patch

Review of attachment 8972460 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/tests/non262/Intl/Locale/grandfathered.js
@@ +81,5 @@
> +for (var {tag, options, canonical, extensions} of testData) {
> +    var loc = new Intl.Locale(tag, options);
> +    assertEq(loc.toString(), canonical);
> +
> +    for (var {name, value} of Object.entries(extensions)) {

Should be `[name, value]` or this will be checking `assertEq(loc[undefined], undefined)`
Preparation part 1:
- Splits the tokenizer from the BCP47 language tag parser, so a later patch can reuse the tokenizer.
- Also includes some minor clean-ups to reduce code duplication in the parser.
- And splits the language tag canonicalization and stringification into two separate functions, so later patches can reuse their functionality more easily.
Attachment #8972460 - Attachment is obsolete: true
Attachment #8987665 - Flags: review?(jwalden+bmo)
- Renames intrinsic_IsWrappedArrayBuffer to intrinsic_IsWrappedInstanceOfBuiltin and moves it closer to intrinsic_IsInstanceOfBuiltin/intrinsic_GuardToBuiltin.
- It's also now required to pass an object value to the function for consistency with intrinsic_IsInstanceOfBuiltin/intrinsic_GuardToBuiltin.
- And perform the same changes for intrinsic_IsPossiblyWrappedTypedArray/intrinsic_IsPossiblyWrappedInstanceOfBuiltin.

(Earlier patches for Intl.Locale used IsPossiblyWrappedInstanceOfBuiltin for the new LocaleObject, the current patch uses only IsWrappedInstanceOfBuiltin. But given that I've already written the changes for IsPossiblyWrappedInstanceOfBuiltin, I think it makes sense to keep them, even though they're no longer required for the Intl.Locale patch.)
Attachment #8987666 - Flags: review?(jwalden+bmo)
Now the big patch for Intl.Locale:


js/src/builtin/intl/CommonFunctions.js:
- Add helper functions to parse standalone language, script, region, and Unicode extension types.
- In C++ I'd try to avoid duplicating parts of the BCP47 parser, for (self-hosted) JavaScript code it was easier to just copy a bit of code, at least when trying to get a somewhat performant Intl.Locale object. :-/
- Update CanonicalizeLocaleList per the Intl.Locale proposal to accept Intl.Locale objects.


js/src/builtin/intl/Locale.js:

I've invested a bit of time into performance work for Intl.Locale, so users don't shy away from creating Intl.Locale objects when working with other Intl objects when compared to passing bare BCP47 language tag strings. The implementation is still relatively near to the spec text, but in some places there are extra checks and sometimes some spec steps are skipped or performed earlier/later.

ApplyOptionsToTag:
- Contrary to the spec text, the patch passes a language tag object instead of a language tag string to the function.
- Language tag canonicalization is also performed in the caller instead of directly in the function.

ApplyUnicodeExtensionToTag:
- Instead of creating a separate |records| object, the function modifies the input |options| object. This allows the fast path below step 4.a to directly return to the caller instead of creating a new result object and copying the options values.

InitializeLocale:
- Has more optimizations to skip steps where possible, with (short) comments explaining the short cuts.

Intl_Locale_maximize, Intl_Locale_minimize:
- And more optimizations, so we can skip calling the Intl.Locale constructor (avoids a JS -> C++ -> JS transition, which is kind of slowish) and we don't need to reparse and revalidate known valid language tags.



js/src/builtin/intl/Locale.cpp:

About 80% of the code in this file is for Intl.Locale.prototype.maximize/minimize, which makes it almost sound that something went wrong here. So, err, let me explain this mess.

ICU provides uloc_addLikelySubtags and uloc_minimizeSubtags, which perform (almost) everything we need for Intl.Locale.prototype.maximize/minimize. Except they work on ICU's Locale IDs (not BCP47 language tags!), so a proper citizen is supposed to perform the following calls when maximizing a language tag using ICU:
  1. Call uloc_forLanguageTag to transform the BCP47 language tag into an ICU Locale ID.
  2. Call uloc_addLikelySubtags to maximize the Locale ID.
  3. Call uloc_toLanguageTag to convert the result back into a BCP47 language tag.
  4. And finally pass the BCP47 language tag to the Intl.Locale constructor, so it can be reparsed and canonicalized etc.

Only doing this in exactly that order is really slow (because we'd need to call into ICU multiple times, malloc memory, etc). Fortunately it's relatively easy to skip all steps, except for the call to uloc_addLikelySubtags, so we can save some time here. For example to avoid calling uloc_forLanguageTag and uloc_toLanguageTag, we *only* need to reimplement the parts we actually need, which boils down to a bit of string manipulation. But since we're in C++ and in this case we need to work on C-strings, string manipulation is not as easy and safe as in, let's say JavaScript. But then I heard about this new shiny mozilla::Span thing, which is used by the cool kids, and also provides nice release assertions, so bugs when working with raw char-pointers reduce the likelihood to introduce sec-issues. \o/ 
I started to frantically converting everything to mozilla::Span, only then to realize that mozilla::Span doesn't support nullptr values, so we can't just return a nullptr-span when an error occured. /o\
And because I didn't want to loose the release assertion which we get automatically when working with mozilla::Span, I've sprinkled JS::Result throughout Locale.cpp which probably made it the most non-SpiderMonkey looking file I've ever written.
Attachment #8987667 - Flags: review?(jwalden+bmo)
Attached patch bug1433303-part4-tests.patch (obsolete) — Splinter Review
Tests for Intl.Locale.

Some of the tests in js/src/tests/non262/Intl/Locale were already upstreamed (in some form) to test262 by Ms2ger. I still need to verify if the current test262 cover all functionality when compared to the tests in js/src/tests/non262/Intl/Locale. And some of the test262 tests need fixing (for example because in some cases the original tests written by me were incorrect!).
Attachment #8987668 - Flags: review?(jwalden+bmo)
Would it make sense to move the language tag parsing to C++ or Rust?

In C++ we have a parser here: https://searchfox.org/mozilla-central/source/intl/locale/MozLocale.cpp#19
In Rust we have a parser here: https://github.com/pyfisch/rust-language-tags or https://crates.io/crates/fluent-locale
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #18)
> Would it make sense to move the language tag parsing to C++ or Rust?

Yes, at some point I'd like to move it to C++. (For performance reasons, but not necessarily because parsing in JavaScript is slow, it's more about avoiding the C++ to JS transition. And when the parser, including its canonicalization support, is in C++, we can move js/src/builtin/intl/LangTagMappingsGenerated.js into C++, too, which could enable us to share the data across compartments, which is nice for memory reasons.)

> In C++ we have a parser here:
> https://searchfox.org/mozilla-central/source/intl/locale/MozLocale.cpp#19
> In Rust we have a parser here: https://github.com/pyfisch/rust-language-tags
> or https://crates.io/crates/fluent-locale

The C++ one only supports a subset, the Rust one seems to be missing some aspects we need (e.g. duplicate variant subtags don't seem to be checked; canonicalization looks incomplete), so I think we're stuck with our own parser for now. :-)
I believe the following tests (from the last patch I looked at) are covered by test262 now:

js/src/tests/non262/Intl/Locale/getters.js
js/src/tests/non262/Intl/Locale/grandfathered-language-likely-subtags.js
js/src/tests/non262/Intl/Locale/grandfathered-language-option.js
js/src/tests/non262/Intl/Locale/grandfathered.js
js/src/tests/non262/Intl/Locale/likely-subtags.js
js/src/tests/non262/Intl/Locale/locale-constructor-empty-options.js
js/src/tests/non262/Intl/Locale/locale-constructor-invalid-options.js
js/src/tests/non262/Intl/Locale/locale-constructor-lang-tags.js
js/src/tests/non262/Intl/Locale/locale-constructor-locale-obj.js
js/src/tests/non262/Intl/Locale/locale-constructor-not-callable.js
js/src/tests/non262/Intl/Locale/locale-constructor-tag-type-check.js
js/src/tests/non262/Intl/Locale/locale-constructor-unicode-ext.js
js/src/tests/non262/Intl/Locale/non-iana-canonicalization.js
js/src/tests/non262/Intl/Locale/surface.js
js/src/tests/non262/Intl/Locale/tags-parsed-twice.js

js/src/tests/non262/Intl/Locale/cross-compartment.js is not.
> The C++ one only supports a subset, the Rust one seems to be missing some aspects we need (e.g. duplicate variant subtags don't seem to be checked; canonicalization looks incomplete), so I think we're stuck with our own parser for now. :-)

I'd be down to extend C++ one to support more (preferably without degrading performance too much, so maybe extension parsing can be done as an option?) because it would mean we can carry full BCP47 tags around Gecko.

The Rust one can be improved to our needs and I'd like to replace the C++ one with the Rust one at some point, but I assume it would be hard to get it into SpiderMonkey since there's no Rust in it yet.

What I'm trying to say is that if we want the parser in compiled language, it'd be very beneficial to have a parser shared between Gecko and SpiderMonkey.
Comment on attachment 8987665 [details] [diff] [review]
bug1433303-part1-split-bcp47-tokenizer.patch

Review of attachment 8987665 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/builtin/intl/CommonFunctions.js
@@ +556,5 @@
> +        // the canonical form.
> +        // TODO: Change to in-place update instead of creating a new object?
> +        var tagObj = parseLanguageTag(localeObj.locale);
> +        assert(tagObj !== null, "grandfathered language tags are well-formed");
> +        return tagObj;

Why can't |localeObj| be returned directly here, as it is for the |language === undefined| case?

@@ +727,5 @@
> +
> +    if (extensions.length > 0)
> +        canonical += "-" + callFunction(std_Array_join, extensions, "-");
> +
> +    if (privateuse)

Not !== undefined?
Attachment #8987665 - Flags: review?(jwalden+bmo) → review+
Attachment #8987666 - Flags: review?(jwalden+bmo) → review+
Comment on attachment 8987667 [details] [diff] [review]
bug1433303-part3-intl-locale.patch

Review of attachment 8987667 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/builtin/intl/CommonFunctions.js
@@ +571,5 @@
> +        if (ts.tokenLength === script.length) {
> +            // The first character of a script code needs to be capitalized.
> +            // "hans" -> "Hans"
> +            return callFunction(std_String_toUpperCase, script[0]) +
> +                   callFunction(std_String_toLowerCase, Substring(script, 1, script.length - 1));

Maybe worth a comment noting that |ts.tokenLength === script.length| guarantees no trailing garbage.

@@ +590,5 @@
> +    //        / 3DIGIT              ; UN M.49 code
> +    if ((ts.tokenLength === 2 && ts.token === ALPHA) ||
> +        (ts.tokenLength === 3 && ts.token === DIGIT))
> +    {
> +        if (ts.tokenLength === region.length) {

And same.

@@ +619,5 @@
> +        if (ts.tokenLength < 3 || ts.tokenLength > 8)
> +            return null;
> +
> +        NEXT_TOKEN_OR_RETURN_NULL(ts);
> +    } while (ts.token !== NONE);

Note how we *do* have to check for trailing garbage here, unlike the others above.

::: js/src/builtin/intl/Locale.cpp
@@ +98,5 @@
> +    Rooted<LocaleObject*> locale(cx, NewObjectWithGivenProto<LocaleObject>(cx, proto));
> +    if (!locale)
> +        return false;
> +
> +    locale->setReservedSlot(LocaleObject::INTERNALS_SLOT, NullValue());

The five lines above should probably be their own function so you can call them here and in intl_CreateUninitializedLocale without duplicating (in particular) the slot-initialization bit of code.

@@ +194,5 @@
> +IsStructurallyValidLanguageTag(mozilla::Span<const CharT> language)
> +{
> +    // BNF: language = 2*8ALPHA.
> +    // Canonical form is lower case.
> +    return (2 <= language.Length() && language.Length() <= 8) &&

Given the rest of our code style, I'd somewhat prefer size()/data() as the accessors into Span data.

@@ +227,5 @@
> +template <typename CharT>
> +using ValidateSubtagFn = decltype(IsStructurallyValidLanguageTag<CharT>);
> +
> +template <ValidateSubtagFn<JS::Latin1Char> validateLatin1,
> +          ValidateSubtagFn<char16_t> validateTwoByte>

You should just make the various validation template functions be functions inside template classes, then you can have a |template <typename CharT> class Validator| template template parameter and not have to pass the same template twice.

Too bad we can't use generic lambdas yet.  Also I feel like there should be *some* way to have a template parameter that is a function template, but I am probably wrong about that apprehension as I couldn't find anything searching online just now.

@@ +241,5 @@
> +static bool
> +IsStructurallyValidLanguageTag(JSLinearString* language)
> +{
> +    return IsStructurallyValidSubtag<IsStructurallyValidLanguageTag,
> +                                     IsStructurallyValidLanguageTag>(language);

I'm mildly surprised you don't have to specify a CharT for these.

@@ +289,5 @@
> +    MOZ_TRY_VAR(result, getLanguage(cx, localeId, language));
> +
> +    // The empty string denotes the root locale in ICU, convert it to "und".
> +    if (result.IsEmpty())
> +        return mozilla::MakeStringSpan("und");

If everything else is including a null terminator in the returned spans -- isn't it? -- then this is wrong.  MakeStringSpan says it doesn't include a null terminator.

I think I would prefer that we never include null-terminators in the sizes of returned spans, and null-terminating is something only the caller ever does to a span.  (Also this goes without saying but it's really awful that ICU demands null termination to use many of these APIs.)  If that can be done to all this patch without too much violence, let's do that.

@@ +327,5 @@
> +}
> +
> +static ConstStringSpan
> +CreateLocaleForLikelySubtags(JSLinearString* language, JSLinearString* script,
> +                             JSLinearString* region, StringSpan locale)

Make these all handles.

@@ +329,5 @@
> +static ConstStringSpan
> +CreateLocaleForLikelySubtags(JSLinearString* language, JSLinearString* script,
> +                             JSLinearString* region, StringSpan locale)
> +{
> +    auto appendSubtag = [](JSLinearString* subtag, StringSpan span, bool addSep) {

And the argument here.

@@ +366,5 @@
> +        if (addSep)
> +            span[0] = '_';
> +        size_t offset = size_t(addSep);
> +        auto subspan = span.Subspan(offset, subtag.Length());
> +        mozilla::PodCopy(subspan.Elements(), subtag.Elements(), subtag.Length());

std::copy_n(subtag.Elements(), subtag.Length(), subspan.Elements());

@@ +409,5 @@
> +{
> +    MOZ_ASSERT(localeId.Last<1>()[0] == '\0', "locale ID must be zero-terminated");
> +
> +    int32_t size = intl::CallICU(cx, [&localeId](char* chars, uint32_t size, UErrorCode* status) {
> +        return likelySubtagsFn(localeId.Elements(), chars, size, status);

AssertedCast<int32_t>(size)

@@ +485,5 @@
> +    ConstStringSpan script;
> +    MOZ_TRY_VAR(script, GetScriptTag(cx, localeId, mozilla::MakeSpan(scriptChars)));
> +    MOZ_TRY(assignSubtag(LikelySubtagsResult_Script, script));
> +
> +    char regionChars[ULOC_COUNTRY_CAPACITY] = {0};

All the arrays of compile-time-known length are...a bit skeevy IMO.  It's not that I don't trust ICU to get it r...oh who am I kidding, these sorts of constants are inherently worrisome.  :-)

Could you clarify somewhere that "language" in these bits of code refer not to the "language" production in BCP47, but specifically to the language production without optional extlang components?  ULOC_LANG_CAPACITY is not sized to either language-without-extlangs *or* to language-the-production (I guess this only works because the min/max algorithms in ICU only work on locale IDs, and those can't be something like "xxx-xxx-xxx-xxx"?), and I got a bit confused checking these capacities against reality just now.

@@ +538,5 @@
> +// 1. Call uloc_forLanguageTag() to transform the input BCP 47 language tag
> +//    into a locale ID.
> +// 2. Call uloc_addLikelySubtags() to add the likely subtags to the locale ID.
> +// 3. Call uloc_toLanguageTag() to transform the resulting locale ID back into
> +//    a BCP 47 language tag.

This is all kind of trash.  I guess okay for now, or something.

But maybe we should contemplate actually fixing ICU to have sensible, efficient APIs for this that don't require transformation and un-transformation?  Add/remove likely subtags is not that crazy an operation, and it's silly for us to be reimplementing these narrow bits just because ICU is awful.

@@ +570,5 @@
> +    MOZ_ASSERT(args[0].isString());
> +    MOZ_ASSERT(args[1].isString() || args[1].isUndefined());
> +    MOZ_ASSERT(args[2].isString() || args[2].isUndefined());
> +
> +    constexpr size_t LanguageTagMaxLength = 8; // language = 2*8ALPHA

Mention "(sans extlangs)" or something.

@@ +575,5 @@
> +    constexpr size_t ScriptTagMaxLength = 4; // script = 4ALPHA
> +    constexpr size_t RegionTagMaxLength = 3; // region = 2ALPHA / 3DIGIT
> +
> +    static_assert(LanguageTagMaxLength < ULOC_LANG_CAPACITY,
> +                  "BCP 47 language tag fits into ULOC_LANG_CAPACITY");

without extlangs

@@ +596,5 @@
> +        script = args[1].toString()->ensureLinear(cx);
> +        if (!script)
> +            return cx->alreadyReportedOOM();
> +        MOZ_ASSERT(IsStructurallyValidScriptTag(script));
> +    }

else {
    MOZ_ASSERT(args[1].isUndefined());
}

@@ +604,5 @@
> +        region = args[2].toString()->ensureLinear(cx);
> +        if (!region)
> +            return cx->alreadyReportedOOM();
> +        MOZ_ASSERT(IsStructurallyValidRegionTag(region));
> +    }

else { ... }

@@ +623,5 @@
> +
> +    // FIXME: Spec needs to define behaviour if no match was found, see UTS35.
> +    // For example uloc_addLikelySubtags("aaa") returns "aaa", but UTS #35
> +    // requires to either report an error or return the result for "und".
> +    // https://github.com/tc39/proposal-intl-locale/issues/42

This got resolved as "aaa", looks like, which seems to mean we just don't need to do anything except remove this comment, right?

::: js/src/builtin/intl/Locale.js
@@ +50,5 @@
> +
> +        region = standaloneRegion;
> +    }
> +
> +    // TODO: Spec issue - do we want to give some kind of indication if the

Everything from here down is longer consistent with latest spec, somewhat unsurprisingly given the delay to review here.  How about a followup patch to fix these steps into shape?  I'm just going to not bother looking at any of this here, I think.

@@ +150,5 @@
> +    for (var i = 0; i < relevantExtensionKeys.length; i++) {
> +        var key = relevantExtensionKeys[i];
> +
> +        // Step 7.a.
> +        var value = undefined;

Could also just not initialize this, IMO.

@@ +226,5 @@
> +    var size = extension.length;
> +
> +    // Step 5.
> +    // |extension| starts with "u-" instead of "-u-" in our implementation, so
> +    // we need to initialize |k| with 2 instead of 3.

assert(extension.length >= 2);
assert(extension[0] === 'u');
assert(extension[1] === '-');

@@ +236,5 @@
> +        // Step 6.a.
> +        var e = callFunction(std_String_indexOf, extension, "-", k);
> +
> +        // Step 6.b.
> +        var len = e < 0 ? size - k : e - k;

(e < 0 ? size : e) - k

@@ +238,5 @@
> +
> +        // Step 6.b.
> +        var len = e < 0 ? size - k : e - k;
> +
> +        // Step 6.c.

Spec text is "Let subtag the String value equal to the..." for this, missing "be" after "subtag".  #lazygithub

@@ +253,5 @@
> +            if (len === 2) {
> +                // Step 6.e.i.1.
> +                // NB: Duplicates are removed in CanonicalizeUnicodeExtension.
> +                _DefineDataProperty(keywords, keywords.length, {key, value});
> +            } else {

This structure all matches the spec, but it would be h*ckin clearer if the spec and this algorithm were written first as a loop over attributes, then a loop over keywords.  IMO.

@@ +344,5 @@
> +        var {key, value} = keywords[i];
> +        if (i === 0 || key !== keywords[i - 1].key) {
> +            extension += "-" + key;
> +            if (value !== "")
> +                extension += "-" + value;

This is going to require a followup to address https://github.com/tc39/proposal-intl-locale/issues/43 which at least does s/true// on whole values and does other transformations I haven't fully puzzled out entirely.  (We gonna have to shell out to ICU to do "u-ca-islamicc" -> "u-ca-islamic-civil" and the whole rest of them?  :-\ )

@@ +516,5 @@
> +    assert(IsObject(tagObj), "CreateLocale called with non-object");
> +    assert(GuardToLocale(otherLocale) !== null, "CreateLocale called with non-Locale");
> +
> +    var locInternals = getLocaleInternals(otherLocale);
> +    var locale = intl_CreateUninitializedLocale();

I have a general preference for creating things as late as possible, i.e. do this *after* you've fully initialized |internals|, right before the slot-is-null assert/set-of-slot.

@@ +687,5 @@
> +    return ToStringLanguageTagObject(tagObj);
> +}
> +
> +/**
> + * get Intl.Locale.prototype.baseName

Spec orders this after calendar -- maybe spec should swap these for alphabetical ordering?

::: js/src/vm/CommonPropertyNames.h
@@ +226,1 @@
>      macro(locale, locale, "locale") \

😬
Attachment #8987667 - Flags: review?(jwalden+bmo) → review+
(In reply to Jeff Walden [:Waldo] from comment #22)
> ::: js/src/builtin/intl/CommonFunctions.js
> @@ +556,5 @@
> > +        // the canonical form.
> > +        // TODO: Change to in-place update instead of creating a new object?
> > +        var tagObj = parseLanguageTag(localeObj.locale);
> > +        assert(tagObj !== null, "grandfathered language tags are well-formed");
> > +        return tagObj;
> 
> Why can't |localeObj| be returned directly here, as it is for the |language
> === undefined| case?

Because we want to return the canonicalized language tag and |localeObj| is not canonicalized for grandfathered tags with modern replacements. (For example for "en-GB-oed", |localeObj| is |{ locale: "en-GB-oxendict", grandfathered: true }|, but now we need to return the complete tag object |{ language: "en", extlang1: undefined, extlang2: undefined, extlang3: undefined, script: undefined, region: "GB", variants: ["oxendict"], extensions: [], privateuse: undefined }|.)


(In reply to Jeff Walden [:Waldo] from comment #23)
> @@ +194,5 @@
> > +IsStructurallyValidLanguageTag(mozilla::Span<const CharT> language)
> > +{
> > +    // BNF: language = 2*8ALPHA.
> > +    // Canonical form is lower case.
> > +    return (2 <= language.Length() && language.Length() <= 8) &&
> 
> Given the rest of our code style, I'd somewhat prefer size()/data() as the
> accessors into Span data.

Hmm, I agree that the lower-case names are a better fit within SpiderMonkey code. But the comment at [1] makes it sound like the lower-case versions should be avoided unless mixing them with C++ standard library functions. And for the Locale.cpp file specifically, using the lower-case versions for |Span::IsEmpty|, |Span::Length|, and |Span::Elements|, but everywhere else being forced to use the upper-case versions, because no lower-case alternatives are available, makes the code look inconsistent. 

[1] https://searchfox.org/mozilla-central/rev/3fdc491e118c5cdfbaf6e2d52f3466d2b27ad1de/mfbt/Span.h#403-410

> @@ +289,5 @@
> > +    MOZ_TRY_VAR(result, getLanguage(cx, localeId, language));
> > +
> > +    // The empty string denotes the root locale in ICU, convert it to "und".
> > +    if (result.IsEmpty())
> > +        return mozilla::MakeStringSpan("und");
> 
> If everything else is including a null terminator in the returned spans --
> isn't it? -- then this is wrong.  MakeStringSpan says it doesn't include a
> null terminator.

Actually this is wrong for a different reason. (GetLanguageTag is called from HasExpectedLanguageTag and HasExpectedLanguageTag expects the result value is always stored in the supplied span.)

> 
> I think I would prefer that we never include null-terminators in the sizes
> of returned spans, and null-terminating is something only the caller ever
> does to a span.  (Also this goes without saying but it's really awful that
> ICU demands null termination to use many of these APIs.)  If that can be
> done to all this patch without too much violence, let's do that.
> 
> @@ +327,5 @@
> > +}
> > +
> > +static ConstStringSpan
> > +CreateLocaleForLikelySubtags(JSLinearString* language, JSLinearString* script,
> > +                             JSLinearString* region, StringSpan locale)
> 
> Make these all handles.

Normally I prefer not to use handles when garbage collection is not possible at all. (I think you once said you prefer if handles are used everywhere, so you don't need to worry about GC. Whereas I see/use handles as an indication that GC is possible, so whenever I stumble across rooted variables where GC is definitely not possible, I get confused and wonder why things were rooted in the first place. I guess it's simply a different mindset resp. approach how to use handles?)

> @@ +366,5 @@
> > +        if (addSep)
> > +            span[0] = '_';
> > +        size_t offset = size_t(addSep);
> > +        auto subspan = span.Subspan(offset, subtag.Length());
> > +        mozilla::PodCopy(subspan.Elements(), subtag.Elements(), subtag.Length());
> 
> std::copy_n(subtag.Elements(), subtag.Length(), subspan.Elements());

Calling memmove (which is what std::copy_n does here, at least for gcc and clang), will be somewhat slower than [1], but it's probably not noticeable here. (I've used PodCopy here, because CopyChars, which is called in the other CreateLocaleForLikelySubtags function, also calls PodCopy.)

[1] https://searchfox.org/mozilla-central/rev/3fdc491e118c5cdfbaf6e2d52f3466d2b27ad1de/mfbt/PodOperations.h#102-110

> 
> @@ +409,5 @@
> > +{
> > +    MOZ_ASSERT(localeId.Last<1>()[0] == '\0', "locale ID must be zero-terminated");
> > +
> > +    int32_t size = intl::CallICU(cx, [&localeId](char* chars, uint32_t size, UErrorCode* status) {
> > +        return likelySubtagsFn(localeId.Elements(), chars, size, status);
> 
> AssertedCast<int32_t>(size)

Hmm, I think that should actually belong into intl::CallICU itself.

> 
> @@ +485,5 @@
> > +    ConstStringSpan script;
> > +    MOZ_TRY_VAR(script, GetScriptTag(cx, localeId, mozilla::MakeSpan(scriptChars)));
> > +    MOZ_TRY(assignSubtag(LikelySubtagsResult_Script, script));
> > +
> > +    char regionChars[ULOC_COUNTRY_CAPACITY] = {0};
> 
> All the arrays of compile-time-known length are...a bit skeevy IMO.  It's
> not that I don't trust ICU to get it r...oh who am I kidding, these sorts of
> constants are inherently worrisome.  :-)

It's actually recommended by ICU to use these constants and ICU itself uses them everywhere. 

> 
> Could you clarify somewhere that "language" in these bits of code refer not
> to the "language" production in BCP47, but specifically to the language
> production without optional extlang components?  ULOC_LANG_CAPACITY is not
> sized to either language-without-extlangs *or* to language-the-production (I
> guess this only works because the min/max algorithms in ICU only work on
> locale IDs, and those can't be something like "xxx-xxx-xxx-xxx"?), and I got
> a bit confused checking these capacities against reality just now.

ULOC_LANG_CAPACITY was changed to 12 in [1] to support language tags starting with "i-" (and "x-i"). The longest grandfathered language starting with "i-" is "i-enochian", but that fits in 10+1 (with the NUL terminator). Maybe 12 was chosen because the longest grandfathered tag needs 11+1 ("cel-gaulish")? No wait, the change was committed 15 Mar 2001 and at that point RFC 5646 and BCP 47 language tags didn't exist yet. In 2001, RFC 3066 [2] was just issued, so maybe the limit of 12 was actually derived by 3 (maximum of the primary subtag) + 8 (maximum of the second subtag) + 1 (NUL-terminator)? (And the third and subsequent subtags were probably simply ignored?)

[1] https://github.com/unicode-org/icu/commit/5f966a61715b217c20421b13125fbbca7db13803#diff-9fd919f1948e3d6ba37080f3d57c0558
[2] https://tools.ietf.org/html/rfc3066


ULOC_SCRIPT_CAPACITY was added in [3], but neither [3] nor [4] give an explanation why the script capacity was set to 5+1 instead of 4+1. (Unless the constant is meant to include the leading '-' separator, but that'd be strange...)

[3] https://github.com/unicode-org/icu/commit/fd7d8e1dc59d042f144b36e509e1709971d8b7e7
[4] https://unicode-org.atlassian.net/browse/ICU-2942

> 
> @@ +538,5 @@
> > +// 1. Call uloc_forLanguageTag() to transform the input BCP 47 language tag
> > +//    into a locale ID.
> > +// 2. Call uloc_addLikelySubtags() to add the likely subtags to the locale ID.
> > +// 3. Call uloc_toLanguageTag() to transform the resulting locale ID back into
> > +//    a BCP 47 language tag.
> 
> This is all kind of trash.  I guess okay for now, or something.
> 
> But maybe we should contemplate actually fixing ICU to have sensible,
> efficient APIs for this that don't require transformation and
> un-transformation?  Add/remove likely subtags is not that crazy an
> operation, and it's silly for us to be reimplementing these narrow bits just
> because ICU is awful.

I use a similar trick like #lazygithub: Add a test262 test and wait until someone else files an ICU feature request, because their JS-engine doesn't pass test262. :-)


> @@ +596,5 @@
> > +        script = args[1].toString()->ensureLinear(cx);
> > +        if (!script)
> > +            return cx->alreadyReportedOOM();
> > +        MOZ_ASSERT(IsStructurallyValidScriptTag(script));
> > +    }
> 
> else {
>     MOZ_ASSERT(args[1].isUndefined());
> }

This is already asserted right at the start of the function. I'll just move the constants below the argument processing, so that (1) the argument checks are closer together and (2) the constants are nearer to where they're actually used.


> @@ +623,5 @@
> > +
> > +    // FIXME: Spec needs to define behaviour if no match was found, see UTS35.
> > +    // For example uloc_addLikelySubtags("aaa") returns "aaa", but UTS #35
> > +    // requires to either report an error or return the result for "und".
> > +    // https://github.com/tc39/proposal-intl-locale/issues/42
> 
> This got resolved as "aaa", looks like, which seems to mean we just don't
> need to do anything except remove this comment, right?

Yes.

> 
> ::: js/src/builtin/intl/Locale.js
> @@ +50,5 @@
> > +
> > +        region = standaloneRegion;
> > +    }
> > +
> > +    // TODO: Spec issue - do we want to give some kind of indication if the
> 
> Everything from here down is longer consistent with latest spec, somewhat
> unsurprisingly given the delay to review here.  How about a followup patch
> to fix these steps into shape?  I'm just going to not bother looking at any
> of this here, I think.

Ok.


> @@ +344,5 @@
> > +        var {key, value} = keywords[i];
> > +        if (i === 0 || key !== keywords[i - 1].key) {
> > +            extension += "-" + key;
> > +            if (value !== "")
> > +                extension += "-" + value;
> 
> This is going to require a followup to address
> https://github.com/tc39/proposal-intl-locale/issues/43 which at least does
> s/true// on whole values and does other transformations I haven't fully
> puzzled out entirely.  (We gonna have to shell out to ICU to do
> "u-ca-islamicc" -> "u-ca-islamic-civil" and the whole rest of them?  :-\ )

I guess because of issues like [1], we'll need to do this in our own code.

[1] https://unicode-org.atlassian.net/browse/ICU-13417


> @@ +687,5 @@
> > +    return ToStringLanguageTagObject(tagObj);
> > +}
> > +
> > +/**
> > + * get Intl.Locale.prototype.baseName
> 
> Spec orders this after calendar -- maybe spec should swap these for
> alphabetical ordering?

Yes.
Comment on attachment 8987668 [details] [diff] [review]
bug1433303-part4-tests.patch

Review of attachment 8987668 [details] [diff] [review]:
-----------------------------------------------------------------

Didn't look at any of the tests that were just enabled without any changes to them.  Maybe I should spot-check some, but this has dragged long enough, better to just get it out there and lean on TC39 doing enough validation pre-landing.

This is a very aggressive r+.  Feel free to rethink it, or to post an interdiff of extra changes for review if there are interesting bits as I suspect there may be.

::: js/src/tests/non262/Intl/Locale/getters.js
@@ +11,5 @@
> +assertEq(loc.region, "DE");
> +assertEq(loc.calendar, "gregory");
> +assertEq(loc.collation, "phonebk");
> +assertEq(loc.hourCycle, "h23");
> +assertEq(loc.caseFirst, "true");

If https://www.unicode.org/reports/tr35/#u_Extension says true subtags are supposed to be removed, that only applies to [[Locale]], right?  And this should indeed still be "true", right?

If true is really not meaningful for this value, I would prefer you used "xyzw" or some similar nonsense string to clarify this isn't actually an intrinsically meaningful value.  Probably same for all the other subtags here, too.

@@ +18,5 @@
> +
> +// Replace all components through option values and validate the getters still
> +// return the expected results.
> +var loc = new Intl.Locale(langtag, {
> +    language: "ja",

On the other hand, all of these option values *are* range-checked, so they do have to stay as-is.  (It is possibly strange that the raw text format lets you create stuff that is un-creatable using options.  Kinda wish there were a light remove-garbage pass on strings as parsed by this.)

@@ +25,5 @@
> +    calendar: "japanese",
> +    collation: "search",
> +    hourCycle: "h24",
> +    caseFirst: "false",
> +    numeric: "true",

This should be a boolean, not a string, for sanity.

@@ +71,5 @@
> +assertEq(loc.region, undefined);
> +
> +// Regular grandfathered language tag.
> +var loc = new Intl.Locale("cel-gaulish");
> +assertEq(loc.baseName, "cel-gaulish");

*mutters about there being any grandfathered tags without langtag-formatted alternatives yet again, why can't upstream IANA add alternatives for all of these yet*

::: js/src/tests/non262/Intl/Locale/grandfathered-language-option.js
@@ +11,5 @@
> +    {
> +        tag: "nb",
> +        options: {
> +            language: "no-bok",
> +            region: "NO",

This property should never be accessed, right?  Step 4b https://tc39.github.io/proposal-intl-locale/#sec-apply-options-to-tag should fail for both these before this property would be examined.  So why, if this is going to exist, shouldn't it be an always-throwing getter or something?  And maybe all the other fields too, for completeness.  (Most complete would be an intercepting proxy and a check of an exact operation sequence, but that level of paranoia seems not absolutely necessary.)

Also, can't this test be upstreamed?

::: js/src/tests/non262/Intl/Locale/grandfathered.js
@@ +5,5 @@
> +    {
> +        tag: "i-default",
> +        options: {
> +            region: "ZZ",
> +            numberingSystem: "latn",

This (non-undefined region or script with a canonicalizes-to-grandfathered tag) should cause a throw under the current proposal -- and the numberingSystem accessor should either not exist or should throw something.

@@ +20,5 @@
> +        options: {
> +            region: "US",
> +            numberingSystem: "latn",
> +        },
> +        canonical: "en-GB-oxendict-u-nu-latn",

Current proposal sez "en-US-oxendict-u-nu-latn".

@@ +34,5 @@
> +            numberingSystem: "latn",
> +        },
> +        canonical: "cel-gaulish",
> +        extensions: {
> +            numberingSystem: undefined,

If I read things correctly, under the current proposal this should fall into https://tc39.github.io/proposal-intl-locale/#sec-apply-unicode-extension-to-tag step 2 and throw away |options.numberingSystem| changes.  These semantics don't seem consistent with throwing for non-undefined region or script in the similar situation, that throws above.  I'm not sure that's ideal; but also I can't remember if we affirmatively decided on these differing semantics in the last call or in a PR discussing it.  :-\

@@ +42,5 @@
> +    // Regular grandfathered without modern replacement.
> +    {
> +        tag: "cel-gaulish",
> +        options: {
> +            region: "FR",

This should cause a throw as noted above.

@@ +67,5 @@
> +    // Regular grandfathered with modern replacement.
> +    {
> +        tag: "art-lojban",
> +        options: {
> +            region: "ZZ",

This seems wrong under the current proposal, looks like.  You get jbo, then you put ZZ into it, then you tack on the Unicode bits.  I'm not sure where ZZ goes away in that.  If you were max/mining, sure, but this isn't.  Right?

::: js/src/tests/non262/Intl/Locale/likely-subtags.js
@@ +37,5 @@
> +    "ar-Arab": "ar",
> +
> +    // Language and region subtags are present.
> +    "en-US": "en",
> +    "en-GB": "en-GB",

This is...interesting, given we use en-GB as the thing we think is predictive of more of our users when we can't get better data.

::: js/src/tests/non262/Intl/Locale/locale-constructor-empty-options.js
@@ +21,5 @@
> +// Empty 'caseFirst' option is rejected.
> +assertThrowsInstanceOf(() => new Intl.Locale("en", {caseFirst: ""}), RangeError);
> +
> +// Empty 'numeric' option is okay.
> +assertEq(new Intl.Locale("en", {numeric: ""}).toString(), "en-u-kn-false");

I mean, yes, but...this is a boolean property, so this is kind of silly.  If other tests test both true/false values for this, I would just omit this test -- or leave a comment about how "numeric" is boolean and so there's no empty value for it.

::: js/src/tests/non262/Intl/Locale/locale-constructor-invalid-options.js
@@ +3,5 @@
> +// Irregular grandfathered language tag.
> +assertThrowsInstanceOf(() => new Intl.Locale("en", {language: "i-klingon"}), RangeError);
> +
> +// Regular grandfathered language tag are currently allowed.
> +// assertEq(new Intl.Locale("zh-Hant", {language: "zh-min-nan"}).toString(), "zh-min-nan-Hant");

You should be able to test for this throwing a RangeError when the language property is grandfathered, https://tc39.github.io/proposal-intl-locale/#sec-apply-options-to-tag step 4b, right?

@@ +42,5 @@
> +
> +// 'hourCycle' option is validated, allowed values: "h11", "h12", "h23", "h24".
> +assertThrowsInstanceOf(() => new Intl.Locale("en", {hourCycle: "h25"}), RangeError);
> +
> +// 'caseFirst' option is validated, allowed values: "upper", "lower", "false".

Maybe s/allowed/exact allowed/

@@ +46,5 @@
> +// 'caseFirst' option is validated, allowed values: "upper", "lower", "false".
> +assertThrowsInstanceOf(() => new Intl.Locale("en", {caseFirst: "UPPER"}), RangeError);
> +
> +// 'numeric' option is not validated, the input is converted through ToBoolean.
> +assertEq(new Intl.Locale("en", {numeric: null}).toString(), "en-u-kn-false");

And a version with numeric being truthy?  Should compute to "en-u-kn", since the true should be eliminated in https://tc39.github.io/proposal-intl-locale/#sec-apply-options-to-tag step 9.

::: js/src/tests/non262/Intl/Locale/locale-constructor-tag-type-check.js
@@ +1,3 @@
> +// |reftest| skip-if(!this.hasOwnProperty('Intl')||(!this.Intl.Locale&&!this.hasOwnProperty('addIntlExtras')))
> +
> +// Throws a TypeError if the |tag| argument is neither a string nor an object.

This test should ideally be moved upstream, or upstream should test this and we should remove this test.  Also -- the upstream test should test |$262.IsHTMLDDA| on this.

@@ +9,5 @@
> +assertThrowsInstanceOf(() => new Intl.Locale(NaN), TypeError);
> +assertThrowsInstanceOf(() => new Intl.Locale(0), TypeError);
> +assertThrowsInstanceOf(() => new Intl.Locale(true), TypeError);
> +assertThrowsInstanceOf(() => new Intl.Locale(false), TypeError);
> +assertThrowsInstanceOf(() => new Intl.Locale(Symbol()), TypeError);

Should add in Symbol.iterator and Symbol.for("foo") to test built-in symbols and symbols in the built-in registry, too.

::: js/src/tests/non262/Intl/Locale/locale-constructor-unicode-ext.js
@@ +7,5 @@
> +    // Keywords currently used in Intl specs are reordered in US-ASCII order.
> +    "zh-u-nu-hans-ca-chinese": "zh-u-ca-chinese-nu-hans",
> +    "zh-u-ca-chinese-nu-hans": "zh-u-ca-chinese-nu-hans",
> +
> +    // Even keywords currently not used in Intl specs are reordered in US-ASCII order.

If we're going for currently-not-used, seems like picking a not-a-keyword-at-all value makes sense.  As I read https://www.unicode.org/reports/tr35/#Locale_Extension_Key_and_Type_Data I don't see that any keywords are explicitly guaranteed to never be used or are provided for private use, but it seems unimaginable that "xx" ever would be, at least.  :-)

@@ +14,5 @@
> +
> +    // Attributes in Unicode extensions are reordered in US-ASCII order.
> +    "pt-u-attr-ca-gregory": "pt-u-attr-ca-gregory",
> +    "pt-u-attr1-attr2-ca-gregory": "pt-u-attr1-attr2-ca-gregory",
> +    "pt-u-attr2-attr1-ca-gregory": "pt-u-attr1-attr2-ca-gregory",

Make these attr1 and xttr2, or something such that a < $thing < ca and a < $thing < gregory.

::: js/src/tests/non262/Intl/Locale/tags-parsed-twice.js
@@ +10,5 @@
> +        options: {
> +            region: "US",
> +            calendar: "gregory",
> +        },
> +        canonical: "en-GB-oxendict-u-ca-gregory",

I think this changed per https://github.com/tc39/proposal-intl-locale/commit/94c9a08958065c63ec0e51723c41bea8e4f348b1 right?

https://tc39.github.io/proposal-intl-locale/#sec-apply-options-to-tag  Step 9 converts "en-GB-oed" to "en-GB-oxendict", so then we'll execute to step 12b(ii) which will "Set tag to tag with the substring corresponding to the region production replaced by the string region." which would intermediately compute "en-US-oxendict" (and then the gregory bit would be tacked on after).

Or am I missing something but I don't think I am?

@@ +23,5 @@
> +        options: {
> +            region: "NO",
> +            calendar: "gregory",
> +        },
> +        canonical: "nb-u-ca-gregory",

Similarly "nb-NO-u-ca-gregory".
Attachment #8987668 - Flags: review?(jwalden+bmo) → review+
(In reply to André Bargull [:anba] from comment #24)
> > std::copy_n(subtag.Elements(), subtag.Length(), subspan.Elements());
> 
> Calling memmove (which is what std::copy_n does here, at least for gcc and
> clang), will be somewhat slower than [1], but it's probably not noticeable
> here.

Probably should use std::copy(elements, elements + length, subspan) then for memmove-ful behavior.  (It is mildly unfortunate that the C++ algorithms for this are not more clear about overlappy-definedness.)

> > @@ +409,5 @@
> > > +{
> > > +    MOZ_ASSERT(localeId.Last<1>()[0] == '\0', "locale ID must be zero-terminated");
> > > +
> > > +    int32_t size = intl::CallICU(cx, [&localeId](char* chars, uint32_t size, UErrorCode* status) {
> > > +        return likelySubtagsFn(localeId.Elements(), chars, size, status);
> > 
> > AssertedCast<int32_t>(size)
> 
> Hmm, I think that should actually belong into intl::CallICU itself.

Does it?  likelySubtagsFn is what takes int32_t; there are lambdas that could be passed in that would want uint32_t.  Or so I would think, at some remove from having written that comment and being slightly too lazy to double-check.

> ULOC_LANG_CAPACITY was changed to 12 in [1] to support language tags
> starting with "i-" (and "x-i"). The longest grandfathered language starting
> with "i-" is "i-enochian", but that fits in 10+1 (with the NUL terminator).
> Maybe 12 was chosen because the longest grandfathered tag needs 11+1
> ("cel-gaulish")? No wait, the change was committed 15 Mar 2001 and at that
> point RFC 5646 and BCP 47 language tags didn't exist yet. In 2001, RFC 3066
> [2] was just issued, so maybe the limit of 12 was actually derived by 3
> (maximum of the primary subtag) + 8 (maximum of the second subtag) + 1
> (NUL-terminator)? (And the third and subsequent subtags were probably simply
> ignored?)

Like I said, these constants are skeevy.  :-)

> ULOC_SCRIPT_CAPACITY was added in [3], but neither [3] nor [4] give an
> explanation why the script capacity was set to 5+1 instead of 4+1. (Unless
> the constant is meant to include the leading '-' separator, but that'd be
> strange...)

Like I said, these constants are skeevy.  :-)

> I use a similar trick like #lazygithub: Add a test262 test and wait until
> someone else files an ICU feature request, because their JS-engine doesn't
> pass test262. :-)

💯
How far are we from landing it (even if just exposed to chrome docs)?
Flags: needinfo?(andrebargull)
The current proposal depends on <https://github.com/tc39/ecma402/pull/289>, which is still under review resp. has multiple open questions. As soon as the revised ECMA-402 PR is applied, additional changes for Intl.Locale may be necessary, depending on how the open issues/questions for the PR are resolved. 

When ignoring the additional canonicalizations from UTS 35, the main issue to solve are how to handle language tags with extlang subtags and grandfathered tags. We could simply reject them for now in Intl.Locale by throwing an error, assuming this case doesn't apply for in-Gecko use. But that'd still mean the implemented Intl.Locale parts don't cover the complete proposal - which no current implementation of Intl.Locale can provide because of the open issues mentioned above...
Flags: needinfo?(andrebargull)
Depends on: 1522070

There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:anba, could you have a look please?

Flags: needinfo?(andrebargull)

The spec proposal changed which invalidated parts of the patches attached here. New patches will depend on bug 1522070 being fixed.

Flags: needinfo?(andrebargull)

There are some r+ patches which didn't land and no activity in this bug for 2 weeks.
:anba, could you have a look please?

Flags: needinfo?(andrebargull)
Flags: needinfo?(andrebargull)

Can this land now?

Flags: needinfo?(andrebargull)

This feature has shipped on Chrome 74, so maybe a "chrome-parity" tag is needed.

Depends on: es-intl
Blocks: es-intl
No longer depends on: es-intl

Intl.Locale.m{ax,in}imize() are implemented in a part 2.

Attachment #8987665 - Attachment is obsolete: true
Flags: needinfo?(andrebargull)
Attachment #8987666 - Attachment is obsolete: true
Attachment #8987667 - Attachment is obsolete: true
Attachment #8987668 - Attachment is obsolete: true

Pushed by btara@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/2311af7fcff5
Part 1: Implement Intl.Locale proposal. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/c21e82e313c4
Part 2: Add Intl.Locale minimize() and maximize() functions. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/56d8838cd081
Part 3: Tests for Intl.Locale proposal. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/90f6092b76dd
Part 4: Enable test2626 tests for Intl.Locale proposal. r=jwalden
https://hg.mozilla.org/integration/autoland/rev/6c3d3ce32b08
Part 5: Reimport test262 for Intl.Locale tests. r=jwalden

Keywords: checkin-needed
Regressions: 1570921

If I see it correctly, this isn't enabled by default and thus not shipping in Firefox 70, right?
Is there a bug filed to actually ship this?

Flags: needinfo?(andrebargull)

There are some (minor) issues reported to the proposal Github page and while those probably shouldn't result in any breaking changes, it'd still be nice if the proposal champions could address them. (The main proposal champion was on break till September, so I assume he's still working on his backlog.) Well, maybe we could move Intl.Locale to nightly-only in the meantime?

Flags: needinfo?(andrebargull)

Nightly-only isn't crazy, for sure. However -- I think we should get bug 1570370 landed before we enable anything.

There's certainly some value to shipping just to ship. But it seems not entirely helpful in any other aspect to ship code that's a dead end implementation-wise.

If we're going to do parsing and other work in C++ in the long run, and in the medium run, and in the short run too, we should just do it before shipping, even in nightly. Then anyone experimenting will be hammering on the thing we will be using well into the future, not a thing that's gonna die whenever anba finds time to respond to my review comments. :-)

Update since comment 44 -- the C++ changes landed, but working through those revealed to me (anba I'm sure already recognized them 'cause he's way more into the weeds on this than I am :-) ) a few spec concerns that I'm not comfortable shipping til they're resolved. Perhaps most importantly, the current spec has two different ways to canonically represent language tags, with a distinction that is super-confusing and super-difficult to reasonably explain to web developers IMO. Once those are resolved, I think we can consider moving forward again.

Anba - do you consider Intl.Locale to be stable enough to be exposed to the web content? It's exposed in V8.

Flags: needinfo?(andrebargull)

See Waldo's comment in comment #45.

Flags: needinfo?(andrebargull)

The concerns I noted in comment 45 have been resolved by a recent TR35 update that removes the special canonicalization requirements. So there is now only one way to canonicalize, again. And once we implement that, I think we can ship this. I filed bug 1596544 to do so.

You need to log in before you can comment on or make changes to this bug.