Closed Bug 1723586 Opened 4 months ago Closed 1 month ago

Unify the behavior of Locale canonicalization in SpiderMonkey with intl/components

Categories

(Core :: Internationalization, task, P3)

task

Tracking

()

RESOLVED DUPLICATE of bug 1719746

People

(Reporter: gregtatum, Assigned: dminor)

References

(Blocks 2 open bugs)

Details

This bug is tracking unifying the behavior of Locale canonicalization in SpiderMonkey, specifically in the file: src/builtin/intl/Locale.cpp

In Bug 1719540 I'm fleshing out an initial location for canonicalization, but in the ICU4X experimentation, Dan has a patch to back Locale Canonicalization with ICU4X. This bug is for tracking that work. Perhaps Dan can add more details here of the work that is needed.

Flags: needinfo?(dminor)

I think eventually we'll want to match the API provided by the ICU4X locale canonicalizer: canonicalize and the two likely subtags operations. I'd expect the LocaleCanonicalizer to be a singleton.

The upstream canonicalize implementation is not yet complete, but we could potentially replace the unic-locale crate [1] with ICU4X as a test case outside of our unification efforts.

We could also work around some of the missing functionality in the ICU4X version for now (it is just missing a handful of cases that are not present in the CLDR json data.) The canonicalize operation is currently quite a bit slower than the version in SpiderMonkey. We'd have to test to see if that would make a measurable difference in performance (not just microbenchmarks) before switching.

That said, I'm not sure if we need to need to break out the locale canonicalization functionality into a separate class now, or leave it as part of the implementation in the Locale class until we move to ICU4X.

[1] https://searchfox.org/mozilla-central/source/third_party/rust/unic-langid-impl/

Flags: needinfo?(dminor)
Assignee: nobody → dminor

Performance is kind of important here, because the canonicalisation code is performed whenever any Intl object is created (via CanonicalizeLocaleList which calls intl_ValidateAndCanonicalizeLanguageTag). (Some Intl objects are created during start-up, so things like loading ICU resource files have to be delayed until absolutely necessary, see for example bug 919872 and bug 1220693 where we delay calls to ICU resource loading, but there are more bugs elsewhere. Probably not too relevant for this bug, but something which always has to be taken into account.)

Some background info about Intl.Locale, in case it's helpful:

When I wrote Intl.Locale, I've tried to ensure good performance, because one goal of Intl.Locale is to provide easy and fast access to locale identifier subtags. Before Intl.Locale, users most likely used regular expressions for this task. And because regular expressions are heavily optimised in JS engines, any approach for Intl.Locale which is >10x slower than regular expressions seemed unacceptable, because why should a user switch to Intl.Locale when it's a magnitude slower. And because Mozilla was also championing the proposal, it seemed bad to push a proposal which may not get user acceptance because it's inherently slow.

Relevant bugs are:

  • Bug 1522070: Switch from BCP 47 language tags [RFC 5646] to Unicode BCP 47 locale identifiers.
  • Bug 1433303: Initial Intl.Locale implementation.
  • Bug 1570370: Port of language tag parsing from JS to C++ to address memory regressions caused by the other two bugs. Also additionally improved the parser performance.

Probably also worth noting that only SpiderMonkey currently has reasonable performance for Intl.Locale. For example this test case to retrieve the language subtag is much faster in SpiderMonkey when compared to V8 and JSC:

function f() {
  // List of locales. Script and region subtags can be added by removing the
  // commented out lines.
  var locales = [
    "de", "en", "zh", "it",
    "fr", "es", "th", "ja",
    "nl", "fi", "se", "no",
    "pt", "ro", "pl", "ru",
  ]
  // .map(x => x + "-Latn")
  // .map(x => x + "-US")
  ;

  // Regular expression to parse a locale identifier.
  //
  // |unicode_locale_extensions| and |transformed_extensions| are both handled as
  // |other_extensions| for the sake of simplicity.
  var re = /^([a-z]{2,3}|[a-z]{4,8})(?:-([a-z]{4}))?(?:-([a-z]{2}|[0-9]{3}))?(?:-(?:[a-z0-9]{5,8}|[0-9][a-z0-9]{3}))*(?:-[a-wy-z0-9](?:-[a-z0-9]{2,8})+)*(?:-x(?:-[a-z0-9]{1,8})+)?$/i;

  // Loop to parse a locale identifier and retrieve its language subtag.
  var r = 0;
  var t = performance.now();
  for (var i = 0; i < 1_000_000; ++i) {
    var x = locales[i & 0xf];

    // Variant 1: Use Intl.Locale.

    var loc = new Intl.Locale(x);
    r += loc.language.length;

    // Variant 2: Use a regular expression.

    // var lang = re.exec(x);
    // if (lang) r += lang[1].length;
  }
  print(performance.now() - t, r);
}

for (var i = 0; i < 10; ++i) f();

Thanks for the extra context Anba!

I'm now leaning towards "unifying" the current LanguageTag / Locale classes rather than trying to move to ICU4X now. That will give us time to improve the performance of the canonicalize implementation in ICU4X. From the benchmarking we've done so far, the rest of the Locale implementation in ICU4X is as fast or faster than what we have in SpiderMonkey.

Bug 1719746 moved this code from SpiderMonkey to the unified Locale implementation. To match the ICU4X model, we'd want to move the canonicalization bits to the unified LocaleCanonicalizer but I think it makes sense to hold off on doing this until we are actually switching to ICU4X. If we don't end up switching, we could move the one method on LocaleCanonicalizer over to Locale itself.

I'm going to mark this as a dup, since Bug 1719746 fixed the SpiderMonkey portion of this work already.

Status: NEW → RESOLVED
Closed: 1 month ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1719746
Blocks: 1686965
You need to log in before you can comment on or make changes to this bug.