Closed Bug 1331508 Opened 7 years ago Closed 5 months ago

Local data overlay on top of CLDR/ICU

Categories

(Core :: Internationalization, defect, P3)

defect

Tracking

()

RESOLVED FIXED
122 Branch
Tracking Status
firefox122 --- fixed

People

(Reporter: zbraniecki, Assigned: eemeli)

References

Details

Attachments

(1 file)

As we're migrating more of Gecko to rely on ICU/CLDR we start seeing places where this regresses our locale coverage.

For example, we release Android in locales such as an, cak, gn, mai, son, tsz, which are not covered by CLDR/ICU.

Our top priority goal now is to land new localization framework (l20n) in Firefox for Android and it relies on ICU.
In order to be able to switch without regressing, we have to solve the coverage problem.

I'm establishing a protocol for L10n Drivers to upstream our localizations back to CLDR, but establishing a new locale in CLDR may be more time consuming, and potentially political than it sounds.

At the Unicode conference I spoke with multiple customers of CLDR/ICU - namely Microsoft, Google and Apple - and they all said that they maintain a local overlay on top of the source data to facilitate their needs.

They aim to keep the layer to minimum and optimistically remove it one day, but for now, that setup unblocks them to release their products with their locale coverage and use CLDR/ICU as a backbone at the same time.

In comparison with them, I think our task is slightly easier, since we do not aim to diverge from CLDR and we don't have a need for custom values that differ from CLDR.

But we do need to be able to format numbers and dates (and plural rules and relative time formats and units) to locales that CLDR doesn't have data for.

I'd like to suggest that we add a similar layer and develop a procedure that will allow L10n Drivers together with localizers to submit CLDR-like data that we will keep locally and use in our ICU calls while we upstream them. Once the upstream is complete, we'll remove those bits from the overlay.
Seeking feedback from :mkato, :jfkthame, :waldo, Andre Bargull. CC :pike.

Note: This is not related to landing ICU in Android (bug 1215247) - because we can land it without some locales.
This is only important for us when we talk about migrating Firefox UI to use ICU.
Flags: needinfo?(m_kato)
Flags: needinfo?(l10n)
Flags: needinfo?(jwalden+bmo)
Flags: needinfo?(jfkthame)
Flags: needinfo?(andrebargull)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #0)
> In comparison with them, I think our task is slightly easier, since we do
> not aim to diverge from CLDR and we don't have a need for custom values that
> differ from CLDR.

I don't know which bug that was in, but I recall reading bugmail about date formats and German that suggested that we should in fact tune CLDR-hosted data to make the firefox product better.

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #1)
> Seeking feedback from :mkato, :jfkthame, :waldo, Andre Bargull. CC :pike.
> 
> Note: This is not related to landing ICU in Android (bug 1215247) - because
> we can land it without some locales.
> This is only important for us when we talk about migrating Firefox UI to use
> ICU.

Can you be precise on this one? It sounds like "pff, wth to an, cak, gn, mai, son, tsz", and I recall phone conversations that sounded like "we only use plural forms, and the missing locales have the same as English, so fall-back is fine".

Those would be two vastly different statements.
Flags: needinfo?(l10n)
(resubmitting due to the number of spellings)

> I don't know which bug that was in, but I recall reading bugmail about date formats and German that suggested that we should in fact tune CLDR-hosted data to make the firefox product better.

My understanding is that the pressure on this is significantly lower, and in such scenarios we can actually use the route to elevate the conversation to CLDR.
Not only because the worst case scenario is that the translation is slightly worse (rather than non-existing), but also because it's a matter of preference, and CLDR has pretty solid procedures for solving those (in fact, that's 80% of what they do ;)).

Lastly, if CLDR uses one approach, then it means that host environments in which Firefox runs will use it. If we use a different approach, even if it's slightly better, it's inconsistent with the OS.

Thus, I'd say that we can start by trying to elevate such cases to CLDR and try to solve them there.

> Can you be precise on this one? It sounds like "pff, wth to an, cak, gn, mai, son, tsz", and I recall phone conversations that sounded like "we only use plural forms, and the missing locales have the same as English, so fall-back is fine".
> Those would be two vastly different statements.

Sure.

There are two parts you quoted here:

> Note: This is not related to landing ICU in Android (bug 1215247) - because we can land it without some locales.

I believe that this issue is not blocking for bug 1215247 because we can easily land it, and expose Intl API in Firefox for Android without, for example, 'an' locale. It's a new feature, that is currently non-existing, and thus we're not regressing anything by landing it. Also, lack of an 'an' locale in CLDR means that other implementations of the Intl API likely also don't have it.

> This is only important for us when we talk about migrating Firefox UI to use ICU.

I believe that the issue of lack of overlays is going to be a recurring theme as we introduce new formatters and transition toward ICU-backed APIs. For example, we just landed PluralRules that use CLDR. If we don't have PluralRules in CLDR for an 'an' locale, we will fallback  english. That works this time, but if we will have a new locale that uses different plural rules, we will need an overlay.
Soon, we will want to transition relative time fromat strings from localization based to mozIntl.RelativeTimeFormat based (bug 1270140). If we will not have our overlays, that will regress for locales that we cover, but CLDR doesn't.

The bottom line is, that there are two uses of ICU/CLDR in Gecko:

1) As a backend for Intl API for the Web
2) As a backend for Firefox UI

For the former, we do not need this bug to be fixed. For the latter, I believe we will need it.

Does it answer your question?
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #0)
> I'd like to suggest that we add a similar layer and develop a procedure that
> will allow L10n Drivers together with localizers to submit CLDR-like data
> that we will keep locally and use in our ICU calls while we upstream them.
> Once the upstream is complete, we'll remove those bits from the overlay.

When you say "overlay", do you mean adding new locales per http://userguide.icu-project.org/icudata, or something else?
> When you say "overlay", do you mean adding new locales per http://userguide.icu-project.org/icudata, or something else?

There are two possible things we can do:

1) Add a locale that ICU does not have data for.

In this case, we'd like to establish a procedure that will allow us to provide the right data to be included in Mozilla source for ICU to pick

2) Override some CLDR data

In the scenario like one described by :pike, we'd may want to modify a CLDR field for a particular locale.
AIUI, we can use the ICU pkgdata tool to package data for (an) additional locale(s) into ICU's format, and then include this in the data ICU will use by calling udata_setCommonData API.[1] This can be called repeatedly to load multiple data packages, and ICU will search them all when data is requested.

To support adding locales to an installation, I guess we'd probably want to define a locale_data resource directory, and have code that iterates over it at startup and loads whatever data packages are found there. So adding or updating a locale is simply a matter of building a new data package and dropping it into the locale_data dir.

In principle, this should support both (1) and (2) above, provided the added packages are loaded before any (implicit) access to the default built-in data; still, I think we should be _extremely_ hesitant to override CLDR data. It's hard to imagine that we can really maintain locale data better (with more extensive review, etc.) than the CLDR project. If people think CLDR data is incorrect for a supported locale, that should be addressed upstream rather than having Gecko deviate from the (CLDR-based) system behavior.

[1] http://icu-project.org/apiref/icu4c/udata_8h.html#a467bda719595adb58f959dde735e1153
Flags: needinfo?(jfkthame)
Great to hear! And I agree on being reluctant and cautious with overrides.

It seems then that we know how to do this technically and it doesn't require any new code from us.

I think I'd like to now decide on who would be maintaining the directory with our custom locales and an example of how should data should look like for a locale.

Then we can start filing bugs to add such data, prepare data for landing and get review from the overrides directory maintainer.
On windows and js-standalone, we use linked-in cldr data, and for this, we probably need to, too.
how does the example data file looks like? Is it a JSON file? XML?

I'd like to work with our PM group to build a plan on how we're going to prepare such a "custom locale" dataset for us to include.
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #0)
> As we're migrating more of Gecko to rely on ICU/CLDR we start seeing places
> where this regresses our locale coverage.
> 
> For example, we release Android in locales such as an, cak, gn, mai, son,
> tsz, which are not covered by CLDR/ICU.

Does multi-locale build of Android include this localization?  maemo-locales and all-locales into mobile/android/locales, these aren't included.
Flags: needinfo?(m_kato)
You probably looked at central, which comes with a limited set of locales. The full list is on aurora, https://hg.mozilla.org/releases/mozilla-aurora/file/default/mobile/android/locales/maemo-locales and https://hg.mozilla.org/releases/mozilla-aurora/file/default/mobile/android/locales/all-locales. There they are used.
Flags: needinfo?(andrebargull)
Putting new locales in a separate directory, in the actual upstream format, seems most usable.  Patching *existing* locale data, I'm leery of but don't have significant sense about the avoidability or otherwise of that.  I would prefer the upstreaming route for existing locale data, if at all possible.

And as to web Intl behavior versus Mozilla-specific Intl behavior, and not customizing the former but customizing the latter...well, some of the latter stuff we are working on standardizing, so that I think probably really should be in the first bucket.  And for the rest, I'm not all that convinced that just because it has some exalted use in our own UI, it deserves special treatment.

Other than that, I don't have especial feedback.  Plus I'm out for awhile now, so it's not like I'm going to argue strongly if other actions are taken, exactly -- mostly just clearing the queues as far as possible now.
Flags: needinfo?(jwalden+bmo)
Priority: -- → P3
Severity: normal → S3

We're looking into what it'd take to finally resolve this, and to have a way to overlay fixes on CLDR data as used by Firefox, as well as including at least partial internationalization support for locales that aren't included in CLDR, but for which we do provide a localization.

Assignee: nobody → earo
Status: NEW → ASSIGNED
See Also: → 1410168, 1612379, 1613271

ICU4X data management is designed for this use case, providing two solutions:

  1. OverlayDataProvider
    It allows us to establish a chain DataProvider which would selectively return "overlay" or route the request to ICU4XDataProvider.

This way you can have a runtime light overlay which only overrides data for selected keys, and the "full" data provider.

  1. MergedDataProvider

This allows us to have a combine at build time two sources into a single data package.

Presumably that would solve the issue for interfaces that use ICU4X internally, but not ICU4C, yes?

For ICU4C, I hope that there's a way to apply patches like this to effect at least some changes, and to reconsider the requirements we impose for such data patches.

I also presume that there will be a small stack of gotchas involved in adding a locale not supported by CLDR.

See Also: → 1642505

I added a patch adding a brief note to the Firefox docs on ICU, stating that small data changes are possible, but that anything significant should be done upstream. At the moment, in particular (but not only) Fluent makes use of Rust crates such as intl_pluralrules that have their own separate CLDR dependencies.

So I'm pretty sure that supporting larger overlays like entirely new locales is not practically possible, but that this could and should be revisited if we get to a point where our CLDR dependencies are once again concentrated in one place, i.e. ICU4X.

Pushed by earo@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/468998fe6c16
Add note to ICU docs about patching CLDR data. r=fluent-reviewers,platform-i18n-reviewers,dminor,flod
Status: ASSIGNED → RESOLVED
Closed: 5 months ago
Resolution: --- → FIXED
Target Milestone: --- → 122 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: