Local data overlay on top of CLDR/ICU

NEW
Unassigned

Status

()

P3
normal
2 years ago
8 months ago

People

(Reporter: gandalf, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

2 years ago
As we're migrating more of Gecko to rely on ICU/CLDR we start seeing places where this regresses our locale coverage.

For example, we release Android in locales such as an, cak, gn, mai, son, tsz, which are not covered by CLDR/ICU.

Our top priority goal now is to land new localization framework (l20n) in Firefox for Android and it relies on ICU.
In order to be able to switch without regressing, we have to solve the coverage problem.

I'm establishing a protocol for L10n Drivers to upstream our localizations back to CLDR, but establishing a new locale in CLDR may be more time consuming, and potentially political than it sounds.

At the Unicode conference I spoke with multiple customers of CLDR/ICU - namely Microsoft, Google and Apple - and they all said that they maintain a local overlay on top of the source data to facilitate their needs.

They aim to keep the layer to minimum and optimistically remove it one day, but for now, that setup unblocks them to release their products with their locale coverage and use CLDR/ICU as a backbone at the same time.

In comparison with them, I think our task is slightly easier, since we do not aim to diverge from CLDR and we don't have a need for custom values that differ from CLDR.

But we do need to be able to format numbers and dates (and plural rules and relative time formats and units) to locales that CLDR doesn't have data for.

I'd like to suggest that we add a similar layer and develop a procedure that will allow L10n Drivers together with localizers to submit CLDR-like data that we will keep locally and use in our ICU calls while we upstream them. Once the upstream is complete, we'll remove those bits from the overlay.
(Reporter)

Comment 1

2 years ago
Seeking feedback from :mkato, :jfkthame, :waldo, Andre Bargull. CC :pike.

Note: This is not related to landing ICU in Android (bug 1215247) - because we can land it without some locales.
This is only important for us when we talk about migrating Firefox UI to use ICU.
Flags: needinfo?(m_kato)
Flags: needinfo?(l10n)
Flags: needinfo?(jwalden+bmo)
Flags: needinfo?(jfkthame)
Flags: needinfo?(andrebargull)

Comment 2

2 years ago
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #0)
> In comparison with them, I think our task is slightly easier, since we do
> not aim to diverge from CLDR and we don't have a need for custom values that
> differ from CLDR.

I don't know which bug that was in, but I recall reading bugmail about date formats and German that suggested that we should in fact tune CLDR-hosted data to make the firefox product better.

(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #1)
> Seeking feedback from :mkato, :jfkthame, :waldo, Andre Bargull. CC :pike.
> 
> Note: This is not related to landing ICU in Android (bug 1215247) - because
> we can land it without some locales.
> This is only important for us when we talk about migrating Firefox UI to use
> ICU.

Can you be precise on this one? It sounds like "pff, wth to an, cak, gn, mai, son, tsz", and I recall phone conversations that sounded like "we only use plural forms, and the missing locales have the same as English, so fall-back is fine".

Those would be two vastly different statements.
Flags: needinfo?(l10n)
Comment hidden (obsolete)
(Reporter)

Comment 4

2 years ago
(resubmitting due to the number of spellings)

> I don't know which bug that was in, but I recall reading bugmail about date formats and German that suggested that we should in fact tune CLDR-hosted data to make the firefox product better.

My understanding is that the pressure on this is significantly lower, and in such scenarios we can actually use the route to elevate the conversation to CLDR.
Not only because the worst case scenario is that the translation is slightly worse (rather than non-existing), but also because it's a matter of preference, and CLDR has pretty solid procedures for solving those (in fact, that's 80% of what they do ;)).

Lastly, if CLDR uses one approach, then it means that host environments in which Firefox runs will use it. If we use a different approach, even if it's slightly better, it's inconsistent with the OS.

Thus, I'd say that we can start by trying to elevate such cases to CLDR and try to solve them there.

> Can you be precise on this one? It sounds like "pff, wth to an, cak, gn, mai, son, tsz", and I recall phone conversations that sounded like "we only use plural forms, and the missing locales have the same as English, so fall-back is fine".
> Those would be two vastly different statements.

Sure.

There are two parts you quoted here:

> Note: This is not related to landing ICU in Android (bug 1215247) - because we can land it without some locales.

I believe that this issue is not blocking for bug 1215247 because we can easily land it, and expose Intl API in Firefox for Android without, for example, 'an' locale. It's a new feature, that is currently non-existing, and thus we're not regressing anything by landing it. Also, lack of an 'an' locale in CLDR means that other implementations of the Intl API likely also don't have it.

> This is only important for us when we talk about migrating Firefox UI to use ICU.

I believe that the issue of lack of overlays is going to be a recurring theme as we introduce new formatters and transition toward ICU-backed APIs. For example, we just landed PluralRules that use CLDR. If we don't have PluralRules in CLDR for an 'an' locale, we will fallback  english. That works this time, but if we will have a new locale that uses different plural rules, we will need an overlay.
Soon, we will want to transition relative time fromat strings from localization based to mozIntl.RelativeTimeFormat based (bug 1270140). If we will not have our overlays, that will regress for locales that we cover, but CLDR doesn't.

The bottom line is, that there are two uses of ICU/CLDR in Gecko:

1) As a backend for Intl API for the Web
2) As a backend for Firefox UI

For the former, we do not need this bug to be fixed. For the latter, I believe we will need it.

Does it answer your question?
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #0)
> I'd like to suggest that we add a similar layer and develop a procedure that
> will allow L10n Drivers together with localizers to submit CLDR-like data
> that we will keep locally and use in our ICU calls while we upstream them.
> Once the upstream is complete, we'll remove those bits from the overlay.

When you say "overlay", do you mean adding new locales per http://userguide.icu-project.org/icudata, or something else?
(Reporter)

Comment 6

2 years ago
> When you say "overlay", do you mean adding new locales per http://userguide.icu-project.org/icudata, or something else?

There are two possible things we can do:

1) Add a locale that ICU does not have data for.

In this case, we'd like to establish a procedure that will allow us to provide the right data to be included in Mozilla source for ICU to pick

2) Override some CLDR data

In the scenario like one described by :pike, we'd may want to modify a CLDR field for a particular locale.
AIUI, we can use the ICU pkgdata tool to package data for (an) additional locale(s) into ICU's format, and then include this in the data ICU will use by calling udata_setCommonData API.[1] This can be called repeatedly to load multiple data packages, and ICU will search them all when data is requested.

To support adding locales to an installation, I guess we'd probably want to define a locale_data resource directory, and have code that iterates over it at startup and loads whatever data packages are found there. So adding or updating a locale is simply a matter of building a new data package and dropping it into the locale_data dir.

In principle, this should support both (1) and (2) above, provided the added packages are loaded before any (implicit) access to the default built-in data; still, I think we should be _extremely_ hesitant to override CLDR data. It's hard to imagine that we can really maintain locale data better (with more extensive review, etc.) than the CLDR project. If people think CLDR data is incorrect for a supported locale, that should be addressed upstream rather than having Gecko deviate from the (CLDR-based) system behavior.

[1] http://icu-project.org/apiref/icu4c/udata_8h.html#a467bda719595adb58f959dde735e1153
Flags: needinfo?(jfkthame)
(Reporter)

Comment 8

2 years ago
Great to hear! And I agree on being reluctant and cautious with overrides.

It seems then that we know how to do this technically and it doesn't require any new code from us.

I think I'd like to now decide on who would be maintaining the directory with our custom locales and an example of how should data should look like for a locale.

Then we can start filing bugs to add such data, prepare data for landing and get review from the overrides directory maintainer.

Comment 9

2 years ago
On windows and js-standalone, we use linked-in cldr data, and for this, we probably need to, too.
(Reporter)

Comment 10

2 years ago
how does the example data file looks like? Is it a JSON file? XML?

I'd like to work with our PM group to build a plan on how we're going to prepare such a "custom locale" dataset for us to include.
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #0)
> As we're migrating more of Gecko to rely on ICU/CLDR we start seeing places
> where this regresses our locale coverage.
> 
> For example, we release Android in locales such as an, cak, gn, mai, son,
> tsz, which are not covered by CLDR/ICU.

Does multi-locale build of Android include this localization?  maemo-locales and all-locales into mobile/android/locales, these aren't included.
Flags: needinfo?(m_kato)

Comment 12

2 years ago
You probably looked at central, which comes with a limited set of locales. The full list is on aurora, https://hg.mozilla.org/releases/mozilla-aurora/file/default/mobile/android/locales/maemo-locales and https://hg.mozilla.org/releases/mozilla-aurora/file/default/mobile/android/locales/all-locales. There they are used.

Updated

2 years ago
Flags: needinfo?(andrebargull)
Putting new locales in a separate directory, in the actual upstream format, seems most usable.  Patching *existing* locale data, I'm leery of but don't have significant sense about the avoidability or otherwise of that.  I would prefer the upstreaming route for existing locale data, if at all possible.

And as to web Intl behavior versus Mozilla-specific Intl behavior, and not customizing the former but customizing the latter...well, some of the latter stuff we are working on standardizing, so that I think probably really should be in the first bucket.  And for the rest, I'm not all that convinced that just because it has some exalted use in our own UI, it deserves special treatment.

Other than that, I don't have especial feedback.  Plus I'm out for awhile now, so it's not like I'm going to argue strongly if other actions are taken, exactly -- mostly just clearing the queues as far as possible now.
Flags: needinfo?(jwalden+bmo)
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.