Open Bug 1613271 Opened 5 years ago Updated 1 year ago

Coordination of CLDR data in Gecko

Categories

(Core :: Internationalization, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: zbraniecki, Unassigned)

References

Details

Historically, we pulled CLDR with every new ICU release, and someone (mostly :anba, with :waldo as a backup), would rebuild the data file using a python script [0].

With the recent inclusions of Rust crates such as unic-langid and fluent-langneg and planned inclusion of pluralrules, I'd like to design a plan for updating their CLDR data in Gecko.

Two important notes to that:

  1. There's a larger conversation currently ongoing in https://github.com/i18n-concept/rust-discuss/issues about different models of data management for Rust ICU. In the long term we expect to have a cohesive meta-crate under Unicode Consortium governance and cohesive data management and updating model.
  2. The crates we are vendoring in right now are using CLDR in a particular way - by baking it into its source code. This is not fully scalable to other crates like datetimeformat/numberformat etc., but is the right choice, even for long term, for the current subset of crates in Gecko.

This means that we are looking for short/medium term solution for 2-4 crates which should be compiled with some CLDR data in form of Rust source code.

Each one of the crates provides a script for rebuilding the builtin data from CLDR JSON source, as a separate step.

At the moment, the CLDR data update happens separately for each crate, and then a minor release gets released to crates.io and then I vendor new versions of the crates into Gecko. That's unsustainable.

Since the data updates rarely - usually once per 6 months - the ideal model would be to design a single step to be executed when CLDR gets updated, which would result in the data for all affected crates to be rebuilt.

The challenge is that the crates are in third_party/rust/ and updating their src/data.rs file will likely mess with checksums and feels dirty since it would diverge the source from the one vendored in from crates.io.

I assume the first step will be to vendor in intl/cldr from https://github.com/unicode-cldr/cldr-json, as a source of data for all CLDR-driven crates.

A sub-ideal approach I could try would be to enforce a build.rs file in each crate which gets conditionally executed if the crate is built as part of Gecko and then the data_tables.rs file of each crates gets updated from the intl/cldr.
That means that each Gecko build rebuilds the data that needs to be rebuilt only once per 6 months.

Alternatively, maybe there's some way to do conditional include_str into the .rs file, for some "gecko" feature of those crates, but having a gecko-specific feature in a public crate feels dirty as well.

Additionally, it would be nice to have some form of intl/cldr/data_filter.json file that would allow us to control which locales/data gets baked into the crates, similarly to what we do for ICU4C, but I don't know if there's any standardized model for reading config data from outside of a third_party vendored in crate that would impact its build.

:emilio, :manish - do you have any thoughts on how to design such model?

[0] https://searchfox.org/mozilla-central/rev/2e355fa82aaa87e8424a9927c8136be184eeb6c7/intl/icu_sources_data.py

Flags: needinfo?(manishearth)
Flags: needinfo?(emilio)
Priority: -- → P3
See Also: → 1560038

Additionally, it would be nice to have some form of intl/cldr/data_filter.json file that would allow us to control which locales/data gets baked into the crates, similarly to what we do for ICU4C, but I don't know if there's any standardized model for reading config data from outside of a third_party vendored in crate that would impact its build.

That sounds like cargo features to me. I don't know how complex the CLDR data is in order to be able to handle it via features, but it seems the most obvious choice (from the point of view of someone ignorant as me, which doesn't know much about it).

At the moment, the CLDR data update happens separately for each crate, and then a minor release gets released to crates.io and then I vendor new versions of the crates into Gecko. That's unsustainable.

Why? (honest question, as I'm not such an expert in what's involved here). It seems a very similar model to the ICU one.

Flags: needinfo?(emilio)

As far as I can tell, there's a need for a centralized model only for cases where more than one crate uses the same piece of CLDR data. Do we have a list of such potential sharing cases?

For such cases, it would make sense to split the data part into a separate crate that both different use-case code crates depend on, and treating data updates that don't change the data layout as semver non-breaking.

At the moment, the CLDR data update happens separately for each crate, and then a minor release gets released to crates.io and then I vendor new versions of the crates into Gecko. That's unsustainable.

Why is it unsustainable?

For cases where different crates depend on disjoint parts of CLDR, it seems to me that trying to unify their data would be unnecessarily ocean-boily.

Just for info, we also have CLDR-derived data that is incorporated directly into Gecko C++ code as compile-time data; I just filed bug 1613350 about updating the localized-quotation-marks data used by intl/locale/Quotes.cpp. It'd be nice if this didn't require any manual work but happened as part of a unified "update Gecko to new CLDR release" process.

That sounds like cargo features to me. I don't know how complex the CLDR data is in order to be able to handle it via features, but it seems the most obvious choice (from the point of view of someone ignorant as me, which doesn't know much about it).

If I understand correctly, features are really good at binary turning things on/off.

What I'm asking about here is some config file with data such as:

{
    "locales": ["en", "de", "fr", "pl"]
}

being stored in some central location (for example intl/cldr/data_filter.json) and then used to regenerate data either at build time of the crate (which means that build.rs of the crate has to look up this file from outside of itself) or as part of some script stored in m-c used to regenerate the data.rs files in the crates.

Why? (honest question, as I'm not such an expert in what's involved here). It seems a very similar model to the ICU one.

ICU is a single bundle vendored in at each release.
In contrast to that, in Rust we currently have two crates - unic-langid and fluent-langneg and will add a third pluralrules soon. Then, we may add unic-datetime, unic-numberformat and so on.
Each one of them is a separate crate at the moment.

In the long run, we're talking about creating a monorepo, unifying the underlying crates and coordinating their data management till we achieve an option to pull the "whole" RustICU together and then yes, the behavior will match that of ICU.

Until then, once CLDR gets released I have to update each crate separately, push the releases to crates.io, then vendor them into Gecko.

What I'd like to do is create a way for those crates to use data from some external source, and then when those crates get vendored into Gecko they could somehow use a single Gecko-chosen CLDR for their data.
As I mentioned, the issue is that since the crates are in third_party/rust/, I'm not sure if I can override a file, like third_party/rust/unic-langid-impl/src/layout_table.rs. I'm worried that if I try to commit it it'll create checksum mismatch.
I can also make it a build.rs step for each crate, with the impact that the data will be rebuild on each build rather than once per 6 months.

As far as I can tell, there's a need for a centralized model only for cases where more than one crate uses the same piece of CLDR data. Do we have a list of such potential sharing cases?

I'm slightly concerned about the reality in which we have ICU using CLDR 36, unic-langid using CLDR 37, fluent-langeg using CLDR 34, localized quotation arks using CLDR 35 and so on.

I agree that it doesn't seem like the discrepancy will is likely to cause any serious impact, but it feels like an increase of fragility of the package which has multiple versions of CLDR depending on what you do (for example, ICU likelySubtags may end up doing something slightly different from unic-langid one, which means that MozLocale::maximize in C++ will do sth else than Intl.Locale.maximize in JS).
I'd like to find a way to reduce that risk.

If we could all converge around intl/cldr JSON source, then we'd just have two sources - ICU and CLDR-JSON.

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #4)

That sounds like cargo features to me. I don't know how complex the CLDR data is in order to be able to handle it via features, but it seems the most obvious choice (from the point of view of someone ignorant as me, which doesn't know much about it).

If I understand correctly, features are really good at binary turning things on/off.

What I'm asking about here is some config file with data such as:

[...snip...]

I would expect the crate to have a feature per locale. Then we can, in theory script something to enable the right cargo features (modifying a cargo.toml or what not).

I would expect the crate to have a feature per locale. Then we can, in theory script something to enable the right cargo features (modifying a cargo.toml or what not).

That may not scale well :(

The data filtering for ICU is more sophisticated and I expect that we'll want to replicate that here: https://github.com/unicode-org/icu/blob/master/docs/userguide/icu_data/buildtool.md

In particular, we may want to filter out locales and features and locales per features.

For example, we may want to build a general locales list for all crates building in CLDR, then narrow down some locales we want to include for Collation or DateTime patterns, and then filter out some locales for a particular data-heavy feature.

Which means that we'd need to create a Cartesian product of localesfeaturescrates. Having a config file used to rebuild data seems like a more maintanable solution. Do you disagree?

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #4)

What I'm asking about here is some config file with data such as:

{
    "locales": ["en", "de", "fr", "pl"]
}

What sort of things do we filter and take a subset of locales for as opposed to taking the data for all locales that CLDR knows about?

Until then, once CLDR gets released I have to update each crate separately, push the releases to crates.io, then vendor them into Gecko.

So is the problem updating each crate upstream? Once the upstreams have been updates, the vendoring step could be a script, right?

What I'd like to do is create a way for those crates to use data from some external source, and then when those crates get vendored into Gecko they could somehow use a single Gecko-chosen CLDR for their data.

Shouldn't we want the upstream crates to be up-to-date? It seems bad to come up with a solution that patches the CLDR data for Gecko only and leaves the crates.io copies out-of-date.

If we could all converge around intl/cldr JSON source, then we'd just have two sources - ICU and CLDR-JSON.

Is JSON space-efficient enough to actually be workable compared to pre-baking the data in some smart way into the read-only segment of the binary?

What sort of things do we filter and take a subset of locales for as opposed to taking the data for all locales that CLDR knows about?

As demonstrated in bug 1612578 by :anba, narrowing down the list of locales for currencies and collations is a technique used by Chromium to shrink the data tables. There are other areas where we could selectively adjust the locale list per feature - Google engineers indicated that they plan to further tailor the list over time as well.

Shouldn't we want the upstream crates to be up-to-date? It seems bad to come up with a solution that patches the CLDR data for Gecko only and leaves the crates.io copies out-of-date.

From my POV the issue of outdated CLDR is separate from cohesive CLDR.
I agree that in general you want all CLDR-driven crates to contain the latest stable CLDR, I believe it's important for Gecko to invest in using a single CLDR for all its APIs, and that's only achievable if we converge on a single CLDR source that all crates feed from.

In my mind a feature of a crate to use a CLDR from a single source is a separate item from the drive to update the crate to use the latest CLDR.

Is JSON space-efficient enough to actually be workable compared to pre-baking the data in some smart way into the read-only segment of the binary?

I must have misrepresented my position. CLDR-JSON is only my suggestion for a vendored-in source of data. All crates I maintain parse the JSON data into more compat data structures - currently baking the result in as Rust source, but in unic-datetime I also experiment with serde-bincode.
You can see the preliminary results in https://github.com/zbraniecki/intl-measurements/ - BIN is bincode.

(In reply to Zibi Braniecki [:zbraniecki][:gandalf] from comment #8)

As demonstrated in bug 1612578 by :anba, narrowing down the list of locales for currencies

What Chrome does with currencies seems pretty uncool as a matter of principle: They have an arbitrary cut-off for "economy size" and currency-sensitive APIs just don't work for smaller economies. If browsers are unwilling to ship this data, maybe exposing currency-aware formatting to JS was a mistake.

and collations is a technique used by Chromium to shrink the data tables.

Do we currently exclude the collations that are based on legacy CJK encodings? Does anyone have any intention of implementing them in Rust? And if there is an intention to implement them in Rust, shouldn't they be implemented by calling into encoding_rs rather than by importing the data from CLDR if the goal is to minimize binary size?

Shouldn't we want the upstream crates to be up-to-date? It seems bad to come up with a solution that patches the CLDR data for Gecko only and leaves the crates.io copies out-of-date.

From my POV the issue of outdated CLDR is separate from cohesive CLDR.
I agree that in general you want all CLDR-driven crates to contain the latest stable CLDR, I believe it's important for Gecko to invest in using a single CLDR for all its APIs, and that's only achievable if we converge on a single CLDR source that all crates feed from.

I'm still skeptical of the necessity of such cohesion and of the possibility of getting all crates to agree on a single CLDR source.

Omitting currencies which are no longer used may be a better cut-off. That means for example:

10..toLocaleString("en", {style: "currency", currency: "DDM", currencyDisplay: "name"})

will no longer return "10.00 East German marks", but instead "10.00 DDM". Using the currency code already happens today for locales which don't have localised currency names, for example:

10..toLocaleString("pa", {style: "currency", currency: "DDM", currencyDisplay: "name"})

already returns "10.00 DDM".

One additional reason would be to allow us to write local overrides - see bug 1614941 for an example of where we'd like to write one.

If we land rust datetimeformat without centralized cldr source, we won't be able to.

So a couple ways to do this:

  • Have build.rs pick up a path for a custom generated file from an environment variable
  • Have a feature where the build.rs generates the file on the go, perhaps picking up the data path from an env var
  • Have the data provider be a trait with a default impl and type aliases provided, override in Gecko

I like the last option the best, but it doesn't necessarily scale if things are deep down in a dependency tree.

The last option would basically be that instead of PluralRules you have generic::PluralRules<Provider> and you export generic::PluralRules<DefaultProvider> from lib.rs as crate::PluralRules as a default feature. Gecko can disable the feature and plug in its own Provider impl.

Flags: needinfo?(manishearth)
Severity: normal → S3
See Also: → 1331508
You need to log in before you can comment on or make changes to this bug.