Evaluate replacing some l10n data with CLDR

NEW
Unassigned

Status

()

enhancement
P4
normal
2 years ago
6 months ago

People

(Reporter: zbraniecki, Unassigned)

Tracking

(Depends on 2 bugs)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

We currently have several places where we ask our localizers to translate entries that we already have via ICU/CLDR.

In at least one case - languageNames[0] and regionNames[1] - it's a substantial amount of mundane strings (500) that are not very project specific.

At the same time, without a need for those strings in JS Intl API, carrying all of them for all locales in every Gecko is going to affect the bundle size unnecessarily.

I'd like to propose a new approach to handling data as such. In this approach, we'd be extracting parts of the CLDR data files into a format easily usable from within Firefox, treating those files as part of the language resources (i.e. packaging them into langpacks) and loading alongside other language resources via L10nRegistry.

A simplest approach to that would be to make the tool take CLDR data from ICU and produce .FTL equivalents of regionNames and languageNames files, but since we don't need any human interations here, and the entries are simple key-value-pairs we could make it faster by using a faster format like JSON, and possibly also lighter by using some other format that takes less bytes to encode the data (maybe even binary?).

Then, I imagine that such a file (let's name it languageNames.json) would be available to be loaded via L10nRegistry and exposed for the special API to use.

This way, we could sync our data with CLDR, stop maintaining our own database of language and region names, and decrease the burden on our localizers.

I don't think this will happen anytime soon, as first we'll need to sync our data with CLDR data, and develop the tools to extract CLDR data, and then write the APIs in Gecko, but I think it's a good time to start the discussion about it and possibly link it to other, related bugs.

[0] https://searchfox.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/languageNames.properties
[1] https://searchfox.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/regionNames.properties
Priority: -- → P4
Using stock ftl as file format would have the benefit of language fallback being already implemented, though?

That would make up for partially existing data.
Andre - can you help me with understanding of how can we get the CLDR data here?

An example bit I'm trying to track is an entry for the language name "German" which in CLDR is present in http://www.unicode.org/cldr/charts/31/summary/en.html#194 or http://www.unicode.org/cldr/charts/31/summary/pl.html#192

I can't find such entry or even data file in our `intl/icu` directory which makes me worried that maybe we don't even carry CLDR data in our repository? I know we didn't store raw CLDR, but I assumed we have CLDR via ICU and at least in the source repo we have access to all of it (and then we cut out things we don't need when we build).

If that's not the case, I'm wondering what's the best way to approach the problem I'm trying to tackle here. Should we write a build system script that pulls CLDR data from github? Should we vendor in CLDR separately?

I'd appreciate your thoughts and feedback
Flags: needinfo?(andrebargull)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #2)
> I can't find such entry or even data file in our `intl/icu` directory which
> makes me worried that maybe we don't even carry CLDR data in our repository?
> I know we didn't store raw CLDR, but I assumed we have CLDR via ICU and at
> least in the source repo we have access to all of it (and then we cut out
> things we don't need when we build).

We already cut down the data when we clone the ICU svn repository [1]. (At least I think that's the problem in this case, because [2] contains "niemiecki" and that file is part of the directories we wipe out.)

[1] https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/intl/update-icu.sh#37-50
[2] http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/data/lang/pl.txt
Flags: needinfo?(andrebargull)
Depends on: 1416148
Thanks André!

So, seems like we'd need a script that pulls data such as:
http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/data/lang/pl.txt

and generates:

https://searchfox.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/languageNames.properties

out of it.

That seems reasonable. I guess we do similar things for intl data:
 - for timezones: https://searchfox.org/mozilla-central/source/intl/update-tzdata.sh
 - for ICU use: https://searchfox.org/mozilla-central/source/intl/update-icu.sh

Kekoa - is it something you'd be interested in?
Flags: needinfo?(kekoariggin)
This looks like something I can work on. I do have a couple questions:

- What are the files that need to be processed and where can I find them? 
- What output files are we hoping to have? Am I understanding correctly that we need one file per locale per data file?
Flags: needinfo?(kekoariggin) → needinfo?(gandalf)
Depends on: 1433694
I filed bug 1433694 for this particular script. Let's take it there.
Flags: needinfo?(gandalf)
Taking the conversation about the idea of moving some of our l10n data to use CLDR out of bug 1433694:

Gerv in bug 1433694 comment 5 asked:
> there would need to be a very good reason to reopen that discussion. Are you suggesting there is one?

I believe so. We're unifying all of our Internationalization around ICU and CLDR. There are many benefits of doing that, but three main ones are:

1) They already have a lot of data and algorithms that we get for free if we use them
2) They're well maintained by many industry players rather than us trying to maintain our own dataset
3) CLDR/ICU is what's used in MacOS, iOS, Windows and Android at least. I expect all new major systems to use t as well.

This means that by using CLDR here we get consistency for the user between what he sees in date/time/regional preferences of their OS and in Firefox.

Since we don't maintain our own Regional Preferences panel, the closer we get to the underlying OS data model, the easier it is for us to just use their settings thus creating a way for users to customize their intl settings once (in the OS) and have Firefox just pick it up.

For example, MacOS locale selection uses directly ICU methods, showing all locale names in CLDR format. When the user then goes to Fx preferences to select Firefox locale, having the same data shown in the same way would make it easier for the user to scan and find their locales.

In particular in the context of the potential switch of language/region names there are two main values I see:

1) We lower the entry barrier and localization burden on our localizers. Those are hundreds of strings that we can get for free our of CLDR and join the cooperation effort with other industry players by upstreaming our improvements, rather than maintaining it on our own.
2) We guarantee consistency between our Intl APIs and our L10n data. If any Intl API will return region/language names, we'll use CLDR, and then in our L10n we'll show different names due to a different source.

I don't know if that's enough of the value, but I hope that it at least justifies holding conversation about it.

Can I get your further feedback based on this explanation :gerv?
Flags: needinfo?(gerv)
> Mozilla uses uppercase (e.g. "Italiano" for Italian), CLDR uses lowercase ("italiano"). That would work in the middle of a sentence, not as stand-alone (current use in Mozilla) or at the beginning of a sentence.

Does it also "not work" in MacOS language selection preferences? If it does - let's investigate how they use CLDR to get the outcome we want and maybe we can use it as well.
They're uppercase in macOS Prefs.
They seem to use some capitalization - http://icu-project.org/apiref/icu4c/classicu_1_1Locale.html#acc4a8c21f19103503663cf6fcda9170d

I'll try to investigate how they decide on capitalization. Maybe we can use ICU just to decide on capitalization for standalone, while feeding it data from our own FTL files. I have high hopes that technically we can solve it. :)

I'd like to tackle comment 7 first to figure out if at all it makes sense to try.
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #7)
> I don't know if that's enough of the value, but I hope that it at least
> justifies holding conversation about it.

It's certainly enough to hold a conversation :-)

Let's start with a few clarifying questions:

* Are we talking here about country/region names, language names, or both?

* CLDR presumably provides the names in "all" languages? Does GENC, or does it only provide English names?

* Might it be an option to use the CLDR names entirely internally and in APIs, but the GENC names for display within Firefox?

* Might it be an option to not ship a list of names at all, but pick them up from the OS, so we can disclaim responsibility for them? Or would that be a coding nightmare?

* Do we know if any particular OS, e.g. OS X, uses the same list in every shipping version, or do they perhaps customize the list to local political preference?

* Is there any documentation anywhere about how CLDR chooses the names it chooses, and their position on the usual controversial topics?

* Is there a list somewhere of the differences (both additions/removals, and textual changes) between the CLDR country/region name list and/or the GENC country/region name list? Having this data would make it much easier to evaluate both how big of a change this would be, and also how politically sensitive.

Gerv
Flags: needinfo?(gerv)
I can try to answer some of them (non technical ones)

> * Are we talking here about country/region names, language names, or both?

I don't think Mozilla's list of languages comes from any shared source, unlike region names. But yes, the final goal would be to use CLDR for both.

> * CLDR presumably provides the names in "all" languages? Does GENC, or does
> it only provide English names?

It provides translations for the vast majority of our languages, and we plan to contribute back our translations where a language is not available.

As far as I can tell, GENC only provides a list of English names.

> * Is there a list somewhere of the differences (both additions/removals, and
> textual changes) between the CLDR country/region name list and/or the GENC
> country/region name list? Having this data would make it much easier to
> evaluate both how big of a change this would be, and also how politically
> sensitive.

Some territories are defined in GENC but not in CLDR

Missing region names in CLDR:
QM: Midway Islands
QS: Bassas da India
QU: Juan de Nova Island
QW: Wake Island
QX: Glorioso Islands
QZ: Akrotiri
XA: Ashmore and Cartier Islands
XB: Baker Island
XC: Coral Sea Islands
XD: Dhekelia
XE: Europa Island
XG: Gaza Strip
XH: Howland Island
XJ: Jan Mayen
XL: Palmyra Atoll
XM: Kingman Reef
XP: Paracel Islands
XQ: Jarvis Island
XR: Svalbard
XS: Spratly Islands
XT: Tromelin Island
XU: Johnston Atoll
XV: Navassa Island
XW: West Bank

Different names are mostly due to ordering, accents, use of & vs 'and'., or abbreviations.

The real differences are: BQ, CZ, FK, HK, MM

Different values:
AG
  CLDR: Antigua & Barbuda
  Mozilla: Antigua and Barbuda
BA
  CLDR: Bosnia & Herzegovina
  Mozilla: Bosnia and Herzegovina
BL
  CLDR: St. Barthélemy
  Mozilla: Saint Barthelemy
BQ
  CLDR: Caribbean Netherlands
  Mozilla: Bonaire, Sint Eustatius, and Saba
BS
  CLDR: Bahamas
  Mozilla: Bahamas, The
CD
  CLDR: Congo - Kinshasa
  Mozilla: Congo (Kinshasa)
CG
  CLDR: Congo - Brazzaville
  Mozilla: Congo (Brazzaville)
CV
  CLDR: Cape Verde
  Mozilla: Cabo Verde
CZ
  CLDR: Czechia
  Mozilla: Czech Republic
FK
  CLDR: Falkland Islands
  Mozilla: Falkland Islands (Islas Malvinas)
FM
  CLDR: Micronesia
  Mozilla: Micronesia, Federated States of
GM
  CLDR: Gambia
  Mozilla: Gambia, The
GS
  CLDR: South Georgia & South Sandwich Islands
  Mozilla: South Georgia and South Sandwich Islands
HK
  CLDR: Hong Kong SAR China
  Mozilla: Hong Kong
HM
  CLDR: Heard & McDonald Islands
  Mozilla: Heard Island and McDonald Islands
KN
  CLDR: St. Kitts & Nevis
  Mozilla: Saint Kitts and Nevis
KP
  CLDR: North Korea
  Mozilla: Korea, North
KR
  CLDR: South Korea
  Mozilla: Korea, South
LC
  CLDR: St. Lucia
  Mozilla: Saint Lucia
MF
  CLDR: St. Martin
  Mozilla: Saint Martin
MM
  CLDR: Myanmar (Burma)
  Mozilla: Burma
MO
  CLDR: Macau SAR China
  Mozilla: Macau
PM
  CLDR: St. Pierre & Miquelon
  Mozilla: Saint Pierre and Miquelon
RE
  CLDR: Réunion
  Mozilla: Reunion
SH
  CLDR: St. Helena
  Mozilla: Saint Helena, Ascension, and Tristan da Cunha
ST
  CLDR: São Tomé & Príncipe
  Mozilla: Sao Tome and Principe
TC
  CLDR: Turks & Caicos Islands
  Mozilla: Turks and Caicos Islands
TF
  CLDR: French Southern Territories
  Mozilla: French Southern and Antarctic Lands
TT
  CLDR: Trinidad & Tobago
  Mozilla: Trinidad and Tobago
VC
  CLDR: St. Vincent & Grenadines
  Mozilla: Saint Vincent and the Grenadines
VG
  CLDR: British Virgin Islands
  Mozilla: Virgin Islands, British
VI
  CLDR: U.S. Virgin Islands
  Mozilla: Virgin Islands, U.S.
WF
  CLDR: Wallis & Futuna
  Mozilla: Wallis and Futuna

Full analysis is in bug 1416148.
(In reply to Francesco Lodolo [:flod] from comment #12)
> The real differences are: BQ, CZ, FK, HK, MM

Correction, there are a few more (some more significant than others): BQ, CZ, FM, FK, HK, MM, MO, SH
Full list of CLDR regions
https://github.com/unicode-cldr/cldr-localenames-full/blob/master/main/en/territories.json

CZ, FK, HK, MO have an alt-variant that matches Mozilla values. 
The alt-variant for MM is Myanmar, while for Mozilla is Burma.
Side question, is there a way to browse GENC these days w/out adding a cert exception?
Thanks for the questions Gerv!

(In reply to Gervase Markham [:gerv] from comment #11)
> * Are we talking here about country/region names, language names, or both?

Both.
 
> * CLDR presumably provides the names in "all" languages? Does GENC, or does
> it only provide English names?

They provide the names in all languages, and the list is updated/maintained and extended by the CLDR work group with input from major contributors like Apple, Microsoft and Google but also open to external contributors.

As I mentioned before, CLDR aims to be the Wikipedia for intl data.

> * Might it be an option to use the CLDR names entirely internally and in
> APIs, but the GENC names for display within Firefox?

I'm not sure what you mean here by "CLDR named entirely internally"?

If we believe any of the English names in CLDR to be wrong, I'd suggest we upstream our improvements to benefit all CLDR users.
 
> * Might it be an option to not ship a list of names at all, but pick them up
> from the OS, so we can disclaim responsibility for them? Or would that be a
> coding nightmare?

I'd say that it would be a very cumbersome procedure to try. We'd end up with varied levels of support per OS, with missing matches in places we cannot control.

Having a unified list of language names and regions as part of our toolking localization dataset makes it much easier to build a predictable UIs for selecting locales.
 
> * Do we know if any particular OS, e.g. OS X, uses the same list in every
> shipping version, or do they perhaps customize the list to local political
> preference?

I do not know that. I can ask.
All I know is that they use CLDR.
 
> * Is there any documentation anywhere about how CLDR chooses the names it
> chooses, and their position on the usual controversial topics?

http://cldr.unicode.org/translation/country-names
http://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code

CLDR meets every week on Wednesday at 8am PST. I can ask any of the questions we may have directly to the WG if needed :)
In case of french - bug 1434886 comment 3 - seems like we make mistakes in our translations and CLDR keeps the quality quite high.

Resetting NI on :gerv to help us move forward.
Flags: needinfo?(gerv)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #17)
> In case of french - bug 1434886 comment 3 - seems like we make mistakes in
> our translations and CLDR keeps the quality quite high.

Same was for Italian (bug 1434854).
Sorry, guys - was at FOSDEM. (Sad not to see you there, gandalf?) 

CLDR's list was not one of the options considered last time, so it's reasonable to reopen the question.

Looking down that list of diffs, it seems pleasingly minimal. None of my list of "dangerous" names appears on it, so presumably all of them are exactly the same in both lists.[0] Almost all of the changes are either neutral or improvements, with only HK, MM and MO needing further study. 

HK and MO have the same change - the addition of "SAR China". This is a reasonable factual description of their status, so I hope it would not be objectionable.

MM changes "Burma" to "Myanmar (Burma)". The dispute here is whether the military regime has the legitimacy to change the name (both names have valid provenance in the local language) from Burma to Myanmar without a referendum. Including both but leading with the one the government has chosen seems just as fine a solution as any other.
https://en.wikipedia.org/wiki/Names_of_Myanmar

So yes, I approve of this change. We should make it and update the wiki page to match.[1] As always, let's stay out of the business of making country-specific tweaks to the list based on representations from citizens. That way lies madness.

Gerv

[0] 
Kosovo
Tibet
Taiwan
Northern Cyprus
Chechnya
Catalonia
Kurdistan
ISIS
The West Bank / The Gaza Strip / Palestine

[1] https://wiki.mozilla.org/Lists_of_Countries_and_Regions
Flags: needinfo?(gerv)
You need to log in before you can comment on or make changes to this bug.