Closed
Bug 1431324
Opened 7 years ago
Closed 5 years ago
Evaluate replacing some l10n data with CLDR
Categories
(Core :: Internationalization, enhancement, P4)
Core
Internationalization
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: zbraniecki, Unassigned)
References
Details
Attachments
(1 file)
95.84 KB,
image/png
|
Details |
We currently have several places where we ask our localizers to translate entries that we already have via ICU/CLDR.
In at least one case - languageNames[0] and regionNames[1] - it's a substantial amount of mundane strings (500) that are not very project specific.
At the same time, without a need for those strings in JS Intl API, carrying all of them for all locales in every Gecko is going to affect the bundle size unnecessarily.
I'd like to propose a new approach to handling data as such. In this approach, we'd be extracting parts of the CLDR data files into a format easily usable from within Firefox, treating those files as part of the language resources (i.e. packaging them into langpacks) and loading alongside other language resources via L10nRegistry.
A simplest approach to that would be to make the tool take CLDR data from ICU and produce .FTL equivalents of regionNames and languageNames files, but since we don't need any human interations here, and the entries are simple key-value-pairs we could make it faster by using a faster format like JSON, and possibly also lighter by using some other format that takes less bytes to encode the data (maybe even binary?).
Then, I imagine that such a file (let's name it languageNames.json) would be available to be loaded via L10nRegistry and exposed for the special API to use.
This way, we could sync our data with CLDR, stop maintaining our own database of language and region names, and decrease the burden on our localizers.
I don't think this will happen anytime soon, as first we'll need to sync our data with CLDR data, and develop the tools to extract CLDR data, and then write the APIs in Gecko, but I think it's a good time to start the discussion about it and possibly link it to other, related bugs.
[0] https://searchfox.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/languageNames.properties
[1] https://searchfox.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/regionNames.properties
Reporter | ||
Updated•7 years ago
|
Priority: -- → P4
Comment 1•7 years ago
|
||
Using stock ftl as file format would have the benefit of language fallback being already implemented, though?
That would make up for partially existing data.
Reporter | ||
Comment 2•7 years ago
|
||
Andre - can you help me with understanding of how can we get the CLDR data here?
An example bit I'm trying to track is an entry for the language name "German" which in CLDR is present in http://www.unicode.org/cldr/charts/31/summary/en.html#194 or http://www.unicode.org/cldr/charts/31/summary/pl.html#192
I can't find such entry or even data file in our `intl/icu` directory which makes me worried that maybe we don't even carry CLDR data in our repository? I know we didn't store raw CLDR, but I assumed we have CLDR via ICU and at least in the source repo we have access to all of it (and then we cut out things we don't need when we build).
If that's not the case, I'm wondering what's the best way to approach the problem I'm trying to tackle here. Should we write a build system script that pulls CLDR data from github? Should we vendor in CLDR separately?
I'd appreciate your thoughts and feedback
Flags: needinfo?(andrebargull)
Comment 3•7 years ago
|
||
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #2)
> I can't find such entry or even data file in our `intl/icu` directory which
> makes me worried that maybe we don't even carry CLDR data in our repository?
> I know we didn't store raw CLDR, but I assumed we have CLDR via ICU and at
> least in the source repo we have access to all of it (and then we cut out
> things we don't need when we build).
We already cut down the data when we clone the ICU svn repository [1]. (At least I think that's the problem in this case, because [2] contains "niemiecki" and that file is part of the directories we wipe out.)
[1] https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/intl/update-icu.sh#37-50
[2] http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/data/lang/pl.txt
Flags: needinfo?(andrebargull)
Reporter | ||
Comment 4•7 years ago
|
||
Thanks André!
So, seems like we'd need a script that pulls data such as:
http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/data/lang/pl.txt
and generates:
https://searchfox.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/languageNames.properties
out of it.
That seems reasonable. I guess we do similar things for intl data:
- for timezones: https://searchfox.org/mozilla-central/source/intl/update-tzdata.sh
- for ICU use: https://searchfox.org/mozilla-central/source/intl/update-icu.sh
Kekoa - is it something you'd be interested in?
Flags: needinfo?(kekoariggin)
This looks like something I can work on. I do have a couple questions:
- What are the files that need to be processed and where can I find them?
- What output files are we hoping to have? Am I understanding correctly that we need one file per locale per data file?
Flags: needinfo?(kekoariggin) → needinfo?(gandalf)
Reporter | ||
Comment 6•7 years ago
|
||
I filed bug 1433694 for this particular script. Let's take it there.
Flags: needinfo?(gandalf)
Reporter | ||
Comment 7•7 years ago
|
||
Taking the conversation about the idea of moving some of our l10n data to use CLDR out of bug 1433694:
Gerv in bug 1433694 comment 5 asked:
> there would need to be a very good reason to reopen that discussion. Are you suggesting there is one?
I believe so. We're unifying all of our Internationalization around ICU and CLDR. There are many benefits of doing that, but three main ones are:
1) They already have a lot of data and algorithms that we get for free if we use them
2) They're well maintained by many industry players rather than us trying to maintain our own dataset
3) CLDR/ICU is what's used in MacOS, iOS, Windows and Android at least. I expect all new major systems to use t as well.
This means that by using CLDR here we get consistency for the user between what he sees in date/time/regional preferences of their OS and in Firefox.
Since we don't maintain our own Regional Preferences panel, the closer we get to the underlying OS data model, the easier it is for us to just use their settings thus creating a way for users to customize their intl settings once (in the OS) and have Firefox just pick it up.
For example, MacOS locale selection uses directly ICU methods, showing all locale names in CLDR format. When the user then goes to Fx preferences to select Firefox locale, having the same data shown in the same way would make it easier for the user to scan and find their locales.
In particular in the context of the potential switch of language/region names there are two main values I see:
1) We lower the entry barrier and localization burden on our localizers. Those are hundreds of strings that we can get for free our of CLDR and join the cooperation effort with other industry players by upstreaming our improvements, rather than maintaining it on our own.
2) We guarantee consistency between our Intl APIs and our L10n data. If any Intl API will return region/language names, we'll use CLDR, and then in our L10n we'll show different names due to a different source.
I don't know if that's enough of the value, but I hope that it at least justifies holding conversation about it.
Can I get your further feedback based on this explanation :gerv?
Flags: needinfo?(gerv)
Reporter | ||
Comment 8•7 years ago
|
||
> Mozilla uses uppercase (e.g. "Italiano" for Italian), CLDR uses lowercase ("italiano"). That would work in the middle of a sentence, not as stand-alone (current use in Mozilla) or at the beginning of a sentence.
Does it also "not work" in MacOS language selection preferences? If it does - let's investigate how they use CLDR to get the outcome we want and maybe we can use it as well.
Comment 9•7 years ago
|
||
They're uppercase in macOS Prefs.
Reporter | ||
Comment 10•7 years ago
|
||
They seem to use some capitalization - http://icu-project.org/apiref/icu4c/classicu_1_1Locale.html#acc4a8c21f19103503663cf6fcda9170d
I'll try to investigate how they decide on capitalization. Maybe we can use ICU just to decide on capitalization for standalone, while feeding it data from our own FTL files. I have high hopes that technically we can solve it. :)
I'd like to tackle comment 7 first to figure out if at all it makes sense to try.
Comment 11•7 years ago
|
||
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #7)
> I don't know if that's enough of the value, but I hope that it at least
> justifies holding conversation about it.
It's certainly enough to hold a conversation :-)
Let's start with a few clarifying questions:
* Are we talking here about country/region names, language names, or both?
* CLDR presumably provides the names in "all" languages? Does GENC, or does it only provide English names?
* Might it be an option to use the CLDR names entirely internally and in APIs, but the GENC names for display within Firefox?
* Might it be an option to not ship a list of names at all, but pick them up from the OS, so we can disclaim responsibility for them? Or would that be a coding nightmare?
* Do we know if any particular OS, e.g. OS X, uses the same list in every shipping version, or do they perhaps customize the list to local political preference?
* Is there any documentation anywhere about how CLDR chooses the names it chooses, and their position on the usual controversial topics?
* Is there a list somewhere of the differences (both additions/removals, and textual changes) between the CLDR country/region name list and/or the GENC country/region name list? Having this data would make it much easier to evaluate both how big of a change this would be, and also how politically sensitive.
Gerv
Flags: needinfo?(gerv)
Comment 12•7 years ago
|
||
I can try to answer some of them (non technical ones)
> * Are we talking here about country/region names, language names, or both?
I don't think Mozilla's list of languages comes from any shared source, unlike region names. But yes, the final goal would be to use CLDR for both.
> * CLDR presumably provides the names in "all" languages? Does GENC, or does
> it only provide English names?
It provides translations for the vast majority of our languages, and we plan to contribute back our translations where a language is not available.
As far as I can tell, GENC only provides a list of English names.
> * Is there a list somewhere of the differences (both additions/removals, and
> textual changes) between the CLDR country/region name list and/or the GENC
> country/region name list? Having this data would make it much easier to
> evaluate both how big of a change this would be, and also how politically
> sensitive.
Some territories are defined in GENC but not in CLDR
Missing region names in CLDR:
QM: Midway Islands
QS: Bassas da India
QU: Juan de Nova Island
QW: Wake Island
QX: Glorioso Islands
QZ: Akrotiri
XA: Ashmore and Cartier Islands
XB: Baker Island
XC: Coral Sea Islands
XD: Dhekelia
XE: Europa Island
XG: Gaza Strip
XH: Howland Island
XJ: Jan Mayen
XL: Palmyra Atoll
XM: Kingman Reef
XP: Paracel Islands
XQ: Jarvis Island
XR: Svalbard
XS: Spratly Islands
XT: Tromelin Island
XU: Johnston Atoll
XV: Navassa Island
XW: West Bank
Different names are mostly due to ordering, accents, use of & vs 'and'., or abbreviations.
The real differences are: BQ, CZ, FK, HK, MM
Different values:
AG
CLDR: Antigua & Barbuda
Mozilla: Antigua and Barbuda
BA
CLDR: Bosnia & Herzegovina
Mozilla: Bosnia and Herzegovina
BL
CLDR: St. Barthélemy
Mozilla: Saint Barthelemy
BQ
CLDR: Caribbean Netherlands
Mozilla: Bonaire, Sint Eustatius, and Saba
BS
CLDR: Bahamas
Mozilla: Bahamas, The
CD
CLDR: Congo - Kinshasa
Mozilla: Congo (Kinshasa)
CG
CLDR: Congo - Brazzaville
Mozilla: Congo (Brazzaville)
CV
CLDR: Cape Verde
Mozilla: Cabo Verde
CZ
CLDR: Czechia
Mozilla: Czech Republic
FK
CLDR: Falkland Islands
Mozilla: Falkland Islands (Islas Malvinas)
FM
CLDR: Micronesia
Mozilla: Micronesia, Federated States of
GM
CLDR: Gambia
Mozilla: Gambia, The
GS
CLDR: South Georgia & South Sandwich Islands
Mozilla: South Georgia and South Sandwich Islands
HK
CLDR: Hong Kong SAR China
Mozilla: Hong Kong
HM
CLDR: Heard & McDonald Islands
Mozilla: Heard Island and McDonald Islands
KN
CLDR: St. Kitts & Nevis
Mozilla: Saint Kitts and Nevis
KP
CLDR: North Korea
Mozilla: Korea, North
KR
CLDR: South Korea
Mozilla: Korea, South
LC
CLDR: St. Lucia
Mozilla: Saint Lucia
MF
CLDR: St. Martin
Mozilla: Saint Martin
MM
CLDR: Myanmar (Burma)
Mozilla: Burma
MO
CLDR: Macau SAR China
Mozilla: Macau
PM
CLDR: St. Pierre & Miquelon
Mozilla: Saint Pierre and Miquelon
RE
CLDR: Réunion
Mozilla: Reunion
SH
CLDR: St. Helena
Mozilla: Saint Helena, Ascension, and Tristan da Cunha
ST
CLDR: São Tomé & Príncipe
Mozilla: Sao Tome and Principe
TC
CLDR: Turks & Caicos Islands
Mozilla: Turks and Caicos Islands
TF
CLDR: French Southern Territories
Mozilla: French Southern and Antarctic Lands
TT
CLDR: Trinidad & Tobago
Mozilla: Trinidad and Tobago
VC
CLDR: St. Vincent & Grenadines
Mozilla: Saint Vincent and the Grenadines
VG
CLDR: British Virgin Islands
Mozilla: Virgin Islands, British
VI
CLDR: U.S. Virgin Islands
Mozilla: Virgin Islands, U.S.
WF
CLDR: Wallis & Futuna
Mozilla: Wallis and Futuna
Full analysis is in bug 1416148.
Comment 13•7 years ago
|
||
(In reply to Francesco Lodolo [:flod] from comment #12)
> The real differences are: BQ, CZ, FK, HK, MM
Correction, there are a few more (some more significant than others): BQ, CZ, FM, FK, HK, MM, MO, SH
Comment 14•7 years ago
|
||
Full list of CLDR regions
https://github.com/unicode-cldr/cldr-localenames-full/blob/master/main/en/territories.json
CZ, FK, HK, MO have an alt-variant that matches Mozilla values.
The alt-variant for MM is Myanmar, while for Mozilla is Burma.
Comment 15•7 years ago
|
||
Side question, is there a way to browse GENC these days w/out adding a cert exception?
Reporter | ||
Comment 16•7 years ago
|
||
Thanks for the questions Gerv!
(In reply to Gervase Markham [:gerv] from comment #11)
> * Are we talking here about country/region names, language names, or both?
Both.
> * CLDR presumably provides the names in "all" languages? Does GENC, or does
> it only provide English names?
They provide the names in all languages, and the list is updated/maintained and extended by the CLDR work group with input from major contributors like Apple, Microsoft and Google but also open to external contributors.
As I mentioned before, CLDR aims to be the Wikipedia for intl data.
> * Might it be an option to use the CLDR names entirely internally and in
> APIs, but the GENC names for display within Firefox?
I'm not sure what you mean here by "CLDR named entirely internally"?
If we believe any of the English names in CLDR to be wrong, I'd suggest we upstream our improvements to benefit all CLDR users.
> * Might it be an option to not ship a list of names at all, but pick them up
> from the OS, so we can disclaim responsibility for them? Or would that be a
> coding nightmare?
I'd say that it would be a very cumbersome procedure to try. We'd end up with varied levels of support per OS, with missing matches in places we cannot control.
Having a unified list of language names and regions as part of our toolking localization dataset makes it much easier to build a predictable UIs for selecting locales.
> * Do we know if any particular OS, e.g. OS X, uses the same list in every
> shipping version, or do they perhaps customize the list to local political
> preference?
I do not know that. I can ask.
All I know is that they use CLDR.
> * Is there any documentation anywhere about how CLDR chooses the names it
> chooses, and their position on the usual controversial topics?
http://cldr.unicode.org/translation/country-names
http://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code
CLDR meets every week on Wednesday at 8am PST. I can ask any of the questions we may have directly to the WG if needed :)
Reporter | ||
Comment 17•7 years ago
|
||
In case of french - bug 1434886 comment 3 - seems like we make mistakes in our translations and CLDR keeps the quality quite high.
Resetting NI on :gerv to help us move forward.
Flags: needinfo?(gerv)
Comment 18•7 years ago
|
||
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #17)
> In case of french - bug 1434886 comment 3 - seems like we make mistakes in
> our translations and CLDR keeps the quality quite high.
Same was for Italian (bug 1434854).
Comment 19•7 years ago
|
||
Sorry, guys - was at FOSDEM. (Sad not to see you there, gandalf?)
CLDR's list was not one of the options considered last time, so it's reasonable to reopen the question.
Looking down that list of diffs, it seems pleasingly minimal. None of my list of "dangerous" names appears on it, so presumably all of them are exactly the same in both lists.[0] Almost all of the changes are either neutral or improvements, with only HK, MM and MO needing further study.
HK and MO have the same change - the addition of "SAR China". This is a reasonable factual description of their status, so I hope it would not be objectionable.
MM changes "Burma" to "Myanmar (Burma)". The dispute here is whether the military regime has the legitimacy to change the name (both names have valid provenance in the local language) from Burma to Myanmar without a referendum. Including both but leading with the one the government has chosen seems just as fine a solution as any other.
https://en.wikipedia.org/wiki/Names_of_Myanmar
So yes, I approve of this change. We should make it and update the wiki page to match.[1] As always, let's stay out of the business of making country-specific tweaks to the list based on representations from citizens. That way lies madness.
Gerv
[0]
Kosovo
Tibet
Taiwan
Northern Cyprus
Chechnya
Catalonia
Kurdistan
ISIS
The West Bank / The Gaza Strip / Palestine
[1] https://wiki.mozilla.org/Lists_of_Countries_and_Regions
Flags: needinfo?(gerv)
Reporter | ||
Comment 20•5 years ago
|
||
This is now obsoleted by the Intl.DisplayNames
API.
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•