Closed Bug 1431324 Opened 7 years ago Closed 4 years ago

Evaluate replacing some l10n data with CLDR

Categories

(Core :: Internationalization, enhancement, P4)

enhancement

Tracking

()

RESOLVED WONTFIX

People

(Reporter: zbraniecki, Unassigned)

References

Details

Attachments

(1 file)

We currently have several places where we ask our localizers to translate entries that we already have via ICU/CLDR. In at least one case - languageNames[0] and regionNames[1] - it's a substantial amount of mundane strings (500) that are not very project specific. At the same time, without a need for those strings in JS Intl API, carrying all of them for all locales in every Gecko is going to affect the bundle size unnecessarily. I'd like to propose a new approach to handling data as such. In this approach, we'd be extracting parts of the CLDR data files into a format easily usable from within Firefox, treating those files as part of the language resources (i.e. packaging them into langpacks) and loading alongside other language resources via L10nRegistry. A simplest approach to that would be to make the tool take CLDR data from ICU and produce .FTL equivalents of regionNames and languageNames files, but since we don't need any human interations here, and the entries are simple key-value-pairs we could make it faster by using a faster format like JSON, and possibly also lighter by using some other format that takes less bytes to encode the data (maybe even binary?). Then, I imagine that such a file (let's name it languageNames.json) would be available to be loaded via L10nRegistry and exposed for the special API to use. This way, we could sync our data with CLDR, stop maintaining our own database of language and region names, and decrease the burden on our localizers. I don't think this will happen anytime soon, as first we'll need to sync our data with CLDR data, and develop the tools to extract CLDR data, and then write the APIs in Gecko, but I think it's a good time to start the discussion about it and possibly link it to other, related bugs. [0] https://searchfox.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/languageNames.properties [1] https://searchfox.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/regionNames.properties
Priority: -- → P4
Using stock ftl as file format would have the benefit of language fallback being already implemented, though? That would make up for partially existing data.
Andre - can you help me with understanding of how can we get the CLDR data here? An example bit I'm trying to track is an entry for the language name "German" which in CLDR is present in http://www.unicode.org/cldr/charts/31/summary/en.html#194 or http://www.unicode.org/cldr/charts/31/summary/pl.html#192 I can't find such entry or even data file in our `intl/icu` directory which makes me worried that maybe we don't even carry CLDR data in our repository? I know we didn't store raw CLDR, but I assumed we have CLDR via ICU and at least in the source repo we have access to all of it (and then we cut out things we don't need when we build). If that's not the case, I'm wondering what's the best way to approach the problem I'm trying to tackle here. Should we write a build system script that pulls CLDR data from github? Should we vendor in CLDR separately? I'd appreciate your thoughts and feedback
Flags: needinfo?(andrebargull)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #2) > I can't find such entry or even data file in our `intl/icu` directory which > makes me worried that maybe we don't even carry CLDR data in our repository? > I know we didn't store raw CLDR, but I assumed we have CLDR via ICU and at > least in the source repo we have access to all of it (and then we cut out > things we don't need when we build). We already cut down the data when we clone the ICU svn repository [1]. (At least I think that's the problem in this case, because [2] contains "niemiecki" and that file is part of the directories we wipe out.) [1] https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/intl/update-icu.sh#37-50 [2] http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/data/lang/pl.txt
Flags: needinfo?(andrebargull)
Depends on: 1416148
Thanks André! So, seems like we'd need a script that pulls data such as: http://bugs.icu-project.org/trac/browser/trunk/icu4c/source/data/lang/pl.txt and generates: https://searchfox.org/mozilla-central/source/toolkit/locales/en-US/chrome/global/languageNames.properties out of it. That seems reasonable. I guess we do similar things for intl data: - for timezones: https://searchfox.org/mozilla-central/source/intl/update-tzdata.sh - for ICU use: https://searchfox.org/mozilla-central/source/intl/update-icu.sh Kekoa - is it something you'd be interested in?
Flags: needinfo?(kekoariggin)
This looks like something I can work on. I do have a couple questions: - What are the files that need to be processed and where can I find them? - What output files are we hoping to have? Am I understanding correctly that we need one file per locale per data file?
Flags: needinfo?(kekoariggin) → needinfo?(gandalf)
Depends on: 1433694
I filed bug 1433694 for this particular script. Let's take it there.
Flags: needinfo?(gandalf)
Taking the conversation about the idea of moving some of our l10n data to use CLDR out of bug 1433694: Gerv in bug 1433694 comment 5 asked: > there would need to be a very good reason to reopen that discussion. Are you suggesting there is one? I believe so. We're unifying all of our Internationalization around ICU and CLDR. There are many benefits of doing that, but three main ones are: 1) They already have a lot of data and algorithms that we get for free if we use them 2) They're well maintained by many industry players rather than us trying to maintain our own dataset 3) CLDR/ICU is what's used in MacOS, iOS, Windows and Android at least. I expect all new major systems to use t as well. This means that by using CLDR here we get consistency for the user between what he sees in date/time/regional preferences of their OS and in Firefox. Since we don't maintain our own Regional Preferences panel, the closer we get to the underlying OS data model, the easier it is for us to just use their settings thus creating a way for users to customize their intl settings once (in the OS) and have Firefox just pick it up. For example, MacOS locale selection uses directly ICU methods, showing all locale names in CLDR format. When the user then goes to Fx preferences to select Firefox locale, having the same data shown in the same way would make it easier for the user to scan and find their locales. In particular in the context of the potential switch of language/region names there are two main values I see: 1) We lower the entry barrier and localization burden on our localizers. Those are hundreds of strings that we can get for free our of CLDR and join the cooperation effort with other industry players by upstreaming our improvements, rather than maintaining it on our own. 2) We guarantee consistency between our Intl APIs and our L10n data. If any Intl API will return region/language names, we'll use CLDR, and then in our L10n we'll show different names due to a different source. I don't know if that's enough of the value, but I hope that it at least justifies holding conversation about it. Can I get your further feedback based on this explanation :gerv?
Flags: needinfo?(gerv)
> Mozilla uses uppercase (e.g. "Italiano" for Italian), CLDR uses lowercase ("italiano"). That would work in the middle of a sentence, not as stand-alone (current use in Mozilla) or at the beginning of a sentence. Does it also "not work" in MacOS language selection preferences? If it does - let's investigate how they use CLDR to get the outcome we want and maybe we can use it as well.
They're uppercase in macOS Prefs.
They seem to use some capitalization - http://icu-project.org/apiref/icu4c/classicu_1_1Locale.html#acc4a8c21f19103503663cf6fcda9170d I'll try to investigate how they decide on capitalization. Maybe we can use ICU just to decide on capitalization for standalone, while feeding it data from our own FTL files. I have high hopes that technically we can solve it. :) I'd like to tackle comment 7 first to figure out if at all it makes sense to try.
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #7) > I don't know if that's enough of the value, but I hope that it at least > justifies holding conversation about it. It's certainly enough to hold a conversation :-) Let's start with a few clarifying questions: * Are we talking here about country/region names, language names, or both? * CLDR presumably provides the names in "all" languages? Does GENC, or does it only provide English names? * Might it be an option to use the CLDR names entirely internally and in APIs, but the GENC names for display within Firefox? * Might it be an option to not ship a list of names at all, but pick them up from the OS, so we can disclaim responsibility for them? Or would that be a coding nightmare? * Do we know if any particular OS, e.g. OS X, uses the same list in every shipping version, or do they perhaps customize the list to local political preference? * Is there any documentation anywhere about how CLDR chooses the names it chooses, and their position on the usual controversial topics? * Is there a list somewhere of the differences (both additions/removals, and textual changes) between the CLDR country/region name list and/or the GENC country/region name list? Having this data would make it much easier to evaluate both how big of a change this would be, and also how politically sensitive. Gerv
Flags: needinfo?(gerv)
I can try to answer some of them (non technical ones) > * Are we talking here about country/region names, language names, or both? I don't think Mozilla's list of languages comes from any shared source, unlike region names. But yes, the final goal would be to use CLDR for both. > * CLDR presumably provides the names in "all" languages? Does GENC, or does > it only provide English names? It provides translations for the vast majority of our languages, and we plan to contribute back our translations where a language is not available. As far as I can tell, GENC only provides a list of English names. > * Is there a list somewhere of the differences (both additions/removals, and > textual changes) between the CLDR country/region name list and/or the GENC > country/region name list? Having this data would make it much easier to > evaluate both how big of a change this would be, and also how politically > sensitive. Some territories are defined in GENC but not in CLDR Missing region names in CLDR: QM: Midway Islands QS: Bassas da India QU: Juan de Nova Island QW: Wake Island QX: Glorioso Islands QZ: Akrotiri XA: Ashmore and Cartier Islands XB: Baker Island XC: Coral Sea Islands XD: Dhekelia XE: Europa Island XG: Gaza Strip XH: Howland Island XJ: Jan Mayen XL: Palmyra Atoll XM: Kingman Reef XP: Paracel Islands XQ: Jarvis Island XR: Svalbard XS: Spratly Islands XT: Tromelin Island XU: Johnston Atoll XV: Navassa Island XW: West Bank Different names are mostly due to ordering, accents, use of & vs 'and'., or abbreviations. The real differences are: BQ, CZ, FK, HK, MM Different values: AG CLDR: Antigua & Barbuda Mozilla: Antigua and Barbuda BA CLDR: Bosnia & Herzegovina Mozilla: Bosnia and Herzegovina BL CLDR: St. Barthélemy Mozilla: Saint Barthelemy BQ CLDR: Caribbean Netherlands Mozilla: Bonaire, Sint Eustatius, and Saba BS CLDR: Bahamas Mozilla: Bahamas, The CD CLDR: Congo - Kinshasa Mozilla: Congo (Kinshasa) CG CLDR: Congo - Brazzaville Mozilla: Congo (Brazzaville) CV CLDR: Cape Verde Mozilla: Cabo Verde CZ CLDR: Czechia Mozilla: Czech Republic FK CLDR: Falkland Islands Mozilla: Falkland Islands (Islas Malvinas) FM CLDR: Micronesia Mozilla: Micronesia, Federated States of GM CLDR: Gambia Mozilla: Gambia, The GS CLDR: South Georgia & South Sandwich Islands Mozilla: South Georgia and South Sandwich Islands HK CLDR: Hong Kong SAR China Mozilla: Hong Kong HM CLDR: Heard & McDonald Islands Mozilla: Heard Island and McDonald Islands KN CLDR: St. Kitts & Nevis Mozilla: Saint Kitts and Nevis KP CLDR: North Korea Mozilla: Korea, North KR CLDR: South Korea Mozilla: Korea, South LC CLDR: St. Lucia Mozilla: Saint Lucia MF CLDR: St. Martin Mozilla: Saint Martin MM CLDR: Myanmar (Burma) Mozilla: Burma MO CLDR: Macau SAR China Mozilla: Macau PM CLDR: St. Pierre & Miquelon Mozilla: Saint Pierre and Miquelon RE CLDR: Réunion Mozilla: Reunion SH CLDR: St. Helena Mozilla: Saint Helena, Ascension, and Tristan da Cunha ST CLDR: São Tomé & Príncipe Mozilla: Sao Tome and Principe TC CLDR: Turks & Caicos Islands Mozilla: Turks and Caicos Islands TF CLDR: French Southern Territories Mozilla: French Southern and Antarctic Lands TT CLDR: Trinidad & Tobago Mozilla: Trinidad and Tobago VC CLDR: St. Vincent & Grenadines Mozilla: Saint Vincent and the Grenadines VG CLDR: British Virgin Islands Mozilla: Virgin Islands, British VI CLDR: U.S. Virgin Islands Mozilla: Virgin Islands, U.S. WF CLDR: Wallis & Futuna Mozilla: Wallis and Futuna Full analysis is in bug 1416148.
(In reply to Francesco Lodolo [:flod] from comment #12) > The real differences are: BQ, CZ, FK, HK, MM Correction, there are a few more (some more significant than others): BQ, CZ, FM, FK, HK, MM, MO, SH
Full list of CLDR regions https://github.com/unicode-cldr/cldr-localenames-full/blob/master/main/en/territories.json CZ, FK, HK, MO have an alt-variant that matches Mozilla values. The alt-variant for MM is Myanmar, while for Mozilla is Burma.
Side question, is there a way to browse GENC these days w/out adding a cert exception?
Thanks for the questions Gerv! (In reply to Gervase Markham [:gerv] from comment #11) > * Are we talking here about country/region names, language names, or both? Both. > * CLDR presumably provides the names in "all" languages? Does GENC, or does > it only provide English names? They provide the names in all languages, and the list is updated/maintained and extended by the CLDR work group with input from major contributors like Apple, Microsoft and Google but also open to external contributors. As I mentioned before, CLDR aims to be the Wikipedia for intl data. > * Might it be an option to use the CLDR names entirely internally and in > APIs, but the GENC names for display within Firefox? I'm not sure what you mean here by "CLDR named entirely internally"? If we believe any of the English names in CLDR to be wrong, I'd suggest we upstream our improvements to benefit all CLDR users. > * Might it be an option to not ship a list of names at all, but pick them up > from the OS, so we can disclaim responsibility for them? Or would that be a > coding nightmare? I'd say that it would be a very cumbersome procedure to try. We'd end up with varied levels of support per OS, with missing matches in places we cannot control. Having a unified list of language names and regions as part of our toolking localization dataset makes it much easier to build a predictable UIs for selecting locales. > * Do we know if any particular OS, e.g. OS X, uses the same list in every > shipping version, or do they perhaps customize the list to local political > preference? I do not know that. I can ask. All I know is that they use CLDR. > * Is there any documentation anywhere about how CLDR chooses the names it > chooses, and their position on the usual controversial topics? http://cldr.unicode.org/translation/country-names http://cldr.unicode.org/index/cldr-spec/picking-the-right-language-code CLDR meets every week on Wednesday at 8am PST. I can ask any of the questions we may have directly to the WG if needed :)
In case of french - bug 1434886 comment 3 - seems like we make mistakes in our translations and CLDR keeps the quality quite high. Resetting NI on :gerv to help us move forward.
Flags: needinfo?(gerv)
(In reply to Zibi Braniecki [:gandalf][:zibi] from comment #17) > In case of french - bug 1434886 comment 3 - seems like we make mistakes in > our translations and CLDR keeps the quality quite high. Same was for Italian (bug 1434854).
Sorry, guys - was at FOSDEM. (Sad not to see you there, gandalf?) CLDR's list was not one of the options considered last time, so it's reasonable to reopen the question. Looking down that list of diffs, it seems pleasingly minimal. None of my list of "dangerous" names appears on it, so presumably all of them are exactly the same in both lists.[0] Almost all of the changes are either neutral or improvements, with only HK, MM and MO needing further study. HK and MO have the same change - the addition of "SAR China". This is a reasonable factual description of their status, so I hope it would not be objectionable. MM changes "Burma" to "Myanmar (Burma)". The dispute here is whether the military regime has the legitimacy to change the name (both names have valid provenance in the local language) from Burma to Myanmar without a referendum. Including both but leading with the one the government has chosen seems just as fine a solution as any other. https://en.wikipedia.org/wiki/Names_of_Myanmar So yes, I approve of this change. We should make it and update the wiki page to match.[1] As always, let's stay out of the business of making country-specific tweaks to the list based on representations from citizens. That way lies madness. Gerv [0] Kosovo Tibet Taiwan Northern Cyprus Chechnya Catalonia Kurdistan ISIS The West Bank / The Gaza Strip / Palestine [1] https://wiki.mozilla.org/Lists_of_Countries_and_Regions
Flags: needinfo?(gerv)

This is now obsoleted by the Intl.DisplayNames API.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: