Open Bug 666662 Opened 9 years ago Updated 2 years ago

Implement master list of language subtags (language, script, region, variant, etc.)

Categories

(Core :: Internationalization, enhancement)

enhancement
Not set

Tracking

()

People

(Reporter: GPHemsley, Assigned: GPHemsley)

References

(Depends on 2 open bugs, Blocks 6 open bugs, )

Details

(Whiteboard: [bcp47])

Attachments

(2 files, 5 obsolete files)

Our discussion for how to implement BCP 47 (see bug 356038) has called for a master list of language subtags, as culled from the IANA Language Subtag Registry [1]. This includes (but may not be limited to) subtags identifying language, script, region, and variant.

There is a resource on the web [2] that has already has already gone through the effort of extracting these different categories of subtag, and I am in the process of writing a Python script that can generate files using that data. (Once I get it to more than a proof of concept, I'll put it on the Web somewhere.)

There is a wiki page [3] detailing what our plan is for this, but feedback is welcome!

[1] http://www.iana.org/assignments/language-subtag-registry
[2] http://www.langtag.net/registries.html
[3] https://wiki.mozilla.org/User:GPHemsley/BCP_47
Blocks: 656750
Blocks: 666731
It seemed during the discussion that we ideally would like to have the master list outside of the locale, but something is stopping us from doing so in the short term. (Deficiencies in the l10n tools?) We were also in agreement that we can have two separate lists: language names that should be localized and language names that needn't be.

Would it be possible to have just the list of localizable names inside the locale and then have the other list outside the locale? Then, you can fall back to the unlocalized list if you can find a language name in the localized list.

If we do go that route: Should the unlocalized list be the full list, or just the languages that aren't in the localized file?

Also, where would these files live?
Another thing:
Since we're writing all this stuff in JavaScript, should we implement the master list in JSON? Or is the (arbitrary?) key/value pair format necessary for other things?
Yeah, there's one hangup with l10n tools - how do we make it possible for localizers to translate additional language names that aren't in the "commonly-localized" list?  I don't see a way to do that easily with existing tools like Pootle or Narro.  

We could say "if you want to add additional translations, you'll have to open a UTF-8 editor and add them to your locale's languageNames.properties file".  If we say this, then things are much simpler: we'd just need to add a few new languages to languageNames.properties, and then add the master list outside of the locale (I'd say that this list should be the full list - one less file to touch as we inevitably add languages to languageNames.properties in the future). 

No idea where it should live.
(In reply to comment #3)
> We could say "if you want to add additional translations, you'll have to
> open a UTF-8 editor and add them to your locale's languageNames.properties
> file".  If we say this, then things are much simpler: we'd just need to add
> a few new languages to languageNames.properties, and then add the master
> list outside of the locale (I'd say that this list should be the full list -
> one less file to touch as we inevitably add languages to
> languageNames.properties in the future). 

Ah, yeah, good point. I was thinking we'd just pull it in for the en-US locale upon request and have everyone localize that additional string. But there shouldn't be any reason why any given locale shouldn't be able to decide which language names to localize.

As for the other types of subtags, I suppose that all regions should be localized for each locale. Scripts probably don't need to be too localized, but if we decide to, it'd probably be a two-file system like language names. Variants etc. probably aren't important enough to localize.
Attached file Script to generate master lists (obsolete) —
This is a newly-written Python script that generates the four master lists of subtags, based on the resources provided by LangTag.net. Just run it via command line and you will have 4 new .properties files in that same directory. (You'll also get similar output to the terminal.)

Some notes:
* There is information in the registry that is not easy to implement in a key/value sort of way—at least, not without some discussion first. This includes information about some variants that are only valid when prefix by a certain language tag combination.
* For some subtag instances (usually private use), subtag ranges are represented using two dots ('..'). For ease of localization, I did not attempt to expand them. However, this is something we'll have to take into account when we do the parsing within the browser.
* When a subtag has multiple names associated with it, this script arbitrarily picks the first one. Since this may cause problems for certain subtags, I've included code to allow for overriding such names. I've used Haitian Creole [ht] as an example (bug 531849). This code could also be used to override single names (e.g. when it's politically controversial).
* This script doesn't attempt to create a list of grandfathered tags because the list provided by LangTag.net does not include any information about which tags are redundant or otherwise deprecated. (It also doesn't attempt to create a list of redundant subtags for the obvious reason that they can already be generated from the other lists.)

I think that's about it. Feedback welcome!
Assignee: smontagu → gphemsley
Attachment #541620 - Flags: feedback?(l10n)
Attachment #541620 - Flags: feedback?(kscanne)
(In reply to comment #4) 
> As for the other types of subtags, I suppose that all regions should be
> localized for each locale. Scripts probably don't need to be too localized,
> but if we decide to, it'd probably be a two-file system like language names.
> Variants etc. probably aren't important enough to localize.

I'd like to make it possible to localize everything, if a team wants to.   A lot of us have already translated a lot of language names and script names as part of other projects, e.g. ISO 639 and ISO 15924 codes are translated through the GNU TP: http://translationproject.org/

And there are even some variants that I'd want to translate, like the Cornish language orthographies.

Two-file system is fine with me in all cases.   "Commonly-localized" scripts could be very small indeed, and commonly-localized variants could start out empty even.
(In reply to comment #6)
> (In reply to comment #4) 
> > As for the other types of subtags, I suppose that all regions should be
> > localized for each locale. Scripts probably don't need to be too localized,
> > but if we decide to, it'd probably be a two-file system like language names.
> > Variants etc. probably aren't important enough to localize.
> 
> I'd like to make it possible to localize everything, if a team wants to.   A
> lot of us have already translated a lot of language names and script names
> as part of other projects, e.g. ISO 639 and ISO 15924 codes are translated
> through the GNU TP: http://translationproject.org/
> 
> And there are even some variants that I'd want to translate, like the
> Cornish language orthographies.
> 
> Two-file system is fine with me in all cases.   "Commonly-localized" scripts
> could be very small indeed, and commonly-localized variants could start out
> empty even.

Fair enough. The script, as it stands, never made that decision, so we don't have to change anything.
Attachment #541620 - Flags: feedback?(kscanne) → feedback+
Script looks fine to me.  We might want to leave out the Private Use ranges from the .properties files.   And there's an extra space in the language name "ia" (can someone explain why this is the one line in the registry which breaks the nice clean syntax??)

Eventually might be worth parsing the registry itself instead of relying on a third party.  I can take a crack at that if you think it's worth it.
Also, IANA appears to use a forward slash in many language names (some should properly be |, but whatever) so that might not be the best delimiter to use when there are multiple names for the same language.   Semicolon looks ok for now.
One more detail - some languages have leading '=' in the name (click languages) - this may cause problems with whatever tools parse the .properties files.
(In reply to comment #8)
> Script looks fine to me.  We might want to leave out the Private Use ranges
> from the .properties files.

Ah, yeah, I think that's something that's come up in the past. I'd say that's probably up to Axel to decide. But if we do that, there may be other, less clear-cut things that we'll want to take out, too. (Things like the scripts that start with 'Z', for example.)

> And there's an extra space in the language
> name "ia" (can someone explain why this is the one line in the registry
> which breaks the nice clean syntax??)

Yeah, I know. That's an error in the original registry, and it bugs me every time I look at it. The worst part is, that whole parenthetical isn't really strictly necessary, except that there are two different languages that have at one point gone by the name "Interlingua".

> Eventually might be worth parsing the registry itself instead of relying on
> a third party.  I can take a crack at that if you think it's worth it.

Perhaps, if we feel there's more information that we need. (I haven't taken a look at the Perl script you sent me yet.)

(In reply to comment #9)
> Also, IANA appears to use a forward slash in many language names (some
> should properly be |, but whatever) so that might not be the best delimiter
> to use when there are multiple names for the same language.   Semicolon
> looks ok for now.

That slash was used in the LangTag.net files. And actually, it's ' / ' (with spaces), so I don't think it threw anything off. (It's only carried over into our files in comments. We don't use multiple names anywhere else.)

(In reply to comment #10)
> One more detail - some languages have leading '=' in the name (click
> languages) - this may cause problems with whatever tools parse the
> .properties files.

Yeah, I'm not really sure how any of that works. Are there no click languages in the files currently in use? We'll have to worry about all those special characters if they cause a problem, I guess. If anything, the "real" equals sign is surrounded by spaces, as well.
Blocks: 556237
No longer depends on: 556237
So, I'm thinking the master list files will live at

intl/locale/src/*Names.properties

and the localized files will live at 

toolkit/locales/en-US/chrome/global/*Names.properties

where * = (language|script|region|variant).

This leaves all the legacy files in place while allowing us to build on top of them. Then each can be removed or converted when the time is right.

In particular, languageNames.properties and regionNames.properties already exist within en-US, so they would be joined by scriptNames.properties and variantNames.properties with whatever names are common. Then the master lists (as generated by the script in attachment 541620 [details]) would live at those same four names in intl/locale/src/.

I think we'll have to resort and reformat the two existing files in en-US. We'll also have to update regionNames to accommodate the new regions and the numbered regions. I think we decided that the full regionNames would be localized, with only a handful of scripts and perhaps zero variants by default. The list of languageNames would be as we decided in bug 356038 and/or on the wiki, though I do think that 2-char and 3-char codes should be sorted separately.

Thoughts?
For those wondering, we're working on the code here:
https://github.com/GPHemsley/BCP47/

(Output files are also in the repository.)
Attachment #541620 - Attachment is obsolete: true
Attachment #541620 - Flags: feedback?(l10n)
This diff only adds new files to intl/locale/src/.

These are the new master lists, including the subtags for language, extlang, script, region, and variant. There are also separate files for deprecated codes (will have various uses later) and Suppress-Script for the language subtags that have them (will be useful for bug 556237).
Attachment #543559 - Flags: review?(smontagu)
Attachment #543559 - Flags: review?(l10n)
Updating the patch to include the overrides for controversial region names, as well as additional files that include information about the scope of language and extlang subtags (macrolanguage, collection, etc.).
Attachment #543559 - Attachment is obsolete: true
Attachment #545051 - Flags: review?(smontagu)
Attachment #545051 - Flags: review?(l10n)
Attachment #543559 - Flags: review?(smontagu)
Attachment #543559 - Flags: review?(l10n)
Oh, I forgot to mention: Simon expressed concern about the number of comments in these files, as he was unsure how they would affect startup or first-read time.

I included them because of their usefulness to a human reader (e.g. a localizer), but it's trivial for me to regenerate these files without the comments.

Axel, what are your thoughts?
(In reply to comment #15)
> Updating the patch to include the overrides for controversial region names,

I should note for the record that the choices made here were discussed in detail on the newsgroups.

My rationale for these choices is here:
http://groups.google.com/group/mozilla.dev.l10n/msg/23689b626b9ec1d7
This drops in on top of the current locations of the localizable lists, adding the (empty) files for the additional subtag types.

This will likely change before the end, but this is what we have as of now.
Attachment #545544 - Flags: review?(smontagu)
Attachment #545544 - Flags: review?(l10n)
Attachment #545544 - Attachment description: Add localizable lists of names → Part 2: Add localizable lists of names
For what it's worth, these two patches are in the latest BCP47 nightly:
http://ftp.mozilla.org/pub/mozilla.org/firefox/nightly/latest-maple/

And because part 2 adds a bunch more language names, it fixes bug 531849, bug 535422, and bug 586085 (among others, I'm sure).
Blocks: 531849, 535422, 586085
Comment on attachment 545051 [details] [diff] [review]
Part 1: Add master lists of subtags from IANA Language Subtag Registry. (v2)

Review of attachment 545051 [details] [diff] [review]:
-----------------------------------------------------------------

::: intl/locale/src/regionNames.properties
@@ +358,5 @@
> +
> +VU = Vanuatu
> +WF = Wallis and Futuna
> +WS = Samoa
> +XA..XZ = Private use

This one looks funky?

::: intl/locale/src/scriptNames.properties
@@ +75,5 @@
> +Hant = Han (Traditional variant)
> +Hebr = Hebrew
> +Hira = Hiragana
> +Hmng = Pahawh Hmong
> +Hrkt = (alias for Hiragana + Katakana)

Do we have a better string for this and the other (alias for ..)?
(In reply to comment #20)
> Comment on attachment 545051 [details] [diff] [review] [review]
> Part 1: Add master lists of subtags from IANA Language Subtag Registry. (v2)
> 
> Review of attachment 545051 [details] [diff] [review] [review]:
> -----------------------------------------------------------------
> 
> ::: intl/locale/src/regionNames.properties
> @@ +358,5 @@
> > +
> > +VU = Vanuatu
> > +WF = Wallis and Futuna
> > +WS = Samoa
> > +XA..XZ = Private use
> 
> This one looks funky?

Yeah, ranges are represented by <start code>..<end code>, and are usually used by private use subtags, which tend to be contiguous.

We haven't yet figured out how we want to handle either private use subtags or subtag ranges, so they're currently listed as-is. 

> ::: intl/locale/src/scriptNames.properties
> @@ +75,5 @@
> > +Hant = Han (Traditional variant)
> > +Hebr = Hebrew
> > +Hira = Hiragana
> > +Hmng = Pahawh Hmong
> > +Hrkt = (alias for Hiragana + Katakana)
> 
> Do we have a better string for this and the other (alias for ..)?

These are the names listed in the IANA registry, but I did recently come across some names from Unicode that are a little more helpful:
http://unicode.org/iso15924/iso15924-codes.html

Shall I override them and use the names listed there? And should I keep or drop the parenthetical?
Blocks: 684335
Updated the master list to reflect the 2011-08-25 version of the registry, as well as a number of other improvements pertaining to the names associated with various subtags. (I believe this is the first time the changes for controversial region names have been included in the patch, for example.)
Attachment #545051 - Attachment is obsolete: true
Attachment #562172 - Flags: review?(smontagu)
Attachment #562172 - Flags: review?(l10n)
Attachment #545051 - Flags: review?(smontagu)
Attachment #545051 - Flags: review?(l10n)
(In reply to Gordon P. Hemsley [:gphemsley] from comment #22)
> as well as a number of other improvements pertaining to the names associated
> with various subtags. (I believe this is the first time the changes for
> controversial region names have been included in the patch, for example.)

Oh, nope. They were in v2, as well. But this does update Macedonia to emphasize that the 'f' in 'former' is lowercase and changes the spelling for 'ug' to the more common 'Uyghur'. And it addresses Axel's concern for 'Hrkt', among the other changes involved the updated registry version.
Admittedly, the reason this is in a separate patch is because I forgot to add the files when I created the previous patch. However, it might be that these files are less important, so they can be reviewed separately, I think.

Scope refers to whether a subtag represents a macrolanguage, etc. It might not be something that browser need know when parsing content already on the Web.
Attachment #562175 - Flags: review?(smontagu)
Attachment #562175 - Flags: review?(l10n)
This is an updated list of localizable names, which is a subset of the master list.

The list of language subtags are culled from a number of current sources, including Mozilla l10n teams, Google locales, and Wikipedia language versions.

The list of region subtags is actually identical to the master list, as it was felt that the list was small and all regions are relevant.

The lists of script, variant, and extlang subtags are empty for now, as it was felt that none of them were particularly important to be localized.

However, the creation of the files opens up the possibility of adding to the localizable lists in the future (perhaps to include common scripts, like Latn, Cyrl, Arab, etc.).
Attachment #545544 - Attachment is obsolete: true
Attachment #562176 - Flags: review?(smontagu)
Attachment #562176 - Flags: review?(l10n)
Attachment #545544 - Flags: review?(smontagu)
Attachment #545544 - Flags: review?(l10n)
I'm starting to worry about the data size here. How much does this add to download size, and to disk size. In particular for fennec?

Also, there's no users for this data yet, nor build infra to ship it, AFAICT? I'm wondering if we would be better off to pull that data into a sqlite database we ship.

Last resort would be to put the majority of this data into a webservice, and load it on demand. That'd look wrong on the first run, perhaps (depending on how we code the clients), but it'd be much more lightweight.

CCing Taras to get some input on how we can ship data effectively.

For the l10n aspects of this, I'd prefer to not change the region names to be caps, that might be nice style, but we'd throw all our existing data away. That said, I wonder if we can replicate some of the algorithms to create corresponding data (with our special cases excluded) out of cldr?
Blocks: 573320
(Apologies for the delay. I've been rather busy with school.)

(In reply to Axel Hecht [:Pike] from comment #26)
> I'm starting to worry about the data size here. How much does this add to
> download size, and to disk size. In particular for fennec?

I don't know, offhand. The patch itself is ~200 KB. How much is too much?

> Also, there's no users for this data yet, nor build infra to ship it,
> AFAICT? I'm wondering if we would be better off to pull that data into a
> sqlite database we ship.
> 
> Last resort would be to put the majority of this data into a webservice, and
> load it on demand. That'd look wrong on the first run, perhaps (depending on
> how we code the clients), but it'd be much more lightweight.

That's something we can discuss. Keep in mind there are two parts to this bug: one part creates the master list that currently nothing is using (though the idea is to switch everything over to it, piece by piece, once the dust settles), but the other part is bringing the existing lists up to date so that, e.g., spellcheck dictionary authors can actually provide resources with the name of the language included, instead of having to resort to ugly workarounds or defaulting to the bare language code.

> CCing Taras to get some input on how we can ship data effectively.

Still waiting on this?

> For the l10n aspects of this, I'd prefer to not change the region names to
> be caps, that might be nice style, but we'd throw all our existing data
> away.

There is very little existing data that is useful, IMO. The existing file has been updated haphazardly, and manually, over the years, and I think the benefit of having a standardized and automatically-generated going forward outweighs this one-time cost.

> That said, I wonder if we can replicate some of the algorithms to
> create corresponding data (with our special cases excluded) out of cldr?

Could you elaborate on this? Kevin and I had discussed using CLDR for certain things, but I'm not too familiar with it, so I don't know what data is contains or how we might be able to use it. It was my impression that we may be able to use the data to prepopulate (or source) translations, but I don't see how that relates to the initial list.
(In reply to Gordon P. Hemsley [:gphemsley] from comment #27)
> (In reply to Axel Hecht [:Pike] from comment #26)
> > For the l10n aspects of this, I'd prefer to not change the region names to
> > be caps, that might be nice style, but we'd throw all our existing data
> > away.
> 
> There is very little existing data that is useful, IMO. The existing file
> has been updated haphazardly, and manually, over the years, and I think the
> benefit of having a standardized and automatically-generated going forward
> outweighs this one-time cost.

I'm just realizing now that you meant the actual localizations, not the version control history. >_<

Is there a valid use case for have case-sensitivity in the localization tools? Or can we somehow update the tools to automatically rename keys to match the current case?
Depends on: 716321
No longer blocks: 531849, 535422, 586085, 684335
Comment on attachment 562176 [details] [diff] [review]
Part 2: Add localizable lists of names (v3)

I've opened bug 716321 to discuss changes to the existing files separately.
Attachment #562176 - Attachment is obsolete: true
Attachment #562176 - Flags: review?(smontagu)
Attachment #562176 - Flags: review?(l10n)
Blocks: 716377
(In reply to Gordon P. Hemsley [:gphemsley] from comment #27)
> (In reply to Axel Hecht [:Pike] from comment #26)
> > That said, I wonder if we can replicate some of the algorithms to
> > create corresponding data (with our special cases excluded) out of cldr?
> 
> Could you elaborate on this? Kevin and I had discussed using CLDR for
> certain things, but I'm not too familiar with it, so I don't know what data
> is contains or how we might be able to use it. It was my impression that we
> may be able to use the data to prepopulate (or source) translations, but I
> don't see how that relates to the initial list.

I've opened bug 716377 to discuss using the data in CLDR.
Depends on: 705542
Attachment #562172 - Flags: review?(smontagu)
Attachment #562172 - Flags: review?(l10n)
Attachment #562175 - Flags: review?(smontagu)
Attachment #562175 - Flags: review?(l10n)
Is that more or less what we did with LocaleService::Locale? Is there anything left in this bug?
Flags: needinfo?(jfkthame)
I'm not sure (or don't recall) what all the potential use-cases here might be, but I don't think this is necessarily addressed yet. Does LocaleService provide lists of valid tags? I don't recall an API that would do that... e.g. so that we could present a list of all the valid language tags for use in accept-languages, or all the script subtags that could be used in specifying font prefs.
Flags: needinfo?(jfkthame)
You need to log in before you can comment on or make changes to this bug.