Closed Bug 716321 Opened 9 years ago Closed 13 days ago

Update existing list of language subtags to reflect more modern usage

Categories

(Core :: Internationalization, defect)

defect
Not set
normal

Tracking

()

RESOLVED INACTIVE

People

(Reporter: GPHemsley, Assigned: GPHemsley)

References

(Blocks 3 open bugs, )

Details

(Whiteboard: [bcp47])

Attachments

(1 file, 4 obsolete files)

I'm spinning this off from bug 666662, because that bug requires a logistical discussion that should not block updating the existing list of language subtags, which is used by the existing language preference interface and spellcheck extension authors.

There are numerous bugs open requesting various changes to the existing lists, and this should supersede all of those. (In fact, if it doesn't then this bug should be updated.)

The updated list is created based on various sources that use language subtags, including Kevin's list of spellcheckers, as well as available localizations of Google, Wikipedia, and mozilla-aurora. (See the URL for the makefile which obtains this data.)
Attached patch Update languageNames.properties (obsolete) — Splinter Review
This updates the list of language names to the most recent available information.

It sorts the 3-char subtags below the 2-char subtags, which explains some of the apparent deletions.

Using bug 399667 as precedent, it also excludes deprecated subtags, though they are sometimes used by the sources. The full list of deprecated subtags is available here: https://github.com/GPHemsley/BCP47/blob/master/languageDeprecated.properties
Attachment #586758 - Flags: review?(l10n)
Blocks: 724594
Blocks: 489404
Add some additional languages and remove parentheticals (which included various disambiguators) from language names.
Attachment #586758 - Attachment is obsolete: true
Attachment #586758 - Flags: review?(l10n)
Attachment #610346 - Flags: review?(l10n)
Add a couple more languages that have spellcheckers available.
Attachment #610346 - Attachment is obsolete: true
Attachment #610346 - Flags: review?(l10n)
Attachment #610440 - Flags: review?(l10n)
Blocks: 741842
This patch updates the list of region subtags to reflect more modern usage (addition of South Sudan, numeric region subtags, etc.). It also enables the test for numeric region subtags.
Assignee: smontagu → gphemsley
Attachment #612048 - Flags: review?(l10n)
Summary: Update existing list of language subtags to reflect more modern usage → Update existing list of language and region subtags to reflect more modern usage
Blocks: 705542
Axel, I don't have too much time to devote to this in the next few weeks, but it would be nice to be able to land it before the uplift of 14 to Aurora next week, especially given that the changes from bug 730209 are already present there. What's left to do here?
Comment on attachment 610440 [details] [diff] [review]
Update languageNames.properties (v3)

Review of attachment 610440 [details] [diff] [review]:
-----------------------------------------------------------------

Canceling the review, I can't r+ or r- this without understanding why this is the dataset.
Attachment #610440 - Flags: review?(l10n)
Comment on attachment 612048 [details] [diff] [review]
Update regionNames.properties (v1)

Review of attachment 612048 [details] [diff] [review]:
-----------------------------------------------------------------

The regionNames pose the same question, what's the data set, and why?

I find a footnote on http://de.wikipedia.org/wiki/ISO-3166-1-Kodierliste#cite_note-anm1-0 which claims that ea etc shouldn't be included, for example. Can't find a corresponding note in English, sorry.

Technically, I'd prefer if you didn't change whitespace. If you have to, don't align the '=', but just go consistently for ' = '. r- for the technical nit.
Attachment #612048 - Flags: review?(l10n) → review-
(In reply to Axel Hecht [:Pike] from comment #8)
> Comment on attachment 612048 [details] [diff] [review]
> Update regionNames.properties (v1)
> 
> Review of attachment 612048 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> The regionNames pose the same question, what's the data set, and why?
> 
> I find a footnote on
> http://de.wikipedia.org/wiki/ISO-3166-1-Kodierliste#cite_note-anm1-0 which
> claims that ea etc shouldn't be included, for example. Can't find a
> corresponding note in English, sorry.

Well, first off, I should remind you that we're implementing BCP 47, not ISO 3166. It is up to curators of the IANA Language Subtag Registry to determine whether a particular ISO 3166 is appropriate for use in a language tag. They have determined that certain reserved codes are appropriate and certain ones are not. (You'll note, for example, that reserved code 'UK' is not in this list, because 'GB' is the code that should be used.)

With that being said, this list is generated from the region subtags listed in the IANA Language Subtag Registry, with the deprecated and private use subtags removed. (It is actually debatable whether we want to exclude the deprecated subtags, but we made the decision to do so.)

The files involved in generating this patch are here:
https://github.com/GPHemsley/BCP47/blob/master/get_subtags.py
https://github.com/GPHemsley/BCP47/blob/master/region.txt
https://github.com/GPHemsley/BCP47/blob/master/regionNames.properties
https://github.com/GPHemsley/BCP47/blob/master/regionDeprecated.properties
https://github.com/GPHemsley/BCP47/blob/master/makefile#L67
https://github.com/GPHemsley/BCP47/blob/master/regionNames-l10n.properties

> Technically, I'd prefer if you didn't change whitespace. If you have to,
> don't align the '=', but just go consistently for ' = '. r- for the
> technical nit.

Per BCP 47, a region subtag is either 2 letters or 3 numbers. As such, I readjusted the whitespace to match the maximum possible length of a region subtag (instead of the seemingly-arbitrary number that currently exists in the file).

If you'd like to me to change it to a single space on either side, that's fine by me. Just know that the numerical entries won't be aligned with the alphabetical entries.
(In reply to Gordon P. Hemsley [:gphemsley] from comment #9)
> With that being said, this list is generated from the region subtags listed
> in the IANA Language Subtag Registry, with the deprecated and private use
> subtags removed. (It is actually debatable whether we want to exclude the
> deprecated subtags, but we made the decision to do so.)

I should also note that some of the English names have been overridden from the names that are listed in the registry.

The regions in question are here:
https://github.com/GPHemsley/BCP47/blob/master/get_subtags.py#L213

My original justification for these choices is here:
http://groups.google.com/group/mozilla.dev.l10n/browse_thread/thread/97d2dddb8db97248/1231aceeaf2cfc06

(Note: Some of the "renames" I justify in that thread merely involve reverting to the name used in the registry. The get_subtags.py lists the manual overrides in relation to the registry, not the existing names in the Mozilla source.)
Returning the discussion about region subtags to bug 705542. Axel, please respond there.
No longer blocks: 705542
Summary: Update existing list of language and region subtags to reflect more modern usage → Update existing list of language subtags to reflect more modern usage
Attachment #612048 - Attachment is obsolete: true
Blocks: 709930
No longer blocks: 741842
(In reply to Axel Hecht [:Pike] from comment #7)
> Comment on attachment 610440 [details] [diff] [review]
> Update languageNames.properties (v3)
> 
> Review of attachment 610440 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Canceling the review, I can't r+ or r- this without understanding why this
> is the dataset.

Even though Gordon is doing all the hard work on this stuff I thought I'd jump in and at least explain the choice of languages in the patch.

The list consists of:
(1) currently localized language names
(2) languages for which there are existing Mozilla l10n efforts (== landed on mozilla-aurora)
(3) languages for which there are existing open source spell checkers
(4) languages with Wikipedias
(5) languages for which Google's search interface is localized

My original conservative proposal was (1)-(3).  (3) is personally important to me since I've worked on several spell checkers for languages not on the list and the experience for users of these addons is broken.  In particular, AMO reviewers have been unwilling to grant full reviews to these addons. 

We added (4) and (5) based on suggestions on the dev-l10n list; here's that thread:

https://groups.google.com/forum/?fromgroups#!msg/mozilla.dev.l10n/L4KF6mNTwRA/m7EQML0FlkUJ

Were we to go even bigger, the next natural set of languages to include would be the ones in CLDR, but that's another ~300 to add, and we decided that would be an unnecessary burden on localizers.

Hope this helps!
Rebase patch and include additional languages.
Attachment #610440 - Attachment is obsolete: true
Status: NEW → ASSIGNED
No longer blocks: 489404
To expedite the process for existing bugs on file, I've created individual patches for the following bugs:

* Bug 535422: Add support for Lower Sorbian [dsb].
* Bug 586085: Add support for Kashubian [csb], Hawaiian [haw], and Hiligaynon [hil]. 
* Bug 531849: Rename "Haitian" to "Haitian Creole" [ht].
* Bug 724594: Rename "Scots Gaelic" to "Scottish Gaelic" [gd].

I've also filed bug 788178 to remove a bunch of trailing whitespace from language.properties (the file that dictates what is displayed in the Languages preferences dialog list).

For these patches to apply the most cleanly, they should be applied in the order they were in my patch queue: whitespace patch first, then the rest of the bugs in the order they appear above (which, incidentally, is based on how long they've been on file, prioritizing new additions over renamings).
No longer blocks: 709930
No longer blocks: 586085
No longer blocks: 535422, 724594
Blocks: 829658
Blocks: 835074
Status: ASSIGNED → RESOLVED
Closed: 13 days ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.