Open Bug 356038 (bcp47) Opened 18 years ago Updated 2 years ago

[meta] BCP 47 (RFC 5646 and 4647; IANA Language Subtag Registry) support

Categories

(Core :: Internationalization, task, P3)

task

Tracking

()

People

(Reporter: smontagu, Unassigned)

References

(Depends on 6 open bugs, Blocks 1 open bug, )

Details

(Keywords: meta, Whiteboard: [bcp47] [See comment 34 and beyond for the latest discussion.])

Attachments

(2 files, 2 obsolete files)

RFC 4646 and 4647 are the replacements for RFC 3166
Alias: rfc4646
Attached patch Checkpoint wip (obsolete) — — Splinter Review
I've gone ahead and synched the previous WIP patch with mozilla-central (as it was previously based off of CVS/Mozilla Suite code) and the current IANA Language Subtag Registry. It contains full versions of languageNames.properties, regionNames.properties, and scriptNames.properties.

Those first two files could probably land immediately, independent of the rest of the BCP 47 refactoring, as their format is the same as what is currently used.

Still missing from the refactoring are the other properties files, which I've yet to create. (I'm using a custom-built PHP script to extract the data, and I'll be sure to post it here once it's in its final version, in order to simplify synching in the future.)

I haven't tested whether this patch compiles, and I'm not sure if I messed up the indentation of the files. (I may have even fixed it; I don't even know.)

What would be useful to know (to save me the trouble of having to figure it out) is what exactly is missing from this code, beyond those other lists.
Attachment #241732 - Attachment is obsolete: true
Oh, and some more notes about the tag lists:

The names associated with each tag are exactly the way they are listed in the IANA registry. This may differ from the names previously used in these files. In the case of multiple names associated with one tag, the first one listed in the registry is used.

These lists include all tags that do not have a 'Preferred-Value' associated with them. This means that deprecated and previously-removed tags may have made a return to the list. (I think this is a Good Thing™.)

The whitespace separators differed in the two existing files. I decided to go with tab separators in these new versions. This may make viewing differences difficult, unless you ignore whitespace.

All symbols that are not alphanumeric, space, or parentheses have been escaped. This may be overkill for basic symbols such as commas or periods, but I don't especially care.

That's about all I can think of right now. If you have any questions or comments, let me know.
This is gonna be troublesome for localizers (so many new properties)... CCing Pike.

Also, the list includes languages that went extinct long ago, such as:
- Polabian
- Prussian 
- Lower Silesian
and non-existent countries, like the Soviet Union and "Serbia and Montenegro".

Another issue is that this patch uses the deprecated \uXXXX encoding of non-ascii in property values (Mozilla property files are UTF-8).
see also bug 230866 (Taiwan / Macedonia) - be VERY careful with changes like that !

and we also have the kp/kr pair (North and South Korea) ...

Also note that various country names are now changed to the official one, not the common English variant for it, so they won't be easily found in an alphabetic list. Like Iran, Cote d'Ivoire, or Tanzania.
Basically, yes, we should totally do this.

For the most content, which are the triple letter languages, IMHO, we should rip those out of l10n. Having them in some form is cool, but enforcing l10n seems over the top. We might want to add three lower caps to filter.py and offer that localizers can pick some that are relevant for them, but that'd be it, i.e., check the l10n version first and then one in content.

Regarding the script tags, should those be with capital initial letter? And I'd interested to see marcoos' questions answered, too :-)
(In reply to comment #5)
> Also, the list includes languages that went extinct long ago, such as:
> - Polabian
> - Prussian 
> - Lower Silesian
> and non-existent countries, like the Soviet Union and "Serbia and Montenegro".
As far as I'm considering it, this code is for processing language content that is available on the Web, not what languages Mozilla products may be localized for. As such, I think it appropriate to include these things, as a user may still come across samples of these languages. As I stated in comment 4, I have included all tags that have not been replaced using a 'Preferred-Value' field in the IANA registry. That means that, although they may be deprecated (and I'm not saying, necessarily, that the ones you mentioned are), one may still come across samples tagged with those tags, so it would be important to include them—particularly because if we don't, they'll wind up being printed as literals, anyway.

> Another issue is that this patch uses the deprecated \uXXXX encoding of
> non-ascii in property values (Mozilla property files are UTF-8).
I wasn't aware of that. I merely duplicated the format of the existing files, which I suppose haven't been updated since the switch to UTF-8. I wish I would've known that beforehand, though, because it does take an extra couple of lines of code to do the substitution.
(In reply to comment #6)
> see also bug 230866 (Taiwan / Macedonia) - be VERY careful with changes like
> that !
> 
> and we also have the kp/kr pair (North and South Korea) ...
> 
> Also note that various country names are now changed to the official one, not
> the common English variant for it, so they won't be easily found in an
> alphabetic list. Like Iran, Cote d'Ivoire, or Tanzania.
As I said in comment 4, the files were automatically generated using the names officially associated with the tags, as listed in the IANA registry. It is not my place to decide what any country should be called.
(In reply to comment #7)
> Basically, yes, we should totally do this.
> 
> For the most content, which are the triple letter languages, IMHO, we should
> rip those out of l10n. Having them in some form is cool, but enforcing l10n
> seems over the top. We might want to add three lower caps to filter.py and
> offer that localizers can pick some that are relevant for them, but that'd be
> it, i.e., check the l10n version first and then one in content.
I'm not quite sure what you're referring to, given that I don't know this code beyond the files touched by my patch.

> Regarding the script tags, should those be with capital initial letter? And I'd
> interested to see marcoos' questions answered, too :-)
I would be inclined to agree with you, except there are some caveats:
1) If we did, it would also make sense to change the region tags to uppercase.
2) The region tags are currently lowercase, so I left them the same.
3) Despite there being Best Practice for proper capitalization for each type of tag, they are all technically case-insensitive.
4) It is much easier/shorter to code to just force everything to always be lowercase than to attempt to match proper capitalization.
Summary: RFC 4646 and 4647 support → RFC 4646 and 4647 (BCP 47; IANA Language Subtag Registry) support
OK, I just attempted to compile this, and it works, but with a couple of caveats.

1) The patch will not compile as-written because it does not include the variantNames.properties and grandfatheredNames.properties files. But all you have to do is `touch` these files, and everything will compile fine.

2) It seems the JS code does not properly account for tags with multiple subtags (e.g. en-US-Latn, although this particular example is technically redundant). Instead, it just goes with the first bit of information and drops the rest. I haven't at all looked into what is causing this; I've only just thrown a quick test together to see what happens.

3) I think I compiled using an old version of mozilla-central (dating from around 2009-08-13, when I did the original work), so things may have changed since then.

4) The patch excludes private use range characters, but I think it would be wise to include them. Otherwise you might get a language of "English (AA)" instead of "English (Private use)".

Even so, despite these caveats, putting this patch into trunk right now would already improve support for BCP 47.
First of all, I support this change; the current language tags does not sufficient to describe different types of Chinese scripts. 

(In reply to comment #9)
> (In reply to comment #6)
> > see also bug 230866 (Taiwan / Macedonia) - be VERY careful with changes like
> > that !
> > 
> > and we also have the kp/kr pair (North and South Korea) ...
> > 
> > Also note that various country names are now changed to the official one, not
> > the common English variant for it, so they won't be easily found in an
> > alphabetic list. Like Iran, Cote d'Ivoire, or Tanzania.
> As I said in comment 4, the files were automatically generated using the names
> officially associated with the tags, as listed in the IANA registry. It is not
> my place to decide what any country should be called.

Hi there,

I think the essence here is "not to make political statement."

"Following standard" is a lazy excuse for the problem as there is no official standard for Country/Region names, at least not definitive as W3C Spec. I do not want to start an essay about the definition and recognition of a nation-state; please contact your nearest Int'l relationship majors or Wikipedia for detail.

Mozilla is a F/OSS project, not United Nations. No one should be allowed for pushing their ideology and we should avoid doing so for anyone. ISO 3166 names had caused a lot of headache among other free software communities, Please try not to bring it to Mozilla.

So, for the sake of political correctness, please keep disputed regions in their short names, without prefix or suffix about their political status ("Republic of", "Region of", "Provence of") thus removing the implication. In this way we could keep all parties in peace.


Tim
Mozilla Taiwan Community (MozTW)
(In reply to comment #12)
> Hi there,
> 
> I think the essence here is "not to make political statement."
> 
> "Following standard" is a lazy excuse for the problem as there is no official
> standard for Country/Region names, at least not definitive as W3C Spec. I do
> not want to start an essay about the definition and recognition of a
> nation-state; please contact your nearest Int'l relationship majors or
> Wikipedia for detail.
> 
> Mozilla is a F/OSS project, not United Nations. No one should be allowed for
> pushing their ideology and we should avoid doing so for anyone. ISO 3166 names
> had caused a lot of headache among other free software communities, Please try
> not to bring it to Mozilla.
> 
> So, for the sake of political correctness, please keep disputed regions in
> their short names, without prefix or suffix about their political status
> ("Republic of", "Region of", "Provence of") thus removing the implication. In
> this way we could keep all parties in peace.
> 
> 
> Tim
> Mozilla Taiwan Community (MozTW)
Again, I am not making any decisions. These lists are being automatically generated from the IANA registry data. If someone wants to change some of the listings after this patch is finished and before it is applied to trunk, that's fine. But I'm not going to be the one making those decisions.
Comment on attachment 393959 [details] [diff] [review]
WIP patch synched with mozilla-central and latest IANA registry

Let's put some action items into a review comment:

If parts of this patch were created automatically, the tools for that should be part of the patch.

The generated files should be utf-8, no unicode escapes, please.

The 3-letter language names should be in a separate file, and not in locales/en-US. 

There shouldn't be whitespace-only changes, blame info trumps tool.

There's room for a follow-up patch to add an empty localized version with a comment that it allows adding translations, and that should be tried for entries for triple-letter languages first.

There should be tests for this patch.

What happens with the extended information in cases like en-Latn-US? Might be worth to make that a bit more flexible. Right now, I think we'd say "English (Latin)" which sounds wrong.

Not sure that I like the js code, someone that owns or peers that code should add a review for that.
Attachment #393959 - Flags: review-
(In reply to comment #14)
> (From update of attachment 393959 [details] [diff] [review])
> Let's put some action items into a review comment:
> 
> If parts of this patch were created automatically, the tools for that should be
> part of the patch.
Of course. This wasn't intended as the final patch. It was merely updating the previously-existing patch to work against mozilla-central, since it was so old.

> The generated files should be utf-8, no unicode escapes, please.
Will do. Not a problem.

> The 3-letter language names should be in a separate file, and not in
> locales/en-US. 
Why? The have the same significance as 2-letter language names.

> There shouldn't be whitespace-only changes, blame info trumps tool.
Fair enough. Should I create a second patch afterwards that makes it match. And do we prefer tabs or spaces? (Either way, one of the existing files will have to change, as they don't even match now.)

> There's room for a follow-up patch to add an empty localized version with a
> comment that it allows adding translations, and that should be tried for
> entries for triple-letter languages first.
I'm not sure I follow what you're saying here.

> There should be tests for this patch.
I suppose, but I wouldn't know how to create them. I'm not familiar with the tesing system.

> What happens with the extended information in cases like en-Latn-US? Might be
> worth to make that a bit more flexible. Right now, I think we'd say "English
> (Latin)" which sounds wrong.
I'm not sure what your question is here, as that is not a valid language declaration. The correct one would be 'en-US-Latn', and that (I believe) only renders as "English (United States)" with this patch, which is also incorrect. I think 'en-Latn' does render as "English (Latin)", though. Not sure what else to call it, unless you want to artificially add "script" after a lot of the script definitions. That won't work for all of them, though.

> Not sure that I like the js code, someone that owns or peers that code should
> add a review for that.
Well, I didn't write the JS included in this patch, Simon did. That's not to say it doesn't need work, though, because it doesn't even seem to be fully functioning, as I mentioned in comment 11.
(In reply to comment #15)
> > What happens with the extended information in cases like en-Latn-US? Might be
> > worth to make that a bit more flexible. Right now, I think we'd say "English
> > (Latin)" which sounds wrong.
> I'm not sure what your question is here, as that is not a valid language
> declaration. The correct one would be 'en-US-Latn', and that (I believe) only
> renders as "English (United States)" with this patch, which is also incorrect.
> I think 'en-Latn' does render as "English (Latin)", though. Not sure what else
> to call it, unless you want to artificially add "script" after a lot of the
> script definitions. That won't work for all of them, though.
Oh, I take that first part back. You had it right. 'en-Latn-US' the correct way. I was doing it wrong, so perhaps my comments in comment 11 do not apply after all. I'll have to check.
(In reply to comment #15)
> (In reply to comment #14)
> > The 3-letter language names should be in a separate file, and not in
> > locales/en-US. 
> Why? The have the same significance as 2-letter language names.

Not in practical matters, really. For the most-spoken locales, 2 letter codes exist, and are used.

For those languages that only have triple-letter codes, most languages are unlikely going to have own names, so using the generic latin script names makes sense in most locales. Continued below.

Can we get a mapping for the triple-letter codes that map to two-letter codes?

> > There shouldn't be whitespace-only changes, blame info trumps tool.
> Fair enough. Should I create a second patch afterwards that makes it match. And
> do we prefer tabs or spaces? (Either way, one of the existing files will have
> to change, as they don't even match now.)

IMHO the files should stay as is, I don't see a point to modify the styles in both files to be the same.

> > There's room for a follow-up patch to add an empty localized version with a
> > comment that it allows adding translations, and that should be tried for
> > entries for triple-letter languages first.
> I'm not sure I follow what you're saying here.

... continued from above.

If we rip out the triple letter language names, there might be some of those that make sense to be localized in some locales. That's why we should add a file to support that without breaking when they're not localized. But that's OK for a follow up patch.

> > There should be tests for this patch.
> I suppose, but I wouldn't know how to create them. I'm not familiar with the
> tesing system.

Not sure myself.
(In reply to comment #17)
> > > The 3-letter language names should be in a separate file, and not in
> > > locales/en-US. 
> > Why? The have the same significance as 2-letter language names.
> 
> Not in practical matters, really. For the most-spoken locales, 2 letter codes
> exist, and are used.
> 
> For those languages that only have triple-letter codes, most languages are
> unlikely going to have own names, so using the generic latin script names makes
> sense in most locales. Continued below.
> > > There's room for a follow-up patch to add an empty localized version with a
> > > comment that it allows adding translations, and that should be tried for
> > > entries for triple-letter languages first.
> > I'm not sure I follow what you're saying here.
> 
> ... continued from above.
> 
> If we rip out the triple letter language names, there might be some of those
> that make sense to be localized in some locales. That's why we should add a
> file to support that without breaking when they're not localized. But that's OK
> for a follow up patch.
As far as I can tell, the only 3-letter languages that are being generated by my script (and, thus, the ones in the patch here) are the ones that do not have a 2-letter counterpart. Also, I am noticing in my latest diff (which allows a clearer picture now that I've stopped changing the whitespace) is that there are already a handful of 3-letter codes in the file as it exists today.

Also, the registry does provide some data as to the default script (known as Suppress-Script, since you'll want to avoid the redundancy of specifying the given language with the given script) of some languages, but that information is not currently being used by this patch.

But re-reading your comment, I see that I misunderstood what you meant. You're referring to translating the language names into LOTE. I see. In that case, it may be worth putting them into another file. However, the file is now sorted that all 3-letter languages are below the 2-letter languages, so the locale translators could be instructed to only translate the first half of the file, perhaps?
Rather not append. Most tools bail if there's an untranslated entry. If we go for an extra file, the tools might bail about an obsolete entry, which is usually signaled as a warning instead of an error.
Comment on attachment 397548 [details] [diff] [review]
WIP synched to tip (changeset aa3a30f3f1e8)

Darn enter key. Didn't mean to submit that like that.

This patch has been updated to patch against tip. It takes into account the whitespace issue and the UTF-8/escaping issue, but it still has all of the language tags in one file.
Attachment #397548 - Attachment description: WIP synched to tip → WIP synched to tip (changeset aa3a30f3f1e8)
Attachment #393959 - Attachment is obsolete: true
This is a testcase presenting examples of a number of valid tags. Most of them are from RFC 4646 itself, as Appendix B provides a list and describes (mostly) what they're supposed to represent. For those that it doesn't I extrapolated from the IANA registry. I've also included an example that I happened to come across on Wikipedia that uses the script subtags to label things.
comments about the regionNames.properties change :

I'm a bit concerned how the list will be shown in an alphabetic listing. Some
countries have changed their name with a prefix, or have been changed into a a
non-English variant. In both cases, the name might be a bit difficult to find
back if you don't know about it. There's still a difference between having a
more-or-less "official" name for a country, and one usable in an English
language application. If we insist on using the "official" name in Gecko, then
we also impose limitations on the localisers. Why should the official name only
be used in the English version, and not the other ones ?

- aa 'private use' : maybe correct, but what's it doing here ?

- cd/cg (the Congo's) : technically, the names are now correct, but replacing
Congo-Brazzaville with Congo might confuse people that don't know about the
other (bigger) Congo. Unfortunately, the name Congo is NOT used for the bigger
country as many people think, but for the smaller one, although everyone calls
that Congo-Brazzaville in spoken language. And since that will now be sorted
with the letter T, it will be difficult to find it back. Maybe we should use
"Congo, The Democratic Republic of the" ? Or go back to the old
Congo-Brazzaville and Congo-Kinshasa ? 

- ci 'CĂ´te d'Ivoire' : you won't find it back at the letter I as before.
Probably ok, though.

- cs 'Serbia and Montenegro' : the country is now splitted into rs (Serbia) and
me (Montenegro). See
http://www.iso.org/iso/country_codes/iso_3166_code_lists.htm - which version
did you use ?

- fm 'Federated States of Micronesia' : sorting issue

- ir 'Islamic Republic of Iran' : sorting issue

- kp/kr (North & South Korea) : technically correct, but shouldn't we use the
english names ? Although that also presents a sorting issue.

- la 'Lao People's Democratic Republic' : although different from the other
countries, won't cause a sorting issue

- ly 'Libyan Arab Jamahiriya' : idem as Laos

- mk 'The Former Yugoslav Republic of Macedonia' : sorting issue

- nt 'neutral zone'. The doesn't exist anymore, but the code is still reserved.
Shouldn't we filter out those reserved codes ?

- qm..qz 'Private use' : that will not work, if you need them (?) it has to
spelled out

- su 'soviet union' : doesn't exists anymore as a country code, it's not even
reserved.

- sy 'Syrian Arab Republic' : idem as Laos

- tw 'Taiwan, Province of China' : serious political issue

- tz  'United Republic of Tanzania' : sorting issue

- va 'Holy See (Vatican City State)' : maybe confusing

- vn 'Viet Nam' : ok

- xa..xz 'Private use' : that will not work, if you need them (?) it has to
spelled out

- yu 'Yugoslavia' : not used anymore, marked as reserved

- zz 'Private use' : ???

- number codes : do we need those ??? As a region maybe, but then we should not
present the list as a list of countries.
(In reply to comment #23)
> comments about the regionNames.properties change :
> [etc.]
Jo, let me preface by saying this: Please ensure that you have read the entire discussion up until this point. I have stated multiple times the cause of the name changes and also that this is not a final patch. "WIP" means "Work in Progress".

Now perhaps I'm the only one, but I'm not looking at this from the point of view of localizing Gecko builds or creating alphabetical lists (nor is that necessarily what RFC 4646 is intended for). I am looking at this from the point of view of identifying text on the Internet. As I said in comment 4 and elsewhere, this list has been automatically generated from the data contained in the IANA Language Subtag Registry. It is not directly dependent on the contents of ISO 3166. BCP 47 (RFC 4646 and RFC 4647) address this.

That is why there are entries in the lists for countries that no longer exist. The point is that text labeled with those tags may appear on the Internet. Any tag that has not been listed in the IANA registry with a Preferred-Value has been included in theses lists, even if it is a deprecated one. (Deprecated tags or subtags that have a Preferred-Value listed in the registry are not included in these lists.) These are also the reasons behind listing the numerical region subtags and private use subtags.

Is Simon available to offer comment?
(In reply to comment #24)
> (In reply to comment #23)
> > comments about the regionNames.properties change :
> > [etc.]
> Jo, let me preface by saying this: Please ensure that you have read the entire
> discussion up until this point. I have stated multiple times the cause of the
> name changes and also that this is not a final patch. "WIP" means "Work in
> Progress".

Ofcourse, but that doesn't mean that there can be comments.

> 
> Now perhaps I'm the only one, but I'm not looking at this from the point of
> view of localizing Gecko builds or creating alphabetical lists (nor is that
> necessarily what RFC 4646 is intended for). I am looking at this from the point
> of view of identifying text on the Internet. As I said in comment 4 and
> elsewhere, this list has been automatically generated from the data contained
> in the IANA Language Subtag Registry. It is not directly dependent on the
> contents of ISO 3166. BCP 47 (RFC 4646 and RFC 4647) address this.

I just wanted to point out the problems in that list. The ranges for private  use can't be used for instance (unless we add code for that particular syntax). And you're using an older version (Serbia & Montenegro for instance). I have been using the regionNames.properties file before to generate a list of countries (Firefox doesn't show that list in the GUI). My software does it differently - the languages are calculated for a specific country that needs to be selected first (separate mapping list). But now I have to make a new list, because the sorting order is different. Oh well, do what you want. I've had my share of political problems before (can't mention details).
(In reply to comment #25)
> > Now perhaps I'm the only one, but I'm not looking at this from the point of
> > view of localizing Gecko builds or creating alphabetical lists (nor is that
> > necessarily what RFC 4646 is intended for). I am looking at this from the point
> > of view of identifying text on the Internet. As I said in comment 4 and
> > elsewhere, this list has been automatically generated from the data contained
> > in the IANA Language Subtag Registry. It is not directly dependent on the
> > contents of ISO 3166. BCP 47 (RFC 4646 and RFC 4647) address this.
> 
> I just wanted to point out the problems in that list. The ranges for private 
> use can't be used for instance (unless we add code for that particular syntax).
> And you're using an older version (Serbia & Montenegro for instance).
Yes, you're right about the ranges. I had intended to take care of them before submitting the page, but I forgot.

However, I am not using an outdated list, which is what I've been trying to tell you. There just simply is not a one-to-one relationship between the old value and the new value, and it is possible that the old value could show up on the Internet somewhere.

Here is the IANA registry entry for CS:
%%
Type: region
Subtag: CS
Description: Serbia and Montenegro
Added: 2005-10-16
Deprecated: 2006-10-05
Comments: see RS for Serbia or ME for Montenegro
%%

Since there is no Preferred-Value (only a mention of what CS split into in the Comments section), it is included in the list.

Here is an example of an entry (BU) that is not included in the list because a Preferred-Value is set (MM):
%%
Type: region
Subtag: BU
Description: Burma
Added: 2005-10-16
Deprecated: 1989-12-05
Preferred-Value: MM
%%

Thus, BU isn't in the list, but MM is.
Comment on attachment 397548 [details] [diff] [review]
WIP synched to tip (changeset aa3a30f3f1e8)

bug 513147 removed metaData.xul and quite a bit of other code, not sure if there's more overlap.

Anyway, this patch won't apply anymore.
(In reply to comment #27)
> (From update of attachment 397548 [details] [diff] [review])
> bug 513147 removed metaData.xul and quite a bit of other code, not sure if
> there's more overlap.
> 
> Anyway, this patch won't apply anymore.
I wish I'd known about that bug earlier. I left a comment on it: bug 513147 comment 61.
Bug 513147 doesn't invalidate this bug: the language names are also used in the language preference dialog (from Preferences | Content)
Depends on: 522913
(In reply to comment #29)
> Bug 513147 doesn't invalidate this bug: the language names are also used in the
> language preference dialog (from Preferences | Content)
Are you sure about that? Because I've edited my Accept-Language settings via about:config to included options that weren't in the list provided, and that menu didn't automatically generate names for them, despite them being made up of identifiable parts.

For example, I added 'es-US' (well, technically, 'es-us', since that menu uses all-lowercase, for some reason), which should easily be identified as "Spanish (United States)". However, it's listed in that menu without a title at all.

(I've also added '*-Latn' to my list, which should match against any language written in a Latin script. When this bug is fixed, that should show up as "All Latin scripts" or something similar.)
(In reply to comment #30)
> Are you sure about that? Because I've edited my Accept-Language settings via
> about:config to included options that weren't in the list provided, and that
> menu didn't automatically generate names for them, despite them being made up
> of identifiable parts.
> 
> For example, I added 'es-US' (well, technically, 'es-us', since that menu uses
> all-lowercase, for some reason), which should easily be identified as "Spanish
> (United States)". However, it's listed in that menu without a title at all.

I thought there was already a bug about this, but I can't find it. I might have been thinking of bug 269437 comment 7.
For the record, RFC 4646 was obsoleted in September 2009 by RFC 5646. I've updated the bug summary and alias to refer directly to BCP 47. I've also updated the link to point to the HTML version of BCP 47, instead of just the text version. (Please note that, while a PDF version of BCP 47 is available, it has not yet been updated to reflect the inclusion of RFC 5646 in place of RFC 4646.)

Also, the IANA Language Subtag Registry was actually established in a document separate from BCP 47: RFC 4645, which was updated by RFC 5645 (also in September 2009). However, those RFCs merely describe the establishment of the registry; BCP 47 is what describes how to use that data, so BCP 47 is what needs implementing.
Alias: rfc4646 → bcp47
Summary: RFC 4646 and 4647 (BCP 47; IANA Language Subtag Registry) support → BCP 47 (RFC 5646 and 4647; IANA Language Subtag Registry) support
more: bug 586085
I'm starting to get lost among all these different standards and recommendations ...
Boy, I can be quite cranky. Apologies to everyone for the rudeness of my past comments.

Given the changes made through Firefox 4 (with bug 513147 and bug 522913 being particularly notable), the file dependencies for this bug have diminished, along with its scope.

As far as I understand it, the remaining focus of languages in Firefox is the preference box that generates the Accept-Language header (and, possibly, the mechanisms that handle font/encoding negotiation).

I think that means that the list of files we currently have to deal with is this:
browser/components/preferences/languages.js
browser/components/preferences/languages.xul
intl/locale/src/langGroups.properties
intl/locale/src/language.properties
toolkit/locales/en-US/chrome/global/languageNames.properties
toolkit/locales/en-US/chrome/global/regionNames.properties

Given the heated discussion that has occurred in the past and is bound to continue, it would probably be best to separate out each particular task, as they are not as dependent on each other as it may seem. It would probably make sense to separate the tasks out into individual bugs, but let's get on the same page as to what those tasks are first.

In general, the tasks will likely fall into one of two categories:
(1) those that involve the Languages preference dialog, and
(2) those that involve the localization process.

Given these two categories, the files can then be separated thusly (though this is somewhat obvious):
(1) languages.js, languages.xul, langGroups.properties, language.properties
(2) languageNames.properties, regionNames.properties

Now for the tasks (in no particular order at the moment):
* Update the JS code to handle all the new requirements in BCP 47.
* Update list of language names ('Primary Language Subtags'). (Do we exclude extinct/historical languages? If so, based on what criteria?)
* Update list of region names ('Region Subtags'). (Who or what decides when to differ from how the IANA registry lists regions?) 
* Add list of script names ('Script Subtags').
* Intentionally ignore 'Extended Language Subtags', as they are generally for backwards compatibility with 'Primary Language Subtags' that represent macrolanguages.
* Intentionally ignore 'Redundant Registrations', as they are generally for backwards compatibility and can be composed of other valid subtags.
* Decide how to handle 'Variant Subtags', 'Extension Subtags', and 'Private Use Subtags', as well as 'Grandfathered Registrations', as it is unclear how they will come into play with regard to language selection or localization.
* Decide how to clean up and/or reorganize language groups. (Can they be superseded by 'Script Subtags'?)
* Decide whether specifying the "accepted" languages is necessary. (What are the reasons for a language not being "accepted"?)
* Decide how to separate the l10n-necessary language names from the l10n-unnecessary language names. (Do we separate 2-char vs. 3-char, or do we use another method?)
* Decide how to improve the Languages selection interface.
* Decide how we should handle 'q' values in the Accept-Language header. (Should we just allow them to be automatically generated from the given order, as is apparently the existing behavior?)

I think that's about where we stand right now. Axel and Simon, what are your thoughts?
Whiteboard: [See comment 34 and beyond for the latest discussion.]
Hi,

Related to this, we have had some problems whith these variants:
https://bugzilla.mozilla.org/show_bug.cgi?id=654467

Any news?
(In reply to comment #34)

Nice structure to the challenges here, thanks.

CCing Kevin, who's all around smaller languages, and might have some hints at to how to get control over the amount of choices here.

I'm tempted to say that we need to start a UX experiment or two to get started?
Hi, nice to see you here Gordon, and great work on this bug so far.  

One other place where the languages appear in the UI is in the context menu for spell checking.  Spell-checking add-ons contain dictionary and affix files conventionally named xx_YY.dic and xx_YY.aff, and as I understand it these filenames are converted into localized language names using languageNames.properties.

If a language is not listed in languageNames.properties, then the language appears as "xx" in the context menu for spell checking.   This isn't such a big deal, except that in practice the AMO editors haven't been willing to approve spell checking add-ons for which this is the case (hence bug 586085 for haw, hil, and csb, and I have several others with the same issue -- Tok Pisin (tpi) springs to mind).

Putting on my localizer's hat for a moment, I like Axel's proposal of a second file for the bulk of the ISO 639-3 codes, outside of locales/en-US, giving localizers the ability to pick and choose any special ones they'd like to localize, and otherwise defaulting to the names in the IANA registry (taking note that these are really *English* names, not, say, autonyms in Latin script which might be more desirable but would be harder to come by).

Doing this much would seem to resolve the spell checking issue once and forever!

As far as which languages to include in the file requiring localization, I'd say we should be as inclusive as possible.  Language names are easy to localize - most of us have translated the same list of languages a dozen times for different projects, and I expect most l10n teams are sophisticated enough to be using translation memories or PO compendia or whatever.   A few concrete thoughts:

* all ISO 639-1 languages
* any other languages that are currently in languageNames.properties (I see 9 ISO 639-3 codes: ast, fur, hsb, kok, nso, son, tig, tlh, wen); would be silly to throw those translations away
* any languages for which there exists a FLOSS spell checker (I maintain an mostly up-to-date list of these; adds maybe a dozen new languages, a quick grep shows cop, csb, dyu, guc, haw, hil, lnc, mos, nds, shs, tet, tpi as needed)
* any other languages which have active l10n teams for Mozilla products (a handful more)

But again, this is a mere suggestion - I'm satisfied with the ability to localize any that are important to my own locale.

The big sticking point would then seem to be the UI for the Accept-Languages stuff.  It's not so easy to use now, and will only be worse with thousands of languages to choose from.  I'd say it should have a search box for sure - but I'll leave this to the UX experts.
For those interested, I've set up a wiki article that tries to spell out our plan for tackling this bug. It's meant to be a living document, so feel free to modify it there or comment on things here. I believe Kevin and I are mostly in agreement on its contents; Axel will be on vacation next week, so we probably won't get to hear his thoughts on it until after that.

https://wiki.mozilla.org/User:GPHemsley/BCP_47

Just to explain it a little: The Plan is roughly in order of operation, but it is also shorthand for the various tasks that need to be accomplished. In the Tasks section are essentially what I wrote in comment 34, but now we can more easily iterate on them and answer the outstanding questions. The Areas of Focus are essentially how I understand the various areas of the codebase that this bug would have to touch. You may have a different opinion or more information than I do.

So, the current thinking is, we would have a master list of language names (etc.) based on the IANA database. Then each locale would be able to localize the names if they chose to; otherwise, the values would fall back to what is in the database. I believe that it should be possible to not localize any language names (i.e. have them all be optional); Kevin, I think, prefers what he said in comment 37. But that's something we can discuss further.

So, that's where we are now. Per Axel's suggestion, I'll be sending an e-mail to the .l10n and .i18n newsgroups tomorrow about this. Until then (and after then), feedback welcome—and remember: it's a wiki!
Thanks Gordon.  Actually I agree that localizers should have the option of not localizing any language names.  What's important to me (speaking as a localizer here) is that a list of "commonly localized" language names be made available for translation, perhaps following the scheme I suggested in comment 37.   I added a bullet to the wiki page to this effect.  No translator wants to sift through thousands of language names looking for the translatable ones.
Hi, people from Chromium are working on this as well:
http://codereview.chromium.org/7086017

Maybe we cant take advantage of their code :P
Hmm...

I don't know if it will turn out to be relevant for this bug, but I just came across RFC 6067 [1], which defines "BCP 47 Extension U".

Here's the abstract:
>
   This document specifies an Extension to BCP 47 that provides subtags
   that specify language and/or locale-based behavior or refinements to
   language tags, according to work done by the Unicode Consortium.
<

It allows you to encode a language with some more specific Unicode information.

Here's the example they use in the RFC:
>
   For example, the language tag "de-DE-u-attr-co-phonebk" consists of:

   o  The base language tag "de-DE" (German as used in Germany), exactly
      as defined by [BCP47] using subtags from the IANA Language Subtag
      Registry.

   o  The singleton 'u', identifying this extension.

   o  The attribute 'attr', which is an example for illustration (no
      attributes were defined at the time this document was published).

   o  The keyword 'co-phonebk', consisting to the key 'co' (Collation)
      and the type 'phonebk' (Phonebook collation order).
<

The Unicode Common Locale Data Repository has some further information, including an actual, valid example of 'en-GB-u-kn-true'. [2]

[1] https://tools.ietf.org/html/rfc6067
[2] http://cldr.unicode.org/index/bcp47-extension
I've added a list of references to the wiki that links to all the documents we have to take into consideration in this process. If you find any I missed, please add to the list.

https://wiki.mozilla.org/User:GPHemsley/BCP_47#References
I just found this website which will likely come in very handy:

http://www.langtag.net/registries.html

This parses the IANA Subtag Registry into various separate lists for easier use!

The root website also has some interesting resources:

http://www.langtag.net/
Depends on: 666662
Depends on: 556237
Whiteboard: [See comment 34 and beyond for the latest discussion.] → [bcp47] [See comment 34 and beyond for the latest discussion.]
Keywords: meta
No longer blocks: 656750
Depends on: 656750
Depends on: 525494
Depends on: 481389
Depends on: 142092
Depends on: 181520
Depends on: 331779
No longer blocks: 654467
Depends on: 654467
Depends on: 666731
In case you hadn't noticed, I've added a bunch of dependencies and blockers of this bug, which is now a meta bug. I've also added "[bcp47]" to the whiteboard of all such bugs, for easy querying.

Axel, Kevin, and I had a discussion earlier today regarding our plan for the implementation of BCP 47. Some information about that meeting is available on the wiki [1]. We've also updated the main wiki article [2] to reflect some of our discussion.

Much of our discussion focused on how our changes will affect the UI for both language selection and font selection. (They really haven't changed much since the Netscape days.) Axel will talk to the UX about how to bring those UIs into the modern, multi-lingual, Unicode era.

With regard to the backend, Axel suggested we discuss bug 556237 (font and encoding negotiation) with Simon and John Daggett to determine how that will affect our work and vice versa. In addition, I've filed bug 666662 to get the ball rolling on implementing a master list of language (etc.) names. The first feature to use this master list will be the spellchecker interface; I've filed bug 666731 for that.

[1] https://wiki.mozilla.org/User:GPHemsley/BCP_47/2011-06-23
[2] https://wiki.mozilla.org/User:GPHemsley/BCP_47
Depends on: 535422
Blocks: 535422
No longer depends on: 535422
Depends on: 667734
Depends on: 669321
Depends on: 669598
Depends on: 669814
Depends on: 556236
No longer depends on: 556236
Depends on: 672448
Depends on: 672320
I support to specify the Chinese scripts (zh-Hant for traditional Chinese and zh-Hans for simplified Chinese), which is promoted by BCP 47, instead of locations (both zh-tw and zh-hk use traditional Chinese and zh-cn uses simplified Chinese).  Locations should be subtags of zh-Hant and zh-Hans.  But as of Firefox 6.0, zh-Hant and zh-Hans are still not available options.


(In reply to Tim Guan-tin Chien [:timdream] (MozTW.org) from comment #12)
> First of all, I support this change; the current language tags does not
> sufficient to describe different types of Chinese scripts. 
> 
> (In reply to comment #9)
> > (In reply to comment #6)
> > > see also bug 230866 (Taiwan / Macedonia) - be VERY careful with changes like
> > > that !
> > > 
> > > and we also have the kp/kr pair (North and South Korea) ...
> > > 
> > > Also note that various country names are now changed to the official one, not
> > > the common English variant for it, so they won't be easily found in an
> > > alphabetic list. Like Iran, Cote d'Ivoire, or Tanzania.
> > As I said in comment 4, the files were automatically generated using the names
> > officially associated with the tags, as listed in the IANA registry. It is not
> > my place to decide what any country should be called.
> 
> Hi there,
> 
> I think the essence here is "not to make political statement."
> 
> "Following standard" is a lazy excuse for the problem as there is no
> official standard for Country/Region names, at least not definitive as W3C
> Spec. I do not want to start an essay about the definition and recognition
> of a nation-state; please contact your nearest Int'l relationship majors or
> Wikipedia for detail.
> 
> Mozilla is a F/OSS project, not United Nations. No one should be allowed for
> pushing their ideology and we should avoid doing so for anyone. ISO 3166
> names had caused a lot of headache among other free software communities,
> Please try not to bring it to Mozilla.
> 
> So, for the sake of political correctness, please keep disputed regions in
> their short names, without prefix or suffix about their political status
> ("Republic of", "Region of", "Provence of") thus removing the implication.
> In this way we could keep all parties in peace.
> 
> 
> Tim
> Mozilla Taiwan Community (MozTW)
Depends on: 684335
Depends on: 696642
Depends on: 716321
No longer blocks: 531849, 535422, 586085
Depends on: 716377
Depends on: 730209
Depends on: 705542
maple is now available as per bug 731617#c18. Follow the instructions in the wiki if you still want a project branch [1]. Otherwise, please remove your name from the list of waiting teams.

[1]
https://wiki.mozilla.org/ReleaseEngineering/DisposableProjectBranches#Do_you_need_a_disposable_branch.3F
Depends on: 730625
Depends on: 769872
Depends on: 1048153
Depends on: 1054739
Blocks: 1263437

Jonathan - do you think we can close this now? We use BCP47 across our codebase with MozLocale and LocaleService canonicalizing to that. Maybe some of the old bugs and their dependencies can be closed now?

Flags: needinfo?(jfkthame)

I guess this probably isn't really useful/relevant at this point (but I suspect some of the specific related bugs may still be valid issues that we want to address in some way, so we'd need to look at them individually rather than just mass-closing).

Flags: needinfo?(jfkthame)
Summary: BCP 47 (RFC 5646 and 4647; IANA Language Subtag Registry) support → [meta] BCP 47 (RFC 5646 and 4647; IANA Language Subtag Registry) support

The bug assignee didn't login in Bugzilla in the last 7 months.
:m_kato, could you have a look please?
For more information, please visit auto_nag documentation.

Assignee: smontagu → nobody
Flags: needinfo?(m_kato)
Severity: normal → S3
Type: defect → task
Flags: needinfo?(m_kato)
Priority: -- → P3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: