[zh-TW] Verify language and region names from CLDR

NEW
Assigned to

Status

enhancement
a year ago
a year ago

People

(Reporter: flod, Assigned: flod)

Tracking

(Blocks 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: cldr-data, )

Attachments

(3 attachments, 1 obsolete attachment)

Assignee

Description

a year ago
Posted file Differences for zh-TW (obsolete) —
Translations for both language and region names are available in CLDR, but right now we ask all our localization teams to provide their own translations for them.

For reference, language names are used in preferences to set up the Accept Language setting (Language section in General), but also in the context menu for dictionaries.

In the long term, we'd like to use CLDR as a source for this data; for this reason we're investigating the differences between our localizations and CLDR, spot checking some languages.

So far we've started with French and Italian, finding that most times CLDR had the right information. As a next step, we decided to move to a language with an higher percentage of differences:
* There are 203 language names in Mozilla, 195 (96.6%) have a different translation compared to CLDR. That's over average (46%).
* There are 272 region names in Mozilla, 40 (14,11%) have a different translation compared to CLDR. That's below average (31%).

https://hg.mozilla.org/l10n-central/zh-TW/file/default/toolkit/chrome/global/regionNames.properties
https://hg.mozilla.org/l10n-central/zh-TW/file/default/toolkit/chrome/global/languageNames.properties

As a reference I used zh-Hant in CLDR.
https://github.com/unicode-cldr/cldr-localenames-modern/tree/master/main

Here's the ask:
- Can you go through the list of differences, and explain the reason for so many differences?
- Would the CLDR names need to be adapted to be used in Firefox?
- Can the CLDR data be improved? 

One more note: some language and region names differ between Mozilla and CLDR. As a consequence, translation will be different. For example, "Southern Sotho" vs "Sotho, Southern", or "Caribbean Netherlands" vs "Bonaire, Sint Eustatius, and Saba".

See also bug 1434854 for an example (Italian).
Regarding the differences from both side:

- Language Names. The differences can be categorized in to these:

  * When the difference is between ~文 (CLDR) / ~語 (Mozilla) (e.g., ach)
文 means "script", while 語 means the more general language itself, or spoken language. The exception in languageNames.properties is zh, we used "中文" (Chinese) to represent 國語/漢語/普通話/標準華語 (Mandarin, in different countries), ja and nl. ja and nl should be corrected to use ~語 if the premise is correct. 

  * When the Mozilla translation has a country name in parentheses (e.g., ce & ch)
The country or region name in the parentheses is where the language mostly used or spoken. Most of these are used to help users to determine the language as most of them are simply phonetic transliteration which makes it non-sense if the language is less popular in Taiwan. I don't think these region names should be contributed back to CLDR anyway. I think this and the previous one are the key reason made it over average.

  * Different translation in both side, but not fall in both categories above (e.g., br & ca)
Again, the language names are usually a transliteration so the CLDR contribution might just had a different pronounciation, referred from a different linguistic textbook, or different source to me. Both can be correct.

  * Minor: One used transliteration, the other used translation by meaning (e.g., vo)
We should avoid this situation, as there might have duplicated translation, like the Mozilla translation for vo/Volapük is 世界語/The World Language which conflicts with eo/Esperanto, using the same language.

  * Minor: Used the country speaking that language at most as translation (e.g., dz)
If the language is less known to the local users, we might simply use the country where the language well-spoken, or as official language to it's translation. So for Dzongkha we localized it as 不丹語 "Bhutanese"

- Country Names

  * Dependency prefixes (e.g., AW, MF)
The CLDR translation means "Dutch Aruba" while our source only have "Aruba" so I did not translated 荷屬 "Dutch"

  * Different translation in both side (e.g., BL & BV)
Same with 3rd point mentioned for language names. Just different transliterations made by different people.

  * Geographical suffixs. (e.g, JE)
Jersey is a island but there is no "island" in the source text. Do we have to add it? I didn't but CLDR did. I don't think we should add when we are mentioning it as a region or country. 

  * Different source text (e.g. HK, MO)
As you mentioned. These differences can be due to politics or other reasons. 

  * Minor: Missing the latter part after "and" in Mozilla translations (e.g., AG, VC)
For example: translation of Barbuda is missing for AG/Antigua and Barbuda. I will add them back.



Regarding to other questions, 
- Would the CLDR names need to be adapted to be used in Firefox?
Yes. If I need to change language fallback preference, the current implement for language dropdown list would make it difficult to pick and make sure the language is what I want to add to list. Maybe sort the list by language code?

Another issue using CLDR translation without adapt I can think of is if CLDR changed data due to the well-known politics decision someday, it would be troublesome. 

- Can the CLDR data be improved? 
Probably there is no need to improve, as all of the entries in CLDR is already locaized well.


Anyway, thanks for building the list so I can spot the issues in current translation.
Assignee

Comment 2

a year ago
Thanks for looking into it so quickly.

> * When the difference is between ~文 (CLDR) / ~語 (Mozilla) (e.g., ach)
> 文 means "script", while 語 means the more general language itself, or spoken
> language. The exception in languageNames.properties is zh, we used "中文"
> (Chinese) to represent 國語/漢語/普通話/標準華語 (Mandarin, in different countries), ja
> and nl. ja and nl should be corrected to use ~語 if the premise is correct. 

That, together with the country name between parenthesis, definitely explains the number of differences (-文 vs -語). 

That's also an interesting point to figure out: is the name a reference to written or spoken language? Clearly CLDR thinks it's script/text, and it might make sense given the context.

>   * When the Mozilla translation has a country name in parentheses (e.g., ce
> & ch)
> The country or region name in the parentheses is where the language mostly
> used or spoken. Most of these are used to help users to determine the
> language as most of them are simply phonetic transliteration which makes it
> non-sense if the language is less popular in Taiwan. I don't think these
> region names should be contributed back to CLDR anyway. I think this and the
> previous one are the key reason made it over average.

Agreed. 'ch' is region agnostic, you would then have ch-CH for Swiss spoken in Switzerland.

>   * Different translation in both side, but not fall in both categories
> above (e.g., br & ca)
> Again, the language names are usually a transliteration so the CLDR
> contribution might just had a different pronounciation, referred from a
> different linguistic textbook, or different source to me. Both can be
> correct.

For these cases, it would be useful to figure out if there are third party references that can be used to either improve our localization, or CLDR.

> - Country Names
> 
> * Dependency prefixes (e.g., AW, MF)
> The CLDR translation means "Dutch Aruba" while our source only have "Aruba"
> so I did not translated 荷屬 "Dutch"

So, there are some expected differences. Having said that, AW and MF don't have "Dutch" in the original text on CLDR
https://github.com/unicode-cldr/cldr-localenames-modern/blob/master/main/en/territories.json#L58
https://github.com/unicode-cldr/cldr-localenames-modern/blob/master/main/en/territories.json#L199

Maybe that should be reported back to CLDR, if the zh-Hant translation is arbitrarily adding "Dutch", and there's no external evidence saying it should be added.

Sadly, in my experience it's quite hard to find references for region names (National Geography institutions and similar).

> * Geographical suffixs. (e.g, JE)
> Jersey is a island but there is no "island" in the source text. Do we have
> to add it? I didn't but CLDR did. I don't think we should add when we are
> mentioning it as a region or country. 

As above: JE for CLDR is just "Jersey". Unless there is an established name with "Island", it might be worth reporting it.

>   * Different source text (e.g. HK, MO)

A few years ago, we switched to use an external list (GENC) for Mozilla, which has some differences from the list used by CLDR (some more details are in bug 1431324).

As long as it's an external source, officially recognized, we have a reason to use that nomenclature.

> - Would the CLDR names need to be adapted to be used in Firefox?
> Yes. If I need to change language fallback preference, the current implement
> for language dropdown list would make it difficult to pick and make sure the
> language is what I want to add to list. Maybe sort the list by language code?

To clarify: is that independent from using CLDR, or does the experience significantly degrade when using CLDR names? How so?

> Another issue using CLDR translation without adapt I can think of is if CLDR
> changed data due to the well-known politics decision someday, it would be
> troublesome. 

See point above. It's an official list, supported by all major tech companies (Google, Microsoft, Apple, IBM). If they change something, they need to do it carefully.

> Anyway, thanks for building the list so I can spot the issues in current
> translation.

No problem, let me know if you want an updated list of differences at some point, I just need to run a script on my machine.
(In reply to Francesco Lodolo [:flod] from comment #2)
> Thanks for looking into it so quickly.
> 
> > * When the difference is between ~文 (CLDR) / ~語 (Mozilla) (e.g., ach)
> > 文 means "script", while 語 means the more general language itself, or spoken
> > language. The exception in languageNames.properties is zh, we used "中文"
> > (Chinese) to represent 國語/漢語/普通話/標準華語 (Mandarin, in different countries), ja
> > and nl. ja and nl should be corrected to use ~語 if the premise is correct. 
> 
> That, together with the country name between parenthesis, definitely
> explains the number of differences (-文 vs -語). 
> 
> That's also an interesting point to figure out: is the name a reference to
> written or spoken language? Clearly CLDR thinks it's script/text, and it
> might make sense given the context.

Yeah. My logic is that a script can be used to represent different languages, but the spoken language won't. 

> 
> >   * When the Mozilla translation has a country name in parentheses (e.g., ce
> > & ch)
> > The country or region name in the parentheses is where the language mostly
> > used or spoken. Most of these are used to help users to determine the
> > language as most of them are simply phonetic transliteration which makes it
> > non-sense if the language is less popular in Taiwan. I don't think these
> > region names should be contributed back to CLDR anyway. I think this and the
> > previous one are the key reason made it over average.
> 
> Agreed. 'ch' is region agnostic, you would then have ch-CH for Swiss spoken
> in Switzerland.
> 
> >   * Different translation in both side, but not fall in both categories
> > above (e.g., br & ca)
> > Again, the language names are usually a transliteration so the CLDR
> > contribution might just had a different pronounciation, referred from a
> > different linguistic textbook, or different source to me. Both can be
> > correct.
> 
> For these cases, it would be useful to figure out if there are third party
> references that can be used to either improve our localization, or CLDR.

It would be very difficult as there are very few or even no people uses in TW/HK studying those languages. It's like everyone is third-party and you can't find a authority. 

> 
> > - Country Names
> > 
> > * Dependency prefixes (e.g., AW, MF)
> > The CLDR translation means "Dutch Aruba" while our source only have "Aruba"
> > so I did not translated 荷屬 "Dutch"
> 
> So, there are some expected differences. Having said that, AW and MF don't
> have "Dutch" in the original text on CLDR
> https://github.com/unicode-cldr/cldr-localenames-modern/blob/master/main/en/
> territories.json#L58
> https://github.com/unicode-cldr/cldr-localenames-modern/blob/master/main/en/
> territories.json#L199
> 
> Maybe that should be reported back to CLDR, if the zh-Hant translation is
> arbitrarily adding "Dutch", and there's no external evidence saying it
> should be added.

Agreed. 

> Sadly, in my experience it's quite hard to find references for region names
> (National Geography institutions and similar).

Same situation exists here. I can refer from the Ministry of Foreign Affairs for a list of country/region translations, but there are still some regions or island names that requires other references like National Library or Customs / Banks.


> > * Geographical suffixs. (e.g, JE)
> > Jersey is a island but there is no "island" in the source text. Do we have
> > to add it? I didn't but CLDR did. I don't think we should add when we are
> > mentioning it as a region or country. 
> 
> As above: JE for CLDR is just "Jersey". Unless there is an established name
> with "Island", it might be worth reporting it.

Agreed. 

> >   * Different source text (e.g. HK, MO)
> 
> A few years ago, we switched to use an external list (GENC) for Mozilla,
> which has some differences from the list used by CLDR (some more details are
> in bug 1431324).
> 
> As long as it's an external source, officially recognized, we have a reason
> to use that nomenclature.

I was just looking for those! Thanks for raising the keyword and bug number.

> > - Would the CLDR names need to be adapted to be used in Firefox?
> > Yes. If I need to change language fallback preference, the current implement
> > for language dropdown list would make it difficult to pick and make sure the
> > language is what I want to add to list. Maybe sort the list by language code?
> 
> To clarify: is that independent from using CLDR, or does the experience
> significantly degrade when using CLDR names? How so?

Ah I was referring the UI layout in preferences. It's actually difficult for ideographic language user to add translation this way. For example: If I have to put Italian to the list.

1.  Italian is usually translated to 義大利語 or 意大利語 in Chinese (different characters with same pronounciation, no difference in meanings).

2. Go through the list, find out the position of 義大利 and add to the the list. If I can't find out at the first glance, then have to write down "義" and other characters in the list to determine to go up or down (the list was ordered by the stroke amount). 

3. If it was "意"大利語 came up in my mind in step 1, then I won't find it as the strokes are different. 

If it was a 3-character language, or the translation is totally different, language code is the last resort. Luckily people usually won't need to change the preference. 


Regarding CLDR names: if we could make it clear using language names instead of written scripts, and there is no need to have country name as remark/suffix, then I think it's fine to use CLDR data. 

> > Another issue using CLDR translation without adapt I can think of is if CLDR
> > changed data due to the well-known politics decision someday, it would be
> > troublesome. 
> 
> See point above. It's an official list, supported by all major tech
> companies (Google, Microsoft, Apple, IBM). If they change something, they
> need to do it carefully.

Agreed. Hopefully we won't hear the news that Mozilla products are banned in some country due to violating law someday.

> 
> > Anyway, thanks for building the list so I can spot the issues in current
> > translation.
> 
> No problem, let me know if you want an updated list of differences at some
> point, I just need to run a script on my machine.

I pushed some changes via Pontoon, which corrects the missing parts and mistranslations mentioned above. 
https://hg.mozilla.org/l10n-central/zh-TW/rev/01a03a3bbbc7
https://hg.mozilla.org/l10n-central/zh-TW/rev/7aaa491b73ab

One exception I made was mf (Saint-Martin). Reason is simple: to distinguish the French and Dutch dependecy respectively.
Assignee

Comment 4

a year ago
Thanks Peter. Are you able to provide in a comment the list of individual changes that should be reported to CLDR, with an explanation on why the change should happen?

> Ah I was referring the UI layout in preferences. It's actually difficult for
> ideographic language user to add translation this way. For example: If I
> have to put Italian to the list.
> 
> 1.  Italian is usually translated to 義大利語 or 意大利語 in Chinese (different
> characters with same pronounciation, no difference in meanings).
> 
> 2. Go through the list, find out the position of 義大利 and add to the the
> list. If I can't find out at the first glance, then have to write down "義"
> and other characters in the list to determine to go up or down (the list was
> ordered by the stroke amount). 
> 
> 3. If it was "意"大利語 came up in my mind in step 1, then I won't find it as
> the strokes are different. 
> 
> If it was a 3-character language, or the translation is totally different,
> language code is the last resort. Luckily people usually won't need to
> change the preference. 

This might be useful information for the prefs reorg we're doing about languages.
(In reply to Francesco Lodolo [:flod] from comment #4)
> Thanks Peter. Are you able to provide in a comment the list of individual
> changes that should be reported to CLDR, with an explanation on why the
> change should happen?
> 

Sure, but could you re-run the script again for the latest datas? I will attach explanation for each names to be reported back.
Flags: needinfo?(francesco.lodolo)
Assignee

Comment 6

a year ago
Updated file with differences
Attachment #8956012 - Attachment is obsolete: true
Flags: needinfo?(francesco.lodolo)
Posted file Language List.txt
I think we should decide whether to report the changes with reason 1 or not first, as the ambigousity of "文" played the big part here.

For reasons 2 & 3 I'm fine to use CLDR.

Changed Reasons:
1. change 文 (script) to 語 (language): all
2. wrong translation
3. the translation was a transliteration, different words were used but have same or similar pronounciation
4. the translation was a literal translation, different words were used but have same or similar meaning
5. one side translated the name to "country + lanugage", the other side transliterated the language name
6. others (reason explained one-by-one)

Exception: for all Mozilla translations except eo, ia, ie, nb and nn, please ignore the part in parentheses.
Posted file Region List.txt
Changed Reasons:
1. Both translations are used in Taiwan
2. Transliteration, different words were used but have same or similar pronounciation
3. Different Translation
4. Others

For different strings with reason 1 & 2, I'm fine if we use the translation from CLDR unless noted.
Assignee

Comment 9

a year ago
Thanks Peter, will take a look next week.

My feeling is that we should not report error 1 for languages, and should use "script" in Mozilla too.

For example, I'm looking at my Google profile, and they use 文. The same happens in Apple's preference to add a language. Microsoft too: https://www.microsoft.com/zh-tw/language/Search
Assignee

Comment 10

a year ago
I've started putting together a spreadsheet:
https://docs.google.com/spreadsheets/d/1-PBB6C6pvcxugIN_3s_i8xK7Y3hg30Mz9Eb-Sz1nK30/edit#gid=1212643450

Two general notes:

1) There are differences even for English, those have a colored background in the spreadsheet. I think it makes sense to completely ignore them. Do you agree?


2) 文 vs 語. Looks like the industry uses the former, but I have no linguistic knowledge to confirm if that's a poor choice or not. Would it be OK to assume that 文 is the right way to translate these names, and ignore the difference when that's the only change between the two translations? I've also double checked all the Traditional Chinese variants in CLDR, and they all use 文 (语 for Simplified).

Besides that, I have one more doubt. Take for example Aragonese (an): 

CLDR: 阿拉貢文
Mozilla: 阿拉貢語

That's clearly due to 文 vs 語.

But, consider Asturian:

CLDR: 阿斯圖里亞文
Mozilla: 阿斯圖里亞語 (西班牙西北)

In your file, it's indicated as (1), but there's also the whole thing between parenthesis (Google Translate tells me "Northwest Spain"). Is there are a reason for adding geographical information to a language?
(In reply to Francesco Lodolo [:flod] from comment #10)
> I've started putting together a spreadsheet:
> https://docs.google.com/spreadsheets/d/1-PBB6C6pvcxugIN_3s_i8xK7Y3hg30Mz9Eb-
> Sz1nK30/edit#gid=1212643450

Great, that's much better than my raw data ;) Will update that spreadsheet.

> Two general notes:
> 
> 1) There are differences even for English, those have a colored background
> in the spreadsheet. I think it makes sense to completely ignore them. Do you
> agree?

That's exactly same situation to my case 3 for language names. Similar pronunciation but no meaning differences.

For English part we can ignore the difference, for Chinese translation, either keeping the difference or accepting the transation from CLDR would be fine.

> 2) 文 vs 語. Looks like the industry uses the former, but I have no linguistic
> knowledge to confirm if that's a poor choice or not. Would it be OK to
> assume that 文 is the right way to translate these names, and ignore the
> difference when that's the only change between the two translations? I've
> also double checked all the Traditional Chinese variants in CLDR, and they
> all use 文 (语 for Simplified).

So seems zh-Hans team at Unicode support my view. Per CLDR's translation guidelines [1], I believe it should be refering to the  language itself, but I don't know CLDR's process or if there is any resolutions making them to use 文/script.

I have no 100% confidence to say it's CLDR's mistake or Mozilla's. While localizing Common Voice the ambigiousity between language/script/spoken names in different contexts actually confuse me.

> Besides that, I have one more doubt. Take for example Aragonese (an): 
> 
> CLDR: 阿拉貢文
> Mozilla: 阿拉貢語
> 
> That's clearly due to 文 vs 語.
> 
> But, consider Asturian:
> 
> CLDR: 阿斯圖里亞文
> Mozilla: 阿斯圖里亞語 (西班牙西北)
> 
> In your file, it's indicated as (1), but there's also the whole thing
> between parenthesis (Google Translate tells me "Northwest Spain"). Is there
> are a reason for adding geographical information to a language?

It was actually a practice can be backtraced to the epoch of l10n-central[2]. My wild guess is the data was a legacy from Netscape codebase, probably pulled from some sources like CLDR and some human revision. From my understanding it was used to help users to confirm which language they are selecting (as you quoted in comment 4).

I agree it is not a required thing and it cause issues sometimes, but I still keep what is already there in the repository. Hence I said "Exception: for all Mozilla translations except eo, ia, ie, nb and nn, please ignore the part in parentheses."

For newer languages added to languageNames.properties, I tend not add the geographical information suffix (e.g., [3]) unless there will be confuses be made.


[1]
http://cldr.unicode.org/translation/language-names
http://cldr.unicode.org/translation/localepattern
[2] https://hg.mozilla.org/l10n-central/zh-TW/file/d000f82c41eb/toolkit/chrome/global/languageNames.properties
[3] https://hg.mozilla.org/l10n-central/zh-TW/diff/639779e5b9d6/toolkit/chrome/global/languageNames.properties
Assignee

Comment 12

a year ago
Thanks for fixing the regions, I'll report them to CLDR as soon as the Survey tool is open.

What about languages? Ignoring the "文 vs 語", is there's anything that should be reported?
(In reply to Francesco Lodolo [:flod] from comment #12)
> Thanks for fixing the regions, I'll report them to CLDR as soon as the
> Survey tool is open.
> 
> What about languages? Ignoring the "文 vs 語", is there's anything that should
> be reported?

Yes, Ignoring the "文 vs 語" question and the country suffix temporarily, there are still couple translations to report: eo, gv, ia, ie, nb, nn, st. See Notes in the spreadsheet for my reasons. 

Again, when reporting to CLDR, please leave the parentheses out for gv (曼語) & st (南索托語).

I made another "Report to CLDR?" type "?", as I'm not sure how we should deal with these transliteration/pronunciation issues.
Assignee

Comment 14

a year ago
Thanks Peter, I'll try to figure out how the CLDR Survey Tool works as soon as it's reopen.

Taking the bug for now, so I don't lose track of it.
Assignee: petercpg → francesco.lodolo
Whiteboard: cldr-data
You need to log in before you can comment on or make changes to this bug.