Open
Bug 1123497
Opened 10 years ago
Updated 3 years ago
Some Arabic script based languages are not considered Arabic on font selection
Categories
(Core :: Layout: Text and Fonts, defect)
Tracking
()
UNCONFIRMED
People
(Reporter: ebrahim, Unassigned)
Details
(Whiteboard: [gfx-noted])
Attachments
(2 files)
|
17.81 KB,
image/png
|
Details | |
|
12.78 KB,
patch
|
Details | Diff | Splinter Review |
On Bug 1081514 we changed sans-serif font of Arabic script based languages to Segoe UI. It seems work well with two letter code languages but fails with three letter codes Arabic script languages and it also fails on Chrome and IE11 so may this is a issue of standards or ICU?
Here is a testcase:
data:text/html;charset=utf8,<div style="font-family: sans-serif; font-size: 400%"><div lang="ar">%D8%B7%D8%A8%D9%82</div><div lang="fa">%D8%B7%D8%A8%D9%82</div><div lang="ur">%D8%B7%D8%A8%D9%82</div><div lang="ckb">%D8%B7%D8%A8%D9%82</div><div lang="mzn">%D8%B7%D8%A8%D9%82</div><div lang="glk">%D8%B7%D8%A8%D9%82</div></div>
All lines should be rendered with exactly same font but they don't (see the attached screenshot)
Comment 1•10 years ago
|
||
Do you guys have an idea on this?
Flags: needinfo?(jfkthame)
Flags: needinfo?(jdaggett)
Whiteboard: [gfx-noted]
Comment 2•10 years ago
|
||
In the case of Firefox, I'd guess this is simply because the language codes involved (along with many others) are not assigned to an appropriate langGroup in http://mxr.mozilla.org/mozilla-central/source/intl/locale/langGroups.properties, which only lists a small selection of all the possible tags.
AFAIK, we don't currently have access to a comprehensive mapping from language codes to the likely script used to write the language, which is what we'd need here; hence we're stuck with the ad hoc list in langGroups.properties. A much more complete resource would be the CLDR data,[1] for example, if we packaged that with a suitable API; this includes mappings such as
<language type="ckb" scripts="Arab"/>
<language type="mzn" scripts="Arab"/>
<language type="glk" scripts="Arab"/>
that would direct us to the appropriate font prefs for the examples given.
Meanwhile, the potential band-aid here would be to add more entries to langGroups.properties.
[1] http://unicode.org/repos/cldr/trunk/common/supplemental/supplementalData.xml
Flags: needinfo?(jfkthame)
Comment 3•10 years ago
|
||
Seems like this is a more appropriate category.
Component: Graphics: Text → Layout: Text
Comment 4•10 years ago
|
||
Adding in more extensive lang ==> langGroup mappings. Constructed from the CLDR reference noted in comment 2 and merged into the existing list. For the most part, prefer existing references when they exist.
Flags: needinfo?(jdaggett)
Attachment #8554372 -
Flags: review?(smontagu)
Comment 5•10 years ago
|
||
Can we get away with adding a ton of languages to langGroups.properties and not to languageNames.properties?
Comment 6•10 years ago
|
||
(In reply to Simon Montagu :smontagu from comment #5)
> Can we get away with adding a ton of languages to langGroups.properties and
> not to languageNames.properties?
I think so? I don't think langGroups is exposed to the UI directly anywhere.
FTR, I've already generated a file based on the Suppress-Script field in the Language Subtag Registry that is a small list of what script a language uses by default if a script subtag isn't specified:
https://github.com/GPHemsley/BCP47/blob/master/properties/full/scriptSuppress.properties
Ideally, I expect this to be a subset of what the CLDR provides, but I didn't confirm that to be the case. Note that it maps directly to script subtags, rather than the legacy Mozilla language groups. (Did we never switch x-western to Latn?) Note also that this list doesn't contain the three 3-char language tags at issue here.
Comment 7•10 years ago
|
||
Comment on attachment 8554372 [details] [diff] [review]
patch, add more lang to langgroup mappings
Review of attachment 8554372 [details] [diff] [review]:
-----------------------------------------------------------------
We should avoid any mapping to a legacy language group that doesn't have a one-to-one relationship with a modern script subtag.
::: intl/locale/langGroups.properties
@@ +3,5 @@
> # License, v. 2.0. If a copy of the MPL was not distributed with this
> # file, You can obtain one at http://mozilla.org/MPL/2.0/.
> #
> +# References: http://unicode.org/repos/cldr/trunk/common/supplemental/supplementalData.xml
> +# http://www.omniglot.com/writing/atoz.htm
The Language Subtag Registry should really be on this list somewhere.
@@ +17,5 @@
> +# the default be the predominantly used script (e.g. Cyrillic in
> +# former-USSR countries where Arabic may also be common). Languages with
> +# special scripts for which we don't define a lang group are omitted
> +# (e.g. Mongolian) or mapped to another more frequently supported language
> +# (e.g. Ainu, which uses kana).
All places where the content in CLDR has been overridden should be noted explicitly, or else we won't be able to keep track of them.
@@ +263,2 @@
> hil=x-western
> +hmd=scripts=Plrd
Conversion issue here.
Comment 8•10 years ago
|
||
Raw CLDR data is apparently here:
http://unicode.org/repos/cldr/trunk/tools/java/org/unicode/cldr/util/data/language_script_raw.txt
Comment 9•10 years ago
|
||
(In reply to Gordon P. Hemsley [:GPHemsley] from comment #7)
> Comment on attachment 8554372 [details] [diff] [review]
> patch, add more lang to langgroup mappings
> The Language Subtag Registry should really be on this list somewhere.
URL?
> All places where the content in CLDR has been overridden should be
> noted explicitly, or else we won't be able to keep track of them.
The problem is that we need to map lang tags to a given group but in some cases the CLDR lists several scripts as primary:
ug Uyghur primary Arab Arabic
ug Uyghur primary Cyrl Cyrillic
ug Uyghur secondary Latn Latin
The ordering here seems to be alphabetical so I wasn't sure which should predominate. My understanding is that for Uyghur, Arabic script is used, which is why I've listed these mappings:
ug-cyrl=x-cyrillic
ug=ar
Turkmen is an even more interesting example, since it appears to have transitioned between several different writing systems over time. According to Wikipedia, pre-1929 Arabic was used. From 1929 through 1991 Cyrillic was used (thanks Joe) and currently the "official" writing system is Latin-based.
tk Turkmen primary Arab Arabic
tk Turkmen primary Cyrl Cyrillic
tk Turkmen primary Latn Latin
The mappings I've listed:
tk-arab=ar
tk=x-cyrillic
#tk=x-western # (The country declared in 1992 to gradually move to Latin script)
So rather than "overriding" here I think the data here needs to make a clear choice based on the scripts/fonts that are commonly used for a given language. Font preferences should really be based on a combination of language *and* script but for now I think this will suffice.
Comment 10•10 years ago
|
||
(In reply to John Daggett (:jtd) from comment #9)
> (In reply to Gordon P. Hemsley [:GPHemsley] from comment #7)
> > Comment on attachment 8554372 [details] [diff] [review]
> > patch, add more lang to langgroup mappings
>
> > The Language Subtag Registry should really be on this list somewhere.
>
> URL?
https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry
> > All places where the content in CLDR has been overridden should be
> > noted explicitly, or else we won't be able to keep track of them.
>
> The problem is that we need to map lang tags to a given group but in some
> cases the CLDR lists several scripts as primary:
>
> ug Uyghur primary Arab Arabic
> ug Uyghur primary Cyrl Cyrillic
> ug Uyghur secondary Latn Latin
>
> The ordering here seems to be alphabetical so I wasn't sure which should
> predominate. My understanding is that for Uyghur, Arabic script is used,
> which is why I've listed these mappings:
>
> ug-cyrl=x-cyrillic
> ug=ar
Wikipedia and Ethnologue seem to suggest that Arabic is the primary script used and that Cyrllic is used less. However, that is not going to be a reliable calculation on its own, because both are going to be in use in high numbers. (And there is also going to be a significant amount written in Latin, which we can't discount.)
> Turkmen is an even more interesting example, since it appears to have
> transitioned between several different writing systems over time. According
> to Wikipedia, pre-1929 Arabic was used. From 1929 through 1991 Cyrillic was
> used (thanks Joe) and currently the "official" writing system is Latin-based.
>
> tk Turkmen primary Arab Arabic
> tk Turkmen primary Cyrl Cyrillic
> tk Turkmen primary Latn Latin
>
> The mappings I've listed:
>
> tk-arab=ar
> tk=x-cyrillic
> #tk=x-western # (The country declared in 1992 to gradually move to Latin
> script)
This situation is similar to Uyghur, but with much clearer boundaries: The use of the script varies generally by region: Regions that use Arabic for other languages use Arabic for Turkmen; regions that use Cyrillic or Latin for other languages use Cyrillic or Latin for Turkmen (likely based on political ideology). There's little way to know in advance which script is used if a script subtag is not specified.
> So rather than "overriding" here I think the data here needs to make a clear
> choice based on the scripts/fonts that are commonly used for a given
> language. Font preferences should really be based on a combination of
> language *and* script but for now I think this will suffice.
I agree. I'm not sure it makes sense to specify a script for a language that doesn't have a single primary script associated with it. Such a language should be tagged with its script; if it's not, the behavior should be undefined. (We might be able to do some predictions based on region subtags, but that quickly gets complex.)
Regarding the syntax/content of the file:
I would recommend putting the unqualified language tag first, followed by those with script subtags, followed by those with region subtags. If a situation is complicated and a decision has been made to override source material, that should be explained in detail in a comment at the top of the language block.
| Reporter | ||
Comment 11•10 years ago
|
||
I added the Arabic script ones for Chromium on https://codereview.chromium.org/1008343002 based on patch of this bug. I think these two "id=ar" "ha=ar" should be dropped from this patch as their Wikipedia are also not with Arabic script: http://id.wikipedia.org http://ha.wikipedia.org
Comment 12•10 years ago
|
||
Comment on attachment 8554372 [details] [diff] [review]
patch, add more lang to langgroup mappings
Review of attachment 8554372 [details] [diff] [review]:
-----------------------------------------------------------------
::: intl/locale/langGroups.properties
@@ +252,4 @@
> gv=x-western
> +gvr=x-devanagari
> +gwi=x-western
> +ha=ar
Hausa is primarily written in Latin script; Arabic script is at most a plurality.
Either way, though, we shouldn't be specifying scripts for languages that have multiple dominant scripts.
@@ +282,4 @@
> ia=x-western
> +iba=x-western
> +ibb=x-western
> +id=ar
Indonesian is written in Latin script.
Comment 13•10 years ago
|
||
(In reply to ebrahim from comment #11)
> I added the Arabic script ones for Chromium on
> https://codereview.chromium.org/1008343002 based on patch of this bug. I
> think these two "id=ar" "ha=ar" should be dropped from this patch as their
> Wikipedia are also not with Arabic script: http://id.wikipedia.org
> http://ha.wikipedia.org
Good catch!
Updated•3 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•