Use ccTLD and charset to improve language tags
Categories
(Core :: Internationalization, defect, P5)
Tracking
()
People
(Reporter: mozilla, Unassigned)
References
Details
(Keywords: good-first-bug)
Attachments
(1 obsolete file)
Updated•17 years ago
|
Comment 1•17 years ago
|
||
Comment 2•8 years ago
|
||
Comment 4•8 years ago
|
||
Updated•5 years ago
|
Comment 6•5 years ago
|
||
I agree with Xidorn's comment 4 above. The "real" solution to proper CJK font selection/display is for the content to be properly lang-tagged, but in the absence of that, we try to apply some heuristics, and this would probably reduce the number of times we end up picking an inappropriate font.
Comment 7•5 years ago
|
||
Since we're not opposed to this, adding the good-first-bug
keyword.
This is pretty easy to fix by extending the method Document::RecomputeLanguageFromCharset()
to look at the TLD in before it does language = service->GetLocaleLanguage();
.
The CJK mapping for our font code, which doesn't distinguish CN and SG or HK and MO for font purposes is:
cn, sg, xn--clchc0ea0b2g2a9gcd, xn--fiqs8S, xn--fiqz9S, xn--yfro4i67o, xn--clchc0ea0b2g2a9gcd, xn--yfro4i67o: zh-CN
tw, xn--kprw13d, xn--kpry57d: zh-TW
hk, mo, xn--j6w193g, xn--mix891f: zh-HK
kp, kr, xn--3e0b707e: ko
jp: ja
For non-CJK cases, the exercise is to map https://github.com/hsivonen/chardetng/blob/master/src/tld.rs to https://searchfox.org/mozilla-central/source/intl/locale/encodingsgroups.properties .
Comment 8•5 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #7)
For non-CJK cases, the exercise is to map https://github.com/hsivonen/chardetng/blob/master/src/tld.rs to https://searchfox.org/mozilla-central/source/intl/locale/encodingsgroups.properties .
Actually, this can be automated in code: Instantiate mozilla::EncodingDetector
and then call Guess
with the TLD and false
. Then pass the result to service->LookupCharSet()
. This works even for CJK except for hk, mo, xn--j6w193g, and xn--mix891f, which you need to special-case to zh-HK
. (This approach would result in zh-TW without special-casing.)
Comment 9•5 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #8)
(In reply to Henri Sivonen (:hsivonen) from comment #7)
For non-CJK cases, the exercise is to map https://github.com/hsivonen/chardetng/blob/master/src/tld.rs to https://searchfox.org/mozilla-central/source/intl/locale/encodingsgroups.properties .
Actually, this can be automated in code: Instantiate
mozilla::EncodingDetector
and then callGuess
with the TLD andfalse
. Then pass the result toservice->LookupCharSet()
. This works even for CJK except for hk, mo, xn--j6w193g, and xn--mix891f, which you need to special-case tozh-HK
. (This approach would result in zh-TW without special-casing.)
Hmm. This method doesn't distinguish between our Latin and Unicode font groups well.
Comment 10•5 years ago
|
||
After thinking about this a bit, doing this for non-CJK cases could do more harm than good at this point.
Comment 11•5 years ago
|
||
Comment 12•5 years ago
•
|
||
I don't know how to write automated tests for this. AFAICT, testing this would require reftests that load HTTP instead of file:
. AFAICT, our own reftests load from file:
and WPT doesn't have the right TLDs available.
Reporter | ||
Comment 13•5 years ago
|
||
Inferring Persian language for a site under .ir instead of Arabic sounds like an improvement to me. For spell-checking, keyboard, font selection, etc. The font selection actually does make a difference on Fontconfig-based systems like Firefox Linux.
Updated•5 years ago
|
Comment 14•5 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #12)
I don't know how to write automated tests for this. AFAICT, testing this would require reftests that load HTTP instead of
file:
. AFAICT, our own reftests load fromfile:
and WPT doesn't have the right TLDs available.
We actually can load gecko reftests via http by prefixing them with an "HTTP" annotation in the reftest manifest; but I don't know whether we can somehow specify the TLD that would use. Probably not.
Comment 15•5 years ago
|
||
To set expectations: I'm primarily working on something else, and I figured I'd fix something easy that'd require a tiny patch while, yesterday, I was temporarily in a situation that wasn't suitable for working on my priority items. That is, unfortunately, I'm not committed to pursuing this at this time if this turns out not to be a small and easy patch.
(In reply to Behdad Esfahbod from comment #0)
Many Chinese websites have a language tag of "zh". This doesn't help pango
and fontconfig choose a font as they can't discriminate between zh_CN and
zh_TW (traditional versus simplified Chinese, etc).
Now that I look at comment 0 more carefully, this patch doesn't address the case of an explicit language tag saying zh
if in the no-tag case we'd guess zh-TW
. That seems like a separate concern compared to whether we look at the TLD in addition to looking at the legacy encoding and the UI locale.
(In reply to Behdad Esfahbod from comment #13)
Inferring Persian language for a site under .ir instead of Arabic sounds like an improvement to me. For spell-checking, keyboard, font selection, etc. The font selection actually does make a difference on Fontconfig-based systems like Firefox Linux.
Spell-checking and keyboard are input issues, which currently aren't a function of the page language even if tagged, so let's put that aside for this bug. (Also, there are plenty of ccTLDs with one script and font convention but more than one language.)
I see that on Ubuntu 20.04 without any Fontconfig or Firefox font pref changes made by me, untagged, ar
, fa
, and ur
all look different. I can't judge the correctness of untagged, ar
, or fa
, but I'm pretty confident that the ur
results are stylistically unwanted. It seems to me that there's an Ubuntu-level configuration bug involved.
If the system had repertoire-wise such an Arabic-script default font that it has Persian coverage, what effect is passing fa
to Fontconfig supposed to have when things are working correctly?
Comment 16•5 years ago
|
||
Comment 17•5 years ago
|
||
I made an error when reuploading the patch. The version that was supposed to address the review comments did not do so, because my local changes didn't actually go into the patch. Follow-up in bug 1689541.
Comment 18•5 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #15)
Now that I look at comment 0 more carefully, this patch doesn't address the case of an explicit language tag saying
zh
if in the no-tag case we'd guesszh-TW
. That seems like a separate concern compared to whether we look at the TLD in addition to looking at the legacy encoding and the UI locale.
Filed bug 1689542.
I see that on Ubuntu 20.04 without any Fontconfig or Firefox font pref changes made by me, untagged,
ar
,fa
, andur
all look different. I can't judge the correctness of untagged,ar
, orfa
, but I'm pretty confident that theur
results are stylistically unwanted. It seems to me that there's an Ubuntu-level configuration bug involved.
Filed bug 1689543.
Comment 19•5 years ago
|
||
bugherder |
Comment 20•5 years ago
|
||
Hmm. I wonder if we're going to get complaints about what this will do to the font selection of Latin-script content on these TLDs. Perhaps it would have been more prudent to plug this deeper into the font fallback logic instead of the page-wide inference.
Comment 21•5 years ago
•
|
||
Backed out as requested by Henri.
Backout link: https://hg.mozilla.org/integration/autoland/rev/e65f687074e78ed1bb55f425ecdbcf3cb417403e
Updated•5 years ago
|
Updated•5 years ago
|
Comment 22•5 years ago
|
||
Instead of document-level language inference, the inference should be stored into a different field of Document
and https://searchfox.org/mozilla-central/rev/f9ad45c76ba50bdee54bebd14e6625ae14d4d085/gfx/thebes/gfxPlatformFontList.cpp#2023 should use that new field.
Per the first paragraph of comment 15, it can take quite a while before I get to it, so anyone with time available right now should feel it's OK to take the bug.
Comment 23•5 years ago
•
|
||
For future copypaste refence, the inference code with the review comments addressed is at https://hg.mozilla.org/mozilla-central/file/1acbaa0d067cbb68a7dac168ecffb413d66169ce/dom/base/Document.cpp#l16636
Description
•