Open Bug 476384 Opened 17 years ago Updated 5 years ago

Use ccTLD and charset to improve language tags

Categories

(Core :: Internationalization, defect, P5)

defect

Tracking

()

REOPENED

People

(Reporter: mozilla, Unassigned)

References

Details

(Keywords: good-first-bug)

Attachments

(1 obsolete file)

Many Chinese websites have a language tag of "zh". This doesn't help pango and fontconfig choose a font as they can't discriminate between zh_CN and zh_TW (traditional versus simplified Chinese, etc). Firefox can improve the situation by adding the ccTLD country name to the "zh" language tag to pass zh_CN down to Pango for .cn websites. The charset tag, if not Unicode, can also be used for further refinement. For example, BIG5 is a Taiwanese standard whereas CN2312 is a Chinese one.
Assignee: nobody → smontagu
Component: General → Internationalization
OS: Linux → All
Product: Firefox → Core
QA Contact: general → i18n
Hardware: x86 → All
Version: unspecified → Trunk
We already use the charset as a hint to language. Using the ccTLD as well is probably a good idea too, at least in some cases.
We should probably use zh-CN for .cn, zh-TW for .tw, zh-HK for .hk, and ja for .jp, and probably ko for .kr? Maybe some Arabic languages can benefit from similar thing as well?
On the other hand, having Web content change if it's put on a different domain is a very confusing API to present to Web developers. We should really be encouraging use of lang=.
I don't think this is more confusing than having it depend on user's system language. We could try to infer language from the content text, but that's probably more work, and could potentially be more confusing. Inferring language from ccTLD should be an improvement on CJK display for users anyway. We can show a message in the console saying that we are inferring language from the domain, but the page should really tag them properly.
Assignee: smontagu → nobody

Can you triage this Jonathan?

Flags: needinfo?(jfkthame)

I agree with Xidorn's comment 4 above. The "real" solution to proper CJK font selection/display is for the content to be properly lang-tagged, but in the absence of that, we try to apply some heuristics, and this would probably reduce the number of times we end up picking an inappropriate font.

Severity: normal → S3
Flags: needinfo?(jfkthame)
Priority: -- → P5

Since we're not opposed to this, adding the good-first-bug keyword.

This is pretty easy to fix by extending the method Document::RecomputeLanguageFromCharset() to look at the TLD in before it does language = service->GetLocaleLanguage();.

The CJK mapping for our font code, which doesn't distinguish CN and SG or HK and MO for font purposes is:
cn, sg, xn--clchc0ea0b2g2a9gcd, xn--fiqs8S, xn--fiqz9S, xn--yfro4i67o, xn--clchc0ea0b2g2a9gcd, xn--yfro4i67o: zh-CN
tw, xn--kprw13d, xn--kpry57d: zh-TW
hk, mo, xn--j6w193g, xn--mix891f: zh-HK
kp, kr, xn--3e0b707e: ko
jp: ja

For non-CJK cases, the exercise is to map https://github.com/hsivonen/chardetng/blob/master/src/tld.rs to https://searchfox.org/mozilla-central/source/intl/locale/encodingsgroups.properties .

Keywords: good-first-bug

(In reply to Henri Sivonen (:hsivonen) from comment #7)

For non-CJK cases, the exercise is to map https://github.com/hsivonen/chardetng/blob/master/src/tld.rs to https://searchfox.org/mozilla-central/source/intl/locale/encodingsgroups.properties .

Actually, this can be automated in code: Instantiate mozilla::EncodingDetector and then call Guess with the TLD and false. Then pass the result to service->LookupCharSet(). This works even for CJK except for hk, mo, xn--j6w193g, and xn--mix891f, which you need to special-case to zh-HK. (This approach would result in zh-TW without special-casing.)

(In reply to Henri Sivonen (:hsivonen) from comment #8)

(In reply to Henri Sivonen (:hsivonen) from comment #7)

For non-CJK cases, the exercise is to map https://github.com/hsivonen/chardetng/blob/master/src/tld.rs to https://searchfox.org/mozilla-central/source/intl/locale/encodingsgroups.properties .

Actually, this can be automated in code: Instantiate mozilla::EncodingDetector and then call Guess with the TLD and false. Then pass the result to service->LookupCharSet(). This works even for CJK except for hk, mo, xn--j6w193g, and xn--mix891f, which you need to special-case to zh-HK. (This approach would result in zh-TW without special-casing.)

Hmm. This method doesn't distinguish between our Latin and Unicode font groups well.

After thinking about this a bit, doing this for non-CJK cases could do more harm than good at this point.

I don't know how to write automated tests for this. AFAICT, testing this would require reftests that load HTTP instead of file:. AFAICT, our own reftests load from file: and WPT doesn't have the right TLDs available.

Inferring Persian language for a site under .ir instead of Arabic sounds like an improvement to me. For spell-checking, keyboard, font selection, etc. The font selection actually does make a difference on Fontconfig-based systems like Firefox Linux.

Assignee: nobody → hsivonen
Status: NEW → ASSIGNED

(In reply to Henri Sivonen (:hsivonen) from comment #12)

I don't know how to write automated tests for this. AFAICT, testing this would require reftests that load HTTP instead of file:. AFAICT, our own reftests load from file: and WPT doesn't have the right TLDs available.

We actually can load gecko reftests via http by prefixing them with an "HTTP" annotation in the reftest manifest; but I don't know whether we can somehow specify the TLD that would use. Probably not.

To set expectations: I'm primarily working on something else, and I figured I'd fix something easy that'd require a tiny patch while, yesterday, I was temporarily in a situation that wasn't suitable for working on my priority items. That is, unfortunately, I'm not committed to pursuing this at this time if this turns out not to be a small and easy patch.

(In reply to Behdad Esfahbod from comment #0)

Many Chinese websites have a language tag of "zh". This doesn't help pango
and fontconfig choose a font as they can't discriminate between zh_CN and
zh_TW (traditional versus simplified Chinese, etc).

Now that I look at comment 0 more carefully, this patch doesn't address the case of an explicit language tag saying zh if in the no-tag case we'd guess zh-TW. That seems like a separate concern compared to whether we look at the TLD in addition to looking at the legacy encoding and the UI locale.

(In reply to Behdad Esfahbod from comment #13)

Inferring Persian language for a site under .ir instead of Arabic sounds like an improvement to me. For spell-checking, keyboard, font selection, etc. The font selection actually does make a difference on Fontconfig-based systems like Firefox Linux.

Spell-checking and keyboard are input issues, which currently aren't a function of the page language even if tagged, so let's put that aside for this bug. (Also, there are plenty of ccTLDs with one script and font convention but more than one language.)

I see that on Ubuntu 20.04 without any Fontconfig or Firefox font pref changes made by me, untagged, ar, fa, and ur all look different. I can't judge the correctness of untagged, ar, or fa, but I'm pretty confident that the ur results are stylistically unwanted. It seems to me that there's an Ubuntu-level configuration bug involved.

If the system had repertoire-wise such an Arabic-script default font that it has Persian coverage, what effect is passing fa to Fontconfig supposed to have when things are working correctly?

Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4500122fa98d Use the TLD as a CJK font hint for UTF-8 pages. r=jfkthame

I made an error when reuploading the patch. The version that was supposed to address the review comments did not do so, because my local changes didn't actually go into the patch. Follow-up in bug 1689541.

(In reply to Henri Sivonen (:hsivonen) from comment #15)

Now that I look at comment 0 more carefully, this patch doesn't address the case of an explicit language tag saying zh if in the no-tag case we'd guess zh-TW. That seems like a separate concern compared to whether we look at the TLD in addition to looking at the legacy encoding and the UI locale.

Filed bug 1689542.

I see that on Ubuntu 20.04 without any Fontconfig or Firefox font pref changes made by me, untagged, ar, fa, and ur all look different. I can't judge the correctness of untagged, ar, or fa, but I'm pretty confident that the ur results are stylistically unwanted. It seems to me that there's an Ubuntu-level configuration bug involved.

Filed bug 1689543.

Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → 87 Branch

Hmm. I wonder if we're going to get complaints about what this will do to the font selection of Latin-script content on these TLDs. Perhaps it would have been more prudent to plug this deeper into the font fallback logic instead of the page-wide inference.

Flags: needinfo?(hsivonen)
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 87 Branch → ---
Attachment #9199581 - Attachment is obsolete: true

Instead of document-level language inference, the inference should be stored into a different field of Document and https://searchfox.org/mozilla-central/rev/f9ad45c76ba50bdee54bebd14e6625ae14d4d085/gfx/thebes/gfxPlatformFontList.cpp#2023 should use that new field.

Per the first paragraph of comment 15, it can take quite a while before I get to it, so anyone with time available right now should feel it's OK to take the bug.

Assignee: hsivonen → nobody
Flags: needinfo?(hsivonen)

For future copypaste refence, the inference code with the review comments addressed is at https://hg.mozilla.org/mozilla-central/file/1acbaa0d067cbb68a7dac168ecffb413d66169ce/dom/base/Document.cpp#l16636

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: