476384 - Use ccTLD and charset to improve language tags

Reporter

Description

•

17 years ago

Many Chinese websites have a language tag of "zh". This doesn't help pango and fontconfig choose a font as they can't discriminate between zh_CN and zh_TW (traditional versus simplified Chinese, etc). Firefox can improve the situation by adding the ccTLD country name to the "zh" language tag to pass zh_CN down to Pango for .cn websites. The charset tag, if not Unicode, can also be used for further refinement. For example, BIG5 is a Taiwanese standard whereas CN2312 is a Chinese one.

Nochum Sossonko [:Natch]

Updated

•

17 years ago

Assignee: nobody → smontagu

Component: General → Internationalization

OS: Linux → All

Product: Firefox → Core

QA Contact: general → i18n

Hardware: x86 → All

Version: unspecified → Trunk

Simon Montagu :smontagu

Comment 1

•

17 years ago

We already use the charset as a hint to language. Using the ccTLD as well is probably a good idea too, at least in some cases.

Xidorn Quan [:xidorn] UTC+10

Comment 2

•

8 years ago

We should probably use zh-CN for .cn, zh-TW for .tw, zh-HK for .hk, and ja for .jp, and probably ko for .kr? Maybe some Arabic languages can benefit from similar thing as well?

David Baron :dbaron: (⌚️UTC-4, no longer working on Mozilla)

Comment 3

•

8 years ago

On the other hand, having Web content change if it's put on a different domain is a very confusing API to present to Web developers. We should really be encouraging use of lang=.

Xidorn Quan [:xidorn] UTC+10

Comment 4

•

8 years ago

I don't think this is more confusing than having it depend on user's system language. We could try to infer language from the content text, but that's probably more work, and could potentially be more confusing. Inferring language from ccTLD should be an improvement on CJK display for users anyway. We can show a message in the console saying that we are inferring language from the domain, but the page should really tag them properly.

Henri Sivonen (:hsivonen)

Updated

•

5 years ago

Assignee: smontagu → nobody

Zibi Braniecki [:zbraniecki][:gandalf]

Comment 5

•

5 years ago

Can you triage this Jonathan?

Flags: needinfo?(jfkthame)

Jonathan Kew [:jfkthame]

Comment 6

•

5 years ago

I agree with Xidorn's comment 4 above. The "real" solution to proper CJK font selection/display is for the content to be properly lang-tagged, but in the absence of that, we try to apply some heuristics, and this would probably reduce the number of times we end up picking an inappropriate font.

Severity: normal → S3

Flags: needinfo?(jfkthame)

Priority: -- → P5

Henri Sivonen (:hsivonen)

Comment 7

•

5 years ago

Since we're not opposed to this, adding the good-first-bug keyword.

This is pretty easy to fix by extending the method Document::RecomputeLanguageFromCharset() to look at the TLD in before it does language = service->GetLocaleLanguage();.

The CJK mapping for our font code, which doesn't distinguish CN and SG or HK and MO for font purposes is:
cn, sg, xn--clchc0ea0b2g2a9gcd, xn--fiqs8S, xn--fiqz9S, xn--yfro4i67o, xn--clchc0ea0b2g2a9gcd, xn--yfro4i67o: zh-CN
tw, xn--kprw13d, xn--kpry57d: zh-TW
hk, mo, xn--j6w193g, xn--mix891f: zh-HK
kp, kr, xn--3e0b707e: ko
jp: ja

For non-CJK cases, the exercise is to map https://github.com/hsivonen/chardetng/blob/master/src/tld.rs to https://searchfox.org/mozilla-central/source/intl/locale/encodingsgroups.properties .

Keywords: good-first-bug

Henri Sivonen (:hsivonen)

Comment 8

•

5 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #7)

For non-CJK cases, the exercise is to map https://github.com/hsivonen/chardetng/blob/master/src/tld.rs to https://searchfox.org/mozilla-central/source/intl/locale/encodingsgroups.properties .

Actually, this can be automated in code: Instantiate mozilla::EncodingDetector and then call Guess with the TLD and false. Then pass the result to service->LookupCharSet(). This works even for CJK except for hk, mo, xn--j6w193g, and xn--mix891f, which you need to special-case to zh-HK. (This approach would result in zh-TW without special-casing.)

Henri Sivonen (:hsivonen)

Comment 9

•

5 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #8)

(In reply to Henri Sivonen (:hsivonen) from comment #7)

For non-CJK cases, the exercise is to map https://github.com/hsivonen/chardetng/blob/master/src/tld.rs to https://searchfox.org/mozilla-central/source/intl/locale/encodingsgroups.properties .

Actually, this can be automated in code: Instantiate mozilla::EncodingDetector and then call Guess with the TLD and false. Then pass the result to service->LookupCharSet(). This works even for CJK except for hk, mo, xn--j6w193g, and xn--mix891f, which you need to special-case to zh-HK. (This approach would result in zh-TW without special-casing.)

Hmm. This method doesn't distinguish between our Latin and Unicode font groups well.

Henri Sivonen (:hsivonen)

Comment 10

•

5 years ago

After thinking about this a bit, doing this for non-CJK cases could do more harm than good at this point.

Henri Sivonen (:hsivonen)

Comment 11

•

5 years ago

Attached file Bug 476384 - Use the TLD as a CJK font hint for UTF-8 pages. (obsolete) — Details

Henri Sivonen (:hsivonen)

Comment 12

•

5 years ago

•

Edited

I don't know how to write automated tests for this. AFAICT, testing this would require reftests that load HTTP instead of file:. AFAICT, our own reftests load from file: and WPT doesn't have the right TLDs available.

Behdad Esfahbod

Reporter

Comment 13

•

5 years ago

Inferring Persian language for a site under .ir instead of Arabic sounds like an improvement to me. For spell-checking, keyboard, font selection, etc. The font selection actually does make a difference on Fontconfig-based systems like Firefox Linux.

Phabricator Automation

Updated

•

5 years ago

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Jonathan Kew [:jfkthame]

Comment 14

•

5 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #12)

I don't know how to write automated tests for this. AFAICT, testing this would require reftests that load HTTP instead of file:. AFAICT, our own reftests load from file: and WPT doesn't have the right TLDs available.

We actually can load gecko reftests via http by prefixing them with an "HTTP" annotation in the reftest manifest; but I don't know whether we can somehow specify the TLD that would use. Probably not.

Henri Sivonen (:hsivonen)

Comment 15

•

5 years ago

To set expectations: I'm primarily working on something else, and I figured I'd fix something easy that'd require a tiny patch while, yesterday, I was temporarily in a situation that wasn't suitable for working on my priority items. That is, unfortunately, I'm not committed to pursuing this at this time if this turns out not to be a small and easy patch.

(In reply to Behdad Esfahbod from comment #0)

Many Chinese websites have a language tag of "zh". This doesn't help pango
and fontconfig choose a font as they can't discriminate between zh_CN and
zh_TW (traditional versus simplified Chinese, etc).

Now that I look at comment 0 more carefully, this patch doesn't address the case of an explicit language tag saying zh if in the no-tag case we'd guess zh-TW. That seems like a separate concern compared to whether we look at the TLD in addition to looking at the legacy encoding and the UI locale.

(In reply to Behdad Esfahbod from comment #13)

Inferring Persian language for a site under .ir instead of Arabic sounds like an improvement to me. For spell-checking, keyboard, font selection, etc. The font selection actually does make a difference on Fontconfig-based systems like Firefox Linux.

Spell-checking and keyboard are input issues, which currently aren't a function of the page language even if tagged, so let's put that aside for this bug. (Also, there are plenty of ccTLDs with one script and font convention but more than one language.)

I see that on Ubuntu 20.04 without any Fontconfig or Firefox font pref changes made by me, untagged, ar, fa, and ur all look different. I can't judge the correctness of untagged, ar, or fa, but I'm pretty confident that the ur results are stylistically unwanted. It seems to me that there's an Ubuntu-level configuration bug involved.

If the system had repertoire-wise such an Arabic-script default font that it has Persian coverage, what effect is passing fa to Fontconfig supposed to have when things are working correctly?

Pulsebot

Comment 16

•

5 years ago

Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/4500122fa98d Use the TLD as a CJK font hint for UTF-8 pages. r=jfkthame

Henri Sivonen (:hsivonen)

Updated

•

5 years ago

Blocks: 1689541

Henri Sivonen (:hsivonen)

Comment 17

•

5 years ago

I made an error when reuploading the patch. The version that was supposed to address the review comments did not do so, because my local changes didn't actually go into the patch. Follow-up in bug 1689541.

Henri Sivonen (:hsivonen)

Comment 18

•

5 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #15)

Now that I look at comment 0 more carefully, this patch doesn't address the case of an explicit language tag saying zh if in the no-tag case we'd guess zh-TW. That seems like a separate concern compared to whether we look at the TLD in addition to looking at the legacy encoding and the UI locale.

Filed bug 1689542.

I see that on Ubuntu 20.04 without any Fontconfig or Firefox font pref changes made by me, untagged, ar, fa, and ur all look different. I can't judge the correctness of untagged, ar, or fa, but I'm pretty confident that the ur results are stylistically unwanted. It seems to me that there's an Ubuntu-level configuration bug involved.

Filed bug 1689543.

Sandor Molnar[:smolnar]

Comment 19

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/4500122fa98d

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox87: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 87 Branch

Henri Sivonen (:hsivonen)

Comment 20

•

5 years ago

Hmm. I wonder if we're going to get complaints about what this will do to the font selection of Latin-script content on these TLDs. Perhaps it would have been more prudent to plug this deeper into the font fallback logic instead of the page-wide inference.

Atila Butkovits

Comment 21

•

5 years ago

•

Edited

Backed out as requested by Henri.

Backout link: https://hg.mozilla.org/integration/autoland/rev/e65f687074e78ed1bb55f425ecdbcf3cb417403e

Flags: needinfo?(hsivonen)

Atila Butkovits

Updated

•

5 years ago

Status: RESOLVED → REOPENED

status-firefox87: fixed → ---

Resolution: FIXED → ---

Target Milestone: 87 Branch → ---

Phabricator Automation

Updated

•

5 years ago

Attachment #9199581 - Attachment is obsolete: true

Henri Sivonen (:hsivonen)

Comment 22

•

5 years ago

Instead of document-level language inference, the inference should be stored into a different field of Document and https://searchfox.org/mozilla-central/rev/f9ad45c76ba50bdee54bebd14e6625ae14d4d085/gfx/thebes/gfxPlatformFontList.cpp#2023 should use that new field.

Per the first paragraph of comment 15, it can take quite a while before I get to it, so anyone with time available right now should feel it's OK to take the bug.

Assignee: hsivonen → nobody

Flags: needinfo?(hsivonen)

Henri Sivonen (:hsivonen)

Comment 23

•

5 years ago

•

Edited

For future copypaste refence, the inference code with the review comments addressed is at https://hg.mozilla.org/mozilla-central/file/1acbaa0d067cbb68a7dac168ecffb413d66169ce/dom/base/Document.cpp#l16636