1706864 - chardetng not running on .in TLD is a problem for legacy Japanese generic TLD usage of .in

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(away 4/27 - 5/6)

Reporter

Description

•

3 years ago

The following page is detected as a Western text encoding page, but it's written in Japanese.

http://urawaza.in/md/index.htm

The page is text/html, the HTTP header does not have "charset", and the page does not have <meta> element specifying the charset.

Google Chrome and Chromium Edge for Windows correctly detect the right encoding.

Henri Sivonen (:hsivonen)

Assignee

Comment 1

•

3 years ago

This is because the detector is disabled on the .in TLD in the hope of accommodating old font hacks, especially for Tamil.

This is problematic when .in is used as a generic domain for purposes where non-Latin legacy encodings are relevant.

We don't really have good information on

How commonly users still browse pre-mobile-era font hack-based content and have the relevant intentionally mis-encoded fonts installed.
How commonly the .in TLD is used as a generic domain for non-Latin legacy-encoded content in languages not typically used in India.

(A couple of years ago, I went through old Bugzilla bugs that complained about font hack-based sites in India, and all those sites had migrated to Unicode.)

Notably, on a page like the one reported, choosing View: Text Encoding: Automatic works. If the detector was enabled for .in automatically, View: Text Encoding: Automatic would not make a font hack-based page readable.

Henri Sivonen (:hsivonen)

Assignee

Updated

•

3 years ago

Summary: The detector considers a page written in Japanese as a Western text encoding → chardetng not running on .in TLD is a problem for legacy Japanese generic TLD usage of .in

Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(away 4/27 - 5/6)

Reporter

Comment 2

•

3 years ago

Thank you for quick investigation!

(In reply to Henri Sivonen (:hsivonen) from comment #1)

How commonly the .in TLD is used as a generic domain for non-Latin legacy-encoded content in languages not typically used in India.

Well, some TLDs are used for Japanese pages even if they are for non-related countries. So, considering with TDL to enable auto-detection may not make sense. But I have no better idea except using "Accept-Languages" setting. (I think that typical Japanese users don't access Western language pages directly.)

Henri Sivonen (:hsivonen)

Assignee

Comment 3

•

3 years ago

Well, some TLDs are used for Japanese pages even if they are for non-related countries.

For Japanese content, this issue doesn't apply to all TLDs the same way.

For .in and .lk the detector is completely turned off due to the expectation of font hack legacy. For .cn, .mo, .hk, .sg, .tw, .kr, and .kp, if the result is valid EUC-KR, GBK, or Big5 (taking into account known extensions) as applicable, the detector never guesses a Japanese encoding. Pretty much all EUC-JP that doesn't use either half-width katakana or JIS X 0212 is valid EUC-KR or GBK, too, so this pretty much makes EUC-JP undetectable on these TLDs in order not to misdetect EUC-KR or GBK (as applicable) as EUC-JP.

Detecting Japanese legacy content on other non-.jp TLDs should work about as well as it does on .com.

Henri Sivonen (:hsivonen)

Assignee

Comment 4

•

3 years ago

(The menu item "Automatic" intentionally does not consider the TLD to allow an override when the TLD signal gave the unwanted result.)

Henri Sivonen (:hsivonen)

Assignee

Comment 5

•

3 years ago

(In reply to Henri Sivonen (:hsivonen) from comment #3)

For .in and .lk the detector is completely turned off due to the expectation of font hack legacy.

This is controlled by the prefs intl.charset.detector.ng.in.enabled and intl.charset.detector.ng.lk.enabled.

Henri Sivonen (:hsivonen)

Assignee

Comment 6

•

3 years ago

The reason why I'm not taking action at this time based on the current level of information:

If the font hack issue is still relevant and chardetng is enabled for .in, for unlabeled font hack dependent content that gets detected as something other than windows-1252, there would be no way for the user to take action in the Firefox UI to remedy the situation after bug 1687635. However, the problem of unlabeled Japanese content on .in is remedyable from the menu.

I welcome information about the current user-facing relevance of the Tamil and Devenagari font hacks that Chrome knows about.

One way to proceed without actual information about content in the field would be to find out about the structure of the font hack encodings that Chrome knows about and to try to reason if they would always score negative for all the encodings that chardetng supports or, if they'd always score negative except for the rarer Cyrillic encodings, and then we could exclude those from consideration for .in.

It's worth noting that legacy content from India isn't constrained to .in, but also occurs on .com/.org/.net where chardetng runs without excluding encodings from consideration by TLD. We have zero complaints about font hacks breaking on .com/.org/.net. However, we don't know which one of these explains it:

The font hacks aren't relevant anymore.
The font hacks yield negative (or rejected) scores for all supported encodings, so windows-1252 prevails for generic TLDs and the font hacks work.
The font hacks are relevant and they are broken on .com/.org/.net, but we lack a feedback loop from users to Bugzilla.

If the reason was either 1 or 2, we could remove the special case. If it's 3, then removing the special case for .in would break things further.

The "bilingual" (English and Tamil) flavor of Tamil99 is the closest to an official font hack. By looking at its structure, the notion that Tamil99 content would score negative (or rejected) on all probes and, therefore, already be detected as windows-1252 is plausible: Should score rejected for CJK, Hebrew, Arabic, Greek, and Thai. Should score negative for any Latin. Probably scores negative for all Cyrillic encodings. IBM866 might be a problem.

Henri Sivonen (:hsivonen)

Assignee

Comment 7

•

3 years ago

I found online converters that allowed me to synthesize a bit of content in all the Devanagari and most Tamil encodings that Chrome's detector knows about. All of these scored rejected or negative for all encodings that chardetng knows about, so the fallback is windows-1252.

Henri Sivonen (:hsivonen)

Assignee

Comment 8

•

3 years ago

Attached file Bug 1706864 - Enable chardetng for .in and .lk TLDs. — Details

Henri Sivonen (:hsivonen)

Assignee

Comment 9

•

3 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=1f4da642d5b19816b927062a0a9ac735c9255e6b

Phabricator Automation

Updated

•

3 years ago

Attachment #9220318 - Attachment description: Bug 1706864 - Enable chardetng for .in and .lk TLDs. → WIP: Bug 1706864 - Enable chardetng for .in and .lk TLDs.

Henri Sivonen (:hsivonen)

Assignee

Comment 10

•

3 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6aa37dbe3729953f53edc421dc1a0500415789f9

Henri Sivonen (:hsivonen)

Assignee

Comment 11

•

3 years ago

needinfoing mhoye for license review.

These tests include CC-by-sa Wikipedia content. See bug 1432728 for precedent. mhoye, are you OK with how the licensing is indicated both in the README (see patch) and in the tests themselves (see below)?

The binary files in the patch are non-UTF-8 text files that have either of the following forms:

<script src="/resources/testharness.js"></script>
<script src="/resources/testharnessreport.js"></script>
(converted copypaste from the start of the Tamil Wikipedia article about the planet Mars)
<script>
test(function() {
  assert_equals(document.characterSet, "windows-1252");
},"Should fall back to windows-1252");
</script>
The text content above originates from <a href="https://ta.wikipedia.org/w/index.php?title=%E0%AE%9A%E0%AF%86%E0%AE%B5%E0%AF%8D%E0%AE%B5%E0%AE%BE%E0%AE%AF%E0%AF%8D_(%E0%AE%95%E0%AF%8B%E0%AE%B3%E0%AF%8D)&oldid=3129711">Wikipedia</a> and
is licensed under the <a href="https://creativecommons.org/licenses/by-sa/3.0/legalcode">Creative Commons Attribution-ShareAlike 3.0 Unported</a> license.

<script src="/resources/testharness.js"></script>
<script src="/resources/testharnessreport.js"></script>
(converted copypaste from the start of the Hindi Wikipedia article about the planet Mars)
<script>
test(function() {
  assert_equals(document.characterSet, "windows-1252");
},"Should fall back to windows-1252");
</script>
The text content above originates from <a href="https://hi.wikipedia.org/w/index.php?title=%E0%A4%AE%E0%A4%82%E0%A4%97%E0%A4%B2_%E0%A4%97%E0%A5%8D%E0%A4%B0%E0%A4%B9&oldid=5105576">Wikipedia</a> and
is licensed under the <a href="https://creativecommons.org/licenses/by-sa/3.0/legalcode">Creative Commons Attribution-ShareAlike 3.0 Unported</a> license.

Assignee: nobody → hsivonen

Status: NEW → ASSIGNED

Flags: needinfo?(mhoye)

Phabricator Automation

Updated

•

3 years ago

Attachment #9220318 - Attachment description: WIP: Bug 1706864 - Enable chardetng for .in and .lk TLDs. → Bug 1706864 - Enable chardetng for .in and .lk TLDs.

Phabricator Automation

Updated

•

3 years ago

Attachment #9220318 - Attachment description: Bug 1706864 - Enable chardetng for .in and .lk TLDs. → WIP: Bug 1706864 - Enable chardetng for .in and .lk TLDs.

Phabricator Automation

Updated

•

3 years ago

Attachment #9220318 - Attachment description: WIP: Bug 1706864 - Enable chardetng for .in and .lk TLDs. → Bug 1706864 - Enable chardetng for .in and .lk TLDs.

Mike Hoye [:mhoye]

Comment 12

•

3 years ago

This is fine from a licensing perspective - our longstanding policy, as you note, is that tests can incorporate data licensed under CC-BY or CC-BY-SA, provided that data is stored in its own separate directory along with a LICENSE file that meets the CC-BY or CC-BY-SA requirements.

Flags: needinfo?(mhoye)

Henri Sivonen (:hsivonen)

Assignee

Comment 13

•

3 years ago

(In reply to Mike Hoye [:mhoye] from comment #12)

This is fine from a licensing perspective - our longstanding policy, as you note, is that tests can incorporate data licensed under CC-BY or CC-BY-SA, provided that data is stored in its own separate directory along with a LICENSE file that meets the CC-BY or CC-BY-SA requirements.

Thanks. I added a copy of the license as a LICENSE file and tweaked the attribution to restate the title even though it was already part of the URL, too.

Pulsebot

Comment 14

•

3 years ago

Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3737510242c5
Enable chardetng for .in and .lk TLDs. r=dminor

Natalia Csoregi [:nataliaCs]

Comment 15

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/3737510242c5

Status: ASSIGNED → RESOLVED

Closed: 3 years ago

status-firefox90: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 90 Branch