Closed Bug 1706864 Opened 3 years ago Closed 3 years ago

chardetng not running on .in TLD is a problem for legacy Japanese generic TLD usage of .in

Categories

(Core :: Internationalization, defect)

defect

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox90 --- fixed

People

(Reporter: masayuki, Assigned: hsivonen)

References

Details

(Keywords: parity-chrome, parity-edge)

Attachments

(1 file)

The following page is detected as a Western text encoding page, but it's written in Japanese.

The page is text/html, the HTTP header does not have "charset", and the page does not have <meta> element specifying the charset.

Google Chrome and Chromium Edge for Windows correctly detect the right encoding.

This is because the detector is disabled on the .in TLD in the hope of accommodating old font hacks, especially for Tamil.

This is problematic when .in is used as a generic domain for purposes where non-Latin legacy encodings are relevant.

We don't really have good information on

  • How commonly users still browse pre-mobile-era font hack-based content and have the relevant intentionally mis-encoded fonts installed.
  • How commonly the .in TLD is used as a generic domain for non-Latin legacy-encoded content in languages not typically used in India.

(A couple of years ago, I went through old Bugzilla bugs that complained about font hack-based sites in India, and all those sites had migrated to Unicode.)

Notably, on a page like the one reported, choosing View: Text Encoding: Automatic works. If the detector was enabled for .in automatically, View: Text Encoding: Automatic would not make a font hack-based page readable.

Summary: The detector considers a page written in Japanese as a Western text encoding → chardetng not running on .in TLD is a problem for legacy Japanese generic TLD usage of .in

Thank you for quick investigation!

(In reply to Henri Sivonen (:hsivonen) from comment #1)

  • How commonly the .in TLD is used as a generic domain for non-Latin legacy-encoded content in languages not typically used in India.

Well, some TLDs are used for Japanese pages even if they are for non-related countries. So, considering with TDL to enable auto-detection may not make sense. But I have no better idea except using "Accept-Languages" setting. (I think that typical Japanese users don't access Western language pages directly.)

Well, some TLDs are used for Japanese pages even if they are for non-related countries.

For Japanese content, this issue doesn't apply to all TLDs the same way.

For .in and .lk the detector is completely turned off due to the expectation of font hack legacy. For .cn, .mo, .hk, .sg, .tw, .kr, and .kp, if the result is valid EUC-KR, GBK, or Big5 (taking into account known extensions) as applicable, the detector never guesses a Japanese encoding. Pretty much all EUC-JP that doesn't use either half-width katakana or JIS X 0212 is valid EUC-KR or GBK, too, so this pretty much makes EUC-JP undetectable on these TLDs in order not to misdetect EUC-KR or GBK (as applicable) as EUC-JP.

Detecting Japanese legacy content on other non-.jp TLDs should work about as well as it does on .com.

(The menu item "Automatic" intentionally does not consider the TLD to allow an override when the TLD signal gave the unwanted result.)

(In reply to Henri Sivonen (:hsivonen) from comment #3)

For .in and .lk the detector is completely turned off due to the expectation of font hack legacy.

This is controlled by the prefs intl.charset.detector.ng.in.enabled and intl.charset.detector.ng.lk.enabled.

The reason why I'm not taking action at this time based on the current level of information:

If the font hack issue is still relevant and chardetng is enabled for .in, for unlabeled font hack dependent content that gets detected as something other than windows-1252, there would be no way for the user to take action in the Firefox UI to remedy the situation after bug 1687635. However, the problem of unlabeled Japanese content on .in is remedyable from the menu.

I welcome information about the current user-facing relevance of the Tamil and Devenagari font hacks that Chrome knows about.

One way to proceed without actual information about content in the field would be to find out about the structure of the font hack encodings that Chrome knows about and to try to reason if they would always score negative for all the encodings that chardetng supports or, if they'd always score negative except for the rarer Cyrillic encodings, and then we could exclude those from consideration for .in.

It's worth noting that legacy content from India isn't constrained to .in, but also occurs on .com/.org/.net where chardetng runs without excluding encodings from consideration by TLD. We have zero complaints about font hacks breaking on .com/.org/.net. However, we don't know which one of these explains it:

  1. The font hacks aren't relevant anymore.
  2. The font hacks yield negative (or rejected) scores for all supported encodings, so windows-1252 prevails for generic TLDs and the font hacks work.
  3. The font hacks are relevant and they are broken on .com/.org/.net, but we lack a feedback loop from users to Bugzilla.

If the reason was either 1 or 2, we could remove the special case. If it's 3, then removing the special case for .in would break things further.

The "bilingual" (English and Tamil) flavor of Tamil99 is the closest to an official font hack. By looking at its structure, the notion that Tamil99 content would score negative (or rejected) on all probes and, therefore, already be detected as windows-1252 is plausible: Should score rejected for CJK, Hebrew, Arabic, Greek, and Thai. Should score negative for any Latin. Probably scores negative for all Cyrillic encodings. IBM866 might be a problem.

I found online converters that allowed me to synthesize a bit of content in all the Devanagari and most Tamil encodings that Chrome's detector knows about. All of these scored rejected or negative for all encodings that chardetng knows about, so the fallback is windows-1252.

Attachment #9220318 - Attachment description: Bug 1706864 - Enable chardetng for .in and .lk TLDs. → WIP: Bug 1706864 - Enable chardetng for .in and .lk TLDs.

needinfoing mhoye for license review.

These tests include CC-by-sa Wikipedia content. See bug 1432728 for precedent. mhoye, are you OK with how the licensing is indicated both in the README (see patch) and in the tests themselves (see below)?

The binary files in the patch are non-UTF-8 text files that have either of the following forms:


<script src="/resources/testharness.js"></script>
<script src="/resources/testharnessreport.js"></script>
(converted copypaste from the start of the Tamil Wikipedia article about the planet Mars)
<script>
test(function() {
  assert_equals(document.characterSet, "windows-1252");
},"Should fall back to windows-1252");
</script>
The text content above originates from <a href="https://ta.wikipedia.org/w/index.php?title=%E0%AE%9A%E0%AF%86%E0%AE%B5%E0%AF%8D%E0%AE%B5%E0%AE%BE%E0%AE%AF%E0%AF%8D_(%E0%AE%95%E0%AF%8B%E0%AE%B3%E0%AF%8D)&oldid=3129711">Wikipedia</a> and
is licensed under the <a href="https://creativecommons.org/licenses/by-sa/3.0/legalcode">Creative Commons Attribution-ShareAlike 3.0 Unported</a> license.


<script src="/resources/testharness.js"></script>
<script src="/resources/testharnessreport.js"></script>
(converted copypaste from the start of the Hindi Wikipedia article about the planet Mars)
<script>
test(function() {
  assert_equals(document.characterSet, "windows-1252");
},"Should fall back to windows-1252");
</script>
The text content above originates from <a href="https://hi.wikipedia.org/w/index.php?title=%E0%A4%AE%E0%A4%82%E0%A4%97%E0%A4%B2_%E0%A4%97%E0%A5%8D%E0%A4%B0%E0%A4%B9&oldid=5105576">Wikipedia</a> and
is licensed under the <a href="https://creativecommons.org/licenses/by-sa/3.0/legalcode">Creative Commons Attribution-ShareAlike 3.0 Unported</a> license.

Assignee: nobody → hsivonen
Status: NEW → ASSIGNED
Flags: needinfo?(mhoye)
Attachment #9220318 - Attachment description: WIP: Bug 1706864 - Enable chardetng for .in and .lk TLDs. → Bug 1706864 - Enable chardetng for .in and .lk TLDs.
Attachment #9220318 - Attachment description: Bug 1706864 - Enable chardetng for .in and .lk TLDs. → WIP: Bug 1706864 - Enable chardetng for .in and .lk TLDs.
Attachment #9220318 - Attachment description: WIP: Bug 1706864 - Enable chardetng for .in and .lk TLDs. → Bug 1706864 - Enable chardetng for .in and .lk TLDs.

This is fine from a licensing perspective - our longstanding policy, as you note, is that tests can incorporate data licensed under CC-BY or CC-BY-SA, provided that data is stored in its own separate directory along with a LICENSE file that meets the CC-BY or CC-BY-SA requirements.

Flags: needinfo?(mhoye)

(In reply to Mike Hoye [:mhoye] from comment #12)

This is fine from a licensing perspective - our longstanding policy, as you note, is that tests can incorporate data licensed under CC-BY or CC-BY-SA, provided that data is stored in its own separate directory along with a LICENSE file that meets the CC-BY or CC-BY-SA requirements.

Thanks. I added a copy of the license as a LICENSE file and tweaked the attribution to restate the title even though it was already part of the URL, too.

Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/3737510242c5
Enable chardetng for .in and .lk TLDs. r=dminor
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 90 Branch
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: