chardetng not running on .in TLD is a problem for legacy Japanese generic TLD usage of .in
Categories
(Core :: Internationalization, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox90 | --- | fixed |
People
(Reporter: masayuki, Assigned: hsivonen)
References
Details
(Keywords: parity-chrome, parity-edge)
Attachments
(1 file)
The following page is detected as a Western text encoding page, but it's written in Japanese.
The page is text/html, the HTTP header does not have "charset", and the page does not have <meta>
element specifying the charset.
Google Chrome and Chromium Edge for Windows correctly detect the right encoding.
Assignee | ||
Comment 1•3 years ago
|
||
This is because the detector is disabled on the .in TLD in the hope of accommodating old font hacks, especially for Tamil.
This is problematic when .in is used as a generic domain for purposes where non-Latin legacy encodings are relevant.
We don't really have good information on
- How commonly users still browse pre-mobile-era font hack-based content and have the relevant intentionally mis-encoded fonts installed.
- How commonly the .in TLD is used as a generic domain for non-Latin legacy-encoded content in languages not typically used in India.
(A couple of years ago, I went through old Bugzilla bugs that complained about font hack-based sites in India, and all those sites had migrated to Unicode.)
Notably, on a page like the one reported, choosing View: Text Encoding: Automatic works. If the detector was enabled for .in automatically, View: Text Encoding: Automatic would not make a font hack-based page readable.
Assignee | ||
Updated•3 years ago
|
Reporter | ||
Comment 2•3 years ago
|
||
Thank you for quick investigation!
(In reply to Henri Sivonen (:hsivonen) from comment #1)
- How commonly the .in TLD is used as a generic domain for non-Latin legacy-encoded content in languages not typically used in India.
Well, some TLDs are used for Japanese pages even if they are for non-related countries. So, considering with TDL to enable auto-detection may not make sense. But I have no better idea except using "Accept-Languages" setting. (I think that typical Japanese users don't access Western language pages directly.)
Assignee | ||
Comment 3•3 years ago
|
||
Well, some TLDs are used for Japanese pages even if they are for non-related countries.
For Japanese content, this issue doesn't apply to all TLDs the same way.
For .in and .lk the detector is completely turned off due to the expectation of font hack legacy. For .cn, .mo, .hk, .sg, .tw, .kr, and .kp, if the result is valid EUC-KR, GBK, or Big5 (taking into account known extensions) as applicable, the detector never guesses a Japanese encoding. Pretty much all EUC-JP that doesn't use either half-width katakana or JIS X 0212 is valid EUC-KR or GBK, too, so this pretty much makes EUC-JP undetectable on these TLDs in order not to misdetect EUC-KR or GBK (as applicable) as EUC-JP.
Detecting Japanese legacy content on other non-.jp TLDs should work about as well as it does on .com.
Assignee | ||
Comment 4•3 years ago
|
||
(The menu item "Automatic" intentionally does not consider the TLD to allow an override when the TLD signal gave the unwanted result.)
Assignee | ||
Comment 5•3 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #3)
For .in and .lk the detector is completely turned off due to the expectation of font hack legacy.
This is controlled by the prefs intl.charset.detector.ng.in.enabled
and intl.charset.detector.ng.lk.enabled
.
Assignee | ||
Comment 6•3 years ago
|
||
The reason why I'm not taking action at this time based on the current level of information:
If the font hack issue is still relevant and chardetng is enabled for .in
, for unlabeled font hack dependent content that gets detected as something other than windows-1252, there would be no way for the user to take action in the Firefox UI to remedy the situation after bug 1687635. However, the problem of unlabeled Japanese content on .in
is remedyable from the menu.
I welcome information about the current user-facing relevance of the Tamil and Devenagari font hacks that Chrome knows about.
One way to proceed without actual information about content in the field would be to find out about the structure of the font hack encodings that Chrome knows about and to try to reason if they would always score negative for all the encodings that chardetng supports or, if they'd always score negative except for the rarer Cyrillic encodings, and then we could exclude those from consideration for .in
.
It's worth noting that legacy content from India isn't constrained to .in
, but also occurs on .com
/.org
/.net
where chardetng runs without excluding encodings from consideration by TLD. We have zero complaints about font hacks breaking on .com
/.org
/.net
. However, we don't know which one of these explains it:
- The font hacks aren't relevant anymore.
- The font hacks yield negative (or rejected) scores for all supported encodings, so windows-1252 prevails for generic TLDs and the font hacks work.
- The font hacks are relevant and they are broken on
.com
/.org
/.net
, but we lack a feedback loop from users to Bugzilla.
If the reason was either 1 or 2, we could remove the special case. If it's 3, then removing the special case for .in
would break things further.
The "bilingual" (English and Tamil) flavor of Tamil99 is the closest to an official font hack. By looking at its structure, the notion that Tamil99 content would score negative (or rejected) on all probes and, therefore, already be detected as windows-1252 is plausible: Should score rejected for CJK, Hebrew, Arabic, Greek, and Thai. Should score negative for any Latin. Probably scores negative for all Cyrillic encodings. IBM866 might be a problem.
Assignee | ||
Comment 7•3 years ago
|
||
I found online converters that allowed me to synthesize a bit of content in all the Devanagari and most Tamil encodings that Chrome's detector knows about. All of these scored rejected or negative for all encodings that chardetng knows about, so the fallback is windows-1252.
Assignee | ||
Comment 8•3 years ago
|
||
Assignee | ||
Comment 9•3 years ago
|
||
Updated•3 years ago
|
Assignee | ||
Comment 10•3 years ago
|
||
Assignee | ||
Comment 11•3 years ago
|
||
needinfoing mhoye for license review.
These tests include CC-by-sa Wikipedia content. See bug 1432728 for precedent. mhoye, are you OK with how the licensing is indicated both in the README (see patch) and in the tests themselves (see below)?
The binary files in the patch are non-UTF-8 text files that have either of the following forms:
<script src="/resources/testharness.js"></script>
<script src="/resources/testharnessreport.js"></script>
(converted copypaste from the start of the Tamil Wikipedia article about the planet Mars)
<script>
test(function() {
assert_equals(document.characterSet, "windows-1252");
},"Should fall back to windows-1252");
</script>
The text content above originates from <a href="https://ta.wikipedia.org/w/index.php?title=%E0%AE%9A%E0%AF%86%E0%AE%B5%E0%AF%8D%E0%AE%B5%E0%AE%BE%E0%AE%AF%E0%AF%8D_(%E0%AE%95%E0%AF%8B%E0%AE%B3%E0%AF%8D)&oldid=3129711">Wikipedia</a> and
is licensed under the <a href="https://creativecommons.org/licenses/by-sa/3.0/legalcode">Creative Commons Attribution-ShareAlike 3.0 Unported</a> license.
<script src="/resources/testharness.js"></script>
<script src="/resources/testharnessreport.js"></script>
(converted copypaste from the start of the Hindi Wikipedia article about the planet Mars)
<script>
test(function() {
assert_equals(document.characterSet, "windows-1252");
},"Should fall back to windows-1252");
</script>
The text content above originates from <a href="https://hi.wikipedia.org/w/index.php?title=%E0%A4%AE%E0%A4%82%E0%A4%97%E0%A4%B2_%E0%A4%97%E0%A5%8D%E0%A4%B0%E0%A4%B9&oldid=5105576">Wikipedia</a> and
is licensed under the <a href="https://creativecommons.org/licenses/by-sa/3.0/legalcode">Creative Commons Attribution-ShareAlike 3.0 Unported</a> license.
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 12•3 years ago
|
||
This is fine from a licensing perspective - our longstanding policy, as you note, is that tests can incorporate data licensed under CC-BY or CC-BY-SA, provided that data is stored in its own separate directory along with a LICENSE file that meets the CC-BY or CC-BY-SA requirements.
Assignee | ||
Comment 13•3 years ago
|
||
(In reply to Mike Hoye [:mhoye] from comment #12)
This is fine from a licensing perspective - our longstanding policy, as you note, is that tests can incorporate data licensed under CC-BY or CC-BY-SA, provided that data is stored in its own separate directory along with a LICENSE file that meets the CC-BY or CC-BY-SA requirements.
Thanks. I added a copy of the license as a LICENSE file and tweaked the attribution to restate the title even though it was already part of the URL, too.
Comment 14•3 years ago
|
||
Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/3737510242c5 Enable chardetng for .in and .lk TLDs. r=dminor
Comment 15•3 years ago
|
||
bugherder |
Description
•