Page with lots of half-width katakana not detected as Shift_JIS
Categories
(Core :: Internationalization, defect)
Tracking
()
Tracking | Status | |
---|---|---|
firefox91 | --- | fixed |
People
(Reporter: masayuki, Assigned: hsivonen)
References
Details
(Whiteboard: [If you have another example of the issue, please leave the URL in a comment], [wptsync upstream])
Attachments
(1 file)
Failed to detect right text encoding of the following page.
It's text/html, and the HTTP header does not have "charset", and also the page does not have <meta>
for specifying the charset.
And also, this cannot be fixed with the "Text Encoding" menu in these days.
Assignee | ||
Comment 1•3 years ago
|
||
(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(Still not recoverd perfectly) from comment #0)
Failed to detect right text encoding of the following page.
It's text/html, and the HTTP header does not have "charset", and also the page does not have
<meta>
for specifying the charset.And also, this cannot be fixed with the "Text Encoding" menu in these days.
The reason why this page doesn't work right away is that its first non-ASCII characters are a pair of Shift_JIS half-width katakana characters that form a valid (non-half-width) EUC-JP character. The quick-deciding Japanese-only detector is based on the bet that pages wouldn't start like this. Of course, the Web finds a counter-example.
I need to take a better look to explain why the item "Automatic" in the Text Encoding menu can't deal with this, either. My first guess is that this page has too many half-width katakana characters relative to the number of full-width characters.
Out of curiosity: What's the purpose of using half-width katakana on this page (as opposed to full-width characters for all Japanese text on the page)?
Assignee | ||
Updated•3 years ago
|
Reporter | ||
Comment 2•3 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #1)
Out of curiosity: What's the purpose of using half-width katakana on this page (as opposed to full-width characters for all Japanese text on the page)?
I think that this is rare case. Basically, most Japanese people don't use half-width kana without special reasons because:
- It requires voiced (semi-)sound sign as a character after a base kana character.
- It does not rendered so beautiful.
- It does not match with other type of characters such as Hiragana and Chinese characters.
On the other hand, some people (I guess mostly elder people than me) just love to use it better than (Fullwidth) Katakana because of their habit.
Assignee | ||
Comment 3•3 years ago
|
||
I think that this is rare case. Basically, most Japanese people don't use half-width kana without special reasons because:
- It requires voiced (semi-)sound sign as a character after a base kana character.
- It does not rendered so beautiful.
- It does not match with other type of characters such as Hiragana and Chinese characters.
This is the underlying assumption of the detectors. Pure JIS X 0201 text files are intentionally unsupported as too ancient relative to the complications their support would cause for other non-Latin 8-bit encodings. I've assumed that when half-width katakana appears, it would appear in old terminal examples inside a page where the bulk of the text is full-width. I.e. some half-width katakana surrounded by full-width characters.
The case at hand where there are a handful of full-width characters surrounded by half-width katakana comes as a surprise to me.
On the other hand, some people (I guess mostly elder people than me) just love to use it better than (Fullwidth) Katakana because of their habit.
I wasn't aware of this, and the lack of awareness is reflected in the detector design. :-(
Assignee | ||
Comment 4•3 years ago
|
||
For people who may encounter this same issue and figure they shouldn't file a duplicate: Please leave a comment with the URL you are seeing this problem at.
(Also, even URLs of non-broken pages where half-width katakana is dominant would help me understand what the detector should look for if we decide to put time into fixing this.)
Assignee | ||
Comment 5•3 years ago
|
||
Updated•3 years ago
|
Assignee | ||
Comment 6•3 years ago
|
||
The patch here makes the menu item "Automatic" work. Bug 1711476 would make having to use the menu unnecessary. Bug 1687635 removes the confusion between the items "Automatic" and "Japanese".
Assignee | ||
Comment 7•3 years ago
|
||
Assignee | ||
Comment 10•3 years ago
|
||
Thanks for the r+.
Comment 11•3 years ago
|
||
Pushed by hsivonen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/c7c1c74df7ea Make chardetng detect half-width katakana. r=emk
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/29164 for changes under testing/web-platform/tests
Comment 13•3 years ago
|
||
bugherder |
Upstream PR merged by moz-wptsync-bot
Description
•