Closed Bug 1706862 Opened 3 years ago Closed 3 years ago

Page with lots of half-width katakana not detected as Shift_JIS

Categories

(Core :: Internationalization, defect)

defect

Tracking

()

RESOLVED FIXED
91 Branch
Tracking Status
firefox91 --- fixed

People

(Reporter: masayuki, Assigned: hsivonen)

References

Details

(Whiteboard: [If you have another example of the issue, please leave the URL in a comment], [wptsync upstream])

Attachments

(1 file)

Failed to detect right text encoding of the following page.

It's text/html, and the HTTP header does not have "charset", and also the page does not have <meta> for specifying the charset.

And also, this cannot be fixed with the "Text Encoding" menu in these days.

(In reply to Masayuki Nakano [:masayuki] (he/him)(JST, +0900)(Still not recoverd perfectly) from comment #0)

Failed to detect right text encoding of the following page.

It's text/html, and the HTTP header does not have "charset", and also the page does not have <meta> for specifying the charset.

And also, this cannot be fixed with the "Text Encoding" menu in these days.

The reason why this page doesn't work right away is that its first non-ASCII characters are a pair of Shift_JIS half-width katakana characters that form a valid (non-half-width) EUC-JP character. The quick-deciding Japanese-only detector is based on the bet that pages wouldn't start like this. Of course, the Web finds a counter-example.

I need to take a better look to explain why the item "Automatic" in the Text Encoding menu can't deal with this, either. My first guess is that this page has too many half-width katakana characters relative to the number of full-width characters.

Out of curiosity: What's the purpose of using half-width katakana on this page (as opposed to full-width characters for all Japanese text on the page)?

Summary: Failed to detect right text encoding and cannot fix it from the "Text Encoding" menu → Page with lots of half-width katakana not detected as Shift_JIS

(In reply to Henri Sivonen (:hsivonen) from comment #1)

Out of curiosity: What's the purpose of using half-width katakana on this page (as opposed to full-width characters for all Japanese text on the page)?

I think that this is rare case. Basically, most Japanese people don't use half-width kana without special reasons because:

  • It requires voiced (semi-)sound sign as a character after a base kana character.
  • It does not rendered so beautiful.
  • It does not match with other type of characters such as Hiragana and Chinese characters.

On the other hand, some people (I guess mostly elder people than me) just love to use it better than (Fullwidth) Katakana because of their habit.

I think that this is rare case. Basically, most Japanese people don't use half-width kana without special reasons because:

  • It requires voiced (semi-)sound sign as a character after a base kana character.
  • It does not rendered so beautiful.
  • It does not match with other type of characters such as Hiragana and Chinese characters.

This is the underlying assumption of the detectors. Pure JIS X 0201 text files are intentionally unsupported as too ancient relative to the complications their support would cause for other non-Latin 8-bit encodings. I've assumed that when half-width katakana appears, it would appear in old terminal examples inside a page where the bulk of the text is full-width. I.e. some half-width katakana surrounded by full-width characters.

The case at hand where there are a handful of full-width characters surrounded by half-width katakana comes as a surprise to me.

On the other hand, some people (I guess mostly elder people than me) just love to use it better than (Fullwidth) Katakana because of their habit.

I wasn't aware of this, and the lack of awareness is reflected in the detector design. :-(

For people who may encounter this same issue and figure they shouldn't file a duplicate: Please leave a comment with the URL you are seeing this problem at.

(Also, even URLs of non-broken pages where half-width katakana is dominant would help me understand what the detector should look for if we decide to put time into fixing this.)

Whiteboard: [If you have another example of the issue, please leave the URL in a comment]
Assignee: nobody → hsivonen
Status: NEW → ASSIGNED

The patch here makes the menu item "Automatic" work. Bug 1711476 would make having to use the menu unnecessary. Bug 1687635 removes the confusion between the items "Automatic" and "Japanese".

See Also: → 1711476

Review ping.

Flags: needinfo?(VYV03354)

Sorry for overlooking the review request.

Flags: needinfo?(VYV03354)

Thanks for the r+.

Pushed by hsivonen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c7c1c74df7ea
Make chardetng detect half-width katakana. r=emk
Created web-platform-tests PR https://github.com/web-platform-tests/wpt/pull/29164 for changes under testing/web-platform/tests
Whiteboard: [If you have another example of the issue, please leave the URL in a comment] → [If you have another example of the issue, please leave the URL in a comment], [wptsync upstream]
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 91 Branch
Upstream PR merged by moz-wptsync-bot
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: