Auto-detect cannot detect the encoding of an old Japanese document
Categories
(Core :: Internationalization, defect)
Tracking
()
People
(Reporter: emk, Unassigned)
References
()
Details
I had to select Japanese
explicitly to view this text.
Maybe because this document contains some non-standard Shift-JIS characters.
Reporter | ||
Comment 1•4 years ago
|
||
I can no longer select the correct encoding once the text encoding menu is removed.
Moreover, Chrome and Edge auto-detect the correct encoding without any user interaction.
Comment 2•4 years ago
|
||
I guess the first question is: Should we treat the sequences in this document that the spec currently treats as errors as spec bugs instead?
Moreover, Chrome and Edge auto-detect the correct encoding without any user interaction.
Two possible explanations for this:
- Chromium only inspects the prefix of the stream, and the non-standard stuff is relatively late in the stream.
- ced has no relation to the Encoding Standard or the decoders in Chromium, so ced might not reject this file even if inspecting it until the byte sequences that Chromium's decoders treat as errors.
Comment 3•4 years ago
|
||
Aside: I'm uneasy about the Encoding Standard not having Windows-consistent PUA mappings for EUC-KR and the lowest part of Big5 for similar reason.
Comment 4•4 years ago
|
||
Should we treat the sequences in this document that the spec currently treats as errors as spec bugs instead?
Do I understand correctly that the byte sequences that aren't supported by the Encoding Standard fall into two categories:
- Later revisions of the de jure JIS X 0208 standard that weren't actually widely adopted in implementations.
- Vendor extensions from something called PC-9801 that weren't widely adopted by other vendors.
?
I.e. we shouldn't change the spec?
Reporter | ||
Comment 5•4 years ago
|
||
I won't request to add actual mappings of those characters. The problem is that, only single occurrence of those characters makes the entire document unreadable.
Can Encoding Standard treat those area special, so that the encoding detector won't bail even if they have no actual mappings?
Comment 6•4 years ago
•
|
||
Can Encoding Standard treat those area special, so that the encoding detector won't bail even if they have no actual mappings?
The Encoding Standard doesn't specify detection at this point, but perhaps chardetng and a possible future chardetng-based detection spec needs to special-case these.
So far, though, I think it's a bit premature to change chardetng based on this file only, since this particular file intentionally contains weird byte sequences. In that sense, it's like a test case / demo instead of being normal legacy content. It's a bit unfortunate that it mixes informative content and test case-like data in the same file.
I think it's premature to make changes based on this file especially considering that the Chromium reason for detecting this file as EUC-JP appears to be that Chromium only examines the prefix. If I remove prefix of the file before the part that starts discussing JIS X 0208-1983 and JIS X 0208-1990, Chromium doesn't detect the remaining file as EUC-JP. That is, this file happens to be lucky in having long enough prefix of valid Shift_JIS to hide the invalid parts from ced.
I'll ask the maintainer of the site to add the appropriate charset
parameter to the HTTP header.
Comment 7•4 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #0)
Maybe because this document contains some non-standard Shift-JIS characters.
(It seems to be EUC-JP.)
Comment 8•4 years ago
|
||
Given that this is EUC-JP and not Shift_JIS and the structural overlap between EUC-JP, EUC-KR, and GBK, I'm even more hesitant make chardetng tolerate invalid byte sequences instead of treating them as "not EUC-JP" decision points.
Comment 9•4 years ago
|
||
Fixed by the site maintainer.
Reporter | ||
Updated•4 years ago
|
Comment 10•4 years ago
|
||
I took another look at this document, and I'm confused by the section titled ○旧 JIS と新 JIS の違い. It talks about characters defined in the 1990 revision of JIS X 0208 and lists characters that have the 0x81 lead byte in Shift_JIS but aren't in the Encoding Standard. I don't have an actual copy of the 1990 revision of JIS X 0208 standard at hand, but evidence in both Lunde's book and in Wikipedia suggests that the Encoding Standard matches the 1990 revision for row 2.
What are those other characters about in that section?
OTOH, perhaps I should make chardetng tolerate the JIS X 0213 extensions to the NEC extension row.
Comment 11•4 years ago
|
||
Also, perhaps chardetng should tolerate MacJapanese extensions.
Reporter | ||
Comment 12•4 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #10)
I took another look at this document, and I'm confused by the section titled ○旧 JIS と新 JIS の違い. It talks about characters defined in the 1990 revision of JIS X 0208 and lists characters that have the 0x81 lead byte in Shift_JIS but aren't in the Encoding Standard. I don't have an actual copy of the 1990 revision of JIS X 0208 standard at hand, but evidence in both Lunde's book and in Wikipedia suggests that the Encoding Standard matches the 1990 revision for row 2.
What are those other characters about in that section?
I think this list also includes undefined code points to make the code point sequence contiguous. Otherwise it does not match the character count written above (39 special characters and 32 box-drawing characters).
When this text file was displayed on MS-DOS PCs, undefined code points were shown as blanks rather than replacement characters because legacy-encoding-to-Unicode conversion was not performed. So readers didn't notice the "mojibake", I guess.
Comment 13•4 years ago
|
||
(In reply to Masatoshi Kimura [:emk] from comment #12)
(In reply to Henri Sivonen (:hsivonen) from comment #10)
I took another look at this document, and I'm confused by the section titled ○旧 JIS と新 JIS の違い. It talks about characters defined in the 1990 revision of JIS X 0208 and lists characters that have the 0x81 lead byte in Shift_JIS but aren't in the Encoding Standard. I don't have an actual copy of the 1990 revision of JIS X 0208 standard at hand, but evidence in both Lunde's book and in Wikipedia suggests that the Encoding Standard matches the 1990 revision for row 2.
What are those other characters about in that section?
I think this list also includes undefined code points to make the code point sequence contiguous. Otherwise it does not match the character count written above (39 special characters and 32 box-drawing characters).
When this text file was displayed on MS-DOS PCs, undefined code points were shown as blanks rather than replacement characters because legacy-encoding-to-Unicode conversion was not performed. So readers didn't notice the "mojibake", I guess.
Thanks. Considering that IE didn't render blanks for unmapped code points, I think it's relatively safe to expect these particular ones not to be a significant compat issue.
Description
•