Closed Bug 23944 Opened 25 years ago Closed 25 years ago

Incorrect identification of japanese SJIS charset

Categories

(MailNews Core :: Internationalization, enhancement, P3)

enhancement

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: jmdesp, Assigned: ftang)

Details

Attachments

(3 files)

On a SJIS encoded mail. Auto-detection displays : It's NOT shift_JIS - byte 240 (F0) and then tryes to display with UTF-8 But F0 is a legal code for SJIS. In the mail the code was present in the hiragana wo character.
I need more info: Do you know which character cause this? Is that 0xF0XX or 0xXXF0? Are you able to reproduce this in browser?
I realized that auto-detection displays the offset of the error, not the value of the byte. So the problem might have nothing to do with the value 0xF0. But I'm not able to understand how it counts the offset inside the message therefore I'm not able to say what is the byte value that starts the problem. I've been testing, and it seems to be counting more than one for each byte. I'm not able to reproduce the bug inside browser, because the browser displays the mail correctly when saved as a file. But if you give me a mail adress, I can forward the mail shortened a just one line that is still incorrectly detected as UTF-8, instead of SJIS. I also have another mail, that displays as corean under Linux and garbage under windows, despites the fact the auto-converter says it has recognised it as EUC-JP (it is indeed EUC-JP). Once again, when saved as a file, it is correctly diplayed in the browser. It might be a different problem. All tests were with M12-seamonkey
Target Milestone: M14
nhotta is on vacation this week. marking for M14 for now.
hi, jean-marc. Could you possibly send me some test msgs with the problem characters in it? Send to: momoi@netscape.com.
"byte 240 (F0)" mean "the 240 (0xF0) byte from the beginning of the file", not "the byte value 240 (0xF0) "
Change OS and Platform to ALL
OS: Linux → All
Hardware: PC → All
Reassign to ftang.
Assignee: nhotta → ftang
Status: NEW → ASSIGNED
jean-marc, can you attach the problem message into the attachment. So later I can download that file and send it through my local build.
It seems auto-recognition is not reset between the parts of a multi-part email message. It should as they may have different encoding. This should be added somewhere in the wish list for future versions.
These 2 test msgs have non-Japanese charsets. One says iso-8859-1(main body) and the other says us-ascii (2nd body). I don't think we can currently display these because the charset info does not match the content of the mail. Also we are not supposed to apply auto-detection to mail msgs which already have a charset parameter. I think these will become displayble when we implement charset override via the Character coding menu.
What's the point of auto-detection if you just read charset parameter ? M12 tries to do auto-detection on every message I receive instead of using charset parameter. If you have a robust auto-detection algorithm, it's a much better way to go than using charset headers, the average user just doesn't set the charset correctly. I used ISO-8859 for convenience, because Netscape and Outlook will convert SJIS and EUC to ISO-2022 when sending mail, but I think i can recreate the samples with correct headers. The samples are based on mail I actually received (it's from an open mailing list, not personal mail, and I removed any content that could be considered personnal) and the encoding was incorrectly set from the start.
OK, I'll correct my mistake. If I set the correct content encoding inside the messages, the Netscape displays them correctly, so you're right Momoi, the problem comes from the header and if I set no content encoding, the auto-converter does some quite good job. So let's close this as a bug. But I'd like two things to be enhanced : - If would be nice to have a way to ignore the content of the header, other than manual selection of correct encoding. Headers are very often wrong. - Change something in this converter log that appears in the log windows. It's very misleading. When it tells me it has identified as UTF-8 my message that is not UTF-8 and that is also not correctly displayed, you tend to believe this is due to an error in the auto-converter whereas it's a mistake in the headers. One last point, when there's no content header, autoconverter seems to try to do conversion line by line. In the second sample, if you remove the charset and then retest it in Mozilla, the first line of EUC code displays as if it were ISO-8859-1, and the following lines are converted to japanese. After some thinking, I don't think this is really a good feature, but it would need some larger scale testing to decide between the two possibilities : - Either have line by line convertion and sometimes errors, because it's impossible to decide on just one line. - Or use only one encoding for the whole message, and not being able to correctly decode messages that mixes several different encodings.
Severity: normal → enhancement
ok, I fix for the test case.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
** Checked with 2/17/2000 Linux build ** Unix is the only platform left which still emits debug msgs. So I tried jean-marc's EUC-JP example with the above build in the mailbox with Japanese Auto-Detection on. The detection correctly identified EUC-JP on all 6 of the lines. For this example, the previously observed problem seems to have been fixed. I'm going to mark this fix verified. Jean-Marc, try it out yourself. I could read your msg without any error. You say that you wrote a book on how to learn Kanji faster. Send me more info on that, please.
Status: RESOLVED → VERIFIED
Sample 3 was a shortened version of sample 2. In sample 2, you could see the message was truly from Yves Maniette who translated and adapted in french the kanji learning book of James W. HEISIG. I felt free to use this short text as an test, because it WAS originally a test Yves Maniette wrote to a public mailing list to check if he was able to send correct japanese messages with his Netscape 4.x. I should have kept the attribution of the message. ftang had wrote earlier he had fixed the sample, when I've got time I will download and test again the last build to check whether I'm able to find other cases of bad conversion.
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: