Closed Bug 23944 Opened 26 years ago Closed 26 years ago

Incorrect identification of japanese SJIS charset

Tracking

(Not tracked)

Status:

VERIFIED FIXED

Milestone:

M14

People

(Reporter: jmdesp, Assigned: ftang)

Details

Attachments

(3 files)

email that is recognised as UTF8 instead of SJIS 26 years ago Jean-Marc Desperrier 1.53 KB, text/plain		Details
email identified as EUC, but incorrectly displayed 26 years ago Jean-Marc Desperrier 2.11 KB, text/plain		Details
Test mail : First line incorrectly recognised, the rest identified as EUC 26 years ago Jean-Marc Desperrier 835 bytes, text/plain		Details

Jean-Marc Desperrier

Reporter

Description

•

26 years ago

On a SJIS encoded mail. Auto-detection displays : It's NOT shift_JIS - byte 240 (F0) and then tryes to display with UTF-8 But F0 is a legal code for SJIS. In the mail the code was present in the hiragana wo character.

nhottanscp

Comment 1

•

26 years ago

I need more info: Do you know which character cause this? Is that 0xF0XX or 0xXXF0? Are you able to reproduce this in browser?

Jean-Marc Desperrier

Reporter

Comment 2

•

26 years ago

I realized that auto-detection displays the offset of the error, not the value of the byte. So the problem might have nothing to do with the value 0xF0. But I'm not able to understand how it counts the offset inside the message therefore I'm not able to say what is the byte value that starts the problem. I've been testing, and it seems to be counting more than one for each byte. I'm not able to reproduce the bug inside browser, because the browser displays the mail correctly when saved as a file. But if you give me a mail adress, I can forward the mail shortened a just one line that is still incorrectly detected as UTF-8, instead of SJIS. I also have another mail, that displays as corean under Linux and garbage under windows, despites the fact the auto-converter says it has recognised it as EUC-JP (it is indeed EUC-JP). Once again, when saved as a file, it is correctly diplayed in the browser. It might be a different problem. All tests were with M12-seamonkey

bobj

Updated

•

26 years ago

Target Milestone: M14

bobj

Comment 3

•

26 years ago

nhotta is on vacation this week. marking for M14 for now.

Katsuhiko Momoi

Comment 4

•

26 years ago

hi, jean-marc. Could you possibly send me some test msgs with the problem characters in it? Send to: momoi@netscape.com.

Frank Tang

Assignee

Comment 5

•

26 years ago

"byte 240 (F0)" mean "the 240 (0xF0) byte from the beginning of the file", not "the byte value 240 (0xF0) "

Frank Tang

Assignee

Comment 6

•

26 years ago

Change OS and Platform to ALL

OS: Linux → All

Hardware: PC → All

nhottanscp

Comment 7

•

26 years ago

Reassign to ftang.

Assignee: nhotta → ftang

Frank Tang

Assignee

Updated

•

26 years ago

Status: NEW → ASSIGNED

Frank Tang

Assignee

Comment 8

•

26 years ago

jean-marc, can you attach the problem message into the attachment. So later I can download that file and send it through my local build.

Jean-Marc Desperrier

Reporter

Comment 9

•

26 years ago

Attached file email that is recognised as UTF8 instead of SJIS — Details

Jean-Marc Desperrier

Reporter

Comment 10

•

26 years ago

Attached file email identified as EUC, but incorrectly displayed — Details

Jean-Marc Desperrier

Reporter

Comment 11

•

26 years ago

It seems auto-recognition is not reset between the parts of a multi-part email message. It should as they may have different encoding. This should be added somewhere in the wish list for future versions.

Katsuhiko Momoi

Comment 12

•

26 years ago

These 2 test msgs have non-Japanese charsets. One says iso-8859-1(main body) and the other says us-ascii (2nd body). I don't think we can currently display these because the charset info does not match the content of the mail. Also we are not supposed to apply auto-detection to mail msgs which already have a charset parameter. I think these will become displayble when we implement charset override via the Character coding menu.

Jean-Marc Desperrier

Reporter

Comment 13

•

26 years ago

What's the point of auto-detection if you just read charset parameter ? M12 tries to do auto-detection on every message I receive instead of using charset parameter. If you have a robust auto-detection algorithm, it's a much better way to go than using charset headers, the average user just doesn't set the charset correctly. I used ISO-8859 for convenience, because Netscape and Outlook will convert SJIS and EUC to ISO-2022 when sending mail, but I think i can recreate the samples with correct headers. The samples are based on mail I actually received (it's from an open mailing list, not personal mail, and I removed any content that could be considered personnal) and the encoding was incorrectly set from the start.

Jean-Marc Desperrier

Reporter

Comment 14

•

26 years ago

OK, I'll correct my mistake. If I set the correct content encoding inside the messages, the Netscape displays them correctly, so you're right Momoi, the problem comes from the header and if I set no content encoding, the auto-converter does some quite good job. So let's close this as a bug. But I'd like two things to be enhanced : - If would be nice to have a way to ignore the content of the header, other than manual selection of correct encoding. Headers are very often wrong. - Change something in this converter log that appears in the log windows. It's very misleading. When it tells me it has identified as UTF-8 my message that is not UTF-8 and that is also not correctly displayed, you tend to believe this is due to an error in the auto-converter whereas it's a mistake in the headers. One last point, when there's no content header, autoconverter seems to try to do conversion line by line. In the second sample, if you remove the charset and then retest it in Mozilla, the first line of EUC code displays as if it were ISO-8859-1, and the following lines are converted to japanese. After some thinking, I don't think this is really a good feature, but it would need some larger scale testing to decide between the two possibilities : - Either have line by line convertion and sometimes errors, because it's impossible to decide on just one line. - Or use only one encoding for the whole message, and not being able to correctly decode messages that mixes several different encodings.

Severity: normal → enhancement

Jean-Marc Desperrier

Reporter

Comment 15

•

26 years ago

Attached file Test mail : First line incorrectly recognised, the rest identified as EUC — Details

Frank Tang

Assignee

Comment 16

•

26 years ago

ok, I fix for the test case.

Status: ASSIGNED → RESOLVED

Closed: 26 years ago

Resolution: --- → FIXED

Katsuhiko Momoi

Comment 17

•

26 years ago

** Checked with 2/17/2000 Linux build ** Unix is the only platform left which still emits debug msgs. So I tried jean-marc's EUC-JP example with the above build in the mailbox with Japanese Auto-Detection on. The detection correctly identified EUC-JP on all 6 of the lines. For this example, the previously observed problem seems to have been fixed. I'm going to mark this fix verified. Jean-Marc, try it out yourself. I could read your msg without any error. You say that you wrote a book on how to learn Kanji faster. Send me more info on that, please.

Status: RESOLVED → VERIFIED

Jean-Marc Desperrier

Reporter

Comment 18

•

26 years ago

Sample 3 was a shortened version of sample 2. In sample 2, you could see the message was truly from Yves Maniette who translated and adapted in french the kanji learning book of James W. HEISIG. I felt free to use this short text as an test, because it WAS originally a test Yves Maniette wrote to a public mailing list to check if he was able to send correct japanese messages with his Netscape 4.x. I should have kept the attribution of the message. ftang had wrote earlier he had fixed the sample, when I've got time I will download and test again the last build to check whether I'm able to find other cases of bad conversion.

Myk Melez [:myk] [@mykmelez]

Updated

•

21 years ago

Product: MailNews → Core

Nobody; OK to take it and work on it

Updated

•

17 years ago

Product: Core → MailNews Core

You need to log in before you can comment on or make changes to this bug.