Closed
Bug 23944
Opened 25 years ago
Closed 25 years ago
Incorrect identification of japanese SJIS charset
Categories
(MailNews Core :: Internationalization, enhancement, P3)
MailNews Core
Internationalization
Tracking
(Not tracked)
VERIFIED
FIXED
M14
People
(Reporter: jmdesp, Assigned: ftang)
Details
Attachments
(3 files)
On a SJIS encoded mail.
Auto-detection displays :
It's NOT shift_JIS - byte 240 (F0)
and then tryes to display with UTF-8
But F0 is a legal code for SJIS.
In the mail the code was present in the hiragana wo character.
Comment 1•25 years ago
|
||
I need more info:
Do you know which character cause this? Is that 0xF0XX or 0xXXF0?
Are you able to reproduce this in browser?
Reporter | ||
Comment 2•25 years ago
|
||
I realized that auto-detection displays the offset of the error, not the value
of the byte. So the problem might have nothing to do with the value 0xF0.
But I'm not able to understand how it counts the offset inside the message
therefore I'm not able to say what is the byte value that starts the problem.
I've been testing, and it seems to be counting more than one for each byte.
I'm not able to reproduce the bug inside browser, because the browser displays
the mail correctly when saved as a file.
But if you give me a mail adress, I can forward the mail shortened a just one
line that is still incorrectly detected as UTF-8, instead of SJIS.
I also have another mail, that displays as corean under Linux and garbage under
windows, despites the fact the auto-converter says it has recognised it as
EUC-JP (it is indeed EUC-JP). Once again, when saved as a file, it is correctly
diplayed in the browser. It might be a different problem.
All tests were with M12-seamonkey
Comment 4•25 years ago
|
||
hi, jean-marc. Could you possibly send me some test msgs with the problem characters in it?
Send to: momoi@netscape.com.
Assignee | ||
Comment 5•25 years ago
|
||
"byte 240 (F0)" mean "the 240 (0xF0) byte from the beginning of the file", not
"the byte value 240 (0xF0) "
Assignee | ||
Updated•25 years ago
|
Status: NEW → ASSIGNED
Assignee | ||
Comment 8•25 years ago
|
||
jean-marc, can you attach the problem message into the attachment. So later I
can download that file and send it through my local build.
Reporter | ||
Comment 9•25 years ago
|
||
Reporter | ||
Comment 10•25 years ago
|
||
Reporter | ||
Comment 11•25 years ago
|
||
It seems auto-recognition is not reset between the parts of a multi-part email
message.
It should as they may have different encoding.
This should be added somewhere in the wish list for future versions.
Comment 12•25 years ago
|
||
These 2 test msgs have non-Japanese charsets.
One says iso-8859-1(main body) and the other says us-ascii (2nd body).
I don't think we can currently display these because the charset info does not
match the content of the mail.
Also we are not supposed to apply auto-detection to mail msgs which already have
a charset parameter. I think these will become displayble when we implement charset override
via the Character coding menu.
Reporter | ||
Comment 13•25 years ago
|
||
What's the point of auto-detection if you just read charset parameter ?
M12 tries to do auto-detection on every message I receive instead of using
charset parameter.
If you have a robust auto-detection algorithm, it's a much better way to go than
using charset headers, the average user just doesn't set the charset correctly.
I used ISO-8859 for convenience, because Netscape and Outlook will convert SJIS
and EUC to ISO-2022 when sending mail, but I think i can recreate the samples
with correct headers.
The samples are based on mail I actually received (it's from an open mailing
list, not personal mail, and I removed any content that could be considered
personnal) and the encoding was incorrectly set from the start.
Reporter | ||
Comment 14•25 years ago
|
||
OK, I'll correct my mistake.
If I set the correct content encoding inside the messages, the Netscape displays
them correctly, so you're right Momoi, the problem comes from the header and if
I set no content encoding, the auto-converter does some quite good job.
So let's close this as a bug.
But I'd like two things to be enhanced :
- If would be nice to have a way to ignore the content of the header, other than
manual selection of correct encoding. Headers are very often wrong.
- Change something in this converter log that appears in the log windows. It's
very misleading. When it tells me it has identified as UTF-8 my message that is
not UTF-8 and that is also not correctly displayed, you tend to believe this is
due to an error in the auto-converter whereas it's a mistake in the headers.
One last point, when there's no content header, autoconverter seems to try to do
conversion line by line.
In the second sample, if you remove the charset and then retest it in Mozilla,
the first line of EUC code displays as if it were ISO-8859-1, and the following
lines are converted to japanese.
After some thinking, I don't think this is really a good feature, but it would
need some larger scale testing to decide between the two possibilities :
- Either have line by line convertion and sometimes errors, because it's
impossible to decide on just one line.
- Or use only one encoding for the whole message, and not being able to
correctly decode messages that mixes several different encodings.
Severity: normal → enhancement
Reporter | ||
Comment 15•25 years ago
|
||
Assignee | ||
Comment 16•25 years ago
|
||
ok, I fix for the test case.
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Comment 17•25 years ago
|
||
** Checked with 2/17/2000 Linux build **
Unix is the only platform left which still emits debug
msgs. So I tried jean-marc's EUC-JP example with the
above build in the mailbox with Japanese Auto-Detection on.
The detection correctly identified EUC-JP on all 6 of the
lines.
For this example, the previously observed problem seems to have been
fixed.
I'm going to mark this fix verified.
Jean-Marc, try it out yourself. I could read your msg without any
error. You say that you wrote a book on how to learn Kanji faster.
Send me more info on that, please.
Status: RESOLVED → VERIFIED
Reporter | ||
Comment 18•25 years ago
|
||
Sample 3 was a shortened version of sample 2. In sample 2, you could see the message was truly from Yves
Maniette who translated and adapted in french the kanji learning book of James W. HEISIG.
I felt free to use this short text as an test, because it WAS originally a test Yves Maniette wrote to a public
mailing list to check if he was able to send correct japanese messages with his Netscape 4.x.
I should have kept the attribution of the message.
ftang had wrote earlier he had fixed the sample, when I've got time I will download and test again the last
build to check whether I'm able to find other cases of bad conversion.
Updated•21 years ago
|
Product: MailNews → Core
Updated•17 years ago
|
Product: Core → MailNews Core
You need to log in
before you can comment on or make changes to this bug.
Description
•