378008 - different charsets for headers and mail body (Import from Outlook 2003 puts different charset from one in <meta> tag in Content-Type:. Tb uses charset in Content-Type: instead of charset in <meta http-equiv>)

Reporter

Description

•

18 years ago

User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3 Build Identifier: version 2.0.0.0 (20070326) There is imported mail from Outlook 2003. Headers: Content-Type: text/html; charset=windows-1251 Actual mail: meta http-equiv=Content-Type content="text/html; charset=koi8-r" By default Thunderbird detect it and display using windows-1251 charset. Must change by hand to koi8-r to display it properly. Reproducible: Always Steps to Reproduce: 1. set intl.charset.detector to ruprob 2. set mailnews.view_default_charset to KOI8-R 3. import attached eml 4. view this mail Actual Results: message displayed using windows-1251 charset (from headers) Expected Results: message displayed using koi8-r charset (actual encoding) Was fine in Outlook 2003

Dmitry Kubov

Reporter

Comment 1

•

18 years ago

Attached file mail with multiple charsets in header/body — Details

Dmitry Kubov

Reporter

Comment 2

•

18 years ago

Thats for all messages, composed by ms outlook with ms word as editor.

Component: Migration → General

WADA:World Anti-bad-Duping Agency

Comment 3

•

18 years ago

> mail with multiple charsets in header/body (message header for mail body) Content-Type: text/html; charset=windows-1251 (meta in <head>) <meta http-equiv=Content-Type content="text/html; charset=koi8-r"> When HTTP(browser), <meta ... charset=xxx> is applied only when HTTP header(Content-Type:) doesn't have charset. This is defined by RFC. (IE6 does do opposite to it. I don't know about IE7.) And as far as I remember, auto-detect is invoked only when no <meta ... charset=xxx>. Although not HTTP header/not browser, message header/MUA, I think application of rule when HTTP header(charset=windows-1252) is reasonable in this case. To Dmitry Kubov(bug opener) : Is the mail incorrectly rendered when "Content-Type: text/html;"(no charset)? When no charset in Content-Type: and default charset is other than koi8-r(such as utf-8), is the mail rendered incorectly? (Please set View/Character Encoding/Auto Detect=On and choose appropriate one)

WADA:World Anti-bad-Duping Agency

Comment 4

•

18 years ago

Adding explanation in summary for ease of understand.

Summary: different charsets for headers and mail body → different charsets for headers and mail body (charset in Content-Type: is used instead of charset in <meta http-equiv>)

Dmitry Kubov

Reporter

Comment 5

•

18 years ago

autodetect is already defined using "set intl.charset.detector to ruprob" tweaking .eml file from Content-Type: text/html; charset=windows-1251 to Content-Type: text/html fixes rendering issue. it uses koi8-r from html part of message even for other default charsets But its not a solution for every buggy message.

Phil Ringnalda (:philor)

Updated

•

18 years ago

QA Contact: migration → general

Dan Mosedale (:dmosedale, :dmose)

Updated

•

17 years ago

Assignee: mscott → nobody

WADA:World Anti-bad-Duping Agency

Comment 7

•

16 years ago

Opener of DUPed bug 505072 says; > Version 2.0b1pre (build id: 20090702001417) doesn't have such problem. Dmitry Kubov(bug opener): Can you reproduce problem with newest trunk nightly build?

[:Aureliano Buendía]

Comment 8

•

16 years ago

Reporter can you reply to comment #7?

Zane U. Ji

Comment 9

•

16 years ago

I just installed 2009072012539. Unfortunately, I can still reproduce the bug.

Zane U. Ji

Comment 10

•

16 years ago

I checked the email twice and found that the inline HTML body is encoded in UTF8. To show the content SeaMonkey has to 1) decode the HTML body in UTF8 as specified in mail header, 2) display the HTML body in GB2312 as its header indicates. I am not sure if this format is correct.

WADA:World Anti-bad-Duping Agency

Comment 11

•

16 years ago

(In reply to comment #9) > I just installed 2009072012539. Unfortunately, I can still reproduce the bug. Which build(which Gecko)? > seamonkey-2.0b1pre / seamonkey-2.0b2pre (Gecko 1.9.1) > http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly/latest-comm-1.9.1/ > seamonkey-2.1a1pre (Gecko 1.9.2) > http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly/latest-comm-central-trunk/ > I checked the email twice and found that the inline HTML body is encoded in UTF8. Does it mean next? > (Case-B) Your case > Content-Type: text/html; charset="utf-8" >(snip) > <meta http-equiv="Content-Type" content="text/html; charset=gb2312" /> > HTML source is written in UTF-8 If so, it's opposite to this bug... > (Case-A) This bug's case > Content-Type: text/html; charset="windows-1251" >(snip) > <meta http-equiv="Content-Type" content="text/html; charset=koi8-r" /> > HTML source is written in koi8-r If "Content-Type: charset=..." is used, complaint of this bug occurs with Case-A, but if "Content-Type: charset=..." is not used, complaint of bug 505072 arises with Case-B...

WADA:World Anti-bad-Duping Agency

Comment 12

•

16 years ago

I could confirm that bug 505072 is above Case-B, by next. 1. View/Source => View/Character Encoding is set to UTF-8 => no display by U+FFFD(�) 2. Change View/Character Encoding to gb2312 => many characters are displayed by U+FFFD(�) Re-opening bug 505072, because opposite problem to this bug and I think bug 505072 is regression by some changes.

Zane U. Ji

Comment 13

•

16 years ago

(In reply to comment #11) > Which build(which Gecko)? > > seamonkey-2.0b1pre / seamonkey-2.0b2pre (Gecko 1.9.1) > > http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly/latest-comm-1.9.1/ > > seamonkey-2.1a1pre (Gecko 1.9.2) > > http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly/latest-comm-central-trunk/ The 2nd. Build identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9.2a1pre) Gecko/20090720 SeaMonkey/2.1a1pre > > > I checked the email twice and found that the inline HTML body is encoded in > UTF8. > > Does it mean next? > > (Case-B) Your case > > Content-Type: text/html; charset="utf-8" > >(snip) > > <meta http-equiv="Content-Type" content="text/html; charset=gb2312" /> > > HTML source is written in UTF-8 > > If so, it's opposite to this bug... > > (Case-A) This bug's case > > Content-Type: text/html; charset="windows-1251" > >(snip) > > <meta http-equiv="Content-Type" content="text/html; charset=koi8-r" /> > > HTML source is written in koi8-r > > If "Content-Type: charset=..." is used, complaint of this bug occurs with > Case-A, but if "Content-Type: charset=..." is not used, complaint of bug 505072 > arises with Case-B... My humble guess is that SeaMonkey mail try to display an HTML email without decoding it. We can try decoding inline html body of case A in windows-1251, and displaying it using encoding koi8-r. If it's correct, then problem solved. I don't know the difference between windows-1251 and ASCII. And I have no ideal what the correct result is. So I can't do that by myself.

WADA:World Anti-bad-Duping Agency

Comment 14

•

16 years ago

(In reply to comment #13) > And I have no ideal what the correct result is. Me too. Problem is that "what is correct action" is not clear (a) when malformed mail, or (b) when real char-encoding of mail data is altered by server (by cache server, server who holds mail data, intermediate server in mail delivery, ...). (1) Bug 505072 Content-Type: ... charset=CHARSET-1, <meta ... charset=CHARSET-2">, HTML is written in CHARSET-1. (2) This bug Content-Type: ... charset=CHARSET-1, <meta ... charset=CHARSET-2">, HTML is written in CHARSET-2. (3) Worst case Content-Type: ... charset=CHARSET-1, <meta ... charset=CHARSET-2">, HTML is written in CHARSET-3. I think charset of Content-Type: should be used to support above (b) well as HTTP & HTTP header does, but I'm not sure.

WADA:World Anti-bad-Duping Agency

Comment 15

•

16 years ago

(In reply to comment #5) > But its not a solution for every buggy message. To be tolerant with such malformed mails by bad mailers, Tb has a feature. Folder properties Default Character Encoding: koi8-r [ X ] Apply default to all messages in the folder (individual message character encoding settings and auto-detection will be ignored) It's equivalent to replace mail header of (a) by mail header of (b) in your case. > (a) Content-Type: text/html; charset=windows-1251 > (b) Content-Type: text/html; charset=koir-8 It's a feature made to relief many many victims of malformed mail like you. If you copy the mails to folder with this option(and rebuild-index if required), text you want is displayed and mail body becomes readable if mail body is written in koi8-r. (Correction of comment #7) Bug 505072 was different issue form this bug on similar mail data. Dmitry Kubov(bug opener), there is no need to test again, because nothing is changed for this bug.

[:Aureliano Buendía]

Updated

•

16 years ago

Keywords: testcase

Jorg K (CEST = GMT+2)

Comment 17

•

15 years ago

The problem is here in the code: http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/nsOutlookCompose.cpp#607. The import tries to get the charset from the header, if none is there, it defaults to the system character set, for Windows machines with an English version of Windows, that's normally "windows-1251". Instead, what should be done is extract the charset from the appropriate HTML header.

WADA:World Anti-bad-Duping Agency

Updated

•

15 years ago

Component: General → Import

Product: Thunderbird → MailNews Core

QA Contact: general → import

Summary: different charsets for headers and mail body (charset in Content-Type: is used instead of charset in <meta http-equiv>) → different charsets for headers and mail body (Import from Outlook 2003 puts different charset from one in <meta> tag in Content-Type:. Tb uses charset in Content-Type: instead of charset in <meta http-equiv>)

Ludovic Hirlimann [:Usul]

Updated

•

15 years ago

Status: UNCONFIRMED → RESOLVED

Closed: 15 years ago

Resolution: --- → DUPLICATE