Closed Bug 378008 Opened 17 years ago Closed 14 years ago

different charsets for headers and mail body (Import from Outlook 2003 puts different charset from one in <meta> tag in Content-Type:. Tb uses charset in Content-Type: instead of charset in <meta http-equiv>)

Categories

(MailNews Core :: Import, defect)

x86
Windows XP
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 547119

People

(Reporter: dmitry, Unassigned)

References

Details

(Keywords: testcase)

Attachments

(1 file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.8.1.3) Gecko/20070309 Firefox/2.0.0.3
Build Identifier: version 2.0.0.0 (20070326)

There is imported mail from Outlook 2003.
Headers:
Content-Type: text/html; charset=windows-1251

Actual mail:
meta http-equiv=Content-Type content="text/html; charset=koi8-r"

By default Thunderbird detect it and display using windows-1251 charset. Must change by hand to koi8-r to display it properly.

Reproducible: Always

Steps to Reproduce:
1. set intl.charset.detector to ruprob
2. set mailnews.view_default_charset to KOI8-R
3. import attached eml
4. view this mail

Actual Results:  
message displayed using windows-1251 charset (from headers)

Expected Results:  
message displayed using koi8-r charset (actual encoding)

Was fine in Outlook 2003
Thats for all messages, composed by ms outlook with ms word as editor.
Component: Migration → General
> mail with multiple charsets in header/body

(message header for mail body)
Content-Type: text/html; charset=windows-1251
(meta in <head>)
<meta http-equiv=Content-Type content="text/html; charset=koi8-r">

When HTTP(browser), <meta ... charset=xxx> is applied only when HTTP header(Content-Type:) doesn't have charset.
This is defined by RFC. (IE6 does do opposite to it. I don't know about IE7.)
And as far as I remember, auto-detect is invoked only when no <meta ... charset=xxx>. 
Although not HTTP header/not browser, message header/MUA, I think application of rule when HTTP header(charset=windows-1252) is reasonable in this case.

To Dmitry Kubov(bug opener) :

Is the mail incorrectly rendered when "Content-Type: text/html;"(no charset)?
When no charset in Content-Type: and default charset is other than koi8-r(such as utf-8), is the mail rendered incorectly?
(Please set View/Character Encoding/Auto Detect=On and choose appropriate one)
Adding explanation in summary for ease of understand.
Summary: different charsets for headers and mail body → different charsets for headers and mail body (charset in Content-Type: is used instead of charset in <meta http-equiv>)
autodetect is already defined using "set intl.charset.detector to ruprob"

tweaking .eml file from
Content-Type: text/html; charset=windows-1251
to
Content-Type: text/html
fixes rendering issue. it uses koi8-r from html part of message even for other default charsets

But its not a solution for every buggy message.
QA Contact: migration → general
Assignee: mscott → nobody
Opener of DUPed bug 505072 says;
> Version 2.0b1pre (build id: 20090702001417) doesn't have such problem.
Dmitry Kubov(bug opener): Can you reproduce problem with newest trunk nightly build?
Reporter can you reply to comment #7?
I just installed 2009072012539. Unfortunately, I can still reproduce the bug.
I checked the email twice and found that the inline HTML body is encoded in UTF8.

To show the content SeaMonkey has to
1) decode the HTML body in UTF8 as specified in mail header,
2) display the HTML body in GB2312 as its header indicates.

I am not sure if this format is correct.
(In reply to comment #9)
> I just installed 2009072012539. Unfortunately, I can still reproduce the bug.

Which build(which Gecko)?
> seamonkey-2.0b1pre / seamonkey-2.0b2pre (Gecko 1.9.1)
> http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly/latest-comm-1.9.1/
> seamonkey-2.1a1pre (Gecko 1.9.2)
> http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly/latest-comm-central-trunk/

> I checked the email twice and found that the inline HTML body is encoded in
UTF8.

Does it mean next?
> (Case-B) Your case
> Content-Type: text/html; charset="utf-8"
>(snip)
> <meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
> HTML source is written in UTF-8

If so, it's opposite to this bug...
> (Case-A) This bug's case
> Content-Type: text/html; charset="windows-1251"
>(snip)
> <meta http-equiv="Content-Type" content="text/html; charset=koi8-r" />
> HTML source is written in koi8-r

If "Content-Type: charset=..." is used, complaint of this bug occurs with Case-A, but if "Content-Type: charset=..." is not used, complaint of bug 505072 arises with Case-B...
I could confirm that bug 505072 is above Case-B, by next.
  1. View/Source => View/Character Encoding is set to UTF-8
     => no display by U+FFFD(�)
  2. Change View/Character Encoding to gb2312
     => many characters are displayed by U+FFFD(�)
Re-opening bug 505072, because opposite problem to this bug and I think bug 505072 is regression by some changes.
(In reply to comment #11)
> Which build(which Gecko)?
> > seamonkey-2.0b1pre / seamonkey-2.0b2pre (Gecko 1.9.1)
> > http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly/latest-comm-1.9.1/
> > seamonkey-2.1a1pre (Gecko 1.9.2)
> > http://ftp.mozilla.org/pub/mozilla.org/seamonkey/nightly/latest-comm-central-trunk/
The 2nd.
Build identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
rv:1.9.2a1pre) Gecko/20090720 SeaMonkey/2.1a1pre
> 
> > I checked the email twice and found that the inline HTML body is encoded in
> UTF8.
> 
> Does it mean next?
> > (Case-B) Your case
> > Content-Type: text/html; charset="utf-8"
> >(snip)
> > <meta http-equiv="Content-Type" content="text/html; charset=gb2312" />
> > HTML source is written in UTF-8
> 
> If so, it's opposite to this bug...
> > (Case-A) This bug's case
> > Content-Type: text/html; charset="windows-1251"
> >(snip)
> > <meta http-equiv="Content-Type" content="text/html; charset=koi8-r" />
> > HTML source is written in koi8-r
> 
> If "Content-Type: charset=..." is used, complaint of this bug occurs with
> Case-A, but if "Content-Type: charset=..." is not used, complaint of bug 505072
> arises with Case-B...

My humble guess is that SeaMonkey mail try to display an HTML email without
decoding it. We can try decoding inline html body of case A in windows-1251,
and displaying it using encoding koi8-r. If it's correct, then problem solved.

I don't know the difference between windows-1251 and ASCII. And I have no ideal
what the correct result is. So I can't do that by myself.
(In reply to comment #13)
> And I have no ideal what the correct result is.

Me too.
Problem is that "what is correct action" is not clear (a) when malformed mail, or (b) when real char-encoding of mail data is altered by server (by cache server, server who holds mail data, intermediate server in mail delivery, ...).
(1) Bug 505072
    Content-Type: ... charset=CHARSET-1, <meta ... charset=CHARSET-2">,
    HTML is written in CHARSET-1.
(2) This bug
    Content-Type: ... charset=CHARSET-1, <meta ... charset=CHARSET-2">,
    HTML is written in CHARSET-2.
(3) Worst case
    Content-Type: ... charset=CHARSET-1, <meta ... charset=CHARSET-2">,
    HTML is written in CHARSET-3.
I think charset of Content-Type: should be used to support above (b) well as HTTP & HTTP header does, but I'm not sure.
(In reply to comment #5)
> But its not a solution for every buggy message.

To be tolerant with such malformed mails by bad mailers, Tb has a feature.
  Folder properties
    Default Character Encoding: koi8-r
    [ X ] Apply default to all messages in the folder (individual message
          character encoding settings and auto-detection will be ignored)
It's equivalent to replace mail header of (a) by mail header of (b) in your case.
> (a) Content-Type: text/html; charset=windows-1251
> (b) Content-Type: text/html; charset=koir-8
It's a feature made to relief many many victims of malformed mail like you.

If you copy the mails to folder with this option(and rebuild-index if
required), text you want is displayed and mail body becomes readable if mail body is written in koi8-r.

(Correction of comment #7)
Bug 505072 was different issue form this bug on similar mail data.
Dmitry Kubov(bug opener), there is no need to test again, because nothing is changed for this bug.
Keywords: testcase
The problem is here in the code:

http://mxr.mozilla.org/comm-central/source/mailnews/import/outlook/src/nsOutlookCompose.cpp#607.

The import tries to get the charset from the header, if none is there, it defaults to the system character set, for Windows machines with an English version of Windows, that's normally "windows-1251".

Instead, what should be done is extract the charset from the appropriate HTML header.
Component: General → Import
Product: Thunderbird → MailNews Core
QA Contact: general → import
Summary: different charsets for headers and mail body (charset in Content-Type: is used instead of charset in <meta http-equiv>) → different charsets for headers and mail body (Import from Outlook 2003 puts different charset from one in <meta> tag in Content-Type:. Tb uses charset in Content-Type: instead of charset in <meta http-equiv>)
Status: UNCONFIRMED → RESOLVED
Closed: 14 years ago
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: