Closed Bug 593894 Opened 14 years ago Closed 14 years ago

HTML mail body with both MIME charset and HTML charset specified display garbled (with html5.enable=true, charset in <meta> is not ignored, and UTF-8 in <meta> is required for expected rendering of non-ascii characters)

Categories

(Thunderbird :: General, defect)

x86
Windows XP
defect
Not set
major

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 594646

People

(Reporter: mikekaganski, Unassigned)

Details

Attachments

(6 files)

User-Agent:       Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)
Build Identifier: Mozilla/5.0 (Windows NT 5.1; rv:2.0b5pre) Gecko/20100907 Shredder/3.2a1pre

In the test build of Shredder, if an HTML message contains both MIME charset info (Content-Type: text/html; charset=koi8-r) and HTML charset info (<META content="text/html; charset=koi8-r" http-equiv=Content-Type>), then the text is in most cases shown garbled. The only seen exception thus far is when the charset is utf-8.
The text that is displayed is the result of the following procedure:
1. Take the original text
2. Taking the first charset setting into account, convert it to utf-8.
3. Handling the text from step 2 as the text to be displayed, "decode" it for display using the second charset setting.
So the text that is shown usually takes 2 letters for every original letter, that represent the valuse of the surrogate pairs of utf-8 converted from specified charset.
The source window display the text OK.
This is clearly a regression, as the vanilla TB 3.1.2 shows such messages OK.

Reproducible: Always

Steps to Reproduce:
1. Open an attached mail in a recent test build of TB
Actual Results:  
The message text looks garbled

Expected Results:  
The message reads OK
> Actual Results:  
> The message text looks garbled

What phenomenon do you call "garbled"? Same display as bug 506504?
Same problem as bug 528736?
See bug 572886 comment #13, bug 572886 comment #15.
(In reply to comment #3)

I wrote in the problem description:
"So the text that is shown usually takes 2 letters for every original letter,
that represent the valuse of the surrogate pairs of utf-8 converted from
specified charset."

When I read the description of the bugs you mentioned, I see that the problem there was displaying HTML tags and other technical info along with the message text; maybe I'm wrong?
In this case, no HTML/MIME source is displayed; instead, the text of the message becomes unreadable (every non-ascii char becomes 2 other chars corresponding to utf-8 codes that substitute the intended char, but interpreted as original charset). I checked this using the Universal Cyrillic decoder (http://2cyr.com/decode/?lang=en).
When I open the first attachment of Bug 528736, I don't see no problem, maybe because it's all ascii. Don't know if it has anything to do here.
By the way, this bug appears regardless of the multipart type; even if you open message whose topmost type is "text/html" (and it has therefore a single MIME part), the result is the same.
Attachment 1 [details] [diff] originally was a multipart/mixed with attachments; for the purpose of this bug, I removed the attachments from source. Now I further modified this message to be a single-part text/html message, and problem is still there.
And another thing: the problem is seen regardless of the display mode (View->Message Body As): it's the same in "Original HTML", "Simple HTML" and "Plain Text".
Edit: if I select View->Message Body As->Plain Text, then close the message and reopen it, it shows OK. If the selected mode is either Original HTML or Simple HTML, the problem is there. If in any mode I touch auto-detect, then the message headers are shown before the text, but the text itself looks OK.
I checked bug 572886 again. Problem of bug 572886 was reproduced with 6/09 build, but was already resolved by Tb trunk 8/09 build.
This bug could be reproduced with Tb trunk 6/09 build and 9/06 build using attached test mails. I couls see same display as your screen shot.
Problem couldn't be observed with html5.enable=false.
If View/Character Encoding is set to compatible one with 7bits us-ascii(other than UTF-16 family), similar pattern with different glyph was observed on non-ascii characters. Wrong conversion looks to happen upon <meta> charset handling, as you say.

Confirming, and CC-ing to Henri Sivonen.
Severity: normal → major
Status: UNCONFIRMED → NEW
Ever confirmed: true
Version: unspecified → Trunk
Attached is test mails of Bug 508946, with Content-Type: Shift_JIS(changed from UTF-8 of original case).
Expected display is as follows.
> ここは日本語です。 Here is Japanese.
If utf-8 is specified in <meta> tag, non-ascii characters are shown as expected.
This bug is Content-Type: non-utf-8/non-utf16-family case of Bug 508946. 
I should have checked non utf-8 case in Bug 508946...

Problem is probably mismatch between converter's output and parser's expectation in <meta> charset processing.
- Converter converts data from charset in Content-Type: to UTF-8.
- If Content-Type: is UTF-8, <meta> charset processing is not invoked,
  or parser correctly identifies data as UTF-8. (Bug 508946 is fixed)   
- If Content-Type: is not UTF=8, parser interpetes the converted UTF-8 data
  as data of charset in <meta> tag.
Attachment #472916 - Attachment mime type: text/plain → text/plain; charset="Shift_JIS"
CC-ing to Zane U. Ji(patch creator of Bug 505072).
I thought charset of Content-Type: is always used and charset in <meta> tag is always ignored after your patch. HTML parser side's fault?
It seems that there is a regression. The problem reported in Bug 505072 reappears.
(In reply to comment #12)
> The problem reported in Bug 505072 reappears.

I also could see Bug 508946(spin-off from your Bug 505072), with Content-Type: UTF-8. Rough regression window:
(No problem) thunderbird-3.2a1pre 20100501 build.
             Bug 508946 and this bug doesn't occur. 
             default of html5.enable = false
(Problem occurs) thunderbird-3.2a1pre 20100513 build.
             Bug 508946 and this bug occurs.
             default of html5.enable = true
AFAIR, default change of html5.enable from false to true was doen between 5/01 and 5/13, and it was reason why I had 5/01 build and 5/13 build on my PC.

Your patch of Bug 505072 looks effective with html5.enable=false only.
Can you provide patch for html5.enable=true case?
Summary: HTML mail body with both MIME charset and HTML charset specified display garbled → HTML mail body with both MIME charset and HTML charset specified display garbled (with html5.enable=true)
Summary: HTML mail body with both MIME charset and HTML charset specified display garbled (with html5.enable=true) → HTML mail body with both MIME charset and HTML charset specified display garbled (with html5.enable=true, charset in <meta> is not ignored, and UTF-8 in <meta> is required for expected rendering of non-ascii characters)
Attachment #472916 - Attachment mime type: text/plain; charset="Shift_JIS" → text/html; charset="Shift_JIS"
Attachment #472916 - Attachment mime type: text/html; charset="Shift_JIS" → text/plain; charset="Shift_JIS"
How does the mail code pass the charset from the MIME layer to the HTML parser? Is kCharsetFromChannel specified as the charset source?
(In reply to comment #14)
> How does the mail code pass the charset from the MIME layer to the HTML parser?

As far as I know, suppose the input is
...
Content-Type: text/html; charset=koi8-r

<html><head>
<META content="text/html; charset=koi8-r">
</head>
<body/>
</html>
, MIME will try to output:
<html>
<head/>
<body/>
</html>
, which is encoded in UTF8.
Not sure if it still applies to the latest code.

> Is kCharsetFromChannel specified as the charset source?
Yes.
(In reply to comment #15)
> (In reply to comment #14)
> > How does the mail code pass the charset from the MIME layer to the HTML parser?
> 
> As far as I know, suppose the input is
> ...
> Content-Type: text/html; charset=koi8-r
> 
> <html><head>
> <META content="text/html; charset=koi8-r">
> </head>
> <body/>
> </html>
> , MIME will try to output:
> <html>
> <head/>
> <body/>
> </html>
> , which is encoded in UTF8.
> Not sure if it still applies to the latest code.

Do you mean the HTML parser doesn't see the bytes decoded from MIME but the MIME layer performs further modifications on those bytes?
(In reply to comment #16)
> (In reply to comment #15)
> > (In reply to comment #14)
> > > How does the mail code pass the charset from the MIME layer to the HTML parser?
> > 
> > As far as I know, suppose the input is
> > ...
> > Content-Type: text/html; charset=koi8-r
> > 
> > <html><head>
> > <META content="text/html; charset=koi8-r">
> > </head>
> > <body/>
> > </html>
> > , MIME will try to output:
> > <html>
> > <head/>
> > <body/>
> > </html>
> > , which is encoded in UTF8.
> > Not sure if it still applies to the latest code.
> 
> Do you mean the HTML parser doesn't see the bytes decoded from MIME but the
> MIME layer performs further modifications on those bytes?
No, I don't. I'm saying that MIME parser tries to apply <META> tag. So the charset might not be passed to HTML parser if there is only one <META>.
(In reply to comment #17)
> No, I don't. I'm saying that MIME parser tries to apply <META> tag.

Can the MIME layer be changed not to look at <meta> in the MIME payload?
Yes. However, there will be a performance impact.(In reply to comment #18)
> Can the MIME layer be changed not to look at <meta> in the MIME payload?

Yes. However, there will be a performance impact.
(In reply to comment #15)

I have to correct myself.
MIME expects <META> to be in the following format without any line break:
<META HTTP-EQUIV=... CONTENT=... CHARSET=...>

> 
> As far as I know, suppose the input is
> ...
> Content-Type: text/html; charset=koi8-r
> 
> <html><head>
> <META content="text/html; charset=koi8-r">
Should be <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=koi8-r">.

> </head>
> <body/>
> </html>

> , MIME will try to output:
> <html>
> <head/>
Should be 
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; ">
</head>

> <body/>
> </html>
> , which is encoded in UTF8.
> Not sure if it still applies to the latest code.
> 

The problem is caused by two facts:
1. MIME doesn't recognize the <META> tag in attachment 472514 [details]. It is not what MIME parser expected, so it is passed to HTML5 parser untouched.
2. It seems that when HTML5 parser sees the charset information, it immediately believes that the input content is encoded in that charset rather than UTF8.
(In reply to comment #20)
> 2. It seems that when HTML5 parser sees the charset information, it immediately
> believes that the input content is encoded in that charset rather than UTF8.

Only if the calling code set the charset source to a tentative value:
http://mxr.mozilla.org/mozilla-central/source/parser/htmlparser/public/nsIParser.h#100

If you create the parser with a confident charset source, it won't second-guess the charset.

If you let the HTML5 parser figure out the charset on its own, it won't reload if it finds the meta within the first 1024 bytes. If the meta appears later and the channel doesn't support reloading, the parser give the docshell and opportunity to cancel the reload request.
Problem depended on View/Message Body As;
- Original HTML, attribute order in meta dependent : bug 594646
- Simple HTML, Content-Transfer-Encoding dependent : bug 598740
Setting dependency to those bugs.
Depends on: 594646, 598740
As this bug is not for quoted-printable/base64 case, and as this bug is relevant to order of attributes in meta tag and "View/Message Body/Original HTML", closing as dup of bug 594646.
Bug opener, please reopen if duping is wrong.
..
No longer blocks: 508946
Status: NEW → RESOLVED
Closed: 14 years ago
No longer depends on: 594646, 598740
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: