Closed Bug 593894 Opened 14 years ago Closed 14 years ago

HTML mail body with both MIME charset and HTML charset specified display garbled (with html5.enable=true, charset in <meta> is not ignored, and UTF-8 in <meta> is required for expected rendering of non-ascii characters)

Tracking

(Not tracked)

Status:

RESOLVED DUPLICATE of bug 594646

People

(Reporter: mikekaganski, Unassigned)

Details

Attachments

(6 files)

Bad-looking mail using koi8-r charset 14 years ago Mike Kaganski 6.99 KB, application/octet-stream		Details
Bad-looking mail using windows-1251 charset 14 years ago Mike Kaganski 71.76 KB, application/octet-stream		Details
Screenshot of garbled text of the first attachment 14 years ago Mike Kaganski 32.56 KB, image/png		Details
Screenshot of the same text of the first attachment looking OK in TB 3.1.2 14 years ago Mike Kaganski 31.18 KB, image/png		Details
Modified version of attachment 1, showing the same problem 14 years ago Mike Kaganski 6.83 KB, message/rfc822		Details
Content-Type: Shift_JIS version of test case of Bug 508946 14 years ago WADA:World Anti-bad-Duping Agency 3.84 KB, text/plain; charset="Shift_JIS"		Details

Mike Kaganski

Reporter

Description

•

14 years ago

User-Agent:       Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; WOW64; Trident/4.0; SLCC2; .NET CLR 2.0.50727; .NET CLR 3.5.30729; .NET CLR 3.0.30729; Media Center PC 6.0)
Build Identifier: Mozilla/5.0 (Windows NT 5.1; rv:2.0b5pre) Gecko/20100907 Shredder/3.2a1pre

In the test build of Shredder, if an HTML message contains both MIME charset info (Content-Type: text/html; charset=koi8-r) and HTML charset info (<META content="text/html; charset=koi8-r" http-equiv=Content-Type>), then the text is in most cases shown garbled. The only seen exception thus far is when the charset is utf-8.
The text that is displayed is the result of the following procedure:
1. Take the original text
2. Taking the first charset setting into account, convert it to utf-8.
3. Handling the text from step 2 as the text to be displayed, "decode" it for display using the second charset setting.
So the text that is shown usually takes 2 letters for every original letter, that represent the valuse of the surrogate pairs of utf-8 converted from specified charset.
The source window display the text OK.
This is clearly a regression, as the vanilla TB 3.1.2 shows such messages OK.

Reproducible: Always

Steps to Reproduce:
1. Open an attached mail in a recent test build of TB
Actual Results:  
The message text looks garbled

Expected Results:  
The message reads OK

Mike Kaganski

Reporter

Comment 1

•

14 years ago

Attached file Bad-looking mail using koi8-r charset — Details

Mike Kaganski

Reporter

Comment 2

•

14 years ago

Attached file Bad-looking mail using windows-1251 charset — Details

WADA:World Anti-bad-Duping Agency

Comment 3

•

14 years ago

> Actual Results:  
> The message text looks garbled

What phenomenon do you call "garbled"? Same display as bug 506504?
Same problem as bug 528736?
See bug 572886 comment #13, bug 572886 comment #15.

Mike Kaganski

Reporter

Comment 4

•

14 years ago

(In reply to comment #3)

I wrote in the problem description:
"So the text that is shown usually takes 2 letters for every original letter,
that represent the valuse of the surrogate pairs of utf-8 converted from
specified charset."

When I read the description of the bugs you mentioned, I see that the problem there was displaying HTML tags and other technical info along with the message text; maybe I'm wrong?
In this case, no HTML/MIME source is displayed; instead, the text of the message becomes unreadable (every non-ascii char becomes 2 other chars corresponding to utf-8 codes that substitute the intended char, but interpreted as original charset). I checked this using the Universal Cyrillic decoder (http://2cyr.com/decode/?lang=en).
When I open the first attachment of Bug 528736, I don't see no problem, maybe because it's all ascii. Don't know if it has anything to do here.

Mike Kaganski

Reporter

Comment 5

•

14 years ago

Attached image Screenshot of garbled text of the first attachment — Details

Mike Kaganski

Reporter

Comment 6

•

14 years ago

Attached image Screenshot of the same text of the first attachment looking OK in TB 3.1.2 — Details

Mike Kaganski

Reporter

Comment 7

•

14 years ago

Attached file Modified version of attachment 1, showing the same problem — Details

By the way, this bug appears regardless of the multipart type; even if you open message whose topmost type is "text/html" (and it has therefore a single MIME part), the result is the same.
Attachment 1 [details] [diff] originally was a multipart/mixed with attachments; for the purpose of this bug, I removed the attachments from source. Now I further modified this message to be a single-part text/html message, and problem is still there.
And another thing: the problem is seen regardless of the display mode (View->Message Body As): it's the same in "Original HTML", "Simple HTML" and "Plain Text".

Mike Kaganski

Reporter

Comment 8

•

14 years ago

Edit: if I select View->Message Body As->Plain Text, then close the message and reopen it, it shows OK. If the selected mode is either Original HTML or Simple HTML, the problem is there. If in any mode I touch auto-detect, then the message headers are shown before the text, but the text itself looks OK.

WADA:World Anti-bad-Duping Agency

Comment 9

•

14 years ago

I checked bug 572886 again. Problem of bug 572886 was reproduced with 6/09 build, but was already resolved by Tb trunk 8/09 build.
This bug could be reproduced with Tb trunk 6/09 build and 9/06 build using attached test mails. I couls see same display as your screen shot.
Problem couldn't be observed with html5.enable=false.
If View/Character Encoding is set to compatible one with 7bits us-ascii(other than UTF-16 family), similar pattern with different glyph was observed on non-ascii characters. Wrong conversion looks to happen upon <meta> charset handling, as you say.

Confirming, and CC-ing to Henri Sivonen.

Severity: normal → major

Status: UNCONFIRMED → NEW

Ever confirmed: true

Version: unspecified → Trunk

WADA:World Anti-bad-Duping Agency

Comment 10

•

14 years ago

Attached file Content-Type: Shift_JIS version of test case of Bug 508946 — Details

Attached is test mails of Bug 508946, with Content-Type: Shift_JIS(changed from UTF-8 of original case).
Expected display is as follows.
> ここは日本語です。 Here is Japanese.
If utf-8 is specified in <meta> tag, non-ascii characters are shown as expected.
This bug is Content-Type: non-utf-8/non-utf16-family case of Bug 508946. 
I should have checked non utf-8 case in Bug 508946...

Problem is probably mismatch between converter's output and parser's expectation in <meta> charset processing.
- Converter converts data from charset in Content-Type: to UTF-8.
- If Content-Type: is UTF-8, <meta> charset processing is not invoked,
  or parser correctly identifies data as UTF-8. (Bug 508946 is fixed)   
- If Content-Type: is not UTF=8, parser interpetes the converted UTF-8 data
  as data of charset in <meta> tag.

WADA:World Anti-bad-Duping Agency

Updated

•

14 years ago

Attachment #472916 - Attachment mime type: text/plain → text/plain; charset="Shift_JIS"

WADA:World Anti-bad-Duping Agency

Updated

•

14 years ago

Blocks: 508946

WADA:World Anti-bad-Duping Agency

Comment 11

•

14 years ago

CC-ing to Zane U. Ji(patch creator of Bug 505072).
I thought charset of Content-Type: is always used and charset in <meta> tag is always ignored after your patch. HTML parser side's fault?

Zane U. Ji

Comment 12

•

14 years ago

It seems that there is a regression. The problem reported in Bug 505072 reappears.

WADA:World Anti-bad-Duping Agency

Comment 13

•

14 years ago

(In reply to comment #12)
> The problem reported in Bug 505072 reappears.

I also could see Bug 508946(spin-off from your Bug 505072), with Content-Type: UTF-8. Rough regression window:
(No problem) thunderbird-3.2a1pre 20100501 build.
             Bug 508946 and this bug doesn't occur. 
             default of html5.enable = false
(Problem occurs) thunderbird-3.2a1pre 20100513 build.
             Bug 508946 and this bug occurs.
             default of html5.enable = true
AFAIR, default change of html5.enable from false to true was doen between 5/01 and 5/13, and it was reason why I had 5/01 build and 5/13 build on my PC.

Your patch of Bug 505072 looks effective with html5.enable=false only.
Can you provide patch for html5.enable=true case?

Summary: HTML mail body with both MIME charset and HTML charset specified display garbled → HTML mail body with both MIME charset and HTML charset specified display garbled (with html5.enable=true)

WADA:World Anti-bad-Duping Agency

Updated

•

14 years ago

Summary: HTML mail body with both MIME charset and HTML charset specified display garbled (with html5.enable=true) → HTML mail body with both MIME charset and HTML charset specified display garbled (with html5.enable=true, charset in <meta> is not ignored, and UTF-8 in <meta> is required for expected rendering of non-ascii characters)

WADA:World Anti-bad-Duping Agency

Updated

•

14 years ago

Attachment #472916 - Attachment mime type: text/plain; charset="Shift_JIS" → text/html; charset="Shift_JIS"

WADA:World Anti-bad-Duping Agency

Updated

•

14 years ago

Attachment #472916 - Attachment mime type: text/html; charset="Shift_JIS" → text/plain; charset="Shift_JIS"

Henri Sivonen (:hsivonen)

Comment 14

•

14 years ago

How does the mail code pass the charset from the MIME layer to the HTML parser? Is kCharsetFromChannel specified as the charset source?

Zane U. Ji

Comment 15

•

14 years ago

(In reply to comment #14)
> How does the mail code pass the charset from the MIME layer to the HTML parser?

As far as I know, suppose the input is
...
Content-Type: text/html; charset=koi8-r

<html><head>
<META content="text/html; charset=koi8-r">
</head>
<body/>
</html>
, MIME will try to output:
<html>
<head/>
<body/>
</html>
, which is encoded in UTF8.
Not sure if it still applies to the latest code.

> Is kCharsetFromChannel specified as the charset source?
Yes.

Henri Sivonen (:hsivonen)

Comment 16

•

14 years ago

(In reply to comment #15)
> (In reply to comment #14)
> > How does the mail code pass the charset from the MIME layer to the HTML parser?
> 
> As far as I know, suppose the input is
> ...
> Content-Type: text/html; charset=koi8-r
> 
> <html><head>
> <META content="text/html; charset=koi8-r">
> </head>
> <body/>
> </html>
> , MIME will try to output:
> <html>
> <head/>
> <body/>
> </html>
> , which is encoded in UTF8.
> Not sure if it still applies to the latest code.

Do you mean the HTML parser doesn't see the bytes decoded from MIME but the MIME layer performs further modifications on those bytes?

Zane U. Ji

Comment 17

•

14 years ago

(In reply to comment #16)
> (In reply to comment #15)
> > (In reply to comment #14)
> > > How does the mail code pass the charset from the MIME layer to the HTML parser?
> > 
> > As far as I know, suppose the input is
> > ...
> > Content-Type: text/html; charset=koi8-r
> > 
> > <html><head>
> > <META content="text/html; charset=koi8-r">
> > </head>
> > <body/>
> > </html>
> > , MIME will try to output:
> > <html>
> > <head/>
> > <body/>
> > </html>
> > , which is encoded in UTF8.
> > Not sure if it still applies to the latest code.
> 
> Do you mean the HTML parser doesn't see the bytes decoded from MIME but the
> MIME layer performs further modifications on those bytes?
No, I don't. I'm saying that MIME parser tries to apply <META> tag. So the charset might not be passed to HTML parser if there is only one <META>.

Henri Sivonen (:hsivonen)

Comment 18

•

14 years ago

(In reply to comment #17)
> No, I don't. I'm saying that MIME parser tries to apply <META> tag.

Can the MIME layer be changed not to look at <meta> in the MIME payload?

Zane U. Ji

Comment 19

•

14 years ago

Yes. However, there will be a performance impact.(In reply to comment #18)
> Can the MIME layer be changed not to look at <meta> in the MIME payload?

Yes. However, there will be a performance impact.

Zane U. Ji

Comment 20

•

14 years ago

(In reply to comment #15)

I have to correct myself.
MIME expects <META> to be in the following format without any line break:
<META HTTP-EQUIV=... CONTENT=... CHARSET=...>

> 
> As far as I know, suppose the input is
> ...
> Content-Type: text/html; charset=koi8-r
> 
> <html><head>
> <META content="text/html; charset=koi8-r">
Should be <META HTTP-EQUIV="Content-Type" CONTENT="text/html; CHARSET=koi8-r">.

> </head>
> <body/>
> </html>

> , MIME will try to output:
> <html>
> <head/>
Should be 
<head>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; ">
</head>

> <body/>
> </html>
> , which is encoded in UTF8.
> Not sure if it still applies to the latest code.
> 

The problem is caused by two facts:
1. MIME doesn't recognize the <META> tag in attachment 472514 [details]. It is not what MIME parser expected, so it is passed to HTML5 parser untouched.
2. It seems that when HTML5 parser sees the charset information, it immediately believes that the input content is encoded in that charset rather than UTF8.

Henri Sivonen (:hsivonen)

Comment 21

•

14 years ago

(In reply to comment #20)
> 2. It seems that when HTML5 parser sees the charset information, it immediately
> believes that the input content is encoded in that charset rather than UTF8.

Only if the calling code set the charset source to a tentative value:
http://mxr.mozilla.org/mozilla-central/source/parser/htmlparser/public/nsIParser.h#100

If you create the parser with a confident charset source, it won't second-guess the charset.

If you let the HTML5 parser figure out the charset on its own, it won't reload if it finds the meta within the first 1024 bytes. If the meta appears later and the channel doesn't support reloading, the parser give the docshell and opportunity to cancel the reload request.

WADA:World Anti-bad-Duping Agency

Comment 22

•

14 years ago

Problem depended on View/Message Body As;
- Original HTML, attribute order in meta dependent : bug 594646
- Simple HTML, Content-Transfer-Encoding dependent : bug 598740
Setting dependency to those bugs.

Depends on: 594646, 598740

WADA:World Anti-bad-Duping Agency

Comment 23

•

14 years ago

As this bug is not for quoted-printable/base64 case, and as this bug is relevant to order of attributes in meta tag and "View/Message Body/Original HTML", closing as dup of bug 594646.
Bug opener, please reopen if duping is wrong.
..

No longer blocks: 508946

Status: NEW → RESOLVED

Closed: 14 years ago

No longer depends on: 594646, 598740

Resolution: --- → DUPLICATE

You need to log in before you can comment on or make changes to this bug.