Open
Bug 997050
Opened 11 years ago
Updated 3 years ago
Decode Content Email error (if charset=utf-16le or utf-16be and is QP encoded, =00 for 7bit-ascii character is replaced by =20 == space)
Categories
(MailNews Core :: MIME, defect)
Tracking
(Not tracked)
NEW
People
(Reporter: thierry.bruyere, Unassigned)
References
(Depends on 1 open bug)
Details
Attachments
(3 files, 3 obsolete files)
User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0 (Beta/Release)
Build ID: 20140317233623
Steps to reproduce:
I do not read an email. The decoding of the content is wrong.
The content is displayed in unreadable character.
If I make a response to the email, the content is correctly decoded.
Confirming. When just viewing the message it shows all text in Chinese an missing chars (on Win XP).
When replying to it, it shows paragraphs of French(?) interleaved with paragraphs of Chinese.
The message is in utf-16le encoding, defined like this:
Content-Type: multipart/alternative;
boundary="=_30CA4D538242981EE100000082975D1C"
--=_30CA4D538242981EE100000082975D1C
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
charset="utf-16le"
Content-Disposition: inline
Content-Description: plainpart
jcranmer, is this supposed to work?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(Pidgeot18)
Comment 2•11 years ago
|
||
If View/Character Encoding/Unicode(UTF-8) is chosen,
View/Message Body As/Plain Text
text is shown, although some letters is shown in U+FFFD, and space is shown between each letter
(because utf-16 is shown as utf-8)
View/Message Body As/any HTML
HTML source is shown as text string, without parsing as HTML.
text is shown, although some letters is shown in U+FFFD, and space is shown between each letter
External phenomenon itself is similar to Bug 571704(see also duped Bug 594646).
If each part is changed to base64, same result as above is observed without changing Character Encoding to UTF-8, with no "space between each letter".
1. View/Message Body As/All Body Parts, Save each part to file.
Data is correct utf-16le without BOM,
although excess <HEAD><metaHTTP-EQUIV="Content-Type" CONTENT="text/html;charset=UTF-8">
is contained after last </html>,
2. Compose a mail by Tb, attach two saved files in utf-16le without BOM. Send Later.
Because utf-16le data, Tb encodes in base64.
3. Edit mail in Outbox, set charset correctly, change filename for ease of test, Repair Folder.
Comment 3•11 years ago
|
||
Comment 4•11 years ago
|
||
Comment 5•11 years ago
|
||
Updated•11 years ago
|
Attachment #8408744 -
Attachment description: Crafted mail folder file : multipart/mixed mail with text/plain & text/htlm utf-16le part → Crafted mail folder file : multipart/mixed mail with base64 encoded utf-16le parts
Comment 6•11 years ago
|
||
(In reply to :aceman from comment #1)
> Confirming. When just viewing the message it shows all text in Chinese an
> missing chars (on Win XP).
> When replying to it, it shows paragraphs of French(?) interleaved with
> paragraphs of Chinese.
I recognized "Veuillez" in the beginning which is indeed French.
> jcranmer, is this supposed to work?
UTF-16LE should work, which means someone in the depths of libmime is gunking this stuff up badly. I don't have the time right now to figure out what, though.
Flags: needinfo?(Pidgeot18)
Comment 7•11 years ago
|
||
Note:
Copy attached file as "local mail folder file", restart Tb, Repair Folder.
Because raw utf-16le binary is used in mail data stream, 0x00 is contained, so copy/move mail won't work on mail data of Case-1/Case-2.
Case-1. Crafted mail folder file : text/plain utf-16le mail
Mail is shown by View/Character Encoding/UTF-8.
Case-2. Crafted mail folder file : text/html utf-16le mail
Mail is shown by View/Character Encoding/UTF-8.
HTML is rendered.
Case-3. Crafted mail folder file : multipart/mixed mail with base64 encoded utf-16le parts
text/plain & text/html part is shown without View/Character Encoding/UTF-8.
text/html part is not rendered as HTML. HTML source is shown as string.
(A) Garbled display in Case-1/Case-2 : data converted to utf-8 is shown as utf-16le.
If utf-16le binary is interpreted as utf-8 by View/Character Encoding/UTF-8,
"space between each letter" should occur, but it doesn't occur.
View/Character Encoding/UTF-8 is perhaps internallly applied to "data converted to utf-8"
because mail text is not encoded in Case-1/Case-2.
(B) If attachment(shown in IFRAME), data converted to utf-8 looks shown as utf-8.
However, Tb looks to forget call HTML parser.
(C) If text/plain or text/html part under multipart/alternative,
simlar phenomenon to Case-1/Case-2 looks to occur, because they are "message body".
If "message body", "problem when encoded"(base64 or quoted-printable) may occur.
Updated•11 years ago
|
Summary: Decode Content Email error → Decode Content Email error (if charset=utf-16le, data converted to utf-8 is shown as utf-16le, and if HTML message body is encoded in QP or base64, it's not rendered as HTML correctly)
Updated•11 years ago
|
Attachment #8408742 -
Attachment mime type: text/plain → text/plain; charset=iso-8859-1
Updated•11 years ago
|
Attachment #8408743 -
Attachment mime type: text/plain → text/plain; charset=iso-8859-1
Updated•11 years ago
|
Summary: Decode Content Email error (if charset=utf-16le, data converted to utf-8 is shown as utf-16le, and if HTML message body is encoded in QP or base64, it's not rendered as HTML correctly) → Decode Content Email error (if charset=utf-16le or utf-16be and is QP/base64 encoded, data doesn't look converted correctly, or wrong charset seems applied aftter conversion)
Comment 8•11 years ago
|
||
To avoid content sniffing, "<" is changed to "[ ". Following is gryph of text in utf-8.([CRLF]=0x0D0A)
> [ html>[CRLF]
> [ head>[CRLF]
> [ meta charset="utf-8">[CRLF]
> [ meta http-equiv="Content-Type" content="text/html; charset='utf-8'">[CRLF]
> [ /head>[CRLF]
> [ body>
> [ p lang="ru">абвгдеёжзий[ /p>[CRLF]
> [ p lang="ja">ここは、日本語の文字です。[ /p>[CRLF]
> [ p lang="en">Hello World[ /p>[CRLF]
> [ p lang="fr">Veuillez ne répondez pas à cet e-mail pour mettre à jour votre Service Ticket.[ /p>[CRLF]
> [ p lang="zh-Hant">請以手足關係的精神相對待[ /p>[CRLF]
> [ /body>[CRLF]
> [ /html>[CRLF]
Above data is saved as utf-16be text file, and data is encoded with base64, and the base64 encoded data is put in mail data stream.
> Content-Type: text/plain; charset=utf-16be
> Content-Transfer-Encoding: base64
>
> AFsAIABoAHQAbQBsAD4ADQAKAFsAIABoAGUAYQBkAD4ADQAKACAAIAAgACAAWwAgAG0AZQB0
>(snip)
> ACAALwBiAG8AZAB5AD4ADQAKAFsAIAAvAGgAdABtAGwAPgANAAo=
If this mail is displayed by Tb,
View/Character Encodind : (Unicode(UTF-16BE) is checked.
and DOM Inspector showed following data.
#text node content under PRE which is held in DIV.
> [ html>
> 嬀 栀攀愀搀㸀ഊ [ meta charset="utf-8">
> 嬀 洀攀琀愀 栀琀琀瀀ⴀ攀焀甀椀瘀㴀∀䌀漀渀琀攀渀琀ⴀ吀礀瀀攀∀ 挀漀渀琀攀渀琀㴀∀琀攀砀琀⼀栀琀洀氀㬀 挀栀愀爀猀攀琀㴀✀甀琀昀ⴀ㠀✀∀㸀ഊ[ /head>
> 嬀 戀漀搀礀㸀ഊ[ p lang="ru">абвгдеёжзий[ /p>
> 嬀 瀀 氀愀渀最㴀∀樀愀∀㸰匰匰漰ťⲊ鸰湥蝛地朰夰Ȁ嬀 ⼀瀀㸀ഊ[ p lang="en">Hello World[ /p>
> 嬀 瀀 氀愀渀最㴀∀昀爀∀㸀嘀攀甀椀氀氀攀稀 渀攀 爀瀀漀渀搀攀稀 瀀愀猀 挀攀琀 攀ⴀ洀愀椀氀 瀀漀甀爀 洀攀琀琀爀攀 樀漀甀爀 瘀漀琀爀攀 匀攀爀瘀椀挀攀 吀椀挀欀攀琀⸀嬀 ⼀瀀㸀ഊ[ p lang="zh-Hant">請以手足關係的精神相對蔀嬀 ⼀瀀㸀ഊ[ /body>
"Single byte code in UTF-8" only is displayed normally?
If so, why second "[ meta http-equiv=...", "[ p lang="ja">" and "[ /p>", "[ p lang="zh-Hant">" and "[ /p>" are garbled?
Conversion of utf-16le/utf-16be is broken? If so, it's only when QP/base64 encoded?
Or problem in "display as #text node in <PRE> element"(text/plain case)?
Comment 9•11 years ago
|
||
Phenomenon I reported in comment #8 was bug 604284. Setting dependency for ease of trackn and aalysis.
Depends on: 604284
Comment 10•11 years ago
|
||
Test mail. multipart/mixed. UTF-16LE, quoted-printable,
No newline character(U+000A,U+000D) in quoted-printable dara.
body, text/plain : ABCDあいうえお
(あいうえお is Japanese Hiragana, AIUEO)
(ABCD in QP : A=00B=00C=00D=00)
(あいうえお in QP : =42=30=44=30=46=30=48=30=4A=30)
part 2, text/plain : <p>ABCDあいうえお</p>
part 3, text/html : <p>ABCDあいうえお</p>
(1) View/Character Encoding/Unicode(UTF-16LE) is correctly shown.
(2) In any part,
7bit-ascii character is garbled.
Japanese character is displayed normally.
(3) In text/html part.
HTML tag is not detected, so shown as string under <BODY>.
Cause of problem in Quoted-Printable is perhaps 0x00 for single byte ascii in UTF16.
Attachment #8408742 -
Attachment is obsolete: true
Attachment #8408743 -
Attachment is obsolete: true
Attachment #8408744 -
Attachment is obsolete: true
Updated•11 years ago
|
Summary: Decode Content Email error (if charset=utf-16le or utf-16be and is QP/base64 encoded, data doesn't look converted correctly, or wrong charset seems applied aftter conversion) → Decode Content Email error (if charset=utf-16le or utf-16be and is QP encoded, data doesn't look converted correctly, or wrong charset seems applied aftter conversion)
Comment 11•11 years ago
|
||
Garbled display in test case attached to comment #10.
> "UTF-16LE binary for ABCDあいうえお in text/plain(quoted-printable)" is shown as;
> ⁁⁂⁃⁄あいうえお
Updated•11 years ago
|
Component: Message Reader UI → MIME
OS: Linux → All
Product: Thunderbird → MailNews Core
Hardware: x86_64 → All
Version: 28 → 24
Updated•11 years ago
|
Summary: Decode Content Email error (if charset=utf-16le or utf-16be and is QP encoded, data doesn't look converted correctly, or wrong charset seems applied aftter conversion) → Decode Content Email error (if charset=utf-16le or utf-16be and is QP encoded, 7bit-ascii character doesn't look converted correctly)
Comment 12•11 years ago
|
||
FYI.
In test case attached to comment #10,
> <p>ABCDあいうえお<\p> is converted to
> ‼⁰‾⁁⁂⁃⁄あいうえお‼ ⁰‾
So, HTML tag is not processed when text/html part, and is shown as <body> text.
If CR and/or LF is contained, bug 604284 occurs. Because LF is used by original mail, both bug 604284(due to CR/LF) and this bug(QP, 7bits-ascii char) occurs at same time on original mail.
Following is different issue:
- Garbled display when 0x5C(backslash in ascii) is contained in UTF-16 binary.
- Garbled display when raw binary of 0x00 in UTF-16 is sent as mail data stream
(Content-Transfer-Encoding: 8bit instead of base64/quoted-printable).
This won't occur if SMTP/IMAP/POP3 is used.
However, Web mail system can do anything, and .eml file can contain anything.
Comment 13•11 years ago
|
||
FYI. Culprit 0f =5C=0D case was 0x0D instead of 0x5C. See bug 604284 comment #5, please.
Comment 14•11 years ago
|
||
If quoted-printale, ABCD in UTF-16BE/UTF-16BE is shown by Glyph of ⁁⁂⁃⁄ by Tb 24.
When this string is copy&pasted to Text Editor and saved in UTF-8, following binary was obtained.
E28181 E28182 E28183 E28184
This was ;
UTF-8 UTF-16 QP/UTF-16BE ABCD of QP/UTF16-BE
E28181 http://codepoints.net/U+2041 0x2041 =20=41 <- =00=41 === A
E28182 http://codepoints.net/U+2042 0x2042 =20=41 <- =00=42 === B
E28183 http://codepoints.net/U+2043 0x2043 =20=41 <- =00=43 === C
E28184 http://codepoints.net/U+2044 0x2044 =20=41 <- =00=44 === D
This doesn't occur if base64.
If saved in file, original binary of =00=41=00=42=00=43=00=44 is kept.
Problem is:
Upon processing quoted-printable text for text display/HTML parsing,
=00(Null) is replaced by =20(space), regardless of charset.
If UTF-16 or UTF-32, =00 should be preserved, even when quoted-printable,
as done correctly when base64.
By Quirks for malformed mail in quoted-printable?
Or text processing is done on quoted-printable string directly, without decoding quoted-printable.
Summary: Decode Content Email error (if charset=utf-16le or utf-16be and is QP encoded, 7bit-ascii character doesn't look converted correctly) → Decode Content Email error (if charset=utf-16le or utf-16be and is QP encoded, =00 for 7bit-ascii character is replaced by =20 == space)
Comment 15•11 years ago
|
||
FYI.
If raw UTF-16 binary is placed in text/plain part(Content-Type: text/plain; charset=UTF-16BE, Content-Transfer-Encoding: 8bit), "remove 0x00" occurs, and 0x00 is lost by "Save".
Row UTF-16BE data in mail : 0x0041 0042 0043 0044 4230 4430 4630 4830 4A30
Binary of text by "Save" : 0x41424344 4230 4430 4630 4830 4A30
A A
| |
| +-- correctly displayed
+-- Shown as 0x4142, 0x4344 of UTF-16, so broken letter is shown
Comment 16•9 years ago
|
||
This is a regression from Bug 243199
attachment 148398 [details] [diff] [review] does replace NULL-chars with space-chars.
- *out++ = (char) c;
+ /* treat null bytes as spaces per bug 243199 comment 7 */
+ *out++ = c ? (char) c : ' ';
The current code was indeed changed a bit, but still works that way.
Comment 17•9 years ago
|
||
Without attachment 148398 [details] [diff] [review], the second test mail displays fine.
Comment 18•9 years ago
|
||
(In reply to WADA from comment #12)
> If CR and/or LF is contained, bug 604284 occurs. Because LF is used by
> original mail, both bug 604284(due to CR/LF) and this bug(QP, 7bits-ascii
> char) occurs at same time on original mail.
Confirmed.
Without attachment 148398 [details] [diff] [review] the Plain-Text view of the first test mail displays also fine. But the HTML view is still broken.
If I remove the LF out of the HTML-part, this view is also OK.
Comment 19•6 years ago
|
||
Confirming presence of bug in TB 52.7 and 60.4
Comment 20•5 years ago
|
||
Still present in 60.7.2
Updated•3 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•