Open Bug 997050 Opened 11 years ago Updated 3 years ago

Decode Content Email error (if charset=utf-16le or utf-16be and is QP encoded, =00 for 7bit-ascii character is replaced by =20 == space)

Categories

(MailNews Core :: MIME, defect)

defect

Tracking

(Not tracked)

People

(Reporter: thierry.bruyere, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(3 files, 3 obsolete files)

User Agent: Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:28.0) Gecko/20100101 Firefox/28.0 (Beta/Release) Build ID: 20140317233623 Steps to reproduce: I do not read an email. The decoding of the content is wrong. The content is displayed in unreadable character. If I make a response to the email, the content is correctly decoded.
Confirming. When just viewing the message it shows all text in Chinese an missing chars (on Win XP). When replying to it, it shows paragraphs of French(?) interleaved with paragraphs of Chinese. The message is in utf-16le encoding, defined like this: Content-Type: multipart/alternative; boundary="=_30CA4D538242981EE100000082975D1C" --=_30CA4D538242981EE100000082975D1C Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset="utf-16le" Content-Disposition: inline Content-Description: plainpart jcranmer, is this supposed to work?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(Pidgeot18)
If View/Character Encoding/Unicode(UTF-8) is chosen, View/Message Body As/Plain Text text is shown, although some letters is shown in U+FFFD, and space is shown between each letter (because utf-16 is shown as utf-8) View/Message Body As/any HTML HTML source is shown as text string, without parsing as HTML. text is shown, although some letters is shown in U+FFFD, and space is shown between each letter External phenomenon itself is similar to Bug 571704(see also duped Bug 594646). If each part is changed to base64, same result as above is observed without changing Character Encoding to UTF-8, with no "space between each letter". 1. View/Message Body As/All Body Parts, Save each part to file. Data is correct utf-16le without BOM, although excess <HEAD><metaHTTP-EQUIV="Content-Type" CONTENT="text/html;charset=UTF-8"> is contained after last </html>, 2. Compose a mail by Tb, attach two saved files in utf-16le without BOM. Send Later. Because utf-16le data, Tb encodes in base64. 3. Edit mail in Outbox, set charset correctly, change filename for ease of test, Repair Folder.
Attachment #8408744 - Attachment description: Crafted mail folder file : multipart/mixed mail with text/plain & text/htlm utf-16le part → Crafted mail folder file : multipart/mixed mail with base64 encoded utf-16le parts
(In reply to :aceman from comment #1) > Confirming. When just viewing the message it shows all text in Chinese an > missing chars (on Win XP). > When replying to it, it shows paragraphs of French(?) interleaved with > paragraphs of Chinese. I recognized "Veuillez" in the beginning which is indeed French. > jcranmer, is this supposed to work? UTF-16LE should work, which means someone in the depths of libmime is gunking this stuff up badly. I don't have the time right now to figure out what, though.
Flags: needinfo?(Pidgeot18)
Note: Copy attached file as "local mail folder file", restart Tb, Repair Folder. Because raw utf-16le binary is used in mail data stream, 0x00 is contained, so copy/move mail won't work on mail data of Case-1/Case-2. Case-1. Crafted mail folder file : text/plain utf-16le mail Mail is shown by View/Character Encoding/UTF-8. Case-2. Crafted mail folder file : text/html utf-16le mail Mail is shown by View/Character Encoding/UTF-8. HTML is rendered. Case-3. Crafted mail folder file : multipart/mixed mail with base64 encoded utf-16le parts text/plain & text/html part is shown without View/Character Encoding/UTF-8. text/html part is not rendered as HTML. HTML source is shown as string. (A) Garbled display in Case-1/Case-2 : data converted to utf-8 is shown as utf-16le. If utf-16le binary is interpreted as utf-8 by View/Character Encoding/UTF-8, "space between each letter" should occur, but it doesn't occur. View/Character Encoding/UTF-8 is perhaps internallly applied to "data converted to utf-8" because mail text is not encoded in Case-1/Case-2. (B) If attachment(shown in IFRAME), data converted to utf-8 looks shown as utf-8. However, Tb looks to forget call HTML parser. (C) If text/plain or text/html part under multipart/alternative, simlar phenomenon to Case-1/Case-2 looks to occur, because they are "message body". If "message body", "problem when encoded"(base64 or quoted-printable) may occur.
Summary: Decode Content Email error → Decode Content Email error (if charset=utf-16le, data converted to utf-8 is shown as utf-16le, and if HTML message body is encoded in QP or base64, it's not rendered as HTML correctly)
Attachment #8408742 - Attachment mime type: text/plain → text/plain; charset=iso-8859-1
Attachment #8408743 - Attachment mime type: text/plain → text/plain; charset=iso-8859-1
Summary: Decode Content Email error (if charset=utf-16le, data converted to utf-8 is shown as utf-16le, and if HTML message body is encoded in QP or base64, it's not rendered as HTML correctly) → Decode Content Email error (if charset=utf-16le or utf-16be and is QP/base64 encoded, data doesn't look converted correctly, or wrong charset seems applied aftter conversion)
To avoid content sniffing, "<" is changed to "[ ". Following is gryph of text in utf-8.([CRLF]=0x0D0A) > [ html>[CRLF] > [ head>[CRLF] > [ meta charset="utf-8">[CRLF] > [ meta http-equiv="Content-Type" content="text/html; charset='utf-8'">[CRLF] > [ /head>[CRLF] > [ body> > [ p lang="ru">абвгдеёжзий[ /p>[CRLF] > [ p lang="ja">ここは、日本語の文字です。[ /p>[CRLF] > [ p lang="en">Hello World[ /p>[CRLF] > [ p lang="fr">Veuillez ne répondez pas à cet e-mail pour mettre à jour votre Service Ticket.[ /p>[CRLF] > [ p lang="zh-Hant">請以手足關係的精神相對待[ /p>[CRLF] > [ /body>[CRLF] > [ /html>[CRLF] Above data is saved as utf-16be text file, and data is encoded with base64, and the base64 encoded data is put in mail data stream. > Content-Type: text/plain; charset=utf-16be > Content-Transfer-Encoding: base64 > > AFsAIABoAHQAbQBsAD4ADQAKAFsAIABoAGUAYQBkAD4ADQAKACAAIAAgACAAWwAgAG0AZQB0 >(snip) > ACAALwBiAG8AZAB5AD4ADQAKAFsAIAAvAGgAdABtAGwAPgANAAo= If this mail is displayed by Tb, View/Character Encodind : (Unicode(UTF-16BE) is checked. and DOM Inspector showed following data. #text node content under PRE which is held in DIV. > [ html> > ਀嬀 栀攀愀搀㸀ഊ [ meta charset="utf-8"> > ਀    嬀 洀攀琀愀 栀琀琀瀀ⴀ攀焀甀椀瘀㴀∀䌀漀渀琀攀渀琀ⴀ吀礀瀀攀∀ 挀漀渀琀攀渀琀㴀∀琀攀砀琀⼀栀琀洀氀㬀 挀栀愀爀猀攀琀㴀✀甀琀昀ⴀ㠀✀∀㸀ഊ[ /head> > ਀嬀 戀漀搀礀㸀ഊ[ p lang="ru">абвгдеёжзий[ /p> > ਀嬀 瀀 氀愀渀最㴀∀樀愀∀㸰匰匰漰ťⲊ鸰湥蝛地朰夰Ȁ嬀 ⼀瀀㸀ഊ[ p lang="en">Hello World[ /p> > ਀嬀 瀀 氀愀渀最㴀∀昀爀∀㸀嘀攀甀椀氀氀攀稀 渀攀 爀瀀漀渀搀攀稀 瀀愀猀  挀攀琀 攀ⴀ洀愀椀氀 瀀漀甀爀 洀攀琀琀爀攀  樀漀甀爀 瘀漀琀爀攀 匀攀爀瘀椀挀攀 吀椀挀欀攀琀⸀嬀 ⼀瀀㸀ഊ[ p lang="zh-Hant">請以手足關係的精神相對੟蔀嬀 ⼀瀀㸀ഊ[ /body> "Single byte code in UTF-8" only is displayed normally? If so, why second "[ meta http-equiv=...", "[ p lang="ja">" and "[ /p>", "[ p lang="zh-Hant">" and "[ /p>" are garbled? Conversion of utf-16le/utf-16be is broken? If so, it's only when QP/base64 encoded? Or problem in "display as #text node in <PRE> element"(text/plain case)?
Phenomenon I reported in comment #8 was bug 604284. Setting dependency for ease of trackn and aalysis.
Depends on: 604284
Test mail. multipart/mixed. UTF-16LE, quoted-printable, No newline character(U+000A,U+000D) in quoted-printable dara. body, text/plain : ABCDあいうえお (あいうえお is Japanese Hiragana, AIUEO) (ABCD in QP : A=00B=00C=00D=00) (あいうえお in QP : =42=30=44=30=46=30=48=30=4A=30) part 2, text/plain : <p>ABCDあいうえお</p> part 3, text/html : <p>ABCDあいうえお</p> (1) View/Character Encoding/Unicode(UTF-16LE) is correctly shown. (2) In any part, 7bit-ascii character is garbled. Japanese character is displayed normally. (3) In text/html part. HTML tag is not detected, so shown as string under <BODY>. Cause of problem in Quoted-Printable is perhaps 0x00 for single byte ascii in UTF16.
Attachment #8408742 - Attachment is obsolete: true
Attachment #8408743 - Attachment is obsolete: true
Attachment #8408744 - Attachment is obsolete: true
Summary: Decode Content Email error (if charset=utf-16le or utf-16be and is QP/base64 encoded, data doesn't look converted correctly, or wrong charset seems applied aftter conversion) → Decode Content Email error (if charset=utf-16le or utf-16be and is QP encoded, data doesn't look converted correctly, or wrong charset seems applied aftter conversion)
Garbled display in test case attached to comment #10. > "UTF-16LE binary for ABCDあいうえお in text/plain(quoted-printable)" is shown as; > ⁁⁂⁃⁄あいうえお
Component: Message Reader UI → MIME
OS: Linux → All
Product: Thunderbird → MailNews Core
Hardware: x86_64 → All
Version: 28 → 24
Summary: Decode Content Email error (if charset=utf-16le or utf-16be and is QP encoded, data doesn't look converted correctly, or wrong charset seems applied aftter conversion) → Decode Content Email error (if charset=utf-16le or utf-16be and is QP encoded, 7bit-ascii character doesn't look converted correctly)
FYI. In test case attached to comment #10, > <p>ABCDあいうえお<\p> is converted to > ‼⁰‾⁁⁂⁃⁄あいうえお‼ ⁰‾ So, HTML tag is not processed when text/html part, and is shown as <body> text. If CR and/or LF is contained, bug 604284 occurs. Because LF is used by original mail, both bug 604284(due to CR/LF) and this bug(QP, 7bits-ascii char) occurs at same time on original mail. Following is different issue: - Garbled display when 0x5C(backslash in ascii) is contained in UTF-16 binary. - Garbled display when raw binary of 0x00 in UTF-16 is sent as mail data stream (Content-Transfer-Encoding: 8bit instead of base64/quoted-printable). This won't occur if SMTP/IMAP/POP3 is used. However, Web mail system can do anything, and .eml file can contain anything.
FYI. Culprit 0f =5C=0D case was 0x0D instead of 0x5C. See bug 604284 comment #5, please.
If quoted-printale, ABCD in UTF-16BE/UTF-16BE is shown by Glyph of ⁁⁂⁃⁄ by Tb 24. When this string is copy&pasted to Text Editor and saved in UTF-8, following binary was obtained. E28181 E28182 E28183 E28184 This was ; UTF-8 UTF-16 QP/UTF-16BE ABCD of QP/UTF16-BE E28181 http://codepoints.net/U+2041 0x2041 =20=41 <- =00=41 === A E28182 http://codepoints.net/U+2042 0x2042 =20=41 <- =00=42 === B E28183 http://codepoints.net/U+2043 0x2043 =20=41 <- =00=43 === C E28184 http://codepoints.net/U+2044 0x2044 =20=41 <- =00=44 === D This doesn't occur if base64. If saved in file, original binary of =00=41=00=42=00=43=00=44 is kept. Problem is: Upon processing quoted-printable text for text display/HTML parsing, =00(Null) is replaced by =20(space), regardless of charset. If UTF-16 or UTF-32, =00 should be preserved, even when quoted-printable, as done correctly when base64. By Quirks for malformed mail in quoted-printable? Or text processing is done on quoted-printable string directly, without decoding quoted-printable.
Summary: Decode Content Email error (if charset=utf-16le or utf-16be and is QP encoded, 7bit-ascii character doesn't look converted correctly) → Decode Content Email error (if charset=utf-16le or utf-16be and is QP encoded, =00 for 7bit-ascii character is replaced by =20 == space)
FYI. If raw UTF-16 binary is placed in text/plain part(Content-Type: text/plain; charset=UTF-16BE, Content-Transfer-Encoding: 8bit), "remove 0x00" occurs, and 0x00 is lost by "Save". Row UTF-16BE data in mail : 0x0041 0042 0043 0044 4230 4430 4630 4830 4A30 Binary of text by "Save" : 0x41424344 4230 4430 4630 4830 4A30 A A | | | +-- correctly displayed +-- Shown as 0x4142, 0x4344 of UTF-16, so broken letter is shown
This is a regression from Bug 243199 attachment 148398 [details] [diff] [review] does replace NULL-chars with space-chars. - *out++ = (char) c; + /* treat null bytes as spaces per bug 243199 comment 7 */ + *out++ = c ? (char) c : ' '; The current code was indeed changed a bit, but still works that way.
Without attachment 148398 [details] [diff] [review], the second test mail displays fine.
(In reply to WADA from comment #12) > If CR and/or LF is contained, bug 604284 occurs. Because LF is used by > original mail, both bug 604284(due to CR/LF) and this bug(QP, 7bits-ascii > char) occurs at same time on original mail. Confirmed. Without attachment 148398 [details] [diff] [review] the Plain-Text view of the first test mail displays also fine. But the HTML view is still broken. If I remove the LF out of the HTML-part, this view is also OK.

Confirming presence of bug in TB 52.7 and 60.4

Still present in 60.7.2

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: