Open
Bug 604284
Opened 14 years ago
Updated 2 years ago
base64 encoded text UTF-16 text is displayed garbled after newline (both U+000A/U+000D, and 0x0D/0x0A as second byte of UTF-16)
Categories
(MailNews Core :: MIME, defect)
Tracking
(Not tracked)
NEW
People
(Reporter: bmo, Unassigned)
References
(Depends on 1 open bug, Blocks 1 open bug)
Details
Attachments
(4 files, 1 obsolete file)
The following email MIME code is displayed in Thunderbird 3.1.4 with corrupted text following the first LF character. This may be a problem with the base64 decoder? --=_dbae96014315fdcdc85247a6c4ff9209 Content-Transfer-Encoding: base64 Content-Type: text/plain; charset=utf-16; name=example.licence.utf16.txt Content-Disposition: attachment; filename=example.licence.utf16.txt IABMAGkAYwBlAG4AYwBlADoAIAAgAFMAbwBtAGUAcABsAGEAYwBlAAoASABvAHMAdABuAGEAbQBl ADoAIABTAG8AbQBlAHcAaABlAHIAZQAKAA== --=_dbae96014315fdcdc85247a6c4ff9209-- Results are: Found user stream: F00D0007 (example.licence) -- example.licence.utf16.txt -- Licence: Someplace䠀漀猀琀渀愀洀攀㨀 匀漀洀攀眀栀攀爀攀ഀ Expected results are: Found user stream: F00D0007 (example.licence) -- example.licence.utf16.txt -- Licence: Someplace Hostname: Somewhere
Comment 1•14 years ago
|
||
Content of file generated by "Save As" of Tb 3.0.4. File size is 82 bytes. Hex dump of the file. > 20004C006900630065006E00630065003A002000200053006F006D00650070006C00610063006500 > 0A0048006F00730074006E0061006D0065003A00200053006F006D00650077006800650072006500 > 0A00 Data of UTF16-LE, without BOM. No problem in decode of base64. No difference from data with BOM for UTF-16-LE. Notepad.exe shows the saved file as expected(0x0A00 is shown as LF, 0x0A of us-ascii). Problem is obseved in trunk build too. > Mozilla/5.0 (Windows NT 5.1; rv:2.0b7pre) Gecko/20100925 Thunderbird/3.3a1pre It seems text file of utf-16(le/be, and utf-32?) only issue. No problem if html file?
Comment 2•10 years ago
|
||
mail #1 : Glyph of test data. Cyrillic, French, Japanese, Traditiona Chinese letters are contained. mail #2 : UTF-16BE, text/plain, base64 encoded. Garbled display by newline is observed. mail #3 : UTF-16BE, text/html, base64 encoded. Newline character is removed => "Garbled display due to newline character in HTML source" disappears. => Garbled display due to "tag detection/newline detection failure" started at mid of string in <p lang="zh-Hant">. Conversion from UTF16 to UTF-8 is done at wrong binary data boundary?
Comment 3•10 years ago
|
||
Cause of garbled display in mail #3 was backslash(0x5C) in Chinese character.
(mail #1 ro mail #3 is unchanged)
mail #1 : Glyph of test data.
Cyrillic, French, Japanese, Traditiona Chinese letters are contained.
mail #2 : UTF-16BE, text/plain, base64 encoded.
Garbled display by newline is observed.
mail #3 : UTF-16BE, text/html, base64 encoded. Newline character is removed
=> "Garbled display due to newline character in HTML source" disappears.
=> Garbled display due to "tag detection/newline detection failure"
started at mid of string in <p lang="zh-Hant">.
Conversion from UTF16 to UTF-8 is done at wrong binary data boundary?
(mail #4 is added)
mail #4 : UTF-16BE, text/html, base64 encoded. Newline character is removed,
Difference between mail #3 and mail #4 is one Traditional Chinese letter.
mail #3 : 請以手足關係的精神相對待
A
|
V
mail #4 : 請以手足關係的精神相請待
=> "Garbled display due to newline character in HTML source" disappears.
=> Garbled display due to "tag detection/newline detection failure"
observed in maul #3 doesn't occur.
> Traditional Chinese letter 對
> http://www.fileformat.info/info/unicode/char/5c0d/index.htm
> UTF-16 (hex) = 0x5C0D
> 0x5C in ascii = \ (Backslash which is used for escaping)
Attachment #8409270 -
Attachment is obsolete: true
Comment 4•10 years ago
|
||
If Quoted-Printable and non-ascii(DBCS) characters, it's easy to observe garbage by CR/LF. Text data : ABCDあいうえおABCD[CRLF]あいうえお ABCD in UTF-16LE : =41=00=42=00=43=00=44=00 A=00B=00C=00D=00 あいうえお in UTF-16LE : =42=30=44=30=46=30=48=30=4A=30 (Japanese Hiraana. A, I, U, E, O) CR in UTF-16LE : =0D=00 LF in UTF-16LE : =0A=00 (0) ABCD => converted to "⁁⁂⁃⁄" (no quote, this is bug 997050) (1) CRLF is inserted before 2nd あいうえお ⁁⁂⁃⁄あいうえお⁁⁂⁃⁄ഠ あいうえお (2) LF is inserted before 2nd あいうえお ⁁⁂⁃⁄あいうえお⁁⁂⁃⁄䈠䐰䘰䠰䨰 (3) CR is inserted before 2nd あいうえお ⁁⁂⁃⁄あいうえお⁁⁂⁃⁄䈠䐰䘰䠰䨰 (4) HT is inserted before 2nd あいうえお => No problem ⁁⁂⁃⁄あいうえお⁁⁂⁃⁄ あいうえお - If CR or LF, a bytes for Newline is generated, and an exess byte is generated. UTF-16 binary starts from the exess byte, so character is broken, - If CRLF, two excess byte is merged into "ഠ ", so DBCS letters after it is fortunately not affected. - When base64, bug 997050 doesn't occur. So, alphabet charcters is not altered to letters like ⁁⁂⁃⁄. However, alphabet charcters is 0x00## in UTF-16, so 0x00 appears in binary. Then, excess byte causes broken binary for alphabet characters.
Updated•10 years ago
|
Component: Message Reader UI → MIME
Product: Thunderbird → MailNews Core
Version: 3.1 → 24
Updated•10 years ago
|
OS: Windows 7 → All
Hardware: x86_64 → All
Comment 5•10 years ago
|
||
Not UTF-16 CR(U+000D)/LF(U+000A) only problem. "0x0D/0x0A as second byte of UTF-16" always produces broken letter problem. When UTF16BE, any letter after =??=0A / =??=0D is broken. If quoted-printable, it's easy to observe. Test data]: =7C=BE=79=5E=76=F8=??=0#=5F=85=04=34=04=36=04=37=04=38=04=39 where ?? is 5C, 6C, 7C etc. 0# is 09, 0A, 0B, 0C, 0D, 0E not all combination is contained. Broken pattern. =??=0D(or =??=0A) is normally displayed. =0A is inserted after =??=0D(or =??=0A) -> =0A=5F =85=04 =34=04 =36=04 =37=04 =38=04 =39 Last =39 is ignored, or merged with bynary after the last =39. This is perhaps applicable to =00=0A, =00=0D, =00=0D=00=0A. After =00=0D=00=0A(or =00=0A, =00=0D), 0x0A is inserted, and the 0x0A is merged with binaty after =00=0D=00=0A(or =00=0A, =00=0D). Problem upon putting #text node in <PRE> for text display? Problem upon parsing HTML source text which is held in #text node?
Updated•10 years ago
|
Summary: base64 encoded text UTF-16 text is displayed garbled after newline → base64 encoded text UTF-16 text is displayed garbled after newline (both U+000A/U+000D, and 0x0D/0x0A as second byte of UTF-16)
Comment 6•10 years ago
|
||
FYI. Original character(in UTF-16BE) with newline character. [CRLF] = U+000D U+000A. > Line# Glyph Data represented in UTF-16BE/QP > #1 : AAAA00AA[CRLF] =00=41 =00=41 =00=41 =00=41 =00=30 =00=30 =00=41 =00=41 =00=0D =00=0A > #2 : AAAA01AA[CRLF] =00=41 =00=41 =00=41 =00=41 =00=30 =00=31 =00=41 =00=41 =00=0D =00=0A > #3 : AAAA02AA[CRLF] =00=41 =00=41 =00=41 =00=41 =00=30 =00=32 =00=41 =00=41 =00=0D =00=0A > #4 : AAAA03AA[CRLF] =00=41 =00=41 =00=41 =00=41 =00=30 =00=33 =00=41 =00=41 =00=0D =00=0A > #5 : AAAA04AA[CRLF] =00=41 =00=41 =00=41 =00=41 =00=30 =00=34 =00=41 =00=41 =00=0D =00=0A > #6 : AAAA05AA[CRLF] =00=41 =00=41 =00=41 =00=41 =00=30 =00=35 =00=41 =00=41 =00=0D =00=0A > #7 : AAAA06AA[CRLF] =00=41 =00=41 =00=41 =00=41 =00=30 =00=36 =00=41 =00=41 =00=0D =00=0A This data is attached to mail with base64, text/plain. Shown data at message pane in UTF-8. > row #1 : 4141 4141 3030 4141 > row #2 : E0A880 E48480 E48480 E48480 E48480 E38080 E38480 E48480 E48480 E0B48A 4141 4141 3032 4141 > row #3 : E0A880 E48480 E48480 E48480 E48480 E38080 E38C80 E48480 E48480 E0B48A 4141 4141 3034 4141 > row #4 : E0A880 E48480 E48480 E48480 E48480 E38080 E39480 E48480 E48480 E0B48A 4141 4141 3036 4141 Correspnding Unocode character. > E0A880 U+0A00 http://www.fileformat.info/info/unicode/char/0a00/index.htm > E48480 U+4100 http://www.fileformat.info/info/unicode/char/4100/index.htm > E38080 U+3000 http://www.fileformat.info/info/unicode/char/3000/index.htm > E38480 U+3100 http://www.fileformat.info/info/unicode/char/3100/index.htm > E38C80 U+3300 http://www.fileformat.info/info/unicode/char/3300/index.htm > E39480 U+3500 http://www.fileformat.info/info/unicode/char/3500/index.htm > E0B48A U+0D0A http://www.fileformat.info/info/unicode/char/d0a/index.htm Why broken, How broken, is similar to "0x0D/0x0A in second byte of UTF-16 binary" case. 0x0A or 0x0D is generated by newline, or orphaned 0x0A or 0x0D is treated as newline, and is merged with binary after U+000D/U+000A. After merge in UTF-16 binary, conversion to UTF-8 is done for text display, HTML parsing.
Comment 7•10 years ago
|
||
FYI. Actual letters shown at row #1 to row #4 by Tb 24 on Win-XP. > AAAA00AA > 䄀䄀䄀䄀 䄀䄀ഊAAAA02AA > 䄀䄀䄀䄀 ㌀䄀䄀ഊAAAA04AA > 䄀䄀䄀䄀 㔀䄀䄀ഊAAAA06AA Because bug 997050 doesn't occur when base64, 7bit-ascii character(U+00##) is normally shown before CRLF, and after excess 0x0A or 0x0D is eaten up by next excess 0x0A or 0x0D.
Updated•2 years ago
|
Severity: normal → S3
You need to log in
before you can comment on or make changes to this bug.
Description
•