Open Bug 604284 Opened 14 years ago Updated 2 years ago

base64 encoded text UTF-16 text is displayed garbled after newline (both U+000A/U+000D, and 0x0D/0x0A as second byte of UTF-16)

Categories

(MailNews Core :: MIME, defect)

defect

Tracking

(Not tracked)

People

(Reporter: bmo, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(4 files, 1 obsolete file)

The following email MIME code is displayed in Thunderbird 3.1.4 with 
corrupted text following the first LF character. This may be a problem
with the base64 decoder?

--=_dbae96014315fdcdc85247a6c4ff9209
Content-Transfer-Encoding: base64
Content-Type: text/plain; charset=utf-16;
 name=example.licence.utf16.txt
Content-Disposition: attachment;
 filename=example.licence.utf16.txt

IABMAGkAYwBlAG4AYwBlADoAIAAgAFMAbwBtAGUAcABsAGEAYwBlAAoASABvAHMAdABuAGEAbQBl
ADoAIABTAG8AbQBlAHcAaABlAHIAZQAKAA==
--=_dbae96014315fdcdc85247a6c4ff9209--

Results are:

Found user stream: F00D0007 (example.licence)
-- example.licence.utf16.txt --
 Licence:  Someplace਍䠀漀猀琀渀愀洀攀㨀 匀漀洀攀眀栀攀爀攀ഀ


Expected results are:

Found user stream: F00D0007 (example.licence)
-- example.licence.utf16.txt --
 Licence:  Someplace
Hostname: Somewhere
Content of file generated by "Save As" of Tb 3.0.4.
File size is 82 bytes. Hex dump of the file.
> 20004C006900630065006E00630065003A002000200053006F006D00650070006C00610063006500
> 0A0048006F00730074006E0061006D0065003A00200053006F006D00650077006800650072006500
> 0A00
Data of UTF16-LE, without BOM. No problem in decode of base64. 
No difference from data with BOM for UTF-16-LE.
Notepad.exe shows the saved file as expected(0x0A00 is shown as LF, 0x0A of us-ascii).
Problem is obseved in trunk build too.
> Mozilla/5.0 (Windows NT 5.1; rv:2.0b7pre) Gecko/20100925 Thunderbird/3.3a1pre

It seems text file of utf-16(le/be, and utf-32?) only issue.
No problem if html file?
mail #1 : Glyph of test data.
          Cyrillic, French, Japanese, Traditiona Chinese letters are contained.
mail #2 : UTF-16BE, text/plain, base64 encoded.
          Garbled display by newline is observed.
mail #3 : UTF-16BE, text/html, base64 encoded. Newline character is removed
     => "Garbled display due to newline character in HTML source" disappears.
     => Garbled display due to "tag detection/newline detection failure"
        started at mid of string in <p lang="zh-Hant">.
        Conversion from UTF16 to UTF-8 is done at wrong binary data boundary?
Cause of garbled display in mail #3 was backslash(0x5C) in Chinese character.

(mail #1 ro mail #3 is unchanged)
mail #1 : Glyph of test data.
          Cyrillic, French, Japanese, Traditiona Chinese letters are contained.
mail #2 : UTF-16BE, text/plain, base64 encoded.
          Garbled display by newline is observed.
mail #3 : UTF-16BE, text/html, base64 encoded. Newline character is removed
     => "Garbled display due to newline character in HTML source" disappears.
     => Garbled display due to "tag detection/newline detection failure"
        started at mid of string in <p lang="zh-Hant">.
        Conversion from UTF16 to UTF-8 is done at wrong binary data boundary?

(mail #4 is added)
mail #4 : UTF-16BE, text/html, base64 encoded. Newline character is removed,
  Difference between mail #3 and mail #4 is one Traditional Chinese letter.
    mail #3 : 請以手足關係的精神相對待
                               A
                               |
                               V
    mail #4 : 請以手足關係的精神相請待 
    => "Garbled display due to newline character in HTML source" disappears.
    => Garbled display due to "tag detection/newline detection failure"
       observed in maul #3 doesn't occur.

> Traditional Chinese letter 對
> http://www.fileformat.info/info/unicode/char/5c0d/index.htm
>   UTF-16 (hex)  = 0x5C0D
>   0x5C in ascii = \ (Backslash which is used for escaping)
Attachment #8409270 - Attachment is obsolete: true
If Quoted-Printable and non-ascii(DBCS) characters, it's easy to observe garbage by CR/LF.

Text data : ABCDあいうえおABCD[CRLF]あいうえお
    ABCD    in UTF-16LE : =41=00=42=00=43=00=44=00
                          A=00B=00C=00D=00
    あいうえお in UTF-16LE : =42=30=44=30=46=30=48=30=4A=30
                          (Japanese Hiraana. A, I, U, E, O)
    CR      in UTF-16LE : =0D=00
    LF      in UTF-16LE : =0A=00

(0) ABCD => converted to "⁁⁂⁃⁄" (no quote, this is bug 997050)
(1) CRLF is inserted before 2nd あいうえお
    ⁁⁂⁃⁄あいうえお⁁⁂⁃⁄਍ഠ あいうえお
(2) LF is inserted before 2nd あいうえお
    ⁁⁂⁃⁄あいうえお⁁⁂⁃⁄਍䈠䐰䘰䠰䨰
(3) CR is inserted before 2nd あいうえお
    ⁁⁂⁃⁄あいうえお⁁⁂⁃⁄਍䈠䐰䘰䠰䨰
(4) HT is inserted before 2nd あいうえお => No problem
    ⁁⁂⁃⁄あいうえお⁁⁂⁃⁄ あいうえお

- If CR or LF, a bytes for Newline is generated, and an exess byte is generated.
  UTF-16 binary starts from the exess byte, so character is broken,
- If CRLF, two excess byte is merged into "ഠ ", so DBCS letters after it
  is fortunately not affected.
- When base64, bug 997050 doesn't occur. So, alphabet charcters is not altered
  to letters like ⁁⁂⁃⁄.
  However, alphabet charcters is 0x00## in UTF-16, so 0x00 appears in binary.
  Then, excess byte causes broken binary for alphabet characters.
Component: Message Reader UI → MIME
Product: Thunderbird → MailNews Core
Version: 3.1 → 24
OS: Windows 7 → All
Hardware: x86_64 → All
Not UTF-16 CR(U+000D)/LF(U+000A) only problem.
"0x0D/0x0A as second byte of UTF-16" always produces broken letter problem.

When UTF16BE, any letter after =??=0A / =??=0D is broken. If quoted-printable, it's easy to observe.
Test data]:
 =7C=BE=79=5E=76=F8=??=0#=5F=85=04=34=04=36=04=37=04=38=04=39
   where ?? is 5C, 6C, 7C etc.
         0# is 09, 0A, 0B, 0C, 0D, 0E
   not all combination is contained. 
Broken pattern.
 =??=0D(or =??=0A) is normally displayed.
 =0A is inserted after =??=0D(or =??=0A)
  -> =0A=5F =85=04 =34=04 =36=04 =37=04 =38=04 =39
     Last =39 is ignored, or merged with bynary after the last =39.

This is perhaps applicable to =00=0A, =00=0D, =00=0D=00=0A.
  After =00=0D=00=0A(or =00=0A, =00=0D), 0x0A is inserted,
  and the 0x0A is merged with binaty after =00=0D=00=0A(or =00=0A, =00=0D).

Problem upon putting #text node in <PRE> for text display?
Problem upon parsing HTML source text which is held in #text node?
Summary: base64 encoded text UTF-16 text is displayed garbled after newline → base64 encoded text UTF-16 text is displayed garbled after newline (both U+000A/U+000D, and 0x0D/0x0A as second byte of UTF-16)
FYI.
Original character(in UTF-16BE) with newline character. [CRLF] = U+000D U+000A.
> Line#   Glyph             Data represented in UTF-16BE/QP
>  #1  :  AAAA00AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=30 =00=41 =00=41 =00=0D =00=0A
>  #2  :  AAAA01AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=31 =00=41 =00=41 =00=0D =00=0A
>  #3  :  AAAA02AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=32 =00=41 =00=41 =00=0D =00=0A
>  #4  :  AAAA03AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=33 =00=41 =00=41 =00=0D =00=0A
>  #5  :  AAAA04AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=34 =00=41 =00=41 =00=0D =00=0A
>  #6  :  AAAA05AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=35 =00=41 =00=41 =00=0D =00=0A
>  #7  :  AAAA06AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=36 =00=41 =00=41 =00=0D =00=0A
This data is attached to mail with base64, text/plain.
Shown data at message pane in UTF-8.
> row #1 :  4141 4141 3030 4141
> row #2 :  E0A880 E48480 E48480 E48480 E48480 E38080 E38480 E48480 E48480 E0B48A 4141 4141 3032 4141
> row #3 :  E0A880 E48480 E48480 E48480 E48480 E38080 E38C80 E48480 E48480 E0B48A 4141 4141 3034 4141
> row #4 :  E0A880 E48480 E48480 E48480 E48480 E38080 E39480 E48480 E48480 E0B48A 4141 4141 3036 4141
Correspnding Unocode character.
>   E0A880  U+0A00   http://www.fileformat.info/info/unicode/char/0a00/index.htm
>   E48480  U+4100   http://www.fileformat.info/info/unicode/char/4100/index.htm
>   E38080  U+3000   http://www.fileformat.info/info/unicode/char/3000/index.htm
>   E38480  U+3100   http://www.fileformat.info/info/unicode/char/3100/index.htm
>   E38C80  U+3300   http://www.fileformat.info/info/unicode/char/3300/index.htm
>   E39480  U+3500   http://www.fileformat.info/info/unicode/char/3500/index.htm
>   E0B48A  U+0D0A   http://www.fileformat.info/info/unicode/char/d0a/index.htm
Why broken, How broken, is similar to "0x0D/0x0A in second byte of UTF-16 binary" case.
  0x0A or 0x0D is generated  by newline, or orphaned 0x0A or 0x0D is treated as newline,
  and is merged with binary after U+000D/U+000A.
  After merge in UTF-16 binary,
  conversion to UTF-8 is done for text display, HTML parsing.
FYI.
Actual letters shown at row #1 to row #4 by Tb 24 on Win-XP.
> AAAA00AA
> ਀䄀䄀䄀䄀 ㄀䄀䄀ഊAAAA02AA
> ਀䄀䄀䄀䄀 ㌀䄀䄀ഊAAAA04AA
> ਀䄀䄀䄀䄀 㔀䄀䄀ഊAAAA06AA
Because bug 997050 doesn't occur when base64, 7bit-ascii character(U+00##) is normally shown before CRLF, and after excess 0x0A or 0x0D is eaten up by next excess 0x0A or 0x0D.
Bug 244829 was found for UTF-16/base64 case.
Depends on: 244829
See Also: → 1642917
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: