Open Bug 604284 Opened 14 years ago Updated 2 years ago

base64 encoded text UTF-16 text is displayed garbled after newline (both U+000A/U+000D, and 0x0D/0x0A as second byte of UTF-16)

Tracking

(Not tracked)

Status:

NEW

People

(Reporter: bmo, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(4 files, 1 obsolete file)

Example email with corrupted display attachment 14 years ago Brodie 1.42 KB, message/rfc822		Details
mail folder file : base64 encoded UTF-16BE HTML mail 10 years ago WADA:World Anti-bad-Duping Agency 4.04 KB, text/plain		Details
mail folder file (updated): base64 encoded UTF-16BE HTML mail 10 years ago WADA:World Anti-bad-Duping Agency 5.61 KB, text/plain		Details
CR/LF in UTF-16LE, Quoted-Printable mail with non-ascii DBCS character 10 years ago WADA:World Anti-bad-Duping Agency 2.11 KB, text/plain		Details
EML file, QP, UTF16BE, Any letter after =??=0A / =??=0D is broken 10 years ago WADA:World Anti-bad-Duping Agency 2.84 KB, text/plain		Details

Brodie

Reporter

Description

•

14 years ago

Attached file Example email with corrupted display attachment — Details

The following email MIME code is displayed in Thunderbird 3.1.4 with 
corrupted text following the first LF character. This may be a problem
with the base64 decoder?

--=_dbae96014315fdcdc85247a6c4ff9209
Content-Transfer-Encoding: base64
Content-Type: text/plain; charset=utf-16;
 name=example.licence.utf16.txt
Content-Disposition: attachment;
 filename=example.licence.utf16.txt

IABMAGkAYwBlAG4AYwBlADoAIAAgAFMAbwBtAGUAcABsAGEAYwBlAAoASABvAHMAdABuAGEAbQBl
ADoAIABTAG8AbQBlAHcAaABlAHIAZQAKAA==
--=_dbae96014315fdcdc85247a6c4ff9209--

Results are:

Found user stream: F00D0007 (example.licence)
-- example.licence.utf16.txt --
 Licence:  Someplace਍䠀漀猀琀渀愀洀攀㨀 匀漀洀攀眀栀攀爀攀ഀ


Expected results are:

Found user stream: F00D0007 (example.licence)
-- example.licence.utf16.txt --
 Licence:  Someplace
Hostname: Somewhere

WADA:World Anti-bad-Duping Agency

Comment 1

•

14 years ago

Content of file generated by "Save As" of Tb 3.0.4.
File size is 82 bytes. Hex dump of the file.
> 20004C006900630065006E00630065003A002000200053006F006D00650070006C00610063006500
> 0A0048006F00730074006E0061006D0065003A00200053006F006D00650077006800650072006500
> 0A00
Data of UTF16-LE, without BOM. No problem in decode of base64. 
No difference from data with BOM for UTF-16-LE.
Notepad.exe shows the saved file as expected(0x0A00 is shown as LF, 0x0A of us-ascii).
Problem is obseved in trunk build too.
> Mozilla/5.0 (Windows NT 5.1; rv:2.0b7pre) Gecko/20100925 Thunderbird/3.3a1pre

It seems text file of utf-16(le/be, and utf-32?) only issue.
No problem if html file?

WADA:World Anti-bad-Duping Agency

Updated

•

10 years ago

Blocks: 997050

WADA:World Anti-bad-Duping Agency

Comment 2

•

10 years ago

Attached file mail folder file : base64 encoded UTF-16BE HTML mail (obsolete) — Details

mail #1 : Glyph of test data.
          Cyrillic, French, Japanese, Traditiona Chinese letters are contained.
mail #2 : UTF-16BE, text/plain, base64 encoded.
          Garbled display by newline is observed.
mail #3 : UTF-16BE, text/html, base64 encoded. Newline character is removed
     => "Garbled display due to newline character in HTML source" disappears.
     => Garbled display due to "tag detection/newline detection failure"
        started at mid of string in <p lang="zh-Hant">.
        Conversion from UTF16 to UTF-8 is done at wrong binary data boundary?

WADA:World Anti-bad-Duping Agency

Comment 3

•

10 years ago

Attached file mail folder file (updated): base64 encoded UTF-16BE HTML mail — Details

Cause of garbled display in mail #3 was backslash(0x5C) in Chinese character.

(mail #1 ro mail #3 is unchanged)
mail #1 : Glyph of test data.
          Cyrillic, French, Japanese, Traditiona Chinese letters are contained.
mail #2 : UTF-16BE, text/plain, base64 encoded.
          Garbled display by newline is observed.
mail #3 : UTF-16BE, text/html, base64 encoded. Newline character is removed
     => "Garbled display due to newline character in HTML source" disappears.
     => Garbled display due to "tag detection/newline detection failure"
        started at mid of string in <p lang="zh-Hant">.
        Conversion from UTF16 to UTF-8 is done at wrong binary data boundary?

(mail #4 is added)
mail #4 : UTF-16BE, text/html, base64 encoded. Newline character is removed,
  Difference between mail #3 and mail #4 is one Traditional Chinese letter.
    mail #3 : 請以手足關係的精神相對待
                               A
                               |
                               V
    mail #4 : 請以手足關係的精神相請待 
    => "Garbled display due to newline character in HTML source" disappears.
    => Garbled display due to "tag detection/newline detection failure"
       observed in maul #3 doesn't occur.

> Traditional Chinese letter 對
> http://www.fileformat.info/info/unicode/char/5c0d/index.htm
>   UTF-16 (hex)  =　0x5C0D
>   0x5C in ascii = \ (Backslash which is used for escaping)

Attachment #8409270 - Attachment is obsolete: true

WADA:World Anti-bad-Duping Agency

Comment 4

•

10 years ago

Attached file CR/LF in UTF-16LE, Quoted-Printable mail with non-ascii DBCS character — Details

If Quoted-Printable and non-ascii(DBCS) characters, it's easy to observe garbage by CR/LF.

Text data : ABCDあいうえおABCD[CRLF]あいうえお
    ABCD    in UTF-16LE : =41=00=42=00=43=00=44=00
                          A=00B=00C=00D=00
    あいうえお in UTF-16LE : =42=30=44=30=46=30=48=30=4A=30
                          (Japanese Hiraana. A, I, U, E, O)
    CR      in UTF-16LE : =0D=00
    LF      in UTF-16LE : =0A=00

(0) ABCD => converted to "⁁⁂⁃⁄" (no quote, this is bug 997050)
(1) CRLF is inserted before 2nd あいうえお
　　　　⁁⁂⁃⁄あいうえお⁁⁂⁃⁄਍ഠ あいうえお
(2) LF is inserted before 2nd あいうえお
    ⁁⁂⁃⁄あいうえお⁁⁂⁃⁄਍䈠䐰䘰䠰䨰
(3) CR is inserted before 2nd あいうえお
    ⁁⁂⁃⁄あいうえお⁁⁂⁃⁄਍䈠䐰䘰䠰䨰
(4) HT is inserted before 2nd あいうえお => No problem
    ⁁⁂⁃⁄あいうえお⁁⁂⁃⁄ あいうえお

- If CR or LF, a bytes for Newline is generated, and an exess byte is generated.
  UTF-16 binary starts from the exess byte, so character is broken,
- If CRLF, two excess byte is merged into "ഠ ", so DBCS letters after it
  is fortunately not affected.
- When base64, bug 997050 doesn't occur. So, alphabet charcters is not altered
  to letters like ⁁⁂⁃⁄.
  However, alphabet charcters is 0x00## in UTF-16, so 0x00 appears in binary.
  Then, excess byte causes broken binary for alphabet characters.

WADA:World Anti-bad-Duping Agency

Updated

•

10 years ago

Component: Message Reader UI → MIME

Product: Thunderbird → MailNews Core

Version: 3.1 → 24

WADA:World Anti-bad-Duping Agency

Updated

•

10 years ago

OS: Windows 7 → All

Hardware: x86_64 → All

WADA:World Anti-bad-Duping Agency

Comment 5

•

10 years ago

Attached file EML file, QP, UTF16BE, Any letter after =??=0A / =??=0D is broken — Details

Not UTF-16 CR(U+000D)/LF(U+000A) only problem.
"0x0D/0x0A as second byte of UTF-16" always produces broken letter problem.

When UTF16BE, any letter after =??=0A / =??=0D is broken. If quoted-printable, it's easy to observe.
Test data]:
 =7C=BE=79=5E=76=F8=??=0#=5F=85=04=34=04=36=04=37=04=38=04=39
   where ?? is 5C, 6C, 7C etc.
         0# is 09, 0A, 0B, 0C, 0D, 0E
   not all combination is contained. 
Broken pattern.
 =??=0D(or =??=0A) is normally displayed.
 =0A is inserted after =??=0D(or =??=0A)
  -> =0A=5F =85=04 =34=04 =36=04 =37=04 =38=04 =39
     Last =39 is ignored, or merged with bynary after the last =39.

This is perhaps applicable to =00=0A, =00=0D, =00=0D=00=0A.
  After =00=0D=00=0A(or =00=0A, =00=0D), 0x0A is inserted,
  and the 0x0A is merged with binaty after =00=0D=00=0A(or =00=0A, =00=0D).

Problem upon putting #text node in <PRE> for text display?
Problem upon parsing HTML source text which is held in #text node?

WADA:World Anti-bad-Duping Agency

Updated

•

10 years ago

Summary: base64 encoded text UTF-16 text is displayed garbled after newline → base64 encoded text UTF-16 text is displayed garbled after newline (both U+000A/U+000D, and 0x0D/0x0A as second byte of UTF-16)

WADA:World Anti-bad-Duping Agency

Comment 6

•

10 years ago

FYI.
Original character(in UTF-16BE) with newline character. [CRLF] = U+000D U+000A.
> Line#   Glyph             Data represented in UTF-16BE/QP
>  #1  :  AAAA00AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=30 =00=41 =00=41 =00=0D =00=0A
>  #2  :  AAAA01AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=31 =00=41 =00=41 =00=0D =00=0A
>  #3  :  AAAA02AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=32 =00=41 =00=41 =00=0D =00=0A
>  #4  :  AAAA03AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=33 =00=41 =00=41 =00=0D =00=0A
>  #5  :  AAAA04AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=34 =00=41 =00=41 =00=0D =00=0A
>  #6  :  AAAA05AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=35 =00=41 =00=41 =00=0D =00=0A
>  #7  :  AAAA06AA[CRLF]    =00=41 =00=41 =00=41 =00=41 =00=30 =00=36 =00=41 =00=41 =00=0D =00=0A
This data is attached to mail with base64, text/plain.
Shown data at message pane in UTF-8.
> row #1 :  4141 4141 3030 4141
> row #2 :  E0A880 E48480 E48480 E48480 E48480 E38080 E38480 E48480 E48480 E0B48A 4141 4141 3032 4141
> row #3 :  E0A880 E48480 E48480 E48480 E48480 E38080 E38C80 E48480 E48480 E0B48A 4141 4141 3034 4141
> row #4 :  E0A880 E48480 E48480 E48480 E48480 E38080 E39480 E48480 E48480 E0B48A 4141 4141 3036 4141
Correspnding Unocode character.
>   E0A880  U+0A00   http://www.fileformat.info/info/unicode/char/0a00/index.htm
>   E48480  U+4100   http://www.fileformat.info/info/unicode/char/4100/index.htm
>   E38080  U+3000   http://www.fileformat.info/info/unicode/char/3000/index.htm
>   E38480  U+3100   http://www.fileformat.info/info/unicode/char/3100/index.htm
>   E38C80  U+3300   http://www.fileformat.info/info/unicode/char/3300/index.htm
>   E39480  U+3500   http://www.fileformat.info/info/unicode/char/3500/index.htm
>   E0B48A  U+0D0A   http://www.fileformat.info/info/unicode/char/d0a/index.htm
Why broken, How broken, is similar to "0x0D/0x0A in second byte of UTF-16 binary" case.
  0x0A or 0x0D is generated  by newline, or orphaned 0x0A or 0x0D is treated as newline,
  and is merged with binary after U+000D/U+000A.
  After merge in UTF-16 binary,
  conversion to UTF-8 is done for text display, HTML parsing.

WADA:World Anti-bad-Duping Agency

Comment 7

•

10 years ago

FYI.
Actual letters shown at row #1 to row #4 by Tb 24 on Win-XP.
> AAAA00AA
> ਀䄀䄀䄀䄀　㄀䄀䄀ഊAAAA02AA
> ਀䄀䄀䄀䄀　㌀䄀䄀ഊAAAA04AA
> ਀䄀䄀䄀䄀　㔀䄀䄀ഊAAAA06AA
Because bug 997050 doesn't occur when base64, 7bit-ascii character(U+00##) is normally shown before CRLF, and after excess 0x0A or 0x0D is eaten up by next excess 0x0A or 0x0D.

WADA:World Anti-bad-Duping Agency

Comment 8

•

10 years ago

Bug 244829 was found for UTF-16/base64 case.

Depends on: 244829

Alfred Peters [:infofrommozilla]

Updated

•

4 years ago

Updated

•

2 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

base64 encoded text UTF-16 text is displayed garbled after newline (both U+000A/U+000D, and 0x0D/0x0A as second byte of UTF-16)

Categories

(MailNews Core :: MIME, defect)

Tracking

(Not tracked)

People

(Reporter: bmo, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(4 files, 1 obsolete file)

Description

Comment 1

Updated

Comment 2

Comment 3

Comment 4

Updated

Updated

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Updated

Updated

Attachment

General

Description

File Name

Content Type