Closed Bug 1508136 Opened 6 years ago Closed 6 years ago

ISO-2022-JP text is displayed with a replacement character <?> if it contains a zero-length ASCII run due to concatenation

Categories

(Core :: Internationalization, defect)

60 Branch
defect
Not set
normal

Tracking

()

RESOLVED INVALID

People

(Reporter: kzmizzz, Unassigned)

References

Details

Attachments

(2 files)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0

Steps to reproduce:

Open "sample.eml" with Thunderbird. (60.3.1 Windows)

"Sample.eml" contains ISO-2022-JP text like:

   <ESC>$Bこんに<ESC>(B<ESC>$Bちは<ESC>(B

This text also could be considered that these two ISO-2022-JP texts are concatenated.

   <ESC>$Bこんに<ESC>(B
   <ESC>$Bちは<ESC>(B

"From", "To", and "Subject" headers in "sample.eml" also have such text.
These are described in "details.txt."



Actual results:

See screenshots.
Thunderbird 60.3.1 -> TB60.3.1(wrong).png
Thunderbird 54.0.b3 -> TB54.0.b3(good).png

In 60.3.1, U+FFFD (REPLACEMENT CHARACTER) are displayed in text.
54.0.b3 seems to display correctly.



Expected results:

U+FFFD should not be appeard in text.

The root cause would be in Encoding Standard of WHATWG. https://encoding.spec.whatwg.org/
The reference implementation of Encoding Standard inserts U+FFFD for the zero-length content between escape sequences.
Encoding Standard says:
> The ISO-2022-JP encoder is the only encoder for which the concatenation of multiple outputs can result in an error when run through the corresponding decoder.
(https://encoding.spec.whatwg.org/#iso-2022-jp-encoder)

This behavior is also the spec of TextDecoder or its underlying libs.
See bug 1506049.

However, Thunderbird should display ISO-2022-JP text correctly.
Inserting U+FFFD will reduce potential risk of XSS, but Thunderbird should not be relying it.
Thanks for the detailed report. I'll copy some of it into this comment for easier accessibility. Let's focus on the body problem:

Message content: <ESC>$B$3$s$K<ESC>(B<ESC>$B$A$O<ESC>(B

Message display: こんに<?>ちは

ISO-2022-JP text in the message body (hex):
 1b 24 42  : <ESC $ B> select JIS X 0208-1983 to be used
 24 33     : character "こ" in JIS X 0208-1983
 24 73     : character "ん" in JIS X 0208-1983
 24 4b     : character "に" in JIS X 0208-1983
 1b 28 42  : <ESC ( B> select ASCII to be used
 1b 24 42  : <ESC $ B> select JIS X 0208-1983 to be used
 24 41     : character "ち" in JIS X 0208-1983
 24 4f     : character "は" in JIS X 0208-1983
 1b 28 42  : <ESC ( B> select ASCII to be used
 0d  0a    : <CR LF>

The replacement character appears where <ESC>(B appears in the message body.

For far the presented facts. Now our reply:

Yes, handling of all encoding changed in bug 1363281 in Thunderbird 56 beta. So yes, 54 and 60 may behave differently.

We discussed "zero-lenght ASCII runs" at length in bug 1374149, see bug 1374149 comment #3 and below. As per bug 1374149 comment #5 these zero-length runs are invalid. We only tolerate them at the end of an RFC 2047 token, but not in the middle of a string. Sorry. Where do these invalid messages come from?

Henri, anything to add here?
Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(hsivonen)
Resolution: --- → INVALID
Summary: "concatenated" ISO-2022-JP text is displayed incorrectly → ISO-2022-JP text is displayed with a replacement character <?> if it contains a zero-length ASCII run due to concatenation
Component: Folder and Message Lists → Internationalization
Product: Thunderbird → Core
Version: 60 → 60 Branch
I found this issue on some emails from a mailing-list system.
That mailing-list system concatenating some ISO-2022-JP text for changing Subject header or modify content.
It is not common, but generates valid ISO-2022-JP text.
(In reply to Jorg K (GMT+1) from comment #1)
> Henri, anything to add here?

I still don't understand the benefit of the U+FFFD generation as an XSS defense, considering that there are other cases left undefended: https://github.com/whatwg/encoding/issues/115#issuecomment-312645847

OTOH, generating a U+FFFD when there is no content between ISO-2022-JP shift sequences is mentioned in the Unicode Security Considerations:
https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input

If you are interested in getting this changed, the next steps would be: 1) Finding out what IE, Edge, Chrome and Safari do and 2) finding out why the Unicode Security Considerations say what they say about this.
Attached file iso-2022-jp.html
(In reply to Henri Sivonen (:hsivonen) from comment #3)
> 1) Finding out what IE, Edge, Chrome and Safari do and 2) finding out why the
> Unicode Security Considerations say what they say about this.
For number 1) you can use the attached page.

Thunderbird displays the body as Firefox would display a web page.

While I'm here, I tried IE and Edge, they both don't show the <?>. I don't have Chrome (yet on this fairly new machine), but bug 1374149 comment #3 says: Chrome matches encoding_rs, that is TB/FF.
Flags: needinfo?(hsivonen)
Edge and IE don't generate a REPLACEMENT CHARACTER. Firefox, Chrome and Safari do. (With the caveat that my Mac is stuck on El Capitan, so I couldn't test the latest Safari.)
I posted to the Unicode mailing list about this:
https://www.unicode.org/mail-arch/unicode-ml/y2018-m11/0106.html
Thanks Henri that's a very long post. From a layman's point, showing the <?> in the otherwise readable text doesn't appear very "useful", and uconv and Microsoft browsers don't insert it.

I have trouble understanding why this is done, maybe it's related to this:
  "Security software written to the formal specification may not detect malicious text
   (for example, "delete" with a shift-to-double-byte then an immediate shift-to-ASCII
   in the middle)."

How would one create or hide "malicious text" using those "no-op escape sequences"?
(In reply to Jorg K (GMT+1) from comment #7)
> I have trouble understanding why this is done, maybe it's related to this:
>   "Security software written to the formal specification may not detect
> malicious text
>    (for example, "delete" with a shift-to-double-byte then an immediate
> shift-to-ASCII
>    in the middle)."
> 
> How would one create or hide "malicious text" using those "no-op escape
> sequences"?

If the ASCII string "delete" has ISO-2022-JP shift sequences added between the characters, "security software" scanning the content on the byte level does not see the sequence of bytes as containing "delete" but after decoding, the text says "delete" unless REPLACEMENT CHARACTERS are injected.
I got it now, thanks. Surely the security software needs to be a bit context aware, no? HTML del<span></span>ete is also not detected as "delete".
Duplicate of this bug: 1864978
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: