Closed
Bug 1508136
Opened 6 years ago
Closed 6 years ago
ISO-2022-JP text is displayed with a replacement character <?> if it contains a zero-length ASCII run due to concatenation
Categories
(Core :: Internationalization, defect)
Tracking
()
RESOLVED
INVALID
People
(Reporter: kzmizzz, Unassigned)
References
Details
Attachments
(2 files)
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0 Steps to reproduce: Open "sample.eml" with Thunderbird. (60.3.1 Windows) "Sample.eml" contains ISO-2022-JP text like: <ESC>$Bこんに<ESC>(B<ESC>$Bちは<ESC>(B This text also could be considered that these two ISO-2022-JP texts are concatenated. <ESC>$Bこんに<ESC>(B <ESC>$Bちは<ESC>(B "From", "To", and "Subject" headers in "sample.eml" also have such text. These are described in "details.txt." Actual results: See screenshots. Thunderbird 60.3.1 -> TB60.3.1(wrong).png Thunderbird 54.0.b3 -> TB54.0.b3(good).png In 60.3.1, U+FFFD (REPLACEMENT CHARACTER) are displayed in text. 54.0.b3 seems to display correctly. Expected results: U+FFFD should not be appeard in text. The root cause would be in Encoding Standard of WHATWG. https://encoding.spec.whatwg.org/ The reference implementation of Encoding Standard inserts U+FFFD for the zero-length content between escape sequences. Encoding Standard says: > The ISO-2022-JP encoder is the only encoder for which the concatenation of multiple outputs can result in an error when run through the corresponding decoder. (https://encoding.spec.whatwg.org/#iso-2022-jp-encoder) This behavior is also the spec of TextDecoder or its underlying libs. See bug 1506049. However, Thunderbird should display ISO-2022-JP text correctly. Inserting U+FFFD will reduce potential risk of XSS, but Thunderbird should not be relying it.
Comment 1•6 years ago
|
||
Thanks for the detailed report. I'll copy some of it into this comment for easier accessibility. Let's focus on the body problem: Message content: <ESC>$B$3$s$K<ESC>(B<ESC>$B$A$O<ESC>(B Message display: こんに<?>ちは ISO-2022-JP text in the message body (hex): 1b 24 42 : <ESC $ B> select JIS X 0208-1983 to be used 24 33 : character "こ" in JIS X 0208-1983 24 73 : character "ん" in JIS X 0208-1983 24 4b : character "に" in JIS X 0208-1983 1b 28 42 : <ESC ( B> select ASCII to be used 1b 24 42 : <ESC $ B> select JIS X 0208-1983 to be used 24 41 : character "ち" in JIS X 0208-1983 24 4f : character "は" in JIS X 0208-1983 1b 28 42 : <ESC ( B> select ASCII to be used 0d 0a : <CR LF> The replacement character appears where <ESC>(B appears in the message body. For far the presented facts. Now our reply: Yes, handling of all encoding changed in bug 1363281 in Thunderbird 56 beta. So yes, 54 and 60 may behave differently. We discussed "zero-lenght ASCII runs" at length in bug 1374149, see bug 1374149 comment #3 and below. As per bug 1374149 comment #5 these zero-length runs are invalid. We only tolerate them at the end of an RFC 2047 token, but not in the middle of a string. Sorry. Where do these invalid messages come from? Henri, anything to add here?
Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago
Flags: needinfo?(hsivonen)
Resolution: --- → INVALID
Summary: "concatenated" ISO-2022-JP text is displayed incorrectly → ISO-2022-JP text is displayed with a replacement character <?> if it contains a zero-length ASCII run due to concatenation
Updated•6 years ago
|
Component: Folder and Message Lists → Internationalization
Product: Thunderbird → Core
Version: 60 → 60 Branch
Reporter | ||
Comment 2•6 years ago
|
||
I found this issue on some emails from a mailing-list system. That mailing-list system concatenating some ISO-2022-JP text for changing Subject header or modify content. It is not common, but generates valid ISO-2022-JP text.
Comment 3•6 years ago
|
||
(In reply to Jorg K (GMT+1) from comment #1) > Henri, anything to add here? I still don't understand the benefit of the U+FFFD generation as an XSS defense, considering that there are other cases left undefended: https://github.com/whatwg/encoding/issues/115#issuecomment-312645847 OTOH, generating a U+FFFD when there is no content between ISO-2022-JP shift sequences is mentioned in the Unicode Security Considerations: https://www.unicode.org/reports/tr36/#Some_Output_For_All_Input If you are interested in getting this changed, the next steps would be: 1) Finding out what IE, Edge, Chrome and Safari do and 2) finding out why the Unicode Security Considerations say what they say about this.
Comment 4•6 years ago
|
||
(In reply to Henri Sivonen (:hsivonen) from comment #3) > 1) Finding out what IE, Edge, Chrome and Safari do and 2) finding out why the > Unicode Security Considerations say what they say about this. For number 1) you can use the attached page. Thunderbird displays the body as Firefox would display a web page. While I'm here, I tried IE and Edge, they both don't show the <?>. I don't have Chrome (yet on this fairly new machine), but bug 1374149 comment #3 says: Chrome matches encoding_rs, that is TB/FF.
Flags: needinfo?(hsivonen)
Comment 5•6 years ago
|
||
Edge and IE don't generate a REPLACEMENT CHARACTER. Firefox, Chrome and Safari do. (With the caveat that my Mac is stuck on El Capitan, so I couldn't test the latest Safari.)
Comment 6•6 years ago
|
||
I posted to the Unicode mailing list about this: https://www.unicode.org/mail-arch/unicode-ml/y2018-m11/0106.html
Comment 7•6 years ago
|
||
Thanks Henri that's a very long post. From a layman's point, showing the <?> in the otherwise readable text doesn't appear very "useful", and uconv and Microsoft browsers don't insert it. I have trouble understanding why this is done, maybe it's related to this: "Security software written to the formal specification may not detect malicious text (for example, "delete" with a shift-to-double-byte then an immediate shift-to-ASCII in the middle)." How would one create or hide "malicious text" using those "no-op escape sequences"?
Comment 8•6 years ago
|
||
(In reply to Jorg K (GMT+1) from comment #7) > I have trouble understanding why this is done, maybe it's related to this: > "Security software written to the formal specification may not detect > malicious text > (for example, "delete" with a shift-to-double-byte then an immediate > shift-to-ASCII > in the middle)." > > How would one create or hide "malicious text" using those "no-op escape > sequences"? If the ASCII string "delete" has ISO-2022-JP shift sequences added between the characters, "security software" scanning the content on the byte level does not see the sequence of bytes as containing "delete" but after decoding, the text says "delete" unless REPLACEMENT CHARACTERS are injected.
Comment 9•6 years ago
|
||
I got it now, thanks. Surely the security software needs to be a bit context aware, no? HTML del<span></span>ete is also not detected as "delete".
You need to log in
before you can comment on or make changes to this bug.
Description
•