Closed Bug 1374149 Opened 3 years ago Closed 3 years ago

Folded subject header using ISO-2022-JP doesn't get decoded properly

Categories

(MailNews Core :: Internationalization, defect)

defect
Not set

Tracking

(thunderbird_esr5255+ fixed, thunderbird55 fixed, thunderbird56 fixed)

RESOLVED FIXED
Thunderbird 56.0
Tracking Status
thunderbird_esr52 55+ fixed
thunderbird55 --- fixed
thunderbird56 --- fixed

People

(Reporter: t.matsuu, Assigned: jorgk-bmo)

References

Details

Attachments

(3 files, 3 obsolete files)

Tabs between MIME encoding lines are improperly treated.

Maybe a regression of bug 1261841.
Which Product/Component is better for this bug, "MailNews Core/Internationalization" or "Core/Internationalization"?

User Agent: Mozilla/5.0 (Windows NT 6.1; rv:56.0) Gecko/20100101 Thunderbird/56.0a1
Build ID: 20170618030210


Good:
Subject: =?UTF-8?B?44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC?=
 =?UTF-8?B?44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC?=
 =?UTF-8?B?44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC?=
 =?UTF-8?B?44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC?=
 =?UTF-8?B?44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC?=
 =?UTF-8?B?44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC44GC?=

-> ああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああああ


NG:
Subject: =?ISO-2022-JP?B?GyRCJCIkIiQiJCIkIiQiJCIkIiQiJCIkIhsoQg==?=
	=?ISO-2022-JP?B?GyRCJCIkIiQiJCIkIiQiJCIkIiQiJCIkIiQiJCIbKEI=?=
	=?ISO-2022-JP?B?GyRCJCIkIiQiJCIkIiQiJCIkIiQiJCIkIiQiJCIbKEI=?=
	=?ISO-2022-JP?B?GyRCJCIkIiQiJCIkIiQiJCIkIiQiJCIkIiQiJCIbKEI=?=
	=?ISO-2022-JP?B?GyRCJCIkIiQiJCIkIiQiJCIkIiQiJCIkIiQiJCIbKEI=?=
	=?ISO-2022-JP?B?GyRCJCIkIiQiJCIkIiQiJCIkIiQiJCIkIiQiJCIbKEI=?=
	=?ISO-2022-JP?B?GyRCJCIkIiQiJCIkIiQiJCIkIiQiJCIkIiQiJCIbKEI=?=
	=?ISO-2022-JP?B?GyRCJCIkIiQiGyhC?=

-> あああああああああああ�あああああああああああああ�あああああああああああああ�あああああああああああああ�あああああああああああああ�あああああああああああああ�あああああああああああああ�あああ
(In reply to Takanori MATSUURA from comment #0)
> Tabs between MIME encoding lines are improperly treated.
Where are the tabs? Oh, I see:
> Subject: =?ISO-2022-JP?B?GyRCJCIkIiQiJCIkIiQiJCIkIiQiJCIkIhsoQg==?=
> 	=?ISO-2022-JP?B?GyRCJCIkIiQiJCIkIiQiJCIkIiQiJCIkIiQiJCIbKEI=?=
  ^^^^^^
What happens when you swap these tabs for spaces? In fact, I tried, that doesn't make a difference.

> Maybe a regression of bug 1261841.
Did it work in previous versions of TB? Yes, I tested TB 55 and it shows correctly there.

> Which Product/Component is better for this bug, "MailNews
> Core/Internationalization" or "Core/Internationalization"?
Good question, if it's cause by bug 1261841, then "Core/Internationalization", otherwise "MailNews Core/Internationalization".

There are two possibilities here:
1) Either we assemble the subject line incorrectly and the new ISO-2022-JP decoder is less tolerant to bad input
2) There is a problem in the new ISO-2022-JP decoder, which I doubt.

I'll have to check what exactly gets passed to the decoder.
Summary: Tabs between MIME encoding lines are improperly treated → Folded subject header using ISO-2022-JP doesn't get decoded properly
I've reduced the test case to:
Subject: =?ISO-2022-JP?B?GyRCJCIkIiQiJCIkIiQiJCIkIiQiJCIkIhsoQg==?=
 =?ISO-2022-JP?B?GyRCJCIkIiQiGyhC?=
That still shows a � in the decoded subject.

The header decoding is done in JSMime and there I found this comment:
https://dxr.mozilla.org/comm-central/rev/0cac30b866ef093d76eadbe6d55f5d1c916f39d2/mailnews/mime/jsmime/jsmime.js#682
related to cutting multi-byte characters in the middle. Perhaps that's what happened here.

Takanori-san, which software produced this header? It wasn't TB since we do all headers in UTF-8.

Henri, do you have any idea what's going on here? As I said before, the decoding takes place in JSMime, using .decode on a TextDecoder. To port bug 1261841 in bug 1363281, I didn't make any changes to the MIME code.
Flags: needinfo?(t.matsuu)
Flags: needinfo?(hsivonen)
var decoder = new TextDecoder("iso-2022-jp");
var encoder = new TextEncoder();
decoder.decode(encoder.encode(atob('GyRCJCIkIiQiJCIkIiQiJCIkIiQiJCIkIhsoQg==')),{stream:true});
decoder.decode(encoder.encode(atob('GyRCJCIkIiQiGyhC')));
Nightly 56: "\ufffdあああ"
Release 54: "あああ"
The encoding_rs iso-2022-jp decoder inserts "\ufffd" when it encounters the "\u001b(B\u001b$B" sequence. (probably in order to prevent XSS attacks.) Henri would know the answer. Chrome matches encoding_rs, by the way.

If we do *NOT* pass {streaming: true}, it works as expected because TextDecoder no longer encounters the "\u001b(B\u001b$B" sequence.
(In reply to Masatoshi Kimura [:emk] from comment #3)
> If we do *NOT* pass {streaming: true}, it works as expected because
> TextDecoder no longer encounters the "\u001b(B\u001b$B" sequence.

It will break the splitted multi-byte case, of course. Another workaround is to trim the trailing "\u001b(B\" and the leading "\u001b$B" manually before passing the buffer to TextDecoder.
Each base64 segment ends with an escape that transitions back to the ASCII state. Each base64 segment starts with a escape that transitions to the JIS X 0208 state. When these are concatenated, they form a zero-length ASCII run, which is an error per spec.

The remedy would be to omit the {stream: true} parameter on the TextDecoder call, but per the comment there, that would break the decoding of RFC 2047-incompliant input. What emk says about trimming the escapes would work here but could potentially break other (incompliant) input.

I'm not sure what the rationale is for zero-length runs between escapes being errors considering all the mischief that's possible by transitioning from the ASCII state to the ASCII state or alternating between the ASCII and Roman states. annevk, why is a zero-length run between escapes an error?
Flags: needinfo?(hsivonen) → needinfo?(annevk)
Probably the most robust hack would be to omit {stream: true} parameter if the encoding is ISO-2022-JP and the previous segment ended with ESC ( B.
(In reply to Henri Sivonen (:hsivonen) from comment #6)
> Probably the most robust hack would be to omit {stream: true} parameter if
> the encoding is ISO-2022-JP and the previous segment ended with ESC ( B.

Correction: Omit {stream: true} if the encoding is ISO-2022-JP and the segment being decoded ends with ESC ( B. (And with any encoding when decoding the last segment.)

Basically, compliant producers produce output as if the consumers didn't do streaming decode. Thunderbird does streaming decode in case a producer is incompliant. ISO-2022-JP is the only encoding that's not indifferent to content produced for non-streaming consumption being consumed in the streaming mode. :-(
Masatoshi and Henri have already started the tech discussion.
Thanks.

FYI, such subject line is generated by "Active! mail" which is one of the famous on-premises web mail system in Japan.
Flags: needinfo?(t.matsuu)
(In reply to Henri Sivonen (:hsivonen) from comment #5)
> I'm not sure what the rationale is for zero-length runs between escapes
> being errors considering all the mischief that's possible by transitioning
> from the ASCII state to the ASCII state or alternating between the ASCII and
> Roman states. annevk, why is a zero-length run between escapes an error?

It seems the spec behavior came from https://www.w3.org/Bugs/Public/show_bug.cgi?id=27256 which claims Chrome had this behavior at the time.

Additionally, https://web.archive.org/web/20140912104045/http://upokecenter.dreamhosters.com/articles/2013/04/differences-in-the-iso-2022-jp-encoding-between-browsers/ claims Firefox had this behavior in the reverse case (switch to multibyte and then back to ASCII without anything in between).
Given the Chrome behavior, it's probably the best not to tweak the Encoding Standard here and for Thunderbird to adopt the tweak from comment 7 to its Postel hack.
Summarising: So basically this is caused by
return output + currentDecoder.decode(buffer, {stream: true});
https://dxr.mozilla.org/comm-central/rev/0cac30b866ef093d76eadbe6d55f5d1c916f39d2/mailnews/mime/jsmime/jsmime.js#686
where the streaming is used to be tolerant to multi-bytes broken in the middle.

For ISO-2022-JP that causes a problem since the input presented represents correct RCF2047 chunks, each started and terminated correctly, and those termination marks are now causing a problem in "streaming mode" since a zero-length run between them is now considered to be an error by the new decoder, but not the old. Hmm.
(In reply to Jorg K (GMT+2) from comment #11)
> Summarising: So basically this is caused by
> return output + currentDecoder.decode(buffer, {stream: true});
> https://dxr.mozilla.org/comm-central/rev/
> 0cac30b866ef093d76eadbe6d55f5d1c916f39d2/mailnews/mime/jsmime/jsmime.js#686
> where the streaming is used to be tolerant to multi-bytes broken in the
> middle.
> 
> For ISO-2022-JP that causes a problem since the input presented represents
> correct RCF2047 chunks, each started and terminated correctly, and those
> termination marks are now causing a problem in "streaming mode" since a
> zero-length run between them is now considered to be an error by the new
> decoder, but not the old. Hmm.

Right. Clearly, RFC 2047 tries to require producers to do things in such a way that consumption in either the streaming or non-streaming mode works, but RFC 2047 doesn't foresee empty sequences between escapes becoming an error for ISO-2022-JP. This would not be a problem if Thunderbird relied on producers being RFC 2047-compliant and each segment, therefore, being consumable in non-streaming mode. But when Thunderbird's attempt to deal with non-compliant producers is added to the mix, the combination of factors produces undesirable results.

Therefore, I suggest that Thunderbird turn off its attempt to deal with non-compliant consumers when the input looks compliant enough for the non-compliance workaround being unnecessary. I.e. when decoding iso-2022-jp and the segment does, in fact, end with ESC ( B like the RFC requires.

Additionally, passing {stream:true} unconditionally causes *another* bug for all (multibyte) encodings: It masks errors at the end of the last segment, i.e. the end of the stream is currently not signaled to the decoder.
This addresses the ISO-2022-JP issue. This comment is not addressed so far:
> Additionally, passing {stream:true} unconditionally causes *another*
> bug for all (multibyte) encodings.
Assignee: nobody → jorgk
Status: NEW → ASSIGNED
OK, now taking care of not calling this with streaming on the last token.
Attachment #8879086 - Attachment is obsolete: true
(In reply to Henri Sivonen (:hsivonen) from comment #10)
> Given the Chrome behavior, it's probably the best not to tweak the Encoding
> Standard here and for Thunderbird to adopt the tweak from comment 7 to its
> Postel hack.

After thinking about this some more, filed a spec bug anyway in order to have this properly considered:
https://github.com/whatwg/encoding/issues/115
Tweaked comment and order to make the more likely clause execute first.
Attachment #8879095 - Attachment is obsolete: true
This adds a test.
Attachment #8879096 - Attachment is obsolete: true
Attachment #8879101 - Flags: review?(Pidgeot18)
Duplicate of this bug: 1374926
https://github.com/whatwg/encoding/issues/115#issuecomment-309411747 seems like a good enough reason to treat this as an error and maybe special case in email?
Flags: needinfo?(annevk)
(In reply to Anne (:annevk) from comment #19)
> https://github.com/whatwg/encoding/issues/115#issuecomment-309411747 seems
> like a good enough reason to treat this as an error and maybe special case
> in email?

What's unclear to me is: What scenario makes zero-length sequences between escapes dangerous but doesn't make escape to ascii state, one character, another escape to ASCII state or alternating between ASCII and Roman states for a run of ASCII characters with at least one character in each state dangerous?
Flags: needinfo?(annevk)
Not sure, let's ask emk: see comment 20.
Flags: needinfo?(annevk) → needinfo?(VYV03354)
Uconv considered ASCII-to-ASCII switches danger:
https://dxr.mozilla.org/mozilla-beta/rev/c836770a64c15192c42d33340abf861a1d201723/intl/uconv/ucvja/nsJapaneseToUnicode.cpp#511-516
Exactly because of the security reason (bug 407161).

If the Encoding Standard does not treat them danger, I think we should do it.
Flags: needinfo?(VYV03354)
emk, what about ASCII-to-Roman?
Comment on attachment 8879101 [details] [diff] [review]
1374149-decoding-streaming.patch (v2b) + test

Masatoshi-san, could you review this for me please. It's about time we fix this regressions which is affecting our Japanese Daily users since popular Japanese web services (see comment #8) are sending ISO-2022-JP encoded subject headers.
Attachment #8879101 - Flags: review?(Pidgeot18) → review?(VYV03354)
Attachment #8879101 - Flags: review?(VYV03354) → review+
https://hg.mozilla.org/comm-central/rev/78a4fd6d79b50c334acfbae9220a92dc384be321
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 56.0
Looks like this caused a test failure in test_jsmime_charset.js:
"�" == ""
Since the last RFC 2047 token is now decoded with no streaming, the encoder notices the wrong input immediately and returns a � instead of waiting for the next chunk that never arrives. So here we need to change the expected result.
Comment on attachment 8879101 [details] [diff] [review]
1374149-decoding-streaming.patch (v2b) + test

Bug 1301989 depends on this one, so let's uplift this one, too.
Attachment #8879101 - Flags: approval-comm-esr52?
Attachment #8879101 - Flags: approval-comm-beta+
Attachment #8879101 - Flags: approval-comm-esr52? → approval-comm-esr52+
You need to log in before you can comment on or make changes to this bug.