Thunderbird breaks UTF-8 character withing encoded words (violates RFC 2047) on RFC 2231 multiline filename generation
Categories
(MailNews Core :: Attachments, defect)
Tracking
(Not tracked)
People
(Reporter: 3APA3A, Assigned: infofrommozilla)
Details
Attachments
(2 files, 1 obsolete file)
|
11.95 KB,
patch
|
jorgk-bmo
:
review+
|
Details | Diff | Splinter Review |
|
12.23 KB,
patch
|
jorgk-bmo
:
review+
|
Details | Diff | Splinter Review |
User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0
Steps to reproduce:
Thunderbird 60.5.1 32-bit Windows
- Create new message
- Attach file with name "Информационное письмо.pdf"
- Send message
- Check message source
Actual results:
Generated attachment part headers are
Content-Type: application/pdf;
name="=?UTF-8?B?0JjQvdGE0L7RgNC80LDRhtC40L7QvdC90L7QtSDQv9C40YHRjNC80L4u?=
=?UTF-8?Q?pdf?="
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename0=utf-8''%D0%98%D0%BD%D1%84%D0%BE%D1%80%D0%BC%D0%B0%D1%86%D0%B8;
filename1=%D0%BE%D0%BD%D0%BD%D0%BE%D0%B5%20%D0%BF%D0%B8%D1%81%D1%8C%D0;
filename2=%BC%D0%BE%2E%70%64%66
note, UTF-8 character "м" (%D0%BC) is split between filename1 and filename2. Filename is incorrectly parsed/displayed by recipient. Thunderbird itself displays filename correctly despite of the bug due to fix
https://bugzilla.mozilla.org/show_bug.cgi?id=493544
Expected results:
RFC 2231 extends encoded-word definition from RFC 2047, according to RFC 2047
Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.
Comment 1•6 years ago
|
||
Interesting, thanks for the report. Alfred, you feel like poking around that stuff?
| Assignee | ||
Comment 2•6 years ago
|
||
I can confirm this bug.
It occurs because we first escape the entire file name, and then fold in front of any '%' character.
The problem is even known (However, the comment is in the wrong line):
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/compose/src/nsMsgCompUtils.cpp#1062
| Assignee | ||
Comment 3•6 years ago
|
||
This patch already completely fixes the problem. But there is more to do.
But before I would like to know if you would accept the patch in principle like that.
So far, we try to minimize the charset. However, JSMIME always uses UTF-8 as charset in the 'Content-Type' header. In the patch, I now do the same for the 'Content-Disposition' header. That simplifies the thing a bit.
ToDo:
- The charset minimization is now obsolete and can be removed:
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/compose/src/nsMsgCompUtils.cpp#570 - The charset parameters of RFC2231ParmFolding() could be omitted.
- The language parameter isn't used at all and could also be omitted.
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/compose/src/nsMsgCompUtils.cpp#937 - And finally some tests must be adapted.
However, we do not need a new test. This bug will be covered already test_attachment.js.
Comment 4•6 years ago
|
||
Comment 5•6 years ago
|
||
Comment 6•6 years ago
|
||
(back to Alfred)
| Assignee | ||
Comment 7•6 years ago
|
||
(In reply to Jorg K (GMT+2) from comment #4)
Comment on attachment 9056407 [details] [diff] [review]
So that header has UTF-8 and windows-1252. Personally, I don't mind going
down the UTF-8 route, but we already got complaints, see bug 1340773.
It would be doable for 8-bit charsets. But with MBCSs the things go complicated. We would have to check character by character to see if the encoded text broke the limit.
@@ +1045,3 @@
}
// *end is now a '%'.// Check if the following UTF-8 octet is a continuation char.You need to educate the reviewer here. What's a continuation character in
UTF-8?
The first octet of a UTF-8 character defines how many (1 to 4) octets the character consists of. With 'continuation char' I meant the possibly following octets.
Now I had to look for myself, where I picked up that name. ;-)
OK, Wikipedia names them 'continuation bytes'
https://en.wikipedia.org/wiki/UTF-8
| Assignee | ||
Comment 8•6 years ago
|
||
- Bugfix for UTF-8
- Always use UTF-8 when encoding is needed.
- Remove of parameter
carsetandlanguageofRFC2231ParmFolding(). - Fixed tests.
Comment 9•6 years ago
|
||
| Assignee | ||
Comment 10•6 years ago
|
||
(In reply to Jorg K (GMT+2) from comment #9)
Should we try https://hg.mozilla.org/users/Pidgeot18_gmail.com/patch-queues/file/tip/patches-newcompose/attachment-structured ?
+ " name*1*=EFGHIJKLMNOPQRSTUVWXYZ%5b%5c%5d^_`abcdefghijklmnopqrstuvwxyz{|}~%c2;\r\n"+
^^^
+ " name*2*=%a0%c2%a1%c2%a2%c2%a3%c2%a4%c2%a5%c2%a6%c2%a7%c2%a8%c2%a9%c2%aa%c2;\r\n"+
^^^
%C2%A0 => <u+00A0> NO-BREAK SPACE;
That doesn't even fix our problem.
Comment 11•6 years ago
|
||
OK, so you are saying that Joshua's patch modifies the test and the modified test shows that UTF-8 characters are broken left, right and centre, see:
+ Name: 'Content-Type: text/plain; charset=US-ASCII;\r\n'+
+ " name*0*=UTF-8''%20%21%22#$%25&'%28%29*+%2c-./0123456789%3a%3b%3c=%3e?%40ABCD;\r\n"+
+ " name*1*=EFGHIJKLMNOPQRSTUVWXYZ%5b%5c%5d^_`abcdefghijklmnopqrstuvwxyz{|}~%c2;\r\n"+
+ " name*2*=%a0%c2%a1%c2%a2%c2%a3%c2%a4%c2%a5%c2%a6%c2%a7%c2%a8%c2%a9%c2%aa%c2;\r\n"+
+ " name*3*=%ab%c2%ac%c2%ad%c2%ae%c2%af%c2%b0%c2%b1%c2%b2%c2%b3%c2%b4%c2%b5%c2;\r\n"+
+ " name*4*=%b6%c2%b7%c2%b8%c2%b9%c2%ba%c2%bb%c2%bc%c2%bd%c2%be%c2%bf%c3%80%c3;\r\n"+
+ " name*5*=%81%c3%82%c3%83%c3%84%c3%85%c3%86%c3%87%c3%88%c3%89%c3%8a%c3%8b%c3;\r\n"+
+ " name*6*=%8c%c3%8d%c3%8e%c3%8f%c3%90%c3%91%c3%92%c3%93%c3%94%c3%95%c3%96%c3;\r\n"+
+ " name*7*=%97%c3%98%c3%99%c3%9a%c3%9b%c3%9c%c3%9d%c3%9e%c3%9f%c3%a0%c3%a1%c3;\r\n"+
+ " name*8*=%a2%c3%a3%c3%a4%c3%a5%c3%a6%c3%a7%c3%a8%c3%a9%c3%aa%c3%ab%c3%ac%c3;\r\n"+
+ " name*9*=%ad%c3%ae%c3%af%c3%b0%c3%b1%c3%b2%c3%b3%c3%b4%c3%b5%c3%b6%c3%b7%c3;\r\n"+
+ " name*10*=%b8%c3%b9%c3%ba%c3%bb%c3%bc%c3%bd%c3%be%c3%bf.txt\r\n",
All the line end with "c2" the first byte of a UTF-8 character and the "continuation bytes" (note: reviewer was educated) are on the next line, line "a0", "ab", "b6", etc.
| Assignee | ||
Comment 12•6 years ago
|
||
(In reply to Jorg K (GMT+2) from comment #11)
OK, so you are saying that Joshua's patch modifies the test and the modified test shows that UTF-8 characters are broken
Yes. That's the reason for this bug.
All the line end with "c2" the first byte of a UTF-8 character and the "continuation bytes" (note: reviewer was educated) are on the next line, line "a0", "ab", "b6", etc.
It is a coincidence that the error is now so frequent. Joshua's patch just reformats the output.
Comment 13•6 years ago
•
|
||
| Assignee | ||
Comment 14•6 years ago
|
||
(In reply to Jorg K (GMT+2) from comment #13)
Review of attachment 9062711 [details] [diff] [review]:
@@ -960,4 @@
}
- if (dupParm.IsEmpty())
- return nullptr;
Should we leave this in? It could be called with a null string.
That should already have been checked at:
NS_ENSURE_TRUE(parmName && *parmName && parmValue && *parmValue, nullptr);
https://dxr.mozilla.org/comm-central/source/xpcom/base/nsDebug.h#201
These need to be compiled regardless of the DEBUG flag.
@@ +1014,5 @@
}
// *end is now a '%'.// Check if the following UTF-8 octet is a continuation.while(end > start + 3 && (*(end + 1) == '8' || *(end + 1) == '9'|| *(end + 1) == 'A' || *(end + 1) == 'B')) {Hmm, can this be written as
... && '8' <= *(end + 1) && *(end + 1) <= 'B')?
Hmm - it makes the code a bit unreadable - right? But if you prefer it this way: OK.
Comment 15•6 years ago
•
|
||
OK, let's lose the if (dupParm.IsEmpty()) as you had it. Checking four subquent (<== wrong!!) characters with four comparisons doesn't appear to be particularly elegant. But hey, I got confused with hex values, between '8' and 'B' we have 9 : ; < = > ? @ A. So although what I suggested would work, it's not particularly correct. Please leave what you had. Sorry.
So that leaves the "UTF-8" string, .Assign() and the while (end - 3 > start from my comment. I can fix that myself if you agree.
| Assignee | ||
Comment 16•6 years ago
|
||
(In reply to Jorg K (GMT+2) from comment #15)
OK, let's lose the
if (dupParm.IsEmpty())as you had it. Checking four subquent (<== wrong!!) characters with four comparisons doesn't appear to be particularly elegant. But hey, I got confused with hex values, between '8' and 'B' we have 9 : ; < = > ? @ A. So although what I suggested would work, it's not particularly correct. Please leave what you had. Sorry.So that leaves the "UTF-8" string,
.Assign()and thewhile (end - 3 > startfrom my comment. I can fix that myself if you agree.
Sure, just as you like.
Comment 17•6 years ago
•
|
||
Check the interdiff:
https://bugzilla.mozilla.org/attachment.cgi?oldid=9062711&action=interdiff&newid=9062803&headers=1
Damn, I forgot the space after the while. Will be fixed when landing.
Comment 18•6 years ago
|
||
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/b1ee53e20973
Don't split UTF-8 characters in 'filename' parameter of MIME 'Content-Disposition' header. r=jorgk
Updated•6 years ago
|
Description
•