Closed Bug 1532668 Opened 6 years ago Closed 6 years ago

Thunderbird breaks UTF-8 character withing encoded words (violates RFC 2047) on RFC 2231 multiline filename generation

Categories

(MailNews Core :: Attachments, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED
Thunderbird 68.0

People

(Reporter: 3APA3A, Assigned: infofrommozilla)

Details

Attachments

(2 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:65.0) Gecko/20100101 Firefox/65.0

Steps to reproduce:

Thunderbird 60.5.1 32-bit Windows

  1. Create new message
  2. Attach file with name "Информационное письмо.pdf"
  3. Send message
  4. Check message source

Actual results:

Generated attachment part headers are

Content-Type: application/pdf;
name="=?UTF-8?B?0JjQvdGE0L7RgNC80LDRhtC40L7QvdC90L7QtSDQv9C40YHRjNC80L4u?=
=?UTF-8?Q?pdf?="
Content-Transfer-Encoding: base64
Content-Disposition: attachment;
filename0=utf-8''%D0%98%D0%BD%D1%84%D0%BE%D1%80%D0%BC%D0%B0%D1%86%D0%B8;
filename1=%D0%BE%D0%BD%D0%BD%D0%BE%D0%B5%20%D0%BF%D0%B8%D1%81%D1%8C%D0;
filename2=%BC%D0%BE%2E%70%64%66

note, UTF-8 character "м" (%D0%BC) is split between filename1 and filename2. Filename is incorrectly parsed/displayed by recipient. Thunderbird itself displays filename correctly despite of the bug due to fix
https://bugzilla.mozilla.org/show_bug.cgi?id=493544

Expected results:

RFC 2231 extends encoded-word definition from RFC 2047, according to RFC 2047

Each 'encoded-word' MUST represent an integral number of characters.
A multi-octet character may not be split across adjacent 'encoded-
word's.

Component: Untriaged → Attachments
Product: Thunderbird → MailNews Core

Interesting, thanks for the report. Alfred, you feel like poking around that stuff?

Flags: needinfo?(infofrommozilla)

I can confirm this bug.
It occurs because we first escape the entire file name, and then fold in front of any '%' character.

The problem is even known (However, the comment is in the wrong line):
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/compose/src/nsMsgCompUtils.cpp#1062

This patch already completely fixes the problem. But there is more to do.

But before I would like to know if you would accept the patch in principle like that.

So far, we try to minimize the charset. However, JSMIME always uses UTF-8 as charset in the 'Content-Type' header. In the patch, I now do the same for the 'Content-Disposition' header. That simplifies the thing a bit.

ToDo:

However, we do not need a new test. This bug will be covered already test_attachment.js.

Flags: needinfo?(infofrommozilla)
Attachment #9056407 - Flags: feedback?(jorgk)
Comment on attachment 9056407 [details] [diff] [review] Don't split UTF-8 characters in 'filename' parameter of MIME 'Content-Disposition' header. Review of attachment 9056407 [details] [diff] [review]: ----------------------------------------------------------------- You're asking a jolly good question here. With a file äöü and sending in windows-1252, I got these headers: ``` Content-Type: text/plain; charset=UTF-8; name="=?UTF-8?B?w6TDtsO8LnR4dA==?=" Content-Transfer-Encoding: base64 Content-Disposition: attachment; filename*=windows-1252''%E4%F6%FC%2E%74%78%74 ``` So that header has UTF-8 and windows-1252. Personally, I don't mind going down the UTF-8 route, but we already got complaints, see bug 1340773. So let me ask you back: Would to be harder to maintain using the body charset? Maybe I can catch Joshua (:jcranmer) on IRC, he's sometimes around on Sundays to ask how he feels about this. Re. the other clean-up you suggested: You know that `nsMsgI18NFileSystemCharset()` was dumbed-down in bug 1381762 and is not working properly any more. For example, for a Polish locale it doesn't return Windows ANSI charset with horrible consequences, see bug 1505315. So I'd definitely remove its use. The patch as many white-space nits, I'll point some out. ::: mailnews/compose/src/nsMsgCompUtils.cpp @@ +944,2 @@ > needEscape = true; > + nsAutoCString nativeParmValue=NS_ConvertUTF16toUTF8(parmValue); ws around operator. @@ +1037,5 @@ > if (*end && needEscape) > { > + // Check to see if we are in the middle of escaped char. > + // We use ESCAPE_ALL, so every third character is a '%'. > + if (end-1 > start && *(end-1) == '%') { ws around -, here and below. @@ +1045,3 @@ > } > + // *end is now a '%'. > + // Check if the following UTF-8 octet is a continuation char. You need to educate the reviewer here. What's a continuation character in UTF-8?
Attachment #9056407 - Flags: feedback?(jorgk)
Comment on attachment 9056407 [details] [diff] [review] Don't split UTF-8 characters in 'filename' parameter of MIME 'Content-Disposition' header. It was easier to get hold of Joshua than I had thought, here's the result: 21:25:48 - jorgk: so the question is: How do you feel of ripping that out and always doing UTF-8. 21:26:02 - jcranmer: jorgk: I thought I had a patch that already did that 21:26:26 - jcranmer: in other words "OH HELL YES USE ONLY UTF-8" 21:27:42 - jorgk: jcranmer: You know where you have your patch? 21:29:35 - jcranmer: https://hg.mozilla.org/users/Pidgeot18_gmail.com/patch-queues/file/tip/patches-newcompose I browsed those patches and found: https://hg.mozilla.org/users/Pidgeot18_gmail.com/patch-queues/file/tip/patches-newcompose/attachment-structured Pretty radical, please take a look.
Attachment #9056407 - Flags: feedback+

(back to Alfred)

Assignee: nobody → infofrommozilla
Flags: needinfo?(infofrommozilla)

(In reply to Jorg K (GMT+2) from comment #4)

Comment on attachment 9056407 [details] [diff] [review]

So that header has UTF-8 and windows-1252. Personally, I don't mind going
down the UTF-8 route, but we already got complaints, see bug 1340773.

It would be doable for 8-bit charsets. But with MBCSs the things go complicated. We would have to check character by character to see if the encoded text broke the limit.

@@ +1045,3 @@

     }
  •    // *end is now a '%'.
    
  •    // Check if the following UTF-8 octet is a continuation char.
    

You need to educate the reviewer here. What's a continuation character in
UTF-8?

The first octet of a UTF-8 character defines how many (1 to 4) octets the character consists of. With 'continuation char' I meant the possibly following octets.

Now I had to look for myself, where I picked up that name. ;-)

OK, Wikipedia names them 'continuation bytes'
https://en.wikipedia.org/wiki/UTF-8

  • Bugfix for UTF-8
  • Always use UTF-8 when encoding is needed.
  • Remove of parameter carset and language of RFC2231ParmFolding().
  • Fixed tests.
Attachment #9056407 - Attachment is obsolete: true
Flags: needinfo?(infofrommozilla)
Attachment #9062711 - Flags: review?(jorgk)

(In reply to Jorg K (GMT+2) from comment #9)

Should we try https://hg.mozilla.org/users/Pidgeot18_gmail.com/patch-queues/file/tip/patches-newcompose/attachment-structured ?

https://hg.mozilla.org/users/Pidgeot18_gmail.com/patch-queues/file/tip/patches-newcompose/attachment-structured#l617

+        " name*1*=EFGHIJKLMNOPQRSTUVWXYZ%5b%5c%5d^_`abcdefghijklmnopqrstuvwxyz{|}~%c2;\r\n"+
                                                                                   ^^^
+        " name*2*=%a0%c2%a1%c2%a2%c2%a3%c2%a4%c2%a5%c2%a6%c2%a7%c2%a8%c2%a9%c2%aa%c2;\r\n"+
                   ^^^

%C2%A0 => <u+00A0> NO-BREAK SPACE;

That doesn't even fix our problem.

OK, so you are saying that Joshua's patch modifies the test and the modified test shows that UTF-8 characters are broken left, right and centre, see:

+  Name: 'Content-Type: text/plain; charset=US-ASCII;\r\n'+
+        " name*0*=UTF-8''%20%21%22#$%25&'%28%29*+%2c-./0123456789%3a%3b%3c=%3e?%40ABCD;\r\n"+
+        " name*1*=EFGHIJKLMNOPQRSTUVWXYZ%5b%5c%5d^_`abcdefghijklmnopqrstuvwxyz{|}~%c2;\r\n"+
+        " name*2*=%a0%c2%a1%c2%a2%c2%a3%c2%a4%c2%a5%c2%a6%c2%a7%c2%a8%c2%a9%c2%aa%c2;\r\n"+
+        " name*3*=%ab%c2%ac%c2%ad%c2%ae%c2%af%c2%b0%c2%b1%c2%b2%c2%b3%c2%b4%c2%b5%c2;\r\n"+
+        " name*4*=%b6%c2%b7%c2%b8%c2%b9%c2%ba%c2%bb%c2%bc%c2%bd%c2%be%c2%bf%c3%80%c3;\r\n"+
+        " name*5*=%81%c3%82%c3%83%c3%84%c3%85%c3%86%c3%87%c3%88%c3%89%c3%8a%c3%8b%c3;\r\n"+
+        " name*6*=%8c%c3%8d%c3%8e%c3%8f%c3%90%c3%91%c3%92%c3%93%c3%94%c3%95%c3%96%c3;\r\n"+
+        " name*7*=%97%c3%98%c3%99%c3%9a%c3%9b%c3%9c%c3%9d%c3%9e%c3%9f%c3%a0%c3%a1%c3;\r\n"+
+        " name*8*=%a2%c3%a3%c3%a4%c3%a5%c3%a6%c3%a7%c3%a8%c3%a9%c3%aa%c3%ab%c3%ac%c3;\r\n"+
+        " name*9*=%ad%c3%ae%c3%af%c3%b0%c3%b1%c3%b2%c3%b3%c3%b4%c3%b5%c3%b6%c3%b7%c3;\r\n"+
+        " name*10*=%b8%c3%b9%c3%ba%c3%bb%c3%bc%c3%bd%c3%be%c3%bf.txt\r\n",

All the line end with "c2" the first byte of a UTF-8 character and the "continuation bytes" (note: reviewer was educated) are on the next line, line "a0", "ab", "b6", etc.

(In reply to Jorg K (GMT+2) from comment #11)

OK, so you are saying that Joshua's patch modifies the test and the modified test shows that UTF-8 characters are broken

Yes. That's the reason for this bug.

All the line end with "c2" the first byte of a UTF-8 character and the "continuation bytes" (note: reviewer was educated) are on the next line, line "a0", "ab", "b6", etc.

It is a coincidence that the error is now so frequent. Joshua's patch just reformats the output.

Comment on attachment 9062711 [details] [diff] [review] Don't split UTF-8 characters in 'filename' parameter of MIME 'Content-Disposition' header. Review of attachment 9062711 [details] [diff] [review]: ----------------------------------------------------------------- Looks OK to me. Note that the mailnews/compose hasn't bee reformatted to Google style yet. That will be done in the next few days. See bug 1546364. ::: mailnews/compose/src/nsMsgCompUtils.cpp @@ +921,4 @@ > > bool needEscape; > nsCString dupParm; > + nsCString charset("UTF-8"); We could hard-code that in the two/three spots where this is used. @@ +927,2 @@ > needEscape = true; > + dupParm = parmValue; Maybe better: dupParam.Assign(parmValue) ? @@ -960,4 @@ > } > > - if (dupParm.IsEmpty()) > - return nullptr; Should we leave this in? It could be called with a null string. @@ +1014,4 @@ > } > + // *end is now a '%'. > + // Check if the following UTF-8 octet is a continuation. > + while(end > start + 3 && (*(end + 1) == '8' || *(end + 1) == '9' `while (space) (end - 3 > start`, for consistency, no? Above we have `end - 1` and `end - 2`. @@ +1014,5 @@ > } > + // *end is now a '%'. > + // Check if the following UTF-8 octet is a continuation. > + while(end > start + 3 && (*(end + 1) == '8' || *(end + 1) == '9' > + || *(end + 1) == 'A' || *(end + 1) == 'B')) { Hmm, can this be written as ... && '8' <= *(end + 1) && *(end + 1) <= 'B')? EDIT: Fixed last code just above.
Attachment #9062711 - Flags: review?(jorgk) → review+

(In reply to Jorg K (GMT+2) from comment #13)

Review of attachment 9062711 [details] [diff] [review]:

@@ -960,4 @@

}

  • if (dupParm.IsEmpty())
  • return nullptr;

Should we leave this in? It could be called with a null string.

That should already have been checked at:
NS_ENSURE_TRUE(parmName && *parmName && parmValue && *parmValue, nullptr);

https://dxr.mozilla.org/comm-central/source/xpcom/base/nsDebug.h#201
These need to be compiled regardless of the DEBUG flag.

@@ +1014,5 @@

     }
  •    // *end is now a '%'.
    
  •    // Check if the following UTF-8 octet is a continuation.
    
  •    while(end > start + 3 && (*(end + 1) == '8' || *(end + 1) == '9'
    
  •                              || *(end + 1) == 'A' || *(end + 1) == 'B')) {
    

Hmm, can this be written as
... && '8' <= *(end + 1) && *(end + 1) <= 'B')?

Hmm - it makes the code a bit unreadable - right? But if you prefer it this way: OK.

Flags: needinfo?(jorgk)

OK, let's lose the if (dupParm.IsEmpty()) as you had it. Checking four subquent (<== wrong!!) characters with four comparisons doesn't appear to be particularly elegant. But hey, I got confused with hex values, between '8' and 'B' we have 9 : ; < = > ? @ A. So although what I suggested would work, it's not particularly correct. Please leave what you had. Sorry.

So that leaves the "UTF-8" string, .Assign() and the while (end - 3 > start from my comment. I can fix that myself if you agree.

Flags: needinfo?(jorgk)

(In reply to Jorg K (GMT+2) from comment #15)

OK, let's lose the if (dupParm.IsEmpty()) as you had it. Checking four subquent (<== wrong!!) characters with four comparisons doesn't appear to be particularly elegant. But hey, I got confused with hex values, between '8' and 'B' we have 9 : ; < = > ? @ A. So although what I suggested would work, it's not particularly correct. Please leave what you had. Sorry.

So that leaves the "UTF-8" string, .Assign() and the while (end - 3 > start from my comment. I can fix that myself if you agree.

Sure, just as you like.

Check the interdiff:
https://bugzilla.mozilla.org/attachment.cgi?oldid=9062711&action=interdiff&newid=9062803&headers=1

Damn, I forgot the space after the while. Will be fixed when landing.

Attachment #9062803 - Flags: review+

Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/b1ee53e20973
Don't split UTF-8 characters in 'filename' parameter of MIME 'Content-Disposition' header. r=jorgk

Status: UNCONFIRMED → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 68.0
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: