Encoding errors of characters ś and ą in subject, body and attachment names when sending message via "Sent to > Mail recipient" on Polish Windows (caused by interpreting MAPI data as ISO-8859-2 instead of windows-1250)

RESOLVED FIXED in Thunderbird 65.0

Status

defect
RESOLVED FIXED
9 months ago
8 months ago

People

(Reporter: mkasprowicz, Assigned: jorgk)

Tracking

({regression})

Thunderbird 65.0
Dependency tree / graph

Thunderbird Tracking Flags

(thunderbird_esr6064+ fixed, thunderbird64 fixed, thunderbird65 fixed)

Details

Attachments

(6 attachments, 1 obsolete attachment)

Posted image Bez tytułu.jpg
User Agent: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:63.0) Gecko/20100101 Firefox/63.0

Steps to reproduce:

I updated the program to the latest version


Actual results:

no Polish characters in the title and attachments


Expected results:

should be Polish characters.for example, Mątwicki should be included in the attachment
We need the full message as .eml file to be able to investigate, or at least all the message headers and also the MIME part headers for at least one attachment.

Looking at the subject in the picture which has a [009C] character, there's most likely an encoding error of the sender's side. TB is pretty solid these days when it comes to encoding. That said, version TB 60 and beyond my be a little stricter in what they allow.
I do not know if it's okay, but I've added the .eml file above.
Attachment #9023221 - Attachment mime type: message/rfc822 → text/plain
I am a simple boy, I am asking for a simpler explanation because I do not understand?
Thanks.

The subject contains various RFC 2047 encoded stings, the first one is
Subject: =?UTF-8?Q?Wysy=c5=82anie_wiadomo=c2=9cci_e-mail=3a_M=c5=a1twicki_M?=

The character that's not decoded properly is c2 9c. That's an encoding error:
https://www.utf8-zeichentabelle.de/unicode-utf8-table.pl:
U+009C c2 9c <control>
So that's a unicode control character with no textual representation, so the TB display is correct.

Looking at the attachment headers:
Content-Type: application/pdf;
 name="=?UTF-8?Q?M=c5=a1twicki_M=2e_-_PPE_za_10-2018=2epdf?="
Content-Type: application/pdf;
 name="=?UTF-8?Q?M=c5=a1twicki_M=2e_-_PIT-4_za_10-2018=2epdf?="

The character in question is c5 a1 *is*
U+0161 š c5 a1	LATIN SMALL LETTER S WITH CARON
https://www.utf8-zeichentabelle.de/unicode-utf8-table.pl?number=1024
which is what TB displays.

Now I see that you created the message with TB 60.3. So I don't know how it got mis-encoded. I've just extracted the attachment from the message, changed the š to a ą and sent it to you. For all I can tell, the ą is in the attachment name in the sent message.
I was curious and wrote an attachment from you and sent it back and the problem continues.


Wysyłanie wiadomoœci e-mail: Mštwicki M. - PPE za 10-2018

Wiadomoœć jest gotowa do wysłania wraz z następujšcymi załšcznikami (plikami lub łšczami):
Mštwicki M. - PPE za 10-2018


What's interesting, some Polish characters show, for example "ł" "ć" "ę".

will you think about my problem? this is not a big problem, but they are annoying stamps.
Yes, I would like to understand the problem. By the looks of it, you didn't compose the e-mail by hand, right? Maybe you used the "Send to > Mail recipient" from the Windows desktop.

The subject was:
Wysyłanie wiadomo[009C]ci e-mail: Mštwicki M. - PPE za 10-2018, Mštwicki M. - komornik, Mštwicki M. - PIT-4 za 10-2018
but it should have been:
Wysyłanie wiadomości e-mail: Mątwicki M. - PPE za 10-2018, Mątwicki M. - komornik, Mątwicki M. - PIT-4 za 10-2018
(translated: Sending an e-mail: ...)

So somehow the ś got corrupted. Also the attachment names. I'll send the e-mail again to you so you can see that it works.

So the question is: How did you generate this e-mail? Looks like there is a bug in Windows or the Windows/Thunderbird interface.

Does it work if you create a new e-mail, paste the subject "Wysyłanie wiadomości e-mail: Mątwicki ..." and add the attachment(s) manually?
(In reply to Jorg K (GMT+1) from comment #7)
> Maybe you used the "Send to > Mail
> recipient" from the Windows desktop.
Right, the message body is:
Wiadomo[009C]ć jest gotowa do wysłania wraz z następujšcymi załšcznikami (plikami lub łšczami):

The message is ready to send with the following attachments (files or links):

That's generated badly by Windows.
Summary: no Polish characters in the title and attachments → Encoding errors of characters ś and ą in subject, body and attachment names when sending message via "Sent to > Mail recipient" on Polish Windows
Duplicate of this bug: 1505889
Someone might also take a look at bug 689942 and close it out if it no longer exists.
I have the same problem

Since I update Thunderbird to version 60.0 and up there is a problem with create new email by right click on file on desktop and send to --->receiver. 
The same situation is when I use different program which is prepare message and use Thunderbird  as a default client. It is not only my problems, Many people on polish Mozilla forum has the same problems. I have 5 different computers and when I upgrade Thunderbird to version 60 problem appear.

Actual results:

Thunderbird normally create new email but in topic there are mistakes in polish letters. The same situation is in space under the topic, (where we write email), default text which is add has mistakes in polish letters.
I don't have any extra ad-dons installed in Thunderbird. Restart Thunderbird to default sets not help at all. 
Creating of normal email works well but only when you try to send file in this way as I describe mistakes in polish letters appear.
This will be hard to debug since I'd have to install a Polish language pack.

So there are only these two characters wrong?
U+0105 ą c4 85 comes out as U+0161 š c5 a1
U+015B ś c5 9b comes out as U+009C <control> c2 9c

Any other characters that are broken? Anything in Hungarian or Czech also broken?
For the record, I had to compile a 32bit version although I usually use 64bit for development, but that would crash due to bug 393302.

Then I had to set
  HKLM\SOFTWARE\Clients\Mail\Mozilla Thunderbird\DLLPath
to
  C:\mozilla-source\comm-central\obj-i686-pc-mingw32\comm\mailnews\mapi\mapiDLL\mozMapi32.dll

and
  HKLM\SOFTWARE\Classes\CLSID\{29F458BE-8866-11D5-A3DD-00B0D0F3BAA7}\LocalServer32 - Default
to
  "C:\mozilla-source\comm-central\obj-i686-pc-mingw32\dist\bin\thunderbird.exe" /MAPIStartup

Sadly I don't get a compose window when using "Sent to > Mail recipient", but the main Window opens instead :-(

My plan is to produce a debug version that an affected Polish user can run and report back the results.
I do not know if this will help, but I've checked which formats all Polish characters are.


the original file name: ąęćżźłóśń

title of the message: Wysyłanie wiadomoœci e-mail: šę濟łóœń
So problems with ąźś.
FRG, any idea how I can convince TB to open a compose window instead of the main window, see comment #13.
Flags: needinfo?(frgrahl)
I just installed TB 60.3 in a vm. Works of course with en-US. Did you set the registration paths for the dlls too?

https://dxr.mozilla.org/comm-esr60/source/mail/installer/windows/nsis/shared.nsh#369
Flags: needinfo?(frgrahl)
Yes, I did, see comment #13. I've even done
regsvr32.exe /s C:\mozilla-source\comm-central\obj-i686-pc-mingw32\comm\mailnews\mapi\mapiDLL\mozMapi32.dll
now. Nothing helped.

But here comes the success story now. I did |mach package| and got myself
C:\mozilla-source\comm-central\obj-i686-pc-mingw32\dist\install\sea\thunderbird-65.0a1.en-US.win32.installer.exe

Installing that, everything works now. I can now add my debug and see what happens.
I meant MapiProxy_InUse.dll. Didn't see it in comment 13 but installing the build is probably cleaner anyway.

> Installing that, everything works now. I can now add my debug and see what happens.

You probably know but really easy now to build an l10n version. Just add ac_add_options --with-l10n-base=d:/seamonkey/l10n/l10n-esr60 (your path of course) and use mach build installers-pl -v for polish after you did the en-US build.
Well, while looking where to put the debug, I found the problem.

In https://searchfox.org/comm-central/source/mailnews/mapi/mapihook/src/msgMapiHook.cpp we encode the data we get passed in from Windows into unicode using nsMsgI18NFileSystemCharset() and then nsMsgI18NConvertToUnicode(platformCharSet, ...).

nsMsgI18NFileSystemCharset() got somewhat simplified and does no longer return what it returned before. It now simply return the fallback encoding for that locale, see:
https://hg.mozilla.org/comm-central/rev/0b0cba8d70bd#l1.31

Looks like for Polish, that is ISO-8859-2, also called Latin-2:
https://searchfox.org/mozilla-central/rev/4e094f66ced333d69b24cd49273789e3a1173dfc/dom/encoding/localesfallbacks.properties#57
https://de.wikipedia.org/wiki/ISO_8859-2

However, the real Windows file system charset is windows-1250, https://en.wikipedia.org/wiki/Windows-1250.

Let's see: In windows-1250 our ą is 0xB9, and in ISO-8859-2 that is a š :-(. In windows-1250 the ś is 0x9C and in ISO-8859-2 that is a control character :-( - Exactly what we observed. ę is 0xEA in both encodings so that's why that works.

So the root cause of the problem is that Windows delivers the data as windows-1250 via the MAPI interface, we used to interpret it correctly, but now we interpret it as ISO-8859-2.

Now that we know what broke it, how do we fix it?

The fix would be to implement MAPISENDMAILW instead of MAPISENDMAIL https://docs.microsoft.com/en-gb/windows/desktop/api/mapi/nc-mapi-mapisendmailw to send a unicode message.

I workaround might be to set the pref intl.charset.fallback.override to windows-1250.

Reporters, can you please try that.
Blocks: 1381762
Keywords: regression
Summary: Encoding errors of characters ś and ą in subject, body and attachment names when sending message via "Sent to > Mail recipient" on Polish Windows → Encoding errors of characters ś and ą in subject, body and attachment names when sending message via "Sent to > Mail recipient" on Polish Windows (caused by interpreting MAPI data as ISO-8859-2 instead of windows-1250)
Masatoshi-san, I need your help.

As you can see from comment #20, the problem was caused by dropping nsIPlatformCharset and dumbing down nsMsgI18NFileSystemCharset().

So here I'm trying to solve the problem by moving to MAPISendMailW which can supposedly handle "unicode".

The patch works in so far as "Sent to > Mail recipients" will start a compose window and attach the selected file. So the HandleAttachments() function works.

However, the subject is a single "E" and the body a single "Y". The print shows:
=== E, 45
===  , 0
===  , 0
===  , 0
=== [02], 2
===  , 0
===  , 0
===  , 0
===  , 0
===  , 0
===  , 0
===  , 0
=== [02], 2
===  , 0
===  , 0
===  , 0
=== Y, 59
===  , 0
===  , 0
===  , 0

I don't know what Microsoft mean by "unicode". Since it's still passed as LPSTR I assumed that it's UTF-8, but it doesn't appear to be. Even interpreted as UTF-16 I don't get a better result.

I'm sure you have more experience with this.
Assignee: nobody → jorgk
Status: UNCONFIRMED → ASSIGNED
Ever confirmed: true
Attachment #9024216 - Flags: feedback?(VYV03354)
I forgot to say:
https://docs.microsoft.com/en-gb/windows/desktop/api/mapi/nc-mapi-mapisendmailw
That talks of Unicode and that has lpMapiMessageW instead of lpMapiMessage. But both structures appear to be the same. We model this structure here: https://searchfox.org/comm-central/rev/ed86e0f292198530a321ba1dd2ebc9b3b8a7f506/mailnews/mapi/mapihook/build/msgMapi.idl#35
"unicode" means UTF-16LE in Microsoft terms.
Component: Untriaged → Simple MAPI
Product: Thunderbird → MailNews Core
Comment on attachment 9024216 [details] [diff] [review]
1505315-MAPI-unicode.patch - WIP

Sorry, lpMapiMessageW is different, see:
typedef struct MapiMessageW {
  ULONG            ulReserved;
  PWSTR            lpszSubject;
  PWSTR            lpszNoteText;
  PWSTR            lpszMessageType;
  PWSTR            lpszDateReceived;
  PWSTR            lpszConversationID;
  FLAGS            flFlags;
  lpMapiRecipDescW lpOriginator;
  ULONG            nRecipCount;
  lpMapiRecipDescW lpRecips;
  ULONG            nFileCount;
  lpMapiFileDescW  lpFiles;
}  *lpMapiMessageW;

Somehow I could only find this in the Google cache.
Attachment #9024216 - Flags: feedback?(VYV03354)
Some more:

typedef struct MapiRecipDescW {
  ULONG ulReserved;
  ULONG ulRecipClass;
  PWSTR lpszName;
  PWSTR lpszAddress;
  ULONG ulEIDSize;
  PVOID lpEntryID;
}  *lpMapiRecipDescW;

typedef struct MapiFileDescW {
  ULONG ulReserved;
  ULONG flFlags;
  ULONG nPosition;
  PWSTR lpszPathName;
  PWSTR lpszFileName;
  PVOID lpFileType;
}  *lpMapiFileDescW;
Adding MAPISendMailW is great, but you cannot remove MAPISendMail because this is an implementation of Microsoft Messaging Application Programming Interface (MAPI) and we have no control over callers. Some lagacy callers might be hardcoding MAPISendMail.
Please use NS_CopyUnicodeToNative/NS_CopyNativeToUnicode (or use MultiByteToWideChar/WideCharToMultiByte directly) instead of depending on dumb FallbackEncoding.
OK, so with that hint, the fix is very simple. NS_CopyNativeToUnicode() internally uses MultiByteToWideChar() and that will use the correct code page.

I wonder which other bugs we have now due to the dumbing down of nsMsgI18NFileSystemCharset(). There are a few call sites.

We could use NS_CopyNativeToUnicode() in some call sites, but sadly that doesn't return a status, so we won't notice if something can't be encoded, for example here:
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/base/src/nsMessenger.cpp#1854

This needs a follow-up bug to check all those call sites. Here for example
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/addrbook/src/nsAbManager.cpp#827
we could just use NS_CopyUnicodeToNative.
Attachment #9024216 - Attachment is obsolete: true
Attachment #9024236 - Flags: review?(VYV03354)
Attachment #9024236 - Flags: review?(VYV03354) → review+
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/e1449ad9e4d6
Use NS_CopyNativeToUnicode() in MAPI to respect Windows code page. r=emk
Status: ASSIGNED → RESOLVED
Closed: 8 months ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 65.0
Attachment #9024236 - Flags: approval-comm-esr60+
Attachment #9024236 - Flags: approval-comm-beta+
Reporters, an unofficial English build of TB 60.3.1 is now available here:
https://queue.taskcluster.net/v1/task/J_C7Dh-AQY6DqO3uVgOSnA/runs/0/artifacts/public/build/install/sea/target.installer.exe
Please try it.
the original file name: ąęćżźłóńś

Subject: Wysyłanie wiadomości e-mail: ąęćżźłóńś

message: Wiadomość jest gotowa do wysłania wraz z następującymi załącznikami (plikami lub linkami):
ąęćżźłóńś


It seems that everything is fine. When can you expect an official update?
I hope within the next five days, sadly I don't set release dates myself.
I believe there is another issue which I will fix in bug 1506422. Please try this for me:

Send yourself a plaintext e-mail with only ą in it or save a draft. You can use Shift+Click "Write" if you're usually composing in HTML. Or you can use the message you produced above. Save the e-mail or draft as text file. Open that file in Notepad. I think you will see š.

We will save the file using ISO-8859-2 and Polish Windows will open the file assuming windows-1250 encoding.
Flags: needinfo?(mkasprowicz)
I do not know if I understood correctly but I did:
1. I sent the file by send to
2. I saved this message as a text file and opened it in a notebook

In the text file, I have it:

Subject:
Wysyłanie wiadomości e-mail: ąęćżźłóńś
From:
Marcin Kasprowicz <mkasprowicz@o2.pl>
Date:
11.11.2018, 19:57
To:
mkasprowicz@o2.pl

Wiadomość jest gotowa do wysłania wraz z następującymi załącznikami (plikami lub linkami):
ąęćżźłóńś



I did everything on the update from you
Flags: needinfo?(mkasprowicz)
OK thanks, can you attach that text file here. I want to check which encoding it is.
Thanks, for some reason this got saved as UTF-8 and not ISO-8859-2. So bug 1506422 wasn't a problem here.
I have another favour to ask: Address book export. Please do this:

Open the address book.
File > New > Address book. Call it whatever you want, like xxx.
Right-click on the new address book, New Contact. Call the person ąęćżźłóńś.
Export this address book: Tools > Export, choose "Comma Separated (System Charset)" - Not UTF-8.
Check the content of the file. If in doubt, attach it here. I actually get
&#261;&#281;&#263;&#380;&#378;&#322;ó&#324;&#347;
since the Polish characters can't be stored in my system charset.

You can of course delete that address book now. Thanks in advance.
Flags: needinfo?(mkasprowicz)
Flags: needinfo?(mkasprowicz)
Posted image contact.jpg
Thank you, pretty much what I got.
Duplicate of this bug: 1506794
TB 60.3.1 which contains the fix has now been released, Polish version here:
https://download.mozilla.org/?product=thunderbird-60.3.1-SSL&os=win&lang=pl
You need to log in before you can comment on or make changes to this bug.