Closed Bug 1506422 Opened 10 months ago Closed 10 months ago

Replace use of nsMsgI18NFileSystemCharset() with NS_CopyUnicodeToNative/NS_CopyNativeToUnicode()

Categories

(MailNews Core :: General, enhancement)

enhancement
Not set

Tracking

(thunderbird_esr6064+ fixed, thunderbird64 fixed, thunderbird65 fixed)

RESOLVED FIXED
Thunderbird 65.0
Tracking Status
thunderbird_esr60 64+ fixed
thunderbird64 --- fixed
thunderbird65 --- fixed

People

(Reporter: jorgk, Assigned: jorgk)

Details

Attachments

(4 files, 1 obsolete file)

After bug 1381762 nsMsgI18NFileSystemCharset() uses a simple fallback scheme.

NS_CopyUnicodeToNative/NS_CopyNativeToUnicode() are much smarter about the native code page used by Windows.
Masatoshi-san, what do you think?

Sadly this doesn't fix
https://dxr.mozilla.org/comm-central/rev/2a29ee0adb310b54a6a2df72034953fed8f2b043/comm/mailnews/base/src/nsMessenger.cpp#1854
were we save a message to file.

So in the example of our Polish friends from the other bug, they will save as ISO-8859-2, and then Windows comes along and opens as windows-1250 and some characters will be wrong.

We had much discussion about this piece of code on bug 1271864 comment #52 and above. Perhaps it's time to always go to UTF-8 now. What do you think?
Assignee: nobody → jorgk
Status: NEW → ASSIGNED
Attachment #9024244 - Flags: feedback?(VYV03354)
(In reply to Jorg K (GMT+1) from comment #1)
> We had much discussion about this piece of code on bug 1271864 comment #52
> and above. Perhaps it's time to always go to UTF-8 now. What do you think?

I found that the Japanese localization already migrated to unicode:
https://hg.mozilla.org/l10n-central/ja/rev/11867217bdd2
So I concur. But please file a new bug about that.
Comment on attachment 9024244 [details] [diff] [review]
1506422-replace-nsMsgI18NFileSystemCharset.patch (v1)

nsMsgI18NConvertToUnicode is not using nsMsgI18NFileSystemCharset:
https://dxr.mozilla.org/comm-central/source/comm/mailnews/base/util/nsMsgI18N.cpp#67-93

It is not using FallbackEncoding either. Why is it relevant?
(In reply to Masatoshi Kimura [:emk] from comment #3)
> It is not using FallbackEncoding either. Why is it relevant?

Sorry, I don't understand the question. nsMsgI18NFileSystemCharset() is just too dump these days, so it's safer to use the Windows-aware NS_CopyUnicodeToNative/NS_CopyNativeToUnicode() where we can. What am I missing?
I think this should do it, unless I'm missing something.
Attachment #9024244 - Attachment is obsolete: true
Attachment #9024244 - Flags: feedback?(VYV03354)
Attachment #9024248 - Flags: review?(VYV03354)
(In reply to Masatoshi Kimura [:emk] from comment #2)
> I found that the Japanese localization already migrated to unicode:
> https://hg.mozilla.org/l10n-central/ja/rev/11867217bdd2
> So I concur. But please file a new bug about that.
This bug here is about removing some of the dangers of the nsMsgI18NFileSystemCharset() which is too dump, see the Polish example. So I don't see the need for a separate bug.
(In reply to Jorg K (GMT+1) from comment #4)
> Sorry, I don't understand the question. nsMsgI18NFileSystemCharset() is just
> too dump these days, so it's safer to use the Windows-aware
> NS_CopyUnicodeToNative/NS_CopyNativeToUnicode() where we can. What am I
> missing?

Is this code path specific to Windows? NS_CopyUnicodeToNative is just an alias of CopyUTF16toUTF8 on other platforms.
In general we should reduce using NS_CopyUnicodeToNative as much as possible. It makes lossy conversions on Windows. In MAPI case, it is absolutely required due to API compatibility, but we should not use it in general.
(In reply to Masatoshi Kimura [:emk] from comment #7)
> Is this code path specific to Windows? NS_CopyUnicodeToNative is just an
> alias of CopyUTF16toUTF8 on other platforms.
Right. The code path is not Windows-specific. However, how do we solve the situation of nsMsgI18NFileSystemCharset() coming up with the "wrong" charset? Previously it was based on nsIPlatformCharset.getCharset(). But as the Polish MAPI case shows, it will return ISO8859-2 instead of windows-1250. So we encode in the former, and when the user opens the file in Windows with Notepad, the content will look wrong. That's will be true for address book export (nsAbManager.cpp) and message saving, well, that one we switched to UTF-8 anyway. The code in nsMsgSend.cpp was also wrong since it translates a native filename to UTF-16.

So I think the question comes down to: Which encoding do Linux and Mac use for a) file names b) text file content. I think we discussed this for Linux file names before and we can't tell, but UTF-8 is the best guess. Mac file names are UTF-8 (right?) and I don't know what the answer for the content is.
And the code in nsMsgFilter*.cpp is also about filenames.
(In reply to Masatoshi Kimura [:emk] from comment #8)
> In general we should reduce using NS_CopyUnicodeToNative as much as
> possible. It makes lossy conversions on Windows.
Indeed. That's why I'm only using it in the address book export where the user also has the option to export as UTF-8. "Native" export is needed since some MS programs can only import native files (bug 117236). In message export (nsMessenger.cpp) I always use UTF-8 now to avoid loss of information. As stated above, we can't use the simplified fallback charset.
Marcin and Alice, can you please do me a favour:

Download either the ZIP file
https://queue.taskcluster.net/v1/task/SrMd-0zhRaGtEDR2cuiUMg/runs/0/artifacts/public/build/target.zip
or the installer
https://queue.taskcluster.net/v1/task/SrMd-0zhRaGtEDR2cuiUMg/runs/0/artifacts/public/build/install/sea/target.installer.exe
for a special version of TB 60.3.1 from my try build:
https://treeherder.mozilla.org/#/jobs?repo=try-comm-central&selectedJob=211713183&revision=82a16ce05e7ef06f16ca82ef6246ad7809e35df3

Hint for Marcin: It's easier to get the ZIP file, unpack it to some C:\Temp directory and then open a DOS shell to that directory and type "thunderbird.exe". You can run it on your production profile without any risk.

Then please create a new address book as per bug 1505315 comment #37 and create a new contact with either Polish (Marcin) or Japanese (Alice) content. Then export as CSV using the system charset (not unicode). I'm curious to see the results.

On English Windows I get aeczzlóns for ąęćżźłóńś which is a reasonable approximation and better than the current
ąęćżźłóńś
Flags: needinfo?(mkasprowicz)
Flags: needinfo?(alice0775)
(In reply to Jorg K (GMT+1) from comment #12)
> Marcin and Alice, can you please do me a favour:
> 
> Download either the ZIP file
> https://queue.taskcluster.net/v1/task/SrMd-0zhRaGtEDR2cuiUMg/runs/0/
> artifacts/public/build/target.zip
> or the installer
> https://queue.taskcluster.net/v1/task/SrMd-0zhRaGtEDR2cuiUMg/runs/0/
> artifacts/public/build/install/sea/target.installer.exe
> for a special version of TB 60.3.1 from my try build:
> https://treeherder.mozilla.org/#/jobs?repo=try-comm-
> central&selectedJob=211713183&revision=82a16ce05e7ef06f16ca82ef6246ad7809e35d
> f3
> 
> Hint for Marcin: It's easier to get the ZIP file, unpack it to some C:\Temp
> directory and then open a DOS shell to that directory and type
> "thunderbird.exe". You can run it on your production profile without any
> risk.
> 
> Then please create a new address book as per bug 1505315 comment #37 and
> create a new contact with either Polish (Marcin) or Japanese (Alice)
> content. Then export as CSV using the system charset (not unicode). I'm
> curious to see the results.
> 
> On English Windows I get aeczzlóns for ąęćżźłóńś which is a reasonable
> approximation and better than the current
> ąęćżźłóńś

I ran the trybuild in Japanese Edition of Windows10 (Shift_JIS),
I got "______o__" instead "ąęćżźłóńś" using the trybuild. So, The try build seems to convert all unicode to 0x5f.
Flags: needinfo?(alice0775)
Flags: needinfo?(mkasprowicz)
Attached image contact 15-11-2018.jpg
I installed it from the installer - the word DOS frightened me;) - see if that's what you meant.
Marcin: Thank you. The CSV looks perfect to me if opened in a text editor and viewed in windows-1250. Do you agree? Have you opened it in Notepad or Notepad++? I see ąęćżźłóńś,ąęćżźłóńś,ąęćżźłóńś ąęćżźłóńś,ąęćżźłóńś,ąęćżźłóńś@gmail.com,ąęćżźłóńś.

Alice: Thank you, but there has been a misunderstanding. I wanted you to try some Japanese text and see what happens. The resulting CSV should be encoded in Shift_JIS/Windows-932 and look good in a text editor. Yes, any character that can't be represented is replaced by a _, so you got those for the Polish characters which are not in your "local" charset. But it should look fine for Japanese.
Flags: needinfo?(alice0775)
Masatoshi-san, I think the experiments we're doing here (will) show that this patch helps with address book export. We already discussed switching message export to UTF-8 exclusively. If you prefer, I can find another reviewer.
Flags: needinfo?(VYV03354)
I opened in notepad, wordpad, libreoffice calc, everything was fine. Good job !
(In reply to Jorg K (GMT+1) from comment #17)
> 
> Alice: Thank you, but there has been a misunderstanding. I wanted you to try
> some Japanese text and see what happens. The resulting CSV should be encoded
> in Shift_JIS/Windows-932 and look good in a text editor. Yes, any character
> that can't be represented is replaced by a _, so you got those for the
> Polish characters which are not in your "local" charset. But it should look
> fine for Japanese.

Okay, see attached. So, Polish characters is broken, but Japanese characters is OK with exported csv in Japanese Edition of Windows10.
Flags: needinfo?(alice0775)
Flags: needinfo?(VYV03354)
Attachment #9024248 - Flags: review?(VYV03354) → review+
Pushed by mozilla@jorgk.com:
https://hg.mozilla.org/comm-central/rev/4a063ea24a79
Replace use of nsMsgI18NFileSystemCharset() with NS_CopyUnicodeToNative/NS_CopyNativeToUnicode(). r=emk
Status: ASSIGNED → RESOLVED
Closed: 10 months ago
Resolution: --- → FIXED
Attachment #9024248 - Flags: approval-comm-esr60+
Attachment #9024248 - Flags: approval-comm-beta+
Some more reading post landing. Henri gives a good account in bug 1381762 comment #6 (copied completely here):

===
The purpose of mozilla::dom::FallbackEncoding is to model the legacy encoding for *consuming* unlabeled HTML content on the Web. While it is strongly correlated with Windows "ANSI code page" concept, it's not the same e.g. for Polish (for questionable historical reasons).

On cursory look, it seems that mailnews uses "file system charset" for operations that involve paths of some kind (unclear to me what kind exactly) e.g. at https://searchfox.org/comm-central/source/mailnews/base/search/src/nsMsgFilter.cpp#814 . For file system paths, the things to use are NS_CopyNativeToUnicode() and NS_CopyUnicodeToNative(). These use the glibc environment on Linux and use the pre-NTFS legacy encoding on Windows. (Windows code really should be using the W variants of APIs for file system paths.)

That is to say: All this stuff could use a review of *why* non-UTF-8 is used and what the legacy source of non-UTF-8ness is in each case. mozilla::dom::FallbackEncoding is unlikely to be the right answer for all of them.
===

So according to this, replacing the dumbed-down (by bug 1381762) nsMsgI18NFileSystemCharset() with NS_CopyUnicodeToNative/NS_CopyNativeToUnicode() was the right thing to do.

Looking at the removal of nsIPlatformCharset here https://hg.mozilla.org/mozilla-central/rev/436dbeb1d9b4, I can see that:
a) Mac used to return UTF-8, so no change with this patch
   https://hg.mozilla.org/mozilla-central/rev/436dbeb1d9b4#l3.32
   In fact, using that is more correct than the locale's fallback.
b) Windows NS_CopyNativeToUnicode() is using the correct Windows code page, so much better than the fallback
   and returning to the old nsIPlatformCharset.
c) Linux: Hard to tell what the original code did, but it looks like it assumed mostly UTF-8 according to
   https://hg.mozilla.org/mozilla-central/rev/436dbeb1d9b4#l11.127 although this list here
   https://hg.mozilla.org/mozilla-central/rev/436dbeb1d9b4#l12.22
   tells a different story. I can't see a reference to "glibc" anywhere.
Comment on attachment 9024248 [details] [diff] [review]
1506422-replace-nsMsgI18NFileSystemCharset.patch (v2)

Review of attachment 9024248 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/addrbook/src/nsAbManager.cpp
@@ -829,5 @@
> -                                               valueCStr);
> -              NS_ENSURE_SUCCESS(rv,rv);
> -
> -              if (NS_FAILED(rv)) {
> -                NS_ERROR("failed to convert string to system charset.  use LDIF");

Can't this happen with the new code?

::: mailnews/compose/src/nsMsgSend.cpp
@@ -2488,5 @@
>          nsString errorMsg;
> -        nsresult rv = nsMsgI18NConvertToUnicode(nsMsgI18NFileSystemCharset(),
> -                                                m_attachments[i]->m_realName,
> -                                                attachmentFileName);
> -        if (attachmentFileName.IsEmpty() && m_attachments[i]->mURL) {

Why can we drop this block?
Thanks for looking this over.

(In reply to :aceman from comment #24)
> Can't this happen with the new code?
This block
-              NS_ENSURE_SUCCESS(rv,rv);
-
-              if (NS_FAILED(rv)) {
-                NS_ERROR("failed to convert string to system charset.  use LDIF");
-                valueCStr = "?";
was of course nonsense, since the |if (NS_FAILED(rv)) {| block was unreachable.

CopyUTF16toUTF8() and NS_CopyUnicodeToNative() don't fail, hence the removal of the error checking. NS_CopyUnicodeToNative() on Windows is lossy, see comment #17, last paragraph. Characters that can't be converted are turned into "_", but that's OK since in the case of address book export, the user can choose lossless UTF-8 export. The "native" export is only there for some M$ programs that can't import UTF-8 CSV yet, details in bug 117236 (mentioned in comment #11).

NS_CopyNativeToUnicode() also doesn't fail, Linux/Mac version here:
https://searchfox.org/mozilla-central/rev/647b9eb1099e373e079e16f147f74f3d1d65eed3/xpcom/io/nsNativeCharsetUtils.cpp#19
Windows version here:
https://searchfox.org/mozilla-central/rev/647b9eb1099e373e079e16f147f74f3d1d65eed3/xpcom/io/nsNativeCharsetUtils.cpp#45
Note, the code of NS_CopyUnicodeToNative() is just below and also returns NS_OK always.

> ::: mailnews/compose/src/nsMsgSend.cpp
> @@ -2488,5 @@
> >          nsString errorMsg;
> > -        nsresult rv = nsMsgI18NConvertToUnicode(nsMsgI18NFileSystemCharset(),
> > -                                                m_attachments[i]->m_realName,
> > -                                                attachmentFileName);
> > -        if (attachmentFileName.IsEmpty() && m_attachments[i]->mURL) {
> 
> Why can we drop this block?
The idea is that
+        NS_CopyNativeToUnicode(m_attachments[i]->m_realName, attachmentFileName);
doesn't fail and that m_attachments[i]->m_realName contains a non-empty native filename. So the converted file name will also not be empty. In case of a crazy Linux system using filenames in, say, windows-1252, and all highbit characters > 0x7F are in fact invalid unicode. In that case the the attachmentFileName would still not be empty since CopyUTF8toUTF16() used inside NS_CopyNativeToUnicode():
https://searchfox.org/mozilla-central/rev/647b9eb1099e373e079e16f147f74f3d1d65eed3/xpcom/io/nsNativeCharsetUtils.cpp#21
uses replacement characters:
https://searchfox.org/mozilla-central/rev/647b9eb1099e373e079e16f147f74f3d1d65eed3/xpcom/string/nsReadableUtils.h#214

So these changes are fine. 

I was more worried about the replacement of the total-nonsense code here ... and here:
https://hg.mozilla.org/comm-central/rev/4a063ea24a79#l3.57
https://hg.mozilla.org/comm-central/rev/4a063ea24a79#l3.86
as I wasn't sure what
-  nsTextFormatter::ssprintf(unicodeString, unicodeFormatter, value.get());
was doing. It looked like a "silly" (see comment above) UTF-8 to UTF-16 conversion.
You need to log in before you can comment on or make changes to this bug.