Closed Bug 1271864 Opened 8 years ago Closed 7 years ago

Saved message truncated if it contains a character that can't be encoded with the nsMsgI18NFileSystemCharset()

Categories

(Thunderbird :: Message Reader UI, defect, P1)

45 Branch
x86_64
Windows
defect

Tracking

(thunderbird51 wontfix, thunderbird52 fixed, thunderbird53 fixed)

RESOLVED FIXED
Thunderbird 53.0
Tracking Status
thunderbird51 --- wontfix
thunderbird52 --- fixed
thunderbird53 --- fixed

People

(Reporter: techadmin, Assigned: jorgk-bmo)

References

Details

(Keywords: regression)

Attachments

(5 files, 1 obsolete file)

User Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; Trident/7.0; .NET4.0C; .NET4.0E; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30729; rv:11.0) like Gecko

Steps to reproduce:

Mouse Right Click in E-Mail List Pain And Save Message to Text Format.
ThunderBird Version 45.0


Actual results:

That E-Mail text includes "ü" charactor.
Saved Text File was not save sentence after "ü"

example

<E-mail>
Thank you for your contract note.

Please send asap by air, is it possible this week?

Meilleures salutations/mit freundlichen Grüßen/Best regards 
(after omit)

<Text File>
Thank you for your contract note.

Please send asap by air, is it possible this week?

Meilleures salutations/mit freundlichen Gr


Expected results:

all sentence in E-Mail Save to Text File
Severity: normal → major
Priority: -- → P1
OS: Unspecified → Windows
Hardware: Unspecified → x86_64
Not a security issue.
Group: mail-core-security
I can't reproduce this (and I'm German, so I have e-mail with ü to save).
Can you attach the message that fails to save correctly?
Save as .eml, then attach to the bug.

Which program are you using to view the saved text file?
Severity: major → normal
Attached file E-Mail File
This is E-Mail File.
(In reply to Jorg K (PTO during summer, NI me) from comment #2)
> I can't reproduce this (and I'm German, so I have e-mail with ü to save).
> Can you attach the message that fails to save correctly?
> Save as .eml, then attach to the bug.
> 
> Which program are you using to view the saved text file?

Thank you.
Viewer Programs are Hidemaru and NotePad,WordPad,IE.

Current Version we are using is Thunderbird 45.0 and 45.1
OS Windows 10 pro x64 and window 7 pro x64

When Thunderbird Version 35.* and OS Windows 7 Pro,This Bug did not Occur.
I have no idea how you managed to export this file like that.
Standard saving from TB doesn't export all the mail headers.
In the attachment you can see that saving as a text file works. Most likely you're using an add-on which stopped working in TB 45. Here's a list of the ones we know are incompatible:

Conversations, MoreFunctionsforAddressBook, Mnenhy, Quicktext, QuoteAndComposeManager. I've read that HeaderToolsLite is also uncompatible, but I can't confirm that.

Run TB in safe-mode (add-ons turned off, Help menu) and you will see that it works.
Status: UNCONFIRMED → RESOLVED
Closed: 8 years ago
Resolution: --- → INVALID
Whiteboard: [addon]
(In reply to Jorg K (PTO during summer, NI me) from comment #6)
> Created attachment 8752079 [details]
> Re  Contract note.txt saved by JK
> 
> I have no idea how you managed to export this file like that.
> Standard saving from TB doesn't export all the mail headers.
> In the attachment you can see that saving as a text file works. Most likely
> you're using an add-on which stopped working in TB 45. Here's a list of the
> ones we know are incompatible:
> 
> Conversations, MoreFunctionsforAddressBook, Mnenhy, Quicktext,
> QuoteAndComposeManager. I've read that HeaderToolsLite is also uncompatible,
> but I can't confirm that.
> 
> Run TB in safe-mode (add-ons turned off, Help menu) and you will see that it
> works.

Thank you for your reply.

I export mail on TB Safe mode,but the results were the same.

We are using for Japanese edition and default Language is Japanese.
All Mail Header export setting is in menu bar 'view' →'header'
Oh, sorry, I was completely wrong then ;-(
Well, I exported the e-mail with all it's headers and there was no problem.

Maybe there is a problem with the Japanese version. Or there is a problem with the unicode setup on your system. Could you install an en-US version an try it there?
Status: RESOLVED → REOPENED
Ever confirmed: true
Resolution: INVALID → ---
Summary: I saved e-mail message include character "ü" to text format,but after "ü" sentence can't save → Saving UTF-8 e-mail message that includes character "ü" to text format, truncates before the ü (Japanese version of TB)
Whiteboard: [addon]
(In reply to Jorg K (PTO during summer, NI me) from comment #8)
> Oh, sorry, I was completely wrong then ;-(
> Well, I exported the e-mail with all it's headers and there was no problem.
> 
> Maybe there is a problem with the Japanese version. Or there is a problem
> with the unicode setup on your system. Could you install an en-US version an
> try it there?

Thank you for your reply.
I installed Thunderbird en-US version from en-us official site described below on Other Computer(windows 7 pro x64 Japanese)

https://www.mozilla.org/en-US/thunderbird/

From Mail List Pain ,right-click menu「save as...」 to save Text Format.


but the results were the same.
As I said, I saved attachment 8751993 [details] using a en-US version of TB on an English Win 7 (x64) system with no problem.
If you check:
Control Panels > Region and Language, Administrative tab, Language for non-Unicode programs. What have you set there?

Try changing that to English (United States). That requires a restart.

Chiaki-san, sorry, I keep forgetting, do you have a Windows system? Could you try saving the e-mail from attachment 8751993 [details] as a text file and see whether it gets truncated before the "ü"?

We made changes between TB 38 and TB 45 related to the encoding engine, see bug 1202401. But I wouldn't expect effects like this.
Flags: needinfo?(ishikawa)
Status: REOPENED → UNCONFIRMED
Ever confirmed: false
Thank you for your reply.

(In reply to Jorg K (PTO during summer, NI me) from comment #10)
> As I said, I saved attachment 8751993 [details] using a en-US version of TB
> on an English Win 7 (x64) system with no problem.
> If you check:
> Control Panels > Region and Language, Administrative tab, Language for
> non-Unicode programs. What have you set there?
> 
> Try changing that to English (United States). That requires a restart.

After this Setting Change and Restart,Save as Text is no Problem.
Well, OK, but Mozilla software *is* Unicode aware, so this setting shouldn't matter.

We'll look into it further. Thanks for the report.

Magnus, why do strings get truncated when the Windows system locale is not en-US? Any ideas?
Flags: needinfo?(mkmelin+mozilla)
Summary: Saving UTF-8 e-mail message that includes character "ü" to text format, truncates before the ü (Japanese version of TB) → Saving UTF-8 e-mail message that includes character "ü" to text format, truncates before the ü (if Windows locate is *not* English but Japanese).
(In reply to Jorg K (PTO during summer, NI me) from comment #10)

> Chiaki-san, sorry, I keep forgetting, do you have a Windows system? Could
> you try saving the e-mail from attachment 8751993 [details] as a text file
> and see whether it gets truncated before the "ü"?
> 
Hi,
I think I have encountered a different problem.
I saved the attachment as a file (.eml) file from firefox.
I opened it from TB (open a message).
Then when I tried to save it using (save under a new name) which is control-S (I think),
nothing happens. No menu for file chooser (!)
I am afraid TB does not want to save this e-mail file. I have no idea.
Maybe I have to edit it (change it) before TB would agree to save it (?)
No, I can't even seem to edit it.
Hmm... I have no idea what is going on :-(

This windows TB 45.10 (Japanese version) under Windows 64-bit (japanese).

[I run a version of TB under Linux on different PC. So I get to see the bugs of both versions :-) ]

OK, I restarted TB, and this time, I could edit the message and then when I tried to save it I could save it under a different name, but egad, I was composing HTML e-mail and then TB offered to save the e-mail as HTML document !?
Anyway, when I looked at the saved file using firefox, at least no truncation occurs.
(Yes, it WAS in HTML. I looked at the saved file using emacs.)
Strange things do happen, don't they?
(In reply to ISHIKAWA, Chiaki from comment #13)

> Strange things do happen, don't they?

Silly me. I think this is because I did not change the from field, etc.
Obviously TB thinks this is not something MY account should handle because it contained somebody else's e-mail address NOT understood by any of my account settings, 
and refused to save,
etc. initially.
But it should WARN this loudly IMHO either when I opened it OR
when I tried to edit it AND save it.
Oh well. Should I file an RFE ?

TIA
Flags: needinfo?(ishikawa)
ISHIKAWA, Chiaki 様

リプライありがとうございます。

日本語がわかる方と思われますので、正確を期するために、
日本語で現象を再度まとめてリプライさせていただきます。
二度手間であれば申し訳ありません。

文面にüを含むメール、より正確にはüßを含むメールに対して、
メール一覧ペインから右クリックで「メッセージを保存」を選び、
テキスト形式で保存すると、ü以降が出力されていない、というものです。
元メールと出力結果は添付しております。

なお、今回の事象を再現したい場合は、
どこかに.一つ目の添付したテキストをeml拡張子で保存していただいて、
それをメール一覧のペインにドラッグアンドドロップしてください。

あるいは、以下の文面をコピーしてメールを作成してTBで受信できるアカウントに送信していただいてもOKです。

Thank you for your contract note.

Please send asap by air, is it possible this week?

Meilleures salutations/mit freundlichen Grüßen/Best regards

一覧ペインに表示されているメールを右クリックして「メッセージを保存」を選択していただければ、
保存形式をテキストファイルに変更できます。

私の会社では3台の端末でこの事象が確認できております。

Windows10 Pro 64bit 日本語版 *1台
Windows7 Pro 64bit 日本語版 *2台

です。
事象が発生したのはThunderbird 45.0と45.1です。
Windows 7 Pro 64bit&TB35.*では起きていませんでした。

なお、今回、二つ別パターンの文面を試してみました。

①üを含むが、üの後にßが続かない内容のメールをテキスト保存
文面
「
aaaaaaaaa

as aaaüber

oooo 
」

結果
「
(ヘッダ情報)
aaaaaaaaa

as aaaber

oooo」

と、üのみ抜けた形で保存されました。


②üßと続く内容のメールをテキスト保存
文面
「test Grüßen test」

結果
「(ヘッダ情報)
test Gr」

で、最初と同様、ü以降が切れた形で保存されます。
üßの連続が怪しいです。

なお、メールの文字コードはいずれもUnicodeで、
出力テキストファイルはいずれも Shift-Jis 改行コードCR+LFとなっております。

以上取り急ぎになりますが、
調査のお役に立てば幸いです。
よろしくお願いいたします。

(In reply to ISHIKAWA, Chiaki from comment #13)
> (In reply to Jorg K (PTO during summer, NI me) from comment #10)
> 
> > Chiaki-san, sorry, I keep forgetting, do you have a Windows system? Could
> > you try saving the e-mail from attachment 8751993 [details] as a text file
> > and see whether it gets truncated before the "ü"?
> > 
> Hi,
> I think I have encountered a different problem.
> I saved the attachment as a file (.eml) file from firefox.
> I opened it from TB (open a message).
> Then when I tried to save it using (save under a new name) which is
> control-S (I think),
> nothing happens. No menu for file chooser (!)
> I am afraid TB does not want to save this e-mail file. I have no idea.
> Maybe I have to edit it (change it) before TB would agree to save it (?)
> No, I can't even seem to edit it.
> Hmm... I have no idea what is going on :-(
> 
> This windows TB 45.10 (Japanese version) under Windows 64-bit (japanese).
> 
> [I run a version of TB under Linux on different PC. So I get to see the bugs
> of both versions :-) ]
> 
> OK, I restarted TB, and this time, I could edit the message and then when I
> tried to save it I could save it under a different name, but egad, I was
> composing HTML e-mail and then TB offered to save the e-mail as HTML
> document !?
> Anyway, when I looked at the saved file using firefox, at least no
> truncation occurs.
> (Yes, it WAS in HTML. I looked at the saved file using emacs.)
> Strange things do happen, don't they?
Might be because the Japanese localization sets mailnews.disable_fallback_to_utf8.<charset> true? 
<charset> likely being ISO-2022-JP
Flags: needinfo?(mkmelin+mozilla)
Toshiyuki-san, could you check preferences that start with mailnews.disable_fallback_to_utf8 in the config editor. Make sure mailnews.disable_fallback_to_utf8.ISO-2022-JP is false.
Summary: Saving UTF-8 e-mail message that includes character "ü" to text format, truncates before the ü (if Windows locate is *not* English but Japanese). → Saving UTF-8 e-mail message that includes character "ü" to text format, truncates before the ü (if Windows locale is *not* English but Japanese).
(In reply to Jorg K (PTO during summer, NI me) from comment #17)
> Toshiyuki-san, could you check preferences that start with
> mailnews.disable_fallback_to_utf8 in the config editor. Make sure
> mailnews.disable_fallback_to_utf8.ISO-2022-JP is false.

Thank you for your reply

Copy from my config setting ↓

>mailnews.disable_fallback_to_utf8.ISO-2022-JP;false
So, so far only setting the Windows locale (comment #11) has helped, right?
(In reply to Jorg K (PTO during summer, NI me) from comment #19)
> So, so far only setting the Windows locale (comment #11) has helped, right?

Thank you for your reply.

Two Machine,
"Windows 7 pro 64bit & TB en-US Version & Windows locale is en-US" 
,"Windows 10 Pro 64bit & TB Japanese Version & windows Locale is Japanese" 
,their config settings are both "mailnews.disable_fallback_to_utf8.ISO-2022-JP;false"
I will try to see what happens on my computer and also with the local linux build with the suggested test. Maybe I can see what is going on more easily there.


(In reply to Toshiyuki Tanigawa from comment #15)

> ISHIKAWA, Chiaki 様
> 
> リプライありがとうございます。
> 
> 日本語がわかる方と思われますので、正確を期するために、
> 日本語で現象を再度まとめてリプライさせていただきます。
> 二度手間であれば申し訳ありません。
> 

試してみます。Linuxの上のバイナリでもテストしてみます。
OK, I can confirm that problem happens.

I opened the attachment in my TB (Japanese version) 
under Windows 10 4-bit (Japanese environment)

I edited the message as new and changed from and to field to that of my address.
I sent it.

When I received it, here is what I did.

I right-clicked to open the context-sensitive menu.
I chose
save the message (I don&'t know the English menu message right now).

NEXT STEP IS IMPORTANT:
When the file chooser appears, I selected the
.TXT fiel type (!) [saving .eml file does not show the problem.]

I saved the text file. Voila. The message is truncated as has been reported.

mailnews.disable_fallback_to_utf8.ISO-2022-JP;false

I think the issue has something to do with the fact the message is UTF8, but the saved text seems to be simple SJIS (MS-KANJI) file type. Hmm...

OK, what happens with linux build.
Status: UNCONFIRMED → NEW
Ever confirmed: true
This is with linux build. (local build based on 10-day old C-C tree or something like that.)

Windows and linux versions differ in a crucial place.

When I try saving a message (by right clicking),
- under Windows, the windows chooser shows a few different file types (.eml, .html, and .txt type) and text type causes the problem, whereas

- under Linux, the file chooser shows a plain vanilla all file type by default and I can select Mail file (presumably .eml), html and txt (I am using Debian GNU/Linux and the desktop system is I think a variant of Gnome) 

I got curious and checked what happens if I select the file type explicitly.
- Chose Mail file  ... file chooser shows the extension of .eml.
- Chose HTML type  ... file chooser still shows the extension of .eml.
- Chose TEXT type  ... file chooser still shows the extension of .eml.
- Chose plain vanilla type shown by default 

Anyway what I obtained for the chosen file type:
Mail file:
File is encoded as unicode (and DOS crlf ending according to Emacs editor I used).
The file contains all the headers (I think. There was no truncation.

HTML type:
File is encoded as unicode, but this time, the line ending is simple LF only (again according to Emacs).
The file does not contain any mail headers, only the main message text PLUS Subject, From:, Date:, To: all in HTML constructs. The whole file is in HTML format.
There was no truncation.

Text type:
File is encoded as unicode, but this time the line ending is simple LF only (according toe Emacs).
The file does not contain any mail headers except for Subject, From:, Date:, To:
*BUT* these header lines have a tendency to contain NEWLINE immediately after ":" which makes the text file very awkward to use WITHOUT EDITING. I think I have noticed this before somewhere.
Maybe this should be filed as a bug(!)
E.g.,
--- quote ---

Subject:
Re: Contract note
From:
ishikawa <ishikawa@localhost>
Date:
2016年05月20日 01:11
To:
ishikawa <ishikawa@localhost>

--- end quote ---
(BTW, this NEWLINE immediately after ":" occurs under Windows 10 when I save text file.)

There is no truncation after u umlaut.

Plain vanilla all type:
This produces the same file as when I selected Mail file type.


I think that Windows uses  SJIS (MS-KANJI) for
saving text file is one cause of the problem, I think?
(In this particular case, since there is no non-ASCII range character since text after u umlaut inclusively is gone, Emacs did not show any particular character code when I visited the saved text file under windows 10.)  
When I saved a Japanese messaage as text file, it was indeed saved as SJIS (MS-KANJI) file
with DOS CRLF ending.

Hope this helps.
Given that we have not heard of many German users (where TB usage is high)
complained about the issue,
I think it could be related to Japanese Windows version only (agah).

We may have to forego the use of system-supplied default setting of  character code for various output routines under Windows
and need to set UTF8 code explicitly (this may or may not an issue here in Japan...)
but I think it is OK. Unlike the mail clients of various origins, the editors supplied by Windows itself has been capable of handling UTF8 for a long time now.
(In reply to ISHIKAWA, Chiaki from comment #23)

> I think that Windows uses  SJIS (MS-KANJI) for
> saving text file is one cause of the problem, I think?
> (In this particular case, since there is no non-ASCII range character since
> text after u umlaut inclusively is gone, Emacs did not show any particular
> character code when I visited the saved text file under windows 10.)  
> When I saved a Japanese messaage as text file, it was indeed saved as SJIS
> (MS-KANJI) file with DOS CRLF ending.

Thanks. Did you try setting the locale to en-US as per comment #10?

Can you find out where the truncation happens. The reporter says that this wasn't an issue in TB 38. We've changed some encoding in TB 45 in bug 1202401.

I guess it happens here:
https://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgI18N.cpp#34
https://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgI18N.cpp#99 <== Failure here.

Maybe we get the charset for text encoding from the system
https://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgI18N.cpp#186
and UTF-8 including German ÄÖÜäöü might not be encodable in the system charset (Shift-JIS/MS-Kanji).
(In reply to Jorg K (PTO during summer, NI me) from comment #25)
> (In reply to ISHIKAWA, Chiaki from comment #23)
> 
> > I think that Windows uses  SJIS (MS-KANJI) for
> > saving text file is one cause of the problem, I think?
> > (In this particular case, since there is no non-ASCII range character since
> 
> Thanks. Did you try setting the locale to en-US as per comment #10?

No I did not. I assume with that setting it would work as reported.
(Maybe on next reboot of the PC).

> Can you find out where the truncation happens. The reporter says that this
> wasn't an issue in TB 38. We've changed some encoding in TB 45 in bug
> 1202401.

This is a little difficult. I mean, I don't have Windows build environment.
Oh, you mean bisecting? I think I can do that over the weekend.

> I guess it happens here:
> https://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgI18N.
> cpp#34
> https://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgI18N.
> cpp#99 <== Failure here.
> 
I looked at the code here.
I suspect that once the error [I mean the failure to find mapping] happens, the conversion is given up there.
BTW, in bug 1202401, I saw a comment that says "You should emphasize here that NS_ERROR_UENC_NOMAPPING is a success code, despite its name."
(bug 1202401 comment 6)
A tricky code, indeed.

> Maybe we get the charset for text encoding from the system
> https://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgI18N.
> cpp#186

Looks like so although there are some obscure mentions about this in bug 1202401 which sounds to me very gibberish since I am not familiar with the code setting / usage inside TB very much.

> and UTF-8 including German ÄÖÜäöü might not be encodable in the system
> charset (Shift-JIS/MS-Kanji).

I don't think umlaut characters are representable in shift-JIS/MS-Kanji.
So that would be the problem.

So it would be great if someone with Windows build environment can produce a debug version that prints out the charset value picked up at
> https://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgI18N.
> cpp#186

If this charset is shift-JIS/MS-Kanji, we should override it into utf-8, I think.
This is a kludge to work against the Windows code trying to be clever in Japanese locale setting.

At the same time, bug 1202401 seems to refer to a framework of trying a charset conversion and if it fails, trying another generic code conversion. But I suspect that framework does not get implemented.
I thought I bi-sect and find where the issue occured by downloading the old TB binaries from
https://ftp.mozilla.org/pub/thunderbird/releases/

At the same time, I did not want to mess up my usual profile and decided to
create a new profile.
thunderbird.exe -p 
did the trick to invoke profile manager (and I created the profile with minimum setup.)

However, I also do not want to connect to my ordinary mail account either.

Well, there shows up a problem.

When I tried 38.8.0, the install insists that I set up an e-mail account and even when I try to go the "manual setup" way, it insists on testing the connection and unless the connection is good, it won't let me finish the setup successfully.
Yeah, this would be good for "TB for dummies".

But I don&t want to do anything more than just downloading and opening the problematic .eml file, etc., that is a simple testing for this bug.

Any thought?

Someone with a spare e-mail account can test this (for bi-secting purposes) under Japanese windows if three is nothing simpler than this crazy "TB for dummies" mail setup, which incidentally was offensive to some system admin types if I recall from reading posts to newsgroups and mailing lists. 

I give up if there is anything simpler than just open the e-mail file .eml and put that in my Inbox or whatever without setting up extra e-mail account just for this. 

Thank you.
The best way to test various older version is to have a test profile that you can select with -p. In the test profile you have to configure an account, so I usually just use an IMAP account that doesn't interfere with my POP mail.
(In reply to Jorg K (PTO during summer, NI me) from comment #28)
> The best way to test various older version is to have a test profile that
> you can select with -p. In the test profile you have to configure an
> account, so I usually just use an IMAP account that doesn't interfere with
> my POP mail.

Problem is that 38.8.x I which tried would not let me finish the configuration because the server doesn't exist or something like that. How do you manage the IMAP account server.
Do you use a real IMAP server and simply does not bother to download that or something like that.
I have a davecot/imap (I think it is the name) on a linux image running on my test PC and pointed my incoming IMAP to it, and then pop3 since davecot supports both protocols (!) and both somehow failed the test done by the configuration and thus I could not let configuration finish. (davecot.{imap,pop3} are accessible and tested inside the linux image and so there may be some configuration issues.) I really wish this "TB for dummies" has an escape hatch to let a sysadmin test some features without finishing the configuration completely (of course, then there is a chance for causing misfeatures that only occur when the configuration is not completed. Oh well.) 
My ISP does not seem to have IMAP interface. Oh well...
You just press "Manual Setup" in the wizard to enter things yourself.
But, you don't need any account to test an .eml, you can open it and use Copy To > folder
(In reply to Magnus Melin from comment #30)
> You just press "Manual Setup" in the wizard to enter things yourself.
> But, you don't need any account to test an .eml, you can open it and use
> Copy To > folder

I may be too dumb and irritated by this "TB for dummies" setup thingy, but believe me,
the "FILE" menu won't show up in menu bar 
in 38.8.0 (?) if I don't set up the mail account first.
That is, the whole set of menu bar entries for mail functions are not there to begin with, sigh...
(So I can not even LOAD the message in the file sysem because there is no OPEN menu item.)

Maybe I missed something also, but even "Manual Setup" insists that the account info
which I typed in must be TESTED (and it failed.) before accepting it.
This is the most irritating part.

I have a test account now from Jorg and so should be able to perform the bi-section (unless the "TB for dummies" will not interfere in an unexpected manner...)
Come to think of it, I had to go through the hoops when I set up test account for TB testing under linux once, and I have used the account info ever since. I had no idea how I managed to create the account and that was when some people other than me also complained about the strange UI interaction of mail account configuration (I learned them by searching the websites to solve my problem.)

Will report the bi-section result in a day or two.

TIA
After "Manual config", the "Advanced config" button is there - if you click that you end up in the account manager and can configure the account however you want.
(In reply to Masatoshi Kimura [:emk] from comment #37)
> 
> It is impossible to save umlaut in the Shift-JIS encoding. This bug is not
> about that. This bug is about chopping-off behavior.

Right. I was looking at too old version since the original poster mentioned something about 
38: no actually he mentioned 35 was OK. Somewhere I misread :-)

Anyway, 
From comment 35
 So try a TB 44 beta. You could also compare TB 45 dailies of 2015-11-21 and the day after.

From comment 36:
Or rather 2015-11-15 -> 2015-11-16 since bug 1214619 ripped out a bunch of the functionality.

So I will check the following
last TB 44 beta.
Is the following the one I should check?
   http://ftp.mozilla.org/pub/thunderbird/releases/44.0b1/win32/ja/

For the following, I pick up the
DATE-comm-central instead of DATE-comm-aurora, correct?
http://ftp.mozilla.org/pub/thunderbird/nightly/2015/11/

2015-11-15
2015-11-16
...
2015-11-21

TIA
Ok, I found that 44.0b1 does not cut short the .txt saving although it is in ASCII/SJIS
and umlaut and eszett(?) are replaced by "??" characters.
> Meilleures salutations/mit freundlichen Gr??en/Best regards

As for the guesses where the problem occurred.
I found that the INITIAL version in the list I mentioned in 450a1 Nov-15 version ALREADY has this issue of .txt file being cut in the middle (!).

I think we need to look BACK FURTHER (!)

Here is what I found. 
Something has happened between 2015-11-14 (not cut, replacement with ??) and
2015-11-15 (cut in the middle as reported.).


TB version|.eml |.html|.txt
--------------------------------
(I said ASCII below, but TB probably used SJIS/MS-Kanji for text output.)

---
440b1|.eml not cut UTF-8 CRLF |.html not cut UTF-8 LF-ending  | .txt not cut ASCII? CRLF Meilleures salutations/mit freundlichen Gr??en/Best regards  
---
11-05 (Nov-05)45.0-a1|.eml not cut UTF-8 CRLF|.html not cut UTF-8 LF-ending|.txt not cut, ??? LF ending. But replacement with '?'. Meilleures salutations/mit freundlichen Gr??en/Best regards

11-10 (Nov-10)|.eml not cut UTF-8 CRLF| .html not cut UTF-8 LF-ending|.txt not cut. ??? LF-ENDING But replacement with '?'. Meilleures salutations/mit freundlichen Gr??en/Best regards

11-14 (Nov-15) 45.0-a1|.eml not cut UTF-8 CRLF|.html not cut UTF-8|.txt not cut. ??? LF-ENDING But replacement with '?'. Meilleures salutations/mit freundlichen Gr??en/Best regards

11-15 (Nov-15) 45.0-a1|.eml not cut UTF-8 CRLF |.html not cut UTF-8 LF-ending| .txt cut short!    BAD


Hope this helps.
My previous test is under Windows 10 64-bit Japanese version.
I am not sure what happens under Windows 7, but I think as far as character code is concerned the different windows versions behave in a very similar manner.
(However, I don't know exactly what string is returned for the default system charset encoding: cp932 Shift-JIS, MS-Kanji, etc.

It probably is easiest to use UTF-8 for text output irrespective system default encoding with the announcement of this fact in big bold letters in changelog/update document/notice.
Short of that, we simply detect Japanese character code (cp932, shift-jis, MS-Kanji aliases) and use UTF-8 in that case (?).

I have to wonder loud what happens to a German user who receives a Japanese message and
tries to save the message as ".txt" TEXT type by "Save As" menu.
Or for that matter to a Chinese user who receives a Japanese message and tries to save the message as .txt.
Alice, perhaps you can pin this down a little further.

STR:
Import the attached .eml file.
Save it as text on a PC where the locale is set to Japanese.

Before Nov 14, 2015 we got Meilleures salutations/mit freundlichen Gr??en/Best regards.
After  Nov 15, 2015 we get Meilleures salutations/mit freundlichen Gr (truncated).

We thought the cause would be bug 1214619, but that landed on 2015-11-17.

Anyway, we're close, so perhaps you can pin this down further.

As always: Thanks a lot for your help in advance.
Flags: needinfo?(alice0775)
Via local build,
Last Good : m-c 202b199b9fcf c-c 16b805aecc02
First Bad : m-c 202b199b9fcf c-c 15454eb97bbe


This was regressed by 
15454eb97bbe	Magnus Melin — Bug 1202401 - Prepare for m-c removal of nsISaveAsCharset. r=jcranmer, a=mkmelin
Blocks: 1202401
Flags: needinfo?(alice0775)
Keywords: regression
I think we just need to do some debugging where I said in comment #25:
https://dxr.mozilla.org/comm-central/source/mailnews/base/util/nsMsgI18N.cpp#99

If there is a mapping failure, we should just keep going instead of stopping.
(In reply to Magnus Melin from comment #32)
> After "Manual config", the "Advanced config" button is there - if you click
> that you end up in the account manager and can configure the account however
> you want.

I think I was just confused with the UI in question here.
So what I did was simple.
I downloaded TB 1.5.0.14 which does not have this configurator and let it accept my bogus mail account information so that I can read and save an .eml file on my local directory.

It works!  (I suspect that I have been using an old profile that was created in this manner for a long time???)

Anyway, I found an already an issue with 38.8.0, which I chose as the initial lower-bound to start bi-secting (!).

Windows 10 64-bit, Japanese version.
(After reading the problematic e-mail by Open and then copy it to Inbox, and then Save as:
With 38.8.0 
- if I save the generic file type (*.*) and let TB 38.8.0 save the message as .eml file,
the file is not accidentally shortened. The file is saved in UTF-8 with DOS-ending (CRLF) according to Emacs with which I am reading the file.

- if I save the message as HTML type with .html ending, it is again saved without getting chopped.
It is in UTF-8 but with LF only for line ending. (Of course, it is HTML content.).

- Problem. If I choose TEXT type, and save it as .txt file:.
I think it is plain ascii someow with DOS LINE ending (CRLF).
It is *NOT* cut off, but the umlaut characters are not reproduced correctly obviously in ASCII.
The line has been converted to

Meilleures salutations/mit freundlichen Gr??en/Best regards

that is, umlaut and eszet or whatever are now ??. (Two literal question mark characters!)

Any thought?

Should I retract back to which version (Now, I am not sure where I should start bi-secting which way.)

TIA
Again my point is
38.8.0 already cannot save the said e-mail message correctly since it seems to save it in US ASCII ?
However, it does not chop it at the umlaut. Umlaut and other characters are replaced with "?".

TIA
Let me repeat comment #10:
We made changes between TB 38 and TB 45 related to the encoding engine, see bug 1202401.

That bug landed on TB 45. So try a TB 44 beta. You could also compare TB 45 dailies of 2015-11-21 and the day after.
(In reply to Jorg K (PTO during summer, NI me) from comment #43)
> If there is a mapping failure, we should just keep going instead of stopping.

We already do, since NS_ERROR_UENC_NOMAPPING is a success code(!)

Probably though, we just don't do the kOnError_Replace to ? on "errors" now. I'd say that's not so much of a bug. If we can't map we should use utf-8 instead. And that fallback apparently is missing when saving as plain text.
Or rather 2015-11-15 -> 2015-11-16 since bug 1214619 ripped out a bunch of the functionality.
So why does it get truncated? I'd understand if those characters simply went missing. I'm not at my desk right now, so I can't debug it. Also, I prefer not to set my system to Japanese.
IIRC that's just what encoders do when they can't really encode properly and you don't allow replacements. May be an implementation detail of each encoder too I suppose, but bailing seems a proper response to me.
(In reply to ISHIKAWA, Chiaki from comment #34)
> Again my point is
> 38.8.0 already cannot save the said e-mail message correctly since it seems
> to save it in US ASCII ?
> However, it does not chop it at the umlaut. Umlaut and other characters are
> replaced with "?".

It is impossible to save umlaut in the Shift-JIS encoding. This bug is not about that. This bug is about chopping-off behavior.
(In reply to Jorg K (PTO during summer, NI me) from comment #45)
> So why does it get truncated? I'd understand if those characters simply went
> missing. I'm not at my desk right now, so I can't debug it. Also, I prefer
> not to set my system to Japanese.

(In reply to Magnus Melin from comment #46)
> IIRC that's just what encoders do when they can't really encode properly and
> you don't allow replacements. May be an implementation detail of each
> encoder too I suppose, but bailing seems a proper response to me.

So I assume what the above discussion leads is 
- the characters may get replaced with "?"s, but
  it should not be truncated, BUT
- it may be up to each encoder to make it possible to do so, and some encoders
  may not support that and in that case, we simply bail out (AND LET THE USER KNOW THAT
  THE SAVING FAILED ?)

???
So for example, we may want special-case the NS_ERROR_UENC_NOMAPPING

    if (rv == NS_ERROR_UENC_NOMAPPING) {
      mappingFailure = true;
    }

*** here we check that NS_ERROR_UENC_NOMAPPING and continue (?)
    if (NS_FAILED(rv) || dstLength == 0)
      break;
    outString.Append(localbuf, dstLength);


This part also requires special-casing NS_ERROR_UENC_NOMAPPING 


  rv = encoder->Finish(localbuf, &dstLength);
*** here ***
  if (NS_SUCCEEDED(rv)) {
    if (dstLength)
      outString.Append(localbuf, dstLength);
    return !mappingFailure ? rv: NS_ERROR_UENC_NOMAPPING;
  }

And the caller of this function also needs to understand NS_ERROR_UENC_NOMAPPING
(I see such a call in the patch in the Bug 1202401 )

But the issue is whether replacment with "?" or whatever character designated for it does not seem to work (?) ... this has to be fixed.

Anyway, the encoder called for each charset combination case needs to return non-zer *dstLength for these continuation to work.
(That is what Magnus seems to say.) 

Is my understanding correct?

TIA
Not sure I understand your question, but for the save as .txt case, we need to check the return code. If it says it couldn't convert, we need to do it again with output encoding being UTF-8 so it will succeed. 
Code should be around nsMessenger::SaveAs - http://mxr.mozilla.org/comm-central/source/mailnews/base/src/nsMessenger.cpp#1001
We should handle NS_ERROR_UENC_NOMAPPING here:
https://dxr.mozilla.org/comm-central/rev/764af75d8f8cb66b4a998979a47dc19056eadf3e/mailnews/base/src/nsMessenger.cpp#1858-1860

(In reply to Jorg K (GMT+1) from comment #44)
> So why does it get truncated? I'd understand if those characters simply went
> missing.

Because the encoder stops converting when it encounters an unmappable character.
Or, just change the last parameter of nsMsgI18NConvertFromUnicode to false. Although umlauts will still be changed to "?" on Japanese locale, it is not worse than Thunderbird 38 or earlier.
This is all just ridiculous. I see no reason we wouldn't always output the text file as UTF-8 instead of what charset the platform happens to use as default, which would guarantee correct output with no truncation.
QA Contact: mkmelin+mozilla
Assignee: nobody → mkmelin+mozilla
QA Contact: mkmelin+mozilla
Masatoshi-san, thanks for your comments.

Magnus, thanks for taking this on. Since there was no movement here after May 2016, this slipped off my radar until the duplicate arrived.
(In reply to Magnus Melin from comment #52)
> This is all just ridiculous. I see no reason we wouldn't always output the
> text file as UTF-8 instead of what charset the platform happens to use as
> default, which would guarantee correct output with no truncation.

On MS Win, I think it was perhaps for notepad. IIRC, even when notepad started to support utf-8, notepad requested BOM even though utf-8 instead of utf-16. Recent notepad on MS Windows looks to show utf-8 file as utf-8 even when no BOM for utf-8.
However, notepad still saves utf-8 text with BOM for utf-8, so many peoples are still suffered from it.
> http://stackoverflow.com/questions/8432584/how-to-make-notepad-to-save-text-in-utf-8-without-bom
I think "notepad exist on MS Win" was reason why saving in utf-8 was avoided on MS Win.
If Thunderbird saves in utf-8, many users surely say that culprit of all problems on him due to BOM for utf-8 is Thunderbird. :-)
This is reason why Notepad++ is still mandatory for us.
> https://notepad-plus-plus.org/
> http://www.larshaendler.com/2015/01/20/remove-bom-with-notepad/

As notepad already shows "utf-8 without BOM" as "utf-8", I believe "saving in utf-8 always" is already possible even on CJK MS Win.
Summary: Saving UTF-8 e-mail message that includes character "ü" to text format, truncates before the ü (if Windows locale is *not* English but Japanese). → Saved message truncated if it contains a character that can't be encoded with the nsMsgI18NFileSystemCharset()
I see no reason why we shouldn't use the system charset if we can. That's what we do on other occasions.
Attachment #8827102 - Flags: review?(mkmelin+mozilla)
Comment on attachment 8827102 [details] [diff] [review]
1271864-encode-as-utf8.patch (v1).

This works great, I saved an e-mail in ISO-2022-JP. Without the patch the e-mail gets saved as ANSI and truncated, with the patch the full e-mail is exported as UTF-8 and looks nice when opened in the text editor.
Comment on attachment 8827102 [details] [diff] [review]
1271864-encode-as-utf8.patch (v1).

Review of attachment 8827102 [details] [diff] [review]:
-----------------------------------------------------------------

But why not always use UTF-8 straight away? For all intents and purposes utf-8 encoded text files are not a problem.

I mean this is seriously messed up: the mail is written in one encoding, then we try to save it to ... no, not that encoding, not utf-8, no, to whatever the user system happens to use by default. This makes no sense whatsoever. Let's quit the madness.

::: mailnews/base/src/nsMessenger.cpp
@@ +1858,5 @@
>      rv = nsMsgI18NConvertFromUnicode(nsMsgI18NFileSystemCharset(),
>        utf16Buffer, outCString, false, true);
> +    if (rv == NS_ERROR_UENC_NOMAPPING) {
> +      // If we can't encode with the preferred charset, use UTF-8.
> +      rv = nsMsgI18NConvertFromUnicode("UTF-8", utf16Buffer, outCString);

CopyUTF16toUTF8(utf16Buffer, outCString);
Bug 181456 has collected some of Thunderbird's bug reports for this I believe.
(In reply to Magnus Melin from comment #57)
> But why not always use UTF-8 straight away? For all intents and purposes
> utf-8 encoded text files are not a problem.
Why should I? The user always exports e-mail and gets the system charset. You don't know what tools they use to view text files. So now suddenly they get a message that can't be saved in the charset they are used to. So instead of truncating the message or showing ?, we use UTF-8.

I see no reason to change the existing behaviour. I don't know that somewhere in Japan there isn't a person using some old program that can't deal with UTF-8.
I believe all programs should do their part in utf-8 convergence, and this is a good opportunity to do one little step. If the tool they use can't view utf-8 (which I doubt) that tool should be fixed.

We talk about technical debt a lot. This is exactly part of that - designing for fictional persons in Japan. When simplifying is possible, we should just do that instead.
I don't really agree.
Attachment #8828511 - Flags: review?(mkmelin+mozilla)
Attachment #8828511 - Flags: review?(mkmelin+mozilla)
Here is v1 with the nit fixed.

(In reply to Magnus Melin from comment #60)
> I believe all programs should do their part in utf-8 convergence, and this
> is a good opportunity to do one little step. If the tool they use can't view
> utf-8 (which I doubt) that tool should be fixed.
> We talk about technical debt a lot. This is exactly part of that - designing
> for fictional persons in Japan. When simplifying is possible, we should just
> do that instead.

I don't agree with this at all. We are here to fix a bug, not to teach UTF-8 evangelism. We have a very conservative user base who get upset with the slightest change we make (like people complain that the Reply-to is not honoured any more in all cases when the type <ctrl>R).

We don't have a single good reason to change existing working behaviour (other than being arrogant and telling the users how they should use our product). You can't exclude the possibility that somewhere there is a system in Europe that reads ANSI exported files and can't deal with UTF-8 (which are multibyte). You needlessly change the encoding of all e-mail which has European international characters like äöüáóú. Personally, I don't want to deal with the needless regressions this might cause.

Europeans during their lifetime may not get in contact with UTF-8, since ANSI, the default charset in their machines does everything the need. The standard Windows text editor, Notepad, shows this ...

  This file contains characters in Unicode format which will
  be lost if you save this file as ANSI encoded text file.
  To keep the Unicode information, click Cancel below and
  then select one of the Unicode option from the Encoding
  drop down list. Continue?

rather scary warning when people come near unicode.

So kindly approve the patch and please stop teaching evangelism, change working systems and tell people what they should be doing. If their file system charset is UTF-8, this is what will be exported anyway. Or do we need to take this to the (non-existent) tech committee?

Didn't you say in bug 1301640 comment #109:
> The OS knows best what the user wants.
Apparently now you know even better ;-)
Attachment #8827102 - Attachment is obsolete: true
Attachment #8827102 - Flags: review?(mkmelin+mozilla)
Attachment #8828680 - Flags: review?(mkmelin+mozilla)
As you are talking about what users may want:

I am a mere user (from Europe) and have no problem with UTF-8. In fact, I would prefer this solution. But I am also okay with any other solution, as long as emails saved as text files don't get truncated. This is a very annoying situation at my working place.

(As I have written in my original bug report, I have no problems with emails containing ä etc., but with silly symbols like the Unicode letters U+1F44D or U+1F44E.)

Best wishes
(In reply to isyahadin from comment #63)
> (As I have written in my original bug report, I have no problems with emails
> containing ä etc., but with silly symbols like the Unicode letters U+1F44D
> or U+1F44E.)
Sure, those would be exported as UTF-8.
Comment on attachment 8828680 [details] [diff] [review]
1271864-encode-as-utf8.patch (v1a).

Review of attachment 8828680 [details] [diff] [review]:
-----------------------------------------------------------------

::: mailnews/base/src/nsMessenger.cpp
@@ +1859,5 @@
>        utf16Buffer, outCString, false, true);
> +    if (rv == NS_ERROR_UENC_NOMAPPING) {
> +      // If we can't encode with the preferred charset, use UTF-8.
> +      CopyUTF16toUTF8(utf16Buffer, outCString);
> +      rv = NS_OK;

NS_ERROR_UENC_NOMAPPING is a success code, so you don't need this rv assingment
Attachment #8828680 - Flags: review?(mkmelin+mozilla)
(In reply to Jorg K (GMT+1) from comment #62)

> I don't agree with this at all. We are here to fix a bug, not to teach UTF-8
> evangelism. 

Yes but the bug can be fixed two ways. 

> conservative user base who get upset with the slightest change we make

In comparison, a way larger change than the proposed one, that we made UTF-8 default for outgoing didn't get a single complaint as far as I remember.

> honoured any more in all cases when the type <ctrl>R).
> 
> We don't have a single good reason to change existing working behaviour
> (other than being arrogant and telling the users how they should use our
> product). 

This is pretty analog to web standards advocacy, solving similar issues. 

I'd see two kind of users:
 - the ones that know about encodings - they will want utf-8 if asked.
 - the ones who have no idea and just want things to work. UTF-8 will work for them too.

> Europeans during their lifetime may not get in contact with UTF-8, since

Eh, probably at this point way over 90% of anything you come in contact with is UTF-8 if you take a closer look. 

> > The OS knows best what the user wants.
> Apparently now you know even better ;-)

What the user wants is pretty clear: text that's not garbled. The system charset is not really a choice of text encoding either...

But, anyhow. Let's make aceman review this instead :)
Which approach do you prefer?
Flags: needinfo?(acelists)
(In reply to Magnus Melin from comment #65)
> NS_ERROR_UENC_NOMAPPING is a success code, so you don't need this rv
> assignment
Sure, I can take it out if the reviewer wants it taken out. I think it's better for clarity especially given that NS_ERROR_UENC_NOMAPPING is non-zero, yet a success code.

(In reply to Magnus Melin from comment #66)
> UTF-8 will work for them too.
I wouldn't bet my life on it, would you?
Attachment #8828511 - Flags: review?(acelists)
Comment on attachment 8828680 [details] [diff] [review]
1271864-encode-as-utf8.patch (v1a).

Hey Aceman, no diplomacy will help you here ;-) Pick one!
Attachment #8828680 - Flags: review?(acelists)
(In reply to Jorg K (GMT+1) from comment #62)
> Europeans during their lifetime may not get in contact with UTF-8, since
> ANSI, the default charset in their machines does everything the need. The
> standard Windows text editor, Notepad, shows this ...

True, I haven't seen any utf-8 encoded files in Windows in my area. If anything, there were some system files that looked like utf-16 (fixed 2 bytes wide chars).

I need to check how utf-8 displays on Windows installs that I have access to (I promise to skip the majority being Windows XP from the testing :)).

> Eh, probably at this point way over 90% of anything you come in contact with
> is UTF-8 if you take a closer look. 

Well, I could completely ignore utf-8 until about a year ago. Only things like TB .properties files were encoded in utf-8 and it caused a pain to me as my Linux wasn't set to utf-8 by default. I made the transition to utf-8 for reasons I do not remember right now, but surely the continuing system upgrades put many new hurdles in my way (even the kernel defaulted to utf-8 but it could be turned off) so I finally caved in.

Then it took some time to convert my data to utf-8 and I am still not finished (I encounter new stuff till today).
I preferred the iso-8859-2 encoding as all chars were 1 byte.
 
> What the user wants is pretty clear: text that's not garbled. The system
> charset is not really a choice of text encoding either...

If there is a preferred charset imposed on the system that is thus implied by most apps, we should try to use that one.

So if you guys still want my utf8-ignorant person to look at this, I can try :)
(In reply to :aceman from comment #69)
> So if you guys still want my utf8-ignorant person to look at this, I can try

From what I read in your comment #69, I get the impression that you prefer patch v1. Since Magnus nominated you, I see no reason to go looking for another adjudicator now ;-)

As for your question: A UTF-8 encoded files will open in Notepad just fine (although the display of Japanese looks bad, too small). Notepad's "Save Dialog" offers ANSI, Unicode (16bit, LE), Unicode (BE) and UTF-8.

Let's face it, ANSI (windows1252 and ISO-8859-x) is not going away in a rush. And it we wanted to push UTF-8, we should do it openly in another bug. We had this long discussion whether to export the address book in UTF-8 and ended up offering both options since both were needed.
Flags: needinfo?(acelists)
Comment on attachment 8828680 [details] [diff] [review]
1271864-encode-as-utf8.patch (v1a).

Review of attachment 8828680 [details] [diff] [review]:
-----------------------------------------------------------------

I like this version to first try the native charset and fallback to utf-8 later.

Maybe there could be a pref like mailnews.force_save_as_UTF8 to always use utf-8 if the user chooses so?

::: mailnews/base/src/nsMessenger.cpp
@@ +1859,5 @@
>        utf16Buffer, outCString, false, true);
> +    if (rv == NS_ERROR_UENC_NOMAPPING) {
> +      // If we can't encode with the preferred charset, use UTF-8.
> +      CopyUTF16toUTF8(utf16Buffer, outCString);
> +      rv = NS_OK;

I would keep this. Letting NS_ERROR_UENC_NOMAPPING propagate away from this code may only cause problems as nobody expects it to mean success when just reading the name of the thing. I agree with setting a safe and well-known success value.
Attachment #8828680 - Flags: review?(acelists) → review+
https://hg.mozilla.org/comm-central/rev/48282013040ad39e8e7736103fd52031449a101b

As for the preference: To use Thomas' words: That would be very discoverable ;-)
And Magnus hates extra preferences.
Assignee: mkmelin+mozilla → jorgk
Status: NEW → RESOLVED
Closed: 8 years ago7 years ago
Resolution: --- → FIXED
Target Milestone: --- → Thunderbird 53.0
Attachment #8828511 - Attachment description: 1271864-encode-as-utf8.patch (v2). → Alternative, NOT landed: 1271864-encode-as-utf8.patch (v2).
Attachment #8828511 - Flags: review?(acelists)
Comment on attachment 8828680 [details] [diff] [review]
1271864-encode-as-utf8.patch (v1a).

Surely a bug we should fix in TB 52. Too late for another beta 51 now.
Attachment #8828680 - Flags: approval-comm-aurora+
I just updated to TB 52.0, and the problem is solved - thank you!
I confirmed it is fine. Thank you so much for fixing it.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: