Last Comment Bug 23418 - "File | save as file" has problems with Japanese messages saved as plain text
: "File | save as file" has problems with Japanese messages saved as plain text
nsbeta3+, patch in hand, reviewed
Product: MailNews Core
Classification: Components
Component: Internationalization (show other bugs)
: Trunk
: x86 Windows NT
: P2 major (vote)
: M18
Assigned To: nhottanscp
: Katsuhiko Momoi
: 39357 (view as bug list)
Depends on:
  Show dependency treegraph
Reported: 2000-01-07 19:30 PST by Katsuhiko Momoi
Modified: 2008-07-31 01:22 PDT (History)
8 users (show)
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---

a JPN msg saved with html option -- user manually writing in the extension. (39 bytes, text/html)
2000-01-07 19:31 PST, Katsuhiko Momoi
no flags Details
Patch to help fix the header to UTF-8 conversion problem (3.71 KB, patch)
2000-02-18 10:09 PST, rhp (gone)
no flags Details | Diff | Splinter Review
This is the message you should try Saving As TEXT (8.71 KB, image/gif)
2000-02-18 10:11 PST, rhp (gone)
no flags Details
A message shown in its entirety. Take a look at the next msg to see how it is saved into .txt format. (62.66 KB, image/jpeg)
2000-03-08 02:27 PST, Katsuhiko Momoi
no flags Details
The msg is saved into .txt format -- now shows truncation. (43.78 KB, image/jpeg)
2000-03-08 02:28 PST, Katsuhiko Momoi
no flags Details
5 test message in this file -- 4 JPN and 1 UTF-8 encoded JPN message. (38.98 KB, text/plain)
2000-03-08 02:32 PST, Katsuhiko Momoi
no flags Details
a patch to use a correct string length (1.00 KB, patch)
2000-09-12 15:44 PDT, nhottanscp
no flags Details | Diff | Splinter Review

Description Katsuhiko Momoi 2000-01-07 19:30:16 PST
** Observed with 1/6/99 Win32 build **

When non-ASCII msg is saved as a file using File | Save as menu,
the file is current saved without any extension and "as is" whether
or not the option selected by the user is HTML or plain text.

Some user may be attempted to supply the .html extension him/herself.
If so, the result will be an Unicoded encoded somewhat deficient HTML
file. This needs to be corrected.

I guess HTML format should save as is -- JPN msg should be saved in
HTML format and in JIS.

WE should also try supplying .txt extension and see what happens.

Also, if there isn't one, we should file a separate bug for
the format option malfunction at the save as dialog window.
Comment 1 Katsuhiko Momoi 2000-01-07 19:31:59 PST
Created attachment 4067 [details]
a JPN msg saved with html option -- user manually writing in the extension.
Comment 2 nhottanscp 2000-01-10 10:29:59 PST
Reassign to jefft.
Save as text/html -> Currently saved as UTF-8 (wrong) - we need to convert UTF-8
to mail charset (e.g. ISO-2022-JP).
Save as text/plain -> IQA need to test the current behavior. The spec it to
convert the data to platform file charset. Below is the code to get the platform
file charset.

  #define NS_IMPL_IDS
  #include "nsIPlatformCharset.h"
  #undef NS_IMPL_IDS

  nsCOMPtr <nsIPlatformCharset> platformCharset;
  nsAutoString aPlatformCharset;
  rv = nsComponentManager::CreateInstance(NS_PLATFORMCHARSET_PROGID, nsnull,
  if (NS_SUCCEEDED(rv))
  rv = platformCharset->GetCharset(kPlatformCharsetSel_FileName,
Comment 3 Katsuhiko Momoi 2000-01-10 15:05:59 PST
Currently when we save as into txt format, this is what happens:

1. Save as into text without explicitly supplying the .txt extension
   by the user. --> saves the whole msg including vCard, etc. into
   ISO-2022-JP. This looks like source data themselves.

2. Save as into text by supplying .txt extension yourself.
   Saves without VCard and other extra parts but into iso-2022-jp
   rather than expected Shift_JIS for JPN Windows.
Comment 4 Katsuhiko Momoi 2000-01-10 15:06:59 PST
Both case 1 and case 2 should save without extra parts and into Shift_JIS
rather than in ISO-2022-JP.
Comment 5 Jean-Francois Ducarroz 2000-01-14 17:49:59 PST
I have just added a new function in nsMsgI18N.h named msgCompFileSystemCharset
(oops, not enough generic name, maybe we should rename it) that give you back
the file system character set.
Comment 6 Katsuhiko Momoi 2000-02-03 02:14:06 PST
I would like to designate this as beta1 since "file | save as" is something
users do use and its UI should be non-confusing and
its resulting effects should be consistent and correct.
Comment 7 leger 2000-02-03 17:20:32 PST
Putting on PDT+ radar for beta1.
Comment 8 Phil Peterson 2000-02-07 13:38:10 PST
Load-balance to rhp
Comment 9 rhp (gone) 2000-02-09 13:30:46 PST
Ok, I have most of this done for plain text, but I have a question on saving 
messages as HTML. Why can't I emit a META tag header with UTF-8 as the output 
for the saved HTML web page. With the current architecture, you suggestion of 
going back to the original charset of the message if full of problems and I 
don't think will give us a very good result.


- rhp 
Comment 10 Katsuhiko Momoi 2000-02-09 14:02:13 PST
I think users will want to see the original data encoding.
On most platforms, you can then open such a file with 
a text editor and see & edit the content without a problem.
That is not going to be the case if you save into UTF-8.

That and in the past we have been saving the original data
as is under HTML extension and I don't think we should break the
familiar behavior. 
Comment 11 rhp (gone) 2000-02-09 16:04:01 PST
I understand your point and I think I have an idea how to fix this...but just 
to clarify, I don't think this feature really works on 4.x. When you do a Save 
As for a mail message in 4.x (HTML format), you basically get the raw contents 
of the message dumped to the file with RFC822 headers, etc. 

We'll do better than this for 5.x  :-)

- rhp
Comment 12 rhp (gone) 2000-02-10 00:49:45 PST
Ok, I've tried to improve this performance. Is this all going to work 100%, 
probably not. For plain text, I am getting the output from libmime and saving 
it in plaintext as the charset of the system.

For HTML, I am trying to output the message in the original charset without any 

momoi san,
Now, there are a million combinations that can happen here and scenarious that 
will probably break, but what I need you to do is help me with the major issues 
and I debug from there.

- rhp
Comment 13 Katsuhiko Momoi 2000-02-10 01:01:46 PST
Hi, Rich. In the case of HTML msg, are you going to strip out
some headers and MIME structural material? Under 4.x, we didn't
do any of that and saved the original msg. In case we run into
complicated problems, can we go back to that simple msg
saving method for HTML. 
Comment 14 rhp (gone) 2000-02-10 08:27:53 PST
Actually, the way this works is as follows:

 - Save As: .EML (or no extension) - Saves the raw RFC822 message to a file
 - Save As: .TXT - Does its best job at converting what you see in your message 
display into a .TXT file (in the native charset of the system)
 - Save As: .HTML - Again, it outputs HTML that will give you a display similar 
to what you see in your 3 pane message display

Actually, Communicator 4.x "Save As" HTML doesn't really work. It simply saves 
the raw RFC822 text into a file. So, in 5.x we will actually generate something 
that is an HTML document without any RFC822 messages.

- rhp
Comment 15 Katsuhiko Momoi 2000-02-10 09:53:57 PST
Great! Look forward to seeing the results intoday's build.
Comment 16 Katsuhiko Momoi 2000-02-10 11:25:42 PST
I'm not done looking at this yet, but so far I have found a couple of problems.

1. Even though, the save as dialog shows possible extension types, it does not supply one
   automatically unless the user writes it in. Otherwise it's all saved in ,eml format.
2. When saving into .txt format, the headers (To:, CC:, Date:, etc.) and their content are 
   separated into different lines, e.g.

   Jane Banning

Comment 17 rhp (gone) 2000-02-10 11:34:17 PST
The separate line problem is in the HTML - TEXT converter. Table conversion is 
pretty bad. Don't reopen this bug on that issue.

- rhp
Comment 18 Katsuhiko Momoi 2000-02-10 11:43:06 PST
Thanks. Browser save as has the same problems as you say for .txt format.
The charset converion to system charset in ,txt format seems to be working well.
Comment 19 Katsuhiko Momoi 2000-02-18 00:43:51 PST
Rich, sorry it's taking me longer to verify this fix.
I've found some definite problems:

1. We are saving the original MIME-encoded subject headers
   in UTF-8 rather than in the original encoding indicated by the 
   MIME charset in the header when using the HTML save option.

2. Earlier .txt (when supplied by the user) files were generally
   saved correctly -- that is what I reported. But now with this
   build, all I see is the same data as .eml file. What happened
   since then?

3. We don't automatically supply the extensions even though the
   file | save as file dialog window clearly shows the 
   possible extension types. Users are used to the application
   automatically supplying the extension in the dialog if none
   is entered.

Re-opening for the above reasons... there may have been regressions
in the last few days also for .txt type.
Comment 20 Katsuhiko Momoi 2000-02-18 00:47:16 PST
The build I saw the problems reported above was: Win32 2000021708.
Comment 21 rhp (gone) 2000-02-18 07:14:23 PST
Ok, well the REGRESSION that we are seeing (i.e. extensions are ignored) is a 
bug that I think I found in nsString(). 

Rick: On line 517 of xpcom/ds/nsStr.cpp, there is a line:

  const char* destLast  = root+((aDest.mLength-1)*aDelta); //pts to last char 
in aDest (likely null)

but if you really want to point to the NULL character of this string, the "-1" 
shouldn't be there. I changed the line to remove the -1 and it started working 
again. With the -1, you never get a hit.

The headers issue sounds like a bug I will tackle. Also, I will look at adding 
the extensions for these files.

Technically, I'm supposed to be on vacation today, but I'll see what I can do.

- rhp
Comment 22 rhp (gone) 2000-02-18 10:08:33 PST
Ok, I think I have a fix for the header conversion stuff, but I need Naoki's 
help now.

I will attach a patch that you can apply from the same level as you /mozilla 
directory. It fixes the headers being converted into UTF-8. Now, I have a 
problem with the ConvertFromUnicode() call. I pass in a bunch of unicode data 
(I think) and I want it converted, but for one of the Smoketest email messages, 
the ConvertFromUnicode() routine seems to truncate after the text "Subject: ". 
Can you help me with this one?

Other than that, I think this is close to being fixed. 

- rhp
Comment 23 rhp (gone) 2000-02-18 10:09:51 PST
Created attachment 5422 [details] [diff] [review]
Patch to help fix the header to UTF-8 conversion problem
Comment 24 rhp (gone) 2000-02-18 10:11:41 PST
Created attachment 5423 [details]
This is the message you should try Saving As TEXT
Comment 25 nhottanscp 2000-02-18 10:22:18 PST
I had just started to rebuild the tree. I'll try the patch when it's done.
I look at the patch. I have not followed this bug in detail, so the header is 
supposed to be converted to the charset specified in MIME header instead of 
UTF-8? I am not sure which part of the code fixed it.
Anyway, I'll test after the build is done.
Comment 26 rhp (gone) 2000-02-18 10:24:42 PST
Right, we should be saving the file as the original message charset. The thing 
that fixed the problem was taking out the "header=quoting" line. This change 
the behavior in libmime.

If you can try to save that message off, you'll see the conversion problem. 
Thanks for the help!

- rhp
Comment 27 nhottanscp 2000-02-18 12:35:55 PST
Okay, my build completed. And I applied the patch and did nmake at 
mailnews/base. But headers are still saved as UTF-8 (I saved as html).
I set a break point at the changed line as below but didn't hit.

--- 1417,1423 ----
        rv = ConvertFromUnicode(msgCompFileSystemCharset(), m_msgBuffer, 
Comment 28 rhp (gone) 2000-02-18 12:39:55 PST
You won't hit that line unless you save as plain text. Make sure you are saving 
the file as "test.TXT".

- rhp
Comment 29 nhottanscp 2000-02-18 12:48:24 PST
I saved as .TXT (selected it from the popup and add extension manualy) that 
works (saved correctly) but it didn't hit the break point.

Save as html, just above the break point I mentioned, it checks a value 
In my case that's 0 (false), do I need to change my environment?

BTW, I think the reason of the conversion failure you saw is probably your using 
US windows. That case, the file system charset is windows-1252 and the convert 
will fail to convert the Japanese text.
Comment 30 rickg 2000-02-19 01:17:10 PST
I've landed my change to nsStr which partially fixes this problem. RHP -- it's 
up to you now.
Comment 31 rhp (gone) 2000-02-19 08:20:10 PST
Ok, I have this boiled down to an issue of "Save As" into HTML still converting
the headers to UTF-8. I know what I need to do so this shouldn't be too hard to fix.

- rhp
Comment 32 rhp (gone) 2000-02-19 21:01:18 PST
Ok, I have this fixed now and will get it into the tree when I get approval.

- rhp
Comment 33 rhp (gone) 2000-02-21 21:59:11 PST
Ok, this should be much better than it was now. - rhp
Comment 34 rhp (gone) 2000-02-22 07:48:14 PST
Reopening this bug. I found a test case that broke it, but the good thing is 
that I have it fixed and just need to get permission to checkin.

- rhp
Comment 35 rhp (gone) 2000-02-22 07:48:51 PST
Just updating the whiteboard. I have a pretty simple fix in hand.

- rhp
Comment 36 rhp (gone) 2000-02-22 15:02:05 PST
Ok, I think my latest checkins fixed this problem now. - rhp
Comment 37 Katsuhiko Momoi 2000-02-24 15:02:48 PST
I still see 2 problems with ** 2/24/2000 Win32 build **

1. Extensions need to be supplied to get the right results.
   --> Maybe this should go to another bug.

2. When saving into .txt file format, it does not save into
   the system charset. So this means that JPN mail msg is saved
   in ISO-2022-JP rather than in Shift_JIS.
Comment 38 rhp (gone) 2000-02-24 15:17:55 PST
I have the first problem fixed in my tree...let's not file a new bug on this. 
On the second one, I save the plain text file in the following charset:


if this is wrong, I would have Naoki look at it. 

I really want to get this bug resolved...I've spent a lot of time on this one 

- rhp
Comment 39 rhp (gone) 2000-02-24 16:18:19 PST
I've tested this to the best of my abilities on  my machine. I am converting 
the Unicode text from the mail message into the charset I get from:


I can send you a recent diff to look at what is going into the tree eventually 
to fix the extension request.

- rhp
Comment 40 nhottanscp 2000-02-24 16:47:22 PST
Okay, I'll take a look.
Comment 41 nhottanscp 2000-02-24 17:15:30 PST
msgCompFileSystemCharset() returns Shift_JIS on my NT-J.
I debugged the conversion code it's getting a conversion charset as Shift_JIS. 
So it's supposed to convert from unicode to Shift_JIS.
But the data it's getting is ISO-2022-JP in UCS2 format. The Shift_JIS converter 
just removes the padded zeros from each character, the result we get is 
Comment 42 nhottanscp 2000-02-24 17:22:48 PST
On my machine it happends with nsMessenger.cpp 1.136 or later
but not happens with 1.135 (at least for body).
Comment 43 rhp (gone) 2000-02-24 17:38:18 PST
will investigate.
Comment 44 rhp (gone) 2000-02-24 19:34:20 PST
Ok, I think I have a fix for the Save As Text. To tell for sure, I'm going to
have to send 2 patches to Naoki for him to try.

Let me generate those and I will send it in email. This will also address the
extension issue.

- rhp
Comment 45 rhp (gone) 2000-02-25 15:10:57 PST
Ok, after working with naoki and momoi san, I have this fixed...really :-) I
will get this reviewed.

- rhp
Comment 46 rhp (gone) 2000-02-29 10:01:54 PST
Ok, this one should be fixed once and for all :-)

- rhp
Comment 47 Katsuhiko Momoi 2000-03-08 02:25:26 PST
** Checked with 3/7/2000 Win32 build **

Saving as into .eml, .html, and .txt is now generally working.
You can let the dialog supply an extension or supply one yourself.
Either way, this works. 

I have not seen any problem with saving into .eml and .html files.
There is a problem, however saving into .txt file, particularly
Japanese (ISO-2022-JP) messages. The data is cut off in the process
of saving. I'll append 2 images showing one such example.
I'll also upload a test msg file showing 4 JPN msgs + 1 UTF-8 (JPN)
test message. 
It's hard to predict where the cut off occurs saving into .txt
format. But in all 4 messages, it does. I've seen something similar
before in processing ISO-2022-JP and mistaking some bytes as
HTML tags or special characters. It also does not look like we 
handle ASCII well saving into .txt format. You get a "?" symbol
in one of the msgs. The cut off may be occurring for more than
one reasons.

I'm re-opening this...
Comment 48 Katsuhiko Momoi 2000-03-08 02:27:40 PST
Created attachment 6263 [details]
A message shown in its entirety. Take a look at the next msg to see how it is saved into .txt format.
Comment 49 Katsuhiko Momoi 2000-03-08 02:28:52 PST
Created attachment 6264 [details]
The msg is saved into .txt format -- now shows truncation.
Comment 50 Katsuhiko Momoi 2000-03-08 02:30:52 PST
The image above is a relatively benign case of a few characters at the
end being cut off. 5 test messages show much more servere
truncation of one kind or another.
The test file is attached below.
Comment 51 Katsuhiko Momoi 2000-03-08 02:32:50 PST
Created attachment 6265 [details]
5 test message in this file -- 4 JPN and 1 UTF-8 encoded JPN message.
Comment 52 Katsuhiko Momoi 2000-03-08 02:42:54 PST
Put the mailbox file into your Loacal folder and
save each one into .txt format under Japanese Windows.
Well, this could be a problem for rhp. 
Naoki, can you help with debugging this problem?
Comment 53 Katsuhiko Momoi 2000-03-08 02:56:47 PST
I'm going to re-do the quoted portion below because some crucial words
have been omitted due to my poor typing.

"It's hard to predict where the cut off occurs saving into .txt
format. But in all 4 messages, it does. I've seen something similar
before in processing ISO-2022-JP and mistaking some bytes as
HTML tags or special characters. It also does not look like we 
handle ASCII well saving into .txt format. You get a "?" symbol
in one of the msgs. The cut off may be occurring for more than
one reasons."

shoul read:

"It's hard to predict where the cut off occurs saving into .txt
format. But in all *5* messages, it does. I've seen something similar
before in processing ISO-2022-JP and mistaking some bytes as
HTML tags or special characters. It also does not look like we 
handle ASCII *space* well saving into .txt format. You get a "?" symbol
in one of the msgs. The cut off may be occurring for more than
one reasons."

I've indicated addition of 2 words with * *.
Comment 54 rhp (gone) 2000-03-08 07:54:47 PST
I'll investigate, but I've put so much time into this already, I need to focus 
on a bunch of other issues. Sorry, this is good enough for beta. I will change 
the summary, clear the whiteboard, etc.. and work on it later.

- rhp
Comment 55 rhp (gone) 2000-05-17 12:42:31 PDT
*** Bug 39357 has been marked as a duplicate of this bug. ***
Comment 56 rhp (gone) 2000-07-22 11:44:52 PDT
I don't seem to be getting cutoff files when saved. Can you retest this?

- rhp
Comment 57 Katsuhiko Momoi 2000-09-09 03:18:44 PDT
Sorry. Somehow this escaped my attention for a while.
Ilooked at this problem with 9/8/2000 Win32 build.
The same problem still exists. 

The body text: XXXPlain text 
is saved as: XXXPlain te

thus missing the last 2 letters.

I tried a number of Japanese messages both HTML and plain text
type, they all suffer from this problem. 

I'm re-opening this bug. I think you need to have a Japanese
Windows system to see this problem clearly because it involves 
saving into Shift_JIS. I don't see this problem when saving
ASCII mail.

Naoki, please take a look at this. There seems to be a problem
in converting to Shift_JIS.
Comment 58 rhp (gone) 2000-09-09 06:21:51 PDT
If you can see if this is a conversion error, that would be great.

- rhp
Comment 59 nhottanscp 2000-09-11 11:59:59 PDT
I can reproduce when saving as a text file but not html.
It always cut at the bottom of the file so I assume there something wrong in 
length calculation.
Comment 60 Katsuhiko Momoi 2000-09-11 22:13:04 PDT
Saving into a plain text file is problematical when the message is
question is HTML or plain text type. 
Of all the saving options, this one is probably used the most.
Nominating for nsbeta 3.
Comment 61 Frank Tang 2000-09-12 14:49:50 PDT
nsbeta3+ P2
Comment 62 nhottanscp 2000-09-12 15:43:57 PDT
I have a patch for this.
Comment 63 nhottanscp 2000-09-12 15:44:47 PDT
Created attachment 14532 [details] [diff] [review]
a patch to use a correct string length
Comment 64 nhottanscp 2000-09-14 19:58:24 PDT
Fix checked in.
Comment 65 ji 2000-09-19 12:03:28 PDT
Verified with win32 2000091909 and linux 2000091906 build. It's fixed.

Note You need to log in before you can comment on or make changes to this bug.