Closed Bug 23418 Opened 25 years ago Closed 24 years ago

"File | save as file" has problems with Japanese messages saved as plain text

Categories

(MailNews Core :: Internationalization, defect, P2)

x86
Windows NT

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: momoi, Assigned: nhottanscp)

References

Details

(Whiteboard: nsbeta3+, patch in hand, reviewed)

Attachments

(7 files)

** Observed with 1/6/99 Win32 build **

When non-ASCII msg is saved as a file using File | Save as menu,
the file is current saved without any extension and "as is" whether
or not the option selected by the user is HTML or plain text.

Some user may be attempted to supply the .html extension him/herself.
If so, the result will be an Unicoded encoded somewhat deficient HTML
file. This needs to be corrected.

I guess HTML format should save as is -- JPN msg should be saved in
HTML format and in JIS.

WE should also try supplying .txt extension and see what happens.

Also, if there isn't one, we should file a separate bug for
the format option malfunction at the save as dialog window.
Assignee: nhotta → jefft
Reassign to jefft.
Save as text/html -> Currently saved as UTF-8 (wrong) - we need to convert UTF-8
to mail charset (e.g. ISO-2022-JP).
Save as text/plain -> IQA need to test the current behavior. The spec it to
convert the data to platform file charset. Below is the code to get the platform
file charset.

  #define NS_IMPL_IDS
  #include "nsIPlatformCharset.h"
  #undef NS_IMPL_IDS

  nsCOMPtr <nsIPlatformCharset> platformCharset;
  nsAutoString aPlatformCharset;
  rv = nsComponentManager::CreateInstance(NS_PLATFORMCHARSET_PROGID, nsnull,
                                          NS_GET_IID(nsIPlatformCharset),
getter_AddRefs(platformCharset));
  if (NS_SUCCEEDED(rv))
  {
  rv = platformCharset->GetCharset(kPlatformCharsetSel_FileName,
aPlatformCharset);
  }
Status: NEW → ASSIGNED
Target Milestone: M14
Currently when we save as into txt format, this is what happens:

1. Save as into text without explicitly supplying the .txt extension
   by the user. --> saves the whole msg including vCard, etc. into
   ISO-2022-JP. This looks like source data themselves.

2. Save as into text by supplying .txt extension yourself.
   Saves without VCard and other extra parts but into iso-2022-jp
   rather than expected Shift_JIS for JPN Windows.
Both case 1 and case 2 should save without extra parts and into Shift_JIS
rather than in ISO-2022-JP.
I have just added a new function in nsMsgI18N.h named msgCompFileSystemCharset
(oops, not enough generic name, maybe we should rename it) that give you back
the file system character set.
I would like to designate this as beta1 since "file | save as" is something
users do use and its UI should be non-confusing and
its resulting effects should be consistent and correct.
Keywords: beta1
Putting on PDT+ radar for beta1.
Whiteboard: [PDT+]
Load-balance to rhp
Assignee: jefft → rhp
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
Whiteboard: [PDT+] → [PDT+] Land by: 2/11/00
Depends on: 1775
No longer depends on: 1775
Ok, I have most of this done for plain text, but I have a question on saving 
messages as HTML. Why can't I emit a META tag header with UTF-8 as the output 
for the saved HTML web page. With the current architecture, you suggestion of 
going back to the original charset of the message if full of problems and I 
don't think will give us a very good result.

Comments?

- rhp 
I think users will want to see the original data encoding.
On most platforms, you can then open such a file with 
a text editor and see & edit the content without a problem.
That is not going to be the case if you save into UTF-8.

That and in the past we have been saving the original data
as is under HTML extension and I don't think we should break the
familiar behavior. 
I understand your point and I think I have an idea how to fix this...but just 
to clarify, I don't think this feature really works on 4.x. When you do a Save 
As for a mail message in 4.x (HTML format), you basically get the raw contents 
of the message dumped to the file with RFC822 headers, etc. 

We'll do better than this for 5.x  :-)

- rhp
Ok, I've tried to improve this performance. Is this all going to work 100%, 
probably not. For plain text, I am getting the output from libmime and saving 
it in plaintext as the charset of the system.

For HTML, I am trying to output the message in the original charset without any 
converstion.

momoi san,
Now, there are a million combinations that can happen here and scenarious that 
will probably break, but what I need you to do is help me with the major issues 
and I debug from there.

- rhp
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Hi, Rich. In the case of HTML msg, are you going to strip out
some headers and MIME structural material? Under 4.x, we didn't
do any of that and saved the original msg. In case we run into
complicated problems, can we go back to that simple msg
saving method for HTML. 
Actually, the way this works is as follows:

 - Save As: .EML (or no extension) - Saves the raw RFC822 message to a file
 - Save As: .TXT - Does its best job at converting what you see in your message 
display into a .TXT file (in the native charset of the system)
 - Save As: .HTML - Again, it outputs HTML that will give you a display similar 
to what you see in your 3 pane message display

Actually, Communicator 4.x "Save As" HTML doesn't really work. It simply saves 
the raw RFC822 text into a file. So, in 5.x we will actually generate something 
that is an HTML document without any RFC822 messages.

- rhp
Great! Look forward to seeing the results intoday's build.
I'm not done looking at this yet, but so far I have found a couple of problems.

1. Even though, the save as dialog shows possible extension types, it does not supply one
   automatically unless the user writes it in. Otherwise it's all saved in ,eml format.
2. When saving into .txt format, the headers (To:, CC:, Date:, etc.) and their content are 
   separated into different lines, e.g.

   From: 
   Jane Banning


Momi,
The separate line problem is in the HTML - TEXT converter. Table conversion is 
pretty bad. Don't reopen this bug on that issue.

- rhp
Thanks. Browser save as has the same problems as you say for .txt format.
The charset converion to system charset in ,txt format seems to be working well.
Rich, sorry it's taking me longer to verify this fix.
I've found some definite problems:

1. We are saving the original MIME-encoded subject headers
   in UTF-8 rather than in the original encoding indicated by the 
   MIME charset in the header when using the HTML save option.

2. Earlier .txt (when supplied by the user) files were generally
   saved correctly -- that is what I reported. But now with this
   build, all I see is the same data as .eml file. What happened
   since then?

3. We don't automatically supply the extensions even though the
   file | save as file dialog window clearly shows the 
   possible extension types. Users are used to the application
   automatically supplying the extension in the dialog if none
   is entered.

Re-opening for the above reasons... there may have been regressions
in the last few days also for .txt type.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The build I saw the problems reported above was: Win32 2000021708.
Ok, well the REGRESSION that we are seeing (i.e. extensions are ignored) is a 
bug that I think I found in nsString(). 

Rick: On line 517 of xpcom/ds/nsStr.cpp, there is a line:

  const char* destLast  = root+((aDest.mLength-1)*aDelta); //pts to last char 
in aDest (likely null)

but if you really want to point to the NULL character of this string, the "-1" 
shouldn't be there. I changed the line to remove the -1 and it started working 
again. With the -1, you never get a hit.

The headers issue sounds like a bug I will tackle. Also, I will look at adding 
the extensions for these files.

Technically, I'm supposed to be on vacation today, but I'll see what I can do.

- rhp
Status: REOPENED → ASSIGNED
Ok, I think I have a fix for the header conversion stuff, but I need Naoki's 
help now.

I will attach a patch that you can apply from the same level as you /mozilla 
directory. It fixes the headers being converted into UTF-8. Now, I have a 
problem with the ConvertFromUnicode() call. I pass in a bunch of unicode data 
(I think) and I want it converted, but for one of the Smoketest email messages, 
the ConvertFromUnicode() routine seems to truncate after the text "Subject: ". 
Can you help me with this one?

Other than that, I think this is close to being fixed. 

- rhp
Whiteboard: [PDT+] Land by: 2/11/00 → [PDT+]
I had just started to rebuild the tree. I'll try the patch when it's done.
I look at the patch. I have not followed this bug in detail, so the header is 
supposed to be converted to the charset specified in MIME header instead of 
UTF-8? I am not sure which part of the code fixed it.
Anyway, I'll test after the build is done.
Right, we should be saving the file as the original message charset. The thing 
that fixed the problem was taking out the "header=quoting" line. This change 
the behavior in libmime.

If you can try to save that message off, you'll see the conversion problem. 
Thanks for the help!

- rhp
Okay, my build completed. And I applied the patch and did nmake at 
mailnews/base. But headers are still saved as UTF-8 (I saved as html).
I set a break point at the changed line as below but didn't hit.

--- 1417,1423 ----
        ConvertBufToPlainText(m_msgBuffer);
        rv = ConvertFromUnicode(msgCompFileSystemCharset(), m_msgBuffer, 
&conBuf);
You won't hit that line unless you save as plain text. Make sure you are saving 
the file as "test.TXT".

- rhp
I saved as .TXT (selected it from the popup and add extension manualy) that 
works (saved correctly) but it didn't hit the break point.

Save as html, just above the break point I mentioned, it checks a value 
m_doCharsetConversion.
In my case that's 0 (false), do I need to change my environment?

BTW, I think the reason of the conversion failure you saw is probably your using 
US windows. That case, the file system charset is windows-1252 and the convert 
will fail to convert the Japanese text.
Whiteboard: [PDT+] → [PDT+] [2/19/00]
I've landed my change to nsStr which partially fixes this problem. RHP -- it's 
up to you now.
Ok, I have this boiled down to an issue of "Save As" into HTML still converting
the headers to UTF-8. I know what I need to do so this shouldn't be too hard to fix.

- rhp
Ok, I have this fixed now and will get it into the tree when I get approval.

- rhp
Ok, this should be much better than it was now. - rhp
Status: ASSIGNED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → FIXED
Reopening this bug. I found a test case that broke it, but the good thing is 
that I have it fixed and just need to get permission to checkin.

- rhp
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Just updating the whiteboard. I have a pretty simple fix in hand.

- rhp
Whiteboard: [PDT+] [2/19/00] → [PDT+] [2/22/00]
Status: REOPENED → ASSIGNED
Ok, I think my latest checkins fixed this problem now. - rhp
Status: ASSIGNED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → FIXED
I still see 2 problems with ** 2/24/2000 Win32 build **

1. Extensions need to be supplied to get the right results.
   --> Maybe this should go to another bug.

2. When saving into .txt file format, it does not save into
   the system charset. So this means that JPN mail msg is saved
   in ISO-2022-JP rather than in Shift_JIS.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I have the first problem fixed in my tree...let's not file a new bug on this. 
On the second one, I save the plain text file in the following charset:

                nsAutoCString(msgCompFileSystemCharset()

if this is wrong, I would have Naoki look at it. 

I really want to get this bug resolved...I've spent a lot of time on this one 
issue.

- rhp
I've tested this to the best of my abilities on  my machine. I am converting 
the Unicode text from the mail message into the charset I get from:

                    msgCompFileSystemCharset()

I can send you a recent diff to look at what is going into the tree eventually 
to fix the extension request.

- rhp
Assignee: rhp → nhotta
Status: REOPENED → NEW
Okay, I'll take a look.
Status: NEW → ASSIGNED
msgCompFileSystemCharset() returns Shift_JIS on my NT-J.
I debugged the conversion code it's getting a conversion charset as Shift_JIS. 
So it's supposed to convert from unicode to Shift_JIS.
But the data it's getting is ISO-2022-JP in UCS2 format. The Shift_JIS converter 
just removes the padded zeros from each character, the result we get is 
ISO-2022-JP.
On my machine it happends with nsMessenger.cpp 1.136 or later
but not happens with 1.135 (at least for body).
Assignee: nhotta → rhp
Status: ASSIGNED → NEW
will investigate.
Status: NEW → ASSIGNED
Whiteboard: [PDT+] [2/22/00] → [PDT+]
Ok, I think I have a fix for the Save As Text. To tell for sure, I'm going to
have to send 2 patches to Naoki for him to try.

Let me generate those and I will send it in email. This will also address the
extension issue.

- rhp
Ok, after working with naoki and momoi san, I have this fixed...really :-) I
will get this reviewed.

- rhp
Whiteboard: [PDT+] → [PDT+] CAN CHECKIN FIX ANY TIME
Summary: "File | save as file" saves in Unicode when user supplies html extension → [FIXED] "File | save as file" saves in Unicode when user supplies html extension
Ok, this one should be fixed once and for all :-)

- rhp
Status: ASSIGNED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → FIXED
Summary: [FIXED] "File | save as file" saves in Unicode when user supplies html extension → "File | save as file" saves in Unicode when user supplies html extension
** Checked with 3/7/2000 Win32 build **

Saving as into .eml, .html, and .txt is now generally working.
You can let the dialog supply an extension or supply one yourself.
Either way, this works. 

I have not seen any problem with saving into .eml and .html files.
There is a problem, however saving into .txt file, particularly
Japanese (ISO-2022-JP) messages. The data is cut off in the process
of saving. I'll append 2 images showing one such example.
I'll also upload a test msg file showing 4 JPN msgs + 1 UTF-8 (JPN)
test message. 
It's hard to predict where the cut off occurs saving into .txt
format. But in all 4 messages, it does. I've seen something similar
before in processing ISO-2022-JP and mistaking some bytes as
HTML tags or special characters. It also does not look like we 
handle ASCII well saving into .txt format. You get a "?" symbol
in one of the msgs. The cut off may be occurring for more than
one reasons.

I'm re-opening this...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The image above is a relatively benign case of a few characters at the
end being cut off. 5 test messages show much more servere
truncation of one kind or another.
The test file is attached below.
Put the mailbox file into your Loacal folder and
save each one into .txt format under Japanese Windows.
Well, this could be a problem for rhp. 
Naoki, can you help with debugging this problem?
I'm going to re-do the quoted portion below because some crucial words
have been omitted due to my poor typing.

"It's hard to predict where the cut off occurs saving into .txt
format. But in all 4 messages, it does. I've seen something similar
before in processing ISO-2022-JP and mistaking some bytes as
HTML tags or special characters. It also does not look like we 
handle ASCII well saving into .txt format. You get a "?" symbol
in one of the msgs. The cut off may be occurring for more than
one reasons."

shoul read:

"It's hard to predict where the cut off occurs saving into .txt
format. But in all *5* messages, it does. I've seen something similar
before in processing ISO-2022-JP and mistaking some bytes as
HTML tags or special characters. It also does not look like we 
handle ASCII *space* well saving into .txt format. You get a "?" symbol
in one of the msgs. The cut off may be occurring for more than
one reasons."

I've indicated addition of 2 words with * *.
I'll investigate, but I've put so much time into this already, I need to focus 
on a bunch of other issues. Sorry, this is good enough for beta. I will change 
the summary, clear the whiteboard, etc.. and work on it later.

- rhp
Status: REOPENED → ASSIGNED
Keywords: beta1
Summary: "File | save as file" saves in Unicode when user supplies html extension → "File | save as file" has problems with Japanese messages saved as plain text
Whiteboard: [PDT+] CAN CHECKIN FIX ANY TIME
Target Milestone: M14 → M17
*** Bug 39357 has been marked as a duplicate of this bug. ***
Target Milestone: M17 → M18
I don't seem to be getting cutoff files when saved. Can you retest this?

- rhp
Status: ASSIGNED → RESOLVED
Closed: 25 years ago24 years ago
Resolution: --- → WORKSFORME
Sorry. Somehow this escaped my attention for a while.
Ilooked at this problem with 9/8/2000 Win32 build.
The same problem still exists. 

The body text: XXXPlain text 
is saved as: XXXPlain te

thus missing the last 2 letters.

I tried a number of Japanese messages both HTML and plain text
type, they all suffer from this problem. 

I'm re-opening this bug. I think you need to have a Japanese
Windows system to see this problem clearly because it involves 
saving into Shift_JIS. I don't see this problem when saving
ASCII mail.

Naoki, please take a look at this. There seems to be a problem
in converting to Shift_JIS.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Naoki,
If you can see if this is a conversion error, that would be great.

- rhp
Assignee: rhp → nhotta
Status: REOPENED → NEW
I can reproduce when saving as a text file but not html.
It always cut at the bottom of the file so I assume there something wrong in 
length calculation.
Saving into a plain text file is problematical when the message is
question is HTML or plain text type. 
Of all the saving options, this one is probably used the most.
Nominating for nsbeta 3.
Keywords: nsbeta3
nsbeta3+ P2
Priority: P3 → P2
Whiteboard: nsbeta3+
I have a patch for this.
Status: NEW → ASSIGNED
Whiteboard: nsbeta3+ → nsbeta3+, patch in hand need review
Whiteboard: nsbeta3+, patch in hand need review → nsbeta3+, patch in hand, reviewed
Fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
Verified with win32 2000091909 and linux 2000091906 build. It's fixed.
Status: RESOLVED → VERIFIED
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: