"File | save as file" has problems with Japanese messages saved as plain text

VERIFIED FIXED in M18

Status

MailNews Core
Internationalization
P2
major
VERIFIED FIXED
18 years ago
9 years ago

People

(Reporter: Katsuhiko Momoi, Assigned: nhottanscp)

Tracking

Trunk
x86
Windows NT

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: nsbeta3+, patch in hand, reviewed)

Attachments

(7 attachments)

(Reporter)

Description

18 years ago
** Observed with 1/6/99 Win32 build **

When non-ASCII msg is saved as a file using File | Save as menu,
the file is current saved without any extension and "as is" whether
or not the option selected by the user is HTML or plain text.

Some user may be attempted to supply the .html extension him/herself.
If so, the result will be an Unicoded encoded somewhat deficient HTML
file. This needs to be corrected.

I guess HTML format should save as is -- JPN msg should be saved in
HTML format and in JIS.

WE should also try supplying .txt extension and see what happens.

Also, if there isn't one, we should file a separate bug for
the format option malfunction at the save as dialog window.
(Reporter)

Comment 1

18 years ago
Created attachment 4067 [details]
a JPN msg saved with html option -- user manually writing in the extension.
(Assignee)

Updated

18 years ago
Assignee: nhotta → jefft
(Assignee)

Comment 2

18 years ago
Reassign to jefft.
Save as text/html -> Currently saved as UTF-8 (wrong) - we need to convert UTF-8
to mail charset (e.g. ISO-2022-JP).
Save as text/plain -> IQA need to test the current behavior. The spec it to
convert the data to platform file charset. Below is the code to get the platform
file charset.

  #define NS_IMPL_IDS
  #include "nsIPlatformCharset.h"
  #undef NS_IMPL_IDS

  nsCOMPtr <nsIPlatformCharset> platformCharset;
  nsAutoString aPlatformCharset;
  rv = nsComponentManager::CreateInstance(NS_PLATFORMCHARSET_PROGID, nsnull,
                                          NS_GET_IID(nsIPlatformCharset),
getter_AddRefs(platformCharset));
  if (NS_SUCCEEDED(rv))
  {
  rv = platformCharset->GetCharset(kPlatformCharsetSel_FileName,
aPlatformCharset);
  }

Updated

18 years ago
Status: NEW → ASSIGNED
Target Milestone: M14
(Reporter)

Comment 3

18 years ago
Currently when we save as into txt format, this is what happens:

1. Save as into text without explicitly supplying the .txt extension
   by the user. --> saves the whole msg including vCard, etc. into
   ISO-2022-JP. This looks like source data themselves.

2. Save as into text by supplying .txt extension yourself.
   Saves without VCard and other extra parts but into iso-2022-jp
   rather than expected Shift_JIS for JPN Windows.
(Reporter)

Comment 4

18 years ago
Both case 1 and case 2 should save without extra parts and into Shift_JIS
rather than in ISO-2022-JP.
I have just added a new function in nsMsgI18N.h named msgCompFileSystemCharset
(oops, not enough generic name, maybe we should rename it) that give you back
the file system character set.
(Reporter)

Comment 6

18 years ago
I would like to designate this as beta1 since "file | save as" is something
users do use and its UI should be non-confusing and
its resulting effects should be consistent and correct.
Keywords: beta1

Comment 7

18 years ago
Putting on PDT+ radar for beta1.
Whiteboard: [PDT+]

Comment 8

18 years ago
Load-balance to rhp
Assignee: jefft → rhp
Status: ASSIGNED → NEW

Updated

18 years ago
Status: NEW → ASSIGNED

Updated

18 years ago
Whiteboard: [PDT+] → [PDT+] Land by: 2/11/00

Updated

18 years ago
Depends on: 1775

Updated

18 years ago
No longer depends on: 1775

Comment 9

18 years ago
Ok, I have most of this done for plain text, but I have a question on saving 
messages as HTML. Why can't I emit a META tag header with UTF-8 as the output 
for the saved HTML web page. With the current architecture, you suggestion of 
going back to the original charset of the message if full of problems and I 
don't think will give us a very good result.

Comments?

- rhp 
(Reporter)

Comment 10

18 years ago
I think users will want to see the original data encoding.
On most platforms, you can then open such a file with 
a text editor and see & edit the content without a problem.
That is not going to be the case if you save into UTF-8.

That and in the past we have been saving the original data
as is under HTML extension and I don't think we should break the
familiar behavior. 

Comment 11

18 years ago
I understand your point and I think I have an idea how to fix this...but just 
to clarify, I don't think this feature really works on 4.x. When you do a Save 
As for a mail message in 4.x (HTML format), you basically get the raw contents 
of the message dumped to the file with RFC822 headers, etc. 

We'll do better than this for 5.x  :-)

- rhp

Comment 12

18 years ago
Ok, I've tried to improve this performance. Is this all going to work 100%, 
probably not. For plain text, I am getting the output from libmime and saving 
it in plaintext as the charset of the system.

For HTML, I am trying to output the message in the original charset without any 
converstion.

momoi san,
Now, there are a million combinations that can happen here and scenarious that 
will probably break, but what I need you to do is help me with the major issues 
and I debug from there.

- rhp
Status: ASSIGNED → RESOLVED
Last Resolved: 18 years ago
Resolution: --- → FIXED
(Reporter)

Comment 13

18 years ago
Hi, Rich. In the case of HTML msg, are you going to strip out
some headers and MIME structural material? Under 4.x, we didn't
do any of that and saved the original msg. In case we run into
complicated problems, can we go back to that simple msg
saving method for HTML. 

Comment 14

17 years ago
Actually, the way this works is as follows:

 - Save As: .EML (or no extension) - Saves the raw RFC822 message to a file
 - Save As: .TXT - Does its best job at converting what you see in your message 
display into a .TXT file (in the native charset of the system)
 - Save As: .HTML - Again, it outputs HTML that will give you a display similar 
to what you see in your 3 pane message display

Actually, Communicator 4.x "Save As" HTML doesn't really work. It simply saves 
the raw RFC822 text into a file. So, in 5.x we will actually generate something 
that is an HTML document without any RFC822 messages.

- rhp
(Reporter)

Comment 15

17 years ago
Great! Look forward to seeing the results intoday's build.
(Reporter)

Comment 16

17 years ago
I'm not done looking at this yet, but so far I have found a couple of problems.

1. Even though, the save as dialog shows possible extension types, it does not supply one
   automatically unless the user writes it in. Otherwise it's all saved in ,eml format.
2. When saving into .txt format, the headers (To:, CC:, Date:, etc.) and their content are 
   separated into different lines, e.g.

   From: 
   Jane Banning


Comment 17

17 years ago
Momi,
The separate line problem is in the HTML - TEXT converter. Table conversion is 
pretty bad. Don't reopen this bug on that issue.

- rhp
(Reporter)

Comment 18

17 years ago
Thanks. Browser save as has the same problems as you say for .txt format.
The charset converion to system charset in ,txt format seems to be working well.
(Reporter)

Comment 19

17 years ago
Rich, sorry it's taking me longer to verify this fix.
I've found some definite problems:

1. We are saving the original MIME-encoded subject headers
   in UTF-8 rather than in the original encoding indicated by the 
   MIME charset in the header when using the HTML save option.

2. Earlier .txt (when supplied by the user) files were generally
   saved correctly -- that is what I reported. But now with this
   build, all I see is the same data as .eml file. What happened
   since then?

3. We don't automatically supply the extensions even though the
   file | save as file dialog window clearly shows the 
   possible extension types. Users are used to the application
   automatically supplying the extension in the dialog if none
   is entered.

Re-opening for the above reasons... there may have been regressions
in the last few days also for .txt type.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Comment 20

17 years ago
The build I saw the problems reported above was: Win32 2000021708.

Comment 21

17 years ago
Ok, well the REGRESSION that we are seeing (i.e. extensions are ignored) is a 
bug that I think I found in nsString(). 

Rick: On line 517 of xpcom/ds/nsStr.cpp, there is a line:

  const char* destLast  = root+((aDest.mLength-1)*aDelta); //pts to last char 
in aDest (likely null)

but if you really want to point to the NULL character of this string, the "-1" 
shouldn't be there. I changed the line to remove the -1 and it started working 
again. With the -1, you never get a hit.

The headers issue sounds like a bug I will tackle. Also, I will look at adding 
the extensions for these files.

Technically, I'm supposed to be on vacation today, but I'll see what I can do.

- rhp
Status: REOPENED → ASSIGNED

Comment 22

17 years ago
Ok, I think I have a fix for the header conversion stuff, but I need Naoki's 
help now.

I will attach a patch that you can apply from the same level as you /mozilla 
directory. It fixes the headers being converted into UTF-8. Now, I have a 
problem with the ConvertFromUnicode() call. I pass in a bunch of unicode data 
(I think) and I want it converted, but for one of the Smoketest email messages, 
the ConvertFromUnicode() routine seems to truncate after the text "Subject: ". 
Can you help me with this one?

Other than that, I think this is close to being fixed. 

- rhp
Whiteboard: [PDT+] Land by: 2/11/00 → [PDT+]

Comment 23

17 years ago
Created attachment 5422 [details] [diff] [review]
Patch to help fix the header to UTF-8 conversion problem

Comment 24

17 years ago
Created attachment 5423 [details]
This is the message you should try Saving As TEXT
(Assignee)

Comment 25

17 years ago
I had just started to rebuild the tree. I'll try the patch when it's done.
I look at the patch. I have not followed this bug in detail, so the header is 
supposed to be converted to the charset specified in MIME header instead of 
UTF-8? I am not sure which part of the code fixed it.
Anyway, I'll test after the build is done.

Comment 26

17 years ago
Right, we should be saving the file as the original message charset. The thing 
that fixed the problem was taking out the "header=quoting" line. This change 
the behavior in libmime.

If you can try to save that message off, you'll see the conversion problem. 
Thanks for the help!

- rhp
(Assignee)

Comment 27

17 years ago
Okay, my build completed. And I applied the patch and did nmake at 
mailnews/base. But headers are still saved as UTF-8 (I saved as html).
I set a break point at the changed line as below but didn't hit.

--- 1417,1423 ----
        ConvertBufToPlainText(m_msgBuffer);
        rv = ConvertFromUnicode(msgCompFileSystemCharset(), m_msgBuffer, 
&conBuf);

Comment 28

17 years ago
You won't hit that line unless you save as plain text. Make sure you are saving 
the file as "test.TXT".

- rhp
(Assignee)

Comment 29

17 years ago
I saved as .TXT (selected it from the popup and add extension manualy) that 
works (saved correctly) but it didn't hit the break point.

Save as html, just above the break point I mentioned, it checks a value 
m_doCharsetConversion.
In my case that's 0 (false), do I need to change my environment?

BTW, I think the reason of the conversion failure you saw is probably your using 
US windows. That case, the file system charset is windows-1252 and the convert 
will fail to convert the Japanese text.

Updated

17 years ago
Whiteboard: [PDT+] → [PDT+] [2/19/00]

Comment 30

17 years ago
I've landed my change to nsStr which partially fixes this problem. RHP -- it's 
up to you now.

Comment 31

17 years ago
Ok, I have this boiled down to an issue of "Save As" into HTML still converting
the headers to UTF-8. I know what I need to do so this shouldn't be too hard to fix.

- rhp

Comment 32

17 years ago
Ok, I have this fixed now and will get it into the tree when I get approval.

- rhp

Comment 33

17 years ago
Ok, this should be much better than it was now. - rhp
Status: ASSIGNED → RESOLVED
Last Resolved: 18 years ago17 years ago
Resolution: --- → FIXED

Comment 34

17 years ago
Reopening this bug. I found a test case that broke it, but the good thing is 
that I have it fixed and just need to get permission to checkin.

- rhp
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 35

17 years ago
Just updating the whiteboard. I have a pretty simple fix in hand.

- rhp
Whiteboard: [PDT+] [2/19/00] → [PDT+] [2/22/00]

Updated

17 years ago
Status: REOPENED → ASSIGNED

Comment 36

17 years ago
Ok, I think my latest checkins fixed this problem now. - rhp
Status: ASSIGNED → RESOLVED
Last Resolved: 17 years ago17 years ago
Resolution: --- → FIXED
(Reporter)

Comment 37

17 years ago
I still see 2 problems with ** 2/24/2000 Win32 build **

1. Extensions need to be supplied to get the right results.
   --> Maybe this should go to another bug.

2. When saving into .txt file format, it does not save into
   the system charset. So this means that JPN mail msg is saved
   in ISO-2022-JP rather than in Shift_JIS.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---

Comment 38

17 years ago
I have the first problem fixed in my tree...let's not file a new bug on this. 
On the second one, I save the plain text file in the following charset:

                nsAutoCString(msgCompFileSystemCharset()

if this is wrong, I would have Naoki look at it. 

I really want to get this bug resolved...I've spent a lot of time on this one 
issue.

- rhp

Comment 39

17 years ago
I've tested this to the best of my abilities on  my machine. I am converting 
the Unicode text from the mail message into the charset I get from:

                    msgCompFileSystemCharset()

I can send you a recent diff to look at what is going into the tree eventually 
to fix the extension request.

- rhp
Assignee: rhp → nhotta
Status: REOPENED → NEW
(Assignee)

Comment 40

17 years ago
Okay, I'll take a look.
Status: NEW → ASSIGNED
(Assignee)

Comment 41

17 years ago
msgCompFileSystemCharset() returns Shift_JIS on my NT-J.
I debugged the conversion code it's getting a conversion charset as Shift_JIS. 
So it's supposed to convert from unicode to Shift_JIS.
But the data it's getting is ISO-2022-JP in UCS2 format. The Shift_JIS converter 
just removes the padded zeros from each character, the result we get is 
ISO-2022-JP.
(Assignee)

Comment 42

17 years ago
On my machine it happends with nsMessenger.cpp 1.136 or later
but not happens with 1.135 (at least for body).
Assignee: nhotta → rhp
Status: ASSIGNED → NEW

Comment 43

17 years ago
will investigate.
Status: NEW → ASSIGNED
Whiteboard: [PDT+] [2/22/00] → [PDT+]

Comment 44

17 years ago
Ok, I think I have a fix for the Save As Text. To tell for sure, I'm going to
have to send 2 patches to Naoki for him to try.

Let me generate those and I will send it in email. This will also address the
extension issue.

- rhp

Comment 45

17 years ago
Ok, after working with naoki and momoi san, I have this fixed...really :-) I
will get this reviewed.

- rhp
Whiteboard: [PDT+] → [PDT+] CAN CHECKIN FIX ANY TIME

Updated

17 years ago
Summary: "File | save as file" saves in Unicode when user supplies html extension → [FIXED] "File | save as file" saves in Unicode when user supplies html extension

Comment 46

17 years ago
Ok, this one should be fixed once and for all :-)

- rhp
Status: ASSIGNED → RESOLVED
Last Resolved: 17 years ago17 years ago
Resolution: --- → FIXED
Summary: [FIXED] "File | save as file" saves in Unicode when user supplies html extension → "File | save as file" saves in Unicode when user supplies html extension
(Reporter)

Comment 47

17 years ago
** Checked with 3/7/2000 Win32 build **

Saving as into .eml, .html, and .txt is now generally working.
You can let the dialog supply an extension or supply one yourself.
Either way, this works. 

I have not seen any problem with saving into .eml and .html files.
There is a problem, however saving into .txt file, particularly
Japanese (ISO-2022-JP) messages. The data is cut off in the process
of saving. I'll append 2 images showing one such example.
I'll also upload a test msg file showing 4 JPN msgs + 1 UTF-8 (JPN)
test message. 
It's hard to predict where the cut off occurs saving into .txt
format. But in all 4 messages, it does. I've seen something similar
before in processing ISO-2022-JP and mistaking some bytes as
HTML tags or special characters. It also does not look like we 
handle ASCII well saving into .txt format. You get a "?" symbol
in one of the msgs. The cut off may be occurring for more than
one reasons.

I'm re-opening this...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
(Reporter)

Comment 48

17 years ago
Created attachment 6263 [details]
A message shown in its entirety. Take a look at the next msg to see how it is saved into .txt format.
(Reporter)

Comment 49

17 years ago
Created attachment 6264 [details]
The msg is saved into .txt format -- now shows truncation.
(Reporter)

Comment 50

17 years ago
The image above is a relatively benign case of a few characters at the
end being cut off. 5 test messages show much more servere
truncation of one kind or another.
The test file is attached below.
(Reporter)

Comment 51

17 years ago
Created attachment 6265 [details]
5 test message in this file -- 4 JPN and 1 UTF-8 encoded JPN message.
(Reporter)

Comment 52

17 years ago
Put the mailbox file into your Loacal folder and
save each one into .txt format under Japanese Windows.
Well, this could be a problem for rhp. 
Naoki, can you help with debugging this problem?
(Reporter)

Comment 53

17 years ago
I'm going to re-do the quoted portion below because some crucial words
have been omitted due to my poor typing.

"It's hard to predict where the cut off occurs saving into .txt
format. But in all 4 messages, it does. I've seen something similar
before in processing ISO-2022-JP and mistaking some bytes as
HTML tags or special characters. It also does not look like we 
handle ASCII well saving into .txt format. You get a "?" symbol
in one of the msgs. The cut off may be occurring for more than
one reasons."

shoul read:

"It's hard to predict where the cut off occurs saving into .txt
format. But in all *5* messages, it does. I've seen something similar
before in processing ISO-2022-JP and mistaking some bytes as
HTML tags or special characters. It also does not look like we 
handle ASCII *space* well saving into .txt format. You get a "?" symbol
in one of the msgs. The cut off may be occurring for more than
one reasons."

I've indicated addition of 2 words with * *.

Comment 54

17 years ago
I'll investigate, but I've put so much time into this already, I need to focus 
on a bunch of other issues. Sorry, this is good enough for beta. I will change 
the summary, clear the whiteboard, etc.. and work on it later.

- rhp
Status: REOPENED → ASSIGNED
Keywords: beta1
Summary: "File | save as file" saves in Unicode when user supplies html extension → "File | save as file" has problems with Japanese messages saved as plain text
Whiteboard: [PDT+] CAN CHECKIN FIX ANY TIME
Target Milestone: M14 → M17

Comment 55

17 years ago
*** Bug 39357 has been marked as a duplicate of this bug. ***

Updated

17 years ago
Target Milestone: M17 → M18

Comment 56

17 years ago
I don't seem to be getting cutoff files when saved. Can you retest this?

- rhp
Status: ASSIGNED → RESOLVED
Last Resolved: 17 years ago17 years ago
Resolution: --- → WORKSFORME
(Reporter)

Comment 57

17 years ago
Sorry. Somehow this escaped my attention for a while.
Ilooked at this problem with 9/8/2000 Win32 build.
The same problem still exists. 

The body text: XXXPlain text 
is saved as: XXXPlain te

thus missing the last 2 letters.

I tried a number of Japanese messages both HTML and plain text
type, they all suffer from this problem. 

I'm re-opening this bug. I think you need to have a Japanese
Windows system to see this problem clearly because it involves 
saving into Shift_JIS. I don't see this problem when saving
ASCII mail.

Naoki, please take a look at this. There seems to be a problem
in converting to Shift_JIS.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---

Comment 58

17 years ago
Naoki,
If you can see if this is a conversion error, that would be great.

- rhp
Assignee: rhp → nhotta
Status: REOPENED → NEW
(Assignee)

Comment 59

17 years ago
I can reproduce when saving as a text file but not html.
It always cut at the bottom of the file so I assume there something wrong in 
length calculation.
(Reporter)

Comment 60

17 years ago
Saving into a plain text file is problematical when the message is
question is HTML or plain text type. 
Of all the saving options, this one is probably used the most.
Nominating for nsbeta 3.
Keywords: nsbeta3

Comment 61

17 years ago
nsbeta3+ P2
Priority: P3 → P2
Whiteboard: nsbeta3+
(Assignee)

Comment 62

17 years ago
I have a patch for this.
Status: NEW → ASSIGNED
(Assignee)

Comment 63

17 years ago
Created attachment 14532 [details] [diff] [review]
a patch to use a correct string length

Updated

17 years ago
Whiteboard: nsbeta3+ → nsbeta3+, patch in hand need review
(Assignee)

Updated

17 years ago
Whiteboard: nsbeta3+, patch in hand need review → nsbeta3+, patch in hand, reviewed
(Assignee)

Comment 64

17 years ago
Fix checked in.
Status: ASSIGNED → RESOLVED
Last Resolved: 17 years ago17 years ago
Resolution: --- → FIXED

Comment 65

17 years ago
Verified with win32 2000091909 and linux 2000091906 build. It's fixed.
Status: RESOLVED → VERIFIED
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.