Closed Bug 23418 Opened 25 years ago Closed 24 years ago

"File | save as file" has problems with Japanese messages saved as plain text

Categories

(MailNews Core :: Internationalization, defect, P2)

x86
Windows NT

Tracking

(Not tracked)

VERIFIED FIXED

People

(Reporter: momoi, Assigned: nhottanscp)

References

Details

(Whiteboard: nsbeta3+, patch in hand, reviewed)

Attachments

(7 files)

** Observed with 1/6/99 Win32 build ** When non-ASCII msg is saved as a file using File | Save as menu, the file is current saved without any extension and "as is" whether or not the option selected by the user is HTML or plain text. Some user may be attempted to supply the .html extension him/herself. If so, the result will be an Unicoded encoded somewhat deficient HTML file. This needs to be corrected. I guess HTML format should save as is -- JPN msg should be saved in HTML format and in JIS. WE should also try supplying .txt extension and see what happens. Also, if there isn't one, we should file a separate bug for the format option malfunction at the save as dialog window.
Assignee: nhotta → jefft
Reassign to jefft. Save as text/html -> Currently saved as UTF-8 (wrong) - we need to convert UTF-8 to mail charset (e.g. ISO-2022-JP). Save as text/plain -> IQA need to test the current behavior. The spec it to convert the data to platform file charset. Below is the code to get the platform file charset. #define NS_IMPL_IDS #include "nsIPlatformCharset.h" #undef NS_IMPL_IDS nsCOMPtr <nsIPlatformCharset> platformCharset; nsAutoString aPlatformCharset; rv = nsComponentManager::CreateInstance(NS_PLATFORMCHARSET_PROGID, nsnull, NS_GET_IID(nsIPlatformCharset), getter_AddRefs(platformCharset)); if (NS_SUCCEEDED(rv)) { rv = platformCharset->GetCharset(kPlatformCharsetSel_FileName, aPlatformCharset); }
Status: NEW → ASSIGNED
Target Milestone: M14
Currently when we save as into txt format, this is what happens: 1. Save as into text without explicitly supplying the .txt extension by the user. --> saves the whole msg including vCard, etc. into ISO-2022-JP. This looks like source data themselves. 2. Save as into text by supplying .txt extension yourself. Saves without VCard and other extra parts but into iso-2022-jp rather than expected Shift_JIS for JPN Windows.
Both case 1 and case 2 should save without extra parts and into Shift_JIS rather than in ISO-2022-JP.
I have just added a new function in nsMsgI18N.h named msgCompFileSystemCharset (oops, not enough generic name, maybe we should rename it) that give you back the file system character set.
I would like to designate this as beta1 since "file | save as" is something users do use and its UI should be non-confusing and its resulting effects should be consistent and correct.
Keywords: beta1
Putting on PDT+ radar for beta1.
Whiteboard: [PDT+]
Load-balance to rhp
Assignee: jefft → rhp
Status: ASSIGNED → NEW
Status: NEW → ASSIGNED
Whiteboard: [PDT+] → [PDT+] Land by: 2/11/00
Depends on: 1775
No longer depends on: 1775
Ok, I have most of this done for plain text, but I have a question on saving messages as HTML. Why can't I emit a META tag header with UTF-8 as the output for the saved HTML web page. With the current architecture, you suggestion of going back to the original charset of the message if full of problems and I don't think will give us a very good result. Comments? - rhp
I think users will want to see the original data encoding. On most platforms, you can then open such a file with a text editor and see & edit the content without a problem. That is not going to be the case if you save into UTF-8. That and in the past we have been saving the original data as is under HTML extension and I don't think we should break the familiar behavior.
I understand your point and I think I have an idea how to fix this...but just to clarify, I don't think this feature really works on 4.x. When you do a Save As for a mail message in 4.x (HTML format), you basically get the raw contents of the message dumped to the file with RFC822 headers, etc. We'll do better than this for 5.x :-) - rhp
Ok, I've tried to improve this performance. Is this all going to work 100%, probably not. For plain text, I am getting the output from libmime and saving it in plaintext as the charset of the system. For HTML, I am trying to output the message in the original charset without any converstion. momoi san, Now, there are a million combinations that can happen here and scenarious that will probably break, but what I need you to do is help me with the major issues and I debug from there. - rhp
Status: ASSIGNED → RESOLVED
Closed: 25 years ago
Resolution: --- → FIXED
Hi, Rich. In the case of HTML msg, are you going to strip out some headers and MIME structural material? Under 4.x, we didn't do any of that and saved the original msg. In case we run into complicated problems, can we go back to that simple msg saving method for HTML.
Actually, the way this works is as follows: - Save As: .EML (or no extension) - Saves the raw RFC822 message to a file - Save As: .TXT - Does its best job at converting what you see in your message display into a .TXT file (in the native charset of the system) - Save As: .HTML - Again, it outputs HTML that will give you a display similar to what you see in your 3 pane message display Actually, Communicator 4.x "Save As" HTML doesn't really work. It simply saves the raw RFC822 text into a file. So, in 5.x we will actually generate something that is an HTML document without any RFC822 messages. - rhp
Great! Look forward to seeing the results intoday's build.
I'm not done looking at this yet, but so far I have found a couple of problems. 1. Even though, the save as dialog shows possible extension types, it does not supply one automatically unless the user writes it in. Otherwise it's all saved in ,eml format. 2. When saving into .txt format, the headers (To:, CC:, Date:, etc.) and their content are separated into different lines, e.g. From: Jane Banning
Momi, The separate line problem is in the HTML - TEXT converter. Table conversion is pretty bad. Don't reopen this bug on that issue. - rhp
Thanks. Browser save as has the same problems as you say for .txt format. The charset converion to system charset in ,txt format seems to be working well.
Rich, sorry it's taking me longer to verify this fix. I've found some definite problems: 1. We are saving the original MIME-encoded subject headers in UTF-8 rather than in the original encoding indicated by the MIME charset in the header when using the HTML save option. 2. Earlier .txt (when supplied by the user) files were generally saved correctly -- that is what I reported. But now with this build, all I see is the same data as .eml file. What happened since then? 3. We don't automatically supply the extensions even though the file | save as file dialog window clearly shows the possible extension types. Users are used to the application automatically supplying the extension in the dialog if none is entered. Re-opening for the above reasons... there may have been regressions in the last few days also for .txt type.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The build I saw the problems reported above was: Win32 2000021708.
Ok, well the REGRESSION that we are seeing (i.e. extensions are ignored) is a bug that I think I found in nsString(). Rick: On line 517 of xpcom/ds/nsStr.cpp, there is a line: const char* destLast = root+((aDest.mLength-1)*aDelta); //pts to last char in aDest (likely null) but if you really want to point to the NULL character of this string, the "-1" shouldn't be there. I changed the line to remove the -1 and it started working again. With the -1, you never get a hit. The headers issue sounds like a bug I will tackle. Also, I will look at adding the extensions for these files. Technically, I'm supposed to be on vacation today, but I'll see what I can do. - rhp
Status: REOPENED → ASSIGNED
Ok, I think I have a fix for the header conversion stuff, but I need Naoki's help now. I will attach a patch that you can apply from the same level as you /mozilla directory. It fixes the headers being converted into UTF-8. Now, I have a problem with the ConvertFromUnicode() call. I pass in a bunch of unicode data (I think) and I want it converted, but for one of the Smoketest email messages, the ConvertFromUnicode() routine seems to truncate after the text "Subject: ". Can you help me with this one? Other than that, I think this is close to being fixed. - rhp
Whiteboard: [PDT+] Land by: 2/11/00 → [PDT+]
I had just started to rebuild the tree. I'll try the patch when it's done. I look at the patch. I have not followed this bug in detail, so the header is supposed to be converted to the charset specified in MIME header instead of UTF-8? I am not sure which part of the code fixed it. Anyway, I'll test after the build is done.
Right, we should be saving the file as the original message charset. The thing that fixed the problem was taking out the "header=quoting" line. This change the behavior in libmime. If you can try to save that message off, you'll see the conversion problem. Thanks for the help! - rhp
Okay, my build completed. And I applied the patch and did nmake at mailnews/base. But headers are still saved as UTF-8 (I saved as html). I set a break point at the changed line as below but didn't hit. --- 1417,1423 ---- ConvertBufToPlainText(m_msgBuffer); rv = ConvertFromUnicode(msgCompFileSystemCharset(), m_msgBuffer, &conBuf);
You won't hit that line unless you save as plain text. Make sure you are saving the file as "test.TXT". - rhp
I saved as .TXT (selected it from the popup and add extension manualy) that works (saved correctly) but it didn't hit the break point. Save as html, just above the break point I mentioned, it checks a value m_doCharsetConversion. In my case that's 0 (false), do I need to change my environment? BTW, I think the reason of the conversion failure you saw is probably your using US windows. That case, the file system charset is windows-1252 and the convert will fail to convert the Japanese text.
Whiteboard: [PDT+] → [PDT+] [2/19/00]
I've landed my change to nsStr which partially fixes this problem. RHP -- it's up to you now.
Ok, I have this boiled down to an issue of "Save As" into HTML still converting the headers to UTF-8. I know what I need to do so this shouldn't be too hard to fix. - rhp
Ok, I have this fixed now and will get it into the tree when I get approval. - rhp
Ok, this should be much better than it was now. - rhp
Status: ASSIGNED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → FIXED
Reopening this bug. I found a test case that broke it, but the good thing is that I have it fixed and just need to get permission to checkin. - rhp
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Just updating the whiteboard. I have a pretty simple fix in hand. - rhp
Whiteboard: [PDT+] [2/19/00] → [PDT+] [2/22/00]
Status: REOPENED → ASSIGNED
Ok, I think my latest checkins fixed this problem now. - rhp
Status: ASSIGNED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → FIXED
I still see 2 problems with ** 2/24/2000 Win32 build ** 1. Extensions need to be supplied to get the right results. --> Maybe this should go to another bug. 2. When saving into .txt file format, it does not save into the system charset. So this means that JPN mail msg is saved in ISO-2022-JP rather than in Shift_JIS.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
I have the first problem fixed in my tree...let's not file a new bug on this. On the second one, I save the plain text file in the following charset: nsAutoCString(msgCompFileSystemCharset() if this is wrong, I would have Naoki look at it. I really want to get this bug resolved...I've spent a lot of time on this one issue. - rhp
I've tested this to the best of my abilities on my machine. I am converting the Unicode text from the mail message into the charset I get from: msgCompFileSystemCharset() I can send you a recent diff to look at what is going into the tree eventually to fix the extension request. - rhp
Assignee: rhp → nhotta
Status: REOPENED → NEW
Okay, I'll take a look.
Status: NEW → ASSIGNED
msgCompFileSystemCharset() returns Shift_JIS on my NT-J. I debugged the conversion code it's getting a conversion charset as Shift_JIS. So it's supposed to convert from unicode to Shift_JIS. But the data it's getting is ISO-2022-JP in UCS2 format. The Shift_JIS converter just removes the padded zeros from each character, the result we get is ISO-2022-JP.
On my machine it happends with nsMessenger.cpp 1.136 or later but not happens with 1.135 (at least for body).
Assignee: nhotta → rhp
Status: ASSIGNED → NEW
will investigate.
Status: NEW → ASSIGNED
Whiteboard: [PDT+] [2/22/00] → [PDT+]
Ok, I think I have a fix for the Save As Text. To tell for sure, I'm going to have to send 2 patches to Naoki for him to try. Let me generate those and I will send it in email. This will also address the extension issue. - rhp
Ok, after working with naoki and momoi san, I have this fixed...really :-) I will get this reviewed. - rhp
Whiteboard: [PDT+] → [PDT+] CAN CHECKIN FIX ANY TIME
Summary: "File | save as file" saves in Unicode when user supplies html extension → [FIXED] "File | save as file" saves in Unicode when user supplies html extension
Ok, this one should be fixed once and for all :-) - rhp
Status: ASSIGNED → RESOLVED
Closed: 25 years ago25 years ago
Resolution: --- → FIXED
Summary: [FIXED] "File | save as file" saves in Unicode when user supplies html extension → "File | save as file" saves in Unicode when user supplies html extension
** Checked with 3/7/2000 Win32 build ** Saving as into .eml, .html, and .txt is now generally working. You can let the dialog supply an extension or supply one yourself. Either way, this works. I have not seen any problem with saving into .eml and .html files. There is a problem, however saving into .txt file, particularly Japanese (ISO-2022-JP) messages. The data is cut off in the process of saving. I'll append 2 images showing one such example. I'll also upload a test msg file showing 4 JPN msgs + 1 UTF-8 (JPN) test message. It's hard to predict where the cut off occurs saving into .txt format. But in all 4 messages, it does. I've seen something similar before in processing ISO-2022-JP and mistaking some bytes as HTML tags or special characters. It also does not look like we handle ASCII well saving into .txt format. You get a "?" symbol in one of the msgs. The cut off may be occurring for more than one reasons. I'm re-opening this...
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The image above is a relatively benign case of a few characters at the end being cut off. 5 test messages show much more servere truncation of one kind or another. The test file is attached below.
Put the mailbox file into your Loacal folder and save each one into .txt format under Japanese Windows. Well, this could be a problem for rhp. Naoki, can you help with debugging this problem?
I'm going to re-do the quoted portion below because some crucial words have been omitted due to my poor typing. "It's hard to predict where the cut off occurs saving into .txt format. But in all 4 messages, it does. I've seen something similar before in processing ISO-2022-JP and mistaking some bytes as HTML tags or special characters. It also does not look like we handle ASCII well saving into .txt format. You get a "?" symbol in one of the msgs. The cut off may be occurring for more than one reasons." shoul read: "It's hard to predict where the cut off occurs saving into .txt format. But in all *5* messages, it does. I've seen something similar before in processing ISO-2022-JP and mistaking some bytes as HTML tags or special characters. It also does not look like we handle ASCII *space* well saving into .txt format. You get a "?" symbol in one of the msgs. The cut off may be occurring for more than one reasons." I've indicated addition of 2 words with * *.
I'll investigate, but I've put so much time into this already, I need to focus on a bunch of other issues. Sorry, this is good enough for beta. I will change the summary, clear the whiteboard, etc.. and work on it later. - rhp
Status: REOPENED → ASSIGNED
Keywords: beta1
Summary: "File | save as file" saves in Unicode when user supplies html extension → "File | save as file" has problems with Japanese messages saved as plain text
Whiteboard: [PDT+] CAN CHECKIN FIX ANY TIME
Target Milestone: M14 → M17
*** Bug 39357 has been marked as a duplicate of this bug. ***
Target Milestone: M17 → M18
I don't seem to be getting cutoff files when saved. Can you retest this? - rhp
Status: ASSIGNED → RESOLVED
Closed: 25 years ago24 years ago
Resolution: --- → WORKSFORME
Sorry. Somehow this escaped my attention for a while. Ilooked at this problem with 9/8/2000 Win32 build. The same problem still exists. The body text: XXXPlain text is saved as: XXXPlain te thus missing the last 2 letters. I tried a number of Japanese messages both HTML and plain text type, they all suffer from this problem. I'm re-opening this bug. I think you need to have a Japanese Windows system to see this problem clearly because it involves saving into Shift_JIS. I don't see this problem when saving ASCII mail. Naoki, please take a look at this. There seems to be a problem in converting to Shift_JIS.
Status: RESOLVED → REOPENED
Resolution: WORKSFORME → ---
Naoki, If you can see if this is a conversion error, that would be great. - rhp
Assignee: rhp → nhotta
Status: REOPENED → NEW
I can reproduce when saving as a text file but not html. It always cut at the bottom of the file so I assume there something wrong in length calculation.
Saving into a plain text file is problematical when the message is question is HTML or plain text type. Of all the saving options, this one is probably used the most. Nominating for nsbeta 3.
Keywords: nsbeta3
nsbeta3+ P2
Priority: P3 → P2
Whiteboard: nsbeta3+
I have a patch for this.
Status: NEW → ASSIGNED
Whiteboard: nsbeta3+ → nsbeta3+, patch in hand need review
Whiteboard: nsbeta3+, patch in hand need review → nsbeta3+, patch in hand, reviewed
Fix checked in.
Status: ASSIGNED → RESOLVED
Closed: 24 years ago24 years ago
Resolution: --- → FIXED
Verified with win32 2000091909 and linux 2000091906 build. It's fixed.
Status: RESOLVED → VERIFIED
Product: MailNews → Core
Product: Core → MailNews Core
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: