Closed Bug 230042 Opened 21 years ago Closed 16 years ago

Saving UTF-8 Mail as text results in Question Marks

Categories

(Thunderbird :: Mail Window Front End, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED DUPLICATE of bug 181456

People

(Reporter: henrik.pauli, Unassigned)

References

Details

(Keywords: intl)

Attachments

(2 files, 1 obsolete file)

User-Agent:       Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6a) Gecko/20031106 Firebird/0.7+
Build Identifier: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.6a) Gecko/20031106 Firebird/0.7+

I tried to 'Save As...' a UTF-8 encoded mail as a simple text file.  The
nonlatin characters involved (they were Cyrillic, but I doubt that matters) were
turned into question marks and I got a plain and simple ANSI txt file.

Reproducible: Always

Steps to Reproduce:
1. right click in mail window
2. Save As... text file

Actual Results:  
>> try, hopefully it doesn't break, do tell if it does: ????????? ?????????
>> ??????: ??????? ??????) was waaaaaaaay too boring.

(yes, those are \x3F)

Expected Results:  
> try, hopefully it doesn't break, do tell if it does: Александр Сергеевич
> Пушкин: Евгений Онегин) was waaaaaaaay too boring.
Now of course the Expected Results part would make a lot more sense if Bugzilla
didn't escape characters into amp-hash-foo-semicolon.

Also, I forgot my Thunderbird build: Mozilla Thunderbird 0.4 (20031205)
This seems to be a dupe of bug 181456 -- that's a browser bug, but the save as 
text function is the same in all the Mozilla programs, I believe.

I'm not sure how the text charset is chosen; on my Windows system, it appears to 
use Win-1252.  You want the plain text file saved *as* UTF-8?  Or to select the 
charset for plain text?

See Mail/News bug 112069; browser bug 29272

I'd suggest saving as HTML, but that's not likely to work either; see Mail/News 
bug 74424, bug 153401; browser bug 212158.

And for reference: bug 233650 (Composer).
Keywords: intl
OS: Windows XP → All
Hardware: PC → All
Summary: Saving UTF-8 Mail Results In Question Marks → Saving UTF-8 Mail as text results in Question Marks
I want to save the file as it is.  No conversion, no nothing (that would be more
of a feature request than a bug report, afterall!)

The fact that it was UTF-8 and it broke is merely just a reason to file a bug.
(In reply to comment #3)
> I want to save the file as it is.  No conversion, no nothing 

Then save it as a .EML file instead of text.
I think you misinterpreted what I meant, and I misphrased what I intended to
say.  EML would do me no good, as I don't want to deal with EML to
sensible-format conversions myself.  I guess it would have been a better idea to
say "I want to save it just as I see it".

And as for your observation on the codepage of your Mailer, the text charset is
chosen regarding the encoding line in the mail and can be optionally overridden
in the View menu.  I need no further conversion, so while in the save dialogue
it might be a fine enhancement to offer codepages of equal or better value (for
example: Some japanese don't like Shift-JIS, so these people might like to save
pages or whatever they see on the screen, saved in EUC; who knows!), that would
be a serious vanity enhancement.
Hello 

> Actual Results:  
>> try, hopefully it doesn't break, do tell if it does: ????????? ?????????
>> ??????: ??????? ??????) was waaaaaaaay too boring.

You have used an editor, which does not support utf-8 for themselve. Take
notepad.exe of w2k oder xp for example and you will see the cyrilic letters. 

This program discovers the encoding of the file and the characters.
No, I used Mozilla Thunderbird, not an external editor.  Obviously I wouldn't
have posted the bug here if it were a problem in another editor. (DUH.)

Mozilla didn't manage to save my mail, back then, into a proper UTF-8 or U16
file, and that was my problem.  Since it still does not work, I'll now attach a
screenshot for easier understandability.
Attached image An example screenshot.
Thunderbird (0.9 and below for sure) shows Unicode characters, but cannot save
them into file, as shown by Notepad.
(In reply to comment #8)
> Created an attachment (id=169814) [edit]
> An example screenshot.

Your screen shot indicates that the mail is long.
Reporter, isn't this same problem as Bug 269812?
(Not UTF-8 only problem. All non-ascii, when long mail.)
Interesting.  Yes, it was a very long mail.

I just sent myself an UTF-8 e-mail whose "real" content was a mere 863 bytes
(the same text as in the screenshot) and it resulted in question marks again. 
The e-mail itself is reported to be ~2KB.  Same thing happened with an e-mail
sent in the Windows-1251 charset.  The SMTP server choked on UTF-16, so I
couldn't try that one :)

I certainly welcome more suggestions.
Putting bug 269812 in "Depends on:".
Depends on: 269812
Wada, I certainly experienced this without hitting the size limits experienced
your bug-starting testcase in 269812, my latest testing was done with a 2KByte
e-mail with ~900Byte "true content".

I have my doubts that it depends on that one, but who knows...  The two things
might just have a common source in the code and that's it.
When text format is selected, thunderbird converts mail content into _system
charset_ before saving. ( see
http://lxr.mozilla.org/seamonkey/source/mailnews/base/src/nsMessenger.cpp#1907 )

Thus characters which not belongs to system charset might be saved as question
mark. By contrast, firefox seems to respect currently selected encoding when
saving page as text.

Possible workaround: Save mail as html and open it with firefox. Then try to
save it as text using firefox with UTF-8 encoding selected.
(In reply to comment #12)

Henrik Pauli, I also doubt on long mail problem now, because I looked into your
screen shot well then before.
Removing it from "depends on:".
(But please be careful in testing because bug 269812 occurs when long mail.)

As Jshin says, sounds normal unicode->system charset conversion when the charset
doesn't have Cyrillic letters, or unicode->system charset conversion problem
like Bug 120728.

(Q1) What is locale/charset of your Hungarian MS Win-XP?
     (Question on character set which OS uses.
     (eg. Shift_JIS when Japanese MS Win, EUC-JP usually when Japanese Linux.)
     (If changed to Windows-1252, this problem is possibly be resolved.)
(Q2) What about very short plain/text mail?
     (single line Subject: header, ascii only subject)  
     - UTF-8 and single line of normal mail body with Cyrillic letters.
     - Windows-1251 and single line of normal mail body with Cyrillic letters.
     (question to clarify "Cyrillic character related problem")
(Q3) Is there any relation to first character of a mail line?
        (See Bug 257828)
     (a) usual mail line with Cyrillic letters
     (b) ">" : Quoted text part with Cyrillic letters
     (c) "--" : Signature part with Cyrillic letters
(Q4) Is there any relation to Subject: header?
     - Single line Subject: header
     - Multiple line Subject: header
(Q5) Can next be a workaround?
     (1) CTRL+A(select all)/CTRL+C(copy) at mail window
     (2) Open notepad.exe
     (3) CTRL+V(paste) at notepad.exe

By the way, I think this problem can be re-created with "Draft", without
sending. Is it right?
  
No longer depends on: 269812
Correction. 
 "As Kim says" (<== "As Jshin says")
Kim, sorry for my mistake.
(I pasted text from bug 269812 in which I'm asking Jshin's help, but forgot to
change name portion...)
(A1) Windows-1250 "Central European".  If Kim Jeongkyu's assumptions are
correct, then if I changed my locale to Windows-1251 "Cyrillic" I could save the
name of Yevgeniy Anegin in Cyrillics to disk.  I have yet to try that.

(A2) My newer testcase, suggesting that this isn't related to the mail size bug
you've experienced, wasn't a single line, but still very short.  Will try this.

(A3) No.  My new test did not contain any quotation characters.  I also suspect
that #257828 is a separate issue -- although also a bug in the txt writer
component -- simply having to do with a quirky implementation of the signature,
nothing to do with encodings really.

(A4) Never before seen/produced a multiple Subject: line.

(A5) Certainly.  Works flawlessly.
(In reply to comment #15)
> Correction. 
>  "As Kim says" (<== "As Jshin says")
> Kim, sorry for my mistake.
> (I pasted text from bug 269812 in which I'm asking Jshin's help, but forgot to
> change name portion...)

wada, no problem at all :-)
Text still doesn't work as of Mozilla Thunderbird version 1.6a1 (20050903), HTML
does, so Jeongkyu Kim's workaround suggestion is applicable.

It also doesn't seem to matter whether it's a multibyte encoding or a singlebyte
one (Re: #16 A1)

Re: #14, yes, it works just the same in unsent mail; thus it doesn't depend on
mailservers' possible misprocessing of the e-mails -- it's a purely Mozilla bug.
QA Contact: front-end
Bug 269812 Comment #13 has pointed out "split of 3 bytes code", and UTF-8 has 3byte encoding. If this is the case, no need of "very large mail size" and "greater than buffer size(4k)" is sufficient condition. Please watch Bug 269812.
Adding Bug 269812 in depends-on again. 
Depends on: 269812
Similar problem in Mac 2.0.0.12. A mail with Chinese characters saved as txt files results in the saved text as "我是刘源" characters instead of the proper chinese ones. But somehow I don't get ????? characters, so I'm not sure if it's entirely the same...
(In reply to comment #20)
> But somehow I don't get ????? characters, so I'm not sure if it's entirely the same...

Generated garbage by "save as text file" depends on charset of mail, used characters of the charset in the mail, position of problematic character in the mail, OS/Language(used char code based on locale). And there are several corruption patterns. And displayed glyphs of the text file depends on the corruption pattern, detected/used charset, used font in display, application used for display.
See Bug 269812 and bugs listed in dependency tree for Bug 269812, and read Bug 269812 Comment #18.

To Gary Kwong:
If the mail contains characters of 3 byte UTF-8 code, and if this bug(or Bug 269812), I think your problem will disapper when next test.
 1. Create a mail folder (say Test)
 2. Copy the UTF-8 mail to Test folder twice (==> 2 identical mails)
 3. Click other mail folder(force close of file of Test)
 4. Edit file of Test (not Test.msf) by text editor (open with UTF-8)
 5. Remove lower half lines in mail body of first mail  (make it less than 4K) 
 6. Remove upper half lines in mail body of second mail (make it less than 4K)
 7. Click Test folder, "Rebuild Index" via folder property
 8. Save as text file. (both of first & second mail)
 (If large mail, increase number of copies, and increase number of remove lines) 
3 byte UTF-8 code is involved?
Will your problem disappear by above procedure?

By the way, when UTF-8 mail, I guess problem can easily be seen by test mail of many "Euro Sign only" lines, because Euro Sign has 3 bytes binary in UTF-8.
> ABCDEF[CRLF]
> EuroSign000001EuroSignEuroSignEuroSign...EuroSign[CRLF]
> EuroSign000002EuroSignEuroSignEuroSign...EuroSign[CRLF]
>              | (<== line number to see "corrupted at where")
> EuroSign009999EuroSignEuroSignEuroSign...EuroSign[CRLF]
> UVWXYZ[CRLF]
Try it.
(In reply to comment #21)
>  5. Remove lower half lines in mail body of first mail  (make it less than 4K) 
>  6. Remove upper half lines in mail body of second mail (make it less than 4K)

I don't quite get it. That mail I tested only has 4 chinese characters in each of subject and body... it shouldn't be larger than 4k.
(In reply to comment #22)
> it shouldn't be larger than 4k
Oh sorry for ambiguous/uncertain number. My 4K meant internal buffer size I'm guessing. And even when mail size is smaller than 4K, internal data size can exceed buffer size(I'm guessing it's 4K), because single byte in ascii is possibly converted to 2 bytes unicode or other code internally while processing.

> That mail I tested only has 4 chinese characters in each of subject and body...

Can you attach the mail to bug?
 (a. Save as ".eml", and attach the .eml file via "Add an attchment" link)
 (b. Create Test folder, copy the mail, click other folder, attach file of Test)
If sensitive information such as mail address is included in the file, replace it by dummy string such as xxx@bbb.ccc, with keeping string length.
Attached file probable testcase (obsolete) —
I'm not sure if this example is a decent / correct one. I tried reimporting the .eml and then exporting txt but it just didn't work.

I could export the original mail as txt though.
(In reply to comment #24)
> probable testcase
I believe your problem is not this bug.

Gary Kwong:
What is your OS? What is locale of your OS? What char code is used by your OS?
(When Japanese MS Win, locale is Japan, and system char code = Shift_JIS)
What program is used to display the saved text file?
What is displayed by next test?
(1) Save the mail as text file
(2) Open it with Firefox or other browser, and change View/Character Encoding  
    GB2312(charset of mail), Your OS's char code, Windows-1252, ISO-8859-1 etc.
(In reply to comment #25)
> What is your OS? What is locale of your OS? What char code is used by your OS?
> (When Japanese MS Win, locale is Japan, and system char code = Shift_JIS)

English Mac OS X 10.5 Leopard, with Chinese char support, English menus.

> What program is used to display the saved text file?

TextEdit. (The Mac's notepad equivalent)

> What is displayed by next test?
> (1) Save the mail as text file
> (2) Open it with Firefox or other browser, and change View/Character Encoding  
>     GB2312(charset of mail), Your OS's char code, Windows-1252, ISO-8859-1 etc.
> 

It works fine when viewed in UTF-8! (but had defaulted to viewing in ISO-8859-1)

(Thus I am still unable to confirm this bug...)
(In reply to comment #26)
> 
> It works fine when viewed in UTF-8! (but had defaulted to viewing in
> ISO-8859-1)
> 
> (Thus I am still unable to confirm this bug...)
> 

You're fine then -- it's just your text editor/viewer being unable to guess whether it's Latin-1 or UTF-8 (which /is/ fairly hard without heuristics -- or a BOM at the beginning of the file)

In my case, the non-ASCII characters were entirely turned into dust (i.e. question marks), regardless of the e-mail size (still not sure how that could affect encoding capabilities in the first place).

I haven't used Thunderbird or Mozilla Mail in years now (thank goodness), especially not on Windows, so I can't tell whether it's still valid in the Windows builds or not.  It would be interesting to hear from someone trying to reproduce this.
I can reproduce it in Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9b5pre) Gecko/2008032502 Thunderbird/3.0a1pre 

1. Write a plain text message containing a Cyrillic character
2. Choose: Options - Character Encoding - Unicode (UTF-8)
2. Save the message as a draft
3. Select the draft, right-click, and choose Save As...
4. Choose Save As Type: Text Files and press OK

Result: loss of data without any warning.  Viewed in a hex editor, the resulting file contains hex 3F (question mark) where the original message contained hex D0 89 (UTF-8 encoded capital lje).

(In reply to comment #28)
> I can reproduce it in Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US;
> rv:1.9b5pre) Gecko/2008032502 Thunderbird/3.0a1pre 

Confirming based on original report and the above comment. Nominating wanted-thunderbird3 for impact on l10n.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: wanted-thunderbird3?
Mail folder file contains 8 UTF-8 mails with Euro-Sign only.

>(Mail body)
> ABCDEFGH
> Euro000001EuroEuro...EuroEuro<Pad>
> Euro000002EuroEuro...EuroEuro
>          |
> Euro000030EuroEuro...EuroEuro
> STUVWXYZ
>
>
 
Mail-0 : Subject: Euro-Sign-0030-0000  ( <Pad> Length=0, <Pad>="" )
Mail-1 : Subject: Euro-Sign-0030-0001  ( <Pad> Length=1, <Pad>="A" )
Mail-2 : Subject: Euro-Sign-0030-0002  ( <Pad> Length=2, <Pad>="AB" )
Mail-3 : Subject: Euro-Sign-0030-0003  ( <Pad> Length=3, <Pad>="ABC" )
Mail-4 : Subject: Euro-Sign-0030-0004  ( <Pad> Length=4, <Pad>="ABCD" )
Mail-5 : Subject: Euro-Sign-0030-0005  ( <Pad> Length=5, <Pad>="ABCDE" )
Mail-6 : Subject: Euro-Sign-0030-0006  ( <Pad> Length=6, <Pad>="ABCDEF" )
Mail-7 : Subject: Euro-Sign-0030-0007  ( <PadL Length=7, <Pad>="ABCDEFG" )

(Test Result of save as text, on Japanese MS Win-XP SP2, Tb trunk 2008/3/27)
A. Mail-0(pad_len=0),Mail-3(pad_len=3),Mail-6(pad_len=6) => Successfully saved
B. Mail-1(pad_len=1),Mail-4(pad_len=3),Mail-7(pad_len=7) => Null file
C. Mail-2(pad_len=2),Mail-5(pad_len=5)                   => Null file

Note:
When Japanese MS Win, system charset=Shift_JIS and Shift_JIS doesn't have Euro-Sign. So Euro-Sign is converted to ascii "EUR" when saved as text, and null file seems to be generated if conversion fails.
Test result depends on system charset. Different result may be obtained on diferent OS/Locale.
Dupe of bug 181456?
Assignee: mscott → nobody
(In reply to comment #32)
> Dupe of bug 181456?
No.

Following is file size by "save as text" with test case of Comment #30.
Problem/phenomenon I explained in Comment #30 is corruption of text file(null file when the test case on Japanese MS Win).
When saved with size=4K, Euro-Sign is successfully converted to "EUR" in my test(on Japanese MS Win).    
> File Name               Size
> ---------------------  -----
> EuroSign-0030-0000.txt 4,062
> EuroSign-0030-0001.txt     0
> EuroSign-0030-0002.txt     0
> EuroSign-0030-0003.txt 4,065
> EuroSign-0030-0004.txt     0
> EuroSign-0030-0005.txt     0
> EuroSign-0030-0006.txt 4,068
> EuroSign-0030-0007.txt     0

 
(In reply to comment #32)
> Dupe of bug 181456?

My test case of Comment #30 seems to be different problem(Bug 269812 itself) from this bug, Henrik Pauli(bug opener)'s case. Original problem of this bug looks to be DUP of Bug 181456.
I seems to have confused this bug with phenomenon of Bug 269812 Comment #9 thru Bug 269812 Comment #12. Sorry for my confusion.
Status: NEW → RESOLVED
Closed: 16 years ago
Flags: wanted-thunderbird3?
Resolution: --- → DUPLICATE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: