Closed Bug 181456 Opened 22 years ago Closed 3 years ago

Save As text loses characters outside the page or message encoding

Categories

(Core :: Internationalization, defect)

defect
Not set
critical

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: evgen_k, Assigned: smontagu)

References

(Depends on 1 open bug, )

Details

(Keywords: dataloss, intl)

User-Agent:       Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.2b) Gecko/20021030
Build Identifier: Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.2b) Gecko/20021030

Ahtung!
It is very big file - about 4.5Mb and it is almost blocked OS/2 with Mozilla 1.0
(1.2b are not blocking) so it may be expected that it will crush some other systems
May be there will be found some other bugs.
Please send what  you are thinking about such html code to openoffice.org

 By the way, there is two (!!!) different charset metas in header!
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"><html
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40"><head>
  <meta http-equiv="Content-Type" content="text/html;CHARSET=iso-8859-1">
  <meta content="text/html; charset=windows-1252" http-equiv="CONTENT-TYPE">

So you can also send what  you are thinking about such html code to 
www.microsoft.org

URL is the same as for bug 181450



Reproducible: Always

Steps to Reproduce:
1. Goto Url and save it as text
2.
3.

Actual Results:  
all chars like ;&#1100;  are  turned to ? - question marks

Expected Results:  
all chars like ;&#1100;  should be are  turned to some code, not to ???

Ahtung!
It is very big file - about 4.5Mb and it is almost blocked OS/2 with Mozilla 1.0
(1.2b are not blocking) so it may be expected that it will crush some other systems
May be there will be found some other bugs.
Please send what  you are thinking about such html code to openoffice.org
Mozilla could do either of two things:

1) It could automatically switch to a Unicode encoding as soon as there are
characters that do not fit into the page-specified encoding. This is à la
TextEdit on Mac OS X. Preferably, it would save such a file as UTF-16 with a
BOM, so that a text editor can automatically determine the encoding. This would
preserve any character in the document. More advanced, it could offer a list of
possible encodings (also à la TextEdit) when saving.

2) It could save special entities outside the charset unparsed (as "&#1100;"),
although this could be very unreadable depending on the amount of entities.

Item 1 could be turned into a feature request. As it is, Mozilla seems to just
ignore characters outside the given charset when saving as text (old Netscape
behavior). A little testting yields that it isn't actually following the charset
strictly. Saving a UTF-16 page with a BOM as text yields a UTF-16 text file,
although without a BOM (so the text editor had to be instructed which charset to
use).
-> Internationalization, not sure if that is right.

also cc'ing bz, since it could be file handling
Component: Browser-General → Internationalization
grrr, forgot to reassign
Assignee: asa → smontagu
QA Contact: asa → ylong
I thought this would be a dupe, but I can't find another report.

We should:

(1) Offer the option to save as charset (defaulting to the charset that the page
is currently being viewed as).
(2) Warn the user when the page contains codepoints outside the chosen charset.
(3) Be clever enough to offer only charsets NOT containing codepoints outside
the chosen charset.

I believe that Composer already does at least (1) and (2).
Status: UNCONFIRMED → NEW
Ever confirmed: true
Keywords: dataloss
Summary: Can't convert unicode charactres to 1-byte chars while "save page as ->text flies" (i.e. all chars like ;&#1100; are turned to ? - question marks ) → Save As text loses characters outside the page encoding
Keywords: intl
By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and
<http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss
bugs are of critical or possibly higher severity.  Only changing open bugs to
minimize unnecessary spam.  Keywords to trigger this would be crash, topcrash,
topcrash+, zt4newcrash, dataloss.
Severity: normal → critical
By definition, save as TXT converts the file into the codepage of the operating
system.

This WILL cause loss.

Leaving them as HTML entities is the wrong idea, since people complain when Save
As Text leaves any HTML at all in. It should convert the file to text.

Personally, I think this is won't fix.

the only thing of interest here is an RFE to specify the charset during save as
text.
OS: other → All
Hardware: Other → All
This bug affects not just characters "outside" of some character set, but also
character entities that should be rendered well in ASCII, for example #147;
should be rendered as "``", #148; as "''", #151; as "--", etc.  This in effect
makes it impossible to save a properly marked-up document as plain text which is
annoying.

I am using Mozilla 1.5 on Linux.
Sorry, forgot to mention that these character enitities are realized as '?' in
the text file.
#147 in what encoding? U+0093 (&#147;) is not a graphic character but a C1
control character. Are you assuming Windows-1252? (you're not supposed to do
that. NCRs in any html/xhml should refer to Unicode code points) Anyway, 0x93 in
Windows-1252 is not covered by many character encoding. What's your locale? If
your locale codeset doesn't cover 0x93 in Windows-1252, it's natural that it's
turned to '?'. However, your request for transliteration is valid. So, in
addition to three items in comment #4, we can transliterate (maybe, optionally).
(In reply to comment #6)
> the only thing of interest here is an RFE to specify the charset during save as
> text.

And that would be bug 171103.
Depends on: 171103
(In reply to comment #6)
> By definition, save as TXT converts the file into the codepage of the operating
> system.
>
> This WILL cause loss.

Why not change that, use UTF-8 instead?
QA Contact: amyy → i18n
(In reply to comment #11)
> Why not change that, use UTF-8 instead?

Then a user would not be able to view the file. TXT files (at least on Windows) don't have a codepage associated with them.

You need to save in the native codepage so someone could open up say Notepad and look at the file.
Huh? Even notepad handles utf-8 just fine, in "Save as" it has itself the option so save as utf-8. It uses the UTF-8 BOM I believe.
true. I guess we could look into this. This would be a pretty big change to existing behavior though that would need a lot of testing to see what it did on other platforms.
Generally speaking the UTF-8 BOM causes more problems than it solves, even on Windows.
Updating the summary, because I suspect that saving as text is a more common use case for mail than it is for browsing.
Summary: Save As text loses characters outside the page encoding → Save As text loses characters outside the page or message encoding
this still persists in firefox 24.

(i have found that i can use save as text option instead of save as complete webpage with deleting additional files (i need complete webpage because it saves page that is loaded via ajax eg feed of news from friends / subscriptions, or "profile page" of person with older posts loaded, etc in modern sites like vkontakte, facebook, twitter ) , but have discovered this bug)

Hello! I will close this issue as RESOLVED-WORKSFORME since there weren't any new cashes in the last 6 months with this crash signature. If the issue is still available please feel free to reopen it.

Thank you and have a nice day!

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.