User-Agent: Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.2b) Gecko/20021030 Build Identifier: Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.2b) Gecko/20021030 Ahtung! It is very big file - about 4.5Mb and it is almost blocked OS/2 with Mozilla 1.0 (1.2b are not blocking) so it may be expected that it will crush some other systems May be there will be found some other bugs. Please send what you are thinking about such html code to openoffice.org By the way, there is two (!!!) different charset metas in header! <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"><html xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:x="urn:schemas-microsoft-com:office:excel" xmlns="http://www.w3.org/TR/REC-html40"><head> <meta http-equiv="Content-Type" content="text/html;CHARSET=iso-8859-1"> <meta content="text/html; charset=windows-1252" http-equiv="CONTENT-TYPE"> So you can also send what you are thinking about such html code to www.microsoft.org URL is the same as for bug 181450 Reproducible: Always Steps to Reproduce: 1. Goto Url and save it as text 2. 3. Actual Results: all chars like ;ь are turned to ? - question marks Expected Results: all chars like ;ь should be are turned to some code, not to ??? Ahtung! It is very big file - about 4.5Mb and it is almost blocked OS/2 with Mozilla 1.0 (1.2b are not blocking) so it may be expected that it will crush some other systems May be there will be found some other bugs. Please send what you are thinking about such html code to openoffice.org
Mozilla could do either of two things: 1) It could automatically switch to a Unicode encoding as soon as there are characters that do not fit into the page-specified encoding. This is à la TextEdit on Mac OS X. Preferably, it would save such a file as UTF-16 with a BOM, so that a text editor can automatically determine the encoding. This would preserve any character in the document. More advanced, it could offer a list of possible encodings (also à la TextEdit) when saving. 2) It could save special entities outside the charset unparsed (as "ь"), although this could be very unreadable depending on the amount of entities. Item 1 could be turned into a feature request. As it is, Mozilla seems to just ignore characters outside the given charset when saving as text (old Netscape behavior). A little testting yields that it isn't actually following the charset strictly. Saving a UTF-16 page with a BOM as text yields a UTF-16 text file, although without a BOM (so the text editor had to be instructed which charset to use).
-> Internationalization, not sure if that is right. also cc'ing bz, since it could be file handling
Component: Browser-General → Internationalization
grrr, forgot to reassign
Assignee: asa → smontagu
QA Contact: asa → ylong
I thought this would be a dupe, but I can't find another report. We should: (1) Offer the option to save as charset (defaulting to the charset that the page is currently being viewed as). (2) Warn the user when the page contains codepoints outside the chosen charset. (3) Be clever enough to offer only charsets NOT containing codepoints outside the chosen charset. I believe that Composer already does at least (1) and (2).
Status: UNCONFIRMED → NEW
Ever confirmed: true
Summary: Can't convert unicode charactres to 1-byte chars while "save page as ->text flies" (i.e. all chars like ;ь are turned to ? - question marks ) → Save As text loses characters outside the page encoding
By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and <http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss bugs are of critical or possibly higher severity. Only changing open bugs to minimize unnecessary spam. Keywords to trigger this would be crash, topcrash, topcrash+, zt4newcrash, dataloss.
Severity: normal → critical
By definition, save as TXT converts the file into the codepage of the operating system. This WILL cause loss. Leaving them as HTML entities is the wrong idea, since people complain when Save As Text leaves any HTML at all in. It should convert the file to text. Personally, I think this is won't fix. the only thing of interest here is an RFE to specify the charset during save as text.
OS: other → All
Hardware: Other → All
This bug affects not just characters "outside" of some character set, but also character entities that should be rendered well in ASCII, for example #147; should be rendered as "``", #148; as "''", #151; as "--", etc. This in effect makes it impossible to save a properly marked-up document as plain text which is annoying. I am using Mozilla 1.5 on Linux.
Sorry, forgot to mention that these character enitities are realized as '?' in the text file.
#147 in what encoding? U+0093 (“) is not a graphic character but a C1 control character. Are you assuming Windows-1252? (you're not supposed to do that. NCRs in any html/xhml should refer to Unicode code points) Anyway, 0x93 in Windows-1252 is not covered by many character encoding. What's your locale? If your locale codeset doesn't cover 0x93 in Windows-1252, it's natural that it's turned to '?'. However, your request for transliteration is valid. So, in addition to three items in comment #4, we can transliterate (maybe, optionally).
(In reply to comment #6) > the only thing of interest here is an RFE to specify the charset during save as > text. And that would be bug 171103.
(In reply to comment #6) > By definition, save as TXT converts the file into the codepage of the operating > system. > > This WILL cause loss. Why not change that, use UTF-8 instead?
QA Contact: amyy → i18n
(In reply to comment #11) > Why not change that, use UTF-8 instead? Then a user would not be able to view the file. TXT files (at least on Windows) don't have a codepage associated with them. You need to save in the native codepage so someone could open up say Notepad and look at the file.
Huh? Even notepad handles utf-8 just fine, in "Save as" it has itself the option so save as utf-8. It uses the UTF-8 BOM I believe.
true. I guess we could look into this. This would be a pretty big change to existing behavior though that would need a lot of testing to see what it did on other platforms.
Generally speaking the UTF-8 BOM causes more problems than it solves, even on Windows.
Updating the summary, because I suspect that saving as text is a more common use case for mail than it is for browsing.
Summary: Save As text loses characters outside the page encoding → Save As text loses characters outside the page or message encoding
this still persists in firefox 24. (i have found that i can use save as text option instead of save as complete webpage with deleting additional files (i need complete webpage because it saves page that is loaded via ajax eg feed of news from friends / subscriptions, or "profile page" of person with older posts loaded, etc in modern sites like vkontakte, facebook, twitter ) , but have discovered this bug)
You need to log in before you can comment on or make changes to this bug.