181456 - Save As text loses characters outside the page or message encoding

Reporter

Description

•

22 years ago

User-Agent:       Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.2b) Gecko/20021030
Build Identifier: Mozilla/5.0 (OS/2; U; Warp 4.5; en-US; rv:1.2b) Gecko/20021030

Ahtung!
It is very big file - about 4.5Mb and it is almost blocked OS/2 with Mozilla 1.0
(1.2b are not blocking) so it may be expected that it will crush some other systems
May be there will be found some other bugs.
Please send what  you are thinking about such html code to openoffice.org

 By the way, there is two (!!!) different charset metas in header!
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"><html
xmlns:o="urn:schemas-microsoft-com:office:office"
xmlns:x="urn:schemas-microsoft-com:office:excel"
xmlns="http://www.w3.org/TR/REC-html40"><head>
  <meta http-equiv="Content-Type" content="text/html;CHARSET=iso-8859-1">
  <meta content="text/html; charset=windows-1252" http-equiv="CONTENT-TYPE">

So you can also send what  you are thinking about such html code to 
www.microsoft.org

URL is the same as for bug 181450



Reproducible: Always

Steps to Reproduce:
1. Goto Url and save it as text
2.
3.

Actual Results:  
all chars like ;&#1100;  are  turned to ? - question marks

Expected Results:  
all chars like ;&#1100;  should be are  turned to some code, not to ???

Ahtung!
It is very big file - about 4.5Mb and it is almost blocked OS/2 with Mozilla 1.0
(1.2b are not blocking) so it may be expected that it will crush some other systems
May be there will be found some other bugs.
Please send what  you are thinking about such html code to openoffice.org

Niklas Dougherty

Comment 1

•

22 years ago

Mozilla could do either of two things:

1) It could automatically switch to a Unicode encoding as soon as there are
characters that do not fit into the page-specified encoding. This is à la
TextEdit on Mac OS X. Preferably, it would save such a file as UTF-16 with a
BOM, so that a text editor can automatically determine the encoding. This would
preserve any character in the document. More advanced, it could offer a list of
possible encodings (also à la TextEdit) when saving.

2) It could save special entities outside the charset unparsed (as "&#1100;"),
although this could be very unreadable depending on the amount of entities.

Item 1 could be turned into a feature request. As it is, Mozilla seems to just
ignore characters outside the given charset when saving as text (old Netscape
behavior). A little testting yields that it isn't actually following the charset
strictly. Saving a UTF-16 page with a BOM as text yields a UTF-16 text file,
although without a BOM (so the text editor had to be instructed which charset to
use).

Christian :Biesinger (don't email me, ping me on IRC)

Comment 2

•

22 years ago

-> Internationalization, not sure if that is right.

also cc'ing bz, since it could be file handling

Component: Browser-General → Internationalization

Christian :Biesinger (don't email me, ping me on IRC)

Comment 3

•

22 years ago

grrr, forgot to reassign

Assignee: asa → smontagu

QA Contact: asa → ylong

Simon Montagu :smontagu

Assignee

Comment 4

•

22 years ago

I thought this would be a dupe, but I can't find another report.

We should:

(1) Offer the option to save as charset (defaulting to the charset that the page
is currently being viewed as).
(2) Warn the user when the page contains codepoints outside the chosen charset.
(3) Be clever enough to offer only charsets NOT containing codepoints outside
the chosen charset.

I believe that Composer already does at least (1) and (2).

Status: UNCONFIRMED → NEW

Ever confirmed: true

Keywords: dataloss

Summary: Can't convert unicode charactres to 1-byte chars while "save page as ->text flies" (i.e. all chars like ;ь are turned to ? - question marks ) → Save As text loses characters outside the page encoding

Yuying Long

Updated

•

22 years ago

Keywords: intl

Brant Gurganus

Comment 5

•

22 years ago

By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and
<http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss
bugs are of critical or possibly higher severity.  Only changing open bugs to
minimize unnecessary spam.  Keywords to trigger this would be crash, topcrash,
topcrash+, zt4newcrash, dataloss.

Severity: normal → critical

Mike Kaply [:mkaply]

Comment 6

•

21 years ago

By definition, save as TXT converts the file into the codepage of the operating
system.

This WILL cause loss.

Leaving them as HTML entities is the wrong idea, since people complain when Save
As Text leaves any HTML at all in. It should convert the file to text.

Personally, I think this is won't fix.

the only thing of interest here is an RFE to specify the charset during save as
text.

OS: other → All

Hardware: Other → All

Anton Ivanov

Comment 7

•

21 years ago

This bug affects not just characters "outside" of some character set, but also
character entities that should be rendered well in ASCII, for example #147;
should be rendered as "``", #148; as "''", #151; as "--", etc.  This in effect
makes it impossible to save a properly marked-up document as plain text which is
annoying.

I am using Mozilla 1.5 on Linux.

Anton Ivanov

Comment 8

•

21 years ago

Sorry, forgot to mention that these character enitities are realized as '?' in
the text file.

Jungshik Shin

Comment 9

•

21 years ago

#147 in what encoding? U+0093 (&#147;) is not a graphic character but a C1
control character. Are you assuming Windows-1252? (you're not supposed to do
that. NCRs in any html/xhml should refer to Unicode code points) Anyway, 0x93 in
Windows-1252 is not covered by many character encoding. What's your locale? If
your locale codeset doesn't cover 0x93 in Windows-1252, it's natural that it's
turned to '?'. However, your request for transliteration is valid. So, in
addition to three items in comment #4, we can transliterate (maybe, optionally).

:aceman

Comment 10

•

20 years ago

(In reply to comment #6)
> the only thing of interest here is an RFE to specify the charset during save as
> text.

And that would be bug 171103.

Boris Zbarsky [:bzbarsky]

Updated

•

20 years ago

Depends on: 171103

Magnus Melin [:mkmelin]

Comment 11

•

16 years ago

(In reply to comment #6)
> By definition, save as TXT converts the file into the codepage of the operating
> system.
>
> This WILL cause loss.

Why not change that, use UTF-8 instead?

QA Contact: amyy → i18n

Mike Kaply [:mkaply]

Comment 12

•

16 years ago

(In reply to comment #11)
> Why not change that, use UTF-8 instead?

Then a user would not be able to view the file. TXT files (at least on Windows) don't have a codepage associated with them.

You need to save in the native codepage so someone could open up say Notepad and look at the file.

Magnus Melin [:mkmelin]

Comment 13

•

16 years ago

Huh? Even notepad handles utf-8 just fine, in "Save as" it has itself the option so save as utf-8. It uses the UTF-8 BOM I believe.

Mike Kaply [:mkaply]

Comment 14

•

16 years ago

true. I guess we could look into this. This would be a pretty big change to existing behavior though that would need a lot of testing to see what it did on other platforms.

Simon Montagu :smontagu

Assignee

Comment 15

•

16 years ago

Generally speaking the UTF-8 BOM causes more problems than it solves, even on Windows.

Dan Mosedale (:dmosedale, :dmose)

Comment 19

•

15 years ago

Updating the summary, because I suspect that saving as text is a more common use case for mail than it is for browsing.

Summary: Save As text loses characters outside the page encoding → Save As text loses characters outside the page or message encoding

Dinar

Comment 20

•

11 years ago

this still persists in firefox 24.

(i have found that i can use save as text option instead of save as complete webpage with deleting additional files (i need complete webpage because it saves page that is loaded via ajax eg feed of news from friends / subscriptions, or "profile page" of person with older posts loaded, etc in modern sites like vkontakte, facebook, twitter ) , but have discovered this bug)

Negritas Sergiu, Desktop QA

Comment 21

•

3 years ago

Hello! I will close this issue as RESOLVED-WORKSFORME since there weren't any new cashes in the last 6 months with this crash signature. If the issue is still available please feel free to reopen it.

Thank you and have a nice day!

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → WORKSFORME

Bugzilla

Quick Search

Save As text loses characters outside the page or message encoding

Categories

(Core :: Internationalization, defect)

Tracking

()

People

(Reporter: evgen_k, Assigned: smontagu)

References

(Depends on 1 open bug,
URL
)

Details

(Keywords: dataloss, intl)

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Updated

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Comment 11

Comment 12

Comment 13

Comment 14

Comment 15

Comment 19

Comment 20

Comment 21