Closed Bug 271239 Opened 20 years ago Closed 3 years ago

"Save page, complete" doesn't save encoding information

Categories

(Firefox :: File Handling, defect)

defect
Not set
major

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: xanthian, Unassigned)

References

()

Details

(Keywords: helpwanted)

User-Agent:       Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8a5) Gecko/20041029
Build Identifier: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8a5) Gecko/20041029

Doing a "File => Save page as" on the indicated URL, and choosing
"Web Page Complete" as the "Save as type" entry and using local
file name "xhtml1.html" (since that URL won't admit its file name),
saves a _rendered_ copy of the page text, with the HTML entities of
the original replaced by Unicode trigraphs. As a result, when the
local copy is displayed, things represented by entities, like the
copyright, trademark, and registered symbols, are shown as triples
of nonsense characters instead of the appropriate glyphs. Proper
behavior would be instead to save the document with the #whatever;
entities in their original format.

Reproducible: Always
Steps to Reproduce:
1. Open indicated URL
2. Do a "File => Save page as..."
3. Choose Save as type: Web Page Complete
4. Open local copy in another tab
5. Do "View => Page Source" on original page and on local copy
6. See HTML entities in body source of original, see rendered
   unicode trigraphs in body source of saved copy

Actual Results:  
Entities for copyright, trademark, and registered symbols were
replaced by the ASCII characters of the Unicode trigraphs in the
source HTML of the saved local copy, resulting in munged special
symbols in the rendered version of the local copy.

Expected Results:  
The "File => Save page as..." functionality should have downloaded
and saved a clean, unrendered copy of the original HTML source, or
if one still existed, saved a locally cached unrendered version of
the HTML source.

A bugzilla search on "unicode entities" shows many bugs that may be
symptoms of the same base design failure as this one, but none seemed
to capture the "mis-saving" aspect of the problem.

I've marked this bug as "major", since the page saving feature is
broken in a way that makes saving a working local copy of a page
containing HTML entities impossible, but other thinking might
classify it as "critical", since data is lost/damaged during the
save process.

Comment: The bugzilla page on which this is entered asks the user
to select "component". This is a bad idea, since there is in
general no way a user not familiar with the code can make an
intelligent choice there. The breakdown needs to be change to a
breakdown by functionality/widget accessed/menu item selected,
since the user certainly knows that the "Save page as"
functionality is what is broken.

Also, the use there of the acronym "DOM" and similarly opaque
namings is user hostile for the user who is not a Mozilla developer.
I don't know where or how to file this as a bug, or against what,
since it is a meta issue not about the browser, but I'm hoping some
developer will accept the initiative to do that bug filing for me,
based on this comment text.
Product: Browser → Seamonkey
Probably dupe of Bug 220782
The problem here is that the page is UTF-8 encoded and we don't save that
information with the page.  As a result, when it's loaded from disk it's parsed
as ISO-8859-1 or whatever the user has set as the default.

We should be inserting the appropriate meta tag for HTML in the persistence
object and the XML serializer should be adjusting the XML decl appropriately for
XML.

And no, this has nothing to do with bug 220782.
Assignee: download-manager → file-handling
Status: UNCONFIRMED → NEW
Component: Download Manager → File Handling
Ever confirmed: true
OS: Windows 98 → All
Product: Seamonkey → Core
QA Contact: ian
Hardware: PC → All
Keywords: helpwanted
Summary: "File=>Save page as" saves rendered page, not original, breaks HTML entities → "Save page, complete" doesn't save encoding information
Received in email, not correctly reflected here:

> bzbarsky@mit.edu changed:

>            What    |Removed                     |Added
> ----------------------------------------------------------------------------
>             Summary|"File=>Save page as" saves  |"Save page, complete"
>                    |rendered page, not original,|doesn't save encoding
>                    |breaks HTML entities        |information

Unfortunately, that merely changes the bug summary
to a summary of something you'd _like_ to fix, and
ignores completely the true problem, which is that
you are saving from a rendered copy rather than the
original. You can do countless fixes of the symptoms
of the real bug, or you can fix the original
mis-design by simply and directly doing the save
from original HTML code.

    I'm not your mommy, so it isn't my duty to teach
    you good sense, and your increasing
    obnoxiousness in email leaves me reluctant to
    cooperate with you at all, but as I programmer
    since 1961, I know where I'd expend _my_ efforts
    on a fix, and it's not on attempts to retrofit
    information you should never have discarded in
    the first place.

[It's worth commenting that this blunder of trying
to handle stuff needing original HTML sources from
derived sources instead is by no means unique to
Mozilla. My web site at anycities.com has the same
problem; when I am handed a copy of my web page HTML
to edit, it has the HTML entities for the less-than
and greater-than signs already rendered, rather than
giving me back the entities I put into the page
source. This means I have to retrofit each instance,
each time I update a web page, which stinks.  if I
forget to put the angle brackets back to entities,
the contents between them are treated at the next
rendering as nonsense tags and ignored, rather than
as material which is supposed to be displayed
between angle brackets in the rendered HTML, as
intended. The fix there is the same a the fix here:
the end user wants the original code, not the munged
rendered version of the code with some retrofit
attempts applied.]

The point, which you seem intent on ignoring, is that modifying the source
involved parsing it, which inherently discards information if the HTML parser in
Mozilla is used for the job.  Writing an entire separate HTML parser for this
operation is simply not warranted.
(in reply to comment #4)

> The point, which you seem intent on ignoring, is
> that modifying the source involved parsing it,
> which inherently discards information if the HTML
> parser in Mozilla is used for the job.

Yep, now you've identified the design blunder: using
renderer output for URL localization input where the
original HTML should have been used, instead.

> Writing an entire separate HTML parser for this
> operation is simply not warranted.

Now you've identified, and rejected, the needed fix.

And you are _way_ naive if you think the total work
for all the bandages you are going to have to apply
with your approach is "simpler". The needed HTML
parser and URL localizer can fairly easily be
written in sed(), it is no big deal [and no, I'm not
volunteering to do it]. Retrofitting lost information
as you intend is an artificial intelligence task,
and is both a huge effort and one doomed always to
retain failures on the fringes, since there is no
possible way without the original sources to know
whether you are "retrofitting" something that
happens to look like rendered entities, but is in
fact text from the original document intended to
look that way in the rendered form too.

You invest your life in whichever "solution"
supports your self-esteem, but only the solution
that goes back to original HTML sources is ever
going to support Mozilla's correct functioning.

"Save page, complete" isn't intended to be saving the original file. It's
intended to save the page, as it's displayed, so that the user can reopen it and
see the same thing.

The problem in described comment 0 is that the file doesn't render the same way
anymore when reopened. This is because we don't save encoding information. It
has nothing to do with the original markup.

I assure you, serialising a DOM is not a problem that requires AI. (Parsing
HTML, on the other hand, is not far from being impossible to do, and is
certainly not as simple as "just writing a sed script".)
Grrr. Putting my login ID in the comment body, so the brain-dead
bugzilla search mechanism finds this bug when I ask for all bugs
containing "xanthian", which just happens to be part of the
"reporter" field, which bugzilla _should_ scan. Please ignore,
sorry for the inconvenience, but bugzilla's plentiful search
failures are a big time waster for bug reporters.

xanthian.
andrew, do you see this?
bz, I'm trying to reconcile your comments with Hixie's.  Seems like two choices, enh or wontfix.  What do you think?
Assignee: file-handling → nobody
QA Contact: ian → file-handling
Quoting Comment 6:  "Save page, complete" isn't intended to be saving the original file. It's
intended to save the page, as it's displayed, so that the user can reopen it and
see the same thing.

I suspect that if you polled the users, at least half of them would indicate that they always thought it was to save the original file(s).  I know that's how I intuitively thought it worked, until I found out otherwise.

And certainly, even though it may be intended to save the rendered document, I think there's a valid use case for saving the raw information.  It may even be the more common use case.
Product: Core → Firefox
Version: Trunk → unspecified

Since this is a very old report I'm going to close it for now.

If the issue still occurs, please feel free to re-open the report.

Thank you!

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.