Closed
Bug 271239
Opened 20 years ago
Closed 3 years ago
"Save page, complete" doesn't save encoding information
Categories
(Firefox :: File Handling, defect)
Firefox
File Handling
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: xanthian, Unassigned)
References
()
Details
(Keywords: helpwanted)
User-Agent: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8a5) Gecko/20041029 Build Identifier: Mozilla/5.0 (Windows; U; Win98; en-US; rv:1.8a5) Gecko/20041029 Doing a "File => Save page as" on the indicated URL, and choosing "Web Page Complete" as the "Save as type" entry and using local file name "xhtml1.html" (since that URL won't admit its file name), saves a _rendered_ copy of the page text, with the HTML entities of the original replaced by Unicode trigraphs. As a result, when the local copy is displayed, things represented by entities, like the copyright, trademark, and registered symbols, are shown as triples of nonsense characters instead of the appropriate glyphs. Proper behavior would be instead to save the document with the #whatever; entities in their original format. Reproducible: Always Steps to Reproduce: 1. Open indicated URL 2. Do a "File => Save page as..." 3. Choose Save as type: Web Page Complete 4. Open local copy in another tab 5. Do "View => Page Source" on original page and on local copy 6. See HTML entities in body source of original, see rendered unicode trigraphs in body source of saved copy Actual Results: Entities for copyright, trademark, and registered symbols were replaced by the ASCII characters of the Unicode trigraphs in the source HTML of the saved local copy, resulting in munged special symbols in the rendered version of the local copy. Expected Results: The "File => Save page as..." functionality should have downloaded and saved a clean, unrendered copy of the original HTML source, or if one still existed, saved a locally cached unrendered version of the HTML source. A bugzilla search on "unicode entities" shows many bugs that may be symptoms of the same base design failure as this one, but none seemed to capture the "mis-saving" aspect of the problem. I've marked this bug as "major", since the page saving feature is broken in a way that makes saving a working local copy of a page containing HTML entities impossible, but other thinking might classify it as "critical", since data is lost/damaged during the save process. Comment: The bugzilla page on which this is entered asks the user to select "component". This is a bad idea, since there is in general no way a user not familiar with the code can make an intelligent choice there. The breakdown needs to be change to a breakdown by functionality/widget accessed/menu item selected, since the user certainly knows that the "Save page as" functionality is what is broken. Also, the use there of the acronym "DOM" and similarly opaque namings is user hostile for the user who is not a Mozilla developer. I don't know where or how to file this as a bug, or against what, since it is a meta issue not about the browser, but I'm hoping some developer will accept the initiative to do that bug filing for me, based on this comment text.
Updated•20 years ago
|
Product: Browser → Seamonkey
Comment 1•20 years ago
|
||
Probably dupe of Bug 220782
Comment 2•20 years ago
|
||
The problem here is that the page is UTF-8 encoded and we don't save that information with the page. As a result, when it's loaded from disk it's parsed as ISO-8859-1 or whatever the user has set as the default. We should be inserting the appropriate meta tag for HTML in the persistence object and the XML serializer should be adjusting the XML decl appropriately for XML. And no, this has nothing to do with bug 220782.
Assignee: download-manager → file-handling
Status: UNCONFIRMED → NEW
Component: Download Manager → File Handling
Ever confirmed: true
OS: Windows 98 → All
Product: Seamonkey → Core
QA Contact: ian
Hardware: PC → All
Updated•20 years ago
|
Keywords: helpwanted
Updated•20 years ago
|
Summary: "File=>Save page as" saves rendered page, not original, breaks HTML entities → "Save page, complete" doesn't save encoding information
| Reporter | ||
Comment 3•20 years ago
|
||
Received in email, not correctly reflected here: > bzbarsky@mit.edu changed: > What |Removed |Added > ---------------------------------------------------------------------------- > Summary|"File=>Save page as" saves |"Save page, complete" > |rendered page, not original,|doesn't save encoding > |breaks HTML entities |information Unfortunately, that merely changes the bug summary to a summary of something you'd _like_ to fix, and ignores completely the true problem, which is that you are saving from a rendered copy rather than the original. You can do countless fixes of the symptoms of the real bug, or you can fix the original mis-design by simply and directly doing the save from original HTML code. I'm not your mommy, so it isn't my duty to teach you good sense, and your increasing obnoxiousness in email leaves me reluctant to cooperate with you at all, but as I programmer since 1961, I know where I'd expend _my_ efforts on a fix, and it's not on attempts to retrofit information you should never have discarded in the first place. [It's worth commenting that this blunder of trying to handle stuff needing original HTML sources from derived sources instead is by no means unique to Mozilla. My web site at anycities.com has the same problem; when I am handed a copy of my web page HTML to edit, it has the HTML entities for the less-than and greater-than signs already rendered, rather than giving me back the entities I put into the page source. This means I have to retrofit each instance, each time I update a web page, which stinks. if I forget to put the angle brackets back to entities, the contents between them are treated at the next rendering as nonsense tags and ignored, rather than as material which is supposed to be displayed between angle brackets in the rendered HTML, as intended. The fix there is the same a the fix here: the end user wants the original code, not the munged rendered version of the code with some retrofit attempts applied.]
Comment 4•20 years ago
|
||
The point, which you seem intent on ignoring, is that modifying the source involved parsing it, which inherently discards information if the HTML parser in Mozilla is used for the job. Writing an entire separate HTML parser for this operation is simply not warranted.
| Reporter | ||
Comment 5•20 years ago
|
||
(in reply to comment #4) > The point, which you seem intent on ignoring, is > that modifying the source involved parsing it, > which inherently discards information if the HTML > parser in Mozilla is used for the job. Yep, now you've identified the design blunder: using renderer output for URL localization input where the original HTML should have been used, instead. > Writing an entire separate HTML parser for this > operation is simply not warranted. Now you've identified, and rejected, the needed fix. And you are _way_ naive if you think the total work for all the bandages you are going to have to apply with your approach is "simpler". The needed HTML parser and URL localizer can fairly easily be written in sed(), it is no big deal [and no, I'm not volunteering to do it]. Retrofitting lost information as you intend is an artificial intelligence task, and is both a huge effort and one doomed always to retain failures on the fringes, since there is no possible way without the original sources to know whether you are "retrofitting" something that happens to look like rendered entities, but is in fact text from the original document intended to look that way in the rendered form too. You invest your life in whichever "solution" supports your self-esteem, but only the solution that goes back to original HTML sources is ever going to support Mozilla's correct functioning.
Comment 6•20 years ago
|
||
"Save page, complete" isn't intended to be saving the original file. It's intended to save the page, as it's displayed, so that the user can reopen it and see the same thing. The problem in described comment 0 is that the file doesn't render the same way anymore when reopened. This is because we don't save encoding information. It has nothing to do with the original markup. I assure you, serialising a DOM is not a problem that requires AI. (Parsing HTML, on the other hand, is not far from being impossible to do, and is certainly not as simple as "just writing a sed script".)
| Reporter | ||
Comment 7•20 years ago
|
||
Grrr. Putting my login ID in the comment body, so the brain-dead bugzilla search mechanism finds this bug when I ask for all bugs containing "xanthian", which just happens to be part of the "reporter" field, which bugzilla _should_ scan. Please ignore, sorry for the inconvenience, but bugzilla's plentiful search failures are a big time waster for bug reporters. xanthian.
Comment 8•17 years ago
|
||
andrew, do you see this?
Comment 9•16 years ago
|
||
bz, I'm trying to reconcile your comments with Hixie's. Seems like two choices, enh or wontfix. What do you think?
Assignee: file-handling → nobody
QA Contact: ian → file-handling
Comment 10•13 years ago
|
||
Quoting Comment 6: "Save page, complete" isn't intended to be saving the original file. It's intended to save the page, as it's displayed, so that the user can reopen it and see the same thing. I suspect that if you polled the users, at least half of them would indicate that they always thought it was to save the original file(s). I know that's how I intuitively thought it worked, until I found out otherwise. And certainly, even though it may be intended to save the rendered document, I think there's a valid use case for saving the raw information. It may even be the more common use case.
Updated•8 years ago
|
Product: Core → Firefox
Version: Trunk → unspecified
Comment 11•3 years ago
|
||
Since this is a very old report I'm going to close it for now.
If the issue still occurs, please feel free to re-open the report.
Thank you!
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•