Open Bug 115328 Opened 20 years ago Updated 4 years ago

Save As Web Page Complete saves both scripts and output from scripts, resulting in duplicated content

Categories

(Core :: DOM: Navigation, defect)

defect
Not set
critical

Tracking

()

Future

People

(Reporter: adamlock, Unassigned)

References

Details

(Keywords: dataloss, topembed-, Whiteboard: se-radar)

Attachments

(1 file)

In mfcEmbed load a page containing some JS that adds extra elements (e.g. banner
adverts). Call nsIWebBrowserPersist::SaveDocument on webBrowser and it saves the
modified DOM not, the original one.

This means that the saved copy contains the elements added by the JS and when
saved copy is loaded again, the JS runs once more adding more elements
effectively doubling up everything.

If possible, webBrowser should pull the original HTML from the cache for the
current page, parse that into its own DOM and save from that.
Sample website:
http://msn.espn.go.com/main.html


Menus at top and bottom of page as well as flash content are added by 
document.write() statements.
*** Bug 118792 has been marked as a duplicate of this bug. ***
Target Milestone: --- → Future
Expand this bug to cover the general issue of how to get a fresh DOM from the
cached data for the currently loaded URI. I think it could be pretty tricky,
especially when frames and such are taken into account.
OS: Windows 2000 → All
Hardware: PC → All
Summary: webbrowser saves the modified DOM, not the original → Dirty DOM being fed to webbrowserpersist - need to parse a fresh one from cache
This makes the save webpage complete feature completely useless to me.  No
chance of getting to this sooner?  Also, it's in all builds, not just embedding
ones.
There is not much I can do about this problem until there is some way of
obtaining the original DOM (and sub-DOMs for framesets) for a given URI. To me
that means either a copy of the original should be kept for such purposes, or it
should be possible to reconstruct one from the cached data.
The original DOM is not accessible after it's loaded, we don't keep an extra
copy of the DOM in memory for obvious reasons.
Bringing over dataloss keyword from dupe. Blah.
Keywords: dataloss
*** Bug 137784 has been marked as a duplicate of this bug. ***
Can't the solution used for bug 40867 be used here to get back the original
content ? 
Of course you still would need to reparse the DOM from the copy in the cache.
Keywords: topembed
given the complexity of the true fix here (after discussions with Adam), we're
topembed minusing this.
Keywords: topembedtopembed-
*** Bug 148614 has been marked as a duplicate of this bug. ***
*** Bug 154902 has been marked as a duplicate of this bug. ***
another sample: doing a save "web page, complete" on the following page with
frames also exhibits this bug:

http://developer.apple.com/techpubs/macosx/Essentials/AquaHIGuidelines/index.html
Whiteboard: se-radar
Would Mozilla use the current DOM or a new DOM (from the cached html) when
determining which images to save?
Requested downgrade of stopper status on proprietary embed client - request
denied. Adding kw:topembed to open this up for review. Please see:

http://bugscape.netscape.com/show_bug.cgi?id=12989


for more detailed information. 
Keywords: topembed-topembed
topembed+ per EDT
Keywords: topembedtopembed+
*** Bug 179490 has been marked as a duplicate of this bug. ***
*** Bug 182546 has been marked as a duplicate of this bug. ***
By the definitions on <http://bugzilla.mozilla.org/bug_status.html#severity> and
<http://bugzilla.mozilla.org/enter_bug.cgi?format=guided>, crashing and dataloss
bugs are of critical or possibly higher severity.  Only changing open bugs to
minimize unnecessary spam.  Keywords to trigger this would be crash, topcrash,
topcrash+, zt4newcrash, dataloss.
Severity: normal → critical
Depends on: 191023
5/5 EDT triage: minusing topembed+ status.  Dropping this from the radar to
better focus on existing working set.
Keywords: topembed+topembed-
*** Bug 218416 has been marked as a duplicate of this bug. ***
*** Bug 274745 has been marked as a duplicate of this bug. ***
*** Bug 283622 has been marked as a duplicate of this bug. ***
*** Bug 299752 has been marked as a duplicate of this bug. ***
Hrm, I don't want it to save the original HTML, I want it to save the current
DOM exactly as I see it on the screen; but in that case I probably also want to
remove all scripts from the page.
If you remove all scripts, then the page may not work properly after being saved...
Well, in most cases I don't want it to "work" at all: most of the time when I
save a page it is a receipt or other record of what I am currently viewing: I
don't want it to change and I don't want it to run scripts with the security
privileges of a file: URL anyway (that's a different bug, I know).

I can understand there might be other situations in which saving the original
HTML with all its scripts is a good idea, but I don't want this behavior to be
left undiscussed and treated as if there was only one true way.
(In reply to comment #28)
> Well, in most cases I don't want it to "work" at all: most of the time when I
> save a page it is a receipt or other record of what I am currently viewing: I
> don't want it to change and I don't want it to run scripts with the security
> privileges of a file: URL anyway (that's a different bug, I know).
> 
> I can understand there might be other situations in which saving the original
> HTML with all its scripts is a good idea, but I don't want this behavior to be
> left undiscussed and treated as if there was only one true way.

In my opinion, the discussion whether the JavaScript-implementation of Firefox 
is secure or not is another issue. It has nothing to do with saving the webpage. 
 The Internet Explorer uses its own Model for saving - that is ****. Opera does 
almost what I want - save the webpage exactly how it comes from the server, but 
Opera does not change the links between <noscript>-tags. I think that "HTTrack 
Website Copier" is a good example of how to save a webpage. It leaves the files 
as they are and replaces all the links in a webpage with relative paths and 
saves the referenced files (for example images, css, scripts, ...). Since 
HTTrack is open source, maybe this is a possible way for Firefox too.
Summary: Dirty DOM being fed to webbrowserpersist - need to parse a fresh one from cache → Save As Web Page Complete saves both scripts and output from scripts, resulting in duplicated content
I encountered this problem today at Eyewonder.  I'll attach a testcase in a sec.

We should save the files exactly as they come from the server. Being able to
save generated content instead of how they are on the server is bug 120457

--> Reassigning
Brian, saving as "web page, complete" modifies the content by definition.  There
is no feasible way to save the original on-generated content when doing a "web
page, complete" save.  Please read up on the code before making any more
comments like that, ok?
Attached file testcase
Being feasable and being possible are two different things. It's not reasonable
to have both the generated content, and the original JS saved in the same file
(plus it makes debugging pages a lot harder). I know that making additional
requests of the server for the HTML file is dangerous, especially when it's CGI
request, but I don't see a reason why we can't store the original HTML file in
memory until the frame is destroyed, and then just re-request any images, js,
etc that goes with it. Even with 10 pages open, we're only talking about an
additional 400K of memory on average.
Boris: I understand link URLs will be modified, but besides that, can't the DOM
can be left intact?

See also:
Bug 60426 - Allow users to choose between generated and source html in view-source
Bug 120457 - "Save As" should optionally allow to save generated content of a page
> Boris: I understand link URLs will be modified, but besides that, can't the DOM
> can be left intact?

The "web page, complete" mode is meant to be a "save what the user sees" mode
and that's what it does.  Doing that means saving the DOM.  That's not likely to
change unless the UI folks decide this option should have a completely different
behavior in general.
Thanks for clarifying. So the conundrum is whether we want to save the scripts,
or remove them. Removing them would cause problems for user-initiated events,
such as expanding divs, whereas leaving them can cause duplicated content.  For
testing problems with user-initiated events, leaving the scripts is useful, but
could also be a security issue (comment #28) and can cause duplicated content
when the DOM is written (the issue in this bug). An additional option for saving
original page (with links modified) is an RFE bug 271571 (that I reported in
2004 and forgot about) and isn't relevant to this bug.

"Workaround": I determined a quick way (based on Boris's description of our
behavior) to get rid of the duplicated content in simple cases. If you look at
the testcase, then delete the generated content with DOM inspector (in this
case, the DIV), then "Save Complete", you'll get a version without the data
repeated. This will be very hard for some really complicated generated content,
but will work well for pages with it only generated in one place on the page.

Boris: Is bug 120457 already covered by our current behavior and therefore INVALID?

Boris: I think what we might be able to (some day down the road) do for all bugs
regarding generated content, like this and bug 60426, is keep a list of page
changes, sort of like an undo list, so that we can allow people to choose which
generated content they want to remove from the page. This could be pref-disabled
by default as not to bloat memory usage. Do you think this would be possible in
the back-end, and beyond that would it also be possible for a 3rd party
extension to perform this function?
*** Bug 305437 has been marked as a duplicate of this bug. ***
*** Bug 364711 has been marked as a duplicate of this bug. ***
Assignee: adamlock → nobody
QA Contact: adamlock → docshell
Duplicate of this bug: 395875
Duplicate of this bug: 499909
Duplicate of this bug: 506469
hello. document.write()'s content should not be saved second time/duplicated with regular html text outside of <script></script>, because it should be as original and most people(near 98%) open the saved page with javascript is turned on, and if page developer wants document.write()'s content is visible after it is saved and opened with noscript browser, let he duplicate the content in <noscript></noscript> or <noscript><iframe></iframe></noscript>, as i know firefox saves them even when it has javascript turned on.
"as i know firefox saves them even when it has javascript turned on" - no, it is not so, it does not save iframe in noscript content when javascript is on.
You need to log in before you can comment on or make changes to this bug.