Closed Bug 40867 Opened 24 years ago Closed 22 years ago

Need means to reuse/reload current page without refetching from server

Categories

(Core :: DOM: Navigation, defect, P3)

defect

Tracking

()

VERIFIED FIXED
mozilla1.0

People

(Reporter: jmd, Assigned: rpotts)

References

(Blocks 2 open bugs, )

Details

(5 keywords, Whiteboard: [Hixie-P1] partial fix is checked into 0.9.2 branch)

Attachments

(9 files, 2 obsolete files)

When viewing any page, we need to keep a copy of the source around, instead of
opening again, in the case of the file protocol, or downloading it again, in the
case of the http and ftp protocols.

The reason this must be done is simple. Save As, in other applications, means,
save what I am currently viewing to a file that I will specify. In most cases,
this is what Mozilla does with it's Save As. However, due to the fact that we
are regeting the page, there stands a chance that it has changed, or been
completly removed. This goes completly against the 'save what I am currently
viewing to a file' meaning most users know Save As to mean.

One of the worst consequenses of this is for web developers. Say you are
designing a page, and have a copy open in mozilla. You make an unwanted change
to the source file, and are unable to undo it. Or maybe you even accidently
deleted the source file. No worries, you have a copy cached in Mozilla, right?
Wrong.

Also note that trying to save it, after the file has been deleted, will crash
Mozilla if the file was accessed via ftp or file. I've opened bug 40792 on this
problem. If the file was accessed via http and was deleted, Mozilla will
quietly save the 404 page. This is bad. Very, very bad.

Another problem is pages with dynamic content. You want to save the page you are
viewing. Maybe it was loading 2 hours ago, but thats the page you want saved,
not whatever is there now.

Yet another problem is large pages for users on dialup connections. The Lynx
browser makes you redownload a page when you to view as source. I found this
infuriating when I was a dialup'er, trying to speed up my surfing by using Lynx.
It made no sense to have to redownload what I seemingly was staring right at. 

For the composer, a bug was filed a month ago (Bug 37023), which is related to
this. davidr8 reported that composer was forcing you to save before you could
view source. beppe commented that it was more efficient to read from a file, and
the bug was set to VERIFIED WONTFIX. This is exactly the same problem. Do not
read the source from a file. Say I make a page in composer, save it, then edit
that saved file in another editor. If I switch back to composer, I'm looking at
the old version of the file. If I view source, I want, and expect, to see the
old source. What if i needed to reference it. It's seemingly there, hell, I'm
staring right at it!

Saving the pre-rendered page in memory is a huge bandwidth savings if it was
accessed over http/ftp, and is just plain the right thing to do.
Great bug!  I think it's the same as bug 6119, though.
That is only regarding view source. I proposed that the system be used for Save
As, and view source. I'll put a note on that one.
Somehow, I would think the cache should take care of this, given a "Once Per
Session" recheck pref (in Advanced|Cache in the prefs).

Adding "perf" keyword.
Keywords: perf
Looks similar to bug 6119. Reassigning to Bill law.
Neeti
Assignee: gordon → law
Component: Networking: Cache → XML
I suggest that bug 6119, bug 17889, bug 37023, and bug 40792 (and probably a few 
other bugs as well) all be made dependent on this bug; and that this bug be upped 
to major severity, since it involves loss of data.

Otherwise those other bugs are all going to run around doing various 
inappropriate things with the cache, when commands operating on the current file 
(save, print, view source, etc) shouldn't be dependent on the cache at all (e.g. 
if the sum total of files in the current Web page is larger than the disk cache 
size).

This bug really isn't much to do with XML. Mozilla should hold the source of
*all* files currently being used, whether they be XML, PNG, CSS, or whatever.
Changing component from XML to XPApps (as I think was intended) and nominating 
for nsbeta3 since I think this is a major problem - we need to use the cache for 
these things (and get things out of the cache even when expired in certain 
circumstances).
Component: XML → XP Apps
Keywords: nsbeta3
Nav triage team: Already works the way 4.x works.
Whiteboard: [nsbeta3-]
Removing nsbeta3- to trigger re-evaluation.

PDT: This causes silent data loss!

If you want to save the page currently being shown, but the page in question is 
no longer on the server, or is dynamically created, then Mozilla will silently 
lose the page, saving the 404 file not found page, or the different version in 
the case of dynamic content, without telling the user. 

This is also a problem with View Source, one that content developers find 
incredibly annoying with dynamic content. Say there is a bug on a dynamically
generated page. The user right clicks on the page to view the source, but
when the source comes up it is for a totally different page, since the server
has generated a whole new document.

There are many other examples of when this is critical (e.g. dialup connection
is dropped before user saves page, view source on a POST operation such as after
a money transaction over the net, saving NetCenter content that changes 
regularly, and so on and so forth).

IE does this correctly, so marking 4xp. Moving up to 'major' severity, since
this is a major loss of function. Is there a keyword for this kind of bug?
Marking 'correctness' unless someone can think of a more appropriate keyword...
Severity: enhancement → major
Keywords: 4xp, correctness
Whiteboard: [nsbeta3-]
[To reverse the old maxim: it's not a feature, it's a bug! Resummarizing.]
Hardware: PC → All
Summary: [RFE] Mozilla Needs to Hold Source of Current Page in Memory → Hold source of current page in memory
nav triage team: nsbeta3-, while we believe this to be important, we are 
concerned that this would be too difficult to do safely, this late in the game.
Whiteboard: [nsbeta3-]
Sanity check: this bug is nsbeta3-, but blocks bug 17889 which is nsbeta3+.
Removing nsbeta3- again for re-evaluation. We have a lot more information on
this bug now. Bugs 17889 and 6119 are closely related, and likely have the same
fix. 17889 is nsbeta3+, and I just nominated 6119 for + and dogfood. Unless
somebody can come up with a super-workaround for both bugs, this bug should
really be fixed.

Ian Hickson has excellent comments in bug 6119 describing the problem, and I
copied comments from a dup of 6119 which may help towards a fix.
Whiteboard: [nsbeta3-]
adding cc
See also bug 39957...

Okay, so I'm thinking about this now. I'm not sure what we currently do, but I
would imagine that we keep a DOM tree for our current document if it's HTML or
XML, and otherwise, we just display the document source. It would probably not
be a Good Thing to keep both a DOM and the source of a large document around at
once, but it might be the compromise we would have to make for beta3.

In the long run, what might be possible is to have our object model code for
XML/HTML store enough information to be reserialized exactly as it was in the
source, such as whitespace, capitalization, comments, entity substitutions, etc.
A lot of thought is being given to this stuff in the W3C's work on the infoset
(http://www.w3.org/TR/xml-infoset) and this sort of stuff has been mulled over
plenty in SGML's "groves" -- the issue of what data constitutes the "information
represented by the document" and what is also in the document source but is
considered extraneous... Anyway, the XML and HTML parsers could store this extra
information in the object model (and hopefully by this time we'll have a unified
object model behind all our DOMs), and thus avoid any duplication of
information.

That would be the intelligent solution to the view-source, save-as, and
send-page bugs (see bug 6119). The printing and changing charset bugs don't need
the exact source, so this isn't a direct solution to those, but the DOM does
need to get held in memory for them...

Now that I think about it, why should the printing and changing charset bugs
even exist, if the DOM is kept in memory already (which it must, right?) Does
the document actually get reloaded and reparsed before doing those things? That
would be really messed up.

I'm thinking that it might be two different underlying bugs: (1) we need to hold
the source for the view-source/save-as type bug, and (2) we need to make print,
change charset, etc. use the current DOM we have in memory.

Does this sound like a reasonable analysis?
nav triage team:
there are other ways to fix the specific user problems (like reading from the
cache) that is much less of a hit then this suggested fix.
nsbeta3-
M Future to consider later.
Whiteboard: [nsbeta3-]
Target Milestone: --- → Future
Johng, that is incorrect. You can't read from the cache for a page which Mozilla 
has been told not to cache, because it's not in the cache.

For example -- you want to save, or print, the order confirmation page for 
something you've bought over the Web, but the server (correctly) has told Mozilla 
not to cache that page for privacy reasons. So Mozilla goes to the cache, finds 
that the page isn't there, and then *reloads the page* ... so (unless the shop 
site is quite cleverly designed) you end up inadvertently doubling your order. 
Need I say, that is not desirable?
I'm not 100% positive it works that way.  I think most sites are set up so that 
an order triggers a redirect to the receipt page.  So a reload just reloads that 
receipt page.  Going *back* triggers a "re-post form data?" confirmation dialog.

But I'm out of my area of expertise, here.

Regardless, the bottom line is that there is no way for us (speaking for the 
"front end") to  tell Necko to do anything more than fetch it from the cache (if 
possible).  See 
http://lxr.mozilla.org/seamonkey/source/netwerk/base/public/nsIChannel.idl#198

There is no "reload from the secret stash that's not the cache."

I think, if the request is legitimate (which it very well might be), that it 
would have to be implemented, for the most part, in the http channel 
implementation, with a new load flag that we could then use to fetch the data in 
the various places where we want to reuse it (view source, save as, print, 
etc.).

So, I'm resetting the component.  At the least, Gagan will be able to comment on 
the subject with more authority than I possibly could.

Re-summarizing, too.
Assignee: law → neeti
Component: XP Apps → Networking: Cache
Summary: Hold source of current page in memory → Need means to reload current page without refetching from server
Not actually sure if this is a networking bug. I'm wondering what piece of code
actually hangs on to the document while it's being rendered, while the DOM is
being modified by scripts, etc.

Who owns the current document?... Please have a look at the attachment...

De-nominating for beta3, removing milestone, but marking blocker, since it
blocks two closely-related [nsbeta3+][dogfood+] bugs. This needs to be on
somebody's radar, I just don't know whose.
Severity: major → blocker
Keywords: nsbeta3
Summary: Need means to reload current page without refetching from server → Need means to reuse/reload current page without refetching from server
Whiteboard: [nsbeta3-]
Target Milestone: Future → ---
forgive me if im stating the obvious with any of this. even worse, im not quite
sure what DOM means, used in the context here. i'm taking it to mean the html
equvilent of java bytecode (optimized half-rendered page), so if i guessed wrong
there, i guess you should stop reading.

we need three source caches. the traditional disk and memory caches, plus one to
hold the source of the current page, regardless of the settings of disk/memory
caches, and weither or not the server told us to cache the current page.

seperately, there is the DOM cache (perhaps storing DOM for the last N pages
would be a neat cpu/memory trade, but lets not stray).

print/redraw and maybe character switching (17889) would ask for the DOM cache.
view source/send page/save would ask for the source cache, which would fetch
from the always-available current-page cache.

the other alternative, the reserialization dr explained, seems like a pain, and
error prone... maybe worth investigating in the future, i dont know.
jeremy: dom is the w3c standard object model interface for representing a
document.

> print/redraw and maybe character switching (17889) would ask for the DOM
> cache. view source/send page/save would ask for the source cache, which would
> fetch from the always-available current-page cache.

that's one way of looking at it, but you have the right idea: use the object
model where you don't need the source, to avoid reparsing; use the source where
you do need it.

> the other alternative, the reserialization dr explained, seems like a pain,
> and error prone... maybe worth investigating in the future, i dont know.

to your credit, it is definitely a pain if not well-thought-out beforehand.
there is talk, though, of redoing our dom code sometime in the future (because
we currently have separate code for html, xul, and other doms). this would be a
place to include the reserialization information for xml/html. but, like i said,
that would be for the future.

for the present, we just need to keep the source of the current document in
memory. perhaps if we're worried about bloat, we could choose only to keep it in
memory when it's the result of a form submission... something like that.
Bug 17889 (to do with changing encoding), nsbeta3+, means you need to retain 
source at the *byte* level ...
No longer blocks: 6119
For pages that dont give special no-cache settings, we behave the same as 4x

But if the page had headers that disabled caching, then we will hit this 
bug in 4.x too. Would be nice to fix this. Since we dont have time, 
we will fix this for later release.

The general idea of the fix is to keep all pages in cache (no-cache) and not 
get them from cache if they weren't supposed to be cached. Then add another 
reload policy that says get it from cache overriding these hints.

Removing 4xp
Keywords: 4xp
Target Milestone: --- → Future
No longer blocks: 39957
We don't behave the same as 4.x here -- bug 20843 is what is the same as 4.x,
that is, post results aren't cached, but in 4.x you at least were able to view
the source or save the current page without refetching. that has nothing to do
with the behavior of the cache.

Bill Law summarized the bug best by saying "There is no 'reload from the secret
stash that's not the cache'."

This bug probably shouldn't even be in the Cache component. Also, adding
dataloss (because it should have been there all along).
Keywords: dataloss
Is this the same as bug 50949?  The bugs that depend on each seem pretty 
similar, and that one's fixed :)
As of the 09-27-00 build (give or take a day or two), the Advanced Debug option
for "Enable Memory Cache" is gone, and documents do not seem to be getting
caching in memory.  Is this related to all this discussion of cache/no-cache
here (bug 40867)?

As I understand it, there are supposed to be two caches, disk and memory.  If a
document's HTTP header (set by the HTTP server) says "Don't cache", the document
should not be stored in the disk cache.  But shouldn't it still be stored in the
memory cache, for re-use by Save As, Print, etc.?
atovar: there (are/should be) three caches, disk, mem, and current. current
caches objects regardless of weither disk/mem are turned off, or if http sets
nocache.

I say are/should be, because i'm not sure of the status. if are is true, this
bug is half closed.
Per law@netscape.com, this bug now also tracks the usage for View Source, i.e.
"Make View source never load from the server". Thus nominating for mozilla0.9.
See recent comments in bug 6119 for rationales.
Keywords: mozilla0.9
Blocks: 56346
When I type the url, "about:cache", I never see anything listed under memory
cache, even when I have Debug / Enable Memory Cache set.

Does this accurately show that the memory cache is never used?

I was also thinking: Why is the memory cache disabled by default?  Is this
because we are relying on the operating system to provide sufficient in-memory
caching for objects stored to the disk cache?
Blocks: 55583
This bug is a *major* pain for Zope developers, since on server exceptions we 
embed a traceback in an HTML comment in the error page, so that it can be 
inspected with View Source.  Couldn't the original text simply be attached to 
the root node of the DOM as a read-only attribute?
Blocks: 6119
This seems to be an architectural problem, and a cascading one that's causing
many other bugs.  (With perhaps more bugs yet to be identified.)

My understanding (correct me if I'm wrong) is that all these various functions
(View Source, Save As, etc.) all ask for the page from the Necko library, which
tries to fetch it from the normal memory or disk cache if possible, loading it
from the network (or generating an error?) if it's not in the cache.  I assume
that the cache management subsystem is part of Necko, not external to it.  Is
this an accurate summary of how it works?  Have I made any errors?

Users and web developers alike have the same (reasonable) expectation -- that
what they see in the browser window is what they'll get if they view the source,
save it, etc.  While that's often true, clearly it sometimes isn't.  It can
cause data loss and confusion, impede debugging of web applications, etc.

I think the root cause of this problem is that these functions are asking Necko
for the information, when Necko's function is to fetch data from URLs of all
sorts.  The very concept of "loading" the data to be viewed or saved is really
nonsensical, because it's obviously ALREADY been loaded once!  Forcing that
"load" to come from the cache may improve the situation, but it doesn't solve
the problem at all.  It only masks the more obvious symptoms of the problem.

I hate to suggest architectural changes, because they can often be problematic,
needing to evaluate and fix any code that might contain implicit assumptions
based on the old architecture.  However, I believe the only solution that will
ever fix this problem for good will probably need to be architectural in nature.

Leaving Necko out of the picture for the moment, there should be an underlying
relationship between the data structures that reflects the way we want to use
them.  For functions like "View Source" and "Save As", we want to make sure that
we ALWAYS refer to the same source that is already being displayed in the window
where the functions were triggered.  This is a user requirement, and should be
reflected in the data structures for maximum consistency and correctness.

The simplest way to achieve this would be for the data structures associated
with the window in question to contain or directly point to the exact source
they are using.  I assume they must already contain the DOM generated from that
source, since that would be necessary to reflow the page if the window is
resized.  The simple solution would be to associate the original source with the
DOM, and access THAT via the window's data structures for functions like "View
Source" and "Save As", leaving Necko out of the picture entirely.

This would necessarily be more correct and more reliable than asking Necko to
fetch it from the cache.  Older windows with outdated content would work as
expected, despite the existence of updated content in the cache.  POST results
from form submission would work as expected.  Clearing the cache wouldn't
matter.  Things would work much more as the user expects them to, and this would
be a relatively small architectural change for a big improvement in correctness.
 The obvious downside of this simple solution is that it would use more memory;
many pages would be saved that would duplicate information already in the cache.

The better solution is probably a larger architectural change, separating the
cache manager from the Necko library.  After all, Necko depends on the caching
functionality, and it would be best for these original source copies to be
integrated cleanly with the existing cache -- but there's no good reason for
Necko to be in the middle of functions operating on previously-loaded data.  It
would be cleaner to make the cache manager independent of Necko, while keeping
Necko dependent on the cache manager.  These functions on previously-loaded data
could then depend on the independent cache manager without depending on Necko.
This architectural change would make it impossible for these functions to
unexpectedly reload the page from the network.

An independent cache manager would need to be more flexible than Necko needs.
(I don't know how flexible the current cache manager actually is; maybe it's
already more capable than I imagine.)  While Necko needs size-bounded in-memory
and on-disk LRU caches keyed on URL information, functions such as "View Source"
and "Save As" (AND Back/Forward history functions) need a reference-counted
and/or garbage-collected memory management system, bounded only by available
memory.  When the LRU cache Necko needs happens to point to identical data, the
cache manager could share the memory space.  However, if Necko needs to replace
its cache entry when content has been updated, or delete an entry, it must NOT
change or discard any data still being referenced from, say, a browser window,
if functions on previously-loaded data are to perform as expected.

I've dodged a sticky issue.  What should happen if the DOM is changed by
Javascript code between the time the page is loaded and when the "View Source"
or "Save As" function is called?  There's good arguments for modifying the
source to match the DOM changes, but this is clearly MUCH more difficult than
tracking a copy of the original source.  Even if this can be done, sometimes the
user is likely to want the original source despite the DOM changes, so it should
only be optional.  (I just filed bug 63892 about this, since it's really an
enhancement anyhow.)
No longer blocks: 6119
Keywords: nsbeta1
Blocks: 6119
cc to self
neeti, are you going to fix this, or should it be reassigned so some non-Necko
person?  It shouldn't be Futured without helpwanted for long.

/be
This is something I'd be happy to volunteer for, if nobody else is ready to work
on it.  However, I can't guarantee how much time I'll have available to work on
it, or how long it would take me.  I'm not familiar with the Necko code or the
current cache architecture, so I'd have to study it.  I also don't know all the
places it would need to be linked in.  (Some pointers would be helpful.)

My point is, I'm willing to work on this, but can't promise I'll get it done in
any reasonable timeframe, if at all.  If nobody else would be working on it
anyway, then there's nothing to lose.  If someone with more time available can
dedicate it to this, I can look for other things to work on.

Should I wait and see if anyone picks this up?  Should I start working towards
it just in case?  Should I not have said anything until I had a patch in hand?
Blocks: 57724
No longer blocks: 57724
Wouldn't it be nice if it were possible to generate the original source
accurately (including formatting) from the information in the dom?  If that were
true, then none of this would be an issue, we could just use the dom to
regenerate the source without having to keep an extra copy around.  Not throwing
away whitespace, and finding a way to represent whitespace in attributes, would
be a nice memory efficient solution to this problem, and would have the side
benefit that people would be much happier with composer since it wouldn't always
be reformatting their documents.
That seems like a possible solution, but it would require a considerable number
of extensions to the DOM (some HTML-only, some perhaps XML-only):

 * remembering the original case of normalized attribute and element names and
attribute values
 * remembering the whitespace in a tag:
     + after the element name
     + around the = separating attribute name from value
     + after each attribute
 * remembering the type of quotation marks around each attribute value
 * remembering the entire contents of any markup declarations (e.g., DOCTYPE)
 * remembering the delimeters of included marked sections
 * remembering the contents and delimeters of ignored marked sections
 * remembering which characters were originally entity references
 * remembering how every entity reference was terminated
 * properly remembering comment delimeters, since the W3C DOM Core doesn't fully
describe SGML comment structure
 * remember whatever badly formed HTML (and perhaps XML, too, for view-source
and save) is thrown at us
 * ...what did I forget?
This is an HTTP issue, not an HTML issue. Since right now things don't work 
right for *any* document type, it is premature to be discussing a complex 
optimization for one particular document type.

And if you ask me, reconstructing the *exact* source from the DOM in a way that 
works 100% of the time, and is more efficient than simply keeping a copy of the 
original source, sounds completely hopeless.
It's surely not more efficient in terms of CPU time to regenerate the original
source from the DOM, even if it's extended to represent insignificant (to the
DOM) aspects such as whitespace.  More importantly, it's a significant increase
in complexity that begs for subtle bugs where you may reconstruct the exact
source 99.9% of the time, but never 100% of the time.  This would really work
more cleanly and reliably at a lower level.

The best solution is probably to save the content, byte for byte.  Then it's
straightforward to "recreate" the original source, because it's not a complex
process of reconstruction from the DOM.  Also, saving a byte-for-byte copy
allows for changing the character set (bug 17889) or potentially reinterpreting
the content as a different MIME type entirely.  (Suppose the MIME type specified
by the server is wrong?  How about a function to reinterpret the data under a
user-specified MIME type?)

Also, if memory efficiency is a concern, and CPU time is considered a worthwhile
tradeoff, simply compressing the source in-memory with zlib is a viable option. 
The DOM could also be compressed if it's not actively in use (i.e. history)...
Regarding documents being reformatted by composer, a more straightforward
solution (especially if the byte-for-byte source is to be saved for this bug)
would be to associate parsed elements in the DOM with byte ranges in the saved
source.  This is probably much simpler than annotating the DOM to retain case
and whitespace information.  Wouldn't that be as effective?  (Modified elements
could replace byte ranges in place -- fancier versions could try to guess some
of the formatting conventions by heuristic tests on the surrounding source...)
IMO, viewing the source from the cache should be perfectly fine. If you need to 
debug, you should be able to control your server to ensure that the page can be 
cached or there should be a control in the browser to ignore "no-cache" HTTP 
headers. Please don't implement some drastic, complicated solution such as 
regenerating the source from the DOM or whatever. K.I.S.S.
I agree that reconstructing the source from the DOM is a drastic, complicated
solution.  However, I strongly disagree that simply reading from the cache is an
acceptable solution.  Not only do you have a problem with content that asks not
to be cached, but also with older content that has been replaced in the cache
with newer content (but may still be displayed in a window), form post results,
missing data from manually clearing the cache, etc.  The cache is inappropriate
for the purpose, when the true intent of the functions (in the user's mind) is
to operate on the EXACT data that they see in the window (or saw before).  This
is why I'm suggesting saving a byte-for-byte copy of the content, irrespective
of whether that content should normally be in the cache.  (Ideally, data thus
saved would share memory with duplicate cache data for efficiency reasons...)
To reiterate:

1. You can't fetch the source from the cache, since it may not be there, or may 
have been replaced with a newer copy.  I want the source for what I'm looking 
at, not what I'd get if I reloaded!
2. You can't reconstruct the source from the DOM, since the DOM may have 
changed, and it's terribly fragile to expect an accurate reconstruction even if 
it didn't change.

A copy of the source needs to be associated with the 
window/frame/layer/whatever.  It can be compressed.  It may be able to be 
shared with the cache, but that's tricky.  For now, why not just keep the 
compressed copy as an attribute of the appropriate DOM node, or history object, 
or whatever, and release it when the object it's attached to is released?
Speaking as someone who's already had to cancel a duplicate order caused by
trying to save an order confirmation, I'd say pulling the data from cache is not
sufficient.
You all have valid points. However, I think viewing the source is something that 
is only useful to developers, who will have the savvy to do what is necessary to 
look at the source (i.e. turn off the no-cache header in HTTP server or tell 
browser to ignore no-cache, don't reload page in another window, etc). 99% of 
ordinary users will never bother to view the source.

Saving the current page is a different activity. Ideally, the page save function 
will offer the ability to save in a variety of formats, including HTML. However, 
I think that the data to save can just be generated from the DOM object and 
there is no requirement that the HTML (if the HTML format is chosen) be anything 
like the original source code. I agree that "Save" should absolutely never cause 
another request to be sent.

In summary, I don't believe that "saving a page" is the same thing as 
viewing/saving the contents of the HTTP reply message that generated the page.
> Not only do you have a problem with content that asks not
> to be cached, but also with older content that has been replaced in the cache
> with newer content (but may still be displayed in a window), form post results,
> missing data from manually clearing the cache, etc.

A more flexible cache architecture could handle that (have a way of marking a
mem-cached page as "window still visible, so don't overwrite until window goes
away, but data not to be saved to disk cache" or some such).

> 2. You can't reconstruct the source from the DOM, since the DOM may have
> changed, and it's terribly fragile to expect an accurate reconstruction even  if
> it didn't change.

If the DOM has changed, then so has the display in the window; don't you want to
see the source for what you're really looking at, rather than what you loaded
some time ago?

> to associate parsed elements in the DOM with byte ranges in the saved source.

That would be very useful for composer.  The parser would have to do the
association, though; trying to reconstruct it later would be error prone and
probably CPU intensive.
cc'ing gordon who working on the new cache.
Nice summary, Evan.  Personally, I agree with everything you said in it.

Ken, bug 6119 was specific to viewing the source; this bug applies as much to
saving a page as to viewing the source.  You seem to be suggesting that saving a
page generated from the DOM would be sufficient.  As a user, I would NEVER want
the browser doing this.  Even ignoring the readability of the code, if the
browser is just saving what it *thinks* the source represents, it invites errors
to creep into saved pages that weren't in the original source.  A bug in the
HTML parser could cause a buggy page to be saved.  (What if it didn't render
properly and the user is saving the page in order to view it in another
browser?)  Unsupported HTML tags (or attributes) would be discarded.  Also,
Mozilla itself could fail to view the saved page as before, if there's a bug in
the generation code.

While regenerating the HTML from the DOM is possible, it's complex and risky,
and offers little advantage over the straightforward solution of saving the
original source in some fashion...
Akkana: See my comments (for this bug) of 2000-12-28 14:39 for some of my
earlier thoughts about how an independent cache manager could meet the needs of
Necko's LRU cache and this sort of caching of historical content simultaneously.

As for DOM changes, I don't think it's cut-and-dried.  Probably you'll want to
see the original source sometimes and the source as modified by DOM changes at
other times, depending on the circumstances.  As a user, I certainly wouldn't
want to lose the ability to save a byte-for-byte copy of the original source,
even if the modified DOM could be used to generate a more-current version.  Both
options would be nice, but I believe the original source is critical...

As far as the parser needing to associate the byte ranges within the DOM, that
was implied.  It would be senseless to try to figure that out afterwards; the
parser would definitely need to do that part of the job.

Neeti/Gordon: I guess I'm glad someone is going to be working on this, but I'm a
little disappointed at the same time.  I was hoping to take a stab at it; it
sounded like fun!  (But I wouldn't want it held up waiting on me, since I can't
guarantee I'd find the time -- I don't have the luxury of being paid to hack
Mozilla full-time...)

If a wild hare makes me play with it anyway, I'll let you know.  (But only if
there's something worth seeing!)
Actually, when I look at the source, 99% of the time I am interested in the 
source as it came from the server.  This is certainly the case when I try to 
save a page.  If the DOM has been modified by client-side scripts, I would want 
to be able to save the original source and scripts and not have extra HTML 
that's generated from the modifications to the DOM.

In an ideal world, we would have both options:
1) looking at the source of "what we see now".  This would need to be generated 
from the DOM.
2) looking at the original source.  This is most likely impossible to generate 
from the DOM because it does not necessarily correspond to the DOM (elements may 
have been added to the DOM through scripts).

It looks like the original purpose of this bug was to cover option #2.  Should 
we open a separate bug on option #1?
I wanted to disagree with KenW's comments.  I think that it is not a safe
assumption to make that a developer should be able to frob the cache directive
if they're debugging server-generated pages.  IMO, view-source should function
independently of any cache directives.

Furthermore, when I save a document, I want the original source in all cases
that I've encountered so far.  In my experience, I end up wanting to save a page
so that I can do a diff against an earlier saved page.  I think it's dangerous
to think of saving a page as a non-developer function.

Both viewing the source and saving the page should be lossless.

We have been working on some design changes to the cache that should make it 
easier to address these problems.  I hope to push a proposal out to mozilla.org 
in the next few days.

The decision to keep the source around for currently viewed pages doesn't belong 
to the cache, but some of the changes we want to make will make it easier for 
current pages to share data with the cache avoiding unnecessary duplication.
Adding self to CC.
*** Bug 67781 has been marked as a duplicate of this bug. ***
*** Bug 68677 has been marked as a duplicate of this bug. ***
I don't think 99% of the users never bother to view the source. Although HTML 
was not meant for such a purpose, there are many people out here that generate 
pages at hand, and thus want access to the source whenever requested. That 
includes Javascript that can modify the code -both unaltered and modified 
source are worth interest in that case, but the unaltered one is the more 
important.

When saving a page to the disk, the original page need to be saved. The changes 
in the DOM will operate when loading the page in a browser, executing whatever 
javascripts etc. there are in the page. If the settings of the browser have 
changed (disabling Javascript...), so will the displayed page. I consider this 
as a matter of consistency in the browser rendering.

A question I have is whether or not keep images, embedded objects of the page 
in memory or not. Suppose I want to print a page: using the pictures from the 
DOM would be OK, wouldn't it? There is no use in refetching them
cc'ing pavlov and saari for their view on images and printing.  Short answer: 
sometimes (often?) the images need to be redecoded at a different resolution for 
printing.
Yeah, pulling the unmodified data out of the disk cache and redecoding it when
we need to print would be ideal for images with greater than screen resolution dpi. 
jido@respublica.fr
don't think ... <people> never bother to <act>
is a double negative. While it is valid English, people tend to worry that the 
author is confused. the literal meaning is: believe <people> <act>.

Saving objects with the page is the subject of other bugs.

printing should use the current data.  WYSIWYG
Printing should produce a WYSIWYG replication of what you see on screen in the
highest possible resolution, and that will be bounded by the printing device or
resolution of the image, whichever is smaller.

We're not printing a 72 dpi image if we have a 300dpi copy sitting around.
so long as we know that the image @300dpi is related to the one @72 dpi and not 
something that could be entirely different.
Yes, both/all resolutions are derived from the same http get of the compressed data.
Blocks: 55055
Blocks: 68412
Gordon: do we have any progress on this bug? I'm currently working on a set of
webpage scripts that have form POSTs all over them and debugging them is almost
impossible due to this bug. It's driving me up the wall :(
I'm not sure if the problem I was encountering was related to this.
When I was using some sort of web account and making operations and submitting
forms with cookies, mozilla sometimes re-post my login information when the page
direct me back to the main page (which should be cached), and the server thought
I was attempting to multiple login and refused my request.
Cache bugs to Gordon
Assignee: neeti → gordon
Is the new Cache Manager going to fix this?

It'd be a real handy thing to have, as a developer.  e-commerce and other apps 
where reloading will totally screw you could use it, too.
The new cache itself won't fix this, but will enable the docShell to hold onto 
references to cache entries for the current page ensuring they won't get fetched 
from the server again.  I'm not sure if docShell will take advantage of this 
however.
Whiteboard: [cache]
Resetting milestone from "Future" because at the time it was set it just meant
"post netscape6 release", and we have milestones up to mozilla 1.2 by now.
Keywords: mozilla0.9.1
Target Milestone: Future → ---
Blocks: 74349
Depends on: 72519
Target Milestone: --- → mozilla0.9.1
Blocks: 64100
Any follow-up on this one now that the new cache is backing in the tree?
Keywords: mozilla0.9
From what I've learned about the new cache, we now have a way to keep the 
source for currently viewed pages in the cache using the hard reference 
capability.  Just keep the hard reference around.  This capability is is not 
used yet, however.
If I understand this correctly-- at this stage this is just a docshell issue.
->docshell owner then.
Assignee: gordon → adamlock
Component: Networking: Cache → Embedding: Docshell
QA Contact: tever → adamlock
I'm slightly worried about letting the cache handle source storage.  Consider 
this: user visits a dynamically generated page, and leaves the window open.  A 
little later, they visit the page again in a different window, getting 
different results.  Now the "same" URL is open in two different windows, with 
two different source texts.  Can the cache handle this correctly, or will both 
windows seem to have the source text that was loaded more recently?
The cache can handle this correctly *if* the docShell holds strong references 
(cache tokens) to the items in the cache that it used to render itself.  If 
docShell doesn't hold the strong references, viewing source on either window 
would use the latest version (possibly fetching a new third version from the 
net).
*** Bug 78740 has been marked as a duplicate of this bug. ***
Yes ... and it's much more efficient to hold it in the cache because then you 
have only one copy of the document ... why keep it around a 100K document in 
your current page structure for a long period of time if it's already sitting 
in the cache?  Here's the sequence of events you're talking about:

1. User surfs to http://www.yahoo.com.
- http://www.yahoo.com comes down from the network.  We grab a copy from the 
cache.
- We grab a hard reference to that copy immediately.
- We keep that hard reference around.  This means the cache agrees to never get 
rid of it.

2. The page expires from the cache.
- Since the entry has a hard reference, it stays around in memory but gets 
Doomed.  Thus it cannot be searched.
- Note that this is also what will happen if the page is set not to be cached 
at all.

3. The user surfs to http://www.yahoo.com in another browser.
- Since the entry is doomed, the search through the cache does not find it.
- http://www.yahoo.com comes down from the network.  We grab this new copy from 
the cache.
- We grab a hard reference to that copy immediately.
- We keep that hard reference around.  This means the cache agrees to never get 
rid of it.
- Both copies of the page are in memory now; each hard reference refers to its 
own copy.

4. User closes the first browser.
- The browser lets go of the first hard reference.
- The cache realizes that the entry is doomed and so immediately dumps the 
document from memory.

5. User surfs somewhere else in the second browser.
- The browser lets go of the second hard reference.
- The cache entry is not removed yet (because it's not ready to expire) but 
when it expires it will be removed normally now instead of being Doomed.
Exactly!
Gordon, can the cache token be passed around?  For example, view source loads in
a different docshell than the original document (it's in a different <browser/>
element...).  Is there a way to pass the cache token to the new docshell?  And
are such things scriptable?
Yeah, they could be passed around, but some recipients may not have enough 
context to be able to use them.  We would definitely want to enable their use by 
view-source.  The sooner we can work out the details of what's required the 
better.

I believe the http channel is scriptable (that's where you get and set cache 
tokens).  There may still be a bug open to implement the setting of cache tokens 
on http channels; Darin would know for sure.  We had been waiting to get an 
actual client for them first.  Take a look at nsICachingChannel.idl.

Cache tokens are just an opaque nsISupports that happen to contain an 
nsICacheEntryDescriptor, so a http channel knows how to use them to rehydrate 
itself.
this bug depends on a bug that is  marked future. I'm reflecting that in this
bug's milestone.
Target Milestone: mozilla0.9.1 → Future
Considering that this is a correctness/dataloss bug with 36 votes that has been
targetted to mozilla0.9.1, and that the reason given for futuring bug 72519 is
"there are no consumers for it", wouldn't it perhaps be better to ask whether
the futuring of bug 72519 could be re-evaluated than to just future this bug
which obviously a lot of people consider to be very important...

It's also not as if depending on a futured bug is an absolute impediment to this
being fixed; I notice that 3 of the bugs "blocked" by this bug, and one other
bug "blocked" by bug 72519, are already fixed anyway.
This no longer depends on bug 72519 anyway.  The function we need is 
GetCacheToken() and that is implemented now.  The bug covers both GetCacheToken
() and SetCacheToken(); it is the latter that is not yet working, and that is 
not necessary for this bug.

Everything necessary to fix this bug is now in Mozilla.  There is no need to 
change the milestone unless the developers cannot do it in time.
clobbering milestone and killing dependency.
No longer depends on: 72519
Target Milestone: Future → ---
Question: should the browser keep references to the current page's images 
also?  It seems like it should.  Also if the current page contains frames we 
need to keep around references to the frameset as well as every single frame.  
Otherwise this fix is not accomplishing everything it should--namely a perfect 
representation of the current page for Save, View Source and Print.
Setting cache tokens probably *is* necessary for this bug to be fixed reliably.  
Simply holding a cache token won't guarantee that what you request from HTTP is 
the same thing that you're holding the cache token for.
OK, my apologies.  Then there's something I don't understand.  When I read bug 
72519 I naturally assumed that GetCacheToken got you the 
nsICacheEntryDescriptor (hard reference) referring to the document.  Indeed, 
reading the code, that seems to be the case.

Hmm ... perhaps the hard reference is not always set up by the time you call 
GetCacheToken()?  Is SetCacheToken() supposed to place the document into the 
cache and set up the hard reference?  Or perhaps does http place the document 
into the cache automatically but SetCacheToken() has to be called to keep and 
hold a hard reference to the document?
GetCacheToken() will allow you to keep an entry in the cache, but only 
SetCacheToken() will allow you to recreate a channel to that entry.  Otherwise 
you have to create a channel with a URL which may or may not be what the cache 
token refers to (the entry has been doomed).
Oh, I see.  So what you're saying is we will do this:
* User types in URL
  - Browser creates HTTP channel
  - Browser has HTTP channel retrieve content from URL (possibly through cache)
  - Browser grabs hard reference from HTTP channel and keeps it around 
(GetCacheToken)
  - Browser destroys HTTP channel
* User hits View Source
  - View Source gets cache token from Browser
  - View Source creates HTTP channel
  - View Source has HTTP channel retrieve content using hard reference 
(SetCacheToken)
  - View Source destroys HTTP channel

Why not change the second part to this:
* User hits View source
  - View Source gets cache token from Browser
  - View Source grabs content directly from cache token

Then you don't need SetCacheToken().  HTTP channel seems like an unnecessary 
extra step in this case.  Maybe it's impossible to get a stream out of the 
cache, and that's why it's done this way?
I think this also causes us to suck on the IBENCH tests.
No that's bug 61363.  The IBENCH tests we're looking at aren't doing printing, 
saving, or view-source.  This bug is for operations on the currently displayed 
page, so they don't result in refetching from the net.  When the user wants to 
print, save, or view-source for the front-most window, they don't expect to get 
different results from the net.
Do you think we can get this resolved in time for mozilla0.9.1?
BTW, another situation when we should use the current version instead of
refetching is the "Send page" function (which does not work correctly even in
Netscape 4). With "Send page"not only you might end up loosing the data, you may
also end up e-mailing something very different from what you thought you are
e-mailing...
-> 0.9.3
Target Milestone: --- → mozilla0.9.3
*** Bug 79758 has been marked as a duplicate of this bug. ***
Whiteboard: [cache] → [Hixie-P1] (HTTP) [cache]
Blocks: 83792
bug 83792 - same cause, more unwanted effects.
Fix this NOW, please.

(Sorry for the spam)
No longer blocks: 83792
Blocks: 84106
Blocks: 85128
Blocks: 86261
Keywords: nsenterprise
a possible (minimal) solution:

toplevel page is loaded... before the necko channel goes away, QI to
nsICachingChannel.  then call nsICachingChannel::GetCacheKey.  store
the cache-key someplace.

when you want to reload the document, just create a new necko channel,
but before calling AsyncOpen, QI the new channel to nsICachingChannel,
and call SetCacheKey, passing the stored cache-key to the channel.
also, be sure to call SetLoadFlags(nsIRequest::LOAD_FROM_CACHE) on the
necko channel as well to prevent the usual cache validation.
the cache-key is used, for example, to distinguish the results from 
different POST requests on the same URL.

this solution does not guarantee that data will not be fetched from the
cache, but it does do the "right thing" regardless of whether the cache
is present or not, etc.  and, for file->SaveAs and view-source it will
almost always avoid refetching from the net.
Folks, could someone please step up and make a damn architectural decision here?
over a year of thinking about this issue in this bug alone and we don't seem
much closer to a solution. There's been a lot of good discusions and ideas
kicked around, and a number of possible solutions proposed, but we still don't
appear to have anyone setting a stake in the ground and saying "do this." (or
better yet a group of core moz folk that can agree on single solution.)

I'm currently loosing a $100 argument with buy.com over a duplicate order that
resulted from printing a copy of the order summary. At this point mozilla is
unusable for the average user's e-commerce (at least with sites as poorly
engineered as buy.com). I've had to resort to another browser for any form of
online purchase.
putterman: were you asking me about ecommerce and multiple orders?
Radha, what's the session history perspective on this?
PDT+ per selmer's request.
Whiteboard: [Hixie-P1] (HTTP) [cache] → [Hixie-P1] PDT+
Session History comes into picture only when you hit reload() which uses the
VALIDATE_ALWAYS loadflag.  Session History however stores the cachekey for the
current page in mOSHE. This cacheKey is passed to nsICachingChannel when the
loadtype is LOAD_HISTORY/RELOAD_NORMAL/RELOAD_CHARSET_CHANGE. If the solution to
this bug is to do a similar thing for view-source/printing, then I think the
current setup in nsDocShell::DoURILoad()  can be extended to meet the needs here. 
I'm not seeing any problems with printing POST results.  On both the trunk and
the 0.9.2 branch when printing the results of a form POST we *do not* reload
from the network (and do another POST)...

Are there any sites that exhibit this problem when printing?  (ie. reposting
when printing...)

If not, I suspect that this is only a SaveAs issue...  Because the printing code
deals with documents in a completely different way...

-- rick
The attached patch implements Darin's suggestion for passing in the cache key to
the request generated by a Save As... operation. The patch attempts to load out
of the cache. If there is post data and the stream isn't in the cache, it brings
up a dialog asking if the user wants to repost.

The Save As... menu item works as expected with the patch. I saw a problem with
the Save As... context menu item, but it seemed as if the JS in nsContextMenu.js
was passing a bogus false value for doNotValidate. Still investigating, but the
patch should be set to go in as-is pending r/sr's.
Rick, was that a recent change in printing? I know it used to go back to the
network (for get requests anyway) in pre 0.8 days... I had gotten burnt a few
times printing the results of some internal tools we use and getting hard copy
of the input forms instead.

with current trunk builds here's the matrix from a quick test I just did:
(pre vidur's patch which I just mid-air colided with... building with it now.)

               POST     GET
---------------------------
print           OK      OK
view source     OK      OK
save as        BAD!     OK

hmmm.... one in six... russian roulette anyone?

(p.s. after four months, mention of the state attourney general
last week seems to have finally solved the discusions with buy.com)
with that patch my test case is 100% OK.
Chris, how about "File -> Send page" (bug 86261)? I haven't checked it yet, but
I am afraid it will add a bullet or two to your "Russian roulette".
               POST    GET
---------------------------
print           OK      OK
view source     OK      OK
save as         OK      OK
send page      BAD!     OK


http://ntiaotiant2.ntia.doc.gov/test/mozillatest is a much better test case from
cweiss@iname.com on bug 55583.

summary for those not on that bug:

save as: "successfully" re-requests! (number incremented)  
view source: fails completely (server error, suspect GET instead of POST)
send page: fails completely (server error, ditto)
print: success :)

vidur/adam - any update to this PDT+ bug?  Pls update status whiteboard.
It seems from Chris Abbey's comments that the patch from Vidur works ok?   If
so, can this patch get r and sr and be checked in?  It looks like the patch will
fix Save As, but not Send Page.  Perhaps that is a separate bug.
Latest patch looks good to me so r=adamlock@netscape.com.

I think Send Page should be new bug.
Well, according to last Chris Abbey's test results, _all_ methods currently
re-request, so there is still a need for some universal method of avoiding
re-requesting. But once such method is implemented, making use of it becomes
separate bugs - bug 55583 for "view source", bug 84106 for "save as" and bug
86261 for "send page".

IMHO, what this all means is that the "save as" patch belongs to bug 84106 and
not to this bug.
lchaing: my append on 6 July was with the trivial test case I attached, the
append on 7 July however was with the more realistic testcase at the URL. (I'm
not sure the difference, aside from a cfm script vs a php script, but it clearly
demonstraits one). As a result I wouldn't hold much faith in the value of my
append from 6 July if I were anyone.... ;) 

Aleksy: as rick pointed out, and the URL proves, print works fine... but then
print is also a completely different beast in that it really doesn't *need* the
original html, the dom is fine.  Otherwise I agree with your take on this bug 100%.
The patch ensures that for Save As we first attempt to get the data from the
cache and, only if that isn't possible (e.g. someone explicitly clears the
cache), we put up a dialog asking if we should repost. The same pattern should
be used for the View Source and Send Page cases as well, though the fixes for
these two will require changing different code paths.

I got a verbal sr=rpotts@netscape.com, so I'm inclined to get this into the
trunk for some bake time.
Vidur, exactly, and this is why the patch belongs to bug 84106. What this bug is
about is creating a mechanism that would ensure that if you are viewing a page,
then you also have a cache takon that will ensure that the source will be in the
cache if you need it. This patch is doing the right thing (hopefully) and is
addressing a related issue (as covered by bug 84106), but it is not addressing
this bug.
Vidur, does that patch also fix "Save Frame As..."?  It also uses the savePage()
function, but I'm not sure the history entry you're getting has the post data
for a frame... (in fact, based on my testing it seems not to).
Also how about "View Frame Source"? I was having trouble with that not working
correctly for POSTed data with 0.9.2 today, which is what reminded me.
"View Frame Source" is what _I'm_ working on (bug 55583).  That was my
'testing'.  If there's a way to get the postData and cacheKey associated with a
document in a frame, I have yet to find it....  Hence my previous comment in
this bug.
Not sure if this is pertinent but thought someone should check. There have been
a number of comments posted about "postData" does this need to include/does
include the cookie data at time of posting?  Obviously if you were refetching
the page from the network that would then be important, but is it important for
re-obtaining the page from the cache?  or any other location?

For instance you go to the same page URL+Post data, but since the submission has
already been posted once, now the source could change based on it being the
second post.  So then seemingly identical "posts" could be different based on
cookies, or the current time, etc, but would both appear to be the same in the
cache, does the second post overwrite the first cache? should it?

Glad to see this bug finally getting some heavy attention....
docshell member mOSHE has the cachekey and postdata for the document. THis
applies for subframes too.
Radha: is that member somehow scriptable?  The problem is that we're working in
JS here...

Wiggins:  The important thing is the combination of cacheKey + postData.  The
behavior on load is then as follows:

if (cacheKey && postData) {
  // try to load from cache using cachekey.  If that fails pop up a warning
  // dialog before requesting from the network
} else if (cacheKey) {
  // load from cache but do not prompt before re-requesting
} else {
  // just load
}
mOSHE is a private member of nsDocShell, not available from JS. 
is all or parts of this on the trunk or branch?
what's left to do?  
can someone summarize and project an ETA?
Here's what needs to happen as far as I can tell:

1)  An API is needed to get from a docshell (any docshell, not just the root of
    the docshell tree which has the session history attached to it) the
    information needed to recreate it (load from history or reload from the web
    with the same post data).  At the moment this would be a cache key and a
    post data stream.  These could be gotten as themselves, in the form of an
    nsISHEntry, or whatever.
2)  A method needs to be added to the nsIWebNavigation interface that takes the
    information gotten in step #1 and loads a page using it.  This could be
    similar to the current nsIDocShell::loadURI or something along those lines.

Once that's done, this bug can probably be marked as fixed.  The dependent bugs
can then handle using these APIs to make save page, save frame, send page, view 
source, view frame source, and so on work correctly.

Ccing Rick and Jud who've been talking about this a bit lately, since they'd be
the ones who could provide an ETA (and corrections in case I've gotten something
wrong).
What's up with this bug?  No current status and time is almost gone!
At chofmann's request, I'm posting this summary of status. Currently, the
following work correctly if there's post data associated with a page:

- print page
- print frame
- save page

The following still don't work correctly if there's post data (I believe they
will always refetch with a HTTP GET):

- save frame
- view page source
- view frame source
- send page

For NS6.1 RTM we have the following options:
1) Disable the latter set of operations for pages that have post data. To do
this correctly, specifically to even be able to determine whether the document
in an individual frame has post data associated with it, we still need some
amount of infrastructure (as opposed to chrome JS) change. Our options include:
 a) Expose the currently private nsISHEntry member of a docshell.
 b) Expose a mechanism for getting information about post data at the
nsIWebNavigation level for both top-level content windows and individual frames.
Once one of these changes are made, we'd need to make JS changes to check for
the existence of post data before completing the corresponding operation.

2) Come up with a complete fix. Boris Zbarsky's description of such a fix (or at
least the infrastructure for it) is on the money. We'd have to:
 a) Expose a mechanism for getting the post data at the nsIWebNavigation level
(as in 1b).
 b) Allow for the post data to be passed in as an optional parameter to
nsIWebNavigation::LoadURI (bug 46870).
 c) Either expose the notion of a cache key in a similar manner as the post data
stream or hide the cache key completely. The latter option would require us to
create an internal mapping between URIs and cache keys (including URIs that are
not "officially" in the cache, such as those that have post data).
Once these changes are made, we'd also need to make JS changes to retrieve the
post data and pass it into the load methods used for the different operations.

3) Live with the status quo

Rick Potts is looking at the infrastructure changes for 1 and 2. He will first
work on the infrastructure changes needed for disabling (probably option 1b/2a).
He will then check in an existing patch for 2b (see bug 46870). The ETA for
these changes is 7/18.

At chofmann's request, I will open bugs for making the chrome JS changes
necessary to disable the features for pages with post data. If Rick can't get
the complete fix done by 7/18, chofmann will find owners for the chrome bugs.
By "parts working" I assume you mean there was a checkin on the branch.  If so,
I'm leaning toward option 3.
A few comments on what vidur wrote:

> 1) Disable the latter set of operations for pages that have post data.

This would be a branch-only thing, I presume?  On the trunk we would keep
working toward option #2?

> The latter option would require us to create an internal mapping between URIs
> and cache keys

This may not work so well for view source...  Consider the page at
http://www.mozilla.org.  To view its source we load the URL
view-source:http://www.mozilla.org.  The protocol handler for view-source:
creates an HTTP channel for http://www.mozilla.org and loads it.  _But_, we want
to use the source from the original URL.  So we want to take the cache key and
post data corresponding to the http: url and load the view-source: URL with those.

This does not mean that the cache key has to be exposed, it just means that
there needs to be a way to load a url of the form view-source:whatever using the
information for the "whatever" URL.
Assignee: adamlock → rpotts
>This would be a branch-only thing, I presume?

yes
 
> On the trunk we would keepworking toward option #2?

yes

The most troublesome remaing problem for most users would
be "send page",  I could see users completing an ecommerce
transaction and then attempting to send the page to themselves
as a confrimation of the transaction.

bug over to rpotts since he is working the infrastructure changes
that are now the focus of this bug.
In order to make this work correctly, 3 different API changes must be made:
1. Provide a 'postData' attribute on nsIWebNavigation to allow JS to access the
postData.
2. nsIWebNavigation::LoadURI(...) needs an additional 'postData' argument.
3. The underlying necko 'cache key' needs to be hidden from non-networking APIs.

Clearly, we can't make all of these changes on the branch :-)  #3 alone requires
a fair amount of work - and darin is currently gone :-(

So, here's my propsed plan...
-----------------------------
For the branch I can make API change #1 and expose a postData attribute to JS. 
This will be enough to allow all JS implementations to 'disable' if postData is
present.
Fortunately, this API change is *very* small and should not introduce
regressions...  it does not change any existing APIs and the only consumers will
be the new JS code written to disable the features.

For the trunk I can land API change #1 and #2 and start working on #3...  Once
all three changes have been made, we will be able to 'correctly' implement these
features in JS...

Does this sound reasonable to everyone?

-- rick
sounds like a good approach to me.
Okay, I'm an outsider here, but I'm slightly worried about the #3 approach.

Consider the following scenario. I'm assuming a page that simply provides an
incrementing counter every time it's hit, but the same argument applies to
almost any page that can give different results on successive hits with the same
querystring and/or post data - anything that, for example, is time-sensitive (eg
frequently updated news sites), relies on a backing database on the server which
changes state, is content-negotiated (if the user changed their language prefs)
or cookie based, etc etc etc.

User opens page in window 1. Page is fetched from server, returning "1". Cache
entry #1 is created, containing "1". User leaves this window open on this site
and goes to work in another window.

User later opens same page in window 2. Page is fetched from server again,
returning "2". Cache entry #2 is created, containing "2". Cache entry #2 has the
exact same URI, and also the exact same post data (if any) as #1.

User goes back to window 1 and selects "View source". If the "View source"
implementation only has the uri and post data to go on, it has no way to
distinguish between cache entries #1 and #2, but they contain different things!
Since #2 was created later it will probably get that, which is wrong.

Surely a better way to go here would be to expose the cache key on the document
directly (as an opaque pointer that you can't do anything with except load, of
course) and use this for "view source", "send page" etc. Relying on the URI plus
post data seems like a step backward.

Am I missing something?
I have the same concern as Stuart.  More specifically, hiding the cache key
should still somehow allow loading url "view-source:foo" and getting the source
of "foo" from the corresponding window from cache.
>Does this sound reasonable to everyone?
After the long wait, sounds like music to the ears :-) Although the reasonning 
so far throughout the bug has been that the cache key is "key" to fixing the bug 
as sballard has illustrated.
A second URL fetch performed with identical post data will not retrieve the same
cache entry unless a "cache key" or "cache token" (from the original channel)
was provided to the HTTP channel.  The original cache key can not be recreated
given an identical URL and identical POST data.

I'm not clear on why we want/need to hide the cache key.
One of the issues is that there is not a "single" cache key associated with the
document - correect?

there is a cache key per URI loaded for the document.  This means that to be
"correct" someone needs to hold onto *all* of the cache keys - including the
ones for linked CSS and JS...

Holding onto the cache key for the document URI will definately make things
"better" but i don't believe that it is sufficient to be "correct".

If these assumptions are correct, then I believe that someone has to hold onto
*all* of the cache keys...  And i'm not about to add an nsVoidArray argument
(for them) to LoadURI(...) :-)

I believe that we need to come up with a "correct" approach.
rick, 
when can you get:

   1. Provide a 'postData' attribute on nsIWebNavigation to 
   allow JS to access the postData.

on the branch?  lets make that happen and then we can take the pdt+ off
the this bug.  

vidur, 
can you add the bug numbers for the open bugs for making 
the chrome JS changes as a reference in this bug?
those also need to get pdt+



No issues with the branch which can have its quick fix -- but if it takes
adding a nsVoidArray argument for the correct fix later on the trunk, then there 
are 59 votes and XXX persons cc:ed here for that :-)
Can any netscape people comment on how 4.x handled this? (I never saw 4.x get it
wrong, although I can't say I tried very hard).

A single cache key would work fine for view source, at least.

Is there any way that the document itself can hold cache keys to the other
things that were loaded by it? That way everything gets cleaned up through
normal refcounting and we only need the single cache key to the document.

Perhaps the extra VoidArray should be added to the CacheKey object?
stop it... you all are hurting me :-) 

with nsVoidArray example rick meant to say that this is a bad idea... we clearly 
need to work a better solution than having to keep all the cache-keys (or any)

Lets stay focussed on resolving this for both short term and eventually with a 
better solution past this release. And rick: next time you suggest something 
like a nsVoidArray of cache keys (without sarcasm/kidding tags around it)... I 
am gonna come looking for you with my nerf gun :-)
gagan,

how many smiley faces do i need?  i always thought that one was enough :-)
but perhaps when the cache is involved i need a few more :-) :-) :-)

-- rick
once again i seem to have ignited a hot topic :-)

My main concern with exposing a cache key via the nsIWebNavigation API is that i
do not believe that it is sufficient to guarantee that *all* content will come
out of the cache.

I believe that we may need some other mechanism in order to "get it right" 100%
of the time.  So, rather than put the cache key there "for now" (and
realistically never revisit the issue until another PDT+ bug turns up) i think
that we should hold off and consider what the correct solution is - for the trunk.

I'll try to attach a patch for the branch (really soon) which exposes a
read-only attribute for the postData on nsIWebNavigation...   At least this will
allow us to *not* do the "wrong thing" :-)

Once the branch work has been done, i'll start looking into what the "right
thing" is for the trunk...

So, Gagan, get your nerf gun ready ;-)
I've just attached a patch which adds a read-only attribute to nsIWebNavigation
allowing access to the postData (if there is any...)

So with this patch, given an nsIDocShell you can QI to nsIWebNavigation and
check for postData.

When dealing with frames it is important to get the nsIDocShell associated with
the particular frame in order to insure that you are getting the right post data :-)

-- rick
As far as that goes...  we need to get frame docshells in response to context
menu actions.

At the moment when a context menu comes up we have the document involved
(this.target.parentDocument) and the window involved
(document.commandDispatcher.focusedWindow).

Neither of these allows easy access to the corresponding docshell.  One can walk
the docshell tree and for each docshell compare nsIDocShell.document to
this.target.parentDocument till a match is found (I have a function that does
this lying around).  Is there a better way?
from an nsIDOMDocument you can do the following to find the corrosponding
nsDocShell:

  QI(nsIDOMDocument) -> nsIDocument
  nsIDocument->GetScriptGlobalObject(...)
  nsIScriptGlobalObject->GetDocShell(...)

and you're there!!!
nsIDocument is not scriptable... I'm working in JS.
In terms of the question of how to fix it "right", my suggestion on 2000-12-28
was to make a separate cache manager instead of trying to adapt Necko to do it.
 This still seems cleaner to me; am I alone in thinking along these lines?
Stuart, as far as I remember 4x indeed did it right most of the time, but not
for the "send page".
Blocks: 91341
Blocks: 91342
The bugs to disable view source and send page on the branch are 91341 and 91342.
No longer blocks: 91341, 91342
The PDT+ on this bug on the branch is *just* for the patch necessary to make bug
91341 and 91342 possible (we've just PDT+ed those bugs.)  Can this be checked in
today?
Could someone review and superreview Rick's branch only patch? thanks!
Rick, we really need a way to get the docshell of a document from js. Do you
know of one?
r=bbaetz for the branch.

Having the input stream implement nsIRandomAccessStore doesn't help js, because
that interface isn't scriptable, and we need this for js. For the branch, blake
only needs to know if the data is post data or not, so thats OK. It needs work
for the trunk, and possibly a different solution.
The input stream is rewound at
http://lxr.mozilla.org/seamonkey/source/docshell/base/nsDocShell.cpp#4526 and
any loads that use the nsIWebNavigation APIs should be passing through there...

Code that creates channels directly instead of using the nsDocShell stuff will
have a C++ part that can rewind the stream.
Blake super-reviewed Rick Pott's branch only change and also checked it into the 
branch. PDT shd remove PDT+ from this as it is no longer a stopper for the 
branch. 
Whiteboard: [Hixie-P1] PDT+ → [Hixie-P1] PDT+ branch fix is checked in.
hey blake,
jst and I talked about the best way to get an nsIWebNavigation from an
nsIDOMWindow...  and as you know, currently, there is no easy way.

The best solution we came up with was to an nsIWebNavigation as one of the
interfaces that you could get via GetInterface(...) this would require two small
changes in nsGlobalWindow.cpp

1. Add nsIWebNavigation to the GetInterface(...) implementation.

2. Change the ClassInfo flags so nsIInterfaceRequestor was visible to JS via
XPConnect.

Let me know if you think that these changes are necessary for the branch...  I
think that we will definately want to do this on the trunk... since we need a
way to map from an nsIDOMWindow to the embedding interfaces...

-- rick
Rick, thanks for the info. Last I heard was that we could live with it being a
problem in View Frame Source on the branch.

Lisa, does PDT want to pursue fixing it for view frame source too?

[removing PDT+; there's nothing that needs verification here anyways]
Whiteboard: [Hixie-P1] PDT+ branch fix is checked in. → [Hixie-P1] branch fix is checked in.
No way Netscape 4x did it right most of the time...
How many times did I curse Navigator trying to fetch the document again in offline mode for a printing or for a view source? And what about when View Source provided the source of the error page instead of the displayed one?
Maybe it got better with 4.75 or 4.76 but it was a very late effort.
Netscape 4.x doesn't do it right.  I just tested 4.76, and it's very broken. 
I'm sure 4.78 is much the same.

I've uploaded an attachment with the Perl code of the CGI that I used to test
with; it sends back the current date/time, the method used to call the CGI, and
a link to itself to trigger the GET method, and a submit button to itself to
trigger the POST method, so both can be tested.  If you wait at least 1 second
between tests, you can tell whether or not the CGI was reloaded.

Under 4.76 (Linux), here's what I found:

The GET method will perform the "View Source", "Print", "Send Page" and "Edit
Page" functions, but ALL of these will re-fetch the page and run the CGI again,
since the content has expired from the cache.  (If it's in the cache, it will
probably use that for some or all of these functions, but that's not correct
behavior either.)  All functions using the GET method will use the page, but
none of them use the copy being displayed in the window.

The POST method is even worse.  "Send Page" and "Edit Page" both reload the page
with a GET request (not POST), running the CGI again.  "Print" fails completely,
sending nothing to the "lpr" process, which returns the error "lpr: stdin: empty
input file".  "View Soruce" will bring up a window, but the source displayed is
that of the usual POST error message:

     <TITLE>Missing Post reply data</TITLE>
     <H1>Data Missing</H1>
     This document resulted from a POST operation and has expired from the
     cache.  If you wish you can repost the form data to recreate the
     document by pressing the <b>reload</b> button.

Basically, NONE of the functions in Netscape 4.x that should be using the
original source are correctly doing so.  When they manage to, it's only because
the content happens to be in the cache, and it could expire from the cache or be
replaced with a different version at any time.  For content that expires right
away, the ONLY thing that works on the original data is cut & paste with the
mouse into another window.  That's all.

Netscape 4.x doesn't even come close to getting it right.
As rpotts  explained earlier, I have attached a patch to nsGlobalWindow.cpp and
nsDOMclassInfo.cpp that exposes nsIInterfaceRequestor to JS. I need this
functionality for some other purposes too. 
Radha, don't expose new interfaces on the window object in JS unless it's
absolutely required, in this case it isn't. In stead, allow scripts to QI the
global object to this interface when this interface needs to be accessed. This
means that the window object in JS won't have a getInterface() method on it, in
stead you haveto do
window.QueryInterface(Components.interfaces.nsIInterfaceRequestor).getInterface(...),
this way we can support new functionality w/o polluting the global namespace on
web pages.

I'll attach a patch that does this part, your change to nsGlobalWindow.cpp is
fine, but don't check in the nsDOMClassInfo.cpp change in your patch. With my
patch (once tested n' all that) you'll have sr=jst
Re: jst's patch, I'm assuming that the new inclusion of
nsIXPCScriptable::DONT_ENUM_QUERYINTERFACE in the scriptable flags for Nodes,
Arrays, etc. is not going to break anyone.

sr=vidur
r=rpotts on jst's patch to nsGlobalWindow to expose nsIInterfaceRequestor via an
explicit QI(...)

and r=rpotts on redha's patch to add nsIWebNavigation to getInterface(...) once
the DOM_CLASSINFO_MAP_ENTRY(nsIInterfaceRequestor) is removed...
Vidur, the scriptable flags remain unchanged for Nodes, Arrays etc. (they
already have nsIXPCScriptable::DONT_ENUM_QUERYINTERFACE set), only the flags for
Window are changed.
The latest patch consolidates the previous 2 patches (patch ids 43271, 43347)
and a patch given by jband privately for a assertion problem. This patch
securely exposes nsGlobalwindow interfaces to JS.  will be checkin this to the
trunk. Reviews were given to individual patchlets.
I don't quite see how printing comes into this. Dosn't print work on the 
(currently displayed) DOM? IMHO it should, printing should be wysiwyg.
*** Bug 93157 has been marked as a duplicate of this bug. ***
Target Milestone: mozilla0.9.3 → mozilla0.9.4
*** Bug 93890 has been marked as a duplicate of this bug. ***
*** Bug 94417 has been marked as a duplicate of this bug. ***
What's the latest on this?  They want me to disable edit page/send page/view
source again for pages with postdata on the trunk unless this gets fixed the
proper way.
Depends on: 94205
Moving to nsenterprise+, adding nsBranch. I'm assuming the 0.9.2 branch fix 
hasn't been checked into the trunk as it wasn't the "complete, right fix."
Keywords: nsBranchnsbranch
not happening for 0.9.4. some underlying work Rick is doing will get us closer
to this, but we're not there yet.
Target Milestone: mozilla0.9.4 → mozilla0.9.5
Blocks: 99194
Checkpoint for nsbranch . . .

Judson - Are we there yet on this one?
No longer blocks: 99194
Minusing this one per my email exchange with Judson. This looks like a good one
to get, but it is not gonna be fixed until later in TM0.9.5. 
Keywords: nsbranchnsbranch-
marking nsenterprise-.
*** Bug 96159 has been marked as a duplicate of this bug. ***
-> 0.9.6
Target Milestone: mozilla0.9.5 → mozilla0.9.6
Blocks: 104166
Blocks: 107067
Keywords: nsbranch-
Whiteboard: [Hixie-P1] branch fix is checked in. → [Hixie-P1] partial fix is checked into 0.9.2 branch
->0.9.7

bug #94205 will be landing soon - hope :-)
Target Milestone: mozilla0.9.6 → mozilla0.9.7
*** Bug 106931 has been marked as a duplicate of this bug. ***
Is there something corrupt in Comment #158 (from 7-20-01) that causes it not to
wrap in the Mozilla 0.9.6 window?  I don't know if this is a regression in
Mozilla (I doubt it) or something getting messed up in the buzilla database. 
Also, I apologize to anyone who thinks this is trivial; it is, but I'm hoping
it's easily fixed.
that was a bugzilla bug that was recently fixed i believe.
OK.  To summarize the current situation and get the ball rolling again.... At
the moment we can:

1)  Get the right docshell
2)  Get the postdata associated with the page
3)  Call loadURI with this info (bug 94025 has been fixed).

This is not sufficient for doing view source completely correctly, since there
could be multiple entries in the cache all with the same URI and same postdata
but different content.  It _is_ sufficient for fixing bug 64100.

Rick said:

> there is a cache key per URI loaded for the document.  This means that to be
> "correct" someone needs to hold onto *all* of the cache keys - including the
> ones for linked CSS and JS...

This is true in general.  In reality, however, we're dealing with the following
operations:

A)  save/save as.  This uses nsIWebBrowserPersist now anyway, so I am not sure
    this discussion applies anymore...  At the very least, this should be spun
    off into a separate bug on nsIWebBrowserPersist if it's still a problem.
    (bug 84106 should be retested).  Using the PERSIST_FLAGS_FROM_CACHE flag
    should have the desired effect here.

B)  Send page.  Is this going to send all the linked materials together with the 
    page?   (I sicerely hope this is not the case; munging urls in the source to 
    point them to the online content seems like a better solution, a la
    nsIWebBrowserPersist).  If it is _not_ sending the linked stuff, then we
    don't need a "whole page" solution.

C)  View (frame) source.  This could not care less about things linked from the
    page/frame. All it wants is the source for the page/frame itself as it came 
    off the wire.

D)  Changing character set.  This could actually benefit from the more gneral
    approach.

This seems to cover all the things mentioned in the bugs this blocks.

I propose:

I)  Bug 64100 should get reopened and fixed (this is not very difficult to do
    now, with the infrastructure we have in place).
II) Rick and company should decide whether we actually need a generic api for
    this (especially since intl seems to be opposed to the concept of fixing
    item D).

I'm getting very tempted to move view source into C++, from whence I could just
call nsIDocShell::LoadURI directly and not have all these problems.  The current
apis would allow this to work just fine.
*** Bug 118487 has been marked as a duplicate of this bug. ***
What about the milestone setting?  0.9.7 is long gone, are we going to have this
for 0.9.9 or 1.0??
*** Bug 55583 has been marked as a duplicate of this bug. ***
cc: self
I noted an interesting thing with the current implementation of view source.

It does not send cookies to the server?

I use cookies for session control. If you try to access the page
without cookie you get redirected to login page.

When I tried view source I got a blank page?
This should only be the case when you open without the session cookie?

I dont know if this is relevant or has any impact but I did not se any 
reference to this situation.

I do not have an test case at this moment but I can supply one if the need 
should arise.
Technically View Source should not be sending anything as is its dependency on
this bug.  View Source should only show what was received by Mozilla from the
server.  It should not cause a second request.
This might be redundant information, but I just can't pull myself to read all
the post. But it seem to me Internet explorer make heavy use of disk cache.
Efficient at it even.

When you view source and look at the file, its not a file in meory, but rather a
a physical copy. So can't we just stream the source to memory AND disk when we
loading a page, and view source just open the file at that particular time.
Correct me if I'm wrong, but Internet explorer's view source doesn't apply to
dynamically generated content. (So its not fetched from memory, but rather disk)

Someone know how opera handle this?
Mike: it is true that IE like mozilla will reuse content from the [disk or
memory] cache when it is available.  but, unlike mozilla, IE will generate HTML
when the [disk or memory] cache does not contain the requested document.

This bug is mostly solved already.  Mozilla now puts every downloaded document
into the cache, regardless of cache control headers (the headers are honored
when fetching the documents from the cache for normal page loads).

This does NOT guarantee that the document will exist in the cache when the user
views or saves the source of a document in a random browser window, but it does
pretty much guarantee that it'll work just after the document is loaded.

NOTE: i'm only talking about HTTP and HTTPS, not FILE or FTP.  though, i think
FTP could be modified to use the cache for downloaded material in a similar fashion.
Is http://bugzilla.mozilla.org/show_bug.cgi?id=120809 related?
An image gets refeched from the server every time, but definitely is
in the cache. 
Blocks: 120809
darin: If I go to a page that is generated on the fly (e.g. slashdot), open it
in two windows (thus causing each window to have different content) and then
save both of these windows, will the saved copies be different?
you will save the most recently fetched version of slashdot.  is this a problem
we want to solve?  if so, then it means pinning documents in the cache, which
sort of goes against the idea of having a fixed sized cache.  i'm not saying
it's impossible... indeed, the cache provides facilities for pinning an entry in
the cache, but it certainly is not without its tradeoffs.
My opinion is that "yes, it is a problem we want to solve". I've got some CGI
scripts which generate oodles completely different types of content under
various circumstances. While debugging, I'll often look at the sources of
different outputs (using some non-Mozilla browser).

perhaps Yet Another Pref?

"Save raw HTML source in cache to avoid new requests when using the view-source
command."

Turned off by default, easy for web developer types to turn on.
Only folks who want the feature have to deal with the tradeoff.

-matt
Yes, it is the problem we want to solve, but No, it should not be a pref. This 
is what you would expect to happen, that the page you look at gets saved. If we 
don't pin it there is no way to be sure that that happens.

IMHO we also want to pin it for at least three other things; view-source, send-
page and history. Pinning the htmlsource should not be that expensive, how many 
html-pages does a "defaultinstall doing normal browsing" contain approx? I have 
171 files in my cache, though i guess some of them are images, i would happily 
waste 10-20 of that on pinned down pages
I might be naive, but when I say "save page as" or "send page", I mean "do it
with the stuff I am seeing on my screen right now", and it is very disturbing
for me to see a refetch when I do it.  Many pages are relatively small and
static, most servers are very responsive and many people have good connections,
but this is not always the case, so IMHO, treating a refetch as a Very Expensive
Operation and doing it only when absolutely necessary or when explicitly asked
for is The Right Thing.
As I understand it, we _do_ want to avoid having another refresh. Unfortunately,
that _doesn't_ necessarily mean "the contents on your screen right now."

Why? Because you can change the stuff on your screen. Fill in forms, execute
javascript, whatever. Things of that nature should, IMHO, not be reflected in
"view source," "save page as" or "send page". But what about changing the
character set? (that's bug 17889 btw) Changing the charset shouldn't cause a
page to reset.

If we do want to support a save/send page option which maintains the current
state of the page, it should be considered a different feature and not interfere
with the current effort to implement the current features. 
I don't think that saving the page EXACTLY like it looks now is what anyone
wants. The ideal situation would be to save the HTML document to the cache as
the server gave it to you, and then also save all the other files the server
sent over that the document needs to render correctly, such as StyleSheets,
images, Java and other ebedded things (Flash, etc)

That way, whenever you want to either see the source, or save the entire page
(HTML+dependencies) etc it is all already there, moz would just need to copy it
elsewhere. The only reason that moz needs to refetch anything from the server is
if the user hits Reload or the cache TTL expires. Or if the meta refresh is set,
of course.
-- #197 From Darin Fisher 2002-01-27 16:14 said: -------
> you will save the most recently fetched version of slashdot.  is this a problem
> we want to solve?

  YES.  View Source, Save As, Print -- they should ALL act on the presently
viewed page.
  In the case of view source, we should always view the html for the page that
is displayed in the window.  If this means that I have 10 browser windows open,
all which requested some http://something/time.cgi URL (which outputs the time
of day in html) at slightly different times, I should get 10 different view
sources.  Always.  This is vital to being able to develop content for Mozilla,
because all major server-side development these days is dynamic, and it is
*categorically impossible* to test or debug dynamic content with Mozilla if it
does a refetch when the user hits "view source."

  The same goes for Save As and Print from a user's perspective.  If I just hit
the "submit" button to place an order, I don't want that posted again if I want
to print the invoice page or save it to disk (which would place another order).
 And if I choose to place two different orders to the same site in two different
windows (sequentially), and then save/print each invoice after each has been
placed, it should "do the right thing" and save/print two different invoices.


------- #201 From Nick Lewycky 2002-01-27 18:14 said: -------
> As I understand it, we _do_ want to avoid having another refresh.
> Unfortunately,
> that _doesn't_ necessarily mean "the contents on your screen right now."
> 
> Why? Because you can change the stuff on your screen. Fill in forms, execute
> javascript, whatever. Things of that nature should, IMHO, not be reflected in
> "view source," "save page as" or "send page". But what about changing the
> character set? (that's bug 17889 btw) Changing the charset shouldn't cause a
> page to reset.

  Filling in forms doesn't change the source html.
  Executing javascript, in most cases, doesn't change the source html either
(document.write() is the only exception I can think of, and I believe that is a
microsftism), so the browser should just show the original html as sent by the
server for save/view source.

  But this does bring up an interesting print problem... Printing should, IMO,
show the current form contents, as well as any dhtml changes via javascript,
etc.  That information is not contained in the source html.
From previous discussion, I understood that Print worked with the DOM representation of the page anyway. So it's probably Ok already. Who wants to test printing a filled form?
Hey, what's wrong with the formatting of my previous comment?
Is it possible to edit a comment after posting it?
> Print

Not relevant.  Printing prints the DOM, not the source.  So you always print
exactly what you see.

All the concrete examples people have given so far of cases where they would
want to see separate source for the same url involve post data that was sent to
the URL.  In those cases, we _do_ cache each response separately.
The view source problem isn't limited to post data sent to the URL.  A
"timeofday.cgi" is a good example of a non-post request sent to a URL which
returns dynamic content, and the "view source" should be different for each
browser window/history/tab.

This isn't a contrived example.  JSPs work this way as well.  As do ASPs.  As do
PHPs.  etc.  URL accesses may cause the server to generate and send dynamic
content /without/ sending state information via (get/post/cookies/URLrewrite)
which must be cached for view source.
(I got a midair collision with somebody saying the same thing as me; this is in
response to comment #206)

Not at all; it applies to *any* site whose content might change over time. If I
visit slashdot.org at lunchtime, leave the window open, do some other browsing
in other windows, including visiting slashdot.org again later with new stories,
then go to view source of the original page, I expect to get the source of that
page, not the source of slashdot.org at some arbitrary later time that I
happened to view it.

From a user's perspective, this is basic functionality. If the page is sitting
right there in a window, why should it be impossible to view its source? I do
understand the implementation reasons why it's hard, but it's also hard to
implement a functional CSS engine and HTML4 viewer - that's not a reason to
choose not to do it, or we'd still be using MozillaClassic :)

It should *never* be possible to have a page in a window and not be able to view
source, or send or save the page, and get any result other than the exact
content of the window that is on your screen. Even if the page specifies
no-store. That's what users expect; it should be what we do. No exceptions.
i think one point needs to be made clear:

it is trivial (modulo some API issues) to fix mozilla to not hit the server for
view->source and file->saveas nearly 99% of the time (including POST results).
it is far more complex to ensure that this is correct 100% of the time, and it
is also far more complex to ensure that view->source and file->saveas correspond
to the page being viewed.

of course, i understand why solving this is important, and i would love to see
it solved.  i also think that we have to be careful when we go to solve this
problem, since there is a bloat factor to consider... even if it wouldn't be an
issue for the majority of users, we still need to think about people with
systems "circa" 266 Mhz.  moreover, some users disable their disk cache.  we
need a solution that works (or can be disabled) when there is no disk cache.
Blocks: 118487
Another thing I wanted to mention is "back" (the leftmost button in the toolbar).
moz treats "back" as a _refetch_ (unless it's an anchor in this page).
IMO this is wrong: "back" means "give me what I already saw", not "refetch".
Again, treating a refetch as a Very Expensive Operation and doing it only when
absolutely necessary or when explicitly asked for is The Right Thing.
sam: back currently loads the page from the cache.  it is true that if that page
is dynamically generated and has been more recently visited and hence modified
that you will not see what you originally saw.  fixing this would be difficult
and probably should not be done as it would result in some serious bloat,
especially for SSL pages which cannot be cached on disk for security reasons.
Does the cache not have a (timestamp visited), (timestamp last modified) for the
entry?
yes it does, but how would that help this situation?  sure it would allow you to
determine that the page in the cache is not the same page as you previously
visited, but then what?  you'd still be without the original page, so why not
just show what you've got in the cache.
What your looking as it the need to cache everything but (invalidate/show/hide)
the cache on specific requests (or specific components.)

It's agreed that Print, Save As, View Source and some other commands do not need
to reload the URL from the network.   One option to alleviate this, is to cache
all requests in both digested and raw formats regardless, but make them visible
only to specific services (Print, Save As, View Source, etc.) The "stealth
cached" items expire and are replaced/expired as a normally cached item (perhaps
with a rule always cache).  By using this method, you can have one efficient
cache for the system.

* Cache everything
* On cache request, determine visibility to decide if you should return item is
not in cache (even if it is) and must be reloaded from the network.
** If do not cache pref is set, then the item isn’t visible in the cache (to the
browser) and must be refetched.
** If Print, Save As, View Source, all cache is visible always.

This would involve some checking between components to see who is authorized to
view the stealth cache, but I think it would improve performance in addition to
eliminating double posts and gets.

These are just suggestions and probably nothing new, but they might help in the
discussion for a direction to go...
thomas: most of what you've described, mozilla already does (modulo some bugs).
 it achieves the effect using load flags.  there are three principle load flags:
LOAD_NORMAL, LOAD_FROM_CACHE, LOAD_BYPASS_CACHE.  clicking on a link or entering
an address in the URL bar sets the LOAD_NORMAL flag.  file->saveas,
view->source, file->print set LOAD_FROM_CACHE, and shift-reload sets
LOAD_BYPASS_CACHE.  i should really say that this is how mozilla is designed...
there are bugs and mozilla may not set the load flags correctly in all cases.

what mozilla does not do is "pin" entries in the cache.  this means that a page
you are viewing could be evicted from the cache if you spend sufficient time
browsing in another window.  as a result, file->saveas, etc. may require a
server hit.
if(cached(current_page)) {
&nbsp;&nbsp;&nbsp;display_source(cache(current_page));
} else {
&nbsp;&nbsp;&nbsp;echo "Sorry, the page in question was not found in the cache,
click here to re-retrieve this page";
}


If the a refetch is required, the user should be well aware of it (and mozilla
should ask permission), period.  Else I assume that's the source of the current
page.

This is just my take/solution on the whole thing, and what I expect should be
done, obviously the real code will probably be quite a bit more complex than my
lame psuedo code.  :)

Is it really that unacceptable to say "sorry I couldn't view the source, here's
why:"?
steve: you have described exactly what mozilla already tries to do (again,
modulo some bugs).
i think that a seperate bug should be filed to lock a window's history in the
disk cache, but i feel that full source versions of pages should not be stored
in memory. if someone decides to turn off disk cache it should be acceptable to
require a refetch from the server on view source.

what people are attempting to add into this bug would be best implemented by
changing the cache system--anything else is just a hack. i don't think anyone
wants to see the cache system extended this close to 1.0.

file a bug and dep it on this one. i see wins here if the cache is modified to
provide for a true history, especialy since the cache's traditional role is all
but obsolite due to dynamic pages and fast connections.

this really should be discussed in a news group.
darin@netscape.com 2002-01-30 11:54 wrote (#215):
> what mozilla does not do is "pin" entries in the
> cache.  this means that a page
> you are viewing could be evicted from the cache if
> you spend sufficient time
> browsing in another window.  as a result,
> file->saveas, etc. may require a
> server hit.

  The only problem with this, and I don't mean to harp on this point if it is
already abundantly clear, is that Mozilla's DOM is effectively a "memory cache",
and it does get out of sync with the real cache.

  If I can see the page in the mozilla window, it is only reasonable to expect
that I can view its source, save it to disk, or send it in email, WITHOUT
requiring the browser to get it from the server again.

  My humble picks from the best solutions yet suggested:

  1) Sync the caches.  Sync the caches.  Sync the caches.  Either (a) purge DOM
from memory when a page is purged from disk, or (b) pin entries in the cache. 
(a) is the same fix as popping up a "page is gone" requestor, but doesn't leave
the user scratching his head wondering why in the world Netscape's engineers
can't save the page to disk if they can render it right there on the screen,
because it will no longer render to screen.  If it's gone from the cache, it
can't be rendered without a request.

  or

  2) add yet another cache that mirrors the DOM, saving the pure HTML.  Better
yet, attach it somewhere as part of the DOM.  This eliminates the sync problem
because if Mozilla every goes to the cache, it will reconstruct the DOM from
(and with) the html.

  IMO, 1 is the best "fix it now!" solution and 2 is the best "real" fix, even
though it is either ugly or less memory efficient.  If I'm speaking out of my
****, forgive me.  Trying to help.
> purge DOM from memory when a page is purged from disk

You mean blank out the page the user is currently viewing?  The only place the
DOM is "cached" is in the actual view the user has.  As soon as you view
something else the DOM is destroyed....

> If it's gone from the cache, it can't be rendered without a request.

Correct.  The if you hit "back" to go back to a page that's gone from the cache
you will get a request (and an alert).  The problem is when the source has
already been rendered.
OK. And then we can come to conclusion that the only correct solution is to hold
a direct cache key to pinned cache entry for every dom which is held in mozilla.
But this was already suggested.

If i try to get back in history to other page and it's cached I load it and then
I get a new direct cache key. So there is no need to mess up with direct cache
keys and history. Certainly this behaviour is not perfect but it can't be
without other tradeoffs.

Note that there are other bugs where some people want to have cached the entire
DOM for purpose of page history. Then in this DOM history should be cached the
cache keys for sources.
Sorry for the unconstructive spam but...
If the only way to fix this is to modify the cache to allow "pinning" pages in 
the cache, we've had over 1.5 years to realize this and so I *do* think Mozilla 
1.0 should wait on it.

Also, at this point there have been SO MANY examples and counter-examples that 
I don't remember what works and what doesn't.  The basic idea seems to be: the 
original HTML for whatever is displayed (in a Mozilla window) should be 
accessible to various functions (Print, Save As, View Source, etc.) without a 
refetch.  Is this not correct, i.e. do these functions already work as I just 
described?

It seems that we are down to 2 areas of misbehavior:

1. GET requests that have been evicted from the cache will hit the server
*without* asking the user first.

2. POST requests do not work.

To address these problems i'll expose some kind of nsIWebPageDescriptor
interface that allows access to the following information:
  a) URI
  b) PostData
  c) Cache key

There will also be a LoadURI(nsIWebPageDescriptor) -- somewhere...

Additionally, i'll make sure that if the 'load-from-cache-only' load flag is set
on the view-source channel (which it should be) then the user is prompted BEFORE
a request is made to the server (in the case where the page has been evicted
from the cache...)

This is basically what boris proposed way back when in comment #186 ;-)

-- rick
If the majority of dynamic pages contain a "no-cache" meta tag, does the
"cache-pinning" solution mean that we're back to square 1 for developers wishing
to see their outputted page source, and for anyone wishing to save-as or send a
dynamic page (e.g., an invoice)?
Question: "Pinning" refers to preventing the expiration from cache of any 
element currently displayed in a browser window?  If not, what exactly does it 
mean?
Al: 'no-cache' documents are still put into the cache.  they are there for the
purposes of view->source, etc.  normal page loads of a 'no-cache' document will
bypass the cache.
How about no-store documents? Is there any way to ensure that those get pinned,
say, only in the memory cache - so that they can still have view-source done on
them, but will never ever enter the disk cache?
stuart: yes, even 'no-store' documents are cached (in the memory cache only) for
the purposes of view->source, etc.  this was only recently changed to be this
way... previously 'no-store' documents were not cached.
*** Bug 120457 has been marked as a duplicate of this bug. ***
darin: Would it be possible to associate a particular cache entry with each
document currently on the screen, and (through the magic of reference counting
or something along those lines) guarentee that that cache entry not be
overwritten so long as the relevant page is open? That would mean that the cache
could end up containing three or four copies of a page with the same URI, so
that view source would work (you would use these cache entries to load the page
in view source or save as); as soon as those windows got closed, the entries in
the cache would be deleted (or marked as out of date or whatever). In the real
world this would not cause much bloat (how often do you visit the same page in
multiple windows?) but it would solve the view-source-of-the-same-page-multiple-
times problem.
hixie: the cache already provides support for exactly what you're describing. 
we'd just need to hook up docshell and friends to utilize it.  there are issues,
of course, that need to be resolved... consider the hypothetical example of an
ever increasing number of windows (eg. you've just visited an evil website)...
at what point does the cache refuse to hold items in the cache?  does it ever
refuse?  does it honor or ignore its size limits?  it seems to me like we need a
solution that allows the cache to honor its size limits, otherwise we run the
risk of serious meltdown resulting from a malicious website.
Also, the cache would only hold multiple entries for the same URI if HTTP
thought they resulted in different documents.  Otherwise, the reference held by
each window points to the same cache entry.
I'd venture to suggest that if we refuse to hold the source in the cache, we
should refuse to open the window ( / display the page), period.

My suggestion would be to allow the cache to expand a limited amount over it's
official maximum size for this purpose (eg 10% more, with a lower limit to
handle the case where the cache is set to zero) and if loading a page would
break this rule, refuse to load the page.

The really difficult case would be what to do with a huge (bigger-than-cache)
file being displayed in a pre-existing window. I'm not sure what we should do
there - either truncate the page, or make an exception to the cache-size rules
for a single page that's bigger than the cache for the duration of time that
it's being displayed in a window. Either way, I don't think we should *ever*
break the contract that if we can display something in a window, we can provide
it's source on demand exactly as it was received from the server.
Darin:

A malicious website could exhaust the system memory just by launching a ton of
windows, but the problem would be escalated by the additional cache footprint if
pinned entries in the cached are never released as well.  I think the overhead
of the windows would have more of an impact.

FWIW, the cache should honor its constraints. I don't think the cache should
ever start refusing to add entries, but it definitely should start expiring
entries even pinned entries when the situation you described exists.

Could there be a warning or a notice to tell a user that they may want to
increase these options if the pinned cache size gets to close to the limits?  
Darin:

A malicious website could exhaust the system memory just by launching a ton of
windows, but the problem would be escalated by the additional cache footprint if
pinned entries in the cached are never released as well.  I think the overhead
of the windows would have more of an impact.

FWIW, the cache should honor its constraints. I don't think the cache should
ever start refusing to add entries, but it definitely should start expiring
entries even pinned entries when the situation you described exists.

Could there be a warning or a notice to tell a user that they may want to
increase these options if the pinned cache size gets to close to the limits?  
thomas: and if we're talking about the memory cache?  HTTPS only goes into the
memory cache... also, what if the user has disabled their disk cache?  then,
currently, only the memory cache is used.  meltdown is more serious and more
common if there are not hard limits on the size of the memory cache.
> My suggestion would be to allow the cache to expand a limited amount over it's
> official maximum size for this purpose

Currently-open document source (not DOM) should NOT count against cache. It's a
functionality issue, not performance. If it's not available, Save and view
source BREAK. Not performace "oh it's slow" break... dataloss "#!&%$" break.

Save and view source should work perfectly if the user has 0 mem and 0 disk
cache. Think lower-end machines with a network wide cache on the LAN. Why waste
seperate resources on the local machine. But certainly those users should still
expect save and view source (etc) to work.

Or think power users that know HTTP caching. Even they won't understand what a
DOM cache is, and why what they see they can't save.
> we should refuse to open the window ( / display the page) 

Let's not lose sight of things here.  This is a browser.  Its primary purpose is to display pages.  All else is secondary and, ultimately, unnecessary.  We should display the page no matter what.  If we can't show the source for a particular page we are displaying we should just say so ("The page has expired, ...."). 
I wouldn't go so far as to say non-browsing features are "unnecessary".  After
performing some sort of financial transaction via the web, saving the resulting
confirmation/tracking page can certainly be considered necessary.
Keywords: topembed
There has been a lot of discussion here about how to handle this bug.  I am not
sure that this is the appropriate place for discussion, but since everyone else
is doing it, I guess I will too.  :)

The end users expect to be able to be able to save, print, or view source of any
web page that they are looking at.  Any other behavior should be considered a flaw. 

In order to accomplish this, we need to be able access the source for every
document currently in any open window or tab. 

The consensus seems to be that the fix is to pin the source for each currently
open document in either memory cache or disk cache.  The debate seems to be
about how to handle the various scenarios presented by the fact that memory
cache and disk cache are both limited resources and too many open documents
being pinned can potentially cause Mozilla to deplete one or both of those
resources.

To address the potential issues, I propose the following (*see footnote) :

A) If disk cache is disabled :

1) The source for all documents currently in an active window or tab is pinned
in memory cache until memory cache is exhausted.

2) When memory cache is exhausted, prompt the user that memory cache is running
out and give 2 choices: a) close some open windows or tabs in order to free
memory cache, b) enable disk cache.

B) If disk cache is enabled :

1) Store and pin open https document source in memory cache.

2) Store and pin open non-https document source to disk cache.

3) If memory cache is exhausted by pinned https document source, prompt the user
that memory cache is running low and give 2 choices: a) close some open windows
or tabs to free memory cache, b) break security and enable temporaily enable
disk cache for https document source.

4) If the specified disk cache limit is reached, prompt the user that the disk
cache limit has been reached and give 3 choices: a) closing some of the open
windows or tabs, b) temporarily allow the disk cache limit to be exceeded, c)
increase disk cache limit.

C) Alternative :
1) Automatically store and pin non-https document source to disk even if disk
cache is disabled or the disk cache limit has been reached.  

2) If disk cache is disabled, these document source files can be deleted from
disk as soon as the document is no longer in an open window.  

3)If disk cache is enabled, these document source files just become unpinned
cache files and are handled accordingly by the disk cache.
 

This alternative method of addressing the issue can be a preference settting so
that users can avoid almost all of the prompts described in the previous method
(the exception being when memory cache becomes full of open https source). 

The only real danger of this alternative method is in the case of the malicious
website that opens an infinite number of windows and thus fills the user's hard
drive with pinned source files, but we can just file a different bug for that,
right?  :D



(*Note: I am by no means claiming that this solution is all my own original
idea. Obviously others have previously presented many of the aspects of this
solution; I am only trying to summarize and organize it all into one concise
description of a solution.  If anyone can think of a scenario that I have not
addressed, please let me know --preferably via email.)
I'm going to put my two cents in.  I'm not a mozilla developer but instead an
end user.  I develop web sites and thus have an understanding of what I expect
of mozilla based on this.  I don't understand a lot of the technical debate
regarding this bug and I don't have time too.  What I do know it what I expect
the browser to do.  Maybe by presenting my expectations you guys can figure out
how to 'make it so'

When I view the source of a web page, I expect the browser to show the source
for that page.  If I've got two or three different 'versions' of the page open,
then the source for each version should be correct for that version.  So if I'm
working on a CGI script and I make a change to the script (but haven't refreshed
the browser window yet) I expect to see the original scripts results, not the
changes I've just made.  NOTE: If I want an up to date view of the latest source
I will refresh the page before viewing the source. If I haven't reloaded the
page, then I want to see the source for the un-reloaded page, not the new source.

The same goes for printing, or any other aspect of the page.

As I said, I don't understand a lot of the technical aspects of caching and
what-not.  However I do expect the browser to act in a predictable way.  This
means that when i view source, or print, I get results based on the page as it
is, not as it may be now.  this means that if I have numerous versions of a page
 in various windows, and I have made changes to the underlying file, then each
page should be different and reflect this in the source, etc.

Hope this helps ;-]
Actually, now that I think about it a little more, this is what I think:

The "cached" copies of documents that are *currently* being displayed in
currently open browser windows (or tabs) should *not* be counted towards the
total size of the cache, whether it be memory or disk. In other words, the
maximum possible size of the cache should be the maximum size specified by the
user *plus* the size of all documents currently open in browser windows.

As soon as a window is closed, or the user browses to a different document
(causing a document to no longer be displayed in an open window), its size
should start counting towards the total cache size; at this point, therefore,
older entries should be evicted in order to bring the cache down within its size
limits.

This works out right if the cache is disabled, too: then the default cache size
is zero, so as soon as a document is no longer being viewed, it will become the
oldest document in the cache and be immediately evicted.

At first glance this seems like weird behavior (or at least it seemed weird to
me) due to the timing of when things would be evicted. But actually it's exactly
the behavior you would expect if you treat the "currently-viewed cache" as
completely distinct from the regular cache. As soon as a page is no longer
currently-viewed, it moves into the regular cache, and that has the same effect
as what happens today when a page is put into the regular cache.
> should *not* be counted towards the total size of the cache

The problem with this is that users limit cache sizes for a reason.  I have a
limited quota and I limit my cache such that cache will not push me over quota. 
If we go over that limit we better be doing it in some temp directory somewhere
and not in the profile....
FWIW, I agree that currently displayed elements should not count towards the
cache limits.  As far as most users are concerned, a cache is where *done* items
are stored for quick retrieval.  I also think that the common user would
*expect* a browser to fail gracefully and refuse to open new windows when, as
they see it, "their system's memory has been taxed to the limit."

In other words, don't be afraid to present the user with a refusal to perform. 
They are more likely to blame it on their computer, its lack of memory, or slow
processor, :-| than on mozilla itself. (And, imho, rightfully so!  It isn't
mozilla's responsibility to make sure there are enough resources to open yet
another item.  That duty belongs to the user or, more to the point, the user's
pocketbook.)
> I have a limited quota and I limit my cache such that

This is nothing to do with disk quota. What I (and about 3 other people now)
propose is keeping it in core. If it uses the normal memory cache mechanism,
fine. If it's seperate, whatever. Once the user leaves the page then Mozilla
will decide whether to keep it in memory or disk cache, or neither.

Users expect the more pages the have actively open, the more memory usage will
rise. This is how any other app works, cache or no cache.
just to bring this discussion back on track

 > consider the hypothetical example of an
 > ever increasing number of windows (eg. you've just visited an evil website)...

i just tried this out to see what currently happens in this situation, mozilla
crashed with 146 pages open (using 100% of cpu and all ram - 384meg). so
basicaly mozilla is going to crash or the system is going to become
<b>completely unusable long before the size of total pages open exceeds size of
cache</b>.

it would take 500 - 1000 pages to exceed the default disk cache size of 50000k
so this should never become an issue.

so all that needs to be done is to 

 > hook up docshell and friends to utilize it.

So I think that (per benm comments) we shouldn't try to put resource control 
into cache management routines (resources in the global meaning, that is,
utilization of machine's disk and CPU resources by Mozilla), and in this bug not
worry about this again.
A separate bug should be filed about implementing in the future some sort of
global resource control in Mozilla that would prevent /Java/Javascript/Flash
plugin/ any other plugin/ from executing DOS attacks that lock up client machine
(eg. by JavaScript opening too many windows, or injecting too much HTML code
into the document). This feature would impose dynamic limits on memory
allocation, cpu utilization, number of opened windows, etc. The limits would be
dependent on amount of recources available on the particular machine, would be
calculated dynamically and the user would be notified gracefully if the limits
were exceeded.
This issue is _separate_.
Please someone qualified (preferably someone who knows a proper component owner
for that :-)) file a separate bug.

The limits on cache size should only pertain to content that isn't displayed in
any window (that is, to content that is not pinned).
This may be difficult, but a browser isn't made in 6 days, especially one that
is implemented correctly :-)
benm, olo: we should not try to design a system that makes this worse.  i don't
understand why current problems w/ many windows is justification for not solving
this bug correctly.  we should care about enforcing cache limits... the cache
limits are a user preference.  you may argue that they are not your preference,
but then that is yet another preference.  IMO we should strive to come up with a
good compromise solution to this problem.  we shouldn't race to solve this bug
by pushing out any concern of cache limits.
darin: those of us proposing "ignoring" cache limits for currently-visible files
are not proposing that the cache limits be weakened, but rather that the
*meaning* of "cache limits" be redefined. When I wear my "developer" hat, I
understand the reason why you might want to consider a currently-being-viewed
page as part of the cache.

But when I put on my "user" hat, I don't think of it as part of the cache at
all: the cache is an *optional* feature, which can be disabled without hurting
browsing. It stores *recently accessed* pages, for ready availability if I
should view the same page *again*. From a user standpoint, even if not from an
implementation one, this is significantly different than storing the *currently
visible* page for ready access if I need to *do something else with that page,
such as see its source*.

So my position is that as a user, I would not expect the cache limits to include
a limit on the number of pages I can keep open at once without losing
functionality like "view source". When I, as a user, set a cache limit, I'm
making a tradeoff of memory/disk versus performance, not memory/disk versus
incorrectness.

The difficulty comes when the disk cache comes into play, and the user specifies
a location for that disk cache. By my theory, "currently viewed pages" should
not be stored in that location because they do not count towards disk cache
size, and the user might limit their cache size based on a known amount of free
space in the cache location. On the other hand, it is vital that moving items
from currently-viewed to cached (and back, if necessary) be an efficient
operation, which may not be the case between two different cache locations,
especially if they are on separate drives.

The only easy solution I can think of would be to ensure that currently viewed
documents only get stored in memory, not disk. But that adds to bloat. I guess
the UI could give a disclaimer that the cache directory will also be used to
store currently viewed pages and therefore will take more space than the limit
you give it?
sballard: i agree with your conclusion... clearly it is bad to store stuff in
the disk cache w/o counting toward the total size of the disk cache.  and
clearly storing in-use pages only in memory is bad too because of the bloat that
would incur.  and, i too agree that it would be very nice from the point of view
of the user/webdeveloper to be able to count on view->source and file->saveas
always giving me what i am looking at.  i'm not really arguing against that as a
goal.  instead, i'm just saying that we need to weigh the cost of doing so, and
consider that it might end up breaking down at some point.  that there might
need to be some threshold at which we simply cannot guarantee that the source
for an in-use page is accessible on the local system.  we should be able to set
that threshold high enough so that 99% of the users will never know it exists. 
i'm suggesting that a threshold exist and that it be configurable.
How about a pref that goes with the caching prefs, and looks like this:

"Allow [ 10]Mb of additional space in the disk cache for pages currently being
viewed (necessary for "view source" and "save page", among other things)."

Ideally, that number would default to something nice and high, but the option of
setting it to "unlimited" should also be available (I'd certainly choose that,
as a web developer). Perhaps:

      ( ) Unlimited
Allow                additional space in the ... (as above)
      (*) [ 10]Mb of

This wording would clearly indicate that this amount would be *additional* to
the amount already specified in the disk cache, so users would know to make sure
to allow enough extra space for it on disk.

If a user reduces that number, they can't complain if "view source" starts
working wrong. But if they specify "unlimited", it should be absolutely
impossible to get into the state where a currently-visible page cannot have its
source viewed or saved (correctly!).
i'm going to put on mpt's hat 'no more prefs'

you have 3 prefs (mem size, disk size, when to check for new objects), they 
should be enough.

I will ask one question: What do you expect mozilla to do when it encounters a 
single page that requires say 1gb of memory to render?

I have two versions of the page on my system, one with local images, and one 
with remote images.  I've loaded the page on things ranging from 24.0 dialup to 
a fast university net connection.  [My system has never had more than 1/2gb of 
physical ram]

If I were to load that page and leave some other windows open, should i expect 
mozilla to keep around the source and images for all my open windows?
timeless- Yes.


and give the user an enormous big red button that says 
"due to low system resources you will not be able to both load this entire page
and reliably use your other windows" 
[close other windows] [do my best to keep them all]
A couple questions for timeless -

After rendering those pages in a current version of Mozilla, does it do a
refetch of the source in order when you request to view source?

If you look at those pages in another browser (IE, Opera, whatever), are you
able to view source without a refetch?

And some comments for everyone -

The way I see it the primary issue of concern in this bug is about the text
source of the pages.  If Mozilla were to pin the text files in cache and not pin
any of the images, flash, sound, etc then that would resolve most of the
problems we currently face now and since the text source files are generally
fairly small, we are not likely to have the cache limit exceeded by pinned
source unless disk cache is disabled or set to a very low limit.

In regards to viewing or saving source, pinning only the text files would make
it work the way it should.

In regards to saving a whole page (text source, images, and all other
components), there could still be some problems with dynamic content if
everything did not get pinned, but we would not be any worse off than we are now
(in regards to saving the whole page correctly).

As for printing, I am not up to speed about how printing works now, does it
depend on refetching or cached source or is it entirely independant of this bug?
 I personally have not had any problems printing pages correctly lately, but I
can not say for sure if that is due to recent printing code changes or just good
luck.

It has been said before but I'll say it again:

Printing is NOT affected by this bug in ANY way. Printing works on the DOM and 
doesn't use the source. So printing never refetches from server even if the 
source of the page has disappeared from all caches. It also works perfectly if 
you have 10 pages from the same url open and they all look different.

If you don't like this behaviour then PLEASE don't argue about it here, this 
bug is spammy as is anyway. File a NEW bug and present your arguments THERE.
what are the chances of getting a fix for this before 099?
Keywords: nsbeta1
No longer blocks: 107067
> pin the text files in cache and not pin any of the images

How do you make this distinction?  What about SVG or other XML-based image
formats?  Those are definitely "images" and they definitely have a "source".  (I
can see not pinning things referenced by <img> tags or something, I guess).

Just something that should be kept in mind, there is such a thing as
dynamically-generated images -- think weather maps, for example.  So the "don't
pin images" idea will not be bulletproof.  Whether or not it's an acceptable
tradeoff I couldn't really say.
excuse me sticking my oar in as well, but at least one bug depending on this
assumes this will fix images, not just text files.

btw i'm not quite sure what's going on with the dependencies - bug 118487 is
dependent on this bug, but that is marked as a duplicate of 115174. shouldn't
someone fix it so the original is the dependent? bug 120809 is the other one
that wouldn't be fixed with a text-file-only pinning.
It's simple: pin the source text of the displayed URL.  If you open an SVG image
in its own window, then of course you pin its source.  On the other hand, if the
image is merely linked in the href of an HTML page, you don't pin it.  The same
applies to linked scripts and stylesheets.
> if the image is merely linked in the href of an HTML page, you don't pin it

This would definately be unacceptable. EVERY object on EVERY open page (windows
and tabs) needs to have its original data available in system memory. Anything
that can change by a DOM or what not would need a seperate copy that I presume
would be created on first-change.
_Why_ do you want every object to be immediately available?
I think it's enough to have every object that can be directly
saved/view-source'd (e.g. images, iframes, normal frames, the document itself)
available.

Things like .css or .js files seem to be unnecessary to me.
> Things like .css or .js files seem to be unnecessary to me.

"Save as..." -> "web page complete" needs them. We should be able to load a
complex page with all kinds of dynamic content and crazy nocache directives,
lose the network connection, and still be able to save the whole page.
This bug's summary is "need means to reuse/reload current page. . . " not
current image or current JavaScript or otherwise. In reality, everything on the
web is static, except for numerous HTML entities and some images. To fix this
bug, we have to focus on reusing the already downloaded HTML file. 

There is no way we can recreate an HTML file from the docshell exactly as it
was. For "view source" and other operations, however, that is the requirement. 

The simplest and most effective solution is to keep a copy of the current HTML
file in core. There would be no error messages, no cache keys, no pinning, and
no prefs. It would cost memory, but sometimes you have to bite the bullet. For
every open window that is displaying an HTML file (or files, if there are
frames), Mozilla should store the HTML file(s) in core. As a later enhancement,
we could gzip the HTML files so they will consume less RAM.

I assume that after Mozilla downloads an HTML file for display, it generates a
docshell, then it tosses the HTML file away. My suggestion is to not toss the
HTML file away until the window closes or the window displays some other page.

This will be hoggish when users start downloading 2GB HTML files every day, but
that hasn't happened yet to any degree of freqency. Should that become common,
we can address that problem at that time. Today, however, storing the HTML in
core makes sense. Take the worst case scenario, a power user with 20 Mozilla
windows open, each with 100k HTML files. That's only 2 Meg or so. As a power
user, he should expect to use a lot of memory. Most users have one or two
windows open, and most users download HTML files of less than 100k. Admittedly,
Slashdot is the exception. In this case, however, the exception proves the rule.

It would be reasonable to limit the size of any HTML file stored in core to an
arbitrary file size limit, such as 1 megabyte.

Then we can employ a simple formula such as: when view source (or related
operation) called, use HTML file from core. If not in core, use HTML file in
memory cache. If not in memory cache, use HTML file in disk cache. If not in
disk cache, re-download the file. When such operations ask for images,
JavaScript files, and stylesheets, just get them from cache or redownload them.
They're static anyway, so the only issue is performance.

If a 100 MB HTML file is generated dynamically, heaven help that server farm.

As for the possible SVG image problem, a separate bug should be filed for that.

Plain text files (TXT's in the Win32 world) should not be stored in core, since
they are going to be static files anyway.

As for dynamic images, we can't fix that. Remember, that printing is NOT
affected by this bug. See comment 255. 

One may wish to send an image via e-mail without redownloading the image. Most
dynamic images, however, will be "mapquest" maps (courtesy of Navigation
Technologies Inc., BTW), and will be the exact same when redownloaded. As for
weather maps, there is not any real problem. If a user downloads a weather map
at 12:00, and then waits 6 hours and finally sends the file by e-mail, and he
expects the sent map to be just as it was at 12:00, he has a problem that
Mozilla just can't solve. Most weather maps will be looked at and handled during
a few minutes time at most, making any change negligible. It is a safe
assumption that most users would want the most current map anyway. Finally, many
weather map sites force a redownload every so often, so there is little point in
trying to "fix" this.
Let us first get a mechanism for pinning the *html* page (or image or whatever 
is currently being viewed) before starting to talk about all linked resources. 

The scope of this bug is about creating a mechanism for pinning things, in 
particular the current page. Everything else is subject for another bug. Feel 
free to create a bug on "Make all linked resources to a page refetchable 
without refetching from server".

I won't put my thoughts on if i think that should be done in this bug since it 
would only encourage further discussions.


At least this is my oppinion, rpotts, darin, do you agree?
Why can't we add a new attribute to the DOM, say document.source, which would be
read-only attribute containing the HTML code of the corresponding page?
Um, why on earth would we need to expose the original (potentially hundreds of
kB) of source of a document *throught the DOM* as text?

PS. I haven't followed this discussion, so I could be missing context here...
jst, that was just an idea I had. It's not like we _need_ to do it.

I don't see, though, where the difference is between using the DOM to access it
or some other way.
Any change we make to the DOM can affect any webpage out there (by introducing
new names into the namespace in question), so to expose things throught the DOM
we need a really good reason to do so. And exposing huge strings to JS is in
general not a good idea since they end up living in the JS GC world and
potentially staying around in memory for a long time after being accessed.
People, the problem is not storing the data. The problem is storing the data 
_without adding too much bloat_. Just so that everybody understands, bloat in 
this context is memoryconsumption. So simply storing the data in the document  
is not an acceptable solution, since it will consume too much memory.

Remember that mozilla is already critisized for using much memory, and that is 
just for the features like CSS/DOM/XML/HTML/JS, iow things that many people 
think is more important then the features that are blocked by this bug.

I'm not saying that this is not important or that I think it should be solved. 
I'm just making sure you all understand what the problem is.

Using the cache system and pinning things in the cache saves us much memory, 
since most of the time we will get the pinned source for free; it already 
exists in the cache anyway. So what we need to solve is the remaining cases 
where the current cache system arn't solving the problem.

I.E. we need to give the cache the ability to guarantee that the source is 
available, as often as possible, preferably always. Please don't come with 
arguments like "anything less then 100% is acceptable", isn't it good if we can 
increase the number from the current, say 90%, to 99%? Sure, we should aim for 
100%, but if we don't get there right away it's not the end of the world. 
There's always tomorrow.
The advantages of hanging the source off of the DOM are:

1- it's the most natural place (from an internal perspective) for it to go,
since the DOM is what Mozilla keeps around to render/print the page, and
whenever the page is renderable/printable, it should be
view-sourceable/sendable/save-asable.

2- no ugly cache pinning/duplicating, etc problems

3- all the problems and nuances of pinning discussed in the last 50 or so
comments go away.  Viewable-but-not-saveasable goes away.  The 0-size cache
problem goes away.  Even the oh-gee-this-page-is-7-terabytes-what-should-we-do?
problem goes away because you simply run out of memory and go away or give an error.


The disadvantages:

1- this exposes the .source node to the user (through javascript or whatnot),
which really isn't a good idea or even beneficial to the user in really any way.

2- memory bloat.  I question this, however -- doesn't representing the document
as DOM take up more memory than representing it as plain html?  Even if the DOM
were a bit smaller than the html, we aren't talking about any orders of
magnitude, are we?
  It seems to me that if you want to cut back on Mozilla's bloat, there are a
thousand other things to do away with before you do away with "correct
functionality" and "the right solution."  I mean, for heaven's sake, purge all
the forward/backward pages and make me reload those (from the cache or the
server) before you make me reload the CURRENT page just to save it to disk.


  Can we somehow attach the source to the DOM internally without exposing it to
the user?  There is a 1-to-1 correspondence between pages that need the source
held and pages which have a current DOM in memory.  This is not true of any of
the cache pinning solutions.

  I really don't think it is unreasonable in any way to make the browser keep a
copy of the html of visible pages in memory -- isn't this how the most
rudimentary browser you could conceive of would do it anyway?  If we don't
attach it to the DOM, is there some other reasonable place in memory we can
stick the source, even if it means paging some of this out to disk once in a
while to appease the memory-bloat argument?  I'd hate to suggest the creation of
yet another cache, but it seems to me (a completely non partial observer :) that
page source needs to be held, and it seems equally obvious (from this week's
traffic on this bug) that the traditional cache is not the place to be doing it
unless you want to add all manner of kludges...



  I think I've now gone over my alloted 2 cents.
Yes, the in memory DOM representation is larger than the source of the document,
but whether or not it's on an order of magnitude larger is besode the point, the
point is that it's large enough to not want to keep it in memory. Either way,
keeping the source in memory, even if it is far smaller than the DOM
representation of the source is unnecessary bloat that we can not in any way
afford. We're way way too bloated as it is, if we'd start holding on to the
source of the document in memory too we'd need a really really good reason to do
that. I still haven't heard any reasons close to good enough to pay that cost
(not that I've listened that closely, but still...).
>Can we somehow attach the source to the DOM internally without exposing it to
>the user? 

Hm... from my understanding of DOM and IDL, this would be possible:
Just add this to nsIDOMDocumentInternal:
[noscript] nsAString getSource();
(or [noscript] string getSource(); or whatever an appropriate string type would be)

This would, as I understand, eliminate problems with GC, as the function would
only be called by C++ Code, which isn't GC'ed. Or is it?
Why does everyone seem to want to put everything in memory? The only thing that
_absolutely_ needs to be put in memory is encrypted files. Personally I dont
care if it takes 5 msec to see the source or 5 sec, just as long as I can
actually see it.
So why not just follow what others already have posted and dump everything we
get from the server in eg. /tmp, that includes images, html, xml etc. This
enables the view-source, save-as and other? features and makes them function
correctly.
This disk usage should not count against the setting for the cache, because it
is not cache. The cache setting is to specify just how much memory/disk you are
willing to spend on *caching* data! (I have never understood the memory cache of
mozilla/netscape, why is it needed? The OS just does fine caching the files
anyway, while being much better at handling the size of it...)
Various people suggested that this bug should refer to all accessible documents 
(including images and .js and .css files). However, we should think about the 
average use case. People who need save as/view source want it to save the 
correct HTML page they are looking at, without a refetch. When users use 
power-features like save entire page, they should expect a refetch of images or 
other non-critical data if it has been expired from cache. Also, they should 
expect a more updated version be saved, just as if they reopened the URL in a 
new window - You don't expect to see the exact same thing.
Therefore, I think only the currently-viewed HTML/xml/whatever source should be 
saved, and only if it's uncachable (if it's cachable, then we don't expect it to 
change, thus we may refetch). Maybe we should pin cachable documents to make 
them remain in cache.

Another option is that if a cachable page is removed from cache and the 
refetched for save/view source we should check that the newly fetched page is 
the same as the old one (via hashing or other means), and if not warn the user 
with a message such as:

"The page you are about to save/view may be different from the displayed page. 
Proceed with view/save?"

I suggest adding a pref such as:

Keep source of viewed pages:
( ) Keep the entire page (increases memory consumption).
(*) Keep only the main page and frames.
( ) Do not keep source (Save and View Source may not work as expected).

In my opinion, if a user views a very large static HTML file, it is very 
important to save it because the user will most likely want to save the file 
(think the Jargon file in one big HTML document).

In any case, if the fetch of original text is not applicable, I suggest a 
message similar to expired POST data:

"The page you are trying to view/save has expired from cache. Refetch?"
The only reason we care about bloat is that it has a negative impact on
performance (in virtual memory operating systems). If we keep the source file in
DOM, we would increase Mozilla's memory consumption by a few hundred kilobytes
on average. This would hurt performance. If we resolve this bug in a way that
results in heavy disk access by every browser window, as suggested by comment
274 and many others, however, then performance will take a huge hit. The first
option is better overall.

I'll be quiet after this message. I've made the case for keeping the source in
DOM. As suggested above, the potential security problem can be avoided. Finally,
after a couple more of the keyword: footprint bugs are fixed, the memory use
issue will balance out.
Responses to various people have said, and my last comment in this bug till
there is actual code involved:

> downloads an HTML file for display, it generates a docshell, then it tosses
> the HTML file away.

You make it sound like we have the HTML and then drop it.  We never have the
entire HTML file at once (except in the cache).  As packets come in necko sends
notifications to the parser which processes the new data and waits for more.  We
would need to keep reallocing the data as more came in (this would be quite slow). 

> Plain text files [...] are going to be static files anyway.

This is a completely bogus assumption...

> I mean, for heaven's sake, purge all the forward/backward pages and make me
> reload those (from the cache or the server) 

That's what we already do.  All that stuff is stored in the disk cache and if
it's expired from there we put up a dialog telling you so and saying that we
will be reloading from the server.  How often do you run across this dialog?

> isn't this how the most rudimentary browser you could conceive of would do it
> anyway?

The most rudimentary browser I can conceive of would do a lot of things that
scale poorly and are inefficient, including that one, yes.  It would also not
lay out most web pages out there...

At this point, the thing I suggest is that we wait for Rick's
nsIWebPageDescriptor work.  That should allow immediate improvements in View
Source and Save Page/Save Image that will solve the problems people are seeing
in most cases.  At that point we can see what issues still remain and how common
they are, and discuss ways to solve them.
1. Can someone confirm that "90% of the time" the source-HTML is not refetched 
when doing View Source/Save?

2. Am I correct in this summary of the competing PRO arguments?
- Pinning-the-cache is more "elegant" because it potentially
  does away with any memory bloat
- attaching the source-HTML to the DOM is simpler since
  there's already the one-to-one relationship with browser windows

3. What does the memory cache DO anyway?  I assumed the DOM for all open 
windows was stored separately since the memory footprint swells consistently as 
you open more windows.  Or, is the DOM part of the memory-cache but the memory-
cache is implemented per-window?

4. If we implement some form of cache-pinning, I would suggest a user-prompt 
like this,
 "A cached copy of this web page is no longer available.
  (You may need to increase the size of your disk/memory cache)
  Would you like to continue, by reloading the page from the web-site?"
  YES / NO

5. When you do a Save As in IE 5.5 it *reloads* most page elements, though I 
don't know about the HTML specifically.
Sorry for the additional spam, but I just did a quick test of the Save As.  
------------------
I had this bug's Bugzilla page open in both Mozilla 20020129 and in IE 5.5.  I
did a Save As from both.  Next, I opened a new IE window and submitted my
previous set of comments.  Then, after a minute-long pause for Bugzilla to
process it, I re-did the Save As from both original windows; my comments weren't
included.  So, in this simple test case, both Mozilla and IE continued to save
the displayed HTML rather than refetching.

Perhaps a new bug(s) needs to be created for the specific instances where
refetches occur?
I don't think we need a new bug thats what this one is for.  The easiest case is
just a simple form using the post method.  Type something in hit submit and then
view source, the page now shows the submitted text but when you view source you
still get the origional source of the form.  This is a huge issue since it make
mozilla hard to use for web developers, the people we want using mozilla to keep
the web from becoming an IE only wasteland.

Just goto http://joshuaeichorn.com/mozilla/view_source_example.php to see the
effect.

I think the majority of users would be happy if this case was fixed.
Did we max the Bugzilla db or something?  I'm seeing "only" the first 280
replies on the web-page now.
-------------------
Mark, Joshua, thank you for the clarifications.  Sorry, Joshua, but I got an
"unknown domain" error when I tried to use your test page
-------------------
There are way too many comments on this bug and I think it's because the title
makes it sound like a widespread problem.  Maybe it should be renamed, "Need
improved means to reuse current page without refetching (see blocked bugs)."

I also think we need to create more bugs (which will be marked as
blocked-by-40867) describing the remaining problem areas.  And, if we create two
new bugs for the two possible solutions (marked as *blocking* 40867), we could
then leave 40867 as the tracking bug.  Specifically, create the following new bugs:
- Need means to pin open-windows in cache
- Need means to attach source-HTML to DOM
- Need means to individually cache dynamic-URLs in multiple windows
- Need means to prevent expiration of cached copy of still-displayed windows
- Need means to "cache" (for re-use) pages marked no-cache (?)
- Need means to cache/re-use HTML forms without refetch (bug 68412 or 84106?)

I leave it for someone else (more knowledgeable than I) to do any of these
suggestions.

Finally, is there something specific about the HTTP info for
http://weather.yahoo.com/forecast/USCA0982_f.html that causes it to refetch on
Save As.  If so, this could become another blocked-bug.
I think breaking this into separate bugs is a good idea at this point (actually,
a whole lot sooner would have been better).

- Need means to pin open-windows in cache
- Need means to attach source-HTML to DOM
These are separate issues and deserve separate bugs to discuss their merits. 
"Need means to pin open-windows in cache" should probably be named:
- Open windows need to hold cache tokens for elements (or something like that).


- Need means to individually cache dynamic-URLs in multiple windows
- Need means to prevent expiration of cached copy of still-displayed windows
- Need means to "cache" (for re-use) pages marked no-cache (?)
These are not necessary.  The first two are already supported by the cache
service, and will be addressed if open windows hold cache tokens.  We are
already doing the third (though someone could open a bug to say we shouldn't).

- Need means to cache/re-use HTML forms without refetch (bug 68412 or 84106?)
I'm not quite sure what this means.  Data received from a server will be cached,
and reuse can be discussed in one of the bugs above.  If this refers to data
entered into a form by the user, then that would require a different bug,
because that data is not stored in the cache.


Keywords: nsbeta1nsbeta1+
Target Milestone: mozilla0.9.7 → mozilla0.9.9
Bug 115832 asks for page to be reused without refetching when doing File -> Edit
Page.
Blocks: 115832
Let me see if I've got this straight.

* This bug NEEDS to be fixed, ASAP.  E-commerce is unsafe while this bug is
outstanding; 'view source' works in a non-WYSIWYG, bandwidth-hogging, and
generally unacceptable manner; etc.  I personally will expect all the major
issues related to this to be fixed by Mozilla 1.0, or I'll abandon Mozilla in
disgust.

* This bug has been open for a year and a half, and there's been a lot of
confusion about its nature.

THE BUG IS:
"view source", "save", "save as", and "send page" should use the exact source
received from the server for the page currently being viewed in the current
window.  (Some people think "Back" should too, though this is debatable). (This
manifests itself in dozens of strange ways).


Early on, *two* fully functional solutions were proposed.
1) Store the raw text (usually HTML) of a document (page, frame, etc.) as 
a node in the DOM, and grab that.

People didn't like this because it added a bunch of text to the DOM, but it
would work.  No attempt has been made to implement it.

2) Store a 'hard reference', not URL-based, to the document's place in the cache
(it's stored in raw form in the cache) -- make sure that the cache doesn't wipe
out the document until the window containing it is gone by holding onto the hard
reference -- grab the document source using this.  (This is also known as the
'cache pinning' or 'cache keys' solution.)

This required an improvement in the cache to support this functionality.  These
changes to the cache are DONE: see comment #74.  So this solution is
HALF-IMPLEMENTED.

So now all that's needed is for the front end to actually use this.  Any page in
an open window needs a 'cache key' a.k.a. 'hard reference' which it holds on to.
 'View Source' and friends need to use this key to retrieve the source.

This has been waiting for nine months, and I can't figure out why.

1) Because nobody is willing to put the interface code in to hold a hard
reference for each document being displayed in a window?

2) Because nobody is willing to put the code in for the individual commands
(View Source, etc.) to retrieve their information using this hard reference?

3) Because people are too confused about the source of the bug to realize that
this is what needs to be done?

If number 3 is the problem, I hope I've solved it.  Number 2 is probably blocked
by number 1.  (Also number 2 deserves a separate bug for each command.)  What's
blocking number 1, which sounds like less than an hour's work for someone who
knows the code ?!?!
This bug deserves priority P1, for obvious reasons.
4) Because there's still back end work required to actually *use* cache keys.

I believe there is back end work being done behind the scenes.  I don't remember
where Boris said it, but he is waiting on someone else so he can do the front
end work.   (IIRC)
Nathan: I think you are referring to rpotts' comment #223 and bzbarsky's comment
#277.
Currently it doesn't appear there's a bug filed on this, I could only see
nsIWebPageDescriptor mentioned by Boris Zbarsky in bug 99642, so it really looks
like Rick Potts works on this thing behind-the-scenes now.
We have to wait.
Keywords: topembedtopembed+
Target Milestone: mozilla0.9.9 → mozilla1.0
I like the speed of the rendering engine, I like the way mozilla looks....
But I'm really getting angry when I want to do a simple view-page-source  and gets
unexpected results! 
I lived with the bug for almost one year, expecting it to be fixed from version
to version.
Hell, no - almost 2 years (by reading these posts) and it's still there! I
always had to keep a
copy of netscape around and I can't say that I'm happy with this.

I read many of the notes here, let me phrase some of my thoughts, in no
particular order.
- I never understood why a view-page/print/whatever would have to go to reread
it from network. 
  The content might change at any time no matter what headers or url or get/post
data says.
  As somebody already said - even if the content changes and I open the same
link in another window,
  I expect that each window will give me at print/save/view-source the correct
information, i.e.
  the one they built their information on.

- I don't get the memory consumption argument in this thread...
  I want from a program to behave correctly and not to eat too many resources.
In that order!
  So, I would have expected to be given a correct information when doing a
view-source. Only if
  that worked I would mind about the memory consumption. If it doesn't work... I
consider it a bug
  and the whole program is already compromised, i.e. I wouldn't trust it anymore
(since it's not
  doing things right).

- I don't get the reason why not to keep all windows _original_
html/js/images/etc bytes that are
used to render a page? And I'm meaning not _only_ the HTML files. You'll never
know when somebody
needs to save some javascript, some images or some flash file...
Why not do it ? Memory consumption ? Wow - that's a good one. I really think
that HTML files (most
often content type) will have about 1 order of magnitude compared to its DOM
representation.
Not to mention compared to all the other structures needed to keep information
regarding the
window state, XML/HTML/DOM representation, XUL, theme and many other such things.

- Somebody said that after opening more than 100 windows the system crashed
while the browser's
cache of 50Mb wasn't yet filled. Wow - that enforces my thought about the fact
that not the original
content retreived from the web would be the memory hog... So, then, again - why
not keep it as is
and make the browser behave as it should?
  Of course I also can't imagine somebody with more than 20-30 windows opened :-)

- I don't mind where you keep the original bytes - be it cache, separate memory
space or file
system. Of course I think the memory is the normal place to use. I do care
though that it should
be kept! I would accept that the browser tells me that it
doesn't have more memory to open new windows than to give unexpected content
when doing a
preview/save/view-source. Of course, this would probably happen when reaching
that >100
open windows case :-)

- I wouldn't count the original bytes as belonging to the cache (even if it's
stored somehow in
there - pinpointed entries?). Some of the posters keep telling that the cache
size limitations
(memory cache size, file system cache size) _must_ be respected and that it
can't keep all the
original bytes. Ok - if you insist: what is this
good since I still can't predict or limit the amount the whole browser process
uses? What good to
know that the cache is limited to xx Mb and you have very strict rules to obey
this (while not
having the expected behaviour) and still - after opening 2-3 windows the process
already has 30 MB
in size and is increasing very much with each new opened window?

- I don't want to switch to another browser, but if mozilla will keep behaving
wrong....
- As a memory optimisation, you can always compare each new content to the ones
that could be the
same (same URL for example) and keep only one reference-counted memory block. Of
course, the cache
will also use probably the last-modified version of these URLs.
And whenever a window gets closed, decrement the reference count of all objects
it used.

In fewer words - I want more this wrong behaviour fixed than mind about memory
consumption and get a bad program.

Sorry if this was a little bit too long - but I'm a bit angry on this behaviour.
Here's some initial work to add the infrastructure necessary for new windows to
leverage cached content...  More work is needed, but this gives an idea of the
direction...

-- rick
Rick, that looks pretty good at first glance.  The only two issues I see on the
view source front are: not handling "view frame source" (need to pass the page
descriptor to openDialog() in nsContextMenu.js inside viewFrameSource()) and the
code in viewsource.js that does argument handling.  typeof(null) == "object",
unfortunately.  It should be fine to just assume the second arg to be the
charset (if not null) and the third arg to be the descriptor (if not null).

On a non-viewsource topic, the persistence object could use a page descriptor as
well to save pages that are the result of post requests (right now it sets the
post data stream, but not the cache key on the channel).  It looks like you're
trying to keep the descriptors as opaque as possible, so maybe it would make
sense to have a "open channel using this descriptor" global helper that would
open a channel, set the post data stream, cache key, whatever flags are needed
to make sure the channel reads from cache, etc.  This would encapsulate
knowledge of what these descriptors actually are in just two places (with some
bending over, one _could_ try to use this helper in docshell, but that seems
uncalled-for to me).

Thanks for doing this,
Boris
hi boris,

this patch smacks nsContextMenu.js a bit so that view-frame-source works....

i wish i could have avoided introducing the BrowserViewFrameSource() function
and instead just called BrowserViewSourceOfWindow(...) passing in the 'focused'
window.  but i had *no* idea how to do this :-(

i also added a try-block to the argument parsing in viewsource.js to deal with
the situation where 'null' is passed.  i believe it should just ignore the
'null' and keep looking for more args.

let me know if you think this patch is the right direction... if so, i'll clean
up the patch and land it...

thanks,
-- rick
I am adding my vote to the 174 of them that already exist, to show my annoyance
of the bug, especially while troubleshooting a web script.  However, from the
very recent patches that I notice, it looks like this might be near fixed.

I appreciate the hard work of the coders, and hope to see the results in the
next version of Mozilla.  Thanks, guys!
That looks reasonable...

I have to admit I'm still a little confused as to why the code in viewsource.js
can't just assume that the second arg (if non-null) is the charset and the third
(if non-null) is the descriptor.  That lets people add more args as needed and
not have them accidentally used as descriptors or charsets (if a future fourth
arg happens to be a string starting with "charset=") or the like...
hey boris,

you're absolutely right!!  i think it must have been paranoia (or lack of sleep)
on my part -- it's getting so i can't tell the difference :-)

i'll fix up viewsource.js to assume the following (possibly null) arguments:
  [0] - url
  [1] - charset
  [2] - page cookie

thanks,
-- rick
Comment on attachment 73999 [details] [diff] [review]
initial patch allowing view-source to use cached content...

Minor problem: nsIWebPageDescriptor.idl uses the wrong (old) license.
dude this bug sucks my ass. your fixing of this bug would please me greatly and
I will be more than happy to send you beer if it is fixed.
Blocks: 132638
No longer blocks: 132638
Many, many kudos to Rick.

(You're about to fix a bug with 188 (the third-most) votes; blocking eight bugs
including three major; a correctness, data loss *and* performance issue; a
Netscape 4 parity issue (despite lack of tag); a problem which prevents
architecture stabilization and dates back several years!  Your solution will
also, it seems, fix 55583, a bug with 240 (the second-most) votes and 93 (the
most) duplicates!)

If you can get this fixed by Mozilla 1.0, you will be my hero.

Is there anything we the observers can do to help make sure this actually gets
in by 1.0?  It doesn't seem to have any of the markers indicating that
mozilla.org considers it a must-have-for-1.0 (although it obviously is a
must-have to a lot of web developers, including me).
Attachment #73999 - Attachment is obsolete: true
Attachment #74439 - Attachment is obsolete: true
r=bzbarsky on the xpfe changes.

The docshell changes look good to me too, with one possible caveat... could it 
become an issue that we have two distinct session history entries floating 
about that hold the same post data stream?  Would it ever happen that we'd try 
to rewind the stream and then read it on two different threads or something 
evil like that (causing the stream to be rewound on one thread while it's being 
read from on another)?  I don't recall ever running into that problem when I 
played with a similar approach that cloned the session history entry a while 
back, but....
- In nsDocShell::LoadPage():

+        nsCString spec, newSpec;

Use nsAutoCString, or whatever it's called nowadays.

+        newSpec.Append(NS_LITERAL_CSTRING("view-source:"));
+        newSpec.Append(spec);

do newSpec.Append(NS_LITERAL_CSTRING(...) + spec); to avoid double append and
potentially double reallocs in the string code.

- In nsSHEntry::Clone():

+    rv = dest->QueryInterface(NS_GET_IID(nsISHEntry), (void**) aResult);
+    NS_RELEASE(dest);

Why not simply:

  *aResult = dest;

I don't see the need for the QI call here, the compiler should do the right cast
with the above and it's less code, and one fewer AddRef/Release.

Other than that the changes look good to me, sr=jst
hey johnny,

thanks for the comments -- i'll fix up the patch...

The extra QI/Release in Clone() is purely habit :-)  It came from when we used
to hand-roll factory functions !!  Since the [out] result was nsISupports the QI
was necessary and we used to have a 'rule' about using this form :-)

But you're absolutely right, when assigning a known class into a correctly typed
result, the compiler will do the right thing...

Hey, old habits die hard ;-)

-- rick
cc'ing darin, who can comment on the postdata issue further.

The session history changes look good to me. Make sure that subframe navigation
continues to work fine. About  cloning postdata, I don't think it will be a 
issue, because, nsDocShell::GetCurrentDescriptor() primarily looks to send mOSHE
as the page descriptor. mOSHE is either the entry for the current page, if the
current page is done loading and docshell is in a stable state or  entry for the
previous page if the user has just started loading a page and necko has not yet
started the data transfer. mOSHE will be null only when session history is
disabled, in which case mLSHE will also be null and GetCurrentDescriptor() will
return error.

A sticky situation could be, when docshell is loading a page with postdata and
has just set the mOSHE in Embed(), (ie., data transfer and consumption has
started) but network is having trouble completing the postdata submission to the
server, (thereby possibly in the middle of reading the postdata stream) *and*
the user does a view-source at this time, which will rewind the postdata stream.
  But I think while reading from a stream, necko maintains offset values which
will enable it to read from where it left last time. So, this shouldn't be an
issue, but network experts can comment further.
What are the footprint impacts of the current patch?  What if a large number of
windows/tabs are open, or very large flat documents?  If there are impacts, how
can they be removed by people (embedders, etc) who don't need this?  (No view
source/etc available).

This pagedescriptor - are we "pinning" the source in the cache as was mentioned?
 Which cache, memory or disk?  What if we have no disk cache or it's turned off?
The current implementation incurs NO extra bloat... It merely allows the
existing caching mechanisms to work...

In the future we can deal with the cases where the cached content is not
available.  However, I believe that this new API is sufficient to deal with more
complex 'pinning' strategies...

Right now, I think it's important to leverage the current 'pinning' strategy. 
Especially since this should deal with 99.9% of the common cases (for
view-source at least).

-- rick
Comment on attachment 76086 [details] [diff] [review]
New patch that addresses the previous comments...

a=asa (on behalf of drivers) for checkin to the 1.0 trunk
Attachment #76086 - Flags: approval+
>In the future we can deal with the cases where the cached content is not
>available.  

If my reading of comment #74 is correct, the current 'pinning' implementation in
the cache ensures that the cached content is always available (provided Mozilla
doesn't crash or run out of memory) for an open window or a frame in an open
window.  Even for pages which are 'not cached' (such pages will only be
accessible through hard references, not through the regular cache mechanism). 

Since 'view source' as currently designed can only be applied to pages in an
open window, this should deal with *100%* of the cases for 'view source'.
Am I correct in assuming that this will mean hitting the back button will ALWAYS
pull the cached copy?  While pages that have "pragma: no cache" should not have
a physical cache for good reasons, it still should be at least cached in the
memory, so that when you hit the back button, you're not reloading the damn page
again.  IE has the same problem.

This has major problems with CGI scripts with FORMs that disappear when you hit
the back button.  All of the contents disappear, and you have to start over
again.  Is this another bug, or directly linked to this one?
Brendan: You're probably thinking of bug 112564.
Patch checked in!!!
*sniff*

rpotts, please tell me that's not an april fool's joke
It's not.
The test case now works.  Congratulations on the good work.

However, the problem isn't solved yet. *sigh*

Go to a page which changes often, like 
http://gcc.gnu.org/ml/gcc-patches/2002-04/

Leave that window around for a while.  Open the same page in another window. 
See, it's changed.  Try 'view source' in the first window.  You get the source
for the newer version.

Obviously the 'cache key' currently used is *not* exactly what we want.
(Trying to figure out the maze of interfaces in Mozilla is still very difficult
for me, so I thought it was.)  Apparently what we want is now called a 'cache
token', and is *still* not fully implemented.  Please add dependency on bug 72519.

Nathanael: reusing cache tokens was not part of the design of view-source, etc.
for a good reason.  it impairs the cache's ability to manager resources and to
make room for newer content.  that said, i can see the benefits of holding cache
tokens for HTML documents, and since there aren't very many at one time, we
probably could get away with pinning these in the cache.  however, i think
that's a second-level enhancement to view-source.  the solution thus far ensures
that we do the right thing on pages that are freshly navigated to, which IMO
covers the majority of the view-source cases.
yeah, it is a second-level enhancement.  I still think it's important but what's
left isn't really a 'must be fixed by 1.0' kind of thing.
I'm sorry to spam a load of people with this message but I've been keeping an
eye on this bug ever since I started using Mozilla as it irks me so much. What
I've never fully understood through this whole period is /why/ Moz seems to have
such difficulty in doing View Source when others like NS4.7x, IE6 handle it
fine. Is there some fundamental flaw in the way Moz is/was written from the
beginning that stops VS working like it does in the other browsers?
I'd be grateful for an explanation (preferably through my EMail so this bug
isn't tangled up) anyone can provide. If anyone can come up with a good reason
as to why VS was never implemented and catered for from the beginning? Other
than "Because" and "Most people don’t use it", I mean. 
Because ;-)

Marking fixed since it is.
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → FIXED
Please open a new bug to deal with the case that's still broken, if you plan on
marking this one fixed.

(I'm referring to the situation where the same page is open in multiple windows
with different contents; only one cache entry is preserved for the content,
instead of one for each different set of content). Otherwise we still *don't*
have a way to "reuse/reload current page without refetching from server" for all
cases, so this bug is NOT fixed.
I would actually like to understand this bug better myself.  A technical
explination of the issues would help... how does Netscape 4.x/IE 6 deal with the
view-source issue?  Why doesn't Mozilla use the same style?  What are the
advantages to Mozilla's method?

And I have to agree that I don't think this is completely fixed, although it is
1000 times better than what it was!  Good work!
I just checked and 4.x *does* get the remaining case right: I viewed the top
story on slashdot.org and then opened the same page in a new window. I refreshed
the new window until the number of posts for "-1" in the "select threshold"
dropdown list went up, without touching the original window. Then I did "View
source" in the original window and looked at the bit of the source that
identifies the threshold dropdown. It indicated the original number, as it
should have.

(Oh, and I made sure I got to the story page just by clicking a link, so there
was no postdata)

I haven't tested, but from comments here Mozilla would have given the value from
the other window. The fact that NS4.x gets this right is an indication that the
tradeoffs described in this bug *aren't* necessary - although I didn't test with
cache disabled or set too small to hold a slashdot page, so I can't say for sure
what happens in that pathological case for 4.x.

caillon, you marked this fixed, do you have an opinion on whether this bug
should be kept open for the remaining cases or closed in favor of a new bug for
the case where the same source is in two windows? How about a new bug for the
pathological case (cache too small to hold all open windows) too?
Please don't re-use this bug for new issues, it's big enough as it is...
Thank you again for a fix to this bug.  Remaining issue spun off to bug 136633
verifying, Man is it good to se this in!
Status: RESOLVED → VERIFIED
Blocks: majorbugs
No longer blocks: majorbugs
Why is this bug marked as VERIFIED FIXED in 2002 when Firefox 3.x still has this same issue?  I still am unable to save, view source, etc. pages, images and others.  It reloads them from the network every time.

KenW: "I think that the data to save can just be generated from the DOM object and there is no requirement that the HTML (if the HTML format is chosen) be
anything like the original source code."

I disagree.  There is a very strong requirement that the source saved be exactly, byte-for-byte what the server provided.  These are the fundamental and essential functions of an HTTP browser: download, view, and save a copy of a web page.  If you are not saving the original web page, then the browser has failed in its most basic functioning.
Bug 288462 provides the full summary of all related issues, complete with RFC violations.  Should be linked to this one.
Here I was getting totally crazy why some $_POST (php) output I printed in <!-- --> didn't show up in view source...

Really, if I want to get the source of a clean request to a URL, then I'll use wget!

Not verified!

Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.0.6) Gecko/2009020410 Fedora/3.0.6-1.fc10 Firefox/3.0.6
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: