When browsing in session history we should be able to display previous pages without fetching them again from the network, regardless if they are "cachable" or not. Therefore *all* pages should be cached, if not in disk cache, then in memory cache for re-use within the current session when requested by browsing in session history. One special case are results of a HTTP POST (see bug 55055). In that case we must be able to associate the cache entry with the post data or a dedicated session history entry. Reasons are explained in RFC 2616 (HTTP/1.1): --8<---- 13.13 History Lists User agents often have history mechanisms, such as "Back" buttons and history lists, which can be used to redisplay an entity retrieved earlier in a session. History mechanisms and caches are different. In particular history mechanisms SHOULD NOT try to show a semantically transparent view of the current state of a resource. Rather, a history mechanism is meant to show exactly what the user saw at the time when the resource was retrieved. By default, an expiration time does not apply to history mechanisms. If the entity is still in storage, a history mechanism SHOULD display it even if the entity has expired, unless the user has specifically configured the agent to refresh expired history documents. This is not to be construed to prohibit the history mechanism from telling the user that a view might be stale. Note: if history list mechanisms unnecessarily prevent users from viewing stale resources, this will tend to force service authors to avoid using HTTP expiration controls and cache controls when they would otherwise like to. Service authors may consider it important that users not be presented with error messages or warning messages when they use navigation controls (such as BACK) to view previously fetched resources. Even though sometimes such resources ought not to cached, or ought to expire quickly, user interface considerations may force service authors to resort to other means of preventing caching (e.g. "once-only" URLs) in order not to suffer the effects of improperly functioning history mechanisms. --8<----
Since this bug blocks bug 55055 (per Radha) which is nominated for 6.01, I'm confirming this bug and nominating it also. Sounds like it's too late for RTM...
There is a related bug about Save As and such.
Severity to critical. This can cause dataloss or even monetary loss. See bug 57880 for details.
Hmmm. This bug looks like it's associated with bug 40867, which talks about needing a mechinism for storing pages for the puproses of save as and view source, which also re-load data. (Could it be a dupe of this one?)
Yes, this is the bug I meant. Adding as blocker.
Nominating for the next release like 55055 (which depends on this one)
cc to self
a "can-do" for new cache. marking till that lands.
> a "can-do" for new cache. Please note that this bug blocks an IMO important [nsbeta1+] bug.
shouldn't we then make this a "blocker" severity? this bug is blocking 55055, a highly visible, dataloss, nsbeta1+ bug.
This bug requires major architecture changes to the current cache. This will be resolved when the code for the new cache lands.
After discussing with gordon and neeti, marking this dupe of 20843. 55055 which is dependant on this will be re-wired to 20843.
Actually making it a dupe *** This bug has been marked as a duplicate of 20843 ***
Bug 20843 covers only caching of HTTP form POSTs. This bug is about a "history mechanism" (as defined in RFC 2616), which allows even expired pages to be re-displayed in their original state when browsing in session history. Thus reopening this bug, but resetting severity, target milestone and keywords for reevaluation (assuming 20843 will cover the most important issue with form POSTs); adding dependency on bug 20843.
RFC 2616 13.13 doesn't REQUIRE we store all pages. It says "If the entity is still in storage.." we should display it, even if it has expired. That is a separate issue than the fact that POST data is not currently incorporated as part of the key used for cache entries (covered by bug 20843). And both are separate issues from enabling the use of the memory cache for non- cacheable items (which is covered by bug 66482, among others).
Readding keywords - they still apply, tho in less severity. > separate issues from enabling the use of the memory cache for non- > cacheable items (which is covered by bug 66482, among others). If bug 66482 really covers non-cachable items (there's no mention of it, so I'd assume, it's only about cacheable ones), what is left for this bug? > RFC 2616 13.13 doesn't REQUIRE we store all pages. It says "If the entity > is still in storage.." It also says: > a history mechanism is meant to show exactly what the user saw at the time > when the resource was retrieved. I guess, the if clause has been added, so UAs don't have to cache *everything* (possibly hundreds of pages) in the history.
Also related to bug 38486
Cache bugs to Gordon
It's really up to the cache client to decide what to store in the cache. I'll let HTTP take the first crack at it. To Darin, with love...
I'm thinking that all no-cache responses can be handled using the cacheKey attribute from nsICachingChannel. we just need to add a load flag on the HTTP channel to indicate that caching should be done even if it the protocol would not consider the response valid for future requests. these responses would be given a special cache key that would not equal the URL. this would mean that only clients with a reference to the cacheKey, provided by HTTP, would be able to recover these responses from the cache. radha: this would work just like it does for POST transactions.
darin: From your comments above, this is what I understand docshell should do: 1) When a page with "no-cache" attribute is loaded for the very first time, docshell should verify that and set a special load attribute on the channel, before handing it off to necko for loading. (I'm presuming you would tell me what that attibute is) 2) In OnStartRequest(), while saving the url in SH, I would also save the cache key for it, (just like I do for postdata results) 3) when the user goes back/forward to a page that had a "no-cache" setting, I would restore the cache key on the channel before handing it off to necko again. Does this look right
that almost exactly it, except that instead of docshell having to check for documents with the "no-cache" tag, docshell would simply set a policy on the channel telling it to go ahead and cache such pages (that it normally would not cache). the only necessary thing missing from our api is this flag.
radha: this bug is assigned to me so that i can add this flag... once i've added it, i'll reassign the bug to you so you can add the necessary docshell/history support.
Shouldn't the existant VALIDATE_NEVER on the nsIChannel satisfy this? The way I see it when docshell first requests a document it sends the standard set of cache flags (default being to LOAD_NORMAL) HTTP then caches it either in memory or on disk (but it ensures that its available somewhere-- even with a no-cache response) Then if the user hits back button docshell adds a VALIDATE_NEVER to the request which tells HTTP to simply return the document the way it is without checking for its validation. I think there is more benefits this way. If we were to create a new flag then it would be hard for us to detect when the window goes away (specially if I have multiple windows open and close one of them) and that its ok for us to throw something out of the cache. With the suggestion I am making this would get handled just like rest of the cached objects and we wont need a special way to clean out these "held for the rest of the session" objects. The same should work for POST results without much change from either SH or HTTP.
ok, i've discussed this with gagan and gordon, and we've come to the conclusion that HTTP should just put everything in the cache. later on, i'd like to make this more customizable, since some clients may not wish to implement session history. so this means, that the flag i was talking about earlier will not be necessary at this time. i'll submit a patch to enable caching of *all* documents.
Gagan: How would that work with two different windows that had the same page in their histories, but from different points in time when the content was different? When each window goes back, it should see only the version it had originally displayed, regardless of other cached versions or changes in the live page...
IFF the pages (with identical URLs) were generated by standard GET hits and had no cache directives, the data for the older version would be replaced by the newer version. So going back in history on either of the windows will only show you the latest version of the pages. If this is not desirable then we'd have to figure out a way to keep these pages unique across other windows. A way to do that would be to associate a window-id with each of the cache-ids-- which if you think about it, will break basic caching behaviour. That is, pages being loaded in a newer window may not be able to reuse the cached version from an existing window. IMHO for these handful of cases with multiple windows where the content is replaced (and the older cache version lost) is a better deal than this.
Reread that quote from RFC 2616 in this bug's description: "History mechanisms and caches are different. In particular history mechanisms SHOULD NOT try to show a semantically transparent view of the current state of a resource. Rather, a history mechanism is meant to show exactly what the user saw at the time when the resource was retrieved." Note that this says nothing about cache directives or which method was used to retrieve the resource. It flatly states that it's meant to show "exactly what the user saw", not something similar that's more convenient for the application to display. To go back and show a newer copy of the page than was originally shown in that window violates the RFC guidelines AND the user's reasonable expectations -- why would we want that to happen?
Deven: I understand the spec's recommendation-- the real problem is that our history implementation uses the cache. Which means that any changes happening to the cache are reflected in the history, unless we keep unique (in time) copies of pages. It should be possible to use a window id to create an inbetween solution of keeping unique pages unique wrt window ids. darin/gordon/et al what say you?
deven: while i understand your point, i don't think we should try to solve the problem of history completely. for MOST dynamic content, such as the results from a form POST or the response from a '?' GET, we'll be storing separate snapshots of the content so that history works as expected. For all other content, we don't plan on keeping historical snapshots... the thinking being that this would be inefficient. anyways, it is already the case that content referenced by history may expire from the cache. for pages with the same URL, we'll just be expiring the older page sooner than what a LRU based algorithm would prescribe. forcibly keeping all historical content in the cache for the lifetime of the browser sessions just seems to me like a recipe for bloat... note: we use the MEMORY cache for HTTPS, so it's not like we are just talking about using more or less disk space.
I want the same thing that deven wants, but I understand how a perfect solution would be inefficient. I really don't think that covering POST and GET-with-? is a broad enough middle-solution, though. I was thinking that there might be a way to do slightly more without sacrificing too much: what if we treated the history operations differently, and only stored snapshots for those locations that are reachable by the BACK button on some window -- that set should be much smaller than all locations in the history, and I could live with the FORWARD button behaving differently.
actually, the default maximum number of times you can go back is 50... users may set this to be whatever they like. are you sure you want us to forcibly keep around 50 full pages of content? instead, we're aiming toward an impl that uses the cache's eviction mechanism to control the number of full pages "recallable" via history. IMO this is the right solution.
It makes sense to set a tunable upper bound.
I think we should try to solve the problem of history completely. But for 1.0 the proposed solution is IMHO enough. In the long term we could keep "uncachable" cache entries unique per request and limit them to a maximum (i.e. expire them more quickly than other pages). I think a total of about 1 MB for such entries should be enough. Normally, only a small part of that would be in memory. So it wouldn't introduce more additional bloat than it is worth.
Gagan: You have to keep unique (in time) copies of form POST results; why should GET requests be any different? Fetching updated data from the cache is wrong, especially for operations such as "Save As" and "View Source" when the page is still being displayed. Darin: I disagree -- I think we SHOULD be trying to solve the problem of history completely. For 1.0, not someday in the vague future. Sacrificing correctness for efficiency isn't acceptable, at least not on this scale. We're not talking about the difference between 99.99% and 100%, it's more like the difference between 95% and 100%. Also, I've seen no evidence that handling history correctly would NECESSARILY be less efficient. More complex? Probably. B-Trees are much more complex than simple binary trees, yet they are nevertheless far more efficient. Complexity doesn't imply inefficiency. I believe the heart of the problem lies in the mistaken assumption that the cache is an appropriate place to store history information. It isn't, and a host of problems have resulted from the mistake of using the cache as a history mechanism, including problems with form POST results, GET queries with ?, View Source and Save As returning incorrect data, etc. Worse yet, the loss of correctness isn't negligible; it violates the user's expectations in a severe way, and for that reason it isn't acceptable -- no matter how convenient it may be or how many other implementations may do the same thing. This is an architectural problem, and the right solution will be architectural in nature -- continually manipulating details of how the cache works will never provide a satisfactory solution, because a LRU cache is inappropriate for storing history information in the first place. I pointed this out months ago (on 2000-12-28) in bug 40867, and even offered to try to attack the problem myself (in what little time I have), but I was told that Gordon was working on a solution so I left it alone. There's no reason why a proper history mechanism must translate into "bloat". First, calling it "bloat" is mischaracterizing it in the first place; it's saved data, not an inadvertent waste of memory. (People don't call Linux "bloated" for using all available memory for disk caching.) Also, it's not inherently necessary to keep history content in memory; there's no reason it couldn't be offloaded to disk. And limits on the total amount of memory and disk space used would also be feasible -- as user-tunable policy mechanisms, not accidents of implementation. The only history content that should be _completely_ inviolate is that associated with a page that is CURRENTLY being displayed in one window or another. (That guarantees that "Save As" and "View Source" will work -- the user can always close that window if they're short of memory.) Basically, as I first mentioned last year, I believe the history mechanism should be independent of the cache mechanism instead of the current kludge of pretending that the cache is an acceptable substitute for a proper history mechanism. This is critical to meeting the actual needs of the users, which is what the application exists for in the first place...
I think the simplest complete solution is a weak reference scheme: Every history item can have a "reference" to its cached page via a unique ID or even via a pointer. When the user hits a page (or maybe even images within a page, if one is so inclined--but that would be considerably harder, involving the changing of the page structure), the browser, as always, first searches the cache. If it finds the page, it gets another reference to it (increasing the reference count). If it does not, the browser goes to the network, gets a new page, caches it, and gets a reference to the page in the cache. Then when the browser wants to pull something from the cache, it goes through the pointer (or calls the cache using the unique ID). Multiple places could reference the same cache item (two gets that both hit the cache for their data). When a history item is cleared (i.e. the browser closes or the user goes to a URL in the middle of history), the item is dereferenced. The cache, for its part, guarantees not to remove any item from the cache that is referenced.
Deven: I think the reason the cache is used for history is so that you only have one copy of the page around, and so that other browsers can search for the page (which will certainly use the cache). I say just modify the cache from LRU to be LRUNRBHI (Least Recently Used page that is Not Referenced By a History Item.
John: your solution _is_ something we considered doing, and we do have a mechanism for ``pinning'' entries in the cache with a reference count. keeping hard references to cache entries allows us to keep them ``alive'' even when collisions occur. the cache (really a data store with eviction capabilities) allows entries to exist in the cache in a doomed state and still be completely valid from the point-of-view of those owning the hard references. a doomed entry is deleted once it's reference count goes to zero (as expected). but invoking this feature of the cache for all pages ever visited during a session up to the last N pages, was deemed excessive. deven may disagree, but we felt that this would solve the bigger problems such as printing and save as for content recently visited. history as we've defined it is really just a list of URLs visited, with form POST data and some other state remembered. as a web surfer, i've never found this to be insufficient. i mainly use history to go back to revisit an URL without having to type it in. i suspect this is the case for most users. they care about the URL, and could care less if the content of the page is not exactly as it was before. so long as it is accurate now, i think most users will be more than content.
Most Internet users do not care if history goes back to the original page ... but Intranet users, especially users of web applications that require complicated, multi-page editing and saving of data, will care. How about this: since it doesn't look that hard to implement (hard references already implemented in cache), put the capability in, disable it by default, but make it a Preference (checkbox in Advanced > Cache: "Keep history pages in cache"--or maybe even "Keep up to n history pages in cache"). This would make web applications work real nice but still make normal Internet surfing viable in terms of memory. This would totally avoid the window ID kludge, allow caching to work 100% the way everyone wants it, and still wouldn't be awful hard to do.
Enabling multi-page web apps involves a much wider array of issues, which cannot be solved simply by changing the caching policy. With the new cache we have ways of avoiding collisions with POST results and Get queries with ?, and we have a mechanism (yet to be taken advantage of) to hold cache entries for the currently view page to get more accurate Printing, Save As..., and View Source. It seems like this should go a long way in supporting the kind of Intranet web application you're talking about John. We're very interested in providing more support for multi-page web apps, but that will involve work in many areas, and if changes need to be made to http and the cache to support that work, they will be made. You are correct John, that the reason the history uses the cache is to avoid storing multiple copies of the same document (which in the case of most GETs would be the case). For history to store and manage a separate copy of every page (and associated documents) would be very wasteful. Whether that space is in memory or on disk is not the issue. On a more general note: I believe RFC 2616 is a spec for HTTP/1.1. I don't believe it is a spec for how user agents MUST implement history mechanisms, but rather a warning to http implementers of what to expect from user agents. That said, I think we are following the spirit of section 13.13: we certainly aren't trying "to show a semantically transparent view of the current state of a resource". However, if our copy of the resource has been evicted (for whatever reason), we refetch it from the net, warning the user in the case of POSTs because of potential side effects. Now, bugs in our current validation code is another matter... :-)
As long as we accept as a sad fact of life that we can't keep all the pages around (since the user may go to hundreds of pages in a single session), LRU is a decent approximation of "give me the last n pages"; even though it will not work per-window and will give priority to pages that are recently viewed but cleared from history. We *could* implement a "priority" scheme in cache that would help, but it might not be worth it. On to the issue of GET versus POST and '?' GET requests. If I understand you correctly, you are saying that, for POST and GET w/ '?' entries, we are able to store multiple snapshots in time even if the '?' parameters and the POST parameters are the same. But with normal GET we cannot. I assume that we are currently doing something like the following: 1. For GET without '?' we always search the cache based on the URL. If we don't find the entry we go to the network. 2. For POST and GET with '?' we hold some kind of pointer or unique ID to the original cache entry so we can get back to it. If the unique entry is gone from cache, then we go back to the network rather than search by URL + parameters. These entries are presumably *not* pinned in the cache, however, because that would be inefficient. If I understand all this, then what prevents us from using the same mechanism for GET requests to get a "unique ID" to entries in the cache which are not pinned down? *Then*, if the unique entry is gone from the cache, we search the cache based on the URL. Then if we find nothing we go to the network. I may be misunderstanding what we do for POST and GET '?' requests.
You are correct about POST and GET ? urls. However, the unique ID becomes part of the key that they are searched for in the cache, because those responses are only valid when viewed by history and we never want to find them in the cache when visiting a "new" url (clicking a link or typing in the location field). If we used this approach for ALL GETs, we would never be able to resuse anything we put in the cache for "new" visits.
We do not need this approach for *all* GETs. If a GET is cachable, subsequent requests will be served from cache and thus result in indentical pages. But we should use it for any uncachable GET.
How about a weak reference scheme to deal with this? When someone requests a document and gets it out of the cache, give them back the cache ID of the request. On subsequent attempts to get the document, they do a call like GetCacheItemByID() which will attempt to find the original item, and if they cannot find it (if it's expired), they try to find by the URL. Basically two different search mechanisms to get at the data. Maybe even a weak reference pointer class with a pointer to the nsCacheEntry (or Descriptor?), and when the cache expires the entry, it goes out to all these weak reference objects and sets the pointer to NULL. This could be simpler to implement, and would be pretty efficient too, I imagine. When a new copy of a URL is fetched from the network, the old copy is not destroyed in this case, but instead pushed aside, so that a search by URL will turn up the new copy. Perhaps they are ordered in the index by date (most recently fetched is the one to grab). This *does* mean that the index in the list of cache entries will not be unique anymore. Is this feasible? Is the index to the cache in memory? Can the current "key" be non-unique as long as the search algorithm returns the most recently created cache entry with that key? This could be useful elsewhere, I am sure. I think we may have a blizzard tomorrow, in which case I'd be happy to work on this. I've been perusing the cache code.
I should point out that there is nothing inherently special about the '?' part of a GET request -- there is no reason why a query couldn't be part of the host name, in fact. For example, see: http://$sum(42,17,-3).x42.com/ ...which is a TOTALLY VALID domain name and triggers a CGI script. There is no difference between that and http://www.example.com/sum?42&17&-3 ...which we treat differently (?). I don't know if this means we have to file a new bug or something...
I never suggested that space should be wasted with identical copies of data. Yes, a naive implementation would waste space on identical copies, but a better solution would not. A quote from my 2000-12-28 message under bug 40867: "When the LRU cache Necko needs happens to point to identical data, the cache manager could share the memory space." The desire to avoid redundant copies of data is an excellent reason to have Necko's LRU cache and the history mechanism depend on a common, independent cache manager subsystem, but it's NOT such a good reason to have the history mechanism depend on Necko's LRU cache. Doing so has caused a number of bugs and guaranteed that unexpected behavior (e.g. reloading from the network) is always possible. I think the reason why these issues have been so complicated was because functions that should have been independent have always been merged (document caching mechanisms to manage previously-retrieved documents vs. LRU cache mechanisms to avoid redundant network transfers). While they're both caching functions, one is needed for correctness, the other for efficiency. By merging them, we've achieved efficiency without correctness, and caused a wide variety of problems that could have been avoided by using a cleaner design from the beginning. It doesn't make sense to go through the networking layer to retrieve an in-memory copy of a document that was previously retrieved. I think the best solution is to have TWO caches, the LRU cache and an independent cache, with the LRU cache dependent on the independent cache. Basically, what I'm suggesting is when data is retrieved (by ANY mechanism, whether "cacheable" or not, even "file:"), the independent cache manager would create a new "handle" and return it to Necko, which would save that handle in the LRU cache for that URL (if it's cacheable), and return the handle to its caller with the data. If another request comes to Necko for the same URL, and the handle is in the LRU cache, Necko would fetch the data from the independent cache manager and return it along with the same handle that was returned before, since the contents haven't changed. If the data is fetched again (forced reload, expired LRU cache entry, etc.), then the new handle would be stored in the LRU cache after saving the new data in the independent cache. If Necko's caller (e.g. a DocShell) later wants a copy of the data again (Back, View Source, etc.), it would use the handle to request the data from the independent cache manager, leaving Necko out of the loop. Therefore, it would ALWAYS receive exactly the same data for the content in question, or an error if it's not available -- it would never receive an updated version unexpectedly, since it would have to ask Necko for an updated copy if necessary. Ideally, the independent cache should determine if identical content was already in the cache (when being handed new content by Necko), and avoid saving a redundant copy, returning the original handle instead. The independent cache manager would have the right to move content between memory and disk at will, retaining the same handle to refer to it. (This might be done with another LRU mechanism inside the independent cache, independent of Necko's LRU cache.) The handle could either be a hard reference, or a weak reference allowing "locking" to keep the content from being deleted (e.g. when currently displayed in a window). Using a weak reference with locking is probably the most flexible solution, and allows user-tunable policy (e.g. size limits) to be implemented for the independent cache manager. It should be obvious that this two-tier solution would (1) not generally waste memory on redundant copies of data, yet (2) keep older copies (as well as new ones) when it's appropriate to do so, and (3) never cause the kind of behavior that has triggered so many bug reports with the current solution. While it may be somewhat more complex, it's also a cleaner design, and I think it's the right solution. What value is there in keeping the current design, besides inertia? John, while I think your "LRUNRBHI" cache might be an improvement, I believe that separating out the LRU cache from the reference-counted cache is much cleaner and safer. Darin, I think you're mistaken in your belief that the current definition of "history" is adequate, even for average users, but especially for power users and web developers. Even the average user is likely to be upset if they can't go "back" to the content they remember seeing -- if a page changes, and they saw the old version in their history, they'll rightly expect to be able to go back and see what the old one looked like. But forget the average user -- even if the masses accept this sort of limitation, web developers won't be so forgiving, and we do WANT web developers to target Mozilla as a preferred platform, don't we? If we don't do it right, the web developers won't like working with Mozilla and that would necessarily impede Mozilla's acceptance. In itself, that should be sufficient reason to want to get this exactly right, not just close enough. If there's ever a preference option for this, the correct behavior should be the default, not a hidden mode that you need to make an effort to enable. Gordon, what issues do the multi-page web apps involve that wouldn't be solved by the two-tier cache architecture I'm suggesting? Also, I _strongly_ disagree that the current approach follows the spirit of section 13.13 -- we aren't making an effort to show EXACTLY what the user saw previously (which is explicitly stated in 13.13, not just implied), and we ARE showing a "semantically transparent view of the current resource" by refetching it from the network any time the content has been evicted. In fact, the ONLY reason that it doesn't always show the most current state of the resource is because the cache is used to avoid a network transfer -- when this is correctly showing exactly what the user saw before, this is by coincidence, not by design. While RFC 2616 and section 13.13 may not be telling us how we MUST implement history mechanisms, the guidelines laid down couldn't be more plain, and we're certainly not following those guidelines in the current implementation, either in the spirit or by the letter. Saying that we aren't REQUIRED to do it doesn't change the fact that these are GOOD guidelines, and we don't really have a good reason to be violating them -- it's just been done for convenience and out of some fears of inefficiency. (And I don't think the right solution need be inefficient at all.) Instead of EVER refetching from the network automatically, it would probably be better to put up a page like NN4 does saying that the previous content has expired from the cache (regardless of whether it's a POST or not), and require the user to hit Reload to see the content. (Maybe for non-POST data that can be identified as not having been changed since the original retrieval, it might make sense to automatically retrieve the page again, but only if it's fairly certain that the content is unchanged. If so, it should be possible to disable this with a preference option.) As for an approach that can be used on *all* GETs, the two-tier cache approach I'm suggesting would work fine when applied to all GET operations. Any time the cache would have returned the contents, a shared handle will be used. If the content is non-cacheable or has expired, it will be fetched again as expected, but old copies will remain available if they are still in use. This approach doesn't require the browser to second-guess the server based on using POST or GET with ? to guess whether content is dynamic; it would just work regardless. Okay, so tell me. What's wrong with my suggestion?
I think your solution is a good idea. Having two levels of the cache separates out some stuff that I was combining in mine. What I was proposing was a single cache that you can search both based on a unique ID or reference, and a URL. You propose a data store indexable by unique ID/reference, and a separate search mechanism; this independence would have three major benefits: 1. It makes it easier to understand: the real "cache" is just a way of storing data that can be expired, and all other ways of accessing it are just views into it. 2. It would make the cache easily searchable in other ways down the road 3. It would make it possible to make the URL search more efficient by not placing (for example) POSTs and '?' GETs in the search list, since they are not ever really searched by URL anyway. Smaller data size means faster search. I agree that it should be possible for two copies of the same URL to be in the cache. From that point of view, the cache needs a little work, IMO. It should just be a store with a bunch of entries which can be expired. I disagree, however, that the browser should use this feature to its limit, and require the cache to keep a copy of every single page in history until the history item goes away. The Netscape guys here have a very good point, in that users could *easily* visit hundreds of pages in the history of a single browser window. If anything, the cache should assign priority to those pages that are most recent in browser history. And while we're talking about content that should be kept around for history lists, if we want to be strict about the RFC, we have to keep all *images* around too. What if they are graphs, for example, dependent on time--like perhaps an up-to-the-minute stock chart? Now we're talking an extreme memory burden. Memory has to be a factor. However, it could be OK (for web developers' sake) to turn on the "keep all pages around" feature as a Pref, or (probably better) "keep n pages in history around". Then the question becomes whether to keep the images too.
Oh, one more thing that makes a separation of the URL search index from the data store itself desirable: the HtTP directive that says not to cache the item. This way, you can keep the non-cached item around in the data store and just not keep the index in the URL search. The expiration policy could be improved, too. The first thing you get rid of from the cache (you *always* get rid of) are entries that are non-searchable (like POST or '?' GET or no-cache entries) and which do not have any references to them. Really, you could safely get rid of these as soon as the reference count goes to 0. When an entry expires from the cache (the directive that says cache for x minutes) the entry becomes a no-cache entry, too, and is removed from the search list.
ian: the '?' behavior comes as a recommendation from rfc 2616 on the basis of maintaining compatibility with existing browser behavior and especially older servers. we would actually be ``correct'' in not handling '?' GETs any differently from normal GETs... that is, we should be able to do the right thing just based on the values of the headers; however, some older servers may not send the correct headers, so we chose to heed the rfc's recommendation on this point.
Darin: ok, that's cool. So basically what we do is simply ignore the cache headers for GET requests with '?' in the URI? If so, then that's fair enough (I was concerned that we might be doing the opposite, namely ignoring the spec in all cases rather than just one). Cheers!
Ian: I don't know exactly what we are doing, but RFC 2616 says in section 13.9: "caches MUST NOT treat responses to such URIs [containing "?"] as fresh unless the server provides an explicit expiration time". I'm not happy with that, but it is the spec.
Yes, the user could easily visit hundreds of pages in a single browser window. So what? That's no reason to architect the system not to save them all -- it's a reason to include user-tunable preferences to control the policy on how much to keep for how long. I did mention that another LRU mechanism could exist in the independent cache manager for this sort of policy-based expiration, which would be unrelated to Necko's LRU cache mechanism. Keep in mind that costs of memory and disk space keep dropping, and as a user, I'd rather have hundreds of history pages cached than to have the space wasted. And if I can't afford to use the space for caching, I'll set the preferences so that the policy will be only to keep a limited amount of data. I'm only recommending that the current document in each window be locked in the cache; any history document not currently displayed would be a possible candidate for eviction based on the policy preferences. In general, I'd suggest evicting cacheable content FIRST, since it's more likely to be unchanged of it must be reloaded from the network. Content that cannot be cached is more likely to be dynamic content, and therefore more important to keep around for history pages. After giving non-cacheable content priority, a LRU mechanism could be used to evict the least recently used pages first. Perhaps a ranking scheme that balances the time of last access against dynamic content and maybe size would be the best solution. (Size because expiring a few large documents might avoid the need to expire many small ones.) Regardless, all of these are expiration policy mechanisms, and (as with Usenet) there's no single solution that everyone will find acceptable, so tunable parameters are the most appropriate solution. There's good reason to store EVERY page retrieved in the independent cache, whether or not it belongs in Necko's LRU cache. But just because it gets stored there doesn't mean it has to stay indefinitely just because it's well back in the user's history. For one thing, if the history itself has a fixed limit (hardcoded to 50 right now?), this would provide one limiting factor. (Such a limit should, of course, be possible to tune or disable in the user's prefs.) Otherwise, pages still "available" in the history could be expired early if necessary, and then perhaps reloaded from the network if needed later, with or without user interaction according to the prefs. It would make sense to remember whether or not an evicted page was cacheable, since a reasonable default would be to reload from the network transparently for static data that was expired early (with a warning message in the status bar), and put up a "press reload" page for dynamic content (as NN4 does with POST). (I'm not sure offhand if any other metadata would be worth saving.) The independent cache could also potentially store the associated DOM tree for some of the saved pages -- this would have to be expired quickest from the cache (without expiring the source), but for machines with sufficient memory, this could be a BIG performance win for very recently-accessed pages. Just think how impressive it would be if the first few times you press "Back" could be drawn nearly as fast as incremental reflows happen when you resize a window now...
the fix for the original bug report has been checked in along with the patch for bug 75679. this does not include the RFE to make history keep around "old" content... a separate bug, filed against the history component, should be opened to track that RFE.
The quote from RFC 2616 in the original description of this bug merely codifies the actual expectations of the user -- to the extent that these guidelines have been violated, the user can also expect to view the behavior as incorrect (i.e. a bug). I can't speak for the original reporter, but the quote suggests that this bug was intended to cover ALL of the deviations from the quoted RFC guidelines, not merely the most egregious ones. If the patch hasn't addressed the problem of returning old versions of content or retaining content for history when the user flushes the cache, then it would appear that this bug is NOT fixed. If this bug is under the wrong component, then change it, but don't represent a partial fix as a complete one! Finishing the job is hardly an "enhancement"...
deven: my concern is first and foremost that of providing parity with existing browser behavior. i feel that this has been done, and with it the critical'ness of this issue has been addressed. you may argue that more work remains to be done, but in my mind your talking about an enhancement to the current browser requirements/feature-set. so, please feel free to file a new bug to track the new feature you describe. i suggest filing it against the history component, as the new cache already provides support for holding hard references to stored content.
Why not leave this bug open and reduce the severity to normal? After all, the description still encompasses the remaining problems, but (as you point out) it's no longer as critical as before. If the history component makes more sense, change that too. All I'm saying is that we shouldn't mark the entire bug as fixed just because a partial fix made the remainder less critical. The problem with closing this bug and making a new one is that it isn't really equivalent -- this bug has history (in the comments) and people who have marked themselves as interested (via CC's and/or votes) that would not be associated with a newly-filed bug. If you leave the bug open but reduce the severity and change the component, all of that history will remain intact. And people who are no longer interested in the remaining parts of the bug can always remove themselves from the CC list or cancel their vote. In general, this seems like a cleaner way to handle partial fixes that address the critical aspects of a bug -- there seems to be a preference for closing bugs entirely and generating new reports for old problems, which seems like a poor solution to tracking an old problem...
Oh, and I'm still not talking about an enhancement -- I'm talking about a bug that happens to also exist in previous browsers like NN4. Bug parity does not an enhancement make.
I agree...file a new bug, against Session History (not Global History)
Deven: don't morph this bug. Do cite it via "bug 56346" or "bug #56346" or equivalent (bugzilla will linkify). You're describing a different bug, in a different component. /be
With recent builds, if I type some comments in a bugzilla page and an error in the cc line, click Commit, get the error page telling me to go back, and do so, I reload the page with none of my changes there -- whatever comments and changes I made are lost forever. A week ago, comments were remembered. Is that a side effect of this cache fix? Is it a different bug? Already filed? It's a serious usability regression for people who use bugzilla a lot.
akkana: i'm pretty certain that that bug was around before i landed.
This sounds more like frameset restoration (ie. layoutHistoryState) stuff than the cache to me :-) -- rick
I'm looking in to the problem with form value restoration in bugzilla. bug 74639 already addresses this bug
*** Bug 76150 has been marked as a duplicate of this bug. ***
It seems that bug 55583 (view-source should show original source) may be the best place to discuss the remaining issues at this point...