Closed Bug 12274 (prefetch) Opened 25 years ago Closed 22 years ago

RFE: Browser should prefetch LINK tag documents

Categories

(Core :: Networking: Cache, enhancement, P5)

enhancement

Tracking

()

VERIFIED FIXED
mozilla1.2alpha

People

(Reporter: sirilyan, Assigned: darin.moz)

References

()

Details

(Keywords: embed, topembed, Whiteboard: [parity-webtv])

Attachments

(3 files, 4 obsolete files)

From a discussion in netscape.public.mozilla.wishlist, under the subject "browse in a non perishable cache and silent browse ahead": Matt Fletcher wrote: > > > Richard Shaw wrote: > [snip non-perishable cache idea] > > Also how about as an alternative to silent download, silent browse > > ahead. This would when the user is doing nothing would, use up the > > bandwidth by browsing ahead through the links on a page putting them > > in a temporary cache only adding them to a normal cache if they are > > actually visited. This temporary cache should be emptied on exit. > > A lot of ISPs and websites would blacklist Mozilla if it did this. For > instance, imdb.com had a section on how web 'accelerators' really slow > down the web (i.e. they waste bandwidth on pages at which one never > looks). > > A cvs-like system that checks frequently viewed pages for changes and > updates those that did might work well with an expanded cache (whether > your idea or another form). I believe this to be a more net friendly > and practical option. > > Fletch (The article Matt mentions can be found at http://www.imdb.com/irony.) This has actually come up a few times in npm.wishlist and I couldn't find a Bugzilla entry on it just now, so here it is. My admittedly imperfect memory gives this list of the things mentioned for "look ahead" cacheing: 1. Any look ahead solution should respect the robots.txt exclusion standard, at a bare minimum. 2. Ideally any look ahead will be history based, not link based. Preloading pages you have already visited and will probably visit again is good; preloading pages that you may or may not visit because they are linked from whatever's in the browser window is bad. 3. An implementation of the LINK tag relationship values "prev" and "next" might also cache the appropriate resources, regardless of whether they were seen or not, in the hopes that someone who is going through a series of documents will probably read them in series. There are some good reasons to close this bug early, though: 1. There are already many third-party products that provide these capabilities by acting as proxy servers, so this may be just another creeping feature for Mozilla. 2. Implementing this *badly* for Mozilla would be worse than doing nothing at all. Somewhat related is bug #11644 asking for more control over what types of resources get cached.
Assignee: gagan → nobody
Summary: [RFE] Browser should look ahead and cache [frequently visited] sites automatically → [HELP WANTED] Browser should look ahead and cache [frequently visited] sites automatically
Assigned to nobody@mozilla.org to flag as unclaimed feature request.
Note: WebTV precaches the document pointed to by <link rel="next"> elements, so if we do anything it should probably be that.
Bulk move of all Cache (to be deleted component) bugs to new Networking: Cache component.
Assignee: nobody → fur
->fur
Target Milestone: M20
We probably won't get to this for this release, but I'm going to leave it in the list in case someone wants to volunteer.
Assigning fur's cache bugs to Gordon. He can split them up with davidm.
Keywords: helpwanted
Summary: [HELP WANTED] Browser should look ahead and cache [frequently visited] sites automatically → Browser should look ahead and cache [frequently visited] sites automatically
spam, changing qa contact from paulmac to tever@netscape.com on networking/RDF bugs
QA Contact: paulmac → tever
marking rfe.
Summary: Browser should look ahead and cache [frequently visited] sites automatically → RFE: Browser should look ahead and cache [frequently visited] sites automatically
Moving to target milestone FUTURE. We'll take a look at it again after we ship N6.
Target Milestone: M20 → Future
Changing component and summary according to hixie's suggestion.
Component: Networking: Cache → Parser
Summary: RFE: Browser should look ahead and cache [frequently visited] sites automatically → RFE: Browser should prefetch LINK tag documents
Reassigning to component & QA owners.
Assignee: gordon → harishd
QA Contact: tever → bsharma
This isn't a parser bug. Giving to nobody.
Assignee: harishd → nobody
Component: Parser → Networking: Cache
Priority: P3 → P5
Who put in next, index and prev links if they dont want them to be used? Don't think this is bad in any way but wery good for the user. The browser shuld precash next, index and prev in that order. index and prev will be in the cach 90% of the time so they are no big deal to the webload. But precashing them vill be good 90% of the remaining 10% of the times. IMHO, this RFE is much more important than the link navigation GUI. Any webpage must define navigation in HTML anyway (most browsers dont implement link navigation) :-( Hovewer important, and I would *love* it for 1.0, it *is* a RFE and it is *not* critical ;-)
I also want this feature, although it's a minor problem. For example, http://www.cessna.com has this problem. On other browsers including NS4.78, prefetch works; ie hovering mouse cursor over the middle menu "Our Aircraft", "Owners.." etc you will see a different picture on the right. On mozilla, it's very slow(and actually it loads the picture each time via Internet), but NS4.78, it's very quick and it doesn't use the net at all. Apparently NS4.78 prefetches pictures.
Whiteboard: parity-webtv
I don't think it would make sense on Google: most of the time, I just look at the first page of results. (Google doesn't use link rel=next now, but they might add it if browsers get good keyboard shortcuts to access link rel=next.) On the other hand, this would make sense on most sites that use link rel=next, and rel=index certainly makes sense. Overall I think this rfe would be good to implement.
Prefetching is generally considered evil. To take an example close to heart, it woul approximately double the load on Bugzilla.
Whiteboard: parity-webtv → WONTFIX? parity-webtv
I don't think I've ever used Bugzilla's first/last/prev/next links, and I'm not sure Bugzilla is even using those links correctly.
I have used them, and they are used correctly.
-> darin
Assignee: nobody → darin
we recently discussed prefetching with the caveat that the additional downloads would be low-priority, limited possibly to a single network connection, and preempted by new page loads. with our existing support for partial cache entries and byte range requests, we should be able to agressively drop partial prefetches without sacrificing all of the work done to prefetch. *** This bug has been marked as a duplicate of 159044 ***
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → DUPLICATE
oops, i closed the wrong bug :(
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
*** Bug 159044 has been marked as a duplicate of this bug. ***
Status: REOPENED → ASSIGNED
Target Milestone: Future → mozilla1.2alpha
I am pretty worried about comment #16... in our recent discussions, we really haven't addressed server load as a potential concern. sure, any prefetching done might improve browser performance, but the overall benefits really depend on the likelihood that a prefetched document will actually be viewed. perhaps this is reason enough to trigger off a different tag? that way, servers could opt in. maybe a HTTP header would even be more appropriate, so as to ensure that only server admins get to choose the appropriate load level due to prefetching. or, maybe that shouldn't be any of mozilla's concern?! hmm...
IMO, Bugzilla should hide the list links unless you click "Go through this list one bug at a time". The current list links are wrong more often than they're right, take up a lot of space (especially if you haven't disabled the site navigation bar), and break caching. Bugzilla's strangeness should not stop us from making Mozilla faster on sites where Next actually means something.
updated status whiteboard and keywords
Whiteboard: WONTFIX? parity-webtv → [parity-webtv]
Depends on: 142255
Depends on: 163746
Attached file v0 patch (.tar.gz) (obsolete) —
this patch implements a very simple version of prefetching. during page load it collects all of the link rel=next hrefs. when the page finishes loading, the prefetch code loads the first collected URL in the background. when that completes, it loads the next URL. it does this until all URLs are loaded or until the user clicks on a link or otherwise starts loading a new page. i've added some code to kill off prefetched loads of content that either is already from the cache or could not be taken from the cache without server revalidation. this should hopefully avoid taxing servers that misuse link rel=next (e.g., bugzilla).
for those inside the netscape firewall, there's a link rel=next testcase at http://unagi.mcom.com/~darinf/test0.html
Attached patch v0.1 patch (obsolete) — Splinter Review
this patch just tidies things up a bit. i've moved the prefetch code into the uriloader module (i think that makes sense). this patch works for the 1.0 branch and trunk (makefile.win changes only apply to the branch, of course). i haven't done the mac project changes yet.
Attachment #96109 - Attachment is obsolete: true
Attachment #96364 - Flags: review+
Comment on attachment 96364 [details] [diff] [review] v0.1 patch r=dougt I spoke with darin about this patch and it is a great start. We/I would like to add a few things like per window precache queues, an attribute on nsICacheableChannel to determine if a channel can or should be precached, an d maybe some kind of UI that shows that there is precaching going on. Worse is better. Lets get this in.
one change i'd like to make is to not prefetch links containing a query string. those most likely correspond to dynamic content that will in most cases come back without cache headers forcing us to hit the server when loading the content for real. so, the value of prefetching such content is minimal. this nicely addresses the bugzilla problem BTW :-)
Attached patch v0.2 patch (obsolete) — Splinter Review
ok, minor change... just blocked URLs w/ query strings, and touched up some comments.
Attachment #96364 - Attachment is obsolete: true
Comment on attachment 96936 [details] [diff] [review] v0.2 patch carrying forward r=dougt
Attachment #96936 - Flags: review+
Attached patch v1 patch (obsolete) — Splinter Review
ok, this patch changes things a bit. instead of modifying the HTML content sink to invoke nsIPrefetchService::PrefetchURI, the prefetch service now hooks itself up as a tag observer (implementing nsIElementObserver). this way it'll be notified whenever the HTML parser encounters a <link> tag. i've also added code to make the prefetch service observe HTTP response headers. this would allow, for example, a proxy cache to dynamically introduce prefetch requests for content that is statistically very popular, for example.
Attachment #96936 - Attachment is obsolete: true
Comment on attachment 97232 [details] [diff] [review] v1 patch r=dougt
Attachment #97232 - Flags: review+
replaced the switch statement with an if-else to simplify code. thx dougt!
Attachment #97232 - Attachment is obsolete: true
Attachment #97242 - Flags: review+
hey darin, this looks really good. the ownership model of the nsIURI within the nsPrefetchNode is kinda scary :-) You might want to add a comment or two explaining it ;-) also, could you leverage the loadgroup (associated with each document) to hold the prefetch requets... that way, you won't need to worry about cancelling them when a new document load is initiated... i don't think this is a big deal... just a thought. -- rick
Comment on attachment 97242 [details] [diff] [review] v1.1 patch - revised per comments from dougt over AIM sr=rpotts@netscape.com
Attachment #97242 - Flags: superreview+
thx for the comments rick... 1- yeah, i'll add some comments on the URI ownership 2- i don't think using the loadgroup of the document would work. consider the case of two documents. one does prefetching, and in the other a user clicks on a link. now, loading the new page must contend with the prefetch traffic. whereas what i'd really like is to kill off all prefetch traffic when any other part of mozilla requests a page load.
patch landed on trunk... minus mac project changes. working on that now.
For the sake of people that have download limits on there accounts, I hope that a option has been put into place to turn this on and off as per a users requirements!!
I sure hope so too! I don't like the idea of URL prefetching. I have scarce bandwidth. Let me turn this off and I'll shut up. :-)
Chris, aaronl: Only links in the form of <link rel="next" href="..."> are prefetched. not <a href="...">. and there is a preference, disable this with: user_pref("network.prefetch-next", false);
the preference is only configurable from all.js for now. i probably should have made it dynamic (i.e., settable via prefs.js). the other thing to note is that prefetching will only occur when the browser has nothing else to do w/ the network connection, and furthermore any other browser requests to load anything will kill off any and all prefetch requests. we are also very selective in what we'll prefetch. that is, we only prefetch documents that can be reused.
what is now prefetched? Static files linked with <link rel="next"/> only, right?
any http:// URL that does not contain a ?query string will be prefetched. if the http headers indicate that the document would have to be fetched fresh each-and-everytime then we'll cancel the prefetch. in other words, yes, only static content will be prefetched.
Assuming it works as described above, I am impressed by the thought that has gone into this. Let's hope it makes pages appear to render nice and fast! Do we also support HTTP "Link" headers? They are supported for linking to CSS style sheets, so presumably it should work for this too, but they are not supported for site icons, so presumably we still don't have a single Link service yet, and this means it probably won't work for this either...
the prefetch service is a HTTP header observer, which means that it will pick up HTTP Link headers, but one downside is that HTTP only reports headers that are new. IOW, loading a page from the cache will not trigger HTTP header observers. that's one way in which Link: header differs from LINK tag. perhaps a unified Link service would be the best way to resolve these differences.
We have several consumers of link elements and headers, each done slightly differently (e.g. the stylesheet thing probes, rather than listening, and the site icon code only does <link> elements). One day someone will snap and unify all this, hopefully. :-) Great to hear than Link: headers were taken into account though! How about <meta http-equiv="link"> ?
nope... looks like the <meta HTTP-EQUIV="link" ...> would be missed :-( oh well... will work on a follow up patch! 1) make prefetching a user preference 2) add support for <meta HTTP-EQUIV="link" CONTENT="...">
checked in mac project changes.. so this should be in all builds of mozilla 1.2 alpha :-)
Status: ASSIGNED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
Would it be worth a special case to not stop the prefetching when the user clicks on a link to the page that is currently being prefetched? It seems that the probability of this happening is quite high on the rel=next case. I understand that the partial cache entries and byte range requests help with this somewhat, but the worst case is still tearing a tcp connection down, re-establishing it and sending a new request, right?
marko, if the next page references a large document that is being prefetched, and the user clicks a link to advance to the next page, then if we don't cancel the prefetched load, loading of the next page will appear to stop on the prefetched document. only when the prefetched document is entirely downloaded will the document appear to snap into place/view. this happens because the prefetch load and the new load are not at all tied together. instead, the second one gets blocked waiting for access to the cache. so, while it might be true that not canceling would give better page load times, the result is something that most likely won't appeal to many users. i think it's better to cancel the prefetch partway thru, so we can go ahead and display what we've already got ASAP.
Wouldn't it be cool if instead of blocking, it noticed that a fetch was already in progress for that resource, and simply hooked into it?
it would certainly be cool, but not at all easy to implement.
Concerning comment #38 (2): Why killing this, and not just suspending until any other activity ends? (example situation: webpage with search results. I open some of them in new windows/tabs to see if they contain what I want, but if not - I want to check the next page of results. Would be nice if it was there already.) Concerning comment #47: I wouldn't worry. If the document is already in cache, then there's a good chance that either the "next" document will be there as well, or it will not be needed, since if the user didn't follow it first time, they won't follow it now too. Worth consideration: Mozilla vs Law. "By clicking on this link you agree to terms of conidtions...". The server logs will say you followed this link even though you didn't. Luckily webmasters rather don't implement such points as <link... Also consider possiblity of maliciously formed pages to use this feature to exploit remote server vulnerablities. More on topic: http://www.phrack.org/show.php?p=57&a=10 Limiting number of links to follow to a fairly low number (8?) would prevent abusing this.
Bartosz wrote: >Worth consideration: Mozilla vs Law. "By clicking on this link you agree to >terms of conidtions...". The server logs will say you followed this link even >though you didn't. Luckily webmasters rather don't implement such points as ><link... As I was reading through this bug I was thinking exactly the same thing. TBH I was considering whether 'next' might be considered a 'submit' link (eg. in a wizard/druid of some kind), in which case preloading it would submit without the user's express permission - as Bartosz said. But I'm sure there are worse ways this could be exploited, much as I think its a good idea ;| At some level this might be considered a bug in the web-app, but if a user switches to a gecko browser and suddenly finds that their web-email inbox is empty due to mozilla precaching the 'remove this email' link, there could be problems...
Why would anyone ever say <link rel="next" href="./delete-mail"> ...and would they ever do it without a query string? I doubt it.
here's a site that uses <link rel=next> without any cache control headers. the pages appear to be served up using PHP without the use of query strings. as a result we prefetch each next page, but then kill off the load once we see that it doesn't have any cache headers. needless to say this is bad for a number of reasons. perhaps the best recourse is to evangelize such sites. hmm... http://www.gnome.org/start/2.0/releasenotes.html
I take it "Last-Modified" is one of those cache headers? VERIFIED FIXED. This bug makes reading the HTML4 spec so much nicer. Thanks Darin. You kick ass.
Status: RESOLVED → VERIFIED
yes, Last-Modified is good enough, because it let's us take a guess at how long the document can remain stale ((date - lastmodified)/10), and when the document does expire, all we have to do is send a conditional If-Modified-Since request to the server (allowing the server to say 304 not modified).
Uhm... why do I see two downloads of the same document with ethereal? I'm browsing the HTML specs and I see two full GET's for each page...! I'm using build 2002091505 .
With a current trunk build on win2k, I can't reproduce this (although instead of using ethereal I was just checking breakpoints in the code and also checking the server logs for a prefetch testcase modelled on the w3 TR documents). I see a GET for the first document, then a prefetch of the second. On moving to the second document, I see a prefetch of the third document, etc. Note: if you click on the 'Next' link in those pages before the prefetch of that next document was complete then this would cancel that pending prefetch and a new (partial) GET request would be issued for the next document. But this is by design.
nick: you should also verify that ethereal isn't "lying to you" ... sometimes (especially under windows) it'll report a packet twice. check the sequence numbers to be certain you aren't seeing an ethereal bug. barring that there is always the possibility that you are loading pages that do not allow caching. after loading one of the pages, you could look at the page info for the page to see if the server specified an expiration time. once that expiration time is past, the prefetched document will have to be validated the next time it is loaded.
Attached patch 1.0 branch patchSplinter Review
combines original patch + all subsequent fixes (bug 166647, bug 171102, and bug 173278).
there's quite a bit of whitespace-noise in the real branch patch. review this one instead.
Alias: prefetch
While I agree that PRTimeToSeconds should really live in NSPR, we should minimize having multiple definitions of the same function (I count 3 more within Necko alone-- FTP has 2 and HTTP has another one) Can we at least consolidate that into Necko's common util.h file (I forget the exact name-- it's been that long!) Other than this nitpick it looks great! r=gagan
Comment on attachment 102309 [details] [diff] [review] 1.0 branch patch w/ no whitespace changes r=gagan
Attachment #102309 - Flags: review+
gagan: thanks for the review. i talked to wtc about PRTimeToSeconds (can't remember the bug no), and he decided that he didn't want to put that function in NSPR because (1) no way to represent dates before 1970 and (2) no way to represent dates after ~2130. i see his point, and i'm hoping to eventually clean things up so that we can have one instance of this function.
Comment on attachment 102309 [details] [diff] [review] 1.0 branch patch w/ no whitespace changes sr=rpotts@netscape.com
Attachment #102309 - Flags: superreview+
Comment on attachment 102309 [details] [diff] [review] 1.0 branch patch w/ no whitespace changes a=rjesup@wgate.com for 1.0 branch with the proviso that the default be that the feature is disabled unless the pref is used to enable it, as per driver discussions and Darin's agreement.
Attachment #102309 - Flags: approval+
default disabled sounds perfectly reasonable to me.
Discussed in bBird team meeting. We will give adt approval for this as soon as test cases are attached to this bug. Please make sure Bindu is cc'ed on bug. We need the checkin to happen by 10/16 COB. I will watch this bug and immediately plus it when the test cases are attached. We are going to reserve the right to ship with this turned off if problems are discovered.
this URL contains links to some examples (sorry for the internal site): http://unagi/~darinf/prefetch/testcases.html
Plussing per email from darin indicated lyecies approval to ship pref'ed on, and attached test cases.
Keywords: adt1.0.2adt1.0.2+
Plussing per email from darin indicated lyecies approval to ship pref'ed on, and attached test cases.
Blocks: grouper
marking fixed1.0.2 belatedly. patch landed 10/16.
Bulk adding topembed keyword. Gecko/embedding needed.
Keywords: topembed
Verified on 2002-10-25-branch build on Win 2K. The test case works as expected.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: