Closed Bug 12274 (prefetch) Opened 25 years ago Closed 22 years ago

RFE: Browser should prefetch LINK tag documents

Categories

(Core :: Networking: Cache, enhancement, P5)

enhancement

Tracking

()

VERIFIED FIXED
mozilla1.2alpha

People

(Reporter: sirilyan, Assigned: darin.moz)

References

()

Details

(Keywords: embed, topembed, Whiteboard: [parity-webtv])

Attachments

(3 files, 4 obsolete files)

From a discussion in netscape.public.mozilla.wishlist, under the subject "browse
in a non perishable cache and silent browse ahead":

Matt Fletcher wrote:
>
> > Richard Shaw wrote:
> [snip non-perishable cache idea]
> > Also how about as an alternative to silent download, silent browse
> > ahead. This would when the user is doing nothing would, use up the
> > bandwidth by browsing ahead through the links on a page putting them
> > in a temporary cache only adding them to a normal cache if they are
> > actually visited. This temporary cache should be emptied on exit.
>
> A lot of ISPs and websites would blacklist Mozilla if it did this.  For
> instance, imdb.com had a section on how web 'accelerators' really slow
> down the web (i.e. they waste bandwidth on pages at which one never
> looks).
>
> A cvs-like system that checks frequently viewed pages for changes and
> updates those that did might work well with an expanded cache (whether
> your idea or another form).  I believe this to be a more net friendly
> and practical option.
>
> Fletch

(The article Matt mentions can be found at http://www.imdb.com/irony.)

This has actually come up a few times in npm.wishlist and I couldn't find a
Bugzilla entry on it just now, so here it is.  My admittedly imperfect memory
gives this list of the things mentioned for "look ahead" cacheing:

1.  Any look ahead solution should respect the robots.txt exclusion standard, at
a bare minimum.

2.  Ideally any look ahead will be history based, not link based.  Preloading
pages you have already visited and will probably visit again is good; preloading
pages that you may or may not visit because they are linked from whatever's in
the browser window is bad.

3.  An implementation of the LINK tag relationship values "prev" and "next"
might also cache the appropriate resources, regardless of whether they were seen
or not, in the hopes that someone who is going through a series of documents
will probably read them in series.

There are some good reasons to close this bug early, though:

1.  There are already many third-party products that provide these capabilities
by acting as proxy servers, so this may be just another creeping feature for
Mozilla.

2.  Implementing this *badly* for Mozilla would be worse than doing nothing at
all.

Somewhat related is bug #11644 asking for more control over what types of
resources get cached.
Assignee: gagan → nobody
Summary: [RFE] Browser should look ahead and cache [frequently visited] sites automatically → [HELP WANTED] Browser should look ahead and cache [frequently visited] sites automatically
Assigned to nobody@mozilla.org to flag as unclaimed feature request.
Note: WebTV precaches the document pointed to by <link rel="next"> elements, so
if we do anything it should probably be that.
Bulk move of all Cache (to be deleted component) bugs to new Networking: Cache
component.
Assignee: nobody → fur
->fur
Target Milestone: M20
We probably won't get to this for this release, but I'm going to leave it in
the list in case someone wants to volunteer.
Assigning fur's cache bugs to Gordon. He can split them up with davidm.
Keywords: helpwanted
Summary: [HELP WANTED] Browser should look ahead and cache [frequently visited] sites automatically → Browser should look ahead and cache [frequently visited] sites automatically
spam, changing qa contact from paulmac to tever@netscape.com on networking/RDF 
bugs
QA Contact: paulmac → tever
marking rfe.
Summary: Browser should look ahead and cache [frequently visited] sites automatically → RFE: Browser should look ahead and cache [frequently visited] sites automatically
Moving to target milestone FUTURE. We'll take a look at it again after we ship 
N6.
Target Milestone: M20 → Future
Changing component and summary according to hixie's suggestion.
Component: Networking: Cache → Parser
Summary: RFE: Browser should look ahead and cache [frequently visited] sites automatically → RFE: Browser should prefetch LINK tag documents
Reassigning to component & QA owners.
Assignee: gordon → harishd
QA Contact: tever → bsharma
This isn't a parser bug.  Giving to nobody.
Assignee: harishd → nobody
Component: Parser → Networking: Cache
Priority: P3 → P5
Who put in next, index and prev links if they dont want them 
to be used? Don't think this is bad in any way but wery good 
for the user. The browser shuld precash next, index and prev 
in that order. index and prev will be in the cach 90% of the 
time so they are no big deal to the webload. But precashing 
them vill be good 90% of the remaining 10% of the times.

IMHO, this RFE is much more important than the link navigation 
GUI. Any webpage must define navigation in HTML anyway (most 
browsers dont implement link navigation) :-(

Hovewer important, and I would *love* it for 1.0, it *is* a 
RFE and it is *not* critical ;-)
I also want this feature, although it's a minor problem.  For example,
http://www.cessna.com
has this problem.  On other browsers including NS4.78, prefetch works; ie
hovering mouse cursor over the middle menu "Our Aircraft", "Owners.." etc you
will see a different picture on the right.  On mozilla, it's very slow(and
actually it loads the picture each time via Internet), but NS4.78, it's very
quick and it doesn't use the net at all.  Apparently NS4.78 prefetches pictures.
Whiteboard: parity-webtv
I don't think it would make sense on Google: most of the time, I just look at
the first page of results.  (Google doesn't use link rel=next now, but they
might add it if browsers get good keyboard shortcuts to access link rel=next.)

On the other hand, this would make sense on most sites that use link rel=next,
and rel=index certainly makes sense.  Overall I think this rfe would be good to
implement.
Prefetching is generally considered evil. To take an example close to heart, it
woul approximately double the load on Bugzilla.
Whiteboard: parity-webtv → WONTFIX? parity-webtv
I don't think I've ever used Bugzilla's first/last/prev/next links, and I'm not
sure Bugzilla is even using those links correctly.
I have used them, and they are used correctly.
-> darin
Assignee: nobody → darin
we recently discussed prefetching with the caveat that the additional downloads
would be low-priority, limited possibly to a single network connection, and
preempted by new page loads.  with our existing support for partial cache
entries and byte range requests, we should be able to agressively drop partial
prefetches without sacrificing all of the work done to prefetch.

*** This bug has been marked as a duplicate of 159044 ***
Status: NEW → RESOLVED
Closed: 22 years ago
Resolution: --- → DUPLICATE
oops, i closed the wrong bug :(
Status: RESOLVED → REOPENED
Resolution: DUPLICATE → ---
*** Bug 159044 has been marked as a duplicate of this bug. ***
Status: REOPENED → ASSIGNED
Target Milestone: Future → mozilla1.2alpha
I am pretty worried about comment #16... in our recent discussions, we really
haven't addressed server load as a potential concern.  sure, any prefetching
done might improve browser performance, but the overall benefits really depend
on the likelihood that a prefetched document will actually be viewed.  perhaps
this is reason enough to trigger off a different tag?  that way, servers could
opt in.  maybe a HTTP header would even be more appropriate, so as to ensure
that only server admins get to choose the appropriate load level due to
prefetching.  or, maybe that shouldn't be any of mozilla's concern?!  hmm...
IMO, Bugzilla should hide the list links unless you click "Go through this list
one bug at a time".  The current list links are wrong more often than they're
right, take up a lot of space (especially if you haven't disabled the site
navigation bar), and break caching.  Bugzilla's strangeness should not stop us
from making Mozilla faster on sites where Next actually means something.
updated status whiteboard and keywords
Whiteboard: WONTFIX? parity-webtv → [parity-webtv]
Depends on: 142255
Depends on: 163746
Attached file v0 patch (.tar.gz) (obsolete) —
this patch implements a very simple version of prefetching.  during page load
it collects all of the link rel=next hrefs.  when the page finishes loading,
the prefetch code loads the first collected URL in the background.  when that
completes, it loads the next URL.  it does this until all URLs are loaded or
until the user clicks on a link or otherwise starts loading a new page.

i've added some code to kill off prefetched loads of content that either is
already from the cache or could not be taken from the cache without server
revalidation.  this should hopefully avoid taxing servers that misuse link
rel=next (e.g., bugzilla).
for those inside the netscape firewall, there's a link rel=next testcase at
http://unagi.mcom.com/~darinf/test0.html
Attached patch v0.1 patch (obsolete) — Splinter Review
this patch just tidies things up a bit.  i've moved the prefetch code into the
uriloader module (i think that makes sense).  this patch works for the 1.0
branch and trunk (makefile.win changes only apply to the branch, of course).  i
haven't done the mac project changes yet.
Attachment #96109 - Attachment is obsolete: true
Attachment #96364 - Flags: review+
Comment on attachment 96364 [details] [diff] [review]
v0.1 patch

r=dougt
I spoke with darin about this patch and it is a great start.  We/I would like
to add a few things like per window precache queues, an attribute on
nsICacheableChannel to determine if a channel can or should be precached, an d
maybe some kind of UI that shows that there is precaching going on.  Worse is
better.  Lets get this in.
one change i'd like to make is to not prefetch links containing a query string.
 those most likely correspond to dynamic content that will in most cases come
back without cache headers forcing us to hit the server when loading the content
for real.  so, the value of prefetching such content is minimal.  this nicely
addresses the bugzilla problem BTW :-)
Attached patch v0.2 patch (obsolete) — Splinter Review
ok, minor change... just blocked URLs w/ query strings, and touched up some
comments.
Attachment #96364 - Attachment is obsolete: true
Comment on attachment 96936 [details] [diff] [review]
v0.2 patch

carrying forward r=dougt
Attachment #96936 - Flags: review+
Attached patch v1 patch (obsolete) — Splinter Review
ok, this patch changes things a bit.  instead of modifying the HTML content
sink to invoke nsIPrefetchService::PrefetchURI, the prefetch service now hooks
itself up as a tag observer (implementing nsIElementObserver).	this way it'll
be notified whenever the HTML parser encounters a <link> tag.  i've also added
code to make the prefetch service observe HTTP response headers.  this would
allow, for example, a proxy cache to dynamically introduce prefetch requests
for content that is statistically very popular, for example.
Attachment #96936 - Attachment is obsolete: true
Comment on attachment 97232 [details] [diff] [review]
v1 patch

r=dougt
Attachment #97232 - Flags: review+
replaced the switch statement with an if-else to simplify code.  thx dougt!
Attachment #97232 - Attachment is obsolete: true
Attachment #97242 - Flags: review+
hey darin,

this looks really good.  

the ownership model of the nsIURI within the nsPrefetchNode is kinda scary :-) 
You might want to add a comment or two explaining it ;-)

also, could you leverage the loadgroup (associated with each document) to hold
the prefetch requets... that way, you won't need to worry about cancelling them
when a new document load is initiated...

i don't think this is a big deal... just a thought.

-- rick
Comment on attachment 97242 [details] [diff] [review]
v1.1 patch - revised per comments from dougt over AIM

sr=rpotts@netscape.com
Attachment #97242 - Flags: superreview+
thx for the comments rick...

1- yeah, i'll add some comments on the URI ownership

2- i don't think using the loadgroup of the document would work.  consider the
case of two documents.  one does prefetching, and in the other a user clicks on
a link.  now, loading the new page must contend with the prefetch traffic. 
whereas what i'd really like is to kill off all prefetch traffic when any other
part of mozilla requests a page load.
patch landed on trunk... minus mac project changes.  working on that now.
For the sake of people that have download limits on there accounts, I hope that
a option has been put into place to turn this on and off as per a users
requirements!!
I sure hope so too! I don't like the idea of URL prefetching. I have scarce bandwidth. Let me turn this off and I'll shut up. :-)
Chris, aaronl: Only links in the form of <link rel="next" href="..."> are
prefetched. not <a href="...">.

and there is a preference, disable this with:
user_pref("network.prefetch-next", false);
the preference is only configurable from all.js for now.  i probably should have
made it dynamic (i.e., settable via prefs.js).  the other thing to note is that
prefetching will only occur when the browser has nothing else to do w/ the
network connection, and furthermore any other browser requests to load anything
will kill off any and all prefetch requests.  we are also very selective in what
we'll prefetch.  that is, we only prefetch documents that can be reused.
what is now prefetched? Static files linked with <link rel="next"/> only, right?
any http:// URL that does not contain a ?query string will be prefetched.  if
the http headers indicate that the document would have to be fetched fresh
each-and-everytime then we'll cancel the prefetch.  in other words, yes, only
static content will be prefetched.
Assuming it works as described above, I am impressed by the thought that has
gone into this. Let's hope it makes pages appear to render nice and fast!

Do we also support HTTP "Link" headers? They are supported for linking to CSS
style sheets, so presumably it should work for this too, but they are not
supported for site icons, so presumably we still don't have a single Link
service yet, and this means it probably won't work for this either...
the prefetch service is a HTTP header observer, which means that it will pick up
HTTP Link headers, but one downside is that HTTP only reports headers that are
new.  IOW, loading a page from the cache will not trigger HTTP header observers.
 that's one way in which Link: header differs from LINK tag.  perhaps a unified
Link service would be the best way to resolve these differences.
We have several consumers of link elements and headers, each done slightly
differently (e.g. the stylesheet thing probes, rather than listening, and the
site icon code only does <link> elements). One day someone will snap and unify
all this, hopefully. :-)

Great to hear than Link: headers were taken into account though! How about <meta
http-equiv="link"> ?
nope... looks like the <meta HTTP-EQUIV="link" ...> would be missed :-(
oh well... will work on a follow up patch!

1) make prefetching a user preference
2) add support for <meta HTTP-EQUIV="link" CONTENT="...">
checked in mac project changes.. so this should be in all builds of mozilla 1.2
alpha :-)
fixed-on-trunk

see these bugs for the remaining issues:
http://bugzilla.mozilla.org/show_bug.cgi?id=166647
http://bugzilla.mozilla.org/show_bug.cgi?id=166648
Status: ASSIGNED → RESOLVED
Closed: 22 years ago22 years ago
Resolution: --- → FIXED
Would it be worth a special case to not stop the prefetching when the user
clicks on a link to the page that is currently being prefetched? It seems that
the probability of this happening is quite high on the rel=next case.

I understand that the partial cache entries and byte range requests help with
this somewhat, but the worst case is still tearing a tcp connection down,
re-establishing it and sending a new request, right?
marko, if the next page references a large document that is being prefetched,
and the user clicks a link to advance to the next page, then if we don't cancel
the prefetched load, loading of the next page will appear to stop on the
prefetched document.  only when the prefetched document is entirely downloaded
will the document appear to snap into place/view.  this happens because the
prefetch load and the new load are not at all tied together.  instead, the
second one gets blocked waiting for access to the cache.  so, while it might be
true that not canceling would give better page load times, the result is
something that most likely won't appeal to many users.  i think it's better to
cancel the prefetch partway thru, so we can go ahead and display what we've
already got ASAP.
Wouldn't it be cool if instead of blocking, it noticed that a fetch was already
in progress for that resource, and simply hooked into it?
it would certainly be cool, but not at all easy to implement.
Concerning comment #38 (2):
Why killing this, and not just suspending until any other activity ends?
(example situation: webpage with search results. I open some of them in new
windows/tabs to see if they contain what I want, but if not - I want to check
the next page of results. Would be nice if it was there already.)

Concerning comment #47:
I wouldn't worry. If the document is already in cache, then there's a good
chance that either the "next" document will be there as well, or it will not be
needed, since if the user didn't follow it first time, they won't follow it now too.

Worth consideration: Mozilla vs Law. "By clicking on this link you agree to
terms of conidtions...". The server logs will say you followed this link even
though you didn't. Luckily webmasters rather don't implement such points as <link...

Also consider possiblity of maliciously formed pages to use this feature to
exploit remote server vulnerablities. More on topic:
http://www.phrack.org/show.php?p=57&a=10
Limiting number of links to follow to a fairly low number (8?) would prevent
abusing this.
Bartosz wrote:
>Worth consideration: Mozilla vs Law. "By clicking on this link you agree to
>terms of conidtions...". The server logs will say you followed this link even
>though you didn't. Luckily webmasters rather don't implement such points as
><link...

As I was reading through this bug I was thinking exactly the same thing. TBH I
was considering whether 'next' might be considered a 'submit' link (eg. in a
wizard/druid of some kind), in which case preloading it would submit without the
user's express permission - as Bartosz said. But I'm sure there are worse ways
this could be exploited, much as I think its a good idea ;|

At some level this might be considered a bug in the web-app, but if a user
switches to a gecko browser and suddenly finds that their web-email inbox is
empty due to mozilla precaching the 'remove this email' link, there could be
problems...
Why would anyone ever say

   <link rel="next" href="./delete-mail">

...and would they ever do it without a query string? I doubt it.
here's a site that uses <link rel=next> without any cache control headers.  the
pages appear to be served up using PHP without the use of query strings.  as a
result we prefetch each next page, but then kill off the load once we see that
it doesn't have any cache headers.  needless to say this is bad for a number of
reasons.  perhaps the best recourse is to evangelize such sites.  hmm...

http://www.gnome.org/start/2.0/releasenotes.html
I take it "Last-Modified" is one of those cache headers?


VERIFIED FIXED. This bug makes reading the HTML4 spec so much nicer. Thanks
Darin. You kick ass.
Status: RESOLVED → VERIFIED
yes, Last-Modified is good enough, because it let's us take a guess at how long
the document can remain stale ((date - lastmodified)/10), and when the document
does expire, all we have to do is send a conditional If-Modified-Since request
to the server (allowing the server to say 304 not modified).
Uhm... why do I see two downloads of the same document with ethereal? I'm
browsing the HTML specs and I see two full GET's for each page...! I'm using
build 2002091505 .
With a current trunk build on win2k, I can't reproduce this (although instead of
using ethereal I was just checking breakpoints in the code and also checking the 
server logs for a prefetch testcase modelled on the w3 TR documents). I see a GET
for the first document, then a prefetch of the second. On moving to the second 
document, I see a prefetch of the third document, etc.

Note: if you click on the 'Next' link in those pages before the prefetch of that
next document was complete then this would cancel that pending prefetch and a
new (partial) GET request would be issued for the next document. But this is by
design.
nick: you should also verify that ethereal isn't "lying to you" ... sometimes
(especially under windows) it'll report a packet twice.  check the sequence
numbers to be certain you aren't seeing an ethereal bug.  barring that there is
always the possibility that you are loading pages that do not allow caching. 
after loading one of the pages, you could look at the page info for the page to
see if the server specified an expiration time.  once that expiration time is
past, the prefetched document will have to be validated the next time it is loaded.
Attached patch 1.0 branch patchSplinter Review
combines original patch + all subsequent fixes (bug 166647, bug 171102, and bug
173278).
there's quite a bit of whitespace-noise in the real branch patch.  review this
one instead.
Alias: prefetch
While I agree that PRTimeToSeconds should really live in NSPR, we should
minimize having multiple definitions of the same function (I count 3 more within
Necko alone-- FTP has 2 and HTTP has another one) Can we at least consolidate
that into Necko's common util.h file (I forget the exact name-- it's been that
long!)

Other than this nitpick it looks great! r=gagan
Comment on attachment 102309 [details] [diff] [review]
1.0 branch patch w/ no whitespace changes

r=gagan
Attachment #102309 - Flags: review+
gagan: thanks for the review.  i talked to wtc about PRTimeToSeconds (can't
remember the bug no), and he decided that he didn't want to put that function in
NSPR because (1) no way to represent dates before 1970 and (2) no way to
represent dates after ~2130.  i see his point, and i'm hoping to eventually
clean things up so that we can have one instance of this function. 
Comment on attachment 102309 [details] [diff] [review]
1.0 branch patch w/ no whitespace changes

sr=rpotts@netscape.com
Attachment #102309 - Flags: superreview+
Comment on attachment 102309 [details] [diff] [review]
1.0 branch patch w/ no whitespace changes

a=rjesup@wgate.com for 1.0 branch with the proviso that the default be that the
feature is disabled unless the pref is used to enable it, as per driver
discussions and Darin's agreement.
Attachment #102309 - Flags: approval+
default disabled sounds perfectly reasonable to me.
Discussed in bBird team meeting.  We will give adt approval for this as soon as
test cases are attached to this bug.  Please make sure Bindu is cc'ed on bug. 
We need the checkin to happen by 10/16 COB.  I will watch this bug and
immediately plus it when the test cases are attached.

We are going to reserve the right to ship with this turned off if problems are
discovered.
this URL contains links to some examples (sorry for the internal site):

http://unagi/~darinf/prefetch/testcases.html
Plussing per email from darin indicated lyecies approval to ship pref'ed on, and
attached test cases.
Keywords: adt1.0.2adt1.0.2+
Plussing per email from darin indicated lyecies approval to ship pref'ed on, and
attached test cases.
Blocks: grouper
marking fixed1.0.2 belatedly.  patch landed 10/16.
Bulk adding topembed keyword.  Gecko/embedding needed.
Keywords: topembed
Verified on 2002-10-25-branch build on Win 2K.

The test case works as expected.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: