Last Comment Bug 453403 - add DNS pre-fetching to Necko and Firefox
: add DNS pre-fetching to Necko and Firefox
Status: RESOLVED FIXED
: fixed1.9.1, perf
Product: Core
Classification: Components
Component: Networking (show other bugs)
: Trunk
: All All
: -- enhancement with 6 votes (vote)
: mozilla1.9.1b2
Assigned To: Patrick McManus [:mcmanus]
:
Mentors:
: 220942 (view as bug list)
Depends on: 459724 463215 463263 464838 467648 475603 486127 488162 492196
Blocks: 437953
  Show dependency treegraph
 
Reported: 2008-09-02 21:41 PDT by Asa Dotzler [:asa]
Modified: 2012-03-29 02:11 PDT (History)
54 users (show)
pavlov: wanted1.9.1+
bzbarsky: in‑testsuite?
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
dns prefetch patch mentioned in comment 8 (27.98 KB, patch)
2008-09-12 19:34 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Review
version 3 of patch (38.84 KB, patch)
2008-10-09 13:01 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Review
version 4 of patch (50.13 KB, patch)
2008-10-14 19:13 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Review
version 5 of patch (49.44 KB, patch)
2008-10-23 08:16 PDT, Patrick McManus [:mcmanus]
bzbarsky: review-
Details | Diff | Review
version 6 of patch (53.73 KB, patch)
2008-10-30 11:21 PDT, Patrick McManus [:mcmanus]
bzbarsky: review+
bzbarsky: superreview+
Details | Diff | Review
version 7 of patch (56.89 KB, patch)
2008-10-31 09:36 PDT, Patrick McManus [:mcmanus]
no flags Details | Diff | Review
version 8 of patch (56.84 KB, patch)
2008-11-03 14:45 PST, Patrick McManus [:mcmanus]
bzbarsky: review+
bzbarsky: superreview+
Details | Diff | Review
Additional patch to fix shutdown nsHostResolver leak (1.08 KB, patch)
2008-11-04 09:20 PST, Boris Zbarsky [:bz]
no flags Details | Diff | Review
version 9 of patch (43.32 KB, patch)
2008-11-07 14:02 PST, Patrick McManus [:mcmanus]
bzbarsky: review+
Details | Diff | Review
version 10 of patch (43.44 KB, patch)
2008-11-07 15:09 PST, Patrick McManus [:mcmanus]
mbeltzner: approval1.9.1b2+
Details | Diff | Review

Description Asa Dotzler [:asa] 2008-09-02 21:41:12 PDT
Not sure what it actually does but Google's Chrome beta has a "enable DNS pre-fetching" preferences and if it's what I think it is, we should do something like that.

I figure we could either pre-fetch DNS for all links, visible links (in the viewport,) or links that get a mouse hover for some millisecond count.
Comment 1 Stuart Parmenter 2008-09-02 21:47:29 PDT
blake/pat: is this something we could do as part of the speculative loading and any idea what the effect here would be if we were to lazily lookup dns for all the links on a page?
Comment 2 Mike Shaver (:shaver -- probably not reading bugmail closely) 2008-09-02 22:01:55 PDT
I don't think you want to do it lazily; we already do DNS lookups as lazily as we can, and you want to do them more eagerly here.

Might be some privacy concerns, so we should have an about:config to turn it off, but just kicking off a DNS poke for every link as we construct the page might be sufficient.
Comment 3 Patrick McManus [:mcmanus] 2008-09-03 07:31:50 PDT
I think this is probably a fabulous idea, I wish I had suggested it - especially for mobile where the RTT of the lookup is so high.

prefetching things has a mixed track record - but this is a perfect candidate: low bandwidth, low cpu, low state.. and in high latency envs a significant payback.

personally, I wouldn't overlap this with the speculative loader bug. If it is overlapped it would be a significant expansion of that scope which right now only deals with resources we are pretty sure we are loading (they are speculative in the sense that js or something similar may undo that decision), whereas I would think in this case we would want to do the async DNS resolution of every link we see.

definitely worth prototyping and measuring the results.
Comment 4 benc 2008-09-03 17:50:44 PDT
This needs to be reviewed from a network security standpoint before it is implemented. There are a variety of concerns when you load and keep DNS data. In the past, the cache implementation was very short, the goal was to prevent the page loads from being excessively "bursty". We also went through misguided phases where people loaded DNS info forever. There are trade offs both ways.

The underlying problem is we don't have a lot of ways of supporting DNS TTL when we cache. This is due to the fact we don't really USE dns, we typically call libraries that implement resolvers, and for reasons I don't understand, those libraries don't support exposing TTLs to applications, even in this modern day and age. 

The other factor to keep in mind is that many modern OS's actually implement DNS caching as well, (which might be part of the argument for not augmenting the libraries w/ TTL support). In fact, I don't know if anyone has really done a sit down w/ a logging DNS server and packet tracer to see if the existing DNS caching scheme helps much on the current tier 1 platforms.

If you are talking about a "necko should work great everywhere" approach, then you really gotta start looking at the event model and counting packets on real systems.
Comment 5 Vladimir Vukicevic [:vlad] [:vladv] 2008-09-03 21:14:13 PDT
I would guess the plan for this was to just use the normal OS DNS cache -- that is, just make a normal DNS lookup request, not implement everything ourselves.  Given that, there should be no security concerns over any that might be present due to whatever way the OS does dns lookups and caching anyway.
Comment 6 Mark Finkle (:mfinkle) (use needinfo?) 2008-09-07 20:21:22 PDT
related to ideas in bug 220942?
Comment 7 Patrick McManus [:mcmanus] 2008-09-12 18:49:31 PDT
re: the OS cache comments in comment 5 and 6.  the gnu libc resolver code does not contain a cache. Each call to getaddrinfo() results in a network round trip. Some folks run "ncsd" locally to accomplish this, but it does not seem standard.

from experiments - OS X does seem to have a operating system cache. I cannot say regarding windows.

For mobile, where glibc is a common resolver and the network round trip time is exorbitant (think 750 - 1000ms) some kind of cache to resolve the "bursts of duplicates", like we currently have, is critical. 

I did a spot check using a large page: http://planet.mozilla.org/ - when I did the test that page needed to resolve 134 hostnames in order to render. Only 23 of them were unique, so the cache saved 111 network lookups.

The current default timeout is 1 minute. In order to gain the most benefit of this (migrating between adjacent pages) I might argue for upping it to 3 or so, but nothing bigger than that in order to not violate any high-availability expectations service providers might have. Even within that constraint I expect this strategy can be quite effective.

We might conditionally compile the cache off on OS's known to have their own caches, but the heart of the pre-fetching would be unchanged - it is just a matter of who is managing the cache that is being speculatively loaded.
Comment 8 Patrick McManus [:mcmanus] 2008-09-12 19:32:56 PDT
I will attach, shortly, a proof of concept patch for DNS prefetching. Read on to the end of this comment and you will see it demonstrates not only the expected benefit, but an unexpected one too.

The patch implements speculative resolution of anchor href's and img src's at the time they are bound to the DOM.

The DNS service is expanded from one class of service into three. The three are high, medium, and low. medium and low are used for pre-fetching and can be disabled via a preference ("network.dns.disablePrefetch"). All previous DNS uses are mapped to the high class of service, images are looked up with medium (see below) and hrefs are looked up with low. Servicing is done with strict priortiy. A request which is originally queued in a low class of service may be upgraded if a new query comes along for it at a higher class before it has actually been de-queued.

The DNS service has always allowed up to 8 threads to be used. That is unchanged. If the pool grows up to 6 or more threads, those threads are used exclusively for high priority items (i.e. non prefetches) - ensuring that high priority requests will never be blocked behind just speculative lookups.

Previously, if the DNS service expanded beyond one thread it very aggressively pruned itself back down at the first opportunity. With the increased load of prefetching that pattern results in excessive creation and destruction of threads. Not only is there OS overhead in doing so, but when using the thread for DNS resolution for the first time libc must initialize the resolver (res_init), which involves opening and parsing files such as /etc/resolv.conf and /etc/hosts which is certainly something that we don't want to do over and over. The patch softens this behavior some by applying the same soft-timeout logic to all of the initialized threads (which are still created in a dynamically growing pool) instead of just the first one.

Resolving hostnames in img urls just before loading them (literally the line before LoadImage()) sounds like a waste. However it turns out to have a significant benefit as the actual load may take a while to be executed due to limits on the number of parallel HTTP connections. 

Running the DNS lookup in parallel with the other loads blocking the image load makes a lot of sense - it is a low bandwidth but potentially high latency operation that nicely overlaps. When the image's turn does come up in the HTTP queue it benefits from a warm DNS cache when starting the transfer.

This has the neat effect of using DNS prefetching to improve page load times on the first page view, not just the subsequent pages in a link-chain.

As I mentioned in comment 7, I ran a set of tests using planet.mozilla.org. (I also did these tests with cnn.com - the results where similar but less dramatic as cnn is a smaller page). I did the test both in a broadband environment, and over a simulated wireless edge network. (54/160 kbit, 770ms latency)

23 unique DNS lookups are needed to render the planet page (as of the time I did the tests anyhow). With the prefetch patches applied 16 of the 23 lookups were improved on the edge tests and 22 of the 23 on the broadband tests. By improved I mean when the HTTP transaction requested the lookup the name was either already in the cache or the request for it had already gone out on the network. Requests that were just sitting in the queue don't count as improved.

This does translate into improved page load time for a page with lots of images in a high latency environment. Loading the initial planet page on the mobile network improved over 4% with just this change, even while absorbing the overhead of looking up dozens of extra other hostnames in order to warm the cache for the next click! This is a nice unexpected bonus and the overhead is really not that onerous.

On broadband the savings are less impressive because the lookups cost so much less - I saved about 15ms on the page load (.1%). The savings should actually be a bit better than that on broadband because the cache of my recursive resolver was full for this test and I did not emulate a lag to reach it on the broadband test as I did for the edge test. Treating it as a LAN (<1ms of latency) instead of as broadband (maybe 70ms of latency) under estimates the benefit of the patch in that scenario. Whether or not the recursive resolver is warm in reality depends on whom it is being shared with and how popular the site is - it will vary widely.

The CNN tests showed improvement too - but less so. It had 35 names, 16 of them unique. Prefetching improved 7 of the 16 on broadband and 5 of the 16 on edge. The CNN tests don't muck up the HTTP queues quite as much, so the opportunity for getting ahead of the image load isn't as strong. The page load times improved marginally as well.

Of course the main goal was in eliminating some latency when going from page A to page B, where B is in a different hostname than A. My edge tests showed the page load time of page B improving by 800ms, which is what we expect.
Comment 9 Patrick McManus [:mcmanus] 2008-09-12 19:34:02 PDT
Created attachment 338419 [details] [diff] [review]
dns prefetch patch mentioned in comment 8
Comment 10 Jo Hermans 2008-09-13 02:18:41 PDT
You can see the windows cache with this command :
  ipconfig /displaydns
and flush it with
  ipconfig /flushdns

<http://planet.mozilla.org/> uses 18 elements in a normal Firefox 3 without the patch

See also bug 208312 : DNS: negative (NXDOMAIN) cache
Comment 11 John Mellor (Jomel) 2008-09-27 09:15:32 PDT
Google blogged about this FYI: http://blog.chromium.org/2008/09/dns-prefetching-or-pre-resolving.html
Comment 12 Jo Hermans 2008-09-27 10:17:35 PDT
more documentation : http://dev.chromium.org/developers/design-documents/dns-prefetching
Comment 13 Ryan VanderMeulen [:RyanVM] 2008-09-27 15:04:07 PDT
Out of curiosity, what happens to those affected by bug 306922 when this lands? Are we going to see random UI lockups while Fx is doing background DNS lookups?
Comment 14 Ted Mielczarek [:ted.mielczarek] 2008-10-01 11:25:25 PDT
Patrick: is this patch ready for review, or are you waiting on something else?
Comment 15 Patrick McManus [:mcmanus] 2008-10-02 11:04:08 PDT
I guess it has received all the feedback it is going to get without a review flag, so I have added that now.
Comment 16 Stuart Parmenter 2008-10-02 15:52:33 PDT
Comment on attachment 338419 [details] [diff] [review]
dns prefetch patch mentioned in comment 8

bz, can you review this?
Comment 17 Boris Zbarsky [:bz] 2008-10-02 19:16:44 PDT
I can try, though I'm not very familiar with this code.  Then again, probably no one is, other than Darin.  ccing biesi just in case, though.

Some questions off the top of my head just skimming this:

1) Why are we manually doing DNS cache stuff from image code instead of just
   building that into the HTTP code and making use of existing request
   priorities?  I can see how we might need this for anchors, since we're not
   actually starting a load there, but the image thing doesn't seem like it
   would work as well as something that also covers scripts, stylesheets, etc.
   To be honest, I'm not really even sure why this helps; don't we just go and
   start resolving from under here anyway once we open the HTTP channel?

2) What's the performance impact from the nsHTMLAnchorElement change?  That's
   typically very perf-sensitive code, and pages often have a _lot_ of links on
   them...  I haven't read the guts of the prefetch code yet, so maybe this is
   not a problem because we coalesce well?  If there is any impact at all, it
   would make more sense to do the prefetches more lazily, though I guess they
   only help if you click a link within the dns cache timeout span?

3) As written, this is going to regress bug 223861 (just from a quick look at
   the CVS blame here).  I'm not happy doing that without at least better data
   akin to that in bug 162871 comment 84.

4) Do we want something akin to chromium's https thing here?

5) Do we want some way for sites to opt out like chromium does?
Comment 18 Patrick McManus [:mcmanus] 2008-10-03 08:10:39 PDT
(In reply to comment #17)
> I can try, though I'm not very familiar with this code.  Then again, probably
> no one is, other than Darin.  ccing biesi just in case, though.
> 
> Some questions off the top of my head just skimming this:
> 
> 1) Why are we manually doing DNS cache stuff from image code instead of just
>    building that into the HTTP code and making use of existing request
>    priorities? 

the HTTP request priorities are not really relevant as far as I can tell - the image dns prefetch is of lower priority than the image dns fetch associated with the http get regardless of the http priorities. I never want the speculative cache thing to tie up the resolver system and create a wait for "I just clicked on it" events.

While it would be nice to put this in the http layer, it needs semantic understanding that this is something that should have its dns prefetched - which is why it is explicitly hung off anchor href's and imgs now. We could add a generic interface for http for things that are being fetched "very soon" like images, but you'd still need a different path for the anchors which we aren't as sure are going to be retrieved.

> I can see how we might need this for anchors, since we're not
>    actually starting a load there, but the image thing doesn't seem like it
>    would work as well as something that also covers scripts, stylesheets, etc.
>    To be honest, I'm not really even sure why this helps; don't we just go and
>    start resolving from under here anyway once we open the HTTP channel?

This is one of the cool and counter intuitive things about this patch. It works because opening the HTTP channel might be a significant amount of time away from identifying the channel hostname. This is because there are limits on the number of channels which are easily bumped into and this code essentially allows the dns resolution of images to "queue jump" for only the DNS portion of the transaction. 

Normally queue jumping is evil behavior, but in this case it really works out well. The DNS transaction is, compared to the HTTP transactions, extremely low bandwidth and dominated by latency. Because it is low bandwdith (1 packet each way) it really does not meaningfully interfere with existing transfers, and its high latency property allows it to essentially overlap for free with the HTTP transfer.. when the image HTTP request does go to open its channel it sees a cache hit.

This manifests itself most strongly in mobile where latencies are very high and the http channel pool is more likely to backup for a long time due to bandwidth constraints.

I did make some measurement of this phenomenon that I mention in comment 8. One of the tests needed to do 23 unique DNS lookups for images and with this change 16 of them were either already completed or already under way when the HTTP channel code asked for the name leading to a significant reduction in wait time.

> 
> 2) What's the performance impact from the nsHTMLAnchorElement change?  That's
>    typically very perf-sensitive code, and pages often have a _lot_ of links on
>    them...  I haven't read the guts of the prefetch code yet, so maybe this is
>    not a problem because we coalesce well?  If there is any impact at all, it
>    would make more sense to do the prefetches more lazily, though I guess they
>    only help if you click a link within the dns cache timeout span?
> 

As I noted, I did a bunch of measurements - and the faster the network the less of a beneift (to the point of their being no perceived benefit) - but I didn't note anything getting slower. 

But there are a million points of view on what to measure. I'm happy to look into any particular scenario you are suspicious about. 

The code is completely fire and forget wrt nsHTMLAnchorElement - it just puts an entry in the queue and moves on. It never even gets called back - the payoff is in the form of a cache hit later. So I do not suspect this will be a big deal for it.

> 3) As written, this is going to regress bug 223861 (just from a quick look at
>    the CVS blame here).  I'm not happy doing that without at least better data
>    akin to that in bug 162871 comment 84.
> 

223861 lowers the default cache time from 300 to 60. This patch raises it to 180.

The spectrum is a tradeoff of course. I inched the limit up because the likelyhood of a payoff also increased - the cache is now a richer object and to make use of it on inter-page clicks some "think time" needs to be accounted for.

It can return to 60 at the unknown cost of lower cache hit rates. I first thought a median page-time-spent metric would be interesting here but I think any median or average would really obscure the large number "click.. click.. click" sequences that typically lead up to a "lets read this in full" kind of page.

As for site ttls, my experience is that the Akamai default of 5 minutes is becoming pretty common for folks doing dynamic dns. Content providers have come to understand that DNS latency is part of their performance and some minimal caching can be their friend.. Services like keynote have helped drive this home.. this is borne out some by rechecking the data listed in bug 162871 and comparing it to today - everything has stayed the same or gotten more generous.

       old  today
amazon 60   60
cnn    215  3600
imbd   60   60
google 118  300

Roll into this that the firefox cache has always had the potential to be over the ttl even with ttls of at least 60. At 60 (the minimum in the data set), it is quite possible for firefox to pick up a 59 second old entry from its local caching resolver and then hold onto it for another 60 seconds - doubling the nameserver allowed validity period. That's life, and life would be a heck of a lot better if getaddrinfo() gave some TTL and age information.

> 4) Do we want something akin to chromium's https thing here?
> 5) Do we want some way for sites to opt out like chromium does?

Yes, I have come to believe it would be good to be pretty compatible with chromium where those ideas make sense.

So I'll make the following changes:
  * lookups derived from https fetches will be tied to a second configuration property (anded with the existing one) that will default to false
  * honor x-dns-prefetch-control in both the response headers and meta to opt out of a particular page
  *  implement link rel="dns-prefetch" though I am dubious of its value
Comment 19 Boris Zbarsky [:bz] 2008-10-03 08:24:22 PDT
OK, so would it make sense to start prefetch at AsyncOpen() time (from inside necko) as opposed to from the image loader?  I'm still not sure why it's better for images to jump the queue but for scripts not to do so (esp. given that the latter have a lot more impact on pageload).  And it's pretty common to have ad scripts on web pages that come from all sorts of servers, as I recall.

> I'm happy to look into any particular scenario you are suspicious about. 

Typical broadband (cable/dsl) network, page that's just a list of 3-4 hundred links, measuring total load time.  This is actually a good approximation of some pages out there.

I wouldn't be too surprised if there is no perf impact, given your description.

> this is borne out some by rechecking the data listed in bug 162871

That's exactly the data I was looking for.  I'm probably ok with the 180 thing given that, though it would be good to get darin to comment.  Want to just bounce him an e-mail?

I don't think we should worry about rel="dns-prefetch" for now.  Unlike the opt-out changes, this one wouldn't break sites if not done, just enhance the ones that make special changes even more if it is done.  We can land without it and do it as a followup if we decide we want to (and I'm not entirely convinced we do, just as you are not).
Comment 20 Boris Zbarsky [:bz] 2008-10-08 09:36:18 PDT
For what it's worth we had an existing bug on this: bug 220942.  Might be worth reading through the comments there too.
Comment 21 Boris Zbarsky [:bz] 2008-10-08 09:36:27 PDT
*** Bug 220942 has been marked as a duplicate of this bug. ***
Comment 22 Patrick McManus [:mcmanus] 2008-10-09 08:24:51 PDT
An interesting bit in the chromium dns prefetch design document, they give 

 meta http-equiv="x-dns-prefetch-control" value="on"

as an example of a meta element.

I presume that the value attribute should really be a content attribute - that is how it is normally done with http-equiv, right? So I am presuming its just a thinko in the chromium doc..
Comment 23 Patrick McManus [:mcmanus] 2008-10-09 13:01:28 PDT
Created attachment 342476 [details] [diff] [review]
version 3 of patch

A new revision of the patch, which incorporates the above feedback.

The changes:
 * lookups are no longer triggered from imglib for img elements. Instead they are triggered in http::AsyncOpen() which should bring the "early lookup" benefits to js, css, etc as well
 * prefetching is not done by default from an https based document, as per chromium. can be overriden with new pref 
 * implement x-dns-prefetch-control, both "on" and "off" in both response header and meta contexts.. 
 * implement link rel="dns-prefetch" for explicit lookups
Comment 24 Patrick McManus [:mcmanus] 2008-10-14 19:13:26 PDT
Created attachment 343153 [details] [diff] [review]
version 4 of patch

Two changes from rev 3
 * a bug with the idl part of the patch
 * v3 contained code for a problem I found with normalizing the flags used in the DNS cache.. I broke out this code and attached that patch to bug 459724 (which also uses DNS flags and needs this code)

This patch now depends on the  "Fix normalization of flags in DNS cache" attachment in bug 459724 in order to work.
Comment 25 Patrick McManus [:mcmanus] 2008-10-23 08:16:19 PDT
Created attachment 344475 [details] [diff] [review]
version 5 of patch

simply removes an entanglement with 459724 that doesnt belong in this patch
Comment 26 Boris Zbarsky [:bz] 2008-10-23 19:18:25 PDT
Comment on attachment 344475 [details] [diff] [review]
version 5 of patch

>+++ b/content/base/src/nsContentSink.cpp
>+nsContentSink::PrefetchDNS(const nsAString &aHref)
>+  if (StringBeginsWith(aHref, NS_LITERAL_STRING("//")))  {
>+    hostname = aHref;
>+    hostname.Cut (0,2);

hostname = Substring(aHref, 2)

>+  nsHTMLDNSPrefetch *prefetch = new nsHTMLDNSPrefetch (hostname, mDocument);

This part should imo look more like this:

  nsRefPtr<nsHTMLDNSPrefetch> prefetch = new nsHTMLDNSPrefetch(...);
  if (prefetch) {
    prefetch->PrefetchLow();
  }

(no manual delete on reference counted object).  Also, no space before the '(' in the constructor call.

>+++ b/content/base/src/nsContentSink.h
>+  void PrefetchDNS(const nsAString &aHref);

This should probably document that aHref might not actually be a URI but might rather be a "//hostname" kind of thing.

>+++ b/content/html/content/src/nsHTMLAnchorElement.cpp
>+nsHTMLAnchorElement::PrefetchDNS()
>+{
>+  nsAutoString hostname;
>+  nsresult rv;
>+  
>+  rv = GetHostname (hostname);
>+  if (NS_FAILED(rv) || hostname.IsEmpty())
>+    return;
>+
>+  nsHTMLDNSPrefetch *prefetch = new nsHTMLDNSPrefetch (hostname, GetOwnerDoc());

That's a pretty suboptimal way of doing things.  If you trace through this code, it creates an nsIURI for the href, gets its spec, creates another nsIURI from the spec, and then gets its hostname.

It's probably better to call GetHrefURI() here and then use the nsIURI version of prefetch.

And the same comments as in the sink in terms of the actual construction/destruction pattern.  Let refcounting actually work for you.  ;)

>@@ -191,30 +212,32 @@ nsHTMLAnchorElement::GetDraggable(PRBool
>+
> nsresult
> nsHTMLAnchorElement::BindToTree(nsIDocument* aDocument, nsIContent* aParent,

Nix that whitespace change?

>+++ b/content/html/content/src/nsHTMLDNSPrefetch.cpp

The licence block seems to be missing the "The Original Code is" paragraph and more importantly the paragraph that lists initial developer (Mozilla Corporation in this case) and copyright date.

>+/* -*- Mode: C++; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4 -*- */
>+/* vim:set tw=80 expandtab softtabstop=2 ts=2 sw=2: */

Those modelines are inconsitent.  You probably want s/4/2/ in the emacs modeline (and reindent accordingly), since content code typically does 2-space indent.

>+nsHTMLDNSPrefetch::nsHTMLDNSPrefetch(nsAString &hostname, nsIDocument *aDocument)
>+    mAllowed = IsAllowed(aDocument);
>+    mHostnamePtr = new NS_ConvertUTF16toUTF8 (hostname);

Why not just have mHostname be an nsCString (not a pointer) and use CopyUTF16toUTF8 here?

>+nsHTMLDNSPrefetch::nsHTMLDNSPrefetch(nsIURI *aURI, nsIDocument *aDocument)
>+    mHostnamePtr = new nsCString();
>+    aURI->GetHost(*mHostnamePtr);

And then here you could just GetAsciiHost(mHostname) (or GetHost as you do, but DNS will end up having to do the converstion to ACE for IDN hostnames in that case; if we're going to use this nsIURI for anything else at least it will already have the ASCII-ified version cached)..

>+nsHTMLDNSPrefetch::IsSecureBaseContext (nsIDocument *aDocument)
>+    if (aDocument != nsnull)

Can we just require the callers to alwys pass in a document (and assert that in the constructor)?

Also (and this applies to all the methods), nix the space before '('.

>+nsHTMLDNSPrefetch::IsAllowed (nsIDocument *aDocument)
>+    if (IsSecureBaseContext(aDocument))
>+        {

Module style is usually to put the curly for an 'if' at the end of the if line.

>+            if (nsContentUtils::GetBoolPref("network.dns.disablePrefetchFromHTTPS", PR_TRUE))

Do we really want to hit the pref service on every single link in an SSL page?  This seems like a good use case for nsContentUtils::AddBoolPrefVarCache (which is underdocumented, I know) called from a method run at module startup with a static PRBool to store the current state.  The only word of warning there is that AddBoolPrefVarCache will default to false on first call if not set, and since you want a default of true you'll need a manual GetBoolPref call there too (at module startup).

>+                rv = PR_FALSE;

Why even keep track of rv?  If it's false here, we'll return false, so just do an early return.

>+    if (rv && (aDocument != nsnull))

Again, I think we should just require a non-null document.

>+            // checks if x-dns-prefetch-control HTTP response header is set to override default
>+            // may also be set by meta tag

s/if/whether the/.  And please capitalize sentence beginning and punctuate sentence ends?

>+            nsAutoString control;
>+            aDocument->GetHeaderData(nsGkAtoms::headerDNSPrefetchControl, control);

So for what it's worth, this might be a little slow... we can see how it goes.  In general, the link stuff could be a Tp hazard; we should watch out for that when we land.  It might be better to just add an API on nsIDocument exposing this boolean value (updated by SetHeaderData) instead of recomputing it on every single link.

>+            if (rv && control.LowerCaseEqualsLiteral ("off"))

We know rv is true here.

I hate to ask this... but what does chromium do if the value is neither "on" or "off"?  Does it treat that as "on" or "off"?  We should do the same (and ask them to update their docs accordingly).

One thing probably worth checking in this code is that the document has a window.  I see no reason to do DNS prefetch in XMLHttpRequest result documents, for example.

>+nsHTMLDNSPrefetch::Prefetch(PRUint16 flags)
>+    nsCOMPtr<nsIDNSService> dns = do_GetService(kDNSServiceCID, &rv);

This is another possible source of performance impact to watch out for.  We might want to cache this service somewhere (nsContentUtils, say).

>+    // this reference is dropped at the end of Prefetch and ensures the object
>+    // cannot be deleted by another thread (initiated from AsyncResolve())
>+    // while this function is running
>+    NS_ADDREF_THIS();    
>+  
>+    // this reference will be dropped by OnLookupComplete()
>+    NS_ADDREF_THIS();    

OK.  So if we have the callers hold a strong ref while calling Prefetch() (as they should per XPCOM rules), can we nix both of these self-addrefs (and the selfrelease in OnLookupComplete())?  Certainly the DNS service will hold a reference to us until it calls OnLookupComplete().

>+++ b/content/html/content/src/nsHTMLDNSPrefetch.h
>+/* -*- Mode: C++; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4 -*- */
>+/* vim:set tw=80 expandtab softtabstop=2 ts=2 sw=2: */

Again, these don't match; should use 2 throughout.  And again, the license block is missing pieces.

>+#include "nsIDNSRecord.h"

Do we really need that header?  Would a forward-declaration do the trick?

>+#include "nsIDNSService.h"

Same here.  Seems like we only need that type in the .cpp, not here.

>+#include "nsICancelable.h"

This can probably also be a forward-declaration: see below.

>+public:
>+	NS_DECL_ISUPPORTS
>+    NS_DECL_NSIDNSLISTENER

Fix indent?

>+    // The aDocument parameter is the context requesting the prefetch - under
>+    // certain circumstances (e.g. headers, or security context) associated with
>+    // the context the prefetch will not be performed. nsnull is acceptable and 
>+    // always allows prefetches

As I said, I don't think there's a good reason to allow a null document.

>+    // call one of the following methods, and the class is self destructing if they
>+    // return NS_OK

I don't think there's a need to describe a special ownership model; this should just work like any other refcounted object.

>+    nsCOMPtr<nsICancelable> mOutstanding;

We don't use this for anything.  Why bother storing it?  Could just put it in a stack nsCOMPtr when we call asyncResolve, and then let it go out of scope.

>+++ b/content/html/document/src/nsHTMLContentSink.cpp

This part looks fine, but do we want to do the prefetch in XHTML too?  If so, we need to adjust the XML sink...

>+++ b/netwerk/base/public/nsNetError.h
>+#define NS_ERROR_SERVICE_FULL \

NS_ERROR_DNS_LOOKUP_QUEUE_FULL seems clearer...  Unless we think this error code will be used for other things, but then the comment needs fixing.

>+    NS_ERROR_GENERATE_FAILURE(NS_ERROR_MODULE_NETWORK, 31)

That gives it the same value as NS_ERROR_REDIRECT_LOOP.  I'd prefer a unique value here.

>+++ b/netwerk/base/src/nsDNSPrefetch.cpp
>+/* -*- Mode: C++; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4 -*- */
>+/* vim:set tw=80 expandtab softtabstop=2 ts=2 sw=2: */

Again, mismatch in the modelines and missing parts in the license.

In necko, I think 4-space indent is standard, so that's what this file should use and what the modelins should declare.

A lot of the nsHTMLDNSPrefetch comments apply here and in the header for this class as well, especially the GetAsciiHost thing, since in most cases the URI here will in fact have the ascii host cached.

Is it worth trying to reduce the code duplication by having a virtual method for whether the prefetch is allowed and having the content class subclass this one?  If so, I'm ok with that as a followup patch, in the interests of getting this landed.

>+++ b/netwerk/base/src/nsDNSPrefetch.h
>+/* -*- Mode: C++; tab-width: 4; indent-tabs-mode: nil; c-basic-offset: 4 -*- */
>+/* vim:set tw=80 expandtab softtabstop=2 ts=2 sw=2: */

Usual thing with modelines and license.

>+++ b/netwerk/dns/public/nsIDNSService.idl
>+    /**
>+     * if set, the query is given lower priority
>+     */
>+    const unsigned long RESOLVE_PRIORITY_MEDIUM = (1 << 2);
>+    const unsigned long RESOLVE_PRIORITY_LOW    = (1 << 3);

What happens if both are set?  Probably we should document that we treat it as PRIORITY_MEDIUM in that case, if that's what we do.

>+      
> };

Weird whitespace there.  Take it out?

>+++ b/netwerk/dns/src/nsHostResolver.cpp
>+#define MAX_NON_PRIORITY 150

MAX_NON_PRIORITY_REQUESTS maybe?

>+// we dont want to be spinning new threads and redoing res_init() all the time in the new thread

"don't", and please do the whole punctuation/capitalization thing?

Maybe document where this '4' number for NumIdleThreads comes from?  And what NumIdleThreads means?  Shouldn't it be called something more like MAX_IDLE_THREADS or some such?

>+static PRUint32 LongIdleTimeout = 300;            // for threads 1 -> numidlethreads 
>+static PRUint32 ShortIdleTimeout = 60;            // for threads numidlethreads+1 -> MAX_RESOLVER_THREADS

Those are in seconds, right?  Document that?  Or really, why not store these in interval units so we don't have to keep calling PR_SecondsToInterval all the time?

Also, add a PR_STATIC_ASSERT(MAX_RESOLVER_THREADS > NumIdleThreads), since we seem to be assuming that?

And NumIdleThreads in those comments, right?

> nsHostResolver::nsHostResolver(PRUint32 maxCacheEntries,
>+    , mHaveIdleThread(0)

So mHaveIdleThread is clearly a counter of some sort now, right?  What's it really counting?  It should get renamed accordingly...  Perhaps mNumIdleThreads?

>+    int i = 0, j;

|i| is only really needed to set .id on the mResolverID[n], right?  That .id seems to be unused; can we just drop it?  Or am I missing something?

>@@ -368,28 +391,31 @@ nsHostResolver::Shutdown()
>+        MoveCList(mHighQ, pendingQ);
>+        MoveCList(mMediumQ, pendingQ);
>+        MoveCList(mLowQ, pendingQ);

That doesn't look right, since as far as I can tell MoveCList fails if the target PRCList is not empty (and it should probably assert that its target is in fact empty, or be rewritten to allow "appending" a PRCList to the target).

>@@ -441,17 +467,17 @@ nsHostResolver::ResolveHost(const char  
>-
>+            

Please lose the whitespace change?

>@@ -479,26 +505,48 @@ nsHostResolver::ResolveHost(const char  
>+            else if ((mPendingCount >= MAX_NON_PRIORITY) &&
>+                     (flags & (RES_PRIORITY_LOW | RES_PRIORITY_MEDIUM)) &&
>+                     (!he->rec->resolving)) {

Drop the extra parens from around the first and third parts of the '&&', please.

>                 if (!he->rec->resolving) {
>+
>                     he->rec->flags = flags;

Why add the blank line?

>+                    // consider the case where we are on a pending queue of 
>+                    // lower priority than the request is being made at.
>+                    // in that case we should upgrade queues

Punctuation + capitalization?  Same in other comments in this bit here.

>+                    if ((!(flags & (RES_PRIORITY_LOW | RES_PRIORITY_MEDIUM))) &&
>+                        (he->rec->flags & (RES_PRIORITY_LOW | RES_PRIORITY_MEDIUM))) {
>+                        // move from (low|med) to high
>+                        MoveQueue(he->rec, mHighQ);
>+                    } else if ((flags & RES_PRIORITY_MEDIUM) &&
>+                               (he->rec->flags & RES_PRIORITY_LOW)) {
>+                        // move from low to med
>+                        MoveQueue(he->rec, mMediumQ);

Do we need to update he->rec->flags somewhere there to indicate that it's on a different queue now?

> nsHostResolver::IssueLookup(nsHostRecord *rec)

>+    if ((rec->flags & (RES_PRIORITY_LOW | RES_PRIORITY_MEDIUM)) == 0)

OK, so I think we could use some static inline functions like:

static inline PRBool IsHighPriority(nsHostRecord *rec);
static inline PRBool IsMediumPriority(nsHostRecord *rec);
static inline PRBool IsLowPriority(nsHostRecord *rec);

I think that would make a lot of this code much more readable.  If we want, we can make the function take flags instead of rec, so in ResolveHost we can use the same functions on the incoming flags too.  Or we could even have functions that take both, with the nsHostRecord ones calling the flags ones on rec->flags.

>+    else if (((!(rec->flags & (RES_PRIORITY_LOW | RES_PRIORITY_MEDIUM))) && (mThreadCount < MAX_RESOLVER_THREADS)) ||
>+             ((rec->flags & (RES_PRIORITY_LOW | RES_PRIORITY_MEDIUM)) && (mThreadCount < NumIdleThreads))) {

If NumIdleThreads < MAX_RESOLVER_THREADS (which we're asserting statically, right?) this looks equivalent to:

  else if (mThreadCount < NumIdleThreads ||
           (IsHighPriority(rec) && mThreadCount < MAX_RESOLVER_THREADS)) {

and this way of writing it seems a lot more readable.

>+                                        mResolverID + mThreadCount++,

This would be a lot clearer if mResolverID were named mResolverThreadInfoArr or some such....

That said, I'm not sure I follow this.  Say we get three requests in rapid succession and spin up three threads to handle them (numbered 0, 1, 2).  Thread 1 finishes its lookup and then waits a bit and then exits.  Now mThreadCount is 2.  If we then go to spin up another thread, we'll point it to slot 2 in the array.

Now if the first 2 items in the array are supposed to be any-request-goes while the third is high-priority-only, after we spin up the new thread we'll have two high-priority-only threads and only one any-request-goes.  Or am I missing something?

>-                                        0);
>+                                        256 * 1024);  // 256KB stack is plenty

Is there a reason for that change?

>+nsHostResolver::MoveQueue(nsHostRecord *aRec, PRCList &aDestQ)

Is there a reason this is a member method?  Seems like it could just be static, or certainly a static class method.

>+    NS_ASSERTION (aRec->onQueue, "Moving Host Record Not Currently Queued");

Nix the space before '('

>+    return;

No need for that.

>+nsHostResolver::GetHostToLookup(nsHostRecord **result, struct nsHostResolverID *aID)
>+    do  {

Why not |while (!mShutdown)|?  We could assert that if mShutdown is false and we're here then all the queues are empty, so it should be equivalent to your code, but without the extra break in the loop....

>+++ b/netwerk/dns/src/nsHostResolver.h
>+#define MAX_RESOLVER_THREADS_ANY  5

MAX_RESOLVER_THREADS_ANY_REQUESTS, perhaps?

>+#define MAX_RESOLVER_THREADS_HIGH 3

MAX_RESOLVER_THREADS_HIGH_PRIORITY_REQUESTS_ONLY?  Or too long?

>+    PRBool  onQueue;  /* true if pending and on the queue (not yet given to getaddrinfo())*/

How many nsHostRecord()s do we have in-flight at once, and how long do they live for?  Would it make sense to use PRPackedBool, or should we just not worry about this?


>+   // nsHostResolverID * is passed to the ThreadFunc
>+   struct nsHostResolverID
>+   {
>+       nsHostResolver *self;
>+       PRUint32 id;
>+       PRBool   onlyHighPriority;
>+   } mResolverID[MAX_RESOLVER_THREADS];

So the thing is, this isn't really an ID.  This is more of a nsHostResolverThreadInfo or some such, no?  Also, the |id| member seems to be unused.

>+    PRCList       mHighQ;
>+        PRCList       mMediumQ;
>+        PRCList       mLowQ;

Fix the indent?

>+++ b/netwerk/protocol/http/src/nsHttpChannel.cpp
>+    nsDNSPrefetch *prefetch = new nsDNSPrefetch (mURI);

Same comments as for HTML about using nsRefPtr, etc.

This looks like great stuff; let's beat the issues out and get this landed.  ;)
Comment 27 Boris Zbarsky [:bz] 2008-10-24 07:19:41 PDT
One other thought.  For anchors, it seems a bit wasteful to create the URI object and create the HTMLDNSPrefetch object just to bail out because prefetch is not allowed. So maybe a better split there is a separate IsDNSPrefetchAllowed check (either on nsIDocument or on nsContentUtils and taking an nsIDocument arg) and if it succeeds just using the necko nsDNSPrefetch?
Comment 28 Patrick McManus [:mcmanus] 2008-10-30 11:21:50 PDT
Created attachment 345527 [details] [diff] [review]
version 6 of patch

Boris, thank you for the extensive comments in 26. I believe I had everything in there that is appropriate to do so with this update. It is made much stronger with your help. 

I didn't note until now your comment 27. Let me know if you think that should block this - my opinion is that these details should remain abstracted inside the prefetch class.
Comment 29 Boris Zbarsky [:bz] 2008-10-30 21:37:23 PDT
Comment on attachment 345527 [details] [diff] [review]
version 6 of patch

>+++ b/content/html/content/src/nsHTMLAnchorElement.cpp
>+nsHTMLAnchorElement::PrefetchDNS()
>+    nsRefPtr<nsHTMLDNSPrefetch> prefetch = new nsHTMLDNSPrefetch(hrefURI, GetOwnerDoc());

Please wrap to 80 cols?

>+++ b/content/html/content/src/nsHTMLDNSPrefetch.cpp
>+    // This would be better done with a static constructor method, but those cannot
>+    // be portably used within mozilla.

What you probably want to do is to just make initialization here a class static method and run it from nsLayoutStatics::Initialize.  Sorry I wasn't more clear about that.

>+nsHTMLDNSPrefetch::IsSecureBaseContext (nsIDocument *aDocument)
>+  if (scheme.EqualsLiteral("https"))
>+    return PR_TRUE;
>+  return PR_FALSE;

  return scheme.EqualsLiteral("https");

>+nsHTMLDNSPrefetch::IsAllowed (nsIDocument *aDocument)
>+  if (IsSecureBaseContext(aDocument)) {
>+    if (GetDisablePrefetchHTTPSPref())

This would be more readable with a && in my opinion.

>+static nsIDNSService *sDNSService = nsnull;

We never release this, so we'd leak it...  Probably best to just set it up in nsLayoutStatics::Initialize and release it in nsLayoutStatics::Shutdown.  Then in Prefetch just null-check sDNSService.  Or we could cache it in nsContentUtils with other services, like I said in the original review.  Either way, but we need to make sure to release at shutdown.

>+nsHTMLDNSPrefetch::Prefetch(PRUint16 flags)
>+  nsresult rv;
>+  
>+  rv = dns->AsyncResolve(mHostname, flags, this, nsnull,
>+                         getter_AddRefs(tmpOutstanding));
>+  return rv;

Just |return dns->.....|

>+class nsHTMLDNSPrefetch : public nsIDNSListener
>+  // the context the prefetch will not be performed. The nsDNSPrefetch class
>+  // does not require an nsIDocument.

I'm not sure we want that last sentence there.

>+++ b/netwerk/base/src/nsDNSPrefetch.cpp
>+static nsIDNSService *sDNSService = nsnull;

Again, we're not releasing this.  nsNetShutdown might be an ok place to do that.

>+    nsresult rv;
>+
>+    rv = dns->AsyncResolve(mHostname, flags, this, nsnull,
>+                           getter_AddRefs(tmpOutstanding));
>+    return rv;

Again, just |return dns->....|

>+++ b/netwerk/dns/src/nsHostResolver.cpp
>+// Use a persistent thread pool inorder to avoid spinning up new threads all the time.

"in order"

>+// go first to an idle thread, if that cannot be found and there are less than MAX_RESOLVER_THREADS

'.' after "thread", please. And s/less/fewer/

>+// currently in the pool a new threads

"thread"

>+// there are less than HighThreadThreshold currently outstanding. If a thread cannot be

s/less/fewer/

Please fix the shutdown leaks and the minor nits, and this looks great.  Just keep an eye out on Tp!
Comment 30 Patrick McManus [:mcmanus] 2008-10-31 09:36:32 PDT
Created attachment 345727 [details] [diff] [review]
version 7 of patch

I believe this addresses everything from comment 29.
Comment 31 Patrick McManus [:mcmanus] 2008-11-01 06:44:30 PDT
please note when checking in from comment 24:

This patch now depends on the  "Fix normalization of flags in DNS cache"
attachment in bug 459724 in order to work correctly.
Comment 32 Boris Zbarsky [:bz] 2008-11-02 09:59:46 PST
Comment on attachment 345727 [details] [diff] [review]
version 7 of patch

This is almost all good stuff, but please use a static raw pointer with manual NS_RELEASE when cleaning up (and NS_ADDREF if needed when setting, e.g. in the necko nsDNSPrefetch code) instead of a static nsCOMPtr...
Comment 33 Christian :Biesinger (don't email me, ping me on IRC) 2008-11-03 13:14:21 PST
Comment on attachment 345727 [details] [diff] [review]
version 7 of patch

+nsDNSPrefetch::nsDNSPrefetch(nsAString &hostname)
+{
+    CopyUTF16toUTF8(hostname, mHostname);


why are you using UTF-16 hostnames in necko? In fact this constructor doesn't seem to be called at all.

@@ -3981,16 +3982,23 @@ nsHttpChannel::AsyncOpen(nsIStreamListen
+    // Start a DNS lookup very early in case the real open is queued the DNS can 
+    // happen in parallel.
+    nsRefPtr<nsDNSPrefetch> prefetch = new nsDNSPrefetch(mURI);

Doesn't the queuing mean that we are currently processing another load to the same host, meaning that the host should already be cached from that? What advantage does this additional prefetch have?
Comment 34 Patrick McManus [:mcmanus] 2008-11-03 13:35:56 PST
(In reply to comment #32)
> (From update of attachment 345727 [details] [diff] [review])
> This is almost all good stuff, but please use a static raw pointer with manual
> NS_RELEASE when cleaning up (and NS_ADDREF if needed when setting, e.g. in the
> necko nsDNSPrefetch code) instead of a static nsCOMPtr...

hi - I've made this change and it will be reflected in v8 of the patch (to be posted tonight), but can you help me understand the difference/advantage? Thanks.
Comment 35 Patrick McManus [:mcmanus] 2008-11-03 13:43:44 PST
Christian, thanks for the comments!

(In reply to comment #33)
> (From update of attachment 345727 [details] [diff] [review])
> +nsDNSPrefetch::nsDNSPrefetch(nsAString &hostname)
> +{

this isn't used in the necko part, as you note, so I've removed it in v8.


> @@ -3981,16 +3982,23 @@ nsHttpChannel::AsyncOpen(nsIStreamListen
> +    // Start a DNS lookup very early in case the real open is queued the DNS
> can 
> +    // happen in parallel.
> +    nsRefPtr<nsDNSPrefetch> prefetch = new nsDNSPrefetch(mURI);
> 
> Doesn't the queuing mean that we are currently processing another load to the
> same host, meaning that the host should already be cached from that? 

not necessarily - there are configurable quotas on the total number of streams that can be created at one time, seperate from the per host quota. (usually 6 in total I think). In my testing it isn't unusual to hit the limit on sites that embed a lot of content by reference. (blogs, aggregators, youtube, photos, counters, analytics, jscript libraries, etc.) Realistically all it takes is references to 3 different hosts.
Comment 36 Patrick McManus [:mcmanus] 2008-11-03 14:45:33 PST
Created attachment 346134 [details] [diff] [review]
version 8 of patch
Comment 37 Boris Zbarsky [:bz] 2008-11-03 17:28:20 PST
> but can you help me understand the difference/advantage?

Just a matter of static nsCOMPtrs confusing leak-detection tools, mostly (and the whole portability guideline thing about static objects with constructors/destructors, which nsCOMPtr has).
Comment 38 Boris Zbarsky [:bz] 2008-11-04 09:20:22 PST
Created attachment 346274 [details] [diff] [review]
Additional patch to fix shutdown nsHostResolver leak

Leak tests on tinderbox caught this.  We need to wake up all the worker threads on shutdown so they can exit, and we were only waking up one.
Comment 39 Boris Zbarsky [:bz] 2008-11-04 18:38:34 PST
Pushed changeset e8fd3f4c52b6 for the main patch and changeset 19b3caf108d1 for the leak fix.
Comment 40 Justin Dolske [:Dolske] 2008-11-05 12:58:21 PST
Backed out due to leaks (see bug 463263).
Comment 41 Boris Zbarsky [:bz] 2008-11-05 17:10:06 PST
So from the leak logs it looks like we were managing to leak a nsDNSAsyncRequest.  Everything else that was leaked would be held by one of those.
Comment 42 Boris Zbarsky [:bz] 2008-11-05 17:32:58 PST
Actually, scratch that.  nsDNSAsyncRequest doesn't hold an nsHostRecord, but losing nsHostRecords could certainly cause us to leak their callbacks, which are nsDNSAsyncRequests.
Comment 43 Bill Gianopoulos [:WG9s] 2008-11-05 17:49:26 PST
I have verified the checkins from this bug as causing the regression noted in bug 463215.
Comment 44 Patrick McManus [:mcmanus] 2008-11-06 06:17:37 PST
I agree we must be losing nsHostRecords somehow, but sadly I get a clean leak report when I run the same test tinderbox does. Of course I am still looking at it and 463215.
Comment 45 Bill Gianopoulos [:WG9s] 2008-11-06 06:50:20 PST
(In reply to comment #44)
> I agree we must be losing nsHostRecords somehow, but sadly I get a clean leak
> report when I run the same test tinderbox does. Of course I am still looking at
> it and 463215.

When looking at 463215, please note that we now have a report of the same type of issue with No proxy selected rather than auto-detect.  Evidently I just did not run long enough yesterday with no proxy set to see the issue, probably because I changed back to auto-detect in order to isolate the regression cause.
Comment 46 Boris Zbarsky [:bz] 2008-11-06 07:32:45 PST
Tinderbox didn't hit the leak reliably either; just "often".  :(  There's almost certainly some sort of timing issue involved...
Comment 47 Patrick McManus [:mcmanus] 2008-11-06 08:19:54 PST
I have been able to make the leak happen a couple of times now. My data set is too small to be sure, but it appears to be leaking outstanding requests during shutdown..  (or maybe ones that are being added during shutdown..)
Comment 48 Boris Zbarsky [:bz] 2008-11-06 08:36:38 PST
We shouldn't be adding requests during shutdown, I would think: we have an mShutdown check in nsHostResolver::ResolveHost.
Comment 49 Patrick McManus [:mcmanus] 2008-11-06 10:54:56 PST
I think this whole leak is innocuous.

resolver::shutdown() is called which wakes up anybody in the thread pool after setting the shutdown flag and then, after cleaning up a few other things, exits. Shutdown() is synchronous, but the other thread exits are not, and they hold references to the resolver.

That leaves a race between the firefox shutdown/leak code and the pool threads getting scheduled and running to completion. The pool thread could take a fair chunk of time if it is blocked on getaddrinfo() which is why I can reproduce this easily if I add a network delay getting to my name server.

spinning on while(mthreadcount) at the end of shutdown reliably fixes the leak. We probably want to join the threads for a real fix.

interestingly, this is not a result of my patch - but the asynchronous nature of the prefetch code brings it to light in what is otherwise a synchronous test.

My only hesitation on doing joins or the spin is that the time it takes to shutdown is going to be bound by the operating system timing out PR_GetAddrInfoByName(). The fx isn't really interruptible or configurable, and some OS's might give it a very long time indeed.

I suppose we could spin with an upper bound of a couple seconds..

none of this has anything to do with 463215.
Comment 50 Boris Zbarsky [:bz] 2008-11-06 11:08:36 PST
Hmm.  Benjamin, is there anything specific we should be doing with the worker threads to make shutdown leak logging happy?  Given comment 49, the "spin-or-join with timeout" approach sounds good to me, but if there's something better I'd love to know.
Comment 51 Patrick McManus [:mcmanus] 2008-11-06 11:32:37 PST
fyi: this is the patch I would propose (I'll roll them all together into one attachment) if Benjamin doesn't have a silver bullet.

diff --git a/netwerk/dns/src/nsHostResolver.cpp b/netwerk/dns/src/nsHostResolver.cpp
--- a/netwerk/dns/src/nsHostResolver.cpp
+++ b/netwerk/dns/src/nsHostResolver.cpp
@@ -444,16 +444,25 @@ nsHostResolver::Shutdown()
         PRCList *node = evictionQ.next;
         while (node != &evictionQ) {
             nsHostRecord *rec = static_cast<nsHostRecord *>(node);
             node = node->next;
             NS_RELEASE(rec);
         }
     }

+    // Logically join the outstanding worker threads with a timeout.
+    // Use this approach instead of PR_JoinThread() because that does
+    // not allow a timeout which may be necessary for a responsive
+    // shutdown if the thread is blocked on a very slow DNS resolution.
+
+    PRIntervalTime delay = PR_MillisecondsToInterval(25);
+    PRIntervalTime stopTime = PR_IntervalNow() + PR_SecondsToInterval(2);
+    while (mThreadCount && PR_IntervalNow() < stopTime)
+        PR_Sleep(delay);
 }

 static inline PRBool
 IsHighPriority(PRUint16 flags)
 {
     return !(flags & (nsHostResolver::RES_PRIORITY_LOW | nsHostResolver::RES_PRIORITY_MEDIUM));
 }
Comment 52 Benjamin Smedberg [:bsmedberg] 2008-11-06 11:33:49 PST
The normal/correct behavior is that all XPCOM threads must be joined before or during the xpcom-shutdown-threads global observer notification.

At some point we specifically decided to absolve the DNS lookup thread(s) from this responsibility because PR_GetAddrInfoByName was uninterruptably blocking. We cannot delay shutdown for any significant amount of time. This meant, IIRC, that the DNS threads couldn't participate in XPCOM.
Comment 53 Boris Zbarsky [:bz] 2008-11-06 11:41:23 PST
Unfortunately, the DNS threads effectively hold owning references to nsHostRecord objects (or more precisely the objects are not released until the lookup is done).  That's the problem we're having here...
Comment 54 Benjamin Smedberg [:bsmedberg] 2008-11-06 11:57:32 PST
We could perhaps use the magic shutdown flag differently to abandon, so that we can simply abandon the worker thread by having the main thread do cleanup for the abandoned worker thread. I can try to do this, if you file a separate bug. The big problem is that you have to avoid referencing mLock after the thread has been abandoned, which sounds like it would introduce a race condition. Anyone know if it's ok to intentionally leak a PRLock?
Comment 55 Patrick McManus [:mcmanus] 2008-11-07 10:57:09 PST
let me know if I'm reading the situation wrong.

What I see is that the DNS code has always violated, with absolution, the edict to join those threads at shutdown. Furthermore, spinning/blocking there just adds delay and isn't really desirable (I agree). However, with the prefetch code the test harness now sees an always existing race between shutdown and the leak detection code. There isn't any race that produces an actual leak - just the detection of it. Nonetheless the leak detection code is important.

I'm going to post a patch that splits the baby. The patch will spin-with-timeout in the shutdown code in debug mode only (which I believe is a pre-req for the leack check code) so we can get reasonable reports but in non-debug builds we will just do the shutdown async as always. I'll also file a follwon bug to see if this can be disentangled as per comment 54.
Comment 56 Benjamin Smedberg [:bsmedberg] 2008-11-07 11:19:40 PST
This could be an actual leak, though not a serious one: it's quite possible for the application to terminate before the worker thread which is performing a DNS resolution returns.

The debug-only spin thing makes me slightly uncomfortable, but not enough to seriously object. The other alternative would be to stop leak-checking nsHostRecord, which is probably worse.
Comment 57 Ted Mielczarek [:ted.mielczarek] 2008-11-07 11:22:20 PST
Note that the refcount leak checking code is available in non-debug builds. The leaks that this patch was backed out for (in mochitest) run on non-debug builds.
Comment 58 Boris Zbarsky [:bz] 2008-11-07 11:33:52 PST
Yeah; we should condition the spinning on NS_BUILD_REFCNT_LOGGING, not DEBUG.
Comment 59 Patrick McManus [:mcmanus] 2008-11-07 14:02:10 PST
Created attachment 346958 [details] [diff] [review]
version 9 of patch

Boris - thanks for the pointer to NS_BUILD_REFCNT_LOGGING, I was having trouble finding that symbol.

This patch should apply to moz-central and it contains:
 * most of v8 (which was backed out)
 * the PR_NotifyAllCondVar change attached to this bug (which was backed out)
 * the wait-for-thread-exit code in shutdown(), which should help the leak detection tool
 * fixes for the problem reported in 463215 - primarily the 'high-priority-only' threads were not being created. Based on this evidence, I also tilted the pool a little towards high-only threads (which could cause queueing in the prefetch requests, but leaves more slots for click driven lookups.) 

Note that v8 contained 4 new files and those were not backed out of hg (they were just removed from the Makefile). v9 does not readd them and requires no changes to them.
Comment 60 Boris Zbarsky [:bz] 2008-11-07 14:40:13 PST
Comment on attachment 346958 [details] [diff] [review]
version 9 of patch

Add a brief comment about it being ok to read mThreadCount outside the lock there because the worst that will happen if we race on it is that we'll wait an extra 25ms?
Comment 61 Patrick McManus [:mcmanus] 2008-11-07 15:09:56 PST
Created attachment 346976 [details] [diff] [review]
version 10 of patch

version 9 with an updated comment as per comment. Think it is checkin-ready?
Comment 62 Patrick McManus [:mcmanus] 2008-11-07 15:10:44 PST
The follow-on bug, regarding joining the threads, is 463724.
Comment 63 Boris Zbarsky [:bz] 2008-11-07 17:21:09 PST
Comment on attachment 346976 [details] [diff] [review]
version 10 of patch

Yeah, this is ready to go.  It would be great to get beta baking for this!
Comment 64 Mike Beltzner [:beltzner, not reading bugmail] 2008-11-10 16:44:04 PST
Comment on attachment 346976 [details] [diff] [review]
version 10 of patch

a=beltzner; feels like a big (and risky) change so watch closely for regressions. You're getting approval because the potential benefit to responsiveness is also pretty big :)
Comment 65 Damon Sicore (:damons) 2008-11-11 08:01:34 PST
Has this landed?
Comment 66 Boris Zbarsky [:bz] 2008-11-11 08:06:58 PST
Not yet.  Tree's been closed ever since it got a= due to orange.
Comment 67 Damon Sicore (:damons) 2008-11-11 11:28:32 PST
Should probably get it in the queue:  https://wiki.mozilla.org/1.9.1_beta_2_checkins
Comment 68 Boris Zbarsky [:bz] 2008-11-11 21:23:25 PST
Pushed changeset 0496b58bb9f8 with that latest diff.
Comment 69 Boris Zbarsky [:bz] 2008-11-13 11:12:52 PST
So this seems to have caused a small (2-3%) Tp3 regression at least on Mac and Linux....  (well, that or one of the other changes I pushed, but this one is the most likely culprit of the set).
Comment 70 Patrick McManus [:mcmanus] 2008-11-13 11:29:02 PST
I'm not familiar with how to run tp3 - any pointers are appreciated.

https://wiki.mozilla.org/Perfomatic/Test_Name_Mappings describes tp3 as "iframe-based pageload test embedded into Talos; loads content via local proxy server.".. If that's accurate it is only going to measure overhead for this feature (there is no upside when the proxy server is doing the DNS), but that doesn't mean we can't work on any overhead.

is tp3 measuring cpu time or elapsed time?
Comment 71 David Baron :dbaron: ⌚️UTC-7 (review requests must explain patch) 2008-11-13 11:32:06 PST
The primary number (the one we're looking at) is a measure of elapsed time.
Comment 72 Boris Zbarsky [:bz] 2008-11-13 11:47:01 PST
Not sure how one runs tp3 locally...  I can't seem to find anything about it (or talos, for that matter) at https://wiki.mozilla.org/Performance:Tinderbox_Tests

Alice, is there a way Patrick can run Tp3 with the tinderbox pageset locally?
Comment 73 Shawn Wilsher :sdwilsh 2008-11-13 11:53:54 PST
You'll likely want to run standalone talos:
https://wiki.mozilla.org/StandaloneTalos

Note: the pageset is different however (we cannot publish our pageset for legal reasons if I recall correctly)
Comment 74 Vladimir Vukicevic [:vlad] [:vladv] 2008-11-13 12:01:28 PST
I put a little xul app that can run Tp and reftests at http://people.mozilla.com/~vladimir/misc/testtool.zip ; though if you have a build with --enable-tests you can just run with the right command line options yourself (-help, it's the -tp option).  The full talos script uses this for all the work, though it also sets up a local http server as well.
Comment 75 Boris Zbarsky [:bz] 2008-11-13 18:59:08 PST
I filed bug 464838 on investigating the Tp regression here.
Comment 76 TheRave 2009-04-26 01:51:03 PDT
Can somebody of the devs here have a look at the following thread in mozillazine forum? It seems to we have a problem with this bugfix.

http://forums.mozillazine.org/viewtopic.php?f=23&t=1150335&start=15
Comment 77 Karsten Düsterloh 2009-04-26 10:10:22 PDT
Did *anyone* give a thought about the security implications of this?
*Any* local HTML document on trunk now causes a DNS query, thus it's *extremely* easy to track who's reading a specific doc just by adding some unique URL in it - the reader doesn't even need to click the URL to be tracked!
That's just mad or utterly broken...
Comment 78 Karsten Düsterloh 2009-04-26 10:14:14 PDT
(In reply to comment #77)
> *Any* local HTML document 

... containing a mere HTML link, e.g.

<html><body>
<a href="http://what.about.sub.domains.in.html.urls/I.wonder">No need to click me!</a>
</body></html>
Comment 79 Shawn Wilsher :sdwilsh 2009-04-26 10:20:12 PDT
(In reply to comment #77)
> Did *anyone* give a thought about the security implications of this?
> *Any* local HTML document on trunk now causes a DNS query, thus it's
> *extremely* easy to track who's reading a specific doc just by adding some
> unique URL in it - the reader doesn't even need to click the URL to be tracked!
> That's just mad or utterly broken...
That's only true if the dns server that you use is owned by the website in question.  A DNS lookup doesn't go to the server of the site for a link...
Comment 80 Mike Shaver (:shaver -- probably not reading bugmail closely) 2009-04-26 10:23:38 PDT
(In reply to comment #77)
> Did *anyone* give a thought about the security implications of this?
> *Any* local HTML document on trunk now causes a DNS query, thus it's
> *extremely* easy to track who's reading a specific doc just by adding some
> unique URL in it 

You mean by putting an <img> or <script> or <link rel="many choices"> in it?  Yeah, long-standing problem, but not what this bug is about.

DNS resolution only helps if you can log queries and associate them back against the originating source, which latter is typically pretty challenging, AFAICT.
Comment 81 Karsten Düsterloh 2009-04-26 11:36:04 PDT
> That's only true if the dns server that you use is owned by the website in
> question.  A DNS lookup doesn't go to the server of the site for a link...

Trackable by the DNS provider of the linked domain, not of the website where the link appears.

> DNS resolution only helps if you can log queries and associate them back
> against the originating source, which latter is typically pretty challenging,
> AFAICT.

Pretty challenging?! 
- set up your own DNS server for your domain
- create unique, not yet used DNS entries for each user you want to track and you are able to get to see certain webpages (e.g. by sending him a spam mail, see bug 488162)
=> You won't get *all* hits of the DNS resolution attempt, but you'll get the *first* one for sure, since the new DNS record needs to find its way to the caches.
Comment 82 Bruno 'Aqualon' Escherl 2009-04-26 13:03:02 PDT
Another problem could be with plans in Germany to log every attempt to access websites on a blacklist with child pornography by using DNS bypasses to a stop page. If the provider logs that you made an DNS query for a forbidden webpage, it could result in a house search by the police, even if you never clicked on it.
Comment 83 Vladimir Vukicevic [:vlad] [:vladv] 2009-04-26 13:46:53 PDT
As Mike said in comment #80, malicious web pages can already force a DNS lookup and even a HTTP request by using an <img> or <script> or <link> tag to any arbitrary site.  Those requests would just show up nicely in their http logs, along with the dns request; if someone had malicious intent, why would they bother to go the complicated and fragile route of setting up dns and hoping that dns prefetching hits their site, when they can just put in an img tag?

Regardless, this bug is not the place for this discussion -- I'd suggest either mozillazine or posting to one of the mozilla newsgroups.
Comment 84 James May [:fowl] 2009-04-26 14:23:48 PDT
Would it be possible to plug this in to the same system that allows extensions like ad block plus to veto requests?
Comment 85 Boris Zbarsky [:bz] 2009-04-26 16:04:02 PDT
There would be a performance cost, but maybe.  File a bug, please, and request blocking if you think it needs to happen for 1.9.1?
Comment 86 Boris Zbarsky [:bz] 2009-04-26 16:26:56 PDT
Oh, and comment 76 also needs a separate bug, with blocking requested on it and marked as blocking this bug.
Comment 87 TheRave 2009-05-03 06:50:11 PDT
For comment 76 i have opened the bug 491140.

Note You need to log in before you can comment on or make changes to this bug.