(I thought we had a bug on this...) We should briefly cache DNS errors so that we do not bang our head against the same DNS failure again and again. I know that the local DNS server has caching, but some people have a long-pole to their DNS server, so they still experience really long delays. There was one bug that described this problem, and I can't find it now. There is also bug 17433, where usage of PAC functions that are DNS dependent can result in really slow performance.
Certainly a good idea but defining "briefly" will probably be very hard.
I have used this technique in 2 projects for my company, which couldn't use the OS-resolver, because we wanted to have more control. Both were (and are still) used in large scale VOIP-systems, which could generate up to 100 lookups per second (continuously !). The first one had a configurable cache time, which defaulted to 60 seconds, but that proved a bit to difficult to configure it just right. Often a DNS-lookups failed (with a 4 second timeout), but a second lookup could succeed after 5 to 10 seconds, because the reply had arrived a bit late. A second project used an adaptive algorithm. We started with a 4 second cache, and doubled every time we had a timeout. The timeout was maxed at 120 seconds, w/ jitter of 4 seconds. The first lookup was done synchronously (w/ the 4 second timeout), the others would return a failure after those 4 seconds, although the lookup was still done (asynchronously). If we knew a negative TTL (from a domain above), then we would use it too, but still maxed at 120 seconds. The result was that we got an answer within 4 seconds, even if the answer was negative (the system could copy with those failures). But the lookups kept continuing (if triggered ofcourse), just further and further apart. So we never overloaded the network, but we still got results within 4 seconds at the most. And we could restore functionality very quickly, when the DNS-server came back online again. I hope this helps, even though this might be more than you would need for regular surfing (4 second timeouts are really short, and a negative result would still be unusable for a browser). But I think it can be used for PAC-lookups. Some PAC files might contains more than a dozen lookups, which could contain in many failures when a router goes down.
i'm not sure i agree that this is something we want... but, the DNS rewrite will make this trivial to implement.
Severity: normal → enhancement
Status: NEW → ASSIGNED
Depends on: 205726
Target Milestone: --- → Future
Bug 200994 explains a good reason why we need a negative cache (remember failed lookups for a while). Not all OS'es provide this service in their resolver-library. Or they have a very tiny fixed cache (Windows NT has 16 entries, Mac OS 8 had only 10). Of course, the system that I described above, was much more then we need. I just gave it as an example of what is uses in large systems. We had to design it because the Tru64Unix resolver was too small and too slow for our servers (big Compaq iron). Last week I had to make some changes, because 200 parallel lookups wasn't sufficient anymore. And that's continuously ! For Mozilla, a 10 second cache would help page-load (as in bug 200994 and bug 17433). Maybe a 1 minute cache.
see bug 205726 comment 69 for the necessary changes
If this is implemented, it should be used for all lookup-failures, not just NXDOMAIN. See bug 68796 why (especially bug 68796 comment 19).
Assignee: darin → nobody
Status: ASSIGNED → NEW
QA Contact: benc → networking
Target Milestone: Future → ---
Created attachment 342910 [details] [diff] [review] cache negative host lookups This has special currency for mobile, where any high latency operation like DNS threatens a really long and painful serialization. * Cache a failed hostlookup for 1 minute * in recognition that some failed lookups are ephemeral, upon successful reuse of cached negative lookup start an asynchronous refresh in the background to update the entry This has the net effect of the same number of network lookups being issued, but within (on avg) 90 seconds of a failed lookup the caller does not wait for verification of host-not-found.
How does this behave if I turn on my computer and my browser starts before the network has come up (and hence gets all sorts of error pages on my start tabs)? Does reloading right after network starts work?
Oh, and may I suggest adding: [diff] git = 1 showfunc = true unified = 8 to your .hgrc to get prettier mq diffs?
Comment on attachment 342910 [details] [diff] [review] cache negative host lookups >+ PRBool negative; /* true if this record is a cache of a failed lookup >+ negative cache entries are valid just like any other >+ (though never for more than 60 seconds), but a use >+ of that negative entry forces an asynchronous refresh */ True if this ... lookup. Negative ... refresh. (punctuation and capitalization). With that and an answer to my question about reloading, looks good.
(In reply to comment #8) > How does this behave if I turn on my computer and my browser starts before the > network has come up (and hence gets all sorts of error pages on my start tabs)? > Does reloading right after network starts work? you would need to reload twice, but you would not have to wait for the timeout. The first reload uses the negative cache hit but also kicks off a revalidation of the cached entry in the background.. the second reload would pick up the positive entry generated by that. this is probly not perceptible to someone just hitting reload repetitively while waiting for the network to come around.
I'd think the person would wait for their connection thigie it titlebar or GNOME whatever-it-is or Windows taskbar to show connected, then do "reload all tabs". I agree that when it doesn't work they'd just do it again, but it seems better to make this common use case Just Work without extra clicks. Do we get some sort of offline/online notifications in cases like that? If so, can we hook into those to invaliate the negative cache?
the definition of "network up" tends to lead to insanity in my past experience. You run into distinctions between link and IP addresses and even firewall layers.. all of these things can toggle back and forth a few times on startup. It's really miserable to try and drive a state machine off of. And even then sometimes other parts of the network aren't really ready to accept you - spanning tree algorithms on the switch are classic for swallowing packets for a settling period after you join the tree and this is transparent to the host. And of course any technique is completely importable between OS's, so I would try and avoid going down that road. Here's an alternate suggestion: nsIDNSService.idl already has a flag to bypass the cache. That would bypass the negative cache entry too if it were set. I can't find any user of the flag in mozilla-central, but if we managed to set it on manual reloads, I think that would do the trick. I'm not sure where to start with that, but I'll research it. Pointers are welcomed.
> It's really miserable to try and drive a state machine off of. Sure. I'm not saying we need to do something complicated here; just the cheap thing if it works. We might not even need anything here, since now that I look at nsIOService::SetOffline it looks like the DNS service isn't even running when we're offline. Does shutting down the DNS service clear our the cache we're looking at here? I assume it does... Doing the DNS cache bypass thing might be a good idea too, but it can be a followup bug as far as I'm concerned.
(In reply to comment #14) > when we're offline. Does shutting down the DNS service clear our the cache > we're looking at here? I assume it does... yes, doing that removes all the cache entries. so I'll attach a patch with the comments fixed up to match your feedback, and I'll make a follow-on bug regarding reload.
Created attachment 342935 [details] [diff] [review] negative dns cache - v2 updated from v1 based on review comments
Attachment #342910 - Attachment is obsolete: true
bug 459724 is the follow on bug regarding reload
Pushed changeset 1e71f0540534. Patrick, thanks for doing this! Not sure whether there's a good way to write a test for this...
Assignee: nobody → mcmanus
Status: NEW → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.