Closed Bug 208312 Opened 21 years ago Closed 16 years ago

DNS: negative (NXDOMAIN) cache

Categories

(Core :: Networking, enhancement)

enhancement
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: benc, Assigned: mcmanus)

References

Details

(Whiteboard: [DNS])

Attachments

(1 file, 1 obsolete file)

(I thought we had a bug on this...)

We should briefly cache DNS errors so that we do not bang our head against the
same DNS failure again and again. I know that the local DNS server has caching,
but some people have a long-pole to their DNS server, so they still experience
really long delays.

There was one bug that described this problem, and I can't find it now.

There is also bug 17433, where usage of PAC functions that are DNS dependent can
result in really slow performance.
Certainly a good idea but defining "briefly" will probably be very hard.
I have used this technique in 2 projects for my company, which couldn't use the
OS-resolver, because we wanted to have more control. Both were (and are still)
used in large scale VOIP-systems, which could generate up to 100 lookups per
second (continuously !).

The first one had a configurable cache time, which defaulted to 60 seconds, but
that proved a bit to difficult to configure it just right. Often a DNS-lookups
failed (with a 4 second timeout), but a second lookup could succeed after 5 to
10 seconds, because the reply had arrived a bit late.

A second project used an adaptive algorithm. We started with a 4 second cache,
and doubled every time we had a timeout. The timeout was maxed at 120 seconds,
w/ jitter of 4 seconds. The first lookup was done synchronously (w/ the 4 second
timeout), the others would return a failure after those 4 seconds, although the
lookup was still done (asynchronously). If we knew a negative TTL (from a domain
above), then we would use it too, but still maxed at 120 seconds.

The result was that we got an answer within 4 seconds, even if the answer was
negative (the system could copy with those failures). But the lookups kept
continuing (if triggered ofcourse), just further and further apart. So we never
overloaded the network, but we still got results within 4 seconds at the most.
And we could restore functionality very quickly, when the DNS-server came back
online again.

I hope this helps, even though this might be more than you would need for
regular surfing (4 second timeouts are really short, and a negative result would
still be unusable for a browser). But I think it can be used for PAC-lookups.
Some PAC files might contains more than a dozen lookups, which could contain in
many failures when a router goes down.
Blocks: 200994
i'm not sure i agree that this is something we want... but, the DNS rewrite will
make this trivial to implement.
Severity: normal → enhancement
Status: NEW → ASSIGNED
Depends on: 205726
Whiteboard: [DNS]
Target Milestone: --- → Future
Bug 200994 explains a good reason why we need a negative cache (remember failed
lookups for a while). Not all OS'es provide this service in their
resolver-library. Or they have a very tiny fixed cache (Windows NT has 16
entries, Mac OS 8 had only 10).

Of course, the system that I described above, was much more then we need. I just
gave it as an example of what is uses in large systems. We had to design it
because the Tru64Unix resolver was too small and too slow for our servers (big
Compaq iron). Last week I had to make some changes, because 200 parallel lookups
wasn't sufficient anymore. And that's continuously !

For Mozilla, a 10 second cache would help page-load (as in bug 200994 and bug
17433). Maybe a 1 minute cache.
see bug 205726 comment 69 for the necessary changes
If this is implemented, it should be used for all lookup-failures, not just
NXDOMAIN. See bug 68796 why (especially bug 68796 comment 19).
Blocks: 154816
Assignee: darin → nobody
Status: ASSIGNED → NEW
QA Contact: benc → networking
Target Milestone: Future → ---
Attached patch cache negative host lookups (obsolete) — Splinter Review
This has special currency for mobile, where any high latency operation like DNS threatens a really long and painful serialization.

* Cache a failed hostlookup for 1 minute

* in recognition that some failed lookups are ephemeral, upon
  successful reuse of cached negative lookup start an asynchronous
  refresh in the background to update the entry

This has the net effect of the same number of network lookups being issued, but within (on avg) 90 seconds of a failed lookup the caller does not wait for verification of host-not-found.
Attachment #342910 - Flags: review?(bzbarsky)
Blocks: 437953
How does this behave if I turn on my computer and my browser starts before the network has come up (and hence gets all sorts of error pages on my start tabs)?  Does reloading right after network starts work?
Oh, and may I suggest adding:

  [diff]
  git = 1
  showfunc = true
  unified = 8

to your .hgrc to get prettier mq diffs?
Comment on attachment 342910 [details] [diff] [review]
cache negative host lookups

>+    PRBool       negative;   /* true if this record is a cache of a failed lookup
>+                                negative cache entries are valid just like any other
>+                                (though never for more than 60 seconds), but a use
>+                                of that negative entry forces an asynchronous refresh */

  True if this ... lookup.
  Negative ... refresh.

(punctuation and capitalization).

With that and an answer to my question about reloading, looks good.
Attachment #342910 - Flags: superreview+
Attachment #342910 - Flags: review?(bzbarsky)
Attachment #342910 - Flags: review+
(In reply to comment #8)
> How does this behave if I turn on my computer and my browser starts before the
> network has come up (and hence gets all sorts of error pages on my start tabs)?
>  Does reloading right after network starts work?

you would need to reload twice, but you would not have to wait for the timeout.

The first reload uses the negative cache hit but also kicks off a revalidation of the cached entry in the background.. the second reload would pick up the positive entry generated by that.

this is probly not perceptible to someone just hitting reload repetitively while waiting for the network to come around.
I'd think the person would wait for their connection thigie it titlebar or GNOME whatever-it-is or Windows taskbar to show connected, then do "reload all tabs".  I agree that when it doesn't work they'd just do it again, but it seems better to make this common use case Just Work without extra clicks.

Do we get some sort of offline/online notifications in cases like that?  If so, can we hook into those to invaliate the negative cache?
the definition of "network up" tends to lead to insanity in my past experience. You run into distinctions between link and IP addresses and even firewall layers.. all of these things can toggle back and forth a few times on startup. It's really miserable to try and drive a state machine off of. And even then sometimes other parts of the network aren't really ready to accept you - spanning tree algorithms on the switch are classic for swallowing packets for a settling period after you join the tree and this is transparent to the host.

And of course any technique is completely importable between OS's, so I would try and avoid going down that road.

Here's an alternate suggestion: nsIDNSService.idl already has a flag to bypass the cache. That would bypass the negative cache entry too if it were set. I can't find any user of the flag in mozilla-central, but if we managed to set it on manual reloads, I think that would do the trick. 

I'm not sure where to start with that, but I'll research it. Pointers are welcomed.
> It's really miserable to try and drive a state machine off of.

Sure.  I'm not saying we need to do something complicated here; just the cheap thing if it works.  We might not even need anything here, since now that I look at nsIOService::SetOffline it looks like the DNS service isn't even running when we're offline.  Does shutting down the DNS service clear our the cache we're looking at here?  I assume it does...

Doing the DNS cache bypass thing might be a good idea too, but it can be a followup bug as far as I'm concerned.
(In reply to comment #14)

> when we're offline.  Does shutting down the DNS service clear our the cache
> we're looking at here?  I assume it does...

yes, doing that removes all the cache entries.

so I'll attach a patch with the comments fixed up to match your feedback, and I'll make a follow-on bug regarding reload.
updated from v1 based on review comments
Attachment #342910 - Attachment is obsolete: true
bug 459724 is the follow on bug regarding reload
Pushed changeset 1e71f0540534.  Patrick, thanks for doing this!

Not sure whether there's a good way to write a test for this...
Assignee: nobody → mcmanus
Status: NEW → RESOLVED
Closed: 16 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: