Closed Bug 208312 Opened 22 years ago Closed 16 years ago

DNS: negative (NXDOMAIN) cache

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: benc, Assigned: mcmanus)

References

Details

(Whiteboard: [DNS])

Attachments

(1 file, 1 obsolete file)

cache negative host lookups 16 years ago Patrick McManus [:mcmanus] 3.11 KB, patch	bzbarsky : review+ bzbarsky : superreview+	Details \| Diff \| Splinter Review
negative dns cache - v2 16 years ago Patrick McManus [:mcmanus] 3.45 KB, patch		Details \| Diff \| Splinter Review

benc

Reporter

Description

•

22 years ago

(I thought we had a bug on this...) We should briefly cache DNS errors so that we do not bang our head against the same DNS failure again and again. I know that the local DNS server has caching, but some people have a long-pole to their DNS server, so they still experience really long delays. There was one bug that described this problem, and I can't find it now. There is also bug 17433, where usage of PAC functions that are DNS dependent can result in really slow performance.

tenthumbs

Comment 1

•

22 years ago

Certainly a good idea but defining "briefly" will probably be very hard.

Jo Hermans

Comment 2

•

22 years ago

I have used this technique in 2 projects for my company, which couldn't use the OS-resolver, because we wanted to have more control. Both were (and are still) used in large scale VOIP-systems, which could generate up to 100 lookups per second (continuously !). The first one had a configurable cache time, which defaulted to 60 seconds, but that proved a bit to difficult to configure it just right. Often a DNS-lookups failed (with a 4 second timeout), but a second lookup could succeed after 5 to 10 seconds, because the reply had arrived a bit late. A second project used an adaptive algorithm. We started with a 4 second cache, and doubled every time we had a timeout. The timeout was maxed at 120 seconds, w/ jitter of 4 seconds. The first lookup was done synchronously (w/ the 4 second timeout), the others would return a failure after those 4 seconds, although the lookup was still done (asynchronously). If we knew a negative TTL (from a domain above), then we would use it too, but still maxed at 120 seconds. The result was that we got an answer within 4 seconds, even if the answer was negative (the system could copy with those failures). But the lookups kept continuing (if triggered ofcourse), just further and further apart. So we never overloaded the network, but we still got results within 4 seconds at the most. And we could restore functionality very quickly, when the DNS-server came back online again. I hope this helps, even though this might be more than you would need for regular surfing (4 second timeouts are really short, and a negative result would still be unusable for a browser). But I think it can be used for PAC-lookups. Some PAC files might contains more than a dozen lookups, which could contain in many failures when a router goes down.

benc

Reporter

Updated

•

22 years ago

Blocks: 200994

Darin Fisher

Comment 3

•

22 years ago

i'm not sure i agree that this is something we want... but, the DNS rewrite will make this trivial to implement.

Severity: normal → enhancement

Status: NEW → ASSIGNED

Depends on: 205726

Whiteboard: [DNS]

Target Milestone: --- → Future

Jo Hermans

Comment 4

•

22 years ago

Bug 200994 explains a good reason why we need a negative cache (remember failed lookups for a while). Not all OS'es provide this service in their resolver-library. Or they have a very tiny fixed cache (Windows NT has 16 entries, Mac OS 8 had only 10). Of course, the system that I described above, was much more then we need. I just gave it as an example of what is uses in large systems. We had to design it because the Tru64Unix resolver was too small and too slow for our servers (big Compaq iron). Last week I had to make some changes, because 200 parallel lookups wasn't sufficient anymore. And that's continuously ! For Mozilla, a 10 second cache would help page-load (as in bug 200994 and bug 17433). Maybe a 1 minute cache.

Jo Hermans

Comment 5

•

21 years ago

see bug 205726 comment 69 for the necessary changes

Jo Hermans

Comment 6

•

21 years ago

If this is implemented, it should be used for all lookup-failures, not just NXDOMAIN. See bug 68796 why (especially bug 68796 comment 19).

Aleksander Adamowski

Updated

•

21 years ago

Blocks: 154816

Darin Fisher

Updated

•

19 years ago

Assignee: darin → nobody

Status: ASSIGNED → NEW

QA Contact: benc → networking

Target Milestone: Future → ---

Patrick McManus [:mcmanus]

Assignee

Comment 7

•

16 years ago

Attached patch cache negative host lookups (obsolete) — Details — Splinter Review

This has special currency for mobile, where any high latency operation like DNS threatens a really long and painful serialization. * Cache a failed hostlookup for 1 minute * in recognition that some failed lookups are ephemeral, upon successful reuse of cached negative lookup start an asynchronous refresh in the background to update the entry This has the net effect of the same number of network lookups being issued, but within (on avg) 90 seconds of a failed lookup the caller does not wait for verification of host-not-found.

Attachment #342910 - Flags: review?(bzbarsky)

Patrick McManus [:mcmanus]

Assignee

Updated

•

16 years ago

Blocks: 437953

Boris Zbarsky [:bzbarsky]

Comment 8

•

16 years ago

How does this behave if I turn on my computer and my browser starts before the network has come up (and hence gets all sorts of error pages on my start tabs)? Does reloading right after network starts work?

Boris Zbarsky [:bzbarsky]

Comment 9

•

16 years ago

Oh, and may I suggest adding: [diff] git = 1 showfunc = true unified = 8 to your .hgrc to get prettier mq diffs?

Boris Zbarsky [:bzbarsky]

Comment 10

•

16 years ago

Comment on attachment 342910 [details] [diff] [review] cache negative host lookups >+ PRBool negative; /* true if this record is a cache of a failed lookup >+ negative cache entries are valid just like any other >+ (though never for more than 60 seconds), but a use >+ of that negative entry forces an asynchronous refresh */ True if this ... lookup. Negative ... refresh. (punctuation and capitalization). With that and an answer to my question about reloading, looks good.

Attachment #342910 - Flags: superreview+

Attachment #342910 - Flags: review?(bzbarsky)

Attachment #342910 - Flags: review+

Patrick McManus [:mcmanus]

Assignee

Comment 11

•

16 years ago

(In reply to comment #8) > How does this behave if I turn on my computer and my browser starts before the > network has come up (and hence gets all sorts of error pages on my start tabs)? > Does reloading right after network starts work? you would need to reload twice, but you would not have to wait for the timeout. The first reload uses the negative cache hit but also kicks off a revalidation of the cached entry in the background.. the second reload would pick up the positive entry generated by that. this is probly not perceptible to someone just hitting reload repetitively while waiting for the network to come around.

Boris Zbarsky [:bzbarsky]

Comment 12

•

16 years ago

I'd think the person would wait for their connection thigie it titlebar or GNOME whatever-it-is or Windows taskbar to show connected, then do "reload all tabs". I agree that when it doesn't work they'd just do it again, but it seems better to make this common use case Just Work without extra clicks. Do we get some sort of offline/online notifications in cases like that? If so, can we hook into those to invaliate the negative cache?

Patrick McManus [:mcmanus]

Assignee

Comment 13

•

16 years ago

the definition of "network up" tends to lead to insanity in my past experience. You run into distinctions between link and IP addresses and even firewall layers.. all of these things can toggle back and forth a few times on startup. It's really miserable to try and drive a state machine off of. And even then sometimes other parts of the network aren't really ready to accept you - spanning tree algorithms on the switch are classic for swallowing packets for a settling period after you join the tree and this is transparent to the host. And of course any technique is completely importable between OS's, so I would try and avoid going down that road. Here's an alternate suggestion: nsIDNSService.idl already has a flag to bypass the cache. That would bypass the negative cache entry too if it were set. I can't find any user of the flag in mozilla-central, but if we managed to set it on manual reloads, I think that would do the trick. I'm not sure where to start with that, but I'll research it. Pointers are welcomed.

Boris Zbarsky [:bzbarsky]

Comment 14

•

16 years ago

> It's really miserable to try and drive a state machine off of. Sure. I'm not saying we need to do something complicated here; just the cheap thing if it works. We might not even need anything here, since now that I look at nsIOService::SetOffline it looks like the DNS service isn't even running when we're offline. Does shutting down the DNS service clear our the cache we're looking at here? I assume it does... Doing the DNS cache bypass thing might be a good idea too, but it can be a followup bug as far as I'm concerned.

Patrick McManus [:mcmanus]

Assignee

Comment 15

•

16 years ago

(In reply to comment #14) > when we're offline. Does shutting down the DNS service clear our the cache > we're looking at here? I assume it does... yes, doing that removes all the cache entries. so I'll attach a patch with the comments fixed up to match your feedback, and I'll make a follow-on bug regarding reload.

Patrick McManus [:mcmanus]

Assignee

Comment 16

•

16 years ago

Attached patch negative dns cache - v2 — Details — Splinter Review

updated from v1 based on review comments

Attachment #342910 - Attachment is obsolete: true

Patrick McManus [:mcmanus]

Assignee

Comment 17

•

16 years ago

bug 459724 is the follow on bug regarding reload

Boris Zbarsky [:bzbarsky]

Comment 18

•

16 years ago

Pushed changeset 1e71f0540534. Patrick, thanks for doing this! Not sure whether there's a good way to write a test for this...

Assignee: nobody → mcmanus

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.