Open Bug 1463374 Opened 6 years ago Updated 2 months ago

evict expired DNS cache entries first

Categories

(Core :: Networking: DNS, enhancement, P2)

enhancement

Tracking

()

Tracking Status
firefox62 --- affected

People

(Reporter: bagder, Unassigned)

References

(Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged][necko-priority-next])

Attachments

(1 file)

The DNS cache currently works like this:

Once a name has been resolved, the cache entry gets added to the end of a linked list (mEvictionQ). If the size of that list reaches the max cache size limit, the first entry is evicted.

The design of this traces back to long before there was any notion or awareness of individual TTL values per DNS cache entry and when all entries have the same life-time, evicting them in this first in, first out basis makes sense.

On Windows with the native resolver, or on any system when TRR is used, where the DNS cache gets separate individual life times for each entry, the FIFO model is no longer as good. We routinely get names resolved with TTL values ranging from single digit seconds up to several thousands.

The FIFO design now makes it likely that the code evicts entries that are still valid and leaves expired entries in, since the eviction order is no longer linear with the life-time in the cache.

I propose we adjust the eviction code to first check if there's an expired entry to expire before it evicts a non-expired one! And when doing so (iterating over the list), it can just as well evict *all* the expired ones at once to make it more effective.
Depends on: 1464999
Flags: needinfo?(n.nethercote)
sorry, mistake
Flags: needinfo?(n.nethercote)
My work on this change has also made it apparent that the TTL values we read sometimes are *ridiculously* low, even as far down as to zero at times. Since the Linux/mac approach with a fixed 60 second TTL still don't show any significant problems, I draw the conclusion that we should be able to extend the "silly" numbers slightly and get better caching. My current approach adds 20 seconds to all TTL values under 20.

I'll post my current patch for a first review and discussion very soon. It depends on bug 1464999 getting fixed, or it will crash miserably.
My new take adds all "eviction nodes" to a splay tree sorted on the expiry time (and since the SplayTree code doesn't allow duplicate identical values, it also adds a "salt" to each node to make sure two nodes are different even if they would otherwise expire at the same time).

When the eviction queue (I still refer to it as a queue in the code, just to maintain the terminology) is full when a new entry is about to be added, the code will now evict the node from the splay tree with the lowest expire time (and if identical, the one with the lowest salt). In the ideal case, that's an already expired node.

As mentioned above in comment #2, I'm also adding 20 seconds to all TTLs lower than 20 since with this system, those entries would otherwise be evicted again almost instantly at times.
Thinking about it a bit more, I think we should instead use the grace period better for the very low TTL records. Like this:

If TTL + grace < 60, then increase grace so that TTL + grace becomes 60.

It makes the entry get cached for 60 seconds, but when referenced within the grace period it will trigger a refresh of the entry and thus ideally update the record in the background if the low TTL actually implied that it would change.
Comment on attachment 8983329 [details]
bug 1463374 - store the DNS cache eviction entries in a splay tree

https://reviewboard.mozilla.org/r/247840/#review255390

Looks great. A couple minor nits on my part.
It would be great if we had a unit test for the expire+grace = 60 case.

::: netwerk/dns/nsHostResolver.h:175
(Diff revision 1)
>  
>      mozilla::net::ResolverMode mResolverMode;
>  
> +    // SplayTree comparison
> +    static int compare(const nsHostRecord & aOne, const nsHostRecord & aTwo);
> +    bool isInEvictionQ();

nit: method names should start with a capital letter - except compare I suppose.

::: netwerk/dns/nsHostResolver.h:177
(Diff revision 1)
>  
> +    // SplayTree comparison
> +    static int compare(const nsHostRecord & aOne, const nsHostRecord & aTwo);
> +    bool isInEvictionQ();
> +    void addedToEvictQ(uint32_t aSalt);
> +    void removedFromEvictQ();

nit: I don't have a big problem with this, but in my mind methods that start with a verb in the past tense would return a bool. ex: AddedSomething(), RemovedSomething(), WasSomething().
I'd prefer if we named these two methods as MarkAddedToEvictQ() and MarkRemovedFromEvictQ(). Or  similar if you can think of something better.

::: netwerk/dns/nsHostResolver.cpp
(Diff revision 1)
>      bool timedOut = false;
>      TimeDuration timeout;
>      TimeStamp epoch, now;
>  
>      MutexAutoLock lock(mLock);
> -

nit: WS only change
Attachment #8983329 - Flags: review?(valentin.gosu) → review+
Comment on attachment 8983329 [details]
bug 1463374 - store the DNS cache eviction entries in a splay tree

https://reviewboard.mozilla.org/r/247840/#review255390

Yeah, I'll have to think about how to do a test for that the best way that doesn't mean having to actually wait. The entire expire + grace logic is however used extensively by the native resolver logic on Linux and Mac so I'm not seeing any larger risk in this function.

> nit: method names should start with a capital letter - except compare I suppose.

Yes, 'compare' is set in SplayTree.h
if you're changing the length of time we consider something valid, please r? me (and explain the change) and block on that for landing. thanks. also would like to understand plan for evaluating this change before landing.
(The patch still has some issues on try so either way its not ready to land yet.)

I propose that in situations where the TTL is less than a minute, we "pad" the expire time with a grace period. Basically like this (psuedo code):

  if ( ttl < 60) {
     grace = 60 - ttl;
  }

The reasoning being that we *already* have expire 60 + grace 60 by default and always for mac and Linux and that generally works really fine so adding up to 60 seconds should not be a major concern while at the same time actually help the cache.

Of course the "60" limit there is totally arbitrary and we could make it 30 or 45 instead too. What triggered me into this path was realizing that the resolver actually quite often gets a *zero* TTL value which thus makes the entry expire immediately.

If we didn't have the "control group" that the mac and linux users are here, I wouldn't have suggested this.

I'm also fine with moving that logic out of this patch to discuss/work on it separately. It's just that this patch makes this behavior more obvious. Even previously the cache entries are expired this fast, they're just not evicted this fast.
Flags: needinfo?(mcmanus)
Note: the padded gracing has landed independently of this work in bug 1493619.
Flags: needinfo?(mcmanus)
Assignee: daniel → nobody

Mass-removing myself from cc; search for 12b9dfe4-ece3-40dc-8d23-60e179f64ac1 or any reasonable part thereof, to mass-delete these notifications (and sorry!)

Priority: P2 → P3
Severity: normal → S3

This seems like something we'd still want to do.

Priority: P3 → P2
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-next]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: