increase stickiness of DNS resolution

RESOLVED FIXED in mozilla20

Status

()

defect
RESOLVED FIXED
7 years ago
7 years ago

People

(Reporter: mcmanus, Assigned: mcmanus)

Tracking

16 Branch
mozilla20
x86_64
Linux
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

our DNS usage isn't very sticky. Assuming constant use of a hostname,
every 30 seconds we will refresh our DNS cache entry (though we don't
block on the refresh). If it changes, we will follow that change pretty closely to immediately.

This is largely based on the idea that the DNS database has a low rate
of change and we really want to pick up those changes quickly because
they are important. But that's not really true anymore. Query
twitter.com or www.facebook.com from 8.8.8.8 - you'll see that they
are changing all the time (load balancing) but these results aren't
really reflective of IP changes in their respective sites.

This current algorithm has a few downsides:

* most critically, it invalidates TLS session resumption - adding a
  round trip to TLS handshakes which beacuse the resumption cache is
  IP based. I've seen this frequently with facebook and twitter,
  which have short TTL and balanced pools of IPs. Obviously we'd
  like low latency handshakes with sites designed that way. Brian has 
  suggested that the TLS layer could do session resumption keyed off
  of hostname instead of IP, but my research indicates this is often
  a limitation on the server side too(the caches really are local to
  a device) so the answer is to stop meandering around when everything
  is working ok.

* it screws up the spdy host coalsescing which includes the rule that
  different names that resolve to the same address are eligible for
  coalsecing (assuming they meet a stack of other requirements too).

* wandering between hosts makes it harder for back end systems that
  have to migrate server side caches in the cloud. That slows stuff
  down for our user and hurts reliability completely outside of ssl.

the upside of this behavior is that a site that is down or being
migrated gets that information picked up very quickly. While providing
a mechanism to make this happen is important, its clear that it is a
much less frequent event than those associated with the downside. We
should be more optimistic, while planning for rain.

Here are the changes:

* During the grace period (which is effectively days) don't actively
  renew dns cache entries that returned successful results and haven't
  had any connection errors reported against them. Any screen with a
  try again button is going to have triggered one of those conditions
  and the dns entry will be refreshed. Same thing with subresource
  fails.

* change normal refresh semantics (f5, ^r) to no longer bypass and
  refresh the DNS cache. a force reload (shift-ctrl-r, ctrl f5) will
  still do so. This is a very common operation meant to get data
  updates, not reset the infrastructure and this change makes it
  consistent with other parts of the system. For example - normal
  refresh never requires a new connection be made (even though if it
  did it would require a new dns lookup, but it might just use an old
  connection with old dns info), and normal refresh does not disable
  use of the cache data it just requires that it be confirmed with the
  server using a conditional request. Forced reloads require new
  connections and totally fresh downloads - this change makes DNS
  match those semantics. our current behavior means people that are
  sitting there polling for facebook wall updates are also trashing
  perfectly fine DNS records and being forced into cold session
  restarts.

I believe that including the quick-refresh provisions for negative dns
entries or entries that have experienced connection errors along with
the force reload sematics will preserve the fail-quickly behavior
while generally giving us better stickiness.

I hope to validate some of this with TLS telemetry in bug 807435 even though that bug does not explicitly include resumption information (which afaict would require changing nss to expose)
Posted patch patch 0Splinter Review
I should add that this patch notably improves ssl resumption rates in an anecdotal way of watching the loging.
Attachment #677140 - Flags: review?(joshmoz)
Attachment #677140 - Flags: review?(joshmoz) → review+
Depends on: 807435
(In reply to Patrick McManus [:mcmanus] from comment #1)
> I should add that this patch notably improves ssl resumption rates in an
> anecdotal way of watching the loging.

If  H = the number of times HandshakeCallback is called,
and A = the number of times AuthCertificateHook is called,
then the number of full handshakes is A and the number of resumed handshakes is H - A, because AuthCertificateHook is only called for full handshakes. So, we can (and should) add telemetry for this.
https://hg.mozilla.org/mozilla-central/rev/72d7159b813b
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
You need to log in before you can comment on or make changes to this bug.