our DNS usage isn't very sticky. Assuming constant use of a hostname, every 30 seconds we will refresh our DNS cache entry (though we don't block on the refresh). If it changes, we will follow that change pretty closely to immediately. This is largely based on the idea that the DNS database has a low rate of change and we really want to pick up those changes quickly because they are important. But that's not really true anymore. Query twitter.com or www.facebook.com from 18.104.22.168 - you'll see that they are changing all the time (load balancing) but these results aren't really reflective of IP changes in their respective sites. This current algorithm has a few downsides: * most critically, it invalidates TLS session resumption - adding a round trip to TLS handshakes which beacuse the resumption cache is IP based. I've seen this frequently with facebook and twitter, which have short TTL and balanced pools of IPs. Obviously we'd like low latency handshakes with sites designed that way. Brian has suggested that the TLS layer could do session resumption keyed off of hostname instead of IP, but my research indicates this is often a limitation on the server side too(the caches really are local to a device) so the answer is to stop meandering around when everything is working ok. * it screws up the spdy host coalsescing which includes the rule that different names that resolve to the same address are eligible for coalsecing (assuming they meet a stack of other requirements too). * wandering between hosts makes it harder for back end systems that have to migrate server side caches in the cloud. That slows stuff down for our user and hurts reliability completely outside of ssl. the upside of this behavior is that a site that is down or being migrated gets that information picked up very quickly. While providing a mechanism to make this happen is important, its clear that it is a much less frequent event than those associated with the downside. We should be more optimistic, while planning for rain. Here are the changes: * During the grace period (which is effectively days) don't actively renew dns cache entries that returned successful results and haven't had any connection errors reported against them. Any screen with a try again button is going to have triggered one of those conditions and the dns entry will be refreshed. Same thing with subresource fails. * change normal refresh semantics (f5, ^r) to no longer bypass and refresh the DNS cache. a force reload (shift-ctrl-r, ctrl f5) will still do so. This is a very common operation meant to get data updates, not reset the infrastructure and this change makes it consistent with other parts of the system. For example - normal refresh never requires a new connection be made (even though if it did it would require a new dns lookup, but it might just use an old connection with old dns info), and normal refresh does not disable use of the cache data it just requires that it be confirmed with the server using a conditional request. Forced reloads require new connections and totally fresh downloads - this change makes DNS match those semantics. our current behavior means people that are sitting there polling for facebook wall updates are also trashing perfectly fine DNS records and being forced into cold session restarts. I believe that including the quick-refresh provisions for negative dns entries or entries that have experienced connection errors along with the force reload sematics will preserve the fail-quickly behavior while generally giving us better stickiness. I hope to validate some of this with TLS telemetry in bug 807435 even though that bug does not explicitly include resumption information (which afaict would require changing nss to expose)
I should add that this patch notably improves ssl resumption rates in an anecdotal way of watching the loging.
Attachment #677140 - Flags: review?(joshmoz)
(In reply to Patrick McManus [:mcmanus] from comment #1) > I should add that this patch notably improves ssl resumption rates in an > anecdotal way of watching the loging. If H = the number of times HandshakeCallback is called, and A = the number of times AuthCertificateHook is called, then the number of full handshakes is A and the number of resumed handshakes is H - A, because AuthCertificateHook is only called for full handshakes. So, we can (and should) add telemetry for this.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla20
You need to log in before you can comment on or make changes to this bug.