Apparently Chromium is allowing up to 50 DNS requests outstanding: http://code.google.com/p/chromium/issues/detail?id=44489 We allow 8 (and only 3 non high-priority requests): http://mxr.mozilla.org/mozilla-central/source/netwerk/dns/nsHostResolver.h#73 It looks like 50 may cause some resolvers to barf (see chromium bug), but it seems likely that we could be doing more than 8. We might also be able to be smart about detecting if we're overloading the DNS resolver from the error code it returns, and back off down to a number it can handle.
How do we feel about just bumping the number up some (to 6/16 for non-high-priority/total outstanding DNS request limts, or more optimistically to 12/32? Assuming we want to keep the same ratio) and seeing what happens? Implementing the smarts to detect that we've overloading the resolver is probably not going to happen soon, and chrome doesn't seem to be running into more than sporadic failures with the limit at 50.
Worth also adding an about:config setting to limit the # of outstanding DNS requests, in case users run into problems with their resolver? Of course we don't want to have this happen often, but it could be useful during beta testing to determine a limit that's unlikely to fail in the first place.
50 outstanding doesn't bother me from a dns perspective, but as each one is implemented on top of nspr blocking threads calling gethostbyname() I am less sanguine about the threading implications. fwiw - the ratio isn't that impt.. the high/non-high (i.e. on-demand/prefetch) split just ensures that some threads are available for on-demand resolution and that prefetch hasn't totally gummed up the works. the ideal thing would be an interface that gets rid of the thread requirement and then we could have a very high limit as long as we rate limited their introduction into the network. I actually think once in a while we overflow simple udp network buffers on some soho recursive resolvers right now - so we would just want to limit introducing them to the network at 8 per ms or somesuch.
> the ideal thing would be an interface that gets rid of the thread requirement I.e stop using getaddrinfo and write our own DNS resolver, right? Note that Chromium is now testing code to do async DNS: http://codereview.chromium.org/9369045/
Created attachment 642272 [details] [diff] [review] fix v1.0 Let's get a conservative upgrade in. This patch ups the total allowed to 12. We can see how this goes and adjust it again later.
Comment on attachment 642272 [details] [diff] [review] fix v1.0 I don't think we want to make a change here without some data behind it. My largest concern is that DNS is a UDP based protocol and its easy to overflow hardware queues with those. Every hw queue depth is different of course, so I can't say how many is too many with any confidence. But if you do overflow the queue the retry behavior is a lot more painful (its fixed to a constant, not sized to the network) than just having waited in the first place. NAT tables are another place that have the same kind of problem. Chrome believes they have this problem afaik from conference conversations. It could be DNS related or it could be TCP SYN related. Lacking any data that suggests the bottleneck is a problem, that's why I've been hesitatnt to make a change. For the queue drop problem, the right scheme is some kind of a rate pacing algorithm. (send them out every ms or so, instead of as close to as in parallel as you can manage) For the NAT thing I'm not sure what to do. To proceed (with even more aggressive limits) I'd like to see a plan to track the telemetry of the retry rate and a measure of how long a query sits in the queue (removing cache hits).. we'd have to add both of these (I think.. we might be able to deduce the retry rate already... not sure). That way we could at least know if turning the knob was making things better in aggregate and if the edge cases were being hurt.
I don't think 50 would be a good number as it has already been established that it's easy to overdo it and cause massive problems. I would maybe suggest 24 on-demand DNS threads and 8 non-priority/prefetch threads so the resolver doesn't get swamped in requests, or you could introduce about:config prefs so we could find a few optimal values through trial and error.
I'm not going to get to doing the measurements required for this any time soon, hopefully someone else can pick this up. I'd prioritize bringing in custom resolver code (e.g. bug 773648) over this.
easiest to consolidate this