Open Bug 731735 Opened 12 years ago Updated 2 years ago

Improve DNS requests scheduling and handling parallelism

Categories

(Core :: Networking: DNS, defect, P3)

defect

Tracking

()

People

(Reporter: mayhemer, Unassigned)

References

Details

(Keywords: perf, Whiteboard: [necko-backlog])

Reality: 
- in the profiler from bug 729182 I can see we usually have at most 4 resolver threads 
- one of them mostly sleeps (not sure it's dedicated to priority requests only) when others are occupied with requests
- it often happens that all those 3 working threads are blocked with some 5 - 10 requests pending since some DNS packets got lost (we timeout) or simply the server responses slowly
- sometimes I had also seen a real example there had been some 15 requests scheduled at the same time all handled by just a single thread, actually in serial, where the longest time to handle the request had been incredible 2506ms ; it was happening after a page has loaded, so it might only be some prefetching?
- on my Win7 machine I also see requests time out after 2s while we just wait for this single request to finish the whole page loading

Suggestion:
- when GetAddrInfo doesn't return in some 45 to 90 ms (based on DNS look-up time telemetry) reschedule the request to a new thread and reissue by another call to GetAddrInfo; this would be similar to what we do for TCP syn loss (needs a watch dog thread)
- that time threshold may adjust at runtime
- increasing the number of threads (bug 580037) to say 12 is logical
- as Patrick suggests in bug 580037, we may issue look-ups with small delays when over some threshold of outstanding requests number (candidate for an experiment)
speculative prefetching and the way the pools are separated really complicates this analysis, but...

(In reply to Honza Bambas (:mayhemer) from comment #0)
> - sometimes I had also seen a real example there had been some 15 requests
> scheduled at the same time all handled by just a single thread, actually in
> serial, where the longest time to handle the request had been incredible
> 2506ms ; it was happening after a page has loaded, so it might only be some
> prefetching?

I'd dig into *that* as an immediate action item, especially if reproducible. I can't think of any scenario where that should be completely serialized. (which of course just means I can't think of one, not that there isn't one).

> - when GetAddrInfo doesn't return in some 45 to 90 ms (based on DNS look-up
> time telemetry) reschedule the request to a new thread and reissue by
> another call to GetAddrInfo; this would be similar to what we do for TCP syn
> loss (needs a watch dog thread)

in general I think that's good (though the timers strike me as kinda low). Just beware that this happens all the time in DNS not because of packet loss but because of name servers that aren't in operation.. often pointed at by some tracking image or async js or something. so the scales are somewhat different.
The 'serialization case' probably is some kind of prefetching I'm not aware of.  In the profiler it starts exactly at 5000ms after the start of the app.  And the actual serialization happened, I bet, since there were just two threads created until that and prefetch requests are low-priority that may block parallelism.  Other runs openning 4 threads leave just a single thread idle while the 3 other take care of this prefetch.
I agree with Patrick that the serialisation issue with what the code suggests should happen. (I also agree that there may be a scenario I can't think of where one thread would be left to deal with med/low priority requests on its own - pondering those possibilities).  This is what is supposed to happen (based on the code).

nsHostResolver:IssueLookup is the context where threads are created.  This is the logic:

if (mNumIdleThreads) {
    // wake up idle thread to process this lookup
    mIdleThreadCV.Notify();
}
else if ((mThreadCount < HighThreadThreshold) ||   // 3 threads max for ANY prio
         (IsHighPriority(rec->flags) && mThreadCount < MAX_RESOLVER_THREADS)) {
    // dispatch new worker thread                  // 8 threads total for HIGH
    ...
}

So, (per the code) if there are a lot of high priority requests at the same time as med/low priority requests, we could have 8 threads spawned. All 8 threads would process high prio requests until there were no more to dequeue.

5 of the threads will exit when all high priority requests are dequeued.  The remaining 3 threads will service the last 3 high prio requests and then start dequeuing from the med and low queues.  These threads should stay active until all med/low requests are dequeued and processed.  This is based on ThreadFunc exiting when GetHostToLookup returns false, and it should only do that when all request queues, high/med/low, are empty.

(Patrick, or someone, please check my understanding here).

I found a potential weakness to that logic though, but I still think that 3 threads should remain if there are more than 3 requests in the queues.  In ResolveHost, if there was a previous request made for a hostname with low priority, which is now being requested in medium priority, then the request is bumped up to the med queue, but no new threads are created.  Only the current idle threads are notified.  If the low prio request was issued when no more threads were allowed for non-high requests, then when it gets moved queue, no new thread will be created.  

  } else if (IsMediumPriority(flags) && IsLowPriority(he->rec->flags)) {
      // Move from low to med.
      MoveQueue(he->rec, mMediumQ);
      he->rec->flags = flags;
      mIdleThreadCV.Notify();
  }

So, I'm thinking that there's either a problem (like with one of the thread counters maybe), or else the requests are being serviced as fast as they arrive in the queue - unlikely, right?

Or, is it possible that one of the threads moves to idle, releases the lock, and then a series of ResolveHost calls happens? Does mIdleThreadCV.Notify() guarantee that the next entity to get the mutex after a call to ResolveHost is the thread?  If not, then maybe that's something to check for??
Severity: enhancement → normal
Keywords: perf
See Also: → 580037
Whiteboard: [necko-backlog]
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: -- → P1
Bulk change to priority: https://bugzilla.mozilla.org/show_bug.cgi?id=1399258
Priority: P1 → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.