Closed Bug 2009372 Opened 4 months ago Closed 2 months ago

Fenix: DoH performance shows high latency when resolving multiple hostnames in parallel (bis)

Categories

(Core :: Networking: DNS, defect, P2)

defect

Tracking

()

RESOLVED FIXED
149 Branch
Tracking Status
firefox149 --- fixed

People

(Reporter: valentin, Assigned: valentin)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [necko-triaged][necko-priority-queue])

Attachments

(10 files, 1 obsolete file)

48 bytes, text/x-phabricator-request
Details
48 bytes, text/x-phabricator-request
Details
48 bytes, text/x-phabricator-request
Details
48 bytes, text/x-phabricator-request
Details
48 bytes, text/x-phabricator-request
Details
48 bytes, text/x-phabricator-request
Details
48 bytes, text/x-phabricator-request
Details
48 bytes, text/x-phabricator-request
Details
48 bytes, text/x-phabricator-request
Details
48 bytes, text/x-phabricator-request
Details

(In reply to Andrew Creskey [:acreskey] from bug 1994314 comment #0)

Very similar to bug 1976964, I can see long dns resolution times when using DoH in Fenix on sites that need to resolve numerous domains simultaneously.

Steps to reproduce:

Load https://www.usatoday.com

I'm using
Wifi
DoH Mode 2

Samsung A54,

Dns53 lookup times: starts off at 33ms , grows to ~40-60ms
https://share.firefox.dev/3JeisEJ

Cloudflare TRR, starts off at 37ms , but then quickly end up resolving most requests in ~100-200ms
https://share.firefox.dev/4hc6LuW

CIRA trr, starts off at 35ms , but then quickly end up resolving most requests in ~150-300ms
https://share.firefox.dev/4nOxxvE

Expectation:
When making multiple requests simultaneous, the latency does not increase to such a degree

This test creates a benchmarking request and reports benchReq.domainLookupEndTime - benchReq.domainLookupStartTime
We run the test multiple times, and make sure the request is performed while lots of other requests are in progress.
We would like to see that the lookup time for the benchmarked request does not (significantly) increase as more requests are added.

See Also: → 2010767
See Also: → 2010819
See Also: → 2011397

Status update:

My investigation into this issue has uncovered a number of other small perf wins, but I haven't yet manage to fix the DoH performance issue.
Recent work seems to indicate that the problem is events dispatched to the socket thread. Regular connections and TRR connections all post events to be run on the socket thread. Because there's no prioritization of the runnables, the TRR runnables end up being blocked by network activity from other channels. This seems to mostly be a problem when the socket thread is close to 100% usage.

A few other promising leads:

  • The TRR thread we use for AsyncOpen, and parsing of DoH packets is mostly unused, but the fact that we need to dispatch runnables between the socket and TRR thread multiple times indicates this might be a potential bottleneck. I've moved the TRR service to the socket thread and noticed no degradation in performance, but also no improvement, probably due to runnables still being dispatched to the end of the queue even when the call is already on the socket thread. Executing the runnable immediately when it's created on the socket thread is likely to improve things.
  • DNSListenerProxy::OnLookupComplete Has a similar issue. It always dispatches even when already on the target thread. However, I tried executing this one synchronously, and encountered a socket thread hang. Most likely there's a bug in nsHostResolver calling OnResolveHostComplete while holding a lock.

A few other things I tried, which could be potential improvements, but did not immediately fix the DoH performance issue:

  • Signal mPollableEvent when nsSocketTransportService::Dispatch is called.
  • Avoid expensive calls to IsAlive in nsHttpConnection::CanReuse for HTTP/2 connections. The profile shows a lot of time spent in PR_Recv call in nsSocketTransport::IsAlive
  • Reading from TRR sockets first, then processing the event loop, then the rest of the sockets.
  • Setting SSL_NO_LOCKS for necko sockets.

Going forward I will focus on the most likely fix for DoH performance (prioritizing the DoH runnables).

Hi folks,

Exciting progress news. I spent a bunch of time last week adding an extra priority event queue to the socket transport service, to confirm whether the previous theory was correct. I made sure that socket thread events triggered by and for TRR channels get an increased priority, and also DNSListenerProxy. After running with just these few changes I was a bit dissapointed in the results.

However, I just checked to see what would happen if in nsSocketTransportService::DoPollIteration we didn't read from non-TRR sockets if we had a socket or a high priority runnable to handle, and that actually made a 10X improvement in DNS timings. The overall duration of each request didn't seem to go down, but parallelism seems to have increased.

This result is very promising and seems to indicate there are easy peformance wins that can be achieved by doing less when there are high priority events pending.
There is a risk of TRR connections starving others - but that's something we can test. I'll start posting patches for review in the next few days.

This is an early profile showing that DNS timings don't baloon to hundreds of ms when there are many requests in the queue: https://share.firefox.dev/45wQcoQ

This patch adds a pref (for testing) that determines whether to use a dedicated
TRR thread to parse DoH responses, or whether to do it on the socket thread to
avoid thread hops.

With the full patch stack in this bug, using the TRR thread seems to double the
DNS resolution time from ~40ms to ~80 ms on my machine.

If TRRService::DispatchTRRRequest is called on the target TRR thread
then we can immediately run it instead of dispatching to the end of the queue.

The PrioritizableRunnable asserts that it runs on the main thread.
In order to prioritize events on the socket thread we extend
PrioritizableRunnable and override Run to avoid the assertion

This change makes the OnInputStreamReady events delivered for the
TRRServiceChannel transaction implement nsIRunnablePriority
so they can be prioritized.

This queue of events is processed before the regular events that get dispatched
to the Socket thread event loop.

Very nice work.
I'll run this stack (and a variant with network.trr.parse_on_socket_thread:false) against our CI Fenix DoH tests and on a local device.

Attachment #9540910 - Attachment description: Bug 2009372 - Add AnyThreadPrioritizableRunnable r=#necko → Bug 2009372 - Allow PrioritizableRunnable::Run to run on socket thread r=#necko

Jeff had asked for simpleperf profiles of slow resolution through DoH.

These are from Fenix Nightly, Feb 2. DoH mode 2, through Cloudflare, in Ottawa Canada.
Without geckoprofiler markers it's harder to identify the slow resolutions via TRR, but these are profiles of scenarios that have consistently lead to problems.

Test page that loads 100 fav-icons, thus 100 domains resolved:
https://share.firefox.dev/3Zgl4GK

USAToday navigation:
https://share.firefox.dev/3NRjWqB

NYTimes.com navigation and scrolling:
https://share.firefox.dev/4tdNte8

Flags: needinfo?(jmuizelaar)

(In reply to Andrew Creskey [:acreskey] from comment #14)

Jeff had asked for simpleperf profiles of slow resolution through DoH.

These are from Fenix Nightly, Feb 2. DoH mode 2, through Cloudflare, in Ottawa Canada.
Without geckoprofiler markers it's harder to identify the slow resolutions via TRR, but these are profiles of scenarios that have consistently lead to problems.

Test page that loads 100 fav-icons, thus 100 domains resolved:
https://share.firefox.dev/3Zgl4GK

USAToday navigation:
https://share.firefox.dev/3NRjWqB

In both of these profiles more than 50% of the time is being spent in NSS and a lot of the time spent in NSS doesn't seem to be running crypto algorithm code. I think it's worth trying to improve that.

Flags: needinfo?(jmuizelaar)

This initial perf run looks promising. DNS timings are way down, while pageload time is roughly the same
https://perf.compare/compare-lando-results?baseLando=177149&newLando=177150&baseRepo=try&newRepo=try&framework=13&search=trr

I've taken an initial look at this patch, focusing just on DNS resolution (not the impact on other network timings or visual metrics).

It's a bit tricky to compare DoH against the OS resolver in CI because the Android OS will cache the hostnames after the first iteration, thus giving the OS resolver an "unfair advantage."
So I've run it locally, flushing the OS cache between iterations.

The results are reproducible and actually match what Valentin found in his perf run, comment 16.

These are the variants that I compared:

Native has DoH disabled
Baseline TRR is a push of today Fenix nightly, using either Cloudflare, CF, or CIRA
TRR Prio is Valentin's TRR prioritization patch stack

Using the trr-multi test to loads 50 favicons.
It's a stress test for DoH as there are so many simultaneous requests. (Although some top sites request a similar amount).

Findings:

  1. TRR Prio reduces TRR lookup times by 91-98ms (53-61%)
  2. With TRR Prio, TRR overhead vs Native is reduced to 29-40ms (62-89%)
  3. Without TRR Prio, TRR overhead vs Native is 122-146ms (244-324%)
  Comparison Table (Mean / Median)                                                                                                                                                                     
                                                                                                                                                                                                       
                      Mean      Median                                                                                                                                                                 
  --------------------------------------                                                                                                                                                               
  Native               50ms       45ms                                                                                                                                                                 
  Baseline TRR CF     184ms      179ms                                                                                                                                                                 
  Baseline TRR CIRA   172ms      191ms                                                                                                                                                                 
  TRR Prio CF          86ms       85ms                                                                                                                                                                 
  TRR Prio CIRA        81ms       74ms                                                                                                                                                                 


  TRR Prio Improvement vs Baseline TRR                                                                                                                                                                 
                                                                                                                                                                                                       
  Provider      Mean Reduction     Median Reduction                                                                                                                                                    
  -------------------------------------------------                                                                                                                                                    
  Cloudflare    -98ms (-53%)       -94ms (-53%)                                                                                                                                                        
  CIRA          -91ms (-53%)      -117ms (-61%)                                                                                                                                                        
                                                                                                                                                                                                       
  TRR Overhead vs Native (with TRR Prio)                                                                                                                                                               
                                                                                                                                                                                                       
  Provider      Mean Overhead      Median Overhead                                                                                                                                                     
  -------------------------------------------------                                                                                                                                                    
  Cloudflare    +36ms (+72%)       +40ms (+89%)                                                                                                                                                        
  CIRA          +31ms (+62%)       +29ms (+64%)                                                                                                                                                        
                                                                                                                                                                                                       
  TRR Overhead vs Native (Baseline, without TRR Prio)                                                                                                                                                  
                                                                                                                                                                                                       
  Provider      Mean Overhead      Median Overhead                                                                                                                                                     
  -------------------------------------------------                                                                                                                                                    
  Cloudflare   +134ms (+268%)     +134ms (+298%)                                                                                                                                                       
  CIRA         +122ms (+244%)     +146ms (+324%)                                                                                                                                                       

Some profiles from the run:

TRR Baseline: https://share.firefox.dev/4a95Bx2
TRR Prio: https://share.firefox.dev/4tppSHF
Native resolver (OS Dns cache flushed): https://share.firefox.dev/4tiKrVW

edit: taken on Samsung A54, PGO builds, wifi

Simpleperf profile of trr-multi with the TRR prioritization patch stack applied:
https://share.firefox.dev/3ZNBOoP

Native 50ms 45ms

Note that my patch stack also includes ⚙ D280956 Bug 2009372 - Increase priority for DNSListenerProxy runnable r=necko, which means even native DNS ressolutions would benefit from the current changes.
I think if we want to compare to the current state of native resolutions (Baseline Native) we need to set network.dns.high_priority_dispatch to false. 🙂

(In reply to Valentin Gosu [:valentin] (he/him) from comment #20)

Native 50ms 45ms

Note that my patch stack also includes ⚙ D280956 Bug 2009372 - Increase priority for DNSListenerProxy runnable r=necko, which means even native DNS ressolutions would benefit from the current changes.
I think if we want to compare to the current state of native resolutions (Baseline Native) we need to set network.dns.high_priority_dispatch to false. 🙂

Ah, yes.
Since I used separate binaries, I do have those numbers.
baseline_1 and baseline_2 are fenix nightly from m-c, without network.dns.high_priority_dispatch.
trr_prio_1, trr_prio_2 have network.dns.high_priority_dispatch:true as they are from the patch stack.
The dns timings, from this limited test, seem roughly comparable.

  Run              Count       Mean     Median                                                                                                                                                         
  --------------------------------------------                                                                                                                                                         
  baseline_1          97       47ms       45ms                                                                                                                                                         
  baseline_2          95       53ms       44ms                                                                                                                                                         
  trr_prio_1          97       67ms       51ms                                                                                                                                                         
  trr_prio_2          95       46ms       44ms

I should note that the trr-multi test was designed as a distilled example of a scenario in which DoH showed higher latency than the os resolver. (Making a large number of dns requests while at the same time processing numerous other http requests).

We don't necessarily need parity with the os resolver in this test to give us comparable overall performance since there are other scenarios in which DoH can outperform. (e.g. with DoH we see fewer of the sporadic multi-second requests, the DNS cache is not cleared on change of network, etc).

See Also: → 1997973

Bug 1997973 will help here as the socket thread can be busyFindCoalescableConnection in 7-8% of samples.

(In reply to Jeff Muizelaar [:jrmuizel] from comment #15)

In both of these profiles more than 50% of the time is being spent in NSS and a lot of the time spent in NSS doesn't seem to be running crypto algorithm code. I think it's worth trying to improve that.

I'll bring this up with the NSS and necko folks - maybe a meta bug to capture what portions of TLS we could make async via other threads, etc.

Attachment #9540907 - Attachment is obsolete: true
Depends on: 2016567
Pushed by valentin.gosu@gmail.com: https://github.com/mozilla-firefox/firefox/commit/d1a957952613 https://hg.mozilla.org/integration/autoland/rev/e79fb83a85eb Add benchmarking xpcshell test for DoH timings while other networking is in progress r=acreskey,necko-reviewers,kershaw https://github.com/mozilla-firefox/firefox/commit/843517a0ab4e https://hg.mozilla.org/integration/autoland/rev/c931cc0478c8 Use Socket thread to parse TRR response instead of TRR thread to avoid thread hops r=necko-reviewers,kershaw https://github.com/mozilla-firefox/firefox/commit/0fd8e2d182e1 https://hg.mozilla.org/integration/autoland/rev/48c2836aa7e0 Mark TRR connections and sockets r=necko-reviewers,kershaw,webrtc-reviewers,mjf https://github.com/mozilla-firefox/firefox/commit/53664bf0f37a https://hg.mozilla.org/integration/autoland/rev/b9a740f7a6a8 Allow PrioritizableRunnable::Run to run on socket thread r=necko-reviewers,kershaw,xpcom-reviewers,nika https://github.com/mozilla-firefox/firefox/commit/a95437aebbe1 https://hg.mozilla.org/integration/autoland/rev/ea6983eacec7 Pipe runnable prioritization r=necko-reviewers,kershaw https://github.com/mozilla-firefox/firefox/commit/07a14723bc09 https://hg.mozilla.org/integration/autoland/rev/aedf96510656 Increase runnable priority for TRR channels r=necko-reviewers,jesup https://github.com/mozilla-firefox/firefox/commit/b4bdee9befe8 https://hg.mozilla.org/integration/autoland/rev/1eff69480b58 Increase priority for DNSListenerProxy runnable r=necko-reviewers,kershaw https://github.com/mozilla-firefox/firefox/commit/535606169556 https://hg.mozilla.org/integration/autoland/rev/49cb1fd639b5 Add mPriorityEventQueue to SocketTransportService r=necko-reviewers,jesup https://github.com/mozilla-firefox/firefox/commit/3b82fab07829 https://hg.mozilla.org/integration/autoland/rev/1ff8800632be Add locking to mPriorityEventQueue r=necko-reviewers,jesup https://github.com/mozilla-firefox/firefox/commit/71907b427963 https://hg.mozilla.org/integration/autoland/rev/8502d135245f Prioritize H3 TRR runnables too r=necko-reviewers,jesup
Pushed by sstanca@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/6b07694ccf35 https://hg.mozilla.org/integration/autoland/rev/6e35b0425dfc Revert "Bug 2009372 - Prioritize H3 TRR runnables too r=necko-reviewers,jesup" for causing leak failures.

Reverted this because it was causing leak failures.

  • Revert link
  • Push with failures
  • Failure Log
  • Failure line: TEST-UNEXPECTED-FAIL | LeakSanitizer | leak at CallbackHolder::CallbackHolder, nsPipeInputStream::AsyncWait, mozilla::RemoteLazyInputStream::AsyncWait, mozilla::nsMultiplexInputStream::AsyncWaitInternal

PS Please also check these mda failures.

Flags: needinfo?(valentin.gosu)
Depends on: 2016649
Pushed by valentin.gosu@gmail.com: https://github.com/mozilla-firefox/firefox/commit/65c19f7231e9 https://hg.mozilla.org/integration/autoland/rev/46887b4cbf76 Add benchmarking xpcshell test for DoH timings while other networking is in progress r=acreskey,necko-reviewers,kershaw https://github.com/mozilla-firefox/firefox/commit/cf1b9db0d77b https://hg.mozilla.org/integration/autoland/rev/422fe0b0035b Use Socket thread to parse TRR response instead of TRR thread to avoid thread hops r=necko-reviewers,kershaw https://github.com/mozilla-firefox/firefox/commit/ee334bd83462 https://hg.mozilla.org/integration/autoland/rev/2802442f8a6c Mark TRR connections and sockets r=necko-reviewers,kershaw,webrtc-reviewers,mjf https://github.com/mozilla-firefox/firefox/commit/ce7603163784 https://hg.mozilla.org/integration/autoland/rev/d7912d95df76 Allow PrioritizableRunnable::Run to run on socket thread r=necko-reviewers,kershaw,xpcom-reviewers,nika https://github.com/mozilla-firefox/firefox/commit/2a95d5ec8ea1 https://hg.mozilla.org/integration/autoland/rev/57af501a1b75 Pipe runnable prioritization r=necko-reviewers,kershaw https://github.com/mozilla-firefox/firefox/commit/60b4ab73b55f https://hg.mozilla.org/integration/autoland/rev/8d16e3e7455b Increase runnable priority for TRR channels r=necko-reviewers,jesup https://github.com/mozilla-firefox/firefox/commit/4502ff33d787 https://hg.mozilla.org/integration/autoland/rev/db195751f853 Increase priority for DNSListenerProxy runnable r=necko-reviewers,kershaw https://github.com/mozilla-firefox/firefox/commit/61a43db866b0 https://hg.mozilla.org/integration/autoland/rev/135db32e8d09 Add mPriorityEventQueue to SocketTransportService r=necko-reviewers,jesup https://github.com/mozilla-firefox/firefox/commit/fdd33c257d9c https://hg.mozilla.org/integration/autoland/rev/d065c8f881ce Add locking to mPriorityEventQueue r=necko-reviewers,jesup https://github.com/mozilla-firefox/firefox/commit/7e1de86c275c https://hg.mozilla.org/integration/autoland/rev/0dd114b6fdf4 Prioritize H3 TRR runnables too r=necko-reviewers,jesup
Flags: needinfo?(valentin.gosu)
Blocks: 2016649
No longer depends on: 2016649
No longer blocks: 2016649
Regressions: 2016649
QA Whiteboard: [qa-triage-done-c150/b149]
Duplicate of this bug: 1998646
Regressions: 2026504
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: