Fenix: DoH performance shows high latency when resolving multiple hostnames in parallel (bis)
Categories
(Core :: Networking: DNS, defect, P2)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox149 | --- | fixed |
People
(Reporter: valentin, Assigned: valentin)
References
(Blocks 3 open bugs)
Details
(Whiteboard: [necko-triaged][necko-priority-queue])
Attachments
(10 files, 1 obsolete file)
|
48 bytes,
text/x-phabricator-request
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
Details | |
|
48 bytes,
text/x-phabricator-request
|
Details |
(In reply to Andrew Creskey [:acreskey] from bug 1994314 comment #0)
Very similar to bug 1976964, I can see long dns resolution times when using DoH in Fenix on sites that need to resolve numerous domains simultaneously.
Steps to reproduce:
I'm using
Wifi
DoH Mode 2Samsung A54,
Dns53 lookup times: starts off at 33ms , grows to ~40-60ms
https://share.firefox.dev/3JeisEJCloudflare TRR, starts off at 37ms , but then quickly end up resolving most requests in ~100-200ms
https://share.firefox.dev/4hc6LuWCIRA trr, starts off at 35ms , but then quickly end up resolving most requests in ~150-300ms
https://share.firefox.dev/4nOxxvEExpectation:
When making multiple requests simultaneous, the latency does not increase to such a degree
| Assignee | ||
Comment 1•4 months ago
|
||
This test creates a benchmarking request and reports benchReq.domainLookupEndTime - benchReq.domainLookupStartTime
We run the test multiple times, and make sure the request is performed while lots of other requests are in progress.
We would like to see that the lookup time for the benchmarked request does not (significantly) increase as more requests are added.
| Assignee | ||
Comment 2•3 months ago
|
||
Status update:
My investigation into this issue has uncovered a number of other small perf wins, but I haven't yet manage to fix the DoH performance issue.
Recent work seems to indicate that the problem is events dispatched to the socket thread. Regular connections and TRR connections all post events to be run on the socket thread. Because there's no prioritization of the runnables, the TRR runnables end up being blocked by network activity from other channels. This seems to mostly be a problem when the socket thread is close to 100% usage.
A few other promising leads:
- The TRR thread we use for AsyncOpen, and parsing of DoH packets is mostly unused, but the fact that we need to dispatch runnables between the socket and TRR thread multiple times indicates this might be a potential bottleneck. I've moved the TRR service to the socket thread and noticed no degradation in performance, but also no improvement, probably due to runnables still being dispatched to the end of the queue even when the call is already on the socket thread. Executing the runnable immediately when it's created on the socket thread is likely to improve things.
DNSListenerProxy::OnLookupCompleteHas a similar issue. It always dispatches even when already on the target thread. However, I tried executing this one synchronously, and encountered a socket thread hang. Most likely there's a bug in nsHostResolver calling OnResolveHostComplete while holding a lock.
A few other things I tried, which could be potential improvements, but did not immediately fix the DoH performance issue:
- Signal mPollableEvent when nsSocketTransportService::Dispatch is called.
- Avoid expensive calls to IsAlive in nsHttpConnection::CanReuse for HTTP/2 connections. The profile shows a lot of time spent in PR_Recv call in nsSocketTransport::IsAlive
- Reading from TRR sockets first, then processing the event loop, then the rest of the sockets.
- Setting SSL_NO_LOCKS for necko sockets.
Going forward I will focus on the most likely fix for DoH performance (prioritizing the DoH runnables).
| Assignee | ||
Comment 3•3 months ago
•
|
||
Hi folks,
Exciting progress news. I spent a bunch of time last week adding an extra priority event queue to the socket transport service, to confirm whether the previous theory was correct. I made sure that socket thread events triggered by and for TRR channels get an increased priority, and also DNSListenerProxy. After running with just these few changes I was a bit dissapointed in the results.
However, I just checked to see what would happen if in nsSocketTransportService::DoPollIteration we didn't read from non-TRR sockets if we had a socket or a high priority runnable to handle, and that actually made a 10X improvement in DNS timings. The overall duration of each request didn't seem to go down, but parallelism seems to have increased.
This result is very promising and seems to indicate there are easy peformance wins that can be achieved by doing less when there are high priority events pending.
There is a risk of TRR connections starving others - but that's something we can test. I'll start posting patches for review in the next few days.
This is an early profile showing that DNS timings don't baloon to hundreds of ms when there are many requests in the queue: https://share.firefox.dev/45wQcoQ
| Assignee | ||
Comment 4•3 months ago
|
||
This patch adds a pref (for testing) that determines whether to use a dedicated
TRR thread to parse DoH responses, or whether to do it on the socket thread to
avoid thread hops.
With the full patch stack in this bug, using the TRR thread seems to double the
DNS resolution time from ~40ms to ~80 ms on my machine.
| Assignee | ||
Comment 5•3 months ago
|
||
If TRRService::DispatchTRRRequest is called on the target TRR thread
then we can immediately run it instead of dispatching to the end of the queue.
| Assignee | ||
Comment 6•3 months ago
|
||
| Assignee | ||
Comment 7•3 months ago
|
||
The PrioritizableRunnable asserts that it runs on the main thread.
In order to prioritize events on the socket thread we extend
PrioritizableRunnable and override Run to avoid the assertion
| Assignee | ||
Comment 8•3 months ago
|
||
This change makes the OnInputStreamReady events delivered for the
TRRServiceChannel transaction implement nsIRunnablePriority
so they can be prioritized.
| Assignee | ||
Comment 9•3 months ago
|
||
| Assignee | ||
Comment 10•3 months ago
|
||
| Assignee | ||
Comment 11•3 months ago
|
||
This queue of events is processed before the regular events that get dispatched
to the Socket thread event loop.
| Assignee | ||
Comment 12•3 months ago
|
||
Comment 13•3 months ago
|
||
Very nice work.
I'll run this stack (and a variant with network.trr.parse_on_socket_thread:false) against our CI Fenix DoH tests and on a local device.
Updated•3 months ago
|
Comment 14•3 months ago
|
||
Jeff had asked for simpleperf profiles of slow resolution through DoH.
These are from Fenix Nightly, Feb 2. DoH mode 2, through Cloudflare, in Ottawa Canada.
Without geckoprofiler markers it's harder to identify the slow resolutions via TRR, but these are profiles of scenarios that have consistently lead to problems.
Test page that loads 100 fav-icons, thus 100 domains resolved:
https://share.firefox.dev/3Zgl4GK
USAToday navigation:
https://share.firefox.dev/3NRjWqB
NYTimes.com navigation and scrolling:
https://share.firefox.dev/4tdNte8
Comment 15•3 months ago
|
||
(In reply to Andrew Creskey [:acreskey] from comment #14)
Jeff had asked for simpleperf profiles of slow resolution through DoH.
These are from Fenix Nightly, Feb 2. DoH mode 2, through Cloudflare, in Ottawa Canada.
Without geckoprofiler markers it's harder to identify the slow resolutions via TRR, but these are profiles of scenarios that have consistently lead to problems.Test page that loads 100 fav-icons, thus 100 domains resolved:
https://share.firefox.dev/3Zgl4GKUSAToday navigation:
https://share.firefox.dev/3NRjWqB
In both of these profiles more than 50% of the time is being spent in NSS and a lot of the time spent in NSS doesn't seem to be running crypto algorithm code. I think it's worth trying to improve that.
| Assignee | ||
Comment 16•3 months ago
|
||
This initial perf run looks promising. DNS timings are way down, while pageload time is roughly the same
https://perf.compare/compare-lando-results?baseLando=177149&newLando=177150&baseRepo=try&newRepo=try&framework=13&search=trr
| Assignee | ||
Comment 17•3 months ago
|
||
Comment 18•3 months ago
•
|
||
I've taken an initial look at this patch, focusing just on DNS resolution (not the impact on other network timings or visual metrics).
It's a bit tricky to compare DoH against the OS resolver in CI because the Android OS will cache the hostnames after the first iteration, thus giving the OS resolver an "unfair advantage."
So I've run it locally, flushing the OS cache between iterations.
The results are reproducible and actually match what Valentin found in his perf run, comment 16.
These are the variants that I compared:
Native has DoH disabled
Baseline TRR is a push of today Fenix nightly, using either Cloudflare, CF, or CIRA
TRR Prio is Valentin's TRR prioritization patch stack
Using the trr-multi test to loads 50 favicons.
It's a stress test for DoH as there are so many simultaneous requests. (Although some top sites request a similar amount).
Findings:
- TRR Prio reduces TRR lookup times by 91-98ms (53-61%)
- With TRR Prio, TRR overhead vs Native is reduced to 29-40ms (62-89%)
- Without TRR Prio, TRR overhead vs Native is 122-146ms (244-324%)
Comparison Table (Mean / Median)
Mean Median
--------------------------------------
Native 50ms 45ms
Baseline TRR CF 184ms 179ms
Baseline TRR CIRA 172ms 191ms
TRR Prio CF 86ms 85ms
TRR Prio CIRA 81ms 74ms
TRR Prio Improvement vs Baseline TRR
Provider Mean Reduction Median Reduction
-------------------------------------------------
Cloudflare -98ms (-53%) -94ms (-53%)
CIRA -91ms (-53%) -117ms (-61%)
TRR Overhead vs Native (with TRR Prio)
Provider Mean Overhead Median Overhead
-------------------------------------------------
Cloudflare +36ms (+72%) +40ms (+89%)
CIRA +31ms (+62%) +29ms (+64%)
TRR Overhead vs Native (Baseline, without TRR Prio)
Provider Mean Overhead Median Overhead
-------------------------------------------------
Cloudflare +134ms (+268%) +134ms (+298%)
CIRA +122ms (+244%) +146ms (+324%)
Some profiles from the run:
TRR Baseline: https://share.firefox.dev/4a95Bx2
TRR Prio: https://share.firefox.dev/4tppSHF
Native resolver (OS Dns cache flushed): https://share.firefox.dev/4tiKrVW
edit: taken on Samsung A54, PGO builds, wifi
Comment 19•3 months ago
|
||
Simpleperf profile of trr-multi with the TRR prioritization patch stack applied:
https://share.firefox.dev/3ZNBOoP
| Assignee | ||
Comment 20•3 months ago
|
||
Native 50ms 45ms
Note that my patch stack also includes ⚙ D280956 Bug 2009372 - Increase priority for DNSListenerProxy runnable r=necko, which means even native DNS ressolutions would benefit from the current changes.
I think if we want to compare to the current state of native resolutions (Baseline Native) we need to set network.dns.high_priority_dispatch to false. 🙂
Comment 21•3 months ago
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #20)
Native 50ms 45ms
Note that my patch stack also includes ⚙ D280956 Bug 2009372 - Increase priority for DNSListenerProxy runnable r=necko, which means even native DNS ressolutions would benefit from the current changes.
I think if we want to compare to the current state of native resolutions (Baseline Native) we need to setnetwork.dns.high_priority_dispatchto false. 🙂
Ah, yes.
Since I used separate binaries, I do have those numbers.
baseline_1 and baseline_2 are fenix nightly from m-c, without network.dns.high_priority_dispatch.
trr_prio_1, trr_prio_2 have network.dns.high_priority_dispatch:true as they are from the patch stack.
The dns timings, from this limited test, seem roughly comparable.
Run Count Mean Median
--------------------------------------------
baseline_1 97 47ms 45ms
baseline_2 95 53ms 44ms
trr_prio_1 97 67ms 51ms
trr_prio_2 95 46ms 44ms
Comment 22•3 months ago
•
|
||
I should note that the trr-multi test was designed as a distilled example of a scenario in which DoH showed higher latency than the os resolver. (Making a large number of dns requests while at the same time processing numerous other http requests).
We don't necessarily need parity with the os resolver in this test to give us comparable overall performance since there are other scenarios in which DoH can outperform. (e.g. with DoH we see fewer of the sporadic multi-second requests, the DNS cache is not cleared on change of network, etc).
Comment 23•3 months ago
|
||
Bug 1997973 will help here as the socket thread can be busyFindCoalescableConnection in 7-8% of samples.
Comment 24•3 months ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #15)
In both of these profiles more than 50% of the time is being spent in NSS and a lot of the time spent in NSS doesn't seem to be running crypto algorithm code. I think it's worth trying to improve that.
I'll bring this up with the NSS and necko folks - maybe a meta bug to capture what portions of TLS we could make async via other threads, etc.
Updated•3 months ago
|
Comment 25•3 months ago
|
||
Comment 26•3 months ago
|
||
Comment 27•3 months ago
•
|
||
Reverted this because it was causing leak failures.
- Revert link
- Push with failures
- Failure Log
- Failure line: TEST-UNEXPECTED-FAIL | LeakSanitizer | leak at CallbackHolder::CallbackHolder, nsPipeInputStream::AsyncWait, mozilla::RemoteLazyInputStream::AsyncWait, mozilla::nsMultiplexInputStream::AsyncWaitInternal
PS Please also check these mda failures.
Comment 28•3 months ago
|
||
Please check this mda failure as well -> https://treeherder.mozilla.org/logviewer?job_id=548555430&repo=autoland&task=b5lpmPUpRX-0UKEvsCfiJA.0
Comment 29•2 months ago
|
||
Comment 30•2 months ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/46887b4cbf76
https://hg.mozilla.org/mozilla-central/rev/422fe0b0035b
https://hg.mozilla.org/mozilla-central/rev/2802442f8a6c
https://hg.mozilla.org/mozilla-central/rev/d7912d95df76
https://hg.mozilla.org/mozilla-central/rev/57af501a1b75
https://hg.mozilla.org/mozilla-central/rev/8d16e3e7455b
https://hg.mozilla.org/mozilla-central/rev/db195751f853
https://hg.mozilla.org/mozilla-central/rev/135db32e8d09
https://hg.mozilla.org/mozilla-central/rev/d065c8f881ce
https://hg.mozilla.org/mozilla-central/rev/0dd114b6fdf4
| Assignee | ||
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Updated•2 months ago
|
Description
•