Closed Bug 2009372 Opened 4 months ago Closed 2 months ago

Fenix: DoH performance shows high latency when resolving multiple hostnames in parallel (bis)

Tracking

()

Status:

RESOLVED FIXED

Milestone:

149 Branch

Tracking Flags:

Tracking

Status

firefox149

---

fixed

People

(Reporter: valentin, Assigned: valentin)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [necko-triaged][necko-priority-queue])

Attachments

(10 files, 1 obsolete file)

Bug 2009372 - Add benchmarking xpcshell test for DoH timings while other networking is in progress r=#necko,acreskey 4 months ago Valentin Gosu [:valentin] 48 bytes, text/x-phabricator-request		Details
Bug 2009372 - Use Socket thread to parse TRR response instead of TRR thread to avoid thread hops r=#necko 3 months ago Valentin Gosu [:valentin] 48 bytes, text/x-phabricator-request		Details
Bug 2009372 - Run TRR runnable immediately if on correct thread r=#necko 3 months ago Valentin Gosu [:valentin] 48 bytes, text/x-phabricator-request		Details
Bug 2009372 - Mark TRR connections and sockets r=#necko 3 months ago Valentin Gosu [:valentin] 48 bytes, text/x-phabricator-request		Details
Bug 2009372 - Allow PrioritizableRunnable::Run to run on socket thread r=#necko 3 months ago Valentin Gosu [:valentin] 48 bytes, text/x-phabricator-request		Details
Bug 2009372 - Pipe runnable prioritization r=#necko 3 months ago Valentin Gosu [:valentin] 48 bytes, text/x-phabricator-request		Details
Bug 2009372 - Increase runnable priority for TRR channels r=#necko 3 months ago Valentin Gosu [:valentin] 48 bytes, text/x-phabricator-request		Details
Bug 2009372 - Increase priority for DNSListenerProxy runnable r=#necko 3 months ago Valentin Gosu [:valentin] 48 bytes, text/x-phabricator-request		Details
Bug 2009372 - Add mPriorityEventQueue to SocketTransportService r=#necko 3 months ago Valentin Gosu [:valentin] 48 bytes, text/x-phabricator-request		Details
Bug 2009372 - Add locking to mPriorityEventQueue r=#necko 3 months ago Valentin Gosu [:valentin] 48 bytes, text/x-phabricator-request		Details
Bug 2009372 - Prioritize H3 TRR runnables too r=#necko 3 months ago Valentin Gosu [:valentin] 48 bytes, text/x-phabricator-request		Details

Valentin Gosu [:valentin]

Assignee

Description

•

4 months ago

(In reply to Andrew Creskey [:acreskey] from bug 1994314 comment #0)

Very similar to bug 1976964, I can see long dns resolution times when using DoH in Fenix on sites that need to resolve numerous domains simultaneously.

Steps to reproduce:

Load https://www.usatoday.com

I'm using
Wifi
DoH Mode 2

Samsung A54,

Dns53 lookup times: starts off at 33ms , grows to ~40-60ms
https://share.firefox.dev/3JeisEJ

Cloudflare TRR, starts off at 37ms , but then quickly end up resolving most requests in ~100-200ms
https://share.firefox.dev/4hc6LuW

CIRA trr, starts off at 35ms , but then quickly end up resolving most requests in ~150-300ms
https://share.firefox.dev/4nOxxvE

Expectation:
When making multiple requests simultaneous, the latency does not increase to such a degree

Valentin Gosu [:valentin]

Assignee

Comment 1

•

4 months ago

Attached file Bug 2009372 - Add benchmarking xpcshell test for DoH timings while other networking is in progress r=#necko,acreskey — Details

This test creates a benchmarking request and reports benchReq.domainLookupEndTime - benchReq.domainLookupStartTime
We run the test multiple times, and make sure the request is performed while lots of other requests are in progress.
We would like to see that the lookup time for the benchmarked request does not (significantly) increase as more requests are added.

Valentin Gosu [:valentin]

Assignee

Updated

•

4 months ago

Updated

•

4 months ago

Updated

•

3 months ago

Comment 2

•

3 months ago

Status update:

My investigation into this issue has uncovered a number of other small perf wins, but I haven't yet manage to fix the DoH performance issue.
Recent work seems to indicate that the problem is events dispatched to the socket thread. Regular connections and TRR connections all post events to be run on the socket thread. Because there's no prioritization of the runnables, the TRR runnables end up being blocked by network activity from other channels. This seems to mostly be a problem when the socket thread is close to 100% usage.

A few other promising leads:

The TRR thread we use for AsyncOpen, and parsing of DoH packets is mostly unused, but the fact that we need to dispatch runnables between the socket and TRR thread multiple times indicates this might be a potential bottleneck. I've moved the TRR service to the socket thread and noticed no degradation in performance, but also no improvement, probably due to runnables still being dispatched to the end of the queue even when the call is already on the socket thread. Executing the runnable immediately when it's created on the socket thread is likely to improve things.
DNSListenerProxy::OnLookupComplete Has a similar issue. It always dispatches even when already on the target thread. However, I tried executing this one synchronously, and encountered a socket thread hang. Most likely there's a bug in nsHostResolver calling OnResolveHostComplete while holding a lock.

A few other things I tried, which could be potential improvements, but did not immediately fix the DoH performance issue:

Signal mPollableEvent when nsSocketTransportService::Dispatch is called.
Avoid expensive calls to IsAlive in nsHttpConnection::CanReuse for HTTP/2 connections. The profile shows a lot of time spent in PR_Recv call in nsSocketTransport::IsAlive
Reading from TRR sockets first, then processing the event loop, then the rest of the sockets.
Setting SSL_NO_LOCKS for necko sockets.

Going forward I will focus on the most likely fix for DoH performance (prioritizing the DoH runnables).

Valentin Gosu [:valentin]

Assignee

Comment 3

•

3 months ago

•

Edited

Hi folks,

Exciting progress news. I spent a bunch of time last week adding an extra priority event queue to the socket transport service, to confirm whether the previous theory was correct. I made sure that socket thread events triggered by and for TRR channels get an increased priority, and also DNSListenerProxy. After running with just these few changes I was a bit dissapointed in the results.

However, I just checked to see what would happen if in nsSocketTransportService::DoPollIteration we didn't read from non-TRR sockets if we had a socket or a high priority runnable to handle, and that actually made a 10X improvement in DNS timings. The overall duration of each request didn't seem to go down, but parallelism seems to have increased.

This result is very promising and seems to indicate there are easy peformance wins that can be achieved by doing less when there are high priority events pending.
There is a risk of TRR connections starving others - but that's something we can test. I'll start posting patches for review in the next few days.

This is an early profile showing that DNS timings don't baloon to hundreds of ms when there are many requests in the queue: https://share.firefox.dev/45wQcoQ

Valentin Gosu [:valentin]

Assignee

Comment 4

•

3 months ago

Attached file Bug 2009372 - Use Socket thread to parse TRR response instead of TRR thread to avoid thread hops r=#necko — Details

This patch adds a pref (for testing) that determines whether to use a dedicated
TRR thread to parse DoH responses, or whether to do it on the socket thread to
avoid thread hops.

With the full patch stack in this bug, using the TRR thread seems to double the
DNS resolution time from ~40ms to ~80 ms on my machine.

Valentin Gosu [:valentin]

Assignee

Comment 5

•

3 months ago

Attached file Bug 2009372 - Run TRR runnable immediately if on correct thread r=#necko (obsolete) — Details

If TRRService::DispatchTRRRequest is called on the target TRR thread
then we can immediately run it instead of dispatching to the end of the queue.

Valentin Gosu [:valentin]

Assignee

Comment 6

•

3 months ago

Attached file Bug 2009372 - Mark TRR connections and sockets r=#necko — Details

Valentin Gosu [:valentin]

Assignee

Comment 7

•

3 months ago

Attached file Bug 2009372 - Allow PrioritizableRunnable::Run to run on socket thread r=#necko — Details

The PrioritizableRunnable asserts that it runs on the main thread.
In order to prioritize events on the socket thread we extend
PrioritizableRunnable and override Run to avoid the assertion

Valentin Gosu [:valentin]

Assignee

Comment 8

•

3 months ago

Attached file Bug 2009372 - Pipe runnable prioritization r=#necko — Details

This change makes the OnInputStreamReady events delivered for the
TRRServiceChannel transaction implement nsIRunnablePriority
so they can be prioritized.

Valentin Gosu [:valentin]

Assignee

Comment 9

•

3 months ago

Attached file Bug 2009372 - Increase runnable priority for TRR channels r=#necko — Details

Valentin Gosu [:valentin]

Assignee

Comment 10

•

3 months ago

Attached file Bug 2009372 - Increase priority for DNSListenerProxy runnable r=#necko — Details

Valentin Gosu [:valentin]

Assignee

Comment 11

•

3 months ago

Attached file Bug 2009372 - Add mPriorityEventQueue to SocketTransportService r=#necko — Details

This queue of events is processed before the regular events that get dispatched
to the Socket thread event loop.

Valentin Gosu [:valentin]

Assignee

Comment 12

•

3 months ago

Attached file Bug 2009372 - Add locking to mPriorityEventQueue r=#necko — Details

Andrew Creskey [:acreskey]

Comment 13

•

3 months ago

Very nice work.
I'll run this stack (and a variant with network.trr.parse_on_socket_thread:false) against our CI Fenix DoH tests and on a local device.

Phabricator Automation

Updated

•

3 months ago

Attachment #9540910 - Attachment description: Bug 2009372 - Add AnyThreadPrioritizableRunnable r=#necko → Bug 2009372 - Allow PrioritizableRunnable::Run to run on socket thread r=#necko

Andrew Creskey [:acreskey]

Comment 14

•

3 months ago

Jeff had asked for simpleperf profiles of slow resolution through DoH.

These are from Fenix Nightly, Feb 2. DoH mode 2, through Cloudflare, in Ottawa Canada.
Without geckoprofiler markers it's harder to identify the slow resolutions via TRR, but these are profiles of scenarios that have consistently lead to problems.

Test page that loads 100 fav-icons, thus 100 domains resolved:
https://share.firefox.dev/3Zgl4GK

USAToday navigation:
https://share.firefox.dev/3NRjWqB

NYTimes.com navigation and scrolling:
https://share.firefox.dev/4tdNte8

Flags: needinfo?(jmuizelaar)

Jeff Muizelaar [:jrmuizel]

Comment 15

•

3 months ago

(In reply to Andrew Creskey [:acreskey] from comment #14)

Jeff had asked for simpleperf profiles of slow resolution through DoH.

These are from Fenix Nightly, Feb 2. DoH mode 2, through Cloudflare, in Ottawa Canada.
Without geckoprofiler markers it's harder to identify the slow resolutions via TRR, but these are profiles of scenarios that have consistently lead to problems.

Test page that loads 100 fav-icons, thus 100 domains resolved:
https://share.firefox.dev/3Zgl4GK

USAToday navigation:
https://share.firefox.dev/3NRjWqB

In both of these profiles more than 50% of the time is being spent in NSS and a lot of the time spent in NSS doesn't seem to be running crypto algorithm code. I think it's worth trying to improve that.

Flags: needinfo?(jmuizelaar)

Valentin Gosu [:valentin]

Assignee

Comment 16

•

3 months ago

This initial perf run looks promising. DNS timings are way down, while pageload time is roughly the same
https://perf.compare/compare-lando-results?baseLando=177149&newLando=177150&baseRepo=try&newRepo=try&framework=13&search=trr

Valentin Gosu [:valentin]

Assignee

Comment 17

•

3 months ago

Attached file Bug 2009372 - Prioritize H3 TRR runnables too r=#necko — Details

Andrew Creskey [:acreskey]

Comment 18

•

3 months ago

•

Edited

I've taken an initial look at this patch, focusing just on DNS resolution (not the impact on other network timings or visual metrics).

It's a bit tricky to compare DoH against the OS resolver in CI because the Android OS will cache the hostnames after the first iteration, thus giving the OS resolver an "unfair advantage."
So I've run it locally, flushing the OS cache between iterations.

The results are reproducible and actually match what Valentin found in his perf run, comment 16.

These are the variants that I compared:

Native has DoH disabled
Baseline TRR is a push of today Fenix nightly, using either Cloudflare, CF, or CIRA
TRR Prio is Valentin's TRR prioritization patch stack

Using the trr-multi test to loads 50 favicons.
It's a stress test for DoH as there are so many simultaneous requests. (Although some top sites request a similar amount).

Findings:

TRR Prio reduces TRR lookup times by 91-98ms (53-61%)
With TRR Prio, TRR overhead vs Native is reduced to 29-40ms (62-89%)
Without TRR Prio, TRR overhead vs Native is 122-146ms (244-324%)

  Comparison Table (Mean / Median)                                                                                                                                                                     
                                                                                                                                                                                                       
                      Mean      Median                                                                                                                                                                 
  --------------------------------------                                                                                                                                                               
  Native               50ms       45ms                                                                                                                                                                 
  Baseline TRR CF     184ms      179ms                                                                                                                                                                 
  Baseline TRR CIRA   172ms      191ms                                                                                                                                                                 
  TRR Prio CF          86ms       85ms                                                                                                                                                                 
  TRR Prio CIRA        81ms       74ms                                                                                                                                                                 


  TRR Prio Improvement vs Baseline TRR                                                                                                                                                                 
                                                                                                                                                                                                       
  Provider      Mean Reduction     Median Reduction                                                                                                                                                    
  -------------------------------------------------                                                                                                                                                    
  Cloudflare    -98ms (-53%)       -94ms (-53%)                                                                                                                                                        
  CIRA          -91ms (-53%)      -117ms (-61%)                                                                                                                                                        
                                                                                                                                                                                                       
  TRR Overhead vs Native (with TRR Prio)                                                                                                                                                               
                                                                                                                                                                                                       
  Provider      Mean Overhead      Median Overhead                                                                                                                                                     
  -------------------------------------------------                                                                                                                                                    
  Cloudflare    +36ms (+72%)       +40ms (+89%)                                                                                                                                                        
  CIRA          +31ms (+62%)       +29ms (+64%)                                                                                                                                                        
                                                                                                                                                                                                       
  TRR Overhead vs Native (Baseline, without TRR Prio)                                                                                                                                                  
                                                                                                                                                                                                       
  Provider      Mean Overhead      Median Overhead                                                                                                                                                     
  -------------------------------------------------                                                                                                                                                    
  Cloudflare   +134ms (+268%)     +134ms (+298%)                                                                                                                                                       
  CIRA         +122ms (+244%)     +146ms (+324%)

Some profiles from the run:

TRR Baseline: https://share.firefox.dev/4a95Bx2
TRR Prio: https://share.firefox.dev/4tppSHF
Native resolver (OS Dns cache flushed): https://share.firefox.dev/4tiKrVW

edit: taken on Samsung A54, PGO builds, wifi

Andrew Creskey [:acreskey]

Comment 19

•

3 months ago

Simpleperf profile of trr-multi with the TRR prioritization patch stack applied:
https://share.firefox.dev/3ZNBOoP

Valentin Gosu [:valentin]

Assignee

Comment 20

•

3 months ago

Native 50ms 45ms

Note that my patch stack also includes ⚙ D280956 Bug 2009372 - Increase priority for DNSListenerProxy runnable r=necko, which means even native DNS ressolutions would benefit from the current changes.
I think if we want to compare to the current state of native resolutions (Baseline Native) we need to set network.dns.high_priority_dispatch to false. 🙂

Andrew Creskey [:acreskey]

Comment 21

•

3 months ago

(In reply to Valentin Gosu [:valentin] (he/him) from comment #20)

Native 50ms 45ms

Note that my patch stack also includes ⚙ D280956 Bug 2009372 - Increase priority for DNSListenerProxy runnable r=necko, which means even native DNS ressolutions would benefit from the current changes.
I think if we want to compare to the current state of native resolutions (Baseline Native) we need to set network.dns.high_priority_dispatch to false. 🙂

Ah, yes.
Since I used separate binaries, I do have those numbers.
baseline_1 and baseline_2 are fenix nightly from m-c, without network.dns.high_priority_dispatch.
trr_prio_1, trr_prio_2 have network.dns.high_priority_dispatch:true as they are from the patch stack.
The dns timings, from this limited test, seem roughly comparable.

  Run              Count       Mean     Median                                                                                                                                                         
  --------------------------------------------                                                                                                                                                         
  baseline_1          97       47ms       45ms                                                                                                                                                         
  baseline_2          95       53ms       44ms                                                                                                                                                         
  trr_prio_1          97       67ms       51ms                                                                                                                                                         
  trr_prio_2          95       46ms       44ms

Andrew Creskey [:acreskey]

Comment 22

•

3 months ago

•

Edited

I should note that the trr-multi test was designed as a distilled example of a scenario in which DoH showed higher latency than the os resolver. (Making a large number of dns requests while at the same time processing numerous other http requests).

We don't necessarily need parity with the os resolver in this test to give us comparable overall performance since there are other scenarios in which DoH can outperform. (e.g. with DoH we see fewer of the sporadic multi-second requests, the DNS cache is not cleared on change of network, etc).

Andrew Creskey [:acreskey]

Updated

•

3 months ago

Comment 23

•

3 months ago

Bug 1997973 will help here as the socket thread can be busyFindCoalescableConnection in 7-8% of samples.

Andrew Creskey [:acreskey]

Comment 24

•

3 months ago

(In reply to Jeff Muizelaar [:jrmuizel] from comment #15)

In both of these profiles more than 50% of the time is being spent in NSS and a lot of the time spent in NSS doesn't seem to be running crypto algorithm code. I think it's worth trying to improve that.

I'll bring this up with the NSS and necko folks - maybe a meta bug to capture what portions of TLS we could make async via other threads, etc.

Phabricator Automation

Updated

•

3 months ago

Attachment #9540907 - Attachment is obsolete: true

Andrew Creskey [:acreskey]

Updated

•

3 months ago

Depends on: 2016567

Pulsebot

Comment 25

•

3 months ago

Pulsebot

Comment 26

•

3 months ago

Pushed by sstanca@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/6b07694ccf35 https://hg.mozilla.org/integration/autoland/rev/6e35b0425dfc Revert "Bug 2009372 - Prioritize H3 TRR runnables too r=necko-reviewers,jesup" for causing leak failures.

Serban Stanca [:SerbanS]

Comment 27

•

3 months ago

•

Edited

Reverted this because it was causing leak failures.

Revert link
Push with failures
Failure Log
Failure line: TEST-UNEXPECTED-FAIL | LeakSanitizer | leak at CallbackHolder::CallbackHolder, nsPipeInputStream::AsyncWait, mozilla::RemoteLazyInputStream::AsyncWait, mozilla::nsMultiplexInputStream::AsyncWaitInternal

PS Please also check these mda failures.

Failure Log

Flags: needinfo?(valentin.gosu)

amarc

Comment 28

•

3 months ago

Please check this mda failure as well -> https://treeherder.mozilla.org/logviewer?job_id=548555430&repo=autoland&task=b5lpmPUpRX-0UKEvsCfiJA.0

Andreas Pehrson [:pehrsons]

Updated

•

3 months ago

Depends on: 2016649

Pulsebot

Comment 29

•

2 months ago

Atila Butkovits

Comment 30

•

2 months ago

bugherder

Status: NEW → RESOLVED

Closed: 2 months ago

status-firefox149: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 149 Branch

Valentin Gosu [:valentin]

Assignee

Updated

•

2 months ago

Flags: needinfo?(valentin.gosu)

Mathew Hodson

Updated

•

2 months ago

Blocks: 2016649

No longer depends on: 2016649

Mathew Hodson

Updated

•

2 months ago

No longer blocks: 2016649

Regressions: 2016649

Camelia Badau [:cbadau], Desktop Test Engineering

Updated

•

2 months ago

QA Whiteboard: [qa-triage-done-c150/b149]

Valentin Gosu [:valentin]

Assignee

Updated

•

2 months ago

Duplicate of this bug: 1998646

Kershaw Chang [:kershaw]

Updated

•

1 month ago

Regressions: 2026504

You need to log in before you can comment on or make changes to this bug.