Closed Bug 1976964 Opened 10 months ago Closed 9 months ago

Fenix: DoH performance degrades under high concurrent request load

Tracking

()

Status:

RESOLVED FIXED

Milestone:

143 Branch

Tracking Flags:

Tracking

Status

firefox143

---

fixed

People

(Reporter: acreskey, Assigned: kershaw)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [necko-triaged])

Attachments

(7 files)

resolution_time_by_request_number.png 10 months ago Andrew Creskey [:acreskey] 299.43 KB, image/png		Details
wsj_lookup_time.png 10 months ago Andrew Creskey [:acreskey] 211.40 KB, image/png		Details
wsj_cdf.png 10 months ago Andrew Creskey [:acreskey] 207.02 KB, image/png		Details
Bug 1976964 - Don't try to activate all HTTP/2 streams at once, r=#necko 10 months ago Kershaw Chang [:kershaw] 48 bytes, text/x-phabricator-request		Details
Bug 1976964 - Enable network.trr.async_connInfo for early beta, r=#necko 10 months ago Kershaw Chang [:kershaw] 48 bytes, text/x-phabricator-request		Details
updated_doh_request_time_by_number_wsj.png 9 months ago Andrew Creskey [:acreskey] 200.82 KB, image/png		Details
updated_doh_cdf_wsj.png 9 months ago Andrew Creskey [:acreskey] 203.98 KB, image/png		Details

Andrew Creskey [:acreskey]

Reporter

Description

•

10 months ago

•

Edited

Testing on a mid-range Samsung a54, PGO optimized nightly build (-O2), I'm seeing the DoH dns resolution time degrade if a large number of requests are made in a short period of time.

This tests site looks up 100 domains by creating images of each's fav icon.
https://acreskeymoz.github.io/dns/test_lookups.html

While this is a lot of lookups, designed to stress the system, top news sites like www.wjs.com can make a similar number of domain requests.
(Note that with DoH and native dns we make 3 requests for every domain: A, AAAA, HTTPS)

What I'm seeing is that under heavy load the resolution time for DoH requests can grow much more rapidly than that for native dns.

==================================================
Metric          Native DNS      TRR DNS        
--------------------------------------------------
Count           306             297            
Mean (ms)       399.535         1048.750       
Median (ms)     210.036         811.085        
Std Dev (ms)    647.351         803.899        
Min (ms)        4.573           179.793        
Max (ms)        3867.147        2992.642       

Summary:
Native DNS is faster on average by 649.214 ms

For this test the OS DNS cache was flushed between configuration changes to ensure native dns requests had to at least go off device.
Done on home wifi

Andrew Creskey [:acreskey]

Reporter

Comment 1

•

10 months ago

Attached image resolution_time_by_request_number.png — Details

Plotting DNS lookup time by request number shows a pattern where DoH requests experience increasing delays relative to native DNS.

Andrew Creskey [:acreskey]

Reporter

Updated

•

10 months ago

Blocks: necko-perf

Andrew Creskey [:acreskey]

Reporter

Comment 2

•

10 months ago

Profile:
https://share.firefox.dev/4lNoFFf

Profile with moz_logs (very slow, captured a portion of the navigation)
https://share.firefox.dev/4eIlBIa

Andrew Creskey [:acreskey]

Reporter

Comment 3

•

10 months ago

(In reply to Andrew Creskey [:acreskey] from comment #2)

Profile:
https://share.firefox.dev/4lNoFFf

Wondering if we're reaching some concurrent stream limit on the TRR connection? Socket thread is doing work but doesn't look to be 100% bottlenecked.

As an aside, in the field experiment we saw ~20% of TRR requests complete over HTTP/1.1 on Fenix -- a high domain request count would be a real problem there.

Andrew Creskey [:acreskey]

Reporter

Updated

•

10 months ago

Comment 4

•

10 months ago

In bug 1976996 we're seeing a higher rate (up to 20%) of TRR service channel connections being made over HTTP/1.1, but in these local runs I'm seeing the service channel connection being made over HTTP/2.

Andrew Creskey [:acreskey]

Reporter

Comment 5

•

10 months ago

It looks like the problem is much worse on slower networks and of course sites that connect to numerous domains.

In this test I'm using a Pixel 6 optimized Fenix nightly, connecting to my wifi router
Test site: https://www.wsj.com/

==================================================
Metric          Native DNS      TRR DNS        
--------------------------------------------------
Count           123             122            
Mean (ms)       133.476         1597.327       
Median (ms)     60.551          1324.477       
Std Dev (ms)    466.338         1112.312       
Min (ms)        2.130           112.107        
Max (ms)        3075.796        5649.653       

Summary:
Native DNS is faster on average by 1463.851 ms

Andrew Creskey [:acreskey]

Reporter

Comment 6

•

10 months ago

Attached image wsj_lookup_time.png — Details

Lookup times by request number, native/doh for www.wsj.com

Andrew Creskey [:acreskey]

Reporter

Comment 7

•

10 months ago

Attached image wsj_cdf.png — Details

CDF of lookup times, native/doh for www.wsj.com

Andrew Creskey [:acreskey]

Reporter

Comment 8

•

10 months ago

This is a profile of the Pixel 6 loading www.wsj.com, https://share.firefox.dev/4lpK6fP
But I'm seeing log messages in there so the timing will be distorted.

Kershaw Chang [:kershaw]

Assignee

Comment 9

•

10 months ago

I found two areas where we could improve:

Proxy resolution
The log below shows that proxy resolution takes around 500ms. This delay occurs because we need to post a runnable from the TRR background thread and wait for the result on the main thread.

2025-07-11 14:10:33.917739501 UTC - [Parent Process 29829: TRR Background]: D/nsHttp TRRServiceChannel::ResolveProxy [this=77b10e7800]
2025-07-11 14:10:34.501613769 UTC - [Parent Process 29829: GeckoMain]: D/nsHttp TRRServiceChannel::ResolveProxy [this=77b10e7800]
2025-07-11 14:10:34.502220947 UTC - [Parent Process 29829: GeckoMain]: D/nsHttp TRRServiceChannel::OnProxyAvailable [this=77b10e7800 pi=0 status=0 mStatus=0]
2025-07-11 14:10:34.543864501 UTC - [Parent Process 29829: TRR Background]: D/nsHttp TRRServiceChannel::OnProxyAvailable [this=77b10e7800 pi=0 status=0 mStatus=0]

HTTP/2 stream limit handling
The log also indicates that we hit the maximum concurrent stream limit for HTTP/2. When this happens, we don’t handle it efficiently. The current code wakes all waiting streams in the queue when a single stream slot becomes available, even though only one can proceed. This results in unnecessary wakeups and wasted cycles.

Andrew Creskey [:acreskey]

Reporter

Updated

•

10 months ago

Blocks: 1801530

Kershaw Chang [:kershaw]

Assignee

Comment 10

•

10 months ago

Attached file Bug 1976964 - Don't try to activate all HTTP/2 streams at once, r=#necko — Details

Phabricator Automation

Updated

•

10 months ago

Assignee: nobody → kershaw

Status: NEW → ASSIGNED

Kershaw Chang [:kershaw]

Assignee

Comment 11

•

10 months ago

Attached file Bug 1976964 - Enable network.trr.async_connInfo for early beta, r=#necko — Details

This pref was implemented a while ago for socket process but was never enabled. Let’s try enabling it in early Beta and see if we can observe any performance improvements.

Pulsebot

Comment 12

•

9 months ago

Pushed by kjang@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/5e89b7a3fbfb https://hg.mozilla.org/integration/autoland/rev/01ba0d37451b Enable network.trr.async_connInfo for early beta, r=necko-reviewers,sunil https://github.com/mozilla-firefox/firefox/commit/c99f9c905e60 https://hg.mozilla.org/integration/autoland/rev/0545981d3b5c Don't try to activate all HTTP/2 streams at once, r=necko-reviewers,jesup

Pulsebot

Comment 13

•

9 months ago

Pushed by chorotan@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/ed90036c6f61 https://hg.mozilla.org/integration/autoland/rev/b9ca1ed1b4e9 Revert "Bug 1976964 - Don't try to activate all HTTP/2 streams at once, r=necko-reviewers,jesup" for causing xpcshell failures on test_trr.js

Cristina Horotan [:chorotan]

Comment 14

•

9 months ago

Backed out for causing xpcshell failures on test_trr.js

Backout link

Push with failures

Failure log

Flags: needinfo?(kershaw)

Andrew Creskey [:acreskey]

Reporter

Comment 15

•

9 months ago

•

Edited

I took Kershaw's two fixes from this bug and re-ran the scenario where DoH on Fenix was giving poor performance results (loading the high domain site, wsj.com, on a medium quality wifi connection, older phone Pixel 6).

With the patches this scenario looks to be greatly improved, with performance roughly comparable to native dns. (see attachements)

Andrew Creskey [:acreskey]

Reporter

Comment 16

•

9 months ago

Attached image updated_doh_request_time_by_number_wsj.png — Details

We no longer see the trr requests getting slower and slower.
No multi-second resolution times.

Andrew Creskey [:acreskey]

Reporter

Comment 17

•

9 months ago

Attached image updated_doh_cdf_wsj.png — Details

CDF is improved in this scenario as well.

Kershaw Chang [:kershaw]

Assignee

Updated

•

9 months ago

Flags: needinfo?(kershaw)

Pulsebot

Comment 18

•

9 months ago

Pushed by kjang@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/e5befd32df7f https://hg.mozilla.org/integration/autoland/rev/0ef1d095bbeb Enable network.trr.async_connInfo for early beta, r=necko-reviewers,sunil https://github.com/mozilla-firefox/firefox/commit/7fca1ee0e0bd https://hg.mozilla.org/integration/autoland/rev/947d60ec731d Don't try to activate all HTTP/2 streams at once, r=necko-reviewers,jesup

Sandor Molnar[:smolnar]

Comment 19

•

9 months ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/0ef1d095bbeb
https://hg.mozilla.org/mozilla-central/rev/947d60ec731d

Status: ASSIGNED → RESOLVED

Closed: 9 months ago

status-firefox143: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 143 Branch

Jeff Muizelaar [:jrmuizel]

Comment 20

•

9 months ago

Is there a bug for the proxy resolution problem?

Flags: needinfo?(kershaw)

Kershaw Chang [:kershaw]

Assignee

Comment 21

•

9 months ago

(In reply to Jeff Muizelaar [:jrmuizel] from comment #20)

Is there a bug for the proxy resolution problem?

Not sure what you mean. We have bug 1770153 for improving the caching of system proxy information.

Unfortunately, the trick we used in that bug can't be applied to all HTTP requests, since whether a request uses a proxy can vary depending on the URL. We could potentially implement a general cache to store the mapping between URLs and proxy information — maybe that would help.

Flags: needinfo?(kershaw)

Kershaw Chang [:kershaw]

Assignee

Updated

•

9 months ago

Regressions: 1979377

Mihai Boldan, Desktop QA [:mboldan]

Updated

•

9 months ago

QA Whiteboard: [qa-triage-done-c144/b143]

Andrew Creskey [:acreskey]

Reporter

Updated

•

7 months ago