Fenix: DoH performance degrades under high concurrent request load
Categories
(Core :: Networking: DNS, defect, P3)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox143 | --- | fixed |
People
(Reporter: acreskey, Assigned: kershaw)
References
(Blocks 3 open bugs)
Details
(Whiteboard: [necko-triaged])
Attachments
(7 files)
Testing on a mid-range Samsung a54, PGO optimized nightly build (-O2), I'm seeing the DoH dns resolution time degrade if a large number of requests are made in a short period of time.
This tests site looks up 100 domains by creating images of each's fav icon.
https://acreskeymoz.github.io/dns/test_lookups.html
While this is a lot of lookups, designed to stress the system, top news sites like www.wjs.com can make a similar number of domain requests.
(Note that with DoH and native dns we make 3 requests for every domain: A, AAAA, HTTPS)
What I'm seeing is that under heavy load the resolution time for DoH requests can grow much more rapidly than that for native dns.
==================================================
Metric Native DNS TRR DNS
--------------------------------------------------
Count 306 297
Mean (ms) 399.535 1048.750
Median (ms) 210.036 811.085
Std Dev (ms) 647.351 803.899
Min (ms) 4.573 179.793
Max (ms) 3867.147 2992.642
Summary:
Native DNS is faster on average by 649.214 ms
For this test the OS DNS cache was flushed between configuration changes to ensure native dns requests had to at least go off device.
Done on home wifi
| Reporter | ||
Comment 1•10 months ago
|
||
Plotting DNS lookup time by request number shows a pattern where DoH requests experience increasing delays relative to native DNS.
| Reporter | ||
Updated•10 months ago
|
| Reporter | ||
Comment 2•10 months ago
|
||
Profile:
https://share.firefox.dev/4lNoFFf
Profile with moz_logs (very slow, captured a portion of the navigation)
https://share.firefox.dev/4eIlBIa
| Reporter | ||
Comment 3•10 months ago
|
||
(In reply to Andrew Creskey [:acreskey] from comment #2)
Profile:
https://share.firefox.dev/4lNoFFf
Wondering if we're reaching some concurrent stream limit on the TRR connection? Socket thread is doing work but doesn't look to be 100% bottlenecked.
As an aside, in the field experiment we saw ~20% of TRR requests complete over HTTP/1.1 on Fenix -- a high domain request count would be a real problem there.
| Reporter | ||
Comment 4•10 months ago
|
||
In bug 1976996 we're seeing a higher rate (up to 20%) of TRR service channel connections being made over HTTP/1.1, but in these local runs I'm seeing the service channel connection being made over HTTP/2.
| Reporter | ||
Comment 5•10 months ago
|
||
It looks like the problem is much worse on slower networks and of course sites that connect to numerous domains.
In this test I'm using a Pixel 6 optimized Fenix nightly, connecting to my wifi router
Test site: https://www.wsj.com/
==================================================
Metric Native DNS TRR DNS
--------------------------------------------------
Count 123 122
Mean (ms) 133.476 1597.327
Median (ms) 60.551 1324.477
Std Dev (ms) 466.338 1112.312
Min (ms) 2.130 112.107
Max (ms) 3075.796 5649.653
Summary:
Native DNS is faster on average by 1463.851 ms
| Reporter | ||
Comment 6•10 months ago
|
||
Lookup times by request number, native/doh for www.wsj.com
| Reporter | ||
Comment 7•10 months ago
|
||
CDF of lookup times, native/doh for www.wsj.com
| Reporter | ||
Comment 8•10 months ago
|
||
This is a profile of the Pixel 6 loading www.wsj.com, https://share.firefox.dev/4lpK6fP
But I'm seeing log messages in there so the timing will be distorted.
| Assignee | ||
Comment 9•10 months ago
|
||
I found two areas where we could improve:
- Proxy resolution
The log below shows that proxy resolution takes around 500ms. This delay occurs because we need to post a runnable from the TRR background thread and wait for the result on the main thread.
2025-07-11 14:10:33.917739501 UTC - [Parent Process 29829: TRR Background]: D/nsHttp TRRServiceChannel::ResolveProxy [this=77b10e7800]
2025-07-11 14:10:34.501613769 UTC - [Parent Process 29829: GeckoMain]: D/nsHttp TRRServiceChannel::ResolveProxy [this=77b10e7800]
2025-07-11 14:10:34.502220947 UTC - [Parent Process 29829: GeckoMain]: D/nsHttp TRRServiceChannel::OnProxyAvailable [this=77b10e7800 pi=0 status=0 mStatus=0]
2025-07-11 14:10:34.543864501 UTC - [Parent Process 29829: TRR Background]: D/nsHttp TRRServiceChannel::OnProxyAvailable [this=77b10e7800 pi=0 status=0 mStatus=0]
- HTTP/2 stream limit handling
The log also indicates that we hit the maximum concurrent stream limit for HTTP/2. When this happens, we don’t handle it efficiently. The current code wakes all waiting streams in the queue when a single stream slot becomes available, even though only one can proceed. This results in unnecessary wakeups and wasted cycles.
| Assignee | ||
Comment 10•10 months ago
|
||
Updated•10 months ago
|
| Assignee | ||
Comment 11•10 months ago
|
||
This pref was implemented a while ago for socket process but was never enabled. Let’s try enabling it in early Beta and see if we can observe any performance improvements.
Comment 12•9 months ago
|
||
Comment 13•9 months ago
|
||
Comment 14•9 months ago
|
||
| Reporter | ||
Comment 15•9 months ago
•
|
||
I took Kershaw's two fixes from this bug and re-ran the scenario where DoH on Fenix was giving poor performance results (loading the high domain site, wsj.com, on a medium quality wifi connection, older phone Pixel 6).
With the patches this scenario looks to be greatly improved, with performance roughly comparable to native dns. (see attachements)
| Reporter | ||
Comment 16•9 months ago
|
||
We no longer see the trr requests getting slower and slower.
No multi-second resolution times.
| Reporter | ||
Comment 17•9 months ago
|
||
CDF is improved in this scenario as well.
| Assignee | ||
Updated•9 months ago
|
Comment 18•9 months ago
|
||
Comment 19•9 months ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/0ef1d095bbeb
https://hg.mozilla.org/mozilla-central/rev/947d60ec731d
Comment 20•9 months ago
|
||
Is there a bug for the proxy resolution problem?
| Assignee | ||
Comment 21•9 months ago
|
||
(In reply to Jeff Muizelaar [:jrmuizel] from comment #20)
Is there a bug for the proxy resolution problem?
Not sure what you mean. We have bug 1770153 for improving the caching of system proxy information.
Unfortunately, the trick we used in that bug can't be applied to all HTTP requests, since whether a request uses a proxy can vary depending on the URL. We could potentially implement a general cache to store the mapping between URLs and proxy information — maybe that would help.
Updated•9 months ago
|
Description
•