Closed Bug 1976964 Opened 10 months ago Closed 9 months ago

Fenix: DoH performance degrades under high concurrent request load

Categories

(Core :: Networking: DNS, defect, P3)

defect

Tracking

()

RESOLVED FIXED
143 Branch
Tracking Status
firefox143 --- fixed

People

(Reporter: acreskey, Assigned: kershaw)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [necko-triaged])

Attachments

(7 files)

Testing on a mid-range Samsung a54, PGO optimized nightly build (-O2), I'm seeing the DoH dns resolution time degrade if a large number of requests are made in a short period of time.

This tests site looks up 100 domains by creating images of each's fav icon.
https://acreskeymoz.github.io/dns/test_lookups.html

While this is a lot of lookups, designed to stress the system, top news sites like www.wjs.com can make a similar number of domain requests.
(Note that with DoH and native dns we make 3 requests for every domain: A, AAAA, HTTPS)

What I'm seeing is that under heavy load the resolution time for DoH requests can grow much more rapidly than that for native dns.

==================================================
Metric          Native DNS      TRR DNS        
--------------------------------------------------
Count           306             297            
Mean (ms)       399.535         1048.750       
Median (ms)     210.036         811.085        
Std Dev (ms)    647.351         803.899        
Min (ms)        4.573           179.793        
Max (ms)        3867.147        2992.642       

Summary:
Native DNS is faster on average by 649.214 ms

For this test the OS DNS cache was flushed between configuration changes to ensure native dns requests had to at least go off device.
Done on home wifi

Plotting DNS lookup time by request number shows a pattern where DoH requests experience increasing delays relative to native DNS.

Blocks: necko-perf

Profile:
https://share.firefox.dev/4lNoFFf

Profile with moz_logs (very slow, captured a portion of the navigation)
https://share.firefox.dev/4eIlBIa

(In reply to Andrew Creskey [:acreskey] from comment #2)

Profile:
https://share.firefox.dev/4lNoFFf

Wondering if we're reaching some concurrent stream limit on the TRR connection? Socket thread is doing work but doesn't look to be 100% bottlenecked.

As an aside, in the field experiment we saw ~20% of TRR requests complete over HTTP/1.1 on Fenix -- a high domain request count would be a real problem there.

See Also: → 1976996

In bug 1976996 we're seeing a higher rate (up to 20%) of TRR service channel connections being made over HTTP/1.1, but in these local runs I'm seeing the service channel connection being made over HTTP/2.

It looks like the problem is much worse on slower networks and of course sites that connect to numerous domains.

In this test I'm using a Pixel 6 optimized Fenix nightly, connecting to my wifi router
Test site: https://www.wsj.com/

==================================================
Metric          Native DNS      TRR DNS        
--------------------------------------------------
Count           123             122            
Mean (ms)       133.476         1597.327       
Median (ms)     60.551          1324.477       
Std Dev (ms)    466.338         1112.312       
Min (ms)        2.130           112.107        
Max (ms)        3075.796        5649.653       

Summary:
Native DNS is faster on average by 1463.851 ms
Attached image wsj_lookup_time.png

Lookup times by request number, native/doh for www.wsj.com

Attached image wsj_cdf.png

CDF of lookup times, native/doh for www.wsj.com

This is a profile of the Pixel 6 loading www.wsj.com, https://share.firefox.dev/4lpK6fP
But I'm seeing log messages in there so the timing will be distorted.

I found two areas where we could improve:

  1. Proxy resolution
    The log below shows that proxy resolution takes around 500ms. This delay occurs because we need to post a runnable from the TRR background thread and wait for the result on the main thread.
2025-07-11 14:10:33.917739501 UTC - [Parent Process 29829: TRR Background]: D/nsHttp TRRServiceChannel::ResolveProxy [this=77b10e7800]
2025-07-11 14:10:34.501613769 UTC - [Parent Process 29829: GeckoMain]: D/nsHttp TRRServiceChannel::ResolveProxy [this=77b10e7800]
2025-07-11 14:10:34.502220947 UTC - [Parent Process 29829: GeckoMain]: D/nsHttp TRRServiceChannel::OnProxyAvailable [this=77b10e7800 pi=0 status=0 mStatus=0]
2025-07-11 14:10:34.543864501 UTC - [Parent Process 29829: TRR Background]: D/nsHttp TRRServiceChannel::OnProxyAvailable [this=77b10e7800 pi=0 status=0 mStatus=0]
  1. HTTP/2 stream limit handling
    The log also indicates that we hit the maximum concurrent stream limit for HTTP/2. When this happens, we don’t handle it efficiently. The current code wakes all waiting streams in the queue when a single stream slot becomes available, even though only one can proceed. This results in unnecessary wakeups and wasted cycles.
Blocks: 1801530
Assignee: nobody → kershaw
Status: NEW → ASSIGNED

This pref was implemented a while ago for socket process but was never enabled. Let’s try enabling it in early Beta and see if we can observe any performance improvements.

Pushed by chorotan@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/ed90036c6f61 https://hg.mozilla.org/integration/autoland/rev/b9ca1ed1b4e9 Revert "Bug 1976964 - Don't try to activate all HTTP/2 streams at once, r=necko-reviewers,jesup" for causing xpcshell failures on test_trr.js

Backed out for causing xpcshell failures on test_trr.js

Backout link

Push with failures

Failure log

Flags: needinfo?(kershaw)

I took Kershaw's two fixes from this bug and re-ran the scenario where DoH on Fenix was giving poor performance results (loading the high domain site, wsj.com, on a medium quality wifi connection, older phone Pixel 6).

With the patches this scenario looks to be greatly improved, with performance roughly comparable to native dns. (see attachements)

We no longer see the trr requests getting slower and slower.
No multi-second resolution times.

Attached image updated_doh_cdf_wsj.png

CDF is improved in this scenario as well.

Flags: needinfo?(kershaw)
Status: ASSIGNED → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → 143 Branch

Is there a bug for the proxy resolution problem?

Flags: needinfo?(kershaw)

(In reply to Jeff Muizelaar [:jrmuizel] from comment #20)

Is there a bug for the proxy resolution problem?

Not sure what you mean. We have bug 1770153 for improving the caching of system proxy information.

Unfortunately, the trick we used in that bug can't be applied to all HTTP requests, since whether a request uses a proxy can vary depending on the URL. We could potentially implement a general cache to store the mapping between URLs and proxy information — maybe that would help.

Flags: needinfo?(kershaw)
Regressions: 1979377
QA Whiteboard: [qa-triage-done-c144/b143]
See Also: → 1994314
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: