Fenix: DoH performance shows high latency when resolving multiple hostnames in parallel
Categories
(Core :: Networking: DNS, defect, P2)
Tracking
()
People
(Reporter: acreskey, Assigned: kershaw)
References
(Blocks 2 open bugs)
Details
(Keywords: leave-open, Whiteboard: [necko-triaged][necko-priority-queue])
Attachments
(1 file)
Very similar to bug 1976964, I can see long dns resolution times when using DoH in Fenix on sites that need to resolve numerous domains simultaneously.
Steps to reproduce:
I'm using
Wifi
DoH Mode 2
Samsung A54,
Dns53 lookup times: starts off at 33ms , grows to ~40-60ms
https://share.firefox.dev/3JeisEJ
Cloudflare TRR, starts off at 37ms , but then quickly end up resolving most requests in ~100-200ms
https://share.firefox.dev/4hc6LuW
CIRA trr, starts off at 35ms , but then quickly end up resolving most requests in ~150-300ms
https://share.firefox.dev/4nOxxvE
Expectation:
When making multiple requests simultaneous, the latency does not increase to such a degree
| Reporter | ||
Comment 1•9 days ago
|
||
Binary here
I've added some adb instrumentation, but I don't think anything that could add 50-250ms per request.
| Reporter | ||
Comment 2•9 days ago
|
||
On this site I'm counting ~26 nearly simultaneous domain lookups that come from the document.
Even with A, AAAA, and HTTPS records for each we should be under the http/2 stream limit.
| Reporter | ||
Comment 3•9 days ago
|
||
DoH, mode 3 profile, showing the issue
https://share.firefox.dev/4nSFTmg
| Reporter | ||
Comment 4•9 days ago
|
||
From today's Fenix Nightly Playstore build, with moz_logs captured
https://share.firefox.dev/4nbB4mT
Numerous multi-second dns resolutions (some delays expected due to logging)
| Reporter | ||
Updated•9 days ago
|
| Assignee | ||
Comment 5•8 days ago
|
||
Thanks a lot for the log. It's really helpful.
I noticed one TRR request timed out and was canceled, which closed the entire DoH connection. As a result, the other in-flight requests on that connection also failed. A single request cancellation shouldn’t tear down the whole connection. I’ll dig into this and work on improving the behavior.
2025-10-15 13:31:02.286487792 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel::Cancel [this=7858368800 status=804b0055]
2025-10-15 13:31:02.286505371 UTC - [Parent Process 27169: TRR Background]: D/nsHttp nsHttpConnectionMgr::CancelTransaction [trans=7809736610 reason=804b0055]
2025-10-15 13:31:02.286537109 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::CloseTransaction[this=78411a3400 trans=784d2b9100 reason=804b0055]
2025-10-15 13:31:02.286558105 UTC - [Parent Process 27169: TRR Background]: D/nsSocketTransport STS dispatch [7841359420]
2025-10-15 13:31:02.286585937 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::DontReuse 78411a3400 spdysession=784d2b9100
2025-10-15 13:31:02.286609375 UTC - [Parent Process 27169: TRR Background]: D/nsSocketTransport PollableEvent::Signal
2025-10-15 13:31:02.286634765 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::MarkAsDontReuse 78411a3400
2025-10-15 13:31:02.286650878 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp Http2Session::DontReuse 784d2b9100
2025-10-15 13:31:02.286666259 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp closing associated mTransaction
2025-10-15 13:31:02.286748779 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp Http2Session::Close 784d2b9100 804B0055
| Assignee | ||
Updated•8 days ago
|
| Reporter | ||
Comment 6•8 days ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #5)
Thanks a lot for the log. It's really helpful.
I noticed one TRR request timed out and was canceled, which closed the entire DoH connection. As a result, the other in-flight requests on that connection also failed. A single request cancellation shouldn’t tear down the whole connection. I’ll dig into this and work on improving the behavior.2025-10-15 13:31:02.286487792 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel::Cancel [this=7858368800 status=804b0055] 2025-10-15 13:31:02.286505371 UTC - [Parent Process 27169: TRR Background]: D/nsHttp nsHttpConnectionMgr::CancelTransaction [trans=7809736610 reason=804b0055] 2025-10-15 13:31:02.286537109 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::CloseTransaction[this=78411a3400 trans=784d2b9100 reason=804b0055] 2025-10-15 13:31:02.286558105 UTC - [Parent Process 27169: TRR Background]: D/nsSocketTransport STS dispatch [7841359420] 2025-10-15 13:31:02.286585937 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::DontReuse 78411a3400 spdysession=784d2b9100 2025-10-15 13:31:02.286609375 UTC - [Parent Process 27169: TRR Background]: D/nsSocketTransport PollableEvent::Signal 2025-10-15 13:31:02.286634765 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::MarkAsDontReuse 78411a3400 2025-10-15 13:31:02.286650878 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp Http2Session::DontReuse 784d2b9100 2025-10-15 13:31:02.286666259 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp closing associated mTransaction 2025-10-15 13:31:02.286748779 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp Http2Session::Close 784d2b9100 804B0055
Good discovery.
I hope the timeout is not caused by running with about:logging used to record the moz_logs in this profile.
I can record again with a reduced set of logged modules.
| Assignee | ||
Comment 7•8 days ago
|
||
Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, since a single TRR timeout shouldn’t close the entire DoH connection.
Updated•8 days ago
|
| Assignee | ||
Comment 8•8 days ago
|
||
(In reply to Kershaw Chang [:kershaw] from comment #7)
Created attachment 9520465 [details]
Bug 1994314 - Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, r=#neckoTreat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, since a single TRR timeout shouldn’t close the entire DoH connection.
Note: This patch doesn’t address the bottleneck. The bigger question is why the TRR requests time out. From the logs:
- ~194 ms until the socket thread receives the transaction.
- ~800 ms on the socket thread before the request is sent.
- ~1.2 s to process the response (including RTT).
This suggests socket thread is really busy. The TRR request is competing with other HTTP work, so it’s delayed.
2025-10-15 13:31:00.759210205 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel ctor [this=785834e000]
2025-10-15 13:31:00.764458007 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel 785834e000 created nsHttpTransaction 7830dba810
2025-10-15 13:31:00.958610351 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnectionMgr::OnMsgNewTransaction [trans=7830dba800]
2025-10-15 13:31:01.075690185 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpTransaction::ReadSegments 7830dba800
2025-10-15 13:31:02.284472412 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpTransaction::WriteSegments 7830dba800
2025-10-15 13:31:02.284645507 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel::Cancel [this=785834e000 status=804b0055]
Comment 9•7 days ago
|
||
I think the delay makes sense.
The socket thread is busy, so the processing of the transaction dispatched from mozilla::net::TRRServiceChannel::Connect will wait in the queue behind other things.
Then when the DoH response is received, we'll process the sockets in order mozilla::net::nsSocketTransportService::DoPollIteration, so we might actually do other things rather than process the DoH response.
Then we read from the socket, the data goes to the TRR thread, we parse the packet, we call nsHostResolver::CompleteLookupLocked, which then goes to mozilla::net::DNSListenerProxy::OnLookupComplete that dispatches to the original thread that called AsyncResolve. In the common case when the resolution was triggered from mozilla::net::DnsAndConnectSocket::TransportSetup::ResolveHost this was the socket thread, so we need to wait for event queue to be processed yet again in order for the connection to proceed.
The few potential solutions I've discussed with Kershaw:
- Moving TRR resolution to the socket process - this would bypass the busy socket thread issue, but we'd still have the busy event queue, and would make DNS debugging a bit more difficult.
- Increasing the priority of the runnables posted to the socket thread:
- mozilla::net::nsHttpConnectionMgr::AddTransaction already has a priority attribute.
- The other runnable we'd need to increase the priority for would be mozilla::net::DNSListenerProxy::OnLookupComplete
- Make sure we're not needlessly cycling the TRR connection when
network.trr.retry_on_recoverable_errorsis true in mozilla::net::TRRService::ConfirmationContext::HandleEvent - If the prioritization of the runnables doesn't improve things, consider also calling OnSocketReady for TRR sockets before all others. Not sure if this would actually improve performance, and I'm reluctant to do it because it seems fishy for the socket transport service to be aware of the purpose of each socket.
I think 2 and 3 are most promising and relatively straight-forward.
| Assignee | ||
Updated•7 days ago
|
| Reporter | ||
Comment 10•7 days ago
|
||
We're having some issues connecting profiles to the symbol server at the momemnt, but from what I've seen the busy socket thread does seem to be the problem.
With infinite time, I would consider a variation on 1:
- Move all general networking to the socket process. And all of DNS resolution, including TRR resolution to a DNS Process.
So that general networking would not be delayed by TRR requests and vice-versa.
For option 2, increasing the priority of the runnables:
Since it's a relatively straight forward change I think it's worth trying.
But to me there is a significant risk -- in the case of this site at least we would be prioritizing TRR transactions (mostly for ad networks and trackers it appears) over the actual contents of the site (js, image, etc).
So would need to carefully measure the impact of such a change.
Comment 11•6 days ago
|
||
Comment 12•6 days ago
|
||
(In reply to Pulsebot from comment #11)
Pushed by kjang@mozilla.com:
https://github.com/mozilla-firefox/firefox/commit/d735f7f7b8fc
https://hg.mozilla.org/integration/autoland/rev/1dab44851e4e
Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, r=necko-reviewers,sunil
FWIW, it's probably better to file separate bugs for things like this. That makes tracking what's landed, and whether there are regressions easier.
| Reporter | ||
Comment 13•6 days ago
|
||
This view of the profile contains symbols.
Comment 14•6 days ago
|
||
(In reply to Andrew Creskey [:acreskey] from comment #13)
This view of the profile contains symbols.
This only has 120 samples on the socket thread over 442ms range. This suggests that we're probably CPU starved during the time period and so it's pretty hard to reason about the contents of the profile. (Turning off Java sampling might be a way to reduce some of the CPU starvation)
Also the simpleperf profile doesn't seem to have a lot of CPU time being spent on the socket thread. Given this it seems like we have to be really careful trusting what we see in the profile. It also suggests to me that changing the priority of things or moving to a different process isn't necessarily going to give great wins.
Some things I did notice:
- Glean telemetry overhead is still showing up.
- the socket thread seems to be doing blocking IO in sdb_FindObjects via sqlite3 which seems not ideal
- the CPU time being spent in the TRR Background thread seems like mostly overhead
Comment 15•6 days ago
|
||
| bugherder | ||
| Reporter | ||
Comment 16•3 days ago
|
||
I haven't looked at it in depth, but this profile has disabled Java sampling.
Focused on the period with the slower dns resolution
| Reporter | ||
Comment 17•3 days ago
|
||
In the above and other profiles we can also see the logs from the TrrServiceChannel timing distribution accumulations.
e.g.
TimingDistribution::accumulate — networking.trr_dns_start/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 7.808µs
TimingDistribution::accumulate — networking.trr_dns_end/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 2.438769ms
TimingDistribution::accumulate — networking.trr_open_to_first_sent/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 9.179693ms
TimingDistribution::accumulate — networking.trr_first_sent_to_last_received/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 221.474577ms
TimingDistribution::accumulate — networking.trr_complete_load/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 230.65427ms
TimingDistribution::accumulate — networking.trr_open_to_first_received/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 230.624693ms
MemoryDistribution::accumulate - networking.trr_request_size private.canadianshield.cira.ca_2: 383
MemoryDistribution::accumulate - networking.trr_response_size private.canadianshield.cira.ca_2: 233
Counter::add - networking.trr_request_count regular: 1
Timings in this range look pretty common, suggesting that for whatever reason it's taking a long time to receive the response.
| Reporter | ||
Comment 18•3 days ago
•
|
||
This is a simple test page that stresses the browser's dns resolution:
https://acreskeymoz.github.io/dns/dns_prefetch_and_requests.html
There are three tests, each initiated by a button press:
- Test one dns resolution every second via dns-prefetch
- Test 100 dns requests at once via dns-prefetch
- Test 100 dns requests at once by loading favicons from 100 domains
This profiles captures all three
https://deploy-preview-5578--perf-html.netlify.app/public/8f8b8v1hw46427qjx9bt5rhb9j22md83jfzkzcg/marker-chart/?globalTrackOrder=504213&hiddenGlobalTracks=12&markerSearch=lookup&thread=0&v=12
- very roughly 50-80ms / resolution
- ~160-260ms / resolution
- 80-1800ms / resolution. (Socket thread busy with connect, handshake, cert stuff, etc).
Not clear how representative this is of real world situations.
Description
•