1994314 - Fenix: DoH performance shows high latency when resolving multiple hostnames in parallel

Reporter

Description

•

9 days ago

•

Very similar to bug 1976964, I can see long dns resolution times when using DoH in Fenix on sites that need to resolve numerous domains simultaneously.

Steps to reproduce:

Load https://www.usatoday.com

I'm using
Wifi
DoH Mode 2

Samsung A54,

Dns53 lookup times: starts off at 33ms , grows to ~40-60ms
https://share.firefox.dev/3JeisEJ

Cloudflare TRR, starts off at 37ms , but then quickly end up resolving most requests in ~100-200ms
https://share.firefox.dev/4hc6LuW

CIRA trr, starts off at 35ms , but then quickly end up resolving most requests in ~150-300ms
https://share.firefox.dev/4nOxxvE

Expectation:
When making multiple requests simultaneous, the latency does not increase to such a degree

Andrew Creskey [:acreskey]

Reporter

Comment 1

•

9 days ago

Binary here
I've added some adb instrumentation, but I don't think anything that could add 50-250ms per request.

Andrew Creskey [:acreskey]

Reporter

Comment 2

•

9 days ago

On this site I'm counting ~26 nearly simultaneous domain lookups that come from the document.
Even with A, AAAA, and HTTPS records for each we should be under the http/2 stream limit.

Andrew Creskey [:acreskey]

Reporter

Comment 3

•

9 days ago

DoH, mode 3 profile, showing the issue
https://share.firefox.dev/4nSFTmg

Andrew Creskey [:acreskey]

Reporter

Comment 4

•

9 days ago

From today's Fenix Nightly Playstore build, with moz_logs captured

https://share.firefox.dev/4nbB4mT

Numerous multi-second dns resolutions (some delays expected due to logging)

Andrew Creskey [:acreskey]

Reporter

Updated

•

9 days ago

Blocks: necko-perf, perf-android

Kershaw Chang [:kershaw]

Assignee

Comment 5

•

8 days ago

Thanks a lot for the log. It's really helpful.
I noticed one TRR request timed out and was canceled, which closed the entire DoH connection. As a result, the other in-flight requests on that connection also failed. A single request cancellation shouldn’t tear down the whole connection. I’ll dig into this and work on improving the behavior.

2025-10-15 13:31:02.286487792 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel::Cancel [this=7858368800 status=804b0055]
2025-10-15 13:31:02.286505371 UTC - [Parent Process 27169: TRR Background]: D/nsHttp nsHttpConnectionMgr::CancelTransaction [trans=7809736610 reason=804b0055]
2025-10-15 13:31:02.286537109 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::CloseTransaction[this=78411a3400 trans=784d2b9100 reason=804b0055]
2025-10-15 13:31:02.286558105 UTC - [Parent Process 27169: TRR Background]: D/nsSocketTransport STS dispatch [7841359420]
2025-10-15 13:31:02.286585937 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::DontReuse 78411a3400 spdysession=784d2b9100
2025-10-15 13:31:02.286609375 UTC - [Parent Process 27169: TRR Background]: D/nsSocketTransport PollableEvent::Signal
2025-10-15 13:31:02.286634765 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::MarkAsDontReuse 78411a3400
2025-10-15 13:31:02.286650878 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp Http2Session::DontReuse 784d2b9100
2025-10-15 13:31:02.286666259 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp closing associated mTransaction
2025-10-15 13:31:02.286748779 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp Http2Session::Close 784d2b9100 804B0055

Kershaw Chang [:kershaw]

Assignee

Updated

•

8 days ago

Priority: P3 → P2

Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-queue]

Andrew Creskey [:acreskey]

Reporter

Comment 6

•

8 days ago

(In reply to Kershaw Chang [:kershaw] from comment #5)

Thanks a lot for the log. It's really helpful.
I noticed one TRR request timed out and was canceled, which closed the entire DoH connection. As a result, the other in-flight requests on that connection also failed. A single request cancellation shouldn’t tear down the whole connection. I’ll dig into this and work on improving the behavior.

2025-10-15 13:31:02.286487792 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel::Cancel [this=7858368800 status=804b0055]
2025-10-15 13:31:02.286505371 UTC - [Parent Process 27169: TRR Background]: D/nsHttp nsHttpConnectionMgr::CancelTransaction [trans=7809736610 reason=804b0055]
2025-10-15 13:31:02.286537109 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::CloseTransaction[this=78411a3400 trans=784d2b9100 reason=804b0055]
2025-10-15 13:31:02.286558105 UTC - [Parent Process 27169: TRR Background]: D/nsSocketTransport STS dispatch [7841359420]
2025-10-15 13:31:02.286585937 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::DontReuse 78411a3400 spdysession=784d2b9100
2025-10-15 13:31:02.286609375 UTC - [Parent Process 27169: TRR Background]: D/nsSocketTransport PollableEvent::Signal
2025-10-15 13:31:02.286634765 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::MarkAsDontReuse 78411a3400
2025-10-15 13:31:02.286650878 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp Http2Session::DontReuse 784d2b9100
2025-10-15 13:31:02.286666259 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp closing associated mTransaction
2025-10-15 13:31:02.286748779 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp Http2Session::Close 784d2b9100 804B0055

Good discovery.
I hope the timeout is not caused by running with about:logging used to record the moz_logs in this profile.
I can record again with a reduced set of logged modules.

Kershaw Chang [:kershaw]

Assignee

Comment 7

•

8 days ago

Attached file Bug 1994314 - Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, r=#necko — Details

Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, since a single TRR timeout shouldn’t close the entire DoH connection.

Phabricator Automation

Updated

•

8 days ago

Assignee: nobody → kershaw

Status: NEW → ASSIGNED

Kershaw Chang [:kershaw]

Assignee

Comment 8

•

8 days ago

(In reply to Kershaw Chang [:kershaw] from comment #7)

Created attachment 9520465 [details]
Bug 1994314 - Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, r=#necko

Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, since a single TRR timeout shouldn’t close the entire DoH connection.

Note: This patch doesn’t address the bottleneck. The bigger question is why the TRR requests time out. From the logs:

~194 ms until the socket thread receives the transaction.
~800 ms on the socket thread before the request is sent.
~1.2 s to process the response (including RTT).

This suggests socket thread is really busy. The TRR request is competing with other HTTP work, so it’s delayed.

2025-10-15 13:31:00.759210205 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel ctor [this=785834e000]
2025-10-15 13:31:00.764458007 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel 785834e000 created nsHttpTransaction 7830dba810
2025-10-15 13:31:00.958610351 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnectionMgr::OnMsgNewTransaction [trans=7830dba800]
2025-10-15 13:31:01.075690185 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpTransaction::ReadSegments 7830dba800
2025-10-15 13:31:02.284472412 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpTransaction::WriteSegments 7830dba800
2025-10-15 13:31:02.284645507 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel::Cancel [this=785834e000 status=804b0055]

Valentin Gosu [:valentin] (he/him)

Comment 9

•

7 days ago

I think the delay makes sense.
The socket thread is busy, so the processing of the transaction dispatched from mozilla::net::TRRServiceChannel::Connect will wait in the queue behind other things.
Then when the DoH response is received, we'll process the sockets in order mozilla::net::nsSocketTransportService::DoPollIteration, so we might actually do other things rather than process the DoH response.
Then we read from the socket, the data goes to the TRR thread, we parse the packet, we call nsHostResolver::CompleteLookupLocked, which then goes to mozilla::net::DNSListenerProxy::OnLookupComplete that dispatches to the original thread that called AsyncResolve. In the common case when the resolution was triggered from mozilla::net::DnsAndConnectSocket::TransportSetup::ResolveHost this was the socket thread, so we need to wait for event queue to be processed yet again in order for the connection to proceed.

The few potential solutions I've discussed with Kershaw:

Moving TRR resolution to the socket process - this would bypass the busy socket thread issue, but we'd still have the busy event queue, and would make DNS debugging a bit more difficult.
Increasing the priority of the runnables posted to the socket thread:

mozilla::net::nsHttpConnectionMgr::AddTransaction already has a priority attribute.
The other runnable we'd need to increase the priority for would be mozilla::net::DNSListenerProxy::OnLookupComplete

Make sure we're not needlessly cycling the TRR connection when network.trr.retry_on_recoverable_errors is true in mozilla::net::TRRService::ConfirmationContext::HandleEvent
If the prioritization of the runnables doesn't improve things, consider also calling OnSocketReady for TRR sockets before all others. Not sure if this would actually improve performance, and I'm reluctant to do it because it seems fishy for the socket transport service to be aware of the purpose of each socket.

I think 2 and 3 are most promising and relatively straight-forward.

Kershaw Chang [:kershaw]

Assignee

Updated

•

7 days ago

Keywords: leave-open

Andrew Creskey [:acreskey]

Reporter

Comment 10

•

7 days ago

We're having some issues connecting profiles to the symbol server at the momemnt, but from what I've seen the busy socket thread does seem to be the problem.

With infinite time, I would consider a variation on 1:

Move all general networking to the socket process. And all of DNS resolution, including TRR resolution to a DNS Process.
So that general networking would not be delayed by TRR requests and vice-versa.

For option 2, increasing the priority of the runnables:
Since it's a relatively straight forward change I think it's worth trying.

But to me there is a significant risk -- in the case of this site at least we would be prioritizing TRR transactions (mostly for ad networks and trackers it appears) over the actual contents of the site (js, image, etc).
So would need to carefully measure the impact of such a change.

Pulsebot

Comment 11

•

6 days ago

Pushed by kjang@mozilla.com: https://github.com/mozilla-firefox/firefox/commit/d735f7f7b8fc https://hg.mozilla.org/integration/autoland/rev/1dab44851e4e Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, r=necko-reviewers,sunil

Jeff Muizelaar [:jrmuizel]

Comment 12

•

6 days ago

(In reply to Pulsebot from comment #11)

Pushed by kjang@mozilla.com:
https://github.com/mozilla-firefox/firefox/commit/d735f7f7b8fc
https://hg.mozilla.org/integration/autoland/rev/1dab44851e4e
Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, r=necko-reviewers,sunil

FWIW, it's probably better to file separate bugs for things like this. That makes tracking what's landed, and whether there are regressions easier.

Andrew Creskey [:acreskey]

Reporter

Comment 13

•

6 days ago

This view of the profile contains symbols.

Jeff Muizelaar [:jrmuizel]

Comment 14

•

6 days ago

(In reply to Andrew Creskey [:acreskey] from comment #13)

This view of the profile contains symbols.

This only has 120 samples on the socket thread over 442ms range. This suggests that we're probably CPU starved during the time period and so it's pretty hard to reason about the contents of the profile. (Turning off Java sampling might be a way to reduce some of the CPU starvation)

Also the simpleperf profile doesn't seem to have a lot of CPU time being spent on the socket thread. Given this it seems like we have to be really careful trusting what we see in the profile. It also suggests to me that changing the priority of things or moving to a different process isn't necessarily going to give great wins.

Some things I did notice:

Glean telemetry overhead is still showing up.
the socket thread seems to be doing blocking IO in sdb_FindObjects via sqlite3 which seems not ideal
the CPU time being spent in the TRR Background thread seems like mostly overhead

pstanciu

Comment 15

•

6 days ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/1dab44851e4e

Andrew Creskey [:acreskey]

Reporter

Comment 16

•

3 days ago

I haven't looked at it in depth, but this profile has disabled Java sampling.
Focused on the period with the slower dns resolution

https://deploy-preview-5578--perf-html.netlify.app/public/237xp1evxm0apcyz5ffj4bb9s6wr1rrsmke1m38/stack-chart/?globalTrackOrder=504213&hiddenGlobalTracks=12&range=4019m338&thread=1&v=12

Andrew Creskey [:acreskey]

Reporter

Comment 17

•

3 days ago

In the above and other profiles we can also see the logs from the TrrServiceChannel timing distribution accumulations.
e.g.

TimingDistribution::accumulate — networking.trr_dns_start/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 7.808µs
TimingDistribution::accumulate — networking.trr_dns_end/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 2.438769ms
TimingDistribution::accumulate — networking.trr_open_to_first_sent/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 9.179693ms
TimingDistribution::accumulate — networking.trr_first_sent_to_last_received/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 221.474577ms
TimingDistribution::accumulate — networking.trr_complete_load/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 230.65427ms
TimingDistribution::accumulate — networking.trr_open_to_first_received/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 230.624693ms
MemoryDistribution::accumulate - networking.trr_request_size private.canadianshield.cira.ca_2: 383
MemoryDistribution::accumulate - networking.trr_response_size private.canadianshield.cira.ca_2: 233
Counter::add - networking.trr_request_count regular: 1

here

Timings in this range look pretty common, suggesting that for whatever reason it's taking a long time to receive the response.

Andrew Creskey [:acreskey]

Reporter

Comment 18

•

3 days ago

•

Edited

This is a simple test page that stresses the browser's dns resolution:
https://acreskeymoz.github.io/dns/dns_prefetch_and_requests.html

There are three tests, each initiated by a button press:

Test one dns resolution every second via dns-prefetch
Test 100 dns requests at once via dns-prefetch
Test 100 dns requests at once by loading favicons from 100 domains

This profiles captures all three
https://deploy-preview-5578--perf-html.netlify.app/public/8f8b8v1hw46427qjx9bt5rhb9j22md83jfzkzcg/marker-chart/?globalTrackOrder=504213&hiddenGlobalTracks=12&markerSearch=lookup&thread=0&v=12

very roughly 50-80ms / resolution
~160-260ms / resolution
80-1800ms / resolution. (Socket thread busy with connect, handshake, cert stuff, etc).

Not clear how representative this is of real world situations.