Open Bug 1994314 Opened 9 days ago Updated 3 days ago

Fenix: DoH performance shows high latency when resolving multiple hostnames in parallel

Categories

(Core :: Networking: DNS, defect, P2)

defect

Tracking

()

ASSIGNED

People

(Reporter: acreskey, Assigned: kershaw)

References

(Blocks 2 open bugs)

Details

(Keywords: leave-open, Whiteboard: [necko-triaged][necko-priority-queue])

Attachments

(1 file)

Very similar to bug 1976964, I can see long dns resolution times when using DoH in Fenix on sites that need to resolve numerous domains simultaneously.

Steps to reproduce:

Load https://www.usatoday.com

I'm using
Wifi
DoH Mode 2

Samsung A54,

Dns53 lookup times: starts off at 33ms , grows to ~40-60ms
https://share.firefox.dev/3JeisEJ

Cloudflare TRR, starts off at 37ms , but then quickly end up resolving most requests in ~100-200ms
https://share.firefox.dev/4hc6LuW

CIRA trr, starts off at 35ms , but then quickly end up resolving most requests in ~150-300ms
https://share.firefox.dev/4nOxxvE

Expectation:
When making multiple requests simultaneous, the latency does not increase to such a degree

Binary here
I've added some adb instrumentation, but I don't think anything that could add 50-250ms per request.

On this site I'm counting ~26 nearly simultaneous domain lookups that come from the document.
Even with A, AAAA, and HTTPS records for each we should be under the http/2 stream limit.

DoH, mode 3 profile, showing the issue
https://share.firefox.dev/4nSFTmg

From today's Fenix Nightly Playstore build, with moz_logs captured

https://share.firefox.dev/4nbB4mT

Numerous multi-second dns resolutions (some delays expected due to logging)

Thanks a lot for the log. It's really helpful.
I noticed one TRR request timed out and was canceled, which closed the entire DoH connection. As a result, the other in-flight requests on that connection also failed. A single request cancellation shouldn’t tear down the whole connection. I’ll dig into this and work on improving the behavior.

2025-10-15 13:31:02.286487792 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel::Cancel [this=7858368800 status=804b0055]
2025-10-15 13:31:02.286505371 UTC - [Parent Process 27169: TRR Background]: D/nsHttp nsHttpConnectionMgr::CancelTransaction [trans=7809736610 reason=804b0055]
2025-10-15 13:31:02.286537109 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::CloseTransaction[this=78411a3400 trans=784d2b9100 reason=804b0055]
2025-10-15 13:31:02.286558105 UTC - [Parent Process 27169: TRR Background]: D/nsSocketTransport STS dispatch [7841359420]
2025-10-15 13:31:02.286585937 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::DontReuse 78411a3400 spdysession=784d2b9100
2025-10-15 13:31:02.286609375 UTC - [Parent Process 27169: TRR Background]: D/nsSocketTransport PollableEvent::Signal
2025-10-15 13:31:02.286634765 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::MarkAsDontReuse 78411a3400
2025-10-15 13:31:02.286650878 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp Http2Session::DontReuse 784d2b9100
2025-10-15 13:31:02.286666259 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp closing associated mTransaction
2025-10-15 13:31:02.286748779 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp Http2Session::Close 784d2b9100 804B0055
Priority: P3 → P2
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-queue]

(In reply to Kershaw Chang [:kershaw] from comment #5)

Thanks a lot for the log. It's really helpful.
I noticed one TRR request timed out and was canceled, which closed the entire DoH connection. As a result, the other in-flight requests on that connection also failed. A single request cancellation shouldn’t tear down the whole connection. I’ll dig into this and work on improving the behavior.

2025-10-15 13:31:02.286487792 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel::Cancel [this=7858368800 status=804b0055]
2025-10-15 13:31:02.286505371 UTC - [Parent Process 27169: TRR Background]: D/nsHttp nsHttpConnectionMgr::CancelTransaction [trans=7809736610 reason=804b0055]
2025-10-15 13:31:02.286537109 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::CloseTransaction[this=78411a3400 trans=784d2b9100 reason=804b0055]
2025-10-15 13:31:02.286558105 UTC - [Parent Process 27169: TRR Background]: D/nsSocketTransport STS dispatch [7841359420]
2025-10-15 13:31:02.286585937 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::DontReuse 78411a3400 spdysession=784d2b9100
2025-10-15 13:31:02.286609375 UTC - [Parent Process 27169: TRR Background]: D/nsSocketTransport PollableEvent::Signal
2025-10-15 13:31:02.286634765 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnection::MarkAsDontReuse 78411a3400
2025-10-15 13:31:02.286650878 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp Http2Session::DontReuse 784d2b9100
2025-10-15 13:31:02.286666259 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp closing associated mTransaction
2025-10-15 13:31:02.286748779 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp Http2Session::Close 784d2b9100 804B0055

Good discovery.
I hope the timeout is not caused by running with about:logging used to record the moz_logs in this profile.
I can record again with a reduced set of logged modules.

Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, since a single TRR timeout shouldn’t close the entire DoH connection.

Assignee: nobody → kershaw
Status: NEW → ASSIGNED

(In reply to Kershaw Chang [:kershaw] from comment #7)

Created attachment 9520465 [details]
Bug 1994314 - Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, r=#necko

Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, since a single TRR timeout shouldn’t close the entire DoH connection.

Note: This patch doesn’t address the bottleneck. The bigger question is why the TRR requests time out. From the logs:

  • ~194 ms until the socket thread receives the transaction.
  • ~800 ms on the socket thread before the request is sent.
  • ~1.2 s to process the response (including RTT).

This suggests socket thread is really busy. The TRR request is competing with other HTTP work, so it’s delayed.

2025-10-15 13:31:00.759210205 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel ctor [this=785834e000]
2025-10-15 13:31:00.764458007 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel 785834e000 created nsHttpTransaction 7830dba810
2025-10-15 13:31:00.958610351 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpConnectionMgr::OnMsgNewTransaction [trans=7830dba800]
2025-10-15 13:31:01.075690185 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpTransaction::ReadSegments 7830dba800
2025-10-15 13:31:02.284472412 UTC - [Parent Process 27169: Socket Thread]: D/nsHttp nsHttpTransaction::WriteSegments 7830dba800
2025-10-15 13:31:02.284645507 UTC - [Parent Process 27169: TRR Background]: D/nsHttp TRRServiceChannel::Cancel [this=785834e000 status=804b0055]

I think the delay makes sense.
The socket thread is busy, so the processing of the transaction dispatched from mozilla::net::TRRServiceChannel::Connect will wait in the queue behind other things.
Then when the DoH response is received, we'll process the sockets in order mozilla::net::nsSocketTransportService::DoPollIteration, so we might actually do other things rather than process the DoH response.
Then we read from the socket, the data goes to the TRR thread, we parse the packet, we call nsHostResolver::CompleteLookupLocked, which then goes to mozilla::net::DNSListenerProxy::OnLookupComplete that dispatches to the original thread that called AsyncResolve. In the common case when the resolution was triggered from mozilla::net::DnsAndConnectSocket::TransportSetup::ResolveHost this was the socket thread, so we need to wait for event queue to be processed yet again in order for the connection to proceed.

The few potential solutions I've discussed with Kershaw:

  1. Moving TRR resolution to the socket process - this would bypass the busy socket thread issue, but we'd still have the busy event queue, and would make DNS debugging a bit more difficult.
  2. Increasing the priority of the runnables posted to the socket thread:
  1. Make sure we're not needlessly cycling the TRR connection when network.trr.retry_on_recoverable_errors is true in mozilla::net::TRRService::ConfirmationContext::HandleEvent
  2. If the prioritization of the runnables doesn't improve things, consider also calling OnSocketReady for TRR sockets before all others. Not sure if this would actually improve performance, and I'm reluctant to do it because it seems fishy for the socket transport service to be aware of the purpose of each socket.

I think 2 and 3 are most promising and relatively straight-forward.

Keywords: leave-open

We're having some issues connecting profiles to the symbol server at the momemnt, but from what I've seen the busy socket thread does seem to be the problem.

With infinite time, I would consider a variation on 1:

  • Move all general networking to the socket process. And all of DNS resolution, including TRR resolution to a DNS Process.
    So that general networking would not be delayed by TRR requests and vice-versa.

For option 2, increasing the priority of the runnables:
Since it's a relatively straight forward change I think it's worth trying.

But to me there is a significant risk -- in the case of this site at least we would be prioritizing TRR transactions (mostly for ad networks and trackers it appears) over the actual contents of the site (js, image, etc).
So would need to carefully measure the impact of such a change.

(In reply to Pulsebot from comment #11)

Pushed by kjang@mozilla.com:
https://github.com/mozilla-firefox/firefox/commit/d735f7f7b8fc
https://hg.mozilla.org/integration/autoland/rev/1dab44851e4e
Treat NS_ERROR_NET_TIMEOUT_EXTERNAL as a soft error, r=necko-reviewers,sunil

FWIW, it's probably better to file separate bugs for things like this. That makes tracking what's landed, and whether there are regressions easier.

This view of the profile contains symbols.

(In reply to Andrew Creskey [:acreskey] from comment #13)

This view of the profile contains symbols.

This only has 120 samples on the socket thread over 442ms range. This suggests that we're probably CPU starved during the time period and so it's pretty hard to reason about the contents of the profile. (Turning off Java sampling might be a way to reduce some of the CPU starvation)

Also the simpleperf profile doesn't seem to have a lot of CPU time being spent on the socket thread. Given this it seems like we have to be really careful trusting what we see in the profile. It also suggests to me that changing the priority of things or moving to a different process isn't necessarily going to give great wins.

Some things I did notice:

  • Glean telemetry overhead is still showing up.
  • the socket thread seems to be doing blocking IO in sdb_FindObjects via sqlite3 which seems not ideal
  • the CPU time being spent in the TRR Background thread seems like mostly overhead

I haven't looked at it in depth, but this profile has disabled Java sampling.
Focused on the period with the slower dns resolution

https://deploy-preview-5578--perf-html.netlify.app/public/237xp1evxm0apcyz5ffj4bb9s6wr1rrsmke1m38/stack-chart/?globalTrackOrder=504213&hiddenGlobalTracks=12&range=4019m338&thread=1&v=12

In the above and other profiles we can also see the logs from the TrrServiceChannel timing distribution accumulations.
e.g.

TimingDistribution::accumulate — networking.trr_dns_start/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 7.808µs
TimingDistribution::accumulate — networking.trr_dns_end/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 2.438769ms
TimingDistribution::accumulate — networking.trr_open_to_first_sent/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 9.179693ms
TimingDistribution::accumulate — networking.trr_first_sent_to_last_received/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 221.474577ms
TimingDistribution::accumulate — networking.trr_complete_load/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 230.65427ms
TimingDistribution::accumulate — networking.trr_open_to_first_received/private.canadianshield.cira.ca_2 private.canadianshield.cira.ca_2: 230.624693ms
MemoryDistribution::accumulate - networking.trr_request_size private.canadianshield.cira.ca_2: 383
MemoryDistribution::accumulate - networking.trr_response_size private.canadianshield.cira.ca_2: 233
Counter::add - networking.trr_request_count regular: 1

here

Timings in this range look pretty common, suggesting that for whatever reason it's taking a long time to receive the response.

This is a simple test page that stresses the browser's dns resolution:
https://acreskeymoz.github.io/dns/dns_prefetch_and_requests.html

There are three tests, each initiated by a button press:

  1. Test one dns resolution every second via dns-prefetch
  2. Test 100 dns requests at once via dns-prefetch
  3. Test 100 dns requests at once by loading favicons from 100 domains

This profiles captures all three
https://deploy-preview-5578--perf-html.netlify.app/public/8f8b8v1hw46427qjx9bt5rhb9j22md83jfzkzcg/marker-chart/?globalTrackOrder=504213&hiddenGlobalTracks=12&markerSearch=lookup&thread=0&v=12

  1. very roughly 50-80ms / resolution
  2. ~160-260ms / resolution
  3. 80-1800ms / resolution. (Socket thread busy with connect, handshake, cert stuff, etc).

Not clear how representative this is of real world situations.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: