Fenix HTTP/2 Session to TRR is closing frequently (more than once per minute)
Categories
(Core :: Networking: DNS, defect, P3)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox147 | --- | fixed |
People
(Reporter: acreskey, Assigned: valentin)
References
(Blocks 3 open bugs)
Details
(Whiteboard: [necko-triaged])
Attachments
(3 files)
I added some adb logging where we record the trr_request_count_per_conn telemetry (Http2Session::~Http2Session).
What I'm seeing is that under active browsing, the HTTP2Session is being closed frequently, generally more than once per minute
e.g.
10-28 14:27:17.000 4361 4529 V AJC : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=542, protocol=HTTP/2
10-28 14:28:00.961 4361 4529 V AJC : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=433, protocol=HTTP/2
10-28 14:28:19.743 4361 4529 V AJC : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=293, protocol=HTTP/2
10-28 14:28:46.857 4361 4529 V AJC : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=179, protocol=HTTP/2
10-28 14:31:22.869 4361 4529 V AJC : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=529, protocol=HTTP/2
10-28 14:32:10.666 4361 4529 V AJC : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=388, protocol=HTTP/2
10-28 14:32:31.619 4361 4529 V AJC : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=314, protocol=HTTP/2
I'll capture minimal moz_logs to try and find out why.
The expectation is that the connection is long lived to minimized connection overhead.
| Reporter | ||
Updated•1 month ago
|
| Reporter | ||
Comment 1•1 month ago
|
||
I likely won't have the chance to analyze this until I'm back, but in this profile with moz_logs, the http2 session is closed four times in less than one minute.
Two of the sessions only processed 17 and 15 requests.
10-28 14:44:17.160 12203 12532 V AJC : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=263, protocol=HTTP/2
10-28 14:44:58.907 12203 12532 V AJC : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=57, protocol=HTTP/2
10-28 14:44:59.695 12203 12532 V AJC : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=17, protocol=HTTP/2
10-28 14:45:05.191 12203 12532 V AJC : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=15, protocol=HTTP/2
https://profiler.firefox.com/public/5et7pydcjjzw4ys031f0xh1xk9rzch7c0vvccxg
Perhaps the connection is maintained but we are using new http2 sessions?
| Reporter | ||
Updated•1 month ago
|
| Reporter | ||
Comment 2•1 month ago
|
||
This is with "Increased Protection", i.e. trr.mode:2
| Reporter | ||
Comment 3•1 month ago
|
||
Smaller profile that's more manageable
https://share.firefox.dev/4oOnUxc
Seeing what looks like three connections (or at least HTTP2 sessions) closing within ~11 seconds
@ ~17 seconds: Counter::add - networking.trr_request_count_per_conn mozilla.cloudflare-dns.com_h2: 74
@ ~19 seconds: Counter::add - networking.trr_request_count_per_conn mozilla.cloudflare-dns.com_h2: 12
@ ~28 seconds: Counter::add - networking.trr_request_count_per_conn mozilla.cloudflare-dns.com_h2: 43
Looking at the first one
Counter::add - networking.trr_request_count_per_conn mozilla.cloudflare-dns.com_h2: 74
LogMessages — (nsHttp) FindCoalescableConnection(.SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]) no matching conn
LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 78f94fd340 .SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1] did not find an active connection
LogMessages — (nsHttp) FindCoalescableConnection .SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]
LogMessages — (nsHttp) FindCoalescableConnection(.SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]) no matching conn
LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 78f94fd340 .SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1] did not find an active connection
...
LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 78f94fd340 .SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1] did not find an active connection
LogMessages — (nsHttp) nsHttpConnectionMgr::MakeNewConnection [ci = .SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]] Not Available Due to RestrictConnections()
LogMessages — (nsHttp) nsHttpConnectionMgr::TryDispatchTransaction without conn [trans=790eb2b700 ci=784318b300 ci=.SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1] caps=1081113 onlyreused=0 active=0 idle=0]
And I think we are indeed closing the connection as I see these:
2025-11-05 22:06:32.132521484 UTC - [Parent Process 31417: Socket Thread]: D/nsHttp ConnectionEntry::ClosePersistentConnections [ci=.SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]]
...
2025-11-05 22:06:40.999251464 UTC - [Parent Process 31417: Socket Thread]: D/nsHttp ConnectionEntry::ClosePersistentConnections [ci=.SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]]
...
2025-11-05 22:06:47.442156738 UTC - [Parent Process 31417: Socket Thread]: D/nsHttp ConnectionEntry::ClosePersistentConnections [ci=.SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]]
| Reporter | ||
Comment 4•1 month ago
|
||
It also looks like we attempt four trr confirmations in this 45 second navigation
https://share.firefox.dev/43L4D7M
Comment 5•1 month ago
|
||
The confirmation was triggered because the previous TRR request failed due to a timeout.
2025-11-05 22:06:31.991982177 UTC - [Parent Process 31417: TRR Background]: D/nsHostResolver nsHostResolver::CompleteLookup rumcdn.geoedge.be 78276bb6f0 804B001E resolver=1 stillResolving=0
2025-11-05 22:06:31.992014648 UTC - [Parent Process 31417: TRR Background]: D/nsHostResolver nsHostResolver::MaybeRetryTRRLookup triggering Confirmation and retrying with TRR, skip reason was 11
I suspect the timeout was caused by the overhead of verbose logging. Andrew, could you try to reproduce with a reduced log level and see if it still occurs?
| Reporter | ||
Comment 6•1 month ago
|
||
Yes, it's a bit tricky to observe without modifying the underlying behaviour
For this one I had only have nsHttp:1,nsHostResolver:1 and I'm seeing what looks to be similar results.
https://share.firefox.dev/4p0aH4B
In bug 1992479 I'm landing new dns markers (including confirmation events), and then we can do marker-only profiling without any sampling or logging and that should at least be near-zero performance impact.
| Assignee | ||
Comment 7•1 month ago
|
||
So, I took a look at this last week.
The problem here is the network.trr.retry_on_recoverable_errors pref.
What happens is that we get a timeout for a TRR request, which could also happen because the DNS server is very slow.
When that happens, in nsHostResolver::MaybeRetryTRRLookup we call mozilla::net::TRRService::RetryTRRConfirm which calls mozilla::net::TRRService::ConfirmationContext::HandleEvent and sets aUseFreshConnection to true for mozilla::net::TRR::TRR. It then dispatches the TRR request again.
The confirmation with aUseFreshConnection == true will create a new connection, and the queued TRR request will be resolved with the fresh connection - but will fallback to Do53 if another failure occurs.
AFAIKT this mechanism was introduced to deal with unreasonlably high fallback rates when the DoH connection was hanging (maybe due to a network change event).
So we have a bunch of conflicting requirements here:
- We don't want to fallback to Do53 if we don't have to (so we retry the TRR with a new request)
- We want to recover from a hanging H2 connection quickly (so we create a new one).
- We want failing TRR requests to fallback to DoH quickly (initial TRR + retry with new connection + fallback to native vs initial TRR + wait for confirmation to time out + retry with new connection + fallback to native).
This ends up causing us issues when a random DoH request times out - either because the Socket/main thread is super busy, or simply because the DoH sever or authoritative server is slow to respond to some requests.
I have a couple of potential fixes in mind:
- Don't retry a TRR request with DoH if in the time between when the request was created and when it timed out we still received OnDataAvailable for other TRR requests - as that would indicate the connection still works.
- (this one would be more complicated) Instead of just setting LOAD_FRESH_CONNECTION on the confirmation to trigger a new connection, we could trigger a new confirmation with a new originSuffix/isolationKey - this way we could check if the new connection works while we keep using the old one.
| Reporter | ||
Comment 8•1 month ago
|
||
FYI, overhead profile (no sampling) of Nov 12 Fenix Nightly, loading and scrolling through NYtimes
https://share.firefox.dev/47QPcw3
I see some slower dns resolutions, up to 200-300ms, but no TRR confirmations here.
| Reporter | ||
Comment 9•1 month ago
|
||
This is a low-overhead profile of the dns-stress site, https://acreskeymoz.github.io/dns/dns_prefetch_and_requests.html
https://share.firefox.dev/49hgmi3
Many slow resolutions, up to 1000ms in duration but no TRR confirmations.
So this may be happening less frequently than I initially thought. Perhaps it's easier to reproduce on a slower network.
| Reporter | ||
Comment 10•1 month ago
|
||
This is another profile, from a Pixel 6.
While I don't see any confirmations in this one, it looks midway through the navigation session we fallback to dns53, and then go back to DoH
https://share.firefox.dev/4p81AyD
| Assignee | ||
Comment 11•1 month ago
|
||
Also makes the test use the node TRRServer instead of relying on the global moz-http2 server
Updated•1 month ago
|
| Assignee | ||
Comment 12•1 month ago
|
||
This change does the following:
- Sets network.trr.retry_on_recoverable_errors to false, which means that failed TRR requests immediately fallback to native in mode 2.
- Makes it so a failed confirmation will trigger the use of a new connection for the next confirmation
- Triggers a confirmation for every TRR failure
| Assignee | ||
Comment 13•1 month ago
|
||
This new test checks that failed confirmations trigger the use of a new connection for the following confirmation
Comment 14•1 month ago
|
||
Comment 15•1 month ago
|
||
| bugherder | ||
https://hg.mozilla.org/mozilla-central/rev/d4aa61618355
https://hg.mozilla.org/mozilla-central/rev/f599bd5f4d1a
https://hg.mozilla.org/mozilla-central/rev/e75efc0ed3b6
Updated•9 days ago
|
Description
•