Closed Bug 1996813 Opened 1 month ago Closed 1 month ago

Fenix HTTP/2 Session to TRR is closing frequently (more than once per minute)

Categories

(Core :: Networking: DNS, defect, P3)

defect

Tracking

()

RESOLVED FIXED
147 Branch
Tracking Status
firefox147 --- fixed

People

(Reporter: acreskey, Assigned: valentin)

References

(Blocks 3 open bugs)

Details

(Whiteboard: [necko-triaged])

Attachments

(3 files)

I added some adb logging where we record the trr_request_count_per_conn telemetry (Http2Session::~Http2Session).

What I'm seeing is that under active browsing, the HTTP2Session is being closed frequently, generally more than once per minute

e.g.

10-28 14:27:17.000  4361  4529 V AJC     : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=542, protocol=HTTP/2
10-28 14:28:00.961  4361  4529 V AJC     : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=433, protocol=HTTP/2
10-28 14:28:19.743  4361  4529 V AJC     : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=293, protocol=HTTP/2
10-28 14:28:46.857  4361  4529 V AJC     : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=179, protocol=HTTP/2
10-28 14:31:22.869  4361  4529 V AJC     : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=529, protocol=HTTP/2
10-28 14:32:10.666  4361  4529 V AJC     : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=388, protocol=HTTP/2
10-28 14:32:31.619  4361  4529 V AJC     : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=314, protocol=HTTP/2

I'll capture minimal moz_logs to try and find out why.

The expectation is that the connection is long lived to minimized connection overhead.

I likely won't have the chance to analyze this until I'm back, but in this profile with moz_logs, the http2 session is closed four times in less than one minute.
Two of the sessions only processed 17 and 15 requests.

10-28 14:44:17.160 12203 12532 V AJC     : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=263, protocol=HTTP/2
10-28 14:44:58.907 12203 12532 V AJC     : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=57, protocol=HTTP/2
10-28 14:44:59.695 12203 12532 V AJC     : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=17, protocol=HTTP/2
10-28 14:45:05.191 12203 12532 V AJC     : TRR telemetry: host=mozilla.cloudflare-dns.com_h2, requests=15, protocol=HTTP/2

https://profiler.firefox.com/public/5et7pydcjjzw4ys031f0xh1xk9rzch7c0vvccxg

Perhaps the connection is maintained but we are using new http2 sessions?

Whiteboard: [necko-triaged]

This is with "Increased Protection", i.e. trr.mode:2

Smaller profile that's more manageable

https://share.firefox.dev/4oOnUxc

Seeing what looks like three connections (or at least HTTP2 sessions) closing within ~11 seconds

@ ~17 seconds:  Counter::add - networking.trr_request_count_per_conn mozilla.cloudflare-dns.com_h2: 74
@ ~19 seconds:  Counter::add - networking.trr_request_count_per_conn mozilla.cloudflare-dns.com_h2: 12
@ ~28 seconds:  Counter::add - networking.trr_request_count_per_conn mozilla.cloudflare-dns.com_h2: 43

Looking at the first one

Counter::add - networking.trr_request_count_per_conn mozilla.cloudflare-dns.com_h2: 74
LogMessages — (nsHttp) FindCoalescableConnection(.SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]) no matching conn

LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 78f94fd340 .SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1] did not find an active connection
LogMessages — (nsHttp) FindCoalescableConnection .SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]
LogMessages — (nsHttp) FindCoalescableConnection(.SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]) no matching conn
LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 78f94fd340 .SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1] did not find an active connection
...
LogMessages — (nsHttp) GetH2orH3ActiveConn() request for ent 78f94fd340 .SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1] did not find an active connection
LogMessages — (nsHttp) nsHttpConnectionMgr::MakeNewConnection [ci = .SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]] Not Available Due to RestrictConnections()
LogMessages — (nsHttp) nsHttpConnectionMgr::TryDispatchTransaction without conn [trans=790eb2b700 ci=784318b300 ci=.SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1] caps=1081113 onlyreused=0 active=0 idle=0]

And I think we are indeed closing the connection as I see these:

2025-11-05 22:06:32.132521484 UTC - [Parent Process 31417: Socket Thread]: D/nsHttp ConnectionEntry::ClosePersistentConnections [ci=.SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]]
...
2025-11-05 22:06:40.999251464 UTC - [Parent Process 31417: Socket Thread]: D/nsHttp ConnectionEntry::ClosePersistentConnections [ci=.SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]]
...
2025-11-05 22:06:47.442156738 UTC - [Parent Process 31417: Socket Thread]: D/nsHttp ConnectionEntry::ClosePersistentConnections [ci=.SA.......[tlsflags0x00000000]mozilla.cloudflare-dns.com:443[TRR:1]]

It also looks like we attempt four trr confirmations in this 45 second navigation
https://share.firefox.dev/43L4D7M

See Also: → 1994314

The confirmation was triggered because the previous TRR request failed due to a timeout.

2025-11-05 22:06:31.991982177 UTC - [Parent Process 31417: TRR Background]: D/nsHostResolver nsHostResolver::CompleteLookup rumcdn.geoedge.be 78276bb6f0 804B001E resolver=1 stillResolving=0
2025-11-05 22:06:31.992014648 UTC - [Parent Process 31417: TRR Background]: D/nsHostResolver nsHostResolver::MaybeRetryTRRLookup triggering Confirmation and retrying with TRR, skip reason was 11

I suspect the timeout was caused by the overhead of verbose logging. Andrew, could you try to reproduce with a reduced log level and see if it still occurs?

Yes, it's a bit tricky to observe without modifying the underlying behaviour

For this one I had only have nsHttp:1,nsHostResolver:1 and I'm seeing what looks to be similar results.
https://share.firefox.dev/4p0aH4B

In bug 1992479 I'm landing new dns markers (including confirmation events), and then we can do marker-only profiling without any sampling or logging and that should at least be near-zero performance impact.

So, I took a look at this last week.
The problem here is the network.trr.retry_on_recoverable_errors pref.
What happens is that we get a timeout for a TRR request, which could also happen because the DNS server is very slow.
When that happens, in nsHostResolver::MaybeRetryTRRLookup we call mozilla::net::TRRService::RetryTRRConfirm which calls mozilla::net::TRRService::ConfirmationContext::HandleEvent and sets aUseFreshConnection to true for mozilla::net::TRR::TRR. It then dispatches the TRR request again.

The confirmation with aUseFreshConnection == true will create a new connection, and the queued TRR request will be resolved with the fresh connection - but will fallback to Do53 if another failure occurs.

AFAIKT this mechanism was introduced to deal with unreasonlably high fallback rates when the DoH connection was hanging (maybe due to a network change event).
So we have a bunch of conflicting requirements here:

  • We don't want to fallback to Do53 if we don't have to (so we retry the TRR with a new request)
  • We want to recover from a hanging H2 connection quickly (so we create a new one).
  • We want failing TRR requests to fallback to DoH quickly (initial TRR + retry with new connection + fallback to native vs initial TRR + wait for confirmation to time out + retry with new connection + fallback to native).

This ends up causing us issues when a random DoH request times out - either because the Socket/main thread is super busy, or simply because the DoH sever or authoritative server is slow to respond to some requests.

I have a couple of potential fixes in mind:

  • Don't retry a TRR request with DoH if in the time between when the request was created and when it timed out we still received OnDataAvailable for other TRR requests - as that would indicate the connection still works.
  • (this one would be more complicated) Instead of just setting LOAD_FRESH_CONNECTION on the confirmation to trigger a new connection, we could trigger a new confirmation with a new originSuffix/isolationKey - this way we could check if the new connection works while we keep using the old one.

FYI, overhead profile (no sampling) of Nov 12 Fenix Nightly, loading and scrolling through NYtimes
https://share.firefox.dev/47QPcw3
I see some slower dns resolutions, up to 200-300ms, but no TRR confirmations here.

This is a low-overhead profile of the dns-stress site, https://acreskeymoz.github.io/dns/dns_prefetch_and_requests.html
https://share.firefox.dev/49hgmi3
Many slow resolutions, up to 1000ms in duration but no TRR confirmations.
So this may be happening less frequently than I initially thought. Perhaps it's easier to reproduce on a slower network.

This is another profile, from a Pixel 6.

While I don't see any confirmations in this one, it looks midway through the navigation session we fallback to dns53, and then go back to DoH
https://share.firefox.dev/4p81AyD

Also makes the test use the node TRRServer instead of relying on the global moz-http2 server

Assignee: nobody → valentin.gosu
Status: NEW → ASSIGNED

This change does the following:

  • Sets network.trr.retry_on_recoverable_errors to false, which means that failed TRR requests immediately fallback to native in mode 2.
  • Makes it so a failed confirmation will trigger the use of a new connection for the next confirmation
  • Triggers a confirmation for every TRR failure

This new test checks that failed confirmations trigger the use of a new connection for the following confirmation

Pushed by valentin.gosu@gmail.com: https://github.com/mozilla-firefox/firefox/commit/12e4d8e30af2 https://hg.mozilla.org/integration/autoland/rev/d4aa61618355 Move test_connection_reuse_and_cycling into a separate trr test file r=necko-reviewers,kershaw https://github.com/mozilla-firefox/firefox/commit/1984fd7905bb https://hg.mozilla.org/integration/autoland/rev/f599bd5f4d1a Change network.trr.retry_on_recoverable_errors to false and cycle connections when confirmation fails r=necko-reviewers,kershaw https://github.com/mozilla-firefox/firefox/commit/bceb8a919a51 https://hg.mozilla.org/integration/autoland/rev/e75efc0ed3b6 Add new test for connection cycling when network.trr.retry_on_recoverable_errors is false r=necko-reviewers,kershaw
Blocks: 1801530
Status: ASSIGNED → RESOLVED
Closed: 1 month ago
Resolution: --- → FIXED
Target Milestone: --- → 147 Branch
QA Whiteboard: [qa-triage-done-c148/b147]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: