1958018 - We appear to be making 5 dns requests for each host lookup (an extra A and AAAA request)

Reporter

Description

•

1 year ago

This profile with logs and profiler stacks included shows a reproducible example where we make 5 TRR requests for a single host lookup (www.instagram.com) -- an extra A (type 1) and an extra AAAA (type 2) request.

I've modifed TRR::SendHTTPRequest to also output the pointer %p of the TRR object.

https://share.firefox.dev/4j8cLoe

LogMessages — (nsHostResolver) TRR::SendHTTPRequest resolve 29a63e200 www.instagram.com type 65
LogMessages — (nsHostResolver) TRR::SendHTTPRequest resolve 29a63e4d0 www.instagram.com type 28
LogMessages — (nsHostResolver) TRR::SendHTTPRequest resolve 29a63e5c0 www.instagram.com type 1
LogMessages — (nsHostResolver) TRR::SendHTTPRequest resolve 29a63cf40 www.instagram.com type 28
LogMessages — (nsHostResolver) TRR::SendHTTPRequest resolve 29a63e110 www.instagram.com type 1

Note the 5 separate TRR objects and requests. (Each also receives OnDataAvailable and OnStopRequest).

This build makes the TRR constructor log, and I can see that two of the TRR objects are made in the parent process main thread:
https://share.firefox.dev/43CEwAS

LogMessages — (nsHostResolver) TRR::TRR resolve 29a63cf40 www.instagram.com type 28 
LogMessages — (nsHostResolver) TRR::TRR resolve 29a63e110 www.instagram.com type 1

Via this codepath:

mozilla::net::TRR::TRR(AHostResolver*, nsHostRecord*, mozilla::net::TrrType)/Users/acreskey/dev/mc3/netwerk/dns/TRR.cpp
mozilla::net::TRR::TRR(AHostResolver*, nsHostRecord*, mozilla::net::TrrType)/Users/acreskey/dev/mc3/netwerk/dns/TRR.cpp
mozilla::net::TRRQuery::PrepareQuery(mozilla::net::TrrType, nsTArray<RefPtr<mozilla::net::TRR> >&)/Users/acreskey/dev/mc3/netwerk/dns/TRRQuery.cpp
mozilla::net::TRRQuery::DispatchLookup(mozilla::net::TRR*)/Users/acreskey/dev/mc3/netwerk/dns/TRRQuery.cpp
nsHostResolver::TrrLookup(nsHostRecord*, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&, mozilla::net::TRR*)/Users/acreskey/dev/mc3/netwerk/dns/nsHostResolver.cpp
nsHostResolver::NameLookup(nsHostRecord*, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&)/Users/acreskey/dev/mc3/netwerk/dns/nsHostResolver.cpp
nsHostResolver::ResolveHost(nsTSubstring<char> const&, nsTSubstring<char> const&, int, unsigned short, mozilla::OriginAttributes const&, nsIDNSService::DNSFlags, unsigned short, nsResolveHostCallback*)/Users/acreskey/dev/mc3/netwerk/dns/nsHostResolver.cpp
nsDNSService::AsyncResolveInternal(nsTSubstring<char> const&, unsigned short, nsIDNSService::DNSFlags, nsIDNSAdditionalInfo*, nsIDNSListener*, nsIEventTarget*, mozilla::OriginAttributes const&, nsICancelable**)/Users/acreskey/dev/mc3/netwerk/dns/nsDNSService2.cpp
nsDNSPrefetch::Prefetch(nsIDNSService::DNSFlags)/Users/acreskey/dev/mc3/netwerk/base/nsDNSPrefetch.cpp
mozilla::net::nsHttpChannel::MaybeStartDNSPrefetch()/Users/acreskey/dev/mc3/netwerk/protocol/http/nsHttpChannel.cpp
mozilla::net::nsHttpChannel::BeginConnect()/Users/acreskey/dev/mc3/netwerk/protocol/http/nsHttpChannel.cpp

The other three are made on the socket thread:
https://share.firefox.dev/4laQvM6

LogMessages — (nsHostResolver) TRR::TRR resolve 29a63e200 www.instagram.com type 65 
LogMessages — (nsHostResolver) TRR::TRR resolve 29a63e4d0 www.instagram.com type 28 
LogMessages — (nsHostResolver) TRR::TRR resolve 29a63e5c0 www.instagram.com type 1

via this codepath

nsHostResolver::TrrLookup(nsHostRecord*, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&, mozilla::net::TRR*)/Users/acreskey/dev/mc3/netwerk/dns/nsHostResolver.cpp
nsHostResolver::NameLookup(nsHostRecord*, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&)/Users/acreskey/dev/mc3/netwerk/dns/nsHostResolver.cpp
nsHostResolver::ResolveHost(nsTSubstring<char> const&, nsTSubstring<char> const&, int, unsigned short, mozilla::OriginAttributes const&, nsIDNSService::DNSFlags, unsigned short, nsResolveHostCallback*)/Users/acreskey/dev/mc3/netwerk/dns/nsHostResolver.cpp
nsDNSService::AsyncResolveInternal(nsTSubstring<char> const&, unsigned short, nsIDNSService::DNSFlags, nsIDNSAdditionalInfo*, nsIDNSListener*, nsIEventTarget*, mozilla::OriginAttributes const&, nsICancelable**)/Users/acreskey/dev/mc3/netwerk/dns/nsDNSService2.cpp
mozilla::net::HTTPSRecordResolver::FetchHTTPSRRInternal(nsIEventTarget*, nsICancelable**)/Users/acreskey/dev/mc3/netwerk/protocol/http/HTTPSRecordResolver.cpp
mozilla::net::SpeculativeTransaction::FetchHTTPSRR()/Users/acreskey/dev/mc3/netwerk/protocol/http/SpeculativeTransaction.cpp
mozilla::net::nsHttpConnectionMgr::DoSpeculativeConnectionInternal(mozilla::net::ConnectionEntry*, mozilla::net::SpeculativeTransaction*, bool)/Users/acreskey/dev/mc3/netwerk/protocol/http/nsHttpConnectionMgr.cpp
RefPtr<mozilla::net::nsHttpConnectionMgr>::Proxy<void, int, mozilla::net::ARefBase*>::operator()<int&, RefPtr<mozilla::net::ARefBase>&>(int&, RefPtr<mozilla::net::ARefBase>&)/Users/acreskey/dev/mc3/obj-ff-dbg/dist/include/mozilla/RefPtr.h
mozilla::net::ConnEvent::Run()/Users/acreskey/dev/mc3/netwerk/protocol/http/nsHttpConnectionMgr.cpp

Andrew Creskey [:acreskey]

Reporter

Comment 1

•

1 year ago

This is another example -- new profile, default DoH provider
Note the 5 decoded TRR responses for the domain:
https://share.firefox.dev/42f8E2G

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 2

•

1 year ago

The problem is that mozilla::net::HTTPSRecordResolver::FetchHTTPSRRInternal is setting the RESOLVE_CANONICAL_NAME flag, while nsDNSPrefetch::Prefetch is not.
Technically, we should be able to use records that RESOLVE_CANONICAL_NAME and are already in the cache, but because the flag is part of the key, we don't.
I guess it doesn't only affect TRR, but native DNS too.

For an immediate fix, I think I might do a second lookup in the DNS cache, with the additional flag. But long term I'm thinking of getting rid of the RESOLVE_CANONICAL_NAME flag entirely, and always set it when doing native DNS lookups.

Assignee: nobody → valentin.gosu

Blocks: necko-perf

Severity: -- → S3

Priority: -- → P1

Whiteboard: [necko-triaged]

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 3

•

1 year ago

Attached file Bug 1958018 - Always set AI_CANONNAME flag when making DNS requests r=#necko — Details

This flag controls whether ai_canonname is set to the target CNAME when
calling getaddrinfo.
However, since it is part of the host record key, a request that doesn't have
the flag set will not reuse a DNS cache entry that does, even though
all of the information it needs is there.
This is a problem because consumers set this flag inconsistently.
Ideally we would remove it in the future.

Pulsebot

Comment 4

•

1 year ago

Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/f3e4d8e77ad1 Always set AI_CANONNAME flag when making DNS requests r=necko-reviewers,kershaw

Serban Stanca [:SerbanS]

Comment 5

•

1 year ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/f3e4d8e77ad1

Status: NEW → RESOLVED

Closed: 1 year ago

status-firefox139: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 139 Branch

Andrew Creskey [:acreskey]

Reporter

Updated

•

1 year ago

Comment 6

•

1 year ago

This fix may somehow be the cause of the regression to later percentile top-level document dns lookup times seen primarily in Fenix
https://bugzilla.mozilla.org/show_bug.cgi?id=1960020

Andrew Creskey [:acreskey]

Reporter

Comment 7

•

1 year ago

Attached file Backout Bug 1958018 We appear to be making 5 trr requests for each host lookup r=#necko-reviewers,kershaw — Details

We've observed a large performance regression and this bug is a possible reason -- backing out to verify. See bug 1960020

Pulsebot

Comment 8

•

1 year ago

Backout by acreskey@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/00b688b8f45e Backout We appear to be making 5 trr requests for each host lookup r=kershaw,necko-reviewers

Atila Butkovits

Comment 9

•

1 year ago

Backed out for causing build bustages.

Backout link: https://hg-edge.mozilla.org/integration/autoland/rev/0500a825b5aedb1e00550decd05fccc4e5bfcc04

Push with failures: https://treeherder.mozilla.org/jobs?repo=autoland&selectedTaskRun=UpwznvA8SZiJ-BHEzfagKA.0&resultStatus=testfailed%2Cbusted%2Cexception%2Cretry%2Cusercancel&revision=00b688b8f45ea0a59bb253185473b96d00f90aba

Failure log: https://treeherder.mozilla.org/logviewer?job_id=504086976&repo=autoland&lineNumber=1171

Flags: needinfo?(acreskey)

Pulsebot

Comment 10

•

1 year ago

Backout by acreskey@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/bc76dbe69b70 Backout We appear to be making 5 trr requests for each host lookup r=kershaw,necko-reviewers

agoloman

Comment 11

•

1 year ago

Backout merged into central: https://hg-edge.mozilla.org/mozilla-central/rev/bc76dbe69b708949bc97ed9901121933ebf65cc3

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Updated

•

1 year ago

Status: RESOLVED → REOPENED

Resolution: FIXED → ---

BugBot [:suhaib / :marco/ :calixte]

Updated

•

1 year ago

status-firefox139: fixed → affected

Andrew Creskey [:acreskey]

Reporter

Comment 12

•

1 year ago

It looks like this patch was the cause of the Fenix dns regression, as backing it out looks to have restored the distribution

Flags: needinfo?(acreskey)

Mathew Hodson

Updated

•

1 year ago

Regressions: 1960020

Updated

•

1 year ago

Status: REOPENED → ASSIGNED

status-firefox139: affected → ---

Target Milestone: 139 Branch → ---

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 13

•

1 year ago

Thank you for backing it out Andrew.
I'll try to look into whether ai_canonname lookups are slower that ones without the flag on Android.

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 14

•

1 year ago

var listener1 = {
  onLookupComplete(inRequest, inRecord, inStatus) {
    this.count++;
    if (this.count == this.maxCount) {
      console.log(performance.now() - this.start);
    }
  },
  QueryInterface: ChromeUtils.generateQI(["nsIDNSListener"]),
  start: null,
  maxCount: 1000,
  count: 0,
};

  var flags = Ci.nsIDNSService.RESOLVE_CANONICAL_NAME; // Set to 0 to avoid setting ai_cannonname flag
  var random = Math.random().toString().slice(2)
  listener1.start = performance.now();
  for (let i =0; i < listener1.maxCount; i++) {
    // This one will be canceled with cancelAsyncResolve.
    Services.dns.asyncResolve(
      `test${random}${i}.wildcard.gosu.se`,
      Ci.nsIDNSService.RESOLVE_TYPE_DEFAULT,
      flags,
      null, // resolverInfo
      listener1,
      Services.tm.currentThread,
      {}
    );

  }

I used the script above to test whether ai_cannoname resolutions are slower on Android (or any other platform), and manual testing didn't show any obvious issues with that.

Andrew Creskey [:acreskey]

Reporter

Comment 15

•

1 year ago

Good, idea -- let me try this script and see if I can spot a difference in a higher latency environment.
Maybe the recursive canonical name lookup is adding to the timings in the field.

Flags: needinfo?(acreskey)

Andrew Creskey [:acreskey]

Reporter

Comment 16

•

1 year ago

I observed the same behaviour -- ai_cannoname resolution was comparable on Android (Fenix nightly, wifi), even with additional latency.

+200ms rtt
flags = Ci.nsIDNSService.RESOLVE_CANONICAL_NAME: 17208ms
flags = 0: 17206ms

+1000ms rtt
flags = Ci.nsIDNSService.RESOLVE_CANONICAL_NAME: 29349ms
flags = 0: 28951ms

Flags: needinfo?(acreskey)

Andrew Creskey [:acreskey]

Reporter

Updated

•

1 year ago

Blocks: perf-android

Andrew Creskey [:acreskey]

Reporter

Comment 17

•

1 year ago

Because we measured such a large impact on document dns lookups, let me see if I can reproduce this locally via pageload tests.

Flags: needinfo?(acreskey)

Lars Eggert [:lars]

Comment 18

•

1 year ago

Is the issue maybe that canonical lookups involve additional roundtrips (to obtain that canonical name)? tcpdump might tell us.

Greg Hess

Updated

•

1 year ago

Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-queue]

Andrew Creskey [:acreskey]

Reporter

Updated

•

1 year ago

Flags: needinfo?(acreskey)

Summary: We appear to be making 5 trr requests for each host lookup (an extra A and AAAA request) → We appear to be making 5 dns requests for each host lookup (an extra A and AAAA request)

Andrew Creskey [:acreskey]

Reporter

Comment 19

•

1 year ago

(In reply to Lars Eggert [:lars] from comment #18)

Is the issue maybe that canonical lookups involve additional roundtrips (to obtain that canonical name)? tcpdump might tell us.

This seems reasonable to me.

But locally even in pageload tests I can't see a performance difference between resolving traditional dns with and without the canonical flag.

In Fenix we clearly saw the difference in top-level-document dns lookup time.

And glam also picked it up on all native dns lookup times (Fenix).

But what's interesting is that going back to the pageload event, I don't see other timings further down the navigation process being affected.
e.g. time to request start didn't appear to move

Which makes me wonder if the change to resolve with the canonical flag somehow affected how we report "dns lookup time", but not the actual time to resolve?

Andrew Creskey [:acreskey]

Reporter

Comment 20

•

1 year ago

(In reply to Andrew Creskey [:acreskey] from comment #19)

Which makes me wonder if the change to resolve with the canonical flag somehow affected how we report "dns lookup time", but not the actual time to resolve?

So by adding the RESOLVE_CANONICAL_NAME flag to all requests we end up using the DNS record from the cache (before key compare would fail and we would issue a new request).
Perhaps for some reason, P75 onwards, the dns request starts earlier (but still completes at the same time).
And so we record a longer "dns lookup time", but haven't actually delayed the network requests at all (which is what the pageload event is showing).

Just a hypothesis!

FYI
Nightly Fenix build with RESOLVE_CANONICAL_NAME on all requests (moz_logs, symbols blank)
https://share.firefox.dev/4mheJ7X

Current Fenix nightly build RESOLVE_CANONICAL_NAME not on all requests (moz_logs, symbols blank)
https://share.firefox.dev/3Fa3JbR

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 21

•

1 year ago

Which makes me wonder if the change to resolve with the canonical flag somehow affected how we report "dns lookup time", but not the actual time to resolve?

Excellent point, Andrew.
I think both with TRR and native DNS, what happens here is that the initial DNS lookup prewarms the cache (either the TRR server cache, or the OS cache). We didin't end up using the speculative lookup, but the second lookup ended up completing much faster (it took less time, but it completed later than the first one did).

I think it's a positive sign that the pageload time didn't regress.
We might want to look at the time between asyncOpen -> lookupEnd instead of lookupStart -> lookupEnd (unfortunately not currently captured)

Andrew Creskey [:acreskey]

Reporter

Comment 22

•

1 year ago

•

Edited

(In reply to Valentin Gosu [:valentin] (he/him) from comment #21)

Which makes me wonder if the change to resolve with the canonical flag somehow affected how we report "dns lookup time", but not the actual time to resolve?

Excellent point, Andrew.
I think both with TRR and native DNS, what happens here is that the initial DNS lookup prewarms the cache (either the TRR server cache, or the OS cache). We didin't end up using the speculative lookup, but the second lookup ended up completing much faster (it took less time, but it completed later than the first one did).

I think it's a positive sign that the pageload time didn't regress.

Yes, I think we're getting closer -- this might be an instrumentation issue.

Through glam we can look at the old HTTP_PAGE_* probes recorded in the loadgroup

network_dns_start doesn't look to have changed:
https://glam.telemetry.mozilla.org/fenix/probe/network_dns_start/explore?ping_type=metrics&timeHorizon=QUARTER

But we can see that network_dns_end dropped with your RESOLVE_CANONICAL_NAME patch.
https://glam.telemetry.mozilla.org/fenix/probe/network_dns_end/explore?ping_type=metrics&timeHorizon=QUARTER

i.e. this probe recorded the change as an improvement (perhaps just looking at the 2nd dns request, not the "warmed up one"?)

But the pageload event recorded it as a regression, maybe timing from the warm-up start to the final dns resolution? (recorded from the timed channel)
But it's still not clear to me how it could end up reporting a longer duration (assuming the use of the warmup request was helpful).

We might want to look at the time between asyncOpen -> lookupEnd instead of lookupStart -> lookupEnd (unfortunately not currently captured)

I'll keep looking for something like " asyncOpen -> lookupEnd "

Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }}

Assignee

Comment 23

•

1 year ago

So if average DNS time goes down with the patch, but pageload DNS time goes up, that means:

the average DNS resolution will be hitting the cache more often (which is to be expected given they now share cache keys)
the pageload DNS resolution is reporting longer times - meaning the cache wasn't warmed up before the load. Most likely now the document channel is now reporting the correct DNS timings.

Given that other timings weren't influenced by this, I'd be inclined to land it again and let it ride the trains.

Andrew Creskey [:acreskey]

Reporter

Updated

•

1 year ago

Comment 24

•

1 year ago

Agreed -- I'm in favor of re-landing this as well as there is no evidence of performance regression.

I would like to have a better understanding of the DNS timings on the document channel, and I logged that in bug 1967433

Pulsebot

Comment 25

•

1 year ago

Pushed by valentin.gosu@gmail.com: https://hg.mozilla.org/integration/autoland/rev/5fce48333c13 Always set AI_CANONNAME flag when making DNS requests r=necko-reviewers,kershaw

Cristian Tuns

Comment 26

•

1 year ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/5fce48333c13

Status: ASSIGNED → RESOLVED

Closed: 1 year ago → 1 year ago

status-firefox140: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 140 Branch

Andrew Creskey [:acreskey]

Reporter

Comment 27

•

1 year ago

This is riding the trains, but looking at dns lookup volume in release, I think this should save a couple of hundred billion lookups per day.

https://glam.telemetry.mozilla.org/firefox/probe/dns_lookup_time/explore?channel=release&normalizationType=non_normalized&process=parent&ref=20250407111027&timeHorizon=QUARTER

Andrew Creskey [:acreskey]

Reporter

Comment 28

•

11 months ago

If I have this correct, from Glam we were collecting about 6 trillion samples for dns_native_lookup_time, i.e. calls to getaddrinfo(), in total per Release (e.g. Fx 138 and Fx 139), ~200 billion / day. (Seems a little high).

https://glam.telemetry.mozilla.org/fog/probe/dns_native_lookup_time/explore?aggregationLevel=version&app_id=release&normalizationType=non_normalized&visiblePercentiles=%5B99.9%2C99%2C95%2C75%2C50%2C25%2C5%5D

But expecting this to drop for Fx 140 when this fix hit Release.

I made this query to count samples, grouped by major version, not clear yet if the sum of Fx139 + Fx140 is actually lower:
https://sql.telemetry.mozilla.org/queries/109135/source?p_channel=release&p_days=20#268036

Andrew Creskey [:acreskey]

Reporter

Comment 29

•

11 months ago

I made this query to count samples, grouped by major version, not clear yet if the sum of Fx139 + Fx140 is actually lower:
https://sql.telemetry.mozilla.org/queries/109135/source?p_channel=release&p_days=20#268036

Only seeing up to about 3 billion samples per day with the above query.

Bug 1958018 - Always set AI_CANONNAME flag when making DNS requests r=#necko 1 year ago Valentin Gosu [:valentin] {{ PTO 3 July - 3 August }} 48 bytes, text/x-phabricator-request		Details \| Review
Backout Bug 1958018 We appear to be making 5 trr requests for each host lookup r=#necko-reviewers,kershaw 1 year ago Andrew Creskey [:acreskey] 48 bytes, text/x-phabricator-request		Details \| Review