We appear to be making 5 dns requests for each host lookup (an extra A and AAAA request)
Categories
(Core :: Networking: DNS, defect, P1)
Tracking
()
| Tracking | Status | |
|---|---|---|
| firefox140 | --- | fixed |
People
(Reporter: acreskey, Assigned: valentin)
References
(Blocks 2 open bugs)
Details
(Whiteboard: [necko-triaged][necko-priority-queue])
Attachments
(2 files)
This profile with logs and profiler stacks included shows a reproducible example where we make 5 TRR requests for a single host lookup (www.instagram.com) -- an extra A (type 1) and an extra AAAA (type 2) request.
I've modifed TRR::SendHTTPRequest to also output the pointer %p of the TRR object.
https://share.firefox.dev/4j8cLoe
LogMessages — (nsHostResolver) TRR::SendHTTPRequest resolve 29a63e200 www.instagram.com type 65
LogMessages — (nsHostResolver) TRR::SendHTTPRequest resolve 29a63e4d0 www.instagram.com type 28
LogMessages — (nsHostResolver) TRR::SendHTTPRequest resolve 29a63e5c0 www.instagram.com type 1
LogMessages — (nsHostResolver) TRR::SendHTTPRequest resolve 29a63cf40 www.instagram.com type 28
LogMessages — (nsHostResolver) TRR::SendHTTPRequest resolve 29a63e110 www.instagram.com type 1
Note the 5 separate TRR objects and requests. (Each also receives OnDataAvailable and OnStopRequest).
This build makes the TRR constructor log, and I can see that two of the TRR objects are made in the parent process main thread:
https://share.firefox.dev/43CEwAS
LogMessages — (nsHostResolver) TRR::TRR resolve 29a63cf40 www.instagram.com type 28
LogMessages — (nsHostResolver) TRR::TRR resolve 29a63e110 www.instagram.com type 1
Via this codepath:
mozilla::net::TRR::TRR(AHostResolver*, nsHostRecord*, mozilla::net::TrrType)/Users/acreskey/dev/mc3/netwerk/dns/TRR.cpp
mozilla::net::TRR::TRR(AHostResolver*, nsHostRecord*, mozilla::net::TrrType)/Users/acreskey/dev/mc3/netwerk/dns/TRR.cpp
mozilla::net::TRRQuery::PrepareQuery(mozilla::net::TrrType, nsTArray<RefPtr<mozilla::net::TRR> >&)/Users/acreskey/dev/mc3/netwerk/dns/TRRQuery.cpp
mozilla::net::TRRQuery::DispatchLookup(mozilla::net::TRR*)/Users/acreskey/dev/mc3/netwerk/dns/TRRQuery.cpp
nsHostResolver::TrrLookup(nsHostRecord*, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&, mozilla::net::TRR*)/Users/acreskey/dev/mc3/netwerk/dns/nsHostResolver.cpp
nsHostResolver::NameLookup(nsHostRecord*, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&)/Users/acreskey/dev/mc3/netwerk/dns/nsHostResolver.cpp
nsHostResolver::ResolveHost(nsTSubstring<char> const&, nsTSubstring<char> const&, int, unsigned short, mozilla::OriginAttributes const&, nsIDNSService::DNSFlags, unsigned short, nsResolveHostCallback*)/Users/acreskey/dev/mc3/netwerk/dns/nsHostResolver.cpp
nsDNSService::AsyncResolveInternal(nsTSubstring<char> const&, unsigned short, nsIDNSService::DNSFlags, nsIDNSAdditionalInfo*, nsIDNSListener*, nsIEventTarget*, mozilla::OriginAttributes const&, nsICancelable**)/Users/acreskey/dev/mc3/netwerk/dns/nsDNSService2.cpp
nsDNSPrefetch::Prefetch(nsIDNSService::DNSFlags)/Users/acreskey/dev/mc3/netwerk/base/nsDNSPrefetch.cpp
mozilla::net::nsHttpChannel::MaybeStartDNSPrefetch()/Users/acreskey/dev/mc3/netwerk/protocol/http/nsHttpChannel.cpp
mozilla::net::nsHttpChannel::BeginConnect()/Users/acreskey/dev/mc3/netwerk/protocol/http/nsHttpChannel.cpp
The other three are made on the socket thread:
https://share.firefox.dev/4laQvM6
LogMessages — (nsHostResolver) TRR::TRR resolve 29a63e200 www.instagram.com type 65
LogMessages — (nsHostResolver) TRR::TRR resolve 29a63e4d0 www.instagram.com type 28
LogMessages — (nsHostResolver) TRR::TRR resolve 29a63e5c0 www.instagram.com type 1
via this codepath
nsHostResolver::TrrLookup(nsHostRecord*, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&, mozilla::net::TRR*)/Users/acreskey/dev/mc3/netwerk/dns/nsHostResolver.cpp
nsHostResolver::NameLookup(nsHostRecord*, mozilla::detail::BaseAutoLock<mozilla::Mutex&> const&)/Users/acreskey/dev/mc3/netwerk/dns/nsHostResolver.cpp
nsHostResolver::ResolveHost(nsTSubstring<char> const&, nsTSubstring<char> const&, int, unsigned short, mozilla::OriginAttributes const&, nsIDNSService::DNSFlags, unsigned short, nsResolveHostCallback*)/Users/acreskey/dev/mc3/netwerk/dns/nsHostResolver.cpp
nsDNSService::AsyncResolveInternal(nsTSubstring<char> const&, unsigned short, nsIDNSService::DNSFlags, nsIDNSAdditionalInfo*, nsIDNSListener*, nsIEventTarget*, mozilla::OriginAttributes const&, nsICancelable**)/Users/acreskey/dev/mc3/netwerk/dns/nsDNSService2.cpp
mozilla::net::HTTPSRecordResolver::FetchHTTPSRRInternal(nsIEventTarget*, nsICancelable**)/Users/acreskey/dev/mc3/netwerk/protocol/http/HTTPSRecordResolver.cpp
mozilla::net::SpeculativeTransaction::FetchHTTPSRR()/Users/acreskey/dev/mc3/netwerk/protocol/http/SpeculativeTransaction.cpp
mozilla::net::nsHttpConnectionMgr::DoSpeculativeConnectionInternal(mozilla::net::ConnectionEntry*, mozilla::net::SpeculativeTransaction*, bool)/Users/acreskey/dev/mc3/netwerk/protocol/http/nsHttpConnectionMgr.cpp
RefPtr<mozilla::net::nsHttpConnectionMgr>::Proxy<void, int, mozilla::net::ARefBase*>::operator()<int&, RefPtr<mozilla::net::ARefBase>&>(int&, RefPtr<mozilla::net::ARefBase>&)/Users/acreskey/dev/mc3/obj-ff-dbg/dist/include/mozilla/RefPtr.h
mozilla::net::ConnEvent::Run()/Users/acreskey/dev/mc3/netwerk/protocol/http/nsHttpConnectionMgr.cpp
| Reporter | ||
Comment 1•1 year ago
|
||
This is another example -- new profile, default DoH provider
Note the 5 decoded TRR responses for the domain:
https://share.firefox.dev/42f8E2G
| Assignee | ||
Comment 2•1 year ago
|
||
The problem is that mozilla::net::HTTPSRecordResolver::FetchHTTPSRRInternal is setting the RESOLVE_CANONICAL_NAME flag, while nsDNSPrefetch::Prefetch is not.
Technically, we should be able to use records that RESOLVE_CANONICAL_NAME and are already in the cache, but because the flag is part of the key, we don't.
I guess it doesn't only affect TRR, but native DNS too.
For an immediate fix, I think I might do a second lookup in the DNS cache, with the additional flag. But long term I'm thinking of getting rid of the RESOLVE_CANONICAL_NAME flag entirely, and always set it when doing native DNS lookups.
| Assignee | ||
Comment 3•1 year ago
|
||
This flag controls whether ai_canonname is set to the target CNAME when
calling getaddrinfo.
However, since it is part of the host record key, a request that doesn't have
the flag set will not reuse a DNS cache entry that does, even though
all of the information it needs is there.
This is a problem because consumers set this flag inconsistently.
Ideally we would remove it in the future.
Comment 5•1 year ago
|
||
| bugherder | ||
| Reporter | ||
Comment 6•1 year ago
|
||
This fix may somehow be the cause of the regression to later percentile top-level document dns lookup times seen primarily in Fenix
https://bugzilla.mozilla.org/show_bug.cgi?id=1960020
| Reporter | ||
Comment 7•1 year ago
|
||
We've observed a large performance regression and this bug is a possible reason -- backing out to verify. See bug 1960020
Comment 9•1 year ago
|
||
Backed out for causing build bustages.
Backout link: https://hg-edge.mozilla.org/integration/autoland/rev/0500a825b5aedb1e00550decd05fccc4e5bfcc04
Failure log: https://treeherder.mozilla.org/logviewer?job_id=504086976&repo=autoland&lineNumber=1171
Comment 10•1 year ago
|
||
Comment 11•1 year ago
|
||
Backout merged into central: https://hg-edge.mozilla.org/mozilla-central/rev/bc76dbe69b708949bc97ed9901121933ebf65cc3
| Assignee | ||
Updated•1 year ago
|
Updated•1 year ago
|
| Reporter | ||
Comment 12•1 year ago
|
||
It looks like this patch was the cause of the Fenix dns regression, as backing it out looks to have restored the distribution
Updated•1 year ago
|
Updated•1 year ago
|
| Assignee | ||
Comment 13•1 year ago
|
||
Thank you for backing it out Andrew.
I'll try to look into whether ai_canonname lookups are slower that ones without the flag on Android.
| Assignee | ||
Comment 14•1 year ago
|
||
var listener1 = {
onLookupComplete(inRequest, inRecord, inStatus) {
this.count++;
if (this.count == this.maxCount) {
console.log(performance.now() - this.start);
}
},
QueryInterface: ChromeUtils.generateQI(["nsIDNSListener"]),
start: null,
maxCount: 1000,
count: 0,
};
var flags = Ci.nsIDNSService.RESOLVE_CANONICAL_NAME; // Set to 0 to avoid setting ai_cannonname flag
var random = Math.random().toString().slice(2)
listener1.start = performance.now();
for (let i =0; i < listener1.maxCount; i++) {
// This one will be canceled with cancelAsyncResolve.
Services.dns.asyncResolve(
`test${random}${i}.wildcard.gosu.se`,
Ci.nsIDNSService.RESOLVE_TYPE_DEFAULT,
flags,
null, // resolverInfo
listener1,
Services.tm.currentThread,
{}
);
}
I used the script above to test whether ai_cannoname resolutions are slower on Android (or any other platform), and manual testing didn't show any obvious issues with that.
| Reporter | ||
Comment 15•1 year ago
|
||
Good, idea -- let me try this script and see if I can spot a difference in a higher latency environment.
Maybe the recursive canonical name lookup is adding to the timings in the field.
| Reporter | ||
Comment 16•1 year ago
|
||
I observed the same behaviour -- ai_cannoname resolution was comparable on Android (Fenix nightly, wifi), even with additional latency.
+200ms rtt
flags = Ci.nsIDNSService.RESOLVE_CANONICAL_NAME: 17208ms
flags = 0: 17206ms
+1000ms rtt
flags = Ci.nsIDNSService.RESOLVE_CANONICAL_NAME: 29349ms
flags = 0: 28951ms
| Reporter | ||
Updated•1 year ago
|
| Reporter | ||
Comment 17•1 year ago
|
||
Because we measured such a large impact on document dns lookups, let me see if I can reproduce this locally via pageload tests.
Comment 18•1 year ago
|
||
Is the issue maybe that canonical lookups involve additional roundtrips (to obtain that canonical name)? tcpdump might tell us.
| Reporter | ||
Updated•1 year ago
|
| Reporter | ||
Comment 19•1 year ago
|
||
(In reply to Lars Eggert [:lars] from comment #18)
Is the issue maybe that canonical lookups involve additional roundtrips (to obtain that canonical name)?
tcpdumpmight tell us.
This seems reasonable to me.
But locally even in pageload tests I can't see a performance difference between resolving traditional dns with and without the canonical flag.
In Fenix we clearly saw the difference in top-level-document dns lookup time.
And glam also picked it up on all native dns lookup times (Fenix).
But what's interesting is that going back to the pageload event, I don't see other timings further down the navigation process being affected.
e.g. time to request start didn't appear to move
Which makes me wonder if the change to resolve with the canonical flag somehow affected how we report "dns lookup time", but not the actual time to resolve?
| Reporter | ||
Comment 20•1 year ago
|
||
(In reply to Andrew Creskey [:acreskey] from comment #19)
Which makes me wonder if the change to resolve with the canonical flag somehow affected how we report "dns lookup time", but not the actual time to resolve?
So by adding the RESOLVE_CANONICAL_NAME flag to all requests we end up using the DNS record from the cache (before key compare would fail and we would issue a new request).
Perhaps for some reason, P75 onwards, the dns request starts earlier (but still completes at the same time).
And so we record a longer "dns lookup time", but haven't actually delayed the network requests at all (which is what the pageload event is showing).
Just a hypothesis!
FYI
Nightly Fenix build with RESOLVE_CANONICAL_NAME on all requests (moz_logs, symbols blank)
https://share.firefox.dev/4mheJ7X
Current Fenix nightly build RESOLVE_CANONICAL_NAME not on all requests (moz_logs, symbols blank)
https://share.firefox.dev/3Fa3JbR
| Assignee | ||
Comment 21•1 year ago
|
||
Which makes me wonder if the change to resolve with the canonical flag somehow affected how we report "dns lookup time", but not the actual time to resolve?
Excellent point, Andrew.
I think both with TRR and native DNS, what happens here is that the initial DNS lookup prewarms the cache (either the TRR server cache, or the OS cache). We didin't end up using the speculative lookup, but the second lookup ended up completing much faster (it took less time, but it completed later than the first one did).
I think it's a positive sign that the pageload time didn't regress.
We might want to look at the time between asyncOpen -> lookupEnd instead of lookupStart -> lookupEnd (unfortunately not currently captured)
| Reporter | ||
Comment 22•1 year ago
•
|
||
(In reply to Valentin Gosu [:valentin] (he/him) from comment #21)
Which makes me wonder if the change to resolve with the canonical flag somehow affected how we report "dns lookup time", but not the actual time to resolve?
Excellent point, Andrew.
I think both with TRR and native DNS, what happens here is that the initial DNS lookup prewarms the cache (either the TRR server cache, or the OS cache). We didin't end up using the speculative lookup, but the second lookup ended up completing much faster (it took less time, but it completed later than the first one did).I think it's a positive sign that the pageload time didn't regress.
Yes, I think we're getting closer -- this might be an instrumentation issue.
Through glam we can look at the old HTTP_PAGE_* probes recorded in the loadgroup
network_dns_start doesn't look to have changed:
https://glam.telemetry.mozilla.org/fenix/probe/network_dns_start/explore?ping_type=metrics&timeHorizon=QUARTER
But we can see that network_dns_end dropped with your RESOLVE_CANONICAL_NAME patch.
https://glam.telemetry.mozilla.org/fenix/probe/network_dns_end/explore?ping_type=metrics&timeHorizon=QUARTER
i.e. this probe recorded the change as an improvement (perhaps just looking at the 2nd dns request, not the "warmed up one"?)
But the pageload event recorded it as a regression, maybe timing from the warm-up start to the final dns resolution? (recorded from the timed channel)
But it's still not clear to me how it could end up reporting a longer duration (assuming the use of the warmup request was helpful).
We might want to look at the time between asyncOpen -> lookupEnd instead of lookupStart -> lookupEnd (unfortunately not currently captured)
I'll keep looking for something like " asyncOpen -> lookupEnd "
| Assignee | ||
Comment 23•1 year ago
|
||
So if average DNS time goes down with the patch, but pageload DNS time goes up, that means:
- the average DNS resolution will be hitting the cache more often (which is to be expected given they now share cache keys)
- the pageload DNS resolution is reporting longer times - meaning the cache wasn't warmed up before the load. Most likely now the document channel is now reporting the correct DNS timings.
Given that other timings weren't influenced by this, I'd be inclined to land it again and let it ride the trains.
| Reporter | ||
Comment 24•1 year ago
|
||
Agreed -- I'm in favor of re-landing this as well as there is no evidence of performance regression.
I would like to have a better understanding of the DNS timings on the document channel, and I logged that in bug 1967433
Comment 25•1 year ago
|
||
Comment 26•1 year ago
|
||
| bugherder | ||
| Reporter | ||
Comment 27•1 year ago
|
||
This is riding the trains, but looking at dns lookup volume in release, I think this should save a couple of hundred billion lookups per day.
| Reporter | ||
Comment 28•11 months ago
|
||
If I have this correct, from Glam we were collecting about 6 trillion samples for dns_native_lookup_time, i.e. calls to getaddrinfo(), in total per Release (e.g. Fx 138 and Fx 139), ~200 billion / day. (Seems a little high).
But expecting this to drop for Fx 140 when this fix hit Release.
I made this query to count samples, grouped by major version, not clear yet if the sum of Fx139 + Fx140 is actually lower:
https://sql.telemetry.mozilla.org/queries/109135/source?p_channel=release&p_days=20#268036
| Reporter | ||
Comment 29•11 months ago
|
||
I made this query to count samples, grouped by major version, not clear yet if the sum of Fx139 + Fx140 is actually lower:
https://sql.telemetry.mozilla.org/queries/109135/source?p_channel=release&p_days=20#268036
Only seeing up to about 3 billion samples per day with the above query.
Description
•