Extremely long DNS resolution times with Fenix, no fallback resolver
Categories
(Core :: Networking: DNS, defect, P1)
Tracking
()
Performance Impact | medium |
People
(Reporter: denispal, Unassigned)
References
(Blocks 2 open bugs)
Details
(Keywords: perf:pageload, perf:responsiveness, Whiteboard: [necko-triaged])
I am seeing some incredibly long resolution times in Fenix, upwards of 5s-10s times. This is most noticeable from applink navigations, but I see it occasionally during normal navigation as well.
Here are some profiles:
https://share.firefox.dev/3bS7hN4
https://share.firefox.dev/3m96vQm
Reporter | ||
Comment 1•4 years ago
|
||
:acreskey noticed most of the time is spent sitting in Android's getaddrinfo. There is mention of Chrome using their own DNS resolver that uses failovers to different servers so it could help explain the difference I am seeing between Firefox and other browsers here.
Reporter | ||
Comment 2•4 years ago
|
||
Seems my ISP is giving me a pretty slow DNS server with DHCP. Simply changing DNS servers fixes this for me, or enabling DoH/VPN also fixes it. Seems like we might not have any failover for a slow DNS request which may leave users blaming Fenix for slow performance. I haven't been experiencing this problem in the other webview browsers.
Updated•4 years ago
|
Updated•4 years ago
|
Updated•4 years ago
|
Comment 3•4 years ago
|
||
(In reply to Denis Palmeiro [:denispal] from comment #2)
Seems my ISP is giving me a pretty slow DNS server with DHCP. Simply changing DNS servers fixes this for me, or enabling DoH/VPN also fixes it. Seems like we might not have any failover for a slow DNS request which may leave users blaming Fenix for slow performance. I haven't been experiencing this problem in the other webview browsers.
I think there is not much we can do if the system resolver is slow.
Valentin, what do you think?
Comment 4•4 years ago
|
||
I agree.
From what I can tell mobile Chrome sometimes uses their DoH implementation to do DNS prefetches, but otherwise they also use getaddrinfo
so they should have the same preformance.
Comment 5•4 years ago
|
||
Keeping this open, but unfortunately there's not much we can do about it.
We can't control which DNS servers we use with getaddrinfo.
We might improve this with DoH - see bug 1664878.
Comment 6•3 years ago
|
||
https://glam.telemetry.mozilla.org/firefox/probe/dns_lookup_time/explore?process=parent suggests slow DNS lookup times are a real issue. GLAM doesn't seem to give information to me on the DoH lookup time, do we have some idea of how it compares? And do we have data on mobile?
Comment 7•3 years ago
|
||
DoH has never been enabled on mobile so we don't have any data on how it would perform.
For desktop, you can compare the lookup times between these two probes.
https://glam.telemetry.mozilla.org/firefox/probe/dns_native_lookup_time/explore?channel=release&process=parent
https://glam.telemetry.mozilla.org/firefox/probe/dns_trr_lookup_time3/explore?channel=release&process=parent
Comment 8•3 years ago
|
||
This issue is very annoying, as pages often take extremely long time to load. I must note e.g. DuckDuckGo browser which uses either Chromium or WebView is much faster at DNS resolution, not sure why.
Is there anything blocking DoH from being enabled to at least try to work it around?
Updated•3 years ago
|
Comment 9•2 years ago
|
||
We investigated the possibility to race the native resolver and DoH:
- this will add additional technical complexity that is not very high.
- this also would need further investigation regarding policy compliance in different countries.
Considering that we want to turn on DoH for more users, we decided to investigate the possibility to enable DoH for such users instead of implementing the racing. We are going to continue the investigation in this direction.
Comment 10•2 years ago
|
||
The Performance Priority Calculator has determined this bug's performance priority to be P2. If you'd like to request re-triage, you can reset the Performance flag to "?" or needinfo the triage sheriff.
Platforms: Android
Page load impact: Severe
[x] Bug affects multiple sites
Comment 11•2 years ago
|
||
Why P2? Even on P1 it’s taking extremely long for this bug to even be looked at. It impacts me every day every single time I open any webpage on my phone. I’m losing 30 to 90 seconds on every webpage every day just because I cannot enable DoH to speed up DNS lookups.
If you’re downgrading this issue, you could’ve at least made DoH available as an opt-in setting on the stable channel.
Comment 12•2 years ago
|
||
Hold on, which one is higher, P1 or P2? I’m confused.
Comment 13•2 years ago
|
||
(In reply to Andrej Shadura from comment #11)
Why P2? Even on P1 it’s taking extremely long for this bug to even be looked at. It impacts me every day every single time I open any webpage on my phone. I’m losing 30 to 90 seconds on every webpage every day just because I cannot enable DoH to speed up DNS lookups.
If you’re downgrading this issue, you could’ve at least made DoH available as an opt-in setting on the stable channel.
Apologies, I missed some criteria when re-triaging this bug and should indeed be a P1 Performance bug. To be clear, the issue is still important to Fx. This particular priority setting is very specific to the performance team and does not denote the overall priority to the team responsible for the fix.
The Performance Priority Calculator has determined this bug's performance priority to be P1. If you'd like to request re-triage, you can reset the Performance flag to "?" or needinfo the triage sheriff.
Platforms: Android
Impact on browser UI: Renders browser effectively unusable
Impact on site: Renders site effectively unusable
Page load impact: Severe
[x] Bug affects multiple sites
Comment 14•2 years ago
|
||
The severity field for this bug is set to S3. However, the Performance Impact
field flags this bug as having a high impact on the performance.
:jesup, could you consider increasing the severity of this performance-impacting bug? Alternatively, if you think the performance impact is lower than previously assessed, could you request a re-triage from the performance team by setting the Performance Impact
flag to ?
?
For more information, please visit auto_nag documentation.
Comment 15•2 years ago
|
||
Bumped priority and severity.
My concern with addressing this via DoH is that it's only available in certain regions (and may also be blocked by certain providers), so it's not a general fix. It may be a useful partial fix or bandaid, however, and lower impact/cost to implement.
Anecdotally, I frequently see long (seconds) delays on mobile Fenix with the loading indicator stalled and the lock icon not yet showing locked (i.e. we don't have TLS up, and my bet is we're in DNS lookup). I imagine this may be the same issue. These are usually when going to a site from the google newsfeed (so likely applink)
Denis/perf team - do you still see this? Can you get profiles showing this? Or even better (though perhaps a pain to get), logs with the settings in about:networking.
Comment 16•2 years ago
|
||
I do also see the occasional issues with loading being frozen the same way as described by Randell. I haven't captured this is a profile though, perhaps Denis has recently?
Updated•2 years ago
|
Reporter | ||
Comment 17•2 years ago
|
||
Yes, this is still an occasional issue. Although they are not usually as long as originally reported in this profile which I have mostly addressed by switching to cloudflare's dns servers.
Comment 18•1 year ago
|
||
In Bug 1838240 we migrated the DNS probes to glean so we can look at this on Fenix.
Because dns resolution time is part of the pageload_event, we can also look at it from this perspective. See https://bugzilla.mozilla.org/show_bug.cgi?id=1838240#c1
Updated•1 year ago
|
Comment 19•11 months ago
|
||
While DoH for Android progresses, bug 1874464 may provide a solution to this.
Comment 20•8 months ago
|
||
Ni? me to figure out if long resolution times are limited to older Android versions.
Updated•8 months ago
|
Updated•7 months ago
|
Comment 21•7 months ago
|
||
I think the approach here should be something similar to bug 1895226.
If we see that the browser doesn't have IPv6 connectivity and/or IPv6 DNS requests keep timing out, we can avoid doing them at all.
We have to be careful about the fact that IPv6 connectivity to the internet may be missing, but the local network might still depend on IPv6 working.
Comment 22•7 months ago
|
||
This is an older bug, so let me summarize what it describes.
The user is experiencing slow or stalled DNS requests due to their ISP-provided DNS resolvers being problematic. This can happen when an ISP assigns DNS servers that are slow or unreliable (comment 2).
This is not a problem when using Chrome on the same network (we believe because they use a fallback resolver).
In Firefox, changing dns servers or using DoH resolves the problem.
As we're not expecting to roll out DoH for Android in the short term, maybe a solution could be to race the https resource records query, bug 1852752?
See also: bug 1895908
Comment 23•7 months ago
|
||
Increasing the severity level to match the high performance impact, but this decision is nuanced, so let me know if this adjustment doesn't seem fully accurate or appropriate.
Comment 24•7 months ago
|
||
I think we should move this back to S3 - we have workarounds (change dns to (say) 8.8.8.8, use DoH); it likely doesn't affect many users (though it's hard to tell how many), the impact on an individual user usually isn't horrible unless DNS is totally broken - and if so it really is an ISP/etc issue, even if Chrome manages to work around it. So S3, even if a higher-priority S3, imo
Comment 25•7 months ago
|
||
S2 (Serious) Major functionality/product severely impaired or a high impact issue and a satisfactory workaround does not exist
S3 (Normal) Blocks non-critical functionality and a work around exists
Agreed :)
But we will try to find a solution for our Android users.
Updated•5 months ago
|
Comment 26•5 months ago
|
||
According to Ilya Grigorik, and documented in "The Performance of Open Source Software
High Performance Networking in Chrome", Chrome uses their own async resolver which mitigates slows DNS queries via the following methods:
https://aosabook.org/en/posa/high-performance-networking-in-chrome.html
better control of retransmission timers, and ability to execute multiple queries in parallel
visibility into record TTLs, which allows Chrome to refresh popular records ahead of time
better behavior for dual stack implementations (IPv4 and IPv6)
failovers to different servers, based on RTT or other signals
Comment 27•5 months ago
|
||
We've pulled this from the priority queue because adding a fallback resolver isn't directly actionable and is also beyond scope for the priority queue. This bug will be taken to DNS/DoH strategy scope.
Updated•4 months ago
|
Updated•4 months ago
|
Comment 28•1 month ago
|
||
Denis, Necko folks have just resolved a DNS performance issue, bug 1122907, which may affect this scenario.
But the change is preffed off by default for now.
If you're still seeing the issue in this bug, can you try turning on the pref network.dns.skip_ipv6_when_no_addresses
to see if it helps?
Reporter | ||
Comment 29•1 month ago
|
||
This hasn't been a problem for me for a while now so I can't really provide any useful feedback on that pref unfortunately.
Description
•