Open Bug 1664492 Opened 4 years ago Updated 1 month ago

Extremely long DNS resolution times with Fenix, no fallback resolver

Categories

(Core :: Networking: DNS, defect, P1)

Unspecified
Android
defect

Tracking

()

Performance Impact medium

People

(Reporter: denispal, Unassigned)

References

(Blocks 2 open bugs)

Details

(Keywords: perf:pageload, perf:responsiveness, Whiteboard: [necko-triaged])

I am seeing some incredibly long resolution times in Fenix, upwards of 5s-10s times. This is most noticeable from applink navigations, but I see it occasionally during normal navigation as well.

Here are some profiles:
https://share.firefox.dev/3bS7hN4
https://share.firefox.dev/3m96vQm

:acreskey noticed most of the time is spent sitting in Android's getaddrinfo. There is mention of Chrome using their own DNS resolver that uses failovers to different servers so it could help explain the difference I am seeing between Firefox and other browsers here.

Seems my ISP is giving me a pretty slow DNS server with DHCP. Simply changing DNS servers fixes this for me, or enabling DoH/VPN also fixes it. Seems like we might not have any failover for a slow DNS request which may leave users blaming Fenix for slow performance. I haven't been experiencing this problem in the other webview browsers.

Component: Performance → Networking: DNS
Whiteboard: [qf] → [qf:p1:responsiveness]
Whiteboard: [qf:p1:responsiveness] → [qf:p1:pageload]

(In reply to Denis Palmeiro [:denispal] from comment #2)

Seems my ISP is giving me a pretty slow DNS server with DHCP. Simply changing DNS servers fixes this for me, or enabling DoH/VPN also fixes it. Seems like we might not have any failover for a slow DNS request which may leave users blaming Fenix for slow performance. I haven't been experiencing this problem in the other webview browsers.

I think there is not much we can do if the system resolver is slow.
Valentin, what do you think?

Flags: needinfo?(valentin.gosu)

I agree.
From what I can tell mobile Chrome sometimes uses their DoH implementation to do DNS prefetches, but otherwise they also use getaddrinfo so they should have the same preformance.

Flags: needinfo?(valentin.gosu)

Keeping this open, but unfortunately there's not much we can do about it.
We can't control which DNS servers we use with getaddrinfo.
We might improve this with DoH - see bug 1664878.

Severity: -- → S3
OS: Unspecified → Android
Priority: -- → P3
See Also: → 1664878
Whiteboard: [qf:p1:pageload] → [qf:p1:pageload][necko-triaged]

https://glam.telemetry.mozilla.org/firefox/probe/dns_lookup_time/explore?process=parent suggests slow DNS lookup times are a real issue. GLAM doesn't seem to give information to me on the DoH lookup time, do we have some idea of how it compares? And do we have data on mobile?

Flags: needinfo?(valentin.gosu)

DoH has never been enabled on mobile so we don't have any data on how it would perform.

For desktop, you can compare the lookup times between these two probes.
https://glam.telemetry.mozilla.org/firefox/probe/dns_native_lookup_time/explore?channel=release&process=parent
https://glam.telemetry.mozilla.org/firefox/probe/dns_trr_lookup_time3/explore?channel=release&process=parent

Flags: needinfo?(valentin.gosu)

This issue is very annoying, as pages often take extremely long time to load. I must note e.g. DuckDuckGo browser which uses either Chromium or WebView is much faster at DNS resolution, not sure why.
Is there anything blocking DoH from being enabled to at least try to work it around?

Performance Impact: --- → P1
Keywords: perf:pageload
Whiteboard: [qf:p1:pageload][necko-triaged] → [necko-triaged]

We investigated the possibility to race the native resolver and DoH:

  • this will add additional technical complexity that is not very high.
  • this also would need further investigation regarding policy compliance in different countries.

Considering that we want to turn on DoH for more users, we decided to investigate the possibility to enable DoH for such users instead of implementing the racing. We are going to continue the investigation in this direction.

The Performance Priority Calculator has determined this bug's performance priority to be P2. If you'd like to request re-triage, you can reset the Performance flag to "?" or needinfo the triage sheriff.

Platforms: Android
Page load impact: Severe
[x] Bug affects multiple sites

Performance Impact: P1 → P2

Why P2? Even on P1 it’s taking extremely long for this bug to even be looked at. It impacts me every day every single time I open any webpage on my phone. I’m losing 30 to 90 seconds on every webpage every day just because I cannot enable DoH to speed up DNS lookups.

If you’re downgrading this issue, you could’ve at least made DoH available as an opt-in setting on the stable channel.

Hold on, which one is higher, P1 or P2? I’m confused.

(In reply to Andrej Shadura from comment #11)

Why P2? Even on P1 it’s taking extremely long for this bug to even be looked at. It impacts me every day every single time I open any webpage on my phone. I’m losing 30 to 90 seconds on every webpage every day just because I cannot enable DoH to speed up DNS lookups.

If you’re downgrading this issue, you could’ve at least made DoH available as an opt-in setting on the stable channel.

Apologies, I missed some criteria when re-triaging this bug and should indeed be a P1 Performance bug. To be clear, the issue is still important to Fx. This particular priority setting is very specific to the performance team and does not denote the overall priority to the team responsible for the fix.

The Performance Priority Calculator has determined this bug's performance priority to be P1. If you'd like to request re-triage, you can reset the Performance flag to "?" or needinfo the triage sheriff.

Platforms: Android
Impact on browser UI: Renders browser effectively unusable
Impact on site: Renders site effectively unusable
Page load impact: Severe
[x] Bug affects multiple sites

Performance Impact: P2 → P1

The severity field for this bug is set to S3. However, the Performance Impact field flags this bug as having a high impact on the performance.
:jesup, could you consider increasing the severity of this performance-impacting bug? Alternatively, if you think the performance impact is lower than previously assessed, could you request a re-triage from the performance team by setting the Performance Impact flag to ??

For more information, please visit auto_nag documentation.

Flags: needinfo?(rjesup)
Severity: S3 → S2
Priority: P3 → P2

Bumped priority and severity.

My concern with addressing this via DoH is that it's only available in certain regions (and may also be blocked by certain providers), so it's not a general fix. It may be a useful partial fix or bandaid, however, and lower impact/cost to implement.

Anecdotally, I frequently see long (seconds) delays on mobile Fenix with the loading indicator stalled and the lock icon not yet showing locked (i.e. we don't have TLS up, and my bet is we're in DNS lookup). I imagine this may be the same issue. These are usually when going to a site from the google newsfeed (so likely applink)

Denis/perf team - do you still see this? Can you get profiles showing this? Or even better (though perhaps a pain to get), logs with the settings in about:networking.

Flags: needinfo?(rjesup)
Flags: needinfo?(dpalmeiro)
Flags: needinfo?(bas)

I do also see the occasional issues with loading being frozen the same way as described by Randell. I haven't captured this is a profile though, perhaps Denis has recently?

Flags: needinfo?(bas)
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-next]

Yes, this is still an occasional issue. Although they are not usually as long as originally reported in this profile which I have mostly addressed by switching to cloudflare's dns servers.

Flags: needinfo?(dpalmeiro)

In Bug 1838240 we migrated the DNS probes to glean so we can look at this on Fenix.

Because dns resolution time is part of the pageload_event, we can also look at it from this perspective. See https://bugzilla.mozilla.org/show_bug.cgi?id=1838240#c1

Severity: S2 → S3

While DoH for Android progresses, bug 1874464 may provide a solution to this.

See Also: → 1874464

Ni? me to figure out if long resolution times are limited to older Android versions.

Flags: needinfo?(valentin.gosu)
Whiteboard: [necko-triaged][necko-priority-next] → [necko-triaged][necko-priority-review]
Blocks: perf-android
See Also: → 1122907
Blocks: 1894804
Flags: needinfo?(valentin.gosu)

I think the approach here should be something similar to bug 1895226.
If we see that the browser doesn't have IPv6 connectivity and/or IPv6 DNS requests keep timing out, we can avoid doing them at all.
We have to be careful about the fact that IPv6 connectivity to the internet may be missing, but the local network might still depend on IPv6 working.

This is an older bug, so let me summarize what it describes.

The user is experiencing slow or stalled DNS requests due to their ISP-provided DNS resolvers being problematic. This can happen when an ISP assigns DNS servers that are slow or unreliable (comment 2).

This is not a problem when using Chrome on the same network (we believe because they use a fallback resolver).

In Firefox, changing dns servers or using DoH resolves the problem.

As we're not expecting to roll out DoH for Android in the short term, maybe a solution could be to race the https resource records query, bug 1852752?
See also: bug 1895908

Increasing the severity level to match the high performance impact, but this decision is nuanced, so let me know if this adjustment doesn't seem fully accurate or appropriate.

Severity: S3 → S2

I think we should move this back to S3 - we have workarounds (change dns to (say) 8.8.8.8, use DoH); it likely doesn't affect many users (though it's hard to tell how many), the impact on an individual user usually isn't horrible unless DNS is totally broken - and if so it really is an ISP/etc issue, even if Chrome manages to work around it. So S3, even if a higher-priority S3, imo

Flags: needinfo?(acreskey)
S2	(Serious) Major functionality/product severely impaired or a high impact issue and a satisfactory workaround does not exist
S3	(Normal) Blocks non-critical functionality and a work around exists

Agreed :)
But we will try to find a solution for our Android users.

Severity: S2 → S3
Performance Impact: high → medium
Flags: needinfo?(acreskey)
Priority: P2 → P1
Whiteboard: [necko-triaged][necko-priority-review] → [necko-triaged][necko-priority-queue]

According to Ilya Grigorik, and documented in "The Performance of Open Source Software
High Performance Networking in Chrome", Chrome uses their own async resolver which mitigates slows DNS queries via the following methods:

https://aosabook.org/en/posa/high-performance-networking-in-chrome.html


    better control of retransmission timers, and ability to execute multiple queries in parallel
    visibility into record TTLs, which allows Chrome to refresh popular records ahead of time
    better behavior for dual stack implementations (IPv4 and IPv6)
    failovers to different servers, based on RTT or other signals

We've pulled this from the priority queue because adding a fallback resolver isn't directly actionable and is also beyond scope for the priority queue. This bug will be taken to DNS/DoH strategy scope.

Whiteboard: [necko-triaged][necko-priority-queue] → [necko-triaged]
Summary: Extremely long DNS resolution times with Fenix → Extremely long DNS resolution times with Fenix, no fallback resolver
See Also: 16648781801530

Denis, Necko folks have just resolved a DNS performance issue, bug 1122907, which may affect this scenario.
But the change is preffed off by default for now.

If you're still seeing the issue in this bug, can you try turning on the pref network.dns.skip_ipv6_when_no_addresses to see if it helps?

Flags: needinfo?(dpalmeiro)

This hasn't been a problem for me for a while now so I can't really provide any useful feedback on that pref unfortunately.

Flags: needinfo?(dpalmeiro)
You need to log in before you can comment on or make changes to this bug.