Open Bug 1664492 Opened 4 years ago Updated 10 days ago

Extremely long DNS resolution times with Fenix

Categories

(Core :: Networking: DNS, defect, P2)

Unspecified
Android
defect

Tracking

()

Performance Impact high

People

(Reporter: denispal, Unassigned, NeedInfo)

References

(Blocks 1 open bug)

Details

(Keywords: perf:pageload, perf:responsiveness, Whiteboard: [necko-triaged][necko-priority-review])

I am seeing some incredibly long resolution times in Fenix, upwards of 5s-10s times. This is most noticeable from applink navigations, but I see it occasionally during normal navigation as well.

Here are some profiles:
https://share.firefox.dev/3bS7hN4
https://share.firefox.dev/3m96vQm

:acreskey noticed most of the time is spent sitting in Android's getaddrinfo. There is mention of Chrome using their own DNS resolver that uses failovers to different servers so it could help explain the difference I am seeing between Firefox and other browsers here.

Seems my ISP is giving me a pretty slow DNS server with DHCP. Simply changing DNS servers fixes this for me, or enabling DoH/VPN also fixes it. Seems like we might not have any failover for a slow DNS request which may leave users blaming Fenix for slow performance. I haven't been experiencing this problem in the other webview browsers.

Component: Performance → Networking: DNS
Whiteboard: [qf] → [qf:p1:responsiveness]
Whiteboard: [qf:p1:responsiveness] → [qf:p1:pageload]

(In reply to Denis Palmeiro [:denispal] from comment #2)

Seems my ISP is giving me a pretty slow DNS server with DHCP. Simply changing DNS servers fixes this for me, or enabling DoH/VPN also fixes it. Seems like we might not have any failover for a slow DNS request which may leave users blaming Fenix for slow performance. I haven't been experiencing this problem in the other webview browsers.

I think there is not much we can do if the system resolver is slow.
Valentin, what do you think?

Flags: needinfo?(valentin.gosu)

I agree.
From what I can tell mobile Chrome sometimes uses their DoH implementation to do DNS prefetches, but otherwise they also use getaddrinfo so they should have the same preformance.

Flags: needinfo?(valentin.gosu)

Keeping this open, but unfortunately there's not much we can do about it.
We can't control which DNS servers we use with getaddrinfo.
We might improve this with DoH - see bug 1664878.

Severity: -- → S3
OS: Unspecified → Android
Priority: -- → P3
See Also: → 1664878
Whiteboard: [qf:p1:pageload] → [qf:p1:pageload][necko-triaged]

https://glam.telemetry.mozilla.org/firefox/probe/dns_lookup_time/explore?process=parent suggests slow DNS lookup times are a real issue. GLAM doesn't seem to give information to me on the DoH lookup time, do we have some idea of how it compares? And do we have data on mobile?

Flags: needinfo?(valentin.gosu)

DoH has never been enabled on mobile so we don't have any data on how it would perform.

For desktop, you can compare the lookup times between these two probes.
https://glam.telemetry.mozilla.org/firefox/probe/dns_native_lookup_time/explore?channel=release&process=parent
https://glam.telemetry.mozilla.org/firefox/probe/dns_trr_lookup_time3/explore?channel=release&process=parent

Flags: needinfo?(valentin.gosu)

This issue is very annoying, as pages often take extremely long time to load. I must note e.g. DuckDuckGo browser which uses either Chromium or WebView is much faster at DNS resolution, not sure why.
Is there anything blocking DoH from being enabled to at least try to work it around?

Performance Impact: --- → P1
Keywords: perf:pageload
Whiteboard: [qf:p1:pageload][necko-triaged] → [necko-triaged]

We investigated the possibility to race the native resolver and DoH:

  • this will add additional technical complexity that is not very high.
  • this also would need further investigation regarding policy compliance in different countries.

Considering that we want to turn on DoH for more users, we decided to investigate the possibility to enable DoH for such users instead of implementing the racing. We are going to continue the investigation in this direction.

The Performance Priority Calculator has determined this bug's performance priority to be P2. If you'd like to request re-triage, you can reset the Performance flag to "?" or needinfo the triage sheriff.

Platforms: Android
Page load impact: Severe
[x] Bug affects multiple sites

Performance Impact: P1 → P2

Why P2? Even on P1 it’s taking extremely long for this bug to even be looked at. It impacts me every day every single time I open any webpage on my phone. I’m losing 30 to 90 seconds on every webpage every day just because I cannot enable DoH to speed up DNS lookups.

If you’re downgrading this issue, you could’ve at least made DoH available as an opt-in setting on the stable channel.

Hold on, which one is higher, P1 or P2? I’m confused.

(In reply to Andrej Shadura from comment #11)

Why P2? Even on P1 it’s taking extremely long for this bug to even be looked at. It impacts me every day every single time I open any webpage on my phone. I’m losing 30 to 90 seconds on every webpage every day just because I cannot enable DoH to speed up DNS lookups.

If you’re downgrading this issue, you could’ve at least made DoH available as an opt-in setting on the stable channel.

Apologies, I missed some criteria when re-triaging this bug and should indeed be a P1 Performance bug. To be clear, the issue is still important to Fx. This particular priority setting is very specific to the performance team and does not denote the overall priority to the team responsible for the fix.

The Performance Priority Calculator has determined this bug's performance priority to be P1. If you'd like to request re-triage, you can reset the Performance flag to "?" or needinfo the triage sheriff.

Platforms: Android
Impact on browser UI: Renders browser effectively unusable
Impact on site: Renders site effectively unusable
Page load impact: Severe
[x] Bug affects multiple sites

Performance Impact: P2 → P1

The severity field for this bug is set to S3. However, the Performance Impact field flags this bug as having a high impact on the performance.
:jesup, could you consider increasing the severity of this performance-impacting bug? Alternatively, if you think the performance impact is lower than previously assessed, could you request a re-triage from the performance team by setting the Performance Impact flag to ??

For more information, please visit auto_nag documentation.

Flags: needinfo?(rjesup)
Severity: S3 → S2
Priority: P3 → P2

Bumped priority and severity.

My concern with addressing this via DoH is that it's only available in certain regions (and may also be blocked by certain providers), so it's not a general fix. It may be a useful partial fix or bandaid, however, and lower impact/cost to implement.

Anecdotally, I frequently see long (seconds) delays on mobile Fenix with the loading indicator stalled and the lock icon not yet showing locked (i.e. we don't have TLS up, and my bet is we're in DNS lookup). I imagine this may be the same issue. These are usually when going to a site from the google newsfeed (so likely applink)

Denis/perf team - do you still see this? Can you get profiles showing this? Or even better (though perhaps a pain to get), logs with the settings in about:networking.

Flags: needinfo?(rjesup)
Flags: needinfo?(dpalmeiro)
Flags: needinfo?(bas)

I do also see the occasional issues with loading being frozen the same way as described by Randell. I haven't captured this is a profile though, perhaps Denis has recently?

Flags: needinfo?(bas)
Whiteboard: [necko-triaged] → [necko-triaged][necko-priority-next]

Yes, this is still an occasional issue. Although they are not usually as long as originally reported in this profile which I have mostly addressed by switching to cloudflare's dns servers.

Flags: needinfo?(dpalmeiro)

In Bug 1838240 we migrated the DNS probes to glean so we can look at this on Fenix.

Because dns resolution time is part of the pageload_event, we can also look at it from this perspective. See https://bugzilla.mozilla.org/show_bug.cgi?id=1838240#c1

Severity: S2 → S3

While DoH for Android progresses, bug 1874464 may provide a solution to this.

See Also: → 1874464

Ni? me to figure out if long resolution times are limited to older Android versions.

Flags: needinfo?(valentin.gosu)
Whiteboard: [necko-triaged][necko-priority-next] → [necko-triaged][necko-priority-review]
Blocks: 1889193
You need to log in before you can comment on or make changes to this bug.