Closed Bug 1702025 Opened 4 years ago Closed 15 days ago

Slow DNS times (500ms+) seen on Fenix applink cold startup

Categories

(Core :: Networking: DNS, defect, P3)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: acreskey, Unassigned)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

(Whiteboard: [necko-triaged])

We are sometimes seeing exceptionally slow DNS resolution times in Fenix applink cold startup scenarios.

In this case the app is launched directly by the android OS with a target URL.

DNS resolution of ~1900ms
https://share.firefox.dev/3cAhy2g

DNS resolution of 527ms
https://share.firefox.dev/2PKO5cN

These profiles were captured by different developers, both on Moto G5.

I believe this is a different root cause from Bug 1664492 where the user's ISP was at fault.

I managed to get similar poor times for WARM/HOT page load while navigating to these pages from in app UI (i.e. not via Intents):

I also tried on more heavily trafficked sites (roku.com, amazon.com, facebook.com) but got more reasonable results (< 65ms, usually much less).

:mcomella provided a profile with dns threads captured.
This one is only 150ms but unfortunately the time is spent in android_getaddrinfofornetcontext
https://share.firefox.dev/3czxUYU

So yeah, we rely on the OS for the actual DNS resolution, and if we are spending it in getaddrinfo it looks like there's either something beyond our control or we're going to have to be clever. E.g. is there something that Android is doing that makes the first lookups take longer? Are there just too many lookups going on? I need to look at the profile later.

Meanwhile, curious what Valentin thinks, though he's away until next week.

Flags: needinfo?(valentin.gosu)

I'm curious as well if the cases where we see really long delays are also spent in android's getaddrinfo.

:mcomella, as we were discussing, ni'ed you to try with DoH enabled

Flags: needinfo?(michael.l.comella)

Is there a way to confirm DoH is enable from the profile (or even the device)? I set the network.trr.mode=3 (I didn't see a doh-rollout.enabled config option on Android) and I got a 500ms resolution again. However, there are 240ms for two calls each is spent in android's getaddrinfo so I wonder if DoH was working. Here's the profile: https://share.firefox.dev/3ucNVtZ

Flags: needinfo?(michael.l.comella)

On Android we don't turn on DoH by default. It can only be enabled via about:config since we don't have any UI for Fenix

This specific delay is caused exclusively by the call to getaddrinfo in the libc implementation.
Using DoH might improve it in the edge cases where this is taking too long - the profiles show long waits in fread - so it's waiting for the DNS daemon to return something.

I don't know if we can do much about this issue. We can probably improve some of the corner cases with DoH, but our DoH implementation also uses regular DNS to bootstrap & all, so it might not be an all around fix. That said, I don't think we've ever looked at DoH performance on Fenix so I'd be interested to see if that fixes anything.

Flags: needinfo?(valentin.gosu)

(In reply to Michael Comella (:mcomella) [needinfo or I won't see it] from comment #5)

Is there a way to confirm DoH is enable from the profile (or even the device)? I set the network.trr.mode=3 (I didn't see a doh-rollout.enabled config option on Android) and I got a 500ms resolution again. However, there are 240ms for two calls each is spent in android's getaddrinfo so I wonder if DoH was working. Here's the profile: https://share.firefox.dev/3ucNVtZ

You can check in about:networking if we used DoH for that resolution.
Also, the pref value should be reflected in about:support.

Put this in P3, since the delay is not caused by our code.

Severity: -- → S4
Priority: -- → P3
Whiteboard: [necko-triaged]

ni' myself to see if this still happens and also if it's fixed by pref added in bug 1122907

Flags: needinfo?(acreskey)
See Also: → 1122907

I haven't reproduced this myself, but if anyone is seeing this behaviour I would like hear about it.

Flags: needinfo?(acreskey)
Depends on: 1926244

We now have accurately telemetry on applink initial dns timings.

From this set, the timings don't look problematic:

p25: 0ms
p50: 0ms
p75: 3ms
p95: 45ms
p99: 150ms

If anyone can reproduce a problem and capture a profile, please feel free to re-open the bug.

Status: NEW → RESOLVED
Closed: 15 days ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.