Closed Bug 1583298 Opened 5 years ago Closed 2 years ago

Determine if we are indeed spending up to 3x as long in DNS resolution as Chrome on some sites

Categories

(Core :: Performance, task, P1)

ARM64
Android
task

Tracking

()

RESOLVED WORKSFORME
Performance Impact medium
Tracking Status
firefox99 - ---

People

(Reporter: acreskey, Assigned: acreskey)

References

Details

(Keywords: perf:pageload, Whiteboard: [performance-pageload])

Attachments

(5 files)

While looking at profiles generated as part of Bug 1582820, Bas and I noticed that we may be spending considerably more time in DNS resolution than Chrome.

This is captured in browsertime and so I collected it here.
Note the row for imdb and booking.com in particular and how we may be spending over 3x as long:
Also note that not all early navigation metrics are reported by both browsers, so this data is incomplete.

This may a problem in this profile, as we see numerous requests with 100ms on After DNS request, which should correspond to domain lookup end
https://perfht.ml/2l2AnmD

See Also: → 1508326

This profile includes DNS resolver threads:
https://perfht.ml/2l7YRuL

See Also: → 1583230

I've spent a bit of time going over the results from my browsertime live runs and also the results from a long-running test that :denispal ran in July.

You can find a plain-text summary of my live site data here (fenix_none is Fenix with tracking protection disabled)
https://paste.rs/LbS

And Denis' results (subset of sites that overlapped with tp6p1)
https://paste.rs/pEb

One area where the datasets agree is that on the last early navigation-metric time point, responseEnd.
They both sow Fenix being significantly faster than Chrome. Median of 411ms for Fenix and 563.5ms for Chrome in my dataset (368.5ms to 481ms in favor of Fenix in Denis' data).

The performance difference varies between sites:

Denis' results show generally matching site-to-site performance differences, with a couple of exceptions.

In both datasets, Chrome and Fenix are roughly equal at responseStart time.
However Chrome reports being significantly slower by responseEnd.

Looking at the sum of the early metrics in my run, you can see Chrome's longer serverResponseTime (responseEnd - requestStart in browsertime caused by the longer delivery time:

Attached image Early metrics by site

This is an overall view of our early metrics performance on these pages.
The sites with the long fetchStarts include a redirection.

So, going back to the bug as logged, are we spending more time doing DNS resolution?
Other than perhaps fetchStart, this may be the only metric we can improve on.
My dataset says yes, Denis' says no.
So I think I'll collect results from a third party.

Site by site domain lookup time from my results.

As another datapoint, these are the results from :nalexander's live site tests in April 2019.
(These particular ones are from the Moto G5)
https://paste.rs/RlO

Here Fenix is faster for the median result, but slower for the mean (due to slow outliers which you can see in the attached graph).

Again, fenix is faster than Chrome in overall early navigation metrics (up to responseEnd).

My own view is that these browser-reported early navigation metrics should be taken with a grain of salt.

I think more profiling would be valuable to get a better idea of how many requests spend a large amount of time waiting for DNS resolution.

Just collecting some ideas if we want to experiment with this:
• I see that we do DNS prefetching in certain cases -- perhaps this could be expanded on or tuned?

• One of the reasons that Chrome dropped the OS's getaddrinfo() in favor of their own resolver was so they could "failover to different servers based on RTT or other signals". This could be worth trying.
With our own resolver, we could also race DNS requests, also discussed here
https://groups.google.com/forum/#!topic/mozilla.dev.tech.network/D8UmrLZZh5k

• The Resolver thread pool size could be experimented with. We saw a lot of requests that were blocked on what looked like DNS. Maybe hi-core count CPUs would do better with more threads.

• With DNS-over-https coming online soon, maybe that's what I should be looking at?

I ran some tests where I increased the size of the resolver thread pool, hoping to coax some performance out of devices like the Octa-Core Pixel 3.
But it looks like increasing the pool sizes on Pixel 3 doesn't improve any meaningful metrics:
https://docs.google.com/spreadsheets/d/1bhdXkziUFBE5f4PbOiKN561TVAGAG6A8zrT--Uxq5Bw/edit#gid=1015141094
Mean backend time may be improved, but nothing user-facing.

See Also: → 1596935

From this comment, it looks like we are misreporting our dns resolution times (they end up exaggerated)
https://bugzilla.mozilla.org/show_bug.cgi?id=1596935#c21

Performance Impact: --- → ?
Whiteboard: [qf:investigate]

hey Andrew, is there anything more to do here? Dragana mentioned bug 1626958 to fix the misreporting, but that bug doesn't seem explicit about this...

Thanks!

Flags: needinfo?(acreskey)
See Also: → 1761528

To me this still seems important, so I logged Bug 1761528 to see if the reporting is indeed incorrect and to see if we can fix it.

See Also: → 1626958
Depends on: 1761528
Flags: needinfo?(acreskey)
See Also: 1761528
Severity: normal → S1
Performance Impact: ? → P2
Keywords: perf:pageload
Priority: -- → P1
Whiteboard: [performance-pageload
Whiteboard: [performance-pageload → [performance-pageload]

99 is about to ship, not tracking for this release.

Assignee: nobody → acreskey

One thing to note, these measurements were all made with OS-based dns lookup, not DoH.

I'm going to see if this Chrome to Firefox discrepancy still reproduces.

Using the live site capabilities of our performance infrastructure, I'm not seeing a significant difference in DNS resolution times between Chrome and Firefox (Desktop and Pixel 2).
https://docs.google.com/spreadsheets/d/1iUuMtNZAIWMOO9SmBS_70RmIKnLK0DA_zj9MSOnD-eE/edit#gid=490683402

But note that the reported DNS times (domainLookupEnd - domainLookupStart) in the desktop tests are just a few milliseconds on both Firefox and Chrome.

So I'll do a small test locally as a sanity check.

Locally, on Desktop I'm seeing that we start the DNS lookup later than Chrome, but ultimately we complete the request in a similar amount of time.
https://docs.google.com/spreadsheets/d/1iUuMtNZAIWMOO9SmBS_70RmIKnLK0DA_zj9MSOnD-eE/edit#gid=490683402&range=65:75x

This bug was opened 3 years ago, but at the moment, I don't see any evidence that supports keeping it open.

In Bug 1761528 we verified that DNS resolution times are now accurate.
With the accuracy verified and because I'm no longer reproducing a performance difference between Chrome and Firefox, I'm going to close this bug.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME

With this being closed, do you know if bug 1702025 or bug 1664492, which are other fenix DNS issues, can be closed?

Flags: needinfo?(acreskey)

Michael, I think the first thing would be to verify the current state of DNS over HTTPS on Fenix.

To my understanding, both bug 1702025 and bug 1664492 occurred when we used the legacy native DNS resolution.

Since I think it's fair to say that we won't be investing in improving that workflow, if these bugs can't be reproduced with DoH, then I think it's fair to at least lower their priority.

Flags: needinfo?(acreskey)
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: