Closed Bug 1524609 Opened 5 years ago Closed 5 years ago

Investigate performance impact of tuning RCWN heuristics

Categories

(Core :: Performance, enhancement, P1)

enhancement

Tracking

()

RESOLVED INACTIVE
Tracking Status
firefox67 --- affected

People

(Reporter: acreskey, Assigned: acreskey)

References

(Blocks 1 open bug)

Details

Attachments

(7 files)

The Necko "Race Cache With Network" system makes a decision to fetch a given resource from either disk cache or from the network. This is on several heuristics (e.g. disk cache speed, resource size).

A quick test of disabling the feature (network.http.rcwn.enabled=false) shows significant potential for performance improvements in multiple raptor tp6 page load tests.

e.g. load event and hero elements 20% + faster on some sites.

Some tests, notably instagram, appear to have regressed significantly.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&newProject=try&newRevision=225ed801cc271e0902260fb443f6dd75da173d15&framework=10&showOnlyComparable=1&selectedTimeRange=1209600

In addition, the Noise Metric (~ the sum of test std dev) dropped significantly.
e.g. windows10-64   down 59.39%.


Caveats:
The raptor tests are run in a lab, and perhaps the low-latency network may skew the value of RCWN.
In addition, the http sessions are played back from mitmproxy recordings.


This bug should cover the work of investigating the performance impact of RCWN tuning under "real world" conditions.

Hey Vicky, I'm thinking perhaps we should disable this while we're exploring tuning. What do you think? Is there any reason to leave it enabled given these results?

Flags: needinfo?(vchin)

Hi Selena,

One scenario where I suspect that RCWN is helping us is on machines like the reference laptops where the slow platter hard disk is perpetually chugging away.
Network resources may very well be in the cache but it may also take a very long time to retrieve them.
I don't have actual data on this though.

Disabling the RCWN makes app-prod.js come in several times faster on https://www.youtube.com/tv/ based on some quick measurements (40-100ms to ~10ms), this doesn't necessarily make the site faster in this case, since there's other work to do while we're waiting for the network, but this is a bad sign. This is on a very fast machine where the network is actually very fast at getting this file, so it doesn't even seem to be helping in that case.

Running a try patch on the reference hardware (-ux) that disables RCWN:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=136092e3f9435e91f4d0554d66d172cbb5f9b37a&framework=10&selectedTimeRange=604800

Not the best test because this is gainst recorded http sessions instead of live sites.

These tests on the reference hardware in the lab timeout as often as they succeed.
See: https://treeherder.mozilla.org/#/jobs?repo=try&revision=136092e3f9435e91f4d0554d66d172cbb5f9b37a&selectedJob=229015294
However overall they look to improve the raptor-measured metrics.

When I start this investigation the plan is to also collect data points from these sources:
-raptor suite against live sites (:rwood has a patch to enable this)
-web page test with a script to enable the flag (live sites)
-browsertime against live sites

:selena I'd like to see the results of turning this off against live sites first as outlined in comment 5.

Flags: needinfo?(vchin)

Now that raptor can run tp6 pageload tests against lives sites (Bug 1531169) I've started running tests where we should expect to see the impact of RCWN tuning.

So this is a comparison between raptor live sites (Base, left) and raptor live sites with rcwn disable (New, right)

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=a1ea76df63c20c933b2a5b8826363147cf647979&newProject=try&newRevision=558f44100b18bee85ec7547c504116531d904fb7&framework=10

And the same test using the reference laptop in the lab (-ux): raptor live sites (Base, left) and raptor live sites with rcwn disable (New, right):
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=3aa96f138ab4cd33efa8bbe9cbf55bae3f8871e7&newProject=try&newRevision=be4bbbb12e811c54aeb5947c5bb393425129e26c&framework=10

I don't think I have enough runs in yet, but the meaning of these results is not clear. Especially compared to the results from Comment 1 (recorded http sessions).

I'll run a live site vs live site comparison (no other changes) this weekend when I can get lots of jobs done quickly to get a better understanding of the baseline noise in live site comparisons.

One reason I'm using the raptor tp6 page load tests is that they are actually page reload tests. The initial page load is discarded ("cold loads" coming soon to raptor). So they really exercise the cached resource codepaths.

I have some results from local runs of raptor-tp6-1 (amazon,facebook,google,youtube)
Each run in raptor is 24 reloads of the page for each site. For the results below, I collected the results from 5 runs.
So 120 reloads of each site for each hardware config.

2017 Macbook Pro (great wifi), summary here:
https://docs.google.com/spreadsheets/d/1lYXjy0FiQJf-0qPXMGcDsBY9mwV-QOatblVqOm1w4ys/edit#gid=0&range=228:228

2017 Reference laptop (wired connection), summary here:
https://docs.google.com/spreadsheets/d/1lYXjy0FiQJf-0qPXMGcDsBY9mwV-QOatblVqOm1w4ys/edit#gid=1492341646&range=228:228

Although this is noisy data, it looks like RCWN is reducing loadtime on the reference laptop by ~10% (a lot!) for facebook and amazon. No effect on the other sites.

On the Macbook Pro there is some evidence that RCWN regresses loadtime on facebook and google (~10%), maybe youtube.

^^ Those runs are using raptor with live sites enabled.

Interesting! So, the Macbook Pro has an SSD, correct? Do we have telemetry on loadtime that we can correlate with disk type?

Flags: needinfo?(acreskey)

Yes, the Macbook Pro is indeed SSD. Unfortunately we don't have telemetry on disk type yet (I just logged it Bug 1533861), but it looks like from the Telemetry Environment (Windows only) I can get the hdd model which could work.
I can also run local tests like using my external spinning platter drive on my MacBook Pro.

From my understanding of the Race Cache With Network code, it makes its decisions based on the queue of items to be retrieved from the disk cache and also if that cache retrieval is getting slow. So it should handle both SSD and spinning disk.

Right now I'm trying to find a scenario where the feature provides a strong win or a strong loss so that I can compare profiles and get a better understanding of why.

Flags: needinfo?(acreskey)

So at the moment I don't think I'll be able to use results from raptor live sites on try because of the noise.

I compared tp6-1 and tp6-2 results from separate pushes of the same revision and even with high repeated job counts I'm seeing results that differ significantly.
e.g.
Amazon load time improved by 32.94% over 20 runs on OSX with no code change.
Amazon load time regressed by 27.18% on linux PGO over 25 runs with no code change.

I was expecting that as the repeat job count increased, the results from each push would converge.

https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=c0dfc4da1fca8b3f054b96ed77f693aba21d3e5e&newProject=try&newRevision=26626d4540eb1f2e6917bcead61995db5ad7eced&framework=10

I've re-run a large series of raptor tp6 tests to get a better understanding of the impact of RCWN on our current test infrastructure:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&newProject=try&newRevision=81be09264a0f3ad0092a8b2c4a8ae0f22e38c043&framework=10&selectedTimeRange=1209600

I'm not seeing the drastic improvements in performance that I saw in Comment 1, although I am still seeing the Noise Metric being reduced significantly (most prominent on linux64 where it drops ~54%).

Updates:

I attempted to use the Windows Environment telemetry probe system.hdd.profile.model to compare by SSD / HDD. Unfortunately others have tried this route but the 8k+ entries make this not possible without classifying the entries.

But while looking at the telemetry I did make this observation: the NetworkDelayedRace outcome is exceptionally rare (e.g. 0.25% of outcomes) if I read this correctly.
This is the scenario where the network wins even after it's been given a delayed start.

So I’ve tried experiments which exclude or minimize this code path (since it can still incur a cost to the parent process's main thread once the the delayed nsHttpChannel is created)

1. Only RCWN if the cache is slow (otherwise just use cache)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=4a567ef71c971269481a6089b0d576252efe5a00&newProject=try&newRevision=b8f99d69619bfa841ad296a66afbbbf01af53936&framework=10
There may be some small gains here, in the 1-3% on multiple sites.

2. Don't delay network requests when racing
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=4a567ef71c971269481a6089b0d576252efe5a00&newProject=try&newRevision=daa7a113d2f309137cd240428c1664400191e1f2&framework=10
This appears to significantly regress multiple sites. e.g. 6.73% on amazon (osx), 9% on microsoft (osx)

:valentin, :michal, do you think there is potential in approach 1? Perhaps if the definition of a slow cache was modified?

I did discover something interesting while stepping through the cache code though:
on android the http memory cache size was fixed at 1MB about 10 years ago. This is probably way too small for modern android devices. Logged for investigation in Bug 1536171

Flags: needinfo?(valentin.gosu)
Flags: needinfo?(michal.novotny)

(In reply to Andrew Creskey from comment #16)

But while looking at the telemetry I did make this observation: the NetworkDelayedRace outcome is exceptionally rare (e.g. 0.25% of outcomes) if I read this correctly.
This is the scenario where the network wins even after it's been given a delayed start.

So I’ve tried experiments which exclude or minimize this code path (since it can still incur a cost to the parent process's main thread once the the delayed nsHttpChannel is created)

What we should try to minimize is the scenario when the cache wins when the delayed network request was triggered. Unfortunately, the current probe doesn't provide this information. CacheDelayedRace is reported regardless of whether the network request was sent or not.

1. Only RCWN if the cache is slow (otherwise just use cache)
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=4a567ef71c971269481a6089b0d576252efe5a00&newProject=try&newRevision=b8f99d69619bfa841ad296a66afbbbf01af53936&framework=10
There may be some small gains here, in the 1-3% on multiple sites.

2. Don't delay network requests when racing
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=4a567ef71c971269481a6089b0d576252efe5a00&newProject=try&newRevision=daa7a113d2f309137cd240428c1664400191e1f2&framework=10
This appears to significantly regress multiple sites. e.g. 6.73% on amazon (osx), 9% on microsoft (osx)

:valentin, :michal, do you think there is potential in approach 1? Perhaps if the definition of a slow cache was modified?

We definitely should not remove delayed racing because detecting slow cache is always tricky, so having some reasonable delay is good.

Flags: needinfo?(michal.novotny)

Thanks for the feedback Michal.

By the way, relative to physical drive types, I verified that raptor tests on these platforms are on SSD: linux64, windows10-64, windows7-32

And these are on platter HDD: osx-10-10, windows10-64-ux

Blocks: 1425268
Flags: needinfo?(valentin.gosu)
Priority: -- → P1

I haven't been able to spend a lot of time on this issue but I was able to collect results from a long-running live site test.
This was run on the Acer reference laptop using the Browsertime framework:

https://paste.rs/DDT

I found these datapoints to be interesting:
-disabling RCWN lead to a 12% regression in median loadtime on the buzzfeed site, although mean firstPaint and firstContentfulPaint were improved signficantly (~40%)
-disabling RCWN lead to a 69, 61% regression in firstPaint, firstContentfulPaint on the wired site
-disabling RCWN appears to improve most metrics on the washingtonpost site

As a datapoint, this is a raptor tp6 comparison of running with and without rcwn on android (Moto G5):
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=2ff6be5092202f8b43b1757505a4c77a8c33ae15&newProject=try&newRevision=edbf678e015090d4f9778d8e7357a3324701f50c&framework=10

I would say that the results are within the error bars of the tests.

I plan to revisit this when I can fix the network latency in a test framework. (Likely through tsproxy or similar).

I ran a large scale pageload test over various fixed network conditions to try and find areas where RCWN was helping or else hindering performance. With those cases in mind the plan was to test variations on the RCWN tuning parameters.

Notes:
• These were run on the Moto G5 android device with Geckoview_Example (05/24/2019) and on the 2017 Reference Laptop (Acer-Aspire-i3) with Firefox Night 69.0a
• The web pages were recorded once and played back using Web Page Replay and Browsertime
• I used tsproxy to simulate network conditions.
• tsproxy defines its network presets [here] (https://github.com/catapult-project/catapult/blob/484f9f764dc58973a0466e4bdf1bfd50c75165e2/telemetry/telemetry/page/traffic_setting.py#L39). I used 'NONE' (0MS) , 'WIFI', and a custom setting of 50 ms rtt.
• The loadtime was measured on "warm" page loads -- i.e. the page was loaded once and then it was reloaded 25 times. It was the reload performance that were captured (to ensure more resources were in the network cache). I do have data from "cold" loads if anyone is interested.

This first round of testing was simple: baseline (RCWN on) and RCWN off.
I'll attach the raw data and boxplots generated w/ R.

Attached image g5_0MS.png

Moto G5 - 0ms rtt

Attached image g5_50ms.png

Moto G5 - 50ms rtt

Attached image g5_WIFI.png

Moto G5 - 'WIFI' setting (30 Mbps down and 2ms rtt)

Reference laptop - 0ms rtt

Reference laptop - 50ms rtt.

Reference laptop - 'WIFI' preset

Attached file browsertime-tests.zip

Raw results from browsertime for the above charts.

One thing that stands out to me:
On the reference laptop (slow platter drive) with low latency conditions (0ms and 'WIFI' (2ms)), we can see that disabling RCWN massively degrades performance and increases variance on the wired site.
Comment 25 and Comment 27

Or put another way, RCWN is very helpful in improving performance and reducing variance on that site for the reference laptop.
But note that with an added 50ms of latency I'm not seeing the performance win for that site.

Unfortunately my android and laptop pagesets were a bit different so I don't have wired on Moto G5 to compare against.

But even so I don't see a clear path to tune RCWN based on these results.
They are in many ways like the perfherder changes that Michal and I have put up -- some small wins here and there and maybe some small losses here and there.
Perhaps my pageset doesn't capture enough sites like wired that are impacted by RCWN.
Perhaps my test methodology isn't ideal.

But I do believe that my initial hypothesis in Comment 1 -- disabling RCWN leads to large performance gains -- is wrong.
Note that I was comparing my revision against mozilla-central, and not a proper baseline parent revision. (My mistake as a newbie. Mozilla-central changes significantly.)

One more thought:
The "sterile" environment in which these tests are run (Windows with Windows Defender disabled, minimal set of processes running, 1 tab open, etc) is probably not ideal to surface cases where RCWN is helping.

A better scenario might be:
The user has 5 applications running, 10 tabs are open, OS is paging to disk, real-time virus checking is running, and Netflix is playing in a second window.
In this case it's easy to imagine RCWN being a big win even with higher latency network. But hard to test!

The "sterile" environment in which these tests are run (Windows with Windows Defender disabled, minimal set of processes running, 1 tab open, etc) is probably not ideal to surface cases where RCWN is helping.

A better scenario might be:
The user has 5 applications running, 10 tabs are open, OS is paging to disk, real-time virus checking is running, and Netflix is playing in a second window.
In this case it's easy to imagine RCWN being a big win even with higher latency network. But hard to test!

Right. When we landed RCWN, we had some telemetry around it; I don't know shat is showed. The way to test this would be to land some changes and run an experiment in Nightly with A/B comparisons, and see how the telemetry differs. You can't compare on one site, but you can compare average loadtimes and cache-hit-rates/etc.

I was not able to find any performance improvements by tuning RCWN.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: