Open Bug 1502138 Opened 1 year ago Updated 3 months ago

[meta] Determine cause(s) of variable page load timing results

Categories

(Core :: Performance, task, P1)

task

Tracking

()

Tracking Status
firefox65 --- affected

People

(Reporter: acreskey, Assigned: acreskey)

References

(Depends on 4 open bugs)

Details

(Keywords: meta)

Attachments

(3 files)

When using Firefox we observe a high degree of variability in page load times.

Possibly the best method to reproduce this is to run the raptor tp-6 tests locally.
These tests make use of Mitmproxy to record pages and play them back locally, thus minimizing network variability.

But the recorded results still reveal that page load events occur at significantly different times on multiple runs.

Consider this example of the hero element timing on the captured amazon.com page:
(Note that the first sample is the cold load, and is thus expected to be slower)

"name": "raptor-tp6-amazon-firefox-hero:hero1", 
          "replicates": [
            1630, 
            848, 
            850, 
            835, 
            858, 
            846, 
            797, 
            854, 
            891, 
            778, 
            840, 
            847, 
            786, 
            820, 
            793, 
            833, 
            1177, 
            793, 
            795, 
            851, 
            846, 
            846, 
            791, 
            805, 
            823
          ]

Ignoring the first load sample, these results show a relative standard deviation of 9.15%.

This bug is to determine components and functionality within components that is causing this high degree of variability.
Updates:

• The variance has been reproduced on Windows and OSX, locally and on lab hardware.

• The DAMP tools for minimizing noise (e.g. forcing GCs) cannot be used here since the tests are driven by a web extension without access to privileged api.
I did make a patch which forced GCs and CCs, but this did not reduce noise.

• An implementation of the setTimeout() clamp discussed here also did not reduce noise:
https://bugzilla.mozilla.org/show_bug.cgi?id=1373723

• Significant variance is seen in Chrome as well, although it's noticeably lower.

• As a test, disabling the baseline jit significantly reduces variance. However there are still very long outliers.

• When profiling subtests, large outliers can be frequently be seen.
Repeatedly, these are caused by long network requests.

e.g.
Compare these profiles of the same page reload:
fast (hover to see that load 708 takes 1.2ms):
https://perf-html.io/public/2c1f6d9e42e5c9734cd139728116c8ab549a2d6b/network-chart/?globalTrackOrder=0-1-2-3-4&hiddenGlobalTracks=1-2-3&localTrackOrderByPid=23635-1-0~23638-0~23636-0~&range=1.9505_3.4173~2.0038_2.0852&thread=5&v=3

slow (hover to see that load 2085 takes 187ms):
https://perf-html.io/public/afc715b6d5f2385651a5aa03636aa88e02bbe1f1/network-chart/?globalTrackOrder=0-1-2-3-4&hiddenGlobalTracks=1-2-3&localTrackOrderByPid=23635-1-0~23638-0~23636-0~&range=1.9510_3.6641~1.9989_2.2543&thread=5&v=3
Related - :jesup put up a patch that puts necko request cache status (Hit, Miss, etc) in Network markers of the profiler
https://bugzilla.mozilla.org/show_bug.cgi?id=1504930
Blocks: 1505256
Updates:

• On local raptor tp6 tests (both Windows and OSX opt release), noise (measured as relative standard deviation) can be cut approximately in half by making a hacked change to Necko which increases the expiration time of http resources. This prevents the large http resource response time discrepancies that we see in the above profiles.

• Strangely, when this test change is run on a try server, a similar drop in noise has yet been observed.
i.e. I'm not seeing much of a difference in the lab.
I'm investigating why this the case.
Making the small changes required to allow raptor-tp6 to capture 25 profiles on the try server allowed me to see that http response time is also a contributing factor in the slowest test results on lab hardware, at least when profiled.

Compare these two captures, run in the lab, where one loads in 932ms:
https://perfht.ml/2OGZXGQ

And the other in 1070ms:
https://perfht.ml/2OH059i
Priority: -- → P1
Updates:

• The long http requests, mentioned above, were discovered to also be cases where cached resources were not used (e.g. "no-cache" directive), and either the mitmproxy or some component in the runtime is taking an exceptionally longtime to complete them.
Extending the Necko cache hack ("generous-cache") allows these items to remain in the cache by ignoring cache validation protocol. 
This is useful only for testing of performance changes, but cannot be enabled in the runtime.
But it raises the question of _why_ we are seeing these spikes in processing of recorded requests.

• In profiling the youtube pageload (almost always the noisiest suite), I found another problem:
it doesn’t appear that there is sufficient time between page cycles to finish executing JS. In addition, the screen is always refreshing due to a spinner.
This isn’t the cleanest profile, but I think you can see that the browser is anything but idle prior to the test start, and we see fallout from that in GC major:
https://perfht.ml/2OOvQ08

A simple patch extends the 1 second delay between page loads to 15 seconds and this significantly drops the noise.

Patches have been made to bandaid these sources of noise, and the reduced variance can be seen in these results:
(from Windows10, x64opt, run in the lab).
https://docs.google.com/spreadsheets/d/1nMQjBARxlz1mVZR_I11MRXTjG6o6ndfze_cBhW6-6ow/edit#gid=0
Updates:
• Building on the above patches, I ran brief experiments to defer the garbage collection by modifying the heap growth parameters.
This in fact increased the noisiness of the tests. Suggesting that possibly more frequent GC could reduce noise.
However the impact of GC is still minor compared to the network requests.

• Realized that our recorded webpage playback tool, mitmproxy, by default makes live connections to upstream servers (e.g. https://www.amazon.com). (This was very much unexpected for me!).
The reason is a feature called "upstream certificate sniffing"
https://mitmproxy.readthedocs.io/en/v2.0.2/features/upstreamcerts.html
mitmproxy pauses the recorded playback and makes a brief connection to the upstream server to have a look at the ssl certificate.
This is needed for cases where the client connects by direct ip instead of hostname (and two other subtle cert naming fallbacks).
Naturally this introduces significant variability (particularly locally over wifi) due to the network requests.

Using the "--no-upstream-cert" option allows the proxy to playback entirely from disk without a network connection.
I have evidence that this reduces the variance significantly, and am investigating the implications and results.
Updates:

•  The disabling of mitmproxy "upstream certificate sniffing" behavior mentioned in Comment 7 reduces the variance of the tp6 raptor tests in general (by ~25% in lab tests and ~50% in local testing). In some cases the removal does regress the performance. I'm investigating why (unexpected since the network responses are now all served from disk).

• In looking closer at the metrics recorded, I logged this issue where it appears that the hero element timing (measured through IntersectionObserver) is not lining up with the recorded screenshots for a given profile:
https://bugzilla.mozilla.org/show_bug.cgi?id=1510999
(understandable since the rendering occurs asynchronously).

• The Performance and Performance Test Engineering teams shared a very productive meeting in Mozlando.
Out of this discussion came the consensus that pageload, (effectively loadEvent) would be added as the tier 1 performance test. See https://bugzilla.mozilla.org/show_bug.cgi?id=1504013
The existing tests will be moved to tier 2. (Previously their results were geomeaned to form the pageload value).
This should let us track a metric more representative of the perceived load and also facilitate further reduction in the noise (since we're no longer tracking 4 markers (dcm, fnbp, ttfi, hero). ttfi never fires for some pages (e.g. youtube, due to constant JS, and, as we saw, the hero element tracking is not correct in at least two pages).

• As Kyle Lahnakoski, pointed out, the noise in the raptor tp6 tests is bimodal, which is making some of our current statistical comparisons effectively invalid. Plan is still to find the source of the behaviors which introduce the bimodality, rather than using statistical techniques to deal with the bimodal data.
So when looking at the page load times I'm seeing a bimodal distribution where the significantly slower results are caused by one of the following:

1. Over the course of the 24 reloads, certain HTTP resources will (based on their cache directive) need to be refetched, or at the very least, re-validated. This will take significantly longer to load/validate than items in the cache. This can throw the load time of a page off by hundreds of milliseconds.

2. Over the course of the 24 reloads there will be runs with significant Garbage Collection during the early stages of the pageload. This has a tendency to delay other operations (which presumably require access to the parent process's main thread). This can also throw off the load time of a page by hundreds of milliseconds.

Notes:
• An experiment was done with the latest mitmproxy 4.0.4 and deterministic JS added. (Not a full integration of mitmproxy 4.04 into raptor, but recording and playback of a page in Firefox using the Performance.timing web api). The load times were still bimodal.

• An experiment was done in which the Network feature "race-network-with-cache" was disabled. This did not remove bimodal load time behaviour from the tests.

• An experiment was done where the raptor tests were run with the Necko cache (memory/disk) disabled. This did significantly reduce the noise (generally all loads were very slow). 
But the practicality of this change is questionable as the cache is a key component of the browser.

It was suggested that a better solution would be to have raptor tp6 run cold load tests (i.e. loads from a single profile) as well as the current reload tests. Note that cold load tests are already planned for 2019.

The most significant update is that :jesup landed a large performance optimization that deferred the execution of JS triggered by setTimeout() to an idle queue.
bug 1270059

This had the side effect of cutting the Noise Metric (roughly the sum of std deviations) by ~40% to 60% depending on the platform, visible here:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&newProject=try&newRevision=b9acdea4f9d6df7c48ddbf1dbcc4f930b66f18af&framework=10&showOnlyImportant=1&selectedTimeRange=1209600

I haven't investigated why exactly this reduces the noise, but it's fair to say that potentially executing long JS functions during pageload (depending on timing) was leading to nondeterministic scheduling.
This was observed in profiles.

Bug 1517323 captures the work of removing mitmproxy's live upstream connections.

Unfortunately it is regressing performance tests.
I investigated and determined that at least for some of the tests (e.g. bing), the performance drop comes from the fact that mitmproxy running in "offline mode" drops http/2 playback to http/1.

This can have a very significant performance impact, particularly when playing back from cached recordings.

I'll be looking into this issue shortly - Bug 1524609.

At least on these raptor pageload tests in the lab, disabling RCWN is reducing the noise by 7% to 70%, depending on the platform.

My goal will be primarily performance tuning of RCWN, but it appears that noise reduction is very possible.
I had earlier done tests of disabling RCWN but at that time I was more focused on finding the source of the bimodal loading distributions and did not notice a drop in the Noise Metric.

Depends on: 1525017

It looks like the browser is still busy doing work which interferes with page load after the 30 seconds that raptor waits post-startup.
A quick experiment here extends raptor's startup delay from 30 to 90 seconds and noise is significantly reduced:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-inbound&newProject=try&newRevision=3bbe84b59015f54cfe174a0d2430ec5fb2571ace&framework=10&selectedTimeRange=2592000

Just logged Bug 1543776 after :vchin noticed the erratic load times on www.allrecipes.com (see image)

I've seen slow loads coinciding with long GC majors many times.

If we can resolve this we will significantly reduce noise in numerous tp6 tests.

Fixing Bug 1548355 lead to a very significant drop in noise (and loadtime) on raptor-tp6-reddit-firefox loadtime

https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1934845,1,10

We believe that this is because it defers GC out of the loading stage, and thus off-thread parsing isn't blocked (Bug 1543806 ).

Type: defect → task

This issue is a major contributor to noise: Bug 1555306
For this given site, on the 2017 reference laptop page load results will vary from ~2000ms to ~5000ms when testing against a recorded http session.

Depends on: 1564569

The behaviour described in Bug 1564569 was identified as the cause of large spikes in recorded load time such as these (amazon cold load tests on android).

See Also: → 1558189, 1558191
See Also: → 1565325

I described my findings in this post:
https://groups.google.com/a/mozilla.com/d/msg/perfteam/Zk9GN6kx2Fc/NU6QQby8DwAJ

Summarized here:
In Bug 1564569 it was shown that the injection of new resources via setTimeout() (which may or may not occur before onload) is the source of the bimodal load performance in many, likely most, cases (the first 3 of 3 in tp6m).

You can see the difference between our current behaviour (top) and a prototype in which we run the setTimeout()s after load here:
https://imgur.com/NBOKnLe

Given that the prototype solution is not viable (hurts visual metrics, poor behaviour on some sites), two alternatives were discussed:

• For completeness, Markus raised the idea of creating our own metric, e.g. onunbrokendependencyload which would then exhibit less noise.
• Given the preference for visual metrics, the idea of focusing noise reduction on a visual metric (which may be more stable) was discussed and was favorably received

Making into a meta bug; we should link any bugs affecting variability to this; reports in general on variability should go here

Keywords: meta
Summary: Determine cause(s) of variable page load timing results → [meta] Determine cause(s) of variable page load timing results

Linking bugs that were closed and worked on this quarter in an effort to reduce noise in the pageload tests:

Bug 1558189 Determine if android device temperature is introducing noise in test results (it's not the cause of the major outliers)
Bug 1563209 Bi-modal raptor tests results: this was caused by mitmproxy's live upstream cert issue which Tarek found a workaround for
Bug 1565325 Determine if using the same mitmproxy session for all cold page load browser cycles introduces noise (ruled out as a source of major noise)
Bug 1558191 Determine if different android devices are introducing noise in test results (not closed but ruled out as a source of major outliers)
Bug 1543776 5+ second delays seen while loading www.allrecipes.com (although not resolved, this issue been worked on by :smaug, :jonco and myself (e.g. Bug 1575943, Bug 1579426))

My recommendations going forward are:
• Pursue Bug 1561324: being able to test on reference PC hardware is important and will be even more so with Fission
• As discussed here, being noise reduction efforts for visual metrics: compare them to current navigation timing metrics, evaluate noise between visual metrics (speedIndex, contentfulnessIndex, etc), and find the sources of noise within those metrics

FYI, currently attempting to reduce noise on the reference laptops in the lab so that they may be usable for performance work. (Bug 1561324)

Found a significant source of noise on desktop cold pageload tests: Bug 1589070
The desktop cold load tests were incorrectly re-using the same profile for each cold load so resources would be in the disk cache for subsequent loads.

Each load was designed to be independent, but because of this bug the performance results would be bimodal with the first out of every 25 cold loads being slow (pictured).

For example:

"name": "loadtime", 
          "replicates": [
            1506, 
            752, 
            697, 
            717, 
            715,
Depends on: 1589070
Depends on: 1590443

We noticed in Bug 1595537 that the JS debug option async stacks end up enabled in the raptor performance tests (desktop).
Disabling this pref reduces noise significantly and improves test 'realism', so that's landing shortly.

noise metric
linux64-shippable           -9.52%    
macosx1014-64-shippable     -6.86%    
windows10-64-shippable.    -22.01%

See also: Talos version - bug 1597297

Depends on: 1595537

In Bug 1597862 stephend found that the shift to conditioned profiles is cutting the standard deviation on test results by half.
This will be landing shortly on desktop.

See Also: → 1597862
You need to log in before you can comment on or make changes to this bug.