866195 - Improve Autophone s1s2 test reproducibility

Assignee

Description

•

12 years ago

Recently when attempting to fix bug 862508 to allow Autophone to test Android 4.2 devices, I ran in an apparent regression in throbberstart due to the change in how Fennec is launched. While attempting to investigate the regression, I ran tests of how fennec is launched comparing the current Autophone run_fennec_with_profile method (call it A) to versions of run_fennec_with_profile which used mozbase's launchFennec without (call it B) and with (call it C) the -W argument to am. There was no consistent ordering between phones of the 3 values. Values A and B which should have been identical were consistently different. In addition, the variability of the test results as shown by the standard deviations made it difficult to make accurate comparisons. Something to note is that I ran A, B, and C in the same session with the same profile without rebooting in between. The variability of the test results is very apparent when looking at the recent results at <http://mrcote.info/phonedash>. This is especially true for the remote page tests where the network connections are throttled via ipfw to 780Kbps down and 330Kbps up. I do not know whether the throttled network is the cause of the variability or if it has some other cause. In my opinion these results are so noisy as to be unusable. I checked the performance of the phonedash web server using httperf as compared to nginx for serving the static files for Twitter2.html and did not see any performance issue which might affect the tests. This led me to reading up on test reproducibility and got me thinking about how we actually run the tests. A simplified description of how Autophone runs the s1s2 tests is: reboot phone install fennec for cache in disabled, enabled: create profile with cache preferences launch fennec to initialize the profile for test in local-blank, local-twitter, remote-blank, remote-twitter: for iteration in iterations: launch fennec and load the test page collect and report throbberstart, throbberstop values We use the throbberstart and throbberstop values from the iterations to calculate a mean and standard deviation which is displayed on phonedash. I see several problems in this approach which may affect the reliability and reproducibility of the tests. 1. We test the uncached behavior using prefs to disable the cache. I think it would be more realistic if we used the normal cache settings and treated the first load of a page as the uncached test and treated the second load of a page as the cached test. 2. We do not reboot the phone in between the test pages nor in between the iterations. We treat each test and iteration as an independent measurement when as currently implemented each test and iteration depends upon the previous tests and iterations. I think we should reboot before each measurement to eliminate any influences from prior measurements. 3. Our choice of iterations is ad-hoc. Rather than focusing on the standard deviation of the measurements we can choose the iterations to reduce the standard error of the mean and perhaps obtain more reliable results. I think that implementing #2 will help in improving reliability and help in choosing a realistic and useful iteration count. For background reading http://en.wikipedia.org/wiki/Standard_error and http://en.wikipedia.org/wiki/Standard_deviation The proposed test execution would look like: reboot phone install fennec launch fennec to make sure the build is usable if fennec is not usable: abort test for test in local-blank, local-twitter, remote-blank, remote-twitter: for iteration in iterations: reboot phone create profile launch fennec to initialize the profile launch fennec and load the test page collect and report uncached throbberstart, throbberstop values launch fennec and load the test page collect and report cached throbberstart, throbberstop values I would like to take some of the redundant phones out of the production Autophone s1s2 runs to use in testing this approach. We could keep the current Autophone running in production on the remaining phones until we determine if this is a valid approach. If we do switch to this new behavior, I can rerun previous data to fill in the historical behavior. Thoughts?

cmtalbert

Comment 1

•

12 years ago

I'm all for being more deliberate with what we are measuring and how we are measuring it. I think this is a good step in the right direction and I look forward to what your experiment turns up w.r.t. measurement stability.

Geoff Brown [:gbrown]

Comment 2

•

12 years ago

(In reply to Bob Clary [:bc:] from comment #0) > 1. We test the uncached behavior using prefs to disable the cache. I think > it would be more realistic if we used the normal cache settings and treated > the first load of a page as the uncached test and treated the second load of > a page as the cached test. A few slightly related ideas here: We should keep in mind that first load with caching enabled is not the same as caching disabled: With caching enabled, we are storing that first load, and it is possible that the disk writes, and additional processing, will affect performance. We don't normally test Fennec in configurations that differ from the defaults we ship (except in ad-hoc, one-off tests to investigate the effect of configuration changes). Network disk cache is enabled in Fennec and has been for a long time now. Why are we routinely running tests with cache disabled? With caching enabled, it seems wrong to put the first load results in the same bucket as subsequent results -- we expect them to differ and we lose that information if we routinely just look at the aggregate. The current iterations: for test in local-blank, local-twitter, remote-blank, remote-twitter: for iteration in iterations: launch fennec and load the test page are problematic/unrealistic with caching enabled because we are testing the result of loading the same page multiple times in succession, with little happening in between. There is a risk here that we will benefit not only from our own caching, but also from OS level caching: disk buffers may still be in memory for instance, resulting in improved performance that wouldn't be seen in normal use. The proposed execution is much better in this respect, since each pair of results is against new cache files (new profile).

Mark Côté [:mcote]

Comment 3

•

12 years ago

(In reply to Geoff Brown [:gbrown] from comment #2) > We don't normally test Fennec in configurations that differ from the > defaults we ship (except in ad-hoc, one-off tests to investigate the effect > of configuration changes). Network disk cache is enabled in Fennec and has > been for a long time now. Why are we routinely running tests with cache > disabled? IIRC blassey suggested this. I think he wanted to try to remove variability caused by writing to disk, or at least try to see its effect vs standard cache usage. > With caching enabled, it seems wrong to put the first load results in the > same bucket as subsequent results -- we expect them to differ and we lose > that information if we routinely just look at the aggregate. Actually we omit the first result of the cached runs when calculating the mean & standard deviation. For comparison, you can see the initial result--and only the initial result--by enabling "show initial only" in phonedash. > > The current iterations: > > for test in local-blank, local-twitter, remote-blank, remote-twitter: > for iteration in iterations: > launch fennec and load the test page > > are problematic/unrealistic with caching enabled because we are testing the > result of loading the same page multiple times in succession, with little > happening in between. There is a risk here that we will benefit not only > from our own caching, but also from OS level caching: disk buffers may still > be in memory for instance, resulting in improved performance that wouldn't > be seen in normal use. The proposed execution is much better in this > respect, since each pair of results is against new cache files (new profile). Yeah I think we originally were taking a trade off of speed of execution (and potentially device life) by only rebooting in between tests. I would expect less variation if we rebooted between every run. If we can still keep up with the results, then yeah let's try this.

phonedash patch 12 years ago Bob Clary [:bc] (inactive) 2.14 KB, patch	mcote : feedback+	Details \| Diff \| Splinter Review
autophone patch 12 years ago Bob Clary [:bc] (inactive) 17.19 KB, patch	mcote : review+	Details \| Diff \| Splinter Review
data comparing various approaches 12 years ago Bob Clary [:bc] (inactive) 60.12 KB, text/plain		Details
phonedash patch v2 12 years ago Bob Clary [:bc] (inactive) 6.42 KB, patch	mcote : review+	Details \| Diff \| Splinter Review
follow up patch v1 12 years ago Bob Clary [:bc] (inactive) 3.02 KB, patch		Details \| Diff \| Splinter Review
follow up patch v2 12 years ago Bob Clary [:bc] (inactive) 6.42 KB, patch		Details \| Diff \| Splinter Review
follow up patch v3 12 years ago Bob Clary [:bc] (inactive) 3.57 KB, patch	mcote : review+	Details \| Diff \| Splinter Review