Closed Bug 796167 Opened 13 years ago Closed 10 years ago

investigate restarting browser vs internal cycles for statistical variance

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: k0scist, Unassigned)

Details

(Whiteboard: [SfN])

See https://bugs.webkit.org/show_bug.cgi?id=97510 :rniwa found significantly less variance when the browser is restarted per (e.g.) point vs taking multiple data in a single browser session: """ Hi all, I've recently conducted an experiment to reduce the per-run variance (i.e. the test score difference between different runs of the same test) of performance tests in WebKit and wanted to share the result. I tried two approaches: Increasing the sample size - I've tried quadrupling the sample size. e.g. if we were getting 20 samples previously, then I would get 100 of them. Get the samples from different instances of the test runner program - We use special test runner program called DumpRenderTree or WebKitTestRunner to run performance tests. I've divided the sample size by 4 and got samples in separate instance of DumpRenderTree. e.g. if we were to get 20 samples, then I would get 5 samples from a single instance of DumpRenderTree, and then run the test 4 times to get the total of 20 samples. The conclusion is that approach 2 works while approach 1 only amplifies the difference between runs. Approach 2 does not only increase the variance in each run but also smoothes values between runs so that mean we get in each run is much closer to the mean of all runs. You can see 3 sample data on https://bugs.webkit.org/show_bug.cgi?id=97510. Best, Ryosuke Niwa Software Engineer Google Inc. """ This should be investigated for talos and signal from noise. I am actually quite surprised at this, though with address space randomization....maybe it makes sense? (TL;DR : This is a little like putting the cart before the horse. We know we need to figure out how many points we need (e.g. per test) for a decent distribution sample.) We should quantify what difference this makes. We should also quantify what the run time will be. At some point we will need to figure out a good balance of test time vs. statistical viability. For my money, I would much rather have a few tests I can count on with good samples vs a lot of tests that I can't count on.
Whiteboard: [SfN]
I need more information regarding rniwa's experiments: 1. whether the test runner program runs on one machine or many machines to collect data? 2. why increased within variation is good and how it "smooths" values between runs? 3. what is "the mean of all runs" that he is referring to? FYI: Our team worked closely with :jmaher to do exploratory experiments to study the most efficient way to collect data (optimizing for time and other factors). Sometime in Jan/Feb 2012, we ran experiments involving rebooting the computer, and also restarting the browser for every replicate, to see how the results change. We also looked into how collecting data in row-major and column-major order influenced the variation within the test replicates. Result: We saw big improvement reducing variation in the data when we ran experiments changing the order at which we collect the data, but not significant difference by rebooting/restarting. I am not sure how Google's testing environment/machines/etc. are representative of Mozila's, but I think these issues have been carefully considered at the every early stages of Signal from Noise project.
3 years later, we haven't done this and instead ensured our tests measured useful things. In fact we have found most of our difficulties in actually pinpointing regressions comes down to scheduling issues. Assuming that is fixed we could find the next bit of low hanging fruit.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.