Closed
Bug 796167
Opened 13 years ago
Closed 10 years ago
investigate restarting browser vs internal cycles for statistical variance
Categories
(Testing :: Talos, defect)
Testing
Talos
Tracking
(Not tracked)
RESOLVED
WORKSFORME
People
(Reporter: k0scist, Unassigned)
Details
(Whiteboard: [SfN])
See https://bugs.webkit.org/show_bug.cgi?id=97510
:rniwa found significantly less variance when the browser is restarted
per (e.g.) point vs taking multiple data in a single browser session:
"""
Hi all,
I've recently conducted an experiment to reduce the per-run variance
(i.e. the test score difference between different runs of the same
test) of performance tests in WebKit and wanted to share the result.
I tried two approaches:
Increasing the sample size - I've tried quadrupling the sample
size. e.g. if we were getting 20 samples previously, then I would
get 100 of them.
Get the samples from different instances of the test runner
program - We use special test runner program called DumpRenderTree
or WebKitTestRunner to run performance tests. I've divided the
sample size by 4 and got samples in separate instance of
DumpRenderTree. e.g. if we were to get 20 samples, then I would
get 5 samples from a single instance of DumpRenderTree, and then
run the test 4 times to get the total of 20 samples.
The conclusion is that approach 2 works while approach 1 only
amplifies the difference between runs. Approach 2 does not only
increase the variance in each run but also smoothes values between
runs so that mean we get in each run is much closer to the mean of all
runs.
You can see 3 sample data on
https://bugs.webkit.org/show_bug.cgi?id=97510.
Best,
Ryosuke Niwa
Software Engineer
Google Inc.
"""
This should be investigated for talos and signal from noise. I am
actually quite surprised at this, though with address space
randomization....maybe it makes sense?
(TL;DR :
This is a little like putting the cart before the horse. We know we
need to figure out how many points we need (e.g. per test) for a
decent distribution sample.)
We should quantify what difference this makes. We should also
quantify what the run time will be. At some point we will need to
figure out a good balance of test time vs. statistical viability. For
my money, I would much rather have a few tests I can count on with
good samples vs a lot of tests that I can't count on.
Reporter | ||
Updated•13 years ago
|
Whiteboard: [SfN]
Comment 1•13 years ago
|
||
I need more information regarding rniwa's experiments:
1. whether the test runner program runs on one machine or many machines to collect data?
2. why increased within variation is good and how it "smooths" values between runs?
3. what is "the mean of all runs" that he is referring to?
FYI: Our team worked closely with :jmaher to do exploratory experiments to study the most efficient way to collect data (optimizing for time and other factors). Sometime in Jan/Feb 2012, we ran experiments involving rebooting the computer, and also restarting the browser for every replicate, to see how the results change. We also looked into how collecting data in row-major and column-major order influenced the variation within the test replicates.
Result: We saw big improvement reducing variation in the data when we ran experiments changing the order at which we collect the data, but not significant difference by rebooting/restarting.
I am not sure how Google's testing environment/machines/etc. are representative of Mozila's, but I think these issues have been carefully considered at the every early stages of Signal from Noise project.
Comment 2•10 years ago
|
||
3 years later, we haven't done this and instead ensured our tests measured useful things. In fact we have found most of our difficulties in actually pinpointing regressions comes down to scheduling issues. Assuming that is fixed we could find the next bit of low hanging fruit.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•