Jan's thesis (see url) has recommendations for a better statistical technique than our t-test to use when trying to identify regressions/wins in performance data. It's neato, let's do it.
We discussed this in length at the meeting during the work week.
We can definitely improve the statistics we use, but I'm not convinced that Jan's exponential averaging is really what we want.
(In reply to Justin Lebar [:jlebar] from comment #1)
> We can definitely improve the statistics we use, but I'm not convinced that
> Jan's exponential averaging is really what we want.
If I remember correctly you said that the fact that it slowly adapts to changes could be disadvantageous in case you get a bunch of small, undetected regressions, right? So maybe it could be combined with another technique that tries to explicitly detect these kinds of things.
I went through and read this section in your thesis.
There's an explicit trade-off here with respect to whether we use a forward window of size 1 or larger. On the one hand, a size-1 window lets us capture regressions which appear and quickly disappear (either because of a backout or, more worryingly, because some other change landed and improved performance). On the other hand, using a larger forward window lets us detect smaller regressions.
To be concrete, in Figure 5.1 (pg 65), exponential smoothing would ignore any value within the green 95% confidence interval, while the t-test might not. In other words, with exponential smoothing, we cannot detect regressions smaller than our confidence interval.
Offhand, I imagine that the chances of us landing a regression followed by an improvement which hides the regression on all tests is pretty slim, so it seems to me that the benefit of having a size-1 forward window is pretty small.
When talking about regression finding in the current talos data I think it is worth pointing out that the numbers that are currently being used both on the graphs.m.o and in Jan's work are two steps of aggregation removed from the raw results.  contains four graphs the tsvg testsuite to illustrate this fact.
The first graph shows the data points as currently interpreted (in the graphs-new style). This suite is a combination of 12 different test pages which have their own execution time. Each data point in this graph is the result of averaging the values generated for each page.
The second graph shows how widely these pages vary in terms of execution time. As a result the averaging done to produce the first graph is already hiding a lot of detail.
The third graphs selects one of those pages (hixie-007.xml). In this view the changes are already much more evident than in the first graph. However even this graph is the result of an aggregation step. For each run of the test the pages are loaded multiple times. This graph is the result of taking the median of those raw results.
The fourth graph shows the distribution of the raw results for the hixie-007.xml page split into the different time periods identified through the third graph.
The summary of this is that 1) the aggregation steps hide many of the changes that occur except for those affecting the longest running pages and 2) we are already making multiple samples of each build and we should consider collecting and using those results to do the regression testing.
Hmm, indeed the median may not be the most indicator to use. For one thing, if results are bimodal the median could end up on either side of the distribution, increasing noise in the later aggregates.
Assuming the number of samples isn't prohibitively low, you could consider doing some form of iterative refinement on the mean:
1) Sort the samples in order of magnitude
2) Calculate the standard deviation
3) Calculate the mean of say the center 50% (25%-75%) or 75% (12.5%-87.5%) of samples (to avoid outliers skewing the mean)
4) Discard outliers more than n standard deviations from this mean
5) If no outliers were found, calculate the sample mean, otherwise go back to step 2)
This could be modified to start with high n (say 10), loop until no outliers are found, then loop again with lower n (down to a reasonable limit - you don't want to run out of samples!)
My knowledge of statistics is admittedly limited, but I got the inspiration for this strategy from a program used in astronomy to determine the luminosity of a star: the area surrounding the star is used to determine the amount of light in that area not originating from the star (say from glowing interstellar dust), but if there are other faint sources in that area, they need to be filtered out. I never got to see the source code for that program, so the above is my personal take on the idea.
s/most/best in the first sentence. Also, I wanted to add that you could also run the iterative refinement until either n reaches the set lower limit *or* a set number of samples remain (to avoid the problem I mentioned with bimodal distributions).
These are great points about how we aggregate our test results, but can we discuss them in a separate bug, please?
That is a good point, to that effect see: bug 707486 for a discussion of the impact of aggregation on talos. I will be opening up some additional bugs in the next week or so to document the findings/recommendations before I depart.
Doesn't sound like there's consensus here yet on which path to take. Could we use both techniques -- Jan's and the existing -- and send mail if either method finds a regression?
We're not talking about changing the data we gather AIUI, just how we aggregate it.
> We're not talking about changing the data we gather AIUI, just how we aggregate it.
My guess is that this bug isn't a particularly big issue at the moment, and e.g. bug 707486 is more important. But maybe our resident statisticians have a different feeling about it?
right now we are readjusting the way we run the tests and report on them. For example, we are looking into running the tests in 30 cycles and dropping the first 10 numbers leaving 20 numbers which are much more consistent and reliable. Also taking the median of 20 numbers vs 9 (currently) is going to give us a more reliable result.
Another thing we are doing is looking to upload the raw values collect to the graph server and not a calculated number per page. This will give us better insight into how the tests are being run and give us the flexibility to do these calculations on the graph server end and with other tools instead of tweaking talos.
(In reply to Joel Maher (:jmaher) from comment #11)
> right now we are readjusting the way we run the tests and report on them.
> For example, we are looking into running the tests in 30 cycles and dropping
> the first 10 numbers leaving 20 numbers which are much more consistent and
> reliable. Also taking the median of 20 numbers vs 9 (currently) is going to
> give us a more reliable result.
a few cycles on this and we have made slight adjustments to our existing algorithm- going per page tends to be too difficult to weed through the noise and make actionable bugs for developers.