Closed Bug 1171694 Opened 10 years ago Closed 10 years ago

provide number of data points or estimate of standard error for old and new in perfherder compare summary

Categories

(Tree Management :: Perfherder, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: karlt, Unassigned)

References

Details

(Whiteboard: [dupme])

As indicated in bug 1164894, mean and stddev are not enough to make statistical conclusions about changes in mean value. Adding either the number of data points or estimate of standard error in each mean would provide enough information to determine confidence intervals and estimate how many data points would be required to achieve enough confidence. If both number of data points and standard error are provided then standard deviation can be dropped. The confidence number added since bug 1164894 is nice for quickly scanning through and finding the tests with the most statistically significant differences, but it doesn't provide enough information for before and after confidence intervals.
See Also: → 1160613
Two things: 1. I think this is not quite so simple as looking at the set of samples per test, as the results of each run depend also on the environment it was run in (is there a background task ongoing, etc.). Thus, current best practice is to not be confident in any result without at least 5 or 6 retriggers (which usually picks up this kind of variation in environment). We want to make that more clear in the UI, this is covered by bug 1164898. 2. I'm a bit concerned about overloading what's essentially meant to be an "executive summary" with this type of information really makes sense. I'd rather explain things in the very simple terms that we are right now, then let people dive into the details if they want to. If there is inaccuracy/ambiguity in the data we're presenting, let's address that head on rather than adding additional information that requires a strong stats background to decipher. We have bug 1164891 on file to allow digging into the per-testrun results in more detail. We could definitely include this type of information there. Avi and/or Joel: Do you have any further thoughts? My inclination is to address this request in bug 1164891.
Flags: needinfo?(jmaher)
Flags: needinfo?(avihpit)
See Also: → 1164898
I think this depends on the test and the data, but if we want a standard approach we should find what works for most tests and have options to get more info. The simpler we can keep the math and summaries, the greater the chance anyone can interpret it. Maybe bug 1164891 could include a few metrics and we could have it shown on mouseover.
Flags: needinfo?(jmaher)
The standard error is already used to calculate the confidence level (t-test value), so that's covered already. As for displaying before/after std errors, personally I think it's an overkill, and I don't think there's much value in it and I don't really see where it could be put into good use. Overall, it's statistics which is used to present some additional info which hopefully can occasionally help understand the data better, but we don't care _that_ much about minute differences in values. It's designed to give a comparison overview and some statistical analysis of the data we have, and I think it does so reasonably well and reasonably useful. I really think there's not much value is high resolution stats here. I could be mistaken, and if someone has some good use cases for such high-res stats, we can reconsider, but till then, I think it's an overkill with little to no value.
Flags: needinfo?(avihpit)
Ok, after 6 months of use there hasn't been broad demand for more data in compare view per se (though we do now display the individual values if you hover over an item). Let's resolve this as incomplete for now, we can reopen if there is broad consensus that more data would be useful.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INCOMPLETE
There is now a number of runs column as requested, and the mouseover explains that the ± value is a std deviation, so these can be used to determine the standard error. Don't know exactly which bug covered "# Runs", but this is fixed now. Thanks!
Resolution: INCOMPLETE → WORKSFORME
Whiteboard: [dupme]
You need to log in before you can comment on or make changes to this bug.