Closed Bug 1160613 Opened 10 years ago Closed 10 years ago

come up with a more accurate method for highlighting regressions/improvement in compare mode

Categories

(Tree Management :: Perfherder, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: avih)

References

Details

Attachments

(1 file)

a spinoff from bug 1142680. We have an OK solution right now, but it should be optimized and accurate for the data we are using.
So, Joel has since changed it to some formula which I'm not able to understand because I don't know what its inputs are and the inputs don't look symmetric to me, which I find suspicious (i.e. I don't think it should matter what's considered old/new revision in order to decide if the diff between them is meaningful enough to highlight): https://github.com/jmaher/treeherder-ui/blob/nits/webapp/app/js/perf.js#L174 Kyle, let's say we have: - N retriggers for test T at revision R1 - M retriggers for the same test T at revision R2 What would you suggest to use in order take these N values and M values and decide if the difference between them is meaningful (in order to highlight the diff in red/green when it's meaningful)?
Flags: needinfo?(klahnakoski)
Blocks: 1142680
Use the t-test, if the samples are performance samples with no outliers. Otherwise use the median-test[1]; it is less sensitive at detecting population differences, but forgiving of bad data. If your number of samples is small (<20), *and* you want more sensitivity than the median test *and* you still want to be insensitive to outliers, then the Kruskal–Wallis one-way analysis of variance by ranks[2][3] [1] http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.median_test.html[2] [2] http://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_one-way_analysis_of_variance [3] http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.f_oneway.html
Flags: needinfo?(klahnakoski)
I don't think we can say for a fact there are no outliers, but OTOH all the values are by running the same test on the same code base few times, so we _probably_ don't have big outliers. N and M are typically about 5-6, possibly less (if users retrigger only twice for instance) and definitely not more than 20. So with this added info, which will you choose? Roberto, if you have some ideas, feel free to help too.
Flags: needinfo?(klahnakoski)
Maybe the t-test is best. The if there are outliers in the samples, then alerting significance may be the correct side effect. 5-6 samples on either side is right on the cusp of making the Median test useful, fewer samples and you will get too much noise. I have little experience with f-oneway(), but since it uses rank instead of amplitude, it will suffer the same problem that median test has.
Flags: needinfo?(klahnakoski)
I agree with Kyle that with this few datapoints, and presumably no extreme outliers, a t-test seems the way to go.
After discussing this with Roberto on IRC, we think that the way to go is with un-paired t-test, where sample sizes could be different, as well as the variance between groups (the latter since we know that some patches can increase or decrease the noise level). This is the relevant section http://en.wikipedia.org/wiki/Student%27s_t-test#Equal_or_unequal_sample_sizes.2C_unequal_variances I'll take this.
QA Contact: avihpit
Assignee: nobody → avihpit
QA Contact: avihpit
oh cool, I had experimented with this a bit, using the ttest from graph server was easy to add to the javascript code- we can adjust it as needed- in an hour or so I can get the code in a branch on github and we can take it from there.
jmaher, wlach, if one of you thinks the other is enough for the review, feel free to remove yourself the r?. Again, not sure how the interaction between github and bugzilla should be, so posting the PR info here as well: --------------------------------- The confidence level is used for deciding which changes to highlight (red/green) and also for display at the table view. New behavior: - Ignore differences below 1.5% and low (<0.5) t-test values. - Diff above 1.5%, and t-value below 1.0 -> not-sure-regression. - Above t-value of 1.0 -> confident -> highlight red/green. - Added the t-test value as the last column at the table ("Conf.") - Added the actual runs values at the "Runs" tooltip. TODO: - Fine tune the t-test constants 0.5 and 1.0 if required. - Possibly unify this calculation with our regression alerts thresholds.
Attachment #8606976 - Flags: review?(wlachance)
Attachment #8606976 - Flags: review?(jmaher)
Attachment #8606976 - Flags: feedback?(rvitillo)
Comment on attachment 8606976 [details] [review] Confidence level for compare-herder I made a few comments in the github PR, it looks like you are following: http://en.wikipedia.org/wiki/Student's_t-test#Equal_or_unequal_sample_sizes.2C_unequal_variances this seems to be inline with what we are doing in our other calcuations. Overall the cleanup with the code looks good!
Attachment #8606976 - Flags: review?(jmaher) → review+
(In reply to Joel Maher (:jmaher) from comment #9) > I made a few comments in the github PR. Thanks, I think I responded to all of them and the only change I need to do is the 4/2 indentation. If you feel I should address more things, let me know. > it looks like you are following: > http://en.wikipedia.org/wiki/Student's_t-test#Equal_or_unequal_sample_sizes. > 2C_unequal_variances Indeed, as I said on comment 6. > this seems to be inline with what we are doing in our other calcuations. Good to know :) > Overall the cleanup with the code looks good! Thanks!
I commented on the github PR.
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #11) > I commented on the github PR. Thanks. I'm summarizing it here: Roberto says that the t-test calculations where stddv and the means use geomeans need also log-transform of the values since the usage of geomeans implies that the samples distribution is logarithmic, so that's what the formula says. The patch doesn't use log-transform. For us, the distribution is probably not logarithmic, and regardless, our variance is relatively low enough for that to not matter. Still, to follow the correct definition of t-test calculations, we should either use all geomeans and log-transform, or just use plain averages everywhere. I'll change the code to use plain averages since we don't have a need for geomeans on this case.
What I meant is that you should implement the t-test with the arithmetic mean and the arithmetic standard deviation as defined on e.g. Wikipedia. Now, if you want to use the test to compare the arithmetic means of two groups A and B, you should just invoke the t-test function on your data as-is. If you desire instead to compare the geometric means of A and B, then you log-transform the values of A and B before invoking the t-test.
I added the following commits: - Review comments (indent, average instead geomeans for t-test). - Improved stddev for 1 sample: now uses stddev of the other set. - Bugfix where .isEmpty wasn't true when both sets are empty. - Improvement: make correct distinction between legitimate 0 and N/A values.
Comment on attachment 8606976 [details] [review] Confidence level for compare-herder Pretty close, I didn't look at the math in enormous detail, assuming that rvitillo had already done so. Some UI and other minor changes that I'd like to see happen before landing, but in general this looks good! r? me again once you've addressed those.
Attachment #8606976 - Flags: review?(wlachance) → review-
Attachment #8606976 - Flags: feedback?(rvitillo) → feedback+
Commit pushed to master at https://github.com/mozilla/treeherder-ui https://github.com/mozilla/treeherder-ui/commit/d120b6109acddc33e9560385fa0069ef16b69d2f Bug 1160613: compare perf: use t-test for confidence level The confidence level is used for deciding which changes to highlight (red/green) and also for display at the table view. New behavior: - Ignore differences below 1.5% and t-state values below 0.5 . - Diff above 1.5%, and t-value below 1.0 -> not-sure-regression. - Above t-value of 1.0 -> confident -> highlight red/green. - Added the t-test value as the last column at the table ("Confidence"). - Added the actual runs values at the "Runs" tooltip. TODO: - Fine tune the t-test constants 0.5 and 1.0 if required.
Comment on attachment 8606976 [details] [review] Confidence level for compare-herder Issues addressed, r+'ed and merged!
Attachment #8606976 - Flags: review- → review+
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/87705012c42f200be052db7747c136f8827b32b7 Bug 1160613: compare perf: use t-test for confidence level The confidence level is used for deciding which changes to highlight (red/green) and also for display at the table view. New behavior: - Ignore differences below 1.5% and t-state values below 0.5 . - Diff above 1.5%, and t-value below 1.0 -> not-sure-regression. - Above t-value of 1.0 -> confident -> highlight red/green. - Added the t-test value as the last column at the table ("Confidence"). - Added the actual runs values at the "Runs" tooltip. TODO: - Fine tune the t-test constants 0.5 and 1.0 if required.
See Also: → 1171694
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: