Closed
Bug 1160613
Opened 10 years ago
Closed 10 years ago
come up with a more accurate method for highlighting regressions/improvement in compare mode
Categories
(Tree Management :: Perfherder, defect)
Tree Management
Perfherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jmaher, Assigned: avih)
References
Details
Attachments
(1 file)
a spinoff from bug 1142680. We have an OK solution right now, but it should be optimized and accurate for the data we are using.
| Reporter | ||
Comment 1•10 years ago
|
||
So, Joel has since changed it to some formula which I'm not able to understand because I don't know what its inputs are and the inputs don't look symmetric to me, which I find suspicious (i.e. I don't think it should matter what's considered old/new revision in order to decide if the diff between them is meaningful enough to highlight): https://github.com/jmaher/treeherder-ui/blob/nits/webapp/app/js/perf.js#L174
Kyle, let's say we have:
- N retriggers for test T at revision R1
- M retriggers for the same test T at revision R2
What would you suggest to use in order take these N values and M values and decide if the difference between them is meaningful (in order to highlight the diff in red/green when it's meaningful)?
Flags: needinfo?(klahnakoski)
Comment 2•10 years ago
|
||
Use the t-test, if the samples are performance samples with no outliers. Otherwise use the median-test[1]; it is less sensitive at detecting population differences, but forgiving of bad data. If your number of samples is small (<20), *and* you want more sensitivity than the median test *and* you still want to be insensitive to outliers, then the Kruskal–Wallis one-way analysis of variance by ranks[2][3]
[1] http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.median_test.html[2]
[2] http://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_one-way_analysis_of_variance
[3] http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.f_oneway.html
Flags: needinfo?(klahnakoski)
| Assignee | ||
Comment 3•10 years ago
|
||
I don't think we can say for a fact there are no outliers, but OTOH all the values are by running the same test on the same code base few times, so we _probably_ don't have big outliers.
N and M are typically about 5-6, possibly less (if users retrigger only twice for instance) and definitely not more than 20.
So with this added info, which will you choose?
Roberto, if you have some ideas, feel free to help too.
Flags: needinfo?(klahnakoski)
Comment 4•10 years ago
|
||
Maybe the t-test is best. The if there are outliers in the samples, then alerting significance may be the correct side effect. 5-6 samples on either side is right on the cusp of making the Median test useful, fewer samples and you will get too much noise. I have little experience with f-oneway(), but since it uses rank instead of amplitude, it will suffer the same problem that median test has.
Flags: needinfo?(klahnakoski)
Comment 5•10 years ago
|
||
I agree with Kyle that with this few datapoints, and presumably no extreme outliers, a t-test seems the way to go.
| Assignee | ||
Comment 6•10 years ago
|
||
After discussing this with Roberto on IRC, we think that the way to go is with un-paired t-test, where sample sizes could be different, as well as the variance between groups (the latter since we know that some patches can increase or decrease the noise level).
This is the relevant section http://en.wikipedia.org/wiki/Student%27s_t-test#Equal_or_unequal_sample_sizes.2C_unequal_variances
I'll take this.
QA Contact: avihpit
| Assignee | ||
Updated•10 years ago
|
Assignee: nobody → avihpit
QA Contact: avihpit
| Reporter | ||
Comment 7•10 years ago
|
||
oh cool, I had experimented with this a bit, using the ttest from graph server was easy to add to the javascript code- we can adjust it as needed- in an hour or so I can get the code in a branch on github and we can take it from there.
| Assignee | ||
Comment 8•10 years ago
|
||
jmaher, wlach, if one of you thinks the other is enough for the review, feel free to remove yourself the r?.
Again, not sure how the interaction between github and bugzilla should be, so posting the PR info here as well:
---------------------------------
The confidence level is used for deciding which changes to highlight (red/green)
and also for display at the table view.
New behavior:
- Ignore differences below 1.5% and low (<0.5) t-test values.
- Diff above 1.5%, and t-value below 1.0 -> not-sure-regression.
- Above t-value of 1.0 -> confident -> highlight red/green.
- Added the t-test value as the last column at the table ("Conf.")
- Added the actual runs values at the "Runs" tooltip.
TODO:
- Fine tune the t-test constants 0.5 and 1.0 if required.
- Possibly unify this calculation with our regression alerts thresholds.
Attachment #8606976 -
Flags: review?(wlachance)
Attachment #8606976 -
Flags: review?(jmaher)
Attachment #8606976 -
Flags: feedback?(rvitillo)
| Reporter | ||
Comment 9•10 years ago
|
||
Comment on attachment 8606976 [details] [review]
Confidence level for compare-herder
I made a few comments in the github PR, it looks like you are following:
http://en.wikipedia.org/wiki/Student's_t-test#Equal_or_unequal_sample_sizes.2C_unequal_variances
this seems to be inline with what we are doing in our other calcuations.
Overall the cleanup with the code looks good!
Attachment #8606976 -
Flags: review?(jmaher) → review+
| Assignee | ||
Comment 10•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #9)
> I made a few comments in the github PR.
Thanks, I think I responded to all of them and the only change I need to do is the 4/2 indentation. If you feel I should address more things, let me know.
> it looks like you are following:
> http://en.wikipedia.org/wiki/Student's_t-test#Equal_or_unequal_sample_sizes.
> 2C_unequal_variances
Indeed, as I said on comment 6.
> this seems to be inline with what we are doing in our other calcuations.
Good to know :)
> Overall the cleanup with the code looks good!
Thanks!
Comment 11•10 years ago
|
||
I commented on the github PR.
| Assignee | ||
Comment 12•10 years ago
|
||
(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #11)
> I commented on the github PR.
Thanks. I'm summarizing it here:
Roberto says that the t-test calculations where stddv and the means use geomeans need also log-transform of the values since the usage of geomeans implies that the samples distribution is logarithmic, so that's what the formula says. The patch doesn't use log-transform.
For us, the distribution is probably not logarithmic, and regardless, our variance is relatively low enough for that to not matter. Still, to follow the correct definition of t-test calculations, we should either use all geomeans and log-transform, or just use plain averages everywhere.
I'll change the code to use plain averages since we don't have a need for geomeans on this case.
Comment 13•10 years ago
|
||
What I meant is that you should implement the t-test with the arithmetic mean and the arithmetic standard deviation as defined on e.g. Wikipedia.
Now, if you want to use the test to compare the arithmetic means of two groups A and B, you should just invoke the t-test function on your data as-is. If you desire instead to compare the geometric means of A and B, then you log-transform the values of A and B before invoking the t-test.
| Assignee | ||
Comment 14•10 years ago
|
||
I added the following commits:
- Review comments (indent, average instead geomeans for t-test).
- Improved stddev for 1 sample: now uses stddev of the other set.
- Bugfix where .isEmpty wasn't true when both sets are empty.
- Improvement: make correct distinction between legitimate 0 and N/A values.
Comment 15•10 years ago
|
||
Comment on attachment 8606976 [details] [review]
Confidence level for compare-herder
Pretty close, I didn't look at the math in enormous detail, assuming that rvitillo had already done so. Some UI and other minor changes that I'd like to see happen before landing, but in general this looks good! r? me again once you've addressed those.
Attachment #8606976 -
Flags: review?(wlachance) → review-
Updated•10 years ago
|
Attachment #8606976 -
Flags: feedback?(rvitillo) → feedback+
Comment 16•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder-ui
https://github.com/mozilla/treeherder-ui/commit/d120b6109acddc33e9560385fa0069ef16b69d2f
Bug 1160613: compare perf: use t-test for confidence level
The confidence level is used for deciding which changes to highlight (red/green)
and also for display at the table view.
New behavior:
- Ignore differences below 1.5% and t-state values below 0.5 .
- Diff above 1.5%, and t-value below 1.0 -> not-sure-regression.
- Above t-value of 1.0 -> confident -> highlight red/green.
- Added the t-test value as the last column at the table ("Confidence").
- Added the actual runs values at the "Runs" tooltip.
TODO:
- Fine tune the t-test constants 0.5 and 1.0 if required.
Comment 17•10 years ago
|
||
Comment on attachment 8606976 [details] [review]
Confidence level for compare-herder
Issues addressed, r+'ed and merged!
Attachment #8606976 -
Flags: review- → review+
Updated•10 years ago
|
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Comment 18•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/87705012c42f200be052db7747c136f8827b32b7
Bug 1160613: compare perf: use t-test for confidence level
The confidence level is used for deciding which changes to highlight (red/green)
and also for display at the table view.
New behavior:
- Ignore differences below 1.5% and t-state values below 0.5 .
- Diff above 1.5%, and t-value below 1.0 -> not-sure-regression.
- Above t-value of 1.0 -> confident -> highlight red/green.
- Added the t-test value as the last column at the table ("Confidence").
- Added the actual runs values at the "Runs" tooltip.
TODO:
- Fine tune the t-test constants 0.5 and 1.0 if required.
You need to log in
before you can comment on or make changes to this bug.
Description
•