1160613 - come up with a more accurate method for highlighting regressions/improvement in compare mode

Reporter

Description

•

9 years ago

a spinoff from bug 1142680.  We have an OK solution right now, but it should be optimized and accurate for the data we are using.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

9 years ago

So, Joel has since changed it to some formula which I'm not able to understand because I don't know what its inputs are and the inputs don't look symmetric to me, which I find suspicious (i.e. I don't think it should matter what's considered old/new revision in order to decide if the diff between them is meaningful enough to highlight): https://github.com/jmaher/treeherder-ui/blob/nits/webapp/app/js/perf.js#L174


Kyle, let's say we have:
- N retriggers for test T at revision R1
- M retriggers for the same test T at revision R2

What would you suggest to use in order take these N values and M values and decide if the difference between them is meaningful (in order to highlight the diff in red/green when it's meaningful)?

Flags: needinfo?(klahnakoski)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Updated

•

9 years ago

Blocks: 1142680

Kyle Lahnakoski [:ekyle]

Comment 2

•

9 years ago

Use the t-test, if the samples are performance samples with no outliers.  Otherwise use the median-test[1];  it is less sensitive at detecting population differences, but forgiving of bad data.  If your number of samples is small (<20), *and* you want more sensitivity than the median test *and* you still want to be insensitive to outliers, then the Kruskal–Wallis one-way analysis of variance by ranks[2][3] 

[1] http://docs.scipy.org/doc/scipy-0.15.1/reference/generated/scipy.stats.median_test.html[2]
[2] http://en.wikipedia.org/wiki/Kruskal%E2%80%93Wallis_one-way_analysis_of_variance
[3] http://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.f_oneway.html

Flags: needinfo?(klahnakoski)

Avi Halachmi (:avih)

Assignee

Comment 3

•

9 years ago

I don't think we can say for a fact there are no outliers, but OTOH all the values are by running the same test on the same code base few times, so we _probably_ don't have big outliers.

N and M are typically about 5-6, possibly less (if users retrigger only twice for instance) and definitely not more than 20.

So with this added info, which will you choose?

Roberto, if you have some ideas, feel free to help too.

Flags: needinfo?(klahnakoski)

Kyle Lahnakoski [:ekyle]

Comment 4

•

9 years ago

Maybe the t-test is best.  The if there are outliers in the samples, then alerting significance may be the correct side effect.  5-6 samples on either side is right on the cusp of making the Median test useful, fewer samples and you will get too much noise.  I have little experience with f-oneway(), but since it uses rank instead of amplitude, it will suffer the same problem that median test has.

Flags: needinfo?(klahnakoski)

Roberto Agostino Vitillo (:rvitillo)

Comment 5

•

9 years ago

I agree with Kyle that with this few datapoints, and presumably no extreme outliers, a t-test seems the way to go.

Avi Halachmi (:avih)

Assignee

Comment 6

•

9 years ago

After discussing this with Roberto on IRC, we think that the way to go is with un-paired t-test, where sample sizes could be different, as well as the variance between groups (the latter since we know that some patches can increase or decrease the noise level).

This is the relevant section http://en.wikipedia.org/wiki/Student%27s_t-test#Equal_or_unequal_sample_sizes.2C_unequal_variances

I'll take this.

QA Contact: avihpit

Avi Halachmi (:avih)

Assignee

Updated

•

9 years ago

Assignee: nobody → avihpit

QA Contact: avihpit

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 7

•

9 years ago

oh cool, I had experimented with this a bit, using the ttest from graph server was easy to add to the javascript code- we can adjust it as needed- in an hour or so I can get the code in a branch on github and we can take it from there.

Avi Halachmi (:avih)

Assignee

Comment 8

•

9 years ago

Attached file Confidence level for compare-herder — Details

jmaher, wlach, if one of you thinks the other is enough for the review, feel free to remove yourself the r?.

Again, not sure how the interaction between github and bugzilla should be, so posting the PR info here as well:
---------------------------------



The confidence level is used for deciding which changes to highlight (red/green)
and also for display at the table view.

New behavior:
- Ignore differences below 1.5% and low (<0.5) t-test values.
- Diff above 1.5%, and t-value below 1.0 -> not-sure-regression.
- Above t-value of 1.0 -> confident -> highlight red/green.
- Added the t-test value as the last column at the table ("Conf.")
- Added the actual runs values at the "Runs" tooltip.

TODO:
- Fine tune the t-test constants 0.5 and 1.0 if required.
- Possibly unify this calculation with our regression alerts thresholds.

Attachment #8606976 - Flags: review?(wlachance)

Attachment #8606976 - Flags: review?(jmaher)

Attachment #8606976 - Flags: feedback?(rvitillo)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 9

•

9 years ago

Comment on attachment 8606976 [details] [review]
Confidence level for compare-herder

I made a few comments in the github PR, it looks like you are following:
http://en.wikipedia.org/wiki/Student's_t-test#Equal_or_unequal_sample_sizes.2C_unequal_variances

this seems to be inline with what we are doing in our other calcuations.

Overall the cleanup with the code looks good!

Attachment #8606976 - Flags: review?(jmaher) → review+

Avi Halachmi (:avih)

Assignee

Comment 10

•

9 years ago

(In reply to Joel Maher (:jmaher) from comment #9)
> I made a few comments in the github PR.

Thanks, I think I responded to all of them and the only change I need to do is the 4/2 indentation. If you feel I should address more things, let me know.


> it looks like you are following:
> http://en.wikipedia.org/wiki/Student's_t-test#Equal_or_unequal_sample_sizes.
> 2C_unequal_variances

Indeed, as I said on comment 6.

> this seems to be inline with what we are doing in our other calcuations.

Good to know :)

> Overall the cleanup with the code looks good!

Thanks!

Roberto Agostino Vitillo (:rvitillo)

Comment 11

•

9 years ago

I commented on the github PR.

Avi Halachmi (:avih)

Assignee

Comment 12

•

9 years ago

(In reply to Roberto Agostino Vitillo (:rvitillo) from comment #11)
> I commented on the github PR.

Thanks. I'm summarizing it here:

Roberto says that the t-test calculations where stddv and the means use geomeans need also log-transform of the values since the usage of geomeans implies that the samples distribution is logarithmic, so that's what the formula says. The patch doesn't use log-transform.

For us, the distribution is probably not logarithmic, and regardless, our variance is relatively low enough for that to not matter. Still, to follow the correct definition of t-test calculations, we should either use all geomeans and log-transform, or just use plain averages everywhere.

I'll change the code to use plain averages since we don't have a need for geomeans on this case.

Roberto Agostino Vitillo (:rvitillo)

Comment 13

•

9 years ago

What I meant is that you should implement the t-test with the arithmetic mean and the arithmetic standard deviation as defined on e.g. Wikipedia.

Now, if you want to use the test to compare the arithmetic means of two groups A and B, you should just invoke the t-test function on your data as-is. If you desire instead to compare the geometric means of A and B, then you log-transform the values of A and B before invoking the t-test.

Avi Halachmi (:avih)

Assignee

Comment 14

•

9 years ago

I added the following commits:

- Review comments (indent, average instead geomeans for t-test).
- Improved stddev for 1 sample: now uses stddev of the other set.
- Bugfix where .isEmpty wasn't true when both sets are empty.
- Improvement: make correct distinction between legitimate 0 and N/A values.

William Lachance (:wlach)

Comment 15

•

9 years ago

Comment on attachment 8606976 [details] [review]
Confidence level for compare-herder

Pretty close, I didn't look at the math in enormous detail, assuming that rvitillo had already done so. Some UI and other minor changes that I'd like to see happen before landing, but in general this looks good! r? me again once you've addressed those.

Attachment #8606976 - Flags: review?(wlachance) → review-

Roberto Agostino Vitillo (:rvitillo)

Updated

•

9 years ago

Attachment #8606976 - Flags: feedback?(rvitillo) → feedback+

Treeherder GitHub Bugbot

Comment 16

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder-ui

https://github.com/mozilla/treeherder-ui/commit/d120b6109acddc33e9560385fa0069ef16b69d2f
Bug 1160613: compare perf: use t-test for confidence level

The confidence level is used for deciding which changes to highlight (red/green)
and also for display at the table view.

New behavior:
- Ignore differences below 1.5% and t-state values below 0.5 .
- Diff above 1.5%, and t-value below 1.0 -> not-sure-regression.
- Above t-value of 1.0 -> confident -> highlight red/green.
- Added the t-test value as the last column at the table ("Confidence").
- Added the actual runs values at the "Runs" tooltip.

TODO:
- Fine tune the t-test constants 0.5 and 1.0 if required.

William Lachance (:wlach)

Comment 17

•

9 years ago

Comment on attachment 8606976 [details] [review]
Confidence level for compare-herder

Issues addressed, r+'ed and merged!

Attachment #8606976 - Flags: review- → review+

William Lachance (:wlach)

Updated

•

9 years ago

Status: NEW → RESOLVED

Closed: 9 years ago

Resolution: --- → FIXED

Treeherder GitHub Bugbot

Comment 18

•

9 years ago

Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/87705012c42f200be052db7747c136f8827b32b7
Bug 1160613: compare perf: use t-test for confidence level

The confidence level is used for deciding which changes to highlight (red/green)
and also for display at the table view.

New behavior:
- Ignore differences below 1.5% and t-state values below 0.5 .
- Diff above 1.5%, and t-value below 1.0 -> not-sure-regression.
- Above t-value of 1.0 -> confident -> highlight red/green.
- Added the t-test value as the last column at the table ("Confidence").
- Added the actual runs values at the "Runs" tooltip.

TODO:
- Fine tune the t-test constants 0.5 and 1.0 if required.

Karl Tomlinson (:karlt)

Updated

•

9 years ago

Blocks: 1164894

Avi Halachmi (:avih)

Assignee

Updated

•

9 years ago