Closed Bug 1164898 Opened 9 years ago Closed 9 years ago

perfherder compare view should make it more clear when we don't have enough data to be confident in our comparison

Categories

(Tree Management :: Perfherder, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: jmaher, Assigned: wlach)

References

Details

Attachments

(1 file)

46 bytes, text/x-github-pull-request
jfrench
: review+
avih
: feedback+
jmaher
: feedback+
Details | Review
This could be related or part of bug 1160613.

When we have a single data point, it is not useful for helping us determine if we have a regression or improvement.  In fact we should have 6 data points minimum to ensure a change.

We could require this for the stddev and report N/A unless there is >=6 data points.  But if there are not enough, should we show them?  It would be nice to hide them with a url param or a checkbox in the UI:
hideIncompleteResults=1

That name doesn't sound right, but whenever we don't have enough data it should be hidden from the view easily/default.

Likewise finding some method to output to the end user what is lacking sufficient data would be good (along with a button to retrigger!!).
(In reply to Joel Maher (:jmaher) from comment #0)
> This could be related or part of bug 1160613.
> 
> When we have a single data point, it is not useful for helping us determine
> if we have a regression or improvement.  In fact we should have 6 data
> points minimum to ensure a change.

I don't think that's true. If, for instance, you have few data points for "old" and their stddev is low (say, < 5%), and you have a single "new" data point with a value of twice that of the average (or geomean) of the old values, then you can be pretty confident it's a regression (assuming "lower is better" for this test).

A confidence level can be defined in a way which takes into account the number of values at each group, and, in fact, t-test does just that.

Unless you think otherwise, I suggest that the above approach would be part of bug 1160613 (I actually have a patch ready, will PR today hopefully).
while I agree with you, how will we know what the historical noise is on a test with a single data point?  Assuming we solve that, then I believe we can make a great solution to notify the user that we have enough data or need more data.
The historical noise doesn't matter IMO.

Each group of values has its own noise level which matters. Why would you care if a week ago this test had a high noise level but today (or at one or both of "old" and "new") the noise level is now low?
my comment specifically called out "a single data point".  We have enough random hiccups in our data and our noise levels change (like in the case of svg, opacity) or on weekends.  At the very least you need 2 data points to get a stddev, and I don't think you confidence can be high with just two.
A single data point would affect confidence level, obviously (to the worse), and it's true that with a single point we can only guess an stddev value, however, my patch for bug 1160613 assumes a level of 15% stddev for a single point, which is quite higher than our average, so if it's mistaken, it's probably towards the less confidence side.

Though it's obviously not perfect since there's only so much info you can derive from a single point.

Regardless though, for a single data point it should produce reasonably low confidence level, and we can further fine tune the thresholds as we go.
I think I'd prefer to solve this by showing some kind of warning banner on the top of the page if the # of replicates is < 5 on either the original or compared revision, probably using something like this:

http://getbootstrap.com/components/#alerts

I think that should be enough to get the point across that the results should necessarily be trusted. If we just omit information without context it's not particularly helpful IMO.

Thoughts?
Flags: needinfo?(jmaher)
Flags: needinfo?(avihpit)
the idea of a banner is great.  How do we solve this when we have 100 data points and maybe 6 or 10 of the tests are missing a job or two- we just can't trust those results, but we can trust the rest.

Maybe we could show a warning and hide the results that don't have enough data and the warning could say "view all results, including ones with not enough data to make a comparison" or something like that.
Flags: needinfo?(jmaher)
The idea of a warning is good IMO. Like Joel mentioned, it should probably be per "line" rather than for the entire page since 1. It's technically easier to do and 2. it's possible that some tests are missing data while some tests are fine.

I don't think we should hide results with not enough data, but it should be clear what the status is.

How about we add a warning icon with some tooltip either near the runs value (which would be easiest to implement) or at the comparison or confidence columns?
Flags: needinfo?(avihpit)
Summary: perfherder compare view should not show results by default if we do not have enough data points → perfherder compare view should make it more clear when we don't have enough data to be confident in our comparison
See Also: → 1171694
Attached file PR
This does what :avih suggested: pop up a warning icon in the comparison view when we have less than 5 retriggers.

Screenshot in PR. Let me know what you think.
Assignee: nobody → wlachance
Attachment #8634506 - Flags: review?(tojonmz)
Attachment #8634506 - Flags: feedback?(jmaher)
Attachment #8634506 - Flags: feedback?(avihpit)
i would make it 6 data point which is equal to original + 5 retriggers.  otherwise the screenshot looks good.
Attachment #8634506 - Flags: feedback?(jmaher) → feedback+
Comment on attachment 8634506 [details] [review]
PR

Commenting on the PR itself. Looks like we got a winner with the last suggestion of "5 Base/New runs recommended for increased confidence in comparison"
Attachment #8634506 - Flags: feedback?(avihpit) → feedback+
Comment on attachment 8634506 [details] [review]
PR

r+ tested locally on Nightly. I've provided a couple of observations in the PR fwiw to take or leave :)
Attachment #8634506 - Flags: review?(tojonmz) → review+
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/0f2a0b8f25474458459cef150b9cc1c4112e2a80
Bug 1164898 - Warn when more retriggers needed to be confident in comparison
Thanks for the great feedback all! This is now pushed.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Commit pushed to master at https://github.com/mozilla/treeherder

https://github.com/mozilla/treeherder/commit/f1055425bdbabe64d7bd428e81d034e354901239
Bug 1164898 - A few more fixups to display of confidence warning

* Don't display it if we don't have any runs at all for either original or new
* Some capitalization tweaks
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: