Closed
Bug 1164898
Opened 9 years ago
Closed 9 years ago
perfherder compare view should make it more clear when we don't have enough data to be confident in our comparison
Categories
(Tree Management :: Perfherder, defect)
Tree Management
Perfherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: jmaher, Assigned: wlach)
References
Details
Attachments
(1 file)
This could be related or part of bug 1160613. When we have a single data point, it is not useful for helping us determine if we have a regression or improvement. In fact we should have 6 data points minimum to ensure a change. We could require this for the stddev and report N/A unless there is >=6 data points. But if there are not enough, should we show them? It would be nice to hide them with a url param or a checkbox in the UI: hideIncompleteResults=1 That name doesn't sound right, but whenever we don't have enough data it should be hidden from the view easily/default. Likewise finding some method to output to the end user what is lacking sufficient data would be good (along with a button to retrigger!!).
Comment 1•9 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #0) > This could be related or part of bug 1160613. > > When we have a single data point, it is not useful for helping us determine > if we have a regression or improvement. In fact we should have 6 data > points minimum to ensure a change. I don't think that's true. If, for instance, you have few data points for "old" and their stddev is low (say, < 5%), and you have a single "new" data point with a value of twice that of the average (or geomean) of the old values, then you can be pretty confident it's a regression (assuming "lower is better" for this test). A confidence level can be defined in a way which takes into account the number of values at each group, and, in fact, t-test does just that. Unless you think otherwise, I suggest that the above approach would be part of bug 1160613 (I actually have a patch ready, will PR today hopefully).
Reporter | ||
Comment 2•9 years ago
|
||
while I agree with you, how will we know what the historical noise is on a test with a single data point? Assuming we solve that, then I believe we can make a great solution to notify the user that we have enough data or need more data.
Comment 3•9 years ago
|
||
The historical noise doesn't matter IMO. Each group of values has its own noise level which matters. Why would you care if a week ago this test had a high noise level but today (or at one or both of "old" and "new") the noise level is now low?
Reporter | ||
Comment 4•9 years ago
|
||
my comment specifically called out "a single data point". We have enough random hiccups in our data and our noise levels change (like in the case of svg, opacity) or on weekends. At the very least you need 2 data points to get a stddev, and I don't think you confidence can be high with just two.
Comment 5•9 years ago
|
||
A single data point would affect confidence level, obviously (to the worse), and it's true that with a single point we can only guess an stddev value, however, my patch for bug 1160613 assumes a level of 15% stddev for a single point, which is quite higher than our average, so if it's mistaken, it's probably towards the less confidence side. Though it's obviously not perfect since there's only so much info you can derive from a single point. Regardless though, for a single data point it should produce reasonably low confidence level, and we can further fine tune the thresholds as we go.
Assignee | ||
Comment 6•9 years ago
|
||
I think I'd prefer to solve this by showing some kind of warning banner on the top of the page if the # of replicates is < 5 on either the original or compared revision, probably using something like this: http://getbootstrap.com/components/#alerts I think that should be enough to get the point across that the results should necessarily be trusted. If we just omit information without context it's not particularly helpful IMO. Thoughts?
Flags: needinfo?(jmaher)
Flags: needinfo?(avihpit)
Reporter | ||
Comment 7•9 years ago
|
||
the idea of a banner is great. How do we solve this when we have 100 data points and maybe 6 or 10 of the tests are missing a job or two- we just can't trust those results, but we can trust the rest. Maybe we could show a warning and hide the results that don't have enough data and the warning could say "view all results, including ones with not enough data to make a comparison" or something like that.
Flags: needinfo?(jmaher)
Comment 8•9 years ago
|
||
The idea of a warning is good IMO. Like Joel mentioned, it should probably be per "line" rather than for the entire page since 1. It's technically easier to do and 2. it's possible that some tests are missing data while some tests are fine. I don't think we should hide results with not enough data, but it should be clear what the status is. How about we add a warning icon with some tooltip either near the runs value (which would be easiest to implement) or at the comparison or confidence columns?
Flags: needinfo?(avihpit)
Assignee | ||
Updated•9 years ago
|
Summary: perfherder compare view should not show results by default if we do not have enough data points → perfherder compare view should make it more clear when we don't have enough data to be confident in our comparison
Assignee | ||
Comment 9•9 years ago
|
||
This does what :avih suggested: pop up a warning icon in the comparison view when we have less than 5 retriggers. Screenshot in PR. Let me know what you think.
Assignee: nobody → wlachance
Attachment #8634506 -
Flags: review?(tojonmz)
Attachment #8634506 -
Flags: feedback?(jmaher)
Attachment #8634506 -
Flags: feedback?(avihpit)
Reporter | ||
Comment 10•9 years ago
|
||
i would make it 6 data point which is equal to original + 5 retriggers. otherwise the screenshot looks good.
Reporter | ||
Updated•9 years ago
|
Attachment #8634506 -
Flags: feedback?(jmaher) → feedback+
Comment 11•9 years ago
|
||
Comment on attachment 8634506 [details] [review] PR Commenting on the PR itself. Looks like we got a winner with the last suggestion of "5 Base/New runs recommended for increased confidence in comparison"
Attachment #8634506 -
Flags: feedback?(avihpit) → feedback+
Comment 12•9 years ago
|
||
Comment on attachment 8634506 [details] [review] PR r+ tested locally on Nightly. I've provided a couple of observations in the PR fwiw to take or leave :)
Attachment #8634506 -
Flags: review?(tojonmz) → review+
Comment 13•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/0f2a0b8f25474458459cef150b9cc1c4112e2a80 Bug 1164898 - Warn when more retriggers needed to be confident in comparison
Comment 14•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/81537e49a03269f8cfae512470d96cddc3858d56 Bug 1164898 - Update recommendation to 6 runs (was 5)
Assignee | ||
Comment 15•9 years ago
|
||
Thanks for the great feedback all! This is now pushed.
Status: NEW → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
Comment 16•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/a1377948092f9ba9f64dfb85d02b2211bc7adc6e Bug 1164898 - Fix spacing issue in comparison view when deploy happens
Comment 17•9 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/f1055425bdbabe64d7bd428e81d034e354901239 Bug 1164898 - A few more fixups to display of confidence warning * Don't display it if we don't have any runs at all for either original or new * Some capitalization tweaks
You need to log in
before you can comment on or make changes to this bug.
Description
•