Closed
Bug 1186228
Opened 10 years ago
Closed 10 years ago
Simplify and clarify display of information in perfherder compare view
Categories
(Tree Management :: Perfherder, defect)
Tree Management
Perfherder
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: wlach, Assigned: wlach)
Details
Attachments
(3 files)
I had a productive discussion with :bwinton yesterday on the perf comparison view. He had a number of recommendations that so far haven't been captured by existing bugs, mainly oriented around simplifying/clarifying the display of information:
1. Don't have seperate columns for base/new standard deviation (just include the raw value, with details in a tooltip)
2. Make it more obvious that you can hover over the means for additional information.
3. Separate out the # of runs from the confidence / means into their own column. Put alert on insufficient number of runs there. [ it's confusing that this information is mixed up with both the means and confidence ]
4. Don't show partial results (rows with only one set of data)
5. Make it optional to show parts where there hasn't been a significant amount of change
In general, the principle was "less is more". Most people looking at the comparison view just want an idea of whether things got better/worse, and by roughly how much. Certainly, there are use cases for diving in deeper to the data, but that sort of stuff is better encapsulated by tooltips and/or detail views.
Before I go too far down this rabbit hole, I thought I'd post a screenshot of my work in progress for people to look at and comment on.
Attachment #8636825 -
Flags: feedback?(jmaher)
Attachment #8636825 -
Flags: feedback?(bwinton)
Attachment #8636825 -
Flags: feedback?(avihpit)
Comment 1•10 years ago
|
||
Comment on attachment 8636825 [details]
work in progress
It's not as simple as i think it could be, but I like it. ;)
A couple of things to add:
1) A grey border around the red bar after the delta column. (We talked about this earlier, so I suspect it's upcoming.)
2) What about having a red "<", or green ">", or grey "≈" between the Base and New columns? So it would read like:
17.41 ± 0.23 < 24.27 ± 0.34 7.06 40.54% ==== 38.46 (high) 5/5
to indicate that the base was less than the new…
Attachment #8636825 -
Flags: feedback?(bwinton) → feedback+
Comment 2•10 years ago
|
||
Comment on attachment 8636825 [details]
work in progress
(In reply to William Lachance (:wlach) from comment #0)
> 1. Don't have seperate columns for base/new standard deviation (just include
> the raw value, with details in a tooltip)
Very good. I'd argue that the main stddev display should be in percentage rather than absolute value. This makes it much easier to say "this test appears much noisier than that test" (10% is more than 5% regardless of absolute values). We can display the absolute value in a tooltip.
> 2. Make it more obvious that you can hover over the means for additional
> information.
I like the light dotted underscore for that (as in - as is).
> 3. Separate out the # of runs from the confidence / means into their own
> column. Put alert on insufficient number of runs there. [ it's confusing
> that this information is mixed up with both the means and confidence ]
While the overall confidence level also depends on the number of runs, the number of runs is more enlightening for the stddev value IMO. I think the number of runs should be within the same column as the value and stddev (%). We could add the warning icon at this column too - per base/new.
> 4. Don't show partial results (rows with only one set of data)
> 5. Make it optional to show parts where there hasn't been a significant
> amount of change
I tend to disagree with that. IMO it's better to show whatever we've got. It's easy enough to deduce that the comparison is missing because data is (visibly) missing on one side.
> In general, the principle was "less is more". Most people looking at the
> comparison view just want an idea of whether things got better/worse, and by
> roughly how much. Certainly, there are use cases for diving in deeper to the
> data, but that sort of stuff is better encapsulated by tooltips and/or
> detail views.
Pretty much agreed. I'd also remove the delta absolute value from the default view. Percentage diff is enough for the main view IMO (especially when the base/new absolute values are visible), and put the absolute diff in a tooltip.
Another thing we could improve here is to make it clearer per test if higher-is-better or the other way around. While it's mostly obvious when the delta is colored in red/green, is impossible to tell when it isn't.
I'd suggest adding a tooltip to the delta column which says if higher or lower is better for this delta, and add a visible asterisk (doesn't require a tooltip) on the delta column for tests which are higher-is-better since these are the minority.
As another approach to make stuff easier on the eyes, we could use gray (lighter) color for values which are not immediately important, such as the stddev value, while keeping the important values (test name, base/new means, delta %, confidence) in darker color.
Overall, /me like.
One last thing though, it shows (graph) instead of (subtests) near the test name. Is that intentional? (IMO it should stay subtests).
Attachment #8636825 -
Flags: feedback?(avihpit) → feedback+
Comment 3•10 years ago
|
||
Comment on attachment 8636825 [details]
work in progress
this is great stuff! I would like to revisit the bar graphs we have. Also having more mouse over bits would be nice.
Do we need the number of data points all the time? Could we only display the numbers and warning sign iff there are < 6 data points for either side?
I think adding a value to the mouseover which shows the raw data would be useful:
runs (5): <3.14, 3.15, 3.13, 3.14, 3.14>
The idea here is that if we are missing data it would show like a todo list.
Attachment #8636825 -
Flags: feedback?(jmaher) → feedback+
Comment 4•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #3)
> Do we need the number of data points all the time? Could we only display
> the numbers and warning sign iff there are < 6 data points for either side?
I think it gives good value to display it right away without tooltip. We could use a lighter color for this number such that it doesn't clutter the view.
Comment 5•10 years ago
|
||
good point- make it grey by default, then turn it yellow/orange if we should collect more data.
| Assignee | ||
Comment 6•10 years ago
|
||
Attachment #8637266 -
Flags: feedback?(bwinton)
| Assignee | ||
Comment 7•10 years ago
|
||
(In reply to Avi Halachmi (:avih) from comment #2)
> Comment on attachment 8636825 [details]
> work in progress
>
> (In reply to William Lachance (:wlach) from comment #0)
> > 1. Don't have seperate columns for base/new standard deviation (just include
> > the raw value, with details in a tooltip)
>
> Very good. I'd argue that the main stddev display should be in percentage
> rather than absolute value. This makes it much easier to say "this test
> appears much noisier than that test" (10% is more than 5% regardless of
> absolute values). We can display the absolute value in a tooltip.
Hmm, just tried this and it seems to make comparing the two values more difficult than before in many cases.
Consider e.g. "17.41 ± 0.23, 24.47 ± 0.34". If we expressed that in percentages it would like: "17.41 ± 1.30%, 24.47 ± 1.40%". Much less clear IMO.
> > 3. Separate out the # of runs from the confidence / means into their own
> > column. Put alert on insufficient number of runs there. [ it's confusing
> > that this information is mixed up with both the means and confidence ]
>
> While the overall confidence level also depends on the number of runs, the
> number of runs is more enlightening for the stddev value IMO. I think the
> number of runs should be within the same column as the value and stddev (%).
> We could add the warning icon at this column too - per base/new.
So I think this interferes with the skimmability of the results. :bwinton argued (and I tend to agree) that most developers (who are not perf domain experts) are just going to be confused by number of runs / standard deviation. I do think we need to include the information, but if it's off to the side it's easier to disregard.
> > 4. Don't show partial results (rows with only one set of data)
> > 5. Make it optional to show parts where there hasn't been a significant
> > amount of change
>
> I tend to disagree with that. IMO it's better to show whatever we've got.
> It's easy enough to deduce that the comparison is missing because data is
> (visibly) missing on one side.
We have a warning banner about missing tests up top. We could include this information there. Obviously we can have a toggle to show the missing data.
> > In general, the principle was "less is more". Most people looking at the
> > comparison view just want an idea of whether things got better/worse, and by
> > roughly how much. Certainly, there are use cases for diving in deeper to the
> > data, but that sort of stuff is better encapsulated by tooltips and/or
> > detail views.
>
> Pretty much agreed. I'd also remove the delta absolute value from the
> default view. Percentage diff is enough for the main view IMO (especially
> when the base/new absolute values are visible), and put the absolute diff in
> a tooltip.
Done.
> Another thing we could improve here is to make it clearer per test if
> higher-is-better or the other way around. While it's mostly obvious when the
> delta is colored in red/green, is impossible to tell when it isn't.
>
> I'd suggest adding a tooltip to the delta column which says if higher or
> lower is better for this delta, and add a visible asterisk (doesn't require
> a tooltip) on the delta column for tests which are higher-is-better since
> these are the minority.
Seems sensible. I'll file a followup bug for this.
> As another approach to make stuff easier on the eyes, we could use gray
> (lighter) color for values which are not immediately important, such as the
> stddev value, while keeping the important values (test name, base/new means,
> delta %, confidence) in darker color.
I'm concerned that this might make the display a bit overly busy. Let's see how we do with the current design and we can tweak stuff further if necessary.
> One last thing though, it shows (graph) instead of (subtests) near the test
> name. Is that intentional? (IMO it should stay subtests).
The screenshot was of a subtest comparison view, which doesn't have subtests itself. :) The graph option for subtests is borderline useless (it just displays the highlighted datapoints on a graph, which is useless for try builds), I will probably just take it out.
Comment 8•10 years ago
|
||
(In reply to William Lachance (:wlach) from comment #7)
> Hmm, just tried this and it seems to make comparing the two values more
> difficult than before in many cases.
>
> Consider e.g. "17.41 ± 0.23, 24.47 ± 0.34". If we expressed that in
> percentages it would like: "17.41 ± 1.30%, 24.47 ± 1.40%". Much less clear
> IMO.
TBH for me the the percentage doesn't make it less clear and personally the vast majority of times I looked at the stddev value I always looked at the percentage one. You see +- 1% you know the test is stable without thinking much, but if you see +- 80% (and we do have those) then it's clear something is fishy.
Another big benefit of percentage is to be able to compare noise level between two tests which potentially have meaningfully different absolute values, e.g. 17 +- 7 and 350 +- 80 <-- which is noisier? obvious as percentage but harder to grasp with absolute values without giving it some consideration.
> So I think this interferes with the skimmability of the results. :bwinton
> argued (and I tend to agree) that most developers (who are not perf domain
> experts) are just going to be confused by number of runs / standard
> deviation. I do think we need to include the information, but if it's off to
> the side it's easier to disregard.
Well, the mean and stddev are over the number of runs. If you move that number aside you _might_ make it less cluttered around the stddev value but more crowded around the comparison bottom line, AND it loses its context.
I can live with your suggestion, but I still think it'd be better near the stddev or mean value.
> > > 4. Don't show partial results (rows with only one set of data)
> > > 5. Make it optional to show parts where there hasn't been a significant
> > > amount of change
> >
> > I tend to disagree with that. IMO it's better to show whatever we've got.
> > It's easy enough to deduce that the comparison is missing because data is
> > (visibly) missing on one side.
>
> We have a warning banner about missing tests up top. We could include this
> information there. Obviously we can have a toggle to show the missing data.
Maybe I didn't get this one right. The scenario you describe is that a test X on platform Y has results only for base (or only new), right? In that case, I think we should display what we have and possibly a warning where we don't have values. Did you want to completely not display those lines? I fail to see how it helps TBH.
Another interpretation for "one set of data" is that we have only one run (for either/both old/new), but I believe this should be solved with the runs warning, and still, displaying values which we do have is better than not displaying them IMO.
> > As another approach to make stuff easier on the eyes, we could use gray
> > (lighter) color for values which are not immediately important, such as the
> > stddev value, while keeping the important values (test name, base/new means,
> > delta %, confidence) in darker color.
>
> I'm concerned that this might make the display a bit overly busy. Let's see
> how we do with the current design and we can tweak stuff further if
> necessary.
None of this is necessary TBH. All the stuff on this bug is eye candy (useful to a degree, but not _necessaey_).
Comment 9•10 years ago
|
||
Comment on attachment 8637266 [details]
Example of putting comparison directly in chart
Yeah, looks good. I think adding the red/green/grey text-color will make it a little easier to parse, but that's a minor enhancement. (Also, it should be centered between the two numbers. It looks like it's a little too far right. :)
Attachment #8637266 -
Flags: feedback?(bwinton) → feedback+
| Assignee | ||
Comment 10•10 years ago
|
||
Belated update...
(In reply to Avi Halachmi (:avih) from comment #8)
> (In reply to William Lachance (:wlach) from comment #7)
> > Hmm, just tried this and it seems to make comparing the two values more
> > difficult than before in many cases.
> >
> > Consider e.g. "17.41 ± 0.23, 24.47 ± 0.34". If we expressed that in
> > percentages it would like: "17.41 ± 1.30%, 24.47 ± 1.40%". Much less clear
> > IMO.
>
> TBH for me the the percentage doesn't make it less clear and personally the
> vast majority of times I looked at the stddev value I always looked at the
> percentage one. You see +- 1% you know the test is stable without thinking
> much, but if you see +- 80% (and we do have those) then it's clear something
> is fishy.
>
> Another big benefit of percentage is to be able to compare noise level
> between two tests which potentially have meaningfully different absolute
> values, e.g. 17 +- 7 and 350 +- 80 <-- which is noisier? obvious as
> percentage but harder to grasp with absolute values without giving it some
> consideration.
Yeah, I was convinced by this argument. It's displaying percentages now (you can get the absolute difference via the tooltip)
> > > I tend to disagree with that. IMO it's better to show whatever we've got.
> > > It's easy enough to deduce that the comparison is missing because data is
> > > (visibly) missing on one side.
> >
> > We have a warning banner about missing tests up top. We could include this
> > information there. Obviously we can have a toggle to show the missing data.
>
> Maybe I didn't get this one right. The scenario you describe is that a test
> X on platform Y has results only for base (or only new), right? In that
> case, I think we should display what we have and possibly a warning where we
> don't have values. Did you want to completely not display those lines? I
> fail to see how it helps TBH.
>
> Another interpretation for "one set of data" is that we have only one run
> (for either/both old/new), but I believe this should be solved with the runs
> warning, and still, displaying values which we do have is better than not
> displaying them IMO.
This is being discussed in bug 1187000 (via a filter menu)
| Assignee | ||
Comment 11•10 years ago
|
||
Ok, finished work on this and put up on stage for testing. Here's an example (assuming no one overwrites me on stage):
https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=027bf5a1e5eb&newProject=try&newRevision=b015adea7c85
More improvements are forthcoming (especially wrt the bar graphs which seem to be a frequent topic of complaint), but I think this organization of information is clearer than before.
Attachment #8640638 -
Flags: review?(jmaher)
Attachment #8640638 -
Flags: feedback?(bwinton)
Attachment #8640638 -
Flags: feedback?(avihpit)
Comment 12•10 years ago
|
||
Comment on attachment 8640638 [details] [review]
PR
(In reply to William Lachance (:wlach) from comment #10)
> https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=027bf5a1e5eb&newProject=try&newRevision=b015adea7c85
...
> This is being discussed in bug 1187000 (via a filter menu)
Not sure I get it. Bug 1187000 is about missing e10s view (probably rows), while this bug is about the column and their data in the default view. I _think_ they're mostly tangential.
To the point of this bug as far as I can tell:
- It _seems_ to me that rows with partial data are hidden, e.g. "v8_7 opt", "kraken opt", "dromaeo* opt" _seems_ to me to be missing XP results - I can't really tell if it's completely missing or only partially missing, since it's just not there, while it is there for the other tests. That's basically my request from earlier to display all the data we have even if in some ways it's lacking.
- On some rows it displays "incomplete" (e.g. on "cart opt") even when the confidence level is very high. I believe we mentioned in the past that the warning should be there only for non-high confidence levels, since "high" confidence implies that actually we do have enough data. An exception to this rule would maybe be for a single data point, but I'm not even sure about that TBH.
Other than those two points, looks good to me.
Attachment #8640638 -
Flags: feedback?(avihpit) → feedback-
Comment 13•10 years ago
|
||
Comment on attachment 8640638 [details] [review]
PR
I commented on the pull request with some suggestions, but in general this looks better to me.
Attachment #8640638 -
Flags: feedback?(bwinton) → feedback+
| Assignee | ||
Comment 14•10 years ago
|
||
Comment on attachment 8640638 [details] [review]
PR
Discussed things in person with Avi: some things mentioned above were misunderstandings, others will be addressed in future bugs, and yet more are being addressed right here.
I updated the PR with the following changes:
* Only show a warning about # of runs if we have less than 2 (on either side) OR confidence is not high. The "we need 6 runs to be really really confident" rule may be useful for sheriffs, but it is overkill for many other cases: Avi's t-test algorithm is reliable enough for most purposes. Sheriffs can continue doing their own thing, of course (the UI makes it obvious when there are less than 6 runs).
* Made it slightly more obvious when there are no results for one side of the comparison by printing "No results" in muted text for this case.
* Updated UI styling per :bwinton's suggestions in the PR.
Avi, could you have a look and let me know what you think? Once again the PR is on stage:
https://treeherder.allizom.org/perf.html#/compare?originalProject=try&originalRevision=027bf5a1e5eb&newProject=try&newRevision=b015adea7c85
Attachment #8640638 -
Flags: feedback- → feedback?(avihpit)
Comment 15•10 years ago
|
||
Comment on attachment 8640638 [details] [review]
PR
LGTM. Thanks! :)
Attachment #8640638 -
Flags: feedback?(avihpit) → feedback+
Comment 16•10 years ago
|
||
Comment on attachment 8640638 [details] [review]
PR
thanks, no issues in the PR comments either.
Attachment #8640638 -
Flags: review?(jmaher) → review+
Comment 17•10 years ago
|
||
Commit pushed to master at https://github.com/mozilla/treeherder
https://github.com/mozilla/treeherder/commit/903a7805a6055b793f427dfc8be9d1c51c6ae085
Bug 1186228 - Simplify display of information in compareperf
| Assignee | ||
Comment 18•10 years ago
|
||
Merged the PR. Resolving this, let's handle further feedback/work in separate bugs.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•