Open Bug 1481821 Opened Last year Updated 4 days ago
perfherder compare is very limited when developers are investigating a regression or making their test relevant or stable
our perfherder compare shows a lot of related data which is useful in many cases, but not all of them. When a regression shows up or a test is noisy, a test owner will often need to look at data in more details, sometimes the raw replicates. One idea from the devtools team is that we look at different calculations on the data (not the median value always). There are probably dozens of ways to look at the data which would be helpful to understand what the test is doing.
:ochameau can you provide a link/screenshot of some of your tools for analyzing the subtests/replicates of the damp test?
Here is a link: https://firefox-dev.tools/performance-dashboard/inspect/index.html?base=7337cfb80e8b285aab95789b04daab6825d765f3&new=03e1f8759a9d7cfe3de45a450d4513917ac866db&platform=linux64-opt (Note that it may break as anytime as that's a work in progress that I may easily change) Here is a screenshot in case it broke: https://screenshots.firefox.com/UP2OGzt5LlpkRHhT/firefox-dev.tools My goal here was to tweak both the dashboard *and* the test harness/scripts in order to have the dashboard say that all subtests have a "computed" difference close to 0%. And that actual warning are all off when comparing two distinct try push against the same m-c changeset. Another test was to see the dashboard report 1% difference when introducing a fake 1% regression. In this prototype I experiment with: * multiple data sets: with/without replicates and also with same filtering of data points outside of q1<=>q3. * multiple maths: mean, median, confidence interval The first conclusion of this experiment was that boxplots were looking more obvious to everyone I demoed this tool. i.e. it is easier to make a conclusion on a subtest when comparing the two boxplots rather than reading statistical numbers.
In retrospect I feel like the compare view is the weakest part of perfherder, but it's also one of the hardest things to get right. I think one thing I would definitely change would be to focus more on displaying the *distribution* of results, and less on derived measures like means/medians/standard deviations/confidence intervals (which can be pretty misleading and hard for most people to interpret correctly). To that end, I quite like the box plot / distribution views in that screenshot.
Flags: needinfo?(jmaher) → needinfo?(igoldan)
Type: defect → task
You need to log in before you can comment on or make changes to this bug.