Closed
Bug 1190664
Opened 10 years ago
Closed 10 years ago
Geometric mean doesn't work
Categories
(Tree Management :: Perfherder, defect)
Tree Management
Perfherder
Tracking
(Not tracked)
RESOLVED
INCOMPLETE
People
(Reporter: glandium, Unassigned)
References
Details
Attachments
(1 file, 1 obsolete file)
|
62.19 KB,
text/html
|
Details |
I'm sorry for the provocative summary, but perfherder is getting on my nerves.
For example look at this:
https://treeherder.mozilla.org/perf.html#/compare?originalProject=try&originalRevision=fa358692aeaf&newProject=try&newRevision=54483e48d958
look under dromaeo_dom opt, and see how windows 8-64 is red and how linux32 green, with respectively -1.51% and 2.43%.
Now click on subtests.
In the windows 8-64 case, there is only one test going in the direction indicated by the global score difference, and it's not even outside the stddev margin. So here we have 3 uncertain positive results and 1 uncertain negative result, and the global result is a red regression.
In the linux32 case, we have 2 uncertain negative results, 2 uncertain positive results, and the global result is a green improvement.
My interpretation of the subtests page is that there's nothing to see here. Yet perfherder tells me there is. Which means I now need to go through *all* the other subtests to see whether there is actually something worrying or not.
Why not normalize results to aggregate them?
Comment 1•10 years ago
|
||
thanks for bringing this up- there are a few issues with perfherder summarization- they are all related to tests which have unique calculations.
1) dromaeo_dom: this has to summarize the subtests differently, then we can use the existing geomean on the resulting subtests
2) kraken: this needs to aggregate the subtests differently
3) canvasmark: this needs to aggregate the subtests differently (same formula as kraken)
4) v8_7: this needs to aggregate the subtests differently
some related bugs:
* bug 1184966
* bug 1186076 (resolved now)
The key here is to ingest the new summaries so that we can use the proper formula- it is realistic this week to have this resolved.
Comment 2•10 years ago
|
||
So I think there are actually two things here:
1. As Joel said, we should be summarizing the dromaeo results differently. This may or may not fix this particular problem.
2. We should probably not be saying that something is either good or bad when the "regression" / "improvement" is significantly less than the standard deviation. I know for Dromaeo in particular we have said that regressions/improvements less than 10% are not really meaningful: we should probably capture this somehow in the compare UI.
Avi, Joel, do you have any thoughts on (2) ? My strawman proposal is not to highlight regressions/improvements that are less the standard deviation.
Flags: needinfo?(jmaher)
Flags: needinfo?(avihpit)
Comment 3•10 years ago
|
||
(In reply to William Lachance (:wlach) from comment #2)
> 2. We should probably not be saying that something is either good or bad
> when the "regression" / "improvement" is significantly less than the
> standard deviation.
>
> Avi, Joel, do you have any thoughts on (2) ? My strawman proposal is not to
> highlight regressions/improvements that are less the standard deviation.
The t-test calculation ("confidence level") already takes that into account.
However, maybe we're not summarizing the sub-tests into the "run" value correctly? In general, we should geometric-mean the subtests into a single value, however, looking at the sub-tests of dromaeo_dom, we see these deltas:
Win8-64: 0.4%, -0.5%, 0.75%, 0.78%, with "high" confidence levels (between 1.4 to 2.8, and we consider "high" confidence arbitrarily as above 1.0), while the overall delta is displayed as -1.5%.
Linux32: -0.4%, -1.1%, 0.7%, 0.4%, with confidence between 0.5 to 2.5, and the overall delta as +2.4%.
So to me it looks like the delta % of the subtests doesn't get good representation at the overall delta.
Joel, what's our exact calculation from all the subtests of all the runs into a single value at the overall view?
E.g. if we have:
run1.subtest1
run1.subtest2
run1.subtest3
run2.subtest1
run2.subtest2
run2.subtest3
I'd expect the overall value to be:
plain_average(
geomean(run1.subtest1, run1.subtest2, run1.subtest3),
geomean(run2.subtest1, run2.subtest2, run2.subtest3)
)
And displayed at the subtests view as:
subtest1: plain_average(run1.subtest1, run2.subtest1)
subtest2: plain_average(run1.subtest2, run2.subtest2)
subtest3: plain_average(run1.subtest3, run2.subtest3)
I.e. values of the same "kind" (from the same subtest or runs of the same test/platform) are averaged plainly, while values of different "kinds" (e.g. different subtests) are aggregated using geometric mean.
Flags: needinfo?(avihpit)
Comment 4•10 years ago
|
||
yes, we should only show things that we find useful, i.e. dromaeo >=10%. How to document that and where to put these rules, that is something to figure out. I would suggest maybe a thresholds/rules file in perfherder that compare view uses as well as the alert generation code :)
regarding change vs stddev- we should flag it in the UI as either a lower confidence or some other indicator. Maybe we require more data for those situations (10 points vs 6), or we are explicit about not wasting time on those.
Avi, you asked some good questions, not all tests are equal:
* v8_7 has an internal summarization of the subtests
* kraken and canvasmark summarize yet another way
* dromaeo subtests actually have sub-subtests so the subtests needs a special calculation, but dromaeo proper uses a geometric mean for the resulting suite metric.
* the rest uses a geometric mean, although some tests like ts_paint, tresize, sessionrestore* have only one data point, so a geometric mean is not as useful.
Overall the subtests for the non benchmark tests follow:
* drop first [0,1,2,5] data points
* take the median value of the remaining points
So it isn't perfect, we can change what we feel needs to be changed. right now we are building the system to work smoother.
Flags: needinfo?(jmaher)
Comment 5•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #4)
> Avi, you asked some good questions, not all tests are equal:
> * v8_7 has an internal summarization of the subtests
> ...
So, that still doesn't explain (to me) the apparent difference between the subtests deltas in %, to the overall delta in %. Something seems inconsistent to me between those views.
Even if there are such special calculations, I'd _think_ they shouldn't create an appearance of inconsistency.
Could you please list how the overall view and the subtests view calculate their values for:
- "normal" tests (I'd expect those to work as I described at comment 3. Do they?)
- "special" tests (per case).
Comment 6•10 years ago
|
||
the only subtest special case is dromaeo. I really cannot figure out what pieces of this you don't understand. Please show this with examples including data so I can help find the confusion on my end or your end.
Comment 7•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #6)
> I really cannot figure out what
> pieces of this you don't understand. Please show this with examples
> including data so I can help find the confusion on my end or your end.
I don't understand why on both the following cases all the subtests deltas are way below the overall delta. I expect the overall delta to be sort of the average of all the subtests data:
(In reply to Avi Halachmi (:avih) from comment #3)
> Win8-64: 0.4%, -0.5%, 0.75%, 0.78%, with "high" confidence levels (between
> 1.4 to 2.8, and we consider "high" confidence arbitrarily as above 1.0),
> while the overall delta is displayed as -1.5%.
>
> Linux32: -0.4%, -1.1%, 0.7%, 0.4%, with confidence between 0.5 to 2.5, and
> the overall delta as +2.4%.
>
> So to me it looks like the delta % of the subtests doesn't get good
> representation at the overall delta.
(In reply to Joel Maher (:jmaher) from comment #6)
> the only subtest special case is dromaeo.
From the following, it seems to me like you were saying that v8_7, kraken, canvasmark and dromaeo are special in some way.
(In reply to Joel Maher (:jmaher) from comment #4)
> Avi, you asked some good questions, not all tests are equal:
> * v8_7 has an internal summarization of the subtests
> * kraken and canvasmark summarize yet another way
> * dromaeo subtests actually have sub-subtests so the subtests needs a
> special calculation, but dromaeo proper uses a geometric mean for the
> resulting suite metric.
> * the rest uses a geometric mean, although some tests like ts_paint,
My main point is that it _seems_ to me that the overall delta % doesn't represent well the subtests delta %, and I want to understand why, and for that I'm asking what are the calculations from the subtests results to the overall delta, both in the "normal" case and in any kinds of "special" cases.
Comment 8•10 years ago
|
||
my confusion is around where this data is. Are we talking what talos records, what it reports, what graph server shows, what compare view shows?
I would like to see the raw data from a talos run and what we see in compare view (I assume that is where the confusion is). It does appear that we are not communicating well in the bug. I am viewing this from raw talos data -> subtest result -> suite result.
Comment 9•10 years ago
|
||
The data is the one which comment 0 refers to. The overall view vs the subtests view of dromaeo_dom-opt on win8-64 and linux32. That's the main example which shows the issue for now.
Comment 10•10 years ago
|
||
I would prefer to talk about tests which are not internal benchmarks and are known to be incorrect. These would be ones like: tart, cart, tp5, svgx, tscrollx, tp5o_scroll. When we have the correct reporting for dromaeo I would be happier to talk about it.
Comment 11•10 years ago
|
||
But the bug was open with these specific cases as examples.
If you think it's an invalid bug, please say why. Otherwise, let's try to understand why the overall delta on these examples appears to not reflect correctly the subtests delta. That shouldn't b related to the reliability of the tests - it's just plain numbers which seem to not add up.
Comment 12•10 years ago
|
||
I believe this is invalid if it is only related to dromaeo. I would be happy to discuss details of results when we have the proper dromaeo score posted. Do we need to validate other geometric means?
Comment 13•10 years ago
|
||
From IRC:
<jmaher> avih: dromaeo, kraken, v8, canvasmark are all incorrect- we are close to having them fixed, but until then we shouldn't waste a lot of time on them
Looking at bug 1184966 which was referenced at comment 1 as the relevant bug, it says mainly that dromaeo, canvasmark, v8 and kraken have their own way of summarizing the subtests into a single value, but It doesn't say how this ends up as an issue, and where.
"our" method of combining subtests into a single overall value is by geometric mean (where more than one subtest exist). Geomean should roughly end with the overall delta % as average of the subtests deltas %. These 4 tests might be using a different calculation, or maybe giving special weights to some subtests etc.
Combining the data, I'm guessing the following:
1. The subtests view is raw values from the subtests.
2. The overview page shows the special "internal" summarization of these tests as run values, and NOT a geaomean of the subtests.
Therefore it appears to be inconsistent, however, if we believe their summarization is good, then we should believe the overall delta %.
However, this still doesn't explain how none of the subtests has a delta % bigger than the overall delta %. I believe we still need to answer that.
Comment 14•10 years ago
|
||
Summary of how I understand it so far:
1. At the time of filing this bug, dromaeo overview numbers should have been calculated using geomean and not using any "special" method, therefore the reason given earlier at this bug (that dromaeo uses a "special" calculation) is not likely to have affected this bug. This has changed in bug 1184966, and now the overview should have some "special" calculation.
2. Since bug 1184966 is now fixed, it might be hard to reproduce the same issue again with dromaeo, since it doesn't use geomean between the subtests anymore.
3. Joel mentioned on IRC that specifically with dromaeo "it was a geomean of the raw replicates for everything iirc" [instead of geomean of all the subtests of a run, and then averaging the runs for the overview - avih], so this could be yet another issue, but more details are missing on this one and AFAIK there's no bug filed for it.
4. I did want to follow all the data manually and try to find where it could have gone wrong, but I don't think it's possible at the compare-perf page since it sorts the values of the runs/subtests, so it's impossible to know the correct way to aggregate it manually.
- Maybe we should do something to make it easier to track the raw data instead of users effectively being forced to trust it relatively blindly.
I filed bug 1194333 to track all talos/treeherder/perferder data correctness issues.
As for this bug, not sure what to do about it since the compare view should now yield different results, which makes it both harder and potentially useless to try and figure out what went wrong here.
If someone is able to help me getting the raw data for the views mentioned in comment 0, I'm still willing to try and figure out where it went wrong, and I think it could still be valuable, but for now, I don't know how to do that.
What do you guys think?
Comment 15•10 years ago
|
||
(In reply to Avi Halachmi (:avih) from comment #14)
> If someone is able to help me getting the raw data for the views mentioned
> in comment 0, I'm still willing to try and figure out where it went wrong,
> and I think it could still be valuable, but for now, I don't know how to do
> that.
>
> What do you guys think?
Thanks for volunteering to look at this.
Here's an ipython notebook with the raw json numbers for dromaeo dom win8-64 included, indexed by result set id (= push) and job id (= individual talos job).
You can either analyze it using ipython notebook or import it into your tool of choice. :)
Updated•10 years ago
|
Attachment #8647662 -
Attachment mime type: text/plain → application/json
Comment 16•10 years ago
|
||
(In reply to William Lachance (:wlach) from comment #15)
> Created attachment 8647662 [details]
> Reproducing bug 1190664 dromaeo dom windows8.ipynb
>
> (In reply to Avi Halachmi (:avih) from comment #14)
> > If someone is able to help me getting the raw data for the views mentioned
> > in comment 0, I'm still willing to try and figure out where it went wrong,
> > and I think it could still be valuable, but for now, I don't know how to do
> > that.
> >
> > What do you guys think?
>
> Thanks for volunteering to look at this.
>
> Here's an ipython notebook with the raw json numbers for dromaeo dom win8-64
> included, indexed by result set id (= push) and job id (= individual talos
> job).
>
> You can either analyze it using ipython notebook or import it into your tool
> of choice. :)
Hmm, nbviewer doesn't seem to work with bz attachments, here's a link to it via my people account which does:
http://nbviewer.ipython.org/url/people.mozilla.org/~wlachance/Reproducing%20bug%201190664%20dromaeo%20dom%20windows8.ipynb
Comment 17•10 years ago
|
||
Comment on attachment 8647662 [details]
Reproducing bug 1190664 dromaeo dom windows8.ipynb
Oops, this one had a bug in it too. Just ignore it and use the one on my people account.
Attachment #8647662 -
Attachment is obsolete: true
Comment 18•10 years ago
|
||
OK, so I took the data which wlach gave me, for dromaeo_dom opt x64, and wrote a program to manually calculate the regressions from it (run it by clicking the attachment - the page contains the raw data on which it operates).
As far as I can tell the subtests view is correct, while the overview page is incorrect.
Specifically, the value of each run in the overview page doesn't seem to be the average of the geomeans per run.
Here's how the output looks from this program:
> Sub-tests view - average each subtest over all runs:
> ----------------------------------------------------
> sub-test 0: old: 411.3 +- 5.5 new: 414.5 +- 3.8 delta: 0.78 %
> sub-test 1: old: 446.6 +- 5.2 new: 444.3 +- 5.1 delta: -0.51 %
> sub-test 2: old: 76405.8 +- 913.2 new: 76980.9 +- 1424.0 delta: 0.75 %
> sub-test 3: old: 19067.7 +- 79.6 new: 19142.8 +- 87.2 delta: 0.39 %
>
> geomean of (averages of) subtests over all runs: old: 4044.6 new: 4058.9 delta: 0.35 %
>
>
> Overview - geomeans all the subtests per run:
> ---------------------------------------------
> old geomeans: 4018.5042963333626, 4030.496158300653, 4065.1916525024917, 4063.7774073699106, 4044.2157544639635, 4068.8796673796965, 4039.679244938808, 4073.61257833807, 4060.795046125907, 4010.6109566095715, 4026.9081170231198, 4078.644580102007, 4036.878933079546, 4045.780541505174, 4000.702238023214, 4028.5369518769253, 4084.2661094908476, 4024.9919638681026, 4055.184074289028, 4031.226821247839
> new geomeans: 4023.7734479425403, 4060.229790243121, 4058.4772632560043, 4000.157355006849, 4072.830026841685, 4051.432223589525, 4078.003429471551, 4051.301162847771, 4051.51377152406, 4074.4373950795384, 4064.044520783285, 4094.1792763400977, 4065.819551970244, 4063.84862242265, 4061.4342847012435, 4079.9464228761944, 4064.8307635351816, 4037.0440135580893, 4044.7054801681343, 4075.0064184538387
>
> Averages (of geomeans of suntests per run): old: 4044.4 new: 4058.7 delta: 0.35 %
The perfherder overview page shows the runs values as ~2730, while the goemean of ~411, ~447, ~76406, ~19058 is actually: ~4045.
Note also how consistent the two approaches are:
- geomean of averages per subtest: old: 4044.6 new: 4058.9 delta: 0.35 %
- average of geomeans per run: old: 4044.4 new: 4058.7 delta: 0.35 %
So something in the overview page does not crunch the data correctly, at least for this sample of data.
---------------------------------
Off topic but highly interesting:
The raw data has both the means and the medians of the replicates per subtest per run. The program uses the mean value since this is what perfherder uses, and indeed the results match the subtests view exactly. However, if I take the medians instead, the regression is doubled.
Comment 19•10 years ago
|
||
Specicifically, it seems that on the overview page, the geomean-per-job ("runs") values are incorrect.
Comment 20•10 years ago
|
||
More specifically, it looks like the geomean values at "ORIGINAL SERIES BLOBS" and "NEW SERIES BLOBS" at http://nbviewer.ipython.org/url/people.mozilla.org/~wlachance/Reproducing%20bug%201190664%20dromaeo%20dom%20windows8.ipynb at "In [12]" and "In [13]" (not part of the data used for the calculations at comment 18) are ~27xx. So this data is incorrect.
Comment 21•10 years ago
|
||
I really wish we were doing this on an active test suite. For example, we have disabled all dromaeo dom coverage due to it causing crashes/hangs/timeouts in automation. I assume the findings you are seeing would apply generically to other tests like svg opacity or a11y (where we have a range of subtest results)- maybe we should be validating those instead.
Comment 22•10 years ago
|
||
It's not an issue to validate other results. The program is in place, just need to run it on different data.
Regardless though, this data was collected at a time where AFAICT the calculations should have been normal (i.e. not special) for dromaeo dom, and AFAICT there were no crashes or special hangs.
This is very clear from the fact that the subtests results (which should be the source to all our view and regression tests etc) are super stable in this set of data.
So if the input data is stable and good, and it surely looks to me like it's very good, then there shouldn't be an issue where the overview page is inconsistent with the subtests data.
That being said, sure, the program is attached. Just replace the data or post it someplace and I'll run it on a new data to validate other tests/cases/etc.
Comment 23•10 years ago
|
||
Some more info:
1. The input data which I used for validation in comment 18 is apparently already a derivative of the actual input used in treeherder. That is to say, it's the input to the overview/subtests data in compare-perf, but not the input to treeherder. This means that the issue lies between the actual inputs to the inputs which I used at comment 18. Getting the original input data to replicate the entire pipe is apparently hard.
2. As far as I understand, for the overview page, treeherder calculates the input geomeans over all the replicates. I'm not sure how it calculates the subtests data from the replicates.
3. As far as I understand, the calculations done by treeherder in this bug are not unique to dromaeo, which means that all tests are affected by this issue. It might be the case that some tests are affected more than others, dromaeo possibly more than most (one explanation I heard is that "the dromaeo data is pathologically bad", though I can't find evidence to that with the data set I used in comment 18).
4. Apparently, it's hard to programmatically find/understand the exact extent to which we're affected.
5. Bug 1184966 moved the summarization (of replicates to subtests, of subtests to test/suit) from treeherder to "the harness" (read - to the talos side), therefore it should completely bypass any calculation issues in treeherder which led to this bug.
Overall conclusion:
1. Unless proven otherwise, we should suspect every compare-view used in the past. It's hard to tell if the issue is only in one of subtests/overview or in both of them, but my current guess is that the overview is affected while the subtests are not.
2. We need to understand the point[s] in time where data correctness status changes, preferably also being able to say what was wrong before the change.
@jmaher/wlach
Is it your opinion that since bug 1184966 has landed about a week ago, any treeherder data views (graphs, compare-perf overview or subtests) should be correct for data collected today onward?
Comment 24•10 years ago
|
||
when we land bug 1193385 I believe we will be proper and can trust the results. That doesn't mean old comparisons are bad, it just means we might be over/under reporting in the past.
Comment 25•10 years ago
|
||
(In reply to Joel Maher (:jmaher) from comment #24)
> That doesn't mean old comparisons are bad, it just means we might
> be over/under reporting in the past.
It doesn't indeed, but since you can't declare knowingly to what extent has this bug affected us, it's better to suspect the data than to assume it's correct.
Another examples from the same compare view of comment 0:
tpaint opt XP: overview: -1.67, subtests (single test): -1.79
Only a single subtest, the overall page should show exactly the same delta as the subtests view, and yet it doesn't.
I don't know if the tpaint issue is this bug, or maybe bug 1193385, since neither of those bugs declare explicitly which tests are affected and to what extent.
Comment 26•10 years ago
|
||
We know that the old perfherder summarization algorithm had problems, let's just say "and data before August 10th is suspect" and leave it at that. If new compelling reasons to revisit the old data materialize, we can always revisit this, but I'd rather spend time validating what we're using than figuring out exactly how wrong we were before.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → INCOMPLETE
Comment 27•10 years ago
|
||
(In reply to William Lachance (:wlach) from comment #26)
> let's just say "and data before August 10th is suspect" and leave it at that.
That's a fine statement which indeed doesn't require further digging because it specifies the extent of the issue. I wouldn't have dug int the data if you said it earlier.
You need to log in
before you can comment on or make changes to this bug.
Description
•