perfherder should let harness do summarization

RESOLVED FIXED

Status

defect
RESOLVED FIXED
4 years ago
4 years ago

People

(Reporter: jmaher, Assigned: jmaher)

Tracking

Details

Attachments

(1 attachment)

(Assignee)

Description

4 years ago
if we are going to be using a standardized metric, we need to adhere to it as much as we can.

v8 has a specific metric for combining the subtests into a single number, we need to use this:
http://hg.mozilla.org/build/talos/file/8a61f8d21fc0/talos/output.py#l99
(Assignee)

Comment 1

4 years ago
also kraken and canvasmark have their own summarization calculation:
http://hg.mozilla.org/build/talos/file/8a61f8d21fc0/talos/output.py#l288

and for dromaeo we have a different metric for the subtests:
http://hg.mozilla.org/build/talos/file/8a61f8d21fc0/talos/filter.py#l49
Let's morph this bug into one covering letting the harness specify the summarization, since that's what we've decided to do.
Depends on: 1186076
Summary: perfherder summarization for v8 needs to use the v8 metric → perfherder should let harness do summarization
No longer blocks: 1150152
Depends on: 1150152
What's the "bug" here? i.e. does the fact that these tests have their own way of summarizing the subtests hurt us in some way? if yes, how does it hurt us?

Do we have reasons to believe that "our standard metric" would be better than their custom summarization? I'd think that unless they were sloppy, there was a good reason for them to choose "special" summarization, so why do we believe our method would be better?
(In reply to Avi Halachmi (:avih) from comment #3)
> Do we have reasons to believe that "our standard metric" would be better
> than their custom summarization? I'd think that unless they were sloppy,
> there was a good reason for them to choose "special" summarization, so why
> do we believe our method would be better?

This bug is precisely about letting tests specify their own summarization, instead of perfherder doing so. We're in violent agreement.
Thanks. I must have misinterpreted "if we are going to be using a standardized metric, we need to adhere to it as much as we can.".

Anyway, so as far as I understand, we're currently using the "standardized" calculation instead of the intended "custom" calculation which the test uses.

Where does it hurt us? E.g. where should it show X but instead shows Y?
In bug 1167409, a V8-v7 regression was reported as an improvement, misleading the developer. Now I'm not sure why perfherder is interpreting raw times as scores - that might be orthogonal to this bug. But I believe the intent is that it should see the scores provided by V8 (which group some tests together), including the final tally.

IIRC this is especially important for V8 since it adjusts the number of runs per test based on the runtime - so a perf improvement could cause it to do more runs, leading to a higher total time - though note that this also isn't great for the stability of the scores it reports: it can make Splay bimodal for instance.
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #6)
> But I believe
> the intent is that it should see the scores provided by V8 (which group some
> tests together), including the final tally.

I don't disagree with that intent, I'm just asking for examples where it should show one set of numbers but instead it shows other numbers (e.g. links to pages which show the bad numbers, and link to the "good" numbers someplace else).

That's for reference, such that it's clear for whoever following this bug to understand what's broken, where, and why.
I don't know where to find the correct numbers, but an example of the 'inverted' numbers for V8-v7 is:

https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=8a9734a9d97d&newProject=try&newRevision=bb349bc1f1a3&originalSignature=f3cdc5746374e146c7cc7c679e89f6dc5931f68e&newSignature=f3cdc5746374e146c7cc7c679e89f6dc5931f68e

Note the high number for RegExp (the test with our lowest score on V8) and the low number for RayTrace (the test with our highest score on V8). Also note how the *increase* in the Base -> New numbers for RayTrace is marked as an improvement rather than a regression - though again, that may be orthogonal to this bug.
There seems to be some confusion here. 

Here are the tests for which perfherder is currently using a geometric mean to summarize subtest results, but shouldn't:

* Any test with responsiveness in its name
* v8_7
* kraken
* canvasmark

Each of these tests uses its own custom metric to summarize results, defined in this file:

http://hg.mozilla.org/build/talos/file/b6589aef27a5/talos/output.py#l95

We were submitting these summaries values to graphserver before. This bug is just about making sure perfherder summarizes things in the same way. I believe subtest results should stay the same.

See also bug 1190957, which is about documenting how we do these summaries in the wiki.
See Also: → 1190957
Ah, my bad. I do believe the subtest view is wrong for V8-v7 though - it should either get the subtest scores from V8 or start treating them as times rather than scores. Do you know if there's a bug open for that?
(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #10)
> Ah, my bad. I do believe the subtest view is wrong for V8-v7 though - it
> should either get the subtest scores from V8 or start treating them as times
> rather than scores. Do you know if there's a bug open for that?

Are you saying the numbers are incorrect, or the way that the numbers are displayed are incorrect (i.e. we're showing a regression where we should be showing an improvement)? 

With respect to the former, currently perfherder just calculates the mean of results for the subtests, if talos is doing something different there this bug should also cover that.
(In reply to William Lachance (:wlach) from comment #11)
> Are you saying the numbers are incorrect, or the way that the numbers are
> displayed are incorrect (i.e. we're showing a regression where we should be
> showing an improvement)?

Both: it's displaying the time that each test took (instead of the score that V8 assigns, which might depend on the number of runs it decides to do), but it's interpreting the times as scores, so regressions become improvements.
(Assignee)

Comment 13

4 years ago
Assignee: nobody → jmaher
Status: NEW → ASSIGNED
Attachment #8643926 - Flags: review?(wlachance)
Comment on attachment 8643926 [details] [review]
support annotated talosdata

Looks good. As discussed we should test this on stage for a bit before landing  on master. I'll take care of that later today.
Attachment #8643926 - Flags: review?(wlachance) → review+
Duplicate of this bug: 1184968
I think we're good to go with this now, changes have been merged into master.
Status: ASSIGNED → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Blocks: 1194333
Just a note, that talos chooses a slightly different set of results to calculate the geometric mean than perfherder did (talos often drops the first few sets of results, depending on the test), so this changes some results for those tests, though the difference is not as great as it is for e.g. v8. Example:

https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=[mozilla-inbound,ca94ff5540bd7a90cd41306de55b1b69962aca1e,1]
No longer depends on: 1150152
You need to log in before you can comment on or make changes to this bug.