1184966 - perfherder should let harness do summarization

Assignee

Description

•

10 years ago

if we are going to be using a standardized metric, we need to adhere to it as much as we can. v8 has a specific metric for combining the subtests into a single number, we need to use this: http://hg.mozilla.org/build/talos/file/8a61f8d21fc0/talos/output.py#l99

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 1

•

10 years ago

also kraken and canvasmark have their own summarization calculation: http://hg.mozilla.org/build/talos/file/8a61f8d21fc0/talos/output.py#l288 and for dromaeo we have a different metric for the subtests: http://hg.mozilla.org/build/talos/file/8a61f8d21fc0/talos/filter.py#l49

William Lachance (:wlach)

Comment 2

•

10 years ago

Let's morph this bug into one covering letting the harness specify the summarization, since that's what we've decided to do.

Depends on: 1186076

Summary: perfherder summarization for v8 needs to use the v8 metric → perfherder should let harness do summarization

William Lachance (:wlach)

Updated

•

10 years ago

No longer blocks: 1150152

Depends on: 1150152

Avi Halachmi (:avih)

Comment 3

•

10 years ago

What's the "bug" here? i.e. does the fact that these tests have their own way of summarizing the subtests hurt us in some way? if yes, how does it hurt us? Do we have reasons to believe that "our standard metric" would be better than their custom summarization? I'd think that unless they were sloppy, there was a good reason for them to choose "special" summarization, so why do we believe our method would be better?

William Lachance (:wlach)

Comment 4

•

10 years ago

(In reply to Avi Halachmi (:avih) from comment #3) > Do we have reasons to believe that "our standard metric" would be better > than their custom summarization? I'd think that unless they were sloppy, > there was a good reason for them to choose "special" summarization, so why > do we believe our method would be better? This bug is precisely about letting tests specify their own summarization, instead of perfherder doing so. We're in violent agreement.

Avi Halachmi (:avih)

Comment 5

•

10 years ago

Thanks. I must have misinterpreted "if we are going to be using a standardized metric, we need to adhere to it as much as we can.". Anyway, so as far as I understand, we're currently using the "standardized" calculation instead of the intended "custom" calculation which the test uses. Where does it hurt us? E.g. where should it show X but instead shows Y?

Emanuel Hoogeveen [:ehoogeveen]

Comment 6

•

10 years ago

In bug 1167409, a V8-v7 regression was reported as an improvement, misleading the developer. Now I'm not sure why perfherder is interpreting raw times as scores - that might be orthogonal to this bug. But I believe the intent is that it should see the scores provided by V8 (which group some tests together), including the final tally. IIRC this is especially important for V8 since it adjusts the number of runs per test based on the runtime - so a perf improvement could cause it to do more runs, leading to a higher total time - though note that this also isn't great for the stability of the scores it reports: it can make Splay bimodal for instance.

Avi Halachmi (:avih)

Comment 7

•

10 years ago

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #6) > But I believe > the intent is that it should see the scores provided by V8 (which group some > tests together), including the final tally. I don't disagree with that intent, I'm just asking for examples where it should show one set of numbers but instead it shows other numbers (e.g. links to pages which show the bad numbers, and link to the "good" numbers someplace else). That's for reference, such that it's clear for whoever following this bug to understand what's broken, where, and why.

Emanuel Hoogeveen [:ehoogeveen]

Comment 8

•

10 years ago

I don't know where to find the correct numbers, but an example of the 'inverted' numbers for V8-v7 is: https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=8a9734a9d97d&newProject=try&newRevision=bb349bc1f1a3&originalSignature=f3cdc5746374e146c7cc7c679e89f6dc5931f68e&newSignature=f3cdc5746374e146c7cc7c679e89f6dc5931f68e Note the high number for RegExp (the test with our lowest score on V8) and the low number for RayTrace (the test with our highest score on V8). Also note how the *increase* in the Base -> New numbers for RayTrace is marked as an improvement rather than a regression - though again, that may be orthogonal to this bug.

William Lachance (:wlach)

Comment 9

•

10 years ago

There seems to be some confusion here. Here are the tests for which perfherder is currently using a geometric mean to summarize subtest results, but shouldn't: * Any test with responsiveness in its name * v8_7 * kraken * canvasmark Each of these tests uses its own custom metric to summarize results, defined in this file: http://hg.mozilla.org/build/talos/file/b6589aef27a5/talos/output.py#l95 We were submitting these summaries values to graphserver before. This bug is just about making sure perfherder summarizes things in the same way. I believe subtest results should stay the same. See also bug 1190957, which is about documenting how we do these summaries in the wiki.

Comment 10

•

10 years ago

Ah, my bad. I do believe the subtest view is wrong for V8-v7 though - it should either get the subtest scores from V8 or start treating them as times rather than scores. Do you know if there's a bug open for that?

William Lachance (:wlach)

Comment 11

•

10 years ago

(In reply to Emanuel Hoogeveen [:ehoogeveen] from comment #10) > Ah, my bad. I do believe the subtest view is wrong for V8-v7 though - it > should either get the subtest scores from V8 or start treating them as times > rather than scores. Do you know if there's a bug open for that? Are you saying the numbers are incorrect, or the way that the numbers are displayed are incorrect (i.e. we're showing a regression where we should be showing an improvement)? With respect to the former, currently perfherder just calculates the mean of results for the subtests, if talos is doing something different there this bug should also cover that.

Emanuel Hoogeveen [:ehoogeveen]

Comment 12

•

10 years ago

(In reply to William Lachance (:wlach) from comment #11) > Are you saying the numbers are incorrect, or the way that the numbers are > displayed are incorrect (i.e. we're showing a regression where we should be > showing an improvement)? Both: it's displaying the time that each test took (instead of the score that V8 assigns, which might depend on the number of runs it decides to do), but it's interpreting the times as scores, so regressions become improvements.

Joel Maher ( :jmaher ) (UTC -8)

Assignee

Comment 13

•

10 years ago

Attached file support annotated talosdata — Details

Assignee: nobody → jmaher

Status: NEW → ASSIGNED

Attachment #8643926 - Flags: review?(wlachance)

William Lachance (:wlach)

Comment 14

•

10 years ago

Comment on attachment 8643926 [details] [review] support annotated talosdata Looks good. As discussed we should test this on stage for a bit before landing on master. I'll take care of that later today.

Attachment #8643926 - Flags: review?(wlachance) → review+

Treeherder GitHub Bugbot

Comment 16

•

10 years ago

Commits pushed to master at https://github.com/mozilla/treeherder https://github.com/mozilla/treeherder/commit/61a4291292a21435fe9bc36a984dfb90163fa335 Bug 1184966 - perfherder should let harness do summarization https://github.com/mozilla/treeherder/commit/e5b97151e12c0064bf14534fe83e14c6f892c7f2 Bug 1184966 - Incorporate optional "value" parameter for subtests

William Lachance (:wlach)

Comment 17

•

10 years ago

I think we're good to go with this now, changes have been merged into master.

Status: ASSIGNED → RESOLVED

Closed: 10 years ago

Resolution: --- → FIXED

Avi Halachmi (:avih)

Updated

•

10 years ago

Blocks: 1194333

William Lachance (:wlach)

Comment 18

•

10 years ago

Just a note, that talos chooses a slightly different set of results to calculate the geometric mean than perfherder did (talos often drops the first few sets of results, depending on the test), so this changes some results for those tests, though the difference is not as great as it is for e.g. v8. Example: https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=[mozilla-inbound,ca94ff5540bd7a90cd41306de55b1b69962aca1e,1]

William Lachance (:wlach)

Updated

•

10 years ago

No longer depends on: 1150152