1504227 - Add raptor test option to specify what measurement(s) to alert on

Reporter

Description

•

6 years ago

currently we take the geometric mean [1] of all metrics we record and report that as a single number per page for alerting detection/sheriffing.  if we have a hero element, report that as the single number.

If we do not have a hero element, lets do what we currently do until we get agreement to do something else.


[1] https://searchfox.org/mozilla-central/source/testing/raptor/raptor/output.py#657

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

6 years ago

:vchin, is there a preferred data point (first non blank paint, dom content flushed, etc.) you would prefer we track/monitor instead of a geometric_mean of all when we do not have a hero element?

Flags: needinfo?(vchin)

Robert Wood [:rwood]

Assignee

Comment 2

•

6 years ago

Interesting. We could add an option in the raptor test INI, something like 'alert_on =' and it would have to equal one of the values that is measured / from the 'measure' INI setting.

Robert Wood [:rwood]

Assignee

Comment 3

•

6 years ago

(In reply to Robert Wood [:rwood] from comment #2)
> Interesting. We could add an option in the raptor test INI, something like
> 'alert_on =' and it would have to equal one of the values that is measured /
> from the 'measure' INI setting.

(and if it's not there, use geometric mean as we do now)

Robert Wood [:rwood]

Assignee

Updated

•

6 years ago

Assignee: nobody → rwood

Status: NEW → ASSIGNED

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 4

•

6 years ago

It appears that we're sending all of the metrics to Perfherder as subtests, and the top level test uses the geometric mean as the datapoint. Would it be possible to just configure which test(s) generate alerts? This would allow us to alert on just the hero element when desired, and supress the top-level or other subtest alerts.

https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1782727,1,10&series=autoland,1782728,1,10&series=autoland,1782729,1,10&series=autoland,1782730,1,10&series=autoland,1799131,1,10

Flags: needinfo?(igoldan)

Ionuț Goldan [:igoldan]

Comment 5

•

6 years ago

(In reply to Dave Hunt [:davehunt] [he/him/his] ⌚️UTC+1 from comment #4)
> Would it be
> possible to just configure which test(s) generate alerts? This would allow
> us to alert on just the hero element when desired, and supress the top-level
> or other subtest alerts.

From what I know, each perf test outputs a PERFHERDER_DATA JSON dump into the logs. These JSONs can include a "shouldAlert" field, which Perfherder respects. It can be applied at schema level and at subtest level as well.

Basically, Perfherder doesn't dictate what alerts to ignore. The data it ingests does that.

Flags: needinfo?(igoldan)

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 6

•

6 years ago

Would it make sense to do this in two patches, the first to enable alerts for the hero element, and the second to disable alerts for the top-level geometric mean? Is there a way to verify this change without having to wait for an alert?

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 7

•

6 years ago

the geometric mean isn't needed at all, that is just something we added to help catch things.  This is just changing what value we report, in this case, we just need some small changes.

right now we have:
https://searchfox.org/mozilla-central/source/testing/raptor/raptor/output.py#657

we can check if there is a hero element in the values and use that instead.  I believe we have an array like:
[[a, dcf],
 [b, ttfnbp],
 [c, hero1],
 [d, ttfi]]

so if we are expecting hero1, we could do:
if [j for i,j in vals if j == 'hero1']:
    return vals['hero1']

or something like that.   We always need to alert on the top level value (I think so), but we can optionally alert on subtests.

Flags: needinfo?(jmaher)

Robert Wood [:rwood]

Assignee

Comment 8

•

6 years ago

Ok, so looks like we need to make a change on the Raptor side since we always alert on the top level value (and we don't want to alert on both geometric mean and hero elemen/subtests).

What if there are multiple hero elements in the page (like we might have in the future for instagram, for example)? Would it be better to make this configurable in the test INI? Like 'alert_on = hero1' where the alert_on value would need to be one value taken from the 'measure =' key also in the same test INI. That way we could change it from one particular hero element to another.

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 9

•

6 years ago

(In reply to Robert Wood [:rwood] from comment #8)
> Ok, so looks like we need to make a change on the Raptor side since we
> always alert on the top level value (and we don't want to alert on both
> geometric mean and hero elemen/subtests).
> 
> What if there are multiple hero elements in the page (like we might have in
> the future for instagram, for example)? Would it be better to make this
> configurable in the test INI? Like 'alert_on = hero1' where the alert_on
> value would need to be one value taken from the 'measure =' key also in the
> same test INI. That way we could change it from one particular hero element
> to another.

I agree that it makes sense for this to be configurable. I'm not sure 'alert_on' is ideal though, as we could later have a requirement to alert on subtests. Maybe something like 'suite_value_source'?

Robert Wood [:rwood]

Assignee

Updated

•

5 years ago

Summary: adjust data point we alert on to be hero element when available → add raptor test option to specify what measurement to report as the top level result/alert on

Robert Wood [:rwood]

Assignee

Updated

•

5 years ago

Blocks: 1503315
No longer blocks: 1473078

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 10

•

5 years ago

In bug 1504013 were are changing from the geometric mean to "loadtime" as the top level metric to report/alert on. Would we still want to substitute the hero element timing when present? I would prefer the top level metric to be consistent, and to optionally enable reporting/alerting on the hero element subtests if these need to be monitored closely.

Flags: needinfo?(acreskey)

Andrew Creskey [:acreskey]

Comment 11

•

5 years ago

First, I think this is very important work.

Right now when we compare top-level results to Chrome I think we have a problem:

For Firefox we use the geomean of {dcf, fnbpaint, hero, ttfi} for most sites.

For Chrome we use the geomean of {fcp, hero}.

 And these are compared directly, e.g.
 https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1731806,1,10&series=mozilla-central,1792346,1,10&selected=mozilla-central,1731806,413387,666816833,10

 But since different marker elements are used in the two geomeans, these are effectively apples-to-oranges comparisons.
 In particular, the Firefox measurement includes the very long TTFI marker which Chrome does not.

Anyway, I'm personally very much in favor of moving to "loadtime" as the top level result / alert metric.

  From what I've been able to tell the Hero element is problematic because we can detect when it's in the DOM, but not necessarily when it's been rendered. So as it is, it's not an ideal metric as the end user doesn't actually care about what's in the DOM, only what he or she can see :) 
See: https://bugzilla.mozilla.org/show_bug.cgi?id=1510999

Flags: needinfo?(acreskey)

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 12

•

5 years ago

(In reply to Andrew Creskey from comment #11)
> First, I think this is very important work.
> 
> Right now when we compare top-level results to Chrome I think we have a
> problem:
> 
> For Firefox we use the geomean of {dcf, fnbpaint, hero, ttfi} for most sites.
> 
> For Chrome we use the geomean of {fcp, hero}.
> 
>  And these are compared directly, e.g.
>  https://treeherder.mozilla.org/perf.html#/
> graphs?timerange=5184000&series=mozilla-central,1731806,1,10&series=mozilla-
> central,1792346,1,10&selected=mozilla-central,1731806,413387,666816833,10
> 
>  But since different marker elements are used in the two geomeans, these are
> effectively apples-to-oranges comparisons.
>  In particular, the Firefox measurement includes the very long TTFI marker
> which Chrome does not.
> 
> Anyway, I'm personally very much in favor of moving to "loadtime" as the top
> level result / alert metric.
> 
>   From what I've been able to tell the Hero element is problematic because
> we can detect when it's in the DOM, but not necessarily when it's been
> rendered. So as it is, it's not an ideal metric as the end user doesn't
> actually care about what's in the DOM, only what he or she can see :) 
> See: https://bugzilla.mozilla.org/show_bug.cgi?id=1510999

It is possible to compare the results of the subtests, although I don't know if there are suitable comparable measurements. Here is Firefox first non-blank paint vs Chrome first contentful paint: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1731768,1,10&series=mozilla-central,1792350,1,10

After speaking with Rob, we like the idea of being able to specify the subtest(s) that should alert for each test. Where loadtime is consistently available we can use this, and where it's not we can use the geometric mean. This might require some changes to Perfherder to silence alerting on the top-level suite value, which must always be available.

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

5 years ago

Depends on: 1514260

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 14

•

5 years ago

Okay, so based on the comments in bug 1504227 I think we should look into the alert_on suggestion from :rwood in comment 2. Instead of using this to influence the suite value, it should control the shouldAlert values of the suite/subtests in PERFHERDER_DATA. The suite value should remain the geometric mean, and we should default alerting on this value. We should also accept multiple values and ensure the values are valid choices.

Flags: needinfo?(vchin)

Summary: add raptor test option to specify what measurement to report as the top level result/alert on → Add raptor test option to specify what measurement(s) to alert on

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

5 years ago

Blocks: 1504013

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

5 years ago

Priority: -- → P1

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

5 years ago

Blocks: 1520255

Robert Wood [:rwood]

Assignee

Comment 15

•

5 years ago

Question: given the newinfo that the geometric mean really isn't useful, should we make the top-level reported value in treeherder/perfherder be FCP instead? Then we can alert on that by default since it's the top-level; and then in the subtests we will also have alert true set for 'loadtime'. What do you think? In the case where a test INI doesn't have 'fcp' listed to measure, then we could still run but wouldn't dump anything to Perfherder in production, just an idea. I don't see the point in calculating and reporting the geomean if it's not useful.

Flags: needinfo?(dave.hunt)

Robert Wood [:rwood]

Assignee

Comment 16

•

5 years ago

(In reply to Robert Wood [:rwood] from comment #15)

(in the case where the test INI has 'alert_on' = fcp, loadtime)

Or perhaps we could use the first 'alert_on' value (i.e. fcp) to be the top-level overall result reported. Or have another INI setting too 'suite_result = fcp' to determines the overall result reported; and then have 'alert_on = fcp, loadtime'.

Dave Hunt [:davehunt] [he/him] ⌚BST

Comment 17

•

5 years ago

I don't like the idea of having the top-level reported value to be variable. If fcp is guaranteed to be available then I'm okay with using that, however I think I'd prefer to either leave the top-level as geometric mean for now, or to use 0. We've discussed that fcp and loadtime are useful to engineers, especially for alerting on regresssions, however there is also the possibility that some calculated score may be useful for dashboards. The geometric mean may not be the ideal approach, but any such score would make sense as the top-level value.

:ekyle do you have some thoughts on this?

Flags: needinfo?(dave.hunt) → needinfo?(klahnakoski)

Kyle Lahnakoski [:ekyle]

Comment 18

•

5 years ago

The geometric mean is a good statistic to alert on: It has less variance than the individual measures; it is less prone to false positives.

I attended Vicky's performance team meeting, and my conclusions on what to build have not changed:

Engineers require dashboards that are fine-grained and measure what they can control; they should be directed to Perfherder whenever possible.
The performance sheriffs are interested in raising issues on regressions: This is different from the engineers; sheriffs only care why there is a regression only so far that it helps them identify the engineer that can fix the problem; otherwise why is not relevant. What's important to Sheriffs is controlling the inevitable volume of alerts and the minimizing the number of false positives to within reason. The geo-mean provides this; it reduces the number of tracked statistics, it reduces variance so there is less false positives.
Management wants KPI tracking - These are usually large aggregates over many measures, smoothed over some time window. Management wants to know how far we are from our goal. A geo-mean of particular measures can be used here (Our particular geo-mean is unlikely to be useful to management). The geo-mean is not valued for it's reduced variance, rather it is valued for combining multiple disparate measures into a single statistic. The health dashboard is where such aggregate numbers can/should be reported.

So, keeping in mind we have three different categories of customer, we should not change TP6's suite geo-mean calculation; Our team (igolden) will use it.

We can turn on alerting for subtest measures the engineers are interested in. I am concerned that these subtests will have too much noise, produce too many false positives, and will increase our sheriffing load with little benefit. I suspect our team will alert on regressions in the geo-mean, highlight the particular measure that caused the geo-mean regression, then inform the engineer on what happened. I will leave it to the perf sheriffs to figure that out.

Flags: needinfo?(klahnakoski)

Robert Wood [:rwood]

Assignee

Comment 19

•

5 years ago

Ok, thank you for the excellent feedback guys! I'll leave it as/is regarding the geomean, and just add a new option for 'alert_on'.

Robert Wood [:rwood]

Assignee

Comment 20

•

5 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=de60aeafb280f2ec14f4518f4869b7083609b30e

Robert Wood [:rwood]

Assignee

Comment 21

•

5 years ago

Attached file Bug 1504227 - Add raptor test option to specify what measurement(s) to alert on; r?davehunt — Details

Robert Wood [:rwood]

Assignee

Comment 22

•

5 years ago

Attached file Bug 1504227 - Add raptor test option to specify what measurement(s) to alert on; r?davehunt — Details

Depends on D17288

Robert Wood [:rwood]

Assignee

Comment 23

•

5 years ago

Here's an example of the output for tp6-google (with just 2 page-cycles) with 'alert_on = fcp, loadtime' set in the test INI:

15:44:08 INFO - raptor-output PERFHERDER_DATA: {"framework": {"name": "raptor"}, "suites": [{"extraOptions": [], "name": "raptor-tp6-google-firefox", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 569.9, "subtests": [{"name": "raptor-tp6-google-firefox-fcp", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 303, "shouldAlert": true, "replicates": [558, 303], "unit": "ms"}, {"name": "raptor-tp6-google-firefox-fnbpaint", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 303, "replicates": [558, 303], "unit": "ms"}, {"name": "raptor-tp6-google-firefox-hero:hero1", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 790, "replicates": [1228, 790], "unit": "ms"}, {"name": "raptor-tp6-google-firefox-loadtime", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 1305, "shouldAlert": true, "replicates": [1811, 1305], "unit": "ms"}, {"name": "raptor-tp6-google-firefox-dcf", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 291, "replicates": [545, 291], "unit": "ms"}, {"name": "raptor-tp6-google-firefox-ttfi", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 1241, "replicates": [1776, 1241], "unit": "ms"}], "type": "pageload", "unit": "ms"}]}

Robert Wood [:rwood]

Assignee

Comment 24

•

5 years ago

https://treeherder.mozilla.org/#/jobs?repo=try&revision=835e3157eeb135c076f3632499081aa414cfb820

Pulsebot

Comment 25

•

5 years ago

Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9a008f412fd7
Add raptor test option to specify what measurement(s) to alert on; r=davehunt
https://hg.mozilla.org/integration/autoland/rev/07500fee706b
Add raptor test option to specify what measurement(s) to alert on; r=davehunt

Andrei Ciure[:aciure]

Comment 26

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/9a008f412fd7
https://hg.mozilla.org/mozilla-central/rev/07500fee706b

Status: ASSIGNED → RESOLVED

Closed: 5 years ago

status-firefox66: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla66

Bug 1504227 - Add raptor test option to specify what measurement(s) to alert on; r?davehunt 5 years ago Robert Wood [:rwood] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1504227 - Add raptor test option to specify what measurement(s) to alert on; r?davehunt 5 years ago Robert Wood [:rwood] 47 bytes, text/x-phabricator-request		Details \| Review