Closed Bug 1504227 Opened 6 years ago Closed 5 years ago

Add raptor test option to specify what measurement(s) to alert on

Categories

(Testing :: Raptor, enhancement, P1)

Version 3
enhancement

Tracking

(firefox66 fixed)

RESOLVED FIXED
mozilla66
Tracking Status
firefox66 --- fixed

People

(Reporter: jmaher, Assigned: rwood)

References

Details

Attachments

(2 files)

currently we take the geometric mean [1] of all metrics we record and report that as a single number per page for alerting detection/sheriffing.  if we have a hero element, report that as the single number.

If we do not have a hero element, lets do what we currently do until we get agreement to do something else.


[1] https://searchfox.org/mozilla-central/source/testing/raptor/raptor/output.py#657
:vchin, is there a preferred data point (first non blank paint, dom content flushed, etc.) you would prefer we track/monitor instead of a geometric_mean of all when we do not have a hero element?
Flags: needinfo?(vchin)
Interesting. We could add an option in the raptor test INI, something like 'alert_on =' and it would have to equal one of the values that is measured / from the 'measure' INI setting.
(In reply to Robert Wood [:rwood] from comment #2)
> Interesting. We could add an option in the raptor test INI, something like
> 'alert_on =' and it would have to equal one of the values that is measured /
> from the 'measure' INI setting.

(and if it's not there, use geometric mean as we do now)
Assignee: nobody → rwood
Status: NEW → ASSIGNED
It appears that we're sending all of the metrics to Perfherder as subtests, and the top level test uses the geometric mean as the datapoint. Would it be possible to just configure which test(s) generate alerts? This would allow us to alert on just the hero element when desired, and supress the top-level or other subtest alerts.

https://treeherder.mozilla.org/perf.html#/graphs?series=autoland,1782727,1,10&series=autoland,1782728,1,10&series=autoland,1782729,1,10&series=autoland,1782730,1,10&series=autoland,1799131,1,10
Flags: needinfo?(igoldan)
(In reply to Dave Hunt [:davehunt] [he/him/his] ⌚️UTC+1 from comment #4)
> Would it be
> possible to just configure which test(s) generate alerts? This would allow
> us to alert on just the hero element when desired, and supress the top-level
> or other subtest alerts.

From what I know, each perf test outputs a PERFHERDER_DATA JSON dump into the logs. These JSONs can include a "shouldAlert" field, which Perfherder respects. It can be applied at schema level and at subtest level as well.

Basically, Perfherder doesn't dictate what alerts to ignore. The data it ingests does that.
Flags: needinfo?(igoldan)
Would it make sense to do this in two patches, the first to enable alerts for the hero element, and the second to disable alerts for the top-level geometric mean? Is there a way to verify this change without having to wait for an alert?
Flags: needinfo?(jmaher)
the geometric mean isn't needed at all, that is just something we added to help catch things.  This is just changing what value we report, in this case, we just need some small changes.

right now we have:
https://searchfox.org/mozilla-central/source/testing/raptor/raptor/output.py#657

we can check if there is a hero element in the values and use that instead.  I believe we have an array like:
[[a, dcf],
 [b, ttfnbp],
 [c, hero1],
 [d, ttfi]]

so if we are expecting hero1, we could do:
if [j for i,j in vals if j == 'hero1']:
    return vals['hero1']

or something like that.   We always need to alert on the top level value (I think so), but we can optionally alert on subtests.
Flags: needinfo?(jmaher)
Ok, so looks like we need to make a change on the Raptor side since we always alert on the top level value (and we don't want to alert on both geometric mean and hero elemen/subtests).

What if there are multiple hero elements in the page (like we might have in the future for instagram, for example)? Would it be better to make this configurable in the test INI? Like 'alert_on = hero1' where the alert_on value would need to be one value taken from the 'measure =' key also in the same test INI. That way we could change it from one particular hero element to another.
(In reply to Robert Wood [:rwood] from comment #8)
> Ok, so looks like we need to make a change on the Raptor side since we
> always alert on the top level value (and we don't want to alert on both
> geometric mean and hero elemen/subtests).
> 
> What if there are multiple hero elements in the page (like we might have in
> the future for instagram, for example)? Would it be better to make this
> configurable in the test INI? Like 'alert_on = hero1' where the alert_on
> value would need to be one value taken from the 'measure =' key also in the
> same test INI. That way we could change it from one particular hero element
> to another.

I agree that it makes sense for this to be configurable. I'm not sure 'alert_on' is ideal though, as we could later have a requirement to alert on subtests. Maybe something like 'suite_value_source'?
Summary: adjust data point we alert on to be hero element when available → add raptor test option to specify what measurement to report as the top level result/alert on
Blocks: 1503315
No longer blocks: 1473078
In bug 1504013 were are changing from the geometric mean to "loadtime" as the top level metric to report/alert on. Would we still want to substitute the hero element timing when present? I would prefer the top level metric to be consistent, and to optionally enable reporting/alerting on the hero element subtests if these need to be monitored closely.
Flags: needinfo?(acreskey)
First, I think this is very important work.

Right now when we compare top-level results to Chrome I think we have a problem:

For Firefox we use the geomean of {dcf, fnbpaint, hero, ttfi} for most sites.

For Chrome we use the geomean of {fcp, hero}.


And these are compared directly, e.g.

https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1731806,1,10&series=mozilla-central,1792346,1,10&selected=mozilla-central,1731806,413387,666816833,10


But since different marker elements are used in the two geomeans, these are effectively apples-to-oranges comparisons.

In particular, the Firefox measurement includes the very long TTFI marker which Chrome does not.

Anyway, I'm personally very much in favor of moving to "loadtime" as the top level result / alert metric.



From what I've been able to tell the Hero element is problematic because we can detect when it's in the DOM, but not necessarily when it's been rendered. So as it is, it's not an ideal metric as the end user doesn't actually care about what's in the DOM, only what he or she can see :)

See: https://bugzilla.mozilla.org/show_bug.cgi?id=1510999
Flags: needinfo?(acreskey)
(In reply to Andrew Creskey from comment #11)
> First, I think this is very important work.
> 
> Right now when we compare top-level results to Chrome I think we have a
> problem:
> 
> For Firefox we use the geomean of {dcf, fnbpaint, hero, ttfi} for most sites.
> 
> For Chrome we use the geomean of {fcp, hero}.
> 
> 
And these are compared directly, e.g.
> 
https://treeherder.mozilla.org/perf.html#/
> graphs?timerange=5184000&series=mozilla-central,1731806,1,10&series=mozilla-
> central,1792346,1,10&selected=mozilla-central,1731806,413387,666816833,10
> 
> 
But since different marker elements are used in the two geomeans, these are
> effectively apples-to-oranges comparisons.
> 
In particular, the Firefox measurement includes the very long TTFI marker
> which Chrome does not.
> 
> Anyway, I'm personally very much in favor of moving to "loadtime" as the top
> level result / alert metric.
> 
> 

From what I've been able to tell the Hero element is problematic because
> we can detect when it's in the DOM, but not necessarily when it's been
> rendered. So as it is, it's not an ideal metric as the end user doesn't
> actually care about what's in the DOM, only what he or she can see :)

> See: https://bugzilla.mozilla.org/show_bug.cgi?id=1510999

It is possible to compare the results of the subtests, although I don't know if there are suitable comparable measurements. Here is Firefox first non-blank paint vs Chrome first contentful paint: https://treeherder.mozilla.org/perf.html#/graphs?timerange=5184000&series=mozilla-central,1731768,1,10&series=mozilla-central,1792350,1,10

After speaking with Rob, we like the idea of being able to specify the subtest(s) that should alert for each test. Where loadtime is consistently available we can use this, and where it's not we can use the geometric mean. This might require some changes to Perfherder to silence alerting on the top-level suite value, which must always be available.
Okay, so based on the comments in bug 1504227 I think we should look into the alert_on suggestion from :rwood in comment 2. Instead of using this to influence the suite value, it should control the shouldAlert values of the suite/subtests in PERFHERDER_DATA. The suite value should remain the geometric mean, and we should default alerting on this value. We should also accept multiple values and ensure the values are valid choices.
Flags: needinfo?(vchin)
Summary: add raptor test option to specify what measurement to report as the top level result/alert on → Add raptor test option to specify what measurement(s) to alert on
Priority: -- → P1

Question: given the newinfo that the geometric mean really isn't useful, should we make the top-level reported value in treeherder/perfherder be FCP instead? Then we can alert on that by default since it's the top-level; and then in the subtests we will also have alert true set for 'loadtime'. What do you think? In the case where a test INI doesn't have 'fcp' listed to measure, then we could still run but wouldn't dump anything to Perfherder in production, just an idea. I don't see the point in calculating and reporting the geomean if it's not useful.

Flags: needinfo?(dave.hunt)

(In reply to Robert Wood [:rwood] from comment #15)

(in the case where the test INI has 'alert_on' = fcp, loadtime)

Or perhaps we could use the first 'alert_on' value (i.e. fcp) to be the top-level overall result reported. Or have another INI setting too 'suite_result = fcp' to determines the overall result reported; and then have 'alert_on = fcp, loadtime'.

I don't like the idea of having the top-level reported value to be variable. If fcp is guaranteed to be available then I'm okay with using that, however I think I'd prefer to either leave the top-level as geometric mean for now, or to use 0. We've discussed that fcp and loadtime are useful to engineers, especially for alerting on regresssions, however there is also the possibility that some calculated score may be useful for dashboards. The geometric mean may not be the ideal approach, but any such score would make sense as the top-level value.

:ekyle do you have some thoughts on this?

Flags: needinfo?(dave.hunt) → needinfo?(klahnakoski)

The geometric mean is a good statistic to alert on: It has less variance than the individual measures; it is less prone to false positives.

I attended Vicky's performance team meeting, and my conclusions on what to build have not changed:

  • Engineers require dashboards that are fine-grained and measure what they can control; they should be directed to Perfherder whenever possible.
  • The performance sheriffs are interested in raising issues on regressions: This is different from the engineers; sheriffs only care why there is a regression only so far that it helps them identify the engineer that can fix the problem; otherwise why is not relevant. What's important to Sheriffs is controlling the inevitable volume of alerts and the minimizing the number of false positives to within reason. The geo-mean provides this; it reduces the number of tracked statistics, it reduces variance so there is less false positives.
  • Management wants KPI tracking - These are usually large aggregates over many measures, smoothed over some time window. Management wants to know how far we are from our goal. A geo-mean of particular measures can be used here (Our particular geo-mean is unlikely to be useful to management). The geo-mean is not valued for it's reduced variance, rather it is valued for combining multiple disparate measures into a single statistic. The health dashboard is where such aggregate numbers can/should be reported.

So, keeping in mind we have three different categories of customer, we should not change TP6's suite geo-mean calculation; Our team (igolden) will use it.

We can turn on alerting for subtest measures the engineers are interested in. I am concerned that these subtests will have too much noise, produce too many false positives, and will increase our sheriffing load with little benefit. I suspect our team will alert on regressions in the geo-mean, highlight the particular measure that caused the geo-mean regression, then inform the engineer on what happened. I will leave it to the perf sheriffs to figure that out.

Flags: needinfo?(klahnakoski)

Ok, thank you for the excellent feedback guys! I'll leave it as/is regarding the geomean, and just add a new option for 'alert_on'.

Here's an example of the output for tp6-google (with just 2 page-cycles) with 'alert_on = fcp, loadtime' set in the test INI:

15:44:08 INFO - raptor-output PERFHERDER_DATA: {"framework": {"name": "raptor"}, "suites": [{"extraOptions": [], "name": "raptor-tp6-google-firefox", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 569.9, "subtests": [{"name": "raptor-tp6-google-firefox-fcp", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 303, "shouldAlert": true, "replicates": [558, 303], "unit": "ms"}, {"name": "raptor-tp6-google-firefox-fnbpaint", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 303, "replicates": [558, 303], "unit": "ms"}, {"name": "raptor-tp6-google-firefox-hero:hero1", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 790, "replicates": [1228, 790], "unit": "ms"}, {"name": "raptor-tp6-google-firefox-loadtime", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 1305, "shouldAlert": true, "replicates": [1811, 1305], "unit": "ms"}, {"name": "raptor-tp6-google-firefox-dcf", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 291, "replicates": [545, 291], "unit": "ms"}, {"name": "raptor-tp6-google-firefox-ttfi", "lowerIsBetter": true, "alertThreshold": 2.0, "value": 1241, "replicates": [1776, 1241], "unit": "ms"}], "type": "pageload", "unit": "ms"}]}

Pushed by rwood@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/9a008f412fd7
Add raptor test option to specify what measurement(s) to alert on; r=davehunt
https://hg.mozilla.org/integration/autoland/rev/07500fee706b
Add raptor test option to specify what measurement(s) to alert on; r=davehunt
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla66
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: