Talos: We need higher resolution regression alerts

RESOLVED FIXED

Status

RESOLVED FIXED
4 years ago
3 years ago

People

(Reporter: avih, Unassigned)

Tracking

({ateam-talos-task})

Trunk
ateam-talos-task
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(Not tracked)

Details

(Reporter)

Description

4 years ago
Some of our tests combine many sub tests into a single test score. E.g. TP5 with many page scores, TART with many different animation cases, possibly WebGL test (bug 1020663) which produces 67 different sub-results with non-uniform magnitudes, etc.

Currently, our regression alerts (both m.dev.tree-management email alerts and regression bugs which are filed manually) are based on the the formula by which graphserver processes all the sub-results from a full talos test run into a single number:

- A talos run of a test consists of N executions of each sub-test (mostly N==25).
- Take median for each N results per sub-test.
- Ignore the highest of those medians.
- Average the rest <-- the "final" number by which we calculate regressions and which graphserver displays at its graphs.

This formula was, AFAIK, mostly designed to handle TP5-like tests, where the page load times are similar between pages, no page has any particular importance, and therefore the page which performed the worst was dropped as an outlier, and the rest were averaged.

This formula is not good for tests where each sub-test was designed specifically to reflect a well-intended perspective, and where each sub-test is meaningful independently (e.g. TART/CART/WebGL).

We've already seen cases where the average showed a relatively small regression, which was due to a single sub-test which had a really big and meaningful regression (e.g. bug 1004429 which regressed newtab animation by ~40% was perceived as 5% regression because it only regressed 4 of 30 TART's sub-results).

This bug is about being able to detect (and display etc) meaningful regressions which manifest at one or few sub-results of a test - when the subtext is meaningful independently.

I'm not sure exactly which form should this high-res detection take or with what technical approach.

We could use the backend of either graphserver or datazilla for this. While graphserver displays only this average, if it also happen to store all the individual sub-results per run, then it might be a good starting point. Otherwise, maybe datazilla would be more suitable for this.

As far as the alerts themselves go, I think we could use something like this:
- If the average (of all sub tests) regressed more than A%, issue an alert on the average.
- Regardless, if a single sub-test regressed considerably more (TBD) than the average regressed, then issue an alert which includes the most regressed sub-test and also the 2nd most regressed sub-test, together with the average regression.

Any ideas on how to approach this?
(Reporter)

Comment 1

4 years ago
Regardless of higher resolution, if we want (and we usually do) to give each sub-test an equal weight while each of them is in a different range of values - as it sometimes happen with all the tests with independently meaningful sub-tests, we could use geometric mean instead of average (of all the medians and without dropping the highest).

This should improve our tracking practically for free. Assuming, of course, that we don't already do it.
Looking into this we send this data (using tsvgx as an example) to graph server:
START
VALUES
qm-pxp01,tsvgx,,a49657b0b23b,20140603105048,1401971200
0,209.00,gearflowers.svg
1,42.00,composite-scale.svg
2,41.00,composite-scale-opacity.svg
3,41.00,composite-scale-rotate.svg
4,41.50,composite-scale-rotate-opacity.svg
5,388.00,hixie-001.xml
6,405.00,hixie-002.xml
7,148.00,hixie-003.xml
8,320.50,hixie-004.xml
9,1322.50,hixie-005.xml
10,2199.50,hixie-006.xml
11,1823.00,hixie-007.xml
END

Here is the data we send to datazilla:
[{"test_machine": {"platform": "x86_64", "osversion": "Ubuntu 14.04", "os": "linux", "name": "qm-pxp01"}, "testrun": {"date": 1401971200, "suite": "tsvgx", "options": {"responsiveness": false, "tpmozafterpaint": false, "tpchrome": true, "tppagecycles": 25, "tpcycles": 1, "tprender": false, "shutdown": false, "extensions": [{"name": "pageloader@mozilla.org"}], "rss": false}}, "results": {"gearflowers.svg": [237.0, 211.0, 200.0, 205.0, 210.0, 212.0, 448.0, 206.0, 205.0, 196.0, 200.0, 203.0, 192.0, 196.0, 204.0, 213.0, 216.0, 198.0, 204.0, 214.0, 221.0, 216.0, 215.0, 217.0, 220.0], "hixie-007.xml": [1871.0, 1827.0, 1833.0, 1816.0, 1813.0, 1808.0, 1815.0, 1802.0, 1804.0, 1808.0, 1800.0, 1810.0, 1805.0, 1818.0, 1861.0, 1879.0, 1833.0, 1842.0, 1845.0, 1824.0, 1822.0, 1832.0, 1836.0, 1878.0, 1844.0], "hixie-003.xml": [181.0, 149.0, 150.0, 152.0, 151.0, 152.0, 146.0, 160.0, 144.0, 146.0, 157.0, 142.0, 143.0, 142.0, 153.0, 153.0, 142.0, 148.0, 147.0, 152.0, 154.0, 141.0, 164.0, 148.0, 154.0], "composite-scale-rotate.svg": [42.0, 40.0, 48.0, 39.0, 36.0, 41.0, 41.0, 40.0, 39.0, 47.0, 54.0, 40.0, 40.0, 42.0, 41.0, 38.0, 39.0, 39.0, 44.0, 43.0, 39.0, 42.0, 42.0, 43.0, 40.0], "hixie-006.xml": [2180.0, 2189.0, 2216.0, 2242.0, 2165.0, 2133.0, 2234.0, 2171.0, 2175.0, 2185.0, 2187.0, 2224.0, 2215.0, 2163.0, 2185.0, 2213.0, 2157.0, 2244.0, 2152.0, 2181.0, 2226.0, 2226.0, 2233.0, 2212.0, 2235.0], "composite-scale-rotate-opacity.svg": [45.0, 41.0, 38.0, 44.0, 47.0, 46.0, 40.0, 40.0, 45.0, 47.0, 39.0, 40.0, 43.0, 38.0, 41.0, 42.0, 43.0, 44.0, 44.0, 41.0, 41.0, 41.0, 40.0, 44.0, 42.0], "composite-scale.svg": [80.0, 39.0, 42.0, 37.0, 41.0, 41.0, 41.0, 44.0, 43.0, 40.0, 37.0, 40.0, 42.0, 42.0, 42.0, 44.0, 41.0, 42.0, 44.0, 42.0, 43.0, 38.0, 39.0, 42.0, 40.0], "hixie-005.xml": [1354.0, 1314.0, 1310.0, 1301.0, 1320.0, 1339.0, 1325.0, 1328.0, 1330.0, 1317.0, 1317.0, 1331.0, 1320.0, 1299.0, 1325.0, 1309.0, 1305.0, 1350.0, 1317.0, 1352.0, 1353.0, 1316.0, 1333.0, 1313.0, 1317.0], "hixie-001.xml": [406.0, 396.0, 388.0, 403.0, 411.0, 434.0, 398.0, 384.0, 387.0, 385.0, 392.0, 390.0, 394.0, 388.0, 409.0, 377.0, 385.0, 387.0, 388.0, 381.0, 397.0, 391.0, 385.0, 382.0, 388.0], "hixie-002.xml": [384.0, 394.0, 377.0, 388.0, 397.0, 415.0, 404.0, 411.0, 402.0, 411.0, 406.0, 413.0, 409.0, 407.0, 417.0, 389.0, 388.0, 398.0, 393.0, 389.0, 384.0, 400.0, 399.0, 408.0, 409.0], "hixie-004.xml": [300.0, 296.0, 332.0, 290.0, 346.0, 355.0, 344.0, 351.0, 341.0, 323.0, 320.0, 298.0, 296.0, 333.0, 288.0, 321.0, 341.0, 296.0, 298.0, 330.0, 321.0, 314.0, 290.0, 309.0, 317.0], "composite-scale-opacity.svg": [46.0, 42.0, 42.0, 40.0, 42.0, 42.0, 42.0, 39.0, 42.0, 40.0, 39.0, 41.0, 40.0, 40.0, 43.0, 43.0, 43.0, 40.0, 41.0, 40.0, 38.0, 38.0, 44.0, 46.0, 44.0]}, "test_build": {"version": "32.0a1", "revision": "a49657b0b23b", "id": "20140603105048", "branch": "", "name": "Firefox"}}]

You can see that for each page, we ignore the first 5, then take the median of the remaining 20, that is the single number we report to graph server.

Here is the code that takes the post data and does something with it, you can see we key off of 'VALUES':
http://hg.mozilla.org/graphs/file/217745463427/server/pyfomatic/collect.py#l272

That moves us to:
http://hg.mozilla.org/graphs/file/217745463427/server/pyfomatic/collect.py#l175

and in line 204, we actually insert each single value. 


In summary we have access to a data point for each page, we can consider alerting if a single page has regressed greatly.

Some thoughts:
* should we stop ignoring the max value and just average all pages together?
** if we don't do that should we consider posting a dummy page that has a value of 1000000 or something?
* this would greatly increase the time to analyze data, could we be smarter about this and not do it for all pages, etc.?
(Reporter)

Comment 3

4 years ago
(In reply to Joel Maher (:jmaher) from comment #2)
> You can see that for each page, we ignore the first 5, then take the median
> of the remaining 20, that is the single number we report to graph server.

I can't see from these outputs, it but it's aligned with what I know.
 
> and in line 204, we actually insert each single value. 

I'll take your word. I don't trust my own parsing of that code.

> In summary we have access to a data point for each page, we can consider
> alerting if a single page has regressed greatly.

This is great. In my mind, this (having good enough data for this on graphserver) was a big question mark.

> Some thoughts:
> * should we stop ignoring the max value and just average all pages together?

I've considered it in the past, and I think we should, however, this could bring a new issue where the highest value is of a different magnitude than the others, and therefore gets a much higher weight at the average.

Which is where comment 1 offers a solution (geometric mean).

Regardless of which change we decide on, however, it's important to remember that once this change has taken place, all the new results will _not_ be comparable to older results due to the different formula which produces results.

However #2, since we have all the historic data anyway, the new formula could also be applied to historic data (e.g. at graphserver), which will make newer results comparable to older results and would just have a different view for the old/new results.

> ** if we don't do that should we consider posting a dummy page that has a
> value of 1000000 or something?

Yes. This is also something which came up in the past. It doesn't _need_ to be a fixed value, and could be 10% higher (or lower, depending if higher is worse for this test) than the highest value.

> * this would greatly increase the time to analyze data, could we be smarter
> about this and not do it for all pages, etc.?

How much of an issue/bottleneck is this processing? if it turns out to be a real issue, then sure, we could do this better processing less frequent, but personally I think we should start without such optimizations and see how bad it gets, if at all.
right now the process hangs often (maybe once/month).  I doubt it uses up 100% cpu load, but alerting per page would probably add 10 times the processing to the system (think 49 pages for tp5o, we would have to compare all of those).

Would we want to compare the pages for a sustained regression as we do the *average*?

Assuming all is well, this would greatly increase the volume of alerts.  Right now we have needed to build a webapp to manage the alerts- we could adapt this to work with the larger volume.  My gut tells me (from data I get on datazilla alerts) that we will see 4 times the alerts.

Any thoughts on this?
* limit the higher resolution to specific tests?
* ensure we have a solid webapp (treeherder) to handle this?
* only alert if a single page is >10% (or xx%) (probably <twice the volume)
* would we need to map the alert to the test name (i.e. alipay.com regresses, this means tp5 needs to run)
* how do we handle a bimodal test page?
(Reporter)

Comment 5

4 years ago
(In reply to Joel Maher (:jmaher) from comment #4)
> ... but alerting per page would probably add 10 times the
> processing to the system (think 49 pages for tp5o, we would have to compare
> all of those).

No, TP5 doesn't need this kind of processing, and therefore could be left using the old processing method.

> Would we want to compare the pages for a sustained regression as we do the
> *average*?

Not sure I understand the question. Rephrase please.

> Assuming all is well, this would greatly increase the volume of alerts. 

I don't think so. Or at least not necessarily, and regardless, we could adjust the thresholds to keep the same number of alerts, if this issue matters.

What we would hopefully get, is higher quality of the alerts (both in what they catch and in how they report).

> Right now we have needed to build a webapp to manage the alerts- we could
> adapt this to work with the larger volume.

Again, not sure I'm following. Are you saying we've built (and now using) some web app which was built specifically to analyze results? and that this would need some modification? - to handle a different volume of alerts (which doesn't _have_ to be bigger)? or for some other reasons?


> Any thoughts on this?
> * limit the higher resolution to specific tests?

Absolutely.

But first let's list the potential changes which some tests could benefit from:
1. Don't drop the highest median.
2. Use geometric mean instead of average (assuming currently we do average).

These two alone will improve the quality of the graphs and alerts without changing anything else.

3. When reporting and checking for regressions, also check the values of sub tests.

#3 is the only thing which requires more processing etc.


- TP5 definitely doesn't need such high res.

- tscrollx and tsvgx .. are not terrible as is, but we've already seen cases of missed tsvgx regressions because only one page regressed (either because it was dropped as an outlier or because it didn't affect the average enough). So it would definitely benefit from 1+2, but maybe we could skip 3.

- TART and CART (and possibly WebGL) - would benefit from 1+2 and also from 3.

- tresize - I _think_ it doesn't need highres, but I'll need to examine it.

- Other tests: not sure, I'll need to examine.

Regardless, we can decide which tests will get this extra processing, and right now I think they're most important in few tests only.

> * ensure we have a solid webapp (treeherder) to handle this?

I don't know what this is...

> * only alert if a single page is >10% (or xx%) (probably <twice the volume)

I'd say the exact same process which we currently apply for the average (with noise level/Z etc), just also for each sub test, and trigger an alert if it's above some threshold, possibly relative. E.g if a specific test regressed more than twice that what the average regressed.

> * would we need to map the alert to the test name (i.e. alipay.com
> regresses, this means tp5 needs to run)

The report will need to display the sub test name, yes. What does "this means tp5 needs to run" mean?

> * how do we handle a bimodal test page?

The same we handle them now: 1. try to not have bimodal results. 2. ignore them if they're not reliable. 3. try to still use them somehow even while being bimodal.
>> Would we want to compare the pages for a sustained regression as we do the
>> *average*?
>
>Not sure I understand the question. Rephrase please.

currently the single point average of all the pages is compared with pre/post data points to verify a sustained regression, would we do the exact same process for each individual page?

----

The alert UI (http://54.215.155.53:8080/alerts.html#) is very primative and works for the volume and style of alerts we have now.  Assuming per page alerts, we would have to adjust this, also with a larger volume of alerts we might need to work the user interface differently.  I am working on that for the datazilla alerts, this might not be a big issue.

----

getting graph server to alert on specific pages, tests with different thresholds is doable, but it will be some work.  I don't imagine it to be too easy and there will be some lag between changes and live deployment.

----

>> * would we need to map the alert to the test name (i.e. alipay.com
>> regresses, this means tp5 needs to run)
>
>The report will need to display the sub test name, yes. What does "this means tp5 needs to run" mean?

if iframe.svg fails, we need to let people know that is part of the test tsvgx so they can run it locally- more code will be needed in the alerting or in the talos harness to support this.

----

If we were to alert on pages from graph server we would need a UI for it. Right now datazilla provides that iff you can get it to load data. Without a history of a specific page, an alert loses important. A developer wants to see that this is real and how relative it is. Also is the change at their changeset vs another.

Similarly if we retrigger on tbpl, we would need a quick way to view the subtest results.  Right now you can retrigger on tbpl and look at the summary pane to see the reported number which clicking on each completed instance.


----

Still the suggestion of the geometric mean for all the pages (instead of drop one and average the rest) seems to be by far the most effective. While I want per page alerts, I think it is too early to implement that.
(Reporter)

Comment 7

4 years ago
(In reply to Joel Maher (:jmaher) from comment #6)
> currently the single point average of all the pages is compared with
> pre/post data points to verify a sustained regression, would we do the exact
> same process for each individual page?

I would think so, yes. The regression detection system (i.e. function which looks at previous and possibly future data points and decides if the current point is a meaningful deviation) should not be modified IMO. Only applied, on top of the regular average tracking, also to individual sub-results.

> The alert UI (http://54.215.155.53:8080/alerts.html#) is very primative and ..

> getting graph server to alert on specific pages, tests with different
> thresholds is doable, but it will be some work.  I don't imagine it to be
> too easy and there will be some lag between changes and live deployment.

> if iframe.svg fails, we need to let people know that is part of the test
> tsvgx so they can run it locally- more code will be needed in the alerting
> or in the talos harness to support this.

> If we were to alert on pages from graph server we would need a UI for it.

Right, so I'm guessing that all those are systems which use the alerts. Depending on how these systems are implemented, some or all of them would need to be modified for this.

What I had in mind was mostly the dev.tree-management mailing list and the regression bugs which you file. Would we be able to limit the scope of this high-res detection to only those two? would this reduce the amount of work for this?


> Similarly if we retrigger on tbpl, we would need a quick way to view the
> subtest results.  Right now you can retrigger on tbpl and look at the
> summary pane to see the reported number which clicking on each completed
> instance.

It would indeed, and the datazilla link would provide it because it displays individual sub-results. tbpl currently doesn't show regressions anyway - just numbers. We could modify tbpl to show sub-results as well, but TBH, I don't think it's required at this stage (and possibly never).


> Still the suggestion of the geometric mean for all the pages (instead of
> drop one and average the rest) seems to be by far the most effective.

I don't know if most, but quite important, yes.

> While I want per page alerts, I think it is too early to implement that.

Too early why? We already have cases where it would help, the tests which benefit from this are already running for many months... and we could have been benefiting from such higher resolution right now if it was working. How is it early?
it is too early as we have no way to view the history of the individual test page.  Datazilla is not a solution and won't be for the short term.  The webpages are not loading consistently (if at all).  Also the rest of the tooling needs to be in place-

I see these problems:
* I am the only one who understands these and files bugs
* Without a proper UI and method for testing on try to understand the results, developers will just be confused and distrust all of talos or absolve themselves of responsibility for reported regressions.
(Reporter)

Comment 9

4 years ago
We already use datazilla when only some of the sub-results have changed, which happens occasionally. It's not perfect yet, but it's the best we have right now. Compare-talos is also available as an online tool to compare sub-results effectively between two builds.

I'm guessing it should be possible to modify it (if at all required) to show sub-results from a single cset (i.e. without comparing to another cset). Matt?

True, graphserver doesn't currently display sub-results, but changing graphserver's display is not a requirement IMO for us to be able to detect regressions in high-res.

And if you think it's absolutely a requirement, then let's fix that too, and if you can't fix it, then we'll either find someone who can, or I'll give it a go myself.

It's certainly not too early as far as our need for it goes.
Flags: needinfo?(MattN+bmo)
(Reporter)

Comment 10

4 years ago
Regardless, the geometric mean and not disregarding the highest sub-result indeed probably requires much less effort, and only involves changing one formula (probably at graphserver, but possibly other places which duplicate it) - for tests we decide should be calculated this way (e.g. not for TP5), and the rest will happen automatically.

Can we declare this as stage one, and in parallel unlift the blocks for stage 2 which is the high-res regressions detection?
(Reporter)

Comment 11

4 years ago
errr.. s/unlift/lift/
stage 1:
* will we rename the tests as the numbers could change significantly?
* this should apply to all tests, not just webgl or tsvgx, any objections?

Here is where we ignore the max and average the results;
http://hg.mozilla.org/graphs/file/217745463427/server/pyfomatic/collect.py#l212

basically we do this:
select avg(value) from test_run_values where test_run_id = %s and value not in (select max(value) from test_run_values where test_run_id = %s)

instead we could do this:
select exp(sum(log(value))/count(value)) from test_run_values where test_run_id = %s

actually doing this on new test names sounds like the right thing to do.  This way we can have a solid history (not a prepopulated history of averages).

If we do that, maybe we could create a new function in graph server to process the data which does the geomean and we would call that if we send data of type GEOVALUES (instead of type VALUES):
http://hg.mozilla.org/graphs/file/217745463427/server/pyfomatic/collect.py#l266

Then the tests we want to report this can be modified in talos to be 'tsvg_geo' or something like that.

As for round 2, even if we disagree about the requirements of tools we can at least do some of the pre-requisites.
(Reporter)

Updated

4 years ago
Depends on: 1021842
(Reporter)

Comment 13

4 years ago
(In reply to Joel Maher (:jmaher) from comment #12)
> stage 1:
> * will we rename the tests as the numbers could change significantly?
> * this should apply to all tests, not just webgl or tsvgx, any objections?

I've filed bug 1021842 for stage 1. Let's continue this discussion there and keep this bug for the "proper" high-resolution detection.


> As for round 2, even if we disagree about the requirements of tools we can
> at least do some of the pre-requisites.

OK, what are the pre-requisit items as far as you're aware? Please mention which you think are really important (and why) and which you think could maybe wait for a later stage.

Do keep in mind that for people who bump into regressions only at individual sub-tests (which is not rare at all), using either datazilla or compare-talos is already a requirement. We file regression bugs, and they can't use graphserver or tbpl to track their try-pushes (because they don't show the stuff their pushes affect), so they _have_ to use datazilla or compare-talos already today.
(Reporter)

Comment 14

4 years ago
Also, we could start with applying the high-res detection and output it (wherever we do regressions detection) into some side-log without actually changing the current notifications to the mailing list or the regression bugs which we file.

This will allow us to understand what changes are required and where - to apply this high-res detection, and also evaluate its outputs without actually affecting anything else.

Joel, what say you? sounds good as a first step for this?
Flags: needinfo?(jmaher)
I think a win/win here would be to create a new table in the database that stores all the alerts with a simple json interface to query them.  No other logic is needed, just the raw data we send to dev.tree-management.  This would benefit my UI greatly (no need to sync up the newsgroup and I could get more accurate alerts).

In addition this would store per page regressions.  The only consumer for the time being will be the alerts UI, so there would be no confusion in emails.  We could let this sit for a few weeks and quantify the volume of alerts.

My only concern is the load this puts on the server.  We should be careful about that and design this with a way to throttle (i.e. ignore pages in list or only do pages in list)
Flags: needinfo?(jmaher)
(Reporter)

Comment 16

4 years ago
I'm not following.

As far as I understand, the graph server already stores all the data we need in order to detect regressions in higher resolution - these are the sub-tests results (or medians thereof).

All we need to do, on top of the "normal" detection, is apply exactly the same procedure also to sub-results (which are already stored at the db), possibly with slightly higher thresholds. That's it.

Why new table? what does it mean to "store an alert"? to what info which was expressed so far does "win/win" refer?
afaik the graph server does not store any data about alerts, it just does a calculation on the data and sends email if it detects a regression.  So if we could put these alerts into a database that we can query in a json format, then we could do this easier without affecting developers.
(Reporter)

Comment 18

4 years ago
Joel and me just had a chat on IRC. We both agree that the process should be as follows:

1. Implement the new detection code.
2. Put its output someplace such that we can evaluate and tune it.
3. Replace the old alerts once we're happy with it.

Since 1 will be no-op without 2, comment 15 tries to first solve 2.

FWIW, I don't care much to which system it reports as long as we can examine its output and tune the system. Though it should be a temporary thing because step 3 will use the new alerts instead of the old ones.

Joel currently uses this "Talos Alerts UI" page: http://54.215.155.53:8080/alerts.html - It parses newgroup posts and generates the UI to manage the alerts (tracking them, filing regression bugs etc).

We both agreed that the high-res detection code could report to exactly the same newgroups, only with a title tag, like "[ignore - evaluating] xx% regression.. blah blah", then the current UI could ignore it on one hand, and OTOH it could be extended to allow showing either old or new alerts with some UI switch, and eventually once we're happy, will only be used with new high-res alerts.

Emails to patch authors could either be suppressed altogether or sent with this title tag so hopefully they'll get ignored.

Anyway, the title tag hack gives us the storage, or "temp side log" etc without adding a new table to the DB, and the UI could be adapted to process it lazily.

Back to square one then? actually calculate the new regressions? :)
(Reporter)

Comment 19

4 years ago
Apparently there's been very similar work making progress on back channels (github, irc, emails, https://etherpad.mozilla.org/SignalFromNoise , etc), which is behind bug 962378.

The goal seems identical to this bug - be able to detect regressions using more data than just "the average of sub-results".

That other bug aims at using datazilla though rather than graphserver (even if this bug is not about graphserver or datazilla but generally about detecting regressions better).

Still not sure if they should be combined or not, but I just became aware of it so decided to post.

Kyle, Joel, do you think we should merge these bugs into one and focus our efforts productively only on one of the approaches?
Flags: needinfo?(klahnakoski)
Flags: needinfo?(jmaher)
See Also: → bug 962378
I don't think we can merge these bugs quite yet for a couple of reasons:
1) Datazilla is not reliable yet- depending on alerts that reference and point to a system that doesn't provide a stable UI is not a step forward
2) Mozilla is familiar with graph server, we can use this per page alerts as a bridge to migrate to datazilla.  

I see it working this way:
* continue to refine our alerts on datazilla
* tweak our UI to work with per page alerts (applicable to datazilla + graph server)
* when we file bugs that are per page specific, it will get people using datazilla, compare-talos, etc. (i.e. not graph server)
* once we have validated that datazilla alerts are >= graph server alerts, we should have a full performance data interface in treeherder
* Q3 we can lay the foundation for transitioning fully to Datazilla
* Q4 we can do the transition
* Q1 we can turn off graph server for good 

This means developing alerts in parallel for a while.  A lot of the lessons learned to date on datazilla will reduce the amount of work on graph server.  Likewise the more eyes on alerts from graph server will find the remaining corner cases in the datazilla alerts.
Flags: needinfo?(jmaher)
(Reporter)

Comment 21

4 years ago
The reasons I think they should be combined are as follows:

1. They both aim to solve _exactly_ the same problem.
2. The considerations on how to evaluate discrete per-sub-result regressions and how to send the alerts are identical.

The fact that one (this) was focused on graphserver as backend and the other on datazilla is purely circumstantial (because I wasn't aware of the other effort and graphserver seems to have enough data for this).

It shouldn't really matter which backend we use for this system, because the result would/should be the same. Any difference due to still more data at datazilla (like the individual 25 runs of each page) is _probably_ not meaningful enough to make these 2 different projects.

Also, the fact that this bug discusses the problem in details, suggests and discusses approaches for solution, publicly, the other project doesn't seem to have (as far as I'm aware of at this stage) and kind of design, goals, evaluations steps etc which were discussed publicly in a "normal" bug.

And since this will affect many, it both deserves the attention it could get, and keep working on this independently would be plain waste of time.

I think we should:

1. Decide and agree on a method to detect "high-res" alerts for both this and the other bug.

2. Evaluate the product of this system (as apparently happened to some degree) and refine it to a level we all agree on.

3. maybe implement the above with both graphserver and datazilla backends independently. _Maybe_. The big effort should be at the system level and methods. Not the backend used.
so high-res alerts on graph server gives us data, but not way to view it.  We would have to view the raw data on Datazilla.  We could code up some stuff to allow graph server to display per page information, or push to get datazilla to return results a bit faster in the UI and continue to use that.

either way need a wait to view the alerts, filter them and take action on them.  I am starting to agree that one way is the right way, maybe since we are halfway there with datazilla alerts go for it and live with it.

Kyle what do you think?
(Reporter)

Comment 23

4 years ago
You're talking details, I'm talking essence.

The fact that graphserver and datazilla are different shouldn't matter when we design the system. High res alerts are useful regardless of the back end used to generate them, though obviously they'll be more useful if they can be examined after they're generated.

These high res regressions could be generated either using graphserver backend or datazilla backend, and viewing them right now is only possible with datazilla or compare-talos.

But again, these are implementation details.

We shouldn't have two projects solving exactly the same problem with different backends.
(Reporter)

Comment 24

4 years ago
(In reply to Avi Halachmi (:avih) from comment #23)
> We shouldn't have two projects solving exactly the same problem with
> different backends.

Or, at the very least, they should share everything except how they interact with the backend. "everything" hear means how do we define per-sub-result alert, how many alerts are generated, what kinds of thresholds are used, etc.
very true.  I have been looking into datazilla and the UI will be deprecated soon.  It is realistic a beta version which will be integrated into tbpl replacement treeherder will be online in July.  By end of Q2 it is realistic that we have per page graphing tools built into tree herder which will make performance debugging much easier and solve the majority of our UI problems.  This is a priority project and actively being worked on right now.

for the short term, we would be duplicating work writing a new UI on graph server and nobody is going to work on datazilla to make it snappier.
(Reporter)

Comment 26

4 years ago
(In reply to Joel Maher (:jmaher) from comment #25)
> for the short term, we would be duplicating work writing a new UI on graph
> server and nobody is going to work on datazilla to make it snappier.

If you're convinced that this detection would be meaningless as long as we don't have proper UI for it, and also convinced that we won't have proper UI for it until either datazilla stabilizes (which you can't see happening anytime soon) or treeherder could display it (by which time the other project would already be deployed), then I don't see any reason to not wontfix this bug.

Otherwise, I don't understand what kind of work you think should take place at this bug, and why.

I only wish I knew all of this right after the bug was filed, or preferably even earlier. It would have saved us lots of wasted time to discuss a solution for a problem which is already at the final stages of solving elsewhere and where all or most of these decisions have already been made.
comment #4 and comment #6 I mention datazilla alerts.  This is the right time to start making decisions for these alerts now that we have a way to reliably get the data and perform operations on the data.  What we need are rules:
* what branches
* what tests
* what pages
* improvments or just regressions
* what algorithm
* what window size
* per test only, or per suite as well
* thresholds for each alert
* how to handle bimodal data (i.e. v8, dromaeo, etc.)
* noisy pages such as alipay.com in tp5
* content of the alerts

I am not saying we require a UI to do these alerts, but this is not scalable to a sheriff role or self managed developer role without a UI in place.  I want a plan for a UI in place while we investigate and tweak the alerts.

If we are looking to roll these out in 6 weeks, then lets put answers down for these questions above and start watching the alerts.  Within a week we will have enough data to know if we are on the right path or not.

The need for these alerts happen to be at the right time.  4 weeks ago, it might not have been possible to drive on datazilla alerts, but now we are in the right position to make these adjustments.
(Reporter)

Comment 28

4 years ago
(In reply to Joel Maher (:jmaher) from comment #27)
> If we are looking to roll these out in 6 weeks, then lets put answers down
> for these questions above and start watching the alerts.  Within a week we
> will have enough data to know if we are on the right path or not.
> 
> The need for these alerts happen to be at the right time.  4 weeks ago, it
> might not have been possible to drive on datazilla alerts, but now we are in
> the right position to make these adjustments.

So it sounds as if most of these questions were given at lease some answers (maybe even all of them if deployment is only few weeks away).

Can we start by enumerating the answers to these questions as they manifest on the other project?

And again, any reason to not wontfix this and put all the effort on the other project?
we can wontfix this, but we will open a new bug with the details from this, so let me respond with what I believe we have in place for some of the above questions:

* what branches
** currently: all branches but try, cedar
** ideal: All sheriff monitored branches (this would ignore ash, jamon, and other experimental branches)

* what tests
** currently: all tests
** ideally: a simple configured list of tests we care about (tsvgx, tscrollx, cart, tart, tp5, a11y) ?
*** do we care about dromaeo, kraken, v8?

* what pages
** currently: all
** ideally: a configurable list to include/exclude specific pages- maybe this isn't needed

* improvments or just regressions
** currently: regressions only are emailed out
** ideally: a UI to filter on unique regressions, but the ability to see improvements as well

* what algorithm
** currently: median test over the window size (kyle needs to confirm this)
** ideally: TBD

* what window size
** currently: 20 data point before/after
** ideally: a lower number like 10, graph server uses 12

* per test only, or per suite as well
** currently: only per page is reported >5% regression
** ideally: per page >10%, per suite >5%, TBD

* thresholds for each alert
** currently: 5%
** ideally: TBD

* how to handle bimodal data (i.e. v8, dromaeo, etc.)
** currently: no method (we have 2 problems alipay.com in tp5 and iframe.svg in tsvgx)
** ideally: fix tests, firefox, harness, or just disable the specific page.

* noisy pages such as alipay.com in tp5
** see above about bimodal data.

* content of the alerts
** currently: link to datazilla, current changeset, previous changeset, information about values
** ideally: TBD
Just refinements to Joel's answers:

* what tests
** currently: all tests (except dromaeo, which has multiple sub-test results jammed into one array)

* what pages
** currently: all (and we can turn them on/off)

* improvments or just regressions
** currently: regressions only are emailed out (we can alert regressions, improvements, or both with a switch)

* what algorithm
** currently: median test over the window size (CONFIRMED)
** ideally: piece-wise linear regression analysis (http://en.wikipedia.org/wiki/Segmented_regression)

* what window size
** currently: 20 data point before/after (globally adjustable, working on making it a per test parameter)

* how to handle bimodal data (i.e. v8, dromaeo, etc.)
** currently: using median as our metric mitigates some of the bimodal problem by naturally selecting the most common mode. 
** ideally: using a mixture model (http://en.wikipedia.org/wiki/Mixture_model) will help characterize the two modes and help us construct a useful aggregate

* noisy pages such as alipay.com in tp5
** short term: with larger windows, noisy data should simply result in less sensitive regression detection.
Flags: needinfo?(klahnakoski)
(Reporter)

Comment 31

4 years ago
I suggest the following terminology:

- Suite: A group of tests which can be chosen to be executed together by tbpl. E.g. suite 's' is tsvgr_opacity, tsvgx, tscrollx, tart, cart.

- Test: like tscrollx or tart, etc.

- sub-result: (old term: "page"). "page" is a pageloader/tp5 terminology where indeed each of those was a real single page which the pageloader addon loaded and measured. Many new tests output several such "pages" which are actually sub-results of the test where each one is independently meaningful. On those case we shouldn't throw away outlier "pages" because they're not actually pages but a meaningful sub result.

Though since there's no contradiction with "page", it's OK to keep using this name as long as we remember that its result is independently meaningful and many times have nothing to do with any actual pages.


(In reply to Kyle Lahnakoski [:ekyle] from comment #30)
> * what tests
> ** currently: all tests (except dromaeo, which has multiple sub-test results
> jammed into one array)

Other tests have sub results as well, like tscrollx, tart, etc. Why can't we just "split" the dormaeo array into "normal" sub-results like we do with other tests? So datazilla can track them independently etc. I bet it's half an hour work at most only on the test side itself.


> * what pages
> ** currently: all (and we can turn them on/off)

We should turn off tests which are meaningless or completely not useful etc. From the tests I'm familiar with (mostly the 's' suite) I don't think we should throw away any sub-results.


> * improvments or just regressions
> ** currently: regressions only are emailed out (we can alert regressions,
> improvements, or both with a switch)

> * what algorithm
> ** currently: median test over the window size (CONFIRMED)
> ** ideally: piece-wise linear regression analysis
> (http://en.wikipedia.org/wiki/Segmented_regression)

> * what window size
> ** currently: 20 data point before/after (globally adjustable, working on
> making it a per test parameter)


Any interesting reasons to deviate from the graphserver approach/formulas/parameters? It's not that I have any reason to want the old behavior, but a change usually needs a good reason. E.g.:

- Why drop improvement alerts? IMO they're very nice to get as a dev.
- (if the algorithm differs) why different algorithm?
- Why 20 and not 12 datapoints before/after?

- Why the 5%/10% thresholds rather than what graphserver uses (which I believe is lower than this).

> * how to handle bimodal data (i.e. v8, dromaeo, etc.)
> ** currently: using median as our metric mitigates some of the bimodal
> problem by naturally selecting the most common mode. 
> ** ideally: using a mixture model
> (http://en.wikipedia.org/wiki/Mixture_model) will help characterize the two
> modes and help us construct a useful aggregate

Same as before: Is it different than graphserver? if yes, for what reason?

> * noisy pages such as alipay.com in tp5
> ** short term: with larger windows, noisy data should simply result in less
> sensitive regression detection.

Without looking at alipay.com graphs, if a page is particularly noisy and we don't have an interesting enough reason to keep it (and on tp5 we probably don't have interesting reasons to want to stick to any specific page IMO), then why keep it? Completely without checking any numbers, I'd say that a "random" page with a noise level of above ~50 is probably not worth keeping, possibly even much less.

For tests with actual meaningful sub-results (rather than mostly random pages from alexa), a noisy sub-result needs to be examined and probably fixed.

Not sure I've seen the answer to, or maybe it was included in "what alerts: per page", I think that this is too high a resolution by default. The sub-results are grouped into tests for a reason, and this reason is that the sub-results measure different aspects of a specific test, or subject.

Therefore, when we send alerts on improvements/regressions, they should alert per test rather than alert-per-page IMO.

HOWEVER, in order to understand if we need to send an alert for a specific test, we should also check if any of its sub-results independently goes over our threshold (I suggested that if a sub-result regresses in more than twice the average regression of all sub results, then this sub-result should be displayed/included together with the test regression email).

This meand a regression alert could look as follows:

tscrollx regressed 2%, : sub-result1 regressed 10%, sub-result 4 regressed 13%. etc - group all the regressions on sub-results of a single test into a single regression email for this test.
(In reply to Avi Halachmi (:avih) from comment #31)
> I suggest the following terminology:
My terminology was referring one level deeper (scrollx is a "suite", and alipay.com is a "test" in the tp5o "suite").  Your terminology is fine for this bug, but I then my "sub-results" will be referred to a "sub-sub-results" using your terminology.

> Other tests have sub results as well, like tscrollx, tart, etc. Why can't we
> just "split" the dormaeo array into "normal" sub-results like we do with
> other tests? So datazilla can track them independently etc. I bet it's half
> an hour work at most only on the test side itself.
Yes, breaking out the sub-sub-results in the sub-results to something that can be easily analyzed is planned. 

> We should turn off tests which are meaningless or completely not useful etc.
> From the tests I'm familiar with (mostly the 's' suite) I don't think we
> should throw away any sub-results.
I am agnostic about which tests to report and which tests to turn on/off.  I really need a domain expert, such as yourself, to make those types of decisions.
 
> Any interesting reasons to deviate from the graphserver
> approach/formulas/parameters? It's not that I have any reason to want the
> old behavior, but a change usually needs a good reason. E.g.:
>
> - Why drop improvement alerts? IMO they're very nice to get as a dev.
Alerting on improvements are not actionable.  The reason for dzAlerts is to reduce the human resources to isolate performance regressions that are severe enough that action should be taken.  We want to streamline the process of identifying, confirming and opening bugs on performance regressions.  I think this is where dzAlerts and graph server deviate in purpose.

> - (if the algorithm differs) why different algorithm?
> - Why 20 and not 12 datapoints before/after?
dzAlerts uses the median test, and more points allow it to produce less false positives on (non-guassian) noisy data.  The t-test is too sensitive to the non-gaussian noise we are witnessing and generating too many false alerts.

> - Why the 5%/10% thresholds rather than what graphserver uses (which I
> believe is lower than this).
Same reason as above: A higher threshold produces alerts that are actionable.  Furthermore, the sub-test analysis gives a much stronger signal than can be seen when analyzing a suite aggregate, so we need higher thresholds to compensate.  We can lower the threshold, but the amount of manual work required to confirm the additional regressions may not worth the gain.  We are still working on the process that consumes these alerts, and as it gets better I am sure this threshold will be lowered.

> > * how to handle bimodal data (i.e. v8, dromaeo, etc.)
> > ** currently: using median as our metric mitigates some of the bimodal
> > problem by naturally selecting the most common mode. 
> > ** ideally: using a mixture model
> > (http://en.wikipedia.org/wiki/Mixture_model) will help characterize the two
> > modes and help us construct a useful aggregate
> 
> Same as before: Is it different than graphserver? if yes, for what reason?
The test results being analyzed by graph server benefit from dampening effect of aggregation.  The noise seen in the sub-results is more pronounced, but so is the signal.  The t-test strategy used by graph server performs less well in this noisier environment: so the median test is used instead.

> > * noisy pages such as alipay.com in tp5
> > ** short term: with larger windows, noisy data should simply result in less
> > sensitive regression detection.
> 
> Without looking at alipay.com graphs, if a page is particularly noisy and we
> don't have an interesting enough reason to keep it (and on tp5 we probably
> don't have interesting reasons to want to stick to any specific page IMO),
> then why keep it? Completely without checking any numbers, I'd say that a
> "random" page with a noise level of above ~50 is probably not worth keeping,
> possibly even much less.
Despite the noise, we can clearly see a signal in alipay.com.  This requires larger windows to detect with existing algorithms.  I completely agree that fixing or removing noisy tests is a good strategy to improve signal, but sometimes that strategy can take a long time to implement.  For example, the garbage collection pauses are causing multi-modal sub-results in V8; which is taking a while to fix.  In the mean time, a noisy test can still be useful.
 
> Not sure I've seen the answer to, or maybe it was included in "what alerts:
> per page", I think that this is too high a resolution by default. The
> sub-results are grouped into tests for a reason, and this reason is that the
> sub-results measure different aspects of a specific test, or subject.
> 
> Therefore, when we send alerts on improvements/regressions, they should
> alert per test rather than alert-per-page IMO.
The page-level analysis contributes to the alert being actionable by providing information on which pages regressed.  It is rare that a performance regression in a test result is a caused by multiple sub-test regressions; usually only one has has regressed.

> HOWEVER, in order to understand if we need to send an alert for a specific
> test, we should also check if any of its sub-results independently goes over
> our threshold (I suggested that if a sub-result regresses in more than twice
> the average regression of all sub results, then this sub-result should be
> displayed/included together with the test regression email).
Yes.  It is often the case that one sub-result regresses while the others are unchanged.
(Reporter)

Comment 33

4 years ago
(In reply to Kyle Lahnakoski [:ekyle] from comment #32)
> (In reply to Avi Halachmi (:avih) from comment #31)
> > I suggest the following terminology:
> My terminology was referring one level deeper (scrollx is a "suite", and
> alipay.com is a "test" in the tp5o "suite").  Your terminology is fine for
> this bug, but I then my "sub-results" will be referred to a
> "sub-sub-results" using your terminology.

I've lost you here and I don't understand what's this deeper level you say you referred to. Even more so, I don't understand if you chose to keep "your" terminology or "mine" throughout the rest of your post. How does this relate to:

(In reply to Kyle Lahnakoski [:ekyle] from comment #30)
> Just refinements to Joel's answers:
> * what tests
> ** currently: all tests (except dromaeo, which has multiple sub-test results
> jammed into one array)
> 
> * what pages
> ...

This clearly suggests that when you said "test" you refer to "dormaeo" (even if we actually have 2 dormaeo tests: dormaeo_css and dormaeo_dom) or "tscrollx" or "tart" etc.

And that when you say "page" you mean one of the sub-results of those tests, such as dojo.html as a "page" of dormaeo_css, tiled-downscale.html as a page of tscrollx, and icon-close_DPI1.all.TART as a "page" of TART.

The datazilla UI supports this terminology, where under "Browse by Tests" it puts test names, such as tscrollx or TART.

I wasn't changing the depth level of this terminology, but rather only suggested to change terminology from "page" to "sub-result" (or more accurately to "sub-test" which produces a single sub-result) because on many cases they're not really pages (like tart or cart) while on others they're really a "page" such as alipay.com "page" at the tp5o "test".


> Yes, breaking out the sub-sub-results in the sub-results to something that
> can be easily analyzed is planned.

I'm losing you again. I don't understand what a sub-sub-result is, neither in "my" terminology nor in yours. Please keep in mind that I'm completely unfamiliar with dormaeo, so if it has more "levels" than what I suggested then I wasn't aware of it. On dz UI it seems to have the same levels as other tests such as tscrollx or tp5o.

We should really make as much effort as required to align our (and others) terminologies, such that when we discuss stuff we're on the same page, pardon the pun ;)

> > - Why drop improvement alerts? IMO they're very nice to get as a dev.
> Alerting on improvements are not actionable.  The reason for dzAlerts is to
> reduce the human resources to isolate performance regressions that are
> severe enough that action should be taken.  We want to streamline the
> process of identifying, confirming and opening bugs on performance
> regressions.  I think this is where dzAlerts and graph server deviate in
> purpose.

Improvements and regressions are two sides of exactly the same coin, and this coin is called "performance changes as a result of a patch".

By notifying devs only of regressions you're showing them only half of the picture as far as performance changes from their patch go.

While it's true that for the most part regressions are actionable and improvements are not, improvements are usually the results of those actions taken on regressions (or other actions), and therefore as important for devs as regressions are, and arguably even more important when they land a change which should make some big improvement and they need feedback that results have indeed improved. Improvements are the goals, regressions are undesirable side effects.

Splitting on what's actionable and what's not is not as generic or as good as a split on "what's useful info and what isn't". "Useful" contains info which is not always actionable, and it's still useful. We should pick the latter because it provides more useful info (even if not much more actionable info), and it doesn't cost us more in almost any aspect of "cost".

Not to mention the morale boost of an improvement notification vs a regression notification...

> > - (if the algorithm differs) why different algorithm?
> > - Why 20 and not 12 datapoints before/after?
> dzAlerts uses the median test, and more points allow it to produce less
> false positives on (non-guassian) noisy data.  The t-test is too sensitive
> to the non-gaussian noise we are witnessing and generating too many false
> alerts.
> 
> > - Why the 5%/10% thresholds rather than what graphserver uses (which I
> > believe is lower than this).
> Same reason as above: A higher threshold produces alerts that are
> actionable.  Furthermore, the sub-test analysis gives a much stronger signal
> than can be seen when analyzing a suite aggregate, so we need higher
> thresholds to compensate.

Did you get back to using "your" terminology where a tscrollx is a "suite" rather than a "test"? (I'm not nitpicking - I'm really asking to check if I understand you correctly. I really think we should align our terminology for this bug and all other places, which is the reason I started with it in my previous reply. I don't care which terminology it is, as long as we're all aligned on using the same terminology).

Regardless, when you say better algorithm, less noise, etc, is there any kind of data which shows how bad is the old "system", and how these changes make the new system better? Does this data also show what we lose by making these changes? (e.g. maybe not discovering some changes which were discovered before?).

> > Same as before: Is it different than graphserver? if yes, for what reason?
> The test results being analyzed by graph server benefit from dampening
> effect of aggregation.  The noise seen in the sub-results is more
> pronounced, but so is the signal.  The t-test strategy used by graph server
> performs less well in this noisier environment: so the median test is used
> instead.

Same as before: any data which shows how does the change affects the detections? E.g. by taking the same data, applying both methods on the data, and examining how the output differs?


> Despite the noise, we can clearly see a signal in alipay.com.

I specifically said "random page". As in, a page which we don't have any particular reason to keep. For such pages, if they're too noisy to be useful, we should just drop them rather than jumping through hoops to make their data useful.

OTOH, for sub-tests, if those are noisy, then they should be investigated at the very least because typically a sub-test has a specific goal (compared to a random "page" which is just another "random" data point). And if hoops needs to be jumped through to make those useful, then sometimes it could be worth the effort.

The main difference between a sub-test (sub-result) and a page is that while both are at the same "depth" level, "page" is picked as a relatively random data point (such as X most popular pages on alexa on tp5o test), while a sub-test was specifically designed to reflect a well-intended perspective of performance.

Of course, if the signal is strong enough to be useful despite the noise level, then my comment doesn't apply at all.


> > Therefore, when we send alerts on improvements/regressions, they should
> > alert per test rather than alert-per-page IMO.
> The page-level analysis contributes to the alert being actionable by
> providing information on which pages regressed.  It is rare that a
> performance regression in a test result is a caused by multiple sub-test
> regressions; usually only one has has regressed.

I'm not sure I understand if you explanation is meant to support my suggestion or to provide an alternative one, or for some other reason. I _think_ that you're agreeing with me and provides a bit more info?
(In reply to Avi Halachmi (:avih) from comment #33)
> I've lost you here and I don't understand what's this deeper level you say
> you referred to. Even more so, I don't understand if you chose to keep
> "your" terminology or "mine" throughout the rest of your post. How does this
> relate to:
> 
> (In reply to Kyle Lahnakoski [:ekyle] from comment #30)
> > Just refinements to Joel's answers:
> > * what tests
> > ** currently: all tests (except dromaeo, which has multiple sub-test results
> > jammed into one array)
> > 
> > * what pages
> > ...
> ...
> 
> > Yes, breaking out the sub-sub-results in the sub-results to something that
> > can be easily analyzed is planned.
> 
> I'm losing you again. I don't understand what a sub-sub-result is,
Looking at a dromaeo_dom test, it has 4 sub-results, one of which is attr.html. (https://datazilla.mozilla.org/?start=1401710458&stop=1402474256&product=Firefox&repository=Mozilla-Inbound&os=win&os_version=6.1.7601&test=dromaeo_dom&graph_search=817ede736aab&tr_id=5747657&graph=attr.html&x86=true&x86_64=false&project=talos)  If you look at the replicates of any one sub-result of attr.html, you will notice they are clustered in groups of 5 (click on a data point and and see the test replicates in the bar chart at bottom).  These clusters are because attr.html itself has child tests, each run 5 times each.  I do not know what they are named, nor what they are testing, but let us call them "attr.html.1" thru "attr.html.6" respectively. attr.html.1 is an example sub-sub-result I was referring to.   

These sub-sub-results are found in all dromaeo_dom and dromaeo_css and make existing sub-result analysis useless.  They must be broken up, named (like above), and given first class status for analysis.

> We should really make as much effort as required to align our (and others)
> terminologies, such that when we discuss stuff we're on the same page,
> pardon the pun ;)
I hope my example cleared up the confusion I caused with "sub-sub-results".  I wish I had a better name for these in your terminology.

> > > - Why drop improvement alerts? IMO they're very nice to get as a dev.
> > Alerting on improvements are not actionable.  The reason for dzAlerts is to
> > reduce the human resources to isolate performance regressions that are
> > severe enough that action should be taken.  We want to streamline the
> > process of identifying, confirming and opening bugs on performance
> > regressions.  I think this is where dzAlerts and graph server deviate in
> > purpose.
> 
> Improvements and regressions are two sides of exactly the same coin, and
> this coin is called "performance changes as a result of a patch".
> 
> By notifying devs only of regressions you're showing them only half of the
> picture as far as performance changes from their patch go.
> 
> While it's true that for the most part regressions are actionable and
> improvements are not, improvements are usually the results of those actions
> taken on regressions (or other actions), and therefore as important for devs
> as regressions are, and arguably even more important when they land a change
> which should make some big improvement and they need feedback that results
> have indeed improved. Improvements are the goals, regressions are
> undesirable side effects.
> 
> Splitting on what's actionable and what's not is not as generic or as good
> as a split on "what's useful info and what isn't". "Useful" contains info
> which is not always actionable, and it's still useful. We should pick the
> latter because it provides more useful info (even if not much more
> actionable info), and it doesn't cost us more in almost any aspect of "cost".
> 
> Not to mention the morale boost of an improvement notification vs a
> regression notification...
Interesting.  I have not given much thought to who or what would consume improvement alerts.  If we have a channel to broadcast such alerts, I will not argue against doing so.


> > > - (if the algorithm differs) why different algorithm?
> > > - Why 20 and not 12 datapoints before/after?
> > dzAlerts uses the median test, and more points allow it to produce less
> > false positives on (non-guassian) noisy data.  The t-test is too sensitive
> > to the non-gaussian noise we are witnessing and generating too many false
> > alerts.
> > 
> > > - Why the 5%/10% thresholds rather than what graphserver uses (which I
> > > believe is lower than this).
> > Same reason as above: A higher threshold produces alerts that are
> > actionable.  Furthermore, the sub-test analysis gives a much stronger signal
> > than can be seen when analyzing a suite aggregate, so we need higher
> > thresholds to compensate.
> 
> Did you get back to using "your" terminology where a tscrollx is a "suite"
> rather than a "test"? 
Doh!  Yes, it should have read "...seen when analyzing a TEST aggregate..."

> Regardless, when you say better algorithm, less noise, etc, is there any
> kind of data which shows how bad is the old "system", and how these changes
> make the new system better? Does this data also show what we lose by making
> these changes? (e.g. maybe not discovering some changes which were
> discovered before?).
We are going though the process of comparing the graph server alerts with dzAlerts to ensure they are of higher quality, meaning they catch everything graph server does, and more.  We still have issues with the greater sub-test noise causing too many false positives (or alert redundancies), but we are close to solving that too.  Joel can confirm we can already catch regressions that graph server is missing (but not many).

> > > Same as before: Is it different than graphserver? if yes, for what reason?
> > The test results being analyzed by graph server benefit from dampening
> > effect of aggregation.  The noise seen in the sub-results is more
> > pronounced, but so is the signal.  The t-test strategy used by graph server
> > performs less well in this noisier environment: so the median test is used
> > instead.
> Same as before: any data which shows how does the change affects the
> detections? E.g. by taking the same data, applying both methods on the data,
> and examining how the output differs?
I think the answer above answers this too.
 
> > > Therefore, when we send alerts on improvements/regressions, they should
> > > alert per test rather than alert-per-page IMO.
> > The page-level analysis contributes to the alert being actionable by
> > providing information on which pages regressed.  It is rare that a
> > performance regression in a test result is a caused by multiple sub-test
> > regressions; usually only one has has regressed.
> 
> I'm not sure I understand if you explanation is meant to support my
> suggestion or to provide an alternative one, or for some other reason. I
> _think_ that you're agreeing with me and provides a bit more info?
Yes, I am agreeing with you, and adding more info:  Specifically, in practice, when you see a regression in a test aggregate, it is usually caused by only one sub-test, with the others unchanged.  Furthermore, we sometimes see regressions in one sub-test, while seeing improvements in other sub-tests of the same test: This would imply it's possible to have sub-test regressions while the test aggregate appears unchanged.
(Reporter)

Comment 35

4 years ago
(In reply to Kyle Lahnakoski [:ekyle] from comment #34)
> Looking at a dromaeo_dom test, it has 4 sub-results, one of which is
> attr.html.
> (https://datazilla.mozilla.org/ ...)
> If you look at the replicates of any one sub-result of attr.html

What are replicates? Are these the individual runs which talos does in a single "instance" (do we have a term for "instance" or "one run" etc?) which are usually 25 runs (result in 25 individual numbers all for the same sub-test), and where talos sends the median of those to graphserver (after discarding few) as as one of the sub-result (my terminology) of a single test (same terminology)? and which talos sends all of them to datazilla?

Could you please choose any terminology which you like, and list all the suites, tests, pages, results, sub results, sub-sub results etc (and whatever else exists with test names/runs/etc), into one table which I could use as reference to make sure I can follow the discussion?

Pick whatever terminology which exists or which you find suitable. Forget about any terminology which I suggested.

> I do not know what they are named, nor what
> they are testing, but let us call them "attr.html.1" thru "attr.html.6"
> respectively. attr.html.1 is an example sub-sub-result I was referring to.   

I'll have to examine this together with looking at the source code. As I mentioned, I never looked into the dormaeo test or its results. Thanks for the details.

> I wish I had a better name for these in your terminology.

Forget my terminology. It never existed. It's a fiction. By all means I will align 100% with whatever terminology you choose or have :)

Just please list it (fully, not with samples or examples) as I suggested above and we'll be good on this one forever :)

> If we have a channel to broadcast such alerts, I will
> not argue against doing so.

Hmm.. wouldn't that be the exact same channel which you intend to use (/are already using) for the regression alerts? i.e. wherever you calculate/display/send regression alerts, just do the same also for improvements?

> We are going though the process of comparing the graph server alerts with
> dzAlerts to ensure they are of higher quality, meaning they catch everything
> graph server does, and more.  We still have issues with the greater sub-test
> noise causing too many false positives (or alert redundancies), but we are
> close to solving that too.  Joel can confirm we can already catch
> regressions that graph server is missing (but not many).

I believe you. I'd like to join this discussion, if possible, and preferably also get up to date by following its history. Is it being tracked some place? something which describes the design, work, decisions, arguments, etc which you've gone through, as well as what its current status is?
Avi, 

Thank you for making me realize dzAlerts has a serious deficiency: There is no wiki page.  This project grew organically from the Signal from Noise Meetings, and never was much of anything until recently.  How about I put it here: https://wiki.mozilla.org/Auto-tools/Projects/Alerts.  I will start with nomenclature.  And use your questions to guide what is relevant in the docs.

(In reply to Avi Halachmi (:avih) from comment #35)
> Could you please choose any terminology which you like, and list all the
> suites, tests, pages, results, sub results, sub-sub results etc (and
> whatever else exists with test names/runs/etc), into one table which I could
> use as reference to make sure I can follow the discussion?
I do not think a full list can be made, I am not even sure of what all is being analyzed.  The analysis engine is given a schema (eg https://github.com/klahnakoski/datazilla-alerts/blob/master/dimension_talos.json) to interpret the sub-test results as points in a data cube, which is then sliced along the interesting dimensions (edges at https://github.com/klahnakoski/datazilla-alerts/blob/master/resources/settings/talos_settings.json#L57) to provide a multitude of data series for analysis.  The domain of all dimensions is discovered at run time, if certain tests are not recently run, then they are not discovered.  This allows the analysis to stay flexible as tests and sub-tests get added and removed over time.

I will do my best to layout nomenclature on the new wiki page.


> Pick whatever terminology which exists or which you find suitable. Forget
> about any terminology which I suggested.
I will add nomenclature to the wiki.

> > If we have a channel to broadcast such alerts, I will
> > not argue against doing so.
> 
> Hmm.. wouldn't that be the exact same channel which you intend to use (/are
> already using) for the regression alerts? i.e. wherever you
> calculate/display/send regression alerts, just do the same also for
> improvements?
They are not the same: The regression alerts channel is used to notify a human that a regression exists and should be confirmed, and a bug generated to fix it.  Improvements do not require similar action so are just noise to this process.  If you know of something or someone that will make use of improvement alerts we can send them there instead.  Regressions and improvements may have a mathematical symmetry, but they feed into different business processes.

> > We are going though the process of comparing the graph server alerts with
> > dzAlerts to ensure they are of higher quality, meaning they catch everything
> > graph server does, and more.  We still have issues with the greater sub-test
> > noise causing too many false positives (or alert redundancies), but we are
> > close to solving that too.  Joel can confirm we can already catch
> > regressions that graph server is missing (but not many).
> 
> I believe you. I'd like to join this discussion, if possible, and preferably
> also get up to date by following its history. Is it being tracked some
> place? something which describes the design, work, decisions, arguments, etc
> which you've gone through, as well as what its current status is?

I will start filling in https://wiki.mozilla.org/Auto-tools/Projects/Alerts
(Reporter)

Comment 37

4 years ago
(In reply to Kyle Lahnakoski [:ekyle] from comment #36)
> Avi, 
> 
> Thank you for making me realize dzAlerts has a serious deficiency: There is
> no wiki page.  This project grew organically from the Signal from Noise
> Meetings, and never was much of anything until recently.  How about I put it
> here: https://wiki.mozilla.org/Auto-tools/Projects/Alerts.  I will start
> with nomenclature.  And use your questions to guide what is relevant in the
> docs.

Sounds great. Thanks for picking it up. Are these meetings still taking place?

> I will do my best to layout nomenclature on the new wiki page.

Appreciated.

> They are not the same: The regression alerts channel is used to notify a
> human that a regression exists and should be confirmed, and a bug generated
> to fix it.  Improvements do not require similar action so are just noise to
> this process.  If you know of something or someone that will make use of
> improvement alerts we can send them there instead.  Regressions and
> improvements may have a mathematical symmetry, but they feed into different
> business processes.

I think the improvements should appear on the same "boards" which display and track the regressions, possibly with some UI switch to choose either/both. As well as email to patch authors, like graphserver does now. The only thing which we might skip with the improvements is filing bug on them after manually reviewing them. The rest (as far as I can guess what it includes) could probably be the same.

Since I'm not yet familiar with the scope of the system, I can't know how much any suggestion would deviate from the design/goals/implementations.

> I will start filling in https://wiki.mozilla.org/Auto-tools/Projects/Alerts

Cheers. Looking forward to it.
(In reply to Avi Halachmi (:avih) from comment #37)
> (In reply to Kyle Lahnakoski [:ekyle] from comment #36)
> Sounds great. Thanks for picking it up. Are these meetings still taking
> place?
Yes, every second Friday @ 8amPDT.  Next one is tomorrow!
(Reporter)

Comment 39

4 years ago
(In reply to Kyle Lahnakoski [:ekyle] from comment #38)
> Yes, every second Friday @ 8amPDT.  Next one is tomorrow!

Thanks!

OK, so since now we see more bits of the puzzle - mainly that the dzAlerts project exists and with seemingly identical goal to this bug (and apparently wider scope, longer history and considerable progress as well), the questions I have in mind are:

- When can we hope that it ships?

- If we know that the road could still be long for dzAlerts, is there still a place for making some progress with graph server based alerts improvements? mainly (independently) this bug (higher resolution) and bug 1021842 (better formula)?

Especially bug 1021842 looks to me like potentially big win with relatively low investment, to fill in some of the gap until dzAlerts is deployed?

What say you?
Flags: needinfo?(klahnakoski)
Flags: needinfo?(jmaher)
(In reply to Avi Halachmi (:avih) from comment #39)

> - When can we hope that it ships?
We have dzAlerts running on a staging server continuously.  There are know issues (https://bugzilla.mozilla.org/showdependencytree.cgi?id=962378&hide_resolved=1) but they are being worked on.  When it comes to the alerts, I do not believe there is more than a quarter's work left until it is of high quality.

> Especially bug 1021842 looks to me like potentially big win with relatively
> low investment, to fill in some of the gap until dzAlerts is deployed?
> 
> What say you?

How about we discuss direction tomorrow?  You able to make the SfN meeting (8am PDT tomorrow (Friday))?
Flags: needinfo?(klahnakoski)
Personally I think we can move forward on graph server with fixing the geomean for all values instead of the average of all but the max value.  Other than that, we are close enough on dzalerts to use and fine tune them.
Flags: needinfo?(jmaher)
(Reporter)

Comment 42

4 years ago
(In reply to Joel Maher (:jmaher) from comment #41)
> Personally I think we can move forward on graph server with fixing the
> geomean for all values instead of the average of all but the max value. 
> Other than that, we are close enough on dzalerts to use and fine tune them.

I can live with that.

Can you live with the fact that all the graphs will change after this lands, and that numbers after this change will probably not be comparable to numbers before this change? meaning you won't be able to detect regressions around the switch, or compare values from before/after the switch.

I think it's still OK. It's like a reset-point for the all the data.
regarding the resetting of the raw data values, we need to bite the bullet at some point but I don't think it will be that scary.

If we get this deployed at least 2 weeks prior to the next uplift Aurora will have a set of results with the geomean as will trunk, then when we uplift it will be a diff of geomean v32 to geomean v33.
(Reporter)

Comment 44

4 years ago
(In reply to Joel Maher (:jmaher) from comment #43)
> regarding the resetting of the raw data values, we need to bite the bullet
> at some point but I don't think it will be that scary.
> 
> If we get this deployed at least 2 weeks prior to the next uplift Aurora
> will have a set of results with the geomean as will trunk, then when we
> uplift it will be a diff of geomean v32 to geomean v33.

Sounds good. Shall we continue this at bug 1021842?
Duplicate of this bug: 556574
Depends on: 1045002
not sure why mattn is needinfo here, lets clean up what we can.
Flags: needinfo?(MattN+bmo)
looking through some old bugs, I found this bug- we have been using the geometric mean for the last year, would that satisfy the need for this?  If so, I would like to close this bug out.
Flags: needinfo?(avihpit)
(Reporter)

Comment 48

3 years ago
(In reply to Avi Halachmi (:avih) from comment #0)
> This formula is not good for tests where each sub-test was designed
> specifically to reflect a well-intended perspective, and where each sub-test
> is meaningful independently (e.g. TART/CART/WebGL).

The geometric mean handles this ^ aspect of this bug - which is to improve overall score formula where subtests have different inherent magnitudes.


> This bug is about being able to detect (and display etc) meaningful
> regressions which manifest at one or few sub-results of a test ...

And this ^ is covered too these days since now perfherder can display subtests perf changes too.

So this bug should be closed.
Flags: needinfo?(avihpit)
Status: NEW → RESOLVED
Last Resolved: 3 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.