1020677 - Talos: We need higher resolution regression alerts

Reporter

Description

•

10 years ago

Some of our tests combine many sub tests into a single test score. E.g. TP5 with many page scores, TART with many different animation cases, possibly WebGL test (bug 1020663) which produces 67 different sub-results with non-uniform magnitudes, etc.

Currently, our regression alerts (both m.dev.tree-management email alerts and regression bugs which are filed manually) are based on the the formula by which graphserver processes all the sub-results from a full talos test run into a single number:

- A talos run of a test consists of N executions of each sub-test (mostly N==25).
- Take median for each N results per sub-test.
- Ignore the highest of those medians.
- Average the rest <-- the "final" number by which we calculate regressions and which graphserver displays at its graphs.

This formula was, AFAIK, mostly designed to handle TP5-like tests, where the page load times are similar between pages, no page has any particular importance, and therefore the page which performed the worst was dropped as an outlier, and the rest were averaged.

This formula is not good for tests where each sub-test was designed specifically to reflect a well-intended perspective, and where each sub-test is meaningful independently (e.g. TART/CART/WebGL).

We've already seen cases where the average showed a relatively small regression, which was due to a single sub-test which had a really big and meaningful regression (e.g. bug 1004429 which regressed newtab animation by ~40% was perceived as 5% regression because it only regressed 4 of 30 TART's sub-results).

This bug is about being able to detect (and display etc) meaningful regressions which manifest at one or few sub-results of a test - when the subtext is meaningful independently.

I'm not sure exactly which form should this high-res detection take or with what technical approach.

We could use the backend of either graphserver or datazilla for this. While graphserver displays only this average, if it also happen to store all the individual sub-results per run, then it might be a good starting point. Otherwise, maybe datazilla would be more suitable for this.

As far as the alerts themselves go, I think we could use something like this:
- If the average (of all sub tests) regressed more than A%, issue an alert on the average.
- Regardless, if a single sub-test regressed considerably more (TBD) than the average regressed, then issue an alert which includes the most regressed sub-test and also the 2nd most regressed sub-test, together with the average regression.

Any ideas on how to approach this?