Consider alternate aggregation schemes for Talos results.

RESOLVED WONTFIX

Status

Testing
Talos
RESOLVED WONTFIX
6 years ago
2 years ago

People

(Reporter: Stephen Lewchuk, Unassigned)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [SfN])

(Reporter)

Description

6 years ago
One of the results of my work with Talos is the some of the limitations imposed by the current aggregation pipeline.  Each data point surfaced on graphs.m.o is the result of two stages of aggregation which reduce the amount of information which can be gleaned from them.

A bit of background on this aggregation, each test suite (tp5, tdhtml, etc.) is composed of a set of pages (tp5 has 100, tsvg:12, tdhtml:18).  For each run this set of pages is loaded multiple times (tp5:10, the rest:5).  The values for each page are then filtered to remove the maximum and the median is returned.  Then the medians for all the pages are once again filtered to remove the median and the average is the final result for this run of the test suite.  (Note that the first level of aggregation is done before being sent to graphs.m.o so those raw results are not accessible from that service.)

Some discussion [1] has occurred around the two filtering steps in the aggregation process here so I will not repeat it though no action has been taken and further discussion is welcome though maybe it should get it's own bug.

The consequence of doing these two steps of aggregation is to loose insight provided by information we are spending resources to collect.  [2] contains four graphs to illustrate this point. (Note that the filtering described above was not applied in the examples presented).

Graph 1 shows the data as currently interpreted. Here changes can be observed but are small compared to the noise of the data.

Graph 2 shows the range the different pages can take.  As a result the averaging in the second step hides the impact of changes that do not affect all pages in a testsuite.

Graph 3 shows the results for a single page in the tdhtml testsuite, hixie-007.xml.  Here the performance changes are much more obvious than in graph 1.

Graph 4 shows the distribution of the raw results for the hixie-007.xml page split into the different time periods identified through the third graph.

While using the non aggregated data as the primary interaction for developers would worsen the problem of having too many graphs to look at, automated tools which can handle the additional views into the data should be developed to use all the information available not just the aggregates.

[1] https://groups.google.com/forum/#!topic/mozilla.dev.platform/kXUFafYInWs
[2] http://people.mozilla.org/~slewchuk/graphs/snowleopard-tsvg-hixie-007.pdf
Some thoughts about the first stage of aggregation:

As I suggested in bug 706912 comment #6, using the median of the samples for each page may not provide the best data. There's two sides to this:
1) Mathematically, the median improves as an estimator for the normal distribution as the number of samples goes up, but the number of samples here is pretty low.
2) From a practical standpoint, if the performance is somehow bimodal or even more convoluted, the median could easily alternate between various dense regions in the distribution, thus increasing noise.

In bug 706912 comment #6 I suggested an iterative refinement technique to find an estimator that minimizes the weaknesses of the mean and the median:
1) By discarding outliers based on distance from the mean of a sorted subset of the samples, thus reducing the amount by which the estimator is skewed by outliers.
2) By averaging over the remaining samples, thus reducing the effects of having various local maxima in the density function of the (unknown) distrubition.

However this technique works better as the number of samples goes up, and with 5 samples the accuracy of its measure of distance is probably too limited to be effective. Unfortunately, I imagine that increasing the number of samples per page would make the tests prohibitively slow.

For 5 samples, a common technique is to discard the maximum and minimum samples, then take the arithmetic mean of the remaining 3. In this case however, an additional consideration might be that while measurements on the low end are bounded by performance of the system (assuming nothing is broken), measurements on the high end are likely skewed by things like I/O, network performance et cetera. As such, the distribution is likely skewed and it might actually be better to take the mean of the lower 3 values and discard the higher 2, as this would still give some measure of variance while reducing the impact of outliers on the high end. Similarly we could take the mean out of 6 or 7 of the 10 tp5 samples, depending on how skewed the distribution actually is.

To determine the percentage of samples that should be discarded, we could take a larger number of measurements to determine the percentage of samples likely to fall in the interesting part of the distribution. While the percentage should be large enough to detect changes in the variance of the distribution, it should be small enough that (large) spikes in I/O are unlikely to affect the result.

It would also be interesting to propagate some measure of the variance to the second stage so that if the variance in the individual measurements increases significantly, it won't just show up as an increase in the mean. With only 5 or 10 samples, however, determining the variance or standard deviation becomes very hit or miss. Indeed I remember reading a paper that suggested that taking the distance of the maximum value from the mean is a better measure of variance for small sets of samples than the usual [mean of squares] - [square of mean] (unfortunately I don't have that source on hand).

Skimming the discussion in [1] above, it is mentioned that for multi-modal results, throwing out the high values is losing valuable information. While I agree that the sources of this multi-modality should be understood (Jan Larres' thesis from bug 706912 might be a good source for this), I actually think it's likely that most of them come from sources we have no control over and reflect events on the system that happen to be clustered because they generally take roughly the same amount of time (by user mode scheduling or otherwise). This is just my suspicion, of course, and I'm looking forward to others' thoughts on this.
(Reporter)

Updated

6 years ago
Depends on: 710484

Updated

5 years ago
Whiteboard: [SfN]
this is something we are not realistically going to get to.  Having changed our tests so much in the last 2 years and learning how to make things actionable has adjusted our needs of data.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.