Closed Bug 710484 Opened 10 years ago Closed 10 years ago

Change Talos Filtering to stop stripping maximum and to start separating the first run

Categories

(Testing :: Talos, defect)

defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: stephen.lewchuk, Assigned: k0scist)

References

Details

(Whiteboard: [SfN])

Attachments

(1 file)

A test suite of talos produces a matrix of values [1] a set of repetitions for each component in the test suite.  The process to create a single value as follows is:

1. Eliminate maximum value from results for each component.
2. Generate a median for each component from the remaining values.
3. Eliminate maximum component median.
4. Generate an average across remaining components.

Steps 1 and 2 are in the python script, steps 3 and 4 in the graph server database.

The first filtering reduces noise at the cost of artificially skewing the data.
The second filtering hides the results of the longest running component and adds no value.

The new proposed pipeline is:

1. Eliminate the first run of each component.*
2. Generate median of the remaining values for each component.
3. Generate the average of the component medians.

* Eventually all individual values for the test suite should be recorded but the capability in graph serve doesn't exist to do this.

The rationale for treating the first run differently is that the first run has quite different from the rest of the values[2] and ranges both higher and lower and thus introduce a large amount of noise into the data.

[1] http://people.mozilla.org/~slewchuk/graphs/tsvg_testframe.pdf
[2] http://people.mozilla.org/~slewchuk/graphs/histograms_by_runs.pdf
In bug 707486 you mention that we take 5 measurements per component (with the exception of TP5 which takes 10). Would it be feasible to take 6 measurements per component, drop the 1st measurement as you suggest, then drop the lowest and highest measurements and take the mean of the remaining 3? For TP5 we could similarly (but without changing the number of measurements) drop the first measurement, drop the highest and lowest 2 measurements and take the mean of the remaining 5.
What you propose isn't completely infeasible from an implementation point of view, having different logic for TP5 compared to others would add some complexity. I would suggest however that clipping both the top and the bottom doesn't solve the problem I was trying to tackle. As you mentioned in the other bug the question of whether the largest values are normally uncontrollable anomalies or legitimate manifestations of Firefox execution is unclear and should be investigated more fully.

I would propose however that in many cases this technique is hiding real data, has the potential to make real structure in the data appear as an anomaly and give developers a false sense of the consistency of Firefox.  When our browser can take a wide range of different values to load a page, that means Firefox has structural ambiguity which in my opinion is a negative feature of the browser and should be tackled as something to root out not something to hide.

The two minor changes proposed here were intended to halt bad practices in the short term and get people thinking about Talos as we move to implement some of the more aggressive changes outlined in: bug 710788 and in https://wiki.mozilla.org/Metrics/Talos_Investigation.  The other half of your suggestion is to move from the median to the mean, in the long term when we have larger samples I agree, I'm not sure what the magnitude of the improvement would be at the low samples.
Right, I suppose it depends on what you think Talos should be trying to achieve.

My suggestions above are based on the assumption that Talos should have as little noise as possible so that consistent but small regressions can more easily be spotted. With such a small number of measurements the median is likely to be fairly noisy, however the mean can easily be skewed by outliers beyond our control. The intent of dropping the maximum and minimum values is to get the best of both worlds, that is a less noisy alternative to the median.

In the long term I agree that the results should include a measure of their uncertainty, so that a change that e.g. improves best-case results at the cost of a regression in the worst case can be spotted - but that would require a lot more samples than are currently taken, as your analysis in bug 710788 indicates.

(In reply to Stephen Lewchuk from comment #2)
> The other half of your suggestion is to move from the median to the mean,
> in the long term when we have larger samples I agree, I'm not sure what the
> magnitude of the improvement would be at the low samples.

That's a good question! I ran some Monte Carlo simulations to quantify the difference between the median and mean-of-3 approaches for the following cases: (1) a normal distribution, (2) a bimodal but symmetrical combination of two gaussians, and (3) a bimodal, asymmetric combination of two gaussians. Although I imagine the models you used in bug 710788 are more sophisticated, I figured these would still give some interesting data.

For each case I repeatedly chose 5 points at random from a large (n = 500 million) set of points with the desired distribution, then recorded the absolute difference between (a) the population mean and the median, and (b) the population mean and the mean-of-three, and calculated the mean and standard deviation of those datasets:

(1)        mean     sd
median:    8.534    10.71
mean-of-3: 7.600     9.53
diff:      12.29%   12.38%

(2)        mean     sd
median:    28.20    31.42
mean-of-3: 20.80    25.06
diff:      35.58%   25.38%

(3)        mean     sd
median:    23.47    27.17
mean-of-3: 19.01    22.76
           23.46%   19.38%

As you can see the mean-of-3 consistently outperforms the median, most substantially in the symmetric bimodal situation. Since the points are also more tightly grouped, it should be less noisy. Looking at histograms reveals that the median shows the same bimodal behavior as the data (as I earlier suspected), whereas the mean-of-3 estimators show maxima much nearer the population mean and only slight bumps where the data is densest.

(1) http://i.imgur.com/KQswK.png
    http://i.imgur.com/mLWyv.png
    http://i.imgur.com/FHLXo.png

(2) http://i.imgur.com/l7dBr.png
    http://i.imgur.com/Fml0Z.png
    http://i.imgur.com/Uhe7c.png

(3) http://i.imgur.com/ENNfG.png
    http://i.imgur.com/cWaHV.png
    http://i.imgur.com/XkjbQ.png
That is some interesting stuff, I'd be interested to see what the effect on live data is. For jmaher, what level of effort would this require from an engineering standpoint? I suspect it might be easier when the changes to break apart Talos are made, please correct me if I am wrong. If that is the case would we want to try this now or build testing it out it into to the longer term changes in Talos?
this is fairly easy to do in talos proper.  The danger here is changing the numbers that we report and needing to rebaseline.  I think the next steps should be:
* modify talos to do this (via a config file option for now)
* land this and roll it out branch by branch as resources are available for side by side staging

Overall this will take about 2 months to roll out, but we can get the initial branches started within a week or two.  Really just getting this landed on m-c and m-i will be huge.

We will be breaking apart talos to run per page in January and this would be a first step in that.
This is currently done in http://hg.mozilla.org/build/talos/file/de761a6183df/talos/run_tests.py#l115 for send_to_csv . However, I thought what we cared about were the graph server results? send_to_graph does not do this, unless the logic is hiding somewhere besides run_tests.py .  I'm not sure how these are run in production. I'll investigate this further and report back
Assignee: nobody → jhammel
Assuming that the platform was unusually specific in error; please correct me if i am wrong and we really only want this on Mac
OS: Mac OS X → All
Hardware: x86 → All
What we currently send to graphserver: https://bug716670.bugzilla.mozilla.org/attachment.cgi?id=587180

(see bug 716670 for details)
It appears that we overlooked the obvious here.

From pageloader, we get:
NOISE: |i|pagename|median|mean|min|max|runs|
NOISE: |0;thesartorialist.blogspot.com;852;864.3333333333334;809;1135;951;852;920;865;809;858;851;1135;836;837
NOISE: |1;cakewrecks.blogspot.com;264;266.55555555555554;252;651;651;263;252;273;264;275;260;292;268;252

Then in talos, we parse this line and create:
0,852.00,thesartorialist.blogspot.com
1,264.00,cakewrecks.blogspot.com


We can fix this in pageloader:
http://hg.mozilla.org/build/pageloader/file/beca399c3a16/chrome/report.js#l97

One thing to caution here is that changing the pageloader needs to be configurable, so this would be another option sent to the tp command line.


We could also fix this in talos by adding an option here:
http://hg.mozilla.org/build/talos/file/4887e3aebe37/talos/run_tests.py#l82

We could parse out the raw values and have more control over what we calculate.  I prefer this method since changing pageloader is a bit more complicated and a higher chance of error (less tests, more chance to accidentally change the numbers).
running tsvg, here are some values that I get:
NEW: gearflowers.svg: 88.0
OLD: gearflowers.svg: 89.0
NEW: composite-scale.svg: 46.0
OLD: composite-scale.svg: 45.5
NEW: composite-scale-opacity.svg: 20.0
OLD: composite-scale-opacity.svg: 19.0
NEW: composite-scale-rotate.svg: 19.0
OLD: composite-scale-rotate.svg: 19.0
NEW: composite-scale-rotate-opacity.svg: 20.0
OLD: composite-scale-rotate-opacity.svg: 19.5
NEW: hixie-001.xml: 15028.0
OLD: hixie-001.xml: 15027.0
NEW: hixie-002.xml: 15032.0
OLD: hixie-002.xml: 15022.0
NEW: hixie-003.xml: 5028.0
OLD: hixie-003.xml: 5027.5
NEW: hixie-004.xml: 5047.0
OLD: hixie-004.xml: 5046.0
NEW: hixie-005.xml: 4686.0
OLD: hixie-005.xml: 4667.0
NEW: hixie-006.xml: 5697.0
OLD: hixie-006.xml: 5697.0
NEW: hixie-007.xml: 2221.0
OLD: hixie-007.xml: 2199.5
Comment on attachment 588061 [details] [diff] [review]
ignore first page load in calculating median (1.0)

-        if options.develop == True:
-            if options.resultsServer == '':
-                options.resultsServer = ' '
-            if options.resultsLink == '':
-                options.resultsLink = ' '
-
-            if options.webServer == '':
-              options.webServer = "localhost:%s" % (findOpenPort('127.0.0.1'))
-
         # if resultsServer and resultsLinks are given replace resultsURL from there
         if options.resultsServer and options.resultsLink:
             if options.resultsURL:
                 raise Configuration("Can't use resultsServer/resultsLink and resultsURL; use resultsURL instead")
             options.resultsURL = 'http://%s%s' % (options.resultsServer, options.resultsLink)
 
+        if options.develop == True:
+            if options.resultsURL == '':
+                options.resultsURL = ' '
+
+            if options.webServer == '':
+              options.webServer = "localhost:%s" % (findOpenPort('127.0.0.1'))
+

This was already taken with bug 716670

Other than that, looks good!
Attachment #588061 - Flags: review?(jhammel) → review+
inbound:
http://hg.mozilla.org/build/talos/rev/eeee5b50266b

steps to get this live:
* deploy talos to production
* determine which tests on m-c/m-i we want to convert to this method
* work with releng to adjust buildbot-config to do side by side staging
** repeat if necessary
Whiteboard: [SfN]
can we close this bug?  In general we have done this for all pageloader based tests.
Sounds good to me
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.