1210509 - perfherder compare view should do a better job dealing with high std dev sets of data

Reporter

Description

•

9 years ago

right now we have a great system for comparing one change to the previous change and showing the differences in compareview. In many cases when we have a really noisy test, we could generate many data points for each revision and there is no real regression, but relative to each other they show a regression.

One example is this:
https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=d51440cc7a2f&newProject=mozilla-inbound&newRevision=aa61d48eb6ae

What we have here is tcheck2 on android:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=[mozilla-inbound,fdad6ae27544b0dd52113fce3184968100190e76,1]

you can see that the base and new revision are well within the normal range.

another example:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=[mozilla-inbound,fdad6ae27544b0dd52113fce3184968100190e76,1]

looking for windows xp tsvgx, here is the graph:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=2592000&series=[mozilla-inbound,f437443b6f45bafdf125532c1c3b59db1db3fd83,1]

This is the same pattern- a noisy test and it happens to be that our two revisions hit lower/higher than the other for the median value and we report a regression.

How can we solve this? I don't know if we can rely 100% on data from both revisions. Would it be possible to calculate a stddev for a given test/platform based on historical data on a reference branch (say mozilla-central)? Then we can apply that stddev to the values we are using and we can lower our confidence in that case.

another option is that we put each test into buckets based on the stddev. In this case we would apply a different formula in the compare view for each bucket. So bucket 1 would be a low stddev and we would treat it as we do now. bucket 2 would have a higher stddev (maybe not crazy) and we could set a higher bar for marking the alert as high- bucket 3 would be for the crazy tests and would need the end user to cross reference with historical data or to collect 20+ data points.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 1

•

9 years ago

Avi, we discussed this briefly in the perftesting meeting today, I would really like to figure this out in the coming weeks, do you have some suggestions?

Flags: needinfo?(avihpit)

Avi Halachmi (:avih)

Comment 2

•

9 years ago

I agree we should really understand how to tackle this, but my plate is super full right now. I'll try to see if I can make some time for this next week, but I can't promise.

I'll keep myself with the needinfo for now and keep reading other messages here as they come.

Kyle Lahnakoski [:ekyle]

Comment 3

•

9 years ago

Attached image AA vs D5.png — Details

First 6 are revision AA, the last 6 are from revision D5

Kyle Lahnakoski [:ekyle]

Comment 4

•

9 years ago

Attached image All replicates.png — Details

All replicates from all revisions in the two days involving the problem

Kyle Lahnakoski [:ekyle]

Comment 5

•

9 years ago

This appears to be another case of running-same-test-at-around-the-same-time-gets-same-results:  This would seem to be a good thing, but the variance between samples is artificially low, making the difference seem significant.  

* Joel is right, we could use the test results from the revisions before and after to increase our sample size (assuming there is no other regression/improvement in those other revisions), or help us characterize the overall behavior of this test.
* We can spread out the re-triggers over the course of a day (or week) so that the results will include the variability we usually see.
* We can do more inspection into the replicates:

> aa61d48eb6ae [3.7462988,1.7083443,4.8776793,3.0694685,4.7481084]
> aa61d48eb6ae [1.894531,3.8034909,0.3970367,15.30059,13.777445]
> aa61d48eb6ae [7.53143,3.6273627,1.9312166,2.9777503,8.878902]
> aa61d48eb6ae [12.246477,4.720222,0.6612913,2.7869997,1.1115423]
> aa61d48eb6ae [6.792979,6.6584945,6.072397,8.44784,2.8630354]
> aa61d48eb6ae [5.6984944,6.106548,5.258944,1.6848508,3.2468348]
> d51440cc7a2f [7.174985,4.557762,1.8127912,3.5609336,6.730564]
> d51440cc7a2f [4.309075,4.6112494,6.5570354,1.1544898,3.3047724]
> d51440cc7a2f [10.469094,2.5542858,7.62965,0.0044093817,1.6415395]
> d51440cc7a2f [3.8039446,2.3816898,2.503575,5.57297,0.7044095]
> d51440cc7a2f [7.4866786,0.11583299,5.926416,0.8506108,2.1789682]
> d51440cc7a2f [8.9439335,0.9512888,0.15097,1.1893138,5.720781]
http://activedata.allizom.org/tools/query.html#query_id=80Q8sUjK

Which shows us that the intra-test variance is high, and therefore increases our expected variance when comparing the inter-test median values.

I also looked at the replicates over the past two days to confirm the variance is high, and there is no pattern that might explain that high variance.

Avi Halachmi (:avih)

Comment 6

•

9 years ago

TBH, I still don't understand the issue, or rather, not sure I understand the examples (and as a result also the issue).

(In reply to Joel Maher (:jmaher) from comment #0)
> One example is this:
> https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-
> inbound&originalRevision=d51440cc7a2f&newProject=mozilla-
> inbound&newRevision=aa61d48eb6ae
> 
> What we have here is tcheck2 on android:
> https://treeherder.mozilla.org/perf.html#/
> graphs?timerange=1209600&series=[mozilla-inbound,
> fdad6ae27544b0dd52113fce3184968100190e76,1]
> 
> you can see that the base and new revision are well within the normal range.

In this example, the first link (overview page) shows 39.8% tcheck2 regression, but I really can't tell where these two revisions are on the second link (the graph). I don't think I see two "marked" revisions on the graph...

(off topic - the subtests view for this test shows 49% regression and there's only a single subtest. Shouldn't it show also 39.8%?)

What I see at the graph are these "ranges"
- Sep 18 - Sep 24: range ~12 - ~22, with an average of ~17
- Sep 24 - Sep 24: range  ~1 -  ~3, with an average of  ~2
- Sep 25 - Sep 28: same as the first range (12-22, 17)
- Sep 29 -       : range  ~1 -  ~6, with an average of ~3-4

At which of those ranges are "base" and "new" from the first link?

Also, what does this mean? : "base and new are well within the normal range", what is this "normal range"?


> another example:
> https://treeherder.mozilla.org/perf.html#/
> graphs?timerange=1209600&series=[mozilla-inbound,
> fdad6ae27544b0dd52113fce3184968100190e76,1]

This link is identical to the second link above. If it's intentional, then I don't understand how it's "another example".

 
> looking for windows xp tsvgx, here is the graph:
> https://treeherder.mozilla.org/perf.html#/
> graphs?timerange=2592000&series=[mozilla-inbound,
> f437443b6f45bafdf125532c1c3b59db1db3fd83,1]
> 
> This is the same pattern- a noisy test and it happens to be that our two
> revisions hit lower/higher than the other for the median value and we report
> a regression.

I don't think it's the same pattern as the second (tcheck2) graph.

The tcheck2 graph is noisy, but it clearly changes its attributes and values at the date ranges I listed above.

But this tsvgx graph belongs to the category of "noisy on weekdays, stable on weekends". IMO a completely unrelated pattern to the tcheck2 graph above.

> How can we solve this?

So what is "this" exactly?

Flags: needinfo?(avihpit)

Kyle Lahnakoski [:ekyle]

Comment 7

•

9 years ago

Attached image The problem.png — Details

Avi,

The revisions in question are hard to see.  I highlighted the revisions in this image.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 8

•

9 years ago

The issues here are that if on any given day (excluding weekends) we have many tests that could produce data between X and Y (lets say 10-20).  The problem is that we have two arbitrary revisions, each with 6 data points.  Lets say Rev 1 shows data in the 12-15 range, and Rev 2 shows data in the 15-18 range.  Both of these fall into the expected pattern of data (between 10-20) based on historical and future data.  Now, just looking at rev1 vs rev2, it looks like rev2 has a regression, in this case there is no regression, just data falling into the normal range.

What Kyle mentions is interesting- if we have to factor time into the equation, how do we ensure we do it smartly?  Our goal is to get data as fast as possible- but if we had a robot doing this, it could schedule 1 job/hour until we have our 6 data points.  Would that be representative?  As the data on weekends is different than normal, we know that time can matter, but why is that the case?  What factors are involved in when the tests are run which cause this?

Avi Halachmi (:avih)

Comment 9

•

9 years ago

So is that just due to too few retriggers? i.e. that if we did 20 retriggers instead of 6 then they would have less meaningful differences between the revisions?

Or would it not matter and even with 20 or 100 retriggers perfherder would display the diff as meaningful and in the same magnitude as it did with 6 retriggers?

I.e. was it just a statistical chance that the retriggers ended with meaningful difference? or was it because something is causing the builds to have these differences consistently?

And if something is causing this difference, what would that something be? the revision? the test time-of-day? the machine on which the test runs? etc

Kyle Lahnakoski [:ekyle]

Comment 10

•

9 years ago

Avi,

I can not say what is causing the difference, or why this test has has a range spanning over 3 orders of magnitude. 

> d51440cc7a2f [10.469094,2.5542858,7.62965,0.0044093817,1.6415395]

I suspect that adding more re-triggers is not sufficient to catch the scope of the problem.  There are other tests that have this same issue: Behaving consistently when re-triggered in a batch and scheduled for around the same time.  I hope that spreading the tests over time will help get a more varied sample.

I am open to the possibility this is "just statistical chance that the retriggers ended with meaningful difference":  Given we have only five replicates per test, and the replicate variance is high, the median statistic is expected to be as unstable as we see in perfherder.   Looking at the individual replicates exposes this high variance.

Avi Halachmi (:avih)

Comment 11

•

9 years ago

(In reply to Kyle Lahnakoski [:ekyle] from comment #10)
> I can not say what is causing the difference, or why this test has has a
> range spanning over 3 orders of magnitude. 
> 
> > d51440cc7a2f [10.469094,2.5542858,7.62965,0.0044093817,1.6415395]

I'm assuming this data is from tcheck2. IMO this indicates that the test itself is not good. If this test consistently produces such values, then I don't think we should be using it to observe performance changes.


> There are other tests that have this same issue: Behaving
> consistently when re-triggered in a batch and scheduled for around the same
> time.

Can we list these tests explicitly? Does that list change frequently?


> I hope that spreading the tests over time will help get a more varied sample.

I think we should first understand why it's happening. Specifically, if the test itself is bad then we should be fixing it or dropping it IMO. If we can't, and if spreading the runs over more time does help, then the remaining question is how do we do this in practice.

> I am open to the possibility this is "just statistical chance that the
> retriggers ended with meaningful difference":  Given we have only five
> replicates per test, and the replicate variance is high, the median
> statistic is expected to be as unstable as we see in perfherder.   Looking
> at the individual replicates exposes this high variance.

The last sentence specifically leads me again to the conclusion that the test you were observing is not a good test, and we should not be using it.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 12

•

9 years ago

I would like to build a list of tests which are problematic when comparing 6 retriggers.

Should we look at raw replicates to indicate possible flaws?

My initial thought would be to list all tests based on number of data points needed- maybe certain test/platforms would need 8 or 10 data points, whereas the majority will be fine with 6.

Can we get some agreement on what we would like to do here and then start gathering data?

Kyle Lahnakoski [:ekyle]

Comment 13

•

9 years ago

Joel, 

Before you try gathering a list (which seems like a lot of work to get, and then you are never really sure you got them all, and then it changes over time), I would suggest we decide on a strategy for this test, and confirm it works.  Then apply this strategy to all other significant differences to see if they disappear.

The strategy of using the variance in revisions found before, and revisions found after (not including revisions on the other side of a known discontinuity), is probably best;  it is the one that Joel used to determine this difference was not significant, and uses least testing resources.

Avi Halachmi (:avih)

Comment 14

•

9 years ago

I don't think we can decide on a strategy before we know how many tests are affected. If only tcheck2 is affected, then let's just drop it and be done with it.

But if more tests are affected, then before we decide on a strategy, I'd like to examine the test themselves and try to understand if the tests themselves are bad.

Avi Halachmi (:avih)

Comment 15

•

9 years ago

Joel, both you and Kyle suggested that it doesn't affect a single test. So we can start by you listing existing examples which you have in mind, and let's start by examining those individually (I'll do that myself once I have a list to work with).

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 16

•

9 years ago

Kyle, that is a good point- I think we need to apply this manually and document the steps required to get to a conclusion of good/bad test.

we have tcheck2, lets look at some others.  The problem here is we are looking at tests which when retriggered many times result in a small cluster of results, but that isn't representative of where results might lie.

We can't measure by stddev of the test necessarily, we need to measure clustering of results which fall into a small window of a larger stddev.

I think windows XP tsvgx falls into this category:
https://treeherder.mozilla.org/perf.html#/comparesubtest?originalProject=mozilla-inbound&originalRevision=88adc47cb47e&newProject=try&newRevision=6e1fa8593bbc&originalSignature=f437443b6f45bafdf125532c1c3b59db1db3fd83&newSignature=f437443b6f45bafdf125532c1c3b59db1db3fd83 (feel free to look at the summary)
http://graphs.mozilla.org/graph.html#tests=[[281,131,45]]&sel=1443548636319,1444153436319&displayrange=7&datatype=geo

I would rather look at these one at a time and then find a pattern which we might be able to script.

Avi Halachmi (:avih)

Comment 17

•

9 years ago

> The problem here is we are looking at tests which when retriggered many times result
> in a small cluster of results, but that isn't representative of where results might lie.

Starting to analyze all tests would be fine, and I'd love if we could do that, but earlier comments suggest that this is a known problem, i.e. that there are existing examples of this issue. So let's start with listing the existing examples which we have.

So far, as far as I can tell:
- tcheck2 - most likely
- tsvgx on XP - maybe.

Do we have any more platform+test combinations which we suspect belong to this category?

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 18

•

9 years ago

those are the two examples I know of, looking at more data in:
https://treeherder.allizom.org/perf.html#/compare?originalProject=mozilla-inbound&originalRevision=5d303b961af1&newProject=mozilla-inbound&newRevision=dbe8d3254ccd

I see windowsxp tscrollx causing problems (bi-modal bits).  it seems that the base had 2 instances in the high modal, and the new had 1 instance in the high modal.
https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=[mozilla-inbound,e5c782b789ec02e8add7dae8c97ca3f42ed46442,1]&series=[mozilla-inbound,c3eb69f0719aacca596bca0626205e4b30953034,0]&highlightedRevisions=5d303b961af1&highlightedRevisions=dbe8d3254ccd&zoom=1443982433926.7017,1444197344000,2.1980661585711054,4.227051665817482


-----------------

tresize windows 8:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=[mozilla-inbound,bb06e2fd4b3adb76776faa7fe3d4b5a4e0228128,1]&highlightedRevisions=5d303b961af1&highlightedRevisions=dbe8d3254ccd


and tresize linux64:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=[mozilla-inbound,72f4651f24362c87efb15d5f4113b9ca194d8e3f,1]&highlightedRevisions=5d303b961af1&highlightedRevisions=dbe8d3254ccd

-----------

windows 8 tpaint:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=[mozilla-inbound,9681e921b531333a04976dd2036de3d2ee686780,1]&highlightedRevisions=5d303b961af1&highlightedRevisions=dbe8d3254ccd

shows a small regression because we have 1 data point as an outlier1

------------------------

windows 7 glterrain:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=[mozilla-inbound,f882417b93d2cbcfb747772556799a66e8a950c8,1]&highlightedRevisions=5d303b961af1&highlightedRevisions=dbe8d3254ccd

---------------------

linux64 damp:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=1209600&series=[mozilla-inbound,25169736932930bc20f0437757e9399dadf68976,1]&highlightedRevisions=5d303b961af1&highlightedRevisions=dbe8d3254ccd&zoom=1443904920634.9207,1444202738000,71.98070581408513,129.95172030683875

more points on the lower modal in the new showing an improvement
-----------------------

a11yr windows xp:
https://treeherder.mozilla.org/perf.html#/graphs?timerange=604800&series=[mozilla-inbound,2bab095ad2eb5a626f575231ea6f838d424fe6c6,1]&highlightedRevisions=5d303b961af1&highlightedRevisions=dbe8d3254ccd


that seems fairly representative of what I see normally, there is no osx 10.10 in the list.

Avi Halachmi (:avih)

Comment 19

•

9 years ago

Thanks, good data.

Question: did you look at all the graphs? or only the graphs which showed regressions (and improvements?) at the compareperf main view?

Ultimately, what interests us in this bug would be all the tests/combos which _could_ have this issue - meaning that their results cluster in some way.

But let's go on with the list we have so far.

So, as far as I can tell, the data is split between what I think is two mostly unrelated categories.

1. Plain relatively uniform noise with some expected outliers:
- tpaint

X. Not sure if plain noise or bi-modal:
- damp - I suspect plain noise, but need to look at denser graphs to decide.

2. Clearly bi-modal (the rest):
- tscrollx on XP (with or without e10s)
- glterrain on win8-64 (non e10s)
- tresize on win8-64 (non e10s)
- tresize on linux64 - more weight to the lower bucket, and somewhat noisy in general.
- a11y on xp non e10s - more weight to the lower bucket, and somewhat noisy in general.

Kyle, Joel, do you agree with these categories and this division?

I think plain noise doesn't belong to this bug, since noise and outliers are just a fact of life. We would like it to be less noisy and with less outliers, but for now, that's how it is.

For plain noise, few more retriggers of the same build should help, but gathering data from other builds or from tests which where executed at a different time would not make the data more useful than plain retriggers of the same build.

Which leaves us with the oldest issue in the Talos books: bimodal results.

Before we go on, do we all agree that the issue is not "dealing with high stddev" (which probably includes bimodal but is not limited to bimodal) but rather more specifically dealing with tests which produce bimodal results?

Avi Halachmi (:avih)

Comment 20

•

9 years ago

Also, just to make sure we're on the same page, in general we should only care about bimodality if the two (or more) buckets are far enough from eachother. I.e. that if one set of results is in one bucket and another set of results is in the other bucket - it would show a meaningful "fake" difference.

I think we should probably care about bimodality only where the buckets are more than ~1% apart.

All the tests which I identified as bimodal in comment 19 belong to this category. Their clusters are about 10%-30% apart.

Kyle Lahnakoski [:ekyle]

Comment 21

•

9 years ago

Avi,

I believe there is a third pattern:  Results with statistics that vary based on time of day, or day of week. 

I believe, without evidence, the specific issue in tcheck2 raised in this bug is of that third type.  I realize that this particular bug can be explained by random noise, given the low number of replicates in each sample. Looking at the test results for tcheck2.  tcheck2 does not appear to be bimodal, but I do agree that bimodal results will exhibit the same problems.

Avi Halachmi (:avih)

Comment 22

•

9 years ago

Right, good observation. It didn't appear at the examples at comment 18 and I forgot to mention it too.

Do we have examples other than tcheck2 which are not bimodal, not normal noise, but still exhibit clustering?

Joel, when was the last time that tcheck2 proved useful to anyone?

Currently, I really think tcheck2 is just a bad test, and unless other tests exhibit this pattern, we should not try to find general solutions for this specific test. Instead, we should just drop it, or ask whoever wrote it to fix it.

Regardless of normal noise and the tcheck2 issue, we're still left with the bimodality issue, which is IMO the most important issue we should try to address.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 23

•

9 years ago

there are 11 bugs associated with tcheck2, the most recent ones:
* https://bugzilla.mozilla.org/show_bug.cgi?id=1111565 - Dec 15th (2015)
* https://bugzilla.mozilla.org/show_bug.cgi?id=1122012 - Jan 15th
* https://bugzilla.mozilla.org/show_bug.cgi?id=1199683 - Aug 28th

I suspect other tests might cluster, but I have no direct evidence.

Avi Halachmi (:avih)

Comment 24

•

9 years ago

(In reply to Avi Halachmi (:avih) from comment #6)
> What I see at the graph are these "ranges" [the tcheck graph]
> - Sep 18 - Sep 24: range ~12 - ~22, with an average of ~17
> - Sep 24 - Sep 24: range  ~1 -  ~3, with an average of  ~2
> - Sep 25 - Sep 28: same as the first range (12-22, 17)
> - Sep 29 -       : range  ~1 -  ~6, with an average of ~3-4

So I thought it looked very noisy but otherwise "plain noise".

I examined 1 year worth of tcheck2: https://treeherder.mozilla.org/perf.html#/graphs?timerange=31536000&series=[mozilla-inbound,fdad6ae27544b0dd52113fce3184968100190e76,1]

And I see that it was mostly bimodal but not very noisy until about a month ago (Sep-2nd).

However, since around Sep-2nd, it appears to me to have become "plain noise" - and quite a lot of it.

So, I don't think we should be looking at tcheck2 data more than a month old, for the reason that to it looks like it had a very different behavior back then.

As I noted in my quote above, since Sep-29th it's only slightly less noisy than before, but with much lower average, hence making the noise much more meaningful as percentage of the average, but I don't know if it would "deceive" the t-test which we use to determine meaningfulness (I would like to think that it wouldn't).

However, I can't see evidence to clustering. It just looks to me like really plain noise.

_If_ it is plain noise, then more retriggers of the same build should eventually fit into the statistical pattern of this test IMO.

If we're not satisfied with just doing more retriggers (or if they just don't help), then let's examine the clustering.

Let's assume that there is clustering, and try hypotheses as to which parameter is responsible for it. For now, let's assume there's only one parameter which is responsible for the clustering.

This clustering parameter could be:
1. The changeset id.
2. The machine (or class of machines) on which the test was executed.
3. The data center at which the test was executed (if we have more than one).
4. The day of the week at which the test was executed.
5. The time of day at which the test was executed.

Feel free to add more.

Luckily, we already have two sets of tcheck2 results which we suspect belong to different clusters - the comparison at the first link from comment 0.

We could start by comparing the 5 parameters above (and others we could suspect) between those result sets, and the parameters which are the same could be eliminated as suspects for the clustering.

We could probably apply the same hypotheses and elimination process for bimodal results. It needs time, but if we really want to get to the root of it, we should spend it. Maybe execute the same batche of tests few times per day, every day, at different data centers, different machines, different changesets, etc, until we nail the reason which results in clustering.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 25

•

9 years ago

one other slightly different scenario, but confusing for consumers is:
https://treeherder.allizom.org/perf.html#/compare?originalProject=fx-team&originalRevision=ff8ebe5251ec&newProject=fx-team&newRevision=c951580b6f9b

look at windows xp damp- it shows a 35% regression, but there is none.  this is all due to 1 large outlier.  The question is- how can we detect something that is a far outlier and what can we do to ignore it.

"Saptarshi Guha[:joy]"

Comment 26

•

9 years ago

I'm not sure what method is being used to compare (t-test?) which is based on means and is not robust.
You could consider robust alternatives or t-test's based on "trimmed/studentized' means which essentially drop the top/bottom x%.

Can't recommend a solution without really understanding the mechanics of the test and if this common pattern.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 27

•

9 years ago

here is the code to calculate the fields in the compare view:
https://github.com/mozilla/treeherder/blob/master/ui/js/perf.js#L278

we do use a t-test it appears, it might take a bit more explaining to get to the bottom of it.  Assume we have 6 data points on either side- requiring more would get expensive, but not impossible to do.

"Saptarshi Guha[:joy]"

Comment 28

•

9 years ago

More data points might not be the solution if you have really large outliers. Thats why you need something that is robust, i.e less affected by the strength of an outlier. This might mean a different testing method, using a different form of aggregation etc( dropping of outliers which means losing observations which isn't the most preferred route when you have only 6 observations).

Again, will need to study more before recommending anything.

William Lachance (:wlach)

Comment 29

•

9 years ago

So one thing with the compare view is that we're currently using the mean of the values to calculate the difference between two sets of data. Wouldn't it be better to use the median? In theory that should be more resilient to outliers.

It wouldn't help that much for the particular cases we're talking about here, but where there are more minor differences in data, it seems to give better results. Take this example:

https://treeherder.mozilla.org/perf.html#/compare?originalProject=mozilla-central&originalRevision=84a7cf29f4f1&newProject=try&newRevision=ceaaed5dbb2a&filterTest=tcanvasmark%20opt%20e10s&filterPlatform=linux64&showOnlyImportant=0

Old with mean:

Base: 6510.00
New:  6299.20

(~3% difference)

With median:

Base: 6258.75
New: 6332.5

(~1% difference)

This is because the base results have an outlier (7344.5) which gets a disproportionately high weighting when we just take the mean.

Kyle, Saptarshi: Would there be any downsides to using the median here? I can't think of anything that isn't contrived, but I'm not a statistics expert.

Flags: needinfo?(sguha)

Flags: needinfo?(klahnakoski)

Kyle Lahnakoski [:ekyle]

Comment 30

•

9 years ago

dzAlerts used a median test as a pre-filter to avoid the negative effects of outliers (type 1), but it is necessarily insensitive; requiring more data to detect real signals.  The median test is insensitive (or overly sensitive) with bimodal data (type2), and just as ineffective as t-test when dealing with cyclic patterns (type3).  Reporting a difference-of-medians makes sense when using the median test.

I doubt we can mathematically combine the t-test statistics with difference-of-medians in a consistent manner.  Attempting to do so will leave us with many more "edge cases" than we have now.  For example, distributions that better match log-normal (like positive test times near zero, perhaps) will have very small difference-of-medians - and may never raise an alert.  Bimodal data has medians that are one of the two modes; sometimes making the difference-of-medians much larger then the difference-of-means.

Difference-of-medians can be used to report a discontinuity to a human, but I suggest not using it to compare against the t-test statistics.

Flags: needinfo?(klahnakoski)

"Saptarshi Guha[:joy]"

Comment 31

•

9 years ago

So
1. changing the test wont' help with this bimodal data. My stand on bimodal data is to figure why it's bimodal and treat it differently - thats for another bug.
2. A test robust to outliers
 a) Do outliers occur on just some tests? For those where outliers are super rare and the assumptions of the t-test are met, use a t-test 
b) where the assumptions are not met at all: a non parametric test. They lose some ability to detect change but they are not affected by the vaulues.
3. Increase the sample size so that an outlier will have less of an effect.

In the example above, if you use  a log scale, what happens? 

I think we should move away from this manner of testing and treat it like a time series and detect spikes and/or systemic changes.

Flags: needinfo?(sguha)

"Saptarshi Guha[:joy]"

Comment 32

•

9 years ago

There are also 'bayesian' methods for t test than handle outliers.
http://doingbayesiandataanalysis.blogspot.com/2011/06/better-than-t-test-robust-bayesian.html

"Saptarshi Guha[:joy]"

Comment 33

•

9 years ago

(In reply to "Saptarshi Guha[:joy]" from comment #32)
> There are also 'bayesian' methods for t test than handle outliers.
> http://doingbayesiandataanalysis.blogspot.com/2011/06/better-than-t-test-
> robust-bayesian.html

There also modifications to the t-test to handle the "winsorized" mean (i.e. replace outliers with the e.g. 98% percentile) and trimmed mean (drop those values). But the computation is more involved.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 34

•

9 years ago

interesting idea to use the trimmed mean- at Microsoft we used the 80% mean (remove 10% lowest, 10% highest) and that seemed to give us a good data set to work with.  The problem is we have 6 data points, so 1 outlier does a lot of damage.

we could revisit the raw replicates where we normally have 20+ replicates.  In the case where we drop the 1st value and take the median of the remaining X (19 or 24), we could apply the trimmed mean to that data set and remove the 2 highest/lowest data points for a set of 20 replicates.  That would leave 16 for us to use in our median calculations.  Keep in mind that is for our subpages- we then take a geometric mean of all the subpages and produce the score for the test.

should we require more data?  can we identify an outlier (say 1 or 2 are not in the range of the others) and then notify the UI that we need more data?  I think more math is fine- as long as we can explain it in English.

William Lachance (:wlach)

Comment 35

•

9 years ago

Avi and I had another discussion about this last week in the context of bug 1227635 and I've been thinking about it a bunch since.

In a sense this is a generalization of the same problem Joel noticed with MacOS X 10.10 until it was fixed recently: some tests don't produce consistent results and you need many more retriggers before you can be confident in them, but this is only visible by looking at the historical data.

We've done quite a bit of back & forth in this bug about how to represent that. I'd prefer to avoid approaches which are too complicated (i.e. trying to figure out dynamically inside the compare view whether each test is "noisy", that'd take forever). I think a good first pass would be to find some kind of single measure that expresses the characteristics of being "unreliable" for each test, calculate it, then hide tests by default in the compare view if that measure exceeds a certain threshold, just like we did for OSX.

Qualitatively, "unreliable" means something like:

1. High standard deviation
2. Bi-modal behaviour (possibly this can be expressed in terms of 1).

We could pre-calculate this measure for each test based on their last week's worth of data (actually looking back on 1210509 this is basically what you proposed). I'm not sure what the best way of expressing this is. Do there exist good statistical methods for detecting bi-modality and/or high std dev over a period of time?

A side bonus of this approach is that it will easily allow us to generate a report on what the noisy tests are, perhaps even generating alerts if a test starts getting noisy (or vice versa).

Avi Halachmi (:avih)

Comment 36

•

9 years ago

(In reply to William Lachance (:wlach) from comment #35)
> Qualitatively, "unreliable" means something like:
> 
> 1. High standard deviation
> 2. Bi-modal behaviour (possibly this can be expressed in terms of 1).

I still think it's only 2. High stddev with relatively normal distribution is fine, and the t-test takes that into account correctly.

But bimodal results could just produce groups of results which end up in different clusters, and there's nothing we can do about that.

So my suggestion would be to get a list of platform+test combos which have bimodal results, and treat them differently. You suggested to hide, I suggest to show them as orange for instance, and request the user to retrigger and compare them visually at the graph.

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 37

•

9 years ago

identifying >1 modal signature series is doable, but turning a job orange is probably unrealistic.  We could flag them in the UI in a compare view or an alert view.  We could determine if we flag/hide by default or do other creative things.

One thing we should do is stop graph server from emailing developers about regressions.

William Lachance (:wlach)

Comment 38

•

6 years ago

Hey Joel, I suspect we should close this as WONTFIX. We've done a bunch of work in the UI to make it easier to analyze this type of data (increasing confidence threshold, allowing the view of the distribution) but I don't think there are any other quick wins.

Flags: needinfo?(jmaher)

Joel Maher ( :jmaher ) (UTC -8)

Reporter

Comment 39

•

6 years ago

I added a noise metric to the compare view in bug 1416347, it is similar to this- I wanted to come up with a way to track and alert on a noise metric- ideally we could still do that, but it would be difficult.  Since there are no actionable ideas I suggest closing this bug and when we want to do something like:
* in-tree distribution expectations (X-mode, X%noise, etc.)
* flags for extreme outliers
* noise tracking over time
* alerts for changes in noise/distributions

Status: NEW → RESOLVED

Closed: 6 years ago

Flags: needinfo?(jmaher)

Resolution: --- → WONTFIX

AA vs D5.png 9 years ago Kyle Lahnakoski [:ekyle] 19.65 KB, image/png		Details
All replicates.png 9 years ago Kyle Lahnakoski [:ekyle] 164.63 KB, image/png		Details
The problem.png 9 years ago Kyle Lahnakoski [:ekyle] 243.80 KB, image/png		Details