Closed Bug 968920 Opened 8 years ago Closed 8 years ago

Autophone - use standard error percentage as a discriminator

Categories

(Testing :: Autophone, enhancement)

enhancement
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

Details

Attachments

(4 files)

On a regular basis, some phones randomly misbehave due to unknown causes and produce throbber measurements which are far out of character with respect to measurements just prior to and after the outlier. These 'bad' measurements typically have a very large standard deviation/standard error which is out of line with other measurements.

If the identical build is retested on the device, the 'bad' behavior disappears and the new measurement has a much improved standard deviation/standard error.

I would like to use the standard error as a percentage of the measurement value to determine if a test run should be kept or retested.

s1s2test currently has a constant STDERRP_THRESHOLD which is intended to be used to terminate a series of iterations if the standard error percentage falls below the value. I have temporarily set this to 0 to make sure that all iterations are run, but am considering re-enabling it along with another threshold that will be used to determine if a run is too noisy and should be rejected/retested.

For example:

We might say that we normally want to run 8 iterations of a test to get a reasonable estimate of the mean. But if the standard error percentage falls below 1%, we could terminate the test run early and keep the values.

On the flip side, once a test run completes, we could reject it if its standard error percentage was over 3% and retest the build. If the standard error does not fall below our threshold after some number of attempts, we would keep the most recent test run's values.

Thoughts?
PS. I forgot to mention that we would probably need a hardware, e.g. nexus-one, nexus-7, specific threshold. We could add an additional section to the test manifest in bug 968905 which lists the hardware specific thresholds for each of the device types being tested.
Flags: needinfo?(klahnakoski)
I like the idea of using standard error to indicate the quality of the test results.  My only concern is if the 'bad' measurements are actually part of a another (rarer) mode; in which their results and frequency should be monitored, if only to ensure the other mode continues to be rare.

If you plan to terminate testing early in light of low standard error, be sure the standard deviation is corrected for bias.  An estimate is good enough: http://en.wikipedia.org/wiki/Unbiased_estimation_of_standard_deviation#Rule_of_thumb_for_the_normal_distribution

In the event of high standard error: I suggest not throwing away the test results, but concatenate them to the results of the retest.  This will have the unfortunate side effect of generating a different number of replicates for each test "run".  The benefit is we will keep a record of the 'bad' behavior, and whether it is changing over time.  In general, I want to avoid deciding whether to keep the data after looking at it because it can be a source of bias.  I am less concerned with diluting 'bad' data with more samples.

In the test runs where we keep the results of 'bad' behavior, the mean and variance become meaningless.  This can be mitigated by rejecting the extreme percentiles before calculating mean, or in the limit; use the median.
Flags: needinfo?(klahnakoski)
(In reply to Kyle Lahnakoski [:ekyle] from comment #2)
> I like the idea of using standard error to indicate the quality of the test
> results.  My only concern is if the 'bad' measurements are actually part of
> a another (rarer) mode; in which their results and frequency should be
> monitored, if only to ensure the other mode continues to be rare.
> 

Hopefully this will not be that common and I can send a notification.

> If you plan to terminate testing early in light of low standard error, be
> sure the standard deviation is corrected for bias.  An estimate is good
> enough:
> http://en.wikipedia.org/wiki/
> Unbiased_estimation_of_standard_deviation#Rule_of_thumb_for_the_normal_distri
> bution
> 

I'm conflicted about this option but wanted to consider it. So, Unbiased standard deviation = sqrt( (sum(X - Xi)^2) / (N - 1.5) ) ?

> In the event of high standard error: I suggest not throwing away the test
> results, but concatenate them to the results of the retest.  This will have
> the unfortunate side effect of generating a different number of replicates
> for each test "run".  The benefit is we will keep a record of the 'bad'
> behavior, and whether it is changing over time.  In general, I want to avoid
> deciding whether to keep the data after looking at it because it can be a
> source of bias.  I am less concerned with diluting 'bad' data with more
> samples.
> 

This sounds like one way of accomplishing this would be to let the test continue to run until the standard error drops below the threshold. This might not alleviate temporary problems that would be fixed by a reinstall or reboot though.

Phonedash currently drops any existing test results for a build, but we could change that behavior to keep them and just accumulate more measurements. I think this is closer to what you had in mind. Mark, what would you say to not dropping existing results?

I can run some example runs and see how these choices affect the results.

> In the test runs where we keep the results of 'bad' behavior, the mean and
> variance become meaningless.  This can be mitigated by rejecting the extreme
> percentiles before calculating mean, or in the limit; use the median.

"in the limit" of what?
On 06/02/2014 2:26 PM, bugzilla-daemon@mozilla.org wrote:

>
> I'm conflicted about this option but wanted to consider it. So, Unbiased
> standard deviation = sqrt( (sum(X - Xi)^2) / (N - 1.5) ) ?

Yes.  I understand hesitation in using this imprecise formula.  I am only trying to prevent underestimating the standard error in the case where there are very few samples (less than 8).

>> In the event of high standard error: I suggest not throwing away the test
>> results, but concatenate them to the results of the retest.  This will have
>> the unfortunate side effect of generating a different number of replicates
>> for each test "run".  The benefit is we will keep a record of the 'bad'
>> behavior, and whether it is changing over time.  In general, I want to avoid
>> deciding whether to keep the data after looking at it because it can be a
>> source of bias.  I am less concerned with diluting 'bad' data with more
>> samples.
>>
>
> This sounds like one way of accomplishing this would be to let the test
> continue to run until the standard error drops below the threshold. 

Given the 'bad' behavior is not Gaussian, I suspect collecting samples until the standard error drops below threshold will result in too many test runs.  Instead, perform the tests in batches and only looking at the standard error in each batch; I thin that's like your original plan, just we keep the previous test results.

>> In the test runs where we keep the results of 'bad' behavior, the mean and
>> variance become meaningless.  This can be mitigated by rejecting the extreme
>> percentiles before calculating mean, or in the limit; use the median.
>
> "in the limit" of what?

If we take the mean of the middle x%, then as x->0 we get the median.
Kyle, Is using the full set of data + keeping the middle part of the data is meant to just keep historical records for bad data in order that we can mine for patterns? It won't provide us any 'better' estimate than the most recent batch whose standard error falls below the threshold will it?

If that is the case, what do you think about keeping all of the data from the attempts but when displaying the values in phonedash.mozilla.org, we only use the accepted batch to calculate the displayed mean and std dev/std error? As Mark suggested offline we could simply mark the 'historical' runs with a flag. By default phonedash would display the unflagged most recent batch and we could add an option to incorporate all results in the display. It may be useful to also be able to choose in the phonedash ui between using the mean and median or a middle percentage value. I think I would defer most of the UI changes / phonedash server changes (apart from using the unflagged values) to bug 967052.
* phonedash server
** add a rejected boolean column to rawfennecstart
** use unbiased estimation of the stddev
** added a count property to the result returned by S1S2RawFennecData
   so we can display the actual number of replicants used in a point.
** add new check and reject apis. I actually only use the check
   at the moment since I removed the automatic deletion of old
   results and decided that rejecting old results wasn't the right
   choice either. I can remove the reject if you want, but thought
   I'd leave it in since it didn't hurt.
* phonedash ui
** add a checkbox to the ui to include rejected values in the plot.
** add a N=<count> to the tooltip

I applied this to phonedash-dev with some sample data:

<http://phonedash-dev.allizom.org/#/org.mozilla.fennec/throbberstop/remote-blank/norejected/2014-02-14/2014-02-21/notcached/errorbars/standarderror>
Attachment #8380012 - Flags: feedback?(mcote)
* Changed stddev to use unbiased estimation
* Added 3 options to the test config file
** stderrp_accept - accept early if stderrp is below this threshold
** stderrp_reject - reject if stderrp is above this threshold
** stderrp_attempts - number of retries
* No longer delete old results.
* Use the new check api to not retest tests which have already been completed 
  and accepted. This will help when a job terminates abnormally and we have to 
  restart the job again.
* Simplified the attempts and data collection in runtests.
* Originally accepted the last set of measurements even if it was above the
  reject threshold. I changed that to just leave it as rejected as that seemed 
  more appropriate. I left the code in but commented out. I'll remove it if you 
  agree when I submit the final patch.
Attachment #8380019 - Flags: feedback?(mcote)
Please check out <http://phonedash-dev.allizom.org>, especially Remote Blank or Remote Twitter. This run was done with stderrp_accept = 0.5%, stderrp_reject = 5% and stderrp_attempts = 3. I think in production I would only use stderrp_attempts = 2 instead of 3.

Note how the gs2 and gs3 are rejected in the remote graphs. These two phones have been acting up and I'm not sure how this will play on the production phones. This will certainly help me decide which phones aren't good enough to keep in production.

If any of the throbberstart or throbberstops exceeded the rejection threshold, I rejected all of the measurements for the test. This can be a little confusing, but I think is the right choice.

When the Include rejected results is checked, it includes all data. When it is unchecked only the data that was not rejected is displayed.

I'm not doing any trimming of outliers or attempting to use median instead of mean.

If this seems to be a good approach, I'll finish off the patches and submit them for final review.
Flags: needinfo?(klahnakoski)
Assignee: nobody → bclary
Status: NEW → ASSIGNED
Comment on attachment 8380012 [details] [diff] [review]
bug-968920-stderr-phonedash.patch

Review of attachment 8380012 [details] [diff] [review]:
-----------------------------------------------------------------

::: server/handlers.py
@@ +108,5 @@
> +                    phoneid=result['phoneid'],
> +                    bldtype=result['bldtype'],
> +                    testname=result['testname'],
> +                    cached=result['cached'],
> +                    rejected=result['rejected'])

Is this leftover debug code?

@@ +130,5 @@
> +            autophonedb.SQL_TABLE, what=what, where=where,
> +            vars=dict(phoneid=phoneid, testname=testname,
> +                      revision=revision, productname=productname))
> +
> +        return {"result": len(list(data)) > 0 }

You can just do bool(data).  It'll be False if no rows were returned.
Attachment #8380012 - Flags: feedback?(mcote) → feedback+
Comment on attachment 8380019 [details] [diff] [review]
bug-968920-stderr-autophone.patch

Review of attachment 8380019 [details] [diff] [review]:
-----------------------------------------------------------------

::: tests/s1s2test.py
@@ +113,5 @@
>              # [settings]
>              self._iterations = cfg.getint('settings', 'iterations')
> +            try:
> +                self.stderrp_accept = cfg.getfloat('settings', 'stderrp_accept')
> +            except ConfigParser.noOptionError:

NoOptionError, not noOptionError.

@@ +117,5 @@
> +            except ConfigParser.noOptionError:
> +                self.stderrp_accept = 0
> +            try:
> +                self.stderrp_reject = cfg.getfloat('settings', 'stderrp_reject')
> +            except ConfigParser.noOptionError:

Ditto.

@@ +121,5 @@
> +            except ConfigParser.noOptionError:
> +                self.stderrp_reject = 100
> +            try:
> +                self.stderrp_attempts = cfg.getint('settings', 'stderrp_attempts')
> +            except ConfigParser.noOptionError:

Ditto.

@@ +138,5 @@
> +    def is_stderr_below_threshold(self, dataset, threshold):
> +        """Return True if all of the measurements in the dataset have
> +        standard errors of the mean below the threshold.
> +
> +        Return False if at least one measurement is above the threshold.

Should amend this to say that it also returns False if one or more datasets have only one value.

@@ +254,5 @@
> +                #elif attempt == self.stderrp_attempts - 1:
> +                #    rejected = False
> +                #    self.loggerdeco.info(
> +                #        'Accepted noisy test (%d/%d) after %d/%d iterations' %
> +                #        (testnum, testcount, iteration+1, self._iterations))

Yeah I think it's a good idea to remove this.

@@ +536,4 @@
>  
>          return (start_time, throbber_start_time, throbber_stop_time)
>  
> +    def reject_results_for_build(self, build_metadata):

I think I'd take this out unless you have definite plans to use it.

@@ +616,5 @@
> +                    e,
> +                    query['phoneid'], query['test'],
> +                    query['revision'], query['product']))
> +            return False
> +        else:

You can remove the else since you always return if there's an expected exception.
Attachment #8380019 - Flags: feedback?(mcote) → feedback+
(In reply to Mark Côté ( :mcote ) from comment #9)
> ::: server/handlers.py
> @@ +108,5 @@
> > +                    phoneid=result['phoneid'],
> > +                    bldtype=result['bldtype'],
> > +                    testname=result['testname'],
> > +                    cached=result['cached'],
> > +                    rejected=result['rejected'])
> 
> Is this leftover debug code?

Yeah. cleaned up.

(In reply to Mark Côté ( :mcote ) from comment #10)

> 
> NoOptionError, not noOptionError.

DOH

Thanks.
> Kyle, Is using the full set of data + keeping the middle part of
> the data is meant to just keep historical records for bad data
> in order that we can mine for patterns? 

Yes, I guess so.  If your current analysis does not require the historical records, then do not throw them away;  keep them for possible later analysis.  My concern is that the current analysis is a "prototype":  Throwing away test results based on the decisions of this prototype may introduce bias in the results.  Keeping the historical test results, even if they are not used, will allow future analysis to detect this bias, if it exists.
Flags: needinfo?(klahnakoski)
Attachment #8380650 - Flags: review?(mcote) → review+
Attachment #8380651 - Flags: review?(mcote) → review+
https://github.com/mozilla/autophone/commit/270db9e1448a5fb7ffeb50407cdd033a190574cb
https://github.com/markrcote/phonedash/commit/e4a72859dc066e4ec6917d6dee6165d93628bc5e

I will perform a dump of the current database, then alter it to add the new column. I think a stderrp_reject of 10% will catch the most aggregious device issues and stderrp_attempts = 2 will allow a retry without causing the machines to get backlogged on mozilla-inbound.
Status: ASSIGNED → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.