Closed Bug 866195 Opened 12 years ago Closed 12 years ago

Improve Autophone s1s2 test reproducibility

Categories

(Testing Graveyard :: Autophone, defect)

All
Android
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: bc, Assigned: bc)

References

()

Details

Attachments

(4 files, 3 obsolete files)

Recently when attempting to fix bug 862508 to allow Autophone to test Android 4.2 devices, I ran in an apparent regression in throbberstart due to the change in how Fennec is launched. While attempting to investigate the regression, I ran tests of how fennec is launched comparing the current Autophone run_fennec_with_profile method (call it A) to versions of run_fennec_with_profile which used mozbase's launchFennec without (call it B) and with (call it C) the -W argument to am. There was no consistent ordering between phones of the 3 values. Values A and B which should have been identical were consistently different. In addition, the variability of the test results as shown by the standard deviations made it difficult to make accurate comparisons. Something to note is that I ran A, B, and C in the same session with the same profile without rebooting in between. The variability of the test results is very apparent when looking at the recent results at <http://mrcote.info/phonedash>. This is especially true for the remote page tests where the network connections are throttled via ipfw to 780Kbps down and 330Kbps up. I do not know whether the throttled network is the cause of the variability or if it has some other cause. In my opinion these results are so noisy as to be unusable. I checked the performance of the phonedash web server using httperf as compared to nginx for serving the static files for Twitter2.html and did not see any performance issue which might affect the tests. This led me to reading up on test reproducibility and got me thinking about how we actually run the tests. A simplified description of how Autophone runs the s1s2 tests is: reboot phone install fennec for cache in disabled, enabled: create profile with cache preferences launch fennec to initialize the profile for test in local-blank, local-twitter, remote-blank, remote-twitter: for iteration in iterations: launch fennec and load the test page collect and report throbberstart, throbberstop values We use the throbberstart and throbberstop values from the iterations to calculate a mean and standard deviation which is displayed on phonedash. I see several problems in this approach which may affect the reliability and reproducibility of the tests. 1. We test the uncached behavior using prefs to disable the cache. I think it would be more realistic if we used the normal cache settings and treated the first load of a page as the uncached test and treated the second load of a page as the cached test. 2. We do not reboot the phone in between the test pages nor in between the iterations. We treat each test and iteration as an independent measurement when as currently implemented each test and iteration depends upon the previous tests and iterations. I think we should reboot before each measurement to eliminate any influences from prior measurements. 3. Our choice of iterations is ad-hoc. Rather than focusing on the standard deviation of the measurements we can choose the iterations to reduce the standard error of the mean and perhaps obtain more reliable results. I think that implementing #2 will help in improving reliability and help in choosing a realistic and useful iteration count. For background reading http://en.wikipedia.org/wiki/Standard_error and http://en.wikipedia.org/wiki/Standard_deviation The proposed test execution would look like: reboot phone install fennec launch fennec to make sure the build is usable if fennec is not usable: abort test for test in local-blank, local-twitter, remote-blank, remote-twitter: for iteration in iterations: reboot phone create profile launch fennec to initialize the profile launch fennec and load the test page collect and report uncached throbberstart, throbberstop values launch fennec and load the test page collect and report cached throbberstart, throbberstop values I would like to take some of the redundant phones out of the production Autophone s1s2 runs to use in testing this approach. We could keep the current Autophone running in production on the remaining phones until we determine if this is a valid approach. If we do switch to this new behavior, I can rerun previous data to fill in the historical behavior. Thoughts?
I'm all for being more deliberate with what we are measuring and how we are measuring it. I think this is a good step in the right direction and I look forward to what your experiment turns up w.r.t. measurement stability.
(In reply to Bob Clary [:bc:] from comment #0) > 1. We test the uncached behavior using prefs to disable the cache. I think > it would be more realistic if we used the normal cache settings and treated > the first load of a page as the uncached test and treated the second load of > a page as the cached test. A few slightly related ideas here: We should keep in mind that first load with caching enabled is not the same as caching disabled: With caching enabled, we are storing that first load, and it is possible that the disk writes, and additional processing, will affect performance. We don't normally test Fennec in configurations that differ from the defaults we ship (except in ad-hoc, one-off tests to investigate the effect of configuration changes). Network disk cache is enabled in Fennec and has been for a long time now. Why are we routinely running tests with cache disabled? With caching enabled, it seems wrong to put the first load results in the same bucket as subsequent results -- we expect them to differ and we lose that information if we routinely just look at the aggregate. The current iterations: for test in local-blank, local-twitter, remote-blank, remote-twitter: for iteration in iterations: launch fennec and load the test page are problematic/unrealistic with caching enabled because we are testing the result of loading the same page multiple times in succession, with little happening in between. There is a risk here that we will benefit not only from our own caching, but also from OS level caching: disk buffers may still be in memory for instance, resulting in improved performance that wouldn't be seen in normal use. The proposed execution is much better in this respect, since each pair of results is against new cache files (new profile).
(In reply to Geoff Brown [:gbrown] from comment #2) > We don't normally test Fennec in configurations that differ from the > defaults we ship (except in ad-hoc, one-off tests to investigate the effect > of configuration changes). Network disk cache is enabled in Fennec and has > been for a long time now. Why are we routinely running tests with cache > disabled? IIRC blassey suggested this. I think he wanted to try to remove variability caused by writing to disk, or at least try to see its effect vs standard cache usage. > With caching enabled, it seems wrong to put the first load results in the > same bucket as subsequent results -- we expect them to differ and we lose > that information if we routinely just look at the aggregate. Actually we omit the first result of the cached runs when calculating the mean & standard deviation. For comparison, you can see the initial result--and only the initial result--by enabling "show initial only" in phonedash. > > The current iterations: > > for test in local-blank, local-twitter, remote-blank, remote-twitter: > for iteration in iterations: > launch fennec and load the test page > > are problematic/unrealistic with caching enabled because we are testing the > result of loading the same page multiple times in succession, with little > happening in between. There is a risk here that we will benefit not only > from our own caching, but also from OS level caching: disk buffers may still > be in memory for instance, resulting in improved performance that wouldn't > be seen in normal use. The proposed execution is much better in this > respect, since each pair of results is against new cache files (new profile). Yeah I think we originally were taking a trade off of speed of execution (and potentially device life) by only rebooting in between tests. I would expect less variation if we rebooted between every run. If we can still keep up with the results, then yeah let's try this.
(In reply to Geoff Brown [:gbrown] from comment #2) > (In reply to Bob Clary [:bc:] from comment #0) > > 1. We test the uncached behavior using prefs to disable the cache. I think > > it would be more realistic if we used the normal cache settings and treated > > the first load of a page as the uncached test and treated the second load of > > a page as the cached test. > > A few slightly related ideas here: > > We should keep in mind that first load with caching enabled is not the same > as caching disabled: With caching enabled, we are storing that first load, > and it is possible that the disk writes, and additional processing, will > affect performance. > Great point. > We don't normally test Fennec in configurations that differ from the > defaults we ship (except in ad-hoc, one-off tests to investigate the effect > of configuration changes). Network disk cache is enabled in Fennec and has > been for a long time now. Why are we routinely running tests with cache > disabled? > This was the choice I made when blassey asked for results for uncached vs. cached. I agree it is the wrong thing to do and I am happy with the change to count the first load as the uncached value and the second as the cached value. > With caching enabled, it seems wrong to put the first load results in the > same bucket as subsequent results -- we expect them to differ and we lose > that information if we routinely just look at the aggregate. > As Mark mentioned we currently don't use the first values unless you ask for them, but with the proposed changes I think it would be good to keep them. > The current iterations: > > for test in local-blank, local-twitter, remote-blank, remote-twitter: > for iteration in iterations: > launch fennec and load the test page > > are problematic/unrealistic with caching enabled because we are testing the > result of loading the same page multiple times in succession, with little > happening in between. There is a risk here that we will benefit not only > from our own caching, but also from OS level caching: disk buffers may still > be in memory for instance, resulting in improved performance that wouldn't > be seen in normal use. The proposed execution is much better in this > respect, since each pair of results is against new cache files (new profile). (In reply to Mark Côté ( :mcote ) from comment #3) > > Yeah I think we originally were taking a trade off of speed of execution > (and potentially device life) by only rebooting in between tests. I would > expect less variation if we rebooted between every run. If we can still > keep up with the results, then yeah let's try this. I have been testing the approach where we reboot between each iteration of the test page load vs only rebooting for each job. I'm not sure about the results yet, but rebooting between iterations causes each job to run for several hours which I am not sure gives sufficient benefit for the cost. I am in the process of checking reboot for each iteration, reboot for each test page, reboot for each job and comparing the results. I am very happy with the focus on standard error of the mean instead of standard deviation of the results though. I think we will be able to get tighter test results. One question I have for you all (blassey especially) is the acceptable percentage standard error of the mean. Getting below 1% has been problematic so far. 2% is probably achievable without increasing the measurement count to much. I think 3% would be too high. I have a roundtable item in today's #ateam meeting if you want to join and help with the discussion.
Attached patch phonedash patch (obsolete) — Splinter Review
Calculate and display standard error of the mean. Mark, I've left the initial only check in for the moment, but I don't think it is needed any more. I was wondering if it would be good to allow the user to select to display the standard error of the mean or the standard deviation. Showing standard error allows us to get an idea of how good the measurement is while showing the standard deviation allows us to get an idea of the variability of the data.
Attachment #744092 - Flags: feedback?(mcote)
Attached patch autophone patchSplinter Review
This patch follows the general outline given before but instead reboots before each test. I am not completely this is better than the case where we just reboot before each job. See the next comment.
Attachment #744093 - Flags: review?(mcote)
This shows some comparative data for iterations 32 for the two runs on the same build revision for: base - current autophone behavior with a change to move the loggerdeco call out of the starttime measurement, run fennec which probably is a cause of some variability in the previous results. reboot_job - current approach but only rebooting before each job. reboot_test - current approach but rebooting before each job and each test. reboot_iteration - current approach but rebooting before each iteration of a test. I only have one run for this. This results in a multi-hour run for each build and isn't practical but I wanted to compare it to the others. This shows a comparison of the different test runs using a two-sided Welchs T-Test to see if the means from the different runs were distinguishable ass well as sample data for the 4th, 8th, 16th and 32nd values for each run. I believe 32 is a good value for the iterations as it will reduce the standard error by 1/5 of the standard deviation. We can see that the two base runs disagree on 7 out of 24 measurement between the runs on an identical build which confirms our experience. The most consistent runs are the two reboot_job runs (4/24) followed closely by the reboot_test runs (5/24). On the face of the data, we might pick the reboot_jobs approach but I think reboot_tests is the better approach even though its runs are slightly less consistent. Mark, if you can patch the mrcote.info/phonedash_bc, and clear its database I can do a test run for the last few days so we can compare the current production results to the results we would get with the proposed patch.
Adding bug 862508 and bug 860750 as dependencies to show the prior patches that need to be applied.
Depends on: 862508, 860750
Comment on attachment 744092 [details] [diff] [review] phonedash patch Review of attachment 744092 [details] [diff] [review]: ----------------------------------------------------------------- Yeah if we are using the first value as the uncached value, we should take out the "initial only" control. Would it make sense to allow the display of either stddev or stderr? I'm still not sure exactly in what way one is more useful than the other.
Attachment #744092 - Flags: feedback?(mcote) → feedback+
The standard deviation estimates the variability in the measurement distribution while the standard error of the mean is an estimate of the error for the mean itself. So, for any given individual measurement we would expect it to fall within 1 standard deviations of the mean 68% of the time and within 2 standard deviations 95% of the time. For multiple runs with the same conditions we expect the means to occur within 1 standard errors 68% of the time and within 2 standard errors 95% of the time. While the standard deviation doesn't decrease with increasing number of measurements, the standard error of the mean does as ~ standarddeviation/sqrt(count).
Comment on attachment 744093 [details] [diff] [review] autophone patch Review of attachment 744093 [details] [diff] [review]: ----------------------------------------------------------------- Getting complicated, but looks good. ::: tests/s1s2test.py @@ +40,5 @@ > + r['stderr'] = r['stddev']/sqrt(r['count']) > + r['stderrp'] = 100.0*r['stderr']/float(r['mean']) > + return r > + > +def is_stderr_acceptible(dataset): That spelling is unacceptable. ;) @@ +127,5 @@ > + build_metadata['buildid'])) > + self.set_status(msg='Could not run Fennec. Aborting test for ' > + 'build %s' % > + (attempt, > + build_metadata['buildid'])) "attempt" shouldn't be in the substitution list here. @@ +305,5 @@ > + 'shell.checkDefaultClient': False, > + 'browser.warnOnQuit': False, > + 'browser.EULA.override': True, > + 'toolkit.telemetry.prompted': telemetry_prompt, > + 'toolkit.telemetry.notifiedOptOut': telemetry_prompt} Drop the } to the beginning of the next line.
Attachment #744093 - Flags: review?(mcote) → review+
Attachment #744092 - Attachment is obsolete: true
Attachment #744222 - Flags: review?(mcote)
Comment on attachment 744222 [details] [diff] [review] phonedash patch v2 Review of attachment 744222 [details] [diff] [review]: ----------------------------------------------------------------- Looks good. ::: html/scripts/app.js @@ +42,5 @@ > revisions[buildtime] = data[phone][test][metric][builddate].revision; > + if (params.errorbartype == 'standarderror') > + errorbarvalue = data[phone][test][metric][builddate].stderr > + else > + errorbarvalue = data[phone][test][metric][builddate].stddev Standard JS coding style is to use braces even if blocks are only one line long.
Attachment #744222 - Flags: review?(mcote) → review+
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch follow up patch v1 (obsolete) — Splinter Review
I missed the other places where logging calls or overhead also affect the measurements. This patch: 1. removes the unnecessary logging call from run_fennec_with_profile and moves the starttime collection to inside of run_fennec and changes run_fennec_with_profile to return the starttime. I also added the failIfRunning=False to launchFennec which prevents launchApplication from testing to see if the fennec process already exists. 2. removes the removal of the sessionstore.{js,bak} files. I don't think it is appropriate to be removing these in between the uncached and cached tests in particular and in general I don't think we should be hacking the profile like this in the middle of a test if we want to measure something related to what a user experiences. #1 eliminates the previously discovered regression when switching from the old run_fennec_with_profile to using launchFennec. I ran the original code for several builds before realizing my mistake, then ran a build with #1 for a build before realizing I should do #2 as well. I'm restarting the phonedash_bc run again. Hopefully this will get even better results. See http://mrcote.info/phonedash_bc for the running results.
Attachment #744953 - Flags: review?(mcote)
Comment on attachment 744953 [details] [diff] [review] follow up patch v1 I should have listened to Welchs T Test. The reboot between each test takes too long on the production phones. I'm starting another test with modified hardware values reporting to phonedash_bc to compare the results with removing the reboot. We may also need to lower the iterations value, but I would definitely rather keep it as high as possible.
Attachment #744953 - Attachment is obsolete: true
Attachment #744953 - Flags: review?(mcote)
Attached patch follow up patch v2 (obsolete) — Splinter Review
same as v1 but no longer rebooting between each test and now with -W. I ran a partial test with just the rebooting change and am now running another with the rebooting change and the use of the wait=True parameter on launchFennec. The three sets of results are being reported to phonedash_bc with the device name: original run - original device id patch v1 - original device id with -r patch v2 - original device id with -w It will be a couple of hours before we have 2 data points for -w, but I think this will be it so I'd like to start the review request before too close to the end of the day.
Attachment #745284 - Flags: review?(mcote)
I don't think -W gives any benefit over not specifying it and may slow down some phones, noticeably the nexus one and may result in slightly larger stderrs, so please review with the assumption I will specify wait=False in run_fennec_with_profile.
right patch for right repo this time. Has wait = False.
Attachment #745284 - Attachment is obsolete: true
Attachment #745284 - Flags: review?(mcote)
Attachment #745360 - Flags: review?(mcote)
Comment on attachment 745360 [details] [diff] [review] follow up patch v3 Review of attachment 745360 [details] [diff] [review]: ----------------------------------------------------------------- Looks good. ::: phonetest.py @@ +146,5 @@ > + starttime = int(self.dm.getInfo('uptimemillis')['uptimemillis'][0]) > + except IndexError: > + # uptimemillis is not supported in all implementations > + # therefore we can not exclude such cases. > + starttime = 0 I can't recall what this is from, but since you're moving this around anyway, let's take out the try/except. I think the results would be very strange if this happened.
Attachment #745360 - Flags: review?(mcote) → review+
https://github.com/mozilla/autophone/commit/ee040e2d4c59cb6d1c3a8e3c9ff5b7108c6d2318 We've copied the old production results to http://mrcote.info/phonedash_bc/ and will be reporting new results to http://mrcote.info/phonedash/ until the move to the new moco hosts. I'll backfill the data as time permits. /me crosses fingers.
Status: REOPENED → RESOLVED
Closed: 12 years ago12 years ago
Resolution: --- → FIXED
Product: Testing → Testing Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: