Closed Bug 1112149 Opened 10 years ago Closed 9 years ago

[raptor] Determine the feasibility of running perf tests on the b2g-emulator on the cloud

Categories

(Firefox OS Graveyard :: Gaia::PerformanceTest, defect)

Other
Gonk (Firefox OS)
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: rwood, Assigned: rwood)

References

Details

(Keywords: perf)

Determine if it is feasible to use the b2g-emulator to generate a performance baseline for the raptor tests. Also determine if AWS will be the location/cloud of choice for this automation.

Will the raptor perf tests run successfully on the b2g-emulator (stability wise)?
Will the generated numbers be consistent, and enable the establishment of a perf baseline?

Try this locally first, then on an AWS instance.
Created AWS raptor instance for this work; going direct to instance as want to get numbers from there to see if feasible to establish raptor performance baseline.
Depends on: 1123884
Depends on: 1129948
AWS instance: r3.xlarge ubuntu 14.04
Docker image: https://github.com/rwood-moz/raptor-docker-runner
Cmd: DEBUG=* RUNS=100 APPS='clock' RAPTOR_EMULATOR=1 node tests/raptor/emulator_launch_test.js
Duration for 1x cmd: Approx 25 minutes
Restarted the emulator (and did a make raptor) before each cmd.

Results: https://gist.github.com/rwood-moz/17604d572bfa4be06217
Flags: needinfo?(eperelman)
Same emulator setup as in comment 2, with updated/latest gaia and gecko.

Results: https://gist.github.com/rwood-moz/e52492f11323c484ee2e


For comparison, the launch test run on an actual local flame-kk device on my local VM.
Cmd: DEBUG=* RUNS=100 APPS='clock' node tests/raptor/launch_test.js
Duration for 1x cmd: Approx 14 minutes

Results: https://gist.github.com/rwood-moz/db58832d9ab1089efcb8
For the results of the emulator tests in comment 2:

➜ logfile 1
Mean: 6814.98
Median: 6668.5
Minimum: 5421
Maximum: 8961
Standard Deviation: 724.532255662196

➜ logfile 2
Mean: 6792.14
Median: 6681.5
Minimum: 5370
Maximum: 8942
Standard Deviation: 760.2411236315791

➜ logfile 3
Mean: 6704.27
Median: 6529
Minimum: 5691
Maximum: 9571
Standard Deviation: 768.0729902052462
Flags: needinfo?(eperelman)
For emulator tests in comment 3:

Mean 0: 7077.551724137931
Median 0: 7042
Mode 0: 7042
Minimum 0: 5529
Maximum 0: 9318
Standard Deviation 0: 896.145369450003
95th Percentile 0: 8925.65

 ----------------

Mean 1: 7080.0344827586205
Median 1: 6981
Mode 1: 6981
Minimum 1: 5510
Maximum 1: 8851
Standard Deviation 1: 910.0654952845379
95th Percentile 1: 8714.2

 ----------------

Mean 2: 6939.931034482759
Median 2: 6912
Mode 2: 6912
Minimum 2: 5491
Maximum 2: 9589
Standard Deviation 2: 863.0025345003932
95th Percentile 2: 8307.449999999997

 ----------------

Mean 3: 7251.586206896552
Median 3: 7224
Mode 3: 7224
Minimum 3: 5394
Maximum 3: 9852
Standard Deviation 3: 1072.862960598204
95th Percentile 3: 9344.699999999997

 ----------------

Mean 4: 6885.482758620689
Median 4: 6784
Mode 4: 6784
Minimum 4: 5510
Maximum 4: 8529
Standard Deviation 4: 851.6285259750632
95th Percentile 4: 8280.099999999999

 ----------------

Mean 5: 6998.620689655172
Median 5: 6981
Mode 5: 6981
Minimum 5: 5491
Maximum 5: 9589
Standard Deviation 5: 860.0780002235203
95th Percentile 5: 8887.899999999998

 ----------------

Mean 6: 7171.103448275862
Median 6: 7053
Mode 6: 7053
Minimum 6: 5394
Maximum 6: 9852
Standard Deviation 6: 985.0501945841027
95th Percentile 6: 8952.349999999997

 ----------------

Mean 7: 6946.793103448276
Median 7: 6912
Mode 7: 6912
Minimum 7: 5529
Maximum 7: 9318
Standard Deviation 7: 917.6021743093602
95th Percentile 7: 8568.449999999997

 ----------------

Mean 8: 7057.172413793103
Median 8: 6981
Mode 8: 6981
Minimum 8: 5491
Maximum 8: 8851
Standard Deviation 8: 854.2952759727435
95th Percentile 8: 8714.2

 ----------------

Mean 9: 7108.275862068966
Median 9: 6942
Mode 9: 6942
Minimum 9: 5394
Maximum 9: 9852
Standard Deviation 9: 1056.702577688359
95th Percentile 9: 9602.149999999998

 ----------------

Mean 10: 7077.551724137931
Median 10: 7042
Mode 10: 7042
Minimum 10: 5529
Maximum 10: 9318
Standard Deviation 10: 896.145369450003
95th Percentile 10: 8925.65

 ----------------

Mean 11: 7080.0344827586205
Median 11: 6981
Mode 11: 6981
Minimum 11: 5510
Maximum 11: 8851
Standard Deviation 11: 910.0654952845379
95th Percentile 11: 8714.2

 ----------------

Mean 12: 6939.931034482759
Median 12: 6912
Mode 12: 6912
Minimum 12: 5491
Maximum 12: 9589
Standard Deviation 12: 863.0025345003932
95th Percentile 12: 8307.449999999997

 ----------------

Mean 13: 7251.586206896552
Median 13: 7224
Mode 13: 7224
Minimum 13: 5394
Maximum 13: 9852
Standard Deviation 13: 1072.862960598204
95th Percentile 13: 9344.699999999997

 ----------------

Mean 14: 6885.482758620689
Median 14: 6784
Mode 14: 6784
Minimum 14: 5510
Maximum 14: 8529
Standard Deviation 14: 851.6285259750632
95th Percentile 14: 8280.099999999999

 ----------------

Mean 15: 6998.620689655172
Median 15: 6981
Mode 15: 6981
Minimum 15: 5491
Maximum 15: 9589
Standard Deviation 15: 860.0780002235203
95th Percentile 15: 8887.899999999998

 ----------------

Mean 16: 7168.8
Median 16: 7072.5
Mode 16: 7072.5
Minimum 16: 5394
Maximum 16: 9852
Standard Deviation 16: 1052.1954001039921
95th Percentile 16: 9171
Status: NEW → ASSIGNED
Keywords: perf
For device results in comment 3:

Mean 0: 1177.5862068965516
Median 0: 1181
Mode 0: 1181
Minimum 0: 1101
Maximum 0: 1252
Standard Deviation 0: 39.15091687954342
95th Percentile 0: 1238.7

 ----------------

Mean 1: 1188.2413793103449
Median 1: 1163
Mode 1: 1163
Minimum 1: 1077
Maximum 1: 1338
Standard Deviation 1: 65.49633060130493
95th Percentile 1: 1330.4

 ----------------

Mean 2: 1168.1379310344828
Median 2: 1173
Mode 2: 1173
Minimum 2: 1053
Maximum 2: 1281
Standard Deviation 2: 51.406912280374705
95th Percentile 2: 1244.8999999999999

 ----------------

Mean 3: 1181.1724137931035
Median 3: 1181
Mode 3: 1181
Minimum 3: 1080
Maximum 3: 1273
Standard Deviation 3: 42.81239666226018
95th Percentile 3: 1251.15

 ----------------

Mean 4: 1185.896551724138
Median 4: 1163
Mode 4: 1163
Minimum 4: 1077
Maximum 4: 1338
Standard Deviation 4: 64.53833120716668
95th Percentile 4: 1330.4

 ----------------

Mean 5: 1168.0344827586207
Median 5: 1163
Mode 5: 1163
Minimum 5: 1053
Maximum 5: 1263
Standard Deviation 5: 50.40011135634981
95th Percentile 5: 1245.9

 ----------------

Mean 6: 1184.1724137931035
Median 6: 1187
Mode 6: 1187
Minimum 6: 1080
Maximum 6: 1281
Standard Deviation 6: 48.01154240928353
95th Percentile 6: 1273.3999999999999

 ----------------

Mean 7: 1177.3793103448277
Median 7: 1171
Mode 7: 1171
Minimum 7: 1077
Maximum 7: 1330
Standard Deviation 7: 54.41818521673611
95th Percentile 7: 1310.05

 ----------------

Mean 8: 1179.1379310344828
Median 8: 1163
Mode 8: 1163
Minimum 8: 1107
Maximum 8: 1338
Standard Deviation 8: 57.3343034695641
95th Percentile 8: 1297.1499999999999

 ----------------

Mean 9: 1171.344827586207
Median 9: 1170
Mode 9: 1170
Minimum 9: 1053
Maximum 9: 1281
Standard Deviation 9: 53.742858939948846
95th Percentile 9: 1273.3999999999999

 ----------------

Mean 10: 1177.5862068965516
Median 10: 1181
Mode 10: 1181
Minimum 10: 1101
Maximum 10: 1252
Standard Deviation 10: 39.15091687954342
95th Percentile 10: 1238.7

 ----------------

Mean 11: 1188.2413793103449
Median 11: 1163
Mode 11: 1163
Minimum 11: 1077
Maximum 11: 1338
Standard Deviation 11: 65.49633060130493
95th Percentile 11: 1330.4

 ----------------

Mean 12: 1168.1379310344828
Median 12: 1173
Mode 12: 1173
Minimum 12: 1053
Maximum 12: 1281
Standard Deviation 12: 51.406912280374705
95th Percentile 12: 1244.8999999999999

 ----------------

Mean 13: 1181.1724137931035
Median 13: 1181
Mode 13: 1181
Minimum 13: 1080
Maximum 13: 1273
Standard Deviation 13: 42.81239666226018
95th Percentile 13: 1251.15

 ----------------

Mean 14: 1185.896551724138
Median 14: 1163
Mode 14: 1163
Minimum 14: 1077
Maximum 14: 1338
Standard Deviation 14: 64.53833120716668
95th Percentile 14: 1330.4

 ----------------

Mean 15: 1168.0344827586207
Median 15: 1163
Mode 15: 1163
Minimum 15: 1053
Maximum 15: 1263
Standard Deviation 15: 50.40011135634981
95th Percentile 15: 1245.9

 ----------------

Mean 16: 1180.6
Median 16: 1179.5
Mode 16: 1179.5
Minimum 16: 1080
Maximum 16: 1281
Standard Deviation 16: 52.30812556381657
95th Percentile 16: 1277
Just a note that the results from comment 5 and 6 were re-chunked into groups of 30 runs each.
AWS instance: c3.2xlarge ubuntu 14.04
Docker image: https://github.com/rwood-moz/raptor-docker-runner 
Cmd: DEBUG=* RUNS=30 APPS='clock' RAPTOR_EMULATOR=1 node tests/raptor/emulator_launch_test.js
Duration for 1x cmd: Approx 10 minutes
Restarted the emulator (and did a make raptor) before each cmd.

Results: https://gist.github.com/rwood-moz/5f43461b237118ecd198
For emulator-kk results in comment 8 (thanks for the script, Eli!):

Mean 0: 6335.172413793103
Median 0: 6086
Mode 0: 6086
Minimum 0: 5311
Maximum 0: 8721
Standard Deviation 0: 843.0406816774259
95th Percentile 0: 7528.749999999997

 ---------------- 

Mean 1: 6330
Median 1: 6244
Mode 1: 6244
Minimum 1: 4919
Maximum 1: 7786
Standard Deviation 1: 622.0492274565657
95th Percentile 1: 7420.249999999998

 ---------------- 

Mean 2: 6004.862068965517
Median 2: 5950
Mode 2: 5950
Minimum 2: 4870
Maximum 2: 7528
Standard Deviation 2: 709.3941428716105
95th Percentile 2: 7443.45

 ---------------- 

Mean 3: 5969.620689655172
Median 3: 5608
Mode 3: 5608
Minimum 3: 4814
Maximum 3: 9034
Standard Deviation 3: 931.0047377113093
95th Percentile 3: 7767.649999999996

 ---------------- 

Mean 4: 6380.793103448276
Median 4: 6069
Mode 4: 6069
Minimum 4: 5052
Maximum 4: 8127
Standard Deviation 4: 877.9617609813886
95th Percentile 4: 7969.299999999999

 ---------------- 

Mean 5: 5848.310344827586
Median 5: 5734
Mode 5: 5734
Minimum 5: 4632
Maximum 5: 7849
Standard Deviation 5: 800.2816790074938
95th Percentile 5: 7676.099999999999

 ---------------- 

Mean 6: 6036.758620689655
Median 6: 5847
Mode 6: 5847
Minimum 6: 4911
Maximum 6: 7882
Standard Deviation 6: 678.3625139678034
95th Percentile 6: 7572.299999999999

 ---------------- 

Mean 7: 6458.724137931034
Median 7: 6253
Mode 7: 6253
Minimum 7: 5291
Maximum 7: 7844
Standard Deviation 7: 750.3344953649683
95th Percentile 7: 7672.05

 ---------------- 

Mean 8: 6522.275862068966
Median 8: 6561
Mode 8: 6561
Minimum 8: 5022
Maximum 8: 8422
Standard Deviation 8: 785.2974352633233
95th Percentile 8: 7695.249999999997

 ---------------- 

Mean 9: 6437.827586206897
Median 9: 6300
Mode 9: 6300
Minimum 9: 5252
Maximum 9: 8384
Standard Deviation 9: 835.9394495876675
95th Percentile 9: 8162.65
For the emulator results in comment 5:

Smallest mean: 6885ms
Largest mean: 7251ms
Range: 366ms


10% regression:
  6885 + 689 = 7574
  7251 + 725 = 7976
5% regression:
  6885 + 344 = 7229
  7251 + 363 = 7614

Assessment:
A 5% regression falls within the normal range of values, making it impossible to determine regressions unless it is greater than 5%, and even then should probably be significantly above 5% in order prove that the max range hasn't increased. Falling back to requiring a 10% regression to fail a test means needing to have a *700ms* regression. The potential for missed regressions in the 3-10% range, which are the most common, would usually never be caught, leaving this mechanism only available to catch the most serious regressions. I'm not sure, but it seems inefficient to allocate this much time to testing every patch if it can only prevent 10% of the regressions that ever arise.

The emulator values to exhibit quite a bit of variance, with the potential of a minimum-to-maximum value being swung by 100%.

---

For the device results in comment 6:

Smallest mean: 1168ms
Largest mean: 1188ms
Range: 20ms

10% regression:
  1168 + 117 = 1285
  1188 + 119 = 1307
5% regression:
  1168 + 58 = 1226
  1188 + 60 = 1248
3% regression:
  1168 + 35 = 1203
  1188 + 36 = 1224

Assessment:
Even a 3% regression in the smallest mean puts it 15ms higher than the largest mean. I would feel comfortable failing regressions of 3-4% if they ran on devices. The values produced by devices seems to be pretty stable and with minimal variance compared to emulator values.
For the more powerful emulator results in comment 9:

Smallest mean: 5848ms
Largest mean: 6522ms
Range: 674ms

Assessment:
With a more powerful emulator, we are experiencing a much larger range of means? That doesn't bode well for the reliability of the emulator. A range 674ms already represents more than 10% of the value of the largest mean, concluding that we would need a regression threshold of 15% just to ensure its validity, which also means that we wouldn't catch any regressions less than 15%.

:(
Duration for emulator-k results in comment 5:

6 min (approx) for initial emulator boot-up and make raptor
+ 25 min for 100 RUNS of 1 single app (a single test iteration of 100 launches of a single app)
= 31 minutes for a single app

Extend it to all apps:
6 min initial emulator boot
+ (25 min for 100 RUNS x 12 apps minimum)
= 306 minutes for all apps if done in the same test run
x2 (master, master + PR) although they could be run concurrently via different tasks


Duration for device results in comment 6:

2 min (approx) for initial emulator boot-up and make raptor
+ 14 min for 100 RUNS of 1 single app (a single test iteration of 100 launches of a single app)
= 16 minutes for a single app

Extend it to all apps:
2 min initial device boot
+ (14 min for 100 RUNS x 12 apps minimum)
= 170 minutes for all apps if done in the same test run
x2 (master, master + PR) although they could be run concurrently via different devices
Duration for emulator-kk results in comment 9:

6 min (approx) for initial emulator boot-up and make raptor
+ 10 min for 30 RUNS of 1 single app (a single test iteration of 100 launches of a single app)
= 16 minutes for a single app

Extend it to all apps:
6 min initial emulator boot
+ (10 min for 100 RUNS x 12 apps minimum)
= 126 minutes for all apps if done in the same test run
x2 (master, master + PR) although they could be run concurrently via different tasks
Depends on: 1138074
(In reply to :Eli Perelman from comment #10)
> Assessment:
> A 5% regression falls within the normal range of values, making it
> impossible to determine regressions unless it is greater than 5%, and even
> then should probably be significantly above 5% in order prove that the max
> range hasn't increased. Falling back to requiring a 10% regression to fail a
> test means needing to have a *700ms* regression. The potential for missed

This doesn't surprise me in the least. I feel that emulator checks are going to be useful as sanity checks only. We shouldn't count on these for how we monitor performance as a be-all-end-all.

Among other things, as we discussed on IRC, there is no guarantee of similar performance between separate instances hosting the emulator, or even the same instance at different times. That means the most we can pull out is a simple back to back A->B test on the same instance.

So while this can be ok, within reason, for "fail a build if it's grossly different" it'll be pretty useless for giving us a picture of performance over time. Yesterday's results won't be comparable to today's results because they'll possibly be on a fundamentally different host speed.

We could *try* to stitch it together from a series of deltas (last test had an A->B diff of 3%, next test a B->C diff of 5%, therefore we extrapolate that A->B->C had a diff of (1.03 * 1.05) = 8.15% difference. But it's going to be awfully indirect compared to just testing on a more stable environment like a device.

Even using this for sanity checks, I suspect we will have spikes that foul those tests, depending on how long that takes to execute; the performance can change underneath the test as the hosting service prioritizes/deprioritizes VMs. 

I think we should roll it out and see, but I also think we need to have good sanity check protocols in hand, like retesting the same build a bunch of times over through the final test architecture. If it starts failing itself when the build's otherwise bit-identical, we might have an issue depending on how often that happens.

> Even a 3% regression in the smallest mean puts it 15ms higher than the
> largest mean. I would feel comfortable failing regressions of 3-4% if they
> ran on devices. The values produced by devices seems to be pretty stable and
> with minimal variance compared to emulator values.

Again, reiterating an IRC discussion for the purposes of rolling up in to Bugzilla, my concern with this line of thinking on both device and emulator is that we shouldn't over-rely on the mean, or even the median. Two very different sets of performance characteristics can have similar values for these depending on how spread out the results are on either side of the center.

While I'm wary of standard deviation on a non-gaussian distribution (task time tests are always positive-skewed due to a hard lower bound on the minimum possible time to complete the task), it may be good enough to correlate with the distribution shape. Alternately, possible looking at a higher percentile (90th/95th) along with mean or median would capture that. 

The problem with both these approaches are that I'm not sure you'll get stable results on them from only 30 data points; they're both very distribution-dependent, and small samples tend to have variable distributions. A 90th percentile on that is the 4th highest point, and the 95th percentile is the 2nd highest point. Similarly, a standard deviation covers 20 out of the 30 points, with only 10 points outside it on either side. Those all strike me as pretty thin.

On the plus side, I do think even a mean/median-only approach will be ok catching instances where the distribution doesn't change and the numbers just get higher. That would correspond with a single operation adding a predictable amount of time, which is probably the most common scenario. 

As long as we supplement this kind of approach with longer-running tests with more data (I'd suggest a daily 20-something hour run, with as many points as we can gather in that time) to catch other types of perf regressions I think it can be OK.
(In reply to :Eli Perelman from comment #11)
> For the more powerful emulator results in comment 9:
> 
> Smallest mean: 5848ms
> Largest mean: 6522ms
> Range: 674ms
> 
> Assessment:
> With a more powerful emulator, we are experiencing a much larger range of
> means? That doesn't bode well for the reliability of the emulator. A range
> 674ms already represents more than 10% of the value of the largest mean,
> concluding that we would need a regression threshold of 15% just to ensure
> its validity, which also means that we wouldn't catch any regressions less
> than 15%.
> 
> :(

I'm not ready to conclude that the increased variability correlates with the power of the emulator. I think you might find that there's a high level of variability, period, and any one of these sessions may or may not capture a minimum or maximum amount of fluctuation.

I would run a *lot* of these before drawing any conclusions. Try running a day's worth of 30-test results and see what comes out then.
(In reply to Geo Mealer [:geo] from comment #15)
> I'm not ready to conclude that the increased variability correlates with the
> power of the emulator.

Oh yeah, I wasn't making that connection, it was just disappointing to see that increased power hadn't improved the situation.

That said, I agree with all your points. I do believe that on the post-commit automation side, we need to crank up the amount of data so the visualization is logical and useful, contrary to what we currently have. Also like you said, reliance on mean/median/p50 in post-commit automation hasn't been particularly insightful, hence needing to move to p90/95 there with much more data. Then again, this data will continue to run against devices for the foreseeable future, which have been pretty reliable.

For the pre-commit side, it's really a balance between time and resources. Do you think its possible to get a valuable determination from the current state of the emulators if we had more runs (maybe 40-60) and did a percentile analysis?
Flags: needinfo?(gmealer)
Depends on: 1139428
Depends on: 1139448
(In reply to :Eli Perelman from comment #16)
> 
> For the pre-commit side, it's really a balance between time and resources.
> Do you think its possible to get a valuable determination from the current
> state of the emulators if we had more runs (maybe 40-60) and did a
> percentile analysis?

Probably, but I honestly still don't know. But I think you should try it, just maybe not with a blocking test yet.

This is long. Sorry.

Let me separate concerns. 

First, "current state of the emulators":

My primary concern comes back to how performance-stable the host instance for the emulator is, nothing about the emulators themselves. 

If your host's performance is changing (or I/O latency changes to virtualized disk stores, or any other systemic dependency) it's kind of obvious that your overall performance will change with that. And further, we host on Amazon, and it's obvious from hunting around the web that the performance of those hosts is typically quite variable.

So my conclusion there:

Your Firefox OS emulators, as currently hosted on Amazon (and probably any other VPS- or EC-like host) will not be performance-stable, unless you've somehow locked down performance-stable host instances.

That drives my conclusion that this won't be reliable for comparing results from sessions at different times; the host instance performance can and probably will change between them. 

But what I don't know is *if the performance fluctuates significantly during a short period of time*. 

If so, that means even the proposed immediate A/B test is vulnerable, because the performance might spike or trough during the A/B test.

My suspicion is that yes, it can. So then it comes down to how variable can it be, and that's why I suggested a very long-running set of performance tests in the same way you've done above, to get as much data as reasonably possible so you can see what the spread is. It has the side effect of making sure it's during one theoretical session, so you can also verify whether or not it will change during that session.

But there are other concerns, and these -do- have to do with emulators and Firefox OS:

The host OS itself may not be performance-stable. The emulators run on Linux. When do maintenance daemons kick in and steal performance from the emulator? If you don't know that, you don't know when spikes will be introduced.

And Firefox OS isn't very performance-stable as it is. You can rerun simple timing tests on it over and over and get pretty variable results, with unpredictable spikes. That doesn't surprise me in a multi-layered OS with a bunch of concurrency going on, but does mean performance testing it in general isn't going to be as precise as you'd like unless you learn how to turn off a lot of background activity.

The golden rule of performance testing is that the only thing that's supposed to be variable is the exact thing you're testing. Everything else just craps up the test.

But we're talking:

Variable virtualized instance + variable host OS + variable Firefox OS.

The emulator is probably the only stable part of it, to be honest, hence the least of my concerns.

And not to be negative, but keep in mind I'm still not sure we've nailed down how to do this right on just "variable Firefox OS".

But I do think you should try it. I'd recommend making sure it works to your satisfaction on-device first, -then- moving to emulator, though. Take on one stabilization challenge at a time.

And you might consider self-hosting the machines or VMs used for this particular test. That might go a long way towards stabilizing down their performance. Maybe not everything can/should be elastic. 

Either way you go, also keep in mind that I/O is a VM's kryptonite. It's not just about stabilizing the VM--gotta stabilize its external disk stores too.


Second, "40 or 50 data points":

I think you should get as many as you can afford to get time-wise during one run, as long as they're all on the same instance. Don't parallelize this between instances, due to the problem above.

There's nothing magic about 30 data points, aside from that then the samples are considered adequate for aggregation via mean of means by way of Central Limit Theorem. I assume someone knew that significance and decided 30 would be a good number.

But to make that work, you have to have a lot of independent 30+ point samples from the same build, running in the same environment (i.e. from the same population). We don't test that way, so 30 points is meaningless to us. You can't take samples from different builds and aggregate them like that, and we only generate one sample per build.

So it comes down to this: the more points you can get--again, assuming things are otherwise performance stable--the closer your mean (or percentile, or whatever) will be to the "true" mean. You will decrease variability due to random sampling.

So get as many as you can afford to get for now, but temper that by doing more trials after reducing the sample size and see if things get much more variable. Somewhere there's a sweet spot where sample consistency will be "good enough," and aim for that, no less, no more.

It may take time and a bunch of false positives or negatives to find that. That's why I'm saying do this by staging the test as a non-blocking test for the time being.

Also, this -all- assumes things are even stable enough to test on, per the first concern.


Finally, "percentile analysis":

My observation so far from your samples are that the means/medians way more consistent than your high percentiles. My guess is that's because you have more outliers than we'd probably like, but that probably reflects the shifting ground you're testing on.

So for the type of simple qualification test you're doing, I'd align on the statistic or combination of statistics that A) is the most stable build over build but B) will change if there's a large enough problem.

That's probably mean/median, possibly considered in combination with standard deviation.

I do think that *acceptance* should be done on 90+ percentile analysis, for the simple reason that it answers the question we probably care about: "What level of quality will people experience the grand majority of the time?" as opposed to "What level of quality will people experience half the time?" Standard web testing practice seems to agree with me, so I'm confident on that.

But you need way more points to make that practical, so I think it's moot here. For a simple qualification test, just do whatever seems to call out a problem.

Just don't get hung up on the threshold. It's probably going to need to be an insensitive test to avoid a lot of false positives, given the stablization problems I outlined above. It's a sanity check, don't treat it as anything more than that. 

Even if you only catch within a 20% threshold, go with it. It's better than nothing, and we can shrink it later once we figure out how to stabilize.
Flags: needinfo?(gmealer)
> And Firefox OS isn't very performance-stable as it is. You can rerun simple timing tests on it over and over and get pretty variable results, with unpredictable spikes. That doesn't surprise me in a multi-layered OS with a bunch of concurrency going on, but does mean performance testing it in general isn't going to be as precise as you'd like unless you learn how to turn off a lot of background activity.

I should also call out that this is its own issue. The users experience that kind of variability as well. 

It's possible one of the things we should be tracking is how variable the OS is inherently, and tightening that up -not- just for the purposes of stabilizing tests, but for also to make the user experience more predictable.
More tests in progress, tweaking the emulator to see if that results in reduced variance.

https://gist.github.com/rwood-moz/3732062d4aeea123f95a
After more tests with emulator tweaks, this set of numbers is a bit more promising concerning variance. Note that this is on emulator and not emulator-kk (emulator-kk blocked by bug 1139428).

AWS r3.xlarge ubuntu 14.04
Emulator (NOT emulator-kk)
Cfg: 2048 RAM (default 2047MB partition size which is max)
Added: -wipe-data (rest of settings default)
Cmd: DEBUG=* RUNS=30 APPS='settings' RAPTOR_EMULATOR=1 node tests/raptor/emulator_launch_test.js
Iterations: 10 (shutdown and restart emulator and do make raptor before each cmd iteration)
Separate iterations but appending to one log.

Summary of results (via test-stats):
 
Mean 0: 14510.137931034482
Median 0: 14265
Mode 0: 14265
Minimum 0: 13736
Maximum 0: 15820
Standard Deviation 0: 607.9252860252013
95th Percentile 0: 15706.949999999999
 
----------------
 
Mean 1: 14179.275862068966
Median 1: 13896
Mode 1: 13896
Minimum 1: 13465
Maximum 1: 15896
Standard Deviation 1: 679.9015596368863
95th Percentile 1: 15744
 
----------------
 
Mean 2: 14358.206896551725
Median 2: 14066
Mode 2: 14066
Minimum 2: 13661
Maximum 2: 15906
Standard Deviation 2: 623.9393968863645
95th Percentile 2: 15639.05
 
----------------
 
Mean 3: 14014.448275862069
Median 3: 13933
Mode 3: 13933
Minimum 3: 13193
Maximum 3: 15219
Standard Deviation 3: 462.8235372525455
95th Percentile 3: 15188.6
 
----------------
 
Mean 4: 14843.620689655172
Median 4: 14652
Mode 4: 14652
Minimum 4: 14001
Maximum 4: 16042
Standard Deviation 4: 555.3590583666294
95th Percentile 4: 15707.6
 
----------------
 
Mean 5: 14158.586206896553
Median 5: 14006
Mode 5: 14006
Minimum 5: 13627
Maximum 5: 15366
Standard Deviation 5: 452.38696566800195
95th Percentile 5: 15283.349999999999
 
----------------
 
Mean 6: 14172.241379310344
Median 6: 13995
Mode 6: 13995
Minimum 6: 13545
Maximum 6: 15519
Standard Deviation 6: 553.3661664303498
95th Percentile 6: 15311.9
 
----------------
 
Mean 7: 14049.586206896553
Median 7: 13991
Mode 7: 13991
Minimum 7: 13558
Maximum 7: 15243
Standard Deviation 7: 366.07456534968014
95th Percentile 7: 14619.799999999997
 
----------------
 
Mean 8: 14069.931034482759
Median 8: 13863
Mode 8: 13863
Minimum 8: 13558
Maximum 8: 15389
Standard Deviation 8: 488.00119636897466
95th Percentile 8: 15214.2
 
----------------
 
Mean 9: 14224.310344827587
Median 9: 13895
Mode 9: 13895
Minimum 9: 13409
Maximum 9: 15485
Standard Deviation 9: 660.7001043015046
95th Percentile 9: 15348.2
 
----------------

(Raw data here: https://gist.github.com/rwood-moz/164db86970da075ea1e6)
Switched to using the 'template' app. Initial results here:

https://gist.github.com/rwood-moz/964a4fea4affdae263c4
Making progress. Latest results with more emulator/qemu tweaks:

https://gist.github.com/rwood-moz/53ee14e19de8ac63bcaf

Next I will try running the emulator on a ramdisk that is mounted in the docker container.
Latest numbers, running the emulator on a ramdisk:

https://gist.github.com/rwood-moz/d16e44d7d48203d8f92c
In summary of all the test data thus far, this is the best setup when it comes to reducing launch test variance. Note, this is using the ics emulator; emulator-kk is currently out of commission for these tests because of bug 1139428 and bug 1139448. Hopefully this setup will remain true for emulator-kk.

AWS r3.xlarge ubuntu 14.04
Docker container: https://github.com/rwood-moz/raptor-docker-runner.git
Emulator (NOT emulator-kk)
** Emulator running on a 1024MB RAMDISK mounted into the docker container **
Cfg: 2048 RAM, 2047 partition (max partition size)
Removed: -skin, -skindir, -camera-back
Added: -wipe-data, -no-skin (rest of settings default)
Added: '-net none -bt hci,null' to end of -qemu $TAIL_ARGS
Added: -cache-size 2048
Cmd: DEBUG=* RUNS=30 APPS='template' RAPTOR_EMULATOR=1 node tests/raptor/emulator_launch_test.js
Iterations: 10 (shutdown and restart emulator and do make raptor before each cmd iteration)
Same setup as above except on a c4.2xlarge AWS instance type (thanks to Jonas' info on his instance performance tests/rankings). The app launch times are faster, therefore the suite will run faster; however the variance is not improved much at all, when compared with the tests on the r3.xlarge (at least in this limited sample).

https://gist.github.com/rwood-moz/d16e44d7d48203d8f92c#file-ramdisk_6-txt
Did more cycles on the c4.2xlarge and it seems pretty consistent.  This is on the ics emulator currently but will move to emulator-kk when it is working with raptor again. This looks like the best setup:

AWS c4.2xlarge ubuntu 14.04
** Emulator running on a 1024MB RAMDISK mounted into the docker container **
Cfg: 2048 RAM, 2047 partition (max partition size)
Removed: -skin, -skindir, -camera-back
Added: -wipe-data, -no-skin (rest of settings default)
Added: '-net none -bt hci,null' to end of -qemu $TAIL_ARGS
Added: -cache-size 2048
Cmd: DEBUG=* RUNS=30 APPS='template' RAPTOR_EMULATOR=1 node tests/raptor/emulator_launch_test.js
Iterations: 10 (shutdown and restart emulator and do make raptor before each cmd iteration) 

Eli, what do you think of these latest numbers (this file and three more cycles below it):

https://gist.github.com/rwood-moz/d16e44d7d48203d8f92c#file-ramdisk_6-txt
Flags: needinfo?(eperelman)
Looks like we've made some good progress. Just for kicks, can you do a set of 30 runs for the Settings app so we can compare?
Flags: needinfo?(eperelman)
Switched back to emulator-kk as it is working again now. Unfortunately the launch time is slower and the variance is higher on emulator-kk. :(

AWS c4.2xlarge ubuntu 14.04
** Emulator-KK running on a 1024MB RAMDISK mounted into the docker container **
Cfg: 2048 RAM, 2047 partition (max partition size)
Removed: -skin, -skindir, -camera-back
Added: -no-skin (not using -wipe-data b/c on emulator-kk it actually removes /data folder, need that)
Added: '-net none -bt hci,null' to end of -qemu $TAIL_ARGS
Added: -cache-size 2048
Cmd: DEBUG=* RUNS=30 APPS='template' RAPTOR_EMULATOR=1 node tests/raptor/emulator_launch_test.js
Iterations: 10 (shutdown and restart emulator and do make raptor before each cmd iteration)

Results: https://gist.github.com/rwood-moz/d16e44d7d48203d8f92c#file-ramdisk_9a-txt

Same setup, but 'settings' app: https://gist.github.com/rwood-moz/d16e44d7d48203d8f92c#file-ramdisk_9b-txt

What do you think, Eli?
Flags: needinfo?(eperelman)
Rob and I discussed this in person, and while the results from emulator-kk aren't as good as emulator, the percentage standard deviation is of p95 sits at 8% for emulator, and 12% for emulator-kk. A difference of 4% will have to be tolerated in this instance. I think for getting this to the POC stage, we should start a threshold at 15% and evaluate the efficacy and rate of false positives to see if this is valid.
Flags: needinfo?(eperelman)
No longer depends on: 1139448
Work is now underway to integrate the raptor launch test on emulator with gaia-try, via taskcluster; using the parameters/test environment that has been established here.
Status: ASSIGNED → RESOLVED
Closed: 9 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.