Closed Bug 950841 Opened 11 years ago Closed 10 years ago

Measure Baseline FPS Across Gaia Apps w/ Async Pan and Zoom and Tiling

Categories

(Core :: Graphics, defect, P1)

ARM
Gonk (Firefox OS)
defect

Tracking

()

RESOLVED WORKSFORME
1.3 C1/1.4 S1(20dec)

People

(Reporter: mchang, Assigned: mchang)

References

Details

(Keywords: perf, Whiteboard: [c=handeye p=4 s= u=])

Attachments

(2 files, 1 obsolete file)

Measure Baseline FPS numbers without async pan and zoom and tiling, with async pan and zoom and tiling enabled, and one with async pan and zoom on and tiling off.
Attached patch fpsMeasure.patchSplinter Review
Measures 700 frames of FPS, calculates the standard deviation and FPS count. Also chops off the first 300 frames to let some caching of the app occur. Our partners also let the apps cache some data prior to measuring FPS.
Before Patch (Async pan/zoom off, tiling off):
Settings: Average is: 49.567143, std dev: 10.551631
Contacts: Average is: 51.201429, std dev: 9.525500
Messages: Average is: 48.150000, std dev: 13.465896
Email: Average is: 56.665714, std dev: 5.565169
Gallery: Average is: 54.080000, std dev: 10.600780
Music: Average is: 53.662857, std dev: 10.241284

After Patch (Async pan zoom ON, tiling off):
Settings: Average is: 54.728571, std dev: 7.472849
Contacts: Average is: 52.288571, std dev: 12.399517
Messages: Average is: 51.032857, std dev: 12.642778
Email: Average is: 55.447143, std dev: 7.039181
Gallery: Average is: 53.352857, std dev: 7.103505
Music: Average is: 53.704286, std dev: 9.760400

Experimental setup:
Gaia 1.3 Rev: 
Merge: ac541b8 c9d5064
Author: Malini Das <thehyperballad@gmail.com>
Date:   Tue Dec 17 09:18:06 2013 -0800

Merge pull request #14760 from malini/v1.3mixin
Reland Uplifted changes for Bug 925398 (includes Bug 945284, Bug 947001)

Mozilla-Aurora Gecko Rev:
changeset:   168845:49aa881ba686
tag:         qparent
user:        Dave Hunt <dhunt@mozilla.com>
date:        Mon Dec 16 15:53:03 2013 -0500
summary:     Bug 949406 - Bump marionette_client version to 0.7.1, r=mdas, a=test-only

Gaia Reference Workload Light
Email: My personal email, Inbox - 823 messages w/ 13.4 mb. dev-b2g mailing list, 1,602 messages and 10.2 mb.
Count 700 frames from the compositor, frame 300 - 1000. We disregard the first 300 frames because they usually count the homescreen unlocking, swipe, and open app animations.
Buri firmware: ro.build.date=Fri Oct 11 22:28:25 CST 2013

Settings: Scroll up/down 3 times. Go into developer -> back out to main. Scroll a couple times. Go into Lock Screen, enable / disable lockscreen. Scroll a couple more times.
Email: Scroll up / down inbox once. Check 1 email, scroll content, go back to inbox. Check next email, scroll to content. go out. Go to dev-b2g, scroll a couple of times. Check first email, go back out. Check next email, scroll around, go back out.
Messages / Contacts: Scroll up / down once. Click on first contact, scroll down, scroll up, pan around, go back out. Click Settings, go back out. Click add Contact, go back out. Scroll down to bottom, click bottom contact, pan around a couple of times. Scroll back out. Scroll up. 
Gallery: Scroll down to bottom, scroll up to top. Click top image, swipe right 10 x, swipe left 5 times. Zoom in, pan right, pan left, zoom out. Go back out. Scroll to bottom. Swipe left 5 times.
Music: Click on songs, scroll to bottom, scroll to top. Click on artists, scroll down, scroll back up. Click on top artist album, go back to artists. Click on albums. Scroll down, scroll up. Click on album, go back out. 

Reboot between measurements.
Updated with proper 11/15 Buri Firmware:

Before Patch (Async pan/zoom off, tiling off):
Settings: Average is: 54.710000, std dev: 6.682400
Contacts: Average is: 51.511429, std dev: 10.210562
Messages: Average is: 50.824286, std dev: 13.496740
Email:   Average is: 50.125714, std dev: 13.164397
Gallery: Average is: 49.471429, std dev: 13.332613
Music: Average is: 52.668571, std dev: 10.418878

After Patch (Async pan zoom ON, tiling off):
Settings: Average is: 54.244286, std dev: 7.350338
Contacts: Average is: 50.327143, std dev: 11.599389
Messages: Average is: 51.874286, std dev: 11.276077
Email: Average is: 55.424286, std dev: 6.707883
Gallery: Average is: 54.442857, std dev: 9.746407
Music: Average is: 51.115714, std dev: 12.656090
I am surprised we don't reach a stable fps across all the apps. Shouldn't the compositor be scheduled at a fixed fps, and preempt anything else, no matter what the system load. Worst case we don't deliver updates quickly enough and checkerboard, but we should always hit full composited fps. BenWa?
Flags: needinfo?(bgirard)
@gal - From previous Scroll FPS measurements, we plateau around 55 fps+ after enough scrolls / actions / everything gets cached. However, our partners don't wait that long and they measure FPS after 2-3 scrolls to cache everything. The FPS curve starts low as the app starts and crawls its way up to 55 fps+, which is the segment I tried to capture.
FPS for Async Pan / Zoom ON, Tiling ON
Settings: Average is: 53.917143, std dev: 8.655713
Contacts: Average is: 52.184286, std dev: 11.651538
Messages: Average is: 49.418571, std dev: 13.192550
Email: Average is: 55.328571, std dev: 4.852456
Gallery: Average is: 53.582857, std dev: 9.915352
Music: Average is: 52.321429, std dev: 10.482685
Scrolling of Browser (nytimes.com). Click on the favorite, wait until loading bar is done, scroll up/down:
Async Pan/Zoom off, Tiling off: Average is: 50.702857, std dev: 7.458475
Async Pan/Zoom ON, Tiling off: Average is: 51.110000, std dev: 6.413215
Async Pan/zoom ON, Tiling ON: Average is: 55.567143, std dev: 7.050010

Are these all the numbers you needed Milan? Other numbers in Comment 3
Flags: needinfo?(milan)
The standard deviations on these are really large, to the point I'm not sure they tell us anything significant. 

Taking Settings in comment 2 as an example, we have a before that says there's a 68% chance of true mean being within one standard deviation (~10.5fps) of the calculated mean, assuming this is a normal distribution. That means the true mean could be anywhere from ~39 to 60fps, at 68% confidence (and 32% either higher or lower than that)

Then we have an after that says there is a 68% chance of true mean being within one standard deviation (7.5fps) of the calculated mean, so ~47fps to ~62fps (but we know it's capped at 60) at 68% confidence.

That's a pretty huge overlap, from ~47-60fps, taking cap into account. If I understand correctly, these two samples could easily both represent the same performance. And that's 68% confidence. The usual target is quite a bit higher, with correspondingly higher ranges.

In general, my understanding of how this works is that if you have a std deviation calculated at 10fps, two sample calculations within 20fps (or more, at higher confidence) of each other aren't meaningfully indicative of a different true mean. The resolution of the test is just too low due to variability, whether that noise is introduced via the test procedure, the system under test, or whatever.

Can someone from metrics, or who has a better grounding in stats correct me here? 

It's a big concern of mine, that ultimately these type of test results aren't enough to go on to indicate a regression. It's not a matter of anything is better than nothing, when making these sorts of decisions--the noise level has to be low enough to be useful.
Reading the FPS is not the best indicator of what framerate we're actually getting.  High speed camera would be great.  BenWa is talking with Mason offline on the scroll graph method/tool we have started on to see if it can be used here.
Flags: needinfo?(milan)
Flags: needinfo?(bgirard)
(In reply to Geo Mealer [:geo] from comment #8)
> The standard deviations on these are really large, to the point I'm not sure
> they tell us anything significant. 

After cramming a bunch of stats research into a couple of days, I'm convinced I'm off-base in my understanding of how these stats are best applied. Applying t-tests to the samples, assuming an N of 700, shows them to be largely significant, though I'm unclear of how to interpret the magnitude of that significance.

That said, every time we've tried measuring fps in the past--especially with camera--we've seen a large variability in the results. At one point, testing different cameras with Eideticker, we also saw a bimodal distribution instead of a normal distribution. We landed on different local maxima/minima of this in two different test runs and assumed a difference that turned out not to be there. A longer test run showed the real distribution. Averages are really only going to work with a normal distribution, so that's a potential problem.

So, even though I'm probably not the best person to be doing the stats analysis on this, I'm pretty convinced someone should be. We got a lot of benefit from metrics team analyzing our datazilla results--they pointed out some pretty big problems that were skewing them, and clarified the noise level. 

That sort of thing is a concern here as well because we're testing a non-deterministic system, using partially manual methods. Both of those may be noisy. And while the device is rebooted between runs, the measurements themselves are all across one run and are not independent. If there system did a GC in the middle of it, for example, that'd throw off the distribution.

So, upshot, I think someone from metrics team should do an analysis across this to understand significance.

The patch in question collected 700 fps samples, and output an average/std. deviation. I think a couple of immediate questions I'd have would be 

A) what was the median result? Comparing that to the mean should give us some idea of skew.

B) did the collected results reflect a normal distribution around the calculated mean? 

With 700 results, I'd expect a pattern to emerge if this is a normal distribution unaffected by background noise. The way the individual results are collected into buckets, that should be something we could dump as well (just a dump of all the bucket counts would be adequate).

C) assuming we are noisy, do things improve if we automate the scrolls?

I'd be willing to help redo the experiment on one or two of the test points with an updated patch and/or gaia automation if that would get us there, but at this point probably cannot until January.
David, per Geo's comment 8 and comment 10 can you or anyone else on the Metrics team help us analyze Mason's FPS findings?
Flags: needinfo?(dzeber)
Sorry for the delay - I was on PTO for the first part of the week. 

In response to comment 10: the t-test should be valid unless the distributions are extremely skewed or otherwise weird, as the sample size (N = 700) is quite large. 

Some results are significant, but some are borderline to not significant (assuming independent samples). However, since the comparisons are before-and-after for the same apps, it would be appropriate to use the paired t-test, which would reduce the standard error and possibly increase the significance. 

It's difficult to judge the magnitude of significance using the p-value on an absolute scale (the p-value is the false positive rate you'd expect if you rerun the experiment many many times and the means are actually *the same*), but you can compare p-values to each other or compare them to a threshold value. 

If you can send me the raw data (700 values for each run), I can look into Geo's questions about normality and skewness and provide more details.
Flags: needinfo?(dzeber)
(In reply to dzeber from comment #12)
> 
> ... 
> 
> If you can send me the raw data (700 values for each run), I can look into
> Geo's questions about normality and skewness and provide more details.

Thanks David.

Mason, please attach the raw data to this bug so David can take a closer look.
Flags: needinfo?(mchang)
I can't find all the raw data, I was just copying/pasting the pre-calculated values after I made sure the patch works. I can re-do the experiments if you need to, but it won't be the same as what's posted here before.
Flags: needinfo?(mchang)
You can do t-tests with the summaries posted above, but we'd need the original data to answer Geo's questions about the shape of the distributions, to check assumptions, and to run paired t-tests. Since a few of the p-values are hovering close to the cutoff between what we'd consider significant and non-significant (in the 0.05 to 0.01 range), it'd probably be best to get the raw values for further investigation. 

It's fine if the data are from a new run - we should get similar results, unless other things have changed (eg. version). I suspect the shapes of the distributions will not be that different.
Attached file Raw Data Dump from Settings App (obsolete) —
Raw data dump from the Compositor::DrawFPS data. The left number is the FPS measurement. The right number is how many times we got that many frames. Is this what you needed? Thanks!
Flags: needinfo?(dzeber)
Do we know what the hardware VSync rate is for the Buri? I would have thought 60Hz, but Ben pointed out that there was some chance it was higher. I notice there are some results above 60 in the data.
(In reply to Mason Chang [:mchang] from comment #16)
> Created attachment 8358699 [details]
> Raw Data Dump from Settings App

Thanks! Which combination of pan/zoom/tiling and which app was this for? Would you be able to post the data dump from the one you want to compare to (the before/after comparison)?
Flags: needinfo?(dzeber) → needinfo?(mchang)
Attachment #8358699 - Attachment is obsolete: true
Flags: needinfo?(mchang)
I uploaded the raw data with both APZ enabled and disabled. Is this good?
Flags: needinfo?(dzeber)
That's great - thanks.
Flags: needinfo?(dzeber)
Looking at the data, one thing that jumps out is that we're naturally bounded at/around 60fps by the LCD refresh rate, which is predictably where most of the results lie with only insignificant binning afterwards.

Since what we're really looking for is a smooth 60fps scroll I wonder if we should look at this more as a discrete probability distribution with a probability mass function around the chance that it takes less than 16.7ms to output a frame: either we hit the next refresh with the next frame of animation or not.

Maybe what we should really be measuring is the number of refresh misses per total frames, with baselines set around the maximum number of misses. 

Combined with something like an X/Y graph of "number of refreshes" vs. "number of frames" (similar to Benoit's scroll graph, but without the concept of work done per frame) that would also display jank very visibly.
May I ask how these tests are being performed? When I last saw b2gperf, while it's good to have any number to look at, the rate at which it scrolls is too slow to really test performance - we want a variety of flings that mimic human input ideally, and we want some higher velocity flings too.

It would be good to get numbers with tiling *and* progressive updates on too (layers.progressive-paint), which ought to reduce contention and thus reduce jank (and possibly increase overall framerate).
Keeping in mind this isn't our ultimate test methodology, I think the flicks to date have largely been manual. Differences in velocity is one of the reasons I suspect the test results will be noisy. When I tried this manually before, the line between "flick fast enough to max the system" and "flick so fast that the screen doesn't register" was pretty thin.

Eideticker, which will probably represent our final methodology, automates flicks in a repeatable fashion with Orangutan, and can flick at different velocities without the same touchscreen limitations. I agree that we need to explore a range of flick effort at that point to understand how it affects performance.

As far as what makes it into an ongoing suite, I was thinking more in terms of picking the one that reliably maxed out performance, but there's certainly an argument for "max" and "typical" as two different tests. I think it depends on how much headroom we have in terms of suite runtime, really. One-offs like this probably can use a more comprehensive range, if it turns out to be useful.
I really hate to sound like a broken record but we should be collecting performance profiles for these runs. From a profile we can see if the compositor is stuck in a GL call or is sleeping because the test doesn't require 60 FPS scrolling. We don't need to be guessing here.

My guess:
APZC causes bigger layers for display ports. We block the compositors for these allocations. If thats the case I'm surprised that the APZC numbers are even as good as they are. APZC + buffer rotation is not a winning combination.

If we can confirm the above with profiles then I have solid evidence to start working on gralloc tiling ASAP.
I recommend we simply focus on APZC at this point. If its slower at anything, lets measure that, attach a profile, and we can file bugs. We already know the major sources. We need gralloc tiling and gralloc allocation off the compositor. The latter is almost in the bag.
Blocks: 962687
Moving the discussion to bug 962687 so we can track issues, get profiles, and solve performance issues. Closing this bug due to the original intent which was to measure baseline FPS.
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → WORKSFORME
Blocks: b2g-tiling
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: