950841 - Measure Baseline FPS Across Gaia Apps w/ Async Pan and Zoom and Tiling

Assignee

Description

•

12 years ago

Measure Baseline FPS numbers without async pan and zoom and tiling, with async pan and zoom and tiling enabled, and one with async pan and zoom on and tiling off.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 1

•

12 years ago

Attached patch fpsMeasure.patch — Details — Splinter Review

Measures 700 frames of FPS, calculates the standard deviation and FPS count. Also chops off the first 300 frames to let some caching of the app occur. Our partners also let the apps cache some data prior to measuring FPS.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 2

•

12 years ago

Before Patch (Async pan/zoom off, tiling off): Settings: Average is: 49.567143, std dev: 10.551631 Contacts: Average is: 51.201429, std dev: 9.525500 Messages: Average is: 48.150000, std dev: 13.465896 Email: Average is: 56.665714, std dev: 5.565169 Gallery: Average is: 54.080000, std dev: 10.600780 Music: Average is: 53.662857, std dev: 10.241284 After Patch (Async pan zoom ON, tiling off): Settings: Average is: 54.728571, std dev: 7.472849 Contacts: Average is: 52.288571, std dev: 12.399517 Messages: Average is: 51.032857, std dev: 12.642778 Email: Average is: 55.447143, std dev: 7.039181 Gallery: Average is: 53.352857, std dev: 7.103505 Music: Average is: 53.704286, std dev: 9.760400 Experimental setup: Gaia 1.3 Rev: Merge: ac541b8 c9d5064 Author: Malini Das <thehyperballad@gmail.com> Date: Tue Dec 17 09:18:06 2013 -0800 Merge pull request #14760 from malini/v1.3mixin Reland Uplifted changes for Bug 925398 (includes Bug 945284, Bug 947001) Mozilla-Aurora Gecko Rev: changeset: 168845:49aa881ba686 tag: qparent user: Dave Hunt <dhunt@mozilla.com> date: Mon Dec 16 15:53:03 2013 -0500 summary: Bug 949406 - Bump marionette_client version to 0.7.1, r=mdas, a=test-only Gaia Reference Workload Light Email: My personal email, Inbox - 823 messages w/ 13.4 mb. dev-b2g mailing list, 1,602 messages and 10.2 mb. Count 700 frames from the compositor, frame 300 - 1000. We disregard the first 300 frames because they usually count the homescreen unlocking, swipe, and open app animations. Buri firmware: ro.build.date=Fri Oct 11 22:28:25 CST 2013 Settings: Scroll up/down 3 times. Go into developer -> back out to main. Scroll a couple times. Go into Lock Screen, enable / disable lockscreen. Scroll a couple more times. Email: Scroll up / down inbox once. Check 1 email, scroll content, go back to inbox. Check next email, scroll to content. go out. Go to dev-b2g, scroll a couple of times. Check first email, go back out. Check next email, scroll around, go back out. Messages / Contacts: Scroll up / down once. Click on first contact, scroll down, scroll up, pan around, go back out. Click Settings, go back out. Click add Contact, go back out. Scroll down to bottom, click bottom contact, pan around a couple of times. Scroll back out. Scroll up. Gallery: Scroll down to bottom, scroll up to top. Click top image, swipe right 10 x, swipe left 5 times. Zoom in, pan right, pan left, zoom out. Go back out. Scroll to bottom. Swipe left 5 times. Music: Click on songs, scroll to bottom, scroll to top. Click on artists, scroll down, scroll back up. Click on top artist album, go back to artists. Click on albums. Scroll down, scroll up. Click on album, go back out. Reboot between measurements.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 3

•

12 years ago

Updated with proper 11/15 Buri Firmware: Before Patch (Async pan/zoom off, tiling off): Settings: Average is: 54.710000, std dev: 6.682400 Contacts: Average is: 51.511429, std dev: 10.210562 Messages: Average is: 50.824286, std dev: 13.496740 Email: Average is: 50.125714, std dev: 13.164397 Gallery: Average is: 49.471429, std dev: 13.332613 Music: Average is: 52.668571, std dev: 10.418878 After Patch (Async pan zoom ON, tiling off): Settings: Average is: 54.244286, std dev: 7.350338 Contacts: Average is: 50.327143, std dev: 11.599389 Messages: Average is: 51.874286, std dev: 11.276077 Email: Average is: 55.424286, std dev: 6.707883 Gallery: Average is: 54.442857, std dev: 9.746407 Music: Average is: 51.115714, std dev: 12.656090

Andreas Gal :gal

Comment 4

•

12 years ago

I am surprised we don't reach a stable fps across all the apps. Shouldn't the compositor be scheduled at a fixed fps, and preempt anything else, no matter what the system load. Worst case we don't deliver updates quickly enough and checkerboard, but we should always hit full composited fps. BenWa?

Flags: needinfo?(bgirard)

Mason Chang [Inactive] [:mchang]

Assignee

Comment 5

•

12 years ago

@gal - From previous Scroll FPS measurements, we plateau around 55 fps+ after enough scrolls / actions / everything gets cached. However, our partners don't wait that long and they measure FPS after 2-3 scrolls to cache everything. The FPS curve starts low as the app starts and crawls its way up to 55 fps+, which is the segment I tried to capture.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 6

•

12 years ago

FPS for Async Pan / Zoom ON, Tiling ON Settings: Average is: 53.917143, std dev: 8.655713 Contacts: Average is: 52.184286, std dev: 11.651538 Messages: Average is: 49.418571, std dev: 13.192550 Email: Average is: 55.328571, std dev: 4.852456 Gallery: Average is: 53.582857, std dev: 9.915352 Music: Average is: 52.321429, std dev: 10.482685

Mason Chang [Inactive] [:mchang]

Assignee

Comment 7

•

12 years ago

Scrolling of Browser (nytimes.com). Click on the favorite, wait until loading bar is done, scroll up/down: Async Pan/Zoom off, Tiling off: Average is: 50.702857, std dev: 7.458475 Async Pan/Zoom ON, Tiling off: Average is: 51.110000, std dev: 6.413215 Async Pan/zoom ON, Tiling ON: Average is: 55.567143, std dev: 7.050010 Are these all the numbers you needed Milan? Other numbers in Comment 3

Flags: needinfo?(milan)

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Comment 8

•

12 years ago

The standard deviations on these are really large, to the point I'm not sure they tell us anything significant. Taking Settings in comment 2 as an example, we have a before that says there's a 68% chance of true mean being within one standard deviation (~10.5fps) of the calculated mean, assuming this is a normal distribution. That means the true mean could be anywhere from ~39 to 60fps, at 68% confidence (and 32% either higher or lower than that) Then we have an after that says there is a 68% chance of true mean being within one standard deviation (7.5fps) of the calculated mean, so ~47fps to ~62fps (but we know it's capped at 60) at 68% confidence. That's a pretty huge overlap, from ~47-60fps, taking cap into account. If I understand correctly, these two samples could easily both represent the same performance. And that's 68% confidence. The usual target is quite a bit higher, with correspondingly higher ranges. In general, my understanding of how this works is that if you have a std deviation calculated at 10fps, two sample calculations within 20fps (or more, at higher confidence) of each other aren't meaningfully indicative of a different true mean. The resolution of the test is just too low due to variability, whether that noise is introduced via the test procedure, the system under test, or whatever. Can someone from metrics, or who has a better grounding in stats correct me here? It's a big concern of mine, that ultimately these type of test results aren't enough to go on to indicate a regression. It's not a matter of anything is better than nothing, when making these sorts of decisions--the noise level has to be low enough to be useful.

Milan Sreckovic [:milan] (needinfo for best results)

Comment 9

•

12 years ago

Reading the FPS is not the best indicator of what framerate we're actually getting. High speed camera would be great. BenWa is talking with Mason offline on the scroll graph method/tool we have started on to see if it can be used here.

Flags: needinfo?(milan)

Flags: needinfo?(bgirard)

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Comment 10

•

12 years ago

(In reply to Geo Mealer [:geo] from comment #8) > The standard deviations on these are really large, to the point I'm not sure > they tell us anything significant. After cramming a bunch of stats research into a couple of days, I'm convinced I'm off-base in my understanding of how these stats are best applied. Applying t-tests to the samples, assuming an N of 700, shows them to be largely significant, though I'm unclear of how to interpret the magnitude of that significance. That said, every time we've tried measuring fps in the past--especially with camera--we've seen a large variability in the results. At one point, testing different cameras with Eideticker, we also saw a bimodal distribution instead of a normal distribution. We landed on different local maxima/minima of this in two different test runs and assumed a difference that turned out not to be there. A longer test run showed the real distribution. Averages are really only going to work with a normal distribution, so that's a potential problem. So, even though I'm probably not the best person to be doing the stats analysis on this, I'm pretty convinced someone should be. We got a lot of benefit from metrics team analyzing our datazilla results--they pointed out some pretty big problems that were skewing them, and clarified the noise level. That sort of thing is a concern here as well because we're testing a non-deterministic system, using partially manual methods. Both of those may be noisy. And while the device is rebooted between runs, the measurements themselves are all across one run and are not independent. If there system did a GC in the middle of it, for example, that'd throw off the distribution. So, upshot, I think someone from metrics team should do an analysis across this to understand significance. The patch in question collected 700 fps samples, and output an average/std. deviation. I think a couple of immediate questions I'd have would be A) what was the median result? Comparing that to the mean should give us some idea of skew. B) did the collected results reflect a normal distribution around the calculated mean? With 700 results, I'd expect a pattern to emerge if this is a normal distribution unaffected by background noise. The way the individual results are collected into buckets, that should be something we could dump as well (just a dump of all the bucket counts would be adequate). C) assuming we are noisy, do things improve if we automate the scrolls? I'd be willing to help redo the experiment on one or two of the test points with an updated patch and/or gaia automation if that would get us there, but at this point probably cannot until January.

Mike Lee [:mlee]

Comment 11

•

12 years ago

David, per Geo's comment 8 and comment 10 can you or anyone else on the Metrics team help us analyze Mason's FPS findings?

Flags: needinfo?(dzeber)

Dave Zeber [:dzeber]

Comment 12

•

12 years ago

Sorry for the delay - I was on PTO for the first part of the week. In response to comment 10: the t-test should be valid unless the distributions are extremely skewed or otherwise weird, as the sample size (N = 700) is quite large. Some results are significant, but some are borderline to not significant (assuming independent samples). However, since the comparisons are before-and-after for the same apps, it would be appropriate to use the paired t-test, which would reduce the standard error and possibly increase the significance. It's difficult to judge the magnitude of significance using the p-value on an absolute scale (the p-value is the false positive rate you'd expect if you rerun the experiment many many times and the means are actually *the same*), but you can compare p-values to each other or compare them to a threshold value. If you can send me the raw data (700 values for each run), I can look into Geo's questions about normality and skewness and provide more details.

Flags: needinfo?(dzeber)

Mike Lee [:mlee]

Comment 13

•

12 years ago

(In reply to dzeber from comment #12) > > ... > > If you can send me the raw data (700 values for each run), I can look into > Geo's questions about normality and skewness and provide more details. Thanks David. Mason, please attach the raw data to this bug so David can take a closer look.

Flags: needinfo?(mchang)

Mason Chang [Inactive] [:mchang]

Assignee

Comment 14

•

12 years ago

I can't find all the raw data, I was just copying/pasting the pre-calculated values after I made sure the patch works. I can re-do the experiments if you need to, but it won't be the same as what's posted here before.

Flags: needinfo?(mchang)

Dave Zeber [:dzeber]

Comment 15

•

12 years ago

You can do t-tests with the summaries posted above, but we'd need the original data to answer Geo's questions about the shape of the distributions, to check assumptions, and to run paired t-tests. Since a few of the p-values are hovering close to the cutoff between what we'd consider significant and non-significant (in the 0.05 to 0.01 range), it'd probably be best to get the raw values for further investigation. It's fine if the data are from a new run - we should get similar results, unless other things have changed (eg. version). I suspect the shapes of the distributions will not be that different.

Mason Chang [Inactive] [:mchang]

Assignee

Comment 16

•

12 years ago

Attached file Raw Data Dump from Settings App (obsolete) — Details

Raw data dump from the Compositor::DrawFPS data. The left number is the FPS measurement. The right number is how many times we got that many frames. Is this what you needed? Thanks!

Flags: needinfo?(dzeber)

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Comment 17

•

12 years ago

Do we know what the hardware VSync rate is for the Buri? I would have thought 60Hz, but Ben pointed out that there was some chance it was higher. I notice there are some results above 60 in the data.

Dave Zeber [:dzeber]

Comment 18

•

12 years ago

(In reply to Mason Chang [:mchang] from comment #16) > Created attachment 8358699 [details] > Raw Data Dump from Settings App Thanks! Which combination of pan/zoom/tiling and which app was this for? Would you be able to post the data dump from the one you want to compare to (the before/after comparison)?

Flags: needinfo?(dzeber) → needinfo?(mchang)

Mason Chang [Inactive] [:mchang]

Assignee

Comment 19

•

12 years ago

Attached file Raw Data Dump from Settings App — Details

Attachment #8358699 - Attachment is obsolete: true

Flags: needinfo?(mchang)

Mason Chang [Inactive] [:mchang]

Assignee

Comment 20

•

12 years ago

I uploaded the raw data with both APZ enabled and disabled. Is this good?

Flags: needinfo?(dzeber)

Dave Zeber [:dzeber]

Comment 21

•

12 years ago

That's great - thanks.

Flags: needinfo?(dzeber)

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Comment 22

•

11 years ago

Looking at the data, one thing that jumps out is that we're naturally bounded at/around 60fps by the LCD refresh rate, which is predictably where most of the results lie with only insignificant binning afterwards. Since what we're really looking for is a smooth 60fps scroll I wonder if we should look at this more as a discrete probability distribution with a probability mass function around the chance that it takes less than 16.7ms to output a frame: either we hit the next refresh with the next frame of animation or not. Maybe what we should really be measuring is the number of refresh misses per total frames, with baselines set around the maximum number of misses. Combined with something like an X/Y graph of "number of refreshes" vs. "number of frames" (similar to Benoit's scroll graph, but without the concept of work done per frame) that would also display jank very visibly.

Chris Lord [:cwiiis]

Comment 23

•

11 years ago

May I ask how these tests are being performed? When I last saw b2gperf, while it's good to have any number to look at, the rate at which it scrolls is too slow to really test performance - we want a variety of flings that mimic human input ideally, and we want some higher velocity flings too. It would be good to get numbers with tiling *and* progressive updates on too (layers.progressive-paint), which ought to reduce contention and thus reduce jank (and possibly increase overall framerate).

Geo Mealer [:geo] -- This account is inactive after 2015-07-07

Comment 24

•

11 years ago

Keeping in mind this isn't our ultimate test methodology, I think the flicks to date have largely been manual. Differences in velocity is one of the reasons I suspect the test results will be noisy. When I tried this manually before, the line between "flick fast enough to max the system" and "flick so fast that the screen doesn't register" was pretty thin. Eideticker, which will probably represent our final methodology, automates flicks in a repeatable fashion with Orangutan, and can flick at different velocities without the same touchscreen limitations. I agree that we need to explore a range of flick effort at that point to understand how it affects performance. As far as what makes it into an ongoing suite, I was thinking more in terms of picking the one that reliably maxed out performance, but there's certainly an argument for "max" and "typical" as two different tests. I think it depends on how much headroom we have in terms of suite runtime, really. One-offs like this probably can use a more comprehensive range, if it turns out to be useful.

Benoit Girard (:BenWa)

Comment 25

•

11 years ago

I really hate to sound like a broken record but we should be collecting performance profiles for these runs. From a profile we can see if the compositor is stuck in a GL call or is sleeping because the test doesn't require 60 FPS scrolling. We don't need to be guessing here. My guess: APZC causes bigger layers for display ports. We block the compositors for these allocations. If thats the case I'm surprised that the APZC numbers are even as good as they are. APZC + buffer rotation is not a winning combination. If we can confirm the above with profiles then I have solid evidence to start working on gralloc tiling ASAP.

Andreas Gal :gal

Comment 26

•

11 years ago

I recommend we simply focus on APZC at this point. If its slower at anything, lets measure that, attach a profile, and we can file bugs. We already know the major sources. We need gralloc tiling and gralloc allocation off the compositor. The latter is almost in the bag.

Mason Chang [Inactive] [:mchang]

Assignee

Updated

•

11 years ago

Blocks: 962687

Mason Chang [Inactive] [:mchang]

Assignee

Comment 27

•

11 years ago

Moving the discussion to bug 962687 so we can track issues, get profiles, and solve performance issues. Closing this bug due to the original intent which was to measure baseline FPS.

Status: ASSIGNED → RESOLVED

Closed: 11 years ago

Resolution: --- → WORKSFORME

Chris Lord [:cwiiis]

Updated

•

11 years ago

Blocks: b2g-tiling

fpsMeasure.patch 12 years ago Mason Chang [Inactive] [:mchang] 2.19 KB, patch		Details \| Diff \| Splinter Review
Raw Data Dump from Settings App 12 years ago Mason Chang [Inactive] [:mchang] 2.43 KB, text/plain		Details
Raw Data Dump from Settings App 12 years ago Mason Chang [Inactive] [:mchang] 4.90 KB, text/plain		Details