Closed
Bug 962687
Opened 11 years ago
Closed 11 years ago
[meta] Investigate APZ Performance Issues
Categories
(Core :: Graphics, defect, P2)
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: mchang, Assigned: mchang)
References
Details
(Keywords: perf, Whiteboard: [c=profiling p= s=2014.08.15 u=])
Attachments
(3 files)
2.19 KB,
patch
|
Details | Diff | Splinter Review | |
3.00 KB,
application/force-download
|
Details | |
27.37 KB,
application/force-download
|
Details |
Investigate root causes of APZ performance issues and fix them.
Comment 1•11 years ago
|
||
I've looked at this yesterday and I've already begun speaking with Cwiiis regarding this. Cwiiis is filing a bug with our requirements for tiling which will be one of the major pre-requisits for this. We will also want to do other tweaks to solve compositors slowdowns like bug 939348.
Depends on: 939348
Assignee | ||
Comment 2•11 years ago
|
||
Some profiles up for the E-Mail app on v1.3 on bug 962699. Original profile shows 5x layer drawing. I removed one by putting the scrollable content in an opaque background.
Comment 3•11 years ago
|
||
I've taken a look at the data Mason posted in bug 950841.
The main thing I noticed is that in both cases, the data appear to be bimodal. The upper part of the distribution is quite concentrated around 58-59 fps, while the lower part is much more spread out and centred around 45-50 fps. This suggested that the frame rate measurement is determined by (at least) 2 competing underlying processes. I've attached a plot showing the distributions and the mean and median.
For these types of distributions, doing t-tests doesn't really tell you much, because it uses the mean to summarize all of the information contained in the distribution. From the plot the two cases look quite different, but these types of differences are not captured by the mean, and the t-test comes back as not significant. If the distributions for the other tests are similar, this would explain why a number of the t test results seem to be not significant. The t-test is correct in what it measures, but it is not able to pick up on the more complicated types of differences that exist between the distributions.
I tried fitting a mixture of 2 normal distributions to each case, and the fit is decent (see attached plot). For the "with APZ" case, we get (mean = 48.08, sd = 8.79) for the lower group and (mean = 57.86, sd = 2.16) for the upper group. For the "without APZ" case, we have (mean = 46.25, sd = 8.84) for the lower group and (mean = 58.77, sd = 1.47) for the upper group. In both cases, the mixing proportions are the same, attributing 40% of the observations to the lower group and 60% to the upper group.
I find that the means of the upper groups are different between the two cases, but that the means of the lower groups are not significantly different. Notice also that the spread of the lower group remains the same between the two cases, but the spread of the upper group is lower for "without APZ". Hence the difference due to APZ is affecting the upper group but not the lower group.
Comment 4•11 years ago
|
||
Comment 5•11 years ago
|
||
Comment 6•11 years ago
|
||
(In reply to dzeber from comment #5)
> Created attachment 8365134 [details]
> Plot of mixture fits
Just to clarify, the blue curve is the fitted normal for the lower group, and the red line is the fitted normal for the upper group. The black dashed line is the fitted mixture distribution.
I don't know enough about the testing process to know what's causing the grouping, but caching was mentioned in bug 950841. If this is affected by the order of the test runs (eg later replicates tend to be higher than earlier ones), this might be accounted for by recording the order of the FPS observations.
The high-bound forcing that grouping is hardware-based; the LCD (almost certainly) refreshes at 60Hz. A "good" test result would look like almost everything at 60 with only a few straying lower. The less strays the better the result.
I don't know what's causing the low-bound. However, with what I said above in mind I don't think either median or mean is actually what we want to know. We want to know how many strays.
Repeating what I said in https://bugzilla.mozilla.org/show_bug.cgi?id=950841#c22. The discussion got reined in there, but I'd be very interested in comment here:
Looking at the data, one thing that jumps out is that we're naturally bounded at/around 60fps by the LCD refresh rate, which is predictably where most of the results lie with only insignificant binning afterwards.
Since what we're really looking for is a smooth 60fps scroll I wonder if we should look at this more as a discrete probability distribution with a probability mass function around the chance that it takes less than 16.7ms to output a frame: either we hit the next refresh with the next frame of animation or not.
Maybe what we should really be measuring is the number of refresh misses per total frames, with baselines set around the maximum number of misses.
Combined with something like an X/Y graph of "number of refreshes" vs. "number of frames" (similar to Benoit's scroll graph, but without the concept of work done per frame) that would also display jank very visibly.
Edit for this bug: I know there's a goal to get scrollgraph working, but my concern is that it'll only work in cases where we can measure work done per refresh (how many pixels something moved). That's very appropriate for pan/scroll, but we might have other animations we want to measure that don't involve translocating something recognizable across the screen.
For more general animation testing, I think my proposal would be more versatile. We could do both, and apply them as appropriate--the basic dependency of detecting something happening per refresh would serve both. It's really the only dependency of this proposal, so we might get it stood up as a milestone.
Should correct myself a little bit. I think taking the median or mean of the number of refresh misses might be a great way to aggregate that. I don't think taking those stats across instantaneous fps is as valuable.
Among other things, there's no such thing as instantaneous fps--has to be measured across refreshes--so I think those results may be dependent from dragging in history: if the last was low, I'd expect the next to be low too, just maybe rising or falling. You don't change fps on a dime. But frame misses are instantaneous data.
(In reply to dzeber from comment #6)
> I don't know enough about the testing process to know what's causing the
> grouping,
One thing that came out in the meeting that I think was unclear. Said there but repeating here for others watching the bug.
These aren't 700 (or 1000 on recent ones, I think) repeated measurements of the same thing. They're 700/1000 instantaneous velocities taken across a single scrolling scenario. That's why I meant in comment 8 re: the measurements probably being dependent. fps #30 will include maybe frames 25-30, whereas fps #31 will include maybe frames 26-31. They overlap.
But that lack of repetition is the other reason I think trend data is misleading here: any glitch (whether in measurement or not) affects the entire run, and won't get "ironed out" by the repeats. Noise will always stack across the test.
Comment 10•11 years ago
|
||
To follow up on the discussion in last week's meeting and comments in the bugs, it seems like the underlying question is how best to draw valid conclusions and reliably detect the problems we are interested in. Given the data we have, we can look for the method that makes the best use of it. However, it is also very important to make sure we are collecting data in the best possible way at the source, as this upper-bounds the effectiveness of any analysis we try. I wanted to put down some thoughts about this (sorry it's a bit long).
- One thing is to collect data as unaggregated as possible. We want measurements at the most basic level, but that give a complete description of the process. This is what ensures the framework is versatile/extensible. If the data we collect is already aggregated, we are limiting the analysis available, and we are also limiting the ability to check assumptions as to whether the analysis will work as expected. This is important because in my experience with tech/web data, the typical assumptions are often not met. If the data are unaggregated, we can always aggregate afterwards, and we keep the ability to apply different methods to answer different questions later on.
For example, if we only report the means and standard deviations, we can do a t-test, but we can't tell whether or not to trust the results. If we get the FPS values as frequency counts, we can look at the distribution as I have done (and notice the bimodality), but we can't refine it further. If we know the individual FPS values with timestamps, we might be able to explain away some of the bimodality as a time effect, making our final comparison even more reliable.
- Another factor in data collection is to make sure the observations are as independent as possible. Our goal is to limit and explain the noise, and it's easier if very little noise is due to any carry-over between measurements. As Geo mentions in comment 9, the FPS values probably do have such a carry-over effect.
- Some of the noise can also be explained by external factors like available system resources. This was mentioned in https://bugzilla.mozilla.org/show_bug.cgi?id=950841#c25. It would be very useful to collect this information as well, so that we can explain observations that seem not to fit the trend (eg. the lower mode in the bimodal distribution). If I'm comparing two test runs with the current data, I'd be inclined to ignore the measurements from the lower mode, because I don't know if they are caused by the same thing across test runs. If I can explain them using these external factors, then I can use the complete data.
- Finally, it's important to know the exact details of the test procedure to know how to compare measurements from different test runs. For example, scrolling involves variable work, and if the tests are run manually, the amount of work will be different each time. To make test outcomes most comparable, it would be best to automate the tests (so that they run exactly the same each time), and record what is being done at each time interval.
Geo's proposal addresses many of these issues. For each frame, measure the time between consecutive frames (or just timestamps), and for each refresh, the timestamp and whether or not it was a miss. Individual frames should be reasonably independent (assuming we also have information about system resources and the test procedure), and from these measurements we can recover an approximate local FPS rate or other summaries. We can also look at proportions of misses, we can make the plot Geo suggests in comment 7, and we can try other analyses.
I'm sure there are technical and other constraints to weigh these points against. However, collecting more information organized in a useful way is generally the most versatile approach.
Assignee | ||
Comment 11•11 years ago
|
||
Resolving as APZ seems to be pretty stable now.
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
Updated•11 years ago
|
Whiteboard: [c=profiling p=3 s= u=] → [c=profiling p= s=2014.08.15 u=]
You need to log in
before you can comment on or make changes to this bug.
Description
•