Open Bug 899424 Opened 6 years ago Updated 6 years ago

FloatUtils.floatFuzzyEquals is consuming more CPU time on page load than it should

Categories

(Firefox for Android :: Toolbar, defect, minor)

ARM
Android
defect
Not set
minor

Tracking

()

People

(Reporter: ckitching, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Profiling has revealed that 0.3% of the CPU time involved in loading a page is spent in FloatUtils.fuzzyEquals - an entirely uninteresting function to spend so long in.

Yes, it's 0.3% - whoopdediddleydoo - I present a one-line patch reducing this to 0.1% by eliminating the extra function call to Math.abs (Which ProGuard would be unable to optimise out anyway)
Our pages will load.. Uh.. 30 milliseconds faster. Every little helps!
Assignee: nobody → ckitching
Attachment #782944 - Flags: review?
Severity: normal → minor
OS: Linux → Android
Hardware: x86_64 → ARM
Attachment #782944 - Flags: review? → review?(bugmail.mozilla)
Comment on attachment 782944 [details] [diff] [review]
fuzzyEqualsWithoutMathAbs.patch

Review of attachment 782944 [details] [diff] [review]:
-----------------------------------------------------------------

While I'm not entirely opposed to doing this, I need more justification for it before I'll agree to it. Specifically:

1) Please post the profile data that shows your results. How precise is the profile data? I'd like to be reasonably certain that the measurement difference you noticed isn't just noise, so measuring a few samples and generating a standard deviation would be nice.

2) How many times is this function called, and from where? Micro-optimizations like this are acceptable if there's no higher-level point in the call graph that can be optimized. If there is, though, we should look at that first. Reducing the number of calls to e.g. ImmutableViewportMetrics.fuzzyEquals, or even just rearranging the order of calls in that function to better take advantage of short-circuiting is preferable to me. Getting an idea of where this function is being called from and the relative counts of those call sites would be a good start.

3) Even if it turns out that this change is justified I would like to do the minimum change needed in the code itself to allow Proguard to do the rest. In this case, you're changing a readable Math.abs call into something much harder to read. I would prefer pulling out a FloatUtils.abs private function instead, which Proguard can optimize away. Failing that this code change is at least worth a comment.
Attachment #782944 - Flags: review?(bugmail.mozilla) → review-
I'm very sorry, I did an extremely shoddy job of explaining my thinking about this one.

The breakdown of calls to FloatUtils.fuzzyEquals is, roughly:

40% from RectUtils.fuzzyEquals
34% from GeckoLayerClient.progressiveUpdateCallback
9% from FloatUtils.fuzzyEquals(Point, Point)
8% from ScrollbarLayer.fade
5% from Layer$RenderContext.fuzzyEquals
1.5% from ImmutableViewportMetrics.setMargins

And a few other tiny fragments of little interest.

When we apply ProGuard to this, the situation hardly changes - the Math.abs call in FloatUtils.fuzzyEquals
can't be inlined, but also seems to cause ProGuard not to inline the body of fuzzyEquals into the callers
(Which would at least eliminate all the call overhead they're incurring calling fuzzyEquals)

Looking into RectUtils.fuzzyEquals, we come across an unrelated thing of interest:

>public static boolean fuzzyEquals(RectF a, RectF b) {
>    if (a == null && b == null)
>        return true;
>    else if ((a == null && b != null) || (a != null && b == null))
>        return false;
>    else
>        return FloatUtils.fuzzyEquals(a.top, b.top)
>            && FloatUtils.fuzzyEquals(a.left, b.left)
>            && FloatUtils.fuzzyEquals(a.right, b.right)
>            && FloatUtils.fuzzyEquals(a.bottom, b.bottom);
>}

The central if statement can be rewritten equivalently as as if (a == null || b == null). An optimisation
ProGuard plausibly can make by itself, but perhaps worth doing just for clarity.

It doesn't seem to be obvious that there's a way we can optimise this method to make less use of fuzzyEquals,
sadly. 
In fact, exploring the places that use FloatUtils.fuzzyEquals in detail does not seem to provide obvious
routes for reducing the calls to fuzzyEquals, alas. Some of these parent fuzzyEquals functions do seem to 
be being hit pretty hard, though, and most of them are doing little except calling FloatUtils.fuzzyEquals
on a bunch of different things.

Applying the optimisation, Proguard, disturbingly, seems still not to inline the FloatUtils.fuzzyEquals method
itself (Something that I'd have thought would be something that it would do...). Maybe this is just some profiler
effect, but I do not believe so - ProGuard modifies the bytecode itself, and I have successfully measured
ProGuard's improvements using the profiler.
Anyway, on to the actual numbers observed for this change. Both were also ProGuard'd (That hasn't landed yet,
but it's only a matter of time, we hope...)

With the patch (All times in ms):

Total Time in FloatUtils.fuzzyEquals    Calls   Time Per call
23.118                                  5253    .0044
16.847                                  4258    .0040
16.652                                  4303    .0039

5.432                                   1408    .0039
4.158                                   1269    .0033
4.98                                    1205    .0041

3.175                                   772     .0041
2.844                                   617     .0046
2.875                                   617     .0047

10.525                                  2780    .0038
2.84                                    630     .0045
2.438                                   634     .0038

Avg time per call: 0.0041ms
Standard deviation: 0.0004

Without the patch (But still ProGuardified)

Total Time in FloatUtils.fuzzyEquals    Calls   Time Per call
58.989                                  4901    .0120
46.52                                   4086    .0114
50.944                                  4249    .0120

19.447                                  1691    .0115
15.923                                  1281    .0124
14.699                                  1193    .0123

9.061                                   783     .0116
7.365                                   621     .0119
7.019                                   656     .0107

14.44                                   1336    .0108
7.654                                   664     .0115
7.146                                   643     .0111

Avg time per call: 0.0116ms
STDev: 0.0006

Each group of three profiles are three loads of the same page (Respectively: a new york times article, a Reddit thread, a Google search, and an xkcd comic.).
These results make it fairly clear that this change has a significant impact on the speed of this function (Call overhead of the .abs call dominates).

Adding our own version of library functions/classes which ProGuard is unable to optimise is, in general, a good plan. For example, it will likelly prove useful to make our own version of Rect, so calls to boring functions like getWidth() can be optimised away.
In this case, adding our own abs function seems like an excellent idea - additional neatness and no slower. (Although doing so eliminates all non-Proguarded performance benefit of this patch, but that's okay)

Profiles with Proguardification and our own abs function:
https://www.dropbox.com/sh/y9h9cddoinonipm/EmiPuSLEVY

Profiles with Proguardification and my patch as presented (Not really any difference):
https://www.dropbox.com/sh/0w1m0gv0zmufhio/0tLcvT22pD

Profiles with Proguardification but no patch for this problem:
https://www.dropbox.com/sh/dghb4jkvcipf3rk/rBoCc-QEsj
So I looked at the nytimes1.trace file from your dropbox on the assumption that since it was the largest trace file it would have the most data.

I found the FloatUtils.fuzzyEquals, which is reported as taking up 0.3% of the total time (as you also stated in comment 0) and the top caller of it being RectUtils.fuzzyEquals, at 42.3%, which is roughly what you had above.

If I work my way up the stack another level, to the callers of RectUtils.fuzzyEquals, it is called entirely by LayerRenderer$Frame.beginDrawing. I looked at the code for this function and the only call to fuzzyEquals in there is at [1], but it's to RenderContext.fuzzyEquals [2]. I assume the RenderContext.fuzzyEquals function is getting inlined by Proguard and so ends up calling Rect.fuzzyEquals directly on the handful of rects inside the RenderContext.

So now what I would like to do is optimize at this level, and try to reduce the number of calls to RenderContext.fuzzyEquals. Note that mLastPageContext takes on the value of mPageContext a few lines below; this condition is basically checking to see if the mPageContext changes from one frame of composition to the next. I know this part of the code pretty well (considering I wrote that line) and I think that during page load, in most cases, the mPageContext and mLastPageContext will have the same data inside. So if we can detect that fact earlier in the process we can short-circuit this fuzzyEquals check entirely.

If you go up to [3] you'll see that the mPageContext is created entirely from data in the ImmutableViewportMetrics object and some fields of LayerRenderer. (Actually it also takes an offset but with my patch for bug 892267 that can also be pulled out of the ImmutableViewportMetrics). I'm pretty sure that a lot of the time, the ImmutableViewportMetrics object passed in there will actually be the exact same object. At least during page load. So the first high-level optimization that comes to mind is to modify the createPageContext function to be more like this:

createPageContext(ImmutableViewportMetrics metrics) {
  if (mCachedMetrics == metrics) {
     return mCachedPageContext;
  }
  mCachedMetrics = metrics;
  return mCachedPageContext = createContext(...);
}

and then stick in an extra condition in the mPageContext.fuzzyEquals(mLastPageContext) that short-circuits if mPageContext == mLastPageContext. This will save both needless creation of RenderContext instances and also the fuzzyEquals calls.

That being said, the first step is always to verify the assumptions :) In this case the assumption I'm making is that the ImmutableViewportMetrics object passed in to the Frame constructor is going to be the same object a nontrivial amount of time - this can easily be verified by sticking in a few lines of code there.

Thoughts?

[1] http://hg.mozilla.org/mozilla-central/file/c2b375f3a909/mobile/android/base/gfx/LayerRenderer.java#l472
[2] http://hg.mozilla.org/mozilla-central/file/c2b375f3a909/mobile/android/base/gfx/Layer.java#l196
[3] http://hg.mozilla.org/mozilla-central/file/c2b375f3a909/mobile/android/base/gfx/LayerRenderer.java#l428
Ah! Great spot! Having such a good understanding of the code makes finding high level optimisations much easier, it seems. This is still quite unfamiliar (And sadly undocumented :( ) code to me, so I find myself doing peephole-esque optimisations that annoy reviewers.

What you propose seems at least worth investigating - but why not both have our cake and eat it, so to speak? It sounds like you've found a further optimisation - why not add it to the one I propose? (Refactoring mine to instead use our own abs function which can be Proguard-optimised instead of using the Math.abs has no effect on performance gain, but does prevent it from harming the clarity of fuzzyEquals so badly)

Or is the trick of duplicating or repacking library functions used by critial code with a view to being able to optimise them with ProGuard discouraged? If so I feel we'll be blocking quite a few less trivial optimisations, too.

I'll give your latest suggestion a proper investigation tomorrow. The amount of call overhead we're doing to, essentially, do a very simple sum, is quite depressing.
It also occurs to me that another valid approach to finding fuzzy equals would be along the lines of (boolean) ((b-a)*(10^12)). Probably more cryptic than the own-abs approach.
(In reply to Chris Kitching [:ckitching] from comment #5)
> Or is the trick of duplicating or repacking library functions used by
> critial code with a view to being able to optimise them with ProGuard
> discouraged? If so I feel we'll be blocking quite a few less trivial
> optimisations, too.

I'm ok with doing this, but only where "it makes sense". Saving 30ms on a time scale of 20 seconds for something that may not be on the critical path and does impact readability of the code, to me, doesn't really make much sense. And my thinking is that if the optimization I described above works out, then the 0.3% time spent in this function will drop to 0.2% and the benefit from the inlining optimization will be even less worth it.

I guess this is pretty hand-wavy as to what's worth it and what's not but in my experience peephole optimizations like this never make a noticeable difference and therefore are a waste of developer time, unless you can show they're in hot code that gets called millions of times. Just my two cents.
Assignee: chriskitching → nobody
You need to log in before you can comment on or make changes to this bug.