Closed Bug 633627 Opened 13 years ago Closed 13 years ago

Optimize gfxAlphaBoxBlur::Paint on ARM

Categories

(Core :: Graphics, defect)

ARM
All
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla9
Tracking Status
firefox9 --- fixed
fennec 9+ ---

People

(Reporter: stechz, Unassigned)

References

Details

(Keywords: mobile, perf)

Attachments

(5 files, 15 obsolete files)

3.21 KB, text/html
Details
3.34 KB, patch
Details | Diff | Splinter Review
3.11 KB, patch
joe
: review+
Details | Diff | Splinter Review
4.01 KB, patch
jrmuizel
: review+
Details | Diff | Splinter Review
2.10 KB, patch
roc
: review+
Details | Diff | Splinter Review
This takes significant time while waiting for content to refresh on Fennec. Doing an early return speeds up rendering time significantly.
Assignee: nobody → blassey.bugs
tracking-fennec: --- → ?
Blocks: 459117
Keywords: mobile, perf
See bug 509052. Maybe this should be a duplicate of that. In any case, the suggestions and comments in that bug apply here.

The first thing we need is a good performance testcase so we can measure the effects of optimizations in a reliable way.
Brad Lassey was working on ARM/NEON optimizations for this code.  Let's use this bug to track ARM-specific optimizations, and make it a dependency of bug 509052.
Blocks: 509052
Summary: Optimize gfxAlphaBoxBlur::Paint → Optimize gfxAlphaBoxBlur::Paint on ARM
We should make sure to fix the core algorithm before NEON optimizing the existing code
(In reply to comment #3)
> We should make sure to fix the core algorithm before NEON optimizing the
> existing code

If you can point out what's broken about it, I can pick this back up. Otherwise someone who knows this code better should take it.
Assignee: blassey.bugs → nobody
Probably refering to bug 509052 comment 3.
See comment #1.

We need a testcase here. Once we have a testcase we can discuss what should be done to improve it.
Stuart asked me to post some example sites that use box shadows:
* gmail.com for dropdowns
* reader.google.com (interactive) for dropdowns
* flickr.com for text shadow
apple.com also uses text shadow.
Not that we can also paint shadows in a lot of cases using gradients. And we'll get any optimizations in the cairo backends. (This will help other platforms as well)

In general I'm guessing the gradient drawing code should be faster than blurring code.
(In reply to comment #9)
> Not that we can also paint shadows in a lot of cases using gradients. And we'll
> get any optimizations in the cairo backends. (This will help other platforms as
> well)
>
> In general I'm guessing the gradient drawing code should be faster than
> blurring code.

Maybe so but it's unclear it'd be a big win on ARM.

(In reply to comment #7)
> Stuart asked me to post some example sites that use box shadows:
> * gmail.com for dropdowns
> * reader.google.com (interactive) for dropdowns
> * flickr.com for text shadow

We need testcases, not sites here. Or are you saying that every site that uses any kind of blurred shadow is slow in a measurable way on mobile?

Note that any optimizations needed for blurred text-shadow will be completely different than for box-shadow. That's why we need specific testcases to focus on instead of just trying to "make everything fast".
nice to have, but not blocking.  if there is data to support us blocking, we love to see it.
tracking-fennec: ? → 2.0-
Whiteboard: [fennec-4.1?]
(In reply to comment #2)
> Brad Lassey was working on ARM/NEON optimizations for this code.

Brad, did you write any code for this?

(In reply to comment #4)
> (In reply to comment #3)
> > We should make sure to fix the core algorithm before NEON optimizing the
> > existing code
> 
> If you can point out what's broken about it, I can pick this back up. Otherwise
> someone who knows this code better should take it.

What's wrong with the C implementation? Does it not implement what the CSS specification recommends as an approximation of a Gaussian blur?

(In reply to comment #10)
> > In general I'm guessing the gradient drawing code should be faster than
> > blurring code.
> 
> Maybe so but it's unclear it'd be a big win on ARM.

Gradient drawing will almost certainly be faster than a true blur. However, it doesn't work for text shadows, and things might get complicated at curved corners and the like.

> Note that any optimizations needed for blurred text-shadow will be completely
> different than for box-shadow. That's why we need specific testcases to focus
> on instead of just trying to "make everything fast".

If we use a Gaussian blur approximation, it should apply equally to text and box shadows. (If I understand the CSS specification correctly, they use the same algorithm.) Box shadows, of course, are rather more regularly-shaped than text shadows, so we can make some special-case optimizations such as the "aSkipRect" in the current C implementation, or gradient generation for simple rectangles (as proposed above).

----

I think I can pick this up, if no-one is already working on it, but I'd like to understand what the correct output _should_ be so I can get a known-valid (if slow) reference implementation.
I was planning on working on bug 648449 which is basically to implement the third paragraph of bug 509052, comment 8 in order to optimize box-shadow.
Just to be clear, that optimization will hopefully make any ARM-specific optimizations unnecessary.
(In reply to comment #14)
> Just to be clear, that optimization will hopefully make any ARM-specific
> optimizations unnecessary.

What about text shadows? Are they worth optimizing?
(In reply to comment #15)
> (In reply to comment #14)
> > Just to be clear, that optimization will hopefully make any ARM-specific
> > optimizations unnecessary.
> 
> What about text shadows? Are they worth optimizing?

I hope so. Lots of desktop sites employ text shadow.
Yeah, I guess you're right, it's worth having ARM code for text-shadow. Just make sure you run your performance tests on text-shadow :-)
tracking-fennec: - → 7+
Whiteboard: [fennec-4.1?]
I can try to write some NEON code for text-shadow. Do we want to do it in this bug or a new one?
Doing it in this bug makes sense.
(In reply to comment #18)
> I can try to write some NEON code for text-shadow. Do we want to do it in
> this bug or a new one?

Excellent!

You might be interested to learn that Szeged University recently contributed some NEON blur code to WebKit: http://trac.webkit.org/browser/trunk/Source/WebCore/platform/graphics/filters/arm

They've got an RGB blur (maybe RGBA) and also an alpha-only blur, for shadow effects and suchlike.
Can someone please point me to the relevant code for text-shadow?
gfxBlur.cpp. It get used for text-shadow, box-shadow and canvas.
Thanks, roc and tn (who helped on irc).

I talked with stechz about this, and we have some ideas for how to optimize it. There is one question though, about the skip rect. Is it just an optimization - are we allowed to write to those pixels, if say the skip rect is very small (and we think it would be faster that way)?
Attached patch Factor out skip rect (obsolete) — Splinter Review
Comment on attachment 540792 [details] [diff] [review]
Factor out skip rect

For anyone following along, Alon and I figured the first step to optimize this algorithm is to simplify all the critical loops so that they'd be easier to rewrite with NEON. To this effort, I've refactored the loops so that we no longer store a rectangle to skip, but instead store the strips we need to paint.

As it turns out, when you subtract a rectangle from another rectangle there are two optimal ways to represent this space: as a set of horizontal strips and as a set of vertical strips. This is convenient for us, because for this simplified box blur algorithm, the horizontal blur needs horizontal strips and the vertical blur needs vertical strips for the bounds to be correct.

Right now, I seem to have some sort of offset bug. I have a shadow benchmarking program that causes Firefox to crash rather quickly, and sometimes the shadows look incorrect.
Attachment #540792 - Flags: feedback?(azakai)
Attached patch Factor out skip rect (obsolete) — Splinter Review
Attachment #540792 - Attachment is obsolete: true
Attachment #540792 - Flags: feedback?(azakai)
Comment on attachment 540802 [details] [diff] [review]
Factor out skip rect

Oops, outside callers depend on that top left device context change.

After that, if I disable skip rect functionality altogether everything is fine, so it must be something in my rect difference code.
Attachment #540802 - Flags: feedback?(azakai)
Attached patch Factor out skip rect (obsolete) — Splinter Review
Not seeing any more rendering issues. Thanks for debugging help Alon!.

Using my benchmark on my Nexus S:
w/o patch: 11606 frames over 500 seconds (23.212 FPS)
w/  patch: 19951 frames over 500 seconds (39.902 FPS)

I'll upload my benchmark after this; to be clear I'm not confident we should be
making any decisions on this.  If it is benchmarking shadows properly, then
it's quite insightful to me. Simply by removing logic out of the critical
loops, we see quite a significant speedup. I assume is all thanks to better
instruction pipelining.

One concern is that we are already quite fast at this benchmark, as opposed to
what I believed earlier. According to this benchmark, shadows do not seem to be
a bottleneck for repaint speed.
Attachment #540802 - Attachment is obsolete: true
Attachment #540802 - Flags: feedback?(azakai)
Attached file Shadow Benchmark
Aha. If I bump the radius possibilities up to 100 or 200, then I begin to see 2-5 frames per second.
Attachment #540874 - Attachment is obsolete: true
I created a modified benchmark that does a paragraph of shadowed text:
w/o patch: 1287 frames over 500 seconds (2.574 FPS)
w/ new patch: 1326 frames over 500 seconds (2.652 FPS)

The benchmark is currently on my home directory here: http://people.mozilla.org/~bstover/test3.html

This seems like a better benchmark to me. We could argue about whether lots of websites use paragraphs of shadow text, but I'm not sure any of us have a perfect idea of what's being used out there.
I suspect this patch (applied on top of stechz's) might give some additional speedup. Instead of doing 3 horizontal blurs one after the other on the entire area, it does the 3 iterations on each row. This way should be more cache-friendly.

Initial results of stechz's benchmark:

With this patch  : 31.812 fps
Before this patch: 19.854 fps
Note that I haven't fully checked the patch for correctness yet.
With my benchmark from comment 32:
Alon's new patch: 1405 frames over 500 seconds (2.81 FPS)
I don't see a good way to use neon on this problem. We use the three-pass box filter to approximate a gaussian, and it doesn't match up to neon instructions in a natural way that I can see. I hope I am missing something here, please let me know.

Meanwhile I did some more benchmarking with the two patches we have so far. I don't see a benefit to the second patch anymore, so perhaps my first results were just noise?
(In reply to comment #36)
> I don't see a good way to use neon on this problem. We use the three-pass
> box filter to approximate a gaussian, and it doesn't match up to neon
> instructions in a natural way that I can see. I hope I am missing something
> here, please let me know.

Szeged saw some improvement in WebKit with their three-pass-per-direction approach. It wasn't as much improvement as we'd normally see from a NEON routine, but it was still considerably faster than the C code.

I was originally thinking about implementing a kernel-based filter. This is more computationally expensive than the box-filter method, but requires only two passes in total. (One horizontal, one vertical.) The idea I had was to do something like this:

  * Generate a one-dimensional Gaussian kernel. This can be done in C code for simplicity. We can fairly easily cache kernels that are generated at run-time, as blur radii are likely to get re-used.
  * Run the lines through the kernel, and when we store the result back to memory, flip the X and Y axes.
  * The same routine can be run a second time to process the vertical lines (which are now sequential in memory). The X-Y flip will put the image back in the right orientation.

I never got as far as prototyping it, but that method should minimize the number of passes, it should be as cache-friendly as possible and it's also probably quite space-efficient. Its success relies on the weighted-average implementation being faster than the additional memory accesses in the box-filter method.

There is added complixity if you want to handle large or arbitrarily-sized kernels because you'd need modified NEON routines to handle them (or some slow, generic code), but I think you'd have the same complexity in a box filter too.
Getting rid of division (replacing it with multiplications & shifts) in the inner loop is quite important for improving performance on any architecture, and especially on ARM (because there is no hardware division instruction in the current mainstream ARM processors). A similar optimization was discussed in another bug:
    https://bugzilla.mozilla.org/show_bug.cgi?id=659725#c40
I agree, it definitely looks like removing division is crucial here.

jchen has a patch that replaces it with fixed-point multiplication, should be up soon.

I have another 2 ideas for optimization here, but before writing I would like some feedback as to whether they are worthwhile.

So, right now we allocate two buffers of the size of the blur. In a horizontal blur, we do one pass from 1 to 2, then 2 to 1, then 1 to 2, for a total of three passes. However the rows are independent, so we could instead allocate two small buffers of a single row, and go from the main buffer to temp1, then temp1 to temp2, and finally temp2 to the main buffer (so we finish a single row's three passes before going on to the next). This would save some memory, mostly very little, but if a box the size of the screen has a blurred border, we could save say 1MB.

A second thing we can then do, is to avoid 2/3 of the divides/fixed-point multiplies+shift, at least in most cases. We can make the two temp buffers 32-bit integers, and just add - no division at all. So the first initial value is max 255, then after one pass it is 255*blurWidth, then 255*blurWidth^2, and during the last calculation we get a value of max 255*blurWidth^3 before dividing. So if blurWidth <= 256, this will fit in a 32-bit integer. So we end up doing a single division/fixed-point multiply+shift instead of three. (If blurWidth > 256 (which is probably very rare), we would need to fall back to the normal 3 divisions.)
would it make sense to restrict blur radiuses to powers of 2 such that its always just a shift?
Attached patch Patch to optimize divisions (obsolete) — Splinter Review
As a proof-of-concept, this patch replaces divisions with multiplications. It works by calculating the reciprocal of the divisor outside of loops, and switching to multiplication inside the loops. The 'smmul' instruction calculates d = (s1 * s2) >> 32.

Using stechz's benchmark on an HTC Vision,
Unoptimized: 14.664 fps
Optimized:   21.186 fps

oprofile report (unoptimized):
CPU: ARM Cortex-A8, speed 0 MHz (estimated)
Counted CPU_CYCLES events (Number of CPU cycles) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        app name                 symbol name
12265    20.3579  libmozalloc.so (deleted) __aeabi_idiv
5281      8.7656  libxul.so (deleted)      bits_image_fetch_bilinear_affine_none_a8
5019      8.3307  libxul.so (deleted)      BoxBlurVertical(unsigned char*, unsigned char*, int, int, int, int, nsIntRect const&)
4990      8.2826  libmozalloc.so (deleted) __aeabi_uidivmod
4766      7.9108  libxul.so (deleted)      pixman_composite_src_8888_8888_asm_neon
4378      7.2668  libxul.so (deleted)      BoxBlurHorizontal(unsigned char*, unsigned char*, int, int, int, int, nsIntRect const&)
3812      6.3273  libxul.so (deleted)      pixman_composite_over_8888_0565_asm_neon
3199      5.3098  libc.so                  /system/lib/libc.so
2599      4.3139  vmlinux                  /mnt/sdcard/vmlinux
1660      2.7553  libxul.so (deleted)      pixman_composite_src_n_8888_asm_neon
1260      2.0914  libxul.so (deleted)      pixman_composite_src_0565_0565_asm_neon
960       1.5934  libxul.so (deleted)      pixman_composite_over_n_8_8888_asm_neon
896       1.4872  libxul.so (deleted)      pixman_composite_src_n_0565_asm_neon
880       1.4607  libxul.so (deleted)      fast_composite_in_n_8_8

oprofile report (optimized):
CPU: ARM Cortex-A8, speed 0 MHz (estimated)
Counted CPU_CYCLES events (Number of CPU cycles) with a unit mask of 0x00 (No unit mask) count 100000
samples  %        app name                 symbol name
7722     12.8358  libxul.so (deleted)      bits_image_fetch_bilinear_affine_none_a8
7247     12.0462  libxul.so (deleted)      pixman_composite_src_8888_8888_asm_neon
6485     10.7796  libxul.so (deleted)      BoxBlurVertical(unsigned char*, unsigned char*, int, int, int, int, nsIntRect const&)
5611      9.3268  libxul.so (deleted)      pixman_composite_over_8888_0565_asm_neon
5473      9.0974  libxul.so (deleted)      BoxBlurHorizontal(unsigned char*, unsigned char*, int, int, int, int, nsIntRect const&)
4688      7.7926  libc.so                  /system/lib/libc.so
1751      2.9106  libxul.so (deleted)      pixman_composite_src_0565_0565_asm_neon
1559      2.5914  libxul.so (deleted)      pixman_composite_src_n_8888_asm_neon
1425      2.3687  libxul.so (deleted)      pixman_composite_over_n_8_8888_asm_neon
1257      2.0894  libxul.so (deleted)      fast_composite_in_n_8_8
1222      2.0312  libxul.so (deleted)      pixman_composite_src_n_0565_asm_neon

The calls to the division helper functions are completely gone in the optimized profile
(In reply to comment #42)
> Created attachment 541769 [details] [diff] [review] [review]
> Patch to optimize divisions
> 
> As a proof-of-concept, this patch replaces divisions with multiplications.
> It works by calculating the reciprocal of the divisor outside of loops, and
> switching to multiplication inside the loops. The 'smmul' instruction
> calculates d = (s1 * s2) >> 32.

Nice. This is surely going to be faster, and your benchmark results confirm it.

Regarding your patch, there is a problem with the use of SMMUL instruction. It is only supported starting from ARMv6 and it is not compatible with thumb1, so the "#if defined(__GNUC__) && (defined(__arm__) || defined(__thumb__))" check around it is not quite correct. Also compiling a small test code snippet with gcc 4.5 and clang 2.9 provides the following results:

$ cat test.c
#include <stdint.h>

int32_t smmul(int32_t a, int32_t b)
{
    return ((int64_t)a * b) >> 32;
}

$ gcc -c -mcpu=cortex-a8 -O2 test.c
$ objdump -d test.o

00000000 <smmul>:
   0:   e0c10091        smull   r0, r1, r1, r0
   4:   e1a00001        mov     r0, r1
   8:   e12fff1e        bx      lr

$ clang -c -mcpu=cortex-a8 -O2 test.c
$ objdump -d test.o

00000000 <smmul>:
   0:   e750f011        smmul   r0, r1, r0
   4:   e12fff1e        bx      lr

$ clang -c -O2 test.c
$ objdump -d test.o

00000000 <smmul>:
   0:   e0c32091        smull   r2, r3, r1, r0
   4:   e1a00003        mov     r0, r3
   8:   e12fff1e        bx      lr

So in the worst case we get SMULL instruction used (with gcc 4.3 and 4.4 too), at least some of the compilers can generate SMMUL instruction too when given the right -mcpu= option. SMULL is just one cycle slower than SMMUL, which probably is not going to make a really big difference for this particular loop unless it is completely converted to assembly and unrolled.



Also this

> 7722     12.8358  libxul.so (deleted)     
> bits_image_fetch_bilinear_affine_none_a8

and this

> 1257      2.0894  libxul.so (deleted)      fast_composite_in_n_8_8

is bad, because NEON optimizations apparently are missing for such operations in pixman and this also has some impact of the performance of this test. These are separate performance bugs though.
Also submitted this SMMUL related missed optimization issue to gcc bugzilla: http://gcc.gnu.org/bugzilla/show_bug.cgi?id=49526
(In reply to comment #41)
> would it make sense to restrict blur radiuses to powers of 2 such that its
> always just a shift?

No :-)
Attachment #541255 - Attachment is obsolete: true
Attachment #540927 - Attachment description: Factor out skip rect and eliminate half of the build-up alpha loop → Part 1: Factor out skip rect and eliminate half of the build-up alpha loop
I am writing a patch to do fixed-point division in C code rather than asm. This is much faster on arm, but turns out it is 2% slower on desktop x86. So we want this to be arm-specific I assume.

What's the proper way to ifdef this? On defined(__arm__)?
It would be good to understand why the code C generates is faster too! :)
Ignore my previous comment, the 2% was within the margin of noise for this benchmark, turns out.
(In reply to comment #47)
> It would be good to understand why the code C generates is faster too! :)

The compiler does not know anything about the latency of the instruction injected by inline assembly. So this disrupts instructions scheduling a bit and may turn out to be even a performance loss.
(In reply to comment #49)
> (In reply to comment #47)
> > It would be good to understand why the code C generates is faster too! :)
> 
> The compiler does not know anything about the latency of the instruction
> injected by inline assembly. So this disrupts instructions scheduling a bit
> and may turn out to be even a performance loss.

So is that's what happening? Is there significantly different instruction scheduling happening with the C generated code compared to direct assembly?
The inline assembly was all about a tiny saving of 1 cycle (SMMUL vs. SMULL). A slightly worse instructions scheduling done by the compiler can also easily cause a loss of 1 cycle or more. But I expect that any differences are going to be hardly noticeable and my guess is that the originally measured difference was likely just noise (see comment 48).

Using a single instruction via inline assembly just does not make much sense here because it is not worth it. Converting the whole inner loop to hand optimized assembly may be still evaluated.
Updated patch to use fixedpoint division only on ARM. For ARM, this is a 3.3X speedup on blurs.

Would be great if we can get this into FF7 in time. The other patches are important too, but this one is extremely short and simple, and much more significant in terms of performance.
Attachment #541769 - Attachment is obsolete: true
Attachment #542544 - Flags: review?
Why not use fixed point an all platforms? At least for the sake of consistency and perfect reproducibility of the results. Integer division is much slower than multiplication everywhere.

Also is (boxSize == 1) a valid value? If yes, then this patch has a bug.
(In reply to comment #53)
> Why not use fixed point an all platforms? At least for the sake of
> consistency and perfect reproducibility of the results. Integer division is
> much slower than multiplication everywhere.

Yes please.
(In reply to comment #53)
> Also is (boxSize == 1) a valid value? If yes, then this patch has a bug.

I don't quite understand. If boxSize == 1 then reciprocal is 2^32. alphaSum will necessarily be < 2^8 (sum of one Uint8). So multiplied we get a maximum of 2^40, which fits in 2^64, I think.
My mistake, jchen points out to me that we are storing the reciprocal in a PRUint32 and not 64...
Fixed patch. Also doing fixedpoint on all platforms as requested. Seems to give a small perf improvement on desktop, 3-4% (on mobile, over 300% faster as mentioned before).

Who would like to review this?
Attachment #542544 - Attachment is obsolete: true
Attachment #542609 - Flags: review?
Attachment #542544 - Flags: review?
Comment on attachment 542609 [details] [diff] [review]
Part 1: Fixedpoint division on ARM v2

Review of attachment 542609 [details] [diff] [review]:
-----------------------------------------------------------------

Joe can do it I guess
Attachment #542609 - Flags: review? → review?(joe)
There is a reftest failure with the Part 1 patch. The difference is extremely minor, I believe we get slightly different values due to rounding differences between normal division and fixed-point.

Do we care enough about this to not use the optimization on desktop? Or should I update the test to the new values?
(In reply to comment #59)
> Do we care enough about this to not use the optimization on desktop? Or
> should I update the test to the new values?

Update the test.
(In reply to comment #56)
> My mistake, jchen points out to me that we are storing the reciprocal in a
> PRUint32 and not 64...

This brings in "64-bit * 64-bit -> 64-bit" multiplication, which is somewhat slower than "32-bit * 32-bit -> 64-bit". Though if no performance regressions can be measured, then it's probably fine.

But if avoiding 64-bit multiplication is desired, then the special case with boxSize == 1 can be probably handled separately (is this is the case of having just simple copy without blur?).
I was wondering: the values in aOutput are rounded down - is this intentional? 

There are also some arithmetic differences between the two versions: whenever alphaSum is an integer multiple of the boxsize, and boxsize is not a multiple of 2, alphaSum / boxsize will give the exact answer whereas the version in this patch will give 1 lower. For example, if boxSize = 3 and alphaSum = 36,
alphaSum / boxSize = 12 (exact), but
reciprocal = 1431655765 (not exact) and
(alphaSum * reciprocal) >> 32 = 11.

If the intention is for aOutput to hold rounded values, that is
aOutput[foo] = (double)alphaSum / (double)boxSize + 0.5; or equivalently
aOutput[foo] = (alphaSum + boxSize / 2) / boxSize;

... then this could be approximated with 
reciprocal = ((1ULL << 48) + boxSize / 2) / boxSize;
aOutput[foo] = (alphaSum * reciprocal + (1ULL << (48-1)) ) >> 48;

This still gives some rounding differences (136/65280 for boxSize 1-255 and alphaSum 0-255), but far fewer than the previous. Using 16 instead of 48 in the above gives 183/65280 rounding differences, but would allow 32-bit math. 24 gives 171/65280, 36 gives 163/65280. Other values tend to be worse.

(I hope I didn't make any mistakes in the above. If the little program I whipped up to brute force this would be helpful I can post it here)
Attachment #542804 - Attachment is obsolete: true
Attached patch Part 2: Stop using nsTArray (obsolete) — Splinter Review
Comment on attachment 541702 [details] [diff] [review]
"Optimize gfxAlphaBoxBlur::Paint on ARM" []

Obsoleting refactoring-related patches for now. We should open a new bug for these patches at some point.
Attachment #541702 - Attachment is obsolete: true
Attachment #540927 - Attachment is obsolete: true
So, these newly attached patches are fairly straightforward:


Create shadow buffers in device space
Fennec sets different resolutions than 1, so the context matrix is often not of an identity scale. When we are zoomed out (sometimes at zoom levels of .5 or less!) this means we are making unnecessarily large shadow buffers and *then* blitting to the screen using bilinear filtering!

Note that one potential optimization on big areas is to make the shadow buffer *smaller* than it would be in device space. This has to be weighed with bilinear filtering, but I'm guessing it could be a big win with acceptable quality loss.


Stop using nsTArray
During some shotgun profiling with gdb and CTRL-C I noticed I kept hitting nsTArray loops of constructing and destructing all those bytes in the temporary buffer. Since we don't really need nsTArray at all, I replaced it with RAII-style byte arrays.


roc, who might be good reviewers for this? I'd love to get these in for the upcoming Aurora merge.
Comment on attachment 542609 [details] [diff] [review]
Part 1: Fixedpoint division on ARM v2

Need to figure out the test failures for Part 1.
Attachment #542609 - Flags: review?(joe)
Here are a few runs with the attached shadow text benchmark.

Nightly:
4499 frames over 500s (8.998 FPS)
4527 frames over 500s (9.054 FPS)

With these patches:
6190 frames over 500s (12.38 FPS)
5887 frames over 500s (11.75 FPS)
Attachment #540875 - Attachment is obsolete: true
Attachment #540875 - Attachment is obsolete: false
Final version of the Part 1 patch to use fixedpoint division.

After much testing, turns out the compiler is finicky about optimizing 32/64 bit multiplication. The best way is to store the reciprocal in a 64-bit value instead of casting to one later. This leads to a 20% regression on x86 and a 300% improvement on arm. So, the patch does this only for arm.
Attachment #542609 - Attachment is obsolete: true
Attachment #542980 - Flags: review?(joe)
(In reply to comment #67)
> Note that one potential optimization on big areas is to make the shadow
> buffer *smaller* than it would be in device space. This has to be weighed
> with bilinear filtering, but I'm guessing it could be a big win with
> acceptable quality loss.

That sounds like a very good idea.

> Stop using nsTArray
> During some shotgun profiling with gdb and CTRL-C I noticed I kept hitting
> nsTArray loops of constructing and destructing all those bytes in the
> temporary buffer. Since we don't really need nsTArray at all, I replaced it
> with RAII-style byte arrays.

In an opt build, we really do work that's proportional to the number of bytes in the buffer? That sounds really bad. Or do you mean it's the memory allocation/free cost that's hurting?

> roc, who might be good reviewers for this? I'd love to get these in for the
> upcoming Aurora merge.

Joe I guess.
(In reply to comment #70)
> Created attachment 542980 [details] [diff] [review] [review]
> Part 1: Fixedpoint division on ARM v3
> 
> Final version of the Part 1 patch to use fixedpoint division.
> 
> After much testing, turns out the compiler is finicky about optimizing 32/64
> bit multiplication. The best way is to store the reciprocal in a 64-bit
> value instead of casting to one later. This leads to a 20% regression on x86
> and a 300% improvement on arm. So, the patch does this only for arm.

What is the difference between the generated code on x86? I can't think of any reason that the division should be faster.
(In reply to comment #72)
> (In reply to comment #70)
> > Created attachment 542980 [details] [diff] [review] [review] [review]
> > Part 1: Fixedpoint division on ARM v3
> > 
> > Final version of the Part 1 patch to use fixedpoint division.
> > 
> > After much testing, turns out the compiler is finicky about optimizing 32/64
> > bit multiplication. The best way is to store the reciprocal in a 64-bit
> > value instead of casting to one later. This leads to a 20% regression on x86
> > and a 300% improvement on arm. So, the patch does this only for arm.
> 
> What is the difference between the generated code on x86? I can't think of
> any reason that the division should be faster.

I'm not sure exactly why there is such a difference. It depends on 64-bit multiplication versus 32-bit division I am guessing (and not 32 vs. 32), but it must depend on the rest of the loop somehow, since when I do a simple standalone benchmark I see basically no difference between the two. But in any case that shows there is no point in doing it on desktop anyhow since it won't improve anything, and just introduce some minor rounding differences.
(In reply to comment #73)
> I'm not sure exactly why there is such a difference. It depends on 64-bit
> multiplication versus 32-bit division I am guessing (and not 32 vs. 32), but
> it must depend on the rest of the loop somehow, since when I do a simple
> standalone benchmark I see basically no difference between the two. But in
> any case that shows there is no point in doing it on desktop anyhow since it
> won't improve anything, and just introduce some minor rounding differences.

I disagree. Doing them differently introduces the rounding differences, we should keep them same without reason to do differently. Also, I'm a bit worried your measurement methodology. Integer division is in the best case is 10 cycles and multiplication is in the worst case 5 cycles. Also, the new code seems to ask for a 64bitx64bit multiplication which is not so good.

In the following example:
    int f(int32_t a, int32_t d) {
            int64_t b = (1LL<<32)/d;
            return a*b>>32;
    }
     
    int g(int32_t a, int32_t d) {
            int32_t b = (1LL<<32)/d;
            return a*(int64_t)b>>32;
    }

g produces better code on x86-32 and arm, on x86-64 they are the same.

We should either special case a box size of 1 or fix any callers to not use it.
Attachment #542811 - Attachment description: Stop using nsTArray → Part 2: Stop using nsTArray
Attachment #542811 - Flags: review?(joe)
Comment on attachment 542809 [details] [diff] [review]
Part 3: Create shadow bitmap in device space

># HG changeset patch
># User Benjamin Stover <bstover@mozilla.com>
># Date 1309356577 25200
># Node ID 8ef0dea26bea39fbda2f7b3cb2a1e1b8b5171886
># Parent  5974e4cb3fc29826bbe275caf5fa1bdf3b4cd968
>Bug 633627 Create shadow bitmap in device space
>
>diff --git a/layout/base/nsCSSRendering.cpp b/layout/base/nsCSSRendering.cpp
>--- a/layout/base/nsCSSRendering.cpp
>+++ b/layout/base/nsCSSRendering.cpp
>@@ -4004,46 +4004,69 @@ nsContextBoxBlur::Init(const nsRect& aRe
>                        PRUint32 aFlags)
> {
>   if (aRect.IsEmpty()) {
>     mContext = nsnull;
>     return nsnull;
>   }
> 
>   gfxIntSize blurRadius = ComputeBlurRadius(aBlurRadius, aAppUnitsPerDevPixel);
>-  PRInt32 spreadRadius = NS_MIN(PRInt32(aSpreadRadius / aAppUnitsPerDevPixel),
>-                                PRInt32(MAX_SPREAD_RADIUS));
>+  aDestinationCtx->UserToDevice(blurRadius);
>+
>+  PRInt32 spreadRadiusScalar =
>+    NS_MIN(PRInt32(aSpreadRadius / aAppUnitsPerDevPixel),
>+    PRInt32(MAX_SPREAD_RADIUS));
>+  gfxIntSize spreadRadius(spreadRadiusScalar, spreadRadiusScalar);
>+  aDestinationCtx->UserToDevice(spreadRadius);
>+
>   mDestinationCtx = aDestinationCtx;
> 
>   // If not blurring, draw directly onto the destination device
>-  if (blurRadius.width <= 0 && blurRadius.height <= 0 && spreadRadius <= 0 &&
>+  if (blurRadius.width <= 0 && blurRadius.height <= 0 && spreadRadiusScalar <= 0 &&
>       !(aFlags & FORCE_MASK)) {
>     mContext = aDestinationCtx;
>     return mContext;
>   }
> 
>   // Convert from app units to device pixels
>   gfxRect rect = nsLayoutUtils::RectToGfxRect(aRect, aAppUnitsPerDevPixel);
> 
>+  gfxRect deviceRect = mDestinationCtx->UserToDevice(rect);
>+
>   gfxRect dirtyRect =
>     nsLayoutUtils::RectToGfxRect(aDirtyRect, aAppUnitsPerDevPixel);
>   dirtyRect.RoundOut();
> 
>+  gfxRect deviceDirtyRect = mDestinationCtx->UserToDevice(dirtyRect);
>+
>+  gfxRect deviceSkipRect;
>+  gfxRect* deviceSkipRectPtr = NULL;
>+  if (aSkipRect) {
>+    deviceSkipRect = mDestinationCtx->UserToDevice(*aSkipRect);
>+    deviceSkipRectPtr = &deviceSkipRect;
>+  }
>+
>   // Create the temporary surface for blurring
>-  mContext = blur.Init(rect, gfxIntSize(spreadRadius, spreadRadius),
>-                       blurRadius, &dirtyRect, aSkipRect);
>+  mContext = blur.Init(deviceRect, spreadRadius, blurRadius,
>+                       &deviceDirtyRect, deviceSkipRectPtr);
>+
>+  gfxMatrix matrix = aDestinationCtx->CurrentMatrix();
>+  mContext->Scale(matrix.xx, matrix.yy);
>   return mContext;
> }
> 
> void
> nsContextBoxBlur::DoPaint()
> {
>   if (mContext == mDestinationCtx)
>     return;
> 
>+  gfxContextMatrixAutoSaveRestore save(mDestinationCtx);
>+  gfxMatrix matrix = mDestinationCtx->CurrentMatrix();
>+  mDestinationCtx->Scale(1 / matrix.xx, 1 / matrix.yy);
>   blur.Paint(mDestinationCtx);
> }
> 
> gfxContext*
> nsContextBoxBlur::GetContext()
> {
>   return mContext;
> }
Attachment #542809 - Attachment description: Create shadow bitmap in device space → Part 3: Create shadow bitmap in device space
Attachment #542809 - Flags: review?(joe)
Sorry about spam above.

> > Stop using nsTArray
> > During some shotgun profiling with gdb and CTRL-C I noticed I kept hitting
> > nsTArray loops of constructing and destructing all those bytes in the
> > temporary buffer. Since we don't really need nsTArray at all, I replaced it
> > with RAII-style byte arrays.
> 
> In an opt build, we really do work that's proportional to the number of
> bytes in the buffer? That sounds really bad. Or do you mean it's the memory
> allocation/free cost that's hurting?

Yes, we loop through the bytes. The shotgun profiling was with a debug build, but I didn't see anything in nsTArray that only happened for debug.
(In reply to comment #74)
> I'm a bit
> worried your measurement methodology. Integer division is in the best case
> is 10 cycles and multiplication is in the worst case 5 cycles.

Well, I ran the checks several times to confirm, however it is possible I am doing something wrong and just can't see what that is. If someone wants to check my data that would be great.

Note that I don't know much assembly. However, changing the division to fixedpoint doesn't just change a single instruction. The compiler might end up needing to extend values from 32 to 64 during the loop, and/or decide to change ordering, and this could interact in various ways with the surrounding code, making things faster or slower. It could also be a compiler bug. Again, I am not an assembly expert so I didn't debug the generated binary, what I did was try all permutations of the source code (32 vs 64, where to do the type casting, etc.), and benchmark in both x86 desktop, x86 fennec, and ARM fennec.

> Also, the new
> code seems to ask for a 64bitx64bit multiplication which is not so good.
> 

As mentioned earlier, all other code permutations did not end up with fast code on ARM, this was the only way to write it (except for writing raw ARM assembly, but we saw some reasons against that before).

> In the following example:
>     int f(int32_t a, int32_t d) {
>             int64_t b = (1LL<<32)/d;
>             return a*b>>32;
>     }
>      
>     int g(int32_t a, int32_t d) {
>             int32_t b = (1LL<<32)/d;
>             return a*(int64_t)b>>32;
>     }
> 
> g produces better code on x86-32 and arm, on x86-64 they are the same.

This could depend on compiler version I think. Note that we are forced to use an earlier, customized gcc in the Android NDK.

In summary, I think the results we have so far are clear enough, but I admit I could be mistaken somewhere, and I would greatly appreciate someone else trying to replicate them or find the mistake. But in any case though, Part 1 is a very short, very low-risk patch that gives a very big performance win on ARM. I ask, nay implore, that we consider reviewing it so it can land for FF7, with additional investigation later done as a followup if we want fixedpoint for desktop as well.
(In reply to comment #76)
> Yes, we loop through the bytes. The shotgun profiling was with a debug
> build, but I didn't see anything in nsTArray that only happened for debug.

I'm very sure that in an opt build, nsAutoTArray<PRUint8,1000> array; array.SetLength(1000); should not loop through all the bytes. In a debug build some PRUint8 "constructor" might be run, but it should do nothing and be inlined away (along with the loop) in an opt build.

If that is not the case, we need to make it true as a very high priority, because use nsTArray buffers like that all over the place!
Comment on attachment 542811 [details] [diff] [review]
Part 2: Stop using nsTArray

Review of attachment 542811 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/thebes/gfxBlur.cpp
@@ +408,5 @@
> +public:
> +    AutoBuffer(PRSize aSize) : mData(new unsigned char[aSize]) {}
> +    ~AutoBuffer() { delete[] mData; }
> +    unsigned char* mData;
> +};

Can you just use nsAutoArrayPtr? Alternately, use fallible new, since this size is content-controlled, right?
Attachment #542811 - Flags: review?(joe) → review-
Comment on attachment 542809 [details] [diff] [review]
Part 3: Create shadow bitmap in device space

Review of attachment 542809 [details] [diff] [review]:
-----------------------------------------------------------------

::: layout/base/nsCSSRendering.cpp
@@ +4008,5 @@
>      return nsnull;
>    }
>  
>    gfxIntSize blurRadius = ComputeBlurRadius(aBlurRadius, aAppUnitsPerDevPixel);
> +  aDestinationCtx->UserToDevice(blurRadius);

You have a lot of UserToDevice conversions here. You should probably add a comment above saying something about the fact that we want to do this in device space to avoid needing to do any filtering/resizing.

@@ +4012,5 @@
> +  aDestinationCtx->UserToDevice(blurRadius);
> +
> +  PRInt32 spreadRadiusScalar =
> +    NS_MIN(PRInt32(aSpreadRadius / aAppUnitsPerDevPixel),
> +    PRInt32(MAX_SPREAD_RADIUS));

If it fits, put the NS_MIN on the previous line, and definitely put the PRInt32(MAX_SPREAD_RADIUS) directly below the PRInt32(aSpreadRadius...).

@@ +4048,5 @@
> +  mContext = blur.Init(deviceRect, spreadRadius, blurRadius,
> +                       &deviceDirtyRect, deviceSkipRectPtr);
> +
> +  gfxMatrix matrix = aDestinationCtx->CurrentMatrix();
> +  mContext->Scale(matrix.xx, matrix.yy);

Can you explain to me why we scale this way in Init(), and the opposite way in DoPaint()?
Attachment #542809 - Flags: review?(joe) → review+
Comment on attachment 542980 [details] [diff] [review]
Part 1: Fixedpoint division on ARM v3

I'm passing this review off to Jeff, who might have more interesting things to say about it than I do.
Attachment #542980 - Flags: review?(joe) → review?(jmuizelaar)
(In reply to comment #77)
> (In reply to comment #74)
> > g produces better code on x86-32 and arm, on x86-64 they are the same.
> 
> This could depend on compiler version I think. Note that we are forced to
> use an earlier, customized gcc in the Android NDK.

I also did my tests using the gcc from the Android NDK. Can you attach the assembly from both versions to compare?
Comment on attachment 542811 [details] [diff] [review]
Part 2: Stop using nsTArray

After looking at the generated optimized Android code, we don't seem to do any loops before <memset> is called, so this patch is unnecessary.
Attachment #542811 - Attachment is obsolete: true
Assignee: nobody → ben
tracking-fennec: 7+ → 8+
what's the status here?
Regarding part 1 (I don't know the status of part 3 - stechz?), I re-ran all my benchmarks. It looks like I was wrong about the differences between the various ways to do fixedpoint math: Jeff, it is indeed a little faster the way you suggested. I must have made a mistake in collecting the results, or didn't notice how noisy the data is, sorry about that.

I still see a 4-5% slowdown on x86, however.

Attached is some x86 data: The generated code without the patch, the patch code, and the generated code with the patch. Perhaps someone that knows x86 asm can figure out why the patch makes it slower. If not, I still think we should take this patch (but just for arm).
Derf, would you mind looking at the assembly?
tracking-fennec: 8+ → 9+
(In reply to Benjamin Stover (:stechz) from comment #86)
> Derf, would you mind looking at the assembly?

So, part of the problem here is that you're doing a mixed signed-unsigned multiplication. That requires a cltd instruction to sign-extend alphaSum and then _two_ multiplies (one for the original 32-bit value and one for the extended sign). That both gives you a nice long pipeline stall and requires a bunch of extra register shuffling. The second version spills an extra 5 variables to the stack (that may be due to the extra multiply and the register requirements it implies, or it may just be that you got an unlucky roll with the random number generator that is the gcc register allocator).

AFAICT, there's no way that alphaSum can be negative (unless it overflows), so it would probably make more sense to make it unsigned, so a single unsigned 32x32->64 bit multiply will suffice.

Also, for future reference, it is _far_ easier to parse a dump of gcc's asm output if you compile with debug information, so that it includes line number annotations.
> 
> You have a lot of UserToDevice conversions here. You should probably add a
> comment above saying something about the fact that we want to do this in
> device space to avoid needing to do any filtering/resizing.

Comment added.

> If it fits, put the NS_MIN on the previous line, and definitely put the
> PRInt32(MAX_SPREAD_RADIUS) directly below the PRInt32(aSpreadRadius...).

Done.

> Can you explain to me why we scale this way in Init(), and the opposite way
> in DoPaint()?

mContext is the context we paint the shadow mask in. The painter will be painting in content coordinates, so we need to apply a transform that maps content to device.

mDestinationCtx is where the blurred shadow mask will be painted to. There should be no scale being applied when we blit because we shouldn't be oversampling or undersampling, so this just makes sure we have identity scale.
Attachment #542809 - Attachment is obsolete: true
Comment on attachment 556936 [details] [diff] [review]
Create shadow bitmap in device space

Asking for review again just to make sure we're on the same page.
Attachment #556936 - Flags: review?(joe)
Attachment #556936 - Flags: review?(joe) → review+
Assignee: ben → azakai
Note Talos also reported a 17.8% FennecBench Zoom regression.

Regression :( FennecBench Zoom increase 17.8% on Android 2.2 Mozilla-Inbound
----------------------------------------------------------------------------
    Previous: avg 289.850 stddev 6.745 of 30 runs up to revision ba8bbef0fdf9
    New     : avg 341.356 stddev 9.867 of 5 runs since revision 84766ee09777
    Change  : +51.506 (17.8% / z=7.636)
    Graph   : http://mzl.la/nm5Pt4

Changeset range: http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=ba8bbef0fdf9&tochange=84766ee09777

Changesets:
  * http://hg.mozilla.org/integration/mozilla-inbound/rev/62f2756245c5
    : Benjamin Stover <bstover@mozilla.com> - Bug 633627 Create shadow bitmap in device space r=joe
    : http://bugzilla.mozilla.org/show_bug.cgi?id=633627

  * http://hg.mozilla.org/integration/mozilla-inbound/rev/84766ee09777
    : Josh Matthews <josh@joshmatthews.net> - Bug 683614 - Fix typo in oninvalid event handler name in test. r=mounir
    : http://bugzilla.mozilla.org/show_bug.cgi?id=683614
derf, thanks for the help! (Sorry about the lack of debug info - I didn't know about that.)

Here is an updated patch that fixes that stuff. Speed appears unchanged on desktop, and is over twice as fast on android.

Assuming we are ok with this patch, the next step for fixedpoint stuff will be to modify the relevant tests as mentioned previously.
Attachment #542980 - Attachment is obsolete: true
Attachment #552563 - Attachment is obsolete: true
Attachment #557662 - Flags: review?(jmuizelaar)
Attachment #542980 - Flags: review?(jmuizelaar)
Attachment #557662 - Flags: review?(jmuizelaar) → review+
(In reply to Bas Schouten (:bas) from comment #93)
> Note Talos also reported a 17.8% FennecBench Zoom regression.

It turns out this regression was incorrectly blamed on this patch.  Backing out the patch did not fix the Talos tzoom regression.
Regarding the fixedpoint patch, I am not sure what to do about the tests:

1. layout/reftests/box-shadow/boxshadow-blur-2.html , a != comparison. Visually I don't see any difference in the output, so I assume the change from before is tiny rounding stuff. However, this is a != comparison, which means we should render something different than a particular image - it seems very odd that the particular image given happens to be exactly what is now rendered with the fixedpoint patch. In other words, hitting on the exact single output not allowed by this test seems implausible. So I assume I am missing something?

2. layout/reftests/bugs/536061.html , a == comparison. The comparison is to layout/reftests/bugs/536061-ref.html which has some additional transforms applied, and I guess the idea is that they should not affect the output. However they do affect the output with this patch, I assume due to minor rounding differences. Visually a very small difference can be seen in comparison to without this patch (but both look fine). Should I just disable this test? Not sure what else to do.
Yeah, as I noted in comment #62 there are going to be rounding differences, including some where the math was exact before - but it won't be off by more than 1/255. Even before the patch, the blur rounded down instead of toward the nearest integer (though that may be to spec, I don't know), so maybe the test should be modified to test whether each pixel of the result is within a given acceptable range of the expected (say within 1%)?
(In reply to Emanuel Hoogeveen from comment #97)
> maybe the
> test should be modified to test whether each pixel of the result is within a
> given acceptable range of the expected (say within 1%)?

That sounds like it would work for the == test. Do we have such a mechanism available in the reftest harness? If not hopefully it wouldn't be hard to add.

Not sure what would make sense for the != test though.
Hmm, the test itself is a bit beyond me (I was expecting a reference image, not a bunch of transforms). I don't think there's any kind of 'fuzzy compare' in the harness, though I could be wrong. Maybe the test could be turned into a pass/fail test that does the comparison itself, but I don't know how hard that would be or if there's any precedent. I'm more of an interested lurker than a developer :(
(In reply to Emanuel Hoogeveen from comment #97)
> Yeah, as I noted in comment #62 there are going to be rounding differences,
> including some where the math was exact before - but it won't be off by more
> than 1/255. Even before the patch, the blur rounded down instead of toward
> the nearest integer (though that may be to spec, I don't know), so maybe the
> test should be modified to test whether each pixel of the result is within a
> given acceptable range of the expected (say within 1%)?

We generally don't do this. Adding fudge values like that makes tests very difficult to manage.

If only a few pixels are different you can extend the "censoring" black divs to cover them.

If everything's slightly off then you should probably just disable the test for platforms where this code is used.
Hmm, I think the differences between the fixedpoint and normal division code can be bigger than expected. We can get a 1-off error on each division. However, we can accumulate those errors, since we run the blur a few times (to make it smoother). I think this explains why the differences are visually noticeable in some cases (but still small, and they still look ok).

This ends up affecting a lot of pixels in the two tests affected here. This is on all platforms though. Does anyone object to just disabling these two tests?
This patch disables the two problematic tests. I verified that there are no test failures after applying this.
Attachment #559208 - Flags: review?(roc)
Can you just make these random-if(Android)?
The failures are on desktop though, not Android. Sorry if I wasn't clear before.
OK. Can you make them random instead of skip?
Ok, changed to random.
Attachment #559208 - Attachment is obsolete: true
Attachment #559605 - Flags: review?(roc)
Attachment #559208 - Flags: review?(roc)
Pushed the fixedpoint stuff to inbound.

There is also stechz's patch for doing the calculations in device space, which bounced. Do we still want that?
Whiteboard: [inbound]
Fixedpoint division in blur code:
https://hg.mozilla.org/mozilla-central/rev/2e63e7fcecd6

Disable tests that fail with fixedpoint blurring:
https://hg.mozilla.org/mozilla-central/rev/f1586578808b

Leaving open for comment 108.
Status: NEW → ASSIGNED
Whiteboard: [inbound]
Target Milestone: --- → mozilla9
I will not have time to try to fix stechz's patch. Also marking this as RESOLVED since the fixedpoint patch has landed and I don't see any responses to comment 108 which asks whether we still want stechz's patch.
Assignee: azakai → nobody
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Mozilla /5.0 (Android;Linux armv7l;rv:10.0a1) Gecko/20111019 Firefox/10.0a1 Fennec/10.0a1
Device: LG Optimus 2X

I try to verify this and on gmail.com and reader.google.com I could see the shadow for dropdowns. On flickr.com and apple.com where should I look for text shadow? And should I look for something else also in order to verify this bug?
Blocks: 739969
No longer blocks: 739969
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: