659725 - Optimize Canvas putImageData conversion loop

Reporter

Description

•

14 years ago

Canvas demos will usually call putImageData many times per second, optimally something like 50-60. |nsCanvasRenderingContext2D::PutImageData_explicit| loops over the input to convert it to the format we need later (cairo), reordering the RGBA to BGRA (little endian) or ARGB (big endian), http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/nsCanvasRenderingContext2D.cpp#3925 We might be able to optimize that loop. I did some benchmarks on a 320x200 canvas. On desktop I get 1ms for the loop. So at 50fps, we have 20ms per frame, so 5% of our CPU time is taken, not too bad. This is with a small canvas, though - a larger one, or even one that fills a typical monitor, would be slower. I also benchmarked on fennec on an HTC desire. It takes 4ms there, which is 20% of our CPU time - which looks very significant.

Siarhei Siamashka

Comment 1

•

14 years ago

(In reply to comment #0) > I also benchmarked on fennec on an HTC desire. It takes 4ms there, which is > 20% of our CPU time - which looks very significant. You can check bug 534215 comment 1 for the way of doing fast premultiplication on ARM.

Alon Zakai (:azakai)

Reporter

Comment 2

•

14 years ago

Attached patch one possibility (obsolete) — Details — Splinter Review

Silly patch with one quick improvement, no SSE or NEON. Just little endian so far. Patch also includes the timing code. On Android, this gets me from 4ms to 2.5ms for the loop. Asking for feedback because I'm not sure if this is a good idea or not, and what is the right approach.

Attachment #535226 - Flags: feedback?(jmuizelaar)

Attachment #535226 - Flags: feedback?(bas.schouten)

Benoit Jacob [:bjacob] (mostly away)

Comment 3

•

14 years ago

Here you're only considering conversions from/to a few different formats, right? Just asking because if you ever had to do conversions from/to a large number of formats, you might then be interested in reusing the code we have in WebGL for that, see http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/WebGLContextGL.cpp#3313 and http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/WebGLTexelConversions.h Here we have 155 different paths, generated by C++ templates (x86-64 code size: 24K), but no SSE/NEON. In your case, if you have only a few cases to handle, of course it's a lot more worth doing SSE/NEON.

Benoit Jacob [:bjacob] (mostly away)

Comment 4

•

•

14 years ago

Hm, yes. I'm not sure why I thought otherwise, but I found these instructions in the ISA pretty easily... Alon, do you want to try and spin a patch, or would you like me to?

Alon Zakai (:azakai)

Reporter

Comment 14

•

14 years ago

(In reply to comment #13) > Hm, yes. I'm not sure why I thought otherwise, but I found these > instructions in the ISA pretty easily... > > Alon, do you want to try and spin a patch, or would you like me to? I've not written code like this before, but I'd be happy to try with some guidance. What's the closest existing example of this sort of thing in our code so I can study that?

Justin Lebar (not reading bugmail)

Comment 15

•

14 years ago

If you want to do SSE, you could look at gfxAlphaRecoverySSE2.cpp or nsUTF8ToUnicodeSSE2.cpp. For NEON

Justin Lebar (not reading bugmail)

•

14 years ago

Attached patch NEON patch (obsolete) — Details — Splinter Review

Ok, here is a NEON patch. This patch gives me 3-4ms in a benchmark compared to 7-8ms with the current code. It probably isn't as optimized as it can be, this is my first time using ARM assembly.

Attachment #535226 - Attachment is obsolete: true

Attachment #535226 - Flags: feedback?(jmuizelaar)

Attachment #535226 - Flags: feedback?(bas.schouten)

Attachment #536467 - Flags: review?(justin.lebar+bug)

Alon Zakai (:azakai)

Reporter

Comment 20

•

14 years ago

Attached patch NEON patch (obsolete) — Details — Splinter Review

Oops, wrong patch before, sorry about that. Here's the real one.

Attachment #536467 - Attachment is obsolete: true

Attachment #536467 - Flags: review?(justin.lebar+bug)

Attachment #536469 - Flags: review?(justin.lebar+bug)

Justin Lebar (not reading bugmail)

Comment 21

•

Comment 33

•

14 years ago

Comment 41

•

14 years ago

(In reply to comment #37) > see comment 22: > > A lot of performance is lost if not using prefetch here, because many of the > existing ARM processors don't have automatic hardware prefetcher. There is > PLD > instruction available on ARM for this purpose. > > You'll want to experiment with how far ahead to prefetch, but in my limited > experience, the sweet spot was pretty large. > I am not sure I understand exactly what PLD does from the ARM manual. It basically hints the CPU to prefetch data into its cache? So at the beginning of the loop, I would tell it to prefetch the address I will need for the iteration after the current one, something like that? > > + : [src] "r" (src), [dst] "r" (dst), [last] "r" (last) > > Since you modify src and dst, it needs to be > > : [src] "+r" (src), [dst] "+r" (dst), [last] "r" (last) GCC complains "error: input operand constraint contains '+'" with that. If I move src and dst to be output operands, it compiles but doesn't work (I suspect that src and dst are no longer being read). I'm probably missing something basic here. But in any case, while we do write to src and dst, they are not used as outputs (we don't read them after the asm block) - so is it still strictly necessary to mark them as "+"? I can't figure this out from the GCC docs, but it compiles and runs properly with just "r".

Justin Lebar (not reading bugmail)

Comment 42

•

14 years ago

> I am not sure I understand exactly what PLD does from the ARM manual. It > basically hints the CPU to prefetch data into its cache? So at the beginning of > the loop, I would tell it to prefetch the address I will need for the iteration > after the current one, something like that? Yup, an iteration or more after the current one. If you prefetch too close to the current iteration, the data won't be ready by the time you want it. But if you prefetch too far from the current iteration, you could end up knocking the prefetched data out of cache before you need it! > GCC complains "error: input operand constraint contains '+'" with that. If I > move src and dst to be output operands, it compiles but doesn't work (I > suspect that src and dst are no longer being read). Ah, yes, +r only applies to output regs. So : [src] "+r" (src), [dst] "+r" (dst) : [last] "r" (last) doesn't work? > But in any case, while we do write to src and dst, they are not used as > outputs (we don't read them after the asm block) - so is it still strictly > necessary to mark them as "+"? Since GCC isn't doing any intraprocedural optimizations, it doesn't matter AFAIK. But in a hypothetical LTO world, you'd want to inform the compiler that you modify as well as read src and dst.

Alon Zakai (:azakai)

Reporter

Comment 43

•

14 years ago

(In reply to comment #42) > > > GCC complains "error: input operand constraint contains '+'" with that. If I > > move src and dst to be output operands, it compiles but doesn't work (I > > suspect that src and dst are no longer being read). > > Ah, yes, +r only applies to output regs. So > > : [src] "+r" (src), [dst] "+r" (dst) > : [last] "r" (last) > > doesn't work? > It compiles but nothing is actually done when the code runs. I suspect src and dst do not contain the right values when they are listed as outputs.

Timothy B. Terriberry (:derf)

Comment 44

•

14 years ago

(In reply to comment #38) > I think it makes a lot of sense to first decide what would be the "correct" > implementation before trying hard to optimize it. So, the current behavior comes from bug 389366 comment 1. As long as the two properties listed there ({255, 255, 255, a} round-trips to {255, 255, 255, a} and premultiply(unmultiply(x))==x) are satisfied, I think we can use whatever rounding we want. Given that, I think 127 and (a/2) make more sense as rounding offsets than 254 and 0.

Jeff Muizelaar [:jrmuizel]

Comment 45

•

14 years ago

For what it's worth Chrome uses: /** Return (a*b)/255, taking the ceiling of any fractional bits. Only valid if both a and b are 0..255. The expected result equals (a * b + 254) / 255. */ static inline U8CPU SkMulDiv255Ceiling(U8CPU a, U8CPU b) { SkASSERT((uint8_t)a == a); SkASSERT((uint8_t)b == b); unsigned prod = SkMulS16(a, b) + 255; return (prod + (prod >> 8)) >> 8; }

Jeff Muizelaar [:jrmuizel]

Comment 46

•

14 years ago

I'd also recommend not changing the rounding behavior as part of this bug. If we do decide to change the rounding behaviour we might want to encourage Webkit to change to. I would also suggest documenting some of the rationale for the current behaviour in the code, so we don't need to think as much the next time this comes up.

Alon Zakai (:azakai)

Reporter

Comment 47

•

•

14 years ago

You may be able to squeeze a bit more out of it by experimenting with changing the #64. Maybe it should be #32, #96, or higher.

Siarhei Siamashka

Comment 54

•

14 years ago

Attached file unpremultiply-neon-preview.c — Details

This is an initial preview of the NEON code for "(255 * x + a / 2) / a" type of unpremultiply. A basic benchmark from Galaxy Tab (1GHz ARM Cortex-A8): Small buffer (L1 cache) : 156.42 MPix/s Large buffer (memory) : 130.54 MPix/s Which means that even this naive implementation is already (barely) fast enough to be mostly limited by the available memory bandwidth on this device (and random pixels vs. zero filled source buffer does not make any significant difference). Though more headroom still could be useful. The remaining things to do are just: 1. proper support for non multiple of 8 pixels buffer sizes 2. browser specific style of rounding (+254 offset).

Siarhei Siamashka

Comment 55

•

14 years ago

(In reply to comment #40) > There's a standard algorithm for generating such constants. See "N-Bit > Unsigned Division via N-Bit Multiply-Add," In Proc. 17th IEEE Symposium on > Computer Arithmetic (ARITH'05), pp. 131--139, June 2005. Yes, I know. I have seen a number of similar papers related to replacing divisions with multiplications. > The gist of it boils down to Algorithm 1 (reproduced here for those without > access to the paper): > > Inputs: uword d and n, with n>=1 and 1<=d<(1<<<n) > int m := floor(log2(d)); > uword a, b; > if d == (1<<m) then > a := (1<<n)-1; > b := (1<<n)-1; > else > uword t := floor((1<<(m+n))/d); > uword r := (t*d+d) mod (1<<n) > if r <= (1<<m) then > a := t+1; > b := 0; > else > a := t; > b := t; > endif > endif > > Then the division is equivalent to ((a*x+b)>>(m+n)). In this case n==16. I > don't know if this will work better than the constants you proposed. This > method only requires one multiplication, instead of two, which will be the > slowest step in NEON. I specifically tried to tailor the algorithm to better fit the available NEON instructions that we have. The initial 8-bit long multiplication is almost free, because widening of 8-bit data to prepare it for 16-bit multiplication is still needed in any case. As for the multiplications, they are reasonably fast with NEON. Peak performance is: 1/8 cycle per multiplication for 8-bit data 1/4 cycle per multiplication for 16-bit data 1 cycle per multiplication for 32-bit data Latency is not an issue if the code can be sufficiently unrolled and pipelined. And this is our case. Having 8-bit multiplication followed by 16-bit multiplication is faster than a single 32-bit multiplication with NEON. And I got these table constants by simple bruteforcing, there is no particular need for any clever algorithm in this particular case :) > But the table needs values for a and b (16-bits each) > and m (0 to 7). b is either the same as a, or 0, so in theory you could pack > all three parameters into 3 bytes, but taking advantage of that may cost > more computation than it saves cache misses. Yes, it's still faster to have 4 bytes per table entry even though this wastes 1 byte of space.

Siarhei Siamashka

Comment 56

•

14 years ago

(In reply to comment #46) > If we do decide to change the rounding behaviour we might want to encourage > Webkit to change to. Do they have a bugtracker? Or what is the preferred way to contact them? It would be kind of funny if it turns out that they are also doing premultiplication in this weird way exactly because they are trying to be more compatible with Mozilla ;)

Timothy B. Terriberry (:derf)

Comment 57

•

14 years ago

(In reply to comment #55) > I specifically tried to tailor the algorithm to better fit the available > NEON instructions that we have. The initial 8-bit long multiplication is > almost free, because widening of 8-bit data to prepare it for 16-bit > multiplication is still needed in any case. I thought VMULL had a 5- to 6-cycle latency. You've only got 3 8-bit muls, so that's not enough to hide all the latency. It's not a lot of room to do much of anything fancier, I'll agree. > Having 8-bit multiplication followed by 16-bit multiplication is faster than > a single 32-bit multiplication with NEON. Neither approach needs a 32-bit mul. They both use 16x16->32. > Yes, it's still faster to have 4 bytes per table entry even though this > wastes 1 byte of space. I agree, I was just pointing out that with two 16-bit parameters and a 3-bit parameter, you needed _more_ than 4 bytes with my approach, unless you take advantage of of the fact that b=={0 or a} (though that means more work at run-time). In any case, this came out somewhat better than I was expecting. There's probably still a few cycles to be wrung out of it, for example by moving up the first vuzp (or maybe replacing the first two with a single vtrn? fewer instructions, but the same latency... you'd need something to stick in-between that and the third vuzp, though), as well as moving up the vld4/vmin's, to get better balance between the LS and NEON pipelines, and eliminating the use of q4-q7. As for the vmin's themselves, that's not the behavior of the current C code (particularly for alpha==0 it passes the colors through unchanged). But I don't know if it's even possible to have "super-luminescent" values in our canvas data (and I certainly don't think the current behavior for alpha>0 is very useful).

Alon Zakai (:azakai)

Reporter

Comment 58

•

14 years ago

Attached patch NEON patch v4 — Details — Splinter Review

Updated multiply patch. Changed preload to 256 (I checked lots of values, that seems optimal), and does not create the premultiply table if we don't need it. I don't understand if the recent discussion in the last few comments here is relevant to this patch, or to future work? Do we want to split things out into separate bugs perhaps?

Attachment #536969 - Attachment is obsolete: true

Justin Lebar (not reading bugmail)

Comment 59

•

•

14 years ago

(In reply to comment #69) > Siarhei's patch looks like the way to go. Is it ready for review? > If so let's mark it. My patch only shows that it is still possible to improve instructions scheduling for premultiply code, but it will not show any substantial practical improvement unless using it with L1/L2 cached data. Which means that the performance of your patch is also fine for the practical usage. As the performance is memory bandwidth limited, the best what can be done is just to avoid walking over the large pixel buffers multiple times and process as much as possible in a single pass. The biggest problem with my patch is the reliance on the header file from pixman, which is not suitable for unmultiply function anyway. That's why I think that it is better to make a new ARM NEON macro based code specifically for premultiply/unmultiply with the support of: 1. arbitrary ARGB color components layout for both source and destination 2. different variants of rounding 3. optional checking whether there are any non-opaque pixels (needed for PNG in bug 594883) As soon as such mini library with the collection of optimized routines is ready, it can be used to solve the whole class of premultiply/unmultiply related performance problems. And the part of your patch, which moves premultiply functionality into a separate file (not only assembly, but also generic C implementation), is actually the right way to go in my opinion.

Justin Lebar (not reading bugmail)

Comment 71

•

14 years ago

> Which means that the performance of your patch is also fine for the practical > usage. I agree. The interleaved scheduling is clever, but our lives will be easier if we don't rely on all that magic.

Alon Zakai (:azakai)

Reporter

Comment 72

•

14 years ago

Comment on attachment 537179 [details] [diff] [review] NEON patch v4 Ok, cool. Asking for review on my patch then.

Attachment #537179 - Attachment is obsolete: false

Attachment #537179 - Flags: review?(bas.schouten)

Siarhei Siamashka

Comment 73

•

14 years ago

(In reply to comment #71) > I agree. The interleaved scheduling is clever, That's called "software pipelining". There are many pages around in the Internet describing it, but some basic information is even available from wikipedia: http://en.wikipedia.org/wiki/Software_pipelining In the pixman we are using a simple 2 stage pipelining (stage1 is called 'head' and stage2 is called 'tail' for convenience). Processing of each pixel just spans over two loop iterations and also it's useful when dealing with non-uniform latencies for different instructions where simple unrolling would just fail. > but our lives will be easier if we don't rely on all that magic. It's very easy to also add software pipelining to Alon's code, especially considering that it does not have to handle the block sizes which are not multiples of 8 pixels.

Justin Lebar (not reading bugmail)

•

14 years ago

3-month review ping?

Bas Schouten (:bas.schouten)

Comment 78

•

14 years ago

Comment on attachment 537179 [details] [diff] [review] NEON patch v4 Review of attachment 537179 [details] [diff] [review]: ----------------------------------------------------------------- Derf, could you look at this first, as I'm not comfortable enough with ARM to be sure this is exactly right.

Attachment #537179 - Flags: review?(bas.schouten) → review?(tterribe)

Timothy B. Terriberry (:derf)

Comment 79

•

14 years ago

Comment on attachment 537179 [details] [diff] [review] NEON patch v4 Review of attachment 537179 [details] [diff] [review]: ----------------------------------------------------------------- The actual NEON code looks fine. r+ with a couple of minor nits. There's also now an nsCanvasRenderingContext2DAzure, which duplicates much of the code from nsCanvasRenderingContext2D. That should probably be updated to use this, as well. ::: content/canvas/src/convert_from_canvas_rgba.cpp @@ +44,5 @@ > +PRBool need_premultiply_table() > +{ > +#if defined(MOZILLA_MAY_SUPPORT_NEON) && defined(IS_LITTLE_ENDIAN) > + return !mozilla::supports_neon(); > +#endif Could I get an #else here that wraps the non-NEON case? @@ +55,5 @@ > + value = value + 255; > + return (value + (value >> 8)) >> 8; > +} > + > +void convert_from_canvas_rgba_noarch(PRUint8 *dst, PRUint8 const *src, int n, PRUint8 const (*premultiplyTable)[256]) This function should be static (else you'll have to check n%4==0 here, too). @@ +99,5 @@ > + } > +} > + > +#if defined(MOZILLA_MAY_SUPPORT_NEON) && defined(IS_LITTLE_ENDIAN) > +void convert_from_canvas_rgba_neon_littleendian(PRUint8 *dst, PRUint8 const *src, int n, PRUint8 const (*premultiplyTable)[256]) Same here. @@ +139,5 @@ > + "bne 1b\n" > + : > + : [src] "r" (src), [dst] "r" (dst), [last] "r" (last) > + : "cc", "memory", > + "q2", "q8", "q9", "q10", I'm not actually clear on gcc's precise rules, but either you don't need any of the q register names, or you also need q0 and q1.

Attachment #537179 - Flags: review?(tterribe) → review+

Marco Castelluccio [:marco]

Comment 80

•

11 years ago

Is this optimization still meaningful?

Flags: needinfo?(azakai)

Alon Zakai (:azakai)

Reporter

Comment 81

•

11 years ago

No idea, sorry - been 3 years since I was involved with that code.

Flags: needinfo?(azakai)

Kelsey Gilbert [:jgilbert]

Comment 82

•

7 years ago

Tentatively closing this.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → INCOMPLETE

one possibility 14 years ago Alon Zakai (:azakai) 2.40 KB, patch		Details \| Diff \| Splinter Review
NEON patch 14 years ago Alon Zakai (:azakai) 9.61 KB, patch		Details \| Diff \| Splinter Review
NEON patch 14 years ago Alon Zakai (:azakai) 9.80 KB, patch	justin.lebar+bug : review-	Details \| Diff \| Splinter Review
NEON patch v2 14 years ago Alon Zakai (:azakai) 10.40 KB, patch		Details \| Diff \| Splinter Review
premultiply-unpremultiply-test.c (a simple test program for premultiply/unpremultiply correctness) 14 years ago Siarhei Siamashka 7.83 KB, text/plain		Details
NEON patch v3 14 years ago Alon Zakai (:azakai) 10.34 KB, patch		Details \| Diff \| Splinter Review
unpremultiply-neon-preview.c 14 years ago Siarhei Siamashka 10.85 KB, text/plain		Details
NEON patch v4 14 years ago Alon Zakai (:azakai) 11.69 KB, patch	derf : review+ justin.lebar+bug : feedback+	Details \| Diff \| Splinter Review
NEON patch with premultiply code for putImageData, based on pixman macro template 14 years ago Siarhei Siamashka 11.91 KB, patch		Details \| Diff \| Splinter Review