Closed Bug 659725 Opened 13 years ago Closed 5 years ago

Optimize Canvas putImageData conversion loop

Categories

(Core :: Graphics: Canvas2D, defect)

Other Branch
defect
Not set
normal

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: azakai, Unassigned)

References

Details

Attachments

(4 files, 5 obsolete files)

Canvas demos will usually call putImageData many times per second, optimally something like 50-60. |nsCanvasRenderingContext2D::PutImageData_explicit| loops over the input to convert it to the format we need later (cairo), reordering the RGBA to BGRA (little endian) or ARGB (big endian),

http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/nsCanvasRenderingContext2D.cpp#3925

We might be able to optimize that loop.

I did some benchmarks on a 320x200 canvas. On desktop I get 1ms for the loop. So at 50fps, we have 20ms per frame, so 5% of our CPU time is taken, not too bad. This is with a small canvas, though - a larger one, or even one that fills a typical monitor, would be slower.

I also benchmarked on fennec on an HTC desire. It takes 4ms there, which is 20% of our CPU time - which looks very significant.
(In reply to comment #0)
> I also benchmarked on fennec on an HTC desire. It takes 4ms there, which is
> 20% of our CPU time - which looks very significant.

You can check bug 534215 comment 1 for the way of doing fast premultiplication on ARM.
Attached patch one possibility (obsolete) — Splinter Review
Silly patch with one quick improvement, no SSE or NEON. Just little endian so far. Patch also includes the timing code.

On Android, this gets me from 4ms to 2.5ms for the loop.

Asking for feedback because I'm not sure if this is a good idea or not, and what is the right approach.
Attachment #535226 - Flags: feedback?(jmuizelaar)
Attachment #535226 - Flags: feedback?(bas.schouten)
Here you're only considering conversions from/to a few different formats, right?

Just asking because if you ever had to do conversions from/to a large number of formats, you might then be interested in reusing the code we have in WebGL for that, see
http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/WebGLContextGL.cpp#3313
and
http://mxr.mozilla.org/mozilla-central/source/content/canvas/src/WebGLTexelConversions.h

Here we have 155 different paths, generated by C++ templates (x86-64 code size: 24K), but no SSE/NEON. In your case, if you have only a few cases to handle, of course it's a lot more worth doing SSE/NEON.
Looking at your code here, are you sure that sPremultiplyTable is a good idea at all? Not saying that it isn't, but that really needs to be validated by experiment. If sPremultiplyTable fits entirely in the CPU cache, then I'm willing to believe that it's an optimization. But since we're talking about 64K here and cell phones, I don't know. That would depend on many factors: the device, the image, and the rest of the test case (what else is competing for CPU cache memory).
I'm new to this code myself so I don't know what the usual ways are to optimize this sort of thing. Looking around for tools, I found the orc (optimized inner loop runtime compiler),

http://code.entropywave.com/projects/orc/

Looks like a simple language that compiles into SSE, NEON, etc. bjacob, if I understand it correctly, it might help with the dozens of paths you have in WebGL (less work than writing SSE/NEON by hand)? If no one has investigated it yet I can take a closer look.
> Looking at your code here, are you sure that sPremultiplyTable is a good idea 
> at all? Not saying that it isn't, but that really needs to be validated by 
> experiment.

See bug 519400 for those experiments (desktop only).

I don't know how this will perform on mobile compared to the original code [1].  The ARM Cortex-A8 has 32KB of L1, so of course the lookup table doesn't fit in L1.  But neither does the image data, so either way we're blowing away L1.

It's hard to quantify the cost of blowing away 64KB of L2, and perhaps this isn't worthwhile when the canvas is small.  (Perhaps we should use a different approach on very small canvases!)

I couldn't figure out how to do

   (x*alpha + 254) / 255

with NEON or SSE, since neither has an integer division instructions IIRC.  But maybe there's a clever transformation on this.

I'm curious how this patch performs on desktop.  There's a lot of data parallelism, which is promising.

[1] http://hg.mozilla.org/mozilla-central/file/61e6d741/content/canvas/src/nsCanvasRenderingContext2D.cpp#l3703
(In reply to comment #6)
> > Looking at your code here, are you sure that sPremultiplyTable is a good idea 
> > at all? Not saying that it isn't, but that really needs to be validated by 
> > experiment.
> 
> See bug 519400 for those experiments (desktop only).
> 
> I don't know how this will perform on mobile compared to the original code
> [1].  The ARM Cortex-A8 has 32KB of L1, so of course the lookup table
> doesn't fit in L1.  But neither does the image data, so either way we're
> blowing away L1.
> 
> It's hard to quantify the cost of blowing away 64KB of L2, and perhaps this
> isn't worthwhile when the canvas is small.  (Perhaps we should use a
> different approach on very small canvases!)
> 
> I couldn't figure out how to do
> 
>    (x*alpha + 254) / 255
> 
> with NEON or SSE, since neither has an integer division instructions IIRC. 
> But maybe there's a clever transformation on this.

GCC knows how to optimize away integer divisions.

This code:

unsigned char f(unsigned char x, unsigned char alpha)
{
  return ((unsigned short)(x)*alpha + 254) / 255;
}

with this command line (gcc 4.5.3):

gcc a.c -S -o a.s -O3

gives this assembly:

        movzbl  %dil, %edi
        movzbl  %sil, %esi
        movl    $-2139062143, %edx
        imull   %edi, %esi
        addl    $254, %esi
        movl    %esi, %eax
        imull   %edx
        leal    (%rdx,%rsi), %eax
        sarl    $7, %eax
        ret
(In reply to comment #5)
> I'm new to this code myself so I don't know what the usual ways are to
> optimize this sort of thing. Looking around for tools, I found the orc
> (optimized inner loop runtime compiler),
> 
> http://code.entropywave.com/projects/orc/
> 
> Looks like a simple language that compiles into SSE, NEON, etc. bjacob, if I
> understand it correctly, it might help with the dozens of paths you have in
> WebGL (less work than writing SSE/NEON by hand)? If no one has investigated
> it yet I can take a closer look.

Thanks; it's not absurd to use a runtime compiler for WebGL format conversions, since there are just too many cases there to have built-in SSE paths for all of them. However, I don't know yet that it's performance critical enough that we need to go this far. The only common case where it's really performance critical is when streaming video textures.
Ah, that's a good idea.

It hinges on the sarl (shift right 7 bits), which iirc doesn't exist in SSE or NEON.  I'll ponder this more, but maybe Tim can shed some light for us.
(In reply to comment #9)
> Ah, that's a good idea.
> 
> It hinges on the sarl (shift right 7 bits), which iirc doesn't exist in SSE
> or NEON.  I'll ponder this more, but maybe Tim can shed some light for us.

If SSE/NEON don't have the right shift instruction you need here, you could at least do right shifts on 64bit ints, followed by a AND with a mask, that would still be only 3 instructions, not too bad. If SSE/NEON has a 128bit right shift instructions, then you're down to 2 instructions (shift+and).
(In reply to comment #6)
> > Looking at your code here, are you sure that sPremultiplyTable is a good idea 
> > at all? Not saying that it isn't, but that really needs to be validated by 
> > experiment.
> 
> See bug 519400 for those experiments (desktop only).
> 
> I don't know how this will perform on mobile compared to the original code
> [1].  The ARM Cortex-A8 has 32KB of L1, so of course the lookup table
> doesn't fit in L1.  But neither does the image data, so either way we're
> blowing away L1.
> 
> It's hard to quantify the cost of blowing away 64KB of L2, and perhaps this
> isn't worthwhile when the canvas is small.  (Perhaps we should use a
> different approach on very small canvases!)
> 
> I couldn't figure out how to do
> 
>    (x*alpha + 254) / 255

Division by 255 is common in graphics and has different recipes. The common one is (x+127)/255 which is x*257>>8 or (x + x>>8)>>8. Is the 254 correct? I seem to remember looking into that before but I don't remember the answer.
> It hinges on the sarl (shift right 7 bits), which iirc doesn't exist in SSE
> or NEON.  I'll ponder this more, but maybe Tim can shed some light for us.

Arithmetic right shifts by constant amounts certainly exist in both SSE and NEON (though it makes more sense to use a logical one here). See also bug 594920 comment 8 and comment 10. The code there uses an actual multiply, as it's fewer instructions and fewer uops, though someone might be able to do better if they spent some time on it.
Hm, yes.  I'm not sure why I thought otherwise, but I found these instructions in the ISA pretty easily...

Alon, do you want to try and spin a patch, or would you like me to?
(In reply to comment #13)
> Hm, yes.  I'm not sure why I thought otherwise, but I found these
> instructions in the ISA pretty easily...
> 
> Alon, do you want to try and spin a patch, or would you like me to?

I've not written code like this before, but I'd be happy to try with some guidance. What's the closest existing example of this sort of thing in our code so I can study that?
If you want to do SSE, you could look at gfxAlphaRecoverySSE2.cpp or nsUTF8ToUnicodeSSE2.cpp.

For NEON
... for NEON, maybe yuv_convert_arm.cpp.  Also [1].

If you have a mobile build set up, I might recommend starting with NEON, since it's much more sane than SSE and it's easy to load r/g/b/a into separate registers.

[1] http://blogs.arm.com/software-enablement/161-coding-for-neon-part-1-load-and-stores/
Thanks for the links Justin!
Attached patch NEON patch (obsolete) — Splinter Review
Ok, here is a NEON patch.

This patch gives me 3-4ms in a benchmark compared to 7-8ms with the current code. It probably isn't as optimized as it can be, this is my first time using ARM assembly.
Attachment #535226 - Attachment is obsolete: true
Attachment #535226 - Flags: feedback?(jmuizelaar)
Attachment #535226 - Flags: feedback?(bas.schouten)
Attachment #536467 - Flags: review?(justin.lebar+bug)
Attached patch NEON patch (obsolete) — Splinter Review
Oops, wrong patch before, sorry about that. Here's the real one.
Attachment #536467 - Attachment is obsolete: true
Attachment #536467 - Flags: review?(justin.lebar+bug)
Attachment #536469 - Flags: review?(justin.lebar+bug)
Comment on attachment 536469 [details] [diff] [review]
NEON patch

I'm happy to give feedback, but I'm not a peer, so this will need to be reviewed by someone else once we're done.

> diff --git a/content/canvas/src/convert_from_canvas_rgba.cpp b/content/canvas/src/convert_from_canvas_rgba.cpp
> new file mode 100644
> --- /dev/null
> +++ b/content/canvas/src/convert_from_canvas_rgba.cpp

Can we put these functions in a namespace?

> +void convert_from_canvas_rgba_noarch(PRUint8 *dst, PRUint8 *src, int n, PRUint8 (*premultiplyTable)[256])

I'd prefer a const pointer to src and the premultiply table.  Also, please assert that n is a multiple of 4.

>+#if defined(MOZILLA_MAY_SUPPORT_NEON) && defined(IS_LITTLE_ENDIAN)
>+void convert_from_canvas_rgba_neon_littleendian(PRUint8 *dst, PRUint8 *src, int n, PRUint8 (*premultiplyTable)[256])
>+{

Const pointers.

>+  if (((int)dst)%32 != ((int)src)%32) {

Maybe use NS_PTR_TO_INT32() here and elsewhere?

>+  int align = ((int)dst)%32;
>+  if (align != 0) {
>+    // Advance to 32-bit alignment
>+    convert_from_canvas_rgba_noarch(dst, src, 32-align, premultiplyTable);
>+    dst += align;
>+    src += align;
>+  }

ptr % 32 == 0 is checking for 32-*byte* alignment, which I don't think we need.  vld4.8 only requires 32-bit alignment.

If it's not 32-*bit* aligned to begin with, then no amount of realignment will help, because we have to walk in 32-bit increments in the unvectorized path.  I don't know if we can rely on these pointers being word aligned; if so, we should just assert.  If not, we should fall back to the slow path.

>+  align = n % 32;
>+  if (align != 0) {
>+    // Do any bytes at the end that are not 32-bit aligned

Here you correctly make sure that we process a multiple of 32 *bytes* using the vector ops.  I wouldn't call this "aligning", though, since it's OK if src + n isn't a multiple of 32.

>+    convert_from_canvas_rgba_noarch(dst + n - align, src + n - align, align, premultiplyTable);
>+    n = n - align;
>+  }

Probably should do this after the vector ops; it's better in general to access data sequentially, since the MMU can prefetch that better.  When you move it below the asm, you may want to take advantage of the fact that dst will be pointing to the next byte to process.

I'm not sure it's worthwhile to do this using the table lookup method.  As it is, we have to compute the premultiply table and hold it around, even when we don't have to do any work at the end.  Maybe it would be better if we had a convert_from_canvas_rgba_noarch_notable method which computed the values directly, and if convert_from_canvas_rgba_noarch ensured that the premultiply table exists. Then we wouldn't need to compute the table before calling the neon function.

(I imagine we may want to reorganize how we handle the (un)premultiply table, since gfxUtils.cpp duplicated this code.  But we can worry about that in a separate bug...)

>+  PRUint8 *last = dst + n;
>+
>+  asm volatile (

Does it not work if it's not volatile?  It should work with just regular asm if you have the right [read, write, clobber] settings.

>+".fpu neon\n"
>+  "mov r0, %[src]\n"
>+  "mov r1, %[dst]\n"
>+  "mov r2, %[last]\n"

You'll use fewer registers if you just use %[src] wherever you use r0, and so on.

>+  "vmov.i16 q6, #254\n" // a constant we need later
>+  "vmov.i16 q7, #255\n" // another constant we need later
>+"1:\n"
>+  "vld4.8 {d0,d1,d2,d3}, [r0]\n" // interleaved load so that d0-d3 contain R,G,B,A data respectively
>+
>+  "vmovl.u8 q2, d0\n" // Double to 16 bits
>+  "vmovl.u8 q3, d1\n"
>+  "vmovl.u8 q4, d2\n"
>+  "vmovl.u8 q5, d3\n"

Might be a bit faster if you do the d3 -> q5 conversion before d1 -> q3 and d2
-> q4, since then the first multiply by alpha doesn't have to wait for the
conversion to finish.

>+  "vmul.u16 q2, q5\n" // Multiply by alpha
>+  "vmul.u16 q3, q5\n"
>+  "vmul.u16 q4, q5\n"
>+
>+  "vadd.u16 q2, q6\n" // Add 254
>+  "vadd.u16 q3, q6\n"
>+  "vadd.u16 q4, q6\n"
>+
>+  "vshr.u16 q8, q2, #8\n" // Divide by 255
>+  "vshr.u16 q9, q3, #8\n"
>+  "vshr.u16 q10, q4, #8\n"
>+  "vadd.u16 q2, q8\n"
>+  "vadd.u16 q3, q9\n"
>+  "vadd.u16 q4, q10\n"
>+  "vshr.u16 q2, #8\n"
>+  "vshr.u16 q3, #8\n"
>+  "vshr.u16 q4, #8\n"
>+
>+  "vmovn.i16 d2, q2\n" // Return to 8 bits, swapping R and B as we go
>+  "vmovn.i16 d1, q3\n"
>+  "vmovn.i16 d0, q4\n"

I'm not sure this is exactly the same as the original code; it appears to be off by 1 in some cases.  Testcase at the bottom of the comment.  Maybe I'm interpreting the asm incorrectly?

Is it possible to write a test for this?  I don't know what the state is of our mobile test infrastructure.

>+  "vst4.8 {d0,d1,d2,d3}, [r1]\n" // interleaved save
>+  "add r0, #32\n" // increment pointers
>+  "add r1, #32\n"

If you do "vst4.8 {d0,d1,d2,d3}, [dst]!", then the pointer will be incremented for you.  Same for the vld4.8 above.

>+  "cmp r1, r2\n" // check if we are done

It probably doesn't matter, but you might get a speedup by comparing src to src + n instead of comparing dst to dst + n, since if you change the "vld [src]" to "vld [src]!", we can be sure that we won't have to stall the pipeline waiting for dst += 32 to be computed.

>+	: [src] "r" (src), [dst] "r" (dst), [last] "r" (last)

If you use src and dst directly (rather than through r0, r1), this needs to be "+r" for src and dst, since you're reading and writing them.

>+	: "cc", "memory", "d0", "d1", "d2", "d3", "d4", "d5", "d6", "r0", "r1", "r2", "q0", "q1", "q2", "q3", "q4", "q5",
>+    "q6", "q7", "q8", "q9", "q10"

And then you can take out r0, r1, r2 here.

Last time I checked, GCC doesn't understand that {d0, d1} is the same as q0.  Therefore to be totally correct, I think the list of clobbered registers has to include all of the doubleword registers through d20.  It doesn't really matter, though, since there's no more asm in this function.

>diff --git a/content/canvas/src/nsCanvasRenderingContext2D.cpp b/content/canvas/src/nsCanvasRenderingContext2D.cpp
>--- a/content/canvas/src/nsCanvasRenderingContext2D.cpp
>+++ b/content/canvas/src/nsCanvasRenderingContext2D.cpp
>+    convert_from_canvas_rgba(dst, src, w*h*4, sPremultiplyTable);

We assert that aDataLen == w*h*4, so maybe you should use that?


Testcase:

  #include <stdio.h>
  #include <stdint.h>

  uint8_t method1(uint16_t x, uint16_t a) {
    return (uint8_t) ((x * a + 254) / 255);
  }

  uint8_t method2(uint16_t x, uint16_t a) {
    uint16_t y = x * a + 254;
    return (uint8_t) ((y + (y >> 8)) >> 8);
  }

  int main() {
    for (uint16_t x = 0; x <= 255; x++) {
      for (uint16_t a = 0; a <= 255; a++) {
        uint8_t m1 = method1(x, a);
        uint8_t m2 = method2(x, a);
        if (m1 != m2) {
          printf("Difference at x=%3d, a=%3d. m1=%3d, m2=%3d\n", x, a, m1, m2);
        }
      }
    }
  }

Outputs:

  Difference at x=  1, a=  1. m1=  1, m2=  0
  Difference at x=  2, a=128. m1=  2, m2=  1
  Difference at x=  4, a= 64. m1=  2, m2=  1
  Difference at x=  7, a= 73. m1=  3, m2=  2
  Difference at x=  8, a= 32. m1=  2, m2=  1
  etc.
Attachment #536469 - Flags: review?(justin.lebar+bug) → review-
A lot of performance is lost if not using prefetch here, because many of the existing ARM processors don't have automatic hardware prefetcher. There is PLD instruction available on ARM for this purpose.

And just for the reference, a similar code for doing premultiplication as ((x * a + 127) / 255) is implemented here:
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.S?id=pixman-0.22.0#n2184

Which can be used as the following function if bypassing standard pixman API is desired:
    void pixman_composite_src_pixbuf_8888_asm_neon(
        int32_t w, int32_t h,
        uint32_t *dst, int32_t dst_stride,
        uint32_t *src, int32_t src_stride)
Comment on attachment 536469 [details] [diff] [review]
NEON patch

Review of attachment 536469 [details] [diff] [review]:
-----------------------------------------------------------------

I know I wasn't asked, but....

::: content/canvas/src/convert_from_canvas_rgba.cpp
@@ +90,5 @@
> +".fpu neon\n"
> +  "mov r0, %[src]\n"
> +  "mov r1, %[dst]\n"
> +  "mov r2, %[last]\n"
> +  "vmov.i16 q6, #254\n" // a constant we need later

q4-q7 are callee-saved in the ARM ABI, so if you use them, they have to be saved/restored. I _believe_ gcc will do this for you if you put them in the clobber list, but it's better to just use different registers. There's more than enough.

@@ +91,5 @@
> +  "mov r0, %[src]\n"
> +  "mov r1, %[dst]\n"
> +  "mov r2, %[last]\n"
> +  "vmov.i16 q6, #254\n" // a constant we need later
> +  "vmov.i16 q7, #255\n" // another constant we need later

q7 is never used.

@@ +93,5 @@
> +  "mov r2, %[last]\n"
> +  "vmov.i16 q6, #254\n" // a constant we need later
> +  "vmov.i16 q7, #255\n" // another constant we need later
> +"1:\n"
> +  "vld4.8 {d0,d1,d2,d3}, [r0]\n" // interleaved load so that d0-d3 contain R,G,B,A data respectively

This load and the subsequent store are unaligned as written. You need at least 16-byte alignment to make them any faster (indicated with [r0, :128] for 128-bit alignment in the GNU syntax). But unless 16-byte alignment is something we can expect, you might as well leave them unaligned and get rid of all the alignment code.

@@ +95,5 @@
> +  "vmov.i16 q7, #255\n" // another constant we need later
> +"1:\n"
> +  "vld4.8 {d0,d1,d2,d3}, [r0]\n" // interleaved load so that d0-d3 contain R,G,B,A data respectively
> +
> +  "vmovl.u8 q2, d0\n" // Double to 16 bits

There's an instruction that combines vmovl followed by vmul: vmull.u8 (more accurately, it does the multiply on two 8-bit operands and then keeps all 16 bits of the result).

@@ +108,5 @@
> +  "vadd.u16 q2, q6\n" // Add 254
> +  "vadd.u16 q3, q6\n"
> +  "vadd.u16 q4, q6\n"
> +
> +  "vshr.u16 q8, q2, #8\n" // Divide by 255

The sequence you actually want here is
(y + ((y + (y>>7))>>8))>>8
(try plugging that into jlebar's test case and see if it doesn't match; though in the (y+(y>>8))>>8 method's defense, it's only wrong for at most one alpha value per x value)

But, for your convenience, NEON has a "shift right and add" instruction: vshra, as well as a "shift right and narrow" instruction: vshrn. So the correct thing won't take any more instructions than what you've already got.
(In reply to comment #21)
> to include all of the doubleword registers through d20.  It doesn't really
> matter, though, since there's no more asm in this function.

This is not actually true, if you want to use things like LTO (not with the gcc currently available for Android, but someday...).
(In reply to comment #11)
> (In reply to comment #6)
> > I couldn't figure out how to do
> > 
> >    (x*alpha + 254) / 255
> 
> Division by 255 is common in graphics and has different recipes. The common
> one is (x+127)/255 which is x*257>>8 or (x + x>>8)>>8. Is the 254 correct? I
> seem to remember looking into that before but I don't remember the answer.

This 254 also does not look correct to me. The math involved here can be verified by comparing the difference of calculated results between fixed point implementation and a reference floating point implementation of premultiplication:

double reference_float_premultiply(int alpha, int x)
{
    /* convert to floating point */
    double alpha_f = (double) alpha / 255.0;
    double x_f = (double) x / 255.0;
    /* do the premultiplication */
    double result = x_f * alpha_f;
    /* convert back from [0.0, 1.0] to [0, 255] range, but don't do rounding yet */
    return result * 255.0;
}

int premultiply(int alpha, int x)
{
    return (x * alpha + 127) / 255;
}



for (alpha = 0; alpha <= 255; alpha++)
{
    for (x = 0; x <= 255; x++)
    {
        double result1 = premultiply(alpha, x);
        double result2 = reference_float_premultiply(alpha, x);
        /* do some clever comparison between result1 and result2 */
    }
}
(In reply to comment #25)
> This 254 also does not look correct to me. The math involved here can be
> verified by comparing the difference of calculated results between fixed
> point implementation and a reference floating point implementation of
> premultiplication:

My understanding was the ceil((a/255.0)*x) behavior in the premultiply allows it to be a direct inverse to the floor((255.0/a)*x) behavior in the unmultiply, meaning there's no round-trip error if you unmultiply the data and then premultiply it again. This is decidedly not true if you round both ways.
> This load and the subsequent store are unaligned as written. You need at least 
> 16-byte alignment to make them any faster (indicated with [r0, :128] for 
> 128-bit alignment in the GNU syntax). But unless 16-byte alignment is something 
> we can expect, you might as well leave them unaligned and get rid of all the 
> alignment code.

As I read the ARM manual, vld4.8 only takes 32-bit alignment.  (Table 4-12, "Permitted combinations of parameters".)  Am I misunderstanding what this means?
(In reply to comment #27)
> As I read the ARM manual, vld4.8 only takes 32-bit alignment.  (Table 4-12,
> "Permitted combinations of parameters".)  Am I misunderstanding what this
> means?

Yes.

See footnote b, "align can be omitted. In this case, standard alignment rules apply," although it's not clear in that manual what the "standard alignment rules" are (see below). But what the '-" entries in the align column basically mean is, "It's never useful to specify an alignment for this access," e.g., it doesn't do you any good to add an alignment for an 8-bit single-lane load to one register. Note also that that table is all for single-lane loads, and we're using a multi-element load here (i.e., Table 4-14).

The actual operation of the instruction is defined by the ARM Architecture Reference Manual ARMv7-A and ARMv7-R edition (page A8-614 in my version). In particular (describing how the instruction is decoded):

alignment = if align == '00' then 1 else 4 << Uint(align)
...
address = R[n]; if (address MOD alignment) != 0 then GenerateAlignmentException();
(In reply to comment #26)
> My understanding was the ceil((a/255.0)*x) behavior in the premultiply
> allows it to be a direct inverse to the floor((255.0/a)*x) behavior in the
> unmultiply, meaning there's no round-trip error if you unmultiply the data
> and then premultiply it again. This is decidedly not true if you round both
> ways.

Hmm, that's an interesting point. But this seems to work for me:

#include <assert.h>

int premultiply(int x, int a)
{
    return (x * a + 127) / 255;
}

int unpremultiply(int x, int a)
{
    return (255 * x + 127) / a;
}

int main()
{
    int a, x;
    for (a = 1; a <= 255; a++)
    {
        for (x = 0; x <= 255; x++)
        {
            assert(x == premultiply(unpremultiply(x, a), a));
        }
    }
    return 0;
}

Or maybe premultiply/unpremultiply behaviour is strictly defined by some web standard?
(In reply to comment #29)
> Hmm, that's an interesting point. But this seems to work for me:

I think you're doing something wrong.

> int unpremultiply(int x, int a)
> {
>     return (255 * x + 127) / a;
> }

127 is not the proper offset to round division by a. Consider x=1, a=1. (255*1+127)/1 == 382, which is a bit off from 255, where it should be. Note also that 382 will get truncated to 126 when it gets stored back to an 8-bit value, which your example doesn't account for, and which causes a mismatch in this case.
Right, the correct offset is "a / 2"

int unpremultiply(int x, int a)
{
    return (255 * x + a / 2) / a;
}
And division free unpremultiply could probably look like this (one table lookup in a small table per pixel):

int unpremultiply(int x, int a)
{
    int shift = 14;
    int mulconst = 255 * (1 << shift) / a; /* do table lookup for this */
    return (x * mulconst + (1 << (shift - 1))) >> shift;
}

It would be great to somehow tweak the code to actually fit 'mulconst' into 16 bits, that would make it better suitable for SIMD optimizations.
Attached patch NEON patch v2 (obsolete) — Splinter Review
Thanks for the feedback guys! Here is an updated patch.

Implementing (x + (x + (x>>7))>>8 ) >> 8 is a little clunky but maybe that's just my not finding a shortcut. However, it doesn't give perfect results - the 255/255 case gives 254 instead of 255. I get the same problem with the test code on desktop, *if* I add some forced casts to uint16_t in there (which is closer to what the NEON code is doing).

Given that, maybe it makes sense to go with the faster and simpler (x + x>>8)>>8? It has a lot more 1-off errors, but at most one per alpha?
Attachment #536469 - Attachment is obsolete: true
I'll do a pass over this in a moment.
The Canvas 2D Context spec mandates the following:
http://dev.w3.org/html5/2dcontext/#dom-context-2d-putimagedata
> The handling of pixel rounding when the specified coordinates do not exactly
> map to the device coordinate space is not defined by this specification,
> except that the following must result in no visible changes to the rendering:
>
> context.putImageData(context.getImageData(x, y, w, h), p, q);
> ...for any value of x, y, w, and h and where p is the smaller of x and the
> sum of x and w, and q is the smaller of y and the sum of y and h; and except
> that the following two calls:
>
> context.createImageData(w, h);
> context.getImageData(0, 0, w, h);
>
> ...must return ImageData objects with the same dimensions, for any value of w
> and h.
Is this satisfied?
(In reply to comment 35)
> Is this satisfied?

We should probably have a test for this...
> int initial = 32 - (n % 32);
> if (initial != 32) {
>   // Do enough bytes at the beginning so that we are left with a multiple of 32
>   convert_from_canvas_rgba_noarch(dst, src, initial, premultiplyTable);

I'd really prefer if we didn't have to populate the lookup table when NEON is available; that seems like a waste.

> +  "vmov.i16 q12, #254\n" // a constant we need later
> +"1:\n"
> +  "vld4.8 {d0,d1,d2,d3}, [%[src]]!\n" // interleaved load so that d0-d3 contain R,G,B,A data respectively
> +
> +  "vmull.u8 q2, d3, d0\n" // Multiply by alpha
> +  "vmull.u8 q3, d3, d1\n"
> +  "vmull.u8 q10, d3, d2\n"
> +
> +  "vadd.u16 q13, q2, q12\n" // Add 254
> +  "vadd.u16 q14, q3, q12\n"
> +  "vadd.u16 q15, q10, q12\n"
> +  // now q13 is q2 + 254 = x
> +  "vsra.u16 q13, q13, #7\n" // Divide by 255
> +  "vsra.u16 q14, q14, #7\n"
> +  "vsra.u16 q15, q15, #7\n"
> +  // now q13 is x + (x>>7)
> +  "vshr.u16 q13, q13, #8\n"
> +  "vshr.u16 q14, q14, #8\n"
> +  "vshr.u16 q15, q15, #8\n"
> +  // now q13 is (x + x>>7) >> 8
> +  "vadd.u16 q13, q2\n"
> +  "vadd.u16 q14, q3\n"
> +  "vadd.u16 q15, q10\n"
> +  "vadd.u16 q13, q12\n" // Add 254 again
> +  "vadd.u16 q14, q12\n"
> +  "vadd.u16 q15, q12\n"
> +  // now q13 is x + ((x + x>>7) >> 8)

Maybe Tim can do better than this, but you should at least be able to combine
the final vshr with the next vadd, right?

Also, see comment 22:

 A lot of performance is lost if not using prefetch here, because many of the
 existing ARM processors don't have automatic hardware prefetcher. There is PLD
 instruction available on ARM for this purpose.

You'll want to experiment with how far ahead to prefetch, but in my limited experience, the sweet spot was pretty large.

> +  : [src] "r" (src), [dst] "r" (dst), [last] "r" (last)

Since you modify src and dst, it needs to be

  : [src] "+r" (src), [dst] "+r" (dst), [last] "r" (last)

> +    "q0", "q1", "q2", "q3", "q8", "q9", "q10", "q12", "q13",
> +    "d0", "d1", "d2", "d3", "d4", "d5", "d6", "d7", "d8", "d9",
> +    "d10", "d11", "d12", "d13", "d14", "d15", "d16", "d17", "d18", "d19",
> +    "d20", "d21", "d22", "d23", "d24", "d25", "d26"

Please update this.
(In reply to comment #33)
> Implementing (x + (x + (x>>7))>>8 ) >> 8 is a little clunky but maybe that's
> just my not finding a shortcut. However, it doesn't give perfect results -
> the 255/255 case gives 254 instead of 255. I get the same problem with the
> test code on desktop, *if* I add some forced casts to uint16_t in there
> (which is closer to what the NEON code is doing).
> 
> Given that, maybe it makes sense to go with the faster and simpler (x +
> x>>8)>>8? It has a lot more 1-off errors, but at most one per alpha?

I think it makes a lot of sense to first decide what would be the "correct" implementation before trying hard to optimize it.
(In reply to comment #32)
> And division free unpremultiply could probably look like this (one table
> lookup in a small table per pixel):

It's going to be hard to use a lookup table like this in NEON, because of the long delay to get stuff from NEON to ARM registers so it can be used in an address calculation. Maybe you could load the alpha values a second time on the ARM side. But you'll also need to assemble the constants after you load them on the NEON side... single-lane loads are slow and trying to pack things together with ARM instructions isn't much better.

> It would be great to somehow tweak the code to actually fit 'mulconst' into
> 16 bits, that would make it better suitable for SIMD optimizations.

There's a standard algorithm for generating such constants. See "N-Bit Unsigned Division via N-Bit Multiply-Add," In Proc. 17th IEEE Symposium on Computer Arithmetic (ARITH'05), pp. 131--139, June 2005.

The gist of it boils down to Algorithm 1 (reproduced here for those without access to the paper):

Inputs: uword d and n, with n>=1 and 1<=d<(1<<<n)
int m := floor(log2(d));
uword a, b;
if d == (1<<m) then
    a := (1<<n)-1;
    b := (1<<n)-1;
else
    uword t := floor((1<<(m+n))/d);
    uword r := (t*d+d) mod (1<<n)
    if r <= (1<<m) then
        a := t+1;
        b := 0;
    else
        a := t;
        b := t;
    endif
endif

Then the division is equivalent to ((a*x+b)>>(m+n)). In this case n==16. I don't know if this will work better than the constants you proposed. This method only requires one multiplication, instead of two, which will be the slowest step in NEON. But the table needs values for a and b (16-bits each) and m (0 to 7). b is either the same as a, or 0, so in theory you could pack all three parameters into 3 bytes, but taking advantage of that may cost more computation than it saves cache misses.

In any case, this lookup table approach is not so obvious a win as the premultiply case.

(In reply to comment #38)
> I think it makes a lot of sense to first decide what would be the "correct"
> implementation before trying hard to optimize it.

This is true, but in general we're going to need a "divide by 255" step.

For the premultiply case (d=255), a=0x8081, b=0, and m=7. So (y+((y+(y>>7))>>8)>>8 was my attempt to compute a*0x8081>>23 without a 16x16->32 multiply, because NEON doesn't have one that returns just the high 16 bits (SSE does). Alon is right, though, the (y+(y>>7) step can overflow a 16-bit integer.

> uint8_t premultiply_neon_ready(uint8_t x, uint8_t a)
> {
>    uint32_t tmp = x * a + 128;
>    return (tmp + (tmp >> 8)) >> 8;
> }

I notice you adding 128 here instead of 127. I wondered about this in Jeff's comment 11, because if you don't add the extra 1, (x+(x>>8))>>8 is certainly not equivalent to x/255. However, this trick appears to work for 254 as well. I.e., y=x*a+255; return (y+(y>>8))>>8 gives the correct value for all cases (I didn't check all integers, but it is at least correct for all pairs of x and a; it also doesn't overflow 16 bits). So we can use that formulation regardless of which rounding offset we decide on.
(In reply to comment #37)
> see comment 22:
> 
>  A lot of performance is lost if not using prefetch here, because many of the
>  existing ARM processors don't have automatic hardware prefetcher. There is
> PLD
>  instruction available on ARM for this purpose.
> 
> You'll want to experiment with how far ahead to prefetch, but in my limited
> experience, the sweet spot was pretty large.
> 

I am not sure I understand exactly what PLD does from the ARM manual. It basically hints the CPU to prefetch data into its cache? So at the beginning of the loop, I would tell it to prefetch the address I will need for the iteration after the current one, something like that?

> > +  : [src] "r" (src), [dst] "r" (dst), [last] "r" (last)
> 
> Since you modify src and dst, it needs to be
> 
>   : [src] "+r" (src), [dst] "+r" (dst), [last] "r" (last)

GCC complains "error: input operand constraint contains '+'" with that. If I move src and dst to be output operands, it compiles but doesn't work (I suspect that src and dst are no longer being read). I'm probably missing something basic here. But in any case, while we do write to src and dst, they are not used as outputs (we don't read them after the asm block) - so is it still strictly necessary to mark them as "+"? I can't figure this out from the GCC docs, but it compiles and runs properly with just "r".
> I am not sure I understand exactly what PLD does from the ARM manual. It 
> basically hints the CPU to prefetch data into its cache? So at the beginning of 
> the loop, I would tell it to prefetch the address I will need for the iteration 
> after the current one, something like that?

Yup, an iteration or more after the current one.  If you prefetch too close to the current iteration, the data won't be ready by the time you want it.  But if you prefetch too far from the current iteration, you could end up knocking the prefetched data out of cache before you need it!

> GCC complains "error: input operand constraint contains '+'" with that. If I 
> move src and dst to be output operands, it compiles but doesn't work (I 
> suspect that src and dst are no longer being read).

Ah, yes, +r only applies to output regs.  So 

 : [src] "+r" (src), [dst] "+r" (dst)
 : [last] "r" (last)
 
doesn't work?

> But in any case, while we do write to src and dst, they are not used as 
> outputs (we don't read them after the asm block) - so is it still strictly 
> necessary to mark them as "+"?

Since GCC isn't doing any intraprocedural optimizations, it doesn't matter AFAIK.  But in a hypothetical LTO world, you'd want to inform the compiler that you modify as well as read src and dst.
(In reply to comment #42)
> 
> > GCC complains "error: input operand constraint contains '+'" with that. If I 
> > move src and dst to be output operands, it compiles but doesn't work (I 
> > suspect that src and dst are no longer being read).
> 
> Ah, yes, +r only applies to output regs.  So 
> 
>  : [src] "+r" (src), [dst] "+r" (dst)
>  : [last] "r" (last)
>  
> doesn't work?
> 

It compiles but nothing is actually done when the code runs. I suspect src and dst do not contain the right values when they are listed as outputs.
(In reply to comment #38)
> I think it makes a lot of sense to first decide what would be the "correct"
> implementation before trying hard to optimize it.

So, the current behavior comes from bug 389366 comment 1. As long as the two properties listed there ({255, 255, 255, a} round-trips to {255, 255, 255, a} and premultiply(unmultiply(x))==x) are satisfied, I think we can use whatever rounding we want. Given that, I think 127 and (a/2) make more sense as rounding offsets than 254 and 0.
For what it's worth Chrome uses:

/** Return (a*b)/255, taking the ceiling of any fractional bits. Only valid if
    both a and b are 0..255. The expected result equals (a * b + 254) / 255.
 */
static inline U8CPU SkMulDiv255Ceiling(U8CPU a, U8CPU b) {
    SkASSERT((uint8_t)a == a);
    SkASSERT((uint8_t)b == b);
    unsigned prod = SkMulS16(a, b) + 255;
    return (prod + (prod >> 8)) >> 8;
}
I'd also recommend not changing the rounding behavior as part of this bug. If we do decide to change the rounding behaviour we might want to encourage Webkit to change to. I would also suggest documenting some of the rationale for the current behaviour in the code, so we don't need to think as much the next time this comes up.
Attached patch NEON patch v3 (obsolete) — Splinter Review
Updated patch. Uses PLD and (x*y+254)/255 = (x*y+255 + (x*y+255)>>8)>>8 (which is true for all 8-bit x, y).

I get 2.45ms compared to 7.5ms with the slow path, so over 3x faster :)

Did I forget anything? I'm not sure what was said before about the premultiply table - do we want to remove it in the slow path?
Attachment #536798 - Attachment is obsolete: true
> I'm not sure what was said before about the premultiply 
> table - do we want to remove it in the slow path?

The issue is that the premultiply table needs to be computed, and that's a waste of time if we have NEON.  So I'd like to use the table-based method only when we don't have NEON, and only call EnsurePremultiplyTable() in that case.  When we do have NEON, we can just do an element-by-element arithmetic conversion for the alignment step.

I'm not sure what's going on with the +r business, but I'll try to get a mobile build environment going tomorrow so I can play around with it.

> I get 2.45ms compared to 7.5ms with the slow path, so over 3x faster :)

Awesome!
Comment on attachment 536969 [details] [diff] [review]
NEON patch v3

>+  "pld [%[src], #32]\n" // a little preloading helps the bits go down
>    ...
>+  "pld [%[src], #64]\n" // a little more preloading, for the next cache line

The loop does src += 32, so you end up preloading the each address twice.  Is it any slower if you just do

  pld [%[src], #64]

at the beginning of the loop in place of both existing pld calls?
Even if we have NEON, though, we still need the slow path for cases where the number of pixels is not a multiple of 8. Or are you saying we shouldn't create the table and do the math directly on each pixel for those?

If so, is the suggestion to have three code paths (table, neon, and no-neon-no-table), or to stop using the table in the slow path and just do (x*y+255 + (x*y+255)>>8)>>8?
> Even if we have NEON, though, we still need the slow path for cases where the 
> number of pixels is not a multiple of 8.

Yes.  But when we have NEON, we need to process at most 31 pixels using the slow path, right?

> If so, is the suggestion to have three code paths (table, neon, and no-neon-
> no-table), or to stop using the table in the slow path and just do (x*y+255 + 
> (x*y+255)>>8)>>8?

The table is still faster than arithmetic on x86 and when we don't have NEON, so yes, I was thinking three code paths.

(Note that you probably don't need to code up the arithmetic code path explicitly using shifts -- it looks like GCC is more than capable of converting the division to bit twiddling.)
(In reply to comment #49)
> Comment on attachment 536969 [details] [diff] [review] [review]
> NEON patch v3
> 
> >+  "pld [%[src], #32]\n" // a little preloading helps the bits go down
> >    ...
> >+  "pld [%[src], #64]\n" // a little more preloading, for the next cache line
> 
> The loop does src += 32, so you end up preloading the each address twice. 
> Is it any slower if you just do
> 
>   pld [%[src], #64]
> 
> at the beginning of the loop in place of both existing pld calls?

Nice, it is faster that way, 2.35 ms vs 2.45 :)
You may be able to squeeze a bit more out of it by experimenting with changing the #64.  Maybe it should be #32, #96, or higher.
This is an initial preview of the NEON code for "(255 * x + a / 2) / a" type of unpremultiply. A basic benchmark from Galaxy Tab (1GHz ARM Cortex-A8):
    Small buffer (L1 cache) : 156.42 MPix/s
    Large buffer (memory)   : 130.54 MPix/s

Which means that even this naive implementation is already (barely) fast enough to be mostly limited by the available memory bandwidth on this device (and random pixels vs. zero filled source buffer does not make any significant difference). Though more headroom still could be useful.

The remaining things to do are just:
1. proper support for non multiple of 8 pixels buffer sizes
2. browser specific style of rounding (+254 offset).
(In reply to comment #40)
> There's a standard algorithm for generating such constants. See "N-Bit
> Unsigned Division via N-Bit Multiply-Add," In Proc. 17th IEEE Symposium on
> Computer Arithmetic (ARITH'05), pp. 131--139, June 2005.

Yes, I know. I have seen a number of similar papers related to replacing divisions with multiplications.

> The gist of it boils down to Algorithm 1 (reproduced here for those without
> access to the paper):
> 
> Inputs: uword d and n, with n>=1 and 1<=d<(1<<<n)
> int m := floor(log2(d));
> uword a, b;
> if d == (1<<m) then
>     a := (1<<n)-1;
>     b := (1<<n)-1;
> else
>     uword t := floor((1<<(m+n))/d);
>     uword r := (t*d+d) mod (1<<n)
>     if r <= (1<<m) then
>         a := t+1;
>         b := 0;
>     else
>         a := t;
>         b := t;
>     endif
> endif
> 
> Then the division is equivalent to ((a*x+b)>>(m+n)). In this case n==16. I
> don't know if this will work better than the constants you proposed. This
> method only requires one multiplication, instead of two, which will be the
> slowest step in NEON.

I specifically tried to tailor the algorithm to better fit the available NEON instructions that we have. The initial 8-bit long multiplication is almost free, because widening of 8-bit data to prepare it for 16-bit multiplication is still needed in any case.

As for the multiplications, they are reasonably fast with NEON. Peak performance is:
 1/8 cycle per multiplication for 8-bit data
 1/4 cycle per multiplication for 16-bit data
 1 cycle per multiplication for 32-bit data
Latency is not an issue if the code can be sufficiently unrolled and pipelined. And this is our case.

Having 8-bit multiplication followed by 16-bit multiplication is faster than a single 32-bit multiplication with NEON.

And I got these table constants by simple bruteforcing, there is no particular need for any clever algorithm in this particular case :)

> But the table needs values for a and b (16-bits each)
> and m (0 to 7). b is either the same as a, or 0, so in theory you could pack
> all three parameters into 3 bytes, but taking advantage of that may cost
> more computation than it saves cache misses.

Yes, it's still faster to have 4 bytes per table entry even though this wastes 1 byte of space.
(In reply to comment #46)
> If we do decide to change the rounding behaviour we might want to encourage
> Webkit to change to.

Do they have a bugtracker? Or what is the preferred way to contact them? It would be kind of funny if it turns out that they are also doing premultiplication in this weird way exactly because they are trying to be more compatible with Mozilla ;)
(In reply to comment #55)
> I specifically tried to tailor the algorithm to better fit the available
> NEON instructions that we have. The initial 8-bit long multiplication is
> almost free, because widening of 8-bit data to prepare it for 16-bit
> multiplication is still needed in any case.

I thought VMULL had a 5- to 6-cycle latency. You've only got 3 8-bit muls, so that's not enough to hide all the latency. It's not a lot of room to do much of anything fancier, I'll agree.

> Having 8-bit multiplication followed by 16-bit multiplication is faster than
> a single 32-bit multiplication with NEON.

Neither approach needs a 32-bit mul. They both use 16x16->32.

> Yes, it's still faster to have 4 bytes per table entry even though this
> wastes 1 byte of space.

I agree, I was just pointing out that with two 16-bit parameters and a 3-bit parameter, you needed _more_ than 4 bytes with my approach, unless you take advantage of of the fact that b=={0 or a} (though that means more work at run-time).

In any case, this came out somewhat better than I was expecting. There's probably still a few cycles to be wrung out of it, for example by moving up the first vuzp (or maybe replacing the first two with a single vtrn? fewer instructions, but the same latency... you'd need something to stick in-between that and the third vuzp, though), as well as moving up the vld4/vmin's, to get better balance between the LS and NEON pipelines, and eliminating the use of q4-q7.

As for the vmin's themselves, that's not the behavior of the current C code (particularly for alpha==0 it passes the colors through unchanged). But I don't know if it's even possible to have "super-luminescent" values in our canvas data (and I certainly don't think the current behavior for alpha>0 is very useful).
Attached patch NEON patch v4Splinter Review
Updated multiply patch. Changed preload to 256 (I checked lots of values, that seems optimal), and does not create the premultiply table if we don't need it.

I don't understand if the recent discussion in the last few comments here is relevant to this patch, or to future work? Do we want to split things out into separate bugs perhaps?
Attachment #536969 - Attachment is obsolete: true
Comment on attachment 537179 [details] [diff] [review]
NEON patch v4

> +  "vadd.u16 q8, q2\n" // add 254&divide by 255: we calculate that as ((x+255) + ((x+255)>>8))>>8

Nit: spaces around "&" (or make it "and").

> +    "q2", "q8", "q9", "q10",
> +    "d0", "d1", "d2", "d3", "d4", "d5",
> +    "d16", "d17", "d18", "d19", "d20", "d21"

Since you clobber d0, d1, d2, d3, I think you need to inform gcc that you clobber q0, q1.

This looks good to me, although I'd really like to understand what's going on with +r not working before we check it in.

I'm not totally pleased with the fact that logic around the premultiply table passes through these two files so much, but I think the right way to work around this is to move this code into gfxUtils, where we don't have to worry about storing the premultiply table in a separate file from where we use it.  

Would you be interested in doing that, Alon?  It could be a separate patch, or a separate bug, if you'd like.

Let's move the discussion about the unpremultiply step into a separate bug, please, since the premultiply patch is useful on its own.
Attachment #537179 - Flags: feedback+
With +r it "optimizes" out the whole neon section...
But it seems to generate the right asm with +r and volatile.  Not sure why it needs a volatile.
(In reply to comment #61)
> But it seems to generate the right asm with +r and volatile.  Not sure why
> it needs a volatile.

Because gcc is dumb. I assume it does not interpret the "memory" clobber as meaning you actually wrote useful values to memory, just that you _might_ have changed something. You'd think not having any output operands would also get a block scheduled for removal by the optimizer, but I suppose that broke too much code for them to actually use that strategy. But removing the code once you give it output operands and then don't use the results apparently did not.
(In reply to comment #28)
> (In reply to comment #27)
> > As I read the ARM manual, vld4.8 only takes 32-bit alignment.  (Table 4-12,
> > "Permitted combinations of parameters".)  Am I misunderstanding what this
> > means?
> 
> Yes.

Thanks a lot for clarifying this.
(In reply to comment #57)
> > Having 8-bit multiplication followed by 16-bit multiplication is faster than
> > a single 32-bit multiplication with NEON.
> 
> Neither approach needs a 32-bit mul. They both use 16x16->32.

It may be still interesting to try 32-bit multiplications for non-SIMD code. The table only needs to be modified to contain single 32-bit values (c8 * c16) instead of pairs.

> In any case, this came out somewhat better than I was expecting. There's
> probably still a few cycles to be wrung out of it, for example by moving up
> the first vuzp (or maybe replacing the first two with a single vtrn? fewer
> instructions, but the same latency... you'd need something to stick
> in-between that and the third vuzp, though), as well as moving up the
> vld4/vmin's, to get better balance between the LS and NEON pipelines, and
> eliminating the use of q4-q7.

As I already mentioned, this code was just a demonstration of how this algorithm maps to NEON instructions, so readability was clearly preferred. It was intentionally not optimized at all, and there is a lot of room for improvement. Yes, it's possible to eliminate all the stalls for ARM Cortex-A9 and also at least partially dual-issue VLD/VST/VUZP/VMOVN instructions on ARM Cortex-A8.

Earlier I provided a link to the existing NEON premultiply code from pixman, it can be used as an example of how such loops can be optimized to get better use of dual-issue and hide instruction latencies for ARM NEON:
    http://cgit.freedesktop.org/pixman/tree/pixman/pixman-arm-neon-asm.S?id=pixman-0.22.0#n2202

The 'pixman_composite_src_pixbuf_8888_process_pixblock_tail_head' macro contains all the code from the inner loop, which gets executed per 8 pixels. Macro 'fetch_src_pixblock' expands to a single VLD4.8 instruction (alternatively it can be also expanded to a simple code chunk implementing nearest scaling fetcher, but that's not interesting for us here). The instructions with 'PF' prefix implement cross-scanline prefetching, which is useful when image width is much smaller than stride, and they normally don't cause any overhead because they are run on ARM pipeline.

So optimized premultiply function is quite trivial to implement (with any variant of rounding) using the existing macro templates. Unfortunately these templates can't be effectively used for unpremultiply function, because of the need to do parallel access to the source buffer using ARM instructions.

> As for the vmin's themselves, that's not the behavior of the current C code
> (particularly for alpha==0 it passes the colors through unchanged). But I
> don't know if it's even possible to have "super-luminescent" values in our
> canvas data (and I certainly don't think the current behavior for alpha>0 is
> very useful).

There are two tricky cases. One of them is the handling of "super-luminescent" values (this term actually originates from Jim Blinn's book "Dirty Pixels" according to Soeren Sandmann [1]). These values can't be obtained as a result of premultiplication, but the unpremultiply function still has to do something sane when getting some buffer which just happens to have such values. The current Mozilla's behaviour is to just happily overflow and return modulo 256 of whatever gets calculated. And this behaviour is quite tricky to emulate in the SIMD optimized code if that suddenly becomes a requirement (hopefully it will not). An alternative way of dealing with "super-luminescent" values is to probably just saturate the result to 255. It would probably cause a bit lesser difference from the original value after unpremultiply/premultiply round-trip. Those VMIN instructions serve this purpose, but it is also perfectly fine to change VMOVN -> VQMOVN and get the same end result.

Another case is when alpha == 0. Division by zero is clearly unwanted. And the normal premultiplied pixels also can't have color component set to anything other 0 when alpha is 0 (otherwise it would be just another variant of "super-luminescent" pixel). And any color component value converts to zero when premultiplying it with zero alpha, so technically we even can't be too wrong by returning any arbitrary value when unpremultiplying zero alpha. There are still some alternatives:

a) Return some constant (for example 0) with the code:

    if (a == 0)
        return 0;

b) Return the color component value itself:

    if (a == 0)
        return x;

c) Return 255 when alpha == 0 and color component != 0 just to be consistent with "super-luminescent" pixels handling:

    if (a == 0)
        return x > 0 ? 255 : 0;

All of these variants are possible with my variant of NEON unpremultiply code by just tweaking the table entry for alpha == 0 and getting rid of VMIN instructions. It would be:
a) { 0, 0 }
b) { 2, 32768 }
c) { 255, 65535 }

Regarding premultiply with the Mozilla specific rounding, it's a bit more tricky and the algorithm needs some modification:

uint8_t mozunpremultiply_neon_ready(uint8_t x, uint8_t a)
{
    uint32_t tmp;
    /* a single lookup in a 1K size table */
    struct mozunpremultiply_consts consts = mozunpremultiply_consts_array[a];
    /* unsigned 8-bit long multiplication */
    tmp = x * consts.mul8;
    /* unsigned 16-bit long multiplication */
    tmp = tmp * consts.mul16;
    /* add some offset and shift the value */
    tmp = (tmp + consts.add8) >> 16;
    /* saturate instead of using modulo 256 */
    return tmp <= 255 ? tmp : 255;
}

There is a need for the new 8-bit constant, but we conveniently have a vacant space for it in the lookup table. The operation "(tmp + consts.add8) >> 16" maps to VADDHN.U32 instruction, so it does not become a performance disaster. If you would like to have a look at it, I can provide a complete NEON code. Also not fully performance optimized yet.


1. http://www.mail-archive.com/pixman@lists.freedesktop.org/msg00465.html
(In reply to comment #59)
> Let's move the discussion about the unpremultiply step into a separate bug,
> please, since the premultiply patch is useful on its own.

I'm not so sure about this. The premultiply and unpremultiply operations are closely related and need to use the same math. If that was not important, then we would have already almost no work to do here. Mozilla's sources already contain NEON optimized premultiply function (see comment 1 and comment 22), it would just need to be called from the right place and we are done.

So instead I propose:
1. Add NEON optimized premultiply/unpremultiply functions using existing Mozilla's rounding (tweaking unpremultiply to use saturation instead of modulo 256). And make sure that they are used for both canvas *and* gfxUtils.cpp (thus also solving bug 534215)
2. Create a new bug about changing premultiply/unpremultiply rounding behaviour. Because current rounding seems wrong and round-trip unpremultiply/premultiply requirement is fulfilled by just having two bugs, which happen to cancel each other.
(In reply to comment #65)
> > Let's move the discussion about the unpremultiply step into a separate bug,
> > please, since the premultiply patch is useful on its own.
> 
> I'm not so sure about this. The premultiply and unpremultiply operations are

No, I agree with jlebar here. They need to use the same math, yes, but the current C code for the unmultiply step already uses compatible math with the proposed premultiply asm. Adding unmultiply asm is a separate issue, and should be tracked in its own bug.
Register q2 is initialized as "vmov.u16 q2, #255", Inner loop looks
in the following way for the NEON code (different instruction streams
have different indentation for better readability):

        vshr.u16    q11, q8,  #8
        vswp        d3,  d31
        vshr.u16    q12, q9,  #8
        vshr.u16    q13, q10, #8
    vld4.8 {d0, d1, d2, d3}, [SRC]!
        vaddhn.u16  d30, q11, q8
        vaddhn.u16 d29, q12, q9
    vrev32.16   q8,  q2
        vaddhn.u16 d28, q13, q10
    vrev32.16   q9,  q2
    vmlal.u8    q8,  d3, d0
    vrev32.16   q10, q2
    vmlal.u8    q9,  d3, d1
    vmlal.u8    q10, d3, d2
        vst4.8 {d28, d29, d30, d31}, [DST_W, :128]!

And the synthetic benchmark numbers are the following:

=== premultiply code, based on pixman template ===

  Cortex-A8 1GHz (Galaxy Tab):
    Small buffer (L1 cache) : 375.64 MPix/s
    Large buffer (memory)   : 163.11 MPix/s

  Cortex-A9 1GHz (early prototype board with slow RAM):
    Small buffer (L1 cache) : 291.43 MPix/s
    Large buffer (memory)   : 63.66 MPix/s

=== NEON patch v4 from Alon Zakai ===

  Cortex-A8 1GHz (Galaxy Tab):
    Small buffer (L1 cache) : 285.14 MPix/s
    Large buffer (memory)   : 152.73 MPix/s

  Cortex-A9 1GHz (early prototype board with slow RAM):
    Small buffer (L1 cache) : 217.60 MPix/s
    Large buffer (memory)   : 63.96 MPix/s


This all shows that memory bandwidth on ARM devices is quite far from
being able to cope with the performance of NEON unit. And the same applies to
almost all basic non-scaled 2D graphics operations.

BTW, I have placed an order for http://www.origenboard.org/ some time ago (it should have mostly the same hardware as Galaxy S2). Let's see if it lives up to its promise of providing "high bandwidth DDR3". That's especially interesting, considering that NEON unit was downgraded in Cortex-A9.
(In reply to comment #66)
> No, I agree with jlebar here. They need to use the same math, yes, but the
> current C code for the unmultiply step already uses compatible math with the
> proposed premultiply asm. Adding unmultiply asm is a separate issue, and
> should be tracked in its own bug.

I can agree only on one condition: even if those bugs become tracked separately, they need to be solved at the same time, and we are better to do this *right now* in order not to waste any extra time getting back to them months later. There are enough unattended bugs technically related to all the same issue already:
- bug 534215 ('Optimize WebGL premultiply')
- bug 594883 ('Speed up ARGB premultiply in PNG decoder ')
- bug 633467 ('2 Premultiply tables both gfxUtil and nsCanvasRenderingContext2D')

And from this list, bug 594883 is quite interesting because it shows that there is *one more* way of doing premultiplication (+0 offset) in Mozilla source code:
    http://mxr.mozilla.org/mozilla2.0/source/gfx/thebes/gfxColor.h#129

According to your suggestion, I have submitted two new bugs:
- bug 662130 ('Wrong rounding used for alpha (un)premultiply operations in Canvas 2D and WebGL')
- bug 662134 ('Implement ARM NEON optimization for alpha unmultiply operation for Canvas 2D and WebGL')

The only special thing about this particular bug 659725 is that it got into the spotlight, and competent people are readily available to comment and review patches. This unfortunately does not appear to be the same for the other closely related bugs and they don't get enough attention.
Comment on attachment 537179 [details] [diff] [review]
NEON patch v4

Very impressive results Siarhei!

Based on them I am marking my patch as obsolete. Siarhei's patch looks like the way to go. Is it ready for review? If so let's mark it.
Attachment #537179 - Attachment is obsolete: true
(In reply to comment #69)
> Siarhei's patch looks like the way to go. Is it ready for review?
> If so let's mark it.

My patch only shows that it is still possible to improve instructions scheduling for premultiply code, but it will not show any substantial practical improvement unless using it with L1/L2 cached data. Which means that the performance of your patch is also fine for the practical usage. As the performance is memory bandwidth limited, the best what can be done is just to avoid walking over the large pixel buffers multiple times and process as much as possible in a single pass.

The biggest problem with my patch is the reliance on the header file from pixman, which is not suitable for unmultiply function anyway. That's why I think that it is better to make a new ARM NEON macro based code specifically for premultiply/unmultiply with the support of:
1. arbitrary ARGB color components layout for both source and destination
2. different variants of rounding
3. optional checking whether there are any non-opaque pixels (needed for PNG in bug 594883)

As soon as such mini library with the collection of optimized routines is ready, it can be used to solve the whole class of premultiply/unmultiply related performance problems.

And the part of your patch, which moves premultiply functionality into a separate file (not only assembly, but also generic C implementation), is actually the right way to go in my opinion.
>  Which means that the performance of your patch is also fine for the practical 
> usage.

I agree.  The interleaved scheduling is clever, but our lives will be easier if we don't rely on all that magic.
Comment on attachment 537179 [details] [diff] [review]
NEON patch v4

Ok, cool.

Asking for review on my patch then.
Attachment #537179 - Attachment is obsolete: false
Attachment #537179 - Flags: review?(bas.schouten)
(In reply to comment #71)
> I agree.  The interleaved scheduling is clever,

That's called "software pipelining". There are many pages around in the Internet describing it, but some basic information is even available from wikipedia:
    http://en.wikipedia.org/wiki/Software_pipelining

In the pixman we are using a simple 2 stage pipelining (stage1 is called 'head' and stage2 is called 'tail' for convenience). Processing of each pixel just spans over two loop iterations and also it's useful when dealing with non-uniform latencies for different instructions where simple unrolling would just fail.

> but our lives will be easier if we don't rely on all that magic.

It's very easy to also add software pipelining to Alon's code, especially considering that it does not have to handle the block sizes which are not multiples of 8 pixels.
> That's called "software pipelining".

I took a class a while back from Monica Lam, who loves this stuff.  I guess on ARM, where we have details about how the microarchitecture works, it might be feasible to take NEON compiler intrinsics and automatically software pipeline them in a reasonable way.

In fact, gcc even has -fmodulo-sched.  But I'm not sure it does anything.  And anyway GCC totally fails at NEON intrinsics.  Guess we have to continue relying on our brains for some time more.  :)
(In reply to comment #72)
> Comment on attachment 537179 [details] [diff] [review] [review]
> NEON patch v4
> 
> Ok, cool.
> 
> Asking for review on my patch then.

+  "bne 1b\n"
+  :
+  : [src] "r" (src), [dst] "r" (dst), [last] "r" (last)
+  : "cc", "memory",
+    "q2", "q8", "q9", "q10",
+    "d0", "d1", "d2", "d3", "d4", "d5",
+    "d16", "d17", "d18", "d19", "d20", "d21"
+  );

Modifying the registers from the input operands list is wrong, even if gcc seems to misbehave when you put them as "+r" output operands. Just try to change this inline assembly block to "asm volatile" and use earlyclobber "+&r" operands instead of "+r". I'm not sure if they are really supposed to make any difference in this particular case or some gcc bug gets exposed, but if for example using:
    int copy_of_src = src;
    asm (
        " ... do something ..."
        : [src] "+r" (src)
        : [copy_of_src] "r" (copy_of_src)
        : "cc", "memory");
Then gcc is perfectly fine assigning the same register to "src" and "copy_of_src" variables. So in order to be on a safe side, it's better to use earlyclobber modifier (&).
Bas, review ping?
3-month review ping?
Comment on attachment 537179 [details] [diff] [review]
NEON patch v4

Review of attachment 537179 [details] [diff] [review]:
-----------------------------------------------------------------

Derf, could you look at this first, as I'm not comfortable enough with ARM to be sure this is exactly right.
Attachment #537179 - Flags: review?(bas.schouten) → review?(tterribe)
Comment on attachment 537179 [details] [diff] [review]
NEON patch v4

Review of attachment 537179 [details] [diff] [review]:
-----------------------------------------------------------------

The actual NEON code looks fine.

r+ with a couple of minor nits.

There's also now an nsCanvasRenderingContext2DAzure, which duplicates much of the code from nsCanvasRenderingContext2D. That should probably be updated to use this, as well.

::: content/canvas/src/convert_from_canvas_rgba.cpp
@@ +44,5 @@
> +PRBool need_premultiply_table()
> +{
> +#if defined(MOZILLA_MAY_SUPPORT_NEON) && defined(IS_LITTLE_ENDIAN)
> +  return !mozilla::supports_neon();
> +#endif

Could I get an #else here that wraps the non-NEON case?

@@ +55,5 @@
> +  value = value + 255;
> +  return (value + (value >> 8)) >> 8;
> +}
> +
> +void convert_from_canvas_rgba_noarch(PRUint8 *dst, PRUint8 const *src, int n, PRUint8 const (*premultiplyTable)[256])

This function should be static (else you'll have to check n%4==0 here, too).

@@ +99,5 @@
> +  }
> +}
> +
> +#if defined(MOZILLA_MAY_SUPPORT_NEON) && defined(IS_LITTLE_ENDIAN)
> +void convert_from_canvas_rgba_neon_littleendian(PRUint8 *dst, PRUint8 const *src, int n, PRUint8 const (*premultiplyTable)[256])

Same here.

@@ +139,5 @@
> +  "bne 1b\n"
> +  :
> +  : [src] "r" (src), [dst] "r" (dst), [last] "r" (last)
> +  : "cc", "memory",
> +    "q2", "q8", "q9", "q10",

I'm not actually clear on gcc's precise rules, but either you don't need any of the q register names, or you also need q0 and q1.
Attachment #537179 - Flags: review?(tterribe) → review+
Is this optimization still meaningful?
Flags: needinfo?(azakai)
No idea, sorry - been 3 years since I was involved with that code.
Flags: needinfo?(azakai)

Tentatively closing this.

Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: