Closed Bug 885496 Opened 7 years ago Closed 6 years ago

Optimize KissFFT with NEON instructions

Categories

(Core :: Web Audio, defect)

x86
macOS
defect
Not set

Tracking

()

RESOLVED WONTFIX

People

(Reporter: ehsan, Assigned: jwwang)

References

Details

(Keywords: perf, Whiteboard: [blocking-webaudio-])

Attachments

(4 files, 2 obsolete files)

Depends on: 882543
No longer depends on: 884253
Is there a conventional mozilla way to apply such a foreign patch?

Should I create a patch by extracting useful parts and discarding those which we don't need (ex: README.swe and kissfft.gyp)?

Should we also add NEON impl. of kf_bfly3() and kf_bfly5()?
What we usually end up doing is to keep the patch at the root of the directory containing the library, and add a line to the 'update.sh' script to apply the patch after having pulled from upstream, see [1] for an example of a library where we keep a patch on top of upstream.

Not sure about the kf_bfly3() and kf_bfly5() implementation, probably not worth it if we don't call them.

[1]: http://mxr.mozilla.org/mozilla-central/source/media/libsoundtouch/
Attachment #773039 - Flags: review?(ehsan)
Attached patch [Part 2] Update update.sh. (obsolete) — Splinter Review
It is kinda mind-bending to have a patch (part 2) which will create another patch which shares a common hunk (modification to kiss_fft.c) with the previous patch (part 1).
Attachment #773041 - Flags: review?(ehsan)
Comment on attachment 773039 [details] [diff] [review]
[Part 1] Optimize KissFFT with NEON instructions.

Review of attachment 773039 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, although I'm not really familiar with NEON myself.  Tim, do you mind having a quick look over the NEON code to see if there is anything wrong there?
Attachment #773039 - Flags: review?(ehsan) → review?(tterribe)
Attachment #773041 - Flags: review?(ehsan) → review+
(In reply to jwwang from comment #4)
> Created attachment 773041 [details] [diff] [review]
> [Part 2] Update update.sh.
> 
> It is kinda mind-bending to have a patch (part 2) which will create another
> patch which shares a common hunk (modification to kiss_fft.c) with the
> previous patch (part 1).

It's true.  :-)
Assignee: nobody → jwwang
(In reply to :Ehsan Akhgari (needinfo? me!) from comment #5)
> Looks good, although I'm not really familiar with NEON myself.  Tim, do you
> mind having a quick look over the NEON code to see if there is anything
> wrong there?

I don't mind, but I've got a bunch of reviews in my queue, so it will likely be a few days before I can get to this.
(In reply to comment #7)
> (In reply to :Ehsan Akhgari (needinfo? me!) from comment #5)
> > Looks good, although I'm not really familiar with NEON myself.  Tim, do you
> > mind having a quick look over the NEON code to see if there is anything
> > wrong there?
> 
> I don't mind, but I've got a bunch of reviews in my queue, so it will likely be
> a few days before I can get to this.

Tim: ping? :)
Comment on attachment 773039 [details] [diff] [review]
[Part 1] Optimize KissFFT with NEON instructions.

Review of attachment 773039 [details] [diff] [review]:
-----------------------------------------------------------------

r- for the lack of runtime detection. The actual asm could also certainly be a lot better.

::: media/kiss_fft/kiss_fft.c
@@ +17,5 @@
>  /* The guts header contains all the multiplication and addition macros that are defined for
>   fixed or floating point complex numbers.  It also delares the kf_ internal functions.
>   */
>  
> +#ifdef HAVE_ARM_NEON

We need run-time NEON detection, even on ARMv7 devices.

::: media/kiss_fft/kiss_fft_bfly2_neon.s
@@ +1,2 @@
> +/*
> +* Copyright (c) 2013, The Linux Foundation. All rights reserved.

Has anyone pinged licensing@mozilla.org to make sure this is okay? I know BSD code is generally fine, but we may need to update about:license, etc.

@@ +29,5 @@
> +
> +@ NEON optimized assembly routine of kf_bfly2()
> +
> +    .text
> +    .fpu neon

You need to set .arch to something with NEON, and .object_arch to some lowest-common-denominator thing to, e.g., make sure this works when building an ARMv6 binary that might also be run on a device with NEON. See gfx/ycbcr/yuv_row_arm.s for an example.

@@ +32,5 @@
> +    .text
> +    .fpu neon
> +    .align 4
> +    .global     kf_bfly2
> +    .func       kf_bfly2

Need a .fnstart to generate an unwind table entry.

@@ +35,5 @@
> +    .global     kf_bfly2
> +    .func       kf_bfly2
> +
> +kf_bfly2:
> +    stmdb            sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}

You should always push an even number of registers to keep the stack pointer 8-byte aligned. In this case, r9, sl, fp, and lr are unused. You need to use lr to be able to pop the return address into pc. You aren't use r12, which is caller-saved. Ditch r9, sl, and fp, replace r7 with r12 (which you don't need to save here), and r8 with lr. Then you're saving just 4 registers (r4, r5, r6, lr). You can get rid of r6 by following the advice below. You could also get rid of r4 by decrementing r3 by 4 at a time instead.

@@ +36,5 @@
> +    .func       kf_bfly2
> +
> +kf_bfly2:
> +    stmdb            sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}
> +@    vstmdb           sp!, {d8-d15}

If we don't need this code, just delete it.

@@ +41,5 @@
> +                                                @ r0 - Fout| r1 - fstride | r2 - st | r3 - m
> +    pld             [r0, #0]
> +    mov             r8, r3, asl #3              @ convert m into bytes count (m*8)
> +    add             r5, r0, r8                  @ Fout2 = Fout + m;
> +    add             r6, r2, #264                @ tw1 = st->twiddles

This is very fragile. You should just pass st->twiddles directly.

@@ +59,5 @@
> +    vld2.32         {d16-d19}, [r6], r7         @ load *tw1; tw1 += (fstride*4);
> +    pld             [r6, #0]                    @ preload next tw1
> +    vmul.f32        q2, q10, q12                @ C_MUL (t,  *Fout2 , *tw1);
> +    vmul.f32        q3, q11, q13
> +    vsub.f32        q8, q2, q3

This is what VMLS was made for.

@@ +62,5 @@
> +    vmul.f32        q3, q11, q13
> +    vsub.f32        q8, q2, q3
> +    vmul.f32        q2, q10, q13
> +    vmul.f32        q3, q11, q12
> +    vadd.f32        q9, q2, q3

This is what VMLA was made for.

@@ +64,5 @@
> +    vmul.f32        q2, q10, q13
> +    vmul.f32        q3, q11, q12
> +    vadd.f32        q9, q2, q3
> +
> +    vld2.32         {d0-d3}, [r0]               @ load *Fout;

You should start this load earlier. As it is, it is guaranteed to cause a 2-cycle stall, at a minimum. Putting it after the first mul is probably best.

@@ +82,5 @@
> +@.kf_bfly2_process_remaining:
> +    asr             r8, r3, #31
> +    lsr             r7, r8, #30
> +    add             r4, r7, r3
> +    ands            r3, r4, #3                  @ if (k % 4 == 0)

This whole thing can just be replaced by ands r3, r3, #3. m is always non-negative.

@@ +83,5 @@
> +    asr             r8, r3, #31
> +    lsr             r7, r8, #30
> +    add             r4, r7, r3
> +    ands            r3, r4, #3                  @ if (k % 4 == 0)
> +    beq             .kf_bfly2_done

Instead of branching, just ldmeqfd here.

@@ +92,5 @@
> +                                                @ float32x4x2_t t;          d4 {s8,s9}
> +
> +
> +.bfly2_do_while1:                               @ do { //process 1 sample per iteration
> +    vld1.32         {d2}, [r5]                  @ load *Fout2;{s16,s17}

The comment is wrong.

@@ +93,5 @@
> +
> +
> +.bfly2_do_while1:                               @ do { //process 1 sample per iteration
> +    vld1.32         {d2}, [r5]                  @ load *Fout2;{s16,s17}
> +    vld1.32         {d3}, [r6], r1              @ load *tw1; tw1 += (fstride);{s24,s25}

The comment is wrong.

@@ +99,5 @@
> +    vmul.f32        d1, d2, d3                  @ @ C_MUL (t,  *Fout2 , *tw1);
> +    vsub.f32        s8, s2, s3
> +    vmul.f32        s2, s4, s7
> +    vmul.f32        s3, s5, s6
> +    vadd.f32        s9, s2, s3

VMLA

@@ +101,5 @@
> +    vmul.f32        s2, s4, s7
> +    vmul.f32        s3, s5, s6
> +    vadd.f32        s9, s2, s3
> +
> +    vld1.32         {d0}, [r0]                  @ load *Fout;

See above about moving this up.

@@ +114,5 @@
> +    subs            r3, r3, #1                  @ }while(--k);
> +    bne             .bfly2_do_while1
> +
> +.kf_bfly2_done:
> +@    vldmia           sp!, {d8-d15}

If we don't need this code, just delete it.

@@ +116,5 @@
> +
> +.kf_bfly2_done:
> +@    vldmia           sp!, {d8-d15}
> +    ldmia            sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}
> +    nop

What's the nop for?

@@ +118,5 @@
> +@    vldmia           sp!, {d8-d15}
> +    ldmia            sp!, {r4, r5, r6, r7, r8, r9, sl, fp, pc}
> +    nop
> +
> +    .endfunc

Need a .fnend to generate an unwind table entry. Also, you should provide a .size entry so that debuggers can identify this function in crash reports, etc.

::: media/kiss_fft/kiss_fft_bfly4_neon.s
@@ +1,1 @@
> +/*

Assume unless otherwise specified that every comment on kf_bfly2 applies to this file as well.

@@ +35,5 @@
> +    .global     kf_bfly4
> +    .func       kf_bfly4
> +
> +kf_bfly4:
> +    stmdb           sp!, {r4, r5, r6, r7, r8, r9, r10, r11, r12, lr}

You aren't using lr. Replace r11 with lr and don't save r12. Two fewer registers to save/restore. You can eliminate r4 and r6 as above, as well.

@@ +39,5 @@
> +    stmdb           sp!, {r4, r5, r6, r7, r8, r9, r10, r11, r12, lr}
> +@   vstmdb          sp!, {d8-d15}
> +                                                @ r0 - Fout| r1 - fstride | r2 - st | r3 - m
> +    pld             [r0, #0]
> +    mov             r5, r3

Don't make a copy and then destructively modify the source. Swap the roles of r3 and r5 instead.

@@ +45,5 @@
> +    add             r6, r2, #264                @ tw1 = st->twiddles
> +    pld             [r6, #0]
> +    mov             r7, r6                      @ tw2 = st->twiddles
> +    mov             r8, r7                      @ tw3 = st->twiddles
> +    ldr             r2, [r2, #4]                @ st->inverse

You should not be branching in an inner loop based on a value fixed outside that loop. Make a separate function for the inverse. That also means you can avoid passing in st without using more than 4 parameters. Assemble it with macros if you're worried about code duplication.

@@ +60,5 @@
> +    beq             .kf_bfly4_do_while1         @ if(k==0)
> +
> +.kf_bfly4_do_while4:                            @ do { //process 4 samples per iteration
> +    add             r11, r0, r3                 @ fom = Fout+m;
> +    mov             r12, r11

Instead of making this copy, you can recompute the value below just before you need it, which allows you to keep it in r11 instead of r12, saving you another register.

@@ +66,5 @@
> +    vld1.32         {d20}, [r6], r1             @ rtwd1 = vld1_f32((const float32_t*)tw1); tw1 += fstride;
> +    vld1.32         {d21}, [r6], r1             @ rtwd2 = vld1_f32((const float32_t*)tw1); tw1 += fstride;
> +    vld1.32         {d22}, [r6], r1             @ rtwd3 = vld1_f32((const float32_t*)tw1); tw1 += fstride;
> +    vld1.32         {d23}, [r6], r1             @ rtwd4 = vld1_f32((const float32_t*)tw1); tw1 += fstride;
> +    vuzp.32         q10, q11                    @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4));

Loading into d20, d22, d21, then d23 and using vtrn instead of vuzp will save you a cycle. The same applies for the other twiddles below.

@@ +67,5 @@
> +    vld1.32         {d21}, [r6], r1             @ rtwd2 = vld1_f32((const float32_t*)tw1); tw1 += fstride;
> +    vld1.32         {d22}, [r6], r1             @ rtwd3 = vld1_f32((const float32_t*)tw1); tw1 += fstride;
> +    vld1.32         {d23}, [r6], r1             @ rtwd4 = vld1_f32((const float32_t*)tw1); tw1 += fstride;
> +    vuzp.32         q10, q11                    @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4));
> +    vld2.32         {d0-d3}, [r11], r3          @ rfout = vld2q_f32((const float32_t*)(fom1)); fom2 = Fout+m2;

Do this load before the VUZP (VTRN).

@@ +70,5 @@
> +    vuzp.32         q10, q11                    @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4));
> +    vld2.32         {d0-d3}, [r11], r3          @ rfout = vld2q_f32((const float32_t*)(fom1)); fom2 = Fout+m2;
> +    vmul.f32        q2, q0, q10                 @ C_MUL_NEON(scratch0, rfout, scratch3);
> +    vmul.f32        q3, q1, q11
> +    vsub.f32        q12, q2, q3

VMLS

@@ +73,5 @@
> +    vmul.f32        q3, q1, q11
> +    vsub.f32        q12, q2, q3
> +    vmul.f32        q2, q0, q11
> +    vmul.f32        q3, q1, q10
> +    vadd.f32        q13, q2, q3

VMLA

@@ +80,5 @@
> +    vld1.32         {d20}, [r7], r9             @ rtwd1 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2;
> +    vld1.32         {d21}, [r7], r9             @ rtwd2 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2;
> +    vld1.32         {d22}, [r7], r9             @ rtwd3 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2;
> +    vld1.32         {d23}, [r7], r9             @ trtwd4 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2;
> +    vuzp.32         q10, q11                    @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4));

You'd benefit a lot from figuring out how to interleave these instructions with prior arithmetic on A8-class hardware. It can share the first and last cycle of a load/store/permute instruction with that of an ALU instruction, potentially saving you 10 cycles here (since all of these instructions are multi-cycle... though you'll "save" fewer when you convert to using VMLS/VMLA since there will be fewer instructions to share with). You have to switch to using Q8/Q9 instead of Q10/Q11, though, or you'll run into data dependency problems.

@@ +81,5 @@
> +    vld1.32         {d21}, [r7], r9             @ rtwd2 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2;
> +    vld1.32         {d22}, [r7], r9             @ rtwd3 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2;
> +    vld1.32         {d23}, [r7], r9             @ trtwd4 = vld1_f32((const float32_t*)tw2); tw2 += fstride*2;
> +    vuzp.32         q10, q11                    @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4));
> +    vld2.32         {d0-d3}, [r11], r3          @ rfout = vld2q_f32((const float32_t*)(fom2)); fom3 = Fout+m3;

See above.

@@ +84,5 @@
> +    vuzp.32         q10, q11                    @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4));
> +    vld2.32         {d0-d3}, [r11], r3          @ rfout = vld2q_f32((const float32_t*)(fom2)); fom3 = Fout+m3;
> +    vmul.f32        q2, q0, q10                 @ C_MUL_NEON(scratch1, rfout, scratch3);
> +    vmul.f32        q3, q1, q11
> +    vsub.f32        q14, q2, q3

VMLS

@@ +87,5 @@
> +    vmul.f32        q3, q1, q11
> +    vsub.f32        q14, q2, q3
> +    vmul.f32        q2, q0, q11
> +    vmul.f32        q3, q1, q10
> +    vadd.f32        q15, q2, q3

VMLA

@@ +95,5 @@
> +    vld1.32         {d21}, [r8], r10            @ rtwd2 = vld1_f32((const float32_t*)tw3); tw3 += fstride*3;
> +    vld1.32         {d22}, [r8], r10            @ rtwd3 = vld1_f32((const float32_t*)tw3); tw3 += fstride*3;
> +    vld1.32         {d23}, [r8], r10            @ rtwd4 = vld1_f32((const float32_t*)tw3); tw3 += fstride*3;
> +    vuzp.32         q10, q11                    @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4));
> +    vld2.32         {d0-d3}, [r11]              @ rfout = vld2q_f32((const float32_t*)(fom3));

See above.

@@ +98,5 @@
> +    vuzp.32         q10, q11                    @ scratch3 = vuzpq_f32(vcombine_f32(rtwd1, rtwd2), vcombine_f32(rtwd3, rtwd4));
> +    vld2.32         {d0-d3}, [r11]              @ rfout = vld2q_f32((const float32_t*)(fom3));
> +    vmul.f32        q2, q0, q10                 @ C_MUL_NEON(scratch2, rfout, scratch3);
> +    vmul.f32        q3, q1, q11
> +    vsub.f32        q8, q2, q3

VMLS

@@ +101,5 @@
> +    vmul.f32        q3, q1, q11
> +    vsub.f32        q8, q2, q3
> +    vmul.f32        q2, q0, q11
> +    vmul.f32        q3, q1, q10
> +    vadd.f32        q9, q2, q3

VMLA

@@ +139,5 @@
> +                                                @ }
> +.c_end4:
> +    vst2.32         {d20-d23}, [r12], r3        @ vst2q_f32((float32_t*)(fom), scratch3); fom2 = Fout+m2;
> +    vst2.32         {d16-d19}, [r12], r3        @ vst2q_f32((float32_t*)fom2, scratch2); fom3 = Fout+m3;
> +    vst2.32         {d28-d31}, [r12]            @ vst2q_f32((float32_t*)(fom3), scratch1);

You can do much better by interleaving the last 8 arithmetic ops with the stores. To maximize dual issuing and minimize pipeline stalls, you want something like (for the not_inverse4 case)

vadd.f32 q8, q0, q10
vadd.f32 q9, q1, q11
vadd.f32 q14, q2, q13
vadd.f32 q15, q3, q12
vst2.32 {d16-d19}, [r0]!
vsub.f32 q0, q0, q10
vsub.f32 q1, q1, q11
vst2.32 {d28-d31}, [r12], r3
vsub.f32 q8, q2, q13
vst2.32 {d0-d3}, [r12], r3
vadd.f32 q9, q3, q12
vst2.32 {d16-d19}, [r12]

There's still a 2-cycle stall while executing the last store, but there's no more computations to stick in there. Overall this saves 32-26 = 6 cycles.

@@ +153,5 @@
> +    add             r4, r4, r5
> +    ands            r5, r4, #3                  @ if (k%4 == 0)
> +    beq             .kf_bfly4_done
> +
> +.kf_bfly4_do_while1:                            @ do { //process 1 sample per iteration

Most of the comments for the above loop apply here as well.

@@ +156,5 @@
> +
> +.kf_bfly4_do_while1:                            @ do { //process 1 sample per iteration
> +    pld             [r7, #0]
> +    vld1.32         {d18}, [r6], r1             @ rtwd1 = vld1_f32((const float32_t*)tw1); tw1 += fstride;
> +    vuzp.32         d18, d19                    @ scratch3 = vuzp_f32(rtwd1, rtwd2); //d11 is empty

The comment is wrong (here and everywhere that mentions d11).

@@ +186,5 @@
> +    vmul.f32        q1, q0, q9                  @ C_MUL_NEON(scratch2, rfout, scratch3);
> +    vsub.f32        d16, d2, d3
> +    vmul.f32        d2, d0, d19
> +    vmul.f32        d3, d1, d18
> +    vadd.f32        d17, d2, d3

This stuff is misguided. The Q version of VMUL is twice as slow as the D version, so it doesn't actually save anything to VUZP everything.

Really, what you want is to load all three twiddles and all three coefficients into adjacent registers, and then do 3 simultaneous complex multiplies the same way you do 4 of them in the first loop.

@@ +200,5 @@
> +    vsub.f32        q2, q2, q8                  @ C_SUB_NEON(scratch0, scratch0, scratch2);
> +
> +    vsub.f32        q8, q0, q9                  @ C_SUB_NEON(scratch2, rfout, scratch3);
> +
> +    vadd.f32        q0, q0, q9                  @ C_ADD_NEON(rfout, rfout, scratch3);

Similarly, here you want the real/imaginary values packed together. The Q version of VADD/VSUB is also twice as slow as the D version, so you're halving your throughput.
Attachment #773039 - Flags: review?(tterribe) → review-
Whiteboard: [blocking-webaudio-]
Comment on attachment 773039 [details] [diff] [review]
[Part 1] Optimize KissFFT with NEON instructions.

Review of attachment 773039 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/kiss_fft/kiss_fft.c
@@ +17,5 @@
>  /* The guts header contains all the multiplication and addition macros that are defined for
>   fixed or floating point complex numbers.  It also delares the kf_ internal functions.
>   */
>  
> +#ifdef HAVE_ARM_NEON

mozilla::supports_abc() is only usable in CPP files. Is there other run-time detection for C files?

::: media/kiss_fft/kiss_fft_bfly2_neon.s
@@ +35,5 @@
> +    .global     kf_bfly2
> +    .func       kf_bfly2
> +
> +kf_bfly2:
> +    stmdb            sp!, {r4, r5, r6, r7, r8, r9, sl, fp, lr}

What's the purpose to keep stack pointer 8-byte aligned? I see no stack operations in the following instructions.
jwwang, see how I worked around it in [1] and [2] (I use it at [3]). Essentially, you wrap the C++ functions you need in C functions, and include the C header where you need to do the runtime detection. Glandium explained to me all the build system magic we had to do in order to make this work, so it should be painless for you this time. Don't hesitate to ask me if it does not work.

That said, if this is going to be common, maybe we should expose a C API to SSE.h.

[1]: http://mxr.mozilla.org/mozilla-central/source/media/libspeex_resampler/src/sse_detect.h
[2]: http://mxr.mozilla.org/mozilla-central/source/media/libspeex_resampler/src/sse_detect.cpp
[3]: http://mxr.mozilla.org/mozilla-central/source/media/libspeex_resampler/src/resample.c#359
I see. Thanks. I would prefer a clean cut in exposing C APIs in SSE.h and arm.h
Rewrite kiss_fft_bfly4_neon.s.
Attachment #773039 - Attachment is obsolete: true
Attachment #802161 - Flags: review?(tterribe)
Attachment #773041 - Attachment is obsolete: true
Attachment #802162 - Flags: review+
Attachment #802159 - Flags: review?(ehsan) → review+
Whiteboard: [blocking-webaudio-] → [blocking-webaudio-][webaudio-perf]
Keywords: perf
Whiteboard: [blocking-webaudio-][webaudio-perf] → [blocking-webaudio-]
Do we still need this bug once Bug 926838 is landed?
I don't think so (that is why I've held off spending the time to review it).
Let's make that explicit!
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WONTFIX
Attachment #802161 - Flags: review?(tterribe)
Attached file kiss_fft_bfly5_neon.s
You need to log in before you can comment on or make changes to this bug.