Closed Bug 882171 Opened 11 years ago Closed 11 years ago

Optimize the AudioNodeEngine.cpp routines for NEON

Categories

(Core :: Web Audio, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla26
blocking-b2g koi+
Tracking Status
firefox25 --- fixed
firefox26 --- fixed

People

(Reporter: ehsan.akhgari, Assigned: jwwang)

References

Details

(Whiteboard: [FT: Media Recording, Sprint 1])

Attachments

(4 files, 11 obsolete files)

5.30 KB, text/x-c++src
Details
17.31 KB, text/x-c++src
Details
13.29 KB, patch
glandium
: review+
Details | Diff | Splinter Review
5.75 KB, patch
jwwang
: review+
Details | Diff | Splinter Review
      No description provided.
Ben, would you share your crazy setup you use to write ARM assembly for Opus on your FxOS device with me ? Otherwise, I'm gonna try to find a Pandaboard somewhere.
Flags: needinfo?(ben)
Before we do this (or other optimization work) I think we need some benchmarks. Filed bug 882543 for that.
Depends on: 882543
The SSE2 work is in bug 877662. That bug also makes audio blocks be 16-byte aligned.
Assignee: paul → nobody
Depends on: 877662
Assignee: nobody → jwwang
I am planning to use this library.
http://projectne10.github.io/Ne10/doc/index.html

Or is there a similar library integrated to gecko already?
This library does not seem to suit our needs, from what I read.

Working directly with NEON intrinsics should be easy and fast enough for what we have to do to write the code ourselves without having to rely on a library.

If we consider after profiling that the speedup is not that great and that we need more speed, we can then consider writing assembly directly, but this would be way slower to write.
Yeah, I agree.  We should try to optimize Kiss FFT for NEON and then measure some test cases (ConvolverNode should be the most interesting consumer) and see if the FFT is the bottleneck or not.
(In reply to Paul Adenot (:padenot) from comment #5)
> This library does not seem to suit our needs, from what I read.

I fail to see how this library doesn't fit our needs. Below is a list of functions that can be imiplemented in Ne10 functions.

void AudioBufferAddWithScale(const float* aInput, float aScale, float* aOutput, uint32_t aSize)
* ne10_add_float (when aScale == 1.0f)
* ne10_mlac_float

AudioBlockCopyChannelWithScale(const float* aInput, float aScale, float* aOutput)
* ne10_mulc_float

AudioBlockCopyChannelWithScale(const float aInput[WEBAUDIO_BLOCK_SIZE], const float aScale[WEBAUDIO_BLOCK_SIZE], float aOutput[WEBAUDIO_BLOCK_SIZE])
* ne10_mul_float
I'm sure you can do something with this lib, but I'm not sure it is worth importing it in the tree.

ne10_add_float is just a vector add, ne10_mlac_float is just a multiply and accumulate. Nothing difficult to write, or I'm missing something.
Just a thought about not reinventing the wheel.
If it is not worth the porting, let roll our own.
(Sorry, for some reason I previously commented on this bug thinking that it's bug 885496!)

(In reply to comment #9)
> Just a thought about not reinventing the wheel.
> If it is not worth the porting, let roll our own.

I believe that the code involved here should be fairly simple and localized, and I would prefer to not have to import a large library just so that we can use a few of their routines.
Attachment #770000 - Flags: review?(ehsan)
Attachment #770001 - Flags: review?(ehsan)
Comment on attachment 770000 [details] [diff] [review]
NEON implementation of AudioNodeEngine.cpp part 1

Review of attachment 770000 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/media/AudioNodeEngineNEON.cpp
@@ +29,5 @@
> +  for (unsigned i = 0; i < aSize; i+=4) {
> +    vin = vld1q_f32((float32_t*)aInput+i);
> +    vout = vld1q_f32((float32_t*)aOutput+i);
> +    vout = vmlaq_f32(vout, vin, vscale);
> +    vst1q_f32((float32_t*)aOutput+i, vout);

This doesn't properly handle the case where aSize is not a multiple of 4.

@@ +83,5 @@
> +
> +  for (uint32_t i = 0; i < aSize * aChannelCount; i+=4) {
> +    vin = vld1q_f32((float32_t*)aBlock+i);
> +    vout = vmulq_f32(vin, vscale);
> +    vst1q_f32((float32_t*)aBlock+i, vout);

Same here.
Handle the case where array size is not a multiple of 4.
Attachment #770000 - Attachment is obsolete: true
Attachment #770000 - Flags: review?(ehsan)
Attachment #770086 - Flags: review?(roc)
Comment on attachment 770086 [details] [diff] [review]
NEON implementation of AudioNodeEngine.cpp part 1

Review of attachment 770086 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me.
Attachment #770086 - Flags: review?(roc) → review?(paul)
Comment on attachment 770086 [details] [diff] [review]
NEON implementation of AudioNodeEngine.cpp part 1

Review of attachment 770086 [details] [diff] [review]:
-----------------------------------------------------------------

Could you try hand unrolling a couple functions, to check if we can get some speedup? Just to check if it is worth it. For example, I could get 50% speedup by unrolling the inplace gain implementation four times when using SSE.

derf, could you sanity check this if you have a minute? I'm afraid I'm ignorant when it comes to ARM.

::: content/media/AudioNodeEngineNEON.cpp
@@ +30,5 @@
> +  aSize -= dif;
> +  unsigned i = 0;
> +  for (; i < aSize; i+=4) {
> +    vin = vld1q_f32((float32_t*)aInput+i);
> +    vout = vld1q_f32((float32_t*)aOutput+i);

I remember that derf told me it was preferable to use array indexing rather that pointer arithmetic, in terms of speed.

Anyways, I'd rather have static_casts at the beginning of the function that cluttering the inner loop.
Attachment #770086 - Flags: review?(tterribe)
Attachment #770086 - Flags: review?(paul)
Attachment #770086 - Flags: review+
Comment on attachment 770001 [details] [diff] [review]
NEON implementation of AudioNodeEngine.cpp part 2

Review of attachment 770001 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/media/AudioNodeEngine.cpp
@@ +22,5 @@
>    aChunk->mDuration = WEBAUDIO_BLOCK_SIZE;
>    aChunk->mChannelData.SetLength(aChannelCount);
>    float* data = static_cast<float*>(buffer->Data());
>    for (uint32_t i = 0; i < aChannelCount; ++i) {
> +    float* check = data + i*WEBAUDIO_BLOCK_SIZE;

Nit: please call this channelData.
Attachment #770001 - Flags: review?(ehsan) → review+
Attachment #770001 - Attachment is obsolete: true
Attachment #770587 - Flags: review+
Attachment #770587 - Attachment is obsolete: true
Attachment #770588 - Flags: review+
test results:
AudioBufferAddWithScale                      : loop=1000000, time=1.553925
AudioBufferAddWithScale_NEON                 : loop=1000000, time=0.730693
AudioBufferAddWithScale_NEON_unroll          : loop=1000000, time=0.741942
AudioBufferAddWithScale_NEON_unroll2         : loop=1000000, time=0.701317
AudioBlockCopyChannelWithScale               : loop=1000000, time=0.788323
AudioBlockCopyChannelWithScale_NEON          : loop=1000000, time=0.635880
AudioBlockCopyChannelWithScale_NEON_unroll   : loop=1000000, time=0.495064
AudioBlockCopyChannelWithScale_NEON_unroll2  : loop=1000000, time=0.503057
AudioBlockCopyChannelWithScale               : loop=1000000, time=1.164082
AudioBlockCopyChannelWithScale_NEON          : loop=1000000, time=0.838090
AudioBlockCopyChannelWithScale_NEON_unroll   : loop=1000000, time=0.678961
AudioBlockCopyChannelWithScale_NEON_unroll2  : loop=1000000, time=0.663595
AudioBufferInPlaceScale                      : loop=1000000, time=0.923929
AudioBufferInPlaceScale_NEON                 : loop=1000000, time=0.535680
AudioBufferInPlaceScale_NEON_unroll          : loop=1000000, time=0.469168
AudioBufferInPlaceScale_NEON_unroll2         : loop=1000000, time=0.472063
AudioBlockPanStereoToStereo                  : loop=1000000, time=1.942470
AudioBlockPanStereoToStereo_NEON             : loop=1000000, time=1.273460
AudioBlockPanStereoToStereo_NEON_unroll      : loop=1000000, time=1.103122
AudioBlockPanStereoToStereo_NEON_unroll2     : loop=1000000, time=1.114045

Generally, unroll level 4 is better than those without unrolling. However, for some functions (ex: AudioBufferAddWithScale_NEON_unroll2), unroll level 2 is better than level 4. I think we might have to tweak functions case by case.

Btw, there is no performance difference whether to use pointer arithmetic or array indexing. I guess the compiler is smart enough to figure this out.
The file that goes with the benchmark test.
(In reply to jwwang from comment #20)
> Btw, there is no performance difference whether to use pointer arithmetic or
> array indexing. I guess the compiler is smart enough to figure this out.

Well, the important point, I think, is whether you change the base pointer in the loop (vin = vld1q_f32((float32_t*)aInput); aInput+=4;) vs. accessing it with the loop index (as you do here). I don't think the exact syntax matters. This is just a general observation that compilers usually produce better code in the latter case, but it may not always apply. See, e.g., <http://support.amd.com/us/Processor_TechDocs/47414_15h_sw_opt_guide.pdf> Section 3.2.
With loop unrolling.
Attachment #770086 - Attachment is obsolete: true
Attachment #770086 - Flags: review?(tterribe)
Attachment #772519 - Flags: review?(tterribe)
Comment on attachment 772519 [details] [diff] [review]
NEON implementation of AudioNodeEngine.cpp part 1

Review of attachment 772519 [details] [diff] [review]:
-----------------------------------------------------------------

I took a look at the actual asm this generates, and it is moderately awful, as one would expect from intrinsics (lots of redundant addressing calculations, unaligned load/stores, small load/stores instead of large ones, etc.). It all looks like it will work, though.

r=me once you fix the build issue.

::: content/media/AudioNodeEngineNEON.cpp
@@ +22,5 @@
> +                                  float* aOutput,
> +                                  uint32_t aSize)
> +{
> +  ASSERT_ALIGNED(aInput);
> +  ASSERT_ALIGNED(aOutput);

I don't think you actually get any benefit from this, because there are no intrinsics that specifically take aligned pointers. All the asm it generates uses unaligned loads.

@@ +33,5 @@
> +  aSize -= dif;
> +  unsigned i = 0;
> +  for (; i < aSize; i+=16) {
> +    vin0 = vld1q_f32(ADDRESS_OF(aInput, i));
> +    vin1 = vld1q_f32(ADDRESS_OF(aInput, i+4));

Ugh, there's no 4-register vld1 available from intrinsics? That's pretty poor.

::: content/media/AudioNodeEngineNEON.h
@@ +1,4 @@
> +/* -*- mode: c++; tab-width: 2; indent-tabs-mode: nil; c-basic-offset: 2 -*- */
> +/* this source code form is subject to the terms of the mozilla public
> + * license, v. 2.0. if a copy of the mpl was not distributed with this file,
> + * You can obtain one at http://mozilla.org/MPL/2.0/. */

This header has no multiple-include guard.

::: content/media/moz.build
@@ +119,5 @@
>  ]
> +
> +if CONFIG['CPU_ARCH'] == 'arm' and CONFIG['HAVE_ARM_NEON']:
> +    CPP_SOURCES += [
> +        'AudioNodeEngineNEON.cpp',

This won't build. HAVE_ARM_NEON merely asserts that the compiler can build NEON code. It doesn't actually force using -mfloat-abi=softfp -mfpu=neon, which is required for these intrinsics to work.

This was easy to fix in the old Make-based build system, but I have no idea how to do it with moz.build files.
Attachment #772519 - Flags: review?(tterribe) → review+
(In reply to Timothy B. Terriberry (:derf) from comment #25)
> Ugh, there's no 4-register vld1 available from intrinsics? That's pretty
> poor.

FWIW, since you don't actually care how this data is arranged in the registers, you could maybe use vld2q_f32() to load 32 bytes at a time? The cycle cost should be the same as a real 4-register vld1.32.
(In reply to Timothy B. Terriberry (:derf) from comment #26)
> FWIW, since you don't actually care how this data is arranged in the
> registers, you could maybe use vld2q_f32() to load 32 bytes at a time? The
> cycle cost should be the same as a real 4-register vld1.32.

Using float32x4x2_t in c code will result in stack push/pop instructions which will decrease the speed. 

Lately I've been poking around source code and build flags to see if I can get a better output which includes:
1. -funroll-loop which adds further improvement slightly over code which is already manually unrolled.
2. memory barrier which prevents improper instruction reordering from compiler which generates code where vld1.32 is followed by vmla.f32 immediately and stalls the CPU pipeline.
3. __builtin_prefetch (available from GCC 4.7) which doesn't add visible improvement.
4. __builtin_assume_aligned which doesn't work well as intended.

I think if we need further optimization, we should go to assembly to get most out of it.
(In reply to jwwang from comment #27)
> Lately I've been poking around source code and build flags to see if I can
> get a better output which includes:
> 1. -funroll-loop which adds further improvement slightly over code which is
> already manually unrolled.
> 2. memory barrier which prevents improper instruction reordering from
> compiler which generates code where vld1.32 is followed by vmla.f32
> immediately and stalls the CPU pipeline.
> 3. __builtin_prefetch (available from GCC 4.7) which doesn't add visible
> improvement.
> 4. __builtin_assume_aligned which doesn't work well as intended.

We should be able to do all of these later in separate bugs.
(In reply to Timothy B. Terriberry (:derf) from comment #25)
> ::: content/media/moz.build
> @@ +119,5 @@
> >  ]
> > +
> > +if CONFIG['CPU_ARCH'] == 'arm' and CONFIG['HAVE_ARM_NEON']:
> > +    CPP_SOURCES += [
> > +        'AudioNodeEngineNEON.cpp',
> 
> This won't build. HAVE_ARM_NEON merely asserts that the compiler can build
> NEON code. It doesn't actually force using -mfloat-abi=softfp -mfpu=neon,
> which is required for these intrinsics to work.
> 
> This was easy to fix in the old Make-based build system, but I have no idea
> how to do it with moz.build files.

Perhaps gps can tell us how to do that.
Flags: needinfo?(gps)
Blocks: 894856
Needinfo?ing some other build system peers.
Flags: needinfo?(ted)
Flags: needinfo?(mh+mozilla)
What exactly do you need to do? Add some compiler flags? You can OS_CXXFLAGS or CXXFLAGS += extra flags in Makefile.in (these variables aren't yet ported to moz.build).
Flags: needinfo?(ted)
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(gps)
(In reply to Gregory Szorc [:gps] from comment #31)
> What exactly do you need to do? Add some compiler flags? You can OS_CXXFLAGS
> or CXXFLAGS += extra flags in Makefile.in (these variables aren't yet ported
> to moz.build).

Which needs to happen only on one file, so you need AudioNodeEngineNEON.$(OBJ_SUFFIX): CXXFLAGS += -mfpu=neon. We'll need a special syntax for thes things in mo.build, but we don't have one yet.
You can add DEFINES+=['-mfpu=neon'] in moz.build. However, it won't work for it will be overwritten by default flags (-mfpu=vfp). Is this a bug of moz.build?
Attachment #772519 - Attachment is obsolete: true
Attachment #778259 - Flags: review+
It looks like part 2 can't be landed until Bug 877662 is resolved since it depends on the 16 byte alignment of SharedBuffer.
(In reply to jwwang from comment #33)
> You can add DEFINES+=['-mfpu=neon'] in moz.build. However, it won't work for
> it will be overwritten by default flags (-mfpu=vfp). Is this a bug of
> moz.build?

That's not a DEFINE, so it doesn't get put in the right place in the compile commandline. Use glandium's suggestion in comment 32.
It looks like Bug 877662 is gonna take a while. Should I remove the dependency and 16-byte alignment assertion so that this patch can land first?
blocking-b2g: --- → koi+
Whiteboard: [FT: Media Recording, Sprint]
Whiteboard: [FT: Media Recording, Sprint] → [FT: Media Recording, Sprint 1]
Flags: needinfo?(ben)
(In reply to jwwang from comment #37)
> It looks like Bug 877662 is gonna take a while. Should I remove the
> dependency and 16-byte alignment assertion so that this patch can land first?

Sure, it would be nice if you could do that!
Attachment #778259 - Attachment is obsolete: true
Attachment #787285 - Flags: review+
Attached patch Part 2 - call NEON functions. (obsolete) — Splinter Review
Remove 16-byte alignment which is not required by NEON.
Attachment #770588 - Attachment is obsolete: true
Attachment #787287 - Flags: review+
No longer depends on: 877662
Keywords: checkin-needed
More specifically, armv6 bustage.
/usr/bin/ccache /builds/slave/try-and-a6-0000000000000000000/build/android-ndk/toolchains/arm-linux-androideabi-4.7/prebuilt/linux-x86_64/bin/arm-linux-androideabi-g++ -o AudioNodeEngineNEON.o -c  -fvisibility=hidden -DMOZILLA_INTERNAL_API -DIMPL_LIBXUL  -DSTATIC_EXPORTABLE_JS_API -DNO_NSPR_10_SUPPORT -DOS_POSIX=1 -DOS_LINUX=1  -D_IMPL_NS_LAYOUT -I/builds/slave/try-and-a6-0000000000000000000/build/ipc/chromium/src -I/builds/slave/try-and-a6-0000000000000000000/build/ipc/glue -I../../ipc/ipdl/_ipdlheaders  -I/builds/slave/try-and-a6-0000000000000000000/build/content/media -I. -I../../dist/include  -I/builds/slave/try-and-a6-0000000000000000000/build/obj-firefox/dist/include/nspr -I/builds/slave/try-and-a6-0000000000000000000/build/obj-firefox/dist/include/nss     -I/builds/slave/try-and-a6-0000000000000000000/build/content/base/src -I/builds/slave/try-and-a6-0000000000000000000/build/layout/generic -I/builds/slave/try-and-a6-0000000000000000000/build/layout/xul/base/src   -fPIC -isystem /builds/slave/try-and-a6-0000000000000000000/build/android-ndk/platforms/android-9/arch-arm/usr/include  -Wall -Wpointer-arith -Woverloaded-virtual -Werror=return-type -Wtype-limits -Wempty-body -Wsign-compare -Wno-invalid-offsetof -mandroid -fno-short-enums -fno-exceptions -Wno-psabi -march=armv6 -mfpu=vfp -I/builds/slave/try-and-a6-0000000000000000000/build/build/stlport/stlport -I/builds/slave/try-and-a6-0000000000000000000/build/android-ndk/sources/cxx-stl/system/include -fno-exceptions -fno-strict-aliasing -fno-rtti -ffunction-sections -fdata-sections -fno-exceptions -std=gnu++0x -pipe  -DNDEBUG -DTRIMMED -g -Os -freorder-blocks -fno-reorder-functions -fomit-frame-pointer -Werror -Wno-error=uninitialized -Wno-error=deprecated-declarations  -mfpu=neon  -isystem /builds/slave/try-and-a6-0000000000000000000/build/android-ndk/platforms/android-9/arch-arm/usr/include  -DMOZILLA_CLIENT -include ../../mozilla-config.h -MD -MP -MF .deps/AudioNodeEngineNEON.o.pp  /builds/slave/try-and-a6-0000000000000000000/build/content/media/AudioNodeEngineNEON.cpp

There is the "-mfpu=neon" option when compiling AudioNodeEngineNEON.cpp.
However...

In file included from /builds/slave/try-and-a6-0000000000000000000/build/content/media/AudioNodeEngineNEON.cpp:7:0:
/builds/slave/try-and-a6-0000000000000000000/build/android-ndk/toolchains/arm-linux-androideabi-4.7/prebuilt/linux-x86_64/bin/../lib/gcc/arm-linux-androideabi/4.7/include/arm_neon.h:32:2: error: #error You must enable NEON instructions (e.g. -mfloat-abi=softfp -mfpu=neon) to use arm_neon.h

Will march=armv6 cause mfpu=neon to be ignored?
NEON is not available in ARMv6 AFAIK, so we should only build AudioNodeEngineNEON.cpp for ARMv7, and just use the C++ implementation for ARMv6.  Sorry I did not catch this before. :/
We have
if CONFIG['CPU_ARCH'] == 'arm' and CONFIG['HAVE_ARM_NEON']:
    CPP_SOURCES += [
        'AudioNodeEngineNEON.cpp',
    ]

in moz.build.

It looks like CONFIG['HAVE_ARM_NEON'] is true when building for ARMv6. We should have this fixed or have some other way to identify CPU arch.
Use BUILD_ARM_NEON instead of HAVE_ARM_NEON.
It looks like AudioNodeEngineNEON.cpp is still included using BUILD_ARM_NEON.
https://tbpl.mozilla.org/?tree=Try&rev=6fe0abfe1724
(In reply to jwwang from comment #48)
> It looks like AudioNodeEngineNEON.cpp is still included using BUILD_ARM_NEON.
> https://tbpl.mozilla.org/?tree=Try&rev=6fe0abfe1724

Unfortunately, BUILD_ARM_NEON is 0 instead of nothing, so the test doesn't do what you think it does. Feel free to change that in build/autoconf/arch.m4 and js/src/build/autoconf/arch.m4.
Note that part 2 is still likely to fail on armv6 because it will still try to use the functions that are provided by that file you don't want to build.
Attachment #787285 - Attachment is obsolete: true
Attachment #789401 - Flags: review?(mh+mozilla)
Attached patch Part 2 - call NEON functions. (obsolete) — Splinter Review
Attachment #787287 - Attachment is obsolete: true
Attachment #789404 - Flags: review?(mh+mozilla)
It is tricky that CONFIG['BUILD_ARM_NEON'] is a string (which is '1') and it returns false when being compared with 1 (the integer).

It is confusing to have both HAVE_ARM_NEON and BUILD_ARM_NEON. I would like to have HAVE_ARM_NEON defined only when $ARM_ARCH >= 7 and remove BUILD_ARM_NEON if feasible.
Comment on attachment 789401 [details] [diff] [review]
Part 1 - Add AudioNodeEngineNEON.cpp

Review of attachment 789401 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/media/moz.build
@@ +125,5 @@
>      'VideoUtils.cpp',
>      'WebVTTLoadListener.cpp',
>  ]
> +
> +if CONFIG['CPU_ARCH'] == 'arm' and CONFIG['BUILD_ARM_NEON'] == '1':

Please just change the silliness in build/autoconf/arch.m4 (replace 0 with nothing), and just use CONFIG['BUILD_ARM_NEON'] here.
Attachment #789401 - Flags: review?(mh+mozilla) → review-
Attachment #789404 - Flags: review?(mh+mozilla) → review+
Attachment #789401 - Attachment is obsolete: true
Attachment #789485 - Flags: review?(mh+mozilla)
Attachment #789404 - Attachment is obsolete: true
Attachment #789486 - Flags: review+
Attachment #789485 - Flags: review?(mh+mozilla) → review+
Keywords: checkin-needed
Whiteboard: [FT: Media Recording, Sprint 1] → [FT: Media Recording, Sprint 1][checkin-needed-aurora]
https://hg.mozilla.org/mozilla-central/rev/305facc074df
https://hg.mozilla.org/mozilla-central/rev/84ec68df9294
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla26
https://hg.mozilla.org/releases/mozilla-aurora/rev/141f5c87fbf7
https://hg.mozilla.org/releases/mozilla-aurora/rev/66856ecb3698
Whiteboard: [FT: Media Recording, Sprint 1][checkin-needed-aurora] → [FT: Media Recording, Sprint 1]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: