Closed Bug 877662 Opened 7 years ago Closed 4 years ago

Optimize the AudioNodeEngine.cpp routines

Categories

(Core :: Web Audio, defect, P1)

x86
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla48
Tracking Status
firefox48 --- fixed

People

(Reporter: ehsan, Assigned: dminor)

References

(Blocks 1 open bug)

Details

(Keywords: perf, Whiteboard: [blocking-webaudio-])

Attachments

(9 files, 2 obsolete files)

18.56 KB, text/x-csrc
Details
127.96 KB, image/png
Details
2.88 KB, patch
ehsan
: review+
roc
: review-
Details | Diff | Splinter Review
26.80 KB, patch
ehsan
: review+
Details | Diff | Splinter Review
58 bytes, text/x-review-board-request
padenot
: review+
Details
58 bytes, text/x-review-board-request
padenot
: review+
Details
58 bytes, text/x-review-board-request
padenot
: review+
Details
58 bytes, text/x-review-board-request
padenot
: review+
Details
58 bytes, text/x-review-board-request
padenot
: review+
Details
We need SSE2 versions of those routines.  Also it would be nice to ensure that AllocateAudioBlock always allocates 16-byte aligned buffers where we have SSE2 available.
Attached file Microbenchmark
Here is a reimplementation in sse2 of all the functions we have in AudioNodeEngine.cpp (but not WriteZeroesToAudioBlock, which is essentially a memset), in the form of a independent microbenchmark with various flavor of C++, hand-unrolled C++, SSE2 intrinsics and hand unrolled SSE2 intrinsics, with some basic tests of the output.
Attached image benchmark results
And some encouraging results.
Nice!  BTW, I forgot to ask, do you also have NEON experience?  We should definitely optimize these routines for mobile as well...
No, but this is something I've been considering for some time. Now is the time look into it, I think.
Mass moving Web Audio bugs to the Web Audio component.  Filter on duckityduck.
Component: Video/Audio → Web Audio
derf, if you have a minute, could you have a look at the functions in
AudioNodeEngineSSE.cpp?
Attachment #758675 - Flags: review?(tterribe)
Attachment #758675 - Flags: review?(ehsan)
I'm also planning on adding some form of compiled test for those functions.
Comment on attachment 758671 [details] [diff] [review]
Align AudioBlock on 16 bytes when SSE2 is available. r=

Review of attachment 758671 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/media/AudioNodeEngine.cpp
@@ +14,2 @@
>    nsRefPtr<SharedBuffer> buffer =
>      SharedBuffer::Create(WEBAUDIO_BLOCK_SIZE*aChannelCount*sizeof(float));

Please MOZ_ASSERT the alignment here.

::: content/media/SharedBuffer.h
@@ +28,5 @@
>   * Typically you would allocate one of these, fill it in, and then treat it as
>   * immutable while it's shared.
>   * This only guarantees 4-byte alignment of the data. For alignment we
>   * simply assume that the refcount is at least 4-byte aligned and its size
>   * is divisible by 4.

Please update this comment.
Attachment #758671 - Flags: review?(ehsan) → review+
Comment on attachment 758675 [details] [diff] [review]
Add an SSE2 implementation of AudioNodeEngine.cpp functions. r=

Review of attachment 758675 [details] [diff] [review]:
-----------------------------------------------------------------

Ted, can you please do the honors for the build system stuff here?
Attachment #758675 - Flags: review?(ted)
Comment on attachment 758675 [details] [diff] [review]
Add an SSE2 implementation of AudioNodeEngine.cpp functions. r=

Review of attachment 758675 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/media/AudioNodeEngineSSE2.cpp
@@ +38,5 @@
> +
> +    vscaled0 = _mm_mul_ps(vin0, vgain);
> +    vscaled1 = _mm_mul_ps(vin1, vgain);
> +    vscaled2 = _mm_mul_ps(vin2, vgain);
> +    vscaled3 = _mm_mul_ps(vin3, vgain);

It might make sense to have a separate path for the aScale==1.0 case, but it's fine if you want to do that in a follow-up.

::: content/media/AudioNodeStream.cpp
@@ +331,5 @@
>    AllocateAudioBlock(outputChannelCount, &aTmpChunk);
>    float silenceChannel[WEBAUDIO_BLOCK_SIZE] = {0.f};
>    // The static storage here should be 1KB, so it's fine
> +  nsAutoTArray<float, GUESS_AUDIO_CHANNELS*WEBAUDIO_BLOCK_SIZE + 15> downmixBuffer;
> +  float* alignedDownmixBuffer = ALIGN16(downmixBuffer.Elements());

Please declare this variable below, as this value here will be invalid if we end up doing a dynamic allocation.  Also, we don't need this value here anyway.

@@ +358,2 @@
>          for (uint32_t j = 0; j < outputChannelCount; ++j) {
> +          outputChannels[j] = &alignedDownmixBuffer[j * WEBAUDIO_BLOCK_SIZE];

So here we're assuming that WEBAUDIO_BLOCK_SIZE is a multiple of 16, which is fine, but please static assert that here to be explicit about it.
Attachment #758675 - Flags: review?(ehsan) → review+
Comment on attachment 758675 [details] [diff] [review]
Add an SSE2 implementation of AudioNodeEngine.cpp functions. r=

Review of attachment 758675 [details] [diff] [review]:
-----------------------------------------------------------------

r=me with nits fixed.

::: content/media/AudioNodeEngineSSE2.cpp
@@ +8,5 @@
> +
> +#ifdef DEBUG
> +  #define ASSERT_ALIGNED(ptr)                                                  \
> +            MOZ_ASSERT((((uintptr_t)ptr + 15) & ~0x0F) == (uintptr_t)ptr,      \
> +                       #ptr " has to be aligned on 16 bytes.");

"16-byte aligned"

@@ +54,5 @@
> +    _mm_store_ps(&aOutput[i], vout0);
> +    _mm_store_ps(&aOutput[i + 4], vout1);
> +    _mm_store_ps(&aOutput[i + 8], vout2);
> +    _mm_store_ps(&aOutput[i + 12], vout3);
> +  }

Did you look at the compiler's output on x86-32? I worry that it will spill lots with this unrolled 4 times (since it can't prove the loads don't alias with the stores, it needs at least 9 registers, plus one because register allocation is NP-hard).

@@ +183,5 @@
> +    _mm_store_ps(in, vout0);
> +    _mm_store_ps(in+4, vout1);
> +    _mm_store_ps(in+8, vout2);
> +    _mm_store_ps(in+12, vout3);
> +    in += 16;

It's generally better for the compiler if you do array indexing (like your other functions) than if you do pointer arithmetic.

::: content/media/AudioNodeStream.cpp
@@ +10,5 @@
>  #include "ThreeDPoint.h"
>  
>  using namespace mozilla::dom;
>  
> +#define ALIGN16(ptr) (float*)(((uintptr_t)ptr + 15) & ~0x0F);

Parentheses around ptr, as well as the whole macro body, and remove the trailing ;

@@ +330,5 @@
>  
>    AllocateAudioBlock(outputChannelCount, &aTmpChunk);
>    float silenceChannel[WEBAUDIO_BLOCK_SIZE] = {0.f};
>    // The static storage here should be 1KB, so it's fine
> +  nsAutoTArray<float, GUESS_AUDIO_CHANNELS*WEBAUDIO_BLOCK_SIZE + 15> downmixBuffer;

You're allocating 15 extra floats when I think you want 15 extra bytes.

@@ +352,5 @@
>      } else if (channels.Length() > outputChannelCount) {
>        if (mChannelInterpretation == ChannelInterpretation::Speakers) {
>          nsAutoTArray<float*,GUESS_AUDIO_CHANNELS> outputChannels;
>          outputChannels.SetLength(outputChannelCount);
> +        downmixBuffer.SetLength(outputChannelCount * WEBAUDIO_BLOCK_SIZE + 15);

See above.

::: content/media/Makefile.in
@@ +26,5 @@
>  CFLAGS   += $(GSTREAMER_CFLAGS)
>  CXXFLAGS += $(GSTREAMER_CFLAGS)
> +
> +ifneq (,$(INTEL_ARCHITECTURE))
> +# gcc requires -msse2 on AudioNodeEngine.cpp since it uses SSE intrinsics.

The comment says -msse2 but the flag you're actually adding is -msse. Since you're using XMM registers, you want the former.

Also, shouldn't this be on AudioNodeEngineSSE2.$(OBJ_SUFFIX) instead?

@@ +28,5 @@
> +
> +ifneq (,$(INTEL_ARCHITECTURE))
> +# gcc requires -msse2 on AudioNodeEngine.cpp since it uses SSE intrinsics.
> +ifdef GNU_CC
> +AudioNodeEngine.$(OBJ_SUFFIX): CXXFLAGS+=-msse

The comment says -msse2 but the flag you're actually adding is -msse. Since you're using XMM registers, you want the former.

@@ +32,5 @@
> +AudioNodeEngine.$(OBJ_SUFFIX): CXXFLAGS+=-msse
> +endif
> +
> +ifdef SOLARIS_SUNPRO_CXX
> +AudioNodeEngine.$(OBJ_SUFFIX): CXXFLAGS+=-xarch=sse -xO4

Again, sse vs. sse2.

::: content/media/moz.build
@@ +43,5 @@
>  MODULE = 'content'
>  
> +# Are we targeting x86 or x64?  If so, build SSE2 files.
> +if CONFIG['INTEL_ARCHITECTURE']:
> +    DEFINES += ['-DUSE_SSE']

Unless you're planning to use gfx2d's Factory::HasSSE2() for some bizarre reason, I think it makes more sense to use xpcom's SSE.h and MOZILLA_MAY_SUPPORT_SSE2/MOZILLA_PRESUME_SSE2 (which you don't have to define yourself here).
Comment on attachment 758675 [details] [diff] [review]
Add an SSE2 implementation of AudioNodeEngine.cpp functions. r=

Forgot to set the flag, too.
Attachment #758675 - Flags: review?(tterribe) → review+
Comment on attachment 758675 [details] [diff] [review]
Add an SSE2 implementation of AudioNodeEngine.cpp functions. r=

Review of attachment 758675 [details] [diff] [review]:
-----------------------------------------------------------------

Makefile/moz.build bits look fine modulo derf's concerns.
Attachment #758675 - Flags: review?(ted) → review+
The previous patch did not include the SSE2 / scalar dispatch code, for some
reason. Here it is, in AudioNodeEngine.cpp. I basically followed the
recommendations at the top of SSE.h.

The unity gain fastpath is handled outside of SSE2/scalar specific code.
Attachment #759249 - Flags: review?(ehsan)
Attachment #758675 - Attachment is obsolete: true
(In reply to Timothy B. Terriberry (:derf) from comment #12)
> @@ +54,5 @@
> > +    _mm_store_ps(&aOutput[i], vout0);
> > +    _mm_store_ps(&aOutput[i + 4], vout1);
> > +    _mm_store_ps(&aOutput[i + 8], vout2);
> > +    _mm_store_ps(&aOutput[i + 12], vout3);
> > +  }
> 
> Did you look at the compiler's output on x86-32? I worry that it will spill
> lots with this unrolled 4 times (since it can't prove the loads don't alias
> with the stores, it needs at least 9 registers, plus one because register
> allocation is NP-hard).

http://www.pastebin.mozilla.org/2492550, so it looks like it's OK.
Attachment #759249 - Flags: review?(ehsan) → review+
This makes sure that we get aligned pointers even when we could borrow data from
the AudioBufferSourceNode buffer.
Attachment #760660 - Flags: review?(ehsan)
Attachment #759249 - Attachment is obsolete: true
Comment on attachment 760660 [details] [diff] [review]
Add an SSE2 implementation of AudioNodeEngine.cpp functions. r=

Review of attachment 760660 [details] [diff] [review]:
-----------------------------------------------------------------

Nice catch!

Note that I added a bunch of functions to AudioNodeEngine.cpp in bug 815643, sorry if it bitrots this patch.  I filed bug 881587 to optimize those.
Attachment #760660 - Flags: review?(ehsan) → review+
Comment on attachment 758671 [details] [diff] [review]
Align AudioBlock on 16 bytes when SSE2 is available. r=

Review of attachment 758671 [details] [diff] [review]:
-----------------------------------------------------------------

::: content/media/SharedBuffer.h
@@ +35,5 @@
>  public:
> +  void* Data() {
> +    if (mozilla::supports_sse()) {
> +      // return a pointer aligned on 16 bytes.
> +      void * storage_start = (uint8_t*)(this + 1);

I think this should be unconditional. It's simpler that way, and there's pretty much no benefit to being less-aligned on non-SSE platforms.
Attachment #758671 - Flags: review-
Paul, can you please address comment 19 and land these patches?
Flags: needinfo?(paul)
Some code has landed that allocate new buffers (like in the convolver node), so I need to make all that aligned, and make sure it works. But yes, I'll get to it next week.
Flags: needinfo?(paul)
No longer blocks: 882171
Whiteboard: [blocking-webaudio-]
Whiteboard: [blocking-webaudio-] → [blocking-webaudio-][webaudio-perf]
Keywords: perf
Whiteboard: [blocking-webaudio-][webaudio-perf] → [blocking-webaudio-]
Making a P1 since early results showed it would help with perf.  If it's not going to have a big enough impact, we can drop it to P2.
Priority: -- → P1
Karl -- Paul is clearing his plate to focus on bug 848954. Can you pick this up?   Also, see Comment 22.  If you think this should be lowered to a P2, I'm fine with that.  Thanks.
Assignee: paul → karlt
Per email discussion with Karl, lowering this to P2.
Priority: P1 → P2
If we want to refocus on this (and its sister bug), we really should spend some time profiling. Last time I checked, it was not clear all the processing functions were the bottlenecks, and the code has moved a lot since then.
Yes, the only method I've seen over 1% of a busy profile was BufferComplexMultiply()
Blocks: 881587
Priority: P2 → P1
Rank: 15
No longer blocks: webaudioperf
I talked with Karl today, and he and I agree that it'd be more efficient to find another person to work on this one.  Same with Bug 881587.
Assignee: karlt → nobody
Assignee: nobody → dminor
To be able to use SSE2 routines, we need to audio buffers to be allocated
on 16 byte boundaries.

Review commit: https://reviewboard.mozilla.org/r/44347/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/44347/
Attachment #8738465 - Flags: review?(padenot)
Attachment #8738466 - Flags: review?(padenot)
Attachment #8738467 - Flags: review?(padenot)
Comment on attachment 8738467 [details]
MozReview Request: Bug 877662 - Use SSE2 versions of AudioNodeEngine functions r?padenot

https://reviewboard.mozilla.org/r/44351/#review41969

::: dom/media/webaudio/blink/ReverbAccumulationBuffer.cpp:100
(Diff revision 1)
>      MOZ_ASSERT(isSafe);
>      if (!isSafe)
>          return 0;
>  
> -    AudioBufferAddWithScale(source, 1.0f, destination + writeIndex, numberOfFrames1);
> +#ifdef USE_SSE2
> +    // It is unlikely either the source is aligned or the number of values

This is not ideal, we should either:

- fall back to scalar when we have a number of items that is not a multiple of 16
- fall back to scalar if the input aligned, or process the first few elements in scalar and switch to vector and back for the last few items.

Can you file a followup for this?
Attachment #8738467 - Flags: review?(padenot) → review+
Comment on attachment 8738465 [details]
MozReview Request: Bug 877662 - Align audio buffer allocations to 16 byte boundaries r?padenot

https://reviewboard.mozilla.org/r/44347/#review41965

::: dom/media/webaudio/AudioBlock.cpp
(Diff revision 1)
>   * buffer can reuse and modify its contents next iteration if other references
>   * are all downstream temporary references held by AudioBlock.
> - *
> - * This only guarantees 4-byte alignment of the data. For alignment we simply
> - * assume that the memory from malloc is at least 4-byte aligned and that
> - * AudioBlockBuffer's size is divisible by 4.

Can you leave a note about alignment here ?

::: dom/media/webaudio/AudioNodeExternalInputStream.cpp:91
(Diff revision 1)
>  {
>    NS_ASSERTION(aSegment->GetDuration() == WEBAUDIO_BLOCK_SIZE, "Bad segment duration");
>  
> +// To be safe, let's copy data if we're using SSE2 to ensure everything is
> +// aligned properly. Simply checking for channel alignment does not seem
> +// to be sufficient here.

Do we know why ?

::: dom/media/webaudio/BiquadFilterNode.cpp:141
(Diff revision 1)
>                      const AudioBlock& aInput,
>                      AudioBlock* aOutput,
>                      bool* aFinished) override
>    {
> -    float inputBuffer[WEBAUDIO_BLOCK_SIZE];
> +    float inputBuffer[WEBAUDIO_BLOCK_SIZE + 4];
> +    float* alignedInputBuffer = (float* )(((uintptr_t)inputBuffer + 15) & ~0x0F);

Maybe assert the alignment here ? Or maybe I'm just paranoid, do we have the guarantee that the stack is aligned in all the platforms we support ?
Attachment #8738465 - Flags: review?(padenot) → review+
Comment on attachment 8738466 [details]
MozReview Request: Bug 877662 - Update SSE2 versions of AudioNodeEngine functions r?padenot

https://reviewboard.mozilla.org/r/44349/#review41967

::: dom/media/webaudio/AudioNodeEngineSSE2.cpp:31
(Diff revision 1)
> +void
> +AudioBufferAddWithScale_SSE(const float* aInput,
> +                            float aScale,
> +                            float* aOutput,
> +                            uint32_t aSize)
> +{

How are all those different from my initial patch ? It's a bit hard to review, can you make this sit on top of the old patches?
Attachment #8738466 - Flags: review?(padenot)
https://reviewboard.mozilla.org/r/44347/#review41965

> Do we know why ?

I think I was accidentally only checking the alignment of the first channel. Works fine if I iterate over each channel.
Review commit: https://reviewboard.mozilla.org/r/46183/diff/#index_header
See other reviews: https://reviewboard.mozilla.org/r/46183/
Attachment #8738466 - Attachment description: MozReview Request: Bug 877662 - Add SSE2 versions of AudioNodeEngine functions r?padenot → MozReview Request: Bug 877662 - Update SSE2 versions of AudioNodeEngine functions r?padenot
Attachment #8741091 - Flags: review?(padenot)
Attachment #8738466 - Flags: review?(padenot)
Comment on attachment 8738465 [details]
MozReview Request: Bug 877662 - Align audio buffer allocations to 16 byte boundaries r?padenot

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/44347/diff/1-2/
Comment on attachment 8738466 [details]
MozReview Request: Bug 877662 - Update SSE2 versions of AudioNodeEngine functions r?padenot

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/44349/diff/1-2/
Comment on attachment 8738467 [details]
MozReview Request: Bug 877662 - Use SSE2 versions of AudioNodeEngine functions r?padenot

Review request updated; see interdiff: https://reviewboard.mozilla.org/r/44351/diff/1-2/
Attachment #8741091 - Flags: review?(padenot) → review+
Comment on attachment 8741091 [details]
MozReview Request: Bug 877662 - Add AlignmentUtils.h r?padenot

https://reviewboard.mozilla.org/r/46183/#review42969
Comment on attachment 8741092 [details]
MozReview Request: Bug 877662 - Add an SSE2 implementation of AudioNodeEngine.cpp functions. r=ehsan

https://reviewboard.mozilla.org/r/46185/#review42971

I've written this, but I remember it being reviewed by someone, so I'm setting it as r+.
Attachment #8741092 - Flags: review+
Comment on attachment 8738466 [details]
MozReview Request: Bug 877662 - Update SSE2 versions of AudioNodeEngine functions r?padenot

https://reviewboard.mozilla.org/r/44349/#review42973
Attachment #8738466 - Flags: review?(padenot) → review+
https://reviewboard.mozilla.org/r/46185/#review42971

ehsan did the review of your original patch.
You need to log in before you can comment on or make changes to this bug.