Closed Bug 1000044 Opened 10 years ago Closed 10 years ago

audio resampling consumes about 5% of cpu when gUM with audio

Categories

(Core :: Web Audio, defect)

ARM
Gonk (Firefox OS)
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla35
tracking-b2g backlog

People

(Reporter: slee, Assigned: karlt)

References

Details

(Keywords: perf, Whiteboard: [priority][webrtc_uplift])

Attachments

(3 files)

Attached file gUM.zip
Go to http://mozilla.github.io/webrtc-landing/gum_test.html and press "Audio". The cpu usage spends about 5% on resampling on peak. And total cpu usage of browser is about 22%.
If this is a concern for you, you can try lowering the quality of the resampler [1].

[1]:L http://dxr.mozilla.org/mozilla-central/source/content/media/MediaStreamGraph.cpp?from=MediaStreamGraph.cpp#2295
The profile shows the interpolating resampler is in use.
I wonder whether this is resampling from 16 to 44.1 kHz.
The denominator of the ratio there would be just outside the range at which the direct resampler would be used.
http://people.mozilla.org/~karlt/sampleRate.html tells you the output sample rate.

If inputs are likely to be at 16 kHz, then it would be more efficient to resample to 48 kHz.  You can experiment by changing AudioStream::InitPreferredSampleRate() to return 48000.

Another option to experiment with may be to modify the test at http://hg.mozilla.org/mozilla-central/annotate/8e797a1a4d65/media/libspeex_resampler/src/resample.c#l614
so that the direct resampler is used even at the cost of more memory.

A test such as (den_rate * filt_len <= oversample * filt_len + 8 || den_rate * filt_len <= 28224) would mean "I want to use the direct resampler even if it uses more memory up to 55.125 kB (on systems with 16 bit samples)".  If I calculate correctly, that should be just enough to use the direct resampler when resampling from 16 to 44.1.  Whether that is actually faster depends on cache size, I assume.
Blocks: 984239
When adaptive rates are added for handling clock drift, we may prefer the interpolating resampler because it is cheaper to set up for new rates.
For rates that don't lead to small denominators, that will be the only option.
It depends how often the rate is likely to vary from the expected rate.
I tried 
1. set init flag to SPEEX_RESAMPLER_QUALITY_MIN and SPEEX_RESAMPLER_QUALITY_VOIP
2. Force AudioStream::InitPreferredSampleRate() to return 48000.

I used gunshup to test. Because I want to compare the CPU usage with the old version which does the resampling in GIPS. 
Here is the test settings.
* codec - g711
* aec -> pref("media.getusermedia.aec", 4);

* Old version
** browser -> 24%
* New version
** without any modification
*** browser -> 37%
** SPEEX_RESAMPLER_QUALITY_MIN
*** browser -> 34%
** SPEEX_RESAMPLER_QUALITY_VOIP
*** browser -> 39%
** Force AudioStream::InitPreferredSampleRate() to return 48000
*** browser -> 29%

[1] http://dxr.mozilla.org/mozilla-central/source/content/media/MediaStreamGraph.cpp?from=MediaStreamGraph.cpp#2295
[2] http://hg.mozilla.org/mozilla-central/annotate/8e797a1a4d65/media/libspeex_resampler/src/resample.c#l614
Attached file perf data of comment 4
Nominating this as a v2.0 blocker and then will remove it as a dependency for Bug 984239 (the user story for getting H.264 functional on FxOS 2.0).
blocking-b2g: --- → 2.0?
No longer blocks: 984239
Why is this needed for 2.0 specifically? Is this required for the Loop's MVP support?
Flags: needinfo?(mreavy)
Keywords: perf
It's not needed for Loop's MVP.  We just really want to improve the perf for 2.0 (gecko 32).  What's the proper may to mark this?
Flags: needinfo?(mreavy)
(In reply to Maire Reavy [:mreavy] (Please needinfo me) from comment #8)
> It's not needed for Loop's MVP.  We just really want to improve the perf for
> 2.0 (gecko 32).  What's the proper may to mark this?

I'd mark then as backlog in the blocking-b2g flag with '[priority]' in the whiteboard to indicate that this is important to get resolved. I wouldn't block the release on it mainly because we only block on issues that must be fixed in Gecko 32 no matter what (true stop ship).
blocking-b2g: 2.0? → backlog
Whiteboard: [priority]
Some neon optimizations landed upstream, which would reduce resampling time by one third (on supporting systems).
http://git.xiph.org/?p=speexdsp.git;a=commitdiff;h=0e5d424fdba2fd1c132428da38add0c0845b4178
Is there an eta for getting those speex changes into Mozilla?
Flags: needinfo?(jmvalin)
Flags: needinfo?(giles)
AFAIK, Tristan's working on this, not sure what the eta is, but I'm sure getting more testing can speed things up. There's also some changes to use the "direct" method (instead of interpolation) that should help as well. I'm not sure if they're in the Mozilla tree. Also, are you guys compiling this code as fixed-point, and what complexity are you using. Each of these issues (direct resampler patch, fixed-point, complexity) has roughly the same performance issue as the Neon code.
Flags: needinfo?(jmvalin)
Yes, this is compiled as fixed point on mobile.

--enable-resample-full-sinc-table won't be suitable because there could be rate changes, but a change to the HUGEMEM heuristic may be an option (comment 2 and 3).

SPEEX_RESAMPLER_QUALITY_DEFAULT is used but comment 4 indicated little gain from changing that.  (I don't know whether this is "complexity".)
Karl is driving this, I think. If there's stuff upstream I'm happy to update the in-tree version if that's helpful. Just file a bug and assign it to me. Are we ready to day that, Karl?
Flags: needinfo?(giles)
I started getting some things ready for an update in bug 1033122 and bug 1033140.

See also these changes for updating to bbe7e099, which doesn't include neon optimizations, and are missing the hugemem patches from opus-tools.
https://hg.mozilla.org/try/rev/7827546c45a4
https://hg.mozilla.org/try/rev/be59b95af8d2

I'm taking some vacation, and so it'll be about 2 weeks before I get back to that.
There may also be a little work enabling dynamic see detection.
Depends on: 1042508
The upstream neon optimizations have only been implemented for the direct
resampler, so we'd either need to switch to the direct resampler, to benefit
from the optimizations, or write some optimizations for the interpolating
resampler.

At 16 -> 44.1 using the direct resampler means about 441/8 ≅ 55 times as much
memory/initialization as interpolating resampler at default quality.  That is
only 55.125 kB with 16 bit samples, but still 55 times as much initialization
work.  The current hugemem patch uses the direct resampler with up to 32 times
as much memory/initialization at default quality.

One option to consider might be to make this large memory choice only for
initializing a new resampler, not for rate changes, but there would be a
benefit even with rate changes if they are not too frequent.

It would be possible to refactor things up bit to do a lazy sinc table
initialization, but that would only be of benefit if processing chunks of less
than 10 ms at each rate.  I expect we need to process somewhat more than 10
ms at 16 -> 44.1 to benefit from the direct resampler.
Given we are not doing much in the way of rate changes ATM, I suggest we start with this and we can look at other improvements in the future if/when rate changes become a problem.

This also changes the HUGEMEM test to be independent of quality.
|oversample| has a maximum value of 32, so we know that the interpolating
resampler will use less memory when st->den_rate > 441.
Assignee: nobody → karlt
Status: NEW → ASSIGNED
Attachment #8461197 - Flags: review?(jmvalin)
Whiteboard: [priority] → [priority][webrtc_uplift]
Comment on attachment 8461197 [details] [diff] [review]
use direct resampler for 16->44.1k

Review of attachment 8461197 [details] [diff] [review]:
-----------------------------------------------------------------

This looks like the diff of a diff. I'm having a hard time making any sense of it.
(In reply to Jean-Marc Valin (:jmspeex) from comment #18)
> Comment on attachment 8461197 [details] [diff] [review]
> use direct resampler for 16->44.1k
> 
> Review of attachment 8461197 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks like the diff of a diff. I'm having a hard time making any sense
> of it.

That is just an addition to the patchset that gets applied on top of uplifts of new Opus source; it's just updating that patchset to include the change to the source file so we won't lose the change on the next uplift of Opus source.
Yes, the changes to resample.c are the real changes.
The changes to hugemem.patch just the same changes to a patch applied on uplifts, from speexdsp now.
On further thought, I don't think there is much to gain from the more complex
heuristic in comment 21.  It avoids the direct resampler when it is
particularly costly, but in those situations (large down-sampling ratios) the
processing is very costly and there is more potential to gain from the direct
resampler.

The heuristic in attachment 8461197 [details] [diff] [review] makes more sense I think because it weighs
up the relative costs of initialization and processing, which are both
proportional to filt_len (in an ideal implementation).  If the initialization
costs are too high because the down-sampling ratio makes filt_len too high,
then the processing costs will be too high anyway.  There is a different
problem to solve for large-ratio down-sampling and the heuristic of comment 21
would not solve that.
Comment on attachment 8461197 [details] [diff] [review]
use direct resampler for 16->44.1k

Review of attachment 8461197 [details] [diff] [review]:
-----------------------------------------------------------------

This looks like the diff of a diff. I'm having a hard time making any sense of it.
Attachment #8461197 - Flags: review?(jmvalin) → review-
Comment on attachment 8461197 [details] [diff] [review]
use direct resampler for 16->44.1k

Review of attachment 8461197 [details] [diff] [review]:
-----------------------------------------------------------------

re-marking for review per discussion in #media
Attachment #8461197 - Flags: review- → review?(jmvalin)
Comment on attachment 8461197 [details] [diff] [review]
use direct resampler for 16->44.1k

Review of attachment 8461197 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with the following check

I believe that the line
if (st->den_rate <= 441)
can probably have the bounds changed to 160 and still work for the 44.1<->48k case.
Attachment #8461197 - Flags: review?(jmvalin) → review+
(In reply to Jean-Marc Valin (:jmspeex) from comment #25)

Thanks for having a look.

Yes, 44k1:48k simplies to 147/160, and so 160 would be sufficient for
44k1<->48k.

However, here we are resampling from 16k to 44k1.  We no longer have the common
factor of 3 and so this simplifies only to 160:441.  den_rate derives from out_rate and so we need to compare against 441 to use the direct resampler for 16k->44k1.

Does 441 seem reasonable, given that?
Flags: needinfo?(jmvalin)
Memory use of the direct resampler is filt_len * den_rate * word_size.
At default quality, when up-sampling, filt_len = 64.
On mobile word_size = 2.
So for 16k->44k1 on mobile, at default quality, memory use is
  64 * 441 * 2 = 55.125 kB

Memory use can be higher when down-sampling, if ratios are not nice.
I don't think we use higher qualities.

192k->44k1 gives 640:147 and so direct resampler memory use is about
  64 * 640 * 2 = 80 kB
Both existing and proposed heuristics would select the direct resampler for
this situation.

There are possible situations that would use significantly more memory with
the new approach, but those situations are expected only in demanding Web
Audio use.  The patch increases the switch point at large down-sampling from
16*(1+8) to 441, which makes the unlikely worst cases worse by a factor of 3.
If you're fine with the memory use, then go ahead. Also, can the user cause the resampler quality parameter to change. If so, you might want to check with the max memory for quality=10.
Flags: needinfo?(jmvalin)
The user has no control over the quality.  Karl: what's the CPU win of this patch on b2g?
(In reply to Randell Jesup [:jesup] from comment #29)
> Karl: what's the CPU win of this patch on b2g?

I don't have my own measurements, but the direct resampler performs 1/4 as many multiplications as the interpolating resampler.  Multiplications are not the only work as there is some copying involved.  The time taken by the direct resampler for 96->44.1 has been measured on ARM to be about 1/3 that of the interpolating resampler.  That's consistent with the 48k numbers in comment 4, so I think we can expect something similar.

On top of that, only the direct resampler has neon optimizations.  It can expect a further reduction in resampling time of 1/2, giving 1/6 overall, on neon architectures.
https://hg.mozilla.org/mozilla-central/rev/61147905cbf2
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla35
blocking-b2g: backlog → ---
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: