Upstream issues with OpenSLES audio code in webrtc

RESOLVED FIXED in Firefox 24

Status

()

defect
RESOLVED FIXED
6 years ago
9 months ago

People

(Reporter: jesup, Assigned: jesup)

Tracking

(Blocks 1 bug)

22 Branch
mozilla26
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(blocking-b2g:koi+, firefox24 fixed, firefox25 fixed, firefox26 fixed, firefox-esr24 wontfix)

Details

(Whiteboard: [FT: Media Recording, Sprint 2][qa-])

Attachments

(5 attachments, 1 obsolete attachment)

See bug 825112, especially comment 136
https://bugzilla.mozilla.org/show_bug.cgi?id=825112#c136

There are also some TSAN data-race issues flagged earlier in the bug against a copy of the JNI audio code, that are generally shared by many or all of the audio backends in the upstream.  Depending on the compiler, this may or may not be an issue - though generally not in practice (though they should be fixed).
includes changes to the Playout stop code from upstream trunk, though we don't use playout
Attachment #782644 - Flags: review?(tterribe)
Attachment #782644 - Flags: review?(paul)
Attachment #782644 - Flags: review?(mwu)
Depends on: webrtc_updates
Depends on: webrtc_upstream_bugs
No longer depends on: webrtc_updates
Blocks: 899204
Comment on attachment 782644 [details] [diff] [review]
clean up record issues in webrtc OpenSLES code

Review of attachment 782644 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/trunk/webrtc/modules/audio_device/android/audio_device_opensles_android.cc
@@ +1371,2 @@
>  
> +  {

Why this anonymous scope if you don't use RAII? I assume the WebRTC code is compiled with -fno-except, and they don't have some |CriticalSectionExitScoped| facility.

@@ +1375,3 @@
>        if (rec_voe_audio_queue_.size() <= 0) {
> +        crit_sect_.Leave();
> +        rec_timer_.Wait(40);

Comment over this magic number, or constant with a meaningful name.

@@ +1377,5 @@
> +        rec_timer_.Wait(40);
> +        crit_sect_.Enter();
> +        if (rec_voe_audio_queue_.size() <= 0) {
> +          // if we went 40ms without a buffer, we need to see if we're still
> +          // recording or have been Stop()ed, and it's waiting for us to return

In fact, I feel that this comment block should go above the |Wait(40)|, explain the general strategy, and give a rational behind the 40ms number, or say it is arbitrary.
Attachment #782644 - Flags: review?(paul) → review+
(In reply to Paul Adenot (:padenot) from comment #3)
> Comment on attachment 782644 [details] [diff] [review]
> clean up record issues in webrtc OpenSLES code
> 
> Review of attachment 782644 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> :::
> media/webrtc/trunk/webrtc/modules/audio_device/android/
> audio_device_opensles_android.cc
> @@ +1371,2 @@
> >  
> > +  {
> 
> Why this anonymous scope if you don't use RAII? I assume the WebRTC code is
> compiled with -fno-except, and they don't have some
> |CriticalSectionExitScoped| facility.

There's no way to release the scoped/RAII version and then re-acquire it (just release), so I can't use their RAII functions.  The block was already there; I can remove it.  It does provide visual scoping, but no functional scoping now.

> @@ +1375,3 @@
> >        if (rec_voe_audio_queue_.size() <= 0) {
> > +        crit_sect_.Leave();
> > +        rec_timer_.Wait(40);
> 
> Comment over this magic number, or constant with a meaningful name.
> 
> @@ +1377,5 @@
> > +        rec_timer_.Wait(40);
> > +        crit_sect_.Enter();
> > +        if (rec_voe_audio_queue_.size() <= 0) {
> > +          // if we went 40ms without a buffer, we need to see if we're still
> > +          // recording or have been Stop()ed, and it's waiting for us to return
> 
> In fact, I feel that this comment block should go above the |Wait(40)|,
> explain the general strategy, and give a rational behind the 40ms number, or
> say it is arbitrary.

sure, thanks
(In reply to Randell Jesup [:jesup] from comment #4)

> There's no way to release the scoped/RAII version and then re-acquire it
> (just release), so I can't use their RAII functions. 

Correction: CriticalSectionScoped::Leave() is private, so I don't even have that.
Attachment #782644 - Attachment is obsolete: true
Attachment #782644 - Flags: review?(tterribe)
Attachment #782644 - Flags: review?(mwu)
Attachment #783163 - Flags: review?(tterribe)
Attachment #783163 - Flags: review?(mwu)
Comment on attachment 783163 [details] [diff] [review]
clean up record issues in webrtc OpenSLES code

Carry forward r=padenot
Attachment #783163 - Flags: review+

Comment 8

6 years ago
Comment on attachment 783163 [details] [diff] [review]
clean up record issues in webrtc OpenSLES code

Review of attachment 783163 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/trunk/webrtc/modules/audio_device/android/audio_device_opensles_android.cc
@@ +1267,5 @@
>    return (rec_warning_ > 0);
>  }
>  
>  bool AudioDeviceAndroidOpenSLES::RecordingError() const {
> +  CriticalSectionScoped lock(&crit_sect_);

This api (recording/playout warning/error) is pretty crappy overall, but it seems like it doesn't need to be very good and thread safe to do what it does..
Attachment #783163 - Flags: review?(mwu) → review+
Comment on attachment 783163 [details] [diff] [review]
clean up record issues in webrtc OpenSLES code

Review of attachment 783163 [details] [diff] [review]:
-----------------------------------------------------------------

This is forward progress, but it does not by any means resolve all of the issues, even with just the recording pieces.

::: media/webrtc/trunk/webrtc/modules/audio_device/android/audio_device_opensles_android.cc
@@ +1176,4 @@
>  
> +  if (sles_player_itf_) {
> +    // Make sure player is stopped
> +    int32_t res = (*sles_player_itf_)->SetPlayState(sles_player_itf_,

Note for upstream: If you try to fix the races in PlayerSimpleBufferQueueCallbackHandler() by adding locking, this will deadlock/leak, and you can't just release crit_sect_, as Terminate() calls this function while holding the lock as well.

@@ +1370,5 @@
> +  while (is_recording_) {
> +    if (rec_voe_audio_queue_.size() <= 0) {
> +      crit_sect_.Leave();
> +      // Wait for max 40ms for incoming audio data before looping the
> +      // poll and checking for ::Stop() being called (which waits for us

Why don't you just have StopRecording() frob rec_timer_ (which, BTW, is not a timer)? Waiting around 40 ms to tear down the recording thread for no reason seems dumb (which thread does destruction happen on? I hope it's not the main thread).

@@ +1395,3 @@
>      voe_audio_buffer_->SetRecordedBuffer(buf, num_samples);
>      voe_audio_buffer_->SetVQEData(playout_delay_, recording_delay_, 0);
>      voe_audio_buffer_->DeliverRecordedData();

If you hold crit_sect_ while calling all of these functions (which, you know, you need to, unless you can prove races with AttachAudioBuffer, etc., can't happen), you are blocking RecorderSimpleBufferQueueCallbackHandler() for at least as long as it takes them to execute. This means there is _no_ advantage to doing all of this on a separate thread, not even a theoretical "we shouldn't be doing any processing in the OpenSL ES callback because some day it might not have its own independent thread" (which I don't buy anyway).

I reiterate that the whole thread should just go away. It serves no purpose except overhead and complexity.

@@ +1428,5 @@
> +    if (rec_voe_ready_queue_.size() <= 0) {
> +      // Log Error.
> +      rec_error_ = 1;
> +      WEBRTC_OPENSL_TRACE(kTraceError, kTraceAudioDevice, id_,
> +                          "  Audio Rec thread buffers underrun");

Didn't notice this in my earlier review, but we don't change audio here, yet we call Enqueue() anyway. That means that if this ever actually triggers, we'll have two copies of this buffer running through the queues, with its contents potentially getting overwritten before they can be used elsewhere, or while they're being used. The chance of this is high because AudioRecord delivers data to OpenSL ES in bursty (64 ms) chunks.

We will never recover from this state. Every time the error triggers, it will get worse.
Attachment #783163 - Flags: review?(tterribe) → review+
(In reply to Timothy B. Terriberry (:derf) from comment #9)
> Comment on attachment 783163 [details] [diff] [review]
> clean up record issues in webrtc OpenSLES code
> 
> Review of attachment 783163 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This is forward progress, but it does not by any means resolve all of the
> issues, even with just the recording pieces.

Thanks!

> @@ +1370,5 @@
> > +  while (is_recording_) {
> > +    if (rec_voe_audio_queue_.size() <= 0) {
> > +      crit_sect_.Leave();
> > +      // Wait for max 40ms for incoming audio data before looping the
> > +      // poll and checking for ::Stop() being called (which waits for us
> 
> Why don't you just have StopRecording() frob rec_timer_ (which, BTW, is not
> a timer)? Waiting around 40 ms to tear down the recording thread for no
> reason seems dumb (which thread does destruction happen on? I hope it's not
> the main thread).

Frobbing rec_timer_ will be at best iffy, since it may well (pretty likely in fact) get serviced before ::Stop() is called, in which case it will still wait 40ms.  ::Stop() waits for up to 10 seconds for the process to exit.  We could frob it, but we still may wait, and likely will.  The API doesn't give us an easy way to abort waits at ::Stop().  (We could add an "I'm going to kill you!" state var and then frob it, or even ask it to kill itself.)

> @@ +1395,3 @@
> >      voe_audio_buffer_->SetRecordedBuffer(buf, num_samples);
> >      voe_audio_buffer_->SetVQEData(playout_delay_, recording_delay_, 0);
> >      voe_audio_buffer_->DeliverRecordedData();
> 
> If you hold crit_sect_ while calling all of these functions (which, you
> know, you need to, unless you can prove races with AttachAudioBuffer, etc.,
> can't happen), you are blocking RecorderSimpleBufferQueueCallbackHandler()
> for at least as long as it takes them to execute. This means there is _no_
> advantage to doing all of this on a separate thread, not even a theoretical
> "we shouldn't be doing any processing in the OpenSL ES callback because some
> day it might not have its own independent thread" (which I don't buy anyway).
> 
> I reiterate that the whole thread should just go away. It serves no purpose
> except overhead and complexity.

You have a point here, but more directly it points out that in the process of making this change, the crit_sect lock grew to include those functions, in the mods to handle not busy-waiting.  The original unlocked around all three; all other desktop drivers unlock only around DeliverRecordedData() which seems a much better idea.

> @@ +1428,5 @@
> > +    if (rec_voe_ready_queue_.size() <= 0) {
> > +      // Log Error.
> > +      rec_error_ = 1;
> > +      WEBRTC_OPENSL_TRACE(kTraceError, kTraceAudioDevice, id_,
> > +                          "  Audio Rec thread buffers underrun");
> 
> Didn't notice this in my earlier review, but we don't change audio here, yet
> we call Enqueue() anyway. That means that if this ever actually triggers,
> we'll have two copies of this buffer running through the queues, with its
> contents potentially getting overwritten before they can be used elsewhere,
> or while they're being used. The chance of this is high because AudioRecord
> delivers data to OpenSL ES in bursty (64 ms) chunks.
> 
> We will never recover from this state. Every time the error triggers, it
> will get worse.

Clearly it shouldn't Enqueue in this case. Also, if this does happen, we'll end up with more than one buffer waiting to be enqueued in a future callback.  (and there's a hole where if ALL the buffers end up being processed, then we won't get called back, and thus won't enqueue, and will be hung.)

Clearly, we don't want to underrun....
Attachment #785447 - Flags: review?(tterribe)
Flag this bug to koi+ and add keyword in whiteboard for tracking in media recording agile team for FFOS 1.2. Since it is under review, I assume it will be resolved soon.
blocking-b2g: --- → koi+
Whiteboard: [FT: Media Recording, Sprint 2]
Hi Randell,

May I know if there is any targeted date for this bug to be resolved? I am asking since this function is very important for getting the user media which is related to a big user story in FFOS 1.2 media recording team. Thank you very much.

Ivan Tsay
Flags: needinfo?(rjesup)
Ivan - I'll be landing the initial 2 patches (or an update of them) as soon as they've both reviewed (still waiting on one).  I'll likely incorporate comments on the patches from the upstream (webrtc.org) source that improve it further (getting rid of the magic 40ms value).  These should land in 26.

Note that the main effect here is performance (kill 1ms poll!), and removing TSAN data races, which in practice would be unlikely to cause problems (but that can't be guaranteed without examining the assembler generated by the compiler).
Flags: needinfo?(rjesup)
Setting a deadline of 9/4 to get this into our code.
Target Milestone: --- → mozilla26
also fixed sampling-rate initialization (which was getting overwritten immediately with the correct value, so it wasn't causing a bug)
Attachment #785447 - Flags: review?(tterribe) → review?(paul)
Attachment #800256 - Flags: review?(paul)
Attachment #785447 - Flags: review?(paul) → review+
Attachment #800256 - Flags: review?(paul) → review+
https://hg.mozilla.org/mozilla-central/rev/a058786a8080
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Whiteboard: [FT: Media Recording, Sprint 2] → [FT: Media Recording, Sprint 2][qa-]
Comment on attachment 802741 [details] [diff] [review]
(beta 24 version) clean up record issues in webrtc OpenSLES code + wallpaper  to be done upstream and then will replace this

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 904784

User impact if declined: some devices will fail to shut down camera and mic, and will lock it from other apps until the browser is force-quited

Testing completed (on m-c, etc.): on m-c.  Reviewed by padenot, derf and mwu, and the main Google engineer who handles the upstream code.

Risk to taking this patch (and alternatives if risky): See email; additional risk on top of 904784 is low.

String or IDL/UUID changes made by this patch: none
Attachment #802741 - Flags: approval-mozilla-beta?
Attachment #802741 - Flags: approval-mozilla-aurora?
Comment on attachment 802741 [details] [diff] [review]
(beta 24 version) clean up record issues in webrtc OpenSLES code + wallpaper  to be done upstream and then will replace this

Approving as discussed via email.
Attachment #802741 - Flags: approval-mozilla-beta?
Attachment #802741 - Flags: approval-mozilla-beta+
Attachment #802741 - Flags: approval-mozilla-aurora?
Attachment #802741 - Flags: approval-mozilla-aurora+
verified fixed on android 24.0 beta 10 build-candidate #2
Backed out from esr24 because it's mobile-only and shouldn't have been merged there.
https://hg.mozilla.org/releases/mozilla-esr24/rev/c9b5ea714522

Updated

9 months ago
No longer blocks: webrtc_upstream_bugs
You need to log in before you can comment on or make changes to this bug.