Closed Bug 818822 Opened 7 years ago Closed 6 years ago

Need to resample inputs to MediaStreamGraph


(Core :: WebRTC: Audio/Video, defect)

Not set





(Reporter: jesup, Assigned: padenot)


(Blocks 2 open bugs)


(Whiteboard: [s=fx32][p=13, 1.5:p1, ft:webrtc])


(4 files, 2 obsolete files)

We need to resample some/most inputs to the MediaStreamGraph in order to avoid drift issues with audio being sampled on clocks other than the CPU clock (or sampled on an entirely different machine).

There have been considerable discussion (see below and bug 691234 for art of it), and roc, mreavy, derf and I talked extensively about it at IETF and other points.

roc: you may want to summarize your thoughts/plan here.  My understanding was that we would try clocking off the playback driver, and resample inputs.

Handling mic inputs (which may have a different clock than the audio out!) is tricky, especially with echo cancellation, which really needs to see the actual samples that went out to the hardware. 

I *believe* the right way to handle that if the mic and speakers don't share a clock is to resample the "far-end" sample (going to the speakers) to match the input clock rate, since that's effectively what the differing hardware sample rates are doing to you and them anyways.

From bug 691234 comment 41:

 Randell Jesup [:jesup] 2012-06-21 11:03:38 PDT

Right - jitter buffers move a lot more than any reasonable hardware clock, so they need to do more than resample.  The Time Scale Modification (TSM) stuff in NetEQ is a common and quite effective trick.  (For anyone who didn't know.)

I certainly would be tempted to use the hardware capture clock for incoming mic data, and the playout clock (not guaranteed to the be same as the capture clock!) for data from the PeerConnection, and if the two are connected directly (audio.src = getUserMedia_stream) then force the resample; but I realize that's not how MediaStreams work.

An example for a videophone (DSP details omitted).  Buffer nodes rebuffer into 10ms frames.
Recording taps, VAD, control, etc all are omitted.  A lot of this is inside the webrtc code, but it helps to understand the complexity of stream management.

The wonder of Emacs artist-mode!
             |                                     |
       +-------------+                             |    Send audio thread
       | Jitter Buff |     RTP reception thread    | 
       |   Insert    |                             |         RTP
       +------+------+                             |          |
              |                                    |  +-----------------+ 
 -------------+------------------------------------+  |  Speech Encode  |
       +------+------+     Receive audio           |  |                 | 
       | Jitter Buff |         thread              |  +--------+--------+
       |   Dequeue   |                             |  +--------+--------+
       +------+------+                             |  |   Mixer         +- Other
       +------+------+                             |  +--------+--------+ sources
       |  Decode     |                             |   +-------+--------+
       |             |                             |   |  Buffer        |
       +------+------+                             |   +-------+--------+
       +------+------+                             |           |
       |Time Scale   |                             |    +------+----------+
       |Modification |                             |    |   Conference    |
       +------+------+                             | /--+Bridge (optional)|
       +------+------+       +-------------+   /---+-   +-------+---------+
       |   Buffer    +-------+   audio     +---    |            |
       |             |       | conf. bridge|       |     +------+-------+
       +------+------+       +-------------+       |     |AGC and output|
       +------+------+                             |     | level adjust |
       |   Mixer     +-- other sources             |     |& DTMF insert |
       |             |                             |     +------+-------+
       +------+------+                             |            |
       +------+------+                             |     +------+-------+
       |AGC, DC Filt |            +---------------+|     |    Main      |
       | & volume    +------------+ fsb queue     ++-----+    AEC       |
       +------+------+            |               ||     +-------+------+
       +------+------+            +---------------+|      +------+------+
       |Playout audio|                             |      |Capture audio|
       |    driver   |                             |      |    driver   |
       +------+------+                             |      +------+------+
       +------+------+    Linux Kernel                           |
       |  Buffers    |                                    +------+-------+
       +-------------+                                    |  Buffers     |
       |             |                                    +--------------+
       +-------------+                                    |              |
       |             |                                    +--------------+
       +----------+--+                                    |              |
                  |                                       +-------+------+
               |                                                          |
               |           Hardware Audio Codec                           |
               | Note: Use HW Biquad filters to flatten response if avail |
                  |                                         |
                ..|.                                        |
               ..  ..                                     --+--
               ..  ..                                     --+--
              ..    ..                                   (     )
             ..........                                   -----

This let the output (mic -> RTP) thread run with microphone timestamps, and the input thread (after the TSM and associated buffer node) run with audio playout timestamps.  RTP input to TSM runs with RTP source timestamps; the TSM takes the place of a resampler.  The only complexity is at the cross input/output thread links, which would need resampling if the mic and playout clocks aren't synced (mine were), and mixing of other sources (ditto).


It's probably helpful to label your clock sources; merging two streams with the same clock source doesn't need a resample; merging them with different sources does.  The complication here is that MediaStreams run on internal (CPU/NTP) clock time; the only sources that do so are probably pre-recorded sources.

So, for all sources that aren't internal (sourced at a video or audio decoder element), the inputs will need associated resamplers or timestamp transcoders.  For audio, the sources will need to monitor input samples versus nominal frequency (at CPU/NTP clock rate), perhaps with a PLL, and use that to drive the settings for the resampler.  For video, I would simply use the PLL to convert input timestamps to internal CPU timestamps.  This might need some tweakery, as many RTP sinks assume video sources are at consistent cadences and timestamp deltas in order to provide smooth playback, generally some integer frame rate, or some integer number of frames each on a 30 (or 25!) fps boundary (i.e. timestamp deltas of 3000 at 90KHz timestamp rate.)

Note that many audio sources may require a resampler anyways to feed into the core, so this would just affect settings for it.  The PLL aspect of it does mean it may take a short time to converge after starting.

We probably need this to be a core part of MediaStream data inputs, with an option to disable it if we know the source is synced to the CPU.

Sorry for the long message...
(In reply to Randell Jesup [:jesup] from comment #0)
> I *believe* the right way to handle that if the mic and speakers don't share
> a clock is to resample the "far-end" sample (going to the speakers) to match
> the input clock rate, since that's effectively what the differing hardware
> sample rates are doing to you and them anyways.

I think we should do things almost exactly the other way around (and I think I've said this before): the audio output should be the master clock, and we should resample at the microphones to compensate for drift. Part of the reason for this is that while we will basically always have a single audio output device, we won't always have exactly one input device. In fact we will often have zero. In a system with multiple microphones, none of which are in use, which clock do you sync to?

In any case, from a practical point of view, the Speex resampler was designed to handle exactly this case (small, continuously changing sample rate differences), and is already in the tree, so we should probably just use that. After bug 818327 it should even work.
I tend to agree with Tim that syncing with the playback is better than syncing with capture, for exactly the same reasons. I don't think it makes any difference to the AEC whether it's the capture or the playback that's resampled. The other possibility is to go with the "global clock" (RTC/NTP-based I assume) that's used for the RTP timestamps. The advantage of this is that you never have to deal with any clock drift and/or RTP timestamp adjustments. I think the RTC is also much more accurate than what a soudcard gives you. Of course, the downside is that it means twice the resampling.
My apologies - it was late when I entered that bug.  My discussions with roc involved using the output clock as the master, not the input clock - I got it backwards when writing it down.  And as mentioned, that would mean resampling the input, not the output.

I think locking to the playout clock also may help us control playout latency.  Note that the output frequency may jump if the output destination changes.
(In reply to Randell Jesup [:jesup] from comment #3)
> latency.  Note that the output frequency may jump if the output destination
> changes.

This is something Jean-Marc pointed out to me, as well, but I think it's something we'd want to know about (and since this can happen somewhere beyond our control, may be our only way of finding out).
Summary: Need to resample inputs to MediaStramGraph → Need to resample inputs to MediaStreamGraph
Duplicate of this bug: 771834
Blocks: 884365
Blocks: 908834
Blocks: 958090
When there are two different output cards connected to the MSG, I wonder whether there is anything ensuring that they are in sync.  If nothing guarantees this, then something needs to be done on every output except one, to avoid drift.
Correct: nothing guarantees two outputs are synced.  In theory the "right" way to do it is to have the graph self-divide (graph nodes are driven by the output), and when there's a cross-connect, force a resample.  Most graphs will end in one output, and not cross-connect.  In some rare instances a source might get played on both outputs, in which case at the join point (the source) you'd need to resample.

It may be simpler to choose one output as "main", and resample anything sent to the other one to that (if both are in use).  It gets trickier when there's a switch from two to one outputs active, or one to another, but probably not too problematic modulo that you might need to leave a small buffer in place after updating the resample rate which you wouldn't need if it had been originally run without a resampler.  (ie. if A is direct (no resample), B is added (and resampled to match A), then A stops, B will still need a buffer even if the resampler is turned off (though maybe you can drift it slowly until the buffer underflows and you can remove the buffer...)
Assignee: roc → paul
Blocks: 970426, 694814, 929128
Duplicate of this bug: 970715
If we're going to provide a single mixed stream from the MediaStreamGraph back to the AEC (and to support driving the MSG from the outputs), we need to resample to 48000Hz at all inputs that aren't 48000Hz, even if they're don't need it for clock domain reasons (like gUM audio does).

gUM (or anything else with a different clockbase) will need to be resampled/buffered.
This resamples all SourceMediaStream to have a rate of IdealAudioRate(), so that
all the audio that flows through the graph has the same sample rate, so that we
can easily mix the tracks.

The resampling happens in `SourceMediaStream::AppendToTrack`.
Attachment #8389635 - Flags: review?(roc)
Blocks: 982490
Blocks: 983023
Blocks: 983052
Blocks: 983062
This is necessary for the rest to work.
Attachment #8397196 - Flags: review?(rjesup)
Comment on attachment 8397196 [details] [diff] [review]
Update AudioConduit so it can work at 44.1kHz. r=

Review of attachment 8397196 [details] [diff] [review]:

r+ but this change is also included in my patchset
Attachment #8397196 - Flags: review?(rjesup) → review+
Attachment #8389635 - Attachment is obsolete: true
Rebased and ready to land.
Attachment #8397196 - Attachment is obsolete: true
Also may be new

13:16:47     INFO -  [2358] ###!!! ASSERTION: Buffer underran: 'bufferEnd >= mCurrentTime', file /builds/slave/m-in-lx-d-00000000000000000000/build/content/media/MediaStreamGraph.cpp, line 473
Backed out of central:
Resolution: FIXED → ---
Depends on: 991504
Got some logs via testing/marionette/client/marionette/ changes:

     env = {'MOZ_CRASHREPORTER': '1',
+           'NSPR_LOG_MODULES': 'mediastreamgraph:5,signaling:5,getusermedia:5'}

13:24:39     INFO -  138 INFO TEST-PASS | /tests/dom/media/tests/mochitest/test_peerConnection_setLocalOfferInHaveRemoteOffer.html | iceConnectionState should not be undefined
13:24:39     INFO -  139 INFO TEST-INFO | /tests/dom/media/tests/mochitest/test_peerConnection_setLocalOfferInHaveRemoteOffer.html | PeerConnectionWrapper (pcRemote): oniceconnectionstatechange fired, new state is: closed
13:24:39     INFO -  140 INFO TEST-INFO | /tests/dom/media/tests/mochitest/test_peerConnection_setLocalOfferInHaveRemoteOffer.html | PeerConnectionWrapper (pcRemote): Closed connection.
13:24:39     INFO -  141 INFO TEST-INFO | /tests/dom/media/tests/mochitest/test_peerConnection_setLocalOfferInHaveRemoteOffer.html | Test finished
13:24:39     INFO -  142 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_peerConnection_setLocalOfferInHaveRemoteOffer.html | Test timed out.

Probably some type of shutdown/cleanup of the MSG or stream...
See bug 991866 for code needed to run full M10 mochitest-remote locally (and no, it didn't fail for me)
Hmmm, it didnt' seem to have any extra logging in the above.  :-( may also be relevant.  The test destroyed the peerconnection at timestamp 489.xx - which may be near the mochitest timeout time
(In reply to Randell Jesup [:jesup] from comment #23)
> Hmmm, it didnt' seem to have any extra logging in the above.  :-(

I wonder whether this might be to blame:
Depends on: 992436
Attachment #8402226 - Flags: review?(paul)

The problem is the debug emulator running out of virtual cpu power to keep up with realtime inputs.

The patch reduces fake video and audio input rates to reduce the CPU load from the media streams.
M10 is 11 for 11 green :-)
Once this is r+'d we can reland
(without the wakelock stuff) also green
Attachment #8402226 - Flags: review?(paul) → review+
Comment on attachment 8402226 [details] [diff] [review]
Reduce fake audio/video rates on b2g debug only to avoid overloading mochitest emulator VMs

>+#if defined(MOZ_WIDGET_GONK) && defined(DEBUG)
>+// B2G emulator debug is very, very slow and has problems dealing with realtime audio inputs
>+  mTimer->InitWithCallback(this, (1000 / mOpts.mFPS)/10, nsITimer::TYPE_REPEATING_SLACK);
>   mTimer->InitWithCallback(this, 1000 / mOpts.mFPS, nsITimer::TYPE_REPEATING_SLACK);

I would have naively guessed that reducing the interval to 1/10th would mean timer is called 10x as often, *increasing* the video rate.
No longer depends on: 992436
This test a) has issues with missing preload/loops which makes it flaky, b) even with that fixed, on b2g emulator opt (both local and on tbpl) randomly fails with timeouts with this patch.  Likely this is a timing issue simply provoked by the changes in this bug.

See bug 994351
Attachment #8404286 - Flags: review?(drno)
Comment on attachment 8404286 [details] [diff] [review]
Disable problematic test for B2G

We'll add the annotation to bug 994351 as well
Comment on attachment 8404286 [details] [diff] [review]
Disable problematic test for B2G

Review of attachment 8404286 [details] [diff] [review]:

One nit.
Otherwise looks good to me.

::: content/media/test/crashtests/crashtests.list
@@ +65,5 @@
>  load 952756.html
>  load 986901.html
>  load buffer-source-ended-1.html
>  load offline-buffer-source-ended-1.html
> +skip-if(B2G) HTTP load media-element-source-seek-1.html

Can we a comment referring to bug 994351 please?
Otherwise looks good to me.
Attachment #8404286 - Flags: review?(drno) → review+
Forgot to re-push the "slow down inputs for B2G emulator debug" patch this time:
No longer blocks: MediaSegmentBase
Depends on: MediaSegmentBase
Depends on: 996853
Depends on: 999267
Whiteboard: [s=fx32]
Updating this to WebRTC since that's the work driving this.
Component: Video/Audio → WebRTC: Audio/Video
Whiteboard: [s=fx32] → [s=fx32][p=2, 1.5:p1, ft:webrtc]
Whiteboard: [s=fx32][p=2, 1.5:p1, ft:webrtc] → [s=fx32][p=13, 1.5:p1, ft:webrtc]
Depends on: 994351
Depends on: 1014862
No longer blocks: 983023
No longer blocks: 983062
You need to log in before you can comment on or make changes to this bug.