901831 - Major regression in audio drift (delay buildup) in getUserMedia starting 7/27 when freeing video frames

Reporter

Description

•

12 years ago

There was a major regression on the 7/27 nightly build on Windows (it *appears* to not regress on mac, but that's not fully tested - only 8/2 and 8/5 nightlies were tested on mac). I haven't tested Linux yet. The delay buildup is very large; a second in a few minutes (and continues to build at an apparently constant rate, though it seems like the rate may vary some from one run/build to another). Note this is on a machine with a 44100 hz mic, but the headset I was using was set in the windows sound properties to 48000 hz mono (and will be resampled with the speex resampler to 16000 to insert into the mediastream). I tried 7/20, 7/24, 7/25, 7/26, 7/27 and 7/28, nightlies (and 8/5) and the delay starts with the 7/27 nightly. The main suspect is the runtime SSE2 changes to the libspeex resampler in bug 894941, which landed on 7/26 and was in the regression window of 46d73e889cb4:fb48c7d58b8b The other bugs that landed (and touched media) were: bug 893307 - Surround speaker support in cubeb_wasapi - theoretically possible, since I'd only seen this on windows Bug 882543 - Very unlikely since it's for offline graphs, but it does touch the MediaStreamGraph code.

Randell Jesup [:jesup] (needinfo me)

Reporter

Updated

•

12 years ago

Blocks: 785584

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 1

•

12 years ago

Argh. I typed in an update and added people, but must have forgotten to save it. So: it does not *appear* to be the speex resampler or wasapi changes, and the date of regression is a bit less sure. Attempts to bisect led to inconsistent results, partly due to interactions with the other browser I believe. It's still possible they're indirectly involved, but I currently mostly dought that. It *does* seem to be strongly correlated to running two browsers at once (one with a large profile but idle), where all my tests were done. When it's 'bad' (typically while I'm running an idle Aurora), there's a fairly constant drift (increasing delay) which you can also see by continuously increasing memory usage. There also are (sometimes) "bursts" of CPU use every 4-5 minutes for 1 minute or slightly over, combined with rapidly rising delay and memory use (and usually global underruns in MSG). These bursts can happen when I'm only running one browser, and also when I'm not in a call (gum local rendering only). Note: my laptop is set to maximum performance, and during a burst the CPU use rises from say 25% to 60% (for the FF process alone). The first, more general issue seems to be made far worse when there are two browsers running. This is odd, but profiles show that MediaStreamGraph is spending an average of 30% of the time in freeing video frames: Starting at layers:Image::Release() ~Image() -> ~nsAutoPtr(layers::ImageBackendData) -> layers::PlanarYCbCrD3D10BackendData::vector-deleting-destuctor -> ditto::~PlanarYCbCrD3D10BackendData() -> ~nsRefPtr<dom::RTCPeerConnectionIceEvent() ->CLayeredObjectWithCLS<...>::CContainedObject::Release() -> CUseCountedObject<...>::Release() -> NOutermost::CDevices::LOEnter() -> RtlpDeCommitFreeBlock() -> ntWaitForSingleObject() (The ICE entry in there is spurious due to the linker merging identical functions.) When I profile with only one browser running, I see no hits there at all. Also, while it shows 30% of the time that thread is there, that doesn't tell you how long the calls took - if 1 in 100 takes 100ms, that could cause major havoc and slips in MSG. I assume there's *some* sort of interlock between the browsers in the DirectX handling. The questions are: 1) why the long waits to free video frames with two browsers? 1a) is this something new? (perhaps around 7/26-7-29?) 1b) is it a bug, or is it something that was always possible - if so, we need to move frame releases off MSG on windows at least 1c) how long are there free calls taking? 2) How much delay or frame-processing overrun can be handled before MSG "slips" a stream? 2a) How can we minimize the processing on MSG? (see bug 884365) 2b) How can we adapt to compensate for slips? (Using a resampler per bug 884365, dropping audio entirely if the latency is too large, etc) 2c) How can we refactor MSG to maximize the ability to avoid slipping a stream or to recover when we do? 2d) Exactly how sensitive is MSG to processing overruns on a specific call? 3) what causes the large bumps in CPU use every ~5 minutes? (I have a few traces in them; pure CPU traces (xperf) just show things taking longer; SPS traces show real processing code taking longer (~2-3x), but delays (waiting for video frame frees) being similar. 4) How much of this applies to Mac/Linux/mobile? Some of these we may have seen bits of there (MSG drift on overload), but much of what's being reported here seems far worse or only is on Windows. I'm sure more details will become obvious as we dig in.

Summary: Major regression in audio drift (delay buildup) in (windows?) getUserMedia starting 7/27/2013 → Major regression in audio drift (delay buildup) in getUserMedia starting 7/27 when freeing video frames

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 2

•

12 years ago

I tested a patch to save off but not release all video/audio frames in MediaSegment.cpp; the net result is that there is *no* increase in delay in the 30 seconds I can run without running out of memory. (You can Reload the page to stop the leak and reclaim the memory, and so run the test multiple times for up to 30 seconds.) If this is something expected for us to deal with (slow release of video frames), we'll want to transfer them to a thread specifically for releasing them.

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 3

•

12 years ago

Possible regression range for the "two browsers" case: Bad: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1374803980/ Good: http://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-inbound-win32/1374802360/ Changeset range: (on inbound): c35d8edebbe1:13e83433293b (about 35 changesets) I've switched back and forth 4 times between the builds with consistent behavior. This is on the 26th but likely did not make the 26th nightly, so it appears to match the Nightly regression window. Most seem innocuous. Interesting ones: layout/base/nsDisplayList.cpp (ok, not so likely, but maybe affects gfx usage indirectly) content/media/MediaRecorder.cpp (but why would *that* be active?) a *whole* bunch of changes to sessionstore for promises. Note that my profile here is likely small, but the profile in the Aurora I have running is ~4MB. If that changeset added stalls on disk io or other jank issues somewhere, this might affect it (waiting for aurora to flush perhaps). Still, overall this is a longshot, and no idea why this would appear to impact freeing of video frames! The rest don't even seem feasible. Effing weird.

Tim Taubert [:ttaubert] (inactive)

Comment 4

•

12 years ago

FTR, this is the regression range mentioned by Randell: http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=c35d8edebbe1&tochange=13e83433293b

Tim Taubert [:ttaubert] (inactive)

Comment 5

•

12 years ago

(In reply to Randell Jesup [:jesup] from comment #3) > a *whole* bunch of changes to sessionstore for promises. That's not really a big patch... I can't imagine this being the cause as most of this code is only active on startup or shortly after opening a new window. Did you try to use the Gecko Profiler? I think doing a 'hg bisect' shouldn't be too bad either with only 33 changesets.

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 6

•

12 years ago

(In reply to Tim Taubert [:ttaubert] from comment #5) > (In reply to Randell Jesup [:jesup] from comment #3) > > a *whole* bunch of changes to sessionstore for promises. > > That's not really a big patch... I can't imagine this being the cause as > most of this code is only active on startup or shortly after opening a new > window. I agree it doesn't make much sense... though if it somehow caused MainThread event queue issues (jank) it might in theory be the cause. > Did you try to use the Gecko Profiler? I think doing a 'hg bisect' shouldn't > be too bad either with only 33 changesets. Spoken like someone who isn't building on Windows.... ;-) Though it isn't horrible with a small range; that's likely a small enough range to cut the build times from ~50 min to ~10-15. It takes most of a day to bisect a day's worth of m-i changes The profiler is how I identified releasing video frames as being involved -- which totally doesn't match what we see in this changeset range. I am locally building the 'bad' version, and will either selectively backout or bisect from there. In around an hour....

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 7

•

12 years ago

(In reply to Randell Jesup [:jesup] from comment #1) > I assume there's *some* sort of interlock between the browsers in the > DirectX handling. > > The questions are: > > 1) why the long waits to free video frames with two browsers? > 1a) is this something new? (perhaps around 7/26-7-29?) > 1b) is it a bug, or is it something that was always possible - if so, we > need to move frame releases off MSG on windows at least We could make video buffer release happen on other threads, or recycle buffers to avoid having to free them at all. But it would be very useful to know what triggers this, so I'm looking forward to the results of that bisect? > 2) How much delay or frame-processing overrun can be handled before MSG > "slips" a stream? > 2a) How can we minimize the processing on MSG? (see bug 884365) > 2b) How can we adapt to compensate for slips? (Using a resampler per bug > 884365, dropping audio entirely if the latency is too large, etc) Sounds like we should do both. We should be able to drop data in SourceMediaStreams if the source is real-time and the buffer (i.e. latency) gets too large; this could be done by the code feeding the SourceMediaStream (i.e. gUM or PeerConnection), or we could add that feature to SourceMediaStream itself.

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 8

•

12 years ago

Ok, this broke with changeset da5fa63e90cb (Bug 8962500), which was a DisplayList.cpp change that was backed out soon thereafter (and unbroke). Then a day or two later changeset d0a9de949678 landed (a backout of bug 873378), which broke it again (again, a DisplayList change to the max layers it appears). Apparently these somehow interact (indirectly) with DirectX/video-frame-freeing when another browser is running.

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 9

•

12 years ago

Verified after re-landing of bug 873378 (the palindrome bug) that the *horrible* regression is gone, but a persistent steady buildup of delay remains (with two browsers), and it goes away if I don't free the video frames.

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 10

•

11 years ago

Need to reverify if the problem still exists (many gfx/media changes since then). Related to the issues with "local drift" for MediaRecording, etc. No rush to reverify, but we should make sure it doesn't fall off the radar entirely.

backlog: --- → webRTC+

Rank: 45

Flags: needinfo?(rjesup)

Flags: needinfo?(jbecerra)

Priority: -- → P4

Randell Jesup [:jesup] (needinfo me)

Reporter

Comment 11

•

10 years ago

I think we can close this

Flags: needinfo?(rjesup) → needinfo?(mreavy)

Maire Reavy [:mreavy]

Updated

•

10 years ago

Status: NEW → RESOLVED

Closed: 10 years ago

Flags: needinfo?(mreavy)

Resolution: --- → WORKSFORME

Bugzilla

Major regression in audio drift (delay buildup) in getUserMedia starting 7/27 when freeing video frames

Categories

(Core :: WebRTC: Audio/Video, defect, P4)

Tracking

()

People

(Reporter: jesup, Unassigned, NeedInfo)

References

Details

(Keywords: regression, Whiteboard: [getUserMedia])

Crash Data

Security

(public)

User Story

Description

Updated

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated