Closed Bug 1406135 Opened 7 years ago Closed 7 years ago

Can't subscribe to large amount of webRTC streams (regression)

Categories

(Core :: WebRTC, defect, P2)

56 Branch
Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla59
Tracking Status
thunderbird_esr52 --- unaffected
firefox-esr52 --- unaffected
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 + wontfix
firefox59 --- fixed

People

(Reporter: mun.dave, Assigned: padenot)

References

Details

(Keywords: regression)

Attachments

(2 files)

Attached image FFSUB.png
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.100 Safari/537.36 Steps to reproduce: I tried to use the OpenTok JS SDK to receive 12 webRTC video+audio streams on Win 10 FF 56. Actual results: Typically, if I try to subscribe to 12 streams, usually ~8 render video onto the DOM, and the other do not. I have attached a screenshot repro-ing on a computer with a wired connection that I do not expect to have a CPU or network bottleneck. Here is a YT link (3rd party repro-ing on a different machine) recording the behavior discrepancy between Win FF and Win Chrome: https://youtu.be/-voCghEJfdQ?t=92 Expected results: I would expect all 12 to receive video.
OS: Unspecified → Windows 10
I know you would like me to reproduce with a SDK agnostic + minimal footprint, but I am having trouble finding another free service that allows more than 8 concurrent streams.
Here's a plain JS repro of what is probably the same issue: https://output.jsbin.com/dezobe Like the one with OpenTok, this only reproduces on Firefox Windows (tested Chrome & FF on macOS & Windows).
Improved version of jsbin :-) https://output.jsbin.com/sizekiy
Ok final update for now: https://output.jsbin.com/nubuhap This adds the option to add a delay between each stream / peer connection. Based on our customer bug report I thought that FF might work if the streams were fed slowly but I still got the issue with a 3000ms delay between each one. It seems .play() never completes for the 6th stream. This is both FF nightly and stable, on Windows.
Component: Untriaged → WebRTC
Product: Firefox → Core
Nils, this looks like Bug 1402510?
Flags: needinfo?(drno)
See Also: → 1402510
(In reply to Michael Froman [:mjf] from comment #5) > Nils, this looks like Bug 1402510? Indeed this appears to be the same issue. Interestingly the test case works fine for me with 56 and 58 on Mac. But it reproduces immediately on Win 10 with Fx 56 and 58. I'll try to investigate...
Flags: needinfo?(drno)
Assignee: nobody → drno
Status: UNCONFIRMED → NEW
Rank: 25
Ever confirmed: true
Priority: -- → P2
Doesn't repro on Linux with Fx 56 either.
So here is an interesting observation: when I take out the 'audio: true' from the gUM request in the test page and use mode 'parallel' it works for me on Windows 10. For some unknown reason if I take out 'audio: true' the serial mode stops working. Andrew can you please confirm that making the gUM request video if it works for you then as well?
Flags: needinfo?(voltrevo)
When looking at about:webrtc it's pretty clear that the connection gets established, but the sender does not actually start sending any RTP packets. Since there is only a single gUM request in the test code, I would guess this is some problem with the code replicating an existing track to other streams (MSG)?! Paul: does the description of being able to request a maximum of 6 audio streams ring a bell? Any thoughts on this?
Flags: needinfo?(padenot)
(In reply to Nils Ohlmeier [:drno] from comment #11) > When looking at about:webrtc it's pretty clear that the connection gets > established, but the sender does not actually start sending any RTP packets. > Since there is only a single gUM request in the test code, I would guess > this is some problem with the code replicating an existing track to other > streams (MSG)?! > > Paul: does the description of being able to request a maximum of 6 audio > streams ring a bell? Any thoughts on this? I would believe it can happen with some hardware/driver combination, but I don't have equipment to reproduce this. That said, if we fail opening a system-level audio stream, we should fall-back to a normal thread and re-try periodically to open a new audio stream, so we should be able to send packets (since we fake a clock and keep things rolling).
Flags: needinfo?(padenot)
(In reply to Nils Ohlmeier [:drno] from comment #10) > So here is an interesting observation: when I take out the 'audio: true' > from the gUM request in the test page and use mode 'parallel' it works for > me on Windows 10. > > For some unknown reason if I take out 'audio: true' the serial mode stops > working. > > Andrew can you please confirm that making the gUM request video if it works > for you then as well? Hmm for me it works without audio for both parallel and serial.
Flags: needinfo?(voltrevo)
@jib: what kind of logs do I need to gather to allow someone to analyze this?
Flags: needinfo?(jib)
Dunno, depends on whether we suspect the problem is network or local. If it's audio, maybe MOZ_LOG=timestamp:1,MediaManager:5,MediaStreamGraph:4,TrackUnionStream:5,webrtc_trace:65535 ? Paul, any others? I'd try to reduce the JS test case to see if it repros without the peer connection piece. Do we suspect a regression? If so, a regression range might help.
Flags: needinfo?(jib) → needinfo?(padenot)
I'm pretty sure it's not a network problem. ICE established the PeerConnection just fine. But we appear to never start sending any media. I'll try yo get the logs tomorrow.
(In reply to Jan-Ivar Bruaroey [:jib] (needinfo? me) from comment #15) > MOZ_LOG=timestamp:1,MediaManager:5,MediaStreamGraph:4,TrackUnionStream:5, > webrtc_trace:65535 ? > > Paul, any others? MOZ_LOG=timestamp:1,MediaManager:5,MediaStreamGraph:4,TrackUnionStream:5,webrtc_trace:65535,cubeb:5
Flags: needinfo?(padenot)
While testing with this jsbin (50 streams in parallel) I got a shutdownhang in latest Nightly, [1] (actually I got two, but they seem to be dupes, see [2]). In the jsbin I consistently see video frames for 26 streams, while the connection is established for all streams (so 50). Interestingly there are 26 threads in the crash report matching "webrtc::AudioDeviceLinuxPulse::RecThreadProcess". The weird part is that full duplex is on so audio capture should be handled by cubeb. Paul, do you know more here on what is expected? [1] https://crash-stats.mozilla.com/report/index/e57d6d23-8b59-4e6d-9c6d-6eebd0171024#allthreads [2] https://crash-stats.mozilla.com/report/index/e728a85a-c57a-4fb7-8117-e15970171024#allthreads
Flags: needinfo?(padenot)
This is either 1394163 or 1397793, depending on how you look at it (the problem or the solution). Since WebRTC.org 57 landed, we always initialize and create threads and devices using webrtc.org, but we never use them. This is already covered by my patches that refactor all this. I also have a plan to only land this on its own, but I've been told it's now unnecessary to land it early, because we thought this was only useful to unblock sandboxing, and they can wait for 59. In light of the above, maybe we can revisit this plan.
Flags: needinfo?(padenot)
A bit of a longer explanation. Since the last webrtc uplift, we initialize "platform objects for audio" [0] each time we initialize VoEBaseImpl [1]. We initialize VoEBaseImpl in the following situations: - Anytime we EnumerateAudioDevice [2] - Anytime we Initialize an AudioDevice [3] - Anytime we create an AudioConduit: [4] Sometimes we clean those up, sometimes we don't, so sometimes threads and other resources are sticking around, sometimes not. The patch set in bug 1397793 fixes all of this. [0]: http://searchfox.org/mozilla-central/source/media/webrtc/trunk/webrtc/modules/audio_device/audio_device_impl.cc#189 [1]: http://searchfox.org/mozilla-central/source/media/webrtc/trunk/webrtc/voice_engine/voe_base_impl.cc#197 [2]: http://searchfox.org/mozilla-central/source/dom/media/webrtc/MediaEngineWebRTC.cpp#314 [3]: http://searchfox.org/mozilla-central/source/dom/media/webrtc/MediaEngineWebRTCAudio.cpp#787 [4]: http://searchfox.org/mozilla-central/source/media/webrtc/signaling/src/media-conduit/AudioConduit.cpp#320
Thanks Paul. I just confirmed that your latest WIP set of patches in bug 1397793, [1] worked for me with the same steps as in comment 18. This was an unoptimized local build even, so extra good. [1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=7fafc2ddbf2d7cf3a03c730df04f190a02d1923a
Depends on: 1397793
Let's start review on bug 1397793 and see how far we get and how comfortable we are with landing in 58. It's not like this is a common use-case, but spreading patches out a bit might be good, and unblocking sandboxing earlier would be bonus.
[Tracking Requested - why for this release]: Based on comment 20 this regressed in bug 1341285.
Blocks: 1341285
Keywords: regression
Summary: Can't subscribe to large amount of webRTC streams → Can't subscribe to large amount of webRTC streams (regression)
We usually don't track bugs that are fix optional. If we want to consider tracking this, we should change 58 status to affected.
Hold on guys the try build from comment #22 from this URL https://treeherder.mozilla.org/#/jobs?repo=try&revision=7fafc2ddbf2d7cf3a03c730df04f190a02d1923a does not solve the problem, instead it crashes Firefox w/o a crash report.
drno pointed out this basically blocks anyone who goes into a bigger mesh call.
To clarify my comment #25: on my Win 10 laptop when I try https://output.jsbin.com/nubuhap RTP never gets send for the 6 or 7th PeerConnection. But when I tried Andreas try build (see above) instead of not sending any media Firefox crashes. And it crashes so hard that it doesn't even record any crash in about:crashes. I ran out of time to gather logs or run it with a debugger.
Nils, I mentioned it on irc but not on the bug, those patches are WIP, and only do something meaningful on Linux for now.
Attached file log.txt.child-2.gz
So this is the log from stock Nightly (so no patches applied) with the log parameters from comment #17.
Tracking 58+ based on Comment 26.
Assignee: drno → nobody
Andreas, Paul: could either of you take a look at the log I uploaded in comment #29 to asses what is going on there and if bug 1397793 is really going to fix that (note: Andreas try build with the patches from that bug crashed on my Win10 laptop so hard it wouldn't even submit a crash report).
Flags: needinfo?(padenot)
Flags: needinfo?(apehrson)
I see 12 "webrtc::AudioDeviceModuleImpl::CreatePlatformSpecificObjects" where each will create threads and set up audio apis and more. I don't know in more detail what goes wrong after that, but I confirmed that on linux Paul's patches help. To figure it out completely I would start to look at where in the chain between MediaStreamGraph (since AIUI all peer connections use the same video track) and the sending side of an RTCPeerConnection frames get blocked. If all connections are using the same track I'd expect MediaPipeline and VideoConduit to work, so that suggests something happens under the hood in webrtc.org. We know that audio triggers it, but is it a general issue of creating too many threads (the platform specifics mentioned above) or something more explicit in the webrtc stack? It could even be the former and Paul's patches just pushing the boundary a bit.
Flags: needinfo?(apehrson)
What I find interesting in the log is after the 12th "webrtc::AudioDeviceModuleImpl::AudioDeviceModuleImpl". Could 12 instances be 6 sender PCs and 6 receiver PCs, where the ones created after this didn't work? > 2017-10-25 22:38:09.296000 UTC - [Main Thread]: D/webrtc_trace (audio_device_wave_win.cc:234): failed to start the timer event That's http://searchfox.org/mozilla-central/rev/40e8eb46609dcb8780764774ec550afff1eed3a5/media/webrtc/trunk/webrtc/modules/audio_device/win/audio_device_wave_win.cc#234 (But for some reason it's forwarded to our log as debug level instead of error -- had it been error it would have been easier to find) > 2017-10-25 22:38:10.297000 UTC - [Main Thread]: D/webrtc_trace (audio_device_impl.cc:534): Audio device initialization failed. > 2017-10-25 22:38:10.298000 UTC - [Main Thread]: D/webrtc_trace (audio_device_impl.cc:1441): webrtc::AudioDeviceModuleImpl::StopPlayout > 2017-10-25 22:38:10.298000 UTC - [Main Thread]: D/webrtc_trace (audio_device_impl.cc:1483): webrtc::AudioDeviceModuleImpl::StopRecording > 2017-10-25 22:38:10.298000 UTC - [Main Thread]: D/webrtc_trace (audio_device_impl.cc:1509): webrtc::AudioDeviceModuleImpl::RegisterEventObserver > 2017-10-25 22:38:10.298000 UTC - [Main Thread]: D/webrtc_trace (audio_device_impl.cc:1522): webrtc::AudioDeviceModuleImpl::RegisterAudioCallback > 2017-10-25 22:38:10.298000 UTC - [Main Thread]: D/webrtc_trace (audio_device_buffer.cc:88): webrtc::AudioDeviceBuffer::RegisterAudioCallback > 2017-10-25 22:38:10.298000 UTC - [Main Thread]: D/webrtc_trace (audio_device_impl.cc:547): webrtc::AudioDeviceModuleImpl::Terminate > 2017-10-25 22:38:10.298000 UTC - [Main Thread]: D/webrtc_trace (audio_device_impl.cc:408): webrtc::AudioDeviceModuleImpl::~AudioDeviceModuleImpl > 2017-10-25 22:38:10.298000 UTC - [Main Thread]: D/webrtc_trace (audio_device_buffer.cc:82): AudioDeviceBuffer::~dtor This init failure seems like a 1 second timeout after the previous timer error. Logged at http://searchfox.org/mozilla-central/rev/40e8eb46609dcb8780764774ec550afff1eed3a5/media/webrtc/trunk/webrtc/modules/audio_device/audio_device_impl.cc#534
The timer start failure is from [1]: > timerID_ = timeSetEvent(time, 0, (LPTIMECALLBACK)HANDLE(event_), 0, > TIME_PERIODIC | TIME_CALLBACK_EVENT_PULSE); > ... > return timerID_ != NULL; So it's an error from timeSetEvent, [2]: > Return value > Returns an identifier for the timer event if successful or an error otherwise. This function returns NULL if it fails and the timer event was not created. (This identifier is also passed to the callback function.) But no further info on why it would fail. At least this suggests that getting rid of all these extra audio stacks would help, assuming that this timer failure was the reason the frames stop showing up. [1] http://searchfox.org/mozilla-central/rev/40e8eb46609dcb8780764774ec550afff1eed3a5/media/webrtc/trunk/webrtc/system_wrappers/source/event_timer_win.cc#66 [2] https://msdn.microsoft.com/en-us/library/dd757634(v=vs.85).aspx
(In reply to Nils Ohlmeier [:drno] from comment #31) > Andreas, Paul: could either of you take a look at the log I uploaded in > comment #29 to asses what is going on there and if bug 1397793 is really > going to fix that (note: Andreas try build with the patches from that bug > crashed on my Win10 laptop so hard it wouldn't even submit a crash report). Looking at your log (tell me if I got any of the following wrong): - This is Windows - The input device has a rate of 16kHz, the output device 48kHz - The period is 10ms (one callback each 10ms, providing 10ms of microphone audio, and requesting 10ms of output audio). - We have a single cubeb stream (duplex) as expected, it's doing fine (we see the callback messages at a correct rate, around one each 10ms). It's doing fine because it's been created first (we create an MSG, it creates and output audio stream, then switches to a duplex stream. We see perfectly normal callbacks throughout, with sensible driver switches. The only error here is that we can't register the endpoint notification client, which is caused by the sandbox on windows, and there is a workaround in place and shipped to release to workaround this. - Around line 1075, we start seeing some webrtc things. Immediately, we see activity in `audio_device_impl.cc`. This is what I'm explaining in comment 20. Look at the amount of `AudioDeviceModuleImpl::Init` that are happening. Those are very expensive objects. - Since the computer resources are not infinite, we see things starting to fail (line 6798, benign failure, but it shows the beginning of the resource exhaustion) - Line 10074 (note that a few seconds have passed since the beginning of the call, around 9 according to the log), we see the culprit: "Audio device initialization failed." from webrtc.org. This is [0], called by [1], called by [2] (it's in the conduit, because we've already initialized successfully at all the other call sites, that happen sequentially before). So the AudioConduit creation failed. This returns error after error, basically unwinding everything until we reach the DOM bindings. This is what you're looking for. - Around line 26431, we see more logs from `audio_device_impl.cc`, because we're probably shutting things down or the GC kicks in or the like. Remember we already had a bunch of those running, so we need to shut them down. - Right after this, we don't have an input stream anymore, cubeb switches to an output only stream, and continues running the graph, until the end. So, long story short, this will be fixed by this patch set, since one of the things that it does is stopping this `audio_device_impl.cc` madness. Again, this is a work in progress, and it's not ready. [0]: http://searchfox.org/mozilla-central/source/media/webrtc/trunk/webrtc/modules/audio_device/audio_device_impl.cc#534 [1]: http://searchfox.org/mozilla-central/source/media/webrtc/trunk/webrtc/voice_engine/voe_base_impl.cc#239 [2]: http://searchfox.org/mozilla-central/source/media/webrtc/signaling/src/media-conduit/AudioConduit.cpp#320
Flags: needinfo?(padenot) → needinfo?(drno)
Marking wontfix after discussion with QA.
Meaning we'll continue reviews and get this ready to land as soon as 59 opens.
Discussed on IRC last Friday.
Flags: needinfo?(drno)
See Also: → 1420815
This is now fixed.
Status: NEW → RESOLVED
Closed: 7 years ago
Resolution: --- → FIXED
Assignee: nobody → padenot
Target Milestone: --- → mozilla59
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: