1397881 - WebRTC crash: send_stream_impl is null

Reporter

Description

•

7 years ago

User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36 Steps to reproduce: Using Twilio's Video SDK and SFU, it's quite easy to crash a tab through normal usage in Firefox 55. For example, see here: https://crash-stats.mozilla.com/report/index/c498f3cb-494a-49f2-9602-c2ec10170831 It seems that an `RTC_CHECK` for ensuring that `send_stream_impl` is not `null` fails in `DestroyVideoSendStream`, although I don't know enough about what this means... Actual results: I hope to followup with a minimal repro example, but for now, all I can say is that using Twilio's QuickStart application (https://github.com/twilio/video-quickstart-js) with a Group Room (SFU) will cause this issue. Let two Participants join a Room in two Firefox instances, and the crash will occur, typically on disconnect.

Kohei Yoshino

Updated

•

7 years ago

Crash Signature: [@ mozalloc_abort | abort | rtc::FatalMessage::~FatalMessage ]

Component: Untriaged → WebRTC

Keywords: crash, crashreportid

Product: Firefox → Core

Randell Jesup [:jesup] (needinfo me)

Comment 1

•

7 years ago

All crashes in the last month are linux, mac or android - not windows. IN fact, almost all are android. Clearly a timing hole tearing down PeerConnectionMedia. All crashes are nullptrs. Note: be careful; some of these crashes (especially 53 and earlier) are other bugs which hit the generic fatal assert in webrtc code; we should have it be marked as a "skip over" when generating signatures for crash-stats. I only saw the ~PeerConnectioNMedia stacks in 55+, but I didn't do searches to confirm that (use "proto signature" to check for "does Not contain" ~PeerConnectionMedia, or something).

Rank: 15

status-firefox55: --- → wontfix

status-firefox56: --- → affected

status-firefox57: --- → affected

status-firefox-esr52: --- → ?

Keywords: csectype-nullptr

Priority: -- → P1

mroberts

Reporter

Comment 2

•

7 years ago

Attached file repro1.js — Details

Here is a reproduction example. It may not be minimal (there are 5 rounds of negotiation here); however, the call to RTCPeerConnection#close at the end of the test definitely causes a crash.

mroberts

Reporter

Comment 3

•

7 years ago

Attached file repro2.js — Details

I have attached a minimal reproduction example (2 rounds of negotiation) demonstrating changes to the BUNDLE that appear to cause the crash. Notice that, in the second round of negotiation, two new m= sections are added (video and audio with MIDs "video0" and "audio0", respectively), and the previous m= section with MID "sdparta_1" is rejected.

mroberts

Reporter

Comment 4

•

7 years ago

Attached file repro3.js — Details

repro2.js included audio m= sections. These are not necessary to demonstrate the bug. I've removed them in repro3.js.

[:philipp]

Comment 5

•

7 years ago

is this the linux/osx variant of the signature in bug 1345652 which is crashing with a fairly high volume on windows?

Bulk Bug Changes for mreavy's org

Comment 6

•

7 years ago

Mass change P1->P2 to align with new Mozilla triage process

Priority: P1 → P2

Alteisen

Comment 7

•

7 years ago

We suspect that we are affected by this issue. Our reproducer uses WebRTC sessions with multiple attendees (>5). When Firefox joins we sporadically see that it fails to start one (or more) of the incoming video streams. We did not find any solution to repair this (e.g. sending keyframes). When we end the PeerConnection we run into the exact same FatalMessage. To the best of our knowledge this is very timing dependent, i.e., looks like a race. We could reproduce with Firefox, Firefox Beta, Firefox Nightly. On Nightly we gave a debug build a go and ended up in an earlier assertion within WebRTC: ``` Process 78902 stopped * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) frame #0: 0x0000000100d92049 libmozglue.dylib`mozalloc_abort(msg=<unavailable>) at mozalloc_abort.cpp:33 [opt] 30 #ifdef MOZ_WIDGET_ANDROID 31 abortThroughJava(msg); 32 #endif -> 33 MOZ_CRASH(); 34 } 35 36 #ifdef MOZ_WIDGET_ANDROID (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0) * frame #0: 0x0000000100d92049 libmozglue.dylib`mozalloc_abort(msg=<unavailable>) at mozalloc_abort.cpp:33 [opt] frame #1: 0x0000000100d920b0 libmozglue.dylib`::abort() at mozalloc_abort.cpp:80 [opt] frame #2: 0x00000001089a7045 XUL`rtc::FatalMessage::~FatalMessage(this=<unavailable>) at checks.cc:109 [opt] frame #3: 0x00000001089a4509 XUL`rtc::FatalMessage::~FatalMessage(this=<unavailable>) at checks.cc:102 [opt] frame #4: 0x0000000108b0972c XUL`webrtc::internal::Call::CreateVideoReceiveStream(this=<unavailable>, configuration=<unavailable>) at call.cc:644 [opt] frame #5: 0x00000001090a4fe5 XUL`mozilla::WebrtcVideoConduit::CreateRecvStream(this=0x000000011f0bf800) at VideoConduit.cpp:500 [opt] frame #6: 0x00000001090a6e32 XUL`mozilla::WebrtcVideoConduit::SetRemoteSSRC(this=0x000000011f0bf800, ssrc=<unavailable>) at VideoConduit.cpp:882 [opt] frame #7: 0x00000001090ad7c7 XUL`mozilla::media::LambdaRunnable<mozilla::WebrtcVideoConduit::ReceivedRTPPacket(void const*, int, unsigned int)::$_3>::Run() [inlined] mozilla::WebrtcVideoConduit::ReceivedRTPPacket(void const*, int, unsigned int)::$_3::operator()() at VideoConduit.cpp:2100 [opt] frame #8: 0x00000001090ad794 XUL`mozilla::media::LambdaRunnable<mozilla::WebrtcVideoConduit::ReceivedRTPPacket(void const*, int, unsigned int)::$_3>::Run(this=0x000000011f95ad00) at MediaUtils.h:199 [opt] frame #9: 0x0000000104d33a62 XUL`nsThread::ProcessNextEvent(this=0x0000000101009690, aMayWait=<unavailable>, aResult=0x00007fff5f1d3177) at nsThread.cpp:1039 [opt] frame #10: 0x0000000104d31819 XUL`NS_ProcessPendingEvents(aThread=0x0000000101009690, aTimeout=10) at nsThreadUtils.cpp:463 [opt] frame #11: 0x000000010774ebc1 XUL`nsBaseAppShell::NativeEventCallback(this=0x00000001049ca470) at nsBaseAppShell.cpp:99 [opt] frame #12: 0x00000001077b4af7 XUL`nsAppShell::ProcessGeckoEvents(aInfo=0x00000001049ca470) at nsAppShell.mm:436 [opt] frame #13: 0x00007fffc4427321 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17 frame #14: 0x00007fffc440821d CoreFoundation`__CFRunLoopDoSources0 + 557 frame #15: 0x00007fffc4407716 CoreFoundation`__CFRunLoopRun + 934 frame #16: 0x00007fffc4407114 CoreFoundation`CFRunLoopRunSpecific + 420 frame #17: 0x00007fffc3967ebc HIToolbox`RunCurrentEventLoopInMode + 240 frame #18: 0x00007fffc3967cf1 HIToolbox`ReceiveNextEventCommon + 432 frame #19: 0x00007fffc3967b26 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 71 frame #20: 0x00007fffc1f00a54 AppKit`_DPSNextEvent + 1120 frame #21: 0x00007fffc267c7ee AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 2796 frame #22: 0x00000001077b3d46 XUL`::-[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:](self=0x00000001049caa10, _cmd=<unavailable>, mask=18446744073709551615, expiration=4001-01-01 00:00:00 UTC, mode="kCFRunLoopDefaultMode", flag=YES) at nsAppShell.mm:158 [opt] frame #23: 0x00007fffc1ef53db AppKit`-[NSApplication run] + 926 frame #24: 0x00000001077b529d XUL`nsAppShell::Run(this=0x00000001049ca470) at nsAppShell.mm:715 [opt] frame #25: 0x0000000108fdfdb3 XUL`XRE_RunAppShell() at nsEmbedFunctions.cpp:880 [opt] frame #26: 0x00000001052adc67 XUL`mozilla::ipc::MessagePumpForChildProcess::Run(this=0x000000010107db50, aDelegate=0x00007fff5f1d4ab8) at MessagePump.cpp:269 [opt] frame #27: 0x0000000105262e35 XUL`MessageLoop::Run() [inlined] MessageLoop::RunHandler(this=<unavailable>) at message_loop.cc:319 [opt] frame #28: 0x0000000105262e30 XUL`MessageLoop::Run(this=<unavailable>) at message_loop.cc:299 [opt] frame #29: 0x0000000108fdfa15 XUL`XRE_InitChildProcess(aArgc=16, aArgv=0x00007fff5f1d4d38, aChildData=<unavailable>) at nsEmbedFunctions.cpp:705 [opt] frame #30: 0x0000000100a2aee9 plugin-container`main [inlined] content_process_main(bootstrap=0x00000001010160d0, argc=<unavailable>, argv=0x00007fff5f1d4d38) at plugin-container.cpp:63 [opt] frame #31: 0x0000000100a2aebd plugin-container`main(argc=<unavailable>, argv=0x00007fff5f1d4d38) at MozillaRuntimeMain.cpp:25 [opt] frame #32: 0x0000000100a2ae84 plugin-container`start + 52 ``` It seems as if the wrong thread queries into WebRTC. If anyone has any update we would greatly appreciate it, thx!

Dan Minor [:dminor]

Assignee

Updated

•

7 years ago

Assignee: nobody → dminor

Dan Minor [:dminor]

Assignee

Comment 8

•

7 years ago

This crash is caused by assertions in the webrtc.org code that prevent a ssrc from being reused. The reproduction example does not specify ssrcs which means we are generating them. I think the problem occurs here [1]. We're reusing the track from one msection for another msection without changing the generated ssrc. I think in this case we should either be creating a new track or at least generating new ssrcs, but I'm not that familiar with the signaling side of things. :drno, does this make sense to you? Thanks! [1] http://searchfox.org/mozilla-central/rev/a4702203522745baff21e519035b6c946b7d710d/media/webrtc/signaling/src/jsep/JsepSessionImpl.cpp#1055

Status: NEW → ASSIGNED

Flags: needinfo?(drno)

Randell Jesup [:jesup] (needinfo me)

Comment 9

•

7 years ago

SSRCs are per-RTP-session, which is an m-line in "normal" RTP. In bundle, the m-lines can share an RTP session and thus can collide -- but SSRC collisions are something that always can happen. If we are moving a track to a different M-line (really, are we doing that??) regenerating the SSRC is probably a smart thing in any case.

Nils Ohlmeier [:drno]

Comment 10

•

7 years ago

@Alteisen: because of this frame #7: 0x00000001090ad7c7 XUL`mozilla::media::LambdaRunnable<mozilla::WebrtcVideoConduit::ReceivedRTPPacket(void const*, int, unsigned int)::$_3>::Run() [inlined] mozilla::WebrtcVideoConduit::ReceivedRTPPacket(void const*, int, unsigned int)::$_3::operator()() at VideoConduit.cpp:2100 [opt] in your stack trace your bug is probably a different problem, then the original bug report here. That frame indicates your problem hits when receiving media. Where the attached test cases here involve signaling only if I'm not mistaken.

Flags: needinfo?(drno)

Nils Ohlmeier [:drno]

Comment 11

•

7 years ago

@bwc: to me this look like the far side marked the first m-section as inactive in the renegoatiation. But we keep our local track around. Then when we get to find a new local track for the new second m-section we find the local track which used to be attached to the first m-section. And then webrtc.org "complains" with it's assertion that we can't setup another listener for that SSRC. Now I'm not 100% sure how we should fix this best. I currently see these options: - stop the receiver when signaling realizes an m-section has been set to inactive. Then when we re-assign the local track and start a receiver webrtc.org should no longer complain about double SSRCs. - try to detect in the function Dan pointed at in comment #8 that we are about to move a local track from one m-section to another and only then: - either stop the old receiver and then start a new one - or simply skip starting of a new receiver and keep the existing receiver running as is What do you think is the best option here?

Flags: needinfo?(docfaraday)

Byron Campen [:bwc]

Comment 12

•

7 years ago

(In reply to Nils Ohlmeier [:drno] from comment #11) > @bwc: to me this look like the far side marked the first m-section as > inactive in the renegoatiation. But we keep our local track around. Then > when we get to find a new local track for the new second m-section we find > the local track which used to be attached to the first m-section. And then > webrtc.org "complains" with it's assertion that we can't setup another > listener for that SSRC. > > Now I'm not 100% sure how we should fix this best. I currently see these > options: > - stop the receiver when signaling realizes an m-section has been set to > inactive. Then when we re-assign the local track and start a receiver > webrtc.org should no longer complain about double SSRCs. > - try to detect in the function Dan pointed at in comment #8 that we are > about to move a local track from one m-section to another and only then: > - either stop the old receiver and then start a new one > - or simply skip starting of a new receiver and keep the existing receiver > running as is > > What do you think is the best option here? I have not checked your analysis, but are you talking about marking inactive, or rejecting? If this is a scenario caused by rejection of an m-section, the transceivers work will not try to re-use the local track (or its SSRC), so a fix is already in the pipeline. If we're talking about just marking a=inactive, I have test-cases already in the transceivers work that do this, so again I think we'll be ok.

Flags: needinfo?(docfaraday)

Dan Minor [:dminor]

Assignee

Comment 13

•

7 years ago

It is the marking a=inactive case. It sounds like this is covered by transceivers, I'm going to stop looking at this until after that lands. I'll retest at that point. Thanks!

Depends on: 1290948

Alteisen

Comment 14

•

7 years ago

Thanks for the updates on this! Is the "the marking a=inactive case" already on master? If so I could give it a go. @drno The stack trace that I provided comes "before" the fatal message that is pinpointed by this bug (Given a debug build of Firefox). My assumption was that it shows an earlier symptom of the same defect. Possibly its another defect. I will retest once a solution was accepted. Lets cross the fingers that its just one defect :)

Nils Ohlmeier [:drno]

Updated

•

7 years ago

Comment 15

•

7 years ago

As the FatalMessage abort was a catch all, but this report is specifically about the assertion in DestroyVideoSendStream I added the more precise crash signatures. Note: the crash stats have also change in September to no longer track the generic FatalMessage.

Nils Ohlmeier [:drno]

Updated

•

7 years ago

Byron Campen [:bwc]

Comment 16

•

7 years ago

Try to reproduce this with bug 1290948.

Flags: needinfo?(docfaraday)

Dan Minor [:dminor]

Assignee

Comment 17

•

7 years ago

This is working for me with the changes in Bug 1290948.

Dan Minor [:dminor]

Assignee

Updated

•

7 years ago

Status: ASSIGNED → RESOLVED

Closed: 7 years ago

Flags: needinfo?(docfaraday)

Resolution: --- → WORKSFORME

repro1.js 7 years ago mroberts 11.81 KB, text/javascript		Details
repro2.js 7 years ago mroberts 2.15 KB, text/javascript		Details
repro3.js 7 years ago mroberts 1.89 KB, text/javascript		Details