WebRTC crash: send_stream_impl is null

RESOLVED WORKSFORME

Status

()

P2
normal
Rank:
15
RESOLVED WORKSFORME
a year ago
10 months ago

People

(Reporter: mroberts, Assigned: dminor)

Tracking

({crash, crashreportid, csectype-nullptr})

55 Branch
crash, crashreportid, csectype-nullptr
Points:
---

Firefox Tracking Flags

(firefox-esr52 ?, firefox55 wontfix, firefox56 affected, firefox57 affected)

Details

(crash signature)

Attachments

(3 attachments)

11.81 KB, text/javascript
Details
2.15 KB, text/javascript
Details
1.89 KB, text/javascript
Details
(Reporter)

Description

a year ago
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/60.0.3112.113 Safari/537.36

Steps to reproduce:

Using Twilio's Video SDK and SFU, it's quite easy to crash a tab through normal usage in Firefox 55. For example, see here: https://crash-stats.mozilla.com/report/index/c498f3cb-494a-49f2-9602-c2ec10170831

It seems that an `RTC_CHECK` for ensuring that `send_stream_impl` is not `null` fails in `DestroyVideoSendStream`, although I don't know enough about what this means...




Actual results:

I hope to followup with a minimal repro example, but for now, all I can say is that using Twilio's QuickStart application (https://github.com/twilio/video-quickstart-js) with a Group Room (SFU) will cause this issue. Let two Participants join a Room in two Firefox instances, and the crash will occur, typically on disconnect.
Crash Signature: [@ mozalloc_abort | abort | rtc::FatalMessage::~FatalMessage ]
Component: Untriaged → WebRTC
Keywords: crash, crashreportid
Product: Firefox → Core
All crashes in the last month are linux, mac or android - not windows.  IN fact, almost all are android.  Clearly a timing hole tearing down PeerConnectionMedia.

All crashes are nullptrs.  

Note: be careful; some of these crashes (especially 53 and earlier) are other bugs which hit the generic fatal assert in webrtc code; we should have it be marked as a "skip over" when generating signatures for crash-stats.  I only saw the ~PeerConnectioNMedia stacks in 55+, but I didn't do searches to confirm that (use "proto signature" to check for "does Not contain" ~PeerConnectionMedia, or something).
Rank: 15
status-firefox55: --- → wontfix
status-firefox56: --- → affected
status-firefox57: --- → affected
status-firefox-esr52: --- → ?
Keywords: csectype-nullptr
Priority: -- → P1
(Reporter)

Comment 2

a year ago
Created attachment 8906788 [details]
repro1.js

Here is a reproduction example. It may not be minimal (there are 5 rounds of negotiation here); however, the call to RTCPeerConnection#close at the end of the test definitely causes a crash.
(Reporter)

Comment 3

a year ago
Created attachment 8906820 [details]
repro2.js

I have attached a minimal reproduction example (2 rounds of negotiation) demonstrating changes to the BUNDLE that appear to cause the crash. Notice that, in the second round of negotiation, two new m= sections are added (video and audio with MIDs "video0" and "audio0", respectively), and the previous m= section with MID "sdparta_1" is rejected.
(Reporter)

Comment 4

a year ago
Created attachment 8906822 [details]
repro3.js

repro2.js included audio m= sections. These are not necessary to demonstrate the bug. I've removed them in repro3.js.

Comment 5

a year ago
is this the linux/osx variant of the signature in bug 1345652 which is crashing with a fairly high volume on windows?
Mass change P1->P2 to align with new Mozilla triage process
Priority: P1 → P2

Comment 7

a year ago
We suspect that we are affected by this issue. Our reproducer uses WebRTC sessions with multiple attendees (>5). When Firefox joins we sporadically see that it fails to start one (or more) of the incoming video streams. We did not find any solution to repair this (e.g. sending keyframes). When we end the PeerConnection we run into the exact same FatalMessage. To the best of our knowledge this is very timing dependent, i.e., looks like a race.

We could reproduce with Firefox, Firefox Beta, Firefox Nightly.

On Nightly we gave a debug build a go and ended up in an earlier assertion within WebRTC:
```
Process 78902 stopped
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
    frame #0: 0x0000000100d92049 libmozglue.dylib`mozalloc_abort(msg=<unavailable>) at mozalloc_abort.cpp:33 [opt]
   30  	#ifdef MOZ_WIDGET_ANDROID
   31  	    abortThroughJava(msg);
   32  	#endif
-> 33  	    MOZ_CRASH();
   34  	}
   35  	
   36  	#ifdef MOZ_WIDGET_ANDROID
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=1, address=0x0)
  * frame #0: 0x0000000100d92049 libmozglue.dylib`mozalloc_abort(msg=<unavailable>) at mozalloc_abort.cpp:33 [opt]
    frame #1: 0x0000000100d920b0 libmozglue.dylib`::abort() at mozalloc_abort.cpp:80 [opt]
    frame #2: 0x00000001089a7045 XUL`rtc::FatalMessage::~FatalMessage(this=<unavailable>) at checks.cc:109 [opt]
    frame #3: 0x00000001089a4509 XUL`rtc::FatalMessage::~FatalMessage(this=<unavailable>) at checks.cc:102 [opt]
    frame #4: 0x0000000108b0972c XUL`webrtc::internal::Call::CreateVideoReceiveStream(this=<unavailable>, configuration=<unavailable>) at call.cc:644 [opt]
    frame #5: 0x00000001090a4fe5 XUL`mozilla::WebrtcVideoConduit::CreateRecvStream(this=0x000000011f0bf800) at VideoConduit.cpp:500 [opt]
    frame #6: 0x00000001090a6e32 XUL`mozilla::WebrtcVideoConduit::SetRemoteSSRC(this=0x000000011f0bf800, ssrc=<unavailable>) at VideoConduit.cpp:882 [opt]
    frame #7: 0x00000001090ad7c7 XUL`mozilla::media::LambdaRunnable<mozilla::WebrtcVideoConduit::ReceivedRTPPacket(void const*, int, unsigned int)::$_3>::Run() [inlined] mozilla::WebrtcVideoConduit::ReceivedRTPPacket(void const*, int, unsigned int)::$_3::operator()() at VideoConduit.cpp:2100 [opt]
    frame #8: 0x00000001090ad794 XUL`mozilla::media::LambdaRunnable<mozilla::WebrtcVideoConduit::ReceivedRTPPacket(void const*, int, unsigned int)::$_3>::Run(this=0x000000011f95ad00) at MediaUtils.h:199 [opt]
    frame #9: 0x0000000104d33a62 XUL`nsThread::ProcessNextEvent(this=0x0000000101009690, aMayWait=<unavailable>, aResult=0x00007fff5f1d3177) at nsThread.cpp:1039 [opt]
    frame #10: 0x0000000104d31819 XUL`NS_ProcessPendingEvents(aThread=0x0000000101009690, aTimeout=10) at nsThreadUtils.cpp:463 [opt]
    frame #11: 0x000000010774ebc1 XUL`nsBaseAppShell::NativeEventCallback(this=0x00000001049ca470) at nsBaseAppShell.cpp:99 [opt]
    frame #12: 0x00000001077b4af7 XUL`nsAppShell::ProcessGeckoEvents(aInfo=0x00000001049ca470) at nsAppShell.mm:436 [opt]
    frame #13: 0x00007fffc4427321 CoreFoundation`__CFRUNLOOP_IS_CALLING_OUT_TO_A_SOURCE0_PERFORM_FUNCTION__ + 17
    frame #14: 0x00007fffc440821d CoreFoundation`__CFRunLoopDoSources0 + 557
    frame #15: 0x00007fffc4407716 CoreFoundation`__CFRunLoopRun + 934
    frame #16: 0x00007fffc4407114 CoreFoundation`CFRunLoopRunSpecific + 420
    frame #17: 0x00007fffc3967ebc HIToolbox`RunCurrentEventLoopInMode + 240
    frame #18: 0x00007fffc3967cf1 HIToolbox`ReceiveNextEventCommon + 432
    frame #19: 0x00007fffc3967b26 HIToolbox`_BlockUntilNextEventMatchingListInModeWithFilter + 71
    frame #20: 0x00007fffc1f00a54 AppKit`_DPSNextEvent + 1120
    frame #21: 0x00007fffc267c7ee AppKit`-[NSApplication(NSEvent) _nextEventMatchingEventMask:untilDate:inMode:dequeue:] + 2796
    frame #22: 0x00000001077b3d46 XUL`::-[GeckoNSApplication nextEventMatchingMask:untilDate:inMode:dequeue:](self=0x00000001049caa10, _cmd=<unavailable>, mask=18446744073709551615, expiration=4001-01-01 00:00:00 UTC, mode="kCFRunLoopDefaultMode", flag=YES) at nsAppShell.mm:158 [opt]
    frame #23: 0x00007fffc1ef53db AppKit`-[NSApplication run] + 926
    frame #24: 0x00000001077b529d XUL`nsAppShell::Run(this=0x00000001049ca470) at nsAppShell.mm:715 [opt]
    frame #25: 0x0000000108fdfdb3 XUL`XRE_RunAppShell() at nsEmbedFunctions.cpp:880 [opt]
    frame #26: 0x00000001052adc67 XUL`mozilla::ipc::MessagePumpForChildProcess::Run(this=0x000000010107db50, aDelegate=0x00007fff5f1d4ab8) at MessagePump.cpp:269 [opt]
    frame #27: 0x0000000105262e35 XUL`MessageLoop::Run() [inlined] MessageLoop::RunHandler(this=<unavailable>) at message_loop.cc:319 [opt]
    frame #28: 0x0000000105262e30 XUL`MessageLoop::Run(this=<unavailable>) at message_loop.cc:299 [opt]
    frame #29: 0x0000000108fdfa15 XUL`XRE_InitChildProcess(aArgc=16, aArgv=0x00007fff5f1d4d38, aChildData=<unavailable>) at nsEmbedFunctions.cpp:705 [opt]
    frame #30: 0x0000000100a2aee9 plugin-container`main [inlined] content_process_main(bootstrap=0x00000001010160d0, argc=<unavailable>, argv=0x00007fff5f1d4d38) at plugin-container.cpp:63 [opt]
    frame #31: 0x0000000100a2aebd plugin-container`main(argc=<unavailable>, argv=0x00007fff5f1d4d38) at MozillaRuntimeMain.cpp:25 [opt]
    frame #32: 0x0000000100a2ae84 plugin-container`start + 52
```
It seems as if the wrong thread queries into WebRTC.

If anyone has any update we would greatly appreciate it, thx!
(Assignee)

Updated

a year ago
Assignee: nobody → dminor
(Assignee)

Comment 8

a year ago
This crash is caused by assertions in the webrtc.org code that prevent a ssrc from being reused. The reproduction example does not specify ssrcs which means we are generating them. I think the problem occurs here [1]. We're reusing the track from one msection for another msection without changing the generated ssrc. I think in this case we should either be creating a new track or at least generating new ssrcs, but I'm not that familiar with the signaling side of things. :drno, does this make sense to you? Thanks!

[1] http://searchfox.org/mozilla-central/rev/a4702203522745baff21e519035b6c946b7d710d/media/webrtc/signaling/src/jsep/JsepSessionImpl.cpp#1055
Status: NEW → ASSIGNED
Flags: needinfo?(drno)
SSRCs are per-RTP-session, which is an m-line in "normal" RTP.  In bundle, the m-lines can share an RTP session and thus can collide -- but SSRC collisions are something that always can happen.  If we are moving a track to a different M-line (really, are we doing that??) regenerating the SSRC is probably a smart thing in any case.
@Alteisen: because of this 
frame #7: 0x00000001090ad7c7 XUL`mozilla::media::LambdaRunnable<mozilla::WebrtcVideoConduit::ReceivedRTPPacket(void const*, int, unsigned int)::$_3>::Run() [inlined] mozilla::WebrtcVideoConduit::ReceivedRTPPacket(void const*, int, unsigned int)::$_3::operator()() at VideoConduit.cpp:2100 [opt]
in your stack trace your bug is probably a different problem, then the original bug report here. That frame indicates your problem hits when receiving media. Where the attached test cases here involve signaling only if I'm not mistaken.
Flags: needinfo?(drno)
@bwc: to me this look like the far side marked the first m-section as inactive in the renegoatiation. But we keep our local track around. Then when we get to find a new local track for the new second m-section we find the local track which used to be attached to the first m-section. And then webrtc.org "complains" with it's assertion that we can't setup another listener for that SSRC.

Now I'm not 100% sure how we should fix this best. I currently see these options:
- stop the receiver when signaling realizes an m-section has been set to inactive. Then when we re-assign the local track and start a receiver webrtc.org should no longer complain about double SSRCs.
- try to detect in the function Dan pointed at in comment #8 that we are about to move a local track from one m-section to another and only then:
  - either stop the old receiver and then start a new one
  - or simply skip starting of a new receiver and keep the existing receiver running as is

What do you think is the best option here?
Flags: needinfo?(docfaraday)
(In reply to Nils Ohlmeier [:drno] from comment #11)
> @bwc: to me this look like the far side marked the first m-section as
> inactive in the renegoatiation. But we keep our local track around. Then
> when we get to find a new local track for the new second m-section we find
> the local track which used to be attached to the first m-section. And then
> webrtc.org "complains" with it's assertion that we can't setup another
> listener for that SSRC.
> 
> Now I'm not 100% sure how we should fix this best. I currently see these
> options:
> - stop the receiver when signaling realizes an m-section has been set to
> inactive. Then when we re-assign the local track and start a receiver
> webrtc.org should no longer complain about double SSRCs.
> - try to detect in the function Dan pointed at in comment #8 that we are
> about to move a local track from one m-section to another and only then:
>   - either stop the old receiver and then start a new one
>   - or simply skip starting of a new receiver and keep the existing receiver
> running as is
> 
> What do you think is the best option here?

I have not checked your analysis, but are you talking about marking inactive, or rejecting? If this is a scenario caused by rejection of an m-section, the transceivers work will not try to re-use the local track (or its SSRC), so a fix is already in the pipeline. If we're talking about just marking a=inactive, I have test-cases already in the transceivers work that do this, so again I think we'll be ok.
Flags: needinfo?(docfaraday)
(Assignee)

Comment 13

a year ago
It is the marking a=inactive case. It sounds like this is covered by transceivers, I'm going to stop looking at this until after that lands. I'll retest at that point. Thanks!
Depends on: 1290948

Comment 14

a year ago
Thanks for the updates on this! Is the "the marking a=inactive case" already on master? If so I could give it a go.

@drno The stack trace that I provided comes "before" the fatal message that is pinpointed by this bug (Given a debug build of Firefox). My assumption was that it shows an earlier symptom of the same defect. Possibly its another defect. I will retest once a solution was accepted. Lets cross the fingers that its just one defect :)

Updated

11 months ago
See Also: → bug 1409167

Comment 15

11 months ago
As the FatalMessage abort was a catch all, but this report is specifically about the assertion in DestroyVideoSendStream I added the more precise crash signatures.

Note: the crash stats have also change in September to no longer track the generic FatalMessage.

Updated

11 months ago
Crash Signature: [@ mozalloc_abort | abort | rtc::FatalMessage::~FatalMessage ] → [@ mozalloc_abort | abort | rtc::FatalMessage::~FatalMessage ] [@ mozalloc_abort | webrtc::internal::Call::DestroyVideoSendStream ] [@ mozalloc_abort | abort | webrtc::internal::Call::DestroyVideoSendStream ]

Comment 16

10 months ago
Try to reproduce this with bug 1290948.
Flags: needinfo?(docfaraday)
(Assignee)

Comment 17

10 months ago
This is working for me with the changes in Bug 1290948.
(Assignee)

Updated

10 months ago
Status: ASSIGNED → RESOLVED
Last Resolved: 10 months ago
Flags: needinfo?(docfaraday)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.