Closed Bug 1409167 Opened 3 years ago Closed 2 years ago

Fatal full-browser crash playing with two-way WebRTC demo in two tabs, or simply ending a call

Categories

(Core :: WebRTC, defect, P2)

Unspecified
macOS
defect

Tracking

()

RESOLVED DUPLICATE of bug 1479853
Tracking Status
firefox-esr52 --- unaffected
firefox-esr60 --- fixed
firefox56 --- wontfix
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- wontfix
firefox60 --- wontfix
firefox61 --- wontfix
firefox62 --- wontfix

People

(Reporter: jib, Unassigned)

References

Details

(Keywords: regression)

Crash Data

I'm reporting this while it's still fresh in my mind, even though I haven't been able to repro yet.

I had lots of tabs open in Nightly, but chiefly among them were two tabs with https://jsfiddle.net/jib1/5Lwjazuf/ running an active WebRTC connection between them.

To start it, I'd opened the fiddle in two tabs first, clicked the [Call] button in one, then visited both tabs to allow cam+mic sharing in both (it's a two-way call demo).

I was then working in a third unrelated jsfiddle, and I believe I was moving between the different tabs, and I don't recall exactly, but I MAY have tried to hit refresh or the "Run" button in one of the fiddles, or something similar, to end one side of the call, which I THINK is when the Firefox browser suddenly disappeared on me, or it may have been unprovoked, unsure.

I got a crash reporter where I typed "jib fiddle" into the comments field and submitted, and then Firefox was gone and was restarting.

After restarting Firefox, I went to about:crashes, which showed TWO crashes at the exact time:

  bp-145020e7-cd1b-4c01-af34-a6cd00171016    10/16/17	3:59 PM
  bp-8fa20bd0-761e-4b5a-af38-0f9a90171016    10/16/17	3:59 PM

Top is content process: [@ mozalloc_abort | abort | webrtc::internal::Call::DestroyVideoReceiveStream ]
2nd is master process:  [@ ReceivePort::WaitForMessage | mozilla::ipc::SharedMemoryBasic::ShareToProcess ]

The 2nd has the comment "jib fiddle".
Actually, as I retrace what I was doing, I may have been just trying to cut'n'paste from from one fiddle into another or from a github issue page to a fiddle, when it crashed.
This happened while destructing the VideoConduit so reloading or re-running a fiddle sounds more plausible (and the parent process seems to have crashed because the IPC port went down?).

I can't see any obvious holes in VideoConduit. We only delete a receive stream after having created one (which sets `mRecvStream`) with the same `webrtc::Call` instance, and we don't double delete (since we reset `mRecvStream` on deleting). I also don't see anything emptying out `webrtc::Call::receive_stream_ssrcs_` under our feet.

I'll note though, there are similar crashes for DestroyVideoSendStream too, e.g., https://crash-stats.mozilla.com/report/index/ace3b8c6-f8e9-4a67-9e9b-50fc70171004
Looks like for the parent process crash we already have a long standing bug https://bugzilla.mozilla.org/show_bug.cgi?id=1264209
See Also: → 1264209
adding some more signatures... including the equivalent Send side crash
Crash Signature: [@ mozalloc_abort | abort | webrtc::internal::Call::DestroyVideoReceiveStream ] → [@ mozalloc_abort | abort | webrtc::internal::Call::DestroyVideoReceiveStream ] [@ abort | webrtc::internal::Call::DestroyVideoReceiveStream ] [@ mozalloc_abort | abort | webrtc::internal::Call::DestroyVideoSendStream ]
Note that until recently, these would all get lumped under RTCFatalMessage
As P1's need owners can I assign this to you for now Andreas?
Feel free to re-assign to someone more suitable or more capacity.
Assignee: nobody → apehrson
Based on the number from crash stats this goes back to 55.
Note: all of these are assertions in the webrtc.org code, and there aren't many of those.  So it's safe sec-wise, and in the field, even with all the signatures, it's low rate - though note that before Sept, you probably need to search for RTCFatalMessage with a proto signature matching this, so the graph above doesn't show that.

Probably doesn't *need* to be fixed in 57 unless it's blocking a usecase that we care about, of if my assertion about frequency is wrong.
I'm starting to wonder if this is caused by the SSRC switching code added in bug 1337777. Although that code landed in 54. But it also looks like the frequency picked up in 56, which might got worse from timing changes by the webrtc.org 57 merge in Fx 56.
I thin jib only wanted to ensure that this is not a recent regression. Since it clearly is an older problem and it's frequency is not very high I think this is not a P1.

This might also be related to bug 1397881, which hits the same impl = null check on the sender side.
Priority: P1 → P2
See Also: → 1397881
Assignee: apehrson → nobody
Duplicate of this bug: 1431549
See Also: → 1394602
See Also: → 1431604
Rank 9 is for P1s. I'll move this P2 down to 13 for now. Feel free to adjust.
Rank: 9 → 13
Duplicate of this bug: 1473983
A coworker just saw the DestroyVideoReceiveStream signature when ending a call on 60 ESR, and did not have two WebRTC demo tabs open.

That might broaden the impact of this bug, and perhaps warrants an ESR60 fix.
Summary: Fatal full-browser crash playing with two-way WebRTC demo in two tabs → Fatal full-browser crash playing with two-way WebRTC demo in two tabs, or simply ending a call
To clarify, in case it helps: the call was torn down by the remote end.
See Also: → 1490462
This crash happens frequently with me and my team. Link are three bug reports generated from a single crash.
https://crash-stats.mozilla.com/report/index/ff29aed5-6d77-4ffa-96db-5b6fc0180925
https://crash-stats.mozilla.com/report/index/6e23cd71-15a6-49e0-8e58-9948a0180925
https://crash-stats.mozilla.com/report/index/1cbdd8a5-5d99-4cd6-a905-db6a30180925

Steps to reproduce
Join a meet room on firefox https://meet.jit.si/atheer and open 'New Incognito Window' on firefox with the same room name.
Hang up and join multiple times. At some point firefox crashes.

Environment details
Firefox 62.0.2 running on MacOS High Sierra (10.13.2)
Just in case my links expire here is the crash log of https://crash-stats.mozilla.com/report/index/6e23cd71-15a6-49e0-8e58-9948a0180925

0	libmozglue.dylib	mozalloc_abort	memory/mozalloc/mozalloc_abort.cpp:34
1	XUL	rtc::FatalMessage::~FatalMessage()	media/webrtc/trunk/webrtc/base/checks.cc:109
2	libmozglue.dylib	BaseAllocator::free(void*)	memory/build/mozjemalloc.cpp:3525
3	libmozglue.dylib	mozilla::detail::MutexImpl::unlock()	mozglue/misc/Mutex_posix.cpp:181
(In reply to Karim Fikani from comment #16)
> This crash happens frequently with me and my team. Link are three bug
> reports generated from a single crash.
> https://crash-stats.mozilla.com/report/index/ff29aed5-6d77-4ffa-96db-
> 5b6fc0180925
> https://crash-stats.mozilla.com/report/index/6e23cd71-15a6-49e0-8e58-
> 9948a0180925
> https://crash-stats.mozilla.com/report/index/1cbdd8a5-5d99-4cd6-a905-
> db6a30180925
> 
> Steps to reproduce
> Join a meet room on firefox https://meet.jit.si/atheer and open 'New
> Incognito Window' on firefox with the same room name.
> Hang up and join multiple times. At some point firefox crashes.
> 
> Environment details
> Firefox 62.0.2 running on MacOS High Sierra (10.13.2)

I managed to repro this with the same setup on the first attempt.

It seems to crash shortly after frames are flowing over the network.

I'm not sure this is the same as the original issue reported however. I'll dig a bit more and perhaps file a new bug if it indeed seems different.
It's the same issue indeed. I'm getting the same symptom as bug 1431549 when reproducing.

Having somewhat reliable steps to reproduce is definitely a step forward. I'll see if I can get this in a debugger, and hopefully figure out where we are racing.
What made this harder to debug is that the check for overwriting the receive stream for an existing ssrc is debug-only [1], whereas the one when you delete the first one (now overwritten) is for release as well.

Flipping [1] to release makes it fail first.


[1] https://searchfox.org/mozilla-central/rev/ffe6eaf2f032e58ec3b0650a87df2c62ae4ca441/media/webrtc/trunk/webrtc/call/call.cc#653-654
The only way I see this happening is if a peer connection has two transceivers (using the same webrtc::Call instance) [1], where both connected VideoConduits end up creating receive streams for the same ssrc [2].


[1] https://searchfox.org/mozilla-central/rev/ffe6eaf2f032e58ec3b0650a87df2c62ae4ca441/media/webrtc/signaling/src/peerconnection/PeerConnectionMedia.cpp#978,988
[2] https://searchfox.org/mozilla-central/rev/ffe6eaf2f032e58ec3b0650a87df2c62ae4ca441/media/webrtc/signaling/src/media-conduit/VideoConduit.cpp#1635
This is hard enough to repro, especially when logging. But I tried to figure out what is going on in terms of creating/destroying receive streams in VideoConduit with some tailored custom logs.

I've grouped the rows per call instance. The rows for each call are sorted chronologically.

> 0x1228d4000 (Call 0x1228d8000) Creating receive stream for SSRC 721991488 (0x2b08b740)
> — 0x1228d4000 (Call 0x1228d8000) Destroying receive stream
> 0x1228d4000 (Call 0x1228d8000) Creating receive stream for SSRC 1878989383 (0x6fff1a47)
> — 0x1228d4000 (Call 0x1228d8000) Destroying receive stream
> 0x1228d4000 (Call 0x1228d8000) Creating receive stream for SSRC 2396950355 (0x8ede8f53)
> 0x12b2e6000 (Call 0x1228d8000) Creating receive stream for SSRC 4286033713 (0xff77af31)
> — 0x12b2e6000 (Call 0x1228d8000) Destroying receive stream
> 0x12b2e6000 (Call 0x1228d8000) Creating receive stream for SSRC 1878989383 (0x6fff1a47)
> — 0x1228d4000 (Call 0x1228d8000) Destroying receive stream
> 0x1228d4000 (Call 0x1228d8000) Creating receive stream for SSRC 240469126 (0xe554486)
> — 0x1228d4000 (Call 0x1228d8000) Destroying receive stream
> — 0x12b2e6000 (Call 0x1228d8000) Destroying receive stream

> 0x152dc0800 (Call 0x152d67000) Creating receive stream for SSRC 2091158466 (0x7ca48bc2)
> — 0x152dc0800 (Call 0x152d67000) Destroying receive stream
> 0x152dc0800 (Call 0x152d67000) Creating receive stream for SSRC 390404851 (0x17451af3)
> — 0x152dc0800 (Call 0x152d67000) Destroying receive stream
> 0x152dc0800 (Call 0x152d67000) Creating receive stream for SSRC 2396950355 (0x8ede8f53)
> 0x15bd88000 (Call 0x152d67000) Creating receive stream for SSRC 3141479317 (0xbb3f2b95)
> — 0x15bd88000 (Call 0x152d67000) Destroying receive stream
> 0x15bd88000 (Call 0x152d67000) Creating receive stream for SSRC 390404851 (0x17451af3)
> — 0x152dc0800 (Call 0x152d67000) Destroying receive stream
> — 0x15bd88000 (Call 0x152d67000) Destroying receive stream

We can see that in both groups there are different VideoConduits on the same Call creating receive streams for the same SSRC. Example:
> 0x1228d4000 (Call 0x1228d8000) Creating receive stream for SSRC 1878989383 (0x6fff1a47)
> 0x12b2e6000 (Call 0x1228d8000) Creating receive stream for SSRC 1878989383 (0x6fff1a47)

It's all on the main thread so it smells to me like an ordering problem in TransceiverImpl or above.

Byron, you know the layers above VideoConduit better. Can you take a look?

A note is that the triggering call to create receive streams (and delete any previously created stream) is ConfigureRecvMediaCodecs, only called at [1].


[1] https://searchfox.org/mozilla-central/rev/ffe6eaf2f032e58ec3b0650a87df2c62ae4ca441/media/webrtc/signaling/src/peerconnection/TransceiverImpl.cpp#905
Flags: needinfo?(docfaraday)
This might be a dupe of bug 1479853. I don't see these crashes on 64 at all, and not on 63 since bug 1479853 landed.
Flags: needinfo?(docfaraday)
That looks very plausible, thanks! I'll close this and related bugs as dupes.
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 1479853
You need to log in before you can comment on or make changes to this bug.