Closed Bug 1273652 Opened 4 years ago Closed 4 years ago

No video rendering after renegotiating recvonly video (e.g. on meet.jit.si)

Categories

(Core :: WebRTC, defect, P1)

defect

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox49 --- fixed
firefox50 --- fixed

People

(Reporter: drno, Assigned: jesup)

References

Details

Attachments

(8 files)

Occasionally when being in >= 3 multi-party call on meet.jit.si one of the video streams gets rendered as just as black.

- this affects all version of Firefox (up to release)
- I think I have not seen this for a two party call
- it can affect the main video rendering as well as one of the preview renderer
- it can happen right when joining a running conference
- sometimes I think it happens also for video which has been rendered successfully

I'm attaching a NSPR & webrtc log from a time where I had left a 3 way call and rejoined the call and got black video for one of the two other participants.

Because of the SSRC re-writting done by the SFU I'm wondering if this might be related to Simulcast support?! Marking this P1 until we have an idea what is causing this.
I doubled checked already that the SDP we see in Firefox on about:webrtc for these calls is actually the SDP we got over the wire (oppose to Firefox re-creating a strange looking SDP from a confused internal state).

What I'm referring to is this:

...
m=video 1 RTP/SAVPF 100
a=ssrc:1019964314 cname:6f53568c-1802-4029-b63d-b71790779aec
a=ssrc:1083845631 cname:1c04ce18-1ef5-43d1-8a1d-b977949ed95c
a=ssrc:4016385049 cname:1c04ce18-1ef5-43d1-8a1d-b977949ed95c
…
m=video 1 RTP/SAVPF 100
a=ssrc:3232102231 cname:1c04ce18-1ef5-43d1-8a1d-b977949ed95c
…
m=video 1 RTP/SAVPF 100
a=ssrc:3969758265 cname:mixed
a=ssrc:3969758265 label:mixedlabelvideo0
a=ssrc:3969758265 mslabel:mixedmslabel
…
The patch from bug 1213773 might potentially fix this issue. Need to verify.
See Also: → 1213773
We investigated the situation and these additional SSRCs are to be expected. They are recvonly SSRCs from the remote endpoints that we decide to allocate to existing m-lines in order to keep the m-line number low.

On a side note, we’re able to reproduce the non-rendering issue even in a 1-1 call, albeit very rarely.

Hopefully bug 1213773 fixes this.
George - can you test if this is fixed in Today's (5/25) Nightly?  the patch first was in Nightly today.
Thanks!
(Note: today's nightly has a bug where the AEC is only on for the first use of getUserMedia, so use headsets.  Hopefully the fix for that will be in tomorrow's nightly)
Flags: needinfo?(gp)
It seems I'm no longer able to "reproduce" the issue. I put reproduce in quotes because we never a reliable way to reproduce it, it's random. I'll give it a few more shots to make sure I wasn't just being lucky.
I "reproduced" it just now with two Nightly's on Linux and Mac. So it is not fixed :-(

I took me a couple of attempts. Now I have 1:1 call, but that call has two m=audio and two m=video lines in the SDP. Maybe the black video gets triggered only if multiple m-lines are present? That would at least explains why it is harder to repro on 1:1, but easier with on a 3-way call.

Another observation from this 1:1 call:
in the "RTP Stats" section of about:webrtc I see:
 - outbound_rt[c]p_audio_0
 - outbound_rt[c]p_video_1
this makes sense as the SDP says sendrecv on one audio and one video m-line.
But then on the receiving side I see this:
 - inbound_rt[c]p_audio_0 <- which appears dead; SSRC 0; no packets
 - inbound_rt[c]p_video_1 <- which also appears to be dead; SSRC 0; no packets
 - inbound_rt[c]p_audio_3
 - inbound_rt[c]p_video_4
Why is number 2 missing from that list? Interestingly the amount of SSRC's in the remote SDP is 5 (which matches the 0-4 range).
Apparently even a fresh 1:1 call results in two audio plus two video m lines.

A new theory from my side: does the black video get triggered when you re-use a room (at least in the 1:1 scenario)?
I think I have not seen the black video when I used a room/URL for the first time.
Now I'm able to repro it too :(

> But then on the receiving side I see this:

The dead channels are expected to be dead. We pre-allocate 2 SSRCs for our server (1 for audio and 1 for video), but we typically don't send anything there. 

> Why is number 2 missing from that list?

I think number 2 is missing because it's for SCTP, not RTP.
> I think I have not seen the black video when I used a room/URL for the first time.

I'm able to repro it even when I use a room for the first time :(
^ comes from a 1-1 call between FF46<->current Nightly. Both participants joined just once.
Flags: needinfo?(gp)
A little more information about the captured about:webrtc page. It's coming from the endpoint that renders black video. The SSRC that fails to be rendered is 1520089304. You can check that FF is getting data by looking at inbound_rtp_video_4:

Decoder: Avg. bitrate: 1.77 Mbps (0.38 SD) Avg. framerate: 29.21 fps (1.15 SD)

Local: 13:04:32 GMT-0500 (CDT) inboundrtp SSRC: 1520089304 Received: 243081 packets (260614.48 Kb) Lost: 331 Jitter: 1.35

Remote: 13:04:33 GMT-0500 (CDT) outboundrtp SSRC: 1520089304 Sent: 243041 packets (255825.53 Kb)
Yeah the missing number 2 in the RTP stats happens when the m=application is the middle of the m-sections in the SDP, rather then the end.

I also verified via PCAP that packets are in fact still arrive with the expected SSRCs in case of the black rendering.
One thing that I notice between the first two calls in that attachment is that there's an extra a=ssrc line in a remote video m-section in the first one (groom2, which I assume failed?), but not the previous one (groom1):

groom2 (failed?)
a=ssrc:619954680 cname:mixed
a=ssrc:619954680 label:mixedlabelvideo0
a=ssrc:619954680 mslabel:mixedmslabel
a=ssrc:2033112203 cname:a0be499b-8416-44a4-b7d4-1144fbfaa4bd

groom1 (succeeded?)
a=ssrc:1513371698 cname:mixed
a=ssrc:1513371698 label:mixedlabelvideo0
a=ssrc:1513371698 mslabel:mixedmslabel

Also, in groom1 (the succeeded case?) not all of the m-sections are bundled, whereas in groom2 (the failed case?) they are. This smells like a bundle demux problem to me.
(In reply to George Politis [:gp] from comment #12)
> A little more information about the captured about:webrtc page. It's coming
> from the endpoint that renders black video. The SSRC that fails to be
> rendered is 1520089304. You can check that FF is getting data by looking at
> inbound_rtp_video_4:
> 
> Decoder: Avg. bitrate: 1.77 Mbps (0.38 SD) Avg. framerate: 29.21 fps (1.15
> SD)
> 
> Local: 13:04:32 GMT-0500 (CDT) inboundrtp SSRC: 1520089304 Received: 243081
> packets (260614.48 Kb) Lost: 331 Jitter: 1.35
> 
> Remote: 13:04:33 GMT-0500 (CDT) outboundrtp SSRC: 1520089304 Sent: 243041
> packets (255825.53 Kb)

   That's kinda weird. Because neither of the two ssrcs that are in the previous video m-section show up in the stats. What is that m-section for?
> That's kinda weird. Because neither of the two ssrcs that are in the previous video m-section show up in the stats. What is that m-section for?

Those are the pre-allocate video SSRCs for our server. We typically don't send anything there.
(In reply to Byron Campen [:bwc] (PTO May 26, PTO Jun 6-10) from comment #14)
> One thing that I notice between the first two calls in that attachment is
> that there's an extra a=ssrc line in a remote video m-section in the first
> one (groom2, which I assume failed?), but not the previous one (groom1):
> 
> groom2 (failed?)
> a=ssrc:619954680 cname:mixed
> a=ssrc:619954680 label:mixedlabelvideo0
> a=ssrc:619954680 mslabel:mixedmslabel
> a=ssrc:2033112203 cname:a0be499b-8416-44a4-b7d4-1144fbfaa4bd
> 
> groom1 (succeeded?)
> a=ssrc:1513371698 cname:mixed
> a=ssrc:1513371698 label:mixedlabelvideo0
> a=ssrc:1513371698 mslabel:mixedmslabel
> 
> Also, in groom1 (the succeeded case?) not all of the m-sections are bundled,
> whereas in groom2 (the failed case?) they are. This smells like a bundle
> demux problem to me.

This additional SSRC is to be expected. It's a recvonly SSRCs from the remote endpoint that we decide to allocate to existing m-lines in order to keep the m-line number low.

I don't remember what the scenario was for groom1 or if it succeeded, I should probably have taken a note about that and I should also have clarified that the failed captured session was groom2, the very top one.
(In reply to Byron Campen [:bwc] (PTO May 26, PTO Jun 6-10) from comment #14)
> Also, in groom1 (the succeeded case?) not all of the m-sections are bundled,
> whereas in groom2 (the failed case?) they are. This smells like a bundle
> demux problem to me.

Interesting observation. But in all the logs I captured everything appears to be bundled correctly.
Assignee: nobody → drno
Rank: 10
Summary: Black video rendering in multiparty calls on meet.jit.si → Black video rendering in calls on meet.jit.si
See Also: → 1277984
Sample NSPR log file from a call with black video rendering
Sample NSPR log file from a working call
So after the two log files in attachment 8761070 [details] and attachment 8761071 [details] my current educated guess is that the Jitsi bridge fails in certain call scenarios to send/forward RTCP between the sender and the receiver.

Note: all my attempts to debug this have shown that for unknown reasons this problem appears to be timing sensitive. E.g. with a non-optimized, debug build of Firefox I was not able to reproduce the problem today at all. But after compiling the exact same code base with optimization and no debug it became easy to reproduce. My guess is that debug vs non-debug somehow impacts the speed with which re-negotiations are being taken care of and that might change something on the bridge side.

If you execute the following command:
  grep "received RTCP" nspr.log.good-with-rtcp | cut -d '|' -f 2 | sort | uniq -c
you get something like this:
   3  Receive audio[467ea7df-437c-44f2-adf5-f856c7fd84bf] received RTCP packet.
   1  Receive audio[5c45c565-e7be-441e-842c-747337ee9788] received RTCP packet.
   5  Receive video[833496da-85f1-4a88-a1ba-d9bad271d27b] received RTCP packet.
  36  Receive video[f8939cf5-b735-4873-814b-643aef2e1ae9] received RTCP packet.
 142  Transmit audio[{77f7d662-db59-334e-9a9f-a2eb86fa327d}] received RTCP packet.
 142  Transmit video[{cd24f8a8-31d4-4f4a-a847-6ef57ebdc00e}] received RTCP packet.

But if you execute this:
  grep "received RTCP" nspr.log.bad-with-rtcp | cut -d '|' -f 2 | sort | uniq -c
you something like this:
    19  Receive audio[5c45c565-e7be-441e-842c-747337ee9788] received RTCP packet.
 559  Transmit audio[{80505a4a-d359-2341-945d-d270f3f708cf}] received RTCP packet.
 559  Transmit video[{911b4eb7-f24c-ab41-a7e0-b86b7c36384b}] received RTCP packet.
The important part here is that "Receive video[...]" is missing, which indicates that FF MediaPipelines never forwarded any RTCP on its video RTP receiver to the webrtc.org code.

Now this could also indicate a bug in FF MediaPipelines, especially the filter which lets RTP and RTCP pass for each pipeline based on the SSRC's from the SDP. But the log file clearly shows that the "Receive video[..]" pipeline does actually receive video RTP packets, which means the SSRC filter seems to be set properly for that pipeline (plus I manually tracked in the log file the SSRC's getting updated through renegotiation). As the RTCP part of the MediaPipeline shares the same filter as the RTP part I'm pretty sure the filter is setup properly for RTCP part.
And the fact that "Receive audio[...]" still shows up in the table above means that we are getting at least RTCP for the audio receiver.

Now this only indicates that FF is not receiving RTCP sender reports from the sender side, which if I understand it correctly is less important for working video, then RTCP receiver reports. The log file shows even in the bad call scenario FF sending out receiver reports regularly. But if this is really a problem in the bridge I assume it is very likely that it affects RTCP sender and receiver reports for a given video stream.

I guess the remaining step which we could do is check in a PCAP of a bad call for the RTCP traffic and the SSRC's on these reports.
Another try with NSPR logs + pcap show that FF actually sends sender reports on the recvonly video m-section... hmmm I guess I keep digging.
(In reply to Nils Ohlmeier [:drno] from comment #21)
> I guess the remaining step which we could do is check in a PCAP of a bad
> call for the RTCP traffic and the SSRC's on these reports.

Hi Nils, the bridge is sending sender reports for the SSRCs that it forwards, which include report blocks for the SSRCs that it receives. I run a quick test and the RTCP seemed fine while video was not rendering. If you want to run additional tests, there's a way to have the bridge dump the decrypted traffic.
Looking at some log files and pcap even more I found that in a 1:1 call the Firefox which renders black initially sends RTCP receiver and sender reports. But then it stops sending receiver reports and only send sender reports, for audio and video. My current guess is that one of the renegotiations triggers this behavior.
Further the audio side only sending sender reports probably does not affect the audio, because the RTCP feedback is not essential. But in case of video looking at the frequency of the sender reports it actually looks more like these are requests to submit a missing I-frame. I guess these I-Frame requests inside of sender reports get ignore on the sender side and therefore the sender never submits a new I-frame, and the receiver stays black because it can't rendering only with p-frames.

I'll continue to add more logging to figure out why the receiver switches to sending sender reports only on all channels.
I located the code which is probably causing the sending instead of just receiving behavior. But that code got added pretty recently in 49.
Additionally I see also more confusion in my logs regarding which webrtc.org engines are suppose to be used for video decoding.
I'm finally able to repro the problem locally with a mochitest. It appears we have a problem when reconfiguring a recvonly video codec (which gets triggered through renegotiation - and meet.jit.si sometimes does two rounds and sometimes three rounds of offer-answer (where the third offer is logically identical to the second offer from before)).
NSPR and Webrtc log from a failed mochitest run.
Modified version of a test case which for me reproduces the issue after renegotiating.
Assignee: drno → rjesup
Status: NEW → ASSIGNED
Attachment #8770176 - Flags: review?(pkerr) → review+
I manually did over 20 calls between my Mac and Linux machine with the patch applied on both sides and the calls were rock solid (where before I would have gotten black video with 5 calls max).

I'm still working on writing a mochitest to prevent regressions of this in the future.
Summary: Black video rendering in calls on meet.jit.si → No video rendering after renegotiating recvonly video (e.g. on meet.jit.si)
leave-open for the test
Keywords: leave-open
Pushed by rjesup@wgate.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/381f21c8b4c0
Always reinitialize the receiver/jitterbuffer when reseting video decoding r=pkerr
I guess we need to wait until the actual fix has made its way into MC, before we can land the test (I'm assuming your patch is not going to show up on the mozreview autoland repo).
Attachment #8770335 - Flags: review?(rjesup) → review+
Comment on attachment 8770335 [details]
Bug 1273652: mochi test to verify rendering happens after double negotiation.

https://reviewboard.mozilla.org/r/63800/#review61096
https://hg.mozilla.org/mozilla-central/rev/4957d2668f55
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Comment on attachment 8770176 [details] [diff] [review]
Always reinitialize the receiver/jitterbuffer when reseting video decoding

Approval Request Comment
[Feature/regressing bug #]: N/A (upstream bug)

[User impact if declined]: random video failure in some renegotiation situations depending on timing

[Describe test coverage new/current, TreeHerder]: Includes test (request is for both patches)

[Risks and why]: Very low risk
 
[String/UUID change made/needed]: none
Attachment #8770176 - Flags: approval-mozilla-aurora?
Attachment #8770335 - Flags: approval-mozilla-aurora?
See Also: → 1287615
Comment on attachment 8770176 [details] [diff] [review]
Always reinitialize the receiver/jitterbuffer when reseting video decoding

Review of attachment 8770176 [details] [diff] [review]:
-----------------------------------------------------------------

This patch fixes some video failure in some renegotiation. Let's take it in aurora.
Attachment #8770176 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Comment on attachment 8770335 [details]
Bug 1273652: mochi test to verify rendering happens after double negotiation.

Add a test. Take it in aurora.
Attachment #8770335 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
You need to log in before you can comment on or make changes to this bug.