Closed Bug 1155246 Opened 5 years ago Closed 5 years ago

Video RTCP RRs are never sent when multistream+bundle is used

Categories

(Core :: WebRTC: Networking, defect, P1)

x86_64
Linux
defect

Tracking

()

RESOLVED INVALID
Blocking Flags:

People

(Reporter: gp, Assigned: drno)

Details

Attachments

(1 file, 1 obsolete file)

User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:40.0) Gecko/20100101 Firefox/40.0
Build ID: 20150416030209

Steps to reproduce:

I opened https://george.jitsi.net/bugzilla in Chome and in FF (> 37). The page makes use of the newly introduced multistream support in FF. FF is told to use bundle and rtcp-mux by signaling.




Actual results:

The video playback in FF freezes while it plays in Chrome.




Expected results:

The video playback in FF should not freeze.

Here's what I think is happening:

- FF is told to use bundle and rtcp-mux by signaling.
- The webrtc engine is requesting a PLI for a receive only video channel. It builds an RR carrying a PLI.
- That PLI goes down to..
    MediaPipeline::PipelineTransport::SendRtcpPacket_s()
    MediaPipeline::SendPacket() (where "this" is a MediaPipelineReceiveVideo)
    TransportLayerIce::SendPacket()
    NrIceMediaStream::SendPacket()
    nr_ice_media_stream_send()

and there, in the the nr_ice_media_stream_send() method, the component returned by the nr_ice_peer_ctx_find_component() method has the wrong destination address. It turns out that the packet doesn't reach the network either.
Thanks, George, for filing this. Nils is back today, and he's looking at this now.  I'm also adding Byron, Martin, and Randell to the cc.
Assignee: nobody → drno
Status: UNCONFIRMED → NEW
Rank: 10
Ever confirmed: true
Priority: -- → P1
When testing on a single Mac it was a little harder to reproduce the problem, but now I saw it. George are you testing this on a single or with multiple machines?
Extra logging in Nightly shows me only a single remote destination used for all packets.
It's true that the problem does not happen all the time. I suspect that you have to have some packet loses/delay for it to show. On the Mac I'm using Network Link Conditioner, it's similar to netem for Linux.

About the single remote destination, that's right, we're using bundle and rtcp-mux. The problem is that the webrtc engine generates RRs that carry PLIs and hands them to the transport layer but they never reach the network. You will notice that there isn't a single RR coming out of FF (whether it contains a PLI or some other feedback message, so this isn't only about PLIs really).

I've debugged FF in lldb and "seen" the PLIs that webrtc is generating and the path they take in FF. When the PLI goes down to the ICE transport layer, the ICE component has the wrong destination address and on top of that, somehow the PLI doesn't come out of FF.
Looking at un-rotting the patch in 859971 to verify if that fixes the problem.
I commented in 859971 that the original patch in there is rotten pretty badly. But I was able to modify our signaling unit tests to show that we indeed still have a problem with RTCP going in both directions. So I think we getting close to closing this here as duplicate of 859971.
It turns out that we actually have a signaling unit test which kind of resembles the use case here, which renegotiates an extra audio & video stream, but only in one direction: 
https://dxr.mozilla.org/mozilla-central/source/media/webrtc/signaling/test/signaling_unittests.cpp#2363

Looking at the full log files from a test run shows some interesting log lines:
- Dropping rtcp packet (which increases significantly after the renegotiation)
- Error unprotecting SRTCP packet error=7 (which only shows up after the renegotiation)
errno=7 apparently translates into auth failure:
https://dxr.mozilla.org/mozilla-central/source/netwerk/srtp/src/srtp/srtp.c#1944
It sounds like the authorization contexts for SRTP (specifically SRTCP) for reception on the sender side are fubar'd

If the SSRC changed, or if the DTLS-SRTP key was changed, perhaps the RTP was updated but not RTCP in the one-directional case.
When I do a test call with the provided test URL I don't see the SRTCP errors, but lots of the RTCP dropping packets messages.
The implementation in bundle cases may forward RTCP to all transports, not just the one with the right SSRC.  Need to see more (signaling + trace logs, maybe dumps of the RRs) or look in a debugger.  This has been the topic of a bunch of ... annoyance in the past.
Assignee: drno → rjesup
Status: NEW → ASSIGNED
Silly bugzilla...  NI drno to make sure he sees this
Assignee: rjesup → drno
Flags: needinfo?(drno)
I actually see RR going out, but they seem to be empty...
Flags: needinfo?(drno)
Going out from where? From FF (the receiver), from Chrome (the sender) or from JVB (the server)? In any case RRs from FF should contain the PLIs that the engine is generating.
It looks like Fx gets majorly confused which RTCP reports need to go where in this use case.
(In reply to Nils Ohlmeier [:drno] from comment #17)
> It looks like Fx gets majorly confused which RTCP reports need to go where
> in this use case.

We only do very rudimentary filtering of incoming RTCP in the bundle case, because it is very complicated to do so, and because the webrtc.org code does filtering under the hood. We're doing basically the same filtering as libjingle. As for any weirdness for outgoing RTCP, we just send what webrtc.org tells us to.
> As for any weirdness for outgoing RTCP, we just send what webrtc.org tells us to.

That's what should happen in theory, but it doesn't in practice, it seems.
(In reply to George Politis [:gp] from comment #19)
> > As for any weirdness for outgoing RTCP, we just send what webrtc.org tells us to.
> 
> That's what should happen in theory, but it doesn't in practice, it seems.

Yeah, there's some bug in the transport stuff for RTCP, probably. We aren't doing any filtering for outgoing is what I'm getting at.
Byron, Nils -- I/we believe this is the last bug to get Jitsi truly working with Firefox, and I'd like to find out as soon as we can what's really going on here -- and how realistic it is to get a fix into Fx38 and/or Fx39.  I need to be away from the office for part of the day today, but can you two discuss this issue and figure out a plan to debug it?  I'm fine with either of you taking the lead (or both of you sharing the lead). Nils did a lot of investigation on this while Byron was on PTO. I just need this to get understood and fixed as soon as can.  Can you collaborate and pull in Randell as needed to get this resolved as soon as we can?  Thanks!
Flags: needinfo?(drno)
Flags: needinfo?(docfaraday)
(In reply to George Politis [:gp] from comment #0)
> User Agent: Mozilla/5.0 (X11; Linux x86_64; rv:40.0) Gecko/20100101
> Firefox/40.0
> Build ID: 20150416030209
> 
> Steps to reproduce:
> 
> I opened https://george.jitsi.net/bugzilla in Chome and in FF (> 37). The
> page makes use of the newly introduced multistream support in FF. FF is told
> to use bundle and rtcp-mux by signaling.
> 
> 
> 
> 
> Actual results:
> 
> The video playback in FF freezes while it plays in Chrome.
> 
> 
> 
> 
> Expected results:
> 
> The video playback in FF should not freeze.
> 
> Here's what I think is happening:
> 
> - FF is told to use bundle and rtcp-mux by signaling.
> - The webrtc engine is requesting a PLI for a receive only video channel. It
> builds an RR carrying a PLI.
> - That PLI goes down to..
>     MediaPipeline::PipelineTransport::SendRtcpPacket_s()
>     MediaPipeline::SendPacket() (where "this" is a MediaPipelineReceiveVideo)
>     TransportLayerIce::SendPacket()
>     NrIceMediaStream::SendPacket()
>     nr_ice_media_stream_send()
> 
> and there, in the the nr_ice_media_stream_send() method, the component
> returned by the nr_ice_peer_ctx_find_component() method has the wrong
> destination address. It turns out that the packet doesn't reach the network
> either.

I'm most interested in this last statement. When you say "doesn't reach the network"
do you mean that it doesn't get to the other end or doesn't even get transmitted
on the wire?

Which IP address is it? Some other component? Something random?
Flags: needinfo?(gp)
> When you say "doesn't reach the network"
do you mean that it doesn't get to the other end or doesn't even get transmitted
on the wire?

It doesn't even get transmitted on the wire.

> Which IP address is it? Some other component? Something random?

It's the correct IP address but the wrong port. This is a very wild guess, probably utterly wrong, but it could be that although rtcp-mux is used, multiple components get created anyway (i.e. one for RTP and one for RTCP), but they're not properly initialized and when they're used for sending, well it fails.
Bad formatting, sorry :-( Also clearing the needinfo request.
Flags: needinfo?(gp)
(In reply to George Politis [:gp] from comment #23)
> > When you say "doesn't reach the network"
> do you mean that it doesn't get to the other end or doesn't even get
> transmitted
> on the wire?
> 
> It doesn't even get transmitted on the wire.

That's interesting. Were you able to trace through the code to see why.


> > Which IP address is it? Some other component? Something random?
> 
> It's the correct IP address but the wrong port. This is a very wild guess,
> probably utterly wrong, but it could be that although rtcp-mux is used,
> multiple components get created anyway (i.e. one for RTP and one for RTCP),
> but they're not properly initialized and when they're used for sending, well
> it fails.

This actually seems like a fairly reasonable guess. Maybe for some reason
we're coalescing them for non-bungle but not for bungle. Remember that the offerer
needs to make two components.

Three questions:
1. What is nr_ice_media_stream::label
(https://dxr.mozilla.org/mozilla-central/source/media/mtransport/third_party/nICEr/src/ice/ice_media_stream.h#43)

2. What is the component index:
https://dxr.mozilla.org/mozilla-central/source/media/mtransport/third_party/nICEr/src/ice/ice_component.h#63

3. Does it match a candidate sent by the other side?
I'm having no luck reproducing this. When I test, I see RTCP RRs being sent regularly for the streams we are receiving RTP for. In a chrome vs firefox test, the first two m-sections (msid:mixedmslabel) are dormant and receive no RTP or RTCP traffic from either side, the second two are active and firefox regularly sends RTCP RR for both.
Flags: needinfo?(docfaraday)
I believe the problem isn't triggered in Byron's case because the receive stream gets assigned to the channel that is also sending, so we have a sendrecv channel internally. That can happen.

If, on the other hand, the receive stream gets assigned to a channel that is *not* sending, so we have a recvonly channel internally, then you should be able to reproduce the problem.
P.S. You can reproduce the issue more reliably if you open two Chrome instances and one FF instance.
This dumps the SSRCs in RR's to the NSPR log.
Attachment #8594549 - Attachment is obsolete: true
Flags: needinfo?(drno)
I was able to reproduce the problem of stalling video. I highly recommend to use two different cameras for Chrome and Firefox, as when both share the same camera it is really hard to spot the problem, as the Jtisi bridge shows your own preview until the remote video comes in.

But I was never able to follow the problem of RR's getting send to the wrong destination.

I think the problem occurs if Fx is sending on the first sendrecv stream (note: in fact this stream is sendonly, but it is not marked as such), but receives the Chrome video on a second recvonly stream. My current guess is that something screws up the SSRCs in this scenario.
Ok, tried that, and seeing some abnormalities. Here's a pruned down summary of the m-sections:

audio, firefox sendrecv, gateway sendrecv, RTP active, no RTCP RR
audio, firefox recvonly, gateway sendrecv, RTP active, has RTCP RR
audio, firefox recvonly, gateway sendrecv, RTP dormant (msid:mixedmslabel)
video, firefox sendrecv, gateway sendrecv, RTP active, no RTCP RR
video, firefox recvonly, gateway sendrecv, RTP active, has RTCP RR
video, firefox recvonly, gateway sendrecv, RTP dormant (msid:mixedmslabel)

On the m-sections without RTCP RR, I observe no RTCP RR coming from the webrtc.org code at all until the call is ended (I'm guessing this is a BYE or something). However, I do see plenty of SRs for m-sections 0 and 3, and they are definitely compound RTCP (I'm not yet able to figure out what exactly is in the compound packet since that gets encrypted). Maybe webrtc.org has decided to piggyback the RRs on the SRs for sendrecv, and sends standalone RRs when recvonly?
Attachment #8598901 - Attachment is patch: true
Comment on attachment 8598901 [details] [diff] [review]
Extended RTCP dumping

Review of attachment 8598901 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/media-conduit/VideoConduit.cpp
@@ +48,5 @@
> +DumpRTCP(void *obj, const char *type, const void *data, int len)
> +{
> +  const uint8_t *ptr = static_cast<const uint8_t*>(data);
> +  const uint8_t *end = ptr + len; // one past end of buffer
> +  uint32_t ssrc = (ptr[4] << 24 | ptr[5] << 16 | ptr[6] << 8 | ptr[7]);

Not all RTCPs start with an RR or SR if non-compound RTCPs are allowed (i.e. it could be a bare RTCPFB/NACK packet, etc).  Of course this is just for a debug, so if it's wrong it's not a big deal.
(In reply to Byron Campen [:bwc] from comment #31)
> Ok, tried that, and seeing some abnormalities. Here's a pruned down summary
> of the m-sections:
> 
> audio, firefox sendrecv, gateway sendrecv, RTP active, no RTCP RR
> audio, firefox recvonly, gateway sendrecv, RTP active, has RTCP RR
> audio, firefox recvonly, gateway sendrecv, RTP dormant (msid:mixedmslabel)
> video, firefox sendrecv, gateway sendrecv, RTP active, no RTCP RR
> video, firefox recvonly, gateway sendrecv, RTP active, has RTCP RR
> video, firefox recvonly, gateway sendrecv, RTP dormant (msid:mixedmslabel)
> 
> On the m-sections without RTCP RR, I observe no RTCP RR coming from the
> webrtc.org code at all until the call is ended (I'm guessing this is a BYE
> or something). However, I do see plenty of SRs for m-sections 0 and 3, and
> they are definitely compound RTCP (I'm not yet able to figure out what
> exactly is in the compound packet since that gets encrypted). Maybe
> webrtc.org has decided to piggyback the RRs on the SRs for sendrecv, and
> sends standalone RRs when recvonly?

Nil's patch (derivative of my older one with SSRCs added) will dump all the sub-packets of an RTCP packet, and is easily extended to dump more data
Yeah, I just took that for a spin, I'm not seeing any cases where a RR has been tacked onto a SR. I am continuing to look.
Ok, that RTCP dumping shows that webrtc.org is not sending RRs for the sendrecv sections.
Ok, just saw this on about:webrtc

inbound_rtp_video_3

Decoder: Avg. bitrate: 0.47 Mbps (0.15 SD) Avg. framerate: 16.74 fps (1.77 SD)

Local: 15:29:21 GMT-0700 (PDT) inboundrtp SSRC: 715982246 Received: 18445 packets (17860.54 Kb) Lost: 53 Jitter: 0.479

Remote: 15:29:21 GMT-0700 (PDT) outboundrtp SSRC: 715982246 Sent: 127843 packets (131360.27 Kb)

That's the stats for the m-section that isn't sending RRs. Seems fishy. The other stream looks like this:

inbound_rtp_video_4

Decoder: Avg. bitrate: 0.80 Mbps (0.23 SD) Avg. framerate: 13.54 fps (3.18 SD)

Local: 15:29:21 GMT-0700 (PDT) inboundrtp SSRC: 1955998194 Received: 31159 packets (32048.27 Kb) Lost: 672 Jitter: 0.358

Remote: 15:29:21 GMT-0700 (PDT) outboundrtp SSRC: 1955998194 Sent: 35111 packets (35467.59 Kb)
Letting it sit for a while seems to normalize the incoming packet rate; I'm now receiving RTP packets at the same rate as Chrome is sending on both video streams.
(In reply to Byron Campen [:bwc] from comment #31)
> Ok, tried that, and seeing some abnormalities. Here's a pruned down summary
> of the m-sections:
> 
> audio, firefox sendrecv, gateway sendrecv, RTP active, no RTCP RR
> audio, firefox recvonly, gateway sendrecv, RTP active, has RTCP RR
> audio, firefox recvonly, gateway sendrecv, RTP dormant (msid:mixedmslabel)
> video, firefox sendrecv, gateway sendrecv, RTP active, no RTCP RR

My understanding is that in this video stream only Firefox sends our outgoing video, but we never receive any incoming video. If that is really the case I could understand why no RR's are going out, as their is nothing to report about. And even if we would send them, they would be either boring or confusing for the sending side if they report zeros only.

> video, firefox recvonly, gateway sendrecv, RTP active, has RTCP RR

George can we verify on your server that these RR's make it to Chrome?

> video, firefox recvonly, gateway sendrecv, RTP dormant (msid:mixedmslabel)
Flags: needinfo?(gp)
Detailing what should be sent for each m-line per the spec

(In reply to Nils Ohlmeier [:drno] from comment #38)
> (In reply to Byron Campen [:bwc] from comment #31)
> > Ok, tried that, and seeing some abnormalities. Here's a pruned down summary
> > of the m-sections:
> > 
> > audio, firefox sendrecv, gateway sendrecv, RTP active, no RTCP RR

Should send SR, perhaps occasionally RR (but not likely)

> > audio, firefox recvonly, gateway sendrecv, RTP active, has RTCP RR

Should send RR

> > audio, firefox recvonly, gateway sendrecv, RTP dormant (msid:mixedmslabel)

Can send RR (with RC=0), does not need to

> > video, firefox sendrecv, gateway sendrecv, RTP active, no RTCP RR

Should send SR, perhaps occasionally RR (especially if we're not using non-compound RTCP and occasionally send multiple NACKs/etc inbetween sending frames) or if the video source is slow (screencapture).

> My understanding is that in this video stream only Firefox sends our
> outgoing video, but we never receive any incoming video. If that is really
> the case I could understand why no RR's are going out, as their is nothing
> to report about. And even if we would send them, they would be either boring
> or confusing for the sending side if they report zeros only.
> 
> > video, firefox recvonly, gateway sendrecv, RTP active, has RTCP RR

Should send RR

> George can we verify on your server that these RR's make it to Chrome?
> 
> > video, firefox recvonly, gateway sendrecv, RTP dormant (msid:mixedmslabel)

Can send RR with RC=0, doesn't need to
Ah, that's right, SRs contain RR data. So there's nothing anomalous that I can reproduce...
> > video, firefox recvonly, gateway sendrecv, RTP active, has RTCP RR
> 
> George can we verify on your server that these RR's make it to Chrome?

I can verify this today.
> > video, firefox recvonly, gateway sendrecv, RTP active, has RTCP RR
> 
> George can we verify on your server that these RR's make it to Chrome?

In my tests RRs don't get transmitted. SRs are transmitted normally. Have you used tcpdump/wireshark to observed those RRs coming out of FF?
Flags: needinfo?(gp)
P.S. WebRTC piggybacks PLIs to compound RRs for recvonly channels. So, if FF doesn't emit RRs, it doesn't emit PLIs either.
P.P.S. For smoother debugging you can hack the WebRTC engine to dump in plaintext (decrypted) all the RTP/RTCP traffic that it generates. I'm doing the same thing at the gateway level. That said, most likely you're already aware of this trick and you're already using it. If that's the case I'm sorry for the redundant/useless information.
(In reply to George Politis [:gp] from comment #42)
> > > video, firefox recvonly, gateway sendrecv, RTP active, has RTCP RR
> > 
> > George can we verify on your server that these RR's make it to Chrome?
> 
> In my tests RRs don't get transmitted. SRs are transmitted normally. Have
> you used tcpdump/wireshark to observed those RRs coming out of FF?

   Yeah, I see loads of UDP packets that start with 81:c9:00 being sent, and they are being sent to the right place. I do not see these packets arriving on the other side, although it is possible the SSRCs are different (the SSRC in the RR hasn't been advertised in SDP, since we aren't transmitting anything).
Maybe we have something platform-dependent here? What platform have you been testing on? I've done my testing on linux, I can try OS X if that's where you're seeing problems.
Flags: needinfo?(gp)
I see the same results on OS X.
(In reply to George Politis [:gp] from comment #44)
> P.P.S. For smoother debugging you can hack the WebRTC engine to dump in
> plaintext (decrypted) all the RTP/RTCP traffic that it generates. I'm doing
> the same thing at the gateway level. That said, most likely you're already
> aware of this trick and you're already using it. If that's the case I'm
> sorry for the redundant/useless information.

George,

Can you please answer the questions I asked in c25?
> George,
>
> Can you please answer the questions I asked in c25?

Eric, I'm sorry I haven't done so already, I haven't forgotten about it, I'm just short on bandwidth. I'll do that by tomorrow, this will surely yield something interesting.
(In reply to George Politis [:gp] from comment #49)
> > George,
> >
> > Can you please answer the questions I asked in c25?
> 
> Eric, I'm sorry I haven't done so already, I haven't forgotten about it, I'm
> just short on bandwidth. I'll do that by tomorrow, this will surely yield
> something interesting.

No worries. I just wasn't sure if it got lost in the flood of comments.

Thanks.
> Maybe we have something platform-dependent here? What platform have you been testing on? I've done my testing on linux, I can try OS X if that's where you're seeing problems.

I've tried on both platforms as well. I don't understand why we don't observe the same thing. Since I'm the only one able to repro this, please buy some time until I can answer Eric's questions.
I think I finally understand what George is concerned about: SDP shows only a single port in use (as everything should be bundled), but on the wire I actually see two UDP source and destination ports being used?!
I have no idea yet where that is coming from and why it does not show up in our logging.
(In reply to Nils Ohlmeier [:drno] from comment #52)
> I think I finally understand what George is concerned about: SDP shows only
> a single port in use (as everything should be bundled), but on the wire I
> actually see two UDP source and destination ports being used?!
> I have no idea yet where that is coming from and why it does not show up in
> our logging.

Never mind. I should have known better. The second UDP stream is obviously from Google Chrome.
(In reply to George Politis [:gp] from comment #42)
> In my tests RRs don't get transmitted. SRs are transmitted normally. Have
> you used tcpdump/wireshark to observed those RRs coming out of FF?

I looked at two calls with wireshark. In both cases I see lots of RR's getting send by Firefox.

The first call worked fine, in the second call I had some of choppy and stalling video issues.
The noticeable difference between the RR's is that in the first case the RR's are just 64 bytes in total size (=rtcp length 1) and essentially empty, meaning they only have the sender SSRC in it but nothing else to report.
In the second case the RR's are usually 132 bytes (=rtcp length 7), but also bigger and smaller packets.

Noticeably the RR's going to Chrome looks quite different in frequency and size. Unfortunately most of the detail information from within the RTCP packets appears to be useless as my wireshark does not understand that these are SRTCP and not just plain RTCP.
I am no longer able to reproduce this. FF transmits RRs just fine. Has anything changed in the transport layer recently?
Maybe bug 1146462 fixed this somehow?
(In reply to Byron Campen [:bwc] from comment #56)
> Maybe bug 1146462 fixed this somehow?

Nope, seems to work fine before this changeset.
I'm very confused about this and I'm still looking for an explanation although I'm afraid I got a bit too excited and I jumped to conclusions too fast without triple checking everything. It seems more and more likely that this was never a real thing and that I messed this up :-( I deeply apologize for the waste of your time.
Flags: needinfo?(gp)
backlog: --- → webRTC+
I suggest that we close this as invalid.
Status: ASSIGNED → RESOLVED
Closed: 5 years ago
Resolution: --- → INVALID
You need to log in before you can comment on or make changes to this bug.