Closed Bug 1231039 Opened 9 years ago Closed 8 years ago

STUN binding requests problems with Cisco Spark

Categories

(Infrastructure & Operations Graveyard :: NetOps, task, P1)

Tracking

(Not tracked)

RESOLVED WORKSFORME

People

(Reporter: drno, Unassigned)

References

Details

Originally reported in bug 1225248 comment #19 https://bugzilla.mozilla.org/show_bug.cgi?id=1225248#c19 to comment #25.

It appears that we have some kind of confusion or race condition between Firefox and Cisco Sparks Linus, where Firefox sends a STUN binding requests for the second media stream (usually video), but never gets a binding response from Linus.
Apparently Linus receives the binding request, but holds off sending a binding response, because it has send itself a binding request to Firefox, but hasn't gotten any response.
(In reply to Hank Peng from comment #25)
> More from Nathan:
> Thach took a deeper look and pointed out a couple places that my analysis
> was incorrect.
> 
> linus is receiving STUN binding requests on the video port but is not
> responding to them b/c they are aggressive nomination and linus has not yet
> received a binding response. linus has sent a binding request on the video
> port but never receives a response.
> 
> so the root cause appears to be the same, linus does not receive a binding
> response from firefox, but hopefully the extra information will be useful.

Yes that additional information is very helpful. I'm wondering if this problem is triggered by running Firefox on a machine with public IP, or if it is just a timing/race thing.

It is plausible that we have a problem in this area. E.g. at least the statistics on about:webrtc definitely get confused with Linus re-using the same IP and port for both ICE streams. So far I thought this were only the statistics which got confused, but maybe there is more to the problem.
(In reply to Nils Ohlmeier [:drno] from comment #1)
> (In reply to Hank Peng from comment #25)
> > More from Nathan:
> > Thach took a deeper look and pointed out a couple places that my analysis
> > was incorrect.
> > 
> > linus is receiving STUN binding requests on the video port but is not
> > responding to them b/c they are aggressive nomination and linus has not yet
> > received a binding response. linus has sent a binding request on the video
> > port but never receives a response.
> > 
> > so the root cause appears to be the same, linus does not receive a binding
> > response from firefox, but hopefully the extra information will be useful.
> 
> Yes that additional information is very helpful. I'm wondering if this
> problem is triggered by running Firefox on a machine with public IP, or if
> it is just a timing/race thing.
> 
> It is plausible that we have a problem in this area. E.g. at least the
> statistics on about:webrtc definitely get confused with Linus re-using the
> same IP and port for both ICE streams. So far I thought this were only the
> statistics which got confused, but maybe there is more to the problem.

   Waiting for STUN responses before sending your own STUN response seems like a bad idea to me, although I would have to see a pcap before I could make a guess why the response from firefox wasn't getting through.
(In reply to Hank Peng from comment #24)
> Got the reply from linus:
> a few things of note...
> 
> 1) firefox has changed their SDP a bit and it is causing linus to not
> advertise bundle. we used to do bundle, but no longer do with FF42. linus
> wants to see the ICE caps on the audio and video m-lines be the same, but
> with FF42 they are different (the port numbers are different).

Sorry I'm not sure I understand what 'ICE caps' means in this case.
As the offerer Firefox does not know if the answerer will support bundle and therefore we need to offer different port numbers and ICE streams per m-line, in case the answerer doesn't support bundle. To check for bundle support you should not look at port numbers, but at the 'a=group:BUNDLE' attribute.
(In reply to Byron Campen [:bwc] from comment #2)
>    Waiting for STUN responses before sending your own STUN response seems
> like a bad idea to me, although I would have to see a pcap before I could
> make a guess why the response from firefox wasn't getting through.

Agreed. I think it is their full implementation reacting to Firefox aggressive nomination.

But Firefox not responding to incoming binding requests is also bad. I'll try to check if I can find any obvious error in our code in that regards.
See Also: → 1225248
(In reply to Nils Ohlmeier [:drno] from comment #3)
> (In reply to Hank Peng from comment #24)
> > Got the reply from linus:
> > a few things of note...
> > 
> > 1) firefox has changed their SDP a bit and it is causing linus to not
> > advertise bundle. we used to do bundle, but no longer do with FF42. linus
> > wants to see the ICE caps on the audio and video m-lines be the same, but
> > with FF42 they are different (the port numbers are different).
> 
> Sorry I'm not sure I understand what 'ICE caps' means in this case.
> As the offerer Firefox does not know if the answerer will support bundle and
> therefore we need to offer different port numbers and ICE streams per
> m-line, in case the answerer doesn't support bundle. To check for bundle
> support you should not look at port numbers, but at the 'a=group:BUNDLE'
> attribute.

this was taken out of context a bit, i was responding to a question of why bundle was disabled by linus.  we did some work a few months ago to get bundle to work with firefox and i was explaining why that was no longer working as intended.  originally the firefox SDP had the same port numbers for audio and video, and that is what we were using as our 'key' to know that bundle was supported.  we can update our code to detect bundle per the other attributes and generate a proper answer.
(In reply to Nils Ohlmeier [:drno] from comment #4)
> (In reply to Byron Campen [:bwc] from comment #2)
> >    Waiting for STUN responses before sending your own STUN response seems
> > like a bad idea to me, although I would have to see a pcap before I could
> > make a guess why the response from firefox wasn't getting through.
> 
> Agreed. I think it is their full implementation reacting to Firefox
> aggressive nomination.
> 
> But Firefox not responding to incoming binding requests is also bad. I'll
> try to check if I can find any obvious error in our code in that regards.

yes, the linus ICE stack won't send a binding response to a use-candidate binding request until linus has received a binding response to it's own non-use-candidate binding request.  the code is trying to guarantee that neither side can complete ice negotiation until at least one binding request/response has successfully been exchanged in each direction.  linus will send a binding response to a non-use-candidate binding request right away.

i don't know what the fallout (if any) would be to just send the binding response.  presumably, the firefox side would think ICE was complete and the linus side would not.  later, if linus received a binding response from firefox then both sides would think ICE was complete.  if linus never received a binding response then ICE would fail on the linus side but not on the firefox side.

for this particular scenario, since linus does not receive the binding response the end result would seem to be unaffected as ICE would not complete on the linus side and the media flow would fail.

for what it is worth, linus is happy to run in ice-lite mode as well which simplifies these scenarios somewhat.  we disable that for firefox as it is our understanding that firefox does not support an ice-lite peer.
(In reply to Nathan Buckles from comment #6)
> (In reply to Nils Ohlmeier [:drno] from comment #4)
> > (In reply to Byron Campen [:bwc] from comment #2)
> > >    Waiting for STUN responses before sending your own STUN response seems
> > > like a bad idea to me, although I would have to see a pcap before I could
> > > make a guess why the response from firefox wasn't getting through.
> > 
> > Agreed. I think it is their full implementation reacting to Firefox
> > aggressive nomination.
> > 
> > But Firefox not responding to incoming binding requests is also bad. I'll
> > try to check if I can find any obvious error in our code in that regards.
> 
> yes, the linus ICE stack won't send a binding response to a use-candidate
> binding request until linus has received a binding response to it's own
> non-use-candidate binding request. 

This sees like it's going to cause a problem with aggressive mode. Can you
point to the part of RFC 5245 that says you should do this?
(In reply to Eric Rescorla (:ekr) from comment #7)
> (In reply to Nathan Buckles from comment #6)
> > (In reply to Nils Ohlmeier [:drno] from comment #4)
> > > (In reply to Byron Campen [:bwc] from comment #2)
> > > >    Waiting for STUN responses before sending your own STUN response seems
> > > > like a bad idea to me, although I would have to see a pcap before I could
> > > > make a guess why the response from firefox wasn't getting through.
> > > 
> > > Agreed. I think it is their full implementation reacting to Firefox
> > > aggressive nomination.
> > > 
> > > But Firefox not responding to incoming binding requests is also bad. I'll
> > > try to check if I can find any obvious error in our code in that regards.
> > 
> > yes, the linus ICE stack won't send a binding response to a use-candidate
> > binding request until linus has received a binding response to it's own
> > non-use-candidate binding request. 
> 
> This sees like it's going to cause a problem with aggressive mode. Can you
> point to the part of RFC 5245 that says you should do this?

i make no claims that this is the right behavior or per spec, was just attempting to describe the current behavior.  however, in this particular case whether linus sends the binding response or not doesn't seem like it would matter.  if linus does not receive a binding response from firefox then ICE will not complete on the linus side and the call would still fail.
Is there a packet capture from the device where firefox is running in this scenario?
I am afraid there was no packet capture. Added David Benham to the CC list. I wonder if the issue is still reproducible to get a packet capture.
Yes that would be super helpful. If not then I'll try to reproduce the problem next week some time.
backlog: --- → webrtc/webaudio+
Rank: 15
Priority: -- → P1
I just talked with Nils.  He'll look at this in early January.
Assignee: nobody → drno
I started talking to our IT department about getting a public IP address for trying to replicate this.
backlog: webrtc/webaudio+ → ---
Component: WebRTC: Networking → NetOps
Product: Core → Infrastructure & Operations
QA Contact: jbarnell
Version: Trunk → unspecified
As briefly discussed via email and IRC it would be great if I could get a public IP address temporarily patched in the MTV office onto outlet '2456 A' for testing this bug. I'm estimating that I would need it at maximum for a days, probably less.
Assignee: drno → jbircher
Assignee: jbircher → network-operations
Group: mozilla-employee-confidential
2456A -> switch2.df201-3.ops.mtv2.mozilla.net:3/0/39
Do you guys have any idea when you could look into providing the public IP?
Our partner Cisco is asking me regularly for updates on this bug and we would like to be able to report progress. So any rough estimation would be highly appreciated.
This will have to go through a security review and the Change Advisory Board to be implemented (because we don't currently have a way to deliver public IPs to the internal networking equipment, so it's a comprehensive change to the firewall config to make it possible), so this would probably be easier to handle in a separate bug with a dependency rather than polluting your code bug with the IP address stuff.  I would also suspect we're looking at at least a week and a half to get it implemented because of having to clear the above hurdles.
(In reply to Dave Miller [:justdave] (justdave@bugzilla.org) from comment #17)
> ... so this would probably be easier
> to handle in a separate bug with a dependency rather than polluting your
> code bug with the IP address stuff.

Thanks Dave for the suggestion. I created bug 1249168 as a blocker for this bug, where we can discuss the internal details without further polluting this original bug report.
Group: mozilla-employee-confidential
Looks like I should be able to test this with a public IP soon.
Finally started testing this on a public IP.
So I tested a lots of Sparks calls from a machine on a public IP (no firewall what so ever):
- with Firefox 42 and with a local build from todays latest sources (Nightly)
- with e10s on, with e10s off
- with ICE TCP on and ICE TCP off
- with direct calling and by joining an ongoing meeting
I have not been able to reproduce this problem at all.

The only thing which looked a little bit like this problem was that the very first call I made from the public IP did not get established. But in that case I had ICE logs turned on, and the PeerConnections simply got destroyed after gathering finished (I assume some JS code decided to hangup/give up on that call attempt). So the call never got as far as the call from the bug report.

So without further evidence or more log files I fear I have to close this as not reproducible.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WORKSFORME
Product: Infrastructure & Operations → Infrastructure & Operations Graveyard
You need to log in before you can comment on or make changes to this bug.