1231039 - STUN binding requests problems with Cisco Spark

Reporter

Description

•

9 years ago

Originally reported in bug 1225248 comment #19 https://bugzilla.mozilla.org/show_bug.cgi?id=1225248#c19 to comment #25.

It appears that we have some kind of confusion or race condition between Firefox and Cisco Sparks Linus, where Firefox sends a STUN binding requests for the second media stream (usually video), but never gets a binding response from Linus.
Apparently Linus receives the binding request, but holds off sending a binding response, because it has send itself a binding request to Firefox, but hasn't gotten any response.

Nils Ohlmeier [:drno]

Reporter

Comment 1

•

9 years ago

(In reply to Hank Peng from comment #25)
> More from Nathan:
> Thach took a deeper look and pointed out a couple places that my analysis
> was incorrect.
> 
> linus is receiving STUN binding requests on the video port but is not
> responding to them b/c they are aggressive nomination and linus has not yet
> received a binding response. linus has sent a binding request on the video
> port but never receives a response.
> 
> so the root cause appears to be the same, linus does not receive a binding
> response from firefox, but hopefully the extra information will be useful.

Yes that additional information is very helpful. I'm wondering if this problem is triggered by running Firefox on a machine with public IP, or if it is just a timing/race thing.

It is plausible that we have a problem in this area. E.g. at least the statistics on about:webrtc definitely get confused with Linus re-using the same IP and port for both ICE streams. So far I thought this were only the statistics which got confused, but maybe there is more to the problem.

Byron Campen [:bwc]

Comment 2

•

9 years ago

(In reply to Nils Ohlmeier [:drno] from comment #1)
> (In reply to Hank Peng from comment #25)
> > More from Nathan:
> > Thach took a deeper look and pointed out a couple places that my analysis
> > was incorrect.
> > 
> > linus is receiving STUN binding requests on the video port but is not
> > responding to them b/c they are aggressive nomination and linus has not yet
> > received a binding response. linus has sent a binding request on the video
> > port but never receives a response.
> > 
> > so the root cause appears to be the same, linus does not receive a binding
> > response from firefox, but hopefully the extra information will be useful.
> 
> Yes that additional information is very helpful. I'm wondering if this
> problem is triggered by running Firefox on a machine with public IP, or if
> it is just a timing/race thing.
> 
> It is plausible that we have a problem in this area. E.g. at least the
> statistics on about:webrtc definitely get confused with Linus re-using the
> same IP and port for both ICE streams. So far I thought this were only the
> statistics which got confused, but maybe there is more to the problem.

   Waiting for STUN responses before sending your own STUN response seems like a bad idea to me, although I would have to see a pcap before I could make a guess why the response from firefox wasn't getting through.

Nils Ohlmeier [:drno]

Reporter

Comment 3

•

9 years ago

(In reply to Hank Peng from comment #24)
> Got the reply from linus:
> a few things of note...
> 
> 1) firefox has changed their SDP a bit and it is causing linus to not
> advertise bundle. we used to do bundle, but no longer do with FF42. linus
> wants to see the ICE caps on the audio and video m-lines be the same, but
> with FF42 they are different (the port numbers are different).

Sorry I'm not sure I understand what 'ICE caps' means in this case.
As the offerer Firefox does not know if the answerer will support bundle and therefore we need to offer different port numbers and ICE streams per m-line, in case the answerer doesn't support bundle. To check for bundle support you should not look at port numbers, but at the 'a=group:BUNDLE' attribute.

Nils Ohlmeier [:drno]

Reporter

Comment 4

•

9 years ago

(In reply to Byron Campen [:bwc] from comment #2)
>    Waiting for STUN responses before sending your own STUN response seems
> like a bad idea to me, although I would have to see a pcap before I could
> make a guess why the response from firefox wasn't getting through.

Agreed. I think it is their full implementation reacting to Firefox aggressive nomination.

But Firefox not responding to incoming binding requests is also bad. I'll try to check if I can find any obvious error in our code in that regards.

Nils Ohlmeier [:drno]

Reporter

Updated

•

9 years ago

Comment 5

•

9 years ago

(In reply to Nils Ohlmeier [:drno] from comment #3)
> (In reply to Hank Peng from comment #24)
> > Got the reply from linus:
> > a few things of note...
> > 
> > 1) firefox has changed their SDP a bit and it is causing linus to not
> > advertise bundle. we used to do bundle, but no longer do with FF42. linus
> > wants to see the ICE caps on the audio and video m-lines be the same, but
> > with FF42 they are different (the port numbers are different).
> 
> Sorry I'm not sure I understand what 'ICE caps' means in this case.
> As the offerer Firefox does not know if the answerer will support bundle and
> therefore we need to offer different port numbers and ICE streams per
> m-line, in case the answerer doesn't support bundle. To check for bundle
> support you should not look at port numbers, but at the 'a=group:BUNDLE'
> attribute.

this was taken out of context a bit, i was responding to a question of why bundle was disabled by linus.  we did some work a few months ago to get bundle to work with firefox and i was explaining why that was no longer working as intended.  originally the firefox SDP had the same port numbers for audio and video, and that is what we were using as our 'key' to know that bundle was supported.  we can update our code to detect bundle per the other attributes and generate a proper answer.

Nathan Buckles

Comment 6

•

9 years ago

(In reply to Nils Ohlmeier [:drno] from comment #4)
> (In reply to Byron Campen [:bwc] from comment #2)
> >    Waiting for STUN responses before sending your own STUN response seems
> > like a bad idea to me, although I would have to see a pcap before I could
> > make a guess why the response from firefox wasn't getting through.
> 
> Agreed. I think it is their full implementation reacting to Firefox
> aggressive nomination.
> 
> But Firefox not responding to incoming binding requests is also bad. I'll
> try to check if I can find any obvious error in our code in that regards.

yes, the linus ICE stack won't send a binding response to a use-candidate binding request until linus has received a binding response to it's own non-use-candidate binding request.  the code is trying to guarantee that neither side can complete ice negotiation until at least one binding request/response has successfully been exchanged in each direction.  linus will send a binding response to a non-use-candidate binding request right away.

i don't know what the fallout (if any) would be to just send the binding response.  presumably, the firefox side would think ICE was complete and the linus side would not.  later, if linus received a binding response from firefox then both sides would think ICE was complete.  if linus never received a binding response then ICE would fail on the linus side but not on the firefox side.

for this particular scenario, since linus does not receive the binding response the end result would seem to be unaffected as ICE would not complete on the linus side and the media flow would fail.

for what it is worth, linus is happy to run in ice-lite mode as well which simplifies these scenarios somewhat.  we disable that for firefox as it is our understanding that firefox does not support an ice-lite peer.

Eric Rescorla (:ekr)

Comment 7

•

9 years ago

(In reply to Nathan Buckles from comment #6)
> (In reply to Nils Ohlmeier [:drno] from comment #4)
> > (In reply to Byron Campen [:bwc] from comment #2)
> > >    Waiting for STUN responses before sending your own STUN response seems
> > > like a bad idea to me, although I would have to see a pcap before I could
> > > make a guess why the response from firefox wasn't getting through.
> > 
> > Agreed. I think it is their full implementation reacting to Firefox
> > aggressive nomination.
> > 
> > But Firefox not responding to incoming binding requests is also bad. I'll
> > try to check if I can find any obvious error in our code in that regards.
> 
> yes, the linus ICE stack won't send a binding response to a use-candidate
> binding request until linus has received a binding response to it's own
> non-use-candidate binding request. 

This sees like it's going to cause a problem with aggressive mode. Can you
point to the part of RFC 5245 that says you should do this?

Nathan Buckles

Comment 8

•

9 years ago

(In reply to Eric Rescorla (:ekr) from comment #7)
> (In reply to Nathan Buckles from comment #6)
> > (In reply to Nils Ohlmeier [:drno] from comment #4)
> > > (In reply to Byron Campen [:bwc] from comment #2)
> > > >    Waiting for STUN responses before sending your own STUN response seems
> > > > like a bad idea to me, although I would have to see a pcap before I could
> > > > make a guess why the response from firefox wasn't getting through.
> > > 
> > > Agreed. I think it is their full implementation reacting to Firefox
> > > aggressive nomination.
> > > 
> > > But Firefox not responding to incoming binding requests is also bad. I'll
> > > try to check if I can find any obvious error in our code in that regards.
> > 
> > yes, the linus ICE stack won't send a binding response to a use-candidate
> > binding request until linus has received a binding response to it's own
> > non-use-candidate binding request. 
> 
> This sees like it's going to cause a problem with aggressive mode. Can you
> point to the part of RFC 5245 that says you should do this?

i make no claims that this is the right behavior or per spec, was just attempting to describe the current behavior.  however, in this particular case whether linus sends the binding response or not doesn't seem like it would matter.  if linus does not receive a binding response from firefox then ICE will not complete on the linus side and the call would still fail.

Byron Campen [:bwc]

Comment 9

•

9 years ago

Is there a packet capture from the device where firefox is running in this scenario?

Hank Peng

Comment 10

•

9 years ago

I am afraid there was no packet capture. Added David Benham to the CC list. I wonder if the issue is still reproducible to get a packet capture.

Nils Ohlmeier [:drno]

Reporter

Comment 11

•

9 years ago

Yes that would be super helpful. If not then I'll try to reproduce the problem next week some time.

Randell Jesup [:jesup] (needinfo me)

Updated

•

9 years ago

backlog: --- → webrtc/webaudio+

Rank: 15

Priority: -- → P1

Maire Reavy [:mreavy]

Comment 12

•

9 years ago

I just talked with Nils.  He'll look at this in early January.

Assignee: nobody → drno

Nils Ohlmeier [:drno]

Reporter

Comment 13

•

8 years ago

I started talking to our IT department about getting a public IP address for trying to replicate this.

Nils Ohlmeier [:drno]

Reporter

Updated

•

8 years ago

backlog: webrtc/webaudio+ → ---

status-firefox45: affected → ---

Component: WebRTC: Networking → NetOps

Product: Core → Infrastructure & Operations

QA Contact: jbarnell

Version: Trunk → unspecified

Nils Ohlmeier [:drno]

Reporter

Comment 14

•

8 years ago

As briefly discussed via email and IRC it would be great if I could get a public IP address temporarily patched in the MTV office onto outlet '2456 A' for testing this bug. I'm estimating that I would need it at maximum for a days, probably less.

Assignee: drno → jbircher

John B [:johnb]

Updated

•

8 years ago

Assignee: jbircher → network-operations

Group: mozilla-employee-confidential

Van Le [:van]

Comment 15

•

8 years ago

2456A -> switch2.df201-3.ops.mtv2.mozilla.net:3/0/39

Nils Ohlmeier [:drno]

Reporter

Comment 16

•

8 years ago

Do you guys have any idea when you could look into providing the public IP?
Our partner Cisco is asking me regularly for updates on this bug and we would like to be able to report progress. So any rough estimation would be highly appreciated.

Dave Miller [:justdave]

Comment 17

•

8 years ago

This will have to go through a security review and the Change Advisory Board to be implemented (because we don't currently have a way to deliver public IPs to the internal networking equipment, so it's a comprehensive change to the firewall config to make it possible), so this would probably be easier to handle in a separate bug with a dependency rather than polluting your code bug with the IP address stuff.  I would also suspect we're looking at at least a week and a half to get it implemented because of having to clear the above hurdles.

Nils Ohlmeier [:drno]

Reporter

Comment 18

•

8 years ago

(In reply to Dave Miller [:justdave] (justdave@bugzilla.org) from comment #17)
> ... so this would probably be easier
> to handle in a separate bug with a dependency rather than polluting your
> code bug with the IP address stuff.

Thanks Dave for the suggestion. I created bug 1249168 as a blocker for this bug, where we can discuss the internal details without further polluting this original bug report.

Group: mozilla-employee-confidential

Nils Ohlmeier [:drno]

Reporter

Comment 19

•

8 years ago

Looks like I should be able to test this with a public IP soon.

Nils Ohlmeier [:drno]

Reporter

Comment 20

•

8 years ago

Finally started testing this on a public IP.

Nils Ohlmeier [:drno]

Reporter

Comment 21

•

8 years ago

So I tested a lots of Sparks calls from a machine on a public IP (no firewall what so ever):
- with Firefox 42 and with a local build from todays latest sources (Nightly)
- with e10s on, with e10s off
- with ICE TCP on and ICE TCP off
- with direct calling and by joining an ongoing meeting
I have not been able to reproduce this problem at all.

The only thing which looked a little bit like this problem was that the very first call I made from the public IP did not get established. But in that case I had ICE logs turned on, and the PeerConnections simply got destroyed after gathering finished (I assume some JS code decided to hangup/give up on that call attempt). So the call never got as far as the call from the bug report.

So without further evidence or more log files I fear I have to close this as not reproducible.

Status: NEW → RESOLVED

Closed: 8 years ago

Resolution: --- → WORKSFORME

BMO Automation

Updated

•

2 years ago

Product: Infrastructure & Operations → Infrastructure & Operations Graveyard