Upgrade to RFC 5389 STUN

VERIFIED FIXED in Firefox 24

Status

()

Core
WebRTC: Networking
VERIFIED FIXED
4 years ago
4 years ago

People

(Reporter: ekr, Assigned: ekr)

Tracking

(Blocks: 1 bug)

unspecified
mozilla25
Points:
---

Firefox Tracking Flags

(firefox23 unaffected, firefox24+ verified, firefox25 fixed, fennec24+)

Details

(Whiteboard: [WebRTC][blocking-webrtc-])

Attachments

(2 attachments, 1 obsolete attachment)

(Assignee)

Description

4 years ago
Firefox currently generates RFC 3489 backward-compatible STUN requests,
but it actually wants to receive a 5389 XOR-MAPPED-ADDRESS. This creates
problems with servers which actually do backward compatibility.

We should just upgrade to RFC5389, which is what Google does.

Try removing:

                    'USE_RFC_3489_BACKWARDS_COMPATIBLE',

from nicer.gyp
(Assignee)

Comment 1

4 years ago
Created attachment 768958 [details] [diff] [review]
Convert to new STUN request
(Assignee)

Updated

4 years ago
Assignee: nobody → ekr
(Assignee)

Comment 2

4 years ago
Update: this patch works with the Firefox STUN server but not the Google servers because Google is offering MAPPED-ADDRESS instead of XOR-MAPPED-ADDRESS. Investigations continue but I think this may be a problem on Google's side.

Updated

4 years ago
OS: Mac OS X → All
Hardware: x86 → All
Whiteboard: [WebRTC] [blocking-webrtc-]
Duplicate of this bug: 877701
Adding a note here that one of the affected STUN servers is the following:

[{url:"stun:stun.l.google.com:19302"}]

Which causes the talky.io and tokbox demo to fail occasionally.

A couple of compatibility questions:

1. What is the web developer workaround here? Specifically, what STUN servers should we suggest web developers to use vs. not use due to this bug existing?

2. What's the breadth of STUN servers impacted here?
Given that this is causing call failures on at least two WebRTC demos on fxAndroid to desktop, I'm adding the tracking-fennec flag ? that was on the dupe of this bug.
Blocks: 750010
tracking-fennec: --- → ?

Updated

4 years ago
Whiteboard: [WebRTC] [blocking-webrtc-] → [WebRTC][blocking-webrtc-][android-webrtc+]

Updated

4 years ago
tracking-firefox24: --- → ?

Updated

4 years ago
status-firefox23: --- → unaffected
status-firefox24: --- → affected
status-firefox25: --- → affected
tracking-firefox24: ? → +
(Assignee)

Comment 6

4 years ago
I have reached out to Google to see about them fixing their (noncompliant) server.

Possibilities:

1. They just fix it and we land the above patch.
2. They fix it but make it conditional on the SOFTWARE field, which we just add and set to nICEr or something.
3. They don't fix it and we land a patch that accepts MAPPED-ADDRESS.
As Ekr says in comment 6, Google's server is noncompliant, and this issue is not a bug in our code (and it's definitely not specific to Android).
Whiteboard: [WebRTC][blocking-webrtc-][android-webrtc+] → [WebRTC][blocking-webrtc-][android-webrtc-]
(Assignee)

Comment 8

4 years ago
(In reply to Maire Reavy [:mreavy] from comment #7)
> As Ekr says in comment 6, Google's server is noncompliant, and this issue is
> not a bug in our code (and it's definitely not specific to Android).

This isn't quite correct.

There is a defect in our code in that we are offering 3489 STUN but not
accepting MAPPED-ADDRESS in response. (See the description).

However, even when this defect is fixed we will still have a problem with
Google's servers.
(In reply to Eric Rescorla (:ekr) from comment #8)
> (In reply to Maire Reavy [:mreavy] from comment #7)
> > As Ekr says in comment 6, Google's server is noncompliant, and this issue is
> > not a bug in our code (and it's definitely not specific to Android).
> 
> This isn't quite correct.
> 
> There is a defect in our code in that we are offering 3489 STUN but not
> accepting MAPPED-ADDRESS in response. (See the description).
> 
> However, even when this defect is fixed we will still have a problem with
> Google's servers.

Thanks for clarifying. Referring to the possibilities in comment 6: How likely is Google to fix their server code?  How much time/effort is it to do #3 (land a patch that accepts MAPPED-ADDRESS)?
Right, that was my understanding as well - there's a fix we need here on our side.

Also - although this isn't directly related to the Android code, it's impacting important demos for Android. The dupes demonstrate this is preventing usage of at least two WebRTC demos. That's enough of a case to block on. So I disagree on not blocking on this. We're getting going to negative reviews on Google Play if common WebRTC demos aren't working.
Whiteboard: [WebRTC][blocking-webrtc-][android-webrtc-] → [WebRTC][blocking-webrtc-][android-webrtc+]
(Assignee)

Comment 11

4 years ago
(In reply to Jason Smith [:jsmith] from comment #10)
> Right, that was my understanding as well - there's a fix we need here on our
> side.


No, that's actually not obviously true either. The STUN servers I
have seen (aside from Google) generally respond with XOR-MAPPED-ADDRESS.


> Also - although this isn't directly related to the Android code, it's
> impacting important demos for Android. The dupes demonstrate this is
> preventing usage of at least two WebRTC demos. That's enough of a case to
> block on. So I disagree on not blocking on this. We're getting going to
> negative reviews on Google Play if common WebRTC demos aren't working.

WebRTC for Android should not be preffed on in release builds in any case.

I don't really care about the flags here, but
Let's take this to email with release managers & fxandroid drivers cc-ed at this point. I'll start the email thread shortly.

Updated

4 years ago
Whiteboard: [WebRTC][blocking-webrtc-][android-webrtc+] → [WebRTC][blocking-webrtc-][android-webrtc?]
(Assignee)

Comment 13

4 years ago
Update:
I have contacted Google and they are going to look into fixing their
server. If they don't fix it, I have a workaround ready to go.
tracking-fennec: ? → 24+
(Assignee)

Comment 14

4 years ago
Created attachment 781383 [details] [diff] [review]
Merged patch to do 5389 and MAPPED-ADDRESS
(Assignee)

Updated

4 years ago
Attachment #768958 - Attachment is obsolete: true
(Assignee)

Updated

4 years ago
Attachment #781383 - Flags: review?(adam)
Is there an update on this? If we don't have a fix from google, then we'll need your workaround to land before Monday 08/05. Thank you.
(Assignee)

Comment 16

4 years ago
Google isn't going to fix this in a timely fashion. We need to land the patch above

Comment 17

4 years ago
Comment on attachment 781383 [details] [diff] [review]
Merged patch to do 5389 and MAPPED-ADDRESS

Review of attachment 781383 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with nits.

stun_client_ctx.c:577: tab character
stun_client_ctx.c:578: tab character
stun_client_ctx.c:579: tab character
stun_client_ctx.c:580: tab character
stun_client_ctx.c:581: tab character
stun_client_ctx.c:582: tab character
stun_client_ctx.c:583: tab character
stun_client_ctx.c:584: tab character
stun_proc.c:220: tab character

::: media/mtransport/third_party/nICEr/src/ice/ice_candidate.c
@@ +655,5 @@
>      switch(cand->u.srvrflx.stun->state){
>        /* OK, we should have a mapped address */
>        case NR_STUN_CLIENT_STATE_DONE:
>          /* Copy the address */
> +        nr_transport_addr_copy(&cand->addr, &cand->u.srvrflx.stun->results.stun_binding_response.mapped_addr);

This change appears to make "nr_stun_client_stun_binding_response_stund_0_96_results" effectively deadwood. I think you want to finish this off by stripping it out of stun_client_ctx.h, and changing its one reference in stun_client_ctx.c to use stun_binding_response instead.
Attachment #781383 - Flags: review?(adam) → review+
(Assignee)

Comment 18

4 years ago
(In reply to Adam Roach [:abr] (Traveling through Aug 5) from comment #17)
> Comment on attachment 781383 [details] [diff] [review]
> Merged patch to do 5389 and MAPPED-ADDRESS
> 
> Review of attachment 781383 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> r+ with nits.
> 
> stun_client_ctx.c:577: tab character
> stun_client_ctx.c:578: tab character
> stun_client_ctx.c:579: tab character
> stun_client_ctx.c:580: tab character
> stun_client_ctx.c:581: tab character
> stun_client_ctx.c:582: tab character
> stun_client_ctx.c:583: tab character
> stun_client_ctx.c:584: tab character
> stun_proc.c:220: tab character

Done


> ::: media/mtransport/third_party/nICEr/src/ice/ice_candidate.c
> @@ +655,5 @@
> >      switch(cand->u.srvrflx.stun->state){
> >        /* OK, we should have a mapped address */
> >        case NR_STUN_CLIENT_STATE_DONE:
> >          /* Copy the address */
> > +        nr_transport_addr_copy(&cand->addr, &cand->u.srvrflx.stun->results.stun_binding_response.mapped_addr);
> 
> This change appears to make
> "nr_stun_client_stun_binding_response_stund_0_96_results" effectively
> deadwood. I think you want to finish this off by stripping it out of
> stun_client_ctx.h, and changing its one reference in stun_client_ctx.c to
> use stun_binding_response instead.

I don't think this is right. STUN and ICE are separate modules here and
some other STUN user might want this.

Comment 19

4 years ago
Okay, that makes sense.
(Assignee)

Comment 20

4 years ago
https://hg.mozilla.org/integration/mozilla-inbound/rev/2e3cdffc3e65
https://hg.mozilla.org/mozilla-central/rev/2e3cdffc3e65
Status: NEW → RESOLVED
Last Resolved: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla25

Updated

4 years ago
status-firefox25: affected → fixed
Looking for QA to verify that this patch fixes the problems reported. If so, we'll request uplift to Fx24.  Thanks.
Keywords: verifyme
Aaron & I can take a look at this tomorrow.
Whiteboard: [WebRTC][blocking-webrtc-][android-webrtc?] → [WebRTC][blocking-webrtc-][android-webrtc?][webrtc-uplift]
Jason and I ran through a variety of testing combinations and demos this afternoon and this was our findings

Builds/Config: 

* Nightly (08/07) on Desktop (Mac OSX 10.8.4 & Windows 7 64-Bit) and Android (4.3/4.2)
* Nexus 4 (Android 4.3), Samsung Galaxy Nexus (Android 4.2) 
* Residential Wi-Fi and Office Wi-Fi (Mozilla Network)

Talky.io:

* We both are not able to see connected clients to a session. This will need further investigation into the issue. As it stands, I am re-opening bug 877701.
* We noticed this error: 

[14:08:47.308] NS_ERROR_UNEXPECTED: Unexpected error @ https://talky.io/talky.0ecfcb9a.js:19385

TokBox (http://tokbox.com/opentok/webrtc/demo)

* Confirmed working using the WebRTC demo using Desktop and Android
* The demo did not work on 4G data connection for Jason; perhaps expected. We will check with a follow-up to see if Wi-Fi works out for him

AppRTC (https://apprtc.appspot.com)

* Confirmed working for us using the WebRTC demo using Desktop and Android

* The demo did not work on 4G data connection for Jason; perhaps expected?

* Call went through with Android phone on office WiFi (Mozilla) to Desktop Firefox on (Mozilla-G) WiFi, but performance was extremely slow for video/audio output

* Residential WiFi Desktop to Android on the same network was flawless

* Toronto to Mountain View (Firefox on Android to Firefox on Android), call went through but the session yielded too much latency and was generally slow

* Interop call between Chrome Beta for Android (Mountain View) to Firefox on Android (Toronto), call went through but there was still much latency but far less than Firefox on Android to Firefox on Android

* Mountain View call to Toronto from Desktop to Android on first attempt call failed to establish a connection. On second attempt the call was very smooth, little latency

* Toronto call to Mountain View with Firefox on Android to Chrome on Desktop, the call went through with little latency
One additional note about the error below:

[14:08:47.308] NS_ERROR_UNEXPECTED: Unexpected error @ https://talky.io/talky.0ecfcb9a.js:19385

When I inspected the JS code, this appears to be failing during the creation of the peer connection object with the STUN server - [{url:"stun:stun.l.google.com:19302"}].

We need to deep dive to figure out why this is happening. bug 877701 has been reopened as a result.
Just retested also with tokbox over wifi on desktop firefox --> firefox for android - call went through for me as well.
(Assignee)

Comment 27

4 years ago
(In reply to Jason Smith [:jsmith] from comment #25)
> One additional note about the error below:
> 
> [14:08:47.308] NS_ERROR_UNEXPECTED: Unexpected error @
> https://talky.io/talky.0ecfcb9a.js:19385
> 
> When I inspected the JS code, this appears to be failing during the creation
> of the peer connection object with the STUN server -
> [{url:"stun:stun.l.google.com:19302"}].
> 
> We need to deep dive to figure out why this is happening. bug 877701 has
> been reopened as a result.

Hmmm.... Can you provide the entire JS? This seems like a separate issue.
Created attachment 787079 [details]
Code in Question
Also - looks like calls with talky.io are completely failing entirely, even on desktop firefox --> desktop firefox.

I get the feeling this patch might have broke talky.io entirely. We need a regression range to confirm.

Updated

4 years ago
Duplicate of this bug: 877701
Did some regression digging around the talky.io failure - this appears to be broken on release 23 with firing an NS_ERROR_FAILURE on creation of the peer connection & the NS_ERROR_UNEXPECTED on nightly before and after this patch. So it looks like this is confirmed to be a different issue. I'll file a separate bug for it.
Filed bug 902615 for the talky.io issue we've seen.
I think the analysis above confirms this patch is okay to take to Beta.

We do need to do a separate analysis on what's happening on bug 902615. Additionally, we need to continue to study ways to improve performance of calls (especially in the mobile environment) and figure out faster paths to diagnose call failures based on the analysis done above.
Status: RESOLVED → VERIFIED
Keywords: verifyme
(In reply to Jason Smith [:jsmith] from comment #33)
> I think the analysis above confirms this patch is okay to take to Beta.
> 
> We do need to do a separate analysis on what's happening on bug 902615.
> Additionally, we need to continue to study ways to improve performance of
> calls (especially in the mobile environment) and figure out faster paths to
> diagnose call failures based on the analysis done above.

Update on this - we went back and did some more testing around this. We used a test app that made use of the affected STUN server directly to ensure that we were hitting the affected path. talky.io was unavailable to use due to another site bug found during testing. What we found is summarized below.

1. Call 1 - Desktop to FxAndroid

Result - Call failed - iceConnectionState indicated "failed"

2. Call 2 - FxAndroid to FxAndroid

Result - Call established - video streaming performance was acceptable, but audio was chopped up by each syllable.

3. Call 3 - FxAndroid to FxAndroid (followup to same session from call 2)

Result - Call failed - likely the issue being seen in bug 904598.

4. Call 4 - FxAndroid to FxAndroid (cold start of a new session)

Result - Call established - video streaming performance was bad and audio was chopped up by each syllable.

Conclusion - There's a lot of unpredictability here, which is leading QA to think that we should avoid taking the uplift here. The calls did go through on some cases, but the unpredictable results questions whether this is safe to uplift.
(Assignee)

Comment 35

4 years ago
Can you please compare this to beta without this patch. I suppose it's possible this would cause problems with Android video once it is established, but it seems rather unlikely.
(In reply to Eric Rescorla (:ekr) from comment #35)
> Can you please compare this to beta without this patch. I suppose it's
> possible this would cause problems with Android video once it is
> established, but it seems rather unlikely.

When this was tested before this patch landed, the apprtc performance issues seen were already present in that build. Doing calls with the affected stun server also failed 100%.

Now we see the same performance issues on apprtc. Except doing calls with the affected stun server appears to establish on FxAndroid --> FxAndroid. So this is an improvement.
(Assignee)

Comment 37

4 years ago
(In reply to Jason Smith [:jsmith] from comment #36)
> (In reply to Eric Rescorla (:ekr) from comment #35)
> > Can you please compare this to beta without this patch. I suppose it's
> > possible this would cause problems with Android video once it is
> > established, but it seems rather unlikely.
> 
> When this was tested before this patch landed, the apprtc performance issues
> seen were already present in that build. Doing calls with the affected stun
> server also failed 100%.
> 
> Now we see the same performance issues on apprtc. Except doing calls with
> the affected stun server appears to establish on FxAndroid --> FxAndroid. So
> this is an improvement.

OK, so since this is a connectivity problem on desktop, this seems like
that's not enough of an argument against uplift....
Yeah, now that I think about this again, this is actually more favorable to take the patch given that we do get FxAndroid --> FxAndroid calls established with no other regressions seen.
:ekr,

Are we planning on uplifting this ? If yes, that should happen sooner than later to get maximum beta testing .
Flags: needinfo?(ekr)
(In reply to bhavana bajaj [:bajaj] from comment #39)
> :ekr,
> 
> Are we planning on uplifting this ? If yes, that should happen sooner than
> later to get maximum beta testing .

Please note we are planned to gtb for our next beta this afternoon PT (noon- 1PM) hence landing this asap will be helpful as Maire just confirmed that we are certain that this needs an uplift.
(Assignee)

Comment 41

4 years ago
I'm fine with uplifting it. What button do I push?
Flags: needinfo?(ekr)
(Assignee)

Comment 42

4 years ago
Comment on attachment 781383 [details] [diff] [review]
Merged patch to do 5389 and MAPPED-ADDRESS

[Approval Request Comment]
Bug caused by (feature/regressing bug #): This has been there since the beginning.

User impact if declined:
Failure during calls between two machines that are behind different NATs on sites which use Google's STUN servers.


Testing completed (on m-c, etc.): 
See above in bug.


Risk to taking this patch (and alternatives if risky):
Most probable risk is that some other server behaves in another unpredictable way and so some other calls don't work. Aside from that, the usual risks in taking any patch.

String or IDL/UUID changes made by this patch:
None.
Attachment #781383 - Flags: approval-mozilla-beta?
Comment on attachment 781383 [details] [diff] [review]
Merged patch to do 5389 and MAPPED-ADDRESS

Given we do not want ship android webrtc with this bug, approving this on beta given QA verification.

Please let QA know if there is any additional testing that may be needed here .
Attachment #781383 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/cadb54f63c80
status-firefox24: affected → fixed
Whiteboard: [WebRTC][blocking-webrtc-][android-webrtc?][webrtc-uplift] → [WebRTC][blocking-webrtc-]
I've verified that what I see on mozilla-beta with this patch on the latest candidate matches the results as seen on mozilla-central that Jason and I conducted last week.
status-firefox24: fixed → verified
You need to log in before you can comment on or make changes to this bug.