Last Comment Bug 888274 - Upgrade to RFC 5389 STUN
: Upgrade to RFC 5389 STUN
Status: VERIFIED FIXED
[WebRTC][blocking-webrtc-]
:
Product: Core
Classification: Components
Component: WebRTC: Networking (show other bugs)
: unspecified
: All All
: -- normal (vote)
: mozilla25
Assigned To: Eric Rescorla (:ekr)
:
Mentors:
: 877701 (view as bug list)
Depends on:
Blocks: android-webrtc
  Show dependency treegraph
 
Reported: 2013-06-28 07:06 PDT by Eric Rescorla (:ekr)
Modified: 2013-08-20 16:20 PDT (History)
11 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
unaffected
+
verified
fixed
24+


Attachments
Convert to new STUN request (2.19 KB, patch)
2013-06-28 07:51 PDT, Eric Rescorla (:ekr)
no flags Details | Diff | Review
Merged patch to do 5389 and MAPPED-ADDRESS (5.26 KB, patch)
2013-07-25 18:16 PDT, Eric Rescorla (:ekr)
adam: review+
bajaj.bhavana: approval‑mozilla‑beta+
Details | Diff | Review
Code in Question (612.64 KB, text/plain)
2013-08-07 12:59 PDT, Jason Smith [:jsmith]
no flags Details

Description Eric Rescorla (:ekr) 2013-06-28 07:06:46 PDT
Firefox currently generates RFC 3489 backward-compatible STUN requests,
but it actually wants to receive a 5389 XOR-MAPPED-ADDRESS. This creates
problems with servers which actually do backward compatibility.

We should just upgrade to RFC5389, which is what Google does.

Try removing:

                    'USE_RFC_3489_BACKWARDS_COMPATIBLE',

from nicer.gyp
Comment 1 Eric Rescorla (:ekr) 2013-06-28 07:51:32 PDT
Created attachment 768958 [details] [diff] [review]
Convert to new STUN request
Comment 2 Eric Rescorla (:ekr) 2013-06-28 08:44:13 PDT
Update: this patch works with the Firefox STUN server but not the Google servers because Google is offering MAPPED-ADDRESS instead of XOR-MAPPED-ADDRESS. Investigations continue but I think this may be a problem on Google's side.
Comment 3 Timothy B. Terriberry (:derf) 2013-07-22 09:09:28 PDT
*** Bug 877701 has been marked as a duplicate of this bug. ***
Comment 4 Jason Smith [:jsmith] 2013-07-22 09:23:42 PDT
Adding a note here that one of the affected STUN servers is the following:

[{url:"stun:stun.l.google.com:19302"}]

Which causes the talky.io and tokbox demo to fail occasionally.

A couple of compatibility questions:

1. What is the web developer workaround here? Specifically, what STUN servers should we suggest web developers to use vs. not use due to this bug existing?

2. What's the breadth of STUN servers impacted here?
Comment 5 Jason Smith [:jsmith] 2013-07-22 10:39:09 PDT
Given that this is causing call failures on at least two WebRTC demos on fxAndroid to desktop, I'm adding the tracking-fennec flag ? that was on the dupe of this bug.
Comment 6 Eric Rescorla (:ekr) 2013-07-23 13:34:58 PDT
I have reached out to Google to see about them fixing their (noncompliant) server.

Possibilities:

1. They just fix it and we land the above patch.
2. They fix it but make it conditional on the SOFTWARE field, which we just add and set to nICEr or something.
3. They don't fix it and we land a patch that accepts MAPPED-ADDRESS.
Comment 7 Maire Reavy [:mreavy] (Plz ni?:mreavy) 2013-07-24 08:40:38 PDT
As Ekr says in comment 6, Google's server is noncompliant, and this issue is not a bug in our code (and it's definitely not specific to Android).
Comment 8 Eric Rescorla (:ekr) 2013-07-24 10:10:21 PDT
(In reply to Maire Reavy [:mreavy] from comment #7)
> As Ekr says in comment 6, Google's server is noncompliant, and this issue is
> not a bug in our code (and it's definitely not specific to Android).

This isn't quite correct.

There is a defect in our code in that we are offering 3489 STUN but not
accepting MAPPED-ADDRESS in response. (See the description).

However, even when this defect is fixed we will still have a problem with
Google's servers.
Comment 9 Maire Reavy [:mreavy] (Plz ni?:mreavy) 2013-07-24 10:15:34 PDT
(In reply to Eric Rescorla (:ekr) from comment #8)
> (In reply to Maire Reavy [:mreavy] from comment #7)
> > As Ekr says in comment 6, Google's server is noncompliant, and this issue is
> > not a bug in our code (and it's definitely not specific to Android).
> 
> This isn't quite correct.
> 
> There is a defect in our code in that we are offering 3489 STUN but not
> accepting MAPPED-ADDRESS in response. (See the description).
> 
> However, even when this defect is fixed we will still have a problem with
> Google's servers.

Thanks for clarifying. Referring to the possibilities in comment 6: How likely is Google to fix their server code?  How much time/effort is it to do #3 (land a patch that accepts MAPPED-ADDRESS)?
Comment 10 Jason Smith [:jsmith] 2013-07-24 10:17:33 PDT
Right, that was my understanding as well - there's a fix we need here on our side.

Also - although this isn't directly related to the Android code, it's impacting important demos for Android. The dupes demonstrate this is preventing usage of at least two WebRTC demos. That's enough of a case to block on. So I disagree on not blocking on this. We're getting going to negative reviews on Google Play if common WebRTC demos aren't working.
Comment 11 Eric Rescorla (:ekr) 2013-07-24 10:22:15 PDT
(In reply to Jason Smith [:jsmith] from comment #10)
> Right, that was my understanding as well - there's a fix we need here on our
> side.


No, that's actually not obviously true either. The STUN servers I
have seen (aside from Google) generally respond with XOR-MAPPED-ADDRESS.


> Also - although this isn't directly related to the Android code, it's
> impacting important demos for Android. The dupes demonstrate this is
> preventing usage of at least two WebRTC demos. That's enough of a case to
> block on. So I disagree on not blocking on this. We're getting going to
> negative reviews on Google Play if common WebRTC demos aren't working.

WebRTC for Android should not be preffed on in release builds in any case.

I don't really care about the flags here, but
Comment 12 Jason Smith [:jsmith] 2013-07-24 10:38:59 PDT
Let's take this to email with release managers & fxandroid drivers cc-ed at this point. I'll start the email thread shortly.
Comment 13 Eric Rescorla (:ekr) 2013-07-25 08:59:33 PDT
Update:
I have contacted Google and they are going to look into fixing their
server. If they don't fix it, I have a workaround ready to go.
Comment 14 Eric Rescorla (:ekr) 2013-07-25 18:16:06 PDT
Created attachment 781383 [details] [diff] [review]
Merged patch to do 5389 and MAPPED-ADDRESS
Comment 15 Erin Lancaster [:elan] 2013-08-02 09:06:33 PDT
Is there an update on this? If we don't have a fix from google, then we'll need your workaround to land before Monday 08/05. Thank you.
Comment 16 Eric Rescorla (:ekr) 2013-08-02 14:59:10 PDT
Google isn't going to fix this in a timely fashion. We need to land the patch above
Comment 17 Adam Roach [:abr] 2013-08-04 13:31:16 PDT
Comment on attachment 781383 [details] [diff] [review]
Merged patch to do 5389 and MAPPED-ADDRESS

Review of attachment 781383 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with nits.

stun_client_ctx.c:577: tab character
stun_client_ctx.c:578: tab character
stun_client_ctx.c:579: tab character
stun_client_ctx.c:580: tab character
stun_client_ctx.c:581: tab character
stun_client_ctx.c:582: tab character
stun_client_ctx.c:583: tab character
stun_client_ctx.c:584: tab character
stun_proc.c:220: tab character

::: media/mtransport/third_party/nICEr/src/ice/ice_candidate.c
@@ +655,5 @@
>      switch(cand->u.srvrflx.stun->state){
>        /* OK, we should have a mapped address */
>        case NR_STUN_CLIENT_STATE_DONE:
>          /* Copy the address */
> +        nr_transport_addr_copy(&cand->addr, &cand->u.srvrflx.stun->results.stun_binding_response.mapped_addr);

This change appears to make "nr_stun_client_stun_binding_response_stund_0_96_results" effectively deadwood. I think you want to finish this off by stripping it out of stun_client_ctx.h, and changing its one reference in stun_client_ctx.c to use stun_binding_response instead.
Comment 18 Eric Rescorla (:ekr) 2013-08-04 17:31:57 PDT
(In reply to Adam Roach [:abr] (Traveling through Aug 5) from comment #17)
> Comment on attachment 781383 [details] [diff] [review]
> Merged patch to do 5389 and MAPPED-ADDRESS
> 
> Review of attachment 781383 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> r+ with nits.
> 
> stun_client_ctx.c:577: tab character
> stun_client_ctx.c:578: tab character
> stun_client_ctx.c:579: tab character
> stun_client_ctx.c:580: tab character
> stun_client_ctx.c:581: tab character
> stun_client_ctx.c:582: tab character
> stun_client_ctx.c:583: tab character
> stun_client_ctx.c:584: tab character
> stun_proc.c:220: tab character

Done


> ::: media/mtransport/third_party/nICEr/src/ice/ice_candidate.c
> @@ +655,5 @@
> >      switch(cand->u.srvrflx.stun->state){
> >        /* OK, we should have a mapped address */
> >        case NR_STUN_CLIENT_STATE_DONE:
> >          /* Copy the address */
> > +        nr_transport_addr_copy(&cand->addr, &cand->u.srvrflx.stun->results.stun_binding_response.mapped_addr);
> 
> This change appears to make
> "nr_stun_client_stun_binding_response_stund_0_96_results" effectively
> deadwood. I think you want to finish this off by stripping it out of
> stun_client_ctx.h, and changing its one reference in stun_client_ctx.c to
> use stun_binding_response instead.

I don't think this is right. STUN and ICE are separate modules here and
some other STUN user might want this.
Comment 19 Adam Roach [:abr] 2013-08-04 17:35:08 PDT
Okay, that makes sense.
Comment 21 Carsten Book [:Tomcat] 2013-08-05 03:09:05 PDT
https://hg.mozilla.org/mozilla-central/rev/2e3cdffc3e65
Comment 22 Maire Reavy [:mreavy] (Plz ni?:mreavy) 2013-08-05 08:22:50 PDT
Looking for QA to verify that this patch fixes the problems reported. If so, we'll request uplift to Fx24.  Thanks.
Comment 23 Jason Smith [:jsmith] 2013-08-05 13:27:28 PDT
Aaron & I can take a look at this tomorrow.
Comment 24 Aaron Train [:aaronmt] 2013-08-07 12:31:51 PDT
Jason and I ran through a variety of testing combinations and demos this afternoon and this was our findings

Builds/Config: 

* Nightly (08/07) on Desktop (Mac OSX 10.8.4 & Windows 7 64-Bit) and Android (4.3/4.2)
* Nexus 4 (Android 4.3), Samsung Galaxy Nexus (Android 4.2) 
* Residential Wi-Fi and Office Wi-Fi (Mozilla Network)

Talky.io:

* We both are not able to see connected clients to a session. This will need further investigation into the issue. As it stands, I am re-opening bug 877701.
* We noticed this error: 

[14:08:47.308] NS_ERROR_UNEXPECTED: Unexpected error @ https://talky.io/talky.0ecfcb9a.js:19385

TokBox (http://tokbox.com/opentok/webrtc/demo)

* Confirmed working using the WebRTC demo using Desktop and Android
* The demo did not work on 4G data connection for Jason; perhaps expected. We will check with a follow-up to see if Wi-Fi works out for him

AppRTC (https://apprtc.appspot.com)

* Confirmed working for us using the WebRTC demo using Desktop and Android

* The demo did not work on 4G data connection for Jason; perhaps expected?

* Call went through with Android phone on office WiFi (Mozilla) to Desktop Firefox on (Mozilla-G) WiFi, but performance was extremely slow for video/audio output

* Residential WiFi Desktop to Android on the same network was flawless

* Toronto to Mountain View (Firefox on Android to Firefox on Android), call went through but the session yielded too much latency and was generally slow

* Interop call between Chrome Beta for Android (Mountain View) to Firefox on Android (Toronto), call went through but there was still much latency but far less than Firefox on Android to Firefox on Android

* Mountain View call to Toronto from Desktop to Android on first attempt call failed to establish a connection. On second attempt the call was very smooth, little latency

* Toronto call to Mountain View with Firefox on Android to Chrome on Desktop, the call went through with little latency
Comment 25 Jason Smith [:jsmith] 2013-08-07 12:49:05 PDT
One additional note about the error below:

[14:08:47.308] NS_ERROR_UNEXPECTED: Unexpected error @ https://talky.io/talky.0ecfcb9a.js:19385

When I inspected the JS code, this appears to be failing during the creation of the peer connection object with the STUN server - [{url:"stun:stun.l.google.com:19302"}].

We need to deep dive to figure out why this is happening. bug 877701 has been reopened as a result.
Comment 26 Jason Smith [:jsmith] 2013-08-07 12:53:00 PDT
Just retested also with tokbox over wifi on desktop firefox --> firefox for android - call went through for me as well.
Comment 27 Eric Rescorla (:ekr) 2013-08-07 12:54:42 PDT
(In reply to Jason Smith [:jsmith] from comment #25)
> One additional note about the error below:
> 
> [14:08:47.308] NS_ERROR_UNEXPECTED: Unexpected error @
> https://talky.io/talky.0ecfcb9a.js:19385
> 
> When I inspected the JS code, this appears to be failing during the creation
> of the peer connection object with the STUN server -
> [{url:"stun:stun.l.google.com:19302"}].
> 
> We need to deep dive to figure out why this is happening. bug 877701 has
> been reopened as a result.

Hmmm.... Can you provide the entire JS? This seems like a separate issue.
Comment 28 Jason Smith [:jsmith] 2013-08-07 12:59:36 PDT
Created attachment 787079 [details]
Code in Question
Comment 29 Jason Smith [:jsmith] 2013-08-07 13:00:25 PDT
Also - looks like calls with talky.io are completely failing entirely, even on desktop firefox --> desktop firefox.

I get the feeling this patch might have broke talky.io entirely. We need a regression range to confirm.
Comment 30 Jason Smith [:jsmith] 2013-08-07 13:17:17 PDT
*** Bug 877701 has been marked as a duplicate of this bug. ***
Comment 31 Jason Smith [:jsmith] 2013-08-07 13:20:21 PDT
Did some regression digging around the talky.io failure - this appears to be broken on release 23 with firing an NS_ERROR_FAILURE on creation of the peer connection & the NS_ERROR_UNEXPECTED on nightly before and after this patch. So it looks like this is confirmed to be a different issue. I'll file a separate bug for it.
Comment 32 Jason Smith [:jsmith] 2013-08-07 13:55:11 PDT
Filed bug 902615 for the talky.io issue we've seen.
Comment 33 Jason Smith [:jsmith] 2013-08-07 14:52:24 PDT
I think the analysis above confirms this patch is okay to take to Beta.

We do need to do a separate analysis on what's happening on bug 902615. Additionally, we need to continue to study ways to improve performance of calls (especially in the mobile environment) and figure out faster paths to diagnose call failures based on the analysis done above.
Comment 34 Jason Smith [:jsmith] 2013-08-13 12:01:22 PDT
(In reply to Jason Smith [:jsmith] from comment #33)
> I think the analysis above confirms this patch is okay to take to Beta.
> 
> We do need to do a separate analysis on what's happening on bug 902615.
> Additionally, we need to continue to study ways to improve performance of
> calls (especially in the mobile environment) and figure out faster paths to
> diagnose call failures based on the analysis done above.

Update on this - we went back and did some more testing around this. We used a test app that made use of the affected STUN server directly to ensure that we were hitting the affected path. talky.io was unavailable to use due to another site bug found during testing. What we found is summarized below.

1. Call 1 - Desktop to FxAndroid

Result - Call failed - iceConnectionState indicated "failed"

2. Call 2 - FxAndroid to FxAndroid

Result - Call established - video streaming performance was acceptable, but audio was chopped up by each syllable.

3. Call 3 - FxAndroid to FxAndroid (followup to same session from call 2)

Result - Call failed - likely the issue being seen in bug 904598.

4. Call 4 - FxAndroid to FxAndroid (cold start of a new session)

Result - Call established - video streaming performance was bad and audio was chopped up by each syllable.

Conclusion - There's a lot of unpredictability here, which is leading QA to think that we should avoid taking the uplift here. The calls did go through on some cases, but the unpredictable results questions whether this is safe to uplift.
Comment 35 Eric Rescorla (:ekr) 2013-08-13 12:15:43 PDT
Can you please compare this to beta without this patch. I suppose it's possible this would cause problems with Android video once it is established, but it seems rather unlikely.
Comment 36 Jason Smith [:jsmith] 2013-08-13 14:20:10 PDT
(In reply to Eric Rescorla (:ekr) from comment #35)
> Can you please compare this to beta without this patch. I suppose it's
> possible this would cause problems with Android video once it is
> established, but it seems rather unlikely.

When this was tested before this patch landed, the apprtc performance issues seen were already present in that build. Doing calls with the affected stun server also failed 100%.

Now we see the same performance issues on apprtc. Except doing calls with the affected stun server appears to establish on FxAndroid --> FxAndroid. So this is an improvement.
Comment 37 Eric Rescorla (:ekr) 2013-08-13 14:23:16 PDT
(In reply to Jason Smith [:jsmith] from comment #36)
> (In reply to Eric Rescorla (:ekr) from comment #35)
> > Can you please compare this to beta without this patch. I suppose it's
> > possible this would cause problems with Android video once it is
> > established, but it seems rather unlikely.
> 
> When this was tested before this patch landed, the apprtc performance issues
> seen were already present in that build. Doing calls with the affected stun
> server also failed 100%.
> 
> Now we see the same performance issues on apprtc. Except doing calls with
> the affected stun server appears to establish on FxAndroid --> FxAndroid. So
> this is an improvement.

OK, so since this is a connectivity problem on desktop, this seems like
that's not enough of an argument against uplift....
Comment 38 Jason Smith [:jsmith] 2013-08-13 14:29:44 PDT
Yeah, now that I think about this again, this is actually more favorable to take the patch given that we do get FxAndroid --> FxAndroid calls established with no other regressions seen.
Comment 39 bhavana bajaj [:bajaj] 2013-08-19 09:52:35 PDT
:ekr,

Are we planning on uplifting this ? If yes, that should happen sooner than later to get maximum beta testing .
Comment 40 bhavana bajaj [:bajaj] 2013-08-19 11:14:22 PDT
(In reply to bhavana bajaj [:bajaj] from comment #39)
> :ekr,
> 
> Are we planning on uplifting this ? If yes, that should happen sooner than
> later to get maximum beta testing .

Please note we are planned to gtb for our next beta this afternoon PT (noon- 1PM) hence landing this asap will be helpful as Maire just confirmed that we are certain that this needs an uplift.
Comment 41 Eric Rescorla (:ekr) 2013-08-19 11:21:13 PDT
I'm fine with uplifting it. What button do I push?
Comment 42 Eric Rescorla (:ekr) 2013-08-19 11:28:29 PDT
Comment on attachment 781383 [details] [diff] [review]
Merged patch to do 5389 and MAPPED-ADDRESS

[Approval Request Comment]
Bug caused by (feature/regressing bug #): This has been there since the beginning.

User impact if declined:
Failure during calls between two machines that are behind different NATs on sites which use Google's STUN servers.


Testing completed (on m-c, etc.): 
See above in bug.


Risk to taking this patch (and alternatives if risky):
Most probable risk is that some other server behaves in another unpredictable way and so some other calls don't work. Aside from that, the usual risks in taking any patch.

String or IDL/UUID changes made by this patch:
None.
Comment 43 bhavana bajaj [:bajaj] 2013-08-19 11:30:53 PDT
Comment on attachment 781383 [details] [diff] [review]
Merged patch to do 5389 and MAPPED-ADDRESS

Given we do not want ship android webrtc with this bug, approving this on beta given QA verification.

Please let QA know if there is any additional testing that may be needed here .
Comment 44 Randell Jesup [:jesup] 2013-08-19 14:08:40 PDT
https://hg.mozilla.org/releases/mozilla-beta/rev/cadb54f63c80
Comment 45 Aaron Train [:aaronmt] 2013-08-20 16:20:01 PDT
I've verified that what I see on mozilla-beta with this patch on the latest candidate matches the results as seen on mozilla-central that Jason and I conducted last week.

Note You need to log in before you can comment on or make changes to this bug.