Closed Bug 888274 Opened 11 years ago Closed 11 years ago

Upgrade to RFC 5389 STUN

Categories

(Core :: WebRTC: Networking, defect)

defect
Not set
normal

Tracking

()

VERIFIED FIXED
mozilla25
Tracking Status
firefox23 --- unaffected
firefox24 + verified
firefox25 --- fixed
fennec 24+ ---

People

(Reporter: ekr, Assigned: ekr)

References

Details

(Whiteboard: [WebRTC][blocking-webrtc-])

Attachments

(2 files, 1 obsolete file)

Firefox currently generates RFC 3489 backward-compatible STUN requests,
but it actually wants to receive a 5389 XOR-MAPPED-ADDRESS. This creates
problems with servers which actually do backward compatibility.

We should just upgrade to RFC5389, which is what Google does.

Try removing:

                    'USE_RFC_3489_BACKWARDS_COMPATIBLE',

from nicer.gyp
Attached patch Convert to new STUN request (obsolete) — Splinter Review
Assignee: nobody → ekr
Update: this patch works with the Firefox STUN server but not the Google servers because Google is offering MAPPED-ADDRESS instead of XOR-MAPPED-ADDRESS. Investigations continue but I think this may be a problem on Google's side.
OS: Mac OS X → All
Hardware: x86 → All
Whiteboard: [WebRTC] [blocking-webrtc-]
Adding a note here that one of the affected STUN servers is the following:

[{url:"stun:stun.l.google.com:19302"}]

Which causes the talky.io and tokbox demo to fail occasionally.

A couple of compatibility questions:

1. What is the web developer workaround here? Specifically, what STUN servers should we suggest web developers to use vs. not use due to this bug existing?

2. What's the breadth of STUN servers impacted here?
Given that this is causing call failures on at least two WebRTC demos on fxAndroid to desktop, I'm adding the tracking-fennec flag ? that was on the dupe of this bug.
tracking-fennec: --- → ?
Whiteboard: [WebRTC] [blocking-webrtc-] → [WebRTC][blocking-webrtc-][android-webrtc+]
I have reached out to Google to see about them fixing their (noncompliant) server.

Possibilities:

1. They just fix it and we land the above patch.
2. They fix it but make it conditional on the SOFTWARE field, which we just add and set to nICEr or something.
3. They don't fix it and we land a patch that accepts MAPPED-ADDRESS.
As Ekr says in comment 6, Google's server is noncompliant, and this issue is not a bug in our code (and it's definitely not specific to Android).
Whiteboard: [WebRTC][blocking-webrtc-][android-webrtc+] → [WebRTC][blocking-webrtc-][android-webrtc-]
(In reply to Maire Reavy [:mreavy] from comment #7)
> As Ekr says in comment 6, Google's server is noncompliant, and this issue is
> not a bug in our code (and it's definitely not specific to Android).

This isn't quite correct.

There is a defect in our code in that we are offering 3489 STUN but not
accepting MAPPED-ADDRESS in response. (See the description).

However, even when this defect is fixed we will still have a problem with
Google's servers.
(In reply to Eric Rescorla (:ekr) from comment #8)
> (In reply to Maire Reavy [:mreavy] from comment #7)
> > As Ekr says in comment 6, Google's server is noncompliant, and this issue is
> > not a bug in our code (and it's definitely not specific to Android).
> 
> This isn't quite correct.
> 
> There is a defect in our code in that we are offering 3489 STUN but not
> accepting MAPPED-ADDRESS in response. (See the description).
> 
> However, even when this defect is fixed we will still have a problem with
> Google's servers.

Thanks for clarifying. Referring to the possibilities in comment 6: How likely is Google to fix their server code?  How much time/effort is it to do #3 (land a patch that accepts MAPPED-ADDRESS)?
Right, that was my understanding as well - there's a fix we need here on our side.

Also - although this isn't directly related to the Android code, it's impacting important demos for Android. The dupes demonstrate this is preventing usage of at least two WebRTC demos. That's enough of a case to block on. So I disagree on not blocking on this. We're getting going to negative reviews on Google Play if common WebRTC demos aren't working.
Whiteboard: [WebRTC][blocking-webrtc-][android-webrtc-] → [WebRTC][blocking-webrtc-][android-webrtc+]
(In reply to Jason Smith [:jsmith] from comment #10)
> Right, that was my understanding as well - there's a fix we need here on our
> side.


No, that's actually not obviously true either. The STUN servers I
have seen (aside from Google) generally respond with XOR-MAPPED-ADDRESS.


> Also - although this isn't directly related to the Android code, it's
> impacting important demos for Android. The dupes demonstrate this is
> preventing usage of at least two WebRTC demos. That's enough of a case to
> block on. So I disagree on not blocking on this. We're getting going to
> negative reviews on Google Play if common WebRTC demos aren't working.

WebRTC for Android should not be preffed on in release builds in any case.

I don't really care about the flags here, but
Let's take this to email with release managers & fxandroid drivers cc-ed at this point. I'll start the email thread shortly.
Whiteboard: [WebRTC][blocking-webrtc-][android-webrtc+] → [WebRTC][blocking-webrtc-][android-webrtc?]
Update:
I have contacted Google and they are going to look into fixing their
server. If they don't fix it, I have a workaround ready to go.
tracking-fennec: ? → 24+
Attachment #768958 - Attachment is obsolete: true
Attachment #781383 - Flags: review?(adam)
Is there an update on this? If we don't have a fix from google, then we'll need your workaround to land before Monday 08/05. Thank you.
Google isn't going to fix this in a timely fashion. We need to land the patch above
Comment on attachment 781383 [details] [diff] [review]
Merged patch to do 5389 and MAPPED-ADDRESS

Review of attachment 781383 [details] [diff] [review]:
-----------------------------------------------------------------

r+ with nits.

stun_client_ctx.c:577: tab character
stun_client_ctx.c:578: tab character
stun_client_ctx.c:579: tab character
stun_client_ctx.c:580: tab character
stun_client_ctx.c:581: tab character
stun_client_ctx.c:582: tab character
stun_client_ctx.c:583: tab character
stun_client_ctx.c:584: tab character
stun_proc.c:220: tab character

::: media/mtransport/third_party/nICEr/src/ice/ice_candidate.c
@@ +655,5 @@
>      switch(cand->u.srvrflx.stun->state){
>        /* OK, we should have a mapped address */
>        case NR_STUN_CLIENT_STATE_DONE:
>          /* Copy the address */
> +        nr_transport_addr_copy(&cand->addr, &cand->u.srvrflx.stun->results.stun_binding_response.mapped_addr);

This change appears to make "nr_stun_client_stun_binding_response_stund_0_96_results" effectively deadwood. I think you want to finish this off by stripping it out of stun_client_ctx.h, and changing its one reference in stun_client_ctx.c to use stun_binding_response instead.
Attachment #781383 - Flags: review?(adam) → review+
(In reply to Adam Roach [:abr] (Traveling through Aug 5) from comment #17)
> Comment on attachment 781383 [details] [diff] [review]
> Merged patch to do 5389 and MAPPED-ADDRESS
> 
> Review of attachment 781383 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> r+ with nits.
> 
> stun_client_ctx.c:577: tab character
> stun_client_ctx.c:578: tab character
> stun_client_ctx.c:579: tab character
> stun_client_ctx.c:580: tab character
> stun_client_ctx.c:581: tab character
> stun_client_ctx.c:582: tab character
> stun_client_ctx.c:583: tab character
> stun_client_ctx.c:584: tab character
> stun_proc.c:220: tab character

Done


> ::: media/mtransport/third_party/nICEr/src/ice/ice_candidate.c
> @@ +655,5 @@
> >      switch(cand->u.srvrflx.stun->state){
> >        /* OK, we should have a mapped address */
> >        case NR_STUN_CLIENT_STATE_DONE:
> >          /* Copy the address */
> > +        nr_transport_addr_copy(&cand->addr, &cand->u.srvrflx.stun->results.stun_binding_response.mapped_addr);
> 
> This change appears to make
> "nr_stun_client_stun_binding_response_stund_0_96_results" effectively
> deadwood. I think you want to finish this off by stripping it out of
> stun_client_ctx.h, and changing its one reference in stun_client_ctx.c to
> use stun_binding_response instead.

I don't think this is right. STUN and ICE are separate modules here and
some other STUN user might want this.
Okay, that makes sense.
https://hg.mozilla.org/mozilla-central/rev/2e3cdffc3e65
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla25
Looking for QA to verify that this patch fixes the problems reported. If so, we'll request uplift to Fx24.  Thanks.
Keywords: verifyme
Aaron & I can take a look at this tomorrow.
Whiteboard: [WebRTC][blocking-webrtc-][android-webrtc?] → [WebRTC][blocking-webrtc-][android-webrtc?][webrtc-uplift]
Jason and I ran through a variety of testing combinations and demos this afternoon and this was our findings

Builds/Config: 

* Nightly (08/07) on Desktop (Mac OSX 10.8.4 & Windows 7 64-Bit) and Android (4.3/4.2)
* Nexus 4 (Android 4.3), Samsung Galaxy Nexus (Android 4.2) 
* Residential Wi-Fi and Office Wi-Fi (Mozilla Network)

Talky.io:

* We both are not able to see connected clients to a session. This will need further investigation into the issue. As it stands, I am re-opening bug 877701.
* We noticed this error: 

[14:08:47.308] NS_ERROR_UNEXPECTED: Unexpected error @ https://talky.io/talky.0ecfcb9a.js:19385

TokBox (http://tokbox.com/opentok/webrtc/demo)

* Confirmed working using the WebRTC demo using Desktop and Android
* The demo did not work on 4G data connection for Jason; perhaps expected. We will check with a follow-up to see if Wi-Fi works out for him

AppRTC (https://apprtc.appspot.com)

* Confirmed working for us using the WebRTC demo using Desktop and Android

* The demo did not work on 4G data connection for Jason; perhaps expected?

* Call went through with Android phone on office WiFi (Mozilla) to Desktop Firefox on (Mozilla-G) WiFi, but performance was extremely slow for video/audio output

* Residential WiFi Desktop to Android on the same network was flawless

* Toronto to Mountain View (Firefox on Android to Firefox on Android), call went through but the session yielded too much latency and was generally slow

* Interop call between Chrome Beta for Android (Mountain View) to Firefox on Android (Toronto), call went through but there was still much latency but far less than Firefox on Android to Firefox on Android

* Mountain View call to Toronto from Desktop to Android on first attempt call failed to establish a connection. On second attempt the call was very smooth, little latency

* Toronto call to Mountain View with Firefox on Android to Chrome on Desktop, the call went through with little latency
One additional note about the error below:

[14:08:47.308] NS_ERROR_UNEXPECTED: Unexpected error @ https://talky.io/talky.0ecfcb9a.js:19385

When I inspected the JS code, this appears to be failing during the creation of the peer connection object with the STUN server - [{url:"stun:stun.l.google.com:19302"}].

We need to deep dive to figure out why this is happening. bug 877701 has been reopened as a result.
Just retested also with tokbox over wifi on desktop firefox --> firefox for android - call went through for me as well.
(In reply to Jason Smith [:jsmith] from comment #25)
> One additional note about the error below:
> 
> [14:08:47.308] NS_ERROR_UNEXPECTED: Unexpected error @
> https://talky.io/talky.0ecfcb9a.js:19385
> 
> When I inspected the JS code, this appears to be failing during the creation
> of the peer connection object with the STUN server -
> [{url:"stun:stun.l.google.com:19302"}].
> 
> We need to deep dive to figure out why this is happening. bug 877701 has
> been reopened as a result.

Hmmm.... Can you provide the entire JS? This seems like a separate issue.
Attached file Code in Question
Also - looks like calls with talky.io are completely failing entirely, even on desktop firefox --> desktop firefox.

I get the feeling this patch might have broke talky.io entirely. We need a regression range to confirm.
Did some regression digging around the talky.io failure - this appears to be broken on release 23 with firing an NS_ERROR_FAILURE on creation of the peer connection & the NS_ERROR_UNEXPECTED on nightly before and after this patch. So it looks like this is confirmed to be a different issue. I'll file a separate bug for it.
Filed bug 902615 for the talky.io issue we've seen.
I think the analysis above confirms this patch is okay to take to Beta.

We do need to do a separate analysis on what's happening on bug 902615. Additionally, we need to continue to study ways to improve performance of calls (especially in the mobile environment) and figure out faster paths to diagnose call failures based on the analysis done above.
Status: RESOLVED → VERIFIED
Keywords: verifyme
(In reply to Jason Smith [:jsmith] from comment #33)
> I think the analysis above confirms this patch is okay to take to Beta.
> 
> We do need to do a separate analysis on what's happening on bug 902615.
> Additionally, we need to continue to study ways to improve performance of
> calls (especially in the mobile environment) and figure out faster paths to
> diagnose call failures based on the analysis done above.

Update on this - we went back and did some more testing around this. We used a test app that made use of the affected STUN server directly to ensure that we were hitting the affected path. talky.io was unavailable to use due to another site bug found during testing. What we found is summarized below.

1. Call 1 - Desktop to FxAndroid

Result - Call failed - iceConnectionState indicated "failed"

2. Call 2 - FxAndroid to FxAndroid

Result - Call established - video streaming performance was acceptable, but audio was chopped up by each syllable.

3. Call 3 - FxAndroid to FxAndroid (followup to same session from call 2)

Result - Call failed - likely the issue being seen in bug 904598.

4. Call 4 - FxAndroid to FxAndroid (cold start of a new session)

Result - Call established - video streaming performance was bad and audio was chopped up by each syllable.

Conclusion - There's a lot of unpredictability here, which is leading QA to think that we should avoid taking the uplift here. The calls did go through on some cases, but the unpredictable results questions whether this is safe to uplift.
Can you please compare this to beta without this patch. I suppose it's possible this would cause problems with Android video once it is established, but it seems rather unlikely.
(In reply to Eric Rescorla (:ekr) from comment #35)
> Can you please compare this to beta without this patch. I suppose it's
> possible this would cause problems with Android video once it is
> established, but it seems rather unlikely.

When this was tested before this patch landed, the apprtc performance issues seen were already present in that build. Doing calls with the affected stun server also failed 100%.

Now we see the same performance issues on apprtc. Except doing calls with the affected stun server appears to establish on FxAndroid --> FxAndroid. So this is an improvement.
(In reply to Jason Smith [:jsmith] from comment #36)
> (In reply to Eric Rescorla (:ekr) from comment #35)
> > Can you please compare this to beta without this patch. I suppose it's
> > possible this would cause problems with Android video once it is
> > established, but it seems rather unlikely.
> 
> When this was tested before this patch landed, the apprtc performance issues
> seen were already present in that build. Doing calls with the affected stun
> server also failed 100%.
> 
> Now we see the same performance issues on apprtc. Except doing calls with
> the affected stun server appears to establish on FxAndroid --> FxAndroid. So
> this is an improvement.

OK, so since this is a connectivity problem on desktop, this seems like
that's not enough of an argument against uplift....
Yeah, now that I think about this again, this is actually more favorable to take the patch given that we do get FxAndroid --> FxAndroid calls established with no other regressions seen.
:ekr,

Are we planning on uplifting this ? If yes, that should happen sooner than later to get maximum beta testing .
Flags: needinfo?(ekr)
(In reply to bhavana bajaj [:bajaj] from comment #39)
> :ekr,
> 
> Are we planning on uplifting this ? If yes, that should happen sooner than
> later to get maximum beta testing .

Please note we are planned to gtb for our next beta this afternoon PT (noon- 1PM) hence landing this asap will be helpful as Maire just confirmed that we are certain that this needs an uplift.
I'm fine with uplifting it. What button do I push?
Flags: needinfo?(ekr)
Comment on attachment 781383 [details] [diff] [review]
Merged patch to do 5389 and MAPPED-ADDRESS

[Approval Request Comment]
Bug caused by (feature/regressing bug #): This has been there since the beginning.

User impact if declined:
Failure during calls between two machines that are behind different NATs on sites which use Google's STUN servers.


Testing completed (on m-c, etc.): 
See above in bug.


Risk to taking this patch (and alternatives if risky):
Most probable risk is that some other server behaves in another unpredictable way and so some other calls don't work. Aside from that, the usual risks in taking any patch.

String or IDL/UUID changes made by this patch:
None.
Attachment #781383 - Flags: approval-mozilla-beta?
Comment on attachment 781383 [details] [diff] [review]
Merged patch to do 5389 and MAPPED-ADDRESS

Given we do not want ship android webrtc with this bug, approving this on beta given QA verification.

Please let QA know if there is any additional testing that may be needed here .
Attachment #781383 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/cadb54f63c80
Whiteboard: [WebRTC][blocking-webrtc-][android-webrtc?][webrtc-uplift] → [WebRTC][blocking-webrtc-]
I've verified that what I see on mozilla-beta with this patch on the latest candidate matches the results as seen on mozilla-central that Jason and I conducted last week.
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: