Closed Bug 857668 Opened 11 years ago Closed 3 years ago

Implement ALTERNATE-SERVER for TURN

Categories

(Core :: WebRTC: Networking, defect, P2)

x86
macOS
defect

Tracking

()

RESOLVED FIXED
90 Branch
Tracking Status
firefox90 --- fixed

People

(Reporter: ekr, Assigned: bwc)

References

(Blocks 1 open bug)

Details

(Whiteboard: [turn][fuzzing:queue:cdiehl])

Attachments

(20 files, 2 obsolete files)

48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
48 bytes, text/x-phabricator-request
Details | Review
      No description provided.
Whiteboard: [WebRTC][blocking-webrtc-][turn]
Whiteboard: [WebRTC][blocking-webrtc-][turn] → [WebRTC][blocking-webrtc-][turn][fuzzing:queue:cdiehl]
I am currently working on implementing this, could someone assign the bug to me?
Assignee: nobody → rohan16garg
rohan: thanks! make sure you tell hg who you are when you upload a patch, and once you have something to try we can help guide you through the process of testing, reviewing and landing a patch.
I ran "./mach mercurial-setup" which should have set the correct hg variables right?

As for the patch itself, I have it working up to the point where the request is redirected to the alternate server and the alternate server replies with a stale nonce error code. I just need to figure out why the code doesn't send another request with the new nonce specified in the error response.
(In reply to Rohan Garg from comment #3)
> I ran "./mach mercurial-setup" which should have set the correct hg
> variables right?

I'm not familiar with that command. The manual way to do this is go into ~/.hgrc and add something like:

    [ui]
    username = Rohan Garg <rohan16garg@gmail.com>

 
> As for the patch itself, I have it working up to the point where the request
> is redirected to the alternate server and the alternate server replies with
> a stale nonce error code. I just need to figure out why the code doesn't
> send another request with the new nonce specified in the error response.

Go ahead and upload your patch with a description starting with "WIP" (for "Work In Progress"), and then needinfo me (check the "Need more information from" box). I can't guarantee a fast response (this week is the WebRTC interim standards meeting, and then I'm travelling for the Memorial Day weekend), but I'll try to give it a look and see what's going on when I do get some time.
I have my work on github in this [1] branch.

Here's where I'm stuck :

The socket that is used to communicate with the TURN server is owned by the test. However, when the negotiation ends in a ALTERNATE-SERVER state for TCP connections, a new socket is required since the same socket cannot be reused for the new TCP connection.

What I propose is this :

Modify the nr_turn_client_allocate [2] function to have a
error_cb callback that can handle errors.

Once the restart function detects that it has to try an alternate server and
that it's operating in TCP mode, we cancel the current transaction and close
the socket. We then call the error call back. In the error callback we check
if the turn_client_ctx state is NR_TURN_CLIENT_STATE_CANCELLED and whether the
turn_client_ctx has a alternate turn server attribute, if so, we allocate a
new socket and start the whole thing from scratch.

How does that sound? :)


[1] https://github.com/shadeslayer/gecko-dev/tree/fix-for-857668
[2] http://dxr.mozilla.org/mozilla-central/source/media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c?from=turn_client_ctx.c#626
Oh, right, cons of above approach, breaks API/ABI, but I'm not sure how API/ABI breaks are handled in internal libraries in Firefox.
Flags: needinfo?(adam)
Byron: I'm at the WebRTC interim this week -- can you give this a look?
Flags: needinfo?(docfaraday)
Comment on attachment 8425039 [details] [diff] [review]
WIP: Make nICEr handle the ALTERNATE SERVER state

Review of attachment 8425039 [details] [diff] [review]:
-----------------------------------------------------------------

I think you're going to want to teach everything in the chain how to reconnect/restart, since replacing internal members is going to have lifecycle implications (I've called out some that I spotted, there may be others). Pointers to these socket objects are scattered all over the place, and getting them all is going to be time-consuming and error-prone. Another alternative would be to create a brand new candidate with all new sockets and such; a clean start. You would have to watch out for redirect loops though.

::: media/mtransport/third_party/nICEr/src/stun/stun_client_ctx.c
@@ +164,5 @@
>      }
>  
> +    if (ctx->my_addr.protocol == IPPROTO_TCP) {
> +      ctx->peer_addr.protocol=IPPROTO_TCP;
> +    }

You'll need to also call nr_transport_addr_fmt_addr_string here.

@@ +169,2 @@
>      nr_stun_client_reset(ctx);
> +    ctx->error_code = 0;

So, between the nr_stun_client_reset and setting the error code to 0, there's a whole bunch of stuff left. The label doesn't seem to change (and might be misleading now), and none of the auth stuff seems to change (not sure how much of this we should change, would have to sit and think about it).

::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c
@@ +231,5 @@
> +  if (tctx->turn_server_addr.protocol == IPPROTO_TCP) {
> +    r_log(NR_LOG_TURN, LOG_DEBUG, "TURN(%s): Closing TCP socket for restart", ctx->tctx->label);
> +    nr_socket_close(tctx->sock);
> +
> +    r = nr_ip4_port_to_transport_addr(0, 0, IPPROTO_TCP, &addr);

Error check?

@@ +232,5 @@
> +    r_log(NR_LOG_TURN, LOG_DEBUG, "TURN(%s): Closing TCP socket for restart", ctx->tctx->label);
> +    nr_socket_close(tctx->sock);
> +
> +    r = nr_ip4_port_to_transport_addr(0, 0, IPPROTO_TCP, &addr);
> +    r = nr_socket_local_create(&addr, &real_socket_);

We're probably going to want to use the same local address we were using before. Also, error check?

@@ +233,5 @@
> +    nr_socket_close(tctx->sock);
> +
> +    r = nr_ip4_port_to_transport_addr(0, 0, IPPROTO_TCP, &addr);
> +    r = nr_socket_local_create(&addr, &real_socket_);
> +    nr_socket_buffered_stun_create(real_socket_, 100000,

The only other call (excluding test code) I see to nr_socket_buffered_stun_create uses NR_STUN_MAX_MESSAGE_SIZE, and not 100000. Also, we need to do error handling here, and clean stuff up if we hit a failure (in this case, real_socket_).

@@ +237,5 @@
> +    nr_socket_buffered_stun_create(real_socket_, 100000,
> +                                   &buffered_socket_);
> +
> +    tctx->sock = buffered_socket_;
> +    ctx->stun->sock = buffered_socket_;

I'm pretty sure there's an nr_ice_socket around with an old buffered socket pointer in it, see here:

http://dxr.mozilla.org/mozilla-central/source/media/mtransport/third_party/nICEr/src/ice/ice_component.c#335

Also, the nr_ice_socket is hooked up to a signal based on the fd of the old buffered socket:

http://dxr.mozilla.org/mozilla-central/source/media/mtransport/third_party/nICEr/src/ice/ice_socket.c#221

Maybe the thing to do is build reconnect/restart logic into the nr_socket API, so we don't have to worry about lifecycle as much?

@@ +243,5 @@
> +    if (nr_stun_message_has_attribute(ctx->stun->response, NR_STUN_ATTR_ERROR_CODE, &ec)
> +       && ec->u.error_code.number == 300) {
> +          if (nr_stun_message_has_attribute(ctx->stun->response, NR_STUN_ATTR_ALTERNATE_SERVER, &as)) {
> +             nr_transport_addr_copy(&tctx->turn_server_addr, &as->u.alternate_server);
> +             tctx->turn_server_addr.protocol=IPPROTO_TCP;

Don't forget to call nr_transport_addr_fmt_addr_string here.

@@ +246,5 @@
> +             nr_transport_addr_copy(&tctx->turn_server_addr, &as->u.alternate_server);
> +             tctx->turn_server_addr.protocol=IPPROTO_TCP;
> +          }
> +    }
> +    nr_turn_client_connect(tctx);

Probably need to do error handling here, but ensure that R_WOULDBLOCK is not treated as a failure. (I imagine what you want to end up doing here is creating an nr_turn_client_restart/reconnect?)

@@ +325,5 @@
>  
>          ctx->retry_ct++;
> +      } else if (ctx->stun->error_code == 300) {
> +        r_log(NR_LOG_TURN, LOG_INFO, "TURN(%s): Alternate server recieved, restarting TURN", ctx->tctx->label);
> +        ctx->retry_ct = 0;

Hmm. What if we get into a redirect loop?
Attachment #8425039 - Flags: feedback-
Flags: needinfo?(docfaraday)
Since the socket is managed by applications consuming nICEr , I propose that nr_turn_client_allocate be modified so that it can call a error callback.

This error callback is a function in the application consuming nICEr and can check the error code of the TURN process and if a 300 error code is present, it proceeds to restart the process with a new socket.

Does this approach seem sensible?

Cheers
Sorry this took so long -- I'm a bit hesitant to make an architectural change that is so visible to the using application without first getting Eric's input, as he was responsible for the original design.
Flags: needinfo?(adam) → needinfo?(ekr)
(In reply to Rohan Garg from comment #10)
> Since the socket is managed by applications consuming nICEr , I propose that
> nr_turn_client_allocate be modified so that it can call a error callback.
> 
> This error callback is a function in the application consuming nICEr and can
> check the error code of the TURN process and if a 300 error code is present,
> it proceeds to restart the process with a new socket.
> 
> Does this approach seem sensible?

If I understand your question, no, I don't think that's right. The applications
consuming nICEr never see *any* socket. If you mean that the ICE part of
nICEr needs to handle a failure of the TURN part, that might be OK.
Flags: needinfo?(ekr)
Just curious if there has been any progress on this. I'm currently architecting our TURN server environment and I'd really like to see support for ALTERNATE_SERVER 300 in firefox. Is there any new information?
Flags: needinfo?(rohan16garg)
backlog: --- → webRTC+
Rank: 45
Priority: -- → P4
Whiteboard: [WebRTC][blocking-webrtc-][turn][fuzzing:queue:cdiehl] → [turn][fuzzing:queue:cdiehl]
I would really like to see support for ALTERNATE-SERVER in Firefox!!!
Assignee: rohan16garg → nobody
Rank: 45 → 29
Priority: P4 → P2
QA Contact: jsmith
Any news on this? Chromium implemented it in 2015 (https://bugs.chromium.org/p/webrtc/issues/detail?id=1986), Firefox is the missing piece to enable easier load balancing of TURN servers... Would be *awesome* if this would be implemented :)
Mass change P2->P3 to align with new Mozilla triage process.
Priority: P2 → P3

Since rohan16garg@gmail.com is not working on this any more I am removing the NI.

Flags: needinfo?(rohan16garg)

Hi Team, Any News on this? Is it implented in firefox?

Priority: P3 → P2

I tried a two different approaches to solving this, which involved directly re-using the stun context and some higher up. But I have come to the conclusion that bwc's review from comment #9 is still holds true.
It looks like the most sane way approaching this feature is to add some functions which can replicate a TURN server context, without the sockets in it. Then set or replace the IP address of the target TURN server in it that cloned context. And then feed this TURN server high up in the nICEr logic, and ignore the original turn context as failed.
That way we might also be able to build in loop detection, by checking for already failed TURN contexts with the same IP address etc in it.

Assignee: nobody → docfaraday

So first, we have to figure out how we're going to test this. Because we don't necessarily have multi-homed hosts in CI, and do not have a test STUN/TURN server on the internet that we administer (and could configure to do redirects), I think the most viable option is to have some wrapper code (eg; in the NAT simulator) that will send redirect responses when the destination addr/port matches a specific value (configured with a pref, I guess).

Blocks: 1710634
See Also: → 1710706

We need this because the test TURN server does not support stun redirects where
the address changes. Even if we were to add this support to the test STUN
server, it would not be helpful in CI since it would require having additional
IP addresses, which we cannot count on since we do not control the network
configuration of those machines.

Also, the fact that the code in this patch does not support MESSAGE-INTEGRITY
allows us to test the case where it is not present in a STUN redirect. Some
deployments rely on this.

Lastly, this code supports adding multiple (or 0) ALTERNATE-SERVER fields,
which is something that the test TURN server does not support.

Depends on D115276

Depends on D115277

Depends on D115279

This substantially simplifies the task of replacing the nr_socket a TURN
context is using.

Depends on D115280

STUN/300 is only meaningful for specific STUN method types, and the handling
depends on the spec that those methods are defined in. Doing any handling in
the base stun client code is not appropriate.

Depends on D115284

Also some logging that was useful in debugging this.

Depends on D115286

Leaking test-case was dom/media/webrtc/tests/mochitests/test_peerConnection_bug825703.html

Depends on D115287

Also some indentation fixup.

Depends on D115289

Depends on D115291

Nothing used this function before this bug, so this field has never been used
anyway. We handle challenge loops with nr_turn_stun_ctx.retry_ct.

Depends on D115292

Attachment #8425039 - Attachment is obsolete: true

We only use this with TCP sockets.

Depends on D115293

This avoids shutdown leaks in cases where these tests are run in isolation.

Depends on D116029

Attachment #9222264 - Attachment is obsolete: true
Pushed by bcampen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/143bfb916b26
Some IPv6 fixes and refactoring in the test TURN server. r=mjf,ahal
https://hg.mozilla.org/integration/autoland/rev/2435ef0cd258
Teach test TURN server to open additional ports, on which it will respond with 300. r=mjf,ahal
https://hg.mozilla.org/integration/autoland/rev/08808a003231
Teach NAT simulator to do STUN redirects, configured by prefs. r=mjf
https://hg.mozilla.org/integration/autoland/rev/edaaced4f42b
Work a little harder to make sure we start tests with a reasonable pref environment. r=jib
https://hg.mozilla.org/integration/autoland/rev/76f4a828537e
Test cases for bug. r=jib
https://hg.mozilla.org/integration/autoland/rev/8aaf855abaed
Remove some unneeded non-owning references to nr_socket. r=mjf
https://hg.mozilla.org/integration/autoland/rev/9dc0836dcc9b
Remove some fields that serve no purpose besides making this code harder to maintain. r=mjf
https://hg.mozilla.org/integration/autoland/rev/86b7a16b487f
Add function for getting STUN multi-attributes. r=mjf
https://hg.mozilla.org/integration/autoland/rev/286b02c1eb23
Remove existing STUN/300 handling. r=mjf
https://hg.mozilla.org/integration/autoland/rev/2adc8298e2fa
Handle Allocate/300 with ALTERNATE-SERVER. r=mjf
https://hg.mozilla.org/integration/autoland/rev/97a666f07a57
Fix handling of IPv6 literals. r=mjf
https://hg.mozilla.org/integration/autoland/rev/5845d6f99b76
Fix a leak that the previous patch uncovered. r=mjf
https://hg.mozilla.org/integration/autoland/rev/ea6463ca6695
Let Allocate/300 without MESSAGE-INTEGRITY slide. r=mjf
https://hg.mozilla.org/integration/autoland/rev/4c540c962a19
Stop ignoring error responses from STUN/TURN servers. r=mjf
https://hg.mozilla.org/integration/autoland/rev/b8158a2544e1
Open a new TCP socket when we receive a redirect in the TCP case. r=mjf
https://hg.mozilla.org/integration/autoland/rev/c858c4bbd547
Implement redirect loop detection. r=mjf
https://hg.mozilla.org/integration/autoland/rev/c290554ad7e4
Remove this extra retry_ct field. r=mjf
https://hg.mozilla.org/integration/autoland/rev/798446339428
Use IPPROTO_TCP here. r=mjf
https://hg.mozilla.org/integration/autoland/rev/96849c0ceebe
Remove this check; sockets are sometimes closed without nICEr knowing about it. r=mjf
https://hg.mozilla.org/integration/autoland/rev/e643486292f0
Close these RTCPeerConnections. r=jib
Regressions: 1713239
Regressions: 1713300
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Creator:
Created:
Updated:
Size: