Closed Bug 1120065 Opened 5 years ago Closed 5 years ago

TURN allocations are not released when the PeerConnection is closed

Categories

(Core :: WebRTC: Networking, defect)

x86_64
Linux
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla38
Tracking Status
firefox38 --- fixed

People

(Reporter: fippo, Assigned: drno)

References

Details

Attachments

(4 files, 3 obsolete files)

Attached file allocate.pcap
User Agent: Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/37.0.2062.120 Chrome/37.0.2062.120 Safari/537.36

Steps to reproduce:

Actually the same issue as http://code.google.com/p/chromium/issues/detail?id=406578 -- just in Firefox

Allocate relay candidates, e.g. using 
http://googlechrome.github.io/webrtc/samples/web/content/peerconnection/trickle-ice/
and close the peerconnection (done in the example).



Actual results:

The is no allocate request with a lifetime of 0. See the attached pcap.


Expected results:

This tends to be an issue for the TURN servers at some point.
Component: Untriaged → WebRTC: Networking
Product: Firefox → Core
Nils -- Can you look at this next week?
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(drno)
Sure
Assignee: nobody → drno
Flags: needinfo?(drno)
So here is an early working copy. It is missing tests. And there is probably still room for improvement.. (e.g. as nr_turn_client_deallocate and nr_turn_client_send_indication both implement now sending a STUN message as a one shot should we create common code for them?)
Attachment #8553352 - Flags: feedback?(docfaraday)
One observation from testing this: this only seems to get called when PC.close() in JS gets called. Which I assume is better then nothing. But for some unknown reason the ICE ctx destroy functions never seem to get called if a window, a tab or the whole browser just get closed (w/o calling close() from JS before).
Result from further testing:
- calling PC.close() releases the TURN resource immediately
- reloading a page release immediately
- when closing a tab/window it seems to depend on the timing of the garbage collector
- only if you close the browser this does not get called
(In reply to Nils Ohlmeier [:drno] from comment #5)
> Result from further testing:
> - calling PC.close() releases the TURN resource immediately
> - reloading a page release immediately
> - when closing a tab/window it seems to depend on the timing of the garbage
> collector

I also imagine PC.close() can't be called if we just 'lose' a ref to a PC (pc = null;) and it gets GC'd.

My assumption is that it's related to whether or not we get an inner-window-destroyed event for the window (or if we get it before the object is gc'd).  However: PeerConnection.js holds only a weakref to the PC, so it *can* get GC'd on it's own before we process an inner-window-destroyed (I presume).  

Note that the JS implementation of PeerConnection::close() sets the IceConnection state, closes the Idps, and calls the Impl's Close().

We could move some of this into either the addPC()'s entry (the list), or perhaps better in the c++'s destructor (PCImpl)

> - only if you close the browser this does not get called
Works for PC.close(), closing and reloading of tabs/windows.
In case of quitting the browser the code actually gets invoked, but the request never make it to the wire. Probably because STS refuses new requests at that point.

Still missing: unit test code.
Attachment #8553352 - Attachment is obsolete: true
Attachment #8553352 - Flags: feedback?(docfaraday)
Attachment #8553960 - Flags: feedback?(docfaraday)
Just some dead code which I found while working on this.
Attachment #8553961 - Flags: review?(docfaraday)
Comment on attachment 8553960 [details] [diff] [review]
bug_1120065_release_turn_allocations.patch

Review of attachment 8553960 [details] [diff] [review]:
-----------------------------------------------------------------

Looks about the right shape/size. Some comments.

::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c
@@ +436,5 @@
>  
>    return(0);
>  }
>  
> +void nr_turn_client_set_auth_param(nr_turn_client_ctx *tctx,

This should probably be nr_turn_client_get_auth_param. Also, I worry that someone might come along and use this to build an auth_params that will outlive |tctx|. Maybe just do this inline, since we only use in one place?

@@ +448,5 @@
> +
> +  authp->authenticate = 1;
> +}
> +
> +int nr_turn_client_send_single_stun_request(nr_turn_client_ctx *ctx,

We could probably just call this ...send_stun_request

::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.h
@@ +77,5 @@
>  #define NR_TURN_CLIENT_STATE_ALLOCATING      2
>  #define NR_TURN_CLIENT_STATE_ALLOCATED       3
>  #define NR_TURN_CLIENT_STATE_FAILED          4
>  #define NR_TURN_CLIENT_STATE_CANCELLED       5
> +#define NR_TURN_CLIENT_STATE_DEALLOCATING    6

Unused...
Attachment #8553960 - Flags: feedback?(docfaraday) → feedback+
Attachment #8553961 - Flags: review?(docfaraday) → review+
Updated with the feedback.
Attachment #8553960 - Attachment is obsolete: true
Attachment #8555125 - Flags: review?(docfaraday)
Added deallocation unit test and cleaned up the existing turn unit tests
Attachment #8555127 - Flags: review?(docfaraday)
Try run (although I would be surprised if that shows anything): https://treeherder.mozilla.org/#/jobs?repo=try&revision=774dfbed45c2
Comment on attachment 8555125 [details] [diff] [review]
bug_1120065_release_turn_allocations.patch

Review of attachment 8555125 [details] [diff] [review]:
-----------------------------------------------------------------

Just a couple of things to do before landing.

::: media/mtransport/third_party/nICEr/src/stun/turn_client_ctx.c
@@ +483,5 @@
> +    ABORT(r);
> +
> +  // We are only sending a single request here because we are in the process of
> +  // shutting everything down. Theoretically we should probably start a seperate
> +  // STUN transaction which outlives the TURN context.

This should be a c-style comment. Also, we should file a bug for handling retransmissions/EWOULDBLOCK, and put that bug number here.

@@ +487,5 @@
> +  // STUN transaction which outlives the TURN context.
> +  if ((r=nr_turn_client_send_stun_request(ctx, aloc, 0)))
> +    ABORT(r);
> +
> +  ctx->state = NR_TURN_CLIENT_STATE_DEALLOCATING;

Given that we cancel right after sending the deallocation (which updates |state|), and that we have switch statements (and if (... || ...) checks that resemble switch statements) scattered around on |state| that may need to be updated, we should probably not bother adding this new state.
Attachment #8555125 - Flags: review?(docfaraday) → review+
Comment on attachment 8555127 [details] [diff] [review]
bug_1120065_unit_test_turn_deallocation.patch

Review of attachment 8555127 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/mtransport/test/turn_unittest.cpp
@@ -334,5 @@
>  }
>  
> -TEST_F(TurnClient, AllocateAndHold) {
> -  Allocate();
> -  PR_Sleep(20000);

As much as I dislike tests that just wait like this one, this is the only test we have that can be used to verify that refresh actually works.

@@ -342,5 @@
>    Allocate();
>    SendTo(relay_addr_);
>    ASSERT_TRUE_WAIT(received() == 100, 1000);
> -  PR_Sleep(10000); // Wait 10 seconds to make sure the
> -                   // CreatePermission has time to complete/fail.

Hmm. Since we're talking to ourselves here, I'm pretty sure we can send before the permission is created (the permission is only needed for one direction, not the other, and since both ways are identical... well, I guess it depends on the implementation of the TURN server).

I think that cleaning these up is going to be a bit more involved. We would need to stop talking to ourselves for this to be a good test, since I don't think it does a good job of testing whether CreatePermission worked in the first place. We should probably file a bug here, but leave this stuff alone.

@@ +396,5 @@
> +  ASSERT_TRUE_WAIT(received() == 100, 1000);
> +  Deallocate();
> +  turn_ctx_->state = NR_TURN_CLIENT_STATE_ALLOCATED;
> +  // If we are fast enough the socket should be still usable and we rely on
> +  // our assert, rather then on the TURN server to close the connection

What if we are not fast enough? Does the call to |nr_turn_client_send_indication| fail? Is this a possible intermittent?
Attachment #8555127 - Flags: review?(docfaraday) → review-
Blocks: 1126560
(In reply to Byron Campen [:bwc] from comment #13)
> This should be a c-style comment. Also, we should file a bug for handling
> retransmissions/EWOULDBLOCK, and put that bug number here.

Opened bug 1126560 for that.
 
> @@ +487,5 @@
> > +  // STUN transaction which outlives the TURN context.
> > +  if ((r=nr_turn_client_send_stun_request(ctx, aloc, 0)))
> > +    ABORT(r);
> > +
> > +  ctx->state = NR_TURN_CLIENT_STATE_DEALLOCATING;
> 
> Given that we cancel right after sending the deallocation (which updates
> |state|), and that we have switch statements (and if (... || ...) checks
> that resemble switch statements) scattered around on |state| that may need
> to be updated, we should probably not bother adding this new state.

My thinking here was when writing the unit test that without setting the state someone in the future could internally call deallocate without knowing what he is doing and cause harm. Changing the state would at least make sure that most other functions in here would stop working at that point and it does not cost us much.
(In reply to Byron Campen [:bwc] from comment #14)
> @@ -334,5 @@
> >  }
> >  
> > -TEST_F(TurnClient, AllocateAndHold) {
> > -  Allocate();
> > -  PR_Sleep(20000);
> 
> As much as I dislike tests that just wait like this one, this is the only
> test we have that can be used to verify that refresh actually works.

The problem I have with this test is that it does not test anything. It sleeps after the allocation. But it not even notice if the allocation when away, or if the refreshs failed. Besides how do you know that it actually did send something? The server might have overwritten the requested lifetimes.

I was wondering if there is a simple request we could send to ask if the allocation is still alive on the server?! Because just checking some variables in our TURN context is not much better then what we have right now.
 
> @@ -342,5 @@
> >    Allocate();
> >    SendTo(relay_addr_);
> >    ASSERT_TRUE_WAIT(received() == 100, 1000);
> > -  PR_Sleep(10000); // Wait 10 seconds to make sure the
> > -                   // CreatePermission has time to complete/fail.
> 
> Hmm. Since we're talking to ourselves here, I'm pretty sure we can send
> before the permission is created (the permission is only needed for one
> direction, not the other, and since both ways are identical... well, I guess
> it depends on the implementation of the TURN server).
> 
> I think that cleaning these up is going to be a bit more involved. We would
> need to stop talking to ourselves for this to be a good test, since I don't
> think it does a good job of testing whether CreatePermission worked in the
> first place. We should probably file a bug here, but leave this stuff alone.

I think the 10s are just excessive. Sure if the TURN will talk to itself or not with or without a permission request depends.
The real problem is that even outside of this test our TURN client does not wait for permission request response. We just ensure that we send the permission request first followed directly by the send request. But in a real world scenario it does not matter as we can start sending without waiting for the permission request response.

I also considered opening another port to send data to and from. But from looking at the TURN client code coverage that would not change a lot, as the current tests first sends and then receives.

> @@ +396,5 @@
> > +  ASSERT_TRUE_WAIT(received() == 100, 1000);
> > +  Deallocate();
> > +  turn_ctx_->state = NR_TURN_CLIENT_STATE_ALLOCATED;
> > +  // If we are fast enough the socket should be still usable and we rely on
> > +  // our assert, rather then on the TURN server to close the connection
> 
> What if we are not fast enough? Does the call to
> |nr_turn_client_send_indication| fail? Is this a possible intermittent?

Yes it fails if the connection has been closed. The Turn server I was testing against closed the connection about a 1s after the de-allocation request. But the test also can't rely on the server to do that, as a server for what ever reason might not do it.
So we would have to have a 'I don't care' flag for the send(). As this is the only place where I'm using the success flag I can probably change into that.
(In reply to Nils Ohlmeier [:drno] from comment #16)
> (In reply to Byron Campen [:bwc] from comment #14)
> > @@ -342,5 @@
> > >    Allocate();
> > >    SendTo(relay_addr_);
> > >    ASSERT_TRUE_WAIT(received() == 100, 1000);
> > > -  PR_Sleep(10000); // Wait 10 seconds to make sure the
> > > -                   // CreatePermission has time to complete/fail.
> > 
> > Hmm. Since we're talking to ourselves here, I'm pretty sure we can send
> > before the permission is created (the permission is only needed for one
> > direction, not the other, and since both ways are identical... well, I guess
> > it depends on the implementation of the TURN server).
> > 
> > I think that cleaning these up is going to be a bit more involved. We would
> > need to stop talking to ourselves for this to be a good test, since I don't
> > think it does a good job of testing whether CreatePermission worked in the
> > first place. We should probably file a bug here, but leave this stuff alone.
> 
> I think the 10s are just excessive. Sure if the TURN will talk to itself or
> not with or without a permission request depends.
> The real problem is that even outside of this test our TURN client does not
> wait for permission request response. We just ensure that we send the
> permission request first followed directly by the send request. But in a
> real world scenario it does not matter as we can start sending without
> waiting for the permission request response.
> 
> I also considered opening another port to send data to and from. But from
> looking at the TURN client code coverage that would not change a lot, as the
> current tests first sends and then receives.

I forgot to mention the other problem here: waiting for the permission request to succeed here is kind of pointless, because if the server really needs time to process the permission request I see two possible consequences:
a) the server drops the send indication which we send right after the permission request to the floor already and our assert would have failed, because we never received the first 100 bytes.
b) the server stored the send indication (high likely IMHO) and forwards it after the permission request got granted, but if we would receive the permission reply after 5 seconds the assert before the sleep would have failed by then.

So in summary I still don't get what this extra sleep is buying us. If at all we should remove the PR_Sleep() and increase the timeout on the asserts.
(In reply to Nils Ohlmeier [:drno] from comment #17)
> (In reply to Nils Ohlmeier [:drno] from comment #16)
> > (In reply to Byron Campen [:bwc] from comment #14)
> > > @@ -342,5 @@
> > > >    Allocate();
> > > >    SendTo(relay_addr_);
> > > >    ASSERT_TRUE_WAIT(received() == 100, 1000);
> > > > -  PR_Sleep(10000); // Wait 10 seconds to make sure the
> > > > -                   // CreatePermission has time to complete/fail.
> > > 
> > > Hmm. Since we're talking to ourselves here, I'm pretty sure we can send
> > > before the permission is created (the permission is only needed for one
> > > direction, not the other, and since both ways are identical... well, I guess
> > > it depends on the implementation of the TURN server).
> > > 
> > > I think that cleaning these up is going to be a bit more involved. We would
> > > need to stop talking to ourselves for this to be a good test, since I don't
> > > think it does a good job of testing whether CreatePermission worked in the
> > > first place. We should probably file a bug here, but leave this stuff alone.
> > 
> > I think the 10s are just excessive. Sure if the TURN will talk to itself or
> > not with or without a permission request depends.
> > The real problem is that even outside of this test our TURN client does not
> > wait for permission request response. We just ensure that we send the
> > permission request first followed directly by the send request. But in a
> > real world scenario it does not matter as we can start sending without
> > waiting for the permission request response.
> > 
> > I also considered opening another port to send data to and from. But from
> > looking at the TURN client code coverage that would not change a lot, as the
> > current tests first sends and then receives.
> 
> I forgot to mention the other problem here: waiting for the permission
> request to succeed here is kind of pointless, because if the server really
> needs time to process the permission request I see two possible consequences:
> a) the server drops the send indication which we send right after the
> permission request to the floor already and our assert would have failed,
> because we never received the first 100 bytes.
> b) the server stored the send indication (high likely IMHO) and forwards it
> after the permission request got granted, but if we would receive the
> permission reply after 5 seconds the assert before the sleep would have
> failed by then.
> 
> So in summary I still don't get what this extra sleep is buying us. If at
> all we should remove the PR_Sleep() and increase the timeout on the asserts.

   I looked at the spec again, and the TURN server is not supposed to let us use send indications until the permission is created, so as long as this holds, the asserts should only pass if the permission is successfully created, and the sleep is redundant.
(In reply to Byron Campen [:bwc] from comment #18)
>    I looked at the spec again, and the TURN server is not supposed to let us
> use send indications until the permission is created, so as long as this
> holds, the asserts should only pass if the permission is successfully
> created, and the sleep is redundant.

Hmm interesting. The means we are potentially loosing data, depending on the Turn server implementation. Which is usually not too bad, but do we send check requests through send indications?

For the purpose of the tests we could call nr_turn_client_ensure_perm() from the test before trying to send and send indication. That could allow us to wait for the result of the permission request. But the problem remains that we currently only write a log message in nr_turn_client_permissions_cb(). So without code changes in the client we could only sleep and verify that the status does not switch to failed.
Added an assert to the AllocateAndHold to have it at least verify something after sleeping.
Replaced the sleeps after the send asserts with longer timeouts, in case it is needed.
Attachment #8555127 - Attachment is obsolete: true
Attachment #8557307 - Flags: review?(docfaraday)
Comment on attachment 8557307 [details] [diff] [review]
bug_1120065_unit_test_turn_deallocation.patch

Review of attachment 8557307 [details] [diff] [review]:
-----------------------------------------------------------------

I'm ok with this. Before landing, please file a follow-up bug to improve the pre-existing test cases.
Attachment #8557307 - Flags: review?(docfaraday) → review+
Blocks: 1128128
Keywords: checkin-needed
https://hg.mozilla.org/mozilla-central/rev/6156b3b5f1e6
https://hg.mozilla.org/mozilla-central/rev/72073b4271d4
https://hg.mozilla.org/mozilla-central/rev/4332055a8b22
Status: NEW → RESOLVED
Closed: 5 years ago
Flags: in-testsuite+
Resolution: --- → FIXED
Target Milestone: --- → mozilla38
You need to log in before you can comment on or make changes to this bug.