Closed Bug 1004530 Opened 5 years ago Closed 5 years ago

Relayed candidates failing when using trickle ice depending on network conditions

Categories

(Core :: WebRTC: Signaling, defect)

32 Branch
x86
macOS
defect
Not set

Tracking

()

RESOLVED FIXED
mozilla32
Tracking Status
firefox31 --- fixed
firefox32 --- fixed

People

(Reporter: ggb, Assigned: bwc)

References

Details

Attachments

(14 files, 17 obsolete files)

37.50 KB, text/plain
Details
846.72 KB, application/x-bzip2
Details
519.64 KB, application/x-bzip2
Details
3.61 MB, application/x-bzip2
Details
436.76 KB, text/plain
Details
121.91 KB, text/x-log
Details
1.41 MB, application/x-bzip2
Details
1.91 MB, application/x-bzip2
Details
124.23 KB, text/x-log
Details
1.19 MB, application/x-bzip2
Details
4.76 KB, patch
ekr
: review+
Details | Diff | Splinter Review
10.85 KB, patch
ekr
: review+
Details | Diff | Splinter Review
17.03 KB, patch
bwc
: review+
Details | Diff | Splinter Review
6.80 KB, patch
bwc
: review+
Details | Diff | Splinter Review
Attached file icetest_ff_logs.txt
Connections Firefox to Firefox through TURN are failing some times (depending on network conditions it goes from 50% to 5% in our test environment).

If trickle ice is disabled then it apparently never fails.  The packet loss in the network, the round-trip-time to the server and the time elapsed between creating the PeerConnection and calling getLocalDescription apparently affects the failure rate.  

When happening there is this log in the about:webrtc logs (complete logs are attached):
"stream1/audio) tried to trickle ICE in inappropriate state 5"

Reproduced with Firefox 29 and Firefox Nightly.

How to reproduce it:
* Block UDP traffic with ipfw to force usage of TURN TCP
* Use any simple test page with 2 PeerConnections (this is the one I use https://dl.dropboxusercontent.com/u/1806026/icetest.html but you need to configure your ICE server on it, I can provide some if needed)
Byron, Nils -- Can one of you try to reproduce this report?  Thanks.
I'll go look at the code.
So, the stream has been marked as failed, and I'm seeing lots of lines like the following in the log:

"Error in sendto IP4:192.168.1.82:49937"

This looks like a duplicate of bug 985493. I am cherry-picking fixes out of bug 891551 that are likely to fix this, more in a bit.
Actually, this one might not be a dupe, but a bug for which there is yet another fix in the ICE TCP patches. I'll try to cherry-pick that one too.
I'll try to reproduce later today.
@byron The sendto error is probably for the UDP candidates because I'm blocking UDP traffic with ipfw to force it to use the TURN TCP candidates (192.168.1.82 was my local address).
After further reading, our handling of candidate pair failures isn't quite right, and plays poorly with trickle. We are marking components as failed if all pairs are failed, without leaving any kind of grace period to allow new candidates (either remote or local) to trickle in. I'm working on a fix now.
First cut at a fix. Needs a little cleanup, but should be good enough for testing.
Assignee: nobody → docfaraday
Status: NEW → ASSIGNED
I'm finally able to reproduce the problem after adding 1s delay on the network traffic of my TURN server.
Looks like your patch works if I have 500ms delay on my TURN server, but not with 1s delay. From looking at the code I'm assuming there is a 2s timeout in your code. Does this observation concern you?
(In reply to Nils Ohlmeier [:drno] from comment #10)
> Looks like your patch works if I have 500ms delay on my TURN server, but not
> with 1s delay. From looking at the code I'm assuming there is a 2s timeout
> in your code. Does this observation concern you?

Forget about that.
It looks like even with the patch applied and 500ms delay on the TURN server I don't get a very good success rate.
So, setting up a TURN candidate requires a few round trips, usually two allocate requests (first one is challenged), then a permission request. Also, we don't just blast out requests as fast as we can, since there is some throttling involved. So a 500ms delay (on top of any other delays) seems large enough that 2 seconds won't work some of the time. I can bump the grace period up to 4 seconds and we can see what happens.
Attachment #8416124 - Attachment is obsolete: true
Unfortunately I don't have hard numbers, but here is my experience from manual testing:
- without any patch and 500ms TURN delay you have ~50% of failure or success
- without any patch and 1s TURN delay it fails all the time
- with patch and 500ms TURN delay you still get failures
- with patch and 1s TURN delay it still fails all the time

So basically it looks to me like you fix is not really improving things here so far.
(In reply to Nils Ohlmeier [:drno] from comment #14)
> Unfortunately I don't have hard numbers, but here is my experience from
> manual testing:
> - without any patch and 500ms TURN delay you have ~50% of failure or success
> - without any patch and 1s TURN delay it fails all the time
> - with patch and 500ms TURN delay you still get failures
> - with patch and 1s TURN delay it still fails all the time
> 
> So basically it looks to me like you fix is not really improving things here
> so far.

PCAPs and logging please?
Attached file logs.tar.bz2
Logs and PCAP from a failed test run with the 4s patch applied and 1s delayed TURN server.
Hmm. The TURN TCP stuff is sometimes delayed more than a second; I see a bunch of CreatePermission requests getting a response 1.6-1.7 seconds later (these seem to be for requests that came in the same TCP packet as wrapped check requests and such). Can you get it to fail with a more reasonable 500ms delay, since I would actually expect that to work? (Also, a 1 second delay on not just the binding setup stuff, but also the check request stuff, in each direction, could be causing problems)
Attached file logs500ms.tar.bz2
Logs and PCAP from a failed call with 500ms TURN delay. Unfortunately I only saw trickle candidates from one side.
(In reply to Byron Campen [:bwc] from comment #17)
> (Also, a 1 second
> delay on not just the binding setup stuff, but also the check request stuff,
> in each direction, could be causing problems)

Well the delay is added on the IP layer for everything. Setting something up which delays based on application layer would be a lot more effort.
Seeing a different kind of problem in this 500ms trace; we never send a CreatePermission request at all. Will look into why.
Looks like we formed a couple of TURN TCP candidate pairs, but we never tried them.
Hmm, also seeing a bunch of TCP connections timing out in the logs, but the SYN/ACK is only delayed by 500ms as expected. Wonder what that's about...
Ohhhhh. Looking at the traces, I see 7 TCP connections get established, then some AllocateRequest/Response (this all takes a good 1.15 seconds or so), and once all those responses come in, 5 more SYNs pop out for the remaining TCP candidates. By the time those get SYN/ACKed, 1.65 seconds total has gone by, which is long enough for our connect timeout to pop (1.5 seconds). I'm a little puzzled why this is happening; maybe there's some sort of limit we're bumping up against in our socket library? It isn't anything like hitting the maximum accept limit on the server, because the initial SYNs are delayed, not just the SYN/ACKs.
Add some logging to debug the weirdness we're seeing with delayed SYN.
Attachment #8416285 - Attachment is obsolete: true
Attached file logs500ms.tar.bz2
And another run with 500ms delay.
Warning: the log contains first one successful run. After that I reloaded the page and got a failure.
I think I see what is happening here, or at least another show-stopper bug when dealing with late trickle. I should have a patch in a few.
Fix another bug, where the check timer would stop being scheduled once we were in a late trickle situation.
Attachment #8416724 - Attachment is obsolete: true
So this patch seems to give me 100% success rate with trickled ICE TURN TCP candidates if I have 500ms delay to my TURN server.
With 1s delay it works sometimes, but not very often. But: I had three crashes when I was testing the non-successful cases with 1s delay. So before shipping we need to carefully test that this does not introduce more instability.
Fix bug where I wasn't canceling the grace period timer when destroying the ice_component.
Attachment #8416768 - Attachment is obsolete: true
That is likely to be your crash fix.
Attached file Mac OSX crash reports
With the latest patch applied to today's mc I get crashes pretty much every time I reload the test page. These are the Mac OSX crash reports from the crashes.
Fix another bug where I didn't zero out the timer handle in the callback.
Attachment #8416846 - Attachment is obsolete: true
Ok. This last patch seems to work reliable with 500ms delay and sometimes with 1s delay. And I did not have any crashes any more.
The only remaining problem is that almost every time the initial loading of the test page results in ICE failure (looks like it fails to get TURN candidates). And then on re-load of the page it always succeeds. I'm going to attach a log file of such a failure.
Attached file initial-failure.log
Failure on initial load. No trickle ICE gets send. This happens with 500ms and 1000ms delay on the TURN relay.
Seeing connections time out in that log, similar to the weird delayed SYN I was seeing before (which remains unexplained). If you can get a pcap of this, and create a new ticket, I can work on that there. I'll need to put some polish on the patch here, and find someone to review it.
Attachment #8417551 - Attachment is obsolete: true
Attachment #8418138 - Attachment is obsolete: true
Needinfo myself to check back on try push.

https://tbpl.mozilla.org/?tree=Try&rev=ca5aaec97edd
Flags: needinfo?(docfaraday)
Comment on attachment 8418144 [details] [diff] [review]
Allow a grace period for trickle candidates to arrive when all candidate pairs have failed.

Review of attachment 8418144 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c
@@ +279,5 @@
>              ABORT(r);
>  
>            nr_ice_candidate_pair_set_state(pair->pctx,pair,NR_ICE_PAIR_STATE_SUCCEEDED);
>  
> +          if(r=nr_ice_component_insert_pair(pair->remote->component,pair))

The intent was to replace every call to nr_ice_candidate_pair_insert with a call to a function that performed the same insert, plus things like canceling the grace period timer and restarting the check timer if needed.

@@ +490,5 @@
>      pair->state=state;
>  
>  
>      if(pair->state==NR_ICE_PAIR_STATE_FAILED){
> +      nr_ice_component_failed_pair(pair->remote->component);

This doesn't do anything synchronously anymore; at most it will schedule a timer.

::: media/mtransport/third_party/nICEr/src/ice/ice_component.c
@@ +943,5 @@
>    abort:
>      return(_status);
>    }
>  
> +static int nr_ice_component_all_pairs_failed(nr_ice_component *comp)

Removed some stuff that was extraneous to this function's purpose.

@@ +1113,5 @@
> +    if(pcomp->trickle_grace_period_timer)
> +      NR_async_timer_cancel(pcomp->trickle_grace_period_timer);
> +
> +    /* Make sure the check timer is still going. */
> +    nr_ice_media_stream_start_checks(pair->remote->stream->pctx, pair->remote->stream);

Check timer doesn't get rescheduled if there were no pairs that can be started, so when new pairs come in (due to trickle) we need to ensure that it is scheduled.
Attachment #8418144 - Flags: review?(ekr)
Flags: needinfo?(docfaraday)
Attachment #8418144 - Flags: review?(ekr) → review?(drno)
Comment on attachment 8418144 [details] [diff] [review]
Allow a grace period for trickle candidates to arrive when all candidate pairs have failed.

Review of attachment 8418144 [details] [diff] [review]:
-----------------------------------------------------------------

I would like to get an answer for my concern in ice_media_stream.c, before giving r+.

::: media/mtransport/third_party/nICEr/src/ice/ice_component.c
@@ +980,5 @@
>  
> +    if (nr_ice_component_all_pairs_failed(comp)){
> +      /* All pairs are failed, even after a grace period. Mark component as
> +       * failed. */
> +      nr_ice_media_stream_component_failed(comp->stream,comp);

Not essential for this patch/solution, but why does nr_ice_media_stream_component_failed takes the stream and the component as args if stream is part of comp anyhow?

@@ +1098,5 @@
> +
> +
> +int nr_ice_component_insert_pair(nr_ice_component *pcomp, nr_ice_cand_pair *pair)
> +  {
> +    int r,_status;

Should _status get initialized with an default error return value in case someone suddenly decides to goto: abort?
Without this we could potentially return an uninitialized value at the end.

::: media/mtransport/third_party/nICEr/src/ice/ice_media_stream.c
@@ +400,5 @@
> +    /* This is used both to start checks initially, and also to restart them
> +     * if we're in a grace period state with only failed pairs, and a new
> +     * candidate has trickled in, necessitating a restart of the timer */
> +    if (!stream->timer)
> +      nr_ice_media_stream_check_timer_cb(0,0,stream);

So we have three possible combinations here:
1) state = CHECKS_FROZEN then the state gets set to Active, which removes any pending timers and then you re-install a timer here
2) state = CHECKS_ACTIVE skips over the set_state() call, and probably there are still timers running which don't get touched
3) state = CHECKS_COMPLETED skips set_state() again, but in that case I assume there is no timer any more and so this would start check timers again?!?!
4) state = CHECKS_FAILED skips set_state() again, and there is no timer (hopefully) any more, and so you restart it with this.

1, 2 and 4 make sense to me. Scenario 3 worries me.
Constrain nr_ice_media_stream_start_checks some
Attachment #8418144 - Attachment is obsolete: true
Attachment #8418144 - Flags: review?(drno)
Attachment #8421132 - Flags: review?(drno)
Comment on attachment 8421132 [details] [diff] [review]
Allow a grace period for trickle candidates to arrive when all candidate pairs have failed.

Review of attachment 8421132 [details] [diff] [review]:
-----------------------------------------------------------------

I would like to get feedback on the if condition in ice_media_stream.c before approving.

::: media/mtransport/third_party/nICEr/src/ice/ice_component.c
@@ +1113,5 @@
> +    if(pcomp->trickle_grace_period_timer)
> +      NR_async_timer_cancel(pcomp->trickle_grace_period_timer);
> +
> +    /* Make sure the check timer is still going. */
> +    nr_ice_media_stream_start_checks(pair->remote->stream->pctx, pair->remote->stream);

This call return 0 or 1. Is it safe to ignore the return value?

::: media/mtransport/third_party/nICEr/src/ice/ice_media_stream.c
@@ +394,5 @@
>  /* Start checks for this media stream (aka check list) */
>  int nr_ice_media_stream_start_checks(nr_ice_peer_ctx *pctx, nr_ice_media_stream *stream)
>    {
> +    if (stream->ice_state > NR_ICE_MEDIA_STREAM_CHECKS_ACTIVE)
> +      return(1);

Does this mean a stream in COMPLETED or FAILED state can no longer accept new trickle candidates?
Sorry maybe I'm confused here with layering of streams and components.
(In reply to Nils Ohlmeier [:drno] from comment #42)
> Comment on attachment 8421132 [details] [diff] [review]
> Allow a grace period for trickle candidates to arrive when all candidate
> pairs have failed.
> 
> Review of attachment 8421132 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I would like to get feedback on the if condition in ice_media_stream.c
> before approving.
> 
> ::: media/mtransport/third_party/nICEr/src/ice/ice_component.c
> @@ +1113,5 @@
> > +    if(pcomp->trickle_grace_period_timer)
> > +      NR_async_timer_cancel(pcomp->trickle_grace_period_timer);
> > +
> > +    /* Make sure the check timer is still going. */
> > +    nr_ice_media_stream_start_checks(pair->remote->stream->pctx, pair->remote->stream);
> 
> This call return 0 or 1. Is it safe to ignore the return value?
>

   It is safe, although there may be some value to logging something. Let me look. 
> ::: media/mtransport/third_party/nICEr/src/ice/ice_media_stream.c
> @@ +394,5 @@
> >  /* Start checks for this media stream (aka check list) */
> >  int nr_ice_media_stream_start_checks(nr_ice_peer_ctx *pctx, nr_ice_media_stream *stream)
> >    {
> > +    if (stream->ice_state > NR_ICE_MEDIA_STREAM_CHECKS_ACTIVE)
> > +      return(1);
> 
> Does this mean a stream in COMPLETED or FAILED state can no longer accept
> new trickle candidates?
> Sorry maybe I'm confused here with layering of streams and components.

Yeah, that is what it means; it doesn't make sense to start sending new checks once we're in a sink state, at least not until we support renegotiation or something. The grace period delays the transition to FAILED.
Attachment #8421132 - Attachment is obsolete: true
Attachment #8421132 - Flags: review?(drno)
Attachment #8421795 - Flags: review?(drno)
Comment on attachment 8421795 [details] [diff] [review]
Allow a grace period for trickle candidates to arrive when all candidate pairs have failed.

Review of attachment 8421795 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good to me.
I'll test it one more time to be safe.
But I would also suggest one final review by someone more familiar with nICEr code base, e.g. ekr or abr.
Attachment #8421795 - Flags: review?(drno) → review+
Actually, on another scan through, I think I've found something subtly wrong. Will correct in a moment.
When a new pair is added, don't restart the check timer if the stream is still frozen.
Attachment #8421795 - Attachment is obsolete: true
Attachment #8422032 - Attachment is obsolete: true
Comment on attachment 8422034 [details] [diff] [review]
Allow a grace period for trickle candidates to arrive when all candidate pairs have failed.

Review of attachment 8422034 [details] [diff] [review]:
-----------------------------------------------------------------

Requesting a second pass; have explained a few of the changes in comments.

::: media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c
@@ +279,5 @@
>              ABORT(r);
>  
>            nr_ice_candidate_pair_set_state(pair->pctx,pair,NR_ICE_PAIR_STATE_SUCCEEDED);
>  
> +          if(r=nr_ice_component_insert_pair(pair->remote->component,pair))

Inserts pair in check list, cancels grace period timer (if set), and ensures that check timer is running if the stream is active (it might have expired earlier for lack of pairs to start).

@@ +490,5 @@
>      pair->state=state;
>  
>  
>      if(pair->state==NR_ICE_PAIR_STATE_FAILED){
> +      nr_ice_component_failed_pair(pair->remote->component);

This just schedules the grace period timer, so returns void.

::: media/mtransport/third_party/nICEr/src/ice/ice_component.c
@@ +943,5 @@
>    abort:
>      return(_status);
>    }
>  
> +static int nr_ice_component_all_pairs_failed(nr_ice_component *comp)

Removed some extraneous code/comments that didn't really add anything.

@@ +975,2 @@
>  
> +static void nr_ice_component_trickle_wait_cb(NR_SOCKET s, int how, void *cb_arg)

Callback for grace period.
Attachment #8422034 - Flags: review?(ekr)
Logs and Pcap for a test run with patch 8422034.
I see trickle ICE candidates come in, but call never gets established. This is with 500ms delay on the TURN server side.
Note: I'm testing from home and my TURN server is in the office. I haven't tested like this before. But trickle candidates seem to indicate that the TURN server is reachable. So I don't think that is the issue.

Good news: I see now trickel ICE happening all the time (it used to not work on the first attempt most of the time).
Attachment #8422034 - Flags: review?(ekr)
Looks like the Pcap is useless because of the VPN. And apparently I have trouble reaching Mozilla's STUN server.
Let's assume my test scenario over VPN is borked. I'll re-test tomorrow in the office.
Attachment #8422034 - Flags: review?(ekr)
So even in the office I can't get a trickle ICE call working any more with patch 8422034 (and 500ms delay on the TURN server).

The attached logs and pcap captured one initial call where no trickle ICE happened at all. The second call in the logs showed trickle ICE candidates in the developer console, but the call did not get established.
Attached file frozen.log
This is the about:webrtc dump from a call attempt between two machines using apprtc where the local machines had UDP blocked. The local machines was a fresh build from today's m-c. Maybe this is helpful in finding out how we end up in the frozen state.
Logs and PCAP from an apprtc call. Both sides on local m-c builds from today. Local side had latest patch applied. Remote side had UDP blocked.
Comment on attachment 8422034 [details] [diff] [review]
Allow a grace period for trickle candidates to arrive when all candidate pairs have failed.

Review of attachment 8422034 [details] [diff] [review]:
-----------------------------------------------------------------

Based on subsequent comments, it seems that this patch does not work. Re-request review when you believe it is fixed.
Attachment #8422034 - Flags: review?(ekr)
Looking at the logs, I'm seeing evidence that the failure was due to another bug that we have a patch in the works for.
Fix another crash bug in the timer management.
Attachment #8422034 - Attachment is obsolete: true
I got this patch to work three times in a row with a delay of 500ms, tab-to-tab call using apprtc, with UDP blocked. I was using ipfw on my OS X machine to block/delay traffic.
Ok, also ran the same test a few times without the patch, and it fails every time. So this seems to be doing what I'd expect.
Attachment #8425919 - Flags: review?(ekr)
I've done some 3-way testing with opentokrtc.com (using ipfw to drop UDP traffic to self via ephemeral ports; ie UDP host candidates), and it seems to at least connect with this patch, but without it does not fare so well. I'm guessing having a media stream running imposes enough latency on the STUN traffic to cause late trickle.
So, after some more testing and reading of code, this bug is actually a little more serious than I originally realized. Not only do we not recover if a new local candidate is gathered after all existing pairs have failed, we also will not restart the check timer once all existing pairs have been started (happens very quickly). We _will_ restart the check timer if a remote trickle candidate comes in, however. 

What this means is that once side A trickles a TURN candidate, side B will form a handful of new pairs and restart the check timer; if side B gathers a TURN candidate before the last of these pairs is started, we're ok, but if it is gathered too late (after the last pair is started), nothing is ever done with it.
Comment on attachment 8432913 [details] [diff] [review]
Part 2: Unit test for verifying that local candidates gathered after the check timer has stopped are ultimately scheduled.

Review of attachment 8432913 [details] [diff] [review]:
-----------------------------------------------------------------

This look roughly sane to you?
Attachment #8432913 - Flags: feedback?(ekr)
Comment on attachment 8432913 [details] [diff] [review]
Part 2: Unit test for verifying that local candidates gathered after the check timer has stopped are ultimately scheduled.

Review of attachment 8432913 [details] [diff] [review]:
-----------------------------------------------------------------

This looks reasonable.

::: media/mtransport/test/ice_unittest.cpp
@@ +68,5 @@
>  namespace {
>  
>  enum TrickleMode { TRICKLE_NONE, TRICKLE_SIMULATE, TRICKLE_REAL };
>  
> +typedef std::string(*CandidateFilter)(const std::string& candidate);

space before (

@@ +89,5 @@
> +  if (typ_loc != std::string::npos) {
> +    MOZ_ASSERT(typ_loc > 3);
> +    // Remove a digit from the port number
> +    result.erase(typ_loc - 3, 1);
> +  }

Suggest replace this with a fixed bogus candidate.
Attachment #8432913 - Flags: feedback?(ekr) → feedback+
Comment on attachment 8425919 [details] [diff] [review]
Allow a grace period for trickle candidates to arrive when all candidate pairs have failed.

Review of attachment 8425919 [details] [diff] [review]:
-----------------------------------------------------------------

After talking to Justin, I think this would be simpler with a global timer for
overall failure. Also, I think I see a number of other errors here.

::: media/mtransport/third_party/nICEr/src/ice/ice_component.c
@@ +167,5 @@
>        NR_async_timer_cancel(component->keepalive_timer);
>      nr_stun_client_ctx_destroy(&component->keepalive_ctx);
>  
> +    if(component->trickle_grace_period_timer)
> +      NR_async_timer_cancel(component->trickle_grace_period_timer);

Please set to 0 just for sanity.

@@ +975,2 @@
>  
> +static void nr_ice_component_trickle_wait_cb(NR_SOCKET s, int how, void *cb_arg)

Why are you putting this timer on the component level rather than on the
ice_peer_ctx level? That seems like it would be easier.

@@ +985,5 @@
> +    }
> +  }
> +
> +void nr_ice_component_failed_pair(nr_ice_component *comp)
> +  {

This seems to assume you are doing trickle, but that's not certain.

@@ +986,5 @@
> +  }
> +
> +void nr_ice_component_failed_pair(nr_ice_component *comp)
> +  {
> +    if (nr_ice_component_all_pairs_failed(comp)){

This is a question, not a statement, right? If so, please rename
as like "have_all_pairs_failed()" or "..._failed_p()"

@@ +1101,5 @@
> +  {
> +    int r,_status;
> +
> +    if (pair->state == NR_ICE_PAIR_STATE_FAILED ||
> +        pair->state == NR_ICE_PAIR_STATE_CANCELLED){

Why can you add pairs in any SUCCEEDED or INPROGRESS?

@@ +1112,5 @@
> +
> +    if(pcomp->trickle_grace_period_timer) {
> +      NR_async_timer_cancel(pcomp->trickle_grace_period_timer);
> +      pcomp->trickle_grace_period_timer=0;
> +    }

This all seems like a pretty good argument for a global timer from first candidate since it doesn't need this reset logic.

@@ +1115,5 @@
> +      pcomp->trickle_grace_period_timer=0;
> +    }
> +
> +    /* If the stream isn't frozen, make sure the check timer is running. */
> +    if(pair->remote->stream->ice_state == NR_ICE_MEDIA_STREAM_CHECKS_ACTIVE){

Note that we are currently starting checks elsewhere for trickle candidates, so
this is rearchitecting code. You need to minimally remove that.

http://dxr.mozilla.org/mozilla-central/source/media/mtransport/third_party/nICEr/src/ice/ice_peer_ctx.c#309
Attachment #8425919 - Flags: review?(ekr) → review-
(In reply to Eric Rescorla (:ekr) from comment #66)
> Comment on attachment 8425919 [details] [diff] [review]
> Allow a grace period for trickle candidates to arrive when all candidate
> pairs have failed.
> 
> Review of attachment 8425919 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> After talking to Justin, I think this would be simpler with a global timer
> for
> overall failure. Also, I think I see a number of other errors here.
> 
> ::: media/mtransport/third_party/nICEr/src/ice/ice_component.c
> @@ +985,5 @@
> > +    }
> > +  }
> > +
> > +void nr_ice_component_failed_pair(nr_ice_component *comp)
> > +  {
> 
> This seems to assume you are doing trickle, but that's not certain.
> 

   Whether we accept trickle or not is not configurable, and remote trickle is what this grace period is about.

> @@ +986,5 @@
> > +  }
> > +
> > +void nr_ice_component_failed_pair(nr_ice_component *comp)
> > +  {
> > +    if (nr_ice_component_all_pairs_failed(comp)){
> 
> This is a question, not a statement, right? If so, please rename
> as like "have_all_pairs_failed()" or "..._failed_p()"
> 

   Have renamed.

> @@ +1101,5 @@
> > +  {
> > +    int r,_status;
> > +
> > +    if (pair->state == NR_ICE_PAIR_STATE_FAILED ||
> > +        pair->state == NR_ICE_PAIR_STATE_CANCELLED){
> 
> Why can you add pairs in any SUCCEEDED or INPROGRESS?
> 

   Looks like SUCCEEDED can happen with peer reflexive, but I'm not seeing a way for WAITING or IN_PROGRESS. I'm changing the check to be a little more restrictive.

> @@ +1112,5 @@
> > +
> > +    if(pcomp->trickle_grace_period_timer) {
> > +      NR_async_timer_cancel(pcomp->trickle_grace_period_timer);
> > +      pcomp->trickle_grace_period_timer=0;
> > +    }
> 
> This all seems like a pretty good argument for a global timer from first
> candidate since it doesn't need this reset logic.
> 

   Alright, have done, and put it in the ice_peer_ctx.

> @@ +1115,5 @@
> > +      pcomp->trickle_grace_period_timer=0;
> > +    }
> > +
> > +    /* If the stream isn't frozen, make sure the check timer is running. */
> > +    if(pair->remote->stream->ice_state == NR_ICE_MEDIA_STREAM_CHECKS_ACTIVE){
> 
> Note that we are currently starting checks elsewhere for trickle candidates,
> so
> this is rearchitecting code. You need to minimally remove that.
> 
> http://dxr.mozilla.org/mozilla-central/source/media/mtransport/third_party/
> nICEr/src/ice/ice_peer_ctx.c#309

   It looks like we were still relying on this code to start checks if ICE started without any remote candidates, but I can change the code I'm adding to unfreeze the stream when a new pair is added. However, we may want to consider a little more carefully what we want to do with streams that have all their candidates at the start.
Attachment #8425919 - Attachment is obsolete: true
Comment on attachment 8434586 [details] [diff] [review]
Part 1: Allow a grace period for trickle candidates to arrive when all candidate pairs have failed.

Review of attachment 8434586 [details] [diff] [review]:
-----------------------------------------------------------------

I have chosen to start the grace period timer when the ice_peer_ctx is first told to pair, regardless of whether any pairs are formed in this process. This seems to lead to sensible behavior if the other end simply never trickles anything, or sends us any checks (leading to triggered checks). Otherwise I think we'd just hang around indefinitely.

There's also the question of what to do about starting up extra streams that had some remote candidates in the SDP, but never get any trickle candidates, but I think that is separable.
Attachment #8434586 - Flags: review?(ekr)
Comment on attachment 8434574 [details] [diff] [review]
Part 3: Unit test that verifies that new pairs will start when local gather happens after all preceding pairs have failed, provided the grace period has not elapsed. Also a couple more tests that use a new test-case feature.

Review of attachment 8434574 [details] [diff] [review]:
-----------------------------------------------------------------

The extra tests were mainly to convince myself that the new ControlTrickle feature didn't completely suck. Maybe we want to replace some existing tests instead, want to hear your thoughts on that.
Attachment #8434574 - Flags: review?(ekr)
Attachment #8432913 - Attachment is obsolete: true
Attachment #8434687 - Flags: review?(ekr)
Comment on attachment 8434687 [details] [diff] [review]
Part 2: Unit test for verifying that local candidates gathered after the check timer has stopped are ultimately scheduled.

Review of attachment 8434687 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8434687 - Flags: review?(ekr) → review+
Comment on attachment 8434574 [details] [diff] [review]
Part 3: Unit test that verifies that new pairs will start when local gather happens after all preceding pairs have failed, provided the grace period has not elapsed. Also a couple more tests that use a new test-case feature.

Review of attachment 8434574 [details] [diff] [review]:
-----------------------------------------------------------------

r- because IRC conversation indicates that some of these tests shouldn't screen host/srflx.

::: media/mtransport/test/ice_unittest.cpp
@@ +173,5 @@
> +    static void Trickle(NR_SOCKET s, int how, void *cb_arg) {
> +      static_cast<SchedulableTrickleCandidate*>(cb_arg)->Trickle();
> +    }
> +
> +    void Trickle();

Please use different names for these fxns. Not a great idea to rely on overloading when generating a pointer to a function to send to C code.

@@ +177,5 @@
> +    void Trickle();
> +
> +    std::string& Candidate() {
> +      return candidate_;
> +    }

suggest const std::string and make it const.

@@ +543,5 @@
> +         ++s) {
> +      for (auto cand = s->second.begin(); cand != s->second.end(); ++cand) {
> +        delete *cand;
> +      }
> +    }

Why not make this a map of ScopedDeletePtr<SchedulableTrickleCandidates>. Then you wouldn't need this. If you opt to keep explicit memory management, why not put it in the dtor.

@@ +791,5 @@
>    nsRefPtr<NrIceCtx> ice_ctx_;
>    std::vector<mozilla::RefPtr<NrIceMediaStream> > streams_;
>    std::map<std::string, std::vector<std::string> > candidates_;
> +  std::map<size_t, std::vector<SchedulableTrickleCandidate*> >
> +    controlled_trickle_candidates_;

Can we get a comment that explains what the key is here. Also, why can't this be a std::vector and we'll just fill it with null vectors in init.

@@ +1387,5 @@
> +
> +void DelayRelayCandidates(
> +    std::vector<SchedulableTrickleCandidate*>& candidates,
> +    unsigned int ms) {
> +  for (auto c = candidates.begin(); c!= candidates.end(); ++c) {

convention seems to be to use |i| for this iterator

@@ +1396,5 @@
> +    }
> +  }
> +}
> +
> +TEST_F(IceConnectTest, TestConnectTurnWithNormalTrickleDelay) {

These tests all seem to be TurnOnly. I believe that you didn't intend that, but please make names conform to behavior.
Attachment #8434574 - Flags: review?(ekr) → review-
(In reply to Eric Rescorla (:ekr) from comment #75)
> Comment on attachment 8434574 [details] [diff] [review]
> Part 3: Unit test that verifies that new pairs will start when local gather
> happens after all preceding pairs have failed, provided the grace period has
> not elapsed. Also a couple more tests that use a new test-case feature.
> 
> Review of attachment 8434574 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> r- because IRC conversation indicates that some of these tests shouldn't
> screen host/srflx.
> 
> ::: media/mtransport/test/ice_unittest.cpp
> @@ +173,5 @@
> > +    static void Trickle(NR_SOCKET s, int how, void *cb_arg) {
> > +      static_cast<SchedulableTrickleCandidate*>(cb_arg)->Trickle();
> > +    }
> > +
> > +    void Trickle();
> 
> Please use different names for these fxns. Not a great idea to rely on
> overloading when generating a pointer to a function to send to C code.

Ok.

> 
> @@ +177,5 @@
> > +    void Trickle();
> > +
> > +    std::string& Candidate() {
> > +      return candidate_;
> > +    }
> 
> suggest const std::string and make it const.
> 

   I wanted to allow candidate_ to be modified. I can add a second const accessor.

> @@ +543,5 @@
> > +         ++s) {
> > +      for (auto cand = s->second.begin(); cand != s->second.end(); ++cand) {
> > +        delete *cand;
> > +      }
> > +    }
> 
> Why not make this a map of ScopedDeletePtr<SchedulableTrickleCandidates>.
> Then you wouldn't need this. If you opt to keep explicit memory management,
> why not put it in the dtor.
> 

   These aren't safe to delete on main, because of the timer cancel. Also, our build config disables move semantics in the STL, so we wouldn't be able to do this without using RefPtr.

> @@ +791,5 @@
> >    nsRefPtr<NrIceCtx> ice_ctx_;
> >    std::vector<mozilla::RefPtr<NrIceMediaStream> > streams_;
> >    std::map<std::string, std::vector<std::string> > candidates_;
> > +  std::map<size_t, std::vector<SchedulableTrickleCandidate*> >
> > +    controlled_trickle_candidates_;
> 
> Can we get a comment that explains what the key is here. Also, why can't
> this be a std::vector and we'll just fill it with null vectors in init.
> 

   I can add the comment, but using a vector instead of a map requires more code, and the performance here is not terribly important. Also, using references to stuff within a map is a little less idiot-proof than the same for stuff in a vector.

> @@ +1387,5 @@
> > +
> > +void DelayRelayCandidates(
> > +    std::vector<SchedulableTrickleCandidate*>& candidates,
> > +    unsigned int ms) {
> > +  for (auto c = candidates.begin(); c!= candidates.end(); ++c) {
> 
> convention seems to be to use |i| for this iterator
> 

   I prefer to use something that corresponds roughly to the thing being iterated over, but that preference is not strong.

> @@ +1396,5 @@
> > +    }
> > +  }
> > +}
> > +
> > +TEST_F(IceConnectTest, TestConnectTurnWithNormalTrickleDelay) {
> 
> These tests all seem to be TurnOnly. I believe that you didn't intend that,
> but please make names conform to behavior.

   Righto.
Sorry, that should have been _more_ idiot-proof, not less.
Comment on attachment 8434586 [details] [diff] [review]
Part 1: Allow a grace period for trickle candidates to arrive when all candidate pairs have failed.

Review of attachment 8434586 [details] [diff] [review]:
-----------------------------------------------------------------

r- for over-aggressive pacing and the default issue.

::: media/mtransport/third_party/nICEr/src/ice/ice_candidate_pair.c
@@ +490,5 @@
>      pair->state=state;
>  
>  
>      if(pair->state==NR_ICE_PAIR_STATE_FAILED){
> +      nr_ice_component_check_if_failed(pair->remote->component);

I don't think the rename here is helpful. Please have nr_ice_component_failed_pair() call
nr_ice_component_check_if_failed. Also, reinstate the error check.

::: media/mtransport/third_party/nICEr/src/ice/ice_component.c
@@ +804,5 @@
>  
>          if(r=nr_ice_candidate_pair_create(pctx,lcand,pcand,&pair))
>            ABORT(r);
>  
> +        if(r=nr_ice_component_insert_pair(pcomp, pair))

This causes the stream to unfreeze (as below) so it's too aggressive. You want to retain it for the trickle ICE version, but not here.

@@ +972,4 @@
>  
> +void nr_ice_component_check_if_failed(nr_ice_component *comp)
> +  {
> +    if (!comp->stream->pctx->trickle_grace_period_timer &&

Please use nested ifs and add a log statement explaining what's going on.

@@ +1091,5 @@
> +
> +    /* Make sure the check timer is running. */
> +    if(pair->remote->stream->ice_state == NR_ICE_MEDIA_STREAM_CHECKS_ACTIVE ||
> +       pair->remote->stream->ice_state == NR_ICE_MEDIA_STREAM_CHECKS_FROZEN){
> +      if(nr_ice_media_stream_start_checks(pair->remote->stream->pctx, pair->remote->stream)) {

Please log that we are trying to start checks.

@@ +1099,5 @@
> +    }
> +
> +    _status=0;
> +  abort:
> +

Extra whitespace.

::: media/mtransport/third_party/nICEr/src/ice/ice_media_stream.c
@@ +393,5 @@
>  
>  /* Start checks for this media stream (aka check list) */
>  int nr_ice_media_stream_start_checks(nr_ice_peer_ctx *pctx, nr_ice_media_stream *stream)
>    {
> +    if (stream->ice_state > NR_ICE_MEDIA_STREAM_CHECKS_ACTIVE)

Add a comment here.

@@ +394,5 @@
>  /* Start checks for this media stream (aka check list) */
>  int nr_ice_media_stream_start_checks(nr_ice_peer_ctx *pctx, nr_ice_media_stream *stream)
>    {
> +    if (stream->ice_state > NR_ICE_MEDIA_STREAM_CHECKS_ACTIVE)
> +      return(1);

Please use ABORT() here.

@@ +400,3 @@
>      nr_ice_media_stream_set_state(stream,NR_ICE_MEDIA_STREAM_CHECKS_ACTIVE);
> +
> +    if (!stream->timer)

Let's log that we are doing this.

::: media/mtransport/third_party/nICEr/src/ice/ice_peer_ctx.c
@@ +314,5 @@
> +    pctx->trickle_grace_period_timer=0;
> +
> +    /* Trickle grace period is over; mark every component with only failed pairs
> +     * as failed. */
> +    stream=STAILQ_FIRST(&pctx->peer_streams);

Please add logging here.

@@ +356,5 @@
>       * up in UNPAIRED after creating some pairs. */
>      pctx->state = NR_ICE_PEER_STATE_PAIRED;
>  
> +    /* Start grace period timer for incoming trickle candidates */
> +    nr_ice_peer_ctx_start_trickle_timer(pctx);

So, your answer for how you do RFC5245 ICE with no trickle is
to set the grace period to 0? Not sure I love that. Either let's
have trickle be optional or have the default grace period be 0.
Attachment #8434574 - Attachment is obsolete: true
Attachment #8436136 - Flags: review?(ekr)
Comment on attachment 8436136 [details] [diff] [review]
Part 3: Unit test that verifies that new pairs will start when local gather happens after all preceding pairs have failed, provided the grace period has not elapsed. Also a couple more tests that use a new test-case feature.

Review of attachment 8436136 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM
Attachment #8436136 - Flags: review?(ekr) → review+
Attachment #8434586 - Attachment is obsolete: true
Attachment #8434586 - Flags: review?(ekr)
Attachment #8436165 - Flags: review?(ekr)
Comment on attachment 8436165 [details] [diff] [review]
Part 1: Allow a grace period for trickle candidates to arrive when all candidate pairs have failed.

Review of attachment 8436165 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm with revisions below

::: media/mtransport/third_party/nICEr/src/ice/ice_component.c
@@ +1100,5 @@
> +
> +    if(r=nr_ice_candidate_pair_insert(&pair->remote->stream->check_list,pair))
> +      ABORT(r);
> +
> +    /* Make sure the check timer is running, if the stream was previously

Please log here.

::: media/mtransport/third_party/nICEr/src/ice/ice_media_stream.c
@@ +394,5 @@
>  /* Start checks for this media stream (aka check list) */
>  int nr_ice_media_stream_start_checks(nr_ice_peer_ctx *pctx, nr_ice_media_stream *stream)
>    {
> +    int r,_status;
> +    if (stream->ice_state > NR_ICE_MEDIA_STREAM_CHECKS_ACTIVE)

Comment here please for why this check.

Also, please add an assert.

::: media/mtransport/third_party/nICEr/src/ice/ice_peer_ctx.c
@@ +349,5 @@
> +static void nr_ice_peer_ctx_start_trickle_timer(nr_ice_peer_ctx *pctx)
> +  {
> +    UINT4 grace_period_timeout;
> +
> +    if(!NR_reg_get_uint4(NR_ICE_REG_TRICKLE_GRACE_PERIOD,&grace_period_timeout)){

This starts the timer even if grace_period is 0.

Please also check grace_period.
Attachment #8436165 - Flags: review?(ekr) → review+
Attachment #8436165 - Attachment is obsolete: true
Comment on attachment 8436185 [details] [diff] [review]
Part 1: Allow a grace period for trickle candidates to arrive when all candidate pairs have failed.

Review of attachment 8436185 [details] [diff] [review]:
-----------------------------------------------------------------

Carry forward r=ekr, and the now-distant r=drno
Attachment #8436185 - Flags: review+
Comment on attachment 8437065 [details] [diff] [review]
Part 1: (Subset for uplift) Subset of part 1 that ensures that the check timer is running when new pairs are created.

Carry forward r+ for the code in this patch.
Attachment #8437065 - Flags: review+
Comment on attachment 8437065 [details] [diff] [review]
Part 1: (Subset for uplift) Subset of part 1 that ensures that the check timer is running when new pairs are created.

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 

   Trickle ICE

User impact if declined: 

   Relatively common connectivity failures if there is a delay between the start of gathering on one side, and the start of ICE, particularly with relay candidates.

Testing completed (on m-c, etc.): 

   Unit-tests, and try. Posting a separate try run for the subset patch in a moment.

Risk to taking this patch (and alternatives if risky): 

   Fairly low.

String or IDL/UUID changes made by this patch:

   None.
Attachment #8437065 - Flags: approval-mozilla-beta?
Attachment #8437065 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Nils, does this need any more particular testing for beta 1? Is there anything in particular we can do for extra manual testing so we can try to detect possible regressions? Thanks!
Flags: needinfo?(drno)
Good question Liz.
We are in the process of setting up the required infrastructure in our QA lab as part of a new CI system to prevent regression on Nightly in the future.
Unfortunately it is not that easy to replicate this behavior. But it is a good point. We should probably consider setting up the network infrastructure, which allows manual testing of problems like this manually at any time.
But to answer your question: as we don't have the infrastructure yet there is nothing you could test for beta right now.
Flags: needinfo?(drno)
Duplicate of this bug: 1001671
You need to log in before you can comment on or make changes to this bug.