Closed Bug 991037 Opened 10 years ago Closed 10 years ago

WebRTC: ICE candidate gathering becomes complete before all candidates processed

Categories

(Core :: WebRTC: Networking, defect, P1)

31 Branch
x86
macOS
defect

Tracking

()

RESOLVED FIXED
mozilla34

People

(Reporter: alan.ford, Assigned: bwc)

References

Details

Attachments

(6 files, 44 obsolete files)

15.48 KB, patch
Details | Diff | Splinter Review
52.18 KB, patch
Details | Diff | Splinter Review
56.81 KB, patch
bwc
: review+
Details | Diff | Splinter Review
244.33 KB, text/plain
Details
8.75 KB, patch
bwc
: review+
Details | Diff | Splinter Review
204.95 KB, patch
bwc
: review+
Details | Diff | Splinter Review
User Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.9; rv:31.0) Gecko/20100101 Firefox/31.0 (Beta/Release)
Build ID: 20140401030203

Steps to reproduce:

* PeerConnection ("pc") configured to use a STUN server (this problem can be more reliably reproduced with a STUN server that takes over 100ms to respond to a Binding Request)
* Create onicecandidate callback (we'll come back to this)
* Add local mediastream and call pc.createOffer with callback to then call pc.setLocalDescription with the provided sdp

The onicecandidate callback looks roughly like this:

function (evt) {
    console.log("Ice Gathering State: " + pc.iceGatheringState);
    if (evt.candidate) {
        console.log("Gathered ICE candidate: " + evt.candidate.candidate);
    } else if (pc.iceGatheringState == "complete") {
        console.log("Finished gathering candidates: " + pc.localDescription.sdp);
        sendMessage(pc.localDescription);
    }
}

So if it's called with a null event, signifying end of candidates, we are able to send the SDP. My application does not support Trickle ICE so I need to wait for all candidates before sending SDP. This is as per JSEP section 3.4. I put the pc.iceGatheringState check in there to see if this helped the problem (described below), but it did not.


Actual results:

At some point, Firefox decides ICE candidate gathering is complete, and calls onicecandidate with a null event. The iceGatheringState is also "complete".

However, ICE candidate gathering is not complete! After this null callback, I still get some number of onicecandidate callbacks providing srflx candidates. This can vary from 0 to 4, depending on how fast the STUN server responded.

The candidates provided via this callback are NOT included in the pc.localDescription SDP.

Furthermore, no onicecandidate callbacks are made for the candidates that ARE in the localDescription SDP.

Even if I wait for 1 second before accessing pc.localDescription, the new candidates are not added to the SDP.


Expected results:

Primary issue: onicecandidate(null) with iceGatheringState=="complete" should not be called until ICE candidate gathering is in fact complete.

Secondary issue: onicecandidate should be called for all ICE candidates.
Assignee: nobody → docfaraday
I'll give this a look today.
So, a couple of things that bear mentioning; onicecandidate is only supposed to be called for candidates that were not present in the SDP. onicecandidate is intended to drive trickle ICE only (ie; candidates gathered after the offer was generated). The other thing that bears mentioning is that we currently have no way to turn trickle off, and I'm pretty sure no such API is specified in the webrtc specs. What you'll probably need to do is take the candidates from onicecandidate and manually insert them into the SDP (yuck).

As for the gathering state changing too soon, that's definitely a bug, and I'm giving it a look now.

I'll have to look over the specs and talk to some people to see whether accumulating trickle candidates in the SDP internally until the local description is retrieved, and then firing onicecandidate after that, is something that would be feasible.
(In reply to Byron Campen [:bwc] from comment #2)
> So, a couple of things that bear mentioning; onicecandidate is only supposed
> to be called for candidates that were not present in the SDP. onicecandidate
> is intended to drive trickle ICE only (ie; candidates gathered after the
> offer was generated). The other thing that bears mentioning is that we
> currently have no way to turn trickle off, and I'm pretty sure no such API
> is specified in the webrtc specs. What you'll probably need to do is take
> the candidates from onicecandidate and manually insert them into the SDP
> (yuck).

That's not supposed to be necessary. If you call CreateOffer() after
the last candidate, you should get the offer with all the candidates.
I thought we wrote that code, though it's possible we didn't or it
doesn't; work.


> As for the gathering state changing too soon, that's definitely a bug, and
> I'm giving it a look now.
> 
> I'll have to look over the specs and talk to some people to see whether
> accumulating trickle candidates in the SDP internally until the local
> description is retrieved, and then firing onicecandidate after that, is
> something that would be feasible.

If I understand your comment, that's how our code is intended to work.

Note, however, that I believe taht the specs are moving to require all
trickle and so CreateOffer() would never have any candidates in it.
I'm having no luck reproducing an early gathering state change, but this may be a corner case caused only by a specific stun server implementation. Do you have a some content I can use to attempt to reproduce?
Flags: needinfo?(alan.ford)
Infuriatingly, while it seemed to be doing it reliably earlier, I've only got the state change to fail twice in 20-30 attempts now. Gut feeling is latency in STUN server responses, but I don't have any hard evidence for that. My logs threw out:

"Ice Gathering State: gathering" 
"Gathered ICE candidate: candidate:1 1 UDP 1694236671 1.2.3.4 61753 typ srflx raddr 192.168.0.12 rport 61753"
"Ice Gathering State: complete"
"Ice Gathering State: complete"
"Gathered ICE candidate: candidate:1 2 UDP 1694236670 1.2.3.4 62791 typ srflx raddr 192.168.0.12 rport 62791"

On a failing case. But in the main it was playing ball this evening with "gathering".

Regarding the other points, my understanding is as ekr says, calling createOffer when ICE is done should have all candidates - that's how Chrome works, at least. Section 3.4.1 of JSEP says: "applications that do not support this feature can simply wait for the indication that gathering is complete, and then create and send their offer, with all the candidates, at this time."
Flags: needinfo?(alan.ford)
Well, the next time you are able to repro, can you open up about:webrtc, click the logging button, and attach all the text on the page?
Also, try delaying your call to CreateOffer(), instead of delaying SetLocalDescription.
Wait.

I have a hypothesis: the code path when we init a new local candidate goes (async!) through sipcc. The code path for gathering state changes does not. This could definitely explain why you're seeing this, while the unit-tests are unable to reproduce.
Good thinking. We can fix that!
(In reply to Byron Campen [:bwc] from comment #6)
> Well, the next time you are able to repro, can you open up about:webrtc,
> click the logging button, and attach all the text on the page?

Will email it to you later (although one thing I did notice is that all the candidates - not just those in the initial SDP - are listed in the overview).

(In reply to Byron Campen [:bwc] from comment #7)
> Also, try delaying your call to CreateOffer(), instead of delaying
> SetLocalDescription.

Right, this gets interesting here.

All I am doing in the, when onicecandidate(null) is received, is accessing the PeerConnection's pc.localDescription.sdp. In Chrome, this always contains the latest local SDP, updated with all known ICE candidates.

But I realise reading ekr's response that in fact you expect pc.createOffer() to be called again at this point (JSEP section 3.4.1 is a little ambiguous on this, it just talks about "create and send the offer").

So I've tried this now, and onicecandidate(null) && pc.iceGatheringState == "complete" I now call createOffer again. On the createOffer callback, I call pc.setLocalDescription with the new SDP.

But pc.localDescription.sdp is not updated! When accessing it directly I get the old SDP. So I have no idea if this new SDP is applied or not? Given the about:webrtc output suggests the new ICE candidates are available, by sending over the full SDP returned by createOffer things seem to work ok... But this does not feel right that the local SDP is not being updated.

Also, FWIW, onIceCandidate in Chrome is called for every candidate, trickle or not. I cannot find anything obvious in the specs that suggest what the correct behaviour here is (although I can understand the logic of it being additional trickle candidates only).
(In reply to Alan Ford from comment #10)
> (In reply to Byron Campen [:bwc] from comment #6)
> > Well, the next time you are able to repro, can you open up about:webrtc,
> > click the logging button, and attach all the text on the page?
> 
> Will email it to you later (although one thing I did notice is that all the
> candidates - not just those in the initial SDP - are listed in the overview).
> 
> (In reply to Byron Campen [:bwc] from comment #7)
> > Also, try delaying your call to CreateOffer(), instead of delaying
> > SetLocalDescription.
> 
> Right, this gets interesting here.
> 
> All I am doing in the, when onicecandidate(null) is received, is accessing
> the PeerConnection's pc.localDescription.sdp. In Chrome, this always
> contains the latest local SDP, updated with all known ICE candidates.

Yes, the specification appears to require this, so that's a bug
in Firefox (or we should argue that the spec is wrong).

"The localDescription attribute must return the RTCSessionDescription that was most recently passed to setLocalDescription(), plus any local candidates that have been generated by the ICE Agent since then.
"



> So I've tried this now, and onicecandidate(null) && pc.iceGatheringState ==
> "complete" I now call createOffer again. On the createOffer callback, I call
> pc.setLocalDescription with the new SDP.

OK, so that's what I would have expected based on my understanding of the
code.


> But pc.localDescription.sdp is not updated! When accessing it directly I get
> the old SDP. So I have no idea if this new SDP is applied or not?

Well, the SDP is identical except for the ICE candidates, and
the ICE candidates are not affected by SetLocal().



> Also, FWIW, onIceCandidate in Chrome is called for every candidate, trickle
> or not. 

Chrome only does trickle.


> I cannot find anything obvious in the specs that suggest what the
> correct behaviour here is (although I can understand the logic of it being
> additional trickle candidates only).

This is currently unclear in the specs but I believe the consensus at IETF 89
was to go to trickle only.
Put this somewhere other than my laptop. Not expected to work.
Work in progress; is ugly, but builds and mostly works. Need to chase down a remaining mochitest failure, and lots of cleanup.
Status: UNCONFIRMED → NEW
Ever confirmed: true
Getting closer to complete. Still kinda ugly in places (the way we give sipcc cycles is clunky).
Attachment #8405105 - Attachment is obsolete: true
Priority: -- → P2
Target Milestone: --- → mozilla33
Hey Byron -- What's the state of this?  And how important is it?
Flags: needinfo?(docfaraday)
Well, the correct fix is to flatten out the sipcc threading, which is a fair bit of work. It may take some time before I am able to finish that. It is possible that we could be dropping local candidates because of this, which could cause failures. I suppose we could add a telemetry probe to determine whether we need to add a bandaid fix (maybe just an artificial delay on the end of candidates signal).
Flags: needinfo?(docfaraday)
Attachment #8411400 - Attachment is obsolete: true
Working much better now. Still needs cleanup, and we may need to take some action to ensure interop.
Attachment #8455603 - Attachment is obsolete: true
This seems to work pretty well, although I have hit some interop problems:

1. TokBox's MCU seems to think we're Chrome since we do full trickle, and does some sort of unilateral bundle thing where it reuses the same port for every candidate. This does not work.

2. Since JSEP goes much more quickly/smoothly now, we are exposing problems in apprtc where it will reorder SDP and trickle candidates, seemingly at random (the candidates that are placed ahead of the SDP are usually not the first candidates). I have verified that everything is emitted in the correct order on the browser, so this seems to be apprtc's problem.
A little cleanup. Still kinda ugly in that we need to call GiveCCAppCycles_m() at the appropriate places, and that determination is hard to make. Ideally, we'd want to hack away at sipcc until there weren't any message queues left, and get rid of the vast majority of the wrapper code.
Attachment #8459040 - Attachment is obsolete: true
Depends on: 1041832
Attachment #8459897 - Attachment is obsolete: true
Byron I assume I need both patches to give trickle a try, but the "Route ICE gathering..." seems to be rotten. Can you un-rot it please?
Flags: needinfo?(docfaraday)
Actually, that patch is stale. Let me remove it. Just use the other one.
Flags: needinfo?(docfaraday)
Attachment #8404123 - Attachment is obsolete: true
Looks like your path is missing a file. I get this when I do './mach build'

Reference to a file that doesn't exist in UNIFIED_SOURCES (/Users/nohlmeier/src/mozilla-central/media/webrtc/signaling/src/softphonewrapper/CC_SIPCCThread.cpp)
Flags: needinfo?(docfaraday)
Target Milestone: mozilla33 → mozilla35
Adding some files I missed.
Attachment #8461657 - Attachment is obsolete: true
Ok, this should work better.
Flags: needinfo?(docfaraday)
Attachment #8463499 - Flags: feedback?(ethanhugg)
Attachment #8463499 - Flags: feedback?(adam)
This shouldn't have been moved out of Fx34.
Priority: P2 → P1
Target Milestone: mozilla35 → mozilla34
Comment on attachment 8463499 [details] [diff] [review]
(WIP) Get rid of some sipcc threads, and run all sipcc logic on main.

Review of attachment 8463499 [details] [diff] [review]:
-----------------------------------------------------------------

I think this idea would work.

::: media/webrtc/signaling/src/sipcc/core/src-common/misc_apps_task.c
@@ +49,2 @@
>  {
> +    static const char fname[] = "MiscApp_task_prepare";

"MiscApp_prepare_task" although you could replace the fnames with __FUNCTION__ below since you're already making changes here.

::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCThread.cpp
@@ +1,1 @@
> +#include "CC_SIPCCThread.h"

I assume this file needs the same license header as the .h has.

@@ +1,3 @@
> +#include "CC_SIPCCThread.h"
> +
> +#include "CSFLog.h"

I think you'll need to add this file to the media/webrtc/moz.build signaling_non_unified_sources_2 section because of this inclusion.  At least I needed to do that on Linux.
Attachment #8463499 - Flags: feedback?(ethanhugg) → feedback+
Incorporate feedback.
Attachment #8463499 - Attachment is obsolete: true
Attachment #8463499 - Flags: feedback?(adam)
Comment on attachment 8464859 [details] [diff] [review]
(WIP) Get rid of some sipcc threads, and run all sipcc logic on main.

Review of attachment 8464859 [details] [diff] [review]:
-----------------------------------------------------------------

One thing that I'd like to have feedback on is where we call GiveCCappCycles; right now it is in the c++ wrapper code, but the way this code is written (lots of mutex locking) might make this prone to deadlock. On the other hand, calling it in PeerConnectionImpl etc is less deadlock-prone, but more prone to developer error (ie; adding some call into sipcc but forgetting to run the pump). Not sure what the lesser of two evils is here, but it feels like putting it in the wrapper is probably the right thing to do, we just have to be careful to get it right the first time.
Attachment #8464859 - Flags: feedback?(adam)
(In reply to Byron Campen [:bwc] from comment #30)
> One thing that I'd like to have feedback on is where we call
> GiveCCappCycles; right now it is in the c++ wrapper code, but the way this
> code is written (lots of mutex locking) might make this prone to deadlock.
> On the other hand, calling it in PeerConnectionImpl etc is less
> deadlock-prone, but more prone to developer error (ie; adding some call into
> sipcc but forgetting to run the pump). Not sure what the lesser of two evils
> is here, but it feels like putting it in the wrapper is probably the right
> thing to do, we just have to be careful to get it right the first time.

I'm still ingesting the patch, so I might be missing some of what you're trying to accomplish here. (FWIW, I think it would make things much easier for everyone if you broke this up into somewhat smaller patches -- there isn't any reason, for example, to roll the PeerConnection.js changes in with the threading refactor).

In any case, both of the approaches you mention seem to be vary labor intensive and difficult to maintain. Is there any reason we can't do an async dispatch of GiveCCappCycles back to the main thread from ccappTaskSendMsg(), so that every time we queue a message for sipcc to handle, there's a matching runnable in the main thread's queue? Yes, you'd need to wrap this dispatch in a C wrapper, but that just seems much cleaner than manually turning the crank piecemeal for each of these methods.

Not clearing the f?, since I really haven't looked at this in depth yet.
I'm trying to accomplish a number of things here, only one of which is fixing this bug (caused by the fact that gathered candidates are plumbed through sipcc, but gathering state changes are not). The biggest concern at the moment is the long delays caused by the synchronous dispatched sprinkled throughout VcmSIPCCBinding (I believe this to be the primary cause of our connectivity failures with tokbox). Adding an async message-processing pump would be ok, I think, but it could still get bogged down if main gets busy (say, generating a dozen DTLS identities for a webrtc conference call).

FWIW, my plan is to rip out the stuff in sipcc that we don't use, and refactor to simplify, so this will be a transitional form.
Some of the init logic was being run on the SIP thread, so I've moved it.
Attachment #8464859 - Attachment is obsolete: true
Attachment #8464859 - Flags: feedback?(adam)
Tests pass for me, need to verify other platforms, and then determine how we want to refactor.
Finally, fix the premature end-of-candidates signal. Also, revert an earlier change that broke some of the dtls mochitests.
Attachment #8467339 - Attachment is obsolete: true
Attachment #8468680 - Attachment is obsolete: true
Attachment #8468696 - Flags: feedback?(martin.thomson)
Attachment #8468696 - Flags: feedback?(adam)
Comment on attachment 8468696 [details] [diff] [review]
Part 0: Stop sync waiting until GMP is ready, since this happens on main now and will deadlock.

Review of attachment 8468696 [details] [diff] [review]:
-----------------------------------------------------------------

On the right track.  The lack of a positive signal from GMP when it is ready bothers me a lot though.

::: media/webrtc/signaling/src/media/VcmSIPCCBinding.cpp
@@ +243,3 @@
>  {
> +  // This code assumes that GMP has been initted, specifically by
> +  // PeerConnectionCtx::initGMP()

This looks like commented out code. Might want to wrap sooner.

@@ +248,3 @@
>    }
>  
> +  if (gSelf->mGMPService) {

the return early pattern will make this entire block much easier to follow

if (!gSelf->mGMPService) {
  return 0;
}

if (NS_FAILED(rv) || !hasPlugin) {
  return 0;
}

::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.cpp
@@ +476,5 @@
> +  PeerConnectionCtx::gMainThread->Dispatch(WrapRunnableNM(&GMPReady_m),
> +                                           NS_DISPATCH_NORMAL);
> +};
> +
> +bool PeerConnectionCtx::initGMP()

No point having a return value if you are calling this like above.

@@ +492,5 @@
> +    return false;
> +  }
> +
> +  // presumes that all GMP dir scans have been queued for the GMPThread
> +  thread->Dispatch(WrapRunnableNM(&GMPReady), NS_DISPATCH_NORMAL);

I think that the idea of relying on the side-effects of do_GetService above makes me very nervous.  What if the init of GMP requires a dispatch to another thread to complete?

Can't you modify the API of the GMP service to provide a notification when it is done?  That's more work, but seems less fragile.

::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.h
@@ +146,5 @@
> +  // offer creation (or SetRemote, when we're the answerer) until all of this is
> +  // ready to go, since blocking on this init is just begging for deadlock.
> +  nsCOMPtr<mozIGeckoMediaPluginService> mGMPService;
> +  bool mGMPReady;
> +  nsTArray<nsRefPtr<nsIRunnable>> mQueuedJSEPOperations;

Do we need to have a list here?  I know that generality is appealing, but the javascript side of things is holding its own queue, so we should only have a single outstanding action to complete.
Attachment #8468696 - Flags: feedback?(martin.thomson) → feedback+
(In reply to Martin Thomson [:mt] from comment #38)
> @@ +492,5 @@
> > +    return false;
> > +  }
> > +
> > +  // presumes that all GMP dir scans have been queued for the GMPThread
> > +  thread->Dispatch(WrapRunnableNM(&GMPReady), NS_DISPATCH_NORMAL);
> 
> I think that the idea of relying on the side-effects of do_GetService above
> makes me very nervous.  What if the init of GMP requires a dispatch to
> another thread to complete?
> 
> Can't you modify the API of the GMP service to provide a notification when
> it is done?  That's more work, but seems less fragile.
> 

   I could do that, but it is going to push back the sipcc thread flattening work. Maybe we can put in a separate bug?

> ::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.h
> @@ +146,5 @@
> > +  // offer creation (or SetRemote, when we're the answerer) until all of this is
> > +  // ready to go, since blocking on this init is just begging for deadlock.
> > +  nsCOMPtr<mozIGeckoMediaPluginService> mGMPService;
> > +  bool mGMPReady;
> > +  nsTArray<nsRefPtr<nsIRunnable>> mQueuedJSEPOperations;
> 
> Do we need to have a list here?  I know that generality is appealing, but
> the javascript side of things is holding its own queue, so we should only
> have a single outstanding action to complete.

This is a singleton, so we could have multiple PCs trying to spin up at once.
Another test fix
Attachment #8468696 - Attachment is obsolete: true
Attachment #8468696 - Flags: feedback?(adam)
Break the pc.js queue simplification out into its own patch.
Break the code that implements full trickle behavior out into its own patch. Note that this makes bug 991037 happen much more reliably, since the candidate buffer is in PeerConnectionImpl instead of down in sipcc, meaning that when the local description is set, the full set of existing candidates is flung at sipcc at once, and all must beat the gathering complete signal, instead of just some small subset of the server reflexives.
Attachment #8467340 - Attachment is obsolete: true
The actual thread-flattening work, with some cleanup.
Attachment #8468701 - Attachment is obsolete: true
Attachment #8468912 - Attachment is obsolete: true
Attachment #8468740 - Attachment is obsolete: true
Attachment #8468899 - Attachment is obsolete: true
Attachment #8468901 - Flags: review?(martin.thomson)
Attachment #8468910 - Attachment is obsolete: true
Attachment #8468931 - Attachment is obsolete: true
Remove an unnecessary include, since informing sipcc of found ICE candidates is now the responsibility of PeerConnectionImpl, along with the rest of the JSEP stuff.
Attachment #8468933 - Attachment is obsolete: true
Attachment #8468913 - Attachment is obsolete: true
Comment on attachment 8468901 [details] [diff] [review]
Part 1: Simplify the command-queueing logic in pc.js, and make it less racy.

Review of attachment 8468901 [details] [diff] [review]:
-----------------------------------------------------------------

Looks fine, modulo the below.

I've considered rolling executeNext into callCB, but that can wait.

::: dom/media/PeerConnection.js
@@ +623,5 @@
>      this._impl.setLocalDescription(type, sdp);
>    },
>  
>    setRemoteDescription: function(desc, onSuccess, onError) {
> +    this._remoteType = desc.type;

Why did you move this forward?  If this fails, don't you need to reset the remote type?  Or is the PC so screwed up at that point it's not worth even trying to recover?
Attachment #8468901 - Flags: review?(martin.thomson) → review+
Attachment #8468937 - Attachment is obsolete: true
(In reply to Martin Thomson [:mt] from comment #51)
> Comment on attachment 8468901 [details] [diff] [review]
> Part 1: Simplify the command-queueing logic in pc.js, and make it less racy.
> 
> Review of attachment 8468901 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks fine, modulo the below.
> 
> I've considered rolling executeNext into callCB, but that can wait.
> 

   I actually tried that, and in some cases we don't actually want to do it.

> ::: dom/media/PeerConnection.js
> @@ +623,5 @@
> >      this._impl.setLocalDescription(type, sdp);
> >    },
> >  
> >    setRemoteDescription: function(desc, onSuccess, onError) {
> > +    this._remoteType = desc.type;
> 
> Why did you move this forward?  If this fails, don't you need to reset the
> remote type?  Or is the PC so screwed up at that point it's not worth even
> trying to recover?

Well, I ripped it out of the message queueing because it simplified things considerably. Its only purpose is to determine where to store the SDP when setRemote/setLocal succeed, so I guess it doesn't matter what the value is if either fails; it will get reset to the appropriate value if either is called again. But, in the interest of consistency, I can restore to default in the event that we get a failure callback.
Attachment #8469399 - Attachment is obsolete: true
Attachment #8468901 - Attachment is obsolete: true
Fix bug where we weren't filtering trickle candidates for m-lines we weren't actually using.
Attachment #8469395 - Attachment is obsolete: true
Attachment #8469597 - Attachment is obsolete: true
Attachment #8469595 - Attachment is obsolete: true
Comment on attachment 8469643 [details] [diff] [review]
Part 1: Simplify the command-queueing logic in pc.js, and make it less racy.

Review of attachment 8469643 [details] [diff] [review]:
-----------------------------------------------------------------

Carry forward r=mt
Attachment #8469643 - Flags: review+
Make sure we're not closed before processing found ICE candidates.
Attachment #8469639 - Attachment is obsolete: true
Comment on attachment 8469661 [details] [diff] [review]
Part 2: Convert over to full trickle, which allows some simplification of code, and makes the following work much easier.

Review of attachment 8469661 [details] [diff] [review]:
-----------------------------------------------------------------

Trying something different with this request; I've sprinkled some comments around that should assist in understanding why I made the changes I made. Let's see if the tool barfs...

::: dom/media/PeerConnection.js
@@ -1046,5 @@
> -    if (this._dompc._iceGatheringState == "complete") {
> -        // If we are not trickling or we completed gathering prior
> -        // to setLocal, then trigger a call of onicecandidate here.
> -        this.foundIceCandidate(null);
> -    }

We decide when to send the end-of-candidates signal in c++ now, since it is a bit easier.

::: media/mtransport/nricemediastream.cpp
@@ -352,5 @@
> -      // draft-ivov-mmusic-trickle-ice-01.txt says to use port 9
> -      // but "::" instead of "0.0.0.0". Since we don't do any
> -      // IPv6 we are ignoring that for now.
> -      *addrp = "0.0.0.0";
> -      *portp = 9;

This bit has been moved to VcmSIPCCBinding, the rest has been discarded.

::: media/mtransport/nricemediastream.h
@@ -118,5 @@
> -class NrIceOpaque {
> - public:
> -  virtual ~NrIceOpaque() {}
> -};
> -

VcmSIPCCBinding is no longer in charge of informing sipcc of gathered ICE candidates, so it doesn't need a call handle anymore, meaning no need for this opaque stuff.

::: media/webrtc/signaling/include/CC_Call.h
@@ +298,5 @@
>  
>          virtual void addICECandidate(const std::string & candidate, const std::string & mid, unsigned short level, Timecard *) = 0;
>  
> +        virtual void foundICECandidate(const std::string & candidate, const std::string & mid, unsigned short level, Timecard *) = 0;
> +

We now have a full JSEP API here, and the only user is PeerConnectionImpl.

::: media/webrtc/signaling/src/media/VcmSIPCCBinding.cpp
@@ +526,5 @@
> +                                 uint16_t level) {
> +  sipcc::PeerConnectionWrapper pc(peerconnection.c_str());
> +  ENSURE_PC_NO_RET(pc, peerconnection.c_str());
> +
> +  pc.impl()->OnNewMline(level);

We need to tell PCImpl about m-lines so it can suppress trickle of candidates with a level that we aren't using.

@@ -581,5 @@
> - *  @param[out] ctx - the NrIceCtx
> - *  @param[out] stream - the NrIceStream
> - */
> -
> -static short vcmGetIceStream_m(cc_mcapid_t mcap_id,

This allowed VcmSIPCCBinding to register for the candidate signal, and set up the VcmOpaque, neither of which it needs to do anymore.

@@ -629,5 @@
> - *
> - *  @return 0 for success; VCM_ERROR for failure
> - *
> - */
> -static short vcmRxAllocICE_s(TemporaryRef<NrIceCtx> ctx_in,

We no longer need to run any of this on STS, since we aren't asking the NrIceMediaStream for the current set of candidates anymore.

@@ +568,5 @@
> +  // draft-ivov-mmusic-trickle-ice-01.txt says to use port 9
> +  // but "::" instead of "0.0.0.0". Since we don't do any
> +  // IPv6 we are ignoring that for now.
> +  std::string default_addr("0.0.0.0");
> +  int default_port = 9;

This used to live in NrIceMediaStream::GetDefaultCandidate

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
@@ +2055,5 @@
> +  // end-of-candidates event in IceGatheringStateChange_m.
> +  if (mIceGatheringState == PCImplIceGatheringState::Complete &&
> +      !mCandidateBuffer.empty()) {
> +    SendEndOfCandidates();
> +  }

We check whether it is time to send the end of candidates in two places:

1) When setLocal succeeds (here)
2) When the gathering state transitions to completed

@@ +2074,5 @@
> +
> +void
> +PeerConnectionImpl::SendEndOfCandidates() {
> +  // We dispatch this because real candidates do a dispatch in
> +  // PeerConnection::onCallEvent, and we don't want this to jump ahead.

Note: In this patch, this is still a race we lose, because the real candidates go through sipcc. The subsequent patch closes the gap, with a single dispatch in PeerConnection::onCallEvent for real candidates, and a single dispatch here for the end-of-candidates.

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.h
@@ +725,5 @@
>  
>    bool mHaveDataStream;
>  
> +  // The number of times we've called addStream() on sipcc (we do not decrement
> +  // for removeStream(), since that won't result in removal of an m-line)

Ok, this comment is wrong. Will fix after the initial round of review, since I don't want to re-type all of these notes for you.

::: media/webrtc/signaling/src/sipcc/core/gsm/fsmdef.c
@@ -1319,5 @@
>  
>      dcb->digest_alg[0] = '\0';
>      dcb->digest[0] = '\0';
> -
> -    sll_lite_init(&dcb->candidate_list);

This is the old pre-local-SDP holding tank for candidates that we don't need anymore.
Attachment #8469661 - Flags: feedback?(martin.thomson)
Attachment #8469594 - Flags: review?(martin.thomson)
Attachment #8469594 - Flags: review?(martin.thomson) → review+
Attachment #8469661 - Attachment is obsolete: true
Attachment #8469661 - Flags: feedback?(martin.thomson)
Rip out the windows message queue logic, and use the same implementation on all platforms.
Attachment #8468938 - Attachment is obsolete: true
Comment on attachment 8469661 [details] [diff] [review]
Part 2: Convert over to full trickle, which allows some simplification of code, and makes the following work much easier.

Review of attachment 8469661 [details] [diff] [review]:
-----------------------------------------------------------------

I can't see any problems here, but I see that this is now obsolete.

::: dom/media/PeerConnection.js
@@ +1077,5 @@
>  
>    onIceCandidate: function(level, mid, candidate) {
> +    this._dompc.logWarning("Got candidate: " + candidate, null, 0);
> +    if (candidate == "") {
> +      this.foundIceCandidate(null);

I think that this will go away at some point in the very near future.  Spec editors are being slow though.

::: media/mtransport/nricemediastream.h
@@ +169,5 @@
>    // might be holding RefPtrs but we want those writes to fail once
>    // the context has been destroyed.
>    void Close();
>  
> +  void SetLevel(uint16_t level) { level_ = level; }

Comment required here.  I assume that level == m= section index?  And the usual complaints regarding indexing apply (we start at 1 here).

::: media/webrtc/signaling/src/media/VcmSIPCCBinding.cpp
@@ +526,5 @@
> +                                 uint16_t level) {
> +  sipcc::PeerConnectionWrapper pc(peerconnection.c_str());
> +  ENSURE_PC_NO_RET(pc, peerconnection.c_str());
> +
> +  pc.impl()->OnNewMline(level);

Why don't we just not generate candidates at the levels we don't need?

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
@@ +2017,5 @@
> +                                     uint16_t level)
> +{
> +  ASSERT_ON_THREAD(mSTSThread);
> +  nsRefPtr<PeerConnectionImpl> pc(this);
> +  mThread->Dispatch(

This is inconsistent with other dispatches.  Do we have a single dispatch style yet?

@@ +2211,3 @@
>    }
>    WrappableJSErrorResult rv;
>    RUN_ON_THREAD(mThread,

What is the deal with these inconsistent dispatches?

::: media/webrtc/signaling/src/peerconnection/PeerConnectionMedia.cpp
@@ +219,5 @@
>      mIceStreams.push_back(dcStream);
>    }
>  
>    // TODO(ekr@rtfm.com): This is not connected to the PCCimpl.
>    // Will need to do that later.

Does this comment still apply?
(In reply to Martin Thomson [:mt] from comment #65)
> Comment on attachment 8469661 [details] [diff] [review]
> Part 2: Convert over to full trickle, which allows some simplification of
> code, and makes the following work much easier.
> 
> Review of attachment 8469661 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I can't see any problems here, but I see that this is now obsolete.
> 
> ::: dom/media/PeerConnection.js
> @@ +1077,5 @@
> >  
> >    onIceCandidate: function(level, mid, candidate) {
> > +    this._dompc.logWarning("Got candidate: " + candidate, null, 0);
> > +    if (candidate == "") {
> > +      this.foundIceCandidate(null);
> 
> I think that this will go away at some point in the very near future.  Spec
> editors are being slow though.

   Really? What (if anything) will be in its place?

> ::: media/webrtc/signaling/src/media/VcmSIPCCBinding.cpp
> @@ +526,5 @@
> > +                                 uint16_t level) {
> > +  sipcc::PeerConnectionWrapper pc(peerconnection.c_str());
> > +  ENSURE_PC_NO_RET(pc, peerconnection.c_str());
> > +
> > +  pc.impl()->OnNewMline(level);
> 
> Why don't we just not generate candidates at the levels we don't need?
> 

   Because sipcc doesn't tell us how many m-lines it has until the local description is set, and we want to start gathering well before that. Ideally, sipcc would tell us when it decides it needs a new m-line, and we would spin up another NrIceMediaStream in response. This will probably be a work-item for multi-streaming support.


> ::: media/webrtc/signaling/src/peerconnection/PeerConnectionMedia.cpp
> @@ +219,5 @@
> >      mIceStreams.push_back(dcStream);
> >    }
> >  
> >    // TODO(ekr@rtfm.com): This is not connected to the PCCimpl.
> >    // Will need to do that later.
> 
> Does this comment still apply?

I have no idea what ekr meant by this comment.
I looked at the code, and I think ekr's comment refers to the fact that PeerConnectionMedia::IceStreamReady doesn't actually go anywhere:

http://dxr.mozilla.org/mozilla-central/source/media/webrtc/signaling/src/peerconnection/PeerConnectionMedia.cpp#529
Incorporate feedback. Will request review once everything settles down.
Attachment #8470228 - Attachment is obsolete: true
Comment on attachment 8470231 [details] [diff] [review]
Part 4: (WIP) Get rid of some sipcc threads, and run all sipcc logic on main.

Review of attachment 8470231 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/sipcc/cpr/android/cpr_android_init.c
@@ +174,5 @@
> +//    returnCode = pthread_mutex_init(&msgQueueListMutex, NULL);
> +//    if (returnCode != 0) {
> +//        CPR_ERROR("%s: MsgQueue Mutex init failure %d\n", fname, returnCode);
> +//        return CPR_FAILURE;
> +//    }

just use the delete key, don't be shy :)
Attachment #8470231 - Attachment is obsolete: true
Comment on attachment 8470986 [details] [diff] [review]
Part 4: (WIP) Get rid of some sipcc threads, and run all sipcc logic on main.

Review of attachment 8470986 [details] [diff] [review]:
-----------------------------------------------------------------

This is starting to shape up. I was forced to gut a lot of the message-queue threading primitives in order to get this to work on windows, but since we're using it all single-threaded anyway, it is not a big deal.
Attachment #8470986 - Flags: feedback?(martin.thomson)
Attachment #8470986 - Flags: feedback?(ethanhugg)
Attachment #8470986 - Flags: feedback?(adam)
When you say "shape up", does that mean it passes on try?  This is a very large changeset and I don't want to review this too many times.
It is passing on try, and not full of commented-out code.

https://tbpl.mozilla.org/?tree=Try&rev=660e608e382f
Comment on attachment 8470986 [details] [diff] [review]
Part 4: (WIP) Get rid of some sipcc threads, and run all sipcc logic on main.

Review of attachment 8470986 [details] [diff] [review]:
-----------------------------------------------------------------

This is on the right track.  I think that there are a few more things that could be deleted.

::: media/webrtc/signaling/src/media/VcmSIPCCBinding.cpp
@@ +444,5 @@
>  void vcmInit (void)
>  {
> +    // We do not do a thread assert here, since there are unit-tests that set
> +    // gMainThread to something other than main, but still call the init code
> +    // on main.

This function looks like a prime candidate for deletion to me.  Main advantage would be getting rid of the comment.

@@ +691,5 @@
> +    ice_media_stream(level-1);
> +  if (!stream)
> +    return VCM_ERROR;
> +
> +  nsresult rv = RUN_ON_THREAD(pc.impl()->media()->ice_ctx()->thread(),

train wreck

@@ +1980,1 @@
>  int vcmTxStartICE(cc_mcapid_t mcap_id,

I really hope that you haven't changed anything here.  I couldn't see any changes, but the diff is screwed up.

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
@@ +1762,5 @@
>  {
>    PC_AUTO_ENTER_API_CALL_NO_CHECK();
>  
> +  if (IsClosed()) {
> +    return NS_OK;

Thank you.

::: media/webrtc/signaling/src/sipcc/core/ccapp/ccapp_task.c
@@ +138,5 @@
>      }
>      return CPR_SUCCESS;
>  }
>  
> +int CCApp_service_queue()

A comment noting that this just pulls a single item from the queue and executes it might be worthwhile.

::: media/webrtc/signaling/src/sipcc/core/ccapp/ccapp_task.h
@@ +17,5 @@
>  
>  extern void addCcappListener(appListener* listener, int type);
>  appListener *getCcappListener(int type);
>  cpr_status_e ccappTaskSendMsg (uint32_t cmd, void *msg, uint16_t len, uint32_t usrInfo);
> +

Whitespace only change (an OK one).

::: media/webrtc/signaling/src/sipcc/core/ccapp/ccprovider.c
@@ +2169,5 @@
>      platform_initialized = FALSE;
>      CCAppShutdown();
> +    if (ccapp_thread) {
> +        (void)cprDestroyThread(ccapp_thread);
> +    }

Delete me.  And at this point, the function name above probably needs to change.

::: media/webrtc/signaling/src/sipcc/core/common/init.c
@@ +265,5 @@
>      debugInit();
>  
> +    CCApp_prepare_task();
> +    GSM_prepare_task();
> +#if 0

I almost missed this.  Please delete this block and the corresponding use of all these threads.  The code you've removed makes these dangerous to have lying around.

::: media/webrtc/signaling/src/sipcc/core/gsm/gsm.c
@@ +579,5 @@
>      gsm_shutdown();
>      dp_shutdown();
> +    if (gsm_thread) {
> +        (void) cprDestroyThread(gsm_thread);
> +    }

Delete me.  And change destroy_gsm_thread() to something more apt.

::: media/webrtc/signaling/src/sipcc/core/sipstack/ccsip_task.c
@@ +313,5 @@
>   * Return Value: SIP_OK or SIP_ERROR
>   *
>   */
>  cpr_status_e
>  SIPTaskSendMsg (uint32_t cmd, void *msg, uint16_t len, void *usr)

Delete me.

::: media/webrtc/signaling/src/sipcc/core/src-common/misc_apps_task.c
@@ +24,5 @@
>  void destroy_misc_app_thread(void);
>  extern cprThread_t misc_app_thread;
>  
>  cpr_status_e
>  MiscAppTaskSendMsg (uint32_t cmd, cprBuffer_t buf, uint16_t len)

I think that you need to delete this function.

::: media/webrtc/signaling/src/sipcc/cpr/android/cpr_android_init.c
@@ +174,5 @@
> +//    returnCode = pthread_mutex_init(&msgQueueListMutex, NULL);
> +//    if (returnCode != 0) {
> +//        CPR_ERROR("%s: MsgQueue Mutex init failure %d\n", fname, returnCode);
> +//        return CPR_FAILURE;
> +//    }

Delete all the codes.

::: media/webrtc/signaling/src/sipcc/cpr/common/cpr_ipc.c
@@ -327,5 @@
>          return CPR_FAILURE;
>      }
>  
> -#ifdef SIP_OS_WINDOWS
> -    ((cpr_msg_queue_t *)msgQueue)->handlePtr = thread;

I think that you need to delete this function too (cprSetMessageQueueThread).

@@ -402,5 @@
> -        }
> -    }
> -
> -    switch (msg.message) {
> -        case WM_CLOSE:

Note to self: check that gsm_shutdown(), sip_regmgr_destroy_cc_conns(), sip_shutdown() are all called properly at some appropriate stage.

::: media/webrtc/signaling/src/sipcc/cpr/linux/cpr_linux_init.c
@@ +172,1 @@
>  #if CPR_TIMERS_ENABLED

Do we even use this code with CPR_TIMERS_ENABLED on?  More code to delete?

::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCCall.cpp
@@ +41,5 @@
>      if(audioControl)
>      {
>           pMediaData->volume = audioControl->getDefaultVolume();
>      }
> +    GiveCCAppCycles_m();

I think that we need a better solution than this mess of calls to GiveCCAppCycles_m().  The immediate answer is a template, but I think that we can probably do better than that.

::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCThread.cpp
@@ +6,5 @@
> +
> +#include "CSFLog.h"
> +
> +extern "C" {
> +int CCApp_service_queue(int waitForMicroseconds);

ccapp_task.c doesn't include the waitForMicroseconds argument.  Am I missing something?

::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCThread.h
@@ +1,5 @@
> +/* This Source Code Form is subject to the terms of the Mozilla Public
> + * License, v. 2.0. If a copy of the MPL was not distributed with this
> + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> +
> +#pragma once

I thought that we used #ifndef guards.  What's the rule here?
Attachment #8470986 - Flags: feedback?(martin.thomson) → feedback+
(In reply to Martin Thomson [:mt] from comment #74)
> Comment on attachment 8470986 [details] [diff] [review]
> Part 4: (WIP) Get rid of some sipcc threads, and run all sipcc logic on main.
> 
> Review of attachment 8470986 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This is on the right track.  I think that there are a few more things that
> could be deleted.
> 

   I'll go purge a little more aggressively.

> ::: media/webrtc/signaling/src/media/VcmSIPCCBinding.cpp
> @@ +444,5 @@
> >  void vcmInit (void)
> >  {
> > +    // We do not do a thread assert here, since there are unit-tests that set
> > +    // gMainThread to something other than main, but still call the init code
> > +    // on main.
> 
> This function looks like a prime candidate for deletion to me.  Main
> advantage would be getting rid of the comment.
>

   Sure, but we're getting outside the scope of thread-flattening at that point. I could delete sipcc code all day every day for a week or two, but the time for that will come.

> @@ +691,5 @@
> > +    ice_media_stream(level-1);
> > +  if (!stream)
> > +    return VCM_ERROR;
> > +
> > +  nsresult rv = RUN_ON_THREAD(pc.impl()->media()->ice_ctx()->thread(),
> 
> train wreck
> 

   I can check to see what might make this better.

> @@ +1980,1 @@
> >  int vcmTxStartICE(cc_mcapid_t mcap_id,
> 
> I really hope that you haven't changed anything here.  I couldn't see any
> changes, but the diff is screwed up.
> 

   Yeah, I recommend applying the patches, and using git diff. That does a better job with a lot of this patch. (There's another nasty case of this in gsm.c that git diff --no-index -w displays well)

> ::: media/webrtc/signaling/src/sipcc/core/src-common/misc_apps_task.c
> @@ +24,5 @@
> >  void destroy_misc_app_thread(void);
> >  extern cprThread_t misc_app_thread;
> >  
> >  cpr_status_e
> >  MiscAppTaskSendMsg (uint32_t cmd, cprBuffer_t buf, uint16_t len)
> 
> I think that you need to delete this function.
>

   If I delete this, it's probably gonna cascade all over the codebase. We can do this later.

> @@ -402,5 @@
> > -        }
> > -    }
> > -
> > -    switch (msg.message) {
> > -        case WM_CLOSE:
> 
> Note to self: check that gsm_shutdown(), sip_regmgr_destroy_cc_conns(),
> sip_shutdown() are all called properly at some appropriate stage.
> 

   I'm reasonably certain we're ok here.

> ::: media/webrtc/signaling/src/sipcc/cpr/linux/cpr_linux_init.c
> @@ +172,1 @@
> >  #if CPR_TIMERS_ENABLED
> 
> Do we even use this code with CPR_TIMERS_ENABLED on?  More code to delete?
> 

   Maybe. Is this in scope?

> ::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCCall.cpp
> @@ +41,5 @@
> >      if(audioControl)
> >      {
> >           pMediaData->volume = audioControl->getDefaultVolume();
> >      }
> > +    GiveCCAppCycles_m();
> 
> I think that we need a better solution than this mess of calls to
> GiveCCAppCycles_m().  The immediate answer is a template, but I think that
> we can probably do better than that.
> 

   I've tried a few approaches, none of them awesome. We can either do this, or call GiveCCAppCycles_m in PeerConnectionImpl/Ctx, or maybe even wire up the message queue dispatch to cause an async call to GiveCCAppCycles_m. Just processing without queuing anything results in deadlocks, and just removing the locking is dicey too. I've resolved some of the deadlock problems, but I could take another crack at it. We should be wary that a lot of this code might not react correctly unless there is an unwinding of the stack. Ultimately, this is a transitional form.
 

> ::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCThread.cpp
> @@ +6,5 @@
> > +
> > +#include "CSFLog.h"
> > +
> > +extern "C" {
> > +int CCApp_service_queue(int waitForMicroseconds);
> 
> ccapp_task.c doesn't include the waitForMicroseconds argument.  Am I missing
> something?
> 

   Nope, just didn't propagate a change everywhere.

> ::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCThread.h
> @@ +1,5 @@
> > +/* This Source Code Form is subject to the terms of the Mozilla Public
> > + * License, v. 2.0. If a copy of the MPL was not distributed with this
> > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */
> > +
> > +#pragma once
> 
> I thought that we used #ifndef guards.  What's the rule here?

   I was just using the same convention used in the rest of softphonewrapper.
Attachment #8470252 - Flags: review?(martin.thomson)
Attachment #8470252 - Flags: review?(martin.thomson) → review+
Comment on attachment 8470986 [details] [diff] [review]
Part 4: (WIP) Get rid of some sipcc threads, and run all sipcc logic on main.

Review of attachment 8470986 [details] [diff] [review]:
-----------------------------------------------------------------

This looks fine but I couldn't get it to work with pc_test, does it work for you?  I am getting trickle failed messages like this:

ICE-PEER(PC:1408394452839897 (id=25 url=http://mozilla.github.io/webrtc-landing/pc_test.html):default)/STREAM(1408394452839897 (id=25 url=http://mozilla.github.io/webrtc-landing/pc_test.html): stream1/audio)/COMP(1): All pairs are failed, and grace period has elapsed. Marking component as failed.
Attachment #8470986 - Flags: feedback?(ethanhugg) → feedback+
(In reply to Ethan Hugg [:ehugg] from comment #76)
> Comment on attachment 8470986 [details] [diff] [review]
> Part 4: (WIP) Get rid of some sipcc threads, and run all sipcc logic on main.
> 
> Review of attachment 8470986 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> This looks fine but I couldn't get it to work with pc_test, does it work for
> you?  I am getting trickle failed messages like this:
> 
> ICE-PEER(PC:1408394452839897 (id=25
> url=http://mozilla.github.io/webrtc-landing/pc_test.html):default)/
> STREAM(1408394452839897 (id=25
> url=http://mozilla.github.io/webrtc-landing/pc_test.html):
> stream1/audio)/COMP(1): All pairs are failed, and grace period has elapsed.
> Marking component as failed.

   I don't think that page actually does trickle candidates. This would probably need to be fixed also.
Depends on: 1055260
Comment on attachment 8468911 [details] [diff] [review]
Part 3: (WIP) Bring signaling_unittests up-to-date with the full-trickle behavior.

Review of attachment 8468911 [details] [diff] [review]:
-----------------------------------------------------------------

So, this test case probably needs a pretty comprehensive rework, since the whole thing was written around the old partial trickle behavior. I don't think I'm going to have time to do this (and get the more exhaustive changes reviewed) before Thursday, and we really do need to get this stuff landed by then.

What's your take?
Attachment #8468911 - Flags: feedback?(martin.thomson)
Attachment #8470986 - Attachment is obsolete: true
Attachment #8470986 - Flags: feedback?(adam)
Comment on attachment 8468911 [details] [diff] [review]
Part 3: (WIP) Bring signaling_unittests up-to-date with the full-trickle behavior.

Review of attachment 8468911 [details] [diff] [review]:
-----------------------------------------------------------------

Looks sane.  I think that you probably want to include a test case that covers the non-trickle scenario.  That is, where you wait for the candidates to all appear before you proceed.  I'm not sure how you would implement that, but it's almost certainly something you can defer to another bug.

Most of this stuff can be handled by opening new bugs for the specific things you think need fixing.  I think that we can stand to risk having some corner cases regress as long as we have a plan to fix them promptly.  And as long as they are small enough and the fix really is prompt.  I don't see anything here that can't be addressed by just hacking on things until they work.  Getting the localDescription and remoteDescription attributes right is probably the one that stands out as most important here.

::: media/webrtc/signaling/test/signaling_unittests.cpp
@@ +1274,5 @@
> +      test_utils->sts_target(),
> +      WrapRunnable(&trickler,
> +                   &SignalingAgent::GetCandidatesByLevel_s,
> +                   &candidates));
> +    DoTrickleIceChrome(candidates);

Code duplication here between the Chrome and non-Chrome paths.  A boolean seems appropriate.  (Yeah, I know you didn't start it, but...)

@@ +2431,5 @@
>    a1_->CloseSendStreams();
>    a2_->CloseReceiveStreams();
>  }
>  
> +TEST_F(SignalingTest, DISABLED_FullCallTrickle)

Is this one that you plan to delete?

@@ +2734,5 @@
>  
>  }
>  
> +// SIPCC is not updating the SDP consistently when trickle is involved.
> +TEST_F(SignalingTest, DISABLED_OfferAnswerCheckDescriptions)

Whack a bug number on it and move on ;)
Attachment #8468911 - Flags: feedback?(martin.thomson) → feedback+
Attachment #8468911 - Attachment is obsolete: true
Attachment #8475445 - Attachment is obsolete: true
Comment on attachment 8475447 [details] [diff] [review]
Part 3: Bring signaling_unittests up-to-date with the full-trickle behavior, and a fair bit of cleanup.

Review of attachment 8475447 [details] [diff] [review]:
-----------------------------------------------------------------

Ok, I've done all that I can to make this diff display sensibly. Mostly readable. Some guideposts sprinkled throughout.

::: media/webrtc/signaling/test/signaling_unittests.cpp
@@ +523,2 @@
>    {
> +    for (auto it = sdp_lines_.begin(); it != sdp_lines_.end() && limit;) {

We don't need a "fast" map here, since at most we're talking about 40 entries or so.

@@ -557,4 @@
>        }
> -      std::string value = content.substr(objType.length() + 1);
> -      sdp_map_.insert(std::pair<std::string, SdpLine>(objType,
> -        std::make_pair(line_no,value)));

MakeKeyValue takes care of parsing |content| now.

@@ +570,5 @@
>      std::string key;
>      std::string value;
>      if(whiteSpace == std::string::npos) {
> +      //this is the line with no extra contents
> +      //example, v=0, a=sendrecv

This came from a copy/paste of this code in |Parse|

@@ +575,2 @@
>        key = content.substr(0,  content.size() - 2);
> +      value = "\r\n"; // Checking code assumes this is here.

There was a fair bit of special case logic around re-adding this; just going to be consistent and say there is always a trailing \r\n

@@ +595,5 @@
> +    }
> +    return sdp_lines_.end();
> +  }
> +
> +  void InsertLineAfter(const std::string &objType,

This is used to insert a=candidate lines.

@@ -624,5 @@
> -      } else {
> -        key = line.substr(0, whiteSpace);
> -        //<line_no>:<value>
> -        value = line.substr(whiteSpace+1);
> -      }

Copy/paste code, lives in |MakeKeyValue| now.

@@ -636,5 @@
> -        std::string cand = line.substr(2, line.size() - 4);
> -        ice_candidates_.insert(std::pair<int, std::string>(levels_, cand));
> -       } else {
> -        sdp_without_ice_ += line;
> -      }

We don't need quick access any longer, since full trickle happens on its own.

@@ -640,5 @@
> -      }
> -      if (line.find("m=") == 0) {
> -        // This is an m-line
> -        ++levels_;
> -      }

Don't need this either.

@@ -655,5 @@
> -
> -      SdpLine sdp_line_pair = (*it).second;
> -      std::string value;
> -      if(sdp_line_pair.second.length() == 0) {
> -        value = (*it).first + "\r\n";

Here's some of that special case \r\n code I was talking about earlier.

@@ -1141,5 @@
>      ASSERT_EQ(signaling_state(), endState);
>      offer_ = pObserver->lastString;
>    }
>  
> -void CreateAnswer(std::string offer,

This was unused.

@@ +1225,5 @@
> +    NORMAL_ENCODING,
> +    CHROME_ENCODING
> +  } TrickleEncoding;
> +
> +  void ReceiveTrickleCandidates(

This is a rename/refactor/merge of |DoTrickleIce|/|DoTrickleIceChrome|.

@@ -1658,5 @@
>      a1_->CreateOffer(options, OFFER_AV, sdpCheck);
>      a1_->SetLocal(TestObserver::OFFER, a1_->offer());
>    }
>  
> -  void OfferAnswer(sipcc::OfferOptions& options,

Refactored this code into four separate functions:
Offer
Answer
Trickle
WaitForCompleted

|OfferAnswer| is now a pretty small function that calls those four.

@@ +1743,5 @@
> +  }
> +
> +  void OfferAnswer(sipcc::OfferOptions& options,
> +                   uint32_t offerAnswerFlags,
> +                   bool finishAfterAnswer,

I'd like to remove this and have the tests using it simply call Offer, but that'll be later.

@@ +1785,4 @@
>    }
>  
> +  void OfferAnswerTrickleChrome(sipcc::OfferOptions& options,
> +                                uint32_t offerAnswerFlags,

No longer makes assumptions about media flags.

@@ +1879,5 @@
>      // Double-check that the offered SDP matches what we expect
>      CheckRtcpFbSdp(sdpWrapper.getSdp(), feedback);
>  
>      a2_->SetRemote(TestObserver::OFFER, sdpWrapper.getSdp());
> +    a2_->CreateAnswer(OFFER_AV | ANSWER_AV);

Lots of simple replacement follows.
Attachment #8475447 - Flags: review?(martin.thomson)
Blocks: 1055852
Comment on attachment 8475447 [details] [diff] [review]
Part 3: Bring signaling_unittests up-to-date with the full-trickle behavior, and a fair bit of cleanup.

Review of attachment 8475447 [details] [diff] [review]:
-----------------------------------------------------------------

This is a pretty big improvement on what was there.

::: media/webrtc/signaling/test/signaling_unittests.cpp
@@ +512,5 @@
>  
>  class ParsedSDP {
>   public:
> +
> +  ParsedSDP(const std::string &sdp)

This class is plain weird.

I can see why you would split the SDP into lines, but to then split it into everything before the first space and everything after is pretty strange.  Maybe I can understand the '=', or even ':' for 'a=' lines, but this...

Not that I'm going to ask you to fix it or anything, just shaking my head.

@@ +535,3 @@
>    {
> +    DeleteLines(objType);
> +  }

With the default argument on DeleteLines, is this really needed?

@@ +568,5 @@
>    {
>      size_t whiteSpace = content.find(' ');
>      std::string key;
>      std::string value;
>      if(whiteSpace == std::string::npos) {

ws after if

@@ +659,5 @@
> +  {
> +    std::string candidate_attribute("a=" + candidate + "\r\n");
> +    // InsertLineAfter is 0 indexed, but level is 1 indexed
> +    // This assumes that we have only media-level c lines.
> +    InsertLineAfter("c=IN", candidate_attribute, level - 1);

Much nicer, thanks.

@@ +1251,5 @@
>    }
>  
> +  void ReceiveTrickleCandidatesChrome(const std::multimap<int, std::string> &candidates) {
> +    return ReceiveTrickleCandidates(candidates, CHROME_ENCODING);
> +  }

I'm not a big fan of rerouting functions like this, I'd have preferred to see the CHROME_ENCODING thing all the way at the top (in the TEST_F call even).

@@ +1280,5 @@
> +    trickler.GetCandidatesByLevel(&candidates);
> +    ReceiveTrickleCandidates(candidates);
> +  }
> +
> +  void ReceiveTrickleCandidatesChrome(SignalingAgent &trickler) {

See above.  Still duplicating code here.

@@ +1743,5 @@
> +  }
> +
> +  void OfferAnswer(sipcc::OfferOptions& options,
> +                   uint32_t offerAnswerFlags,
> +                   bool finishAfterAnswer,

I only found one instance of this being false.

@@ +1754,5 @@
> +    if(finishAfterAnswer) {
> +      Answer(options, offerAnswerFlags, answerSdpCheck, trickleType);
> +      Trickle(trickleType);
> +      WaitForCompleted();
> +    }

Some tests close out streams, other don't.  I assume that the destructors take care of that for us.

@@ +1769,4 @@
>      a2_->SetLocal(TestObserver::ANSWER, a2_->answer());
>      ParsedSDP sdpWrapper(a2_->answer());
>      sdpWrapper.ReplaceLine("m=audio", "m=audio 65375 RTP/SAVPF 109 8 101\r\n");
>      sdpWrapper.AddLine("a=rtpmap:8 PCMA/8000\r\n");

Since this function is only used in the one place, I think that inlining this is appropriate.  We probably don't need to test all the different combinations here.  That should remove the need for a branch too.

This is also a very specific modification.  Would you mind changing the name of the test function to match?  OfferAndAnswerWithExtraCodec perhaps.
Attachment #8475447 - Flags: review?(martin.thomson) → review+
Simplify considerably; no more message queues.
Attachment #8474819 - Attachment is obsolete: true
Attachment #8475447 - Attachment is obsolete: true
Comment on attachment 8476279 [details] [diff] [review]
Part 3: Bring signaling_unittests up-to-date with the full-trickle behavior, and a fair bit of cleanup.

Review of attachment 8476279 [details] [diff] [review]:
-----------------------------------------------------------------

Carry forward r=mt
Attachment #8476279 - Flags: review+
Attachment #8476218 - Attachment is obsolete: true
Prevent message queues from being created.
Attachment #8476282 - Attachment is obsolete: true
Comment on attachment 8476364 [details] [diff] [review]
Part 4: Get rid of the sipcc threads/queues, and run all sipcc logic on main.

Review of attachment 8476364 [details] [diff] [review]:
-----------------------------------------------------------------

You'll want to pull down this patch queue and use git diff --no-index -w, since that makes a lot of this much easier to view.

::: media/webrtc/signaling/src/media/VcmSIPCCBinding.cpp
@@ -505,5 @@
> -                                 uint16_t level) {
> -  sipcc::PeerConnectionWrapper pc(peerconnection.c_str());
> -  ENSURE_PC_NO_RET(pc, peerconnection.c_str());
> -
> -  pc.impl()->OnNewMline(level);

We call this directly now.

@@ -737,5 @@
> -static short vcmSetIceCandidate_m(const char *peerconnection,
> -                                  const char *icecandidate,
> -                                  uint16_t level)
> -{
> -  CSFLogDebug( logTag, "%s: PC = %s", __FUNCTION__, peerconnection);

Moved code. Just use git diff --no-index -w. Lots of similar stuff in this patch.

::: media/webrtc/signaling/src/peerconnection/PeerConnectionCtx.cpp
@@ +397,5 @@
>    codecMask |= VCM_CODEC_RESOURCE_VP8;
>    //codecMask |= VCM_CODEC_RESOURCE_I420;
>    mCCM->setVideoCodecs(codecMask);
> +  mCCM->addCCObserver(this);
> +  ChangeSipccState(dom::PCImplSipccState::Starting);

We need to do these before |startSDPMode|, since that fires a callback at us synchronously.

::: media/webrtc/signaling/src/sipcc/core/ccapp/ccapp_task.c
@@ +126,2 @@
>  
> +    CCAPP_DEBUG(DEB_F_PREFIX"Received Cmd[%d] for app[%d]", DEB_F_PREFIX_ARGS(SIP_CC_PROV, __FUNCTION__),

Code moved from CCApp_Task loop

::: media/webrtc/signaling/src/sipcc/core/ccapp/ccprovider.c
@@ +2161,4 @@
>   *  Parameters:   none
>   *  Returns: none
>   */
> +void ccapp_shutdown()

I completely stopped caring about renaming after this.

::: media/webrtc/signaling/src/sipcc/core/common/init.c
@@ +299,3 @@
>      gsm_set_initialized();
>      PHNChangeState(STATE_CONNECTED);
> +    ui_set_sip_registration_state(CC_ALL_LINES, TRUE);

Magic function called by SIP task that is required for init. No more SIP task, so we must call ourselves.

@@ +320,5 @@
>  send_task_unload_msg(cc_srcs_t dest_id)
>  {
>      const char *fname = "send_task_unload_msg";
>      uint16_t len = 4;
> +    cprBuffer_t  msg;

We were just stomping this pointer with a new buffer below, or not using it at all, leaking it.

::: media/webrtc/signaling/src/sipcc/core/gsm/fsmdef.c
@@ -1263,5 @@
>      dcb->early_error_release = FALSE;
>  
>      dcb->active_feature = CC_FEATURE_NONE;
>  
> -    /* Release transient timers if any have been allocated */

This class had init code that tried to create a bunch of timers, and if any of them failed (for example, because the callback message queue was NULL), init would fail. I just removed these timers entirely, just to be sure.

@@ -4950,5 @@
> -        if (cprCancelTimer(dcb->ringback_delay_tmr) != CPR_SUCCESS) {
> -            FSM_DEBUG_SM(get_debug_string(FSMDEF_DBG_TMR_CANCEL_FAILED),
> -                         dcb->call_id, dcb->line, fname, "Ringback Delay",
> -                         cpr_errno);
> -        }

This part of the hunk is totally separate from the second part, conceptually.

@@ -4951,5 @@
> -            FSM_DEBUG_SM(get_debug_string(FSMDEF_DBG_TMR_CANCEL_FAILED),
> -                         dcb->call_id, dcb->line, fname, "Ringback Delay",
> -                         cpr_errno);
> -        }
> -    } else {

This displays confusingly. This line should be a substitution for '}', and the rest should be red.

@@ +7297,5 @@
>               * In either case, UI can be unlocked since we are now
>               * about to hold again.
>               */
>              fim_unlock_ui(dcb->call_id);
> +            /*

This is the old else clause

@@ -9231,5 @@
>      DEF_DEBUG(DEB_F_PREFIX"Disabling mass registration print", DEB_F_PREFIX_ARGS(SIP_REG, fname));
>      FSM_FOR_ALL_CBS(dcb, fsmdef_dcbs, FSMDEF_MAX_DCBS) {
>          fsmdef_init_dcb(dcb, CC_NO_CALL_ID, FSMDEF_CALL_TYPE_NONE,
>                          FSMDEF_NO_NUMBER, LSM_NO_LINE, NULL);
> -        /*

This is the offending init code.

::: media/webrtc/signaling/src/sipcc/core/gsm/gsm.c
@@ +266,2 @@
>  
> +cpr_status_e

git diff -w does a way better job of this.

::: media/webrtc/signaling/src/sipcc/core/gsm/h/fsm.h
@@ -355,5 @@
>      boolean spoof_ringout_requested;
>      /* TRUE if GSM has applied ringout due to CCMs request to show ringout UI */
>      boolean spoof_ringout_applied;
>  
> -    /* Timer to go on hook after any call error */

Removed this timer stuff from the struct as well.

::: media/webrtc/signaling/src/sipcc/core/src-common/misc_apps_task.c
@@ +26,5 @@
>  
>  cpr_status_e
>  MiscAppTaskSendMsg (uint32_t cmd, cprBuffer_t buf, uint16_t len)
>  {
> +    /* Mozilla hack: just ignore everything that is posted here */

I actually forgot to remove the MiscAppTask here, but I don't want to re-write my commentary. I'll remove that once we've cycled once.

::: media/webrtc/signaling/src/sipcc/cpr/common/cpr_ipc.c
@@ +296,5 @@
>   *       timers.  The work to fix this rare situation is not considered
>   *       worth the effort to fix....so just leaving as is.
>   */
>  cprRC_t
>  cprSendMessage (cprMsgQueue_t msgQueue, void *msg, void **ppUserData)

Maybe I'll remove this too once a review cycle is done.

@@ +343,5 @@
>   * @return depth of msgQueue
>   *
>   * @pre (msgQueue not_eq NULL)
>   */
>  uint16_t cprGetDepth (cprMsgQueue_t msgQueue)

I think I'll make this return 0 after we cycle.

::: media/webrtc/signaling/src/softphonewrapper/CC_SIPCCDevice.cpp
@@ +31,5 @@
>  
>      CC_SIPCCDevicePtr pSIPCCDevice = CC_SIPCCDevice::wrap(deviceHandle);
>  
> +    pSIPCCDevice->enableVideo(true);
> +    pSIPCCDevice->enableCamera(true);

Doing this in the c'tor leads to us wrapping around and trying to retrieve the CC_Device from the map (go look at Wrapper.h), which isn't there yet because the c'tor isn't finished yet. So, we do it in the factory method.
Attachment #8476364 - Flags: review?(martin.thomson)
Comment on attachment 8476364 [details] [diff] [review]
Part 4: Get rid of the sipcc threads/queues, and run all sipcc logic on main.

Review of attachment 8476364 [details] [diff] [review]:
-----------------------------------------------------------------

I see a number of places where you choose to delete the contents of the function and then not also remove the function.  I trust (and hope) that you remember these and go back to kill those functions with fire.  And ccapp_task.c is a horror.  Otherwise, I can't see how a patch that deletes this much code can be rejected.

I have some test failures to share, which maybe need to be looked at.
Attachment #8476364 - Flags: review?(martin.thomson) → review+
Yeah, I was leaving those empty functions there for context at their callsites later.
Nils seems to have missed a few edits in his patch, but this is the only problem I saw:

28642 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | received ICE candidate after end of trickle 

Will retest and see.
Attached file bad.txt
Delete a little more code, and fix a comment.
Attachment #8476364 - Attachment is obsolete: true
Comment on attachment 8477649 [details] [diff] [review]
Part 4: Get rid of the sipcc threads/queues, and run all sipcc logic on main.

Review of attachment 8477649 [details] [diff] [review]:
-----------------------------------------------------------------

https://bugzilla.mozilla.org/attachment.cgi?oldid=8476364&action=interdiff&newid=8477649&headers=1
Looks like some suppressions aren't working anymore because I've removed the functions the suppressions are based on. I already had some fixes in bug 1055852, so I think it makes sense to land that along with this stuff.
FYI, here's an earlier try push that also had the leak fixes in it; LSAN seems happy here:

https://tbpl.mozilla.org/?tree=Try&rev=e9a2c772a580
Attachment #8477649 - Flags: review+
So, try looks mostly good, but even with a gathering timeout tweak and the removal of the STUN server from the picture, we're still getting lots of failures on B2G emulator. The cause simply seems to be outrageous delays in test execution, and otherwise extremely slow performance (note all the timecards recording that it takes 6-10 seconds to generate a DTLS identity). Not sure what can be done about this.

https://tbpl.mozilla.org/?tree=Try&rev=9d00b7f89e0a
Note also that the timecards report extremely large delays in the start of gathering; all this takes is a single dispatch back and forth from STS, with almost no actual work involved. This strongly implies that either STS or main have queueing delays of several seconds, sometimes pushing 20. This is just an impossible situation.
(In reply to Byron Campen [:bwc] from comment #101)
> Note also that the timecards report extremely large delays in the start of
> gathering; all this takes is a single dispatch back and forth from STS, with
> almost no actual work involved. This strongly implies that either STS or
> main have queueing delays of several seconds, sometimes pushing 20. This is
> just an impossible situation.

I've seen delays in opening audio devices on b2g emulator of similar timescales; generally the issue is that the AWS instances we use (single-threaded IIRC) combined with the emulator layer running under that and other load on the AWS instances can simply make them very not-real-time.  Combine that with the emulator reporting real times, not simulated times in the emulator (it's not a cycle-level CPU emulator), and stuff that generates realtime data (media sources in particular) or stuff that relies on timeouts can cause queue buildups (or simply fail timeouts)
So, taking a closer look:

https://tbpl.mozilla.org/php/getParsedLog.php?id=46617331&tree=Try#error0

and 

https://tbpl.mozilla.org/php/getParsedLog.php?id=46617331&tree=Try#error4

For some reason, we see a _three_ _minute_ halt in execution of the test.

The remaining instance of this failure I see in that try push, the delay was "only" two minutes. What is going on here?
Oh, this is neat; this log line is hit 33 times in that test run:

(ice/ERR) ICE(<pc name>)): no local addresses available

Something is very fishy here.
Attachment #8469643 - Attachment is obsolete: true
Attachment #8477649 - Attachment is obsolete: true
Comment on attachment 8479377 [details] [diff] [review]
Part 4: Get rid of the sipcc threads/queues, and run all sipcc logic on main

Review of attachment 8479377 [details] [diff] [review]:
-----------------------------------------------------------------

Carry forward r=mt
Attachment #8479377 - Flags: review+
Comment on attachment 8479371 [details] [diff] [review]
Part 1: Simplify the command-queueing logic in pc.js, and make it less racy

Review of attachment 8479371 [details] [diff] [review]:
-----------------------------------------------------------------

Carry forward r=mt
Attachment #8479371 - Flags: review+
Depends on: 963103, 1059878
Depends on: 1059867
Attachment #8479371 - Attachment is obsolete: true
Attachment #8479377 - Attachment is obsolete: true
Comment on attachment 8480808 [details] [diff] [review]
Part 1: Simplify the command-queueing logic in pc.js, and make it less racy.

Review of attachment 8480808 [details] [diff] [review]:
-----------------------------------------------------------------

Carry forward r=mt
Attachment #8480808 - Flags: review+
Comment on attachment 8480809 [details] [diff] [review]
Part 4: Get rid of the sipcc threads/queues, and run all sipcc logic on main.

Review of attachment 8480809 [details] [diff] [review]:
-----------------------------------------------------------------

Carry forward r=mt
Attachment #8480809 - Flags: review+
No longer depends on: 963103
Try is looking good, and Nils' test is detecting this bug and causing oranges.
Keywords: checkin-needed
Blocks: 1026037
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: