Closed Bug 906843 Opened 11 years ago Closed 11 years ago

Investigate and address signaling inefficiencies in WebRTC signaling

Categories

(Core :: WebRTC: Signaling, defect)

Other Branch
x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla26

People

(Reporter: abr, Assigned: abr)

References

Details

Attachments

(3 files, 5 obsolete files)

We currently have relatively long call setup times reported for WebRTC calls, both anecdotal and emperical. Subjectively, it takes a long time after a user accepts a call before media flows. More empirically, we have had reports from "SIP over Websockets" implementors indicating that the response time required for webrtc causes SIP INVITE messages to be retransmitted, which indicates at least 500 ms of latency (which is much longer than we should expect).

We need to instrument the signaling (produce timestamps for messages as they pass from the JS layer down to SIPCC and back) to isolate where time is being consumed, and then work on reducing this time wherever necessary.
I'd like to have the instrumentation, investigation and identification of problems completed within the next 2 weeks. (I'm setting a deadline of 9/10.)  We can then file follow up bugs for any large issues we find.
Target Milestone: --- → mozilla26
Comment on attachment 795630 [details] [diff] [review]
Part 1: Convert unit test checks from ASSERT_TRUE_WAIT to condition variables

This patch converts all of the existing ASSERT_TRUE_WAIT calls to condition variable waits, so that they complete immediately upon the state change that we're interested in. This is necessary so that the (up to) 200 ms wait between variable value checks does not perturb the performance evaluation.

I've also converted the PR_Sleep() calls into (ironically) ASSERT_TRUE_WAIT calls, simply because waiting a full 10 seconds for packets to accumulate seemed a bit excessive.

In addition to making the signaling unittests appropriate for gathering call setup perf data, these changes reduce the time to run the full suite of tests from 4:33 to 2:11 on my system.
Attachment #795630 - Flags: review?(ekr)
Whiteboard: [leave-open]
I've got the PeerConnection instrumented, and we're not seeing any outrageous delays under normal circumstances. For example, here is a representative callee-side accounting of call setup in an optimized build, using the signaling_unittests driver (--gtest_filter="*.FullCall"):

>  Timestamp   | Delta       | Event                          | File                         | Function             
> ====================================================================================================================
>     0.000000 |    0.000000 | Constructor Completed          | PeerConnectionImpl.cpp:340   | PeerConnectionImpl   
>     0.044666 |    0.044666 | Initializing PC Ctx            | PeerConnectionImpl.cpp:592   | Initialize           
>     0.044666 |    0.000000 | Done Initializing PC Ctx       | PeerConnectionImpl.cpp:598   | Initialize           
>     0.044729 |    0.000063 | Generating DTLS Identity       | PeerConnectionImpl.cpp:635   | Initialize           
>     0.101421 |    0.056692 | Done Generating DTLS Identity  | PeerConnectionImpl.cpp:638   | Initialize           
>     0.328072 |    0.226651 | Ice state: waiting             | PeerConnectionImpl.cpp:1542  | IceStateChange_m     
>     0.334122 |    0.006050 | Set Remote Description         | PeerConnectionImpl.cpp:1081  | SetRemoteDescription 
>     0.334163 |    0.000041 | Processing set remote event    |           ccprovider.c:680   | processSessionEvent   
>     0.334168 |    0.000005 | Sending message to queue       |                ccapi.c:1238  | send_message_helper   
>     0.335962 |    0.001794 | Posting message to PC          |                   ui.c:1573  | post_message_helper   
>     0.336000 |    0.000038 | Operation Completed            | PeerConnectionImpl.cpp:1386  | onCallEvent           
>     0.336045 |    0.000045 | Create Answer                  | PeerConnectionImpl.cpp:1038  | CreateAnswer         
>     0.338277 |    0.002232 | Processing create answer event |           ccprovider.c:671   | processSessionEvent   
>     0.338279 |    0.000002 | Sending message to queue       |                ccapi.c:1238  | send_message_helper   
>     0.339914 |    0.001635 | Posting message to PC          |                   ui.c:1573  | post_message_helper   
>     0.339946 |    0.000032 | Operation Completed            | PeerConnectionImpl.cpp:1386  | onCallEvent           
>     0.340057 |    0.000111 | Set Local Description          | PeerConnectionImpl.cpp:1061  | SetLocalDescription   
>     0.340069 |    0.000012 | Processing set local event     |           ccprovider.c:676   | processSessionEvent   
>     0.340072 |    0.000003 | Sending message to queue       |                ccapi.c:1238  | send_message_helper   
>     0.341285 |    0.001213 | ICE Attributes Installed       |               fsmdef.c:3542  | fsmdef_ev_setlocaldesc
>     0.380895 |    0.039610 | Posting message to PC          |                   ui.c:1573  | post_message_helper   
>     0.380942 |    0.000047 | Operation Completed            | PeerConnectionImpl.cpp:1386  | onCallEvent           
>     0.452007 |    0.071065 | Ice state: connected           | PeerConnectionImpl.cpp:1548  | IceStateChange_m     
>     5.729998 |    5.277991 | Destructor Invoked             | PeerConnectionImpl.cpp:346   | ~PeerConnectionImpl 

The plan now is to clean up the implementation that creates this table (and possibly add another couple of events), and then land it in a way that produces this timecard output when "signaling" logging is set to 2 or higher. This will allow developers who are seeing higher-than-expected latency to provide concrete feedback that will help us isolate delay issues.

As EKR points out, one of the larger delays (ICE candidate gathering, which takes 227 ms in the example above) will be pretty much eliminated once Bug 842549 lands.
Target Milestone: mozilla26 → ---
Huh. I don't know why it changed the milestone when I did that. Fixing.
Target Milestone: --- → mozilla26
Scanning down the list of deltas in the table above, it is worth noting that the 40 ms marked "Posting message to PC" (4th line from the bottom) are caused by the call state transition to connected, which triggers the start of ICE checks. These checks are initiated on the STS thread, and are called synchronously from the SIPCC thread. The return value from this call isn't particularly important (failures are reported to the observer via a "SetState(ICE_CTX_FAILED)"), so conversion to an async call should be safe. On the other hand, since overall call setup time is going to be governed by the amount of time these checks take to finish, this may not be a win overall.
Attachment #795630 - Attachment is obsolete: true
Attachment #795630 - Flags: review?(ekr)
Attachment #796249 - Attachment is obsolete: true
Attachment #796248 - Attachment is obsolete: true
Attachment #798873 - Attachment description: Part 2: Instrument signaling for isolation of system delays → Instrument signaling for isolation of system delays
Attachment #798874 - Attachment description: Part 1: Convert unit test checks from ASSERT_TRUE_WAIT to condition variables → Convert unit test checks from ASSERT_TRUE_WAIT to condition variables
Attachment #798873 - Flags: review?(ethanhugg)
Comment on attachment 798873 [details] [diff] [review]
Instrument signaling for isolation of system delays

Review of attachment 798873 [details] [diff] [review]:
-----------------------------------------------------------------

Overall I think this will be fine, but I want to be able to build it and try it out on my Linux box before I r+.  Error and some warnings below.

::: media/webrtc/signaling/src/common/time_profiling/timecard.c
@@ +39,5 @@
> +                               tc->entries_allocated * sizeof(TimecardEntry));
> +  }
> +
> +  /* Trim the path component from the filename */
> +  const char *last_slash = file;

/home/ehugg/mozilla/fourth/media/webrtc/signaling/src/common/time_profiling/timecard.c:43:3: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]

@@ +52,5 @@
> +    file++;
> +  }
> +
> +  /* Record the data into the timecard entry */
> +  TimecardEntry *entry = &tc->entries[tc->curr_entry];

/home/ehugg/mozilla/fourth/media/webrtc/signaling/src/common/time_profiling/timecard.c:56:3: warning: ISO C90 forbids mixed declarations and code [-Wdeclaration-after-statement]

@@ +103,5 @@
> +    }
> +  }
> +
> +  dprintf(descriptor, "\nTimecard created %4ld.%6.6d\n\n",
> +          tc->start_time.tv_sec, tc->start_time.tv_usec);

/home/ehugg/mozilla/fourth/media/webrtc/signaling/src/common/time_profiling/timecard.c:107:11: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘__suseconds_t’ [-Wformat]

@@ +133,5 @@
> +            offset.tv_sec, offset.tv_usec,
> +            delta.tv_sec, delta.tv_usec,
> +            (int)event_width, entry->event,
> +            (int)file_width, entry->file, entry->line,
> +            (int)function_width, entry->function);

/home/ehugg/mozilla/fourth/media/webrtc/signaling/src/common/time_profiling/timecard.c:137:13: warning: format ‘%d’ expects argument of type ‘int’, but argument 4 has type ‘__suseconds_t’ [-Wformat]
/home/ehugg/mozilla/fourth/media/webrtc/signaling/src/common/time_profiling/timecard.c:137:13: warning: format ‘%d’ expects argument of type ‘int’, but argument 6 has type ‘__suseconds_t’ [-Wformat]

::: media/webrtc/signaling/src/common/time_profiling/timecard.h
@@ +36,5 @@
> +} TimecardEntry;
> +
> +typedef struct Timecard {
> +  size_t curr_entry;
> +  size_t entries_allocated;

In file included from /home/ehugg/mozilla/fourth/media/webrtc/signaling/src/sipcc/core/ccapp/CCProvider.h:11:0,
                 from /home/ehugg/mozilla/fourth/media/webrtc/signaling/src/sipcc/core/ccapp/capability_set.c:6:
/home/ehugg/mozilla/fourth/media/webrtc/signaling/./src/common/time_profiling/timecard.h:39:3: error: unknown type name ‘size_t’
/home/ehugg/mozilla/fourth/media/webrtc/signaling/./src/common/time_profiling/timecard.h:40:3: error: unknown type name ‘size_t’
make[7]: *** [capability_set.o] Error 1

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
@@ +68,5 @@
>  static const char* logTag = "PeerConnectionImpl";
>  static const int DTLS_FINGERPRINT_LENGTH = 64;
>  static const int MEDIA_STREAM_MUTE = 0x80;
>  
> +PRLogModuleInfo *signalingLogInfo() {

Won't this function conflict with the GetSignalingLogInfo in CSFLog.cpp?  It seems that they both will create a LogModule with the same name.

@@ -659,5 @@
> -  while(PeerConnectionCtx::GetInstance()->sipcc_state() != kStarted) {
> -    PR_Sleep(100);
> -  }
> -#endif
> -

Should this be in the patch with the unittest changes or are you sure we'll push them together?
Comment on attachment 798873 [details] [diff] [review]
Instrument signaling for isolation of system delays

Review of attachment 798873 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/peerconnection/PeerConnectionImpl.cpp
@@ +68,5 @@
>  static const char* logTag = "PeerConnectionImpl";
>  static const int DTLS_FINGERPRINT_LENGTH = 64;
>  static const int MEDIA_STREAM_MUTE = 0x80;
>  
> +PRLogModuleInfo *signalingLogInfo() {

As far as I can tell, having duplicate log modules with the same name is perfectly fine; both are configured correctly and behave as expected. See, e.g., netwerk/protocol/websocket/BaseWebSocketChannel.cpp:32 and netwerk/protocol/websocket/WebSocketChannelParent.cpp:33. There is similar log module name duplication for CSP, MediaDecoder, Widget, WidgetDrag, WidgetFocus, WifiMonitor, cltsrv_log, nsWindow, nsWyciwygChannel, printing, proxy, and possibly others.

@@ -659,5 @@
> -  while(PeerConnectionCtx::GetInstance()->sipcc_state() != kStarted) {
> -    PR_Sleep(100);
> -  }
> -#endif
> -

This check hasn't been necessary for a very, very long time. Look at (the pre-patch) SignalingAgent::Init() and SignalingAgent::InitAllowFail() in signaling_unittests.cpp -- both do an ASSERT_TRUE_WAIT for this state.
Attachment #798873 - Attachment is obsolete: true
Attachment #798873 - Flags: review?(ethanhugg)
Ethan: good catches on the C90 issues. Those would have blown up on Windows. I'm going to push to try to make sure I didn't hose anything similar elsewhere:

https://tbpl.mozilla.org/?tree=Try&rev=f78b395ed259
Attachment #798983 - Attachment description: Part 2: Instrument signaling for isolation of system delays → Instrument signaling for isolation of system delays
Attachment #798983 - Flags: review?(ethanhugg)
Attachment #798983 - Attachment is obsolete: true
Attachment #798983 - Flags: review?(ethanhugg)
Well, that try was a huge flaming fireball of a crash-and-burn. Apparently, we can't use either dprint or gettimeofday. Also, the include paths for non-c++11 compilers didn't work out. Retrying with a respun patch:

https://tbpl.mozilla.org/?tree=Try&rev=81e92556395c
Comment on attachment 799097 [details] [diff] [review]
Instrument signaling for isolation of system delays

Third time's a charm, I hope. This builds for me under Ubuntu 12.
Attachment #799097 - Flags: review?(ethanhugg)
Comment on attachment 799097 [details] [diff] [review]
Instrument signaling for isolation of system delays

Review of attachment 799097 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, just one nit.

::: media/webrtc/signaling/src/sipcc/core/ccapp/ccapi_call_info.c
@@ +811,5 @@
> +  static const char *fname="CCAPI_CallInfo_getMediaStreams";
> +  session_data_t *data = (session_data_t *)handle;
> +  Timecard *timecard = NULL;
> +
> +  CCAPP_DEBUG(DEB_F_PREFIX"Entering", DEB_F_PREFIX_ARGS(SIP_CC_PROV, fname));

I think we can use __FUNCTION__ instead of fname here that way it won't say the wrong function name.
Attachment #799097 - Flags: review?(ethanhugg) → review+
Comment on attachment 800212 [details] [diff] [review]
Shorten sleep period for ASSERT_TRUE_WAIT from 200ms to 10ms; change traffic checks to count packets rather than wait 10 seconds

Review of attachment 800212 [details] [diff] [review]:
-----------------------------------------------------------------

Here's an alternative to my previous "speeding up the unit tests" patch that doesn't rely on condition variables. Instead, we shorten the time period in the _WAIT macros from 200ms to 10 ms. This still perturbs the results in the performance testing a little bit, but not as severely as before.
Attachment #800212 - Flags: review?(ekr)
Comment on attachment 800212 [details] [diff] [review]
Shorten sleep period for ASSERT_TRUE_WAIT from 200ms to 10ms; change traffic checks to count packets rather than wait 10 seconds

Review of attachment 800212 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #800212 - Flags: review?(ekr) → review+
Whiteboard: [leave-open]
https://hg.mozilla.org/mozilla-central/rev/882c6a6b3b35
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Blocks: 844071
I get format warnings on Ubuntu 64-bit now from timecard.c:

/home/ehugg/mozilla/work/media/webrtc/signaling/src/common/time_profiling/timecard.c: In function ‘print_timecard’:
/home/ehugg/mozilla/work/media/webrtc/signaling/src/common/time_profiling/timecard.c:92:11: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 2 has type ‘long int’ [-Wformat]
/home/ehugg/mozilla/work/media/webrtc/signaling/src/common/time_profiling/timecard.c:92:11: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘long int’ [-Wformat]
/home/ehugg/mozilla/work/media/webrtc/signaling/src/common/time_profiling/timecard.c:121:12: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 2 has type ‘long int’ [-Wformat]
/home/ehugg/mozilla/work/media/webrtc/signaling/src/common/time_profiling/timecard.c:121:12: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 3 has type ‘long int’ [-Wformat]
/home/ehugg/mozilla/work/media/webrtc/signaling/src/common/time_profiling/timecard.c:121:12: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 4 has type ‘long int’ [-Wformat]
/home/ehugg/mozilla/work/media/webrtc/signaling/src/common/time_profiling/timecard.c:121:12: warning: format ‘%lld’ expects argument of type ‘long long int’, but argument 5 has type ‘long int’ [-Wformat]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: