Closed Bug 865956 Opened 12 years ago Closed 8 years ago

Enable WebRTC C++ unit tests for b2g

Categories

(Core :: WebRTC, defect, P4)

x86
macOS
defect

Tracking

()

RESOLVED WONTFIX
backlog tech-debt

People

(Reporter: ekr, Assigned: jhlin)

References

Details

(Whiteboard: [WebRTC], [blocking-webrtc-][b2g-webrtc+])

Attachments

(21 files, 6 obsolete files)

4.08 KB, patch
Details | Diff | Splinter Review
116.75 KB, text/plain
Details
10.63 KB, text/plain
Details
3.18 KB, text/plain
Details
2.99 KB, text/plain
Details
648.59 KB, application/x-gzip
Details
3.96 KB, patch
ted
: review+
ekr
: review+
Details | Diff | Splinter Review
387.08 KB, application/x-gzip
Details
2.48 MB, text/plain
Details
2.65 MB, text/plain
Details
2.54 KB, patch
Details | Diff | Splinter Review
2.29 KB, patch
Details | Diff | Splinter Review
6.27 KB, patch
Details | Diff | Splinter Review
3.93 KB, patch
Details | Diff | Splinter Review
436.57 KB, application/x-gzip
Details
1.90 KB, patch
Details | Diff | Splinter Review
185.97 KB, application/x-gzip
Details
159.60 KB, application/x-gzip
Details
316.03 KB, application/x-gzip
Details
1.76 MB, text/x-log
Details
597.63 KB, text/x-log
Details
Attached patch PatchSplinter Review
I've attached a patch that will get them building. It needs a bunch of cleanup but resolves the major issues. Note that this doesn't mean the unit testss all work.
Whiteboard: [WebRTC], [blocking-webrtc-][b2g-webrtc+]
Assign to john, since he is currently working on this
Assignee: nobody → jolin
With Eric's patch applied, the tests built and run on Unagi device. Running signaling_test failed when initializing peerconnection, but using GDB to run it will pass. Bug 866010 was fired to check the problem.
Depends on: 866010
After changing the library linking configuration on my unagi the signaling_unittests now runs. Close bug 866010.
Quick summary: 46 tests so far. 39 passes & 7 failures. Failures: [ FAILED ] SignalingTest.OfferModifiedAnswer (27955 ms) [ FAILED ] SignalingTest.FullCallTrickle (33090 ms) [ FAILED ] SignalingTest.FullCallTrickleChrome (50979 ms) [ FAILED ] SignalingTest.OfferAnswerCheckDescriptions (47857 ms) [ FAILED ] SignalingTest.CheckTrickleSdpChange (244976 ms) [ FAILED ] SignalingTest.ipAddrAnyOffer (502596 ms) [ FAILED ] SignalingTest.missingUfrag (139945 ms)
Can you please try running these tests individually and see if they pass? That will give us a clue as to whether these tests are actually busted or to whether it's some sort of "we've run a lot of tests and we're tired" thing.
(In reply to Eric Rescorla (:ekr) from comment #6) > Can you please try running these tests individually and see if they pass? > > That will give us a clue as to whether these tests are actually busted > or to whether it's some sort of "we've run a lot of tests and we're tired" > thing. Just run those 7 tests individually and only one of them remains failed: [ FAILED ] SignalingTest.CheckTrickleSdpChange (24692 ms)
Also run mediaconduit_unittests and transport_unittests and all test cases passed!
This is awesome. CAn you do the following with NSPR_LOG_MODULES=signaling:5 1. Run the CheckTrickleSdpChange alone. 2. Run the whole test signaling_unittest? Thanks,
(In reply to Eric Rescorla (:ekr) from comment #12) > This is awesome. > > CAn you do the following with NSPR_LOG_MODULES=signaling:5 > > 1. Run the CheckTrickleSdpChange alone. > 2. Run the whole test signaling_unittest? > > Thanks, I believe all tests were run with that already. The 1st lines of my attachments are commands used to run the tests. Can you check whether that's what you'd like it to be and let me know if anything missing or incorrect?
OK, then we just need the adb logcat output from those runs.
(In reply to Eric Rescorla (:ekr) from comment #14) > OK, then we just need the adb logcat output from those runs. Just run CheckTrickleSdpChange several times again, and all passed. I noticed that the assertion which failed previously (http://mxr.mozilla.org/mozilla-central/source/media/webrtc/signaling/test/signaling_unittests.cpp#1094) repeatly evaluates the result for up to 5 seconds before giving up. Could it be that network condition was bad and the check timed out?
Certainly possible. If it fails again, we can reopen this...
Also run full suite on emulator. Results are quite similar as unagi: Several cases fail during full test, but they will pass when run individually. Some of these tests still fail when run individually at first, but will pase if time out value in failed assertions are increased to 10 seconds. Eric, the results look promising to me. Should we consider including them in the tinderbox tests? (Or they're already included in desktop builds?)
- remove unrelated code changes in Eric's patch - tiny changes in build files
Attachment #742968 - Flags: review?(slee)
Attachment #742968 - Flags: review?(ekr)
(In reply to John Lin[:jolin][:jhlin] from comment #17) > Also run full suite on emulator. Results are quite similar as unagi: Several > cases fail during full test, but they will pass when run individually. > Some of these tests still fail when run individually at first, but will pase > if time out value in failed assertions are increased to 10 seconds. OK, so I think the first thing to do is to diagnose why this is happening. If you can post the logs I asked for in c12 then I can try to figure that out. > Eric, the results look promising to me. Should we consider including them in > the tinderbox tests? (Or they're already included in desktop builds?) Many of them currently are. See: https://wiki.mozilla.org/Media/WebRTC/Testing We're still trying to figure out how to run the remaining ones. However, we could consider running them in the Ad Hoc CI system that I run, once I get it building b2g.
Comment on attachment 742968 [details] [diff] [review] Cleanup-ed patch based on Eric's draft Build system should be reviewed by build peers. Ted, would you please review this patch? Thanks.
Attachment #742968 - Flags: review?(ted)
In the ZIP: *-gtest.log are output of gtest; *-logcat.log are logcat output. webrtc-fullcall-* are logs of SignalingTest.FullCall, which fails both in full & individual runs on emulator. After increasing kDefaultTimeout to 10000ms in signaling_unittests.cpp, it passes. webrtc-5s-* are logs when kDefaultTimeout = 5000, webrtc-10s-* are logs when KDefaultTimeout = 10000 Please let me know if any further info needed.
one more thing: *-fullcall-* are logs of individual runs. The others are logs of a full run.
Attachment 742968 [details] [diff] breaks on try server because of: NameError: name 'build_with_gonk' is not defined while evaluating condition 'clang==1 or OS=="android" or build_with_gonk==1'... Still finding fix for that...
Attachment #742968 - Attachment is obsolete: true
Attachment #742968 - Flags: review?(ted)
Attachment #742968 - Flags: review?(slee)
Attachment #742968 - Flags: review?(ekr)
Attachment #743504 - Flags: review?(ted)
Comment on attachment 743504 [details] [diff] [review] Updated patch to fix build break on try server. Review of attachment 743504 [details] [diff] [review]: ----------------------------------------------------------------- this lgtm
Attachment #743504 - Flags: review+
john: Do all the tests pass with kDefaultTimeout=10000ms? If so, I think I'm just going to write this off as slowness.
Blocks: 867828
(In reply to Eric Rescorla (:ekr) from comment #26) > john: > Do all the tests pass with kDefaultTimeout=10000ms? > > If so, I think I'm just going to write this off as slowness. With kDefaultTimeout=10000ms, the following tests still failed when running signaling_unittests w/o filter, but will pass when running individually. [ FAILED ] SignalingTest.FullCall [ FAILED ] SignalingTest.FullCallVideoOnly [ FAILED ] SignalingTest.OfferModifiedAnswer [ FAILED ] SignalingTest.FullCallTrickle [ FAILED ] SignalingTest.FullCallTrickleChrome [ FAILED ] SignalingTest.AudioOnlyG711Call [ FAILED ] SignalingTest.OfferAnswerCheckDescriptions [ FAILED ] SignalingTest.CheckTrickleSdpChange
Attachment #743504 - Flags: review?(ted) → review+
The command to run tests: EXTRA_TEST_ARGS="--addEnv='MOZ_WEBRTC_TESTS=1' --addEnv='NSPR_LOG_MODULES=signaling:5'" make -C ./media/webrtc/signaling/test/ cppunittests-remote Failed tests: [ FAILED ] SignalingTest.FullCall (25517 ms) [ FAILED ] SignalingTest.FullCallVideoOnly (34686 ms) [ FAILED ] SignalingTest.OfferModifiedAnswer (312207 ms) [ FAILED ] SignalingTest.FullCallTrickle (64786 ms) [ FAILED ] SignalingTest.FullCallTrickleChrome (326039 ms) [ FAILED ] SignalingTest.FullChromeHandshake (260931 ms) [ FAILED ] SignalingTest.OfferAnswerCheckDescriptions (226000 ms) [ FAILED ] SignalingTest.CheckTrickleSdpChange (130034 ms) Please use test name (like 'FullCall') as pattern to jump quickly to the relevant part in log.
Attachment #751626 - Attachment description: Partial rlog → Partial rlog. There is 6 lines of shift because the locally added log code. That means '...signaling_unittests.cpp:1061: Failure' is actually line 1055 at m-c.
Attached file more log
Attached file Partial log with timing patch applied (obsolete) —
Attachment #752480 - Attachment is obsolete: true
Attachment #752043 - Attachment is obsolete: true
Attachment #752089 - Attachment is obsolete: true
Attachment #752541 - Attachment is obsolete: true
Attachment #752561 - Attachment description: Partial log with with latest patch applied → Partial log with with latest patch applied. Also add webrtc_trace=16.
Attached file Log with 120s delay between tests (obsolete) —
Attachment #752626 - Attachment is obsolete: true
Attachment #742142 - Attachment is patch: true
Attachment #742142 - Attachment mime type: text/x-patch → text/plain
Attached file Log with STS messages
After adding more log and run *FullCall* in past few days, I think I made some progress on tracking the slownness problem in signaling_unittests: * Test environment - emulator - m-c changeset: 133070 - double timeout value (kDefaultTimeout = 5000 * 2) - sleep time of WAIT_ (in gtest_utils.h) is extended to 1000ms. - log showing current ICE state is added in IceCompleted (in signaling_unittests.cpp). Combined with the change above, this can be used as marker of seconds for waiting time. Here is what I have so far: * Results are as follows: 1. FullCallVideoOnly(pass) 2. FullCallAudioOnly(pass) -> FullCallVideoOnly(pass) 3. FullCall(pass) -> FullCallAudioOnly(pass) -> FullCallVideoOnly(fail) * Running FullCall affects the result of FullCallVideoOnly: - in case 3, there are log showing some event took very long time to complete, like webrtc-signaling-fullcalls.log line#38167: 5826880[42630d80]: STS process event took 9443ms - it looks like needed to complete ICE will not be executed until the event mentioned about is finished. - on the other hand, case 1 & 2 there is no very long event processing log I'm currently trying to connect gdb to emulator so I can break and investigate stacktrace during STS thread processing that event, but it doesn't work yet and could take a while before I succeed. In the mean time, could you please help check the log in the attachment and see if it shows any hint about which event it could be? Any comments or suggestions are welcomed.
Flags: needinfo?(ekr)
Flags: needinfo?(adam)
According to the log, long time was spent in SSL_ConfigSecureServer() called by TransportLayerDtls::Setup(). It sounds like a Crypto thing which needs lots of CPU cycles to me... My thoery is that some other tasks eat so many cycles that STS thread starved. Eric & Adam, do you think this need to be addressed? One problem I can think of is that running heavy tasks in STS thread could interfere real I/O tasks... ------------------------------------------------------------------------ 15408:5826880[42630d80]: Flow[2cd76c20e04e8f1a:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 10784ms 15580:5826880[42630d80]: Flow[2cd76c20e04e8f1a:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 11399ms 20246:5826880[42630d80]: Flow[e12bb0b3789abb6a:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 29578ms 20486:5826880[42630d80]: Flow[e12bb0b3789abb6a:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 12992ms 21726:5826880[42630d80]: Flow[e12bb0b3789abb6a:2,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 60992ms 21902:5826880[42630d80]: Flow[e12bb0b3789abb6a:2,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 10010ms
Flags: needinfo?(ekr)
Flags: needinfo?(adam)
Wow, that's taking a really long time. Does this get progressively worse as the system runs? CC'ing Brian Smith to see if he has any ideas abt what's taking so long?
Flags: needinfo?(bsmith)
(In reply to Eric Rescorla (:ekr) from comment #46) > Wow, that's taking a really long time. Does this get progressively worse as > the system runs? Not always. That's why I think it has something to do with other concurrently running tasks. It went like this when running *FullCall*: ----------------------------------------------------------------------------- 1438:5826880[42630d80]: Flow[602ceae143f1f9d0:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 1738ms 1462:5826880[42630d80]: Flow[602ceae143f1f9d0:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 1634ms 1557:5826880[42630d80]: Flow[602ceae143f1f9d0:2,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 2890ms 1611:5826880[42630d80]: Flow[602ceae143f1f9d0:2,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 1571ms 7966:5826880[42630d80]: Flow[b900061dd23c7c79:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 956ms 7981:5826880[42630d80]: Flow[b900061dd23c7c79:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 937ms 15408:5826880[42630d80]: Flow[2cd76c20e04e8f1a:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 10784ms 15580:5826880[42630d80]: Flow[2cd76c20e04e8f1a:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 11399ms 20246:5826880[42630d80]: Flow[e12bb0b3789abb6a:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 29578ms 20486:5826880[42630d80]: Flow[e12bb0b3789abb6a:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 12992ms 21726:5826880[42630d80]: Flow[e12bb0b3789abb6a:2,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 60992ms 21902:5826880[42630d80]: Flow[e12bb0b3789abb6a:2,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 10010ms And running *FullCall*Only doesn't look that bad: ----------------------------------------------------------------------------- 1292:5826880[42630d80]: Flow[10f813c5aa6a782e:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 485ms 1303:5826880[42630d80]: Flow[10f813c5aa6a782e:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 536ms 9607:5826880[42630d80]: Flow[40ea0a79b40b7c32:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 2348ms 9636:5826880[42630d80]: Flow[40ea0a79b40b7c32:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 1357ms Another thing I've noticed is that there were be many "Process Task" threads occupying most of the CPU usages when running tests: -------------------------------------------------------------------------- User 54%, System 45%, IOW 0%, IRQ 0% User 558 + Nice 0 + Sys 473 + Idle 0 + IOW 0 + IRQ 0 + SIRQ 0 = 1031 PID TID PR CPU% S VSS RSS PCY UID Thread Proc 576 621 0 15% S 209092K 62748K fg root ProcessThread /data/signaling_unittests 576 614 0 15% S 209092K 62748K fg root ProcessThread /data/signaling_unittests 576 605 0 15% S 209092K 62748K fg root ProcessThread /data/signaling_unittests 576 597 0 14% S 209092K 62748K fg root ProcessThread /data/signaling_unittests 576 629 0 14% S 209092K 62752K fg root ProcessThread /data/signaling_unittests 576 637 0 14% S 209092K 62752K fg root ProcessThread /data/signaling_unittests 576 631 0 2% S 209092K 62752K fg root ProcessThread /data/signaling_unittests 576 623 0 2% S 209092K 62748K fg root ProcessThread /data/signaling_unittests 576 590 0 2% S 209092K 62748K fg root ProcessThread /data/signaling_unittests 576 599 0 2% S 209092K 62748K fg root ProcessThread /data/signaling_unittests 576 607 0 2% S 209092K 62748K fg root ProcessThread /data/signaling_unittests 576 608 0 2% S 209092K 62748K fg root ProcessThread /data/signaling_unittests > > CC'ing Brian Smith to see if he has any ideas abt what's taking so long?
FYR, here are numbers of running *FullCall* on Unagi. Not as bad as emulator, but shows similar situation. ----------------------------------------------------------------------- 489:31721864[42830d80]: Flow[be93752e3591ef99:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 832ms 499:31721864[42830d80]: Flow[be93752e3591ef99:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 723ms 511:31721864[42830d80]: Flow[be93752e3591ef99:2,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 835ms 542:31721864[42830d80]: Flow[be93752e3591ef99:2,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 755ms 3158:31721864[42830d80]: Flow[602edd9b44bfaef6:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 533ms 3167:31721864[42830d80]: Flow[602edd9b44bfaef6:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 593ms 5684:31721864[42830d80]: Flow[c1d7590257649a7a:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 1074ms 5700:31721864[42830d80]: Flow[c1d7590257649a7a:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 1107ms 7030:31721864[42830d80]: Flow[d8f1e20b2e4e6c92:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 3249ms 7064:31721864[42830d80]: Flow[d8f1e20b2e4e6c92:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 1283ms 7096:31721864[42830d80]: Flow[d8f1e20b2e4e6c92:2,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 1665ms 7125:31721864[42830d80]: Flow[d8f1e20b2e4e6c92:2,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 1281ms 9747:31721864[42830d80]: Flow[109a803211dff031:1,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 4581ms 9808:31721864[42830d80]: Flow[109a803211dff031:1,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 2820ms 9863:31721864[42830d80]: Flow[109a803211dff031:2,rtp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 2577ms 9923:31721864[42830d80]: Flow[109a803211dff031:2,rtcp(none)]; Layer[dtls]: SSL_ConfigSecureServer took 2719ms
There may be interaction with the scheduler; if this is chewing with no IO for a long time it may get dropped in priority compared to IO-bound or shorter-time-between-sleep threads. You could avoid blocking IO operations by moving it (but NSS likely wouldn't allow that as it's single-thread-only IIRC), but that wouldn't help the latency. I would assume SSL_ConfigSecureServer() is doing some type of public-key generation, which is CPU-intensive (and IIRC we talked about pre-generating keys to avoid latency). And in fact, I see it calls ssl3_CreateRSAStepDownKeys() which calls SECKEY_CreateRSAPrivateKey() which calls PK11_GenerateKeyPair(). Some CPUs will be much faster than others, and if you're only getting a fraction of the CPU...
So, this isn't creating our actual public keys.... Those are created in DTLS identity. You could try disabling step-down (it shouldn't be needed) in transportlayerdtls.c with an SSL_OptionSet(ssl_fd, SSL_NO_STEP_DOWN, PR_TRUE). But if this is the issue, one wonders why we're not seeing a big hot spot DtlsIdentity::Generate().
(In reply to Eric Rescorla (:ekr) from comment #50) > So, this isn't creating our actual public keys.... Those are created in DTLS > identity. > > You could try disabling step-down (it shouldn't be needed) in > transportlayerdtls.c > with an SSL_OptionSet(ssl_fd, SSL_NO_STEP_DOWN, PR_TRUE). I've tried it and run *FullCall* again. SSL_NO_STEP_DOWN did significantly reduce the time(~150ms max now) of SSL_ConfigSecureServer on emulator. I also added some trace log in webrtc code to observe the lifecycle of ProcessThreads, because from the output of 'top -t', they occupy most of the CPU usage during tests. It seems to me there were some threads outlived the tests from which they were created and got shutting down when the whole test case ended. And I wonder whether tests will run faster if they can be stopped early. Could you take a look at the log in Attachment 756465 [details] and give some comment? > > But if this is the issue, one wonders why we're not seeing a big hot spot > DtlsIdentity::Generate().
For the late ProcessThreads shutting down problem mentioned in comment 50, here is what I found. MediaPipeline dispatchs deletion events to the main thread for some of its MediaSeesionConduit: http://mxr.mozilla.org/mozilla-central/source/media/webrtc/signaling/src/mediapipeline/MediaPipeline.h#334 http://mxr.mozilla.org/mozilla-central/source/media/webrtc/signaling/src/mediapipeline/MediaPipeline.h#448 Unfortunately, in the case of running unit tests, the main thread won't be freed until all tests are complete, so everything hold by those conduits won't be released until then. IHMO, I don't think this could be a problem in real life apps, where main thread should never be occupied by long jobs.
(In reply to John Lin[:jolin][:jhlin] from comment #53) > For the late ProcessThreads shutting down problem mentioned in comment 50, Correction: I meant comment 52. > here is what I found. > MediaPipeline dispatchs deletion events to the main thread for some of its > MediaSeesionConduit: > http://mxr.mozilla.org/mozilla-central/source/media/webrtc/signaling/src/ > mediapipeline/MediaPipeline.h#334 > http://mxr.mozilla.org/mozilla-central/source/media/webrtc/signaling/src/ > mediapipeline/MediaPipeline.h#448 > > Unfortunately, in the case of running unit tests, the main thread won't be > freed until all tests are complete, so everything hold by those conduits > won't be released until then. > IHMO, I don't think this could be a problem in real life apps, where main > thread should never be occupied by long jobs.
Depends on: 884144
Clearing needinfo.
Flags: needinfo?(brian)
Some of the work ethan did may have helped with some of the MainThread issues mentioned, perhaps
backlog: --- → tech-debt
Rank: 45
Priority: -- → P4
webrtc.org C++ unit tests for desktop have been enabled in bug 964133. Closing as we won't put any effort any more into B2G related stuff.
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: