Closed Bug 853858 Opened 11 years ago Closed 11 years ago

Intermittent test_peerConnection_basicAudio.html, test_peerConnection_basicAudioVideo.html, test_peerConnection_basicAudioVideoCombined.html, test_peerConnection_basicVideo.html, test_peerConnection_bug827843.html, test_peerConnection_bug | Test timed out

Categories

(Core :: WebRTC: Signaling, defect, P1)

21 Branch
x86_64
Linux
defect

Tracking

()

RESOLVED FIXED
mozilla23
Tracking Status
firefox21 --- disabled
firefox22 + fixed
firefox23 --- fixed

People

(Reporter: abr, Assigned: abr)

References

Details

(Keywords: intermittent-failure, Whiteboard: [WebRTC],[blocking-webrtc-][qa-])

Attachments

(2 files, 1 obsolete file)

+++ This bug was initially created as a clone of Bug #839677 +++

RyanVM
https://tbpl.mozilla.org/php/getParsedLog.php?id=20886816&tree=Mozilla-Inbound
Ubuntu 12.04 x64 mozilla-inbound opt test mochitest-2 on 2013-03-20 08:43:11
slave: tst-linux64-ec2-324

09:06:53     INFO -  26689 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out.
Comment on attachment 729798 [details] [diff] [review]
Additional instrumentation for on-hook event

This bug is one of a class of bug where a call is torn down with an ONHOOK event, but no corresponding messaging happens towards the PeerConnection. Since no callback is issued, the corresponding tests time out.

This patch does three things to help shed some light on why calls are sometimes town down without sending an event back to the PeerConnection. (1) All calls to cc_onhook and friends now indicate the source file and line number from which it was called; (2) PeerConnectionImpl has additional logging messages that are printed when it is about to invoke endCall (which ultimately causes an ONHOOK event to be triggered); and (3) Logging of messages sent to the "ui" (PeerConnectionImpl) has been increased from Debug to Notice so that it will be emitted during mochi tests.
Attachment #729798 - Flags: review?(ethanhugg)
Comment on attachment 729798 [details] [diff] [review]
Additional instrumentation for on-hook event

Review of attachment 729798 [details] [diff] [review]:
-----------------------------------------------------------------

::: media/webrtc/signaling/src/sipcc/core/gsm/ccapi.c
@@ +1422,5 @@
>      pmsg->active_list  = active_list;
>  
>      CC_DEBUG_ENTRY(__FUNCTION__, src_id, dst_id, call_id, line, cc_msg_name(pmsg->msg_id));
> +    DEF_DEBUG("(%d/%d) On-hook called from %s:%d",
> +              line, call_id, filename, fileline);

Nit: I believe both line_t and callid_t are unsigned.

::: media/webrtc/signaling/src/sipcc/core/includes/phone_debug.h
@@ +108,5 @@
>  #define PHN_DEBUG_SNTP_PACKET if (DebugSNTPPacket)
>  
>  /* TNP adapter debugs */
>  #define TNP_DEBUG(format,...) if (TNPDebug) \
> +        CSFLogNotice("tnp", format, ## __VA_ARGS__)

You may want to mention here why you are redirecting debug to notice.
Attachment #729798 - Flags: review?(ethanhugg) → review+
Whiteboard: [WebRTC],[blocking-webrtc+] [qa-] → [WebRTC],[blocking-webrtc+]
Whiteboard: [WebRTC],[blocking-webrtc+] → [WebRTC],[blocking-webrtc-]
https://hg.mozilla.org/mozilla-central/rev/817fd65ab44b
Whiteboard: [WebRTC],[blocking-webrtc-] → [WebRTC],[blocking-webrtc-][leave open]
Okay, I think I see what's going on here, thanks to the log in Bug 854255 comment 7. The telltale clue is that things are cruising along just fine and then *BAM* out of left field, someone destroys the PC.

How can this happen? Well, the PC is rooted in the content javascript context. By my understanding, the only way it can be destroyed is by the content javascript relinquishing all references to it. Once that happens, the garbage collector will eventually find it and clean it up.

Turning to the test at hand; the mochitest itself starts the chain of events like this:

  runTest(function () {
    var test = new PeerConnectionTest();
    test.setMediaConstraints([{audio: true}, {video: true}],
                             [{audio: true}, {video: true}]);
    test.run();
  }, true);

That "PeerConnectionTest" object contains the two PCs that will be used for the test.

Now, keep in mind that the "test.run()" function call only executes the first step in the chain of events and then returns. And then the runTest() function returns. And the "test" variable goes out of scope.

That last step is critical: once the "test" variable goes out of scope, the entire graph of objects comprising PeerConnectionTest, its two PCs, their observers, and their supporting C++ objects becomes unrooted, and subject to collection the next time the garbage collector runs.

If that happens before the test is finished, then your PCs go away (along with all your testing objects), and no more callbacks get fired.

The following test cases all reuse this pattern, and will consequently share this pathology:

* test_peerConnection_basicAudio.html (this bug)
* test_peerConnection_basicAudioVideo.html (Bug 854255)
* test_peerConnection_basicAudioVideoCombined.html (Bug 845538)
* test_peerConnection_basicVideo.html
* test_peerConnection_bug827843.html
* test_peerConnection_bug840344.html (Bug 841496)

I'm duping those other bugs to this one, and renaming this one so it will match in tbpl for star purposes.

I think the solution is to move the "var test = new PeerConnectionTest();" into a global scope so it survives for the duration of the test. I'll attach a patch shortly.
Summary: Intermittent /tests/dom/media/tests/mochitest/test_peerConnection_basicAudio.html | Test timed out → Intermittent test_peerConnection_basicAudio.html, test_peerConnection_basicAudioVideo.html, test_peerConnection_basicAudioVideoCombined.html, test_peerConnection_basicVideo.html, test_peerConnection_bug827843.html, test_peerConnection_bug | Test timed out
This patch is untested -- I'm waiting for a tree clobber to finish so I can give it a try.
Comment on attachment 733333 [details] [diff] [review]
Move PeerConnectionTest instance reference into global scope

This works locally, although I've never been able to replicate the timeout on a local machine (so I can't verify that the problem is gone)...
Attachment #733333 - Flags: review?(rjesup)
Comment on attachment 733333 [details] [diff] [review]
Move PeerConnectionTest instance reference into global scope

Review of attachment 733333 [details] [diff] [review]:
-----------------------------------------------------------------

Go for it!
Attachment #733333 - Flags: review?(rjesup) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/8daa66c443a3
Whiteboard: [WebRTC],[blocking-webrtc-][leave open] → [WebRTC],[blocking-webrtc-]
Sorry, backed this out because it's failing on Android and B2G:
https://hg.mozilla.org/integration/mozilla-inbound/rev/b76be1ee5350
Ah, I know why this failed. In global scope, we don't do the user agent checking, so this is always going to fail on alternative platforms right now.

A simple fix to this problem I think would be to define the var in global scope, but only create the object within the runTest object.
Comment on attachment 733756 [details] [diff] [review]
Move PeerConnectionTest instance reference into global scope

Either one should do for this sort of bustage fix

https://tbpl.mozilla.org/?tree=Try&rev=257ff0591dc6
(Oranges are not us; mostly infra)
Attachment #733756 - Flags: review?(jsmith)
Attachment #733756 - Flags: review?(hskupin)
Comment on attachment 733756 [details] [diff] [review]
Move PeerConnectionTest instance reference into global scope

Review of attachment 733756 [details] [diff] [review]:
-----------------------------------------------------------------

Yay for complicate dependencies. This looks fine now and should also pass on Android.
Attachment #733756 - Flags: review?(jsmith)
Attachment #733756 - Flags: review?(hskupin)
Attachment #733756 - Flags: review+
Attachment #733333 - Attachment is obsolete: true
Whiteboard: [WebRTC],[blocking-webrtc-] → [WebRTC],[blocking-webrtc-][webrtc-uplift]
Backed out for leaking the world on Windows mochitest-2.
https://hg.mozilla.org/integration/mozilla-inbound/rev/f7156b703f3d

https://tbpl.mozilla.org/php/getParsedLog.php?id=21469097&tree=Mozilla-Inbound

The tbpl leak analyzer says 10 DOMWINDOWS leaked.
Well, so much for my idea for working around this problem. We fix the platform issue and now we have leaks.

Clint - Why is mochitest blowing up here with the global reference we are doing in this patch?
Flags: needinfo?(ctalbert)
I looked into it in detail this morning and chatted with Ryan.  This really, really looks like some other random orange - the log is full of no-window and out-of-memory errors that don't seem connected to our code while trying to shut down (and that may be why cleanup didn't happen for a ton of stuff).  There's no way moving the scope of this JS var caused this, and all the pushes between this and the backout were green as well (or other known oranges).

I'm going to ask Ryan to let me re-land it (we'd discussed that I would likely do so earlier).
Flags: needinfo?(ctalbert)
https://hg.mozilla.org/integration/mozilla-inbound/rev/8262a337d5be

The leaks occurred in Win7 M2, and the logs show none of the tests in this patch ran in M2.  Relanded.
https://hg.mozilla.org/mozilla-central/rev/8262a337d5be
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla23
Comment on attachment 733756 [details] [diff] [review]
Move PeerConnectionTest instance reference into global scope

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Initial test

User impact if declined: Random oranges in tests - this patch just updates the tests to make them not sensitive to GC.

Testing completed (on m-c, etc.): Just landed on m-c.

Risk to taking this patch (and alternatives if risky): minimal to no risk and the risk is only to tests run on 22 in automation.

String or IDL/UUID changes made by this patch:
Attachment #733756 - Flags: approval-mozilla-aurora?
Whiteboard: [WebRTC],[blocking-webrtc-][webrtc-uplift] → [WebRTC],[blocking-webrtc-][webrtc-uplift][qa-]
Comment on attachment 733756 [details] [diff] [review]
Move PeerConnectionTest instance reference into global scope

Basically NPOTB, and fixes orange. a+
Attachment #733756 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Per comment 29, removing webrtc-uplift flag
Whiteboard: [WebRTC],[blocking-webrtc-][webrtc-uplift][qa-] → [WebRTC],[blocking-webrtc-][qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: