Closed Bug 930481 Opened 7 years ago Closed 2 years ago

NOT LINUX, OS X Debug Only: Intermittent test_dataChannel_basicAudioVideo.html,test_dataChannel_basicAudioVideoCombined.html | Test timed out.

Categories

(Core :: WebRTC, defect, P5)

All
macOS
defect

Tracking

()

RESOLVED WORKSFORME
Blocking Flags:

People

(Reporter: cbook, Unassigned)

References

()

Details

(Keywords: intermittent-failure, Whiteboard: [leave-open][test disabled on OS X])

Attachments

(2 files, 1 obsolete file)

Windows 7 32-bit fx-team opt test mochitest-3 on 2013-10-24 04:47:24 PDT for push 18ccb2ac1b15

slave: t-w732-ix-031

https://tbpl.mozilla.org/php/getParsedLog.php?id=29594098&tree=Fx-Team

23315 ERROR TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | Test timed out.
Return code: 1
This appears to be a dup of bug 932601
Depends on: 932601
Reassigning to match the component of the likely real bug 932601
Component: WebRTC: Audio/Video → DOM
Randell, this is currently the #3 top orange on trunk. Can you please take a look?
Flags: needinfo?(rjesup)
https://tbpl.mozilla.org/php/getParsedLog.php?id=30648993&tree=Mozilla-Inbound
OS: Windows 7 → All
Hardware: x86 → All
Summary: Intermittent TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html | Test timed out. → Intermittent test_dataChannel_basicAudioVideo.html,test_peerConnection_basicAudioVideo.html | Test timed out.
Disabled test_dataChannel_basicAudioVideo.html on OS X in http://hg.mozilla.org/integration/mozilla-inbound/rev/f09ec38cd808 so we'll have a chance to sometimes get mochitest-3 to run to completion.
Whiteboard: [test disabled on OS X]
Whiteboard: [test disabled on OS X] → [test disabled on OS X][leave open]
Ok - comment 4 is a different bug than comment 3 and 2.  Comments 0 and 1 don't seem to match either 2/3 or 4+.

Starting with comment 4 this became a frequent orange.  Note that comment 4 was 7 pushes and 4 hours before webrtc 3.43 landed, so it really can't be that.  Though the timing is dangerously coincidental.  Also note that just before 3.43 landed, bwc landed bug 936031, which modified SIPCC.
Per comment 145, moving back to webrtc and unlinking from the DOM bug that was the cause of TBPL stars in comment 2 and 3.

This is a top-orange (now hidden for Mac OSX) webrtc test that never exits.  The two tests flagged are debug builds, testing basicAudioVideo with and without DataChannels.  However, looking over a sub-set of these, it seems they're almost all the with-data-channels version (I didn't see any that weren't, and perhaps those all are actually other bugs), and all on OSX 10.6 or 10.7 - I see no 10.8's.  In fact, 90-95%+ are 10.6.  This may just speak to the problem being a timing issue.

I will note that SCTP has mac-specific locking code in it, but also I should note that no changes have been made to that anywhere near the time this problem started.

What we really need is a way to instrument the builds so when this does happen, we'll get more information on how to narrow it down.  "Test Timed out" doesn't tell much.

The instance I looked at (comparing a "good" and "bad" run) it stops after the second:

INFO -  23389 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  signalingState after local setRemoteDescription is 'stable'

line (note there are more than one such line in the 'good' case), and before the line (in the good one) with:

08:06:32     INFO -  23390 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  PeerConnectionWrapper (pcLocal) is in state: 'open'

(and definitely before you get the "media flowing" logs.  I haven't checked more to see if the pattern holds.


abr: (and bwc/jib/etc): any ideas?
Component: DOM → WebRTC
No longer depends on: 932601
Flags: needinfo?(rjesup) → needinfo?(adam)
Disabled test_dataChannel_basicAudioVideoCombined.html in https://hg.mozilla.org/integration/mozilla-inbound/rev/51953c9a8740 for https://tbpl.mozilla.org/php/getParsedLog.php?id=30673329&tree=Mozilla-Inbound on the very next push after I disabled the last one.
Summary: Intermittent test_dataChannel_basicAudioVideo.html,test_peerConnection_basicAudioVideo.html | Test timed out. → Intermittent test_dataChannel_basicAudioVideo.html,test_peerConnection_basicAudioVideo.html,test_dataChannel_basicAudioVideoCombined.html | Test timed out.
Ted: can you look too?  Something is very, very wrong on the test machine, it seems to be running like molasses and timing out.  Could this be caused by whatever is throwing all the compositor errors?

This one:
https://tbpl.mozilla.org/php/getParsedLog.php?id=30328363&full=1&branch=mozilla-central
shows something VERY interesting.

Starting here (note the times):
05:24:40     INFO -  23238 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  signalingState after local setRemoteDescription is 'stable'
05:24:40     INFO -  1888955584[105614db0]: [main|PeerConnectionImpl] PeerConnectionImpl.cpp:254: PeerConnectionObserverDispatch processing mCallState = 36 (), mFsmState = 18 (HAVE_REMOTE_OFFER)
05:24:40     INFO -  1888955584[105614db0]: [main|PeerConnectionImpl] PeerConnectionImpl.cpp:270: Message contains error: 4: Cannot add found ICE candidate in state HAVE_REMOTE_OFFER
05:24:40     INFO -  1888955584[105614db0]: [main|PeerConnectionImpl] PeerConnectionImpl.cpp:417: : **** UNHANDLED CALL STATE : 36 ()

we get a zillion of the compositor logs and then: (22 seconds later):

05:25:02     INFO -  23239 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  PeerConnectionWrapper (pcLocal) is in state: 'open'
05:25:02     INFO -  23240 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  PeerConnectionWrapper (pcRemote) is in state: 'open'
05:25:02     INFO -  23241 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  signalingState after remote setLocalDescription is 'stable'

12 seconds later:

05:25:14     INFO -  23242 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  PeerConnectionWrapper (pcLocal) has 2 local streams
05:25:14     INFO -  23243 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  PeerConnectionWrapper (pcLocal) has 1 remote streams

16 seconds later:

05:25:30     INFO -  23244 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  PeerConnectionWrapper (pcRemote) has 2 local streams
05:25:30     INFO -  23245 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  PeerConnectionWrapper (pcRemote) has 1 remote streams

23 seconds later:

05:25:53     INFO -  23246 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Media flowing for pcLocal_local_audio
05:25:53     INFO -  23247 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Media flowing for pcLocal_local_video
05:25:53     INFO -  23248 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Media flowing for pcLocal_remote_video

35 seconds later:

05:26:28     INFO -  23249 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Media flowing for pcRemote_local_audio
05:26:28     INFO -  23250 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Media flowing for pcRemote_local_video
05:26:28     INFO -  23251 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Media flowing for pcRemote_remote_video

(Note: the test "timed out" at 05:30:13, but obviously kept running...)
**** 6 minutes and 26 seconds later *****

05:32:54     INFO -  23252 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Message correctly transmitted from pcLocal to pcRemote.

******** 28 minutes later **********************

06:00:50     INFO -  23253 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | (SimpleTest/TestRunner.js) |  Received data is of instance Blob
06:00:50     INFO -  [Parent 902] WARNING: could not find a compositor to schedule composition: file ../../../gfx/layers/ipc/CompositableTransactionParent.cpp, line 225
06:00:50     INFO -  [Parent 902] WARNING: could not find a compositor to schedule composition: file ../../../gfx/layers/ipc/CompositableTransactionParent.cpp, line 225
06:00:50     INFO -  23254 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | (SimpleTest/TestRunner.js) |  Received data has the correct size.
Flags: needinfo?(ted)
And then, suddenly, it stops. I'm taking the "needinfo" for me off of this, as it appears to have been an infra issue, not a code issue.
Flags: needinfo?(adam)
It stopped because [test disabled on OS X], not because it was something else's fault.
Flags: needinfo?(adam)
(In reply to Phil Ringnalda (:philor) from comment #161)
> It stopped because [test disabled on OS X], not because it was something
> else's fault.

Okay, that makes sense. I hope to have some time coming up pretty soon where I can look into this, but I'm still left a bit confused. The mochitest.ini file only suppresses test_dataChannel_basicAudioVideoCombined.html and test_dataChannel_basicAudioVideoCombined.html. It looks like test_peerConnection_basicAudioVideo.html is running just fine without any logs showing up here. I'm removing it from the list of problematic test cases in the bug title.

Ted: if I've overlooked something and test_peerConnection_basicAudioVideo.html is actually causing problems, please let me know.
Flags: needinfo?(adam)
Summary: Intermittent test_dataChannel_basicAudioVideo.html,test_peerConnection_basicAudioVideo.html,test_dataChannel_basicAudioVideoCombined.html | Test timed out. → Intermittent test_dataChannel_basicAudioVideo.html,test_dataChannel_basicAudioVideoCombined.html | Test timed out.
Flags: needinfo?(adam)
Drat. The logs have all timed out. Here's a try push with the tests turned on:
https://tbpl.mozilla.org/?tree=Try&rev=063eab918a1a
Flags: needinfo?(adam)
Drat. The logs have all timed out. Here's a try push with the tests turned on:
https://tbpl.mozilla.org/?tree=Try&rev=063eab918a1a
Keep forgetting to uncheck this...
Flags: needinfo?(adam)
I think we're picking up logs that are almost certainly *different* bugs here. Updating the title of the bug accordingly.
OS: All → Mac OS X
Summary: Intermittent test_dataChannel_basicAudioVideo.html,test_dataChannel_basicAudioVideoCombined.html | Test timed out. → OS X Debug Only: Intermittent test_dataChannel_basicAudioVideo.html,test_dataChannel_basicAudioVideoCombined.html | Test timed out.
Attached image yo_dawg_mochi.png
Okay, that's really odd. The screenshot in the failure log I got from my try push above has the mochitests running inside the mochitest frame. The outer mochi test that is running is tests/dom/media/tests/ipc/test_ipc.html, which was introduced in patch 3 of Bug 525444. The Linux test I discounted above *also* shows the browser claiming to be running that test (albeit without the nested mochi test in the frame). It's not conclusive evidence that tests/dom/media/tests/ipc/test_ipc.html is to blame, but it sure is suspicious.

Eitan: Could you glance at the screen shot and log and make a guess about what might be going on here?
Flags: needinfo?(eitan)
I think you are confusing this with content/media/webspeech/synth/ipc/test/test_ipc.html. The test above was introduced in bug Bug #869869.
Flags: needinfo?(eitan)
Whoops! Sorry, Eitan.

I *really* like Bug 869869 as the trigger here, since its patches landed 2013-11-07 08:53:58 CST, while the oranges on this bug took off like a rocket starting 2013-11-07 16:26:11 CST.

Shih-Chiang: any insights? This issue seems to coincide exactly with the landing of the UDP e10s work. For easy reference, a representative log is here: https://tbpl.mozilla.org/php/getParsedLog.php?id=33059201&tree=Try&full=1
Flags: needinfo?(schien)
I saw several log indicates packet send fail and drop packet by packet filter. It might be the reason of test case timeout, however, I don't know the detail of what goes wrong. We need add more log to know what's going on here.

[send fail]
>15:01:42     INFO -  749993984[11f713540]: [|WebrtcAudioSessionConduit] AudioConduit.cpp:761: SendRTCPPacket RTCP Packet Send Failed
>15:01:42     INFO -  751054848[11d59f6a0]: [|WebrtcVideoSessionConduit] VideoConduit.cpp:948: SendRTCPPacket RTCP Packet Send Failed

[drop packet]
>15:01:42     INFO -  [Parent 901] WARNING: NS_ENSURE_TRUE(allowed) failed: file ../../../../dom/network/src/UDPSocketParent.cpp, line 241
>15:01:42     INFO -  [Parent 901] WARNING: NS_ENSURE_TRUE(allowed) failed: file ../../../../dom/network/src/UDPSocketParent.cpp, line 241
Flags: needinfo?(schien)
Attachment #8361176 - Attachment is obsolete: true
Assignee: nobody → adam
Status: NEW → ASSIGNED
Comment on attachment 8361185 [details] [diff] [review]
Improved failure instrumentation for transport layers

I'd like to get this checked in so we have a better view of what's going on when we get a layer failure (i.e., which layer the failure came from, where in the code we detected it). See the log in comment 167 -- there's an indication of some kind of transport error, but there's not enough information to begin figuring out what happened.
Attachment #8361185 - Flags: review?(ekr)
(In reply to Shih-Chiang Chien [:schien] from comment #172)
> I saw several log indicates packet send fail and drop packet by packet
> filter. It might be the reason of test case timeout, however, I don't know
> the detail of what goes wrong. We need add more log to know what's going on
> here.
> 
> [send fail]
> >15:01:42     INFO -  749993984[11f713540]: [|WebrtcAudioSessionConduit] AudioConduit.cpp:761: SendRTCPPacket RTCP Packet Send Failed
> >15:01:42     INFO -  751054848[11d59f6a0]: [|WebrtcVideoSessionConduit] VideoConduit.cpp:948: SendRTCPPacket RTCP Packet Send Failed

That's interesting. The conditional that produces those lines is:

>     if(mTransport && (mTransport->SendRtpPacket(data, len) == NS_OK))

And, right now, SendRtpPacket is hardcoded to always return NS_OK. So the only way we can get these messages is if mTransport is NULL. Given that WebrtcAudioConduit::AttachTransport() complains if it's called with a null transport (and I don't find that complaint in the log), it seems safe to assume that AttachTransport is never being called.

This AttachTransport is called from the STS thread, using a task that has been queued by MediaPipeline::Init(). Looking at VcmSIPCCBinding.cpp, it looks like any failure to init the media pipeline is also met with an error response.

So I'm led to conclude that these failures are likely due to the MediaPipeline::Init_s() method not executing before the packets start flowing. I'll add some logging to try to confirm this hypothesis and do another try push.
Okay, so here's a try with some trivial logging added on either side of the suspect dispatch:

https://tbpl.mozilla.org/?tree=Try&rev=820f65339f03

TBPL seems to be acting very flaky today, though...
Okay, it turns out that, under OS X 10.6, the subtests under dom/media/tests/ipc/test_ipc.html always time out, and sometimes crash.

When they do timeout, test_ipc.html does not report the failure upwards, so the harness reports success. I've filed this issue as Bug 960729.

So the oranges we were seeing above were caused by these crashes; but, all of the greens appear to timeout rather than being actual success. In all my try pushes so far, I have not seen a single OS X 10.6 debug run without timeouts.
Depends on: 960729
Flags: needinfo?(adam)
Assignee: adam → nobody
Comment on attachment 8361185 [details] [diff] [review]
Improved failure instrumentation for transport layers

Review of attachment 8361185 [details] [diff] [review]:
-----------------------------------------------------------------

lgtm
Attachment #8361185 - Flags: review?(ekr) → review+
Adding limited diagnostics:

https://hg.mozilla.org/integration/mozilla-inbound/rev/52b8d17dac75
Whiteboard: [test disabled on OS X][leave open] → [leave-open][test disabled on OS X]
Abr: ok, the patch has landed.  Next steps?  Who should investigate further, if anyone?
Flags: needinfo?(adam)
(In reply to Randell Jesup [:jesup] from comment #182)
> Abr: ok, the patch has landed.  Next steps?  Who should investigate further,
> if anyone?

So, I kind of confused matters here. That patch actually would provide better diagnostic information for the failure we saw in Comment 167. But I'm pretty sure that's a different bug than this one (possibly the same as Bug 892630?). In any case, the transport failures that the patch instruments don't arise under OS X 10.6 (as far as I can tell), and are designed to give us something to investigate if the Comment 167 log shows up again.

In terms of making progress here, I think the steps are:

1) Have a solution landed for Bug 960729. I don't know who we need to lean on to make this happen, but we're kind of flying blind on this bug (and, I suspect, several others) until that is fixed.

2) Open a bug along the lines of "e10s doesn't ever work for OS X 10.6," and assign it to the original author of the e10s patch (I think that's Shih-Chiang).

3) Once the fix for *that* bug lands, re-enable this pair of tests on the testers, and see if it reoccurs -- I suspect that it has a cause in common with whatever is causing the other (undetected) failures, and that it will run clean once that's taken care of.
Flags: needinfo?(adam)
Probably jduell, not s-c.
Depends on: 978415
Summary: OS X Debug Only: Intermittent test_dataChannel_basicAudioVideo.html,test_dataChannel_basicAudioVideoCombined.html | Test timed out. → NOT LINUX, OS X Debug Only: Intermittent test_dataChannel_basicAudioVideo.html,test_dataChannel_basicAudioVideoCombined.html | Test timed out.
FWIW, the way to actually successfully restrict a bug to just those failures you want it to be about is to file the bugs for the failures you don't want it to be about.

Well, we'll still pick the wrong one off and on, but not as often as we will while this is the only bug on file.
(In reply to Phil Ringnalda (:philor) from comment #192)
> FWIW, the way to actually successfully restrict a bug to just those failures
> you want it to be about is to file the bugs for the failures you don't want
> it to be about.
> 
> Well, we'll still pick the wrong one off and on, but not as often as we will
> while this is the only bug on file.

Filed bug 1007614 for the Linux Version
backlog: --- → webRTC+
Rank: 45
Priority: -- → P4
Flags: needinfo?(ted)
Mass change P4->P5 to align with new Mozilla triage process.
Priority: P4 → P5
Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.