Closed Bug 979627 Opened 10 years ago Closed 5 years ago

Intermittent /tests/dom/media/tests/ipc/test_ipc.html | test_dataChannel_basicAudioVideo.html,test_peerConnection_basicAudioVideoCombined.html | Test timed out.

Categories

(Core :: WebRTC, defect, P3)

29 Branch
x86_64
Linux
defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: KWierso, Assigned: jesup)

References

Details

(Keywords: intermittent-failure)

Attachments

(1 file)

https://tbpl.mozilla.org/php/getParsedLog.php?id=35622290&tree=Fx-Team
slave: tst-linux32-spot-325


15:19:52     INFO -  429 INFO TEST-INFO | dumping last 19 message(s)
15:19:52     INFO -  430 INFO TEST-INFO | if you need more context, please use SimpleTest.requestCompleteLog() in your test
15:19:52     INFO -  431 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | Got first iframe load event.
15:19:52     INFO -  432 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | Got second iframe load event.
15:19:52     INFO -  433 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  iceConnectionState starts at 'new'
15:19:52     INFO -  434 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  iceConnectionState starts at 'new'
15:19:52     INFO -  435 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Initial local signalingState is stable
15:19:52     INFO -  436 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Initial remote signalingState is stable
15:19:52     INFO -  437 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  DataChannelWrapper (pcLocal_channel_0) is of binary type 'blob'
15:19:52     INFO -  438 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  DataChannelWrapper (pcLocal_channel_0) is in state: 'connecting'
15:19:52     INFO -  439 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Create datachannel does not change signaling state
15:19:52     INFO -  440 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Local create offer does not change signaling state
15:19:52     INFO -  441 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  m=application is contained in the SDP
15:19:52     INFO -  442 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  signalingState after local setLocalDescription is 'have-local-offer'
15:19:52     INFO -  443 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  signalingState after remote setRemoteDescription is 'have-remote-offer'
15:19:52     INFO -  444 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Remote create offer does not change signaling state
15:19:52     INFO -  445 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  iceConnectionState should not be undefined
15:19:52     INFO -  446 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  signalingState after local setRemoteDescription is 'stable'
15:19:52     INFO -  447 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  iceConnectionState should not be undefined
15:19:52     INFO -  448 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  iceConnectionState should not be undefined
15:19:52     INFO -  449 INFO TEST-PASS | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  iceConnectionState should not be undefined
15:19:52     INFO -  Xlib:  extension "RANDR" missing on display ":0".
15:19:55     INFO -  450 INFO TEST-UNEXPECTED-FAIL | /tests/dom/media/tests/ipc/test_ipc.html | /tests/dom/media/tests/mochitest/test_dataChannel_basicAudioVideo.html |  Test timed out.
15:19:55     INFO -  [Child 2518] WARNING: Failed to unlock the wakelock.: '!rv.Failed()', file /builds/slave/fx-team-lx-d-00000000000000000/build/content/html/content/src/HTMLVideoElement.cpp, line 301
15:19:55     INFO -  [Child 2518] WARNING: Failed to unlock the wakelock.: '!rv.Failed()', file /builds/slave/fx-team-lx-d-00000000000000000/build/content/html/content/src/HTMLVideoElement.cpp, line 301
15:19:55     INFO -  [Child 2518] WARNING: Failed to unlock the wakelock.: '!rv.Failed()', file /builds/slave/fx-team-lx-d-00000000000000000/build/content/html/content/src/HTMLVideoElement.cpp, line 301
15:19:55     INFO -  [Child 2518] WARNING: Failed to unlock the wakelock.: '!rv.Failed()', file /builds/slave/fx-team-lx-d-00000000000000000/build/content/html/content/src/HTMLVideoElement.cpp, line 301
15:19:55     INFO -  ++DOMWINDOW == 8 (0x8cbb9f8) [pid = 2518] [serial = 13] [outer = 0x8c976d8]
15:19:55     INFO -  -1315825472[84ab998]: [main|PeerConnectionImpl] PeerConnectionImpl.cpp:1638: CloseInt: Closing PeerConnectionImpl 016128d25a14fc20; ending call
15:19:55     INFO -  -1399755968[af609670]: [CCAPP Task|def] ccapi.c:1445: SIPCC-CC_API: 1/3, cc_int_onhook: UI -> GSM: ONHOOK
15:19:55     INFO -  -1399755968[af609670]: [CCAPP Task|def] ccapi.c:1447: (1/3) On-hook called from /builds/slave/fx-team-lx-d-00000000000000000/build/media/webrtc/signaling/src/sipcc/core/ccapp/ccprovider.c:657
15:19:55     INFO -  -1468884160[b0012ff8]: [GSM Task|def] dcsm.c:532: SIPCC-DCSM: dcsm_process_event: DCSM 11  :(DCSM_READY:ONHOOK )
15:19:55     INFO -  -1468884160[b0012ff8]: [GSM Task|fsm_sm] sm.c:46: SIPCC-FSM: sm_process_event: DEF 3   : 0xb36cd8c3x: sm entry: (STABLE:ONHOOK)
15:19:55     INFO -  -1468884160[b0012ff8]: [GSM Task|fsm_sm] fsmdef.c:6711: SIPCC-FSM: fsmdef_ev_onhook: Entered.
15:19:55     INFO -  -1468884160[b0012ff8]: [GSM Task|fsm_sm] fsmdef.c:1986: SIPCC-FSM: 1/3, fsmdef_release: Entered. cause= NORMAL
15:19:55     INFO -  -1468884160[b0012ff8]: [GSM Task|fsm_sm] fsmdef.c:9197: SIPCC-FSM: fsmdef_notify_hook_event: Entered.
https://tbpl.mozilla.org/php/getParsedLog.php?id=35619072&tree=Mozilla-Inbound
Summary: Intermittent test_dataChannel_basicAudioVideo.html | Test timed out. → Intermittent test_dataChannel_basicAudioVideo.html,test_peerConnection_basicAudioVideoCombined.html | Test timed out.
The "test filename" is the (non-unique, be nice if people could use different names for their ipc runner file) thing between the first and second pipe, that's why you're not getting these suggested, but are getting all of the indexedDB/test_ipc.html bugs suggested.

https://tbpl.mozilla.org/php/getParsedLog.php?id=35640198&tree=Fx-Team
Summary: Intermittent test_dataChannel_basicAudioVideo.html,test_peerConnection_basicAudioVideoCombined.html | Test timed out. → Intermittent /tests/dom/media/tests/ipc/test_ipc.html | test_dataChannel_basicAudioVideo.html,test_peerConnection_basicAudioVideoCombined.html | Test timed out.
Is there a bug tracking all of these WebRTC IPC test timeouts? They're showing up pretty high in the Orange Factor results.
Flags: needinfo?(rjesup)
See bug 987548 for part of the answer.  We shouldn't be running all these tests under IPC in the first place.  But there are still issues to resolve here (and IPC test framework isn't properly detecting test failures and exposing it either).  This was discussed today in the webrtc meeting; nils (drno) is looking at it.
Flags: needinfo?(rjesup)
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #73)
> Is there a bug tracking all of these WebRTC IPC test timeouts? They're
> showing up pretty high in the Orange Factor results.

Yes there is: bug 978415
Any updates on this bug? This is probably one of the top WebRTC failures we hit on a regular basis.
Flags: needinfo?(drno)
Looks like a race condition in the tests. I'll have a closer look in the next couple days. Leaving NI myself as reminder.
Assignee: nobody → drno
Flags: needinfo?(drno)
Can we just bite the bullet here and disable them on e10s/test_ipc?
Flags: needinfo?(mreavy)
Sorry, Ryan, this fell off our radar because it was in the wrong category ("Video/Audio" instead of "WebRTC").

Nils, Byron -- Are you actively working on patches that will address this, or is it going to be a while before this is fixed?   

e10s is going to be enabled by default around the end of the month, so I'm reluctant to turn off some of the tests for it.  But it's unclear if this test is helping us or just noise at this point until we fix it.  Since it's e10s related, we may be able to get some of gcp's help (or help from someone on the e10s team).
Component: Video/Audio → WebRTC
Flags: needinfo?(mreavy)
Flags: needinfo?(drno)
Flags: needinfo?(docfaraday)
I have not diagnosed what is going on here yet.
Flags: needinfo?(docfaraday)
Sorry I dropped the ball here. I'll have a look at the logs within the next couple of days...
Just adding some more debug messages to figure out what could be going wrong here.
Flags: needinfo?(drno)
Attachment #8508896 - Flags: review?(docfaraday)
Attachment #8508896 - Flags: review?(docfaraday) → review+
This has collected logs from quite a few different test failures. But the majority of the test failures seem to be around PC_REMOTE timing out while waiting for the data channel to open. Just from the log files I can't really figure out what is going wrong there. So I think we can only add more instrumentation to get an idea where things are failing.
Try run looks green. Lets get this in and hope we get an idea of what is going on here.
Blocks: 1093443
Nils -- Do you have a better idea of what's going on here (now that we have additional instrumentation)?
Flags: needinfo?(drno)
Thanks for the reminder Maire.

I'm really puzzled by this. Just from the test logs it looks like because of some race condition (?) sometimes the data channel callbacks don't get called. In most cases the test fails, because the registered callback for ondatachannel never gets called.
I don't really see a race condition in the test code there. My only explanation would be that the test is replacing an old callback with another callback, and maybe that confuses something so that neither the old nor the new callback ever gets called?!
But then there also occasional test failures where the data channel gets established, but receiving the send message fails.

The other possibility I guess could be that the data channel really never gets established... because of some strange packet loss, or... ?!
Flags: needinfo?(drno)
From looking at the log output from the last failure:

13:49:31 INFO - 1740 INFO TEST-PASS | dom/media/tests/mochitest/ipc/test_ipc.html | test_dataChannel_basicAudioVideo.html | data channel is 'open' after 'onopen'
13:49:31 INFO - 1741 INFO Run step 35: PC_REMOTE_VERIFY_DATA_CHANNEL_STATE
13:49:31 INFO - 1742 INFO TEST-UNEXPECTED-FAIL | dom/media/tests/mochitest/ipc/test_ipc.html | test_dataChannel_basicAudioVideo.html | Test timed out. - expected PASS 

This looks to me like our tests are still missing 'onopen' events, but the channel actually is open. Proper time stamps on the log messages would be super helpful here.
That message is being generated by the onopen event, so it's not a missing event, though it might be that we aren't getting ALL of the channels open correctly.  I note that we don't do anything about checking that the channel is already open when we get it back.

Those data channels seem to be quite flaky.
See Also: → 1118283
I stumbled in here from this try: https://treeherder.mozilla.org/#/jobs?repo=try&revision=a97b2ae27c02

And from reading the logs I get the feeling that audio and video is starving the datachannels out so they get super slow. Trying to confirm that now.
Well, it sort of looked like it from my try but the logs here suggest otherwise.

See comment 541 (e10s, no ipc) and comment 538, comment 544 (ipc); all of them have this, or similar, early on in the failing test:

(This from comment 538)
17:56:57     INFO -  -1456478464[7fd8b0820b60]: Sent
17:56:57     INFO -  -1456478464[7fd8b0820b60]: message with PPID = 50, SID = 1, flags: 0x0003 due to error = 0x00000000
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x03
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x09
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x63
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x68
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x61
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x6e
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x6e
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x65
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x6c
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x5f
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x30
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]: Association change: SCTP_COMM_LOST
17:56:57     INFO -  -1456478464[7fd8b0820b60]: Association change: streams (in/out) = (16/256)
17:56:57     INFO -  -1456478464[7fd8b0820b60]: Association: ABORT =
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x06
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x04
17:56:57     INFO -  -1089930816[7fd8b0820140]: Closing all channels (connection 7fd89eff7240)
17:56:57     INFO -  -1089930816[7fd8b0820140]: Connection 7fd89eff7240/Channel 7fd891d04160: Closing stream 1
17:56:57     INFO -  -1089930816[7fd8b0820140]: Connection 7fd89eff7240: Resetting outgoing stream 1
17:56:57     INFO -  -1089930816[7fd8b0820140]: Destroying Data channel 1
17:56:57     INFO -  -1089930816[7fd8b0820140]: Connection 7fd89eff7240: Sending outgoing stream reset for 1 streams
17:56:57     INFO -  -1089930816[7fd8b0820140]: ***failed: setsockopt RESET, errno 2
17:56:57     INFO -  -1456478464[7fd8b0820b60]: Association change: SCTP_COMM_LOST
17:56:57     INFO -  -1456478464[7fd8b0820b60]: Association change: streams (in/out) = (256/16)
17:56:57     INFO -  -1456478464[7fd8b0820b60]: Association: ABORT =
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x06
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x00
17:56:57     INFO -  -1456478464[7fd8b0820b60]:  0x04
17:56:57     INFO -  -1089930816[7fd8b0820140]: Closing all channels (connection 7fd89eff7ba0)
Flags: needinfo?(drno)
Thanks Andreas. That is an interesting observation. What irritates me most is the error code 0 in that log. I found this in another log

07:51:10 INFO - -1777345728[a34ffac0]: Peer address DTLS connection is now
07:51:10 INFO - -1777345728[a34ffac0]: SCTP_ADDR_UNREACHABLE
07:51:10 INFO - -1777345728[a34ffac0]: (error = 0x00000000).

In that case the connection got established and actually exchanged data successfully until out of the blue apparently the SCTP connection is dead.

I'm afraid the over all problem here is that JS land never gets notified about SCTP/datachannel problems (and that results in the tests timing out). Randell do you have any thoughts on when and how we are going to bubble up SCTP errors into JS land?
Flags: needinfo?(drno) → needinfo?(rjesup)
(In reply to Andreas Pehrson [:pehrsons] (Telenor) from comment #550)
> Well, it sort of looked like it from my try but the logs here suggest
> otherwise.
> 
> See comment 541 (e10s, no ipc) and comment 538, comment 544 (ipc); all of
> them have this, or similar, early on in the failing test:
> 
> (This from comment 538)
> 17:56:57     INFO -  -1456478464[7fd8b0820b60]: Sent
> 17:56:57     INFO -  -1456478464[7fd8b0820b60]: message with PPID = 50, SID
> = 1, flags: 0x0003 due to error = 0x00000000

So, looking around with dxr seems to indicate that this logging could be generated as a result of hitting one of the 4 places below (these are the places that result in a call to sctp_notify_send_failed with |send| set to 1):

https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/src/netinet/sctputil.c#5224
https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/src/netinet/sctp_pcb.c#5746
https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/src/netinet/sctputil.c#3922
https://dxr.mozilla.org/mozilla-central/source/netwerk/sctp/src/netinet/sctputil.c#5174

I'm having a lot of trouble finding any clue that would tell us which one of these four we're hitting; there seems to be absolutely no logging in this SCTP stack.
Hey Nils - I talked with Randell, and he'd like to take this.  It's clearly related to SCTP
Assignee: drno → rjesup
Rank: 25
Flags: firefox-backlog+
Priority: -- → P2
(In reply to Maire Reavy [:mreavy] (Plz needinfo me) from comment #604)
> Hey Nils - I talked with Randell, and he'd like to take this.  It's clearly
> related to SCTP

Sure. I should have freed it up a lot earlier already as I haven't touched it for a long time.
backlog: --- → webRTC+
Flags: firefox-backlog+
Mass change P2->P3 to align with new Mozilla triage process.
Priority: P2 → P3
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → WORKSFORME
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open

Seems to have gone away again.

Status: REOPENED → RESOLVED
Closed: 6 years ago5 years ago
Flags: needinfo?(rjesup)
Resolution: --- → WORKSFORME
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: