Closed
Bug 1375540
Opened 7 years ago
Closed 6 years ago
Intermittent TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 330/370 seconds with no output
Categories
(Core :: WebRTC, defect, P2)
Core
WebRTC
Tracking
()
RESOLVED
FIXED
mozilla61
People
(Reporter: intermittent-bug-filer, Assigned: bwc)
References
Details
(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:product])
Filed by: philringnalda [at] gmail.com https://treeherder.mozilla.org/logviewer.html#?job_id=109225704&repo=mozilla-inbound https://queue.taskcluster.net/v1/task/e8Pxd4xmRkCHYmkwZoeEIA/runs/0/artifacts/public/logs/live_backing.log
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 12•7 years ago
|
||
Mass change P1->P2 to align with new Mozilla triage process
Priority: P1 → P2
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 29•6 years ago
|
||
There have been 32 failures in the past week, according to orange factor: Ocurrences per plaform: 8 on windows10-64 5 on linux32-stylo-disabled 5 on linux64-qr 4 on Linux 4 on Linux x64 3 on Windows 7 2 on windows10-64-ccov 1 on linux64-stylo-disabled Occurrences per build type: 24 on debug 5 on opt 2 on pgo 1 on asan Here a relevant log file and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=155142552&lineNumber=54732 22:15:18 INFO - 2102 INFO TEST-PASS | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | PeerConnectionWrapper (pcRemote): legal ICE state transition from connected to closed 54731 22:15:18 INFO - Buffered messages finished 54732 22:15:18 WARNING - TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 370 seconds with no output 54733 22:15:18 INFO - 2103 ERROR Force-terminating active process(es). 54734 22:15:18 INFO - Determining child pids from psutil... :drno, could you please take a look? Thank you!
Flags: needinfo?(drno)
Whiteboard: [stockwell needswork]
Comment hidden (Intermittent Failures Robot) |
Comment 31•6 years ago
|
||
Over the last 7 days there are 30 failures present on this bug. These happen on Linux, Linux x64, linux32-stylo-disabled, linux64-qr, linux64-stylo-disabled, windows10-64 and windows10-64-ccov Here is the most recent log: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=157380001&lineNumber=24705 Here is a relevant snippet from aforementioned log: [task 2018-01-19T08:21:58.457Z] 08:21:58 INFO - TEST-FAIL | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | The author of the test has indicated that flaky timeouts are expected. Reason: WebRTC inherently depends on timeouts [task 2018-01-19T08:21:58.458Z] 08:21:58 INFO - Buffered messages finished [task 2018-01-19T08:21:58.459Z] 08:21:58 ERROR - TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 370 seconds with no output [task 2018-01-19T08:21:58.460Z] 08:21:58 ERROR - Force-terminating active process(es). [task 2018-01-19T08:21:58.461Z] 08:21:58 INFO - Determining child pids from psutil... [task 2018-01-19T08:21:58.462Z] 08:21:58 INFO - [3072, 3107, 4121]
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 34•6 years ago
|
||
Looks like some problem around ending tracks. When looking at the Orange factor for the last three months it looks to me that landing transceivers in bug 1290948 made this problem a lot worse. Byron, any idea why transceivers could result in track ending events going missing? Could you have look why the mochitest appears to fail to tear down the test properly here?
Depends on: 1290948
Flags: needinfo?(drno) → needinfo?(docfaraday)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 36•6 years ago
|
||
(In reply to Nils Ohlmeier [:drno] from comment #34) > Looks like some problem around ending tracks. When looking at the Orange > factor for the last three months it looks to me that landing transceivers in > bug 1290948 made this problem a lot worse. > > Byron, any idea why transceivers could result in track ending events going > missing? > Could you have look why the mochitest appears to fail to tear down the test > properly here? The transceivers landing changed the timing of a lot of stuff in subtle ways, because our timing was not spec compliant before. It may be that the event listeners aren't guaranteed to be installed soon enough? Let me give it a look.
Assignee | ||
Comment 37•6 years ago
|
||
Ok, so this is worse than track ended events not firing. We are deadlocked due to the threading rat's nest of webrtc.org and GMP. Here's what I can tell: webrtc.org is spinning the main thread GMP is sync dispatching from two webrtc.org threads to the GMP thread (and both of those stacks have a mutex locked in webrtc.org, one of which is being waited on by yet another webrtc.org thread) GMP is sync dispatching to main We're deadlocking/spinning main, here: https://searchfox.org/mozilla-central/source/media/webrtc/trunk/webrtc/base/platform_thread.cc#261 GMP is sync dispatching to main here: 11:10:43 INFO - 7 xul.dll!mozilla::SyncRunnable::DispatchToThread(nsIEventTarget *,nsIRunnable *,bool) [SyncRunnable.h:996706d59171 : 98 + 0x15] 11:10:43 INFO - eip = 0x5c09f714 esp = 0x109bf354 ebp = 0x109bf364 11:10:43 INFO - Found by: call frame info 11:10:43 INFO - 8 xul.dll!mozilla::gmp::GMPServiceCreateHelper::GetOrCreate() [GMPService.cpp:996706d59171 : 85 + 0x11] 11:10:43 INFO - eip = 0x5d8484b2 esp = 0x109bf36c ebp = 0x109bf384 11:10:43 INFO - Found by: call frame info https://searchfox.org/mozilla-central/source/dom/media/gmp/GMPService.cpp#85 The MSG thread also appears to be stuck, possibly waiting on main to do something: 11:10:43 INFO - 6 xul.dll!mozilla::SystemClockDriver::WaitForNextIteration() [GraphDriver.cpp:996706d59171 : 398 + 0xf] 11:10:43 INFO - eip = 0x5d7614ba esp = 0x112ff48c ebp = 0x112ff51c 11:10:43 INFO - Found by: call frame info 11:10:43 INFO - 7 xul.dll!mozilla::MediaStreamGraphImpl::UpdateMainThreadState() [MediaStreamGraph.cpp:996706d59171 : 1305 + 0xd] 11:10:43 INFO - eip = 0x5d7eb356 esp = 0x112ff524 ebp = 0x112ff530 11:10:43 INFO - Found by: call frame info https://searchfox.org/mozilla-central/source/dom/media/GraphDriver.cpp#398 A sync dispatch caused by GMP, from one of webrtc.org's threads to the GMP thread, while holding VideoReceiver::receive_crit_: 11:10:43 INFO - 7 xul.dll!mozilla::SyncRunnable::DispatchToThread(nsIEventTarget *,nsIRunnable *,bool) [SyncRunnable.h:996706d59171 : 98 + 0x15] 11:10:43 INFO - eip = 0x5c09f714 esp = 0x13edf5d0 ebp = 0x13edf5e0 11:10:43 INFO - Found by: call frame info 11:10:43 INFO - 8 xul.dll!mozilla::WebrtcGmpVideoDecoder::Decode(webrtc::EncodedImage const &,bool,webrtc::RTPFragmentationHeader const *,webrtc::CodecSpecificInfo const *,__int64) [WebrtcGmpVideoCodec.cpp:996706d59171 : 812 + 0x33] 11:10:43 INFO - eip = 0x5c8befc0 esp = 0x13edf5e8 ebp = 0x13edf600 11:10:43 INFO - Found by: call frame info https://searchfox.org/mozilla-central/source/media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp#815 webrtc::VideoStreamDecoder is stuck waiting on VideoReceiver::receive_crit_ (see above): 11:10:43 INFO - 4 xul.dll!rtc::CritScope::CritScope(rtc::CriticalSection const *) [criticalsection.cc:996706d59171 : 183 + 0xf] 11:10:43 INFO - eip = 0x5eba22bb esp = 0x1242fd68 ebp = 0x1242fd70 ebx = 0x00000001 11:10:43 INFO - Found by: call frame info 11:10:43 INFO - 5 xul.dll!webrtc::vcm::VideoReceiver::SetReceiveChannelParameters(__int64) [video_receiver.cc:996706d59171 : 165 + 0xc] 11:10:43 INFO - eip = 0x5eb82f70 esp = 0x1242fd78 ebp = 0x1242fd84 11:10:43 INFO - Found by: call frame info 11:10:43 INFO - 6 xul.dll!webrtc::VideoStreamDecoder::OnRttUpdate(__int64,__int64) [video_stream_decoder.cc:996706d59171 : 139 + 0x10] 11:10:43 INFO - eip = 0x5eafe7f2 esp = 0x1242fd8c ebp = 0x1242fd98 11:10:43 INFO - Found by: call frame info https://searchfox.org/mozilla-central/source/media/webrtc/trunk/webrtc/modules/video_coding/video_receiver.cc#165 Yet another sync dispatch caused by GMP, from one of webrtc.org's threads to the GMP thread (similar to above): 11:10:43 INFO - 7 xul.dll!mozilla::SyncRunnable::DispatchToThread(nsIEventTarget *,nsIRunnable *,bool) [SyncRunnable.h:996706d59171 : 98 + 0x15] 11:10:43 INFO - eip = 0x5c09f714 esp = 0x1416eb88 ebp = 0x1416eb98 11:10:43 INFO - Found by: call frame info 11:10:43 INFO - 8 xul.dll!mozilla::WebrtcGmpVideoEncoder::Encode(webrtc::VideoFrame const &,webrtc::CodecSpecificInfo const *,std::vector<webrtc::FrameType,std::allocator<webrtc::FrameType> > const *) [WebrtcGmpVideoCodec.cpp:996706d59171 : 322 + 0x29] 11:10:43 INFO - eip = 0x5c8c0c64 esp = 0x1416eba0 ebp = 0x1416ebb4 11:10:43 INFO - Found by: call frame info https://searchfox.org/mozilla-central/source/media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp#325 IncomingVideoStream is also waiting forever it seems: 11:10:43 INFO - 5 xul.dll!webrtc::EventTimerWin::Wait(unsigned long) [event_timer_win.cc:996706d59171 : 41 + 0xc] 11:10:43 INFO - eip = 0x5eb948fe esp = 0x13b9f800 ebp = 0x13b9f808 11:10:43 INFO - Found by: call frame info 11:10:43 INFO - 6 xul.dll!webrtc::IncomingVideoStream::IncomingVideoStreamProcess() [incoming_video_stream.cc:996706d59171 : 67 + 0xf] 11:10:43 INFO - eip = 0x5eb14aaa esp = 0x13b9f810 ebp = 0x13b9f910 11:10:43 INFO - Found by: call frame info https://searchfox.org/mozilla-central/source/media/webrtc/trunk/webrtc/common_video/incoming_video_stream.cc#67
Flags: needinfo?(docfaraday) → needinfo?(rjesup)
Comment 38•6 years ago
|
||
Dan recently landed in bug 1429390 a change to make some pieces async around H264 GMP. Is this possibly related?
See Also: → 1429390
Comment 39•6 years ago
|
||
From looking at OrangeFactor I don't see the number of timeouts actually going down since bug 1429390 landed on 1/25. So this looks like another problem. Maybe this is the decoding side, which is still sync?
Comment hidden (Intermittent Failures Robot) |
Comment 41•6 years ago
|
||
There have been 31 failures in the last 7 days. For the failure pattern see Comment 31 Do you have any updates here?
Updated•6 years ago
|
Flags: needinfo?(drno)
Comment hidden (Intermittent Failures Robot) |
Comment 43•6 years ago
|
||
Well... I wish I had more of the stacks from that analysis in comment 37, or a link to the log. However... It appears that we're trying to Stop the GMP at the same time we're trying to get a reference to it; the Encode() and Decode() sync calls appear to be red herrings - those are just blocked because the gmp thread has blocked. Though it's odd we're calling Stop() on the GMP thread (why I want the full stacks...). In any case though, since sSingleton is a) threadsafe, and b) never nulled or changed until ClearOnShutdown runs, we can safely do (in GetOrCreate()): if (sSingleton) { // sSingleton is guaranteed to persist until ClearOnShutdown runs RefPtr<GeckoMediaPluginService> service(sSingleton.get()); return service.forget(); } sync dispatches are always dangerous, and one should (almost) never, ever sync dispatch to mainthread. That sync dispatch to main appears to go back to the original code by josh, before we got involved for OpenH264. Probably a fair bit of re-architecting would be needed to get rid of the dispatch entirely, however, if this is created once, it's hard for the mainthread to be trying to do a synchronous operation like Stop against it before it's created. The problem now is that every call to get a handle on it dispatches sync to main. My snippet above would remove that
Flags: needinfo?(rjesup)
Comment hidden (Intermittent Failures Robot) |
Updated•6 years ago
|
Summary: Intermittent TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 330 seconds with no output → Intermittent TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 330/370 seconds with no output
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 46•6 years ago
|
||
This test also uses the fake h264 encoder/decoder, which is broken. Might be the cause here too. See bug 1407653.
See Also: → 1407653
Comment hidden (Intermittent Failures Robot) |
Comment 48•6 years ago
|
||
Update: There have been 40 failures in the last week. Summary: Intermittent TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 330 seconds with no output → Intermittent TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 330/370 seconds with no output The failures occur mostly on Linux /debug. See also: bug 1407653 , bug 1379378 Here is a relevant log file and a snippet with the failure: https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=166813809&lineNumber=25225 [task 2018-03-08T19:42:35.960Z] 19:42:35 INFO - PeerConnectionWrapper (pcRemote): oniceconnectionstatechange fired, new state is: closed [task 2018-03-08T19:42:35.962Z] 19:42:35 INFO - TEST-PASS | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | PeerConnectionWrapper (pcRemote): legal ICE state transition from connected to closed [task 2018-03-08T19:42:35.962Z] 19:42:35 INFO - Buffered messages finished [task 2018-03-08T19:42:35.968Z] 19:42:35 ERROR - TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 370 seconds with no output [task 2018-03-08T19:42:35.969Z] 19:42:35 ERROR - Force-terminating active process(es). [task 2018-03-08T19:42:35.969Z] 19:42:35 INFO - Determining child pids from psutil... [task 2018-03-08T19:42:35.969Z] 19:42:35 INFO - [3697, 2779, 2807] [task 2018-03-08T19:42:35.969Z] 19:42:35 INFO - ==> process 2730 launched child process 2750 [task 2018-03-08T19:42:35.972Z] 19:42:35 INFO - ==> process 2730 launched child process 2779 [task 2018-03-08T19:42:35.972Z] 19:42:35 INFO - ==> process 2730 launched child process 2807 [task 2018-03-08T19:42:35.974Z] 19:42:35 INFO - ==> process 2730 launched child process 3697 [task 2018-03-08T19:42:35.976Z] 19:42:35 INFO - Found child pids: set([3697, 2779, 2750, 2807]) [task 2018-03-08T19:42:35.976Z] 19:42:35 INFO - Failed to get child procs [task 2018-03-08T19:42:35.976Z] 19:42:35 INFO - Killing process: 3697
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 51•6 years ago
|
||
(In reply to Randell Jesup [:jesup] from comment #43) > Well... I wish I had more of the stacks from that analysis in comment 37, or > a link to the log. However... > Here's a recent example of this happening. https://taskcluster-artifacts.net/KvTDnRFPTAGkoxMAF9JzwA/0/public/logs/live_backing.log > It appears that we're trying to Stop the GMP at the same time we're trying > to get a reference to it; the Encode() and Decode() sync calls appear to be > red herrings - those are just blocked because the gmp thread has blocked. > Though it's odd we're calling Stop() on the GMP thread (why I want the full > stacks...). I do not see any sign that we're trying to stop GMP. We are trying to stop the VideoReceiveStream webrtc.org thread though, synchronously on main (ugh). Which sucks when this thread is deadlocked due to sync dispatches... > In any case though, since sSingleton is a) threadsafe, and b) never nulled > or changed until ClearOnShutdown runs, we can safely do (in GetOrCreate()): > if (sSingleton) { > // sSingleton is guaranteed to persist until ClearOnShutdown runs > RefPtr<GeckoMediaPluginService> service(sSingleton.get()); > return service.forget(); > } StaticRefPtr::mRawPtr is just a bare pointer, so simply doing if (sSingleton) will be racy. We need to do at least three things here. 1. rtc::PlatformThread::Stop either needs to stop blocking, or we need to ensure that we only stop webrtc threads from threads other than main. We are blocking main every time we stop a receive stream, at the very least. This is not ok. 2. We need to stop Sync dispatching in mozilla::WebrtcGmpVideoDecoder::Decode. 3. We need to stop sync dispatching when creating the GMP service.
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 55•6 years ago
|
||
Looks like bug 1448863 fixed this.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Updated•6 years ago
|
Assignee: nobody → docfaraday
status-firefox59:
--- → wontfix
status-firefox60:
--- → affected
status-firefox61:
--- → fixed
Flags: needinfo?(drno)
Target Milestone: --- → mozilla61
Updated•6 years ago
|
Whiteboard: [stockwell needswork] → [stockwell fixed:product]
Comment hidden (Intermittent Failures Robot) |
Updated•6 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•