Closed Bug 1375540 Opened 7 years ago Closed 6 years ago

Intermittent TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 330/370 seconds with no output

Categories

(Core :: WebRTC, defect, P2)

defect

Tracking

()

RESOLVED FIXED
mozilla61
Tracking Status
firefox59 --- wontfix
firefox60 --- fixed
firefox61 --- fixed

People

(Reporter: intermittent-bug-filer, Assigned: bwc)

References

Details

(Keywords: intermittent-failure, Whiteboard: [stockwell fixed:product])

Perhaps fallout from bug 1341285?
Rank: 16
Priority: -- → P1
Mass change P1->P2 to align with new Mozilla triage process
Priority: P1 → P2
There have been 32 failures in the past week, according to orange factor:

Ocurrences per plaform:

8 on windows10-64
5 on linux32-stylo-disabled
5 on linux64-qr
4 on Linux
4 on Linux x64
3 on Windows 7
2 on windows10-64-ccov
1 on linux64-stylo-disabled

Occurrences per build type:

24 on debug
5 on opt
2 on pgo
1 on asan

Here a relevant log file and a snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=155142552&lineNumber=54732

22:15:18     INFO -  2102 INFO TEST-PASS | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | PeerConnectionWrapper (pcRemote): legal ICE state transition from connected to closed
54731
22:15:18     INFO -  Buffered messages finished
54732
22:15:18  WARNING -  TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 370 seconds with no output
54733
22:15:18     INFO -  2103 ERROR Force-terminating active process(es).
54734
22:15:18     INFO -  Determining child pids from psutil...

:drno, could you please take a look? Thank you!
Flags: needinfo?(drno)
Whiteboard: [stockwell needswork]
Over the last 7 days there are 30 failures present on this bug. These happen  on Linux, Linux x64, linux32-stylo-disabled, linux64-qr, linux64-stylo-disabled, windows10-64 and windows10-64-ccov

Here is the most recent log: https://treeherder.mozilla.org/logviewer.html#?repo=autoland&job_id=157380001&lineNumber=24705

Here is a relevant snippet from aforementioned log:

[task 2018-01-19T08:21:58.457Z] 08:21:58     INFO - TEST-FAIL | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | The author of the test has indicated that flaky timeouts are expected.  Reason: WebRTC inherently depends on timeouts 
[task 2018-01-19T08:21:58.458Z] 08:21:58     INFO - Buffered messages finished
[task 2018-01-19T08:21:58.459Z] 08:21:58    ERROR - TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 370 seconds with no output
[task 2018-01-19T08:21:58.460Z] 08:21:58    ERROR - Force-terminating active process(es).
[task 2018-01-19T08:21:58.461Z] 08:21:58     INFO - Determining child pids from psutil...
[task 2018-01-19T08:21:58.462Z] 08:21:58     INFO - [3072, 3107, 4121]
Looks like some problem around ending tracks. When looking at the Orange factor for the last three months it looks to me that landing transceivers in bug 1290948 made this problem a lot worse.

Byron, any idea why transceivers could result in track ending events going missing?
Could you have look why the mochitest appears to fail to tear down the test properly here?
Depends on: 1290948
Flags: needinfo?(drno) → needinfo?(docfaraday)
(In reply to Nils Ohlmeier [:drno] from comment #34)
> Looks like some problem around ending tracks. When looking at the Orange
> factor for the last three months it looks to me that landing transceivers in
> bug 1290948 made this problem a lot worse.
> 
> Byron, any idea why transceivers could result in track ending events going
> missing?
> Could you have look why the mochitest appears to fail to tear down the test
> properly here?

The transceivers landing changed the timing of a lot of stuff in subtle ways, because our timing was not spec compliant before. It may be that the event listeners aren't guaranteed to be installed soon enough? Let me give it a look.
Ok, so this is worse than track ended events not firing. We are deadlocked due to the threading rat's nest of webrtc.org and GMP. Here's what I can tell:

webrtc.org is spinning the main thread
GMP is sync dispatching from two webrtc.org threads to the GMP thread (and both of those stacks have a mutex locked in webrtc.org, one of which is being waited on by yet another webrtc.org thread)
GMP is sync dispatching to main



We're deadlocking/spinning main, here: https://searchfox.org/mozilla-central/source/media/webrtc/trunk/webrtc/base/platform_thread.cc#261


GMP is sync dispatching to main here:

11:10:43     INFO -   7  xul.dll!mozilla::SyncRunnable::DispatchToThread(nsIEventTarget *,nsIRunnable *,bool) [SyncRunnable.h:996706d59171 : 98 + 0x15]
11:10:43     INFO -      eip = 0x5c09f714   esp = 0x109bf354   ebp = 0x109bf364
11:10:43     INFO -      Found by: call frame info
11:10:43     INFO -   8  xul.dll!mozilla::gmp::GMPServiceCreateHelper::GetOrCreate() [GMPService.cpp:996706d59171 : 85 + 0x11]
11:10:43     INFO -      eip = 0x5d8484b2   esp = 0x109bf36c   ebp = 0x109bf384
11:10:43     INFO -      Found by: call frame info

https://searchfox.org/mozilla-central/source/dom/media/gmp/GMPService.cpp#85


The MSG thread also appears to be stuck, possibly waiting on main to do something:

11:10:43     INFO -   6  xul.dll!mozilla::SystemClockDriver::WaitForNextIteration() [GraphDriver.cpp:996706d59171 : 398 + 0xf]
11:10:43     INFO -      eip = 0x5d7614ba   esp = 0x112ff48c   ebp = 0x112ff51c
11:10:43     INFO -      Found by: call frame info
11:10:43     INFO -   7  xul.dll!mozilla::MediaStreamGraphImpl::UpdateMainThreadState() [MediaStreamGraph.cpp:996706d59171 : 1305 + 0xd]
11:10:43     INFO -      eip = 0x5d7eb356   esp = 0x112ff524   ebp = 0x112ff530
11:10:43     INFO -      Found by: call frame info

https://searchfox.org/mozilla-central/source/dom/media/GraphDriver.cpp#398


A sync dispatch caused by GMP, from one of webrtc.org's threads to the GMP thread, while holding VideoReceiver::receive_crit_:

11:10:43     INFO -   7  xul.dll!mozilla::SyncRunnable::DispatchToThread(nsIEventTarget *,nsIRunnable *,bool) [SyncRunnable.h:996706d59171 : 98 + 0x15]
11:10:43     INFO -      eip = 0x5c09f714   esp = 0x13edf5d0   ebp = 0x13edf5e0
11:10:43     INFO -      Found by: call frame info
11:10:43     INFO -   8  xul.dll!mozilla::WebrtcGmpVideoDecoder::Decode(webrtc::EncodedImage const &,bool,webrtc::RTPFragmentationHeader const *,webrtc::CodecSpecificInfo const *,__int64) [WebrtcGmpVideoCodec.cpp:996706d59171 : 812 + 0x33]
11:10:43     INFO -      eip = 0x5c8befc0   esp = 0x13edf5e8   ebp = 0x13edf600
11:10:43     INFO -      Found by: call frame info

https://searchfox.org/mozilla-central/source/media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp#815


webrtc::VideoStreamDecoder is stuck waiting on VideoReceiver::receive_crit_ (see above):

11:10:43     INFO -   4  xul.dll!rtc::CritScope::CritScope(rtc::CriticalSection const *) [criticalsection.cc:996706d59171 : 183 + 0xf]
11:10:43     INFO -      eip = 0x5eba22bb   esp = 0x1242fd68   ebp = 0x1242fd70   ebx = 0x00000001
11:10:43     INFO -      Found by: call frame info
11:10:43     INFO -   5  xul.dll!webrtc::vcm::VideoReceiver::SetReceiveChannelParameters(__int64) [video_receiver.cc:996706d59171 : 165 + 0xc]
11:10:43     INFO -      eip = 0x5eb82f70   esp = 0x1242fd78   ebp = 0x1242fd84
11:10:43     INFO -      Found by: call frame info
11:10:43     INFO -   6  xul.dll!webrtc::VideoStreamDecoder::OnRttUpdate(__int64,__int64) [video_stream_decoder.cc:996706d59171 : 139 + 0x10]
11:10:43     INFO -      eip = 0x5eafe7f2   esp = 0x1242fd8c   ebp = 0x1242fd98
11:10:43     INFO -      Found by: call frame info

https://searchfox.org/mozilla-central/source/media/webrtc/trunk/webrtc/modules/video_coding/video_receiver.cc#165


Yet another sync dispatch caused by GMP, from one of webrtc.org's threads to the GMP thread (similar to above):

11:10:43     INFO -   7  xul.dll!mozilla::SyncRunnable::DispatchToThread(nsIEventTarget *,nsIRunnable *,bool) [SyncRunnable.h:996706d59171 : 98 + 0x15]
11:10:43     INFO -      eip = 0x5c09f714   esp = 0x1416eb88   ebp = 0x1416eb98
11:10:43     INFO -      Found by: call frame info
11:10:43     INFO -   8  xul.dll!mozilla::WebrtcGmpVideoEncoder::Encode(webrtc::VideoFrame const &,webrtc::CodecSpecificInfo const *,std::vector<webrtc::FrameType,std::allocator<webrtc::FrameType> > const *) [WebrtcGmpVideoCodec.cpp:996706d59171 : 322 + 0x29]
11:10:43     INFO -      eip = 0x5c8c0c64   esp = 0x1416eba0   ebp = 0x1416ebb4
11:10:43     INFO -      Found by: call frame info

https://searchfox.org/mozilla-central/source/media/webrtc/signaling/src/media-conduit/WebrtcGmpVideoCodec.cpp#325


IncomingVideoStream is also waiting forever it seems:

11:10:43     INFO -   5  xul.dll!webrtc::EventTimerWin::Wait(unsigned long) [event_timer_win.cc:996706d59171 : 41 + 0xc]
11:10:43     INFO -      eip = 0x5eb948fe   esp = 0x13b9f800   ebp = 0x13b9f808
11:10:43     INFO -      Found by: call frame info
11:10:43     INFO -   6  xul.dll!webrtc::IncomingVideoStream::IncomingVideoStreamProcess() [incoming_video_stream.cc:996706d59171 : 67 + 0xf]
11:10:43     INFO -      eip = 0x5eb14aaa   esp = 0x13b9f810   ebp = 0x13b9f910
11:10:43     INFO -      Found by: call frame info

https://searchfox.org/mozilla-central/source/media/webrtc/trunk/webrtc/common_video/incoming_video_stream.cc#67
Flags: needinfo?(docfaraday) → needinfo?(rjesup)
Dan recently landed in bug 1429390 a change to make some pieces async around H264 GMP. Is this possibly related?
See Also: → 1429390
From looking at OrangeFactor I don't see the number of timeouts actually going down since bug 1429390 landed on 1/25. So this looks like another problem. Maybe this is the decoding side, which is still sync?
There have been 31 failures in the last 7 days. 
For the failure pattern see Comment 31
Do you have any updates here?
Flags: needinfo?(drno)
Well... I wish I had more of the stacks from that analysis in comment 37, or a link to the log.  However...

It appears that we're trying to Stop the GMP at the same time we're trying to get a reference to it; the Encode() and Decode() sync calls appear to be red herrings - those are just blocked because the gmp thread has blocked.  Though it's odd we're calling Stop() on the GMP thread (why I want the full stacks...).  

In any case though, since sSingleton is a) threadsafe, and b) never nulled or changed until ClearOnShutdown runs, we can safely do (in GetOrCreate()):
   if (sSingleton) {
     // sSingleton is guaranteed to persist until ClearOnShutdown runs
     RefPtr<GeckoMediaPluginService> service(sSingleton.get());
     return service.forget();
   }

sync dispatches are always dangerous, and one should (almost) never, ever sync dispatch to mainthread.  That sync dispatch to main appears to go back to the original code by josh, before we got involved for OpenH264.

Probably a fair bit of re-architecting would be needed to get rid of the dispatch entirely, however, if this is created once, it's hard for the mainthread to be trying to do a synchronous operation like Stop against it before it's created.  The problem now is that every call to get a handle on it dispatches sync to main.  My snippet above would remove that
Flags: needinfo?(rjesup)
See Also: → 1379378
Summary: Intermittent TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 330 seconds with no output → Intermittent TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 330/370 seconds with no output
This test also uses the fake h264 encoder/decoder, which is broken. Might be the cause here too. See bug 1407653.
See Also: → 1407653
Update:
There have been 40 failures in the last week.

Summary: Intermittent TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 330 seconds with no output → Intermittent TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 330/370 seconds with no output

The failures occur mostly on Linux /debug.

See also: bug 1407653 , bug 1379378

Here is a relevant log file and a snippet with the failure:
https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=166813809&lineNumber=25225
[task 2018-03-08T19:42:35.960Z] 19:42:35     INFO - PeerConnectionWrapper (pcRemote): oniceconnectionstatechange fired, new state is: closed
[task 2018-03-08T19:42:35.962Z] 19:42:35     INFO - TEST-PASS | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | PeerConnectionWrapper (pcRemote): legal ICE state transition from connected to closed 
[task 2018-03-08T19:42:35.962Z] 19:42:35     INFO - Buffered messages finished
[task 2018-03-08T19:42:35.968Z] 19:42:35    ERROR - TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_basicH264Video.html | application timed out after 370 seconds with no output
[task 2018-03-08T19:42:35.969Z] 19:42:35    ERROR - Force-terminating active process(es).
[task 2018-03-08T19:42:35.969Z] 19:42:35     INFO - Determining child pids from psutil...
[task 2018-03-08T19:42:35.969Z] 19:42:35     INFO - [3697, 2779, 2807]
[task 2018-03-08T19:42:35.969Z] 19:42:35     INFO - ==> process 2730 launched child process 2750
[task 2018-03-08T19:42:35.972Z] 19:42:35     INFO - ==> process 2730 launched child process 2779
[task 2018-03-08T19:42:35.972Z] 19:42:35     INFO - ==> process 2730 launched child process 2807
[task 2018-03-08T19:42:35.974Z] 19:42:35     INFO - ==> process 2730 launched child process 3697
[task 2018-03-08T19:42:35.976Z] 19:42:35     INFO - Found child pids: set([3697, 2779, 2750, 2807])
[task 2018-03-08T19:42:35.976Z] 19:42:35     INFO - Failed to get child procs
[task 2018-03-08T19:42:35.976Z] 19:42:35     INFO - Killing process: 3697
(In reply to Randell Jesup [:jesup] from comment #43)
> Well... I wish I had more of the stacks from that analysis in comment 37, or
> a link to the log.  However...
> 

Here's a recent example of this happening.

https://taskcluster-artifacts.net/KvTDnRFPTAGkoxMAF9JzwA/0/public/logs/live_backing.log

> It appears that we're trying to Stop the GMP at the same time we're trying
> to get a reference to it; the Encode() and Decode() sync calls appear to be
> red herrings - those are just blocked because the gmp thread has blocked. 
> Though it's odd we're calling Stop() on the GMP thread (why I want the full
> stacks...).  

   I do not see any sign that we're trying to stop GMP. We are trying to stop the VideoReceiveStream webrtc.org thread though, synchronously on main (ugh). Which sucks when this thread is deadlocked due to sync dispatches...

> In any case though, since sSingleton is a) threadsafe, and b) never nulled
> or changed until ClearOnShutdown runs, we can safely do (in GetOrCreate()):
>    if (sSingleton) {
>      // sSingleton is guaranteed to persist until ClearOnShutdown runs
>      RefPtr<GeckoMediaPluginService> service(sSingleton.get());
>      return service.forget();
>    }

   StaticRefPtr::mRawPtr is just a bare pointer, so simply doing if (sSingleton) will be racy.


   We need to do at least three things here.

1. rtc::PlatformThread::Stop either needs to stop blocking, or we need to ensure that we only stop webrtc threads from threads other than main. We are blocking main every time we stop a receive stream, at the very least. This is not ok.

2. We need to stop Sync dispatching in mozilla::WebrtcGmpVideoDecoder::Decode.

3. We need to stop sync dispatching when creating the GMP service.
Depends on: 1448863
Looks like bug 1448863 fixed this.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Assignee: nobody → docfaraday
Flags: needinfo?(drno)
Target Milestone: --- → mozilla61
Whiteboard: [stockwell needswork] → [stockwell fixed:product]
You need to log in before you can comment on or make changes to this bug.