Closed Bug 1339568 Opened 3 years ago Closed 2 years ago

Intermittent shutdown hang in linux32/64 mochitest-media-e10s jobs

Categories

(Core :: WebRTC, defect, P3)

defect

Tracking

()

RESOLVED FIXED
mozilla60
Tracking Status
firefox58 --- disabled
firefox59 --- disabled
firefox60 --- fixed

People

(Reporter: gbrown, Assigned: gbrown)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(3 files)

All end in many instances of:

[task 2017-02-11T00:45:42.282168Z] 00:45:42     INFO - ###!!! [Child][MessageChannel] Error: (msgtype=0x4400FD,name=PContent::Msg_AccumulateChildKeyedHistogram) Closed channel: cannot send/recv
[task 2017-02-11T00:45:42.283184Z] 00:45:42     INFO - [Child 5240] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2049
[task 2017-02-11T00:45:42.283380Z] 00:45:42     INFO - [Child 5240] WARNING: '!contentChild->SendAccumulateChildKeyedHistogram(keyedAccumulationsToSend)', file /home/worker/workspace/build/src/toolkit/components/telemetry/TelemetryIPCAccumulator.cpp, line 215
[task 2017-02-11T00:45:44.284873Z] 00:45:44     INFO - ###!!! [Child][MessageChannel] Error: (msgtype=0x4400FD,name=PContent::Msg_AccumulateChildKeyedHistogram) Closed channel: cannot send/recv
[task 2017-02-11T00:45:44.286157Z] 00:45:44     INFO - [Child 5240] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2049
[task 2017-02-11T00:45:44.286317Z] 00:45:44     INFO - [Child 5240] WARNING: '!contentChild->SendAccumulateChildKeyedHistogram(keyedAccumulationsToSend)', file /home/worker/workspace/build/src/toolkit/components/telemetry/TelemetryIPCAccumulator.cpp, line 215


Alessio, I see you know TelemetryIPCAccumulator; do you know what's going wrong here?
Flags: needinfo?(alessio.placitelli)
Flags: needinfo?(alessio.placitelli)
(In reply to Geoff Brown [:gbrown] from comment #1)
> All end in many instances of:
> 
> [task 2017-02-11T00:45:42.282168Z] 00:45:42     INFO - ###!!!
> [Child][MessageChannel] Error:
> (msgtype=0x4400FD,name=PContent::Msg_AccumulateChildKeyedHistogram) Closed
> channel: cannot send/recv
> [task 2017-02-11T00:45:42.283184Z] 00:45:42     INFO - [Child 5240] WARNING:
> MsgDropped in ContentChild: file
> /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2049
> [task 2017-02-11T00:45:42.283380Z] 00:45:42     INFO - [Child 5240] WARNING:
> '!contentChild->SendAccumulateChildKeyedHistogram(keyedAccumulationsToSend)',
> file
> /home/worker/workspace/build/src/toolkit/components/telemetry/
> TelemetryIPCAccumulator.cpp, line 215
> [task 2017-02-11T00:45:44.284873Z] 00:45:44     INFO - ###!!!
> [Child][MessageChannel] Error:
> (msgtype=0x4400FD,name=PContent::Msg_AccumulateChildKeyedHistogram) Closed
> channel: cannot send/recv
> [task 2017-02-11T00:45:44.286157Z] 00:45:44     INFO - [Child 5240] WARNING:
> MsgDropped in ContentChild: file
> /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2049
> [task 2017-02-11T00:45:44.286317Z] 00:45:44     INFO - [Child 5240] WARNING:
> '!contentChild->SendAccumulateChildKeyedHistogram(keyedAccumulationsToSend)',
> file
> /home/worker/workspace/build/src/toolkit/components/telemetry/
> TelemetryIPCAccumulator.cpp, line 215
> 
> 
> Alessio, I see you know TelemetryIPCAccumulator; do you know what's going
> wrong here?

This seems to say that a child process (content process in this case) is still up and running, and it's trying to accumulate some scalar measurements. Telemetry sends the gathered measurements to the parent process every 2 seconds in [1] (exactly where it is displaying the scary WARNING).

It looks like the IPC subsystem can't deliver the messages to the parent process.

I'm not an IPC expert, but a few lines above where this mess begins [2] it says that the IPC channel is being closed. Could this be happening because the tests are finished but, somehow, the content process isn't shutting down?

[1] - http://searchfox.org/mozilla-central/rev/ac8a72f2d5204ced2004c93a9c193430c63a6a71/toolkit/components/telemetry/TelemetryIPCAccumulator.cpp#215
[2] - https://treeherder.mozilla.org/logviewer.html#?repo=mozilla-inbound&job_id=76487177&lineNumber=10724
I don't think whats shown in comment 1 is the reason for the hang.
This is just Telemetry trying to send messages to the parent when the connection is already closed (which i'll file a separate bug on).
No new failures of this type seen in bug 1204281 since Feb 22.
Assignee: nobody → gbrown
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
Whiteboard: [stockwell unknown]
Summary: Intermittent hang in test-linux64/debug-mochitest-media-e10s jobs → Intermittent hang in linux64 mochitest-media-e10s jobs
Summary: Intermittent hang in linux64 mochitest-media-e10s jobs → Intermittent hang in linux32/64 mochitest-media-e10s jobs
Whiteboard: [stockwell unknown] → [stockwell needswork]
In https://public-artifacts.taskcluster.net/Li0bAQ7HQa-G6UTHPEhzoA/0/public/logs/live_backing.log

As before, we see a long line of:

task 2017-03-15T22:19:39.931491Z] 22:19:39     INFO - GECKO(3007) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4600F8,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv
...
[task 2017-03-15T23:26:00.679323Z] 23:26:00     INFO - GECKO(3007) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4600F8,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv

[taskcluster:error] Task timeout after 5400 seconds. Force killing container.


In this case there is also a SEGV with a stack:

[task 2017-03-15T22:19:37.717577Z] 22:19:37     INFO - GECKO(3007) | ASAN:DEADLYSIGNAL
[task 2017-03-15T22:19:37.718666Z] 22:19:37     INFO - GECKO(3007) | =================================================================
[task 2017-03-15T22:19:37.719447Z] 22:19:37     INFO - GECKO(3007) | ==3007==ERROR: AddressSanitizer: SEGV on unknown address 0x000000000000 (pc 0x7f20732ef1de bp 0x7f204893bd70 sp 0x7f204893bd60 T347)
[task 2017-03-15T22:19:37.720655Z] 22:19:37     INFO - GECKO(3007) | ==3007==The signal is caused by a WRITE memory access.
[task 2017-03-15T22:19:37.721366Z] 22:19:37     INFO - GECKO(3007) | ==3007==Hint: address points to the zero page.
[task 2017-03-15T22:19:37.931862Z] 22:19:37     INFO - GECKO(3007) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4600F8,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv
[task 2017-03-15T22:19:38.149602Z] 22:19:38     INFO - GECKO(3007) |     #0 0x7f20732ef1dd in mozilla::(anonymous namespace)::RunWatchdog(void*) /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:159:5
[task 2017-03-15T22:19:38.151250Z] 22:19:38     INFO - GECKO(3007) |     #1 0x7f207f5d8d23 in _pt_root /home/worker/workspace/build/src/nsprpub/pr/src/pthreads/ptthread.c:216:5
[task 2017-03-15T22:19:38.152112Z] 22:19:38     INFO - GECKO(3007) |     #2 0x7f20838cbe99 in start_thread /build/eglibc-oqps9y/eglibc-2.15/nptl/pthread_create.c:308
[task 2017-03-15T22:19:38.194877Z] 22:19:38     INFO - GECKO(3007) |     #3 0x7f20829c736c in clone /build/eglibc-oqps9y/eglibc-2.15/misc/../sysdeps/unix/sysv/linux/x86_64/clone.S:112
[task 2017-03-15T22:19:38.195187Z] 22:19:38     INFO - GECKO(3007) | AddressSanitizer can not provide additional info.
[task 2017-03-15T22:19:38.195707Z] 22:19:38     INFO - GECKO(3007) | SUMMARY: AddressSanitizer: SEGV /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:159:5 in mozilla::(anonymous namespace)::RunWatchdog(void*)
[task 2017-03-15T22:19:38.196058Z] 22:19:38     INFO - GECKO(3007) | Thread T347 (Shutdow~minator) created by T0 here:
[task 2017-03-15T22:19:38.199132Z] 22:19:38     INFO - GECKO(3007) |     #0 0x4a3a66 in __interceptor_pthread_create /builds/slave/moz-toolchain/src/llvm/projects/compiler-rt/lib/asan/asan_interceptors.cc:245:3
[task 2017-03-15T22:19:38.200199Z] 22:19:38     INFO - GECKO(3007) |     #1 0x7f207f5d5ac9 in _PR_CreateThread /home/worker/workspace/build/src/nsprpub/pr/src/pthreads/ptthread.c:457:14
[task 2017-03-15T22:19:38.200781Z] 22:19:38     INFO - GECKO(3007) |     #2 0x7f207f5d56de in PR_CreateThread /home/worker/workspace/build/src/nsprpub/pr/src/pthreads/ptthread.c:548:12
[task 2017-03-15T22:19:38.204357Z] 22:19:38     INFO - GECKO(3007) |     #3 0x7f20732ef9a7 in CreateSystemThread /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:73:22
[task 2017-03-15T22:19:38.205230Z] 22:19:38     INFO - GECKO(3007) |     #4 0x7f20732ef9a7 in StartWatchdog /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:395
[task 2017-03-15T22:19:38.205874Z] 22:19:38     INFO - GECKO(3007) |     #5 0x7f20732ef9a7 in Start /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:359
[task 2017-03-15T22:19:38.206630Z] 22:19:38     INFO - GECKO(3007) |     #6 0x7f20732ef9a7 in mozilla::nsTerminator::Observe(nsISupports*, char const*, char16_t const*) /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:450
[task 2017-03-15T22:19:38.212966Z] 22:19:38     INFO - GECKO(3007) |     #7 0x7f2069cae7cc in nsObserverList::NotifyObservers(nsISupports*, char const*, char16_t const*) /home/worker/workspace/build/src/xpcom/ds/nsObserverList.cpp:112:19
[task 2017-03-15T22:19:38.213844Z] 22:19:38     INFO - GECKO(3007) |     #8 0x7f2069cb21c4 in nsObserverService::NotifyObservers(nsISupports*, char const*, char16_t const*) /home/worker/workspace/build/src/xpcom/ds/nsObserverService.cpp:281:19
[task 2017-03-15T22:19:38.214580Z] 22:19:38     INFO - GECKO(3007) |     #9 0x7f2073160bb5 in nsAppStartup::Quit(unsigned int) /home/worker/workspace/build/src/toolkit/components/startup/nsAppStartup.cpp:461:19
[task 2017-03-15T22:19:38.215357Z] 22:19:38     INFO - GECKO(3007) |     #10 0x7f2069db0101 in NS_InvokeByIndex /home/worker/workspace/build/src/xpcom/reflect/xptcall/md/unix/xptcinvoke_asm_x86_64_unix.S:115
[task 2017-03-15T22:19:38.222120Z] 22:19:38     INFO - GECKO(3007) |     #11 0x7f206b59e4e4 in Invoke /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedNative.cpp:2010:12
[task 2017-03-15T22:19:38.222939Z] 22:19:38     INFO - GECKO(3007) |     #12 0x7f206b59e4e4 in Call /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedNative.cpp:1329
[task 2017-03-15T22:19:38.223706Z] 22:19:38     INFO - GECKO(3007) |     #13 0x7f206b59e4e4 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedNative.cpp:1296
[task 2017-03-15T22:19:38.224473Z] 22:19:38     INFO - GECKO(3007) |     #14 0x7f206b5a568c in XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedNativeJSOps.cpp:983:12
[task 2017-03-15T22:19:38.225030Z] 22:19:38     INFO - GECKO(3007) |     #15 0x28255d9dc623  (<unknown module>)
[task 2017-03-15T22:19:38.225677Z] 22:19:38     INFO - GECKO(3007) |     #16 0x621002af184f  (<unknown module>)
[task 2017-03-15T22:19:38.226199Z] 22:19:38     INFO - GECKO(3007) |     #17 0x28255d7968a5  (<unknown module>)
[task 2017-03-15T22:19:38.238518Z] 22:19:38     INFO - GECKO(3007) |     #18 0x7f20739ad852 in EnterBaseline(JSContext*, js::jit::EnterJitData&) /home/worker/workspace/build/src/js/src/jit/BaselineJIT.cpp:160:9
[task 2017-03-15T22:19:38.239525Z] 22:19:38     INFO - GECKO(3007) |     #19 0x7f20739ad0b7 in js::jit::EnterBaselineMethod(JSContext*, js::RunState&) /home/worker/workspace/build/src/js/src/jit/BaselineJIT.cpp:200:28
[task 2017-03-15T22:19:38.240191Z] 22:19:38     INFO - GECKO(3007) |     #20 0x7f2073741419 in js::RunScript(JSContext*, js::RunState&) /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:384:41
[task 2017-03-15T22:19:38.240494Z] 22:19:38     INFO - GECKO(3007) |     #21 0x7f20737724e6 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:466:15
[task 2017-03-15T22:19:38.241120Z] 22:19:38     INFO - GECKO(3007) |     #22 0x7f2073772cf2 in js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>) /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:512:10
[task 2017-03-15T22:19:38.256168Z] 22:19:38     INFO - GECKO(3007) |     #23 0x7f20740e41c3 in JS_CallFunctionValue(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) /home/worker/workspace/build/src/js/src/jsapi.cpp:2828:12
[task 2017-03-15T22:19:38.256923Z] 22:19:38     INFO - GECKO(3007) |     #24 0x7f206c3d38c5 in nsFrameMessageManager::ReceiveMessage(nsISupports*, nsIFrameLoader*, bool, nsAString_internal const&, bool, mozilla::dom::ipc::StructuredCloneData*, mozilla::jsipc::CpowHolder*, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) /home/worker/workspace/build/src/dom/base/nsFrameMessageManager.cpp:1091:14
[task 2017-03-15T22:19:38.257121Z] 22:19:38     INFO - GECKO(3007) |     #25 0x7f206c3d450a in nsFrameMessageManager::ReceiveMessage(nsISupports*, nsIFrameLoader*, bool, nsAString_internal const&, bool, mozilla::dom::ipc::StructuredCloneData*, mozilla::jsipc::CpowHolder*, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) /home/worker/workspace/build/src/dom/base/nsFrameMessageManager.cpp:1121:29
[task 2017-03-15T22:19:38.257777Z] 22:19:38     INFO - GECKO(3007) |     #26 0x7f206c3d450a in nsFrameMessageManager::ReceiveMessage(nsISupports*, nsIFrameLoader*, bool, nsAString_internal const&, bool, mozilla::dom::ipc::StructuredCloneData*, mozilla::jsipc::CpowHolder*, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) /home/worker/workspace/build/src/dom/base/nsFrameMessageManager.cpp:1121:29
[task 2017-03-15T22:19:38.258359Z] 22:19:38     INFO - GECKO(3007) |     #27 0x7f206c3d450a in nsFrameMessageManager::ReceiveMessage(nsISupports*, nsIFrameLoader*, bool, nsAString_internal const&, bool, mozilla::dom::ipc::StructuredCloneData*, mozilla::jsipc::CpowHolder*, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) /home/worker/workspace/build/src/dom/base/nsFrameMessageManager.cpp:1121:29
[task 2017-03-15T22:19:38.258597Z] 22:19:38     INFO - GECKO(3007) |     #28 0x7f206c3d0fc9 in nsFrameMessageManager::ReceiveMessage(nsISupports*, nsIFrameLoader*, nsAString_internal const&, bool, mozilla::dom::ipc::StructuredCloneData*, mozilla::jsipc::CpowHolder*, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) /home/worker/workspace/build/src/dom/base/nsFrameMessageManager.cpp:900:10
[task 2017-03-15T22:19:38.259146Z] 22:19:38     INFO - GECKO(3007) |     #29 0x7f206f6571ad in mozilla::dom::TabParent::ReceiveMessage(nsString const&, bool, mozilla::dom::ipc::StructuredCloneData*, mozilla::jsipc::CpowHolder*, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) /home/worker/workspace/build/src/dom/ipc/TabParent.cpp:2411:14
[task 2017-03-15T22:19:38.259619Z] 22:19:38     INFO - GECKO(3007) |     #30 0x7f206f664d93 in mozilla::dom::TabParent::RecvAsyncMessage(nsString const&, nsTArray<mozilla::jsipc::CpowEntry>&&, IPC::Principal const&, mozilla::dom::ClonedMessageData const&) /home/worker/workspace/build/src/dom/ipc/TabParent.cpp:1604:8
[task 2017-03-15T22:19:38.274827Z] 22:19:38     INFO - GECKO(3007) |     #31 0x7f206b1478bf in mozilla::dom::PBrowserParent::OnMessageReceived(IPC::Message const&) /home/worker/workspace/build/src/obj-firefox/ipc/ipdl/PBrowserParent.cpp:1662:20
[task 2017-03-15T22:19:38.292249Z] 22:19:38     INFO - GECKO(3007) |     #32 0x7f206b2d5205 in mozilla::dom::PContentParent::OnMessageReceived(IPC::Message const&) /home/worker/workspace/build/src/obj-firefox/ipc/ipdl/PContentParent.cpp:2983:28
[task 2017-03-15T22:19:38.292427Z] 22:19:38     INFO - GECKO(3007) |     #33 0x7f206ab32320 in mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:1872:25
[task 2017-03-15T22:19:38.292521Z] 22:19:38     INFO - GECKO(3007) |     #34 0x7f206ab2ea72 in mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:1807:17
[task 2017-03-15T22:19:38.292678Z] 22:19:38     INFO - GECKO(3007) |     #35 0x7f206ab30f24 in mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask&) /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:1680:5
[task 2017-03-15T22:19:38.292765Z] 22:19:38     INFO - GECKO(3007) |     #36 0x7f206ab3153e in mozilla::ipc::MessageChannel::MessageTask::Run() /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:1713:15
[task 2017-03-15T22:19:38.297547Z] 22:19:38     INFO - GECKO(3007) |     #37 0x7f2069d9534c in nsThread::ProcessNextEvent(bool, bool*) /home/worker/workspace/build/src/xpcom/threads/nsThread.cpp:1269:14
[task 2017-03-15T22:19:38.298927Z] 22:19:38     INFO - GECKO(3007) |     #38 0x7f2069db0101 in NS_InvokeByIndex /home/worker/workspace/build/src/xpcom/reflect/xptcall/md/unix/xptcinvoke_asm_x86_64_unix.S:115
[task 2017-03-15T22:19:38.300389Z] 22:19:38     INFO - GECKO(3007) |     #39 0x7f206b59e4e4 in Invoke /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedNative.cpp:2010:12
[task 2017-03-15T22:19:38.301705Z] 22:19:38     INFO - GECKO(3007) |     #40 0x7f206b59e4e4 in Call /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedNative.cpp:1329
[task 2017-03-15T22:19:38.303080Z] 22:19:38     INFO - GECKO(3007) |     #41 0x7f206b59e4e4 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedNative.cpp:1296
[task 2017-03-15T22:19:38.304611Z] 22:19:38     INFO - GECKO(3007) |     #42 0x7f206b5a568c in XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedNativeJSOps.cpp:983:12
[task 2017-03-15T22:19:38.305332Z] 22:19:38     INFO - GECKO(3007) |     #43 0x7f2073772361 in CallJSNative /home/worker/workspace/build/src/js/src/jscntxtinlines.h:282:15
[task 2017-03-15T22:19:38.306698Z] 22:19:38     INFO - GECKO(3007) |     #44 0x7f2073772361 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:448
[task 2017-03-15T22:19:38.308027Z] 22:19:38     INFO - GECKO(3007) |     #45 0x7f207375add7 in CallFromStack /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:499:12
[task 2017-03-15T22:19:38.309654Z] 22:19:38     INFO - GECKO(3007) |     #46 0x7f207375add7 in Interpret(JSContext*, js::RunState&) /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:2954
[task 2017-03-15T22:19:38.313644Z] 22:19:38     INFO - GECKO(3007) |     #47 0x7f20737415ce in js::RunScript(JSContext*, js::RunState&) /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:394:12
[task 2017-03-15T22:19:38.314769Z] 22:19:38     INFO - GECKO(3007) |     #48 0x7f20737724e6 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:466:15
[task 2017-03-15T22:19:38.316170Z] 22:19:38     INFO - GECKO(3007) |     #49 0x7f2073772cf2 in js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>) /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:512:10
[task 2017-03-15T22:19:38.321305Z] 22:19:38     INFO - GECKO(3007) |     #50 0x7f20740e41c3 in JS_CallFunctionValue(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) /home/worker/workspace/build/src/js/src/jsapi.cpp:2828:12
[task 2017-03-15T22:19:38.322444Z] 22:19:38     INFO - GECKO(3007) |     #51 0x7f206b58551f in nsXPCWrappedJSClass::CallMethod(nsXPCWrappedJS*, unsigned short, XPTMethodDescriptor const*, nsXPTCMiniVariant*) /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedJSClass.cpp:1214:23
[task 2017-03-15T22:19:38.323641Z] 22:19:38     INFO - GECKO(3007) |     #52 0x7f2069db17ea in PrepareAndDispatch /home/worker/workspace/build/src/xpcom/reflect/xptcall/md/unix/xptcstubs_x86_64_linux.cpp:120:28
[task 2017-03-15T22:19:38.324392Z] 22:19:38     INFO - GECKO(3007) |     #53 0x7f2069db07c6 in SharedStub (/home/worker/workspace/build/application/firefox/libxul.so+0x20897c6)
[task 2017-03-15T22:19:38.325009Z] 22:19:38     INFO - GECKO(3007) |     #54 0x7f206c446686 in nsGlobalWindow::CanClose() /home/worker/workspace/build/src/dom/base/nsGlobalWindow.cpp:8998:17
[task 2017-03-15T22:19:38.325731Z] 22:19:38     INFO - GECKO(3007) |     #55 0x7f20731603d9 in nsAppStartup::Quit(unsigned int) /home/worker/workspace/build/src/toolkit/components/startup/nsAppStartup.cpp:377:29
[task 2017-03-15T22:19:38.326745Z] 22:19:38     INFO - GECKO(3007) |     #56 0x7f2069db0101 in NS_InvokeByIndex /home/worker/workspace/build/src/xpcom/reflect/xptcall/md/unix/xptcinvoke_asm_x86_64_unix.S:115
[task 2017-03-15T22:19:38.327726Z] 22:19:38     INFO - GECKO(3007) |     #57 0x7f206b59e4e4 in Invoke /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedNative.cpp:2010:12
[task 2017-03-15T22:19:38.329401Z] 22:19:38     INFO - GECKO(3007) |     #58 0x7f206b59e4e4 in Call /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedNative.cpp:1329
[task 2017-03-15T22:19:38.330530Z] 22:19:38     INFO - GECKO(3007) |     #59 0x7f206b59e4e4 in XPCWrappedNative::CallMethod(XPCCallContext&, XPCWrappedNative::CallMode) /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedNative.cpp:1296
[task 2017-03-15T22:19:38.332891Z] 22:19:38     INFO - GECKO(3007) |     #60 0x7f206b5a568c in XPC_WN_CallMethod(JSContext*, unsigned int, JS::Value*) /home/worker/workspace/build/src/js/xpconnect/src/XPCWrappedNativeJSOps.cpp:983:12
[task 2017-03-15T22:19:38.333619Z] 22:19:38     INFO - GECKO(3007) |     #61 0x7f2073772361 in CallJSNative /home/worker/workspace/build/src/js/src/jscntxtinlines.h:282:15
[task 2017-03-15T22:19:38.334348Z] 22:19:38     INFO - GECKO(3007) |     #62 0x7f2073772361 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:448
[task 2017-03-15T22:19:38.335514Z] 22:19:38     INFO - GECKO(3007) |     #63 0x7f20739851ca in js::jit::DoCallFallback(JSContext*, js::jit::BaselineFrame*, js::jit::ICCall_Fallback*, unsigned int, JS::Value*, JS::MutableHandle<JS::Value>) /home/worker/workspace/build/src/js/src/jit/BaselineIC.cpp:2347:14
[task 2017-03-15T22:19:38.336759Z] 22:19:38     INFO - GECKO(3007) |     #64 0x28255d7a199f  (<unknown module>)
[task 2017-03-15T22:19:38.339030Z] 22:19:38     INFO - GECKO(3007) |     #65 0x62100401c15f  (<unknown module>)
[task 2017-03-15T22:19:38.339675Z] 22:19:38     INFO - GECKO(3007) |     #66 0x28255d7968a5  (<unknown module>)
[task 2017-03-15T22:19:38.340311Z] 22:19:38     INFO - GECKO(3007) |     #67 0x7f20739ad852 in EnterBaseline(JSContext*, js::jit::EnterJitData&) /home/worker/workspace/build/src/js/src/jit/BaselineJIT.cpp:160:9
[task 2017-03-15T22:19:38.340955Z] 22:19:38     INFO - GECKO(3007) |     #68 0x7f20739ad0b7 in js::jit::EnterBaselineMethod(JSContext*, js::RunState&) /home/worker/workspace/build/src/js/src/jit/BaselineJIT.cpp:200:28
[task 2017-03-15T22:19:38.341642Z] 22:19:38     INFO - GECKO(3007) |     #69 0x7f2073741419 in js::RunScript(JSContext*, js::RunState&) /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:384:41
[task 2017-03-15T22:19:38.343428Z] 22:19:38     INFO - GECKO(3007) |     #70 0x7f20737724e6 in js::InternalCallOrConstruct(JSContext*, JS::CallArgs const&, js::MaybeConstruct) /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:466:15
[task 2017-03-15T22:19:38.343833Z] 22:19:38     INFO - GECKO(3007) |     #71 0x7f2073772cf2 in js::Call(JSContext*, JS::Handle<JS::Value>, JS::Handle<JS::Value>, js::AnyInvokeArgs const&, JS::MutableHandle<JS::Value>) /home/worker/workspace/build/src/js/src/vm/Interpreter.cpp:512:10
[task 2017-03-15T22:19:38.344309Z] 22:19:38     INFO - GECKO(3007) |     #72 0x7f20740e41c3 in JS_CallFunctionValue(JSContext*, JS::Handle<JSObject*>, JS::Handle<JS::Value>, JS::HandleValueArray const&, JS::MutableHandle<JS::Value>) /home/worker/workspace/build/src/js/src/jsapi.cpp:2828:12
[task 2017-03-15T22:19:38.344764Z] 22:19:38     INFO - GECKO(3007) |     #73 0x7f206c3d38c5 in nsFrameMessageManager::ReceiveMessage(nsISupports*, nsIFrameLoader*, bool, nsAString_internal const&, bool, mozilla::dom::ipc::StructuredCloneData*, mozilla::jsipc::CpowHolder*, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) /home/worker/workspace/build/src/dom/base/nsFrameMessageManager.cpp:1091:14
[task 2017-03-15T22:19:38.345212Z] 22:19:38     INFO - GECKO(3007) |     #74 0x7f206c3d450a in nsFrameMessageManager::ReceiveMessage(nsISupports*, nsIFrameLoader*, bool, nsAString_internal const&, bool, mozilla::dom::ipc::StructuredCloneData*, mozilla::jsipc::CpowHolder*, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) /home/worker/workspace/build/src/dom/base/nsFrameMessageManager.cpp:1121:29
[task 2017-03-15T22:19:38.345983Z] 22:19:38     INFO - GECKO(3007) |     #75 0x7f206c3d450a in nsFrameMessageManager::ReceiveMessage(nsISupports*, nsIFrameLoader*, bool, nsAString_internal const&, bool, mozilla::dom::ipc::StructuredCloneData*, mozilla::jsipc::CpowHolder*, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) /home/worker/workspace/build/src/dom/base/nsFrameMessageManager.cpp:1121:29
[task 2017-03-15T22:19:38.346722Z] 22:19:38     INFO - GECKO(3007) |     #76 0x7f206c3d450a in nsFrameMessageManager::ReceiveMessage(nsISupports*, nsIFrameLoader*, bool, nsAString_internal const&, bool, mozilla::dom::ipc::StructuredCloneData*, mozilla::jsipc::CpowHolder*, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) /home/worker/workspace/build/src/dom/base/nsFrameMessageManager.cpp:1121:29
[task 2017-03-15T22:19:38.347468Z] 22:19:38     INFO - GECKO(3007) |     #77 0x7f206c3d0fc9 in nsFrameMessageManager::ReceiveMessage(nsISupports*, nsIFrameLoader*, nsAString_internal const&, bool, mozilla::dom::ipc::StructuredCloneData*, mozilla::jsipc::CpowHolder*, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) /home/worker/workspace/build/src/dom/base/nsFrameMessageManager.cpp:900:10
[task 2017-03-15T22:19:38.348291Z] 22:19:38     INFO - GECKO(3007) |     #78 0x7f206f6571ad in mozilla::dom::TabParent::ReceiveMessage(nsString const&, bool, mozilla::dom::ipc::StructuredCloneData*, mozilla::jsipc::CpowHolder*, nsIPrincipal*, nsTArray<mozilla::dom::ipc::StructuredCloneData>*) /home/worker/workspace/build/src/dom/ipc/TabParent.cpp:2411:14
[task 2017-03-15T22:19:38.349057Z] 22:19:38     INFO - GECKO(3007) |     #79 0x7f206f664d93 in mozilla::dom::TabParent::RecvAsyncMessage(nsString const&, nsTArray<mozilla::jsipc::CpowEntry>&&, IPC::Principal const&, mozilla::dom::ClonedMessageData const&) /home/worker/workspace/build/src/dom/ipc/TabParent.cpp:1604:8
[task 2017-03-15T22:19:38.349796Z] 22:19:38     INFO - GECKO(3007) |     #80 0x7f206b1478bf in mozilla::dom::PBrowserParent::OnMessageReceived(IPC::Message const&) /home/worker/workspace/build/src/obj-firefox/ipc/ipdl/PBrowserParent.cpp:1662:20
[task 2017-03-15T22:19:38.350497Z] 22:19:38     INFO - GECKO(3007) |     #81 0x7f206b2d5205 in mozilla::dom::PContentParent::OnMessageReceived(IPC::Message const&) /home/worker/workspace/build/src/obj-firefox/ipc/ipdl/PContentParent.cpp:2983:28
[task 2017-03-15T22:19:38.351176Z] 22:19:38     INFO - GECKO(3007) |     #82 0x7f206ab32320 in mozilla::ipc::MessageChannel::DispatchAsyncMessage(IPC::Message const&) /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:1872:25
[task 2017-03-15T22:19:38.351724Z] 22:19:38     INFO - GECKO(3007) |     #83 0x7f206ab2ea72 in mozilla::ipc::MessageChannel::DispatchMessage(IPC::Message&&) /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:1807:17
[task 2017-03-15T22:19:38.352219Z] 22:19:38     INFO - GECKO(3007) |     #84 0x7f206ab30f24 in mozilla::ipc::MessageChannel::RunMessage(mozilla::ipc::MessageChannel::MessageTask&) /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:1680:5
[task 2017-03-15T22:19:38.352663Z] 22:19:38     INFO - GECKO(3007) |     #85 0x7f206ab3153e in mozilla::ipc::MessageChannel::MessageTask::Run() /home/worker/workspace/build/src/ipc/glue/MessageChannel.cpp:1713:15
[task 2017-03-15T22:19:38.353101Z] 22:19:38     INFO - GECKO(3007) |     #86 0x7f2069d9534c in nsThread::ProcessNextEvent(bool, bool*) /home/worker/workspace/build/src/xpcom/threads/nsThread.cpp:1269:14
[task 2017-03-15T22:19:38.353573Z] 22:19:38     INFO - GECKO(3007) |     #87 0x7f2069d91c78 in NS_ProcessNextEvent(nsIThread*, bool) /home/worker/workspace/build/src/xpcom/threads/nsThreadUtils.cpp:389:10
[task 2017-03-15T22:19:38.354076Z] 22:19:38     INFO - GECKO(3007) |     #88 0x7f206ab39de1 in mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) /home/worker/workspace/build/src/ipc/glue/MessagePump.cpp:96:21
[task 2017-03-15T22:19:38.354511Z] 22:19:38     INFO - GECKO(3007) |     #89 0x7f206aa9a940 in RunInternal /home/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:238:10
[task 2017-03-15T22:19:38.354946Z] 22:19:38     INFO - GECKO(3007) |     #90 0x7f206aa9a940 in RunHandler /home/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:231
[task 2017-03-15T22:19:38.355402Z] 22:19:38     INFO - GECKO(3007) |     #91 0x7f206aa9a940 in MessageLoop::Run() /home/worker/workspace/build/src/ipc/chromium/src/base/message_loop.cc:211
[task 2017-03-15T22:19:38.355871Z] 22:19:38     INFO - GECKO(3007) |     #92 0x7f206fcfe9af in nsBaseAppShell::Run() /home/worker/workspace/build/src/widget/nsBaseAppShell.cpp:156:27
[task 2017-03-15T22:19:38.356359Z] 22:19:38     INFO - GECKO(3007) |     #93 0x7f207315ff01 in nsAppStartup::Run() /home/worker/workspace/build/src/toolkit/components/startup/nsAppStartup.cpp:283:30
[task 2017-03-15T22:19:38.356794Z] 22:19:38     INFO - GECKO(3007) |     #94 0x7f207332040a in XREMain::XRE_mainRun() /home/worker/workspace/build/src/toolkit/xre/nsAppRunner.cpp:4492:22
[task 2017-03-15T22:19:38.357247Z] 22:19:38     INFO - GECKO(3007) |     #95 0x7f2073321e93 in XREMain::XRE_main(int, char**, mozilla::BootstrapConfig const&) /home/worker/workspace/build/src/toolkit/xre/nsAppRunner.cpp:4670:8
[task 2017-03-15T22:19:38.357809Z] 22:19:38     INFO - GECKO(3007) |     #96 0x7f207332321c in XRE_main(int, char**, mozilla::BootstrapConfig const&) /home/worker/workspace/build/src/toolkit/xre/nsAppRunner.cpp:4761:21
[task 2017-03-15T22:19:38.358318Z] 22:19:38     INFO - GECKO(3007) |     #97 0x4eb2b3 in do_main /home/worker/workspace/build/src/browser/app/nsBrowserApp.cpp:236:22
[task 2017-03-15T22:19:38.358792Z] 22:19:38     INFO - GECKO(3007) |     #98 0x4eb2b3 in main /home/worker/workspace/build/src/browser/app/nsBrowserApp.cpp:307
[task 2017-03-15T22:19:38.359294Z] 22:19:38     INFO - GECKO(3007) |     #99 0x7f20828f57ec in __libc_start_main /build/eglibc-oqps9y/eglibc-2.15/csu/libc-start.c:226
[task 2017-03-15T22:19:38.359763Z] 22:19:38     INFO - GECKO(3007) | ==3007==ABORTING
In https://public-artifacts.taskcluster.net/RtcFwht_R4G9lPvSAJtqaQ/0/public/logs/live_backing.log

[task 2017-03-15T21:47:05.079553Z] 21:47:05     INFO - GECKO(2806) | Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:159
[task 2017-03-15T21:47:05.081232Z] 21:47:05     INFO - GECKO(2806) | #01: ???[/home/worker/workspace/build/application/firefox/libnspr4.so +0x27cd0]
[task 2017-03-15T21:47:05.082298Z] 21:47:05     INFO - GECKO(2806) | #02: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0x76ba]
[task 2017-03-15T21:47:05.083510Z] 21:47:05     INFO - GECKO(2806) | #03: clone[/lib/x86_64-linux-gnu/libc.so.6 +0x10682d]
[task 2017-03-15T21:47:05.084731Z] 21:47:05     INFO - GECKO(2806) | #04: ??? (???:???)
[task 2017-03-15T21:47:05.086207Z] 21:47:05     INFO - GECKO(2806) | ExceptionHandler::GenerateDump cloned child 3054
[task 2017-03-15T21:47:05.087032Z] 21:47:05     INFO - GECKO(2806) | ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2017-03-15T21:47:05.088157Z] 21:47:05     INFO - GECKO(2806) | ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2017-03-15T21:47:05.263080Z] 21:47:05     INFO - GECKO(2806) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4600F8,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv
That nsTerminator was added in bug 1038342: it looks like it is crashing to tell us that shutdown is hung.
Summary: Intermittent hang in linux32/64 mochitest-media-e10s jobs → Intermittent shutdown hang in linux32/64 mochitest-media-e10s jobs
:Yoric - I like the idea of nsTerminator, but I'm frustrated by the way it is working here. It seems Firefox remains hung after the crash, so the test job eventually times out and we don't get a full crash report. Is that expected? Any idea what is causing this?
Flags: needinfo?(dteller)
Well, the only thing that the terminator tells you is that there is a timeout during shutdown, i.e. that shutdown has frozen for at least one minute. It doesn't attempt to investigate further.
Flags: needinfo?(dteller)
But shouldn't the terminator actually terminate the process? I suppose that's hard to guarantee.
jmaher - Any thoughts on how to proceed here? It appears there is a low-frequency shutdown hang affecting linux, e10s mochitest-media jobs. There's lots of telemetry logging during these shutdowns, but telemetry folks says that's expected under the circumstances. I don't see any other clues in the logging.

It is low enough in frequency that I am tempted to ignore it, but it is likely a product weakness, perhaps a product problem...would be nice to have the cause found and fixed.
Flags: needinfo?(jmaher)
I suspect that it's killing the process for real but that another process is somehow keeping the test alive. Hard to be sure now that there are so many processes around.

Geoff, is this only in ASAN builds or do you see the problem in other builds, too?
thanks for investigating this :gbrown- it looks like this is linux 32/64 debug and asan where the problem is seem primarily.

One sanity check would be to push to try with the telemetry turned off and see if this affects the failure rate.  This might be a good candidate for 'rr' since we are dealing with Linux.
Flags: needinfo?(jmaher)
(In reply to David Teller [:Yoric] (please use "needinfo") from comment #13)
> I suspect that it's killing the process for real but that another process is
> somehow keeping the test alive. Hard to be sure now that there are so many
> processes around.
> 
> Geoff, is this only in ASAN builds or do you see the problem in other
> builds, too?

I see it in non-asan builds too. For instance, this is Linux64/Debug: https://public-artifacts.taskcluster.net/RtcFwht_R4G9lPvSAJtqaQ/0/public/logs/live_backing.log

Notice we MOZ_CRASH in pid 2806:

[task 2017-03-15T21:47:05.079553Z] 21:47:05     INFO - GECKO(2806) | Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:159
[task 2017-03-15T21:47:05.081232Z] 21:47:05     INFO - GECKO(2806) | #01: ???[/home/worker/workspace/build/application/firefox/libnspr4.so +0x27cd0]
[task 2017-03-15T21:47:05.082298Z] 21:47:05     INFO - GECKO(2806) | #02: ???[/lib/x86_64-linux-gnu/libpthread.so.0 +0x76ba]
[task 2017-03-15T21:47:05.083510Z] 21:47:05     INFO - GECKO(2806) | #03: clone[/lib/x86_64-linux-gnu/libc.so.6 +0x10682d]
[task 2017-03-15T21:47:05.084731Z] 21:47:05     INFO - GECKO(2806) | #04: ??? (???:???)
[task 2017-03-15T21:47:05.086207Z] 21:47:05     INFO - GECKO(2806) | ExceptionHandler::GenerateDump cloned child 3054
[task 2017-03-15T21:47:05.087032Z] 21:47:05     INFO - GECKO(2806) | ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2017-03-15T21:47:05.088157Z] 21:47:05     INFO - GECKO(2806) | ExceptionHandler::WaitForContinueSignal waiting for continue signal...

but apparently log from the same pid over an hour later

....
[task 2017-03-15T23:03:44.274060Z] 23:03:44     INFO - GECKO(2806) | [Child 2871] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2036
I'm not sure what else to do here. This hang continues to appear in bug 1204281, but happens too infrequently to debug effectively.
Assignee: gbrown → nobody
(In reply to Geoff Brown [:gbrown] from comment #11)
> But shouldn't the terminator actually terminate the process? I suppose
> that's hard to guarantee.

So the terminator is just using MOZ_CRASH, which is supposed to be a reliable crash. In the log from comment 7 I see:
[task 2017-03-15T21:47:05.087032Z] 21:47:05     INFO - GECKO(2806) | ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2017-03-15T21:47:05.088157Z] 21:47:05     INFO - GECKO(2806) | ExceptionHandler::WaitForContinueSignal waiting for continue signal...

Which is some debug logging we left in the Breakpad code, so the Breakpad signal handler is being triggered here, which is expected, since that's how we write a minidump from a crash. Those log prints happen here:
https://dxr.mozilla.org/mozilla-central/rev/05bfa2831c0ba4a26fa72328ffe6a99aba9c356a/toolkit/crashreporter/breakpad-client/linux/handler/exception_handler.cc#574

I don't think we have any logging *after* that unfortunately, so it's hard to tell if it is succeeding and something else is failing, or if it's the crash reporter code that's hanging.

None of that explains why we'd see the same thing in ASAN builds, though, since those are --disable-crashreporter.

I think we probably have a two-tiered failure here:
1) There's some sort of shutdown hang in these tests. Could be a bug in the tests or in the media code.
2) Additionally, there's some failure that causes Breakpad to hang or fail writing a minidump when the terminator tries to crash the process, which makes it hang more (and not produce a useful stack).
https://treeherder.mozilla.org/#/jobs?repo=try&revision=7b6213b4478f2eb07fab61fa9ea6e17709de982f suggests that the shutdown hang is "caused by" something in the range of tests { dom/media/tests/mochitest/test_getUserMedia_gumWithinGum.html .. dom/media/tests/mochitest/test_peerConnection_multiple_captureStream_canvas_2d.html } -- a range of about 50 tests.
Also, logging in that try push suggests that the hang occurs during ShutdownXPCOM().
Flags: needinfo?(gbrown)
Trying to reduce range of tests known to trigger the hang...

https://treeherder.mozilla.org/#/jobs?repo=try&revision=82076cc5988c77c10a8525c93eb23579248568bf
In my next iteration, with more tests skipped, I got a frequent crash, on shutdown:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=da9268f11c1bc2920a2f65b6c255bcc135009840

that seems interesting and unexpected - I'll file a separate bug for that.
Depends on: 1357466
Flags: needinfo?(gbrown)
no real progress on bug 1357466, these long running issues are continuing to linger week after week.
We have similar issues with Marionette jobs during shutdown. So it's not only media-tests. See bug 1368787.
:gbrown, any chance you could analyze if we are having similar failures here?
Flags: needinfo?(gbrown)
I can't think of a way to further verify the similarity of this bug to bug 1368787.

According to comment 2, the shutdown log spew only suggests "the tests are finished but, somehow, the content process isn't shutting down?" -- the content process is hung on shutdown. I don't know the cause of the hang in this bug (crash reports are not helpful, behavior morphs when test case is modified -- bug 1357466). Bug 1368787 also looks like a shutdown hang, but doesn't have a proper stack (for other reasons) and only has one failure report, so it's likely not reproducible.
Flags: needinfo?(gbrown)
this is marked stockwell needswork, but I don't see orange factor comments in here.  I assume this is related to bug 1204281
Failures continue in bug 1204281 and this remains an important issue, but *this* bug is not an intermittent-failure, so...
Whiteboard: [stockwell needswork]
Bug 1374164 is a shutdown crash following the same tests -- very likely related.
See Also: → 1374164
(In reply to Ted Mielczarek [:ted.mielczarek] (Away Sep 23rd-Oct 1st) from comment #17)
> (In reply to Geoff Brown [:gbrown] from comment #11)
> > But shouldn't the terminator actually terminate the process? I suppose
> > that's hard to guarantee.
> 
> So the terminator is just using MOZ_CRASH, which is supposed to be a
> reliable crash. In the log from comment 7 I see:
> [task 2017-03-15T21:47:05.087032Z] 21:47:05     INFO - GECKO(2806) |
> ExceptionHandler::SendContinueSignalToChild sent continue signal to child
> [task 2017-03-15T21:47:05.088157Z] 21:47:05     INFO - GECKO(2806) |
> ExceptionHandler::WaitForContinueSignal waiting for continue signal...
> 
> [...]
> 
> I think we probably have a two-tiered failure here:
> 1) There's some sort of shutdown hang in these tests. Could be a bug in the
> tests or in the media code.
> 2) Additionally, there's some failure that causes Breakpad to hang or fail
> writing a minidump when the terminator tries to crash the process, which
> makes it hang more (and not produce a useful stack).

Generating a shutdown hang in the content process locally, the crash reporter
runs as expected and terminates.

The second failure is rather that the terminator is terminating only the
parent process - bug 1404190.

The test harness waits for both browser and content process output to finish.
Disabling the first path at
http://searchfox.org/mozilla-central/rev/f54c1723befe6bcc7229f005217d5c681128fcad/testing/mozbase/mozprocess/mozprocess/processhandler.py#843
and instead using self.proc.wait() allows the harness to proceed and kill the
child process with SIGKILL.
See Also: → 1404190
Blocks: 1411358
Waiting for updates on bug 1357466.
Flags: needinfo?(gbrown)
A recent failure:

https://treeherder.mozilla.org/logviewer.html#?repo=try&job_id=143637687&lineNumber=58678

[task 2017-11-10T10:46:56.950Z] 10:46:56     INFO - TEST-START | Shutdown
[task 2017-11-10T10:46:56.951Z] 10:46:56     INFO - Passed:  25205
[task 2017-11-10T10:46:56.951Z] 10:46:56     INFO - Failed:  0
[task 2017-11-10T10:46:56.952Z] 10:46:56     INFO - Todo:    1580
[task 2017-11-10T10:46:56.952Z] 10:46:56     INFO - Mode:    e10s
[task 2017-11-10T10:46:56.952Z] 10:46:56     INFO - Slowest: 37648ms - /tests/dom/media/tests/mochitest/test_peerConnection_checkPacketDumpHook.html
[task 2017-11-10T10:46:56.969Z] 10:46:56     INFO - SimpleTest FINISHED
[task 2017-11-10T10:46:56.969Z] 10:46:56     INFO - TEST-INFO | Ran 1 Loops
[task 2017-11-10T10:46:56.970Z] 10:46:56     INFO - SimpleTest FINISHED
[task 2017-11-10T10:46:57.010Z] 10:46:57     INFO - GECKO(3159) | ++DOMWINDOW == 8 (0x7fde436a3800) [pid = 3213] [serial = 446] [outer = 0x7fde65034000]
[task 2017-11-10T10:46:57.252Z] 10:46:57     INFO - GECKO(3159) | --DOCSHELL 0x7fde569cc800 == 2 [pid = 3213] [id = {8dd1ca39-84e2-401f-8a95-ade47d5040d5}]
[task 2017-11-10T10:46:57.856Z] 10:46:57     INFO - GECKO(3159) | ###!!! [Child][MessageChannel] Error: (msgtype=0x760007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
[task 2017-11-10T10:46:57.861Z] 10:46:57     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: MsgDropped in ContentChild: file /builds/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2365
[task 2017-11-10T10:46:57.862Z] 10:46:57     INFO - GECKO(3159) | ###!!! [Child][MessageChannel] Error: (msgtype=0x760007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
[task 2017-11-10T10:46:57.868Z] 10:46:57     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: MsgDropped in ContentChild: file /builds/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2365
[task 2017-11-10T10:46:57.868Z] 10:46:57     INFO - GECKO(3159) | --DOCSHELL 0x7f5c7a5a0800 == 3 [pid = 3159] [id = {fef5d6b9-6636-40f4-a419-f706c9c78050}]
[task 2017-11-10T10:46:57.868Z] 10:46:57     INFO - GECKO(3159) | ###!!! [Child][MessageChannel] Error: (msgtype=0x760007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
[task 2017-11-10T10:46:57.868Z] 10:46:57     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: MsgDropped in ContentChild: file /builds/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2365
[task 2017-11-10T10:46:57.868Z] 10:46:57     INFO - GECKO(3159) | ###!!! [Child][MessageChannel] Error: (msgtype=0x760007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
[task 2017-11-10T10:46:57.868Z] 10:46:57     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: MsgDropped in ContentChild: file /builds/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2365
[task 2017-11-10T10:46:57.869Z] 10:46:57     INFO - GECKO(3159) | ###!!! [Child][MessageChannel] Error: (msgtype=0x760007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
[task 2017-11-10T10:46:57.869Z] 10:46:57     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: MsgDropped in ContentChild: file /builds/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2365
[task 2017-11-10T10:46:57.869Z] 10:46:57     INFO - GECKO(3159) | ###!!! [Child][MessageChannel] Error: (msgtype=0x760007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
[task 2017-11-10T10:46:57.869Z] 10:46:57     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: MsgDropped in ContentChild: file /builds/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2365
[task 2017-11-10T10:46:57.869Z] 10:46:57     INFO - GECKO(3159) | ###!!! [Child][MessageChannel] Error: (msgtype=0x760007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
[task 2017-11-10T10:46:57.869Z] 10:46:57     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: MsgDropped in ContentChild: file /builds/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2365
[task 2017-11-10T10:46:57.879Z] 10:46:57     INFO - GECKO(3159) | [Main Thread]: D/DataChannel Shutting down SCTP
[task 2017-11-10T10:46:57.881Z] 10:46:57     INFO - GECKO(3159) | --DOCSHELL 0x7f5c63eaa000 == 2 [pid = 3159] [id = {ce67000e-8c6a-4195-b052-854c42684b4e}]
[task 2017-11-10T10:46:57.883Z] 10:46:57     INFO - GECKO(3159) | ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2017-11-10T10:46:57.890Z] 10:46:57     INFO - GECKO(3159) | --DOCSHELL 0x7f5c779ae000 == 1 [pid = 3159] [id = {7c042e0e-31fc-40c3-855a-029e2d864e4d}]
[task 2017-11-10T10:46:57.899Z] 10:46:57     INFO - GECKO(3159) | --DOCSHELL 0x7f5c7c480000 == 0 [pid = 3159] [id = {d1838a3a-d154-4b7c-8811-8fe5bea2b9f2}]
[task 2017-11-10T10:46:57.899Z] 10:46:57     INFO - GECKO(3159) | --DOMWINDOW == 11 (0x7f5c671ce800) [pid = 3159] [serial = 255] [outer = (nil)] [url = about:blank]
[task 2017-11-10T10:46:57.899Z] 10:46:57     INFO - GECKO(3159) | --DOMWINDOW == 10 (0x7f5c7360c800) [pid = 3159] [serial = 253] [outer = (nil)] [url = about:blank]
[task 2017-11-10T10:46:57.915Z] 10:46:57     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file /builds/worker/workspace/build/src/xpcom/threads/nsThread.cpp, line 795
[task 2017-11-10T10:46:57.916Z] 10:46:57     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file /builds/worker/workspace/build/src/xpcom/threads/nsThread.cpp, line 795
[task 2017-11-10T10:46:57.937Z] 10:46:57     INFO - GECKO(3159) | ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2017-11-10T10:46:57.938Z] 10:46:57     INFO - GECKO(3159) | ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
[task 2017-11-10T10:46:57.939Z] 10:46:57     INFO - GECKO(3159) | [Child 3247, Main Thread] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file /builds/worker/workspace/build/src/xpcom/threads/nsThread.cpp, line 795
[task 2017-11-10T10:46:57.943Z] 10:46:57     INFO - GECKO(3159) | [Child 3247, Main Thread] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file /builds/worker/workspace/build/src/xpcom/threads/nsThread.cpp, line 795
[task 2017-11-10T10:46:57.947Z] 10:46:57     INFO - GECKO(3159) | [Child 3247, Main Thread] WARNING: '!gThread', file /builds/worker/workspace/build/src/xpcom/threads/nsTimerImpl.cpp, line 399
[task 2017-11-10T10:46:58.113Z] 10:46:58     INFO - GECKO(3159) | [Child 3247, Main Thread] WARNING: '!gThread', file /builds/worker/workspace/build/src/xpcom/threads/nsTimerImpl.cpp, line 399
[task 2017-11-10T10:46:58.114Z] 10:46:58     INFO - GECKO(3159) | [Child 3247, Main Thread] WARNING: '!gThread', file /builds/worker/workspace/build/src/xpcom/threads/nsTimerImpl.cpp, line 399
[task 2017-11-10T10:46:58.133Z] 10:46:58     INFO - GECKO(3159) | [Child 3247, Main Thread] WARNING: '!gThread', file /builds/worker/workspace/build/src/xpcom/threads/nsTimerImpl.cpp, line 399
[task 2017-11-10T10:46:58.170Z] 10:46:58     INFO - GECKO(3159) | nsStringStats
[task 2017-11-10T10:46:58.170Z] 10:46:58     INFO - GECKO(3159) |  => mAllocCount:          19707
[task 2017-11-10T10:46:58.171Z] 10:46:58     INFO - GECKO(3159) |  => mReallocCount:           56
[task 2017-11-10T10:46:58.172Z] 10:46:58     INFO - GECKO(3159) |  => mFreeCount:           19707
[task 2017-11-10T10:46:58.172Z] 10:46:58     INFO - GECKO(3159) |  => mShareCount:           7163
[task 2017-11-10T10:46:58.172Z] 10:46:58     INFO - GECKO(3159) |  => mAdoptCount:            325
[task 2017-11-10T10:46:58.173Z] 10:46:58     INFO - GECKO(3159) |  => mAdoptFreeCount:        325
[task 2017-11-10T10:46:58.173Z] 10:46:58     INFO - GECKO(3159) |  => Process ID: 3247, Thread ID: 140002545702720
[task 2017-11-10T10:46:58.255Z] 10:46:58     INFO - GECKO(3159) | 1510310818249	Marionette	DEBUG	Received observer notification "xpcom-shutdown"
[task 2017-11-10T10:46:58.257Z] 10:46:58     INFO - GECKO(3159) | 1510310818250	Marionette	DEBUG	New connections will no longer be accepted
[task 2017-11-10T10:46:59.490Z] 10:46:59     INFO - GECKO(3159) | ###!!! [Child][MessageChannel] Error: (msgtype=0x2C00F0,name=PContent::Msg_AccumulateChildHistograms) Closed channel: cannot send/recv
[task 2017-11-10T10:46:59.492Z] 10:46:59     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: MsgDropped in ContentChild: file /builds/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2365
[task 2017-11-10T10:46:59.493Z] 10:46:59     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: '!ipcActor->SendAccumulateChildHistograms(histogramsToSend)', file /builds/worker/workspace/build/src/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 296
[task 2017-11-10T10:46:59.494Z] 10:46:59     INFO - GECKO(3159) | ###!!! [Child][MessageChannel] Error: (msgtype=0x2C00F1,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv
[task 2017-11-10T10:46:59.494Z] 10:46:59     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: MsgDropped in ContentChild: file /builds/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2365
...
[task 2017-11-10T10:48:01.290Z] 10:48:01     INFO - GECKO(3159) | Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /builds/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:163
[task 2017-11-10T10:48:01.387Z] 10:48:01     INFO - GECKO(3159) | #01: _pt_root [nsprpub/pr/src/pthreads/ptthread.c:219]
[task 2017-11-10T10:48:01.388Z] 10:48:01     INFO - 
[task 2017-11-10T10:48:01.388Z] 10:48:01     INFO - GECKO(3159) | #02: libpthread.so.0 + 0x76ba
[task 2017-11-10T10:48:01.389Z] 10:48:01     INFO - 
[task 2017-11-10T10:48:01.390Z] 10:48:01     INFO - GECKO(3159) | #03: libc.so.6 + 0x1073dd
[task 2017-11-10T10:48:01.390Z] 10:48:01     INFO - 
[task 2017-11-10T10:48:01.390Z] 10:48:01     INFO - GECKO(3159) | #04: ??? (???:???)
[task 2017-11-10T10:48:01.392Z] 10:48:01     INFO - GECKO(3159) | ExceptionHandler::GenerateDump cloned child 6006
[task 2017-11-10T10:48:01.392Z] 10:48:01     INFO - GECKO(3159) | ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2017-11-10T10:48:01.392Z] 10:48:01     INFO - GECKO(3159) | ExceptionHandler::WaitForContinueSignal waiting for continue signal...
...
[task 2017-11-10T11:45:56.572Z] 11:45:56     INFO - GECKO(3159) | ###!!! [Child][MessageChannel] Error: (msgtype=0x2C00F5,name=PContent::Msg_RecordDiscardedData) Closed channel: cannot send/recv
[task 2017-11-10T11:45:56.573Z] 11:45:56     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: MsgDropped in ContentChild: file /builds/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2365
[task 2017-11-10T11:45:56.574Z] 11:45:56     INFO - GECKO(3159) | [Child 3213, Main Thread] WARNING: '!ipcActor->SendRecordDiscardedData(discardedData)', file /builds/worker/workspace/build/src/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 315

[taskcluster:error] Task timeout after 5400 seconds. Force killing container.
[taskcluster 2017-11-10 11:45:57.437Z] === Task Finished ===
[taskcluster 2017-11-10 11:45:57.438Z] Unsuccessful task run with exit code: -1 completed in 5401.304 seconds


At a glance, it doesn't look like anything has changed: Telemetry's shutdown behavior is very annoying, nsTerminator is not helpful, and despite crashing, the process appears to remain alive indefinitely.

I suppose, now that bug 1357466 is "resolved", I (or anyone else) could try to narrow down the scope by repeating experiments like comment 18, comment 20, comment 21.

I am skeptical that my efforts will lead to resolution, but I can have another look at this next week, or perhaps the week after that.
Flags: needinfo?(gbrown)
In https://treeherder.mozilla.org/#/jobs?repo=try&revision=044068945d53bd9ae4d71046a30805e1071bfd7d, I ran only the tests in dom/media/tests/mochitest, and I ran them in 12 chunks (fewer tests per chunk). The hang was reproduced 3 times, all 3 times in chunk 5:

https://public-artifacts.taskcluster.net/LZUmnflAR02Vs-yCnruuQA/0/public/logs/live_backing.log
https://public-artifacts.taskcluster.net/dTgxwfOWSaGOfo-frGrGqg/0/public/logs/live_backing.log
https://public-artifacts.taskcluster.net/aal9X35XRSiX1TWHV50Eyw/0/public/logs/live_backing.log

That suggests the hang occurs on shutdown after running this set of 13 tests:

dom/media/tests/mochitest/test_peerConnection_basicAudioDynamicPtMissingRtpmap.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelayTCP.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATSrflx.html
dom/media/tests/mochitest/test_peerConnection_basicAudioPcmaPcmuOnly.html
dom/media/tests/mochitest/test_peerConnection_basicAudioRequireEOC.html
dom/media/tests/mochitest/test_peerConnection_basicAudioVideo.html
dom/media/tests/mochitest/test_peerConnection_basicAudioVideoCombined.html
dom/media/tests/mochitest/test_peerConnection_basicAudioVideoNoBundle.html
dom/media/tests/mochitest/test_peerConnection_basicAudioVideoNoBundleNoRtcpMux.html
dom/media/tests/mochitest/test_peerConnection_basicAudioVideoNoRtcpMux.html
dom/media/tests/mochitest/test_peerConnection_basicVideo.html
dom/media/tests/mochitest/test_peerConnection_checkPacketDumpHook.html
possibly we can disable a couple tests at a time and do a series of pushes- that would help pinpoint the problem.
https://treeherder.mozilla.org/#/jobs?repo=try&author=gbrown@mozilla.com&tochange=adb993aa7f47f2f4e12ce8644ba9a56539766d4b&fromchange=d280045706af10a5bd8106d493d650454de325fd

When running fewer than 13 tests per chunk, failure frequency appears to be drastically reduced: It is hard to discovery anything definitively. However -- if this is telling us anything -- it suggests that some combination of these tests predicts the shutdown hang:

dom/media/tests/mochitest/test_peerConnection_basicAudioDynamicPtMissingRtpmap.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelayTCP.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATSrflx.html

More experiments under way...
I disabled those 4 tests (comment 36) and produced no shutdown hangs in ~200 runs of mochitest-media-e10s-2:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=8112ce323b1f7014b81e61e1c82a1892da11020b

Compared to 5 shutdown hangs in ~200 runs with nothing extra disabled:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=b8d85dd91dbf96599140b1a85d8b410f0c5e2337


I tried disabling only:

dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelayTCP.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATSrflx.html

but produced 1 shutdown hang:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3e80de40f2745a1ae55f2d3dfd7c8e7b80b10387


I tried disabling only:

dom/media/tests/mochitest/test_peerConnection_basicAudioDynamicPtMissingRtpmap.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelayTCP.html

but produced 1 shutdown hang:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=3ed21ecc2038addd7be61564f81ff6c35a206fe8
The try results (see previous comment) are encouraging, but certainly not conclusive. Rather than burning more time on try, I'd like to check this in, keep the bug open, and monitor the effect on shutdown hangs. If there's no change, I'll back this out (enable these tests). If there is a significant improvement, I'll try to track down contributors to these tests, or ping the media team for follow-up.
Assignee: nobody → gbrown
Attachment #8929029 - Flags: review?(jmaher)
Keywords: leave-open
Comment on attachment 8929029 [details] [diff] [review]
disable 4 tests on linux debug

Review of attachment 8929029 [details] [diff] [review]:
-----------------------------------------------------------------

sold
Attachment #8929029 - Flags: review?(jmaher) → review+
Component: General → WebRTC
Product: Testing → Core
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/cb55df94911c
Disable 4 dom/media mochitests on Linux Debug, in hopes of avoiding a shutdown hang; r=jmaher
Rank: 25
Priority: -- → P3
Notifying the authors of these tests, that they are disabled.
Flags: needinfo?(drno)
Flags: needinfo?(docfaraday)
Thank you for narrowing this down, Geoff.

Last I saw, the process for disabling tests involved notifying the owners of the tests.
Can you check please that you haven't disabled any other tests without notification of the owner?
Flags: needinfo?(gbrown)
Fixing bug 1404190 is likely to make this easier to analyze.
Depends on: 1404190
Whiteboard: [disabled]
(In reply to Karl Tomlinson (:karlt) from comment #44)
> Thank you for narrowing this down, Geoff.
> 
> Last I saw, the process for disabling tests involved notifying the owners of
> the tests.
> Can you check please that you haven't disabled any other tests without
> notification of the owner?

Hi Karl. I do indeed normally check that the owner has been needinfo'd before disabling. In this case I didn't because this was a "trial" - comment 38 - to see if disabling these tests was effective. I intended to set needinfo today if going ahead with the disable, or otherwise re-enable the tests. Sorry for the confusion.
Flags: needinfo?(gbrown)
It appears that disabling these 4 tests is indeed effective: There have been no new shutdown hangs in mochitest-media on linux/debug since those tests were disabled on linux/debug.

https://brasstacks.mozilla.com/orangefactor/?display=Bug&bugid=1411358

Shutdown hangs in mochitest-media continue on linux/asan.

:drno -- Any idea how these tests might "cause" an intermittent shutdown hang on linux debug and asan?

dom/media/tests/mochitest/test_peerConnection_basicAudioDynamicPtMissingRtpmap.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelayTCP.html
dom/media/tests/mochitest/test_peerConnection_basicAudioNATSrflx.html
Attachment #8930253 - Flags: review?(jmaher)
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/5f5c4d700a16
Disable 4 dom/media mochitests on linux/asan also; r=jmaher
(In reply to Geoff Brown [:gbrown] from comment #47)
> :drno -- Any idea how these tests might "cause" an intermittent shutdown
> hang on linux debug and asan?
> 
> dom/media/tests/mochitest/
> test_peerConnection_basicAudioDynamicPtMissingRtpmap.html

This test does not ever establish a RTCPeerConnection.

> dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelay.html
> dom/media/tests/mochitest/test_peerConnection_basicAudioNATRelayTCP.html
> dom/media/tests/mochitest/test_peerConnection_basicAudioNATSrflx.html

Where these three tests normally establish a RTCPeerConnection through our NAT emulator.

Because of that I would assume that we should be able to re-enable the first test. Worth a try?

In terms of why this happening: some time ago we added new Telemetry probes to our ICE stack, which are reporting in the form a Scalar. I'm pretty sure we have a similar intermittent bug report hanging around somewhere already.

I think the root question, assuming this new Scalar Telemetry probe reporting is causing these problems, is: 
- is something wrong in the Telemetry code when the content process tries to send Scalar data to a parent which is dead already?
- or is WebRTC's way of collecting/reporting the Scalar is some way wrong/off/buggy (e.g. don't even try to report any Telemetry data on shutdown or something like that)?

Michael: could you have look on what might be going on here?
Flags: needinfo?(drno) → needinfo?(mfroman)
Nils, I believe the similar intermittent bug report you're referring to is Bug 1351655, which is still open.  The last bit of info there was from Comment 28 (https://bugzilla.mozilla.org/show_bug.cgi?id=1351655#c28) which in part stated, "The real problem here is that we're calling XRE_ShutdownChildProcess() in the child process, but it's not shutting down. All of the other stuff (the telemetry timer, the IPC stuff) is just a symptom of that."
Flags: needinfo?(mfroman)
(In reply to Nils Ohlmeier [:drno] from comment #50)
> Because of that I would assume that we should be able to re-enable the first
> test. Worth a try?

I tried that in https://treeherder.mozilla.org/#/jobs?repo=try&revision=3e80de40f2745a1ae55f2d3dfd7c8e7b80b10387, but still got 1 hang.
(In reply to Joel Maher ( :jmaher) (UTC-5) from comment #14)
> One sanity check would be to push to try with the telemetry turned off and
> see if this affects the failure rate.

Even when the Telemetry folks are convinced that Telemetry has nothing to do with these hangs did someone ever try jmaher suggestion and see what happens when we disable Telemetry?
Flags: needinfo?(gbrown)
(In reply to Nils Ohlmeier [:drno] from comment #53)

No, I don't think we ever tried that. Let's see how this turns out:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=f36ea34d3fb2da665b4468a886edf72f9e68b441
Flags: needinfo?(gbrown)
(In reply to Geoff Brown [:gbrown] from comment #54)
> (In reply to Nils Ohlmeier [:drno] from comment #53)
> 
> No, I don't think we ever tried that. Let's see how this turns out:
> 
> https://treeherder.mozilla.org/#/
> jobs?repo=try&revision=f36ea34d3fb2da665b4468a886edf72f9e68b441

That reproduced the shutdown hang twice.

I repeated that experiment (the 4 tests enabled, telemetry disabled) with:

https://treeherder.mozilla.org/#/jobs?repo=try&revision=6d68dcd9b51c452814901970ab752a357576d423

and reproduced the shutdown hang another 7 times.

Example:

https://public-artifacts.taskcluster.net/WY1qyvO7TemHyS_InR8KRA/0/public/logs/live_backing.log

[task 2017-11-22T18:11:20.293Z] 18:11:20     INFO - TEST-OK | dom/media/tests/mochitest/test_peerConnection_twoVideoTracksInOneStream.html | took 3771ms
[task 2017-11-22T18:11:20.350Z] 18:11:20     INFO - GECKO(3052) | ++DOMWINDOW == 8 (0xbeccdc00) [pid = 3101] [serial = 453] [outer = 0xe68ef000]
[task 2017-11-22T18:11:20.424Z] 18:11:20     INFO - TEST-START | Shutdown
[task 2017-11-22T18:11:20.424Z] 18:11:20     INFO - Passed:  26043
[task 2017-11-22T18:11:20.424Z] 18:11:20     INFO - Failed:  0
[task 2017-11-22T18:11:20.425Z] 18:11:20     INFO - Todo:    1578
[task 2017-11-22T18:11:20.425Z] 18:11:20     INFO - Mode:    e10s
[task 2017-11-22T18:11:20.425Z] 18:11:20     INFO - Slowest: 40063ms - /tests/dom/media/tests/mochitest/test_peerConnection_checkPacketDumpHook.html
[task 2017-11-22T18:11:20.447Z] 18:11:20     INFO - SimpleTest FINISHED
[task 2017-11-22T18:11:20.447Z] 18:11:20     INFO - TEST-INFO | Ran 1 Loops
[task 2017-11-22T18:11:20.447Z] 18:11:20     INFO - SimpleTest FINISHED
...
[task 2017-11-22T18:12:25.830Z] 18:12:25     INFO - GECKO(3052) | Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /builds/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:171
[task 2017-11-22T18:13:02.410Z] 18:13:02     INFO - GECKO(3052) | #01: RunWatchdog [toolkit/components/terminator/nsTerminator.cpp:171]
[task 2017-11-22T18:13:02.412Z] 18:13:02     INFO - 
[task 2017-11-22T18:13:02.500Z] 18:13:02     INFO - GECKO(3052) | #02: _pt_root [nsprpub/pr/src/pthreads/ptthread.c:219]
[task 2017-11-22T18:13:02.501Z] 18:13:02     INFO - 
[task 2017-11-22T18:13:02.502Z] 18:13:02     INFO - GECKO(3052) | #03: libpthread.so.0 + 0x6295
[task 2017-11-22T18:13:02.502Z] 18:13:02     INFO - 
[task 2017-11-22T18:13:02.503Z] 18:13:02     INFO - GECKO(3052) | ExceptionHandler::GenerateDump cloned child 6020
[task 2017-11-22T18:13:02.503Z] 18:13:02     INFO - GECKO(3052) | ExceptionHandler::SendContinueSignalToChild sent continue signal to child
[task 2017-11-22T18:13:02.503Z] 18:13:02     INFO - GECKO(3052) | ExceptionHandler::WaitForContinueSignal waiting for continue signal...
[task 2017-11-22T18:19:12.516Z] 18:19:12     INFO - Buffered messages finished
[task 2017-11-22T18:19:12.516Z] 18:19:12    ERROR - TEST-UNEXPECTED-TIMEOUT | dom/media/tests/mochitest/test_peerConnection_twoVideoTracksInOneStream.html | application timed out after 370 seconds with no output
[task 2017-11-22T18:19:12.518Z] 18:19:12    ERROR - Force-terminating active process(es).
[task 2017-11-22T18:19:12.520Z] 18:19:12     INFO - Determining child pids from psutil
[task 2017-11-22T18:19:12.536Z] 18:19:12     INFO - Found child pids: []
[task 2017-11-22T18:19:12.537Z] 18:19:12     INFO - Killing process: 3052
[task 2017-11-22T18:19:12.594Z] 18:19:12     INFO - TEST-INFO | started process screentopng
[task 2017-11-22T18:19:13.248Z] 18:19:13     INFO - TEST-INFO | screentopng: exit 0
[task 2017-11-22T18:19:13.249Z] 18:19:13     INFO - psutil found pid 3052 dead
[task 2017-11-22T18:35:53.376Z] 18:35:53     INFO - Automation Error: mozprocess timed out after 1000 seconds running ['/builds/worker/workspace/build/venv/bin/python', '-u', '/builds/worker/workspace/build/tests/mochitest/runtests.py', '--total-chunks', '3', '--this-chunk', '2', '--appname=/builds/worker/workspace/build/application/firefox/firefox', '--utility-path=tests/bin', '--extra-profile-file=tests/bin/plugins', '--symbols-path=/builds/worker/workspace/build/symbols', '--certificate-path=tests/certs', '--setpref=webgl.force-enabled=true', '--quiet', '--log-raw=/builds/worker/workspace/build/blobber_upload_dir/mochitest-media_raw.log', '--log-errorsummary=/builds/worker/workspace/build/blobber_upload_dir/mochitest-media_errorsummary.log', '--use-test-media-devices', '--screenshot-on-fail', '--cleanup-crashes', '--marionette-startup-timeout=180', '--sandbox-read-whitelist=/builds/worker/workspace/build', '--log-raw=-', '--subsuite=media']
[task 2017-11-22T18:35:53.397Z] 18:35:53    ERROR - timed out after 1000 seconds of no output


In all cases, the shutdown hang is reported first as

dom/media/tests/mochitest/test_peerConnection_twoVideoTracksInOneStream.html | application timed out after 370 seconds with no output

test_peerConnection_twoVideoTracksInOneStream.html is the last test in that directory -- this is just reporting that there's been no output for 370 seconds after shutdown. With telemetry enabled, the telemetry shutdown chatter avoids that no-output timeout. But either way, with or without telemetry, the intermittent shutdown hang persists.
Geoff, maybe we can also add some more logging to the new system log you added, which could at least add all the open processes. That way we could at least see which kind of child processes are still open. 

Also lets be aware of bug 1404190. I don't know if that has something to do here, but as you can see we only run into those hangs during shutdown.
Flags: needinfo?(docfaraday)
(In reply to Henrik Skupin (:whimboo) from comment #58)
> Geoff, maybe we can also add some more logging to the new system log you
> added, which could at least add all the open processes. That way we could at
> least see which kind of child processes are still open. 

That system-info.log already includes a process list, but it is generated before running the tests, so that mechanism is not very useful here.

If the test harness detects a hang, it kills processes, which includes logging that shows which processes were found.  That's not happening here because nothing is timing out at the test harness or mozharness levels: The telemetry logging is acting as a keepalive that avoids all of those timeouts.

The only thing that is timing out is the taskcluster task; I think it would be awkward to log processes from that level.
With the 4 tests disabled, we are not seeing the shutdown hangs any more (or at least, not very frequently). I would still like to see
 - better termination on shutdown hang - bug 1404190;
 - root cause of this shutdown hang fixed.

But I'm probably not the person to resolve those issues.
Assignee: gbrown → nobody
See Also: → 1421830
Depends on: 1428390
I tried to enable the disabled tests, with the fix for bug 1428390, but still hit the shutdown hang, as before:

https://treeherder.mozilla.org/logviewer.html#?job_id=155459928&repo=try
(In reply to Georg Fritzsche [:gfritzsche] from comment #3)
> I don't think whats shown in comment 1 is the reason for the hang.
> This is just Telemetry trying to send messages to the parent when the
> connection is already closed (which i'll file a separate bug on).

Georg, have you filed such a bug? I cannot find it actually.

Also I wonder why telemetry is turned on for media tests. Aren't we turning of datareporting for all test suites?
Flags: needinfo?(gfritzsche)
Sending of Telemetry should be disabled for tests. If it's not that is a bug we want to fix.
Telemetry in general however is always enabled now, even if the data doesn't leave the Firefox client.
Flags: needinfo?(gfritzsche)
(In reply to Henrik Skupin (:whimboo) from comment #62)
> (In reply to Georg Fritzsche [:gfritzsche] from comment #3)
> > I don't think whats shown in comment 1 is the reason for the hang.
> > This is just Telemetry trying to send messages to the parent when the
> > connection is already closed (which i'll file a separate bug on).
> 
> Georg, have you filed such a bug? I cannot find it actually.

I'm not sure. Now there is bug 1432913.
(In reply to Georg Fritzsche [:gfritzsche] from comment #63)
> Sending of Telemetry should be disabled for tests. If it's not that is a bug
> we want to fix.

The kill switch for Telemetry is `datareporting.policy.dataSubmissionEnabled` right?
Flags: needinfo?(gfritzsche)
Correct, that is the main "can Firefox upload data?" switch.
Flags: needinfo?(gfritzsche)
Ok, so mochitests are using `prefs_general.js` from:
https://dxr.mozilla.org/mozilla-central/source/testing/profiles/prefs_general.js

Whether in that file nor in testing/mochitest I can find a reference for that pref. So I doubt that this is set, and as such mochitests have telemetry enabled.

Geoff, can you check that please?
Flags: needinfo?(gbrown)
Depends on: 1433485
Thanks :whimboo -- spun off the prefs conversation to bug 1433485.
Flags: needinfo?(gbrown)
Geoff, looks like it is time that we can try again to re-enable those tests. Lets hope that it will work this time.
Flags: needinfo?(gbrown)
Attached patch re-enable testsSplinter Review
https://treeherder.mozilla.org/#/jobs?repo=try&author=gbrown@mozilla.com&tochange=2f92496ba150896b03a386289c7a6394d40df2a0&fromchange=3548f6ca1327ca8bdb61e564a7b68df8797d32ae has 4 pushes: 2 different base-lines + 2 corresponding pushes with these tests enabled. 

There are significant failures in all cases (existing, known failures): to me, it looks like the enabled cases are no worse than the disabled cases.

Perhaps more significant: None of the failures are hang-on-shutdown. So I think it is reasonable to re-enable these tests. Thanks for the reminder!


Since I cannot reproduce this failure, once these tests are enabled, I think we can close this bug. I am not sure what "fixed" it. Perhaps bug 1433485 helped?
Assignee: nobody → gbrown
Flags: needinfo?(gbrown)
Attachment #8953494 - Flags: review?(hskupin)
Keywords: leave-open
That sounds good. And yes I'm sure that bug 1433485 had an impact here!

Regarding all the timeout failures, do you know that those are happening because the tab has crashed? See the attached PNG artifacts for each of those jobs! How does Mochitest handle such a situation? In Marionette we use "MOZ_CRASHREPORTER_SHUTDOWN" to shutdown Firefox in case of content crashes.
Flags: needinfo?(gbrown)
Attachment #8953494 - Flags: review?(hskupin) → review+
Pushed by gbrown@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/25cb6ef45dee
Re-enable some mochitest-media tests on linux/debug and linux/asan; r=whimboo
Whiteboard: [disabled]
(In reply to Henrik Skupin (:whimboo) from comment #71)
> That sounds good. And yes I'm sure that bug 1433485 had an impact here!
> 
> Regarding all the timeout failures, do you know that those are happening
> because the tab has crashed? See the attached PNG artifacts for each of
> those jobs! How does Mochitest handle such a situation? In Marionette we use
> "MOZ_CRASHREPORTER_SHUTDOWN" to shutdown Firefox in case of content crashes.

I did not notice that! Thank you so much. Follow-up in bug 1440719.
Flags: needinfo?(gbrown)
https://hg.mozilla.org/mozilla-central/rev/25cb6ef45dee
Status: REOPENED → RESOLVED
Closed: 3 years ago2 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla60
You need to log in before you can comment on or make changes to this bug.