Closed
Bug 1351655
Opened 8 years ago
Closed 7 years ago
Intermittent command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', '-u', 'scripts/scripts/desktop_unittest.py', '--mochitest-suite', 'mochitest-media', after many WARNING: MsgDropped in ContentChild
Categories
(Core :: WebRTC, defect, P3)
Core
WebRTC
Tracking
()
RESOLVED
FIXED
People
(Reporter: aryx, Assigned: pehrsons)
References
(Depends on 1 open bug)
Details
(Keywords: intermittent-failure, Whiteboard: low-volume)
Attachments
(1 file, 1 obsolete file)
https://treeherder.mozilla.org/logviewer.html#?job_id=87233213&repo=mozilla-inbound
03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:331: PeerConnectionImpl: PeerConnectionImpl constructor for
03:37:34 INFO - GECKO(1784) | (remote_bitrate_estimator_single_stream.cc:59): RemoteBitrateEstimatorSingleStream: Instantiating.
03:37:34 INFO - GECKO(1784) | [Socket Thread]: D/mtransport NrIceCtx static call to find local stun addresses
03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:331: PeerConnectionImpl: PeerConnectionImpl constructor for
03:37:34 INFO - GECKO(1784) | (remote_bitrate_estimator_single_stream.cc:59): RemoteBitrateEstimatorSingleStream: Instantiating.
03:37:34 INFO - GECKO(1784) | [Socket Thread]: D/mtransport NrIceCtx static call to find local stun addresses
03:37:34 INFO - GECKO(1784) | MEMORY STAT | vsize 4271MB | residentFast 304MB | heapAllocated 29MB
03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionMedia] PeerConnectionMedia.cpp:209: OnProxyAvailable: Proxy Available: 0
03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionMedia] PeerConnectionMedia.cpp:267: OnStunAddrsAvailable: receiving (5) stun addrs
03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionMedia] PeerConnectionMedia.cpp:209: OnProxyAvailable: Proxy Available: 0
03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionMedia] PeerConnectionMedia.cpp:267: OnStunAddrsAvailable: receiving (5) stun addrs
03:37:34 INFO - TEST-OK | dom/media/tests/mochitest/test_selftest.html | took 1145ms
03:37:34 INFO - GECKO(1784) | ++DOMWINDOW == 10 (0x11b2b2800) [pid = 1785] [serial = 527] [outer = 0x126914000]
03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:2942: CloseInt: Closing PeerConnectionImpl 2dccd38ebaa39a02; ending call
03:37:34 INFO - GECKO(1784) | [Main Thread]: I/jsep [1490783854859219 (id=2147484172 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_selftest.html)]: stable -> closed
03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:2942: CloseInt: Closing PeerConnectionImpl dc1b7e7ad8661526; ending call
03:37:34 INFO - GECKO(1784) | [Main Thread]: I/jsep [1490783854866958 (id=2147484172 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_selftest.html)]: stable -> closed
03:37:34 INFO - TEST-START | Shutdown
03:37:34 INFO - Passed: 35185
03:37:34 INFO - Failed: 0
03:37:34 INFO - Todo: 1398
03:37:34 INFO - Mode: e10s
03:37:34 INFO - Slowest: 12172ms - /tests/dom/media/tests/mochitest/test_ondevicechange.html
03:37:34 INFO - SimpleTest FINISHED
03:37:34 INFO - TEST-INFO | Ran 1 Loops
03:37:34 INFO - SimpleTest FINISHED
03:37:34 INFO - GECKO(1784) | --DOCSHELL 0x11b2a3800 == 2 [pid = 1785] [id = {284210b4-7253-414a-aa5d-9c223a738ce8}]
03:37:34 INFO - GECKO(1784) | ++DOMWINDOW == 11 (0x11b409000) [pid = 1785] [serial = 528] [outer = 0x126914000]
03:37:35 INFO - GECKO(1784) | 1490783855014 addons.xpi WARN Exception running bootstrap method shutdown on shield-recipe-client@mozilla.org: TypeError: log is null (resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///builds/slave/test/build/application/NightlyDebug.app/Contents/Resources/browser/features/shield-recipe-client@mozilla.org.xpi!/bootstrap.js:96:5) JS Stack trace: this.shutdown@resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///builds/slave/test/build/application/NightlyDebug.app/Contents/Resources/browser/features/shield-recipe-client@bootstrap.js:96:5 < callBootstrapMethod@XPIProvider.jsm:4957:11 < observe@XPIProvider.jsm:2924:15 < SpecialPowersObserver.prototype.receiveMessage@SpecialPowersObserver.jsm:244:7 < permitUnload@remote-browser.xml:357:13 < CanCloseWindow@browser.js:6376:36 < canClose@browser.js:5227:12 < SpecialPowersObserver.prototype.receiveMessage@SpecialPowersObserver.jsm:244:7
03:37:35 INFO - GECKO(1784) | --DOCSHELL 0x123579000 == 3 [pid = 1784] [id = {56ad6d42-ee07-1641-9eaa-1d9b695a7587}]
03:37:35 INFO - GECKO(1784) | [Parent 1784] WARNING: nsAppShell::Exit() called redundantly: file /home/worker/workspace/build/src/widget/cocoa/nsAppShell.mm, line 691
03:37:35 INFO - GECKO(1784) | --DOCSHELL 0x11e45f800 == 2 [pid = 1784] [id = {0764a66b-de5c-6e47-a707-13407cd46389}]
03:37:35 INFO - GECKO(1784) | --DOCSHELL 0x123580000 == 1 [pid = 1784] [id = {9c0aa61b-dd01-934d-9277-b8781f5e16a1}]
03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
03:37:35 INFO - GECKO(1784) | [Child 1785] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037
03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
03:37:35 INFO - GECKO(1784) | [Child 1785] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037
03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
03:37:35 INFO - GECKO(1784) | [Child 1785] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037
03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
03:37:35 INFO - GECKO(1784) | [Child 1785] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037
03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
03:37:35 INFO - GECKO(1784) | [Child 1785] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037
03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
03:37:35 INFO - GECKO(1784) | [Child 1785] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037
03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
Comment 1•8 years ago
|
||
Maybe this has something to do with the IPC for discovering local addresses?
Rank: 35
Component: Audio/Video → WebRTC: Networking
Priority: -- → P3
Updated•8 years ago
|
backlog: --- → tech-debt
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 6•8 years ago
|
||
This is particularly ugly because it ties up our extremely capacity-constrained and backlog-prone OSX test machines for 2 hours doing basically nothing. Any way we can bump the priority here?
Flags: needinfo?(mreavy)
Comment 7•8 years ago
|
||
To me it looks like the media mochitests do actually finish.
But then we first have the error on closing the socket, which appear to pass. Followed by an endless amount of retires to submit Telemetry histograms.
And then somewhere in between the attempt to kill Firefox via a MOZ_CRASH because of a shutdown timeout also fails.
@mjf: could the Telemetry part be related to the ICE stats you landed some time ago?
18:14:12 INFO - TEST-OK | dom/media/tests/mochitest/test_selftest.html | took 1179ms
18:14:12 INFO - GECKO(2099) | ++DOMWINDOW == 10 (0x124f96000) [pid = 2100] [serial = 527] [outer = 0x1273d7000]
18:14:12 INFO - GECKO(2099) | [Main Thread]: I/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:2975: CloseInt: Closing PeerConnectionImpl 2b3a8bc7bd356aba; ending call
18:14:12 INFO - GECKO(2099) | [Main Thread]: I/jsep [1493428452254024 (id=2147484172 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_selftest.html)]: stable -> closed
18:14:12 INFO - GECKO(2099) | [Main Thread]: I/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:2975: CloseInt: Closing PeerConnectionImpl f771ae1c84cd0d65; ending call
18:14:12 INFO - GECKO(2099) | [Main Thread]: I/jsep [1493428452261770 (id=2147484172 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_selftest.html)]: stable -> closed
18:14:12 INFO - TEST-START | Shutdown
18:14:12 INFO - Passed: 34912
18:14:12 INFO - Failed: 0
18:14:12 INFO - Todo: 1420
18:14:12 INFO - Mode: e10s
18:14:12 INFO - Slowest: 14763ms - /tests/dom/media/tests/mochitest/test_peerConnection_stats.html
18:14:12 INFO - SimpleTest FINISHED
18:14:12 INFO - TEST-INFO | Ran 1 Loops
18:14:12 INFO - SimpleTest FINISHED
18:14:12 INFO - GECKO(2099) | --DOCSHELL 0x128eb8000 == 2 [pid = 2100] [id = {1abf6fe2-c18a-5141-8bf4-1b52dccac79a}]
18:14:12 INFO - GECKO(2099) | ++DOMWINDOW == 11 (0x126f7f000) [pid = 2100] [serial = 528] [outer = 0x1273d7000]
18:14:12 INFO - GECKO(2099) | --DOCSHELL 0x12670e000 == 3 [pid = 2099] [id = {b6de75eb-44e5-d246-b0ee-81f8cd82b938}]
18:14:12 INFO - GECKO(2099) | --DOCSHELL 0x121025800 == 2 [pid = 2099] [id = {1c0f92a7-3374-3b49-922c-b12f3b50495b}]
18:14:13 INFO - GECKO(2099) | --DOCSHELL 0x126710000 == 1 [pid = 2099] [id = {1e95adde-fd65-374f-82f4-7f10ff78a173}]
18:14:13 INFO - GECKO(2099) | --DOCSHELL 0x122040800 == 0 [pid = 2099] [id = {ab0f35ea-78e5-3748-925e-616730cd66b4}]
18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099
18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099
18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099
18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099
18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099
18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099
18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099
18:14:13 INFO - GECKO(2099) | [Main Thread]: D/DataChannel Shutting down SCTP
18:14:13 INFO - GECKO(2099) | ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost
18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: nsAppShell::Exit() called redundantly: file /home/worker/workspace/build/src/widget/cocoa/nsAppShell.mm, line 691
18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file /home/worker/workspace/build/src/xpcom/threads/nsThread.cpp, line 1015
18:14:14 INFO - GECKO(2099) | 1493428454161 Marionette INFO Ceased listening
18:14:14 INFO - GECKO(2099) | [Parent 2099] WARNING: nsAppShell::Exit() called redundantly: file /home/worker/workspace/build/src/widget/cocoa/nsAppShell.mm, line 691
18:14:14 INFO - GECKO(2099) | [Parent 2099] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file /home/worker/workspace/build/src/xpcom/threads/nsThread.cpp, line 1015
18:14:14 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4800FB,name=PContent::Msg_AccumulateChildHistograms) Closed channel: cannot send/recv
18:14:14 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099
18:14:14 INFO - GECKO(2099) | [Child 2100] WARNING: '!ipcActor->SendAccumulateChildHistograms(accumulationsToSend)', file /home/worker/workspace/build/src/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 258
18:14:14 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4800FC,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv
18:14:14 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099
18:14:14 INFO - GECKO(2099) | [Child 2100] WARNING: '!ipcActor->SendAccumulateChildKeyedHistograms(keyedAccumulationsToSend)', file /home/worker/workspace/build/src/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 262
18:14:16 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4800FC,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv
Flags: needinfo?(mfroman)
Comment 8•8 years ago
|
||
@drno: Ice telemetry I added uses scalars (Telemetry::ScalarAdd), not histograms, so this doesn't seem like it is related.
However, after looking at TelemetryIPCAccumulator.cpp and remembering some of my lessons learned adding the IPC for stun address discovery, it could be bug handling one side of the IPC channel going away in the telemetry code.
Flags: needinfo?(mfroman)
Comment 9•8 years ago
|
||
@chutten: Any chance this is a bug with one side of the telemetry IPC channel going away on shutdown?
Flags: needinfo?(chutten)
Comment 10•8 years ago
|
||
I think I have actually experienced this bug once during development/testing on Linux:
I don't remember any more how I shut down Firefox in that base, but basically the Firefox window had closed already and I kept seeing this error message from Telemetry IPC over and over again.
Looking at the running processes it looked like the content process was hanging (trying to deliver its telemetry data to the parent???). I think the parent process was gone already.
I'm wondering if the initial socket problem from shutting down WebRTC then makes Telemetry do something strange...
Comment 11•8 years ago
|
||
Telemetry IPC doesn't do anything to try and stay alive. There is that 2s timer trying to send the last batch of child telemetry, but it's on a best-effort basis when the process is shutting down.
If IPC is bork, it can't send. I think that's all it is. I don't think it has anything to do with the actual problem (but I've been wrong before)
Flags: needinfo?(chutten)
Comment 12•8 years ago
|
||
Bill - are there any known e10s issues where the master process dies, and the child never exits?
Any ideas what can be happening here? he telemetry stuff appears to just be it trying to contact the master process.
Flags: needinfo?(wmccloskey)
We do have a mechanism to cause the child to die if the parent dies. You can search for "AbortOnError" [1]. However, we disable this mechanism late in shutdown [2]. At the same time, we call StartForceKillTimer(). However, StartForceKillTimer does nothing in DEBUG builds. It looks like all the failures here are DEBUG builds (except for one, which I think was mis-starred). DEBUG builds are handled differently because we try really hard to do leak checking in them.
So presumably we're getting to RecvShutdown in the content process. It's probably a good bet that we even Close the channel normally with the child process (given that I don't see any messages about SendShutdown failing to be sent). It seems like something else is causing the shutdown process to hang. Maybe we're in a nested event loop? It's also odd that it seems like both the parent and the child are hanging at shutdown. The parent eventually gets killed by the nsTerminator code.
It would help to have a crash dump here. I'm not sure why we don't have one. I would have expected MOZ_CRASH to generate one.
[1] http://searchfox.org/mozilla-central/search?q=AbortOnError&case=true®exp=false&path=
[2] http://searchfox.org/mozilla-central/rev/35b37316149108d53a02fb8498e250ea8a22ab5d/dom/ipc/ContentChild.cpp#2875
Flags: needinfo?(wmccloskey)
Updated•8 years ago
|
backlog: tech-debt → ---
Rank: 35
Component: WebRTC: Networking → IPC
Flags: needinfo?(mreavy)
Priority: P3 → --
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 18•8 years ago
|
||
This is slightly less-bad than it used to be because now we're only burning 90min of machine time instead of 2hr like we used to (Yay Taskcluster migration), but it's still pretty harmful to our limited OSX machine pool when we get into this state.
Here's a recent log to show what happens - you can see 45+min of infinite looping at the end of the run.
https://public-artifacts.taskcluster.net/arTLstMpQ5yKH6wiARfZ5Q/0/public/logs/live_backing.log
Jim, any chance you could help us find an owner for this?
Flags: needinfo?(jmathies)
![]() |
||
Comment 19•8 years ago
|
||
First error I see is this, not sure if it plays in here but it looks like some data coming over fromt he child isn't valid -
15:39:24 INFO - GECKO(1725) | [Parent 1725] WARNING: NS_FAILED internal_GetScalarByEnum for CHILD: file /home/worker/workspace/build/src/toolkit/components/telemetry/TelemetryScalar.cpp, line 2184
http://searchfox.org/mozilla-central/rev/5696c3e525fc8222674eed6a562f5fcbe804c4c7/toolkit/components/telemetry/TelemetryScalar.cpp#2184
Second error is a PTCPSocket message getting dropped when we receive a Close command on the child side but the channel has already been torn down. This is usually the result of lifetime design issues in the underlying protocol. On the child side we just drop these messages. I'd be surprised in PTCPSocket has lifetime issues though!
http://searchfox.org/mozilla-central/rev/5696c3e525fc8222674eed6a562f5fcbe804c4c7/dom/ipc/ContentChild.cpp#2341
then -
15:39:26 INFO - GECKO(1725) | ###!!! [Child][MessageChannel] Error: (msgtype=0x2400F0,name=PContent::Msg_AccumulateChildHistograms) Closed channel: cannot send/recv
15:39:26 INFO - GECKO(1725) | [Child 1726] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2344
15:39:26 INFO - GECKO(1725) | [Child 1726] WARNING: '!ipcActor->SendAccumulateChildHistograms(accumulationsToSend)', file /home/worker/workspace/build/src/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 299
This is another error on the child side where the parent has requested telemetry data but the channel the request is made on has been shut down on the child side. AFAICT this looks like the main content channel -
http://searchfox.org/mozilla-central/rev/5696c3e525fc8222674eed6a562f5fcbe804c4c7/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#332
In the middle of all this I see some odd messages about redundant calls to nsAppShell::Exit() in both processes.
Stephen, when you have a chance can you poke at this a bit please.
Flags: needinfo?(jmathies) → needinfo?(spohl.mozilla.bugs)
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 24•8 years ago
|
||
Here is what keeps being repeated every two seconds in the log[1]:
15:39:26 INFO - GECKO(1725) | ###!!! [Child][MessageChannel] Error: (msgtype=0x2400F1,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv
15:39:26 INFO - GECKO(1725) | [Child 1726] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2344
15:39:26 INFO - GECKO(1725) | [Child 1726] WARNING: '!ipcActor->SendAccumulateChildKeyedHistograms(keyedAccumulationsToSend)', file /home/worker/workspace/build/src/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 303
15:39:26 INFO - GECKO(1725) | ###!!! [Child][MessageChannel] Error: (msgtype=0x2400F5,name=PContent::Msg_RecordDiscardedData) Closed channel: cannot send/recv
15:39:26 INFO - GECKO(1725) | [Child 1726] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2344
15:39:26 INFO - GECKO(1725) | [Child 1726] WARNING: '!ipcActor->SendRecordDiscardedData(discardedData)', file /home/worker/workspace/build/src/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 318
As was stated before, the MessageChannel has been torn down. We have also called nsAppShell::Exit() in the child, but the IPC timer continues to fire and we attempt to send accumulated telemetry data[1] every two seconds. The two types of data are keyed histograms[2] and discarded data[3]. Attempts to send this continuously fails with "Closed channel: cannot send/recv"[4], because the MessageChannel is already torn down. We then report a warning that the message has been dropped and return immediately from ContentChild::ProcessingError[6]. We don't have a crash dump because we return here. Otherwise, we would execute MOZ_CRASH[7]. This change was introduced in bug 1091766 to address the fact that we didn't use to execute shutdown properly, which also leaves us without leak logs. Note however that even MOZ_CRASH appears to fail to exit/crash the process (as mentioned in comment 7): In the middle of the log, we attempt to crash the child process via MOZ_CRASH due to "Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:160", but then continue to process further IPC timer fires every two seconds.
In summary, it appears that we're desperately trying to submit telemetry data via IPC. So desperately that any calls to nsAppShell::Exit() don't work (because it has already been called) and even MOZ_CRASH doesn't crash the child.
It seems that we should quit trying to submit this data when the channel is torn down and unlock ourselves. I'm not clear how to achieve this. Maybe we should clear this data and pretend that there is nothing to send when the channel is being torn down? Or we should release whatever lock we seem to be holding to allow for nsAppShell::Exit() to complete?
:billm, :mccr8, do you have any thoughts on how to proceed here?
[1] https://public-artifacts.taskcluster.net/arTLstMpQ5yKH6wiARfZ5Q/0/public/logs/live_backing.log
[2] https://dxr.mozilla.org/mozilla-central/rev/20d57b9c4183973af4af5e078dff2aec0b74f928/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#334
[3] https://dxr.mozilla.org/mozilla-central/rev/20d57b9c4183973af4af5e078dff2aec0b74f928/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#303
[4] https://dxr.mozilla.org/mozilla-central/rev/20d57b9c4183973af4af5e078dff2aec0b74f928/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#318
[5] https://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#2442,2463-2464
[6] https://dxr.mozilla.org/mozilla-central/rev/20d57b9c4183973af4af5e078dff2aec0b74f928/dom/ipc/ContentChild.cpp#2384,2385
[7] https://dxr.mozilla.org/mozilla-central/rev/20d57b9c4183973af4af5e078dff2aec0b74f928/dom/ipc/ContentChild.cpp#2403
Flags: needinfo?(wmccloskey)
Flags: needinfo?(spohl.mozilla.bugs)
Flags: needinfo?(continuation)
Comment 25•8 years ago
|
||
It sounds like telemetry should listen for xpcom-shutdown and cancel the timer, so it won't try to send messages after IPC is shut down.
Flags: needinfo?(continuation)
Updated•8 years ago
|
Component: IPC → Telemetry
Product: Core → Toolkit
Comment 26•8 years ago
|
||
Alternatively, some kind of shutdown call could be added to ShutdownXPCOM, right after where we send xpcom-shutdown (NS_XPCOM_SHUTDOWN_OBSERVER_ID).
Comment 27•8 years ago
|
||
One active timer keeps the process alive?
The real problem here is that we're calling XRE_ShutdownChildProcess() in the child process, but it's not shutting down. All of the other stuff (the telemetry timer, the IPC stuff) is just a symptom of that.
I suspect that the "Shutdown too long, probably frozen, causing a crash." message is actually being printed by some other process. I would be very surprised if MOZ_CRASH somehow wasn't working.
I think we'll need to insert some sort of debugging to see what's happening inside ShutdownChildProcess().
Flags: needinfo?(wmccloskey)
Comment 29•8 years ago
|
||
(In reply to Bill McCloskey (:billm) from comment #28)
> I suspect that the "Shutdown too long, probably frozen, causing a crash."
> message is actually being printed by some other process. I would be very
> surprised if MOZ_CRASH somehow wasn't working.
Oh, I misread the log there. This seems to be printed by the parent process with pid 1725 while the child has pid 1726.
> I think we'll need to insert some sort of debugging to see what's happening
> inside ShutdownChildProcess().
Agreed. Who can own this?
Flags: needinfo?(wmccloskey)
Comment 30•8 years ago
|
||
I think one way this could be mitigated is by reinstating the way the parent process kills off the child if it takes too long, in debug builds. I got rid of it for leak checking, but not getting a log sometimes is probably better than hanging a machine.
Comment hidden (Intermittent Failures Robot) |
> Agreed. Who can own this?
I don't know. I guess I'll forward this to Jim. It's some sort of event loop issue, although probably related to the Chromium event loop.
Flags: needinfo?(wmccloskey) → needinfo?(jmathies)
Comment hidden (Intermittent Failures Robot) |
Comment 34•8 years ago
|
||
(In reply to Andrew McCreight [:mccr8] from comment #30)
> I think one way this could be mitigated is by reinstating the way the parent
> process kills off the child if it takes too long, in debug builds. I got rid
> of it for leak checking, but not getting a log sometimes is probably better
> than hanging a machine.
^
Attachment #8923495 -
Flags: review?(continuation)
Comment 35•8 years ago
|
||
Comment on attachment 8923495 [details] [diff] [review]
Patch
Review of attachment 8923495 [details] [diff] [review]:
-----------------------------------------------------------------
I don't understand what this has to do with my comment. It looks like you are turning this warning into a crash? We hit this warning very frequently. I'd guess that the tree would be very orange if you did this.
Attachment #8923495 -
Flags: review?(continuation) → review-
Comment 36•8 years ago
|
||
I was talking about the code that was touched in this patch:
https://hg.mozilla.org/mozilla-central/rev/260370115e37
Maybe we could change ~GeckoChildProcessHost() to force shutdown in NS_FREE_PERMANENT_DATA builds, but have a more generous timeout. It is two seconds now. Maybe if it was set to a minute or two in NS_FREE_PERMANENT_DATA that would be sufficient.
Comment 37•8 years ago
|
||
Comment on attachment 8923495 [details] [diff] [review]
Patch
This was an attempt at reverting your change from bug 1091766, but even so I now realize that we used to call QuickExit() instead of crashing. Marking as obsolete.
Attachment #8923495 -
Attachment is obsolete: true
![]() |
||
Updated•8 years ago
|
Flags: needinfo?(jmathies)
Priority: -- → P2
Whiteboard: low-volume
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment hidden (Intermittent Failures Robot) |
Comment 41•8 years ago
|
||
I'm not entirely sure if it's the same thing, but I had something like this happening to me today with a local build which I started via 'mach run' and opened a JS fiddle. When I quit that Firefox one content process did not shut down and I kept seeing lines like these on the command line:
###!!! [Child][MessageChannel] Error: (msgtype=0x2C00EF,name=PContent::Msg_AccumulateChildHistograms) Closed channel: cannot send/recv
[Child 17420, Main Thread] WARNING: MsgDropped in ContentChild: file /Users/nohlmeier/src/mozilla-central/dom/ipc/ContentChild.cpp, line 2339
[Child 17420, Main Thread] WARNING: '!ipcActor->SendAccumulateChildHistograms(histogramsToSend)', file /Users/nohlmeier/src/mozilla-central/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 296
###!!! [Child][MessageChannel] Error: (msgtype=0x2C00F0,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv
[Child 17420, Main Thread] WARNING: MsgDropped in ContentChild: file /Users/nohlmeier/src/mozilla-central/dom/ipc/ContentChild.cpp, line 2339
[Child 17420, Main Thread] WARNING: '!ipcActor->SendAccumulateChildKeyedHistograms(keyedHistogramsToSend)', file /Users/nohlmeier/src/mozilla-central/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 300
###!!! [Child][MessageChannel] Error: (msgtype=0x2C00F4,name=PContent::Msg_RecordDiscardedData) Closed channel: cannot send/recv
[Child 17420, Main Thread] WARNING: MsgDropped in ContentChild: file /Users/nohlmeier/src/mozilla-central/dom/ipc/ContentChild.cpp, line 2339
[Child 17420, Main Thread] WARNING: '!ipcActor->SendRecordDiscardedData(discardedData)', file /Users/nohlmeier/src/mozilla-central/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 315
So I attached lldb to that one hanging content process and got the following backtrace:
(lldb) bt
* thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP
* frame #0: 0x00007fff6986be7e libsystem_kernel.dylib`__psynch_cvwait + 10
frame #1: 0x00007fff699a7662 libsystem_pthread.dylib`_pthread_cond_wait + 732
frame #2: 0x0000000109f9f0d7 libmozglue.dylib`mozilla::detail::ConditionVariableImpl::wait(this=0x000000010a3a80d0, lock=0x000000010a3a8050) at ConditionVariable_posix.cpp:118
frame #3: 0x0000000112acdddd XUL`mozilla::CondVar::Wait(this=0x000000010a3a80b0, aInterval=4294967295) at BlockingResourceBase.cpp:604
frame #4: 0x0000000112b05604 XUL`mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::LabeledEventQueue> >::GetEvent(this=0x000000010a3a8020, aMayWait=true, aPriority=0x00007ffee6111cf0) at ThreadEventQueue.cpp:139
frame #5: 0x0000000112b133ef XUL`nsThread::ProcessNextEvent(this=0x000000010a307190, aMayWait=true, aResult=0x00007ffee61122de) at nsThread.cpp:961
frame #6: 0x0000000112b3426c XUL`NS_ProcessNextEvent(aThread=0x000000010a307190, aMayWait=true) at nsThreadUtils.cpp:508
frame #7: 0x0000000112afea1f XUL`bool mozilla::SpinEventLoopUntil<(mozilla::ProcessFailureBehavior)1, mozilla::SharedThreadPool::SpinUntilEmpty()::$_0>(aPredicate=0x00007ffee6112348, aThread=0x0000000000000000)::$_0&&, nsIThread*) at nsThreadUtils.h:323
frame #8: 0x0000000112afe557 XUL`mozilla::SharedThreadPool::SpinUntilEmpty() at SharedThreadPool.cpp:80
frame #9: 0x0000000112afe4b9 XUL`mozilla::SharedThreadPoolShutdownObserver::Observe(this=0x000000010e12b9c0, aSubject=0x0000000000000000, aTopic="xpcom-shutdown-threads", aData=0x0000000000000000) at SharedThreadPool.cpp:49
frame #10: 0x0000000112a3aa89 XUL`nsObserverList::NotifyObservers(this=0x0000000165fdb0f8, aSubject=0x0000000000000000, aTopic="xpcom-shutdown-threads", someData=0x0000000000000000) at nsObserverList.cpp:112
frame #11: 0x0000000112a3cc4a XUL`nsObserverService::NotifyObservers(this=0x000000010cea89c0, aSubject=0x0000000000000000, aTopic="xpcom-shutdown-threads", aSomeData=0x0000000000000000) at nsObserverService.cpp:296
frame #12: 0x0000000112b6750c XUL`mozilla::ShutdownXPCOM(aServMgr=0x0000000000000000) at XPCOMInit.cpp:880
frame #13: 0x0000000112b67285 XUL`::NS_ShutdownXPCOM(aServMgr=0x0000000000000000) at XPCOMInit.cpp:810
frame #14: 0x000000011b06ae32 XUL`XRE_TermEmbedding() at nsEmbedFunctions.cpp:224
frame #15: 0x000000011360e8d2 XUL`mozilla::ipc::ScopedXREEmbed::Stop(this=0x000000010a3738e0) at ScopedXREEmbed.cpp:108
frame #16: 0x00000001175be43c XUL`mozilla::dom::ContentProcess::CleanUp(this=0x000000010a373000) at ContentProcess.cpp:267
frame #17: 0x000000011b06c068 XUL`XRE_InitChildProcess(aArgc=16, aArgv=0x00007ffee6112c00, aChildData=0x00007ffee6112b88) at nsEmbedFunctions.cpp:695
frame #18: 0x000000011b079717 XUL`mozilla::BootstrapImpl::XRE_InitChildProcess(this=0x000000010a32a190, argc=19, argv=0x00007ffee6112c00, aChildData=0x00007ffee6112b88) at Bootstrap.cpp:69
frame #19: 0x0000000109aed88c plugin-container`content_process_main(bootstrap=0x000000010a32a190, argc=19, argv=0x00007ffee6112c00) at plugin-container.cpp:63
frame #20: 0x0000000109aed945 plugin-container`main(argc=20, argv=0x00007ffee6112c00) at MozillaRuntimeMain.cpp:25
frame #21: 0x0000000109aed804 plugin-container`start + 52
Comment hidden (Intermittent Failures Robot) |
Updated•7 years ago
|
Component: Telemetry → IPC
Priority: P2 → --
Product: Toolkit → Core
![]() |
||
Updated•7 years ago
|
Priority: -- → P3
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 44•7 years ago
|
||
I have encountered a similar issue that is highly reproducible on try (with my patches for bug 1299515 and a couple more).
Here's a run with some extra diagnostic logging of failed telemetry histograms and shutdown phases:
https://treeherder.mozilla.org/#/jobs?repo=try&revision=69c99229d4f42355c625890d06378f33e70cac8b&selectedJob=157441247
I also reproduced this on a loaner and got similar stacks to drno's in comment 41, https://pastebin.mozilla.org/9076289
From the shutdown phase logs above, it's clear the child is stuck in XPCOM_SHUTDOWN_THREADS, and from the stacks it's clear that it's blocked on emptying a SharedThreadPool. Now, which SharedThreadPool that could be...
FWIW for the failed telemetry histogram I get this:
> [task 2018-01-19T17:15:40.230Z] 17:15:40 INFO - GECKO(2695) | Failed to send keyed histogram 1474, key:TelemetryIPCAccumulator::IPCTimerFired
But if we ignore those for a minute, there are also a couple of these:
> [task 2018-01-19T17:29:51.209Z] 17:29:51 INFO - GECKO(2695) | Failed to send keyed histogram 1474, key:gfxFontCache::gfxFontCache
Could be a lead?
Assignee | ||
Comment 45•7 years ago
|
||
I should also say that there's another lead from the threads listed in the pastebin in comment 44, https://pastebin.mozilla.org/9076289, namely:
> (gdb) info threads
> Id Target Id Frame
> * 1 Thread 0x7fbff04c8740 (LWP 14846) "Web Content" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> 2 Thread 0x7fbfdc524700 (LWP 14869) "Chrome_~dThread" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
> 3 Thread 0x7fbfd74ff700 (LWP 14870) "JS Watchdog" pthread_cond_timedwait@@GLIBC_2.3.2 ()
> at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
> 4 Thread 0x7fbfd6cfe700 (LWP 14871) "JS Helper" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> 5 Thread 0x7fbfd6afd700 (LWP 14872) "JS Helper" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> 6 Thread 0x7fbfd68fc700 (LWP 14873) "JS Helper" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> 7 Thread 0x7fbfd66fb700 (LWP 14874) "JS Helper" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> 8 Thread 0x7fbfd64fa700 (LWP 14875) "JS Helper" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> 9 Thread 0x7fbfd62f9700 (LWP 14876) "JS Helper" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> 10 Thread 0x7fbfd60f8700 (LWP 14877) "Socket Thread" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> 11 Thread 0x7fbfd51ff700 (LWP 14879) "Hang Monitor" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> 12 Thread 0x7fbfd49fe700 (LWP 14880) "Timer" pthread_cond_timedwait@@GLIBC_2.3.2 ()
> at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
> 13 Thread 0x7fbfd1b42700 (LWP 14890) "StyleThread#0" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> 14 Thread 0x7fbfcf18f700 (LWP 14891) "StyleThread#1" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> 15 Thread 0x7fbfcf14e700 (LWP 14892) "StyleThread#2" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> 16 Thread 0x7fbfce9ff700 (LWP 14893) "StyleThread#3" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> --- the below three belong to an instance of webrtc::Call
> 17 Thread 0x7fbfc91f8700 (LWP 15022) "call_worker_que" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
> 18 Thread 0x7fbfc90f7700 (LWP 15023) "ModuleProcessTh" pthread_cond_timedwait@@GLIBC_2.3.2 ()
> at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
> 19 Thread 0x7fbfc8ff6700 (LWP 15024) "PacerThread" pthread_cond_timedwait@@GLIBC_2.3.2 ()
> at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
> ---
> --- The below three belong to another instance of webrtc::Call
> 20 Thread 0x7fbfc89fd700 (LWP 15027) "call_worker_que" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
> 21 Thread 0x7fbfc86ff700 (LWP 15028) "ModuleProcessTh" pthread_cond_timedwait@@GLIBC_2.3.2 ()
> at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
> 22 Thread 0x7fbfc8ef5700 (LWP 15029) "PacerThread" pthread_cond_timedwait@@GLIBC_2.3.2 ()
> at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225
> ---
"call_worker_queue": https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/media/webrtc/trunk/webrtc/call/call.cc#306
"ModuleProcessThread": https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/media/webrtc/trunk/webrtc/call/call.cc#279
"PacerThread": https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/media/webrtc/trunk/webrtc/call/call.cc#280
There are clearly two instances of webrtc::Call still alive.
Comment 46•7 years ago
|
||
(In reply to Andreas Pehrson [:pehrsons] from comment #45)
> There are clearly two instances of webrtc::Call still alive.
Since all of the WebRTC tests instantiate two PeerConnections, and each PeerConnection (with audio and/or video) has one webrtc::Call object I don't think this is unexpected. Except if the Call object in general should have been torn down at this point already. But having two during test execution is not suspicious.
Assignee | ||
Comment 47•7 years ago
|
||
The only SharedThreadPool in gfx/ is [1], but it seems to only be used on Mac.
For WebRTC there's one at [2], but it's only been around since july (bug 1355048).
It's stored on WebrtcVideoConduit, [3], and deleted as last resort in the WebrtcVideoConduit dtor, [4], [5].
There's a comment in MSG about making sure SharedThreadPools are released in time, but it doesn't mention consequences much. Or whether it applies to _all_ instances. [6]
[1] https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/gfx/layers/PaintThread.cpp#173
[2] https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/media/webrtc/signaling/src/media-conduit/WebrtcMediaDataDecoderCodec.h#66-67
[3] https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/media/webrtc/signaling/src/media-conduit/VideoConduit.cpp#505
[4] https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/media/webrtc/signaling/src/media-conduit/VideoConduit.cpp#469
[5] https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/media/webrtc/signaling/src/media-conduit/VideoConduit.cpp#320
[6] https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/dom/media/MediaStreamGraph.cpp#1443-1449
Assignee | ||
Comment 48•7 years ago
|
||
I have noticed that in the working case, we kill these last PeerConnectionImpls during nsCycleCollector::Shutdown, which is after XPCOM_SHUTDOWN_THREADS: https://searchfox.org/mozilla-central/rev/e9a067a401e41017766f4ab90bc3906603273d51/xpcom/build/XPCOMInit.cpp#969
That's late, but unclear why it works most of the time.
Assignee | ||
Comment 49•7 years ago
|
||
(In reply to Andreas Pehrson [:pehrsons] from comment #48)
> I have noticed that in the working case, we kill these last
> PeerConnectionImpls during nsCycleCollector::Shutdown, which is after
> XPCOM_SHUTDOWN_THREADS:
> https://searchfox.org/mozilla-central/rev/
> e9a067a401e41017766f4ab90bc3906603273d51/xpcom/build/XPCOMInit.cpp#969
>
> That's late, but unclear why it works most of the time.
I got to repro this locally fairly easily after putting in some debug logging. But not under rr.
Hopefully this holds.
So far it seems like when we hang, PeerConnectionImpl::Close was never called.
That makes sense with what we've seen on try, since Close leads to CloseInt which leads to ShutdownMedia which destroys PeerConnectionMedia which holds the ref to webrtc::Call whose threads we know were alive during the shutdown hang.
If this is part of the reason for the shutdown hang is another question, but after calling PCImpl::Close on XPCOM_SHUTDOWN I cannot repro anymore. That could also be timing related of course.
I'll put this on try as well and if all is jolly I'll harden it a bit, probably with an async shutdown blocker, and waiting for PeerConnectionMedia to be fully shutdown (it bounces to the STS thread and back to main).
Assignee | ||
Updated•7 years ago
|
Assignee | ||
Comment 50•7 years ago
|
||
Comment 49 didn't help. However, I found that the one SharedThreadPool we're hanging on when I repro locally is "MediaTimer".
Assignee | ||
Comment 51•7 years ago
|
||
I think this is from a new use of MediaTimer that I added in bug 1299515, and a SourceListener that doesn't get destroyed properly in MediaManager.
I have evidence that a SourceListener does not get destroyed, and that would hold a ref to the MediaTimer. I'm just not sure what holds a ref to the SourceListener.
If this holds true I'm just looking at a self-inflicted issue from bug 1299515 and the root cause for this bug remains.
Assignee | ||
Comment 52•7 years ago
|
||
I think I understand what happens here now, and it follows on comment 51.
A (dom-object) MediaStreamTrack holds a strong ref to its LocalTrackSource which holds a strong ref to a SourceListener which holds a strong ref to a MediaTimer.
I have invariants for the last strong ref that I don't want to break by resetting that one, so I'm solving it with a WeakPtr in the LocalTrackSource (it's main thread only).
This then makes my issue irrelevant to this issue (I'll solve mine in bug 1299515 where I added the MediaTimer), though the symptoms are the same. If drno's stack in comment 41 is correct for this bug, this is indeed also due to a dangling reference to a SharedThreadPool somewhere. I'm adding code to log what SharedThreadPool that is, so that when this happens in the future we can better understand where to look.
https://treeherder.mozilla.org/#/jobs?repo=try&revision=d57d4a07fe897993e4da2a36d6068ee02d16ded8
Assignee | ||
Updated•7 years ago
|
Keywords: leave-open
Comment hidden (mozreview-request) |
Comment 54•7 years ago
|
||
mozreview-review |
Comment on attachment 8947082 [details]
Bug 1351655 - Log which SharedThreadPools are not yet destroyed on shutdown-threads.
https://reviewboard.mozilla.org/r/216882/#review222694
Static analysis found 1 defect in this patch.
- 1 defect found by clang-tidy
You can run this analysis locally with:
- `./mach static-analysis check path/to/file.cpp` (C/C++)
If you see a problem in this automated review, please report it here: http://bit.ly/2y9N9Vx
::: xpcom/threads/SharedThreadPool.cpp:55
(Diff revision 1)
> + {
> + ReentrantMonitorAutoEnter mon(*sMonitor);
> + if (!sPools->Iter().Done()) {
> + nsAutoCString str;
> + for (auto i = sPools->Iter(); !i.Done(); i.Next()) {
> + str.AppendPrintf("\"%s\" ", nsAutoCString(i.Key()).get());
Warning: Escaped string literal can be written as a raw string literal [clang-tidy: modernize-raw-string-literal]
str.AppendPrintf("\"%s\" ", nsAutoCString(i.Key()).get());
^
R"("%s" )"
![]() |
||
Comment 55•7 years ago
|
||
mozreview-review |
Comment on attachment 8947082 [details]
Bug 1351655 - Log which SharedThreadPools are not yet destroyed on shutdown-threads.
https://reviewboard.mozilla.org/r/216882/#review222704
I'm not sure this will produce the sort of information you want. See below, and please do correct me if I am misunderstanding the situation--which I probably am!
::: commit-message-9a3b6:10
(Diff revision 1)
> +This patch attempts to help diagnose future cases like this by printing which
> +SharedThreadPools we are going to wait for to stderr.
If I understand comment 52 correctly, we think that somebody is holding a reference to a `SharedThreadPool`, and whoever is holding that reference is also involved in a reference cycle. So we can't drop the `SharedThreadPool` reference, and we can't drop the object holding the reference. Is that a correct summary of the situation?
I'm not clear on how this is going to narrow down the particular `SharedThreadPool` that's being kept alive. Isn't it possible (indeed, likely?) that when we observe `xpcom-shutdown-threads` here, that several thread pools may still be alive, regardless of whether we're in a situation that leaks or not? Therefore, we're just going to print out the same set of thread pools all the time.
If the above situation is correct, it seems what we really want is some sort of print while we're spinning the event loop telling us what set of `SharedThreadPool`s are still alive. Printing out the set each trip through the event loop could print quite a lot of logging, though, and printing every `N` trips through could miss informing us about the death of a pool before the event loop blocks completely. Maybe it'd be better than nothing?
But why not use the usual refcount logging machinery on `SharedThreadPool` to see who's taking references where and then figuring out who's taking a reference that's not released? I guess the problem with that would be getting the logs out of try/treeherder and whether we would actually dump logs in a timeout situation. Even so, that might be more workable than this?
Attachment #8947082 -
Flags: review?(nfroyd)
Comment hidden (Intermittent Failures Robot) |
Assignee | ||
Comment 57•7 years ago
|
||
mozreview-review-reply |
Comment on attachment 8947082 [details]
Bug 1351655 - Log which SharedThreadPools are not yet destroyed on shutdown-threads.
https://reviewboard.mozilla.org/r/216882/#review222704
> If I understand comment 52 correctly, we think that somebody is holding a reference to a `SharedThreadPool`, and whoever is holding that reference is also involved in a reference cycle. So we can't drop the `SharedThreadPool` reference, and we can't drop the object holding the reference. Is that a correct summary of the situation?
>
> I'm not clear on how this is going to narrow down the particular `SharedThreadPool` that's being kept alive. Isn't it possible (indeed, likely?) that when we observe `xpcom-shutdown-threads` here, that several thread pools may still be alive, regardless of whether we're in a situation that leaks or not? Therefore, we're just going to print out the same set of thread pools all the time.
>
> If the above situation is correct, it seems what we really want is some sort of print while we're spinning the event loop telling us what set of `SharedThreadPool`s are still alive. Printing out the set each trip through the event loop could print quite a lot of logging, though, and printing every `N` trips through could miss informing us about the death of a pool before the event loop blocks completely. Maybe it'd be better than nothing?
>
> But why not use the usual refcount logging machinery on `SharedThreadPool` to see who's taking references where and then figuring out who's taking a reference that's not released? I guess the problem with that would be getting the logs out of try/treeherder and whether we would actually dump logs in a timeout situation. Even so, that might be more workable than this?
Summary looks good.
It's anecdotal of course, but when I was able to reproduce this locally there was only the leaking thread pool ever printed. It seemed like all other thread pool references had been dropped in xpcom-will-shutdown or xpcom-shutdown.
I'm not sure the refcount machinery helps as we hang and are completely unresponsive, even to the HangMonitor. What makes the test machines get off this hang is a 90 minute timeout that force kills all processes. Does that retain the refcount logs anywhere? Also on opt builds?
Some thinking out loud,
Could we print the threadpools that still exist when we are about to wait for the event loop and it's empty? Perhaps we could do this in the predicate function we pass to SpinEventLoopUntil()? Any events to remove references to the threadpool should in all fairness have been dispatched during earlier shutdown phases, no?
I'm not really well-versed in these pieces of code so any guidance is useful.
Updated•7 years ago
|
Flags: needinfo?(nfroyd)
![]() |
||
Comment 58•7 years ago
|
||
(In reply to Andreas Pehrson [:pehrsons] from comment #57)
> It's anecdotal of course, but when I was able to reproduce this locally
> there was only the leaking thread pool ever printed. It seemed like all
> other thread pool references had been dropped in xpcom-will-shutdown or
> xpcom-shutdown.
If this can be reproduced locally, are we adding the logging because we expect the results to be markedly different in automation vs. locally? Debugging it locally is probably kind of a pain, I suppose, since it looks like the hang is quite infrequent.
> I'm not sure the refcount machinery helps as we hang and are completely
> unresponsive, even to the HangMonitor. What makes the test machines get off
> this hang is a 90 minute timeout that force kills all processes. Does that
> retain the refcount logs anywhere? Also on opt builds?
No, I don't think the refcount logs are retained anywhere on hangs, nor do they exist for opt builds. Then again, it looks like the majority of hangs (did) happen on debug builds, with a few happening on PGO (?!), so I don't know that not existing for opt would be a huge problem.
> Some thinking out loud,
> Could we print the threadpools that still exist when we are about to wait
> for the event loop and it's empty? Perhaps we could do this in the predicate
> function we pass to SpinEventLoopUntil()?
I'm confused about the "wait for the event loop and it's empty" bit here: does the "it's" refer to the SharedThreadPool instance or the event loop? If the former, then the SharedThreadPool hashtable wouldn't have any thread pools in it, and so we wouldn't print anything, right? If the latter, I don't think we have access to that information in SpinEventLoopUntil()? Or am I totally misunderstanding the suggestion here?
Flags: needinfo?(nfroyd) → needinfo?(apehrson)
Assignee | ||
Comment 59•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #58)
> (In reply to Andreas Pehrson [:pehrsons] from comment #57)
> > It's anecdotal of course, but when I was able to reproduce this locally
> > there was only the leaking thread pool ever printed. It seemed like all
> > other thread pool references had been dropped in xpcom-will-shutdown or
> > xpcom-shutdown.
>
> If this can be reproduced locally, are we adding the logging because we
> expect the results to be markedly different in automation vs. locally?
> Debugging it locally is probably kind of a pain, I suppose, since it looks
> like the hang is quite infrequent.
I want to add it because the ones we've seen in the wild on try are very rare intermittents, and no other clues in the logs point in the right direction.
The one I could repro was actually unrelated to this bug and something new I had put in. It was on its own hard enough to debug, but if we'd like to nail an intermittent issue like this one we need some help, like logs pointing to the SharedThreadPool instance that hasn't been released yet.
> > Some thinking out loud,
> > Could we print the threadpools that still exist when we are about to wait
> > for the event loop and it's empty? Perhaps we could do this in the predicate
> > function we pass to SpinEventLoopUntil()?
>
> I'm confused about the "wait for the event loop and it's empty" bit here:
> does the "it's" refer to the SharedThreadPool instance or the event loop?
> If the former, then the SharedThreadPool hashtable wouldn't have any thread
> pools in it, and so we wouldn't print anything, right? If the latter, I
> don't think we have access to that information in SpinEventLoopUntil()? Or
> am I totally misunderstanding the suggestion here?
I was thinking of the main thread event loop that we're running on, as the predicate passed to SpinEventLoopUntil() is called on every iteration [1].
If these are not viable, can you think of other approaches to debugging this?
[1] https://searchfox.org/mozilla-central/rev/9d47b5fb14152865ad1375eb3ee2e571d81ecdb9/xpcom/threads/SharedThreadPool.cpp#81
Flags: needinfo?(apehrson) → needinfo?(nfroyd)
![]() |
||
Comment 60•7 years ago
|
||
(In reply to Andreas Pehrson [:pehrsons] from comment #59)
> (In reply to Nathan Froyd [:froydnj] from comment #58)
> > (In reply to Andreas Pehrson [:pehrsons] from comment #57)
> > > It's anecdotal of course, but when I was able to reproduce this locally
> > > there was only the leaking thread pool ever printed. It seemed like all
> > > other thread pool references had been dropped in xpcom-will-shutdown or
> > > xpcom-shutdown.
> >
> > If this can be reproduced locally, are we adding the logging because we
> > expect the results to be markedly different in automation vs. locally?
> > Debugging it locally is probably kind of a pain, I suppose, since it looks
> > like the hang is quite infrequent.
>
> I want to add it because the ones we've seen in the wild on try are very
> rare intermittents, and no other clues in the logs point in the right
> direction.
>
> The one I could repro was actually unrelated to this bug and something new I
> had put in. It was on its own hard enough to debug, but if we'd like to nail
> an intermittent issue like this one we need some help, like logs pointing to
> the SharedThreadPool instance that hasn't been released yet.
OK, that makes sense, thank you for the clarification.
> > > Some thinking out loud,
> > > Could we print the threadpools that still exist when we are about to wait
> > > for the event loop and it's empty? Perhaps we could do this in the predicate
> > > function we pass to SpinEventLoopUntil()?
> >
> > I'm confused about the "wait for the event loop and it's empty" bit here:
> > does the "it's" refer to the SharedThreadPool instance or the event loop?
> > If the former, then the SharedThreadPool hashtable wouldn't have any thread
> > pools in it, and so we wouldn't print anything, right? If the latter, I
> > don't think we have access to that information in SpinEventLoopUntil()? Or
> > am I totally misunderstanding the suggestion here?
>
> I was thinking of the main thread event loop that we're running on, as the
> predicate passed to SpinEventLoopUntil() is called on every iteration [1].
Sure, but I am still confused what you want to check here. You want to check the emptiness of the thread's event queue?
Anyway, sorry for the digressions. We have a patch, it will provide some diagnostic info, we might as well land it and see what happens.
> If these are not viable, can you think of other approaches to debugging this?
Flags: needinfo?(nfroyd)
![]() |
||
Comment 61•7 years ago
|
||
mozreview-review |
Comment on attachment 8947082 [details]
Bug 1351655 - Log which SharedThreadPools are not yet destroyed on shutdown-threads.
https://reviewboard.mozilla.org/r/216882/#review225678
Attachment #8947082 -
Flags: review+
Comment hidden (Intermittent Failures Robot) |
Comment 64•7 years ago
|
||
Pushed by pehrsons@gmail.com:
https://hg.mozilla.org/integration/autoland/rev/1776ec7c6471
Log which SharedThreadPools are not yet destroyed on shutdown-threads. r=froydnj
Assignee | ||
Comment 65•7 years ago
|
||
(In reply to Nathan Froyd [:froydnj] from comment #60)
(...)
> > > > Some thinking out loud,
> > > > Could we print the threadpools that still exist when we are about to wait
> > > > for the event loop and it's empty? Perhaps we could do this in the predicate
> > > > function we pass to SpinEventLoopUntil()?
> > >
> > > I'm confused about the "wait for the event loop and it's empty" bit here:
> > > does the "it's" refer to the SharedThreadPool instance or the event loop?
> > > If the former, then the SharedThreadPool hashtable wouldn't have any thread
> > > pools in it, and so we wouldn't print anything, right? If the latter, I
> > > don't think we have access to that information in SpinEventLoopUntil()? Or
> > > am I totally misunderstanding the suggestion here?
> >
> > I was thinking of the main thread event loop that we're running on, as the
> > predicate passed to SpinEventLoopUntil() is called on every iteration [1].
>
> Sure, but I am still confused what you want to check here. You want to
> check the emptiness of the thread's event queue?
To only print this when the task queue is empty, yes. Since the threadpool is expecting to be shut down but the runnable to actually shut it down is never posted.
> Anyway, sorry for the digressions. We have a patch, it will provide some
> diagnostic info, we might as well land it and see what happens.
Thanks!
Comment 66•7 years ago
|
||
bugherder |
Comment 67•7 years ago
|
||
The leave-open keyword is there and there is no activity for 6 months.
:pehrsons, maybe it's time to close this bug?
Flags: needinfo?(apehrson)
Assignee | ||
Comment 68•7 years ago
|
||
Oddly enough this hasn't re-occurred. Well, if it ever would, we now have some logging to pinpoint which threadpool hasn't been shut down.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: needinfo?(apehrson)
Resolution: --- → FIXED
Updated•7 years ago
|
Keywords: leave-open
You need to log in
before you can comment on or make changes to this bug.
Description
•