Closed Bug 1351655 Opened 8 years ago Closed 7 years ago

Intermittent command timed out: 7200 seconds elapsed running ['/tools/buildbot/bin/python', '-u', 'scripts/scripts/desktop_unittest.py', '--mochitest-suite', 'mochitest-media', after many WARNING: MsgDropped in ContentChild

Categories

(Core :: WebRTC, defect, P3)

defect

Tracking

()

RESOLVED FIXED

People

(Reporter: aryx, Assigned: pehrsons)

References

(Depends on 1 open bug)

Details

(Keywords: intermittent-failure, Whiteboard: low-volume)

Attachments

(1 file, 1 obsolete file)

https://treeherder.mozilla.org/logviewer.html#?job_id=87233213&repo=mozilla-inbound 03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:331: PeerConnectionImpl: PeerConnectionImpl constructor for 03:37:34 INFO - GECKO(1784) | (remote_bitrate_estimator_single_stream.cc:59): RemoteBitrateEstimatorSingleStream: Instantiating. 03:37:34 INFO - GECKO(1784) | [Socket Thread]: D/mtransport NrIceCtx static call to find local stun addresses 03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:331: PeerConnectionImpl: PeerConnectionImpl constructor for 03:37:34 INFO - GECKO(1784) | (remote_bitrate_estimator_single_stream.cc:59): RemoteBitrateEstimatorSingleStream: Instantiating. 03:37:34 INFO - GECKO(1784) | [Socket Thread]: D/mtransport NrIceCtx static call to find local stun addresses 03:37:34 INFO - GECKO(1784) | MEMORY STAT | vsize 4271MB | residentFast 304MB | heapAllocated 29MB 03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionMedia] PeerConnectionMedia.cpp:209: OnProxyAvailable: Proxy Available: 0 03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionMedia] PeerConnectionMedia.cpp:267: OnStunAddrsAvailable: receiving (5) stun addrs 03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionMedia] PeerConnectionMedia.cpp:209: OnProxyAvailable: Proxy Available: 0 03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionMedia] PeerConnectionMedia.cpp:267: OnStunAddrsAvailable: receiving (5) stun addrs 03:37:34 INFO - TEST-OK | dom/media/tests/mochitest/test_selftest.html | took 1145ms 03:37:34 INFO - GECKO(1784) | ++DOMWINDOW == 10 (0x11b2b2800) [pid = 1785] [serial = 527] [outer = 0x126914000] 03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:2942: CloseInt: Closing PeerConnectionImpl 2dccd38ebaa39a02; ending call 03:37:34 INFO - GECKO(1784) | [Main Thread]: I/jsep [1490783854859219 (id=2147484172 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_selftest.html)]: stable -> closed 03:37:34 INFO - GECKO(1784) | [Main Thread]: I/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:2942: CloseInt: Closing PeerConnectionImpl dc1b7e7ad8661526; ending call 03:37:34 INFO - GECKO(1784) | [Main Thread]: I/jsep [1490783854866958 (id=2147484172 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_selftest.html)]: stable -> closed 03:37:34 INFO - TEST-START | Shutdown 03:37:34 INFO - Passed: 35185 03:37:34 INFO - Failed: 0 03:37:34 INFO - Todo: 1398 03:37:34 INFO - Mode: e10s 03:37:34 INFO - Slowest: 12172ms - /tests/dom/media/tests/mochitest/test_ondevicechange.html 03:37:34 INFO - SimpleTest FINISHED 03:37:34 INFO - TEST-INFO | Ran 1 Loops 03:37:34 INFO - SimpleTest FINISHED 03:37:34 INFO - GECKO(1784) | --DOCSHELL 0x11b2a3800 == 2 [pid = 1785] [id = {284210b4-7253-414a-aa5d-9c223a738ce8}] 03:37:34 INFO - GECKO(1784) | ++DOMWINDOW == 11 (0x11b409000) [pid = 1785] [serial = 528] [outer = 0x126914000] 03:37:35 INFO - GECKO(1784) | 1490783855014 addons.xpi WARN Exception running bootstrap method shutdown on shield-recipe-client@mozilla.org: TypeError: log is null (resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///builds/slave/test/build/application/NightlyDebug.app/Contents/Resources/browser/features/shield-recipe-client@mozilla.org.xpi!/bootstrap.js:96:5) JS Stack trace: this.shutdown@resource://gre/modules/addons/XPIProvider.jsm -> jar:file:///builds/slave/test/build/application/NightlyDebug.app/Contents/Resources/browser/features/shield-recipe-client@bootstrap.js:96:5 < callBootstrapMethod@XPIProvider.jsm:4957:11 < observe@XPIProvider.jsm:2924:15 < SpecialPowersObserver.prototype.receiveMessage@SpecialPowersObserver.jsm:244:7 < permitUnload@remote-browser.xml:357:13 < CanCloseWindow@browser.js:6376:36 < canClose@browser.js:5227:12 < SpecialPowersObserver.prototype.receiveMessage@SpecialPowersObserver.jsm:244:7 03:37:35 INFO - GECKO(1784) | --DOCSHELL 0x123579000 == 3 [pid = 1784] [id = {56ad6d42-ee07-1641-9eaa-1d9b695a7587}] 03:37:35 INFO - GECKO(1784) | [Parent 1784] WARNING: nsAppShell::Exit() called redundantly: file /home/worker/workspace/build/src/widget/cocoa/nsAppShell.mm, line 691 03:37:35 INFO - GECKO(1784) | --DOCSHELL 0x11e45f800 == 2 [pid = 1784] [id = {0764a66b-de5c-6e47-a707-13407cd46389}] 03:37:35 INFO - GECKO(1784) | --DOCSHELL 0x123580000 == 1 [pid = 1784] [id = {9c0aa61b-dd01-934d-9277-b8781f5e16a1}] 03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 03:37:35 INFO - GECKO(1784) | [Child 1785] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037 03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 03:37:35 INFO - GECKO(1784) | [Child 1785] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037 03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 03:37:35 INFO - GECKO(1784) | [Child 1785] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037 03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 03:37:35 INFO - GECKO(1784) | [Child 1785] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037 03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 03:37:35 INFO - GECKO(1784) | [Child 1785] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037 03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 03:37:35 INFO - GECKO(1784) | [Child 1785] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2037 03:37:35 INFO - GECKO(1784) | ###!!! [Child][MessageChannel] Error: (msgtype=0xD60007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv
Maybe this has something to do with the IPC for discovering local addresses?
Rank: 35
Component: Audio/Video → WebRTC: Networking
Priority: -- → P3
backlog: --- → tech-debt
This is particularly ugly because it ties up our extremely capacity-constrained and backlog-prone OSX test machines for 2 hours doing basically nothing. Any way we can bump the priority here?
Flags: needinfo?(mreavy)
To me it looks like the media mochitests do actually finish. But then we first have the error on closing the socket, which appear to pass. Followed by an endless amount of retires to submit Telemetry histograms. And then somewhere in between the attempt to kill Firefox via a MOZ_CRASH because of a shutdown timeout also fails. @mjf: could the Telemetry part be related to the ICE stats you landed some time ago? 18:14:12 INFO - TEST-OK | dom/media/tests/mochitest/test_selftest.html | took 1179ms 18:14:12 INFO - GECKO(2099) | ++DOMWINDOW == 10 (0x124f96000) [pid = 2100] [serial = 527] [outer = 0x1273d7000] 18:14:12 INFO - GECKO(2099) | [Main Thread]: I/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:2975: CloseInt: Closing PeerConnectionImpl 2b3a8bc7bd356aba; ending call 18:14:12 INFO - GECKO(2099) | [Main Thread]: I/jsep [1493428452254024 (id=2147484172 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_selftest.html)]: stable -> closed 18:14:12 INFO - GECKO(2099) | [Main Thread]: I/signaling [main|PeerConnectionImpl] PeerConnectionImpl.cpp:2975: CloseInt: Closing PeerConnectionImpl f771ae1c84cd0d65; ending call 18:14:12 INFO - GECKO(2099) | [Main Thread]: I/jsep [1493428452261770 (id=2147484172 url=http://mochi.test:8888/tests/dom/media/tests/mochitest/test_selftest.html)]: stable -> closed 18:14:12 INFO - TEST-START | Shutdown 18:14:12 INFO - Passed: 34912 18:14:12 INFO - Failed: 0 18:14:12 INFO - Todo: 1420 18:14:12 INFO - Mode: e10s 18:14:12 INFO - Slowest: 14763ms - /tests/dom/media/tests/mochitest/test_peerConnection_stats.html 18:14:12 INFO - SimpleTest FINISHED 18:14:12 INFO - TEST-INFO | Ran 1 Loops 18:14:12 INFO - SimpleTest FINISHED 18:14:12 INFO - GECKO(2099) | --DOCSHELL 0x128eb8000 == 2 [pid = 2100] [id = {1abf6fe2-c18a-5141-8bf4-1b52dccac79a}] 18:14:12 INFO - GECKO(2099) | ++DOMWINDOW == 11 (0x126f7f000) [pid = 2100] [serial = 528] [outer = 0x1273d7000] 18:14:12 INFO - GECKO(2099) | --DOCSHELL 0x12670e000 == 3 [pid = 2099] [id = {b6de75eb-44e5-d246-b0ee-81f8cd82b938}] 18:14:12 INFO - GECKO(2099) | --DOCSHELL 0x121025800 == 2 [pid = 2099] [id = {1c0f92a7-3374-3b49-922c-b12f3b50495b}] 18:14:13 INFO - GECKO(2099) | --DOCSHELL 0x126710000 == 1 [pid = 2099] [id = {1e95adde-fd65-374f-82f4-7f10ff78a173}] 18:14:13 INFO - GECKO(2099) | --DOCSHELL 0x122040800 == 0 [pid = 2099] [id = {ab0f35ea-78e5-3748-925e-616730cd66b4}] 18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099 18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099 18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099 18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099 18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099 18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099 18:14:13 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0xDA0007,name=PTCPSocket::Msg_Close) Closed channel: cannot send/recv 18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099 18:14:13 INFO - GECKO(2099) | [Main Thread]: D/DataChannel Shutting down SCTP 18:14:13 INFO - GECKO(2099) | ###!!! [Parent][RunMessage] Error: Channel closing: too late to send/recv, messages will be lost 18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: nsAppShell::Exit() called redundantly: file /home/worker/workspace/build/src/widget/cocoa/nsAppShell.mm, line 691 18:14:13 INFO - GECKO(2099) | [Child 2100] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file /home/worker/workspace/build/src/xpcom/threads/nsThread.cpp, line 1015 18:14:14 INFO - GECKO(2099) | 1493428454161 Marionette INFO Ceased listening 18:14:14 INFO - GECKO(2099) | [Parent 2099] WARNING: nsAppShell::Exit() called redundantly: file /home/worker/workspace/build/src/widget/cocoa/nsAppShell.mm, line 691 18:14:14 INFO - GECKO(2099) | [Parent 2099] WARNING: NS_ENSURE_TRUE(maybeContext) failed: file /home/worker/workspace/build/src/xpcom/threads/nsThread.cpp, line 1015 18:14:14 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4800FB,name=PContent::Msg_AccumulateChildHistograms) Closed channel: cannot send/recv 18:14:14 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099 18:14:14 INFO - GECKO(2099) | [Child 2100] WARNING: '!ipcActor->SendAccumulateChildHistograms(accumulationsToSend)', file /home/worker/workspace/build/src/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 258 18:14:14 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4800FC,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv 18:14:14 INFO - GECKO(2099) | [Child 2100] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2099 18:14:14 INFO - GECKO(2099) | [Child 2100] WARNING: '!ipcActor->SendAccumulateChildKeyedHistograms(keyedAccumulationsToSend)', file /home/worker/workspace/build/src/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 262 18:14:16 INFO - GECKO(2099) | ###!!! [Child][MessageChannel] Error: (msgtype=0x4800FC,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv
Flags: needinfo?(mfroman)
@drno: Ice telemetry I added uses scalars (Telemetry::ScalarAdd), not histograms, so this doesn't seem like it is related. However, after looking at TelemetryIPCAccumulator.cpp and remembering some of my lessons learned adding the IPC for stun address discovery, it could be bug handling one side of the IPC channel going away in the telemetry code.
Flags: needinfo?(mfroman)
@chutten: Any chance this is a bug with one side of the telemetry IPC channel going away on shutdown?
Flags: needinfo?(chutten)
I think I have actually experienced this bug once during development/testing on Linux: I don't remember any more how I shut down Firefox in that base, but basically the Firefox window had closed already and I kept seeing this error message from Telemetry IPC over and over again. Looking at the running processes it looked like the content process was hanging (trying to deliver its telemetry data to the parent???). I think the parent process was gone already. I'm wondering if the initial socket problem from shutting down WebRTC then makes Telemetry do something strange...
Telemetry IPC doesn't do anything to try and stay alive. There is that 2s timer trying to send the last batch of child telemetry, but it's on a best-effort basis when the process is shutting down. If IPC is bork, it can't send. I think that's all it is. I don't think it has anything to do with the actual problem (but I've been wrong before)
Flags: needinfo?(chutten)
Bill - are there any known e10s issues where the master process dies, and the child never exits? Any ideas what can be happening here? he telemetry stuff appears to just be it trying to contact the master process.
Flags: needinfo?(wmccloskey)
We do have a mechanism to cause the child to die if the parent dies. You can search for "AbortOnError" [1]. However, we disable this mechanism late in shutdown [2]. At the same time, we call StartForceKillTimer(). However, StartForceKillTimer does nothing in DEBUG builds. It looks like all the failures here are DEBUG builds (except for one, which I think was mis-starred). DEBUG builds are handled differently because we try really hard to do leak checking in them. So presumably we're getting to RecvShutdown in the content process. It's probably a good bet that we even Close the channel normally with the child process (given that I don't see any messages about SendShutdown failing to be sent). It seems like something else is causing the shutdown process to hang. Maybe we're in a nested event loop? It's also odd that it seems like both the parent and the child are hanging at shutdown. The parent eventually gets killed by the nsTerminator code. It would help to have a crash dump here. I'm not sure why we don't have one. I would have expected MOZ_CRASH to generate one. [1] http://searchfox.org/mozilla-central/search?q=AbortOnError&case=true&regexp=false&path= [2] http://searchfox.org/mozilla-central/rev/35b37316149108d53a02fb8498e250ea8a22ab5d/dom/ipc/ContentChild.cpp#2875
Flags: needinfo?(wmccloskey)
backlog: tech-debt → ---
Rank: 35
Component: WebRTC: Networking → IPC
Flags: needinfo?(mreavy)
Priority: P3 → --
This is slightly less-bad than it used to be because now we're only burning 90min of machine time instead of 2hr like we used to (Yay Taskcluster migration), but it's still pretty harmful to our limited OSX machine pool when we get into this state. Here's a recent log to show what happens - you can see 45+min of infinite looping at the end of the run. https://public-artifacts.taskcluster.net/arTLstMpQ5yKH6wiARfZ5Q/0/public/logs/live_backing.log Jim, any chance you could help us find an owner for this?
Flags: needinfo?(jmathies)
First error I see is this, not sure if it plays in here but it looks like some data coming over fromt he child isn't valid - 15:39:24 INFO - GECKO(1725) | [Parent 1725] WARNING: NS_FAILED internal_GetScalarByEnum for CHILD: file /home/worker/workspace/build/src/toolkit/components/telemetry/TelemetryScalar.cpp, line 2184 http://searchfox.org/mozilla-central/rev/5696c3e525fc8222674eed6a562f5fcbe804c4c7/toolkit/components/telemetry/TelemetryScalar.cpp#2184 Second error is a PTCPSocket message getting dropped when we receive a Close command on the child side but the channel has already been torn down. This is usually the result of lifetime design issues in the underlying protocol. On the child side we just drop these messages. I'd be surprised in PTCPSocket has lifetime issues though! http://searchfox.org/mozilla-central/rev/5696c3e525fc8222674eed6a562f5fcbe804c4c7/dom/ipc/ContentChild.cpp#2341 then - 15:39:26 INFO - GECKO(1725) | ###!!! [Child][MessageChannel] Error: (msgtype=0x2400F0,name=PContent::Msg_AccumulateChildHistograms) Closed channel: cannot send/recv 15:39:26 INFO - GECKO(1725) | [Child 1726] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2344 15:39:26 INFO - GECKO(1725) | [Child 1726] WARNING: '!ipcActor->SendAccumulateChildHistograms(accumulationsToSend)', file /home/worker/workspace/build/src/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 299 This is another error on the child side where the parent has requested telemetry data but the channel the request is made on has been shut down on the child side. AFAICT this looks like the main content channel - http://searchfox.org/mozilla-central/rev/5696c3e525fc8222674eed6a562f5fcbe804c4c7/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#332 In the middle of all this I see some odd messages about redundant calls to nsAppShell::Exit() in both processes. Stephen, when you have a chance can you poke at this a bit please.
Flags: needinfo?(jmathies) → needinfo?(spohl.mozilla.bugs)
Here is what keeps being repeated every two seconds in the log[1]: 15:39:26 INFO - GECKO(1725) | ###!!! [Child][MessageChannel] Error: (msgtype=0x2400F1,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv 15:39:26 INFO - GECKO(1725) | [Child 1726] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2344 15:39:26 INFO - GECKO(1725) | [Child 1726] WARNING: '!ipcActor->SendAccumulateChildKeyedHistograms(keyedAccumulationsToSend)', file /home/worker/workspace/build/src/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 303 15:39:26 INFO - GECKO(1725) | ###!!! [Child][MessageChannel] Error: (msgtype=0x2400F5,name=PContent::Msg_RecordDiscardedData) Closed channel: cannot send/recv 15:39:26 INFO - GECKO(1725) | [Child 1726] WARNING: MsgDropped in ContentChild: file /home/worker/workspace/build/src/dom/ipc/ContentChild.cpp, line 2344 15:39:26 INFO - GECKO(1725) | [Child 1726] WARNING: '!ipcActor->SendRecordDiscardedData(discardedData)', file /home/worker/workspace/build/src/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 318 As was stated before, the MessageChannel has been torn down. We have also called nsAppShell::Exit() in the child, but the IPC timer continues to fire and we attempt to send accumulated telemetry data[1] every two seconds. The two types of data are keyed histograms[2] and discarded data[3]. Attempts to send this continuously fails with "Closed channel: cannot send/recv"[4], because the MessageChannel is already torn down. We then report a warning that the message has been dropped and return immediately from ContentChild::ProcessingError[6]. We don't have a crash dump because we return here. Otherwise, we would execute MOZ_CRASH[7]. This change was introduced in bug 1091766 to address the fact that we didn't use to execute shutdown properly, which also leaves us without leak logs. Note however that even MOZ_CRASH appears to fail to exit/crash the process (as mentioned in comment 7): In the middle of the log, we attempt to crash the child process via MOZ_CRASH due to "Hit MOZ_CRASH(Shutdown too long, probably frozen, causing a crash.) at /home/worker/workspace/build/src/toolkit/components/terminator/nsTerminator.cpp:160", but then continue to process further IPC timer fires every two seconds. In summary, it appears that we're desperately trying to submit telemetry data via IPC. So desperately that any calls to nsAppShell::Exit() don't work (because it has already been called) and even MOZ_CRASH doesn't crash the child. It seems that we should quit trying to submit this data when the channel is torn down and unlock ourselves. I'm not clear how to achieve this. Maybe we should clear this data and pretend that there is nothing to send when the channel is being torn down? Or we should release whatever lock we seem to be holding to allow for nsAppShell::Exit() to complete? :billm, :mccr8, do you have any thoughts on how to proceed here? [1] https://public-artifacts.taskcluster.net/arTLstMpQ5yKH6wiARfZ5Q/0/public/logs/live_backing.log [2] https://dxr.mozilla.org/mozilla-central/rev/20d57b9c4183973af4af5e078dff2aec0b74f928/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#334 [3] https://dxr.mozilla.org/mozilla-central/rev/20d57b9c4183973af4af5e078dff2aec0b74f928/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#303 [4] https://dxr.mozilla.org/mozilla-central/rev/20d57b9c4183973af4af5e078dff2aec0b74f928/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp#318 [5] https://dxr.mozilla.org/mozilla-central/source/ipc/glue/MessageChannel.cpp#2442,2463-2464 [6] https://dxr.mozilla.org/mozilla-central/rev/20d57b9c4183973af4af5e078dff2aec0b74f928/dom/ipc/ContentChild.cpp#2384,2385 [7] https://dxr.mozilla.org/mozilla-central/rev/20d57b9c4183973af4af5e078dff2aec0b74f928/dom/ipc/ContentChild.cpp#2403
Flags: needinfo?(wmccloskey)
Flags: needinfo?(spohl.mozilla.bugs)
Flags: needinfo?(continuation)
It sounds like telemetry should listen for xpcom-shutdown and cancel the timer, so it won't try to send messages after IPC is shut down.
Flags: needinfo?(continuation)
Component: IPC → Telemetry
Product: Core → Toolkit
Alternatively, some kind of shutdown call could be added to ShutdownXPCOM, right after where we send xpcom-shutdown (NS_XPCOM_SHUTDOWN_OBSERVER_ID).
One active timer keeps the process alive?
The real problem here is that we're calling XRE_ShutdownChildProcess() in the child process, but it's not shutting down. All of the other stuff (the telemetry timer, the IPC stuff) is just a symptom of that. I suspect that the "Shutdown too long, probably frozen, causing a crash." message is actually being printed by some other process. I would be very surprised if MOZ_CRASH somehow wasn't working. I think we'll need to insert some sort of debugging to see what's happening inside ShutdownChildProcess().
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #28) > I suspect that the "Shutdown too long, probably frozen, causing a crash." > message is actually being printed by some other process. I would be very > surprised if MOZ_CRASH somehow wasn't working. Oh, I misread the log there. This seems to be printed by the parent process with pid 1725 while the child has pid 1726. > I think we'll need to insert some sort of debugging to see what's happening > inside ShutdownChildProcess(). Agreed. Who can own this?
Flags: needinfo?(wmccloskey)
I think one way this could be mitigated is by reinstating the way the parent process kills off the child if it takes too long, in debug builds. I got rid of it for leak checking, but not getting a log sometimes is probably better than hanging a machine.
> Agreed. Who can own this? I don't know. I guess I'll forward this to Jim. It's some sort of event loop issue, although probably related to the Chromium event loop.
Flags: needinfo?(wmccloskey) → needinfo?(jmathies)
Attached patch Patch (obsolete) — Splinter Review
(In reply to Andrew McCreight [:mccr8] from comment #30) > I think one way this could be mitigated is by reinstating the way the parent > process kills off the child if it takes too long, in debug builds. I got rid > of it for leak checking, but not getting a log sometimes is probably better > than hanging a machine. ^
Attachment #8923495 - Flags: review?(continuation)
Comment on attachment 8923495 [details] [diff] [review] Patch Review of attachment 8923495 [details] [diff] [review]: ----------------------------------------------------------------- I don't understand what this has to do with my comment. It looks like you are turning this warning into a crash? We hit this warning very frequently. I'd guess that the tree would be very orange if you did this.
Attachment #8923495 - Flags: review?(continuation) → review-
I was talking about the code that was touched in this patch: https://hg.mozilla.org/mozilla-central/rev/260370115e37 Maybe we could change ~GeckoChildProcessHost() to force shutdown in NS_FREE_PERMANENT_DATA builds, but have a more generous timeout. It is two seconds now. Maybe if it was set to a minute or two in NS_FREE_PERMANENT_DATA that would be sufficient.
Comment on attachment 8923495 [details] [diff] [review] Patch This was an attempt at reverting your change from bug 1091766, but even so I now realize that we used to call QuickExit() instead of crashing. Marking as obsolete.
Attachment #8923495 - Attachment is obsolete: true
Flags: needinfo?(jmathies)
Priority: -- → P2
Whiteboard: low-volume
Depends on: 1404190
I'm not entirely sure if it's the same thing, but I had something like this happening to me today with a local build which I started via 'mach run' and opened a JS fiddle. When I quit that Firefox one content process did not shut down and I kept seeing lines like these on the command line: ###!!! [Child][MessageChannel] Error: (msgtype=0x2C00EF,name=PContent::Msg_AccumulateChildHistograms) Closed channel: cannot send/recv [Child 17420, Main Thread] WARNING: MsgDropped in ContentChild: file /Users/nohlmeier/src/mozilla-central/dom/ipc/ContentChild.cpp, line 2339 [Child 17420, Main Thread] WARNING: '!ipcActor->SendAccumulateChildHistograms(histogramsToSend)', file /Users/nohlmeier/src/mozilla-central/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 296 ###!!! [Child][MessageChannel] Error: (msgtype=0x2C00F0,name=PContent::Msg_AccumulateChildKeyedHistograms) Closed channel: cannot send/recv [Child 17420, Main Thread] WARNING: MsgDropped in ContentChild: file /Users/nohlmeier/src/mozilla-central/dom/ipc/ContentChild.cpp, line 2339 [Child 17420, Main Thread] WARNING: '!ipcActor->SendAccumulateChildKeyedHistograms(keyedHistogramsToSend)', file /Users/nohlmeier/src/mozilla-central/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 300 ###!!! [Child][MessageChannel] Error: (msgtype=0x2C00F4,name=PContent::Msg_RecordDiscardedData) Closed channel: cannot send/recv [Child 17420, Main Thread] WARNING: MsgDropped in ContentChild: file /Users/nohlmeier/src/mozilla-central/dom/ipc/ContentChild.cpp, line 2339 [Child 17420, Main Thread] WARNING: '!ipcActor->SendRecordDiscardedData(discardedData)', file /Users/nohlmeier/src/mozilla-central/toolkit/components/telemetry/ipc/TelemetryIPCAccumulator.cpp, line 315 So I attached lldb to that one hanging content process and got the following backtrace: (lldb) bt * thread #1, queue = 'com.apple.main-thread', stop reason = signal SIGSTOP * frame #0: 0x00007fff6986be7e libsystem_kernel.dylib`__psynch_cvwait + 10 frame #1: 0x00007fff699a7662 libsystem_pthread.dylib`_pthread_cond_wait + 732 frame #2: 0x0000000109f9f0d7 libmozglue.dylib`mozilla::detail::ConditionVariableImpl::wait(this=0x000000010a3a80d0, lock=0x000000010a3a8050) at ConditionVariable_posix.cpp:118 frame #3: 0x0000000112acdddd XUL`mozilla::CondVar::Wait(this=0x000000010a3a80b0, aInterval=4294967295) at BlockingResourceBase.cpp:604 frame #4: 0x0000000112b05604 XUL`mozilla::ThreadEventQueue<mozilla::PrioritizedEventQueue<mozilla::LabeledEventQueue> >::GetEvent(this=0x000000010a3a8020, aMayWait=true, aPriority=0x00007ffee6111cf0) at ThreadEventQueue.cpp:139 frame #5: 0x0000000112b133ef XUL`nsThread::ProcessNextEvent(this=0x000000010a307190, aMayWait=true, aResult=0x00007ffee61122de) at nsThread.cpp:961 frame #6: 0x0000000112b3426c XUL`NS_ProcessNextEvent(aThread=0x000000010a307190, aMayWait=true) at nsThreadUtils.cpp:508 frame #7: 0x0000000112afea1f XUL`bool mozilla::SpinEventLoopUntil<(mozilla::ProcessFailureBehavior)1, mozilla::SharedThreadPool::SpinUntilEmpty()::$_0>(aPredicate=0x00007ffee6112348, aThread=0x0000000000000000)::$_0&&, nsIThread*) at nsThreadUtils.h:323 frame #8: 0x0000000112afe557 XUL`mozilla::SharedThreadPool::SpinUntilEmpty() at SharedThreadPool.cpp:80 frame #9: 0x0000000112afe4b9 XUL`mozilla::SharedThreadPoolShutdownObserver::Observe(this=0x000000010e12b9c0, aSubject=0x0000000000000000, aTopic="xpcom-shutdown-threads", aData=0x0000000000000000) at SharedThreadPool.cpp:49 frame #10: 0x0000000112a3aa89 XUL`nsObserverList::NotifyObservers(this=0x0000000165fdb0f8, aSubject=0x0000000000000000, aTopic="xpcom-shutdown-threads", someData=0x0000000000000000) at nsObserverList.cpp:112 frame #11: 0x0000000112a3cc4a XUL`nsObserverService::NotifyObservers(this=0x000000010cea89c0, aSubject=0x0000000000000000, aTopic="xpcom-shutdown-threads", aSomeData=0x0000000000000000) at nsObserverService.cpp:296 frame #12: 0x0000000112b6750c XUL`mozilla::ShutdownXPCOM(aServMgr=0x0000000000000000) at XPCOMInit.cpp:880 frame #13: 0x0000000112b67285 XUL`::NS_ShutdownXPCOM(aServMgr=0x0000000000000000) at XPCOMInit.cpp:810 frame #14: 0x000000011b06ae32 XUL`XRE_TermEmbedding() at nsEmbedFunctions.cpp:224 frame #15: 0x000000011360e8d2 XUL`mozilla::ipc::ScopedXREEmbed::Stop(this=0x000000010a3738e0) at ScopedXREEmbed.cpp:108 frame #16: 0x00000001175be43c XUL`mozilla::dom::ContentProcess::CleanUp(this=0x000000010a373000) at ContentProcess.cpp:267 frame #17: 0x000000011b06c068 XUL`XRE_InitChildProcess(aArgc=16, aArgv=0x00007ffee6112c00, aChildData=0x00007ffee6112b88) at nsEmbedFunctions.cpp:695 frame #18: 0x000000011b079717 XUL`mozilla::BootstrapImpl::XRE_InitChildProcess(this=0x000000010a32a190, argc=19, argv=0x00007ffee6112c00, aChildData=0x00007ffee6112b88) at Bootstrap.cpp:69 frame #19: 0x0000000109aed88c plugin-container`content_process_main(bootstrap=0x000000010a32a190, argc=19, argv=0x00007ffee6112c00) at plugin-container.cpp:63 frame #20: 0x0000000109aed945 plugin-container`main(argc=20, argv=0x00007ffee6112c00) at MozillaRuntimeMain.cpp:25 frame #21: 0x0000000109aed804 plugin-container`start + 52
Component: Telemetry → IPC
Priority: P2 → --
Product: Toolkit → Core
Priority: -- → P3
I have encountered a similar issue that is highly reproducible on try (with my patches for bug 1299515 and a couple more). Here's a run with some extra diagnostic logging of failed telemetry histograms and shutdown phases: https://treeherder.mozilla.org/#/jobs?repo=try&revision=69c99229d4f42355c625890d06378f33e70cac8b&selectedJob=157441247 I also reproduced this on a loaner and got similar stacks to drno's in comment 41, https://pastebin.mozilla.org/9076289 From the shutdown phase logs above, it's clear the child is stuck in XPCOM_SHUTDOWN_THREADS, and from the stacks it's clear that it's blocked on emptying a SharedThreadPool. Now, which SharedThreadPool that could be... FWIW for the failed telemetry histogram I get this: > [task 2018-01-19T17:15:40.230Z] 17:15:40 INFO - GECKO(2695) | Failed to send keyed histogram 1474, key:TelemetryIPCAccumulator::IPCTimerFired But if we ignore those for a minute, there are also a couple of these: > [task 2018-01-19T17:29:51.209Z] 17:29:51 INFO - GECKO(2695) | Failed to send keyed histogram 1474, key:gfxFontCache::gfxFontCache Could be a lead?
I should also say that there's another lead from the threads listed in the pastebin in comment 44, https://pastebin.mozilla.org/9076289, namely: > (gdb) info threads > Id Target Id Frame > * 1 Thread 0x7fbff04c8740 (LWP 14846) "Web Content" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > 2 Thread 0x7fbfdc524700 (LWP 14869) "Chrome_~dThread" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 > 3 Thread 0x7fbfd74ff700 (LWP 14870) "JS Watchdog" pthread_cond_timedwait@@GLIBC_2.3.2 () > at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 > 4 Thread 0x7fbfd6cfe700 (LWP 14871) "JS Helper" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > 5 Thread 0x7fbfd6afd700 (LWP 14872) "JS Helper" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > 6 Thread 0x7fbfd68fc700 (LWP 14873) "JS Helper" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > 7 Thread 0x7fbfd66fb700 (LWP 14874) "JS Helper" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > 8 Thread 0x7fbfd64fa700 (LWP 14875) "JS Helper" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > 9 Thread 0x7fbfd62f9700 (LWP 14876) "JS Helper" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > 10 Thread 0x7fbfd60f8700 (LWP 14877) "Socket Thread" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > 11 Thread 0x7fbfd51ff700 (LWP 14879) "Hang Monitor" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > 12 Thread 0x7fbfd49fe700 (LWP 14880) "Timer" pthread_cond_timedwait@@GLIBC_2.3.2 () > at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 > 13 Thread 0x7fbfd1b42700 (LWP 14890) "StyleThread#0" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > 14 Thread 0x7fbfcf18f700 (LWP 14891) "StyleThread#1" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > 15 Thread 0x7fbfcf14e700 (LWP 14892) "StyleThread#2" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > 16 Thread 0x7fbfce9ff700 (LWP 14893) "StyleThread#3" pthread_cond_wait@@GLIBC_2.3.2 () at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185 > --- the below three belong to an instance of webrtc::Call > 17 Thread 0x7fbfc91f8700 (LWP 15022) "call_worker_que" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 > 18 Thread 0x7fbfc90f7700 (LWP 15023) "ModuleProcessTh" pthread_cond_timedwait@@GLIBC_2.3.2 () > at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 > 19 Thread 0x7fbfc8ff6700 (LWP 15024) "PacerThread" pthread_cond_timedwait@@GLIBC_2.3.2 () > at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 > --- > --- The below three belong to another instance of webrtc::Call > 20 Thread 0x7fbfc89fd700 (LWP 15027) "call_worker_que" syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38 > 21 Thread 0x7fbfc86ff700 (LWP 15028) "ModuleProcessTh" pthread_cond_timedwait@@GLIBC_2.3.2 () > at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 > 22 Thread 0x7fbfc8ef5700 (LWP 15029) "PacerThread" pthread_cond_timedwait@@GLIBC_2.3.2 () > at ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_timedwait.S:225 > --- "call_worker_queue": https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/media/webrtc/trunk/webrtc/call/call.cc#306 "ModuleProcessThread": https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/media/webrtc/trunk/webrtc/call/call.cc#279 "PacerThread": https://searchfox.org/mozilla-central/rev/2031c0f517185b2bd0e8f6f92f9491b3410c1f7f/media/webrtc/trunk/webrtc/call/call.cc#280 There are clearly two instances of webrtc::Call still alive.
(In reply to Andreas Pehrson [:pehrsons] from comment #45) > There are clearly two instances of webrtc::Call still alive. Since all of the WebRTC tests instantiate two PeerConnections, and each PeerConnection (with audio and/or video) has one webrtc::Call object I don't think this is unexpected. Except if the Call object in general should have been torn down at this point already. But having two during test execution is not suspicious.
I have noticed that in the working case, we kill these last PeerConnectionImpls during nsCycleCollector::Shutdown, which is after XPCOM_SHUTDOWN_THREADS: https://searchfox.org/mozilla-central/rev/e9a067a401e41017766f4ab90bc3906603273d51/xpcom/build/XPCOMInit.cpp#969 That's late, but unclear why it works most of the time.
(In reply to Andreas Pehrson [:pehrsons] from comment #48) > I have noticed that in the working case, we kill these last > PeerConnectionImpls during nsCycleCollector::Shutdown, which is after > XPCOM_SHUTDOWN_THREADS: > https://searchfox.org/mozilla-central/rev/ > e9a067a401e41017766f4ab90bc3906603273d51/xpcom/build/XPCOMInit.cpp#969 > > That's late, but unclear why it works most of the time. I got to repro this locally fairly easily after putting in some debug logging. But not under rr. Hopefully this holds. So far it seems like when we hang, PeerConnectionImpl::Close was never called. That makes sense with what we've seen on try, since Close leads to CloseInt which leads to ShutdownMedia which destroys PeerConnectionMedia which holds the ref to webrtc::Call whose threads we know were alive during the shutdown hang. If this is part of the reason for the shutdown hang is another question, but after calling PCImpl::Close on XPCOM_SHUTDOWN I cannot repro anymore. That could also be timing related of course. I'll put this on try as well and if all is jolly I'll harden it a bit, probably with an async shutdown blocker, and waiting for PeerConnectionMedia to be fully shutdown (it bounces to the STS thread and back to main).
Assignee: nobody → apehrson
Blocks: 1299515
Status: NEW → ASSIGNED
Rank: 25
Component: IPC → WebRTC
Comment 49 didn't help. However, I found that the one SharedThreadPool we're hanging on when I repro locally is "MediaTimer".
I think this is from a new use of MediaTimer that I added in bug 1299515, and a SourceListener that doesn't get destroyed properly in MediaManager. I have evidence that a SourceListener does not get destroyed, and that would hold a ref to the MediaTimer. I'm just not sure what holds a ref to the SourceListener. If this holds true I'm just looking at a self-inflicted issue from bug 1299515 and the root cause for this bug remains.
I think I understand what happens here now, and it follows on comment 51. A (dom-object) MediaStreamTrack holds a strong ref to its LocalTrackSource which holds a strong ref to a SourceListener which holds a strong ref to a MediaTimer. I have invariants for the last strong ref that I don't want to break by resetting that one, so I'm solving it with a WeakPtr in the LocalTrackSource (it's main thread only). This then makes my issue irrelevant to this issue (I'll solve mine in bug 1299515 where I added the MediaTimer), though the symptoms are the same. If drno's stack in comment 41 is correct for this bug, this is indeed also due to a dangling reference to a SharedThreadPool somewhere. I'm adding code to log what SharedThreadPool that is, so that when this happens in the future we can better understand where to look. https://treeherder.mozilla.org/#/jobs?repo=try&revision=d57d4a07fe897993e4da2a36d6068ee02d16ded8
Keywords: leave-open
Comment on attachment 8947082 [details] Bug 1351655 - Log which SharedThreadPools are not yet destroyed on shutdown-threads. https://reviewboard.mozilla.org/r/216882/#review222694 Static analysis found 1 defect in this patch. - 1 defect found by clang-tidy You can run this analysis locally with: - `./mach static-analysis check path/to/file.cpp` (C/C++) If you see a problem in this automated review, please report it here: http://bit.ly/2y9N9Vx ::: xpcom/threads/SharedThreadPool.cpp:55 (Diff revision 1) > + { > + ReentrantMonitorAutoEnter mon(*sMonitor); > + if (!sPools->Iter().Done()) { > + nsAutoCString str; > + for (auto i = sPools->Iter(); !i.Done(); i.Next()) { > + str.AppendPrintf("\"%s\" ", nsAutoCString(i.Key()).get()); Warning: Escaped string literal can be written as a raw string literal [clang-tidy: modernize-raw-string-literal] str.AppendPrintf("\"%s\" ", nsAutoCString(i.Key()).get()); ^ R"("%s" )"
Comment on attachment 8947082 [details] Bug 1351655 - Log which SharedThreadPools are not yet destroyed on shutdown-threads. https://reviewboard.mozilla.org/r/216882/#review222704 I'm not sure this will produce the sort of information you want. See below, and please do correct me if I am misunderstanding the situation--which I probably am! ::: commit-message-9a3b6:10 (Diff revision 1) > +This patch attempts to help diagnose future cases like this by printing which > +SharedThreadPools we are going to wait for to stderr. If I understand comment 52 correctly, we think that somebody is holding a reference to a `SharedThreadPool`, and whoever is holding that reference is also involved in a reference cycle. So we can't drop the `SharedThreadPool` reference, and we can't drop the object holding the reference. Is that a correct summary of the situation? I'm not clear on how this is going to narrow down the particular `SharedThreadPool` that's being kept alive. Isn't it possible (indeed, likely?) that when we observe `xpcom-shutdown-threads` here, that several thread pools may still be alive, regardless of whether we're in a situation that leaks or not? Therefore, we're just going to print out the same set of thread pools all the time. If the above situation is correct, it seems what we really want is some sort of print while we're spinning the event loop telling us what set of `SharedThreadPool`s are still alive. Printing out the set each trip through the event loop could print quite a lot of logging, though, and printing every `N` trips through could miss informing us about the death of a pool before the event loop blocks completely. Maybe it'd be better than nothing? But why not use the usual refcount logging machinery on `SharedThreadPool` to see who's taking references where and then figuring out who's taking a reference that's not released? I guess the problem with that would be getting the logs out of try/treeherder and whether we would actually dump logs in a timeout situation. Even so, that might be more workable than this?
Attachment #8947082 - Flags: review?(nfroyd)
Comment on attachment 8947082 [details] Bug 1351655 - Log which SharedThreadPools are not yet destroyed on shutdown-threads. https://reviewboard.mozilla.org/r/216882/#review222704 > If I understand comment 52 correctly, we think that somebody is holding a reference to a `SharedThreadPool`, and whoever is holding that reference is also involved in a reference cycle. So we can't drop the `SharedThreadPool` reference, and we can't drop the object holding the reference. Is that a correct summary of the situation? > > I'm not clear on how this is going to narrow down the particular `SharedThreadPool` that's being kept alive. Isn't it possible (indeed, likely?) that when we observe `xpcom-shutdown-threads` here, that several thread pools may still be alive, regardless of whether we're in a situation that leaks or not? Therefore, we're just going to print out the same set of thread pools all the time. > > If the above situation is correct, it seems what we really want is some sort of print while we're spinning the event loop telling us what set of `SharedThreadPool`s are still alive. Printing out the set each trip through the event loop could print quite a lot of logging, though, and printing every `N` trips through could miss informing us about the death of a pool before the event loop blocks completely. Maybe it'd be better than nothing? > > But why not use the usual refcount logging machinery on `SharedThreadPool` to see who's taking references where and then figuring out who's taking a reference that's not released? I guess the problem with that would be getting the logs out of try/treeherder and whether we would actually dump logs in a timeout situation. Even so, that might be more workable than this? Summary looks good. It's anecdotal of course, but when I was able to reproduce this locally there was only the leaking thread pool ever printed. It seemed like all other thread pool references had been dropped in xpcom-will-shutdown or xpcom-shutdown. I'm not sure the refcount machinery helps as we hang and are completely unresponsive, even to the HangMonitor. What makes the test machines get off this hang is a 90 minute timeout that force kills all processes. Does that retain the refcount logs anywhere? Also on opt builds? Some thinking out loud, Could we print the threadpools that still exist when we are about to wait for the event loop and it's empty? Perhaps we could do this in the predicate function we pass to SpinEventLoopUntil()? Any events to remove references to the threadpool should in all fairness have been dispatched during earlier shutdown phases, no? I'm not really well-versed in these pieces of code so any guidance is useful.
Flags: needinfo?(nfroyd)
(In reply to Andreas Pehrson [:pehrsons] from comment #57) > It's anecdotal of course, but when I was able to reproduce this locally > there was only the leaking thread pool ever printed. It seemed like all > other thread pool references had been dropped in xpcom-will-shutdown or > xpcom-shutdown. If this can be reproduced locally, are we adding the logging because we expect the results to be markedly different in automation vs. locally? Debugging it locally is probably kind of a pain, I suppose, since it looks like the hang is quite infrequent. > I'm not sure the refcount machinery helps as we hang and are completely > unresponsive, even to the HangMonitor. What makes the test machines get off > this hang is a 90 minute timeout that force kills all processes. Does that > retain the refcount logs anywhere? Also on opt builds? No, I don't think the refcount logs are retained anywhere on hangs, nor do they exist for opt builds. Then again, it looks like the majority of hangs (did) happen on debug builds, with a few happening on PGO (?!), so I don't know that not existing for opt would be a huge problem. > Some thinking out loud, > Could we print the threadpools that still exist when we are about to wait > for the event loop and it's empty? Perhaps we could do this in the predicate > function we pass to SpinEventLoopUntil()? I'm confused about the "wait for the event loop and it's empty" bit here: does the "it's" refer to the SharedThreadPool instance or the event loop? If the former, then the SharedThreadPool hashtable wouldn't have any thread pools in it, and so we wouldn't print anything, right? If the latter, I don't think we have access to that information in SpinEventLoopUntil()? Or am I totally misunderstanding the suggestion here?
Flags: needinfo?(nfroyd) → needinfo?(apehrson)
(In reply to Nathan Froyd [:froydnj] from comment #58) > (In reply to Andreas Pehrson [:pehrsons] from comment #57) > > It's anecdotal of course, but when I was able to reproduce this locally > > there was only the leaking thread pool ever printed. It seemed like all > > other thread pool references had been dropped in xpcom-will-shutdown or > > xpcom-shutdown. > > If this can be reproduced locally, are we adding the logging because we > expect the results to be markedly different in automation vs. locally? > Debugging it locally is probably kind of a pain, I suppose, since it looks > like the hang is quite infrequent. I want to add it because the ones we've seen in the wild on try are very rare intermittents, and no other clues in the logs point in the right direction. The one I could repro was actually unrelated to this bug and something new I had put in. It was on its own hard enough to debug, but if we'd like to nail an intermittent issue like this one we need some help, like logs pointing to the SharedThreadPool instance that hasn't been released yet. > > Some thinking out loud, > > Could we print the threadpools that still exist when we are about to wait > > for the event loop and it's empty? Perhaps we could do this in the predicate > > function we pass to SpinEventLoopUntil()? > > I'm confused about the "wait for the event loop and it's empty" bit here: > does the "it's" refer to the SharedThreadPool instance or the event loop? > If the former, then the SharedThreadPool hashtable wouldn't have any thread > pools in it, and so we wouldn't print anything, right? If the latter, I > don't think we have access to that information in SpinEventLoopUntil()? Or > am I totally misunderstanding the suggestion here? I was thinking of the main thread event loop that we're running on, as the predicate passed to SpinEventLoopUntil() is called on every iteration [1]. If these are not viable, can you think of other approaches to debugging this? [1] https://searchfox.org/mozilla-central/rev/9d47b5fb14152865ad1375eb3ee2e571d81ecdb9/xpcom/threads/SharedThreadPool.cpp#81
Flags: needinfo?(apehrson) → needinfo?(nfroyd)
(In reply to Andreas Pehrson [:pehrsons] from comment #59) > (In reply to Nathan Froyd [:froydnj] from comment #58) > > (In reply to Andreas Pehrson [:pehrsons] from comment #57) > > > It's anecdotal of course, but when I was able to reproduce this locally > > > there was only the leaking thread pool ever printed. It seemed like all > > > other thread pool references had been dropped in xpcom-will-shutdown or > > > xpcom-shutdown. > > > > If this can be reproduced locally, are we adding the logging because we > > expect the results to be markedly different in automation vs. locally? > > Debugging it locally is probably kind of a pain, I suppose, since it looks > > like the hang is quite infrequent. > > I want to add it because the ones we've seen in the wild on try are very > rare intermittents, and no other clues in the logs point in the right > direction. > > The one I could repro was actually unrelated to this bug and something new I > had put in. It was on its own hard enough to debug, but if we'd like to nail > an intermittent issue like this one we need some help, like logs pointing to > the SharedThreadPool instance that hasn't been released yet. OK, that makes sense, thank you for the clarification. > > > Some thinking out loud, > > > Could we print the threadpools that still exist when we are about to wait > > > for the event loop and it's empty? Perhaps we could do this in the predicate > > > function we pass to SpinEventLoopUntil()? > > > > I'm confused about the "wait for the event loop and it's empty" bit here: > > does the "it's" refer to the SharedThreadPool instance or the event loop? > > If the former, then the SharedThreadPool hashtable wouldn't have any thread > > pools in it, and so we wouldn't print anything, right? If the latter, I > > don't think we have access to that information in SpinEventLoopUntil()? Or > > am I totally misunderstanding the suggestion here? > > I was thinking of the main thread event loop that we're running on, as the > predicate passed to SpinEventLoopUntil() is called on every iteration [1]. Sure, but I am still confused what you want to check here. You want to check the emptiness of the thread's event queue? Anyway, sorry for the digressions. We have a patch, it will provide some diagnostic info, we might as well land it and see what happens. > If these are not viable, can you think of other approaches to debugging this?
Flags: needinfo?(nfroyd)
Comment on attachment 8947082 [details] Bug 1351655 - Log which SharedThreadPools are not yet destroyed on shutdown-threads. https://reviewboard.mozilla.org/r/216882/#review225678
Attachment #8947082 - Flags: review+
Pushed by pehrsons@gmail.com: https://hg.mozilla.org/integration/autoland/rev/1776ec7c6471 Log which SharedThreadPools are not yet destroyed on shutdown-threads. r=froydnj
(In reply to Nathan Froyd [:froydnj] from comment #60) (...) > > > > Some thinking out loud, > > > > Could we print the threadpools that still exist when we are about to wait > > > > for the event loop and it's empty? Perhaps we could do this in the predicate > > > > function we pass to SpinEventLoopUntil()? > > > > > > I'm confused about the "wait for the event loop and it's empty" bit here: > > > does the "it's" refer to the SharedThreadPool instance or the event loop? > > > If the former, then the SharedThreadPool hashtable wouldn't have any thread > > > pools in it, and so we wouldn't print anything, right? If the latter, I > > > don't think we have access to that information in SpinEventLoopUntil()? Or > > > am I totally misunderstanding the suggestion here? > > > > I was thinking of the main thread event loop that we're running on, as the > > predicate passed to SpinEventLoopUntil() is called on every iteration [1]. > > Sure, but I am still confused what you want to check here. You want to > check the emptiness of the thread's event queue? To only print this when the task queue is empty, yes. Since the threadpool is expecting to be shut down but the runnable to actually shut it down is never posted. > Anyway, sorry for the digressions. We have a patch, it will provide some > diagnostic info, we might as well land it and see what happens. Thanks!
The leave-open keyword is there and there is no activity for 6 months. :pehrsons, maybe it's time to close this bug?
Flags: needinfo?(apehrson)
Oddly enough this hasn't re-occurred. Well, if it ever would, we now have some logging to pinpoint which threadpool hasn't been shut down.
Status: ASSIGNED → RESOLVED
Closed: 7 years ago
Flags: needinfo?(apehrson)
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: