"MessageChannel destroyed without being closed" / Intermittent PROCESS-CRASH | Main app process exited normally | application crashed [@ mozilla::BlockingResourceBase::CheckAcquire()]
Categories
(Core :: IPC, defect, P2)
Tracking
()
People
(Reporter: intermittent-bug-filer, Unassigned)
References
Details
(Keywords: crash, intermittent-failure, sec-moderate)
Crash Data
Filed by: ccoroiu [at] mozilla.com https://treeherder.mozilla.org/logviewer.html#?job_id=201165754&repo=mozilla-central https://queue.taskcluster.net/v1/task/KdCeOh5KTK-TvWL6dz5VZQ/runs/0/artifacts/public/logs/live_backing.log 11:23:47 INFO - TEST-INFO | Main app process: exit 0 11:23:47 INFO - runtests.py | Application ran for: 0:00:09.841000 11:23:47 INFO - zombiecheck | Reading PID log: c:\users\task_1537786638\appdata\local\temp\tmpu7m26rpidlog 11:23:47 INFO - ==> process 3196 launched child process 6008 ("Z:\task_1537786638\build\application\firefox\firefox.exe" -contentproc --channel="3196.0.1810225244\218909907" -childID 1 -isForBrowser -prefsHandle 2660 -prefMapHandle 2656 -prefsLen 1 -prefMapSize 187592 -schedulerPrefs 0001,2 -parentBuildID 20180924094825 -greomni "Z:\task_1537786638\build\application\firefox\omni.ja" -appomni "Z:\task_1537786638\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1537786638\build\application\firefox\browser" - 3196 "\\.\pipe\gecko-crash-server-pipe.3196" 2736 tab) 11:23:47 INFO - ==> process 3196 launched child process 6248 ("Z:\task_1537786638\build\application\firefox\firefox.exe" -contentproc --channel="3196.6.1862187654\253409131" -childID 2 -isForBrowser -prefsHandle 2928 -prefMapHandle 2004 -prefsLen 151 -prefMapSize 187592 -schedulerPrefs 0001,2 -parentBuildID 20180924094825 -greomni "Z:\task_1537786638\build\application\firefox\omni.ja" -appomni "Z:\task_1537786638\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1537786638\build\application\firefox\browser" - 3196 "\\.\pipe\gecko-crash-server-pipe.3196" 2972 tab) 11:23:47 INFO - ==> process 3196 launched child process 7596 ("Z:\task_1537786638\build\application\firefox\firefox.exe" -contentproc --channel="3196.12.223626118\2098101904" -childID 3 -isForBrowser -prefsHandle 2528 -prefMapHandle 2644 -prefsLen 191 -prefMapSize 187592 -schedulerPrefs 0001,2 -parentBuildID 20180924094825 -greomni "Z:\task_1537786638\build\application\firefox\omni.ja" -appomni "Z:\task_1537786638\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1537786638\build\application\firefox\browser" - 3196 "\\.\pipe\gecko-crash-server-pipe.3196" 1780 tab) 11:23:47 INFO - ==> process 3196 launched child process 6344 ("Z:\task_1537786638\build\application\firefox\plugin-container.exe" --channel="3196.19.1838875392\1275939334" "Z:\task_1537786638\build\application\firefox\gmp-clearkey\0.1" -greomni "Z:\task_1537786638\build\application\firefox\omni.ja" -appomni "Z:\task_1537786638\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1537786638\build\application\firefox\browser" - 3196 "\\.\pipe\gecko-crash-server-pipe.3196" 4004 geckomediaplugin) 11:23:47 INFO - zombiecheck | Checking for orphan process with PID: 6008 11:23:47 INFO - zombiecheck | Checking for orphan process with PID: 6248 11:23:47 INFO - zombiecheck | Checking for orphan process with PID: 6344 11:23:47 INFO - zombiecheck | Checking for orphan process with PID: 7596 11:23:47 INFO - mozcrash Copy/paste: Z:\task_1537786638\build\win32-minidump_stackwalk.exe c:\users\task_1537786638\appdata\local\temp\tmpl4rdyq.mozrunner\minidumps\e320d84c-afad-4559-9ddb-1ecf8644a679.dmp Z:\task_1537786638\build\symbols 11:23:54 INFO - mozcrash Saved minidump as Z:\task_1537786638\build\blobber_upload_dir\e320d84c-afad-4559-9ddb-1ecf8644a679.dmp 11:23:54 INFO - mozcrash Saved app info as Z:\task_1537786638\build\blobber_upload_dir\e320d84c-afad-4559-9ddb-1ecf8644a679.extra 11:23:54 INFO - PROCESS-CRASH | Main app process exited normally | application crashed [@ mozilla::BlockingResourceBase::CheckAcquire()] 11:23:54 INFO - Crash dump filename: c:\users\task_1537786638\appdata\local\temp\tmpl4rdyq.mozrunner\minidumps\e320d84c-afad-4559-9ddb-1ecf8644a679.dmp 11:23:54 INFO - Operating system: Windows NT 11:23:54 INFO - 10.0.15063 11:23:54 INFO - CPU: amd64 11:23:54 INFO - family 6 model 63 stepping 2 11:23:54 INFO - 8 CPUs 11:23:54 INFO - 11:23:54 INFO - GPU: UNKNOWN 11:23:54 INFO - 11:23:54 INFO - Crash reason: EXCEPTION_ACCESS_VIOLATION_READ 11:23:54 INFO - Crash address: 0xf8 11:23:54 INFO - Assertion: Unknown assertion type 0x00000000 11:23:54 INFO - Process uptime: 5 seconds 11:23:54 INFO - 11:23:54 INFO - Thread 1 (crashed) 11:23:54 INFO - 0 xul.dll!mozilla::BlockingResourceBase::CheckAcquire() [BlockingResourceBase.cpp:095ec59a8800360154df6668394803e5a84d031c : 273 + 0x0] 11:23:54 INFO - rax = 0x00007f7d37e9f273 rdx = 0x00000000000000b8 11:23:54 INFO - rcx = 0x00000000000000e8 rbx = 0x000000ce49bff180 11:23:54 INFO - rsi = 0x000002562fc04260 rdi = 0x00000000000000b8 11:23:54 INFO - rbp = 0x000000ce49bff039 rsp = 0x000000ce49bfeff0 11:23:54 INFO - r8 = 0x000002562fc04260 r9 = 0x0000000000000028 11:23:54 INFO - r10 = 0x00007ffa1b343c48 r11 = 0x000000ce49bff190 11:23:54 INFO - r12 = 0x0000000000000000 r13 = 0x0000000000000001 11:23:54 INFO - r14 = 0x00000000000000b8 r15 = 0x0000000000000000 11:23:54 INFO - rip = 0x00007ffa1664e111 11:23:54 INFO - Found by: given as instruction pointer in context 11:23:54 INFO - 1 xul.dll!mozilla::OffTheBooksMutex::Lock() [BlockingResourceBase.cpp:095ec59a8800360154df6668394803e5a84d031c : 384 + 0x9] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff0a0 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa1665558a 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 2 xul.dll!void mozilla::Maybe<mozilla::BaseAutoLock<mozilla::Mutex &> >::emplace<mozilla::Mutex &>(class mozilla::Mutex & const) [Maybe.h:095ec59a8800360154df6668394803e5a84d031c : 599 + 0x12] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff0d0 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16bdab09 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 3 xul.dll!mozilla::ipc::IToplevelProtocol::ToplevelState::GetMessageEventTarget(IPC::Message const &) [ProtocolUtils.cpp:095ec59a8800360154df6668394803e5a84d031c : 982 + 0x5] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff100 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16be3942 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 4 xul.dll!mozilla::ipc::MessageChannel::MessageTask::Post() [MessageChannel.cpp:095ec59a8800360154df6668394803e5a84d031c : 2084 + 0x19] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff1e0 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16be7ca0 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 5 xul.dll!mozilla::ipc::MessageChannel::OnMessageReceivedFromLink(IPC::Message &&) [MessageChannel.cpp:095ec59a8800360154df6668394803e5a84d031c : 1384 + 0xd] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff210 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16be6536 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 6 xul.dll!mozilla::ipc::ProcessLink::OnMessageReceived(IPC::Message &&) [MessageLink.cpp:095ec59a8800360154df6668394803e5a84d031c : 289 + 0xc] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff510 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16be5c32 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 7 xul.dll!IPC::Channel::ChannelImpl::ProcessIncomingMessages(base::MessagePumpForIO::IOContext *,unsigned long) [ipc_channel_win.cc:095ec59a8800360154df6668394803e5a84d031c : 440 + 0xd] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff550 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16b9e072 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 8 xul.dll!IPC::Channel::ChannelImpl::OnIOCompleted(base::MessagePumpForIO::IOContext *,unsigned long,unsigned long) [ipc_channel_win.cc:095ec59a8800360154df6668394803e5a84d031c : 544 + 0x15] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff630 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16b9d96b 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 9 xul.dll!base::MessagePumpForIO::WaitForIOCompletion(unsigned long,base::MessagePumpForIO::IOHandler *) [message_pump_win.cc:095ec59a8800360154df6668394803e5a84d031c : 492 + 0xe] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff680 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16b9923e 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 10 xul.dll!base::MessagePumpForIO::DoRunLoop() [message_pump_win.cc:095ec59a8800360154df6668394803e5a84d031c : 457 + 0x8] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff6f0 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16b98803 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 11 xul.dll!base::MessagePumpWin::RunWithDispatcher(base::MessagePump::Delegate *,base::MessagePumpWin::Dispatcher *) [message_pump_win.cc:095ec59a8800360154df6668394803e5a84d031c : 56 + 0x6] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff720 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16b99042 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 12 xul.dll!MessageLoop::RunHandler() [message_loop.cc:095ec59a8800360154df6668394803e5a84d031c : 318 + 0x5] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff770 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16bafcae 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 13 xul.dll!MessageLoop::Run() [message_loop.cc:095ec59a8800360154df6668394803e5a84d031c : 298 + 0x8] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff7a0 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16baf862 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 14 xul.dll!base::Thread::ThreadMain() [thread.cc:095ec59a8800360154df6668394803e5a84d031c : 181 + 0xa] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff7f0 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16bb0dbd 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 15 xul.dll!static unsigned long `anonymous namespace'::ThreadFunc(void *) [platform_thread_win.cc:095ec59a8800360154df6668394803e5a84d031c : 34 + 0x9] 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff990 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa16b99bcf 11:23:54 INFO - Found by: call frame info 11:23:54 INFO - 16 kernel32.dll!SymCache::FCheckCompilerVersion(unsigned long,unsigned long) + 0xc4 11:23:54 INFO - rbx = 0x000000ce49bff180 rbp = 0x000000ce49bff039 11:23:54 INFO - rsp = 0x000000ce49bff9c0 r12 = 0x0000000000000000 11:23:54 INFO - r13 = 0x0000000000000001 r14 = 0x00000000000000b8 11:23:54 INFO - r15 = 0x0000000000000000 rip = 0x00007ffa49df2774 11:23:54 INFO - Found by: call frame info
Comment 1•6 years ago
|
||
seems to be eme related.
Comment 3•6 years ago
|
||
It looks like IToplevelProtocol::mState is null, while the channel is polling (and handling messages for) its channel on the I/O thread. The only time it should be possible for mState to be null is during the IToplevelProtocol destructor, which runs on the actor thread. Which means this is potentially a use-after-free; see also bug 1478849.
Comment 4•6 years ago
|
||
The main thread stack is informative: 33 xul.dll!static void MOZ_ReportCrash(const char *, const char *, int) [Assertions.h:095ec59a8800360154df6668394803e5a84d031c : 178 + 0x13] 34 xul.dll!mozilla::ipc::MessageChannel::Clear() [MessageChannel.cpp:095ec59a8800360154df6668394803e5a84d031c : 813 + 0x14] 35 xul.dll!mozilla::ipc::MessageChannel::~MessageChannel() [MessageChannel.cpp:095ec59a8800360154df6668394803e5a84d031c : 681 + 0x8] 36 xul.dll!mozilla::ipc::IToplevelProtocol::ToplevelState::~ToplevelState() + 0x19 37 xul.dll!mozilla::ipc::IToplevelProtocol::ToplevelState::`scalar deleting destructor'(unsigned int) + 0x14 38 xul.dll!mozilla::ipc::IToplevelProtocol::~IToplevelProtocol() [ProtocolUtils.cpp:095ec59a8800360154df6668394803e5a84d031c : 689 + 0x18] 39 xul.dll!mozilla::gmp::GMPContentParent::Release() [GMPContentParent.h:095ec59a8800360154df6668394803e5a84d031c : 25 + 0x4a] 40 xul.dll!nsTHashtable<nsBaseHashtableET<nsUint64HashKey,RefPtr<mozilla::gmp::GMPContentParent> > >::s_ClearEntry(PLDHashTable *,PLDHashEntryHdr *) [nsTHashtable.h:095ec59a8800360154df6668394803e5a84d031c : 468 + 0xe] 41 xul.dll!PLDHashTable::~PLDHashTable() [PLDHashTable.cpp:095ec59a8800360154df6668394803e5a84d031c : 335 + 0xc] 42 xul.dll!mozilla::gmp::GMPServiceChild::~GMPServiceChild() [GMPServiceChild.cpp:095ec59a8800360154df6668394803e5a84d031c : 460 + 0xc] 43 xul.dll!mozilla::gmp::GMPServiceChild::`scalar deleting destructor'(unsigned int) + 0x14 44 xul.dll!mozilla::gmp::GeckoMediaPluginServiceChild::Observe(nsISupports *,char const *,char16_t const *) [GMPServiceChild.cpp:095ec59a8800360154df6668394803e5a84d031c : 387 + 0x22] 45 xul.dll!nsObserverList::NotifyObservers(nsISupports *,char const *,char16_t const *) [nsObserverList.cpp:095ec59a8800360154df6668394803e5a84d031c : 111 + 0x19] 46 xul.dll!nsObserverService::NotifyObservers(nsISupports *,char const *,char16_t const *) [nsObserverService.cpp:095ec59a8800360154df6668394803e5a84d031c : 295 + 0x11] 47 xul.dll!mozilla::ShutdownXPCOM(nsIServiceManager *) [XPCOMInit.cpp:095ec59a8800360154df6668394803e5a84d031c : 905 + 0x1e] It's in the middle of intentionally crashing, here: https://searchfox.org/mozilla-central/rev/881a3c5664ede5e08ee986d76433bc5c4b5680e6/ipc/glue/MessageChannel.cpp#813 MOZ_CRASH("MessageChannel destroyed without being closed " \ "(mChannelState == ChannelConnected)."); Which is what I said in comment #3. Similar bugs with other process types: bug 1473972 (NPAPI), bug 1372850 (GPU). These all seem to be caused by browser shutdown. And then there's ContentParent::KillHard, which was triggering this assertion until it was silenced in bug 1423261; I need to look at that some more to see if that might also be a use-after-free.
Comment 5•6 years ago
|
||
The assertion in MessageChannel::Clear should at least make this harder to exploit, because we don't get to the point of deallocating the MessageChannel or ToplevelState or IToplevelProtocol. We're still accessing a partially destroyed object, from a thread other than the one destroying it, so I don't want to call that “safe”, but it's not entirely a use-after-free, I think. The ContentParent::KillHard case is more worrying, because the assertion was turned off in that case, so the objects will be freed. If it's possible for a message to be received on the socket (even if the process is dead, a message could still be in the OS-level buffers), that's probably a use-after-free, and there may be other reasons why the I/O thread could try to access it. But one thing that makes ContentParent a little different is that it's refcounted and registered as an observer for various things, with strong references; those are removed in ActorDestroy, and (from looking at the autogenerated code that calls it) the channel should already be closed at that point, regardless of how the process exited. So I'm still not sure I understand what was going on in bug 1423261.
Updated•6 years ago
|
Updated•6 years ago
|
Comment 6•6 years ago
|
||
I spent some time looking through the crash reports. The GPU case is rare but does occasionally happen, and does happen on Windows (i.e., isn't the result of an unsupported pref-flip on Mac, which is why bug 1372850 was WONTFIXed): bp-0bfdc18f-7fb9-4d98-a5ad-b03ca0181002 (note that GPUChild is created in the parent and content processes, while GPUParent is in the GPU process, and the parent-process GPUChild seems to use the I/O thread as its actor thread, which isn't what I would've done but apparently works in practice). The NPAPI case (bug 1473972) is in the plugin child process, not the parent; I still don't know what exactly is going on there (and in fact there's a lot I don't know yet about how top-level actor shutdown works). It's also not Mac-only, but it is mostly Mac and very low frequency on Windows, which is odd. And the GMP case, which this bug was initially filed for, I haven't seen on crash-stats yet. In the above cases, as mentioned, the “destroyed without being closed” assertion should mostly protect us security-wise, and they're low-frequency as crashes. As for the ContentParent case, those seem to all happen during browser shutdown, which I think explains how the ContentParent becomes unreferenced while the channel is still open (see also bug 968031 for why it's managed by the CC, which is why the destruction is (usually?) deferred until the shutdown CC). This could possibly be fixed by reordering shutdown to stop the I/O thread earlier, but what else would that impact? Alternately, ContentParent could deliberately leak itself in this case: hold a circular reference not reported to the CC and clear it only in ActorDestroy.
Comment 7•6 years ago
|
||
This bug has information about the general family of “channel destroyed without being closed” bugs; the non-KillHard cases might not be exploitable because we crash first and the memory is never freed, but I'm reasonably sure they're at least undefined behavior, and there could be something worse hiding there, and there isn't a simple fix for those. So I'm going to file a separate bug for the KillHard case and quietly avoid mentioning the other issues.
Updated•6 years ago
|
Comment 8•6 years ago
|
||
The other cases probably are exploitable. Consider the finding in comment #3: this crashed because the UniquePtr was already null, but if it races that instead, and reads the original pointer value but uses it after the memory is freed and reallocated, then the attacker can control everything derived from instance variables starting at [1], which includes a vtable call (GetConstructedEventTarget or GetSpecificMessageEventTarget). Manipulating scheduling to make that work may be difficult, although some of these cases (like for plugin processes) aren't necessarily tied to browser shutdown. [1] https://searchfox.org/mozilla-central/rev/3d989d097fa35afe19e814c6d5bc2f2bf10867e1/ipc/glue/ProtocolUtils.cpp#982 I have a vague idea for how to fix this, although it may not survive contact with reality: 1. MessageChannel becomes thread-safe refcounted and shared between the top-level actor and the transport; destruction might need to send runnables to the actor thread if it's not there 2. MessageChannel::mWorkerLoop either becomes a thread-safe weak pointer (which means WorkerLoop becomes thread-safe refcounted) or else *all* accesses to must hold mMonitor (to synchronize with WillDestroyCurrentMessageLoop). 3a. Either MessageChannel::mListener (the IToplevelProtocol*) becomes another thread-safe weak pointer, which means replacing every top-level actor's lifetime handling (usually non-thread-safe refcounting or unique ownership) with thread-safe recounting 3b. Or ToplevelState::mEventTargetMap moves into MessageChannel, and that should be enough to let MessageChannel::mListener to be a non-thread-safe WeakPtr accessed only from the actor thread (which does not change the actor's lifetimes) 4. Test cases where the channel isn't closed properly and outlives the top-level actor and where the process continues to exist and send messages.
Comment 9•6 years ago
|
||
For reference, bug 1398070 is another case where we were warned that something was wrong with the lifetimes here, and we reacted by silencing the assertion because we apparently didn't realize the full implications.
Comment 14•5 years ago
|
||
The signature in Bug 1547654 which is a dupe of this bug is now the top crash on 68 nightly, even though it appears there are only 64 installs/3954 crashes.
Updated•5 years ago
|
Comment 16•5 years ago
|
||
The signature in Bug 1547654 which is a dupe of this bug is now the top crash on 68 nightly, even though it appears there are only 64 installs/3954 crashes.
It looks like there's only 3 machines involved, and 97% of the crashes are from an iMac in Finland, which I presume is running some script or has a really persistent owner. Might be Flash related.
Comment 17•5 years ago
|
||
Adding a signature seen during nightly triage, all on Mac 10.15. This machine appears to be a MBP, and there are 2 unique users crashing with the same Moz Crash reason as seen in Comment 4.
Comment 18•5 years ago
|
||
We talked about this at the IPC meeting a few weeks ago. The current idea is to make the Unsound_IsClosed
sound, using locking, and move the check up into the IToplevelProtocol
destructor to crash before the object is torn down enough to cause undefined behavior if there's a race. We'd still crash, but it won't be security sensitive anymore, and the patch should be less invasive (more upliftable) than trying to fix this fully.
Longer-term I guess we should move away from things caring about destruction order like this, especially in ways that cause memory unsafety (or require deliberate crashing to avoid the unsafety): clearly we don't have enough testing to even understand why those invariants are being violated, let alone catch it before release.
Updated•5 years ago
|
Updated•5 years ago
|
Comment 21•5 years ago
|
||
I don't understand why bug 1498648 was considered to be related to this.
Comment 24•5 years ago
|
||
I've been looking at this again, and I'm wondering if we could just (in addition to making the IsClosed
flag atomic) call MessageChannel::Clear
in the IToplevelProtocol
destructor, before dispatching the runnable to delete the channel… or, to similar effect, dispatch the deletion to the current event loop and then re-dispatch to the I/O thread, to ensure it happens-after the rest of the IToplevelProtocol
/MessageChannel
destructor (which, in the bad case, will crash).
Comment 27•4 years ago
|
||
I've figured out a reliable "STR": comment out the Clear()
call in MessageChannel::NotifyMaybeChannelError
, then kill a content process (e.g. with the shell command kill
). I combined this with a sleep in ~IToplevelProtocol
to ensure the use would happen after the free.
With this, I've discovered that the second idea from comment #24 doesn't work, because VRManagerParent
seems to be destroyed while the actor thread is shutting down and it's not possible to self-dispatch when the nsThread::mEventTarget
has already been cleared.
Comment 28•4 years ago
|
||
…and calling MessageChannel::Clear
directly from ~IToplevelProtocol
has the slight problem that HangMonitorParent
is deleted on the main thread and not its own actor thread. This is a problem if it means that MessageChannel::Clear
does the mWorkerLoop->RemoveDestructionObserver
call on the wrong thread.
That seems to not happen normally because it would already have been cleared by the error handling that I commented out to reproduce the bug, but I wonder whether that's strictly necessary or if it happens rarely in actual use; the thread check in RemoveDestructionObserver
is debug-only so we wouldn't notice.
Comment 29•4 years ago
|
||
(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #28)
HangMonitorParent
is deleted on the main thread and not its own actor thread.
Also the PBackground
parent actor.
I tried commenting out the debug assertion in RemoveDestructionObserver
to see what would happen, and now I have a new crash that doesn't make much sense: a ContentParent
where mTrans
is never set to non-null, and the channel pointed to by the ProcessLink
is freed by the GeckoChildProcessHost
destructor and then used by MessageChannel::Clear
when the ContentParent
is destroyed.
Comment 30•4 years ago
|
||
From looking into bug 1608466, I think there is a fundamental race with this method on the shutdown of many toplevel actors.
When an actor is destroyed, it dispatches a message to the IO thread to destroy the IPC::Channel
transport object in it's destructor (either directly within ~IToplevelProtocol
, or through the GeckoChildProcessHost::Destroy
method). This destructor is called before the destructors of base classes & members, so the message is dispatched before the ~MessageChannel
destructor is run. The ~MessageChannel
destructor will then try to call Unsound_IsClosed
here, but that method calls through a raw pointer to access a property of IPC::Channel
, which may have already been freed on the IO thread.
So, in effect, we have two tightly connected objects, the MessageChannel
and IPC::Channel
, which have completely unrelated lifecycles and don't keep track of one another, so it's somewhat unsurprising that we frequently crash here.
We probably want to make the lifecycle relationship between the ProcessLink
object and the IPC::Channel
more explicit, perhaps by transferring ownership of the IPC::Channel
object into the MessageChannel
when we create it, and away from the ProcessHost
(or IPDL actor via ManagedEndpoint) which initially opened it.
Updated•4 years ago
|
Comment 33•2 years ago
|
||
This is a few years old, and we've had lots of improvements by Nika to channel lifetimes in the meanwhile.
Updated•10 months ago
|
Description
•