Closed Bug 1493656 Opened 6 years ago Closed 2 years ago

"MessageChannel destroyed without being closed" / Intermittent PROCESS-CRASH | Main app process exited normally | application crashed [@ mozilla::BlockingResourceBase::CheckAcquire()]

Categories

(Core :: IPC, defect, P2)

defect

Tracking

()

RESOLVED WORKSFORME

People

(Reporter: intermittent-bug-filer, Unassigned)

References

Details

(Keywords: crash, intermittent-failure, sec-moderate)

Crash Data

Filed by: ccoroiu [at] mozilla.com

https://treeherder.mozilla.org/logviewer.html#?job_id=201165754&repo=mozilla-central

https://queue.taskcluster.net/v1/task/KdCeOh5KTK-TvWL6dz5VZQ/runs/0/artifacts/public/logs/live_backing.log

11:23:47     INFO - TEST-INFO | Main app process: exit 0
11:23:47     INFO - runtests.py | Application ran for: 0:00:09.841000
11:23:47     INFO - zombiecheck | Reading PID log: c:\users\task_1537786638\appdata\local\temp\tmpu7m26rpidlog
11:23:47     INFO - ==> process 3196 launched child process 6008 ("Z:\task_1537786638\build\application\firefox\firefox.exe" -contentproc --channel="3196.0.1810225244\218909907" -childID 1 -isForBrowser -prefsHandle 2660 -prefMapHandle 2656 -prefsLen 1 -prefMapSize 187592 -schedulerPrefs 0001,2 -parentBuildID 20180924094825 -greomni "Z:\task_1537786638\build\application\firefox\omni.ja" -appomni "Z:\task_1537786638\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1537786638\build\application\firefox\browser" - 3196 "\\.\pipe\gecko-crash-server-pipe.3196" 2736 tab)
11:23:47     INFO - ==> process 3196 launched child process 6248 ("Z:\task_1537786638\build\application\firefox\firefox.exe" -contentproc --channel="3196.6.1862187654\253409131" -childID 2 -isForBrowser -prefsHandle 2928 -prefMapHandle 2004 -prefsLen 151 -prefMapSize 187592 -schedulerPrefs 0001,2 -parentBuildID 20180924094825 -greomni "Z:\task_1537786638\build\application\firefox\omni.ja" -appomni "Z:\task_1537786638\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1537786638\build\application\firefox\browser" - 3196 "\\.\pipe\gecko-crash-server-pipe.3196" 2972 tab)
11:23:47     INFO - ==> process 3196 launched child process 7596 ("Z:\task_1537786638\build\application\firefox\firefox.exe" -contentproc --channel="3196.12.223626118\2098101904" -childID 3 -isForBrowser -prefsHandle 2528 -prefMapHandle 2644 -prefsLen 191 -prefMapSize 187592 -schedulerPrefs 0001,2 -parentBuildID 20180924094825 -greomni "Z:\task_1537786638\build\application\firefox\omni.ja" -appomni "Z:\task_1537786638\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1537786638\build\application\firefox\browser" - 3196 "\\.\pipe\gecko-crash-server-pipe.3196" 1780 tab)
11:23:47     INFO - ==> process 3196 launched child process 6344 ("Z:\task_1537786638\build\application\firefox\plugin-container.exe" --channel="3196.19.1838875392\1275939334" "Z:\task_1537786638\build\application\firefox\gmp-clearkey\0.1" -greomni "Z:\task_1537786638\build\application\firefox\omni.ja" -appomni "Z:\task_1537786638\build\application\firefox\browser\omni.ja" -appdir "Z:\task_1537786638\build\application\firefox\browser" - 3196 "\\.\pipe\gecko-crash-server-pipe.3196" 4004 geckomediaplugin)
11:23:47     INFO - zombiecheck | Checking for orphan process with PID: 6008
11:23:47     INFO - zombiecheck | Checking for orphan process with PID: 6248
11:23:47     INFO - zombiecheck | Checking for orphan process with PID: 6344
11:23:47     INFO - zombiecheck | Checking for orphan process with PID: 7596
11:23:47     INFO - mozcrash Copy/paste: Z:\task_1537786638\build\win32-minidump_stackwalk.exe c:\users\task_1537786638\appdata\local\temp\tmpl4rdyq.mozrunner\minidumps\e320d84c-afad-4559-9ddb-1ecf8644a679.dmp Z:\task_1537786638\build\symbols
11:23:54     INFO - mozcrash Saved minidump as Z:\task_1537786638\build\blobber_upload_dir\e320d84c-afad-4559-9ddb-1ecf8644a679.dmp
11:23:54     INFO - mozcrash Saved app info as Z:\task_1537786638\build\blobber_upload_dir\e320d84c-afad-4559-9ddb-1ecf8644a679.extra
11:23:54     INFO - PROCESS-CRASH | Main app process exited normally | application crashed [@ mozilla::BlockingResourceBase::CheckAcquire()]
11:23:54     INFO - Crash dump filename: c:\users\task_1537786638\appdata\local\temp\tmpl4rdyq.mozrunner\minidumps\e320d84c-afad-4559-9ddb-1ecf8644a679.dmp
11:23:54     INFO - Operating system: Windows NT
11:23:54     INFO -                   10.0.15063 
11:23:54     INFO - CPU: amd64
11:23:54     INFO -      family 6 model 63 stepping 2
11:23:54     INFO -      8 CPUs
11:23:54     INFO - 
11:23:54     INFO - GPU: UNKNOWN
11:23:54     INFO - 
11:23:54     INFO - Crash reason:  EXCEPTION_ACCESS_VIOLATION_READ
11:23:54     INFO - Crash address: 0xf8
11:23:54     INFO - Assertion: Unknown assertion type 0x00000000
11:23:54     INFO - Process uptime: 5 seconds
11:23:54     INFO - 
11:23:54     INFO - Thread 1 (crashed)
11:23:54     INFO -  0  xul.dll!mozilla::BlockingResourceBase::CheckAcquire() [BlockingResourceBase.cpp:095ec59a8800360154df6668394803e5a84d031c : 273 + 0x0]
11:23:54     INFO -     rax = 0x00007f7d37e9f273   rdx = 0x00000000000000b8
11:23:54     INFO -     rcx = 0x00000000000000e8   rbx = 0x000000ce49bff180
11:23:54     INFO -     rsi = 0x000002562fc04260   rdi = 0x00000000000000b8
11:23:54     INFO -     rbp = 0x000000ce49bff039   rsp = 0x000000ce49bfeff0
11:23:54     INFO -      r8 = 0x000002562fc04260    r9 = 0x0000000000000028
11:23:54     INFO -     r10 = 0x00007ffa1b343c48   r11 = 0x000000ce49bff190
11:23:54     INFO -     r12 = 0x0000000000000000   r13 = 0x0000000000000001
11:23:54     INFO -     r14 = 0x00000000000000b8   r15 = 0x0000000000000000
11:23:54     INFO -     rip = 0x00007ffa1664e111
11:23:54     INFO -     Found by: given as instruction pointer in context
11:23:54     INFO -  1  xul.dll!mozilla::OffTheBooksMutex::Lock() [BlockingResourceBase.cpp:095ec59a8800360154df6668394803e5a84d031c : 384 + 0x9]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff0a0   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa1665558a
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO -  2  xul.dll!void mozilla::Maybe<mozilla::BaseAutoLock<mozilla::Mutex &> >::emplace<mozilla::Mutex &>(class mozilla::Mutex & const) [Maybe.h:095ec59a8800360154df6668394803e5a84d031c : 599 + 0x12]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff0d0   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16bdab09
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO -  3  xul.dll!mozilla::ipc::IToplevelProtocol::ToplevelState::GetMessageEventTarget(IPC::Message const &) [ProtocolUtils.cpp:095ec59a8800360154df6668394803e5a84d031c : 982 + 0x5]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff100   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16be3942
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO -  4  xul.dll!mozilla::ipc::MessageChannel::MessageTask::Post() [MessageChannel.cpp:095ec59a8800360154df6668394803e5a84d031c : 2084 + 0x19]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff1e0   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16be7ca0
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO -  5  xul.dll!mozilla::ipc::MessageChannel::OnMessageReceivedFromLink(IPC::Message &&) [MessageChannel.cpp:095ec59a8800360154df6668394803e5a84d031c : 1384 + 0xd]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff210   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16be6536
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO -  6  xul.dll!mozilla::ipc::ProcessLink::OnMessageReceived(IPC::Message &&) [MessageLink.cpp:095ec59a8800360154df6668394803e5a84d031c : 289 + 0xc]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff510   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16be5c32
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO -  7  xul.dll!IPC::Channel::ChannelImpl::ProcessIncomingMessages(base::MessagePumpForIO::IOContext *,unsigned long) [ipc_channel_win.cc:095ec59a8800360154df6668394803e5a84d031c : 440 + 0xd]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff550   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16b9e072
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO -  8  xul.dll!IPC::Channel::ChannelImpl::OnIOCompleted(base::MessagePumpForIO::IOContext *,unsigned long,unsigned long) [ipc_channel_win.cc:095ec59a8800360154df6668394803e5a84d031c : 544 + 0x15]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff630   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16b9d96b
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO -  9  xul.dll!base::MessagePumpForIO::WaitForIOCompletion(unsigned long,base::MessagePumpForIO::IOHandler *) [message_pump_win.cc:095ec59a8800360154df6668394803e5a84d031c : 492 + 0xe]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff680   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16b9923e
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO - 10  xul.dll!base::MessagePumpForIO::DoRunLoop() [message_pump_win.cc:095ec59a8800360154df6668394803e5a84d031c : 457 + 0x8]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff6f0   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16b98803
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO - 11  xul.dll!base::MessagePumpWin::RunWithDispatcher(base::MessagePump::Delegate *,base::MessagePumpWin::Dispatcher *) [message_pump_win.cc:095ec59a8800360154df6668394803e5a84d031c : 56 + 0x6]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff720   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16b99042
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO - 12  xul.dll!MessageLoop::RunHandler() [message_loop.cc:095ec59a8800360154df6668394803e5a84d031c : 318 + 0x5]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff770   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16bafcae
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO - 13  xul.dll!MessageLoop::Run() [message_loop.cc:095ec59a8800360154df6668394803e5a84d031c : 298 + 0x8]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff7a0   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16baf862
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO - 14  xul.dll!base::Thread::ThreadMain() [thread.cc:095ec59a8800360154df6668394803e5a84d031c : 181 + 0xa]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff7f0   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16bb0dbd
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO - 15  xul.dll!static unsigned long `anonymous namespace'::ThreadFunc(void *) [platform_thread_win.cc:095ec59a8800360154df6668394803e5a84d031c : 34 + 0x9]
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff990   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa16b99bcf
11:23:54     INFO -     Found by: call frame info
11:23:54     INFO - 16  kernel32.dll!SymCache::FCheckCompilerVersion(unsigned long,unsigned long) + 0xc4
11:23:54     INFO -     rbx = 0x000000ce49bff180   rbp = 0x000000ce49bff039
11:23:54     INFO -     rsp = 0x000000ce49bff9c0   r12 = 0x0000000000000000
11:23:54     INFO -     r13 = 0x0000000000000001   r14 = 0x00000000000000b8
11:23:54     INFO -     r15 = 0x0000000000000000   rip = 0x00007ffa49df2774
11:23:54     INFO -     Found by: call frame info
seems to be eme related.
Component: WebRTC: Audio/Video → Audio/Video: Playback
sounds like a race somewhere.
Component: Audio/Video: Playback → IPC
It looks like IToplevelProtocol::mState is null, while the channel is polling (and handling messages for) its channel on the I/O thread.

The only time it should be possible for mState to be null is during the IToplevelProtocol destructor, which runs on the actor thread.  Which means this is potentially a use-after-free; see also bug 1478849.
Group: dom-core-security
See Also: → 1478849
The main thread stack is informative:

33  xul.dll!static void MOZ_ReportCrash(const char *, const char *, int) [Assertions.h:095ec59a8800360154df6668394803e5a84d031c : 178 + 0x13]
34  xul.dll!mozilla::ipc::MessageChannel::Clear() [MessageChannel.cpp:095ec59a8800360154df6668394803e5a84d031c : 813 + 0x14]
35  xul.dll!mozilla::ipc::MessageChannel::~MessageChannel() [MessageChannel.cpp:095ec59a8800360154df6668394803e5a84d031c : 681 + 0x8]
36  xul.dll!mozilla::ipc::IToplevelProtocol::ToplevelState::~ToplevelState() + 0x19
37  xul.dll!mozilla::ipc::IToplevelProtocol::ToplevelState::`scalar deleting destructor'(unsigned int) + 0x14
38  xul.dll!mozilla::ipc::IToplevelProtocol::~IToplevelProtocol() [ProtocolUtils.cpp:095ec59a8800360154df6668394803e5a84d031c : 689 + 0x18]
39  xul.dll!mozilla::gmp::GMPContentParent::Release() [GMPContentParent.h:095ec59a8800360154df6668394803e5a84d031c : 25 + 0x4a]
40  xul.dll!nsTHashtable<nsBaseHashtableET<nsUint64HashKey,RefPtr<mozilla::gmp::GMPContentParent> > >::s_ClearEntry(PLDHashTable *,PLDHashEntryHdr *) [nsTHashtable.h:095ec59a8800360154df6668394803e5a84d031c : 468 + 0xe]
41  xul.dll!PLDHashTable::~PLDHashTable() [PLDHashTable.cpp:095ec59a8800360154df6668394803e5a84d031c : 335 + 0xc]
42  xul.dll!mozilla::gmp::GMPServiceChild::~GMPServiceChild() [GMPServiceChild.cpp:095ec59a8800360154df6668394803e5a84d031c : 460 + 0xc]
43  xul.dll!mozilla::gmp::GMPServiceChild::`scalar deleting destructor'(unsigned int) + 0x14
44  xul.dll!mozilla::gmp::GeckoMediaPluginServiceChild::Observe(nsISupports *,char const *,char16_t const *) [GMPServiceChild.cpp:095ec59a8800360154df6668394803e5a84d031c : 387 + 0x22]
45  xul.dll!nsObserverList::NotifyObservers(nsISupports *,char const *,char16_t const *) [nsObserverList.cpp:095ec59a8800360154df6668394803e5a84d031c : 111 + 0x19]
46  xul.dll!nsObserverService::NotifyObservers(nsISupports *,char const *,char16_t const *) [nsObserverService.cpp:095ec59a8800360154df6668394803e5a84d031c : 295 + 0x11]
47  xul.dll!mozilla::ShutdownXPCOM(nsIServiceManager *) [XPCOMInit.cpp:095ec59a8800360154df6668394803e5a84d031c : 905 + 0x1e]

It's in the middle of intentionally crashing, here: https://searchfox.org/mozilla-central/rev/881a3c5664ede5e08ee986d76433bc5c4b5680e6/ipc/glue/MessageChannel.cpp#813

                MOZ_CRASH("MessageChannel destroyed without being closed " \
                          "(mChannelState == ChannelConnected).");

Which is what I said in comment #3.

Similar bugs with other process types: bug 1473972 (NPAPI), bug 1372850 (GPU).  These all seem to be caused by browser shutdown.

And then there's ContentParent::KillHard, which was triggering this assertion until it was silenced in bug 1423261; I need to look at that some more to see if that might also be a use-after-free.
The assertion in MessageChannel::Clear should at least make this harder to exploit, because we don't get to the point of deallocating the MessageChannel or ToplevelState or IToplevelProtocol.  We're still accessing a partially destroyed object, from a thread other than the one destroying it, so I don't want to call that “safe”, but it's not entirely a use-after-free, I think.

The ContentParent::KillHard case is more worrying, because the assertion was turned off in that case, so the objects will be freed.  If it's possible for a message to be received on the socket (even if the process is dead, a message could still be in the OS-level buffers), that's probably a use-after-free, and there may be other reasons why the I/O thread could try to access it.

But one thing that makes ContentParent a little different is that it's refcounted and registered as an observer for various things, with strong references; those are removed in ActorDestroy, and (from looking at the autogenerated code that calls it) the channel should already be closed at that point, regardless of how the process exited.  So I'm still not sure I understand what was going on in bug 1423261.
Priority: P5 → P2
I spent some time looking through the crash reports.

The GPU case is rare but does occasionally happen, and does happen on Windows (i.e., isn't the result of an unsupported pref-flip on Mac, which is why bug 1372850 was WONTFIXed): bp-0bfdc18f-7fb9-4d98-a5ad-b03ca0181002 (note that GPUChild is created in the parent and content processes, while GPUParent is in the GPU process, and the parent-process GPUChild seems to use the I/O thread as its actor thread, which isn't what I would've done but apparently works in practice).

The NPAPI case (bug 1473972) is in the plugin child process, not the parent; I still don't know what exactly is going on there (and in fact there's a lot I don't know yet about how top-level actor shutdown works).  It's also not Mac-only, but it is mostly Mac and very low frequency on Windows, which is odd.

And the GMP case, which this bug was initially filed for, I haven't seen on crash-stats yet.

In the above cases, as mentioned, the “destroyed without being closed” assertion should mostly protect us security-wise, and they're low-frequency as crashes.


As for the ContentParent case, those seem to all happen during browser shutdown, which I think explains how the ContentParent becomes unreferenced while the channel is still open (see also bug 968031 for why it's managed by the CC, which is why the destruction is (usually?) deferred until the shutdown CC).  This could possibly be fixed by reordering shutdown to stop the I/O thread earlier, but what else would that impact?  Alternately, ContentParent could deliberately leak itself in this case: hold a circular reference not reported to the CC and clear it only in ActorDestroy.
This bug has information about the general family of “channel destroyed without being closed” bugs; the non-KillHard cases might not be exploitable because we crash first and the memory is never freed, but I'm reasonably sure they're at least undefined behavior, and there could be something worse hiding there, and there isn't a simple fix for those.   So I'm going to file a separate bug for the KillHard case and quietly avoid mentioning the other issues.
Summary: Intermittent PROCESS-CRASH | Main app process exited normally | application crashed [@ mozilla::BlockingResourceBase::CheckAcquire()] → “MessageChannel destroyed without being closed” / Intermittent PROCESS-CRASH | Main app process exited normally | application crashed [@ mozilla::BlockingResourceBase::CheckAcquire()]
Summary: “MessageChannel destroyed without being closed” / Intermittent PROCESS-CRASH | Main app process exited normally | application crashed [@ mozilla::BlockingResourceBase::CheckAcquire()] → "MessageChannel destroyed without being closed" / Intermittent PROCESS-CRASH | Main app process exited normally | application crashed [@ mozilla::BlockingResourceBase::CheckAcquire()]
The other cases probably are exploitable.  Consider the finding in comment #3: this crashed because the UniquePtr was already null, but if it races that instead, and reads the original pointer value but uses it after the memory is freed and reallocated, then the attacker can control everything derived from instance variables starting at [1], which includes a vtable call (GetConstructedEventTarget or GetSpecificMessageEventTarget).  Manipulating scheduling to make that work may be difficult, although some of these cases (like for plugin processes) aren't necessarily tied to browser shutdown.

[1] https://searchfox.org/mozilla-central/rev/3d989d097fa35afe19e814c6d5bc2f2bf10867e1/ipc/glue/ProtocolUtils.cpp#982



I have a vague idea for how to fix this, although it may not survive contact with reality:

1. MessageChannel becomes thread-safe refcounted and shared between the top-level actor and the transport; destruction might need to send runnables to the actor thread if it's not there

2. MessageChannel::mWorkerLoop either becomes a thread-safe weak pointer (which means WorkerLoop becomes thread-safe refcounted) or else *all* accesses to must hold mMonitor (to synchronize with WillDestroyCurrentMessageLoop).

3a. Either MessageChannel::mListener (the IToplevelProtocol*) becomes another thread-safe weak pointer, which means replacing every top-level actor's lifetime handling (usually non-thread-safe refcounting or unique ownership) with thread-safe recounting

3b. Or ToplevelState::mEventTargetMap moves into MessageChannel, and that should be enough to let MessageChannel::mListener to be a non-thread-safe WeakPtr accessed only from the actor thread (which does not change the actor's lifetimes)

4. Test cases where the channel isn't closed properly and outlives the top-level actor and where the process continues to exist and send messages.
For reference, bug 1398070 is another case where we were warned that something was wrong with the lifetimes here, and we reacted by silencing the assertion because we apparently didn't realize the full implications.

The signature in Bug 1547654 which is a dupe of this bug is now the top crash on 68 nightly, even though it appears there are only 64 installs/3954 crashes.

Crash Signature: [@ mozilla::BlockingResourceBase::CheckAcquire()] → [@ mozilla::BlockingResourceBase::CheckAcquire()] [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::ToplevelState::~ToplevelState]

The signature in Bug 1547654 which is a dupe of this bug is now the top crash on 68 nightly, even though it appears there are only 64 installs/3954 crashes.

It looks like there's only 3 machines involved, and 97% of the crashes are from an iMac in Finland, which I presume is running some script or has a really persistent owner. Might be Flash related.

Adding a signature seen during nightly triage, all on Mac 10.15. This machine appears to be a MBP, and there are 2 unique users crashing with the same Moz Crash reason as seen in Comment 4.

Crash Signature: [@ mozilla::BlockingResourceBase::CheckAcquire()] [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::ToplevelState::~ToplevelState] → [@ mozilla::BlockingResourceBase::CheckAcquire()] [@ mozilla::ipc::MessageChannel::Clear | mozilla::ipc::MessageChannel::~MessageChannel |mozilla::ipc::IToplevelProtocol::ToplevelState::~ToplevelState] [@ mozilla::ipc::MessageChannel::Clear | mozilla::i…

We talked about this at the IPC meeting a few weeks ago. The current idea is to make the Unsound_IsClosed sound, using locking, and move the check up into the IToplevelProtocol destructor to crash before the object is torn down enough to cause undefined behavior if there's a race. We'd still crash, but it won't be security sensitive anymore, and the patch should be less invasive (more upliftable) than trying to fix this fully.

Longer-term I guess we should move away from things caring about destruction order like this, especially in ways that cause memory unsafety (or require deliberate crashing to avoid the unsafety): clearly we don't have enough testing to even understand why those invariants are being violated, let alone catch it before release.

Crash Signature: mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::~IToplevelProtocol ] → mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::~IToplevelProtocol ] [@ bool mozilla::ipc::ProcessLink::Unsound_IsClosed()]
Crash Signature: mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::~IToplevelProtocol ] [@ bool mozilla::ipc::ProcessLink::Unsound_IsClosed()] → mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::~IToplevelProtocol ] [@ bool mozilla::ipc::ProcessLink::Unsound_IsClosed()]

I don't understand why bug 1498648 was considered to be related to this.

No longer blocks: 1498648

I've been looking at this again, and I'm wondering if we could just (in addition to making the IsClosed flag atomic) call MessageChannel::Clear in the IToplevelProtocol destructor, before dispatching the runnable to delete the channel… or, to similar effect, dispatch the deletion to the current event loop and then re-dispatch to the I/O thread, to ensure it happens-after the rest of the IToplevelProtocol/MessageChannel destructor (which, in the bad case, will crash).

Bug 1557739 looks at least vaguely related.

See Also: → 1557739

I've figured out a reliable "STR": comment out the Clear() call in MessageChannel::NotifyMaybeChannelError, then kill a content process (e.g. with the shell command kill). I combined this with a sleep in ~IToplevelProtocol to ensure the use would happen after the free.

With this, I've discovered that the second idea from comment #24 doesn't work, because VRManagerParent seems to be destroyed while the actor thread is shutting down and it's not possible to self-dispatch when the nsThread::mEventTarget has already been cleared.

…and calling MessageChannel::Clear directly from ~IToplevelProtocol has the slight problem that HangMonitorParent is deleted on the main thread and not its own actor thread. This is a problem if it means that MessageChannel::Clear does the mWorkerLoop->RemoveDestructionObserver call on the wrong thread.

That seems to not happen normally because it would already have been cleared by the error handling that I commented out to reproduce the bug, but I wonder whether that's strictly necessary or if it happens rarely in actual use; the thread check in RemoveDestructionObserver is debug-only so we wouldn't notice.

(In reply to Jed Davis [:jld] ⟨⏰|UTC-7⟩ ⟦he/him⟧ from comment #28)

HangMonitorParent is deleted on the main thread and not its own actor thread.

Also the PBackground parent actor.

I tried commenting out the debug assertion in RemoveDestructionObserver to see what would happen, and now I have a new crash that doesn't make much sense: a ContentParent where mTrans is never set to non-null, and the channel pointed to by the ProcessLink is freed by the GeckoChildProcessHost destructor and then used by MessageChannel::Clear when the ContentParent is destroyed.

From looking into bug 1608466, I think there is a fundamental race with this method on the shutdown of many toplevel actors.

When an actor is destroyed, it dispatches a message to the IO thread to destroy the IPC::Channel transport object in it's destructor (either directly within ~IToplevelProtocol, or through the GeckoChildProcessHost::Destroy method). This destructor is called before the destructors of base classes & members, so the message is dispatched before the ~MessageChannel destructor is run. The ~MessageChannel destructor will then try to call Unsound_IsClosed here, but that method calls through a raw pointer to access a property of IPC::Channel, which may have already been freed on the IO thread.

So, in effect, we have two tightly connected objects, the MessageChannel and IPC::Channel, which have completely unrelated lifecycles and don't keep track of one another, so it's somewhat unsurprising that we frequently crash here.

We probably want to make the lifecycle relationship between the ProcessLink object and the IPC::Channel more explicit, perhaps by transferring ownership of the IPC::Channel object into the MessageChannel when we create it, and away from the ProcessHost (or IPDL actor via ManagedEndpoint) which initially opened it.

Crash Signature: mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::~IToplevelProtocol ] [@ bool mozilla::ipc::ProcessLink::Unsound_IsClosed()] → mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::~IToplevelProtocol ] [@ bool mozilla::ipc::ProcessLink::Unsound_IsClosed()] [@ mozilla::ipc::MessageChannel::Clear()]

This is a few years old, and we've had lots of improvements by Nika to channel lifetimes in the meanwhile.

Status: NEW → RESOLVED
Crash Signature: mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::~IToplevelProtocol ] [@ bool mozilla::ipc::ProcessLink::Unsound_IsClosed()] [@ mozilla::ipc::MessageChannel::Clear()] → mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::~IToplevelProtocol ] [@ bool mozilla::ipc::ProcessLink::Unsound_IsClosed()] [@ mozilla::ipc::MessageChannel::Clear()]
Closed: 2 years ago
Resolution: --- → WORKSFORME
Group: dom-core-security
You need to log in before you can comment on or make changes to this bug.