Closed Bug 1471154 Opened 7 years ago Closed 3 years ago

IPC reentrancy issues due to nested event loops

Categories

(Core :: IPC, defect, P2)

61 Branch
defect

Tracking

()

RESOLVED WORKSFORME
Tracking Status
firefox-esr60 - wontfix
firefox61 - wontfix
firefox62 --- wontfix
firefox63 --- wontfix
firefox64 --- wontfix
firefox65 --- fix-optional

People

(Reporter: philipp, Unassigned)

Details

(4 keywords)

Crash Data

This bug was filed from the Socorro interface and is report bp-0f841114-eb0d-4a1a-8c62-22d220180626. ============================================================= Top 10 frames of crashing thread: 0 mozglue.dll MOZ_CrashOOL mfbt/Assertions.cpp:33 1 xul.dll mozilla::ipc::MessageChannel::DebugAbort ipc/glue/MessageChannel.cpp:2810 2 xul.dll mozilla::ipc::MessageChannel::~MessageChannel ipc/glue/MessageChannel.cpp:566 3 xul.dll mozilla::ipc::IToplevelProtocol::ToplevelState::~ToplevelState 4 xul.dll mozilla::ipc::IToplevelProtocol::ToplevelState::`scalar deleting destructor' 5 xul.dll mozilla::ipc::IToplevelProtocol::~IToplevelProtocol ipc/glue/ProtocolUtils.cpp:687 6 xul.dll mozilla::dom::ContentParent::cycleCollection::DeleteCycleCollectable dom/ipc/ContentParent.h:316 7 xul.dll SnowWhiteKiller::~SnowWhiteKiller xpcom/base/nsCycleCollector.cpp:2729 8 xul.dll nsCycleCollector::FreeSnowWhite xpcom/base/nsCycleCollector.cpp:2914 9 xul.dll nsCycleCollector_doDeferredDeletion xpcom/base/nsCycleCollector.cpp:4293 ============================================================= this browser crash signature is starting to show up cross-platform starting with firefox 61. all the reports come with MOZ_CRASH Reason "mismatched CxxStackFrame ctor/dtors".
Hey Jed, any idea what might have changed in 61 to have caused this?
Flags: needinfo?(jld)
bp-d5753d9b-47c3-42ec-9e2b-7fd3c0180703 has a clearer stack, and this looks like a use-after-free: the ContentParent is getting cycle collected inside its own message callback. The thing that changed in 61 might be bug 1451363; I'd need to take a closer look at those patches and see what happened with lifetimes.
Group: dom-core-security
Flags: needinfo?(jld)
guessing sec-moderate because it seems to be caught by release assertions, but would be sec-high if there's other symptoms caused by the underlying problem or ways to get around the assert.
Crash volume on 61 doesn't seem high enough to worry about tracking. Let's target 62 for any fixes here.
Did you get a chance to look at this more?
Flags: needinfo?(jld)
Trying to make sense of the stack from bp-d5753d9b-47c3-42ec-9e2b-7fd3c0180703: Frames 73-78: we're receiving a MessageManager message Frames 70-72: but it's preempted by an unrelated microtask? Frames 47-49: which adds a node to a XUL document Frames 28-30: the XBL callback gets slow-script'ed Frames 9-13: XULWindow spins a nested event loop while the modal dialog is up, which allows the CC to run What I haven't extrapolated yet is where we lose the last reference keeping the ContentParent alive. There's a protocol for shutdown (ShutDownProcess SendShutdown, RecvFinishShutdown) that would have to have happened within the nested event loop, maybe? Or maybe there's a case where the ContentParent can become garbage without being shut down? And I'm not seeing how the ToplevelState change affects this. I'm going to hand this off to the people who know those patches (and the CC, and maybe also our event loop situation) better.
Flags: needinfo?(nfroyd)
Flags: needinfo?(continuation)
Flags: needinfo?(jld)
I think that bug 1451363 just changed the signature. Searching for "DebugAbort" for the past six months: https://crash-stats.mozilla.com/search/?signature=~DebugAbort&date=%3E%3D2018-02-14T11%3A05%3A14.000Z&date=%3C2018-08-14T12%3A05%3A14.000Z&_sort=-date&_facets=signature&_columns=date&_columns=signature&_columns=product&_columns=version&_columns=build_id&_columns=platform#facet-signature reveals a lot of similar-looking stacks, including: https://crash-stats.mozilla.com/signature/?signature=mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ADebugAbort%20%7C%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3A~MessageChannel%20%7C%20mozilla%3A%3Adom%3A%3APContentParent%3A%3A~PContentParent&date=%3E%3D2018-02-14T06%3A05%3A14.000Z&date=%3C2018-08-14T08%3A05%3A14.000Z&_columns=date&_columns=product&_columns=version&_columns=build_id&_columns=platform&_columns=reason&_columns=address&_columns=install_time&_sort=-date&page=1#summary https://crash-stats.mozilla.com/signature/?_sort=-date&signature=Abort%20%7C%20mismatched%20CxxStackFrame%20ctor%2Fdtors%20%7C%20mozalloc_abort%20%7C%20NS_DebugBreak%20%7C%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3ADebugAbort%20%7C%20mozilla%3A%3Aipc%3A%3AMessageChannel%3A%3A%7EMessageChannel%20%7C%20mozilla%3A%3Adom%3A%3APContentParent%3A%3A%7EPContentParent&date=%3E%3D2018-02-14T11%3A05%3A14.000Z&date=%3C2018-08-14T12%3A05%3A14.000Z with e.g. https://crash-stats.mozilla.com/report/index/19f6856d-948c-4910-be37-706270180813 https://crash-stats.mozilla.com/report/index/41bebd6e-dc82-44e5-a53f-694830180813 which have roughly the same 15-20ish initial frames as the crash stack in comment 6. I'm in the same place as jld; I don't see how the last reference to the ContentParent would be getting lost. Perhaps we should be holding a strong reference to it somewhere around: https://hg.mozilla.org/releases/mozilla-release/annotate/f5ea361ceb75d5cdb9c44112ae6d37117b05e626/dom/base/nsFrameMessageManager.cpp#l899
Flags: needinfo?(nfroyd)
Yeah, this does feel like an existing crash.
Flags: needinfo?(continuation)
That's a good point about the crash signature; thanks for pointing that out. I was going to suggest putting a death grip on the TabParent in https://searchfox.org/mozilla-central/rev/2466b82b729765fb0a3ab62f812c1a96a7362478/dom/ipc/TabParent.cpp#2551, because that would fix bp-d5753d9b-47c3-42ec-9e2b-7fd3c0180703 (TabParent holds a strong reference to its manager), but then I looked at bp-41bebd6e-dc82-44e5-a53f-694830180813 from comment #7 and it's a completely different actor (PPSMContentDownloader) and a different situation (doing a modal dialog deliberately, not as a side-effect of being slow-script'ed). The only thing in common among all the crashes I've looked at was nsXULWindow::ShowModal and its nested event loop.
I'm not sure what to do with this. If I understand correctly, any IPDL actor that spins a nested event loop inside a callback is at risk of having other IPDL callbacks run reentrantly, including actor destruction. We'll release-assert if a top-level actor is destroyed that way, as seen here, but if some general-case actor gets __delete__'ed while it's on the stack…? Or just tries to modify its state while it's inconsistent? Is everything that can spin an event loop (or run script, or microtasks, or…) prepared to deal with that? I have my doubts. (Also, there's another case besides modal dialogs: sync dispatch, as seen in bp-3d1f725b-e224-41b8-ab93-8a1670180711.) On the other hand, it's hard to believe that all the people who've worked on this before would've overlooked that, so maybe I don't understand correctly. Do we need a way to defer MessageTask runnables to the top-level event loop? (And hope that doesn't break any event ordering dependencies.) Alternately, is there something I'm missing about ContentParent that allows it to be CC'ed without going through proper shutdown? (This, at least, would be more easily fixable: by making IToplevelProtocol refcounted and having MessageChannel::DispatchMessage take a reference.)
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #10) > Do we need a way to defer MessageTask runnables to the top-level event loop? > (And hope that doesn't break any event ordering dependencies.) I'm pretty sure that we've manually implemented this in plugin-land in a few cases inside the actor's Recv function; it would just enqueue the "real" handler to the main event loop.
This is what happens with the last reference to a ContentParent: https://searchfox.org/mozilla-central/rev/e126996d9b0a3d7653de205898517f4f5b632e7f/dom/ipc/ContentParent.cpp#1869 (see also the autogenerated PContentParent::OnChannel{Close,Error}). Also, ContentParent registers itself as various kinds of observer, with strong references, in Init(); it shouldn't be collectible until ActorDestroy() is called and removes those. But I still don't entirely understand what's going on here, because I don't see what references would be keeping the ContentParent's refcount nonzero (why it's CC'ed instead of just destroyed) after it's been ActorDestroy()ed and had all its subactors (e.g., TabParent) destroyed; but maybe there's a reference loop I'm not seeing that includes a TabParent or something. In any case, if that runnable could be protected from nested event loops then that should at least prevent the problems with ContentParent destruction. However: MessageLoop::PostTask just dispatches to the normal XPCOM event target (bug 1269056). MessageLoop::PostNonNestableTask was removed as dead code in bug 1260806, but MessageLoop::PostIdleTask still exists and touches something called deferred_non_nestable_work_queue_, *but* that may not have anything to do with XPCOM event loop nesting. There's stuff in dom/plugins about the possibility of plugins running a nested instance of *glib*'s event loop but I don't think that's related. There are also comments about avoiding plugin callbacks that might be nested inside sync calls from the plugin, but this doesn't seem to be generally nesting-proof if I understand correctly: https://searchfox.org/mozilla-central/rev/e126996d9b0a3d7653de205898517f4f5b632e7f/dom/plugins/ipc/PluginInstanceChild.cpp#2977
There is a "design pattern" for making a runnable fire only on a sufficiently non-nested event loop: requeue it with a timer if it's the wrong kind of context. For example: https://searchfox.org/mozilla-central/rev/5a18fb5aeeec99f1ca1c36a697082c221189a3b9/ipc/glue/MessageChannel.cpp#2720-2726 This seems potentially bad for performance if it's done to every MessageTask without coalescing, and MessageTask runnables need to stay ordered relative to each other so there's some additional subtlety, but in principle it should work. Also, the predicate is a little more complicated than "not nested", because if we're waiting for a response to an outbound `intr` call, then we *do* want to dispatch MessageTasks; that can be checked by scanning the MessageChannel's InterruptFrame stack. (And tying this to the MessageChannel also means that, if there's more than one toplevel actor associated with the threads, one of them spinning an event loop won't block others from making progress.) But for safely disposing of a ContentParent or other toplevel actor, the simple timer hack ought to work fine.
On further thought, the timer thing wouldn't be necessary for the full solution; the MessageChannel::CxxStackFrame machinery can check for the stack returning to a safe state and dispatch a runnable to try to drain the queue. (The comments on CxxStackFrame warn that it implies the presence of a channel on the actual stack, but that the converse may not be true. I think it's sound to use it like I'm considering at the points where a runnable would be invoked, but not (e.g.) immediately after a frame has been popped.)
I found something interesting: https://searchfox.org/mozilla-central/rev/c56977420df7a1b692ce0f7e499ddb364d9fd7b2/ipc/chromium/src/base/message_loop.h#56-59 It looks like we might have been protected from this at some point in the past (or, at least, Chromium itself never had this problem), because Chromium's version of nested event loops will handle OS events but not tasks (runnables), unless there's explicit opt-in. At least according to that comment. But XPCOM event loops don't appear to do that, and the half-finished event loop unification means we're definitly dealing with XPCOM semantics now.
Priority: -- → P2
Requires some re-architecting of IPC message loops, as such this is low priority.

April seems to have accidentally found someone on twitter who can reproduce this reliably: https://twitter.com/William_Leech/status/1121150565273231360 .

If rearchitecturing the event loop is not an option, are there smaller changes we can make, like moving away from synchronous (+ nested event loop) uses of modal filepickers or something?

Also, this is marked sectype-uaf and the crash address looks... not-near-null. If we can't fix the crash, can we make it a nullptr one instead of something that's potentially exploitable, or am I just misreading things? (quite possible!)

Flags: needinfo?(jmathies)
Flags: needinfo?(jld)

was digging down the same thread as Gijs apparently. One of those crashes is bp-85304835-ee72-4de7-974e-0f6f50190424. The crash address being not near null is OK because it's the instruction pointer for the breakpoint.

In that crash the user has April's Certainly Something extension as well as "British English Dictionary (Forked by Marco Pinto)". Given the steps there may be something involving Web Extension APIs going on here.

(In reply to :Gijs (he/him) from comment #17)

If rearchitecturing the event loop is not an option, are there smaller changes we can make, like moving away from synchronous (+ nested event loop) uses of modal filepickers or something?

Also, this is marked sectype-uaf and the crash address looks... not-near-null. If we can't fix the crash, can we make it a nullptr one instead of something that's potentially exploitable, or am I just misreading things? (quite possible!)

This is complicated. The general case of this problem is a UAF, and needs some kind of major change (which, given the current state of IPC planning, probably means refcounting all actors rather than changing the use of event loops).

For the more specific case of destroying a ContentParent while it's in use — which is the case that started this bug and the case reported on Twitter — that's not inherently security-sensitive because we release-assert about it, and there is a relatively simple way to avoid the crash (see comment #13). However, I'm concerned about the possibility of a fix for this case pointing out the existence of the larger problem.

I think I know some of what happened with the STR in the tweet:

  1. The drag-and-drop caused navigation to the cert file, which caused a call to PSMContentDownloaderParent::RecvOnStopRequest (on an actor managed by the ContentParent for the file:/// process)
  2. PSMContentDownloaderParent showed a XUL modal (which wasn't noticed by the user?), spinning a nested event loop
  3. Clicking on the <input type="file"> nested a file open dialog inside that, which seems as if it shouldn't matter, but maybe does
  4. Something causes the file content process to be shut down and cycle collected

The crash reports show that something caused a XUL modal, nested inside an IPC callback for something in a PContent actor tree, but the stack trace doesn't go far enough to directly indicate the cause.

I can reproduce the XUL modal and nested event loop by trying to open a certificate file; attaching with a debugger shows the PContentParent::OnMessageReceived on the stack outside the nested event loop, and I can see that the ContentParent in question belongs to the file content process. I haven't yet been able to cause the ContentParent to be destroyed in the nested event loop.

I notice that almost all reports for this bug's crash signature are on Windows (with a few on Mac that may not be going through the XUL modal path); I wonder if there's something OS-specific about this.

(Leaving needinfo while I decide what to do about this.)

Flags: needinfo?(jmathies)

I think I know why I couldn't reproduce this by manually killing the process: the piece of code I linked in comment #13 delays notifying the actor thread of link errors until it's not in the nested event loop.

What I can do is attach a debugger, get the ContentParent* value, and inject a call of ShutDownProcess(0) on it. That doesn't crash immediately, but it does cause a real use-after-free when I close the PSM modal window:

Thread 1 "firefox" received signal SIGSEGV, Segmentation fault.
0x00007f547aafa84c in mozilla::ipc::IProtocol::GetIPCChannel (this=<optimized out>) at /home/jld/src/obj.gecko-dev/obj-x86_64-pc-linux-gnu/dist/include/mozilla/ipc/ProtocolUtils.h:274
274       MessageChannel* GetIPCChannel() { return mState->GetIPCChannel(); }
(gdb) x/i $rip
=> 0x7f547aafa84c <mozilla::ipc::IPDLParamTraits<mozilla::psm::PPSMContentDownloaderParent>::Write(IPC::Message*, mozilla::ipc::IProtocol*, mozilla::psm::PPSMContentDownloaderParent* const&)+84>:
    mov    (%rax),%rcx
(gdb) p/x $rax
$3 = 0xe5e5e5e5e5e5e5e5

That's a use-after-free of the PSMContentDownloaderParent, not the ContentParent; it's freed via PContentParent::DeallocSubtreeContentParent::DeallocPPSMContentDownloaderParent → dropping the last reference. The ContentParent may have other refs besides the ones released via ActorDestroy, or in principle this could happen if there doesn't happen to be a cycle collection before the modal window is closed. This also means that a patch to hold the last last ContentParent ref until the event loop is un-nested, like the one I've written and was trying to test, would just change the cases where we currently crash safely into UAFs.

This could probably be worked around in the specific case of PSMContentDownloaderParent by making it take a death grip while nesting the event loop, but there may be other instances of this problem.

Flags: needinfo?(jld)

Bug 1540731 should take care of this in most cases, and in particular it empirically seems to fix the PSMContentDownloaderParent UAF from the last comment, but I think there are still some actors that do weird things with their lifecycles and might have security implications from nested event loops; I'll have to check on that. In any case, this is still a live issue for all current non-Nightly releases, and bug 1540731 is an invasive change that we normally wouldn't uplift, so this will need to remain hidden for a while.

For the specific case of ContentParent, the crash should now be fixable by overriding ActorDealloc to release the self-reference. That fix isn't security-sensitive itself, but linking it to the crash signature (vs. passing it off as cleanup) might give too much visibility to the general problem with nested event loops.

I filed bug 1562769 publicly for the ContentParent fix, not mentioning the crash and marking it as just an enhancement.

Summary: Crash in mozilla::ipc::MessageChannel::DebugAbort | mozilla::ipc::MessageChannel::~MessageChannel | mozilla::ipc::IToplevelProtocol::ToplevelState::~ToplevelState → IPC reentrancy issues due to nested event loops

It looks like there were a ton of fixes landed for this bug in 2019, and I can't see any crashes on versions newer than 68 in the last 6 months, so let's close this.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WORKSFORME
Group: dom-core-security
You need to log in before you can comment on or make changes to this bug.