Closed Bug 1478849 Opened 3 years ago Closed 3 years ago
Sanitizer: heap-use-after-free [@ IPC::Channel::Unsound _Is Closed] with READ of size 8
The attached crash information was submitted via the ASan Nightly Reporter on mozilla-central-asan-nightly revision 63.0a1-20180725103029-https://hg.mozilla.org/mozilla-central/rev/0b9c0b211eb3d84c5afdb64f4a91d079d0aa4c75. For detailed crash information, see attachment.
It looks like maybe ~GMPServiceChild is touching a channel after it was destroyed, via mContentParents. Maybe ~ContentParent needs to call RemoveGMPContentParent()? Nils, any idea who could look at this?
Byron, could you please take a look at this? Ask cpearce for advice if needed.
Wow, look at all those bare pointers in the IPC code.
So, it is extremely hard to tell what is going on here, but I have a guess. The only place we create a DeleteTask<IPC::Channel> is here: https://searchfox.org/mozilla-central/source/ipc/glue/ProtocolUtils.cpp#691 GMPContentParent, the thing we're destroying on main (the crashing thread) happens to be an IToplevelProtocol. What I think has happened is we are destroying a GMPContentParent, and are currently running the d'tor code for the IToplevelProtocol. This dispatches a DeleteTask<IPC::Channel> to Chrome_~dThread. Most of the time, this is fine, because that runnable takes a little bit of time to run. But when it is executed before the GMPContentParent is done tearing down, we get a UAF. This is a bug in the IPC code, I think.
The quick-and-dirty fix to this particular crash would be to destroy |mState| before dispatching the DeleteTask, I think. But given the web of bare pointers in this code, I am not sure.
Jim, could you help me to find an IPC expert have a look at this and Byron guess work in comment #6?
I've looked at the code, and comment #6 looks right. The ToplevelState contains the MessageChannel, whose destructor looks at MessageChannel::mLink, which in this case is-a ProcessLink, which has a raw pointer (“weak reference”) to the Transport a.k.a. IPC::Channel. So that means that IToplevelProtocol::mTrans (owning reference to the Transport) must outlive that object's IProtocol::mState (borrowed reference to the Transport). And it doesn't, but comment #6's suggestion of clearing mState before sending the runnable looks reasonable. Also, IToplevelProtocol::mTrans itself doesn't seem to be used much; it looks doable to move the Transport ownership into ProcessLink. That might want to be a followup instead. (Maybe not directly relevant, but I see that bug 1283744 has some historical context on Transport management.)
Assignee: nobody → jld
Component: Audio/Video → IPC
(In reply to Jed Davis [:jld] (⏰UTC-6) from comment #9) > Also, IToplevelProtocol::mTrans itself doesn't seem to be used much; it > looks doable to move the Transport ownership into ProcessLink. That might > want to be a followup instead. Not quite. The transport owner isn't always IToplevelProtocol::mTrans; for the initial actor (like PContent or PGPU) it seems to be ChildThread::channel_ on the child side and ChildProcessHost::channel_ on the parent side. IToplevelProtocol::mTrans appears to be used only for toplevel actors created via Endpoint::Bind. Simplifying that is *definitely* outside the scope of this bug.
Attachment #9001807 - Flags: review?(continuation) → review+
About the possibility of regressions from other raw-pointer issues, I haven't noticed any while looking through the code, but to be a little more rigorous about it: This patch causes the destruction of mState to happen-before the destruction of mTrans, when previously they were unordered, so if there's a vulnerability from that ordering then it was already possible without the patch (especially because that ordering seems to have been what usually happened in practice, e.g. in our ASan CI runs). It also causes mState to be destroyed before the other members of IToplevelProtocol and IProtocol, but they're all POD (integers, enums, raw pointers not touched by destructors) so there shouldn't be any interactions there.
Also, don't rely on this without double-checking it, but I think this isn't usefully exploitable: the only thing we're doing to the freed `Channel` is reading the `channel_impl_` pointer, and then that's used only to read a `bool` which is used to decide whether or not to MOZ_CRASH. There is a virtual call to MessageLink::Unsound_IsClosed in this, but the object it's called on is the ProcessLink, which isn't freed (MessageChannel::mLink) until later in MessageChannel::Clear. The methods that are called on the Channel and ChannelImpl are non-virtual.
Comment on attachment 9001807 [details] [diff] [review] Patch This bug doesn't have a severity rating yet, but I'm requesting sec-approval in case it needs it. [Security approval request comment] * How easily could an exploit be constructed based on the patch? Not simple, but probably doable: it's a use-after-free race condition between two threads in the parent process, and it looks like content could trigger it by creating/destroying EME or WebRTC contexts but it's not trivial to delay the thread that needs to be slower. There'd also need to be some parent-process heap manipulation to do anything with it. And see also comment #13 about severity. * Do comments in the patch, the check-in comment, or tests included in the patch paint a bulls-eye on the security problem? There are no comments in this version, but it's hard to hide: it's pretty obviously rearranging destruction order near something async/thread related. * Which older supported branches are affected by this flaw? All of them. This code hasn't changed much since bug 1283744 (and it may have been more broken before that). * Do you have backports for the affected branches? If not, how different, hard to create, and risky will they be? The patch crossports cleanly to beta and esr60. * How likely is this patch to cause regressions; how much testing does it need? Not likely; see comment #12.
Fwiw, unlike all other use-after-free reports in the ASan Nightly Project, this one is actually frequent. I got multiple reports from different people, in total 6 reports. That might sound low, but given a total population of around 100 clients, this is very frequent for such an issue. So this might indeed have some stability impact as well. :jld, what is the MOZ_CRASH that you spoke of in comment 13? Can we see that in crash-stats?
going with sec-moderate based on comment 13 Although it's not directly this crash, comment 5 is worrying.
https://searchfox.org/mozilla-central/rev/71ef4447db179639be9eff4471f32a95423962d7/ipc/glue/MessageChannel.cpp#804-831 There are a lot of "MessageChannel destroyed without being closed" crashes; all the ones I looked at were during shutdown, when the cycle collector destroyed a ContentParent, and I don't know if they're caused by this bug or because the channel is actually still open.
Comment on attachment 9001807 [details] [diff] [review] Patch sec-moderate, so doesn't need approval. Landed: https://hg.mozilla.org/integration/mozilla-inbound/rev/9c34c00e44d0b021fb1d5ac4bf464b9de4955374
Please request Beta and ESR60 approval on this when you get a chance.
Comment on attachment 9001807 [details] [diff] [review] Patch Approval Request Comment [Feature/Bug causing the regression]: Unclear, because it's a very old bug — at least since the bug 1283744 cleanup and probably before that. [User impact if declined]: This is a use-after-free that can be triggered by content; I have doubts about its exploitability, but I wouldn't bet on that. See comment #16; also see comment #15 about frequency in real-world use. [Is this code covered by automated tests?]: Yes; this is core IPC code. [Has the fix been verified in Nightly?]: Landed in m-c in comment #19. [Needs manual test from QE? If yes, steps to reproduce]: No. It's an intermittent bug and we don't have STR. [List of other uplifts needed for the feature/fix]: None. [Is the change risky?]: Low risk. [Why is the change risky/not risky?]: This essentially just imposes an ordering on some things we were previously doing in parallel; see comment #12 for more analysis. [String changes made/needed]: None.
Comment on attachment 9001807 [details] [diff] [review] Patch Fixes an IPC security issue. Approved for 62.0b20 and ESR 60.2.
When you find some time, it would be very helpful if you could answer the following questions so I can work on improving ASan Nightly further. In particular, many of the reports we saw about use-after-free were sent only once, not multiple times. So I'm asking myself what changes to ASan Nightly I can make to improve overall uaf detection rate: 1) Do you think more frequent GC/CC would have helped detecting this problem more frequently? 2) Do you think GC/CC on specific DOM events (e.g. beforeunload, pagehide, DOMContentLoaded) would have helped? 3) Do you think triggering the event loop in the specific child processes would have helped here? (e.g. dummy XMLHttpRequest onbeforeunload, useful in fuzzing). 4) Do you have any other ideas to reproduce this particular problem better or do you think any of the above points (or other approaches) might help to find other use-after-free issues in areas that you work in? Any feedback would be greatly appreciated. Thanks!
60 predates bug 1451363 and the state factoring, but I forgot about that while digging through older history. Mea culpa. Pre-1451363, the MessageChannel is an instance variable of each autogenerated abstract actor class (the PFooSide classes, which inherit from IToplevelProtocol), named mChannel but not the same as IProtocol::mChannel, and accessible through the virtual GetIPCChannel methods. Which means that the MessageChannel and ProcessLink are already destroyed before entering ~IToplevelProtocol, and this bug is actually a regression caused by bug 1451363, and ESR 60 isn't affected.
(In reply to Christian Holler (:decoder) from comment #24) > 4) Do you have any other ideas to reproduce this particular problem better > or do you think any of the above points (or other approaches) might help to > find other use-after-free issues in areas that you work in? For this UAF, an async runnable/task needs to run before something else that's done synchronously, which more or less means that the first thread needs to be preempted. This is the kind of thing that rr chaos mode approaches more generally, but for this specific case we might get some mileage out of doing a sched_yield() and/or a short sleep when posting a runnable/task to another thread.
Attachment #9001807 - Flags: approval-mozilla-esr60+ → approval-mozilla-esr60-
Whiteboard: [post-critsmash-triage] → [post-critsmash-triage][adv-main62+]
You need to log in before you can comment on or make changes to this bug.