Closed Bug 1277614 Opened 8 years ago Closed 8 years ago

Crash in mozilla::dom::PBlobChild::DestroySubtree

Categories

(Core :: IPC, defect, P1)

Unspecified
Windows 10
defect

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
e10s m9+ ---
firefox47 --- disabled
firefox48 --- fixed
firefox49 --- fixed
firefox-esr45 --- wontfix
firefox50 --- fixed

People

(Reporter: marcia, Assigned: gkrizsanits)

References

Details

(Keywords: crash, csectype-uaf, sec-high, Whiteboard: [adv-main48-] btpp-active, e10s-only)

Crash Data

Attachments

(2 files, 3 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-a9aedac8-9df7-47c0-9908-139d92160602.
=============================================================

Seen while looking at crash-stats. http://bit.ly/1TYfPTu to the crashes on nightly, which are both Mac and Windows. Hopefully I filed this in the right bucket...

Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::dom::PBlobChild::DestroySubtree(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason) 	obj-firefox/ipc/ipdl/PBlobChild.cpp:626
1 	xul.dll 	mozilla::dom::PContentChild::DestroySubtree(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason) 	obj-firefox/ipc/ipdl/PContentChild.cpp:10453
2 	xul.dll 	mozilla::dom::PContentChild::OnChannelClose() 	obj-firefox/ipc/ipdl/PContentChild.cpp:10370
3 	xul.dll 	mozilla::ipc::MessageChannel::NotifyChannelClosed() 	ipc/glue/MessageChannel.cpp:2232
4 	xul.dll 	mozilla::ipc::MessageChannel::NotifyMaybeChannelError() 	ipc/glue/MessageChannel.cpp:2071
5 	xul.dll 	mozilla::ipc::MessageChannel::OnNotifyMaybeChannelError() 	ipc/glue/MessageChannel.cpp:2107
6 	xul.dll 	nsRunnableMethodImpl<void ( mozilla::ipc::MessageChannel::*)(void), 0, 1>::Run() 	obj-firefox/dist/include/nsThreadUtils.h:749
7 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1029
8 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp:290
9 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:132
10 	xul.dll 	mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:317
11 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:228
12 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:208
13 	xul.dll 	nsBaseAppShell::Run() 	widget/nsBaseAppShell.cpp:156
14 	xul.dll 	nsAppShell::Run() 	widget/windows/nsAppShell.cpp:262
15 	xul.dll 	XRE_RunAppShell 	toolkit/xre/nsEmbedFunctions.cpp:827
16 	xul.dll 	mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:285
17 	xul.dll 	MessageLoop::RunHandler() 	ipc/chromium/src/base/message_loop.cc:228
18 	xul.dll 	MessageLoop::Run() 	ipc/chromium/src/base/message_loop.cc:208
19 	xul.dll 	XRE_InitChildProcess 	toolkit/xre/nsEmbedFunctions.cpp:657
20 	plugin-container.exe 	NS_internal_main(int, char**) 	ipc/app/MozillaRuntimeMain.cpp:11
21 	plugin-container.exe 	wmain 	toolkit/xre/nsWindowsWMain.cpp:123
22 	plugin-container.exe 	__scrt_common_main_seh 	f:/dd/vctools/crt/vcstartup/src/startup/exe_common.inl:255
Looking at the code where it is crashing, this may be a generic IPC issue, rather than something related specifically to Blobs, so I'm going to move this to IPC.
Component: DOM: Content Processes → IPC
Depends on: 1277946
This is a frequent crash in IPC code.
tracking-e10s: --- → ?
One of the IPC footgun patterns is to have an ActorDestroy method that indirectly (e.g., via dropping references) tries to do things to another actor that's later in the child array currently being destroyed; I wonder if there's an instance of that going on here.  See also bug 1226477 comment #35.
Who should own this?
Flags: needinfo?(jmathies)
Flags: needinfo?(continuation)
I have a ni? out on baku in the blocking sec bug.
Flags: needinfo?(jmathies)
Flags: needinfo?(continuation)
(In reply to David Baron :dbaron: ⌚️UTC+2 (review requests must explain patch) (vacation June 4-12) from comment #5)
> Who should own this?

I don't know if we have an primary owner of Blob code. bent wrote most of it, and Andrea Marchesini has done a bit of work on it. Gabor Kriszanits might be interested helping out here since this is dom/ipc related.
Flags: needinfo?(gkrizsanits)
(In reply to Jim Mathies [:jimm] from comment #7)
> (In reply to David Baron :dbaron: ⌚️UTC+2 (review requests must explain
> patch) (vacation June 4-12) from comment #5)
> > Who should own this?
> 
> I don't know if we have an primary owner of Blob code. bent wrote most of
> it, and Andrea Marchesini has done a bit of work on it. Gabor Kriszanits
> might be interested helping out here since this is dom/ipc related.

Sure. Can I have CC to the blocking sec bug?
Assignee: nobody → gkrizsanits
Flags: needinfo?(gkrizsanits)
Gabor, I CCed you. I wonder if this is a regression. I'm OK to review a patch. I was planning to work on this bug, but not before the work-week.
Flags: needinfo?(gkrizsanits)
Whiteboard: btpp-active
To avoid the scenario in Comment 4 I think we should set a flag when we start deallocating the children. And from that point on we ignore remove requests from children (we're about to remove all of them anyway). Instead of inventing a new flag or state for this I'm using the dying state here. I'm not sure if that's the right approach and what invariants I'm breaking with this. Bill, what do you think?
Flags: needinfo?(gkrizsanits)
Attachment #8761647 - Flags: feedback?(wmccloskey)
Priority: -- → P1
Can you include an example of how this changes the generated code (as a diff)? It's a lot easier to review that way.
Flags: needinfo?(gkrizsanits)
Attached file generated (obsolete) —
Sorry, I thought I did but it seems like I forgot :(
Flags: needinfo?(gkrizsanits)
Attachment #8761960 - Flags: feedback?(wmccloskey)
Comment on attachment 8761960 [details]
generated

Bugzilla cannot recognize a regular diff's as a patch so I'm unchecking it.
Attachment #8761960 - Attachment is patch: false
Why do you think the issue is in the IPC code and not in the PBlob? It seems that this happens only with PBlobChild::DestroySubtree.
Flags: needinfo?(gkrizsanits)
(In reply to Andrea Marchesini (:baku) from comment #14)
> Why do you think the issue is in the IPC code and not in the PBlob? It seems
> that this happens only with PBlobChild::DestroySubtree.

What Jed describes in Comment 4 and in the comment he links to there seems like something we should guard against anyway. This issue can explain the crash, and I cannot see any other explanation for it. Looking into the linked bugs through Comment 4 it does not seem like a Blob only issue by the way. We might be able to find a workaround for the crash at Blob level but I would prefer to fix the footgun in IPC itself so this won't happen again in another, similar situation in the future.

What I think might be happening here is that when we start the iteration in DeallocSubTree:
        for (auto iter = (mManagedPBlobStreamChild).Iter(); (!((iter).Done())); (iter).Next()) {

During that iteration it's fine to remove an element from the hash table (nsTHashtable<nsPtrHashKey<Protocol>>) via the iterator (iterator.Remove())

but if we trigger a PBlobChild::RemoveManagee call (in case of RemoteBlobSliceImpl this can happen) during the iteration, that removes an element directly by (mManagedPBlobStreamChild).RemoveEntry(actor); and that can totally confuse the iterator I think.

(the difference is between the two kind of removal is that the later also calls ShrinkIfAppropriate() on the map https://dxr.mozilla.org/mozilla-central/source/xpcom/glue/PLDHashTable.cpp#633)
Flags: needinfo?(gkrizsanits)
Thanks for you comment. I was/am only afraid that we are hiding some existing Blob+IPC bug.
I see the point of what you do. Sounds good to me.
Bill, I'm not sure you've seen it but this is an m9 :( do you think you will have some time this week to discuss this bug/patch?
Flags: needinfo?(wmccloskey)
Sorry for the delay. This doesn't look right to me though. My understanding from Jed's comment is that we get into trouble in DestroySubtree and never reach DeallocSubtree. So:

In DestroySubtree, we accumulate a bunch of children in |kids|. Then we iterate through them and call ActorDestroy. If one of those ActorDestroy methods calls Send__delete__ on a later sibling, then we'll delete it right away and remove it from mManagedPBlobStreamChild. However, we'll still iterate over it later because it's still in |kids|. At this point we crash with a UAF.

I wonder if we could modify DestroySubtree to verify that each element of |kids| is still in mManagedPBlobStreamChild before calling DestroySubtree on it. That should be quick since mManagedPBlobStreamChild is a hashtable.

The fact that we'll still send the __delete__ message is a bit unfortunate. In the shutdown case, the actual Send call will fail, which I think is fine. If this isn't during shutdown, then I guess we might end up crashing with a route error. Not sure how big a problem that should be. At least it's not a UAF.
Flags: needinfo?(wmccloskey)
(In reply to Bill McCloskey (:billm) from comment #18)
> Sorry for the delay. This doesn't look right to me though. My understanding
> from Jed's comment is that we get into trouble in DestroySubtree and never
> reach DeallocSubtree. So:

Not for the Blob but we do reach it for the child otherwise it would not remove itself from mManagedPBlobStreamChild no?

> 
> In DestroySubtree, we accumulate a bunch of children in |kids|. Then we
> iterate through them and call ActorDestroy. If one of those ActorDestroy
> methods calls Send__delete__ on a later sibling, then we'll delete it right
> away and remove it from mManagedPBlobStreamChild.

That's exactly what my patch is supposed to prevent. With my patch that child would
not remove itself from mManagedPBlobStreamChild and we would not destroy it yet.

> 
> I wonder if we could modify DestroySubtree to verify that each element of
> |kids| is still in mManagedPBlobStreamChild before calling DestroySubtree on
> it. That should be quick since mManagedPBlobStreamChild is a hashtable.

That was my initial plan, but let's say we do that, that would fix this problem,
but I would be still worried about the problem I described in Comment 15 which
we might hit in some cases. My solution would fix both problem at once.

> 
> The fact that we'll still send the __delete__ message is a bit unfortunate.
> In the shutdown case, the actual Send call will fail, which I think is fine.
> If this isn't during shutdown, then I guess we might end up crashing with a
> route error. Not sure how big a problem that should be. At least it's not a
> UAF.

I think this part should be another bug but once we set that flag (again I'm not sure
the state is the right place to indicate that we started destroying the blob and its
subtree) we should be able to prevent that message to be sent too.
Flags: needinfo?(wmccloskey)
Sorry, I made a mistake. I shouldn't have talked about mManagedPBlobStreamChild since that's not related to what's happening here.

I think the problem is that there are two blobs that are siblings. When one blob's ActorDestroy function is called, it ends up invoking Send__delete__ on a second blob. The problem is in PContentChild::DestroySubtree when it iterates over mManagedPBlobChild.

I don't see how the problem described in comment 15 can happen. DeallocSubTree is basically designed to recursively call methods like DeallocPBlobStreamChild or DeallocPBlobChild. However, those methods don't seem to do much besides dealloc memory. The destructors they call don't do anything. That's pretty typical for this sort of IPDL code.
Flags: needinfo?(wmccloskey)
Attaching diff for the generated code too in a bit.
Attachment #8761647 - Attachment is obsolete: true
Attachment #8761960 - Attachment is obsolete: true
Attachment #8761647 - Flags: feedback?(wmccloskey)
Attachment #8761960 - Flags: feedback?(wmccloskey)
Attachment #8763076 - Flags: review?(wmccloskey)
Attached file diff for the generated code (obsolete) —
I think there are some special characters in the previous diff for some reason that prevents bugzilla to open it. I'm trying to upload a fixed version I hope this helps.
Attachment #8763077 - Attachment is obsolete: true
Attachment #8763076 - Flags: review?(wmccloskey) → review+
Comment on attachment 8763076 [details] [diff] [review]
guarding child iteration in DestroySubtree

Approval Request Comment
[Feature/regressing bug #]: hard to tell
[User impact if declined]: it's a crashing bug with high rates and it's an e10s release blocker
[Describe test coverage new/current, TreeHerder]: I could never reproduce the issue locally so there are no tests for it but we think the problem is well understood and the fix might cover other similar cases. It's on inbound right now.
[Risks and why]: The patch is adding a trivial hash map check. I see no risk at all.
[String/UUID change made/needed]: none
Attachment #8763076 - Flags: approval-mozilla-beta?
Attachment #8763076 - Flags: approval-mozilla-aurora?
Adding the signatures from the duped bug.
Crash Signature: [@ mozilla::dom::PBlobChild::DestroySubtree] → [@ mozilla::dom::PBlobChild::DestroySubtree] [@ @0x0 | mozilla::dom::PContentChild::DestroySubtree] [@ mozilla::dom::PContentChild::DestroySubtree]
https://hg.mozilla.org/mozilla-central/rev/f96486bb6405
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Comment on attachment 8763076 [details] [diff] [review]
guarding child iteration in DestroySubtree

Anything to fix e10s crashes, taking it.
Should be in 48 beta 3
Attachment #8763076 - Flags: approval-mozilla-beta?
Attachment #8763076 - Flags: approval-mozilla-beta+
Attachment #8763076 - Flags: approval-mozilla-aurora?
Attachment #8763076 - Flags: approval-mozilla-aurora+
No longer depends on: 1277946
The mozilla::dom::PBlobChild::DestroySubtree signature, which is by far the most common, does seem to have gone away when these patches landed, which is great. mozilla::dom::PContentChild::DestroySubtree only seems to have happened once recently, so I guess that's okay, too. @0x0 | mozilla::dom::PContentChild::DestroySubtree is still happening, but it looks like it is happening somewhere else in the code, so I'll undupe a bug to track that separately.
No need for this on ESR, as it is e10s-specific.
Whiteboard: btpp-active → btpp-active, e10s-only
After uplift, there are still 5 reports in the last week for this signature on aurora 49: @0x0 | mozilla::dom::PContentChild::DestroySubtree 
and 62 reports in the last week for mozilla::dom::PBlobChild::DestroySubtree . 

Does this need a follow up bug? Did this fix most if not all of the issue in aurora?
Flags: needinfo?(gkrizsanits)
(In reply to Liz Henry (:lizzard) (needinfo? me) from comment #37)
> Does this need a follow up bug? Did this fix most if not all of the issue in
> aurora?

I don't see any mozilla::dom::PBlobChild::DestroySubtree crashes on builds post 6-23, when this was fixed. Bug 1281223 covers the resididual @0x0 | mozilla::dom::PContentChild::DestroySubtree crashes, which seem like a separate issue.
Flags: needinfo?(gkrizsanits)
See Also: → 1281223
Whiteboard: btpp-active, e10s-only → [adv-main48-] btpp-active, e10s-only
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: