Closed Bug 1281223 Opened 8 years ago Closed 6 years ago

Crash in @0x0 | mozilla::dom::PContentChild::DestroySubtree

Categories

(Core :: DOM: Content Processes, defect)

49 Branch
x86
Windows 7
defect
Not set
critical

Tracking

()

RESOLVED WONTFIX
Tracking Status
firefox49 --- wontfix
firefox50 --- wontfix

People

(Reporter: marcia, Assigned: cyu)

References

Details

(Keywords: crash)

Crash Data

Attachments

(1 file, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-94332e1d-2933-45c2-a69a-ae6f12160617.
=============================================================

Seen while looking at Aurora crash stats. Link to crashes: http://bit.ly/28S2KL8. Some dupes in the mix but appears to be a new signature in Aurora. Windows 7 and Win XP showing up as the top operating systems.


Frame 	Module 	Signature 	Source
0 		@0x0 	
1 	xul.dll 	mozilla::dom::PContentChild::DestroySubtree(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason) 	obj-firefox/ipc/ipdl/PContentChild.cpp:10764
2 	xul.dll 	mozilla::dom::PContentChild::OnChannelClose() 	obj-firefox/ipc/ipdl/PContentChild.cpp:10370
3 	xul.dll 	mozilla::ipc::MessageChannel::NotifyChannelClosed() 	ipc/glue/MessageChannel.cpp:2232
4 	xul.dll 	mozilla::ipc::MessageChannel::OnNotifyMaybeChannelError() 	ipc/glue/MessageChannel.cpp:2107
5 	xul.dll 	nsRunnableMethodImpl<bool ( mozilla::ipc::MessageChannel::*)(void), 0, 1>::Run() 	obj-firefox/dist/include/nsThreadUtils.h:749
6 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp:1029
7 	xul.dll 	mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:100
8 	xul.dll 	mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) 	ipc/glue/MessagePump.cpp:317
9 	xul.dll 	_SEH_epilog4
Crash Signature: [@ @0x0 | mozilla::dom::PContentChild::DestroySubtree] → [@ @0x0 | mozilla::dom::PContentChild::DestroySubtree] [@ mozilla::dom::PContentChild::DestroySubtree]
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
I think at least some of these crashes are separate from bug 1277614, because they have continued after patches for it landed. The first step will be figuring out exactly where this crash is in the IPDL generated code.
Status: RESOLVED → REOPENED
Crash Signature: [@ @0x0 | mozilla::dom::PContentChild::DestroySubtree] [@ mozilla::dom::PContentChild::DestroySubtree] → [@ @0x0 | mozilla::dom::PContentChild::DestroySubtree]
Resolution: DUPLICATE → ---
Kan-ru, could someone on your team look at the minidump disassembly and try to figure out where things are going wrong? My best guess is that the crash is here:

          // Accumulate kids into a stable structure to iterate over
>         ManagedPFlyWebPublishedServerChild(kids);
          for (uint32_t i = 0; (i) < ((kids).Length()); (++(i))) {

But I don't see how we could crash there. So maybe it's somewhere inside ManagedPFlyWebPublishedServerChild?
Flags: needinfo?(kchen)
Cervantes?
Flags: needinfo?(kchen) → needinfo?(cyu)
The most common case for executing 0x0 is calling pure virtual function. Because we don't empty the nsTArray for the managees, we could end up calling DestroySubtree() of the managees and call the pure virtual function. I think we can at empty the managed container to bandage patch this crash to make it not crash as long as the PContentChild instance is still alive (maybe because of ref counting).
Assignee: nobody → cyu
Flags: needinfo?(cyu)
QA Contact: cyu
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #5)
> Because we don't empty the nsTArray for the managees, we could end up
s/nsTArray/ManagedContainer
> calling DestroySubtree() of the managees and call the pure virtual function.
Forget
Forget comment #5 and #6. The hashtable of managees are Clear()'d in DeallocSubtree(). Calling pure virtual function happens by the end of DestroySubtree() when ActorDestroy() is called. So this is a user-after-free of PContentChild.
QA Contact: cyu
Hmm. As far as I know, we never delete a ContentChild. We QuickExit before that can possibly happen.
See Also: → 1277614
Given that MessageChannel holds a weak pointer to the listener, even ContentChild is deleted in any unexpected way, we should see pure virtual call of MessageListener::OnChannelClose() instead of ContentChild::ActorDestroy(). I suggest we add diagnostic code to see the content of ContentChild in similar crashes to check if ContentChild is dead with the WeakPtr to it not nulled out.
Attached patch Diagnostic patch (obsolete) — Splinter Review
This adds the heap area for the ContentChild instance to the crash dump to allow us to see its content in the crash dump and help us debug this crash.
Attachment #8770114 - Flags: review?(wmccloskey)
Comment on attachment 8770114 [details] [diff] [review]
Diagnostic patch

Review of attachment 8770114 [details] [diff] [review]:
-----------------------------------------------------------------

Minidumps for the content process are generated in the parent, so I don't think this code will have any effect.

Additionally, as far as I know, we never delete a ContentChild in a release build. Perhaps a reasonable diagnostic would be to MOZ_RELEASE_CRASH in the ~ContentChild if NS_FREE_PERMANENT_DATA is defined.

However, I think a better approach would be to look at one of the minidumps we have in Visual Studio and try to figure out from the assembly where we're actually crashing.
Attachment #8770114 - Flags: review?(wmccloskey) → review-
I loaded several dumps and the crashes all happen in the call instruction of:

    // Finally, destroy "us".
    ActorDestroy(why);
013B9D2D  mov         eax,dword ptr [edi]  
013B9D2F  mov         ecx,edi  
013B9D31  push        ebx  
013B9D32  call        dword ptr [eax+2B4h]
013B9D38  pop         edi  
013B9D39  pop         esi  
013B9D3A  pop         ebp  
013B9D3B  pop         ebx  
}

I am pretty sure that it's pure virtual function called with the ContentChild instance so I guess we need to MOZ_CRASH() in the destructor to know why ContentChild is destroyed.
Attached patch Diagnostic patch V2 (obsolete) — Splinter Review
I decide to use a stronger debug instrumentation, though it's hacky. This guarantees that whenever the vptr of ContentChild instance is overwritten, whether the instance is deleted unexpectedly, or the address is corrupted, the process will crash. We only need to watch for crash reason EXCEPTION_BREAKPOINT and check the crash stack.
Attachment #8770114 - Attachment is obsolete: true
Attachment #8773255 - Flags: review?(wmccloskey)
Comment on attachment 8773255 [details] [diff] [review]
Diagnostic patch V2

Review of attachment 8773255 [details] [diff] [review]:
-----------------------------------------------------------------

I'm worried this will mess up debuggers. Why don't we try a MOZ_RELEASE_ASSERT first? If that doesn't work, we can do something like this.

::: dom/ipc/ContentChild.cpp
@@ +604,5 @@
> +  // Setting debug register DR0.
> +  context.Dr0 = reinterpret_cast<DWORD>(aAddr);
> +
> +  // Setting control register DR7.
> +  // Bit 0: only enable local-level debugging for DR0.

Wikipedia says the local registers are cleared on "task switch"? Does that mean context switch? If so, this isn't going to work.

@@ +608,5 @@
> +  // Bit 0: only enable local-level debugging for DR0.
> +  context.Dr7 |= 1 << 0;
> +  // Bit 16 and 17: break on write (01b) at the address in DR0.
> +  context.Dr7 |= 1 << 16;
> +  context.Dr7 &= ~(1 << 17);

Don't you have to set bits 18 and 19 to 11 since you want to watch 4 bytes?
Attachment #8773255 - Flags: review?(wmccloskey) → review-
(In reply to Bill McCloskey (:billm) from comment #15)
> Comment on attachment 8773255 [details] [diff] [review]
> Diagnostic patch V2
> 
> Review of attachment 8773255 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I'm worried this will mess up debuggers. Why don't we try a
We can skip installing the breakpoint if IsDebuggerPresent() is present.
> MOZ_RELEASE_ASSERT first? If that doesn't work, we can do something like
> this.
> 
I just want to capture the bug in one shot that doesn't require another round of debugging on the user side. But due to the problem of the current patch, I'd prefer deploying a simpler patch first to see if it captures the bug.

> ::: dom/ipc/ContentChild.cpp
> @@ +604,5 @@
> > +  // Setting debug register DR0.
> > +  context.Dr0 = reinterpret_cast<DWORD>(aAddr);
> > +
> > +  // Setting control register DR7.
> > +  // Bit 0: only enable local-level debugging for DR0.
> 
> Wikipedia says the local registers are cleared on "task switch"? Does that
> mean context switch? If so, this isn't going to work.
> 
"Task switch" here refers to hardware context switch, which only saves general-purpose registers. Owing to that, mainstream kernels use software context switch instead. The registers are saved/restored in context switches. The problem with the patch is that it doesn't work correctly in a multithreaded program. We may need to use DllMain() to set the debug registers to all threads.

> @@ +608,5 @@
> > +  // Bit 0: only enable local-level debugging for DR0.
> > +  context.Dr7 |= 1 << 0;
> > +  // Bit 16 and 17: break on write (01b) at the address in DR0.
> > +  context.Dr7 |= 1 << 16;
> > +  context.Dr7 &= ~(1 << 17);
> 
> Don't you have to set bits 18 and 19 to 11 since you want to watch 4 bytes?

Yes, this needs to set these 2 bits to watch 4 byes.
Attachment #8773255 - Attachment is obsolete: true
Attachment #8774283 - Flags: review?(wmccloskey)
Attachment #8774283 - Flags: review?(wmccloskey) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/d45c9f5c4bc84a796e60eb39e4712d434ade89c2
Bug 1281223 - Debug instrumentation for capturing accidental destruction of the ContentChild instance. r=billm
Keywords: leave-open
https://treeherder.mozilla.org/logviewer.html#?job_id=32785114&repo=mozilla-inbound -> bustage and so i had to back this out
Flags: needinfo?(cyu)
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0b40f1aa596a
Backed out changeset d45c9f5c4bc8 for bustage on a CLOSED TREE
Can't call MOZ_CRASH() or anything else calling it in dtor because it will be treated as potential memory leak by MSVC compiler. Need to find a way to crash without warning C4722.
Flags: needinfo?(cyu)
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #22)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ec44e25d922

NS_RUNTIMEABORT() looks good on winxp build. Will proceed with this.
https://hg.mozilla.org/integration/mozilla-inbound/rev/e7e76c9521024792f6d9b430913bdaff3cb11254
Bug 1281223 - Debug instrumentation for catching accidental destruction of the ContentChild instance. r=billm
Crash volume for signature '@0x0 | mozilla::dom::PContentChild::DestroySubtree':
 - nightly (version 51): 0 crashes from 2016-08-01.
 - aurora  (version 50): 5 crashes from 2016-08-01.
 - beta    (version 49): 338 crashes from 2016-08-02.
 - release (version 48): 0 crashes from 2016-07-25.
 - esr     (version 45): 0 crashes from 2016-05-02.

Crash volume on the last weeks (Week N is from 08-22 to 08-28):
            W. N-1  W. N-2  W. N-3
 - nightly       0       0       0
 - aurora        0       0       0
 - beta         33     182     121
 - release       0       0       0
 - esr           0       0       0

Affected platform: Windows

Crash rank on the last 7 days:
           Browser     Content   Plugin
 - nightly
 - aurora  #225
 - beta    #1426
 - release
 - esr
Closing because no crashes reported for 12 weeks.
Status: REOPENED → RESOLVED
Closed: 8 years ago6 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: