Closed
Bug 1281223
Opened 8 years ago
Closed 6 years ago
Crash in @0x0 | mozilla::dom::PContentChild::DestroySubtree
Categories
(Core :: DOM: Content Processes, defect)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: marcia, Assigned: cyu)
References
Details
(Keywords: crash)
Crash Data
Attachments
(1 file, 2 obsolete files)
1003 bytes,
patch
|
billm
:
review+
|
Details | Diff | Splinter Review |
This bug was filed from the Socorro interface and is
report bp-94332e1d-2933-45c2-a69a-ae6f12160617.
=============================================================
Seen while looking at Aurora crash stats. Link to crashes: http://bit.ly/28S2KL8. Some dupes in the mix but appears to be a new signature in Aurora. Windows 7 and Win XP showing up as the top operating systems.
Frame Module Signature Source
0 @0x0
1 xul.dll mozilla::dom::PContentChild::DestroySubtree(mozilla::ipc::IProtocolManager<mozilla::ipc::IProtocol>::ActorDestroyReason) obj-firefox/ipc/ipdl/PContentChild.cpp:10764
2 xul.dll mozilla::dom::PContentChild::OnChannelClose() obj-firefox/ipc/ipdl/PContentChild.cpp:10370
3 xul.dll mozilla::ipc::MessageChannel::NotifyChannelClosed() ipc/glue/MessageChannel.cpp:2232
4 xul.dll mozilla::ipc::MessageChannel::OnNotifyMaybeChannelError() ipc/glue/MessageChannel.cpp:2107
5 xul.dll nsRunnableMethodImpl<bool ( mozilla::ipc::MessageChannel::*)(void), 0, 1>::Run() obj-firefox/dist/include/nsThreadUtils.h:749
6 xul.dll nsThread::ProcessNextEvent(bool, bool*) xpcom/threads/nsThread.cpp:1029
7 xul.dll mozilla::ipc::MessagePump::Run(base::MessagePump::Delegate*) ipc/glue/MessagePump.cpp:100
8 xul.dll mozilla::ipc::MessagePumpForChildProcess::Run(base::MessagePump::Delegate*) ipc/glue/MessagePump.cpp:317
9 xul.dll _SEH_epilog4
Reporter | ||
Updated•8 years ago
|
Crash Signature: [@ @0x0 | mozilla::dom::PContentChild::DestroySubtree] → [@ @0x0 | mozilla::dom::PContentChild::DestroySubtree]
[@ mozilla::dom::PContentChild::DestroySubtree]
Reporter | ||
Updated•8 years ago
|
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → DUPLICATE
Comment 2•8 years ago
|
||
I think at least some of these crashes are separate from bug 1277614, because they have continued after patches for it landed. The first step will be figuring out exactly where this crash is in the IPDL generated code.
Status: RESOLVED → REOPENED
Crash Signature: [@ @0x0 | mozilla::dom::PContentChild::DestroySubtree]
[@ mozilla::dom::PContentChild::DestroySubtree] → [@ @0x0 | mozilla::dom::PContentChild::DestroySubtree]
Resolution: DUPLICATE → ---
Kan-ru, could someone on your team look at the minidump disassembly and try to figure out where things are going wrong? My best guess is that the crash is here:
// Accumulate kids into a stable structure to iterate over
> ManagedPFlyWebPublishedServerChild(kids);
for (uint32_t i = 0; (i) < ((kids).Length()); (++(i))) {
But I don't see how we could crash there. So maybe it's somewhere inside ManagedPFlyWebPublishedServerChild?
Flags: needinfo?(kchen)
Assignee | ||
Comment 5•8 years ago
|
||
The most common case for executing 0x0 is calling pure virtual function. Because we don't empty the nsTArray for the managees, we could end up calling DestroySubtree() of the managees and call the pure virtual function. I think we can at empty the managed container to bandage patch this crash to make it not crash as long as the PContentChild instance is still alive (maybe because of ref counting).
Assignee: nobody → cyu
Flags: needinfo?(cyu)
QA Contact: cyu
Assignee | ||
Comment 6•8 years ago
|
||
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #5)
> Because we don't empty the nsTArray for the managees, we could end up
s/nsTArray/ManagedContainer
> calling DestroySubtree() of the managees and call the pure virtual function.
Assignee | ||
Comment 7•8 years ago
|
||
Forget
Assignee | ||
Comment 8•8 years ago
|
||
Forget comment #5 and #6. The hashtable of managees are Clear()'d in DeallocSubtree(). Calling pure virtual function happens by the end of DestroySubtree() when ActorDestroy() is called. So this is a user-after-free of PContentChild.
Assignee | ||
Updated•8 years ago
|
QA Contact: cyu
Hmm. As far as I know, we never delete a ContentChild. We QuickExit before that can possibly happen.
Assignee | ||
Comment 10•8 years ago
|
||
Given that MessageChannel holds a weak pointer to the listener, even ContentChild is deleted in any unexpected way, we should see pure virtual call of MessageListener::OnChannelClose() instead of ContentChild::ActorDestroy(). I suggest we add diagnostic code to see the content of ContentChild in similar crashes to check if ContentChild is dead with the WeakPtr to it not nulled out.
Assignee | ||
Comment 11•8 years ago
|
||
This adds the heap area for the ContentChild instance to the crash dump to allow us to see its content in the crash dump and help us debug this crash.
Attachment #8770114 -
Flags: review?(wmccloskey)
Comment on attachment 8770114 [details] [diff] [review]
Diagnostic patch
Review of attachment 8770114 [details] [diff] [review]:
-----------------------------------------------------------------
Minidumps for the content process are generated in the parent, so I don't think this code will have any effect.
Additionally, as far as I know, we never delete a ContentChild in a release build. Perhaps a reasonable diagnostic would be to MOZ_RELEASE_CRASH in the ~ContentChild if NS_FREE_PERMANENT_DATA is defined.
However, I think a better approach would be to look at one of the minidumps we have in Visual Studio and try to figure out from the assembly where we're actually crashing.
Attachment #8770114 -
Flags: review?(wmccloskey) → review-
Assignee | ||
Comment 13•8 years ago
|
||
I loaded several dumps and the crashes all happen in the call instruction of:
// Finally, destroy "us".
ActorDestroy(why);
013B9D2D mov eax,dword ptr [edi]
013B9D2F mov ecx,edi
013B9D31 push ebx
013B9D32 call dword ptr [eax+2B4h]
013B9D38 pop edi
013B9D39 pop esi
013B9D3A pop ebp
013B9D3B pop ebx
}
I am pretty sure that it's pure virtual function called with the ContentChild instance so I guess we need to MOZ_CRASH() in the destructor to know why ContentChild is destroyed.
Assignee | ||
Comment 14•8 years ago
|
||
I decide to use a stronger debug instrumentation, though it's hacky. This guarantees that whenever the vptr of ContentChild instance is overwritten, whether the instance is deleted unexpectedly, or the address is corrupted, the process will crash. We only need to watch for crash reason EXCEPTION_BREAKPOINT and check the crash stack.
Attachment #8770114 -
Attachment is obsolete: true
Attachment #8773255 -
Flags: review?(wmccloskey)
Comment on attachment 8773255 [details] [diff] [review]
Diagnostic patch V2
Review of attachment 8773255 [details] [diff] [review]:
-----------------------------------------------------------------
I'm worried this will mess up debuggers. Why don't we try a MOZ_RELEASE_ASSERT first? If that doesn't work, we can do something like this.
::: dom/ipc/ContentChild.cpp
@@ +604,5 @@
> + // Setting debug register DR0.
> + context.Dr0 = reinterpret_cast<DWORD>(aAddr);
> +
> + // Setting control register DR7.
> + // Bit 0: only enable local-level debugging for DR0.
Wikipedia says the local registers are cleared on "task switch"? Does that mean context switch? If so, this isn't going to work.
@@ +608,5 @@
> + // Bit 0: only enable local-level debugging for DR0.
> + context.Dr7 |= 1 << 0;
> + // Bit 16 and 17: break on write (01b) at the address in DR0.
> + context.Dr7 |= 1 << 16;
> + context.Dr7 &= ~(1 << 17);
Don't you have to set bits 18 and 19 to 11 since you want to watch 4 bytes?
Attachment #8773255 -
Flags: review?(wmccloskey) → review-
Assignee | ||
Comment 16•8 years ago
|
||
(In reply to Bill McCloskey (:billm) from comment #15)
> Comment on attachment 8773255 [details] [diff] [review]
> Diagnostic patch V2
>
> Review of attachment 8773255 [details] [diff] [review]:
> -----------------------------------------------------------------
>
> I'm worried this will mess up debuggers. Why don't we try a
We can skip installing the breakpoint if IsDebuggerPresent() is present.
> MOZ_RELEASE_ASSERT first? If that doesn't work, we can do something like
> this.
>
I just want to capture the bug in one shot that doesn't require another round of debugging on the user side. But due to the problem of the current patch, I'd prefer deploying a simpler patch first to see if it captures the bug.
> ::: dom/ipc/ContentChild.cpp
> @@ +604,5 @@
> > + // Setting debug register DR0.
> > + context.Dr0 = reinterpret_cast<DWORD>(aAddr);
> > +
> > + // Setting control register DR7.
> > + // Bit 0: only enable local-level debugging for DR0.
>
> Wikipedia says the local registers are cleared on "task switch"? Does that
> mean context switch? If so, this isn't going to work.
>
"Task switch" here refers to hardware context switch, which only saves general-purpose registers. Owing to that, mainstream kernels use software context switch instead. The registers are saved/restored in context switches. The problem with the patch is that it doesn't work correctly in a multithreaded program. We may need to use DllMain() to set the debug registers to all threads.
> @@ +608,5 @@
> > + // Bit 0: only enable local-level debugging for DR0.
> > + context.Dr7 |= 1 << 0;
> > + // Bit 16 and 17: break on write (01b) at the address in DR0.
> > + context.Dr7 |= 1 << 16;
> > + context.Dr7 &= ~(1 << 17);
>
> Don't you have to set bits 18 and 19 to 11 since you want to watch 4 bytes?
Yes, this needs to set these 2 bits to watch 4 byes.
Assignee | ||
Comment 17•8 years ago
|
||
Attachment #8773255 -
Attachment is obsolete: true
Attachment #8774283 -
Flags: review?(wmccloskey)
Attachment #8774283 -
Flags: review?(wmccloskey) → review+
Assignee | ||
Comment 18•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/d45c9f5c4bc84a796e60eb39e4712d434ade89c2
Bug 1281223 - Debug instrumentation for capturing accidental destruction of the ContentChild instance. r=billm
Assignee | ||
Updated•8 years ago
|
Keywords: leave-open
Comment 19•8 years ago
|
||
https://treeherder.mozilla.org/logviewer.html#?job_id=32785114&repo=mozilla-inbound -> bustage and so i had to back this out
Flags: needinfo?(cyu)
Comment 20•8 years ago
|
||
Backout by cbook@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0b40f1aa596a
Backed out changeset d45c9f5c4bc8 for bustage on a CLOSED TREE
Assignee | ||
Comment 21•8 years ago
|
||
Can't call MOZ_CRASH() or anything else calling it in dtor because it will be treated as potential memory leak by MSVC compiler. Need to find a way to crash without warning C4722.
Flags: needinfo?(cyu)
Assignee | ||
Comment 22•8 years ago
|
||
Assignee | ||
Comment 23•8 years ago
|
||
(In reply to Cervantes Yu [:cyu] [:cervantes] from comment #22)
> https://treeherder.mozilla.org/#/jobs?repo=try&revision=6ec44e25d922
NS_RUNTIMEABORT() looks good on winxp build. Will proceed with this.
Assignee | ||
Comment 24•8 years ago
|
||
https://hg.mozilla.org/integration/mozilla-inbound/rev/e7e76c9521024792f6d9b430913bdaff3cb11254
Bug 1281223 - Debug instrumentation for catching accidental destruction of the ContentChild instance. r=billm
Comment 25•8 years ago
|
||
bugherder |
Comment 26•8 years ago
|
||
Crash volume for signature '@0x0 | mozilla::dom::PContentChild::DestroySubtree':
- nightly (version 51): 0 crashes from 2016-08-01.
- aurora (version 50): 5 crashes from 2016-08-01.
- beta (version 49): 338 crashes from 2016-08-02.
- release (version 48): 0 crashes from 2016-07-25.
- esr (version 45): 0 crashes from 2016-05-02.
Crash volume on the last weeks (Week N is from 08-22 to 08-28):
W. N-1 W. N-2 W. N-3
- nightly 0 0 0
- aurora 0 0 0
- beta 33 182 121
- release 0 0 0
- esr 0 0 0
Affected platform: Windows
Crash rank on the last 7 days:
Browser Content Plugin
- nightly
- aurora #225
- beta #1426
- release
- esr
status-firefox50:
--- → affected
Comment 27•6 years ago
|
||
Closing because no crashes reported for 12 weeks.
Status: REOPENED → RESOLVED
Closed: 8 years ago → 6 years ago
Resolution: --- → WONTFIX
Updated•6 years ago
|
You need to log in
before you can comment on or make changes to this bug.
Description
•