Closed Bug 1138520 Opened 5 years ago Closed 4 years ago

shutdownhang in mozilla::dom::ContentParent::Observe

Categories

(Core :: DOM: Content Processes, defect, critical)

x86
Windows NT
defect
Not set
critical

Tracking

()

RESOLVED FIXED
Tracking Status
firefox37 + disabled
firefox38 + disabled
firefox39 + fixed
firefox40 --- fixed

People

(Reporter: kairo, Assigned: billm)

References

Details

(Keywords: crash, topcrash-win)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-0112a5f7-4dad-492c-985b-1b17c2150226.
=============================================================

Top frames - Thread 0:
Frame 	Module 	Signature 	Source
0 	ntdll.dll 	KiFastSystemCallRet 	
1 	ntdll.dll 	NtWaitForSingleObject 	
2 	kernelbase.dll 	WaitForSingleObjectEx 	
3 	kernel32.dll 	WaitForSingleObjectExImplementation 	
4 	kernel32.dll 	WaitForSingleObject 	
5 	nss3.dll 	PR_Wait 	nsprpub/pr/src/threads/prmon.c
6 	xul.dll 	nsThread::ProcessNextEvent(bool, bool*) 	xpcom/threads/nsThread.cpp
7 	xul.dll 	NS_ProcessNextEvent(nsIThread*, bool) 	xpcom/glue/nsThreadUtils.cpp
8 	xul.dll 	mozilla::dom::ContentParent::Observe(nsISupports*, char const*, wchar_t const*) 	dom/ipc/ContentParent.cpp
9 	xul.dll 	nsObserverList::NotifyObservers(nsISupports*, char const*, wchar_t const*) 	xpcom/ds/nsObserverList.cpp
10 	xul.dll 	nsObserverService::NotifyObservers(nsISupports*, char const*, wchar_t const*) 	xpcom/ds/nsObserverService.cpp
11 	xul.dll 	nsXREDirProvider::DoShutdown() 	toolkit/xre/nsXREDirProvider.cpp
12 	xul.dll 	ScopedXPCOMStartup::~ScopedXPCOMStartup() 	toolkit/xre/nsAppRunner.cpp
13 	xul.dll 	ScopedXPCOMStartup::`scalar deleting destructor'(unsigned int) 	
14 	xul.dll 	XREMain::XRE_main(int, char** const, nsXREAppData const*) 	toolkit/xre/nsAppRunner.cpp
15 	xul.dll 	XRE_main 	toolkit/xre/nsAppRunner.cpp
[...]

More reports and stats at https://crash-stats.mozilla.com/report/list?signature=shutdownhang+%7C+WaitForSingleObjectEx+%7C+WaitForSingleObject+%7C+PR_Wait+%7C+nsThread%3A%3AProcessNextEvent%28bool%2C+bool%2A%29+%7C+NS_ProcessNextEvent%28nsIThread%2A%2C+bool%29+%7C+mozilla%3A%3Adom%3A%3AContentParent%3A%3AObserve%28nsISupports%2A%2C+char+const%2A%2C+wchar_t+const%2A%29

This is the #11 crash in Firefox 37.0b1 and the second-most common shutdownhang in that build (after bug 1124880).
The reports come from all Windows versions with a bit more emphasis on XP than usual, I'm at a loss to find any data that could help narrow this down, unfortunately.
[Tracking Requested - why for this release]:

Odd that this is a content process but not e10s, so it's likely to be the thumbnailing content process.

blame  of http://hg.mozilla.org/releases/mozilla-beta/annotate/276fd03ae5ac/dom/ipc/ContentParent.cpp#l2776 has this on billm: we're in the while (mIPCOpen) loop for a long time/forever.
Component: IPC → DOM: Content Processes
Flags: needinfo?(wmccloskey)
Is there any way we could get stacks for the content processes as well? It's possible that the code here is busted, but more likely the content process is failing to shut down.
Flags: needinfo?(wmccloskey)
Although we have a 5 second timer that's supposed to kill content processes at shutdown, so something clearly isn't going right. I assume that this shutdown hang detector triggers after more than 5 seconds?
Flags: needinfo?(benjamin)
Yeah, 60 seconds.
Flags: needinfo?(benjamin)
We can't get content process stacks from socorro because we're not even sure that the process still exists. You'd have to write explicit code to try and collect the content process minidump.
I guess the real problem is that the timer isn't doing its job. I'll look into that first.

Just our of general interest, where is the code for the shutdown hang detection?
[Tracking Requested - why for this release]: Tracking this for 37 and up since it's a high volume crash. I haven't actually checked crash-stats to see if 38 and 39 are affected but it seems likely.
Bill would you mind assigning yourself to the bug if you're taking it? Release management is trying to make sure all bugs in beta (and maybe aurora too) have assignees. Thanks!
Flags: needinfo?(wmccloskey)
Assignee: nobody → wmccloskey
Flags: needinfo?(wmccloskey)
We might want to back bug 1103036 out on beta.
That may break some of the child process leak logging stuff I landed, but working around that will just be a matter of backing out a few mochitest configuration changes.
Bill - Are we going to bug out bug 1103036 along with the mochitest config changes that mccr8 mentions above? Can you make that happen this week?
Flags: needinfo?(wmccloskey)
QA Whiteboard: [triaged]
The patch has been backed out on beta.
Flags: needinfo?(wmccloskey)
I forgot to mark this as a topcrash so I'm doing that now.

For Firefox 37 this crash is #4 and dropping since the backout.
For Firefox 38 this crash is at #275 and dropping.
For Firefox 39 this crash is at #32 and stable.
Keywords: topcrash-win
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::dom::ContentParent::Observe(nsISupports*, char const*, wchar_t const*)] → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::dom::ContentParent::Observe(nsISupports*, char const*, wchar_t const*)] [@ shutdownhang |…
Crash Signature: , bool) | mozilla::dom::ContentParent::Observe(nsISupports*, char const*, wc... ] → , bool) | mozilla::dom::ContentParent::Observe(nsISupports*, char const*, wc... ]
Crash Signature: , wchar_t const*)] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::dom::ContentParent::Ob… → , wchar_t const*) ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::dom::ContentParent::O…
Crash Signature: , bool) | mozilla::dom::ContentParent::Observe(nsISupports*, char const*, wc... ] → , bool) | mozilla::dom::ContentParent::Observe(nsISupports*, char const*, wc... ] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_Proce…
Bill, can we backout bug 1103036 for 38 too? Thanks
Flags: needinfo?(wmccloskey)
IMHO, if we can back out for 38 without problems, we should see to do that on 39 as well if possible, esp. as 39 will have an extremely short beta cycle.
Looks like the backout on Firefox 37 worked as I'm seeing no crashes reported on that branch with Firefox 37.0b7 or later and this doesn't register at all in the top-300 crashes on Socorro for Fx37.
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #17)
> Looks like the backout on Firefox 37 worked as I'm seeing no crashes
> reported on that branch with Firefox 37.0b7 or later and this doesn't
> register at all in the top-300 crashes on Socorro for Fx37.

Right. It's still a top issue on 38 (and probably on 39 once the async plugin init fallout has been cleaned up).
This is just a patch to try to understand why we're hanging. I'll back out on beta once my try push finishes.
Attachment #8592008 - Flags: review?(nchen)
Comment on attachment 8592008 [details] [diff] [review]
investigative patch

LGTM!
Attachment #8592008 - Flags: review?(nchen) → review+
I had to add a few extra conditions so that this would pass try:
https://hg.mozilla.org/integration/mozilla-inbound/rev/c13eef766b75
Keywords: leave-open
This broke ASAN builds because they have disabled the crash reporter.  I pushed a follow-up to fix that:

https://hg.mozilla.org/integration/mozilla-inbound/rev/b35bfa6c9c0e
Crash Signature: , bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::dom::ContentParent::Observe(nsISupports*, cha ] → , bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::dom::ContentParent::Observe(nsISupports*, cha ] [@ shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_P…
Attached patch possible fixSplinter Review
The crash reports suggest that we're calling KillHard but it's not causing ActorDestroy to be called. That's surprising. There may be some issue with the Windows IPC code that's causing this. However, I'd rather fix this right away than investigate further. I'm pretty sure this patch should eliminate the hangs.
Attachment #8594078 - Flags: review?(nchen)
Comment on attachment 8594078 [details] [diff] [review]
possible fix

Review of attachment 8594078 [details] [diff] [review]:
-----------------------------------------------------------------

Good idea!
Attachment #8594078 - Flags: review?(nchen) → review+
Comment on attachment 8594078 [details] [diff] [review]
possible fix

This patch seems to have fixed the problem. I don't see any new reports since 4/21.

Approval Request Comment
[Feature/regressing bug #]: bug 1103036
[User impact if declined]: Shutdown hangs
[Describe test coverage new/current, TreeHerder]: On m-c for a while.
[Risks and why]: Patch is very low risk.
[String/UUID change made/needed]: None.
Attachment #8594078 - Flags: approval-mozilla-aurora?
Blocks: 1121145
Comment on attachment 8594078 [details] [diff] [review]
possible fix

Approved for uplift to aurora; it will be great if this helps eliminate some shutdown hangs.
Attachment #8594078 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Removing leave-open keyword from resolved bugs, per :sylvestre.
Keywords: leave-open
You need to log in before you can comment on or make changes to this bug.