shutdownhang in mozilla::layers::CompositorParent::ShutDown()

RESOLVED FIXED in Firefox 42

Status

()

defect
--
critical
RESOLVED FIXED
5 years ago
4 years ago

People

(Reporter: whimboo, Assigned: bas.schouten)

Tracking

(Depends on 1 bug, 4 keywords)

Trunk
mozilla43
Points:
---
Dependency tree / graph

Firefox Tracking Flags

(e10s?, firefox36+ wontfix, firefox37 wontfix, firefox38+ wontfix, firefox39+ wontfix, firefox38.0.5 wontfix, firefox40+ wontfix, firefox41+ wontfix, firefox42+ fixed, firefox43 fixed)

Details

(Whiteboard: [firefox-ui-tests][gfx-noted], crash signature, )

Attachments

(1 attachment)

We see constant shutdown hangs for our Mozmill tests in mozilla::layers::CompositorParent::ShutDown(). It seems to mostly happen on Windows especially XP. Currently I'm working on reducing the tests in question but it's a bit tricky.

Here the crash report of the shutdown hang:
bp-50fa6da5-01ab-4ed6-ba0b-6fca22150129.

Teodor, can you please help me and check older release/beta builds of Firefox? It would be good to know when this has been started.
[Tracking Requested - why for this release]:
I see over 2000 crashes with this signature in the last 7 days across supported versions. All reporting this problem on shutdown.
Whiteboard: [qa-automation-blocked] → [qa-automation-blocked][mozmill]
Given that this shutdown hang and crash might be related to Flash protected mode, lets also CC Benjamin.
(In reply to Teodor Druta from comment #2)
> I think I found the regressor for this crash

I don't think that's the right one, unless you have a reproducible test case and actually tested builds and backing out patches one by one.

The shutdownhang|... signatures replaced the "RunWatchdog" signatures of bug 1103833 when bug 1104317 was solved on the crash-stats server side. In turn, the "RunWatchdog" signatures came into being when bug 1038342 was fixed by killing processes that hang for more than 60 seconds on shutdown.

So, all in all, earlier versions would hang there for a long time while versions starting with 36 crash. Those crashes have been reported with "RunWatchdog" before and since January 21, when the fix to bug 1104317 was pushed live in Socorro production, they report with "shutdownhang" signature that give actually better insight in what was hanging.
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #6)
> (In reply to Teodor Druta from comment #2)
> > I think I found the regressor for this crash
> 
> I don't think that's the right one, unless you have a reproducible test case
> and actually tested builds and backing out patches one by one.

Please read my comment 0. It clearly states that we have reproducible steps to trigger this hang. And we know that we didn't crash formerly. All the bugs you are referring here have no impact to the hang problem.
The thing that landed between build1 and build2 was a backout, and Flash protected mode is relatively unlikely to be related to this. I don't trust the regression range from comment 2-3.

nical, can you tell from the crash report which is the compositor thread and/or why it's failing to shut down and hanging?
Flags: needinfo?(nical.bugzilla)
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #8)
> nical, can you tell from the crash report which is the compositor thread
> and/or why it's failing to shut down and hanging?

I don't know which is the compositor thread. CompositorParent::ShutDown waits (spins the event loop) until the Compositor thread is destroyed, which is triggered by the CompositorThreadHolder being destroyed which means both the global sCompositorThreadHolder variable and CompositorParent's mCompositorThread must be null. the global variable was just set to null in the stack so it looks like we haven't nulled out the CompositorParent's mCompositorThreadHolder variable. This should have happened in CompositorParent::DeferredDestroy which is scheduled on the main thread after the compositor thread is done cleaning its stuff up (in CompositorParent::RecvStop which runs in the Compositor thread, triggered by CompositorChild::SendStop() on the main thread which is called by CompositorChild::Destroy which in turn is called by nsBaseWidget::DestroyCompositor)

What a happy mess :)

First thing I'd look at is whether we loose the reference to the CompositorChild in nsBaseWidget without calling DestroyCompositor. Then if something could have prevented any of the functions I mentioned above to be called.
Flags: needinfo?(nical.bugzilla)
Nicolas, would a full minidump be helpful for you?
Flags: needinfo?(nical.bugzilla)
(In reply to Henrik Skupin (:whimboo) from comment #10)
> Nicolas, would a full minidump be helpful for you?

I don't have time to work on this unless I sacrifice other bugs, so unless Bas or Milan want to bump the priority, assume I am not going to fix this in the short term (sorry).
Flags: needinfo?(nical.bugzilla)
Milan or Bas, could you find someone else to work on this? thanks
FYI, beta 6 gtb is today...
Flags: needinfo?(milan)
Flags: needinfo?(bas)
Safe to assume this won't get resolved by beta 6.  We don't even seem to know what it is and where it is, and we don't have a regression range we trust.
Flags: needinfo?(milan)
Duplicate of this bug: 1125643
Summary: Crash in shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::layers::CompositorParent::ShutDown() → shutdownhang in mozilla::layers::CompositorParent::ShutDown()
(In reply to Milan Sreckovic [:milan] from comment #13)
> Safe to assume this won't get resolved by beta 6.  We don't even seem to
> know what it is and where it is, and we don't have a regression range we
> trust.

Milan, can you please have a look at my comment 10? We could provide a full minidump here if that is of any kind of help. If not someone from us would have to spend some more time to reduce the Mozmill test even further. Please let me know if the mini dump path would work.
Flags: needinfo?(milan)
Need to clear up other beta bugs first, this is not likely to get looked at in the next couple of days.
Whiteboard: [qa-automation-blocked][mozmill] → [qa-automation-blocked][mozmill][gfx-noted]
Crash Signature: [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::layers::CompositorParent::ShutDown()] → [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::layers::CompositorParent::ShutDown()] [@ shutdownhang | WaitForSingleObjectEx | PR_Wait |…
Crash Signature: , bool) | mozilla::layers::CompositorParent::ShutDown()] → , bool) | mozilla::layers::CompositorParent::ShutDown()] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, b…
The number of crashes have been lowered with the last beta release. So I'm going to remove our blocking whiteboard entry for now.
Whiteboard: [qa-automation-blocked][mozmill][gfx-noted] → [mozmill][gfx-noted]
Since it decreased, I am going to mark it as wontfix for 36. It is not tracked for 37. Don't hesitate to submit for tracking if it spikes.
Crash Signature: , bool) | mozilla::layers::CompositorParent::ShutDown()] → , bool) | mozilla::layers::CompositorParent::ShutDown()] [@ shutdownhang | ntdll.dll@0x3c6bc]
This build hasn't crashed (on shutdown!) for me yet.  And I've done it a few times.

Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:38.0) Gecko/20100101 Firefox/38.0 ID:20150215030238 CSet: e0cb32a0b1aa

Which is a welcome change.  My last crash today was the restart for this build.  But I've shutdown three times now, and no crash.
Crash Signature: , bool) | mozilla::layers::CompositorParent::ShutDown()] [@ shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::layers… → , bool) | mozilla::layers::CompositorParent::ShutDown()] [@ shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::laye…
sorry, wrong bug.
Crash Signature: , bool) | mozilla::layers::CompositorParent::ShutDown()] [@ shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::dom:… → , bool) | mozilla::layers::CompositorParent::ShutDown()] [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::layers::CompositorParent::Shut…
Tracking for the current release as it is #3.
Keywords: topcrash-win
[Tracking Requested - why for this release]:

Combined signatures puts this in top 5 on Nightly (Fx40)
Crash Signature: , bool) | mozilla::layers::CompositorParent::ShutDown() ] [@ shutdownhang | ntdll.dll@0x3c6bc] → , bool) | mozilla::layers::CompositorParent::ShutDown() ] [@ shutdownhang | ntdll.dll@0x3c6bc] [@ shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNext…
Blocks: 1121145
Tracking as it is still one of the most important issue.
Hello I am marking this as a major as its a knwon crash and happens a lot 

PS I am on windows 7 and crashed yesterday
Severity: critical → major
OS: Windows XP → All
Version: 36 Branch → Trunk
can we get this on the release notes also for people to be aware
Flags: needinfo?(milan)
Sorry untook the need info from Milan but added it back
Flags: needinfo?(milan)
Please don't downgrade to major. Also, we know this happens quite a bit, otherwise it would not have a topcrash flag and be marked tracking for a number of releases, no need to flag more than that. The real issue is we need to find out what's really going on there. I think the only way you can really make this being fixed faster is to provide us with a scenario that can reliably reproduce the issue. We so far haven't heard of any such case.
Severity: major → critical
As for other releases it is too late for 38.
QA Whiteboard: [@ shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | nsThread::Shutdown() ]
Tracy, I don't think the nsThread::Shutdown you added to the whiteboard is the same thing.
heh, didn't mean to add it to whiteboard.


I compared stack signature with another in this bug and they were identical.
Crash Signature: , bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::dom::ContentParent::Observe(nsISupports*, char const*, wchar_t cons ] → , bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::dom::ContentParent::Observe(nsISupports*, char const*, wchar_t cons ] [@ shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEve…
QA Whiteboard: [@ shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | nsThread::Shutdown() ]
This hang is happening 100% of the time for me when I exit the browser with a page open (so there's a content process). This is with a DMD* Linux debug build, running over VNC, with an OSX client. I haven't tried with a non-DMD build yet.

* https://developer.mozilla.org/en-US/docs/Mozilla/Performance/DMD
It looks like my hang is some kind of fallout from my own patches, so I don't know how useful my being able to reproduce it is, but here's the stack on the compositor thread in case it is useful:

#0  0x00007f70c0cf98bf in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
(gdb) bt
#0  0x00007f70c0cf98bf in pthread_cond_wait@@GLIBC_2.3.2 () from /usr/lib/libpthread.so.0
#1  0x00007f70ba16c191 in ConditionVariable::Wait (this=0x7f70a5acdbbc) at /home/amccreight/mc/ipc/chromium/src/base/condition_variable_posix.cc:40
#2  0x00007f70ba18474e in base::WaitableEvent::TimedWait (this=0x7f70a924e1d8, max_time=...) at /home/amccreight/mc/ipc/chromium/src/base/waitable_event_posix.cc:195
#3  0x00007f70ba18491b in base::WaitableEvent::Wait (this=0x7f70a5acdbbc) at /home/amccreight/mc/ipc/chromium/src/base/waitable_event_posix.cc:201
#4  0x00007f70ba17444f in base::MessagePumpDefault::Run (this=0x7f70a924e1c0, delegate=0x7f70a5acdd48) at /home/amccreight/mc/ipc/chromium/src/base/message_pump_default.cc:60
#5  0x00007f70ba1735d4 in RunHandler (this=0x7f70a5acdbbc) at /home/amccreight/mc/ipc/chromium/src/base/message_loop.cc:226
#6  MessageLoop::Run (this=0x7f70a5acdbbc) at /home/amccreight/mc/ipc/chromium/src/base/message_loop.cc:200
#7  0x00007f70ba17e519 in base::Thread::ThreadMain (this=0x7f70a924e160) at /home/amccreight/mc/ipc/chromium/src/base/thread.cc:170
#8  0x00007f70ba17e76f in ThreadFunc (closure=0x7f70a5acdbbc) at /home/amccreight/mc/ipc/chromium/src/base/platform_thread_posix.cc:39

That does not look useful but you never know.
Bill and I looked at this a little bit, but we weren't able to figure much out. Something is going wrong in the sequence of messages back and forth between the parent and child process, so the child doesn't shut down, but the ContentParent either gets far enough or doesn't notice the failure so it removes the xpcom-shutdown observer before shutdown, and thus never kills the nonresponsive child.

My steps to reproduce are something like:
1. Make a debug build, and also add ac_add_options --enable-dmd
2. Start the browser with DMD on like this: ./mach run --dmd --mode=live --sample-below=1
3. Open a random webpage (I used http://news.ycombinator.com/ but maybe it doesn't matter). Let it load at least a little bit.
4. Exit.

That hangs around 95% of the time for me.
The DMD changes there shouldn't affect anything except the performance, making it a good amount slower, so presumably there's some kind of race condition.
Lee, if you follow instructions in comment 33, can you reproduce?
Flags: needinfo?(milan) → needinfo?(lsalzman)
(In reply to Milan Sreckovic [:milan] from comment #35)
> Lee, if you follow instructions in comment 33, can you reproduce?

I am having trouble reproducing this following those instructions. I'm not seeing any hangs with DMD enabled.
Flags: needinfo?(lsalzman)
This could still squeak into 39 but we are heading into beta 4 now.  

It looks like people have had a crack at fixing it several times and Andrew has a good possible way to reproduce a related crash. 

Milan I realize there may be other higher priority issues; we should come back to this and not let it drop though. I'll keep tracking this for 39 for the moment.
Flags: needinfo?(milan)
It kind of feels like the issue I'm seeing is e10s-specific, and thus unrelated to whatever is happening on release. But it is hard to know.
Agreed.  I'll assign to :nical just in case he can get to it; we do want a strong 39.
Assignee: nobody → nical.bugzilla
Flags: needinfo?(milan)
bp-feaa716f-75c4-45a2-b146-7b5d72150619
	19/06/2015	11:17 a.m.

Crashing Thread
Frame 	Module 	Signature 	Source
0 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog(void*) 	toolkit/components/terminator/nsTerminator.cpp
1 	nss3.dll 	PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c
2 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c
3 	msvcr120.dll 	_callthreadstartex 	f:\dd\vctools\crt\crtw32\startup\threadex.c:376
4 	msvcr120.dll 	_threadstartex 	f:\dd\vctools\crt\crtw32\startup\threadex.c:354
5 	kernel32.dll 	BaseThreadInitThunk 	
6 	ntdll.dll 	RtlUserThreadStart 	
7 	kernel32.dll 	BasepReportFault 	
8 	kernel32.dll 	BasepReportFault
(In reply to alex_mayorga from comment #40)
> bp-feaa716f-75c4-45a2-b146-7b5d72150619
> 	19/06/2015	11:17 a.m.
> 
> Crashing Thread
> Frame 	Module 	Signature 	Source
> 0 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog(void*) 
> toolkit/components/terminator/nsTerminator.cpp
> 1 	nss3.dll 	PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c
> 2 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c
> 3 	msvcr120.dll 	_callthreadstartex 
> f:\dd\vctools\crt\crtw32\startup\threadex.c:376
> 4 	msvcr120.dll 	_threadstartex 
> f:\dd\vctools\crt\crtw32\startup\threadex.c:354
> 5 	kernel32.dll 	BaseThreadInitThunk 	
> 6 	ntdll.dll 	RtlUserThreadStart 	
> 7 	kernel32.dll 	BasepReportFault 	
> 8 	kernel32.dll 	BasepReportFault

Alex, you had this as a start up crash?  Weird, the crash report itself is showing it as a > hour session.  Does the start up crash persist?  Does safe mode work?
(In reply to alex_mayorga from comment #40)
> bp-feaa716f-75c4-45a2-b146-7b5d72150619
> 	19/06/2015	11:17 a.m.
> 
> Crashing Thread
> Frame 	Module 	Signature 	Source
> 0 	xul.dll 	mozilla::`anonymous namespace'::RunWatchdog(void*) 
> toolkit/components/terminator/nsTerminator.cpp
> 1 	nss3.dll 	PR_NativeRunThread 	nsprpub/pr/src/threads/combined/pruthr.c
> 2 	nss3.dll 	pr_root 	nsprpub/pr/src/md/windows/w95thred.c
> 3 	msvcr120.dll 	_callthreadstartex 
> f:\dd\vctools\crt\crtw32\startup\threadex.c:376
> 4 	msvcr120.dll 	_threadstartex 
> f:\dd\vctools\crt\crtw32\startup\threadex.c:354
> 5 	kernel32.dll 	BaseThreadInitThunk 	
> 6 	ntdll.dll 	RtlUserThreadStart 	
> 7 	kernel32.dll 	BasepReportFault 	
> 8 	kernel32.dll 	BasepReportFault

Seeing as how the crash might be related to your Intel VGA driver (8.15.10.2696, from 2013), could you try updating to the current version (15.33.36.64.4226, June 2015 via https://goo.gl/np2lji) and see if it still crashes?
Alex, did this just start happening for you with the nightly?  Because if it did, and you have time, before you update the driver, it would be beyond awesome if you could run mozregression (https://developer.mozilla.org/en-US/docs/Mozilla/Debugging/Existing_Tools#MozRegression) to help us find out exactly when it started happening.
¡Hola Milan!

It seems to still be a thing on Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:41.0) Gecko/20100101 Firefox/41.0 ID:20150625030202 CSet: 0b2f5e8b7be5

bp-31608b8c-69d0-49ac-8085-4c6002150625

I'm not entirely sure of STR though...

Is this the tab crash that happens when I shutdown the computer without closing Nightly first?
Flags: needinfo?(milan)
(In reply to alex_mayorga from comment #44)
> ¡Hola Milan!
> 
> It seems to still be a thing on Mozilla/5.0 (Windows NT 6.1; Win64; x64;
> rv:41.0) Gecko/20100101 Firefox/41.0 ID:20150625030202 CSet: 0b2f5e8b7be5
> 
> bp-31608b8c-69d0-49ac-8085-4c6002150625
> 
> I'm not entirely sure of STR though...
> 
> Is this the tab crash that happens when I shutdown the computer without
> closing Nightly first?

Did you try updating your Intel VGA driver as I mentioned in comment 42?
(In reply to Arthur K. from comment #45)
> (In reply to alex_mayorga from comment #44)
> > ¡Hola Milan!
> > 
> > It seems to still be a thing on Mozilla/5.0 (Windows NT 6.1; Win64; x64;
> > rv:41.0) Gecko/20100101 Firefox/41.0 ID:20150625030202 CSet: 0b2f5e8b7be5
> > 
> > bp-31608b8c-69d0-49ac-8085-4c6002150625
> > 
> > I'm not entirely sure of STR though...
> > 
> > Is this the tab crash that happens when I shutdown the computer without
> > closing Nightly first?
> 
> Did you try updating your Intel VGA driver as I mentioned in comment 42?

¡Hola Arthur!

I just tried updating with win64_153336.exe and got the following very uninformative message:

"Error
This computer does not meet the minimum requirements for installing the software.
<OK>"

=(
(In reply to alex_mayorga from comment #46)
> (In reply to Arthur K. from comment #45)
> > (In reply to alex_mayorga from comment #44)
> > > ¡Hola Milan!
> > > 
> > > It seems to still be a thing on Mozilla/5.0 (Windows NT 6.1; Win64; x64;
> > > rv:41.0) Gecko/20100101 Firefox/41.0 ID:20150625030202 CSet: 0b2f5e8b7be5
> > > 
> > > bp-31608b8c-69d0-49ac-8085-4c6002150625
> > > 
> > > I'm not entirely sure of STR though...
> > > 
> > > Is this the tab crash that happens when I shutdown the computer without
> > > closing Nightly first?
> > 
> > Did you try updating your Intel VGA driver as I mentioned in comment 42?
> 
> ¡Hola Arthur!
> 
> I just tried updating with win64_153336.exe and got the following very
> uninformative message:
> 
> "Error
> This computer does not meet the minimum requirements for installing the
> software.
> <OK>"
> 
> =(

Hmm, based on your crash report DeviceID and what your old driver said, it should have been right. Can you please grab GPU-Z 0.8.4 and tell me what it says in the Name area? Also, what does it say in Display Adapter under Device Manager?
(In reply to alex_mayorga from comment #46)
> (In reply to Arthur K. from comment #45)
> > (In reply to alex_mayorga from comment #44)
> > > ¡Hola Milan!
> > > 
> > > It seems to still be a thing on Mozilla/5.0 (Windows NT 6.1; Win64; x64;
> > > rv:41.0) Gecko/20100101 Firefox/41.0 ID:20150625030202 CSet: 0b2f5e8b7be5
> > > 
> > > bp-31608b8c-69d0-49ac-8085-4c6002150625
> > > 
> > > I'm not entirely sure of STR though...
> > > 
> > > Is this the tab crash that happens when I shutdown the computer without
> > > closing Nightly first?
> > 
> > Did you try updating your Intel VGA driver as I mentioned in comment 42?
> 
> ¡Hola Arthur!
> 
> I just tried updating with win64_153336.exe and got the following very
> uninformative message:
> 
> "Error
> This computer does not meet the minimum requirements for installing the
> software.
> <OK>"
> 
> =(

Well, originally I thought this was an HD4000 but it seems to be an HD3000. Try these drivers please: https://goo.gl/2SCvaR
¡Hola Arthur!

win64_152824.exe did work

I disobeyed the installer and left Nightly running during the update.

This resulted on the following crash:

Report ID 	Date Submitted
bp-4388b86f-bd1b-4444-a05c-be5682150625
	25/06/2015	04:04 p.m.

That is seemingly https://bugzilla.mozilla.org/show_bug.cgi?id=1133623
Wontfixing for 39. This recent activity appears to be on 41.
I pressed the "Update Nightly" in the "hamburger-menu". This caused this crash:
https://crash-stats.mozilla.com/report/index/7ea1fd3e-9f3f-41ad-b7e4-e24842150710
[Tracking Requested - why for this release]: See above comment, it happened for 42.0a1.
Note I had e10s on.
The original report talked about mozmill tests - did we ever run into a problem with the debug build?
Flags: needinfo?(milan) → needinfo?(hskupin)
Tracking for 42 as it is a top crash... but I am unhappy that we have been tracking it since 36...
Wontfix for 40 as I don't think we will have a fix in time for this release...
(In reply to Sylvestre Ledru [:sylvestre] PTO => July 10th from comment #54)
> Tracking for 42 as it is a top crash... but I am unhappy that we have been
> tracking it since 36...
> Wontfix for 40 as I don't think we will have a fix in time for this
> release...

Agreed, but we don't know how to fix it.
To me it changes a lot. It stopped happening maybe yesterday, but was hanging/crashing for several weeks before. It didn't happen before that for a long time. But it also happened for a while even before that.
Current signatures on aurora:

#3 shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::layers::CompositorParent::ShutDown()

#6 shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::layers::CompositorParent::ShutDown()

#12 shutdownhang | WaitForSingleObjectEx | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::layers::CompositorParent::ShutDown()
Those were from the browser process list.
Nical, could there be like an image bridge that's still holding onto the CompositorThreadHolder?
Flags: needinfo?(nical.bugzilla)
I managed to reproduce this problem locally with fairly high reliability, I added a printf that shows me the address of the sCompositorThreadHolder before we null it out, here's where it gets interesting, when we're in the hung state, here's the data on sCompositorThreadHolder:

-		(CompositorThreadHolder*)0x10998660	0x10998660 {mRefCnt={mValue={...} } mHelperForMainThreadDestruction={...} mCompositorThread=0x109d8520 {...} }	mozilla::layers::CompositorThreadHolder *
-		mRefCnt	{mValue={...} }	mozilla::ThreadSafeAutoRefCnt
-		mValue	{...}	mozilla::Atomic<unsigned int,2,void>
-		mozilla::detail::AtomicBaseIncDec<unsigned int,2>	{...}	mozilla::detail::AtomicBaseIncDec<unsigned int,2>
-		mozilla::detail::AtomicBase<unsigned int,2>	{mValue={...} }	mozilla::detail::AtomicBase<unsigned int,2>
-		mValue	{...}	std::atomic<unsigned int>
-		std::atomic_uint	{_My_val=2 }	std::atomic_uint
		_My_val	2	unsigned long
		mHelperForMainThreadDestruction	{...}	mozilla::layers::HelperForMainThreadDestruction
+		mCompositorThread	0x109d8520 {startup_data_=0x004fbc30 {options={message_loop_type=??? stack_size=??? transient_hang_timeout=...} ...} ...}	base::Thread * const

In other words, there's 2 references to the CompositorThreadHolder lying around somewhere and not being cleaned up, from there on, this hang occurring is no surprise.
I've confirmed that at this point the ImageBridgeParent singleton is properly destroyed.
Talked about it with Bas on skype.
Flags: needinfo?(nical.bugzilla)
So I've concluded this is to an ImageBridgeParent still being alive.

Here's the stack that creates the offending ImageBridge:

>	xul.dll!mozilla::layers::ImageBridgeParent::ImageBridgeParent(MessageLoop * aLoop, IPC::Channel * aTransport, unsigned long aChildProcessId) Line 74	C++
 	xul.dll!mozilla::layers::ImageBridgeParent::Create(IPC::Channel * aTransport, unsigned long aChildProcessId) Line 196	C++
 	xul.dll!mozilla::dom::ContentParent::AllocPImageBridgeParent(IPC::Channel * aTransport, unsigned long aOtherProcess) Line 3163	C++
 	xul.dll!mozilla::dom::PContentParent::OnMessageReceived(const IPC::Message & msg__) Line 5657	C++
 	xul.dll!mozilla::ipc::MessageChannel::DispatchAsyncMessage(const IPC::Message & aMsg) Line 1373	C++
 	xul.dll!mozilla::ipc::MessageChannel::DispatchMessageW(const IPC::Message & aMsg) Line 1294	C++
 	xul.dll!mozilla::ipc::MessageChannel::OnMaybeDequeueOne() Line 1266	C++
 	xul.dll!DispatchToMethod<mozilla::ipc::MessageChannel,bool (__thiscall mozilla::ipc::MessageChannel::*)(void)>(mozilla::ipc::MessageChannel * obj, bool (void) * method, const Tuple0 & arg) Line 388	C++
 	xul.dll!RunnableMethod<mozilla::ipc::MessageChannel,bool (__thiscall mozilla::ipc::MessageChannel::*)(void),Tuple0>::Run() Line 310	C++
 	xul.dll!mozilla::ipc::MessageChannel::RefCountedTask::Run() Line 456	C++
 	xul.dll!mozilla::ipc::MessageChannel::DequeueTask::Run() Line 473	C++
 	xul.dll!MessageLoop::RunTask(Task * task) Line 365	C++

This ImageBridge is essentially created as a child of a content parent being created.
We've concluded this image bridge is the result of a content process that's created for background thumbnail generation. It appears that the ContentParent for this process and an image bridge are not properly being shutdown. There's a couple of message sending errors in the console, it's possible this has something to do with those messages being dropped because we're shutting down, but this is hard to know for sure.
Note that some of these hangs would have been crashes prior to bug 1175521.  That bug only wallpapered over the crash, but see https://bugzilla.mozilla.org/show_bug.cgi?id=1175521#c7 in particular for perhaps something that can help in this bug.
Duplicate of this bug: 1174741
(In reply to Milan Sreckovic [:milan] from comment #53)
> The original report talked about mozmill tests - did we ever run into a
> problem with the debug build?

We are no longer running Mozmill tests. They have been partly replaced with the new Marionette tests, and the coverage is still low. So we haven't seen this particular problem yet. Sorry.
Flags: needinfo?(hskupin)
I vote we close this bug as incomplete and reopen if the issue returns. Any objections?
This is reproducible by Bas. So I don't see why we should close it.
(In reply to Henrik Skupin (:whimboo) from comment #69)
> This is reproducible by Bas. So I don't see why we should close it.

My understanding was that this was only reproducible under Mozmill and if we're no longer running Mozmill then the crash is basically irrelevant. If we have a way to reproduce it outside Mozmill then I agree that we should continue to investigate. However that is not clear to me in reading this bug report.
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #70)
> My understanding was that this was only reproducible under Mozmill and if
> we're no longer running Mozmill then the crash is basically irrelevant. If
> we have a way to reproduce it outside Mozmill then I agree that we should
> continue to investigate. However that is not clear to me in reading this bug
> report.

The signatures in this bug appear quite a bit in crash data from "the wild", so this is surely not irrelevant. I don't know which paths we found for reproducing "in-house", though.
tracking-e10s: --- → ?
(In reply to Anthony Hughes, QA Mentor (:ashughes) from comment #70)
> (In reply to Henrik Skupin (:whimboo) from comment #69)
> > This is reproducible by Bas. So I don't see why we should close it.
> 
> My understanding was that this was only reproducible under Mozmill and if
> we're no longer running Mozmill then the crash is basically irrelevant. If
> we have a way to reproduce it outside Mozmill then I agree that we should
> continue to investigate. However that is not clear to me in reading this bug
> report.

No, I can reproduce this simply by making my content process die fairly early in its creation.
Bas, since you can reproduce this, can you look for a fix, or work with :nical on it?
Flags: needinfo?(bas)
This is currently the #1 crash on aurora 42 so it's definitely happening in the wild.
¡Hola!

This just bite me today on Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:43.0) Gecko/20100101 Firefox/43.0 ID:20150902030229 CSet: fb720c90eb49590ba55bf52a8a4826ffff9f528b

bp-4ec5b87f-e076-4260-9349-60e452150902
	02/09/2015	12:50 p.m.

Crashes while restarting to update Nightly.

It was particularly bad as about:sessionrestore was wiped out clean so there was data loss =(
Combined signatures put this at the #1 crash on Nightly (Fx43).
The one from comment 75 is a startup crash? That escallated quickly.
Hrm, I stopped being able to reproduce this on nightly.
Flags: needinfo?(bas)
FF 41 beta7 x86build
Add to crash signature:
[@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait | mozilla::ReentrantMonitor::Wait(unsigned int) | nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*, bool) | mozilla::MediaShutdownManager::Shutdown() ]

My Crash Report
https://crash-stats.mozilla.com/report/index/ba869ad3-7266-48d6-96d5-b10a22150905

My bug report:
https://bugzilla.mozilla.org/show_bug.cgi?id=1201639
(In reply to mkdante381 from comment #80)
> FF 41 beta7 x86build
> Add to crash signature:
> [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait |
> mozilla::ReentrantMonitor::Wait(unsigned int) |
> nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*,
> bool) | mozilla::MediaShutdownManager::Shutdown() ]
> 
> My Crash Report
> https://crash-stats.mozilla.com/report/index/ba869ad3-7266-48d6-96d5-
> b10a22150905
> 
> My bug report:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1201639

So your crash report shows you're using "AdapterDriverVersion: 15.201.1151.0" which is a beta (15.8) driver. How does it behave using the stable 15.7.1 driver?
As mentioned earlier on this bug the Mozmill tests are dead. But interestingly we hit the same crash on Windows machines now with our Firefox UI Update tests. See bug 1202375 for details. The tests are getting run in VMs with default software installed. There is no specific graphic driver present beside the one Windows comes with.

Crash report: d5e84fc1-60bb-425b-9e21-5e48a2150907

Those crashes happen multiple times a day on different boxes, and I think they might somewhat be reproducible.
Whiteboard: [mozmill][gfx-noted] → [firefox-ui-tests][gfx-noted]
Bas, is there anything you can do here? Maybe work with Henrik to figure out a repro?
Bas, assigning to you, I need :nical to look at something else for the next week or so.
Assignee: nical.bugzilla → bas
Flags: needinfo?(bas)
Most likely this is still the same issue as it was before when I -could- reproduce it, i.e. an ImageBridgeParent not being cleaned up the way it should after a content process crash.

I can't reproduce this anymore but it would still at the very least be helpful to know if in the cases where we are seeing this a content process crash has occurred.
Flags: needinfo?(bas)
Too late to fix this in 41.
I started being able to reproduce this again. This happens for me currently -most- of the time when the content process crashes early in creation.

It seems that on a 'successful' content process crash we get notified of a channel error and our PImageBridgeParent actor and its subtree get successfully destroyed.

On an 'unsuccessful' content process crash (i.e. which triggers this bug for me), we never get OnChannelError called on the image bridge parent and as a result it and its subtree just leak.
So, this occurs in case things die before the channel ever gets connected, I'm going to suggest a patch that will fix this shutdown hang, but it's very important to realize that in the current situation when that happens (i.e. a channel never gets connected because a content process dies early), we leak any actors whose channels have not been connected yet. CC'ing Brad to make sure the e10s folks are aware of this happening.
Flags: needinfo?(blassey.bugs)
Status: NEW → ASSIGNED
Attachment #8659871 - Flags: review?(nical.bugzilla) → review+
Well that's odd... that should not really be possible... hmmmmm.
Flags: needinfo?(bas)
So.. the try run of this is clear (https://hg.mozilla.org/try/rev/c7c5b82af460) and I looked at a lot of code and can't find out what could possibly cause this. So I can only conclude it might be related to clobbering or something, but it seems odd... I'm going to push this again and will stick around to see what happens. Very sorry to the sheriff if I break things again :).
Flags: needinfo?(blassey.bugs)
https://hg.mozilla.org/mozilla-central/rev/5be65754c0d0
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla43
(In reply to mkdante381 from comment #80)
> FF 41 beta7 x86build
> Add to crash signature:
> [@ shutdownhang | WaitForSingleObjectEx | WaitForSingleObject | PR_Wait |
> mozilla::ReentrantMonitor::Wait(unsigned int) |
> nsThread::ProcessNextEvent(bool, bool*) | NS_ProcessNextEvent(nsIThread*,
> bool) | mozilla::MediaShutdownManager::Shutdown() ]
> 
> My Crash Report
> https://crash-stats.mozilla.com/report/index/ba869ad3-7266-48d6-96d5-
> b10a22150905
> 
> My bug report:
> https://bugzilla.mozilla.org/show_bug.cgi?id=1201639

Catalyst 15.9 was released for Linux so I would guess it'll be released for Windows soon as well. If you're still on the 15.8 beta, they might help in your case if it's driver related.
Comment on attachment 8659871 [details] [diff] [review]
Only acquire a hold on the compositor thread once the channel is connected

Approval Request Comment
[Feature/regressing bug #]: OMTC
[User impact if declined]: Shutdown hangs if child process crashes early
[Describe test coverage new/current, TreeHerder]: Nightly
[Risks and why]: Low, merely delaying
[String/UUID change made/needed]: None
Attachment #8659871 - Flags: approval-mozilla-aurora?
Comment on attachment 8659871 [details] [diff] [review]
Only acquire a hold on the compositor thread once the channel is connected

Fix a shutdown hang, taking it!
Attachment #8659871 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Now This bug is fixed, when I am on site with HTML5 movie and Shutdown Firefox from Hamburger Australis Menu. Problem is still mainly with Adobe Flash

GO to Youtube(You must force Flash on YT) or other site with Flash, or site with content Flash(no movies).
Go to random movie or site with Flash
Pause movie
Shutdown FF
Sometime Firefox crash with signature "[@ shutdownhang |", erlier also on site with only HTML5 movies 

This is problem with "plugin-container.exe". On my computer with AMD R9 270X and Catalyst 15.8beta is problem with Adobe Flash. Flash is unstable. Sometimes Flash process is suspended. Then Firefox is hanging itself. I must kill process "plugin-container.exe". After shutdown Firefox from Australis Menu, Firefox sometime not kill process "plugin-coantainer.exe" and FF crash. No problem with play HTML5 movies, but this problem was earlier, when shutdown FF on site with html 5 movies.

Now I use script for Greasemonkey: https://greasyfork.org/pl/scripts/5433-force-flash-wmode and I added preference to Firefox:
new > string
Preference name: plugins.force.wmode
Value: direct 

Now Flash is more stable. Problem is mainly with acceleration Adobe Flash
(In reply to mkdante381 from comment #99)
> Now This bug is fixed, when I am on site with HTML5 movie and Shutdown
> Firefox from Hamburger Australis Menu. Problem is still mainly with Adobe
> Flash
> 
> GO to Youtube(You must force Flash on YT) or other site with Flash, or site
> with content Flash(no movies).
> Go to random movie or site with Flash
> Pause movie
> Shutdown FF
> Sometime Firefox crash with signature "[@ shutdownhang |", erlier also on
> site with only HTML5 movies 
> 
> This is problem with "plugin-container.exe". On my computer with AMD R9 270X
> and Catalyst 15.8beta is problem with Adobe Flash. Flash is unstable.
> Sometimes Flash process is suspended. Then Firefox is hanging itself. I must
> kill process "plugin-container.exe". After shutdown Firefox from Australis
> Menu, Firefox sometime not kill process "plugin-coantainer.exe" and FF
> crash. No problem with play HTML5 movies, but this problem was earlier, when
> shutdown FF on site with html 5 movies.
> 
> Now I use script for Greasemonkey:
> https://greasyfork.org/pl/scripts/5433-force-flash-wmode and I added
> preference to Firefox:
> new > string
> Preference name: plugins.force.wmode
> Value: direct 
> 
> Now Flash is more stable. Problem is mainly with acceleration Adobe Flash

So, I will state again, can you repro with the stable 15.7 Catalyst driver? Maybe the 15.8 Catalyst beta driver has a problem.
Nope with latest Catalyst 15.7.1 is even worse. 15.8beta fix bug "[424127] The Firefox browser may crash while opening multiple tabs (2 or more)" source: http://support.amd.com/en-us/kb-articles/Pages/latest-catalyst-windows-beta.aspx but not fix acceleration flash
(In reply to mkdante381 from comment #101)
> Nope with latest Catalyst 15.7.1 is even worse. 15.8beta fix bug "[424127]
> The Firefox browser may crash while opening multiple tabs (2 or more)"
> source:
> http://support.amd.com/en-us/kb-articles/Pages/latest-catalyst-windows-beta.
> aspx but not fix acceleration flash

I don't see them yet released on AMD's site but Catalyst 15.8 seems to have been gleaned by the folks at Station Drivers (http://goo.gl/qRK54c). Give them a try.
No longer blocks: 1207979
Depends on: 1207979
(In reply to Arthur K. from comment #103)
> (In reply to mkdante381 from comment #101)
> > Nope with latest Catalyst 15.7.1 is even worse. 15.8beta fix bug "[424127]
> > The Firefox browser may crash while opening multiple tabs (2 or more)"
> > source:
> > http://support.amd.com/en-us/kb-articles/Pages/latest-catalyst-windows-beta.
> > aspx but not fix acceleration flash
> 
> I don't see them yet released on AMD's site but Catalyst 15.8 seems to have
> been gleaned by the folks at Station Drivers (http://goo.gl/qRK54c). Give
> them a try.

I use latest beta...
http://support.amd.com/en-us/kb-articles/Pages/latest-catalyst-windows-beta.aspx
You need to log in before you can comment on or make changes to this bug.