Closed Bug 924622 Opened 6 years ago Closed 6 years ago

Intermittent Android Shutdown "ABORT: mismatched CxxStackFrame ctor/dtors" [@ mozalloc_abort(char const*)]

Categories

(Core :: Graphics: Layers, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla33
tracking-b2g backlog
Tracking Status
e10s + ---
firefox29 --- wontfix
firefox30 --- wontfix
firefox31 --- fixed
firefox32 --- fixed
firefox33 --- fixed
firefox-esr24 --- unaffected
b2g-v1.3 --- wontfix
b2g-v1.3T --- wontfix
b2g-v1.4 --- wontfix
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: RyanVM, Assigned: bjacob)

References

Details

(Keywords: assertion, crash, intermittent-failure)

Crash Data

Attachments

(4 files, 9 obsolete files)

945 bytes, patch
Details | Diff | Splinter Review
24.34 KB, patch
sotaro
: review+
Details | Diff | Splinter Review
25.15 KB, patch
Details | Diff | Splinter Review
4.20 KB, patch
nical
: review+
mattwoodrow
: review+
Details | Diff | Splinter Review
Don't know if those profiling errors are related, but they look bad.

https://tbpl.mozilla.org/php/getParsedLog.php?id=28830191&tree=Mozilla-Inbound

Android 2.2 Armv6 Tegra mozilla-inbound opt test jsreftest-1 on 2013-10-08 08:07:31 PDT for push d1cd57876e48
slave: tegra-067

REFTEST FINISHED: Slowest test took 22989ms (http://10.250.49.158:30067/jsreftest/tests/jsreftest.html?test=js1_5/GC/regress-203278-2.js)
REFTEST INFO | Result summary:
REFTEST INFO | Successful: 49119 (49119 pass, 0 load only)
REFTEST INFO | Unexpected: 0 (0 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 unexpected fixed asserts, 0 failed load, 0 exception)
REFTEST INFO | Known problems: 1104 (20 known fail, 0 known asserts, 952 random, 118 skipped, 14 slow)
REFTEST INFO | Total canvas count = 0
REFTEST TEST-START | Shutdown
INFO | automation.py | Application ran for: 0:14:27.382977
INFO | zombiecheck | Reading PID log: /tmp/tmp5MkjLRpidlog
mozcrash INFO | Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android-armv6/1381238513/fennec-27.0a1.en-US.android-arm-armv6.crashreporter-symbols.zip
/data/anr/traces.txt not found
PROCESS-CRASH | Shutdown | application crashed [@ mozalloc_abort(char const*)]
Crash dump filename: /tmp/tmpokHVgJ/643deeb3-a3b4-be66-34160430-4a921b7e.dmp
Operating system: Android
                  0.0.0 Linux 2.6.32.9-00002-gd8084dc-dirty #1 SMP PREEMPT Wed Feb 2 11:32:06 PST 2011 armv7l nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys
CPU: arm
     2 CPUs

Crash reason:  SIGSEGV
Crash address: 0x0

Thread 16 (crashed)
 0  libmozalloc.so!mozalloc_abort(char const*) [mozalloc_abort.cpp:d1cd57876e48 : 30 + 0x8]
     r4 = 0x00000000    r5 = 0xffffffff    r6 = 0xafd424a8    r7 = 0xafd42550
     r8 = 0x554332d9    r9 = 0x554337fc   r10 = 0x00000000    fp = 0x5548f5bc
     sp = 0x5b3ff840    lr = 0x4498ea20    pc = 0x4498ea28
    Found by: given as instruction pointer in context
 1  libxul.so!NS_DebugBreak [nsDebugImpl.cpp:d1cd57876e48 : 430 + 0x6]
     r4 = 0x00000000    r5 = 0xffffffff    r6 = 0xafd424a8    r7 = 0xafd42550
     r8 = 0x554332d9    r9 = 0x554337fc   r10 = 0x00000000    fp = 0x5548f5bc
     sp = 0x5b3ff848    pc = 0x54a19344
    Found by: call frame info
 2  libxul.so!mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) const [MessageChannel.cpp:d1cd57876e48 : 1539 + 0x1e]
     r4 = 0x4ed02860    r5 = 0x554337fc    r6 = 0x5543353f    r7 = 0x55433544
     r8 = 0x553f900a    r9 = 0x55492451   r10 = 0x5b3ffdf0    fp = 0x55492451
     sp = 0x5b3ffc78    pc = 0x5468d99c
    Found by: call frame info
 3  libxul.so!mozilla::ipc::MessageChannel::~MessageChannel() [MessageChannel.cpp:d1cd57876e48 : 79 + 0x2a]
     r4 = 0x57c7a730    r5 = 0x00000000    r6 = 0x55754db4    r7 = 0x00000010
     r8 = 0x5af54080    r9 = 0x55492451   r10 = 0x5b3ffdf0    fp = 0x55492451
     sp = 0x5b3ffcc8    pc = 0x5468dff8
    Found by: call frame info
 4  libxul.so!mozilla::layers::PImageBridgeParent::~PImageBridgeParent() [PImageBridgeParent.cpp:d1cd57876e48 : 82 + 0x6]
     r4 = 0x57c7a700    r5 = 0x00000000    r6 = 0x55754db4    r7 = 0x00000010
     r8 = 0x5af54080    r9 = 0x55492451   r10 = 0x5b3ffdf0    fp = 0x55492451
     sp = 0x5b3ffcd8    pc = 0x546ed2f8
    Found by: call frame info
 5  libxul.so!mozilla::layers::ImageBridgeParent::~ImageBridgeParent() [ImageBridgeParent.cpp:d1cd57876e48 : 57 + 0x2]
     r4 = 0x57c7a700    r5 = 0x00000000    r6 = 0x55754db4    r7 = 0x00000010
     r8 = 0x5af54080    r9 = 0x55492451   r10 = 0x5b3ffdf0    fp = 0x55492451
     sp = 0x5b3ffce0    pc = 0x54aa3bdc

10-08 08:28:46.291 E/Profiler( 1776): BPUnw: [6 total] thread_register_for_profiling(me=0x701db8, stacktop=0x5deffe4b)
10-08 08:28:46.291 E/Profiler( 1776): BPUnw: [5 total] thread_unregister_for_profiling(me=0x701db8) 
10-08 08:28:46.551 E/Profiler( 1776): BPUnw: [4 total] thread_unregister_for_profiling(me=0x2f50f0) 
10-08 08:28:46.561 I/Gecko   ( 1776): ###!!! [MessageChannel][Parent][../../../ipc/glue/MessageChannel.cpp:79] Assertion (mCxxStackFrames.empty()) failed.  mismatched CxxStackFrame ctor/dtors 
10-08 08:28:46.561 I/Gecko   ( 1776):   MessageChannel 'backtrace':
10-08 08:28:46.561 I/Gecko   ( 1776):   [(0) in sync PImageBridge::Msg_Stop(actor=2147483647) ]
10-08 08:28:46.561 I/Gecko   ( 1776):   remote Interrupt stack guess: 0
10-08 08:28:46.561 I/Gecko   ( 1776):   deferred stack size: 0
10-08 08:28:46.561 I/Gecko   ( 1776):   out-of-turn Interrupt replies stack size: 0
10-08 08:28:46.561 I/Gecko   ( 1776):   Pending queue size: 0, front to back:
10-08 08:28:46.561 I/Gecko   ( 1776): ###!!! ABORT: mismatched CxxStackFrame ctor/dtors: file ../../../ipc/glue/MessageChannel.cpp, line 1539
10-08 08:28:46.561 E/Gecko   ( 1776): mozalloc_abort: ###!!! ABORT: mismatched CxxStackFrame ctor/dtors: file ../../../ipc/glue/MessageChannel.cpp, line 1539
10-08 08:28:47.541 I/DEBUG   (  937): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
10-08 08:28:47.541 I/DEBUG   (  937): Build fingerprint: 'nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys'
10-08 08:28:47.541 I/DEBUG   (  937): pid: 1776, tid: 1834  >>> org.mozilla.fennec <<<
10-08 08:28:47.541 I/DEBUG   (  937): signal 11 (SIGSEGV), fault addr 00000000

10-08 08:28:46.291 E/Profiler( 1776): BPUnw: [6 total] thread_register_for_profiling(me=0x701db8, stacktop=0x5deffe4b)
10-08 08:28:46.291 E/Profiler( 1776): BPUnw: [5 total] thread_unregister_for_profiling(me=0x701db8) 
10-08 08:28:46.551 E/Profiler( 1776): BPUnw: [4 total] thread_unregister_for_profiling(me=0x2f50f0) 
10-08 08:28:46.561 E/Gecko   ( 1776): mozalloc_abort: ###!!! ABORT: mismatched CxxStackFrame ctor/dtors: file ../../../ipc/glue/MessageChannel.cpp, line 1539
10-08 08:28:52.591 W/InputManagerService( 1020): Got RemoteException sending setActive(false) notification to pid 1776 uid 10033
Something is very wrong here!

The gecko main thread (thread 6) is at this stack:

 3  libnss3.so!PR_Wait [ptsynch.c:d1cd57876e48 : 582 + 0x6]
     r4 = 0x4ed397c0    r5 = 0x0021a740    r6 = 0x00000001    sp = 0x4f3198c8
     pc = 0x51a1b8a8
    Found by: call frame info
 4  libxul.so!mozilla::ReentrantMonitor::Wait(unsigned int) [ReentrantMonitor.h:d1cd57876e48 : 89 + 0x6]
     r4 = 0x4f3198e4    r5 = 0x4f3198ec    r6 = 0x4edb4250    sp = 0x4f3198d8
     pc = 0x54156f68
    Found by: call frame info
 5  libxul.so!mozilla::layers::ImageBridgeChild::DestroyBridge() [ImageBridgeChild.cpp:d1cd57876e48 : 632 + 0xa]
     r4 = 0x4f3198e4    r5 = 0x4f3198ec    r6 = 0x4edb4250    sp = 0x4f3198e0
     pc = 0x54aa30d8
    Found by: call frame info
 6  libxul.so!mozilla::layers::ImageBridgeChild::ShutDown() [ImageBridgeChild.cpp:d1cd57876e48 : 583 + 0x2]
     r4 = 0x5599138c    r5 = 0x4ed72550    r6 = 0x00000037    r7 = 0x000000dc
     r8 = 0x55754db4    sp = 0x4f319918    pc = 0x54aa3150
    Found by: call frame info
 7  libxul.so!gfxPlatform::Shutdown() [gfxPlatform.cpp:d1cd57876e48 : 485 + 0x2]
     r4 = 0x5599138c    r5 = 0x4ed72550    r6 = 0x00000037    r7 = 0x000000dc
     r8 = 0x55754db4    sp = 0x4f319920    pc = 0x54a6febc
    Found by: call frame info
 8  libxul.so!nsComponentManagerImpl::KnownModule::~KnownModule() [nsComponentManager.h:d1cd57876e48 : 224 + 0x2]
     r4 = 0x4edab0e0    r5 = 0x4ed72550    r6 = 0x00000037    r7 = 0x000000dc
     r8 = 0x55754db4    sp = 0x4f319928    pc = 0x54a07800
    Found by: call frame info
 9  libxul.so!nsAutoPtr<nsComponentManagerImpl::KnownModule>::~nsAutoPtr() [nsAutoPtr.h:d1cd57876e48 : 77 + 0x6]
     r4 = 0x4edab0e0    r5 = 0x4ed72550    r6 = 0x00000037    r7 = 0x000000dc
     r8 = 0x55754db4    sp = 0x4f319930    pc = 0x54a0ac64
    Found by: call frame info
10  libxul.so!nsTArray_Impl<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayInfallibleAllocator>::Clear() [nsTArray.h:d1cd57876e48 : 534 + 0x6]
     r4 = 0x4ed722ac    r5 = 0x4ed72550    r6 = 0x00000037    r7 = 0x000000dc
     r8 = 0x55754db4    sp = 0x4f319940    pc = 0x54a0aca0
    Found by: call frame info
11  libxul.so!nsComponentManagerImpl::Shutdown() [nsComponentManager.cpp:d1cd57876e48 : 808 + 0x6]
     r4 = 0x4ed72200    r5 = 0x549dd98c    r6 = 0x55754db4    r7 = 0x4f319984
     r8 = 0x4edab7c0   r10 = 0x5551de64    sp = 0x4f319960    pc = 0x54a0add4
    Found by: call frame info
13  libxul.so!ScopedXPCOMStartup::~ScopedXPCOMStartup() [nsAppRunner.cpp:d1cd57876e48 : 1130 + 0x6]
     r4 = 0x4ed3a14c    r5 = 0x55754db4    r6 = 0x55980e5c    r7 = 0x00000000
     r8 = 0x55754db4    r9 = 0x4497ef84   r10 = 0x00000000    fp = 0x00225120
     sp = 0x4f3199b0    pc = 0x53bc964c
    Found by: call frame info
14  libxul.so!XREMain::XRE_main(int, char**, nsXREAppData const*) [nsAppRunner.cpp:d1cd57876e48 : 3961 + 0x6]
     r4 = 0x4f3199f4    r5 = 0x00000000    r6 = 0x4ed3a14c    r7 = 0x00000000
     r8 = 0x55754db4    r9 = 0x4497ef84   r10 = 0x00000000    fp = 0x00225120
     sp = 0x4f3199c8    pc = 0x53bcd3fc
    Found by: call frame info

So we are *really* late in the shutdown process, after all threads should have been stopped, and we're now calling the XPCOM module destructors. The main thread should not be accepting any new events at this point.

According to the comments in gfxPlatform::Shutdown() though, we're trying to shut down IPDL protocols at this stage by calling ImageBridgeChild::ShutDown and CompositorParent::ShutDown. This looks obviously unsafe; the code dates back to bug 763234. This needs to be happening much earlier in the shutdown process.

The "mismatched CxxStackFrame" assertion probably means that we're calling a method on a dead MessageChannel, which is a member of the ImageBridgeParent which is likely dead also. But given the refcounting of ImageBridgeParent, it's not clear to me why it would be dead.
OS: Android → All
Hardware: ARM → All
Summary: Intermittent Android PROCESS-CRASH | Shutdown | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors" → Intermittent PROCESS-CRASH | Shutdown | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors"
See Also: → 948600
Duplicate of this bug: 959080
Summary: Intermittent PROCESS-CRASH | Shutdown | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors" → Intermittent Android PROCESS-CRASH | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors"
Duplicate of this bug: 931014
As Cosmin filed as bug 959080 we have also seen this at least once in our Mozmill tests. A crash report can be found here: bp-ebeafe13-adce-4637-9c55-7742c2140113. This crash affects Firefox 29.0a1 down to 27.0 builds from 20131209204824. Looks like a regression since the 26.0 release.

Cosmin, please check where the crash happened and if it is somewhat reproducible.
Crash Signature: [@ mozalloc_abort(char const*) | Abort | NS_DebugBreak | mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) const ]
Flags: needinfo?(cosmin.malutan)
Keywords: regression
This crashed during a mozilla-aurora_remote testrun, with Firefox 28.0a2 locale de, on mm-osx-107-3 node, build id 20140111004004.

I will run testruns with those configurations, but I doubt it will fail.
Flags: needinfo?(cosmin.malutan)
This assertion usually means that somebody is operating on a dead object. In this case I strongly suspect that we're dual-deleting an ImageBridgeParent object.

See also bug 959080 where this is seen on mac as well, so it seems to affect every platform with OMTC enabled.

Here is the full relevant stack; we're on the image-bridge child thread:

18:59:56     INFO -   0  libmozalloc.so!mozalloc_abort(char const*) [mozalloc_abort.cpp:747121b2bb50 : 30 + 0x4]
18:59:56     INFO -   1  libxul.so!NS_DebugBreak [nsDebugImpl.cpp:747121b2bb50 : 427 + 0x5]
18:59:56     INFO -   2  libxul.so!mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) const [MessageChannel.cpp:747121b2bb50 : 1703 + 0x13]
18:59:56     INFO -   3  libxul.so!mozilla::ipc::MessageChannel::~MessageChannel() [MessageChannel.cpp:747121b2bb50 : 
18:59:56     INFO -   4  libxul.so!mozilla::layers::PImageBridgeParent::~PImageBridgeParent() [PImageBridgeParent.cpp:747121b2bb50 : 89 + 0x7]
18:59:56     INFO -   5  libxul.so!mozilla::layers::ImageBridgeParent::~ImageBridgeParent() [ImageBridgeParent.cpp:747121b2bb50 : 59 + 0x3]
18:59:56     INFO -   6  libxul.so!mozilla::layers::ImageBridgeParent::~ImageBridgeParent() [ImageBridgeParent.cpp:747121b2bb50 : 59 + 0x3]
18:59:56     INFO -   7  libxul.so!mozilla::detail::RefCounted<mozilla::layers::ISurfaceAllocator, (mozilla::detail::RefCountAtomicity)0>::Release() const [RefPtr.h:747121b2bb50 : 82 + 0xb]
18:59:56     INFO -   8  libxul.so!mozilla::layers::DeleteImageBridgeSync [StaticPtr.h:747121b2bb50 : 158 + 0x7]
... event loop

The main thread is *very* late in shutdown, we're already destroying modules:

18:59:56     INFO -   5  libxul.so!mozilla::layers::ImageBridgeChild::DestroyBridge() [ImageBridgeChild.cpp:747121b2bb50 : 589 + 0x9]
18:59:56     INFO -   6  libxul.so!mozilla::layers::ImageBridgeChild::ShutDown() [ImageBridgeChild.cpp:747121b2bb50 : 540 + 0x3]
18:59:56     INFO -   7  libxul.so!gfxPlatform::Shutdown() [gfxPlatform.cpp:747121b2bb50 : 570 + 0x3]
18:59:56     INFO -   8  libxul.so!nsComponentManagerImpl::KnownModule::~KnownModule() [nsComponentManager.h:747121b2bb50 : 226 + 0x1]
18:59:56     INFO -   9  libxul.so!nsAutoPtr<nsComponentManagerImpl::KnownModule>::~nsAutoPtr() [nsAutoPtr.h:747121b2bb50 : 78 + 0x5]
18:59:56     INFO -  10  libxul.so!nsTArray_Impl<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayInfallibleAllocator>::Clear() [nsTArray.h:747121b2bb50 : 531 + 0x3]
18:59:56     INFO -  11  libxul.so!nsComponentManagerImpl::Shutdown() [nsComponentManager.cpp:747121b2bb50 : 790 + 0x7]
18:59:56     INFO -  12  libxul.so!mozilla::ShutdownXPCOM(nsIServiceManager*) [nsXPComInit.cpp:747121b2bb50 : 808 + 0x3]
Crash Signature: [@ mozalloc_abort(char const*) | Abort | NS_DebugBreak | mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) const ]
Component: IPC → Graphics: Layers
Keywords: regression
This signature is currently #8 on Firefox 27 beta on mac, but many of the crashes are an unrelated crash in plugin code, so it's hard to know what the real severity of this crash is. Shutdown crashes tend to be less important in general.
Crash Signature: [@ mozalloc_abort(char const*) | Abort | NS_DebugBreak | mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) const ]
Flags: needinfo?(nical.bugzilla)
To answer Benjamin's question from bug 959080, the crash with Mozmill happens without browser.remote.tabs enabled. It's a default opt build.

(In reply to Cosmin Malutan from comment #34)
> This crashed during a mozilla-aurora_remote testrun, with Firefox 28.0a2
> locale de, on mm-osx-107-3 node, build id 20140111004004.

This is not that helpful. Important here would be the test we are failing in.
I know it doesn't help a lot but the job is gone, it didn't failed again when I tried to reproduce it.
If it will crash again I will return with more info.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #2)
> 
> So we are *really* late in the shutdown process, after all threads should
> have been stopped, and we're now calling the XPCOM module destructors. The
> main thread should not be accepting any new events at this point.

I don't think this is supposed to post anything in the main thread's event loop.
It does post stuff in the ImageBridgeChild thread which is a chromium-style thread, which is about to be destroyed after everything is shut down.
Do you mean that the mutex or something esle here is implicitly waking up XPCOM stuff that is already dead? Or is it that we can't touch a chromium thread at this point?

> 
> According to the comments in gfxPlatform::Shutdown() though, we're trying to
> shut down IPDL protocols at this stage by calling ImageBridgeChild::ShutDown
> and CompositorParent::ShutDown. This looks obviously unsafe; the code dates
> back to bug 763234. This needs to be happening much earlier in the shutdown
> process.

I don't know much about the shutdown process, but the only requirements here is that ImageBridge's shutdown happens before CompositorParent's (because shutting down CompositorParent destroys the thread in which CompositorParent lives). And that it happens when we close the browser (We create and destroy ImageBridge along with the browser).

> 
> The "mismatched CxxStackFrame" assertion probably means that we're calling a
> method on a dead MessageChannel, which is a member of the ImageBridgeParent
> which is likely dead also. But given the refcounting of ImageBridgeParent,
> it's not clear to me why it would be dead.

ImageBridgeParent's refcount is about to go to zero (it is happening in ImageBridgeParent::DeferredDestroy)

I wonder what is closing the channel.
Flags: needinfo?(nical.bugzilla)
> Do you mean that the mutex or something esle here is implicitly waking up
> XPCOM stuff that is already dead?

No.

 Or is it that we can't touch a chromium
> thread at this point?

This. No secondary threads are supposed to exist at this point; all threads are supposed to be joined or abandoned during the xpcom-shutdown-threads phase.
Summary: Intermittent Android PROCESS-CRASH | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors" → Intermittent PROCESS-CRASH | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors"
https://tbpl.mozilla.org/php/getParsedLog.php?id=34561608&tree=Mozilla-Central

This continues to fail frequently. Do we care?
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(benjamin)
Duplicate of this bug: 970640
Yes, it's not test-only; it's showing up in the wild as topcrash bug 970100. This is important.
Blocks: 970100
Flags: needinfo?(benjamin)
I think we do care but I am a bit swamped. From the discussion on this bug, my understanding is that gfxPlatform::ShutDown is already too late to tear down ImageBridge. It might just be a matter of finding another place to call ImageBridgeParent::ShutDown() that is during the browser's shutdown, prior to gfxPlatform::ShutDown() and at a point where we can still use a chromium thread's event loop. I don't know where that would be, though.
Flags: needinfo?(nical.bugzilla)
https://tbpl.mozilla.org/php/getParsedLog.php?id=35433705&tree=Fx-Team

(In reply to Nicolas Silva [:nical] from comment #70)
> I think we do care but I am a bit swamped. From the discussion on this bug,
> my understanding is that gfxPlatform::ShutDown is already too late to tear
> down ImageBridge. It might just be a matter of finding another place to call
> ImageBridgeParent::ShutDown() that is during the browser's shutdown, prior
> to gfxPlatform::ShutDown() and at a point where we can still use a chromium
> thread's event loop. I don't know where that would be, though.

Any suggestions who might? :)
Flags: needinfo?(nical.bugzilla)
Blocks: 936226
Duplicate of this bug: 983141
Milan, can we please find an owner for this?
Flags: needinfo?(milan)
Benoit, let's design a shutdown sequence to take care of this (see comment 70) and fix it for 31.
Assignee: nobody → bjacob
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(milan)
https://tbpl.mozilla.org/php/getParsedLog.php?id=36407544&tree=Fx-Team

12:33:03     INFO -  ###!!! [MessageChannel][Parent][/builds/slave/fx-team-osx64-0000000000000000/build/ipc/glue/MessageChannel.cpp:229] Assertion (mCxxStackFrames.empty()) failed.  mismatched CxxStackFrame ctor/dtors
12:33:03     INFO -    MessageChannel 'backtrace':
12:33:03     INFO -    [(0) in sync PImageBridge::Msg_Stop(actor=2147483647) ]
12:33:03     INFO -    remote Interrupt stack guess: 0
12:33:03     INFO -    deferred stack size: 0
12:33:03     INFO -    out-of-turn Interrupt replies stack size: 0
12:33:03     INFO -    Pending queue size: 0, front to back:
12:33:03     INFO -  [919] ###!!! ABORT: mismatched CxxStackFrame ctor/dtors: file /builds/slave/fx-team-osx64-0000000000000000/build/ipc/glue/MessageChannel.cpp, line 1722
12:33:03     INFO -  [919] ###!!! ABORT: mismatched CxxStackFrame ctor/dtors: file /builds/slave/fx-team-osx64-0000000000000000/build/ipc/glue/MessageChannel.cpp, line 1722
We have a lot of orange in e10s caused by bug 989567, which shows up as follows:

https://tbpl.mozilla.org/php/getParsedLog.php?id=37399660&tree=Holly#error0

Maybe 1/2 debug builds tests this problem. I'm told that we need to fix this bug in order to fix that. When will someone be able to work on this?
I hit this just now on OS X 10.9. I was debugging another intermittent orange at the time. I had chaos mode enabled, and I maxed out my CPU cores using the following command:

> parallel yes {1} '>' /dev/null ::: A B C D E F G H I J K L M N O P

I'm posting this in the hopes that it will help someone reproduce this bug.
Just hit it again after running the test I'm debugging several more times. It seems pretty easy to reproduce this way.
Milan, was there a specific reason why this was assigned to me? This looks like a better fit for nical, who already commented above, and who wrote the patch on bug 763234 that is referred to in comment 2 above.

Un-assigning myself to reflect current reality --- feel free to reassign to me.
Assignee: bjacob → nobody
Flags: needinfo?(milan)
Will assign somebody by tomorrow.
Flags: needinfo?(milan)
Assignee: nobody → nical.bugzilla
I am new to XPCOM stuff so I took a rather naive approach, don't hesitate to tell me if It's completely wrong, there might be dependencies that I am not seeing.

looking at stack traces and at nsXPComInit.cpp, I see that gfxPlatform::Shutdown happens in
(nsComponentManagerImpl::gComponentManager)->Shutdown();
which takes place after nsThreadManager::get()->Shutdown(); In ShutdownXPCOM.
That's pretty much what Benjamin points out in comment 2.

So I added an Observer on NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID which is triggered before nsThreadManager shutdown, and moved the gfx ipdl related stuff there.
Attachment #8404598 - Flags: review?(benjamin)
Comment on attachment 8404598 [details] [diff] [review]
Shut down gfx IPDL protocols before the shutdown of XPCOM threads

I recommend using "xpcom-shutdown" instead of "xpcom-will-shutdown".

You also don't need mGfxIpcShutdownObserver at all here: the observer service holds a strong ref so holding another ref is unnecessary.

Also note that bholley just changed the ordering of module destructors here in bug 913138; that's not sufficient for this bug but you should be aware of it.
Attachment #8404598 - Flags: review?(benjamin) → review+
try push with a bunch of m4 jobs on osx https://tbpl.mozilla.org/?tree=Try&rev=2bcdeb0c90f4
Blocks: 986113
This blocks a 1.4 blocker (bug 986113).
blocking-b2g: --- → 1.4?
I pushed this patch to holly and I think it still has some issues. I see crashes like this:

https://tbpl.mozilla.org/php/getParsedLog.php?id=37591939&tree=Holly#error0

It looks like maybe an issue where code is running after the image bridge is shut down but it still expects the image bridge to be alive?

There's also a stack where it looks like we're still trying to do IPC too late:

https://tbpl.mozilla.org/php/getParsedLog.php?id=37590588&full=1&branch=holly#error0
Attached patch sleepSplinter Review
Oh, I should mention that holly is used for e10s testing.

I can reproduce the second issue by applying this patch on top of the main one in this bug. Then start a browser, open a new e10s window from the file menu (assuming OMTC is enabled), and quit. I see a crash every time on shutdown.
(In reply to Bill McCloskey (:billm) from comment #193)
> I pushed this patch to holly and I think it still has some issues. I see
> crashes like this:
> 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=37591939&tree=Holly#error0
> 
> It looks like maybe an issue where code is running after the image bridge is
> shut down but it still expects the image bridge to be alive?
> 
> There's also a stack where it looks like we're still trying to do IPC too
> late:
> 
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=37590588&full=1&branch=holly#error0

Now its media that is shutting down after ImageBridge which is a problem.
Same thing for Widget. 

I have a patch that moves the shutdown of Media and widget to NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID.

Now I am also hitting a problem with resources that need ImageBridge to get deallocated and that are cycle collected in nsCycleCollector_shutdown which happens way after the shutdown of XPCOM threads.

I suppose it won't be possible to shutdown the cycle collector sooner, but perhaps we could trigger a cycle collection between NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID and NS_XPCOM_SHUTDOWN_OBSERVER_ID?

Chris, is it ok to move the shutdown of Media to NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID?
Flags: needinfo?(cpearce)
This patch seems to fix it for me. It does the following things:
* Have Gfx ipc shut down at NS_XPCOM_SHUTDOWN_OBSERVER_ID so that it happens before XPCOM's thread shutdown
* Move Media's shut down to NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID so that it gets destroyed before gfx's ipc.
* Move Widget's shut down to NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID so that it gets destroyed before gfx's ipc.
* Make it possible for ImageContainer to be cycle collected after ImageBridgeChild's destruction without crashing by adding some checks in ImageBridge. If this happens, the gfx IPDL actors already have cleaned up the memory so nothing gets leaked.

I asked a lot of reviewers because this fiddles with several modules: Benjamin for the nsXPCOmInit stuff in general, Chris for moving media's shutdown, Sotaro for the ImageBridgeChild stuff and Matt for Widget.
Attachment #8405225 - Attachment is obsolete: true
Attachment #8405252 - Flags: review?(sotaro.ikeda.g)
Attachment #8405252 - Flags: review?(matt.woodrow)
Attachment #8405252 - Flags: review?(cpearce)
Attachment #8405252 - Flags: review?(benjamin)
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #192)
> This blocks a 1.4 blocker (bug 986113).

I don't see how an intermittent failure in a debug test would stop us from shipping a phone.  If there is an underlying problem, let's open a bug for it with an STR.  Using 1.4+ to mean "this is important to work on now" is wrong.
(In reply to Milan Sreckovic [:milan] from comment #205)
> (In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment
> #192)
> > This blocks a 1.4 blocker (bug 986113).
> 
> I don't see how an intermittent failure in a debug test would stop us from
> shipping a phone.  If there is an underlying problem, let's open a bug for
> it with an STR.  Using 1.4+ to mean "this is important to work on now" is
> wrong.

Please see bug 986113 comment 26.  As far as I understand, while this bug is intermittently hit on our testing infrastructure, it's 100% reproducible on e10s and maybe on b2g too.  Bug 986113 basically is us detecting several pending tasks during shutdown, and this is one of them while there are others to fix too (please see the dependency list of that bug.)
Blocks a blocker
blocking-b2g: 1.4? → 1.4+
Comment on attachment 8405252 [details] [diff] [review]
Shut down gfx IPDL protocols after Media and Widget, and before the shutdown of XPCOM threads

diff --git a/gfx/layers/ipc/ImageBridgeChild.h b/gfx/layers/ipc/ImageBridgeChild.h

+ * check whether it is in the IBC thread and dispatching itself in the IBC thread
+ * if is is not is dangerous if we don't also check that ImageBridge is created.

Something's wrong with this sentence.

I'm a little concerned with this in general. What exactly is the dependency of media and widget on the gfx shutdown sequence? Rather than expressing this dependency by moving things around in will-shutdown/shutdown/shutdown-threads, can we explicitly call from XPCOM shutdown into the gfx shutdown code at the correct time (after xpcom-shutdown, before xpcom-shutdown-threads)?

I continue to resist the urge to do a full-blown dependency system for startup and shutdown; I'd prefer to make as the shutdown sequence explicitly-ordered if that won't cause other problems.
(In reply to Preeti Raghunath(:Preeti) from comment #207)
> Blocks a blocker

Not on b2g, see below.


(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #206)
> ...
> 
> Please see bug 986113 comment 26.  As far as I understand, while this bug is
> intermittently hit on our testing infrastructure, it's 100% reproducible on
> e10s and maybe on b2g too.  Bug 986113 basically is us detecting several
> pending tasks during shutdown, and this is one of them while there are
> others to fix too (please see the dependency list of that bug.)

"... and maybe on b2g too" and we are making it a blocker for b2g 1.4 release?

What is the STR for reproducing this on a b2g device?  100% or otherwise?
blocking-b2g: 1.4+ → 1.4?
(In reply to Milan Sreckovic [:milan] from comment #209)
> (In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment
> #206)
> > ...
> > 
> > Please see bug 986113 comment 26.  As far as I understand, while this bug is
> > intermittently hit on our testing infrastructure, it's 100% reproducible on
> > e10s and maybe on b2g too.  Bug 986113 basically is us detecting several
> > pending tasks during shutdown, and this is one of them while there are
> > others to fix too (please see the dependency list of that bug.)
> 
> "... and maybe on b2g too" and we are making it a blocker for b2g 1.4
> release?
> 
> What is the STR for reproducing this on a b2g device?  100% or otherwise?

I don't have the answer to this question (and I'm just echoing what Bill said anyway, so needinfo-ing him).

Note that due to the nature of bug 986113, we need to fix a number of bugs and gradually make our way towards making this assertion not fire.  In an ideal world, we'd know exactly which bugs those are, but we don't have any way of knowing that.
Flags: needinfo?(wmccloskey)
Mass tracking-e10s flag change. Filter bugmail on "2be0fcce-e36a-4e2c-aa80-0e3d33eb5406".
tracking-e10s: --- → +
I don't know of any way to trigger this on builds that we ship to users. Bug 986113 is about an assertion, so it only triggers on debug builds. Gregor marked it as blocking, presumably because it's a priority to get debug builds tested on tinderbox. It's pretty easy to make the work_queue_.empty() assertion trip in debug builds. I think if you apply the patch I've posted here and then kill the parent during the sleep call, we'll get the work_queue_.empty() pretty consistently.
Flags: needinfo?(wmccloskey)
Nical, I pushed the new patch to holly.
https://tbpl.mozilla.org/?tree=Holly&rev=d41bfe60b03d
I still see a lot of problems in debug runs during shutdown.
Comment on attachment 8405252 [details] [diff] [review]
Shut down gfx IPDL protocols after Media and Widget, and before the shutdown of XPCOM threads

Review of attachment 8405252 [details] [diff] [review]:
-----------------------------------------------------------------

I think you're better off having a single xpcom-shutdown observer that is responsible for explicitly shutting down these interdependent things in the correct order. I bet there's others we'll discover in future, maybe like the WebRTC media code.

i.e. a single observer that calls MediaShutdownManager::Shutdown(), then CompositorParent::ShutDown(), and then, ImageBridgeChild::ShutDown().
Attachment #8405252 - Flags: review?(cpearce) → review-
(In reply to Nicolas Silva [:nical] from comment #198)
> Chris, is it ok to move the shutdown of Media to
> NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID?

I don't think this is a problem per se, but as I said above, I think explicitly codifying the shutdown dependency order in a single xpcom-shutdown observer is the way most robust solution here.
Flags: needinfo?(cpearce)
(In reply to Chris Pearce (:cpearce) from comment #225)
> Comment on attachment 8405252 [details] [diff] [review]
> Shut down gfx IPDL protocols after Media and Widget, and before the shutdown
> of XPCOM threads
> 
> Review of attachment 8405252 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I think you're better off having a single xpcom-shutdown observer that is
> responsible for explicitly shutting down these interdependent things in the
> correct order. I bet there's others we'll discover in future, maybe like the
> WebRTC media code.
> 
> i.e. a single observer that calls MediaShutdownManager::Shutdown(), then
> CompositorParent::ShutDown(), and then, ImageBridgeChild::ShutDown().

Sounds good, where would this observer be? in xpcom/?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #208)
> I'm a little concerned with this in general. What exactly is the dependency
> of media and widget on the gfx shutdown sequence?

Among other things, Widget owns the LayerManager and the GLContext that is used by the compositor. when the nsBaseWidget object is destroyed there is a lot of gfx/ipc stuff going on to go through all of the layers resources and cleanup things that have a GL resource. This really needs to happen before we loose the ability to use ImageBridge/CompositorParent. Note that destroying a widget is not always part of gecko's shutdown, it also happens any time you close a window.

Media holds strong references to resources allocated by the gfx IPDL protocols (in order to be able to share the video frames with the compositor without making copies), and more generally talks directly to the ImageBridge (to send video frames to the compositor, allocate them, destroy them, etc.).

> Rather than expressing
> this dependency by moving things around in
> will-shutdown/shutdown/shutdown-threads, can we explicitly call from XPCOM
> shutdown into the gfx shutdown code at the correct time (after
> xpcom-shutdown, before xpcom-shutdown-threads)?
> 
> I continue to resist the urge to do a full-blown dependency system for
> startup and shutdown; I'd prefer to make as the shutdown sequence
> explicitly-ordered if that won't cause other problems.

Sounds like what Chris asked, I am happy with this although I am not sure where that code should live. At this point I suppose it would make sense to just call these shutdown functions directly from ShutdownXPCOM, but I can Imagine people not liking the idea.
(In reply to Bill McCloskey (:billm) from comment #221)
> Nical, I pushed the new patch to holly.
> https://tbpl.mozilla.org/?tree=Holly&rev=d41bfe60b03d
> I still see a lot of problems in debug runs during shutdown.

Looking at them: It happens in some IPDL event-lopp code being run in a thread while XPCOM is doing the threads shutdown on the main thread. I don't know which IPDL thread is running that late but it should not be ImageBridge nor Compositor parent since the while point of the patch is to make sure those two are cleaned up earlier, so i'd track it as another bug.

There is also 
ASSERTION: unexpected event topic: 'strcmp(aTopic, NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID) == 0'
which is a mistake in my patch (I was supposed to check NS_XPCOM_SHUTDOWN_OBSERVER_ID in gfxPlatform instead).
v2: gfx ipc shutdown notifies NS_XPCOM_GFX_IPC_SHUTDOWN_OBSERVER_ID so that modules that depend on gfx's ipc get a chance to shutdown in time.
Attachment #8405252 - Attachment is obsolete: true
Attachment #8405252 - Flags: review?(sotaro.ikeda.g)
Attachment #8405252 - Flags: review?(matt.woodrow)
Attachment #8405252 - Flags: review?(benjamin)
Attachment #8406047 - Flags: review?(sotaro.ikeda.g)
Attachment #8406047 - Flags: review?(matt.woodrow)
Attachment #8406047 - Flags: review?(cpearce)
Attachment #8406047 - Flags: review?(benjamin)
Comment on attachment 8406047 [details] [diff] [review]
Shut down gfx IPDL protocols after Media and Widget, and before the shutdown of XPCOM threads

This is still pretty generic. Rather than having a new observer topic can we just explicitly call a static MediaShutdownManager::Shutdown() method from http://hg.mozilla.org/mozilla-central/annotate/215080b813a7/xpcom/build/nsXPComInit.cpp#l878 ?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #233)
> Comment on attachment 8406047 [details] [diff] [review]
> Shut down gfx IPDL protocols after Media and Widget, and before the shutdown
> of XPCOM threads
> 
> This is still pretty generic. Rather than having a new observer topic can we
> just explicitly call a static MediaShutdownManager::Shutdown() method from
> http://hg.mozilla.org/mozilla-central/annotate/215080b813a7/xpcom/build/
> nsXPComInit.cpp#l878 ?

For media it's easy since MediaShutdownManager is a singleton. For widget, it's more annoying because each widget has its own observer listening to the shutdown event. I don't think it's worth the effort of keeping track of every widget manually just for the sake of shutting them down manually.
We can add a specific observer topic for widgets, that we'd fire explicitly in ShutdownXPCOM. But then we wouldn't gain much compared with the current patch.
Flags: needinfo?(cpearce)
woops the needinfo was not intentional
Flags: needinfo?(cpearce)
v3, simpler, invoke gfx's ipc shutdown directly in ShutdownXPCOM and doesn't modify widget and media.
Attachment #8406047 - Attachment is obsolete: true
Attachment #8406047 - Flags: review?(sotaro.ikeda.g)
Attachment #8406047 - Flags: review?(matt.woodrow)
Attachment #8406047 - Flags: review?(cpearce)
Attachment #8406047 - Flags: review?(benjamin)
Attachment #8406262 - Flags: review?(sotaro.ikeda.g)
Attachment #8406262 - Flags: review?(benjamin)
Attachment #8406262 - Flags: review?(benjamin) → review+
Comment on attachment 8406262 [details] [diff] [review]
v3:Shut down gfx IPDL protocols after Media and Widget, and before the shutdown of XPCOM threads

Review of attachment 8406262 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Criteria of adding IsCreated() checks to ImageBridgeChild seem not clear. Can you explain about it?
Attachment #8406262 - Flags: review?(sotaro.ikeda.g) → review+
(In reply to Sotaro Ikeda [:sotaro] from comment #246)
> Comment on attachment 8406262 [details] [diff] [review]
> v3:Shut down gfx IPDL protocols after Media and Widget, and before the
> shutdown of XPCOM threads
> 
> Review of attachment 8406262 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good. Criteria of adding IsCreated() checks to ImageBridgeChild seem
> not clear. Can you explain about it?

It seems like adding IsCreated() to all ImageBridgeChild functions except start up static functions.
(In reply to Sotaro Ikeda [:sotaro] from comment #248)
> (In reply to Sotaro Ikeda [:sotaro] from comment #246)
> > Comment on attachment 8406262 [details] [diff] [review]
> > v3:Shut down gfx IPDL protocols after Media and Widget, and before the
> > shutdown of XPCOM threads
> > 
> > Review of attachment 8406262 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > Looks good. Criteria of adding IsCreated() checks to ImageBridgeChild seem
> > not clear. Can you explain about it?
> 
> It seems like adding IsCreated() to all ImageBridgeChild functions except
> start up static functions.

Something like that. I added IsCreated checks to static methods that we may (unintentionally) end up calling after ImageBridge is destroyed (example, some media resources holding on to ImageClients and TextureClients that are cycle collected after the shutdown of ImageBridge). I tried to fiddle with cycle collection to make sure all of these resources are collected earlier than the gfx ipc shutdown, but I didn't manage to get rid of all of it. So the safest route is to make sure these ImageBridge methods don't crash Gecko when they are called too late. The alternative would have been to force the callers to check ImageBridgeChild::IsCreate() before doing anything with the ImageBridge but it's a bit heavy and over time people will forget to do it.
I have been a bit to optimistic when adding that assertion (that caused the backout)
removed it and relanded. https://hg.mozilla.org/integration/mozilla-inbound/rev/e17d61d6acf6
I am a bit worried about failing that much though, I'll look into it.
https://hg.mozilla.org/mozilla-central/rev/e17d61d6acf6

When you're reasonably comfortable with this patch's bake time, we definitely want this uplifted as far as Release Management will let it go.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
blocking-b2g: 1.4? → backlog
(In reply to TBPL Robot from comment #271)

That was on m-c tip :(
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla31 → ---