Closed Bug 924622 Opened 11 years ago Closed 10 years ago

Intermittent Android Shutdown "ABORT: mismatched CxxStackFrame ctor/dtors" [@ mozalloc_abort(char const*)]

Categories

(Core :: Graphics: Layers, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED
mozilla33
tracking-b2g backlog
Tracking Status
e10s + ---
firefox29 --- wontfix
firefox30 --- wontfix
firefox31 --- fixed
firefox32 --- fixed
firefox33 --- fixed
firefox-esr24 --- unaffected
b2g-v1.3 --- wontfix
b2g-v1.3T --- wontfix
b2g-v1.4 --- wontfix
b2g-v2.0 --- fixed
b2g-v2.1 --- fixed

People

(Reporter: RyanVM, Assigned: bjacob)

References

Details

(Keywords: assertion, crash, intermittent-failure)

Crash Data

Attachments

(4 files, 9 obsolete files)

945 bytes, patch
Details | Diff | Splinter Review
24.34 KB, patch
sotaro
: review+
Details | Diff | Splinter Review
25.15 KB, patch
Details | Diff | Splinter Review
4.20 KB, patch
nical
: review+
mattwoodrow
: review+
Details | Diff | Splinter Review
Don't know if those profiling errors are related, but they look bad.

https://tbpl.mozilla.org/php/getParsedLog.php?id=28830191&tree=Mozilla-Inbound

Android 2.2 Armv6 Tegra mozilla-inbound opt test jsreftest-1 on 2013-10-08 08:07:31 PDT for push d1cd57876e48
slave: tegra-067

REFTEST FINISHED: Slowest test took 22989ms (http://10.250.49.158:30067/jsreftest/tests/jsreftest.html?test=js1_5/GC/regress-203278-2.js)
REFTEST INFO | Result summary:
REFTEST INFO | Successful: 49119 (49119 pass, 0 load only)
REFTEST INFO | Unexpected: 0 (0 unexpected fail, 0 unexpected pass, 0 unexpected asserts, 0 unexpected fixed asserts, 0 failed load, 0 exception)
REFTEST INFO | Known problems: 1104 (20 known fail, 0 known asserts, 952 random, 118 skipped, 14 slow)
REFTEST INFO | Total canvas count = 0
REFTEST TEST-START | Shutdown
INFO | automation.py | Application ran for: 0:14:27.382977
INFO | zombiecheck | Reading PID log: /tmp/tmp5MkjLRpidlog
mozcrash INFO | Downloading symbols from: http://ftp.mozilla.org/pub/mozilla.org/mobile/tinderbox-builds/mozilla-inbound-android-armv6/1381238513/fennec-27.0a1.en-US.android-arm-armv6.crashreporter-symbols.zip
/data/anr/traces.txt not found
PROCESS-CRASH | Shutdown | application crashed [@ mozalloc_abort(char const*)]
Crash dump filename: /tmp/tmpokHVgJ/643deeb3-a3b4-be66-34160430-4a921b7e.dmp
Operating system: Android
                  0.0.0 Linux 2.6.32.9-00002-gd8084dc-dirty #1 SMP PREEMPT Wed Feb 2 11:32:06 PST 2011 armv7l nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys
CPU: arm
     2 CPUs

Crash reason:  SIGSEGV
Crash address: 0x0

Thread 16 (crashed)
 0  libmozalloc.so!mozalloc_abort(char const*) [mozalloc_abort.cpp:d1cd57876e48 : 30 + 0x8]
     r4 = 0x00000000    r5 = 0xffffffff    r6 = 0xafd424a8    r7 = 0xafd42550
     r8 = 0x554332d9    r9 = 0x554337fc   r10 = 0x00000000    fp = 0x5548f5bc
     sp = 0x5b3ff840    lr = 0x4498ea20    pc = 0x4498ea28
    Found by: given as instruction pointer in context
 1  libxul.so!NS_DebugBreak [nsDebugImpl.cpp:d1cd57876e48 : 430 + 0x6]
     r4 = 0x00000000    r5 = 0xffffffff    r6 = 0xafd424a8    r7 = 0xafd42550
     r8 = 0x554332d9    r9 = 0x554337fc   r10 = 0x00000000    fp = 0x5548f5bc
     sp = 0x5b3ff848    pc = 0x54a19344
    Found by: call frame info
 2  libxul.so!mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) const [MessageChannel.cpp:d1cd57876e48 : 1539 + 0x1e]
     r4 = 0x4ed02860    r5 = 0x554337fc    r6 = 0x5543353f    r7 = 0x55433544
     r8 = 0x553f900a    r9 = 0x55492451   r10 = 0x5b3ffdf0    fp = 0x55492451
     sp = 0x5b3ffc78    pc = 0x5468d99c
    Found by: call frame info
 3  libxul.so!mozilla::ipc::MessageChannel::~MessageChannel() [MessageChannel.cpp:d1cd57876e48 : 79 + 0x2a]
     r4 = 0x57c7a730    r5 = 0x00000000    r6 = 0x55754db4    r7 = 0x00000010
     r8 = 0x5af54080    r9 = 0x55492451   r10 = 0x5b3ffdf0    fp = 0x55492451
     sp = 0x5b3ffcc8    pc = 0x5468dff8
    Found by: call frame info
 4  libxul.so!mozilla::layers::PImageBridgeParent::~PImageBridgeParent() [PImageBridgeParent.cpp:d1cd57876e48 : 82 + 0x6]
     r4 = 0x57c7a700    r5 = 0x00000000    r6 = 0x55754db4    r7 = 0x00000010
     r8 = 0x5af54080    r9 = 0x55492451   r10 = 0x5b3ffdf0    fp = 0x55492451
     sp = 0x5b3ffcd8    pc = 0x546ed2f8
    Found by: call frame info
 5  libxul.so!mozilla::layers::ImageBridgeParent::~ImageBridgeParent() [ImageBridgeParent.cpp:d1cd57876e48 : 57 + 0x2]
     r4 = 0x57c7a700    r5 = 0x00000000    r6 = 0x55754db4    r7 = 0x00000010
     r8 = 0x5af54080    r9 = 0x55492451   r10 = 0x5b3ffdf0    fp = 0x55492451
     sp = 0x5b3ffce0    pc = 0x54aa3bdc

10-08 08:28:46.291 E/Profiler( 1776): BPUnw: [6 total] thread_register_for_profiling(me=0x701db8, stacktop=0x5deffe4b)
10-08 08:28:46.291 E/Profiler( 1776): BPUnw: [5 total] thread_unregister_for_profiling(me=0x701db8) 
10-08 08:28:46.551 E/Profiler( 1776): BPUnw: [4 total] thread_unregister_for_profiling(me=0x2f50f0) 
10-08 08:28:46.561 I/Gecko   ( 1776): ###!!! [MessageChannel][Parent][../../../ipc/glue/MessageChannel.cpp:79] Assertion (mCxxStackFrames.empty()) failed.  mismatched CxxStackFrame ctor/dtors 
10-08 08:28:46.561 I/Gecko   ( 1776):   MessageChannel 'backtrace':
10-08 08:28:46.561 I/Gecko   ( 1776):   [(0) in sync PImageBridge::Msg_Stop(actor=2147483647) ]
10-08 08:28:46.561 I/Gecko   ( 1776):   remote Interrupt stack guess: 0
10-08 08:28:46.561 I/Gecko   ( 1776):   deferred stack size: 0
10-08 08:28:46.561 I/Gecko   ( 1776):   out-of-turn Interrupt replies stack size: 0
10-08 08:28:46.561 I/Gecko   ( 1776):   Pending queue size: 0, front to back:
10-08 08:28:46.561 I/Gecko   ( 1776): ###!!! ABORT: mismatched CxxStackFrame ctor/dtors: file ../../../ipc/glue/MessageChannel.cpp, line 1539
10-08 08:28:46.561 E/Gecko   ( 1776): mozalloc_abort: ###!!! ABORT: mismatched CxxStackFrame ctor/dtors: file ../../../ipc/glue/MessageChannel.cpp, line 1539
10-08 08:28:47.541 I/DEBUG   (  937): *** *** *** *** *** *** *** *** *** *** *** *** *** *** *** ***
10-08 08:28:47.541 I/DEBUG   (  937): Build fingerprint: 'nvidia/harmony/harmony/harmony:2.2/FRF91/20110202.102810:eng/test-keys'
10-08 08:28:47.541 I/DEBUG   (  937): pid: 1776, tid: 1834  >>> org.mozilla.fennec <<<
10-08 08:28:47.541 I/DEBUG   (  937): signal 11 (SIGSEGV), fault addr 00000000

10-08 08:28:46.291 E/Profiler( 1776): BPUnw: [6 total] thread_register_for_profiling(me=0x701db8, stacktop=0x5deffe4b)
10-08 08:28:46.291 E/Profiler( 1776): BPUnw: [5 total] thread_unregister_for_profiling(me=0x701db8) 
10-08 08:28:46.551 E/Profiler( 1776): BPUnw: [4 total] thread_unregister_for_profiling(me=0x2f50f0) 
10-08 08:28:46.561 E/Gecko   ( 1776): mozalloc_abort: ###!!! ABORT: mismatched CxxStackFrame ctor/dtors: file ../../../ipc/glue/MessageChannel.cpp, line 1539
10-08 08:28:52.591 W/InputManagerService( 1020): Got RemoteException sending setActive(false) notification to pid 1776 uid 10033
Something is very wrong here!

The gecko main thread (thread 6) is at this stack:

 3  libnss3.so!PR_Wait [ptsynch.c:d1cd57876e48 : 582 + 0x6]
     r4 = 0x4ed397c0    r5 = 0x0021a740    r6 = 0x00000001    sp = 0x4f3198c8
     pc = 0x51a1b8a8
    Found by: call frame info
 4  libxul.so!mozilla::ReentrantMonitor::Wait(unsigned int) [ReentrantMonitor.h:d1cd57876e48 : 89 + 0x6]
     r4 = 0x4f3198e4    r5 = 0x4f3198ec    r6 = 0x4edb4250    sp = 0x4f3198d8
     pc = 0x54156f68
    Found by: call frame info
 5  libxul.so!mozilla::layers::ImageBridgeChild::DestroyBridge() [ImageBridgeChild.cpp:d1cd57876e48 : 632 + 0xa]
     r4 = 0x4f3198e4    r5 = 0x4f3198ec    r6 = 0x4edb4250    sp = 0x4f3198e0
     pc = 0x54aa30d8
    Found by: call frame info
 6  libxul.so!mozilla::layers::ImageBridgeChild::ShutDown() [ImageBridgeChild.cpp:d1cd57876e48 : 583 + 0x2]
     r4 = 0x5599138c    r5 = 0x4ed72550    r6 = 0x00000037    r7 = 0x000000dc
     r8 = 0x55754db4    sp = 0x4f319918    pc = 0x54aa3150
    Found by: call frame info
 7  libxul.so!gfxPlatform::Shutdown() [gfxPlatform.cpp:d1cd57876e48 : 485 + 0x2]
     r4 = 0x5599138c    r5 = 0x4ed72550    r6 = 0x00000037    r7 = 0x000000dc
     r8 = 0x55754db4    sp = 0x4f319920    pc = 0x54a6febc
    Found by: call frame info
 8  libxul.so!nsComponentManagerImpl::KnownModule::~KnownModule() [nsComponentManager.h:d1cd57876e48 : 224 + 0x2]
     r4 = 0x4edab0e0    r5 = 0x4ed72550    r6 = 0x00000037    r7 = 0x000000dc
     r8 = 0x55754db4    sp = 0x4f319928    pc = 0x54a07800
    Found by: call frame info
 9  libxul.so!nsAutoPtr<nsComponentManagerImpl::KnownModule>::~nsAutoPtr() [nsAutoPtr.h:d1cd57876e48 : 77 + 0x6]
     r4 = 0x4edab0e0    r5 = 0x4ed72550    r6 = 0x00000037    r7 = 0x000000dc
     r8 = 0x55754db4    sp = 0x4f319930    pc = 0x54a0ac64
    Found by: call frame info
10  libxul.so!nsTArray_Impl<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayInfallibleAllocator>::Clear() [nsTArray.h:d1cd57876e48 : 534 + 0x6]
     r4 = 0x4ed722ac    r5 = 0x4ed72550    r6 = 0x00000037    r7 = 0x000000dc
     r8 = 0x55754db4    sp = 0x4f319940    pc = 0x54a0aca0
    Found by: call frame info
11  libxul.so!nsComponentManagerImpl::Shutdown() [nsComponentManager.cpp:d1cd57876e48 : 808 + 0x6]
     r4 = 0x4ed72200    r5 = 0x549dd98c    r6 = 0x55754db4    r7 = 0x4f319984
     r8 = 0x4edab7c0   r10 = 0x5551de64    sp = 0x4f319960    pc = 0x54a0add4
    Found by: call frame info
13  libxul.so!ScopedXPCOMStartup::~ScopedXPCOMStartup() [nsAppRunner.cpp:d1cd57876e48 : 1130 + 0x6]
     r4 = 0x4ed3a14c    r5 = 0x55754db4    r6 = 0x55980e5c    r7 = 0x00000000
     r8 = 0x55754db4    r9 = 0x4497ef84   r10 = 0x00000000    fp = 0x00225120
     sp = 0x4f3199b0    pc = 0x53bc964c
    Found by: call frame info
14  libxul.so!XREMain::XRE_main(int, char**, nsXREAppData const*) [nsAppRunner.cpp:d1cd57876e48 : 3961 + 0x6]
     r4 = 0x4f3199f4    r5 = 0x00000000    r6 = 0x4ed3a14c    r7 = 0x00000000
     r8 = 0x55754db4    r9 = 0x4497ef84   r10 = 0x00000000    fp = 0x00225120
     sp = 0x4f3199c8    pc = 0x53bcd3fc
    Found by: call frame info

So we are *really* late in the shutdown process, after all threads should have been stopped, and we're now calling the XPCOM module destructors. The main thread should not be accepting any new events at this point.

According to the comments in gfxPlatform::Shutdown() though, we're trying to shut down IPDL protocols at this stage by calling ImageBridgeChild::ShutDown and CompositorParent::ShutDown. This looks obviously unsafe; the code dates back to bug 763234. This needs to be happening much earlier in the shutdown process.

The "mismatched CxxStackFrame" assertion probably means that we're calling a method on a dead MessageChannel, which is a member of the ImageBridgeParent which is likely dead also. But given the refcounting of ImageBridgeParent, it's not clear to me why it would be dead.
OS: Android → All
Hardware: ARM → All
Summary: Intermittent Android PROCESS-CRASH | Shutdown | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors" → Intermittent PROCESS-CRASH | Shutdown | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors"
See Also: → 948600
Summary: Intermittent PROCESS-CRASH | Shutdown | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors" → Intermittent Android PROCESS-CRASH | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors"
As Cosmin filed as bug 959080 we have also seen this at least once in our Mozmill tests. A crash report can be found here: bp-ebeafe13-adce-4637-9c55-7742c2140113. This crash affects Firefox 29.0a1 down to 27.0 builds from 20131209204824. Looks like a regression since the 26.0 release.

Cosmin, please check where the crash happened and if it is somewhat reproducible.
Crash Signature: [@ mozalloc_abort(char const*) | Abort | NS_DebugBreak | mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) const ]
Flags: needinfo?(cosmin.malutan)
Keywords: regression
This crashed during a mozilla-aurora_remote testrun, with Firefox 28.0a2 locale de, on mm-osx-107-3 node, build id 20140111004004.

I will run testruns with those configurations, but I doubt it will fail.
Flags: needinfo?(cosmin.malutan)
This assertion usually means that somebody is operating on a dead object. In this case I strongly suspect that we're dual-deleting an ImageBridgeParent object.

See also bug 959080 where this is seen on mac as well, so it seems to affect every platform with OMTC enabled.

Here is the full relevant stack; we're on the image-bridge child thread:

18:59:56     INFO -   0  libmozalloc.so!mozalloc_abort(char const*) [mozalloc_abort.cpp:747121b2bb50 : 30 + 0x4]
18:59:56     INFO -   1  libxul.so!NS_DebugBreak [nsDebugImpl.cpp:747121b2bb50 : 427 + 0x5]
18:59:56     INFO -   2  libxul.so!mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) const [MessageChannel.cpp:747121b2bb50 : 1703 + 0x13]
18:59:56     INFO -   3  libxul.so!mozilla::ipc::MessageChannel::~MessageChannel() [MessageChannel.cpp:747121b2bb50 : 
18:59:56     INFO -   4  libxul.so!mozilla::layers::PImageBridgeParent::~PImageBridgeParent() [PImageBridgeParent.cpp:747121b2bb50 : 89 + 0x7]
18:59:56     INFO -   5  libxul.so!mozilla::layers::ImageBridgeParent::~ImageBridgeParent() [ImageBridgeParent.cpp:747121b2bb50 : 59 + 0x3]
18:59:56     INFO -   6  libxul.so!mozilla::layers::ImageBridgeParent::~ImageBridgeParent() [ImageBridgeParent.cpp:747121b2bb50 : 59 + 0x3]
18:59:56     INFO -   7  libxul.so!mozilla::detail::RefCounted<mozilla::layers::ISurfaceAllocator, (mozilla::detail::RefCountAtomicity)0>::Release() const [RefPtr.h:747121b2bb50 : 82 + 0xb]
18:59:56     INFO -   8  libxul.so!mozilla::layers::DeleteImageBridgeSync [StaticPtr.h:747121b2bb50 : 158 + 0x7]
... event loop

The main thread is *very* late in shutdown, we're already destroying modules:

18:59:56     INFO -   5  libxul.so!mozilla::layers::ImageBridgeChild::DestroyBridge() [ImageBridgeChild.cpp:747121b2bb50 : 589 + 0x9]
18:59:56     INFO -   6  libxul.so!mozilla::layers::ImageBridgeChild::ShutDown() [ImageBridgeChild.cpp:747121b2bb50 : 540 + 0x3]
18:59:56     INFO -   7  libxul.so!gfxPlatform::Shutdown() [gfxPlatform.cpp:747121b2bb50 : 570 + 0x3]
18:59:56     INFO -   8  libxul.so!nsComponentManagerImpl::KnownModule::~KnownModule() [nsComponentManager.h:747121b2bb50 : 226 + 0x1]
18:59:56     INFO -   9  libxul.so!nsAutoPtr<nsComponentManagerImpl::KnownModule>::~nsAutoPtr() [nsAutoPtr.h:747121b2bb50 : 78 + 0x5]
18:59:56     INFO -  10  libxul.so!nsTArray_Impl<nsAutoPtr<nsComponentManagerImpl::KnownModule>, nsTArrayInfallibleAllocator>::Clear() [nsTArray.h:747121b2bb50 : 531 + 0x3]
18:59:56     INFO -  11  libxul.so!nsComponentManagerImpl::Shutdown() [nsComponentManager.cpp:747121b2bb50 : 790 + 0x7]
18:59:56     INFO -  12  libxul.so!mozilla::ShutdownXPCOM(nsIServiceManager*) [nsXPComInit.cpp:747121b2bb50 : 808 + 0x3]
Crash Signature: [@ mozalloc_abort(char const*) | Abort | NS_DebugBreak | mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) const ]
Component: IPC → Graphics: Layers
Keywords: regression
This signature is currently #8 on Firefox 27 beta on mac, but many of the crashes are an unrelated crash in plugin code, so it's hard to know what the real severity of this crash is. Shutdown crashes tend to be less important in general.
Crash Signature: [@ mozalloc_abort(char const*) | Abort | NS_DebugBreak | mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) const ]
Flags: needinfo?(nical.bugzilla)
To answer Benjamin's question from bug 959080, the crash with Mozmill happens without browser.remote.tabs enabled. It's a default opt build.

(In reply to Cosmin Malutan from comment #34)
> This crashed during a mozilla-aurora_remote testrun, with Firefox 28.0a2
> locale de, on mm-osx-107-3 node, build id 20140111004004.

This is not that helpful. Important here would be the test we are failing in.
I know it doesn't help a lot but the job is gone, it didn't failed again when I tried to reproduce it.
If it will crash again I will return with more info.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #2)
> 
> So we are *really* late in the shutdown process, after all threads should
> have been stopped, and we're now calling the XPCOM module destructors. The
> main thread should not be accepting any new events at this point.

I don't think this is supposed to post anything in the main thread's event loop.
It does post stuff in the ImageBridgeChild thread which is a chromium-style thread, which is about to be destroyed after everything is shut down.
Do you mean that the mutex or something esle here is implicitly waking up XPCOM stuff that is already dead? Or is it that we can't touch a chromium thread at this point?

> 
> According to the comments in gfxPlatform::Shutdown() though, we're trying to
> shut down IPDL protocols at this stage by calling ImageBridgeChild::ShutDown
> and CompositorParent::ShutDown. This looks obviously unsafe; the code dates
> back to bug 763234. This needs to be happening much earlier in the shutdown
> process.

I don't know much about the shutdown process, but the only requirements here is that ImageBridge's shutdown happens before CompositorParent's (because shutting down CompositorParent destroys the thread in which CompositorParent lives). And that it happens when we close the browser (We create and destroy ImageBridge along with the browser).

> 
> The "mismatched CxxStackFrame" assertion probably means that we're calling a
> method on a dead MessageChannel, which is a member of the ImageBridgeParent
> which is likely dead also. But given the refcounting of ImageBridgeParent,
> it's not clear to me why it would be dead.

ImageBridgeParent's refcount is about to go to zero (it is happening in ImageBridgeParent::DeferredDestroy)

I wonder what is closing the channel.
Flags: needinfo?(nical.bugzilla)
> Do you mean that the mutex or something esle here is implicitly waking up
> XPCOM stuff that is already dead?

No.

 Or is it that we can't touch a chromium
> thread at this point?

This. No secondary threads are supposed to exist at this point; all threads are supposed to be joined or abandoned during the xpcom-shutdown-threads phase.
Summary: Intermittent Android PROCESS-CRASH | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors" → Intermittent PROCESS-CRASH | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors"
https://tbpl.mozilla.org/php/getParsedLog.php?id=34561608&tree=Mozilla-Central

This continues to fail frequently. Do we care?
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(benjamin)
Yes, it's not test-only; it's showing up in the wild as topcrash bug 970100. This is important.
Blocks: 970100
Flags: needinfo?(benjamin)
I think we do care but I am a bit swamped. From the discussion on this bug, my understanding is that gfxPlatform::ShutDown is already too late to tear down ImageBridge. It might just be a matter of finding another place to call ImageBridgeParent::ShutDown() that is during the browser's shutdown, prior to gfxPlatform::ShutDown() and at a point where we can still use a chromium thread's event loop. I don't know where that would be, though.
Flags: needinfo?(nical.bugzilla)
https://tbpl.mozilla.org/php/getParsedLog.php?id=35433705&tree=Fx-Team

(In reply to Nicolas Silva [:nical] from comment #70)
> I think we do care but I am a bit swamped. From the discussion on this bug,
> my understanding is that gfxPlatform::ShutDown is already too late to tear
> down ImageBridge. It might just be a matter of finding another place to call
> ImageBridgeParent::ShutDown() that is during the browser's shutdown, prior
> to gfxPlatform::ShutDown() and at a point where we can still use a chromium
> thread's event loop. I don't know where that would be, though.

Any suggestions who might? :)
Flags: needinfo?(nical.bugzilla)
Blocks: 936226
Milan, can we please find an owner for this?
Flags: needinfo?(milan)
Benoit, let's design a shutdown sequence to take care of this (see comment 70) and fix it for 31.
Assignee: nobody → bjacob
Flags: needinfo?(nical.bugzilla)
Flags: needinfo?(milan)
https://tbpl.mozilla.org/php/getParsedLog.php?id=36407544&tree=Fx-Team

12:33:03     INFO -  ###!!! [MessageChannel][Parent][/builds/slave/fx-team-osx64-0000000000000000/build/ipc/glue/MessageChannel.cpp:229] Assertion (mCxxStackFrames.empty()) failed.  mismatched CxxStackFrame ctor/dtors
12:33:03     INFO -    MessageChannel 'backtrace':
12:33:03     INFO -    [(0) in sync PImageBridge::Msg_Stop(actor=2147483647) ]
12:33:03     INFO -    remote Interrupt stack guess: 0
12:33:03     INFO -    deferred stack size: 0
12:33:03     INFO -    out-of-turn Interrupt replies stack size: 0
12:33:03     INFO -    Pending queue size: 0, front to back:
12:33:03     INFO -  [919] ###!!! ABORT: mismatched CxxStackFrame ctor/dtors: file /builds/slave/fx-team-osx64-0000000000000000/build/ipc/glue/MessageChannel.cpp, line 1722
12:33:03     INFO -  [919] ###!!! ABORT: mismatched CxxStackFrame ctor/dtors: file /builds/slave/fx-team-osx64-0000000000000000/build/ipc/glue/MessageChannel.cpp, line 1722
We have a lot of orange in e10s caused by bug 989567, which shows up as follows:

https://tbpl.mozilla.org/php/getParsedLog.php?id=37399660&tree=Holly#error0

Maybe 1/2 debug builds tests this problem. I'm told that we need to fix this bug in order to fix that. When will someone be able to work on this?
I hit this just now on OS X 10.9. I was debugging another intermittent orange at the time. I had chaos mode enabled, and I maxed out my CPU cores using the following command:

> parallel yes {1} '>' /dev/null ::: A B C D E F G H I J K L M N O P

I'm posting this in the hopes that it will help someone reproduce this bug.
Just hit it again after running the test I'm debugging several more times. It seems pretty easy to reproduce this way.
Milan, was there a specific reason why this was assigned to me? This looks like a better fit for nical, who already commented above, and who wrote the patch on bug 763234 that is referred to in comment 2 above.

Un-assigning myself to reflect current reality --- feel free to reassign to me.
Assignee: bjacob → nobody
Flags: needinfo?(milan)
Will assign somebody by tomorrow.
Flags: needinfo?(milan)
Assignee: nobody → nical.bugzilla
I am new to XPCOM stuff so I took a rather naive approach, don't hesitate to tell me if It's completely wrong, there might be dependencies that I am not seeing.

looking at stack traces and at nsXPComInit.cpp, I see that gfxPlatform::Shutdown happens in
(nsComponentManagerImpl::gComponentManager)->Shutdown();
which takes place after nsThreadManager::get()->Shutdown(); In ShutdownXPCOM.
That's pretty much what Benjamin points out in comment 2.

So I added an Observer on NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID which is triggered before nsThreadManager shutdown, and moved the gfx ipdl related stuff there.
Attachment #8404598 - Flags: review?(benjamin)
Comment on attachment 8404598 [details] [diff] [review]
Shut down gfx IPDL protocols before the shutdown of XPCOM threads

I recommend using "xpcom-shutdown" instead of "xpcom-will-shutdown".

You also don't need mGfxIpcShutdownObserver at all here: the observer service holds a strong ref so holding another ref is unnecessary.

Also note that bholley just changed the ordering of module destructors here in bug 913138; that's not sufficient for this bug but you should be aware of it.
Attachment #8404598 - Flags: review?(benjamin) → review+
try push with a bunch of m4 jobs on osx https://tbpl.mozilla.org/?tree=Try&rev=2bcdeb0c90f4
Blocks: 986113
This blocks a 1.4 blocker (bug 986113).
blocking-b2g: --- → 1.4?
I pushed this patch to holly and I think it still has some issues. I see crashes like this:

https://tbpl.mozilla.org/php/getParsedLog.php?id=37591939&tree=Holly#error0

It looks like maybe an issue where code is running after the image bridge is shut down but it still expects the image bridge to be alive?

There's also a stack where it looks like we're still trying to do IPC too late:

https://tbpl.mozilla.org/php/getParsedLog.php?id=37590588&full=1&branch=holly#error0
Attached patch sleep — — Splinter Review
Oh, I should mention that holly is used for e10s testing.

I can reproduce the second issue by applying this patch on top of the main one in this bug. Then start a browser, open a new e10s window from the file menu (assuming OMTC is enabled), and quit. I see a crash every time on shutdown.
(In reply to Bill McCloskey (:billm) from comment #193)
> I pushed this patch to holly and I think it still has some issues. I see
> crashes like this:
> 
> https://tbpl.mozilla.org/php/getParsedLog.php?id=37591939&tree=Holly#error0
> 
> It looks like maybe an issue where code is running after the image bridge is
> shut down but it still expects the image bridge to be alive?
> 
> There's also a stack where it looks like we're still trying to do IPC too
> late:
> 
> https://tbpl.mozilla.org/php/getParsedLog.
> php?id=37590588&full=1&branch=holly#error0

Now its media that is shutting down after ImageBridge which is a problem.
Same thing for Widget. 

I have a patch that moves the shutdown of Media and widget to NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID.

Now I am also hitting a problem with resources that need ImageBridge to get deallocated and that are cycle collected in nsCycleCollector_shutdown which happens way after the shutdown of XPCOM threads.

I suppose it won't be possible to shutdown the cycle collector sooner, but perhaps we could trigger a cycle collection between NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID and NS_XPCOM_SHUTDOWN_OBSERVER_ID?

Chris, is it ok to move the shutdown of Media to NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID?
Flags: needinfo?(cpearce)
This patch seems to fix it for me. It does the following things:
* Have Gfx ipc shut down at NS_XPCOM_SHUTDOWN_OBSERVER_ID so that it happens before XPCOM's thread shutdown
* Move Media's shut down to NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID so that it gets destroyed before gfx's ipc.
* Move Widget's shut down to NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID so that it gets destroyed before gfx's ipc.
* Make it possible for ImageContainer to be cycle collected after ImageBridgeChild's destruction without crashing by adding some checks in ImageBridge. If this happens, the gfx IPDL actors already have cleaned up the memory so nothing gets leaked.

I asked a lot of reviewers because this fiddles with several modules: Benjamin for the nsXPCOmInit stuff in general, Chris for moving media's shutdown, Sotaro for the ImageBridgeChild stuff and Matt for Widget.
Attachment #8405225 - Attachment is obsolete: true
Attachment #8405252 - Flags: review?(sotaro.ikeda.g)
Attachment #8405252 - Flags: review?(matt.woodrow)
Attachment #8405252 - Flags: review?(cpearce)
Attachment #8405252 - Flags: review?(benjamin)
(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #192)
> This blocks a 1.4 blocker (bug 986113).

I don't see how an intermittent failure in a debug test would stop us from shipping a phone.  If there is an underlying problem, let's open a bug for it with an STR.  Using 1.4+ to mean "this is important to work on now" is wrong.
(In reply to Milan Sreckovic [:milan] from comment #205)
> (In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment
> #192)
> > This blocks a 1.4 blocker (bug 986113).
> 
> I don't see how an intermittent failure in a debug test would stop us from
> shipping a phone.  If there is an underlying problem, let's open a bug for
> it with an STR.  Using 1.4+ to mean "this is important to work on now" is
> wrong.

Please see bug 986113 comment 26.  As far as I understand, while this bug is intermittently hit on our testing infrastructure, it's 100% reproducible on e10s and maybe on b2g too.  Bug 986113 basically is us detecting several pending tasks during shutdown, and this is one of them while there are others to fix too (please see the dependency list of that bug.)
Blocks a blocker
blocking-b2g: 1.4? → 1.4+
Comment on attachment 8405252 [details] [diff] [review]
Shut down gfx IPDL protocols after Media and Widget, and before the shutdown of XPCOM threads

diff --git a/gfx/layers/ipc/ImageBridgeChild.h b/gfx/layers/ipc/ImageBridgeChild.h

+ * check whether it is in the IBC thread and dispatching itself in the IBC thread
+ * if is is not is dangerous if we don't also check that ImageBridge is created.

Something's wrong with this sentence.

I'm a little concerned with this in general. What exactly is the dependency of media and widget on the gfx shutdown sequence? Rather than expressing this dependency by moving things around in will-shutdown/shutdown/shutdown-threads, can we explicitly call from XPCOM shutdown into the gfx shutdown code at the correct time (after xpcom-shutdown, before xpcom-shutdown-threads)?

I continue to resist the urge to do a full-blown dependency system for startup and shutdown; I'd prefer to make as the shutdown sequence explicitly-ordered if that won't cause other problems.
(In reply to Preeti Raghunath(:Preeti) from comment #207)
> Blocks a blocker

Not on b2g, see below.


(In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment #206)
> ...
> 
> Please see bug 986113 comment 26.  As far as I understand, while this bug is
> intermittently hit on our testing infrastructure, it's 100% reproducible on
> e10s and maybe on b2g too.  Bug 986113 basically is us detecting several
> pending tasks during shutdown, and this is one of them while there are
> others to fix too (please see the dependency list of that bug.)

"... and maybe on b2g too" and we are making it a blocker for b2g 1.4 release?

What is the STR for reproducing this on a b2g device?  100% or otherwise?
blocking-b2g: 1.4+ → 1.4?
(In reply to Milan Sreckovic [:milan] from comment #209)
> (In reply to :Ehsan Akhgari (lagging on bugmail, needinfo? me!) from comment
> #206)
> > ...
> > 
> > Please see bug 986113 comment 26.  As far as I understand, while this bug is
> > intermittently hit on our testing infrastructure, it's 100% reproducible on
> > e10s and maybe on b2g too.  Bug 986113 basically is us detecting several
> > pending tasks during shutdown, and this is one of them while there are
> > others to fix too (please see the dependency list of that bug.)
> 
> "... and maybe on b2g too" and we are making it a blocker for b2g 1.4
> release?
> 
> What is the STR for reproducing this on a b2g device?  100% or otherwise?

I don't have the answer to this question (and I'm just echoing what Bill said anyway, so needinfo-ing him).

Note that due to the nature of bug 986113, we need to fix a number of bugs and gradually make our way towards making this assertion not fire.  In an ideal world, we'd know exactly which bugs those are, but we don't have any way of knowing that.
Flags: needinfo?(wmccloskey)
Mass tracking-e10s flag change. Filter bugmail on "2be0fcce-e36a-4e2c-aa80-0e3d33eb5406".
tracking-e10s: --- → +
I don't know of any way to trigger this on builds that we ship to users. Bug 986113 is about an assertion, so it only triggers on debug builds. Gregor marked it as blocking, presumably because it's a priority to get debug builds tested on tinderbox. It's pretty easy to make the work_queue_.empty() assertion trip in debug builds. I think if you apply the patch I've posted here and then kill the parent during the sleep call, we'll get the work_queue_.empty() pretty consistently.
Flags: needinfo?(wmccloskey)
Nical, I pushed the new patch to holly.
https://tbpl.mozilla.org/?tree=Holly&rev=d41bfe60b03d
I still see a lot of problems in debug runs during shutdown.
Comment on attachment 8405252 [details] [diff] [review]
Shut down gfx IPDL protocols after Media and Widget, and before the shutdown of XPCOM threads

Review of attachment 8405252 [details] [diff] [review]:
-----------------------------------------------------------------

I think you're better off having a single xpcom-shutdown observer that is responsible for explicitly shutting down these interdependent things in the correct order. I bet there's others we'll discover in future, maybe like the WebRTC media code.

i.e. a single observer that calls MediaShutdownManager::Shutdown(), then CompositorParent::ShutDown(), and then, ImageBridgeChild::ShutDown().
Attachment #8405252 - Flags: review?(cpearce) → review-
(In reply to Nicolas Silva [:nical] from comment #198)
> Chris, is it ok to move the shutdown of Media to
> NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID?

I don't think this is a problem per se, but as I said above, I think explicitly codifying the shutdown dependency order in a single xpcom-shutdown observer is the way most robust solution here.
Flags: needinfo?(cpearce)
(In reply to Chris Pearce (:cpearce) from comment #225)
> Comment on attachment 8405252 [details] [diff] [review]
> Shut down gfx IPDL protocols after Media and Widget, and before the shutdown
> of XPCOM threads
> 
> Review of attachment 8405252 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> I think you're better off having a single xpcom-shutdown observer that is
> responsible for explicitly shutting down these interdependent things in the
> correct order. I bet there's others we'll discover in future, maybe like the
> WebRTC media code.
> 
> i.e. a single observer that calls MediaShutdownManager::Shutdown(), then
> CompositorParent::ShutDown(), and then, ImageBridgeChild::ShutDown().

Sounds good, where would this observer be? in xpcom/?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #208)
> I'm a little concerned with this in general. What exactly is the dependency
> of media and widget on the gfx shutdown sequence?

Among other things, Widget owns the LayerManager and the GLContext that is used by the compositor. when the nsBaseWidget object is destroyed there is a lot of gfx/ipc stuff going on to go through all of the layers resources and cleanup things that have a GL resource. This really needs to happen before we loose the ability to use ImageBridge/CompositorParent. Note that destroying a widget is not always part of gecko's shutdown, it also happens any time you close a window.

Media holds strong references to resources allocated by the gfx IPDL protocols (in order to be able to share the video frames with the compositor without making copies), and more generally talks directly to the ImageBridge (to send video frames to the compositor, allocate them, destroy them, etc.).

> Rather than expressing
> this dependency by moving things around in
> will-shutdown/shutdown/shutdown-threads, can we explicitly call from XPCOM
> shutdown into the gfx shutdown code at the correct time (after
> xpcom-shutdown, before xpcom-shutdown-threads)?
> 
> I continue to resist the urge to do a full-blown dependency system for
> startup and shutdown; I'd prefer to make as the shutdown sequence
> explicitly-ordered if that won't cause other problems.

Sounds like what Chris asked, I am happy with this although I am not sure where that code should live. At this point I suppose it would make sense to just call these shutdown functions directly from ShutdownXPCOM, but I can Imagine people not liking the idea.
(In reply to Bill McCloskey (:billm) from comment #221)
> Nical, I pushed the new patch to holly.
> https://tbpl.mozilla.org/?tree=Holly&rev=d41bfe60b03d
> I still see a lot of problems in debug runs during shutdown.

Looking at them: It happens in some IPDL event-lopp code being run in a thread while XPCOM is doing the threads shutdown on the main thread. I don't know which IPDL thread is running that late but it should not be ImageBridge nor Compositor parent since the while point of the patch is to make sure those two are cleaned up earlier, so i'd track it as another bug.

There is also 
ASSERTION: unexpected event topic: 'strcmp(aTopic, NS_XPCOM_WILL_SHUTDOWN_OBSERVER_ID) == 0'
which is a mistake in my patch (I was supposed to check NS_XPCOM_SHUTDOWN_OBSERVER_ID in gfxPlatform instead).
v2: gfx ipc shutdown notifies NS_XPCOM_GFX_IPC_SHUTDOWN_OBSERVER_ID so that modules that depend on gfx's ipc get a chance to shutdown in time.
Attachment #8405252 - Attachment is obsolete: true
Attachment #8405252 - Flags: review?(sotaro.ikeda.g)
Attachment #8405252 - Flags: review?(matt.woodrow)
Attachment #8405252 - Flags: review?(benjamin)
Attachment #8406047 - Flags: review?(sotaro.ikeda.g)
Attachment #8406047 - Flags: review?(matt.woodrow)
Attachment #8406047 - Flags: review?(cpearce)
Attachment #8406047 - Flags: review?(benjamin)
Comment on attachment 8406047 [details] [diff] [review]
Shut down gfx IPDL protocols after Media and Widget, and before the shutdown of XPCOM threads

This is still pretty generic. Rather than having a new observer topic can we just explicitly call a static MediaShutdownManager::Shutdown() method from http://hg.mozilla.org/mozilla-central/annotate/215080b813a7/xpcom/build/nsXPComInit.cpp#l878 ?
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #233)
> Comment on attachment 8406047 [details] [diff] [review]
> Shut down gfx IPDL protocols after Media and Widget, and before the shutdown
> of XPCOM threads
> 
> This is still pretty generic. Rather than having a new observer topic can we
> just explicitly call a static MediaShutdownManager::Shutdown() method from
> http://hg.mozilla.org/mozilla-central/annotate/215080b813a7/xpcom/build/
> nsXPComInit.cpp#l878 ?

For media it's easy since MediaShutdownManager is a singleton. For widget, it's more annoying because each widget has its own observer listening to the shutdown event. I don't think it's worth the effort of keeping track of every widget manually just for the sake of shutting them down manually.
We can add a specific observer topic for widgets, that we'd fire explicitly in ShutdownXPCOM. But then we wouldn't gain much compared with the current patch.
Flags: needinfo?(cpearce)
woops the needinfo was not intentional
Flags: needinfo?(cpearce)
v3, simpler, invoke gfx's ipc shutdown directly in ShutdownXPCOM and doesn't modify widget and media.
Attachment #8406047 - Attachment is obsolete: true
Attachment #8406047 - Flags: review?(sotaro.ikeda.g)
Attachment #8406047 - Flags: review?(matt.woodrow)
Attachment #8406047 - Flags: review?(cpearce)
Attachment #8406047 - Flags: review?(benjamin)
Attachment #8406262 - Flags: review?(sotaro.ikeda.g)
Attachment #8406262 - Flags: review?(benjamin)
Attachment #8406262 - Flags: review?(benjamin) → review+
Comment on attachment 8406262 [details] [diff] [review]
v3:Shut down gfx IPDL protocols after Media and Widget, and before the shutdown of XPCOM threads

Review of attachment 8406262 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Criteria of adding IsCreated() checks to ImageBridgeChild seem not clear. Can you explain about it?
Attachment #8406262 - Flags: review?(sotaro.ikeda.g) → review+
(In reply to Sotaro Ikeda [:sotaro] from comment #246)
> Comment on attachment 8406262 [details] [diff] [review]
> v3:Shut down gfx IPDL protocols after Media and Widget, and before the
> shutdown of XPCOM threads
> 
> Review of attachment 8406262 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> Looks good. Criteria of adding IsCreated() checks to ImageBridgeChild seem
> not clear. Can you explain about it?

It seems like adding IsCreated() to all ImageBridgeChild functions except start up static functions.
(In reply to Sotaro Ikeda [:sotaro] from comment #248)
> (In reply to Sotaro Ikeda [:sotaro] from comment #246)
> > Comment on attachment 8406262 [details] [diff] [review]
> > v3:Shut down gfx IPDL protocols after Media and Widget, and before the
> > shutdown of XPCOM threads
> > 
> > Review of attachment 8406262 [details] [diff] [review]:
> > -----------------------------------------------------------------
> > 
> > Looks good. Criteria of adding IsCreated() checks to ImageBridgeChild seem
> > not clear. Can you explain about it?
> 
> It seems like adding IsCreated() to all ImageBridgeChild functions except
> start up static functions.

Something like that. I added IsCreated checks to static methods that we may (unintentionally) end up calling after ImageBridge is destroyed (example, some media resources holding on to ImageClients and TextureClients that are cycle collected after the shutdown of ImageBridge). I tried to fiddle with cycle collection to make sure all of these resources are collected earlier than the gfx ipc shutdown, but I didn't manage to get rid of all of it. So the safest route is to make sure these ImageBridge methods don't crash Gecko when they are called too late. The alternative would have been to force the callers to check ImageBridgeChild::IsCreate() before doing anything with the ImageBridge but it's a bit heavy and over time people will forget to do it.
I have been a bit to optimistic when adding that assertion (that caused the backout)
removed it and relanded. https://hg.mozilla.org/integration/mozilla-inbound/rev/e17d61d6acf6
I am a bit worried about failing that much though, I'll look into it.
https://hg.mozilla.org/mozilla-central/rev/e17d61d6acf6

When you're reasonably comfortable with this patch's bake time, we definitely want this uplifted as far as Release Management will let it go.
Status: NEW → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla31
blocking-b2g: 1.4? → backlog
(In reply to TBPL Robot from comment #271)

That was on m-c tip :(
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: mozilla31 → ---
Very reluctantly backed out in https://hg.mozilla.org/mozilla-central/rev/26209c172150 because we're pretty sure that made it considerably worse, rather than better.
(In reply to Phil Ringnalda (:philor) from comment #304)
> Very reluctantly backed out in
> https://hg.mozilla.org/mozilla-central/rev/26209c172150 because we're pretty
> sure that made it considerably worse, rather than better.

:( These crash are definitely happening before the shutdown of XPCOM threads. Benjamin, are there other ways to hit this assertion you can think of ?
Flags: needinfo?(benjamin)
Reading through the log of https://tbpl.mozilla.org/php/getParsedLog.php?id=37889317&full=1&branch=mozilla-inbound

The gecko main thread (thread 6) is in:

 4  libxul.so!mozilla::layers::ImageBridgeChild::DestroyBridge() [ImageBridgeChild.cpp:ebcacae1532c : 618 + 0xa]
 5  libxul.so!mozilla::layers::ImageBridgeChild::ShutDown() [ImageBridgeChild.cpp:ebcacae1532c : 570 + 0x2]
 6  libxul.so!mozilla::ShutdownXPCOM(nsIServiceManager*) [nsXPComInit.cpp:ebcacae1532c : 750 + 0x2]

The crashing thread (thread 25) is at:

http://hg.mozilla.org/integration/mozilla-inbound/file/ebcacae1532c/ipc/glue/MessageChannel.cpp#l228

 0  libmozalloc.so!mozalloc_abort(char const*) [mozalloc_abort.cpp:ebcacae1532c : 30 + 0x8]
 1  libxul.so!NS_DebugBreak [nsDebugImpl.cpp:ebcacae1532c : 421 + 0x6]
 2  libxul.so!mozilla::ipc::MessageChannel::DebugAbort(char const*, int, char const*, char const*, bool) const [MessageChannel.cpp:ebcacae1532c : 1722 + 0x1e]
 3  libxul.so!mozilla::ipc::MessageChannel::~MessageChannel() [MessageChannel.cpp:ebcacae1532c : 229 + 0x2a]
 4  libxul.so!mozilla::layers::PImageBridgeParent::~PImageBridgeParent() [PImageBridgeParent.cpp:ebcacae1532c : 93 + 0x6]
 5  libxul.so!mozilla::layers::ImageBridgeParent::~ImageBridgeParent() [ImageBridgeParent.cpp:ebcacae1532c : 61 + 0x16]
 6  libxul.so!mozilla::layers::ImageBridgeParent::~ImageBridgeParent() [ImageBridgeParent.cpp:ebcacae1532c : 61 + 0x2]
 7  libxul.so!mozilla::AtomicRefCountedWithFinalize<mozilla::layers::ISurfaceAllocator>::Release() [AtomicRefCountedWithFinalize.h:ebcacae1532c : 46 + 0x16]
 8  libxul.so!mozilla::layers::DeleteImageBridgeSync [StaticPtr.h:ebcacae1532c : 158 + 0x6]
 9  libxul.so!RunnableFunction<void (*)(IPC::Channel*, int), Tuple2<IPC::Channel*, int> >::Run() + 0x1e
10  libxul.so!MessageLoop::RunTask(Task*) [message_loop.cc:ebcacae1532c : 344 + 0xa]
11  libxul.so!MessageLoop::DeferOrRunPendingTask(MessageLoop::PendingTask const&) [message_loop.cc:ebcacae1532c : 352 + 0x6]
12  libxul.so!MessageLoop::DoWork() [message_loop.cc:ebcacae1532c : 430 + 0x2]

Use-after-free of the channel is almost certainly still what's happening here. Is this the compositor thread?

The other gecko threads are:
* thread 10 in an IPC message loop, probably the IPC thread, blocked
* thread 13, the GC thread, blocked
* thread 14, the JS watchdog, blocked
* thread 15, 18, 19, 20, 23, 26, XPCOM threads, blocked
* thread 16, the hang monitor, blocked
* thread 17, the background hang monitor, blocked
* thread 21, 22, a DOM worker, blocked
* thread 24, a chromium thread waiting for something. I don't know what this is but it's unlikely to be a big deal

So I really think the thing to do here is to fprintf(stderr) in ImageBridgeParent::~ImageBridgeParent() and see whether we are actually deleting this object more than once. AIUI there should only ever be one ImageBridgeParent object per android process, right? So if we see that printf more than once, we know we have a problem and can start taking stacks of that case.

AtomicRefCountedWithFinalize::Release is a little scary, but assuming that ISurfaceAllocator::ShrinkShmemSectionHeap can't ever refcount itself I don't see any obvious refcounting errors in the code.

Given that the patches make this more reproducible, has anyone tried just catching this in a debugger?
Flags: needinfo?(benjamin)
Thread 6 (main thread) is sending a synchronous proxy to thread 25 (the ImageBridgeChild thread) so the main thread is expected to be waiting in with this stack. when the other thread is there.

Maybe there are still protocols managed by PImageBridge alive, and we as we destroy the ImageBridgeChild it forces the destruction of the manadged protocols which generates some messages and we destroy PImageBridgeParent while these messages are still inflight. Could that make us hit this assertion?

I haven't managed to reproduce it on Linux + OMTC yet, I'll try on android.
try push with a new version of the patch that makes the ImageBridge's shutdown a bit more bullet-proof by forcing all it's managed protocols to shutdown before we destroy it, plus a hack to assert that ImageBridgeParents are not deleted more than once.

https://tbpl.mozilla.org/?tree=Try&rev=5377221d5161
Did you mean for that try push to run some tests?
Flags: needinfo?(nical.bugzilla)
(In reply to Bill McCloskey (:billm) from comment #352)
> Did you mean for that try push to run some tests?

heh yes :) my bad
Flags: needinfo?(nical.bugzilla)
The only difference compared to the previous version is in StopImageBridgeSync
Attachment #8406262 - Attachment is obsolete: true
Attachment #8411681 - Flags: review?(sotaro.ikeda.g)
https://tbpl.mozilla.org/php/getParsedLog.php?id=38403588&tree=Fx-Team

These last two manual ones were on debug builds, but AFAICT it's the same underlying issue. Let me know if you want them in a new bug.
Attachment #8411681 - Flags: review?(sotaro.ikeda.g) → review+
(In reply to TBPL Robot from comment #446)

FWIW, this instance occurred on a push that was in between the landing in comment 423 and the backout in comment 440.
The only difference with the previous patch is the explicit Release() calls in DispatchReleaseImageClient and DispatchReleaseTextureClient.

I also filed bug 1002451 to avoid falling in this trap again.
Attachment #8411681 - Attachment is obsolete: true
Attachment #8413733 - Flags: review?(sotaro.ikeda.g)
Attachment #8413733 - Flags: review?(sotaro.ikeda.g) → review+
Looks like a change in the ordering of #includes cause the build to fail on some platform. This should fix it: https://hg.mozilla.org/integration/mozilla-inbound/rev/8dda04e44e3e
Backed out: https://hg.mozilla.org/integration/mozilla-inbound/rev/3aa6bab70380

Looks like we are not seeing the end of this soon.
This got backed out.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Attached patch patch v6 (obsolete) — — Splinter Review
Sorry for the review spam, Sotaro. The differences with the previous version are in ImageBridgeShutdownStep1 and ImageBridgeShutdownStep2 (just renamed from StopImageBridgeSync and DeleteImageBridgeSync) with the addition of a another sync Stop ipc message sent after sync WillStop + a spin of the ImageBridge's event loop to make sure all messages have been processed before we delete ImageBridgeParent.
Attachment #8413733 - Attachment is obsolete: true
Attachment #8415320 - Flags: review?(sotaro.ikeda.g)
Attachment #8415320 - Flags: review?(sotaro.ikeda.g) → review+
I forgot to merge the #include ordering fix into the patch before landing
https://hg.mozilla.org/integration/mozilla-inbound/rev/74ef5120ae2f
Nicolas, do you have an idea when you might be able to look at this again? :-)

This intermittent failure is #2 on OrangeFactor, and as such worthy of escalation due to https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy
Flags: needinfo?(nical.bugzilla)
(In reply to Ed Morley [:edmorley UTC+0] from comment #582)
> Nicolas, do you have an idea when you might be able to look at this again?
> :-)
> 
> This intermittent failure is #2 on OrangeFactor, and as such worthy of
> escalation due to https://wiki.mozilla.org/Sheriffing/Test_Disabling_Policy

I am still looking at this. The fact that all of my previous attempts at fixing this failed makes me think that the real cause of the crash is (at least partly) not what I thought it was. At the moment I am out of ideas, so I am stuffing the code with logs and kicking some try pushes to see if any useful info comes out. Hard to tell how long it will take for me to better identify the problem and find a solution that makes it past inbound.
I may stay silent in the bug for a few days but I am still looking into it.
Flags: needinfo?(nical.bugzilla)
My latest attempt is looking good on try so far. One of the issues I was previously hitting was that I (wrongly) assumed that after a synchronous IPDL message returns, the other side (receiver) is done handling the message, which is almost true but not quite. So doing this fails (a lot on android 4, not so much elsewhere):

child->SendSomeSyncMessage()
parent = nullptr; // crash because parent's dtor is racing with the end of the ipdl code that handles th reception of SomeSyncMessage

https://tbpl.mozilla.org/?tree=Try&rev=d3040dd60df0
Attached patch v7 (obsolete) — — Splinter Review
The most important change here is that the last reference of ImageBridgeParent is nulled out on the compositor thread for the reason I presented above.
Attachment #8415320 - Attachment is obsolete: true
Comment on attachment 8417283 [details] [diff] [review]
v7

Review of attachment 8417283 [details] [diff] [review]:
-----------------------------------------------------------------

darn, forgot to set the r? flag
Attachment #8417283 - Flags: review?(sotaro.ikeda.g)
Attachment #8417283 - Flags: review?(sotaro.ikeda.g) → review+
The rebase was non-trivial so here is another try push https://tbpl.mozilla.org/?tree=Try&rev=7b6ecc72a79d
(In reply to Nicolas Silva [:nical] from comment #689)
> The rebase was non-trivial so here is another try push
> https://tbpl.mozilla.org/?tree=Try&rev=7b6ecc72a79d

So green. So wow!
Thank you for persisting with this :-)
Nicolas, I don't see you on IRC, but hopefully you don't mind if I go ahead and push this :)
Keywords: checkin-needed
There's breakage on B2G (non-unified builds):
https://tbpl.mozilla.org/php/getParsedLog.php?id=39207487&tree=Mozilla-Inbound#error0
07:08:17     INFO -  ../../../gecko/xpcom/build/nsXPComInit.cpp: In function 'nsresult mozilla::ShutdownXPCOM(nsIServiceManager*)':
07:08:17    ERROR -  ../../../gecko/xpcom/build/nsXPComInit.cpp:797: error: 'mozilla::layers::SharedBufferManagerChild' has not been declared
(after a good looking try push: https://tbpl.mozilla.org/?tree=Try&rev=dd1d8724cf39 in which the red stuff seems to be infrastructure problems)
I had to back this out for causing frequent mochitest-e10s-2 shutdown hangs as tracked in bug 1007284.
https://hg.mozilla.org/integration/mozilla-inbound/rev/8a5a9a06f59a
But not a single Android/OSX crash on inbound while this was landed!
Quick update: I have a patch that fixes the crash but regresses M-e10s 2. In the case of this M-e10s failure, if I intentionally leak the compositor thread, we don't crash anymore, except that the leak is caught and debug tests transform into a see of oranges.
So it looks like something is still chatting with the compositor thread after we shut it down, but only in M-e10s 2 and only 40% of the time or so.
when the M-e10s 2 crash happens, the crashing thread has nothing in its stack trace, and the only thread with a meaningful stack trace is the main thread, which is in ParentImpl::ShutdownBackgroundThread which is triggered by something reacting to NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID.
I tweaked the shutdown order some more: this one looks promising: https://tbpl.mozilla.org/?tree=Try&rev=031b669d1e23 (e10s M2 retriggers will confirm, and then I'll do a push with all platforms)
Very few differences with the previous version.
One of the differences is having the destruction of the compositor thread be triggered right after the NS_XPCOM_SHUTDOWN_THREADS_OBSERVER_ID rather than just before (since something is still trying to interact with the thread in some way when reacting to this event). Also the deletion of the ImageBridgeParent actor happens before destroying the thread, but not right after the shutdown handshake (otherwise bad timing can make us delete the actor while some ipdl code is still running).
Attachment #8417283 - Attachment is obsolete: true
Attachment #8423135 - Flags: review?(sotaro.ikeda.g)
Comment on attachment 8423135 [details] [diff] [review]
v10293801 same things with a few tweaks to the shutdown order to make it pass tests.

Review of attachment 8423135 [details] [diff] [review]:
-----------------------------------------------------------------

review+ if the comments are addressed.

::: gfx/thebes/gfxPlatform.cpp
@@ +505,5 @@
>      gfxPrefs::DestroySingleton();
>      gfxFont::DestroySingletons();
>  
>      delete gPlatform;
>      gPlatform = nullptr;

Is there no problem to delete the above objects before CompositorParent::ShutDown() call?

::: xpcom/build/nsXPComInit.cpp
@@ +796,5 @@
> +        layers::ImageBridgeChild::ShutDown();
> +#ifdef MOZ_WIDGET_GONK
> +        layers::SharedBufferManagerChild::ShutDown();
> +#endif
> +        layers::AsyncTransactionTracker::Finalize();

AsyncTransactionTracker::Finalize() need to be called after CompositorParent::ShutDown().
Attachment #8423135 - Flags: review?(sotaro.ikeda.g) → review+
This caused mochitest crashes that contributed to a large bustage pileup on inbound and led to having to revert to a last-good cset, inconveniencing other innocent developers as well.

https://hg.mozilla.org/integration/mozilla-inbound/rev/eb2a6f7785a2
another try push: https://tbpl.mozilla.org/?tree=Try&rev=a83141e578d6

in my last try pushes I have seen a "X bad drawable" fatal assertions on M-e10s showing up, which I haven't figured out. It  happens also when we don't use textureCLientX11 (so it's not related to the lifetime of a TextureClient, at least). I can also fairly easily reproduce the X11 crash without my patches so I don't know if I am making it more likely to reproduce. I didn't have that a week ago and I didn't modify anything X11 related so I'm inclined to think it's not my patches, but this got backed out so many times that I have become rather pessimistic about whether random breakage comes from this patch or not.
Bug 1005056 is tracking the X11 issues, FWIW. They're pretty frequent in crashtest-ipc as well.
https://hg.mozilla.org/mozilla-central/rev/7a05dad0a4a2
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla32
*cues Hallelujah chorus*

If anybody deserves a badge for persistence, it's you, Nicolas :)
Indeed, fantastic job! :-)
\o/ Congrats!

Nit: some fprintf's were accidentally added by this patch.
(In reply to Benoit Jacob [:bjacob] from comment #1100)
> \o/ Congrats!
> 
> Nit: some fprintf's were accidentally added by this patch.
Flags: needinfo?(nical.bugzilla)
(In reply to Benoit Jacob [:bjacob] from comment #1100)
> \o/ Congrats!
> 
> Nit: some fprintf's were accidentally added by this patch.

Oh boy. --> Bug 1016321
Flags: needinfo?(nical.bugzilla)
Nicolas, can you take a look at the feasibility of uplifting this to Aurora/Beta? These crashes are pretty frequent on 30/31 as well. Thanks!
Blocks: 984320
Tweaking summary to avoid false positives on TBPL.
Summary: Intermittent PROCESS-CRASH | application crashed [@ mozalloc_abort(char const*)] after "ABORT: mismatched CxxStackFrame ctor/dtors" → Intermittent Android Shutdown "ABORT: mismatched CxxStackFrame ctor/dtors" [@ mozalloc_abort(char const*)]
(In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #1105)
> Nicolas, can you take a look at the feasibility of uplifting this to
> Aurora/Beta? These crashes are pretty frequent on 30/31 as well. Thanks!

Nical, is it possible to uplift to 30? To uplift Bug 1006957 to b2g v1.4, that bug seems to need this bug's fix.
Flags: needinfo?(nical.bugzilla)
Blocks: 1006957
(In reply to Sotaro Ikeda [:sotaro] from comment #1111)
> (In reply to Ryan VanderMeulen [:RyanVM UTC-4] from comment #1105)
> > Nicolas, can you take a look at the feasibility of uplifting this to
> > Aurora/Beta? These crashes are pretty frequent on 30/31 as well. Thanks!
> 
> Nical, is it possible to uplift to 30? To uplift Bug 1006957 to b2g v1.4,
> that bug seems to need this bug's fix.

I changed a patch for b2g v1.4 in Bug 1006957. It does not need this patch now.
Flags: needinfo?(nical.bugzilla)
No longer blocks: 1006957
Attached patch aurora uplift — — Splinter Review
try push https://tbpl.mozilla.org/?tree=Try&rev=bddc86e9e47a

I am supposed to be on vacation, so if you want this landed before Wednesday, I'll let you check the try push and land it if it's good, otherwise I'll land it when I come back.
All over it :)
Flags: needinfo?(ryanvm)
Comment on attachment 8430771 [details] [diff] [review]
aurora uplift

The Try run is solid green (modulo some suites that aren't expected to pass on !trunk).

[Approval Request Comment]
Bug caused by (feature/regressing bug #): Unknown
User impact if declined: Hangs/crashes at shutdown. Earlier comments in this bug suggest that the crash signature has been seen in the wild. At least 5 intermittent oranges have gone away since this landed on trunk.
Testing completed (on m-c, etc.): On m-c for nearly a week. Aurora patch green on Try.
Risk to taking this patch (and alternatives if risky): No known regressions since this landed on trunk. Per discussion with Sotaro, it is very likely that any problems with the patch would have turned up on the Try run as our test suite has proven to be very sensitive to shutdown ordering issues.
String or IDL/UUID changes made by this patch: None
Attachment #8430771 - Flags: approval-mozilla-aurora?
Flags: needinfo?(ryanvm)
There's no way we're going to have a beta patch ready in time, so marking it wontfix for Firefox 30. Still hopeful about getting this onto b2g30, though. But no need to rush that either.
Attachment #8430771 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Bug 997699 makes getting this onto b2g30 tricky.
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
One thing that's happening here (I don't know if it's the only thing) is that top-level protocol actors now must be created and destroyed on the main thread. Some more details in bug 1028383 where we're adding assertions to guard that and make such bugs easier to debug. Concretely, when we're creating/destroying top-level actors on multiple threads, they can race to access the mOpenActors linked list. That's apparently a fairly recent thing from Nuwa. See bug 976479 comment 94.

So this patch fixes a couple place where we were destroying ImageBridgeParent and ImageBridgeChild actors off the main thread, and adds more assertions to guard that.
Attachment #8443715 - Flags: review?(nical.bugzilla)
Attachment #8443715 - Flags: review?(matt.woodrow)
Right so only the ImageBridgeChild actor destruction really is moved to the main thread. The parent side fix is mostly unrelated: I couldnt see a reason why we would want the mTransport to be destroyed off main thread, so I removed that, similar to what we do in CompositorParent. What do you think?
Attachment #8443715 - Flags: review?(nical.bugzilla) → review+
Assignee: nical.bugzilla → bjacob
Attachment #8443715 - Flags: review?(matt.woodrow) → review+
(In reply to Benoit Jacob [:bjacob] from comment #1138)
> Created attachment 8443715 [details] [diff] [review]
> Destroy ImageBridge actors on the main thread

Is this waiting on something to land?
Flags: needinfo?(bjacob)
Target Milestone: mozilla32 → ---
Yes. I have this 10-deep patch queue fixing the whole shitdown sequence and I can't easily land it bit by bit because that only makes a different intermittent go on fire. I have to fix it all at once. My last try push was not very far from the mark,

https://tbpl.mozilla.org/?tree=Try&rev=c74e623e8b70

and this new one should get us closer still:

https://tbpl.mozilla.org/?tree=Try&rev=9a085d9e9f06
Flags: needinfo?(bjacob)
Blocks: 1008254
Blocks: 986738
Should be fixed by bug 774388. The timing seems to match: bug 774388 landed on Sunday night on inbound and the last occurrences here are from Monday on other trees.
Depends on: 774388
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla33
blocking-b2g: backlog → ---
You need to log in before you can comment on or make changes to this bug.