Closed Bug 774622 Opened 12 years ago Closed 12 years ago

crash with abort message: "mismatched CxxStackFrame ctor/dtors: file /builds/slave/m-cen-andrd-ntly/build/ipc/glue/RPCChannel.cpp, line 656" on quitting Nightly

Categories

(Core :: Graphics: Layers, defect)

16 Branch
defect
Not set
critical

Tracking

()

VERIFIED FIXED
mozilla18
Tracking Status
firefox15 --- unaffected
firefox16 + verified
firefox17 --- verified
firefox18 --- verified
fennec 16+ ---

People

(Reporter: AdrianT, Assigned: nical)

References

Details

(4 keywords, Whiteboard: [native-crash])

Crash Data

Attachments

(1 file, 2 obsolete files)

This bug was filed from the Socorro interface and is 
report bp-2b3ea40a-be14-424e-b07d-6145d2120717 .

Other crashes:
https://crash-stats.mozilla.com/report/index/bp-6337ba57-4ad8-450b-9737-bd7772120717
https://crash-stats.mozilla.com/report/index/bp-ace5405c-2498-4f42-a283-ebe1c2120717
============================================================= 
Nightly 16.0a1 2012-07-16
Google Nexus (Android 4.0.4)

Steps to reproduce:
1) Load planet.mozilla.org.
2) Wait for the page to load.
3) Quit Firefox

Actual result:
A crash happens. I got 3 different crashes from 4 tries. Adding the crash signatures to the same bug as they may be related since the same steps to reproduce apply.

Issue is not reproducible on Aurora
============================================================= 
0 	libmozalloc.so 	mozalloc_abort 	memory/mozalloc/mozalloc_abort.cpp:19
1 	libc.so 	libc.so@0x11f6e 	
2 	dalvik-heap (deleted) 	dalvik-heap @0x7b511f 	
3 	dalvik-mark-stack (deleted) 	dalvik-mark-stack @0x30eaf40 	
4 	org.mozilla.fennec-2.apk 	org.mozilla.fennec-2.apk@0x124556b 	
5 	org.mozilla.fennec-2.apk 	org.mozilla.fennec-2.apk@0x33696a 	
6 	dalvik-heap (deleted) 	dalvik-heap @0xfad5f63 	
7 	org.mozilla.fennec-2.apk 	org.mozilla.fennec-2.apk@0x703633 	
8 	libxul.so 	XPCJSRuntime::Get 	js/xpconnect/src/xpcprivate.h:645
9 	libxul.so 	XPCCallContext::~XPCCallContext 	js/xpconnect/src/XPCCallContext.cpp:302
10 	libxul.so 	nsXPCWrappedJSClass::DelegatedQueryInterface 	js/xpconnect/src/XPCWrappedJSClass.cpp:745
11 		@0xffffffff 	
12 	libxul.so 	XPCCallContext::~XPCCallContext 	js/xpconnect/src/XPCCallContext.cpp:334
13 	libxul.so 	nsXPCWrappedJSClass::DelegatedQueryInterface 	js/xpconnect/src/XPCWrappedJSClass.cpp:745
14 		@0x5f4c5d3e 	
15 	libmozglue.so 	libmozglue.so@0x7041 	
16 	libmozglue.so 	libmozglue.so@0x8fbd 	
17 	libmozglue.so 	libmozglue.so@0x90fd 	
18 	libmozglue.so 	libmozglue.so@0x9c7b 	
19 	libmozglue.so 	libmozglue.so@0x20ffa 	
20 	libmozglue.so 	libmozglue.so@0x199d7 	
21 	libmozglue.so 	libmozglue.so@0x7041 	
22 	libmozglue.so 	libmozglue.so@0x8fbd 	
23 	libc.so 	libc.so@0x12132 	
24 	libc.so 	libc.so@0x436ae 	
25 	libmozglue.so 	libmozglue.so@0x19b29 	
26 	libmozglue.so 	libmozglue.so@0x19b41 	
27 	libxul.so 	std::__node_alloc::allocate 	_alloc.h:158
28 	libxul.so 	std::priv::_STLP_alloc_proxy<unsigned int, IPC::Message, std::allocator<IPC::Mes 	_alloc.h:307
29 		@0x3 	
30 	libxul.so 	mozilla::ipc::RPCChannel::DebugAbort 	ipc/glue/RPCChannel.cpp:656
31 	libxul.so 	mozilla::ipc::RPCChannel::~RPCChannel 	ipc/glue/RPCChannel.cpp:80
32 	libxul.so 	mozilla::layers::PCompositorParent::~PCompositorParent 	obj-firefox/ipc/ipdl/PCompositorParent.cpp:88
33 	libxul.so 	mozilla::layers::CompositorParent::~CompositorParent 	gfx/layers/ipc/CompositorParent.cpp:106
34 	libxul.so 	mozilla::layers::CompositorParent::~CompositorParent 	gfx/layers/ipc/CompositorParent.cpp:106
35 	libxul.so 	mozilla::layers::CompositorParent::Release 	CompositorParent.h:58
36 	libxul.so 	RunnableFunction<void , Tuple2<nsRefPtr<mozilla::layers::CompositorParent>, nsRe 	nsAutoPtr.h:874
37 	libxul.so 	RunnableFunction<void , Tuple2<nsRefPtr<mozilla::layers::CompositorParent>, nsRe 	ipc/chromium/src/base/task.h:411
38 	libxul.so 	MessageLoop::RunTask 	ipc/chromium/src/base/message_loop.cc:327
39 	libxul.so 	MessageLoop::DeferOrRunPendingTask 	ipc/chromium/src/base/message_loop.cc:334
40 	libxul.so 	MessageLoop::DoWork 	ipc/chromium/src/base/message_loop.cc:434
41 	libxul.so 	mozilla::ipc::MessagePump::Run 	ipc/glue/MessagePump.cpp:86
42 	libxul.so 	MessageLoop::RunInternal 	ipc/chromium/src/base/message_loop.cc:208
43 	libxul.so 	MessageLoop::Run 	ipc/chromium/src/base/message_loop.cc:201
44 	libxul.so 	nsBaseAppShell::Run 	widget/xpwidgets/nsBaseAppShell.cpp:163
45 	libxul.so 	nsAppStartup::Run 	toolkit/components/startup/nsAppStartup.cpp:257
46 	libxul.so 	XREMain::XRE_mainRun 	toolkit/xre/nsAppRunner.cpp:3787
47 	libxul.so 	XREMain::XRE_main 	toolkit/xre/nsAppRunner.cpp:3864
48 	libxul.so 	XRE_main 	toolkit/xre/nsAppRunner.cpp:3940
49 	libxul.so 	GeckoStart 	toolkit/xre/nsAndroidStartup.cpp:73
50 	libmozglue.so 	libmozglue.so@0x10aad 	
51 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc @0x29918e 	
52 	libdvm.so 	libdvm.so@0x1ec32 	
53 	dalvik-heap (deleted) 	dalvik-heap @0xcb3bae 	
54 	libdvm.so 	libdvm.so@0x58eed 	
55 	data@app@org.mozilla.fennec-2.apk@classes.dex 	data@app@org.mozilla.fennec-2.apk@classes.dex@0x134e61 	
56 	libmozglue.so 	libmozglue.so@0x10a5b 	
57 	data@app@org.mozilla.fennec-2.apk@classes.dex 	data@app@org.mozilla.fennec-2.apk@classes.dex@0x11d4b9 	
58 	libc.so 	libc.so@0x14a43 	
59 	libdvm.so 	libdvm.so@0x98e3d 	
60 	libc.so 	libc.so@0x158a7 	
61 	libmozglue.so 	libmozglue.so@0x10a5b 	
62 	data@app@org.mozilla.fennec-2.apk@classes.dex 	data@app@org.mozilla.fennec-2.apk@classes.dex@0x11d4b9 	
63 	libc.so 	libc.so@0x158a7 	
64 	libmozglue.so 	libmozglue.so@0x10a5b 	
65 	data@app@org.mozilla.fennec-2.apk@classes.dex 	data@app@org.mozilla.fennec-2.apk@classes.dex@0x11d4b9 	
66 	libc.so 	libc.so@0x15f09 	
67 	libdvm.so 	libdvm.so@0x5ae8d 	
68 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc @0x347e 	
69 	dalvik-heap (deleted) 	dalvik-heap @0xddde1e 	
70 	dalvik-heap (deleted) 	dalvik-heap @0xddde1e 	
71 	libdvm.so 	libdvm.so@0xa2532 	
72 	dalvik-heap (deleted) 	dalvik-heap @0xddde1e 	
73 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc @0x2991a2 	
74 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc @0x29918e 	
75 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc @0x29918e 	
76 	data@app@org.mozilla.fennec-2.apk@classes.dex 	data@app@org.mozilla.fennec-2.apk@classes.dex@0x148534 	
77 	libdvm.so 	libdvm.so@0x8e8ab 	
78 	libdvm.so 	libdvm.so@0x74897 	
79 	dalvik-heap (deleted) 	dalvik-heap @0xcb3bae 	
80 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc @0x29918e 	
81 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc @0x29918e 	
82 	libdvm.so 	libdvm.so@0x58d43 	
83 	dalvik-heap (deleted) 	dalvik-heap @0xcb3bae 	
84 	libdvm.so 	libdvm.so@0x5ac1d 	
85 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc @0x29918e 	
86 	data@app@org.mozilla.fennec-2.apk@classes.dex 	data@app@org.mozilla.fennec-2.apk@classes.dex@0x9ccb8 	
87 	dalvik-heap (deleted) 	dalvik-heap @0xcb3bae 	
88 	libdvm.so 	libdvm.so@0x1edbe 	
89 	libdvm.so 	libdvm.so@0x30a8e 	
90 	libdvm.so 	libdvm.so@0xb2f8e 	
91 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc @0x29b4b6 	
92 	libdvm.so 	libdvm.so@0x34272 	
93 	libdvm.so 	libdvm.so@0x5f987 	
94 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc @0x29b4b6 	
95 	dalvik-heap (deleted) 	dalvik-heap @0xc9b946 	
96 	data@app@org.mozilla.fennec-2.apk@classes.dex 	data@app@org.mozilla.fennec-2.apk@classes.dex@0x134a9c 	
97 	libdvm.so 	libdvm.so@0x6c8ff 	
98 	dalvik-LinearAlloc (deleted) 	dalvik-LinearAlloc @0x29b4b6 	
99 	dalvik-heap (deleted) 	dalvik-heap @0xc9b946 	
100 	libdvm.so 	libdvm.so@0xb7c56 	
101 	libdvm.so 	libdvm.so@0x5f987 	
102 	libdvm.so 	libdvm.so@0x6c923 	
103 	libdvm.so 	libdvm.so@0xb7c56 	
104 	libdvm.so 	libdvm.so@0x5f987 	
105 	libdvm.so 	libdvm.so@0xb2f8e 	
106 	libdvm.so 	libdvm.so@0x5fa37 	
107 	libdvm.so 	libdvm.so@0x5f987 	
108 	libc.so 	libc.so@0x12c4e 	
109 	libc.so 	libc.so@0x127a2
(In reply to adrian tamas from comment #0)
> https://crash-stats.mozilla.com/report/index/bp-6337ba57-4ad8-450b-9737-bd7772120717
This one looks like bug 763805/bug 763813.
Keywords: regression
Summary: crash in mozalloc_abort on quitting Nightly → crash with abort message: "mismatched CxxStackFrame ctor/dtors: file /builds/slave/m-cen-andrd-ntly/build/ipc/glue/RPCChannel.cpp, line 656" on quitting Nightly
Whiteboard: [native-crash]
Based on crash stats, it first appeared in 16.0a1/20120715. The regression range is:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=0602e44ac248&tochange=57abb5f70e01
It might be a regression from bug 745148.
Depends on: 764756
Chris looks like a regression from your bug, can you take a look?
Assignee: nobody → jones.chris.g
tracking-fennec: ? → 16+
Crash Signature: [@ mozalloc_abort | XPCJSRuntime::Get] [@ mozalloc_abort | nsAString_internal::Assign ] [@ mozalloc_abort | nsSVGUtils::GetFontXHeight ] → [@ mozalloc_abort | XPCJSRuntime::Get] [@ mozalloc_abort | nsAString_internal::Assign] [@ mozalloc_abort | nsSVGUtils::GetFontXHeight] [@ mozalloc_abort | NS_IsMainThread_P] [@ mozalloc_abort | libwebcore.so@0x3acf63] [@ mozalloc_abort | dalvik-bitm…
Crash Signature: dalvik-bitmap-2 (deleted)@0x14d11f] [@ mozalloc_abort] → dalvik-bitmap-2 (deleted)@0x14d11f] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ]
The mozalloc_abort | XPCJSRuntime::Get signature is #7 top crasher in 16.0a2 and #5 in 17.0a1. This bug ranks probably higher with its other signatures that are shared with other bugs.
Keywords: topcrash
Good build: 2012-07-14
Bad build: 2012-07-15

Possible regression range:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=0602e44ac248&tochange=57abb5f70e01
That range has a lot of changesets in it, please use tinderbox-builds.
The presence of bug 767775 and bug 745148 in that list looks suspicious. Any thoughts, Chris?
Crash Signature: dalvik-bitmap-2 (deleted)@0x14d11f] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ] → dalvik-bitmap-2 (deleted)@0x14d11f] [@ mozalloc_abort | nsThreadManager::GetIsMainThread] [@ mozalloc_abort | arena_avail_tree_insert] [@ mozalloc_abort | js::ObjectImpl::markChildren] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ]
Crashes on abort account for 4.4% of all crashes in 15.0b3 and 17.7% in 16.0a2.
Guessing this bug is the only crash regression in 16.0, it accounts for around 13% of all crashes in 16.0a2, making it #1 top crasher in 16.0a2.
I've hit this crash a couple of times on the Galaxy Nexus, Android 4.1 while quitting while I was on http://buienradar.nl/ .

Adrian, could you try and narrow the regression range further?
Here is the regression range of comment 12:
http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=011147826130&tochange=9046ecf4db8f
Neither bug 767775, nor bug 745148 belong to it.
I think it's a regression from bug 763234.
In inbound tinderbox builds:

The good build: 2012-07-13 73e6fad01985
The bad build: 2012-07-13 f9723f26c8a9

Pushlog: http://hg.mozilla.org/integration/mozilla-inbound/pushloghtml?fromchange=73e6fad01985&tochange=f9723f26c8a9
Blocks: 763234
Bug 763234 expects that no one uses the CompositorParents or whatever is living on the compositor thread after gfxPlatform has been destroyed. Could it be related to this?
Crash Signature: dalvik-bitmap-2 (deleted)@0x14d11f] [@ mozalloc_abort | nsThreadManager::GetIsMainThread] [@ mozalloc_abort | arena_avail_tree_insert] [@ mozalloc_abort | js::ObjectImpl::markChildren] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ] → dalvik-bitmap-2 (deleted)@0x14d11f] [@ mozalloc_abort | dalvik-bitmap-2 (deleted)@0x12011f] [@ mozalloc_abort | 0 (deleted)@0x11f911f] [@ mozalloc_abort | nsThreadManager::GetIsMainThread] [@ mozalloc_abort | arena_avail_tree_insert] [@ mozalloc_abor…
In 16.0a2/20120819, it accounts for 46% of all crashes.
(In reply to Scoobidiver from comment #20)
> In 16.0a2/20120819, it accounts for 46% of all crashes.

I'll ping cjones now and see if anybody else may have more time to look at this.
I am looking at it.

I had troubles debugging this until recently because I couldn't get symbols to show up in gdb (see bug 781668).
If it is very important that we get a temporary hack to remove the crash ASAP, one (very bad) possibility is to leak CompositorParent. CompositorParent is created a startup and destroyed when exiting firefox. The crash happens in its destructor.
Has anyone been able to create reliable STR for this bug?

(In reply to Nicolas Silva [:nical] from comment #22)
> If it is very important that we get a temporary hack to remove the crash
> ASAP, one (very bad) possibility is to leak CompositorParent.
> CompositorParent is created a startup and destroyed when exiting firefox.
> The crash happens in its destructor.

Shutdown crashes are also extremely frustrating because that's exactly what we want to do at shutdown.  Another workaround is determine after what point in shutdown this crash is occurring, and then _exit(0) instead of aborting if we hit this assertion at that point.

STR definitely the way to start here.
(In reply to Chris Jones [:cjones] [:warhammer] from comment #23)
> STR definitely the way to start here.

Adding needURLs and qawanted.
Keywords: needURLs, qawanted
Sorry, I cannot draw URLs for all abort crashes, and this looks like it's becoming just a collection of all abort crashes, which probably can have any number of different causes. I doubt there's any use in pulling URLs for all of those.

I also don't know how to pull URLs specifically for one abort message, which may be what could be reasonable here.
Crash Signature: mozalloc_abort | js::ObjectImpl::markChildren] [@ mozalloc_abort | js_NewStringCopyZ] [@ mozalloc_abort | js_NewString] [@ mozalloc_abort | js::TraceChildren] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ] → mozalloc_abort | js::ObjectImpl::markChildren] [@ mozalloc_abort | js_NewStringCopyZ] [@ mozalloc_abort | js_NewString] [@ mozalloc_abort | js::TraceChildren] [@ mozalloc_abort | __wrap_malloc] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ]
The steps from Comment 0 can still be used to reproduce the issue using Nightly 2012-08-27 on Galaxy Note running Android 4.0.4.

Steps to reproduce:
1) Load planet.mozilla.org.
2) Wait for the page to load.
3) Quit Firefox

Reproducibility rate is about 40-50% of the cases. From ~10 tries I got 4 crashes.
Keywords: qawanted
FWIW, I don't manage to reproduce it with debug builds while I can reproduce it on optimized builds (on a galaxy nexus and a nexus 7).
I can reproduce with the suggested steps in comment #28 pretty easily (08/28, GN:4.1.1)

I/GeckoApp( 2680): pause
I/GeckoApp( 2680): application paused
I/Gecko   ( 2680): ###!!! ABORT: mismatched CxxStackFrame ctor/dtors: file ../../../ipc/glue/RPCChannel.cpp, line 656
Keywords: reproducible
I see that we are sometimes deleting the compositor thread before we delete CompositorParent. When this happens the nsWindow are already deleted, so it is likely that something dispatches a NewRunnableFunction or NewRunnableMethod that takes the CompositorParent in parameter (this uses a ref ptr internally) which keeps it alive for too long.

I'm almost there
Target Milestone: --- → Firefox 17
Crash Signature: mozalloc_abort | js::ObjectImpl::markChildren] [@ mozalloc_abort | js_NewStringCopyZ] [@ mozalloc_abort | js_NewString] [@ mozalloc_abort | js::TraceChildren] [@ mozalloc_abort | __wrap_malloc] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ] → mozalloc_abort | js::ObjectImpl::markChildren] [@ mozalloc_abort | js_NewStringCopyZ] [@ mozalloc_abort | js_NewString] [@ mozalloc_abort | js::TraceChildren] [@ mozalloc_abort | __wrap_malloc] [@ mozalloc_abort] [@ libmozalloc.so@0x1704 ] [@ mozal…
Version: Firefox 16 → Firefox 18
Crash Signature: mozalloc_abort | ProcessGeneric] [@ mozalloc_abort | mozilla::dom::CSS2PropertiesBinding::set_borderImageRepeat] [@ mozalloc_abort | exn_toSource] [@ mozalloc_abort | browser.db (deleted)@0x156451] [@ mozalloc_abort | browser.db (deleted)@0x356e6c] [… → mozalloc_abort | ProcessGeneric ] [@ mozalloc_abort | mozilla::dom::CSS2PropertiesBinding::set_borderImageRepeat ] [@ mozalloc_abort | exn_toSource ] [@ mozalloc_abort | browser.db (deleted)@0x156451 ] [@ mozalloc_abort | browser.db (deleted)@0x356e6c…
Christian, it's a regression from 16.0, not from 18.0.
Version: Firefox 18 → Firefox 16
It seems that the problem is coming from the CompositorParent being kept alive longer than expected because of a dispatched Task, as I said earlier.

This patch fixes the crash on my nexus 7. I just pushed it to try.
What it does is it manages a reference count for the thread, so that when CompositorParent::DestroyThread is called it only decrements the ref count, and only when the ref count reaches zero we delete the thread. It turned out that managing a static int counter by hand required less code than creating a reference counted subclass of base::Thread so I did it by hand.
Assignee: jones.chris.g → nsilva
Attachment #656574 - Flags: review?(jones.chris.g)
Crash Signature: (deleted)@0x356e6c ] [@ mozalloc_abort | JSCompartment::sweepBreakpoints ] → (deleted)@0x356e6c ] [@ mozalloc_abort | JSCompartment::sweepBreakpoints ] [@ mozalloc_abort | libfreebl3.so (deleted)@0x20341 | XPCJSRuntime::Get]
oh well, ran into the cxx stackframe abort again with the patch applied, so I probably just made it less likely to happen :(
Chris, do you know if it is important that CompositorParent's destructor gets called before the compositor thread is destroyed? If so my patch still has some value...
I have (yet another) theory:

The abort that causes the crash is a piece of ipdl code that checks that a stack mCxxStackFrame is empty.
this stack is pusehd and poped only through RAII with CxxStackFrame (see RPCChannel.h) when sending and receiving messages, like this

RPCChannel::Send(Message* msg)
{
    Message copy = *msg;
    CxxStackFrame f(*this, OUT_MESSAGE, &copy); // push
    return AsyncChannel::Send(msg);
} // (RAII) pop

with a sync message we expect to see the following sequence (not always true actually):

Sender.RPCChannel.mCxxStackFrame.push(OUT_MESSAGE)
Receiver.RPCChannel.mCxxStackFrame.push(IN_MESSAGE)
// the message is handled on the receiver's thread
Receiver.RPCChannel.mCxxStackFrame.pop(IN_MESSAGE)
Sender.RPCChannel.mCxxStackFrame.pop(OUT_MESSAGE)

However, if the stack is poped after sending back to the sender the signal that the message has beene handled, it is actually possible that the sender thread start running again before the receiver thread reaches the end of the scope and pops its stack.

In our case the crash happens always in the destructor of CompositorParent very shortly after PCompositorParent::SendStop, because we dispatch a task on the main thread that holds a refPtr to the compositor parent, that calls SendStop and then returns. When the task returns CompositorParent is deleted because the task's RefPtr is destroyed.
so ~CompositorParent happens very shortly after SendStop.

if i printf_stderr something in ~CxxStackFrame just before poping the stack this makes it way easier to reproduce.

Chris, does it make sense? You know ipdl way better than me.
Is the CompositorParent setting itself to be destroyed during ActorDestroy?  What's the expected destruction sequence.
Comment on attachment 656574 [details] [diff] [review]
Fixes fennec exit crash by reference counting the compositor thread.

Clearing review request since doesn't seem the entire fix.
Attachment #656574 - Flags: review?(jones.chris.g)
Still crashing on the latest Nightly: https://crash-stats.mozilla.com/report/index/bp-a2f48d7a-8ba0-4eb2-8257-b07ed2120831
Crash Signature: (deleted)@0x356e6c ] [@ mozalloc_abort | JSCompartment::sweepBreakpoints ] [@ mozalloc_abort | libfreebl3.so (deleted)@0x20341 | XPCJSRuntime::Get] → (deleted)@0x356e6c ] [@ mozalloc_abort | JSCompartment::sweepBreakpoints ] [@ mozalloc_abort | libfreebl3.so (deleted)@0x20341 | XPCJSRuntime::Get] [@ mozalloc_abort | UnmarkGrayChildren ]
When the widget is destroyed (in its dtor), CompositorChild first sends a WillStop message, telling the parent side that it has to get rid of its resources now. After SendWillStop, a task is dispatched on the main thread to defer the second phase of the destruction (it's the same 2-steps destruction sequence principle as ImageBridge's). This task is a RunnableFunction that executes DeferredDestroyCompositor (nsBaseWdget.cpp) which calls compositorChild->Destroy() and immediately after, loses the last reference to the CompositorParent and the CompositorChild which deletes them. Destroy() calles SendStop() which synchronously finishes the destruction sequence.

in short:

nsBaseWidget::dtor
| CompositorChild::SendWillStop [sync]
| | CompositorParent::RecvWillStop
| Dispatch DeferredDestroyCompositor
...
(on fennec it appears that we run into gfxPlatform::ShutDown here, in which case:)
gfxPlatform::ShutDown
| delete compositorThread
...
DeferredDestroyCompositor
| CompositorChild::SendStop [sync]
| | compsitorParent::RecvStop
| CompositorChild::dtor
| CompositorParent::dtor

I think we have at least two bugs here, the first being that the compositor thread can be destroyed way too early, so my first patch fixes this.
The second bug is that since CompositorParent::dtor (and therefore RPCChannel's dtor) is called immediately after SendStop, we have a race between the CompositorParent's destructor and the RPCChannel's poping its mCxxStackFrames on the other side.
Crash Signature: (deleted)@0x356e6c ] [@ mozalloc_abort | JSCompartment::sweepBreakpoints ] [@ mozalloc_abort | libfreebl3.so (deleted)@0x20341 | XPCJSRuntime::Get] [@ mozalloc_abort | UnmarkGrayChildren ] → (deleted)@0x356e6c ] [@ mozalloc_abort | JSCompartment::sweepBreakpoints ] [@ mozalloc_abort | libfreebl3.so (deleted)@0x20341 | XPCJSRuntime::Get] [@ mozalloc_abort | UnmarkGrayChildren ] [@ mozalloc_abort | __swrite | libxul.so@0x10d3d37] [@ mozall…
The parent and child worker threads each get their own set of stack frames.  If we're sharing them somehow, that would be a fantastically bad bug! :/
They're not sharing their stack frames, but they both get deleted in the main thread at the end of DeferredDestroyCompositor. In our case we are having a problem with CompositorParent being destroyed on the main thread while its CxxStackFrames has not yet be poped on the compositor thread after a sync message from the main thread to the compositor thread.
Oh, so we're trying to delete the CompositorParent on the main thread while it's still active on the compositor thread?  That's also very bad! :)
Linux is affected and the stack trace is not buggy. See https://crash-stats.mozilla.com/report/list?signature=mozalloc_abort+|+NS_DebugBreak_P+|+mozilla%3A%3Aipc%3A%3ARPCChannel%3A%3ADebugAbort
Component: General → Graphics: Layers
OS: Android → All
Product: Firefox for Android → Core
Hardware: ARM → All
Target Milestone: Firefox 17 → ---
Version: Firefox 16 → 16 Branch
I have a patch in whice we dispatch a task on the Compositor thread from within CompositorParent::RecvStop in order to make sure that a nsRefPtr of the compositor parent stays there and keeps the compositor parent alive until the end of the ipdl code around RecvStop.

So this solves the problem of deleting the CompositorParent while it is still active on the compositor thread.

With this patch we then have the compositor thread's ref count that reaches zero within its own thread, so we end up having to dispatch a task on the main thread so that later deletes the compositor thread.

The destruction sequence is becoming quite messy:
When a widget is destroyed it triggers most of the compositor parent's clean up, and dispatches a task on the main thread to terminate the cleanup.
This is not extremely nice because we then have part of the compositor stuff being destroyed before the destruction of gfxPlatform, and the rest being destroyed after.
When the second half of the destruction happens, we need to have the compositor thread dispatch a task to himself to avoid deleting the compositor parent too soon. When we reach this task on the compositor thread we then need to dispatch a task from the compositor thread to the main thread so that the main thread deletes the compositor thread.

The way we end up dispatching tasks all over the place makes it hard for us too have guarantees that we destroy the modules in the right order. For instance, Chris, do we have a constraint on how soon we must have deleted the last object that uses ipdl?
(In reply to Nicolas Silva [:nical] from comment #45)
> When we reach this task on the compositor thread
> we then need to dispatch a task from the compositor thread to the main
> thread so that the main thread deletes the compositor thread.

I'm concerned with dispatching a task from the compositor thread to the main thread after we've destroyed the channel and at which points the main thread is tearing down the browser. We have no guarantee that the main thread will always be in a proper state to respond to this event and wont be far along it's shutdown sequence.

I think the right thing to do here is to destroy the associated compositor parent before the widget. To do this we need to find a way to make sure all the pending messages are processed.
This patch seems to fix it for me on fennec debug and optimized.

I agree with BenWa that fixing the problem at this level is probably not the best solution, yet we have to do something about this crash in the short term and fixing the entire destruction sequence may be very tricky.
Attachment #656574 - Attachment is obsolete: true
Attachment #658588 - Flags: review?(jones.chris.g)
Crash Signature: libxul.so@0x1148aa3 | NS_IsMainThread_P] [@ mozalloc_abort | org.mozilla.fennec_aurora-2.apk@0xf24f40] [@ mozalloc_abort | __swrite | libxul.so@0x1148aa3 | 0 (deleted)@0x11f911f] → libxul.so@0x1148aa3 | NS_IsMainThread_P] [@ mozalloc_abort | org.mozilla.fennec_aurora-2.apk@0xf24f40] [@ mozalloc_abort | __swrite | libxul.so@0x1148aa3 | 0 (deleted)@0x11f911f] [@ mozalloc_abort | __swrite | libxul.so@0x1148b03] [@ mozalloc_abort | …
Crash Signature: dalvik-bitmap-2 (deleted)@0x111f63] [@ mozalloc_abort | org.mozilla.firefox_beta-1.apk@0x6d2038] [@ mozalloc_abort | NS_DebugBreak_P | mozilla::ipc::RPCChannel::DebugAbort] → dalvik-bitmap-2 (deleted)@0x111f63] [@ mozalloc_abort | org.mozilla.firefox_beta-1.apk@0x6d2038] [@ mozalloc_abort | __swrite | libxul.so@0x10d3cf7] [@ mozalloc_abort | org.mozilla.firefox_beta-2.apk@0x6d2038] [@ mozalloc_abort | org.mozilla.firefox_b…
Comment on attachment 658588 [details] [diff] [review]
Fixes fennec exit crash by reference counting the compositor thread and making sure a CompositorParent is not deleted while running code on the compositor thread.

>diff --git a/gfx/layers/ipc/CompositorParent.cpp b/gfx/layers/ipc/CompositorParent.cpp

>+static int sCompositorThreadRefCount = 0;

Document the semantics of this.

>+static void KeepCompositorParentAlive(CompositorParent* aNowReadyToDie)

Call this DeferredDelete.

>+{
>+  aNowReadyToDie->Release();
>+}
>+
>+static void DeleteCompositorThread()
>+{
>+  if (NS_IsMainThread()){
>+    if (sCompositorThread) {
>+      delete sCompositorThread;  

No need to null check, |delete| does the right thing.

> void CompositorParent::DestroyThread()
> {

>+  --sCompositorThreadRefCount;
>+  if(sCompositorThreadRefCount == 0) {

Nit: |if (0 == --sRefCount)|

> CompositorParent::~CompositorParent()
> {

>+  --sCompositorThreadRefCount;
>+  if (sCompositorThreadRefCount == 0) {
>+    DeleteCompositorThread();
>+  }

Don't duplicate this code.  Refactor it into a helper function.
Attachment #658588 - Flags: review?(jones.chris.g)
Comment on attachment 660164 [details] [diff] [review]
Fixes fennec exit crash by reference counting the compositor thread and making sure a CompositorParent is not deleted while running code on the compositor thread.

No need to re-request review for small changes like these.
Attachment #660164 - Flags: review?(jones.chris.g) → review+
https://hg.mozilla.org/mozilla-central/rev/8361bd47b3fe
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla18
Please nominate for Aurora/Beta asap (when you're comfortable with the bake time on Nightly).
I cannot reproduce any @mozalloc_abort crashes on the latest Nightly build. 

(In reply to Alex Keybl [:akeybl] from comment #54)
> Please nominate for Aurora/Beta asap (when you're comfortable with the bake
> time on Nightly).

This crashes are annoying because occur almost for each Firefox quit. Due to the high number of crashes both on m-a and m-b, I nominate this to Aurora and Beta. We cannot deliver an app which crashes much more frequently than it does any other released Betas so far.
Attachment #660164 - Flags: approval-mozilla-beta?
Attachment #660164 - Flags: approval-mozilla-aurora?
(In reply to Cristian Nicolae (:xti) from comment #55)
> Due to the high number of crashes both on m-a and m-b, I nominate this to Aurora
> and Beta. 
That is not exactly what Alex asked. It's instead: does this patch fix the issue AND not add regressions that are worse than the benefit?
You answered yes to the first part, what about the second part?
Status: RESOLVED → VERIFIED
(In reply to Scoobidiver from comment #56)
what about the second part?

So far the app is working fine. I tried different scenarios to reproduce this bug, but no crash occurred. Atm there isn't any noticeable regression, so I guess that this patch is great benefit.
I haven't had any problem since this patch so I feel comfortable having it landed on aurora and beta. What would worry me most is having another problem in the destruction sequence that we have not seen yet (the tear-down sequence of omtc is a mess), but it would have to be fixed on top of this anyway.
Comment on attachment 660164 [details] [diff] [review]
Fixes fennec exit crash by reference counting the compositor thread and making sure a CompositorParent is not deleted while running code on the compositor thread.

[Triage Comment]
Fix for a new top crash in FF16, which we believe is low risk based upon current testing. Approving for branches.
Attachment #660164 - Flags: approval-mozilla-beta?
Attachment #660164 - Flags: approval-mozilla-beta+
Attachment #660164 - Flags: approval-mozilla-aurora?
Attachment #660164 - Flags: approval-mozilla-aurora+
Crash Signature: org.mozilla.firefox_beta-1.apk@0x3d2038] [@ mozalloc_abort | NS_DebugBreak_P | mozilla::ipc::RPCChannel::DebugAbort] → org.mozilla.firefox_beta-1.apk@0x3d2038] [@ mozalloc_abort | NS_DebugBreak_P | mozilla::ipc::RPCChannel::DebugAbort] [@ mozalloc_abort | __swrite | libxul.so@0x10d56b7] [@ mozalloc_abort | __swrite | libxul.so@0x10d56b7 | 0 (deleted)@0x11f911f] [@ moz…
Comment on attachment 660164 [details] [diff] [review]
Fixes fennec exit crash by reference counting the compositor thread and making sure a CompositorParent is not deleted while running code on the compositor thread.

Review of attachment 660164 [details] [diff] [review]:
-----------------------------------------------------------------

::: gfx/layers/ipc/CompositorParent.cpp
@@ +84,5 @@
> +
> +static void DeleteCompositorThread()
> +{
> +  if (NS_IsMainThread()){
> +    delete sCompositorThread;  

Trailing whitespace

@@ +231,5 @@
> +  // after this function returns while some ipdl code still needs to run on 
> +  // this thread.
> +  // We must keep the compositor parent alive untill the code handling message 
> +  // reception is finished on this thread.
> +  this->AddRef(); // Corresponds to DeferredDeleteCompositorParent's Release

NS_ADDREF_THIS() is more common.

@@ +232,5 @@
> +  // this thread.
> +  // We must keep the compositor parent alive untill the code handling message 
> +  // reception is finished on this thread.
> +  this->AddRef(); // Corresponds to DeferredDeleteCompositorParent's Release
> +  CompositorLoop()->PostTask(FROM_HERE, 

More trailing whitespace

@@ +233,5 @@
> +  // We must keep the compositor parent alive untill the code handling message 
> +  // reception is finished on this thread.
> +  this->AddRef(); // Corresponds to DeferredDeleteCompositorParent's Release
> +  CompositorLoop()->PostTask(FROM_HERE, 
> +                           NewRunnableFunction(&DeferredDeleteCompositorParent,

Weird indentation
This crash doesn't occur anymore on the latest Aurora and Beta builds.

--
Device: Galaxy Note
OS: Android 4.0.4
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: