Last Comment Bug 753665 - shutdown crash in nsWindow::FindTopLevel
: shutdown crash in nsWindow::FindTopLevel
Status: VERIFIED FIXED
[native-crash]
: crash, reproducible, topcrash
Product: Core
Classification: Components
Component: Widget: Android (show other bugs)
: Trunk
: ARM Android
: -- critical (vote)
: mozilla16
Assigned To: Kartikaya Gupta (email:kats@mozilla.com)
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2012-05-10 00:08 PDT by Scoobidiver (away)
Modified: 2012-07-12 06:33 PDT (History)
11 users (show)
ryanvm: in‑testsuite-
See Also:
Crash Signature:
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
verified
verified
verified
.N+


Attachments
(1/4) Fix bug in error-handling code (892 bytes, patch)
2012-05-22 14:13 PDT, Kartikaya Gupta (email:kats@mozilla.com)
blassey.bugs: review-
Details | Diff | Splinter Review
(2/4) Add some missing kungFuDeathGrips (1.47 KB, patch)
2012-05-25 13:12 PDT, Kartikaya Gupta (email:kats@mozilla.com)
blassey.bugs: review+
Details | Diff | Splinter Review
(3/4) Heap-allocate the nsTouchEvent objects (5.45 KB, patch)
2012-05-25 13:15 PDT, Kartikaya Gupta (email:kats@mozilla.com)
bugmail: review-
Details | Diff | Splinter Review
(4/4) Cleanup (1.50 KB, patch)
2012-05-25 13:16 PDT, Kartikaya Gupta (email:kats@mozilla.com)
blassey.bugs: review+
Details | Diff | Splinter Review
(1/4) Fix bug in error-handling code (v2) (1.06 KB, patch)
2012-05-25 14:09 PDT, Kartikaya Gupta (email:kats@mozilla.com)
blassey.bugs: review+
Details | Diff | Splinter Review
Actual fix (904 bytes, patch)
2012-06-18 14:08 PDT, Kartikaya Gupta (email:kats@mozilla.com)
blassey.bugs: review+
akeybl: approval‑mozilla‑aurora+
bugzilla: approval‑mozilla‑beta+
Details | Diff | Splinter Review

Description Scoobidiver (away) 2012-05-10 00:08:08 PDT
It happens from time to time.

Signature 	nsWindow::FindTopLevel More Reports Search
UUID	8d979a38-e2ff-4325-bd07-23bbf2120510
Date Processed	2012-05-10 05:50:49
Uptime	143
Install Age	3.0 minutes since version was first installed.
Install Time	2012-05-10 05:47:36
Product	FennecAndroid
Version	15.0a1
Build ID	20120509030514
Release Channel	nightly
OS	Linux
OS Version	0.0.0 Linux 2.6.35.10-g3f43272 #1 PREEMPT Tue Jun 14 21:11:23 CST 2011 armv7l
Build Architecture	arm
Build Architecture Info	
Crash Reason	SIGSEGV
Crash Address	0x544c
App Notes 	
AdapterVendorID: bravo, AdapterDeviceID: HTC Desire.
AdapterDescription: 'Model: 'HTC Desire', Product: 'htc_bravo', Manufacturer: 'HTC', Hardware: 'bravo''.
HTC HTC Desire
htc_wwe/htc_bravo/bravo:2.3.3/GRI40/96875.1:user/release-keys
EMCheckCompatibility	True

Frame 	Module 	Signature 	Source
0 	libxul.so 	nsWindow::FindTopLevel 	widget/android/nsWindow.cpp:217
1 	libxul.so 	nsWindow::~nsWindow 	widget/android/nsWindow.cpp:203
2 	libxul.so 	nsWindow::~nsWindow 	widget/android/nsWindow.cpp:210
3 	libxul.so 	nsBaseWidget::Release 	widget/xpwidgets/nsBaseWidget.cpp:82
4 	libxul.so 	nsWindow::Release 	widget/android/nsWindow.cpp:89
5 	libxul.so 	nsCOMPtr_base::~nsCOMPtr_base 	obj-firefox/xpcom/build/nsCOMPtr.cpp:81
6 	libxul.so 	nsTouchEvent::~nsTouchEvent 	nsCOMPtr.h:480
7 	libxul.so 	nsDOMTouchEvent::~nsDOMTouchEvent 	content/events/src/nsDOMTouchEvent.cpp:243
8 	libxul.so 	nsDOMTouchEvent::~nsDOMTouchEvent 	content/events/src/nsDOMTouchEvent.cpp:246
9 	libxul.so 	nsDOMEvent::Release 	content/events/src/nsDOMEvent.cpp:217
10 	libxul.so 	nsDOMUIEvent::Release 	content/events/src/nsDOMUIEvent.cpp:123
11 	libxul.so 	nsDOMTouchEvent::Release 	content/events/src/nsDOMTouchEvent.cpp:270
12 	libxul.so 	XPCJSRuntime::GCCallback 	js/xpconnect/src/XPCJSRuntime.cpp:623
13 	libxul.so 	js::GC 	js/src/jsgc.cpp:3697
14 	libxul.so 	JS_GC 	js/src/jsapi.cpp:2865
15 	libxul.so 	nsXREDirProvider::DoShutdown 	toolkit/xre/nsXREDirProvider.cpp:850
16 	libxul.so 	ScopedXPCOMStartup::~ScopedXPCOMStartup 	toolkit/xre/nsAppRunner.cpp:1125
17 	libxul.so 	XREMain::XRE_main 	toolkit/xre/nsAppRunner.cpp:3879
18 	libxul.so 	XRE_main 	toolkit/xre/nsAppRunner.cpp:3933
...

More reports at:
https://crash-stats.mozilla.com/report/list?signature=nsWindow%3A%3AFindTopLevel
Comment 1 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-10 06:39:31 PDT
Oh, excellent. sewardj found this using valgrind yesterday too, he said he got the following from valgrind.

==1798== Invalid read of size 4
==1798==    at 0x36618BD6: nsWindow::~nsWindow() (nsWindow.cpp:204)
==1798==    by 0x36618C83: nsWindow::~nsWindow() (nsWindow.cpp:210)
==1798==    by 0x3661CDAF: nsBaseWidget::Release() (nsBaseWidget.cpp:82)
==1798==    by 0x36616F59: nsWindow::Release() (nsWindow.cpp:89)
==1798==    by 0x3673DB4D: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81)
==1798==    by 0x3615196D: nsDOMTouchEvent::~nsDOMTouchEvent() (nsCOMPtr.h:480)
==1798==    by 0x361519FF: nsDOMTouchEvent::~nsDOMTouchEvent() (nsDOMTouchEvent.cpp:246)
==1798==    by 0x36137D5D: nsDOMEvent::Release() (nsDOMEvent.cpp:217)
==1798==    by 0x3613A501: nsDOMUIEvent::Release() (nsDOMUIEvent.cpp:123)
==1798==    by 0x361513E5: nsDOMTouchEvent::Release() (nsDOMTouchEvent.cpp:270)
==1798==    by 0x364547EF: XPCJSRuntime::GCCallback(JSRuntime*, JSGCStatus) (XPCJSRuntime.cpp:623)
==1798==    by 0x36A28BC7: Collect(JSRuntime*, bool, long long, js::JSGCInvocationKind, js::gcreason::Reason) (jsgc.cpp:3697)
==1798==  Address 0x21249a34 is 140 bytes inside a block of size 384 free'd
==1798==    at 0x4804F44: free (vg_replace_malloc.c:427)
==1798==    by 0x2275F699: moz_free (mozalloc.cpp:81)
==1798==    by 0x36618C89: nsWindow::~nsWindow() (mozalloc.h:253)
==1798==    by 0x3661CDAF: nsBaseWidget::Release() (nsBaseWidget.cpp:82)
==1798==    by 0x36616F59: nsWindow::Release() (nsWindow.cpp:89)
==1798==    by 0x3673DB4D: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81)
==1798==    by 0x35F13EFD: nsDeviceContext::~nsDeviceContext() (nsCOMPtr.h:480)
==1798==    by 0x35F58565: nsPresContext::~nsPresContext() (nsDeviceContext.h:60)
==1798==    by 0x35F586B7: nsPresContext::~nsPresContext() (nsPresContext.cpp:335)
==1798==    by 0x35F554B9: nsPresContext::Release() (nsPresContext.cpp:345)
==1798==    by 0x36137E5D: nsDOMEvent::~nsDOMEvent() (nsAutoPtr.h:908)
==1798==    by 0x36140E01: nsDOMPageTransitionEvent::~nsDOMPageTransitionEvent() (nsDOMPageTransitionEvent.h:47)


==1798== Invalid write of size 4
==1798==    at 0x36618BE0: nsWindow::~nsWindow() (nsWindow.cpp:205)
==1798==    by 0x36618C83: nsWindow::~nsWindow() (nsWindow.cpp:210)
==1798==    by 0x3661CDAF: nsBaseWidget::Release() (nsBaseWidget.cpp:82)
==1798==    by 0x36616F59: nsWindow::Release() (nsWindow.cpp:89)
==1798==    by 0x3673DB4D: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81)
==1798==    by 0x3615196D: nsDOMTouchEvent::~nsDOMTouchEvent() (nsCOMPtr.h:480)
==1798==    by 0x361519FF: nsDOMTouchEvent::~nsDOMTouchEvent() (nsDOMTouchEvent.cpp:246)
==1798==    by 0x36137D5D: nsDOMEvent::Release() (nsDOMEvent.cpp:217)
==1798==    by 0x3613A501: nsDOMUIEvent::Release() (nsDOMUIEvent.cpp:123)
==1798==    by 0x361513E5: nsDOMTouchEvent::Release() (nsDOMTouchEvent.cpp:270)
==1798==    by 0x364547EF: XPCJSRuntime::GCCallback(JSRuntime*, JSGCStatus) (XPCJSRuntime.cpp:623)
==1798==    by 0x36A28BC7: Collect(JSRuntime*, bool, long long, js::JSGCInvocationKind, js::gcreason::Reason) (jsgc.cpp:3697)
==1798==  Address 0x21249a34 is 140 bytes inside a block of size 384 free'd
==1798==    at 0x4804F44: free (vg_replace_malloc.c:427)
==1798==    by 0x2275F699: moz_free (mozalloc.cpp:81)
==1798==    by 0x36618C89: nsWindow::~nsWindow() (mozalloc.h:253)
==1798==    by 0x3661CDAF: nsBaseWidget::Release() (nsBaseWidget.cpp:82)
==1798==    by 0x36616F59: nsWindow::Release() (nsWindow.cpp:89)
==1798==    by 0x3673DB4D: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81)
==1798==    by 0x35F13EFD: nsDeviceContext::~nsDeviceContext() (nsCOMPtr.h:480)
==1798==    by 0x35F58565: nsPresContext::~nsPresContext() (nsDeviceContext.h:60)
==1798==    by 0x35F586B7: nsPresContext::~nsPresContext() (nsPresContext.cpp:335)
==1798==    by 0x35F554B9: nsPresContext::Release() (nsPresContext.cpp:345)
==1798==    by 0x36137E5D: nsDOMEvent::~nsDOMEvent() (nsAutoPtr.h:908)
==1798==    by 0x36140E01: nsDOMPageTransitionEvent::~nsDOMPageTransitionEvent() (nsDOMPageTransitionEvent.h:47)

It seems to me that the mParent pointer on nsWindow could become invalid because it's just a nsWindow* and not some fancy refptr. If that becomes invalid then it would cause this behaviour (both the crash and the invalid memory read/write).
Comment 2 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-16 14:23:02 PDT
Just ran into this on my GN device with an m-c build. I spent some time panning around http://i.imgur.com/61GmV.jpg, and then hit the home button to go back to the homescreen, and then soon (within a second or two) re-entered fennec by hitting the homescreen icon. It crashed within a second or so of coming up. I had gdb attached, so I could verify it was this same crash.
Comment 3 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-22 07:08:03 PDT
Unfortunately I'm unable to repro using the steps in comment 2. Seems like it happens randomly.
Comment 4 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-22 11:48:36 PDT
Looking at the stack dumps this is actually a shutdown crash and therefore probably less important. Still should be fixed though.

To provide a little more context for what I said in comment 1: the actual error pointed to be the valgrind output in comment 1 refers to these lines of code in the destructor:

    if (top->mFocus == this)
        top->mFocus = nsnull;

It looks like "top" has already been freed, so reading/writing top->mFocus is the problem. "top" comes from FindTopLevel(), which as far as I can see, will only return a garbage value if the chain of mParent references has gone bad.

I looked through the nsWindow code to see how mParent might be corrupted, and the only way I can see that might happen is if (1) a window is destructed without having Destroy() called on it first, or (2) a window has a child added to it after Destroy() was called on it but before it is destructed. Need to trace the code some more...
Comment 5 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-22 14:13:07 PDT
Created attachment 626178 [details] [diff] [review]
(1/4) Fix bug in error-handling code

After staring at this code all day I found one bug, but I don't think it's the cause of this problem. Putting it here anyway, but will request review if/when I do find the actual problem and have a patch for it.
Comment 6 Chris Peterson [:cpeterson] 2012-05-25 10:36:37 PDT
This crash is currently Fennec 15.0a1's #3 topcrash for the past 3 days.
Comment 7 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-25 13:10:34 PDT
The more I stare at this code the less things make sense. But I found some more bugs!
Comment 8 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-25 13:11:15 PDT
Comment on attachment 626178 [details] [diff] [review]
(1/4) Fix bug in error-handling code

This is likely not the cause of this crash, but it's a bug nonetheless.
Comment 9 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-25 13:12:20 PDT
Created attachment 627333 [details] [diff] [review]
(2/4) Add some missing kungFuDeathGrips

Some of the DispatchEvent calls weren't gripping the window hard enough. This may or may not be the cause of this crash.
Comment 10 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-25 13:15:30 PDT
Created attachment 627335 [details] [diff] [review]
(3/4) Heap-allocate the nsTouchEvent objects

This one may actually be the cause of the crash, but I'm very unsure about this. What I'm seeing is that the nsTouchEvent object is stack-allocated, and gets passed to DispatchEvent. This eventually ends up creating a nsDOMTouchEvent object with that stack-allocated nsTouchEvent object. When the nsDOMTouchEvent object destructor runs, it calls delete on the stored pointer (see constructor/destructor at http://mxr.mozilla.org/mozilla-central/source/content/events/src/nsDOMTouchEvent.cpp#189). Calling delete on stack-allocated objects can results in all sorts of weird behaviour including this crash. Making these objects heap allocated should in theory fix it.

However the nsTouchEvent pointers get passed around in a lot of places and I don't what the ownership model for these event objects are so this change could end up leaking these objects too. Any feedback here would be helpful.
Comment 11 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-25 13:16:07 PDT
Created attachment 627336 [details] [diff] [review]
(4/4) Cleanup

Minor things I noticed after staring at this code for so long
Comment 12 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-25 13:21:50 PDT
Comment on attachment 627335 [details] [diff] [review]
(3/4) Heap-allocate the nsTouchEvent objects

wesj pointed out that the nsDOMTouchEvent has an mEventIsInternal flag that guards against this. And yet I'm seeing that delete get hit in gdb. I will need to investigate this further.
Comment 13 Chris Peterson [:cpeterson] 2012-05-25 13:55:43 PDT
Comment on attachment 626178 [details] [diff] [review]
(1/4) Fix bug in error-handling code

Review of attachment 626178 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/android/nsWindow.cpp
@@ +244,5 @@
>      for (PRUint32 i = 0; i < mChildren.Length(); ++i) {
>          // why do we still have children?
>          ALOG("### Warning: Destroying window %p and reparenting child %p to null!", (void*)this, (void*)mChildren[i]);
>          mChildren[i]->SetParent(nsnull);
> +        i--;    // we just removed a child, so decrement counter

kats, you are decrementing the loop counter before it is incremented, so i is always 0. Could this for loop be replaced with something like while (mChildren.Length() > 0)?
Comment 14 Brad Lassey [:blassey] (use needinfo?) 2012-05-25 13:58:47 PDT
Comment on attachment 626178 [details] [diff] [review]
(1/4) Fix bug in error-handling code

Review of attachment 626178 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/android/nsWindow.cpp
@@ +244,5 @@
>      for (PRUint32 i = 0; i < mChildren.Length(); ++i) {
>          // why do we still have children?
>          ALOG("### Warning: Destroying window %p and reparenting child %p to null!", (void*)this, (void*)mChildren[i]);
>          mChildren[i]->SetParent(nsnull);
> +        i--;    // we just removed a child, so decrement counter

As Chris said I think his would be better as:
while (mChildren.Length())
   mChildren[0]->SetParent(nsnull);
Comment 15 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-25 14:09:43 PDT
Created attachment 627351 [details] [diff] [review]
(1/4) Fix bug in error-handling code (v2)

Fixed up to use a while loop
Comment 16 Kartikaya Gupta (email:kats@mozilla.com) 2012-05-26 08:09:19 PDT
Landed the three patches, but this bug should be left open since I'm not sure if those patches will actually fix the crash.

https://hg.mozilla.org/integration/mozilla-inbound/rev/88787676b14a
https://hg.mozilla.org/integration/mozilla-inbound/rev/20a421502a41
https://hg.mozilla.org/integration/mozilla-inbound/rev/a8cc00d4ca48
Comment 18 JP Rosevear [:jpr] 2012-05-29 05:24:14 PDT
Looks like there was a couple crashes in the 27th nightly.
Comment 19 Jeff Muizelaar [:jrmuizel] 2012-05-29 19:52:07 PDT
Yeah, here's a crash from the 29th: https://crash-stats.mozilla.com/report/index/b718547d-7e69-4970-afb8-78f3a2120529
Comment 20 Johnathan Nightingale [:johnath] 2012-06-01 12:45:47 PDT
Re-triaging to strip out non-OMG-ON-FIRE release blockers - pushing this to .N+ but we'd still love to look at a safe patch.
Comment 21 Naoki Hirata :nhirata (please use needinfo instead of cc) 2012-06-14 09:31:15 PDT
7th on aurora https://crash-stats.mozilla.com/topcrasher/byversion/FennecAndroid/15.0a2/7

agree w/ Comment 20.
Comment 22 Brad Lassey [:blassey] (use needinfo?) 2012-06-14 14:36:17 PDT
Kat's the is the #12 top crasher on trunk and #11 on beta so I'd say it is not fixed.

However, being that it is out of the top 10, I'm renominating it to get it off the .N+ list
Comment 23 Kartikaya Gupta (email:kats@mozilla.com) 2012-06-18 14:07:01 PDT
Ok, I finally got STR for this that seem to work reliably.

1. Start fennec
2. Load http://people.mozilla.org/~kgupta/bug/753665.html
3. Pinch zoom in
4. Quit fennec from the menu

The shutdown procedure crashes Fennec; if you have gdb attached you can see it is this crash. What's happening is there are two windows, one a child of the other. The child nsWindow has ::Destroy called on it, which removes it from the parent's child list. Then the parent has ::Destroy called on it, which does the normal cleanup. The parent is then destructed, leaving a dangling mParent reference in the child. Then the child is garbage-collected, and it tries accessing the bad mParent reference.

This STR depends on touch events because the touch events keep a nsCOMPtr reference to the child window, and that's what makes the child window get destructed after the parent window; if the windows are destructed in the reverse order then there is no crash.

The fix is to ensure that in nsWindow::Destroy(), the mParent reference is nulled out. Patch coming up.
Comment 24 Kartikaya Gupta (email:kats@mozilla.com) 2012-06-18 14:08:55 PDT
Created attachment 634161 [details] [diff] [review]
Actual fix

Call SetParent(nsnull) which more cleanly disconnects a child from it's parent, instead of manually removing the child from the parent's child list. In particular, this ensures that mParent gets set to null for the child.
Comment 25 Kartikaya Gupta (email:kats@mozilla.com) 2012-06-18 19:31:08 PDT
https://hg.mozilla.org/integration/mozilla-inbound/rev/4180a4a2ebf2
Comment 26 Kartikaya Gupta (email:kats@mozilla.com) 2012-06-18 19:32:22 PDT
Comment on attachment 634161 [details] [diff] [review]
Actual fix

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
User impact if declined: fennec sometimes crash during shutdown
Testing completed (on m-c, etc.): just landed on inbound, pre-nomming to get it on the radar. we should probably wait to get crash stats data to confirm this fixes the issue before uplifting.
Risk to taking this patch (and alternatives if risky): low risk, mobile only
String or UUID changes made by this patch: none
Comment 27 Ed Morley [:emorley] 2012-06-19 01:18:13 PDT
https://hg.mozilla.org/mozilla-central/rev/4180a4a2ebf2
Comment 28 Alex Keybl [:akeybl] 2012-06-19 20:00:20 PDT
Comment on attachment 634161 [details] [diff] [review]
Actual fix

In preparation for possible 14.0.1 inclusion, let's land on Aurora 15.
Comment 29 Kartikaya Gupta (email:kats@mozilla.com) 2012-06-20 06:46:14 PDT
https://hg.mozilla.org/releases/mozilla-aurora/rev/a25e23ff8455
Comment 30 Kartikaya Gupta (email:kats@mozilla.com) 2012-06-22 12:28:09 PDT
https://hg.mozilla.org/releases/mozilla-beta/rev/766e5964701c
Comment 31 Adrian Tamas (:AdrianT) 2012-07-12 06:33:58 PDT
Unable to get Firefox to crash following the steps in comment 23.

Verified on:
Nightly 16.0a1 2012-07-12/Aurora 15.0a2 2012-07-11/Firefox Mobile 14 beta 12 build 1
HTC Desire
Android 2.2.2

Note You need to log in before you can comment on or make changes to this bug.