shutdown crash in nsWindow::FindTopLevel

VERIFIED FIXED in Firefox 14

Status

()

Core
Widget: Android
--
critical
VERIFIED FIXED
5 years ago
5 years ago

People

(Reporter: Scoobidiver (away), Assigned: kats)

Tracking

({crash, reproducible, topcrash})

Trunk
mozilla16
ARM
Android
crash, reproducible, topcrash
Points:
---
Bug Flags:
in-testsuite -

Firefox Tracking Flags

(firefox14 verified, firefox15 verified, firefox16 verified, blocking-fennec1.0 .N+)

Details

(Whiteboard: [native-crash], crash signature)

Attachments

(5 attachments, 1 obsolete attachment)

(Reporter)

Description

5 years ago
It happens from time to time.

Signature 	nsWindow::FindTopLevel More Reports Search
UUID	8d979a38-e2ff-4325-bd07-23bbf2120510
Date Processed	2012-05-10 05:50:49
Uptime	143
Install Age	3.0 minutes since version was first installed.
Install Time	2012-05-10 05:47:36
Product	FennecAndroid
Version	15.0a1
Build ID	20120509030514
Release Channel	nightly
OS	Linux
OS Version	0.0.0 Linux 2.6.35.10-g3f43272 #1 PREEMPT Tue Jun 14 21:11:23 CST 2011 armv7l
Build Architecture	arm
Build Architecture Info	
Crash Reason	SIGSEGV
Crash Address	0x544c
App Notes 	
AdapterVendorID: bravo, AdapterDeviceID: HTC Desire.
AdapterDescription: 'Model: 'HTC Desire', Product: 'htc_bravo', Manufacturer: 'HTC', Hardware: 'bravo''.
HTC HTC Desire
htc_wwe/htc_bravo/bravo:2.3.3/GRI40/96875.1:user/release-keys
EMCheckCompatibility	True

Frame 	Module 	Signature 	Source
0 	libxul.so 	nsWindow::FindTopLevel 	widget/android/nsWindow.cpp:217
1 	libxul.so 	nsWindow::~nsWindow 	widget/android/nsWindow.cpp:203
2 	libxul.so 	nsWindow::~nsWindow 	widget/android/nsWindow.cpp:210
3 	libxul.so 	nsBaseWidget::Release 	widget/xpwidgets/nsBaseWidget.cpp:82
4 	libxul.so 	nsWindow::Release 	widget/android/nsWindow.cpp:89
5 	libxul.so 	nsCOMPtr_base::~nsCOMPtr_base 	obj-firefox/xpcom/build/nsCOMPtr.cpp:81
6 	libxul.so 	nsTouchEvent::~nsTouchEvent 	nsCOMPtr.h:480
7 	libxul.so 	nsDOMTouchEvent::~nsDOMTouchEvent 	content/events/src/nsDOMTouchEvent.cpp:243
8 	libxul.so 	nsDOMTouchEvent::~nsDOMTouchEvent 	content/events/src/nsDOMTouchEvent.cpp:246
9 	libxul.so 	nsDOMEvent::Release 	content/events/src/nsDOMEvent.cpp:217
10 	libxul.so 	nsDOMUIEvent::Release 	content/events/src/nsDOMUIEvent.cpp:123
11 	libxul.so 	nsDOMTouchEvent::Release 	content/events/src/nsDOMTouchEvent.cpp:270
12 	libxul.so 	XPCJSRuntime::GCCallback 	js/xpconnect/src/XPCJSRuntime.cpp:623
13 	libxul.so 	js::GC 	js/src/jsgc.cpp:3697
14 	libxul.so 	JS_GC 	js/src/jsapi.cpp:2865
15 	libxul.so 	nsXREDirProvider::DoShutdown 	toolkit/xre/nsXREDirProvider.cpp:850
16 	libxul.so 	ScopedXPCOMStartup::~ScopedXPCOMStartup 	toolkit/xre/nsAppRunner.cpp:1125
17 	libxul.so 	XREMain::XRE_main 	toolkit/xre/nsAppRunner.cpp:3879
18 	libxul.so 	XRE_main 	toolkit/xre/nsAppRunner.cpp:3933
...

More reports at:
https://crash-stats.mozilla.com/report/list?signature=nsWindow%3A%3AFindTopLevel
Oh, excellent. sewardj found this using valgrind yesterday too, he said he got the following from valgrind.

==1798== Invalid read of size 4
==1798==    at 0x36618BD6: nsWindow::~nsWindow() (nsWindow.cpp:204)
==1798==    by 0x36618C83: nsWindow::~nsWindow() (nsWindow.cpp:210)
==1798==    by 0x3661CDAF: nsBaseWidget::Release() (nsBaseWidget.cpp:82)
==1798==    by 0x36616F59: nsWindow::Release() (nsWindow.cpp:89)
==1798==    by 0x3673DB4D: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81)
==1798==    by 0x3615196D: nsDOMTouchEvent::~nsDOMTouchEvent() (nsCOMPtr.h:480)
==1798==    by 0x361519FF: nsDOMTouchEvent::~nsDOMTouchEvent() (nsDOMTouchEvent.cpp:246)
==1798==    by 0x36137D5D: nsDOMEvent::Release() (nsDOMEvent.cpp:217)
==1798==    by 0x3613A501: nsDOMUIEvent::Release() (nsDOMUIEvent.cpp:123)
==1798==    by 0x361513E5: nsDOMTouchEvent::Release() (nsDOMTouchEvent.cpp:270)
==1798==    by 0x364547EF: XPCJSRuntime::GCCallback(JSRuntime*, JSGCStatus) (XPCJSRuntime.cpp:623)
==1798==    by 0x36A28BC7: Collect(JSRuntime*, bool, long long, js::JSGCInvocationKind, js::gcreason::Reason) (jsgc.cpp:3697)
==1798==  Address 0x21249a34 is 140 bytes inside a block of size 384 free'd
==1798==    at 0x4804F44: free (vg_replace_malloc.c:427)
==1798==    by 0x2275F699: moz_free (mozalloc.cpp:81)
==1798==    by 0x36618C89: nsWindow::~nsWindow() (mozalloc.h:253)
==1798==    by 0x3661CDAF: nsBaseWidget::Release() (nsBaseWidget.cpp:82)
==1798==    by 0x36616F59: nsWindow::Release() (nsWindow.cpp:89)
==1798==    by 0x3673DB4D: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81)
==1798==    by 0x35F13EFD: nsDeviceContext::~nsDeviceContext() (nsCOMPtr.h:480)
==1798==    by 0x35F58565: nsPresContext::~nsPresContext() (nsDeviceContext.h:60)
==1798==    by 0x35F586B7: nsPresContext::~nsPresContext() (nsPresContext.cpp:335)
==1798==    by 0x35F554B9: nsPresContext::Release() (nsPresContext.cpp:345)
==1798==    by 0x36137E5D: nsDOMEvent::~nsDOMEvent() (nsAutoPtr.h:908)
==1798==    by 0x36140E01: nsDOMPageTransitionEvent::~nsDOMPageTransitionEvent() (nsDOMPageTransitionEvent.h:47)


==1798== Invalid write of size 4
==1798==    at 0x36618BE0: nsWindow::~nsWindow() (nsWindow.cpp:205)
==1798==    by 0x36618C83: nsWindow::~nsWindow() (nsWindow.cpp:210)
==1798==    by 0x3661CDAF: nsBaseWidget::Release() (nsBaseWidget.cpp:82)
==1798==    by 0x36616F59: nsWindow::Release() (nsWindow.cpp:89)
==1798==    by 0x3673DB4D: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81)
==1798==    by 0x3615196D: nsDOMTouchEvent::~nsDOMTouchEvent() (nsCOMPtr.h:480)
==1798==    by 0x361519FF: nsDOMTouchEvent::~nsDOMTouchEvent() (nsDOMTouchEvent.cpp:246)
==1798==    by 0x36137D5D: nsDOMEvent::Release() (nsDOMEvent.cpp:217)
==1798==    by 0x3613A501: nsDOMUIEvent::Release() (nsDOMUIEvent.cpp:123)
==1798==    by 0x361513E5: nsDOMTouchEvent::Release() (nsDOMTouchEvent.cpp:270)
==1798==    by 0x364547EF: XPCJSRuntime::GCCallback(JSRuntime*, JSGCStatus) (XPCJSRuntime.cpp:623)
==1798==    by 0x36A28BC7: Collect(JSRuntime*, bool, long long, js::JSGCInvocationKind, js::gcreason::Reason) (jsgc.cpp:3697)
==1798==  Address 0x21249a34 is 140 bytes inside a block of size 384 free'd
==1798==    at 0x4804F44: free (vg_replace_malloc.c:427)
==1798==    by 0x2275F699: moz_free (mozalloc.cpp:81)
==1798==    by 0x36618C89: nsWindow::~nsWindow() (mozalloc.h:253)
==1798==    by 0x3661CDAF: nsBaseWidget::Release() (nsBaseWidget.cpp:82)
==1798==    by 0x36616F59: nsWindow::Release() (nsWindow.cpp:89)
==1798==    by 0x3673DB4D: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81)
==1798==    by 0x35F13EFD: nsDeviceContext::~nsDeviceContext() (nsCOMPtr.h:480)
==1798==    by 0x35F58565: nsPresContext::~nsPresContext() (nsDeviceContext.h:60)
==1798==    by 0x35F586B7: nsPresContext::~nsPresContext() (nsPresContext.cpp:335)
==1798==    by 0x35F554B9: nsPresContext::Release() (nsPresContext.cpp:345)
==1798==    by 0x36137E5D: nsDOMEvent::~nsDOMEvent() (nsAutoPtr.h:908)
==1798==    by 0x36140E01: nsDOMPageTransitionEvent::~nsDOMPageTransitionEvent() (nsDOMPageTransitionEvent.h:47)

It seems to me that the mParent pointer on nsWindow could become invalid because it's just a nsWindow* and not some fancy refptr. If that becomes invalid then it would cause this behaviour (both the crash and the invalid memory read/write).
Just ran into this on my GN device with an m-c build. I spent some time panning around http://i.imgur.com/61GmV.jpg, and then hit the home button to go back to the homescreen, and then soon (within a second or two) re-entered fennec by hitting the homescreen icon. It crashed within a second or so of coming up. I had gdb attached, so I could verify it was this same crash.
blocking-fennec1.0: --- → ?
Assignee: nobody → bugmail.mozilla
blocking-fennec1.0: ? → +
Unfortunately I'm unable to repro using the steps in comment 2. Seems like it happens randomly.
Looking at the stack dumps this is actually a shutdown crash and therefore probably less important. Still should be fixed though.

To provide a little more context for what I said in comment 1: the actual error pointed to be the valgrind output in comment 1 refers to these lines of code in the destructor:

    if (top->mFocus == this)
        top->mFocus = nsnull;

It looks like "top" has already been freed, so reading/writing top->mFocus is the problem. "top" comes from FindTopLevel(), which as far as I can see, will only return a garbage value if the chain of mParent references has gone bad.

I looked through the nsWindow code to see how mParent might be corrupted, and the only way I can see that might happen is if (1) a window is destructed without having Destroy() called on it first, or (2) a window has a child added to it after Destroy() was called on it but before it is destructed. Need to trace the code some more...
Summary: crash in nsWindow::FindTopLevel → shutdown crash in nsWindow::FindTopLevel
Created attachment 626178 [details] [diff] [review]
(1/4) Fix bug in error-handling code

After staring at this code all day I found one bug, but I don't think it's the cause of this problem. Putting it here anyway, but will request review if/when I do find the actual problem and have a patch for it.
This crash is currently Fennec 15.0a1's #3 topcrash for the past 3 days.
The more I stare at this code the less things make sense. But I found some more bugs!
Comment on attachment 626178 [details] [diff] [review]
(1/4) Fix bug in error-handling code

This is likely not the cause of this crash, but it's a bug nonetheless.
Attachment #626178 - Attachment description: Fix bug in error-handling code → (1/4) Fix bug in error-handling code
Attachment #626178 - Flags: review?(blassey.bugs)
Created attachment 627333 [details] [diff] [review]
(2/4) Add some missing kungFuDeathGrips

Some of the DispatchEvent calls weren't gripping the window hard enough. This may or may not be the cause of this crash.
Attachment #627333 - Flags: review?(blassey.bugs)
Created attachment 627335 [details] [diff] [review]
(3/4) Heap-allocate the nsTouchEvent objects

This one may actually be the cause of the crash, but I'm very unsure about this. What I'm seeing is that the nsTouchEvent object is stack-allocated, and gets passed to DispatchEvent. This eventually ends up creating a nsDOMTouchEvent object with that stack-allocated nsTouchEvent object. When the nsDOMTouchEvent object destructor runs, it calls delete on the stored pointer (see constructor/destructor at http://mxr.mozilla.org/mozilla-central/source/content/events/src/nsDOMTouchEvent.cpp#189). Calling delete on stack-allocated objects can results in all sorts of weird behaviour including this crash. Making these objects heap allocated should in theory fix it.

However the nsTouchEvent pointers get passed around in a lot of places and I don't what the ownership model for these event objects are so this change could end up leaking these objects too. Any feedback here would be helpful.
Attachment #627335 - Flags: review?(blassey.bugs)
Created attachment 627336 [details] [diff] [review]
(4/4) Cleanup

Minor things I noticed after staring at this code for so long
Attachment #627336 - Flags: review?(blassey.bugs)
Comment on attachment 627335 [details] [diff] [review]
(3/4) Heap-allocate the nsTouchEvent objects

wesj pointed out that the nsDOMTouchEvent has an mEventIsInternal flag that guards against this. And yet I'm seeing that delete get hit in gdb. I will need to investigate this further.
Attachment #627335 - Flags: review?(blassey.bugs) → review-
Comment on attachment 626178 [details] [diff] [review]
(1/4) Fix bug in error-handling code

Review of attachment 626178 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/android/nsWindow.cpp
@@ +244,5 @@
>      for (PRUint32 i = 0; i < mChildren.Length(); ++i) {
>          // why do we still have children?
>          ALOG("### Warning: Destroying window %p and reparenting child %p to null!", (void*)this, (void*)mChildren[i]);
>          mChildren[i]->SetParent(nsnull);
> +        i--;    // we just removed a child, so decrement counter

kats, you are decrementing the loop counter before it is incremented, so i is always 0. Could this for loop be replaced with something like while (mChildren.Length() > 0)?
Comment on attachment 626178 [details] [diff] [review]
(1/4) Fix bug in error-handling code

Review of attachment 626178 [details] [diff] [review]:
-----------------------------------------------------------------

::: widget/android/nsWindow.cpp
@@ +244,5 @@
>      for (PRUint32 i = 0; i < mChildren.Length(); ++i) {
>          // why do we still have children?
>          ALOG("### Warning: Destroying window %p and reparenting child %p to null!", (void*)this, (void*)mChildren[i]);
>          mChildren[i]->SetParent(nsnull);
> +        i--;    // we just removed a child, so decrement counter

As Chris said I think his would be better as:
while (mChildren.Length())
   mChildren[0]->SetParent(nsnull);
Attachment #626178 - Flags: review?(blassey.bugs) → review-
Attachment #627333 - Flags: review?(blassey.bugs) → review+
Created attachment 627351 [details] [diff] [review]
(1/4) Fix bug in error-handling code (v2)

Fixed up to use a while loop
Attachment #626178 - Attachment is obsolete: true
Attachment #627351 - Flags: review?(blassey.bugs)
Attachment #627351 - Flags: review?(blassey.bugs) → review+
Attachment #627336 - Flags: review?(blassey.bugs) → review+
Landed the three patches, but this bug should be left open since I'm not sure if those patches will actually fix the crash.

https://hg.mozilla.org/integration/mozilla-inbound/rev/88787676b14a
https://hg.mozilla.org/integration/mozilla-inbound/rev/20a421502a41
https://hg.mozilla.org/integration/mozilla-inbound/rev/a8cc00d4ca48
Whiteboard: [native-crash] → [native-crash][leave open]
Target Milestone: --- → mozilla15
https://hg.mozilla.org/mozilla-central/rev/88787676b14a
https://hg.mozilla.org/mozilla-central/rev/20a421502a41
https://hg.mozilla.org/mozilla-central/rev/a8cc00d4ca48
Flags: in-testsuite-

Comment 18

5 years ago
Looks like there was a couple crashes in the 27th nightly.
Yeah, here's a crash from the 29th: https://crash-stats.mozilla.com/report/index/b718547d-7e69-4970-afb8-78f3a2120529
blocking-fennec1.0: + → .N+
Re-triaging to strip out non-OMG-ON-FIRE release blockers - pushing this to .N+ but we'd still love to look at a safe patch.
7th on aurora https://crash-stats.mozilla.com/topcrasher/byversion/FennecAndroid/15.0a2/7

agree w/ Comment 20.
Keywords: topcrash
Kat's the is the #12 top crasher on trunk and #11 on beta so I'd say it is not fixed.

However, being that it is out of the top 10, I'm renominating it to get it off the .N+ list
blocking-fennec1.0: .N+ → ?
blocking-fennec1.0: ? → .N+
Ok, I finally got STR for this that seem to work reliably.

1. Start fennec
2. Load http://people.mozilla.org/~kgupta/bug/753665.html
3. Pinch zoom in
4. Quit fennec from the menu

The shutdown procedure crashes Fennec; if you have gdb attached you can see it is this crash. What's happening is there are two windows, one a child of the other. The child nsWindow has ::Destroy called on it, which removes it from the parent's child list. Then the parent has ::Destroy called on it, which does the normal cleanup. The parent is then destructed, leaving a dangling mParent reference in the child. Then the child is garbage-collected, and it tries accessing the bad mParent reference.

This STR depends on touch events because the touch events keep a nsCOMPtr reference to the child window, and that's what makes the child window get destructed after the parent window; if the windows are destructed in the reverse order then there is no crash.

The fix is to ensure that in nsWindow::Destroy(), the mParent reference is nulled out. Patch coming up.
Created attachment 634161 [details] [diff] [review]
Actual fix

Call SetParent(nsnull) which more cleanly disconnects a child from it's parent, instead of manually removing the child from the parent's child list. In particular, this ensures that mParent gets set to null for the child.
Attachment #634161 - Flags: review?(blassey.bugs)
Attachment #634161 - Flags: review?(blassey.bugs) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/4180a4a2ebf2
status-firefox14: --- → affected
status-firefox15: --- → affected
status-firefox16: --- → fixed
Whiteboard: [native-crash][leave open] → [native-crash]
Comment on attachment 634161 [details] [diff] [review]
Actual fix

[Approval Request Comment]
Bug caused by (feature/regressing bug #): 
User impact if declined: fennec sometimes crash during shutdown
Testing completed (on m-c, etc.): just landed on inbound, pre-nomming to get it on the radar. we should probably wait to get crash stats data to confirm this fixes the issue before uplifting.
Risk to taking this patch (and alternatives if risky): low risk, mobile only
String or UUID changes made by this patch: none
Attachment #634161 - Flags: approval-mozilla-beta?
Attachment #634161 - Flags: approval-mozilla-aurora?
(Reporter)

Updated

5 years ago
Keywords: reproducible
https://hg.mozilla.org/mozilla-central/rev/4180a4a2ebf2
Status: NEW → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → FIXED
Target Milestone: mozilla15 → mozilla16
Comment on attachment 634161 [details] [diff] [review]
Actual fix

In preparation for possible 14.0.1 inclusion, let's land on Aurora 15.
Attachment #634161 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
https://hg.mozilla.org/releases/mozilla-aurora/rev/a25e23ff8455
status-firefox15: affected → fixed
Attachment #634161 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
https://hg.mozilla.org/releases/mozilla-beta/rev/766e5964701c
status-firefox14: affected → fixed
Unable to get Firefox to crash following the steps in comment 23.

Verified on:
Nightly 16.0a1 2012-07-12/Aurora 15.0a2 2012-07-11/Firefox Mobile 14 beta 12 build 1
HTC Desire
Android 2.2.2
Status: RESOLVED → VERIFIED
status-firefox14: fixed → verified
status-firefox15: fixed → verified
status-firefox16: fixed → verified
You need to log in before you can comment on or make changes to this bug.