Closed Bug 753665 Opened 13 years ago Closed 13 years ago

shutdown crash in nsWindow::FindTopLevel

Categories

(Core Graveyard :: Widget: Android, defect)

ARM
Android
defect
Not set
critical

Tracking

(firefox14 verified, firefox15 verified, firefox16 verified, blocking-fennec1.0 .N+)

VERIFIED FIXED
mozilla16
Tracking Status
firefox14 --- verified
firefox15 --- verified
firefox16 --- verified
blocking-fennec1.0 --- .N+

People

(Reporter: scoobidiver, Assigned: kats)

Details

(Keywords: crash, reproducible, topcrash, Whiteboard: [native-crash])

Crash Data

Attachments

(5 files, 1 obsolete file)

It happens from time to time. Signature nsWindow::FindTopLevel More Reports Search UUID 8d979a38-e2ff-4325-bd07-23bbf2120510 Date Processed 2012-05-10 05:50:49 Uptime 143 Install Age 3.0 minutes since version was first installed. Install Time 2012-05-10 05:47:36 Product FennecAndroid Version 15.0a1 Build ID 20120509030514 Release Channel nightly OS Linux OS Version 0.0.0 Linux 2.6.35.10-g3f43272 #1 PREEMPT Tue Jun 14 21:11:23 CST 2011 armv7l Build Architecture arm Build Architecture Info Crash Reason SIGSEGV Crash Address 0x544c App Notes AdapterVendorID: bravo, AdapterDeviceID: HTC Desire. AdapterDescription: 'Model: 'HTC Desire', Product: 'htc_bravo', Manufacturer: 'HTC', Hardware: 'bravo''. HTC HTC Desire htc_wwe/htc_bravo/bravo:2.3.3/GRI40/96875.1:user/release-keys EMCheckCompatibility True Frame Module Signature Source 0 libxul.so nsWindow::FindTopLevel widget/android/nsWindow.cpp:217 1 libxul.so nsWindow::~nsWindow widget/android/nsWindow.cpp:203 2 libxul.so nsWindow::~nsWindow widget/android/nsWindow.cpp:210 3 libxul.so nsBaseWidget::Release widget/xpwidgets/nsBaseWidget.cpp:82 4 libxul.so nsWindow::Release widget/android/nsWindow.cpp:89 5 libxul.so nsCOMPtr_base::~nsCOMPtr_base obj-firefox/xpcom/build/nsCOMPtr.cpp:81 6 libxul.so nsTouchEvent::~nsTouchEvent nsCOMPtr.h:480 7 libxul.so nsDOMTouchEvent::~nsDOMTouchEvent content/events/src/nsDOMTouchEvent.cpp:243 8 libxul.so nsDOMTouchEvent::~nsDOMTouchEvent content/events/src/nsDOMTouchEvent.cpp:246 9 libxul.so nsDOMEvent::Release content/events/src/nsDOMEvent.cpp:217 10 libxul.so nsDOMUIEvent::Release content/events/src/nsDOMUIEvent.cpp:123 11 libxul.so nsDOMTouchEvent::Release content/events/src/nsDOMTouchEvent.cpp:270 12 libxul.so XPCJSRuntime::GCCallback js/xpconnect/src/XPCJSRuntime.cpp:623 13 libxul.so js::GC js/src/jsgc.cpp:3697 14 libxul.so JS_GC js/src/jsapi.cpp:2865 15 libxul.so nsXREDirProvider::DoShutdown toolkit/xre/nsXREDirProvider.cpp:850 16 libxul.so ScopedXPCOMStartup::~ScopedXPCOMStartup toolkit/xre/nsAppRunner.cpp:1125 17 libxul.so XREMain::XRE_main toolkit/xre/nsAppRunner.cpp:3879 18 libxul.so XRE_main toolkit/xre/nsAppRunner.cpp:3933 ... More reports at: https://crash-stats.mozilla.com/report/list?signature=nsWindow%3A%3AFindTopLevel
Oh, excellent. sewardj found this using valgrind yesterday too, he said he got the following from valgrind. ==1798== Invalid read of size 4 ==1798== at 0x36618BD6: nsWindow::~nsWindow() (nsWindow.cpp:204) ==1798== by 0x36618C83: nsWindow::~nsWindow() (nsWindow.cpp:210) ==1798== by 0x3661CDAF: nsBaseWidget::Release() (nsBaseWidget.cpp:82) ==1798== by 0x36616F59: nsWindow::Release() (nsWindow.cpp:89) ==1798== by 0x3673DB4D: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81) ==1798== by 0x3615196D: nsDOMTouchEvent::~nsDOMTouchEvent() (nsCOMPtr.h:480) ==1798== by 0x361519FF: nsDOMTouchEvent::~nsDOMTouchEvent() (nsDOMTouchEvent.cpp:246) ==1798== by 0x36137D5D: nsDOMEvent::Release() (nsDOMEvent.cpp:217) ==1798== by 0x3613A501: nsDOMUIEvent::Release() (nsDOMUIEvent.cpp:123) ==1798== by 0x361513E5: nsDOMTouchEvent::Release() (nsDOMTouchEvent.cpp:270) ==1798== by 0x364547EF: XPCJSRuntime::GCCallback(JSRuntime*, JSGCStatus) (XPCJSRuntime.cpp:623) ==1798== by 0x36A28BC7: Collect(JSRuntime*, bool, long long, js::JSGCInvocationKind, js::gcreason::Reason) (jsgc.cpp:3697) ==1798== Address 0x21249a34 is 140 bytes inside a block of size 384 free'd ==1798== at 0x4804F44: free (vg_replace_malloc.c:427) ==1798== by 0x2275F699: moz_free (mozalloc.cpp:81) ==1798== by 0x36618C89: nsWindow::~nsWindow() (mozalloc.h:253) ==1798== by 0x3661CDAF: nsBaseWidget::Release() (nsBaseWidget.cpp:82) ==1798== by 0x36616F59: nsWindow::Release() (nsWindow.cpp:89) ==1798== by 0x3673DB4D: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81) ==1798== by 0x35F13EFD: nsDeviceContext::~nsDeviceContext() (nsCOMPtr.h:480) ==1798== by 0x35F58565: nsPresContext::~nsPresContext() (nsDeviceContext.h:60) ==1798== by 0x35F586B7: nsPresContext::~nsPresContext() (nsPresContext.cpp:335) ==1798== by 0x35F554B9: nsPresContext::Release() (nsPresContext.cpp:345) ==1798== by 0x36137E5D: nsDOMEvent::~nsDOMEvent() (nsAutoPtr.h:908) ==1798== by 0x36140E01: nsDOMPageTransitionEvent::~nsDOMPageTransitionEvent() (nsDOMPageTransitionEvent.h:47) ==1798== Invalid write of size 4 ==1798== at 0x36618BE0: nsWindow::~nsWindow() (nsWindow.cpp:205) ==1798== by 0x36618C83: nsWindow::~nsWindow() (nsWindow.cpp:210) ==1798== by 0x3661CDAF: nsBaseWidget::Release() (nsBaseWidget.cpp:82) ==1798== by 0x36616F59: nsWindow::Release() (nsWindow.cpp:89) ==1798== by 0x3673DB4D: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81) ==1798== by 0x3615196D: nsDOMTouchEvent::~nsDOMTouchEvent() (nsCOMPtr.h:480) ==1798== by 0x361519FF: nsDOMTouchEvent::~nsDOMTouchEvent() (nsDOMTouchEvent.cpp:246) ==1798== by 0x36137D5D: nsDOMEvent::Release() (nsDOMEvent.cpp:217) ==1798== by 0x3613A501: nsDOMUIEvent::Release() (nsDOMUIEvent.cpp:123) ==1798== by 0x361513E5: nsDOMTouchEvent::Release() (nsDOMTouchEvent.cpp:270) ==1798== by 0x364547EF: XPCJSRuntime::GCCallback(JSRuntime*, JSGCStatus) (XPCJSRuntime.cpp:623) ==1798== by 0x36A28BC7: Collect(JSRuntime*, bool, long long, js::JSGCInvocationKind, js::gcreason::Reason) (jsgc.cpp:3697) ==1798== Address 0x21249a34 is 140 bytes inside a block of size 384 free'd ==1798== at 0x4804F44: free (vg_replace_malloc.c:427) ==1798== by 0x2275F699: moz_free (mozalloc.cpp:81) ==1798== by 0x36618C89: nsWindow::~nsWindow() (mozalloc.h:253) ==1798== by 0x3661CDAF: nsBaseWidget::Release() (nsBaseWidget.cpp:82) ==1798== by 0x36616F59: nsWindow::Release() (nsWindow.cpp:89) ==1798== by 0x3673DB4D: nsCOMPtr_base::~nsCOMPtr_base() (nsCOMPtr.cpp:81) ==1798== by 0x35F13EFD: nsDeviceContext::~nsDeviceContext() (nsCOMPtr.h:480) ==1798== by 0x35F58565: nsPresContext::~nsPresContext() (nsDeviceContext.h:60) ==1798== by 0x35F586B7: nsPresContext::~nsPresContext() (nsPresContext.cpp:335) ==1798== by 0x35F554B9: nsPresContext::Release() (nsPresContext.cpp:345) ==1798== by 0x36137E5D: nsDOMEvent::~nsDOMEvent() (nsAutoPtr.h:908) ==1798== by 0x36140E01: nsDOMPageTransitionEvent::~nsDOMPageTransitionEvent() (nsDOMPageTransitionEvent.h:47) It seems to me that the mParent pointer on nsWindow could become invalid because it's just a nsWindow* and not some fancy refptr. If that becomes invalid then it would cause this behaviour (both the crash and the invalid memory read/write).
Just ran into this on my GN device with an m-c build. I spent some time panning around http://i.imgur.com/61GmV.jpg, and then hit the home button to go back to the homescreen, and then soon (within a second or two) re-entered fennec by hitting the homescreen icon. It crashed within a second or so of coming up. I had gdb attached, so I could verify it was this same crash.
blocking-fennec1.0: --- → ?
Assignee: nobody → bugmail.mozilla
blocking-fennec1.0: ? → +
Unfortunately I'm unable to repro using the steps in comment 2. Seems like it happens randomly.
Looking at the stack dumps this is actually a shutdown crash and therefore probably less important. Still should be fixed though. To provide a little more context for what I said in comment 1: the actual error pointed to be the valgrind output in comment 1 refers to these lines of code in the destructor: if (top->mFocus == this) top->mFocus = nsnull; It looks like "top" has already been freed, so reading/writing top->mFocus is the problem. "top" comes from FindTopLevel(), which as far as I can see, will only return a garbage value if the chain of mParent references has gone bad. I looked through the nsWindow code to see how mParent might be corrupted, and the only way I can see that might happen is if (1) a window is destructed without having Destroy() called on it first, or (2) a window has a child added to it after Destroy() was called on it but before it is destructed. Need to trace the code some more...
Summary: crash in nsWindow::FindTopLevel → shutdown crash in nsWindow::FindTopLevel
After staring at this code all day I found one bug, but I don't think it's the cause of this problem. Putting it here anyway, but will request review if/when I do find the actual problem and have a patch for it.
This crash is currently Fennec 15.0a1's #3 topcrash for the past 3 days.
The more I stare at this code the less things make sense. But I found some more bugs!
Comment on attachment 626178 [details] [diff] [review] (1/4) Fix bug in error-handling code This is likely not the cause of this crash, but it's a bug nonetheless.
Attachment #626178 - Attachment description: Fix bug in error-handling code → (1/4) Fix bug in error-handling code
Attachment #626178 - Flags: review?(blassey.bugs)
Some of the DispatchEvent calls weren't gripping the window hard enough. This may or may not be the cause of this crash.
Attachment #627333 - Flags: review?(blassey.bugs)
This one may actually be the cause of the crash, but I'm very unsure about this. What I'm seeing is that the nsTouchEvent object is stack-allocated, and gets passed to DispatchEvent. This eventually ends up creating a nsDOMTouchEvent object with that stack-allocated nsTouchEvent object. When the nsDOMTouchEvent object destructor runs, it calls delete on the stored pointer (see constructor/destructor at http://mxr.mozilla.org/mozilla-central/source/content/events/src/nsDOMTouchEvent.cpp#189). Calling delete on stack-allocated objects can results in all sorts of weird behaviour including this crash. Making these objects heap allocated should in theory fix it. However the nsTouchEvent pointers get passed around in a lot of places and I don't what the ownership model for these event objects are so this change could end up leaking these objects too. Any feedback here would be helpful.
Attachment #627335 - Flags: review?(blassey.bugs)
Attached patch (4/4) CleanupSplinter Review
Minor things I noticed after staring at this code for so long
Attachment #627336 - Flags: review?(blassey.bugs)
Comment on attachment 627335 [details] [diff] [review] (3/4) Heap-allocate the nsTouchEvent objects wesj pointed out that the nsDOMTouchEvent has an mEventIsInternal flag that guards against this. And yet I'm seeing that delete get hit in gdb. I will need to investigate this further.
Attachment #627335 - Flags: review?(blassey.bugs) → review-
Comment on attachment 626178 [details] [diff] [review] (1/4) Fix bug in error-handling code Review of attachment 626178 [details] [diff] [review]: ----------------------------------------------------------------- ::: widget/android/nsWindow.cpp @@ +244,5 @@ > for (PRUint32 i = 0; i < mChildren.Length(); ++i) { > // why do we still have children? > ALOG("### Warning: Destroying window %p and reparenting child %p to null!", (void*)this, (void*)mChildren[i]); > mChildren[i]->SetParent(nsnull); > + i--; // we just removed a child, so decrement counter kats, you are decrementing the loop counter before it is incremented, so i is always 0. Could this for loop be replaced with something like while (mChildren.Length() > 0)?
Comment on attachment 626178 [details] [diff] [review] (1/4) Fix bug in error-handling code Review of attachment 626178 [details] [diff] [review]: ----------------------------------------------------------------- ::: widget/android/nsWindow.cpp @@ +244,5 @@ > for (PRUint32 i = 0; i < mChildren.Length(); ++i) { > // why do we still have children? > ALOG("### Warning: Destroying window %p and reparenting child %p to null!", (void*)this, (void*)mChildren[i]); > mChildren[i]->SetParent(nsnull); > + i--; // we just removed a child, so decrement counter As Chris said I think his would be better as: while (mChildren.Length()) mChildren[0]->SetParent(nsnull);
Attachment #626178 - Flags: review?(blassey.bugs) → review-
Attachment #627333 - Flags: review?(blassey.bugs) → review+
Fixed up to use a while loop
Attachment #626178 - Attachment is obsolete: true
Attachment #627351 - Flags: review?(blassey.bugs)
Attachment #627351 - Flags: review?(blassey.bugs) → review+
Attachment #627336 - Flags: review?(blassey.bugs) → review+
Whiteboard: [native-crash] → [native-crash][leave open]
Target Milestone: --- → mozilla15
Looks like there was a couple crashes in the 27th nightly.
blocking-fennec1.0: + → .N+
Re-triaging to strip out non-OMG-ON-FIRE release blockers - pushing this to .N+ but we'd still love to look at a safe patch.
Kat's the is the #12 top crasher on trunk and #11 on beta so I'd say it is not fixed. However, being that it is out of the top 10, I'm renominating it to get it off the .N+ list
blocking-fennec1.0: .N+ → ?
blocking-fennec1.0: ? → .N+
Ok, I finally got STR for this that seem to work reliably. 1. Start fennec 2. Load http://people.mozilla.org/~kgupta/bug/753665.html 3. Pinch zoom in 4. Quit fennec from the menu The shutdown procedure crashes Fennec; if you have gdb attached you can see it is this crash. What's happening is there are two windows, one a child of the other. The child nsWindow has ::Destroy called on it, which removes it from the parent's child list. Then the parent has ::Destroy called on it, which does the normal cleanup. The parent is then destructed, leaving a dangling mParent reference in the child. Then the child is garbage-collected, and it tries accessing the bad mParent reference. This STR depends on touch events because the touch events keep a nsCOMPtr reference to the child window, and that's what makes the child window get destructed after the parent window; if the windows are destructed in the reverse order then there is no crash. The fix is to ensure that in nsWindow::Destroy(), the mParent reference is nulled out. Patch coming up.
Attached patch Actual fixSplinter Review
Call SetParent(nsnull) which more cleanly disconnects a child from it's parent, instead of manually removing the child from the parent's child list. In particular, this ensures that mParent gets set to null for the child.
Attachment #634161 - Flags: review?(blassey.bugs)
Attachment #634161 - Flags: review?(blassey.bugs) → review+
Comment on attachment 634161 [details] [diff] [review] Actual fix [Approval Request Comment] Bug caused by (feature/regressing bug #): User impact if declined: fennec sometimes crash during shutdown Testing completed (on m-c, etc.): just landed on inbound, pre-nomming to get it on the radar. we should probably wait to get crash stats data to confirm this fixes the issue before uplifting. Risk to taking this patch (and alternatives if risky): low risk, mobile only String or UUID changes made by this patch: none
Attachment #634161 - Flags: approval-mozilla-beta?
Attachment #634161 - Flags: approval-mozilla-aurora?
Keywords: reproducible
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
Target Milestone: mozilla15 → mozilla16
Comment on attachment 634161 [details] [diff] [review] Actual fix In preparation for possible 14.0.1 inclusion, let's land on Aurora 15.
Attachment #634161 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
Attachment #634161 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Unable to get Firefox to crash following the steps in comment 23. Verified on: Nightly 16.0a1 2012-07-12/Aurora 15.0a2 2012-07-11/Firefox Mobile 14 beta 12 build 1 HTC Desire Android 2.2.2
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: