Closed Bug 530070 Opened 15 years ago Closed 14 years ago

Firefox crashes in [@ nsWindow::GetParentWindow(int)] called from nsLayoutUtils::TranslateWidgetToView

Categories

(Core :: Widget: Win32, defect)

1.9.2 Branch
x86
Windows XP
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla1.9.3a5
Tracking Status
status1.9.2 --- ?
status1.9.1 --- ?

People

(Reporter: jimm, Assigned: MatsPalmgren_bugz)

References

Details

(Keywords: crash, topcrash, Whiteboard: [sg:critical?] [no steps to reproduce][ccbr][critsmash:patch])

Crash Data

Attachments

(3 files, 2 obsolete files)

Summary: Crash in [@ nsWindow::GetParentWindow] from [@ nsLayoutUtils::TranslateWidgetToView] → Crash in [@ nsWindow::GetParentWindow(int)] from [@ nsLayoutUtils::TranslateWidgetToView]
Severity: normal → critical
Just the stack for reference:

0  	xul.dll  	nsWindow::GetParentWindow  	 widget/src/windows/nsWindow.cpp:1063
1 	xul.dll 	nsWindow::GetParent 	widget/src/windows/nsWindow.cpp:1025
2 	xul.dll 	GetWidgetOffset 	layout/base/nsLayoutUtils.cpp:839
3 	xul.dll 	nsLayoutUtils::TranslateWidgetToView 	layout/base/nsLayoutUtils.cpp:860
4 	xul.dll 	nsLayoutUtils::GetEventCoordinatesRelativeTo 	layout/base/nsLayoutUtils.cpp:680
5 	xul.dll 	PresShell::HandleEvent 	layout/base/nsPresShell.cpp:6127

I think we should close this bug for now until we know if it's not exploidable.
Group: core-security
should we remove the brackets from nsLayoutUtils::TranslateWidgetToView in the summary? If we ever get a crash at that address we'll be fooling crash-stats into linking here.

If the line number in the stack is correct it seems like we got a bogus (already destroyed?) widget out of the HWND. At a brief glance it doesn't look like we're doing anything all that dangerous with it, just trying to read its size and position. We're not writing to it, writing to a location returned by it, or calling any virtual functions.
Whiteboard: [sg:investigate]
I take that back, GetBounds() is virtual.
Whiteboard: [sg:investigate] → [sg:critical?]
(In reply to comment #2)
> should we remove the brackets from nsLayoutUtils::TranslateWidgetToView in the
> summary? If we ever get a crash at that address we'll be fooling crash-stats
> into linking here.

Using the | character to separate the different frames is the way we normally go, right? If not feel free to update the summary again.
Summary: Crash in [@ nsWindow::GetParentWindow(int)] from [@ nsLayoutUtils::TranslateWidgetToView] → Firefox crashes in [@ nsWindow::GetParentWindow(int) | nsLayoutUtils::TranslateWidgetToView]
Bug 506108 comment 7 answers that. Reverting Summary.
Summary: Firefox crashes in [@ nsWindow::GetParentWindow(int) | nsLayoutUtils::TranslateWidgetToView] → Firefox crashes in [@ nsWindow::GetParentWindow(int)] called from nsLayoutUtils::TranslateWidgetToView
Mats, can you look into this? No steps to reproduce, so may not be fixable.
Whiteboard: [sg:critical?] → [sg:critical?] [no steps to reproduce]
Assignee: nobody → jmathies
Assignee: jmathies → matspal
Attached patch wipSplinter Review
I found one problem, probably not the crash cause though.
In OnDestroy we set mOnDestroyCalled to PR_TRUE (which means GetParent()
will return nsnull) before calling  nsBaseWidget::Destroy() which
prevents it from removing us from the parent's child list.
(STR: drag a tab onto the desktop (creating a new window for it) --
each time the child list grows by one)).

The child window in question has mWnd=NULL and mOnDestroyCalled=PR_TRUE
so I don't think it can cause any trouble (other than the temporary
"leak").
The nsWindow parent relationship looks a bit fragile (storing the native
parent HWND on a property on the child HWND), it's a "weak ref" and we
depend on Windows to remove these properties in a timely fashion
(can an event be dispatched to a child window after/during a WM_DESTROY
on a parent?).

Since we can't find the error, I suggest we add protective code to avoid
crashing.  I would re-instate the "unnecessary" SetParent(nsnull) in
OnDestroy as a first step.
http://mxr.mozilla.org/mozilla-central/source/widget/src/windows/nsWindow.cpp#6039
Perhaps also do SetParent(nsnull) on all children in nsBaseWidget::Destroy().

If the crashing continues I suggest we give up the current design and
use a strong ref mParent (as we do on gtk2).
(In reply to comment #8)
> If the crashing continues I suggest we give up the current design and
> use a strong ref mParent (as we do on gtk2).

^^ that would be my preference as well, but I found this wasn't simple to implement. I didn't spend a lot of time on it though. If we have the time now, I'd say taking a shot at remove the prop is the best bet to fixing all these issues.
Another option would be to add the SetParent(nsnull) per above +
NS_ADDREF/RELEASE the parent in SetParent.  That would probably
fix the crash but we get a leak instead?
Attached patch wip10Splinter Review
Here's a mParent solution for trunk... I think I can cook this down to a small
low-risk fix for branches.
Assignee: nobody → matspal
Attached patch Low-risk patch for 1.9.2 (obsolete) — Splinter Review
Keep a strong ref on the parent window, but don't use it for anything.
Attachment #435131 - Flags: review?(jmathies)
+    parent = aParent ? (HWND)aParent->GetNativeData(NS_NATIVE_WINDOW) : NULL;

nit - let's stick with nsnull here.

We're going to land this on m-c first for a bit right?
Attachment #435131 - Flags: review?(jmathies) → review+
Whiteboard: [sg:critical?] [no steps to reproduce] → [sg:critical?] [no steps to reproduce] [needs landing]
(In reply to comment #13)
> +    parent = aParent ? (HWND)aParent->GetNativeData(NS_NATIVE_WINDOW) : NULL;
> 
> nit - let's stick with nsnull here.

I'm sticking with the established convention here - to use NULL
for Windows types such as HWND, HCURSOR, HHOOK, HBRUSH etc
and nsnull for Mozilla types.

> We're going to land this on m-c first for a bit right?

I've landed the 1.9.2 patch on m-c for baking:
http://hg.mozilla.org/mozilla-central/rev/1fb768ca019e

Leaving the bug open to revert that and land the correct trunk patch later.
Whiteboard: [sg:critical?] [no steps to reproduce] [needs landing] → [sg:critical?] [no steps to reproduce]
Backed out changeset 1fb768ca019e due to unexpected leak:
1   nsBaseDragService
521 nsNativeDragTarget
Whiteboard: [sg:critical?] [no steps to reproduce] → [sg:critical?] [no steps to reproduce][ccbr]
Attached patch Low-risk patch for 1.9.2, v2 (obsolete) — Splinter Review
It appears there are circular strong references between
nsWindow/nsNativeDragTarget/nsDragService.
I haven't figured out exactly why it leaked but calling
'mDragService->EndDragSession(PR_FALSE)' when the DnD is canceled
and doing the widget Destroy() call in nsXULWindow::Destroy()
fixes it.  This patch is slightly more risky than the last,
but still low-risk I think.

TryServer unit tests pass without leaking.
I also tested the TryServer builds manually on all 3 platforms,
doing nasty stuff like detaching tabs with a plugin in them and such.
Attachment #435131 - Attachment is obsolete: true
Attachment #439455 - Flags: review?(jmathies)
+  else {
+    // Tell the drag service we're done so it can release objects.
+    mDragService->EndDragSession(PR_FALSE);
+  }

This feels like a bit of a hack. Is there any assurance DragOver will always get called after the drag operation canceled? What about unit tests?
Whiteboard: [sg:critical?] [no steps to reproduce][ccbr] → [sg:critical?] [no steps to reproduce][ccbr][critsmash:patch]
> Leaving the bug open to revert that and land the correct trunk patch later.

Please mark this bug as fixed when you reland the patch on trunk.  Leaving bugs open for followups makes it hard to track what security holes affect a given branch, and makes it hard to track regressions.  The new bug can be security-sensitive but [sg:nse].
Whiteboard: [sg:critical?] [no steps to reproduce][ccbr][critsmash:patch] → [sg:critical?] [no steps to reproduce][ccbr]
Whiteboard: [sg:critical?] [no steps to reproduce][ccbr] → [sg:critical?] [no steps to reproduce][ccbr][critsmash:patch]
Whiteboard: [sg:critical?] [no steps to reproduce][ccbr][critsmash:patch] → [sg:critical?] [no steps to reproduce][ccbr][critsmash:patch][needs mats to address review comment]
(In reply to comment #17)
> +  else {
> +    // Tell the drag service we're done so it can release objects.
> +    mDragService->EndDragSession(PR_FALSE);
> +  }
> 
> This feels like a bit of a hack. Is there any assurance DragOver will always
> get called after the drag operation canceled? What about unit tests?

Mats, is it ok if I take this bug over? Id like to dig into this a bit and understand why this is needed.
> Is there any assurance DragOver will always
> get called after the drag operation canceled?

Not that I can see, no.
So, here's a version that doesn't depend on reaching DragOver.
I'm replacing the mDragCancelled flag with a DragCancel() call
that does the cleanup directly.
Attachment #439455 - Attachment is obsolete: true
Attachment #446003 - Flags: review?(jmathies)
Attachment #439455 - Flags: review?(jmathies)
Comment on attachment 446003 [details] [diff] [review]
Low-risk patch for 1.9.2, v3

thanks for the update!
Attachment #446003 - Flags: review?(jmathies) → review+
What has to be done for trunk here? The mixture of trunk and branch comments is a bit confusing.
I landed the 1.9.2 patch on trunk for baking, but had to back out due
to unexpected Tinderbox orange on Linux.  The previously intermittent orange
bug 527099 and bug 551538 (Mooth), and bug 542580 (Mo5) became perma-orange.
I've debugged it I understand why it occurs with the patch -- adding the
nsWindow::Destroy call in nsXULWindow::Destroy makes SetPrompt(PR_FALSE),
which comes later, fail which in turn messes up window focus/activation.
I'm going to file a separate bug on this, hoping it'll fix those
"unable to restore focus" errors that is causing intermittent orange
on Linux.  I don't know yet if that fix will be safe for branch though.

I've pushed the patch without the explicit Destroy to Try to see if it
still leaks (we've changed mDragCancelled since then which might have helped).  I'll report back with the results...
I mean SetModal(PR_FALSE) of course, not SetPrompt.
BTW, that call is in nsXULWindow::ShowModal() after exiting the modal loop.
Yes, it still leaks nsNativeDragTarget without the explicit 
mWindow->Destroy().  :-(
http://tinderbox.mozilla.org/showlog.cgi?log=MozillaTry/1274669543.1274670507.10879.gz

Jim, feel free to figure out if there's a better way to plug that leak.
I'm going to see if I can fix the focus problem on Linux (bug 567704)
in a safe way so we can keep the Destroy() call.
There's a fix in bug 567704 for the focus problem, it passed TryServer
unit tests together with this patch.
Depends on: 567704
Whiteboard: [sg:critical?] [no steps to reproduce][ccbr][critsmash:patch][needs mats to address review comment] → [sg:critical?] [no steps to reproduce][ccbr][critsmash:patch][needs fix in bug 567704 first]
Is 1.9.1 affected by the same issue?
status1.9.1: --- → ?
All branches as far as I can tell from crash-stats.
Whiteboard: [sg:critical?] [no steps to reproduce][ccbr][critsmash:patch][needs fix in bug 567704 first] → [sg:critical?] [no steps to reproduce][ccbr][critsmash:patch]
Landed the lower-risk fix on trunk, will ask for branch approvals
(together with bug 567704) after baking it for a while.
http://hg.mozilla.org/mozilla-central/rev/9a3a1fa8233a

I filed bug 569583 for the additional (riskier) trunk changes (in wip10).
Status: NEW → RESOLVED
Closed: 14 years ago
status1.9.2: --- → ?
Resolution: --- → FIXED
Target Milestone: --- → mozilla1.9.3a5
Looks like this might have caused blocking regression bug 585287.
Blocks: 585287
Depends on: 585961
Blocks: 573408
No longer blocks: 573408
Depends on: 573408
Crash Signature: [@ nsWindow::GetParentWindow(int)]
Group: core-security → core-security-release
Group: core-security-release
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: