Closed Bug 479749 Opened 11 years ago Closed 11 years ago

Crash with MozMill test when entering and leaving Private Browsing mode immediately [@ libobjc.A.dylib@0x15688][@ FrameIsInActiveWindow][@ nsNativeThemeCocoa::GetScrollbarDrawInfo]

Categories

(Core :: Widget: Cocoa, defect, P2, critical)

x86
macOS
defect

Tracking

()

VERIFIED FIXED
mozilla1.9.2a1

People

(Reporter: whimboo, Assigned: smichaud)

References

Details

(Keywords: crash, testcase, verified1.9.1, Whiteboard: [sg:critical?][mozmill])

Crash Data

Attachments

(5 files, 2 obsolete files)

Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2a1pre) Gecko/20090222 Minefield/3.2a1pre

With my upcoming MozMill test for entering/leaving the Private Browsing mode (see bug 479314) I'm able to crash the browser reproducible. Following three lines seem to be the cause:

controller.click(new elementslib.Elem(controller.menus['tools-menu'].privateBrowsingItem));  
controller.sleep(1000);
controller.click(new elementslib.Elem(controller.menus['tools-menu'].privateBrowsingItem));

As you can see I enter the PB mode, wait 1000ms, and leave the PB mode again.

Crash report: bp-0206bd86-6c9d-4091-9e6c-d97872090223

Here the first 10 frames:

0  	libobjc.A.dylib  	libobjc.A.dylib@0x15688  	
1 	XUL 	nsNativeThemeCocoa::GetScrollbarDrawInfo 	widget/src/cocoa/nsNativeThemeCocoa.mm:1212
2 	XUL 	nsNativeThemeCocoa::GetMinimumWidgetSize 	widget/src/cocoa/nsNativeThemeCocoa.mm:2182
3 	XUL 	nsIFrame::AddCSSMinSize 	layout/xul/base/src/nsBox.cpp:735
4 	XUL 	nsBoxFrame::GetMinSize 	layout/xul/base/src/nsBoxFrame.cpp:873
5 	XUL 	nsBoxFrame::GetPrefSize 	layout/xul/base/src/nsBoxFrame.cpp:824
6 	XUL 	nsSprocketLayout::GetPrefSize 	layout/xul/base/src/nsSprocketLayout.cpp:1337
7 	XUL 	nsBoxFrame::GetPrefSize 	layout/xul/base/src/nsBoxFrame.cpp:818
8 	XUL 	nsSliderFrame::GetPrefSize 	layout/xul/base/src/nsSliderFrame.cpp:1107
9 	XUL 	nsSprocketLayout::GetPrefSize 	layout/xul/base/src/nsSprocketLayout.cpp:1337
10 	XUL 	nsBoxFrame::GetPrefSize 	layout/xul/base/src/nsBoxFrame.cpp:818
Henrik, does it happen only with the private browsing menu item, or is it reproducible with all other items?
Also, FWIW you can switch to using the shortcut keys for now if you want to proceed with the Mozmill test until this gets fixed.
Attached file stack with gdb
This could be security related. Continuing twice after the assertion results in a bad access:

(gdb) c
Continuing.
###!!! ASSERTION: XPConnect is being called on a scope without a 'Components' property!: 'Error', file /data/mozilla/build/mozilla-central/mozilla/js/src/xpconnect/src/xpcwrappednativescope.cpp, line 764

Program received signal SIGTRAP, Trace/breakpoint trap.
0x947b5e42 in __kill ()
(gdb) c
Continuing.
Reading symbols for shared libraries . done
++WEBSHELL 0x1e63b240 == 15
++DOMWINDOW == 32 (0x1e63bbf0) [serial = 36] [outer = 0x0]
++DOMWINDOW == 33 (0x1e63f300) [serial = 37] [outer = 0x1e63bbc0]
WARNING: Unable to test style tree integrity -- no content node: file /data/mozilla/build/mozilla-central/mozilla/layout/base/nsCSSFrameConstructor.cpp, line 9043
WARNING: GetGeckoKeyCodeFromChar has failed.: file /data/mozilla/build/mozilla-central/mozilla/widget/src/cocoa/nsChildView.mm, line 4350
WARNING: Unable to test style tree integrity -- no content node: file /data/mozilla/build/mozilla-central/mozilla/layout/base/nsCSSFrameConstructor.cpp, line 9043

Program received signal EXC_BAD_ACCESS, Could not access memory.
Reason: KERN_PROTECTION_FAILURE at address: 0x00000020
0x94586688 in objc_msgSend ()
Group: core-security
I'll try to get a minimized testcase for tomorrow. Right now I haven't it tested with other menu items but I think it's PB mode related.
I guess the crash in bp-0206bd86-6c9d-4091-9e6c-d97872090223 is
actually taking place in FrameIsInActiveWindow(), and is caused by
calling objc_msgSend on a deleted 'win' object.

But we'd be a lot better off if we had a reproducible testcase.
Whiteboard: [sg:investigate]
Blake: Is the assertion in comment 4 dangerous?
Attached file mozmill test
With this test I'm able to crash Firefox each time we leave the PB mode. The following steps you have to do:

1. Create a fresh profile
2. Install MozMill 1.0.2 (https://addons.mozilla.org/de/firefox/addon/9018)
3. Open MozMill IDE and load test
4. Run test

If Firefox doesn't crash please close all windows and restart Firefox. Do the steps 3 and 4 again. No idea but probably it happens because of the "Know your rights" notification bar.
Keywords: testcase
Here's a gdb stack trace that shows my guess from comment #6 is
correct.

I can easily reproduce Henrik's STR from comment #8, though it always
takes me two tries.  (I	do the STR once (on a "new" build) and get no
crash.  Then I close all windows, restart Firefox, and go through the
STR again (with the same build) -- this time I always crash.)

I quickly found a simple fix.  But I became dissatisfied with it and
have been trying to dig deeper.  Now I think I have a better fix, and
I'll post a patch tomorrow.

As you'll see, this is primarily a Cocoa widgets bug.  But it seems to
be triggered by the contents of a window being destroyed in an unusual
order -- the crashes happen when an nsCocoaWindow object is destroyed
before the nsChildView objects it contains.
Whiteboard: [sg:investigate] → [sg:critical?]
Attached patch Partial fix (obsolete) — Splinter Review
Here's my patch -- the one that fixes the crash Henrik reported and
for which I posted a gdb stack trace in comment #9.

A tryserver build will follow in a while.

But we're not out of the woods yet.  Now that the first crash is
cleared up, I'm getting a second, unrelated crash (still using
Henrik's STR from comment #8).  I'll post a gdb trace of the second
crash in my next comment.

Sounds like MozMill will keep us busy for quite a while :-)
Here's a gdb stack trace (with debug symbols) of this bug's second
crash.  As you can see, it's also from accessing a deleted object (in
this case an NSView object -- the mWidget object associated with an
nsThebesDeviceContext object).

Turns out this crash is a little harder to reproduce than the first
one (though only a little):  Before you use MozMill to run the
test_Crash.js test, you need to load some bug page at
bugzilla.mozilla.org -- this bug's page works just fine.
OK, this patch seems to do the trick.  I've tested it with a window
containing several very busy pages loaded into different tabs, and I
didn't see any problems.

I haven't (yet) tested it in a debug build.  And I suspect it won't
fix the assertion mentioned in comment #4 ... but that's probably not
a Cocoa widgets problem.

(I also suspect that other widgets implementations may have trouble
with this bug's STR.)

A tryserver build will follow in a while (probably quite a long
while).

Henrik, please test this patch as hard as you can, and let us know if
there's anything I missed.
Attachment #363901 - Attachment is obsolete: true
Attachment #363923 - Flags: review?(joshmoz)
(In reply to comment #10)
> Sounds like MozMill will keep us busy for quite a while :-)

That's great. I'll provide more of those tests. Believe me.

(In reply to comment #12)
> Henrik, please test this patch as hard as you can, and let us know if
> there's anything I missed.

Will do and even check it on Windows and Linux. I should better file new bugs if it can be seen on other platforms too, right?
Assignee: joshmoz → smichaud
Status: NEW → ASSIGNED
Flags: blocking1.9.1?
Comment on attachment 363923 [details] [diff] [review]
Fix for Cocoa-widgets-specific problems

Comment on my own patch:

This:

  mContext->Init(nsnull);

should really be this:

  if (mContext)
    mContext->Init(nsnull);

I'll fix it in what I land (if I haven't done so in an earlier
revision).
Side note (about nsCocoaWindow/nsChildView parents and children):

I saw in my tests for this bug that (as best I can tell) an
nsCocoaWindow object can never have nsChildView children
(nsCocoaWindow::GetFirstChild() and nsCocoaWindow::GetLastChild() will
never return an nsChildView object).  And an nsChildView object can
never have an nsCocoaWindow parent (nsChildView::GetParent() will
never return an nsCocoaWindow object).

If true, this is puzzling.  And I don't think it's always been true.

Josh and/or Roc, do you have any idea what this is all about?

The following code exists at the start of the nsCocoaWindow
destructor:

  // notify the children that we're gone
  for (nsIWidget* kid = mFirstChild; kid; kid = kid->GetNextSibling()) {
    nsCocoaWindow* childWindow = static_cast<nsCocoaWindow*>(kid);
    childWindow->mParent = nsnull;
  }

Notice that it assumes any child of an nsCocoaWindow object will also
be an nsCocoaWindow object.  When I saw this (on my way through
tracking down the first crash), I figured this code must be wrong, and
that I could fix the crash by adding code that made 'kid' an
nsChildView object if its nsWindowType was 'eWindowType_child'.

But it didn't fix the crash, and in fact was never exercised.

Then I noticed that, in nsNativeThemeCocoa.mm's
NativeWindowForFrame(), topLevelWidget is always an nsChildView
object.  Which means that it isn't really a "top level widget", as far
as I can tell.
(In reply to comment #13)

> I should better file new bugs if it can be seen on other platforms
> too, right?

Yes, I think that would be best.
I've now done some tests with debug builds -- an ordinary one (without
my patch), and a debug build containing my patch.

I don't see the assertion from comment #4 in either of them.  So I
think it may be an unrelated problem.
(In reply to comment #7)
> Blake: Is the assertion in comment 4 dangerous?

No. It happens when you use an XPConnect property or object from a window that has been closed. We won't crash, although most properties will be missing, so things might not work as expected.
This tryserver build works fine. Running the crash test for several times doesn't crash the browser anymore. Builds without this patch on Windows and Linux aren't affected. So it qualifies for an OS X only bug. Thanks Steven!
Summary: Crash with MozMill test when entering and leaving Private Browsing mode immediately [@ libobjc.A.dylib@0x15688][@ nsNativeThemeCocoa::GetScrollbarDrawInfo] → Crash with MozMill test when entering and leaving Private Browsing mode immediately [@ libobjc.A.dylib@0x15688][@ FrameIsInActiveWindow][@ nsNativeThemeCocoa::GetScrollbarDrawInfo]
Whiteboard: [sg:critical?] → [sg:critical?][has patch][needs review josh]
Steven, your proposed patch looks alright if we do need to protect against such a destruction order, but I'm concerned that this might just be covering up another bug. It seems likely that destroying a widget before its children is the real bug and if widget implementations shouldn't have to protect against such a destruction order I'd rather not. It adds complexity and overhead and will cover up destruction order bugs in the future.

roc - can you weigh in on this?
Josh and Roc, please also look at my question in comment #15.
I would also have expected the children of nsCocoaWindow to be nsChildViews. Maybe no children are actually hooked up? This was added by Josh in bug 362952.

I'd like to know what situations cause the parent widget to be destroyed before its children, but it looks like we've been doing it for a long time --- isn't that what bug 362952 was about? --- so I'm inclined to take Steven's patch as-is.

The long-term solution is of course to get rid of child widgets so these issues will not arise.
Flags: blocking1.9.1? → blocking1.9.1+
Priority: -- → P2
Comment on attachment 363923 [details] [diff] [review]
Fix for Cocoa-widgets-specific problems

+  mContext->Init(nsnull);

We should probably do that in ::Destroy for the base widget since it is the base widget that contains mContext. In any case, the location of this null init strikes me as out-of-place, the product of a deeper design flaw. Either the base widget should be doing a better job cleaning up after itself or we need to do better null checking for the widget's validity in other places.

In "ensureWindowData" you're making sure mWindow is valid, then after a couple calls that wouldn't change validity there is another mWindow null check. You can remove the latter pre-existing null check.
Attachment #363923 - Flags: review?(joshmoz)
>+  mContext->Init(nsnull);
>
> We should probably do that in ::Destroy for the base widget since it
> is the base widget that contains mContext.

Agreed.  My new patch does this.

> In any case, the location of this null init strikes me as
> out-of-place, the product of a deeper design flaw. Either the base
> widget should be doing a better job cleaning up after itself or we
> need to do better null checking for the widget's validity in other
> places.

I'm not sure I understand.  nsIWidget::Destroy() is exactly the place
for this kind of thing -- making sure resources we've passed to
external modules (in this case the (weak) pointer to ourselves used to
initialize an nsIDeviceContext) are "invalidated" when we're
destroyed.

> In "ensureWindowData" you're making sure mWindow is valid, then
> after a couple calls that wouldn't change validity there is another
> mWindow null check. You can remove the latter pre-existing null
> check.

Fixed.

Side note:  When I moved mContext->Init(nsnull) to
nsBaseWidget::Destroy(), at first it seemed not to work (it no longer
fixed the second crash).  Then I discovered that
nsChildView::Destroy() calls nsBaseWidget::OnDestroy() before
nsBaseWidget::Destroy() (and nsBaseWidget::OnDestroy nulls out
mContext).

This is wrong, and isn't how any of the other widgets implementations
work.  All of them (except Cocoa widgets) use Destroy() to tear down
the widget, and then use OnDestroy() to signal that the job is done
(so that the base widget can release its own resources).  In fact the
OS2 nsWindow destructor has a nice little comment explaining this:  "A
call of Destroy() destroys the PM window.  This triggers an
OnDestroy(), which frees resources."

I've changed nsChildView::Destroy() and nsCocoaWindow::Destroy()
accordingly.

A tryserver build will follow in a few hours.
Attachment #363923 - Attachment is obsolete: true
Attachment #366029 - Flags: review?(joshmoz)
"I'm not sure I understand.  nsIWidget::Destroy() is exactly the place for this kind of thing"

I was only objecting to the actual class it was in (nsChildView) - nsBaseWidget::Destroy might very well be fine.
Attachment #366029 - Flags: review?(joshmoz) → review+
Whiteboard: [sg:critical?][has patch][needs review josh] → [sg:critical?][has patch]
Attachment #366029 - Flags: superreview?(roc)
Attachment #366029 - Flags: superreview?(roc) → superreview+
Landed on trunk:
http://hg.mozilla.org/mozilla-central/rev/253c445cc416
Status: ASSIGNED → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
Whiteboard: [sg:critical?][has patch] → [sg:critical?]
Verified fixed with Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.2a1pre) Gecko/20090310 Minefield/3.2a1pre. Thanks Steven!

How long should it bake on trunk before we wanna get this on 1.9.1?
Status: RESOLVED → VERIFIED
Target Milestone: --- → mozilla1.9.2a1
> How long should it bake on trunk before we wanna get this on 1.9.1?

A week?

> Thanks Steven!

You're most welcome.
Whiteboard: [sg:critical?] → [sg:critical?][needs baking on trunk]
Comment on attachment 366029 [details] [diff] [review]
Fix rev1 (follow Josh's suggestions)

This has now been baking on the trunk for a week, with no reported
problems.
Attachment #366029 - Flags: approval1.9.1?
(Following up comment #31)

This patch is basically trivial -- the fix for the second crash
(comment #11) really is trivial, and the fix for the first crash
(comment #9) is little more than the extension of a previous fix (that
for bug 362952).  So it's low risk.

MozMill tests cannot (yet) be automated.
The last part of this patch:

+  // Clear the device context's mWidget field -- otherwise it may get accessed
+  // after it's been deleted.  See bug 479749.
+  if (mContext)
+    mContext->Init(nsnull);

caused bug 482928, which causes popups to appear on the wrong monitor after a tab is closed. In this case, the device context being cleared is still being used, and popups no longer have a widget from which to calculate the screen from.
Attachment #366029 - Flags: approval1.9.1?
(In reply to comment #33)

So I suppose that part of the patch should be backed out.

This will partly reopen this bug -- the "second crash" from comment
#11 will start happening again.  I don't know how often this crash
happens (apparently it's no longer possible to search in the top ten
crash stack entries at crash-stats.mozilla.com).  But I suspect it's a
less serious problem the one this part of my patch has triggered.
At least the new problem is more visible.

I don't have any way to test with multiple monitors, so I'm not the
best person to try to come up with an alternative.
(Following up comment #34)

But it's odd that the code which decides which monitor a popup should
appear on relies on a widget that's already been Destroy()ed.

So I think that part of my patch just uncovered one or more other
bugs, and didn't itself cause this problem.
My understanding is that device contexts are shared between widgets. At least the view code suggests that widgets are initialized with contexts that come from the view manager.
And so it's odd that a device context that refers to a Destory()ed widget is still around.
The "second crash" happens because the device context's mWidget
field contains an NSView object that's been deleted.
> an NSView object

A ChildView object (which is a subclass of NSView).
So what actions are necessary now? Backing-out a part of the patch or coming with a follow-up?
Blocks: 479314
Before anything else, we need to resolve the ambiguities WRT bug
482928.

Does bug 482928 happen on both the trunk and the 1.9.1 branch?

Henrik, are you once again able to reproduce bug 482928 anywhere (on
the trunk and/or the 1.9.1 branch)?

I can't help with this, since I can't test with multiple monitors.
Bug 482928 only happens on trunk and yes, I'm able to reproduce it constantly. There are already a lot of bug reports filed in the last days. But this should be handled in bug 482928. I just wondered about your statement if this should be backed-out. I'll watch bug 482928...
Henrik, what you're saying is still partly ambiguous:

Are you currently able to reproduce bug 482928 on the 1.9.1 branch?

That's directly relevant to this bug -- since if bug 482928 happens on
the 1.9.1 branch, bug 482928 wasn't triggered by my patch for this
bug.

As for partially backing out my patch for this bug (if it *did*
trigger bug 482928):

I made this suggestion (in comment #34) before I'd fully thought
through the problem.  Shortly afterwards (in comment #35) I realized
that my patch, though it may have triggered bug 482928, almost
certainly didn't *cause* it -- instead, it seems to have uncovered one
or more other bugs that are the true cause of bug 482828.

As I said above, I don't have multiple monitors to test with.  And I'm
currently taking time off from Mozilla Corp to get the Java Embedding
Plugin working on SnowLeopard.

But in a couple of weeks I should have more time to work on this bug
(and bug 482928).  This may be needed ... if nobody else is
able/willing to try to track down the true cause(s) of bug 482928.
Steven, as I have said in comment 43, it only happens on trunk. So no, I'm not able to reproduce it on 1.9.1. Take your time...
> Steven, as I have said in comment 43, it only happens on trunk. So
> no, I'm not able to reproduce it on 1.9.1.

Sorry.  What you said in comment #42 *wasn't* ambiguous.  I somehow
missed the "only" :-(

> Take your time...

I will :-)
Let's get this landed and watch out for bug 482928 there as soon as it does.
Landed on the 1.9.1 branch:
http://hg.mozilla.org/releases/mozilla-1.9.1/rev/41b607a57022

There's now a working patch for bug 482928.  With luck it'll get landed on the 1.9.1 branch before the beta4 code freeze.
Keywords: fixed1.9.1
Whiteboard: [sg:critical?][needs baking on trunk] → [sg:critical?]
Verified fixed with Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9.1b4pre) Gecko/20090410 Shiretoko/3.5b4pre ID:20090410035055. Thanks Steven!
Whiteboard: [sg:critical?] → [sg:critical?][mozmill]
Crash Signature: [@ libobjc.A.dylib@0x15688] [@ FrameIsInActiveWindow] [@ nsNativeThemeCocoa::GetScrollbarDrawInfo]
Group: core-security
Crash Signature: [@ libobjc.A.dylib@0x15688] [@ FrameIsInActiveWindow] [@ nsNativeThemeCocoa::GetScrollbarDrawInfo] → [@ libobjc.A.dylib@0x15688] [@ FrameIsInActiveWindow] [@ nsNativeThemeCocoa::GetScrollbarDrawInfo]
You need to log in before you can comment on or make changes to this bug.