Closed Bug 422827 Opened 14 years ago Closed 14 years ago

crash on quit [@ libobjc.A.dylib@0x146e8][@ nsMenuX::RemoveAll()]

Categories

(Core :: Widget: Cocoa, defect, P2)

All
macOS
defect

Tracking

()

VERIFIED FIXED
mozilla1.9

People

(Reporter: samuel.sidler+old, Assigned: smichaud)

References

()

Details

(Keywords: crash, topcrash)

Crash Data

Attachments

(4 files, 3 obsolete files)

Our #22 topcrash (#6 on Mac) in beta 4 happens at libobjc.A.dylib@0x146e8, but more specifically, has nsMenuX::RemoveAll() fairly regularly in stacks. There might be other crashes at libobjc.A.dylib@0x146e8 intermixed with this, but comments indicate a lot of users are crashing on quit.

Sample stack from bp-018175d9-f072-11dc-b290-001a4bd43e5c:

Frame  	Signature  	Source
0 	libobjc.A.dylib@0x146e8 	
1 	AppKit@0x3574d 	
2 	AppKit@0x353d0 	
3 	AppKit@0x3640b 	
4 	nsMenuX::RemoveAll() 	mozilla/widget/src/cocoa/nsMenuX.mm:414
5 	nsMenuX::~nsMenuX() 	mozilla/widget/src/cocoa/nsMenuX.mm:111
6 	nsMenuX::Release() 	mozilla/widget/src/cocoa/nsMenuX.mm:87
7 	ReleaseObjects(void*, void*) 	nsCOMArray.cpp:151
8 	nsVoidArray::EnumerateForwards(int (*)(void*, void*), void*) 	nsVoidArray.cpp:678
9 	nsCOMArray_base::Clear() 	nsCOMArray.cpp:158
10 	nsMenuBarX::~nsMenuBarX() 	mozilla/widget/src/cocoa/nsMenuBarX.mm:217
11 	nsMenuBarX::Release() 	mozilla/widget/src/cocoa/nsMenuBarX.mm:67
12 	nsCOMPtr_base::~nsCOMPtr_base() 	nsCOMPtr.cpp:81
13 	nsCocoaWindow::~nsCocoaWindow() 	mozilla/widget/src/cocoa/nsCocoaWindow.mm:542
14 	nsBaseWidget::Release() 	mozilla/widget/src/xpwidgets/nsBaseWidget.cpp:69
15 	nsCOMPtr_base::assign_with_AddRef(nsISupports*) 	nsCOMPtr.cpp:531
16 	nsXULWindow::Destroy() 	mozilla/xpfe/appshell/src/nsXULWindow.cpp:713
17 	nsWebShellWindow::Destroy() 	mozilla/xpfe/appshell/src/nsWebShellWindow.cpp:840
18 	nsAppShellService::DestroyHiddenWindow() 	mozilla/xpfe/appshell/src/nsAppShellService.cpp:209
19 	nsAppStartup::DestroyHiddenWindow() 	mozilla/toolkit/components/startup/src/nsAppStartup.cpp:160
20 	ScopedXPCOMStartup::~ScopedXPCOMStartup() 	mozilla/toolkit/xre/nsAppRunner.cpp:898
21 	XRE_main 	mozilla/toolkit/xre/nsAppRunner.cpp:3190
22 	main 	mozilla/browser/app/nsBrowserApp.cpp:158
23 	start 	crt.c:272
24 	start 	
25 	@0x1 	


Sample comments with the same stack:

wasn't doing much. just quited, then used spaces to switch screens. also shut down Neo Office

Was just quitting Firefox

Program did not crash. I quit out of it.

printed a PDF (with no physical printer defined) ... then quit out of Firefox

gmail, using firebug, tried to close the browser and it crashed. firebug and gmail have issues, but both are so useful that i can't live without either of them

crashed on quit with 5 tabs opened

Closed Firefox. Two tabs were open.



My guess is that this is a Mac-only crash on quit.
Flags: blocking1.9?
This stack is a bit screwy -- nsMenuX::RemoveAll() just returns
NS_ERROR_NOT_IMPLEMENTED.

But (allowing for some corruption) I'd guess that mMenusArray has
gotten out of date, and contains some items that have already been
destroyed.
Here's a stack trace, made in gdb, using a build with debug symbols.

I'm going to try to find an STR.
STR:

1) Open Minefield.

2) Choose File : Print and then PDF : Save as PDF.

3) Choose Save

4) Choose Quit from the Minefield menu or do command-q.

5) Crash
This STR only "works" on OS X 10.5.2 (not on 10.4.11).
> printed a PDF (with no physical printer defined) ... then quit out of Firefox

Bingo!
Flags: blocking1.9? → blocking1.9+
Priority: -- → P2
This is an old bug, and has a complex regression history.

My STR from comment #3 "works" in the same way (with minor variations)
back to the 2006-11-30-04-trunk Minefield nightly.

Between the 2006-11-22-04-trunk Minefield nightly and the
2006-11-29-04-trunk nightly (inclusive), you get a crash under
DeviceContextImpl::Release() immediately upon printing.

The last nightly where printing works entirely normally (on OS X
10.5.2) is the 2006-11-21-07-trunk nightly.

So what probably started the trouble was the patch for bug 323934
("turn on cocoa-cairo by default"), possibly assisted by the patch for
"fixing cocoa-cairo plugins" (bug number is given as 35732, which must
be wrong).

And what probably triggered/caused the current problem is a change
(among other things) to nsMenuX.mm that has nothing specific to do
with printing -- the patch for bug 351230.
Here's another gdb stack trace, showing that this kind of crash can be
triggered by an Objective-C exception.

My hunch is that the specific exception will vary.
Assignee: joshmoz → smichaud
I've got a revision of my STR from comment #3.  I've found that it
only works if you've already printed to PDF in the same location, and
choose Replace to replace the previous file.

1) Open Minefield.

2) Choose File : Print and then PDF : Save as PDF.

3) Choose Save, then choose to replace the already existing file with
   the same name.

4) Choose Quit from the Minefield menu or do command-q.

5) Crash
Here's a fix for this bug (bug 422827) and bug 423669.

Or more accurately, it's a workaround for an obscure Apple bug that's
specific to Leopard (OS X 10.5.X).  (For more details see the comments
in the patch.)

This patch is definitely hackish (because it digs into undocumented
stuff).  But there's no other way to get this particular problem fixed
quickly.  And the patch itself is low-risk:  It's conceptually very
simple, and will fail gracefully if Apple changes the "signatures" of
the undocumented methods that it "hooks" (if that happens the patch
will simply have no effect).

I hope that Apple will eventually fix this problem.  And when they do
this patch can be backed out.  But (as best I can tell) this bug
doesn't effect any Apple products, so they don't have much incentive
to be quick in their work.

Here's a tryserver build made with this patch:

https://build.mozilla.org/tryserver-builds/2008-03-20_13:11-smichaud@pobox.com-bugzilla422827/smichaud@pobox.com-bugzilla422827-firefox-try-mac.dmg
Attachment #310834 - Flags: review?(joshmoz)
Comment on attachment 310834 [details] [diff] [review]
Fix (workaround for an Apple bug)

Oops, I made a dumb mistake in this patch.  I'll post a new one (and a
new tryserver build) shortly.
Attachment #310834 - Flags: review?(joshmoz)
Attached patch Fix rev1 (fix bad leak) (obsolete) — Splinter Review
This new patch fixes a bad leak.

Here's a tryserver build made with it:

https://build.mozilla.org/tryserver-builds/2008-03-20_15:06-smichaud@pobox.com-bugzilla422827-rev1/smichaud@pobox.com-bugzilla422827-rev1-firefox-try-mac.dmg
Attachment #310834 - Attachment is obsolete: true
Attachment #310864 - Flags: review?(joshmoz)
Duplicate of this bug: 423669
(In reply to comment #11)
> Created an attachment (id=310864) [details]
> Fix rev1 (fix bad leak)
> 
> This new patch fixes a bad leak.
> 
> Here's a tryserver build made with it:

Note: this tryserver build Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9b5pre) Gecko/2008032015 Minefield/3.0b5pre has fixed my crash with a similar/same stack in bug 423669
Josh, I think we should consider bumping this bug to P1 and including
my patch in beta5.  This crash _really_ is a topcrasher, and I'm
pretty sure there are lots of other ways to trigger it than the ones
we've already found.

Look at individual crash reports at the following two URLs:

http://crash-stats.mozilla.com/report/list?range_unit=weeks&branch=1.9&range_value=2&signature=objc_msgSend
http://crash-stats.mozilla.com/report/list?range_unit=weeks&branch=1.9&range_value=2&signature=libobjc.A.dylib%400x146e8
OK.  I'll bump this to a P1, and we need to get josh to review this ASAP.  I'll make sure he's aware of the issue (i.e., I'm dialing his cell right now).
Priority: P2 → P1
Target Milestone: --- → mozilla1.9beta5
OK.  Josh will look at this tonight and get the review done as soon as possible.  Note here to beltzner and josh, it's entirely possible this isn't a beta blocker, so feel free to bump this down if you disagree.  I just want to get this on the radar of the beta if it is indeed a #6 top crasher with lots of possible entry points.  
And... adding beltzner to cc.  

Beltzner, take a look at Comment #16.
> #6 top crasher

Actually, if you count both URLs from comment #14 together (as you
probably should), and if you discount the
nsObjCExceptionLogAbort(NSException*) topcrasher (which can have many
causes and may be tailing off), this is now the #1 Mac topcrasher.
Priority: P1 → P2
Target Milestone: mozilla1.9beta5 → ---
Oops, didn't mean to take this off the beta blocker list.
Priority: P2 → P1
Target Milestone: --- → mozilla1.9beta5
I started reviewing this, it is going to take me more time than I have tonight. I don't think this needs to be a beta blocker, moved to P2.
Priority: P1 → P2
Target Milestone: mozilla1.9beta5 → mozilla1.9
Even though this is (probably) the #1 Mac topcrasher, my patch should
bake on the trunk for a few days before it gets into a "release".  If
we don't have those few days before beta5 (including at least one
beta5 "RC"), I'd agree with Josh that this should be postponed until
the first RC.

(For all its apparent complexity, this is conceptually a very simple
and well-isolated fix.  It should be fine to land it in an RC.)
Attached patch Fix rev2 (fix potential problem) (obsolete) — Splinter Review
The call to super in nsMenuX_nsMenu_removeItem should come first.  Otherwise there's a (small) possibility that aItem will have been deleted by the time the call is made.

I'll post another tryserver build shortly.
Attachment #310864 - Attachment is obsolete: true
Attachment #310990 - Flags: review?(joshmoz)
Attachment #310864 - Flags: review?(joshmoz)
Severity: critical → major
Crashes always get "critical" severity.
Severity: major → critical
In an attempt to get a better idea of whether or not this is really
our #1 Mac topcrasher, I searched for all 1.9-branch and platform-OS-X
crashes with exactly "nsMenuX::RemoveAll()" in the top 10 stack
frames, over the last month.

I got a total of just under 4200 ... which tends to confirm what I said.

http://crash-stats.mozilla.com/?do_query=1&branch=1.9&platform=mac&query_search=stack&query_type=exact&query=nsMenuX%3A%3ARemoveAll%28%29&date=&range_value=1&range_unit=months
Oops -- for comparison with the topcrasher stats at http://crash-stats.mozilla.com/topcrasher/bybranch/1.9 I really should have chosen "2 weeks" instead of "1 month".

The total for 2 weeks is about 2650 -- impressive but still probably not the #1 topcrasher.

http://crash-stats.mozilla.com/?do_query=1&branch=1.9&platform=mac&query_search=stack&query_type=exact&query=nsMenuX%3A%3ARemoveAll%28%29&date=&range_value=2&range_unit=weeks
+// Class for numerical-index keys in gShadowKeyEquivDB (each is the unsigned
+// int equivalent of an NSMenuItem's address).  Using this kind of key is more

Needs a comment about what to do for 64-bit, or better yet, a 64-bit-safe solution.

+    gShadowKeyEquivDB = [[NSMutableDictionary alloc] init];
+  if (!gShadowKeyEquivDB)

Generally we assume those allocations succeed.

+  if ((mTables = [[NSMutableIndexSet alloc] init]) == nil)

Same here. Also I don't really like comparisons against assignments, no reason not to just make it more clear with one extra line of code. Irrelevant here since that test can just go away.

Can we move most of the additions to nsMenuX.mm to the bottom of the file?

We need to file a bug with Apple about this, please paste the text of your filing here if you can.
I tried to make this completely 64-bit safe ... but wasn't able to.
There's an NSNumber method that I'd have needed to use that's only
available when building on OS X 10.5.X (or presumably with a 10.5
SDK).

But I _did_ put in defintions for NSInteger and NSUInteger, which I
"borrowed" the QTKit framework's QTKitDefines.h (the only place
they're defined on 10.4).  I've assumed that we'll only ever be
building 32-bit binaries on 10.4, which I think is both reasonable and
true.  Please let me know, Josh, if you think otherwise.

I chose not wrap these definitions in the ifdef you suggested, Josh
(#if MAC_OSX_VERSION_MAX_ALLOWED <= MAC_OSX_VERSION_10_4) because I'm
not entirely sure how that's supposed to work.  (On the face of it, it
seems to be intended for code that's not supposed to run on 10.5.)
Attachment #310990 - Attachment is obsolete: true
Attachment #311648 - Flags: review?(joshmoz)
Attachment #310990 - Flags: review?(joshmoz)
Another way to trigger this problem (in current nightly builds):

1) Run Minefield and close all windows.

2) Do Command-o or File : Open File -- a native app-modal file picker
   will appear.

3) Do Command-h or Minefield : Hide Minefield.

4) Click on the Minefield doc icon -- Minefield will be unhidden, a
   new browser window will appear under the file picker (still
   displayed), and the file picker will lose focus to the browser
   window.

5) Click on the file picker's Cancel button.

6) Quit (Command-q or Minefield : Quit).

7) Crash.
(Following up comment #29)

Forgot to mention that my patch also fixes this.
Using Steven's tryserver build in Comment 31 and following the STR in Comment 29, I do not crash. Build ID: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9b5pre) Gecko/2008032504 Minefield/3.0b5pre.
Here's the Apple bug report that I'm about to post:

For some time OS X has had an undocumented NSKeyEquivalents protocol
of the NSMenu interface.  There appear to problems with two of these
methods on OS X 10.5.X (or with the code that calls them):

+ (void)_addItem:(NSMenuItem *)aItem toTable:(NSMapTable *)aTable
+ (void)_removeItem:(NSMenuItem *)aItem fromTable:(NSMapTable *)aTable

There have been many cases (thousands) of crashes in Firefox 3
pre-release builds in the second of those two methods
(_removeItem:fromTable:), when code called from _removeItem:fromTable:
tries to access an NSMenuItem object that has already been deleted
(sometimes (always?) when it tries to send a _setChangedFlags: message
to that object).

These crashes happen only on OS X 10.5.X (never on 10.4.X).  And they
stop happening when I "hook" these methods (via method swizzling) and
add code to "retain" aItem on calls to _addItem:toTable:, and to
"release" it on calls to _removeItem:fromTable:.

That this simple change clears up the problem tells me that there are
bugs in Apple's code -- either in these two methods, or in the code
that calls these methods.

Apple's implementations of _addItem:toTable: and
_removeItem:fromTable: appear to assume that a call to the first (on a
given NSMenuItem) will always be matched by a call to the second (on
the same NSMenuItem before it gets deleted).  So these methods don't
retain and release the NSMenuItem.

But this assumption isn't always true.  The crashes in
_removeItem:fromTable: in Firefox 3 pre-release builds (when it's
called on a different NSMenuItem object) show that a NSMenuItem object
can be "added", and then deleted, without ever having been "removed".

This problem is being tracked in Mozilla.org's bug database at the
following URL:

https://bugzilla.mozilla.org/show_bug.cgi?id=422827
My report's been given radar #5820190.
Attachment #311648 - Flags: review?(joshmoz) → review+
Attachment #311648 - Flags: superreview?(vladimir)
Attachment #311648 - Flags: superreview?(vladimir) → superreview+
Landed on trunk:

Checking in widget/src/cocoa/nsCocoaUtils.h;
/cvsroot/mozilla/widget/src/cocoa/nsCocoaUtils.h,v  <--  nsCocoaUtils.h
new revision: 1.6; previous revision: 1.5
done
Checking in widget/src/cocoa/nsMenuX.mm;
/cvsroot/mozilla/widget/src/cocoa/nsMenuX.mm,v  <--  nsMenuX.mm
new revision: 1.65; previous revision: 1.64
done
Checking in widget/src/cocoa/nsToolkit.h;
/cvsroot/mozilla/widget/src/cocoa/nsToolkit.h,v  <--  nsToolkit.h
new revision: 1.18; previous revision: 1.17
done
Checking in widget/src/cocoa/nsToolkit.mm;
/cvsroot/mozilla/widget/src/cocoa/nsToolkit.mm,v  <--  nsToolkit.mm
new revision: 1.32; previous revision: 1.31
done
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Backed out because of a leak test failure ... which now appears to have been spurious.

I'll wait for a while and try again.
Landed again on trunk.  This time no leak test failure.

Checking in widget/src/cocoa/nsCocoaUtils.h;
/cvsroot/mozilla/widget/src/cocoa/nsCocoaUtils.h,v  <--  nsCocoaUtils.h
new revision: 1.8; previous revision: 1.7
done
Checking in widget/src/cocoa/nsMenuX.mm;
/cvsroot/mozilla/widget/src/cocoa/nsMenuX.mm,v  <--  nsMenuX.mm
new revision: 1.67; previous revision: 1.66
done
Checking in widget/src/cocoa/nsToolkit.h;
/cvsroot/mozilla/widget/src/cocoa/nsToolkit.h,v  <--  nsToolkit.h
new revision: 1.20; previous revision: 1.19
done
Checking in widget/src/cocoa/nsToolkit.mm;
/cvsroot/mozilla/widget/src/cocoa/nsToolkit.mm,v  <--  nsToolkit.mm
new revision: 1.34; previous revision: 1.33
done
Duplicate of this bug: 425585
Verified fixed using "my testcase" from Bug 425585 and Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.5; en-US; rv:1.9pre) Gecko/2008032804 Minefield/3.0pre

Crashed with Beta 5 RC2 - does not crash with the Nightly --> Verified fixed
Status: RESOLVED → VERIFIED
Crash Signature: [@ libobjc.A.dylib@0x146e8] [@ nsMenuX::RemoveAll()]
You need to log in before you can comment on or make changes to this bug.