Closed Bug 582466 Opened 14 years ago Closed 14 years ago

Crash in [@ pthread_mutex_lock] called by mozilla::TimeStamp::Now on shutdown on OS X

Categories

(Core :: Widget: Cocoa, defect)

x86
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla2.0b7
Tracking Status
blocking2.0 --- final+

People

(Reporter: jrmuizel, Assigned: smichaud)

References

Details

(Keywords: crash, regression, topcrash)

Crash Data

Attachments

(1 file)

This is fallout from bug 558306.

Looks like the timer thread is still running after TimeStamp::Shutdown has been called.  So our lock is null, and we crash.  But Shutdown is a static object's destructor; I'm a little surprised we call it before joining the TimerThread...  Chris, any idea what's up with that?
Blocks: 558306
blocking2.0: --- → ?
Component: General → XPCOM
QA Contact: general → xpcom
Severity: normal → critical
Keywords: crash, regression
Summary: Crash in pthread_mutex_lock called by mozilla::TimeStamp::Now on shutdown → Crash in [@ pthread_mutex_lock] called by mozilla::TimeStamp::Now on shutdown
I don't see any code calling TimerThread::Shutdown(), where the join is supposed to happen.
Scratch that, I found its calling context in gdb.  The nsThread::Shutdown code looks OK to me (assuming PR_JoinThread() is working correctly).  I wonder if something is exit()ing the main thread, or perhaps raising an exception?  That might cause static dtors to be run without joining the timer thread.  I'm not sure how implicit joins are ordered wrt other at-exit() code on Mac.

Jeff, it would help a lot if you could check whether we get to TimerThread::Shutdown().
cjones: if you load crash stats you can click to see other threads,

Thread 0
Frame 	Module 	Signature [Expand] 	Source
0 	XUL 	__tcf_2 	js/src/xpconnect/wrappers/FilteringWrapper.cpp:149
1 	libSystem.B.dylib 	__cxa_finalize 	
2 	libSystem.B.dylib 	exit 	
3 	AppKit 	-[NSApplication terminate:] 	
4 	AppKit 	-[NSApplication _terminateSendShould:] 	
5 	CoreFoundation 	__invoking___ 	
6 	CoreFoundation 	-[NSInvocation invoke] 	
7 	Foundation 	__NSFireTimer 	
8 	CoreFoundation 	CFRunLoopRunSpecific 	
9 	CoreFoundation 	CFRunLoopRunInMode 	
10 	Foundation 	-[NSRunLoop runMode:beforeDate:] 	
11 	XUL 	nsAppShell::ProcessNextNativeEvent 	widget/src/cocoa/nsAppShell.mm:665
12 	XUL 	nsBaseAppShell::OnProcessNextEvent 	widget/src/xpwidgets/nsBaseAppShell.cpp:156
13 	XUL 	nsAppShell::OnProcessNextEvent 	widget/src/cocoa/nsAppShell.mm:834
14 	XUL 	nsThread::ProcessNextEvent 	xpcom/threads/nsThread.cpp:517
15 	XUL 	NS_ProcessPendingEvents_P 	nsThreadUtils.cpp:200
16 	XUL 	nsBaseAppShell::NativeEventCallback 	widget/src/xpwidgets/nsBaseAppShell.cpp:126
17 	XUL 	nsAppShell::ProcessGeckoEvents 	widget/src/cocoa/nsAppShell.mm:394
18 	CoreFoundation 	CFRunLoopRunSpecific 	
19 	CoreFoundation 	CFRunLoopRunInMode 	
20 	HIToolbox 	RunCurrentEventLoopInMode 	
21 	HIToolbox 	ReceiveNextEventCommon 	
22 	HIToolbox 	BlockUntilNextEventMatchingListInMode 	
23 	AppKit 	_DPSNextEvent 	
24 	AppKit 	-[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] 	
25 	AppKit 	-[NSApplication run] 	
26 	XUL 	nsAppShell::Run 	widget/src/cocoa/nsAppShell.mm:747
27 	XUL 	nsAppStartup::Run 	toolkit/components/startup/src/nsAppStartup.cpp:191
28 	XUL 	XRE_main 	toolkit/xre/nsAppRunner.cpp:3625
29 	firefox-bin 	main 	browser/app/nsBrowserApp.cpp:158
30 	firefox-bin 	firefox-bin@0xbf5 	
31 		@0x1 

indeed we were being force quit.

jeff: please provide steps to reproduce, namely how precisely do you "quit"? the use of the mouse, keyboard, menus, context menus, dock are quite significant.
Doy, I knew that.  This doesn't look good; sez Apple

-----
terminate:

Terminates the receiver.

- (void)terminate:(id)sender
[snip]
Discussion

This method is typically invoked when the user chooses Quit or Exit from the application’s menu. 

When invoked, this method performs several steps to process the termination request. First, it asks the application’s document controller (if one exists) to save any unsaved changes in its documents. During this process, the document controller can cancel termination in response to input from the user. If the document controller does not cancel the operation, this method then calls the delegate’s applicationShouldTerminate: method. If applicationShouldTerminate: returns NSTerminateCancel, the termination process is aborted and control is handed back to the main event loop. If the method returns NSTerminateLater, the application runs its run loop in the NSModalPanelRunLoopMode mode until the replyToApplicationShouldTerminate: method is called with the value YES or NO. If the applicationShouldTerminate: method returns NSTerminateNow, this method posts a NSApplicationWillTerminateNotification notification to the default notification center.

Do not bother to put final cleanup code in your application’s main() function—it will never be executed. If cleanup is necessary, perform that cleanup in the delegate’s applicationWillTerminate: method.
-----

Maybe we should try to play games with NSTerminateCancel followed by our own shutdown code?  Note that any "correct" fix will regress Tshutdown.

The other option is just to go to back to leaking the timer lock.  We were apparently doing this before Vlad changed the timer shutdown code to run from a static dtor.
fwiw, we have a dozen other bugs talking about this problem in case you want to read them, they're a trail of tears.

what it basically amounts to is:
* even though we knew we could not use static xpcom things, we really can't use static xpcom things or really static any-things
I haven't been able to reproduce this after I needed to
-[NSApplication terminate:] should only get called if you quit from the Dock menu.

Is that what you were doing, Jeff?
Also, could you have been using non-standard settings?  Particularly timer-related settings?
> Is that what you were doing, Jeff?

My guess is the answer is no.

Others have been seeing stacks with the following crucial part in
them (including with Safari):

3    AppKit             -[NSApplication terminate:]
4    AppKit             -[NSApplication _terminateSendShould:]
5    CoreFoundation     __invoking___
6    CoreFoundation     -[NSInvocation invoke]
7    Foundation         __NSFireTimer
8    CoreFoundation     __CFRunLoopRun
9    CoreFoundation     CFRunLoopRunSpecific
10   CoreFoundation     CFRunLoopRunInMode

http://stackoverflow.com/questions/1094108/help-with-a-mac-crash-log-what-causes-nsapplication-terminatesendshould
http://forums.macosxhints.com/archive/index.php/t-52129.html

My hunch is this is an Apple bug -- a very old one.
Needless to say, there's probably no way we can work around this until/unless we can reproduce it.
One more thing, Jeff.

If you ever see this again (even just once), look in the Console for
any interesting error messages (particularly any Objective-C
exceptions).
(Following up comment #10)

Forget comment #10.  The stack chunk I quoted appears even during
"normal" uses of '-[NSApplication terminate:]'.  I found this by
breaking (in gdb) on '-[NSApplication terminate:]', then quitting from
the Dock menu.

One should test before one comments :-(

I'd still like to know, Jeff, if the crashes you saw happened when you
quit from the Dock menu.
Keywords: topcrash
Hi, hope this helps (if not please delete). I had this crash yesterday. http://crash-stats.mozilla.com/report/index/bp-3247721c-1c2d-4f08-bdd9-42f0f2100819 I shut down my MBP, when a message appeared saying Shutdown had been stopped by application Crash Reporter. Had 2 apptabs Gmail and a forum, normal tab google reader.

About:support info @ http://pastebin.com/yUcQ5Fu2  Please email me if need more info.
(In reply to comment #14)

The information you provide could be helpful.  Thanks!

By the way, I forgot to mention above that shutting down your computer
(or logging out from it) while Firefox is running is another way
(besides quitting from the Dock menu) to trigger a call to
-[NSApplication terminate:].
I've just found a close to 100% reliable way to reproduce this bug in
a recent Minefield nightly (on OS X 10.5.8).  You also crash in FF
3.6.8, but Breakpad usually doesn't come up.

1) Start Minefield.

2) Choose Bookmarks : Bookmarks Toolbar : Latest Headlines.

3) Scroll all the way to the bottom of the Latest Headlines menu and
   choose "Open All in Tabs".  (Doing this currently opens 77 tabs.)

4) Choose Quit from the Dock menu.

   After 5-10 seconds you'll crash and Breakpad will come up.

Here are a couple of my crash reports:

bp-8a815fad-85bb-4f02-8c77-3d7462100819
bp-0ed55c01-a762-4670-b980-697902100819
Assignee: nobody → smichaud
Status: NEW → ASSIGNED
Summary: Crash in [@ pthread_mutex_lock] called by mozilla::TimeStamp::Now on shutdown → Crash in [@ pthread_mutex_lock] called by mozilla::TimeStamp::Now on shutdown on OS X
Apparently NSApplication:terminate calls exit(). Does it shut down XPCOM first? Assuming that XPCOM was shut down correctly, that thread should not be running. If XPCOM was not shut down, that's the obvious problem here, and we should make sure it *is* shut down.

Why can't we just unwind the stack normally, instead of calling exit() from within a deeply nested event handler?!
Assignee: smichaud → nobody
Component: XPCOM → Widget: Cocoa
QA Contact: xpcom → cocoa
Assignee: nobody → smichaud
The general problem (the difficulty of making [NSApplication
terminate:] work together with XPCOM shutdown) is very old, and has
come up several times before -- notably in bug 384284.

The difficulty is that there's currently no solution to the general
problem.  So whenever this has come up before, we've been forced to
hack around it ... or (if the bug wasn't bad enough or didn't happen
too often) just to throw our hands up.

The first thing that needs to be understood is that we don't call
[NSApplication terminate:] -- the OS does (when someone quits from the
Dock menu or shuts down (or logs out from) their computer while
Firefox is running).  So we have to somehow handle the call.  And we
can't simply ignore it -- otherwise (for example) quitting from the
Dock menu wouldn't work at all.

Calls (from the OS) to [NSApplication terminate:] are currently
handled by code in toolkit/xre/MacApplicationDelegate.mm --
specifically in [MacApplicationDelegate applicationShouldTerminate:].
We currently (under certain circumstances) allow the user to abort the
termination (by returning NSTerminateCancel), or we allow it to
proceed (by first calling
nsIAppStartup::Quit(nsIAppStartup::eForceQuit) and then returning
NSTerminateNow).

But XPCOM isn't completely (or "completely enough") shut down by the
time nsIAppStartup::Quit() returns, and this is the root cause of all
our troubles.

To truly fix this problem we need either of the following:

1) A synchronous counterpart/variant of nsIAppStartup::Quit()

2) A callback that tells us the browser has finished shutting down
   (from which we could call [NSApplication
   replyToApplicationShouldTerminate:YES])
The previous comment oversimplifies one thing -- the OS calls various
handlers (like applicationShouldTerminate:) *before* it calls
[NSApplication terminate:].  What these handlers return determines
when/if the OS calls [NSApplication terminate:].  By the time
[NSApplication terminate:] actually is called, there's nothing we can
do.
> 1) A synchronous counterpart/variant of nsIAppStartup::Quit()
>
> 2) A callback that tells us the browser has finished shutting down
>    (from which we could call [NSApplication
>    replyToApplicationShouldTerminate:YES])

The callback would need to be called on the main thread.
nsIAppStartup quit doesn't and can't synchronously shut down. It signals Firefox that it should shut down.

What happens if applicationShouldTerminate returns NSTerminateLater? The goal here is for us to exit back out to the main event loop, which terminates when nsIAppStartup.Quit is called, and exit back out through XRE_main.
> What happens if applicationShouldTerminate returns NSTerminateLater?

Then (presumably) the OS spins its event loops, waiting for us to call
[NSApplication replyToApplicationTerminate:].

If we could get a callback from Gecko telling us XPCOM had finished
shutting down (or was "close enough" to shutting down), we could then
call [NSApplication replyToApplicationShouldTerminate:YES].
Could we just quit (by exiting from main) and avoid calling replyToApplicationTerminate at all?
> Could we just quit (by exiting from main) and avoid calling
> replyToApplicationTerminate at all?

I doubt it.

And why should we?  At the same point we'd be calling exit() we can
call [NSApplication replyToApplicationShouldTerminate:YES], or call a
callback which calls it.
>> Could we just quit (by exiting from main) and avoid calling
>> replyToApplicationTerminate at all?
>
> I doubt it.

I suspect this would lead to OS-level things not being cleaned up properly.
Why do you think that? Exiting from main is the normal way, and it doesn't show any obvious problems. Let's try it?
> Let's try it?

If it does cause problems, they may not be obvious right away.

What's your objection to the callback?
And needless to say, the OS may have saved up a bunch of tasks to run on the call to [NSApplication replyToApplicationShouldTerminate:YES], but not on exit().
> What's your objection to the callback?

Are you worried that some threads may still be running after we exit from main()?
Not threads, no. But if calling replyToApplicationShouldTerminate could re-enter our code in any way, we'd be screwed. Unless we're absolutely certain it cannot, I think we're safer never calling it (and I really doubt that MacOS isn't perfectly capable of dealing with applications that exit: after all, they exit or crash on a regular basis).
OK, I just tried changing every occurrence of NSTerminateNow in
[MacApplicationDelegate applicationShouldTerminate:] to
NSTerminateLater.  It doesn't work -- the browser hangs when you quit
from the Dock menu.

> But if calling replyToApplicationShouldTerminate could re-enter our
> code in any way, we'd be screwed.

It'd be easy enough to guarantee that no Gecko calls will be made.
All we'd need to do is set a global variable (say in
MacApplicationDelegate.mm) indicating whether or not we'd returned
NSTerminateLater from applicationShouldTerminate:, then call
replyToApplicationShouldTerminate:YES from the callback if this
variable had been set.
> and I really doubt that MacOS isn't perfectly capable of dealing
> with applications that exit: after all, they exit or crash on a
> regular basis

Under normal circumstances, sure.  In fact this is how I assume we've
been quitting Firefox from the main menu (or using Cmd+Q) for years.

What's not normal is doing this after the OS has started taking steps
to call [NSApplication terminate:].
>> But if calling replyToApplicationShouldTerminate could re-enter our
>> code in any way, we'd be screwed.
>
> It'd be easy enough to guarantee that no Gecko calls will be made.
> All we'd need to do is set a global variable (say in
> MacApplicationDelegate.mm) indicating whether or not we'd returned
> NSTerminateLater from applicationShouldTerminate:, then call
> replyToApplicationShouldTerminate:YES from the callback if this
> variable had been set.

As for the call to replyToApplicationShouldTerminate *itself* causing
our code to be re-entered -- yes, we'd need to be careful in
nsAppShell's applicationWillTerminate: handler.  For example, we could
no longer call nsBaseAppShell::Exit() from there.  But I'm quite sure
we could make this handler work without ever calling Gecko.
At this point I'd appreciate someone's help:  I have only the vaguest
idea where the "XPCOM is now destroyed" callback should be called
from.  Please make suggestions.
blocking2.0: ? → final+
anyone have suggestions?  looks like this would fix the #7 top crash in 4.0b5.

more reports at 
http://crash-stats.mozilla.com/report/list?signature=pthread_mutex_lock
> anyone have suggestions?

I do, of course (the second alternative from comment #18):

> 2) A callback that tells us the browser has finished shutting down
>    (from which we could call [NSApplication
>    replyToApplicationShouldTerminate:YES])

And in fact I suspect this is the only way we can fix the general
problem	(to make [NSApplication	terminate:] get	along with FF's
shutdown procedure).

But I haven't yet had a chance to start work on it.  And, because
implementing my suggestion may be somewhat complicated, we'd need to
land it soon (in the next beta or the one after that), to reduce the
chances that it wouldn't completely fix	the problem.

Josh and I would like to postpone dealing with this bug until after
the 4.0 release (possibly in a 4.1 or 4.5 release).

Yes, this is a 4.0b5 topcrasher (#7 on all platforms and #2 on the
Mac).  But it isn't entirely random -- it only happens when quitting
from the Dock menu or shutting down (or logging out from) the OS while
FF is running, and then only when you have lots of tabs open.  So we
could relnote it.  And, because it only happens on quitting, nobody
will lose work because of it.

So I think it'd be possible to postpone dealing with this until FF 4.1
or 4.5.
(In reply to comment #36)

> Josh and I would like to postpone dealing with this bug until after
> the 4.0 release (possibly in a 4.1 or 4.5 release).

When we discussed this I didn't know it was so high up in our top crash list, I thought the number of crashes was very small. In any case, after reading this, I don't think we should wait until after Firefox 4 to fix this.
yeah, something has made this worse in 4.0

checking --- pthread_mutex_lock 20100930-crashdata.csv
found in: 4.0b6 3.6.10 4.0b4 4.0b3 3.6.8 3.5.13 4.0b5 3.6.6 3.6.3 3.6.9 4.0b7pre 3.6 3.6.7 3.5.9 3.5.7 3.5.5 3.5.1 3.0.19 3.0.1 3.0
release total-crashes
              pthread_mutex_lock crashes
                         pct.
all     336929  511     0.00151664
4.0b6   23602   366     0.0155072
3.6.10  195929  62      0.000316441
4.0b4   2075    16      0.00771084
4.0b3   1089    14      0.0128558
3.6.8   20011   13      0.000649643
3.5.13  16321   7       0.000428895
4.0b5   2759    5       0.00181225
3.6.6   5965    5       0.000838223
3.6.3   11144   5       0.000448672
3.6.9   4111    4       0.000972999
4.0b7pre2278    3       0.00131694
does the risk level of the possible changes here warrant changing Final+ to BetaN+ ?  e.g. does it need a beta to shake out side effect?
> does the risk level of the possible changes here warrant changing Final+ to
> BetaN+ ?  e.g. does it need a beta to shake out side effect?

Probably.  If we're going to fix this in FF 4, this should be my highest priority (among Mozilla bugs).
also saw some feedback from beta Input and crash data that some users hitting this signature are doing so when  "...Switching to tab groups view and it crashed"

like:
http://crash-stats.mozilla.com/report/index/37aa9157-5fe3-45bb-9910-527022101001
     Firefox    4.0b6     Mac OS X    10.6.4  --bugs   452318,564089,582466   

would that kind of action be a different bug?
> would that kind of action be a different bug?

Probably not.
 "...Switching to tab groups view, and switching between tab groups" might be a good test case for this bug then.
>> would that kind of action be a different bug?
>
> Probably not.

That stack also has [NSApplication terminate:] on the main thread, so
something must have happened to trigger an OS-forced shutdown
sequence.  I know that quitting from the Dock menu and logging out (or
shutting down the OS) will trigger this.  But I wouldn't be surprised
if there are also other ways for it to happen.

> "...Switching to tab groups view, and switching between tab groups"
> might be a good test case for this bug then.

I've already got a very good way to reproduce this bug from comment
#16.
Attached patch FixSplinter Review
My callback idea didn't work out:  Even with a callback, OS-initiated
termination still ends up aborting the browser during the call to
nsAppShell::Run() (more specifically during the call to [NSApp run]
inside nsAppShell::Run()).  This means that the call to
nsAppStartup::Run() in XRE_Main() never returns, which in turn means
that ScopedXPCOMStartup's destructor and NS_ShutdownXPCOM() never get
called.

So for a while all hope seemed lost.  But then I discovered that
(during OS-initiated termination) [NSApplication terminate:] doesn't
get called until after [MacApplicationDelegate
applicationShouldTerminate:] returns, and [NSApplication terminate:]
doesn't do anything we need except to post an
NSApplicationWillTerminateNotification to everyone (including
ourselves) that's registered for it.

Which leads to a very simple, elegant and robust solution -- hook
[NSApplication terminate:] and replace it with a method that does
exactly (and only) what we need!

OS-initiated termination works this way ([NSApplication terminate:] is
called after the OS posts NSApplicationShouldTerminateNotification) on
all three OSes I tested (10.4.11, 10.5.8 and 10.6.4), in both 32-bit
and 64-bit mode.  So it seems very likely it will continue to work
this way even in future major OS X releases (and almost certainly in
future minor releases of OS X 10.6).

I really regret I didn't think of this earlier :-(

A tryserver build will follow in a few hours.
Attachment #483340 - Flags: review?(joshmoz)
Here's a tryserver build made with my patch from comment #45:
http://ftp.mozilla.org/pub/mozilla.org/firefox/tryserver-builds/smichaud@pobox.com-0e7766678bfd/tryserver-macosx64/firefox-4.0b8pre.en-US.mac64.dmg

Please try it out!

There were no non-spurious failures during any of the tryserver tests.
Attachment #483340 - Flags: review?(joshmoz) → review+
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Just to pre-empt any doubts about the following failures in
mochitests-1/5:

test_bug557087-3.html | element should be focused - got [object
  HTMLBodyElement], expected [object HTMLButtonElement]
(and so forth)

These failures happen on both OS X
(http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287161410.1287163611.22993.gz)
and Linux
(http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1287161977.1287162653.19127.gz).
So it can't have been caused by my patch (which is OS-X-only).
I've just realized that my patch will break Cocoa embedders (like
Camino) on the trunk ... if there *are* any Cocoa embedders on the
trunk.

Smokey, does Camino have any builds from trunk code?

The breakage is fairly major, though not critical -- the
NSApplicationShouldTerminateNotification will no longer be sent, which
means (for example) that the user will no longer be able to cancel a
Quit when more than one tab is open.

I already have a patch that should fix this problem (by not replacing
the OS's [NSApplication terminate:] method when running in a Cocoa
embedder).  But first I'd like to have an idea of how urgent this
problem is.

(By the way, the underlying reason for the breakage is that Cocoa
embedders call [NSApplication terminate:] themselves when they want to
quit, and it behaves differently in this case (as opposed to when the
OS calls it as part of a shutdown sequence initiated (say) from the
Dock menu).)
> I already have a patch that should fix this problem (by not
> replacing the OS's [NSApplication terminate:] method when running in
> a Cocoa embedder).  But first I'd like to have an idea of how urgent
> this problem is.

It'd also be nice to have a Camino trunk build to test my new patch
in.
(In reply to comment #49)
> I've just realized that my patch will break Cocoa embedders (like
> Camino) on the trunk ... if there *are* any Cocoa embedders on the
> trunk.
>
> Smokey, does Camino have any builds from trunk code?

bsmedberg, maybe you could field this question? This kind of question in bugzilla is getting pretty awkward for us.
Target Milestone: --- → mozilla2.0b8
Depends on: 604901
I've opened a new bug on the breakage in Cocoa embedders -- bug 604901.
Target Milestone: mozilla2.0b8 → mozilla2.0b7
Crash Signature: [@ pthread_mutex_lock]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: