Closed Bug 705154 Opened 13 years ago Closed 13 years ago

Crash in chromehang | mach_msg_trap | ABORT: HangMonitor triggered

Categories

(Core :: XPCOM, defect)

x86
macOS
defect
Not set
critical

Tracking

()

RESOLVED FIXED
mozilla11

People

(Reporter: bc, Assigned: benjamin)

References

Details

(Keywords: crash, regression)

Crash Data

Attachments

(2 files)

This bug was filed from the Socorro interface and is 
report bp-a75ba7c2-ab2e-4bc5-8e2c-eff802111124 .
============================================================= 

Also bp-00403124-e021-4bfa-b7b2-0a2fd2111124

0 	libmozalloc.dylib 	mozalloc_abort 	memory/mozalloc/mozalloc_abort.cpp:66
1 	XUL 	NS_DebugBreak_P 	xpcom/base/nsDebugImpl.cpp:388
2 	XUL 	mozilla::HangMonitor::ThreadMain 	xpcom/threads/HangMonitor.cpp:111
3 	libnspr4.dylib 	_pt_root 	nsprpub/pr/src/pthreads/ptthread.c:187
4 	libSystem.B.dylib 	_pthread_start 	
5 	libSystem.B.dylib 	thread_start

I did have some memory intensive pages loaded, but I've been doing that for several days and have not seen this abort.
Component: General → XPCOM
Product: Firefox → Core
QA Contact: general → xpcom
Version: unspecified → Trunk
This started just after I updated to today's Nightly. I've been crashing regularly every few minutes since. I've been disabling extensions one by one to see if that helps. I looked at the push log for the last couple of days but didn't see anything that stood out as a possible cause.
Keywords: regression
Every crashes that have this stack trace have chromehang in their crash signature. The second term in the crash signature is the first frame in thread 0.
Summary: crash chromehang | ABORT: HangMonitor triggered → Crash in chromehang | mach_msg_trap
Summary: Crash in chromehang | mach_msg_trap → Crash in chromehang | mach_msg_trap | ABORT: HangMonitor triggered
Hi,

Mainly updating to add myself to the CC-list for this bug.

FWIW, I filed Bug 705003 which might be related.  I saw HangMonitor Aborts right after the Bug 429592 patches were incorporated into the tinderbox builds.  Most of my Nightly Crash Reports were unsuccessfully sent (about:crashes says cannot find most of those OOIDs), but one seems to have made it through:

https://crash-stats.mozilla.com/report/index/bp-305eb917-2c82-4245-9f0d-ad1ad2111124

... which apparently now has a pointer to this very bug-report.

I also posted to the nightly discussion mail-list, hopefully to fore-warn people that this would likely be seen in the next "official" Nightly builds.

BTW Starting Nightly via CLI with -safe-mode did *not* help anything with this bug, eventually we still got 'pop'ed.  ;)

(I went back to a tinderbox build before the 429592 patches were applied.)

HTH
sci-fi, thanks. If you reload open the crash report from about:crashes and hit reload a few times it will probably be submitted. That's what I have to do. That definitely looks like a candidate.
Hi,

Thank you to :bc: for the clue how to re-re-…-submit the about:crashes reports.  Seems mine are all finally recorded.

Here's a list of the 13 reports I have, sectioned according to "Signature" and "Build ID"[1] fields:

Build ID	20111123101127
@ chromehang | TSFNTFont::GetFormat() const 
https://crash-stats.mozilla.com/report/index/bp-22e7fd4f-c853-4e70-ab86-3e4202111124

Build ID	20111123111426
@ chromehang | TSFNTFont::GetFormat() const 
https://crash-stats.mozilla.com/report/index/bp-0055ed42-8ce4-4dbc-a4bf-760e72111124
https://crash-stats.mozilla.com/report/index/bp-9b09a418-0f0a-4893-ac4a-ccc692111124

Build ID	20111123101127
@ chromehang | mach_msg_trap 
https://crash-stats.mozilla.com/report/index/bp-2a77d2f6-39ad-4374-88c7-9c1b72111124
https://crash-stats.mozilla.com/report/index/bp-21f7503b-a058-44c9-a6fc-440042111124
https://crash-stats.mozilla.com/report/index/bp-462950ca-a71b-4f1d-8417-2f2b62111124
https://crash-stats.mozilla.com/report/index/bp-75ba6b07-0c1e-4d08-b124-133472111124
https://crash-stats.mozilla.com/report/index/bp-4a252859-0489-4894-9dc8-df6492111124
https://crash-stats.mozilla.com/report/index/bp-a16632b3-80f9-4fcc-89ed-90fec2111124

Build ID	20111123111426
@ chromehang | mach_msg_trap 
https://crash-stats.mozilla.com/report/index/bp-088853f0-0766-4999-a578-19ada2111124
https://crash-stats.mozilla.com/report/index/bp-521bc212-cbc6-4c54-8e01-6f05a2111124
https://crash-stats.mozilla.com/report/index/bp-15a08077-fc70-41fc-b962-9d8e52111124
https://crash-stats.mozilla.com/report/index/bp-305eb917-2c82-4245-9f0d-ad1ad2111124

[1] - fetched from <https://ftp.mozilla.org/pub/mozilla.org/firefox/tinderbox-builds/mozilla-central-macosx64/?C=M;O=D>

Bug 705154 seems for the "mach_msg_trap".

If "TSFNTFont::GetFormat" is not related, could we perhaps handle it in Bug 705003 to give it a "personality"?

(I'll post this list to both bugs.)
I only get this crash if I let Firefox sit idle for a few minutes. If I constantly use it, I don't get any crash. If I were to step away from the computer it will crash.
No longer blocks: hang-detector
Depends on: hang-detector
No longer depends on: hang-detector
A side note, might not be related, though: I also experienced some system crashes today on my 17-inch, Late 2006 iMac with 10.7.2 and ATI Radeon X1600 128 MB.  The crash was in plugin-container, I unfortunately didn't get the info on time since the system has crashed again and didn't give detailed info the second time.

Confirming the crash happen with Nightly in the background.
This bug appears as if the hang detector might be malfunctioning, but none of the crash reports have a usable stack on thread 0 (which is the interesting thread). I propose to disable the hang detector on mac for this weekends nightlies so that Ted and I can loop back around on Tuesday to figure out why we aren't getting better stacks. It may be that we need to get symbols for OS libraries.
Tagging a few possible reviewers of the temporary disablement, but if there's somebody else around who can review please feel free.
Attachment #576996 - Flags: review?(smichaud)
Attachment #576996 - Flags: review?(jmathies)
Attachment #576996 - Flags: review?(gavin.sharp)
Please ignore the xpcom/ bits of this patch, they are for a different bug.
Assignee: nobody → benjamin
Comment on attachment 576996 [details] [diff] [review]
Disable the hang monitor on mac, rev. 1

(it'd be nice if the #ifndef DEBUG was an #ifdef instead, easier to read that way IMO)
Attachment #576996 - Flags: review?(gavin.sharp) → review+
I'm also getting this quite frequently when I leave firefox in the background. Here are some of my crash signatures:
http://crash-stats.mozilla.com/report/index/bp-41e2788b-82f0-41be-bf84-6e76f2111125
http://crash-stats.mozilla.com/report/index/bp-535b250a-1a4b-4aa2-a630-4a8712111125
http://crash-stats.mozilla.com/report/index/bp-4ab0bfbc-983d-40ea-9a17-dd1792111125

They all have CoreFoundation@0x4c901 in the main thread which translates to:
> atos -l 0x0 -o /System/Library/Frameworks/CoreFoundation.framework/Versions/A/CoreFoundation 0x4c901
CFDictionaryRemoveAllValues (in CoreFoundation) + 17
https://hg.mozilla.org/mozilla-central/rev/2729a78cd35e

once we hit unlabeled addresses we're not walking the stack correctly and I really want to see what is "above" all this on the stack. Leaving this bug open to track the real problem and reenable the hang monitor.
Status: NEW → ASSIGNED
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #8)
> This bug appears as if the hang detector might be malfunctioning, but none
> of the crash reports have a usable stack on thread 0 (which is the
> interesting thread). I propose to disable the hang detector on mac for this
> weekends nightlies so that Ted and I can loop back around on Tuesday to
> figure out why we aren't getting better stacks. It may be that we need to
> get symbols for OS libraries.

Why aren't we backing this out rather than putting in band-aids like this?
Why would we back it out when the pref was specifically designed so that we could disable it? It's still giving valuable data on Windows/Linux.
(In reply to Benjamin Smedberg  [:bsmedberg] from comment #15)
> Why would we back it out when the pref was specifically designed so that we
> could disable it? It's still giving valuable data on Windows/Linux.

Valuable data == user crashes.  This feature has resulted in almost an order of magnitude increase in crashes on Windows:

https://bugzilla.mozilla.org/show_bug.cgi?id=429592#c117

Like all regressions, this should be backed out or disabled by default on all platforms.
Comment on attachment 576996 [details] [diff] [review]
Disable the hang monitor on mac, rev. 1

http://hg.mozilla.org/mozilla-central/rev/2729a78cd35e
Attachment #576996 - Flags: review?(smichaud)
Attachment #576996 - Flags: review?(jmathies)
(In reply to John Daggett (:jtd) from comment #16)
> Valuable data == user crashes.

Nightly user crashes. This feature's goal was to turn hangs into crashes so that we could track them - that necessarily involves an increase in crash reports. Assuming the functionality is working as expected (an assumption that apparently might not  hold true on Mac), there's no reason to back it out solely because the crash count increased. We do need to investigate the crash reports, of course...
Status: ASSIGNED → NEW
(In reply to John Daggett (:jtd) from comment #16)
> https://bugzilla.mozilla.org/show_bug.cgi?id=429592#c117

Sorry, I wasn't up to date on the comments in that bug when I wrote my last reply - it's obviously not a simple tradeoff, and the discussion there is much more nuanced. Forget I said anything!
Status: NEW → ASSIGNED
The stack which I didn't account for is:

#0  0x00007fff863cad7a in mach_msg_trap ()
#1  0x00007fff863cb3ed in mach_msg ()
#2  0x00007fff8060a902 in __CFRunLoopRun ()
#3  0x00007fff80609d8f in CFRunLoopRunSpecific ()
#4  0x00007fff8587574e in RunCurrentEventLoopInMode ()
#5  0x00007fff85875553 in ReceiveNextEventCommon ()
#6  0x00007fff8587540c in BlockUntilNextEventMatchingListInMode ()
#7  0x00007fff83dd6eb2 in _DPSNextEvent ()
#8  0x00007fff83dd6801 in -[NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] ()
#9  0x00007fff83d9c68f in -[NSApplication run] ()
#10 0x00000001028b8fd4 in nsAppShell::Run (this=0x100303a20) at /builds/mozilla-central/src/widget/src/cocoa/nsAppShell.mm:780
#11 0x0000000102616df5 in nsAppStartup::Run (this=0x1177ed5b0) at /builds/mozilla-central/src/toolkit/components/startup/nsAppStartup.cpp:220
#12 0x0000000101444ff0 in XRE_main (argc=3, argv=0x7fff5fbff860, aAppData=0x1000071c0) at /builds/mozilla-central/src/toolkit/xre/nsAppRunner.cpp:3558
#13 0x0000000100001aeb in do_main (exePath=0x7fff5fbff430 "/builds/mozilla-central/ff-debug/dist/NightlyDebug.app/Contents/MacOS/", argc=3, argv=0x7fff5fbff860) at /builds/mozilla-central/src/browser/app/nsBrowserApp.cpp:201
#14 0x0000000100001d4a in main (argc=3, argv=0x7fff5fbff860) at /builds/mozilla-central/src/browser/app/nsBrowserApp.cpp:287

So the cocoa version of nsAppShell doesn't delegate to nsBaseAppShell::Run which means that the XPCOM event loop is not the outermost event loop at all on mac.

I think this can be fixed by subclassing [NSApplication nextEventMatchingMask:untilDate:inMode:dequeue:] and suspending the hang monitor when appropriate, but I also need to write down how all the different native event loops work, because each one is a little bit different and together they are a nightmare.
Comment on attachment 578045 [details] [diff] [review]
Suspend the hang monitor by overriding a method in the event loop, rev. 1

I haven't tested this.  But it looks reasonable to me, and it should do no harm.
Attachment #578045 - Flags: review?(smichaud) → review+
Mac patch checked in for mozilla11.  Leaving the bug open for the remaining patch.
https://hg.mozilla.org/mozilla-central/rev/1b3f17ffa656
Target Milestone: --- → mozilla11
The other one landed already ;-)
Status: ASSIGNED → RESOLVED
Closed: 13 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: