Closed Bug 463594 Opened 17 years ago Closed 7 years ago

Mac Leak & Bloat test occasionally fails on shutdown due to Assertion failure: 0 == rv in ptsynch.c

Tracking

(Not tracked)

Status:

RESOLVED WORKSFORME

People

(Reporter: standard8, Unassigned)

Details

Attachments

(4 files)

dtrace probe outputs where we see classinits across 2 threads 17 years ago Andrew Sutherland [:asuth] (he/him) 125.63 KB, text/plain		Details
dtrace probe outputs where we see classinits across 1 distinct thread id 17 years ago Andrew Sutherland [:asuth] (he/him) 123.34 KB, text/plain		Details
the dtrace probes with dubious ("wrong") pthread_create return result retrieval 17 years ago Andrew Sutherland [:asuth] (he/him) 682 bytes, text/plain		Details
[checked in] Disable OS X and OE address books. 17 years ago Mark Banner (:standard8) 611 bytes, patch	Bienvenu : review+	Details \| Diff \| Splinter Review

Mark Banner (:standard8)

Reporter

Description

•

17 years ago

MacOSX 10.4 comm-central bloat build is frequently turning red at the moment. The symptoms are that on the --trace-malloc test (the second test) we see in the log: --DOMWINDOW == 7 (0x3ebc7da0) [serial = 14] [outer = 0x3eba8560] [url = about:blank] --DOMWINDOW == 6 (0x36ac1360) [serial = 1] [outer = 0x0] [url = chrome://messenger/content/hiddenWindow.xul] --DOMWINDOW == 5 (0x3b29c630) [serial = 4] [outer = 0x0] [url = chrome://messenger/content/hiddenWindow.xul] --DOMWINDOW == 4 (0x39ae90c0) [serial = 3] [outer = 0x39a7dea0] [url = about:blank] --DOMWINDOW == 3 (0x3eba8590) [serial = 13] [outer = 0x0] [url = chrome://messenger/content/messengercompose/messengercompose.xul] --DOMWINDOW == 2 (0x3ef43570) [serial = 15] [outer = 0x355332c0] [url = chrome://messenger/content/msgAccountCentral.xul] --DOMWINDOW == 1 (0x355332f0) [serial = 5] [outer = 0x0] [url = chrome://messenger/content/msgAccountCentral.xul] --DOMWINDOW == 0 (0x39a7ded0) [serial = 2] [outer = 0x0] [url = chrome://messenger/content/messenger.xul] Assertion failure: 0 == rv, at /Volumes/Build/comm-central-bloat-macosx/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:194 ERROR FAIL Exited with code 6 during test run The assertion is causing us to abort, hence the error code. The first instance this occurred was: http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1225997344.1225998204.7949.gz Started 2008/11/06 10:49, finished 2008/11/06 11:04 There were no relevant changes for that build. Given that gloda landed on the 4th, I don't think that is likely to be the issue. I'm also having a hard time seeming other potential regressions from patches in mozilla-central. I'm going to try a clobber and see how it goes - there were some m-c patches checked in and back out. In theory these shouldn't affect things...

Mark Banner (:standard8)

Reporter

Comment 1

•

17 years ago

We now have a stack for this assertion. It seems to be failing in trace malloc shutdown which is strange, and we can't see anything obvious. (gdb) bt #0 0x9003d66c in kill () #1 0x9010e8cf in raise () #2 0x9010d422 in abort () #3 0x00676b33 in PR_DestroyLock (lock=0x310ed50) at /Volumes/Build/comm-central-bloat-macosx/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:194 #4 0x01877fab in NS_TraceMallocShutdown () at /Volumes/Build/comm-central-bloat-macosx/build/mozilla/tools/trace-malloc/lib/nsTraceMalloc.c:1508 #5 0x017daea6 in cxa_atexit_wrapper () #6 0x900103e1 in __cxa_finalize () #7 0x900102e8 in exit () #8 0x00002072 in _start () #9 0x00001f91 in start () (gdb) print lock $6 = (struct PRLock *) 0x310ed50 (gdb) print *lock $7 = { mutex = { __sig = 1297437784, __opaque = "\000\000\000\000\001\000\000\000####\003*", '\0' <repeats 22 times>, "\0033\000" }, notified = { length = 0, cv = {{ cv = 0x0, times = 0 }, { cv = 0x0, times = 0 }, { cv = 0x0, times = 0 }, { cv = 0x0, times = 0 }, { cv = 0x0, times = 0 }, { cv = 0x0, times = 0 }}, link = 0x0 }, locked = 0, owner = 0xa000d000 }

Philippe M. Chiasson (:gozer)

Comment 2

•

17 years ago

Should have looked if there were other threads running, and turns out, there is one: (gdb) info threads 2 core thread 1 0x900248c7 in semaphore_wait_signal_trap () * 1 core thread 0 0x9003d66c in kill () (gdb) thread 2 [Switching to thread 2 (core thread 1)] 0x900248c7 in semaphore_wait_signal_trap () (gdb) bt #0 0x900248c7 in semaphore_wait_signal_trap () #1 0x90001582 in pthread_mutex_lock () #2 0x00676be2 in PR_Lock (lock=0x310ed50) at /Volumes/Build/comm-central-bloat-macosx/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:207 #3 0x0187947d in FreeCallback (ptr=0x3cb26310, start=0, end=0, t=0x3dbe1690) at /Volumes/Build/comm-central-bloat-macosx/build/mozilla/tools/trace-malloc/lib/nsTraceMalloc.c:1947 #4 0x0187773d in my_malloc_logger (type=12, arg1=41943040, arg2=1018323728, arg3=0, result=0, num_hot_frames_to_skip=0) at /Volumes/Build/comm-central-bloat-macosx/build/mozilla/tools/trace-malloc/lib/nsTraceMalloc.c:1088 #5 0x9001004d in malloc_zone_free () #6 0x90a5b8cc in _free_internal () #7 0x90a5e11b in _destroyInitializingClassList () #8 0x90a5e0e1 in _objc_pthread_destroyspecific () #9 0x90025400 in _pthread_tsd_cleanup () #10 0x90024f88 in pthread_exit () #11 0x927f32f8 in forkThreadForFunction () #12 0x90024227 in _pthread_body ()

Mark Banner (:standard8)

Reporter

Comment 3

•

17 years ago

I've still not got any idea what could be causing this. The profile we run with has one pop account, we don't do things like talking to the server, and afaik we're not trying to load the start page either. So all we're doing is starting up the main mail window, address book and compose windows. We also have multiple threads running. Assuming it is something that mailnews is doing (afaik FF mac box doesn't have this problem), then all I can think of would be gloda, although that shouldn't be doing indexing on startup, I'm wondering if its hooking into mozStorage and that is causing an issue. gozer, could you try disabling gloda locally on that mac box? In mailnews/db/Makefile.in remove the 'gloda' option in DIRS, and I expect it'll need a clobber as well. We can run like that for a couple of days, obviously if it fails again its not gloda, if it doesn't, then it gives us something to look at in more detail.

Philippe M. Chiasson (:gozer)

Comment 4

•

17 years ago

$> hg diff mailnews/db/Makefile.in diff -r 8d9666167d33 mailnews/db/Makefile.in --- a/mailnews/db/Makefile.in Mon Nov 17 19:00:34 2008 +0100 +++ b/mailnews/db/Makefile.in Mon Nov 17 10:44:59 2008 -0800 @@ -42,7 +42,7 @@ include $(DEPTH)/config/autoconf.mk -DIRS = msgdb gloda +DIRS = msgdb include $(topsrcdir)/config/rules.mk' Also clobbered, let's see what happens.

Mark Banner (:standard8)

Reporter

Comment 5

•

17 years ago

So thanks to our wonderfully tidy build system, the patch you did manually won't work: you need to remove db/gloda from mailnews/Makefile.in.

Philippe M. Chiasson (:gozer)

Comment 6

•

17 years ago

$ hg diff diff -r 4210261cd843 mailnews/Makefile.in --- a/mailnews/Makefile.in Tue Nov 18 16:24:51 2008 +0000 +++ b/mailnews/Makefile.in Tue Nov 18 08:31:42 2008 -0800 @@ -44,7 +44,7 @@ MODULE = mailnews -PARALLEL_DIRS = base/search/src db/msgdb db/gloda news local mime compose imap addrbook import extensions +PARALLEL_DIRS = base/search/src db/msgdb news local mime compose imap addrbook import extensions ifeq ($(OS_ARCH),WINNT) ifndef GNU_CC diff -r 4210261cd843 mailnews/db/Makefile.in --- a/mailnews/db/Makefile.in Tue Nov 18 16:24:51 2008 +0000 +++ b/mailnews/db/Makefile.in Tue Nov 18 08:31:42 2008 -0800 @@ -42,7 +42,7 @@ include $(DEPTH)/config/autoconf.mk -DIRS = msgdb gloda +DIRS = msgdb include $(topsrcdir)/config/rules.mk Also clobbered, let's see what happens this time.

Mark Banner (:standard8)

Reporter

Comment 7

•

17 years ago

So since we disabled gloda, its now failed another few times. Let's re-enable gloda for now, and I'll have another think as to how we can test/resolve this.

Philippe M. Chiasson (:gozer)

Comment 8

•

17 years ago

Done, and clobbered.

Andrew Sutherland [:asuth] (he/him)

Comment 9

•

17 years ago

Attached file dtrace probe outputs where we see classinits across 2 threads — Details

It would appear that our fancy Mac Address Book stuff is the "problem". I place problem in quotes because one could argue that a custom memory allocator should not shut itself down while other threads that depend on it are still shutting themselves down. A brief investigation into the TLS cleanup code going on suggests that objc-initialize.m::_class_initialize must be getting called on (at least) one thread which is not the main thread, leaving work for the cleanup mechanism. It appears that AddressBook/CoreData actually creates threads 3 times during the bloat test, but that only 2 of these actually result in objective-c class initializations. Furthermore, depending on the timing, the same physical thread may be reused so that we only see one distinct thread id that has class initializations occur. (Actually, maybe they all result in initialization but reuse reliably happens in one of the cases, I didn't do the timestamp legwork... (see below)) In all 3 cases nsAbOSXDirectory::Init(char const*) is calling AddressBook`+[ABAddressBook sharedAddressBook]. I am attaching the results of my dtrace probes that generate a callstack for every call to _class_initialize not made on the main thread, plus various attempts to interleave information about who is creating threads and when. Do not trust the "pthread_create:return" results, just the "thread_terminate_self:lwp-create" results. Also, keep in mind that my mac mini is dual-core, I didn't set any core affinity, and the output from the two cores is not perfectly interleaved; watch the timestamps (and processor id's), because the file is not strictly ordered!

Andrew Sutherland [:asuth] (he/him)

Comment 10

•

17 years ago

Attached file dtrace probe outputs where we see classinits across 1 distinct thread id — Details

Andrew Sutherland [:asuth] (he/him)

Comment 11

•

17 years ago

Attached file the dtrace probes with dubious ("wrong") pthread_create return result retrieval — Details

Mark Banner (:standard8)

Reporter

Comment 12

•

17 years ago

I don't quite see/understand why we might be getting the AddressBook/CoreData created off the main thread. AFAIK all the address book code that is relevant is working off the main thread, unless there's something hidden in rdf, which I doubt. If nsAbOSXDirectory::Init is being called 3 times, that's a little bit of a surprise. Did you run the probes on both the runs, or just one of them? I know that on the second run, the localstore.rdf may be causing the initialisation to be called early, and then lost, hence the second Init. Not sure where the third one is off hand, though looking at the traces it could just be rdf's fault. This doesn't seem to be affecting released builds, although not having crash-stats on Mac at the moment its a little hard to say. Also, given the fact that this may be rdf induced, and we're hoping on removing that soon, I'm thinking that the best option may be just to disable the OS X address book for the leak and bloat tests. I'm certainly thinking about doing it at least temporarily to confirm this as the real issue. Disabling would also have the advantage that for individuals testing on their machines, what they have in the Address Book app wouldn't affect their leak & bloat test results.

Mark Banner (:standard8)

Reporter

Comment 13

•

17 years ago

Attached patch [checked in] Disable OS X and OE address books. — Details — Splinter Review

Given the current indications that this is actually do with the OS X address book interfaces, I'd like to disable the OS X address book (and OE one for reasons cited in my previous comment), at least temporarily. We can then confirm that this is the cause of the problem, and then consider where we want to go with fixes/re-enabling.

Attachment #351787 - Flags: review?(bienvenu)

David :Bienvenu

Updated

•

17 years ago

Attachment #351787 - Flags: review?(bienvenu) → review+

timeless

Updated

•

17 years ago

Attachment #351750 - Attachment mime type: text/x-log → text/plain

timeless

Updated

•

17 years ago

Attachment #351751 - Attachment mime type: text/x-log → text/plain

timeless

Updated

•

17 years ago

Attachment #351752 - Attachment mime type: text/x-dsrc → text/plain

timeless

Comment 14

•

17 years ago

http://www.mail-archive.com/cocoa-dev@lists.apple.com/msg17262.html This isn't something you're asking for, it's something that happens

Mark Banner (:standard8)

Reporter

Comment 15

•

17 years ago

Comment on attachment 351787 [details] [diff] [review] [checked in] Disable OS X and OE address books. Checked in: http://hg.mozilla.org/comm-central/rev/9d25c19183ef Lets see what the mac box does now.

Attachment #351787 - Attachment description: Disable OS X and OE address books. → [checked in] Disable OS X and OE address books.

Mark Banner (:standard8)

Reporter

Comment 16

•

17 years ago

(In reply to comment #15) > Lets see what the mac box does now. Too early to say wrt failures, but we have seen approx 56.2kb drop in Lk (Trace malloc leaks). The differences in Lk (as you might expect) are to do with the Mac OS X address book and can be seen here at the end of the log: http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1228677504.1228678351.31742.gz#err8

Wayne Mery (:wsmwk)

Comment 17

•

17 years ago

Mark, is this ready to land?

Wayne Mery (:wsmwk)

Comment 18

•

13 years ago

(In reply to Mark Banner (:standard8) from comment #16) > (In reply to comment #15) > > Lets see what the mac box does now. > > Too early to say wrt failures, but we have seen approx 56.2kb drop in Lk > (Trace malloc leaks). The differences in Lk (as you might expect) are to do > with the Mac OS X address book and can be seen here at the end of the log: > http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1228677504. > 1228678351.31742.gz#err8 closeable?

Flags: needinfo?

Mark Banner (:standard8)

Reporter

Comment 19

•

13 years ago

Nope, the assertion is still present on Tinderbox, but is permanent on 10.7 and I don't think we've got a separate bug for that (https://tbpl.mozilla.org/?tree=Thunderbird-Trunk&noignore=1).

Flags: needinfo?

Mark Banner (:standard8)

Reporter

Comment 20

•

7 years ago

Pretty sure we got rid of leak & bloat builders ages ago.

Status: NEW → RESOLVED

Closed: 7 years ago

Resolution: --- → WORKSFORME

You need to log in before you can comment on or make changes to this bug.