Last Comment Bug 463594 - Mac Leak & Bloat test occasionally fails on shutdown due to Assertion failure: 0 == rv in ptsynch.c
: Mac Leak & Bloat test occasionally fails on shutdown due to Assertion failure...
Status: NEW
:
Product: Thunderbird
Classification: Client Software
Component: General (show other bugs)
: Trunk
: x86 Mac OS X
: -- normal (vote)
: ---
Assigned To: Nobody; OK to take it and work on it
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2008-11-07 02:40 PST by Mark Banner (:standard8)
Modified: 2012-12-30 01:27 PST (History)
6 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---


Attachments
dtrace probe outputs where we see classinits across 2 threads (125.63 KB, text/plain)
2008-12-07 01:14 PST, Andrew Sutherland [:asuth]
no flags Details
dtrace probe outputs where we see classinits across 1 distinct thread id (123.34 KB, text/plain)
2008-12-07 01:16 PST, Andrew Sutherland [:asuth]
no flags Details
the dtrace probes with dubious ("wrong") pthread_create return result retrieval (682 bytes, text/plain)
2008-12-07 01:17 PST, Andrew Sutherland [:asuth]
no flags Details
[checked in] Disable OS X and OE address books. (611 bytes, patch)
2008-12-07 08:09 PST, Mark Banner (:standard8)
mozilla: review+
Details | Diff | Review

Description Mark Banner (:standard8) 2008-11-07 02:40:09 PST
MacOSX 10.4 comm-central bloat build is frequently turning red at the moment. The symptoms are that on the --trace-malloc test (the second test) we see in the log:

--DOMWINDOW == 7 (0x3ebc7da0) [serial = 14] [outer = 0x3eba8560] [url = about:blank]
--DOMWINDOW == 6 (0x36ac1360) [serial = 1] [outer = 0x0] [url = chrome://messenger/content/hiddenWindow.xul]
--DOMWINDOW == 5 (0x3b29c630) [serial = 4] [outer = 0x0] [url = chrome://messenger/content/hiddenWindow.xul]
--DOMWINDOW == 4 (0x39ae90c0) [serial = 3] [outer = 0x39a7dea0] [url = about:blank]
--DOMWINDOW == 3 (0x3eba8590) [serial = 13] [outer = 0x0] [url = chrome://messenger/content/messengercompose/messengercompose.xul]
--DOMWINDOW == 2 (0x3ef43570) [serial = 15] [outer = 0x355332c0] [url = chrome://messenger/content/msgAccountCentral.xul]
--DOMWINDOW == 1 (0x355332f0) [serial = 5] [outer = 0x0] [url = chrome://messenger/content/msgAccountCentral.xul]
--DOMWINDOW == 0 (0x39a7ded0) [serial = 2] [outer = 0x0] [url = chrome://messenger/content/messenger.xul]
Assertion failure: 0 == rv, at /Volumes/Build/comm-central-bloat-macosx/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:194
ERROR FAIL Exited with code 6 during test run

The assertion is causing us to abort, hence the error code.

The first instance this occurred was:
http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1225997344.1225998204.7949.gz
Started 2008/11/06 10:49, finished 2008/11/06 11:04

There were no relevant changes for that build. Given that gloda landed on the 4th, I don't think that is likely to be the issue.

I'm also having a hard time seeming other potential regressions from patches in mozilla-central.

I'm going to try a clobber and see how it goes - there were some m-c patches checked in and back out. In theory these shouldn't affect things...
Comment 1 Mark Banner (:standard8) 2008-11-11 22:19:24 PST
We now have a stack for this assertion. It seems to be failing in trace malloc shutdown which is strange, and we can't see anything obvious.

(gdb) bt
#0  0x9003d66c in kill ()
#1  0x9010e8cf in raise ()
#2  0x9010d422 in abort ()
#3  0x00676b33 in PR_DestroyLock (lock=0x310ed50) at /Volumes/Build/comm-central-bloat-macosx/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:194
#4  0x01877fab in NS_TraceMallocShutdown () at /Volumes/Build/comm-central-bloat-macosx/build/mozilla/tools/trace-malloc/lib/nsTraceMalloc.c:1508
#5  0x017daea6 in cxa_atexit_wrapper ()
#6  0x900103e1 in __cxa_finalize ()
#7  0x900102e8 in exit ()
#8  0x00002072 in _start ()
#9  0x00001f91 in start ()
(gdb) print lock
$6 = (struct PRLock *) 0x310ed50
(gdb) print *lock
$7 = {
  mutex = {
    __sig = 1297437784, 
    __opaque = "\000\000\000\000\001\000\000\000####\003*", '\0' <repeats 22 times>, "\0033\000"
  }, 
  notified = {
    length = 0, 
    cv = {{
        cv = 0x0, 
        times = 0
      }, {
        cv = 0x0, 
        times = 0
      }, {
        cv = 0x0, 
        times = 0
      }, {
        cv = 0x0, 
        times = 0
      }, {
        cv = 0x0, 
        times = 0
      }, {
        cv = 0x0, 
        times = 0
      }}, 
    link = 0x0
  }, 
  locked = 0, 
  owner = 0xa000d000
}
Comment 2 Philippe M. Chiasson (:gozer) 2008-11-12 09:46:44 PST
Should have looked if there were other threads running, and turns out, there is one:

(gdb) info threads
  2 core thread 1  0x900248c7 in semaphore_wait_signal_trap ()
* 1 core thread 0  0x9003d66c in kill ()
(gdb) thread 2
[Switching to thread 2 (core thread 1)]
0x900248c7 in semaphore_wait_signal_trap ()
(gdb) bt
#0  0x900248c7 in semaphore_wait_signal_trap ()
#1  0x90001582 in pthread_mutex_lock ()
#2  0x00676be2 in PR_Lock (lock=0x310ed50) at /Volumes/Build/comm-central-bloat-macosx/build/mozilla/nsprpub/pr/src/pthreads/ptsynch.c:207
#3  0x0187947d in FreeCallback (ptr=0x3cb26310, start=0, end=0, t=0x3dbe1690) at /Volumes/Build/comm-central-bloat-macosx/build/mozilla/tools/trace-malloc/lib/nsTraceMalloc.c:1947
#4  0x0187773d in my_malloc_logger (type=12, arg1=41943040, arg2=1018323728, arg3=0, result=0, num_hot_frames_to_skip=0) at /Volumes/Build/comm-central-bloat-macosx/build/mozilla/tools/trace-malloc/lib/nsTraceMalloc.c:1088
#5  0x9001004d in malloc_zone_free ()
#6  0x90a5b8cc in _free_internal ()
#7  0x90a5e11b in _destroyInitializingClassList ()
#8  0x90a5e0e1 in _objc_pthread_destroyspecific ()
#9  0x90025400 in _pthread_tsd_cleanup ()
#10 0x90024f88 in pthread_exit ()
#11 0x927f32f8 in forkThreadForFunction ()
#12 0x90024227 in _pthread_body ()
Comment 3 Mark Banner (:standard8) 2008-11-16 05:49:45 PST
I've still not got any idea what could be causing this. The profile we run with has one pop account, we don't do things like talking to the server, and afaik we're not trying to load the start page either.

So all we're doing is starting up the main mail window, address book and compose windows. We also have multiple threads running. Assuming it is something that mailnews is doing (afaik FF mac box doesn't have this problem), then all I can think of would be gloda, although that shouldn't be doing indexing on startup, I'm wondering if its hooking into mozStorage and that is causing an issue.

gozer, could you try disabling gloda locally on that mac box? In mailnews/db/Makefile.in remove the 'gloda' option in DIRS, and I expect it'll need a clobber as well.

We can run like that for a couple of days, obviously if it fails again its not gloda, if it doesn't, then it gives us something to look at in more detail.
Comment 4 Philippe M. Chiasson (:gozer) 2008-11-17 10:45:42 PST
$> hg diff mailnews/db/Makefile.in
diff -r 8d9666167d33 mailnews/db/Makefile.in
--- a/mailnews/db/Makefile.in   Mon Nov 17 19:00:34 2008 +0100
+++ b/mailnews/db/Makefile.in   Mon Nov 17 10:44:59 2008 -0800
@@ -42,7 +42,7 @@

 include $(DEPTH)/config/autoconf.mk

-DIRS           = msgdb gloda
+DIRS           = msgdb

 include $(topsrcdir)/config/rules.mk'

Also clobbered, let's see what happens.
Comment 5 Mark Banner (:standard8) 2008-11-18 01:43:21 PST
So thanks to our wonderfully tidy build system, the patch you did manually won't work: you need to remove db/gloda from mailnews/Makefile.in.
Comment 6 Philippe M. Chiasson (:gozer) 2008-11-18 08:32:02 PST
$ hg diff
diff -r 4210261cd843 mailnews/Makefile.in
--- a/mailnews/Makefile.in      Tue Nov 18 16:24:51 2008 +0000
+++ b/mailnews/Makefile.in      Tue Nov 18 08:31:42 2008 -0800
@@ -44,7 +44,7 @@

 MODULE       = mailnews

-PARALLEL_DIRS  = base/search/src db/msgdb db/gloda news local mime compose imap addrbook import extensions
+PARALLEL_DIRS  = base/search/src db/msgdb news local mime compose imap addrbook import extensions

 ifeq ($(OS_ARCH),WINNT)
 ifndef GNU_CC
diff -r 4210261cd843 mailnews/db/Makefile.in
--- a/mailnews/db/Makefile.in   Tue Nov 18 16:24:51 2008 +0000
+++ b/mailnews/db/Makefile.in   Tue Nov 18 08:31:42 2008 -0800
@@ -42,7 +42,7 @@

 include $(DEPTH)/config/autoconf.mk

-DIRS           = msgdb gloda
+DIRS           = msgdb

 include $(topsrcdir)/config/rules.mk

Also clobbered, let's see what happens this time.
Comment 7 Mark Banner (:standard8) 2008-11-19 00:10:10 PST
So since we disabled gloda, its now failed another few times. Let's re-enable gloda for now, and I'll have another think as to how we can test/resolve this.
Comment 8 Philippe M. Chiasson (:gozer) 2008-11-19 07:56:24 PST
Done, and clobbered.
Comment 9 Andrew Sutherland [:asuth] 2008-12-07 01:14:55 PST
Created attachment 351750 [details]
dtrace probe outputs where we see classinits across 2 threads

It would appear that our fancy Mac Address Book stuff is the "problem".  I place problem in quotes because one could argue that a custom memory allocator should not shut itself down while other threads that depend on it are still shutting themselves down.

A brief investigation into the TLS cleanup code going on suggests that objc-initialize.m::_class_initialize must be getting called on (at least) one thread which is not the main thread, leaving work for the cleanup mechanism.

It appears that AddressBook/CoreData actually creates threads 3 times during the bloat test, but that only 2 of these actually result in objective-c class initializations.  Furthermore, depending on the timing, the same physical thread may be reused so that we only see one distinct thread id that has class initializations occur.  (Actually, maybe they all result in initialization but reuse reliably happens in one of the cases, I didn't do the timestamp legwork... (see below))

In all 3 cases nsAbOSXDirectory::Init(char const*) is calling AddressBook`+[ABAddressBook sharedAddressBook].

I am attaching the results of my dtrace probes that generate a callstack for every call to _class_initialize not made on the main thread, plus various attempts to interleave information about who is creating threads and when.  Do not trust the "pthread_create:return" results, just the "thread_terminate_self:lwp-create" results.  Also, keep in mind that my mac mini is dual-core, I didn't set any core affinity, and the output from the two cores is not perfectly interleaved; watch the timestamps (and processor id's), because the file is not strictly ordered!
Comment 10 Andrew Sutherland [:asuth] 2008-12-07 01:16:10 PST
Created attachment 351751 [details]
dtrace probe outputs where we see classinits across 1 distinct thread id
Comment 11 Andrew Sutherland [:asuth] 2008-12-07 01:17:05 PST
Created attachment 351752 [details]
the dtrace probes with dubious ("wrong") pthread_create return result retrieval
Comment 12 Mark Banner (:standard8) 2008-12-07 06:31:53 PST
I don't quite see/understand why we might be getting the AddressBook/CoreData created off the main thread. AFAIK all the address book code that is relevant is working off the main thread, unless there's something hidden in rdf, which I doubt.

If nsAbOSXDirectory::Init is being called 3 times, that's a little bit of a surprise. Did you run the probes on both the runs, or just one of them?

I know that on the second run, the localstore.rdf may be causing the initialisation to be called early, and then lost, hence the second Init. Not sure where the third one is off hand, though looking at the traces it could just be rdf's fault.

This doesn't seem to be affecting released builds, although not having crash-stats on Mac at the moment its a little hard to say.

Also, given the fact that this may be rdf induced, and we're hoping on removing that soon, I'm thinking that the best option may be just to disable the OS X address book for the leak and bloat tests.

I'm certainly thinking about doing it at least temporarily to confirm this as the real issue.

Disabling would also have the advantage that for individuals testing on their machines, what they have in the Address Book app wouldn't affect their leak & bloat test results.
Comment 13 Mark Banner (:standard8) 2008-12-07 08:09:16 PST
Created attachment 351787 [details] [diff] [review]
[checked in] Disable OS X and OE address books.

Given the current indications that this is actually do with the OS X address book interfaces, I'd like to disable the OS X address book (and OE one for reasons cited in my previous comment), at least temporarily.

We can then confirm that this is the cause of the problem, and then consider where we want to go with fixes/re-enabling.
Comment 14 timeless 2008-12-07 09:04:44 PST
http://www.mail-archive.com/cocoa-dev@lists.apple.com/msg17262.html

This isn't something you're asking for, it's something that happens
Comment 15 Mark Banner (:standard8) 2008-12-07 11:16:10 PST
Comment on attachment 351787 [details] [diff] [review]
[checked in] Disable OS X and OE address books.

Checked in: http://hg.mozilla.org/comm-central/rev/9d25c19183ef

Lets see what the mac box does now.
Comment 16 Mark Banner (:standard8) 2008-12-07 13:45:48 PST
(In reply to comment #15)
> Lets see what the mac box does now.

Too early to say wrt failures, but we have seen approx 56.2kb drop in Lk (Trace malloc leaks). The differences in Lk (as you might expect) are to do with the Mac OS X address book and can be seen here at the end of the log: http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1228677504.1228678351.31742.gz#err8
Comment 17 Wayne Mery (:wsmwk, NI for questions) 2009-05-03 14:40:14 PDT
Mark, is this ready to land?
Comment 18 Wayne Mery (:wsmwk, NI for questions) 2012-12-29 09:12:52 PST
(In reply to Mark Banner (:standard8) from comment #16)
> (In reply to comment #15)
> > Lets see what the mac box does now.
> 
> Too early to say wrt failures, but we have seen approx 56.2kb drop in Lk
> (Trace malloc leaks). The differences in Lk (as you might expect) are to do
> with the Mac OS X address book and can be seen here at the end of the log:
> http://tinderbox.mozilla.org/showlog.cgi?log=Thunderbird/1228677504.
> 1228678351.31742.gz#err8

closeable?
Comment 19 Mark Banner (:standard8) 2012-12-30 01:27:07 PST
Nope, the assertion is still present on Tinderbox, but is permanent on 10.7 and I don't think we've got a separate bug for that (https://tbpl.mozilla.org/?tree=Thunderbird-Trunk&noignore=1).

Note You need to log in before you can comment on or make changes to this bug.