[Mac] Same-thread deadlock with trace-malloc (causing random red on OS X 10.5.2 mozilla-central leak test build)

RESOLVED FIXED in mozilla1.9.2a1

Status

()

Core
XPCOM
P1
normal
RESOLVED FIXED
10 years ago
9 years ago

People

(Reporter: Axel Hecht, Assigned: roc)

Tracking

({fixed1.9.1, hang})

Trunk
mozilla1.9.2a1
x86
Mac OS X
fixed1.9.1, hang
Points:
---
Bug Flags:
blocking1.9.1 -

Firefox Tracking Flags

(Not tracked)

Details

(URL)

Attachments

(8 attachments)

(Reporter)

Description

10 years ago
We saw a rando-red on OS X 10.5.2 mozilla-central leak test build, where the builds were failing on bm-xserve 16-18, but succeeded on 22. Failed on moz2-darwin9-slave02.

Clobbered the failing slaves to see if that helps.
(Reporter)

Comment 1

10 years ago
The next build after the clobber on moz2-darwin9-slave02 succeeded, but no bm-xserve17, there is a new failure, http://production-master.build.mozilla.org:8010/builders/OS%20X%2010.5.2%20mozilla-central%20leak%20test%20build/builds/7619, in the first instead of the 5 test_alive step.
Nick and I were looking at this a little bit last night. We determined that at least two slaves (moz2-darwn9-slave07 and bm-xserve19 IIRC) were capable of both passing and failing the leak test. Given that the clobbers haven't helped fix the problem I'd say this is a legitimate, intermittent problem.
What's the failure mode look like? Every time I've looked at it, all I see is "address already in use", which looks bogus.
(Reporter)

Comment 4

10 years ago
The "address already in use" one is for the first test_alive. The ones I and Mossop looked at where in test_alive5, and those were actually dying with a slave command timeout. Which, I think, we're interpreting as a hang.
Gotcha. So it sounds like we have this hang in test_alive5, and then when it fails, the next time it runs we fail trying to start the httpd. catlee points out bug 478150, which could be the cause of the second failure.
You can ignore the "address in use" failures. They're only happening on bm-xserve17, and appear to be caused by hung processes. We're tracking this one in bug 478150.
I've closed the tree until we can identify what's going on with this.
The first build that started to go red intermittently built http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=561dc7d5c746 which was pushed by dbaron.

In the log for that, I see a lot of "failed to bind socket" for what I think is the SSL tunnel, but not totally sure.  Don't think that's what is causing this though.

Fatal error was:
command timed out: 3600 seconds without output, killing pid 65209
process killed by signal 9
program finished with exit code -1

I see these two errors in the following two builds that went red (but not in the first):
WARNING: NS_ENSURE_TRUE(browserChrome) failed: file /builds/moz2_slave/mozilla-central-macosx-debug/build/docshell/base/nsDocShell.cpp, line 9128
WARNING: Something wrong when creating the docshell for a frameloader!: file /builds/moz2_slave/mozilla-central-macosx-debug/build/content/base/src/nsFrameLoader.cpp, line 903


It should also be noted that the change just before this built the tracemonkey merge:
http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=1ed5d7300265, however it was green.  Since this has been random though, it's possible that it is the culprit as well as anything before it.
(In reply to comment #8)
> I see these two errors in the following two builds that went red (but not in
> the first):
> WARNING: NS_ENSURE_TRUE(browserChrome) failed: file
> /builds/moz2_slave/mozilla-central-macosx-debug/build/docshell/base/nsDocShell.cpp,
> line 9128
> WARNING: Something wrong when creating the docshell for a frameloader!: file
> /builds/moz2_slave/mozilla-central-macosx-debug/build/content/base/src/nsFrameLoader.cpp,
> line 903
bug 478129
(In reply to comment #8)
> The first build that started to go red intermittently built
> http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=561dc7d5c746 which
> was pushed by dbaron.

When roc was looking into this yesterday, I thought he found (more-intermittent) occurrences going back a few days earlier.
(In reply to comment #10)
> When roc was looking into this yesterday, I thought he found
> (more-intermittent) occurrences going back a few days earlier.
For the browserChrome / docshell issues in comment 9, yes, but there was a long streak of greenness before our recent string of random red.
So is this bug about a hang or a crash, and on what test?
(In reply to comment #8)
> The first build that started to go red intermittently built
> http://hg.mozilla.org/mozilla-central/pushloghtml?changeset=561dc7d5c746 which
> was pushed by dbaron.

And just to be clear, the four changesets there are:
 (1) something that should affect unit test machines only
 (2) something that I don't think is executed in any of our tests
 (3) something that affects reftests only
 (4) something that affects MathML code only
so unless we're handling MathML on startup, I don't think any of them should affect the tests running on the leak test machine.
(In reply to comment #12)
> So is this bug about a hang or a crash, and on what test?
We get:
command timed out: 3600 seconds without output, killing pid 65209
process killed by signal 9
program finished with exit code -1
every time it's red, so I'm not sure if that is a hang or a crash according to buildbot.
Created attachment 362079 [details]
gdb backtrace from moz2-darwin-slave07

Process was using one core's worth of CPU time, then attach with gdb and use 'thread apply all bt' to get this output. Slave is moz2-darwin-slave07.
Attachment #362079 - Attachment description: gdb backtrace → gdb backtrace from moz2-darwin-slave07
Created attachment 362087 [details]
gdb backtrace from bm-xserve17

Same as comment #16 except from bm-xserve17 (quad core xserve vs dual core mini). Slight correction to that comment, it was consuming all the available CPU - same for this one.
(In reply to comment #10)
> When roc was looking into this yesterday, I thought he found
> (more-intermittent) occurrences going back a few days earlier.

Here's an occurrence on Jan 27:
http://tinderbox.mozilla.org/showlog.cgi?log=Firefox/1232982889.1232988151.18515.gz

I didn't look backwards any further than that.
Which thread(s) is/are using the cpu time?
Thank you Nick! The bm-xserve17 stack trace is better because it shows more of the stacks.

It looks like we're deadlocked in spin-locks, that's why the CPU is pegged.

Too bad we didn't get the stack for thread 0x10b or 0x5103. But there are 5 threads in __spin_lock, presumably called from here:
http://src.gnu-darwin.org/DarwinSourceArchive/expanded/Libc/Libc-262.3.2/pthreads/pthread.c
The question is, who is holding the sem_pool_lock? I can't tell from these stacks. Of course, no thread should ever block while holding a spin-lock so this situation should be impossible. And none of the threads whose stacks we can see are actually doing anything but waiting for the next event or for more work, so none of them should be holding higher-level locks.
Created attachment 362104 [details]
Activity Monitor Sample

Not sure if this helps for comment #19 or not, ps -M is fairly unhelpful by not identifying the threads:

$ ps -M 44641
USER     PID   TT  %CPU STAT PRI     STIME     UTIME COMMAND
cltbld 44641 s001   96.7 R      0   6:29.03  16:28.01 /builds/moz2_slave/mozilla-central-macosx-debug/build/obj-firefox/dist/Minefie
       44641         0.0 S     21   0:00.00   0:00.00        
       44641        93.3 R      0   6:28.75  16:47.45        
       44641         0.0 S     32   0:00.00   0:00.00        
       44641        49.4 R      0   6:19.80  16:27.45        
       44641        50.6 R      0   6:40.28  15:55.25        
       44641         0.0 S     32   0:00.00   0:00.01        
       44641        83.4 R      5   4:58.40  12:32.79        
       44641         0.0 S     32   0:00.00   0:00.00        
       44641         0.0 S     32   0:00.00   0:00.00        
       44641         0.0 S     32   0:00.00   0:00.00        

I jumped ship to bm-xserve18 for this, stack is similar.
Created attachment 362105 [details]
gdb backtrace from bm-xserve18

Taken a little after the Activity Sample.
Comment on attachment 362087 [details]
gdb backtrace from bm-xserve17

>Thread 2 (process 52568 thread 0x1a03):
>#0  0x9350b9e6 in mach_msg_trap ()
>#1  0x935131dc in mach_msg ()
>#2  0x001213bc in google_breakpad::ExceptionHandler::WaitForMessage (exception_handler_class=0xb0f800) at /builds/buildbot/moz2-slave/mozilla-central-macosx-debug/build/toolkit/crashreporter/google-breakpad/src/client/mac/handler/exception_handler.cc:451
>#3  0x9353cc55 in _pthread_start ()
>#4  0x9353cb12 in thread_start ()
As far as I can tell, this is indicative of breakpad trying to come up.  It isn't around normally in my builds, so maybe we should try and actually disable breakpad by setting MOZ_CRASHREPORTER_DISABLE=1 instead of MOZ_CRASHREPORTER_NO_REPORT=1 like we currently do.
Taking this bug for now but it's looking like a code issue and will move to the appropriate place later.
Status: NEW → ASSIGNED
Priority: -- → P1
Comment #21 matches comment #22: 5 threads all spinning in __spin_lock. Too bad 3 of the 5 threads have no stack trace :-(. I hope we don't have to install a debug build of Darwin on the slaves!!!

Disabling crashreporter sounds like a great thing to try...
(In reply to comment #23)
> As far as I can tell, this is indicative of breakpad trying to come up.  It
> isn't around normally in my builds, so maybe we should try and actually disable
> breakpad by setting MOZ_CRASHREPORTER_DISABLE=1 instead of
> MOZ_CRASHREPORTER_NO_REPORT=1 like we currently do.

I'll set this up in our staging environment, might take a couple of hours for results.
Created attachment 362120 [details]
gdb backtrace from moz2-darwin-slave03 with MOZ_CRASHREPORTER_DISABLE=1
(In reply to comment #23)
> As far as I can tell, this is indicative of breakpad trying to come up.  It
> isn't around normally in my builds, so maybe we should try and actually disable
> breakpad by setting MOZ_CRASHREPORTER_DISABLE=1 instead of
> MOZ_CRASHREPORTER_NO_REPORT=1 like we currently do.

No, that's just Breakpad sitting around waiting for something to happen. It's normal. You don't see it in your builds because you don't enable the crashreporter at runtime, presumably. (With MOZILLA_OFFICIAL.)
I reproduced this locally and got a good stack trace showing where the problem is:

#0  0xffff027f in ___spin_lock () at /System/Library/Frameworks/System.framework/PrivateHeaders/i386/cpu_capabilities.h:234
#1  0x92877e19 in pthread_mutex_unlock ()
#2  0x928791b3 in dladdr ()
#3  0x0064390c in NS_DescribeCodeAddress (aPC=0x928a0f52, aDetails=0xb00a08b0) at /Users/roc/mozilla-checkin/xpcom/base/nsStackWalk.cpp:1503
#4  0x0066a1f6 in calltree (stack=0x1c63590, num_stack_entries=12, t=0x1c578a0) at /Users/roc/mozilla-checkin/tools/trace-malloc/lib/nsTraceMalloc.c:699
#5  0x0066ace3 in backtrace (t=0x1c578a0, skip=2) at /Users/roc/mozilla-checkin/tools/trace-malloc/lib/nsTraceMalloc.c:967
#6  0x0066c62b in MallocCallback (ptr=0x1c63550, size=64, start=0, end=0, t=0x1c578a0) at /Users/roc/mozilla-checkin/tools/trace-malloc/lib/nsTraceMalloc.c:1795
#7  0x0066b01c in my_malloc_logger (type=10, arg1=212992, arg2=64, arg3=0, result=29766992, num_hot_frames_to_skip=0) at /Users/roc/mozilla-checkin/tools/trace-malloc/lib/nsTraceMalloc.c:1086
#8  0x9287104c in malloc_zone_malloc ()
#9  0x9287a812 in realloc ()
#10 0x928a2448 in new_sem_from_pool ()
#11 0x928a20e6 in _pthread_cond_wait ()
#12 0x928ecaaf in pthread_cond_timedwait ()
#13 0x00b67e60 in pt_TimedWait (cv=0x1c62f64, ml=0x1c618a0, timeout=1000) at /Users/roc/mozilla-checkin/nsprpub/pr/src/pthreads/ptsynch.c:280
#14 0x00b6837b in PR_WaitCondVar (cvar=0x1c62f60, timeout=1000) at /Users/roc/mozilla-checkin/nsprpub/pr/src/pthreads/ptsynch.c:407
#15 0x05187e79 in XPCJSRuntime::WatchdogMain (arg=0x1c5d520) at /Users/roc/mozilla-checkin/js/src/xpconnect/src/xpcjsruntime.cpp:824
#16 0x00b6f411 in _pt_root (arg=0x1c63250) at /Users/roc/mozilla-checkin/nsprpub/pr/src/pthreads/ptthread.c:221
#17 0x928a1095 in _pthread_start ()
#18 0x928a0f52 in thread_start ()
running trace-malloc code in the realloc while the spinlock is held is bad bad bad bad...
So this is a bug in trace-malloc, right?
Component: Release Engineering → XPCOM
Product: mozilla.org → Core
QA Contact: release → xpcom
Version: other → Trunk
roc, isn't there another thread that's holding or blocking on these same locks?  And I'm guessing that thread would be violating our portability guidelines about static initialization.
Assuming the bug is in trace-malloc (which I do not know for sure), do we just need to do some checking like we do in realloc for my_malloc_logger?
http://mxr.mozilla.org/mozilla-central/source/tools/trace-malloc/lib/nsTraceMalloc.c#1170
This bug should be fixable without touching trace-malloc.  If it's not, then we have to declare trace-malloc dead.
I'm not going to hold the three closed for this issue anymore.
Status: ASSIGNED → NEW
Summary: rando-red on OS X 10.5.2 mozilla-central leak test build, slave dependent? → rando-red on OS X 10.5.2 mozilla-central leak test build
David, I suspect that this __spin_lock is not reentrant. So this thread is probably just deadlocking on itself; it's taken the lock in new_sem_from_pool, and it's trying to take it again, but it can't because the lock is already held.

I see two options, both hacky:

1) Avoid calling NS_DescribeCodeAddress when new_sem_from_pool is on the stack. We could do that by determining the address ranges corresponding to some set of functions (at least malloc_zone_malloc, realloc, and new_sem_from_pool), and caching those so that we don't have to call NS_DescribeCodeAddress for addresses in those functions (all before we enable the malloc logger).

2) Try to avoid having new_sem_from_pool calling realloc during normal browser execution. We might be able to do this by doing a series of thread operations that forces a large number of these semaphores to be added to the pool (before we enable the malloc logger).
(In reply to comment #36)
> David, I suspect that this __spin_lock is not reentrant. So this thread is
> probably just deadlocking on itself; it's taken the lock in new_sem_from_pool,
> and it's trying to take it again, but it can't because the lock is already
> held.
Isn't the whole point of locking to prevent threads from accessing a resource simultaneously?  If __spin_lock wasn't supposed to be reenterant across threads, how could it possibly accomplish that?  (Unless you meant it was reentered on the same thread, in which case this comment can be disregarded)
(In reply to comment #37)
> (In reply to comment #36)
> > David, I suspect that this __spin_lock is not reentrant. So this thread is
> > probably just deadlocking on itself; it's taken the lock in new_sem_from_pool,
> > and it's trying to take it again, but it can't because the lock is already
> > held.
> Isn't the whole point of locking to prevent threads from accessing a resource
> simultaneously?  If __spin_lock wasn't supposed to be reenterant across
> threads, how could it possibly accomplish that?  (Unless you meant it was
> reentered on the same thread, in which case this comment can be disregarded)

It's same thread. But more, a spin lock should not be held across any significant out-call such as malloc or realloc. This is an Apple bug. Is it possible to file it (is it in some Darwin code that is open source)?

The fix is to avoid reallocing memory shared among threads. Make racers malloc new memory, the winner spinlocks after malloc succeeds and updates the shared pointer, the losers then lock, see they've lost, free the memory they allocated, and proceed (if enough space) or try again.

/be
(In reply to comment #38)
> It's same thread. But more, a spin lock should not be held across any
> significant out-call such as malloc or realloc. This is an Apple bug. Is it
> possible to file it (is it in some Darwin code that is open source)?

Seems to be, since I found the code on the Web (see link above). Not sure where or how I'd file it. We can say they shouldn't do that, but they may just disagree.

Anyway, we definitely need to work around it. Anyone got a preference between the options 1 and 2 in comment #36? Or another idea?

Updated

10 years ago
Keywords: hang
Summary: rando-red on OS X 10.5.2 mozilla-central leak test build → [Mac] Same-thread deadlock with trace-malloc (causing random red on OS X 10.5.2 mozilla-central leak test build)
option 1 sounds like it would help performance a bit, even, and seems easier to maintain/extend as needed than option 2.
When did this start happening? Did some of our code change to trigger it, or did the build slaves get updated? It seems odd that we'd all of a sudden start tripping over an OS X bug.
Recent changes to the mac's are relatively trivial:
https://wiki.mozilla.org/index.php?title=ReferencePlatforms%2FMac-10.5&diff=126122&oldid=103287
Can't just be that anyway, or we'd see this error on mozilla-1.9.1 too. roc did some digging on how often this was happening on m-c I think.
It started happening more frequently after the last tracemonkey merge, but it's happened before then too.
Created attachment 362428 [details] [diff] [review]
fix

I don't like this patch very much but it seems like the best option.
Assignee: nobody → roc
Attachment #362428 - Flags: review?(dbaron)
Comment on attachment 362428 [details] [diff] [review]
fix

gCriticalRange should be initialized to {0, 0}, shouldn't it?

Comment 46

10 years ago
If "pthread_cond_wait$UNIX2003" cannot be found, every NS_StackWalk will call dlsym (since gCriticalRange remains empty).  Is that expensive?
Shouldn't this be in NS_DescribeCodeAddress instead?  Then trace-malloc could get the failed symbols later.

That said, could you attach stacks for all threads?  I still don't understand what the problem is, since the stacks I've seen show both threads spinning on the same lock.  Who's holding that lock?
(In reply to comment #45)
> (From update of attachment 362428 [details] [diff] [review])
> gCriticalRange should be initialized to {0, 0}, shouldn't it?

It's static so it should already be initialized to 0,0.

(In reply to comment #46)
> If "pthread_cond_wait$UNIX2003" cannot be found, every NS_StackWalk will call
> dlsym (since gCriticalRange remains empty).  Is that expensive?

Yes, but a) the symbol will be found and b) tracemalloc performance is not a big deal, especially since if the symbol is not found we're likely to deadlock anyway...

(In reply to comment #47)
> Shouldn't this be in NS_DescribeCodeAddress instead?  Then trace-malloc could
> get the failed symbols later.

If new_sem_from_pool is on the stack it's not safe to get *any* code addresses. So we'd have to walk the stack, save all the PCs, continue, and some time later do the PC to symbol conversion. I guess we could do that but it seems like more redesign work than it's worth.
(In reply to comment #49)
> If new_sem_from_pool is on the stack it's not safe to get *any* code addresses.
> So we'd have to walk the stack, save all the PCs, continue, and some time later
> do the PC to symbol conversion. I guess we could do that but it seems like more
> redesign work than it's worth.

Part of the annoyance of this is that we'd have to document that the stack-walk callback must avoid calling almost all external functions, including malloc. So actually figuring out how to save the PCs and do the symbol conversion later would be a pain.
Hmm. Of course, it's not the stack-walk that's the problem, it's everything under ReallocCallback that is at risk of deadlock here. That includes TM_SUPPRESS_TRACING_AND_ENTER_LOCK and PL_HashTableAdd, both of which I assume can do pthread_mutex_lock, which can acquire the sem_pool_lock spinlock.

So we really need a different approach, where the stack-walker can notify its callers that we're in a dangerous situation and the caller should just bail out.
Created attachment 362503 [details] [diff] [review]
fix v2

This is better. Make sure the malloc callbacks just bail out ASAP.
Attachment #362503 - Flags: review?(dbaron)
Attachment #362503 - Flags: review?(dbaron) → review-
Comment on attachment 362503 [details] [diff] [review]
fix v2

trace-malloc is going to start producing bad numbers if you bail this early, I think.

I also don't see any reason to have the stack walking code fail when the problem is in the symbol lookup code.  We should be changing NS_DescribeCodeAddress instead.
I propose we disable the trace-malloc leak test on Mac.
Sorry, trying to be on vacation and missed the comments between the last review request and this one.  I can try to look at this tomorrow afternoon or later.
(In reply to comment #55)
> I propose we disable the trace-malloc leak test on Mac.

Alternatively, give trace-malloc a mode where it doesn't record the stacks, and use that mode in the leak tests on Mac.  Then we'll only get numbers and not the diagnostic output about what code is causing the leaks.

Comment 58

10 years ago
How about doing what trace-refcnt does?  trace-refcnt generates "fast stacks" that can be run through fix-macosx-stack.pl later to create "correct stacks".  The leak machines could even be set up to run fix-macosx-stack.pl automatically.
(In reply to comment #58)
> How about doing what trace-refcnt does?  trace-refcnt generates "fast stacks"
> that can be run through fix-macosx-stack.pl later to create "correct stacks". 
> The leak machines could even be set up to run fix-macosx-stack.pl
> automatically.

This is the same code.  nsTraceRefcnt just doesn't get called from within OS threading stuff.
(In reply to comment #51)
> Hmm. Of course, it's not the stack-walk that's the problem, it's everything
> under ReallocCallback that is at risk of deadlock here. That includes
> TM_SUPPRESS_TRACING_AND_ENTER_LOCK and PL_HashTableAdd, both of which I assume
> can do pthread_mutex_lock, which can acquire the sem_pool_lock spinlock.

So if this is really a problem, then I think we should just disable trace-malloc on Mac and be done with it.

If it's not, I think we should either do what I said in comment 57, or use roc's second approach but instead handle it in trace-malloc by using an alternative to NS_DescribeCodeAddress that just prints the address (rather than bailing).


Sorry for too many comments here; I made the mistake of trying to deal with the email that I could in 10 minutes instead of just ignoring it all.
(In reply to comment #60)
> (In reply to comment #51)
> > Hmm. Of course, it's not the stack-walk that's the problem, it's everything
> > under ReallocCallback that is at risk of deadlock here. That includes
> > TM_SUPPRESS_TRACING_AND_ENTER_LOCK and PL_HashTableAdd, both of which I assume
> > can do pthread_mutex_lock, which can acquire the sem_pool_lock spinlock.
> 
> So if this is really a problem, then I think we should just disable
> trace-malloc on Mac and be done with it.

Why, because this patch is too ugly? I don't think it's going to disturb trace-malloc results in any meaningful way. We'll just ignore this one realloc which is cauing us problems.

> If it's not, I think we should either do what I said in comment 57, or use
> roc's second approach but instead handle it in trace-malloc by using an
> alternative to NS_DescribeCodeAddress that just prints the address (rather than
> bailing).

I'm not sure we can print with no risk of hitting the deadlock.
I'm in favor of giving roc's latest patch a try, FWIW.

/be
Comment on attachment 362503 [details] [diff] [review]
fix v2

OK, r=dbaron.  Please monitor the trace-malloc numbers (leaks, allocs, max heap) to see if they become more variable as a result of this patch.
Attachment #362503 - Flags: review- → review+
Because of the user-visible risk during printing (comment#61), and the ongoing intermittent burning on tinderbox tree, nominating as "blocking-1.9.1?".
Flags: blocking1.9.1?
There's no user-visible risk, because "print" there meant "print trace-malloc output to console" and also because we don't ship trace-malloc builds to users.
Flags: blocking1.9.1? → blocking1.9.1-
Whiteboard: [needs landing]
Pushed changeset 90d9e5943324 and followup bustage fix changeset 65a461a480b7 to mozilla-central.
Status: NEW → RESOLVED
Last Resolved: 10 years ago
Resolution: --- → FIXED
Whiteboard: [needs landing]
Target Milestone: --- → mozilla1.9.2a1
Comment on attachment 362503 [details] [diff] [review]
fix v2

Do we want this for branch?
Attachment #362503 - Flags: approval1.9.1?
We may as well take it on branch. There is zero risk to any shipping product.
Attachment #362503 - Flags: approval1.9.1? → approval1.9.1+
You need to log in before you can comment on or make changes to this bug.