Closed Bug 442192 Opened 13 years ago Closed 12 years ago

Trace Malloc xpcshell randomly crashes running mochitest


(Core :: XPCOM, defect)

Not set





(Reporter: bent.mozilla, Assigned: dbaron)



(Keywords: fixed1.9.1)


(2 files, 2 obsolete files)

Attached file Stack trace
Stack attached. I'm guessing that trace malloc has some sort of thread safety hazard from those comments ("same stack"), but I don't really know this code at all.
I'm seeing this crash on windows too, fairly regularly, while the browser_bug420160.js browser-chrome test is run.
OS: Mac OS X → All
Version: unspecified → Trunk
Ok, I'm now crashing 90% of the time, basically making it impossible for me to run mochitests.
unless you really need tracemalloc you can use 'ac_add_options --enable-logrefcnt' to avoid these crashes.
I think the problem here is that xpcshell doesn't call NS_TraceMallocStartupArgs before creating additional threads; trace-malloc assumes that that will happen.  Without that, we're using the same structure for the thread-local storage object on every thread.

I have a patch that should fix it, although there's probably a better (and maybe more-threadsafe) way to do the initialization.
This patch has been bitrotted meanwhile. David, could you please update? It would be nice to see this issue fixed.
Ben, I compiled with the suggested option 'ac_add_options --enable-logrefcnt' but it still doesn't work. For now I'll disable both and will build without trace-malloc.
Attached patch patch (obsolete) — Splinter Review
Here's an updated patch (although the merging was trivial, and you could have applied the patch using an appropriate -F option rather than asking for a new one).
Assignee: nobody → dbaron
Attachment #342577 - Attachment is obsolete: true
Comment on attachment 347623 [details] [diff] [review]

Brendan, does this seem like a reasonable approach?

Rather than requiring any app using the trace-malloc stack tracing functions to call NSTraceMallocStartupArgs, this instead tries to initialize the thread-local storage index on demand.  To avoid having to use atomic operations every time, I used two variables:
 * tls_index_initialized, which changes once in the lifetime of the app
 * tls_index_initializing, on which we do an atomic increment if tls_index_initialized is false, to ensure that only one thread initializes the TLS index

Does this approach seem like it should work?  Is there any better alternative to just looping if the first check says we should initialize and the second doesn't?
Attachment #347623 - Flags: review?(brendan)
Alternatively, could you use PR_CallOnce?

I was hoping to avoid having to hit atomic operations in the normal case if I could do so safely.
Comment on attachment 347623 [details] [diff] [review]

I'm on record as objecting to hand-coded user-mode spinloops in code that might run on a uniprocessor. What if pthread_key_create results in a preemption and another thread camped on that while (!tls_index_initialized); loop runs? It will have to run till end of timeslice in the worst case.

I was wondering (mentioned to dbaron in person already) too why PR_CallOnce couldn't optimize this way too if it made sense (on SMP machines, or with better OS-specific primitives that avoid spinning too long). Asking wtc for his thoughts.

Attachment #347623 - Flags: review?(wtc)
I think a reasonable alternative here is probably to just make and document the assumption that the first call to malloc will be before there is more than one thread.
Actually, maybe the better fix is to use PR_CallOnce inside a boolean check, i.e.:

if (!tls_index_initialized) {
  tls_index_initialized = PR_TRUE;

which we could potentially enter more than once.
Comment on attachment 347623 [details] [diff] [review]

>+    if (!tls_index_initialized) {
>+        if (PR_AtomicIncrement(&tls_index_initializing) == 1) {
>+            TM_CREATE_TLS_INDEX(tls_index);
>+            tls_index_initialized = PR_TRUE;
>+        } else {
>+            while (!tls_index_initialized)
>+                ;
>+        }
>     }

Yes, this kind of hand-written synchronization code
is difficult to get right, and the spin loop can be
problematic because the thread scheduler is not aware
of it.  We have seen thread starvation caused by such
a spin loop on older versions of Solaris.

I remember TraceMalloc can't use PRLock.  If so, you
won't be able to use PR_CallOnce because it is
implemented with PRLock and PRCondVar.
Attachment #347623 - Flags: review?(wtc) → review-
Comment on attachment 347623 [details] [diff] [review]

>+    if (!tls_index_initialized) {
>+        if (PR_AtomicIncrement(&tls_index_initializing) == 1) {
>+            TM_CREATE_TLS_INDEX(tls_index);
>+            tls_index_initialized = PR_TRUE;
>+        } else {
>+            while (!tls_index_initialized)
>+                ;
>+        }
>     }

This is essentially how PR_CallOnce is implemented, without
using a lock and condition variable.

In your spin while loop, you should call a thread yield
function.  Try PR_Sleep(PR_INTERVAL_NO_WAIT), or use
platform-specific functions (sched_yield() for Unix,
and Sleep(0) for Windows).
Can we please please get this fixed? Comment #13 sounds completely reasonable to me.

Right now mochitests in trace-malloc builds are just broken.
(In reply to comment #13)
> I think a reasonable alternative here is probably to just make and document the
> assumption that the first call to malloc will be before there is more than one
> thread.

Yes, this is reasonable as roc says. I'll r+ an alternative patch.

Attached patch patchSplinter Review
OK, here's the revised patch.  I stuck some warning fixes in the middle.
Attachment #347623 - Attachment is obsolete: true
Attachment #366748 - Flags: review?(brendan)
Attachment #347623 - Flags: review?(brendan)
Attachment #366748 - Flags: review?(brendan) → review+
Comment on attachment 366748 [details] [diff] [review]

>+    /*
>+     * XXX Perhaps we should call backtrace() so we can check for
>+     * immediate_abort. However, the only current contexts where
>+     * immediate_abort will be true do not call free(), so for now,
>+     * let's avoid the cost of backtrace().
>+     */

Rather than XXX deadwood, this comment should cite bug 478195. r=me with that. Thanks!


I'll land on 1.9.1 as well.
Closed: 12 years ago
Resolution: --- → FIXED
Martijn, could you verify the fix for mc and/or 1.9.1? I don't have any builds available right now.
Target Milestone: --- → mozilla1.9.2a1
Verified for trunk.
You need to log in before you can comment on or make changes to this bug.