Closed Bug 442192 Opened 17 years ago Closed 17 years ago

Trace Malloc xpcshell randomly crashes running mochitest

Tracking

()

Status:

VERIFIED FIXED

Milestone:

mozilla1.9.2a1

People

(Reporter: bent.mozilla, Assigned: dbaron)

References

Details

(Keywords: fixed1.9.1)

Attachments

(2 files, 2 obsolete files)

Stack trace 17 years ago Ben Turner (not reading bugmail, use the needinfo flag!) 6.37 KB, text/plain		Details
possible patch 17 years ago David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla) 2.92 KB, patch		Details \| Diff \| Splinter Review
patch 17 years ago David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla) 2.90 KB, patch	wtc : review-	Details \| Diff \| Splinter Review
patch 17 years ago David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla) 6.29 KB, patch	brendan : review+	Details \| Diff \| Splinter Review

Ben Turner (not reading bugmail, use the needinfo flag!)

Reporter

Description

•

17 years ago

Attached file Stack trace — Details

Stack attached. I'm guessing that trace malloc has some sort of thread safety hazard from those comments ("same stack"), but I don't really know this code at all.

Martijn Wargers (dead)

Comment 1

•

17 years ago

I'm seeing this crash on windows too, fairly regularly, while the browser_bug420160.js browser-chrome test is run.

OS: Mac OS X → All

Serge Gautherie (:sgautherie)

Updated

•

17 years ago

Version: unspecified → Trunk

Martijn Wargers (dead)

Comment 2

•

17 years ago

Ok, I'm now crashing 90% of the time, basically making it impossible for me to run mochitests.

Ben Turner (not reading bugmail, use the needinfo flag!)

Reporter

Comment 3

•

17 years ago

unless you really need tracemalloc you can use 'ac_add_options --enable-logrefcnt' to avoid these crashes.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Assignee

Comment 4

•

17 years ago

I think the problem here is that xpcshell doesn't call NS_TraceMallocStartupArgs before creating additional threads; trace-malloc assumes that that will happen. Without that, we're using the same structure for the thread-local storage object on every thread. I have a patch that should fix it, although there's probably a better (and maybe more-threadsafe) way to do the initialization.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Assignee

Comment 5

•

17 years ago

Attached patch possible patch (obsolete) — Details — Splinter Review

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 6

•

17 years ago

This patch has been bitrotted meanwhile. David, could you please update? It would be nice to see this issue fixed.

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 7

•

17 years ago

Ben, I compiled with the suggested option 'ac_add_options --enable-logrefcnt' but it still doesn't work. For now I'll disable both and will build without trace-malloc.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Assignee

Comment 8

•

17 years ago

Attached patch patch (obsolete) — Details — Splinter Review

Here's an updated patch (although the merging was trivial, and you could have applied the patch using an appropriate -F option rather than asking for a new one).

Assignee: nobody → dbaron

Attachment #342577 - Attachment is obsolete: true

Status: NEW → ASSIGNED

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Assignee

Comment 9

•

17 years ago

Comment on attachment 347623 [details] [diff] [review] patch Brendan, does this seem like a reasonable approach? Rather than requiring any app using the trace-malloc stack tracing functions to call NSTraceMallocStartupArgs, this instead tries to initialize the thread-local storage index on demand. To avoid having to use atomic operations every time, I used two variables: * tls_index_initialized, which changes once in the lifetime of the app * tls_index_initializing, on which we do an atomic increment if tls_index_initialized is false, to ensure that only one thread initializes the TLS index Does this approach seem like it should work? Is there any better alternative to just looping if the first check says we should initialize and the second doesn't?

Attachment #347623 - Flags: review?(brendan)

Brendan Eich [:brendan]

Comment 10

•

17 years ago

Alternatively, could you use PR_CallOnce? /be

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Assignee

Comment 11

•

17 years ago

I was hoping to avoid having to hit atomic operations in the normal case if I could do so safely.

Brendan Eich [:brendan]

Comment 12

•

17 years ago

Comment on attachment 347623 [details] [diff] [review] patch I'm on record as objecting to hand-coded user-mode spinloops in code that might run on a uniprocessor. What if pthread_key_create results in a preemption and another thread camped on that while (!tls_index_initialized); loop runs? It will have to run till end of timeslice in the worst case. I was wondering (mentioned to dbaron in person already) too why PR_CallOnce couldn't optimize this way too if it made sense (on SMP machines, or with better OS-specific primitives that avoid spinning too long). Asking wtc for his thoughts. /be

Attachment #347623 - Flags: review?(wtc)

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Assignee

Comment 13

•

17 years ago

I think a reasonable alternative here is probably to just make and document the assumption that the first call to malloc will be before there is more than one thread.

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Assignee

Comment 14

•

17 years ago

Actually, maybe the better fix is to use PR_CallOnce inside a boolean check, i.e.: if (!tls_index_initialized) { PR_CallOnce(...); tls_index_initialized = PR_TRUE; } which we could potentially enter more than once.

Wan-Teh Chang

Comment 15

•

17 years ago

Comment on attachment 347623 [details] [diff] [review] patch >+ if (!tls_index_initialized) { >+ if (PR_AtomicIncrement(&tls_index_initializing) == 1) { >+ TM_CREATE_TLS_INDEX(tls_index); >+ tls_index_initialized = PR_TRUE; >+ } else { >+ while (!tls_index_initialized) >+ ; >+ } > } Yes, this kind of hand-written synchronization code is difficult to get right, and the spin loop can be problematic because the thread scheduler is not aware of it. We have seen thread starvation caused by such a spin loop on older versions of Solaris. I remember TraceMalloc can't use PRLock. If so, you won't be able to use PR_CallOnce because it is implemented with PRLock and PRCondVar.

Wan-Teh Chang

Updated

•

17 years ago

Attachment #347623 - Flags: review?(wtc) → review-

Wan-Teh Chang

Comment 16

•

17 years ago

Comment on attachment 347623 [details] [diff] [review] patch >+ if (!tls_index_initialized) { >+ if (PR_AtomicIncrement(&tls_index_initializing) == 1) { >+ TM_CREATE_TLS_INDEX(tls_index); >+ tls_index_initialized = PR_TRUE; >+ } else { >+ while (!tls_index_initialized) >+ ; >+ } > } This is essentially how PR_CallOnce is implemented, without using a lock and condition variable. In your spin while loop, you should call a thread yield function. Try PR_Sleep(PR_INTERVAL_NO_WAIT), or use platform-specific functions (sched_yield() for Unix, and Sleep(0) for Windows).

Robert O'Callahan (:roc) (email my personal email if necessary)

Comment 17

•

17 years ago

Can we please please get this fixed? Comment #13 sounds completely reasonable to me. Right now mochitests in trace-malloc builds are just broken.

Brendan Eich [:brendan]

Comment 18

•

17 years ago

(In reply to comment #13) > I think a reasonable alternative here is probably to just make and document the > assumption that the first call to malloc will be before there is more than one > thread. Yes, this is reasonable as roc says. I'll r+ an alternative patch. /be

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Assignee

Comment 19

•

17 years ago

Attached patch patch — Details — Splinter Review

OK, here's the revised patch. I stuck some warning fixes in the middle.

Attachment #347623 - Attachment is obsolete: true

Attachment #366748 - Flags: review?(brendan)

Attachment #347623 - Flags: review?(brendan)

Brendan Eich [:brendan]

Updated

•

17 years ago

Attachment #366748 - Flags: review?(brendan) → review+

Brendan Eich [:brendan]

Comment 20

•

17 years ago

Comment on attachment 366748 [details] [diff] [review] patch >+ /* >+ * XXX Perhaps we should call backtrace() so we can check for >+ * immediate_abort. However, the only current contexts where >+ * immediate_abort will be true do not call free(), so for now, >+ * let's avoid the cost of backtrace(). >+ */ Rather than XXX deadwood, this comment should cite bug 478195. r=me with that. Thanks! /be

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Assignee

Comment 21

•

17 years ago

http://hg.mozilla.org/mozilla-central/rev/c94701d73e02 I'll land on 1.9.1 as well.

Status: ASSIGNED → RESOLVED

Closed: 17 years ago

Resolution: --- → FIXED

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Assignee

Comment 22

•

17 years ago

http://hg.mozilla.org/releases/mozilla-1.9.1/rev/d37bff0e22b5

David Baron :dbaron: (⌚️UTC-5, no longer working on Mozilla)

Assignee

Updated

•

17 years ago

Keywords: fixed1.9.1

Henrik Skupin [:whimboo][⌚️UTC+1]

Comment 23

•

17 years ago

Martijn, could you verify the fix for mc and/or 1.9.1? I don't have any builds available right now.

Target Milestone: --- → mozilla1.9.2a1

Martijn Wargers (dead)

Comment 24

•

17 years ago

Verified for trunk.

Status: RESOLVED → VERIFIED

You need to log in before you can comment on or make changes to this bug.