Last Comment Bug 435683 - libjemalloc deadlocks with LD_PRELOAD libraries
: libjemalloc deadlocks with LD_PRELOAD libraries
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Memory Allocator (show other bugs)
: unspecified
: All Linux
: -- major with 3 votes (vote)
: mozilla2.0b5
Assigned To: Mike Hommey [:glandium]
:
Mentors:
: 452754 497117 501573 502727 503206 (view as bug list)
Depends on:
Blocks: 586962
  Show dependency treegraph
 
Reported: 2008-05-25 14:25 PDT by Josh Triplett
Modified: 2010-11-05 14:29 PDT (History)
22 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
-


Attachments
a possible patch (2.93 KB, patch)
2008-06-29 05:23 PDT, Mike Hommey [:glandium]
no flags Details | Diff | Splinter Review
a possible patch (2.94 KB, patch)
2008-06-29 05:25 PDT, Mike Hommey [:glandium]
no flags Details | Diff | Splinter Review
preloadtest.c (1.69 KB, text/plain)
2009-12-03 02:42 PST, Peter Åstrand
no flags Details
Use syscall for mmap and munmap (2.93 KB, patch)
2010-05-30 08:28 PDT, Mike Hommey [:glandium]
jasone: review+
Details | Diff | Splinter Review
Disable ncpus with MOZ_MEMORY_NARENAS_DEFAULT_ONE (1.19 KB, patch)
2010-05-31 09:20 PDT, Mike Hommey [:glandium]
no flags Details | Diff | Splinter Review
Use syscall() for mmap and munmap and disable ncpus in jemalloc to work around deadlocks with LD_PRELOADed libraries (4.41 KB, patch)
2010-08-13 08:23 PDT, Mike Hommey [:glandium]
jasone: review+
benjamin: approval2.0+
Details | Diff | Splinter Review

Description Josh Triplett 2008-05-25 14:25:54 PDT
Please see http://bugs.gentoo.org/show_bug.cgi?id=213080 and http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=475166 for background on this bug.

libjemalloc's initialization calls mmap to get memory.  aoss overrides mmap so it can handle mmap'ed OSS devices, and the aoss mmap calls calloc.  This causes a deadlock.

Solving this deadlock seems non-trivial.  The analysis in the Gentoo bug points out that using dlopen/dlsym to obtain the real mmap won't work, because those allocate memory.

As a workaround, perhaps libjemalloc could have a small pool of global memory with which to satisfy early requests, and then it could call mmap when that pool runs out?
Comment 1 Mike Hommey [:glandium] 2008-06-28 23:50:26 PDT
The analysis in the gentoo bug is half wrong. The problem is with open(), not mmap(). And the problem is that libjemalloc does open() /proc/cpuinfo to know how many processors there are, which can be dangerous when open is caught by something else than libc... others functions that are used during jemalloc initialization are getenv and readlink. Note it does avoid using open() to read /etc/malloc.conf, which is supposed to be a symbolic link pointing nowhere, where the linked path is the set of malloc options (which just sounds awful). I don't know if this has been done specifically to avoid open() or not, but it sounds like it.

The same code as for solaris for malloc_ncpus should just work. I'll test that.
Comment 2 Mike Hommey [:glandium] 2008-06-29 00:30:20 PDT
Erf, sysconf seems to do what malloc_ncpus did:
#0  0x00007fc6c4c7c384 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0x00007fc6c4c77c5c in _L_lock_1054 () from /lib/libpthread.so.0
#2  0x00007fc6c4c77b30 in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0x00007fc6c597987d in malloc_init_hard () at jemalloc.c:5201
#4  0x00007fc6c597a114 in malloc_init () at jemalloc.c:5180
#5  0x00007fc6c597a50e in malloc (size=140491697409504) at jemalloc.c:5732
#6  0x00007fc6c517278a in __fopen_internal () from /lib/libc.so.6
#7  0x00007fc6c51dcf12 in get_nprocs () from /lib/libc.so.6
#8  0x00007fc6c51ae12d in sysconf () from /lib/libc.so.6
Comment 3 Mike Hommey [:glandium] 2008-06-29 00:47:48 PDT
Note this happens not only with aoss, but with libtrash, too (for the same reasons)
Comment 4 Mike Hommey [:glandium] 2008-06-29 05:14:53 PDT
Also note it does end up blocking on mmap with aoss, but that is after you make the open non-blocking first.
Comment 5 Mike Hommey [:glandium] 2008-06-29 05:23:26 PDT
Created attachment 327317 [details] [diff] [review]
a possible patch

No idea who should be reviewing this. The idea is to have an arbitrarily sized static buffer to allocate memory from when jemalloc is still initializing itself.

realloc() is not taken care of here, but until some LD_PRELOADable library does use realloc in either open or mmap, it should be safe.

The gotos are put there to avoid using __builtin_expect and help with branch prediction.

The static buffer is 64k. On amd64, dlsym needs 32 bytes, and aoss initialization takes 8k. 64k should be too big already.
Comment 6 Mike Hommey [:glandium] 2008-06-29 05:25:43 PDT
Created attachment 327318 [details] [diff] [review]
a possible patch

This one actually builds...
Comment 7 Alexander Sack 2008-07-11 02:35:53 PDT
(In reply to comment #5)
> Created an attachment (id=327317) [details]
> a possible patch
> 
> No idea who should be reviewing this. The idea is to have an arbitrarily sized
> static buffer to allocate memory from when jemalloc is still initializing
> itself.
> 

try stuart or jasone

Comment 8 Jason Evans 2008-07-11 16:07:25 PDT
Comment on attachment 327318 [details] [diff] [review]
a possible patch

I don't think this patch should be used.  First off, if we were to use an approach of this nature, I think it would be better to combine malloc_initialized and malloc_initializing into a single multi-state variable, in order to avoid extra branches during allocation (in the common case, where initialization is completed).

Second, we don't actually use ncpus for anything right now, so a simpler solution would be to completely disable all ncpus-related code if MOZ_MEMORY_NARENAS_DEFAULT_ONE is defined.

As for the readlink() call for /etc/malloc.conf, we don't really need that, so if it's a problem, we can just remove it.
Comment 9 Jason Evans 2008-07-11 16:22:14 PDT
Does aoss really wrap mmap(2) with a function that calls malloc(3)?  If so, why not change aoss to use a static allocation buffer instead of modifying malloc?
Comment 10 Jason Evans 2008-07-11 16:26:18 PDT
(In reply to comment #1)
> Note it does avoid using open() to read
> /etc/malloc.conf, which is supposed to be a symbolic link pointing nowhere,
> where the linked path is the set of malloc options (which just sounds awful). I
> don't know if this has been done specifically to avoid open() or not, but it
> sounds like it.

This is a performance hack that jemalloc inherited from phkmalloc (the previous FreeBSD malloc implementation).  The hack provides a way to get configurable system-wide malloc configuration with only a single system call (readlink() vs. open()+read()+close()).  As noted in comment #8, we don't really need this hack in the browser.
Comment 11 Mike Hommey [:glandium] 2008-07-11 16:30:16 PDT
Note 2 things:
- ncpus does change the number of arenas,
- hardsetting ncpus to 1 solves the open() issues, but *not* the mmap() one with aoss, so the static memory pool approach is still necessary.

Since the static memory pool is still necessary, there is no need to disable ncpus.

Also note that there doesn't seem to be any place where having the two variables as a single multi-state variable would save extra branches during allocation.
Comment 12 Mike Hommey [:glandium] 2008-07-11 16:33:55 PDT
(In reply to comment #9)
> Does aoss really wrap mmap(2) with a function that calls malloc(3)?  If so, why
> not change aoss to use a static allocation buffer instead of modifying malloc?

1. Because aoss is not alone doing this kind of stuff
2. Because aoss allocates the memory it will return for the mmap(2) call with malloc.
Comment 13 Arun Prasannan (polar) 2008-11-12 18:49:08 PST
Is this related?
(Mozilla/5.0 (X11; U; Linux i686; rv:1.9.1b2pre) Gecko/20081111 SeaMonkey/2.0a2pre)

#0  0xb8044f74 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0xb8040878 in _L_lock_95 () from /lib/libpthread.so.0
#2  0xb804026a in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0x08049f89 in malloc_mutex_lock (mutex=0xfffffe00) at jemalloc.c:1396
#4  0x08054ae3 in malloc_init_hard () at jemalloc.c:5379
#5  0x08055745 in malloc_init () at jemalloc.c:5358
#6  0x08055e9a in calloc (num=1, size=20) at jemalloc.c:6094
#7  0xb7d5f31c in _dlerror_run () from /lib/libdl.so.2
#8  0xb7d5ed6c in dlsym () from /lib/libdl.so.2
#9  0xb80749f7 in getLibraryFunction () from /lib/libsafe.so.2
#10 0xb8074e57 in memcpy () from /lib/libsafe.so.2
#11 0x08054b93 in malloc_init_hard () at jemalloc.c:5453
#12 0x08055745 in malloc_init () at jemalloc.c:5358
#13 0x08055e9a in calloc (num=1, size=20) at jemalloc.c:6094
#14 0xb7d5f31c in _dlerror_run () from /lib/libdl.so.2
#15 0xb7d5ed6c in dlsym () from /lib/libdl.so.2
#16 0xb80749f7 in getLibraryFunction () from /lib/libsafe.so.2
#17 0xb8074e57 in memcpy () from /lib/libsafe.so.2
#18 0xb8046737 in __libc_sigaction () from /lib/libpthread.so.0
#19 0xb803d4f1 in __pthread_initialize_minimal_internal () from /lib/libpthread.so.0
#20 0xb803cd48 in _init () from /lib/libpthread.so.0
#21 0xb8088ca0 in call_init () from /lib/ld-linux.so.2
#22 0xb8088e27 in _dl_init_internal () from /lib/ld-linux.so.2
#23 0xb807a84f in _dl_start_user () from /lib/ld-linux.so.2


Works with LD_PRELOAD=/lib/libc.so.6
Comment 14 Arun Prasannan (polar) 2008-11-12 18:51:59 PST
Oops.

Sorry, I didn't mean to commit that. The answer to my question is no; removing libsafe helped.
Comment 15 Mike Hommey [:glandium] 2008-11-12 22:17:36 PST
(In reply to comment #14)
> Oops.
> 
> Sorry, I didn't mean to commit that. The answer to my question is no; removing
> libsafe helped.

The answer is actually yes, libsafe triggers the same deadlock in jemalloc that aoss does.
Comment 16 Arun Prasannan (polar) 2008-11-13 06:08:52 PST
In which case, bug 462686 is a duplicate?
Comment 17 Mike Hommey [:glandium] 2008-11-13 06:20:47 PST
*** Bug 462686 has been marked as a duplicate of this bug. ***
Comment 18 David Snyder 2008-12-08 14:11:43 PST
As the poster for bug #462686, I can report that firefox-3.1b2 WORKSFORME even when libsafe is loaded on the system (Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1b2) Gecko/20081201 Firefox/3.1b2).  Thanks to all for the work on this.
Comment 19 David Snyder 2009-03-19 18:21:55 PDT
Like the now marked duplicate bug #462686, the latest version beta version of firefox (firefox-3.1b3) once again fails to load when libsafe is used as an LD_PRELOAD condition.

Steps to reproduce:
1) Create the file, ld.so.preload, in /etc and put the line, '/lib/libsafe.so.2' in it.  Or put the environmental variable, LD_PRELOAD=/lib/libsafe.so.2, into the shell before executing the firefox-3.1b3 program from the command line.
2) Attempt to start FF 3.1b3 (e.g. LD_PRELOAD=/lib/libsafe.so.2 /usr/lib/firefox-3.1b3/firefox -P beta3.1)

Actual results:
FF3.1b3 doesn't show, except that top shows that the loading of
FF3.1b3 has partially loaded.

Expected Results:  
FF3.1b3 to show up.

Workaround:
If libsafe is present in the /etc file, ld.so.preload, FF3.1b3 can be started successfully by starting it with the environmental variable LD_PRELOAD to '/lib/libc.so.6' (e.g., LD_PRELOAD=/lib/libc.so.6 /usr/lib/firefox-3.1b3/firefox -P beta3.1).
Comment 20 David Snyder 2009-03-19 18:29:13 PDT
Forgot to provide the stack backtrace from gdb when FF3.1b3 locks up (i.e., FF3.1b3 is invoked without the LD_PRELOAD=/lib/libc.so.6 being used).  The backtrace is as follows:

(gdb) bt
#0  0xb7fa8f74 in __lll_lock_wait () from /lib/libpthread.so.0
#1  0xb7fa4878 in _L_lock_95 () from /lib/libpthread.so.0
#2  0xb7fa426a in pthread_mutex_lock () from /lib/libpthread.so.0
#3  0x0804f79b in ?? ()
#4  0x08056bc0 in ?? ()
#5  0x00000000 in ?? ()

Looks like the problem is, once again, at the __lll_lock_wait function (as was the case with bug #462686).
Comment 21 David Snyder 2009-03-19 19:34:40 PDT
Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1b4pre) Gecko/20090319 Shiretoko/3.5b4pre

Tested out the b4 pre-build, and it exhibits the same problem as the 3.1b3 builds reported above.  The system this was tested on (and all previous reports) was Slackware 12.2.
Comment 22 David Snyder 2009-04-30 16:39:14 PDT
FF3.5b4 still locks up at the same point as FF3.1b3 (see Comment #20)

Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1b4) Gecko/20090423 Firefox/3.5b4

Work Around: see Comment #19
Comment 23 David Snyder 2009-06-20 17:09:41 PDT
FF3.5rc{1,2} both lock up at the _same_ point as reported above.  Would really like to see FF3.5 working with libsafe because _everytime_ I've run FF3.5 without libsafe, my computer locks up after a few hours.  And this _only_ occurs when I run FF3.5 (and, previously, FF3.1) without libsafe.

Build Identifier: Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1) Gecko/20090616 Firefox/3.5

Work Around: see Comment #19
Comment 24 Karl Tomlinson (:karlt) 2009-06-28 16:54:49 PDT
From http://bugs.debian.org/cgi-bin/bugreport.cgi?bug=475166:

#4  0x00002aaaabedf205 in calloc (num=1024, size=8) at jemalloc.c:5215
#5  0x00002aaaaacc853a in ?? () from /usr/lib/libaoss.so
#6  0x00002aaaaacc9205 in open () from /usr/lib/libaoss.so
#7  0x00002aaaabedcf30 in malloc_init_hard () at jemalloc.c:4931

|open| should be async-signal-safe (according to POSIX).
If there is an implementation of open that calls calloc, isn't that a bug?
Comment 25 Andrea Mazzoleni 2009-07-11 08:31:05 PDT
*** Bug 502727 has been marked as a duplicate of this bug. ***
Comment 26 Mike Hommey [:glandium] 2009-07-11 12:04:25 PDT
*** Bug 497117 has been marked as a duplicate of this bug. ***
Comment 27 Matthias Versen [:Matti] 2009-07-12 17:57:04 PDT
*** Bug 501573 has been marked as a duplicate of this bug. ***
Comment 28 Chris Chambers 2009-07-30 10:44:05 PDT
*** Bug 503206 is a duplicate of this ???
Comment 29 Chris Chambers 2009-07-31 14:19:31 PDT
*** Bug 503206 has been marked as a duplicate of this bug. ***
Comment 30 Chris Chambers 2009-08-12 09:15:18 PDT
Any news re this ? . And FYI the Arch forum newbie thread 'Firefox 3.5 will not start' has 2822 hits as of todays date. So affects many ! 

BW Chris.
Comment 31 Karl Tomlinson (:karlt) 2009-08-12 13:41:42 PDT
Has a bug been filed at http://www.alsa-project.org/ re open in aoss?
Comment 32 Peter Åstrand 2009-12-02 00:06:10 PST
>Has a bug been filed at http://www.alsa-project.org/ re open in aoss?

That doesn't make sense, since the bug is really in libjemalloc. Until this bug is solved, Firefox with libjemalloc will deadlock with pretty much *any* LD_PRELOAD library. In addition to aoss, libsafe, cyclades-serial-client, and libtrash, this also includes libpulsedsp.so which is used by the PulseAudio "padsp", as well as most commercial LD_PRELOAD libraries. This includes typical Sun Ray and ThinLinc environments. 

This bug is pretty critical, IMHO. RHEL isn't yet building with libjemalloc, but Fedora, Novell SLED as well as the binaries from firefox.com are affected.
Comment 33 Reed Loden [:reed] (use needinfo?) 2009-12-02 01:18:30 PST
*** Bug 452754 has been marked as a duplicate of this bug. ***
Comment 34 Reed Loden [:reed] (use needinfo?) 2009-12-02 01:54:07 PST
Mike, can we get a new patch for jemalloc based on review comments?
Comment 35 Mike Hommey [:glandium] 2009-12-02 02:03:53 PST
(In reply to comment #34)
> Mike, can we get a new patch for jemalloc based on review comments?

See comment #11
Comment 36 Reed Loden [:reed] (use needinfo?) 2009-12-02 02:12:15 PST
Comment on attachment 327318 [details] [diff] [review]
a possible patch

Very well. Jason, can you take another look at this patch based on Mike's response in comment #11 to your original review?
Comment 37 Jason Evans 2009-12-02 10:00:31 PST
I (In reply to comment #36)
> (From update of attachment 327318 [details] [diff] [review])
> Very well. Jason, can you take another look at this patch based on Mike's
> response in comment #11 to your original review?

Regarding extra branches, malloc_initialized *and* malloc_initializing are checked for every allocation, so indeed there are extra branches for every allocation introduced by this patch.  I came up with a solution to the problem last summer that I integrated into standalone jemalloc:

    http://www.canonware.com/cgi-bin/hg_jemalloc/rev/20ff57483b5f

Recursive allocation is handled without any extra branches in the fast path, and no static buffer is used.

Regarding aoss calling malloc(3) from inside an mmap(2) system call wrapper, let me state things more bluntly than in comment #9: aoss is messing with low level system call interfaces, and it is aoss's responsibility to assure that it does not directly introduce bootstrapping issues.  This is an aoss problem, not a jemalloc problem.
Comment 38 Reed Loden [:reed] (use needinfo?) 2009-12-02 10:38:04 PST
(In reply to comment #37)
> Regarding aoss calling malloc(3) from inside an mmap(2) system call wrapper,
> let me state things more bluntly than in comment #9: aoss is messing with low
> level system call interfaces, and it is aoss's responsibility to assure that it
> does not directly introduce bootstrapping issues.  This is an aoss problem, not
> a jemalloc problem.

It's been pointed out in this bug and in the various dupes that aoss isn't the only one that has problems with jemalloc and LD_PRELOAD. Do all the libraries affected have the same problem, or are there really two problems here -- one in jemalloc and one in aoss?
Comment 39 Mike Hommey [:glandium] 2009-12-02 11:36:03 PST
(In reply to comment #37)
> I came up with a solution to the problem
> last summer that I integrated into standalone jemalloc:
> 
>     http://www.canonware.com/cgi-bin/hg_jemalloc/rev/20ff57483b5f

This changeset changes too many unrelated things, would you mind providing a minimal patch ?
 
> Recursive allocation is handled without any extra branches in the fast path,
> and no static buffer is used.
> 
> Regarding aoss calling malloc(3) from inside an mmap(2) system call wrapper,
> let me state things more bluntly than in comment #9: aoss is messing with low
> level system call interfaces, and it is aoss's responsibility to assure that it
> does not directly introduce bootstrapping issues.  This is an aoss problem, not
> a jemalloc problem.

Reality is that there are LD_PRELOAD libraries that are doing it, and that you're not going to instantly fix them. Which means there is no other way than fixing the issue in Firefox if you want your users to stop complaining.
Comment 40 Peter Åstrand 2009-12-02 11:40:20 PST
I don't know the internal details of aoss, but I trust comment #1 which states that it is open() that is the problem - not mmap(). 

I would say that at this point, nothing indicates that we have multiple problems. The fundamental bug is that when using jemalloc, LD_PRELOAD libraries can't do much at all. This includes calling dlsym(). Without access to such functions, I don't see how preload libraries have any chance at all of doing their job properly. 

I think that the problem should be viewed from an API side: Many functions of the POSIX API needs to allocate memory. This includes, but is not limited to, dlsym(). What jemalloc effectively does is to require that open() never allocates anything. AFAIK, POSIX have no such requirements, so the jemalloc demands can't be justified.
Comment 41 Mike Hommey [:glandium] 2009-12-02 11:48:49 PST
(In reply to comment #40)
> I don't know the internal details of aoss, but I trust comment #1 which states
> that it is open() that is the problem - not mmap(). 

See comment #4
Comment 42 Jason Evans 2009-12-02 11:59:16 PST
(In reply to comment #38)
> It's been pointed out in this bug and in the various dupes that aoss isn't the
> only one that has problems with jemalloc and LD_PRELOAD. Do all the libraries
> affected have the same problem, or are there really two problems here -- one in
> jemalloc and one in aoss?

Yes, there are two separate problems:

1) The version of jemalloc in Firefox cannot handle recursive allocation caused by calls to system library interfaces such as sysctl(2).

2) aoss calls malloc(3) from within an mmap(2) wrapper, which is a major problem because it means a low level allocation function is recursing into a high level allocation function.

We can (and should) fix (1), but (2) isn't our fault, and it should be fixed at the source of the problem (aoss).

I can eventually create a patch for this bug, but I am terribly busy right now (1-week-old son, prepping home for sale, graduating, moving 1000 miles in 2 weeks, starting a new job, etc.), so if this is a burning issue, someone else will have to look through the standalone jemalloc source code and extract the relevant parts.
Comment 43 Peter Åstrand 2009-12-03 02:42:26 PST
Created attachment 415831 [details]
preloadtest.c

The problem is not specific to malloc(). I've attached a minimal LD_PRELOAD test library which demonstrates that the problem exists with open() as well. With Firefox and jemalloc, it hangs during open of /proc/cpuinfo. If this library needs to be "fixed", please tell me how.
Comment 44 Mike Hommey [:glandium] 2009-12-03 02:51:35 PST
(In reply to comment #43)
> Created an attachment (id=415831) [details]
> preloadtest.c
> 
> The problem is not specific to malloc(). I've attached a minimal LD_PRELOAD
> test library which demonstrates that the problem exists with open() as well.
> With Firefox and jemalloc, it hangs during open of /proc/cpuinfo. If this
> library needs to be "fixed", please tell me how.

Nothing new here, this is what happens with aoss and others: jemalloc initialization calls open, which ends up being the one in the LD_PRELOADed library, which calls dlsym, which calls malloc, which ends up in jemalloc, which locks, since it is initializing.
Comment 45 Karl Tomlinson (:karlt) 2009-12-03 02:53:38 PST
Comment on attachment 415831 [details]
preloadtest.c

>int
>open(const char *pathname, int flags, ...)
>{
>    static int (*func) (const char *, int, mode_t) = NULL;
>    va_list args;
>    mode_t mode;
>
>    fprintf(stderr, "preloadtest: open: %s\n", pathname);
>
>    if (!func) {
>	func = (int (*)(const char *, int, mode_t)) dlsym(REAL_LIBC, "open");

|open| must be async-signal-safe.  See man 7 signal.
If it is not async-signal-safe, then other problems will also result.
dlsym is not async-signal-safe, so cannot be called from open.

I'm only guessing, but I wonder whether calling dlsym from an __attribute__((constructor)) function and keeping the result may be an option.
Comment 46 Peter Åstrand 2009-12-03 03:52:53 PST
(In reply to comment #44)

> Nothing new here, this is what happens with aoss and others: jemalloc
> initialization calls open, which ends up being the one in the LD_PRELOADed
> library, which calls dlsym, which calls malloc, which ends up in jemalloc,
> which locks, since it is initializing.

I know, but since the discussion about mmap() comes up now and then, I wanted to demonstrate by example that it happens with open() as well.
Comment 47 Peter Åstrand 2009-12-03 03:55:19 PST
(In reply to comment #45)
> >	func = (int (*)(const char *, int, mode_t)) dlsym(REAL_LIBC, "open");
> 
> |open| must be async-signal-safe.  See man 7 signal.
> If it is not async-signal-safe, then other problems will also result.
> dlsym is not async-signal-safe, so cannot be called from open.

Thanks for pointing this out, but does it have anything todo with this specific problem? Ie, does async-signal-safe also mean that you cannot alloc?
Comment 48 Mike Hommey [:glandium] 2009-12-24 01:31:52 PST
(In reply to comment #42)
> 2) aoss calls malloc(3) from within an mmap(2) wrapper, which is a major
> problem because it means a low level allocation function is recursing into a
> high level allocation function.
> 
> We can (and should) fix (1), but (2) isn't our fault, and it should be fixed at
> the source of the problem (aoss).

I gave a bit more thought to the above and I would add this:
- aoss is not guaranteed to be alone in this case, better be safe and fix (2) than sorry.
- glibc deals fine with LD_PRELOADED mmap calling malloc, and most people are only testing against glibc.
Comment 49 Mike Hommey [:glandium] 2009-12-24 05:45:51 PST
There is another deadlock with padsp (pulseaudio wrapper), which is more problematic: padsp calls dlsym to get the original symbols for the functions it diverts, which is the typical use. Except that it does so on demand. What then happens is that something in nsXULStub calls access() before jemalloc is initialized. access() is caught by padsp, which locks a mutex, and resolves the original symbol with dlsym(). dlsym() ends up allocating memory, which triggers jemalloc initialization code, that opens /proc/cpuinfo. And open gets into padsp again, which locks the same mutex as above before resolving the open symbol with dlsym().

Only posting for documentation, I don't think this should be addressed at jemalloc level.
Comment 50 Mike Hommey [:glandium] 2010-02-04 22:44:11 PST
Almost the same problem as comment #49 with artsdsp, except it doesn't deadlock, but crashes. Same succession of events, and it ends up crashing because, as being uninitialized, it tries to call the original open function, which at the time, it null.
Comment 51 kevin.till 2010-03-31 09:53:08 PDT
(In reply to comment #45)
> 
> I'm only guessing, but I wonder whether calling dlsym from an
> __attribute__((constructor)) function and keeping the result may be an option.

Tried that, however, the constructor function is not called in the case of jemalloc::open() call. Is it because jemalloc is a shared library? GCC document is not very clear about it. 

BTW, can we fix the jemalloc soon?  All user using preload library that overrides open() will see hang!   Thanks!
Comment 52 Mike Hommey [:glandium] 2010-05-30 06:24:06 PDT
(In reply to comment #37)
>     http://www.canonware.com/cgi-bin/hg_jemalloc/rev/20ff57483b5f

This now redirects to a gitweb summary, which doesn't help to know what the diff was originally. Could you either provide the patch or a working url ?
Comment 53 Mike Hommey [:glandium] 2010-05-30 08:28:52 PDT
Created attachment 448274 [details] [diff] [review]
Use syscall for mmap and munmap

A much better approach, which should also work around the padsp and artsdsp problems. It combines the suggestion from comment 8 to disable all ncpus code with MOZ_MEMORY_NARENAS_DEFAULT_ONE, and replaces the calls to mmap() an munmap() from the libc, which, as we saw with e.g. aoss may be overridden with a function calling dlsym, which in turn is calling malloc, with direct calls to the system calls.

There may be a need for some additional #ifdefs though, for platforms where the syscall(SYS_m(un)map, ...) form is unsupported.
Comment 54 Mike Hommey [:glandium] 2010-05-31 06:30:56 PDT
Comment on attachment 448274 [details] [diff] [review]
Use syscall for mmap and munmap

This actually doesn't work on various platforms because of how the mmap syscall works. For example, on x86, the patch as is leads to crash, and the call needs to be modified, or SYS_mmap2 to be used. On scratchbox, only the latter is actually available, apparently (at least in the headers). On s390, the syscall only takes one argument, that is a pointer to a struct containing all the parameters. On alpha, I haven't figured yet how to get something working. And on hurd, there is no mmap syscall.
There are also architectures where some checks need to be done on the returned value.

All in all, it would probably be better to use SYS_mmap2 when available and to only use the syscall for a given list of architecture.
Comment 55 Mike Hommey [:glandium] 2010-05-31 07:02:08 PDT
FTR, on alpha, it's a bug in glibc's syscall() function.
Comment 56 Mike Hommey [:glandium] 2010-05-31 09:20:31 PDT
Created attachment 448385 [details] [diff] [review]
Disable ncpus with MOZ_MEMORY_NARENAS_DEFAULT_ONE

Would it be okay to land that part already ? I have something that should be working for the other part, but I need to test it on all our (Debian) architectures first.
Comment 57 Mike Hommey [:glandium] 2010-08-13 08:23:16 PDT
Created attachment 465667 [details] [diff] [review]
Use syscall() for mmap and munmap and disable ncpus in jemalloc to work around deadlocks with LD_PRELOADed libraries

This is the patch I've been using for a while on Debian and that works on all Debian supported architectures.
Comment 58 Mike Hommey [:glandium] 2010-08-14 00:44:08 PDT
Comment on attachment 465667 [details] [diff] [review]
Use syscall() for mmap and munmap and disable ncpus in jemalloc to work around deadlocks with LD_PRELOADed libraries

This is very important for Linux systems, where the use of /dev/dsp wrappers is widespread, and currently leads to deadlocks.
Comment 59 Benjamin Smedberg AWAY UNTIL 2-AUG-2016 [:bsmedberg] 2010-08-18 08:57:52 PDT
Comment on attachment 465667 [details] [diff] [review]
Use syscall() for mmap and munmap and disable ncpus in jemalloc to work around deadlocks with LD_PRELOADed libraries

a=me for checkin within the next week
Comment 60 Mike Hommey [:glandium] 2010-08-18 16:08:28 PDT
http://hg.mozilla.org/mozilla-central/rev/2c3644af7354

Note You need to log in before you can comment on or make changes to this bug.