jemalloc can be deadlock with fork(2)

RESOLVED FIXED in mozilla2.0b10

Status

()

defect
RESOLVED FIXED
9 years ago
8 years ago

People

(Reporter: kazuki.ohta, Assigned: kazuki.ohta)

Tracking

({hang})

unspecified
mozilla2.0b10
x86
Linux
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Assignee)

Description

9 years ago
User-Agent:       Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; ja-JP-mac; rv:1.9.1.6) Gecko/20091201 Firefox/3.5.6
Build Identifier: 

Hi, I read the source code of jemalloc, and found the deadlock condition although it's very rare.

Current jemalloc code calls pthread_atfork() like this.

  pthread_atfork(_malloc_prefork, _malloc_postfork, _malloc_postfork);

And the actual prefork() & postfork() function is like this.

void _malloc_prefork(void)
{
 	unsigned i;
        /* Acquire all mutexes in a safe order. */
        malloc_spin_lock(&arenas_lock);
	for (i = 0; i < narenas; i++) {
                if (arenas[i] != NULL)
                        malloc_spin_lock(&arenas[i]->lock);
        }
        malloc_spin_unlock(&arenas_lock);
        malloc_mutex_lock(&base_mtx);
	malloc_mutex_lock(&huge_mtx);
}
void _malloc_postfork(void)
{
        unsigned i;
	/* Release all mutexes, now that fork() has completed. */
	malloc_mutex_unlock(&huge_mtx);
	malloc_mutex_unlock(&base_mtx);
	malloc_spin_lock(&arenas_lock);
        for (i = 0; i < narenas; i++) {
                if (arenas[i] != NULL)
                        malloc_spin_unlock(&arenas[i]->lock);
        }
        malloc_spin_unlock(&arenas_lock);
}

The problem is, "unlocking arenas_lock" in prefork().
When prefork function is called, the another thread can run.
Therefore, there's a following deadlock scenario.

(1) prefork() unlocks arenas_lock
(2) another thread acquires arenas_lock before finish executing the prefork()
(3) prefork() finishes
(4) the child process tries to acquire arenas_lock at postfork() and the deadlock occurs.

To solve this problem, the following patch is needed.

diff -r 3e80559aea9e memory/jemalloc/jemalloc.c
--- a/memory/jemalloc/jemalloc.c	Fri Jan 01 23:00:18 2010 +0900
+++ b/memory/jemalloc/jemalloc.c	Sat Jan 02 01:48:42 2010 +0900
@@ -6208,7 +6208,6 @@
 		if (arenas[i] != NULL)
 			malloc_spin_lock(&arenas[i]->lock);
 	}
-	malloc_spin_unlock(&arenas_lock);
 
 	malloc_mutex_lock(&base_mtx);
 
@@ -6226,7 +6225,6 @@
 
 	malloc_mutex_unlock(&base_mtx);
 
-	malloc_spin_lock(&arenas_lock);
 	for (i = 0; i < narenas; i++) {
 		if (arenas[i] != NULL)
 			malloc_spin_unlock(&arenas[i]->lock);


Reproducible: Didn't try
(Assignee)

Updated

9 years ago
Summary: jemmaloc can be deadlock with fork(2) → jemalloc can be deadlock with fork(2)
Thanks.

I have seen something similar with libflashplayer.so calling g_spawn_sync (which forks) from two different threads.

One thread was waiting on malloc_spin_lock(&arenas_lock) in _malloc_postfork,
while another was waiting on malloc_spin_lock(&arenas[i]->lock) in _malloc_prefork.
Status: UNCONFIRMED → NEW
Ever confirmed: true
(Assignee)

Comment 2

9 years ago
Thanks for your comment. The current jemalloc code could cause the deadlock described above. Hope my patch will be included in trunk.

Actually I'm now using jemalloc in an external project, and contains my fix. For 1 year, there's no problem for fork()ing.
_malloc_prefork and _malloc_postfork compete for arenas_lock/arenas[i]->lock

This looks good to me and matches upstream
http://canonware.com/cgi-bin/gitweb.cgi?p=jemalloc.git;a=blob;f=jemalloc/src/jemalloc.c;h=2aebc51dd19d791f66e35053fa291d51fd9cd4e1#l1708
Attachment #496047 - Flags: review+
Comment on attachment 496047 [details] [diff] [review]
patch from comment 0

Requesting approval2.0.
Although this is a rare hang and probably near impossible to reproduce, the code is definitely wrong, and the hang does happen and could be happening more frequently in 2.0 since the plugin container now uses jemalloc (bug 556198).
Attachment #496047 - Flags: approval2.0?

Comment 5

8 years ago
Did you check with jasone about this change?

Comment 6

8 years ago
The change looks correct to me.

Updated

8 years ago
Attachment #496047 - Flags: approval2.0? → approval2.0+
Passed tryserver.
Keywords: checkin-needed
Assignee: nobody → kazuki.ohta

Updated

8 years ago
Assignee: kazuki.ohta → karlt

Comment 8

8 years ago
Bug 437451 has landed the exact same patch, so I assume this is a dupe of it.
Status: NEW → RESOLVED
Last Resolved: 8 years ago
Keywords: checkin-needed
Resolution: --- → DUPLICATE
Duplicate of bug: 437451
Got the wrong bug? Bug 437451 doesn't seem related at all.
(This patch landed with the wrong bug number.)

http://hg.mozilla.org/mozilla-central/rev/3f8ae66d0015
Assignee: karlt → kazuki.ohta
Resolution: DUPLICATE → FIXED
Target Milestone: --- → mozilla2.0b10
(Assignee)

Comment 11

8 years ago
This bug is not fixed yet, right?
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
The attached patch was pushed, as comment 10 shows.  I see no reason why this wouldn't be fixed.
Status: REOPENED → RESOLVED
Last Resolved: 8 years ago8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.