Last Comment Bug 571209 - Make chunksize in jemalloc a compile-time constant under MOZ_MEMORY
: Make chunksize in jemalloc a compile-time constant under MOZ_MEMORY
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: Memory Allocator (show other bugs)
: Trunk
: All All
: -- normal (vote)
: mozilla11
Assigned To: Igor Bukanov
:
Mentors:
Depends on: 560818
Blocks: 586962
  Show dependency treegraph
 
Reported: 2010-06-10 05:25 PDT by Igor Bukanov
Modified: 2011-11-30 03:53 PST (History)
10 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
v1 (8.96 KB, patch)
2011-11-28 06:32 PST, Igor Bukanov
no flags Details | Diff | Splinter Review
v2 (10.25 KB, patch)
2011-11-28 12:56 PST, Igor Bukanov
justin.lebar+bug: review+
Details | Diff | Splinter Review
v3 (9.89 KB, patch)
2011-11-28 16:57 PST, Igor Bukanov
no flags Details | Diff | Splinter Review
v4 (9.77 KB, patch)
2011-11-28 17:32 PST, Igor Bukanov
no flags Details | Diff | Splinter Review
v5 (9.71 KB, patch)
2011-11-29 02:59 PST, Igor Bukanov
justin.lebar+bug: review+
Details | Diff | Splinter Review

Description Igor Bukanov 2010-06-10 05:25:34 PDT
jemalloc during initialization allows to change the default chunk size using a configuration file or an environment variable MALLOC_OPTIONS. This may trigger rather inefficient allocation of chunks for JS GC. With the bug 560818 fixed the code assumes that the size of GC chunk (a compile-time constant) matches chunksize so posix_memalign comming from jemalloc would do efficient chunk allocation.

Thus I suggest to make chunksize a compile-time constant when building under MOZ_MEMORY so JS GC may assume the optimal chunk size. This may also speedup things a bit as, for example, the free call in jemalloc starts with CHUNK_ADDR2BASE(ptr) or ((void *)((uintptr_t)(ptr) & ~chunksize_mask)). Replacing chunksize_mask with compile-time constant would avoid an extra memory load per each free call.
Comment 1 Stuart Parmenter 2010-06-10 10:42:24 PDT
I'd rather leave it runtime configurable so that we can play with different settings and find optimal ones more easily.
Comment 2 Mike Shaver (:shaver -- probably not reading bugmail closely) 2010-06-10 10:47:32 PDT
The point here is that there are other parts of the code that create relationships to the chunk size, and that you won't find optimal ones by simply changing one part of that chain.

What sort of experiments do you find that you're currently doing with the runtime configurability?  It might be that we can support those more directly, if changing the relations in the source and rebuilding is too much work.
Comment 3 Igor Bukanov 2011-11-28 06:32:08 PST
Created attachment 577242 [details] [diff] [review]
v1

The patch makes chunksize and other related parameters in jemalloc a compile-time constant. The effect of the patch is visible in the following program that allocated and frees arrays of various sizes:

int
main(int argc, char **argv, char **envp)
{
    for (size_t loop = 0; loop != 10; ++loop) {
        const size_t N = 1024 * 1024;
        void **array = static_cast<void **>(malloc(N * sizeof(void *)));
        for (size_t i = 0; i != N; ++i) {
            array[i] = malloc(16 * (N % 128));
        }
        for (size_t i = 0; i != N; ++i) {
            free(array[i]);
        }
        free(array);
    }
    return 0;
}

Without the patch the program compiled with GCC -O3 for 32 bit on a 3.0 GHz Linux box runs like (the stats for the fastest run among 10) :

       1043.840247 task-clock                #    1.000 CPUs utilized
                 1 context-switches          #    0.000 M/sec
                 0 CPU-migrations            #    0.000 M/sec
            15,720 page-faults               #    0.015 M/sec
     3,228,595,566 cycles                    #    3.093 GHz
       952,114,163 stalled-cycles-frontend   #   29.49% frontend cycles idle
       157,464,458 stalled-cycles-backend    #    4.88% backend  cycles idle
     5,756,110,782 instructions              #    1.78  insns per cycle
                                             #    0.17  stalled cycles per insn
     1,252,664,517 branches                  # 1200.054 M/sec
           376,755 branch-misses             #    0.03% of all branches

       1.044286911 seconds time elapsed


With the patch that changes into:

        992.405996 task-clock                #    1.000 CPUs utilized
                 1 context-switches          #    0.000 M/sec
                 1 CPU-migrations            #    0.000 M/sec
            10,611 page-faults               #    0.011 M/sec
     3,069,489,210 cycles                    #    3.093 GHz
       879,431,589 stalled-cycles-frontend   #   28.65% frontend cycles idle
       136,361,360 stalled-cycles-backend    #    4.44% backend  cycles idle
     5,515,353,611 instructions              #    1.80  insns per cycle
                                             #    0.16  stalled cycles per insn
     1,185,909,581 branches                  # 1194.984 M/sec
           372,580 branch-misses             #    0.03% of all branches

       0.992851027 seconds time elapsed

Which gives a 5% improvement speed-wise. This shows that an option for a compile-time constant configuration is useful from performance point of view alone.
Comment 4 Justin Lebar (not reading bugmail) 2011-11-28 06:48:37 PST
> Which gives a 5% improvement speed-wise. This shows that an option for a compile-time constant 
> configuration is useful from performance point of view alone.

This seems like a good idea to me.  We've done very little experimentation tuning jemalloc's parameters, and I think our time at this point would probably be better spent getting jemalloc 2 working, since we know that's a large speedup.  But if we did want to play with the parameters, we still could with this patch.

igor, I'm happy to review whenever you're ready.
Comment 5 Igor Bukanov 2011-11-28 12:56:23 PST
Created attachment 577334 [details] [diff] [review]
v2

The patch adds JEMALLOC_STATIC_SIZES that, when defined to 1, turns various size-related parameters into compile-time constants and disables their runtime configuration. If JEMALLOC_STATIC_SIZES is not defined but MOZ_MEMERY is set, JEMALLOC_STATIC_SIZES is defined to 1.

Under JEMALLOC_STATIC_SIZES I define some parameters not via preprocessor defines, but using less portable static const declarations that the C compiler is not required to treat as compile-time constants. This is done when the parameter name coincides with a name in some jemalloc struct. In this case the define would require first to rename the parameter greatly increasing the patch size. So instead I trust that any optimizing compiler would at least inline the const value.
Comment 6 Justin Lebar (not reading bugmail) 2011-11-28 14:23:30 PST
>+/*
>+ * JEMALLOC_STATIC_SIZES controls whether we want to be able to fully
>+ * configure memory allocator parameters at runtime (undefined or 0) or
>+ * for maximal performance we want to define most of the parameters as
>+ * compile-time constants (value 1).
>+ */
>+#ifndef JEMALLOC_STATIC_SIZES
>+#  ifdef MOZ_MEMORY
>+#    define JEMALLOC_STATIC_SIZES 1
>+#  else
>+#    define JEMALLOC_STATIC_SIZES 0
>+#  endif
>+#elif JEMALLOC_STATIC_SIZES != 0 && JEMALLOC_STATIC_SIZES != 1
>+#  error "When defined JEMALLOC_STATIC_SIZES must be 0 or 1"
>+#endif

Why not just harcode JEMALLOC_STATIC_SIZES to 1 and avoid these checks?

>+#if JEMALLOC_STATIC_SIZES
>+
> /* VM page size. */
>+#define pagesize_2pow			((size_t) 12)

Please comment that this must divide the hardware page size, otherwise we'll
abort on startup.

>+#define PAGESIZE_DEFAULT		((size_t) 1 << pagesize_2pow)
>+static const size_t	pagesize = 	PAGESIZE_DEFAULT;

Can we just use |pagesize| instead of PAGESIZE_DEFAULT?  I'd rather not
introduce a global define which is only available when JEMALLOC_STATIC_SIZES is
defined.

>+static const size_t	pagesize_mask = ((size_t) 1 << pagesize_2pow) - 1;
>+
>+/* Various quantum-related settings. */
>+
>+#define QUANTUM_DEFAULT			((size_t) 1 << QUANTUM_2POW_MIN)

Nit: One too many tabs.

>+static const size_t	quantum	=	QUANTUM_DEFAULT;

Similarly, can we just use |quantum| instead of QUANTUM_DEFAULT?

>+/* Number of (2^n)-spaced sub-page bins. */
>+static const unsigned	nsbins =	(unsigned)
>+					(pagesize_2pow -
>+					 SMALL_MAX_2POW_DEFAULT -
>+					 1);

Nit: "1);" doesn't need its own line.

>+
>+#else /* !JEMALLOC_STATIC_SIZES */
>+
> static size_t		pagesize;
> static size_t		pagesize_mask;
> static size_t		pagesize_2pow;
> 
>-/* Various bin-related settings. */

Why was this comment removed?  It's a little silly, I agree, but better to be
consistent with the other comments that are still here in this section.

>-static size_t		bin_maxclass; /* Max size class for bins. */
>-static unsigned		ntbins; /* Number of (2^n)-spaced tiny bins. */
>-static unsigned		nqbins; /* Number of quantum-spaced bins. */
>-static unsigned		nsbins; /* Number of (2^n)-spaced sub-page bins. */
>+static size_t		quantum;
>+static size_t		quantum_mask;
>+
> static size_t		small_min;
> static size_t		small_max;
> 
>-/* Various quantum-related settings. */
>-static size_t		quantum;
>-static size_t		quantum_mask; /* (quantum - 1). */
>+static size_t		bin_maxclass;
>+static unsigned		ntbins;
>+static unsigned		nqbins;
>+static unsigned		nsbins;

Not sure why you changed this, but could you please preserve the comments on
n{t,q,s}bins?

>+/*
>+ * Compute the header size such that it is large enough to contain the page map
>+ * and enough nodes for the worst case: one node per non-header page plus one
>+ * extra for situations where we briefly have one more node allocated than we
>+ * will need.
>+ */
>+#define calculate_arena_header_size()					\
>+	(sizeof(arena_chunk_t) + sizeof(arena_chunk_map_t) * (chunk_npages - 1))
>+
>+#define calculate_arena_header_pages()					\
>+	((calculate_arena_header_size() >> pagesize_2pow) +		\
>+	 (calculate_arena_header_size() & pagesize_mask) ? 1 : 0)
>+
>+/* Max size class for arenas. */
>+#define calculate_arena_maxclass()					\
>+	(chunksize - (arena_chunk_header_npages << pagesize_2pow))
>+
>+#if JEMALLOC_STATIC_SIZES
>+#define CHUNKSIZE_DEFAULT		((size_t) 1 << CHUNK_2POW_DEFAULT)
>+static const size_t	chunksize =	CHUNKSIZE_DEFAULT;

Just use |chunksize| instead of CHUNKSIZE_DEFAULT?

>+static const size_t	chunksize_mask =CHUNKSIZE_DEFAULT - 1;
>+static const size_t	chunk_npages =	CHUNKSIZE_DEFAULT >> pagesize_2pow;
>+#define arena_chunk_header_npages	calculate_arena_header_pages()
>+#define arena_maxclass			calculate_arena_maxclass()

Can we make these const size_t's?  I've got to imagine the compiler can fold
these expressions; do you have evidence to the contrary?

>+#else
> static size_t		chunksize;
>-static size_t		chunksize_mask; /* (chunksize - 1). */
>+static size_t		chunksize_mask;
> static size_t		chunk_npages;
> static size_t		arena_chunk_header_npages;
>-static size_t		arena_maxclass; /* Max size class for arenas. */
>+static size_t		arena_maxclass;
>+#endif
> 
> /********/
> /*
>  * Chunks.
>  */
> 
> #ifdef MALLOC_VALIDATE
> static malloc_rtree_t *chunk_rtree;
> #endif
>@@ -1212,21 +1294,27 @@ static bool	opt_abort = false;
> #ifdef MALLOC_FILL
> static bool	opt_junk = false;
> #endif
> #endif
> static size_t	opt_dirty_max = DIRTY_MAX_DEFAULT;
> #ifdef MALLOC_BALANCE
> static uint64_t	opt_balance_threshold = BALANCE_THRESHOLD_DEFAULT;
> #endif
> static bool	opt_print_stats = false;
>+#if JEMALLOC_STATIC_SIZES
>+#define opt_quantum_2pow	QUANTUM_2POW_MIN
>+#define opt_small_max_2pow	SMALL_MAX_2POW_DEFAULT
>+#define opt_chunk_2pow		CHUNK_2POW_DEFAULT
>+#else
> static size_t	opt_quantum_2pow = QUANTUM_2POW_MIN;
> static size_t	opt_small_max_2pow = SMALL_MAX_2POW_DEFAULT;
> static size_t	opt_chunk_2pow = CHUNK_2POW_DEFAULT;
>+#endif
> #ifdef MALLOC_PAGEFILE
> static bool	opt_pagefile = false;
> #endif
> #ifdef MALLOC_UTRACE
> static bool	opt_utrace = false;
> #endif
> #ifdef MALLOC_SYSV
> static bool	opt_sysv = false;
> #endif
>@@ -5552,44 +5640,47 @@ malloc_init_hard(void)
> #endif
> 
> 	/* Get page size and number of CPUs */
> #ifdef MOZ_MEMORY_WINDOWS
> 	{
> 		SYSTEM_INFO info;
> 
> 		GetSystemInfo(&info);
> 		result = info.dwPageSize;
>-
>-		pagesize = (unsigned) result;
>-
>+		
> #ifndef MOZ_MEMORY_NARENAS_DEFAULT_ONE
> 		ncpus = info.dwNumberOfProcessors;
> #endif
> 	}
> #else
> #ifndef MOZ_MEMORY_NARENAS_DEFAULT_ONE
> 	ncpus = malloc_ncpus();
> #endif
> 
> 	result = sysconf(_SC_PAGESIZE);
> 	assert(result != -1);
>-
>-	pagesize = (unsigned) result;
>-#endif
>-
>-	/*
>-	 * We assume that pagesize is a power of 2 when calculating
>-	 * pagesize_mask and pagesize_2pow.
>-	 */
>+#endif
>+
>+	/* We assume that the page size is a power of 2. */
> 	assert(((result - 1) & result) == 0);
>-	pagesize_mask = result - 1;
>+#if JEMALLOC_STATIC_SIZES
>+	if (pagesize % (size_t) result) {
>+		_malloc_message(_getprogname(),
>+				"Compile-time page size does not divide the runtime one.\n",
>+				"", "");
>+		abort();
>+	}

Thanks for this check.

>+#else	
>+	pagesize = (size_t) result;
>+	pagesize_mask = (size_t) result - 1;
> 	pagesize_2pow = ffs((int)result) - 1;
>-
>+#endif

>+#if !JEMALLOC_STATIC_SIZES
> 				case 'k':
> 					/*
> 					 * Chunks always require at least one
> 					 * header page, so chunks can never be
> 					 * smaller than two pages.
> 					 */
> 					if (opt_chunk_2pow > pagesize_2pow + 1)
> 						opt_chunk_2pow--;
> 					break;
> 				case 'K':
> 					if (opt_chunk_2pow + 1 <
> 					    (sizeof(size_t) << 3))
> 						opt_chunk_2pow++;
> 					break;
>+#endif

If you wanted to be awesome, you could print a warning when one of these
options is specified and STATIC_SIZES is defined.  But since basically nobody
uses this, it's not a big deal.
Comment 7 Igor Bukanov 2011-11-28 16:32:20 PST
(In reply to Justin Lebar [:jlebar] from comment #6)
> >+/*
> >+ * JEMALLOC_STATIC_SIZES controls whether we want to be able to fully
> >+ * configure memory allocator parameters at runtime (undefined or 0) or
> >+ * for maximal performance we want to define most of the parameters as
> >+ * compile-time constants (value 1).
> >+ */
> >+#ifndef JEMALLOC_STATIC_SIZES
> >+#  ifdef MOZ_MEMORY
> >+#    define JEMALLOC_STATIC_SIZES 1
> >+#  else
> >+#    define JEMALLOC_STATIC_SIZES 0
> >+#  endif
> >+#elif JEMALLOC_STATIC_SIZES != 0 && JEMALLOC_STATIC_SIZES != 1
> >+#  error "When defined JEMALLOC_STATIC_SIZES must be 0 or 1"
> >+#endif
> 
> Why not just harcode JEMALLOC_STATIC_SIZES to 1 and avoid these checks?

I will simply drop elif check - it is not necessary given that the rest of the patch uses #if JEMALLOC_STATIC_SIZES so the compiler will reject any-non-integer value and any integer non-zero integer is treated as true.

> >+#define PAGESIZE_DEFAULT		((size_t) 1 << pagesize_2pow)
> >+static const size_t	pagesize = 	PAGESIZE_DEFAULT;
> 
> Can we just use |pagesize| instead of PAGESIZE_DEFAULT?
...
> Similarly, can we just use |quantum| instead of QUANTUM_DEFAULT?

QUANTUM_DEFAULT is necessary as in C one cannot use a const when defining another const. That is, the following fragment is not a valid C code:

static const size_t	quantum	=	((size_t) 1 << QUANTUM_2POW_MIN);
static const size_t	quantum_mask =	quantum - 1;

So I added QUANTUM_DEFAULT. But pagesize is not used in that position, so PAGESIZE_DEFAULT can be indeed removed.

> >+static const size_t	chunksize =	CHUNKSIZE_DEFAULT;
> 
> Just use |chunksize| instead of CHUNKSIZE_DEFAULT?
> >+static const size_t	chunksize_mask =CHUNKSIZE_DEFAULT - 1;

Using chunksize in chunksize_mask does not work.

> 
> >+static const size_t	chunk_npages =	CHUNKSIZE_DEFAULT >> pagesize_2pow;
> >+#define arena_chunk_header_npages	calculate_arena_header_pages()
> >+#define arena_maxclass			calculate_arena_maxclass()
> 
> Can we make these const size_t's?  

This is not possible again for that const-not-a-real-const-in-C problem.
Comment 8 Justin Lebar (not reading bugmail) 2011-11-28 16:45:13 PST
> I will simply drop elif check

What's the benefit of having this at all, as compared to a blanket |#define JEMALLOC_STATIC_SIZES 1|?  We never compile jemalloc without MOZ_MALLOC, right?

> This is not possible again for that const-not-a-real-const-in-C problem.

I'm not versed in this C gotcha.  Would you mind elaborating?  The following code

  #include <stddef.h>
  static const size_t pagesize_2pow = (size_t) 12;
  static const size_t pagesize = (size_t) 1 << pagesize_2pow;
  static const size_t pagesize_mask = pagesize - 1;

  int foo()
  {
    return pagesize_mask;
  }

compiles to

0000000000000000 <foo()>:
   0:	b8 ff 0f 00 00       	mov    eax,0xfff
   5:	c3                   	ret    

Isn't that what we want?
Comment 9 Igor Bukanov 2011-11-28 16:56:09 PST
(In reply to Justin Lebar [:jlebar] from comment #8)
> > I will simply drop elif check
> 
> What's the benefit of having this at all, as compared to a blanket |#define
> JEMALLOC_STATIC_SIZES 1|?  We never compile jemalloc without MOZ_MALLOC,
> right?

This allows to compile with the runtime configuration enabled while having MOZ_MEMORY. This is how I tested the patch.

> I'm not versed in this C gotcha.  Would you mind elaborating?  The following
> code
> 
>   #include <stddef.h>
>   static const size_t pagesize_2pow = (size_t) 12;
>   static const size_t pagesize = (size_t) 1 << pagesize_2pow;
>   static const size_t pagesize_mask = pagesize - 1;
> 
>   int foo()
>   {
>     return pagesize_mask;
>   }
> 
> compiles to

With GCC 4.6 I got:

x.c:3:1: error: initializer element is not constant
x.c:4:1: error: initializer element is not constant

IIRC GCC is right here - this is not a valid C code.
Comment 10 Igor Bukanov 2011-11-28 16:57:09 PST
Created attachment 577429 [details] [diff] [review]
v3

patch with comments addressed
Comment 11 Justin Lebar (not reading bugmail) 2011-11-28 17:05:56 PST
> This allows to compile with the runtime configuration enabled while having MOZ_MEMORY. This is how I 
> tested the patch.

But in order to compile with runtime configuration enabled, you had to modify some source code, or otherwise send a #define into jemalloc.c, right?  My suggestion is just that, when you want to compile with runtime configuration enabled, you should modify the #define, just like you have to do with, for example, MALLOC_DECOMMIT (and, indeed, with every other #define'd option in jemalloc.c).

> With GCC 4.6 I got:
>
> x.c:3:1: error: initializer element is not constant
> x.c:4:1: error: initializer element is not constant

Huh.  I'm also on gcc 4.6, and I've tried -O0 and -O2, with and without -pedantic.  clang also doesn't complain.
Comment 12 Justin Lebar (not reading bugmail) 2011-11-28 17:08:22 PST
Ah, I need -std=gnu99 in order for gcc to flag it as an error.
Comment 13 Igor Bukanov 2011-11-28 17:09:27 PST
(In reply to Justin Lebar [:jlebar] from comment #11)
> > This allows to compile with the runtime configuration enabled while having MOZ_MEMORY. This is how I 
> > tested the patch.
> 
> But in order to compile with runtime configuration enabled, you had to
> modify some source code, or otherwise send a #define into jemalloc.c, right?

No - I just added -DJEMALLOC_STATIC_SIZES=0 to CFLAGS when compiling.

> > With GCC 4.6 I got:
> >
> > x.c:3:1: error: initializer element is not constant
> > x.c:4:1: error: initializer element is not constant
> 
> Huh.  I'm also on gcc 4.6, and I've tried -O0 and -O2, with and without
> -pedantic.  clang also doesn't complain.

Do you compile with GCC or G++ and the file extension is .c?
Comment 14 Igor Bukanov 2011-11-28 17:14:00 PST
(In reply to Justin Lebar [:jlebar] from comment #12)
> Ah, I need -std=gnu99 in order for gcc to flag it as an error.

Hm, For me just gcc x.c or gcc -std=<every_c_version_from_the_manual_on_fedora_15> x.c generates the error.
Comment 15 Justin Lebar (not reading bugmail) 2011-11-28 17:15:20 PST
> No - I just added -DJEMALLOC_STATIC_SIZES=0 to CFLAGS when compiling.

That was what I meant by "or otherwise send a #define into jemalloc.c."

Sorry to nit over this little thing, but I really don't think this one flag is worth privileging in this way over all the other flags which are hardcoded into jemalloc.
Comment 16 Justin Lebar (not reading bugmail) 2011-11-28 17:16:13 PST
> Hm, For me just gcc x.c or gcc -std=<every_c_version_from_the_manual_on_fedora_15> x.c generates the 
> error.

You're right; the problem was that the file extension was cpp.  clang really doesn't complain, fwiw.
Comment 17 Igor Bukanov 2011-11-28 17:32:45 PST
Created attachment 577443 [details] [diff] [review]
v4

The new version names the preprocessor define as MALLOC_STATIC_SIZES and its usage should follow the rest of defines.
Comment 18 Justin Lebar (not reading bugmail) 2011-11-28 18:23:06 PST
+#if defined MOZ_MEMORY && !defined MALLOC_STATIC_SIZES
+#  define MALLOC_STATIC_SIZES 1
+#endif

Can it just be |#define MALLOC_STATIC_SIZES 1| without the MOZ_MEMORY and !MALLOC_STATIC_SIZES guards?

We never compile jemalloc without MOZ_MEMORY (right?), so this is just a red herring.  And the !MALLOC_STATIC_SIZES guard is only useful if we want to disable this feature from configure.  Which we might want to do for some weird architecture or something, but YAGNI.
Comment 19 Igor Bukanov 2011-11-29 02:59:57 PST
Created attachment 577536 [details] [diff] [review]
v5

The new version just defines MALLOC_STATIC_SIZES as 1 and fixes a bad typo in  calculate_arena_header_pages() when I forgot () around b ? c : d in a + (b ? c : d) so it became (a + b) ? c : d. Surprisingly the tryserver has shown the problem only in very few cases that I initially ignored as the bug lead to an infinite loop that translates into a timeouts on he tryserver.
Comment 20 Justin Lebar (not reading bugmail) 2011-11-29 07:01:16 PST
Comment on attachment 577536 [details] [diff] [review]
v5

r=me.  Thanks, Igor.
Comment 22 Marco Bonardo [::mak] 2011-11-30 03:53:53 PST
https://hg.mozilla.org/mozilla-central/rev/67451553bcbb

Note You need to log in before you can comment on or make changes to this bug.