Bug 1291355 (stylo-jemalloc)

stylo: Measure the impact of multiple jemalloc arenas and evaluate the performance tradeoff for enabling them

RESOLVED WORKSFORME

Status

()

Core
Memory Allocator
P1
normal
RESOLVED WORKSFORME
a year ago
2 months ago

People

(Reporter: bholley, Assigned: erahm)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(2 attachments)

We currently enable multiple jemalloc arenas for stylo to eliminate the locking overhead for allocations.

A recent email thread on this topic is as follows:

froydnj:

> I'd need more context on why it was added in the first place.  One of
> the things I'd be curious about would be why the block comment
> immediately above that define is no longer valid--were Stylo folks
> finding that the allocator arena lock was a bottleneck with Stylo
> doing parallel layout-y things?  You said on IRC that Cameron added
> it; maybe he has more context here?

heycam:

> Yes, there was a marked drop off in performance as the number of threads
> was increased (it peaked at 2 or 3 threads IIRC) without turning on this
> jemalloc option.

froydnj:

> Thanks.  More questions. :)
>
> What are the threads here?  Number of threads for Stylo to do
> off-main-thread layout?  How bad is the performance dropoff?  How many
> threads do we want to run with here?  Do we have any idea how much the
> memory fragmentation story increases (e.g. what is the usual size of
> the jemalloc arenas)?  I'm all about the performance win of Stylo, but
> if it comes at serious memory fragmentation costs, it's probably not
> going to be a win for our 32-bit users.

heycam:

> I didn't write down the numbers, so working from memory from a couple of
> months ago, but I think performance peaked at 2 or 3 threads, then went
> down again to be similar to single threaded performance at 6 threads or
> something.
>
> I don't really know what the plan is for deciding how many threads to
> run with.  For the moment we use round(num_cpus * 3 / 4), but I imagine
> we'll need to test with real workloads to see how the choice of
> threadcount interacts with all the other off main thread stuff Gecko is
> doing.
>
> I have no idea about [fragmentation].  What would be a way I can measure the
> jemalloc arena sizes in a non-stylo Firefox with and without this
> option?
Depends on: 1291356
This should have been filed somewhere else, IMHO.
Component: CSS Parsing and Computation → Memory Allocator
> > I have no idea about [fragmentation].  What would be a way I can measure the
> > jemalloc arena sizes in a non-stylo Firefox with and without this
> > option?

A good (though imperfect) proxy for this would be |explicit/heap-overhead/bin-unused| in about:memory report. The total for |explicit/head-overhead| is probably interesting as well as that would include |bookkeeping| as well (jemalloc internal memory usage).
Summary: Measure the impact of multiple jemalloc arenas and evaluate the performance tradeoff for enabling them → stylo: Measure the impact of multiple jemalloc arenas and evaluate the performance tradeoff for enabling them

Updated

11 months ago
Priority: -- → P3
Alias: stylo-jemalloc
Blocks: 1330412
No longer blocks: 1243581
So, I just profiled this a bit, on mac and linux, with varying numbers of threads. Some findings:
(1) I get pretty equivalent numbers with and without MOZ_MEMORY_NARENAS_DEFAULT_ONE (i.e. making the #define at [1] unconditional). That is to say, I cannot reproduce Cameron's original findings. It may be because of bug 1335941, whereby the atomic contention is changing the interactions between the threads and drowning out the signal somehow. Or maybe something in our more-modern stylo architecture makes this less of an issue (which would be great!). We'll need to optimize and implement more stuff to find out.
(2) The parallel traversal seems about 15% slower with jemalloc4, which is not all that inconsistent with the results in bug 1219914 comment 44. These are early numbers, but they sure don't look good for jemalloc4.

[1] http://searchfox.org/mozilla-central/rev/d20e4431d0cf40311afa797868bc5c58c54790a2/memory/mozjemalloc/jemalloc.c#152
Ah, forgot at detail: jemalloc 4 as we configure it doesn't have multiple arenas enabled by default. You can change that in memory/build/jemalloc_config.cpp by removing "narenas:1" from MOZ_MALLOC_OPTIONS.
(In reply to Mike Hommey [:glandium] from comment #4)
> Ah, forgot at detail: jemalloc 4 as we configure it doesn't have multiple
> arenas enabled by default. You can change that in
> memory/build/jemalloc_config.cpp by removing "narenas:1" from
> MOZ_MALLOC_OPTIONS.

Ok, I'll measure that when I get back to remeasuring this stuff, ideally after bug 1335941 lands.
Priority: P3 → P2
Blocks: 1243581
smaug just did some profile and said that multiple jemalloc arenas made Speedometer's innerHTML microbenchmark 10% slower, which is significant. That might be able to be optimized somehow, but that doesn't bode so well for multiple arenas so far...
Eric and Julian are looking into this stuff. Getting to the bottom of this is very high priority, because we if we need to do any custom arena stuff we need to start designing that now.
Assignee: nobody → erahm
Priority: P2 → P1
jemalloc or mozjemalloc?

Comment 9

3 months ago
Note, it was just some particular sub-sub-test of Speedometer (bug 1347525), on linux. I don't know how different malloc impl would affect to the overall Speedometer score.
But what is clear that anything running in talos isn't in general enough for measurements.

(jemalloc4 seems to be also slower than the current mozjemalloc (without stylo enabled stuff))
Also, comment 6 goes the opposite way of what used to be said, which was that multiple arenas made things faster... From comment 6 alone, I'd say what you want is per-page or per-something arenas, not per-thread arenas (which is essentially what mozjemalloc multiple arenas do). While not exposed to gecko, jemalloc (not mozjemalloc) has an API to get an arena and do allocations in that arena (which is BTW one of the good reasons to want to switch to jemalloc)

Comment 11

3 months ago
One thing to remember, need to test this all on different platforms. Memory allocations seem to be in general faster on linux than on MacOS and Windows, so perhaps jemalloc and mozjemalloc behave quite differently on different OSes.
(In reply to Mike Hommey [:glandium] (VAC: Apr 20-May 4) from comment #10)
> Also, comment 6 goes the opposite way of what used to be said, which was
> that multiple arenas made things faster...

It made things faster for parallel style system performance. smaug is measuring something unrelated (single-thread innerHTML performance).

> From comment 6 alone, I'd say
> what you want is per-page or per-something arenas, not per-thread arenas
> (which is essentially what mozjemalloc multiple arenas do).

The point is to avoid locking contention during the parallel traversal. per-page doesn't help with that, but a special style-system arena that was keyed by thread would.
I still seem to be getting unreliable stacks when profiling so I tested with DUMP_STYLE_STATISTICS=1 instead.

I did a test run of loading the myspace album art test 5 times (fresh start each time) and dumping style statistics with multiple arenas enabled in mozjemalloc. I then split out the traversal_time_ms metric (I believe that's what we care about). I repeated this without multiple arenas. The results seem to indicate one arena is faster.

> Multiple Arenas
> 	Run 1		Run 2		Run 3		Run 4		Run 5		Average	
> 	1.202983956	0.8736519958	0.6747109583	1.075475011	1.014448004	0.968253985	
> 	0.8492839988	1.255198964	1.157934021	1.200310013	1.199208025	1.132387004	
> 	1.161297027	1.219962025	1.278500014	1.354901004	1.227325993	1.248397212	
> 	34.46693602	27.65948197	34.41471403	28.23663497	27.24114002	30.4037814	
> 	1.535873977	1.517786004	1.559063967	0.9672119631	1.055503963	1.327087975	
> 	1825.349511	1719.367721	1867.61279	1653.857773	1923.593605	1797.95628	
> 							
> One Arena						
> 	Run 1		Run 2		Run 3		Run 4		Run 5		Average		Delta
> 	0.8032089681	1.298232004	0.9981150506	1.084592019	1.032647968	1.043359202	0.07510521682
> 	0.8614579565	0.7898909971	1.031210006	1.362927025	1.133467013	1.035790599	-0.09659640491
> 	0.7925549871	0.8089640178	0.7779609878	1.176575024	0.7930260035	0.8698162041	-0.3785810084
> 	33.69354497	28.15638104	24.87073396	28.59738801	34.56510697	29.97663099	-0.427150412
> 	1.547398977	1.599415031	1.90692296	1.554740011	1.513577008	1.624410797	0.2973228227
> 	1700.755901	1722.02633	1793.781668	1715.529505	1805.659356	1747.550552	-50.40572801

Example command line:
>  DUMP_STYLE_STATISTICS=1 ./mach run file:///home/erahm/Downloads/myspace/myspace.com/www.myspace.com/albumart.html > stylo_multiple_arenas_perf_1.txt
>  DUMP_STYLE_STATISTICS=1 ./mach run file:///home/erahm/Downloads/myspace/myspace.com/www.myspace.com/albumart.html > stylo_multiple_arenas_perf_2.txt
>  DUMP_STYLE_STATISTICS=1 ./mach run file:///home/erahm/Downloads/myspace/myspace.com/www.myspace.com/albumart.html > stylo_multiple_arenas_perf_3.txt
>  DUMP_STYLE_STATISTICS=1 ./mach run file:///home/erahm/Downloads/myspace/myspace.com/www.myspace.com/albumart.html > stylo_multiple_arenas_perf_4.txt
>  DUMP_STYLE_STATISTICS=1 ./mach run file:///home/erahm/Downloads/myspace/myspace.com/www.myspace.com/albumart.html > stylo_multiple_arenas_perf_5.txt
> cat stylo_multiple_arenas_perf_*.txt | grep traversal_time_ms | cut -d ',' -f 3 | pr -5 -t
Sorry I wasn't clear, this is on a desktop w/
- Ubuntu 16.04 LTS
- clang 4.0.
- CPU: Intel(R) Core(TM) i7-4770 CPU @ 3.40GHz
- 32GB RAM

config:
> ac_add_options --enable-stylo
> ac_add_options --disable-stylo-build-bindgen
For the Obama test we see better perf with multiple arenas.

> Multiple Arenas							
> 	Run 1		Run 2		Run 3		Run 4		Run 5		Average	
> 	0.6914929836	1.024849014	1.304770994	1.078581961	0.7182080299	0.9635805967	
> 	0.8394150063	1.143292	0.8399279905	1.260520949	0.8171049994	0.980052189	
> 	0.7503610104	0.9098720038	0.9900479927	1.15671102	1.115054998	0.9844094049	
> 	31.57984099	33.03947899	32.99107699	32.70683304	35.11506302	33.08645861	
> 							
> One Arena							
> 	Run 1		Run 2		Run 3		Run 4		Run 5		Average		Delta
> 	0.7722550072	1.428720017	1.032161003	0.7012890419	1.329610008	1.052807015	0.08922641864
> 	1.053375017	0.8312190184	1.222328981	1.186915964	1.234023017	1.1055724	0.1255202107
> 	0.8040550165	0.8170910296	1.552518981	0.7702309522	1.132268982	1.015232992	0.03082358744
> 	103.652606	104.5992	100.010437	101.929293	99.28652795	101.8956128	68.8091542
Thanks for the measurements! It's really great to be making progress here.

For our purposes, I think we only care about the "big" traversal, which generally seems to be the last one in the list. For posterity, can you note how many worker threads you were using? Assuming you weren't passing anything for STYLO_THREADS, I'm guessing the answer would be 6 based per [1] with 8 logical CPUs (4cores x 2 hyperthreads), but it's good to be sure.

So in general your results validate heycam's findings. The myspace testcase (assuming you're using the non-"orig" one) has a small DOM and an enormous stylesheet, so the time will be dominated by selector matching, which doesn't allocate much. So we're looking at something like 1750 vs 1800 ms, which to me means that multiple arenas don't help with contention here, and they happen be to be a bit slower in general as we've measured elsewhere.

Conversely, the wikipedia testcase has a large DOM, and so a significant amount of time is spent allocating style structs for each element. On that testcase we see multiple arenas taking us from 100ms to 33ms, basically a 3x improvement, which is huge. That's on the order of the entire win that we'd expect from the parallelism, so stylo isn't much of a win without it.

The next step here is to dig into what exactly in jemalloc is slow in the single-arena case. Is there anything obvious that can be fixed to mostly-eliminate the contention? If not, we have two options:
(1) Try to get multiple jemalloc arenas enabled, which potentially has significant performance implications across the platform.
(2) Allocate and manage style structs using manual, per-thread + per-type arenas in Rust.

My guess is that (2) is going to be a lot easier (and has the advantages discussed in bug 1347852), though it has a few downsides:
* It doesn't completely eliminate the overhead, since we do some amount of malloc-ing for other stuff.
* It's more complex, and we don't get to re-use Arc.
* There's some degree of memory waste, unless we implement something less-arena-like and more-malloc-like.

[1] http://searchfox.org/mozilla-central/rev/24eec01c4e43c0f029b0148900b5f8ed508f7e21/servo/components/style/gecko/global_style_data.rs#36
Depends on: 1356988
Blocks: 1356988
No longer depends on: 1356988
Blocks: 1356991
No longer blocks: 1356988
(In reply to Eric Rahm [:erahm] from comment #15)
> For the Obama test we see better perf with multiple arenas.

I'm assuming that "Multiple arenas" means

  #ifndef MOZ_STYLO
  #define MOZ_MEMORY_NARENAS_DEFAULT_ONE
  #endif

and "Single arena" means

  //#ifndef MOZ_STYLO
  #define MOZ_MEMORY_NARENAS_DEFAULT_ONE
  //#endif

I can reproduce Eric's results for the Obama test.  Numbers are for
traversal_time_ms.  I ran each test twice, for STYLO_THREADS={1,2,3,4,6,8}.
There's a lot of noise but the trend is clear.

Multiple arenas:

  1 thr 181.9  187.9
  2 thr  97.7  111.6
  3 thr  74.9   75.5
  4 thr  54.2   65.9
  6 thr  47.1   48.6
  8 thr  57.8   46.6

Single arena:

  1 thr 177.3  181.8
  2 thr 109.4  109.7
  3 thr  99.9  106.3
  4 thr 110.1  111.2
  6 thr 145.0  135.0
  8 thr 170.1  173.8

The multiple arena case continues to improve out to at least 8 threads.  The
single arena case does benefit from parallelism but gets worse beyond 3 threads.
This is on a low-end Xeon with 4 cores and 8 threads, Fedora 25, Clang 3.9.
I spent much of the day chasing this, but mostly got confused.  Here's a
summary.  Maybe someone can point at a better direction.

* System: Fedora 25, x86_64, clang 3.9, rustc 1.16.0, low end Xeon, 2.5 GHz

* Using the Obama test, since that shows marked slowdowns with a single arena.

* From comment #15 and comment #17 there is clearly something ungood going on.

My goal was to compare single-vs-multiple arena behaviour whilst profiling with
Valgrind's Callgrind tool.  This collections counts of instructions, memory
accesses, cache misses, branch mispredicts and global bus operations, and gives
us a good chance of spotting bad interactions with the microarchitecture.

Because Callgrind is slow and to avoid getting swamped with irrelevant data, I
hacked up the patch shown in the next comment.  This is intended to enable data
collection only when traverse_subtree() is active, which IIUC is what the cost
"traversal_time_ms" pertains to.  I was aided in this enterprise by emilio.  But
it produces costs of only about 17 million instructions for the test case, which
is far too small to account for the 100-ish millisecond cost.  So something is
not right there.

Some questions:

(1) Is there a way to reduce the number of processes in the picture?  Preferably
    to 1?  Having to guess which process' profile to look at is confusing.

(2) Is there a way to have Firefox quit after processing of the test page is
    complete?  This would remove noise resulting from idling / other activities
    between finishing drawing and shutdown.

(3) Is traverse_subtree() the right place to bracket profiler start/stop hints?
    If not, where is a better place?  Would it be better to put them around
    guaranteed single-threaded regions, so as to reduce confusion resulting from
    individual calls to traverse_subtree() being split into multiple threads?
Flags: needinfo?(bobbyholley)
Created attachment 8859194 [details] [diff] [review]
Restricts Callgrind data collection to  traverse_subtree(), per comment
(In reply to Julian Seward [:jseward] from comment #18)
> But it produces costs of only about 17 million instructions for the test case,
> which is far too small to account for the 100-ish millisecond cost.

So instead I compared multiple-vs-single arena profiles for the process
containing around 650-680 calls to traverse_subtree.  Results are

  What                 Multi       Single     Simulated?

  insns              11.83 G      12.11 G
  D1 misses         153.4  M     150.4  M     y
  LL misses          11.26 M      10.60 M     y
  Mispredicts        74.2  M      77.5  M     y
  Global bus events  22.2  M      25.2  M

Given that the entries marked "Simulated" are derived from Callgrind's
simulation of caches and branch predictors, not from the hardware, and also
given that Valgrind serialises threaded execution, I wouldn't pay too much
attention to this.

The only thing that particularly stands out is the change in global bus events,
since that signifies locking.  Indeed, the Multi version has 8.4 million calls
to each of pthread_mutex_{lock,unlock}, the single one has 9.9 million such
calls.  Almost all of that change seems to come from 
"<F as malloc::boxed::FnBox<A>>::call_box", called from
"std::sys::imp::thread::Thread::new::thread_start".

Does that make any sense to anyone?
That seems kind of weird.  Are we starting new threads for each bit of the traversal or something?  I would think that the rayon thread pool would spin up N threads and then not spin up any more...
Julian and I spent a bit of time debugging comment 20's results over IRC.  We discovered that libxul has several symbols that demangle to "<F as alloc::boxed::FnBox<A>>::call_box".  They all look like:

_ZN50_$LT$F$u20$as$u20$alloc..boxed..FnBox$LT$A$GT$$GT$8call_box17h7aeb0dbca779e468E

differing only in the hash at the end.  valgrind is chopping off the hash, and lumping all the functions together in the callgraph, so even though Thread::new::thread_start is calling one flavor of ::call_box, that particular flavor isn't getting disambiguated from all the others.

Needless to say, this sort of thing is extremely frustrating for trying to debug Rust code.

There are a few ::call_box instances that call pthread_mutex_{lock,unlock}; it's difficult to find callers for them because they're apparently only called via traits, i.e. function pointers.  I tried starting a Stylo build, hoping to set some breakpoints, but only got crashes while trying to twiddle refcounts from Stylo.

I'll keeping trying.  Julian pointed out that even if we figure out what's causing those extra global bus events, and even if all those extra bus events are causing some sort of slowdown, it doesn't do a great job explaining why changing the number of arenas makes any difference.  So we're still a bit in the dark on that one.
(In reply to Julian Seward [:jseward] from comment #18)
> (1) Is there a way to reduce the number of processes in the picture? 
> Preferably
>     to 1?  Having to guess which process' profile to look at is confusing.

yes, just pass --disable-e10s to mach.

> 
> (2) Is there a way to have Firefox quit after processing of the test page is
>     complete?  This would remove noise resulting from idling / other
> activities
>     between finishing drawing and shutdown.

You could panic!() at the completion of a traversal that takes more than 30ms or so.

> 
> (3) Is traverse_subtree() the right place to bracket profiler start/stop
> hints?

traverse_dom is probably better, since it has less initialization overhead. But note that there are two versions, one in sequential.rs and one in parallel.rs. Those functions are the top-level driver of the traversal. In the parallel case, that is where we kick over to the worker rthreads.

>     If not, where is a better place?  Would it be better to put them around
>     guaranteed single-threaded regions, so as to reduce confusion resulting
> from
>     individual calls to traverse_subtree() being split into multiple threads?

I'm not sure what you mean by "guaranteed single-threaded regions". If you want the actual work of style computation (which may occur in parallel on several threads), then you should look at compute_style. If you want only the cascade logic and not the selector matching (since I suspect the former is more responsible for the allocations, you should look at cascade_primary_or_pseudo.
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #23)
> > (2) Is there a way to have Firefox quit after processing of the test page is
> >     complete?  This would remove noise resulting from idling / other
> > activities
> >     between finishing drawing and shutdown.
> 
> You could panic!() at the completion of a traversal that takes more than
> 30ms or so.

I don't want to gate it on any specific time period since that's going to
change, dramatically so when running on any Valgrind-based profiling tool.

There's surely some place in the Stylo machinery where it knows that it has
finished all traversal(s) (or whatever it is that we want to profile here).
I could put a panic!() there.  But .. where is that place?
Flags: needinfo?(bobbyholley)
(In reply to Julian Seward [:jseward] from comment #24)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #23)
> > > (2) Is there a way to have Firefox quit after processing of the test page is
> > >     complete?  This would remove noise resulting from idling / other
> > > activities
> > >     between finishing drawing and shutdown.
> > 
> > You could panic!() at the completion of a traversal that takes more than
> > 30ms or so.
> 
> I don't want to gate it on any specific time period since that's going to
> change, dramatically so when running on any Valgrind-based profiling tool.
> 

Ok - maybe gate it on the number of elements styled instead? That's also in the traversal statistics.

> There's surely some place in the Stylo machinery where it knows that it has
> finished all traversal(s) (or whatever it is that we want to profile here).
> I could put a panic!() there.  But .. where is that place?

Gecko calls into Servo to do styling for lots of reasons, most of which are for tiny subtrees or when there's very little work to be performed. I could probably come up with something, but for our purposes here it's just much easier to do the hacky thing and base it on the amount of work done.
Flags: needinfo?(bobbyholley)
See Also: → bug 1359897
Depends on: 1356701
Depends on: 1354215
No longer blocks: 1243581, 1330412
I profiled Fx/Stylo running the Obama test for both single and multiple arenas,
and studied counts of instructions, global memory operations, pthread_mutex_*
calls and malloc calls, and looked at jemalloc and Rayon hot paths.  My interim
conclusions follow.  I should point out that as a result of (5), much of this is
my best guess, but nothing more.  The important points are (4) and (5).

(1) The problem [slowdown in single arena case] is not caused by mozjemalloc's
    locking strategy.  In both configurations mozjemalloc does one
    pthread_mutex_lock and _unlock per malloc and per free.

(2) This is instead caused by excessive inter-core cache line movement resulting
    from the use of a single arena:

    - If all threads use a single malloc lock, the containing line is going to
      be bounced around all cores.

    - The same goes for mozjemalloc's metadata, and in particular the use of the
      structure |run| in |arena_malloc_small|.

    - The same goes for the data in the blocks themselves.  Imagine one thread
      frees a block, and that area is quickly reallocated to a different thread
      on a different core.  When the new thread initialises the block, the
      hardware will have to move the associated line(s) to the new core (that
      is, take a bunch of L2 misses) because there is no way to tell it [on
      Intels] that the old contents of the block are no longer needed.

    This behaviour will be heavily dependent on the exact access patterns from
    Stylo and on how the kernel schedules threads across cores.  This might go
    some way to explaining the apparently inconsistent malloc-perf results that
    we've seen previously.

(3) So far I have not spotted any way to improve the parallel performance of
    mozjemalloc configured for single arenas.

(4) As a result of (2) and (3), I think that single-arena mozjemalloc is a poor
    starting point for a scalable high performance parallel allocator.  We are
    in effect using jemalloc in a way for which it was not intended.  We would
    be better off switching to multi-arena by default and fixing whatever perf
    problems we can.

(5) We have poor tooling for investigating such phenomena (inter core cache line
    transfers).  So far the only tool I have found that can do this is Linux's
    "perf" tool, and even that is hard to use and provides non-exact attribution
    of costs to specific source locations.

Some secondary conclusions:

(6) The single-arena case appears to cause more activity in Rayon (v0.6.0).  In
    particular Rayon makes a very large number of calls to steal_work(), around
    4 million in the multi-arena case and 5.5 million in the single-arena case.
    This is visible when profiling with Valgind/Callgrind.  Whether this is real
    or merely an artefact of the profiling, I don't know.

(7) I tried to investigate (6) by hacking on the vendored Rayon
    (third_party/rust/rayon) but was thwarted by the build system, which refused
    to compile the modified sources due to checksum mismatches.  We need a way
    to make this possible.

(8) Ongoing efforts to reduce the allocation rate are worthwhile regardless of
    the results of this investigation.

(9) I wrote a small test program to assess the effect of cache line sharing, to
    see if the effects are real, and to give a small test case to try out with
    tools (see attachment).  It creates an array of counters with variable
    spacing and increments them from different threads.  Increasing the spacing
    from 4 bytes (all in the same line) to 64 bytes (all in different lines)
    reduces the run time by a factor of 6 on an 4 core, 8 thread Xeon.  The
    following commands are interesting:

    for ((sp = 0; sp < 100; sp = sp+8 )) ; \
      do /usr/bin/time ./contention 8 $sp ; echo ; done

    perf stat -e cycles:u,instructions:u,branches:u,branch-misses:u,L1-dcache-load-misses:u,L1-icache-load-misses:u,LLC-load-misses:u,LLC-store-misses:u,l2_rqsts.miss:u ./contention 8 0

    and the same, but with "8 64" instead of "8 0"

Next steps are:

(10) Cut out mozjemalloc into a standalone test program and try some synthetic
     workloads on it, also watching it with "perf".

(11) See if I can get anything useful out of "perf" for Fx/Stylo as a whole

(12) Look at multi-arena Stylo perf regressions

(13) See if it is possible to modify Valgrind/Callgrind to collect cache line
     transfer numbers.

(14) Look into modifying the in-tree Rayon, and/or trying out a more recent
     version (0.7.0).
Created attachment 8863319 [details]
contention.c

Small program for experimenting with cache line contention.

Comment 28

3 months ago
(In reply to Julian Seward [:jseward] from comment #26)
> (4) As a result of (2) and (3), I think that single-arena mozjemalloc is a
> poor
>     starting point for a scalable high performance parallel allocator.  We
> are
>     in effect using jemalloc in a way for which it was not intended.  We
> would
>     be better off switching to multi-arena by default and fixing whatever
> perf
>     problems we can.
> 

Sounds reasonable, but we can't switch to slower-on-main-thread malloc.

Do we know performance characteristics of some other malloc impls than jemalloc?
This is excellent analysis Julian, thank you! The cache line contention issue makes total sense, and the thrashing to page in garbage memory after free() gives us important insight in our path forward (see below).

My guess is that the main-thread perf regression from enabling multiple jemalloc arenas (the one that smaug measured) is due to the required TLS lookup on every malloc. I would bet we could eliminate that by simply having a global flag that we set before a parallel stylo traversal to enable/disable multiple arenas during that time only.

However, I'm concerned that this (and the whole multiple-jemalloc-arena business in general) would just mask the underlying problem here. Specifically, the multiple-jemalloc-arenas approach makes sense if threads mostly own their data - that is to say, the mapping between between thread id and data under manipulation remains relatively stable. This is fine for something like DOM Workers, but totally falls apart with Rayon/Work-Stealing, where the association between thread and data is very loose.

More concretely: multiple jemalloc arenas will work fine for the initial style pass, where they basically guarantee that threads will only free heap data that they own. But when processing restyles, we have only a 1/N chance that a given element will be restyled by the same OS thread that originally styled it. So each time the DOM is restyled, we'll free a bunch of memory in other threads' arenas, and allocate new style structs in different arenas than their neighbors. This will trend towards zero locality over many restyles, which is a terrible characteristic to have: it will be hard to detect on our benchmarks (which is why we originally determined that multiple jemalloc arenas fixed the problem), but will cause ever-worsening performance for users as they spend time on highly-dynamic pages.

So I think we're going to need to roll our own custom arenas. I'll get working on a design.
Oh, and Julian - continuing to understand this problem, especially the different characteristics of rayon 0.6 and 0.7, sounds great. You should be able to modify the in-tree rayon with a [replace] and then do |./mach vendor rust|. This is a patch for what a successful replace looks like:

https://gist.github.com/bholley/85dc2277aaa732bce04cc9e672915595
See also https://groups.google.com/d/msg/mozilla.dev.platform/_xQyW1UQEv8/Zy14-DdFBgAJ and https://groups.google.com/d/msg/mozilla.dev.platform/TSRH1EU-JVo/6GY6P6UnAQAJ
(In reply to Olli Pettay [:smaug] from comment #28)
> Sounds reasonable, but we can't switch to slower-on-main-thread malloc.

This statement really depends on how much slower the malloc would be; it's not an absolute rule.

> Do we know performance characteristics of some other malloc impls than
> jemalloc?

Only anecdotally.  tcmalloc (http://goog-perftools.sourceforge.net/doc/tcmalloc.html) is faster than glibc's default malloc (with some pretty graphs to prove it, though it's not clear how old those graphs are), but (anecdotally) consumes more memory than jemalloc.  jemalloc is supposed to do a better job than dlmalloc (http://g.oswego.edu/dl/html/malloc.html), which is the basic of glibc's memory allocator, called ptmalloc  (http://www.malloc.de/en/).

Comment 33

3 months ago
(In reply to Nathan Froyd [:froydnj] from comment #32)
> (In reply to Olli Pettay [:smaug] from comment #28)
> > Sounds reasonable, but we can't switch to slower-on-main-thread malloc.
> 
> This statement really depends on how much slower the malloc would be; it's
> not an absolute rule.

Let's say so that if it would slow down Quantum Flow perf metrics, it is no no.
Changing the Gecko-wide allocator is totally out of scope for this bug. I'm pretty sure we can get the best of both worlds by simply toggling multiple jemalloc arenas when we start and stop the parallel traversal. The question is whether that's good enough, or whether we should arena-allocate anyway. We might even want to do both.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #34)
> Changing the Gecko-wide allocator is totally out of scope for this bug. I'm
> pretty sure we can get the best of both worlds by simply toggling multiple
> jemalloc arenas when we start and stop the parallel traversal.

Emilio and I discussed this option a bit more earlier today, and he convinced me that it might be more workable than I thought - at least worth investigating.

From Julian's analysis, it seems like there are two reasons for the bad performance with multiple arenas:
(A) Two threads simultaneously allocating (and trying to initialize) memory in the same cache line.
(B) One thread freeing memory, and another thread subsequently reusing it,  causing us to do unnecessary work transferring the garbage data between cache lines.

Multiple jemalloc arenas solve (A) but not necessarily (B). However, emilio pointed out that we can likely defer most frees until after the parallel traversal, especially since the style context generally holds the ComputedValues alive until after the traversal is complete (though this doesn't help Servo).

It wouldn't be perfect, but it would be a lot easier than arena allocation, and so potentially a good place to start (we _may_ still want arenas to deal with the sequential allocation overhead, but I want to see how low I can get that first). I'll try to have a look at this tomorrow.

A lot of the efficacy here depends on how much of the problem we think is (A) vs (B). Julian, do you have a sense of this?
Flags: needinfo?(jseward)
Depends on: 1361258
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #29)
> My guess is that the main-thread perf regression from enabling multiple
> jemalloc arenas (the one that smaug measured) is due to the required TLS
> lookup on every malloc.  [..]

Having peered at the jemalloc sources, I agree.  And it's potentially 
dangerous, if TLS on some platform (Windows?) is significantly more
expensive than on others.  Do you know if anyone has tried multi-arena
on Windows?

> [..] Rayon/Work-Stealing, where the association between thread and data
> is very loose.

Hmm, that's an interesting point -- I hadn't thought of that.
Maybe some refinement of Rayon, to give it placement or locality hints,
or some such, would be something we should consider.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #35)
> (A) Two threads simultaneously allocating (and trying to initialize) memory
> in the same cache line.
> (B) One thread freeing memory, and another thread subsequently reusing it, 
> causing us to do unnecessary work transferring the garbage data between
> cache lines.
> 
> Multiple jemalloc arenas solve (A) but not necessarily (B).

For reasons I don't [yet] understand, multi-arena jemalloc appears to lock
and unlock for every allocation/free.  Hence even in this case we'll be
bouncing around at least the cache line containing the lock.  I had gotten
the impression that multi-arena meant "one arena per thread" but perhaps
that's not so.

> A lot of the efficacy here depends on how much of the problem we think is
> (A) vs (B). Julian, do you have a sense of this?

Unfortunately no.  Without tooling to find these misses we're pretty much
shooting in the dark.  I'll see if I can hack something up (and generally
keep digging.)
Flags: needinfo?(jseward)
(In reply to Julian Seward [:jseward] from comment #37)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #35)
> > (A) Two threads simultaneously allocating (and trying to initialize) memory
> > in the same cache line.
> > (B) One thread freeing memory, and another thread subsequently reusing it, 
> > causing us to do unnecessary work transferring the garbage data between
> > cache lines.
> > 
> > Multiple jemalloc arenas solve (A) but not necessarily (B).
> 
> For reasons I don't [yet] understand, multi-arena jemalloc appears to lock
> and unlock for every allocation/free.  Hence even in this case we'll be
> bouncing around at least the cache line containing the lock.  I had gotten
> the impression that multi-arena meant "one arena per thread" but perhaps
> that's not so.

This is indeed not so, but (newer?) versions of jemalloc provide for thread-specific arenas (called "caches" in the documentation, so maybe they're not the same thing... http://jemalloc.net/jemalloc.3.html#tcache.create
(In reply to Nathan Froyd [:froydnj] from comment #38)
> (In reply to Julian Seward [:jseward] from comment #37)
> > For reasons I don't [yet] understand, multi-arena jemalloc appears to lock
> > and unlock for every allocation/free.  [..]

> This is indeed not so, but (newer?) versions of jemalloc provide for
> thread-specific arenas (called "caches" in the documentation, so maybe
> they're not the same thing...

Right.  So what I *expected* to see was a fast path in which the calling
thread uses the TLS mechanism to get a pointer to its own private data
area, which is the only potentially expensive bit, then tries
to allocate in there, and only takes a lock if it has to fall back 
allocating from central facilities for whatever reason.

But that's not what it does.

> [..] but (newer?) versions of jemalloc provide for thread-specific arenas

I'm getting the impression that we're not running the most recent or most
capable version of jemalloc for some reason, but I don't know the back story.
Can you give me a pointer to it?
Flags: needinfo?(nfroyd)
(In reply to Julian Seward [:jseward] from comment #39)
> I'm getting the impression that we're not running the most recent or most
> capable version of jemalloc for some reason, but I don't know the back story.
> Can you give me a pointer to it?

I think bug 1219914 is the primary impediment. (Bug 762449 may have other blockers.)
(In reply to Julian Seward [:jseward] from comment #39)
> (In reply to Nathan Froyd [:froydnj] from comment #38)
> > (In reply to Julian Seward [:jseward] from comment #37)
> > > For reasons I don't [yet] understand, multi-arena jemalloc appears to lock
> > > and unlock for every allocation/free.  [..]
> 
> > This is indeed not so, but (newer?) versions of jemalloc provide for
> > thread-specific arenas (called "caches" in the documentation, so maybe
> > they're not the same thing...
> 
> Right.  So what I *expected* to see was a fast path in which the calling
> thread uses the TLS mechanism to get a pointer to its own private data
> area, which is the only potentially expensive bit, then tries
> to allocate in there, and only takes a lock if it has to fall back 
> allocating from central facilities for whatever reason.

Well, that would make sense to me, because there's always the possibility that thread A frees memory from the arena that thread B is allocating from. That can happen regardless of whether we use mutliple arenas or not, so I'd expect there to be a lock acquisition for main-thread allocations with or without multiple arenas. Are you saying there's _more_ locking with multi-arenas, or just that the locking isn't eliminated? Our stylo performance with multiple arenas seems fine, it's more a question of whether we regress main-thread performance at all.

> 
> But that's not what it does.
> 
> > [..] but (newer?) versions of jemalloc provide for thread-specific arenas
> 
> I'm getting the impression that we're not running the most recent or most
> capable version of jemalloc for some reason, but I don't know the back story.
> Can you give me a pointer to it?

Upgrading mozjemalloc is a total black hole and out of scope for us. We should consider ourselves to be stuck with mozjemalloc for the stylo timeframe.
Flags: needinfo?(nfroyd) → needinfo?(jseward)
> Upgrading mozjemalloc is a total black hole and out of scope for us. We should
> consider ourselves to be stuck with mozjemalloc for the stylo timeframe.

I agree. glandium has spent lots of time in the past trying to get jemalloc4 up to snuff, without success. Changing the allocator is *hard*, because *everything* depends on it.
(In reply to Andrew McCreight [:mccr8] from comment #40)
> (In reply to Julian Seward [:jseward] from comment #39)
> > I'm getting the impression that we're not running the most recent or most
> > capable version of jemalloc for some reason, but I don't know the back story.
> > Can you give me a pointer to it?
> 
> I think bug 1219914 is the primary impediment. (Bug 762449 may have other
> blockers.)

See also bug 1357301 comment 8 and bug 1357301 comment 9. It's a horribly broken mess.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #41)

> Well, that would make sense to me, because there's always the possibility
> that thread A frees memory from the arena that thread B is allocating from.
> That can happen regardless of whether we use mutliple arenas or not, so I'd
> expect there to be a lock acquisition for main-thread allocations with or
> without multiple arenas.

Ah yes.  I hadn't thought properly about this.  Sorry for the noise.

> Are you saying there's _more_ locking with
> multi-arenas, or just that the locking isn't eliminated?

Just that locking isn't eliminated or even reduced -- it's the same as
with single arenas.  But I see now that's hard to avoid.

> Upgrading mozjemalloc is a total black hole and out of scope for us.

I understand.  I wasn't proposing that.  I just wanted to understand
the history a bit.
Flags: needinfo?(jseward)
Do you have any cycles today or tomorrow to poke at making multiple arenas dynamically toggleable (i.e. the arenas are always there, but we only do the TLS lookup when a static bool is set)? I was hoping to get to it today but a lot of stuff came up, and not sure I will tomorrow either.

We can expose a Gecko_SetMultipleJemallocArenasEnabled(bool) method in ServoBindings.{cpp.h} to flip the bit, and then invoke that function in parallel.rs.
Flags: needinfo?(jseward)
(And then the key would be to measure the performance on smaug's microbenchmarks with the extra arenas existing but toggled off).
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #45)
> Do you have any cycles today or tomorrow to poke at making multiple arenas
> dynamically toggleable

Yes, looking at it now.
Flags: needinfo?(jseward)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #45)
> [..] making multiple arenas dynamically toggleable [..]

Patch is at bug 1361258 comment 1.  Looks good for the parallel traversal
but I haven't tested smaug's microbenchmarks with it yet.
(In reply to Julian Seward [:jseward] from comment #37)
> I had gotten
> the impression that multi-arena meant "one arena per thread" but perhaps
> that's not so.

IIRC, it's one arena per cpu (or maybe 2), and threads get an arena associated randomly, with the hope they'll be evenly distributed. This means threads still can fight for the same arena.


(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #45)
> Do you have any cycles today or tomorrow to poke at making multiple arenas
> dynamically toggleable (i.e. the arenas are always there, but we only do the
> TLS lookup when a static bool is set)? I was hoping to get to it today but a
> lot of stuff came up, and not sure I will tomorrow either.
> 
> We can expose a Gecko_SetMultipleJemallocArenasEnabled(bool) method in
> ServoBindings.{cpp.h} to flip the bit, and then invoke that function in
> parallel.rs.

Nice hack to test things out, but sounds quite horrible as something to land. It seems to me what's needed here is an API where code can get a jemalloc arena, and use it for allocations. Jemalloc 4 has an API for that, and we've been wanting to use it for e.g. isolating some kinds of allocations from others. Considering the state of things with jemalloc 4, we could consider adding a similar feature to mozjemalloc. Then code can opt-in to allocate their own arena-per-thread.
(In reply to Mike Hommey [:glandium] from comment #49)

In the longer term (post 57, not in scope of this bug) I suspect we are
going to have to take the hit of moving to an allocator which is as
thread friendly as possible, since we don't want to be serialising parallel
code either by allocator locking or by the allocator moving memory between
threads and causing a lot of coherence misses.  Also, having a way to
observe and quantify coherence misses (cross core traffic) would be helpful.
(In reply to Mike Hommey [:glandium] from comment #49)
> (In reply to Julian Seward [:jseward] from comment #37)
> > I had gotten
> > the impression that multi-arena meant "one arena per thread" but perhaps
> > that's not so.
> 
> IIRC, it's one arena per cpu (or maybe 2), and threads get an arena
> associated randomly, with the hope they'll be evenly distributed. This means
> threads still can fight for the same arena.
> 
> 
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #45)
> > Do you have any cycles today or tomorrow to poke at making multiple arenas
> > dynamically toggleable (i.e. the arenas are always there, but we only do the
> > TLS lookup when a static bool is set)? I was hoping to get to it today but a
> > lot of stuff came up, and not sure I will tomorrow either.
> > 
> > We can expose a Gecko_SetMultipleJemallocArenasEnabled(bool) method in
> > ServoBindings.{cpp.h} to flip the bit, and then invoke that function in
> > parallel.rs.
> 
> Nice hack to test things out, but sounds quite horrible as something to
> land.

Why, exactly? It seems like this basically gets us multiple arenas for stylo without infecting the rest of Gecko with it, which is exactly what we want.

> It seems to me what's needed here is an API where code can get a
> jemalloc arena, and use it for allocations. Jemalloc 4 has an API for that,
> and we've been wanting to use it for e.g. isolating some kinds of
> allocations from others. Considering the state of things with jemalloc 4, we
> could consider adding a similar feature to mozjemalloc. Then code can opt-in
> to allocate their own arena-per-thread.

How would we pass that arena through Rust stdlib code that just does malloc?
(In reply to Julian Seward [:jseward] from comment #50)
> (In reply to Mike Hommey [:glandium] from comment #49)
> 
> In the longer term (post 57, not in scope of this bug) I suspect we are
> going to have to take the hit of moving to an allocator which is as
> thread friendly as possible

I don't think that argument is going to hold water. Gecko is single-threaded enough that we can't make single-thread allocation performance or memory usage worse.
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #52)
> (In reply to Julian Seward [:jseward] from comment #50)
> > (In reply to Mike Hommey [:glandium] from comment #49)
> > 
> > In the longer term (post 57, not in scope of this bug) I suspect we are
> > going to have to take the hit of moving to an allocator which is as
> > thread friendly as possible
> 
> I don't think that argument is going to hold water. Gecko is single-threaded
> enough that we can't make single-thread allocation performance or memory
> usage worse.

My bad -- ambiguous phrasing.  By "take the hit" I meant "deal with the hassle of",
not "accept the performance loss".
In any case, I'd like to get this sorted out way or another within the next few days. We're targeting building stylo by default for end-of-week, and we can't do that with the MOZ_STYLO #define in mozjemalloc.c.

Originally smaug had measured a main-thread performance hit from multiple arenas, but from bug 1361258 comment 22 it sounds like he can't reproduce that anymore.

We basically have two options:
(1) Take some more measurements on multiple arenas, and enable them everywhere. This probably involves doing a bit more measurement on malloc-bound main-thread testcases to be sure we don't measure any impact of the TLS lookup, and measuring on AWSY and whatnot to make sure we don't increase fragmentation.
(2) Enable Julian's "adaptive" solution.

What do people prefer?
Flags: needinfo?(mh+mozilla)
Flags: needinfo?(jseward)
Flags: needinfo?(bugs)

Comment 55

3 months ago
(1) perhaps.
I (and others) should have rerun some tests with MOZ_STYLO, since right now I can't seem to get similar regression. Did I do some mistake there? Did I run with jemalloc? Did I build stylo (which itself seem to cause major perf regression, bug 1362982).
I apologize if I caused wasted time :(
Flags: needinfo?(bugs)

Comment 56

3 months ago
Does 'building stylo by default for end-of-week' mean not enabled by default, but all the code, including the additions to Element will be compiled? I assume so.
Has anyone measured performance with stylo builds (even if stylo wasn't enabled)?
Flags: needinfo?(bobbyholley)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #54)
> (1) Take some more measurements on multiple arenas, and enable them
> everywhere. This probably involves doing a bit more measurement on
> malloc-bound main-thread testcases to be sure we don't measure any impact of
> the TLS lookup, and measuring on AWSY and whatnot to make sure we don't
> increase fragmentation.
> (2) Enable Julian's "adaptive" solution.

(1) is preferable, since it's less complexity, and it's where we'd like
    to get to eventually, but it is also more risky.  I am particularly
    concerned that the TLS lookups may be slow on other P1 platforms.

(2) is safer.

I suggest the following: check whether we can find any slowdowns with
multiple arenas on OSX, Windows and Android.  Also check AWSY etc.
If any problems appear, do (2), else do (1).

I can check OSX easily enough.  I *could* do Win32 on a VM, but I'm not
happy to trust perf results from a VM, and really we need to test Win64.
Can anyone test it natively?  And can anyone test Android?
Flags: needinfo?(jseward)
(In reply to Olli Pettay [:smaug] from comment #56)
> Does 'building stylo by default for end-of-week' mean not enabled by
> default, but all the code, including the additions to Element will be
> compiled? I assume so.

Yes.

> Has anyone measured performance with stylo builds (even if stylo wasn't
> enabled)?

That would be a good thing to do. I've looked through all the MOZ_STYLO code and have been assuming that none of it would affect non-stylo perf, but it's certainly possible that the extra word in Element might change something. It's even possible that you might have been testing with this, and that this would explain the difference between your numbers and Julian's.

If either you or Julian could run some measurements with MOZ_STYLO enabled but layout.css.servo.enabled set to false, that would be splendid.
Flags: needinfo?(bobbyholley)
(In reply to Julian Seward [:jseward] from comment #57)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #54)
> > (1) Take some more measurements on multiple arenas, and enable them
> > everywhere. This probably involves doing a bit more measurement on
> > malloc-bound main-thread testcases to be sure we don't measure any impact of
> > the TLS lookup, and measuring on AWSY and whatnot to make sure we don't
> > increase fragmentation.
> > (2) Enable Julian's "adaptive" solution.
> 
> (1) is preferable, since it's less complexity, and it's where we'd like
>     to get to eventually, but it is also more risky.  I am particularly
>     concerned that the TLS lookups may be slow on other P1 platforms.
> 
> (2) is safer.
> 
> I suggest the following: check whether we can find any slowdowns with
> multiple arenas on OSX, Windows and Android.  Also check AWSY etc.
> If any problems appear, do (2), else do (1).

This sounds good to me.
 
> I can check OSX easily enough.  I *could* do Win32 on a VM, but I'm not
> happy to trust perf results from a VM, and really we need to test Win64.
> Can anyone test it natively?  And can anyone test Android?

I'd suggest measuring the perf first on osx. If there's no regression spotted, write up a step-by-step guide to perform the tests on windows, and get somebody with a native device to do the test (I think Nathan has one?). May also be worth testing both win32 and win64, though perhaps win32 + macosx64 is good enough.
https://bugzilla.mozilla.org/show_bug.cgi?id=1361258#c24 suggests that
(1) might be a space problem, with Win7-32 RSS regressions.  What sort
of latitude do we have w.r.t. space?  Does anybody know why only Win7-32
would regress on space?
To reduce the overhead of multiple arenas, could we use just one arena in the chrome process and multiple arenas in the content processes?

Comment 62

3 months ago
IIRC based on some profiles ehsan has seen, stylo would be particularly useful in the parent process, and less in child processes.
Yes, and we're getting rid of the old style system shortly after we ship in any case.
(In reply to Olli Pettay [:smaug] from comment #62)
> IIRC based on some profiles ehsan has seen, stylo would be particularly
> useful in the parent process, and less in child processes.

That's correct.
(In reply to Julian Seward [:jseward] from comment #57)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #54)
> > (1) Take some more measurements on multiple arenas, and enable them
> > everywhere. This probably involves doing a bit more measurement on
> > malloc-bound main-thread testcases to be sure we don't measure any impact of
> > the TLS lookup, and measuring on AWSY and whatnot to make sure we don't
> > increase fragmentation.
> > (2) Enable Julian's "adaptive" solution.
> 
> (1) is preferable, since it's less complexity, and it's where we'd like
>     to get to eventually, but it is also more risky.  I am particularly
>     concerned that the TLS lookups may be slow on other P1 platforms.

Are we talking about an additional TLS lookup per malloc?  TLS lookups do show up in profiles in other code every now and then, is there any reason to believe they wouldn't here?
Yeah, in general I'm still of the opinion that 'adaptive' is the right path forward here.
(In reply to :Ehsan Akhgari (super long backlog, slow to respond) from comment #65)
> > I am particularly concerned that the TLS lookups may be slow on other P1 platforms.
> 
> Are we talking about an additional TLS lookup per malloc?  TLS lookups do
> show up in profiles in other code every now and then, is there any reason to
> believe they wouldn't here?

Yes, one TLS lookup per malloc.  For Linux and Mac it is done as an in-line read of a
__thread variable, so I'm not convinced it would show up in a profile.  For Windows
it's a call to TlsGetValue which I hope would get inlined.

If the platform doesn't support TLS, which is the case when MOZ_MEMORY_ANDROID
is defined, then an arena is chosen by hashing: pthread_self() % narenas,
which is potentially a 32-cycle hit, if the processor's % implementation
can produce only one bit per cycle.  This is why I'm concerned.
(In reply to Julian Seward [:jseward] from comment #67)
> (In reply to :Ehsan Akhgari (super long backlog, slow to respond) from
> comment #65)
> > > I am particularly concerned that the TLS lookups may be slow on other P1 platforms.
> > 
> > Are we talking about an additional TLS lookup per malloc?  TLS lookups do
> > show up in profiles in other code every now and then, is there any reason to
> > believe they wouldn't here?
> 
> Yes, one TLS lookup per malloc.  For Linux and Mac it is done as an in-line
> read of a
> __thread variable, so I'm not convinced it would show up in a profile.  For
> Windows
> it's a call to TlsGetValue which I hope would get inlined.

Is it? How can the read be inline when we're not linked statically? I'd assume we need at least a function call to __tls_get_addr, even on Linux.
Copy/paste from my email earlier today:
I don't see why rush things for doing multiple-arenas for stylo somehow (adaptive or not). You can still build stylo by default without addressing that.

The adaptive approach is too hackish. It makes *everything* use multiple arenas during the time stylo does its work. Which means *every* other running thread will get memory from arenas other than arenas[0] while multiple-arenas are enabled for stylo. In turn, this means on long running processes, increased fragmentation for any long term allocations that might end up on !arenas[0], unfreed runs, etc.

As I said Julian yesterday, a middle ground, since rust can't allow us to pass an arena to every allocation, would be to have stylo threads opt-in to use an arena other than arenas[0]. That is, in terms of the current patch in the bug, we'd keep the
Gecko_SetMultipleJemallocArenasEnabled thing, but every thread would still have its attributed arena stay arenas[0], until some other function is called that would attribute a different arena. That function would then be run at rust-thread-spawn time.
Flags: needinfo?(mh+mozilla)
(In reply to Mike Hommey [:glandium] from comment #69)
> I don't see why rush things for doing multiple-arenas for stylo somehow
> (adaptive or not). You can still build stylo by default without addressing
> that.

Yes, true, but you can't get much parallel speedup out of it with single arenas.
See comment 17 for some motivating numbers.

> The adaptive approach is too hackish. It makes *everything* use multiple
> arenas during the time stylo does its work. [..]

In effect it makes the heap behaviour more scheduling-sensitive -- and hence, in
the case of space problems -- unreproducible -- than it is at present.  To what
extent that is a problem, I have no idea.

> As I said Julian yesterday, a middle ground, since rust can't allow us to
> pass an arena to every allocation, would be to have stylo threads opt-in to
> use an arena other than arenas[0].  [..]

That would be more principled, but it would take longer to do.  We'd need to
hack Rayon to provide us with hooks at worker thread start and exit, unless it
already provides those.  Also, this would necessitate a TLS lookup for every
malloc call, even those made by non-Stylo threads, in order to find out which
arena to use.  By contrast, the current proposal has a fixed overhead of just
one (perfectly predictable) branch per call.
(In reply to Julian Seward [:jseward] from comment #70)
> Also, this would necessitate a TLS lookup for every
> malloc call, even those made by non-Stylo threads, in order to find out which
> arena to use.  By contrast, the current proposal has a fixed overhead of just
> one (perfectly predictable) branch per call.

No, you can still gate everything on Gecko_SetMultipleJemallocArenasEnabled having been called.
(In reply to Julian Seward [:jseward] from comment #70)
> That would be more principled, but it would take longer to do.  We'd need to
> hack Rayon to provide us with hooks at worker thread start and exit, unless
> it
> already provides those.

https://docs.rs/rayon/0.7.0/rayon/struct.Configuration.html#method.start_handler
(In reply to Mike Hommey [:glandium] from comment #69)
> Copy/paste from my email earlier today:
> I don't see why rush things for doing multiple-arenas for stylo somehow
> (adaptive or not). You can still build stylo by default without addressing
> that.

The point of building stylo by default is so that people don't need special MOZ_STYLO builds to work on stylo, and that includes performance profiling and talos runs. The performance of stylo without multiple arenas is bad enough that I consider this a blocker to building by default.

> 
> The adaptive approach is too hackish. It makes *everything* use multiple
> arenas during the time stylo does its work. Which means *every* other
> running thread will get memory from arenas other than arenas[0] while
> multiple-arenas are enabled for stylo. In turn, this means on long running
> processes, increased fragmentation for any long term allocations that might
> end up on !arenas[0], unfreed runs, etc.

That's a great point - I didn't think about that.

> 
> As I said Julian yesterday, a middle ground, since rust can't allow us to
> pass an arena to every allocation, would be to have stylo threads opt-in to
> use an arena other than arenas[0]. That is, in terms of the current patch in
> the bug, we'd keep the
> Gecko_SetMultipleJemallocArenasEnabled thing, but every thread would still
> have its attributed arena stay arenas[0], until some other function is
> called that would attribute a different arena. That function would then be
> run at rust-thread-spawn time.

This sounds like a great plan, and I think it should be easy to implement. Glandium's points in comment 71 and comment 72 are spot-on.
Note that this dovetails very nicely with the recent merge of rayon 0.7, which includes froydnj's work to add the start/exit handlers. Great work by everyone involved in getting this over the line. :-)
Initial patches (from Mike) for the comment 69 proposal, plus some measurements
are at bug 1361258 comment 35.
(In reply to Julian Seward [:jseward] from comment #75)
> Initial patches (from Mike) for the comment 69 proposal, plus some
> measurements
> are at bug 1361258 comment 35.

Do the current patches solve the kludge-y modulus bucketing issue, and guarantee that a separate arena is associated with each thread?
Flags: needinfo?(jseward)
(In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #76)
> (In reply to Julian Seward [:jseward] from comment #75)
> > Initial patches (from Mike) for the comment 69 proposal, plus some
> > measurements
> > are at bug 1361258 comment 35.
> 
> Do the current patches solve the kludge-y modulus bucketing issue, and
> guarantee that a separate arena is associated with each thread?

No, but next iteration will, actually.

Do we care about Android at the moment?
Flags: needinfo?(bobbyholley)

Updated

3 months ago
Flags: needinfo?(jseward)
(In reply to Mike Hommey [:glandium] from comment #77)
> Do we care about Android at the moment?

No. We won't be enabling Stylo V1 on Android. We'll enable it on Android in a later release.
(In reply to Mike Hommey [:glandium] from comment #77)
> (In reply to Bobby Holley (:bholley) (busy with Stylo) from comment #76)
> > (In reply to Julian Seward [:jseward] from comment #75)
> > > Initial patches (from Mike) for the comment 69 proposal, plus some
> > > measurements
> > > are at bug 1361258 comment 35.
> > 
> > Do the current patches solve the kludge-y modulus bucketing issue, and
> > guarantee that a separate arena is associated with each thread?
> 
> No, but next iteration will, actually.

Great!

> 
> Do we care about Android at the moment?

Not for now, and probably not for 57. We will for 58 (or shortly thereafter).
Flags: needinfo?(bobbyholley)
It seems like this bug is useful as a tracking bug, but its title should be updated for that, because what the title talks about was done in bug 1361258.
It's not clear to me what this bug tracks, exactly. I think we have a reasonable solution in bug 1361258, and should file new issues if there's more work to do.
Status: NEW → RESOLVED
Last Resolved: 2 months ago
Resolution: --- → WORKSFORME
No longer blocks: 1356991
You need to log in before you can comment on or make changes to this bug.