Open Bug 1578352 Opened 11 months ago Updated 3 months ago

Cranelift: jemalloc makes parallel compilation slower than serial compilation

Categories

(Core :: Javascript: WebAssembly, defect, P3)

x86_64
Linux
defect

Tracking

()

People

(Reporter: lth, Assigned: jseward)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Attached file parallel.tar

I will attach a simple benchmark extracted from embenchen that compiles a wasm module 20 times and prints the sum of the best 10 runtimes. With a build configured with --disable-debug --enable-optimize --without-intl-api --enable-release, we see this on my Lenovo tower (I've removed the uninteresting part of the output):

$ ~/m-cl/js/src/build-release/dist/bin/js --wasm-compiler=ion compile.js
307.897705078125
$ ~/m-cl/js/src/build-release/dist/bin/js --wasm-compiler=ion compile.js
308.416259765625
$ ~/m-cl/js/src/build-release/dist/bin/js --wasm-compiler=ion compile.js
316.816650390625

Now without threads:

$ ~/m-cl/js/src/build-release/dist/bin/js --no-threads --wasm-compiler=ion compile.js
968.870849609375
$ ~/m-cl/js/src/build-release/dist/bin/js --no-threads --wasm-compiler=ion compile.js
967.937744140625
$ ~/m-cl/js/src/build-release/dist/bin/js --no-threads --wasm-compiler=ion compile.js
968.996337890625

So far so good: threads give > 3x speedup on this system with default thread selection policies. Now for Cranelift, first with threads:

$ ~/m-cl/js/src/build-release/dist/bin/js --wasm-compiler=cranelift compile.js
3170.4775390625
$ ~/m-cl/js/src/build-release/dist/bin/js --wasm-compiler=cranelift compile.js
2966.201171875
$ ~/m-cl/js/src/build-release/dist/bin/js --wasm-compiler=cranelift compile.js
3208.7294921875

Oof, that's a little sad: 10x slower than Ion! Then finally without threads:

$ ~/m-cl/js/src/build-release/dist/bin/js --wasm-compiler=cranelift --no-threads compile.js
2447.20556640625
$ ~/m-cl/js/src/build-release/dist/bin/js --wasm-compiler=cranelift --no-threads compile.js
2444.6923828125
$ ~/m-cl/js/src/build-release/dist/bin/js --wasm-compiler=cranelift --no-threads compile.js
2441.205322265625

Hey, that's faster! That's not good at all.

Cranelift should be faster with parallel compilation.

The above numbers are with the standard register allocator. An amusing fact is that with the minimal register allocator, non-threaded compile times drop down to about 1900ms (because the allocator is faster) but threaded compile times increase further, frequently exceeding 4000ms for the sum-of-10-best-runs. This suggests there's some terrible contention somewhere in some central data structure that is being accessed more frequently by the minimal allocator than by the regular allocator, and it may of course be that that contention is the source of the slowdown for parallel compiles with the regular allocator as well.

OK, so some NUMA effects, running with numactl --cpunodebind=0 cuts the threaded time by half, thus getting us a speedup but a pretty feeble one. But there's still some contention here: reducing the number of threads by using --thread-count=4 brings the time down from about 1750ms to about 1150ms.

I can't reproduce this on my single chip Xeon (Intel(R) Xeon(R) CPU E3-1505M
v5 @ 2.80GHz) (4 cores, 8 threads, 8MB LLC, clamped to max 2.6 GHz).

For ../dist/bin/js --wasm-compiler=cranelift --thread-count=<N> for N in 1 .. 8:

N  Result

1  3208.03759765625
2  1597.113037109375
3  1075.590576171875
4   841.673828125 
5   835.941650390625 
6   838.021240234375
7   806.493896484375
8   808.149658203125

So it scales out to the number of cores OK. And who knows what the effect of
hyperthreading has for N in 5 .. 8.

Also I cannot see any signs of contention or related badness, either with
/usr/bin/perf, or with Valgrind/Callgrind + MESI protocol simulation.

For the 4, 8 and 16 thread config, perf says that we spend a lot of time in
these three:

 cranelift_codegen::regalloc::context::Context::run
 cranelift_codegen::context::Context::compile
 cranelift_bforest::path::Path<F>::find

and the distribution of costs across them is pretty much the same regardless
of the number of threads.

Weird, I can reproduce on my machine, even if I set the CPUs with taskset (I don't have several NUMA sets on my machine, according to numactl --hardware).

That's when compiling AngryBots with Cranelift:

1 9179.98095703125
2 5613.599853515625
3 4841.315185546875
4 6068.06396484375
5 7901.22509765625
6 9809.276123046875
7 11687.159912109375
8 13741.26318359375
9 14125.906982421875
10 15259.072998046875
11 16063.001953125
12 16903.181884765625

(I get consistent speedups when compiling with Ion.)

Benjamin, can you run with perf top or something equivalent and show
what the top few items are? I can provide a script if that helps. I'd like
to know that whatever is hot in your/Lars' builds is the same as what's
hot in mine.

FWIW, I'm on Fedora 30, which I updated yesterday:

Linux yojimbo 5.2.9-200.fc30.x86_64 #1 SMP Fri Aug 16 21:37:45 UTC 2019 x86_64 x86_64 x86_64 GNU/Linux

I have a lot of these:

Processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 79
model name      : Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz
stepping        : 1
microcode       : 0xb000036
cpu MHz         : 1197.748
cache size      : 15360 KB
physical id     : 0
siblings        : 8
core id         : 0
cpu cores       : 4
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb cat_l3 cdp_l3 invpcid_single pti intel_ppin ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm cqm rdt_a rdseed adx smap intel_pt xsaveopt cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local dtherm ida arat pln pts md_clear flush_l1d
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs
bogomips        : 6984.11
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

I was going to move the branch to newer Cranelift master + SpiderMonkey head but I got stuck in rebase hell and will probably not try that until tomorrow.

$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 8 9 10 11
node 0 size: 15953 MB
node 0 free: 3705 MB
node 1 cpus: 4 5 6 7 12 13 14 15
node 1 size: 16101 MB
node 1 free: 4724 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 

This is typically what I saw for the event 'cycles:u':

10.00%  cranelift_codegen::regalloc::context::Context::run
 8.56%  cranelift_codegen::context::Context::compile
 6.06%  cranelift_bforest::path::Path<F>::find
 3.96%  cranelift_codegen::regalloc::live_value_tracker::LiveValueTracker::process_inst
 3.61%  cranelift_codegen::regalloc::coloring::Context::visit_inst
 3.33%  cranelift_codegen::regalloc::liverange::GenLiveRange<PO>::extend_in_ebb
 2.51%  cranelift_codegen::redundant_reload_remover::RedundantReloadRemover::do_redundant_fill_removal_on_function
 2.21%  cranelift_codegen::isa::constraints::RecipeConstraints::satisfied
 1.79%  <cranelift_codegen::isa::enc_tables::Encodings as core::iter::traits::iterator::Iterator>::next
 1.59%  cranelift_entity::list::EntityList<T>::push
 1.43%  <cranelift_codegen::isa::x86::Isa as cranelift_codegen::isa::TargetIsa>::legal_encodings
 1.40%  cranelift_codegen::regalloc::live_value_tracker::LiveValueTracker::ebb_top
 1.35%  cranelift_codegen::ir::instructions::InstructionData::arguments
 1.33%  _int_malloc
 1.32%  cranelift_codegen::regalloc::liverange::GenLiveRange<PO>::livein_local_end
 1.29%  cranelift_codegen::ir::dfg::DataFlowGraph::ctrl_typevar
 1.21%  cranelift_bforest::path::Path<F>::prev
 1.18%  cranelift_codegen::regalloc::coalescing::DomForest::push_node
 1.10%  <cranelift_entity::map::SecondaryMap<K,V> as core::ops::index::IndexMut<K>>::index_mut
 1.08%  cranelift_frontend::ssa::SSABuilder::run_state_machine
 1.05%  cranelift_codegen::regalloc::diversion::RegDiversions::reg
 0.97%  _int_free

Cranelift tip as of this morning, rebased minimal-allocator code (new branch: minimal_allocator), with numactl --cpunodebind=1, --enable-release build, Rust 1.37, rest of toolchain updated yesterday.

Cranelift with minimal allocator:

--no-threads     2057.667724609375
--cpu-count=1   2062.95849609375
--cpu-count=2   1404.244140625
--cpu-count=3    1394.11572265625
--cpu-count=4   1661.387939453125
--cpu-count=5    1814.15185546875
--cpu-count=6   1965.01953125
--cpu-count=7   2118.407470703125
--cpu-count=8   2310.4306640625

Cranelift with coloring allocator:

--no-threads  2542.047119140625
--cpu-count=1 2549.7900390625
--cpu-count=2 1513.36865234375
--cpu-count=3 1244.17578125
--cpu-count=4 1276.791748046875
--cpu-count=5 1432.982666015625
--cpu-count=6 1652.73828125
--cpu-count=7 1828.1484375
--cpu-count=8 1982.426513671875

Ion:

--no-threads 972.300537109375
--cpu-count=1 971.326904296875
--cpu-count=2 576.32861328125
--cpu-count=3 432.686279296875
--cpu-count=4 350.447998046875
--cpu-count=5 339.8291015625
--cpu-count=6 335.18994140625
--cpu-count=7 334.6337890625
--cpu-count=8 330.3974609375

My CL results are similar to Benjamin's: at more than three cores (= 5 threads), cranelift starts slowing down, while Ion keeps winning all the way. And the minimal allocator slows down more quickly than the coloring allocator, suggesting some kind of contention related to instruction count.

(Perf results to follow.)

Cranelift, coloring allocator, --cpu-count=8, numactl as above.

  30.42%  JS Helper  libpthread-2.29.so  [.] __pthread_mutex_lock
   5.39%  JS Helper  libpthread-2.29.so  [.] __pthread_mutex_unlock_usercnt
   4.54%  JS Helper  [kernel.kallsyms]   [k] do_syscall_64
   3.17%  JS Helper  js                  [.] cranelift_codegen::regalloc::context::Context::graph_coloring
   2.74%  JS Helper  js                  [.] cranelift_codegen::context::Context::compile
   2.72%  JS Helper  [kernel.kallsyms]   [k] try_to_wake_up
   1.93%  JS Helper  js                  [.] cranelift_bforest::path::Path<F>::find
   1.87%  JS Helper  [kernel.kallsyms]   [k] entry_SYSCALL_64
   1.70%  JS Helper  [kernel.kallsyms]   [k] syscall_return_via_sysret
   1.59%  JS Helper  [kernel.kallsyms]   [k] psi_task_change
   1.21%  JS Helper  js                  [.] arena_t::MallocSmall
   1.18%  JS Helper  js                  [.] arena_dalloc
   1.18%  JS Helper  js                  [.] cranelift_codegen::regalloc::live_value_tracker::LiveValueTracker::proc
   1.11%  JS Helper  libpthread-2.29.so  [.] __lll_lock_wait
   1.07%  JS Helper  [kernel.kallsyms]   [k] queued_spin_lock_slowpath
   1.07%  JS Helper  js                  [.] cranelift_codegen::regalloc::coloring::Context::visit_inst
   1.02%  JS Helper  js                  [.] cranelift_codegen::regalloc::liverange::GenLiveRange<PO>::extend_in_ebb

Ditto, minimal allocator:

  34.64%  JS Helper  libpthread-2.29.so  [.] __pthread_mutex_lock
   5.82%  JS Helper  libpthread-2.29.so  [.] __pthread_mutex_unlock_usercnt
   5.26%  JS Helper  [kernel.kallsyms]   [k] do_syscall_64
   3.46%  JS Helper  [kernel.kallsyms]   [k] try_to_wake_up
   2.86%  JS Helper  js                  [.] cranelift_codegen::context::Context::compile
   2.35%  JS Helper  [kernel.kallsyms]   [k] entry_SYSCALL_64
   2.03%  JS Helper  [kernel.kallsyms]   [k] syscall_return_via_sysret
   1.88%  JS Helper  [kernel.kallsyms]   [k] psi_task_change
   1.66%  JS Helper  js                  [.] cranelift_codegen::redundant_reload_remover::RedundantReloadRemover::do
   1.37%  JS Helper  libpthread-2.29.so  [.] __lll_lock_wait
   1.35%  JS Helper  [kernel.kallsyms]   [k] queued_spin_lock_slowpath
   1.26%  JS Helper  js                  [.] arena_dalloc
   1.24%  JS Helper  js                  [.] arena_t::MallocSmall
   1.09%  JS Helper  js                  [.] cranelift_codegen::isa::constraints::RecipeConstraints::satisfied
   1.08%  JS Helper  [kernel.kallsyms]   [k] get_futex_value_locked
   1.06%  JS Helper  [kernel.kallsyms]   [k] futex_wait_setup
Summary: Cranelift: Parallel compilation slower than serial compilation → Cranelift: jemalloc makes parallel compilation slower than serial compilation

With --disable-jemalloc, at cpu count=8, the time drops from 1982ms to 524ms. perf looks more reasonable:

   9.44%  JS Helper  js                  [.] cranelift_codegen::regalloc::context::Context::graph_coloring
   8.04%  JS Helper  js                  [.] cranelift_codegen::context::Context::compile
   5.84%  JS Helper  js                  [.] cranelift_bforest::path::Path<F>::find
   3.79%  JS Helper  js                  [.] cranelift_codegen::regalloc::live_value_tracker::LiveValueTracker::proc
   3.48%  JS Helper  js                  [.] cranelift_codegen::regalloc::coloring::Context::visit_inst
   3.14%  JS Helper  js                  [.] cranelift_codegen::regalloc::liverange::GenLiveRange<PO>::extend_in_ebb
   2.34%  JS Helper  js                  [.] cranelift_codegen::redundant_reload_remover::RedundantReloadRemover::do
   2.21%  JS Helper  js                  [.] cranelift_codegen::isa::constraints::RecipeConstraints::satisfied
   1.96%  JS Helper  js                  [.] <cranelift_codegen::isa::enc_tables::Encodings as core::iter::traits::i
   1.67%  JS Helper  js                  [.] cranelift_codegen::regalloc::liverange::GenLiveRange<PO>::livein_local_
   1.64%  JS Helper  js                  [.] <cranelift_codegen::isa::x86::Isa as cranelift_codegen::isa::TargetIsa>
   1.64%  JS Helper  js                  [.] cranelift_bforest::path::Path<F>::prev
   1.38%  JS Helper  js                  [.] cranelift_entity::list::EntityList<T>::push
   1.36%  JS Helper  js                  [.] <cranelift_entity::map::SecondaryMap<K,V> as core::ops::index::IndexMut
   1.26%  JS Helper  js                  [.] cranelift_codegen::regalloc::diversion::RegDiversions::reg
   1.25%  JS Helper  libc-2.29.so        [.] _int_malloc
   1.24%  JS Helper  js                  [.] cranelift_codegen::regalloc::coalescing::DomForest::push_node
   1.18%  JS Helper  js                  [.] cranelift_codegen::ir::instructions::InstructionData::arguments
   1.09%  JS Helper  js                  [.] cranelift_codegen::ir::dfg::DataFlowGraph::ctrl_typevar
   1.07%  JS Helper  js                  [.] cranelift_codegen::regalloc::live_value_tracker::LiveValueVec::push
   1.06%  JS Helper  libc-2.29.so        [.] _int_free

Mike, any suggestions for what we can do here (short of going after all the allocation sites)? These results are so bad it's easy to suspect that we've made a config mistake somewhere.

Flags: needinfo?(mh+mozilla)

This is definitely a jemalloc-vs-glibc-malloc effect. Testing two optimised builds that differ only by one having --disable-jemalloc, I get the following numbers. #thr is set with --thread-count=.

        glibc
#thr    malloc   jemalloc
   1      3250       3354
   2      1635       1866
   3      1100       1422
   4       868       1405
   5       847       1486
   6       858       1757

Am investigating.

For a single iteration of the test case's outer loop, I see both builds call free about 186000 times.

For the jemalloc build, that results in around 467000 global bus locks (lock prefixed insns, more or less), which come almost entirely from 186000 calls to pthread_mutex_lock and the same number to pthread_mutex_unlock.

For the glibc build, that results in only 60000 global bus locks, and no calls to pthread_mutex_anything.

This suggests to me that glibc malloc has some kind of thread-local-pool magic going on. It wouldn't surprise me. Mike, does jemalloc have anything similar? I remember some kind of parallel-performance trickery that got hacked up during the Stylo development, but I can't remember any of the details, and I'm sure jemalloc has changed since then.

In any case, Cranelift's allocation rate is sky-high compared to Ion compiling the same wasm, per measurements of 14 Aug: CL allocates 5.5 x as many bytes, in 14.9 x as many blocks. That is almost certainly making the situation much worse, so improving that would be an obvious first step.

mozjemalloc is not great in allocation-heavy thread-heavy situations. It's always better to avoid allocating at all ;). Anyways, a short term solution to this is to enable per-thread arenas for the threads cranelift uses. But make sure that those threads are not recycled, because per-thread arenas would be leaked, otherwise.

Flags: needinfo?(mh+mozilla)

valgrind --tool=callgrind --collect-bus=yes shows a lot of global bus events
resulting from acquiring/releasing the mozjemalloc lock. The following
patches reduce the number of blocks allocated, and hence the contention for
said lock:

https://github.com/CraneStation/cranelift/pull/998
https://github.com/CraneStation/cranelift/pull/999
https://github.com/CraneStation/cranelift/pull/1000
https://bugzilla.mozilla.org/show_bug.cgi?id=1579661
https://github.com/CraneStation/cranelift/pull/1010

For wasm_lua_binarytrees.js, this reduces the number of blocks allocated from
283,756 to 154,816, that is, to around 55% of the original number.
Instruction count is reduced by about 1.2%.

Here are some results. Summary is: for four threads, CL-on-glibc-malloc gets
a parallel speedup of 3.73. With jemalloc, that speedup is 2.39 before this
series of patches, and 3.08 with them applied. So, roughly, we recover half
the parallel performance lossage compared to glibc malloc.

In more detail:

These compare the original performance (comment 13) against a version patched
as per comment 16, and also including
https://github.com/CraneStation/cranelift/pull/1025 but with the regressing
https://github.com/CraneStation/cranelift/pull/671 backed out.

The columns to compare are the jemalloc parallel speedup for Before/After.

Note that the numbers for 5 and 6 threads are largely meaningless because this
processor only has 4 cores, so 5 and 6 threads entails the uncertain slowdowns
resulting from hyperthreaded execution.

Before

        glibc malloc            jemalloc
#thr    time(ms) speedup    time(ms) speedup
   1       3250    1.00       3354   1.00
   2       1635    1.99       1866   1.80
   3       1100    2.95       1422   2.36
   4        868    3.74       1405   2.39
   5        847    3.83       1486   2.26
   6        858    3.78       1757   1.90


After

        glibc malloc            jemalloc
#thr    time(ms) speedup    time(ms) speedup
   1       3129    1.00       3233   1.00
   2       1591    1.96       1727   1.87
   3       1062    2.94       1242   2.60
   4        838    3.73       1050   3.08
   5        827    3.78       1072   3.01
   6        812    3.85       1133   2.85

Bug 1586791 is directly relevant here. I imagine that fixing it will further
mitigate the parallel and serial jemalloc overheads.

Assignee: nobody → jseward
Priority: P2 → P3
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.