[Cranelift] BatchCompiler::new called too often?
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox71 | --- | fixed |
People
(Reporter: jseward, Assigned: jseward)
Details
Attachments
(3 files)
Whilst trying to track down why Cranelift has high heap turnover (not
residency) compared to Ion, I have come to wonder if we are creating and
destroying the basic compiler structure, BatchCompiler, too often.
For a run compiling wasm_lua_binarytrees.js, CL processes around 680 wasm
functions. However, even when running SpiderMonkey with --no-threads,
BatchCompiler::new() is called 177 times. This strikes me as .. somehow
suboptimal. I would have expected that one BatchCompiler is allocated per
compilation thread -- that is, in this case, just once.
Assignee | ||
Comment 1•5 years ago
|
||
Following hints from Benjamin and Lars, I played around with
JitOptions.wasmBatchIonThreshold
in ModuleGenerator::compileFuncDef
for
the Tier::Optimized
case. Results are:
Thresh Bytes allocd Blocks Insn Count
Value (MBytes) Allocd (Millions)
1100 88.8 153,523 1543.4 (DEFAULT SETTING)
2200 78.9 130,553 1531.3
3300 72.6 120,320 1526.2
4400 69.1 115,070 1523.4
5500 66.3 111,427 1520.9
11000 57.1 102,777 1514.7
infinity 33.5 93,438 1501.1
This is a striking result and suggests there's a lot of room for improvement.
An immediate change is to set a higher threshold for the CL case, up to the
point where available parallelism and/or latency is compromised.
This is with glibc malloc. I would expect the insn count win to be
significantly bigger with mozjemalloc, since that poisons freed memory.
The numbers imply that the compiler's cached state is associated with the
batch unit, not with the compilation thread. Is that true? If so, fixing
that would seem valuable.
Might increasing the threshold for Ion also reduce Ion's allocation rate?
Comment 2•5 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #1)
Might increasing the threshold for Ion also reduce Ion's allocation rate?
Nice work. I believe Benjamin initially set the thresholds based on some benchmarking. In the same way as for the values we use to pre-size the assembler output buffers, I would sure love for us to gather the data that might be used to tune these parameters on users' computer and ship the data home by telemetry. For the assembler output sizes, it's simple static measurements, not hard. Here, it's a little trickier -- it might amount to running experiments locally by perturbing the thresholds, a kind of hill climbing maybe, and then shipping home some data about eg the CPU and how well it fares. Or maybe we don't need to ship the data home, we just need to adjust limits dynamically on the local computer. (And in some sense the same is true for the assembler output, although there local variation is much smaller and may depend mostly on CPU feature set.)
Assignee | ||
Comment 3•5 years ago
|
||
Related-to bug 1578352.
Comment 4•5 years ago
|
||
I would sure love for us to gather the data that might be used to tune these parameters on users' computer
I would like this very much too. Ideally, running local benchmarking to adjust misc parameters (think: GC params too) just after a major (as in semver's major) update would quite nice.
I believe Benjamin initially set the thresholds based on some benchmarking.
Indeed, I did this for Ion, but not for Cranelift.
Thanks for doing this investigation. What we want here is to raise the batch size until it degrades parallel compilation performance, probably for 4 threads since this is the most common case.
Another thing discussed on IRC would be to have a pool of BatchCompiler owned by the module compilation thread; then this coordinating thread would dispatch a BatchCompiler in the context of the task, and the worker could clear it and use it then. Once module compilation is done, the coordinating thread could deallocate all the BatchCompilers.
Assignee | ||
Comment 5•5 years ago
|
||
Assignee | ||
Comment 6•5 years ago
|
||
What do people think of this patch, ignoring the actual numbers involved?
(To arrive at a reasonable threshold, I'll do more measurements, per bbouvier's
suggestion).
Assignee | ||
Updated•5 years ago
|
Comment 7•5 years ago
|
||
Comment on attachment 9099636 [details] [diff] [review] in-principle patch for contemplation Review of attachment 9099636 [details] [diff] [review]: ----------------------------------------------------------------- Yes, that's the idea. Thanks for looking into this!
Comment 8•5 years ago
|
||
Ditto.
Assignee | ||
Comment 9•5 years ago
|
||
This shows performance numbers, for different values of
wasmBatchCraneliftThreshold
, for --thread-count=1 .. 8
, and for glibc
malloc vs mozjemalloc.
Based on those I would suggest changing the value from 1100 as at present to
5000. When using mozjemalloc (I assume this will be how we're configured when
running "for real"), this has the following effects:
- 1-core (single thread) performance is improved by about 3%
- 4 core performance is improved by about 12%
- parallel speedup (4 vs 1 core) improves from 2.99 x to 3.32 x
Total heap allocation falls from 88.8MB in 153.5k blocks to 68.2MB in 113.3k
blocks.
Minor further gains in run time are available by increasing the threshold to
10000 and above. However, that potentially worsens latency considerably.
Most end-user systems are slower than this machine, and, in a realistic
scenario (running the whole browser), fewer than 4 cores are likely to be
available for compiling wasm. So I am inclined towards the lower threshold
(5000) to minimise the excess-latency risk.
Does that sound reasonable? If so, the patch doesn't need to be changed \o/.
Comment 10•5 years ago
|
||
According to your table headings, mozjemalloc is always faster than glibc malloc. Are the headings reversed?
Assignee | ||
Comment 11•5 years ago
•
|
||
(In reply to Lars T Hansen [:lth] from comment #10)
Are the headings reversed?
Er, yes, duh. Mozjemalloc is always slower than glibc malloc.
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 12•5 years ago
|
||
Whilst trying to track down why Cranelift has high heap turnover (not
residency) compared to Ion, I have come to wonder if we are creating and
destroying the basic compiler structure, BatchCompiler, too often.
This patch reduces the overall Cranelift heap overhead by increasing the
batching threshold from 1100 wasm stream bytes to 5000. This is a compromise
between reducing overhead and potential loss of parallelism. The effect is,
for wasm_lua_binarytrees: a 1.3% reduction in instruction count, 26.5% fewer
blocks allocated and 22.1% fewer bytes allocated.
Comment 13•5 years ago
|
||
Pushed by jseward@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/374f3ab2e461 [Cranelift] BatchCompiler::new called too often? r=bbouvier.
Comment 14•5 years ago
|
||
bugherder |
Description
•