Closed Bug 1586791 Opened Last month Closed 28 days ago

[Cranelift] BatchCompiler::new called too often?

Categories

(Core :: Javascript: WebAssembly, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED
mozilla71
Tracking Status
firefox71 --- fixed

People

(Reporter: jseward, Assigned: jseward)

Details

Attachments

(3 files)

Whilst trying to track down why Cranelift has high heap turnover (not
residency) compared to Ion, I have come to wonder if we are creating and
destroying the basic compiler structure, BatchCompiler, too often.

For a run compiling wasm_lua_binarytrees.js, CL processes around 680 wasm
functions. However, even when running SpiderMonkey with --no-threads,
BatchCompiler::new() is called 177 times. This strikes me as .. somehow
suboptimal. I would have expected that one BatchCompiler is allocated per
compilation thread -- that is, in this case, just once.

Following hints from Benjamin and Lars, I played around with
JitOptions.wasmBatchIonThreshold in ModuleGenerator::compileFuncDef for
the Tier::Optimized case. Results are:

     Thresh  Bytes allocd    Blocks     Insn Count
     Value   (MBytes)        Allocd    (Millions)

      1100       88.8       153,523      1543.4  (DEFAULT SETTING)
      2200       78.9       130,553      1531.3
      3300       72.6       120,320      1526.2
      4400       69.1       115,070      1523.4
      5500       66.3       111,427      1520.9
     11000       57.1       102,777      1514.7
  infinity       33.5        93,438      1501.1

This is a striking result and suggests there's a lot of room for improvement.
An immediate change is to set a higher threshold for the CL case, up to the
point where available parallelism and/or latency is compromised.

This is with glibc malloc. I would expect the insn count win to be
significantly bigger with mozjemalloc, since that poisons freed memory.

The numbers imply that the compiler's cached state is associated with the
batch unit, not with the compilation thread. Is that true? If so, fixing
that would seem valuable.

Might increasing the threshold for Ion also reduce Ion's allocation rate?

(In reply to Julian Seward [:jseward] from comment #1)

Might increasing the threshold for Ion also reduce Ion's allocation rate?

Nice work. I believe Benjamin initially set the thresholds based on some benchmarking. In the same way as for the values we use to pre-size the assembler output buffers, I would sure love for us to gather the data that might be used to tune these parameters on users' computer and ship the data home by telemetry. For the assembler output sizes, it's simple static measurements, not hard. Here, it's a little trickier -- it might amount to running experiments locally by perturbing the thresholds, a kind of hill climbing maybe, and then shipping home some data about eg the CPU and how well it fares. Or maybe we don't need to ship the data home, we just need to adjust limits dynamically on the local computer. (And in some sense the same is true for the assembler output, although there local variation is much smaller and may depend mostly on CPU feature set.)

Related-to bug 1578352.

I would sure love for us to gather the data that might be used to tune these parameters on users' computer

I would like this very much too. Ideally, running local benchmarking to adjust misc parameters (think: GC params too) just after a major (as in semver's major) update would quite nice.

I believe Benjamin initially set the thresholds based on some benchmarking.

Indeed, I did this for Ion, but not for Cranelift.

Thanks for doing this investigation. What we want here is to raise the batch size until it degrades parallel compilation performance, probably for 4 threads since this is the most common case.

Another thing discussed on IRC would be to have a pool of BatchCompiler owned by the module compilation thread; then this coordinating thread would dispatch a BatchCompiler in the context of the task, and the worker could clear it and use it then. Once module compilation is done, the coordinating thread could deallocate all the BatchCompilers.

What do people think of this patch, ignoring the actual numbers involved?
(To arrive at a reasonable threshold, I'll do more measurements, per bbouvier's
suggestion).

Attachment #9099636 - Flags: feedback?(bbouvier)
Comment on attachment 9099636 [details] [diff] [review]
in-principle patch for contemplation

Review of attachment 9099636 [details] [diff] [review]:
-----------------------------------------------------------------

Yes, that's the idea. Thanks for looking into this!
Attachment #9099636 - Flags: feedback?(bbouvier) → feedback+

Ditto.

This shows performance numbers, for different values of
wasmBatchCraneliftThreshold, for --thread-count=1 .. 8, and for glibc
malloc vs mozjemalloc.

Based on those I would suggest changing the value from 1100 as at present to
5000. When using mozjemalloc (I assume this will be how we're configured when
running "for real"), this has the following effects:

  • 1-core (single thread) performance is improved by about 3%
  • 4 core performance is improved by about 12%
  • parallel speedup (4 vs 1 core) improves from 2.99 x to 3.32 x

Total heap allocation falls from 88.8MB in 153.5k blocks to 68.2MB in 113.3k
blocks.

Minor further gains in run time are available by increasing the threshold to
10000 and above. However, that potentially worsens latency considerably.
Most end-user systems are slower than this machine, and, in a realistic
scenario (running the whole browser), fewer than 4 cores are likely to be
available for compiling wasm. So I am inclined towards the lower threshold
(5000) to minimise the excess-latency risk.

Does that sound reasonable? If so, the patch doesn't need to be changed \o/.

According to your table headings, mozjemalloc is always faster than glibc malloc. Are the headings reversed?

(In reply to Lars T Hansen [:lth] from comment #10)

Are the headings reversed?

Er, yes, duh. Mozjemalloc is always slower than glibc malloc.

Priority: -- → P3
Assignee: nobody → jseward

Whilst trying to track down why Cranelift has high heap turnover (not
residency) compared to Ion, I have come to wonder if we are creating and
destroying the basic compiler structure, BatchCompiler, too often.

This patch reduces the overall Cranelift heap overhead by increasing the
batching threshold from 1100 wasm stream bytes to 5000. This is a compromise
between reducing overhead and potential loss of parallelism. The effect is,
for wasm_lua_binarytrees: a 1.3% reduction in instruction count, 26.5% fewer
blocks allocated and 22.1% fewer bytes allocated.

Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/374f3ab2e461
[Cranelift] BatchCompiler::new called too often?  r=bbouvier.
Status: NEW → RESOLVED
Closed: 28 days ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla71
You need to log in before you can comment on or make changes to this bug.