Closed Bug 1586791 Opened 5 years ago Closed 5 years ago

[Cranelift] BatchCompiler::new called too often?

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla71

Tracking Flags:

Tracking

Status

firefox71

---

fixed

People

(Reporter: jseward, Assigned: jseward)

Details

Attachments

(3 files)

in-principle patch for contemplation 5 years ago Julian Seward [:jseward] 3.27 KB, patch	bbouvier : feedback+	Details \| Diff \| Splinter Review
perf numbers (160 total) 5 years ago Julian Seward [:jseward] 2.58 KB, text/plain		Details
Bug 1586791 - [Cranelift] BatchCompiler::new called too often? r=bbouvier. 5 years ago Julian Seward [:jseward] 47 bytes, text/x-phabricator-request		Details \| Review

Julian Seward [:jseward]

Assignee

Description

•

5 years ago

Whilst trying to track down why Cranelift has high heap turnover (not
residency) compared to Ion, I have come to wonder if we are creating and
destroying the basic compiler structure, BatchCompiler, too often.

For a run compiling wasm_lua_binarytrees.js, CL processes around 680 wasm
functions. However, even when running SpiderMonkey with --no-threads,
BatchCompiler::new() is called 177 times. This strikes me as .. somehow
suboptimal. I would have expected that one BatchCompiler is allocated per
compilation thread -- that is, in this case, just once.

Julian Seward [:jseward]

Assignee

Comment 1

•

5 years ago

Following hints from Benjamin and Lars, I played around with
JitOptions.wasmBatchIonThreshold in ModuleGenerator::compileFuncDef for
the Tier::Optimized case. Results are:

     Thresh  Bytes allocd    Blocks     Insn Count
     Value   (MBytes)        Allocd    (Millions)

      1100       88.8       153,523      1543.4  (DEFAULT SETTING)
      2200       78.9       130,553      1531.3
      3300       72.6       120,320      1526.2
      4400       69.1       115,070      1523.4
      5500       66.3       111,427      1520.9
     11000       57.1       102,777      1514.7
  infinity       33.5        93,438      1501.1

This is a striking result and suggests there's a lot of room for improvement.
An immediate change is to set a higher threshold for the CL case, up to the
point where available parallelism and/or latency is compromised.

This is with glibc malloc. I would expect the insn count win to be
significantly bigger with mozjemalloc, since that poisons freed memory.

The numbers imply that the compiler's cached state is associated with the
batch unit, not with the compilation thread. Is that true? If so, fixing
that would seem valuable.

Might increasing the threshold for Ion also reduce Ion's allocation rate?

Lars T Hansen [:lth]

Comment 2

•

5 years ago

(In reply to Julian Seward [:jseward] from comment #1)

Might increasing the threshold for Ion also reduce Ion's allocation rate?

Nice work. I believe Benjamin initially set the thresholds based on some benchmarking. In the same way as for the values we use to pre-size the assembler output buffers, I would sure love for us to gather the data that might be used to tune these parameters on users' computer and ship the data home by telemetry. For the assembler output sizes, it's simple static measurements, not hard. Here, it's a little trickier -- it might amount to running experiments locally by perturbing the thresholds, a kind of hill climbing maybe, and then shipping home some data about eg the CPU and how well it fares. Or maybe we don't need to ship the data home, we just need to adjust limits dynamically on the local computer. (And in some sense the same is true for the assembler output, although there local variation is much smaller and may depend mostly on CPU feature set.)

Julian Seward [:jseward]

Assignee

Comment 3

•

5 years ago

Related-to bug 1578352.

Benjamin Bouvier [:bbouvier] (inactive)

Comment 4

•

5 years ago

I would sure love for us to gather the data that might be used to tune these parameters on users' computer

I would like this very much too. Ideally, running local benchmarking to adjust misc parameters (think: GC params too) just after a major (as in semver's major) update would quite nice.

I believe Benjamin initially set the thresholds based on some benchmarking.

Indeed, I did this for Ion, but not for Cranelift.

Thanks for doing this investigation. What we want here is to raise the batch size until it degrades parallel compilation performance, probably for 4 threads since this is the most common case.

Another thing discussed on IRC would be to have a pool of BatchCompiler owned by the module compilation thread; then this coordinating thread would dispatch a BatchCompiler in the context of the task, and the worker could clear it and use it then. Once module compilation is done, the coordinating thread could deallocate all the BatchCompilers.

Julian Seward [:jseward]

Assignee

Comment 5

•

5 years ago

Attached patch in-principle patch for contemplation — Details — Splinter Review

Julian Seward [:jseward]

Assignee

Comment 6

•

5 years ago

What do people think of this patch, ignoring the actual numbers involved?
(To arrive at a reasonable threshold, I'll do more measurements, per bbouvier's
suggestion).

Julian Seward [:jseward]

Assignee

Updated

•

5 years ago

Attachment #9099636 - Flags: feedback?(bbouvier)

Benjamin Bouvier [:bbouvier] (inactive)

Comment 7

•

5 years ago

Comment on attachment 9099636 [details] [diff] [review]
in-principle patch for contemplation

Review of attachment 9099636 [details] [diff] [review]:
-----------------------------------------------------------------

Yes, that's the idea. Thanks for looking into this!

Attachment #9099636 - Flags: feedback?(bbouvier) → feedback+

Lars T Hansen [:lth]

Comment 8

•

5 years ago

Ditto.

Julian Seward [:jseward]

Assignee

Comment 9

•

5 years ago

Attached file perf numbers (160 total) — Details

This shows performance numbers, for different values of
wasmBatchCraneliftThreshold, for --thread-count=1 .. 8, and for glibc
malloc vs mozjemalloc.

Based on those I would suggest changing the value from 1100 as at present to
5000. When using mozjemalloc (I assume this will be how we're configured when
running "for real"), this has the following effects:

1-core (single thread) performance is improved by about 3%
4 core performance is improved by about 12%
parallel speedup (4 vs 1 core) improves from 2.99 x to 3.32 x

Total heap allocation falls from 88.8MB in 153.5k blocks to 68.2MB in 113.3k
blocks.

Minor further gains in run time are available by increasing the threshold to
10000 and above. However, that potentially worsens latency considerably.
Most end-user systems are slower than this machine, and, in a realistic
scenario (running the whole browser), fewer than 4 cores are likely to be
available for compiling wasm. So I am inclined towards the lower threshold
(5000) to minimise the excess-latency risk.

Does that sound reasonable? If so, the patch doesn't need to be changed \o/.

Lars T Hansen [:lth]

Comment 10

•

5 years ago

According to your table headings, mozjemalloc is always faster than glibc malloc. Are the headings reversed?

Julian Seward [:jseward]

Assignee

Comment 11

•

5 years ago

•

Edited

(In reply to Lars T Hansen [:lth] from comment #10)

Are the headings reversed?

Er, yes, duh. Mozjemalloc is always slower than glibc malloc.

Julian Seward [:jseward]

Assignee

Updated

•

5 years ago

Priority: -- → P3

Julian Seward [:jseward]

Assignee

Updated

•

5 years ago

Assignee: nobody → jseward

Julian Seward [:jseward]

Assignee

Comment 12

•

5 years ago

Attached file Bug 1586791 - [Cranelift] BatchCompiler::new called too often? r=bbouvier. — Details

This patch reduces the overall Cranelift heap overhead by increasing the
batching threshold from 1100 wasm stream bytes to 5000. This is a compromise
between reducing overhead and potential loss of parallelism. The effect is,
for wasm_lua_binarytrees: a 1.3% reduction in instruction count, 26.5% fewer
blocks allocated and 22.1% fewer bytes allocated.

Pulsebot

Comment 13

•

5 years ago

Pushed by jseward@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/374f3ab2e461
[Cranelift] BatchCompiler::new called too often?  r=bbouvier.

Stefan Hindli [:stefan_hindli]

Comment 14

•

5 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/374f3ab2e461

Status: NEW → RESOLVED

Closed: 5 years ago

status-firefox71: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla71

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Quick Search

[Cranelift] BatchCompiler::new called too often?

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

Tracking

()

People

(Reporter: jseward, Assigned: jseward)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Updated

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Updated

Updated

Comment 12

Comment 13

Comment 14

Attachment

General

Description

File Name

Content Type