Closed Bug 1625891 Opened 4 years ago Closed 2 years ago

Baldr: Bounds check elimination for indirect calls

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

enhancement

Tracking

()

RESOLVED WONTFIX

People

(Reporter: lth, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

Consider:

  (module
    (type $t0 (func (result i32)))
    (func $f1 (result i32) (i32.const 0))
    (func $f2 (result i32) (i32.const 1))
    (table $tbl funcref (elem $f1 $f2))
    (func $f3 (param i32) (result i32)
      (call_indirect (type $t0) $tbl (local.get 0))
      drop
      (call_indirect (type $t0) $tbl (local.get 0))))`));

The ion code generated for the two indirect calls shows that we bounds check the index against the table length in both calls, yet the index has not changed and tables cannot shrink, so the second test must pass if the first one did. To do better, the bounds check may have to be open-coded so that Ion can see it. At the moment, there is ad-hoc BCE in LIRGenerator::visitWasmCall where we remove the bounds check if the index is constant and below the table minimum, but we want something better.

Blocks: 1340235

It would be useful to have an idea about the impact of this optimization before trying to do the work. One thing that's easy to measure is the static ratio of variable-index accesses to constant-index accesses; slightly harder but much more interesting would be the dynamic ratio. It will be hard to generalize this: source language, compilation strategy, optimization level, and the program itself all play a part. But it doesn't hurt to look at code that people care about, such as autocad, zoom, pspdfkit, and probably some rust codebase, tbd. It doesn't have to be a web-only codebase, either.

The evidence is fairly strong that the non-constant-index indirect calls are the rule (profiling code to appear):

web.autocad.com, the roller assembly sample + a single rectangular selection before quitting:

Bounds check eliminated: 0
Bounds checked, constant index: 0
Bounds checked, variable index: 5473719

pspdfkit.com, just click around in the standalone demo for a little while:

Bounds check eliminated: 0
Bounds checked, constant index: 0
Bounds checked, variable index: 7346718

ruffle.rs (flash emulator), running a demo:

Bounds check eliminated: 0
Bounds checked, constant index: 0
Bounds checked, variable index: 6816046

At the same time, 7M calls over, say, 10s is nothing, so these applications will not benefit much from an optimization unless those calls are very heavily clustered.

Raybench, from bug 1340106, might benefit more, though rendering time here was 8.3s so this is just 45M indirect calls per second:

Bounds check eliminated: 0
Bounds checked, constant index: 0
Bounds checked, variable index: 369114997

The benchmark on bug 1340235 is now a 404, unfortunately.

Run this with --disable-e10s in the browser, since the profiling data
are captured only in the main process.

Depends on: 1742930
No longer blocks: 1340235
Blocks: 1742930
No longer depends on: 1742930

Another run on oskarstalberg.com/Townscaper gives broadly consistent results: while during warmup many bounds checks can be eliminated, as the game progresses all calls are to variable indices.

Over in bug 1340235, I noticed that the table has known length - initial and max are the same. We should find out if this is common, because if it is then the table bound can be baked into the table bounds check, it need not be loaded from tls. (Edit: I'm doing this particular change as part of bug 1743586.)

Edit: According to bug 1743586 comment 5, known table length (min==max) is the common case, and code has landed to bake a known limit into the bounds check.

Depends on: 1745939

I think the easiest test on whether it pays off to optimize table bounds checks more is to turn them off completely and run some benchmarks and some content, if we can find content where we will be able to observe the difference.

Obvious cases:

If they don't show improvements with bounds checks eliminated then nothing will and we should be done. If they show huge improvements then we should consider further work. We should check this on multiple microarchitectures.

The fib benchmark performs 10 indirect calls in a row in the base case of the recursion, so disabling bounds checks should help quite a bit. It also has the most to gain from BCE, since all calls use the same known table index. BCE would eliminate the last nine checks in this innermost block.

The raybench benchmarks traverses a scene graph and calls a virtual intersect method on each object. Most of these do a fair amount of computation. The index is usually a constant vtable index off a variable index representing the start of the object's vtable, the latter index is loaded from the object (and emscripten then masks it, at least in the code for this benchmark). There should be some benefit from disabling bounds checking, but BCE would probably have a hard time removing the checks in practice.

Results, showing running time reduction for removing the bounds check altoghether:

fib arm64/Apple-M1: 6%
fib x64/Intel-i7: 20% (yeah, actually)
raybench arm64/Apple-M1: 5%
raybench x64/intel-i7: 2% (if we're charitable)

The advantage of removing bounds checks might be reduced if we rewrote our code generator so that we would jump to an out-of-line trap and fall through to success code, as conditional forward branches are statically predicted as not taken (for sure on Intel but I believe even on the M1).

I don't think there's much to be gained from fixing this in practice.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: