Closed Bug 1625891 Opened 5 years ago Closed 4 years ago

Baldr: Bounds check elimination for indirect calls

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: lth, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(1 file)

WIP: Bug 1625891 - Profiling code. 4 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review

Lars T Hansen [:lth]

Reporter

Description

•

5 years ago

Consider:

  (module
    (type $t0 (func (result i32)))
    (func $f1 (result i32) (i32.const 0))
    (func $f2 (result i32) (i32.const 1))
    (table $tbl funcref (elem $f1 $f2))
    (func $f3 (param i32) (result i32)
      (call_indirect (type $t0) $tbl (local.get 0))
      drop
      (call_indirect (type $t0) $tbl (local.get 0))))`));

The ion code generated for the two indirect calls shows that we bounds check the index against the table length in both calls, yet the index has not changed and tables cannot shrink, so the second test must pass if the first one did. To do better, the bounds check may have to be open-coded so that Ion can see it. At the moment, there is ad-hoc BCE in LIRGenerator::visitWasmCall where we remove the bounds check if the index is constant and below the table minimum, but we want something better.

Lars T Hansen [:lth]

Reporter

Updated

•

4 years ago

Blocks: 1340235

Lars T Hansen [:lth]

Reporter

Comment 1

•

4 years ago

•

Edited

It would be useful to have an idea about the impact of this optimization before trying to do the work. One thing that's easy to measure is the static ratio of variable-index accesses to constant-index accesses; slightly harder but much more interesting would be the dynamic ratio. It will be hard to generalize this: source language, compilation strategy, optimization level, and the program itself all play a part. But it doesn't hurt to look at code that people care about, such as autocad, zoom, pspdfkit, and probably some rust codebase, tbd. It doesn't have to be a web-only codebase, either.

Lars T Hansen [:lth]

Reporter

Comment 2

•

4 years ago

The evidence is fairly strong that the non-constant-index indirect calls are the rule (profiling code to appear):

web.autocad.com, the roller assembly sample + a single rectangular selection before quitting:

Bounds check eliminated: 0
Bounds checked, constant index: 0
Bounds checked, variable index: 5473719

pspdfkit.com, just click around in the standalone demo for a little while:

Bounds check eliminated: 0
Bounds checked, constant index: 0
Bounds checked, variable index: 7346718

ruffle.rs (flash emulator), running a demo:

Bounds check eliminated: 0
Bounds checked, constant index: 0
Bounds checked, variable index: 6816046

At the same time, 7M calls over, say, 10s is nothing, so these applications will not benefit much from an optimization unless those calls are very heavily clustered.

Raybench, from bug 1340106, might benefit more, though rendering time here was 8.3s so this is just 45M indirect calls per second:

Bounds check eliminated: 0
Bounds checked, constant index: 0
Bounds checked, variable index: 369114997

The benchmark on bug 1340235 is now a 404, unfortunately.

Lars T Hansen [:lth]

Reporter

Comment 3

•

4 years ago

Attached file WIP: Bug 1625891 - Profiling code. — Details

Run this with --disable-e10s in the browser, since the profiling data
are captured only in the main process.

Lars T Hansen [:lth]

Reporter

Updated

•

4 years ago

Depends on: 1742930

Lars T Hansen [:lth]

Reporter

Updated

•

4 years ago

No longer blocks: 1340235

Lars T Hansen [:lth]

Reporter

Updated

•

4 years ago

Blocks: 1742930

No longer depends on: 1742930

Lars T Hansen [:lth]

Reporter

Comment 4

•

4 years ago

Another run on oskarstalberg.com/Townscaper gives broadly consistent results: while during warmup many bounds checks can be eliminated, as the game progresses all calls are to variable indices.

Lars T Hansen [:lth]

Reporter

Comment 5

•

4 years ago

•

Edited

Over in bug 1340235, I noticed that the table has known length - initial and max are the same. We should find out if this is common, because if it is then the table bound can be baked into the table bounds check, it need not be loaded from tls. (Edit: I'm doing this particular change as part of bug 1743586.)

Edit: According to bug 1743586 comment 5, known table length (min==max) is the common case, and code has landed to bake a known limit into the bounds check.

Lars T Hansen [:lth]

Reporter

Updated

•

4 years ago

Depends on: 1745939

Lars T Hansen [:lth]

Reporter

Comment 6

•

4 years ago

I think the easiest test on whether it pays off to optimize table bounds checks more is to turn them off completely and run some benchmarks and some content, if we can find content where we will be able to observe the difference.

Obvious cases:

the fib benchmark from bug 1340235
the raybench benchmark from bug 1340106

If they don't show improvements with bounds checks eliminated then nothing will and we should be done. If they show huge improvements then we should consider further work. We should check this on multiple microarchitectures.

Lars T Hansen [:lth]

Reporter

Comment 7

•

4 years ago

•

Edited

The fib benchmark performs 10 indirect calls in a row in the base case of the recursion, so disabling bounds checks should help quite a bit. It also has the most to gain from BCE, since all calls use the same known table index. BCE would eliminate the last nine checks in this innermost block.

The raybench benchmarks traverses a scene graph and calls a virtual intersect method on each object. Most of these do a fair amount of computation. The index is usually a constant vtable index off a variable index representing the start of the object's vtable, the latter index is loaded from the object (and emscripten then masks it, at least in the code for this benchmark). There should be some benefit from disabling bounds checking, but BCE would probably have a hard time removing the checks in practice.

Results, showing running time reduction for removing the bounds check altoghether:

fib arm64/Apple-M1: 6%
fib x64/Intel-i7: 20% (yeah, actually)
raybench arm64/Apple-M1: 5%
raybench x64/intel-i7: 2% (if we're charitable)

The advantage of removing bounds checks might be reduced if we rewrote our code generator so that we would jump to an out-of-line trap and fall through to success code, as conditional forward branches are statically predicted as not taken (for sure on Intel but I believe even on the M1).

Lars T Hansen [:lth]

Reporter

Comment 8

•

4 years ago

I don't think there's much to be gained from fixing this in practice.

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → WONTFIX

Ben Visness [:bvisness]

Updated

•

1 month ago

Bugzilla

Baldr: Bounds check elimination for indirect calls

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

Tracking

()

People

(Reporter: lth, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(1 file)

Description

Updated

Comment 1

Comment 2

Comment 3

Updated

Updated

Updated

Comment 4

Comment 5

Updated

Comment 6

Comment 7

Comment 8

Updated

Attachment

General

Description

File Name

Content Type