Closed Bug 1761439 Opened 2 years ago Closed 2 years ago

[exploration] Faster mem64 bounds checking

Categories

(Core :: JavaScript: WebAssembly, task, P3)

task

Tracking

()

RESOLVED INACTIVE

People

(Reporter: lth, Assigned: lth)

References

(Depends on 1 open bug, Blocks 1 open bug)

Details

Attachments

(1 file)

1.70 KB, application/x-javascript
Details
Attached file strchr.js

Here's a simple microbenchmark (basically, a naive strchr that stays within a reasonable L2 cache). It runs itself with mem32 and mem64 and prints results.

Memory32 is about 17% faster than Memory64 on my Xeon system. Disabling spectre mitigations for the Memory64 run makes very little difference - about 1%.

The memory accesses are in a loop, but Ion correctly hoists the bounds check limit load out of the loop (verified in disassembly), so the cost here really seems to come from the compare and branch inside the loop. There could be some microarchitectural slowdown in this particular case from having two conditional branches on the same cache line where before we had one: the bounds check is close to the top-of-loop interrupt check. But it's hard to say for sure.

The benchmark and hardware are fine for initial exploration but for serious analysis we need more code and more consumer-oriented hardware: non-Xeon Intel, and some ARM64 thing.

Two solutions that have been proposed are moving traps out-of-line (for the interrupt and bounds check) so that static prediction works better, and making the branch across the trap a short branch (for better code density).

Other bounds checking strategies are also possible, but require more elaborate changes.

Priority: -- → P3
Severity: -- → N/A

I'll elaborate a table of results here. Times are ms. The Xeon is an E5-2637 running Fedora. The i7 is a 2018-era MacBook Pro, the M1 a first-model M1 MacBook Pro (2021), both running MacOS 12.3.

Program      strchr

Xeon i32      924
Xeon i64     1100
Xeon i32/i64 0.84

i7 i32       1274
i7 i64       1886
i7 i32/i64   0.68

M1 i32        822
M1 i64       1230
M1 i32/i64   0.67

For strchr on the Xeon, see comment 0.

For strchr on the i7 and the M1 the slowdown is greater still, suggesting that it's definitely worthwhile to do something about the bounds checking overhead. Again, disabling spectre mitigations don't really move the needle much at all (on the M1, literally not at all).

Here's another table, it's the same engine but generated code now branches out-of-line to the trap and falls through to the test-passed case. Here I've turned off spectre mitigations (because they don't fit this pattern):

Program      strchr

Xeon i32      863
Xeon i64      988
Xeon i32/i64 0.87
New/old i32  0.93
New/old i64  0.90

The ratio between i32 and i64 is the same (assuming no spectre mitigation) but we see a 7% speedup on the i32 case just for getting the interrupt trap out-of-line, and a 10% speedup for getting both the interrupt trap and the bounds check trap out of line.

Depends on: 1680243

Branching out-of-line to the trap is a big win if spectre mitigation is disabled, and a patch is currently pending on bug 1680243. Further work here should assume those fixes.

Depends on: 1707955

We need to be able to turn off spectre mitigations for bounds checking to be affordable.

I've looked into this for some time and I think it's going to be hard to beat the compare-and-branch in general. Most other schemes require a bunch of ALU operations and either have extremely large unmapped reservations or many loads or a branch. I will close this and transfer the blockers to the parent bug.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: