Open Bug 1671873 Opened 5 years ago Updated 4 years ago

SIMD optimization x64/x86: Improved constant generation

Tracking

()

Status:

NEW

People

(Reporter: lth, Unassigned)

References

(Blocks 1 open bug)

Details

Lars T Hansen [:lth]

Reporter

Description

•

5 years ago

•

Edited

At the moment, the code that loads SIMD constants into a register is pretty limited in its strategies: if the constant is zero or ~0, a single instruction is emitted to synthesize that constant (pxorw r, r or pcmpeqw r, r), otherwise it is loaded RIP-relative (x64) / from a patchable address (x86).

Some previous (and quick-and-dirty) measurements I did showed that the break-even point for loading vs synthesizing on x64 is at roughly two instructions. https://github.com/WebAssembly/simd/issues/369#issuecomment-710242680 has a mechanism for doing a more sophisticated analysis using a uop simulator.

It looks like that comment also points to some LLVM code generation strategies that need to be considered by us? LLVM pre-v128.const and post-v128.const might in both cases generate code that is not simply "v128.const" and we want to do well on whatever LLVM emits.

Anyhow, many more constants could probably be synthesized using short instruction sequences and might be more competitive with a more sophisticated analysis.

For example, splat4(0x7fffffff) would be pcmpeqw r, r; psrld r, 1; this is somewhat likely to beat the RIP-relative load. Obviously there's a simple pattern analysis underlying this, any ^0+1+$ value would be derived from the same sequence just choosing different shifts.

Also, constants synthesized for specific ops should be aware of whether the op wants to have integer or fp values, this is esp true for zero (XORPS vs PXOR to clear the register).

Lars T Hansen [:lth]

Reporter

Updated

•

5 years ago

Depends on: 1671907

Lars T Hansen [:lth]

Reporter

Updated

•

5 years ago

Severity: -- → N/A

Priority: -- → P3

Ryan Hunt [:rhunt]

Updated

•

4 years ago

Summary: Improved x86 constant generation → Improved x86 SIMD constant generation

Lars T Hansen [:lth]

Reporter

Updated

•

4 years ago

Blocks: 1690460
No longer blocks: wasm-simd

Lars T Hansen [:lth]

Reporter

Updated

•

4 years ago

Comment 1

•

4 years ago

Some older notes read:

On the Xeon, PCMPEQW + PSLLW to load 0xFF00^8 is a probably a little bit faster on x64 than a constant load that hits the cache, but it's a very slight edge, maybe 0.3% on this benchmark, definitely could be noise, definitely might be mistaken for noise - probably not worthwhile.

This suggests that on x64, we should use single-instruction generation if we can but not otherwise.

The numbers might be different on different hardware or on x86.

Lars T Hansen [:lth]

Reporter

Comment 2

•

4 years ago

Some further notes on a Google bug (https://crbug.com/v8/11033) point to using splats in some cases, there are further pointers in that bug to broader discussion:

4 32-bit lanes are identical
first pair and second pair of lanes are identical

I'm inclined to think that these optimizations pertain mostly to platforms where v128.const is fairly expensive and we may or may not find them useful.

Lars T Hansen [:lth]

Reporter

Updated

•

4 years ago

Summary: Improved x86 SIMD constant generation → SIMD optimization x64/x86: Improved constant generation

You need to log in before you can comment on or make changes to this bug.

Bugzilla

SIMD optimization x64/x86: Improved constant generation

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

Tracking

()

People

(Reporter: lth, Unassigned)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Description

Updated

Updated

Updated

Updated

Updated

Comment 1

Comment 2

Updated