SIMD optimization x64/x86: Improved constant generation
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
People
(Reporter: lth, Unassigned)
References
(Blocks 1 open bug)
Details
At the moment, the code that loads SIMD constants into a register is pretty limited in its strategies: if the constant is zero or ~0, a single instruction is emitted to synthesize that constant (pxorw r, r or pcmpeqw r, r), otherwise it is loaded RIP-relative (x64) / from a patchable address (x86).
Some previous (and quick-and-dirty) measurements I did showed that the break-even point for loading vs synthesizing on x64 is at roughly two instructions. https://github.com/WebAssembly/simd/issues/369#issuecomment-710242680 has a mechanism for doing a more sophisticated analysis using a uop simulator.
It looks like that comment also points to some LLVM code generation strategies that need to be considered by us? LLVM pre-v128.const and post-v128.const might in both cases generate code that is not simply "v128.const" and we want to do well on whatever LLVM emits.
Anyhow, many more constants could probably be synthesized using short instruction sequences and might be more competitive with a more sophisticated analysis.
For example, splat4(0x7fffffff) would be pcmpeqw r, r; psrld r, 1; this is somewhat likely to beat the RIP-relative load. Obviously there's a simple pattern analysis underlying this, any ^0+1+$ value would be derived from the same sequence just choosing different shifts.
Also, constants synthesized for specific ops should be aware of whether the op wants to have integer or fp values, this is esp true for zero (XORPS vs PXOR to clear the register).
Reporter | ||
Updated•4 years ago
|
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Comment 1•4 years ago
|
||
Some older notes read:
On the Xeon, PCMPEQW + PSLLW to load 0xFF00^8 is a probably a little bit faster on x64 than a constant load that hits the cache, but it's a very slight edge, maybe 0.3% on this benchmark, definitely could be noise, definitely might be mistaken for noise - probably not worthwhile.
This suggests that on x64, we should use single-instruction generation if we can but not otherwise.
The numbers might be different on different hardware or on x86.
Reporter | ||
Comment 2•4 years ago
|
||
Some further notes on a Google bug (https://crbug.com/v8/11033) point to using splats in some cases, there are further pointers in that bug to broader discussion:
- 4 32-bit lanes are identical
- first pair and second pair of lanes are identical
I'm inclined to think that these optimizations pertain mostly to platforms where v128.const is fairly expensive and we may or may not find them useful.
Reporter | ||
Updated•4 years ago
|
Description
•