Optimization: Use SIMD for inlined bulk-memory operations
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox93 | --- | fixed |
People
(Reporter: rhunt, Assigned: yury)
References
Details
Attachments
(2 files)
This shouldn't block shipping bulk-memory.
Bug 1594204 add a basic OOL path for memory.copy/memory.fill that uses up to 4 byte loads/stores on 32-bit systems and up to 8 byte loads/stores on 64-bit systems. We can do better if we utilize the SIMD register bank.
I originally avoided this for expediency. By only using GPRs we could re-use our existing wasmLoad/wasmStore infrastructure for bounds checking and trap handling. SIMD will require more work.
I had a version of a patch to do this by adding a Scalar::Int128, and extending wasmLoad/Store to utilize this. It was a bit hacky, and didn't fit at all with the baseline compiler (which uses ValueType for the stack). It's not clear that we should care for this optimization in the baseline compiler, so maybe that's fine.
Updated•5 years ago
|
Reporter | ||
Updated•3 years ago
|
Assignee | ||
Comment 1•3 years ago
|
||
Assignee | ||
Comment 2•3 years ago
•
|
||
I had a version of a patch to do this by adding a Scalar::Int128, and extending wasmLoad/Store to utilize this. It was a bit hacky, and didn't fit at all with the baseline compiler (which uses ValueType for the stack).
Since we have Scalar::Simd128, we could just partially enable ENABLE_WASM_SIMD fragments. Though it becomes too complicated and adds overhead just for copy/fill when ENABLE_WASM_SIMD is off. The patch enables inlining with SIMD ops only when ENABLE_WASM_SIMD is on.
I did some microbenchmarking on x64 (attaching test):
before | after | ||
---|---|---|---|
ion aligned | 1921 | 1315 | 46.06% |
ion unaligned | 2086 | 1475 | 41.42% |
baseline aligned | 2078 | 1504 | 38.16% |
baseline unaligned | 2158 | 1634 | 32.06% |
Assignee | ||
Comment 3•3 years ago
|
||
Assignee | ||
Updated•3 years ago
|
Updated•3 years ago
|
Reporter | ||
Comment 5•3 years ago
|
||
Yury, IIUC your test is checking size=128/129 byte copies and fills. Would it be possible to test on smaller sizes? Like 16,17,32,33 bytes? Just curious if that changes the picture. Would also be curious if we could test this on ARM64 to ensure we get reasonable codegen there too.
Assignee | ||
Comment 6•3 years ago
|
||
(In reply to Ryan Hunt [:rhunt] from comment #5)
Yury, IIUC your test is checking size=128/129 byte copies and fills. Would it be possible to test on smaller sizes? Like 16,17,32,33 bytes? Just curious if that changes the picture. Would also be curious if we could test this on ARM64 to ensure we get reasonable codegen there too.
The benchmark checked length from 1 to 65, I think. The 128/129 is destination offset. ARM64 is pending.
Pushed by ydelendik@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/e8b9b45734fa Use SIMD in inlined memory.{fill,copy} ops. r=rhunt
Comment 8•3 years ago
|
||
bugherder |
Description
•