It is clear that `mem.copy` is a poor match for compiler generated `memcpy`. For this bug I agree we should go with Ryan's byte-copying scheme. For the future, I'd like to propose `mem.copy_hinted`: * `mem.copy_hinted` has the same operands as `mem.copy`, but also carries some hints: - alignment of src index (small int, 1/2/4/8 .. 256, perhaps or log2 thereof) - alignment of dst index (same) - a boolean indicating that the areas definitely do not overlap * An implementation can choose to ignore the hints and handle it the same as `mem.copy`; this will be safe but slow. The behaviour of `mem.copy_hinted` is defined thusly: * If `mem.copy` with the same arguments would trap at runtime, then `mem.copy_hinted` will too. However, the state of both src and dst areas is undefined after this. Areas outside the src and dst areas will be unchanged. * If `mem.copy` with the same arguments would not trap at runtime, but not all the hints are true, then `mem.copy_hinted` will also not trap, but the resulting state of the src and dst areas is undefined afterwards. Areas outside the src and dst areas will be unchanged. * If `mem.copy` with the same arguments would not trap at runtime, and all the hints are true, then `mem.copy_hinted` will also not trap, and will produce the same results as `mem.copy`. (Implied thereby is): Areas outside the src and dst areas will be unchanged. * In all cases, `mem.copy_hinted` gives no guarantees about what happens if multiple threads access the src or dst areas concurrently, beyond any guarantees we might have if the copy had been generated by the front end as a sequence of vanilla wasm loads/stores. This allows the front end compiler to hand useful information to the back end, without sacrificing safety at the wasm level, and it gives the back end wide latitude in choosing an implementation. The use of alignment hints allows the implementation to use aligned multiword loads/stores as it sees fit. I believe natural alignment of them is important to get performance close to native memcpy. In particular, misaligned accesses interact poorly with store forwarding (Intel Optimization Manual, Sec 3.6.5, Store Forwarding). The use of a (non-)overlapping hint facilitates removal of run time direction checks. It also makes it feasible to use cache-line-preload-and-zero instructions on targets that have it, eg POWER "dcbz". One might ask, why bother at all? Why not just tell front ends to emit memcpy inline as wasm? * Wasm implementations may decide to not do the copy in-line, if code size is an issue. * Wasm on 32 bit targets typically requires an explicit range check per access. Getting really good code if the front end compiler emitted the copy in-line would require the wasm backend to merge the range checks together, which sounds complex and fragile. Using `memory.copy_hinted` avoids that problem. * `memory.copy_hinted` could be implemented using vector loads/stores (eg, 128 or 256 bit) that don't exist in wasm itself.
Bug 1570112 Comment 29 Edit History
Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.
It is clear that `mem.copy` is a poor match for compiler generated `memcpy`. For this bug I agree we should go with Ryan's byte-copying scheme. For the future, I'd like to propose `mem.copy_hinted`: * `mem.copy_hinted` has the same operands as `mem.copy`, but also carries some hints: - alignment of src index (small int, 1/2/4/8 .. 256, perhaps or log2 thereof) - alignment of dst index (same) - a boolean indicating that the areas definitely do not overlap * An implementation can choose to ignore the hints and handle it the same as `mem.copy`; this will be safe but slow. The behaviour of `mem.copy_hinted` is defined thusly: * If `mem.copy` with the same arguments would trap at runtime, then `mem.copy_hinted` will too. However, the state of both src and dst areas is undefined after this. Areas outside the src and dst areas will be unchanged. * If `mem.copy` with the same arguments would not trap at runtime, but not all the hints are true, then `mem.copy_hinted` will also not trap, but the resulting state of the src and dst areas is undefined afterwards. Areas outside the src and dst areas will be unchanged. * If `mem.copy` with the same arguments would not trap at runtime, and all the hints are true, then `mem.copy_hinted` will also not trap, and will produce the same results as `mem.copy`. (Implied thereby is): Areas outside the src and dst areas will be unchanged. * In all cases, `mem.copy_hinted` gives no guarantees about what happens if multiple threads access the src or dst areas concurrently, beyond any guarantees we might have if the copy had been generated by the front end as a sequence of vanilla wasm loads/stores. This allows the front end compiler to hand useful information to the back end, without sacrificing safety at the wasm level, and it gives the back end wide latitude in choosing an implementation. The use of alignment hints allows the implementation to use aligned multibyte loads/stores as it sees fit. I believe natural alignment of them is important to get performance close to native memcpy. In particular, misaligned accesses interact poorly with store forwarding (Intel Optimization Manual, Sec 3.6.5, Store Forwarding). The use of a (non-)overlapping hint facilitates removal of run time direction checks. It also makes it feasible to use cache-line-preload-and-zero instructions on targets that have it, eg POWER "dcbz". One might ask, why bother at all? Why not just tell front ends to emit memcpy inline as wasm? * Wasm implementations may decide to not do the copy in-line, if code size is an issue. * Wasm on 32 bit targets typically requires an explicit range check per access. Getting really good code if the front end compiler emitted the copy in-line would require the wasm backend to merge the range checks together, which sounds complex and fragile. Using `memory.copy_hinted` avoids that problem. * `memory.copy_hinted` could be implemented using vector loads/stores (eg, 128 or 256 bit) that don't exist in wasm itself.