Bug 1570112 Comment 29 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Original comment by

Julian Seward [:jseward]

on 2019-09-04 00:53:05 PDT

It is clear that `mem.copy` is a poor match for compiler generated `memcpy`.
For this bug I agree we should go with Ryan's byte-copying scheme.

For the future, I'd like to propose `mem.copy_hinted`:

* `mem.copy_hinted` has the same operands as `mem.copy`, but also carries some hints:
  - alignment of src index (small int, 1/2/4/8 .. 256, perhaps or log2 thereof)
  - alignment of dst index (same)
  - a boolean indicating that the areas definitely do not overlap

* An implementation can choose to ignore the hints and handle it the same as
  `mem.copy`; this will be safe but slow.

The behaviour of `mem.copy_hinted` is defined thusly:

* If `mem.copy` with the same arguments would trap at runtime, then
  `mem.copy_hinted` will too.  However, the state of both src and dst areas is
  undefined after this.  Areas outside the src and dst areas will be
  unchanged.

* If `mem.copy` with the same arguments would not trap at runtime, but not all
  the hints are true, then `mem.copy_hinted` will also not trap, but the
  resulting state of the src and dst areas is undefined afterwards.  Areas
  outside the src and dst areas will be unchanged.

* If `mem.copy` with the same arguments would not trap at runtime, and all the
  hints are true, then `mem.copy_hinted` will also not trap, and will produce
  the same results as `mem.copy`.  (Implied thereby is): Areas outside the src
  and dst areas will be unchanged.

* In all cases, `mem.copy_hinted` gives no guarantees about what happens if
  multiple threads access the src or dst areas concurrently, beyond any
  guarantees we might have if the copy had been generated by the front end as
  a sequence of vanilla wasm loads/stores.

This allows the front end compiler to hand useful information to the back end,
without sacrificing safety at the wasm level, and it gives the back end wide
latitude in choosing an implementation.

The use of alignment hints allows the implementation to use aligned multiword
loads/stores as it sees fit.  I believe natural alignment of them is important
to get performance close to native memcpy.  In particular, misaligned accesses
interact poorly with store forwarding (Intel Optimization Manual, Sec 3.6.5,
Store Forwarding).

The use of a (non-)overlapping hint facilitates removal of run time direction
checks.  It also makes it feasible to use cache-line-preload-and-zero
instructions on targets that have it, eg POWER "dcbz".

One might ask, why bother at all?  Why not just tell front ends to emit memcpy
inline as wasm?

* Wasm implementations may decide to not do the copy in-line, if code size is
  an issue.

* Wasm on 32 bit targets typically requires an explicit range check per
  access.  Getting really good code if the front end compiler emitted the copy
  in-line would require the wasm backend to merge the range checks together,
  which sounds complex and fragile.  Using `memory.copy_hinted` avoids that
  problem.

* `memory.copy_hinted` could be implemented using vector loads/stores (eg, 128
  or 256 bit) that don't exist in wasm itself.

Revision 1 by

Julian Seward [:jseward]

on 2019-09-04 01:02:47 PDT

It is clear that `mem.copy` is a poor match for compiler generated `memcpy`.
For this bug I agree we should go with Ryan's byte-copying scheme.

For the future, I'd like to propose `mem.copy_hinted`:

* `mem.copy_hinted` has the same operands as `mem.copy`, but also carries some hints:
  - alignment of src index (small int, 1/2/4/8 .. 256, perhaps or log2 thereof)
  - alignment of dst index (same)
  - a boolean indicating that the areas definitely do not overlap

* An implementation can choose to ignore the hints and handle it the same as
  `mem.copy`; this will be safe but slow.

The behaviour of `mem.copy_hinted` is defined thusly:

* If `mem.copy` with the same arguments would trap at runtime, then
  `mem.copy_hinted` will too.  However, the state of both src and dst areas is
  undefined after this.  Areas outside the src and dst areas will be
  unchanged.

* If `mem.copy` with the same arguments would not trap at runtime, but not all
  the hints are true, then `mem.copy_hinted` will also not trap, but the
  resulting state of the src and dst areas is undefined afterwards.  Areas
  outside the src and dst areas will be unchanged.

* If `mem.copy` with the same arguments would not trap at runtime, and all the
  hints are true, then `mem.copy_hinted` will also not trap, and will produce
  the same results as `mem.copy`.  (Implied thereby is): Areas outside the src
  and dst areas will be unchanged.

* In all cases, `mem.copy_hinted` gives no guarantees about what happens if
  multiple threads access the src or dst areas concurrently, beyond any
  guarantees we might have if the copy had been generated by the front end as
  a sequence of vanilla wasm loads/stores.

This allows the front end compiler to hand useful information to the back end,
without sacrificing safety at the wasm level, and it gives the back end wide
latitude in choosing an implementation.

The use of alignment hints allows the implementation to use aligned multibyte
loads/stores as it sees fit.  I believe natural alignment of them is important
to get performance close to native memcpy.  In particular, misaligned accesses
interact poorly with store forwarding (Intel Optimization Manual, Sec 3.6.5,
Store Forwarding).

The use of a (non-)overlapping hint facilitates removal of run time direction
checks.  It also makes it feasible to use cache-line-preload-and-zero
instructions on targets that have it, eg POWER "dcbz".

One might ask, why bother at all?  Why not just tell front ends to emit memcpy
inline as wasm?

* Wasm implementations may decide to not do the copy in-line, if code size is
  an issue.

* Wasm on 32 bit targets typically requires an explicit range check per
  access.  Getting really good code if the front end compiler emitted the copy
  in-line would require the wasm backend to merge the range checks together,
  which sounds complex and fragile.  Using `memory.copy_hinted` avoids that
  problem.

* `memory.copy_hinted` could be implemented using vector loads/stores (eg, 128
  or 256 bit) that don't exist in wasm itself.

Back to Bug 1570112 Comment 29

Bugzilla

Quick Search

Bug 1570112 Comment 29 Edit History