Closed Bug 1570112 Opened 5 months ago Closed 27 days ago

Performance of `memory.copy` is quite bad compared to wasm-written `memcpy` function

Categories

(Core :: Javascript: WebAssembly, enhancement, P2)

enhancement

Tracking

()

RESOLVED FIXED
Tracking Status
firefox70 --- affected

People

(Reporter: acrichto, Assigned: rhunt)

References

(Blocks 1 open bug)

Details

Attachments

(19 obsolete files)

Recently the Rust compiler updated to LLVM 9 from LLVM 8. There were some major changes in LLVM 9 around threads and WebAssembly, and one of the indirect pieces is that LLVM 9 now fully supports the threads and bulk memory proposals of WebAssembly. To use threads with LLVM 9 you are required to enable the bulk-memory feature in LLVM by default.

With the bulk-memory feature comes the memory.copy instruction. Rust code ends up generating quite a few calls to memcpy in general for all platforms. The performance of Rust relatively crucially relies on efficient implementations of memcpy as well. For most platforms LLVM will lower specific memcpy values (e.g. a constant move of 16 bytes) into inline optimized instructions, and it looks like for WebAssembly LLVM 9 is lowering almost all memcpy calls into memory.copy instructions. This means that Rust-compiled programs through LLVM 9 have a significant number of memory.copy instructions, many of which are performance critical.

The main "Rust and threads" example that we've historically used is a parallel raytracer, and so with the LLVM 9 update that was my main test case for ensuring that the Rust toolchain works with LLVM 9. Once all was said and done, however, I was pretty surprised at the performance! The previous LLVM 8-compiled code would render one frame in ~300ms on my machine (with the max number of threads), but the same code compiled with LLVM 9 would render a frame in ~2000ms.

After some profiling use perf.html I ended up realizing that 80+% of the time was spent in the native memcpy implementation, a seemingly large amount of that around some sort of synchronization as well. Overall, it definitely seemed like the memory.copy instructions emitted by LLVM with LLVM 9 were the culprit for most of the slowdown if not all of it.

I've prepared a github repository with precompiled versions of the raytracing example, and you can see the two rendered versions online:

Note that those require Nightly firefox with SharedArrayBuffer enabled to work, so they won't work in stable Firefox.

In talking with Luke it sounded like a bug was the best place to report this. If y'all need any more information from me though please just let me know! I suspect that the whole parallel raytracing idea isn't actually needed to showcase the performance slowdown here, it's likely only necessary to use memory.copy but I figured it'd be good to show the whole example.

Also FWIW my platform is Windows 10 where I originally ran these benchmarks.

Also in case it's useful, this is a perf.html capture of rendering two frames on my computer

It looks like DOMWorker#1 is spending 75% of its time in WasmMemoryObject::volatileMemoryLength (called by mem.copy) which is contending a lock with the other worker threads. I'm not sure what requirements there are for bounds checking with shared memory, but it seems like finding a more relaxed way to acquire the memory bounds would be good.

V8 implements this by performing the bounds check in JIT'ed code before the memcpy runtime function [1], and doesn't appear to use any locking [2]. I'm guessing the equivalent for us would be to bounds check on TlsData::boundsCheckLimit?

[1] https://cs.chromium.org/chromium/src/v8/src/compiler/wasm-compiler.cc?l=4849&rcl=781babd0c9ef5d62a23b0a54ed64586c3236db6c
[2] https://cs.chromium.org/chromium/src/v8/src/compiler/wasm-compiler.cc?l=3681&rcl=781babd0c9ef5d62a23b0a54ed64586c3236db6c

Bugbug thinks this bug is a enhancement, but please change it back in case of error.

Type: defect → enhancement

That could certainly help as both a short time perf boost and valuable in the long-term. What we really need to do (which was known and discussed in the wasm CG when this feature was standardized) is optimize the case where memory.copy has a small, constant byte size into an inline set of loads and stores. This is non-trivial so it's a question of whether we need to do this before or after CraneLift; I suppose it depends on how quickly LLVM 9-generated code will get into the wild.

On principle, we should not hold up performance work on the production engines because of Cranelift.

Let's see if we can prioritize this one and similar performance issues (memory.fill probably has the same issue).

Blocks: wasm-bulkmem
Priority: -- → P2
Assignee: nobody → rhunt

Bug 1457855 has a patch for inlined mem.fill on both Baseline and Ion. It never landed though. Might be worth a look.

Thanks for that! So far I focused on mem.copy as that seems to be the only thing used in the ray tracing example (unless I messed up my analysis).

Here are my current patches [1]. They inline mem.copy and mem.fill when length is constant and within a certain bound. (note: not constant value for mem.fill as in Julian's patches, I hadn't thought of that). The patches work with Baseline and Ion on all platforms.

I need to do some more tuning with the generated code to ensure it's acceptable on all platforms. Feedback is welcome.

[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=13a8070bf469a73e538c8fc0f937bafd0c048431

Forgot to add, there's also a patch that switches SAB to use a reader-writer lock so that multiple threads can perform the OOL-mem.copy simultaneously. This helped improve the raytracing benchmark, but didn't completely fix the problem. With the inlining patches though, we're back to parity.

(In reply to Ryan Hunt [:rhunt] from comment #7)

Here are my current patches [1].

I had a quick look at the mem.copy patch. What's the status w.r.t. alignment?
From reading the patch, it seems that it assumes the start point of both src
and dst is 4-aligned. Maybe I misunderstand? You could prefix the main
vector loop (the 32-bit loads/stores) with a 8+16 bit preamble, in the way that
it currently has a 16+8 bit postamble (for lack of a better word), but even then
you'd be assuming that src_start % 4 == dst_start % 4.

On further consideration .. the generated code would be OK if we allow misaligned
32- and 16-bit loads/stores on the target. But do we?

Drive-by nit: the while (remainder >= sizeof(uint16_t)) loop, and its
8-bit equivalent, only execute at most once. Maybe change them to ifs?

At the moment we assume that unaligned integer loads and stores are OK on all tier-1 platforms, and that hasn't bitten us yet. (FP accesses are different.)

Remember that there's no requirement for regular loads and stores to be aligned in wasm. An i32.load that says it is aligned may actually be unaligned. On a chip where unaligned accesses might be an issue we thus have to choose whether to generate a word access and hope for the best, or four byte accesses and assemble them into a word. We do the former unless the load is flagged as unaligned, but I suspect we should probably just generate the word access always unless we have concrete evidence that that's the wrong thing on the device in question.

The "wrong thing" is perf loss from an OS fixup, or (like for FP on ARM) simply a signal. In practice, consumer devices should have CPUs that handle the problem in hardware, but we've not actually added telemetry to substantiate this. We could do that.

(In reply to Julian Seward [:jseward] from comment #9)

(In reply to Ryan Hunt [:rhunt] from comment #7)

Here are my current patches [1].

I had a quick look at the mem.copy patch. What's the status w.r.t. alignment?
From reading the patch, it seems that it assumes the start point of both src
and dst is 4-aligned. Maybe I misunderstand? You could prefix the main
vector loop (the 32-bit loads/stores) with a 8+16 bit preamble, in the way that
it currently has a 16+8 bit postamble (for lack of a better word), but even then
you'd be assuming that src_start % 4 == dst_start % 4.

On further consideration .. the generated code would be OK if we allow misaligned
32- and 16-bit loads/stores on the target. But do we?

Don't have much to add to what Lars said, that was my understanding.

Drive-by nit: the while (remainder >= sizeof(uint16_t)) loop, and its
8-bit equivalent, only execute at most once. Maybe change them to ifs?

That's a good point, I wrote it as a loop to keep it similar to the main 32-bit load/store loop, but 'if's would be clearer here. I'll try to make that change.

I'd like to use RWLock in JS. It seems like the normal pattern is to have a
common class in mozglue/ that is shared between JS and XPCOM, so that's what
this commit does.

I've seen some contention between threads performing 'mem.copy' on the same
memory due to the lock in SAB. This lock is to perform atomic updates to the
length of the SAB while new memory is commited during growth.

By switching to a RWLock, multiple threads can be reading the length as long
as no thread is currently growing the memory, which should be rare.

Depends on D43889

This commit adds a path to Ion and Baseline to inline stores for 'mem.fill'
instructions when the length parameter is a constant within a hardcoded
limit.

The hardcoded limit is derived from light benchmarking of a synthetic test
case.

To ensure memory safety a bounds check is done on the start index of the
instruction. The limit on length is then picked to be less than the offset
guard limit so that we catch overflow past the start index.

The bulk memory operations spec specifies that partial OOB's 'mem.fill' should
result in a partial write up to the OOB's and then a trap. We rely on signal
handling to duplicate this behavior.

The copy is done in a bytewise manner. It's possible that we could perform more
efficient copies if we had a constant 'value', but that will need to be a
follow up.

Depends on D43890

This commit adds a path to Ion and Baseline to inline load and stores for the
'mem.copy' instruction when the length is a constant within a hardcoded limit.

Like the previous commit, the limit is chosen from light benchmarking, memory
safety is performed with bounds checking and the offset guard, and partial
writes need to be preserved.

Unlike the previous commit, the loads and stores are performed with 64, 32, 16,
and 8 bit transfers, depending on the length of the copy. We rely on alignment
issues being fixed up by processors or the OS, as we do with normal Wasm load
or store operations.

The direction of the copy is determined by the source and destination pointers
to properly handle overlapping regions. Unfortunately that means we have to
emit two sequences of loads and stores. I've tried to limit duplication in the
macro assembler and keep clarity, but there might be a better way.

Depends on D43891

Here's a snippet of the generated code for a copy and a fill on x64 and ARM64.

(module
	(memory 1 65536)
	(func $a (param i32) (param i32)
		local.get 0
		local.get 1
		i32.const 16
		memory.copy
	)
	(export "a" $a)
	(func $b (param i32) (param i32)
		local.get 0
		local.get 1
		i32.const 8
		memory.fill
	)
	(export "b" $b)
)
js> wasmDis(m.a)
<omitted>
00000016  3b f7                     cmp %edi, %esi
00000018  0f 8d 17 00 00 00         jnl 0x0000000000000035
0000001E  49 8b 44 37 08            movq 0x08(%r15,%rsi,1), %rax
00000023  49 89 44 3f 08            movq %rax, 0x08(%r15,%rdi,1)
00000028  49 8b 04 37               movq (%r15,%rsi,1), %rax
0000002C  49 89 04 3f               movq %rax, (%r15,%rdi,1)
00000030  e9 12 00 00 00            jmp 0x0000000000000047
00000035  49 8b 04 37               movq (%r15,%rsi,1), %rax
00000039  49 89 04 3f               movq %rax, (%r15,%rdi,1)
0000003D  49 8b 44 37 08            movq 0x08(%r15,%rsi,1), %rax
00000042  49 89 44 3f 08            movq %rax, 0x08(%r15,%rdi,1)
<omitted>
js> wasmDis(m.b)
<omitted>
00000016  41 88 34 3f               movb %sil, (%r15,%rdi,1)
0000001A  41 88 74 3f 01            movb %sil, 0x01(%r15,%rdi,1)
0000001F  41 88 74 3f 02            movb %sil, 0x02(%r15,%rdi,1)
00000024  41 88 74 3f 03            movb %sil, 0x03(%r15,%rdi,1)
00000029  41 88 74 3f 04            movb %sil, 0x04(%r15,%rdi,1)
0000002E  41 88 74 3f 05            movb %sil, 0x05(%r15,%rdi,1)
00000033  41 88 74 3f 06            movb %sil, 0x06(%r15,%rdi,1)
00000038  41 88 74 3f 07            movb %sil, 0x07(%r15,%rdi,1)
<omitted>
js> wasmDis(m.a)
<omitted>
0x2c02646ab064  6b01001f  cmp     w0, w1
0x2c02646ab068  5400010b  b.lt    #+0x20 (addr 0x2c02646ab088)
0x2c02646ab06c  f8606aa2  ldr     x2, [x21, x0]
0x2c02646ab070  f8216aa2  str     x2, [x21, x1]
0x2c02646ab074  910022b0  add     x16, x21, #0x8 (8)
0x2c02646ab078  f8606a02  ldr     x2, [x16, x0]
0x2c02646ab07c  8b0102b0  add     x16, x21, x1
0x2c02646ab080  f9000602  str     x2, [x16, #8]
0x2c02646ab084  14000007  b       #+0x1c (addr 0x2c02646ab0a0)
0x2c02646ab088  910022b0  add     x16, x21, #0x8 (8)
0x2c02646ab08c  f8606a02  ldr     x2, [x16, x0]
0x2c02646ab090  8b0102b0  add     x16, x21, x1
0x2c02646ab094  f9000602  str     x2, [x16, #8]
0x2c02646ab098  f8606aa2  ldr     x2, [x21, x0]
0x2c02646ab09c  f8216aa2  str     x2, [x21, x1]
<omitted>
js> wasmDis(m.b)
<omitted>
0x2c02646ab120  b9404fe1  ldr     w1, [sp, #76]
0x2c02646ab124  38216aa0  strb    w0, [x21, x1]
0x2c02646ab128  8b0102b0  add     x16, x21, x1
0x2c02646ab12c  39000600  strb    w0, [x16, #1]
0x2c02646ab130  8b0102b0  add     x16, x21, x1
0x2c02646ab134  39000a00  strb    w0, [x16, #2]
0x2c02646ab138  8b0102b0  add     x16, x21, x1
0x2c02646ab13c  39000e00  strb    w0, [x16, #3]
0x2c02646ab140  8b0102b0  add     x16, x21, x1
0x2c02646ab144  39001200  strb    w0, [x16, #4]
0x2c02646ab148  8b0102b0  add     x16, x21, x1
0x2c02646ab14c  39001600  strb    w0, [x16, #5]
0x2c02646ab150  8b0102b0  add     x16, x21, x1
0x2c02646ab154  39001a00  strb    w0, [x16, #6]
0x2c02646ab158  8b0102b0  add     x16, x21, x1
0x2c02646ab15c  39001e00  strb    w0, [x16, #7]
<omitted>

I see why SharedArrayRawBuffer::byteLength() takes a Lock parameter as a way to help the caller remember to think about raciness, and that lock's ordering guarantees justify accessing length_ without any synchronization instructions. But could we also have a SharedArrayRawBuffer::volatileByteLength() (called by WasmMemoryObject::volatileByteLength()) that does not take a Lock and uses atomic instructions instead to ensure proper ordering w.r.t resize?

(In reply to Luke Wagner [:luke] from comment #17)

I see why SharedArrayRawBuffer::byteLength() takes a Lock parameter as a way to help the caller remember to think about raciness, and that lock's ordering guarantees justify accessing length_ without any synchronization instructions. But could we also have a SharedArrayRawBuffer::volatileByteLength() (called by WasmMemoryObject::volatileByteLength()) that does not take a Lock and uses atomic instructions instead to ensure proper ordering w.r.t resize?

As I understand it, the one requirement that's important to keep is that CommitBufferMemory (which reduces to memmap/VirtualAlloc) and updating the length are observed in the correct order [1].

My understanding of atomics and memory ordering isn't strong enough to be confident in the required code to preserve this. But I'm open to suggestions.

[1] https://searchfox.org/mozilla-central/rev/8ea946dcf51f0d6400362cc1d49c8d4808eeacf1/js/src/vm/SharedArrayObject.cpp#124

Benjamin and Julian pointed out an important issue here regarding behavior on the OOB's boundary.

The specification for execution of 'memory.copy' maps it recursively to a series of i32.load8_u and i32.store8 [1]. This means that a 'memory.copy' that goes across an OOB boundary must write up until the boundary before trapping.

The 'memory.copy' patch expands to an appropriate amount of 64bit, 32bit, 16bit, and 8bit stores. The problem here is that this can leave a hole at the end of a copy if it stretches across an OOB.

Julian pointed out that we could add an alignment check against the src and dest to determine if they are on a natural word alignment, and if so perform an optimized unrolled copy, as we'd have a guarantee that we won't leave a hole.

This would result in us having four versions of this inlined copy emitted though. (Copy down vs. copy up) && (Word copy vs. byte copy). Which could add a lot of code bloat.

I'm not sure what's the desired solution here. The specification seems to be limiting an optimization opportunity here. I can't say if it's worth it or not though, I don't know what's being balanced here.

At least for now, I think we can land a byte-wise version of inlined copies and pursue more optimizations after.

[1] https://github.com/WebAssembly/bulk-memory-operations/blob/master/document/core/exec/instructions.rst#memorycopy

I should note another suggestion by Benjamin.

Unless we note extra metadata that this is a mem.fill, and the signal handler does slightly more work to fill up up to the actual incorrect index...

I haven't looked into that yet to know if it's feasible or not.

(In reply to Ryan Hunt [:rhunt] from comment #20)

[..] and the signal handler does slightly more work to fill up up to the actual incorrect index... [..]

That sounds pretty scary in the presence of multiple threads.

I've update the 'mem.copy' patch to perform bytewise copies. I also did some more formal local benchmarking and adjusted the limit for inlining. I've collected the details here [1].

[1] https://docs.google.com/spreadsheets/d/1vf9EeRRi9gTc0JokKUcWwvL8wR0yKR24xL5QP4j6xbM/edit?usp=sharing

Attachment #9088934 - Attachment description: Bug 1570112 - Wasm: Inline 'mem.fill' when length is a constant within hardcoded limit. r?bbouvier → Bug 1570112 - Wasm: Inline 'mem.fill' when length is a constant within hardcoded limit. r?jseward
Attachment #9088935 - Attachment description: Bug 1570112 - Wasm: Inline 'mem.copy' when length is a constant within hardcoded limit. r?bbouvier → Bug 1570112 - Wasm: Inline 'mem.copy' when length is a constant within hardcoded limit. r?jseward

That's a nice speedup for the small sizes, so bytewise copy seems like a good way to start. Do you suppose you could add a column to your measurements for 8- and 16-byte register copies? And also SM vs. V8?


If indeed there is still a big speedup to be had, I think there is a scheme that allows larger copies while still avoiding fancy signal handler magic or more code duplication (in fact, it'd remove existing code duplication):

  • we only inline+unroll up to some small constant size such that all the copied values can fit into registers at once (including SIMD registers, there should be plenty)
  • the inlined codegen first loads all the source bytes into the registers, then stores all the registers into the destination address
    • b/c if there is no trap, iiuc, it's not observable which order the registers were stored, even with shared memory (b/c there is no synchronization to guarantee the order in which stores are observed)
    • this avoids branching on src > dest
  • to handle traps:
    • put the highest store first; if it succeeds, then all subsequent stores will succeed; if it fails, then there have been no side effects thus far (*)
    • all loads/stores are given the (new) Trap code Trap::MemCopy
    • in HandleTrap, for a Trap::MemCopy, instead of redirecting pc to segment.trapCode(), redirect instead to a new segment.memCopyTrapCode() stub
    • the memCopyTrapCode stub needs to call Instance::memCopy. For the stub to materialize the correct arguments:
      • 2 fixed registers are chosen per architecture (analogous to, say, WasmTableCallSigReg) to hold the destByteOffset and srcByteoffset
      • a new MemCopyTrapSite is added that is like TrapSite but additionally stores the constant len (and probably, to simplify/shrink things, a pcLength so a single MemCopyTrapSite can cover all the loads/stores of a single memory.copy).
        • Note that trap site metadata is not stored in one big sorted vector, but N sorted vectors, one per Trap code. This avoids needing to store a Trap per TrapSite, but it's also quite convenient for our purposes here, because can simply store a sorted vector of MemCopyTrapSite for Trap::MemCopy without bloating all the other kinds of Traps.
      • in the inline codegen of memory.copy, these 2 registers are fixed via the defineFixed() lowering policy (in Ion)
      • the memCopyTrapCode stub builds a call frame, just like GenerateThrowStub, but passing the 2 fixed registers as args to a new C++ builtin WasmTrappingMemCopy which:
        • gets the len argument from wasmTrapData()
        • calls Instance::memCopy(), which does the Right Thing and, we can assert, ends by throwing an exception
        • ends with return WasmHandleThrow()

So, iiuc, with this scheme there would be no branches and the only slowdown compared to what a native compiler would do for memmove is the extra bit of register pressure for the 2 fixed registers.

(*) This assumes that an unaligned store that spans a page boundary and faults does not successfully write any bytes to the non-faulting page. I think that is the case, but it'd be good to confirm that.

I don't know if it's actually worth the effort, though, hence me asking for additional measurements above.

While I appreciate that the signal handle-y bits of Luke's scheme would be doable, it seems to me like considerable complexity to provide "in the event of a trap, copy as much as possible" semantics that I would guess LLVM's back end doesn't care about. I compare the proposal to the situation where LLVM codes short copies in-line as a sequence of vanilla wasm loads and stores. Then it wouldn't have said semantics (unless all those transactions were single-byte), but would it actually matter?

So I would ask Alex: what semantics do you need for these LLVM-generated memory.copy insns, regarding the state of the copy after a trap?

Flags: needinfo?(acrichton)

Yeah, memcpy is easy: overlap and oob are UB, so lowering to a sequence of accesses is fine. memory.copy in wasm is really not that at all. I'd be inclined to argue that LLVM9 simply gets this wrong.

For now, let's do the naive bit with byte copies (plus the lock fix), if it gets us closer to parity with what LLVM8 would have done.

I've been trying to think of a fix where we conservatively detect that the operation is not-overlapping and not-at-the-bounds-of-memory in which case we can generate as good code as memcpy would have, and otherwise fall back to ool code (conservative == in some cases we would go ool when we wouldn't really need to). But anything I've come up requires too elaborate a guard.

(In reply to Lars T Hansen [:lth] from comment #25)
Yeah, no argument regarding starting with inline byte copies for now. It might be worth filing a separate bug that goes into the perf backlog to consider word-sized copies, though, since I expect, whether it's Right or Wrong, this will be used for this purpose, so it'd be good to have better perf than competitors on it.

(plus the lock fix)

On this topic, do you think the RWLock alternative in comment 17 would work? It seems like it'd be a bit faster too.

(In reply to Julian Seward [:jseward] from comment #24)

So I would ask Alex: what semantics do you need for these LLVM-generated memory.copy insns, regarding the state of the copy after a trap?

From my perspective at least the use cases I'm thinking of are all source from Rust originally so anything which traps indicates a really big error that should tear down everything as quickly as possible in which case the semantics of exactly what happens doesn't actually matter. For us wrt Rust we just need the perf to look good :)

That's basically to say that we'd go towards whatever semantics make the implementation the fastest and reasonable to specify across all engines. As to which precise semantics are chosen I don't think it matters too too much for Rust's use case.

Flags: needinfo?(acrichton)

After some thinking, we can do larger length copies without new signal semantics when src < dest in the current patches. In that case, we are required to copy down and so we cannot perform a partial write even if we did it per-byte. So we could emit a quicker path for that case, and maybe get a probabilistic speed up?

I ran into this trying to write a test case for partial copying as you can only get a partial write to happen if a load is OOBs and all my testing was with storing OOBs.

We also should be able to eliminate a bounds check by only testing the greater of src or dest since we need to compute that to determine the copy direction already.

It is clear that mem.copy is a poor match for compiler generated memcpy.
For this bug I agree we should go with Ryan's byte-copying scheme.

For the future, I'd like to propose mem.copy_hinted:

  • mem.copy_hinted has the same operands as mem.copy, but also carries some hints:

    • alignment of src index (small int, 1/2/4/8 .. 256, perhaps or log2 thereof)
    • alignment of dst index (same)
    • a boolean indicating that the areas definitely do not overlap
  • An implementation can choose to ignore the hints and handle it the same as
    mem.copy; this will be safe but slow.

The behaviour of mem.copy_hinted is defined thusly:

  • If mem.copy with the same arguments would trap at runtime, then
    mem.copy_hinted will too. However, the state of both src and dst areas is
    undefined after this. Areas outside the src and dst areas will be
    unchanged.

  • If mem.copy with the same arguments would not trap at runtime, but not all
    the hints are true, then mem.copy_hinted will also not trap, but the
    resulting state of the src and dst areas is undefined afterwards. Areas
    outside the src and dst areas will be unchanged.

  • If mem.copy with the same arguments would not trap at runtime, and all the
    hints are true, then mem.copy_hinted will also not trap, and will produce
    the same results as mem.copy. (Implied thereby is): Areas outside the src
    and dst areas will be unchanged.

  • In all cases, mem.copy_hinted gives no guarantees about what happens if
    multiple threads access the src or dst areas concurrently, beyond any
    guarantees we might have if the copy had been generated by the front end as
    a sequence of vanilla wasm loads/stores.

This allows the front end compiler to hand useful information to the back end,
without sacrificing safety at the wasm level, and it gives the back end wide
latitude in choosing an implementation.

The use of alignment hints allows the implementation to use aligned multibyte
loads/stores as it sees fit. I believe natural alignment of them is important
to get performance close to native memcpy. In particular, misaligned accesses
interact poorly with store forwarding (Intel Optimization Manual, Sec 3.6.5,
Store Forwarding).

The use of a (non-)overlapping hint facilitates removal of run time direction
checks. It also makes it feasible to use cache-line-preload-and-zero
instructions on targets that have it, eg POWER "dcbz".

One might ask, why bother at all? Why not just tell front ends to emit memcpy
inline as wasm?

  • Wasm implementations may decide to not do the copy in-line, if code size is
    an issue.

  • Wasm on 32 bit targets typically requires an explicit range check per
    access. Getting really good code if the front end compiler emitted the copy
    in-line would require the wasm backend to merge the range checks together,
    which sounds complex and fragile. Using memory.copy_hinted avoids that
    problem.

  • memory.copy_hinted could be implemented using vector loads/stores (eg, 128
    or 256 bit) that don't exist in wasm itself.

If our reaction from implementing memory.copy is to propose a new version, then, given that the feature is still in Stage 3, which is intended for implementor feedback, then I thinkwe should bring this feedback to the CG and propose changes to memory.copy.

That being said, the wasm spec has thus far vigorously attempted to avoid nondeterministic behavior, so I don't think what you proposed would fly. What if instead memory.copy was proposed to not do any writes if it traps. If you couple that with what I was proposing in comment 23, then only the first two bullets remain.

What if instead memory.copy was proposed to not do any writes if it traps.

I think the current behavior is intended to be forward-compatible with good performance to the world where we have memory.protect on shared memory, and the behavior is synced between data segment initialization and memory.copy (and fill and init). We've changed so many things over the life of this proposal, it's hard to keep them straight and remember what lead to what. One discussion was here: https://github.com/WebAssembly/bulk-memory-operations/issues/49.

On a side note, I'm a bit dubious of memory.protect if the only way to implement it (without terrible perf) is to use hardware memory protection. But it seems like memory.protect would have fairly messy semantics in racy write cases like memory.copy with any semantics, and we could just deal with that question if/when we did memory.protect.

It seems like, if our reaction to the current trapping semantics is to want to propose a new instruction, though, we really should reconsider the current instructions' trapping semantics (both memory.copy and memory init). This is basically implementation feedback, which is the point of Stage 3, and I wouldn't be surprised if V8 had a similar experience (since there's nothing especially FF-specific here) and desire to improve.

But just as a basic technical question: would "trap before any writes" semantics combined with the abovementioned approach of "load all source values into registers, and then write high-to-low" be a valid way to optimally inline memory.copy?

I think avoiding nondeterministic behavior makes sense. Would it be possible to modify memory.copy/fill to take an alignment immediate like Julian proposed, but trap on a misaligned src/dest instead of viewing it as a hint? (this would match the alignment behavior of atomics)

The main point of this would be to get some of the benefits of mem.copy_hinted with deterministic behavior. The drawback would be an alignment check when alignment != 0.

I'm not sure if it's reasonable to also take a noOverlap flag and trap when that fails as well. The overhead of a check there might be too much?

(In reply to Luke Wagner [:luke] from comment #32)

On a side note, I'm a bit dubious of memory.protect if the only way to implement it (without terrible perf) is to use hardware memory protection. But it seems like memory.protect would have fairly messy semantics in racy write cases like memory.copy with any semantics, and we could just deal with that question if/when we did memory.protect.

It seems like, if our reaction to the current trapping semantics is to want to propose a new instruction, though, we really should reconsider the current instructions' trapping semantics (both memory.copy and memory init). This is basically implementation feedback, which is the point of Stage 3, and I wouldn't be surprised if V8 had a similar experience (since there's nothing especially FF-specific here) and desire to improve.

But just as a basic technical question: would "trap before any writes" semantics combined with the abovementioned approach of "load all source values into registers, and then write high-to-low" be a valid way to optimally inline memory.copy?

I agree that this would be valid. I would be slightly concerned about complexity and register pressure, but don't have enough experience to say if that would be a problem.

I should also add that I hit a small roadblock with the patches around emitting correct trap metadata. I have to rework the patches a bit, but should hopefully have new versions soon.

Ahh, interesting, so the idea is that you declare an alignment, memory.copy traps if it's wrong, and once we know it's right, we can copy in register sizes up to that alignment quanta because on trap we're perfectly to-spec. My worry is that, a common use of C memcpy is to use it to efficiently copy bytes of unknown alignment and so the compiler would need to emit an alignment of 1 in many cases, requiring us to fall back to the byte copies.

FWIW, if we simply limited the size of inline memory.copy to 128 bytes (so 8×128-bit SSE/NEON registers) I expect we'd get the vast majority of the win and so, given 16/32 SIMD registers (on x64/ARM, resp) register pressure would not be too much of an issue, I expect. In particular, my local g++ -O3 only does an inline copy up to 16 bytes (!) before switching to a call and clang -O3 does inline copy of up to 128 bytes. And, of course, this would avoid both the src<dst and any alignment branches.

I've updated the benchmark with my latest patches and added some comparison between us an V8. It's on the second page.

One interesting point is that we're faster when we inline, but seem to have much more overhead in our OOL path than V8.

I did a quick profile and it seems like we're spending some significant time finding the memory length and memory base inside of WasmInstance::memCopy. (~3-4ms / ~8ms total). We should be able to easily pass the memoryBase from TlsData into the instance call. Using an atomic access to the memoryLength may help as well.

V8 looks like they do some trickery with shared memory by caching the memoryLength per-thread, and sending out an interrupt to all other threads running wasm to update their local copy when growing memory. That may be an option as well.

[1] https://docs.google.com/spreadsheets/d/1vf9EeRRi9gTc0JokKUcWwvL8wR0yKR24xL5QP4j6xbM/edit?usp=sharing

Attachment #9088934 - Attachment description: Bug 1570112 - Wasm: Inline 'mem.fill' when length is a constant within hardcoded limit. r?jseward → Bug 1570112 - Wasm: Inline 'memory.fill' when length is a constant within hardcoded limit. r?jseward
Attachment #9088935 - Attachment description: Bug 1570112 - Wasm: Inline 'mem.copy' when length is a constant within hardcoded limit. r?jseward → Bug 1570112 - Wasm: Inline 'memory.copy' when length is a constant within hardcoded limit. r?jseward

The baseline compiler doesn't have enough information inside the plain load/store
path in order to remove bounds checks from memory.copy and memory.fill after
the first access of dest and src.

This commit passes an override to prevent bounds checks after the first
accesses. This is safe because we limit the allowable size of an inlined copy
to be less than the guard page size.

Depends on D43892

I don't believe we have coverage here. I confirmed that these tests fail with
my earlier larger transfer width patches and pass with the current patches.

Depends on D44930

(In reply to Luke Wagner [:luke] from comment #32)

On a side note, I'm a bit dubious of memory.protect if the only way to implement it (without terrible perf) is to use hardware memory protection. But it seems like memory.protect would have fairly messy semantics in racy write cases like memory.copy with any semantics, and we could just deal with that question if/when we did memory.protect.

It seems like, if our reaction to the current trapping semantics is to want to propose a new instruction, though, we really should reconsider the current instructions' trapping semantics (both memory.copy and memory init). This is basically implementation feedback, which is the point of Stage 3, and I wouldn't be surprised if V8 had a similar experience (since there's nothing especially FF-specific here) and desire to improve.

But just as a basic technical question: would "trap before any writes" semantics combined with the abovementioned approach of "load all source values into registers, and then write high-to-low" be a valid way to optimally inline memory.copy?

Is there a concrete spec or some discussion about memory.protect? Just curious, as I've heard it mentioned before but haven't seen anything.

(In reply to Ryan Hunt [:rhunt] from comment #40)
Nope, I think it's basically just the idea of "what if wasm had a mprotect()-like instruction so that, e.g., null (0) pointer accesses could fault?".

As mentioned above, we have some room to improve our 'memory.copy/fill' cost when using a function call.

bench-copy-4.js spidermonkey v8
ool ~9ms ~4ms

Note: the above is also without any mutex/rwlock as this is a non-shared memory benchmark.

After profiling and some testing, it appears much of the cost is in acquiring the base pointer, length, and shared flag from WasmMemoryObject.

By storing this information in wasm::Instance and skipping access to WasmMemoryObject from Instance::memCopy, we reduce down to ~4.5ms. I'm guessing this is from better cache line utilization from not having to chase as many pointers. Passing this information directly from the JIT'ed code (where we have HeapReg) results in slightly better performance, but I'm not sure the complexity is worth it.

The tricky part is keeping this information in sync with the canonical copy in WasmMemoryObject.

  • The shared memory flag is trivial as it's constant.
  • The memory base pointer is trivial as we are already notified of a moving grow operation.
  • The heap length is non-trivial as we have no easy way to be notified when a shared memory has been grown. The moving-grow observer list isn't sufficient as that is inside of WasmMemoryObject and only contains instances from that thread. The only object that is shared between threads is SharedArrayRawBuffer.

I spent some time digging into WasmArrayRawBuffer and SharedArrayRawBuffer, to see if we could store a list of instances in each to notify upon a grow operation with new memory parameters. This was pretty painful as there's a lot of logic entangled in the heap management code. Most objects can be destroyed and recreated during a grow, so there's not a clear place for this kind of information.

Additionally, some details like whether a memory is huge or not technically lives in WasmMemoryObject, but is required in the computation of boundsCheckLimit.

It would be really useful if there was some central place for this logic. I saw this comment [1] indicating there has been some thought here.

I'm thinking of a wasm::Memory object that actually maps/manages the heap for shared/local memories, is owned by WasmMemoryObject, but could be shared between threads when the memory is shared.

From some light searching, it seems like the main difficulty is that while ArrayBufferObject can have user owned memory, SharedArrayBufferObject cannot.

Is this a restriction that could be reasonably lifted and/or would the above refactoring be desirable?

I'm unsure if the difficulty of the refactoring is worthwhile just for the above microbenchmark, but it might be if it intersects with other plans.

[1] https://searchfox.org/mozilla-central/rev/250f5cc9fb8bdcbb6b23d2a06acfd48addb2f99b/js/src/vm/ArrayBufferObject.h#115

I have filed [1] to discuss potential changes to the spec.

[1] https://github.com/WebAssembly/bulk-memory-operations/issues/111

I've update the benchmark, with comparisons of all the different proposed solutions. [1]

I've hacked these within Spidermonkey on baseline with 64bit and huge memory. This was mostly for ease of hacking, the generated code isn't significantly impacted from not being in Ion.

  1. OOL call - I've tried to make it as fast as possible by passing in the heap base and length to the instance call. Heap base is easily acquired with HeapReg. Heap length isn't readily available due to comment 42, so I pass in boundsCheckLimit as a proxy instead. This is incorrect, but should be equivalent for perf testing purposes. With these changes, the implementation is very close to V8's and our time falls in line with theirs at about ~4s.

  2. Inline bytewise - The patches that are currently up. A comparison of src/dest to determine copy direction and two corresponding sequences of loads and stores per byte. This approach is competitive at small sizes, but loses at 16 bytes.

  3. Inline align guard - Close to (2) with an added alignment check so that copies can be performed per Int32. This performs better than (2), but is questionable whether it would make sense to change the spec to make this possible.

  4. Inline register buffer - Luke's proposed solution that won't write anything if any part is out of bounds. This approach has the best overall performance.

  5. Plain wasm loads/stores - An equivalent benchmark with memory.copy replaced with wasm loads and stores. For some reason, this actually performs better than (4) for 4bytes and 8bytes. Comparing the disassembly, the only difference is register allocation and scheduling. The same amount and type of instructions are emitted. This makes me believe that (4) can be improved to be just as good here.

One additional consideration. The module bytecode size for using memory.copy vs. plain loads and stores increased roughly 10x. So there is a size savings with using memory.copy. This benchmark also leads me to believe that we can achieve equivalent and better performance with memory.copy for small constant sizes if we change the trapping behavior, so overall I think we should continue pursuing this spec change.

[1] https://docs.google.com/spreadsheets/d/1vf9EeRRi9gTc0JokKUcWwvL8wR0yKR24xL5QP4j6xbM/edit#gid=1662652199

I really appreciate the thorough investigation, nice work! Two small follow-up questions regarding option #4:

  • for Size = 4 and 8 cases, are SIMD registers or GPRs used? (Is there any perf difference? One might imagine using XMM avoids GPR pressure.)
  • For Size = 16 and above, are full 16-byte SIMD registers used?

Filed some comments on the patch re MOVSX vs MOVZX and the use of REP MOVSB -- small potatoes, though I think the bytewise copy can do better than it currently does in a number of cases.

How are you performing time measurements? I looked at your JS code and did not see anything; is this with time at the shell or did I miss something? Assuming the times are in seconds the answer is probably not important, but I'm curious even so.

(In reply to Luke Wagner [:luke] from comment #45)

I really appreciate the thorough investigation, nice work! Two small follow-up questions regarding option #4:

  • for Size = 4 and 8 cases, are SIMD registers or GPRs used? (Is there any perf difference? One might imagine using XMM avoids GPR pressure.)

I tried the patches both with SIMD registers and with GPRs when comparing against plain Wasm loads/stores and did not notice a difference. The benchmark is fairly synthetic so a real application might have different characteristics.

  • For Size = 16 and above, are full 16-byte SIMD registers used?

Yes, for Size = 16 the patch emits unaligned SSE 16byte loads and stores.

(In reply to Lars T Hansen [:lth] from comment #46)

Filed some comments on the patch re MOVSX vs MOVZX and the use of REP MOVSB -- small potatoes, though I think the bytewise copy can do better than it currently does in a number of cases.

Thanks! I'll take a look at those.

How are you performing time measurements? I looked at your JS code and did not see anything; is this with time at the shell or did I miss something? Assuming the times are in seconds the answer is probably not important, but I'm curious even so.

Yes, I'm just using time with the shell assuming that the overhead of starting the shell will be roughly equivalent across runs. I'm open to other suggestions, I didn't see anything in the JS shell that was obvious for benchmarking.

Canonically the most precise is to call performance.now() before your benchmark does its thing and then again after, and subtract the former value from the latter to give you a time measurement in milliseconds but typically with at least (IIUC) microsecond precision, in the shell. You can then print that:

let then = performance.now()
bench.run()
let now = performance.now()
print("RUNNING_TIME " + (now - then))

This API has been neutered in the browser and does not deliver that kind of precision there for security reasons.

I agree that for your tests the shell overhead is probably not dominating so it won't matter too much, but with the above method you can do many runs in the same process etc, and you will factor out both shell overhead and general wasm compilation overhead, and you can control for warmup of both CPU cache and various jit caches.

Attached patch mem-copy-bench.patch (obsolete) — Splinter Review

I spent time today hacking in a version that uses rep movs and added it to the benchmark. It was surprisingly tricky to implement correctly. Additionally, it seems to have very poor performance (about 10x worse). It's possible I may have missed something.

I also changed the bytewise copy to use zero extension instead of sign extension. I did not see any observable difference unfortunately.

And finally, I investigated the discrepancy between the non-trapping register buffer approach and plain wasm loads and stores. It turns out that my load-store benchmark was incorrect and storing the destination address to the address at the value loaded from the source address. Stack machines are tricky.. Fixing this aligned plain load and store performance with the register buffer approach, making the register buffer approach the overall fastest for every measured size.

I'm also now attaching the patch that was used to generate these benchmarks if anyone wants to look at it. I'll also be adding the disassembly to notes on the attached spreadsheet so the curious can understand exactly what is being generated and compared for each approach.

I'm also going to drop the reviews on the attached patches until the uncertainty for this bug is resolved.

Re REP MOVS, there are microarchitectural effects here. As per the Intel manual, it probably is not effective for lengths under 76 (!) bytes. And it appears to be quite sensitive to how we manipulate the direction flag, or perhaps the interleaving of high->low and low->high copies. Your benchmark does 6 copies from low->high and then 6 copies from high->low, and each high->low copy (pointer decrement) sets the direction flag before the op and clears it after, preserving the direction flag as "cleared" (as per ABI). However, for the 128 test case, if I instead do 12 copies in one direction (either one, so long as I set the flag in the correct meaning before all the copies and clear it after I'm done), the running time on my system drops from 23.5s to 6.4s, ie by a factor of 4. This still doesn't beat the time for the SIMD (register buffer) code however - in any scenario, though it gets fairly close in the unidirectional high->low case - so I might not bother to try to figure out if I'm seeing memory subsystem effects or effects of diddling the flag bit incessantly...

Attachment #9090896 - Attachment is obsolete: true
Attachment #9090895 - Attachment is obsolete: true
Attachment #9088934 - Attachment is obsolete: true
Attachment #9088933 - Attachment is obsolete: true
Attachment #9088932 - Attachment is obsolete: true
Attachment #9088935 - Attachment is obsolete: true

Depends on D50373

Depends on D50375

Depends on D50376

Depends on D50378

Depends on D50379

Depends on D50380

TODO: Support non-x64

Depends on D50383

I've attached patches that:

  1. Makes our current OOL 'memory.copy' and 'memory.fill' faster by directly passing the heap base ptr and acquiring the heap length relative to the pointer
  2. Adds a byte-wise copy path for 'memory.copy' and 'memory.fill' for Ion and Baseline
  3. Adds a rep-movsb rep-stosb copy path for 'memory.copy' and 'memory.fill' for Ion and Baseline
  4. Adds a SIMD copy path for 'memory.copy' for Ion and Baseline

I've also written a benchmark that tests the bulk-memory instructions against the emscripten generated memcpy and memset, and unrolled plain loads/stores [1]. Results from a local run are here [2].

[1] https://github.com/eqrion/wasm-bulk-bench
[2] https://docs.google.com/spreadsheets/d/1Ri4NHfsqEHi4g58HSJcNoeJCh9wKU7_PQwMnn_0Vtfo/edit#gid=1674387044

(In reply to Ryan Hunt [:rhunt] from comment #65)

I've attached patches that:

  1. Makes our current OOL 'memory.copy' and 'memory.fill' faster by directly passing the heap base ptr and acquiring the heap length relative to the pointer
  2. Adds a byte-wise copy path for 'memory.copy' and 'memory.fill' for Ion and Baseline
  3. Adds a rep-movsb rep-stosb copy path for 'memory.copy' and 'memory.fill' for Ion and Baseline
  4. Adds a SIMD copy path for 'memory.copy' for Ion and Baseline

I've also written a benchmark that tests the bulk-memory instructions against the emscripten generated memcpy and memset, and unrolled plain loads/stores [1]. Results from a local run are here [2].

[1] https://github.com/eqrion/wasm-bulk-bench
[2] https://docs.google.com/spreadsheets/d/1Ri4NHfsqEHi4g58HSJcNoeJCh9wKU7_PQwMnn_0Vtfo/edit#gid=1674387044

Some observations from the data.

  1. With improvements to our OOL implementation
  • We are competitive to plain memcpy from bytes=8 and higher.
  • Plain loads/stores are superior until bytes=32.
  1. With inline byte-wise copies
  • We are superior to plain memcpy until bytes=32 (at which point we should switch to our OOL implementation).
  • Plain loads/stores have equivalent performance for all measured sizes.
  1. With inline transactional copies
  • We have superior performance to plain memcpy and plain loads/stores for all measured sizes (currently up to 128).

This was tested on my x64 Macbook Pro. We should test on other platforms as well.

Nice results! Q: re (3), what does "transactional" mean in this context? /me is unclear.

(In reply to Julian Seward [:jseward] (PTO 28 Oct - 8 Nov inclusive) from comment #67)

Nice results! Q: re (3), what does "transactional" mean in this context? /me is unclear.

Yes, should have made that clear. This is the approach that loads all src bytes into registers using SIMD then stores them to dst from high to low, so that we trap if anything is OOBs. 'transactional' in the sense that we either do the full copy or we trap and don't write anything.

Depends on: 1591047

Comment on attachment 9103768 [details]
Bug 1570112 - Remove locking from shared Wasm memory.

Revision D50374 was moved to bug 1591047. Setting attachment 9103768 [details] to obsolete.

Attachment #9103768 - Attachment is obsolete: true

Comment on attachment 9103769 [details]
Bug 1570112 - Split memCopy/memFill implementations for shared/non-shared modules.

Revision D50375 was moved to bug 1591047. Setting attachment 9103769 [details] to obsolete.

Attachment #9103769 - Attachment is obsolete: true

Comment on attachment 9103770 [details]
Bug 1570112 - Expose WasmArrayRawBuffer internally.

Revision D50376 was moved to bug 1591047. Setting attachment 9103770 [details] to obsolete.

Attachment #9103770 - Attachment is obsolete: true

Comment on attachment 9103771 [details]
Bug 1570112 - Track length in WasmArrayRawBuffer.

Revision D50377 was moved to bug 1591047. Setting attachment 9103771 [details] to obsolete.

Attachment #9103771 - Attachment is obsolete: true

Comment on attachment 9103772 [details]
Bug 1570112 - Pass heapBase to memCopy/memFill and use that to acquire length.

Revision D50378 was moved to bug 1591047. Setting attachment 9103772 [details] to obsolete.

Attachment #9103772 - Attachment is obsolete: true

Comment on attachment 9103773 [details]
Bug 1570112 - Fast path Ion loading of memoryBase for memcopy/memfill.

Revision D50379 was moved to bug 1591047. Setting attachment 9103773 [details] to obsolete.

Attachment #9103773 - Attachment is obsolete: true
Attachment #9096754 - Attachment is obsolete: true
Attachment #9103767 - Attachment is obsolete: true
Depends on: 1594204
Attachment #9103775 - Attachment is obsolete: true
Attachment #9103776 - Attachment is obsolete: true
Attachment #9103777 - Attachment is obsolete: true
Attachment #9103778 - Attachment is obsolete: true

Comment on attachment 9103774 [details]
Bug 1570112 - Split out 'emitMemCopy' function for dedicated optimizations.

Revision D50380 was moved to bug 1594204. Setting attachment 9103774 [details] to obsolete.

Attachment #9103774 - Attachment is obsolete: true

With bug 1591047 and bug 1594204 landed, I don't see any difference in performance between LLVM8/9 on attached links. There may be future performance improvements, but for now I am going to close this.

Status: NEW → RESOLVED
Closed: 27 days ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.