Performance of `memory.copy` is quite bad compared to wasm-written `memcpy` function
Categories
(Core :: JavaScript: WebAssembly, enhancement, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox70 | --- | affected |
People
(Reporter: acrichto, Assigned: rhunt)
References
Details
Attachments
(19 obsolete files)
Recently the Rust compiler updated to LLVM 9 from LLVM 8. There were some major changes in LLVM 9 around threads and WebAssembly, and one of the indirect pieces is that LLVM 9 now fully supports the threads and bulk memory proposals of WebAssembly. To use threads with LLVM 9 you are required to enable the bulk-memory
feature in LLVM by default.
With the bulk-memory
feature comes the memory.copy
instruction. Rust code ends up generating quite a few calls to memcpy
in general for all platforms. The performance of Rust relatively crucially relies on efficient implementations of memcpy
as well. For most platforms LLVM will lower specific memcpy
values (e.g. a constant move of 16 bytes) into inline optimized instructions, and it looks like for WebAssembly LLVM 9 is lowering almost all memcpy
calls into memory.copy
instructions. This means that Rust-compiled programs through LLVM 9 have a significant number of memory.copy
instructions, many of which are performance critical.
The main "Rust and threads" example that we've historically used is a parallel raytracer, and so with the LLVM 9 update that was my main test case for ensuring that the Rust toolchain works with LLVM 9. Once all was said and done, however, I was pretty surprised at the performance! The previous LLVM 8-compiled code would render one frame in ~300ms on my machine (with the max number of threads), but the same code compiled with LLVM 9 would render a frame in ~2000ms.
After some profiling use perf.html I ended up realizing that 80+% of the time was spent in the native memcpy implementation, a seemingly large amount of that around some sort of synchronization as well. Overall, it definitely seemed like the memory.copy
instructions emitted by LLVM with LLVM 9 were the culprit for most of the slowdown if not all of it.
I've prepared a github repository with precompiled versions of the raytracing example, and you can see the two rendered versions online:
Note that those require Nightly firefox with SharedArrayBuffer enabled to work, so they won't work in stable Firefox.
In talking with Luke it sounded like a bug was the best place to report this. If y'all need any more information from me though please just let me know! I suspect that the whole parallel raytracing idea isn't actually needed to showcase the performance slowdown here, it's likely only necessary to use memory.copy
but I figured it'd be good to show the whole example.
Also FWIW my platform is Windows 10 where I originally ran these benchmarks.
Reporter | ||
Comment 1•5 years ago
|
||
Also in case it's useful, this is a perf.html capture of rendering two frames on my computer
Assignee | ||
Comment 2•5 years ago
|
||
It looks like DOMWorker#1 is spending 75% of its time in WasmMemoryObject::volatileMemoryLength (called by mem.copy) which is contending a lock with the other worker threads. I'm not sure what requirements there are for bounds checking with shared memory, but it seems like finding a more relaxed way to acquire the memory bounds would be good.
V8 implements this by performing the bounds check in JIT'ed code before the memcpy runtime function [1], and doesn't appear to use any locking [2]. I'm guessing the equivalent for us would be to bounds check on TlsData::boundsCheckLimit?
[1] https://cs.chromium.org/chromium/src/v8/src/compiler/wasm-compiler.cc?l=4849&rcl=781babd0c9ef5d62a23b0a54ed64586c3236db6c
[2] https://cs.chromium.org/chromium/src/v8/src/compiler/wasm-compiler.cc?l=3681&rcl=781babd0c9ef5d62a23b0a54ed64586c3236db6c
Comment 3•5 years ago
|
||
Bugbug thinks this bug is a enhancement, but please change it back in case of error.
Comment 4•5 years ago
|
||
That could certainly help as both a short time perf boost and valuable in the long-term. What we really need to do (which was known and discussed in the wasm CG when this feature was standardized) is optimize the case where memory.copy has a small, constant byte size into an inline set of loads and stores. This is non-trivial so it's a question of whether we need to do this before or after CraneLift; I suppose it depends on how quickly LLVM 9-generated code will get into the wild.
Comment 5•5 years ago
|
||
On principle, we should not hold up performance work on the production engines because of Cranelift.
Let's see if we can prioritize this one and similar performance issues (memory.fill probably has the same issue).
Updated•5 years ago
|
Comment 6•5 years ago
|
||
Bug 1457855 has a patch for inlined mem.fill on both Baseline and Ion. It never landed though. Might be worth a look.
Assignee | ||
Comment 7•5 years ago
|
||
Thanks for that! So far I focused on mem.copy as that seems to be the only thing used in the ray tracing example (unless I messed up my analysis).
Here are my current patches [1]. They inline mem.copy and mem.fill when length is constant and within a certain bound. (note: not constant value for mem.fill as in Julian's patches, I hadn't thought of that). The patches work with Baseline and Ion on all platforms.
I need to do some more tuning with the generated code to ensure it's acceptable on all platforms. Feedback is welcome.
[1] https://treeherder.mozilla.org/#/jobs?repo=try&revision=13a8070bf469a73e538c8fc0f937bafd0c048431
Assignee | ||
Comment 8•5 years ago
|
||
Forgot to add, there's also a patch that switches SAB to use a reader-writer lock so that multiple threads can perform the OOL-mem.copy simultaneously. This helped improve the raytracing benchmark, but didn't completely fix the problem. With the inlining patches though, we're back to parity.
Comment 9•5 years ago
|
||
(In reply to Ryan Hunt [:rhunt] from comment #7)
Here are my current patches [1].
I had a quick look at the mem.copy patch. What's the status w.r.t. alignment?
From reading the patch, it seems that it assumes the start point of both src
and dst is 4-aligned. Maybe I misunderstand? You could prefix the main
vector loop (the 32-bit loads/stores) with a 8+16 bit preamble, in the way that
it currently has a 16+8 bit postamble (for lack of a better word), but even then
you'd be assuming that src_start % 4 == dst_start % 4
.
On further consideration .. the generated code would be OK if we allow misaligned
32- and 16-bit loads/stores on the target. But do we?
Drive-by nit: the while (remainder >= sizeof(uint16_t))
loop, and its
8-bit equivalent, only execute at most once. Maybe change them to if
s?
Comment 10•5 years ago
|
||
At the moment we assume that unaligned integer loads and stores are OK on all tier-1 platforms, and that hasn't bitten us yet. (FP accesses are different.)
Remember that there's no requirement for regular loads and stores to be aligned in wasm. An i32.load that says it is aligned may actually be unaligned. On a chip where unaligned accesses might be an issue we thus have to choose whether to generate a word access and hope for the best, or four byte accesses and assemble them into a word. We do the former unless the load is flagged as unaligned, but I suspect we should probably just generate the word access always unless we have concrete evidence that that's the wrong thing on the device in question.
The "wrong thing" is perf loss from an OS fixup, or (like for FP on ARM) simply a signal. In practice, consumer devices should have CPUs that handle the problem in hardware, but we've not actually added telemetry to substantiate this. We could do that.
Assignee | ||
Comment 11•5 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #9)
(In reply to Ryan Hunt [:rhunt] from comment #7)
Here are my current patches [1].
I had a quick look at the mem.copy patch. What's the status w.r.t. alignment?
From reading the patch, it seems that it assumes the start point of both src
and dst is 4-aligned. Maybe I misunderstand? You could prefix the main
vector loop (the 32-bit loads/stores) with a 8+16 bit preamble, in the way that
it currently has a 16+8 bit postamble (for lack of a better word), but even then
you'd be assuming thatsrc_start % 4 == dst_start % 4
.On further consideration .. the generated code would be OK if we allow misaligned
32- and 16-bit loads/stores on the target. But do we?
Don't have much to add to what Lars said, that was my understanding.
Drive-by nit: the
while (remainder >= sizeof(uint16_t))
loop, and its
8-bit equivalent, only execute at most once. Maybe change them toif
s?
That's a good point, I wrote it as a loop to keep it similar to the main 32-bit load/store loop, but 'if's would be clearer here. I'll try to make that change.
Assignee | ||
Comment 12•5 years ago
|
||
I'd like to use RWLock in JS. It seems like the normal pattern is to have a
common class in mozglue/ that is shared between JS and XPCOM, so that's what
this commit does.
Assignee | ||
Comment 13•5 years ago
|
||
I've seen some contention between threads performing 'mem.copy' on the same
memory due to the lock in SAB. This lock is to perform atomic updates to the
length of the SAB while new memory is commited during growth.
By switching to a RWLock, multiple threads can be reading the length as long
as no thread is currently growing the memory, which should be rare.
Depends on D43889
Assignee | ||
Comment 14•5 years ago
|
||
This commit adds a path to Ion and Baseline to inline stores for 'mem.fill'
instructions when the length parameter is a constant within a hardcoded
limit.
The hardcoded limit is derived from light benchmarking of a synthetic test
case.
To ensure memory safety a bounds check is done on the start index of the
instruction. The limit on length is then picked to be less than the offset
guard limit so that we catch overflow past the start index.
The bulk memory operations spec specifies that partial OOB's 'mem.fill' should
result in a partial write up to the OOB's and then a trap. We rely on signal
handling to duplicate this behavior.
The copy is done in a bytewise manner. It's possible that we could perform more
efficient copies if we had a constant 'value', but that will need to be a
follow up.
Depends on D43890
Assignee | ||
Comment 15•5 years ago
|
||
This commit adds a path to Ion and Baseline to inline load and stores for the
'mem.copy' instruction when the length is a constant within a hardcoded limit.
Like the previous commit, the limit is chosen from light benchmarking, memory
safety is performed with bounds checking and the offset guard, and partial
writes need to be preserved.
Unlike the previous commit, the loads and stores are performed with 64, 32, 16,
and 8 bit transfers, depending on the length of the copy. We rely on alignment
issues being fixed up by processors or the OS, as we do with normal Wasm load
or store operations.
The direction of the copy is determined by the source and destination pointers
to properly handle overlapping regions. Unfortunately that means we have to
emit two sequences of loads and stores. I've tried to limit duplication in the
macro assembler and keep clarity, but there might be a better way.
Depends on D43891
Assignee | ||
Comment 16•5 years ago
|
||
Here's a snippet of the generated code for a copy and a fill on x64 and ARM64.
(module
(memory 1 65536)
(func $a (param i32) (param i32)
local.get 0
local.get 1
i32.const 16
memory.copy
)
(export "a" $a)
(func $b (param i32) (param i32)
local.get 0
local.get 1
i32.const 8
memory.fill
)
(export "b" $b)
)
js> wasmDis(m.a)
<omitted>
00000016 3b f7 cmp %edi, %esi
00000018 0f 8d 17 00 00 00 jnl 0x0000000000000035
0000001E 49 8b 44 37 08 movq 0x08(%r15,%rsi,1), %rax
00000023 49 89 44 3f 08 movq %rax, 0x08(%r15,%rdi,1)
00000028 49 8b 04 37 movq (%r15,%rsi,1), %rax
0000002C 49 89 04 3f movq %rax, (%r15,%rdi,1)
00000030 e9 12 00 00 00 jmp 0x0000000000000047
00000035 49 8b 04 37 movq (%r15,%rsi,1), %rax
00000039 49 89 04 3f movq %rax, (%r15,%rdi,1)
0000003D 49 8b 44 37 08 movq 0x08(%r15,%rsi,1), %rax
00000042 49 89 44 3f 08 movq %rax, 0x08(%r15,%rdi,1)
<omitted>
js> wasmDis(m.b)
<omitted>
00000016 41 88 34 3f movb %sil, (%r15,%rdi,1)
0000001A 41 88 74 3f 01 movb %sil, 0x01(%r15,%rdi,1)
0000001F 41 88 74 3f 02 movb %sil, 0x02(%r15,%rdi,1)
00000024 41 88 74 3f 03 movb %sil, 0x03(%r15,%rdi,1)
00000029 41 88 74 3f 04 movb %sil, 0x04(%r15,%rdi,1)
0000002E 41 88 74 3f 05 movb %sil, 0x05(%r15,%rdi,1)
00000033 41 88 74 3f 06 movb %sil, 0x06(%r15,%rdi,1)
00000038 41 88 74 3f 07 movb %sil, 0x07(%r15,%rdi,1)
<omitted>
js> wasmDis(m.a)
<omitted>
0x2c02646ab064 6b01001f cmp w0, w1
0x2c02646ab068 5400010b b.lt #+0x20 (addr 0x2c02646ab088)
0x2c02646ab06c f8606aa2 ldr x2, [x21, x0]
0x2c02646ab070 f8216aa2 str x2, [x21, x1]
0x2c02646ab074 910022b0 add x16, x21, #0x8 (8)
0x2c02646ab078 f8606a02 ldr x2, [x16, x0]
0x2c02646ab07c 8b0102b0 add x16, x21, x1
0x2c02646ab080 f9000602 str x2, [x16, #8]
0x2c02646ab084 14000007 b #+0x1c (addr 0x2c02646ab0a0)
0x2c02646ab088 910022b0 add x16, x21, #0x8 (8)
0x2c02646ab08c f8606a02 ldr x2, [x16, x0]
0x2c02646ab090 8b0102b0 add x16, x21, x1
0x2c02646ab094 f9000602 str x2, [x16, #8]
0x2c02646ab098 f8606aa2 ldr x2, [x21, x0]
0x2c02646ab09c f8216aa2 str x2, [x21, x1]
<omitted>
js> wasmDis(m.b)
<omitted>
0x2c02646ab120 b9404fe1 ldr w1, [sp, #76]
0x2c02646ab124 38216aa0 strb w0, [x21, x1]
0x2c02646ab128 8b0102b0 add x16, x21, x1
0x2c02646ab12c 39000600 strb w0, [x16, #1]
0x2c02646ab130 8b0102b0 add x16, x21, x1
0x2c02646ab134 39000a00 strb w0, [x16, #2]
0x2c02646ab138 8b0102b0 add x16, x21, x1
0x2c02646ab13c 39000e00 strb w0, [x16, #3]
0x2c02646ab140 8b0102b0 add x16, x21, x1
0x2c02646ab144 39001200 strb w0, [x16, #4]
0x2c02646ab148 8b0102b0 add x16, x21, x1
0x2c02646ab14c 39001600 strb w0, [x16, #5]
0x2c02646ab150 8b0102b0 add x16, x21, x1
0x2c02646ab154 39001a00 strb w0, [x16, #6]
0x2c02646ab158 8b0102b0 add x16, x21, x1
0x2c02646ab15c 39001e00 strb w0, [x16, #7]
<omitted>
Comment 17•5 years ago
|
||
I see why SharedArrayRawBuffer::byteLength()
takes a Lock
parameter as a way to help the caller remember to think about raciness, and that lock's ordering guarantees justify accessing length_
without any synchronization instructions. But could we also have a SharedArrayRawBuffer::volatileByteLength()
(called by WasmMemoryObject::volatileByteLength()
) that does not take a Lock
and uses atomic instructions instead to ensure proper ordering w.r.t resize?
Assignee | ||
Comment 18•5 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #17)
I see why
SharedArrayRawBuffer::byteLength()
takes aLock
parameter as a way to help the caller remember to think about raciness, and that lock's ordering guarantees justify accessinglength_
without any synchronization instructions. But could we also have aSharedArrayRawBuffer::volatileByteLength()
(called byWasmMemoryObject::volatileByteLength()
) that does not take aLock
and uses atomic instructions instead to ensure proper ordering w.r.t resize?
As I understand it, the one requirement that's important to keep is that CommitBufferMemory (which reduces to memmap/VirtualAlloc) and updating the length are observed in the correct order [1].
My understanding of atomics and memory ordering isn't strong enough to be confident in the required code to preserve this. But I'm open to suggestions.
Assignee | ||
Comment 19•5 years ago
|
||
Benjamin and Julian pointed out an important issue here regarding behavior on the OOB's boundary.
The specification for execution of 'memory.copy' maps it recursively to a series of i32.load8_u
and i32.store8
[1]. This means that a 'memory.copy' that goes across an OOB boundary must write up until the boundary before trapping.
The 'memory.copy' patch expands to an appropriate amount of 64bit, 32bit, 16bit, and 8bit stores. The problem here is that this can leave a hole at the end of a copy if it stretches across an OOB.
Julian pointed out that we could add an alignment check against the src
and dest
to determine if they are on a natural word alignment, and if so perform an optimized unrolled copy, as we'd have a guarantee that we won't leave a hole.
This would result in us having four versions of this inlined copy emitted though. (Copy down vs. copy up) && (Word copy vs. byte copy). Which could add a lot of code bloat.
I'm not sure what's the desired solution here. The specification seems to be limiting an optimization opportunity here. I can't say if it's worth it or not though, I don't know what's being balanced here.
At least for now, I think we can land a byte-wise version of inlined copies and pursue more optimizations after.
Assignee | ||
Comment 20•5 years ago
|
||
I should note another suggestion by Benjamin.
Unless we note extra metadata that this is a mem.fill, and the signal handler does slightly more work to fill up up to the actual incorrect index...
I haven't looked into that yet to know if it's feasible or not.
Comment 21•5 years ago
|
||
(In reply to Ryan Hunt [:rhunt] from comment #20)
[..] and the signal handler does slightly more work to fill up up to the actual incorrect index... [..]
That sounds pretty scary in the presence of multiple threads.
Assignee | ||
Comment 22•5 years ago
|
||
I've update the 'mem.copy' patch to perform bytewise copies. I also did some more formal local benchmarking and adjusted the limit for inlining. I've collected the details here [1].
[1] https://docs.google.com/spreadsheets/d/1vf9EeRRi9gTc0JokKUcWwvL8wR0yKR24xL5QP4j6xbM/edit?usp=sharing
Updated•5 years ago
|
Updated•5 years ago
|
Comment 23•5 years ago
|
||
That's a nice speedup for the small sizes, so bytewise copy seems like a good way to start. Do you suppose you could add a column to your measurements for 8- and 16-byte register copies? And also SM vs. V8?
If indeed there is still a big speedup to be had, I think there is a scheme that allows larger copies while still avoiding fancy signal handler magic or more code duplication (in fact, it'd remove existing code duplication):
- we only inline+unroll up to some small constant size such that all the copied values can fit into registers at once (including SIMD registers, there should be plenty)
- the inlined codegen first loads all the source bytes into the registers, then stores all the registers into the destination address
- b/c if there is no trap, iiuc, it's not observable which order the registers were stored, even with shared memory (b/c there is no synchronization to guarantee the order in which stores are observed)
- this avoids branching on src > dest
- to handle traps:
- put the highest store first; if it succeeds, then all subsequent stores will succeed; if it fails, then there have been no side effects thus far (*)
- all loads/stores are given the (new)
Trap
codeTrap::MemCopy
- in
HandleTrap
, for aTrap::MemCopy
, instead of redirecting pc tosegment.trapCode()
, redirect instead to a newsegment.memCopyTrapCode()
stub - the
memCopyTrapCode
stub needs to callInstance::memCopy
. For the stub to materialize the correct arguments:- 2 fixed registers are chosen per architecture (analogous to, say,
WasmTableCallSigReg
) to hold thedestByteOffset
andsrcByteoffset
- a new
MemCopyTrapSite
is added that is likeTrapSite
but additionally stores the constantlen
(and probably, to simplify/shrink things, apcLength
so a singleMemCopyTrapSite
can cover all the loads/stores of a singlememory.copy
).- Note that trap site metadata is not stored in one big sorted vector, but N sorted vectors, one per
Trap
code. This avoids needing to store aTrap
perTrapSite
, but it's also quite convenient for our purposes here, because can simply store a sorted vector ofMemCopyTrapSite
forTrap::MemCopy
without bloating all the other kinds ofTrap
s.
- Note that trap site metadata is not stored in one big sorted vector, but N sorted vectors, one per
- in the inline codegen of
memory.copy
, these 2 registers are fixed via thedefineFixed()
lowering policy (in Ion) - the
memCopyTrapCode
stub builds a call frame, just likeGenerateThrowStub
, but passing the 2 fixed registers as args to a new C++ builtinWasmTrappingMemCopy
which:- gets the
len
argument fromwasmTrapData()
- calls
Instance::memCopy()
, which does the Right Thing and, we can assert, ends by throwing an exception - ends with
return WasmHandleThrow()
- gets the
- 2 fixed registers are chosen per architecture (analogous to, say,
So, iiuc, with this scheme there would be no branches and the only slowdown compared to what a native compiler would do for memmove
is the extra bit of register pressure for the 2 fixed registers.
(*) This assumes that an unaligned store that spans a page boundary and faults does not successfully write any bytes to the non-faulting page. I think that is the case, but it'd be good to confirm that.
I don't know if it's actually worth the effort, though, hence me asking for additional measurements above.
Comment 24•5 years ago
|
||
While I appreciate that the signal handle-y bits of Luke's scheme would be doable, it seems to me like considerable complexity to provide "in the event of a trap, copy as much as possible" semantics that I would guess LLVM's back end doesn't care about. I compare the proposal to the situation where LLVM codes short copies in-line as a sequence of vanilla wasm loads and stores. Then it wouldn't have said semantics (unless all those transactions were single-byte), but would it actually matter?
So I would ask Alex: what semantics do you need for these LLVM-generated memory.copy insns, regarding the state of the copy after a trap?
Comment 25•5 years ago
|
||
Yeah, memcpy is easy: overlap and oob are UB, so lowering to a sequence of accesses is fine. memory.copy in wasm is really not that at all. I'd be inclined to argue that LLVM9 simply gets this wrong.
For now, let's do the naive bit with byte copies (plus the lock fix), if it gets us closer to parity with what LLVM8 would have done.
I've been trying to think of a fix where we conservatively detect that the operation is not-overlapping and not-at-the-bounds-of-memory in which case we can generate as good code as memcpy would have, and otherwise fall back to ool code (conservative == in some cases we would go ool when we wouldn't really need to). But anything I've come up requires too elaborate a guard.
Comment 26•5 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #25)
Yeah, no argument regarding starting with inline byte copies for now. It might be worth filing a separate bug that goes into the perf backlog to consider word-sized copies, though, since I expect, whether it's Right or Wrong, this will be used for this purpose, so it'd be good to have better perf than competitors on it.
(plus the lock fix)
On this topic, do you think the RWLock alternative in comment 17 would work? It seems like it'd be a bit faster too.
Reporter | ||
Comment 27•5 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #24)
So I would ask Alex: what semantics do you need for these LLVM-generated memory.copy insns, regarding the state of the copy after a trap?
From my perspective at least the use cases I'm thinking of are all source from Rust originally so anything which traps indicates a really big error that should tear down everything as quickly as possible in which case the semantics of exactly what happens doesn't actually matter. For us wrt Rust we just need the perf to look good :)
That's basically to say that we'd go towards whatever semantics make the implementation the fastest and reasonable to specify across all engines. As to which precise semantics are chosen I don't think it matters too too much for Rust's use case.
Assignee | ||
Comment 28•5 years ago
|
||
After some thinking, we can do larger length copies without new signal semantics when src < dest
in the current patches. In that case, we are required to copy down and so we cannot perform a partial write even if we did it per-byte. So we could emit a quicker path for that case, and maybe get a probabilistic speed up?
I ran into this trying to write a test case for partial copying as you can only get a partial write to happen if a load is OOBs and all my testing was with storing OOBs.
We also should be able to eliminate a bounds check by only testing the greater of src
or dest
since we need to compute that to determine the copy direction already.
Comment 29•5 years ago
•
|
||
It is clear that mem.copy
is a poor match for compiler generated memcpy
.
For this bug I agree we should go with Ryan's byte-copying scheme.
For the future, I'd like to propose mem.copy_hinted
:
-
mem.copy_hinted
has the same operands asmem.copy
, but also carries some hints:- alignment of src index (small int, 1/2/4/8 .. 256, perhaps or log2 thereof)
- alignment of dst index (same)
- a boolean indicating that the areas definitely do not overlap
-
An implementation can choose to ignore the hints and handle it the same as
mem.copy
; this will be safe but slow.
The behaviour of mem.copy_hinted
is defined thusly:
-
If
mem.copy
with the same arguments would trap at runtime, then
mem.copy_hinted
will too. However, the state of both src and dst areas is
undefined after this. Areas outside the src and dst areas will be
unchanged. -
If
mem.copy
with the same arguments would not trap at runtime, but not all
the hints are true, thenmem.copy_hinted
will also not trap, but the
resulting state of the src and dst areas is undefined afterwards. Areas
outside the src and dst areas will be unchanged. -
If
mem.copy
with the same arguments would not trap at runtime, and all the
hints are true, thenmem.copy_hinted
will also not trap, and will produce
the same results asmem.copy
. (Implied thereby is): Areas outside the src
and dst areas will be unchanged. -
In all cases,
mem.copy_hinted
gives no guarantees about what happens if
multiple threads access the src or dst areas concurrently, beyond any
guarantees we might have if the copy had been generated by the front end as
a sequence of vanilla wasm loads/stores.
This allows the front end compiler to hand useful information to the back end,
without sacrificing safety at the wasm level, and it gives the back end wide
latitude in choosing an implementation.
The use of alignment hints allows the implementation to use aligned multibyte
loads/stores as it sees fit. I believe natural alignment of them is important
to get performance close to native memcpy. In particular, misaligned accesses
interact poorly with store forwarding (Intel Optimization Manual, Sec 3.6.5,
Store Forwarding).
The use of a (non-)overlapping hint facilitates removal of run time direction
checks. It also makes it feasible to use cache-line-preload-and-zero
instructions on targets that have it, eg POWER "dcbz".
One might ask, why bother at all? Why not just tell front ends to emit memcpy
inline as wasm?
-
Wasm implementations may decide to not do the copy in-line, if code size is
an issue. -
Wasm on 32 bit targets typically requires an explicit range check per
access. Getting really good code if the front end compiler emitted the copy
in-line would require the wasm backend to merge the range checks together,
which sounds complex and fragile. Usingmemory.copy_hinted
avoids that
problem. -
memory.copy_hinted
could be implemented using vector loads/stores (eg, 128
or 256 bit) that don't exist in wasm itself.
Comment 30•5 years ago
|
||
If our reaction from implementing memory.copy
is to propose a new version, then, given that the feature is still in Stage 3, which is intended for implementor feedback, then I thinkwe should bring this feedback to the CG and propose changes to memory.copy
.
That being said, the wasm spec has thus far vigorously attempted to avoid nondeterministic behavior, so I don't think what you proposed would fly. What if instead memory.copy
was proposed to not do any writes if it traps. If you couple that with what I was proposing in comment 23, then only the first two bullets remain.
Comment 31•5 years ago
|
||
What if instead
memory.copy
was proposed to not do any writes if it traps.
I think the current behavior is intended to be forward-compatible with good performance to the world where we have memory.protect
on shared memory, and the behavior is synced between data segment initialization and memory.copy (and fill and init). We've changed so many things over the life of this proposal, it's hard to keep them straight and remember what lead to what. One discussion was here: https://github.com/WebAssembly/bulk-memory-operations/issues/49.
Comment 32•5 years ago
|
||
On a side note, I'm a bit dubious of memory.protect
if the only way to implement it (without terrible perf) is to use hardware memory protection. But it seems like memory.protect
would have fairly messy semantics in racy write cases like memory.copy
with any semantics, and we could just deal with that question if/when we did memory.protect
.
It seems like, if our reaction to the current trapping semantics is to want to propose a new instruction, though, we really should reconsider the current instructions' trapping semantics (both memory.copy
and memory init). This is basically implementation feedback, which is the point of Stage 3, and I wouldn't be surprised if V8 had a similar experience (since there's nothing especially FF-specific here) and desire to improve.
But just as a basic technical question: would "trap before any writes" semantics combined with the abovementioned approach of "load all source values into registers, and then write high-to-low" be a valid way to optimally inline memory.copy
?
Assignee | ||
Comment 33•5 years ago
•
|
||
I think avoiding nondeterministic behavior makes sense. Would it be possible to modify memory.copy/fill
to take an alignment immediate like Julian proposed, but trap on a misaligned src
/dest
instead of viewing it as a hint? (this would match the alignment behavior of atomics)
The main point of this would be to get some of the benefits of mem.copy_hinted
with deterministic behavior. The drawback would be an alignment check when alignment != 0
.
I'm not sure if it's reasonable to also take a noOverlap
flag and trap when that fails as well. The overhead of a check there might be too much?
Assignee | ||
Comment 34•5 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #32)
On a side note, I'm a bit dubious of
memory.protect
if the only way to implement it (without terrible perf) is to use hardware memory protection. But it seems likememory.protect
would have fairly messy semantics in racy write cases likememory.copy
with any semantics, and we could just deal with that question if/when we didmemory.protect
.It seems like, if our reaction to the current trapping semantics is to want to propose a new instruction, though, we really should reconsider the current instructions' trapping semantics (both
memory.copy
and memory init). This is basically implementation feedback, which is the point of Stage 3, and I wouldn't be surprised if V8 had a similar experience (since there's nothing especially FF-specific here) and desire to improve.But just as a basic technical question: would "trap before any writes" semantics combined with the abovementioned approach of "load all source values into registers, and then write high-to-low" be a valid way to optimally inline
memory.copy
?
I agree that this would be valid. I would be slightly concerned about complexity and register pressure, but don't have enough experience to say if that would be a problem.
Assignee | ||
Comment 35•5 years ago
|
||
I should also add that I hit a small roadblock with the patches around emitting correct trap metadata. I have to rework the patches a bit, but should hopefully have new versions soon.
Comment 36•5 years ago
|
||
Ahh, interesting, so the idea is that you declare an alignment, memory.copy
traps if it's wrong, and once we know it's right, we can copy in register sizes up to that alignment quanta because on trap we're perfectly to-spec. My worry is that, a common use of C memcpy
is to use it to efficiently copy bytes of unknown alignment and so the compiler would need to emit an alignment of 1 in many cases, requiring us to fall back to the byte copies.
FWIW, if we simply limited the size of inline memory.copy
to 128 bytes (so 8×128-bit SSE/NEON registers) I expect we'd get the vast majority of the win and so, given 16/32 SIMD registers (on x64/ARM, resp) register pressure would not be too much of an issue, I expect. In particular, my local g++ -O3
only does an inline copy up to 16 bytes (!) before switching to a call and clang -O3
does inline copy of up to 128 bytes. And, of course, this would avoid both the src<dst
and any alignment branches.
Assignee | ||
Comment 37•5 years ago
|
||
I've updated the benchmark with my latest patches and added some comparison between us an V8. It's on the second page.
One interesting point is that we're faster when we inline, but seem to have much more overhead in our OOL path than V8.
I did a quick profile and it seems like we're spending some significant time finding the memory length and memory base inside of WasmInstance::memCopy. (~3-4ms / ~8ms total). We should be able to easily pass the memoryBase from TlsData into the instance call. Using an atomic access to the memoryLength may help as well.
V8 looks like they do some trickery with shared memory by caching the memoryLength per-thread, and sending out an interrupt to all other threads running wasm to update their local copy when growing memory. That may be an option as well.
[1] https://docs.google.com/spreadsheets/d/1vf9EeRRi9gTc0JokKUcWwvL8wR0yKR24xL5QP4j6xbM/edit?usp=sharing
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 38•5 years ago
|
||
The baseline compiler doesn't have enough information inside the plain load/store
path in order to remove bounds checks from memory.copy
and memory.fill
after
the first access of dest and src.
This commit passes an override to prevent bounds checks after the first
accesses. This is safe because we limit the allowable size of an inlined copy
to be less than the guard page size.
Depends on D43892
Assignee | ||
Comment 39•5 years ago
|
||
I don't believe we have coverage here. I confirmed that these tests fail with
my earlier larger transfer width patches and pass with the current patches.
Depends on D44930
Assignee | ||
Comment 40•5 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #32)
On a side note, I'm a bit dubious of
memory.protect
if the only way to implement it (without terrible perf) is to use hardware memory protection. But it seems likememory.protect
would have fairly messy semantics in racy write cases likememory.copy
with any semantics, and we could just deal with that question if/when we didmemory.protect
.It seems like, if our reaction to the current trapping semantics is to want to propose a new instruction, though, we really should reconsider the current instructions' trapping semantics (both
memory.copy
and memory init). This is basically implementation feedback, which is the point of Stage 3, and I wouldn't be surprised if V8 had a similar experience (since there's nothing especially FF-specific here) and desire to improve.But just as a basic technical question: would "trap before any writes" semantics combined with the abovementioned approach of "load all source values into registers, and then write high-to-low" be a valid way to optimally inline
memory.copy
?
Is there a concrete spec or some discussion about memory.protect
? Just curious, as I've heard it mentioned before but haven't seen anything.
Comment 41•5 years ago
|
||
(In reply to Ryan Hunt [:rhunt] from comment #40)
Nope, I think it's basically just the idea of "what if wasm had a mprotect()-like instruction so that, e.g., null (0) pointer accesses could fault?".
Assignee | ||
Comment 42•5 years ago
|
||
As mentioned above, we have some room to improve our 'memory.copy/fill' cost when using a function call.
bench-copy-4.js | spidermonkey | v8 |
---|---|---|
ool | ~9ms | ~4ms |
Note: the above is also without any mutex/rwlock as this is a non-shared memory benchmark.
After profiling and some testing, it appears much of the cost is in acquiring the base pointer, length, and shared flag from WasmMemoryObject.
By storing this information in wasm::Instance and skipping access to WasmMemoryObject from Instance::memCopy, we reduce down to ~4.5ms. I'm guessing this is from better cache line utilization from not having to chase as many pointers. Passing this information directly from the JIT'ed code (where we have HeapReg) results in slightly better performance, but I'm not sure the complexity is worth it.
The tricky part is keeping this information in sync with the canonical copy in WasmMemoryObject.
- The shared memory flag is trivial as it's constant.
- The memory base pointer is trivial as we are already notified of a moving grow operation.
- The heap length is non-trivial as we have no easy way to be notified when a shared memory has been grown. The moving-grow observer list isn't sufficient as that is inside of WasmMemoryObject and only contains instances from that thread. The only object that is shared between threads is SharedArrayRawBuffer.
I spent some time digging into WasmArrayRawBuffer and SharedArrayRawBuffer, to see if we could store a list of instances in each to notify upon a grow operation with new memory parameters. This was pretty painful as there's a lot of logic entangled in the heap management code. Most objects can be destroyed and recreated during a grow, so there's not a clear place for this kind of information.
Additionally, some details like whether a memory is huge or not technically lives in WasmMemoryObject, but is required in the computation of boundsCheckLimit.
It would be really useful if there was some central place for this logic. I saw this comment [1] indicating there has been some thought here.
I'm thinking of a wasm::Memory object that actually maps/manages the heap for shared/local memories, is owned by WasmMemoryObject, but could be shared between threads when the memory is shared.
From some light searching, it seems like the main difficulty is that while ArrayBufferObject can have user owned memory, SharedArrayBufferObject cannot.
Is this a restriction that could be reasonably lifted and/or would the above refactoring be desirable?
I'm unsure if the difficulty of the refactoring is worthwhile just for the above microbenchmark, but it might be if it intersects with other plans.
Assignee | ||
Comment 43•5 years ago
|
||
I have filed [1] to discuss potential changes to the spec.
[1] https://github.com/WebAssembly/bulk-memory-operations/issues/111
Assignee | ||
Comment 44•5 years ago
|
||
I've update the benchmark, with comparisons of all the different proposed solutions. [1]
I've hacked these within Spidermonkey on baseline with 64bit and huge memory. This was mostly for ease of hacking, the generated code isn't significantly impacted from not being in Ion.
-
OOL call - I've tried to make it as fast as possible by passing in the heap base and length to the instance call. Heap base is easily acquired with HeapReg. Heap length isn't readily available due to comment 42, so I pass in boundsCheckLimit as a proxy instead. This is incorrect, but should be equivalent for perf testing purposes. With these changes, the implementation is very close to V8's and our time falls in line with theirs at about ~4s.
-
Inline bytewise - The patches that are currently up. A comparison of src/dest to determine copy direction and two corresponding sequences of loads and stores per byte. This approach is competitive at small sizes, but loses at 16 bytes.
-
Inline align guard - Close to (2) with an added alignment check so that copies can be performed per Int32. This performs better than (2), but is questionable whether it would make sense to change the spec to make this possible.
-
Inline register buffer - Luke's proposed solution that won't write anything if any part is out of bounds. This approach has the best overall performance.
-
Plain wasm loads/stores - An equivalent benchmark with memory.copy replaced with wasm loads and stores. For some reason, this actually performs better than (4) for 4bytes and 8bytes. Comparing the disassembly, the only difference is register allocation and scheduling. The same amount and type of instructions are emitted. This makes me believe that (4) can be improved to be just as good here.
One additional consideration. The module bytecode size for using memory.copy vs. plain loads and stores increased roughly 10x. So there is a size savings with using memory.copy. This benchmark also leads me to believe that we can achieve equivalent and better performance with memory.copy for small constant sizes if we change the trapping behavior, so overall I think we should continue pursuing this spec change.
Comment 45•5 years ago
|
||
I really appreciate the thorough investigation, nice work! Two small follow-up questions regarding option #4:
- for Size = 4 and 8 cases, are SIMD registers or GPRs used? (Is there any perf difference? One might imagine using XMM avoids GPR pressure.)
- For Size = 16 and above, are full 16-byte SIMD registers used?
Comment 46•5 years ago
|
||
Filed some comments on the patch re MOVSX vs MOVZX and the use of REP MOVSB -- small potatoes, though I think the bytewise copy can do better than it currently does in a number of cases.
How are you performing time measurements? I looked at your JS code and did not see anything; is this with time
at the shell or did I miss something? Assuming the times are in seconds the answer is probably not important, but I'm curious even so.
Assignee | ||
Comment 47•5 years ago
|
||
(In reply to Luke Wagner [:luke] from comment #45)
I really appreciate the thorough investigation, nice work! Two small follow-up questions regarding option #4:
- for Size = 4 and 8 cases, are SIMD registers or GPRs used? (Is there any perf difference? One might imagine using XMM avoids GPR pressure.)
I tried the patches both with SIMD registers and with GPRs when comparing against plain Wasm loads/stores and did not notice a difference. The benchmark is fairly synthetic so a real application might have different characteristics.
- For Size = 16 and above, are full 16-byte SIMD registers used?
Yes, for Size = 16 the patch emits unaligned SSE 16byte loads and stores.
Assignee | ||
Comment 48•5 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #46)
Filed some comments on the patch re MOVSX vs MOVZX and the use of REP MOVSB -- small potatoes, though I think the bytewise copy can do better than it currently does in a number of cases.
Thanks! I'll take a look at those.
How are you performing time measurements? I looked at your JS code and did not see anything; is this with
time
at the shell or did I miss something? Assuming the times are in seconds the answer is probably not important, but I'm curious even so.
Yes, I'm just using time
with the shell assuming that the overhead of starting the shell will be roughly equivalent across runs. I'm open to other suggestions, I didn't see anything in the JS shell that was obvious for benchmarking.
Comment 49•5 years ago
|
||
Canonically the most precise is to call performance.now() before your benchmark does its thing and then again after, and subtract the former value from the latter to give you a time measurement in milliseconds but typically with at least (IIUC) microsecond precision, in the shell. You can then print that:
let then = performance.now()
bench.run()
let now = performance.now()
print("RUNNING_TIME " + (now - then))
This API has been neutered in the browser and does not deliver that kind of precision there for security reasons.
I agree that for your tests the shell overhead is probably not dominating so it won't matter too much, but with the above method you can do many runs in the same process etc, and you will factor out both shell overhead and general wasm compilation overhead, and you can control for warmup of both CPU cache and various jit caches.
Assignee | ||
Comment 50•5 years ago
|
||
I spent time today hacking in a version that uses rep movs
and added it to the benchmark. It was surprisingly tricky to implement correctly. Additionally, it seems to have very poor performance (about 10x worse). It's possible I may have missed something.
I also changed the bytewise copy to use zero extension instead of sign extension. I did not see any observable difference unfortunately.
And finally, I investigated the discrepancy between the non-trapping register buffer approach and plain wasm loads and stores. It turns out that my load-store benchmark was incorrect and storing the destination address to the address at the value loaded from the source address. Stack machines are tricky.. Fixing this aligned plain load and store performance with the register buffer approach, making the register buffer approach the overall fastest for every measured size.
I'm also now attaching the patch that was used to generate these benchmarks if anyone wants to look at it. I'll also be adding the disassembly to notes on the attached spreadsheet so the curious can understand exactly what is being generated and compared for each approach.
Assignee | ||
Comment 51•5 years ago
|
||
I'm also going to drop the reviews on the attached patches until the uncertainty for this bug is resolved.
Comment 52•5 years ago
|
||
Re REP MOVS, there are microarchitectural effects here. As per the Intel manual, it probably is not effective for lengths under 76 (!) bytes. And it appears to be quite sensitive to how we manipulate the direction flag, or perhaps the interleaving of high->low and low->high copies. Your benchmark does 6 copies from low->high and then 6 copies from high->low, and each high->low copy (pointer decrement) sets the direction flag before the op and clears it after, preserving the direction flag as "cleared" (as per ABI). However, for the 128 test case, if I instead do 12 copies in one direction (either one, so long as I set the flag in the correct meaning before all the copies and clear it after I'm done), the running time on my system drops from 23.5s to 6.4s, ie by a factor of 4. This still doesn't beat the time for the SIMD (register buffer) code however - in any scenario, though it gets fairly close in the unidirectional high->low case - so I might not bother to try to figure out if I'm seeing memory subsystem effects or effects of diddling the flag bit incessantly...
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 53•5 years ago
|
||
Assignee | ||
Comment 54•5 years ago
|
||
Depends on D50373
Assignee | ||
Comment 55•5 years ago
|
||
Depends on D50374
Assignee | ||
Comment 56•5 years ago
|
||
Depends on D50375
Assignee | ||
Comment 57•5 years ago
|
||
Depends on D50376
Assignee | ||
Comment 58•5 years ago
|
||
Depends on D50377
Assignee | ||
Comment 59•5 years ago
|
||
Depends on D50378
Assignee | ||
Comment 60•5 years ago
|
||
Depends on D50379
Assignee | ||
Comment 61•5 years ago
|
||
Depends on D50380
Assignee | ||
Comment 62•5 years ago
|
||
Depends on D50381
Assignee | ||
Comment 63•5 years ago
|
||
Depends on D50382
Assignee | ||
Comment 64•5 years ago
|
||
TODO: Support non-x64
Depends on D50383
Assignee | ||
Comment 65•5 years ago
|
||
I've attached patches that:
- Makes our current OOL 'memory.copy' and 'memory.fill' faster by directly passing the heap base ptr and acquiring the heap length relative to the pointer
- Adds a byte-wise copy path for 'memory.copy' and 'memory.fill' for Ion and Baseline
- Adds a rep-movsb rep-stosb copy path for 'memory.copy' and 'memory.fill' for Ion and Baseline
- Adds a SIMD copy path for 'memory.copy' for Ion and Baseline
I've also written a benchmark that tests the bulk-memory instructions against the emscripten generated memcpy and memset, and unrolled plain loads/stores [1]. Results from a local run are here [2].
[1] https://github.com/eqrion/wasm-bulk-bench
[2] https://docs.google.com/spreadsheets/d/1Ri4NHfsqEHi4g58HSJcNoeJCh9wKU7_PQwMnn_0Vtfo/edit#gid=1674387044
Assignee | ||
Comment 66•5 years ago
|
||
(In reply to Ryan Hunt [:rhunt] from comment #65)
I've attached patches that:
- Makes our current OOL 'memory.copy' and 'memory.fill' faster by directly passing the heap base ptr and acquiring the heap length relative to the pointer
- Adds a byte-wise copy path for 'memory.copy' and 'memory.fill' for Ion and Baseline
- Adds a rep-movsb rep-stosb copy path for 'memory.copy' and 'memory.fill' for Ion and Baseline
- Adds a SIMD copy path for 'memory.copy' for Ion and Baseline
I've also written a benchmark that tests the bulk-memory instructions against the emscripten generated memcpy and memset, and unrolled plain loads/stores [1]. Results from a local run are here [2].
[1] https://github.com/eqrion/wasm-bulk-bench
[2] https://docs.google.com/spreadsheets/d/1Ri4NHfsqEHi4g58HSJcNoeJCh9wKU7_PQwMnn_0Vtfo/edit#gid=1674387044
Some observations from the data.
- With improvements to our OOL implementation
- We are competitive to plain memcpy from bytes=8 and higher.
- Plain loads/stores are superior until bytes=32.
- With inline byte-wise copies
- We are superior to plain memcpy until bytes=32 (at which point we should switch to our OOL implementation).
- Plain loads/stores have equivalent performance for all measured sizes.
- With inline transactional copies
- We have superior performance to plain memcpy and plain loads/stores for all measured sizes (currently up to 128).
This was tested on my x64 Macbook Pro. We should test on other platforms as well.
Comment 67•5 years ago
|
||
Nice results! Q: re (3), what does "transactional" mean in this context? /me is unclear.
Assignee | ||
Comment 68•5 years ago
|
||
(In reply to Julian Seward [:jseward] (PTO 28 Oct - 8 Nov inclusive) from comment #67)
Nice results! Q: re (3), what does "transactional" mean in this context? /me is unclear.
Yes, should have made that clear. This is the approach that loads all src bytes into registers using SIMD then stores them to dst from high to low, so that we trap if anything is OOBs. 'transactional' in the sense that we either do the full copy or we trap and don't write anything.
Comment 69•5 years ago
|
||
Comment on attachment 9103768 [details]
Bug 1570112 - Remove locking from shared Wasm memory.
Revision D50374 was moved to bug 1591047. Setting attachment 9103768 [details] to obsolete.
Comment 70•5 years ago
|
||
Comment on attachment 9103769 [details]
Bug 1570112 - Split memCopy/memFill implementations for shared/non-shared modules.
Revision D50375 was moved to bug 1591047. Setting attachment 9103769 [details] to obsolete.
Comment 71•5 years ago
|
||
Comment on attachment 9103770 [details]
Bug 1570112 - Expose WasmArrayRawBuffer internally.
Revision D50376 was moved to bug 1591047. Setting attachment 9103770 [details] to obsolete.
Comment 72•5 years ago
|
||
Comment on attachment 9103771 [details]
Bug 1570112 - Track length in WasmArrayRawBuffer.
Revision D50377 was moved to bug 1591047. Setting attachment 9103771 [details] to obsolete.
Comment 73•5 years ago
|
||
Comment on attachment 9103772 [details]
Bug 1570112 - Pass heapBase to memCopy/memFill and use that to acquire length.
Revision D50378 was moved to bug 1591047. Setting attachment 9103772 [details] to obsolete.
Comment 74•5 years ago
|
||
Comment on attachment 9103773 [details]
Bug 1570112 - Fast path Ion loading of memoryBase for memcopy/memfill.
Revision D50379 was moved to bug 1591047. Setting attachment 9103773 [details] to obsolete.
Assignee | ||
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Updated•5 years ago
|
Comment 75•5 years ago
|
||
Comment on attachment 9103774 [details]
Bug 1570112 - Split out 'emitMemCopy' function for dedicated optimizations.
Revision D50380 was moved to bug 1594204. Setting attachment 9103774 [details] to obsolete.
Assignee | ||
Comment 76•5 years ago
|
||
With bug 1591047 and bug 1594204 landed, I don't see any difference in performance between LLVM8/9 on attached links. There may be future performance improvements, but for now I am going to close this.
Description
•