Closed Bug 1570112 Opened 6 years ago Closed 5 years ago

Performance of `memory.copy` is quite bad compared to wasm-written `memcpy` function

Tracking

()

Status:

RESOLVED FIXED

Tracking Flags:

Tracking

Status

firefox70

---

affected

People

(Reporter: acrichto, Assigned: rhunt)

References

Details

Attachments

(19 obsolete files)

Bug 1570112 - Factor out PlatformRWLock from XPCOM and expose in JS. r?froydnj 6 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Use RWLock to allow concurrent access to shared WasmMemory length from multiple threads. r?bbouvier 6 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Wasm: Inline 'memory.fill' when length is a constant within hardcoded limit. r?jseward 6 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Wasm: Inline 'memory.copy' when length is a constant within hardcoded limit. r?jseward 6 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Wasm: Suppress bounds check in baseline after the first check of dest and src. r?jseward 6 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Wasm: Add a test for partial writing of bulk memory operations. r?jseward 6 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
mem-copy-bench.patch 5 years ago Ryan Hunt [:rhunt] 24.09 KB, patch		Details \| Diff \| Splinter Review
Bug 1570112 - Add a test for partial writing of bulk memory operations. 5 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Remove locking from shared Wasm memory. 5 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Split memCopy/memFill implementations for shared/non-shared modules. 5 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Expose WasmArrayRawBuffer internally. 5 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Track length in WasmArrayRawBuffer. 5 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Pass heapBase to memCopy/memFill and use that to acquire length. 5 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Fast path Ion loading of memoryBase for memcopy/memfill. 5 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Split out 'emitMemCopy' function for dedicated optimizations. 5 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Assembler support for 'rep movs' and 'rep stos'. 5 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Add JIT fast paths for memory.copy with byte-by-byte trapping semantics. 5 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Add JIT fast paths for memory.fill with byte-by-byte trapping semantics. 5 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1570112 - Add JIT fast path for memory.copy with transactional trapping semantics. 5 years ago Ryan Hunt [:rhunt] 47 bytes, text/x-phabricator-request		Details \| Review

Alex Crichton [:acrichto]

Reporter

Description

•

6 years ago

Recently the Rust compiler updated to LLVM 9 from LLVM 8. There were some major changes in LLVM 9 around threads and WebAssembly, and one of the indirect pieces is that LLVM 9 now fully supports the threads and bulk memory proposals of WebAssembly. To use threads with LLVM 9 you are required to enable the bulk-memory feature in LLVM by default.

With the bulk-memory feature comes the memory.copy instruction. Rust code ends up generating quite a few calls to memcpy in general for all platforms. The performance of Rust relatively crucially relies on efficient implementations of memcpy as well. For most platforms LLVM will lower specific memcpy values (e.g. a constant move of 16 bytes) into inline optimized instructions, and it looks like for WebAssembly LLVM 9 is lowering almost all memcpy calls into memory.copy instructions. This means that Rust-compiled programs through LLVM 9 have a significant number of memory.copy instructions, many of which are performance critical.

The main "Rust and threads" example that we've historically used is a parallel raytracer, and so with the LLVM 9 update that was my main test case for ensuring that the Rust toolchain works with LLVM 9. Once all was said and done, however, I was pretty surprised at the performance! The previous LLVM 8-compiled code would render one frame in ~300ms on my machine (with the max number of threads), but the same code compiled with LLVM 9 would render a frame in ~2000ms.

After some profiling use perf.html I ended up realizing that 80+% of the time was spent in the native memcpy implementation, a seemingly large amount of that around some sort of synchronization as well. Overall, it definitely seemed like the memory.copy instructions emitted by LLVM with LLVM 9 were the culprit for most of the slowdown if not all of it.

I've prepared a github repository with precompiled versions of the raytracing example, and you can see the two rendered versions online:

Note that those require Nightly firefox with SharedArrayBuffer enabled to work, so they won't work in stable Firefox.

In talking with Luke it sounded like a bug was the best place to report this. If y'all need any more information from me though please just let me know! I suspect that the whole parallel raytracing idea isn't actually needed to showcase the performance slowdown here, it's likely only necessary to use memory.copy but I figured it'd be good to show the whole example.

Also FWIW my platform is Windows 10 where I originally ran these benchmarks.

Alex Crichton [:acrichto]

Reporter

Comment 1

•

6 years ago

Also in case it's useful, this is a perf.html capture of rendering two frames on my computer