Open Bug 1521448 Opened 5 years ago Updated 2 years ago

Improve performance of possibly-shared safe-for-races memcpy/memmove on non-equal types

Categories

(Core :: JavaScript Engine, enhancement, P3)

enhancement

Tracking

()

People

(Reporter: lth, Unassigned)

References

Details

From the commit message on bug 1394420:

The performance story is not completely satisfactory:

On the one hand, we don't regress anything because copying
unshared-to-unshared we do not use the new primitives but instead the
C++ compiler's optimized memcpy and standard memory loads and stores.

On the other hand, performance with shared memory is lower than
performance with unshared memory. TypedArray.prototype.set() is a
good test case. When the source and target arrays have the same type,
the engine uses a memcpy; shared memory copying is 3x slower than
unshared memory for 100,000 8K copies (Uint8). However, when the
source and target arrays are slightly different types (Uint8 vs Int8)
the engine uses individual loads and stores, which for shared memory
turns into two calls per byte being moved; in this case, shared memory
is 127x slower than unshared memory. (All numbers on x64 Linux.)

Can we live with the very significant slowdown in the latter case? It
depends on the applications we envision for shared memory. Primarily,
shared memory will be used as wasm heap memory, in which case most
applications that need to move data will use all Uint8Array arrays and
the slowdown is OK. But it is clearly a type of performance cliff.

We can reduce the overhead by jit-generating more code, specifically
code to perform the load, convert, and store in common cases. More
interestingly, and simpler, we can probably use memcpy in all cases by
copying first (fairly fast) and then running a local fixup. A bug
should be filed for this but IMO we're OK with the current solution.

(Memcpy can also be further sped up in platform-specific ways by
generating cleverer code that uses REP MOVS or SIMD or similar.)

It's probably not urgent to fix this but eventually somebody will stumble across the problem.

The speculation above that we can perhaps do a fast, raw copy to the destination buffer and then fixup is probably wrong in general: the intermediate state would be observable if the destination memory is shared. But if the destination memory is unshared then the agent performing the copying + fixup is the agent owning the memory, and in this case it should be OK, modulo errors that can happen during fixup and leave memory in an unfixed state.

Clearly for many copies that do not actually change representations (int8 -> uint8) we can also use a fast memcpy.

Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.