Open Bug 862249 Opened 11 years ago Updated 7 months ago
Poor performance for simple memcpy/memset loops in Java
Related to bug 730880 and bug 730873, I started working on a simple polyfill that provides proposed syntaxes (and behavior) for memset and memmove operations on typed arrays. This is intended for use in JSIL and Emscripten applications. However, even with a fairly well tuned version of the pure-JS memmove operation, performance is dramatically inferior to TypedArray.set, even though TypedArray.set has the disadvantage of creating GC pressure and GC pauses. I can see three ways to address this: 1) Find ways to improve the codegen for native JS memcpy/memset loops so they are at least close to the performance of the builtin .set() method. 2) Optimize the destArray.set(new TypedArray(srcArray, srcOffset, srcCount), destOffset) pattern so that it does not cause GC pauses. 3) Close bugs 730880 and 730873, exposing high performance operations that can be used for all of these use cases. From digging around with ionflags in the shell, I don't see any bailout or invalidation problems with the native JS loops; the problem seems to be that the generated code (from Ion) still contains bounds checks and in general the copy/set loops contain way more instructions than they should. Note that destArray.set(new TypedArray(srcArray, srcOffset, srcCount), destOffset) is not currently an acceptable solution because neither SpiderMonkey or V8 seem to optimize out the cost of creating the temporary typed array. You still get GC pressure, and the resulting GC pauses are very undesirable in realtime scenarios like games. If the GC pressure can be removed, it becomes an adequate solution for memcpy - but not for memset. Memset still requires manually copying values into an array with a loop. Related note: Is this improved by asm.js - that is, do the bounds checks go away because the size of the typed arrays is known in advance, or the asm.js compiler is able to determine in advance that indices will be in bounds? Feedback on the proposed syntax in the polyfill (NativeMemoryOperations.js) is of course welcomed as well.
I'm trying to do a memcpy in JS and it seems to still be utterly impossible to do it in a performant way in SpiderMonkey. No matter what I do, a handrolled memcpy loop gets two bounds checks per operation - one on the read, one on the write. .set is still incredibly slow and has GC overhead to boot. Is there a magic incantation I can use here? This is responsible for SpiderMonkey being 4-8x slower than v8 on a use case some of my customers care about. I've tried dozens of different formulations and added all sorts of explicit bounds checks at the top of the loop, and nothing. This is still a far cry from what SpiderMonkey should be doing at this point, which is detecting a memcpy loop and emitting an intrinsic, the way a C compiler would.
(In reply to K. Gadd (:kael) from comment #1) > I'm trying to do a memcpy in JS and it seems to still be utterly impossible > to do it in a performant way in SpiderMonkey. No matter what I do, a > handrolled memcpy loop gets two bounds checks per operation - one on the > read, one on the write. Do you have a simple micro-benchmark you can attach here?
Do you mean there's something wrong with the existing microbenchmark attached to this bug?
(In reply to K. Gadd (:kael) from comment #3) > Do you mean there's something wrong with the existing microbenchmark > attached to this bug? My bad, I saw a 1.14 MB attachment and thought "this is not a micro-benchmark", but 99% of it is js.exe. I also filed bug 894881 a while ago but unfortunately it fell of the radar, will look into this again.
Assignee: general → jdemooij
Status: NEW → ASSIGNED
I should note that on a whim I tested out the specific formulation of memcpy used by emscripten, and SpiderMonkey optimizes out the bounds check for that loop, so that's an improvement. However, it's not obvious to me why emscripten's memcpy is the only fast one, because it actually *leaves out* important bounds checks, seemingly assuming that the VM will insert them (of course, it does) - adding the checks you'd think should be there deopts your loop. Despite the bounds checks being gone (an improvement, certainly!) the loop still seems to generate kind of miserable code according to JIT Inspector: for (var sourceEnd = (offset + 12) | 0, i = offset, j = 0; i < sourceEnd; i++, j++) scratchBytes[j] = bytes[i]; -> [InterruptCheckImplicit] [OsiPoint] [CompareAndBranch:lt] cmpl %esi, %edx jge ((324)) [LoadTypedArrayElement] movzbl 0(%ecx,%edx,1), %edi [MoveGroup] xchgl %edi, %eax movl 0x74(%esp), %ecx [StoreTypedArrayElement] movb %al, 0(%ecx,%edi,1) [AddI] addl $0x1, %edx [AddI:OverflowCheck] addl $0x1, %edi jo ((369)) [MoveGroup] movl %ecx, 0x74(%esp) movl 0x70(%esp), %ecx movl %edi, %eax [Goto] jmp ((384)) ##link ((384)) jumps to ((384)) Maybe the actual generated code is much better (not sure how accurate jit inspector is), but that loop seems like it could be considerably better. TypedArray.set is still a poor alternative here since it produces a native call (not inlined).
Whiteboard: [games:p?] → [games:p2]
Whiteboard: [games:p2] → [games:p?]
Severity: major → --
You need to log in before you can comment on or make changes to this bug.