Closed Bug 862249 Opened 12 years ago Closed 6 months ago

Poor performance for simple memcpy/memset loops in JavaScript

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: kael, Unassigned, NeedInfo)

References

(Blocks 1 open bug,
URL
)

Details

(Keywords: perf, Whiteboard: [games:p?])

Attachments

(1 file)

Shell microbenchmark 12 years ago Katelyn Gadd (:kael) 1.14 MB, application/zip		Details

Katelyn Gadd (:kael)

Reporter

Description

•

12 years ago

Attached file Shell microbenchmark — Details

Related to bug 730880 and bug 730873, I started working on a simple polyfill that provides proposed syntaxes (and behavior) for memset and memmove operations on typed arrays. This is intended for use in JSIL and Emscripten applications. However, even with a fairly well tuned version of the pure-JS memmove operation, performance is dramatically inferior to TypedArray.set, even though TypedArray.set has the disadvantage of creating GC pressure and GC pauses. I can see three ways to address this: 1) Find ways to improve the codegen for native JS memcpy/memset loops so they are at least close to the performance of the builtin .set() method. 2) Optimize the destArray.set(new TypedArray(srcArray, srcOffset, srcCount), destOffset) pattern so that it does not cause GC pauses. 3) Close bugs 730880 and 730873, exposing high performance operations that can be used for all of these use cases. From digging around with ionflags in the shell, I don't see any bailout or invalidation problems with the native JS loops; the problem seems to be that the generated code (from Ion) still contains bounds checks and in general the copy/set loops contain way more instructions than they should. Note that destArray.set(new TypedArray(srcArray, srcOffset, srcCount), destOffset) is not currently an acceptable solution because neither SpiderMonkey or V8 seem to optimize out the cost of creating the temporary typed array. You still get GC pressure, and the resulting GC pauses are very undesirable in realtime scenarios like games. If the GC pressure can be removed, it becomes an adequate solution for memcpy - but not for memset. Memset still requires manually copying values into an array with a loop. Related note: Is this improved by asm.js - that is, do the bounds checks go away because the size of the typed arrays is known in advance, or the asm.js compiler is able to determine in advance that indices will be in bounds? Feedback on the proposed syntax in the polyfill (NativeMemoryOperations.js) is of course welcomed as well.

Wayne Mery (:wsmwk)

Updated

•

12 years ago

Keywords: perf

Katelyn Gadd (:kael)

Reporter

Updated

•

12 years ago

Whiteboard: [games:p?]

Katelyn Gadd (:kael)

Reporter

Updated

•

12 years ago

Blocks: 885526

Katelyn Gadd (:kael)

Reporter

Updated

•

12 years ago

Blocks: JSIL

Jan de Mooij [:jandem]

Updated

•

12 years ago

Depends on: 894881

Marco Mucci [:MarcoM]

Updated

•

11 years ago

Blocks: gecko-games

Katelyn Gadd (:kael)

Reporter

Comment 1

•

11 years ago

I'm trying to do a memcpy in JS and it seems to still be utterly impossible to do it in a performant way in SpiderMonkey. No matter what I do, a handrolled memcpy loop gets two bounds checks per operation - one on the read, one on the write. .set is still incredibly slow and has GC overhead to boot. Is there a magic incantation I can use here? This is responsible for SpiderMonkey being 4-8x slower than v8 on a use case some of my customers care about. I've tried dozens of different formulations and added all sorts of explicit bounds checks at the top of the loop, and nothing. This is still a far cry from what SpiderMonkey should be doing at this point, which is detecting a memcpy loop and emitting an intrinsic, the way a C compiler would.

Jan de Mooij [:jandem]

Comment 2

•

11 years ago

(In reply to K. Gadd (:kael) from comment #1) > I'm trying to do a memcpy in JS and it seems to still be utterly impossible > to do it in a performant way in SpiderMonkey. No matter what I do, a > handrolled memcpy loop gets two bounds checks per operation - one on the > read, one on the write. Do you have a simple micro-benchmark you can attach here?

Katelyn Gadd (:kael)

Reporter

Comment 3

•

11 years ago

Do you mean there's something wrong with the existing microbenchmark attached to this bug?

Jan de Mooij [:jandem]

Comment 4

•

11 years ago

(In reply to K. Gadd (:kael) from comment #3) > Do you mean there's something wrong with the existing microbenchmark > attached to this bug? My bad, I saw a 1.14 MB attachment and thought "this is not a micro-benchmark", but 99% of it is js.exe. I also filed bug 894881 a while ago but unfortunately it fell of the radar, will look into this again.

Assignee: general → jdemooij

Status: NEW → ASSIGNED

Katelyn Gadd (:kael)

Reporter

Comment 5

•

11 years ago

I should note that on a whim I tested out the specific formulation of memcpy used by emscripten, and SpiderMonkey optimizes out the bounds check for that loop, so that's an improvement. However, it's not obvious to me why emscripten's memcpy is the only fast one, because it actually *leaves out* important bounds checks, seemingly assuming that the VM will insert them (of course, it does) - adding the checks you'd think should be there deopts your loop. Despite the bounds checks being gone (an improvement, certainly!) the loop still seems to generate kind of miserable code according to JIT Inspector: for (var sourceEnd = (offset + 12) | 0, i = offset, j = 0; i < sourceEnd; i++, j++) scratchBytes[j] = bytes[i]; -> [InterruptCheckImplicit] [OsiPoint] [CompareAndBranch:lt] cmpl %esi, %edx jge ((324)) [LoadTypedArrayElement] movzbl 0(%ecx,%edx,1), %edi [MoveGroup] xchgl %edi, %eax movl 0x74(%esp), %ecx [StoreTypedArrayElement] movb %al, 0(%ecx,%edi,1) [AddI] addl $0x1, %edx [AddI:OverflowCheck] addl $0x1, %edi jo ((369)) [MoveGroup] movl %ecx, 0x74(%esp) movl 0x70(%esp), %ecx movl %edi, %eax [Goto] jmp ((384)) ##link ((384)) jumps to ((384)) Maybe the actual generated code is much better (not sure how accurate jit inspector is), but that loop seems like it could be considerably better. TypedArray.set is still a poor alternative here since it produces a native call (not inlined).

Vladimir Vukicevic [:vlad] [:vladv] (needinfo me, slow to respond)

Updated

•

11 years ago

Whiteboard: [games:p?] → [games:p2]

Michael Bebenita [:mbx]

Updated

•

10 years ago

Depends on: 1039458

Anthony Vaughn [:anthony][:avaughn][San Francisco - Pacific]

Updated

•

9 years ago

Whiteboard: [games:p2] → [games:p?]

Jan de Mooij [:jandem]

Updated

•

7 years ago

Assignee: jdemooij → nobody

Status: ASSIGNED → NEW

RaresB

Updated

•

3 years ago

QA Whiteboard: qa-not-actionable

BugBot [:suhaib / :marco/ :calixte]

Comment 6

•

2 years ago

In the process of migrating remaining bugs to the new severity system, the severity for this bug cannot be automatically determined. Please retriage this bug using the new severity system.

Severity: major → --

Matthew Gaudet (he/him) [:mgaudet]

Updated

•

2 years ago

Severity: -- → S3

Priority: -- → P3

Mayank Bansal

Comment 7

•

6 months ago

Is this still slow for you?

Flags: needinfo?(kg)

Mayank Bansal

Comment 8

•

6 months ago

No response from OP.
This type of code targets WASM these days.
Closing as WONTFIX.

Status: NEW → RESOLVED

Closed: 6 months ago

Resolution: --- → WONTFIX

You need to log in before you can comment on or make changes to this bug.