Closed Bug 601072 Opened 15 years ago Closed 2 years ago

Unsigned Multiplication in Javascript is very slow compared to C

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: alexis, Unassigned)

References

(Depends on 1 open bug)

Details

Attachments

(6 files)

Javascript Code Sample 15 years ago Alexis Shaw 814 bytes, text/html		Details
C code sample -- note x86 dependant -- GCC inline asm. 15 years ago Alexis Shaw 1.17 KB, text/plain		Details
Inlined Javascript version. 15 years ago Alexis Shaw 788 bytes, text/html		Details
Optomised non inlined version 15 years ago Alexis Shaw 835 bytes, text/html		Details
Work around slow array generation in FF. 15 years ago Alexis Shaw 846 bytes, text/html		Details
Assembly log 15 years ago Boris Zbarsky [:bzbarsky] 73.15 KB, text/plain		Details

Alexis Shaw

Reporter

Description

•

15 years ago

User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10 Build Identifier: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10 Unsigned Multiplication in javascript is very slow. In Firefox 3.6 the code example takes 46026 ms In Chrome 6.0.472.62 it takes 11358 ms C it takes 310ms. Please Find attached code samples in both C and Javascript. Reproducible: Always

Alexis Shaw

Reporter

Comment 1

•

15 years ago

Attached file Javascript Code Sample — Details

Alexis Shaw

Reporter

Comment 2

•

15 years ago

Attached file C code sample -- note x86 dependant -- GCC inline asm. — Details

Jan de Mooij [:jandem]

Comment 3

•

15 years ago

Somehow I suspect most of the time here is spent in object allocation and accessing globals (a variable is global if you omit the "var" keyword, globals are much slower than locals). I had to reduce the number of iterations from 1 << 26 to 1 << 24 to avoid a slow script warning in Safari and Firefox. With this, I get the following numbers: Chrome: 2379 FF4b7pre: 2752 Safari: 5506 As you can see, FF4 will be much faster here :)

Alexis Shaw

Reporter

Comment 4

•

15 years ago

Attached file Inlined Javascript version. — Details

This is an inlined version of the javascript.

Alexis Shaw

Reporter

Comment 5

•

15 years ago

Attached file Optomised non inlined version — Details

bug with erlier version was using globals. this version uses locals.

Jesse Ruderman

Updated

•

15 years ago

Summary: Unsigned Multiplication in Javascript is very slow. → Unsigned Multiplication in Javascript is very slow compared to C

Alexis Shaw

Reporter

Comment 6

•

15 years ago

Attached file Work around slow array generation in FF. — Details

Alexis Shaw

Reporter

Comment 7

•

15 years ago

So with optimisations the new numbers are: Inlined Chrome:4124 FF3.6:3510 Safari:5005 Normal Chrome:4562 FF3.6: 4041 Safari:4088 note: this is using a passed in and recycled array, if you make a new array you get the following numbers Chrome:4604 FF 3.6: 25564 Safari:16047

Boris Zbarsky [:bzbarsky]

Updated

•

15 years ago

Attachment #480042 - Attachment mime type: application/octet-stream → text/plain

Boris Zbarsky [:bzbarsky]

Comment 8

•

15 years ago

So on that last benchmark, we're within a factor of 2x of unoptimized C (with handcrafted inline assembly for the hot path, which makes it not very C, btw... though if I just use unsigned long long and skip the assembly gcc seems to actually do a much better job optimizing; the unoptimized code is about the same, but the optimized is about 2x faster than with the inlined assembly), which we generally consider "pretty good". Interestingly, rewriting imul like so: var v=a*b; out[1]= v % (1 << z); out[0]= v - out[1]; actually makes it _slower_ (lots of time spent under fmod). I get an additional 20% speedup or so if I do this: out[0]=(v/(1<<z)) | 0; instead of using Math.floor, at least on m-c. I think there's been some more optimization work on TM to make floor() possibly behave better. Also, hoisting the 1<<z out of the loop makes this faster yet; bug 545406 might well help with that. We could try a builtin, but it's not clear to me that the cost of setting up the function call, etc, will really make up for things. Note that in gcc unoptimized it's the cost of not inlining the function that really makes the time shoot up, and we can't inline a C builtin into the trace, obviously. It's also not clear to me how to sanely create a builtin that returns two separate values. What would the actual JS-facing API look like? For reference, the version that returns a new array every time is totally gc-system bound, at least on Mac: 37% of the time is spent doing free() calls on the finalizer thread, 38% of the time is spent allocating arrays (about 2/3 of this in the realloc js_NewPreallocatedArray calls via growDenseArrayElements), 14% is under js_GC. That adds up to 89% for those of us counting.... In Chrome the "create new arrays" testcase does about as well as the other; I'd love to know how they manage that. ;)

Alexis Shaw

Reporter

Comment 9

•

15 years ago

Could I get the generated code for inside the loop? perhaps we can just special case this and get better? Note the asm is not really hand crafted, I wanted an unsigned multiply, and I do not know how to do that in C. note that a function that returns has been handled, destructing assignment. just return an array, then say to use destructing assignment on them. also note that with a function you can make this work with z>26 all the way upto Z=53, if you wanted to. and why cannot you inline a builtin into the trace? of course checking in the trace conditions that the code actually is a number?

Alexis Shaw

Reporter

Comment 10

•

15 years ago

Also , perhaps instead of defining this function, you could define a function mul that does multiplication on an array of integers with an input z for the bit window size, for example mul(a,b,z) a,b arrays of numbers, or error. a[i],b[i] converted to integers mod 2^z ai[i], bi[i] treating ai[i] and bi[i] as numbers a= sum^k ai[i]*2^z^k and b=sum^k bi[i]*2^z^k compute v=a*b and express in the form v[i]=sum^k v[i]*2^z^k veturn v[i], taking the loop out of the equation, if that is easier, it is probably more efficient.

OS: Mac OS X → Windows 7

Boris Zbarsky [:bzbarsky]

Comment 11

•

15 years ago

> I wanted an unsigned multiply, and I do not know how to do that in C. Just declare the operands as unsigned (or cast them to unsigned while multiplying). > just return an array, That runs into the same GC issue as the scripted function returning an array. > also note that with a function you can make this work with z>26 Sure. That's a cost to keep in mind; it's not clear to me how much this impacts the sorts of code that might use this thing. > and why cannot you inline a builtin into the trace? You can, and we have, for common cases for things that are used often and that are easy to express in LIR. That's a somewhat high bar, though... In general, more general things that are useful in more cases are more likely candidates to be added than specialized things. :) I'll attach the assembly in a few minutes.

Boris Zbarsky [:bzbarsky]

Comment 12

•

15 years ago

Attached file Assembly log — Details

This is the initial trace; there are several branches we compile later as we detect type-stability issues as we go, as far as I can tell. But this is the only thing that includes the whole loop, since the branches start part way through it. The lines starting with 0x... are the generated assembly (x86-64 in my case). The rest is annotations explaining what LIR is being compiled. You can get just the assembly by grepping this log for " 0x" (two leading spaces).

Alexis Shaw

Reporter

Comment 13

•

15 years ago

>> just return an array, >That runs into the same GC issue as the scripted function returning an array. You are not actually returning an array, you are passing an array to a distructing-assignment, this means that you are just returning multiple values, if there is o destructing assignment there, you can then just ignore it. >> also note that with a function you can make this work with z>26 >Sure. That's a cost to keep in mind; it's not clear to me how much this >impacts the sorts of code that might use this thing. The long multiplication algorithm has cost: C(z,b)=(b/z)^2 so C(26,b)/C(32,b) =~1.51 For Karatsuba multiplication it has cost: C(z,b)=(b/z)^1.585 so C(26,b)/C(32,b) =~1.38 (note this is for numbers >400 bits, and that this is a recursive algorithm) > I wanted an unsigned multiply, and I do not know how to do that in C. I see how to do that now, just not with it at that time. anyway, it only made the cost worse.

Alexis Shaw

Reporter

Comment 14

•

15 years ago

There is obviously a way you can make this faster. probably by a lot. I do not know how to implement this though.

OS: Windows 7 → Windows XP

Boris Zbarsky [:bzbarsky]

Comment 15

•

15 years ago

> if there is o destructing assignment there, you can then just ignore it. A C-implemented native has no way to tell what's going to happen to the result. Often enough neither can the tracer, unless the destructuring assignment is right there. Optimizing this is not as easy as you make it sound. ;)

Alexis Shaw

Reporter

Comment 16

•

15 years ago

>> if there is o destructing assignment there, you can then just ignore it. > >A C-implemented native has no way to tell what's going to happen to the result. > >Often enough neither can the tracer, unless the destructuring assignment is >right there. Optimizing this is not as easy as you make it sound. ;) Well we only guarantee that the statement is right there or performance cost. we do not have to optimise for every case, just the case that is most often used. have 2 versions, one that optomises this case: [a,b] = SomeFunctionValue() and a version that optimizes everything else.

Brendan Eich [:brendan]

Comment 17

•

15 years ago

Alexis's comment 16 is right on, we always meant to use an old Scheme trick for return-of-array-to-destructuring-assignment-consumer case, mentioned by Lars Hansen in http://wiki.ecmascript.org/doku.php?id=proposals:group_assignment: "(There is actually a standard trick here in which a continuation that receives multiple values is marked specially so that an invocation that returns multiple values will not cons up an object, but instead return the values through a faster protocol. With proper tail recursion this works very well, but even without it will cover many cases in practice.) — Lars T Hansen 2006/03/03 11:19" Please feel free to file a bug on this. /be

Status: UNCONFIRMED → NEW

Ever confirmed: true

Boris Zbarsky [:bzbarsky]

Updated

•

15 years ago

Depends on: 601662

Boris Zbarsky [:bzbarsky]

Comment 18

•

15 years ago

I filed bug 601662.

Nobody; OK to take it and work on it

Assignee

Updated

•

11 years ago

Assignee: general → nobody

BMO Automation

Updated

•

3 years ago

Severity: normal → S3

Matthew Gaudet (he/him) [:mgaudet]

Updated

•

2 years ago

Status: NEW → RESOLVED

Closed: 2 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.