Closed
Bug 601072
Opened 15 years ago
Closed 2 years ago
Unsigned Multiplication in Javascript is very slow compared to C
Categories
(Core :: JavaScript Engine, enhancement)
Tracking
()
RESOLVED
FIXED
People
(Reporter: alexis, Unassigned)
References
(Depends on 1 open bug)
Details
Attachments
(6 files)
User-Agent: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10
Build Identifier: Mozilla/5.0 (Macintosh; U; Intel Mac OS X 10.6; en-US; rv:1.9.2.10) Gecko/20100914 Firefox/3.6.10
Unsigned Multiplication in javascript is very slow.
In Firefox 3.6 the code example takes 46026 ms
In Chrome 6.0.472.62 it takes 11358 ms
C it takes 310ms.
Please Find attached code samples in both C and Javascript.
Reproducible: Always
Reporter | ||
Comment 1•15 years ago
|
||
Reporter | ||
Comment 2•15 years ago
|
||
Comment 3•15 years ago
|
||
Somehow I suspect most of the time here is spent in object allocation and accessing globals (a variable is global if you omit the "var" keyword, globals are much slower than locals).
I had to reduce the number of iterations from 1 << 26 to 1 << 24 to avoid a slow script warning in Safari and Firefox. With this, I get the following numbers:
Chrome: 2379
FF4b7pre: 2752
Safari: 5506
As you can see, FF4 will be much faster here :)
Reporter | ||
Comment 4•15 years ago
|
||
This is an inlined version of the javascript.
Reporter | ||
Comment 5•15 years ago
|
||
bug with erlier version was using globals. this version uses locals.
Updated•15 years ago
|
Summary: Unsigned Multiplication in Javascript is very slow. → Unsigned Multiplication in Javascript is very slow compared to C
Reporter | ||
Comment 6•15 years ago
|
||
Reporter | ||
Comment 7•15 years ago
|
||
So with optimisations the new numbers are:
Inlined
Chrome:4124
FF3.6:3510
Safari:5005
Normal
Chrome:4562
FF3.6: 4041
Safari:4088
note: this is using a passed in and recycled array, if you make a new array you get the following numbers
Chrome:4604
FF 3.6: 25564
Safari:16047
![]() |
||
Updated•15 years ago
|
Attachment #480042 -
Attachment mime type: application/octet-stream → text/plain
![]() |
||
Comment 8•15 years ago
|
||
So on that last benchmark, we're within a factor of 2x of unoptimized C (with handcrafted inline assembly for the hot path, which makes it not very C, btw... though if I just use unsigned long long and skip the assembly gcc seems to actually do a much better job optimizing; the unoptimized code is about the same, but the optimized is about 2x faster than with the inlined assembly), which we generally consider "pretty good".
Interestingly, rewriting imul like so:
var v=a*b;
out[1]= v % (1 << z);
out[0]= v - out[1];
actually makes it _slower_ (lots of time spent under fmod).
I get an additional 20% speedup or so if I do this:
out[0]=(v/(1<<z)) | 0;
instead of using Math.floor, at least on m-c. I think there's been some more optimization work on TM to make floor() possibly behave better.
Also, hoisting the 1<<z out of the loop makes this faster yet; bug 545406 might well help with that.
We could try a builtin, but it's not clear to me that the cost of setting up the function call, etc, will really make up for things. Note that in gcc unoptimized it's the cost of not inlining the function that really makes the time shoot up, and we can't inline a C builtin into the trace, obviously.
It's also not clear to me how to sanely create a builtin that returns two separate values. What would the actual JS-facing API look like?
For reference, the version that returns a new array every time is totally gc-system bound, at least on Mac: 37% of the time is spent doing free() calls on the finalizer thread, 38% of the time is spent allocating arrays (about 2/3 of this in the realloc js_NewPreallocatedArray calls via growDenseArrayElements), 14% is under js_GC. That adds up to 89% for those of us counting....
In Chrome the "create new arrays" testcase does about as well as the other; I'd love to know how they manage that. ;)
Reporter | ||
Comment 9•15 years ago
|
||
Could I get the generated code for inside the loop? perhaps we can just special case this and get better?
Note the asm is not really hand crafted, I wanted an unsigned multiply, and I do not know how to do that in C. note that a function that returns has been handled, destructing assignment. just return an array, then say to use destructing assignment on them. also note that with a function you can make this work with z>26 all the way upto Z=53, if you wanted to. and why cannot you inline a builtin into the trace? of course checking in the trace conditions that the code actually is a number?
Reporter | ||
Comment 10•15 years ago
|
||
Also , perhaps instead of defining this function, you could define a function mul that does multiplication on an array of integers with an input z for the bit window size, for example
mul(a,b,z)
a,b arrays of numbers, or error.
a[i],b[i] converted to integers mod 2^z ai[i], bi[i]
treating
ai[i] and bi[i] as numbers a= sum^k ai[i]*2^z^k and b=sum^k bi[i]*2^z^k
compute v=a*b
and express in the form v[i]=sum^k v[i]*2^z^k
veturn v[i], taking the loop out of the equation, if that is easier, it is probably more efficient.
OS: Mac OS X → Windows 7
![]() |
||
Comment 11•15 years ago
|
||
> I wanted an unsigned multiply, and I do not know how to do that in C.
Just declare the operands as unsigned (or cast them to unsigned while multiplying).
> just return an array,
That runs into the same GC issue as the scripted function returning an array.
> also note that with a function you can make this work with z>26
Sure. That's a cost to keep in mind; it's not clear to me how much this impacts the sorts of code that might use this thing.
> and why cannot you inline a builtin into the trace?
You can, and we have, for common cases for things that are used often and that are easy to express in LIR. That's a somewhat high bar, though...
In general, more general things that are useful in more cases are more likely candidates to be added than specialized things. :)
I'll attach the assembly in a few minutes.
![]() |
||
Comment 12•15 years ago
|
||
This is the initial trace; there are several branches we compile later as we detect type-stability issues as we go, as far as I can tell. But this is the only thing that includes the whole loop, since the branches start part way through it.
The lines starting with 0x... are the generated assembly (x86-64 in my case). The rest is annotations explaining what LIR is being compiled.
You can get just the assembly by grepping this log for " 0x" (two leading spaces).
Reporter | ||
Comment 13•15 years ago
|
||
>> just return an array,
>That runs into the same GC issue as the scripted function returning an array.
You are not actually returning an array, you are passing an array to a distructing-assignment, this means that you are just returning multiple values, if there is o destructing assignment there, you can then just ignore it.
>> also note that with a function you can make this work with z>26
>Sure. That's a cost to keep in mind; it's not clear to me how much this
>impacts the sorts of code that might use this thing.
The long multiplication algorithm has cost:
C(z,b)=(b/z)^2
so C(26,b)/C(32,b) =~1.51
For Karatsuba multiplication it has cost:
C(z,b)=(b/z)^1.585
so C(26,b)/C(32,b) =~1.38
(note this is for numbers >400 bits, and that this is a recursive algorithm)
> I wanted an unsigned multiply, and I do not know how to do that in C.
I see how to do that now, just not with it at that time. anyway, it only made the cost worse.
Reporter | ||
Comment 14•15 years ago
|
||
There is obviously a way you can make this faster. probably by a lot.
I do not know how to implement this though.
OS: Windows 7 → Windows XP
![]() |
||
Comment 15•15 years ago
|
||
> if there is o destructing assignment there, you can then just ignore it.
A C-implemented native has no way to tell what's going to happen to the result.
Often enough neither can the tracer, unless the destructuring assignment is right there. Optimizing this is not as easy as you make it sound. ;)
Reporter | ||
Comment 16•15 years ago
|
||
>> if there is o destructing assignment there, you can then just ignore it.
>
>A C-implemented native has no way to tell what's going to happen to the result.
>
>Often enough neither can the tracer, unless the destructuring assignment is
>right there. Optimizing this is not as easy as you make it sound. ;)
Well we only guarantee that the statement is right there or performance cost. we do not have to optimise for every case, just the case that is most often used. have 2 versions, one that optomises this case:
[a,b] = SomeFunctionValue()
and a version that optimizes everything else.
Comment 17•15 years ago
|
||
Alexis's comment 16 is right on, we always meant to use an old Scheme trick for return-of-array-to-destructuring-assignment-consumer case, mentioned by Lars Hansen in http://wiki.ecmascript.org/doku.php?id=proposals:group_assignment:
"(There is actually a standard trick here in which a continuation that receives multiple values is marked specially so that an invocation that returns multiple values will not cons up an object, but instead return the values through a faster protocol. With proper tail recursion this works very well, but even without it will cover many cases in practice.)
— Lars T Hansen 2006/03/03 11:19"
Please feel free to file a bug on this.
/be
Status: UNCONFIRMED → NEW
Ever confirmed: true
![]() |
||
Comment 18•15 years ago
|
||
I filed bug 601662.
Assignee | ||
Updated•11 years ago
|
Assignee: general → nobody
Updated•3 years ago
|
Severity: normal → S3
Updated•2 years ago
|
Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•