Closed
Bug 878170
Opened 12 years ago
Closed 12 years ago
OdinMonkey: experiment usage of float32 instead of doubles in asm.js
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
INCOMPLETE
People
(Reporter: bbouvier, Assigned: bbouvier)
References
Details
Attachments
(4 files, 5 obsolete files)
2.74 KB,
application/javascript
|
Details | |
113.24 KB,
patch
|
Details | Diff | Splinter Review | |
36.30 KB,
patch
|
Details | Diff | Splinter Review | |
886.15 KB,
application/octet-stream
|
Details |
Video games engines tend to use float32 instead of double, as the single precision operations are expected to be executed faster by the processors.
This bug consists in an experiment to use float32 instead of doubles in OdinMonkey / asm.js mode.
The idea behind the quick-and-dirty proof of concept is to replace double precision instructions by single precision instructions on code generation. The main issue with this approach is that the interpreter still uses double precision, therefore any interaction between the interpreter and the asm.js code requires single to double precision conversion (resp. the other way around) using cvtss2sd (resp. cvtsd2ss). As a matter of fact, the speed-ups obtained by using single precision instructions might be obstructed by the conversion operators. Moreover, as code generation is also used by Ion and the Baseline compiler, these have to be disabled during testing.
Interactions with the interpreter are the followings:
- calling an AsmJS function (need to convert double arguments to float arguments) and returning float values from an AsmJS function
- storing / loading global values
- storing / loading heap values
- calling FFI
- calling Math builtins
I have some ideas to avoid conversions:
- for math builtins, reimplement them (in C++) using float instead of double.
- if Emscripten doesn't deal with global values and heap values outside of asm.js code (i.e. in the interpreter), the related conversions can be disabled and that would be a great win.
I will post patches and benchmarks soon.
Assignee | ||
Comment 1•12 years ago
|
||
This first patch replaces all used occurrences of double by float, in asm.js mode only. As a matter of fact, the interpreter continues to use double precision numbers, so there is a conversion every time we go from one side to the other. These conversions have been reduced to 2 cases:
- call a FFI function.
- enter and return from asm.js module.
Conversions for heap accesses and global variables accesses were not necessary after all, if we consider that only the module accesses these variables (which is the case in Emscripten, if I understand correctly).
As potential parts also used by IonMonkey and the baseline have been modified, it is not possible to use them with this patch. Therefore, jit-tests are effectively run only on asm.js directory and must be run with the following command:
../jit-test/jit_test.py --args="--no-ion --no-baseline" ./js asm.js
Also, a lot of tests were using comparisons between doubles and results of asm.js functions. As floats have less precision, I have introduced an assertAlmostEqual function, which is equivalent to assertEq but tests for equality under a certain epsilon (abs(x-y) < epsilon). At this time, all asm.js tests pass. Comments in the tests files explain where modifications had to be done for taking account of the loss of precision.
I would like to emphasize that this patch is not to be used to *add* float32 support, but so as to have benchmarks comparing float32 / double execution times for programs. Benchmarks coming soon (spoiler alert: float32 instructions are faster).
Attachment #757728 -
Flags: feedback?
Assignee | ||
Comment 2•12 years ago
|
||
TL;DR: Micro benchmarks: 25% speedup; Bullet: almost no speedup; Box2D: 56% speedup
The attached file contains the micro benchmarks I have used to measure pure local execution times. These benchmarks are useful for checking that generated instructions are effectively float instructions.
Here are some results regarding these benchmarks:
- Basic: basic maths operations, directly implemented by x64 instructions
With doubles: 1574ms
With floats: 1170ms
Speedup: 25.7%
- Builtin: usage of maths built-ins
With doubles: 1594ms
With floats: 1193ms
Speedup: 25%
- FFI calls: regular call of a FFI, to measure impact of conversion
With doubles: 110ms
With floats: 107ms
No speedup nor decrease
- Globals: global variable accesses
With doubles: 2448ms
With floats: 2450ms
No speedup nor decrease
- Heap: heap accesses (stores)
With doubles: 2.18ms
With floats: 1.66ms
Speedup: ~24%
I have also run the AreWeFastYet asmjs-apps benchmarks, to compare:
- Bullet 4:
With doubles: 9579ms
With floats: 9363%
Speedup: ~2%
I suspect there's something wrong. Bullet is supposed to use double or floats, so the expected speedup is way higher.
- Box2d 4:
With doubles: 16139ms
With floats: 6979ms
Speedup: ~56.8%
Not sure if it is really good or there is something wrong here (overflow,...).
Zlib 4 shows non significant differences.
Attachment #757741 -
Flags: feedback?(luke)
Comment 3•12 years ago
|
||
I suspect box2d and bullet may be running a different path than with doubles. No speedup or a speedup bigger than microbenchmarks are both surprising. Both box2d and bullet are incremental, each step depends on the previous, so imprecision can grow and alter what happens.
We can try the skinning, linpack and lua-scimark benchmarks, which all use floating point math and do not work incrementally, so less risk of measuring the wrong thing.
Assignee | ||
Comment 4•12 years ago
|
||
Linpack doesn't terminate and lua-scimark throws an error on startup, so there might be something wrong with the generated code. I am going to investigate this.
Comment 5•12 years ago
|
||
(In reply to Benjamin Bouvier [:bbouvier] from comment #2)
...
FWIW some of the Emscripten translated code for float32 operations
appears open to further optimization.
For example (from box2d):
$46 = (HEAPF32[tempDoublePtr1 >> 2] = ($14 < $28 ? $14 : $28) - $42, HEAP32[tempDoublePtr11 >> 2] | 0);
HEAP32[$45 >> 2] = 0 | $46;
Can you see any reason that this could not be simplified to:
HEAPF32[$45 >> 2] = ($14 < $28 ? $14 : $28) - $42;
The code appears to be a translation of a LLVM bitcast.
If the above can not be easily simplified by Emscripten then perhaps it
would be possible to add a new JS function to bitcast a float32 to an
int32 and to have the asm.js assembler compile it to more optimal machine
code.
Comment 6•12 years ago
|
||
Yeah, we could optimize that specific pattern. I think we are lucky here that an optimization would be simple, though.
Optimally we would want LLVM to never do unnecessary bitcasts of this nature. It seems it's motivation is to convert two 32-bit floats to a 64-bit int, and then store that in a single instruction. In JS of course we turn that into two 32-bit operations anyhow.
Comment 7•12 years ago
|
||
Is there a way to get profiling data on box2d regarding how important it would be to optimize that pattern?
Comment 8•12 years ago
|
||
(In reply to Alon Zakai (:azakai) from comment #7)
> Is there a way to get profiling data on box2d regarding how important it
> would be to optimize that pattern?
Thank you the explanation. They do appear to occur in pairs.
Box2d code often updates pairs of float32 slots. A simple
line count on the un-minified code shows that almost 10% of
heap accesses are using tempDoublePtr. Not sure if they are
in critical paths, but it could be.
grep tempDoublePtr box2d.js | wc
462 6660 50671
grep HEAP box2d.js | wc
5059 49221 275313
Comment 9•12 years ago
|
||
Oh, wow :) I did not realize that. 10% looks potentially very significant.
I'll investigate optimizing this in emscripten. I do suspect though that we might need to fix this issue at the LLVM level, and that might be significant work.
Assignee | ||
Comment 10•12 years ago
|
||
After this week of debugging, I (finally) have a version that seems to reproduce consistent results for lua (on a basic script, not scimark yet) and for linpack. Benchmarks coming soon.
All float32 and float64 operations in asmjs mode are replaced by float32 operations. This especially means that float64 are actually float32 with this patch. This raised issues when dealing with a float32 heap in interpreted mode (as float64 are internally stored as floats) so I had to do some conversions when reading and writing on the interpreter side. As a matter of fact, the conversions that were avoided in asmjs mode are now present when the interpreter reads / writes values in the float64 heap.
Using doubles everywhere has drawbacks. For instance, linpack uses gettimeofday (compiled to js) and was rounding time values (since epoch time is greater than the highest integer value that can be represented as a float without loss), which altered the run paths of the program.
Also, emscripten initializes memory with double -> bytes conversions that were confusing for the runtime. I had to modify one line in emscripten so that the memory is correctly loaded.
Benchmarks and details coming soon.
Attachment #757728 -
Attachment is obsolete: true
Attachment #757728 -
Flags: feedback?
Comment 11•12 years ago
|
||
Perhaps a set of float32 functions could be defined, using
loads and store to a float32 array to convert between double
float and float32 precision value. These functions
would accept double float arguments and return double float
results, but perform the operations to float32 precision.
Emscripten could be modified to use these, and it would be
possible to test them using the standard benchmarks.
Then start replacing calls to these slow functions by
fast inline float32 operations. Perhaps for a start
just target asm.js code to limit the extent of the work.
It might be possible to split out the argument and result
conversion from float32 to double precision into
separate operations, and then optimize away redundant
[float32->double]->[double->float32] operations to leave
the backend to emit fast float32 operations without the
intermediate double float values.
Comment 12•12 years ago
|
||
Two things:
1. Following up to comment 9, I optimized out almost all uses of bitcasts (tempDoublePtr stuff). Surprisingly this had a very small effect on performance. It looks like they were not much of a bad impact on performance (but removing them is great for code size even if not).
2. Regarding comment 11, that would work in theory, however it would likely slow down existing JITs, and we need to be very careful not to do that. It would certainly be interesting to test, but I'm not sure we would want to actually do it. Anyhow, first we need to wait on data for how important float32 is in the first place.
Comment 13•12 years ago
|
||
(In reply to Alon Zakai (:azakai) from comment #12)
> Two things:
>
> 1. Following up to comment 9, I optimized out almost all uses of bitcasts
> (tempDoublePtr stuff). Surprisingly this had a very small effect on
> performance. It looks like they were not much of a bad impact on performance
> (but removing them is great for code size even if not).
That's great, thanks.
> 2. Regarding comment 11, that would work in theory, however it would likely
> slow down existing JITs, and we need to be very careful not to do that. It
> would certainly be interesting to test, but I'm not sure we would want to
> actually do it. Anyhow, first we need to wait on data for how important
> float32 is in the first place.
If this is just an exercise to understand the potential performance difference then perhaps it would be quicker to just recompile the original C code, written for use float32, to use doubles and compare performance of the native code - this might give quick and reliable results for all the architectures.
Comment 14•12 years ago
|
||
Might be relevant here, according to this
http://www.anandtech.com/show/6971/exploring-the-floating-point-performance-of-modern-arm-processors
the newest ARM processors (A15, Krait 300) seem to do quite well on 64-bit math compared to 32 bit.
Assignee | ||
Comment 15•12 years ago
|
||
This is a simpler version of the previous patch, that keeps the original internal representation of doubles in the heap. With the previous patch, doubles in the heap were actually floats and conversions occurs in interpreter mode every time a double was read or written. This one keeps the representation of doubles in the heap as doubles, and makes conversions on heap accesses in asmjs.
Attachment #760042 -
Attachment is obsolete: true
Assignee | ||
Comment 16•12 years ago
|
||
Some benchmarks.
TL;DR: if the initial C++ code compiled with emscripten uses floats, great speedup. If it uses doubles, makes things worse.
I have followed Alon's advice to use linpack and lua_scimark for benchmarking. Also, I noticed that they both use double floats as their internal representation for doubles, which means that emscripten will use a Float64Array to load and store them. So I've instrumented the code to obtain the percentage of float32 heap accesses and the percentage of float64 heap accesses that are generated (among the total number of heap accesses). These, with calls to FFI, are the main reason for slow down in this experiment (as they need conversions).
For instance:
F64[ptr >> 2] = ffiFunction(F64[ptr >> 2]) // needs 4 conversions (read from double array (from double to float), parameter conversion (from float to double), return value conversion (from double to float), store in double array (from float to double))
Benchmarks are launched with |--no-ion --no-baseline|, as the patch messes up with the internals of the code generation, shared between the JIT engines and OdinMonkey.
The microbenchmarks attached in this bug are still showing ~25% improvements.
Box2d:
- generates 30.4% (of the total num. of heap accesses) of Float32 (F32) heap accesses, 0% of Float64 (F64) heap accesses. This means the only present conversions might be for FFI calls.
- 7.2s with floats, 16s with doubles => 50% speed up
- either it runs a different path, or all the avoided conversions make it *really* faster.
Bullet:
- generates 44% of F32, 0% of F64.
- 9.5s with floats, 10s with doubles => 5% speed up
- we expect more than that (probably runs a different path)
Skinning: (noticed it also uses floats)
- generates 10% of F32, 0% of F64.
- 7.2s with floats, 10.9s with doubles => 34% speedup
- microbenchmark, so code alignment and other micro phenomenons could explain that difference.
Linpack:
- Linpack allows to choose which representation of real numbers we want to use. It is set initially to doubles.
- I had to chase a bug that modified dramatically the execution path: time since epoch is used in this benchmark. These times are converted to real numbers in the code; so during conversion to float32, rounding occurred and reused times were not correct. Solution: shift epoch time from a given offset (only differences of times are useful).
- The initial version generates 81% of F64 accesses and no F32 accesses => a lot of conversions are carried out.
- 799 Mflops with floats, 1471 Mflops with doubles => 84% worse
- I tried to recompile Linpack using floats as the representation of real numbers; this way, fewer conversions would be needed.
- The float version generates 16.7% of F64 accesses and 64% of F32 accesses.
- 1053 Mflops with floats, 878 Mflops with doubles => 20% speedup
- Overall, the double representation seems to produce better results, but Mflops depends on times, and times are less precised with the floats...
Lua_scimark:
- Lua also allows to choose the floating point representation of real numbers, so I lead the 2 experiments.
- Real as doubles: 2.93% of F64 accesses, 0% of F32 accesses.
- 14s with floats, 13.6s with doubles => 3% worse
- Real as floats: 0.25% of F64 accesses, 2.79% of F32 accesses.
- 12.8s with floats, 14.58s with doubles => 12.3% better
These preliminary results clearly show that if the original C / C++ code used floats, the resulting emscripted code will be faster. In the other case, it seems that the necessary conversions are killing the perfs. I will investigate bug 877338 (as each load and store of a float64 needs a conversion, solving this bug might help reducing the number of conversions). I will be glad to read your comments and reactions, or any other approaches that could benefit better speedups.
Comment 17•12 years ago
|
||
(In reply to Benjamin Bouvier [:bbouvier] from comment #16)
> Box2d:
> - generates 30.4% (of the total num. of heap accesses) of Float32 (F32) heap
> accesses, 0% of Float64 (F64) heap accesses. This means the only present
> conversions might be for FFI calls.
> - 7.2s with floats, 16s with doubles => 50% speed up
> - either it runs a different path, or all the avoided conversions make it
> *really* faster.
Hmm, I still find this hard to believe. First that it is better than microbenchmarks testing this specifically, and second that Box2D is one of our fastest benchmarks currently, we are just 25% slower than native on my machine. How do we compare to native after the 50% speedup on your machine?
Are you on a 64-bit OS where we use the signal stuff?
We can instrument the benchmarks to output data, if we suspect they run a different path. I believe Box2D for example simulates a few objects, and we can print their locations at the last frame or every X frames.
![]() |
||
Comment 18•12 years ago
|
||
Validating the output of box2d sounds like a great idea. So far, though, the results are sounding pretty good.
Other than that, I wouldn't worry too much about trying to minimize float64 conversions. When we do this for real, there would be no such conversion; C++ doubles would use float64, C++ floats would use float32.
![]() |
||
Updated•12 years ago
|
Attachment #757741 -
Flags: feedback?(luke)
Assignee | ||
Comment 19•12 years ago
|
||
Attachment #760668 -
Attachment is obsolete: true
Assignee | ||
Comment 20•12 years ago
|
||
(In reply to Alon Zakai (:azakai) from comment #17)
> Hmm, I still find this hard to believe. First that it is better than
> microbenchmarks testing this specifically, and second that Box2D is one of
> our fastest benchmarks currently, we are just 25% slower than native on my
> machine. How do we compare to native after the 50% speedup on your machine?
native (clang++ -O2): real 0m4.347s
native (g++ -O2): real 0m4.025s
js --no-ion --no-baseline (doubles): real 0m16.509s
js --no-ion --no-baseline (floats): real 0m6.944s
So actually, with the floats version, we are in the range of 2x slower than the native version. With the double floats version, we are way worse than the native version.
> Are you on a 64-bit OS where we use the signal stuff?
Yes. However, if your referring to the signal stuff as the signal handler used for out of bounds heap accesses, it shouldn't show up if all heap accesses are in bounds, which I suppose for this benchmark.
> We can instrument the benchmarks to output data, if we suspect they run a
> different path. I believe Box2D for example simulates a few objects, and we
> can print their locations at the last frame or every X frames.
I activated the DEBUG variable in the Benchmark.cpp file, to print the y location of the object. There is a problem here: while the native version (either compiled with g++ -O2 or clang++ -O2) shows values which are either 0 or 10, the initial js version (i.e. without the float32 patch) shows other values (0 union the interval [5, 11]). To compare, I also launch it node and the results are between an even bigger interval.
Could there be something wrong, either in IonMonkey / OdinMonkey or Emscripten? Or do you think it's just related to the random numbers generator?
Assignee | ||
Comment 21•12 years ago
|
||
TL;DR: the speedups are real and validated: >50% speedup on box2d, 30% on skinning. Float shell behaves exactly as native version, for Box2d.
After cleaning the printf instructions in the main source code file of Box2D, I have had the proof that the position of the top object was correct at the end of the run. So the run path is the same.
I have checked the position values of the Float js shell with the native shell. They are always really close, if not equal. I added instrumentation to see what were the positions of every object in the scene, both for native and js shell versions: on workloads 3 and 4, there were only <0.01% of different positions between the native and the Float js shell. The difference was 0.00001 on every cases. The Double shell had more differences, around 0.01 at the end of the run.
The float shell uses float math operations for the math builtins, avoiding conversions for math builtins calls. Another shell using Float operations everywhere but double operations for the math builtins had almost the same performance on Box2d as the current shell.
Plus, considering perf records of Box2d runs showed that in the Double shell (current), 32.31% of time is spend in the libm cos function and 23.55% of the time is spent in the libm sin function. For the Float shell, these were reduced to 5.54% in cosf and 5.16% in sinf. The conclusion is that float Math operations are wayyyyyy faster, and this is the reason why there is the 50% speedup for Box2d. Hence, the results are validated :)
I tried to understand why there was no significative speedup for Bullet. Using perf showed that Bullet spends almost all its CPU time in the AsmJS code (only 2.76% in the libm cos function). When using the Float shell, it spends only 0.25% in the libm cosf function, so the overall speedup is not significative.
For the case of Skinning (30% speedup), perf showed that almost no Math builtin are called. Therefore Skinning takes advantage of the faster Float operations on basic arithmetic / operations.
Assignee | ||
Comment 22•12 years ago
|
||
Attachment #761539 -
Attachment is obsolete: true
Assignee | ||
Comment 23•12 years ago
|
||
Rebased so as all tests pass with --no-ion --no-baseline
Attachment #762452 -
Attachment is obsolete: true
Assignee | ||
Comment 24•12 years ago
|
||
Apply x64 patch first and then this one to enjoy the hack.
This one *only works* for x86. If you want to compile for x64, just apply the first patch and not this one. (yes, this is really quick and dirty)
Assignee | ||
Comment 25•12 years ago
|
||
So here are the x86 results (on my x64 architecture):
- Bullet 3: before 3.1s / after 2.9s => speedup 6%
- Bullet 4: before 12.0s / after 11.4s => speedup 5.39%
- Skinning 4: before 15.3s / after 12.8s => speedup 16%
- Linpack: before 835 MFlops / after 846 MFlops => speedup 1.3%
- Lua Scimark: before 13.7s / after 12.8s => speedup 6%
And last but not least...
- Box2d workload 3: before 5.19s / after 1.6s => speedup 69%
- Box2d workload 4: before 23.9s / after 7.6s => speedup 68%
The results have been verified the same way as before (check positions of all objects on workload 3 and compare with the native compiled version) and validate the hack.
These are the results for a cross-compiled shell for x86. As the benchmarks are run on my x64 architecture, the results won't be as significant as the ones we could get on a real x86 platform. The benchmarks are posted if somebody wants to test them on other platforms than Linux.
When running the shell, don't forget to deactivate Baseline and Ion (--no-ion --no-baseline), since the patches mess up with code generation, which is shared by Baseline, Ion and Odin.
Comment 26•12 years ago
|
||
Do we know why libm stuff is so much slower with doubles? Does this happen on other platforms/libc versions? It is surprising that sin/cos should take 5x longer with doubles, in principle.
Assignee | ||
Comment 27•12 years ago
|
||
(In reply to Alon Zakai (:azakai) from comment #26)
> Do we know why libm stuff is so much slower with doubles? Does this happen
> on other platforms/libc versions? It is surprising that sin/cos should take
> 5x longer with doubles, in principle.
I would love to have results from other platforms and libc versions too, especially as libc 2.17, that I use, has some math improvements.
A quick search on the Web didn't gave me interesting results. The thing that shows up the most often regarding float vs double performance is the loading time. As a float has fewer bits than a double, it is faster to load and store.
My hypothesis regarding this, is that cos / sin are computed using Taylor series. The convergence radius is obviously smaller for floats than for doubles, hence the computations end quicker. More generally, the fact that we need less precision helps to use approximate methods.
For instance, GNU Lib C uses a quite simple Taylor series sum for sinf ([1] and [2]), compared to the sin method which has a more complicated flow structure and several lookups (which imply several loads) [3].
[1] http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/ieee754/flt-32/s_sinf.c;h=916e3455711f700141759fc3107c6b4bee1a44a7;hb=c758a6861537815c759cba2018a3b1abb1943842
[2] http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/ieee754/flt-32/k_sinf.c;h=e65fb988b79718c966d6c8e57b38bcfd8ae3a1ba;hb=c758a6861537815c759cba2018a3b1abb1943842
[3] http://sourceware.org/git/?p=glibc.git;a=blob;f=sysdeps/ieee754/dbl-64/s_sin.c;h=7b9252f81e0960e1811926be254336ec1a4c5dc6;hb=c758a6861537815c759cba2018a3b1abb1943842
![]() |
||
Comment 28•12 years ago
|
||
(In reply to Benjamin Bouvier [:bbouvier] from comment #27)
> (In reply to Alon Zakai (:azakai) from comment #26)
> > Do we know why libm stuff is so much slower with doubles? Does this happen
> > on other platforms/libc versions? It is surprising that sin/cos should take
> > 5x longer with doubles, in principle.
>
> For instance, GNU Lib C uses a quite simple Taylor series sum for sinf ([1]
> and [2]), compared to the sin method which has a more complicated flow
> structure and several lookups (which imply several loads) [3].
The double routines have also had a lot more attention paid to how accurate they are, too; more accuracy usually implies slower computation. (Though there has been significant effort spent on making fast, accurate libm routines.)
Comment 29•12 years ago
|
||
Wow. Well, given that, my first thought is this is starting to seem unrepresentative. Likely Box2D is inefficient if it constantly calls sin and cos all the time. If we spend 50% of time in sin/cos, then a native build would likely spent at least 25%. That seems very unreasonable for a physics engine.
We could in principle use the float32 C code for sin in Box2D, avoiding the use of native sin(). Perhaps emscripten should do that in general, if we can't optimize the browser's calls to sin(). (If sin is 5x slower than sinf, then sinf compiled as asm.js should beat sin.)
But can't we optimize the browsers'? We can pick the best (preferring a tradeoff with more speed and less precision) open source one and ship that with Firefox.
(But all this is quite separate from the issue of having or not having float32s.)
Assignee | ||
Comment 30•12 years ago
|
||
I think we can close this one, as the experiment showed great speedups on certain benchmarks. Now, the Float32 support is being implemented as a general optimization in Ion and the Odin support is also getting done.
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
Updated•12 years ago
|
Resolution: FIXED → INCOMPLETE
You need to log in
before you can comment on or make changes to this bug.
Description
•