Closed
Bug 919020
Opened 11 years ago
Closed 11 years ago
Float32: investigate matrix inversions benchmark performance on ARM
Categories
(Core :: JavaScript Engine, defect)
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: bbouvier, Unassigned)
References
()
Details
Attachments
(1 file)
28.37 KB,
application/javascript
|
Details |
This benchmark I made creates a bunch of random matrixes 3x3, inverts them and multiply the original matrix with the inverted matrix. The script contains gl-matrix [0] and a port of it that uses Float32 operations instead. It will print comparative results: reg means 'regular version' and f32 means 'float32 version'. The attached script is a folded version of [1]: it contains gl-matrix, then gl-matrix with the Math.fround calls, then the benchmark itself (at the end).
It has been made so that all function calls get inlined and no bailout happens (except one for each kind at the beginning, because one the functions get compiled and thus changes nature).
On x64, there is a positive speedup. I suspect the same thing on x86. However, on ARM there is a slow down (at least on my Samsung Galaxy S3 and on one of the devices Douglas uses).
There's another version of the benchmark that doesn't inline all functions calls (because callee sizes are too big), accessible on [2]. On this one, the Float32 version (that doesn't inline all calls, because of Math.fround) is faster than the non-Float32 version (that inlines all calls).
That looks weird and I think we could do better in the case of the fully inlined version. Moreover, it could help general performance on ARM.
[0] https://github.com/toji/gl-matrix
[1] http://people.mozilla.org/~bbouvier/inversions-inlined.html
[2] http://people.mozilla.org/~bbouvier/inversions.html
Comment 1•11 years ago
|
||
On my rMBP with OS X 10.8, the inlined version is about 5% faster than the baseline. The non-inlined one is about 12.5% slower than the baseline.
On my Samsung Galaxy S4, the inlined version is about 54% slower than the baseline. The non-inlined version is between 6% and 9% faster. It's hard to tell exactly, because the numbers vary from run to run.
However. Something seems a bit off with the benchmark: the numbers for reg keep changing in a curious pattern. They start out with some number, then over a few seconds go up by about 4%. After that, they start falling, and keep doing that. This happens in both versions of the benchmark, as is probably to be expected.
A concrete example: On my rMBP, reg starts out at 343, then dips to 341, then, over the next 20 seconds or so, increases until it reaches 354, then it begins to fall. It takes maybe 30 seconds until it reaches 343 again. The fall decelerates and it takes about a minute for it to fall below 340. I haven't seen it stop at any point, but the lowest number I was patient enough to wait for was 337.
Reporter | ||
Comment 2•11 years ago
|
||
(In reply to Till Schneidereit [:till] from comment #1)
> However. Something seems a bit off with the benchmark: the numbers for reg
> keep changing in a curious pattern. They start out with some number, then
> over a few seconds go up by about 4%. After that, they start falling, and
> keep doing that. This happens in both versions of the benchmark, as is
> probably to be expected.
That's indeed kind of bizarre. What we should expect (and see locally on the shell and on a new profile with no other tabs and extensions on my browser) is the followings:
- the first run is slower than the other ones (compilation and recompilations)
- next runs are faster.
My bad, I only show average values everywhere. So it makes sense that the average is always falling down, as the first run might be especially high compared to the others. I guess I should use a more stable indicator of the time distribution, like the median or the average on the last X runs.
Till, did you try the shell (way more stable) or the browser version?
Flags: needinfo?(till)
Reporter | ||
Comment 3•11 years ago
|
||
Till told me on IRC he tried the browser version, not the shell version.
Interestingly, on ARM, I locally have speedups when I cross compile and run it with qemu. I will try to run it directly on my phone without the browser now, to see if that makes a difference.
Flags: needinfo?(till)
Reporter | ||
Updated•11 years ago
|
Reporter | ||
Comment 4•11 years ago
|
||
After some experiments with another benchmark (Markov Chains), I see better speedups when I inline the code by hand (i.e. copying and pasting the code and replacing variable names) instead of making a function call that would get inlined by IonMonkey (for float32).
Not sure why the Double version isn't affected by that, though.
Reporter | ||
Comment 5•11 years ago
|
||
Float32 has made some progress on ARM and it's now faster on several devices (38% speedup on Samsung Galaxy S3, 42% on Nexus 4).
Close as worksforme as I am not sure which bug fixed that (even though I suspect float hashing for constant generation, cheers Douglas!).
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•