Closed
Bug 578538
Opened 14 years ago
Closed 14 years ago
JM: Avoid unnecessary stores in prolog of arithmetic fast paths
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
People
(Reporter: dmandelin, Assigned: dvander)
References
Details
Attachments
(1 file, 1 obsolete file)
47.99 KB,
patch
|
Details | Diff | Splinter Review |
The fast path we emit for |+| starts like this:
mov edi, [ebx+0x68]
mov [ebx+0x80], edi **
mov esi, [ebx+0x6c]
mov [ebx+0x84], esi **
mov edx, [ebx+0x60]
mov [ebx+0x78], edx **
mov ecx, [ebx+0x64]
mov [ebx+0x7c], ecx **
Lines marked (**) are unnecessary stores to stack slots. I find that we are noticeably slower on these arithmetic ops than JSC. E.g., arithmetic slowdown is about 10-15% of our total time on cordic. dvander says it was estimated this pattern is costing us about 5% of our total time on SS, so I think these stores are likely the cause of the observed arithmetic slowdown on cordic. Thus, we must fix.
JM wanted SS win: 40 ms
Comment 1•14 years ago
|
||
This depends on bug 574930, allowing for cheap branching and merging of FrameStates. The stores are caused by an intentional frame.syncAllRegs() at the top of jsop_binary(), because there is no sane way to generate arithmetic paths with the current FrameState model.
This also requires a (much easier!) rewrite of jsop_binary() once the FrameState changes are implemented. I am thinking about the FrameState implementation currently.
Assignee: general → sstangl
Depends on: 574930
Assignee | ||
Comment 2•14 years ago
|
||
We should be able to do this without invasive changes to FrameState - the diamond pattern is fairly localized and small. Need to think about it more - didn't we have a patch at some point that grabbed registers up-front?
Even local peephole optimization might be good. If we know there are copies on the stack, they can be synced on the stub transition rather than up-front.
Comment 3•14 years ago
|
||
Any patch written to speed up this case will be rendered useless when the FrameState is changed, and FrameState must change to support sensible branched fastpaths at all.
It would be sensible to eliminate these stores iff the measurements obtained by eliminating them are significantly more valuable than measurements with them active.
Assignee | ||
Comment 4•14 years ago
|
||
Totally doesn't work yet, at all. The goal of this patch is to get rid of the very expensive state flushing at the top of adds. Instead, it aggressively tries to allocate four registers while minimizing spills.
For example, print(T1 + T2, T1 + T2) used to have two sync blocks for ADD like this:
[jaeger] Insns movl 0x60(%ebx), %edx
[jaeger] Insns movl %edx, 0x80(%ebx)
[jaeger] Insns movl 0x64(%ebx), %ecx
[jaeger] Insns movl %ecx, 0x84(%ebx)
[jaeger] Insns movl $0xffff0006, 0x7c(%ebx)
[jaeger] Insns movl $0x0, 0x78(%ebx)
[jaeger] Insns movl %esi, 0x70(%ebx)
[jaeger] Insns movl %edi, 0x74(%ebx)
[jaeger] Insns movl 0x68(%ebx), %edx
[jaeger] Insns movl %edx, 0x88(%ebx)
[jaeger] Insns movl 0x6c(%ebx), %ecx
[jaeger] Insns movl %ecx, 0x8c(%ebx)
[jaeger] Insns movl 0x60(%ebx), %eax
[jaeger] Insns movl %eax, 0x80(%ebx)
[jaeger] Insns movl 0x64(%ebx), %eax
[jaeger] Insns movl %eax, 0x84(%ebx)
[jaeger] Insns movl $0xffff0006, 0x7c(%ebx)
[jaeger] Insns movl $0x0, 0x78(%ebx)
[jaeger] Insns movl %esi, 0x70(%ebx)
[jaeger] Insns movl %edi, 0x74(%ebx)
I suspect most of this is extra stuff happening that does not _need_ to happen. We put it there for best defense. Anyway, this patch reduces it to:
[jaeger] Insns movl 0x64(%ebx), %edx
[jaeger] Insns movl 0x6c(%ebx), %ecx
[jaeger] Insns movl 0x60(%ebx), %eax
[jaeger] Insns movl %edi, 0x74(%ebx)
[jaeger] Insns movl 0x68(%ebx), %edi
...
[jaeger] Insns movl %esi, 0x70(%ebx)
[jaeger] Insns movl 0x60(%ebx), %esi
This is still sub-optimal (more on this later), but I suspect the end result will be better. One downside is that the old method guaranteed synced doubles, whereas the new method does not. For example, a function return value will be in ECX:EDX, and we have to sync that to memory to load it as a double.
That's fine, except the result of an ALU op will be "int or double". In the case of a double result, we forcefully sync, but load the low bits into a register in order to rejoin with the integer path. To load the double, we have to re-sync this register, which is wasteful. We cannot distinguish the fact that it was already necessarily synced if a double.
After talking with Sean, we both would like to see FP register allocation, and that would completely solve this problem. But it can be a follow-up bug, and it's unclear what the perf wins would be right now.
The path stuff Sean & Chris were working on - minimal, lightweight abstraction of FrameState - seems great for tackling this specific problem. Later on we should revisit this, get it as small and lightweight as possible, and see what ideal register allocation gives us. Sooner if this patch turns out not to be enough.
Assignee: sstangl → dvander
Status: NEW → ASSIGNED
Assignee | ||
Comment 5•14 years ago
|
||
Down to failing nbody, raytrace, aes.
On math-cordic, looks like a 3ms win (~15%) so far.
Attachment #459349 -
Attachment is obsolete: true
Assignee | ||
Comment 6•14 years ago
|
||
15ms win.
Patch is split into two parts: one to allocate registers up-front, and another to minimize syncs when loading doubles.
http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/51ed7672df50
http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/87e07ff8196c
We were hoping for more, and I think we can get there with bug 574930, so I'm leaving this bug open and investigating further.
Assignee | ||
Comment 7•14 years ago
|
||
Another follow-up. We were storing out double constants rather than using a literal pool. Fixing this was about a ~2ms (7%) win on math-partial-sums, and a 2.5ms (16%) win on math-spectral-norm.
math-spectral-norm now looks largely gated on argv & calls. same for math-partial-sums, which could also benefit from special-casing Math.pow().
http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/783991695a4d
Assignee | ||
Comment 8•14 years ago
|
||
I played around with this by reducing math-* from SunSpider to microbenchmarks. If we get our other wins in place (fast calls, special-casing some functions on Math, more fast-paths), we'll be as fast or faster.
There's probably not a whole lot more to do here. Experimenting with ideas from bug 574930 might be interesting in the future but I'm convinced that it's time to move on to bigger wins.
Filed follow-up bug 582152 on improvement double loading performance.
You need to log in
before you can comment on or make changes to this bug.
Description
•