Closed Bug 578538 Opened 14 years ago Closed 14 years ago

JM: Avoid unnecessary stores in prolog of arithmetic fast paths

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dmandelin, Assigned: dvander)

References

Details

Attachments

(1 file, 1 obsolete file)

The fast path we emit for |+| starts like this: mov edi, [ebx+0x68] mov [ebx+0x80], edi ** mov esi, [ebx+0x6c] mov [ebx+0x84], esi ** mov edx, [ebx+0x60] mov [ebx+0x78], edx ** mov ecx, [ebx+0x64] mov [ebx+0x7c], ecx ** Lines marked (**) are unnecessary stores to stack slots. I find that we are noticeably slower on these arithmetic ops than JSC. E.g., arithmetic slowdown is about 10-15% of our total time on cordic. dvander says it was estimated this pattern is costing us about 5% of our total time on SS, so I think these stores are likely the cause of the observed arithmetic slowdown on cordic. Thus, we must fix. JM wanted SS win: 40 ms
This depends on bug 574930, allowing for cheap branching and merging of FrameStates. The stores are caused by an intentional frame.syncAllRegs() at the top of jsop_binary(), because there is no sane way to generate arithmetic paths with the current FrameState model. This also requires a (much easier!) rewrite of jsop_binary() once the FrameState changes are implemented. I am thinking about the FrameState implementation currently.
Assignee: general → sstangl
Depends on: 574930
We should be able to do this without invasive changes to FrameState - the diamond pattern is fairly localized and small. Need to think about it more - didn't we have a patch at some point that grabbed registers up-front? Even local peephole optimization might be good. If we know there are copies on the stack, they can be synced on the stub transition rather than up-front.
Any patch written to speed up this case will be rendered useless when the FrameState is changed, and FrameState must change to support sensible branched fastpaths at all. It would be sensible to eliminate these stores iff the measurements obtained by eliminating them are significantly more valuable than measurements with them active.
Attached patch WIP v1 (obsolete) — Splinter Review
Totally doesn't work yet, at all. The goal of this patch is to get rid of the very expensive state flushing at the top of adds. Instead, it aggressively tries to allocate four registers while minimizing spills. For example, print(T1 + T2, T1 + T2) used to have two sync blocks for ADD like this: [jaeger] Insns movl 0x60(%ebx), %edx [jaeger] Insns movl %edx, 0x80(%ebx) [jaeger] Insns movl 0x64(%ebx), %ecx [jaeger] Insns movl %ecx, 0x84(%ebx) [jaeger] Insns movl $0xffff0006, 0x7c(%ebx) [jaeger] Insns movl $0x0, 0x78(%ebx) [jaeger] Insns movl %esi, 0x70(%ebx) [jaeger] Insns movl %edi, 0x74(%ebx) [jaeger] Insns movl 0x68(%ebx), %edx [jaeger] Insns movl %edx, 0x88(%ebx) [jaeger] Insns movl 0x6c(%ebx), %ecx [jaeger] Insns movl %ecx, 0x8c(%ebx) [jaeger] Insns movl 0x60(%ebx), %eax [jaeger] Insns movl %eax, 0x80(%ebx) [jaeger] Insns movl 0x64(%ebx), %eax [jaeger] Insns movl %eax, 0x84(%ebx) [jaeger] Insns movl $0xffff0006, 0x7c(%ebx) [jaeger] Insns movl $0x0, 0x78(%ebx) [jaeger] Insns movl %esi, 0x70(%ebx) [jaeger] Insns movl %edi, 0x74(%ebx) I suspect most of this is extra stuff happening that does not _need_ to happen. We put it there for best defense. Anyway, this patch reduces it to: [jaeger] Insns movl 0x64(%ebx), %edx [jaeger] Insns movl 0x6c(%ebx), %ecx [jaeger] Insns movl 0x60(%ebx), %eax [jaeger] Insns movl %edi, 0x74(%ebx) [jaeger] Insns movl 0x68(%ebx), %edi ... [jaeger] Insns movl %esi, 0x70(%ebx) [jaeger] Insns movl 0x60(%ebx), %esi This is still sub-optimal (more on this later), but I suspect the end result will be better. One downside is that the old method guaranteed synced doubles, whereas the new method does not. For example, a function return value will be in ECX:EDX, and we have to sync that to memory to load it as a double. That's fine, except the result of an ALU op will be "int or double". In the case of a double result, we forcefully sync, but load the low bits into a register in order to rejoin with the integer path. To load the double, we have to re-sync this register, which is wasteful. We cannot distinguish the fact that it was already necessarily synced if a double. After talking with Sean, we both would like to see FP register allocation, and that would completely solve this problem. But it can be a follow-up bug, and it's unclear what the perf wins would be right now. The path stuff Sean & Chris were working on - minimal, lightweight abstraction of FrameState - seems great for tackling this specific problem. Later on we should revisit this, get it as small and lightweight as possible, and see what ideal register allocation gives us. Sooner if this patch turns out not to be enough.
Assignee: sstangl → dvander
Status: NEW → ASSIGNED
Attached patch WIP v2Splinter Review
Down to failing nbody, raytrace, aes. On math-cordic, looks like a 3ms win (~15%) so far.
Attachment #459349 - Attachment is obsolete: true
15ms win. Patch is split into two parts: one to allocate registers up-front, and another to minimize syncs when loading doubles. http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/51ed7672df50 http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/87e07ff8196c We were hoping for more, and I think we can get there with bug 574930, so I'm leaving this bug open and investigating further.
Another follow-up. We were storing out double constants rather than using a literal pool. Fixing this was about a ~2ms (7%) win on math-partial-sums, and a 2.5ms (16%) win on math-spectral-norm. math-spectral-norm now looks largely gated on argv & calls. same for math-partial-sums, which could also benefit from special-casing Math.pow(). http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/783991695a4d
I played around with this by reducing math-* from SunSpider to microbenchmarks. If we get our other wins in place (fast calls, special-casing some functions on Math, more fast-paths), we'll be as fast or faster. There's probably not a whole lot more to do here. Experimenting with ideas from bug 574930 might be interesting in the future but I'm convinced that it's time to move on to bigger wins. Filed follow-up bug 582152 on improvement double loading performance.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
No longer depends on: 574930
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: