It appears that IonMonkey is still slower on empty for-loops. This for loop optimization might be the key to some remaining sunspider benchmarks.
The performance problem we observe on misc-basic-emptyloop seems to be cause by the allocation of an extra register to do the increment of the loop counter. move rdx -> rbx addi rbx, c -> rbx move rbx -> rdx This allocation is likely caused by the block resume point which captures the loop counter. This cause the allocation of an extra register because LAddI instruction can bail out and the loop counter should be restored. We have one solution to handle such cases, which is to ensure all bailout happen before we modify any input register. This solution alone will only solve the loop-counter issue, but it will not help in case of extra counter registered within the loop, because we might still have to resume before other counters modifications. In addition, we can register when an idempotent MIR/LIR instruction is a bijection (under the non-bailing domain), in which case we can handle multiple counters. This will increase the bailing cost but might reduce the register pressure caused by resume points. In this case, removing the resume point of the loop-increment block won't help because it capture the same state of the counter, and thus will still imply having an extra register for saving the state of the counter.
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #1) > In addition, we > can register when an idempotent MIR/LIR instruction is a bijection (under > the non-bailing domain), in which case we can handle multiple counters. > This will increase the bailing cost but might reduce the register pressure > caused by resume points. Brian Hackett worked on something similar in Bug 773666, where he made an attempt to do so for add and sub.
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #1) > The performance problem we observe on misc-basic-emptyloop seems to be cause > by the allocation of an extra register to do the increment of the loop > counter. > > move rdx -> rbx > addi rbx, c -> rbx > move rbx -> rdx Before going any deeper … I made a simple hack to test these hypothesis: 1/ using rdx instead of doing the 2 moves: no gain. 2/ using aligned basic block for loop edges: no gain. 3/ removing the overflow check: 120ms --> 90ms IonMonkey: (hacked: aligned, less registers) 90ms ; rcx = len 0x7ffff7f8aed0: movabs $0x7ffff7fbc010,%r11 0x7ffff7f8aeda: cmpl $0x0,(%r11) 0x7ffff7f8aede: jne 0x7ffff7f8af56 0x7ffff7f8aee4: cmp %ecx,%edx 0x7ffff7f8aee6: jge 0x7ffff7f8aef4 ; mov edx, ebx 0x7ffff7f8aeec: add $0x1,%edx ; edx <-> ebx ; mov ebx, edx ; jo <bailout> 0x7ffff7f8aeef: jmpq 0x7ffff7f8aed0 JägerMonkey: 70ms ; r9 = len 0x7ffff7f8a129: int3 // added for dumping this code. => 0x7ffff7f8a12a: movabs $0x7ffff7fbc010,%r15 0x7ffff7f8a134: cmpl $0x0,(%r15) 0x7ffff7f8a138: jne 0x7ffff7f8a27e 0x7ffff7f8a13e: sub $0xffffffff,%r12d ; really ?! 0x7ffff7f8a142: movabs $0xfff8800000000000,%r10 0x7ffff7f8a14c: or %r12,%r10 0x7ffff7f8a14f: mov %r10,0x70(%rbx) 0x7ffff7f8a153: cmp %r9d,%r12d 0x7ffff7f8a156: jl 0x7ffff7f8a129 0x7ffff7f8a15c: jmpq 0x7ffff7f8a179 I will look at the order of blocks and see if replacing the conditional forward jump by a conditional backward jump can help. Then it might be interesting to test (2) again.
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #3) > > 1/ using rdx instead of doing the 2 moves: no gain. > 2/ using aligned basic block for loop edges: no gain. > 3/ removing the overflow check: 120ms --> 90ms 4/ Using a conditional jump to jump to the top of the loop. 5/ Using sub -1 instead of add +1 Here are the following test I made, I put the following list after remarking that a configuration which was uninteresting at some point became interesting when combined. current jm: 70ms current: 120ms 1: 120ms 1 & 2: 120ms 1 & 2 & 3: 90ms 3 & 4: 90ms 3 & 4 & 5: 90ms 0x7ffff7f8aec9: mov %rdx,%rbx 0x7ffff7f8aecc: sub $0xffffffff,%ebx 0x7ffff7f8aecf: mov %rbx,%rdx 0x7ffff7f8aed2: movabs $0x7ffff7fbc010,%r11 0x7ffff7f8aedc: cmpl $0x0,(%r11) 0x7ffff7f8aee0: jne 0x7ffff7f8af50 0x7ffff7f8aee6: cmp %ecx,%edx 0x7ffff7f8aee8: jl 0x7ffff7f8aec9 *1* & 3 & 4: 60ms Removing the extra moves statements seems to provide a huge win this time. Changing the moves to be 32 bits move instead of 64 bits does change the previous results. I guess this reduce the number of useful instructions for the loop. The JM equivalent, which does a write within the loop, is likely optimized out by the CPU which analyze that it is the only owner of the cache line, and that nobody can read the data which is constantly erased by the latest iteration. 0x7ffff7f8aeca: add $0x1,%edx 0x7ffff7f8aecd: movabs $0x7ffff7fbc010,%r11 0x7ffff7f8aed7: cmpl $0x0,(%r11) 0x7ffff7f8aedb: jne 0x7ffff7f8af4b 0x7ffff7f8aee1: cmp %ecx,%edx 0x7ffff7f8aee3: jl 0x7ffff7f8aeca 0x7ffff7f8aee9: 1 & 2 & 3 & 4: 60ms
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #3) > 3/ removing the overflow check: 120ms --> 90ms Enabling range analysis does the same work, and remove the overflow check of this loop. We cannot enable range analysis by default yet because a few patches are missing. See Bug 765119 for details.
This bug is not fully fixed by range analysis, but other hack to reduce the loop size are just kind of optimizations which are not reachable by our current code generation. With usual JS code, these tight loops are unlikely to be triggered.