IonMonkey: Optimize empty for-loops. (AWFY: 2x slower than JM on x86)

RESOLVED DUPLICATE of bug 765119

Status

()

Core
JavaScript Engine
RESOLVED DUPLICATE of bug 765119
5 years ago
5 years ago

People

(Reporter: nbp, Assigned: nbp)

Tracking

(Blocks: 1 bug)

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: [ion:t])

(Assignee)

Description

5 years ago
It appears that IonMonkey is still slower on empty for-loops.  This for loop optimization might be the key to some remaining sunspider benchmarks.
(Assignee)

Updated

5 years ago
Assignee: general → nicolas.b.pierron
Status: NEW → ASSIGNED
Hardware: x86_64 → x86
Whiteboard: [ion:t]
(Assignee)

Comment 1

5 years ago
The performance problem we observe on misc-basic-emptyloop seems to be cause by the allocation of an extra register to do the increment of the loop counter.

move rdx -> rbx
addi rbx, c -> rbx
move rbx -> rdx

This allocation is likely caused by the block resume point which captures the loop counter.  This cause the allocation of an extra register because LAddI instruction can bail out and the loop counter should be restored.

We have one solution to handle such cases, which is to ensure all bailout happen before we modify any input register.

This solution alone will only solve the loop-counter issue, but it will not help in case of extra counter registered within the loop, because we might still have to resume before other counters modifications.  In addition, we can register when an idempotent MIR/LIR instruction is a bijection (under the non-bailing domain), in which case we can handle multiple counters.  This will increase the bailing cost but might reduce the register pressure caused by resume points.

In this case, removing the resume point of the loop-increment block won't help because it capture the same state of the counter, and thus will still imply having an extra register for saving the state of the counter.
(Assignee)

Comment 2

5 years ago
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #1)
> In addition, we
> can register when an idempotent MIR/LIR instruction is a bijection (under
> the non-bailing domain), in which case we can handle multiple counters. 
> This will increase the bailing cost but might reduce the register pressure
> caused by resume points.

Brian Hackett worked on something similar in Bug 773666, where he made an attempt to do so for add and sub.
(Assignee)

Comment 3

5 years ago
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #1)
> The performance problem we observe on misc-basic-emptyloop seems to be cause
> by the allocation of an extra register to do the increment of the loop
> counter.
> 
> move rdx -> rbx
> addi rbx, c -> rbx
> move rbx -> rdx

Before going any deeper … I made a simple hack to test these hypothesis:

1/ using rdx instead of doing the 2 moves: no gain.
2/ using aligned basic block for loop edges: no gain.
3/ removing the overflow check: 120ms --> 90ms

IonMonkey: (hacked: aligned, less registers) 90ms
   ; rcx = len
   0x7ffff7f8aed0:      movabs $0x7ffff7fbc010,%r11
   0x7ffff7f8aeda:      cmpl   $0x0,(%r11)
   0x7ffff7f8aede:      jne    0x7ffff7f8af56
   0x7ffff7f8aee4:      cmp    %ecx,%edx
   0x7ffff7f8aee6:      jge    0x7ffff7f8aef4
   ; mov edx, ebx
   0x7ffff7f8aeec:      add    $0x1,%edx ; edx <-> ebx
   ; mov ebx, edx
   ; jo <bailout>
   0x7ffff7f8aeef:      jmpq   0x7ffff7f8aed0

JägerMonkey: 70ms
   ; r9 = len
   0x7ffff7f8a129:      int3 // added for dumping this code.
=> 0x7ffff7f8a12a:      movabs $0x7ffff7fbc010,%r15
   0x7ffff7f8a134:      cmpl   $0x0,(%r15)
   0x7ffff7f8a138:      jne    0x7ffff7f8a27e
   0x7ffff7f8a13e:      sub    $0xffffffff,%r12d

   ; really ?!
   0x7ffff7f8a142:      movabs $0xfff8800000000000,%r10
   0x7ffff7f8a14c:      or     %r12,%r10
   0x7ffff7f8a14f:      mov    %r10,0x70(%rbx)

   0x7ffff7f8a153:      cmp    %r9d,%r12d
   0x7ffff7f8a156:      jl     0x7ffff7f8a129
   0x7ffff7f8a15c:      jmpq   0x7ffff7f8a179


I will look at the order of blocks and see if replacing the conditional forward jump by a conditional backward jump can help.  Then it might be interesting to test (2) again.
(Assignee)

Comment 4

5 years ago
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #3)
> 
> 1/ using rdx instead of doing the 2 moves: no gain.
> 2/ using aligned basic block for loop edges: no gain.
> 3/ removing the overflow check: 120ms --> 90ms
4/ Using a conditional jump to jump to the top of the loop.
5/ Using sub -1 instead of add +1

Here are the following test I made, I put the following list after remarking that a configuration which was uninteresting at some point became interesting when combined.

current jm: 70ms
current: 120ms
1: 120ms
1 & 2: 120ms
1 & 2 & 3: 90ms
3 & 4: 90ms
3 & 4 & 5: 90ms

   0x7ffff7f8aec9:      mov    %rdx,%rbx
   0x7ffff7f8aecc:      sub    $0xffffffff,%ebx
   0x7ffff7f8aecf:      mov    %rbx,%rdx
   0x7ffff7f8aed2:      movabs $0x7ffff7fbc010,%r11
   0x7ffff7f8aedc:      cmpl   $0x0,(%r11)
   0x7ffff7f8aee0:      jne    0x7ffff7f8af50
   0x7ffff7f8aee6:      cmp    %ecx,%edx
   0x7ffff7f8aee8:      jl     0x7ffff7f8aec9

*1* & 3 & 4: 60ms

  Removing the extra moves statements seems to provide a huge win this time.  Changing the moves to be 32 bits move instead of 64 bits does change the previous results.  I guess this reduce the number of useful instructions for the loop. The JM equivalent, which does a write within the loop, is likely optimized out by the CPU which analyze that it is the only owner of the cache line, and that nobody can read the data which is constantly erased by the latest iteration.

   0x7ffff7f8aeca:      add    $0x1,%edx
   0x7ffff7f8aecd:      movabs $0x7ffff7fbc010,%r11
   0x7ffff7f8aed7:      cmpl   $0x0,(%r11)
   0x7ffff7f8aedb:      jne    0x7ffff7f8af4b
   0x7ffff7f8aee1:      cmp    %ecx,%edx
   0x7ffff7f8aee3:      jl     0x7ffff7f8aeca
   0x7ffff7f8aee9:

1 & 2 & 3 & 4: 60ms
(Assignee)

Comment 5

5 years ago
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #3)
> 3/ removing the overflow check: 120ms --> 90ms

Enabling range analysis does the same work, and remove the overflow check of this loop.  We cannot enable range analysis by default yet because a few patches are missing. See Bug 765119 for details.
Depends on: 765119
(Assignee)

Comment 6

5 years ago
This bug is not fully fixed by range analysis, but other hack to reduce the loop size are just kind of optimizations which are not reachable by our current code generation.  With usual JS code, these tight loops are unlikely to be triggered.
Status: ASSIGNED → RESOLVED
Last Resolved: 5 years ago
Resolution: --- → DUPLICATE
Duplicate of bug: 765119
You need to log in before you can comment on or make changes to this bug.