Last Comment Bug 688078 - IonMonkey: Poor register allocation on benchmark
: IonMonkey: Poor register allocation on benchmark
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: ---
Assigned To: Jan de Mooij [:jandem] (PTO until July 31)
:
Mentors:
Depends on: 709731
Blocks: 670624
  Show dependency treegraph
 
Reported: 2011-09-20 18:11 PDT by David Anderson [:dvander]
Modified: 2012-01-30 13:22 PST (History)
8 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
benchmark (411 bytes, application/javascript)
2011-09-20 18:11 PDT, David Anderson [:dvander]
no flags Details
WIP 1 (7.99 KB, patch)
2011-11-10 09:44 PST, Jan de Mooij [:jandem] (PTO until July 31)
no flags Details | Diff | Splinter Review
Fix (8.44 KB, patch)
2011-11-12 00:06 PST, Jan de Mooij [:jandem] (PTO until July 31)
no flags Details | Diff | Splinter Review

Description David Anderson [:dvander] 2011-09-20 18:11:57 PDT
Created attachment 561364 [details]
benchmark

IonMonkey w/ LSRA does rather poorly on the attached integer benchmark.

Crankshaft: 260ms
Ion LSRA:   350ms
Ion Greedy: 284ms
TI -m -n:   315ms

Assembly dumps are below. It seems like LSRA has eleven mysterious, probably unnecessary stores in the tightest loop. The Greedy allocator has 5 stores and 6 loads (it effectively spills at loop edges). Crankshaft has one load and zero stores.

We should find out what's going on here.

Crankshaft emits the following assembly:
      => 0x5c8231e3:	mov    %esi,%ebx
      => 0x5c8231e5:	and    $0xffff,%ebx
      => 0x5c8231eb:	mov    %eax,%edi
      => 0x5c8231ed:	add    %ebx,%edi
      => 0x5c8231ef:	mov    %esi,%ebx
      => 0x5c8231f1:	sar    $0x10,%ebx
      => 0x5c8231f4:	mov    %ecx,%edx
      => 0x5c8231f6:	add    %ebx,%edx
      => 0x5c8231f8:	mov    %edi,%ebx
      => 0x5c8231fa:	sar    $0x10,%ebx
      => 0x5c8231fd:	add    %ebx,%edx
      => 0x5c8231ff:	shl    $0x10,%edx
      => 0x5c823202:	and    $0xffff,%edi
      => 0x5c823208:	or     %edi,%edx
      => 0x5c82320a:	mov    -0x3c(%ebp),%ebx
      => 0x5c82320d:	or     %edx,%ebx

IM LSRA:
         0xf73ef1d8:	mov    %ebp,0x7c(%esp)
      => 0xf73ef1dc:	mov    %edx,%ebp
         0xf73ef1de:	and    $0xffff,%edx
         0xf73ef1e4:	mov    %ecx,0x78(%esp)
         0xf73ef1e8:	add    %edx,%ecx
         0xf73ef1ea:	jo     0xf73ef013
         0xf73ef1f0:	mov    %ebp,%edx
         0xf73ef1f2:	sar    $0x10,%ebp
         0xf73ef1f5:	mov    %ebx,0x74(%esp)
         0xf73ef1f9:	add    %ebp,%ebx
         0xf73ef1fb:	jo     0xf73ef018
         0xf73ef201:	mov    %ecx,%ebp
         0xf73ef203:	sar    $0x10,%ecx
         0xf73ef206:	mov    %ebx,0x70(%esp)
         0xf73ef20a:	add    %ecx,%ebx
         0xf73ef20c:	jo     0xf73ef01d
         0xf73ef212:	mov    %ebx,0x70(%esp)
         0xf73ef216:	shl    $0x10,%ebx
         0xf73ef219:	mov    %ebp,0x70(%esp)
         0xf73ef21d:	and    $0xffff,%ebp
         0xf73ef223:	mov    %ebx,0x70(%esp)
         0xf73ef227:	or     %ebp,%ebx
         0xf73ef229:	mov    %esi,0x70(%esp)
         0xf73ef22d:	or     %ebx,%esi
         0xf73ef22f:	mov    %edx,0x6c(%esp)
         0xf73ef233:	add    $0x1,%edx
         0xf73ef236:	jo     0xf73ef022
         0xf73ef23c:	mov    0x74(%esp),%ebx
         0xf73ef240:	mov    0x78(%esp),%ecx

IM Greedy:
         0xf73ef1ba:	mov    0x9c(%esp),%ecx
         0xf73ef1c1:	mov    0x78(%esp),%edx
         0xf73ef1c5:	mov    0x6c(%esp),%ebx
      => 0xf73ef1c9:	mov    0x70(%esp),%ebp
         0xf73ef1cd:	mov    0x74(%esp),%esi
         0xf73ef1d1:	mov    0x7c(%esp),%edi
         0xf73ef1d5:	cmp    %ecx,%esi
         0xf73ef1d7:	jl     0xf73ef1ff
         0xf73ef1dd:	mov    %edi,%esi
         0xf73ef1df:	mov    %edx,%edi
         0xf73ef1e1:	add    $0x1,%edi
         0xf73ef1e4:	jo     0xf73ef00e
         0xf73ef1ea:	mov    %edi,0x78(%esp)
         0xf73ef1ee:	mov    %edi,%edx
         0xf73ef1f0:	mov    %edi,0x78(%esp)
         0xf73ef1f4:	mov    %esi,%edi
         0xf73ef1f6:	mov    %esi,0x6c(%esp)
         0xf73ef1fa:	jmp    0xf73ef16e
         0xf73ef1ff:	mov    %esi,%edx
         0xf73ef201:	and    $0xffff,%edx
         0xf73ef207:	add    %edx,%ebx
         0xf73ef209:	jo     0xf73ef013
         0xf73ef20f:	mov    %esi,%edx
         0xf73ef211:	sar    $0x10,%edx
         0xf73ef214:	add    %edx,%ebp
         0xf73ef216:	jo     0xf73ef018
         0xf73ef21c:	mov    %ebx,%edx
         0xf73ef21e:	sar    $0x10,%edx
         0xf73ef221:	add    %edx,%ebp
         0xf73ef223:	jo     0xf73ef01d
         0xf73ef229:	shl    $0x10,%ebp
         0xf73ef22c:	and    $0xffff,%ebx
         0xf73ef232:	or     %ebx,%ebp
         0xf73ef234:	or     %ebp,%edi
         0xf73ef236:	add    $0x1,%esi
         0xf73ef239:	jo     0xf73ef022
         0xf73ef23f:	mov    %esi,0x74(%esp)
         0xf73ef243:	mov    %edi,0x7c(%esp)
         0xf73ef247:	jmp    0xf73ef1ba
Comment 1 Jan de Mooij [:jandem] (PTO until July 31) 2011-11-09 04:35:40 PST
I looked at this, since this regalloc problem is really common. Consider this function:

function f(x) {
    x >>= 1;
    return x;
}

The LIR:
 
// ...
0 movegroup ()[arg:8 -> =eax] <|@
6 unbox ([i:4 (r)]) (arg:12), (=eax) <|@
0 movegroup () <|@
7 shiftop ([i:7 (=eax)]) (=eax), (c) <|@
8 box ([t:8 (=ecx)], [d:7 (r)]) (=eax) <|@
0 movegroup ()[=eax -> =edx] <|@
9 return () (=ecx), (=edx) <|@

Here we have the following (problematic) intervals:

interval 1: [12, 15]
interval 2: [15, 19] (requirement: SAME_AS interval 1)

The algorithm proceeds as follows:

1) process [12, 15]. Assign register eax
2) process [15, 19]. We have to assign register eax (due to the SAME_AS requirement). However, register eax is still in use by interval 1. So interval 1 is split into [12, 14] and [15, 15].
3) process [15, 15]. This interval has no requirement or hint, so it's (eagerly) spilled.

I think the basic problem here is that interval 1 and 2 overlap but require the same register. adrake, what do you think is the best fix here? Maybe the first interval should be [12, 14] because the current instruction has an interval with SAME_AS requirement?
Comment 2 Jan de Mooij [:jandem] (PTO until July 31) 2011-11-10 04:14:29 PST
I'm working on this now, per our discussion yesterday. The testcase is interesting because there are 5 values we really want to keep in registers in the inner loop:

1: hoisted i & 0xFFFF
2: hoisted i >> 16
3: j
4: t
5: b

On x86 this leaves only 3 registers for the temporaries inside the loop and |i| and |a| from the outer loop.
Comment 3 Jan de Mooij [:jandem] (PTO until July 31) 2011-11-10 04:25:39 PST
(In reply to Jan de Mooij (:jandem) from comment #2)
> On x86 this leaves only 3 registers for the temporaries inside the loop and
> |i| and |a| from the outer loop.

Typo, 2 registers.

Note that the Crankshaft info in comment 0 is not entirely correct. They have 4 loads and 1 store (at the end of the inner loop, right before the jump). They spill (store and load) t and reload b and two other values that are not used inside the inner loop.
Comment 4 Jan de Mooij [:jandem] (PTO until July 31) 2011-11-10 09:44:02 PST
Created attachment 573551 [details] [diff] [review]
WIP 1

This patch makes two changes to the linear scan allocator to reduce spills. It passes jit-tests with --ion-eager, but I still need to clean it up and test it better before asking for review.

For the attached benchmark:

d8         : 313 ms
IM LSRA new: 351 ms
TI+JM      : 383 ms
IM Greedy  : 427 ms
IM LSRA old: 435 ms

The remaining difference with Crankshaft is probably caused by the overflow checks for add (bug 699883). Here's the code we generate for the inner loop:
--
0x9d61c4:	cmp    %eax,%ebx
0x9d61c6:	jge    0x9d622c
0x9d61cc:	mov    %ebx,%eax
0x9d61ce:	and    $0xffff,%ebx
0x9d61d4:	mov    %edx,0x7c(%esp)
0x9d61d8:	add    %ebx,%edx
0x9d61da:	jo     0x9d600e
0x9d61e0:	mov    %eax,%ebx
0x9d61e2:	sar    $0x10,%eax
0x9d61e5:	mov    %ebp,0x78(%esp)
0x9d61e9:	add    %eax,%ebp
0x9d61eb:	jo     0x9d6013
0x9d61f1:	mov    %edx,%eax
0x9d61f3:	sar    $0x10,%edx
0x9d61f6:	add    %edx,%ebp
0x9d61f8:	jo     0x9d6018
0x9d61fe:	shl    $0x10,%ebp
0x9d6201:	and    $0xffff,%eax
0x9d6207:	or     %eax,%ebp
0x9d6209:	mov    %edi,%eax
0x9d620b:	or     %ebp,%edi
0x9d620d:	mov    %ebx,%edx
0x9d620f:	add    $0x1,%ebx
0x9d6212:	jo     0x9d601d
0x9d6218:	mov    0x78(%esp),%ebp
0x9d621c:	mov    0x7c(%esp),%edx
0x9d6220:	mov    0x9c(%esp),%eax
0x9d6227:	jmp    0x9d61c4
--
3 loads and 2 stores. It's interesting that these stores are for the loop invariant entries (i & 0xffff) and (i >> 16). We should be able to get rid of them by moving the stores to the outer loop.

Wimmer's 2005 paper (not 2010) mentions they do this by moving the split position before the loop block. I'll look into that tomorrow; if it works, we have 3 loads (like Crankshaft) but 0 stores (Crankshaft has 1).
Comment 5 Jan de Mooij [:jandem] (PTO until July 31) 2011-11-12 00:06:20 PST
Created attachment 574009 [details] [diff] [review]
Fix

Passes jit-tests and a few hours of fuzzing with anion. I think we should take this patch, it helps this benchmark (and other loops) a lot. We can think about other fixes when we see benchmarks where we do significantly worse than others.
Comment 6 Jan de Mooij [:jandem] (PTO until July 31) 2011-12-12 01:34:41 PST
Comment on attachment 574009 [details] [diff] [review]
Fix

Bug 709731 will fix this for the most part (but we still want some changes from this patch)
Comment 7 Paul Wright 2012-01-04 09:18:09 PST
Was this patch (or parts of it, from comment 6) ever reviewed / pushed?  Is there any benefit to doing so at this point?
Comment 8 Jan de Mooij [:jandem] (PTO until July 31) 2012-01-04 12:35:20 PST
(In reply to Paul Wright from comment #7)
> Is there any benefit to doing so at this point?

No, the patch was obsoleted by bug 709731. Regalloc for the inner loop is much better now:
--
0xd5b52b:	cmp    %ecx,%ebp
0xd5b52d:	jge    0xd5b590
0xd5b533:	mov    %ebp,%eax
0xd5b535:	and    $0xffff,%eax
0xd5b53b:	mov    %esi,%ecx
0xd5b53d:	add    %eax,%ecx
0xd5b53f:	jo     0xd5b01d
0xd5b545:	mov    %ebp,%eax
0xd5b547:	sar    $0x10,%eax
0xd5b54a:	mov    %edi,%esi
0xd5b54c:	add    %eax,%esi
0xd5b54e:	jo     0xd5b022
0xd5b554:	mov    %ecx,%eax
0xd5b556:	sar    $0x10,%eax
0xd5b559:	add    %eax,%esi
0xd5b55b:	jo     0xd5b027
0xd5b561:	shl    $0x10,%esi
0xd5b564:	and    $0xffff,%ecx
0xd5b56a:	or     %ecx,%esi
0xd5b56c:	mov    %edx,%eax
0xd5b56e:	or     %esi,%eax
0xd5b570:	mov    %ebp,%ecx
0xd5b572:	add    $0x1,%ecx
0xd5b575:	jo     0xd5b02c
0xd5b57b:	mov    0x4c(%esp),%esi
0xd5b57f:	mov    %ecx,%ebp
0xd5b581:	mov    0x48(%esp),%ecx
0xd5b585:	mov    %eax,%edx
0xd5b587:	mov    0x44(%esp),%eax
0xd5b58b:	jmp    0xd5b52b
--
Still not optimal though:

1) Range analysis is needed to get rid of the overflow checks for most adds.

2) We're spilling values that are used in the inner loop, instead of spilling values used by the outer loop (for instance, we don't touch ebx inside the inner loop, but we should).

To fix this, we could insert use positions for intervals used inside the loop at the loop backedge, like JVM's allocator. However, this is not high priority since regalloc is reasonable now and comparable to v8 and the greedy allocator.

Note You need to log in before you can comment on or make changes to this bug.