Closed
Bug 688078
Opened 13 years ago
Closed 12 years ago
IonMonkey: Poor register allocation on benchmark
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
FIXED
People
(Reporter: dvander, Assigned: jandem)
References
Details
Attachments
(2 files, 1 obsolete file)
411 bytes,
application/javascript
|
Details | |
8.44 KB,
patch
|
Details | Diff | Splinter Review |
IonMonkey w/ LSRA does rather poorly on the attached integer benchmark. Crankshaft: 260ms Ion LSRA: 350ms Ion Greedy: 284ms TI -m -n: 315ms Assembly dumps are below. It seems like LSRA has eleven mysterious, probably unnecessary stores in the tightest loop. The Greedy allocator has 5 stores and 6 loads (it effectively spills at loop edges). Crankshaft has one load and zero stores. We should find out what's going on here. Crankshaft emits the following assembly: => 0x5c8231e3: mov %esi,%ebx => 0x5c8231e5: and $0xffff,%ebx => 0x5c8231eb: mov %eax,%edi => 0x5c8231ed: add %ebx,%edi => 0x5c8231ef: mov %esi,%ebx => 0x5c8231f1: sar $0x10,%ebx => 0x5c8231f4: mov %ecx,%edx => 0x5c8231f6: add %ebx,%edx => 0x5c8231f8: mov %edi,%ebx => 0x5c8231fa: sar $0x10,%ebx => 0x5c8231fd: add %ebx,%edx => 0x5c8231ff: shl $0x10,%edx => 0x5c823202: and $0xffff,%edi => 0x5c823208: or %edi,%edx => 0x5c82320a: mov -0x3c(%ebp),%ebx => 0x5c82320d: or %edx,%ebx IM LSRA: 0xf73ef1d8: mov %ebp,0x7c(%esp) => 0xf73ef1dc: mov %edx,%ebp 0xf73ef1de: and $0xffff,%edx 0xf73ef1e4: mov %ecx,0x78(%esp) 0xf73ef1e8: add %edx,%ecx 0xf73ef1ea: jo 0xf73ef013 0xf73ef1f0: mov %ebp,%edx 0xf73ef1f2: sar $0x10,%ebp 0xf73ef1f5: mov %ebx,0x74(%esp) 0xf73ef1f9: add %ebp,%ebx 0xf73ef1fb: jo 0xf73ef018 0xf73ef201: mov %ecx,%ebp 0xf73ef203: sar $0x10,%ecx 0xf73ef206: mov %ebx,0x70(%esp) 0xf73ef20a: add %ecx,%ebx 0xf73ef20c: jo 0xf73ef01d 0xf73ef212: mov %ebx,0x70(%esp) 0xf73ef216: shl $0x10,%ebx 0xf73ef219: mov %ebp,0x70(%esp) 0xf73ef21d: and $0xffff,%ebp 0xf73ef223: mov %ebx,0x70(%esp) 0xf73ef227: or %ebp,%ebx 0xf73ef229: mov %esi,0x70(%esp) 0xf73ef22d: or %ebx,%esi 0xf73ef22f: mov %edx,0x6c(%esp) 0xf73ef233: add $0x1,%edx 0xf73ef236: jo 0xf73ef022 0xf73ef23c: mov 0x74(%esp),%ebx 0xf73ef240: mov 0x78(%esp),%ecx IM Greedy: 0xf73ef1ba: mov 0x9c(%esp),%ecx 0xf73ef1c1: mov 0x78(%esp),%edx 0xf73ef1c5: mov 0x6c(%esp),%ebx => 0xf73ef1c9: mov 0x70(%esp),%ebp 0xf73ef1cd: mov 0x74(%esp),%esi 0xf73ef1d1: mov 0x7c(%esp),%edi 0xf73ef1d5: cmp %ecx,%esi 0xf73ef1d7: jl 0xf73ef1ff 0xf73ef1dd: mov %edi,%esi 0xf73ef1df: mov %edx,%edi 0xf73ef1e1: add $0x1,%edi 0xf73ef1e4: jo 0xf73ef00e 0xf73ef1ea: mov %edi,0x78(%esp) 0xf73ef1ee: mov %edi,%edx 0xf73ef1f0: mov %edi,0x78(%esp) 0xf73ef1f4: mov %esi,%edi 0xf73ef1f6: mov %esi,0x6c(%esp) 0xf73ef1fa: jmp 0xf73ef16e 0xf73ef1ff: mov %esi,%edx 0xf73ef201: and $0xffff,%edx 0xf73ef207: add %edx,%ebx 0xf73ef209: jo 0xf73ef013 0xf73ef20f: mov %esi,%edx 0xf73ef211: sar $0x10,%edx 0xf73ef214: add %edx,%ebp 0xf73ef216: jo 0xf73ef018 0xf73ef21c: mov %ebx,%edx 0xf73ef21e: sar $0x10,%edx 0xf73ef221: add %edx,%ebp 0xf73ef223: jo 0xf73ef01d 0xf73ef229: shl $0x10,%ebp 0xf73ef22c: and $0xffff,%ebx 0xf73ef232: or %ebx,%ebp 0xf73ef234: or %ebp,%edi 0xf73ef236: add $0x1,%esi 0xf73ef239: jo 0xf73ef022 0xf73ef23f: mov %esi,0x74(%esp) 0xf73ef243: mov %edi,0x7c(%esp) 0xf73ef247: jmp 0xf73ef1ba
Assignee | ||
Comment 1•13 years ago
|
||
I looked at this, since this regalloc problem is really common. Consider this function: function f(x) { x >>= 1; return x; } The LIR: // ... 0 movegroup ()[arg:8 -> =eax] <|@ 6 unbox ([i:4 (r)]) (arg:12), (=eax) <|@ 0 movegroup () <|@ 7 shiftop ([i:7 (=eax)]) (=eax), (c) <|@ 8 box ([t:8 (=ecx)], [d:7 (r)]) (=eax) <|@ 0 movegroup ()[=eax -> =edx] <|@ 9 return () (=ecx), (=edx) <|@ Here we have the following (problematic) intervals: interval 1: [12, 15] interval 2: [15, 19] (requirement: SAME_AS interval 1) The algorithm proceeds as follows: 1) process [12, 15]. Assign register eax 2) process [15, 19]. We have to assign register eax (due to the SAME_AS requirement). However, register eax is still in use by interval 1. So interval 1 is split into [12, 14] and [15, 15]. 3) process [15, 15]. This interval has no requirement or hint, so it's (eagerly) spilled. I think the basic problem here is that interval 1 and 2 overlap but require the same register. adrake, what do you think is the best fix here? Maybe the first interval should be [12, 14] because the current instruction has an interval with SAME_AS requirement?
Assignee | ||
Comment 2•13 years ago
|
||
I'm working on this now, per our discussion yesterday. The testcase is interesting because there are 5 values we really want to keep in registers in the inner loop: 1: hoisted i & 0xFFFF 2: hoisted i >> 16 3: j 4: t 5: b On x86 this leaves only 3 registers for the temporaries inside the loop and |i| and |a| from the outer loop.
Assignee: general → jdemooij
Status: NEW → ASSIGNED
Assignee | ||
Comment 3•13 years ago
|
||
(In reply to Jan de Mooij (:jandem) from comment #2) > On x86 this leaves only 3 registers for the temporaries inside the loop and > |i| and |a| from the outer loop. Typo, 2 registers. Note that the Crankshaft info in comment 0 is not entirely correct. They have 4 loads and 1 store (at the end of the inner loop, right before the jump). They spill (store and load) t and reload b and two other values that are not used inside the inner loop.
Assignee | ||
Comment 4•13 years ago
|
||
This patch makes two changes to the linear scan allocator to reduce spills. It passes jit-tests with --ion-eager, but I still need to clean it up and test it better before asking for review. For the attached benchmark: d8 : 313 ms IM LSRA new: 351 ms TI+JM : 383 ms IM Greedy : 427 ms IM LSRA old: 435 ms The remaining difference with Crankshaft is probably caused by the overflow checks for add (bug 699883). Here's the code we generate for the inner loop: -- 0x9d61c4: cmp %eax,%ebx 0x9d61c6: jge 0x9d622c 0x9d61cc: mov %ebx,%eax 0x9d61ce: and $0xffff,%ebx 0x9d61d4: mov %edx,0x7c(%esp) 0x9d61d8: add %ebx,%edx 0x9d61da: jo 0x9d600e 0x9d61e0: mov %eax,%ebx 0x9d61e2: sar $0x10,%eax 0x9d61e5: mov %ebp,0x78(%esp) 0x9d61e9: add %eax,%ebp 0x9d61eb: jo 0x9d6013 0x9d61f1: mov %edx,%eax 0x9d61f3: sar $0x10,%edx 0x9d61f6: add %edx,%ebp 0x9d61f8: jo 0x9d6018 0x9d61fe: shl $0x10,%ebp 0x9d6201: and $0xffff,%eax 0x9d6207: or %eax,%ebp 0x9d6209: mov %edi,%eax 0x9d620b: or %ebp,%edi 0x9d620d: mov %ebx,%edx 0x9d620f: add $0x1,%ebx 0x9d6212: jo 0x9d601d 0x9d6218: mov 0x78(%esp),%ebp 0x9d621c: mov 0x7c(%esp),%edx 0x9d6220: mov 0x9c(%esp),%eax 0x9d6227: jmp 0x9d61c4 -- 3 loads and 2 stores. It's interesting that these stores are for the loop invariant entries (i & 0xffff) and (i >> 16). We should be able to get rid of them by moving the stores to the outer loop. Wimmer's 2005 paper (not 2010) mentions they do this by moving the split position before the loop block. I'll look into that tomorrow; if it works, we have 3 loads (like Crankshaft) but 0 stores (Crankshaft has 1).
Assignee | ||
Comment 5•13 years ago
|
||
Passes jit-tests and a few hours of fuzzing with anion. I think we should take this patch, it helps this benchmark (and other loops) a lot. We can think about other fixes when we see benchmarks where we do significantly worse than others.
Attachment #573551 -
Attachment is obsolete: true
Attachment #574009 -
Flags: review?(dvander)
Assignee | ||
Updated•13 years ago
|
Attachment #574009 -
Flags: review?(dvander) → review?(sstangl)
Assignee | ||
Comment 6•13 years ago
|
||
Comment on attachment 574009 [details] [diff] [review] Fix Bug 709731 will fix this for the most part (but we still want some changes from this patch)
Attachment #574009 -
Flags: review?(sstangl)
Comment 7•13 years ago
|
||
Was this patch (or parts of it, from comment 6) ever reviewed / pushed? Is there any benefit to doing so at this point?
Assignee | ||
Comment 8•13 years ago
|
||
(In reply to Paul Wright from comment #7) > Is there any benefit to doing so at this point? No, the patch was obsoleted by bug 709731. Regalloc for the inner loop is much better now: -- 0xd5b52b: cmp %ecx,%ebp 0xd5b52d: jge 0xd5b590 0xd5b533: mov %ebp,%eax 0xd5b535: and $0xffff,%eax 0xd5b53b: mov %esi,%ecx 0xd5b53d: add %eax,%ecx 0xd5b53f: jo 0xd5b01d 0xd5b545: mov %ebp,%eax 0xd5b547: sar $0x10,%eax 0xd5b54a: mov %edi,%esi 0xd5b54c: add %eax,%esi 0xd5b54e: jo 0xd5b022 0xd5b554: mov %ecx,%eax 0xd5b556: sar $0x10,%eax 0xd5b559: add %eax,%esi 0xd5b55b: jo 0xd5b027 0xd5b561: shl $0x10,%esi 0xd5b564: and $0xffff,%ecx 0xd5b56a: or %ecx,%esi 0xd5b56c: mov %edx,%eax 0xd5b56e: or %esi,%eax 0xd5b570: mov %ebp,%ecx 0xd5b572: add $0x1,%ecx 0xd5b575: jo 0xd5b02c 0xd5b57b: mov 0x4c(%esp),%esi 0xd5b57f: mov %ecx,%ebp 0xd5b581: mov 0x48(%esp),%ecx 0xd5b585: mov %eax,%edx 0xd5b587: mov 0x44(%esp),%eax 0xd5b58b: jmp 0xd5b52b -- Still not optimal though: 1) Range analysis is needed to get rid of the overflow checks for most adds. 2) We're spilling values that are used in the inner loop, instead of spilling values used by the outer loop (for instance, we don't touch ebx inside the inner loop, but we should). To fix this, we could insert use positions for intervals used inside the loop at the loop backedge, like JVM's allocator. However, this is not high priority since regalloc is reasonable now and comparable to v8 and the greedy allocator.
Reporter | ||
Updated•12 years ago
|
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•