Closed
Bug 555255
Opened 14 years ago
Closed 14 years ago
Support rematerializing ALU operations
Categories
(Core Graveyard :: Nanojit, defect)
Core Graveyard
Nanojit
Tracking
(Not tracked)
RESOLVED
FIXED
Future
People
(Reporter: edwsmith, Unassigned)
References
Details
(Whiteboard: PACMAN, fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey)
Attachments
(5 files, 4 obsolete files)
4.43 KB,
patch
|
n.nethercote
:
review+
|
Details | Diff | Splinter Review |
3.94 KB,
patch
|
n.nethercote
:
review+
|
Details | Diff | Splinter Review |
2.35 KB,
patch
|
rreitmai
:
review+
wmaddox
:
feedback+
|
Details | Diff | Splinter Review |
1.70 KB,
patch
|
jbramley
:
review+
n.nethercote
:
feedback+
|
Details | Diff | Splinter Review |
4.17 KB,
patch
|
n.nethercote
:
review+
|
Details | Diff | Splinter Review |
Bug 542016 (removing LIR_addp) is one of several we have completed recently that boost the effectiveness of CSE. As expected, this boosts register pressure. In the case of addp, it was particularly acute due to having better CSE in very long basic blocks having this structure: ptr = ... p2 = ptr + 4 ... = p2 ... pN = ptr + 4*N ... = PN // and later in the same (long) block: ... = p2 ... = pN (imagine many derived pointers). this naturally causes lots more spilling and reloading of the values for p2..pN. proposed solution: in examples like this, we end up in asm_restore doing a spill (which is a load, since we generate code bottom-up) for LIR_add(ptr, const) where ptr is already in a register. in effect, if we're spilling a unary ALU op (add R, const is "unary" in this case from the register allocator's p.o.v), *and* if the input register is already in a register, then we can do better by generating the instruction instead of spilling. (a 1-cycle ALU op should be cheaper than a load, not to mention the secondary effect of eliminating the store too).
Reporter | ||
Comment 1•14 years ago
|
||
Here's a working example for ARM. In the long-basic-block example (N on the order of 50-100), it succeeds in cutting the stack size (and thus, spills) by half. I'd like to get early feedback on the approach, but the patch has a couple of obvious problems: * generality: is it smart to generalize this to all unary and pseudo-unary (reg op const) operations? if so, this patch only handles add * safety: asm_restore isn't allowed to modify condition codes, so on x86/64, maybe only add/sub should be supported. * checking the register allocation status in canRemat(), and then counting on it remaining the same in asm_restore(), is fragile. * priority. should we first spill constants, then look for 1-cycle ALU ops? findVictim() right now treats all instructions equally if canRemat() == true. I also want to try this on x86 on TR and TM to see what the impact is before getting too far with it.
Assignee: nobody → edwsmith
Status: NEW → ASSIGNED
Reporter | ||
Updated•14 years ago
|
Attachment #435227 -
Flags: feedback?(nnethercote)
Reporter | ||
Comment 2•14 years ago
|
||
(In reply to comment #0) > Bug 542016 (removing LIR_addp) is one of several we have completed recently well, credit where due: "we" as in "Nick"
Comment 3•14 years ago
|
||
(In reply to comment #1) > > Here's a working example for ARM. In the long-basic-block example (N on the > order of 50-100), it succeeds in cutting the stack size (and thus, spills) by > half. It'd be interesting to know if it has benefits other than in the specific case you were looking at. > * generality: is it smart to generalize this to all unary and pseudo-unary (reg > op const) operations? if so, this patch only handles add My guess is yes. In a modern machine a single ALU op is just about always going to be better than a spill/restore pair, right? > * safety: asm_restore isn't allowed to modify condition codes, so on x86/64, > maybe only add/sub should be supported. Is that because 'lea' is the only x86 arithmetic instruction that doesn't set CCs? (Can't remember off the top of my head.) > * checking the register allocation status in canRemat(), and then counting on > it remaining the same in asm_restore(), is fragile. I've thought before that canRemat() should really be backend-specific, but never bothered to do it. (Eg. on i386 we can rematerialize some LIR_param instructions but that's not in canRemat().) Perhaps this would provide the motivation for that. If that's done then keeping them in sync isn't too bad, esp. if there's a comment that they must be in sync. > * priority. should we first spill constants, then look for 1-cycle ALU ops? > findVictim() right now treats all instructions equally if canRemat() == true. Some 1-cycle ALU ops might be preferable to constants, esp. 64-bit constants on 32-bit platforms. findVictim() already has some notion of priority, so it shouldn't be hard to integrate canRemat() with that. > I also want to try this on x86 on TR and TM to see what the impact is before > getting too far with it. Definitely, but it seems like a good idea to me. (In reply to comment #2) > (In reply to comment #0) > > Bug 542016 (removing LIR_addp) is one of several we have completed recently > > well, credit where due: "we" as in "Nick" You wrote the patch! I just said "yes please" :)
Comment 4•14 years ago
|
||
BTW, I don't like 'isAddConst()' as a name -- there's no indication that oprnd1 must be in a register. Maybe 'isAddRegConst()'?
Reporter | ||
Comment 5•14 years ago
|
||
(In reply to comment #3) > > well, credit where due: "we" as in "Nick" > > You wrote the patch! I just said "yes please" :) i was referring to the many CSE improvements, especially aliasing. group hug.
Reporter | ||
Comment 6•14 years ago
|
||
(In reply to comment #3) > My guess is yes. In a modern machine a single ALU op is just about always > going to be better than a spill/restore pair, right? that's what i'm thinking > Is that because 'lea' is the only x86 arithmetic instruction that doesn't set > CCs? (Can't remember off the top of my head.) exactly. although, lea can also handle left-shifts by 1/2/3 bits, so we could cover add/sub/lsh. If asm_restore could take a hint about whether the CC's are live, then we could handle other cases. (In reply to comment #4) > BTW, I don't like 'isAddConst()' as a name -- there's no indication that oprnd1 > must be in a register. Maybe 'isAddRegConst()'? isAddConst() was just expedient for prototyping. I'll take a crack at generalizing and pick a better name.
Reporter | ||
Updated•14 years ago
|
Attachment #435227 -
Flags: feedback?(nnethercote)
Reporter | ||
Updated•14 years ago
|
Target Milestone: --- → Future
Reporter | ||
Comment 7•14 years ago
|
||
In the X64 backend, each of the rematerialize cases in asm_restore() call ins->clearReg(); /* generate code */ I think the clearReg() call is redundant. when asm_restore() returns, the assembler calls clearReg() anyway and deallocates the register.
Reporter | ||
Comment 8•14 years ago
|
||
If we could keep track of whether the CC's are "live" (whether its okay to clobber them), then asm_restore could handle many more cases on x86 and x64. the general pattern would be: if (is binary op && lhs in register && (rhs in register || rhs is constant) && can-clobber-cc's) { mov r, lhs alu-op r, rhs or alu-op r, const } the only case we can handle with preserving CC's is LEA, which also happens to not need to clobber any registers (its intel's only 3-address-form instruction, afaik).
Reporter | ||
Comment 9•14 years ago
|
||
X64 stats, when addp and addl are rematerialized avmshell-d -Dnodebugger -Ojit -Dverifyonly main.swf code size loads sum(framesize) --------- ----- -------------- before 1287K 16,211 238,896 after 1270K 14,908 229,952
Reporter | ||
Comment 10•14 years ago
|
||
x86-32 stats, same configuration & test x86 loads sum(framesize) ----- -------------- before 15,190 154,112 after 14,779 152,816
Reporter | ||
Comment 11•14 years ago
|
||
Moves canRemat() to each backend where it can be specialized. Early results seem promising even if purely from a code-size / frame-size standpoint. No performance numbers yet.
Attachment #435227 -
Attachment is obsolete: true
Reporter | ||
Comment 12•14 years ago
|
||
ARM stats avmshell-dd -Dnodebugger -Ojit -Dverifyonly main.swf loads sum(framesize) ----- -------------- before 15,649 164,944 after 14,582 160,928
Comment 13•14 years ago
|
||
(In reply to comment #7) > In the X64 backend, each of the rematerialize cases in asm_restore() call > > ins->clearReg(); > /* generate code */ > > I think the clearReg() call is redundant. when asm_restore() returns, the > assembler calls clearReg() anyway and deallocates the register. True. The other backends do it too (at least I checked i386 and ARM and they both do, although the ARM backend uses the old deprecated_markAsClear()).
Comment 14•14 years ago
|
||
I took a look at the patch: - I got compile warnings because there was no default case in the switch in canRematLEA() - I got compile errors because _nvprof() shouldn't be used in Nativei386.cpp -- other uses are within "#ifdef PERFM". - Codegen was barely changed for TM. It appears that LIR_addl results are very rarely spilled, and of those that are, most don't have an immediate RHS.
Reporter | ||
Comment 15•14 years ago
|
||
Precursor to tailoring rematerialization code to each CPU.
Attachment #439942 -
Flags: review?(nnethercote)
Reporter | ||
Comment 16•14 years ago
|
||
This patch does two things, neither of which should affect generated code. 1. in case LIR_alloc in gen(), replace inlined code with a call to evict(), since evict() does exactly what the inlined code does. 2. in backends, remove ins->clearReg() or deprecated_markAsClear() calls from asm_restore(), since evict() takes care of the same thing as soon as asm_restore() returns.
Attachment #440008 -
Flags: review?(nnethercote)
Reporter | ||
Comment 17•14 years ago
|
||
(In reply to comment #3) > (In reply to comment #1) > > * generality: is it smart to generalize this to all unary and pseudo-unary (reg > > op const) operations? if so, this patch only handles add > > My guess is yes. In a modern machine a single ALU op is just about always > going to be better than a spill/restore pair, right? I'm starting to experiment with the other cases ARM handles: add/sub/or/xor/and. For TR i expect the common cases to be add (field calculations), or (pointer tagging), and and (tag removal). Its also worth noting that there's nothing special about binary-ops with constant RHS. It could be a win to rematerialize ALU operations even if the left and right hand side are in other registers. on RISC, thats 1 instruction. on x86, thats two in general, altho maybe the cpu fuses them. but even 2 instructions (mov dest,lsh; aluop dest, rhs) is probably preferrable to a spill/reload.
Reporter | ||
Updated•14 years ago
|
Summary: Support rematerializing unary operations → Support rematerializing ALU operations
Comment 18•14 years ago
|
||
Comment on attachment 439942 [details] [diff] [review] Moves canRemat() from Assembler.cpp to each backend >diff -r 4c5914a48530 nanojit/NativeARM.cpp >--- a/nanojit/NativeARM.cpp Sun Apr 18 21:17:37 2010 -0700 >+++ b/nanojit/NativeARM.cpp Mon Apr 19 12:03:46 2010 -0400 >@@ -1230,6 +1230,16 @@ > } > } > >+/** >+ * these instructions don't have to be saved & reloaded to spill, >+ * they can just be recalculated cheaply >+ */ >+bool >+Assembler::canRemat(LIns* ins) >+{ >+ return ins->isImmAny() || ins->isop(LIR_alloc); >+} >+ Rather than repeating that comment once per backend, can you just put once it above the declaration in Assembler.h? A capital letter at the start and full stop at the end wouldn't hurt either :) r+ with that nit fixed.
Attachment #439942 -
Flags: review?(nnethercote) → review+
Comment 19•14 years ago
|
||
Comment on attachment 440008 [details] [diff] [review] Remove unnecessary clearReg() calls from asm_restore() Looks good, and passes trace-tests on TM.
Attachment #440008 -
Flags: review?(nnethercote) → review+
Reporter | ||
Comment 20•14 years ago
|
||
(In reply to comment #18) > Rather than repeating that comment once per backend, can you just put once it > above the declaration in Assembler.h? A capital letter at the start and full > stop at the end wouldn't hurt either :) Agree, will fix before pushing.
Reporter | ||
Comment 21•14 years ago
|
||
Comment on attachment 439942 [details] [diff] [review] Moves canRemat() from Assembler.cpp to each backend pushed canRemat() refactoring patch NJ: http://hg.mozilla.org/projects/nanojit-central/rev/c12082c4c489
Reporter | ||
Comment 22•14 years ago
|
||
Comment on attachment 440008 [details] [diff] [review] Remove unnecessary clearReg() calls from asm_restore() pushed clearReg() cleanup NJ: http://hg.mozilla.org/projects/nanojit-central/rev/e89860f89d85
Reporter | ||
Comment 23•14 years ago
|
||
Comment on attachment 439942 [details] [diff] [review] Moves canRemat() from Assembler.cpp to each backend TR: http://hg.mozilla.org/tamarin-redux/rev/5bb289b47a0e
Reporter | ||
Comment 24•14 years ago
|
||
Comment on attachment 440008 [details] [diff] [review] Remove unnecessary clearReg() calls from asm_restore() TR: http://hg.mozilla.org/tamarin-redux/rev/9f165d31d470
Comment 25•14 years ago
|
||
TM: http://hg.mozilla.org/tracemonkey/rev/b5861f701077 TM: http://hg.mozilla.org/tracemonkey/rev/305ab44897e8
Reporter | ||
Comment 26•14 years ago
|
||
Updated (combined) work in progress, with _nvprof cruft removed. 1. in CodegenLIR, use LIR_addp instead of LIR_orp for tagging pointers, since LIR_addp can use LEA. 2. in x86 & x64, remateraialize LIR_addp using LEA 3. in arm, do that too, and also handle sub/and/or/xor -- these are the instructions the ARM backend already supported with immediate const RHS operands. On each cpu, this results in either no change, or a net reduction in instructions and frame size, and performance changes that are either in the noise or slightly positive. (on TR). The last thing I want to check is the impact on JIT speed. The new logic only runs when searching for a spillable register, but if that happens enough then it could hurt jit speed. tests with large code are interesting (md5? esc-compiling-itself?)
Attachment #439571 -
Attachment is obsolete: true
Reporter | ||
Comment 27•14 years ago
|
||
I did some testing to see if the op(R,R) cases were worth handling, and they resulted in almost no hits. Given the results on bug 559949 that found that the majority of *all* binary instructions have a constant RHS, this finding makes sense.
Reporter | ||
Updated•14 years ago
|
Whiteboard: PACMAN
Comment 28•14 years ago
|
||
What does "PACMAN" in the whiteboard mean?
Reporter | ||
Comment 29•14 years ago
|
||
We've got an optimization effort going, PACMAN means this bug is flagged as low-risk and possibly helping some-programs. (pacman is going for all the little nuggets and avoiding the high risk monsters)
Reporter | ||
Comment 30•14 years ago
|
||
Modifies the JIT frontend to use LIR_add instructions for pointer tagging, instead of LIR_or instructions. Adds can be folded with other adds, and also can be rematerialized with LEA on x86/x64, unlike LIR_or. There is a risk that tagging an already-tagged pointer now changes the value, whereas 'or' would have left it unchanged. However, doing such a thing would only be hiding a latent bug, would only work if the existing tag were the same one being or'd, and would be totally fragile in the face of changes to the Atom tag values. I reviewed the code carefully and I'm confident we aren't doing any such illegal re-tagging. The same concern applies if we change un-tagging from AND to SUB, which we *will* want to do at some point, because it allows field-offset calculations to be folded together with the untagging operation.
Attachment #440612 -
Attachment is obsolete: true
Attachment #441035 -
Flags: review?(rreitmai)
Attachment #441035 -
Flags: feedback?(wmaddox)
Reporter | ||
Comment 31•14 years ago
|
||
The ARM backend already supported single-instruction folding of immediates into add/sub/and/or/xor instructions. This patch enables the same instructions to be rematerialized without spilling them.
Attachment #441037 -
Flags: review?(Jacob.Bramley)
Attachment #441037 -
Flags: feedback?(nnethercote)
Reporter | ||
Comment 32•14 years ago
|
||
Similar patch for x86 and x64 only deals with adds. I'm showing slight improvements in code size and stack-frame size on TR, and some small speedups on some tests, barely outside the noise. No discernable slowdowns in compiler speed. I suggest this is a worthwhile patch for the code-size and stack-frame wins, as long as there isn't a jit-speed slowdown (from checking the additional cases in canRemat()). I've tested TR jit speed but not TM.
Attachment #441039 -
Flags: review?(nnethercote)
Comment 33•14 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/305ab44897e8
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•14 years ago
|
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment 34•14 years ago
|
||
Comment on attachment 441037 [details] [diff] [review] Rematerialize ALU+IMM operations on ARM Ah, cool! Yep, the patch looks good to me.
Attachment #441037 -
Flags: review?(Jacob.Bramley) → review+
Reporter | ||
Comment 35•14 years ago
|
||
* Updated Nativei386 and NativeX64 to avoid unnecessary clearReg() calls * use isInRegMask(BaseRegs) on X64, to avoid using R12 illegally in an LEA instruction.
Attachment #441039 -
Attachment is obsolete: true
Attachment #441039 -
Flags: review?(nnethercote)
Reporter | ||
Updated•14 years ago
|
Attachment #441514 -
Flags: review?(nnethercote)
Updated•14 years ago
|
Attachment #441035 -
Flags: review?(rreitmai) → review+
Updated•14 years ago
|
Attachment #441035 -
Flags: feedback?(wmaddox) → feedback+
Comment 36•14 years ago
|
||
Comment on attachment 441514 [details] [diff] [review] (v2) Rematerialize ADD+IMM operations using LEA on x64 and x86 >diff -r 6dba613b302f nanojit/NativeX64.cpp >--- a/nanojit/NativeX64.cpp Mon Apr 26 11:22:29 2010 -0400 >+++ b/nanojit/NativeX64.cpp Mon Apr 26 12:42:50 2010 -0400 >@@ -503,6 +503,7 @@ > > void Assembler::LEARIP(R r, I32 d) { emitrm(X64_learip,r,d,(Register)0); asm_output("lea %s, %d(rip)",RQ(r),d); } > >+ void Assembler::LEARM(R r, I d, R b) { emitrm(X64_learm,r,d,b); asm_output("leal %s, %d(%s)",RL(r),d,RL(b)); } > void Assembler::LEAQRM(R r, I d, R b) { emitrm(X64_leaqrm,r,d,b); asm_output("leaq %s, %d(%s)",RQ(r),d,RQ(b)); } > void Assembler::MOVLRM(R r, I d, R b) { emitrm(X64_movlrm,r,d,b); asm_output("movl %s, %d(%s)",RL(r),d,RQ(b)); } > void Assembler::MOVQRM(R r, I d, R b) { emitrm(X64_movqrm,r,d,b); asm_output("movq %s, %d(%s)",RQ(r),d,RQ(b)); } Can you change LEARM to LEALRM to match MOVLRM, please? >+ // return true if this is an instruction we can compute without setting CC's, >+ // and without clobbering any input register. Only LEA fits those requirements. That comment reads oddly. How about: + // Return true if we can generate code for this instruction that neither sets CCs nor clobbers any input register. + // LEA is the only native instruction that fits those requirements. >+ bool canRematLEA(LIns* ins) >+ { >+ switch (ins->opcode()) { >+ case LIR_addi: >+ return ins->oprnd1()->isInRegMask(BaseRegs) && ins->oprnd2()->isImmI(); >+ case LIR_addq: { >+ LIns* rhs; >+ return ins->oprnd1()->isInRegMask(BaseRegs) && >+ (rhs = ins->oprnd2())->isImmQ() && >+ isS32(ins->oprnd2()->immQ()); >+ } 'rhs' is dead. >+ // maybe sub? R = subl/q rL, const => leal/q R, [rL + -const] >+ // maybe lsh? R = lshl/q rL, 1/2/3 => leal/q R, [rL * 2/4/8] >+ default: >+ ; Do you plan to do these or just leave this comment in as a placeholder? >+ } >+ return false; >+ } >diff -r 6dba613b302f nanojit/Nativei386.cpp >--- a/nanojit/Nativei386.cpp Mon Apr 26 11:22:29 2010 -0400 >+++ b/nanojit/Nativei386.cpp Mon Apr 26 12:42:50 2010 -0400 >+ >+ // return true if this is an instruction we can compute without setting CC's, >+ // and without clobbering any input register. Only LEA fits those requirements. As above. >+ bool canRematLEA(LIns* ins) >+ { >+ if (ins->isop(LIR_addi)) >+ return ins->oprnd1()->isInReg() && ins->oprnd2()->isImmI(); >+ // maybe sub? R = subl rL, const => leal R, [rL + -const] >+ // maybe lsh? R = lshl rL, 1/2/3 => leal R, [rL * 2/4/8] >+ return false; As above. Passes trace-tests on TM, makes very slight improvements to codegen.
Attachment #441514 -
Flags: review?(nnethercote) → review+
Updated•14 years ago
|
Attachment #441037 -
Flags: feedback?(nnethercote) → feedback+
Reporter | ||
Comment 37•14 years ago
|
||
Comment on attachment 441035 [details] [diff] [review] Use ADD instead of OR for pointer tagging, to aid better code-gen. TR: pointer tagging with add instead of or: http://hg.mozilla.org/tamarin-redux/rev/1ebae3634f69
Reporter | ||
Comment 38•14 years ago
|
||
NJ: better rematerialization for ARM http://hg.mozilla.org/projects/nanojit-central/rev/c5fca9078e37
Reporter | ||
Comment 39•14 years ago
|
||
NJ: better rematerialization for i386 and x64 http://hg.mozilla.org/projects/nanojit-central/rev/e5db9525afbc
Reporter | ||
Updated•14 years ago
|
Whiteboard: PACMAN → PACMAN, fixed-in-nanojit
Reporter | ||
Comment 40•14 years ago
|
||
TR: http://hg.mozilla.org/tamarin-redux/rev/342639bd6c8a (ARM) http://hg.mozilla.org/tamarin-redux/rev/12a643e24e00 (i386, x64)
Whiteboard: PACMAN, fixed-in-nanojit → PACMAN, fixed-in-nanojit, fixed-in-tamarin
Reporter | ||
Updated•14 years ago
|
Assignee: edwsmith → nobody
Comment 41•14 years ago
|
||
http://hg.mozilla.org/tracemonkey/rev/a653310f9fd4 http://hg.mozilla.org/tracemonkey/rev/d8425f0d1429
Whiteboard: PACMAN, fixed-in-nanojit, fixed-in-tamarin → PACMAN, fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey
Comment 42•14 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/d8425f0d1429
Status: REOPENED → RESOLVED
Closed: 14 years ago → 14 years ago
Resolution: --- → FIXED
Assignee | ||
Updated•10 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•