Closed Bug 555255 Opened 10 years ago Closed 10 years ago

Support rematerializing ALU operations

Categories

(Core Graveyard :: Nanojit, defect)

defect
Not set

Tracking

(Not tracked)

RESOLVED FIXED
Future

People

(Reporter: edwsmith, Unassigned)

References

Details

(Whiteboard: PACMAN, fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey)

Attachments

(5 files, 4 obsolete files)

Bug 542016 (removing LIR_addp) is one of several we have completed recently that boost the effectiveness of CSE.  As expected, this boosts register pressure.  In the case of addp, it was particularly acute due to having better CSE in very long basic blocks having this structure:


   ptr = ...
   p2 = ptr + 4
   ... = p2
   ...
   pN = ptr + 4*N
   ... = PN
   // and later in the same (long) block: 
   ... = p2
   ... = pN

(imagine many derived pointers).

this naturally causes lots more spilling and reloading of the values for p2..pN.

proposed solution: in examples like this, we end up in asm_restore doing a spill (which is a load, since we generate code bottom-up) for LIR_add(ptr, const) where ptr is already in a register.

in effect, if we're spilling a unary ALU op (add R, const is "unary" in this case from the register allocator's p.o.v), *and* if the input register is already in a register, then we can do better by generating the instruction instead of spilling.  (a 1-cycle ALU op should be cheaper than a load, not to mention the secondary effect of eliminating the store too).
Here's a working example for ARM.  In the long-basic-block example (N on the order of 50-100), it succeeds in cutting the stack size (and thus, spills) by half.

I'd like to get early feedback on the approach, but the patch has a couple of obvious problems:

* generality: is it smart to generalize this to all unary and pseudo-unary (reg op const) operations?  if so, this patch only handles add

* safety: asm_restore isn't allowed to modify condition codes, so on x86/64, maybe only add/sub should be supported.

* checking the register allocation status in canRemat(), and then counting on it remaining the same in asm_restore(), is fragile.

* priority.  should we first spill constants, then look for 1-cycle ALU ops? findVictim() right now treats all instructions equally if canRemat() == true.

I also want to try this on x86 on TR and TM to see what the impact is before getting too far with it.
Assignee: nobody → edwsmith
Status: NEW → ASSIGNED
Attachment #435227 - Flags: feedback?(nnethercote)
(In reply to comment #0)
> Bug 542016 (removing LIR_addp) is one of several we have completed recently

well, credit where due:  "we" as in "Nick"
(In reply to comment #1)
> 
> Here's a working example for ARM.  In the long-basic-block example (N on the
> order of 50-100), it succeeds in cutting the stack size (and thus, spills) by
> half.

It'd be interesting to know if it has benefits other than in the specific case you were looking at.

> * generality: is it smart to generalize this to all unary and pseudo-unary (reg
> op const) operations?  if so, this patch only handles add

My guess is yes.  In a modern machine a single ALU op is just about always going to be better than a spill/restore pair, right?

> * safety: asm_restore isn't allowed to modify condition codes, so on x86/64,
> maybe only add/sub should be supported.

Is that because 'lea' is the only x86 arithmetic instruction that doesn't set CCs?  (Can't remember off the top of my head.)
 
> * checking the register allocation status in canRemat(), and then counting on
> it remaining the same in asm_restore(), is fragile.

I've thought before that canRemat() should really be backend-specific, but never bothered to do it.  (Eg. on i386 we can rematerialize some LIR_param instructions but that's not in canRemat().)  Perhaps this would provide the motivation for that.  If that's done then keeping them in sync isn't too bad, esp. if there's a comment that they must be in sync.

> * priority.  should we first spill constants, then look for 1-cycle ALU ops?
> findVictim() right now treats all instructions equally if canRemat() == true.

Some 1-cycle ALU ops might be preferable to constants, esp. 64-bit constants on 32-bit platforms.  findVictim() already has some notion of priority, so it shouldn't be hard to integrate canRemat() with that.

> I also want to try this on x86 on TR and TM to see what the impact is before
> getting too far with it.

Definitely, but it seems like a good idea to me.


(In reply to comment #2)
> (In reply to comment #0)
> > Bug 542016 (removing LIR_addp) is one of several we have completed recently
> 
> well, credit where due:  "we" as in "Nick"

You wrote the patch!  I just said "yes please" :)
BTW, I don't like 'isAddConst()' as a name -- there's no indication that oprnd1 must be in a register.  Maybe 'isAddRegConst()'?
(In reply to comment #3)
> > well, credit where due:  "we" as in "Nick"
> 
> You wrote the patch!  I just said "yes please" :)

i was referring to the many CSE improvements, especially aliasing.  group hug.
(In reply to comment #3)
> My guess is yes.  In a modern machine a single ALU op is just about always
> going to be better than a spill/restore pair, right?

that's what i'm thinking

> Is that because 'lea' is the only x86 arithmetic instruction that doesn't set
> CCs?  (Can't remember off the top of my head.)

exactly.  although, lea can also handle left-shifts by 1/2/3 bits, so we could cover add/sub/lsh.  If asm_restore could take a hint about whether the CC's are live, then we could handle other cases.

(In reply to comment #4)
> BTW, I don't like 'isAddConst()' as a name -- there's no indication that oprnd1
> must be in a register.  Maybe 'isAddRegConst()'?

isAddConst() was just expedient for prototyping.  I'll take a crack at generalizing and pick a better name.
Attachment #435227 - Flags: feedback?(nnethercote)
Target Milestone: --- → Future
In the X64 backend, each of the rematerialize cases in asm_restore() call

    ins->clearReg();
    /* generate code */

I think the clearReg() call is redundant.  when asm_restore() returns, the assembler calls clearReg() anyway and deallocates the register.
If we could keep track of whether the CC's are "live" (whether its okay to clobber them), then asm_restore could handle many more cases on x86 and x64.  the general pattern would be:

if (is binary op && lhs in register && (rhs in register || rhs is constant) && can-clobber-cc's) {
   mov r, lhs
   alu-op r, rhs     or alu-op r, const
}

the only case we can handle with preserving CC's is LEA, which also happens to not need to clobber any registers (its intel's only 3-address-form instruction, afaik).
X64 stats, when addp and addl are rematerialized

avmshell-d  -Dnodebugger -Ojit -Dverifyonly main.swf

        code size   loads   sum(framesize)
        ---------   -----   --------------
before  1287K       16,211  238,896
after   1270K       14,908  229,952
x86-32 stats, same configuration & test

x86     loads   sum(framesize)
        -----   --------------
before  15,190  154,112
after   14,779  152,816
Moves canRemat() to each backend where it can be specialized.  Early results seem promising even if purely from a code-size / frame-size standpoint.  No performance numbers yet.
Attachment #435227 - Attachment is obsolete: true
ARM stats

avmshell-dd -Dnodebugger -Ojit -Dverifyonly main.swf

        loads   sum(framesize)
        -----   --------------
before  15,649  164,944
after   14,582  160,928
(In reply to comment #7)
> In the X64 backend, each of the rematerialize cases in asm_restore() call
> 
>     ins->clearReg();
>     /* generate code */
> 
> I think the clearReg() call is redundant.  when asm_restore() returns, the
> assembler calls clearReg() anyway and deallocates the register.

True.  The other backends do it too (at least I checked i386 and ARM and they both do, although the ARM backend uses the old deprecated_markAsClear()).
I took a look at the patch:

- I got compile warnings because there was no default case in the switch in canRematLEA()

- I got compile errors because _nvprof() shouldn't be used in Nativei386.cpp -- other uses are within "#ifdef PERFM".

- Codegen was barely changed for TM.  It appears that LIR_addl results are very rarely spilled, and of those that are, most don't have an immediate RHS.
Precursor to tailoring rematerialization code to each CPU.
Attachment #439942 - Flags: review?(nnethercote)
Depends on: 560300
This patch does two things, neither of which should affect generated code.

1. in case LIR_alloc in gen(), replace inlined code with a call to evict(), since evict() does exactly what the inlined code does.

2. in backends, remove ins->clearReg() or deprecated_markAsClear() calls from asm_restore(), since evict() takes care of the same thing as soon as asm_restore() returns.
Attachment #440008 - Flags: review?(nnethercote)
(In reply to comment #3)
> (In reply to comment #1)

> > * generality: is it smart to generalize this to all unary and pseudo-unary (reg
> > op const) operations?  if so, this patch only handles add
> 
> My guess is yes.  In a modern machine a single ALU op is just about always
> going to be better than a spill/restore pair, right?

I'm starting to experiment with the other cases ARM handles: add/sub/or/xor/and.  For TR i expect the common cases to be add (field calculations), or (pointer tagging), and and (tag removal).  

Its also worth noting that there's nothing special about binary-ops with constant RHS.  It could be a win to rematerialize ALU operations even if the left and right hand side are in other registers.  on RISC, thats 1 instruction.  on x86, thats two in general, altho maybe the cpu fuses them.  but even 2 instructions (mov dest,lsh; aluop dest, rhs) is probably preferrable to a spill/reload.
Summary: Support rematerializing unary operations → Support rematerializing ALU operations
Comment on attachment 439942 [details] [diff] [review]
Moves canRemat() from Assembler.cpp to each backend

>diff -r 4c5914a48530 nanojit/NativeARM.cpp
>--- a/nanojit/NativeARM.cpp	Sun Apr 18 21:17:37 2010 -0700
>+++ b/nanojit/NativeARM.cpp	Mon Apr 19 12:03:46 2010 -0400
>@@ -1230,6 +1230,16 @@
>     }
> }
> 
>+/**
>+ * these instructions don't have to be saved & reloaded to spill,
>+ * they can just be recalculated cheaply
>+ */
>+bool
>+Assembler::canRemat(LIns* ins)
>+{
>+    return ins->isImmAny() || ins->isop(LIR_alloc);
>+}
>+

Rather than repeating that comment once per backend, can you just put once it above the declaration in Assembler.h?  A capital letter at the start and full stop at the end wouldn't hurt either :)

r+ with that nit fixed.
Attachment #439942 - Flags: review?(nnethercote) → review+
Comment on attachment 440008 [details] [diff] [review]
Remove unnecessary clearReg() calls from asm_restore()

Looks good, and passes trace-tests on TM.
Attachment #440008 - Flags: review?(nnethercote) → review+
(In reply to comment #18)
> Rather than repeating that comment once per backend, can you just put once it
> above the declaration in Assembler.h?  A capital letter at the start and full
> stop at the end wouldn't hurt either :)

Agree, will fix before pushing.
Comment on attachment 439942 [details] [diff] [review]
Moves canRemat() from Assembler.cpp to each backend

pushed canRemat() refactoring patch

NJ: http://hg.mozilla.org/projects/nanojit-central/rev/c12082c4c489
Comment on attachment 440008 [details] [diff] [review]
Remove unnecessary clearReg() calls from asm_restore()

pushed clearReg() cleanup
NJ: http://hg.mozilla.org/projects/nanojit-central/rev/e89860f89d85
Comment on attachment 439942 [details] [diff] [review]
Moves canRemat() from Assembler.cpp to each backend

TR: http://hg.mozilla.org/tamarin-redux/rev/5bb289b47a0e
Comment on attachment 440008 [details] [diff] [review]
Remove unnecessary clearReg() calls from asm_restore()

TR: http://hg.mozilla.org/tamarin-redux/rev/9f165d31d470
Depends on: 560578
Updated (combined) work in progress, with _nvprof cruft removed.  

1. in CodegenLIR, use LIR_addp instead of LIR_orp for tagging pointers, since LIR_addp can use LEA.
2. in x86 & x64, remateraialize LIR_addp using LEA
3. in arm, do that too, and also handle sub/and/or/xor -- these are the instructions the ARM backend already supported with immediate const RHS operands.

On each cpu, this results in either no change, or a net reduction in instructions and frame size, and performance changes that are either in the noise or slightly positive.  (on TR).

The last thing I want to check is the impact on JIT speed.  The new logic only runs when searching for a spillable register, but if that happens enough then it could hurt jit speed.  tests with large code are interesting (md5? esc-compiling-itself?)
Attachment #439571 - Attachment is obsolete: true
I did some testing to see if the op(R,R) cases were worth handling, and they resulted in almost no hits.  Given the results on bug 559949 that found that the majority of *all* binary instructions have a constant RHS, this finding makes sense.
Blocks: 560966
Whiteboard: PACMAN
What does "PACMAN" in the whiteboard mean?
We've got an optimization effort going, PACMAN means this bug is flagged as low-risk and possibly helping some-programs.  (pacman is going for all the little nuggets and avoiding the high risk monsters)
Modifies the JIT frontend to use LIR_add instructions for pointer tagging, instead of LIR_or instructions.  Adds can be folded with other adds, and also can be rematerialized with LEA on x86/x64, unlike LIR_or.

There is a risk that tagging an already-tagged pointer now changes the value, whereas 'or' would have left it unchanged.  However, doing such a thing would only be hiding a latent bug, would only work if the existing tag were the same one being or'd, and would be totally fragile in the face of changes to the Atom tag values.  I reviewed the code carefully and I'm confident we aren't doing any such illegal re-tagging.

The same concern applies if we change un-tagging from AND to SUB, which we *will* want to do at some point, because it allows field-offset calculations to be folded together with the untagging operation.
Attachment #440612 - Attachment is obsolete: true
Attachment #441035 - Flags: review?(rreitmai)
Attachment #441035 - Flags: feedback?(wmaddox)
The ARM backend already supported single-instruction folding of immediates into add/sub/and/or/xor instructions.  This patch enables the same instructions to be rematerialized without spilling them.
Attachment #441037 - Flags: review?(Jacob.Bramley)
Attachment #441037 - Flags: feedback?(nnethercote)
Similar patch for x86 and x64 only deals with adds.  

I'm showing slight improvements in code size and stack-frame size on TR, and some small speedups on some tests, barely outside the noise.  No discernable slowdowns in compiler speed. 

I suggest this is a worthwhile patch for the code-size and stack-frame wins, as long as there isn't a jit-speed slowdown (from checking the additional cases in canRemat()).  I've tested TR jit speed but not TM.
Attachment #441039 - Flags: review?(nnethercote)
http://hg.mozilla.org/mozilla-central/rev/305ab44897e8
Status: ASSIGNED → RESOLVED
Closed: 10 years ago
Resolution: --- → FIXED
Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Comment on attachment 441037 [details] [diff] [review]
Rematerialize ALU+IMM operations on ARM

Ah, cool! Yep, the patch looks good to me.
Attachment #441037 - Flags: review?(Jacob.Bramley) → review+
* Updated Nativei386 and NativeX64 to avoid unnecessary clearReg() calls
* use isInRegMask(BaseRegs) on X64, to avoid using R12 illegally in an LEA instruction.
Attachment #441039 - Attachment is obsolete: true
Attachment #441039 - Flags: review?(nnethercote)
Attachment #441514 - Flags: review?(nnethercote)
Attachment #441035 - Flags: review?(rreitmai) → review+
Attachment #441035 - Flags: feedback?(wmaddox) → feedback+
Comment on attachment 441514 [details] [diff] [review]
(v2) Rematerialize ADD+IMM operations using LEA on x64 and x86

>diff -r 6dba613b302f nanojit/NativeX64.cpp
>--- a/nanojit/NativeX64.cpp	Mon Apr 26 11:22:29 2010 -0400
>+++ b/nanojit/NativeX64.cpp	Mon Apr 26 12:42:50 2010 -0400
>@@ -503,6 +503,7 @@
> 
>     void Assembler::LEARIP(R r, I32 d)          { emitrm(X64_learip,r,d,(Register)0); asm_output("lea %s, %d(rip)",RQ(r),d); }
> 
>+    void Assembler::LEARM(R r, I d, R b)        { emitrm(X64_learm,r,d,b); asm_output("leal %s, %d(%s)",RL(r),d,RL(b)); }
>     void Assembler::LEAQRM(R r, I d, R b)       { emitrm(X64_leaqrm,r,d,b); asm_output("leaq %s, %d(%s)",RQ(r),d,RQ(b)); }
>     void Assembler::MOVLRM(R r, I d, R b)       { emitrm(X64_movlrm,r,d,b); asm_output("movl %s, %d(%s)",RL(r),d,RQ(b)); }
>     void Assembler::MOVQRM(R r, I d, R b)       { emitrm(X64_movqrm,r,d,b); asm_output("movq %s, %d(%s)",RQ(r),d,RQ(b)); }

Can you change LEARM to LEALRM to match MOVLRM, please?


>+    // return true if this is an instruction we can compute without setting CC's,
>+    // and without clobbering any input register.  Only LEA fits those requirements.

That comment reads oddly.  How about:

+    // Return true if we can generate code for this instruction that neither sets CCs nor clobbers any input register.
+    // LEA is the only native instruction that fits those requirements.


>+    bool canRematLEA(LIns* ins)
>+    {
>+        switch (ins->opcode()) {
>+        case LIR_addi:
>+            return ins->oprnd1()->isInRegMask(BaseRegs) && ins->oprnd2()->isImmI();
>+        case LIR_addq: {
>+            LIns* rhs;
>+            return ins->oprnd1()->isInRegMask(BaseRegs) &&
>+                   (rhs = ins->oprnd2())->isImmQ() &&
>+                   isS32(ins->oprnd2()->immQ());
>+        }

'rhs' is dead.


>+        // maybe sub? R = subl/q rL, const  =>  leal/q R, [rL + -const]
>+        // maybe lsh? R = lshl/q rL, 1/2/3  =>  leal/q R, [rL * 2/4/8]
>+        default:
>+            ;

Do you plan to do these or just leave this comment in as a placeholder?

>+        }
>+        return false;
>+    }


>diff -r 6dba613b302f nanojit/Nativei386.cpp
>--- a/nanojit/Nativei386.cpp	Mon Apr 26 11:22:29 2010 -0400
>+++ b/nanojit/Nativei386.cpp	Mon Apr 26 12:42:50 2010 -0400
>+
>+    // return true if this is an instruction we can compute without setting CC's,
>+    // and without clobbering any input register.  Only LEA fits those requirements.

As above.


>+    bool canRematLEA(LIns* ins)
>+    {
>+        if (ins->isop(LIR_addi))
>+            return ins->oprnd1()->isInReg() && ins->oprnd2()->isImmI();
>+        // maybe sub? R = subl rL, const  =>  leal R, [rL + -const]
>+        // maybe lsh? R = lshl rL, 1/2/3  =>  leal R, [rL * 2/4/8]
>+        return false;

As above.

Passes trace-tests on TM, makes very slight improvements to codegen.
Attachment #441514 - Flags: review?(nnethercote) → review+
Attachment #441037 - Flags: feedback?(nnethercote) → feedback+
Comment on attachment 441035 [details] [diff] [review]
Use ADD instead of OR for pointer tagging, to aid better code-gen.

TR: pointer tagging with add instead of or:
http://hg.mozilla.org/tamarin-redux/rev/1ebae3634f69
NJ: better rematerialization for ARM
http://hg.mozilla.org/projects/nanojit-central/rev/c5fca9078e37
NJ: better rematerialization for i386 and x64
http://hg.mozilla.org/projects/nanojit-central/rev/e5db9525afbc
Whiteboard: PACMAN → PACMAN, fixed-in-nanojit
TR:
http://hg.mozilla.org/tamarin-redux/rev/342639bd6c8a  (ARM)
http://hg.mozilla.org/tamarin-redux/rev/12a643e24e00 (i386, x64)
Whiteboard: PACMAN, fixed-in-nanojit → PACMAN, fixed-in-nanojit, fixed-in-tamarin
Assignee: edwsmith → nobody
http://hg.mozilla.org/tracemonkey/rev/a653310f9fd4
http://hg.mozilla.org/tracemonkey/rev/d8425f0d1429
Whiteboard: PACMAN, fixed-in-nanojit, fixed-in-tamarin → PACMAN, fixed-in-nanojit, fixed-in-tamarin, fixed-in-tracemonkey
http://hg.mozilla.org/mozilla-central/rev/d8425f0d1429
Status: REOPENED → RESOLVED
Closed: 10 years ago10 years ago
Resolution: --- → FIXED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.