Closed Bug 553518 Opened 14 years ago Closed 14 years ago

nanojit: avoid 'test r,r' where possible on i386

Categories

(Core Graveyard :: Nanojit, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED FIXED

People

(Reporter: n.nethercote, Assigned: n.nethercote)

Details

(Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin)

Attachments

(1 file, 2 obsolete files)

Attached patch patch (obsolete) — Splinter Review
On i386, we get this sequence a lot in TM as part of the array access code:

      or4 = or and12, eq13
  ......   or eax,edx
      eq12 = eq or4, 0       # codegen'd with the xt
      xt9: xt eq12 -> pc=...... imacpc=(nil) sp+32 rp+0 (GuardID=006)
  ......   test eax,eax
  ......   je    ......

Turns out that 'test eax,eax' is totally unnecessary, because the 'or' has
already set the condition codes appropriately.  We also see the same pattern
with 'and' instead of 'or'.

This patch looks for this pattern and omits the 'test' when appropriate.  It
reduces the number of instructions executed by TM for SunSpider by about
0.2%, which isn't a lot but is still worth doing.  Timing changes are below
SunSpider's noise threshold.

The same idea could be applied to X64, but it currently doesn't even use
'test'.  I haven't converted it to do so because I still don't understand
the X64 backend's instruction encoding.
Attachment #433508 - Flags: review?(edwsmith)
I think this works as written, but it assumes the Assembler is reading directly from a LirReader right out of a LirBuffer.  in other words the optimization assumes that lhsIsRightBeforeCond==true means the next LIns* returned by reader.read() will be lhs.  This will be true if reader is a LirReader.

If the LirFilter passed into Assembler is not a LirReader and is doing anything "interesting" (reordering instructions, or reading from a LIns** array of possibly reordered instructions), then that assumption breaks with subtle failure modes.

That said, there are probably a bunch of nice peephole optimizations the backends could do, that are hard with the restricted LirFilter behavior.  My favorite example is pattern matching of operands to use x86 addressing modes.  If the rhs of an arith op is a load, we could fold it.  Or, if the address of an indirect call is a load, we can fold it too.  Just looking at the LIns* expression tree is not enough, because it can move stores down past loads.  Peeking at the next instruction would make it safe.

Maybe we should rename LirFilter.pos() as LirFilter.peek()?  If we only need one-instruction lookahead then it might be enough.
Its a shame to pass up an optimization now to guard against something we *might* do in the future.

idea 1: What if we put in some safeguards that make this kind of peephole checking safe?  e.g. add an assembler.expectedIns field, and assert in gen if the next instruction is not the expected next instruction?  in the patch, we'd set expectedIns = lhs, and in the main loop, add something like

NanoAssert(!expectedIns || expectedIns == ins);
expectedIns = 0;

idea 2: do the lookahead in the main loop in gen using reader.pos():

   for (LInsp ins = reader->read(); ...) {
       this->nextIns = reader->pos();

   // and in the x86 backend:
   if (lhs == this->nextIns) { /* do optimization */ }

Now the burden is on potential nontrivial LirFilter implementations to correctly implement a 1-instruction lookahead, which i think is a fair tradeoff.
> The same idea could be applied to X64, but it currently doesn't even use
> 'test'.  I haven't converted it to do so because I still don't understand
> the X64 backend's instruction encoding.

I will scrub and expand the comments if you give me an idea of where the clarification would be most helpful.

The way to read the constants is right-to-left, with the bytes left justified, for example.  remember, instructions are little endian:
                        msb       lsb len
        X64_xorqr8  = 0x00F0834800000004LL, // 64bit xor r ^= int64(imm8)
bytes:      00 F0 83 48 00 00 00 04
l-to-r:     04  00 00 00 48  83 F0  00
            len          rex instr  imm8

Generally, the instructions are encoded conservatively and then trimmed by shrinking them when possible.  e.g. rexrb() inserts the high bit of the given registers into the rex byte, and if the high bits are all zero (making it optional) it drops the rex byte by subtracting 1 from the length byte in the instruction.
(In reply to comment #3)
> 
> I will scrub and expand the comments if you give me an idea of where the
> clarification would be most helpful.

The code looks pretty good, it's the x86-64 encoding itself that gives me trouble :/  Once I work through it more I'll let you know if the comments can be improved.

As for your earlier comment, yes, I need to be more careful when checking what the previous generated native instruction was.  I'll think about your suggestions next week and see what I like best.  Thanks for identifying the risk!

BTW, I forgot to mention that this optimization was due to a suggestion from Julian.
Attached patch patch v2 (obsolete) — Splinter Review
(In reply to comment #2)

> idea 2: do the lookahead in the main loop in gen using reader.pos():
> 
>    for (LInsp ins = reader->read(); ...) {
>        this->nextIns = reader->pos();
> 
>    // and in the x86 backend:
>    if (lhs == this->nextIns) { /* do optimization */ }
> 
> Now the burden is on potential nontrivial LirFilter implementations to
> correctly implement a 1-instruction lookahead, which i think is a fair
> tradeoff.

The gen() loop can do the lookahead, the LirFilters don't have to do anything special.

I've done just this, although it turns out you need two instructions of lookahead for this optimisation, because the sequence involved is or/eq/xf.  It even gained me one more removed 'test' for SunSpider;  presumably it was missed by the old patch because the two instructions were on different code pages.

The patch also:
- Changes pos() to lastIns(), for two reasons.  First, that's all it's ever
  used for, checking the last instruction in a buffer.  Second, it's wrong
  if you try to do more than that.  For example, if you use StackFilter you
  need to skip over some store instructions, but pos() doesn't do that.

- Moved the big gen() comment outside the loop, and added a note about
  lookahead.

- Renames some 'i'/'_i' variables as 'ins'/'_ins' in LirReader.
Attachment #433508 - Attachment is obsolete: true
Attachment #433875 - Flags: feedback?(edwsmith)
Attachment #433508 - Flags: review?(edwsmith)
Hmm, there's a complication that my patch doesn't handle.  There may be dead instructions in the sequence, eg:

  or
  (dead instruction)
  eq
  xf

We want to ignore the dead instruction, but the attached patch doesn't do that;  lookahead2 will point to it.  But doing so is tricky because we can't make the is-it-dead? judgment until the last minute, because eg. lookahead1 might be the only user of lookahead2.  The patch is missing some optimisation opportunites for TM because of this.

I imagine this can be worked around but I won't have time to look at it again until tomorrow.
(In reply to comment #6)
> 
> I imagine this can be worked around

Hmm, maybe not.  We need to know if lookahead2 is dead in order to generate the code for lookahead0, but we don't know if lookahead2 is dead until the code for lookahead0 and lookahead1 is generated.  Ie. a circular dependency.  I'll need to be more inventive...
Maybe moot, but .. another thing you need to guarantee is that
any spill stores or reloads that happen to intervene, don't
change the condition codes.
Attachment #433875 - Attachment is patch: true
Attachment #433875 - Attachment mime type: application/octet-stream → text/plain
another possibility, higher up the tricky/crazy-talk scale:

do the peephole optimization at the x86 level instead of LIR: 
1. de-macroize the code to emit comppares and other alu ops
2. emit test r,r as normal.  now NIns* points to that instruction.
3. in the emitter for ALU and CMP ops, look at NIns*.  if it is pointing to test r,r, then we know the Z flag is wanted and we can overwrite the test r,r instruction.

step 3 might be a bit simpler if we remember the previous value of NIns and restore it.  a "Z-wanted" flag and register field holding 'r' might avoid re-parsing the code at NIns but i bet the assembler is more robust if we actually do re-parse the instruction.  

since i put it out there, i'll also argue with myself: arguably we're better off laying the groundwork for recognizing phrases of LIR than laying the groundwork for machine-level peephole optimizations.
(In reply to comment #9)
> 
> since i put it out there, i'll also argue with myself: arguably we're better
> off laying the groundwork for recognizing phrases of LIR than laying the
> groundwork for machine-level peephole optimizations.

Yeah, I think so.  I've already experimented with using the lookahead to do memory operands (a la comment 1).  That's actually where I encountered the problem with dead instructions polluting the lookahead chain, rather than in the 'test r,r' case.
Thinking more about memory operands... imagine we have something like this:

   ld1 = ld x[16]         # rhs,  a.k.a. lookahead[2]
   ult1 = ult y, ld1      # cond, a.k.a. lookahead[1]

if 'ld1' isn't in a register (ie. it's not used after 'ult1') then it
should be a memory operand.  That's easy.

But TM sometimes generates code like this:
 
   ld1 = ld x[16]         # rhs,  a.k.a. lookahead[3]
   <dead non-store>       #              lookahead[2]
   ult1 = ult y, ld1      # cond, a.k.a. lookahead[1]

We can just use brute force and check lookahead[3] as well (assuming
lookahead[2] isn't a store).  Great!

Except sometimes TM also generates this:

   ld1 = ld x[16]         # rhs,  a.k.a. lookahead[3]
   gt1 = gt z, ld1        #              lookahead[2]
   ult1 = ult y, ld1      # cond, a.k.a. lookahead[1]

where 'gt1' gets used after 'ult1'.  In that case you don't want to use a
memory operand for 'ld1', because it's used multiple times.  Although it's
probably not that bad if you do.  I guess you could check whether
lookahead[2] uses 'ld1' and let that contribute to the decision.

(Right now I expect Julian to start piping up about LIR's lack of expression
trees.)

Furthermore, I'm unconvinced whether it's worth going to much effort to use
memory operands.  Once you get into the guts of the microarchitecture is
there really going to be much difference between 

  mov eax,(ebx)
  add ecx,eax

and

  add ecx,(ebx)

?

Of course, using the memory operand might avoid a register spill, but after
experimenting and getting a lot of the possible cases I managed to reduce
the number of spills for all of SunSpider by 0.5%.  Whoop-de-doo.  This is
consistent with previous experiments I did, where removing one i386 GP reg
entirely resulted in ~1% slow-down -- in other words, register pressure
isn't that high in TM.
Attached patch patch v3Splinter Review
This version generalizes the lookahead to N_LOOKAHEAD (currently 3, which includes the current instructions) instructions.  It also fixes ReverseLister to not print the LIR_start instruction multiple times(!)

No memory operands yet, I think that's beyond the scope of this bug.  No X64 changes yet either.
Attachment #433875 - Attachment is obsolete: true
Attachment #434164 - Flags: review?(edwsmith)
Attachment #433875 - Flags: feedback?(edwsmith)
>    ld1 = ld x[16]         # rhs,  a.k.a. lookahead[3]
>    gt1 = gt z, ld1        #              lookahead[2]
>    ult1 = ult y, ld1      # cond, a.k.a. lookahead[1]
> 
> where 'gt1' gets used after 'ult1'.  In that case you don't want to use a
> memory operand for 'ld1', because it's used multiple times.  Although it's
> probably not that bad if you do.  I guess you could check whether
> lookahead[2] uses 'ld1' and let that contribute to the decision.

A simple thing which seems to work well is to force all nodes whose
value is used more than once to be computed into a register.  This
effectively partitions the dag into a bunch of trees.  It does however
require a usage count for each (value) node, which AIUI nanojit
doesn't need to compute atm.


> Furthermore, I'm unconvinced whether it's worth going to much effort to use
> memory operands.  Once you get into the guts of the microarchitecture is
> there really going to be much difference between 
> 
>   mov eax,(ebx)
>   add ecx,eax
> 
> and
> 
>   add ecx,(ebx)

They probably do both produce 2 uops (load, add).  Maybe the secondary
effect of having smaller code are worthwhile though, considering that
at least some of the v8 benchmarks hammer the icache w/ TM.
Comment on attachment 434164 [details] [diff] [review]
patch v3

Looks sound.  Ad hoc testing on TR shows it does remove some null checks, a common example is code like this:

    var p:T = someArray[i]  // implicit downcast to type T
    p.f()   // implicit null check on p

generates code like this:

    p = [some call that returns a tagged pointer]
    call #downcast (p, #T)  // throws an exception on failure
    p2 = and p, -8  // remove the tag
*   test p2,p2
    jeq -> /* null pointer exception */

the test marked * will be removed because the peephole checker sees the add.

in the future, code shape might change.  removing the tag isn't strictly necessary since we know what the value is (it is 1) and can fold tag removal into addressing modes that dereference p.  (which also requires the null pointer check to be a compare p == 1 or p < 4, maybe, which will still be less instructions but might not hit this optimization anymore.  )

if the loop shifting lookahead is hot, it might be cheaper to make the lookahead size == 4 and implement a queue with a pointer that wraps around.  but i doubt its hot.
Attachment #434164 - Flags: review?(edwsmith) → review+
> the test marked * will be removed because the peephole checker sees the add.

er... s/add/and/
(In reply to comment #13)
> 
> A simple thing which seems to work well is to force all nodes whose
> value is used more than once to be computed into a register.  This
> effectively partitions the dag into a bunch of trees.

Well, yes...

> It does however
> require a usage count for each (value) node, which AIUI nanojit
> doesn't need to compute atm.

Exactly.  AFAICT computing this requires an extra pass, the cost of which would probably outweight the benefits of a small number of memory operands.

> They probably do both produce 2 uops (load, add).  Maybe the secondary
> effect of having smaller code are worthwhile though, considering that
> at least some of the v8 benchmarks hammer the icache w/ TM.

This kind of NJ codegen improvement isn't going to fix that problem (see the bottom of comment 11).  I've done a few of them now but they all tend to do stuff like "avoid an extra mov 1 case in 1000" and reduce total native code size by like 0.1%.

So that is an argument for not spending more time on this.  (I possibly shouldn't have even bothered, although it's plausible that avoiding 'test' instructions is a bigger microarchitecture win than avoiding movs.)  I'll land the patch as-is and then probably not worry about memory operands.
http://hg.mozilla.org/tracemonkey/rev/d9d99f65a1b6
Whiteboard: fixed-in-nanojit → fixed-in-nanojit, fixed-in-tracemonkey
TR: http://hg.mozilla.org/tamarin-redux/rev/53ce5a07c026
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin
http://hg.mozilla.org/mozilla-central/rev/d9d99f65a1b6
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: