553518 - nanojit: avoid 'test r,r' where possible on i386

Assignee

Description

•

14 years ago

Attached patch patch (obsolete) — Details — Splinter Review

On i386, we get this sequence a lot in TM as part of the array access code:

      or4 = or and12, eq13
  ......   or eax,edx
      eq12 = eq or4, 0       # codegen'd with the xt
      xt9: xt eq12 -> pc=...... imacpc=(nil) sp+32 rp+0 (GuardID=006)
  ......   test eax,eax
  ......   je    ......

Turns out that 'test eax,eax' is totally unnecessary, because the 'or' has
already set the condition codes appropriately.  We also see the same pattern
with 'and' instead of 'or'.

This patch looks for this pattern and omits the 'test' when appropriate.  It
reduces the number of instructions executed by TM for SunSpider by about
0.2%, which isn't a lot but is still worth doing.  Timing changes are below
SunSpider's noise threshold.

The same idea could be applied to X64, but it currently doesn't even use
'test'.  I haven't converted it to do so because I still don't understand
the X64 backend's instruction encoding.

Attachment #433508 - Flags: review?(edwsmith)

Edwin Smith

Comment 1

•

14 years ago

I think this works as written, but it assumes the Assembler is reading directly from a LirReader right out of a LirBuffer.  in other words the optimization assumes that lhsIsRightBeforeCond==true means the next LIns* returned by reader.read() will be lhs.  This will be true if reader is a LirReader.

If the LirFilter passed into Assembler is not a LirReader and is doing anything "interesting" (reordering instructions, or reading from a LIns** array of possibly reordered instructions), then that assumption breaks with subtle failure modes.

That said, there are probably a bunch of nice peephole optimizations the backends could do, that are hard with the restricted LirFilter behavior.  My favorite example is pattern matching of operands to use x86 addressing modes.  If the rhs of an arith op is a load, we could fold it.  Or, if the address of an indirect call is a load, we can fold it too.  Just looking at the LIns* expression tree is not enough, because it can move stores down past loads.  Peeking at the next instruction would make it safe.

Maybe we should rename LirFilter.pos() as LirFilter.peek()?  If we only need one-instruction lookahead then it might be enough.

Edwin Smith

Comment 2

•

14 years ago

Its a shame to pass up an optimization now to guard against something we *might* do in the future.

idea 1: What if we put in some safeguards that make this kind of peephole checking safe?  e.g. add an assembler.expectedIns field, and assert in gen if the next instruction is not the expected next instruction?  in the patch, we'd set expectedIns = lhs, and in the main loop, add something like

NanoAssert(!expectedIns || expectedIns == ins);
expectedIns = 0;

idea 2: do the lookahead in the main loop in gen using reader.pos():

   for (LInsp ins = reader->read(); ...) {
       this->nextIns = reader->pos();

   // and in the x86 backend:
   if (lhs == this->nextIns) { /* do optimization */ }

Now the burden is on potential nontrivial LirFilter implementations to correctly implement a 1-instruction lookahead, which i think is a fair tradeoff.

Edwin Smith

Comment 3

•

14 years ago

> The same idea could be applied to X64, but it currently doesn't even use
> 'test'.  I haven't converted it to do so because I still don't understand
> the X64 backend's instruction encoding.

I will scrub and expand the comments if you give me an idea of where the clarification would be most helpful.

The way to read the constants is right-to-left, with the bytes left justified, for example.  remember, instructions are little endian:
                        msb       lsb len
        X64_xorqr8  = 0x00F0834800000004LL, // 64bit xor r ^= int64(imm8)
bytes:      00 F0 83 48 00 00 00 04
l-to-r:     04  00 00 00 48  83 F0  00
            len          rex instr  imm8

Generally, the instructions are encoded conservatively and then trimmed by shrinking them when possible.  e.g. rexrb() inserts the high bit of the given registers into the rex byte, and if the high bits are all zero (making it optional) it drops the rex byte by subtracting 1 from the length byte in the instruction.

Nicholas Nethercote [inactive]

Assignee

Comment 4

•

14 years ago

(In reply to comment #3)
> 
> I will scrub and expand the comments if you give me an idea of where the
> clarification would be most helpful.

The code looks pretty good, it's the x86-64 encoding itself that gives me trouble :/  Once I work through it more I'll let you know if the comments can be improved.

As for your earlier comment, yes, I need to be more careful when checking what the previous generated native instruction was.  I'll think about your suggestions next week and see what I like best.  Thanks for identifying the risk!

BTW, I forgot to mention that this optimization was due to a suggestion from Julian.

Nicholas Nethercote [inactive]

Assignee

Comment 5

•

14 years ago

Attached patch patch v2 (obsolete) — Details — Splinter Review

(In reply to comment #2)

> idea 2: do the lookahead in the main loop in gen using reader.pos():
> 
>    for (LInsp ins = reader->read(); ...) {
>        this->nextIns = reader->pos();
> 
>    // and in the x86 backend:
>    if (lhs == this->nextIns) { /* do optimization */ }
> 
> Now the burden is on potential nontrivial LirFilter implementations to
> correctly implement a 1-instruction lookahead, which i think is a fair
> tradeoff.

The gen() loop can do the lookahead, the LirFilters don't have to do anything special.

I've done just this, although it turns out you need two instructions of lookahead for this optimisation, because the sequence involved is or/eq/xf.  It even gained me one more removed 'test' for SunSpider;  presumably it was missed by the old patch because the two instructions were on different code pages.

The patch also:
- Changes pos() to lastIns(), for two reasons.  First, that's all it's ever
  used for, checking the last instruction in a buffer.  Second, it's wrong
  if you try to do more than that.  For example, if you use StackFilter you
  need to skip over some store instructions, but pos() doesn't do that.

- Moved the big gen() comment outside the loop, and added a note about
  lookahead.

- Renames some 'i'/'_i' variables as 'ins'/'_ins' in LirReader.

Attachment #433508 - Attachment is obsolete: true

Attachment #433875 - Flags: feedback?(edwsmith)

Attachment #433508 - Flags: review?(edwsmith)

Nicholas Nethercote [inactive]

Assignee

Comment 6

•

14 years ago

Hmm, there's a complication that my patch doesn't handle.  There may be dead instructions in the sequence, eg:

  or
  (dead instruction)
  eq
  xf

We want to ignore the dead instruction, but the attached patch doesn't do that;  lookahead2 will point to it.  But doing so is tricky because we can't make the is-it-dead? judgment until the last minute, because eg. lookahead1 might be the only user of lookahead2.  The patch is missing some optimisation opportunites for TM because of this.

I imagine this can be worked around but I won't have time to look at it again until tomorrow.

Nicholas Nethercote [inactive]

Assignee

Comment 7

•

14 years ago

(In reply to comment #6)
> 
> I imagine this can be worked around

Hmm, maybe not.  We need to know if lookahead2 is dead in order to generate the code for lookahead0, but we don't know if lookahead2 is dead until the code for lookahead0 and lookahead1 is generated.  Ie. a circular dependency.  I'll need to be more inventive...

Julian Seward [:jseward]

Comment 8

•

14 years ago

Maybe moot, but .. another thing you need to guarantee is that
any spill stores or reloads that happen to intervene, don't
change the condition codes.

Edwin Smith

Updated

•

14 years ago

Attachment #433875 - Attachment is patch: true

Attachment #433875 - Attachment mime type: application/octet-stream → text/plain

Edwin Smith

Comment 9

•

14 years ago

another possibility, higher up the tricky/crazy-talk scale:

do the peephole optimization at the x86 level instead of LIR: 
1. de-macroize the code to emit comppares and other alu ops
2. emit test r,r as normal.  now NIns* points to that instruction.
3. in the emitter for ALU and CMP ops, look at NIns*.  if it is pointing to test r,r, then we know the Z flag is wanted and we can overwrite the test r,r instruction.

step 3 might be a bit simpler if we remember the previous value of NIns and restore it.  a "Z-wanted" flag and register field holding 'r' might avoid re-parsing the code at NIns but i bet the assembler is more robust if we actually do re-parse the instruction.  

since i put it out there, i'll also argue with myself: arguably we're better off laying the groundwork for recognizing phrases of LIR than laying the groundwork for machine-level peephole optimizations.

Nicholas Nethercote [inactive]

Assignee

Comment 10

•

14 years ago

(In reply to comment #9)
> 
> since i put it out there, i'll also argue with myself: arguably we're better
> off laying the groundwork for recognizing phrases of LIR than laying the
> groundwork for machine-level peephole optimizations.

Yeah, I think so.  I've already experimented with using the lookahead to do memory operands (a la comment 1).  That's actually where I encountered the problem with dead instructions polluting the lookahead chain, rather than in the 'test r,r' case.

Nicholas Nethercote [inactive]

Assignee

Comment 11

•

14 years ago

Thinking more about memory operands... imagine we have something like this:

   ld1 = ld x[16]         # rhs,  a.k.a. lookahead[2]
   ult1 = ult y, ld1      # cond, a.k.a. lookahead[1]

if 'ld1' isn't in a register (ie. it's not used after 'ult1') then it
should be a memory operand.  That's easy.

But TM sometimes generates code like this:
 
   ld1 = ld x[16]         # rhs,  a.k.a. lookahead[3]
   <dead non-store>       #              lookahead[2]
   ult1 = ult y, ld1      # cond, a.k.a. lookahead[1]

We can just use brute force and check lookahead[3] as well (assuming
lookahead[2] isn't a store).  Great!

Except sometimes TM also generates this:

   ld1 = ld x[16]         # rhs,  a.k.a. lookahead[3]
   gt1 = gt z, ld1        #              lookahead[2]
   ult1 = ult y, ld1      # cond, a.k.a. lookahead[1]

where 'gt1' gets used after 'ult1'.  In that case you don't want to use a
memory operand for 'ld1', because it's used multiple times.  Although it's
probably not that bad if you do.  I guess you could check whether
lookahead[2] uses 'ld1' and let that contribute to the decision.

(Right now I expect Julian to start piping up about LIR's lack of expression
trees.)

Furthermore, I'm unconvinced whether it's worth going to much effort to use
memory operands.  Once you get into the guts of the microarchitecture is
there really going to be much difference between 

  mov eax,(ebx)
  add ecx,eax

and

  add ecx,(ebx)

?

Of course, using the memory operand might avoid a register spill, but after
experimenting and getting a lot of the possible cases I managed to reduce
the number of spills for all of SunSpider by 0.5%.  Whoop-de-doo.  This is
consistent with previous experiments I did, where removing one i386 GP reg
entirely resulted in ~1% slow-down -- in other words, register pressure
isn't that high in TM.

Nicholas Nethercote [inactive]

Assignee

Comment 12

•

14 years ago

Attached patch patch v3 — Details — Splinter Review

This version generalizes the lookahead to N_LOOKAHEAD (currently 3, which includes the current instructions) instructions.  It also fixes ReverseLister to not print the LIR_start instruction multiple times(!)

No memory operands yet, I think that's beyond the scope of this bug.  No X64 changes yet either.

Attachment #433875 - Attachment is obsolete: true

Attachment #434164 - Flags: review?(edwsmith)

Attachment #433875 - Flags: feedback?(edwsmith)

Julian Seward [:jseward]

Comment 13

•

14 years ago

>    ld1 = ld x[16]         # rhs,  a.k.a. lookahead[3]
>    gt1 = gt z, ld1        #              lookahead[2]
>    ult1 = ult y, ld1      # cond, a.k.a. lookahead[1]
> 
> where 'gt1' gets used after 'ult1'.  In that case you don't want to use a
> memory operand for 'ld1', because it's used multiple times.  Although it's
> probably not that bad if you do.  I guess you could check whether
> lookahead[2] uses 'ld1' and let that contribute to the decision.

A simple thing which seems to work well is to force all nodes whose
value is used more than once to be computed into a register.  This
effectively partitions the dag into a bunch of trees.  It does however
require a usage count for each (value) node, which AIUI nanojit
doesn't need to compute atm.


> Furthermore, I'm unconvinced whether it's worth going to much effort to use
> memory operands.  Once you get into the guts of the microarchitecture is
> there really going to be much difference between 
> 
>   mov eax,(ebx)
>   add ecx,eax
> 
> and
> 
>   add ecx,(ebx)

They probably do both produce 2 uops (load, add).  Maybe the secondary
effect of having smaller code are worthwhile though, considering that
at least some of the v8 benchmarks hammer the icache w/ TM.

Edwin Smith

Comment 14

•

14 years ago

Comment on attachment 434164 [details] [diff] [review]
patch v3

Looks sound.  Ad hoc testing on TR shows it does remove some null checks, a common example is code like this:

    var p:T = someArray[i]  // implicit downcast to type T
    p.f()   // implicit null check on p

generates code like this:

    p = [some call that returns a tagged pointer]
    call #downcast (p, #T)  // throws an exception on failure
    p2 = and p, -8  // remove the tag
*   test p2,p2
    jeq -> /* null pointer exception */

the test marked * will be removed because the peephole checker sees the add.

in the future, code shape might change.  removing the tag isn't strictly necessary since we know what the value is (it is 1) and can fold tag removal into addressing modes that dereference p.  (which also requires the null pointer check to be a compare p == 1 or p < 4, maybe, which will still be less instructions but might not hit this optimization anymore.  )

if the loop shifting lookahead is hot, it might be cheaper to make the lookahead size == 4 and implement a queue with a pointer that wraps around.  but i doubt its hot.

Attachment #434164 - Flags: review?(edwsmith) → review+

Edwin Smith

Comment 15

•

14 years ago

> the test marked * will be removed because the peephole checker sees the add.

er... s/add/and/

Nicholas Nethercote [inactive]

Assignee

Comment 16

•

14 years ago

(In reply to comment #13)
> 
> A simple thing which seems to work well is to force all nodes whose
> value is used more than once to be computed into a register.  This
> effectively partitions the dag into a bunch of trees.

Well, yes...

> It does however
> require a usage count for each (value) node, which AIUI nanojit
> doesn't need to compute atm.

Exactly.  AFAICT computing this requires an extra pass, the cost of which would probably outweight the benefits of a small number of memory operands.

> They probably do both produce 2 uops (load, add).  Maybe the secondary
> effect of having smaller code are worthwhile though, considering that
> at least some of the v8 benchmarks hammer the icache w/ TM.

This kind of NJ codegen improvement isn't going to fix that problem (see the bottom of comment 11).  I've done a few of them now but they all tend to do stuff like "avoid an extra mov 1 case in 1000" and reduce total native code size by like 0.1%.

So that is an argument for not spending more time on this.  (I possibly shouldn't have even bothered, although it's plausible that avoiding 'test' instructions is a bigger microarchitecture win than avoiding movs.)  I'll land the patch as-is and then probably not worry about memory operands.

Nicholas Nethercote [inactive]

Assignee

Comment 17

•

14 years ago

http://hg.mozilla.org/projects/nanojit-central/rev/ec4d959e1cc9

Whiteboard: fixed-in-nanojit

Nicholas Nethercote [inactive]

Assignee

Comment 18

•

14 years ago

http://hg.mozilla.org/tracemonkey/rev/d9d99f65a1b6

Whiteboard: fixed-in-nanojit → fixed-in-nanojit, fixed-in-tracemonkey

Edwin Smith

Comment 19

•

14 years ago

TR: http://hg.mozilla.org/tamarin-redux/rev/53ce5a07c026

Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin

Robert Sayre

Comment 20

•

14 years ago

http://hg.mozilla.org/mozilla-central/rev/d9d99f65a1b6

Status: ASSIGNED → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

Nobody; OK to take it and work on it

Updated

•

10 years ago

Product: Core → Core Graveyard

patch 14 years ago Nicholas Nethercote [inactive] 1.29 KB, patch		Details \| Diff \| Splinter Review
patch v2 14 years ago Nicholas Nethercote [inactive] 15.50 KB, patch		Details \| Diff \| Splinter Review
patch v3 14 years ago Nicholas Nethercote [inactive] 18.88 KB, patch	edwsmith : review+	Details \| Diff \| Splinter Review