Closed
Bug 553518
Opened 14 years ago
Closed 14 years ago
nanojit: avoid 'test r,r' where possible on i386
Categories
(Core Graveyard :: Nanojit, defect)
Tracking
(Not tracked)
RESOLVED
FIXED
People
(Reporter: n.nethercote, Assigned: n.nethercote)
Details
(Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin)
Attachments
(1 file, 2 obsolete files)
18.88 KB,
patch
|
edwsmith
:
review+
|
Details | Diff | Splinter Review |
On i386, we get this sequence a lot in TM as part of the array access code: or4 = or and12, eq13 ...... or eax,edx eq12 = eq or4, 0 # codegen'd with the xt xt9: xt eq12 -> pc=...... imacpc=(nil) sp+32 rp+0 (GuardID=006) ...... test eax,eax ...... je ...... Turns out that 'test eax,eax' is totally unnecessary, because the 'or' has already set the condition codes appropriately. We also see the same pattern with 'and' instead of 'or'. This patch looks for this pattern and omits the 'test' when appropriate. It reduces the number of instructions executed by TM for SunSpider by about 0.2%, which isn't a lot but is still worth doing. Timing changes are below SunSpider's noise threshold. The same idea could be applied to X64, but it currently doesn't even use 'test'. I haven't converted it to do so because I still don't understand the X64 backend's instruction encoding.
Attachment #433508 -
Flags: review?(edwsmith)
Comment 1•14 years ago
|
||
I think this works as written, but it assumes the Assembler is reading directly from a LirReader right out of a LirBuffer. in other words the optimization assumes that lhsIsRightBeforeCond==true means the next LIns* returned by reader.read() will be lhs. This will be true if reader is a LirReader. If the LirFilter passed into Assembler is not a LirReader and is doing anything "interesting" (reordering instructions, or reading from a LIns** array of possibly reordered instructions), then that assumption breaks with subtle failure modes. That said, there are probably a bunch of nice peephole optimizations the backends could do, that are hard with the restricted LirFilter behavior. My favorite example is pattern matching of operands to use x86 addressing modes. If the rhs of an arith op is a load, we could fold it. Or, if the address of an indirect call is a load, we can fold it too. Just looking at the LIns* expression tree is not enough, because it can move stores down past loads. Peeking at the next instruction would make it safe. Maybe we should rename LirFilter.pos() as LirFilter.peek()? If we only need one-instruction lookahead then it might be enough.
Comment 2•14 years ago
|
||
Its a shame to pass up an optimization now to guard against something we *might* do in the future. idea 1: What if we put in some safeguards that make this kind of peephole checking safe? e.g. add an assembler.expectedIns field, and assert in gen if the next instruction is not the expected next instruction? in the patch, we'd set expectedIns = lhs, and in the main loop, add something like NanoAssert(!expectedIns || expectedIns == ins); expectedIns = 0; idea 2: do the lookahead in the main loop in gen using reader.pos(): for (LInsp ins = reader->read(); ...) { this->nextIns = reader->pos(); // and in the x86 backend: if (lhs == this->nextIns) { /* do optimization */ } Now the burden is on potential nontrivial LirFilter implementations to correctly implement a 1-instruction lookahead, which i think is a fair tradeoff.
Comment 3•14 years ago
|
||
> The same idea could be applied to X64, but it currently doesn't even use
> 'test'. I haven't converted it to do so because I still don't understand
> the X64 backend's instruction encoding.
I will scrub and expand the comments if you give me an idea of where the clarification would be most helpful.
The way to read the constants is right-to-left, with the bytes left justified, for example. remember, instructions are little endian:
msb lsb len
X64_xorqr8 = 0x00F0834800000004LL, // 64bit xor r ^= int64(imm8)
bytes: 00 F0 83 48 00 00 00 04
l-to-r: 04 00 00 00 48 83 F0 00
len rex instr imm8
Generally, the instructions are encoded conservatively and then trimmed by shrinking them when possible. e.g. rexrb() inserts the high bit of the given registers into the rex byte, and if the high bits are all zero (making it optional) it drops the rex byte by subtracting 1 from the length byte in the instruction.
Assignee | ||
Comment 4•14 years ago
|
||
(In reply to comment #3) > > I will scrub and expand the comments if you give me an idea of where the > clarification would be most helpful. The code looks pretty good, it's the x86-64 encoding itself that gives me trouble :/ Once I work through it more I'll let you know if the comments can be improved. As for your earlier comment, yes, I need to be more careful when checking what the previous generated native instruction was. I'll think about your suggestions next week and see what I like best. Thanks for identifying the risk! BTW, I forgot to mention that this optimization was due to a suggestion from Julian.
Assignee | ||
Comment 5•14 years ago
|
||
(In reply to comment #2) > idea 2: do the lookahead in the main loop in gen using reader.pos(): > > for (LInsp ins = reader->read(); ...) { > this->nextIns = reader->pos(); > > // and in the x86 backend: > if (lhs == this->nextIns) { /* do optimization */ } > > Now the burden is on potential nontrivial LirFilter implementations to > correctly implement a 1-instruction lookahead, which i think is a fair > tradeoff. The gen() loop can do the lookahead, the LirFilters don't have to do anything special. I've done just this, although it turns out you need two instructions of lookahead for this optimisation, because the sequence involved is or/eq/xf. It even gained me one more removed 'test' for SunSpider; presumably it was missed by the old patch because the two instructions were on different code pages. The patch also: - Changes pos() to lastIns(), for two reasons. First, that's all it's ever used for, checking the last instruction in a buffer. Second, it's wrong if you try to do more than that. For example, if you use StackFilter you need to skip over some store instructions, but pos() doesn't do that. - Moved the big gen() comment outside the loop, and added a note about lookahead. - Renames some 'i'/'_i' variables as 'ins'/'_ins' in LirReader.
Attachment #433508 -
Attachment is obsolete: true
Attachment #433875 -
Flags: feedback?(edwsmith)
Attachment #433508 -
Flags: review?(edwsmith)
Assignee | ||
Comment 6•14 years ago
|
||
Hmm, there's a complication that my patch doesn't handle. There may be dead instructions in the sequence, eg: or (dead instruction) eq xf We want to ignore the dead instruction, but the attached patch doesn't do that; lookahead2 will point to it. But doing so is tricky because we can't make the is-it-dead? judgment until the last minute, because eg. lookahead1 might be the only user of lookahead2. The patch is missing some optimisation opportunites for TM because of this. I imagine this can be worked around but I won't have time to look at it again until tomorrow.
Assignee | ||
Comment 7•14 years ago
|
||
(In reply to comment #6) > > I imagine this can be worked around Hmm, maybe not. We need to know if lookahead2 is dead in order to generate the code for lookahead0, but we don't know if lookahead2 is dead until the code for lookahead0 and lookahead1 is generated. Ie. a circular dependency. I'll need to be more inventive...
Comment 8•14 years ago
|
||
Maybe moot, but .. another thing you need to guarantee is that any spill stores or reloads that happen to intervene, don't change the condition codes.
Updated•14 years ago
|
Attachment #433875 -
Attachment is patch: true
Attachment #433875 -
Attachment mime type: application/octet-stream → text/plain
Comment 9•14 years ago
|
||
another possibility, higher up the tricky/crazy-talk scale: do the peephole optimization at the x86 level instead of LIR: 1. de-macroize the code to emit comppares and other alu ops 2. emit test r,r as normal. now NIns* points to that instruction. 3. in the emitter for ALU and CMP ops, look at NIns*. if it is pointing to test r,r, then we know the Z flag is wanted and we can overwrite the test r,r instruction. step 3 might be a bit simpler if we remember the previous value of NIns and restore it. a "Z-wanted" flag and register field holding 'r' might avoid re-parsing the code at NIns but i bet the assembler is more robust if we actually do re-parse the instruction. since i put it out there, i'll also argue with myself: arguably we're better off laying the groundwork for recognizing phrases of LIR than laying the groundwork for machine-level peephole optimizations.
Assignee | ||
Comment 10•14 years ago
|
||
(In reply to comment #9) > > since i put it out there, i'll also argue with myself: arguably we're better > off laying the groundwork for recognizing phrases of LIR than laying the > groundwork for machine-level peephole optimizations. Yeah, I think so. I've already experimented with using the lookahead to do memory operands (a la comment 1). That's actually where I encountered the problem with dead instructions polluting the lookahead chain, rather than in the 'test r,r' case.
Assignee | ||
Comment 11•14 years ago
|
||
Thinking more about memory operands... imagine we have something like this: ld1 = ld x[16] # rhs, a.k.a. lookahead[2] ult1 = ult y, ld1 # cond, a.k.a. lookahead[1] if 'ld1' isn't in a register (ie. it's not used after 'ult1') then it should be a memory operand. That's easy. But TM sometimes generates code like this: ld1 = ld x[16] # rhs, a.k.a. lookahead[3] <dead non-store> # lookahead[2] ult1 = ult y, ld1 # cond, a.k.a. lookahead[1] We can just use brute force and check lookahead[3] as well (assuming lookahead[2] isn't a store). Great! Except sometimes TM also generates this: ld1 = ld x[16] # rhs, a.k.a. lookahead[3] gt1 = gt z, ld1 # lookahead[2] ult1 = ult y, ld1 # cond, a.k.a. lookahead[1] where 'gt1' gets used after 'ult1'. In that case you don't want to use a memory operand for 'ld1', because it's used multiple times. Although it's probably not that bad if you do. I guess you could check whether lookahead[2] uses 'ld1' and let that contribute to the decision. (Right now I expect Julian to start piping up about LIR's lack of expression trees.) Furthermore, I'm unconvinced whether it's worth going to much effort to use memory operands. Once you get into the guts of the microarchitecture is there really going to be much difference between mov eax,(ebx) add ecx,eax and add ecx,(ebx) ? Of course, using the memory operand might avoid a register spill, but after experimenting and getting a lot of the possible cases I managed to reduce the number of spills for all of SunSpider by 0.5%. Whoop-de-doo. This is consistent with previous experiments I did, where removing one i386 GP reg entirely resulted in ~1% slow-down -- in other words, register pressure isn't that high in TM.
Assignee | ||
Comment 12•14 years ago
|
||
This version generalizes the lookahead to N_LOOKAHEAD (currently 3, which includes the current instructions) instructions. It also fixes ReverseLister to not print the LIR_start instruction multiple times(!) No memory operands yet, I think that's beyond the scope of this bug. No X64 changes yet either.
Attachment #433875 -
Attachment is obsolete: true
Attachment #434164 -
Flags: review?(edwsmith)
Attachment #433875 -
Flags: feedback?(edwsmith)
Comment 13•14 years ago
|
||
> ld1 = ld x[16] # rhs, a.k.a. lookahead[3] > gt1 = gt z, ld1 # lookahead[2] > ult1 = ult y, ld1 # cond, a.k.a. lookahead[1] > > where 'gt1' gets used after 'ult1'. In that case you don't want to use a > memory operand for 'ld1', because it's used multiple times. Although it's > probably not that bad if you do. I guess you could check whether > lookahead[2] uses 'ld1' and let that contribute to the decision. A simple thing which seems to work well is to force all nodes whose value is used more than once to be computed into a register. This effectively partitions the dag into a bunch of trees. It does however require a usage count for each (value) node, which AIUI nanojit doesn't need to compute atm. > Furthermore, I'm unconvinced whether it's worth going to much effort to use > memory operands. Once you get into the guts of the microarchitecture is > there really going to be much difference between > > mov eax,(ebx) > add ecx,eax > > and > > add ecx,(ebx) They probably do both produce 2 uops (load, add). Maybe the secondary effect of having smaller code are worthwhile though, considering that at least some of the v8 benchmarks hammer the icache w/ TM.
Comment 14•14 years ago
|
||
Comment on attachment 434164 [details] [diff] [review] patch v3 Looks sound. Ad hoc testing on TR shows it does remove some null checks, a common example is code like this: var p:T = someArray[i] // implicit downcast to type T p.f() // implicit null check on p generates code like this: p = [some call that returns a tagged pointer] call #downcast (p, #T) // throws an exception on failure p2 = and p, -8 // remove the tag * test p2,p2 jeq -> /* null pointer exception */ the test marked * will be removed because the peephole checker sees the add. in the future, code shape might change. removing the tag isn't strictly necessary since we know what the value is (it is 1) and can fold tag removal into addressing modes that dereference p. (which also requires the null pointer check to be a compare p == 1 or p < 4, maybe, which will still be less instructions but might not hit this optimization anymore. ) if the loop shifting lookahead is hot, it might be cheaper to make the lookahead size == 4 and implement a queue with a pointer that wraps around. but i doubt its hot.
Attachment #434164 -
Flags: review?(edwsmith) → review+
Comment 15•14 years ago
|
||
> the test marked * will be removed because the peephole checker sees the add.
er... s/add/and/
Assignee | ||
Comment 16•14 years ago
|
||
(In reply to comment #13) > > A simple thing which seems to work well is to force all nodes whose > value is used more than once to be computed into a register. This > effectively partitions the dag into a bunch of trees. Well, yes... > It does however > require a usage count for each (value) node, which AIUI nanojit > doesn't need to compute atm. Exactly. AFAICT computing this requires an extra pass, the cost of which would probably outweight the benefits of a small number of memory operands. > They probably do both produce 2 uops (load, add). Maybe the secondary > effect of having smaller code are worthwhile though, considering that > at least some of the v8 benchmarks hammer the icache w/ TM. This kind of NJ codegen improvement isn't going to fix that problem (see the bottom of comment 11). I've done a few of them now but they all tend to do stuff like "avoid an extra mov 1 case in 1000" and reduce total native code size by like 0.1%. So that is an argument for not spending more time on this. (I possibly shouldn't have even bothered, although it's plausible that avoiding 'test' instructions is a bigger microarchitecture win than avoiding movs.) I'll land the patch as-is and then probably not worry about memory operands.
Assignee | ||
Comment 17•14 years ago
|
||
http://hg.mozilla.org/projects/nanojit-central/rev/ec4d959e1cc9
Whiteboard: fixed-in-nanojit
Assignee | ||
Comment 18•14 years ago
|
||
http://hg.mozilla.org/tracemonkey/rev/d9d99f65a1b6
Whiteboard: fixed-in-nanojit → fixed-in-nanojit, fixed-in-tracemonkey
Comment 19•14 years ago
|
||
TR: http://hg.mozilla.org/tamarin-redux/rev/53ce5a07c026
Whiteboard: fixed-in-nanojit, fixed-in-tracemonkey → fixed-in-nanojit, fixed-in-tracemonkey, fixed-in-tamarin
Comment 20•14 years ago
|
||
http://hg.mozilla.org/mozilla-central/rev/d9d99f65a1b6
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•10 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•