Closed Bug 559949 Opened 15 years ago Closed 14 years ago

nanojit: encode immediate forms compactly

Categories

(Core Graveyard :: Nanojit, defect)

defect
Not set
minor

Tracking

(Not tracked)

RESOLVED WONTFIX
Future

People

(Reporter: n.nethercote, Assigned: n.nethercote)

References

Details

(Whiteboard: PACMAN)

Attachments

(3 files)

Binary operations with an immediate as the second operand are very common in LIR. Eg. comparisons against zero, bitwise-AND with a constant mask, shift by a constant amount. Furthermore, ExprFilter pushes constant on the LHS to the RHS, and all(?) the backends treat these immediate forms specially. If we encoded immediate forms more compactly it would have multiple benefits: - Fewer explicit immediates in the LIR would reduce its size, sometimes reducing the number of code chunk allocations required and reducing cache stress and page faults. - Fewer instructions would go through the pipeline, saving time, esp. in CseFilter, for which immediates are a hot case. The downside is that it would introduce representational redundancies. This is mitigated by the fact that just about everywhere that touches the second operand first checks if it's a constant and treats that case specially. So in practice I don't think this will be much of a burden. As for how best to do it, my initial thought was to add new opcodes: addli, subli, etc. I started implementing that but after a while concluded that there is another way that might be better. And that's to introduce a new type, LInspOrImm, which means "LInsp or immediate". (Suggestions for a better name are welcome.) This would be a word-sized value, if the bottom bit is zero then it's an LInsp, if the bottom bit is one then the top 31/63 bits are an immediate. So only very large immediates could not be encoded. This representation would of course be hidden behind some inline accessor functions. LInspOrImm would only be used as the second operand of the relevant operations, most existing LInsp fields would stay the same (eg. the first operand of these binary operations). I'm not certain this is better than explicit opcodes but it avoids the need for many new opcodes, which is nice.
Another slight disadvantage is that it will make LIR dumps harder to read, as it won't be possible to name constants that are encoded in immediate form. In better news, I did a quick analysis and found that for TM on SS and V8, for almost all benchmarks between 70% and 90% of binary operations have a constant on their RHS(!) So the potential effects of compaction are promising.
(In reply to comment #1) > Another slight disadvantage is that it will make LIR dumps harder to read, as > it won't be possible to name constants that are encoded in immediate form. we normally print references to immediates as the immediate value, right? i could live with this problem. > In better news, I did a quick analysis and found that for TM on SS and V8, for > almost all benchmarks between 70% and 90% of binary operations have a constant > on their RHS(!) So the potential effects of compaction are promising. The hard case for RISC backends is when the instruction needs a scratch register to hold an immediate value. this happens with load/store and far jump/call now, and would also start happening with these new immediate LIR instructions. A possible fix is to generate LIns* immediates into a LirBuffer during assembly time. constants like this could never spill into the stack frame (therefore we'd never have to mark them with LIR_live), because they'd always be rematerialized. (I also want to do this for callee-saved registers, so LIR frontends dont have to mess around with them).
On TR x64, % of RHS constants are about 75% on the small benchmarks, and about 65% on a broader set of flash tests & apps. I used X64 to distinguish pointers from ints. opcode %const count ------ ------ ----- ins2 74 22720 (OVERALL) orq 100 4430 gtl 100 15 gel 100 11 andq 100 1575 addq 100 4459 lel 98 760 eql 97 924 eqq 97 2951 ltl 89 607 xorl 88 622 addl 88 495 andl 69 233 subl 69 55 mull 68 22 orl 37 124 ltul 9 62 ltuq 0 1890 ltq 0 79 leq 0 2 And again, on a set of 40 broader tests & apps from the web opcode %const count ------ ------ ----- ins2 65 584763 (OVERALL) orq 100 20473 gtul 100 343 gtl 100 1934 gel 100 43 andq 100 16193 eqq 99 86471 subl 97 60932 eql 96 40431 addl 92 98590 xorl 84 496 andl 82 4433 ltul 73 1179 lel 67 2183 mull 62 1267 addq 54 101004 ltl 51 2266 orl 35 2358 ltuq 0 11579 ltq 0 230 leul 0 47537 leq 0 9
Testing the % of values that are signed-8-bit, in the second group: opcode %isS8 count ------ ------ ----- orq 100 20473 lel 100 1479 gel 100 43 eqq 100 86273 andq 100 16193 subl 99 59115 eql 98 38918 xorl 96 421 addq 96 54605 ltl 95 1172 gtl 94 1934 orl 85 838 addl 85 90908 ltul 80 866 gtul 76 343 andl 44 3649 mull 43 787
Another advantage of the LInspOrImm approach, as compared to the status quo: isImmI() tests are just a bit-test, instead of a deref+test.
(In reply to comment #2) > > we normally print references to immediates as the immediate value, right? Not if you've given it a name. Eg. currently you can do this: MASK = imml 2 x = and y MASK With immediate forms that'll just be this: x = and y 2 The flipside of this is that often CSE causes immediate names to be applied in places it shouldn't be. Eg. if you had a constant 2 elsewhere that wasn't a MASK, it could end up being labelled as MASK anyway. So not being able to name constants is not much of a problem overall. > The hard case for RISC backends is when the instruction needs a scratch > register to hold an immediate value. this happens with load/store and far > jump/call now, and would also start happening with these new immediate LIR > instructions. Looking at the backends, AFAICT: - i386/X64: all immediates can be directly encoded - ARM: - PPC/MIPS: looks like 16-bit immediates can be encoded directly - Sparc/ARM: all immediates can be directly encoded, except for MUL It doesn't seem that hard, we have registerAllocTmp() which can find a scratch register. > A possible fix is to generate LIns* immediates into a LirBuffer during assembly > time. constants like this could never spill into the stack frame (therefore > we'd never have to mark them with LIR_live), because they'd always be > rematerialized. Sounds complicated...
Yeah, cse collisions make constant naming mostly pointless. I would prefer and y 2.
(In reply to comment #6) > Looking at the backends, AFAICT: > - i386/X64: all immediates can be directly encoded > - ARM: > - PPC/MIPS: looks like 16-bit immediates can be encoded directly > - Sparc/ARM: all immediates can be directly encoded, except for MUL arm has some odd rules, but should handle all the common cases. Noticed this when working on rematerialization. > > A possible fix is to generate LIns* immediates into a LirBuffer during assembly > > time. constants like this could never spill into the stack frame (therefore > > we'd never have to mark them with LIR_live), because they'd always be > > rematerialized. > > Sounds complicated... maybe, yeah. the advantage of assigning a register to a LIns*, vs just using a scratch register, comes when the immediate is used more than once in close proximity -- the complicated approach will avoid emitting code twice to generate the same constant. might not matter in practice, need data. I have the same approach in mind for dealing with callee-saved registers, so we'll see how it plays out in that case anyway.
Depends on: 560712
This is a very rough patch implementing the extra opcodes. It passes TM trace-tests and gets rid of most of the immediates, but it doesn't generate good code for some of the more complicated patterns. I've concluded that this approach is not worth it. Having two versions of various opcodes is a *real* pain in ExprFilter and in the backends, you end up having to lots of pattern matching twice. So now the question is whether the LInsp-or-immediate operand type is better, or something else. All this will only help NJ compile-time... I did some measurements of Sunspider on TM: - We hit CseFilter::insImmI() 307788 times - We hit LirBufWriter::insImmI() 53274 times, i.e 83% of immi's are CSE'd - We hit LirBuffer::makeRoom() 748306 times, ie. that's how many instructions we write into the LirBuffer So this confirms that there are a lot of immediates being pushed down the writer pipeline unnecessarily, and so it may be worth continuing investigation.
(In reply to comment #9) > > - We hit CseFilter::insImmI() 307788 times > - We hit LirBufWriter::insImmI() 53274 times, i.e 83% of immi's are CSE'd > - We hit LirBuffer::makeRoom() 748306 times, ie. that's how many instructions > we write into the LirBuffer Whoops, I measured 3d-cube.js 26 times. The real numbers are: 64641, 50481, 187095. Immediates are still clearly dominating. I did some analysis of what the immediates look like. On i386, here are counts of the most common immediates that were successfully CSE'd: 1: 0: 15340 (30.4%, 30.4%) 2: 2: 6143 (12.2%, 42.6%) 3: 1: 5382 (10.7%, 53.2%) 4: -4: 3497 ( 6.9%, 60.1%) 5: 7: 2938 ( 5.8%, 66.0%) 6: 16: 2838 ( 5.6%, 71.6%) 7: -149938176: 2261 ( 4.5%, 76.1%) 8: 136542112: 1914 ( 3.8%, 79.9%) 9: 65535: 1841 ( 3.6%, 83.5%) 10: 4: 1515 ( 3.0%, 86.5%) 11: 136547520: 602 ( 1.2%, 87.7%) 12: 136550624: 532 ( 1.1%, 88.8%) 13: 8: 467 ( 0.9%, 89.7%) 14: 3: 446 ( 0.9%, 90.6%) 15: 6: 281 ( 0.6%, 91.1%) 16: -149926904: 262 ( 0.5%, 91.6%) 17: -149938048: 260 ( 0.5%, 92.2%) 18: 32: 178 ( 0.4%, 92.5%) 19: 12: 133 ( 0.3%, 92.8%) 20: 5: 120 ( 0.2%, 93.0%) 21: -1: 106 ( 0.2%, 93.2%) 22: 248: 70 ( 0.1%, 93.4%) 23: 15: 68 ( 0.1%, 93.5%) 24: 424: 67 ( 0.1%, 93.6%) 25: 280: 67 ( 0.1%, 93.8%) 26: -149926848: 67 ( 0.1%, 93.9%) 27: 20: 67 ( 0.1%, 94.0%) 28: -149927072: 66 ( 0.1%, 94.2%) 29: 163956460: 63 ( 0.1%, 94.3%) 30: -149927240: 63 ( 0.1%, 94.4%) 31: 163956652: 63 ( 0.1%, 94.5%) 32: 163956724: 63 ( 0.1%, 94.6%) 33: 163956788: 63 ( 0.1%, 94.8%) 34: 163956532: 63 ( 0.1%, 94.9%) 35: 11: 61 ( 0.1%, 95.0%) Seeing this, I'm now leaning towards some kind of constant pool approach.
Constant pool? Really? Looks to me that a small immediate takes care of 90% or so.
I tried handling small constants (0..255) specially in CseFilter, just putting them in an array instead of a hash table. It reduced instruction counts by 1.002x for a couple of SS benchmarks (3d-raytrace and crypto-md5). The other idea I had was to avoid dead stages in the pipeline by making each insXYZ() function call onto the next stage that actually does something. Eg. if you call insImm() in TM it passes through FuncFilter and ExprFilter without doing anything. If you instead had in each stage an array of insXYZ() functions you might be able to set things up to avoid these do-nothing stages. But again it might not be much of a win.
(In reply to comment #12) > I tried handling small constants (0..255) specially in CseFilter, just putting > them in an array instead of a hash table. It reduced instruction counts by > 1.002x for a couple of SS benchmarks (3d-raytrace and crypto-md5). For the record, here it is. It's probably not worth landing.
Attachment #442243 - Attachment description: partly working patch → partly working patch introducing new opcodes (eqii, addii, etc)
Again, for the record. Again, probably a lot of effort and upheaval for little benefit.
Whiteboard: PACMAN
Severity: normal → minor
OS: Mac OS X → All
Hardware: x86 → All
Target Milestone: --- → Future
This isn't worth fixing. I spun off the "handle small immediates specially in CseFilter" part as bug 609121.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → WONTFIX
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: