Closed
Bug 559949
Opened 15 years ago
Closed 14 years ago
nanojit: encode immediate forms compactly
Categories
(Core Graveyard :: Nanojit, defect)
Core Graveyard
Nanojit
Tracking
(Not tracked)
RESOLVED
WONTFIX
Future
People
(Reporter: n.nethercote, Assigned: n.nethercote)
References
Details
(Whiteboard: PACMAN)
Attachments
(3 files)
104.35 KB,
patch
|
Details | Diff | Splinter Review | |
10.64 KB,
patch
|
Details | Diff | Splinter Review | |
62.18 KB,
patch
|
Details | Diff | Splinter Review |
Binary operations with an immediate as the second operand are very common in LIR. Eg. comparisons against zero, bitwise-AND with a constant mask, shift by a constant amount.
Furthermore, ExprFilter pushes constant on the LHS to the RHS, and all(?) the backends treat these immediate forms specially.
If we encoded immediate forms more compactly it would have multiple benefits:
- Fewer explicit immediates in the LIR would reduce its size, sometimes reducing the number of code chunk allocations required and reducing cache stress and page faults.
- Fewer instructions would go through the pipeline, saving time, esp. in CseFilter, for which immediates are a hot case.
The downside is that it would introduce representational redundancies. This is mitigated by the fact that just about everywhere that touches the second operand first checks if it's a constant and treats that case specially. So in practice I don't think this will be much of a burden.
As for how best to do it, my initial thought was to add new opcodes: addli, subli, etc. I started implementing that but after a while concluded that there is another way that might be better. And that's to introduce a new type, LInspOrImm, which means "LInsp or immediate". (Suggestions for a better name are welcome.) This would be a word-sized value, if the bottom bit is zero then it's an LInsp, if the bottom bit is one then the top 31/63 bits are an immediate. So only very large immediates could not be encoded. This representation would of course be hidden behind some inline accessor functions. LInspOrImm would only be used as the second operand of the relevant operations, most existing LInsp fields would stay the same (eg. the first operand of these binary operations). I'm not certain this is better than explicit opcodes but it avoids the need for many new opcodes, which is nice.
![]() |
Assignee | |
Comment 1•15 years ago
|
||
Another slight disadvantage is that it will make LIR dumps harder to read, as it won't be possible to name constants that are encoded in immediate form.
In better news, I did a quick analysis and found that for TM on SS and V8, for almost all benchmarks between 70% and 90% of binary operations have a constant on their RHS(!) So the potential effects of compaction are promising.
Comment 2•15 years ago
|
||
(In reply to comment #1)
> Another slight disadvantage is that it will make LIR dumps harder to read, as
> it won't be possible to name constants that are encoded in immediate form.
we normally print references to immediates as the immediate value, right? i could live with this problem.
> In better news, I did a quick analysis and found that for TM on SS and V8, for
> almost all benchmarks between 70% and 90% of binary operations have a constant
> on their RHS(!) So the potential effects of compaction are promising.
The hard case for RISC backends is when the instruction needs a scratch register to hold an immediate value. this happens with load/store and far jump/call now, and would also start happening with these new immediate LIR instructions.
A possible fix is to generate LIns* immediates into a LirBuffer during assembly time. constants like this could never spill into the stack frame (therefore we'd never have to mark them with LIR_live), because they'd always be rematerialized.
(I also want to do this for callee-saved registers, so LIR frontends dont have to mess around with them).
Comment 3•15 years ago
|
||
On TR x64, % of RHS constants are about 75% on the small benchmarks, and about 65% on a broader set of flash tests & apps. I used X64 to distinguish pointers from ints.
opcode %const count
------ ------ -----
ins2 74 22720 (OVERALL)
orq 100 4430
gtl 100 15
gel 100 11
andq 100 1575
addq 100 4459
lel 98 760
eql 97 924
eqq 97 2951
ltl 89 607
xorl 88 622
addl 88 495
andl 69 233
subl 69 55
mull 68 22
orl 37 124
ltul 9 62
ltuq 0 1890
ltq 0 79
leq 0 2
And again, on a set of 40 broader tests & apps from the web
opcode %const count
------ ------ -----
ins2 65 584763 (OVERALL)
orq 100 20473
gtul 100 343
gtl 100 1934
gel 100 43
andq 100 16193
eqq 99 86471
subl 97 60932
eql 96 40431
addl 92 98590
xorl 84 496
andl 82 4433
ltul 73 1179
lel 67 2183
mull 62 1267
addq 54 101004
ltl 51 2266
orl 35 2358
ltuq 0 11579
ltq 0 230
leul 0 47537
leq 0 9
Comment 4•15 years ago
|
||
Testing the % of values that are signed-8-bit, in the second group:
opcode %isS8 count
------ ------ -----
orq 100 20473
lel 100 1479
gel 100 43
eqq 100 86273
andq 100 16193
subl 99 59115
eql 98 38918
xorl 96 421
addq 96 54605
ltl 95 1172
gtl 94 1934
orl 85 838
addl 85 90908
ltul 80 866
gtul 76 343
andl 44 3649
mull 43 787
![]() |
Assignee | |
Comment 5•15 years ago
|
||
Another advantage of the LInspOrImm approach, as compared to the status quo: isImmI() tests are just a bit-test, instead of a deref+test.
![]() |
Assignee | |
Comment 6•15 years ago
|
||
(In reply to comment #2)
>
> we normally print references to immediates as the immediate value, right?
Not if you've given it a name. Eg. currently you can do this:
MASK = imml 2
x = and y MASK
With immediate forms that'll just be this:
x = and y 2
The flipside of this is that often CSE causes immediate names to be applied in places it shouldn't be. Eg. if you had a constant 2 elsewhere that wasn't a MASK, it could end up being labelled as MASK anyway. So not being able to name constants is not much of a problem overall.
> The hard case for RISC backends is when the instruction needs a scratch
> register to hold an immediate value. this happens with load/store and far
> jump/call now, and would also start happening with these new immediate LIR
> instructions.
Looking at the backends, AFAICT:
- i386/X64: all immediates can be directly encoded
- ARM:
- PPC/MIPS: looks like 16-bit immediates can be encoded directly
- Sparc/ARM: all immediates can be directly encoded, except for MUL
It doesn't seem that hard, we have registerAllocTmp() which can find a scratch register.
> A possible fix is to generate LIns* immediates into a LirBuffer during assembly
> time. constants like this could never spill into the stack frame (therefore
> we'd never have to mark them with LIR_live), because they'd always be
> rematerialized.
Sounds complicated...
Comment 7•15 years ago
|
||
Yeah, cse collisions make constant naming mostly pointless. I would prefer and y 2.
Comment 8•15 years ago
|
||
(In reply to comment #6)
> Looking at the backends, AFAICT:
> - i386/X64: all immediates can be directly encoded
> - ARM:
> - PPC/MIPS: looks like 16-bit immediates can be encoded directly
> - Sparc/ARM: all immediates can be directly encoded, except for MUL
arm has some odd rules, but should handle all the common cases. Noticed this when working on rematerialization.
> > A possible fix is to generate LIns* immediates into a LirBuffer during assembly
> > time. constants like this could never spill into the stack frame (therefore
> > we'd never have to mark them with LIR_live), because they'd always be
> > rematerialized.
>
> Sounds complicated...
maybe, yeah. the advantage of assigning a register to a LIns*, vs just using a scratch register, comes when the immediate is used more than once in close proximity -- the complicated approach will avoid emitting code twice to generate the same constant. might not matter in practice, need data.
I have the same approach in mind for dealing with callee-saved registers, so we'll see how it plays out in that case anyway.
![]() |
Assignee | |
Comment 9•15 years ago
|
||
This is a very rough patch implementing the extra opcodes. It passes TM trace-tests and gets rid of most of the immediates, but it doesn't generate good code for some of the more complicated patterns.
I've concluded that this approach is not worth it. Having two versions of various opcodes is a *real* pain in ExprFilter and in the backends, you end up having to lots of pattern matching twice.
So now the question is whether the LInsp-or-immediate operand type is better, or something else. All this will only help NJ compile-time... I did some measurements of Sunspider on TM:
- We hit CseFilter::insImmI() 307788 times
- We hit LirBufWriter::insImmI() 53274 times, i.e 83% of immi's are CSE'd
- We hit LirBuffer::makeRoom() 748306 times, ie. that's how many instructions we write into the LirBuffer
So this confirms that there are a lot of immediates being pushed down the writer pipeline unnecessarily, and so it may be worth continuing investigation.
![]() |
Assignee | |
Comment 10•15 years ago
|
||
(In reply to comment #9)
>
> - We hit CseFilter::insImmI() 307788 times
> - We hit LirBufWriter::insImmI() 53274 times, i.e 83% of immi's are CSE'd
> - We hit LirBuffer::makeRoom() 748306 times, ie. that's how many instructions
> we write into the LirBuffer
Whoops, I measured 3d-cube.js 26 times. The real numbers are: 64641, 50481, 187095. Immediates are still clearly dominating.
I did some analysis of what the immediates look like. On i386, here are counts of the most common immediates that were successfully CSE'd:
1: 0: 15340 (30.4%, 30.4%)
2: 2: 6143 (12.2%, 42.6%)
3: 1: 5382 (10.7%, 53.2%)
4: -4: 3497 ( 6.9%, 60.1%)
5: 7: 2938 ( 5.8%, 66.0%)
6: 16: 2838 ( 5.6%, 71.6%)
7: -149938176: 2261 ( 4.5%, 76.1%)
8: 136542112: 1914 ( 3.8%, 79.9%)
9: 65535: 1841 ( 3.6%, 83.5%)
10: 4: 1515 ( 3.0%, 86.5%)
11: 136547520: 602 ( 1.2%, 87.7%)
12: 136550624: 532 ( 1.1%, 88.8%)
13: 8: 467 ( 0.9%, 89.7%)
14: 3: 446 ( 0.9%, 90.6%)
15: 6: 281 ( 0.6%, 91.1%)
16: -149926904: 262 ( 0.5%, 91.6%)
17: -149938048: 260 ( 0.5%, 92.2%)
18: 32: 178 ( 0.4%, 92.5%)
19: 12: 133 ( 0.3%, 92.8%)
20: 5: 120 ( 0.2%, 93.0%)
21: -1: 106 ( 0.2%, 93.2%)
22: 248: 70 ( 0.1%, 93.4%)
23: 15: 68 ( 0.1%, 93.5%)
24: 424: 67 ( 0.1%, 93.6%)
25: 280: 67 ( 0.1%, 93.8%)
26: -149926848: 67 ( 0.1%, 93.9%)
27: 20: 67 ( 0.1%, 94.0%)
28: -149927072: 66 ( 0.1%, 94.2%)
29: 163956460: 63 ( 0.1%, 94.3%)
30: -149927240: 63 ( 0.1%, 94.4%)
31: 163956652: 63 ( 0.1%, 94.5%)
32: 163956724: 63 ( 0.1%, 94.6%)
33: 163956788: 63 ( 0.1%, 94.8%)
34: 163956532: 63 ( 0.1%, 94.9%)
35: 11: 61 ( 0.1%, 95.0%)
Seeing this, I'm now leaning towards some kind of constant pool approach.
Comment 11•15 years ago
|
||
Constant pool? Really? Looks to me that a small immediate takes care of 90% or so.
![]() |
Assignee | |
Comment 12•15 years ago
|
||
I tried handling small constants (0..255) specially in CseFilter, just putting them in an array instead of a hash table. It reduced instruction counts by 1.002x for a couple of SS benchmarks (3d-raytrace and crypto-md5).
The other idea I had was to avoid dead stages in the pipeline by making each insXYZ() function call onto the next stage that actually does something. Eg. if you call insImm() in TM it passes through FuncFilter and ExprFilter without doing anything. If you instead had in each stage an array of insXYZ() functions you might be able to set things up to avoid these do-nothing stages. But again it might not be much of a win.
![]() |
Assignee | |
Comment 13•15 years ago
|
||
(In reply to comment #12)
> I tried handling small constants (0..255) specially in CseFilter, just putting
> them in an array instead of a hash table. It reduced instruction counts by
> 1.002x for a couple of SS benchmarks (3d-raytrace and crypto-md5).
For the record, here it is. It's probably not worth landing.
![]() |
Assignee | |
Updated•15 years ago
|
Attachment #442243 -
Attachment description: partly working patch → partly working patch introducing new opcodes (eqii, addii, etc)
![]() |
Assignee | |
Comment 14•15 years ago
|
||
Again, for the record. Again, probably a lot of effort and upheaval for little benefit.
Updated•15 years ago
|
Whiteboard: PACMAN
Updated•15 years ago
|
Severity: normal → minor
OS: Mac OS X → All
Hardware: x86 → All
Target Milestone: --- → Future
![]() |
Assignee | |
Comment 15•14 years ago
|
||
This isn't worth fixing. I spun off the "handle small immediates specially in CseFilter" part as bug 609121.
Status: ASSIGNED → RESOLVED
Closed: 14 years ago
Resolution: --- → WONTFIX
Updated•11 years ago
|
Product: Core → Core Graveyard
You need to log in
before you can comment on or make changes to this bug.
Description
•