If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

nanojit: encode immediate forms compactly

RESOLVED WONTFIX

Status

Core Graveyard
Nanojit
--
minor
RESOLVED WONTFIX
8 years ago
4 years ago

People

(Reporter: njn, Assigned: njn)

Tracking

unspecified
Future

Firefox Tracking Flags

(Not tracked)

Details

(Whiteboard: PACMAN)

Attachments

(3 attachments)

(Assignee)

Description

8 years ago
Binary operations with an immediate as the second operand are very common in LIR.  Eg. comparisons against zero, bitwise-AND with a constant mask, shift by a constant amount.

Furthermore, ExprFilter pushes constant on the LHS to the RHS, and all(?) the backends treat these immediate forms specially.

If we encoded immediate forms more compactly it would have multiple benefits:

- Fewer explicit immediates in the LIR would reduce its size, sometimes reducing the number of code chunk allocations required and reducing cache stress and page faults.

- Fewer instructions would go through the pipeline, saving time, esp. in CseFilter, for which immediates are a hot case.

The downside is that it would introduce representational redundancies.  This is mitigated by the fact that just about everywhere that touches the second operand first checks if it's a constant and treats that case specially.  So in practice I don't think this will be much of a burden.

As for how best to do it, my initial thought was to add new opcodes:  addli, subli, etc.  I started implementing that but after a while concluded that there is another way that might be better.  And that's to introduce a new type, LInspOrImm, which means "LInsp or immediate".  (Suggestions for a better name are welcome.)  This would be a word-sized value, if the bottom bit is zero then it's an LInsp, if the bottom bit is one then the top 31/63 bits are an immediate.  So only very large immediates could not be encoded.  This representation would of course be hidden behind some inline accessor functions.  LInspOrImm would only be used as the second operand of the relevant operations, most existing LInsp fields would stay the same (eg. the first operand of these binary operations).  I'm not certain this is better than explicit opcodes but it avoids the need for many new opcodes, which is nice.
(Assignee)

Comment 1

8 years ago
Another slight disadvantage is that it will make LIR dumps harder to read, as it won't be possible to name constants that are encoded in immediate form.

In better news, I did a quick analysis and found that for TM on SS and V8, for almost all benchmarks between 70% and 90% of binary operations have a constant on their RHS(!)  So the potential effects of compaction are promising.

Comment 2

8 years ago
(In reply to comment #1)
> Another slight disadvantage is that it will make LIR dumps harder to read, as
> it won't be possible to name constants that are encoded in immediate form.

we normally print references to immediates as the immediate value, right?  i could live with this problem.

> In better news, I did a quick analysis and found that for TM on SS and V8, for
> almost all benchmarks between 70% and 90% of binary operations have a constant
> on their RHS(!)  So the potential effects of compaction are promising.

The hard case for RISC backends is when the instruction needs a scratch register to hold an immediate value.  this happens with load/store and far jump/call now, and would also start happening with these new immediate LIR instructions.

A possible fix is to generate LIns* immediates into a LirBuffer during assembly time.  constants like this could never spill into the stack frame (therefore we'd never have to mark them with LIR_live), because they'd always be rematerialized.

(I also want to do this for callee-saved registers, so LIR frontends dont have to mess around with them).

Comment 3

8 years ago
On TR x64, % of RHS constants are about 75% on the small benchmarks, and about 65% on a broader set of flash tests & apps.   I used X64 to distinguish pointers from ints.

opcode  %const  count
------  ------  -----
ins2    74      22720  (OVERALL)
orq     100     4430
gtl     100     15
gel     100     11
andq    100     1575
addq    100     4459
lel     98      760
eql     97      924
eqq     97      2951
ltl     89      607
xorl    88      622
addl    88      495
andl    69      233
subl    69      55
mull    68      22
orl     37      124
ltul    9       62
ltuq    0       1890
ltq     0       79
leq     0       2

And again, on a set of 40 broader tests & apps from the web

opcode  %const  count
------  ------  -----
ins2    65      584763  (OVERALL)
orq     100     20473
gtul    100     343
gtl     100     1934
gel     100     43
andq    100     16193
eqq     99      86471
subl    97      60932
eql     96      40431
addl    92      98590
xorl    84      496
andl    82      4433
ltul    73      1179
lel     67      2183
mull    62      1267
addq    54      101004
ltl     51      2266
orl     35      2358
ltuq    0       11579
ltq     0       230
leul    0       47537
leq     0       9

Comment 4

8 years ago
Testing the % of values that are signed-8-bit, in the second group:

opcode  %isS8   count
------  ------  -----
orq     100     20473
lel     100     1479
gel     100     43
eqq     100     86273
andq    100     16193
subl    99      59115
eql     98      38918
xorl    96      421
addq    96      54605
ltl     95      1172
gtl     94      1934
orl     85      838
addl    85      90908
ltul    80      866
gtul    76      343
andl    44      3649
mull    43      787
(Assignee)

Comment 5

8 years ago
Another advantage of the LInspOrImm approach, as compared to the status quo:  isImmI() tests are just a bit-test, instead of a deref+test.
(Assignee)

Comment 6

8 years ago
(In reply to comment #2)
> 
> we normally print references to immediates as the immediate value, right?

Not if you've given it a name.  Eg. currently you can do this:

  MASK = imml 2
  x = and y MASK
  
With immediate forms that'll just be this:

  x = and y 2

The flipside of this is that often CSE causes immediate names to be applied in places it shouldn't be.  Eg. if you had a constant 2 elsewhere that wasn't a MASK, it could end up being labelled as MASK anyway.  So not being able to name constants is not much of a problem overall.


> The hard case for RISC backends is when the instruction needs a scratch
> register to hold an immediate value.  this happens with load/store and far
> jump/call now, and would also start happening with these new immediate LIR
> instructions.

Looking at the backends, AFAICT:
- i386/X64: all immediates can be directly encoded 
- ARM:
- PPC/MIPS: looks like 16-bit immediates can be encoded directly
- Sparc/ARM: all immediates can be directly encoded, except for MUL

It doesn't seem that hard, we have registerAllocTmp() which can find a scratch register.


> A possible fix is to generate LIns* immediates into a LirBuffer during assembly
> time.  constants like this could never spill into the stack frame (therefore
> we'd never have to mark them with LIR_live), because they'd always be
> rematerialized.

Sounds complicated...

Comment 7

8 years ago
Yeah, cse collisions make constant naming mostly pointless. I would prefer and y 2.

Comment 8

8 years ago
(In reply to comment #6)
> Looking at the backends, AFAICT:
> - i386/X64: all immediates can be directly encoded 
> - ARM:
> - PPC/MIPS: looks like 16-bit immediates can be encoded directly
> - Sparc/ARM: all immediates can be directly encoded, except for MUL

arm has some odd rules, but should handle all the common cases.  Noticed this when working on rematerialization.

> > A possible fix is to generate LIns* immediates into a LirBuffer during assembly
> > time.  constants like this could never spill into the stack frame (therefore
> > we'd never have to mark them with LIR_live), because they'd always be
> > rematerialized.
> 
> Sounds complicated...

maybe, yeah.  the advantage of assigning a register to a LIns*, vs just using a scratch register, comes when the immediate is used more than once in close proximity -- the complicated approach will avoid emitting code twice to generate the same constant.  might not matter in practice, need data.

I have the same approach in mind for dealing with callee-saved registers, so we'll see how it plays out in that case anyway.
(Assignee)

Updated

8 years ago
Depends on: 560712
(Assignee)

Comment 9

8 years ago
Created attachment 442243 [details] [diff] [review]
partly working patch introducing new opcodes (eqii, addii, etc)

This is a very rough patch implementing the extra opcodes.  It passes TM trace-tests and gets rid of most of the immediates, but it doesn't generate good code for some of the more complicated patterns.

I've concluded that this approach is not worth it.  Having two versions of various opcodes is a *real* pain in ExprFilter and in the backends, you end up having to lots of pattern matching twice.

So now the question is whether the LInsp-or-immediate operand type is better, or something else.  All this will only help NJ compile-time... I did some measurements of Sunspider on TM:

- We hit CseFilter::insImmI() 307788 times
- We hit LirBufWriter::insImmI() 53274 times, i.e 83% of immi's are CSE'd
- We hit LirBuffer::makeRoom() 748306 times, ie. that's how many instructions we write into the LirBuffer

So this confirms that there are a lot of immediates being pushed down the writer pipeline unnecessarily, and so it may be worth continuing investigation.
(Assignee)

Comment 10

8 years ago
(In reply to comment #9)
> 
> - We hit CseFilter::insImmI() 307788 times
> - We hit LirBufWriter::insImmI() 53274 times, i.e 83% of immi's are CSE'd
> - We hit LirBuffer::makeRoom() 748306 times, ie. that's how many instructions
> we write into the LirBuffer

Whoops, I measured 3d-cube.js 26 times.  The real numbers are:  64641, 50481, 187095.  Immediates are still clearly dominating.

I did some analysis of what the immediates look like.  On i386, here are counts of the most common immediates that were successfully CSE'd:

   1:          0:  15340 (30.4%, 30.4%)
   2:          2:   6143 (12.2%, 42.6%)
   3:          1:   5382 (10.7%, 53.2%)
   4:         -4:   3497 ( 6.9%, 60.1%)
   5:          7:   2938 ( 5.8%, 66.0%)
   6:         16:   2838 ( 5.6%, 71.6%)
   7: -149938176:   2261 ( 4.5%, 76.1%)
   8:  136542112:   1914 ( 3.8%, 79.9%)
   9:      65535:   1841 ( 3.6%, 83.5%)
  10:          4:   1515 ( 3.0%, 86.5%)
  11:  136547520:    602 ( 1.2%, 87.7%)
  12:  136550624:    532 ( 1.1%, 88.8%)
  13:          8:    467 ( 0.9%, 89.7%)
  14:          3:    446 ( 0.9%, 90.6%)
  15:          6:    281 ( 0.6%, 91.1%)
  16: -149926904:    262 ( 0.5%, 91.6%)
  17: -149938048:    260 ( 0.5%, 92.2%)
  18:         32:    178 ( 0.4%, 92.5%)
  19:         12:    133 ( 0.3%, 92.8%)
  20:          5:    120 ( 0.2%, 93.0%)
  21:         -1:    106 ( 0.2%, 93.2%)
  22:        248:     70 ( 0.1%, 93.4%)
  23:         15:     68 ( 0.1%, 93.5%)
  24:        424:     67 ( 0.1%, 93.6%)
  25:        280:     67 ( 0.1%, 93.8%)
  26: -149926848:     67 ( 0.1%, 93.9%)
  27:         20:     67 ( 0.1%, 94.0%)
  28: -149927072:     66 ( 0.1%, 94.2%)
  29:  163956460:     63 ( 0.1%, 94.3%)
  30: -149927240:     63 ( 0.1%, 94.4%)
  31:  163956652:     63 ( 0.1%, 94.5%)
  32:  163956724:     63 ( 0.1%, 94.6%)
  33:  163956788:     63 ( 0.1%, 94.8%)
  34:  163956532:     63 ( 0.1%, 94.9%)
  35:         11:     61 ( 0.1%, 95.0%)

Seeing this, I'm now leaning towards some kind of constant pool approach.

Comment 11

8 years ago
Constant pool? Really? Looks to me that a small immediate takes care of 90% or so.
(Assignee)

Comment 12

8 years ago
I tried handling small constants (0..255) specially in CseFilter, just putting them in an array instead of a hash table.  It reduced instruction counts by 1.002x for a couple of SS benchmarks (3d-raytrace and crypto-md5).

The other idea I had was to avoid dead stages in the pipeline by making each insXYZ() function call onto the next stage that actually does something.  Eg. if you call insImm() in TM it passes through FuncFilter and ExprFilter without doing anything.  If you instead had in each stage an array of insXYZ() functions you might be able to set things up to avoid these do-nothing stages.  But again it might not be much of a win.
(Assignee)

Comment 13

8 years ago
Created attachment 442630 [details] [diff] [review]
patch treating small constants specially in CseFilter

(In reply to comment #12)
> I tried handling small constants (0..255) specially in CseFilter, just putting
> them in an array instead of a hash table.  It reduced instruction counts by
> 1.002x for a couple of SS benchmarks (3d-raytrace and crypto-md5).

For the record, here it is.  It's probably not worth landing.
(Assignee)

Updated

8 years ago
Attachment #442243 - Attachment description: partly working patch → partly working patch introducing new opcodes (eqii, addii, etc)
(Assignee)

Comment 14

8 years ago
Created attachment 442631 [details] [diff] [review]
patch partly implementing the tagged operand type

Again, for the record.  Again, probably a lot of effort and upheaval for little benefit.

Updated

7 years ago
Whiteboard: PACMAN

Updated

7 years ago
Severity: normal → minor
OS: Mac OS X → All
Hardware: x86 → All
Target Milestone: --- → Future
(Assignee)

Comment 15

7 years ago
This isn't worth fixing.  I spun off the "handle small immediates specially in CseFilter" part as bug 609121.
Status: ASSIGNED → RESOLVED
Last Resolved: 7 years ago
Resolution: --- → WONTFIX
Component: Nanojit → Nanojit
Product: Core → Core Graveyard
You need to log in before you can comment on or make changes to this bug.