Closed Bug 444682 Opened 12 years ago Closed 11 years ago

Add support for non-constant displacements for insLoad (insStore already supports those).


(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect)

Not set


(Not tracked)



(Reporter: gal, Unassigned)




(1 file, 1 obsolete file)

insLoad only supports constant displacements at the moment. It really should support non-constant displacements for array loads.
after double checking the code, both LIR_ld* and LIR_st* only support constant displacements.  insStore() allows a LIns* displacement but later, Assembler requires the displacement to be const.

so both should be extended.  the insLoad() pipeline function already supports LIns* displacements, so the various filters including Assembler need to be compelled not to assume the displ is const.

David is on in (for both, load & store), but x86 only. Maybe someone from Adobe can help out with ARM :)
Assignee: nobody → danderson
AFAIK ARM doesn't have a [reg1+reg2] addressing mode, so I think there I'll just have to emit a separate ADD.  Maybe someone can correct me on this since my knowledge of the platform is NULL.
LIR already can do any addressing mode, you just have an expression for the pointer and use displ=0.  adding more insLoad() and insStore() helpers is fine. 

at the core of this is whether we really want more LIR_ld/st primitives that have more addressing modes.  you could argue that even having them with a builtin implied "add p+displ" is corrupt.  

eg: we have 
   LIR_st  val,base,disp   // disp must point to a const
   LIT_sti val,base,#disp  // disp is -128..127

we could add more addressing modes, of course.  or we could eliminate LIR_st, and do:

   LIR_sti val,base,#disp

and when disp is outside the range expressable in LIR, base points to an add expression.  (all other addressing modes are just different ptr expressions).

which direction should we go?  I tend to lean towards having fewer LIR variants and let the cpu-specific assembler do pattern matching to fit addressing modes.  (the +/- 127 offset of LIR_sti is just a byproduct of the LIR encoding, it doesn't imply anything about what the underlying cpu supports)

note that we currently don't have LIR_ldi that's like LIR_sti, but if we did it would definitely reduce LIR size... loads from small offsets are really common.

opinions welcome... making LIR nice and clean and small is one problem were solving, and taking advantage of the cpu's addressing modes is another.
I'm favor of assembler specific matching.  I was planning on peeking ahead in Nativei386 to see if the value was a LIR_add, so in fact the ADD would already be there on ARM and I wouldn't need to change anything.
we can divide and conquor:

in LIR:

change LIR_st/ld to be like LIR_sti, ie the displacement is always an 8bit immediate.  insLoad() and insStore() helpers can generate add expressions when the immedate doesnt fit or if another addr mode is needed.  This eliminates insStorei() and LIR_sti.

in Assembler

assembler should match base pointer expressions to cpu specific addressing modes.
In hotpath we use only LOAD address and STORE address, value. The rest is done in the backend by pattern matching. Here for space reasons a small displacement makes sense, but the logic to parse/match it should definitively be in the backend (IMHO).
ARM has immediate offset, register offset and PC-relative, and the barrel shifter can be used in some cases as well, so there is definitively some pattern matching to be done in the ARM backend.
ARM does have [reg+reg*scale]. It's just written differently.
E.g. int *a; y=a[x] gets translated to: ldr r0, [r0, r1, asl #2]

And for Thumb: lsl r1, r1, #2; ldr r0, [r1, r0]

In my IR I've solved the address generation issue by letting both the xLOAD and the xSTORE ops reference a common xREF op which holds base and index. This way this gets CSEd early on (t[i]=t[i]+1 needs only one xREF). But the assembler may later decide to merge the memory operands and leave the xREF op dead. The decision should be based on the instruction set, whether the CPU AGU is "free" (no extra uops generated for complex addressing modes), and of course whether the xREF is invariant or not relative to the xLOAD/xSTORE.

For x86 merging the addressing mode can be done for the load/store itself and even for most opcodes referencing a load. This helps to save scratch registers. It's also usually faster (LEA is expensive on some CPUs) and more compact.

E.g. instead of:

lea esi, [ebx+edi*8]
movsd xmm6, [esi]
addsd xmm7, xmm6

one could generate this:

movsd xmm6, [ebx+edi*8]
addsd xmm7, xmm6

or even this (if the load has only one use):

addsd xmm7, [ebx+edi*8]
Yep, we do the same thing in hotpath as mike does. We have LOAD STORE and then ADDR to generate addresses. ADDR is hoisted out of the loop (usually), and the register allocator decides whether pressure warrants local materialization or allocate a loop register. In the latter case ADDR goes dead.
Does some pattern matching on loads to take advantage of addressing modes.  Doesn't really help performance but it makes the code a little shorter and easier to read.

Example reductions:

add eax, ebx
mov eax, [eax] 
mov eax, [eax+ebx]

shl ebx, 2
add eax, ebx
mov eax, [eax]
mov eax, [eax+ebx*4]
Attachment #332433 - Flags: review?(edwsmith)
Comment on attachment 332433 [details] [diff] [review]
instruction reduction for a common load pattern

needs rebasing, from what i can tell
Attachment #332433 - Flags: review?(edwsmith) → review-
Attached patch patch v2Splinter Review
Coming back to this since dmandelin ran into it in his regex patch.  Passes TraceMonkey's tests, is about a 15-20ms win on SunSpider for us.
Attachment #332433 - Attachment is obsolete: true
Attachment #346343 - Flags: review?(rreitmai)
Attachment #346343 - Flags: review?(gal) → review+
Pre-emptive push since our tree is closing today:

Will leave the bug open for rreitmai's review
Wow that's a nice win!

I'm a little confused by :
			if (rL == NULL || rL->reg == UnknownReg)
				rleft = getBaseReg(lhs, d, rmask(rr));

did you mean for the call to be findSpecificReg() ?  

Also, would have the forms Mike discussed earlier (xREF) have helped in writing this logic?  That is, we add a LIR_ref instruction which points to base,disp.
Attachment #346343 - Flags: review?(rreitmai) → review+
Thanks, I indeed meant findSpecificRegFor.  I like Mike's idea, sniffing LIR_add is pretty similar as it'll CSE separately and can be left dead by asm_ld.  Perhaps a distinction would make things cleaner though.

I like the idea of applying this optimization to more instructions, for example code like:

mov eax, [esi+edi*2+8]
add ecx, eax

Could be reduced similar to Mike's example.  I'm not sure but it feels like we'd have to check how many uses (or how far apart uses) there are of such expressions.  If the LIR behind "[esi+edi*2+8]" got used frequently, it'd make more sense to put it in a register instead of addressing it each time.
The main advantage of the xREF approach is that it helps alias analysis.
An xLOAD and an xSTORE referencing the same memory cell must have the
same xREF (since it's CSEd). Alias analysis involves a lot of searches
through chains of loads and stores -- you better make that fast. Here the
xREFs really help, since you can identify a matching load/store by the
identity of the reference operand alone.

With the proper data structures in place, basic alias analysis needs
only around 10 lines of code (disambiguation based on const/variable
keys). Extended analysis is well under 100 lines and also does stuff
like disambiguating a[i] <-> a[i+1] and a[i] <-> b[i] (adding dynamic
disambiguation guards).

It's important to keep the load/store varieties separate in the IR (e.g.
LuaJIT 2.x has ALOAD, HLOAD, ULOAD, FLOAD etc.). The semantics of the
language dictate that an array slot cannot alias a hash slot and so on.
This is a big win when you add loop-invariant code motion (LICM aka
hoisting). Many hash loads are invariant and this cannot be blocked by
e.g. array stores with variant keys.

You really want to keep as much of this information in the IR, either
explicitly or in a side structure. Otherwise you loose the ability for
easy disambiguation. In the end you get the same mess as in C/C++, where
almost everything can alias everything else. This makes alias analysis
really expensive and it often fails to disambiguate important cases. You
can do much better when the source language operates at a higher level.

Ok, I realize nanojit doesn't do AA or LICM yet. Just some food for
thought, since it may impact your basic design decisions.

About memory operand fusion:

- The AGU is free on all contemporary x86/x64 CPUs. I.e. it's not
  cheaper to do mov ecx, [ebp] than mov ecx, [esi+4*edi+8].

- But lea can be expensive, so you really want to fuse the load and the
  reference. The AGU bypass to the ALU, found on earlier CPUs, had to go.
  E.g. the P4 splits a lea up into a sequence of shift and add uops.
  Here are uop count and latency in cycles on a P4:

  3/4   lea ebp, [esi+4*edi+8]     2/1   add eax, [esi+4*edi+8]
  1/2   mov ecx, [ebp]
  1/0.5 add eax, ecx
  -----                            ---
  5/6.5                        vs. 2/1

  Ouch! The Core2 and later restored the fast lea by adding special uops
  to the ALU. But a lea still adds 1 extra uop and 1 cycle of latency.

- Fusing loads into x86 ModRM instructions is very beneficial since it
  significantly reduces register pressure. In LuaJIT 2.x I'm fusing
  pretty much everything (from guards to FP operands) against everything
  else that is referencable (FP constants and all kinds of loads, even
  loads of internal fields and spill slots). I'm also fusing cumulative
  offsets into array indexes. This also reduces register pressure and it
  would really fly once I add ABCelim.

- Fusion heuristics are a topic for another message. But I wouldn't worry
  about that before you do LICM. My recommendation: just fuse as much as
  you can in the beginning. Later make it dependent on register pressure
  and the potential spill and restore costs.

Here are some examples for memory operand fusion in action:

A simple reduction over an array:

  local x = 0; for i=1,#t do x = x + t[i] end

x86 machine code for the inner loop, with memory operand fusion:

  cmp esi, edx                          // Bounds check
  ja ->EXIT_2
  cmp dword [ecx+esi*8+0x4], -0x0d      // Type check, fused load, fused ref
  ja ->EXIT_1
  addsd xmm7, [ecx+esi*8]               // Add, fused load, fused ref
  add esi, +0x01
  cmp eax, esi
  jge ->LOOP
  jmp ->EXIT_0

Here's a more involved example from SciMark FFT:

  for i=1,2*n-1,2*dual2 do
    local j = i+dual2
    local ir, ii = v[i], v[i+1]
    local jr, ji = v[j], v[j+1]
    v[j], v[j+1] = ir - jr, ii - ji
    v[i], v[i+1] = ir + jr, ii + ji

x86 machine code for the inner loop, with memory operand fusion:

  mov ebx, [esp+0x10]                   // Restore of bound
  mov ebp, [esp+0x14]                   // Restore of dual2
  cmp esi, ebx                          // Bounds check
  ja ->EXIT_9
  cmp dword [edx+esi*8+0x4],-0x0d       // Type check, fused load, fused ref
  ja ->EXIT_8
  movsd xmm7, [edx+esi*8]               // Load, fused ref
  lea edi, [esi+0x1]
  cmp edi, ebx                          // Bounds check
  ja ->EXIT_7
  cmp dword [edx+edi*8+0x4],-0x0d       // Type check, fused load, fused ref
  ja ->EXIT_6
  movsd xmm6, [edx+edi*8]               // Load, fused ref
  add ebp, esi
  jo ->EXIT_5                           // Overflow check
  cmp ebp, ebx                          // Bounds check
  ja ->EXIT_4
  cmp dword [edx+ebp*8+0x4],-0x0d       // Type check, fused load, fused ref
  ja ->EXIT_3
  movsd xmm4, [edx+ebp*8]               // Load, fused ref
  lea ebx, [ebp+0x1]
  cmp ebx, [esp+0x10]                   // Bounds check, fused spill slot
  ja ->EXIT_2
  cmp dword [edx+ebp*8+0xc],-0x0d       // Type check, fused load, fused ref
  ja ->EXIT_1
  movsd xmm5, [edx+ebp*8+0x8]           // Load, fused ref
  movaps xmm3, xmm7
  subsd xmm3, xmm4
  movaps xmm2, xmm6
  subsd xmm2, xmm5
  movsd [edx+ebp*8+0x8], xmm2           // Store, fused ref
  movsd [edx+ebp*8], xmm3               // Store, fused ref
  addsd xmm7, xmm4
  addsd xmm6, xmm5
  movsd [edx+edi*8], xmm6               // Store, fused ref
  movsd [edx+esi*8], xmm7               // Store, fused ref
  add esi, ecx
  cmp eax, esi
  jge ->LOOP
  jmp ->EXIT_0

[ABCelim would avoid all spills and eliminate 4 of the exits.]
Thanks for the detailed comment Mike and I agree that there is merit in introducing this form.  

If nothing else we may gain a slight reduction in buffer space (or at the very least probably not grow it) by being able to re-use LIR_ref's.  

It would also mean that we could remove the special forms (or at replace them with ref/refi) of ld/st with constant displacement.

BTW, we do have some liveness analysis going on, not to mention cse.  So this change might lead to cleaner code in those areas.
Flags: in-testsuite-
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
Assignee: dvander → nobody
Component: Tracing Virtual Machine → JIT Compiler (NanoJIT)
QA Contact: tracing-vm → nanojit
Attachment #346343 - Flags: superreview?(edwsmith)
Attachment #346343 - Flags: superreview?(edwsmith) → superreview-
Comment on attachment 346343 [details] [diff] [review]
patch v2

obsolete, I think.
Did we actually land this?
not that i'm aware of, at least in TR.  and it might be at odds with the newly proposed change to add an immediate displacement to LIR_ld.  (its not strictly at odds, but it pulls in a different direction)
This did land and it's in our branch still AFAIK.
Closed: 11 years ago
Resolution: --- → FIXED
removing QRB request, bug verified
Flags: flashplayer-qrb?
You need to log in before you can comment on or make changes to this bug.