Closed
Bug 444682
Opened 15 years ago
Closed 14 years ago
Add support for non-constant displacements for insLoad (insStore already supports those).
Categories
(Tamarin Graveyard :: Baseline JIT (CodegenLIR), defect)
Tamarin Graveyard
Baseline JIT (CodegenLIR)
Tracking
(Not tracked)
VERIFIED
FIXED
People
(Reporter: gal, Unassigned)
References
Details
Attachments
(1 file, 1 obsolete file)
|
6.57 KB,
patch
|
rreitmai
:
review+
gal
:
review+
edwsmith
:
superreview-
|
Details | Diff | Splinter Review |
insLoad only supports constant displacements at the moment. It really should support non-constant displacements for array loads.
Comment 1•15 years ago
|
||
after double checking the code, both LIR_ld* and LIR_st* only support constant displacements. insStore() allows a LIns* displacement but later, Assembler requires the displacement to be const. so both should be extended. the insLoad() pipeline function already supports LIns* displacements, so the various filters including Assembler need to be compelled not to assume the displ is const.
| Reporter | ||
Comment 2•15 years ago
|
||
David is on in (for both, load & store), but x86 only. Maybe someone from Adobe can help out with ARM :)
Assignee: nobody → danderson
AFAIK ARM doesn't have a [reg1+reg2] addressing mode, so I think there I'll just have to emit a separate ADD. Maybe someone can correct me on this since my knowledge of the platform is NULL.
Comment 4•15 years ago
|
||
LIR already can do any addressing mode, you just have an expression for the pointer and use displ=0. adding more insLoad() and insStore() helpers is fine. at the core of this is whether we really want more LIR_ld/st primitives that have more addressing modes. you could argue that even having them with a builtin implied "add p+displ" is corrupt. eg: we have LIR_st val,base,disp // disp must point to a const LIT_sti val,base,#disp // disp is -128..127 we could add more addressing modes, of course. or we could eliminate LIR_st, and do: LIR_sti val,base,#disp and when disp is outside the range expressable in LIR, base points to an add expression. (all other addressing modes are just different ptr expressions). which direction should we go? I tend to lean towards having fewer LIR variants and let the cpu-specific assembler do pattern matching to fit addressing modes. (the +/- 127 offset of LIR_sti is just a byproduct of the LIR encoding, it doesn't imply anything about what the underlying cpu supports) note that we currently don't have LIR_ldi that's like LIR_sti, but if we did it would definitely reduce LIR size... loads from small offsets are really common. opinions welcome... making LIR nice and clean and small is one problem were solving, and taking advantage of the cpu's addressing modes is another.
I'm favor of assembler specific matching. I was planning on peeking ahead in Nativei386 to see if the value was a LIR_add, so in fact the ADD would already be there on ARM and I wouldn't need to change anything.
Comment 6•15 years ago
|
||
we can divide and conquor: in LIR: change LIR_st/ld to be like LIR_sti, ie the displacement is always an 8bit immediate. insLoad() and insStore() helpers can generate add expressions when the immedate doesnt fit or if another addr mode is needed. This eliminates insStorei() and LIR_sti. in Assembler assembler should match base pointer expressions to cpu specific addressing modes.
| Reporter | ||
Comment 7•15 years ago
|
||
In hotpath we use only LOAD address and STORE address, value. The rest is done in the backend by pattern matching. Here for space reasons a small displacement makes sense, but the logic to parse/match it should definitively be in the backend (IMHO).
| Reporter | ||
Comment 8•15 years ago
|
||
ARM has immediate offset, register offset and PC-relative, and the barrel shifter can be used in some cases as well, so there is definitively some pattern matching to be done in the ARM backend.
ARM does have [reg+reg*scale]. It's just written differently. E.g. int *a; y=a[x] gets translated to: ldr r0, [r0, r1, asl #2] And for Thumb: lsl r1, r1, #2; ldr r0, [r1, r0] In my IR I've solved the address generation issue by letting both the xLOAD and the xSTORE ops reference a common xREF op which holds base and index. This way this gets CSEd early on (t[i]=t[i]+1 needs only one xREF). But the assembler may later decide to merge the memory operands and leave the xREF op dead. The decision should be based on the instruction set, whether the CPU AGU is "free" (no extra uops generated for complex addressing modes), and of course whether the xREF is invariant or not relative to the xLOAD/xSTORE. For x86 merging the addressing mode can be done for the load/store itself and even for most opcodes referencing a load. This helps to save scratch registers. It's also usually faster (LEA is expensive on some CPUs) and more compact. E.g. instead of: lea esi, [ebx+edi*8] movsd xmm6, [esi] addsd xmm7, xmm6 one could generate this: movsd xmm6, [ebx+edi*8] addsd xmm7, xmm6 or even this (if the load has only one use): addsd xmm7, [ebx+edi*8]
| Reporter | ||
Comment 10•15 years ago
|
||
Yep, we do the same thing in hotpath as mike does. We have LOAD STORE and then ADDR to generate addresses. ADDR is hoisted out of the loop (usually), and the register allocator decides whether pressure warrants local materialization or allocate a loop register. In the latter case ADDR goes dead.
Does some pattern matching on loads to take advantage of addressing modes. Doesn't really help performance but it makes the code a little shorter and easier to read. Example reductions: add eax, ebx mov eax, [eax] --> mov eax, [eax+ebx] shl ebx, 2 add eax, ebx mov eax, [eax] --> mov eax, [eax+ebx*4]
Attachment #332433 -
Flags: review?(edwsmith)
Comment 12•15 years ago
|
||
Comment on attachment 332433 [details] [diff] [review] instruction reduction for a common load pattern needs rebasing, from what i can tell
Attachment #332433 -
Flags: review?(edwsmith) → review-
Coming back to this since dmandelin ran into it in his regex patch. Passes TraceMonkey's tests, is about a 15-20ms win on SunSpider for us.
Attachment #332433 -
Attachment is obsolete: true
Attachment #346343 -
Flags: review?(rreitmai)
Updated•15 years ago
|
Attachment #346343 -
Flags: review?(gal)
| Reporter | ||
Updated•15 years ago
|
Attachment #346343 -
Flags: review?(gal) → review+
Pre-emptive push since our tree is closing today: http://hg.mozilla.org/tracemonkey/rev/54246bd0b617 Will leave the bug open for rreitmai's review
Comment 15•15 years ago
|
||
Wow that's a nice win! I'm a little confused by : if (rL == NULL || rL->reg == UnknownReg) rleft = getBaseReg(lhs, d, rmask(rr)); did you mean for the call to be findSpecificReg() ? Also, would have the forms Mike discussed earlier (xREF) have helped in writing this logic? That is, we add a LIR_ref instruction which points to base,disp.
Updated•15 years ago
|
Attachment #346343 -
Flags: review?(rreitmai) → review+
Thanks, I indeed meant findSpecificRegFor. I like Mike's idea, sniffing LIR_add is pretty similar as it'll CSE separately and can be left dead by asm_ld. Perhaps a distinction would make things cleaner though. I like the idea of applying this optimization to more instructions, for example code like: mov eax, [esi+edi*2+8] add ecx, eax Could be reduced similar to Mike's example. I'm not sure but it feels like we'd have to check how many uses (or how far apart uses) there are of such expressions. If the LIR behind "[esi+edi*2+8]" got used frequently, it'd make more sense to put it in a register instead of addressing it each time.
Comment 17•15 years ago
|
||
The main advantage of the xREF approach is that it helps alias analysis.
An xLOAD and an xSTORE referencing the same memory cell must have the
same xREF (since it's CSEd). Alias analysis involves a lot of searches
through chains of loads and stores -- you better make that fast. Here the
xREFs really help, since you can identify a matching load/store by the
identity of the reference operand alone.
With the proper data structures in place, basic alias analysis needs
only around 10 lines of code (disambiguation based on const/variable
keys). Extended analysis is well under 100 lines and also does stuff
like disambiguating a[i] <-> a[i+1] and a[i] <-> b[i] (adding dynamic
disambiguation guards).
It's important to keep the load/store varieties separate in the IR (e.g.
LuaJIT 2.x has ALOAD, HLOAD, ULOAD, FLOAD etc.). The semantics of the
language dictate that an array slot cannot alias a hash slot and so on.
This is a big win when you add loop-invariant code motion (LICM aka
hoisting). Many hash loads are invariant and this cannot be blocked by
e.g. array stores with variant keys.
You really want to keep as much of this information in the IR, either
explicitly or in a side structure. Otherwise you loose the ability for
easy disambiguation. In the end you get the same mess as in C/C++, where
almost everything can alias everything else. This makes alias analysis
really expensive and it often fails to disambiguate important cases. You
can do much better when the source language operates at a higher level.
Ok, I realize nanojit doesn't do AA or LICM yet. Just some food for
thought, since it may impact your basic design decisions.
About memory operand fusion:
- The AGU is free on all contemporary x86/x64 CPUs. I.e. it's not
cheaper to do mov ecx, [ebp] than mov ecx, [esi+4*edi+8].
- But lea can be expensive, so you really want to fuse the load and the
reference. The AGU bypass to the ALU, found on earlier CPUs, had to go.
E.g. the P4 splits a lea up into a sequence of shift and add uops.
Here are uop count and latency in cycles on a P4:
3/4 lea ebp, [esi+4*edi+8] 2/1 add eax, [esi+4*edi+8]
1/2 mov ecx, [ebp]
1/0.5 add eax, ecx
----- ---
5/6.5 vs. 2/1
Ouch! The Core2 and later restored the fast lea by adding special uops
to the ALU. But a lea still adds 1 extra uop and 1 cycle of latency.
- Fusing loads into x86 ModRM instructions is very beneficial since it
significantly reduces register pressure. In LuaJIT 2.x I'm fusing
pretty much everything (from guards to FP operands) against everything
else that is referencable (FP constants and all kinds of loads, even
loads of internal fields and spill slots). I'm also fusing cumulative
offsets into array indexes. This also reduces register pressure and it
would really fly once I add ABCelim.
- Fusion heuristics are a topic for another message. But I wouldn't worry
about that before you do LICM. My recommendation: just fuse as much as
you can in the beginning. Later make it dependent on register pressure
and the potential spill and restore costs.
Here are some examples for memory operand fusion in action:
A simple reduction over an array:
local x = 0; for i=1,#t do x = x + t[i] end
x86 machine code for the inner loop, with memory operand fusion:
->LOOP:
cmp esi, edx // Bounds check
ja ->EXIT_2
cmp dword [ecx+esi*8+0x4], -0x0d // Type check, fused load, fused ref
ja ->EXIT_1
addsd xmm7, [ecx+esi*8] // Add, fused load, fused ref
add esi, +0x01
cmp eax, esi
jge ->LOOP
jmp ->EXIT_0
Here's a more involved example from SciMark FFT:
for i=1,2*n-1,2*dual2 do
local j = i+dual2
local ir, ii = v[i], v[i+1]
local jr, ji = v[j], v[j+1]
v[j], v[j+1] = ir - jr, ii - ji
v[i], v[i+1] = ir + jr, ii + ji
end
x86 machine code for the inner loop, with memory operand fusion:
->LOOP:
mov ebx, [esp+0x10] // Restore of bound
mov ebp, [esp+0x14] // Restore of dual2
cmp esi, ebx // Bounds check
ja ->EXIT_9
cmp dword [edx+esi*8+0x4],-0x0d // Type check, fused load, fused ref
ja ->EXIT_8
movsd xmm7, [edx+esi*8] // Load, fused ref
lea edi, [esi+0x1]
cmp edi, ebx // Bounds check
ja ->EXIT_7
cmp dword [edx+edi*8+0x4],-0x0d // Type check, fused load, fused ref
ja ->EXIT_6
movsd xmm6, [edx+edi*8] // Load, fused ref
add ebp, esi
jo ->EXIT_5 // Overflow check
cmp ebp, ebx // Bounds check
ja ->EXIT_4
cmp dword [edx+ebp*8+0x4],-0x0d // Type check, fused load, fused ref
ja ->EXIT_3
movsd xmm4, [edx+ebp*8] // Load, fused ref
lea ebx, [ebp+0x1]
cmp ebx, [esp+0x10] // Bounds check, fused spill slot
ja ->EXIT_2
cmp dword [edx+ebp*8+0xc],-0x0d // Type check, fused load, fused ref
ja ->EXIT_1
movsd xmm5, [edx+ebp*8+0x8] // Load, fused ref
movaps xmm3, xmm7
subsd xmm3, xmm4
movaps xmm2, xmm6
subsd xmm2, xmm5
movsd [edx+ebp*8+0x8], xmm2 // Store, fused ref
movsd [edx+ebp*8], xmm3 // Store, fused ref
addsd xmm7, xmm4
addsd xmm6, xmm5
movsd [edx+edi*8], xmm6 // Store, fused ref
movsd [edx+esi*8], xmm7 // Store, fused ref
add esi, ecx
cmp eax, esi
jge ->LOOP
jmp ->EXIT_0
[ABCelim would avoid all spills and eliminate 4 of the exits.]
Comment 18•15 years ago
|
||
Thanks for the detailed comment Mike and I agree that there is merit in introducing this form. If nothing else we may gain a slight reduction in buffer space (or at the very least probably not grow it) by being able to re-use LIR_ref's. It would also mean that we could remove the special forms (or at replace them with ref/refi) of ld/st with constant displacement. BTW, we do have some liveness analysis going on, not to mention cse. So this change might lead to cleaner code in those areas.
Updated•14 years ago
|
Flags: in-testsuite-
Flags: flashplayer-triage+
Flags: flashplayer-qrb?
Assignee: dvander → nobody
Component: Tracing Virtual Machine → JIT Compiler (NanoJIT)
QA Contact: tracing-vm → nanojit
Attachment #346343 -
Flags: superreview?(edwsmith)
Updated•14 years ago
|
Attachment #346343 -
Flags: superreview?(edwsmith) → superreview-
Comment 19•14 years ago
|
||
Comment on attachment 346343 [details] [diff] [review] patch v2 obsolete, I think.
| Reporter | ||
Comment 20•14 years ago
|
||
Did we actually land this?
Comment 21•14 years ago
|
||
not that i'm aware of, at least in TR. and it might be at odds with the newly proposed change to add an immediate displacement to LIR_ld. (its not strictly at odds, but it pulls in a different direction)
This did land and it's in our branch still AFAIK.
| Reporter | ||
Updated•14 years ago
|
Status: NEW → RESOLVED
Closed: 14 years ago
Resolution: --- → FIXED
Updated•14 years ago
|
Status: RESOLVED → VERIFIED
You need to log in
before you can comment on or make changes to this bug.
Description
•