444682 - Add support for non-constant displacements for insLoad (insStore already supports those).

Reporter

Description

•

17 years ago

insLoad only supports constant displacements at the moment. It really should support non-constant displacements for array loads.

Edwin Smith

Comment 1

•

17 years ago

after double checking the code, both LIR_ld* and LIR_st* only support constant displacements. insStore() allows a LIns* displacement but later, Assembler requires the displacement to be const. so both should be extended. the insLoad() pipeline function already supports LIns* displacements, so the various filters including Assembler need to be compelled not to assume the displ is const.

Andreas Gal :gal

Reporter

Comment 2

•

17 years ago

David is on in (for both, load & store), but x86 only. Maybe someone from Adobe can help out with ARM :)

Assignee: nobody → danderson

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 3

•

17 years ago

AFAIK ARM doesn't have a [reg1+reg2] addressing mode, so I think there I'll just have to emit a separate ADD. Maybe someone can correct me on this since my knowledge of the platform is NULL.

Edwin Smith

Comment 4

•

17 years ago

LIR already can do any addressing mode, you just have an expression for the pointer and use displ=0. adding more insLoad() and insStore() helpers is fine. at the core of this is whether we really want more LIR_ld/st primitives that have more addressing modes. you could argue that even having them with a builtin implied "add p+displ" is corrupt. eg: we have LIR_st val,base,disp // disp must point to a const LIT_sti val,base,#disp // disp is -128..127 we could add more addressing modes, of course. or we could eliminate LIR_st, and do: LIR_sti val,base,#disp and when disp is outside the range expressable in LIR, base points to an add expression. (all other addressing modes are just different ptr expressions). which direction should we go? I tend to lean towards having fewer LIR variants and let the cpu-specific assembler do pattern matching to fit addressing modes. (the +/- 127 offset of LIR_sti is just a byproduct of the LIR encoding, it doesn't imply anything about what the underlying cpu supports) note that we currently don't have LIR_ldi that's like LIR_sti, but if we did it would definitely reduce LIR size... loads from small offsets are really common. opinions welcome... making LIR nice and clean and small is one problem were solving, and taking advantage of the cpu's addressing modes is another.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 5

•

17 years ago

I'm favor of assembler specific matching. I was planning on peeking ahead in Nativei386 to see if the value was a LIR_add, so in fact the ADD would already be there on ARM and I wouldn't need to change anything.

Edwin Smith

Comment 6

•

17 years ago

we can divide and conquor: in LIR: change LIR_st/ld to be like LIR_sti, ie the displacement is always an 8bit immediate. insLoad() and insStore() helpers can generate add expressions when the immedate doesnt fit or if another addr mode is needed. This eliminates insStorei() and LIR_sti. in Assembler assembler should match base pointer expressions to cpu specific addressing modes.

Andreas Gal :gal

Reporter

Comment 7

•

17 years ago

In hotpath we use only LOAD address and STORE address, value. The rest is done in the backend by pattern matching. Here for space reasons a small displacement makes sense, but the logic to parse/match it should definitively be in the backend (IMHO).

Andreas Gal :gal

Reporter

Comment 8

•

17 years ago

ARM has immediate offset, register offset and PC-relative, and the barrel shifter can be used in some cases as well, so there is definitively some pattern matching to be done in the ARM backend.

Mike Pall

Comment 9

•

17 years ago

ARM does have [reg+reg*scale]. It's just written differently. E.g. int *a; y=a[x] gets translated to: ldr r0, [r0, r1, asl #2] And for Thumb: lsl r1, r1, #2; ldr r0, [r1, r0] In my IR I've solved the address generation issue by letting both the xLOAD and the xSTORE ops reference a common xREF op which holds base and index. This way this gets CSEd early on (t[i]=t[i]+1 needs only one xREF). But the assembler may later decide to merge the memory operands and leave the xREF op dead. The decision should be based on the instruction set, whether the CPU AGU is "free" (no extra uops generated for complex addressing modes), and of course whether the xREF is invariant or not relative to the xLOAD/xSTORE. For x86 merging the addressing mode can be done for the load/store itself and even for most opcodes referencing a load. This helps to save scratch registers. It's also usually faster (LEA is expensive on some CPUs) and more compact. E.g. instead of: lea esi, [ebx+edi*8] movsd xmm6, [esi] addsd xmm7, xmm6 one could generate this: movsd xmm6, [ebx+edi*8] addsd xmm7, xmm6 or even this (if the load has only one use): addsd xmm7, [ebx+edi*8]

Andreas Gal :gal

Reporter

Comment 10

•

17 years ago

Yep, we do the same thing in hotpath as mike does. We have LOAD STORE and then ADDR to generate addresses. ADDR is hoisted out of the loop (usually), and the register allocator decides whether pressure warrants local materialization or allocate a loop register. In the latter case ADDR goes dead.

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 11

•

17 years ago

Attached patch instruction reduction for a common load pattern (obsolete) — Details — Splinter Review

Does some pattern matching on loads to take advantage of addressing modes. Doesn't really help performance but it makes the code a little shorter and easier to read. Example reductions: add eax, ebx mov eax, [eax] --> mov eax, [eax+ebx] shl ebx, 2 add eax, ebx mov eax, [eax] --> mov eax, [eax+ebx*4]

Attachment #332433 - Flags: review?(edwsmith)

Edwin Smith

Comment 12

•

17 years ago

Comment on attachment 332433 [details] [diff] [review] instruction reduction for a common load pattern needs rebasing, from what i can tell

Attachment #332433 - Flags: review?(edwsmith) → review-

David Mandelin [:dmandelin]

Updated

•

17 years ago

Blocks: 461050

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 13

•

17 years ago

Attached patch patch v2 — Details — Splinter Review

Coming back to this since dmandelin ran into it in his regex patch. Passes TraceMonkey's tests, is about a 15-20ms win on SunSpider for us.

Attachment #332433 - Attachment is obsolete: true

Attachment #346343 - Flags: review?(rreitmai)

David Anderson [:dvander] - inactive, e-mail if emergency

Updated

•

17 years ago

Attachment #346343 - Flags: review?(gal)

Andreas Gal :gal

Reporter

Updated

•

17 years ago

Attachment #346343 - Flags: review?(gal) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 14

•

17 years ago

Pre-emptive push since our tree is closing today: http://hg.mozilla.org/tracemonkey/rev/54246bd0b617 Will leave the bug open for rreitmai's review

Rick Reitmaier

Comment 15

•

17 years ago

Wow that's a nice win! I'm a little confused by : if (rL == NULL || rL->reg == UnknownReg) rleft = getBaseReg(lhs, d, rmask(rr)); did you mean for the call to be findSpecificReg() ? Also, would have the forms Mike discussed earlier (xREF) have helped in writing this logic? That is, we add a LIR_ref instruction which points to base,disp.

Rick Reitmaier

Updated

•

17 years ago

Attachment #346343 - Flags: review?(rreitmai) → review+

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 16

•

17 years ago

Thanks, I indeed meant findSpecificRegFor. I like Mike's idea, sniffing LIR_add is pretty similar as it'll CSE separately and can be left dead by asm_ld. Perhaps a distinction would make things cleaner though. I like the idea of applying this optimization to more instructions, for example code like: mov eax, [esi+edi*2+8] add ecx, eax Could be reduced similar to Mike's example. I'm not sure but it feels like we'd have to check how many uses (or how far apart uses) there are of such expressions. If the LIR behind "[esi+edi*2+8]" got used frequently, it'd make more sense to put it in a register instead of addressing it each time.

Mike Pall

Comment 17

•

17 years ago

The main advantage of the xREF approach is that it helps alias analysis. An xLOAD and an xSTORE referencing the same memory cell must have the same xREF (since it's CSEd). Alias analysis involves a lot of searches through chains of loads and stores -- you better make that fast. Here the xREFs really help, since you can identify a matching load/store by the identity of the reference operand alone. With the proper data structures in place, basic alias analysis needs only around 10 lines of code (disambiguation based on const/variable keys). Extended analysis is well under 100 lines and also does stuff like disambiguating a[i] <-> a[i+1] and a[i] <-> b[i] (adding dynamic disambiguation guards). It's important to keep the load/store varieties separate in the IR (e.g. LuaJIT 2.x has ALOAD, HLOAD, ULOAD, FLOAD etc.). The semantics of the language dictate that an array slot cannot alias a hash slot and so on. This is a big win when you add loop-invariant code motion (LICM aka hoisting). Many hash loads are invariant and this cannot be blocked by e.g. array stores with variant keys. You really want to keep as much of this information in the IR, either explicitly or in a side structure. Otherwise you loose the ability for easy disambiguation. In the end you get the same mess as in C/C++, where almost everything can alias everything else. This makes alias analysis really expensive and it often fails to disambiguate important cases. You can do much better when the source language operates at a higher level. Ok, I realize nanojit doesn't do AA or LICM yet. Just some food for thought, since it may impact your basic design decisions. About memory operand fusion: - The AGU is free on all contemporary x86/x64 CPUs. I.e. it's not cheaper to do mov ecx, [ebp] than mov ecx, [esi+4*edi+8]. - But lea can be expensive, so you really want to fuse the load and the reference. The AGU bypass to the ALU, found on earlier CPUs, had to go. E.g. the P4 splits a lea up into a sequence of shift and add uops. Here are uop count and latency in cycles on a P4: 3/4 lea ebp, [esi+4*edi+8] 2/1 add eax, [esi+4*edi+8] 1/2 mov ecx, [ebp] 1/0.5 add eax, ecx ----- --- 5/6.5 vs. 2/1 Ouch! The Core2 and later restored the fast lea by adding special uops to the ALU. But a lea still adds 1 extra uop and 1 cycle of latency. - Fusing loads into x86 ModRM instructions is very beneficial since it significantly reduces register pressure. In LuaJIT 2.x I'm fusing pretty much everything (from guards to FP operands) against everything else that is referencable (FP constants and all kinds of loads, even loads of internal fields and spill slots). I'm also fusing cumulative offsets into array indexes. This also reduces register pressure and it would really fly once I add ABCelim. - Fusion heuristics are a topic for another message. But I wouldn't worry about that before you do LICM. My recommendation: just fuse as much as you can in the beginning. Later make it dependent on register pressure and the potential spill and restore costs. Here are some examples for memory operand fusion in action: A simple reduction over an array: local x = 0; for i=1,#t do x = x + t[i] end x86 machine code for the inner loop, with memory operand fusion: ->LOOP: cmp esi, edx // Bounds check ja ->EXIT_2 cmp dword [ecx+esi*8+0x4], -0x0d // Type check, fused load, fused ref ja ->EXIT_1 addsd xmm7, [ecx+esi*8] // Add, fused load, fused ref add esi, +0x01 cmp eax, esi jge ->LOOP jmp ->EXIT_0 Here's a more involved example from SciMark FFT: for i=1,2*n-1,2*dual2 do local j = i+dual2 local ir, ii = v[i], v[i+1] local jr, ji = v[j], v[j+1] v[j], v[j+1] = ir - jr, ii - ji v[i], v[i+1] = ir + jr, ii + ji end x86 machine code for the inner loop, with memory operand fusion: ->LOOP: mov ebx, [esp+0x10] // Restore of bound mov ebp, [esp+0x14] // Restore of dual2 cmp esi, ebx // Bounds check ja ->EXIT_9 cmp dword [edx+esi*8+0x4],-0x0d // Type check, fused load, fused ref ja ->EXIT_8 movsd xmm7, [edx+esi*8] // Load, fused ref lea edi, [esi+0x1] cmp edi, ebx // Bounds check ja ->EXIT_7 cmp dword [edx+edi*8+0x4],-0x0d // Type check, fused load, fused ref ja ->EXIT_6 movsd xmm6, [edx+edi*8] // Load, fused ref add ebp, esi jo ->EXIT_5 // Overflow check cmp ebp, ebx // Bounds check ja ->EXIT_4 cmp dword [edx+ebp*8+0x4],-0x0d // Type check, fused load, fused ref ja ->EXIT_3 movsd xmm4, [edx+ebp*8] // Load, fused ref lea ebx, [ebp+0x1] cmp ebx, [esp+0x10] // Bounds check, fused spill slot ja ->EXIT_2 cmp dword [edx+ebp*8+0xc],-0x0d // Type check, fused load, fused ref ja ->EXIT_1 movsd xmm5, [edx+ebp*8+0x8] // Load, fused ref movaps xmm3, xmm7 subsd xmm3, xmm4 movaps xmm2, xmm6 subsd xmm2, xmm5 movsd [edx+ebp*8+0x8], xmm2 // Store, fused ref movsd [edx+ebp*8], xmm3 // Store, fused ref addsd xmm7, xmm4 addsd xmm6, xmm5 movsd [edx+edi*8], xmm6 // Store, fused ref movsd [edx+esi*8], xmm7 // Store, fused ref add esi, ecx cmp eax, esi jge ->LOOP jmp ->EXIT_0 [ABCelim would avoid all spills and eliminate 4 of the exits.]

Rick Reitmaier

Comment 18

•

17 years ago

Thanks for the detailed comment Mike and I agree that there is merit in introducing this form. If nothing else we may gain a slight reduction in buffer space (or at the very least probably not grow it) by being able to re-use LIR_ref's. It would also mean that we could remove the special forms (or at replace them with ref/refi) of ld/st with constant displacement. BTW, we do have some liveness analysis going on, not to mention cse. So this change might lead to cleaner code in those areas.

Chris Peyer

Updated

•

17 years ago

Flags: in-testsuite-

Flags: flashplayer-triage+

Flags: flashplayer-qrb?

Dan Smith

Updated

•

17 years ago

Assignee: dvander → nobody

Component: Tracing Virtual Machine → JIT Compiler (NanoJIT)

QA Contact: tracing-vm → nanojit

Dan Smith

Updated

•

16 years ago

Attachment #346343 - Flags: superreview?(edwsmith)

Edwin Smith

Updated

•

16 years ago

Attachment #346343 - Flags: superreview?(edwsmith) → superreview-

Edwin Smith

Comment 19

•

16 years ago

Comment on attachment 346343 [details] [diff] [review] patch v2 obsolete, I think.

Andreas Gal :gal

Reporter

Comment 20

•

16 years ago

Did we actually land this?

Edwin Smith

Comment 21

•

16 years ago

not that i'm aware of, at least in TR. and it might be at odds with the newly proposed change to add an immediate displacement to LIR_ld. (its not strictly at odds, but it pulls in a different direction)

David Anderson [:dvander] - inactive, e-mail if emergency

Comment 22

•

16 years ago

This did land and it's in our branch still AFAIK.

Andreas Gal :gal

Reporter

Updated

•

16 years ago

Status: NEW → RESOLVED

Closed: 16 years ago

Resolution: --- → FIXED

Chris Peyer

Updated

•

16 years ago

Status: RESOLVED → VERIFIED

Chris Peyer

Comment 23

•

16 years ago

removing QRB request, bug verified

Flags: flashplayer-qrb?

instruction reduction for a common load pattern 17 years ago David Anderson [:dvander] - inactive, e-mail if emergency 6.44 KB, patch	edwsmith : review-	Details \| Diff \| Splinter Review
patch v2 17 years ago David Anderson [:dvander] - inactive, e-mail if emergency 6.57 KB, patch	rreitmai : review+ gal : review+ edwsmith : superreview-	Details \| Diff \| Splinter Review