475998 - TM: strange performance behavior, possibly related to stack alignment

On my machine, I didn't see any unexpected difference in the overall number. It was about .5% slower with the additional check. It could be compiler dependent. Note that I've never tried to track down such a small perf difference before. If it shows up consistently with a larger magnitude in one benchmark, we could reasonably find it. Otherwise, I'm not so sure.

David Mandelin [:dmandelin]

Comment 6

•

16 years ago

OK, I've solidly reproduced it with bitops-bitwise-and.js. I can get 3 different regression values by bumping the value of the variable 'global' in js_ExecuteTree by multiples of 16. The effect seems to be periodic mod 64: if we start at a value that gives the best perf, we see: bump by 0: baseline perf bump by 16: slowdown by ~10% bump by 32: baseline perf bump by 48: slowdown by 20-30% bump by 64: baseline perf ... I do this by changing the line that defines 'global' to something like: double* global = (double*)alloca((globalFrameSize+1+8) * sizeof(double))+2; varying the 2 at the end to change the bump. I have proved that the slowdown is all in the trace. The trace is very short. It is: 0x243f80: nop 0x243f81: nopw 0x0(%eax,%eax,1) 0x243f8a: push %edi 0x243f8b: push %esi 0x243f8c: push %ebx 0x243f8d: push %ebp 0x243f8e: push %ebp 0x243f8f: mov %esp,%ebp 0x243f91: sub $0x8,%esp 0x243f94: mov (%ecx),%eax 0x243f96: mov 0xc(%ecx),%edi 0x243f99: mov 0x8(%ecx),%esi 0x243f9c: mov (%edi),%ebx 0x243f9e: sub $0x1000,%ebx 0x243fa4: mov %ebx,(%edi) 0x243fa6: cmp $0x0,%ebx 0x243fa9: jle 0x26ffdc 0x23dfaf: mov 0x268(%esi),%edi 0x23dfb5: mov 0x260(%esi),%ebx 0x23dfbb: and %ebx,%edi 0x23dfbd: mov %edi,0x268(%esi) 0x23dfc3: add $0x1,%ebx 0x23dfc6: jo 0x269fe8 0x243fb5: mov 0x260(%esi),%ebx 0x243fbb: and %ebx,%edi 0x243fbd: mov %edi,0x268(%esi) 0x243fc3: add $0x1,%ebx 0x243fc6: jo 0x26ffe8 0x243fcc: mov %ebx,0x260(%esi) 0x243fd2: mov %ebx,(%eax) 0x243fd4: movl $0x23c34600,0x8(%eax) 0x243fdb: cmp $0x23c34600,%ebx 0x243fe1: jge 0x26fff4 0x243fe7: mov %edi,0x268(%esi) 0x243fed: mov %ebx,0x260(%esi) 0x243ff3: jmp 0x243f94 0x243ff8: mov %ebp,%esp 0x243ffa: pop %ebp 0x243ffb: pop %ebp 0x243ffc: pop %ebx 0x243ffd: pop %esi 0x243ffe: pop %edi 0x243fff: ret esi is 'global'. I nop'd out the first load through esi and the slowdown effect seemed to decrease. On one of the the slowed down runs the dynamic memory access history for one iteration is: ld 0xbffff080 ld 0xbffff08c ld 0xbffff088 ld 0x0030b130 st 0x0030b130 ld 0xbfffcd28 (through esi/global) ld 0xbfffcd20 (through esi/global) st 0xbfffcd28 (through esi/global) st 0xbfffcd20 (through esi/global) st 0xbfffcf80 ld 0xbfffcf88 st 0xbfffcd28 (through esi/global) st 0xbfffcd20 (through esi/global) In a faster run, this is identical except that the global addreses end in cd18 and cd10. This looks a lot like a cache effect but toward the beginning of this I did some Shark runs with a few basic cache-related HPCs and didn't find anything. The only thing I've learned that way so far is that the total run time difference is about equal to 1 cycle per run through this trace and is accounted for exactly by a difference in the number of stall cycles. The extra stall cycles are about half LD_ST and half reservation stations full.

Andreas Gal :gal

Reporter

Comment 7

•

16 years ago

Note: we store loop-carrying values at the loop edge and re-fetch them, so we have a fair number of store-load dependencies along that loop edge. Not sure whether this somehow plays into this.

Moh Haghighat

Comment 8

•

16 years ago

In reply to comment #6) > The trace is very short. > It is: > 0x243f80: nop > 0x243f81: nopw 0x0(%eax,%eax,1) > 0x243f8a: push %edi David, Would you post the generated code for the same trace when slow and fast? exactly as you've posted above. The offsets are important thanks, - moh

Andreas Gal :gal

Reporter

Comment 9

•

16 years ago

Moh, the code is exactly the same (same code, same offsets). The only thing that changes is the value of esp. Andreas

Moh Haghighat

Comment 10

•

16 years ago

0x23dfaf: mov 0x268(%esi),%edi 0x23dfb5: mov 0x260(%esi),%ebx The two destinations will end up in two different cache lines if esi=64N-8. However, this cache line miss seems to be a one time event per trace call, and the impact is not expected to be high. I suppose the loop back edge is taken frequent enough. Is it possible to get the dynamic memory access history of a few consecutive iterations within the trace?

Andreas Gal :gal

Reporter

Comment 11

•

16 years ago

This part of the code is executed repeatedly. ECX is stable. All memory loads in each iteration should be from the same address. None of the branches is taken. We keep looping along the unconditional jmp at the end. 0x243f94: mov (%ecx),%eax 0x243f96: mov 0xc(%ecx),%edi 0x243f99: mov 0x8(%ecx),%esi 0x243f9c: mov (%edi),%ebx 0x243f9e: sub $0x1000,%ebx 0x243fa4: mov %ebx,(%edi) 0x243fa6: cmp $0x0,%ebx 0x243fa9: jle 0x26ffdc 0x23dfaf: mov 0x268(%esi),%edi 0x23dfb5: mov 0x260(%esi),%ebx 0x23dfbb: and %ebx,%edi 0x23dfbd: mov %edi,0x268(%esi) 0x23dfc3: add $0x1,%ebx 0x23dfc6: jo 0x269fe8 0x243fb5: mov 0x260(%esi),%ebx 0x243fbb: and %ebx,%edi 0x243fbd: mov %edi,0x268(%esi) 0x243fc3: add $0x1,%ebx 0x243fc6: jo 0x26ffe8 0x243fcc: mov %ebx,0x260(%esi) 0x243fd2: mov %ebx,(%eax) 0x243fd4: movl $0x23c34600,0x8(%eax) 0x243fdb: cmp $0x23c34600,%ebx 0x243fe1: jge 0x26fff4 0x243fe7: mov %edi,0x268(%esi) 0x243fed: mov %ebx,0x260(%esi) 0x243ff3: jmp 0x243f94

Moh Haghighat

Comment 12

•

16 years ago

Why there's a gap in the addresses of the two consecutive instructions? 0x23dfc6: jo 0x269fe8 ???? 0x243fb5: mov 0x260(%esi),%ebx

Andreas Gal :gal

Reporter

Comment 13

•

16 years ago

Damn, I didn't even notice that. The code is crossing a code cache page boundary. A jmp is inserted automatically to transition to the other code page. So every iteration we take an unconditional branch in the middle of the code above. Note that the jump always occurs, in the slow case as well as the fast case.

Moh Haghighat

Comment 14

•

16 years ago

There's a discoontoinuity here. The stream 0x243fa6, 0x243fa9, 0x243fb5 ... seem fine. Where is the page crossing? And the unconditional jumps to handle page crossing are not shown here? 0x243fa6: cmp $0x0,%ebx 0x243fa9: jle 0x26ffdc 0x23dfaf: mov 0x268(%esi),%edi .... 0x23dfc6: jo 0x269fe8 0x243fb5: mov 0x260(%esi),%ebx 0x243fbb: and %ebx,%edi

Andreas Gal :gal

Reporter

Comment 15

•

16 years ago

There should be at most one page crossing. David, could you disassemble with TRACEMONKEY=verbose? That shows what code is what a bit better.

David Mandelin [:dmandelin]

Comment 16

•

16 years ago

Attached file Spew from debug build with verbose enabled — Details

The generated native code beings on line 116: 0x255f80 [prologue]. I cut it at 3000 lines because it was over the length limit. The cut portion is about 6000 lines of the same kind of enter-trace spew as you see at the end of what I post here.

David Mandelin [:dmandelin]

Comment 17

•

16 years ago

I tried to time individual trace runs, but I couldn't reproduce the behavior that way. I just found that if I patch all the loads through esi to use different offsets (and move around data before and after the call so the loads access the correct data), I can reproduce the full perf regression behavior. I can even move the 'global' pointer (esi) on trace to be a global array (so it occupies a completely different part of the address space), and the perf regression is still there (the offsets needed are different, but the general 64-period behavior remains).

David Mandelin [:dmandelin]

Comment 18

•

16 years ago

I just did some HPC work with 2 configurations where the slow one is 50% slower, ~900M cycles instead of ~600M. I found that the slow config had 46M more LOAD_BLOCK.STA events. Could this be responsible for the difference?

Moh Haghighat

Comment 19

•

16 years ago

Does it point you to the load instruction which is blocked by an ambiguous store?

David Mandelin [:dmandelin]

Comment 20

•

16 years ago

(In reply to comment #19) > Does it point you to the load instruction which is blocked by an ambiguous > store? Shark is supposed to be able to, but I can't get it to work, and I still don't have my Windows box. I might be able to find it indirectly.

David Mandelin [:dmandelin]

Comment 21

•

16 years ago

By the way, the address gap is because I put together the trace you saw by copy-pasting two separate sets of output. Just pretend all the high bits are the same.

Andreas Gal :gal

Reporter

Comment 22

•

16 years ago

Attached patch make global area static, bake addresses directly into the code — Details — Splinter Review

Andreas Gal :gal

Reporter

Comment 23

•

16 years ago

I don't see a negative perf impact of this patch so if it makes this issue go away we might want to take it.

Moh Haghighat

Comment 24

•

16 years ago

==> mov 0(edi),ebx cmp ebx,0 jle 0x2b7fdc ??? mov edi,664(esi) ??? mov ebx,656(esi) The first store can cause stalls on the marked loads. If making global area is feasible, it should help the HW disambiguator and avoid the stalls. On the other hand, instructions might get longer.

Moh Haghighat

Comment 25

•

16 years ago

It is still not clear why the processor's address disambiguator, which is equipped with address prediction, should not be able to handle the case. The target address of all stores are invariant within the loop (although they are calculated within the loop, their values are the same at different iterations, so predictor should catch them, except for possibly the first time). These are all the stores: mov 0(edi),ebx mov 664(esi),edi mov 656(esi),ebx mov 0(eax),ebx mov 8(eax),600000000 mov 664(esi),edi mov 656(esi),ebx Two other events of interest here are: o MEMORY_DISAMBIGUATION.RESET (excessive fails? 20 cycle penalty) o MEMORY_DISAMBIGUATION.SUCCESS That would help us better understand the problem. It also shows the opportunity for loop invariant code motion by the JIT, possibly through a post pass.

David Mandelin [:dmandelin]

Comment 26

•

16 years ago

(In reply to comment #25) > It is still not clear why the processor's address disambiguator, which is > equipped with address prediction, should not be able to handle the case. I agree. Interestingly, with Andreas's patch to access global area memory through absolute addresses, I got the 'slow' behavior (+300M cycles) even though there were fewer LOAD_BLOCKs (+14M vs. the fast, i.e., 40M fewer than a 'slow' run without Andreas's patch). I did measure the other two events. MEMORY_DISAMBIGUATION.RESET came to about 400 in either configuration. In the fast case, I got +20M MEMORY_DISAMBIGUATION.SUCCESS--I think it was about 20M for a 'slow' case, and 40M for a 'fast'. Removing this block from the beginning part (which contains a load Moh mentioned as suspect) had an interesting effect: mov ebx,0(edi) sub ebx,4096 mov 0(edi),ebx cmp ebx,0 jle 0x2bbfdc Without this piece, I don't get the 'fastest' speed any more, nor the 'slowest'. Instead, I get 2 speeds that are intermediate. I looked at the lowest one and the number of LOAD_BLOCK.STA was low. Almost all the resource stall cycles seem to be LD_ST. I'm wondering if maybe more than one thing is going on here. Could it be that there is a bit of a problem with the one store causing LOAD_BLOCKs, but also that there are just too many loads and stores in this code, creating a long critical path?

Moh Haghighat

Comment 27

•

16 years ago

David, Would it be possible to try hoisting of the following loop-invariant instructions outside the loop? I.e., the loop-carying values Andreas mentioned in Comment #7. mov eax,0(ecx) mov edi,12(ecx) mov esi,8(ecx) It might be the case that some hashing/conflicting of the vitual addresses having a role in the processor disambiguator/predictor.

David Mandelin [:dmandelin]

Comment 28

•

16 years ago

(In reply to comment #27) > Would it be possible to try hoisting of the following loop-invariant > instructions outside the loop? I.e., the loop-carying values Andreas mentioned > in Comment #7. There isn't a consistent effect. I think the exact positions that cause the regression move around slightly (as with many minor changes) but the issue is still there in general. The LOAD_BLOCK count goes up to 100M: maybe some other load is getting blocked and this exposes it more to nearby stores?

David Mandelin [:dmandelin]

Comment 29

•

16 years ago

I tried skipping the next two loads, the ones through global/esi, and that gave an immediate perf improvement (+20% vs. the best time I had previously observed) and consistent perf for all positionings of global. Apparently storing those two addresses just before the backward jmp and then reloading them at the top is bad. I tried deleting the two stores at the end after that, but that had no effect, which makes sense as those stores are never read.

Moh Haghighat

Comment 30

•

16 years ago

Moh Haghighat

Comment 31

•

16 years ago

I believe I've isolated the case to a stand-alone test. Will update you.

Andreas Gal :gal

Reporter

Comment 32

•

16 years ago

Attached patch Merge InterpStruct and global slots — Details — Splinter Review

Andreas Gal :gal

Reporter

Comment 33

•

16 years ago

I have attached a patch that allocates InterpState and global slots using malloc and eliminates ->gp altogether. David, could you give this a spin?

David Mandelin [:dmandelin]

Comment 34

•

16 years ago

(In reply to comment #33) > I have attached a patch that allocates InterpState and global slots using > malloc and eliminates ->gp altogether. David, could you give this a spin? I get a slower run time (almost as slow as the slowest I've seen in my tests of this issue) on the small test. On SS as a whole, things are about the same.

Moh Haghighat

Comment 35

•

16 years ago

I'm still working on this issue. My isolated case shows the problem only on Core-2 Duo. The problem does not exist on Pentium-4 or Nehalem and does not seem to be the "4K aliasing" problem. Will let you know when it is fully root caused.

Robert Sayre

Updated

•

16 years ago

Flags: wanted1.9.1+

Flags: blocking1.9.1?

Flags: blocking1.9.1-

Josh Aas

Updated

•

16 years ago

Priority: -- → P2

Moh Haghighat

Comment 36

•

16 years ago

The problem is largely due to the two stores labeled 122 and 123 in Comment #30. David also notices this in Comment #29. These stores may stall the execution of the next iteration. Both of these stores are unnecessary and can be eliminated through a redundant-store elimination phase. Store 122 is a redundant copy of 112, and 123 accomplishes what 116 has already done. Note that esi holds the global pointer. This, I believe, makes it possible to infer that these two stores do not conflict with stores labeled 117 and 118. On an isolated test case based on the above sequence, I get 30% gain from elimination of these two stores. This seems an occurring pattern and in short traces can have a significant impact. Here’s a sequence generated for check-access-nbody that shows the same pattern. +--> mov -4(ebp),ecx | mov eax,ecx | mov esi,0(eax) | mov eax,-48(esi) | mov edi,-40(esi) | mov ebx,-24(esi) | mov edx,0(esi) | mov ecx,8(esi) | cmp edx,ecx | mov ecx,-4(ebp) | jl --> | | add edi,1 | jo --> | | mov -40(esi),edi | mov 0(esi),edi | mov 8(esi),eax | cmp edi,eax | jnl --> | | mov -48(esi),eax | mov -40(esi),edi | mov -24(esi),ebx | mov esp,ebp | +--- jmp All the three stores at the end of the above loop (offsets -48, -40, -24) can be eliminated. When all stores in the loop are based on the same register (such as above, where all stores are based on esi, and esi is loop invariant), a simple disambiguation based on the offsets would enable the required redundant-store elimination. This may also be possible at the time when these stores were inserted - a backward analysis is easy on this case. In this particular case, some of the loads from these addresses can also be moved outside the loop, but then general loop-invariant code motion can cause register pressure and spilling, and would need rematerialization. But, redundant-store elimination is easier and should always be helpful when it applies. I checked with Ed, and Tamarin has a filter for deadvars elimination.

David Mandelin [:dmandelin]

Comment 37

•

16 years ago

(In reply to comment #36) > But, redundant-store elimination is easier and should always be helpful when it > applies. Nice! RSE sounds like it would fit in pretty well with the nanojit design (although I could imagine it requiring an additional forward pass, but I'm OK with that, especially with the parallel compiler).

Mike Shaver (:shaver emeritus)

Comment 38

•

16 years ago

Can we just yoink the RSE filter from Tamarin for our use, or am I misunderstanding?

Moh Haghighat

Comment 39

•

16 years ago

Ed, How much work is involved in porting Tamarin's filter, and can it handle the two code sequences in Comment #36 and Comment #30 (#30 is harder) ?

Andreas Gal :gal

Reporter

Comment 40

•

16 years ago

We do have a store filter for the stack (StackFilter). Similarly, our global area (gp) is alias free and could get a simple GlobalFilter.

Edwin Smith

Comment 41

•

16 years ago

The dead store elimination in Tamarin works on the LIR code, so if the stores in question are represented in LIR, it might work for this problem. It is generic so it code move into nanojit, but it isn't a filter because it can iterate when there are loop backedges (# of passes is typically 1 + max-loop-nest-depth). At some point in the past, someone (Graydon?) tried to refactor LIR_loop to use LIR_j -> LIR_label instead. There was a slowdown, but I still think this task should be attempted again; if LIR more closely represents what's going on, then we can apply optimizations that don't need speciailized trace knowlege.

Andreas Gal :gal

Reporter

Updated

•

16 years ago

Depends on: 482377

Ryan VanderMeulen [:RyanVM]

Comment 42

•

13 years ago

Obsolete with the removal of tracejit.

Status: NEW → RESOLVED

Closed: 13 years ago

Resolution: --- → WONTFIX

set PERF_BUG to add a print and make it _faster_ 16 years ago Andreas Gal :gal 1.19 KB, patch		Details \| Diff \| Splinter Review
Spew from debug build with verbose enabled 16 years ago David Mandelin [:dmandelin] 171.84 KB, text/plain		Details
make global area static, bake addresses directly into the code 16 years ago Andreas Gal :gal 5.62 KB, patch		Details \| Diff \| Splinter Review
Merge InterpStruct and global slots 16 years ago Andreas Gal :gal 11.68 KB, patch		Details \| Diff \| Splinter Review