Closed Bug 475998 Opened 16 years ago Closed 13 years ago

TM: strange performance behavior, possibly related to stack alignment

Categories

(Core :: JavaScript Engine, defect, P2)

x86
macOS
defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: gal, Unassigned)

References

Details

Attachments

(4 files)

No description provided.
Without PERF_BUG set: 1131 1132 1128 With: 1122 1118 1115
The behavior was observed relative to http://hg.mozilla.org/tracemonkey/rev/76a7344912ae
This needs investigation, maybe it won't block but it is getting in our way landing fixes. /be
Flags: blocking1.9.1?
On my machine, I didn't see any unexpected difference in the overall number. It was about .5% slower with the additional check. It could be compiler dependent. Note that I've never tried to track down such a small perf difference before. If it shows up consistently with a larger magnitude in one benchmark, we could reasonably find it. Otherwise, I'm not so sure.
OK, I've solidly reproduced it with bitops-bitwise-and.js. I can get 3 different regression values by bumping the value of the variable 'global' in js_ExecuteTree by multiples of 16. The effect seems to be periodic mod 64: if we start at a value that gives the best perf, we see: bump by 0: baseline perf bump by 16: slowdown by ~10% bump by 32: baseline perf bump by 48: slowdown by 20-30% bump by 64: baseline perf ... I do this by changing the line that defines 'global' to something like: double* global = (double*)alloca((globalFrameSize+1+8) * sizeof(double))+2; varying the 2 at the end to change the bump. I have proved that the slowdown is all in the trace. The trace is very short. It is: 0x243f80: nop 0x243f81: nopw 0x0(%eax,%eax,1) 0x243f8a: push %edi 0x243f8b: push %esi 0x243f8c: push %ebx 0x243f8d: push %ebp 0x243f8e: push %ebp 0x243f8f: mov %esp,%ebp 0x243f91: sub $0x8,%esp 0x243f94: mov (%ecx),%eax 0x243f96: mov 0xc(%ecx),%edi 0x243f99: mov 0x8(%ecx),%esi 0x243f9c: mov (%edi),%ebx 0x243f9e: sub $0x1000,%ebx 0x243fa4: mov %ebx,(%edi) 0x243fa6: cmp $0x0,%ebx 0x243fa9: jle 0x26ffdc 0x23dfaf: mov 0x268(%esi),%edi 0x23dfb5: mov 0x260(%esi),%ebx 0x23dfbb: and %ebx,%edi 0x23dfbd: mov %edi,0x268(%esi) 0x23dfc3: add $0x1,%ebx 0x23dfc6: jo 0x269fe8 0x243fb5: mov 0x260(%esi),%ebx 0x243fbb: and %ebx,%edi 0x243fbd: mov %edi,0x268(%esi) 0x243fc3: add $0x1,%ebx 0x243fc6: jo 0x26ffe8 0x243fcc: mov %ebx,0x260(%esi) 0x243fd2: mov %ebx,(%eax) 0x243fd4: movl $0x23c34600,0x8(%eax) 0x243fdb: cmp $0x23c34600,%ebx 0x243fe1: jge 0x26fff4 0x243fe7: mov %edi,0x268(%esi) 0x243fed: mov %ebx,0x260(%esi) 0x243ff3: jmp 0x243f94 0x243ff8: mov %ebp,%esp 0x243ffa: pop %ebp 0x243ffb: pop %ebp 0x243ffc: pop %ebx 0x243ffd: pop %esi 0x243ffe: pop %edi 0x243fff: ret esi is 'global'. I nop'd out the first load through esi and the slowdown effect seemed to decrease. On one of the the slowed down runs the dynamic memory access history for one iteration is: ld 0xbffff080 ld 0xbffff08c ld 0xbffff088 ld 0x0030b130 st 0x0030b130 ld 0xbfffcd28 (through esi/global) ld 0xbfffcd20 (through esi/global) st 0xbfffcd28 (through esi/global) st 0xbfffcd20 (through esi/global) st 0xbfffcf80 ld 0xbfffcf88 st 0xbfffcd28 (through esi/global) st 0xbfffcd20 (through esi/global) In a faster run, this is identical except that the global addreses end in cd18 and cd10. This looks a lot like a cache effect but toward the beginning of this I did some Shark runs with a few basic cache-related HPCs and didn't find anything. The only thing I've learned that way so far is that the total run time difference is about equal to 1 cycle per run through this trace and is accounted for exactly by a difference in the number of stall cycles. The extra stall cycles are about half LD_ST and half reservation stations full.
Note: we store loop-carrying values at the loop edge and re-fetch them, so we have a fair number of store-load dependencies along that loop edge. Not sure whether this somehow plays into this.
In reply to comment #6) > The trace is very short. > It is: > 0x243f80: nop > 0x243f81: nopw 0x0(%eax,%eax,1) > 0x243f8a: push %edi David, Would you post the generated code for the same trace when slow and fast? exactly as you've posted above. The offsets are important thanks, - moh
Moh, the code is exactly the same (same code, same offsets). The only thing that changes is the value of esp. Andreas
0x23dfaf: mov 0x268(%esi),%edi 0x23dfb5: mov 0x260(%esi),%ebx The two destinations will end up in two different cache lines if esi=64N-8. However, this cache line miss seems to be a one time event per trace call, and the impact is not expected to be high. I suppose the loop back edge is taken frequent enough. Is it possible to get the dynamic memory access history of a few consecutive iterations within the trace?
This part of the code is executed repeatedly. ECX is stable. All memory loads in each iteration should be from the same address. None of the branches is taken. We keep looping along the unconditional jmp at the end. 0x243f94: mov (%ecx),%eax 0x243f96: mov 0xc(%ecx),%edi 0x243f99: mov 0x8(%ecx),%esi 0x243f9c: mov (%edi),%ebx 0x243f9e: sub $0x1000,%ebx 0x243fa4: mov %ebx,(%edi) 0x243fa6: cmp $0x0,%ebx 0x243fa9: jle 0x26ffdc 0x23dfaf: mov 0x268(%esi),%edi 0x23dfb5: mov 0x260(%esi),%ebx 0x23dfbb: and %ebx,%edi 0x23dfbd: mov %edi,0x268(%esi) 0x23dfc3: add $0x1,%ebx 0x23dfc6: jo 0x269fe8 0x243fb5: mov 0x260(%esi),%ebx 0x243fbb: and %ebx,%edi 0x243fbd: mov %edi,0x268(%esi) 0x243fc3: add $0x1,%ebx 0x243fc6: jo 0x26ffe8 0x243fcc: mov %ebx,0x260(%esi) 0x243fd2: mov %ebx,(%eax) 0x243fd4: movl $0x23c34600,0x8(%eax) 0x243fdb: cmp $0x23c34600,%ebx 0x243fe1: jge 0x26fff4 0x243fe7: mov %edi,0x268(%esi) 0x243fed: mov %ebx,0x260(%esi) 0x243ff3: jmp 0x243f94
Why there's a gap in the addresses of the two consecutive instructions? 0x23dfc6: jo 0x269fe8 ???? 0x243fb5: mov 0x260(%esi),%ebx
Damn, I didn't even notice that. The code is crossing a code cache page boundary. A jmp is inserted automatically to transition to the other code page. So every iteration we take an unconditional branch in the middle of the code above. Note that the jump always occurs, in the slow case as well as the fast case.
There's a discoontoinuity here. The stream 0x243fa6, 0x243fa9, 0x243fb5 ... seem fine. Where is the page crossing? And the unconditional jumps to handle page crossing are not shown here? 0x243fa6: cmp $0x0,%ebx 0x243fa9: jle 0x26ffdc 0x23dfaf: mov 0x268(%esi),%edi .... 0x23dfc6: jo 0x269fe8 0x243fb5: mov 0x260(%esi),%ebx 0x243fbb: and %ebx,%edi
There should be at most one page crossing. David, could you disassemble with TRACEMONKEY=verbose? That shows what code is what a bit better.
The generated native code beings on line 116: 0x255f80 [prologue]. I cut it at 3000 lines because it was over the length limit. The cut portion is about 6000 lines of the same kind of enter-trace spew as you see at the end of what I post here.
I tried to time individual trace runs, but I couldn't reproduce the behavior that way. I just found that if I patch all the loads through esi to use different offsets (and move around data before and after the call so the loads access the correct data), I can reproduce the full perf regression behavior. I can even move the 'global' pointer (esi) on trace to be a global array (so it occupies a completely different part of the address space), and the perf regression is still there (the offsets needed are different, but the general 64-period behavior remains).
I just did some HPC work with 2 configurations where the slow one is 50% slower, ~900M cycles instead of ~600M. I found that the slow config had 46M more LOAD_BLOCK.STA events. Could this be responsible for the difference?
Does it point you to the load instruction which is blocked by an ambiguous store?
(In reply to comment #19) > Does it point you to the load instruction which is blocked by an ambiguous > store? Shark is supposed to be able to, but I can't get it to work, and I still don't have my Windows box. I might be able to find it indirectly.
By the way, the address gap is because I put together the trace you saw by copy-pasting two separate sets of output. Just pretend all the high bits are the same.
I don't see a negative perf impact of this patch so if it makes this issue go away we might want to take it.
==> mov 0(edi),ebx cmp ebx,0 jle 0x2b7fdc ??? mov edi,664(esi) ??? mov ebx,656(esi) The first store can cause stalls on the marked loads. If making global area is feasible, it should help the HW disambiguator and avoid the stalls. On the other hand, instructions might get longer.
It is still not clear why the processor's address disambiguator, which is equipped with address prediction, should not be able to handle the case. The target address of all stores are invariant within the loop (although they are calculated within the loop, their values are the same at different iterations, so predictor should catch them, except for possibly the first time). These are all the stores: mov 0(edi),ebx mov 664(esi),edi mov 656(esi),ebx mov 0(eax),ebx mov 8(eax),600000000 mov 664(esi),edi mov 656(esi),ebx Two other events of interest here are: o MEMORY_DISAMBIGUATION.RESET (excessive fails? 20 cycle penalty) o MEMORY_DISAMBIGUATION.SUCCESS That would help us better understand the problem. It also shows the opportunity for loop invariant code motion by the JIT, possibly through a post pass.
(In reply to comment #25) > It is still not clear why the processor's address disambiguator, which is > equipped with address prediction, should not be able to handle the case. I agree. Interestingly, with Andreas's patch to access global area memory through absolute addresses, I got the 'slow' behavior (+300M cycles) even though there were fewer LOAD_BLOCKs (+14M vs. the fast, i.e., 40M fewer than a 'slow' run without Andreas's patch). I did measure the other two events. MEMORY_DISAMBIGUATION.RESET came to about 400 in either configuration. In the fast case, I got +20M MEMORY_DISAMBIGUATION.SUCCESS--I think it was about 20M for a 'slow' case, and 40M for a 'fast'. Removing this block from the beginning part (which contains a load Moh mentioned as suspect) had an interesting effect: mov ebx,0(edi) sub ebx,4096 mov 0(edi),ebx cmp ebx,0 jle 0x2bbfdc Without this piece, I don't get the 'fastest' speed any more, nor the 'slowest'. Instead, I get 2 speeds that are intermediate. I looked at the lowest one and the number of LOAD_BLOCK.STA was low. Almost all the resource stall cycles seem to be LD_ST. I'm wondering if maybe more than one thing is going on here. Could it be that there is a bit of a problem with the one store causing LOAD_BLOCKs, but also that there are just too many loads and stores in this code, creating a long critical path?
David, Would it be possible to try hoisting of the following loop-invariant instructions outside the loop? I.e., the loop-carying values Andreas mentioned in Comment #7. mov eax,0(ecx) mov edi,12(ecx) mov esi,8(ecx) It might be the case that some hashing/conflicting of the vitual addresses having a role in the processor disambiguator/predictor.
(In reply to comment #27) > Would it be possible to try hoisting of the following loop-invariant > instructions outside the loop? I.e., the loop-carying values Andreas mentioned > in Comment #7. There isn't a consistent effect. I think the exact positions that cause the regression move around slightly (as with many minor changes) but the issue is still there in general. The LOAD_BLOCK count goes up to 100M: maybe some other load is getting blocked and this exposes it more to nearby stores?
I tried skipping the next two loads, the ones through global/esi, and that gave an immediate perf improvement (+20% vs. the best time I had previously observed) and consistent perf for all positionings of global. Apparently storing those two addresses just before the backward jmp and then reloading them at the top is bad. I tried deleting the two stores at the end after that, but that had no effect, which makes sense as those stores are never read.
To make the case easier to follow: nop1 nop9 push edi push esi push ebx push ebp push ebp mov ebp,esp sub esp,8 100 +--> mov eax,0[ecx] 101 | mov edi,12[ecx] 102 | mov esi,8[ecx] 103 | mov ebx,0[edi] 104 | sub ebx,4096 105 | mov 0[edi],ebx 106 | cmp ebx,0 107 | jle --> 108 | 109 | mov edi,664[esi] 110 | mov ebx,656[esi] 111 | and edi,ebx 112 | mov 664[esi],edi 113 | add ebx,1 114 | jo --> 115 | 116 | mov 656[esi],ebx 117 | mov 0[eax],ebx 118 | mov 8[eax],600000000 129 | cmp ebx,600000000 120 | jnl --> 121 | 122 | mov 664[esi],edi 123 | mov 656[esi],ebx 124 | 125 +--- jmp 0x0 mov esp,ebp pop ebp pop ebp pop ebx pop esi pop edi ret
I believe I've isolated the case to a stand-alone test. Will update you.
I have attached a patch that allocates InterpState and global slots using malloc and eliminates ->gp altogether. David, could you give this a spin?
(In reply to comment #33) > I have attached a patch that allocates InterpState and global slots using > malloc and eliminates ->gp altogether. David, could you give this a spin? I get a slower run time (almost as slow as the slowest I've seen in my tests of this issue) on the small test. On SS as a whole, things are about the same.
I'm still working on this issue. My isolated case shows the problem only on Core-2 Duo. The problem does not exist on Pentium-4 or Nehalem and does not seem to be the "4K aliasing" problem. Will let you know when it is fully root caused.
Flags: wanted1.9.1+
Flags: blocking1.9.1?
Flags: blocking1.9.1-
Priority: -- → P2
The problem is largely due to the two stores labeled 122 and 123 in Comment #30. David also notices this in Comment #29. These stores may stall the execution of the next iteration. Both of these stores are unnecessary and can be eliminated through a redundant-store elimination phase. Store 122 is a redundant copy of 112, and 123 accomplishes what 116 has already done. Note that esi holds the global pointer. This, I believe, makes it possible to infer that these two stores do not conflict with stores labeled 117 and 118. On an isolated test case based on the above sequence, I get 30% gain from elimination of these two stores. This seems an occurring pattern and in short traces can have a significant impact. Here’s a sequence generated for check-access-nbody that shows the same pattern. +--> mov -4(ebp),ecx | mov eax,ecx | mov esi,0(eax) | mov eax,-48(esi) | mov edi,-40(esi) | mov ebx,-24(esi) | mov edx,0(esi) | mov ecx,8(esi) | cmp edx,ecx | mov ecx,-4(ebp) | jl --> | | add edi,1 | jo --> | | mov -40(esi),edi | mov 0(esi),edi | mov 8(esi),eax | cmp edi,eax | jnl --> | | mov -48(esi),eax | mov -40(esi),edi | mov -24(esi),ebx | mov esp,ebp | +--- jmp All the three stores at the end of the above loop (offsets -48, -40, -24) can be eliminated. When all stores in the loop are based on the same register (such as above, where all stores are based on esi, and esi is loop invariant), a simple disambiguation based on the offsets would enable the required redundant-store elimination. This may also be possible at the time when these stores were inserted - a backward analysis is easy on this case. In this particular case, some of the loads from these addresses can also be moved outside the loop, but then general loop-invariant code motion can cause register pressure and spilling, and would need rematerialization. But, redundant-store elimination is easier and should always be helpful when it applies. I checked with Ed, and Tamarin has a filter for deadvars elimination.
(In reply to comment #36) > But, redundant-store elimination is easier and should always be helpful when it > applies. Nice! RSE sounds like it would fit in pretty well with the nanojit design (although I could imagine it requiring an additional forward pass, but I'm OK with that, especially with the parallel compiler).
Can we just yoink the RSE filter from Tamarin for our use, or am I misunderstanding?
Ed, How much work is involved in porting Tamarin's filter, and can it handle the two code sequences in Comment #36 and Comment #30 (#30 is harder) ?
We do have a store filter for the stack (StackFilter). Similarly, our global area (gp) is alias free and could get a simple GlobalFilter.
The dead store elimination in Tamarin works on the LIR code, so if the stores in question are represented in LIR, it might work for this problem. It is generic so it code move into nanojit, but it isn't a filter because it can iterate when there are loop backedges (# of passes is typically 1 + max-loop-nest-depth). At some point in the past, someone (Graydon?) tried to refactor LIR_loop to use LIR_j -> LIR_label instead. There was a slowdown, but I still think this task should be attempted again; if LIR more closely represents what's going on, then we can apply optimizations that don't need speciailized trace knowlege.
Depends on: 482377
Obsolete with the removal of tracejit.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: