Closed
Bug 475998
Opened 16 years ago
Closed 13 years ago
TM: strange performance behavior, possibly related to stack alignment
Categories
(Core :: JavaScript Engine, defect, P2)
Tracking
()
RESOLVED
WONTFIX
People
(Reporter: gal, Unassigned)
References
Details
Attachments
(4 files)
1.19 KB,
patch
|
Details | Diff | Splinter Review | |
171.84 KB,
text/plain
|
Details | |
5.62 KB,
patch
|
Details | Diff | Splinter Review | |
11.68 KB,
patch
|
Details | Diff | Splinter Review |
No description provided.
Reporter | ||
Comment 1•16 years ago
|
||
Reporter | ||
Comment 2•16 years ago
|
||
Without PERF_BUG set:
1131
1132
1128
With:
1122
1118
1115
Reporter | ||
Comment 3•16 years ago
|
||
The behavior was observed relative to http://hg.mozilla.org/tracemonkey/rev/76a7344912ae
Comment 4•16 years ago
|
||
This needs investigation, maybe it won't block but it is getting in our way landing fixes.
/be
Flags: blocking1.9.1?
Comment 5•16 years ago
|
||
On my machine, I didn't see any unexpected difference in the overall number. It was about .5% slower with the additional check. It could be compiler dependent.
Note that I've never tried to track down such a small perf difference before. If it shows up consistently with a larger magnitude in one benchmark, we could reasonably find it. Otherwise, I'm not so sure.
Comment 6•16 years ago
|
||
OK, I've solidly reproduced it with bitops-bitwise-and.js. I can get 3 different regression values by bumping the value of the variable 'global' in js_ExecuteTree by multiples of 16. The effect seems to be periodic mod 64: if we start at a value that gives the best perf, we see:
bump by 0: baseline perf
bump by 16: slowdown by ~10%
bump by 32: baseline perf
bump by 48: slowdown by 20-30%
bump by 64: baseline perf
...
I do this by changing the line that defines 'global' to something like:
double* global = (double*)alloca((globalFrameSize+1+8) * sizeof(double))+2;
varying the 2 at the end to change the bump.
I have proved that the slowdown is all in the trace. The trace is very short. It is:
0x243f80: nop
0x243f81: nopw 0x0(%eax,%eax,1)
0x243f8a: push %edi
0x243f8b: push %esi
0x243f8c: push %ebx
0x243f8d: push %ebp
0x243f8e: push %ebp
0x243f8f: mov %esp,%ebp
0x243f91: sub $0x8,%esp
0x243f94: mov (%ecx),%eax
0x243f96: mov 0xc(%ecx),%edi
0x243f99: mov 0x8(%ecx),%esi
0x243f9c: mov (%edi),%ebx
0x243f9e: sub $0x1000,%ebx
0x243fa4: mov %ebx,(%edi)
0x243fa6: cmp $0x0,%ebx
0x243fa9: jle 0x26ffdc
0x23dfaf: mov 0x268(%esi),%edi
0x23dfb5: mov 0x260(%esi),%ebx
0x23dfbb: and %ebx,%edi
0x23dfbd: mov %edi,0x268(%esi)
0x23dfc3: add $0x1,%ebx
0x23dfc6: jo 0x269fe8
0x243fb5: mov 0x260(%esi),%ebx
0x243fbb: and %ebx,%edi
0x243fbd: mov %edi,0x268(%esi)
0x243fc3: add $0x1,%ebx
0x243fc6: jo 0x26ffe8
0x243fcc: mov %ebx,0x260(%esi)
0x243fd2: mov %ebx,(%eax)
0x243fd4: movl $0x23c34600,0x8(%eax)
0x243fdb: cmp $0x23c34600,%ebx
0x243fe1: jge 0x26fff4
0x243fe7: mov %edi,0x268(%esi)
0x243fed: mov %ebx,0x260(%esi)
0x243ff3: jmp 0x243f94
0x243ff8: mov %ebp,%esp
0x243ffa: pop %ebp
0x243ffb: pop %ebp
0x243ffc: pop %ebx
0x243ffd: pop %esi
0x243ffe: pop %edi
0x243fff: ret
esi is 'global'. I nop'd out the first load through esi and the slowdown effect seemed to decrease. On one of the the slowed down runs the dynamic memory access history for one iteration is:
ld 0xbffff080
ld 0xbffff08c
ld 0xbffff088
ld 0x0030b130
st 0x0030b130
ld 0xbfffcd28 (through esi/global)
ld 0xbfffcd20 (through esi/global)
st 0xbfffcd28 (through esi/global)
st 0xbfffcd20 (through esi/global)
st 0xbfffcf80
ld 0xbfffcf88
st 0xbfffcd28 (through esi/global)
st 0xbfffcd20 (through esi/global)
In a faster run, this is identical except that the global addreses end in cd18 and cd10.
This looks a lot like a cache effect but toward the beginning of this I did some Shark runs with a few basic cache-related HPCs and didn't find anything. The only thing I've learned that way so far is that the total run time difference is about equal to 1 cycle per run through this trace and is accounted for exactly by a difference in the number of stall cycles. The extra stall cycles are about half LD_ST and half reservation stations full.
Reporter | ||
Comment 7•16 years ago
|
||
Note: we store loop-carrying values at the loop edge and re-fetch them, so we have a fair number of store-load dependencies along that loop edge. Not sure whether this somehow plays into this.
Comment 8•16 years ago
|
||
In reply to comment #6)
> The trace is very short.
> It is:
> 0x243f80: nop
> 0x243f81: nopw 0x0(%eax,%eax,1)
> 0x243f8a: push %edi
David,
Would you post the generated code for the same trace when slow and fast? exactly as you've posted above. The offsets are important
thanks,
- moh
Reporter | ||
Comment 9•16 years ago
|
||
Moh, the code is exactly the same (same code, same offsets). The only thing that changes is the value of esp.
Andreas
Comment 10•16 years ago
|
||
0x23dfaf: mov 0x268(%esi),%edi
0x23dfb5: mov 0x260(%esi),%ebx
The two destinations will end up in two different cache lines if esi=64N-8.
However, this cache line miss seems to be a one time event per trace call, and the impact is not expected to be high. I suppose the loop back edge is taken frequent enough. Is it possible to get the dynamic memory access history of a few consecutive iterations within the trace?
Reporter | ||
Comment 11•16 years ago
|
||
This part of the code is executed repeatedly. ECX is stable. All memory loads in each iteration should be from the same address. None of the branches is taken. We keep looping along the unconditional jmp at the end.
0x243f94: mov (%ecx),%eax
0x243f96: mov 0xc(%ecx),%edi
0x243f99: mov 0x8(%ecx),%esi
0x243f9c: mov (%edi),%ebx
0x243f9e: sub $0x1000,%ebx
0x243fa4: mov %ebx,(%edi)
0x243fa6: cmp $0x0,%ebx
0x243fa9: jle 0x26ffdc
0x23dfaf: mov 0x268(%esi),%edi
0x23dfb5: mov 0x260(%esi),%ebx
0x23dfbb: and %ebx,%edi
0x23dfbd: mov %edi,0x268(%esi)
0x23dfc3: add $0x1,%ebx
0x23dfc6: jo 0x269fe8
0x243fb5: mov 0x260(%esi),%ebx
0x243fbb: and %ebx,%edi
0x243fbd: mov %edi,0x268(%esi)
0x243fc3: add $0x1,%ebx
0x243fc6: jo 0x26ffe8
0x243fcc: mov %ebx,0x260(%esi)
0x243fd2: mov %ebx,(%eax)
0x243fd4: movl $0x23c34600,0x8(%eax)
0x243fdb: cmp $0x23c34600,%ebx
0x243fe1: jge 0x26fff4
0x243fe7: mov %edi,0x268(%esi)
0x243fed: mov %ebx,0x260(%esi)
0x243ff3: jmp 0x243f94
Comment 12•16 years ago
|
||
Why there's a gap in the addresses of the two consecutive instructions?
0x23dfc6: jo 0x269fe8
????
0x243fb5: mov 0x260(%esi),%ebx
Reporter | ||
Comment 13•16 years ago
|
||
Damn, I didn't even notice that. The code is crossing a code cache page boundary. A jmp is inserted automatically to transition to the other code page. So every iteration we take an unconditional branch in the middle of the code above.
Note that the jump always occurs, in the slow case as well as the fast case.
Comment 14•16 years ago
|
||
There's a discoontoinuity here. The stream 0x243fa6, 0x243fa9, 0x243fb5 ... seem fine. Where is the page crossing? And the unconditional jumps to handle page crossing are not shown here?
0x243fa6: cmp $0x0,%ebx
0x243fa9: jle 0x26ffdc
0x23dfaf: mov 0x268(%esi),%edi
....
0x23dfc6: jo 0x269fe8
0x243fb5: mov 0x260(%esi),%ebx
0x243fbb: and %ebx,%edi
Reporter | ||
Comment 15•16 years ago
|
||
There should be at most one page crossing. David, could you disassemble with TRACEMONKEY=verbose? That shows what code is what a bit better.
Comment 16•16 years ago
|
||
The generated native code beings on line 116: 0x255f80 [prologue]. I cut it at 3000 lines because it was over the length limit. The cut portion is about 6000 lines of the same kind of enter-trace spew as you see at the end of what I post here.
Comment 17•16 years ago
|
||
I tried to time individual trace runs, but I couldn't reproduce the behavior that way.
I just found that if I patch all the loads through esi to use different offsets (and move around data before and after the call so the loads access the correct data), I can reproduce the full perf regression behavior. I can even move the 'global' pointer (esi) on trace to be a global array (so it occupies a completely different part of the address space), and the perf regression is still there (the offsets needed are different, but the general 64-period behavior remains).
Comment 18•16 years ago
|
||
I just did some HPC work with 2 configurations where the slow one is 50% slower, ~900M cycles instead of ~600M. I found that the slow config had 46M more LOAD_BLOCK.STA events. Could this be responsible for the difference?
Comment 19•16 years ago
|
||
Does it point you to the load instruction which is blocked by an ambiguous store?
Comment 20•16 years ago
|
||
(In reply to comment #19)
> Does it point you to the load instruction which is blocked by an ambiguous
> store?
Shark is supposed to be able to, but I can't get it to work, and I still don't have my Windows box. I might be able to find it indirectly.
Comment 21•16 years ago
|
||
By the way, the address gap is because I put together the trace you saw by copy-pasting two separate sets of output. Just pretend all the high bits are the same.
Reporter | ||
Comment 22•16 years ago
|
||
Reporter | ||
Comment 23•16 years ago
|
||
I don't see a negative perf impact of this patch so if it makes this issue go away we might want to take it.
Comment 24•16 years ago
|
||
==> mov 0(edi),ebx
cmp ebx,0
jle 0x2b7fdc
??? mov edi,664(esi)
??? mov ebx,656(esi)
The first store can cause stalls on the marked loads. If making global area is feasible, it should help the HW disambiguator and avoid the stalls. On the other hand, instructions might get longer.
Comment 25•16 years ago
|
||
It is still not clear why the processor's address disambiguator, which is equipped with address prediction, should not be able to handle the case. The target address of all stores are invariant within the loop (although they are calculated within the loop, their values are the same at different iterations, so predictor should catch them, except for possibly the first time).
These are all the stores:
mov 0(edi),ebx
mov 664(esi),edi
mov 656(esi),ebx
mov 0(eax),ebx
mov 8(eax),600000000
mov 664(esi),edi
mov 656(esi),ebx
Two other events of interest here are:
o MEMORY_DISAMBIGUATION.RESET (excessive fails? 20 cycle penalty)
o MEMORY_DISAMBIGUATION.SUCCESS
That would help us better understand the problem. It also shows the opportunity for loop invariant code motion by the JIT, possibly through a post pass.
Comment 26•16 years ago
|
||
(In reply to comment #25)
> It is still not clear why the processor's address disambiguator, which is
> equipped with address prediction, should not be able to handle the case.
I agree. Interestingly, with Andreas's patch to access global area memory through absolute addresses, I got the 'slow' behavior (+300M cycles) even though there were fewer LOAD_BLOCKs (+14M vs. the fast, i.e., 40M fewer than a 'slow' run without Andreas's patch).
I did measure the other two events. MEMORY_DISAMBIGUATION.RESET came to about 400 in either configuration. In the fast case, I got +20M MEMORY_DISAMBIGUATION.SUCCESS--I think it was about 20M for a 'slow' case, and 40M for a 'fast'.
Removing this block from the beginning part (which contains a load Moh mentioned as suspect) had an interesting effect:
mov ebx,0(edi)
sub ebx,4096
mov 0(edi),ebx
cmp ebx,0
jle 0x2bbfdc
Without this piece, I don't get the 'fastest' speed any more, nor the 'slowest'. Instead, I get 2 speeds that are intermediate. I looked at the lowest one and the number of LOAD_BLOCK.STA was low. Almost all the resource stall cycles seem to be LD_ST.
I'm wondering if maybe more than one thing is going on here. Could it be that there is a bit of a problem with the one store causing LOAD_BLOCKs, but also that there are just too many loads and stores in this code, creating a long critical path?
Comment 27•16 years ago
|
||
David,
Would it be possible to try hoisting of the following loop-invariant
instructions outside the loop? I.e., the loop-carying values Andreas mentioned
in Comment #7.
mov eax,0(ecx)
mov edi,12(ecx)
mov esi,8(ecx)
It might be the case that some hashing/conflicting of the vitual addresses having a role in the processor disambiguator/predictor.
Comment 28•16 years ago
|
||
(In reply to comment #27)
> Would it be possible to try hoisting of the following loop-invariant
> instructions outside the loop? I.e., the loop-carying values Andreas mentioned
> in Comment #7.
There isn't a consistent effect. I think the exact positions that cause the regression move around slightly (as with many minor changes) but the issue is still there in general. The LOAD_BLOCK count goes up to 100M: maybe some other load is getting blocked and this exposes it more to nearby stores?
Comment 29•16 years ago
|
||
I tried skipping the next two loads, the ones through global/esi, and that gave an immediate perf improvement (+20% vs. the best time I had previously observed) and consistent perf for all positionings of global. Apparently storing those two addresses just before the backward jmp and then reloading them at the top is bad.
I tried deleting the two stores at the end after that, but that had no effect, which makes sense as those stores are never read.
Comment 30•16 years ago
|
||
To make the case easier to follow:
nop1
nop9
push edi
push esi
push ebx
push ebp
push ebp
mov ebp,esp
sub esp,8
100 +--> mov eax,0[ecx]
101 | mov edi,12[ecx]
102 | mov esi,8[ecx]
103 | mov ebx,0[edi]
104 | sub ebx,4096
105 | mov 0[edi],ebx
106 | cmp ebx,0
107 | jle -->
108 |
109 | mov edi,664[esi]
110 | mov ebx,656[esi]
111 | and edi,ebx
112 | mov 664[esi],edi
113 | add ebx,1
114 | jo -->
115 |
116 | mov 656[esi],ebx
117 | mov 0[eax],ebx
118 | mov 8[eax],600000000
129 | cmp ebx,600000000
120 | jnl -->
121 |
122 | mov 664[esi],edi
123 | mov 656[esi],ebx
124 |
125 +--- jmp 0x0
mov esp,ebp
pop ebp
pop ebp
pop ebx
pop esi
pop edi
ret
Comment 31•16 years ago
|
||
I believe I've isolated the case to a stand-alone test. Will update you.
Reporter | ||
Comment 32•16 years ago
|
||
Reporter | ||
Comment 33•16 years ago
|
||
I have attached a patch that allocates InterpState and global slots using malloc and eliminates ->gp altogether. David, could you give this a spin?
Comment 34•16 years ago
|
||
(In reply to comment #33)
> I have attached a patch that allocates InterpState and global slots using
> malloc and eliminates ->gp altogether. David, could you give this a spin?
I get a slower run time (almost as slow as the slowest I've seen in my tests of this issue) on the small test. On SS as a whole, things are about the same.
Comment 35•16 years ago
|
||
I'm still working on this issue. My isolated case shows the problem only on Core-2 Duo. The problem does not exist on Pentium-4 or Nehalem and does not seem to be the "4K aliasing" problem. Will let you know when it is fully root caused.
Updated•16 years ago
|
Flags: wanted1.9.1+
Flags: blocking1.9.1?
Flags: blocking1.9.1-
Comment 36•16 years ago
|
||
The problem is largely due to the two stores labeled 122 and 123 in Comment #30. David also notices this in Comment #29. These stores may stall the execution of the next iteration. Both of these stores are unnecessary and can be eliminated through a redundant-store elimination phase. Store 122 is a redundant copy of 112, and 123 accomplishes what 116 has already done. Note that esi holds the global pointer. This, I believe, makes it possible to infer that these two stores do not conflict with stores labeled 117 and 118. On an isolated test case based on the above sequence, I get 30% gain from elimination of these two stores.
This seems an occurring pattern and in short traces can have a significant impact. Here’s a sequence generated for check-access-nbody that shows the same pattern.
+--> mov -4(ebp),ecx
| mov eax,ecx
| mov esi,0(eax)
| mov eax,-48(esi)
| mov edi,-40(esi)
| mov ebx,-24(esi)
| mov edx,0(esi)
| mov ecx,8(esi)
| cmp edx,ecx
| mov ecx,-4(ebp)
| jl -->
|
| add edi,1
| jo -->
|
| mov -40(esi),edi
| mov 0(esi),edi
| mov 8(esi),eax
| cmp edi,eax
| jnl -->
|
| mov -48(esi),eax
| mov -40(esi),edi
| mov -24(esi),ebx
| mov esp,ebp
|
+--- jmp
All the three stores at the end of the above loop (offsets -48, -40, -24) can be eliminated. When all stores in the loop are based on the same register (such as above, where all stores are based on esi, and esi is loop invariant), a simple disambiguation based on the offsets would enable the required redundant-store elimination. This may also be possible at the time when these stores were inserted - a backward analysis is easy on this case. In this particular case, some of the loads from these addresses can also be moved outside the loop, but then general loop-invariant code motion can cause register pressure and spilling, and would need rematerialization. But, redundant-store elimination is easier and should always be helpful when it applies.
I checked with Ed, and Tamarin has a filter for deadvars elimination.
Comment 37•16 years ago
|
||
(In reply to comment #36)
> But, redundant-store elimination is easier and should always be helpful when it
> applies.
Nice! RSE sounds like it would fit in pretty well with the nanojit design (although I could imagine it requiring an additional forward pass, but I'm OK with that, especially with the parallel compiler).
Comment 38•16 years ago
|
||
Can we just yoink the RSE filter from Tamarin for our use, or am I misunderstanding?
Comment 39•16 years ago
|
||
Ed,
How much work is involved in porting Tamarin's filter, and can it handle the two code sequences in Comment #36 and Comment #30 (#30 is harder) ?
Reporter | ||
Comment 40•16 years ago
|
||
We do have a store filter for the stack (StackFilter). Similarly, our global area (gp) is alias free and could get a simple GlobalFilter.
Comment 41•16 years ago
|
||
The dead store elimination in Tamarin works on the LIR code, so if the stores in question are represented in LIR, it might work for this problem. It is generic so it code move into nanojit, but it isn't a filter because it can iterate when there are loop backedges (# of passes is typically 1 + max-loop-nest-depth).
At some point in the past, someone (Graydon?) tried to refactor LIR_loop to use LIR_j -> LIR_label instead. There was a slowdown, but I still think this task should be attempted again; if LIR more closely represents what's going on, then we can apply optimizations that don't need speciailized trace knowlege.
Comment 42•13 years ago
|
||
Obsolete with the removal of tracejit.
Status: NEW → RESOLVED
Closed: 13 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•