Closed Bug 582152 Opened 11 years ago Closed 11 years ago
JM: Sync earlier to avoid load stalls on doubles
Moh, any idea what sort of penalty we're paying in that example snippet? We're also getting patterns where it's just one store - either to the low or high bits - again followed immediately by a 64-bit load.
The stores in the given example are NOT forwarded to the load that follows and there are penalties involved that are processor dependent. In general, for a load to forward data from a store, the data that the load needs must be fully contained within a single store (and then various alignment conditions may also exist). I'm sure you have already consulted the IA optimization manual, Table 3.1, Page 148: http://www.intel.com/assets/pdf/manual/248966.pdf In particular, Example 3:30 is of interest. With SSE2, instead of the original sequence: mov [ebx + 0x80], eax mov [ebx + 0x84], ecx movsd xmm0, [ebx + 0x80] a much better sequence is: movd xmm0, eax movd xmm1, ecx unpcklps xmm0, xmm1 The attached test shows the gain to be >4x on Core 2 and 7.5x on Pentium4: Core2: run1 is 4.43 times slower than run2 (624:141) run1 is 4.31 times slower than run2 (608:141) run1 is 4.57 times slower than run2 (640:140) run1 is 4.00 times slower than run2 (624:156) Pentium-4: run1 is 7.54 times slower than run2 (1531:203) run1 is 7.62 times slower than run2 (1547:203) run1 is 7.62 times slower than run2 (1547:203) run1 is 7.06 times slower than run2 (1547:219) On Penryn (SSE 4.1), we have other good instructions for this purpose. More specifically, instead of the original sequence: mov [ebx + 0x80], eax mov [ebx + 0x84], ecx movsd xmm0, [ebx + 0x80] a much better sequence is: movd xmm0, eax pinsrd xmm0, ecx, 1 The attached test shows the gain to be ~8x on Nehalem (Core i7): run1 is 6.91 times slower than run2 (546:79) run1 is 8.82 times slower than run2 (547:62) run1 is 8.82 times slower than run2 (547:62) run1 is 6.82 times slower than run2 (532:78) -moh
We might be doing this in nanojit.
Moh, that is totally awesome. I had looked at the manual, but only section 3.8 ("Floating point Considerations"), somehow missed 3.6. Thanks for your insight. Will have a go at this tomorrow.
Thanks, Moh! I did look at Table 3-1. I didn't see a row for the case of 32-bit aligned stores followed by a 64-bit load, but the gist of the surrounding discussion did suggest there would be a penalty. I guess for Fx4 we are allowed to assume SSE2, but we don't have SSE4 detection yet. I have a few questions about this replacement: > With SSE2, instead of the original sequence: > > mov [ebx + 0x80], eax > mov [ebx + 0x84], ecx > movsd xmm0, [ebx + 0x80] > > a much better sequence is: > > movd xmm0, eax > movd xmm1, ecx > unpcklps xmm0, xmm1 The short version of the question is: does this basically fix our problem, or do we still need to think about moving this earlier in the instruction stream? My guess is that this fixes and there isn't much benefit to trying to schedule it earlier. This is what I'm thinking: If I read Appendix C right, on Core those instructions can issue one per cycle, and the movd will basically finish immediately, and the unpcklps has a latency of 1 or 2. So the whole sequence looks like it costs 3 cycles or so. Is it true that this code sequence costs mostly in terms of issue bandwidth, so it doesn't matter too much where it is placed? If we do anything that incurs a penalty, the penalty will be much more than 3 cycles, so the SSE versions strictly dominate. Is it correct that the stores and loads have to separated pretty far apart (say, at least 10-20 instructions) to avoid the penalty? It would probably be pretty hard for us to achieve that kind of separation. And even then, we still have to pay a 3 cycle wait for the L1 hit at the end anyway, assuming that's on the critical path.
Right. The sequence involves only direct reg-reg instructions, and on out-of-order systems, such as Core-2, the instruction placement generally doesn't matter much. On Atom, however, instruction scheduling is more important. But, the gain of the SSE2 sequence on Atom is already > 2x. I think we should be fine with the SSE2 sequence for now. If you see stalls on real code, let me know.
Hrm, this patch passes trace-tests but is 10X slower. Shark points to this code: (gdb) x/10i 0x2c9450 0x2c9450: movd xmm1,esi 0x2c9454: movd xmm2,eax 0x2c9458: unpcklpd xmm1,xmm2 0x2c945c: mulsd xmm0,xmm1 0x2c9460: movsd QWORD PTR [ebx+0x98],xmm0
Okay, this patch is broken - investigating
unpckld generates a 128bit vector from two 64-bit xmms no?
Yeah, that was the new bug. Catching some more optimization cases and posting results shortly. Looks like a 2-3ms win on my microbenchmark.
Can you post the new disassembly? I want to use this too.
Bug was that I accidentally emitted an 0x66 prefix before the UNPCKLPS op (which is just 0x0F 0x14). > movd xmm1,esi > movd xmm2,eax > unpcklps xmm1,xmm2 > mulsd xmm0,xmm1 > movsd QWORD PTR [ebx+0x98],xmm0 This was a really solid win, like 14ms on graphs (updating now), and 22ms on my macbook. We should investigate SSE4.1 detection and using PINSRD, this is apparently very hot. http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/aaa0f0daa528
11 years ago
Filed follow-up for PINSRD as bug 582785. Thanks again, Moh!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.