Closed Bug 582152 Opened 11 years ago Closed 11 years ago

JM: Sync earlier to avoid load stalls on doubles

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: dvander, Unassigned)

References

Details

Attachments

(3 files)

Reducing math-cordic to not have any slow paths (inlining its calls, taking out >> since bug 578517 is WIP), we're still 2ms slower than JavaScriptCore - 13ms versus 11ms.

Sharking points to double loads, reminiscent of bug 578538 comment #7. The problem is that a value tracked on the stack might be in registers, and not yet synced to memory. We can get code like this:

> mov [ebx + 0x80], eax
> mov [ebx + 0x84], ecx
> movsd xmm0, [ebx + 0x80]

Sometimes only half of the value needs to be written, so there's a 32-bit store and then a 64-bit load. I'm not sure whether this is worse, but Shark doesn't like it either.

There's lots of ways to try tackling this, the easiest is probably to forcefully sync the top two stack slots before checking types in arithmetic ops. It's probably not a big win yet though, so attaching to PunyWins.
Moh, any idea what sort of penalty we're paying in that example snippet? We're also getting patterns where it's just one store - either to the low or high bits - again followed immediately by a 64-bit load.
The stores in the given example are NOT forwarded to the load that follows and there are penalties involved that are processor dependent. In general, for a load to forward data from a store, the data that the load needs must be fully contained within a single store (and then various alignment conditions may also exist).

I'm sure you have already consulted the IA optimization manual, Table 3.1, Page 148: http://www.intel.com/assets/pdf/manual/248966.pdf
In particular, Example 3:30 is of interest.

With SSE2,  instead of the original sequence:

	mov [ebx + 0x80], eax
	mov [ebx + 0x84], ecx
	movsd xmm0, [ebx + 0x80]

a much better sequence is:

	movd xmm0, eax
        movd xmm1, ecx
        unpcklps xmm0, xmm1

The attached test shows the gain to be >4x on Core 2 and 7.5x on Pentium4:

Core2:
run1 is 4.43 times slower than run2 (624:141)
run1 is 4.31 times slower than run2 (608:141)
run1 is 4.57 times slower than run2 (640:140)
run1 is 4.00 times slower than run2 (624:156)

Pentium-4:
run1 is 7.54 times slower than run2 (1531:203)
run1 is 7.62 times slower than run2 (1547:203)
run1 is 7.62 times slower than run2 (1547:203)
run1 is 7.06 times slower than run2 (1547:219)

On Penryn (SSE 4.1), we have other good instructions for this purpose.
More specifically, instead of the original sequence:

	mov [ebx + 0x80], eax
	mov [ebx + 0x84], ecx
	movsd xmm0, [ebx + 0x80]

a much better sequence is:
	movd   xmm0, eax 
	pinsrd xmm0, ecx, 1 

The attached test shows the gain to be ~8x on Nehalem (Core i7):
        run1 is 6.91 times slower than run2 (546:79)
        run1 is 8.82 times slower than run2 (547:62)
        run1 is 8.82 times slower than run2 (547:62)
        run1 is 6.82 times slower than run2 (532:78)	

-moh
We might be doing this in nanojit.
Moh, that is totally awesome. I had looked at the manual, but only section 3.8 ("Floating point Considerations"), somehow missed 3.6.

Thanks for your insight. Will have a go at this tomorrow.
Thanks, Moh! I did look at Table 3-1. I didn't see a row for the case of 32-bit aligned stores followed by a 64-bit load, but the gist of the surrounding discussion did suggest there would be a penalty. 

I guess for Fx4 we are allowed to assume SSE2, but we don't have SSE4 detection yet. I have a few questions about this replacement:

> With SSE2,  instead of the original sequence:
>
>    mov [ebx + 0x80], eax
>    mov [ebx + 0x84], ecx
>    movsd xmm0, [ebx + 0x80]
>
> a much better sequence is:
>
>    movd xmm0, eax
>    movd xmm1, ecx
>    unpcklps xmm0, xmm1

The short version of the question is: does this basically fix our problem, or do we still need to think about moving this earlier in the instruction stream? My guess is that this fixes and there isn't much benefit to trying to schedule it earlier. This is what I'm thinking:

If I read Appendix C right, on Core those instructions can issue one per cycle, and the movd will basically finish immediately, and the unpcklps has a latency of 1 or 2. So the whole sequence looks like it costs 3 cycles or so. Is it true that this code sequence costs mostly in terms of issue bandwidth, so it doesn't matter too much where it is placed?

If we do anything that incurs a penalty, the penalty will be much more than 3 cycles, so the SSE versions strictly dominate. Is it correct that the stores and loads have to separated pretty far apart (say, at least 10-20 instructions) to avoid the penalty? It would probably be pretty hard for us to achieve that kind of separation. And even then, we still have to pay a 3 cycle wait for the L1 hit at the end anyway, assuming that's on the critical path.
Right. The sequence involves only direct reg-reg instructions, and on out-of-order systems, such as Core-2, the instruction placement generally doesn't matter much. On Atom, however, instruction scheduling is more important. But, the gain of the SSE2 sequence on Atom is already > 2x. I think we should be fine with the SSE2 sequence for now. If you see stalls on real code, let me know.
Attached patch attemptSplinter Review
Hrm, this patch passes trace-tests but is 10X slower. Shark points to this code:

(gdb) x/10i 0x2c9450
0x2c9450:	movd   xmm1,esi
0x2c9454:	movd   xmm2,eax
0x2c9458:	unpcklpd xmm1,xmm2
0x2c945c:	mulsd  xmm0,xmm1
0x2c9460:	movsd  QWORD PTR [ebx+0x98],xmm0
Okay, this patch is broken - investigating
unpckld generates a 128bit vector from two 64-bit xmms no?
Yeah, that was the new bug. Catching some more optimization cases and posting results shortly. Looks like a 2-3ms win on my microbenchmark.
Can you post the new disassembly? I want to use this too.
Bug was that I accidentally emitted an 0x66 prefix before the UNPCKLPS op (which is just 0x0F 0x14).

> movd   xmm1,esi
> movd   xmm2,eax
> unpcklps xmm1,xmm2
> mulsd  xmm0,xmm1
> movsd  QWORD PTR [ebx+0x98],xmm0

This was a really solid win, like 14ms on graphs (updating now), and 22ms on my macbook.

We should investigate SSE4.1 detection and using PINSRD, this is apparently very hot.

http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/aaa0f0daa528
Blocks: JaegerSpeed
No longer blocks: JaegerPunyWins
Filed follow-up for PINSRD as bug 582785. Thanks again, Moh!
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.