Closed Bug 582152 Opened 14 years ago Closed 14 years ago

JM: Sync earlier to avoid load stalls on doubles

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: dvander, Unassigned)

References

Details

Attachments

(3 files)

test for a better code sequence with SSE2 14 years ago Moh Haghighat 2.15 KB, text/plain		Details
test for a better code sequence with SSE4.1 14 years ago Moh Haghighat 2.12 KB, text/plain		Details
attempt 14 years ago David Anderson [:dvander] - inactive, e-mail if emergency 5.95 KB, patch		Details \| Diff \| Splinter Review

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Description

•

14 years ago

Reducing math-cordic to not have any slow paths (inlining its calls, taking out >> since bug 578517 is WIP), we're still 2ms slower than JavaScriptCore - 13ms versus 11ms. Sharking points to double loads, reminiscent of bug 578538 comment #7. The problem is that a value tracked on the stack might be in registers, and not yet synced to memory. We can get code like this: > mov [ebx + 0x80], eax > mov [ebx + 0x84], ecx > movsd xmm0, [ebx + 0x80] Sometimes only half of the value needs to be written, so there's a 32-bit store and then a 64-bit load. I'm not sure whether this is worse, but Shark doesn't like it either. There's lots of ways to try tackling this, the easiest is probably to forcefully sync the top two stack slots before checking types in arithmetic ops. It's probably not a big win yet though, so attaching to PunyWins.

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 1

•

14 years ago

Moh, any idea what sort of penalty we're paying in that example snippet? We're also getting patterns where it's just one store - either to the low or high bits - again followed immediately by a 64-bit load.

Moh Haghighat

Comment 2

•

14 years ago

The stores in the given example are NOT forwarded to the load that follows and there are penalties involved that are processor dependent. In general, for a load to forward data from a store, the data that the load needs must be fully contained within a single store (and then various alignment conditions may also exist). I'm sure you have already consulted the IA optimization manual, Table 3.1, Page 148: http://www.intel.com/assets/pdf/manual/248966.pdf In particular, Example 3:30 is of interest. With SSE2, instead of the original sequence: mov [ebx + 0x80], eax mov [ebx + 0x84], ecx movsd xmm0, [ebx + 0x80] a much better sequence is: movd xmm0, eax movd xmm1, ecx unpcklps xmm0, xmm1 The attached test shows the gain to be >4x on Core 2 and 7.5x on Pentium4: Core2: run1 is 4.43 times slower than run2 (624:141) run1 is 4.31 times slower than run2 (608:141) run1 is 4.57 times slower than run2 (640:140) run1 is 4.00 times slower than run2 (624:156) Pentium-4: run1 is 7.54 times slower than run2 (1531:203) run1 is 7.62 times slower than run2 (1547:203) run1 is 7.62 times slower than run2 (1547:203) run1 is 7.06 times slower than run2 (1547:219) On Penryn (SSE 4.1), we have other good instructions for this purpose. More specifically, instead of the original sequence: mov [ebx + 0x80], eax mov [ebx + 0x84], ecx movsd xmm0, [ebx + 0x80] a much better sequence is: movd xmm0, eax pinsrd xmm0, ecx, 1 The attached test shows the gain to be ~8x on Nehalem (Core i7): run1 is 6.91 times slower than run2 (546:79) run1 is 8.82 times slower than run2 (547:62) run1 is 8.82 times slower than run2 (547:62) run1 is 6.82 times slower than run2 (532:78) -moh

Moh Haghighat

Comment 3

•

14 years ago

Attached file test for a better code sequence with SSE2 — Details

Moh Haghighat

Comment 4

•

14 years ago

Attached file test for a better code sequence with SSE4.1 — Details

Andreas Gal :gal

Comment 5

•

14 years ago

We might be doing this in nanojit.

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 6

•

14 years ago

Moh, that is totally awesome. I had looked at the manual, but only section 3.8 ("Floating point Considerations"), somehow missed 3.6. Thanks for your insight. Will have a go at this tomorrow.

David Mandelin [:dmandelin]

Comment 7

•

14 years ago

Thanks, Moh! I did look at Table 3-1. I didn't see a row for the case of 32-bit aligned stores followed by a 64-bit load, but the gist of the surrounding discussion did suggest there would be a penalty. I guess for Fx4 we are allowed to assume SSE2, but we don't have SSE4 detection yet. I have a few questions about this replacement: > With SSE2, instead of the original sequence: > > mov [ebx + 0x80], eax > mov [ebx + 0x84], ecx > movsd xmm0, [ebx + 0x80] > > a much better sequence is: > > movd xmm0, eax > movd xmm1, ecx > unpcklps xmm0, xmm1 The short version of the question is: does this basically fix our problem, or do we still need to think about moving this earlier in the instruction stream? My guess is that this fixes and there isn't much benefit to trying to schedule it earlier. This is what I'm thinking: If I read Appendix C right, on Core those instructions can issue one per cycle, and the movd will basically finish immediately, and the unpcklps has a latency of 1 or 2. So the whole sequence looks like it costs 3 cycles or so. Is it true that this code sequence costs mostly in terms of issue bandwidth, so it doesn't matter too much where it is placed? If we do anything that incurs a penalty, the penalty will be much more than 3 cycles, so the SSE versions strictly dominate. Is it correct that the stores and loads have to separated pretty far apart (say, at least 10-20 instructions) to avoid the penalty? It would probably be pretty hard for us to achieve that kind of separation. And even then, we still have to pay a 3 cycle wait for the L1 hit at the end anyway, assuming that's on the critical path.

Moh Haghighat

Comment 8

•

14 years ago

Right. The sequence involves only direct reg-reg instructions, and on out-of-order systems, such as Core-2, the instruction placement generally doesn't matter much. On Atom, however, instruction scheduling is more important. But, the gain of the SSE2 sequence on Atom is already > 2x. I think we should be fine with the SSE2 sequence for now. If you see stalls on real code, let me know.

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 9

•

14 years ago

Attached patch attempt — Details — Splinter Review

Hrm, this patch passes trace-tests but is 10X slower. Shark points to this code: (gdb) x/10i 0x2c9450 0x2c9450: movd xmm1,esi 0x2c9454: movd xmm2,eax 0x2c9458: unpcklpd xmm1,xmm2 0x2c945c: mulsd xmm0,xmm1 0x2c9460: movsd QWORD PTR [ebx+0x98],xmm0

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 10

•

14 years ago

Okay, this patch is broken - investigating

Andreas Gal :gal

Comment 11

•

14 years ago

unpckld generates a 128bit vector from two 64-bit xmms no?

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 12

•

14 years ago

Yeah, that was the new bug. Catching some more optimization cases and posting results shortly. Looks like a 2-3ms win on my microbenchmark.

Andreas Gal :gal

Comment 13

•

14 years ago

Can you post the new disassembly? I want to use this too.

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 14

•

14 years ago

Bug was that I accidentally emitted an 0x66 prefix before the UNPCKLPS op (which is just 0x0F 0x14). > movd xmm1,esi > movd xmm2,eax > unpcklps xmm1,xmm2 > mulsd xmm0,xmm1 > movsd QWORD PTR [ebx+0x98],xmm0 This was a really solid win, like 14ms on graphs (updating now), and 22ms on my macbook. We should investigate SSE4.1 detection and using PINSRD, this is apparently very hot. http://hg.mozilla.org/users/danderson_mozilla.com/moo/rev/aaa0f0daa528

Blocks: JaegerSpeed
No longer blocks: JaegerPunyWins

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Updated

•

14 years ago

Blocks: 582785

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 15

•

14 years ago

Filed follow-up for PINSRD as bug 582785. Thanks again, Moh!

Status: NEW → RESOLVED

Closed: 14 years ago

Resolution: --- → FIXED

You need to log in before you can comment on or make changes to this bug.

Bugzilla

JM: Sync earlier to avoid load stalls on doubles

Categories

(Core :: JavaScript Engine, defect)

Tracking

()

People

(Reporter: dvander, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Attachments

(3 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Comment 11

Comment 12

Comment 13

Comment 14

Updated

Comment 15

Attachment

General

Description

File Name

Content Type