Closed Bug 1060435 Opened 10 years ago Closed 8 years ago

[meta] Implement SIMD Flappy Bird demo

Categories

(Core :: JavaScript Engine: JIT, defect)

x86_64
Linux
defect
Not set
normal

Tracking

()

RESOLVED FIXED

People

(Reporter: bbouvier, Unassigned)

References

Details

Attachments

(6 files, 4 obsolete files)

      No description provided.
Depends on: 1060437
Depends on: 1060438
Depends on: 1060441
Attached file fbird.js (obsolete) —
Conversion to asm.js of the non-SIMD and SIMD versions fbird update loops.

Had to emulator the unimplemented operations, so SIMD gives only about a 20% improvement at present.

The biggest performance issue is 'select'.

Saw some clues that the lack of 'splat' could be hitting performance.

There is a mod operation in the hot loop that takes about half the loop instructions to implement. Would like to pad the accel vector to 16 bytes so we can use a mask to avoid this.

The array accesses are outside the main loop and emulating them might be acceptable for performance.

This is a quick conversion - it's not pretty.

Let's reassess the performance when 'select' is implemented.
Attached file fbird.js (obsolete) —
* Padded the accel vector to 16 elements, and used a bit and mask to wrap.

* Used signMask to move the 'select' emulation out of the hot block.

Unfortunately the SIMD version is just not giving us a boost. Here's the hot block now 95% or JS cycles:

   0x00007f15796051c7: mov    %r12d,%r14d
   0x00007f15796051ca: shl    $0x2,%r14d
   0x00007f15796051ce: add    $0xc3500,%r14d
   0x00007f15796051d5: movss  (%r15,%r14,1),%xmm6
   0x00007f15796051db: sub    $0x10,%rsp      \
   0x00007f15796051df: movss  %xmm6,(%rsp)     |
   0x00007f15796051e4: movss  %xmm6,0x4(%rsp)  |
   0x00007f15796051ea: movss  %xmm6,0x8(%rsp)  | splat
   0x00007f15796051f0: movss  %xmm6,0xc(%rsp)  | emulation
   0x00007f15796051f6: movaps (%rsp),%xmm6    /
   0x00007f15796051fa: add    $0x10,%rsp
   0x00007f15796051fe: add    $0x1,%r12d
   0x00007f1579605202: and    $0xf,%r12d
   0x00007f1579605206: movaps %xmm6,%xmm8
   0x00007f157960520a: mulps  %xmm2,%xmm8
   0x00007f157960520e: movaps %xmm3,%xmm7
   0x00007f1579605211: mulps  %xmm8,%xmm7
   0x00007f1579605215: movaps %xmm5,%xmm8
   0x00007f1579605219: mulps  %xmm1,%xmm8
   0x00007f157960521d: addps  %xmm8,%xmm7
   0x00007f1579605221: addps  %xmm7,%xmm4
   0x00007f1579605224: mulps  %xmm1,%xmm6
   0x00007f1579605227: addps  %xmm6,%xmm5
   0x00007f157960522a: movdqa %xmm4,%xmm6
   0x00007f157960522e: cmpnleps %xmm0,%xmm6
   0x00007f1579605232: movmskps %xmm6,%r14d
   0x00007f1579605236: test   %r14d,%r14d
   0x00007f1579605239: je     0x7f1579605363

I suspect the splat emulation is killing performance and will working on this next.
Attachment #8481764 - Attachment is obsolete: true
Any ideas why the block in comment 2 is not giving a performance improvement over the non-SIMD version?

Here's the source code for this block:
 accel = toF(f32[(accelIndex << 2) + maxBirdsx8 >>2]);
 // Work around unimplement splat - this might be hitting performance.
 accelx4 = f4(accel, accel, accel, accel);
 accelIndex = (accelIndex + 1) & accelMask;
 posDeltax4 = f4mul(point5x4, f4mul(accelx4, subTimeDeltaSquaredx4));
 posDeltax4 = f4add(posDeltax4, f4mul(newVelx4, subTimeDeltax4));
 newPosx4 = f4add(newPosx4, posDeltax4);
 newVelx4 = f4add(newVelx4, f4mul(accelx4, subTimeDeltax4));
 cmpx4 = f4greaterThan(newPosx4, maxPosx4);
 mask = cmpx4.signMask;
 if (mask) { ...
Flags: needinfo?(sunfish)
>    0x00007f15796051c7: mov    %r12d,%r14d

We don't yet align loops, so it's possible that we're paying something for that here, but it's hard to predict.

>    0x00007f15796051ca: shl    $0x2,%r14d
>    0x00007f15796051ce: add    $0xc3500,%r14d

This is arithmetic we'd really like to fold into the address, which is a problem we know about, but the scalar version probably has the same problem 

>    0x00007f15796051d5: movss  (%r15,%r14,1),%xmm6
>    0x00007f15796051db: sub    $0x10,%rsp      \
>    0x00007f15796051df: movss  %xmm6,(%rsp)     |
>    0x00007f15796051e4: movss  %xmm6,0x4(%rsp)  |
>    0x00007f15796051ea: movss  %xmm6,0x8(%rsp)  | splat
>    0x00007f15796051f0: movss  %xmm6,0xc(%rsp)  | emulation
>    0x00007f15796051f6: movaps (%rsp),%xmm6    /
>    0x00007f15796051fa: add    $0x10,%rsp

This does seem to be the most likely culprit.

>    0x00007f15796051fe: add    $0x1,%r12d
>    0x00007f1579605202: and    $0xf,%r12d
>    0x00007f1579605206: movaps %xmm6,%xmm8
>    0x00007f157960520a: mulps  %xmm2,%xmm8
>    0x00007f157960520e: movaps %xmm3,%xmm7

It looks like we could avoid this copy by commuting the next mul. I think our commuting heuristics in Lowering.cpp ought to get this case, though we probably haven't hooked them up yet for SIMD ops. It'd be nice to do that, though it's probably not a huge thing here.

>    0x00007f1579605211: mulps  %xmm8,%xmm7
>    0x00007f1579605215: movaps %xmm5,%xmm8
>    0x00007f1579605219: mulps  %xmm1,%xmm8
>    0x00007f157960521d: addps  %xmm8,%xmm7
>    0x00007f1579605221: addps  %xmm7,%xmm4
>    0x00007f1579605224: mulps  %xmm1,%xmm6
>    0x00007f1579605227: addps  %xmm6,%xmm5
>    0x00007f157960522a: movdqa %xmm4,%xmm6

This should ideally be movaps rather than movdqa to avoid domain crossing penalties, but it only affects some processors, like Nehalem, and it isn't a huge penalty in any case.

>    0x00007f157960522e: cmpnleps %xmm0,%xmm6
>    0x00007f1579605232: movmskps %xmm6,%r14d
>    0x00007f1579605236: test   %r14d,%r14d

On a machine with SSE4.1, we could use the ptest instruction here, but it's also probably not top priority.

>    0x00007f1579605239: je     0x7f1579605363
> 
> I suspect the splat emulation is killing performance and will working on
> this next.

Sounds right to me.
Flags: needinfo?(sunfish)
Depends on: 1060789
Attached file fbird.js (obsolete) —
Add in the 'select' code, commented out as it's not ready.

Thank you for the feedback. Implementing splat has improved performance, but the SIMD version is only around 75% faster which is a little frustrating. Shall go over it all again. Here's the hot block now:

   0x00007fa5a6700189:	mov    %r12d,%r14d
   0x00007fa5a670018c:	shl    $0x2,%r14d
   0x00007fa5a6700190:	add    $0xc3500,%r14d
   0x00007fa5a6700197:	movss  (%r15,%r14,1),%xmm6
   0x00007fa5a670019d:	shufps $0x0,%xmm6,%xmm6
   0x00007fa5a67001a1:	add    $0x1,%r12d
   0x00007fa5a67001a5:	and    $0xf,%r12d
   0x00007fa5a67001a9:	movaps %xmm6,%xmm8
=> 0x00007fa5a67001ad:	mulps  %xmm2,%xmm8
   0x00007fa5a67001b1:	movaps %xmm3,%xmm7
   0x00007fa5a67001b4:	mulps  %xmm8,%xmm7
   0x00007fa5a67001b8:	movaps %xmm5,%xmm8
   0x00007fa5a67001bc:	mulps  %xmm0,%xmm8
   0x00007fa5a67001c0:	addps  %xmm8,%xmm7
   0x00007fa5a67001c4:	addps  %xmm7,%xmm4
   0x00007fa5a67001c7:	mulps  %xmm0,%xmm6
   0x00007fa5a67001ca:	addps  %xmm6,%xmm5
   0x00007fa5a67001cd:	movdqa %xmm4,%xmm6
   0x00007fa5a67001d1:	cmpnleps %xmm1,%xmm6
   0x00007fa5a67001d5:	movmskps %xmm6,%r14d
   0x00007fa5a67001d9:	test   %r14d,%r14d
   0x00007fa5a67001dc:	je     0x7fa5a6700306

Shall work on 'select' next unless someone beats me to it. It will avoid some branching which might help. I can optimize the heap access here, move the offset into the instruction, but would need to use a patch that is not expected to land soon, shell give it a try just out of curiosity.
Attachment #8481773 - Attachment is obsolete: true
Here's an untested patch which hooks up the commuting heuristics to eliminate copies, in case anyone wants to try it.
Attached file fbird.js (obsolete) —
* Remove the select workaround, not that we have an implementation.

Here's the hot block now, at around 95% of JS cycles, also with the commute patch applied:

   0x00007ff5d33011aa: mov    %r12d,%r14d
   0x00007ff5d33011ad: shl    $0x2,%r14d
   0x00007ff5d33011b1: add    $0xc3500,%r14d
   0x00007ff5d33011b8: movss  (%r15,%r14,1),%xmm6
   0x00007ff5d33011be: shufps $0x0,%xmm6,%xmm6
   0x00007ff5d33011c2: add    $0x1,%r12d
   0x00007ff5d33011c6: and    $0xf,%r12d
   0x00007ff5d33011ca: movaps %xmm2,%xmm8
   0x00007ff5d33011ce: mulps  %xmm6,%xmm8
   0x00007ff5d33011d2: movaps %xmm3,%xmm7
   0x00007ff5d33011d5: mulps  %xmm8,%xmm7
   0x00007ff5d33011d9: movaps %xmm5,%xmm8
   0x00007ff5d33011dd: mulps  %xmm0,%xmm8
   0x00007ff5d33011e1: addps  %xmm8,%xmm7
   0x00007ff5d33011e5: addps  %xmm4,%xmm7
   0x00007ff5d33011e8: mulps  %xmm0,%xmm6
   0x00007ff5d33011eb: addps  %xmm5,%xmm6
   0x00007ff5d33011ee: movdqa %xmm7,%xmm4
   0x00007ff5d33011f2: cmpnleps %xmm1,%xmm4
   0x00007ff5d33011f6: movaps 0x10(%rsp),%xmm8
   0x00007ff5d33011fc: subps  %xmm6,%xmm8
   0x00007ff5d3301200: movdqa %xmm8,%xmm5 \
   0x00007ff5d3301205: andps  %xmm4,%xmm5 |
   0x00007ff5d3301208: movdqa %xmm4,%xmm9 | select
   0x00007ff5d330120d: andnps %xmm6,%xmm9 |
   0x00007ff5d3301211: orps   %xmm9,%xmm5 /
   0x00007ff5d3301215: add    $0x1,%r13d
   0x00007ff5d3301219: movaps %xmm7,%xmm4
   0x00007ff5d330121c: jmpq   0x7ff5d330119f

Shall give the slat and select patches some more love next - try and get them ready to review.

How does the 'select' implementation look? Can we do any better?
Attachment #8481799 - Attachment is obsolete: true
Flags: needinfo?(sunfish)
(In reply to Douglas Crosher [:dougc] from comment #7)
> Created attachment 8481876 [details]
> fbird.js
> 
> * Remove the select workaround, not that we have an implementation.
> 
> Here's the hot block now, at around 95% of JS cycles, also with the commute
> patch applied:
> 
>    0x00007ff5d33011aa: mov    %r12d,%r14d
>    0x00007ff5d33011ad: shl    $0x2,%r14d
>    0x00007ff5d33011b1: add    $0xc3500,%r14d
>    0x00007ff5d33011b8: movss  (%r15,%r14,1),%xmm6
>    0x00007ff5d33011be: shufps $0x0,%xmm6,%xmm6
>    0x00007ff5d33011c2: add    $0x1,%r12d
>    0x00007ff5d33011c6: and    $0xf,%r12d
>    0x00007ff5d33011ca: movaps %xmm2,%xmm8
>    0x00007ff5d33011ce: mulps  %xmm6,%xmm8
>    0x00007ff5d33011d2: movaps %xmm3,%xmm7
>    0x00007ff5d33011d5: mulps  %xmm8,%xmm7
>    0x00007ff5d33011d9: movaps %xmm5,%xmm8
>    0x00007ff5d33011dd: mulps  %xmm0,%xmm8
>    0x00007ff5d33011e1: addps  %xmm8,%xmm7
>    0x00007ff5d33011e5: addps  %xmm4,%xmm7
>    0x00007ff5d33011e8: mulps  %xmm0,%xmm6
>    0x00007ff5d33011eb: addps  %xmm5,%xmm6
>    0x00007ff5d33011ee: movdqa %xmm7,%xmm4
>    0x00007ff5d33011f2: cmpnleps %xmm1,%xmm4
>    0x00007ff5d33011f6: movaps 0x10(%rsp),%xmm8
>    0x00007ff5d33011fc: subps  %xmm6,%xmm8
>    0x00007ff5d3301200: movdqa %xmm8,%xmm5 \
>    0x00007ff5d3301205: andps  %xmm4,%xmm5 |
>    0x00007ff5d3301208: movdqa %xmm4,%xmm9 | select
>    0x00007ff5d330120d: andnps %xmm6,%xmm9 |
>    0x00007ff5d3301211: orps   %xmm9,%xmm5 /
>    0x00007ff5d3301215: add    $0x1,%r13d
>    0x00007ff5d3301219: movaps %xmm7,%xmm4
>    0x00007ff5d330121c: jmpq   0x7ff5d330119f
> 
> Shall give the slat and select patches some more love next - try and get
> them ready to review.
> 
> How does the 'select' implementation look? Can we do any better?

If we have SSE4.1, we can use blendvps.

Otherwise, andps+andnps+orps is indeed the way to do it. It might be possible to eliminate some of those copies with some cleverness though.

It looks like the commute patch, for its part, didn't do anything. I'll look into that when I get a chance.

For completeness, it looks like we also have a reload, so the register allocation can provably be improved.
Flags: needinfo?(sunfish)
After optimizing the 'select' sequence, moving the index offsets into instructions I see around a 30% performance improvement using SIMD - not great.

However if the 'select' sequence is moved out of the hot loop, by testing with signMask, the I see a 55% performance improvement using SIMD - little better.

Here's the winning loop:

7f72b6c04158 a fbird.js:363:56: Function updateAllSimd - Block 3
   0x00007f72b6c04158:	cmp    0xc(%rsp),%ebp
   0x00007f72b6c0415c:	jge    0x7f72b6c041d7

7f72b6c04162 50 fbird.js:363:56: Function updateAllSimd - Block 4
   0x00007f72b6c04162:	mov    %ebx,%esi
   0x00007f72b6c04164:	and    $0x3c,%esi
   0x00007f72b6c04167:	movss  0xc3500(%r15,%rsi,1),%xmm8
   0x00007f72b6c04171:	shufps $0x0,%xmm8,%xmm8
   0x00007f72b6c04176:	add    $0x4,%ebx
   0x00007f72b6c04179:	movaps %xmm2,%xmm9
   0x00007f72b6c0417d:	mulps  %xmm8,%xmm9
   0x00007f72b6c04181:	movaps %xmm3,%xmm7
   0x00007f72b6c04184:	mulps  %xmm9,%xmm7
   0x00007f72b6c04188:	movaps %xmm5,%xmm9
   0x00007f72b6c0418c:	mulps  %xmm0,%xmm9
   0x00007f72b6c04190:	addps  %xmm9,%xmm7
   0x00007f72b6c04194:	addps  %xmm6,%xmm7
   0x00007f72b6c04197:	mulps  %xmm0,%xmm8
   0x00007f72b6c0419b:	addps  %xmm5,%xmm8
   0x00007f72b6c0419f:	movdqa %xmm7,%xmm6
   0x00007f72b6c041a3:	cmpnleps %xmm1,%xmm6
   0x00007f72b6c041a7:	movmskps %xmm6,%esi
   0x00007f72b6c041aa:	test   %esi,%esi
   0x00007f72b6c041ac:	jne    0x7f72b6c041bb

7f72b6c041b2 9 fbird.js:0:0: Function updateAllSimd - Block 5
   0x00007f72b6c041b2:	movaps %xmm8,%xmm5
   0x00007f72b6c041b6:	jmpq   0x7f72b6c041cc

7f72b6c041bb 11 fbird.js:380:28: Function updateAllSimd - Block 6
   0x00007f72b6c041bb:	movaps %xmm4,%xmm5 \
   0x00007f72b6c041be:	subps  %xmm8,%xmm5 / negate
   0x00007f72b6c041c2:	andps  %xmm6,%xmm5 \
   0x00007f72b6c041c5:	andnps %xmm8,%xmm6 | select in a cold path
   0x00007f72b6c041c9:	orps   %xmm6,%xmm5 /

7f72b6c041cc b fbird.js:380:28: Function updateAllSimd - Block 7
   0x00007f72b6c041cc:	add    $0x1,%ebp
   0x00007f72b6c041cf:	movaps %xmm7,%xmm6
   0x00007f72b6c041d2:	jmpq   0x7f72b6c04158


Whereas this is actually slower, the difference being the 'select' sequence is in the hot path:

7ff87f48a183 60 fbird.js:361:56: Function updateAllSimd - Block 4
   0x00007ff87f48a183:	mov    %ebx,%esi
   0x00007ff87f48a185:	and    $0x3c,%esi
   0x00007ff87f48a188:	movss  0xc3500(%r15,%rsi,1),%xmm7
   0x00007ff87f48a192:	shufps $0x0,%xmm7,%xmm7
   0x00007ff87f48a196:	add    $0x4,%ebx
   0x00007ff87f48a199:	movaps %xmm2,%xmm9
   0x00007ff87f48a19d:	mulps  %xmm7,%xmm9
   0x00007ff87f48a1a1:	movaps %xmm3,%xmm8
   0x00007ff87f48a1a5:	mulps  %xmm9,%xmm8
   0x00007ff87f48a1a9:	movaps %xmm5,%xmm9
   0x00007ff87f48a1ad:	mulps  %xmm0,%xmm9
   0x00007ff87f48a1b1:	addps  %xmm9,%xmm8
   0x00007ff87f48a1b5:	addps  %xmm6,%xmm8
   0x00007ff87f48a1b9:	mulps  %xmm0,%xmm7
   0x00007ff87f48a1bc:	addps  %xmm5,%xmm7
   0x00007ff87f48a1bf:	movdqa %xmm8,%xmm6
   0x00007ff87f48a1c4:	cmpnleps %xmm1,%xmm6
   0x00007ff87f48a1c8:	movaps %xmm4,%xmm5
   0x00007ff87f48a1cb:	subps  %xmm7,%xmm5 - negate
   0x00007ff87f48a1ce:	andps  %xmm6,%xmm5 \
   0x00007ff87f48a1d1:	andnps %xmm7,%xmm6 | select
   0x00007ff87f48a1d4:	orps   %xmm6,%xmm5 /
   0x00007ff87f48a1d7:	add    $0x1,%ebp
   0x00007ff87f48a1da:	movaps %xmm8,%xmm6
   0x00007ff87f48a1de:	jmpq   0x7ff87f48a179

Sorry, no immediate explanation.

Note that for this demo the hot code accounts for only around 20% of the process cycles and this explains in part the poor performance improvement from improvement from using SIMD. The improvement to the JS speed is being washed out by the rest of the code.

So we need to get the 'splat' support landed to see any improvement on this demo, and getting the 'select' support landed would be nice but it needs to be in a cold path so it's not critical.
Try build of all the patches in my queue that I suggest landing to support the Flappy Bird demo. This will also support the Mandelbrot demo:
https://tbpl.mozilla.org/?tree=Try&rev=2931483a0721

The ordering of the patches might also be helpful to know. Most of the patches are reviewed and look ready to land. The 'splat' optimization is ready for review and seems in good shape. The 'select' patch is still rough. Completing the 'splat' and 'select' support would help clean up the demo source code a little and might be achievable.

The float32x4 array support is still missing. Luckily the array access is not in the hot loop in the Flappy Bird demo and we can substitute slower float32 array accesses. Would be great if it could be done too. I notice a lot of float canonicalization on the float32 array stores, and it might not be trivial to get this working for a SIMD array store.
Flags: needinfo?(benj)
(In reply to Douglas Crosher [:dougc] from comment #10)
> Try build of all the patches in my queue that I suggest landing to support
> the Flappy Bird demo. This will also support the Mandelbrot demo:
> https://tbpl.mozilla.org/?tree=Try&rev=2931483a0721

The build has some failures for the SIMD 'select' operation - that patch is a bit rough.

Tested the WinXP build and it runs the Mandelbrot and Flappy Bird SIMD demos. The builds are available at:
https://ftp-ssl.mozilla.org/pub/mozilla.org/firefox/try-builds/dtc-moz@scieneer.com-2931483a0721/
Attached file fbird.js
Attachment #8481876 - Attachment is obsolete: true
Explored the movdqa move noted in comment 4 which should ideally be a movaps and it looks like a problem copying operands that feed into MUST_REUSE_INPUT definitions. Perhaps this is a new issue for SIMD. It would seem more appropriate for this move to use the operands type rather than the type of the definition it feeds into.
Attachment #8482154 - Flags: review?(sunfish)
Comment on attachment 8482154 [details] [diff] [review]
Operand copies, to must-use definitions, should use the operand type not the target definitions type.

Review of attachment 8482154 [details] [diff] [review]:
-----------------------------------------------------------------

Good catch!
Attachment #8482154 - Flags: review?(sunfish) → review+
Just uploaded the demo to https://github.com/bnjbvr/simd-demos
A few patches still need to land before it's usable in nightlies.
Best not land the 'Operand copies, to must-use definitions, should use the operand type not the target definitions type.' patch yet. See some register allocator integrity check failures here that this patch might be causing, or the checks might need updating too.
Alright, I now think all required patches have landed (except for the patch included in this bug). Just need to add it to Peter's repo for demoing and we should be good for closing it.
Flags: needinfo?(benj)
Conversions to run in the JS shell. Helps to isolate some issues and test performance.
This version inlines the 'select' operation in the hot loop. It does not perform as well as expected.
This version uses signMask to move the 'select' operation out of the main loop and is around 3 times faster than the non-SIMD reference version.
No longer depends on: 1060441
No longer depends on: 1060438
It's been implemented, it runs as a test under tests/asm.js/simd-fbirds.js and as a benchmark under AWFY: https://arewefastyet.com/#machine=29&view=single&suite=asmjs-ubench&subtest=fbirds-native
Status: NEW → RESOLVED
Closed: 8 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: