[meta] IonMonkey: Use SIMD to optimize gaussian blur.

NEW
Unassigned

Status

()

Core
JavaScript Engine
5 years ago
2 years ago

People

(Reporter: nbp, Unassigned)

Tracking

(Depends on: 3 bugs)

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(1 attachment)

(Reporter)

Description

5 years ago
Guassian blur (kraken benchmark) is interesting because, it contains a loop which repeat 4 times the same operation, and the current MIR optimizations, such as GVN & LICM are able to move most of the boilerplate away.

These 4 identical operations are made on Doubles (might be Int32 inputs) as inferred from the type inference. and they can be divided into 2 contiguous 16 bytes  aligned inputs (2 values from the element vector).

The goal would be to merge the r & g and b & a into 2 xmm registers, and to manipulate them as one entity.  As opposed to our current usage of xmm registers, we would have to introduce a new MIR type to reflect the fact that 2 values / 2 doubles are packed into one FloatRegister. This would have an impact on snapshots and on the register allocation to handle any spill.

How we should schedule such things:

1/ Add SIMD support into the assembler, for loading 2 Values, checking if they are boxed int, unboxing them, multiplying them, adding them, dividing them, and store them.

2/ Evaluate the performance gain by hacking the engine and doing a manual substitution of the compilation result. This will determine if the previous set of patches should land or not, and if we should continue any investigation and implementation.

3/ As this is a complex integration, I will suggest to add a flag to enable or not this feature in the JS Shell. As adding such feature might have consequences on multiple aspect of the code.

4.1/ Support a simple test case for copying value with xmm registers.  See if xmm are still valuable only for a copy or if we would need to add heuristics to a later optimization phase.

4.2/ Add Register allocator & snapshot supports for MIRType_PackedValue and MIRType_PackedDouble. Add one instruction for Loading and Storing.

5/ Add the rest of the MIR / LIR, as well as the new phase(s) for converting vectorized code, such as gaussian blur, into a SIMD-powered assembly.
Doing SIMD based parallelization is an upcoming goal for the ParallelArray project I believe.  I've been curious if the work there could also be used for vectorizing normal JS, and how often that would be helpful.
(Reporter)

Updated

5 years ago
Depends on: 832777
(Reporter)

Updated

5 years ago
Depends on: 832778
(Reporter)

Updated

5 years ago
Depends on: 832779
This is interesting for sure, but do we know why we are currently slower than V8 on guassian-blur? There may be some simpler, lower-hanging fruit there.
(Reporter)

Comment 3

5 years ago
I finished to test the prototype which does the unboxing of packed int (if they are) into double and do packed multiplications & additions and divisions before storing doubles back to the memory.

The prototype is available at: (only works on x64)
https://github.com/nbp/mozilla-central/branches/ionmonkey-fosdem-2013

The current result are showing a 20% improvement (187.1ms --> 149.7) over 100 runs of Gaussian blur.

Currently this prototype does the unboxing with SIMD, and it will surely benefit from the surely-Double arrays patch on which Brian is working on.
(Reporter)

Comment 4

5 years ago
Created attachment 706909 [details]
Code generated by the prototype.

This is the codegen output of the current (as the date of this message) version of
https://github.com/nbp/mozilla-central/branches/ionmonkey-fosdem-2013

This modification add a few LIR nodes to avoid doing the register allocation and use the register allocation made on Float register to allocate Packed-double registers.  The snapshot encoding and the *fake* optimization is made on the last MIR step of the graph.

This optimization relies on the MIR id & op to substitute/mutate the instructions to work on Packed doubles.   It add a few arch-specific LIR nodes which are targeted by the Lowering, in order to allocate temporary registers such as the one needed for unboxing Int32-s with SIMD.

Snapshots are bad, but no bailout occur during gaussian blur, so the prototype just encode packed doubles as doubles.
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #3)
> Currently this prototype does the unboxing with SIMD, and it will surely
> benefit from the surely-Double arrays patch on which Brian is working on.

That is bug 833898, and has a patch up for review if you want to test with it.
(Reporter)

Updated

5 years ago
Duplicate of this bug: 837734
Whiteboard: [ARM-opt]
(Reporter)

Comment 7

4 years ago
???

I you look at the attached patches of dependent bugs, this is far from being an ARM optimization.
Whiteboard: [ARM-opt]
(Assignee)

Updated

3 years ago
Assignee: general → nobody
You need to log in before you can comment on or make changes to this bug.