Guassian blur (kraken benchmark) is interesting because, it contains a loop which repeat 4 times the same operation, and the current MIR optimizations, such as GVN & LICM are able to move most of the boilerplate away. These 4 identical operations are made on Doubles (might be Int32 inputs) as inferred from the type inference. and they can be divided into 2 contiguous 16 bytes aligned inputs (2 values from the element vector). The goal would be to merge the r & g and b & a into 2 xmm registers, and to manipulate them as one entity. As opposed to our current usage of xmm registers, we would have to introduce a new MIR type to reflect the fact that 2 values / 2 doubles are packed into one FloatRegister. This would have an impact on snapshots and on the register allocation to handle any spill. How we should schedule such things: 1/ Add SIMD support into the assembler, for loading 2 Values, checking if they are boxed int, unboxing them, multiplying them, adding them, dividing them, and store them. 2/ Evaluate the performance gain by hacking the engine and doing a manual substitution of the compilation result. This will determine if the previous set of patches should land or not, and if we should continue any investigation and implementation. 3/ As this is a complex integration, I will suggest to add a flag to enable or not this feature in the JS Shell. As adding such feature might have consequences on multiple aspect of the code. 4.1/ Support a simple test case for copying value with xmm registers. See if xmm are still valuable only for a copy or if we would need to add heuristics to a later optimization phase. 4.2/ Add Register allocator & snapshot supports for MIRType_PackedValue and MIRType_PackedDouble. Add one instruction for Loading and Storing. 5/ Add the rest of the MIR / LIR, as well as the new phase(s) for converting vectorized code, such as gaussian blur, into a SIMD-powered assembly.
Doing SIMD based parallelization is an upcoming goal for the ParallelArray project I believe. I've been curious if the work there could also be used for vectorizing normal JS, and how often that would be helpful.
This is interesting for sure, but do we know why we are currently slower than V8 on guassian-blur? There may be some simpler, lower-hanging fruit there.
I finished to test the prototype which does the unboxing of packed int (if they are) into double and do packed multiplications & additions and divisions before storing doubles back to the memory. The prototype is available at: (only works on x64) https://github.com/nbp/mozilla-central/branches/ionmonkey-fosdem-2013 The current result are showing a 20% improvement (187.1ms --> 149.7) over 100 runs of Gaussian blur. Currently this prototype does the unboxing with SIMD, and it will surely benefit from the surely-Double arrays patch on which Brian is working on.
Created attachment 706909 [details] Code generated by the prototype. This is the codegen output of the current (as the date of this message) version of https://github.com/nbp/mozilla-central/branches/ionmonkey-fosdem-2013 This modification add a few LIR nodes to avoid doing the register allocation and use the register allocation made on Float register to allocate Packed-double registers. The snapshot encoding and the *fake* optimization is made on the last MIR step of the graph. This optimization relies on the MIR id & op to substitute/mutate the instructions to work on Packed doubles. It add a few arch-specific LIR nodes which are targeted by the Lowering, in order to allocate temporary registers such as the one needed for unboxing Int32-s with SIMD. Snapshots are bad, but no bailout occur during gaussian blur, so the prototype just encode packed doubles as doubles.
(In reply to Nicolas B. Pierron [:pierron] [:nbp] from comment #3) > Currently this prototype does the unboxing with SIMD, and it will surely > benefit from the surely-Double arrays patch on which Brian is working on. That is bug 833898, and has a patch up for review if you want to test with it.
??? I you look at the attached patches of dependent bugs, this is far from being an ARM optimization.