This patch I unfortunately cannot test. It removes 3 unnecessary instructions from the inner loop of multiplication by using a different variant of the multiply and accumulate instruction that does another addition, available in ARM Architecture version 6 and above. I think this is all the CPUs we care about.
Above was written in more optimistic time. Consider this a placeholder for the next day or two as I work to optimize further using our Raspberry Pi builds.
Created attachment 8780317 [details] [diff] [review] Basic loop. Faster then my experiment with software pipelining. This is a basic loop that proved faster then my software pipelined loop.
Update: Measuring performance on treeherder is hard, so running experiments will take a dedicated Raspberry Pi. It's possible the pipelined version was faster, and something else interfered with the experiment, but I have no way to tell.
You need to log in before you can comment on or make changes to this bug.