When doing an ADD under heavy register pressure, JM reuses an operand register for the result. In order to ensure that the operand can be rematerialized in the slow path if the ADD overflows, JM ensures that operand is synced, issuing a store before the operation if necessary. In the JM branch, this happens three times in the am3 loop on v8-crypto. These stores are unnecessary, as if the ADD overflows the original operands can still be recovered:
add x y
> neg x
> add x y
> neg x
> sync x, y, call stub
This works even if the negs overflow (x == 0x80000000).
This also applies to INC/DEC operations, which will use an extra register and issue an extra move to hold the old value in case of overflow. Not such a big deal for am3, maybe wait for INC/DEC ops to be coalesced into adds (bug 610618?). This should also be doable for SUB, but that case is more complicated (not commutative) and doesn't bite on am3.
Created attachment 493743 [details] [diff] [review]
JM patch for ADD
Applied this patch to JM, but could apply to TM too. This only handles ADD operations.
(In reply to comment #1)
> Created attachment 493743 [details] [diff] [review]
> JM patch for ADD
> Applied this patch to JM, but could apply to TM too. This only handles ADD
Does the patch speed up v8-crypto or some other benchmark of interest?
With type inference this gets me a couple ms on v8-crypto (will apply to TM to measure without inference). This plus a fix for types in arginc (http://hg.mozilla.org/projects/jaegermonkey/rev/30ffdc01adf2) removed 9 memory accesses from the loop in am3, and cut 6ms from my time on v8-crypto (62ms -> 56ms).
This loop has very high register pressure, so is a good regalloc test case. It also has a pretty long linear body, so the current regalloc does a good job with it. After the above two patches, there are still 33 memory accesses in this loop, 13 of which are necessary for getelem/setelems. Of the remaining 20, 7 can be removed with a linear scan regalloc (WIP), several more by using ebp as a GPR.