Created attachment 609983 [details] [diff] [review] /home/mrosenberg/patches/optMod-r0.patch X % (1<<y) can in general be converted into X &((1<<y)-1), with a few extra checks to account for negative X values. In addition, since on ARM, x % y is implemented by calling __eabi_divmod, I've also optimized the case X % ((1<<y)-1) into a small loop that processes the elements more than one bit at a time.

Comment on attachment 609983 [details] [diff] [review] /home/mrosenberg/patches/optMod-r0.patch I'm expecting Sean to review the parts that aren't large blobs of ARM Macro assembler, and Jacob to take the rest, although feel free to comment on anything that seems wrong/out of place.

Comment on attachment 609983 [details] [diff] [review] /home/mrosenberg/patches/optMod-r0.patch Review of attachment 609983 [details] [diff] [review]: ----------------------------------------------------------------- r+ with comments addressed ::: js/src/ion/MCallOptimize.cpp @@ +3,5 @@ > * > * ***** BEGIN LICENSE BLOCK ***** > * Version: MPL 1.1/GPL 2.0/LGPL 2.1 > * > + * The contents of this file are subject t_ the Mozilla Public License Version nit: accidental change from 'o' to '_'. ::: js/src/ion/arm/LIR-arm.h @@ +161,5 @@ > +class LModPowTwoI : public LInstructionHelper<1,1,0> > +{ > + public: > + LIR_HEADER(ModPowTwoI); > + const int32 shift_; nit: Above "public:", with accessor. @@ +163,5 @@ > + public: > + LIR_HEADER(ModPowTwoI); > + const int32 shift_; > + > + LModPowTwoI(const LAllocation &lhs, int32 shift) : shift_(shift) nit: declaration pattern @@ +173,5 @@ > +class LModMaskI : public LInstructionHelper<1,1,1> > +{ > + public: > + LIR_HEADER(ModMaskI); > + const int32 shift_; nit: Above "public:", with accessor. @@ +175,5 @@ > + public: > + LIR_HEADER(ModMaskI); > + const int32 shift_; > + > + LModMaskI(const LAllocation &lhs, const LDefinition &temp1, int32 shift) : shift_(shift) nit: ": shift_(shift)" on its own line with two-space indent. ::: js/src/ion/arm/Lowering-arm.cpp @@ +298,5 @@ > + JS_FLOOR_LOG2(shift, rhs); > + if (1 << shift == rhs) { > + LModPowTwoI *lir = new LModPowTwoI(useRegister(mod->lhs()), shift); > + return (assignSnapshot(lir) && define(lir, mod)); > + } else if ((1 << (shift+1)) - 1 == rhs) { As with x64, this branch can be taken if rhs == 0x80000000. Is that OK? ::: js/src/ion/shared/LIR-x86-shared.h @@ +84,5 @@ > +{ > + public: > + LIR_HEADER(ModPowTwoI); > + const int shift_; > + LModPowTwoI(const LAllocation &lhs, int32 shift) : shift_(shift) { nits: ": shift_(shift)" on its own line with two-space indent before ":". "const int shift_" at top of declaration before "public:". Accessor for shift_. Vertical whitespace between LIR_HEADER(ModPowTwoI): and LModPowTwoI(...). ::: js/src/ion/x64/Lowering-x64.cpp @@ +180,5 @@ > bool > LIRGeneratorX64::lowerModI(MMod *mod) > { > + if (mod->rhs()->isConstant()) { > + int32 rhs = mod->rhs()->toConstant()->value().toInt32(); mod->rhs()->isConstant() does not imply that the rhs is an Int32 -- needs a check. Non-integers have ToNumber() called on them -- it may be worth checking whether those are integers. For example, (3 % "2") === 1. @@ +183,5 @@ > + if (mod->rhs()->isConstant()) { > + int32 rhs = mod->rhs()->toConstant()->value().toInt32(); > + int32 shift; > + JS_FLOOR_LOG2(shift, rhs); > + if (1 << shift == rhs) { non-nit: If rhs is 0x80000000 and shift and rhs are int32, then 1 << shift == rhs, but rhs == -1. The modulus is not equivalent to bitand with the mask for that value. Applies to Lowering-arm also. @@ +187,5 @@ > + if (1 << shift == rhs) { > + LModPowTwoI *lir = new LModPowTwoI(useRegisterAtStart(mod->lhs()), shift); > + return assignSnapshot(lir) && defineReuseInput(lir, mod, 0); > + } > + } It may also be worth handling the case of rhs === 1 || rhs === -1. If the lhs is Int32 (not double, which breaks in case of Infinity), then lhs % [1 | -1] === lhs, and we can skip the operation entirely. Perhaps handled in GVN? @@ +192,2 @@ > LModI *lir = new LModI(useFixed(mod->lhs(), rax), useRegister(mod->rhs())); > return assignSnapshot(lir) && defineFixed(lir, mod, LAllocation(AnyRegister(rdx))); lowerModI should be in Lowering-x86-shared. X64 aliases 'eax' to 'rax' and 'edx' to 'rdx', so using the 32-bit names will work on both platforms. Additionally, it seems that the x64 version uses useRegisterAtStart(), while the x86 version uses useRegister(). useRegister() should be sufficient.

Comment on attachment 609983 [details] [diff] [review] /home/mrosenberg/patches/optMod-r0.patch Review of attachment 609983 [details] [diff] [review]: ----------------------------------------------------------------- I'm not finished with this yet, but I won't have time to finish it today so I'm submitting a partial review now. ::: js/src/ion/arm/CodeGenerator-arm.cpp @@ +562,5 @@ > return true; > } > > bool > +CodeGeneratorARM::visitModPowTwoI(LModPowTwoI *ins) I thought about this for a while because it annoyed me that it takes so many instructions to do this, and some are conditional too. In the end, I found this: asr r1, r0, #(shift-1) add r1, r0, r1, LSR #(32-shift) bic r1, r1, #(1<<shift-1) sub r1, r0, r1 (in = r0, out = r1) (The method came from an excellent book called "Hacker's Delight"; I just ported it to ARM.) @@ +572,5 @@ > + masm.ma_b(&fin, Assembler::Zero); > + masm.ma_rsb(Imm32(0), out, NoSetCond, Assembler::Signed); > + masm.ma_and(Imm32((1<<ins->shift_)-1), out); > + masm.ma_rsb(Imm32(0), out, SetCond, Assembler::Signed); > + if (!bailoutIf(Assembler::Zero, ins->snapshot())) { Does this exist just to test the negative-zero case? If so, it might be nice to run the test only for negative inputs, since positive inputs with a result of 0 might be common, and will never produce negative zero. ::: js/src/ion/arm/MacroAssembler-arm.cpp @@ +722,5 @@ > return Always; > } > > +void > +MacroAssemblerARM::ma_mod_mask(Register src, Register dest, Register hold, int32 shift) I don't like the naming of this. It doesn't handle masks, it just handles values that are one less than a power of two. @@ +728,5 @@ > + // MATH: > + // We wish to compute x % (1<<y) - 1 for a known constant, y. > + // first, let b = (1<<y) and C = (1<<y)-1, then think of the 32 bit dividend as > + // a number in base b, namely c_0*1 + c_1*b + c_2*b^2 ... c_n*b^n > + // now, since both addition and multiplication commute with modulus, I don't think this is true. Consider this (with made-up values to demonstrate): ---- Addition ---- (a*b + c*d) % m => (a*b)%m + (c*d)%m (5*2 + 3*2) % 4 => (5*2)%4 + (3*2)%4 16 % 4 => 10 %4 + 6 %4 0 => 2 + 2 Thinking logically, addition commutes with modulus in a sense, but only if you apply the modulus to the final result as well. That is: (a*b + c*d) % m => ((a*b)%m + (c*d)%m) % m We see this with fixed-width registers, since every arithmetic operator effectively has a mod(2^width) effect when it write the results back to the register. ---- Both ---- (a*b + c*d) % m => (a%m)*(b%m) + (c%m)*(d%m) (5*2 + 3*2) % 4 => (5%4)*(2%4) + (3%4)*(2%4) 16 % 4 => 1 * 2 + 3 * 2 0 => 7 That clearly doesn't match up. Have I completely misunderstood what you were trying to say here? @@ +735,5 @@ > + // now, since b == C + 1, b % C == 1, and b^n % C == 1 > + // this means that the whole thing simplifies to: > + // c_0 + c_1 + c_2 ... c_n % C > + // each c_n can easily be computed by a shift/bitextract, and the modulus can be maintained > + // by simply subtracting by C whenever the number gets over C. You can use that technique for any value, not just for one-less-than-powers-of-two values. For example: uint32_t n = 1022; uint32_t m = 10; for (uint32_t p = clz(m); p >= 0; p--) { while (n >= (m<<p)) { n -= (m<<p); } } return n; // The result will be 1022 % 10 = 2. // The same technique can be used to perform division by counting the // iterations (modified by p). At a glance, the above pseudo-code is similar to your implementation, except that I don't handle negatives. I haven't looked closely at it, though. It's not necessary to use complicated maths to explain this.

(In reply to Jacob Bramley [:jbramley] from comment #3) > Comment on attachment 609983 [details] [diff] [review] > /home/mrosenberg/patches/optMod-r0.patch > > Review of attachment 609983 [details] [diff] [review]: > ----------------------------------------------------------------- > > I'm not finished with this yet, but I won't have time to finish it today so > I'm submitting a partial review now. Fine by me. > > asr r1, r0, #(shift-1) > add r1, r0, r1, LSR #(32-shift) > bic r1, r1, #(1<<shift-1) > sub r1, r0, r1 > > (in = r0, out = r1) > > (The method came from an excellent book called "Hacker's Delight"; I just > ported it to ARM.) > That is neat, I'll have to sit down and figure out how it works. > @@ +572,5 @@ > > + masm.ma_b(&fin, Assembler::Zero); > > + masm.ma_rsb(Imm32(0), out, NoSetCond, Assembler::Signed); > > + masm.ma_and(Imm32((1<<ins->shift_)-1), out); > > + masm.ma_rsb(Imm32(0), out, SetCond, Assembler::Signed); > > + if (!bailoutIf(Assembler::Zero, ins->snapshot())) { > > Does this exist just to test the negative-zero case? If so, it might be nice > to run the test only for negative inputs, since positive inputs with a > result of 0 might be common, and will never produce negative zero. I could try that. As it is, we don't execute the bailout if we got a result of positive zero, since the only instruction that executed that set a condition code was the initial movs, and that did not set the zero flag due to jumping to the end of this LInstruction very early. Were you thinking of something along the lines of (Assuming we keep this implementation, rather than the unconditional one you gave above): movs dest, src; rsbmi dest, dest, #0 and dest dest #mask bpl fin rsbs dest, dest, #0 ldrz pc, bailout_addr fin: > ::: js/src/ion/arm/MacroAssembler-arm.cpp > @@ +722,5 @@ > > return Always; > > } > > > > +void > > +MacroAssemblerARM::ma_mod_mask(Register src, Register dest, Register hold, int32 shift) > > I don't like the naming of this. It doesn't handle masks, it just handles > values that are one less than a power of two. yeah, someone suggested that name very late at night, and it was a whole lot shorter than ma_mod_pow_two_minus_one(). In general, I'm really bad at choosing names, so any suggestions are welcome. > > @@ +728,5 @@ > > + // MATH: > > + // We wish to compute x % (1<<y) - 1 for a known constant, y. > > + // first, let b = (1<<y) and C = (1<<y)-1, then think of the 32 bit dividend as > > + // a number in base b, namely c_0*1 + c_1*b + c_2*b^2 ... c_n*b^n > > + // now, since both addition and multiplication commute with modulus, > > I don't think this is true. Consider this (with made-up values to > demonstrate): > > ---- Addition ---- > > (a*b + c*d) % m => (a*b)%m + (c*d)%m > (5*2 + 3*2) % 4 => (5*2)%4 + (3*2)%4 > 16 % 4 => 10 %4 + 6 %4 > 0 => 2 + 2 > > Thinking logically, addition commutes with modulus in a sense, but only > if you apply the modulus to the final result as well. That is: > yeah, I was speaking of congruence classes modulo some constant, rather than the actual modulus operator, which converts all elements of a congruence class to a single canonical value (or one of two in the case of JS) > (a*b + c*d) % m => ((a*b)%m + (c*d)%m) % m > > We see this with fixed-width registers, since every arithmetic operator > effectively has a mod(2^width) effect when it write the results back to > the register. > > ---- Both ---- > > (a*b + c*d) % m => (a%m)*(b%m) + (c%m)*(d%m) > > (5*2 + 3*2) % 4 => (5%4)*(2%4) + (3%4)*(2%4) > 16 % 4 => 1 * 2 + 3 * 2 > 0 => 7 > 1*2 == 2; 3*2 == 6; 2+6 == 8 \equiv 0 mod 4 > For example: > > uint32_t n = 1022; > uint32_t m = 10; > for (uint32_t p = clz(m); p >= 0; p--) { > while (n >= (m<<p)) { > n -= (m<<p); > } > } > return n; > > // The result will be 1022 % 10 = 2. > // The same technique can be used to perform division by counting the > // iterations (modified by p). > > At a glance, the above pseudo-code is similar to your implementation, > except that I don't handle negatives. I haven't looked closely at it, > though. It's not necessary to use complicated maths to explain this. I've thought about the similarities between my implementation, and the one that is already used for div. I believe that the actual algorithm used for div is very similar to unrolling this loop (I think they also get rid of the inner loop by observing that it will run either 0 or 1 times) The advantage of my algorithm is the number of iterations you need to perform. replacing 10 with 15, this will (presuming you meant p = clz(m)-clz(n)): b = 1022 m = 15 p = clz(15) - clz(1022) -> p = 6 1022 >= 960 -> true n -= m<<p -> 1022 -= 960 -> n = 62 62 >= 960 -> false p-- -> p = 5 62 >= 480 -> false p-- -> p = 4 62 >= 240 -> false p-- -> p = 3 62 >= 120 -> false p-- -> p = 2 62 >= 60 -> true n -= m << p -> 62 -= 60 -> n=2 2 >= 60 -> false p-- -> p = 1 2 >= 30 -> false p-- -> p = 0 2 >= 15 -> false my algorithm expressed in C is roughly: uint32_t n = 1022; uint32_t m = 15; uint32_t log2_m = 32 - clz(m) uint32_t acc = 0; uint32 tmp; while (n != 0) { acc += n & m; if (acc >= m) acc -= m; n >>= log2_m; } return acc; which will have a trace that looks like: n = 1022 m = 15 log2_m = 32 - clz(m) -> log2_m = 4 1022 != 0 -> true acc += n & m -> 0 += 1022 & 15 -> acc = 14 acc >= m -> 14 >= 15 -> false n >>= log2_m -> 1022 >= 4 -> 63 n != 0 -> 63 != 0 -> true acc += n & m -> 14 += 15 -> acc = 29 acc >= m -> 29 >= 15 -> true acc -= m -> 29 -= 14 -> acc = 14 n >>= log2_m -> 63 >>= 4 -> n = 3 n != 0 -> 3 != 0 -> true acc += n & m -> 14 += 3 & 15 -> acc = 17 acc >= m -> 17 >= 15 -> true acc -= m -> 17 -= 15 -> acc = 2 n >>= log2_m -> 3 >>= 4 -> n = 0 n != 0 -> 0 != 0 -> false In this example, it is only a few evaluations shorter, but I think in general, it does less work because each step of the loop processes at *least* two bits of n, whereas decrementing p by one each time only processes a single bit.

(In reply to Marty Rosenberg [:mjrosenb] from comment #4) > (In reply to Jacob Bramley [:jbramley] from comment #3) > > @@ +572,5 @@ > > > + masm.ma_b(&fin, Assembler::Zero); > > > + masm.ma_rsb(Imm32(0), out, NoSetCond, Assembler::Signed); > > > + masm.ma_and(Imm32((1<<ins->shift_)-1), out); > > > + masm.ma_rsb(Imm32(0), out, SetCond, Assembler::Signed); > > > + if (!bailoutIf(Assembler::Zero, ins->snapshot())) { > > > > Does this exist just to test the negative-zero case? If so, it might be nice > > to run the test only for negative inputs, since positive inputs with a > > result of 0 might be common, and will never produce negative zero. > I could try that. As it is, we don't execute the bailout if we got a result > of positive zero, since the only instruction that executed that set a > condition code was the initial movs, and that did not set the zero flag due > to jumping to the end of this LInstruction very early. > > Were you thinking of something along the lines of (Assuming we keep this > implementation, rather than the unconditional one you gave above): > > ... I was, though now I see it, it probably wouldn't be any faster at all than just executing the check. I think what I was thinking was this: cmp src, #0 bpl 1f rsb dest, src, #0 and dest, dest, #mask rsbs dest, dest, #0 ldreq pc, bailout_addr b fin 1: and dest, src, #mask fin: The positive case is then really quick (if the branch is predicted), and the only conditional instructions are branches. > > > +void > > > +MacroAssemblerARM::ma_mod_mask(Register src, Register dest, Register hold, int32 shift) > > > > I don't like the naming of this. It doesn't handle masks, it just handles > > values that are one less than a power of two. > yeah, someone suggested that name very late at night, and it was a whole lot > shorter than > ma_mod_pow_two_minus_one(). In general, I'm really bad at choosing names, > so any suggestions are welcome. Well if it were me it'd be ma_mod_pow_two_minus_one, but you already said that that is too long. How about "ma_mod_max_uintn"? > > > > @@ +728,5 @@ > > > + // MATH: > > > + // We wish to compute x % (1<<y) - 1 for a known constant, y. > > > + // first, let b = (1<<y) and C = (1<<y)-1, then think of the 32 bit dividend as > > > + // a number in base b, namely c_0*1 + c_1*b + c_2*b^2 ... c_n*b^n > > > + // now, since both addition and multiplication commute with modulus, > > > > I don't think this is true. Consider this (with made-up values to > > demonstrate): > > > > ---- Addition ---- > > > > (a*b + c*d) % m => (a*b)%m + (c*d)%m > > (5*2 + 3*2) % 4 => (5*2)%4 + (3*2)%4 > > 16 % 4 => 10 %4 + 6 %4 > > 0 => 2 + 2 > > > > Thinking logically, addition commutes with modulus in a sense, but only > > if you apply the modulus to the final result as well. That is: > > > yeah, I was speaking of congruence classes modulo some constant, rather than > the actual modulus operator, which converts all elements of a congruence > class to a single canonical value (or one of two in the case of JS) Ok, I believe you, though I still think the comment is misleading :-) > In this example, it is only a few evaluations shorter, but I think in > general, it does less work because each step of the loop processes at > *least* two bits of n, whereas decrementing p by one each time only > processes a single bit. Fair enough. You have clearly given it more thought than I have!

landed (with the wrong bug number in the title, my bad): http://hg.mozilla.org/projects/ionmonkey/rev/fbb1bab307bf