Closed Bug 600459 Opened 15 years ago Closed 12 years ago

nanojit: better codegen for div/mod by a constant

Categories

(Core Graveyard :: Nanojit, defect)

x86
macOS
defect
Not set
normal

Tracking

(Not tracked)

RESOLVED WONTFIX

People

(Reporter: n.nethercote, Unassigned)

Details

3d-raytrace does 'x % 3' quite a lot. GCC generates this code for 'x % 3' where 'x' is a signed int32: // x % 3 // ------ // movl %eax, rx // movl rK, $1431655766 // imull %rK // movl rtmp, rx // sarl rtmp, $31 // subl %edx, rtmp // leal %edx, (%edx,%edx,2) // subl rx, %edx You can do similar things for other small constant values. The explanation is something like this: first: x % N == x - (x / N) * N now if we have: x * K == (x / N) << 32 (true for a special value K) then: (x * K) >> 32 == x / N That's my almost-not-quite understanding of it. Better explanations are welcome! I wrote a microbenchmark in asm and found that the above code is approximately 3x faster than doing x%3 with the IDIV instruction on x86. (This roughly matches with what I've heard, which is that IDIV takes ~40 cycles, whereas IMUL takes ~10 cycles, and normal shifts etc take ~1 cycle.) It would be nice if the i386/X64 back-ends did this.
> now if we have: x * K == (x / N) << 32 (true for a special value K) That should be "approximately equal", not "equal". And the special value K is (1 << 32) / (double)N, rounded up to the nearest integer (at least for unsigned division). In what follows, assume C rules for '/' (so it's integer division if both sides are integers, and floating point otherwise). This works because if we set M = (double)(1 << 32) and L = (1<<32)/(double)N then L <= K < L+1 and we have: x * L <= x*K < x*(L+1) x * L / M <= x * K / M < x *(L+1) / M x / (double)N <= x * K / M < x / (double)N + x / M floor(x / (double)N) <= floor(x * K / M) < floor(x / (double)N + x / M) Now floor(x * K / M) is exactly (x*K)>>32. And floor(x / (double)N) = x / N. So we have: x / N <= (x * K) >> 32 < floor(x / (double)N + x / M) But if you recall, x is a 32-bit unsigned integer and M is 2^32. So x/M < 1, and hence the expression on the right hand side is at most x / N + 1. Since the inequality on that side is strict and (x * K) >> 32 is in fact an integer, it must be equal to x / N. There are apparently complications if signs are involved; I haven't thought through them yet....
An extremely detailed analysis can be found here: http://www.cs.uiowa.edu/~jones/bcd/divide.html
(In reply to comment #0) > It would be nice if the i386/X64 back-ends did this. Would be useful to know if similar speedups are likely on other architectures -- I suspect the answer is yes, in which case we could do it at the LIR level.
(In reply to comment #3) > > Would be useful to know if similar speedups are likely on other architectures > -- I suspect the answer is yes, in which case we could do it at the LIR level. Does it rely on 64-bit multiplication? We don't have that in LIR at the moment.
There's also 43 pages of discussion and code on this topic in Henry S Warren's excellent book "Hacker's Delight".
> Does it rely on 64-bit multiplication? Yes. More precisely, it depends on being able to take two 32-bit numbers and generate a pair of 32-bit numbers corresponding to the high and low 32 bits of the product.
(In reply to comment #6) > > Does it rely on 64-bit multiplication? At least x86, arm and ppc have instructions to do 32 *s 32 -> 64 and 32 *u 32 -> 64 multiplication, so this might be worth adding at the LIR level. Even if you can't, HD sec 8.4.2 shows how to compute the high half of such multiplications in "16 basic RISC instructions", including 4 32 * 32 -> 32 multiplications. Not sure if that will provide a net win, though.
(In reply to comment #7) > At least x86, arm and ppc have instructions to do 32 *s 32 -> 64 and > 32 *u 32 -> 64 multiplication, so this might be worth adding at the > LIR level. Yeah, the gotcha is that (IIRC) nanojit only supports 64-bit int types on 64-bit architectures, so this would require a bit of hackery... I suppose we could have some way to have the lo-and-hi results put in specified registers, but it might (indeed) be simpler to do at the backend level given this. Oh well.
(In reply to comment #8) > > Yeah, the gotcha is that (IIRC) nanojit only supports 64-bit int types on > 64-bit architectures And even on 64-bit architectures there is no 64-bit multiplication; the only 64-bit operations are those relating to pointers (add, sub, and, or, shifts, etc). So I agree that doing this in back-ends would be best!
(In reply to comment #8 and comment #9) (The following is speculative, but I _think_ it's true): > (In reply to comment #7) > > At least x86, arm and ppc have instructions to do 32 *s 32 -> 64 and > > 32 *u 32 -> 64 multiplication, so this might be worth adding at the > > LIR level. > > Yeah, the gotcha is that (IIRC) nanojit only supports 64-bit int types on > 64-bit architectures, so this would require a bit of hackery... I think it's not as bad as that. You don't need the full 64 bit result. What the algorithms require is the high half of the 64 bit result, hence there's no need to have any 64-bit anything. As an example, on ARM, given this int div3 ( int x ) { return x / 3; } gcc-4.5.1 produces this div3: ;; 'x' is in r0 ldr r3, .L2 ;; r3 := some magic constant smull r2, r3, r0, r3 sub r0, r3, r0, asr #31 bx lr The smull is doing "r3(hi):r2(lo) := r0 s* r3", but the r2 value is ignored. Indeed: * HD sec 10.3 says "The basic trick is to multiply by a sort of reciprocal of the divisor 'd', approximately (2^32) / d, and extract the leftmost 32 bits of the product." * The algorithms in HD sec 10 are phrased in terms of mulhs and mulhu, which are 32 x 32 -> 32 primitives, producing the high half of the signed/unsigned results respectively.
Product: Core → Core Graveyard
Nanojit has been dead for several years. Its Bugzilla component has been moved to the graveyard (bug 984276). I checked all the open bugs. They're all uninteresting, so I'm WONTFIXing them all. Apologies for the bugspam.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in before you can comment on or make changes to this bug.