600459 - nanojit: better codegen for div/mod by a constant

Reporter

Description

•

15 years ago

3d-raytrace does 'x % 3' quite a lot. GCC generates this code for 'x % 3' where 'x' is a signed int32: // x % 3 // ------ // movl %eax, rx // movl rK, $1431655766 // imull %rK // movl rtmp, rx // sarl rtmp, $31 // subl %edx, rtmp // leal %edx, (%edx,%edx,2) // subl rx, %edx You can do similar things for other small constant values. The explanation is something like this: first: x % N == x - (x / N) * N now if we have: x * K == (x / N) << 32 (true for a special value K) then: (x * K) >> 32 == x / N That's my almost-not-quite understanding of it. Better explanations are welcome! I wrote a microbenchmark in asm and found that the above code is approximately 3x faster than doing x%3 with the IDIV instruction on x86. (This roughly matches with what I've heard, which is that IDIV takes ~40 cycles, whereas IMUL takes ~10 cycles, and normal shifts etc take ~1 cycle.) It would be nice if the i386/X64 back-ends did this.

Boris Zbarsky [:bzbarsky]

Comment 1

•

15 years ago

> now if we have: x * K == (x / N) << 32 (true for a special value K) That should be "approximately equal", not "equal". And the special value K is (1 << 32) / (double)N, rounded up to the nearest integer (at least for unsigned division). In what follows, assume C rules for '/' (so it's integer division if both sides are integers, and floating point otherwise). This works because if we set M = (double)(1 << 32) and L = (1<<32)/(double)N then L <= K < L+1 and we have: x * L <= x*K < x*(L+1) x * L / M <= x * K / M < x *(L+1) / M x / (double)N <= x * K / M < x / (double)N + x / M floor(x / (double)N) <= floor(x * K / M) < floor(x / (double)N + x / M) Now floor(x * K / M) is exactly (x*K)>>32. And floor(x / (double)N) = x / N. So we have: x / N <= (x * K) >> 32 < floor(x / (double)N + x / M) But if you recall, x is a 32-bit unsigned integer and M is 2^32. So x/M < 1, and hence the expression on the right hand side is at most x / N + 1. Since the inequality on that side is strict and (x * K) >> 32 is in fact an integer, it must be equal to x / N. There are apparently complications if signs are involved; I haven't thought through them yet....

Steven Johnson

Comment 2

•

15 years ago

An extremely detailed analysis can be found here: http://www.cs.uiowa.edu/~jones/bcd/divide.html

Steven Johnson

Comment 3

•

15 years ago

(In reply to comment #0) > It would be nice if the i386/X64 back-ends did this. Would be useful to know if similar speedups are likely on other architectures -- I suspect the answer is yes, in which case we could do it at the LIR level.

Nicholas Nethercote [inactive]

Reporter

Comment 4

•

15 years ago

(In reply to comment #3) > > Would be useful to know if similar speedups are likely on other architectures > -- I suspect the answer is yes, in which case we could do it at the LIR level. Does it rely on 64-bit multiplication? We don't have that in LIR at the moment.

Julian Seward [:jseward]

Comment 5

•

15 years ago

There's also 43 pages of discussion and code on this topic in Henry S Warren's excellent book "Hacker's Delight".

Boris Zbarsky [:bzbarsky]

Comment 6

•

15 years ago

> Does it rely on 64-bit multiplication? Yes. More precisely, it depends on being able to take two 32-bit numbers and generate a pair of 32-bit numbers corresponding to the high and low 32 bits of the product.

Julian Seward [:jseward]

Comment 7

•

15 years ago

(In reply to comment #6) > > Does it rely on 64-bit multiplication? At least x86, arm and ppc have instructions to do 32 *s 32 -> 64 and 32 *u 32 -> 64 multiplication, so this might be worth adding at the LIR level. Even if you can't, HD sec 8.4.2 shows how to compute the high half of such multiplications in "16 basic RISC instructions", including 4 32 * 32 -> 32 multiplications. Not sure if that will provide a net win, though.

Steven Johnson

Comment 8

•

15 years ago

(In reply to comment #7) > At least x86, arm and ppc have instructions to do 32 *s 32 -> 64 and > 32 *u 32 -> 64 multiplication, so this might be worth adding at the > LIR level. Yeah, the gotcha is that (IIRC) nanojit only supports 64-bit int types on 64-bit architectures, so this would require a bit of hackery... I suppose we could have some way to have the lo-and-hi results put in specified registers, but it might (indeed) be simpler to do at the backend level given this. Oh well.

Nicholas Nethercote [inactive]

Reporter

Comment 9

•

15 years ago

(In reply to comment #8) > > Yeah, the gotcha is that (IIRC) nanojit only supports 64-bit int types on > 64-bit architectures And even on 64-bit architectures there is no 64-bit multiplication; the only 64-bit operations are those relating to pointers (add, sub, and, or, shifts, etc). So I agree that doing this in back-ends would be best!

Julian Seward [:jseward]

Comment 10

•

15 years ago

(In reply to comment #8 and comment #9) (The following is speculative, but I _think_ it's true): > (In reply to comment #7) > > At least x86, arm and ppc have instructions to do 32 *s 32 -> 64 and > > 32 *u 32 -> 64 multiplication, so this might be worth adding at the > > LIR level. > > Yeah, the gotcha is that (IIRC) nanojit only supports 64-bit int types on > 64-bit architectures, so this would require a bit of hackery... I think it's not as bad as that. You don't need the full 64 bit result. What the algorithms require is the high half of the 64 bit result, hence there's no need to have any 64-bit anything. As an example, on ARM, given this int div3 ( int x ) { return x / 3; } gcc-4.5.1 produces this div3: ;; 'x' is in r0 ldr r3, .L2 ;; r3 := some magic constant smull r2, r3, r0, r3 sub r0, r3, r0, asr #31 bx lr The smull is doing "r3(hi):r2(lo) := r0 s* r3", but the r2 value is ignored. Indeed: * HD sec 10.3 says "The basic trick is to multiply by a sort of reciprocal of the divisor 'd', approximately (2^32) / d, and extract the leftmost 32 bits of the product." * The algorithms in HD sec 10 are phrased in terms of mulhs and mulhu, which are 32 x 32 -> 32 primitives, producing the high half of the signed/unsigned results respectively.

Nobody; OK to take it and work on it

Assignee

Updated

•

12 years ago

Product: Core → Core Graveyard

Nicholas Nethercote [inactive]

Reporter

Comment 11

•

12 years ago

Nanojit has been dead for several years. Its Bugzilla component has been moved to the graveyard (bug 984276). I checked all the open bugs. They're all uninteresting, so I'm WONTFIXing them all. Apologies for the bugspam.

Status: ASSIGNED → RESOLVED

Closed: 12 years ago

Resolution: --- → WONTFIX

Bugzilla

nanojit: better codegen for div/mod by a constant

Categories

(Core Graveyard :: Nanojit, defect)

Tracking

(Not tracked)

People

(Reporter: n.nethercote, Unassigned)

References

Details

Crash Data

Security

(public)

User Story

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Comment 10

Updated

Comment 11