Closed
Bug 600459
Opened 15 years ago
Closed 12 years ago
nanojit: better codegen for div/mod by a constant
Categories
(Core Graveyard :: Nanojit, defect)
Tracking
(Not tracked)
RESOLVED
WONTFIX
People
(Reporter: n.nethercote, Unassigned)
Details
3d-raytrace does 'x % 3' quite a lot. GCC generates this code for 'x % 3'
where 'x' is a signed int32:
// x % 3
// ------
// movl %eax, rx
// movl rK, $1431655766
// imull %rK
// movl rtmp, rx
// sarl rtmp, $31
// subl %edx, rtmp
// leal %edx, (%edx,%edx,2)
// subl rx, %edx
You can do similar things for other small constant values. The explanation
is something like this:
first: x % N == x - (x / N) * N
now if we have: x * K == (x / N) << 32 (true for a special value K)
then: (x * K) >> 32 == x / N
That's my almost-not-quite understanding of it. Better explanations are
welcome!
I wrote a microbenchmark in asm and found that the above code is
approximately 3x faster than doing x%3 with the IDIV instruction on x86. (This roughly matches with what I've heard, which is that IDIV takes ~40 cycles, whereas IMUL takes ~10 cycles, and normal shifts etc take ~1 cycle.)
It would be nice if the i386/X64 back-ends did this.
Comment 1•15 years ago
|
||
> now if we have: x * K == (x / N) << 32 (true for a special value K)
That should be "approximately equal", not "equal". And the special value K is (1 << 32) / (double)N, rounded up to the nearest integer (at least for unsigned division). In what follows, assume C rules for '/' (so it's integer division if both sides are integers, and floating point otherwise).
This works because if we set M = (double)(1 << 32) and L = (1<<32)/(double)N then L <= K < L+1 and we have:
x * L <= x*K < x*(L+1)
x * L / M <= x * K / M < x *(L+1) / M
x / (double)N <= x * K / M < x / (double)N + x / M
floor(x / (double)N) <= floor(x * K / M) < floor(x / (double)N + x / M)
Now floor(x * K / M) is exactly (x*K)>>32. And floor(x / (double)N) = x / N.
So we have:
x / N <= (x * K) >> 32 < floor(x / (double)N + x / M)
But if you recall, x is a 32-bit unsigned integer and M is 2^32. So x/M < 1, and hence the expression on the right hand side is at most x / N + 1. Since the inequality on that side is strict and (x * K) >> 32 is in fact an integer, it must be equal to x / N.
There are apparently complications if signs are involved; I haven't thought through them yet....
Comment 2•15 years ago
|
||
An extremely detailed analysis can be found here:
http://www.cs.uiowa.edu/~jones/bcd/divide.html
Comment 3•15 years ago
|
||
(In reply to comment #0)
> It would be nice if the i386/X64 back-ends did this.
Would be useful to know if similar speedups are likely on other architectures -- I suspect the answer is yes, in which case we could do it at the LIR level.
| Reporter | ||
Comment 4•15 years ago
|
||
(In reply to comment #3)
>
> Would be useful to know if similar speedups are likely on other architectures
> -- I suspect the answer is yes, in which case we could do it at the LIR level.
Does it rely on 64-bit multiplication? We don't have that in LIR at the moment.
Comment 5•15 years ago
|
||
There's also 43 pages of discussion and code on this topic
in Henry S Warren's excellent book "Hacker's Delight".
Comment 6•15 years ago
|
||
> Does it rely on 64-bit multiplication?
Yes. More precisely, it depends on being able to take two 32-bit numbers and generate a pair of 32-bit numbers corresponding to the high and low 32 bits of the product.
Comment 7•15 years ago
|
||
(In reply to comment #6)
> > Does it rely on 64-bit multiplication?
At least x86, arm and ppc have instructions to do 32 *s 32 -> 64 and
32 *u 32 -> 64 multiplication, so this might be worth adding at the
LIR level. Even if you can't, HD sec 8.4.2 shows how to compute the
high half of such multiplications in "16 basic RISC instructions",
including 4 32 * 32 -> 32 multiplications. Not sure if that will
provide a net win, though.
Comment 8•15 years ago
|
||
(In reply to comment #7)
> At least x86, arm and ppc have instructions to do 32 *s 32 -> 64 and
> 32 *u 32 -> 64 multiplication, so this might be worth adding at the
> LIR level.
Yeah, the gotcha is that (IIRC) nanojit only supports 64-bit int types on 64-bit architectures, so this would require a bit of hackery... I suppose we could have some way to have the lo-and-hi results put in specified registers, but it might (indeed) be simpler to do at the backend level given this. Oh well.
| Reporter | ||
Comment 9•15 years ago
|
||
(In reply to comment #8)
>
> Yeah, the gotcha is that (IIRC) nanojit only supports 64-bit int types on
> 64-bit architectures
And even on 64-bit architectures there is no 64-bit multiplication; the only 64-bit operations are those relating to pointers (add, sub, and, or, shifts, etc). So I agree that doing this in back-ends would be best!
Comment 10•15 years ago
|
||
(In reply to comment #8 and comment #9)
(The following is speculative, but I _think_ it's true):
> (In reply to comment #7)
> > At least x86, arm and ppc have instructions to do 32 *s 32 -> 64 and
> > 32 *u 32 -> 64 multiplication, so this might be worth adding at the
> > LIR level.
>
> Yeah, the gotcha is that (IIRC) nanojit only supports 64-bit int types on
> 64-bit architectures, so this would require a bit of hackery...
I think it's not as bad as that. You don't need the full 64 bit
result. What the algorithms require is the high half of the
64 bit result, hence there's no need to have any 64-bit anything.
As an example, on ARM, given this
int div3 ( int x ) { return x / 3; }
gcc-4.5.1 produces this
div3:
;; 'x' is in r0
ldr r3, .L2 ;; r3 := some magic constant
smull r2, r3, r0, r3
sub r0, r3, r0, asr #31
bx lr
The smull is doing "r3(hi):r2(lo) := r0 s* r3", but the r2 value is
ignored.
Indeed:
* HD sec 10.3 says "The basic trick is to multiply by a sort of
reciprocal of the divisor 'd', approximately (2^32) / d, and
extract the leftmost 32 bits of the product."
* The algorithms in HD sec 10 are phrased in terms of mulhs and mulhu,
which are 32 x 32 -> 32 primitives, producing the high half of
the signed/unsigned results respectively.
| Assignee | ||
Updated•12 years ago
|
Product: Core → Core Graveyard
| Reporter | ||
Comment 11•12 years ago
|
||
Nanojit has been dead for several years. Its Bugzilla component has been moved to the graveyard (bug 984276).
I checked all the open bugs. They're all uninteresting, so I'm WONTFIXing them all. Apologies for the bugspam.
Status: ASSIGNED → RESOLVED
Closed: 12 years ago
Resolution: --- → WONTFIX
You need to log in
before you can comment on or make changes to this bug.
Description
•