Closed Bug 976110 Opened 6 years ago Closed 4 years ago

IonMonkey: Optimize integer divide by constant

Categories

(Core :: JavaScript Engine: JIT, enhancement)

42 Branch
enhancement
Not set

Tracking

()

RESOLVED FIXED
mozilla42

People

(Reporter: sunfish, Assigned: collares, Mentored)

References

Details

(Whiteboard: [lang=c++][js:p2])

Attachments

(3 files, 19 obsolete files)

19.72 KB, patch
collares
: review+
RyanVM
: checkin+
Details | Diff | Splinter Review
10.12 KB, patch
sunfish
: review+
RyanVM
: checkin+
Details | Diff | Splinter Review
4.46 KB, patch
sunfish
: review+
RyanVM
: checkin+
Details | Diff | Splinter Review
IonMonkey currently lowers integer division by a constant to a regular divide instruction, which is very slow. There are well-known techniques (see Hacker's Delight, or your favorite C++ compiler) for divison by constants which produce much faster code.

This is a significant source of slowness in asmjs-ubench benchmarks fasta, copy, and corrections, for example.
We can also use this kind of optimizations for:
  Math.floor(a / b)
  (a / b) | 0
Whiteboard: [mentor=sunfish][lang=c++]
I'd like to work on this. I have my copy of Hacker's Delight ready :) 

I'll try to show up in #ionmonkey later today.
Sounds great! I'm sunfish on #ionmonkey :-).
Assignee: nobody → mau
Whiteboard: [mentor=sunfish][lang=c++] → [mentor=sunfish][lang=c++][js:p2]
Status: NEW → ASSIGNED
Looks like v8 just landed this,

https://code.google.com/p/v8/source/detail?r=19749

On 2 asm.js benchmarks they have very large speedups, copy and fasta. Copy is 30% faster and now beats us and native by a wide margin.
Maurício, are you still working on this? :)
Sorry for the delay. I'm working on this, yes, and I think I'll have something up for review in a couple of days.
Sorry for the delay again! I'll clean up some duplication in the patch and post it as soon as I get some sleep. Here are the results I see on asmjs-ubench:

$ python harness.py ~/clean-inbound/js/src/opt/js/src/js 
copy - 7701
corrections - 6616
fannkuch - 3091
fasta - 11713
life - 4115
memops - 8143
primes - 8296
skinning - 15015

$ python harness.py ~/mozilla-inbound/js/src/opt/js/src/js 
copy - 4905
corrections - 6502
fannkuch - 3094
fasta - 11711
life - 4073
memops - 8153
primes - 7248
skinning - 14858

In other words, copy runs faster but fasta doesn't. That's probably because the division and modulus operations in fasta are done on unsigned integers, and unsigned div/mod ops generate different LIR instructions I haven't yet applied the optimization to. I'll try to finish this part over the weekend.
This implements the reciprocal multiplication technique for signed division by positive divisors, and it passes jit-tests and jstests with --ion-eager. Please consider this a "feedback?" if you'd like to wait until the unsigned case is done as well.
Attachment #8394985 - Flags: review?(sunfish)
Comment on attachment 8394985 [details] [diff] [review]
signed.patch (on top of 10f1951a1f1f)

Review of attachment 8394985 [details] [diff] [review]:
-----------------------------------------------------------------

This patch looks good! I think doing just signed first is fine.

I haven't read through the algorithm closely yet, but I'll do that soon. One comment that I do have is that I'd like to see the code that computes the magic number factored out to a platform-independent place, such as js/src/jit/shared/CodeGenerator-shared.* as it'll be useful for other targets as well.
Comment on attachment 8394985 [details] [diff] [review]
signed.patch (on top of 10f1951a1f1f)

Review of attachment 8394985 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +962,5 @@
> +    // integer; we will compute (M - 2^32) * n + (2^32 * n) instead of
> +    // M * n if this is the case (cf. item (a) above).
> +    int64_t magic = (int64_t(1) << (shift+32))/d + 1;
> +    if (magic >= INT32_MAX)
> +        magic -= int64_t(1) << 32;

Testing for >= INT32_MAX is a little odd. Should this be > INT32_MAX?

Actually, instead of making magic an int64_t and manually subtracting 1<<32 to bring it into the int32_t range, and implicitly converting it to int32_t below, could you just convert it to int32_t right here? A truncating conversion would automatically apply the mod 1<<32.

::: js/src/jit/shared/Lowering-x86-shared.cpp
@@ +137,5 @@
>      if (div->rhs()->isConstant()) {
>          int32_t rhs = div->rhs()->toConstant()->value().toInt32();
>  
> +        // Division by powers of two can be done by shifting, and division by
> +        // other numbers can be done by a reciprocal multiplication technique.

Since your patch doesn't implement division by negative powers of 2, please restore that part of this comment.

@@ +153,5 @@
>              }
>              if (div->fallible() && !assignSnapshot(lir, Bailout_BaselineInfo))
>                  return false;
>              return defineReuseInput(lir, div, 0);
> +        } else if (rhs > 0) {

Do I understand correctly that the code above doesn't handle the <= -2 case, which is why this rhs > 0 check is needed? That's fine, if you don't want to do that with this patch, but it'd be nice to add an assert to generateTruncatedDivConstant verifying that the denominator is positive.
It seems that V8 got a huge improvement by doing something similar on misc-f32-exp:
http://arewefastyet.com/?a=b&view=regress#machine=11&view=breakdown&suite=misc

Will this also give us a similar boost?
https://github.com/haytjes/arewefastyet/blob/master/benchmarks/misc/tests/assorted/misc-f32-exp.js

Apparently the divide by 120 is really expensive here?
Trying out by hand, it seems that replacing 120 by float(1/120) (computed by hand before) and transforming this into a float32 multiplication makes us win 50ms.
Comment on attachment 8394985 [details] [diff] [review]
signed.patch (on top of 10f1951a1f1f)

Review of attachment 8394985 [details] [diff] [review]:
-----------------------------------------------------------------

Here are a few additional comments.

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +962,5 @@
> +    // integer; we will compute (M - 2^32) * n + (2^32 * n) instead of
> +    // M * n if this is the case (cf. item (a) above).
> +    int64_t magic = (int64_t(1) << (shift+32))/d + 1;
> +    if (magic >= INT32_MAX)
> +        magic -= int64_t(1) << 32;

Also, this algorithm doesn't appear to handle the case of a negative denominator. Code elsewhere protects it, by only forming LConstantDivI for positive demoniators, but it'd be nice to mention this in a comment here, and perhaps also an assert.

@@ +987,5 @@
> +bool
> +CodeGeneratorX86Shared::visitDivConstantI(LDivConstantI *ins) {
> +    // The denominator, d > 0, can't be a power of 2 (see LDivPowTwoI)
> +    Register lhs = ToRegister(ins->numerator());
> +    int32_t d = ins->denominator();

Please add an assert to check this assumption in this comment, that d is positive and not a power of 2.

@@ +994,5 @@
> +    generateTruncatedDivConstant(lhs, d, ins->mir()->canBeNegativeDividend());
> +
> +    if (!ins->mir()->isTruncated()) {
> +        // Multiply the obtained value by d to verify the answer is an
> +        // integer (and bail otherwise). This cannot overflow.

What happens in the case of a numerator of 0x80000000 and a denominator of -1? The divide itself will overflow in that case, and it looks like this mul check won't detect that. Is there a plan for handling this overflow?

@@ +1134,5 @@
> +bool
> +CodeGeneratorX86Shared::visitModConstantI(LModConstantI *ins) {
> +    // The modulus can't be a power of 2 (see LDivPowTwoI)
> +    Register lhs = ToRegister(ins->numerator());
> +    int32_t mod = ins->modulus();

Please add an assertion for this comment too.
Attachment #8394985 - Flags: review?(sunfish) → review-
Thanks for the review! I haven't uploaded a patch that fixed the other comments yet because I decided it would be better to handle the unsigned and negative divisor cases. I then got discouraged by the amount of duplicated code I would need to add to support those cases. Here are comments on the two main issues that discouraged me:

1) I think a Div operation should be converted into a Mul/Neg operation before codegen if we're dividing by -1, in order to reduce duplication at the codegen layer (no other division case can overflow, so getting rid of this case leads to less cases in codegen). I wasn't able to figure out the best compile phase to do this, though. Is this a reasonable thing to do? If so, at what level should this be done?
2) Implementing unsigned division really requires emitting different code, because of the potential for overflow in newer places. It will involve some unfortunate duplication; I think I'll just implement the best thing I can come up with in a few hours' time and then ask for your feedback, as I'll suffer from design paralysis again otherwise.

I'll (hopefully) upload a new version of the patch tomorrow.
(In reply to Maurício Collares Neto [:mauricioc] from comment #14)
> Thanks for the review! I haven't uploaded a patch that fixed the other
> comments yet because I decided it would be better to handle the unsigned and
> negative divisor cases. I then got discouraged by the amount of duplicated
> code I would need to add to support those cases. Here are comments on the
> two main issues that discouraged me:
> 
> 1) I think a Div operation should be converted into a Mul/Neg operation
> before codegen if we're dividing by -1, in order to reduce duplication at
> the codegen layer (no other division case can overflow, so getting rid of
> this case leads to less cases in codegen). I wasn't able to figure out the
> best compile phase to do this, though. Is this a reasonable thing to do? If
> so, at what level should this be done?

Yes, I like this idea. I suggest doing this at Lowering time. We could do it earlier too, but I don't think that's super important right now. Since the correctness of later code will depend on this, it's nice to do it in Lowering to make sure it catches everything, even things which might get produced during optimization, and even if parts of the optimizer are disabled.

> 2) Implementing unsigned division really requires emitting different code,
> because of the potential for overflow in newer places. It will involve some
> unfortunate duplication; I think I'll just implement the best thing I can
> come up with in a few hours' time and then ask for your feedback, as I'll
> suffer from design paralysis again otherwise.

This sounds fine. Limited amounts of duplication are manageable, and posting what you have is a good step for moving forward in any case. If I think there's too much duplication, I'll work to help you find a way to factor it out.

> I'll (hopefully) upload a new version of the patch tomorrow.

Great!
In preparation for optimizing division by negative constants, I need to make sure division by negative powers of two are handled appropriately.
Attachment #8409097 - Flags: review?(sunfish)
Attachment #8409097 - Attachment is obsolete: true
Attachment #8409097 - Flags: review?(sunfish)
Attachment #8409187 - Flags: review?(sunfish)
Attachment #8394985 - Attachment is obsolete: true
Attachment #8409188 - Flags: review?(sunfish)
Comment on attachment 8409187 [details] [diff] [review]
Part 0: Optimize division and modulus by negative powers of two (on top of 280c066afbf1)

Review of attachment 8409187 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ -1020,5 @@
>  
>          // Negative numbers need a negate, bitmask, negate
>          masm.bind(&negative);
> -        // visitModI has an overflow check here to catch INT_MIN % -1, but
> -        // here the rhs is a power of 2, and cannot be -1, so the check is not generated.

I think pointing out that overflow can't happen in a mod by a constant power of 2 is useful. Would you mind leaving this comment here, or rewording it if you think it isn't clear?

::: js/src/jit/shared/Lowering-x86-shared.cpp
@@ +142,5 @@
> +        // important case to optimize. Note that other optimizations
> +        // are also possible: division by other constants can be
> +        // optimized by a reciprocal multiplication technique.
> +        int32_t shift = FloorLog2(Abs(rhs));
> +        if (rhs != 0 && uint32_t(1) << shift == Abs(rhs)) {

My reading of this code (and the similar Mod code below) is that INT32_MIN is indeed handled correctly, though it's subtle. At a quick glance, I don't think we have good test coverage of this; would you mind adding some basic JIT tests to ensure that division and mod by INT32_MIN are handled correctly?
Attachment #8409187 - Flags: review?(sunfish) → review+
Comment on attachment 8409188 [details] [diff] [review]
Part 1: Optimize signed integer division by constants

Review of attachment 8409188 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good! I just have a few small comments below.

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +976,5 @@
> +                return false;
> +
> +            // If lhs is zero and the divisor is negative, the answer should have
> +            // been -0.
> +            masm.testl(lhs, lhs);

This testl instruction is not needed when d >= 0. Please do a separate d < 0 check and move this inside it.

::: js/src/jit/shared/LIR-x86-shared.h
@@ +124,5 @@
>  
>      const LDefinition *remainder() {
>          return getDef(0);
>      }
> +

spurious whitespace
Attachment #8409188 - Flags: review?(sunfish) → review+
Whiteboard: [mentor=sunfish][lang=c++][js:p2] → [mentor=sunfish][lang=c++][js:p2][leave open]
Attachment #8409265 - Flags: review?(sunfish) → review+
Blocks: 998704
The testcase in the patch for INT32_MIN cases is a good start, but I think it'd be useful to cover even more cases. I filed bug 998704 to track this.
This regressed asmjs-ubench-memops by 49% on x86. Switching the roles of edx and eax for the DivOrModConstant operation (which is generated by the %1000 line in memops.cpp), thus making edx be the output register for the modulus, undoes the regression and turns the patch into a small perf win.

Even without the changes in this bug, if I manually replace the %1e3 in memops.js by %1024, thus forcing IonMonkey to generate a ModPowTwo instruction, the code gets *slower* (8.2s -> 10s) on x86! If I fix edx as an output register for the LModPowTwoI instruction [0], the regression once again becomes a win.

I don't understand Ion's asm.js regalloc well enough (or at all) to make a decision on this. Help?

[0] By replacing "return defineReuseInput(lir, mod, 0);" by "return defineFixed(lir, mod, LAllocation(AnyRegister(edx)));" and adding an extra "masm.mov(lhs, edx);" in the codegen.
Flags: needinfo?(sunfish)
> [0] By replacing "return defineReuseInput(lir, mod, 0);" by "return
> defineFixed(lir, mod, LAllocation(AnyRegister(edx)));" and adding an extra
> "masm.mov(lhs, edx);" in the codegen.

CodeGeneratorX86Shared::visitDivOrModConstantI uses the output register to determine whether to do a div or a mod; did your experiment accidentally change the computation from a mod to a div?
Flags: needinfo?(sunfish)
Sorry, I expressed myself poorly; the footnote in the previous comment describes what I did to lock the output register of ModPowTwoI (not DivOrModConstantI). In other words, besides the changes in the footnote, I was using a clean mozilla-inbound checkin from a few days ago (not including my patch).

For the experiment involving DivOrModConstantI, I verified that there were no divisions by constants in memops.js (by looking at ion.cfg after deleting the mod line), and then hardcoded visitDivOrModConstantI to calculate a mod operation no matter the output register. Not exactly pretty, but it was enough for the purposes of the experiment.
It hit me soon after reading the commit log that the emitted code does an extra mov for no reason. I also took the opportunity to make two variable naming changes for readability.
Attachment #8409464 - Flags: review?(sunfish)
Comment on attachment 8409464 [details] [diff] [review]
Part 2: Clean up signed integer division by constants (on top of 56fe9fa6abcd)

Review of attachment 8409464 [details] [diff] [review]:
-----------------------------------------------------------------

LGTM.
Attachment #8409464 - Flags: review?(sunfish) → review+
I filed bug 999257 to track the asmjs-ubench/memops.js regression on 32-bit x86 seen on AWFY. It appears to be an unrelated problem.
I should have done this to begin with, but now the code to compute magic constants works for signed and unsigned division. This patch should effectively be a no-op for the signed case (maxLog = 31 in computeDivisionConstants). I ended up adjusting comments slightly to point out differences between the signed and unsigned cases.
Attachment #8413004 - Flags: review?(sunfish)
This doesn't pass jit-tests yet, so please don't review it carefully. I just wanted suggestions on clean ways to reduce code duplication, along with comments on the fragility of reusing of L[Div|Mod]PowTwoI for the unsigned case.

To elaborate on the second problem: All I need to reuse L[Div|Mod]PowTwoI is that isUnsigned implies !canBeNegativeDividend. However, this doesn't hold in trunk currently (the assertion in the patch fails). Should I fix the analysis bug and make one flag imply the other (and keep the asserts), or should I add an extra flag in LDivPowTwoI for robustness?
Attachment #8413012 - Flags: feedback?(sunfish)
Comment on attachment 8413004 [details] [diff] [review]
Bug 976110 - Part 3: Generalize reciprocal multiplication constant calculation

Review of attachment 8413004 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/shared/CodeGenerator-shared.cpp
@@ +1039,3 @@
>      // The original presentation of this technique appears in Hacker's
>      // Delight, a book by Henry S. Warren, Jr.. A proof of correctness
> +    // for our version follows. Think of L as either 31 or 32.

This description of L seems a little vague to me. Please explicitly connect L in the comment to maxLog in the code. Also, it'd be good to explicitly connect the values 31 and 32 to signed and unsigned int32 (it's implied here, and mentioned below, but it'd be nice to have it explicit at the top).
Attachment #8413004 - Flags: review?(sunfish) → review-
Comment on attachment 8413012 [details] [diff] [review]
Part 4: Optimize unsigned division by constants. unreviewed

Review of attachment 8413012 [details] [diff] [review]:
-----------------------------------------------------------------

I think it's better to fix the analysis so that canBeNegativeDividend is never set when isUnsigned is set. It may just be that MDiv's constructor sets canBeNegativeDividend_ to true by default, and the NewAsmJS wrapper for creating unsigned divisions is neglecting to clear it.
Attachment #8413012 - Flags: feedback?(sunfish)
Depends on: 1011283
Any plans to continue on this?  This optimization is somewhat disproportionately important to the asm.js ubench tests on awfy.
Yes, sorry! I'll post a new version of the patch today. I stopped working on it because of other commitments, but it's done up to the range analysis imprecision; I'll show up in #ionmonkey in an hour or so if I can't figure out how to fix it.
Same thing as before, but with (hopefully) clearer comments.
Attachment #8413004 - Attachment is obsolete: true
Attachment #8435228 - Flags: review?(sunfish)
Comment on attachment 8435228 [details] [diff] [review]
Part 3: Generalize reciprocal multiplication constant calculation

Review of attachment 8435228 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good! Thanks for elaborating in the comments.
Attachment #8435228 - Flags: review?(sunfish) → review+
As far as I can tell, the canBeNegativeDividend_ information for Div/Mod MIR nodes was imprecise for two reasons:

1) Before calling tryUseUnsignedOperands(), range analysis can't see the correct lower bound for the LHS of the division in ((x>>>0)/8)|0, say, because of the Ursh workaround on http://dxr.mozilla.org/mozilla-central/source/js/src/jit/RangeAnalysis.cpp#529

2) After tryUseUnsignedOperands(), the Ursh node is discarded and the unsigned cast is only represented in the MIR by the MDiv/MMod unsigned_ flag. That is, after truncation, a MDiv/MMod node with unsigned_ == true effectively represents a "cast LHS and RHS and then divide". Thus, LHS (getOperand(0)) and dividend are actually different concepts, and canBeNegativeDividend_ needs to be set unconditionally after calling tryUseUnsignedOperands(). 

I've made it so that canBeNegativeDividend_ is always true for unsigned division ops. Let me know if you prefer to take a different route in solving this problem. Thanks for all the help on IRC!
Attachment #8413012 - Attachment is obsolete: true
Attachment #8435349 - Flags: review?(sunfish)
sunfish suggested asserting !unsigned_ in canBeNegativeDividend(), due to the ambiguity. I like that very much.
Attachment #8435349 - Attachment is obsolete: true
Attachment #8435349 - Flags: review?(sunfish)
Attachment #8435394 - Flags: review?(sunfish)
Comment on attachment 8435394 [details] [diff] [review]
Part 4: Optimize unsigned division by constants

Review of attachment 8435394 [details] [diff] [review]:
-----------------------------------------------------------------

Looks great! Just a few minor comments below.

::: js/src/jit/shared/CodeGenerator-x86-shared.cpp
@@ +882,5 @@
> +        if (ins->mir()->isTruncated()) {
> +            masm.xorl(output, output);
> +            return true;
> +        } else
> +            return bailout(ins->snapshot());

Style nit: since the true block of the if has braces, the else block should too.

@@ +916,5 @@
> +        // Finish the computation.
> +        masm.addl(eax, edx);
> +        masm.sarl(Imm32(rmc.shiftAmount - 1), edx);
> +    } else
> +        masm.sarl(Imm32(rmc.shiftAmount), edx);

Ditto style nit.

@@ +932,5 @@
> +            if (!bailoutIf(Assembler::NotEqual, ins->snapshot()))
> +                return false;
> +        } else {
> +            // Unsigned mod can return a value >= 2^31.
> +            masm.testl(output, output);

This testl instruction is redundant because you can just use the flags produced by the sub above. It would be good to have a comment here and a comment after the sub noting what's going on in that case.
Attachment #8435394 - Flags: review?(sunfish) → review+
I moved code around a little bit to put the bailout check for the modulus operation closer to the subtraction, and added comments.
Attachment #8435394 - Attachment is obsolete: true
Attachment #8436181 - Flags: review?(sunfish)
Whiteboard: [mentor=sunfish][lang=c++][js:p2][leave open] → [mentor=sunfish][lang=c++][js:p2]
Attachment #8436181 - Flags: review?(sunfish) → review+
Attachment #8435228 - Flags: checkin?
Attachment #8436181 - Flags: checkin?
Hi,

would it be possible to get a try link for this checkin? thanks!
Mentor: sunfish
Whiteboard: [mentor=sunfish][lang=c++][js:p2] → [lang=c++][js:p2]
Hi Mauricio, I apologize for the delay getting these patches checked in. Unfortunately, they no longer apply cleanly. There are a few merge conflicts now, and the patches use Bailout_BaselineInfo, which was recently removed. Would you mind updating these patches for the latest trunk?
Attachment #8435228 - Flags: checkin?
Attachment #8436181 - Flags: checkin?
Attachment #8409264 - Flags: checkin+
Attachment #8409265 - Flags: checkin+
Attachment #8409464 - Flags: checkin+
It looks like cset c7925215ca32 is the cause of a 50% regression on asmjs-ubench-memops on 32-bit OSX, but only for Odin, not for --no-asmjs.  Very peculiar.
Also: not on x64.
True. Dan investigated this and filed the asmjs-ubench-memops regression as bug 999257.
Unfortunately, with bug 1011283, the patches for this feature are now disabled (though most of the code is still in the tree). The problem appears to be a pre-existing register allocation bug, and not a bug in the patches here. Hopefully we can find a fix for the register allocation bug so that we can re-enable this feature.
Depends on: 1058083
I've rebased the two patches now that sunfish worked around bug 1011283 in bug 1058083 (thanks!); Jan says he is fine with doing the LSRA check in this patch as well. I don't have Level 1 access so I can't provide a TBPL link, but I've checked this passes 32-bit and 64-bit jit-tests.
Attachment #8435228 - Attachment is obsolete: true
Attachment #8436181 - Attachment is obsolete: true
Depends on: 1067610
Attachment #8485093 - Attachment is obsolete: true
Attachment #8592407 - Attachment is obsolete: true
Attachment #8630832 - Flags: review?(sunfish)
Now that the LSRA is gone and bug 1067610 is fixed (thanks, bhackett!), I think this is safe to land. I'm flagging the patches for review in case you want to read them again, but they are just a rebased version of the patches that were approved last June. 

The patches pass jit-tests with --ion, and here's an asmjs-ubench comparison:

                   before after   delta
copy                 5151  5083  -1,32%
corrections          7044  6952  -1,30%
fannkuch             3410  3457   1,38%
fasta               12344  8117 -34,24%
life                 4820  4796  -0,51%
memops               6147  6049  -1,59%
primes               7312  7109  -2,78%
skinning            13697 13420  -2,02%
mandelbrot-native     396   385  -2,94%
mandelbrot-polyfill   557   555  -0,30%
fbirds-native         172   168  -2,13%
fbirds-polyfill       394   391  -0,76%
Attachment #8592410 - Attachment is obsolete: true
Attachment #8630833 - Flags: review?(sunfish)
It looks like these patches cause the following testcase to be miscompiled, printing two different numbers when it should print the same number twice.

var asmdiv = (function(m) {
    "use asm"
    function f(x) {
        x = x|0;
        var z = 0;
        z = ((x>>>0) / 2)>>>0;
        return z|0;
    }
    return f;
})()

var plaindiv = function(x) {
    x = x|0;
    var z = 0;
    z = ((x>>>0) / 2)>>>0;
    return z|0;
}

var k = 0xf0000000;

print(asmdiv(k));
print(plaindiv(k));

$ js test.js 
-805306368
1342177280
$ 

Can you take a look?
Nice catch, thanks!

Curiously, both values returned by your testcase are incorrect. The right answer is 2013265920, which is what the second case prints on my computer before the fix. Let me know if I should debug this as well.

The bug with this division by powers of two is my fault, and it's a really dumb bug: I forgot to update the LDivPowTwoI instruction to use an unsigned shift in the unsigned case :( It gives the correct result in non-asm.js code because the >>>0 after the division gets compiled to a fallible Ursh, which then causes a bailout:

[Codegen] instruction DivPowTwoI
[Codegen] sarl       $1, %eax
[Codegen] instruction ShiftI:ursh
[Codegen] testl      %eax, %eax
[Codegen] js         .Lfrom577

But there's another recently-introduced complication: Bug 1180874 (commit 596ee431b3d2) changed lowering to emit the *signed* divide-by-constant LIR instructions in the unsigned case as well. This causes an overflow in computeDivisionConstants for divisors >= 2^31, but I think it also breaks division for dividends >= 2^31 and many divisors, as the magic constants for unsigned division are a bit different (cf. part 3, where a maxLog parameter was added for this purpose).

Moreover, bug 1180874 only optimizes non-asm.js code, as asm.js compilation emits UDiv/UMod instructions directly, bypassing the optimization made there. I've updated part 4, but my new patch will only work if bug 1180874 is backed out. Since both patches have the same goal, I think that's reasonable.
Attachment #8630833 - Attachment is obsolete: true
Attachment #8630833 - Flags: review?(sunfish)
Attachment #8631393 - Flags: review?(sunfish)
Please add the tests to your patch as well?
Group: mozilla-employee-confidential
Now with tests! They caught another signed shift thinko.

I've verified that current trunk fails the division-by-3 test I've added. Jit-tests, including the one I added here, pass with this patch (together with a backout of bug 1180874, as mentioned before).
Attachment #8631393 - Attachment is obsolete: true
Attachment #8631393 - Flags: review?(sunfish)
Attachment #8631417 - Flags: review?(sunfish)
Ryan, why was this bug marked as Mozilla confidential?
Also can we move these new patches to a new bug. That way this bug doesn't track code that are in multiple releases.
(In reply to Jan de Mooij [:jandem] from comment #63)
> Ryan, why was this bug marked as Mozilla confidential?

I have no idea how I managed to check that. Sorry.
Group: mozilla-employee-confidential
Comment on attachment 8630832 [details] [diff] [review]
Part 3: Generalize reciprocal multiplication constant calculation

Review of attachment 8630832 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good; just a few whitespace nits below.

::: js/src/jit/shared/CodeGenerator-shared.cpp
@@ +1577,5 @@
> +CodeGeneratorShared::computeDivisionConstants(uint32_t d, int maxLog) {
> +    MOZ_ASSERT(maxLog >= 2 && maxLog <= 32);
> +    // In what follows, 0 < d < 2^maxLog and d is not a power of 2.
> +    MOZ_ASSERT(d < (uint64_t(1) << maxLog) && (d & (d - 1)) != 0);
> + 

Nit: trailing whitespace

::: js/src/jit/x86-shared/CodeGenerator-x86-shared.cpp
@@ +1089,4 @@
>      MOZ_ASSERT((Abs(d) & (Abs(d) - 1)) != 0);
>  
>      // We will first divide by Abs(d), and negate the answer if d is negative.
> +    // If desired, this can be avoided by generalizing computeDivisionConstants. 

trailing whitespace
Attachment #8630832 - Flags: review?(sunfish) → review+
Comment on attachment 8631417 [details] [diff] [review]
Part 4: Optimize unsigned division by constants

Review of attachment 8631417 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good. Just a few minor comments to address below.

::: js/src/jit-test/tests/ion/bug976110.js
@@ +64,5 @@
> +
> +var kmod = (3<<30) + 2;
> +
> +assertEq(asmmod3(kmod), 2);
> +assertEq(plainmod3(kmod), 2);

We should add a test for division by something like 7 too, so that we exercise the rmc.multiplier > UINT32_MAX case.

::: js/src/jit/x86-shared/BaseAssembler-x86-shared.h
@@ +1322,5 @@
> +    void mull_r(RegisterID multiplier)
> +    {
> +        spew("mull       %s", GPReg32Name(multiplier));
> +        m_formatter.oneByteOp(OP_GROUP3_Ev, multiplier, GROUP3_OP_MUL);
> +    }    

trailing whitespace
Attachment #8631417 - Flags: review?(sunfish) → review+
Depends on: 1182203
Attachment #8630832 - Attachment is obsolete: true
Attachment #8631417 - Attachment is obsolete: true
Status: ASSIGNED → RESOLVED
Closed: 4 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla42
Version: unspecified → 42 Branch
You need to log in before you can comment on or make changes to this bug.