Last Comment Bug 716504 - IonMonkey: LMulI should not bailout if the result is positive zero
: IonMonkey: LMulI should not bailout if the result is positive zero
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal (vote)
: ---
Assigned To: Jan de Mooij [:jandem]
:
: Jason Orendorff [:jorendorff]
Mentors:
Depends on:
Blocks: IonSpeed
  Show dependency treegraph
 
Reported: 2012-01-09 04:38 PST by Jan de Mooij [:jandem]
Modified: 2012-01-10 14:38 PST (History)
3 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
Patch (14.09 KB, patch)
2012-01-09 08:45 PST, Jan de Mooij [:jandem]
dvander: review+
marty.rosenberg: review+
Details | Diff | Splinter Review

Description Jan de Mooij [:jandem] 2012-01-09 04:38:31 PST
Currently we always bailout if both operands are not constant and the result is 0. We should only bailout if the result will be -0, not if it's +0.

This should make ss-spectral-norm faster.
Comment 1 Hannes Verschore [:h4writer] 2012-01-09 08:44:41 PST
When implementing MUL it was deliberaty choosen not to do that (yet). It's also how JM/TM did it before and from the look of AWFY JM don't have a problem with that benchmark? So could it be the problem is somewhere else? Ofcourse when doing a loop and bailing on the first iteration will decrease speed for the first few iterations, but shouldn't OSR kick in to get back on the fast path? Now if it indeed does decreases performance a lot, this should get fixed. 
It is possible to check it rather easy by adding the checks (to check if the operands are negative) after the muli instruction and jump over the those checks when result isn't zero. This will of course degrade performance of normal multiplation a bit (because of taking the jmp). A better solution is to have those checks out of line.
Comment 2 Jan de Mooij [:jandem] 2012-01-09 08:45:22 PST
Created attachment 587012 [details] [diff] [review]
Patch

Jump to an OOL path if the result is 0. The OOL path then bails out if lhs or rhs is negative.

This patch helps spectral-norm, but only with the patch for bug 706472 applied, due to inlining different functions.
Comment 3 Hannes Verschore [:h4writer] 2012-01-09 08:56:15 PST
Cool, that is already implemented. Just noticing but this uses an extra register. So extra pressure one the registers. Even when canBeNegativeZero() is set to false!
Comment 4 Jan de Mooij [:jandem] 2012-01-09 09:01:31 PST
(In reply to Hannes Verschore from comment #1)
> When implementing MUL it was deliberaty choosen not to do that (yet). It's
> also how JM/TM did it before and from the look of AWFY JM don't have a
> problem with that benchmark?

JM also checks whether the lhs/rhs is negative, see jsop_binary_full (look for the "Restore original value" comment). Also, even if JM didn't do this, it'd fall back to a stub, which is more efficient than bailing out to the interpreter.

> Ofcourse when doing a loop and bailing on the first iteration will decrease
> speed for the first few iterations, but shouldn't OSR kick in to get back on
> the fast path? Now if it indeed does decreases performance a lot, this
> should get fixed. 

Yeah, OSR is "forbidden" once we bailout somewhere. We'll have to fix that too, but in the meantime this fix is still worthwhile.

> It is possible to check it rather easy by adding the checks (to check if the
> operands are negative) after the muli instruction and jump over the those
> checks when result isn't zero. This will of course degrade performance of
> normal multiplation a bit (because of taking the jmp). A better solution is
> to have those checks out of line.

Yep, see the patch I just attached.
Comment 5 Jan de Mooij [:jandem] 2012-01-09 09:09:12 PST
(In reply to Hannes Verschore from comment #3)
> Cool, that is already implemented. Just noticing but this uses an extra
> register. So extra pressure one the registers. Even when canBeNegativeZero()
> is set to false!

The register for the original LHS value is also used by the snapshot, so it doesn't really matter. It's an extra register once we have mul's without overflow or -0 checks (because we don't need a snapshot then), but I didn't think handling that case was worth the complexity now.
Comment 6 Marty Rosenberg [:mjrosenb] 2012-01-09 11:43:00 PST
Comment on attachment 587012 [details] [diff] [review]
Patch

Review of attachment 587012 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/ion/arm/CodeGenerator-arm.cpp
@@ +472,2 @@
>              masm.ma_cmp(ToRegister(lhs), Imm32(0));
> +            masm.ma_b(ool->entry(), Assembler::Equal);

I don't remember why we're comparing lhs with zero, shouldn't we be comparing dest with zero?

@@ +488,5 @@
> +    JS_ASSERT(result != lhs);
> +    JS_ASSERT(result != rhs);
> +
> +    // Result is -0 if lhs or rhs is negative.
> +    masm.ma_orr(lhs, rhs, result, SetCond);

use ma_cmn.  It adds its two arguments together, setting the condition bits accordingly, and discards the results.  This way we don't have to stomp over our poor return register.

@@ +492,5 @@
> +    masm.ma_orr(lhs, rhs, result, SetCond);
> +    if (!bailoutIf(Assembler::Signed, ins->snapshot()))
> +        return false;
> +
> +    masm.ma_mov(Imm32(0), result);

This can go away as well.

Is it actually worth it to put two instructions into an OOL path?
Comment 7 Jan de Mooij [:jandem] 2012-01-09 12:26:18 PST
(In reply to Marty Rosenberg [:Marty] from comment #6)
> I don't remember why we're comparing lhs with zero, shouldn't we be
> comparing dest with zero?

Yes, good catch. On x86/x64 dest and lhs are the same, but on ARM it should be dest.

> Is it actually worth it to put two instructions into an OOL path?

It's probably good for consistency.. On x86/x64 we have more instructions in the OOL path, and I'd like to use a forward (unlikely) branch there for the edge case (result == 0).  I don't know how this works on ARM, are forward branches seen as unlikely to be taken?
Comment 8 Jan de Mooij [:jandem] 2012-01-10 07:40:33 PST
https://hg.mozilla.org/projects/ionmonkey/rev/23ce5e04429d

Got rid of the OOL path on ARM, as suggested in comment 6. It'd have only 2 instructions and especially on ARM the compile-time overhead is probably not worth it.
Comment 9 Hannes Verschore [:h4writer] 2012-01-10 13:32:28 PST
Uhmm I was just wondering about this and I think there is a better way.
We have a scratch register, right?

So we could do the following:

 bool
 CodeGeneratorX86Shared::visitMulI(LMulI* mul) {
   ...

+  if (mul->canBeNegativeZero())
+    masm.mov(lhs, ScratchReg);

   masm.imull(ToOperand(rhs), ToRegister(lhs));

   ...
 }

 bool
 CodeGeneratorX86Shared::visitMulNegativeZeroCheck(MulNegativeZeroCheck *ool)
 {
     LMulI *ins = ool->ins();
     Register result = ToRegister(ins->output());
-    Register lhsCopy = ToRegister(ins->lhsCopy());
+    Register lhsCopy = ScratchReg;
     Operand rhs = ToOperand(ins->rhs());
     JS_ASSERT(lhsCopy != result);
 
     // Result is -0 if lhs or rhs is negative.
     masm.movl(lhsCopy, result);
     masm.orl(rhs, result);
     if (!bailoutIf(Assembler::Signed, ins->snapshot()))
         return false;
 
     masm.xorl(result, result);
     masm.jmp(ool->rejoin());
     return true;
 }


As result:
1) LMulNegativeZeroCheck can become LMul again
2) lowerForALU can get used again
3) We get the extra register back
4) We only create a copy of the lhs when needed.

One problem atm is that ScratchRegister isn't defined in CodeGenerator-shared.
(I think because it is a different value on X86 and X64)
Comment 10 Jan de Mooij [:jandem] 2012-01-10 13:56:09 PST
(In reply to Hannes Verschore from comment #9)
> Uhmm I was just wondering about this and I think there is a better way.
> We have a scratch register, right?

Not on x86.

Also note that in the most common cases there's no extra move/register atm. Assuming LHS is in eax and the result goes into ecx, we'll emit something like:

mov  eax, ecx
imul $rhs, ecx

The original lhs, eax, is used by the snapshot, other instructions and to recover the lhs. If we used a scratch reg here, we'd have

mov  eax, ecx
mov  ecx, scratch
imul $rhs, ecx

We'd still need eax for the snapshot and other uses of the lhs by other instructions.
Comment 11 Hannes Verschore [:h4writer] 2012-01-10 14:38:34 PST
(In reply to Jan de Mooij (:jandem) from comment #10)
> (In reply to Hannes Verschore from comment #9)
> > Uhmm I was just wondering about this and I think there is a better way.
> > We have a scratch register, right?
> 
> Not on x86.
> 
Ah really. Oh ok. Thank you for explanation. I should probably look a bit more into the register allocator too ;)

Note You need to log in before you can comment on or make changes to this bug.