Last Comment Bug 694509 - IonMonkey: Add function calls for ARM
: IonMonkey: Add function calls for ARM
Status: RESOLVED FIXED
:
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: x86_64 Linux
: -- normal (vote)
: ---
Assigned To: general
:
Mentors:
Depends on:
Blocks:
  Show dependency treegraph
 
Reported: 2011-10-14 00:45 PDT by Marty Rosenberg [:mjrosenb]
Modified: 2012-01-30 13:24 PST (History)
4 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---


Attachments
add function calls and some more stuff. (48.09 KB, patch)
2011-10-14 00:45 PDT, Marty Rosenberg [:mjrosenb]
no flags Details | Diff | Review
add in another fix or two; add some basic constant cleanup to the vfp code (66.66 KB, patch)
2011-10-16 23:03 PDT, Marty Rosenberg [:mjrosenb]
Jacob.Bramley: review+
Details | Diff | Review
Fix nits, add comments. (68.20 KB, patch)
2011-10-17 09:15 PDT, Marty Rosenberg [:mjrosenb]
no flags Details | Diff | Review

Description Marty Rosenberg [:mjrosenb] 2011-10-14 00:45:38 PDT
Created attachment 567026 [details] [diff] [review]
add function calls and some more stuff.

After applying this patch, there are 47 failures on jit-tests.
as the saying goes, the first 90% is almost done.  the second 90% will be the real challenge.
Of the remaining failures, there are about a half dozen due to missing 
CodeGeneratorARM::visitTableSwitch
about a dozen due to not supporting multiply
and another dozen due to only supporting 255 floating point immediates.
the rest are misc. assertions and other fun stuff.
Comment 1 Marty Rosenberg [:mjrosenb] 2011-10-16 23:03:56 PDT
Created attachment 567395 [details] [diff] [review]
add in another fix or two; add some basic constant cleanup to the vfp code
Comment 2 Jacob Bramley [:jbramley] 2011-10-17 07:07:07 PDT
Comment on attachment 567395 [details] [diff] [review]
add in another fix or two; add some basic constant cleanup to the vfp code

Review of attachment 567395 [details] [diff] [review]:
-----------------------------------------------------------------

I'm mostly nit-picking but there are one or two minor glitches. r+ with my comments fixed.

::: js/src/ion/arm/Assembler-arm.cpp
@@ +1074,2 @@
>  
>  // our encoding actually allows just the src and the dest (and theiyr types)

s/theiyr/their/

(I didn't spot that before.)

@@ +1083,5 @@
> +    vfp_size sz = isDouble;
> +    if (vd.isFloat() && vm.isFloat()) {
> +        // Doing a float -> float conversion
> +        if (vm.isSingle())
> +            sz = isSingle;

Curly brackets.

@@ +1137,1 @@
>          length *= 2;

Curly brackets.

@@ +1149,2 @@
>      if (!vd.isDouble()) {
>          // totally do not know how to handle this right now

Will it ever happen? If vd.isSingle() then it's easy as the instruction is the same, but with different sz. If !vd.isFloat() then we can probably ASSERT_NOT_REACHED. Not important for this patch, though.

::: js/src/ion/arm/CodeGenerator-arm.cpp
@@ +276,5 @@
> +    // With no assurance that this is going to be a local branch, I am wary to
> +    // implement this.  Moreover, If it isn't a local branch, it will be large
> +    // and possibly slow.  I believe that the correct way to handle this is to
> +    // subclass label into a fatlabel, where we generate enough room for a load
> +    // before the branch

I'm not entirely sure of the semantics of bailouts, but something I always wanted to implement in NanoJIT (and JaegerMonkey) was a exit path generator, so we can generate exit routines, trampolines and suchlike at run-time (using ARMAssembler or nanojit/NativeARM as appropriate) and (hopefully) within range of local branches in generated code. Even if we have to emit several copies, this will be much faster than full-range branches.

It's not necessary for this patch, of course.

@@ +289,5 @@
> +    // reserved, or a jump table is not optimal for this frame size or
> +    // platform. Whatever, we will generate a lazy bailout.
> +    OutOfLineBailout *ool = new OutOfLineBailout(snapshot, masm.framePushed());
> +    if (!addOutOfLineCode(ool))
> +        return false;

Curly brackets.

@@ +342,2 @@
>      } else {
> +        masm.ma_sub(ToRegister(lhs), ToOperand(rhs), ToRegister(dest), SetCond);

If !ins->shapshot(), SetCond isn't necessary. Since it's probably more efficient not to set them, it might be wise to set them only if ins->snapshot(). (Not important for now.)

@@ +643,5 @@
>  
>      // Lower value with low value
>      if (mir->low() != 0)
> +
> +        masm.ma_sub(ToRegister(input), Imm32(mir->low()), ToRegister(index));

Curly brackets.

@@ +901,5 @@
> +        if (dpun.s.hi == 0) {
> +            // to zero a register, load 1.0, then execute dN <- dN - dN
> +            VFPImm dblEnc(0x3FF00000);
> +            masm.as_vimm(ToFloatRegister(out), dblEnc);
> +            masm.as_vsub(ToFloatRegister(out), ToFloatRegister(out), ToFloatRegister(out));

I think you want an 'else' clause in here, or a 'return true'.

@@ +1018,5 @@
> +    masm.ma_ldr(DTRAddr(objreg, DtrOffImm(offsetof(JSFunction, flags))),
> +                tokreg);
> +
> +    masm.ma_tst(Imm32(JSFUN_KINDMASK), tokreg);
> +    // Call me a skeptic

Skeptic! Use CMP!

@@ +1025,1 @@
>          return false;

Curly brackets.

@@ +1035,2 @@
>  
> +    masm.ma_tst(objreg, objreg);

Go on, use CMP. You know you want to.

@@ +1049,1 @@
>      // Nestle %esp up to the argument vector.

Tricky on ARM. (No %esp.)

@@ +1049,3 @@
>      // Nestle %esp up to the argument vector.
>      if (unused_stack)
> +        masm.ma_add(StackPointer, Imm32(unused_stack), StackPointer);

Curly brackets.

@@ +1080,5 @@
>  
>          // Argument fixup needed. Create a frame with correct |nargs| and then call.
>          masm.bind(&thunk);
> +        // r0 should be safe, since everything has been saved by the register
> +        // allocator

Is there some way to ASSERT(!allocated(r0)) ?

::: js/src/ion/arm/MacroAssembler-arm.cpp
@@ +615,5 @@
> +
> +void
> +MacroAssemblerARM::ma_ldrb(DTRAddr addr, Register rt, Index mode, Condition cc)
> +{
> +    as_dtr(IsLoad, 32, mode, rt, addr, cc);

Size should be 8.

@@ +632,5 @@
> +}
> +void
> +MacroAssemblerARM::ma_ldrsb(EDtrAddr addr, Register rt, Index mode, Condition cc)
> +{
> +    as_extdtr(IsLoad, 16, true, mode, rt, addr, cc);

Size should be 8.

@@ +997,5 @@
>  void
> +MacroAssemblerARM::ma_callIon(const Register r)
> +{
> +    // the stack is presently 8 byte aligned
> +    // we want to decrement sp by 8, and write pc into the new sp

It's usually the responsibility of the caller to store LR (the return address). Is this intentional? (BLX is still the right way to do the call, even if you ignore LR.)

It's probably a good idea to comment that the PC stored will be the address of the instruction _after_ the BLX. This is correct, but the behaviour is not necessarily obvious.

::: js/src/ion/arm/Trampoline-arm.cpp
@@ +189,5 @@
> +IonCode *
> +IonCompartment::generateArgumentsRectifier(JSContext *cx)
> +{
> +    MacroAssembler masm(cx);
> +#if 0

Now that does make the review easier :-) I haven't reviewed the #if 0 block because I've assumed that you're about to change it.

@@ +341,2 @@
>          // And now the stack is all fixed up.
> +        masm.ma_add(sp, Imm32(bailoutFrameSize+12), sp);

Why +12?
Comment 3 Marty Rosenberg [:mjrosenb] 2011-10-17 09:15:45 PDT
Created attachment 567472 [details] [diff] [review]
Fix nits, add comments.
Comment 4 David Anderson [:dvander] 2011-10-17 14:43:27 PDT
> I'm not entirely sure of the semantics of bailouts, but something I always
> wanted to implement in NanoJIT (and JaegerMonkey) was a exit path generator,
> so we can generate exit routines, trampolines and suchlike at run-time
> (using ARMAssembler or nanojit/NativeARM as appropriate) and (hopefully)
> within range of local branches in generated code. Even if we have to emit
> several copies, this will be much faster than full-range branches.
> 

In IonMonkey, we capture the register state (in a "snapshot") at every position that can leave JIT code. When a guard fails, a bailout occurs, which takes the snapshot, recovers a js::StackFrame, and resumes execution in the interpreter. So we're not really worried about the cost of the branch if taken. Instead we're optimizing the gaurds for space, for example:

   cmp r1, <shape>
   bne _guard_L132
...
 guard_L131:
   push <snapshot_131>
   blx ion::Bailout
 guard_L132:
   push <snapshot_132>
   blx ion::Bailout
...

We make these jump tables shared between all code buffers, to avoid having bloated out-of-line paths.
Comment 5 Jacob Bramley [:jbramley] 2011-10-18 01:04:48 PDT
(In reply to David Anderson [:dvander] from comment #4)

That's a good explanation, thank you.

> We make these jump tables shared between all code buffers, to avoid
> having bloated out-of-line paths.

What I was saying was that it'd be nice if we guarantee that the jump
tables are within range of a simple B<cond> instruction by generating
copies once we reach a suitable code size threshold. For example:

    cmp r1, <shape>  // Some guard test.
    bne     _guard_L1
    ... // Lots of code, but not so much that _guard_L1 cannot be
    ... // reached using B.
    b       end_of_guard_table // Jump over the guard/snapshot table.

_guard_L1:
    ...     // Snapshot code.
_guard_L<n>:    // More guards.
    ...

end_of_guard_table:
    cmp r1, <shape>  // Some guard test.
    bne     _guard_L101

_guard_L101:
    ...     // Guard code, maybe the same as _guard_L1.

That's how we used to do literal pools in Jaeger Monkey, to work around
the limited load offset available to LDR. This is the same problem,
except that branches get a much larger offset, so in practice we'd
probably only end up with 2-3 copies of the table. However, in JM we
often ended up with allocated memory such that B could not always be
used to branch to other bits of generated code.

If we don't do this, then all the guards have to look like this (as
Marty was describing in his comment):

    cmp     r1, <shape>
    ldr     ip, <_guard_L1> // Load 32-bit guard address.
    bxne    ip

We would have to do that everywhere just in case the snapshot code isn't
in range of the guard test.

Note that all branches in JM looked like that. I tried to address it
with bug 586297, but weird test failures held that one up and I'm not
sure for how long it will be relevant anyway. Marty has been building
Ion Monkey from the start with the goal of using short branches wherever
possible.
Comment 6 David Anderson [:dvander] 2011-10-18 01:30:51 PDT
Ah I see, thanks for explaining. It sounds like then, if we can't guarantee the addresses are in-range, the shared bailout table might not be either a space saving *or* a critical-path optimization on ARM?

If so: the shared table is completely optional, just avoiding it on ARM should make guards fit in range since the table will be inlined at the bottom of the script.
Comment 7 Jacob Bramley [:jbramley] 2011-10-18 01:59:48 PDT
Oh, sorry, there's a much simpler answer:

   cmp     r1, <shape>
   ldrne   pc, =_guard_L1   // Full range branch.

If we don't need to patch these later, it's easy to make that a branch, like this:

   cmp     r1, <shape>
   bne     _guard_L1

If we do need to patch it later, there are still ways to achieve that (as in bug 586297), but it's hard. It does cost a literal pool slot to use the LDR, but that's not in the inline path at least.

Note You need to log in before you can comment on or make changes to this bug.