Closed Bug 1133529 Opened 5 years ago Closed 3 years ago

Implement Tail Call Optimization in OdinMonkey

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: marc.nieper, Assigned: marc.nieper)

References

Details

Attachments

(3 files, 13 obsolete files)

40.11 KB, image/png
Details
49.08 KB, image/png
Details
62.21 KB, patch
Details | Diff | Splinter Review
At the moment, the asm.js specification does not allow proper tail calls because most callsites require coercing before returning.

In order to evaluate an extension to the asm.js specification that allows proper tail calls, this bug is about implementing the infrastructure for tail call optimization in OdinMonkey and use it in cases like `return f()|0` where the coercion can be easily optimized away to make it an actual tail call.

Doing this, however, can't be a demand for asm.js implementation as it might make a general ES6 implementation failing on such code (due to stack overflows). Thus after this evaluation and after revising the asm.js specification, proper tail calls should be implemented (and will have to due to ES6 demands).
So think about this just a little bit more, the hard case is, iiuc, when the callee has a different amount of stack args, since you now need to adjust the caller's stack frame size in place before jumping to the callee.  But once you do this, the callee cannot just 'ret', b/c that would wind up back in the caller's caller with the wrong stack depth.  On the bright side, this wouldn't be a problem if all args are in registers (which will often be the case for x64/ARM, but not x86).

I can think of a few possible solutions (none zero-cost), but I assume this is an old problem with a well-known space of solutions.  What do you propose?
Status: UNCONFIRMED → NEW
Ever confirmed: true
As all function inside the module are known at compile time, we can easily assume a common size of the stack frame, namely the maximum, can't we?

We just have to do adjustments for calling between asm.js and non-asm.js code.

(A different, more radical approach would be to ignore the calling conventions of the targeted platform completely and use a stack layout that puts it into the hand of the callee to pop off all arguments and the return address from the stack.)
P.S.: In any case, whatever convention is going to be used inside OdinMonkey, should be one that will eventually allow tail call elimination in this scenario

> function f() {return g();}
> function g() {return f();}

where f() is an asm,js module and g() is in non-asm.js code.
(In reply to Marc Nieper-Wißkirchen from comment #2)
> As all function inside the module are known at compile time, we can easily
> assume a common size of the stack frame, namely the maximum, can't we?

Hah, good point!  I'd worry about one single mega-arity function ruining it for everyone but perhaps (1) we can only take the max of PTC callers/callees (2) tweak emscripten to not accidentally emit PTCs (3) issue console warnings when the max frame size is huge despite 1 and 2.

(In reply to Marc Nieper-Wißkirchen from comment #3)
This will be difficult and likely need to be part of the broader PTC effort (which will have to deal with a bounty of such execution-mode-transition corner cases).
I think (3) is the most important point. Using a single mega-arity function (> 8?) while all other functions are of a much smaller shape doesn't look very sensible.

What about the other idea of using a special tail-call-tailored calling convention?

E.g. in Intel-like x86-assembler syntax (barring arguments in registers):

> caller:
>   push continue
>   push 1 ; two arguments for f
>   push 2
>   jmp f
> continue:
>   ...
>
> f:
>   ; 1st argument is at [esp+4]
>   ; 2nd argument is at [esp]
>   ; return address is [esp+8]
>   ...
>   add esp,8 ; remove arguments from stack
>   ; leave return address
>   push 1    ; three arguments for g
>   push 2
>   push 3
>
> g:
>   ...
>   add esp,12 ; remove arguments from stack
>   ret        ; return (to continue)
(In reply to Marc Nieper-Wißkirchen from comment #5)
> I think (3) is the most important point. Using a single mega-arity function
> (> 8?) while all other functions are of a much smaller shape doesn't look
> very sensible.

I guess the question is how do programs use tail calls.  I realized we could maybe go further and detect SCCs of the callgraph, and then each SCC could have its own max-depth, although function-pointer calls could quickly make these SCCs imprecise.

I guess another problem with this technique is it likely doesn't generalize for callings going into or out of an asm.js module.

> What about the other idea of using a special tail-call-tailored calling
> convention?

IIUC, this would require either we use this convention for all calls or we compile a second PTCable version of functions called via PTC.  In the former case, I worry that this will be significantly slower: it's more instructions and we lose the help of the return-address-predictor for branch prediction of the 'ret' (since we're not pairing it with 'call').  In the latter case, the obvious worry is code size (particularly since, if a func-ptr-table is tail-called, all the functions in the table are PTCable).  Again, if Emscripten can reliably avoid PTC and we expect PTC-using programs to be smaller, then this could be fine.
> I guess the question is how do programs use tail calls.  I realized we could
> maybe go further and detect SCCs of the callgraph, and then each SCC could
> have its own max-depth, although function-pointer calls could quickly make
> these SCCs imprecise.

I think the basic use case of PTC in asm.js would be a byte-code interpreter (maybe Emterpreter?). At the moment, it can be coded as follows (written w/o asm.js type coercions to):

loop: for (;;ip = ip + 1) {
  switch (instructions[ip]) {
  case 0:
    // perform action for byte code 0
    break;
  case 1:
    // perform action for byte code 1
    break;
}

This is not very efficient: At the beginning of the switch loop, the compiler have to insert code to see that instructions[ip] is in the range of the case clause numerals which are superfluous as long as the byte code is correct. Secondly, every instruction in the byte code will require two jumps, one direct one to the top of the switch and an indirect one to a particular case. This indirect jump is very costly because the branch predictor will mostly generate misses.

With proper tail calls, one could code this

function f0(ip) {
  // perform action for byte code 0
  return instructions[ip](ip + 1);
}

function f1(ip) {
  // perform action for byte code 1
  return instructions[ip](ip + 1);
}

The goal will have to be to make code like this really fast. However, code like this involves mostly function pointers so any prediction how the program uses tail calls will most likely be of minor help.
 
> IIUC, this would require either we use this convention for all calls or we
> compile a second PTCable version of functions called via PTC.  In the former
> case, I worry that this will be significantly slower: it's more instructions
> and we lose the help of the return-address-predictor for branch prediction
> of the 'ret' (since we're not pairing it with 'call').  In the latter case

That's a good point I forgot. But I can come up with two other proposals that may work:

1) We use a red zone below the stack pointer for arguments. That is, the calling sequence would be:

**
caller:
  mov [esp-8],1 ; 1st argument
  mov [esp-12],2 ; 2nd argument
  call f

f:
  // 1st argument at [esp-4]
  // 2nd argument at [esp-8]
  ...
  // no clean up of arguments needed
  mov [esp-4],1   ; 1st argument
  mov [esp-8],2   ; 2nd argument
  mov [esp-12],3  ; 3rd argument 
  jmp g           ; tail-call

g:
  // 1st argument at [esp-4]
  ...
  // last argument at [esp-12]

  sub esp,12      ; to preserve arguments
  call h          ; call with no arguments 
  ...
  add esp,12      ; restore stack frame

  ...
  ret             ; return to last non tail-call
**

This approach should be safe on architectures where the stack pointer is not used for hardware interrupts.

2) We use a different stack for code and data. The code stack is the hardware stack (pointed to by esp on x86), which is used by call and ret. The data stack is using an unused register (e.g. ebp) in a different data segment. It is the responsibility of the callee to clean up its arguments on the data stack.

One can also mix the approach 2) with the original idea: Give each function a stack frame as if it has, say, 8, arguments and use the original calling convention of the platform. Functions whose number of arguments exceed this number 8 have extra arguments passed on an extra data stack.

The number 8 above could be, of course, changed by a dynamic number determined by the asm.js compiler.
(In reply to Marc Nieper-Wißkirchen from comment #7)
Ah, thanks for reminding me of the interpreter use case.  Note that it's also an example where we want to do absolutely zero extra work in the call/prologue/epilogue -- nothing more than a jump -- if we want to beat the while-switch and so I think a good use case for the fixed-frame-size strategy.

Extending what I was saying earlier, to handle the general case (we haven't seen all callers/callees or highly variable frame size), we probably need some tail-call-friendly ABI (more on that below).  But in the case, as we have with asm.js, where we've seen the whole callgraph and the frame size is reasonably homogeneous, we could consider the fixed-frame-size an internal optimization.

> This approach should be safe on architectures where the stack pointer is not
> used for hardware interrupts.

We actually do use signal handlers (for sampling profiling and for bounds-check-removal of the asm.js heap loads/stores on 64-bit) and have run into these sort of problems before when we've put anything below sp.

I'd be curious to know what ABI the Scheme JITs use.
> (In reply to Marc Nieper-Wißkirchen from comment #7)
> Ah, thanks for reminding me of the interpreter use case.  Note that it's
> also an example where we want to do absolutely zero extra work in the
> call/prologue/epilogue -- nothing more than a jump -- if we want to beat the
> while-switch and so I think a good use case for the fixed-frame-size
> strategy.

As a first step we could implement this because the development costs of this strategy seem to be the lowest - no new ABI has to be invented and we could simply take the maximum frame size in an asm.js module if there are any tail calls (otherwise we fall back on the non-TCO'ing code).

If this works and we get good performance on the interpreter use case, this code is easily hidden when we do not optimize pseudo-tail calls like return f()|0. As Emscripten does not generate any proper tail call yet (because it can't due to the currently restricted asm.js subject), it would have zero impact on the performance of Emscripten (or other legacy) asm.js generated code.

> I'd be curious to know what ABI the Scheme JITs use.

I know that the recent version of GNU lightning has support for emitting tail-calls. They work with a predefined stack frame size for all functions inside the tail-calling trampoline. This is okay for a code generator directly targeting GNU lightning because it can control the size of functions it generates.

Then there is the fastcc calling convention of LLVM. I think it used to use the strategy that the callee has to pop its arguments off the stack and to move the return address up or down the stack in case the next tail-called function has a different arguments. It may be different in recent implementations.

As to Scheme compilers I don't know a real-world example good enough. But if I were to write one, most of my calls would be tail-calls in CPS form: Besides proper tail calls, Scheme has the call/cc instruction which captures the current continuation. So I have to able to jump around the call stack as I like (like exceptions in both directions or setjmp/longjmp in both directions). With just tail-calls this is easy to implement.
P.S.: I have been spending some time browsing through Ion's and Odin's source code. Where can I ask question that may come up without spamming this bug report and without keeping people from work?
There is an #asm.js irc channel (irc.mozilla.org) for asm.js-related discussion, #ionmonkey for compiler-backend-related and also feel free to just ping me (I'm 'luke').
Another option is to keep the return address in a register.  So when a function is called, with parameters on the stack or in registers, the callee function can save the return address register to the stack (for example the first time it encounters a non-tail call).  When the callee function must return or tail-call another function, it gets the return address back into the return address register.

This is the strategy that was used for a long time by the Gambit compiler (when it had a m68k backend) and also by my experimental JS to x86-32 compiler (js86) I wrote 3 years ago that outperforms v8 by 35% on fib(42), which is very function call intensive.  With current intel processors avoiding call/ret may be a bad idea because (if I recall correctly) there's a small stack cache on the processor that is managed by the call/ret instructions to do branch prediction for the ret.  It would be nice to test this.  Note that the 35% quoted above is on a sandy-bridge i7, so it is fairly recent.

Marc

(In reply to Luke Wagner [:luke] from comment #1)
> So think about this just a little bit more, the hard case is, iiuc, when the
> callee has a different amount of stack args, since you now need to adjust
> the caller's stack frame size in place before jumping to the callee.  But
> once you do this, the callee cannot just 'ret', b/c that would wind up back
> in the caller's caller with the wrong stack depth.  On the bright side, this
> wouldn't be a problem if all args are in registers (which will often be the
> case for x64/ARM, but not x86).
> 
> I can think of a few possible solutions (none zero-cost), but I assume this
> is an old problem with a well-known space of solutions.  What do you propose?
(In reply to Luke Wagner [:luke] from comment #8)
> > This approach should be safe on architectures where the stack pointer is not
> > used for hardware interrupts.
> 
> We actually do use signal handlers (for sampling profiling and for
> bounds-check-removal of the asm.js heap loads/stores on 64-bit) and have run
> into these sort of problems before when we've put anything below sp.

This can be solved by keeping in the sp register the address of the real top of stack minus some constant k.  So accesses to the stack must always add k as an offset.  This allows writing "above" the real top of stack (but no more than k), which will still be under the sp.  With this trick you have to use push/pop/call carefully (or not at all).

Of course another approach is to use another register as the stack pointer.  This can actually be beneficial in a system which does heap allocation by bumping a pointer.  The sp register can be used for heap allocation (allocating downwards, using compact and fast "push" instructions to fill the allocated object and move the heap allocation pointer without an extra add/sub instruction).

> I'd be curious to know what ABI the Scheme JITs use.

As I explained in a previous message, in Gambit all parameters (including the return address which is a parameter) are passed in registers.  These registers are saved lazily to the stack (so the stack frame is really just a spill zone with no preassigned location for the parameters and local variables).  A tail-call will move the return address back to the return address register (possibly it is still there if it hasn't been spilled).
This patch lets the asm.js validator validate an extended version of the asm.js subset of JavaScript that allows proper tail calls. These are the additional rules:

1) A call site coercion is not required for tail calls if the return type of a function has already been defined by a previous return statement.

2) If the first function (not counting a change heap function) has no proper tail call, no proper tail call is allowed in the module.

3) If the first function (not counting a change heap function) has a proper tail call, all argument vectors of all functions in the module must have already appeared in a call in the first function.

These rules allow an asm.js compiler to discern already at the beginning of the module whether the module contains proper tail calls or not. In case of tail calls, it already knows at the beginning of the module the stack size needed for all appearing inner function calls in the module so compilation strategies with a fixed stack frame size for passing arguments are readily supported.
Attachment #8565895 - Flags: review?(luke)
Comment on attachment 8565895 [details] [diff] [review]
Changeset to implement validator Implement validator for asm.js modules with proper tail calls

Review of attachment 8565895 [details] [diff] [review]:
-----------------------------------------------------------------

Great job jumping into the code and finding the right cut points right away.  I think, for the validation scheme you've described, this this implementation makes sense.  I wonder if, when we hash out the final exact validation rules (on a new specifiction topic), we won't prefer to make the "PTC declaration" function a little more like the change-heap function: a rigid grammar and not generally callable.  For now it makes sense to experiment with what you have here.  Switching from review? (which is for landing) to feedback?, and f+ b/c this all looks great.  To save some tedious style refactoring later, you might want to check out https://wiki.mozilla.org/JavaScript:SpiderMonkey:Coding_Style, particuarly concerning bracing.

::: js/src/asmjs/AsmJSValidate.cpp
@@ +5160,5 @@
> +CheckArgsIfDeclared(ModuleCompiler &m, ParseNode *usepn, Signature &sig, PropertyName *name) {
> +    if (!m.hasTailCalls()) {
> +        return true;
> +    }
> +    for (unsigned i = 0; i < m.numCallSignatures(); i++) {  

If this is the final strategy we go with, we'd likely want to use a HashSet<Signature>.
Attachment #8565895 - Flags: review?(luke) → feedback+
I had another thought on this:

The current strategy we've discussed (a priori max frame size) puts the burden on the compiler not to accidentally generate mega-arity functions.  What if, instead, we used this hybrid strategy:
 - we require, in the PTC-function, the list of all PTCable functions (along with their signatures)
 - we pick some internal max-stack-arg-bytes limit (say, 64 bytes) and, based on this limit, the ABI and the signature, we can judge, AOT, whether a given PTCable function fits in the limit
 - when a function fits in the limit, we have this zero-overhead PTC ABI (jmp to call, ret to return)
 - when a function does not fit, we adopt a more general (still to be chosen) ABI that allows dynamic frame sizes
 - b/c the list of all *functions* (not signatures) is known AOT, we know, when compiling any given function definition, whether to generate the normal ABI or the PTC-high-arity ABI.

Of course, we can start by setting the max-stack-arg-bytes limit to ∞ and work from there.
(In reply to Luke Wagner [:luke] from comment #16)
> The current strategy we've discussed (a priori max frame size) puts the
> burden on the compiler not to accidentally generate mega-arity functions. 
> What if, instead, we used this hybrid strategy:
>  - we require, in the PTC-function, the list of all PTCable functions (along
> with their signatures)
>  - we pick some internal max-stack-arg-bytes limit (say, 64 bytes) and,
> based on this limit, the ABI and the signature, we can judge, AOT, whether a
> given PTCable function fits in the limit
>  - when a function fits in the limit, we have this zero-overhead PTC ABI
> (jmp to call, ret to return)
>  - when a function does not fit, we adopt a more general (still to be
> chosen) ABI that allows dynamic frame sizes
>  - b/c the list of all *functions* (not signatures) is known AOT, we know,
> when compiling any given function definition, whether to generate the normal
> ABI or the PTC-high-arity ABI.

I think this idea is backwards-compatible with the current approach: At the moment, we bail out when we tail call a function whose argument vector has not been defined in the signature function. In a later iteration, following your idea, we only bail out when the function has been used in a non-tail call earlier (and the asm.js engine switches to another ABI for that function in case that function has a mega-arity not declared earlier).

However, I wonder about the use cases. A compiler who does not want to generate a few mega-arity tail-calling functions could easily use its "ABI", by passing variables on the heap.
(In reply to Marc Nieper-Wißkirchen from comment #17)
> However, I wonder about the use cases. A compiler who does not want to
> generate a few mega-arity tail-calling functions could easily use its "ABI",
> by passing variables on the heap.

That's what actually led me to the current proposal: compilers generating asm.js would have to either pass the problem on to the user or invent their own ABI.  But we can do a much more efficient job in the engine and avoid the inevitable bug reports (b/c *someone* is going to try and then complain).  Plus, we can tune this over time or and based on the architecture (more regs = higher max-stack-arg-bytes limit).
This changeset includes the previous one. On top of the old one I made MAsmJSCall a maybe-MControlInstruction depending on whether boolean flag isTailCall is set or not. The asmjs validator does not emit an MAsmJSReturn but an MAsmJSCall with this flag set in case of tail calls.

The code generator already understands this modification of MAsmJSCall. At the moment, it generates a call plus a jump to the return label.

This localizes the work that remains to be done to CodeGeneratorShared::emitAsmJSCall.

To do that I need a bit of information on the way Ion internally models the stack in an architecture independent way.
Attachment #8565895 - Attachment is obsolete: true
Attachment #8566438 - Flags: feedback?
Attachment #8566438 - Flags: feedback? → feedback?(luke)
Comment on attachment 8566438 [details] [diff] [review]
WIP: implement proper tail calls to code generation

Review of attachment 8566438 [details] [diff] [review]:
-----------------------------------------------------------------

Looks good, glad to see we can have something optionally act like a control instruction.

::: js/src/jit/MIR.h
@@ +1142,4 @@
>      }
>      void initOperand(size_t index, MDefinition *operand) {
>          // FixedList doesn't initialize its elements, so do an unchecked init.
> +        operands_[index].initUnchecked(operand, static_cast<T*>(this));

This static_cast shouldn't be necessary.
Attachment #8566438 - Flags: feedback?(luke) → feedback+
This patch builds upon the previous one. Here I've included some ideas how tail calls could work by fixing input slots being stack offsets.

This not yet runnable code; the register allocator (which would possibly become a "register/arg slot" allocator) will most likely be adjusted as well.
Attachment #8567040 - Flags: feedback?(sunfish)
Attachment #8567040 - Flags: feedback?(luke)
Comment on attachment 8567040 [details] [diff] [review]
Put stack args in tail calls in LUses

Review of attachment 8567040 [details] [diff] [review]:
-----------------------------------------------------------------

As we discussed on irc, my suggestion is to wait on the ARGUMENT LUse, and instead just let the register allocator put stack arguments wherever it likes, and use a MoveGroup to move everything into its final place, for now. I think that'll be easier.

To elaborate a little, the register allocator has bookkeeping to tell it where FIXED *register* uses are, so that it can avoid leaving values in those registers at those locations, but it doesn't yet have such bookkeeping for memory uses. While it'd be fairly easy to each it to emit a store to copy the value into its desired location, it'd be more work to teach it to be careful if it needs the old value in that memory slot later on. The MoveGroup mechanism knows how to handle dependency cycles, so it'll be easier to get up and running.

I agree that optimizing tail calls will be very valuable, so we'll definitely want to look into extending the register allocator to support what's needed once we have the base functionality done.
Attachment #8567040 - Flags: feedback?(sunfish)
This changeset implements proper tail calls for the OdinMonkey compiler when the caller and the callee have compatible (that is equal) signatures. It is currently flagged feedback but may be reflagged review.

The following has to be added to the asm.js specification: A call site coercion is not required for tail calls if the return type of a function has already been defined by a previous return statement and if the callee has the same signature as the caller. In this case, ES6 semantics for tail calls are guaranteed.

I have removed the need of an extra declaration as the first function, in which one had to list all callee signatures, and the related code.

RATIONALE: Tail calls to functions with the same signature have zero-overhead regardless of the calling convention used, and can thus be enabled unconditionally. For most use cases of asmjs as a compilation target, which need proper tail calls, this restriction seems to be fine. If I understand correctly, these restricted types of tail calls are also proposed for the Rust language.

The behaviour of more general tail calls and their implementation will depend heavily on the calling convention. Thus it seems premature to add a special declaration for tail calls to the asmjs specification, which will be most likely biased to one or the other calling convention.

The internal ABI used by asmjs does not be the OS's ABI. So one should evaluate whether there exists an ABI on all supported architectures (e.g. with a "callee cleans up"-convention) that is fast for non-tail calls and allows fast tail calls as well. In that case, we may allow arbitrary tail calls unconditionally (without any declaration).

TODO: To make tail calls that pass most of its arguments unchanged most certainly fast, the register allocator will have to understand argument slots.
Attachment #8566438 - Attachment is obsolete: true
Attachment #8567040 - Attachment is obsolete: true
Attachment #8567040 - Flags: feedback?(luke)
Attachment #8567497 - Flags: feedback?(sunfish)
Attachment #8567497 - Flags: feedback?(luke)
A basic example:

function Module() {
  "use asm";
  function f(n, m) {
    n = n|0;
    m = m|0;
    if ((n|0) == 0) {
      return m|0;
    }
    return f((n - 1)|0, (n + m)|0);
  }
  return f;
}
m = Module();
console.log(m(50000, 0)); // outputs 50000*50001/2
> This can be solved by keeping in the sp register the address of the real top
> of stack minus some constant k.  So accesses to the stack must always add k
> as an offset.  This allows writing "above" the real top of stack (but no
> more than k), which will still be under the sp.  With this trick you have to
> use push/pop/call carefully (or not at all).

I think it is crucial for the x86/x64 architecture and traditional, non-tail calling code to have properly nested call/ret pairs. If the stack pointer has an offset, using call/ret without added performance penalties in non-tail calling code seems to be hard to implement.

Marc

P.S.: With regard to using a red zone: The Linux ABI on x64 has one, so problems with signal handlers must have already been solved somewhere.

> 
> Of course another approach is to use another register as the stack pointer. 
> This can actually be beneficial in a system which does heap allocation by
> bumping a pointer.  The sp register can be used for heap allocation
> (allocating downwards, using compact and fast "push" instructions to fill
> the allocated object and move the heap allocation pointer without an extra
> add/sub instruction).
> 
> > I'd be curious to know what ABI the Scheme JITs use.
> 
> As I explained in a previous message, in Gambit all parameters (including
> the return address which is a parameter) are passed in registers.  These
> registers are saved lazily to the stack (so the stack frame is really just a
> spill zone with no preassigned location for the parameters and local
> variables).  A tail-call will move the return address back to the return
> address register (possibly it is still there if it hasn't been spilled).
Patching of tail calls in BacktrackingAllocator has to happen after reifying to prevent bogus moves.
Attachment #8567497 - Attachment is obsolete: true
Attachment #8567497 - Flags: feedback?(sunfish)
Attachment #8567497 - Flags: feedback?(luke)
Attachment #8567536 - Flags: feedback?(sunfish)
Attachment #8567536 - Flags: feedback?(luke)
One more change: Give functions a second entry point in case they are tail called. This way, the return address isn't pushed twice on the stack. Furthermore with this technique, the profiling prologue of the callee is being skipped in case of tail calls.

(This extra entry point can also help in later iterations of the standard to implement different calling conventions for tail versus non-tail calls.)
Attachment #8567536 - Attachment is obsolete: true
Attachment #8567536 - Flags: feedback?(sunfish)
Attachment #8567536 - Flags: feedback?(luke)
Attachment #8567627 - Flags: feedback?(sunfish)
Attachment #8567627 - Flags: feedback?(luke)
This changeset is to be applied after the last, non-obsoleted one. Apart from correcting a mistake (one code alignment in GenerateAsmJSFunctionPrologue was superfluous), it makes each frame in asm.js at least 64 bytes long. This allows to skip any stack manipulation when tail calling between functions whose frame sizes do not have to exceed these 64 bytes. (Time will show whether these 64 bytes is a good rule of thumb or whether, say 128 bytes, are more convenient.)

To get an understanding of the generated code, I give an example of a simple loop, in one case as a for-loop and then written in recursive style. As one can see, the tail calls themselves in the recursive version give zero overhead compared to the for-loop version. However, as one can see as well, the register allocator is yet not good enough to give the recursive version the same performance. It does a needless move and a needless save and restore on the stack. Once this is resolved, efficient spaghetti code would be possible in asmjs and Emscripten could possibly go without its clever relooper algorithm.

function Module() {
  "use asm";

  function f(n) {                          // subq $56, %rsp
    n = n|0;
    var m = 1;                             // movl $0x1, %ecx
    var r = 0;                             // xorl %eax, %eax 
    for (; (m|0) <= (n|0); m = (m + 1)|0)  // .F
                                           // cmpl %edi, %ecx
                                           // jg .M
      r = (r + m)|0;                       // addl %ecx, %eax
                                           // addl $1, %ecx
                                           // jmp .F
    return r|0;                            // .M
                                           // addq $56, %rsp
                                           // ret
  }

  function g(n) {                          // subq $56, %rsp
    n = n|0;
    return h(n, 1, 0)|0;                   // mov $0x1, %esi 
                                           // xorl %edx, %edx
                                           // call .N
                                           // addq $56, %rsp
                                           // ret
  }

  function h(n, r, m) {                    // .N
                                           // subq $56, %rsp
                                           // .H
                                           // movl %esi, 0x2c(%rsp)
    n = n|0;
    r = r|0;
    m = m|0;
    if (0) return 0;
    if ((m|0) <= (n|0))                   // cmpl %edi, %edx
                                          // jg .L
      return h(n, (r + m)|0, (m + 1)|0);  // movl 0x2c(%rsp), %eax
                                          // movl %eax, %esi
                                          // addl %edx, %esi
                                          // addl $1, %edx
                                          // jmp .H
    return r|0;                           // .L
                                          // movl 0x28(%rsp), %eax
                                          // addq $56, %rsp
                                          // ret
  }

  return {
    f: f,
    g: g
  }
}

var m = Module();
Attachment #8567704 - Flags: feedback?(sunfish)
Attachment #8567704 - Flags: feedback?(luke)
Visualization for the function h in the example code.
Visualization for the function h in the example code.
Comment on attachment 8567704 [details] [diff] [review]
Fix wrong and superfluous code alignment. Skip most frame allocations and overflow checking for tail calls.

Review of attachment 8567704 [details] [diff] [review]:
-----------------------------------------------------------------

It was my impression from the discussion that the plan is that functions will know whether they can be tail-called or not. Is that true? If so, could functions skip the 64 byte minimum if they aren't going to be tail-called?

::: js/src/jit/shared/CodeGenerator-shared.cpp
@@ +83,5 @@
>          MOZ_ASSERT(graph->argumentSlotCount() == 0);
>          frameDepth_ += gen->maxAsmJSStackArgBytes();
>  
> +        if (frameDepth_ < (signed) (64 - sizeof(AsmJSFrame)))
> +            frameDepth_ = 64 - sizeof(AsmJSFrame);

It'd be good to factor out the magic number here and elsewhere into a named constant, and/or possibly even some of these magic calculations into helper functions.
Attachment #8567704 - Flags: feedback?(sunfish)
Comment on attachment 8567627 [details] [diff] [review]
Changeset that implements proper tail calls in asmjs when the callee has same signature as the caller

Review of attachment 8567627 [details] [diff] [review]:
-----------------------------------------------------------------

Overall, this is very cool! Below are some comments I had on an initial readthrough of the codegen parts.

::: js/src/jit/BacktrackingAllocator.cpp
@@ +1374,5 @@
> +BacktrackingAllocator::patchTailCalls()
> +{
> +    if (!compilingAsmJS())
> +        return true;
> +        JitSpew(JitSpew_RegAlloc, "Patch Tailcalls");

Style nit: de-indent.

@@ +1382,5 @@
> +            return false;
> +
> +        LBlock *block = graph.getBlock(i);
> +        for (LInstructionIterator iter = block->begin(); iter != block->end(); ++iter) {
> +            if (iter->isAsmJSCall()) {

To reduce indentation below, use

    if (iter->isAsmJSCall())
        continue;

@@ +1388,5 @@
> +                MAsmJSCall *mir = ins->mir();
> +                if (mir->numStackArgs() == 0)
> +                    continue;
> +
> +                LMoveGroup *moveGroup = LMoveGroup::New(alloc());

I believe you can actually call getInputMoveGroup(ins) here to allocate the LMoveGroup instead of doing it manually. That will allow setInputMoves to be called, which will make this LMoveGroup look a little more like regular LMoveGroups, in case some unsuspecting code sees it.

@@ +1396,5 @@
> +                    if (*from != *to)
> +                        moveGroup->add(from, to,
> +                            LDefinition::TypeFrom(mir->getOperand(j + mir->numArgs())->type()));
> +                }
> +                ins->setNumOperands(mir->numArgs());

Please add a comment here saying "truncate the operands list to exclude the stack arguments, which are now handled by the MoveGroup" or so.

::: js/src/jit/LIR-Common.h
@@ +6490,4 @@
>      }
>  
>      bool isCall() const {
> +        return !mir()->isTailCall();

A comment saying something like "tail calls only appear at the ends of control paths, so values are never live past them, so the register allocator doesn't need to think of them as calls" would be useful here.

::: js/src/jit/MIR.h
@@ +1142,4 @@
>      }
>      void initOperand(size_t index, MDefinition *operand) {
>          // FixedList doesn't initialize its elements, so do an unchecked init.
> +        operands_[index].initUnchecked(operand, static_cast<T*>(this));

Why is this static_cast needed? initUnchecked's second argument is MDefinition, which is already a base class for MInstruction and MControlInstruction.

@@ +12501,4 @@
>  };
>  
>  class MAsmJSCall MOZ_FINAL
> +  : public MVariadicInstruction<MControlInstruction>,

How practical would it be to make a separate MAsmJSTailCall class, with common functionality between MAsmJSTailCall and plain MAsmJSCall factored out into a utility class? That would avoid the awkwardness of making all MAsmJSCall instructions inhereit from MControlInstruction.

@@ +12619,2 @@
>      bool possiblyCalls() const MOZ_OVERRIDE {
>          return true;

It is interesting to think about whether tail calls should have possiblyCalls return false. possiblyCalls is used in LICM to determine whether registers are likely to be spilled, which doesn't apply to tail calls since they by definition aren't in explicit loops, and it's used in codegen to optimize frame layout, which might be interesting.
Attachment #8567627 - Flags: feedback?(sunfish)
(In reply to Dan Gohman [:sunfish] from comment #31)
> Comment on attachment 8567704 [details] [diff] [review]

Thanks a lot for reviewing!
 
> It was my impression from the discussion that the plan is that functions
> will know whether they can be tail-called or not. Is that true? If so, could
> functions skip the 64 byte minimum if they aren't going to be tail-called?

As long as we keep the restriction that tail-calling is only possible between functions of the same signature (see Rust, see LLVM), it seems overkill and pretty unattractive to force all tail callees to be declared in advance.

However, the 64 byte minimum should have little or no impact for the non-tail calling context: For leaf functions it does not matter. Non-leaf functions can put their callees' arguments at a position in the stack so that some of 64 bytes are reused by the callees.

> 
> ::: js/src/jit/shared/CodeGenerator-shared.cpp
> @@ +83,5 @@
> >          MOZ_ASSERT(graph->argumentSlotCount() == 0);
> >          frameDepth_ += gen->maxAsmJSStackArgBytes();
> >  
> > +        if (frameDepth_ < (signed) (64 - sizeof(AsmJSFrame)))
> > +            frameDepth_ = 64 - sizeof(AsmJSFrame);
> 
> It'd be good to factor out the magic number here and elsewhere into a named
> constant, and/or possibly even some of these magic calculations into helper
> functions.

Good idea.
Attached patch changes.txt (obsolete) — Splinter Review
Improved the code according to the review but the refactoring of MAsmJSCall. I am not sure whether having another common abstract base class is really more lucid and my first try wasn't successful.
Attachment #8567627 - Attachment is obsolete: true
Attachment #8567704 - Attachment is obsolete: true
Attachment #8567627 - Flags: feedback?(luke)
Attachment #8567704 - Flags: feedback?(luke)
Attachment #8568151 - Flags: feedback?(sunfish)
Attachment #8568151 - Flags: feedback?(luke)
Another general comment: A compiler targeting asmjs, e.g. a compiler from LLVM to asmjs, could want to compile a function

f(int a, int b, int c) { int x; int y; int z; /* code with a lot of branching */ }

into a series of functions

f_i(int a, int b, int c, int x, int y, int z) { /* code that tail calls the f_j */ }

where the original branch statements have been replaced with tail calls and all the local variables have ended up as arguments to keep track of the state.

This is a scenario we should support very well. This leads to a question: does the register allocator distinguish deeply between argument slots and stack slots? It looks a bit as if they should be viewed on an equal footing.
This patch finally honours the promise given in comment #33, namely that even when setting MinimumAsmJSFrameBytes to a high number (I previously had 64 as a hard-coded value), the stack does not grow by more than this fixed amount than without this setting.

The idea is the following: When a callee is entered (via a non-tail call) it reserves at least MinimumAsmJSFrameBytes on the stack, even in case it only needs less bytes in its frame, say X. When this callee makes another non-tail call, it puts the arguments of its callee at about sp + MinimumAsmJSFrameBytes - X. Thus the callee's callee reuses part of the stack frame.

In case the callee wants to make a tail call, it gives ownership of its frame up to MinimumAsmJSFrameBytes to its callee's stack, so also in this case there is no extra growth of the stack.

(Of course all calculations above have to be properly aligned.)
Attachment #8568564 - Flags: review?(sunfish)
Attachment #8568564 - Flags: review?(luke)
I'm afraid I won't be able to carefully review these for another two weeks.  Also, r? usually means "review for landing", and we should propose/discuss on a new specifiction topic first.

One question, though: so in this patch the callee is reserving the max space needed for a tail call.  I had been assuming that the caller would do this: we'd know a priori which functions perform tail calls and so, when calling a function that performs a tail-call, you reserve not just the space for the callee's stack args, but the max tail-call-args space.  The advantages that I imagined was that tail-call-performing functions don't have to do any stack shuffling in their prologue or at tail-call sites.  I'm sure I don't understand all the constraints, though; could you compare and contrast this with the strategy in this patch?
> I'm afraid I won't be able to carefully review these for another two weeks. 
> Also, r? usually means "review for landing", and we should propose/discuss
> on a new specifiction topic first.

So I should start a topic on specifiction with the intended changes to the specification? Is the
current version of the specification to be found somewhere? (The one on http://asmjs.org/spec/latest/ doesn't include the changeHeap functionality, for example. Or should I use it as a reference nonetheless?)
 
> One question, though: so in this patch the callee is reserving the max space
> needed for a tail call.  I had been assuming that the caller would do this:
> we'd know a priori which functions perform tail calls and so, when calling a
> function that performs a tail-call, you reserve not just the space for the
> callee's stack args, but the max tail-call-args space.  The advantages that
> I imagined was that tail-call-performing functions don't have to do any
> stack shuffling in their prologue or at tail-call sites.  I'm sure I don't
> understand all the constraints, though; could you compare and contrast this
> with the strategy in this patch?

The current patches don't need the assumption of a special declaration function in the module's prologue. The restriction they impose is that only tail-calls to callees with the same signature is possible (which is a good first case that should be specified and should be enough for almost all use cases).

Unless a tail-called callee needs an excessive amount of spill space on the stack, the MinimumAsmJSFrameBytes ensures that no stack-shuffling is needed whatsoever when executing a tail-call.
(In reply to Marc Nieper-Wißkirchen from comment #38)
> So I should start a topic on specifiction with the intended changes to the
> specification? Is the
> current version of the specification to be found somewhere? (The one on
> http://asmjs.org/spec/latest/ doesn't include the changeHeap functionality,
> for example. Or should I use it as a reference nonetheless?)

Yeah, the asmjs.org spec is due for an update with the recent extensions proposed on specifiction.  I realized, though, we probably need more implementation and experimentation in this bug before we post the topic, since it affects what we end up proposing.

> The current patches don't need the assumption of a special declaration
> function in the module's prologue. The restriction they impose is that only
> tail-calls to callees with the same signature is possible (which is a good
> first case that should be specified and should be enough for almost all use
> cases).

I think it makes sense to incrementally experiment, but I don't think we should incrementally specify for this one feature.  If we want to handle mega-arity functions (which it's easy to see arising), then I think we do need a priori knowledge of which functions are tail-callable.

> Unless a tail-called callee needs an excessive amount of spill space on the
> stack, the MinimumAsmJSFrameBytes ensures that no stack-shuffling is needed
> whatsoever when executing a tail-call.

Right, but if I start in 'f', and non-tail-call 'g' and then g tail-calls 'h', iiuc, it is g's prologue in this patch that reserves the frame bytes so, at a minimum, the return address (pushed by the f->g call) has to be moved before g tail-calls h.  That is, I assume that, for any function that wants to perform a tail-call, we'd want this stack layout:

  base of stack ... |                   | stack args | AsmJSFrame |                  | <- sp
                     <--- max tail-call arg bytes -->              <-- frameDepth -->

and the "max tail-call arg bytes" are ensured *before* calling this function.  Also, I think this patch reserves tail-call space in all function prologues; I think we shouldn't need to change the layout of normal non-tail-calling frames.
> Yeah, the asmjs.org spec is due for an update with the recent extensions
> proposed on specifiction.  I realized, though, we probably need more
> implementation and experimentation in this bug before we post the topic,
> since it affects what we end up proposing.

Alright, then I will write up something here what I think could be proposed on specifiction so that all reading this bug know what is currently implemented and what is currently being tested.

> I think it makes sense to incrementally experiment, but I don't think we
> should incrementally specify for this one feature.  If we want to handle
> mega-arity functions (which it's easy to see arising), then I think we do
> need a priori knowledge of which functions are tail-callable.

My point is that I thought that handling the mega-arity case was costly. But it has a cost of zero in the restricted case that the caller has to have the same mega-arity. I currently don't see a case where this restriction would come at a non-zero cost for compiler writers.

So my proposal boils down to specifying this restricted case in the asmjs specification and not to handle arbitrary tail calls. (As I wrote above, this is the same strategy of Rust and LLVM if I am not mistaken.)

Or do you see a use case for tail-calling functions of a different signature?
 
> Right, but if I start in 'f', and non-tail-call 'g' and then g tail-calls
> 'h', iiuc, it is g's prologue in this patch that reserves the frame bytes
> so, at a minimum, the return address (pushed by the f->g call) has to be
> moved before g tail-calls h.  That is, I assume that, for any function that

This patch never reserves extra bytes in the arg space. It reserves extra bytes in the spill space, which are passed down to the tail call. If doing a non-tail call, these extra bytes are added to spIncrement we already have in AsmJSPassStackArg and AsmJSCall. [Have to hurry now, will draw a diagram when I'm back.]

> and the "max tail-call arg bytes" are ensured *before* calling this
> function.  Also, I think this patch reserves tail-call space in all function
> prologues; I think we shouldn't need to change the layout of normal
> non-tail-calling frames.

Even without the patch, all functions reserve some stack space in their prologues, namely for the base pointer just in case profiling would be enabled. So it is not a fundamental change, nor does the layout of the non-tail calling frames change. (I probably didn't word it well, maybe we can chat in irc so that I can explain what I mean here.)
(In reply to Marc Nieper-Wißkirchen from comment #40)
> So my proposal boils down to specifying this restricted case in the asmjs
> specification and not to handle arbitrary tail calls. (As I wrote above,
> this is the same strategy of Rust and LLVM if I am not mistaken.)

Oh, sorry, I totally missed this design point: so you can only tail-call with your own signature?  That would certainly simplify things.

> Or do you see a use case for tail-calling functions of a different signature?

I do not, but I don't have much tail-calling experience :)  So this makes sense when compiling a single function with continuations, since the signatures will all be the same, as you've pointed out above.  But what about when you are compiling a PTC in the source language between two different-arity functions?
(In reply to Luke Wagner [:luke] from comment #41)
> (In reply to Marc Nieper-Wißkirchen from comment #40)
> > So my proposal boils down to specifying this restricted case in the asmjs
> > specification and not to handle arbitrary tail calls. (As I wrote above,
> > this is the same strategy of Rust and LLVM if I am not mistaken.)
> 
> Oh, sorry, I totally missed this design point: so you can only tail-call
> with your own signature?  That would certainly simplify things.

Yes. These tail calls are easy to do without any change to whatever ABI, so are asmjs-like. :-)

Arbitrary tail calls are hard to do with the current ABI (that's why we cam up with the idea of declaring them in advance). But having to declare them in one way just because of the choice of the ABI of one implementation, namely OdinMonkey while another implementation might would want to see a different declaration of tail-calls to implement them in their ABI efficiently, is a bit smelly, isn't it? That's why I am proposing to drop them.

> > Or do you see a use case for tail-calling functions of a different signature?
> 
> I do not, but I don't have much tail-calling experience :)  So this makes
> sense when compiling a single function with continuations, since the
> signatures will all be the same, as you've pointed out above.  But what
> about when you are compiling a PTC in the source language between two
> different-arity functions?

Firstly, if it is a dynamic source language allowing a varying number of arguments (as in the case of JavaScript or Scheme), we cannot map its functions 1:1, so the signature of the function in the source language will not necessarily correspond to the one of the function it is compiled to (e.g. all but a fixed number of parameters are passed through the heap).

Secondly, if the tail calls are static, so that the calling sequence/connected component of callers/callees is known at compile time, a compiler could insert dummy arguments to make all signatures alike (and passing arguments in tail calls should have a zero cost if they are passed unchanged). Effectively, in this case, the compiler will do in advance what we thought asmjs would do in our initial proposal.

Thirdly, if some of the tail calls are dynamic, we already have a restriction imposed, namely that the signature of the dynamic callee has to be fixed in advance (functions in one function table have to compatible), so this basically boils down to the second case.

Lastly, asmjs currently has the same expressive power as a typical assembler language if there weren't the missing gotos and computed gotos. Tail calls between functions with the same signature are effectively gotos (and computed ones in case of dynamic calls), so every algorithm in assembler, I can basically map 1:1 to asmjs once I have this restricted set of tail calls. (I can even recover a register set, namely the first arguments of every call.) So whatever I can compile down to assembler efficiently, should be equally compilable to asmjs. (Emscripten could be extended to support the musttail calling convention of LLVM, so that everything LLVM handles well, asmjs likewise will.)

I think this last point is the most important one: If tail calls are supposed to be the replacement for assembler jumps, we have to get this special case really fast without any overhead (i.e. we have to be so good that relooping in Emscripten won't be worth the effort anymore). During the recent days while implementing this stuff, I have come to the conclusion that allowing arbitrary tail calls just diverts from this important goal.

Finally, here comes the promised stack layout for a function f non-tail calling g, which tail-calls h (view in a wide window):

                                        <--- MinimalAsmJSFramSize ----------------------------------------------------------->
  base of stack ... | stack args for f | AsmJSFrame | spilling area for f | area for AsmJSPassStackArg/AsmJSCall | ... unused | <- sp after prologue of f
                                                                          | stack args for g | AsmJSFrame | spilling area for g | area for AsmJSPassStackArg/AsmJSCall | ... unused | <- after prologue of g
                                                                          | stack args for h | AsmJSFrame | spilling area for h ...                                                 | <- after prologue of h

So MinimalAsmJSFrame is needed so that a tail call between two functions whose stack frame doesn't have to be larger needs no shuffling with the stack pointer.
> (Emscripten could be extended to support the
> musttail calling convention of LLVM, so that everything LLVM handles well,
> asmjs likewise will.)

We definitely want to do this!

> I think this last point is the most important one: If tail calls are
> supposed to be the replacement for assembler jumps, we have to get this
> special case really fast without any overhead (i.e. we have to be so good
> that relooping in Emscripten won't be worth the effort anymore). During the
> recent days while implementing this stuff, I have come to the conclusion
> that allowing arbitrary tail calls just diverts from this important goal.

The thing with the relooper is, when it works well, it works *really* well. Explicitly structured control structures (if, if-else, while, do-while, etc.) are very concise and compressible at the asm.js level. And, they're very convenient for generating good code in the JS engine level. There are efficient ways to construct SSA form (as compared to constructing SSA on a CFG), and it keeps basic blocks naturally grouped by routine. I don't know how we can get all these benefits if we break each basic block into a separate function using tail calls. If you have ideas here, let's talk :).

Of course, the relooper doesn't always work well. When control constructs get complex, it often falls back to merging many control paths together and using variables to record which logical control path to take, often at great cost in register pressure. Tail calls seem like they should make it easier to generate much more efficient code in these cases. And, many functional languages would find it much more convenient to generate tail-call code directly.

So it looks to me like the goal here should be a hybrid model, where we use the relooper when it works well, and tail calls when they work well, so that we get the best of both worlds. We want to optimize tail calls as much as we can, but we also want them to be as flexible as possible, so that we can use them for all the most demanding cases, with as few hurdles as possible. It's totally fine to require functions participating in tail calls to have matching signatures for now, but it seems like in the future, we'd eventually want to relax this restriction. Does that make sense?
Comment on attachment 8568151 [details] [diff] [review]
changes.txt

Review of attachment 8568151 [details] [diff] [review]:
-----------------------------------------------------------------

Yes, stack slots and memory slots are on roughly equal footing in the register allocator, in the sense that both can be directly used by an ANY LUse. I expect that once we do the optimizations to eliminate the memory-to-memory copying in the LMoveGroup, we'll be able to do quite well here.

::: js/src/asmjs/AsmJSFrameIterator.cpp
@@ +315,5 @@
> +    masm.bind(&labels->tailCallEntry);
> +    if (frameDepthBeyondMinimum > 0)
> +        // The caller has not reserved enough bytes on the stack as we need
> +        // more than the minimum. So we reserve it now.
> +        masm.subPtr(Imm32(frameDepthBeyondMinimum), StackPointer);

Style nit: in this if statement and a few others in this patch, the style rules around here say that you need braces around the if body if there are multiple lines, and that includes comments.
Attachment #8568151 - Flags: feedback?(sunfish) → feedback+
I ran http://kripken.github.io/Massive/ against my patch that enables tail calls. There was virtually no difference in score compared to an unmodified nightly version.

Of course, this did not measure the efficiency of the tail call implementation, but it did measure that always allocating a minimum of spill space (and freeing the unused part of it before passing control to non-tail callees) didn't effect the overall execution time of real-world code.

Are there other benchmarks I should run?

Measuring the performance of the generated code when tail calls are used is much harder because there are no reference values. For a test, I coded a simple bytecode interpreter in two versions and let both execute bytecode calculating Fibonacci numbers. The first version uses a large switch statement and the second dynamic tail calls to execute the series of bytecode instructions. I got a 20% speed increase of the version doing tail calls.

A better test would be if we could test my patch against a real-world bytecode interpreter like Emterpreter. For that, however, Emscripten would have to be available in a version that does tail calls.
Does anyone has an idea how to feature-detect support of tail calls in asm.js as long as not all engines implement them? Is there a way to detect whether a particular module has been validated as an asm.js module in a given browser?

Of course, it is possible to check for stack overflows by running a piece of test code first, but this only gives a lower limit of the implementation's stack size.
So to check my understanding, it sounds like there are two use cases here:
 1. new, better intraprocedural-ish control flow primitive for when the relooper fails (or isn't used; e.g. runtime asm.js generation)
 2. compiler target for real source languages with PTC
and we're only targeting 1 directly with this feature but indirectly allowing 2 to be solved by saying that 1 gives the to-asm.js compiler a super-efficient building block for doing its own PTC ABI for solving 2.  That's pretty cool and does seem in the asm.js spirit.  And, as you already said, we could grow asm.js to support 2 directly if the need arose.

The only thing I wish we could do better was avoiding the stack adjustment before/after all non-tail calls.  Massive has pretty high variance in my experience so it'd be hard to see a .5% variance.  The asmjs-apps-*-throughput benchmarks on https://github.com/haytjes/arewefastyet run from the shell could do a better job, esp using cachegrind to get exact insn counts.  What seems unfortunate is that this overhead isn't fundamentally necessary: if we made multiple passes we could figure out the right stack frame size for each PTC-SCC and just use it (adjusting sp only for non-tail-calls out of PTCable functions).

Question for Marc (Feeley): I know you've been compiling Scheme to JS for a while and had to work around lack of PTC; would this extension give you what you need to avoid the horrible workarounds you have to do now?  IIUC, the extension is pretty simple: you can tail-call functions that have the same signature and we'll compile that to a 'jmp'.
Flags: needinfo?(feeley)
(In reply to Luke Wagner [:luke] from comment #47)
> Question for Marc (Feeley): I know you've been compiling Scheme to JS for a
> while and had to work around lack of PTC; would this extension give you what
> you need to avoid the horrible workarounds you have to do now?  IIUC, the
> extension is pretty simple: you can tail-call functions that have the same
> signature and we'll compile that to a 'jmp'.

Yes, having this would remove the need for a trampoline.  Currently the Gambit Scheme compiler with the JS backend compiles each basic-block to a 0-parameter JS function, and if a function A needs to jump to a function B, it returns the function (pointer) B to the trampoline and the trampoline then calls B.  I've attached below the code generated for fib without the trampoline so you can see how control is transferred from one BB to another.  This would probably be a good benchmark for asmjs once PTC is implemented.

var r0; // register 0, typically contains the "return address"
var r1; // register 1, typically contains arg 1 and the result
var stack = []; // the Scheme stack
var sp = -1; // the stack pointer

function bb1_fib() { // entry-point
  if (r1 < 2) {
    return r0();
  } else {
    return bb3_fib();
  }
}

function bb3_fib() {
  stack[sp+1] = r0;
  stack[sp+2] = r1;
  r1 = r1 - 1;
  r0 = bb2_fib;
  if (r1 < 2) {
    (sp += 2);
    return r0();
  } else {
    (sp += 2);
    return bb3_fib();
  }
}

function bb2_fib() { // return-point
  stack[sp+1] = r1;
  r1 = stack[sp] - 2;
  r0 = bb5_fib;
  if (r1 < 2) {
    ++sp;
    return r0();
  } else {
    ++sp;
    return bb3_fib();
  }
}

function bb5_fib() { // return-point
  r1 = stack[sp] + r1;
  (sp -= 3);
  return stack[sp+1]();
}

function underflow() { // primordial continuation
  print(r1); // print the result
}

r0 = underflow;
r1 = 15; // compute fib(15)  <<<< replace with 40 for a beefier test
bb1_fib();
Flags: needinfo?(feeley)
Marc, could you also post the trampoline version so that I can run a comparison? To get a decent result with the tail-call version, it is probably advisable to pass r0, r1 and sp around as arguments so that they end up in CPU registers.
 > The only thing I wish we could do better was avoiding the stack adjustment
> before/after all non-tail calls.  Massive has pretty high variance in my
> experience so it'd be hard to see a .5% variance.  The
> asmjs-apps-*-throughput benchmarks on
> https://github.com/haytjes/arewefastyet run from the shell could do a better
> job, esp using cachegrind to get exact insn counts.  What seems unfortunate
> is that this overhead isn't fundamentally necessary: if we made multiple
> passes we could figure out the right stack frame size for each PTC-SCC and
> just use it (adjusting sp only for non-tail-calls out of PTCable functions).

I am going to run the asmjs-apps benchmarks and compare the results between a vanilla and a patched js shell so that we have a better feeling for the implications. I think we can make one of the following choices:

1) Use the current version: tail calls are just jumps if the caller and the callee do not need overly large stack frames. Normal function calls adjust the stack pointer at the call site and, in case of callees with overly large stack frame, also at the callee's prologue.

2) Use the current version but do not adjust the stack pointer for normal calls at the call size. This increases stack allocation but means one instruction less for normal calls.

3) Don't change the implementation for non-tail calls at all. Instead adjust the stack pointer before and after each jump in case of a tail call.

4) Be flexible and choose 1) or 3) depending on whether the first function in the module gives a hint of whether tail calls are used or not.

5) Impose to the asmjs subset that all tail callees and tail callers have to be declared in the first function. Then choose 1) or 3), where the choice varies from function to function.

6) Choose 1) or 3) and run a second pass of the asmjs compiler in the background to collect extra data and hotswap the code by a better version.

My current thinking about these choices is as follows:

To choose 1), we need at least results from the other benchmark. If these results are positive, I would propose this solution.

2) sounds also like a plausible choice. For example, in the native Windows 64 ABI, a register parameter area of 32 bytes is reserved at each all call site whether the function does have a need for it or not. So reserving a small extra bit of stack space at each call does not seem to have dire implications. In fact, with tail calls, no sane program should need more than 1000 levels of stack frames. The effect of 2) on normal calls is even less than in 1). Only normal calls into callees with overly large stack frames needs an extra stack pointer adjustment at their entry. But such callees are probably not often called in a tight inner loop.

3) is, in my opinion, a very bad choice. If a tail call is to be a substitute for a jump and a basic building block in asmjs code, we don't want more code than necessary on either side of the jump. Especially, we don't want stack overflow checks at tail call entries. On the other hand, time-critical normal functions can be inlined by the compiler and the overhead by two adjustments to the stack pointer seem negligible compared with the call/ret instructions, the stack overflow check and the parameter passing.

4) would be a novel feature to the asmjs spec. At the moment I don't see language constructs appearing in the asmjs specification that are just there to give the compiler a hint on how to compile functions. Also, 4) would be so rough that it is merely a flag saying whether I want to have tail calls enabled or whether I want my previous code handled exactly as before.

5) would blow-up the code a lot in case of many tail calls. For programs doing a lot of tail calls and non-tail calls between function, it would possible be not much better than the much simpler 4) in performance. Also, like 4), it does not really feel asmjs-like. Finally, we would write something into the spec that is possibly completely superfluous once a better ABI is being installed that handles tail calls and non-tail calls equally well (e.g. a calling convention with a separate stack for arguments - is there a reason that speaks against it?).

6) is probably a thing of the future and should be subsumed into a goal of an optimizing asmjs compiler that analyses more than just tail calls, e.g. does multifunction register allocation, etc.
(In reply to Marc Nieper-Wißkirchen from comment #49)
> Marc, could you also post the trampoline version so that I can run a
> comparison?

Here are two versions with trampolines (the "poll" one is optimized so that it doesn't return to the trampoline at every call).

// fib-trampoline.js

var r0;
var r1;
var stack = [];
var sp = -1;

function bb1_fib() { // entry-point
  if (r1 < 2) {
    return r0;
  } else {
    return bb3_fib;
  }
}

function bb3_fib() {
  stack[sp+1] = r0;
  stack[sp+2] = r1;
  r1 = r1 - 1;
  r0 = bb2_fib;
  if (r1 < 2) {
    (sp += 2);
    return r0;
  } else {
    (sp += 2);
    return bb3_fib;
  }
}

function bb2_fib() { // return-point
  stack[sp+1] = r1;
  r1 = stack[sp] - 2;
  r0 = bb5_fib;
  if (r1 < 2) {
    ++sp;
    return r0;
  } else {
    ++sp;
    return bb3_fib;
  }
}

function bb5_fib() { // return-point
  r1 = stack[sp] + r1;
  (sp -= 3);
  return stack[sp+1];
}

function trampoline(pc) {
  while (pc) {
    pc = pc();
  }
}

r0 = false;
r1 = 15; // compute fib(15)  <<<< replace with 40 for a beefier test
trampoline(bb1_fib);
print(r1); // print the result



// fib-trampoline-poll.js

var r0;
var r1;
var stack = [];
var sp = -1;
var count = 0;

function poll(dest) {
  count = 100;
  return dest;
}

function bb1_fib() { // entry-point
  if (r1 < 2) {
    return r0;
  } else {
    return bb3_fib();
  }
}

function bb3_fib() {
  stack[sp+1] = r0;
  stack[sp+2] = r1;
  r1 = r1 - 1;
  r0 = bb2_fib;
  if (r1 < 2) {
    (sp += 2);
    return r0;
  } else {
    (sp += 2);
    if (--count < 0) {
      return poll(bb3_fib);
    } else {
      return bb3_fib();
    }
  }
}

function bb2_fib() { // return-point
  stack[sp+1] = r1;
  r1 = stack[sp] - 2;
  r0 = bb5_fib;
  if (r1 < 2) {
    ++sp;
    return r0;
  } else {
    ++sp;
    return bb3_fib();
  }
}

function bb5_fib() { // return-point
  r1 = stack[sp] + r1;
  (sp -= 3);
  return stack[sp+1]();
}

function underflow() { // primordial continuation
  print(r1); // print the result
  return false; // exit trampoline
}

function trampoline(pc) {
  while (pc) {
    pc = pc();
  }
}

r0 = underflow;
r1 = 15; // compute fib(15)  <<<< replace with 40 for a beefier test
trampoline(bb1_fib);
(In reply to Marc Nieper-Wißkirchen from comment #49)
> To get a decent result with the tail-call version, it is
> probably advisable to pass r0, r1 and sp around as arguments so that they
> end up in CPU registers.

Gambit actually has about 6 or 7 live "registers", so I'm worried this will generate very large JS source code (even if the actual machine code is good because all of the calls just propagate the incoming parameter to the outgoing parameter without needing a move instruction).  In any case, here is the code with these registers as function parameters.  Note that in asmjs the "stack" parameter would probably go away and sp would index a global array.

// fib-regs-in-params.js

function bb1_fib(stack,sp,r0,r1,r2,r3,r4,r5) { // entry-point
  if (r1 < 2) {
    return r0(stack,sp,r0,r1,r2,r3,r4,r5);
  } else {
    return bb3_fib(stack,sp,r0,r1,r2,r3,r4,r5);
  }
}

function bb3_fib(stack,sp,r0,r1,r2,r3,r4,r5) {
  stack[sp+1] = r0;
  stack[sp+2] = r1;
  r1 = r1 - 1;
  r0 = bb2_fib;
  if (r1 < 2) {
    (sp += 2);
    return r0(stack,sp,r0,r1,r2,r3,r4,r5);
  } else {
    (sp += 2);
    return bb3_fib(stack,sp,r0,r1,r2,r3,r4,r5);
  }
}

function bb2_fib(stack,sp,r0,r1,r2,r3,r4,r5) { // return-point
  stack[sp+1] = r1;
  r1 = stack[sp] - 2;
  r0 = bb5_fib;
  if (r1 < 2) {
    ++sp;
    return r0(stack,sp,r0,r1,r2,r3,r4,r5);
  } else {
    ++sp;
    return bb3_fib(stack,sp,r0,r1,r2,r3,r4,r5);
  }
}

function bb5_fib(stack,sp,r0,r1,r2,r3,r4,r5) { // return-point
  r1 = stack[sp] + r1;
  (sp -= 3);
  return stack[sp+1](stack,sp,r0,r1,r2,r3,r4,r5);
}

function underflow(stack,sp,r0,r1,r2,r3,r4,r5) { // primordial continuation
  print(r1); // print the result
}

bb1_fib([], -1, underflow, 15, false, false, false, false); // compute fib(15)  <<<< replace with 40 for a beefier test
I did some measurements with time, valgrind and the asmjs-apps-*-throughput benchmarks from arewefastyet as suggested by Luke. The results are as follows:

I. Without MinimumAsmJSFrameBytes. Without TailCalleeSpace.

box2d.js
17.48 seconds
11,124,470,819 instructions

bullet.js
17.47 seconds
15,582,041,757 instructions

lua_binarytrees.js
11.67 seconds
10,296,515,797 instructions

zlib.js
15.97 seconds
10,845,512,490 instructions



II. With MinimumAsmJSFrameBytes. Without TailCalleeSpace.

box2d.js
17.31 seconds
11,139,985,367 instructions

bullet.js
17.35 seconds
15,633,331,278 instructions

lua_binarytrees.js
11.55 seconds
10,305,304,795 instructions

zlib.js
15.75 seconds
10,856,213,890 instructions



III. With MinimumAsmJSFrameBytes. With TailCalleeSpace.

box2d.js
17.47 seconds
11,131,740,166 instructions

bullet.js
17.30 seconds
15,603,779,672 instructions

lua_binarytrees.js
11.59 seconds
10,355,138,300 instructions

zlib.js
16.10 seconds
10,845,592,942 instructions



Here, I. tests the generated code without any adjustments to make tail calls faster (so this corresponds to 3) in my above list of choices).

II. tests the generated code with a minimum frame size so that most tail calls are just a jump but without compactifying the stack at non-tail calls (so leaving 0-56 bytes of waste for each stack frame level). So this is 2) in my above list.

III. tests the code I have been proposing, namely having a minimum frame size and compactifying the stack for non-tail calls.

Apparently, one has to take all numbers above with a grain of salt. II. appears slightly faster than I. although it uses a bit more instructions (some sp adjustments). III. is, as expected, slightly slower than I. (and II.), but all performance differences are in the 2% range and the error range seems to be about 1% looking at the figures above.

Interpreting these numbers, I'd suggest we are save with variant II. (option 2) in my text above) if something like 0 to 56 bytes of extra waste on the stack does not hurt.
The benchmarks in comment #53 only measure the performance of code without PTCs running under OdinMonkey with PTCs enabled. To measure the performance gain when code uses PTCs, I implemented a flag "-s PTC=1" to "emcc" of Emscripten so that in conjunction with "-s EMTERPRETIFY=1", the Emscripten compiler suite outputs an emterpreted version that uses tail calls and dynamic jumps instead of a huge switch statement in its bytecode interpreter.

You can find the code here: https://github.com/mnieper/emscripten/tree/emterpreter-ptc. It is based on the incoming branch of the upstream repository.

This allowed me to run a number of further tests:

mnieper@macaulay:~/emsdk-portable/emscripten/incoming/tests$ ./runner.py benchmark
LLVM_3_2 not defined, using our LLVM instead (/home/mnieper/emsdk-portable/clang/fastcomp/build_incoming_64/bin)
Running Emscripten benchmarks... [ Wed Mar  4 10:41:19 2015 | em: commit 74661c91817752439a8d2cf91d478eda7331a3fe | llvm: /home/mnieper/emsdk-portable/clang/fastcomp/build_incoming_64/bin ]
test_conditionals (test_benchmark.benchmark) ... 
       sm-f32: mean: 5.681 (+-0.212) secs  median: 5.617  range: 5.437-6.167  (noise: 3.731%)  (15 runs)
    sm-emterp: mean: 64.207 (+-1.356) secs  median: 63.825  range: 62.873-68.502  (noise: 2.112%)  (15 runs)   Relative: 11.30 X slower
   sm-emterp-ptc: mean: 62.366 (+-1.204) secs  median: 62.098  range: 60.486-64.597  (noise: 1.931%)  (15 runs)   Relative: 10.98 X slower
ok
test_copy (test_benchmark.benchmark) ... 
       sm-f32: mean: 7.123 (+-0.427) secs  median: 6.836  range: 6.678-7.699  (noise: 5.991%)  (3 runs)
    sm-emterp: mean: 87.784 (+-0.486) secs  median: 87.448  range: 87.323-88.456  (noise: 0.554%)  (3 runs)   Relative: 12.32 X slower
   sm-emterp-ptc: mean: 84.747 (+-0.853) secs  median: 84.212  range: 83.729-85.817  (noise: 1.007%)  (3 runs)   Relative: 11.90 X slower
ok
test_corrections (test_benchmark.benchmark) ... 
       sm-f32: mean: 11.437 (+-0.278) secs  median: 11.244  range: 11.174-11.822  (noise: 2.432%)  (3 runs)
    sm-emterp: mean: 93.304 (+-0.640) secs  median: 92.950  range: 92.462-94.012  (noise: 0.686%)  (3 runs)   Relative: 8.16 X slower
   sm-emterp-ptc: mean: 88.834 (+-1.853) secs  median: 87.738  range: 86.494-91.026  (noise: 2.086%)  (3 runs)   Relative: 7.77 X slower
ok
test_fannkuch (test_benchmark.benchmark) ... 
       sm-f32: mean: 5.831 (+-0.065) secs  median: 5.791  range: 5.753-5.911  (noise: 1.110%)  (3 runs)
    sm-emterp: mean: 90.942 (+-0.274) secs  median: 90.757  range: 90.655-91.312  (noise: 0.302%)  (3 runs)   Relative: 15.60 X slower
   sm-emterp-ptc: mean: 64.138 (+-1.308) secs  median: 63.259  range: 62.758-65.896  (noise: 2.040%)  (3 runs)   Relative: 11.00 X slower
ok
test_fasta_double (test_benchmark.benchmark) ... ok
test_fasta_double_full (test_benchmark.benchmark) ... ok
test_fasta_float (test_benchmark.benchmark) ... 
       sm-f32: mean: 19.645 (+-0.093) secs  median: 19.587  range: 19.533-19.761  (noise: 0.475%)  (3 runs)
    sm-emterp: mean: 174.973 (+-4.345) secs  median: 172.810  range: 169.030-179.299  (noise: 2.483%)  (3 runs)   Relative: 8.91 X slower
   sm-emterp-ptc: mean: 114.986 (+-0.399) secs  median: 114.749  range: 114.483-115.459  (noise: 0.347%)  (3 runs)   Relative: 5.85 X slower
ok
test_ifs (test_benchmark.benchmark) ... 
       sm-f32: mean: 2.273 (+-0.065) secs  median: 2.261  range: 2.188-2.423  (noise: 2.841%)  (15 runs)
    sm-emterp: mean: 46.910 (+-1.326) secs  median: 46.428  range: 45.479-50.623  (noise: 2.826%)  (15 runs)   Relative: 20.64 X slower
   sm-emterp-ptc: mean: 43.280 (+-1.046) secs  median: 42.713  range: 42.240-45.486  (noise: 2.418%)  (15 runs)   Relative: 19.04 X slower
ok
test_life (test_benchmark.benchmark) ... ok
test_linpack_double (test_benchmark.benchmark) ... ok
test_linpack_float (test_benchmark.benchmark) ... 
       sm-f32: mean: 0.130 (+-0.001) secs  median: 0.129  range: 0.129-0.131  (noise: 0.701%)  (3 runs)
    sm-emterp: mean: 3.819 (+-0.061) secs  median: 3.776  range: 3.775-3.905  (noise: 1.593%)  (3 runs)   Relative: 29.44 X slower
   sm-emterp-ptc: mean: 3.674 (+-0.004) secs  median: 3.672  range: 3.668-3.679  (noise: 0.121%)  (3 runs)   Relative: 28.32 X slower
ok
test_memops (test_benchmark.benchmark) ... 
       sm-f32: mean: 20.045 (+-0.033) secs  median: 20.024  range: 20.006-20.087  (noise: 0.166%)  (3 runs)
    sm-emterp: mean: 241.931 (+-0.929) secs  median: 241.283  range: 241.098-243.228  (noise: 0.384%)  (3 runs)   Relative: 12.07 X slower
   sm-emterp-ptc: mean: 224.046 (+-4.162) secs  median: 222.443  range: 218.169-227.253  (noise: 1.858%)  (3 runs)   Relative: 11.18 X slower
ok
test_primes (test_benchmark.benchmark) ... 
       sm-f32: mean: 11.219 (+-0.060) secs  median: 11.182  range: 11.148-11.294  (noise: 0.532%)  (3 runs)
    sm-emterp: mean: 62.426 (+-0.048) secs  median: 62.406  range: 62.359-62.466  (noise: 0.076%)  (3 runs)   Relative: 5.56 X slower
   sm-emterp-ptc: mean: 59.251 (+-0.528) secs  median: 58.887  range: 58.748-59.981  (noise: 0.891%)  (3 runs)   Relative: 5.28 X slower
ok
test_skinning (test_benchmark.benchmark) ... 
       sm-f32: mean: 12.694 (+-0.042) secs  median: 12.666  range: 12.651-12.752  (noise: 0.333%)  (3 runs)
    sm-emterp: mean: 444.586 (+-2.220) secs  median: 443.046  range: 442.513-447.665  (noise: 0.499%)  (3 runs)   Relative: 35.02 X slower
   sm-emterp-ptc: mean: 264.433 (+-0.351) secs  median: 264.185  range: 264.175-264.929  (noise: 0.133%)  (3 runs)   Relative: 20.83 X slower
ok
test_zzz_box2d (test_benchmark.benchmark) ... 
<building and saving box2d_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ into cache> 
       sm-f32: mean: 12.980 (+-0.177) secs  median: 12.855  range: 12.845-13.230  (noise: 1.362%)  (3 runs)
<load box2d_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ from cache> 
    sm-emterp: mean: 180.739 (+-1.074) secs  median: 179.980  range: 179.910-182.256  (noise: 0.594%)  (3 runs)   Relative: 13.92 X slower
<load box2d_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ from cache> 
   sm-emterp-ptc: mean: 163.074 (+-0.814) secs  median: 162.502  range: 162.396-164.218  (noise: 0.499%)  (3 runs)   Relative: 12.56 X slower
ok
test_zzz_bullet (test_benchmark.benchmark) ... 
<building and saving bullet_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ into cache> 
       sm-f32: mean: 14.419 (+-0.066) secs  median: 14.393  range: 14.325-14.470  (noise: 0.459%)  (3 runs)
<load bullet_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ from cache> 
    sm-emterp: mean: 228.772 (+-2.096) secs  median: 227.350  range: 226.625-231.615  (noise: 0.916%)  (3 runs)   Relative: 15.87 X slower
<load bullet_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ from cache> 
   sm-emterp-ptc: mean: 213.220 (+-1.580) secs  median: 212.265  range: 211.262-215.131  (noise: 0.741%)  (3 runs)   Relative: 14.79 X slower
ok
test_zzz_java_nbody (test_benchmark.benchmark) ... ok
test_zzz_lua_binarytrees (test_benchmark.benchmark) ... 
<building and saving lua_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ into cache> 
       sm-f32: mean: 9.524 (+-0.103) secs  median: 9.483  range: 9.378-9.605  (noise: 1.083%)  (3 runs)
<load lua_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ from cache> 
    sm-emterp: mean: 128.352 (+-2.240) secs  median: 127.388  range: 125.211-130.278  (noise: 1.745%)  (3 runs)   Relative: 13.48 X slower
<load lua_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ from cache> 
   sm-emterp-ptc: mean: 123.599 (+-3.775) secs  median: 121.092  range: 119.502-128.612  (noise: 3.054%)  (3 runs)   Relative: 12.98 X slower
ok
test_zzz_lua_scimark (test_benchmark.benchmark) ... 
<load lua_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ from cache> 
       sm-f32: mean: 10.262 (+-0.133) secs  median: 10.168  range: 10.152-10.449  (noise: 1.299%)  (3 runs)
<load lua_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ from cache> 
    sm-emterp: mean: 185.228 (+-2.801) secs  median: 183.502  range: 181.818-188.679  (noise: 1.512%)  (3 runs)   Relative: 18.05 X slower
<load lua_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ from cache> 
   sm-emterp-ptc: mean: 183.101 (+-5.603) secs  median: 180.312  range: 175.439-188.679  (noise: 3.060%)  (3 runs)   Relative: 17.84 X slower
ok
test_zzz_zlib (test_benchmark.benchmark) ... 
<building and saving zlib_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ into cache> 
       sm-f32: mean: 14.507 (+-0.123) secs  median: 14.450  range: 14.336-14.621  (noise: 0.848%)  (3 runs)
<load zlib_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ from cache> 
    sm-emterp: mean: 216.764 (+-0.796) secs  median: 216.455  range: 215.639-217.381  (noise: 0.367%)  (3 runs)   Relative: 14.94 X slower
<load zlib_O2_bd82d6778c6d15b8746c1a32290da1a0js__home_mnieper_emsdk_portable_clang_fastcomp_build_incoming_64_bin__ from cache> 
   sm-emterp-ptc: mean: 195.907 (+-0.588) secs  median: 195.620  range: 195.099-196.482  (noise: 0.300%)  (3 runs)   Relative: 13.50 X slower
ok

----------------------------------------------------------------------
Ran 20 tests in 14980.009s

OK

These tests show that the speed gain for a bytecode interpreter like emterpreter is between 3% and 40%, depending on the problem, when it is allowed to do tail calls. I haven't yet checked the machine code IonMonkey is generating for each of the procedures of the bytecode interpreter, so there still might be room for further optimizations (e.g. in the register allocator).
(Sorry, for delay, back from GDC-related business)
These are fantastic results, esp for the Emterpreter and really illustrate the point of "Lambda: the ultimate GOTO" ;)

So I was thinking more about Dan's point above that, one of the few remaining advantages of 'goto' over PTCs is that the compiler backend can do all sorts of LICM/GVN which move ops all over the CFG.  Now, there's no reason the compiler *couldn't* inline, but (1) that goes against the grain of AOT/single-pass, (2) if it's based on normal inlining heuristics in the JIT, you won't get the actual goto-CFG, just a bunch of partial inlinings that could generate factors more code.

But what about this: what if mutually-recursive PTC SCCs were necessarily adjacent in the asm.js module (and all at the beginning, before all normal non-PTCable functions)?  Then a reasonable AOT could just parse all the PTCable functions in the SCC into the same MIRGraph.  (Or, if it didn't want to, or it was hard or the MIRGraph got too big, fall back to simple functions.)

As a side benefit, we could reason about a PTC SCC as a single unit (giving the AOT strictly more info than (5) (b/c not just args but also Max(framePushed) matters)) and use trivial goto (w/ no penalty for large frames or the wasting-frame-size vs. inc/dec at normal calls tradeoff).  Also, there are no weird "declarations" or "hints", it's just an ordering requirement.

Thoughts?
Ok, so Dan pointed out an obvious hole in my scheme: w/o having seen the whole module, every PTCable function is a possible entry point, so you don't have any dominance and nothing can be particularly optimized.  You could imagine doing a first-pass compilation that codegens each function separately and then, after finishing the whole asm.js module and observing only one entry function, trigger background compilations that inlines (or perhaps based on profiling, etc).
Comment on attachment 8568151 [details] [diff] [review]
changes.txt

Great work so far, but I'll clear reviews pending replies to above comment.
Attachment #8568151 - Flags: feedback?(luke) → feedback+
Attachment #8568564 - Flags: review?(luke)
For a small bytecode interpreter, an optimizing AOT compiler could (and probably should) inline all mutually tail-recursive function in one big function. For asm.js code that is the output of a, say, Scheme compiler, things are different because nearly every function in the module would be part of the tail-call graph, and we don't want the whole module be compiled into just one MIRGraph/function.

Currently, I don't see a future-proof way to tell an asm.js AOT-compiler how to do such optimizations while still being single-pass. Thus I propose to postpone the following to future work:

1. Implement the infrastructure for a second-pass in OdinMonkey that is run when the processor is idle and hot-swap the code generated by the first pass once the optimized code of the second pass is ready. (This has nothing to do with tail calls in particular but allows for many kinds of full-program optimizations, e.g. reordering of function arguments, etc.)

2. Experiment with and use non-native calling conventions in asm.js generated code and use the best one (there is no reason that we have to use the System V ABI on x64, for example). (This could help to accelerate non-PTC code. For PTC code it offers possibilities like having separate return and data stacks so that we don't have to any stack adjustments; even for non-PTC code, this would allow leaf functions without any stack pointer adjustments.)

--

That said, I would add code to utilize the red zone of the stack where available (e.g. Linux) before pushing my patches above to the trunk. Utilizing the red zone means less stack adjustments (even for code w/o any PTCs). Unfortunately, Windows does not have a red zone so this optimization doesn't work on all major platforms.
(In reply to Marc Nieper-Wißkirchen from comment #58)
> Currently, I don't see a future-proof way to tell an asm.js AOT-compiler how
> to do such optimizations while still being single-pass.

So right now we have a loop:

  while (...) {
    lir = ParseValidateOptimizeAndLower();
    Codegen(lir);
  }

I was thinking this could change to (roughly):

  while (...) {
    vector<LIRGraph*> scc;
    max_frame_depth = 0;
    while (in same SCC) {
      lir = ParseValidateOptimizeAndLower()
      max_frame_depth = max(max_frame_depth, lir->frameDepth());
      scc.push_back(lir);
    }
    for (lir in scc_lir) {
      lir->setFrameDepth(max_frame_depth);
      Codegen(lir);
    }
  }

With this, we could know the exact arg/frame size of the whole SCC before final codegen.  This would fail if we had hundreds of MB of MIR/LIR alive at once, but I think we can punt on that problem for now if we're mostly dealing with goto and normal-sized Scheme programs.  For reference, in the early days, we parsed the entire asm.js module into memory and that worked on 32-bit for a multi-million line C++ codebase (though it wasn't ideal).
After an IRC chat, it does sound reasonable to start with the fully general strategy outlined above (in particular, not compacting on normal calls so that, when the frame size is below the limit, normal calls/prologues are not affected) and then consider SCC-based optimizations as a later possible optimization we can experiment with.

Given a pretty reasonable impl strategy, I think the next step is to discuss in a new issue on discourse.specifiction.org/c/asm-js to let others comment.
Latest patch based on latest upstream version from mozilla-central. Gives each function a minimum frame size (currently 64 bytes, hardcoded in js/src/jit/shared/Assembler-shared.h) so that tail calls between function not needing a larger frame size can become a simple jump.
Attachment #8568151 - Attachment is obsolete: true
Attachment #8568564 - Attachment is obsolete: true
Attachment #8568564 - Flags: review?(sunfish)
Attachment #8581856 - Flags: feedback?(luke)
Change minimum frame size to 96 bytes to account for 99% of the functions in a typical asm.js module (measured with http://beta.unity3d.com/jonas/DT2/).
Attachment #8581856 - Attachment is obsolete: true
Attachment #8581856 - Flags: feedback?(luke)
Attachment #8582244 - Flags: feedback?(luke)
Comment on attachment 8582244 [details] [diff] [review]
Patch to implement PTCs in asm.js

Review of attachment 8582244 [details] [diff] [review]:
-----------------------------------------------------------------

I just did a quick read and, though I'd need to read and understand more carefully before landing (particularly the regalloc stuff), this looks generally good.  Let's see what others say in http://discourse.specifiction.org/t/request-for-comments-add-a-restricted-subset-of-proper-tail-calls-to-asm-js/787.

::: js/src/jit/MIR.h
@@ +1153,5 @@
>      }
>  };
>  
> +template <class T = MInstruction>
> +class MVariadicInstruction : public T

If you split out just this change into a separate patch we can get that reviewed and landed ahead of the rest of the patch.
Attachment #8582244 - Flags: feedback?(luke) → feedback+
Attached patch MVariadicTemplate (obsolete) — Splinter Review
Attachment #8583741 - Flags: review?(luke)
The general changes to MIR.h could be as simple as in the preceding patch. (The simplification with respect to the previous approach is that the existing code can remain untouched otherwise. The PTC-enabled MAsmJSCall can then derive from MVariadicTemplate<MControlInstruction>.)
Comment on attachment 8583741 [details] [diff] [review]
MVariadicTemplate

Review of attachment 8583741 [details] [diff] [review]:
-----------------------------------------------------------------

Ah, nice change; this makes the patch decidedly smaller, but probably still worth landing separately so you can be familiar with the process.  Could you follow these steps so that we can simply set checkin-needed in the keywords?  https://developer.mozilla.org/en-US/docs/Mercurial_FAQ#How_can_I_generate_a_patch_for_somebody_else_to_check-in_for_me.3F

::: js/src/jit/MIR.h
@@ +1153,5 @@
>      }
>  };
>  
> +template <class T>
> +class MVariadicTemplate : public T

How about MVariadicT?
Attachment #8583741 - Flags: review?(luke) → review+
Attached patch mvariadict.patch (obsolete) — Splinter Review
Rename MVariadicTemplate to MVariadicT. Follow https://developer.mozilla.org/en-US/docs/Mercurial_FAQ#How_can_I_generate_a_patch_for_somebody_else_to_check-in_for_me.3F.
Attachment #8583741 - Attachment is obsolete: true
Attachment #8584335 - Flags: checkin?(luke)
Attachment #8584335 - Flags: checkin?(luke) → review+
Keywords: checkin-needed
Whiteboard: [leave open]
Oh, btw, for future patches, you should add
  [diff]
  git = 1
  showfunc = 1
  unified = 8
to your .hgrc so you get 8 lines of context to your patch (currently 0).
Attachment #8584335 - Flags: checkin+
Assignee: nobody → marc.nieper
Implement the missing parts on top of the checked-in mvariadic.patch. Follow the guidelines (including unified=8).
Attachment #8582244 - Attachment is obsolete: true
Attachment #8584335 - Attachment is obsolete: true
http://discourse.specifiction.org/t/request-for-comments-add-a-restricted-subset-of-proper-tail-calls-to-asm-js/787 has been laying dormant for quite a while. There haven't been any negative comments.

Thus I propose to check-in the remaining parts of the patch to enable the restricted set of proper tail calls as discussed above in asm.js code.
I know my last comment on that issue was in favor of landing, but that was right before WebAssembly kicked off into high gear.  The experimentation work here has been extremely valuable, and has contributed to proposed post-v.1 extensions to WebAssembly:
  https://github.com/WebAssembly/design/blob/master/FutureFeatures.md#signature-restricted-proper-tail-calls
  https://github.com/WebAssembly/design/blob/master/FutureFeatures.md#more-expressive-control-flow
However, I'd like to hold off on adding asm.js features at this time to focus on WebAssembly v.1.  Definitely we'll want to resume this conversation shortly thereafter, though.
Blocks: es6
Nice to have at some point, but I don't think this should block ES6; tail call optimizations in WebAssembly will likely be different from the ones in JS, and not added to asm.js anyways. It's a future trail to explore for wasm, though.
Blocks: wasm
No longer blocks: es6
Component: JavaScript Engine → JavaScript Engine: JIT
Priority: -- → P3
This bug actually involved some tweaks to asm.js.  asm.js is basically frozen at this point and wasm will very likely get a proper tail-call operator so WONTFIXing this particular bug.  I really appreciate all the work that went into this, though, and we'll definitely be referring back to it when we implement tail-call in wasm.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → WONTFIX
Whiteboard: [leave open]
You need to log in before you can comment on or make changes to this bug.