Bug 1573179 Comment 3 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

Right, you would need to have some kind of dynamic check at each `return_call` / `return_call_indirect` site to know whether to push the restore-the-WasmTlsReg frame, or whether it was already pushed.  It is necessary to have such a frame pushed before the first `return_call` / `return_call_indirect` that changes the current WasmTlsReg, but it would be sufficient to ensure the frame is pushed before any `return_call` / `return_call_indirect`.

As far as how to know when you need such a frame, I see two options.  One, examine the caller frame directly, e.g. compare the saved return address to the well-known value.  Two, store the stack pointer of the most recent preserve-`WasmTlsReg`-frame in some thread-local state.  The trampoline would preserve not only the WasmTlsReg but also the previous value of the saved stack pointer.

I think it would be easiest to compare the return address; it could be immediate or loaded from the WasmTlsReg.

Just to spitball a potential implementation:

```c++
static void *RestoreWasmTlsRegStubAddr;
static void *RestoreWasmTlsRegReturnAddr;

// "Caller" should jump directly in, passing continuation as ABINonArgReturnVolatileReg.
void emitRestoreWasmTlsRegStub() {
  // RestoreWasmTlsRegStubAddr is here
  masm.call(ABINonArgReturnVolatileReg);

  // RestoreWasmTlsRegReturnAddr is here

#if defined(JS_CODEGEN_ARM64)
  masm.pop(WasmTlsReg, returnAddress);
  masm.abiret();
#else
  masm.Pop(WasmTlsReg);
  masm.ret();
#endif
}
```

In the general case there may be stack args, so the interstitial frame would need to be before the stack args.  The simplest would be to shuffle after preparing the frame and stack args, at each `return_call[_indirect]` site:

```c++
void emitTailCallPrologue() {
  Label done;
  CodeLabel afterStubCall;
  unsigned raOffset = masm.framePushed() + offsetof(Frame, returnAddress);
  masm.branchPtr(
      Condition::eq, Address(masm.getStackPointer(), raOffset),
      // FIXME: probably won't fit in an imm32, could load from WasmTlsReg instead
      Imm32(RestoreWasmTlsRegReturnAddr), &done);
  
  masm.mov(&afterStubCall, ABINonArgReturnVolatileReg);
  masm.jump(RestoreWasmTlsRegStubAddr);
  masm.bind(&afterStubCall);
  masm.addCodeLabel(&afterStubCall);

  // FIXME: Here, shuffle up the the masm.framePushed() + sizeof(Frame)
  // bytes of the current activation by two words.  Set the first word of the
  // callee-owned portion of the frame to the WasmTlsReg, and the second
  // to the returnAddress currently stored in the Frame.  Set the returnAddress
  // of the shuffled frame to RestoreWasmTlsRegReturnAddr.

  masm.bind(&done);
}
```

For a tail call from a function without stack args to a function without stack args, we could do a bit better: we'd still have to shuffle the Frame but not the masm.framePushed() bytes.

WDYT?
Right, you would need to have some kind of dynamic check at each `return_call` / `return_call_indirect` site to know whether to push the restore-the-WasmTlsReg frame, or whether it was already pushed.  It is necessary to have such a frame pushed before the first `return_call` / `return_call_indirect` that changes the current WasmTlsReg, but it would be sufficient to ensure the frame is pushed before any `return_call` / `return_call_indirect`.

As far as how to know when you need such a frame, I see two options.  One, examine the caller frame directly, e.g. compare the saved return address to the well-known value.  Two, store the stack pointer of the most recent preserve-`WasmTlsReg`-frame in some thread-local state.  The trampoline would preserve not only the WasmTlsReg but also the previous value of the saved stack pointer.

I think it would be easiest to compare the return address; it could be immediate or loaded from the WasmTlsReg.

Just to spitball a potential implementation:

```c++
static void *RestoreWasmTlsRegStubAddr;
static void *RestoreWasmTlsRegReturnAddr;

// "Caller" should jump directly in, passing continuation as ABINonArgReturnVolatileReg.
void emitRestoreWasmTlsRegStub() {
  // RestoreWasmTlsRegStubAddr is here
  masm.call(ABINonArgReturnVolatileReg);

  // RestoreWasmTlsRegReturnAddr is here

#if defined(JS_CODEGEN_ARM64)
  masm.pop(WasmTlsReg, lr);
  masm.abiret();
#else
  masm.Pop(WasmTlsReg);
  masm.ret();
#endif
}
```

In the general case there may be stack args, so the interstitial frame would need to be before the stack args.  The simplest would be to shuffle after preparing the frame and stack args, at each `return_call[_indirect]` site:

```c++
void emitTailCallPrologue() {
  Label done;
  CodeLabel afterStubCall;
  unsigned raOffset = masm.framePushed() + offsetof(Frame, returnAddress);
  masm.branchPtr(
      Condition::eq, Address(masm.getStackPointer(), raOffset),
      // FIXME: probably won't fit in an imm32, could load from WasmTlsReg instead
      Imm32(RestoreWasmTlsRegReturnAddr), &done);
  
  masm.mov(&afterStubCall, ABINonArgReturnVolatileReg);
  masm.jump(RestoreWasmTlsRegStubAddr);
  masm.bind(&afterStubCall);
  masm.addCodeLabel(&afterStubCall);

  // FIXME: Here, shuffle up the the masm.framePushed() + sizeof(Frame)
  // bytes of the current activation by two words.  Set the first word of the
  // callee-owned portion of the frame to the WasmTlsReg, and the second
  // to the returnAddress currently stored in the Frame.  Set the returnAddress
  // of the shuffled frame to RestoreWasmTlsRegReturnAddr.

  masm.bind(&done);
}
```

For a tail call from a function without stack args to a function without stack args, we could do a bit better: we'd still have to shuffle the Frame but not the masm.framePushed() bytes.

WDYT?
Right, you would need to have some kind of dynamic check at each `return_call` / `return_call_indirect` site to know whether to push the restore-the-WasmTlsReg frame, or whether it was already pushed.  It is necessary to have such a frame pushed before the first `return_call` / `return_call_indirect` that changes the current WasmTlsReg, but it would be sufficient to ensure the frame is pushed before any `return_call` / `return_call_indirect`.

As far as how to know when you need such a frame, I see two options.  One, examine the caller frame directly, e.g. compare the saved return address to the well-known value.  Two, store the stack pointer of the most recent preserve-`WasmTlsReg`-frame in some thread-local state.  The trampoline would preserve not only the WasmTlsReg but also the previous value of the saved stack pointer.

I think it would be easiest to compare the return address; it could be immediate or loaded from the WasmTlsReg.

Just to spitball a potential implementation:

```c++
static void *RestoreWasmTlsRegStubAddr;
static void *RestoreWasmTlsRegReturnAddr;

// "Caller" should jump directly in, passing continuation as ABINonArgReturnVolatileReg.
void emitRestoreWasmTlsRegStub() {
  // RestoreWasmTlsRegStubAddr is here
  masm.call(ABINonArgReturnVolatileReg);

  // RestoreWasmTlsRegReturnAddr is here

#if defined(JS_CODEGEN_ARM64)
  masm.pop(WasmTlsReg, lr);
  masm.abiret();
#else
  masm.Pop(WasmTlsReg);
  masm.ret();
#endif
}
```

In the general case there may be stack args, so the interstitial frame would need to be before the stack args.  The simplest would be to shuffle after preparing the frame and stack args, at each `return_call[_indirect]` site:

```c++
void emitTailCallPrologue() {
  Label done;
  CodeLabel afterStubCall;
  unsigned raOffset = masm.framePushed() + offsetof(Frame, returnAddress);
  masm.branchPtr(
      Condition::eq, Address(masm.getStackPointer(), raOffset),
      // FIXME: probably won't fit in an imm32, could load from WasmTlsReg instead
      Imm32(RestoreWasmTlsRegReturnAddr), &done);
  
  masm.mov(&afterStubCall, ABINonArgReturnVolatileReg);
  masm.jump(RestoreWasmTlsRegStubAddr);
  masm.bind(&afterStubCall);
  masm.addCodeLabel(&afterStubCall);

  // FIXME: Here, shuffle up the the masm.framePushed() + sizeof(Frame)
  // bytes of the current activation by two words.  Set the first word of the
  // callee-owned portion of the frame to the WasmTlsReg, and the second
  // to the returnAddress currently stored in the Frame.  Set the returnAddress
  // of the shuffled frame to RestoreWasmTlsRegReturnAddr.

  masm.bind(&done);
}
```

For a tail call from a function without stack args to a function without stack args, we could do a bit better: we'd still have to shuffle the Frame but not the masm.framePushed() bytes.

WDYT?

Edit: A couple of considerations.  Shuffling the frame allows for the code generator to still know how large the frame is; the shuffle in effect just inserts an additional frame that the current function doesn't have to worry about.  Probably suboptimal but if we have to shuffle the return address anyway because the caller and callee don't have the same number of stack args, we're losing anyway[*], so it's not so bad.

The other consideration is that of course we could build the frame without jumping through the stub, but then we'd destroy the return address predictor.  My understanding is that it's fastest to trampoline.

[*] We could change the calling convention to have the return address and the Frame pushed before the stack args.  For everything but x86, this would work fine.  For x86, you'd have to shuffle the return address into place at function entry, for all non-tail calls, which is probably enough of a lose to not consider it, but I mention it for completeness.  Guile does this, FWIW, but I still don't know if it's a good idea.
Right, you would need to have some kind of dynamic check at each `return_call` / `return_call_indirect` site to know whether to push the restore-the-WasmTlsReg frame, or whether it was already pushed.  It is necessary to have such a frame pushed before the first `return_call` / `return_call_indirect` that changes the current WasmTlsReg, but it would be sufficient to ensure the frame is pushed before any `return_call` / `return_call_indirect`.

As far as how to know when you need such a frame, I see two options.  One, examine the caller frame directly, e.g. compare the saved return address to the well-known value.  Two, store the stack pointer of the most recent preserve-`WasmTlsReg`-frame in some thread-local state.  The trampoline would preserve not only the WasmTlsReg but also the previous value of the saved stack pointer.

I think it would be easiest to compare the return address; it could be immediate or loaded from the WasmTlsReg.

Just to spitball a potential implementation:

```c++
static void *RestoreWasmTlsRegStubAddr;
static void *RestoreWasmTlsRegReturnAddr;

// "Caller" should jump directly in, passing continuation as ABINonArgReturnVolatileReg.
void emitRestoreWasmTlsRegStub() {
  // RestoreWasmTlsRegStubAddr is here
  masm.call(ABINonArgReturnVolatileReg);

  // RestoreWasmTlsRegReturnAddr is here

#if defined(JS_CODEGEN_ARM64)
  masm.pop(WasmTlsReg, lr);
  masm.abiret();
#else
  masm.Pop(WasmTlsReg);
  masm.ret();
#endif
}
```

In the general case there may be stack args, so the interstitial frame would need to be before the stack args.  The simplest would be to shuffle after preparing the frame and stack args, at each `return_call[_indirect]` site:

```c++
void emitTailCallPrologue() {
  Label done;
  CodeLabel afterStubCall;
  unsigned raOffset = masm.framePushed() + offsetof(Frame, returnAddress);
  masm.branchPtr(
      Condition::eq, Address(masm.getStackPointer(), raOffset),
      // FIXME: probably won't fit in an imm32, could load from WasmTlsReg instead
      Imm32(RestoreWasmTlsRegReturnAddr), &done);
  
  masm.mov(&afterStubCall, ABINonArgReturnVolatileReg);
  masm.jump(RestoreWasmTlsRegStubAddr);
  masm.bind(&afterStubCall);
  masm.addCodeLabel(&afterStubCall);

  // FIXME: Here, shuffle up the the masm.framePushed() + sizeof(Frame)
  // bytes of the current activation by two words.  Set the first word of the
  // callee-owned portion of the frame to the WasmTlsReg, and the second
  // to the returnAddress currently stored in the Frame.  Set the returnAddress
  // of the shuffled frame to RestoreWasmTlsRegReturnAddr.

  masm.bind(&done);
}
```

For a tail call from a function without stack args to a function without stack args, we could do a bit better: we'd still have to shuffle the Frame but not the masm.framePushed() bytes.

WDYT?

Edit: A couple of considerations.  Shuffling the frame allows for the code generator to still know how large the frame is; the shuffle in effect just inserts an additional frame that the current function doesn't have to worry about.  Probably suboptimal but if we have to shuffle the return address anyway because the caller and callee don't have the same number of stack args, we're losing anyway[*], so it's not so bad.

The other consideration is that of course we could build the frame without jumping through the stub, but then we'd destroy the return address predictor.  My understanding is that it's fastest to trampoline.

[*] We could change the calling convention to have the return address and the Frame pushed before the stack args.  For everything but x86, this would work fine.  For x86, you'd have to shuffle the return address into place at function entry, for all non-tail calls, which is probably enough of a lose to not consider it, but I mention it for completeness.  Guile does this, FWIW, but I still don't know if it's a good idea.  But this consideration is tangential to the WasmTlsReg issue, since the WasmTlsReg frame probably wouldn't be the same size (?).
Right, you would need to have some kind of dynamic check at each `return_call` / `return_call_indirect` site to know whether to push the restore-the-WasmTlsReg frame, or whether it was already pushed.  It is necessary to have such a frame pushed before the first `return_call` / `return_call_indirect` that changes the current WasmTlsReg, but it would be sufficient to ensure the frame is pushed before any `return_call` / `return_call_indirect`.

As far as how to know when you need such a frame, I see two options.  One, examine the caller frame directly, e.g. compare the saved return address to the well-known value.  Two, store the stack pointer of the most recent preserve-`WasmTlsReg`-frame in some thread-local state.  The trampoline would preserve not only the WasmTlsReg but also the previous value of the saved stack pointer.

I think it would be easiest to compare the return address; it could be immediate or loaded from the WasmTlsReg.

Just to spitball a potential implementation:

```c++
static void *RestoreWasmTlsRegStubAddr;
static void *RestoreWasmTlsRegReturnAddr;

// "Caller" should jump directly in, passing continuation as ABINonArgReturnVolatileReg.
void emitRestoreWasmTlsRegStub() {
  // RestoreWasmTlsRegStubAddr is here
  masm.call(ABINonArgReturnVolatileReg);

  // RestoreWasmTlsRegReturnAddr is here

#if defined(JS_CODEGEN_ARM64)
  masm.pop(WasmTlsReg, lr);
  masm.abiret();
#else
  masm.Pop(WasmTlsReg);
  masm.ret();
#endif
}
```

In the general case there may be stack args, so the interstitial frame would need to be before the stack args.  The simplest would be to shuffle after preparing the frame and stack args, at each `return_call[_indirect]` site:

```c++
void emitTailCallPrologue() {
  Label done;
  CodeLabel afterStubCall;
  unsigned raOffset = masm.framePushed() + offsetof(Frame, returnAddress);
  masm.branchPtr(
      Condition::eq, Address(masm.getStackPointer(), raOffset),
      // FIXME: probably won't fit in an imm32, could load from WasmTlsReg instead
      Imm32(RestoreWasmTlsRegReturnAddr), &done);
  
  masm.mov(&afterStubCall, ABINonArgReturnVolatileReg);
  masm.jump(RestoreWasmTlsRegStubAddr);
  masm.bind(&afterStubCall);
  masm.addCodeLabel(&afterStubCall);

  // FIXME: Here, shuffle up the the masm.framePushed() + sizeof(Frame)
  // bytes of the current activation by two words.  Set the first word of the
  // callee-owned portion of the frame to the WasmTlsReg, and the second
  // to the returnAddress currently stored in the Frame.  Set the returnAddress
  // of the shuffled frame to RestoreWasmTlsRegReturnAddr.

  masm.bind(&done);
}
```

For a tail call from a function without stack args to a function without stack args, we could do a bit better: we'd still have to shuffle the Frame but not the masm.framePushed() bytes.

WDYT?

Edit: A couple of considerations.  Shuffling the frame allows for the code generator to still know how large the frame is; the shuffle in effect just inserts an additional frame that the current function doesn't have to worry about.  Probably suboptimal but if we have to shuffle the return address anyway because the caller and callee don't have the same number of stack args, we're losing anyway[*], so it's not so bad.

The other consideration is that of course we could build the frame without jumping through the stub, but then we'd destroy the return address predictor.  My understanding is that it's fastest to trampoline.

[*] We could change the calling convention to have the return address and the Frame pushed before the stack args.  For everything but x86, this would work fine.  For x86, you'd have to shuffle the return address into place at at all non-tail call sites, which is probably enough of a lose to not consider it, but I mention it for completeness.  Guile does this, FWIW, but I still don't know if it's a good idea.  But this consideration is tangential to the WasmTlsReg issue, since the WasmTlsReg frame probably wouldn't be the same size (?).

Back to Bug 1573179 Comment 3