Closed Bug 1573179 Opened 5 years ago Closed 2 years ago

Change WebAssembly calling convention for pinned WasmTlsReg versus tail calls

Categories

(Core :: JavaScript: WebAssembly, task, P3)

task

Tracking

()

RESOLVED INCOMPLETE

People

(Reporter: wingo, Unassigned)

References

(Blocks 2 open bugs)

Details

Right now, WebAssembly has a kind of pinned register, the WasmTlsReg. When webassembly code needs to access the world, it does so through that pointer. It's expected that callers preserve that register.

For background: direct calls to functions defined in the same module will preserve WasmTlsReg implicitly; they just use it without modifying it. Direct calls to imports may change WasmTlsReg, so the compiler saves and restores this register around the call. Similarly for indirect calls.

However, tail calls invalidate this logic. An intra-module direct call can make an inter-module tail call.

One option to resolve this problem is to effectively remove the optimization for direct calls. It would be a bit sad though.

Another idea would be, the restore-the-WasmTlsReg is really a property of the continuation -- if a tail call is going to change the WasmTlsReg, it should ideally check the current continuation (previous frame) to see if the WasmTlsReg is going to be restored. If there is code that will restore the reg, then it just jumps to the callee. If not, it should add a small frame that will restore the WasmTlsReg, and then call the callee. In that way the cost would be born by the tail callers, not anyone else.

There is another option, which is to add something to the wasm spec to mark functions which themselves may make tail calls. It would be nice to avoid that, though.

Inserting a thunk at, iiuc, the return_call-site is a cool idea! The worry would be that mutual recursion between two instances would pile up thunk frames, but I think that could be avoided by saying that return_call only creates a thunk if there is not already one, since all that matters is that the final return switches back to the original WasmTlsReg value, and that will be ensured by the initial thunk.

Right, you would need to have some kind of dynamic check at each return_call / return_call_indirect site to know whether to push the restore-the-WasmTlsReg frame, or whether it was already pushed. It is necessary to have such a frame pushed before the first return_call / return_call_indirect that changes the current WasmTlsReg, but it would be sufficient to ensure the frame is pushed before any return_call / return_call_indirect.

As far as how to know when you need such a frame, I see two options. One, examine the caller frame directly, e.g. compare the saved return address to the well-known value. Two, store the stack pointer of the most recent preserve-WasmTlsReg-frame in some thread-local state. The trampoline would preserve not only the WasmTlsReg but also the previous value of the saved stack pointer.

I think it would be easiest to compare the return address; it could be immediate or loaded from the WasmTlsReg.

Just to spitball a potential implementation:

static void *RestoreWasmTlsRegStubAddr;
static void *RestoreWasmTlsRegReturnAddr;

// "Caller" should jump directly in, passing continuation as ABINonArgReturnVolatileReg.
void emitRestoreWasmTlsRegStub() {
  // RestoreWasmTlsRegStubAddr is here
  masm.call(ABINonArgReturnVolatileReg);

  // RestoreWasmTlsRegReturnAddr is here

#if defined(JS_CODEGEN_ARM64)
  masm.pop(WasmTlsReg, lr);
  masm.abiret();
#else
  masm.Pop(WasmTlsReg);
  masm.ret();
#endif
}

In the general case there may be stack args, so the interstitial frame would need to be before the stack args. The simplest would be to shuffle after preparing the frame and stack args, at each return_call[_indirect] site:

void emitTailCallPrologue() {
  Label done;
  CodeLabel afterStubCall;
  unsigned raOffset = masm.framePushed() + offsetof(Frame, returnAddress);
  masm.branchPtr(
      Condition::eq, Address(masm.getStackPointer(), raOffset),
      // FIXME: probably won't fit in an imm32, could load from WasmTlsReg instead
      Imm32(RestoreWasmTlsRegReturnAddr), &done);
  
  masm.mov(&afterStubCall, ABINonArgReturnVolatileReg);
  masm.jump(RestoreWasmTlsRegStubAddr);
  masm.bind(&afterStubCall);
  masm.addCodeLabel(&afterStubCall);

  // FIXME: Here, shuffle up the the masm.framePushed() + sizeof(Frame)
  // bytes of the current activation by two words.  Set the first word of the
  // callee-owned portion of the frame to the WasmTlsReg, and the second
  // to the returnAddress currently stored in the Frame.  Set the returnAddress
  // of the shuffled frame to RestoreWasmTlsRegReturnAddr.

  masm.bind(&done);
}

For a tail call from a function without stack args to a function without stack args, we could do a bit better: we'd still have to shuffle the Frame but not the masm.framePushed() bytes.

WDYT?

Edit: A couple of considerations. Shuffling the frame allows for the code generator to still know how large the frame is; the shuffle in effect just inserts an additional frame that the current function doesn't have to worry about. Probably suboptimal but if we have to shuffle the return address anyway because the caller and callee don't have the same number of stack args, we're losing anyway[*], so it's not so bad.

The other consideration is that of course we could build the frame without jumping through the stub, but then we'd destroy the return address predictor. My understanding is that it's fastest to trampoline.

[*] We could change the calling convention to have the return address and the Frame pushed before the stack args. For everything but x86, this would work fine. For x86, you'd have to shuffle the return address into place at at all non-tail call sites, which is probably enough of a lose to not consider it, but I mention it for completeness. Guile does this, FWIW, but I still don't know if it's a good idea. But this consideration is tangential to the WasmTlsReg issue, since the WasmTlsReg frame probably wouldn't be the same size (?).

Priority: -- → P3

Thanks for the detailed writeup! Yes, that makes sense.

Two random impl notes:

  • I assume there's also a branch "did WasmTlsReg change during this tail-call" that guards this whole thunking logic too?
  • For the "FIXME" at Imm32(RestoreWasmTlsRegReturnAddr): I think what you'd want to do here is add a new SymbolicAddress for a singular, process-wide "tail call thunk" so that it could be patched into the machine code directly (b/c the machine code is shared by many instances on many (window/worker) threads). This new symbolic address could be associated with a "builtin thunk" like all the other process-wide, dynamically-generated-at-startup-time thunks in WasmBuiltins.cpp, via wasm::SymbolicAddressTarget (which would grow a special case to ignore the fact that the tail-call thunk is not wrapping an underlying C++ function).

As an note on ordering: b/c there's an interesting interaction with multi-value-return here, and since there's a pressing need for multi-value in general: would there be any way to do multi-value first, before tail calls, so that we can fully understand that generalization, and have that tested, before we add tail calls?

Luke Wagner [:luke] from comment #4)

  • I assume there's also a branch "did WasmTlsReg change during this tail-call" that guards this whole thunking logic too?

If I understand correctly, I think the above implementation is sufficient to preserve WasmTlsReg across tail-call boundaries. But, it may not be necessary: if the callee has the same WasmTlsReg there's no need to ensure there's a restoration thunk, so we could add a dynamic check to skip the thunking.

As an note on ordering: b/c there's an interesting interaction with multi-value-return here, and since there's a pressing need for multi-value in general: would there be any way to do multi-value first, before tail calls, so that we can fully understand that generalization, and have that tested, before we add tail calls?

Certainly :) I will take a look at your in-progress patch today.

(In reply to Andy Wingo [:wingo] from comment #5)

But, it may not be necessary: if the callee has the same WasmTlsReg there's no need to ensure there's a restoration thunk, so we could add a dynamic check to skip the thunking.

Yes, that was my aim: performance, not correctness. And 99% of the time (and probably 100% for any given branch-predictable callsite) TLS won't be changing.

Certainly :) I will take a look at your in-progress patch today.

Thanks!

(bizarre duplicate comment...)

Depends on: wasm-abi-202x
No longer depends on: wasm-abi-202x

We likely won't do it this way.

Status: NEW → RESOLVED
Closed: 2 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.