Pin TLS in the baseline compiler
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox100 | --- | fixed |
People
(Reporter: lth, Assigned: lth)
References
(Blocks 1 open bug)
Details
Attachments
(7 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review |
+++ This bug was initially created as a clone of Bug #1715459 +++
See bug 1715459 for some preliminary work; bug 1714086 for original TC + analysis.
Back in the day, we decided not to pin the TLS in the baseline compiler because it overconstrains register allocation on x86-32 especially. Instead, the TLS has a home location in the stack frame; is spilled on entry to the function; and is reloaded whenever it is needed. We never measured whether this was good or bad, it was just one of those things that we had to do to move the work along.
Based on exploratory work (attached), pinning the TLS in the baseline compiler results in a 5% decrease in baseline code size on x86-64 (sample of one application: Zen Garden) and will probably have similar savings on ARM64; pinning will therefore help reduce code bloat, which is an issue for large wasm apps. (More test cases would be good.)
As noted by bug 1715459 comment 3, there may be significant regalloc problems on x86-32 as a result, and so this work is not exactly easy - pinning the tls leaves four usable registers. As Julian put it, we'd be "programming like it's 1977". We should seriously consider moving some baseline operations into C++ callouts on this platform only, or otherwise specialize the open-coded implementations ditto to allow more redundant operations so as to shorten value lifetimes in the implementations, or somehow allow for memory operands. Candidates are mostly GC-proposal operations, memory64 operations, and 64-bit atomic operations.
Assignee | ||
Comment 1•3 years ago
|
||
(This works for x64, arm, arm64, mips64. x86 is WIP / subsequent
patches; many things work but atomics and GC features require a lot
more work.)
By pinning the Tls register we simplify code, sometimes substantially,
and somewhat crucially we generate much less code for debugging.
However, this also complicates register allocation on x86, where we
are now down to four usable registers, two having been taken by
scratch and tls (there is no heap register) and one by fp. Frequently
this is too little (five was already too little). Several techniques
can be used to work around this:
-
Free registers early if they are not used
-
Stash values into the save area in tls (now that tls is always
there), not yet an option for reference values -
Push values onto the value stack while we're operating
-
Use WasmTlsReg as a scratch for short regions and reload it at the
end of the region. The region must not contain any code that could
conceivably assume that the register holds the tls value. This is
harder than it sounds; any write barrier or instance call will
require WasmTlsReg to have its correct value.
Assignee | ||
Comment 2•3 years ago
|
||
The value stash is a spill area in the tls, used for some x86
operations to store I64 values. It can be generalized for other value
types and multiple slots. (No reference types yet, though - that
requires rooting or barriers.)
Depends on D139457
Assignee | ||
Comment 3•3 years ago
|
||
Some simple x86-32 changes to allow Tls to be pinned.
Depends on D139458
Assignee | ||
Comment 4•3 years ago
|
||
Fairly simple changes to allow opencoded array.get to work with pinned
Tls on x86-32.
Depends on D139459
Assignee | ||
Comment 5•3 years ago
|
||
Quite complicated changes to allow array.new and array.set to work
with pinned Tls in the baseline compiler on x86-32.
Especially for array.new the argument could be made that the operation
should be moved into C++, possibly on all platforms even - there's a
callout for the allocation and callouts for the barriers (if needed),
if written in C++ some of that code could be inlined and would quite
possibly be faster, and the current open-coded code is always paying
for at least one call anyway so there's no added cost on the Wasm
side.
This solution uses the value stash pretty liberally. Another
alternative would maybe be to recompute some values when they are
needed.
Depends on D139460
Assignee | ||
Comment 6•3 years ago
|
||
Remove some remaining x86-32 related FIXMEs in the comments.
Depends on D139461
Assignee | ||
Comment 7•3 years ago
|
||
These WIP patches still fail a few tests on x86-32:
- 64-bit atomics
- memory64 (though it seems mostly atomics are the problem here too)
- struct.set (and possibly struct.new, but haven't gotten that far yet)
Assignee | ||
Comment 8•3 years ago
|
||
Chatted with Jan a bit and there is in fact a more general abstraction than the "stash" area here. If the TLS register is pinned and there is a stash area in the TLS at reasonable offsets, then we should think about the stash area as "TLS-based registers". We should be able to take the addresses of these "registers", and every such address will be (reg + small-offset) form where the reg is the pinned TLS. There could be any number of these registers but typically just a few will be enough. There could be at least two types, "bits" and "references", where the "reference" type would be a slot that is some type of root that does not require any kind of barrier. As for the lifetime of these, there is a spectrum. It is easiest if they are managed so that they are never live across calls at all, but this is probably not practical. An intermediate point is that they are never live across calls that can re-enter the instance (so, write barriers for sure, and probably allocations and other callouts to the runtime). For full generality, they must be saved and restored across calls; probably caller-saves is easiest but truly, nothing is easy about this. The full generality case would best be avoided.
The flip side of this is that this abstraction decreases the number of hardware registers available, so more masm operations must be able to operate on memory operands. These could be x86-32 only or somewhat cross-platform. The (reg + small-offset) form makes these memory operands fairly easy to handle.
Assignee | ||
Comment 9•3 years ago
|
||
We can effectively pin the Instance* on register-rich architectures
with very little effort, so let's do that. This amounts to reserving
the WasmTlsReg, reloading it only when necessary, performing moves
from WasmTlsReg to some other GPR when the abstractions demand it, and
otherwise using the WasmTlsReg register directly. This will get rid
of all extraneous Instance* loads for the new breakable point, and
many others besides.
This is not possible on x86-32 because there are too few registers.
This may be possible on ARM32 with modest work to accommodate 64-bit
atomics and memory64. However, I'm not going to bother since it's not
meaningful to spend effort on optimizations on ARM32.
Drive-by fix: addressOfGlobalVar() needs to take a RegPtr, not a
RegI32, for its temp. This should have no effect on generated code.
Depends on D140859
Updated•3 years ago
|
Assignee | ||
Updated•3 years ago
|
Comment 10•3 years ago
|
||
Pushed by lhansen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/da5ef551f0ef Opportunistically pin Instance* in baseline. r=yury
Comment 11•3 years ago
|
||
bugherder |
Description
•