Closed Bug 1756739 Opened 4 years ago Closed 4 years ago

Pin TLS in the baseline compiler

Tracking

()

Status:

RESOLVED FIXED

Milestone:

100 Branch

Tracking Flags:

Tracking

Status

firefox100

---

fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug)

Details

Attachments

(7 files)

WIP: Bug 1756739 - Pin WasmTlsReg in the baseline compiler - non-X86 4 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1756739 - Generalize the value stash (WIP) 4 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1756739 - X86: fix select, 64-bit multiply, atomic wait (WIP) 4 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1756739 - X86: array.get (WIP) 4 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1756739 - X86: Array new and set (WIP) 4 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1756739 - X86 fixmes - remove the rest (WIP) 4 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1756739 - Opportunistically pin Instance* in baseline. r?yury 4 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review

Lars T Hansen [:lth]

Assignee

Description

•

4 years ago

•

Edited

+++ This bug was initially created as a clone of Bug #1715459 +++

See bug 1715459 for some preliminary work; bug 1714086 for original TC + analysis.

Back in the day, we decided not to pin the TLS in the baseline compiler because it overconstrains register allocation on x86-32 especially. Instead, the TLS has a home location in the stack frame; is spilled on entry to the function; and is reloaded whenever it is needed. We never measured whether this was good or bad, it was just one of those things that we had to do to move the work along.

Based on exploratory work (attached), pinning the TLS in the baseline compiler results in a 5% decrease in baseline code size on x86-64 (sample of one application: Zen Garden) and will probably have similar savings on ARM64; pinning will therefore help reduce code bloat, which is an issue for large wasm apps. (More test cases would be good.)

As noted by bug 1715459 comment 3, there may be significant regalloc problems on x86-32 as a result, and so this work is not exactly easy - pinning the tls leaves four usable registers. As Julian put it, we'd be "programming like it's 1977". We should seriously consider moving some baseline operations into C++ callouts on this platform only, or otherwise specialize the open-coded implementations ditto to allow more redundant operations so as to shorten value lifetimes in the implementations, or somehow allow for memory operands. Candidates are mostly GC-proposal operations, memory64 operations, and 64-bit atomic operations.

Lars T Hansen [:lth]

Assignee

Comment 1

•

4 years ago

Attached file WIP: Bug 1756739 - Pin WasmTlsReg in the baseline compiler - non-X86 — Details

(This works for x64, arm, arm64, mips64. x86 is WIP / subsequent
patches; many things work but atomics and GC features require a lot
more work.)

By pinning the Tls register we simplify code, sometimes substantially,
and somewhat crucially we generate much less code for debugging.

However, this also complicates register allocation on x86, where we
are now down to four usable registers, two having been taken by
scratch and tls (there is no heap register) and one by fp. Frequently
this is too little (five was already too little). Several techniques
can be used to work around this:

Free registers early if they are not used
Stash values into the save area in tls (now that tls is always
there), not yet an option for reference values
Push values onto the value stack while we're operating
Use WasmTlsReg as a scratch for short regions and reload it at the
end of the region. The region must not contain any code that could
conceivably assume that the register holds the tls value. This is
harder than it sounds; any write barrier or instance call will
require WasmTlsReg to have its correct value.

Lars T Hansen [:lth]

Assignee

Comment 2

•

4 years ago

Attached file WIP: Bug 1756739 - Generalize the value stash (WIP) — Details

The value stash is a spill area in the tls, used for some x86
operations to store I64 values. It can be generalized for other value
types and multiple slots. (No reference types yet, though - that
requires rooting or barriers.)

Depends on D139457

Lars T Hansen [:lth]

Assignee

Comment 3

•

4 years ago

Attached file WIP: Bug 1756739 - X86: fix select, 64-bit multiply, atomic wait (WIP) — Details

Some simple x86-32 changes to allow Tls to be pinned.

Depends on D139458

Lars T Hansen [:lth]

Assignee

Comment 4

•

4 years ago

Attached file WIP: Bug 1756739 - X86: array.get (WIP) — Details

Fairly simple changes to allow opencoded array.get to work with pinned
Tls on x86-32.

Depends on D139459

Lars T Hansen [:lth]

Assignee

Comment 5

•

4 years ago

Attached file WIP: Bug 1756739 - X86: Array new and set (WIP) — Details

Quite complicated changes to allow array.new and array.set to work
with pinned Tls in the baseline compiler on x86-32.

Especially for array.new the argument could be made that the operation
should be moved into C++, possibly on all platforms even - there's a
callout for the allocation and callouts for the barriers (if needed),
if written in C++ some of that code could be inlined and would quite
possibly be faster, and the current open-coded code is always paying
for at least one call anyway so there's no added cost on the Wasm
side.

This solution uses the value stash pretty liberally. Another
alternative would maybe be to recompute some values when they are
needed.

Depends on D139460

Lars T Hansen [:lth]

Assignee

Comment 6

•

4 years ago

Attached file WIP: Bug 1756739 - X86 fixmes - remove the rest (WIP) — Details

Remove some remaining x86-32 related FIXMEs in the comments.

Depends on D139461

Lars T Hansen [:lth]

Assignee

Comment 7

•

4 years ago

These WIP patches still fail a few tests on x86-32:

64-bit atomics
memory64 (though it seems mostly atomics are the problem here too)
struct.set (and possibly struct.new, but haven't gotten that far yet)

Lars T Hansen [:lth]

Assignee

Comment 8

•

4 years ago

Chatted with Jan a bit and there is in fact a more general abstraction than the "stash" area here. If the TLS register is pinned and there is a stash area in the TLS at reasonable offsets, then we should think about the stash area as "TLS-based registers". We should be able to take the addresses of these "registers", and every such address will be (reg + small-offset) form where the reg is the pinned TLS. There could be any number of these registers but typically just a few will be enough. There could be at least two types, "bits" and "references", where the "reference" type would be a slot that is some type of root that does not require any kind of barrier. As for the lifetime of these, there is a spectrum. It is easiest if they are managed so that they are never live across calls at all, but this is probably not practical. An intermediate point is that they are never live across calls that can re-enter the instance (so, write barriers for sure, and probably allocations and other callouts to the runtime). For full generality, they must be saved and restored across calls; probably caller-saves is easiest but truly, nothing is easy about this. The full generality case would best be avoided.

The flip side of this is that this abstraction decreases the number of hardware registers available, so more masm operations must be able to operate on memory operands. These could be x86-32 only or somewhat cross-platform. The (reg + small-offset) form makes these memory operands fairly easy to handle.

Lars T Hansen [:lth]

Assignee

Comment 9

•

4 years ago

Attached file Bug 1756739 - Opportunistically pin Instance* in baseline. r?yury — Details

We can effectively pin the Instance* on register-rich architectures
with very little effort, so let's do that. This amounts to reserving
the WasmTlsReg, reloading it only when necessary, performing moves
from WasmTlsReg to some other GPR when the abstractions demand it, and
otherwise using the WasmTlsReg register directly. This will get rid
of all extraneous Instance* loads for the new breakable point, and
many others besides.

This is not possible on x86-32 because there are too few registers.

This may be possible on ARM32 with modest work to accommodate 64-bit
atomics and memory64. However, I'm not going to bother since it's not
meaningful to spend effort on optimizations on ARM32.

Drive-by fix: addressOfGlobalVar() needs to take a RegPtr, not a
RegI32, for its temp. This should have no effect on generated code.

Depends on D140859

Phabricator Automation

Updated

•

4 years ago

Assignee: nobody → lhansen

Status: NEW → ASSIGNED

Lars T Hansen [:lth]

Assignee

Updated

•

4 years ago

Type: task → enhancement

Summary: [exploration] Pin TLS in the baseline compiler → Pin TLS in the baseline compiler

Pulsebot

Comment 10

•

4 years ago

Pushed by lhansen@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/da5ef551f0ef Opportunistically pin Instance* in baseline. r=yury

Sandor Molnar[:smolnar]

Comment 11

•

4 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/da5ef551f0ef

Status: ASSIGNED → RESOLVED

Closed: 4 years ago

status-firefox100: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 100 Branch

You need to log in before you can comment on or make changes to this bug.

Bugzilla

Pin TLS in the baseline compiler

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

Tracking

()

People

(Reporter: lth, Assigned: lth)

References

(Blocks 1 open bug)

Details

Crash Data

Security

(public)

User Story

Attachments

(7 files)

Description

Comment 1

Comment 2

Comment 3

Comment 4

Comment 5

Comment 6

Comment 7

Comment 8

Comment 9

Updated

Updated

Comment 10

Comment 11

Attachment

General

Description

File Name

Content Type