Closed Bug 1715459 Opened 3 years ago Closed 2 years ago

Do not emit redundant TLS reloads for breakpoints

Categories

(Core :: JavaScript: WebAssembly, defect, P3)

defect

Tracking

()

RESOLVED WONTFIX

People

(Reporter: lth, Assigned: lth)

References

(Blocks 2 open bugs, Regression)

Details

(Keywords: regression)

Attachments

(1 obsolete file)

See bug 1714086 for TC + analysis.

The code emitted for breakpoints currently always loads the TLS pointer from the frame, then calls the breakpoint handler (or performs a NOP). Most of the time the TLS reload will be redundant in practice; we should only emit it when/if needed.

The baseline compiler doesn't have any infrastructure for that type of optimization currently and its invariants around the tls register are a little hazy anyway, so implementing this optimization could be a fair amount of work / might have to wait for more of the baseline compiler cleanup to be completed.

That said, optimizing this just within straight-line code would be a good start, and that might not be too hard.

I don't think there's any good reason why the baseline compiler shouldn't just keep the tls in the Tls register at all times, and I'm going to get to the bottom of this.

Assignee: nobody → lhansen
Status: NEW → ASSIGNED

In brief, the Tls is volatile because it is needed in the register pool on x86 and ARM-32; on these platforms we run out of registers running the test suite if the Tls is taken out of the pool. On x86 there are many, many failures - on ARM-32 there are only problems with atomics and memory64 tests (4 failed test files in total). On ARM we may be able to work around this easily, but on x86 it may take real work. Will investigate a little further.

It is possible that a compromise here is that the Tls register is non-volatile on the 64-bit register-rich platforms; that would help reduce code size for debugging and baseline in general on these platforms, which are after all the most important ones. This further complicates the compiler a little, but not very much.

It's possible to pin WasmTlsReg in the baseline compiler. The hard parts to fix are atomics and some of the open-coded GC operations on x86-32 (and I expect memory64, ditto, although there too it seems to be mostly about atomics), where the constraints are extremely tight - there are only four registers once we've taken away the scratch, the tls, and the FP. In practice some values must be in "memory" but the stack abstraction makes "memory" a little hard to get to.

The problem of temp memory can be solved by using the tls stash area, previously only used for x86 but useful for all if generalized. And we may have to allow the tls to be pushed onto the stack in limited regions, but this will make it harder to find bugs.

I've implemented this for all non-x86-32 code and also ported the array set / array new code for x86-32, as a PoC. The resulting code is surprisingly clean, but it's real work, and abstraction boundaries are breaking down and must shift somehow.

It is far from obvious that we should be investing very much in x86-32 baseline performance and it may very well be that, in the same way that we have call-outs for 64-bit arithmetic on this platform, we should call out to these hairy register-constrained functions that themselves contain calls (for allocation and/or barriers).

Blocks: 1756030

This is likely worth finishing. Experiments with Zen Garden, showing x64 machine code size with baseline compiler only, release build:

Normal code:
Before pinning tls: 75.3MB
After pinning tls: 71.5MB (5% reduction from non-pinning)

Debug code:
Before pinning tls: 200.8MB (2.7x bloat from non-debug code)
After pinning tls: 147.7MB (26% reduction from non-pinning; "only" 2.1x bloat from non-debug code)

A different direction: why is insertBreakablePoint loading the tls register at all?

The breakpoint is a callsite that is either a nop, or it points to the stub generated by GenerateDebugTrapStub (possibly via an intermediate far jump island). That stub does not inspect the WasmTlsReg, but saves all registers and then jumps to WasmHandleDebugTrap in WasmBuiltins.cpp. That function does not inspect register content at all, but instead uses GetNearestEffectiveTls() to get the Tls and Instance.

Commenting out the loading of the Tls in insertBreakablePoint does not break any tests, and reduces code size from 200.8MB to 151.5MB - the difference between that and the 147.7MB reported above being accounted for by the other tls reloads that my more complicated patch removed.

The loaded tls value is never needed by anyone, is not part of the
calling protocol for the stub, and unnecessarily bloats debug code
size by a lot (see analysis on the bug).

See Also: → 1756739
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/c395fad3a8b0
Do not load the WasmTlsReg before a breakpoint trap. r=yury

Backed out for causing SM wasm related bustage

Backout link

Push with failures

Failure log

Failure line(s): TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/debug/wasm-11.js | Unknown (code -11, args "--no-blinterp --no-baseline --no-ion --more-compartments") [0.0 s]

Flags: needinfo?(lhansen)

It fails on x86-32 Linux here, but ditto Windows tests were not run so there's no reason to assume this is Linux-only.

The failing tests talks specifically about the TLS being evicted, so this is a meaningful failure - it means the TLS register is actually depended on by somebody, yet we did not see that when we inspected the code. Whatever it is, it is a further indication that the TLS register must be consistent at wasm calls, no matter the type of the call. (We do the same for builtin calls, where it also should not matter.) I don't understand this yet, but when I do it will show up as documentation through bug 1744513.

The bug that created this test case (and added the comment that Yury dug up, see the patch on the present bug) was

changeset:   389110:bc25a1c06597
user:        Luke Wagner <luke@mozilla.com>
date:        Tue Mar 28 09:12:52 2017 -0500
summary:     Bug 1350744 - Baldr: load TLS reg before calling debug trap handler (r=bbouvier)

In turn this points to bug 1334504, which was a massive ABI change that greatly affected how we handle TLS.

The fix in bug 1350744 was perhaps sensible, and it is perhaps more interesting to ask why Dmitry added the TLS load in the debug-generated code later.

IIUC, Dmitry's change did not increase debug code size: it only moved a load from the far jump island into the baseline-emitted code, and I believe everybody gets a far jump island. So if we're concerned about code size, trying to revert Dmitry's change is likely not the correct solution. We would want a solution that does not have such a load, or has a load that is definitely shared among many. It turns out that was wrong. Very few far jump islands are created, so Dmitry's change significantly (33%) increased debug code size.

Flags: needinfo?(lhansen)

The DebugTrapStub generator calls GenerateExitPrologue which calls SetExitFP which calls LoadActivation which generates code that reads WasmTlsReg. So there's that mystery solved.

So this means that either we piggyback the loading of the WasmTlsReg onto the slow and shared path of a debug callout (see bug 1756951), which I think we could do, or we complete the work on pinning the TLS register in the baseline compiler (see bug 1756739), or we try to revert Dmitry's change (this will require a change to the stack frame layout in the baseline compiler to place the saved Tls value at a known offset, or will require us to store it in the FrameWithTls instead, probably a good thing).

Blocks: 1712893

I believe I know how to revert Dmitry's change - get rid of the special Tls slot in the baseline compiler's frame and use the CalleeTlsSlot in the incoming FrameWithTls instead, and then effectively reinstate the code that was removed in https://phabricator.services.mozilla.com/D83064#change-20B6m8zUtgWT, but referencing the CalleeTlsSlot instead - but I'm not going to do so now, since nothing will really matter until bug 1756951 is fixed, and that will cause the code at the breakable point to change anyway and the redundant reload will likely disappear organically.

Severity: N/A → S3
Status: ASSIGNED → RESOLVED
Type: enhancement → defect
Closed: 2 years ago
Keywords: regression
Regressed by: 1639153, 1742053
Resolution: --- → WONTFIX
Attachment #9264945 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: