Closed Bug 1756951 Opened 10 months ago Closed 9 months ago

Cloning of wasm debug code causes code memory OOM

Categories

(Core :: JavaScript: WebAssembly, defect, P2)

defect

Tracking

()

RESOLVED FIXED
100 Branch
Tracking Status
firefox100 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Blocks 4 open bugs)

Details

Attachments

(9 files, 1 obsolete file)

For a deeper analysis, see bug 1756030 comment 1 et seq. Cloning debug code on large wasm sites will cause us to run out of code memory and will make it impossible to debug the site.

This bug is about the cloning specifically, and how to enable wasm debugging without cloning the full code or suffer OOM from the cloning. Temporary workarounds for the fact that we trigger debug code / cloning inappropriately should be discussed elsewhere.

Some possible solutions:

  • remove the limits on code space (for cloned code?) so that cloning remains possible so long as there is address space available
  • generate debug code that does not require cloning, ie, has some type of indirection for breakpoints (with the existing architecture for breakpoints)
  • generate debug code lazily, ie, has some type of indirection for calls (with the existing architecture for breakpoints)
  • reorganize the debugger so that it can share code for breakpoints, ie, change the breakpointing architecture

Non-solutions:

  • copy-on-write of code will reduce RAM footprint but will still require more address space, and is really a per-process solution while we have a problem with multiple threads inside the same process
  • generating smaller debug code is possible. However, suppose we can shrink the code by 30% by pinning TLS (bug 1715459) and using short jumps across conditional traps. The reality is that LibreOffice will still require 530MB for its debug code on x64, and much more on arm64. With just a few threads this will still max out the 2GB code allowance.

Removing the limits on code space (sketch):

This would be a 64-bit only solution, and would have some security implications (more code == more options for jit-spray attacks), but if it's allowed specifically for debugging code and within a COEP+COOP process it would probably be acceptable risk. The main issue would be that code pointers into and out of a wasm blob would potentially not fit in a 32-bit offset in a call or jump instruction or in the 32-bit value of a pointer load. (We have this problem to a small extent already - on ARM64 the offset range for call/jump is smaller.) Ignoring tiering, since this is baseline code only, all wasm stubs would be affected. Entry stubs would probably have to be co-located with code that calls into them, and would make far jumps. Exit stubs should probably be co-located with the code that calls into them too, ie, with the cloned debug code, and make far jumps to the true callees.

For this, we'd first start by segregating wasm allocations in a separate 2GB space from JS allocations, to shake out bugs as much as possible.

(There's also reason to worry about how some far jumps are encoded. Right now they are encoded pc-relative; but if code is serialized and deserialized in separate chunks then the distance between source and target will likely change. And if they were to be encoded absolute then everything would have to be re-linked on deserialization. Not sure yet how much this matters.)

An alternative is that it ought to be possible to have one code arena per worker rather than one per process, but this doesn't really change anything, as compiled wasm code can be shared between workers and hence in effect all these arenas are per-process once wasm becomes involved.

Generating debug code that does not need to be cloned (sketch):

In this case each breakpoint call is to some handler that decides what to do next. We could make it PC-relative or indirect via TLS; PC-relative leads to less code so assume that. Assume we attach some common code to each function or to the module and that the breakpoint code calls that. Every bytecode would lead to a call. The callee would very quickly determine whether breakpoints are enabled at all (check a byte in a table hanging off the TLS, there could be a byte per function) and return if not; if debugging is enabled for the function, a more complicated computation would be carried out, effectively there would be a callout to the debug trap handler, as now. This architecture would significantly slow down execution of debug code but completely remove the need to clone code. The stub handler would require access to the TLS though that seems to be an invariant at present anyway (bug 1715459).

A fast path would be CALL; table = LOAD tls[x]; byte = LOAD table[x]; CMP byte, 0; JNZ handler; RET with the risk of address arithmetic on ARM64 since x may not fit in an immediate (a bit table would mitigate this at the cost of bit manipulation). Register availability is a concern but the scratch should be available at this point.

A different fast path would first test a flag in the TLS that says whether debugging is enabled at all, to avoid the dependent load and address computation. Still, it's a handful of instructions and control flow.

A further tweak on this is that we could keep the patchable NOP roughly as we have it now, and then patch it with the address of a local code sequence that performs the above checks. The patching would then happen simultaneously for all threads. When no threads are being debugged, performance is as it is now. When at least one thread is being debugged, all threads slow down a fair bit to take the call to the pre-handler, but most will then return directly.

Generating debug code lazily (sketch):

We may be able to piggyback on the existing tiering architecture to generate debug code when we enable debugging for a function; however, this works only when stepping into the function but not when stepping out to a function that was not previously enabled for debugging (because we have no OSR). Setting a breakpoint in a function can't enable debugging for all callers, because they are no computable in general (other than the set "all functions").

We're unlikely to implement OSR for the sake of debugging, even though the baseline architecture allows it to be done somewhat easily; after all, everything is on the stack and debugging does not change what's on the stack. We'd need a level of indirection directly after every function entry to trigger compilation of a debug version and return to the corresponding point in the debug version.

Reorganize the debugger (non-sketch):

I don't know how to do this, it would in any case allow the debugger to work on shared code and would likely be a fair amount of work.

Edit: Yury says that allowing the debugger to work on shared, patched code was the original intent, and that he does not understand why we need to clone code - maybe the debugger architecture changed, maybe there was a misunderstanding. This needs investigation.

Here's a sketch for the code that always calls out to a local stub that determines whether further filtering is possible. The overhead of this relative to the current debug code is significant, almost 3x (measured on doubly-recursive fib(35) on my Xeon tower). That may be affordable as those who debug will have beefy hardware and the current debug code is "only" 2x slower than the optimized code for this tests, but it suggests that it would be beneficial to investigate the extra tweak where we start out with a NOP and patch it with a call to the local stub so that we don't slow down any threads or functions until we start debugging at least one of them (code would still not be cloned, and patching could probably happen without synchronizing the threads).

Here's a different take on the same idea: at the breakpoint site, unconditionally call indirect via Tls to the handler, which will either do something (debugging is happening) or just return (nothing is happening). No testing of flags is required. (Ignore the change to WasmGC.cpp, this just works around a problem that I didn't want to fix right now.)

On Linux / Xeon this is no faster on my test case than the previous attempt, and in a production setting will be slower still due to bug 1649109.

(We want better benchmarks, and we want to benchmark on consumer hardware, for a realistic assessment of both ideas.)

Blocks: 1712893

In principle, the cloning of shared patchable code is not necessary under the assumption that (a) jit code can be writable and executable at the same time and (b) patching can be performed with copy-atomicity. In both cases the condition is necessary for one thread to be able to patch code while other threads are running, as code is shared.

Regarding condition (a) I'm not sure what the situation is; we are clearly wanting to go in the direction of W^X for SpiderMonkey but it's not clear that we couldn't sidestep that for tools if the OS allows it. A workaround if code is always W^X is that all threads are stopped when patching needs to happen. I'm fairly sure we don't do this or have the capability to do it at this point.

Regarding (b) the breakpoint site is a nopPatchableToCall, ie, a branch with an up-to 32-bit offset. On ARM and ARM64 this is a single aligned instruction (limiting the effective offset to something smaller than 32-bit). On x86 and x64 it's a five-byte NOP. On MIPS64 it's a six-byte instruction sequence (though the last is just the delay slot and will never be updated). On Loong64 it's a four-byte instruction sequence. On ARM and ARM64 the patching is copy-atomic. On x86 and x64 the sequence could be copy-atomic if we were to take extra steps to perform only a single (8-byte, usually) write and, on some older systems at least, always ensure that the write does not cross cache lines and code could be R+X for a short window and then W+X for a short window and breakpoint sites are never so close together that we risk colliding when we write. cmpxchg8b should work on x86; movq on x64. On MIPS64 and Loong64 it looks pretty grim, with the current implementation, but given the size of the patch sequence we're likely to blow the code budget anyway, for the larger applications.

With the above conditions met, since the patching is racy it is probably necessary either for the debugger to ensure that only one thread is patching at a time or for the patching to be truly atomic via the use of CMPXCHG or similar mechanism. (Using CMPXCHG would also fix the problem of breakpoint sites being too close together on x86.) Clearly one thread needs to be able to know whether it is the last observer to leave a function/frame and should remove breakpoints or not.

The current debugging architecture is not obviously thread-aware. It has per-Debugger-object data structures to track whether a function or site has a breakpoint, but nothing (that I've found) that coordinates breakpoints across multiple workers, say, or even across multiple Debugger objects on the same thread. It's hard to know whether such coordination is truly necessary or whether the debugger runs in only one thread and is able to handle breakpoint traps from worker threads within its main thread.

Attached file simple-debug.js

Test case for the perf experiments.

Here's another variant. This moves the debugging-is-enabled condition in-line: for each debug step, we load a value from Tls and test it, and conditionally skip across the call. This is a slowdown of 50% relative to the current debugging code (320ms vs 220ms on my Xeon).

On x64 the optimal sequence is something like these 11 bytes

    CMP [WasmTlsReg+offs], 0  ;; REX 83 MODRM OFFS IB
    JZ  L                     ;; 74 OFFS
    CALL [WasmTlsReg+offs]    ;; REX FF MODRM OFFS
L:

on the assumption that the debug flag and handler (which can be the same field) can be placed near the start of the Tls so that 8-bit offsets are enough. Note this calculation depends on pinning the Tls register. But if we pull this off, there is only a small net increase in code size: currently the breakpoint sequence on this platform is 9 bytes, to load the Tls and then perform a call.

On ARM64 it's a little bit worse, here we assume the flag and the handler pointer are indeed the same:

    LDR temp, [WasmTlsReg+offs]
    CBZ temp, L
    BL temp
L:

which takes us from 8 bytes now (load tls + call) to 12 bytes. Still, within reason - since we don't have to clone the code at all.

Assignee: nobody → lhansen
Status: NEW → ASSIGNED
Depends on: 1753692
Assignee: lhansen → nobody
Status: ASSIGNED → NEW

As Yury pointed out on a call, it's possible to do even better by dedicating a register to the trap handler. In that case, the loads above would not be necessary. On ARM64 we have x28 (the PSP) which is not used by the baseline compiler. On x64 we don't really have an available register but since it's baseline code we could probably find one. On ARM and x86 it's going to be tough to do this, though, so we could either disable debugging on those platforms or pay for the extra load from Tls at each breakable location.

I'm not sure yet whether such heroics are necessary. The high bit is to stop cloning code, which we do by one of the methods above; after that it becomes a question of how much optimization is required for "good enough". At the moment, debug code for LibreOffice is 530MB on x64. With the additional bloat from the above, it may be as much as 700MB if we don't pin the Tls register. This is so far from our code limit that we can afford to get rid of the cloning first and then implement Tls pinning and additional optimizations over time. It's not even obvious that single-stepping huge Wasm applications is even meaningful without good source-level debugging support.

Assignee: nobody → lhansen
Status: NEW → ASSIGNED

Note, x64-specific WIP, unoptimized.

Instead of emitting a patchable call at entry/exit/breakpoints, emit a
conditional call to a per-instance handler that the debugger can
install. The new code is larger than the code it replaces but we can
optimize this later (see analysis on the bug).

Now code is constant and can be shared even when debugging.

Depends on D140851

The two patches implement the idea from comment 5 and are sufficient to allow Photoshop to get to the point where the OPFS is not available and it stops even with the console open while loading / compiling. LibreOffice appears to have bitrotted - it throws an error about already having consumed an ArrayBuffer - but I need to dig into that a little more to make sure it's not something I did.

The second patch is probably finished. The first patch needs a little more work to make the breakable-point code as small as possible per-platform, and it is probably also platform-specific to x86/x64 at this point. Also, some adjustments need to be made to some debug code that checks that all wasm call instructions are of a known form. A new form is introduced here.

Attachment #9267404 - Attachment description: WIP: Bug 1756951 - Part 1: Do not patch code when debugging → Bug 1756951 - Part 1: Do not patch code when debugging. r?yury
Attachment #9267415 - Attachment description: WIP: Bug 1756951 - Part 2: Do not clone code for debugging → Bug 1756951 - Part 2: Do not clone code for debugging. r?yury
Attached file simple-debug-step.js (obsolete) —

A simple benchmark to measure the cost of global filtering. If WithBreakpoints==true then a breakpoint is set in the unrelated function "dummy", this will cause every breakable point to go through the filtering mechanism.

The overhead of this when filtering is done in the trap handler is about 1000x; thus we will need better filtering than this in practice.

Here's a simple fast filter that sits on top of my two patches, this is not complete but it demonstrates what a fast path can look like. The inline code checks to see if the breakpointHandler is non-null, as before. If so, it calls relative to a stub attached to the function. The stub loads a table pointer from the Instance and then tests a bit in the table, and if the bit is clear it just returns again - this means no debugging is enabled for this function. (There is one bit per function in this design.) If the bit is set, then the stub jumps indirect to the breakpoint trap handler for further filtering.

With this filter, fib(35) is < 3x slower with debugging enabled for the unrelated function (see previous comment) than it is with no debugging enabled. This seems like something we can live with if it means we don't have to patch code.

Some design notes:

  • There must be one table per instance and the instance's code can be shared, so the table must somehow be instance-relative and we cannot bake the memory location of the table slot into the code.
  • The table is created when the instance is created and never needs to change (though it is updated obviously, as the debugger adds and removes functions from it); it can be destroyed when the instance is destroyed. There is only the one owner; no reference counting is needed.
  • It is probably possible to inline the table into the instance so that we can load the slot off the Instance* that's already in a register, but this is a complication we should argue for separately.
  • There is one filter stub per function and it knows the function index, but it does not know the code offset it was called from and therefore there is only one bit per function in the table. A more precise filter would have one bit per code location in the module or a Bloom filter that combines function offset and function index, perhaps; this would allow more code to run at full speed when breakpoints are set but we're not stepping. It is definitely not appealing to pass the function index from the call site, as that bloats the size of the breakable point. However, we do have a return address, so a filter that incorporates the actual code address into the filter table (or the machine code offset, ditto, which we can compute) would be somewhat interesting. We would then compute the slot and bit offset from the return address. This would require more work, and more registers, and again we should argue for this complication separately.

I will prepare a third patch that implements the simple per-function-index filter and add that to the queue.

Current benchmark results, Release shell, test case as above but tweaked a litte (will be updated), fib(35):

On x64 Xeon:
Ion: 70ms
Baseline: 100ms
Baseline with old debug code, debugging not enabled: 155ms
Baseline with old debug code, debugging enabled on unrelated function: 155ms
Baseline with new debug code, debugging not enabled: 245ms (1.6x)
Baseline with new debug code, debugging enabled on unrelated function: 655ms (2.7x slowdown over new no-debug case, 4.2x slowdown over old)

On ARM64 M1:
Ion: 66ms
Baseline: 78ms
Baseline with old debug code, debugging not enabled: 133ms
Baseline with old debug code, debugging enabled on unrelated function: 133ms
Baseline with new debug code, debugging not enabled: 196ms (1.5x)
Baseline with new debug code, debugging enabled on unrelated function: 330ms (1.7x over new no-debug case, 2.5x over old)

Once debugging is enabled for a function we'll make a call to C++ filtering for every site in the function, so setting a breakpoint in a function will slow that function down greatly (not measured, but earlier results suggest three orders of magnitude slowdown).

A more precise filter would have one bit per code location in the module or a Bloom filter that combines function offset and function index, perhaps

Or it could have a bit per address range. For fast filtering, consider using a portion of the return address directly as an index into a bit table. The shift and mask could be adapted to the module size by the compiler, there would be very little logic. Now we'd only have a slowdown for the area directly around the breakpoint. The filter would however be slightly more expensive. If filtering cost is an issue, there could be two tables and a two-level filter: one per-function and one per-address-range. I guess a Bloom filter combines those, but I'm not sure it's worth the bother to implement that - non-code memory is relatively cheap.

Attached file simple-debug-step.js
Attachment #9267870 - Attachment is obsolete: true
Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5ab04623c826
Part 1: Do not patch code when debugging. r=yury
https://hg.mozilla.org/integration/autoland/rev/10a616ca8ab2
Part 2: Do not clone code for debugging. r=yury
Blocks: 1760264

Going to add a cleanup patch here.

Keywords: leave-open

The insertBreakablePoint function need not be inline, and the code
to create the breakpoint pad can be moved from endFunction to its
own function; both the functions can be grouped with the other
debug functions in WasmBaselineCompile.cpp

Keywords: leave-open
Status: ASSIGNED → RESOLVED
Closed: 9 months ago
Resolution: --- → FIXED
Target Milestone: --- → 100 Branch
Regressions: 1760535
Depends on: 1760627
Regressions: 1760627
Regressions: 1761606
You need to log in before you can comment on or make changes to this bug.