Bug 1756951 Comment 0 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

For a deeper analysis, see bug 1756030 comment 1 et seq.  Cloning debug code on large wasm sites will cause us to run out of code memory and will make it impossible to debug the site.

This bug is about the cloning specifically, and how to enable wasm debugging without cloning the full code or suffer OOM from the cloning.  Temporary workarounds for the fact that we trigger debug code / cloning inappropriately should be discussed elsewhere.

Some possible solutions:

- remove the limits on code space (for cloned code?) so that cloning remains possible so long as there is address space available
- generate debug code that does not require cloning, ie, has some type of indirection for breakpoints (with the existing architecture for breakpoints)
- generate debug code lazily, ie, has some type of indirection for calls (with the existing architecture for breakpoints)
- reorganize the debugger so that it can share code for breakpoints, ie, change the breakpointing architecture

Non-solutions:

- copy-on-write of code will reduce RAM footprint but will still require more address space, and is really a per-process solution while we have a problem with multiple threads inside the same process
- generating smaller debug code is possible.  However, suppose we can shrink the code by 30% by pinning TLS (bug 1715459) and using short jumps across breakpoints.  The reality is that LibreOffice will still require 530MB for its debug code on x64, and much more on arm64.  With just a few threads this will still max out the 2GB code allowance.

**Removing the limits on code space (sketch):**

This would be a 64-bit only solution, and would have some security implications (more code == more options for jit-spray attacks), but if it's allowed specifically for debugging code and within a COEP+COOP process it would probably be acceptable risk.  The main issue would be that code pointers into and out of a wasm blob would potentially not fit in a 32-bit offset in a call or jump instruction or in the 32-bit value of a pointer load.  (We have this problem to a small extent already - on ARM64 the offset range for call/jump is smaller.)  Ignoring tiering, since this is baseline code only, all wasm stubs would be affected.  Entry stubs would probably have to be co-located with code that calls into them, and would make far jumps.  Exit stubs should probably be co-located with the code that calls into them too, ie, with the cloned debug code, and make far jumps to the true callees.

**Generating debug code that does not need to be cloned (sketch):**

In this case each breakpoint call is to some handler that decides what to do next.  We could make it PC-relative or indirect via TLS; PC-relative leads to less code so assume that.  Assume we attach some common code to each function or to the module and that the breakpoint code calls that.  Every bytecode would lead to a call.  The callee would very quickly determine whether breakpoints are enabled at all (check a byte in a table hanging off the TLS, there could be a byte per function) and return if not; if debugging is enabled for the function, a more complicated computation would be carried out, effectively there would be a callout to the debug trap handler, as now.  This architecture would significantly slow down execution of debug code but completely remove the need to clone code.  The stub handler would require access to the TLS though that seems to be an invariant at present anyway (bug 1715459).

A fast path would be CALL; table = LOAD tls[x]; byte = LOAD table[x]; CMP byte, 0; JNZ handler; RET with the risk of address arithmetic on ARM64 since x may not fit in an immediate (a bit table would mitigate this at the cost of bit manipulation).  Register availability is a concern but the scratch should be available at this point.

A different fast path would first test a flag in the TLS that says whether debugging is enabled at all, to avoid the dependent load and address computation.  Still, it's a handful of instructions and control flow.

**Generating debug code lazily (sketch):**

We may be able to piggyback on the existing tiering architecture to generate debug code when we enable debugging for a function; however, this works only when stepping into the function but not when stepping out to a function that was not previously enabled for debugging (because we have no OSR).  Setting a breakpoint in a function can't enable debugging for all callers, because they are no computable in general (other than the set "all functions").

We're unlikely to implement OSR for the sake of debugging, even though the baseline architecture allows it to be done somewhat easily; after all, everything is on the stack and debugging does not change what's on the stack.  We'd need a level of indirection directly after every function entry to trigger compilation of a debug version and return to the corresponding point in the debug version.

**Reorganize the debugger (non-sketch):**

I don't know how to do this, it would in any case allow the debugger to work on shared code and would likely be a fair amount of work.
For a deeper analysis, see bug 1756030 comment 1 et seq.  Cloning debug code on large wasm sites will cause us to run out of code memory and will make it impossible to debug the site.

This bug is about the cloning specifically, and how to enable wasm debugging without cloning the full code or suffer OOM from the cloning.  Temporary workarounds for the fact that we trigger debug code / cloning inappropriately should be discussed elsewhere.

Some possible solutions:

- remove the limits on code space (for cloned code?) so that cloning remains possible so long as there is address space available
- generate debug code that does not require cloning, ie, has some type of indirection for breakpoints (with the existing architecture for breakpoints)
- generate debug code lazily, ie, has some type of indirection for calls (with the existing architecture for breakpoints)
- reorganize the debugger so that it can share code for breakpoints, ie, change the breakpointing architecture

Non-solutions:

- copy-on-write of code will reduce RAM footprint but will still require more address space, and is really a per-process solution while we have a problem with multiple threads inside the same process
- generating smaller debug code is possible.  However, suppose we can shrink the code by 30% by pinning TLS (bug 1715459) and using short jumps across conditional traps.  The reality is that LibreOffice will still require 530MB for its debug code on x64, and much more on arm64.  With just a few threads this will still max out the 2GB code allowance.

**Removing the limits on code space (sketch):**

This would be a 64-bit only solution, and would have some security implications (more code == more options for jit-spray attacks), but if it's allowed specifically for debugging code and within a COEP+COOP process it would probably be acceptable risk.  The main issue would be that code pointers into and out of a wasm blob would potentially not fit in a 32-bit offset in a call or jump instruction or in the 32-bit value of a pointer load.  (We have this problem to a small extent already - on ARM64 the offset range for call/jump is smaller.)  Ignoring tiering, since this is baseline code only, all wasm stubs would be affected.  Entry stubs would probably have to be co-located with code that calls into them, and would make far jumps.  Exit stubs should probably be co-located with the code that calls into them too, ie, with the cloned debug code, and make far jumps to the true callees.

**Generating debug code that does not need to be cloned (sketch):**

In this case each breakpoint call is to some handler that decides what to do next.  We could make it PC-relative or indirect via TLS; PC-relative leads to less code so assume that.  Assume we attach some common code to each function or to the module and that the breakpoint code calls that.  Every bytecode would lead to a call.  The callee would very quickly determine whether breakpoints are enabled at all (check a byte in a table hanging off the TLS, there could be a byte per function) and return if not; if debugging is enabled for the function, a more complicated computation would be carried out, effectively there would be a callout to the debug trap handler, as now.  This architecture would significantly slow down execution of debug code but completely remove the need to clone code.  The stub handler would require access to the TLS though that seems to be an invariant at present anyway (bug 1715459).

A fast path would be CALL; table = LOAD tls[x]; byte = LOAD table[x]; CMP byte, 0; JNZ handler; RET with the risk of address arithmetic on ARM64 since x may not fit in an immediate (a bit table would mitigate this at the cost of bit manipulation).  Register availability is a concern but the scratch should be available at this point.

A different fast path would first test a flag in the TLS that says whether debugging is enabled at all, to avoid the dependent load and address computation.  Still, it's a handful of instructions and control flow.

**Generating debug code lazily (sketch):**

We may be able to piggyback on the existing tiering architecture to generate debug code when we enable debugging for a function; however, this works only when stepping into the function but not when stepping out to a function that was not previously enabled for debugging (because we have no OSR).  Setting a breakpoint in a function can't enable debugging for all callers, because they are no computable in general (other than the set "all functions").

We're unlikely to implement OSR for the sake of debugging, even though the baseline architecture allows it to be done somewhat easily; after all, everything is on the stack and debugging does not change what's on the stack.  We'd need a level of indirection directly after every function entry to trigger compilation of a debug version and return to the corresponding point in the debug version.

**Reorganize the debugger (non-sketch):**

I don't know how to do this, it would in any case allow the debugger to work on shared code and would likely be a fair amount of work.
For a deeper analysis, see bug 1756030 comment 1 et seq.  Cloning debug code on large wasm sites will cause us to run out of code memory and will make it impossible to debug the site.

This bug is about the cloning specifically, and how to enable wasm debugging without cloning the full code or suffer OOM from the cloning.  Temporary workarounds for the fact that we trigger debug code / cloning inappropriately should be discussed elsewhere.

Some possible solutions:

- remove the limits on code space (for cloned code?) so that cloning remains possible so long as there is address space available
- generate debug code that does not require cloning, ie, has some type of indirection for breakpoints (with the existing architecture for breakpoints)
- generate debug code lazily, ie, has some type of indirection for calls (with the existing architecture for breakpoints)
- reorganize the debugger so that it can share code for breakpoints, ie, change the breakpointing architecture

Non-solutions:

- copy-on-write of code will reduce RAM footprint but will still require more address space, and is really a per-process solution while we have a problem with multiple threads inside the same process
- generating smaller debug code is possible.  However, suppose we can shrink the code by 30% by pinning TLS (bug 1715459) and using short jumps across conditional traps.  The reality is that LibreOffice will still require 530MB for its debug code on x64, and much more on arm64.  With just a few threads this will still max out the 2GB code allowance.

**Removing the limits on code space (sketch):**

This would be a 64-bit only solution, and would have some security implications (more code == more options for jit-spray attacks), but if it's allowed specifically for debugging code and within a COEP+COOP process it would probably be acceptable risk.  The main issue would be that code pointers into and out of a wasm blob would potentially not fit in a 32-bit offset in a call or jump instruction or in the 32-bit value of a pointer load.  (We have this problem to a small extent already - on ARM64 the offset range for call/jump is smaller.)  Ignoring tiering, since this is baseline code only, all wasm stubs would be affected.  Entry stubs would probably have to be co-located with code that calls into them, and would make far jumps.  Exit stubs should probably be co-located with the code that calls into them too, ie, with the cloned debug code, and make far jumps to the true callees.

**Generating debug code that does not need to be cloned (sketch):**

In this case each breakpoint call is to some handler that decides what to do next.  We could make it PC-relative or indirect via TLS; PC-relative leads to less code so assume that.  Assume we attach some common code to each function or to the module and that the breakpoint code calls that.  Every bytecode would lead to a call.  The callee would very quickly determine whether breakpoints are enabled at all (check a byte in a table hanging off the TLS, there could be a byte per function) and return if not; if debugging is enabled for the function, a more complicated computation would be carried out, effectively there would be a callout to the debug trap handler, as now.  This architecture would significantly slow down execution of debug code but completely remove the need to clone code.  The stub handler would require access to the TLS though that seems to be an invariant at present anyway (bug 1715459).

A fast path would be CALL; table = LOAD tls[x]; byte = LOAD table[x]; CMP byte, 0; JNZ handler; RET with the risk of address arithmetic on ARM64 since x may not fit in an immediate (a bit table would mitigate this at the cost of bit manipulation).  Register availability is a concern but the scratch should be available at this point.

A different fast path would first test a flag in the TLS that says whether debugging is enabled at all, to avoid the dependent load and address computation.  Still, it's a handful of instructions and control flow.

A further tweak on this is that we could keep the patchable NOP roughly as we have it now, and then patch it with the address of a local function that performs the above checks.  The patching would then happen simultaneously for all threads.  When no threads are being debugged, performance is as it is now.  When at least one thread is being debugged, all threads slow down a fair bit to take the call to the pre-handler, but most will then return directly.

**Generating debug code lazily (sketch):**

We may be able to piggyback on the existing tiering architecture to generate debug code when we enable debugging for a function; however, this works only when stepping into the function but not when stepping out to a function that was not previously enabled for debugging (because we have no OSR).  Setting a breakpoint in a function can't enable debugging for all callers, because they are no computable in general (other than the set "all functions").

We're unlikely to implement OSR for the sake of debugging, even though the baseline architecture allows it to be done somewhat easily; after all, everything is on the stack and debugging does not change what's on the stack.  We'd need a level of indirection directly after every function entry to trigger compilation of a debug version and return to the corresponding point in the debug version.

**Reorganize the debugger (non-sketch):**

I don't know how to do this, it would in any case allow the debugger to work on shared code and would likely be a fair amount of work.
For a deeper analysis, see bug 1756030 comment 1 et seq.  Cloning debug code on large wasm sites will cause us to run out of code memory and will make it impossible to debug the site.

This bug is about the cloning specifically, and how to enable wasm debugging without cloning the full code or suffer OOM from the cloning.  Temporary workarounds for the fact that we trigger debug code / cloning inappropriately should be discussed elsewhere.

Some possible solutions:

- remove the limits on code space (for cloned code?) so that cloning remains possible so long as there is address space available
- generate debug code that does not require cloning, ie, has some type of indirection for breakpoints (with the existing architecture for breakpoints)
- generate debug code lazily, ie, has some type of indirection for calls (with the existing architecture for breakpoints)
- reorganize the debugger so that it can share code for breakpoints, ie, change the breakpointing architecture

Non-solutions:

- copy-on-write of code will reduce RAM footprint but will still require more address space, and is really a per-process solution while we have a problem with multiple threads inside the same process
- generating smaller debug code is possible.  However, suppose we can shrink the code by 30% by pinning TLS (bug 1715459) and using short jumps across conditional traps.  The reality is that LibreOffice will still require 530MB for its debug code on x64, and much more on arm64.  With just a few threads this will still max out the 2GB code allowance.

**Removing the limits on code space (sketch):**

This would be a 64-bit only solution, and would have some security implications (more code == more options for jit-spray attacks), but if it's allowed specifically for debugging code and within a COEP+COOP process it would probably be acceptable risk.  The main issue would be that code pointers into and out of a wasm blob would potentially not fit in a 32-bit offset in a call or jump instruction or in the 32-bit value of a pointer load.  (We have this problem to a small extent already - on ARM64 the offset range for call/jump is smaller.)  Ignoring tiering, since this is baseline code only, all wasm stubs would be affected.  Entry stubs would probably have to be co-located with code that calls into them, and would make far jumps.  Exit stubs should probably be co-located with the code that calls into them too, ie, with the cloned debug code, and make far jumps to the true callees.

For this, we'd first start by segregating wasm allocations in a separate 2GB space from JS allocations, to shake out bugs as much as possible.

**Generating debug code that does not need to be cloned (sketch):**

In this case each breakpoint call is to some handler that decides what to do next.  We could make it PC-relative or indirect via TLS; PC-relative leads to less code so assume that.  Assume we attach some common code to each function or to the module and that the breakpoint code calls that.  Every bytecode would lead to a call.  The callee would very quickly determine whether breakpoints are enabled at all (check a byte in a table hanging off the TLS, there could be a byte per function) and return if not; if debugging is enabled for the function, a more complicated computation would be carried out, effectively there would be a callout to the debug trap handler, as now.  This architecture would significantly slow down execution of debug code but completely remove the need to clone code.  The stub handler would require access to the TLS though that seems to be an invariant at present anyway (bug 1715459).

A fast path would be CALL; table = LOAD tls[x]; byte = LOAD table[x]; CMP byte, 0; JNZ handler; RET with the risk of address arithmetic on ARM64 since x may not fit in an immediate (a bit table would mitigate this at the cost of bit manipulation).  Register availability is a concern but the scratch should be available at this point.

A different fast path would first test a flag in the TLS that says whether debugging is enabled at all, to avoid the dependent load and address computation.  Still, it's a handful of instructions and control flow.

A further tweak on this is that we could keep the patchable NOP roughly as we have it now, and then patch it with the address of a local function that performs the above checks.  The patching would then happen simultaneously for all threads.  When no threads are being debugged, performance is as it is now.  When at least one thread is being debugged, all threads slow down a fair bit to take the call to the pre-handler, but most will then return directly.

**Generating debug code lazily (sketch):**

We may be able to piggyback on the existing tiering architecture to generate debug code when we enable debugging for a function; however, this works only when stepping into the function but not when stepping out to a function that was not previously enabled for debugging (because we have no OSR).  Setting a breakpoint in a function can't enable debugging for all callers, because they are no computable in general (other than the set "all functions").

We're unlikely to implement OSR for the sake of debugging, even though the baseline architecture allows it to be done somewhat easily; after all, everything is on the stack and debugging does not change what's on the stack.  We'd need a level of indirection directly after every function entry to trigger compilation of a debug version and return to the corresponding point in the debug version.

**Reorganize the debugger (non-sketch):**

I don't know how to do this, it would in any case allow the debugger to work on shared code and would likely be a fair amount of work.
For a deeper analysis, see bug 1756030 comment 1 et seq.  Cloning debug code on large wasm sites will cause us to run out of code memory and will make it impossible to debug the site.

This bug is about the cloning specifically, and how to enable wasm debugging without cloning the full code or suffer OOM from the cloning.  Temporary workarounds for the fact that we trigger debug code / cloning inappropriately should be discussed elsewhere.

Some possible solutions:

- remove the limits on code space (for cloned code?) so that cloning remains possible so long as there is address space available
- generate debug code that does not require cloning, ie, has some type of indirection for breakpoints (with the existing architecture for breakpoints)
- generate debug code lazily, ie, has some type of indirection for calls (with the existing architecture for breakpoints)
- reorganize the debugger so that it can share code for breakpoints, ie, change the breakpointing architecture

Non-solutions:

- copy-on-write of code will reduce RAM footprint but will still require more address space, and is really a per-process solution while we have a problem with multiple threads inside the same process
- generating smaller debug code is possible.  However, suppose we can shrink the code by 30% by pinning TLS (bug 1715459) and using short jumps across conditional traps.  The reality is that LibreOffice will still require 530MB for its debug code on x64, and much more on arm64.  With just a few threads this will still max out the 2GB code allowance.

**Removing the limits on code space (sketch):**

This would be a 64-bit only solution, and would have some security implications (more code == more options for jit-spray attacks), but if it's allowed specifically for debugging code and within a COEP+COOP process it would probably be acceptable risk.  The main issue would be that code pointers into and out of a wasm blob would potentially not fit in a 32-bit offset in a call or jump instruction or in the 32-bit value of a pointer load.  (We have this problem to a small extent already - on ARM64 the offset range for call/jump is smaller.)  Ignoring tiering, since this is baseline code only, all wasm stubs would be affected.  Entry stubs would probably have to be co-located with code that calls into them, and would make far jumps.  Exit stubs should probably be co-located with the code that calls into them too, ie, with the cloned debug code, and make far jumps to the true callees.

For this, we'd first start by segregating wasm allocations in a separate 2GB space from JS allocations, to shake out bugs as much as possible.

**Generating debug code that does not need to be cloned (sketch):**

In this case each breakpoint call is to some handler that decides what to do next.  We could make it PC-relative or indirect via TLS; PC-relative leads to less code so assume that.  Assume we attach some common code to each function or to the module and that the breakpoint code calls that.  Every bytecode would lead to a call.  The callee would very quickly determine whether breakpoints are enabled at all (check a byte in a table hanging off the TLS, there could be a byte per function) and return if not; if debugging is enabled for the function, a more complicated computation would be carried out, effectively there would be a callout to the debug trap handler, as now.  This architecture would significantly slow down execution of debug code but completely remove the need to clone code.  The stub handler would require access to the TLS though that seems to be an invariant at present anyway (bug 1715459).

A fast path would be CALL; table = LOAD tls[x]; byte = LOAD table[x]; CMP byte, 0; JNZ handler; RET with the risk of address arithmetic on ARM64 since x may not fit in an immediate (a bit table would mitigate this at the cost of bit manipulation).  Register availability is a concern but the scratch should be available at this point.

A different fast path would first test a flag in the TLS that says whether debugging is enabled at all, to avoid the dependent load and address computation.  Still, it's a handful of instructions and control flow.

A further tweak on this is that we could keep the patchable NOP roughly as we have it now, and then patch it with the address of a local code sequence that performs the above checks.  The patching would then happen simultaneously for all threads.  When no threads are being debugged, performance is as it is now.  When at least one thread is being debugged, all threads slow down a fair bit to take the call to the pre-handler, but most will then return directly.

**Generating debug code lazily (sketch):**

We may be able to piggyback on the existing tiering architecture to generate debug code when we enable debugging for a function; however, this works only when stepping into the function but not when stepping out to a function that was not previously enabled for debugging (because we have no OSR).  Setting a breakpoint in a function can't enable debugging for all callers, because they are no computable in general (other than the set "all functions").

We're unlikely to implement OSR for the sake of debugging, even though the baseline architecture allows it to be done somewhat easily; after all, everything is on the stack and debugging does not change what's on the stack.  We'd need a level of indirection directly after every function entry to trigger compilation of a debug version and return to the corresponding point in the debug version.

**Reorganize the debugger (non-sketch):**

I don't know how to do this, it would in any case allow the debugger to work on shared code and would likely be a fair amount of work.
For a deeper analysis, see bug 1756030 comment 1 et seq.  Cloning debug code on large wasm sites will cause us to run out of code memory and will make it impossible to debug the site.

This bug is about the cloning specifically, and how to enable wasm debugging without cloning the full code or suffer OOM from the cloning.  Temporary workarounds for the fact that we trigger debug code / cloning inappropriately should be discussed elsewhere.

Some possible solutions:

- remove the limits on code space (for cloned code?) so that cloning remains possible so long as there is address space available
- generate debug code that does not require cloning, ie, has some type of indirection for breakpoints (with the existing architecture for breakpoints)
- generate debug code lazily, ie, has some type of indirection for calls (with the existing architecture for breakpoints)
- reorganize the debugger so that it can share code for breakpoints, ie, change the breakpointing architecture

Non-solutions:

- copy-on-write of code will reduce RAM footprint but will still require more address space, and is really a per-process solution while we have a problem with multiple threads inside the same process
- generating smaller debug code is possible.  However, suppose we can shrink the code by 30% by pinning TLS (bug 1715459) and using short jumps across conditional traps.  The reality is that LibreOffice will still require 530MB for its debug code on x64, and much more on arm64.  With just a few threads this will still max out the 2GB code allowance.

**Removing the limits on code space (sketch):**

This would be a 64-bit only solution, and would have some security implications (more code == more options for jit-spray attacks), but if it's allowed specifically for debugging code and within a COEP+COOP process it would probably be acceptable risk.  The main issue would be that code pointers into and out of a wasm blob would potentially not fit in a 32-bit offset in a call or jump instruction or in the 32-bit value of a pointer load.  (We have this problem to a small extent already - on ARM64 the offset range for call/jump is smaller.)  Ignoring tiering, since this is baseline code only, all wasm stubs would be affected.  Entry stubs would probably have to be co-located with code that calls into them, and would make far jumps.  Exit stubs should probably be co-located with the code that calls into them too, ie, with the cloned debug code, and make far jumps to the true callees.

For this, we'd first start by segregating wasm allocations in a separate 2GB space from JS allocations, to shake out bugs as much as possible.

(There's also reason to worry about how some far jumps are encoded.  Right now they are encoded pc-relative; but if code is serialized and deserialized in separate chunks then the distance between source and target will likely change.  And if they were to be encoded absolute then everything would have to be re-linked on deserialization.  Not sure yet how much this matters.)

**Generating debug code that does not need to be cloned (sketch):**

In this case each breakpoint call is to some handler that decides what to do next.  We could make it PC-relative or indirect via TLS; PC-relative leads to less code so assume that.  Assume we attach some common code to each function or to the module and that the breakpoint code calls that.  Every bytecode would lead to a call.  The callee would very quickly determine whether breakpoints are enabled at all (check a byte in a table hanging off the TLS, there could be a byte per function) and return if not; if debugging is enabled for the function, a more complicated computation would be carried out, effectively there would be a callout to the debug trap handler, as now.  This architecture would significantly slow down execution of debug code but completely remove the need to clone code.  The stub handler would require access to the TLS though that seems to be an invariant at present anyway (bug 1715459).

A fast path would be CALL; table = LOAD tls[x]; byte = LOAD table[x]; CMP byte, 0; JNZ handler; RET with the risk of address arithmetic on ARM64 since x may not fit in an immediate (a bit table would mitigate this at the cost of bit manipulation).  Register availability is a concern but the scratch should be available at this point.

A different fast path would first test a flag in the TLS that says whether debugging is enabled at all, to avoid the dependent load and address computation.  Still, it's a handful of instructions and control flow.

A further tweak on this is that we could keep the patchable NOP roughly as we have it now, and then patch it with the address of a local code sequence that performs the above checks.  The patching would then happen simultaneously for all threads.  When no threads are being debugged, performance is as it is now.  When at least one thread is being debugged, all threads slow down a fair bit to take the call to the pre-handler, but most will then return directly.

**Generating debug code lazily (sketch):**

We may be able to piggyback on the existing tiering architecture to generate debug code when we enable debugging for a function; however, this works only when stepping into the function but not when stepping out to a function that was not previously enabled for debugging (because we have no OSR).  Setting a breakpoint in a function can't enable debugging for all callers, because they are no computable in general (other than the set "all functions").

We're unlikely to implement OSR for the sake of debugging, even though the baseline architecture allows it to be done somewhat easily; after all, everything is on the stack and debugging does not change what's on the stack.  We'd need a level of indirection directly after every function entry to trigger compilation of a debug version and return to the corresponding point in the debug version.

**Reorganize the debugger (non-sketch):**

I don't know how to do this, it would in any case allow the debugger to work on shared code and would likely be a fair amount of work.
For a deeper analysis, see bug 1756030 comment 1 et seq.  Cloning debug code on large wasm sites will cause us to run out of code memory and will make it impossible to debug the site.

This bug is about the cloning specifically, and how to enable wasm debugging without cloning the full code or suffer OOM from the cloning.  Temporary workarounds for the fact that we trigger debug code / cloning inappropriately should be discussed elsewhere.

Some possible solutions:

- remove the limits on code space (for cloned code?) so that cloning remains possible so long as there is address space available
- generate debug code that does not require cloning, ie, has some type of indirection for breakpoints (with the existing architecture for breakpoints)
- generate debug code lazily, ie, has some type of indirection for calls (with the existing architecture for breakpoints)
- reorganize the debugger so that it can share code for breakpoints, ie, change the breakpointing architecture

Non-solutions:

- copy-on-write of code will reduce RAM footprint but will still require more address space, and is really a per-process solution while we have a problem with multiple threads inside the same process
- generating smaller debug code is possible.  However, suppose we can shrink the code by 30% by pinning TLS (bug 1715459) and using short jumps across conditional traps.  The reality is that LibreOffice will still require 530MB for its debug code on x64, and much more on arm64.  With just a few threads this will still max out the 2GB code allowance.

**Removing the limits on code space (sketch):**

This would be a 64-bit only solution, and would have some security implications (more code == more options for jit-spray attacks), but if it's allowed specifically for debugging code and within a COEP+COOP process it would probably be acceptable risk.  The main issue would be that code pointers into and out of a wasm blob would potentially not fit in a 32-bit offset in a call or jump instruction or in the 32-bit value of a pointer load.  (We have this problem to a small extent already - on ARM64 the offset range for call/jump is smaller.)  Ignoring tiering, since this is baseline code only, all wasm stubs would be affected.  Entry stubs would probably have to be co-located with code that calls into them, and would make far jumps.  Exit stubs should probably be co-located with the code that calls into them too, ie, with the cloned debug code, and make far jumps to the true callees.

For this, we'd first start by segregating wasm allocations in a separate 2GB space from JS allocations, to shake out bugs as much as possible.

(There's also reason to worry about how some far jumps are encoded.  Right now they are encoded pc-relative; but if code is serialized and deserialized in separate chunks then the distance between source and target will likely change.  And if they were to be encoded absolute then everything would have to be re-linked on deserialization.  Not sure yet how much this matters.)

An alternative is that it ought to be possible to have one code arena per worker rather than one per process, but this doesn't really change anything, as compiled wasm code can be shared between workers and hence in effect all these arenas are per-process once wasm becomes involved.
 
**Generating debug code that does not need to be cloned (sketch):**

In this case each breakpoint call is to some handler that decides what to do next.  We could make it PC-relative or indirect via TLS; PC-relative leads to less code so assume that.  Assume we attach some common code to each function or to the module and that the breakpoint code calls that.  Every bytecode would lead to a call.  The callee would very quickly determine whether breakpoints are enabled at all (check a byte in a table hanging off the TLS, there could be a byte per function) and return if not; if debugging is enabled for the function, a more complicated computation would be carried out, effectively there would be a callout to the debug trap handler, as now.  This architecture would significantly slow down execution of debug code but completely remove the need to clone code.  The stub handler would require access to the TLS though that seems to be an invariant at present anyway (bug 1715459).

A fast path would be CALL; table = LOAD tls[x]; byte = LOAD table[x]; CMP byte, 0; JNZ handler; RET with the risk of address arithmetic on ARM64 since x may not fit in an immediate (a bit table would mitigate this at the cost of bit manipulation).  Register availability is a concern but the scratch should be available at this point.

A different fast path would first test a flag in the TLS that says whether debugging is enabled at all, to avoid the dependent load and address computation.  Still, it's a handful of instructions and control flow.

A further tweak on this is that we could keep the patchable NOP roughly as we have it now, and then patch it with the address of a local code sequence that performs the above checks.  The patching would then happen simultaneously for all threads.  When no threads are being debugged, performance is as it is now.  When at least one thread is being debugged, all threads slow down a fair bit to take the call to the pre-handler, but most will then return directly.

**Generating debug code lazily (sketch):**

We may be able to piggyback on the existing tiering architecture to generate debug code when we enable debugging for a function; however, this works only when stepping into the function but not when stepping out to a function that was not previously enabled for debugging (because we have no OSR).  Setting a breakpoint in a function can't enable debugging for all callers, because they are no computable in general (other than the set "all functions").

We're unlikely to implement OSR for the sake of debugging, even though the baseline architecture allows it to be done somewhat easily; after all, everything is on the stack and debugging does not change what's on the stack.  We'd need a level of indirection directly after every function entry to trigger compilation of a debug version and return to the corresponding point in the debug version.

**Reorganize the debugger (non-sketch):**

I don't know how to do this, it would in any case allow the debugger to work on shared code and would likely be a fair amount of work.
For a deeper analysis, see bug 1756030 comment 1 et seq.  Cloning debug code on large wasm sites will cause us to run out of code memory and will make it impossible to debug the site.

This bug is about the cloning specifically, and how to enable wasm debugging without cloning the full code or suffer OOM from the cloning.  Temporary workarounds for the fact that we trigger debug code / cloning inappropriately should be discussed elsewhere.

Some possible solutions:

- remove the limits on code space (for cloned code?) so that cloning remains possible so long as there is address space available
- generate debug code that does not require cloning, ie, has some type of indirection for breakpoints (with the existing architecture for breakpoints)
- generate debug code lazily, ie, has some type of indirection for calls (with the existing architecture for breakpoints)
- reorganize the debugger so that it can share code for breakpoints, ie, change the breakpointing architecture

Non-solutions:

- copy-on-write of code will reduce RAM footprint but will still require more address space, and is really a per-process solution while we have a problem with multiple threads inside the same process
- generating smaller debug code is possible.  However, suppose we can shrink the code by 30% by pinning TLS (bug 1715459) and using short jumps across conditional traps.  The reality is that LibreOffice will still require 530MB for its debug code on x64, and much more on arm64.  With just a few threads this will still max out the 2GB code allowance.

**Removing the limits on code space (sketch):**

This would be a 64-bit only solution, and would have some security implications (more code == more options for jit-spray attacks), but if it's allowed specifically for debugging code and within a COEP+COOP process it would probably be acceptable risk.  The main issue would be that code pointers into and out of a wasm blob would potentially not fit in a 32-bit offset in a call or jump instruction or in the 32-bit value of a pointer load.  (We have this problem to a small extent already - on ARM64 the offset range for call/jump is smaller.)  Ignoring tiering, since this is baseline code only, all wasm stubs would be affected.  Entry stubs would probably have to be co-located with code that calls into them, and would make far jumps.  Exit stubs should probably be co-located with the code that calls into them too, ie, with the cloned debug code, and make far jumps to the true callees.

For this, we'd first start by segregating wasm allocations in a separate 2GB space from JS allocations, to shake out bugs as much as possible.

(There's also reason to worry about how some far jumps are encoded.  Right now they are encoded pc-relative; but if code is serialized and deserialized in separate chunks then the distance between source and target will likely change.  And if they were to be encoded absolute then everything would have to be re-linked on deserialization.  Not sure yet how much this matters.)

An alternative is that it ought to be possible to have one code arena per worker rather than one per process, but this doesn't really change anything, as compiled wasm code can be shared between workers and hence in effect all these arenas are per-process once wasm becomes involved.
 
**Generating debug code that does not need to be cloned (sketch):**

In this case each breakpoint call is to some handler that decides what to do next.  We could make it PC-relative or indirect via TLS; PC-relative leads to less code so assume that.  Assume we attach some common code to each function or to the module and that the breakpoint code calls that.  Every bytecode would lead to a call.  The callee would very quickly determine whether breakpoints are enabled at all (check a byte in a table hanging off the TLS, there could be a byte per function) and return if not; if debugging is enabled for the function, a more complicated computation would be carried out, effectively there would be a callout to the debug trap handler, as now.  This architecture would significantly slow down execution of debug code but completely remove the need to clone code.  The stub handler would require access to the TLS though that seems to be an invariant at present anyway (bug 1715459).

A fast path would be CALL; table = LOAD tls[x]; byte = LOAD table[x]; CMP byte, 0; JNZ handler; RET with the risk of address arithmetic on ARM64 since x may not fit in an immediate (a bit table would mitigate this at the cost of bit manipulation).  Register availability is a concern but the scratch should be available at this point.

A different fast path would first test a flag in the TLS that says whether debugging is enabled at all, to avoid the dependent load and address computation.  Still, it's a handful of instructions and control flow.

A further tweak on this is that we could keep the patchable NOP roughly as we have it now, and then patch it with the address of a local code sequence that performs the above checks.  The patching would then happen simultaneously for all threads.  When no threads are being debugged, performance is as it is now.  When at least one thread is being debugged, all threads slow down a fair bit to take the call to the pre-handler, but most will then return directly.

**Generating debug code lazily (sketch):**

We may be able to piggyback on the existing tiering architecture to generate debug code when we enable debugging for a function; however, this works only when stepping into the function but not when stepping out to a function that was not previously enabled for debugging (because we have no OSR).  Setting a breakpoint in a function can't enable debugging for all callers, because they are no computable in general (other than the set "all functions").

We're unlikely to implement OSR for the sake of debugging, even though the baseline architecture allows it to be done somewhat easily; after all, everything is on the stack and debugging does not change what's on the stack.  We'd need a level of indirection directly after every function entry to trigger compilation of a debug version and return to the corresponding point in the debug version.

**Reorganize the debugger (non-sketch):**

I don't know how to do this, it would in any case allow the debugger to work on shared code and would likely be a fair amount of work.

Edit: Yury says that allowing the debugger to work on shared, patched code was the original intent, and that he does not understand why we need to clone code - maybe the debugger architecture changed, maybe there was a misunderstanding.  This needs investigation.

Back to Bug 1756951 Comment 0