Closed Bug 1714086 Opened 4 months ago Closed 2 months ago

Make wasm debug-code builds not use 40x the time and 10x the memory of normal builds

Categories

(Core :: Javascript: WebAssembly, defect, P2)

defect

Tracking

()

RESOLVED FIXED
92 Branch
Tracking Status
firefox-esr78 --- wontfix
firefox-esr91 --- wontfix
firefox90 --- wontfix
firefox91 --- wontfix
firefox92 --- fixed

People

(Reporter: lth, Assigned: lth)

References

(Depends on 2 open bugs, Blocks 1 open bug, Regression)

Details

(Keywords: regression)

Attachments

(2 files, 2 obsolete files)

(Spun off from bug 1714072.)

Debug code + metadata are very large and takes a very long time to create.

The very slow compilation is due to the presence of the debugger. Contrast these two programs:

var g = newGlobal({newCompartment: true});
g.eval(`new WebAssembly.Module(read("/home/lhansen/moz/rustc_binary.wasm","binary"))`);

which compiles the test case in about 0.5s, and this one:

var g = newGlobal({newCompartment: true});
var dbg = new Debugger(g);
g.eval(`new WebAssembly.Module(read("/home/lhansen/moz/rustc_binary.wasm","binary"))`);

which takes more than 20s.

The memory use is also due to the debugger. In the first case, memory use ends up at 215MB, while in the second case, the shell grows to about 1.9GB.

(More data at comment 3.)

A quick scan of the code emitted for debugging shows that a very large fraction of it is tls reloads; if these could be removed / avoided, the code would shrink dramatically. Also, there are a lot of back-to-back breakpoints in the emitted code, possibly for operations that emit no machine code at all:

00000065  4c 8b 74 24 10            movq 0x10(%rsp), %r14
0000006A  0f 1f 44 00 00            nopl %eax, (%rax,%rax,1)
0000006F  4c 8b 74 24 10            movq 0x10(%rsp), %r14
00000074  0f 1f 44 00 00            nopl %eax, (%rax,%rax,1)

Memory consumption will be due to:

  • the code bloat above
  • creating a stack map for every breakpoint (even if they are merged, and even if the tc probably does not use reftypes and won't have maps)
  • appending the CallSiteDesc to the vector of those for every breakpoint
  • (other things)

Focus here should be on:

  • emit no pointless breakpoints; elide them if possible, or can they be merged if back-to-back?
  • avoid tls reload if this can be done cheaply
  • look for nonlinear algorithms that will bite when code or metadata become very large (eg, masm buffers are presized for "normal" compiles)
  • dig in with a profiler to look for other trouble spots
Attached patch no-redundant-breakpoints.patch (obsolete) — Splinter Review

Simple idea for how to avoid at least some redundant breakpoints.

Assignee: nobody → lhansen

FWIW (need to look into this later), running memory measurements in about:memory in a debug build will assert (repeatably) after a couple reloads with the console open:

2022	  MOZ_ASSERT(gcTotal == rtStats.gcHeapChunkTotal);
(gdb) p gcTotal
$1 = 33554416
(gdb) p rtStats.gcHeapChunkTotal
$2 = 33554432
(gdb) bt
#0  0x00007fffe4b4f3db in xpc::ReportJSRuntimeExplicitTreeStats(JS::RuntimeStats const&, nsTSubstring<char> const&, nsIHandleReportCallback*, nsISupports*, bool, unsigned long*) (rtStats=..., rtPath=..., handleReport=0x7fff761e49e0, data=0x0, anonymize=false, rtTotalOut=0x7fffffff9230) at js/xpconnect/src/XPCJSRuntime.cpp:2022
#1  0x00007fffe4b55976 in xpc::JSReporter::CollectReports(nsBaseHashtable<nsUint64HashKey, nsTString<char>, nsTString<char>, nsDefaultConverter<nsTString<char>, nsTString<char> > >*, nsBaseHashtable<nsUint64HashKey, nsTString<char>, nsTString<char>, nsDefaultConverter<nsTString<char>, nsTString<char> > >*, nsIHandleReportCallback*, nsISupports*, bool) (windowPaths=0x7fffffffa940, topWindowPaths=0x7fffffffa918, handleReport=0x7fff761e49e0, data=0x0, anonymize=false) at js/xpconnect/src/XPCJSRuntime.cpp:2273
#2  0x00007fffe5af6698 in nsWindowMemoryReporter::CollectReports(nsIHandleReportCallback*, nsISupports*, bool) (this=0x7fffcbb66800, aHandleReport=0x7fff761e49e0, aData=0x0, aAnonymize=false) at dom/base/nsWindowMemoryReporter.cpp:576
#3  0x00007fffe30d837d in nsMemoryReporterManager::DispatchReporter(nsIMemoryReporter*, bool, nsIHandleReportCallback*, nsISupports*, bool)::$_0::operator()() const (this=0x7fffb3614668) at xpcom/base/nsMemoryReporterManager.cpp:1844
#4  0x00007fffe30d8289 in mozilla::detail::RunnableFunction<nsMemoryReporterManager::DispatchReporter(nsIMemoryReporter*, bool, nsIHandleReportCallback*, nsISupports*, bool)::$_0>::Run() (this=0x7fffb3614640) at obj-x86_64-pc-linux-gnu/dist/include/nsThreadUtils.h:534
#5  0x00007fffe326eb5a in mozilla::RunnableTask::Run() (this=0x7fffc0bed780) at xpcom/threads/TaskController.cpp:482

Re comment 0, comment 1: an experiment.

Normal operation for a release build on my machine with the debugger disabled: 0.5s, 0.18GB.
Normal operation for a release build on my machine with the debugger enabled: 19.5s, 1.4GB. (So I guess it's 40x slower, not 10x)
With breakpoints and their stackmaps elided when the PC is the same for the new breakpoint as for the previous breakpoint: 10.6s, 0.86GB.
With breakpoints elided when redundant, but stackmaps emitted as before: 18.9s, 1.3GB
With breakpoints elided when redundant, and stackmaps elided when they do not contain pointers: 0.5s, 0.18GB.

That is, virtually 100% of the overhead comes from emitting stack maps even though the code has no reference types.

Stack maps are always emitted when the debugger is active, whether the frames contain reference types or not. This accounts for the huge time cost, though I'm not sure why it should cost so much space, because I thought stack maps that were the same were commoned. (So maybe I was wrong, or maybe they're not the same.)

Either way, I'm not sure why the stack maps are emitted when the debugger is active when there are no pointers in the frame. Yury, do you know?

Flags: needinfo?(ydelendik)
Summary: Make wasm debug-code builds not use 10x the time and 10x the memory of normal builds → Make wasm debug-code builds not use 40x the time and 10x the memory of normal builds

Do not emit a breakpoint and a stackmap that is adjacent to the previous one in the emitted code.
Presumably a singlestepping debugger will be seen to 'skip' instructions, notably this is
true for x.CONST, DROP, ELSE, END (in some cases). The purpose is to shrink the size of the
debug code and make it faster to create it.

Drive-by fix: Better stack map labels on the stack maps for the breakpoints.

Attachment #9224716 - Attachment is obsolete: true

Hm, there was an interesting change in bug 1628426 (multi-value related), where we went from being interested in stack maps for debug frames with ref types to all stack maps for debug frames. This was a complicated patch doing many things, and I suspect it may accidentally have done some damage here.

Regressed by: 1628426

The Bugbug bot thinks this bug is a defect, but please change it back in case of error.

Type: enhancement → defect

why the stack maps are emitted when the debugger is active when there are no pointers in the frame. Yury, do you know?

I'm not sure about stack maps or pointers -- they are appeared after debug frame work. As you noticed, looks like stack maps are unrelated to debug mode.

Flags: needinfo?(ydelendik)
Duplicate of this bug: 1656549

In addition to not generating stack maps just because we're doing debugging (which needs to be investigated but has some uncertainty), we could look into making the stack map creation more efficient. In createStackMap, most of the time is spent from the point where we heap-allocate the map, and especially in allocation it seems. Therefore, speeding up allocation (using an arena, say) or avoiding allocation by sharing stack maps that are alike aside from the PC they relate to, would be interesting candidates. In particular, I would expect one stack map to be equal to the previous one created fairly often.

In WasmBaselineCompile.cpp there used to be this little thing in endFunction() just before stack maps are created:

    if (env_.debugEnabled()) {
      // If a return type is a ref, we need to note that in the stack maps
      // generated here.  Note that this assumes that DebugFrame::result* and
      // DebugFrame::cachedReturnJSValue_ are either both ref-typed or they are
      // both not ref-typed.  It can't represent the situation where one is and
      // the other isn't.
      HasRefTypedDebugFrame refDebugFrame = HasRefTypedDebugFrame::No;
      for (ValType result : funcType().results()) {
        if (result.isReference()) {
          refDebugFrame = HasRefTypedDebugFrame::Yes;
          break;
        }
      }
      ...

and then refDebugFrame would be passed to createStackMap(). A stack map would be elided if refDebugFrame==No AND the "extra" space in the frame contained no references. For many of the other calls to createStackMap(), the map would be elided simply if the extra space contained no references. The extra space is related to trap handling.

In the new code from bug 1628426, these heuristics / optimizations have all been disabled: a map is now always created for every one of those calls to createStackMap() if the debugger is enabled. The reason for that appears to have to do with handling multiple values correctly. When there are multiple values, the JS return value will be a reference (it's an array), so that's one reason. But the commit message also says that this fixes another bug, where an i64 reified as a BigInt (in the single-value return case) will now be traced properly.

It seems to me that the new code improves on the old code in one respect in addition to fixing bugs: it is unambiguous whether we'll get a map or not. I don't quite understand why the code show above was not used in BaseCompiler::emitFunction, say, to compute a single value that could be used for all calls to createStackMap(). Possibly that would have led to more maps being created.

In any event, if we are to improve on the current situation by creating fewer maps then the above code needs to come back into existence, but it needs to be computed centrally and used everywhere, and it must at least take into account whether the function has multiple values or creates a BigInt value. There may be other conditions.

Some more data:

  • A small LRU cache (10 elements) that is function-local is very effective at reducing allocation of stack maps and reducing run time in the test case. The cache has a hit rate of 98.4% and the running time of the program drops from 19.5s to 3.0s
  • The LRU cache does not reduce the memory footprint all that much however (from 1.4GB to 1.1GB in the task manager). The stackmaps have a very compact representation and most of them will be very small so this is not completely shocking, but then the question is, where does the memory go?
  • What does reduce memory consumption significantly is not storing the stack maps anywhere, ie, the stackMaps_ data structure occupies relatively a lot of memory, seemingly. This could be the consequence of using vector storage and being unlucky with some doubling algorithm, but more investigation is needed.

This is not complete because it neuters StackMap::destroy(), to do
better we need refcounts or to use some type of arena to manage the
memory for the maps. The arena is a little tricky, since compilation
is multi-threaded, but something that allows arenas to be merged when
compilation is done would work fine.

Heap profiling:

Base case: 2.2GB

  • object code 592MB
  • assembler buffers 522MB
  • maplet vector 521MB
  • stack maps 216MB
  • metadata 138MB
  • call site targets 130MB
  • misc 100MB

Filter redundant stack maps (virtually all, comment 12): 2.0GB

  • object code 592MB
  • assembler buffers 522MB
  • maplet vector 521MB
  • metadata 138MB
  • call site targets 130MB
  • misc 100MB

Filter "redundant" breakpoints (comment 4): 1.1GB

  • object code 336MB
  • maplet vector 261MB
  • assembler buffers 256MB
  • metadata 70MB
  • call site targets 65MB
  • misc 110MB

After that it becomes a little hazy. Not emitting any Tls reloads for the remaining breakpoints should show a significant reduction in code size but does not, yet emitting neither Tls reloads nor calls for breakpoints drops code size by 128MB. The tls reload is 5 bytes and so is the call, so not doing Tls reloads should save about 64MB; possibly there's some granularity effect in the allocator (or the observations) here, TBD.

Beyond the optimizations attempted above, the next best step we can take is to shrink the maplet size, for an additional 130MB savings over the filtered breakpoints optimization. At the moment the maplet is 16 bytes on x64, as it holds two pointers. It should be possible to shrink this to 8 bytes by encoding the two values as offsets into buffers: an offset into a StackMap table or arena for the StackMap*, and an offset into the module's executable memory area for the program counter. (I don't think we need to do anything new with process-wide tables.)

I think i'll clean up the patch in comment 4 and offer it for review, since that takes the edge off the problem, but it'll be attractive to investigate further.

Depends on: 1715444
Depends on: 1715456
Depends on: 1715459
See Also: → 1715514
Attachment #9224924 - Attachment is obsolete: true

cf comment 10, in principle we can partially revert / further refine the change that always creates a stack map when debugging is enabled. We need a map when the JS return value field may become a pointer that needs to be traced. This is true for multi-value but as Wingo notes, also when we return an i64 because that can turn into a BigInt. And of course it's true for any ref return type. In the current universe, where multi-value and reftypes are rare (and i64-typed functions are fairly rare too), such a change would drastically reduce the number of stack maps and the time used to compute them and the memory occupied by them.

However, in the universe of the future, with ubiquitous use of reftypes or memory64, the problem will come back.

Since the more subtle attribute is also a bit brittle (it depends on how JS chooses to represent wasm values) it seems to me that it is not a great idea to do further work here; we should focus instead on (a) not turning on debug code just because the console is open and (b) properly documenting why we're currently doing what we're doing (bug 1715514) and (c) reducing the memory consumption of stack maps and the time it takes to compute and allocate them (bug 1715456).

(Edit: I'm doubting myself. Why would we need a stack map for every breakpoint (== safepoint) if it is only at or near the return point that the pointer fields in the DebugFrame might become populated? It is true that a "pointer field" is introduced in the DebugFrame in bug 1628426. But this field is known to be null (really uninitialized, though there's a flag that says so) throughout almost the entire function. Generating a stack map everywhere is correct but unnecessary; the field can be ignored because it is known to be null, and null need never be updated by the GC.)

Bug 1628426 fixed a bug wherein a pointer into the JS heap from a wasm
DebugFrame was not properly traced by the GC unless the wasm return
type was a reference type. The pointer represents the wasm return
value and can be a JS pointer value in the case of multiple return
values (because these are represented as an array) or when the wasm
value is an i64 (represented as a BigInt) or a reference type (itself
a pointer).

The fix in that bug was to always emit a stack map for every safe
point when there is a DebugFrame present, ie, any time when we're
compiling code for debugging. This will have the effect of always
exposing the DebugFrame to GC, and so it will always be traced
properly.

When we're compiling code for debugging, though, there's a safe point
for every wasm breakable point, ie, for every opcode, and so there are
a lot of safe points. In effect, the change massively increases the
time and space for debug-enabled code.

However, it is only necessary to expose the frame to the GC (ie to
emit a stackmap) when the field in the DebugFrame can contain a
non-null pointer
, which is only at the end of the function as it is
about to return. In effect, the old structure of the code where
endFunction() computes whether a stack map is required was correct; it
was the computation of whether the map is required that was incorrect.

This patch reverts the changes from bug 1628426 (while slightly
improving on the naming). It then ensures that a map is always
created at the end of the function when there is a DebugFrame so that
any pointer in the DebugFrame will be traced. We could further refine
that condition to exclude cases where we know the DebugFrame will not
contain a JS pointer (because the wasm return type would not lead to
the creation of a heap datum), but that optimization would be brittle
and does anyway not seem important.

This patch also further refines comments updated in bug 1715514 that
describe how tracing of the DebugFrame is handled.

We may want to try to push this into 91 because of 91esr.

Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/f09fc260c40c
Do not emit a stack map for DebugFrames unless necessary. r=yury
Status: NEW → RESOLVED
Closed: 2 months ago
Resolution: --- → FIXED
Target Milestone: --- → 92 Branch
You need to log in before you can comment on or make changes to this bug.