Create fewer indirect stubs
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
People
(Reporter: lth, Assigned: lth)
References
(Blocks 2 open bugs)
Details
Attachments
(2 files, 8 obsolete files)
For the indirect stubs optimization we must create an indirect stub for a function F stored into a table if F is going to be called from another function G that is not in the same instance as F (because the call must switch to F's instance). If this is not going to be happen the no stub is needed and the same-instance call can proceed at great speed.
Obviously whether F is going to be called from such a G is statically uncomputable, and we don't want to discover this fact at the call point - that would just be another stub to trigger stub creation, and we don't want a stub there if we can avoid it. So we have heuristics in place to decide on stub creation when F is stored into the table by a passive or active initializer, table.set/.grow/.fill/.copy instruction, or (from JS) Table.set/.grow.
The current heuristic gives F a stub in a table in an instance if:
- the table is imported or exported
- F is imported and is stored via a passive or active initializer (ie via Instance::initElems)
- F is stored via wasm
table.set
,table.fill
,table.grow
,table.copy
- F is stored via JS
new Table
,Table.p.set
,Table.p.grow
(there's noTable.p.fill
)
The third bullet should be optimizable, as it would not be necessary to create a stub if the function's instance is the same instance as the table's owning instance and the table is private.
Whether the table is imported or exported is currently a static check. But going back to earlier princples, a better heuristic for whether a function can avoid having a stub is whether the table is referenced only from a single instance and contains only functions from within that instance. This is something that can be discovered at run-time during module instantiation (when a table changes state from not-referenced to singly-referenced to multiply-referenced) or during table creation, initialization, and update (when a table changes state from has-no-functions to functions-from-one-instance to functions-from-several-instances).
Once a table enters the multiply-referenced or functions-from-several-instances states, indirect stubs could be created for all its functions. In the worst case, we get no more stubs than now and perform no more computation, but we can get better call performance for what is dynamically a single-module program, as far as the table is concerned.
Before doing all of the work it would be meaningful to perform a simple corpus analysis to ensure that the optimization will pay off. Some of the work must be done: we must mark modules as private and then detect when they would be upgraded to public, but this does not involve rejiggering the stubs code.
Finally, if it turns out that this optimization pays off then we should reconsider whether it's worth it for the stub to contain a dynamic same-instance check. It might not, and as this check slows down cross-instance calls; it only pays for itself if many stubbed calls are dynamically to a same-instance function.
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 1•3 years ago
|
||
When wasm table instructions operate on a private table and store a pointer
to an instance-local function, no stub need be created, so check this case.
This missing optimization was an oversight with the initial landing of
the indirect-stubs code.
Depends on D133865
Assignee | ||
Comment 2•3 years ago
|
||
This tidies up some of the naming in Dmitry's patch. When we create
stubs we are not interested in whether the table is imported or
exported per se, but in whether it is public (shared by multiple
instances) or not. At the moment, that happens to be true if it were
imported or exported, but we want to move away from that.
Depends on D133865
Updated•3 years ago
|
Assignee | ||
Comment 3•3 years ago
|
||
Depends on D134163
Assignee | ||
Updated•3 years ago
|
Assignee | ||
Comment 4•3 years ago
|
||
A little profiling (profiling code to appear). I instrumented Firefox and loaded a bunch of wasm content from the web. Virtually every table is imported or exported and will not get the full benefit of the call_indirect optimization, but will get a stub. However, in no case did I see multi-instance programs, so the dynamic same-instance optimization in the stub should basically always kick in (and is completely crucial at this time). This gives the call_indirect optimization a very moderate speedup over the old code, see https://bugzilla.mozilla.org/show_bug.cgi?id=1639153#c138. If this is all we wanted we should have implemented the dynamic-same-instance optimization inline (load target tls; compare to private tls; if same, jump to fast code, otherwise save state before call and restore after) and we would have seen more of a speedup (for that) and lower complexity and lower memory use.
However, since I never observe multi-instance programs, the better heuristic for when to create stubs really should pay off because we should not get call_indirect stubs in most cases - the table is "private" to the one instance, and the functions would remain stub-free.
Assignee | ||
Comment 6•3 years ago
|
||
This applies to the changeset before "Dmitry's patch" (ie the
call_indirect stub optimization) landed, rev
d4bd94bc7b58345d02f59b22f35ba6269d8fd2b0.
NOTE there's a bug here, wasm/simd/ad-hack.js has a failure for a
complex case involving a call_indirect to an imported function with a
complicated signature (line 1631). This probably has to do with how
stack arguments are handled and I've probably cut too many corners in
the present patch - caller and callee need to agree on stack arg
offsets, specifically about whether space is made for caller/callee
tls.
The patch implements wasmCallIndirect by a dynamic equality test on
the tls: if the target tls is the same as the caller tls then a fast
call is made without any saving and restoring of state. This is as
fast on my machine as the table-is-private path of Dmitry's patch,
which is basically the best we can hope for, but it is that fast even
when the table has been imported or exported. No stubs are created,
no gnarly heuristics are needed.
Importantly, this does not slow down cross-instance calls much at all
relative to the original code, unlike Dmitry's patch which seriously
penalizes cross-instance calls. There's a small slowdown, but quite
acceptable, and with the null-check optimization removed from the slow
path (it is not needed on the fast path because the tls is known not
to be null) we'll regain that perf, and maybe more.
The main problem here relative to Dmitry's patch is that we don't have
a clean path going forward to removing the tls from the function
table, nor toward a code-only representation for funcrefs so that
call_ref can be fast.
At the same time, I think this opens up some new perspectives, because
we have better perf and less complexity here than we get with Dmitry's
patch.
Assignee | ||
Comment 7•3 years ago
|
||
(Bugs in the patch from comment 6 have been fixed, see updated commit msg in patch.)
Comment 9•3 years ago
|
||
bugherder |
Assignee | ||
Comment 10•3 years ago
|
||
Assignee | ||
Comment 11•3 years ago
|
||
Depends on D135331
Assignee | ||
Comment 12•3 years ago
|
||
Depends on D135332
Assignee | ||
Comment 13•3 years ago
|
||
Depends on D135337
Assignee | ||
Comment 14•3 years ago
|
||
This changes MacroAssembler::wasmCallIndirect to implement dual-path
call code for call_indirect: if the caller's tls equals the callee's
tls, no context switch will be needed and a fast call can be used,
otherwise a slow call with a context switch must be used. This speeds
up call_indirect significantly in the vast majority of cases at a
small cost in code size.
As a result of this, wasmCallIndirect has two call instructions and
therefore two safepoints, and this complication bubbles up to the
baseline compiler, the codegenerator, and lowering. The main issue is
that a LIR node only has one safepoint, so we must generate a second,
synthetic LIR node for the second safepoint.
Drive-by fix: the InterModule attribute in the baseline compiler is
not really about whether a call is inter-module, but about whether the
register state and realm must be restored after a call. The change to
call_indirect exposes this incorrectness: such calls may be
intermodule, but the compiler never needs to restore the register
state or the realm - the macroassembler does this, as needed, on the
slow path.
Drive-by fix: minor cleanup of the emitted code, notably, better
pointer scaling on ARM64.
Drive-by fix: remove some redundant parameters in lowering to reduce
confusion about whether a MIR node is updated for some LIR operations.
Depends on D137846
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Updated•3 years ago
|
Comment 15•3 years ago
|
||
Comment on attachment 9262708 [details]
Bug 1743760 - Dual-path call code (no out-of-line code). r?rhunt
Revision D138052 was moved to bug 1754377. Setting attachment 9262708 [details] to obsolete.
Updated•3 years ago
|
Updated•3 years ago
|
Assignee | ||
Comment 16•3 years ago
|
||
Fixed with the landing of bug 1753061.
Updated•3 years ago
|
Description
•