Closed Bug 1743760 Opened 2 years ago Closed 2 years ago

Create fewer indirect stubs

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

enhancement

Tracking

()

RESOLVED FIXED

People

(Reporter: lth, Assigned: lth)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files, 8 obsolete files)

For the indirect stubs optimization we must create an indirect stub for a function F stored into a table if F is going to be called from another function G that is not in the same instance as F (because the call must switch to F's instance). If this is not going to be happen the no stub is needed and the same-instance call can proceed at great speed.

Obviously whether F is going to be called from such a G is statically uncomputable, and we don't want to discover this fact at the call point - that would just be another stub to trigger stub creation, and we don't want a stub there if we can avoid it. So we have heuristics in place to decide on stub creation when F is stored into the table by a passive or active initializer, table.set/.grow/.fill/.copy instruction, or (from JS) Table.set/.grow.

The current heuristic gives F a stub in a table in an instance if:

  • the table is imported or exported
  • F is imported and is stored via a passive or active initializer (ie via Instance::initElems)
  • F is stored via wasm table.set, table.fill, table.grow, table.copy
  • F is stored via JS new Table, Table.p.set, Table.p.grow (there's noTable.p.fill)

The third bullet should be optimizable, as it would not be necessary to create a stub if the function's instance is the same instance as the table's owning instance and the table is private.

Whether the table is imported or exported is currently a static check. But going back to earlier princples, a better heuristic for whether a function can avoid having a stub is whether the table is referenced only from a single instance and contains only functions from within that instance. This is something that can be discovered at run-time during module instantiation (when a table changes state from not-referenced to singly-referenced to multiply-referenced) or during table creation, initialization, and update (when a table changes state from has-no-functions to functions-from-one-instance to functions-from-several-instances).

Once a table enters the multiply-referenced or functions-from-several-instances states, indirect stubs could be created for all its functions. In the worst case, we get no more stubs than now and perform no more computation, but we can get better call performance for what is dynamically a single-module program, as far as the table is concerned.

Before doing all of the work it would be meaningful to perform a simple corpus analysis to ensure that the optimization will pay off. Some of the work must be done: we must mark modules as private and then detect when they would be upgraded to public, but this does not involve rejiggering the stubs code.

Finally, if it turns out that this optimization pays off then we should reconsider whether it's worth it for the stub to contain a dynamic same-instance check. It might not, and as this check slows down cross-instance calls; it only pays for itself if many stubbed calls are dynamically to a same-instance function.

Summary: Make the "table-is-private" check dynamic rather than static → Create fewer indirect stubs
Blocks: 1744984

When wasm table instructions operate on a private table and store a pointer
to an instance-local function, no stub need be created, so check this case.
This missing optimization was an oversight with the initial landing of
the indirect-stubs code.

Depends on D133865

This tidies up some of the naming in Dmitry's patch. When we create
stubs we are not interested in whether the table is imported or
exported per se, but in whether it is public (shared by multiple
instances) or not. At the moment, that happens to be true if it were
imported or exported, but we want to move away from that.

Depends on D133865

Assignee: nobody → lhansen
Status: NEW → ASSIGNED

Depends on D134163

Keywords: leave-open

A little profiling (profiling code to appear). I instrumented Firefox and loaded a bunch of wasm content from the web. Virtually every table is imported or exported and will not get the full benefit of the call_indirect optimization, but will get a stub. However, in no case did I see multi-instance programs, so the dynamic same-instance optimization in the stub should basically always kick in (and is completely crucial at this time). This gives the call_indirect optimization a very moderate speedup over the old code, see https://bugzilla.mozilla.org/show_bug.cgi?id=1639153#c138. If this is all we wanted we should have implemented the dynamic-same-instance optimization inline (load target tls; compare to private tls; if same, jump to fast code, otherwise save state before call and restore after) and we would have seen more of a speedup (for that) and lower complexity and lower memory use.

However, since I never observe multi-instance programs, the better heuristic for when to create stubs really should pay off because we should not get call_indirect stubs in most cases - the table is "private" to the one instance, and the functions would remain stub-free.

See comment 4 on the bug for more information.

This applies to the changeset before "Dmitry's patch" (ie the
call_indirect stub optimization) landed, rev
d4bd94bc7b58345d02f59b22f35ba6269d8fd2b0.

NOTE there's a bug here, wasm/simd/ad-hack.js has a failure for a
complex case involving a call_indirect to an imported function with a
complicated signature (line 1631). This probably has to do with how
stack arguments are handled and I've probably cut too many corners in
the present patch - caller and callee need to agree on stack arg
offsets, specifically about whether space is made for caller/callee
tls.

The patch implements wasmCallIndirect by a dynamic equality test on
the tls: if the target tls is the same as the caller tls then a fast
call is made without any saving and restoring of state. This is as
fast on my machine as the table-is-private path of Dmitry's patch,
which is basically the best we can hope for, but it is that fast even
when the table has been imported or exported. No stubs are created,
no gnarly heuristics are needed.

Importantly, this does not slow down cross-instance calls much at all
relative to the original code, unlike Dmitry's patch which seriously
penalizes cross-instance calls. There's a small slowdown, but quite
acceptable, and with the null-check optimization removed from the slow
path (it is not needed on the fast path because the tls is known not
to be null) we'll regain that perf, and maybe more.

The main problem here relative to Dmitry's patch is that we don't have
a clean path going forward to removing the tls from the function
table, nor toward a code-only representation for funcrefs so that
call_ref can be fast.

At the same time, I think this opens up some new perspectives, because
we have better perf and less complexity here than we get with Dmitry's
patch.

(Bugs in the patch from comment 6 have been fixed, see updated commit msg in patch.)

Depends on D135331

Depends on D135332

Depends on D135337

See Also: → 1753061
Blocks: 1754018

This changes MacroAssembler::wasmCallIndirect to implement dual-path
call code for call_indirect: if the caller's tls equals the callee's
tls, no context switch will be needed and a fast call can be used,
otherwise a slow call with a context switch must be used. This speeds
up call_indirect significantly in the vast majority of cases at a
small cost in code size.

As a result of this, wasmCallIndirect has two call instructions and
therefore two safepoints, and this complication bubbles up to the
baseline compiler, the codegenerator, and lowering. The main issue is
that a LIR node only has one safepoint, so we must generate a second,
synthetic LIR node for the second safepoint.

Drive-by fix: the InterModule attribute in the baseline compiler is
not really about whether a call is inter-module, but about whether the
register state and realm must be restored after a call. The change to
call_indirect exposes this incorrectness: such calls may be
intermodule, but the compiler never needs to restore the register
state or the realm - the macroassembler does this, as needed, on the
slow path.

Drive-by fix: minor cleanup of the emitted code, notably, better
pointer scaling on ARM64.

Drive-by fix: remove some redundant parameters in lowering to reduce
confusion about whether a MIR node is updated for some LIR operations.

Depends on D137846

Attachment #9258018 - Attachment is obsolete: true
Attachment #9258017 - Attachment is obsolete: true
Attachment #9256319 - Attachment is obsolete: true
Attachment #9258006 - Attachment is obsolete: true
Attachment #9256341 - Attachment is obsolete: true
Attachment #9262708 - Attachment description: WIP: Bug 1743760 - Dual-path call code (no out-of-line code) → Bug 1743760 - Dual-path call code (no out-of-line code). r?rhunt

Comment on attachment 9262708 [details]
Bug 1743760 - Dual-path call code (no out-of-line code). r?rhunt

Revision D138052 was moved to bug 1754377. Setting attachment 9262708 [details] to obsolete.

Attachment #9262708 - Attachment is obsolete: true
Attachment #9255932 - Attachment is obsolete: true
Attachment #9258007 - Attachment is obsolete: true

Fixed with the landing of bug 1753061.

Status: ASSIGNED → RESOLVED
Closed: 2 years ago
Resolution: --- → FIXED
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: