Closed Bug 1743760 Opened 2 years ago Closed 2 years ago

Create fewer indirect stubs

Tracking

()

Status:

RESOLVED FIXED

People

(Reporter: lth, Assigned: lth)

References

(Blocks 2 open bugs)

Details

Attachments

(2 files, 8 obsolete files)

WIP: Bug 1743760 - Avoid indirect stub for wasm instructions 2 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1743760 - Clean up some names. r?rhunt 2 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1743760 - Notes about how to create fewer stubs 2 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1743760 - Profile table lengths and compilation types 2 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1743760 - Dynamically test for same-instance always. 2 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1743760 - Port the better bounds check code 2 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1743760 - Move slow path of call-indirect ool in Ion 2 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1743760 - Adapt hack to x86-32 and arm64. 2 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
WIP: Bug 1743760 - Better scaling code on ARM64 2 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1743760 - Dual-path call code (no out-of-line code). r?rhunt 2 years ago Lars T Hansen [:lth] 48 bytes, text/x-phabricator-request		Details \| Review

Lars T Hansen [:lth]

Assignee

Description

•

2 years ago

•

Edited

For the indirect stubs optimization we must create an indirect stub for a function F stored into a table if F is going to be called from another function G that is not in the same instance as F (because the call must switch to F's instance). If this is not going to be happen the no stub is needed and the same-instance call can proceed at great speed.

Obviously whether F is going to be called from such a G is statically uncomputable, and we don't want to discover this fact at the call point - that would just be another stub to trigger stub creation, and we don't want a stub there if we can avoid it. So we have heuristics in place to decide on stub creation when F is stored into the table by a passive or active initializer, table.set/.grow/.fill/.copy instruction, or (from JS) Table.set/.grow.

The current heuristic gives F a stub in a table in an instance if:

the table is imported or exported
F is imported and is stored via a passive or active initializer (ie via Instance::initElems)
F is stored via wasm table.set, table.fill, table.grow, table.copy
F is stored via JS new Table, Table.p.set, Table.p.grow (there's noTable.p.fill)

The third bullet should be optimizable, as it would not be necessary to create a stub if the function's instance is the same instance as the table's owning instance and the table is private.

Whether the table is imported or exported is currently a static check. But going back to earlier princples, a better heuristic for whether a function can avoid having a stub is whether the table is referenced only from a single instance and contains only functions from within that instance. This is something that can be discovered at run-time during module instantiation (when a table changes state from not-referenced to singly-referenced to multiply-referenced) or during table creation, initialization, and update (when a table changes state from has-no-functions to functions-from-one-instance to functions-from-several-instances).

Once a table enters the multiply-referenced or functions-from-several-instances states, indirect stubs could be created for all its functions. In the worst case, we get no more stubs than now and perform no more computation, but we can get better call performance for what is dynamically a single-module program, as far as the table is concerned.

Before doing all of the work it would be meaningful to perform a simple corpus analysis to ensure that the optimization will pay off. Some of the work must be done: we must mark modules as private and then detect when they would be upgraded to public, but this does not involve rejiggering the stubs code.

Finally, if it turns out that this optimization pays off then we should reconsider whether it's worth it for the stub to contain a dynamic same-instance check. It might not, and as this check slows down cross-instance calls; it only pays for itself if many stubbed calls are dynamically to a same-instance function.

Lars T Hansen [:lth]

Assignee

Updated

•

2 years ago

Summary: Make the "table-is-private" check dynamic rather than static → Create fewer indirect stubs

Lars T Hansen [:lth]

Assignee

Updated

•

2 years ago

Blocks: 1744984

Lars T Hansen [:lth]

Assignee

Comment 1

•

2 years ago

Attached file WIP: Bug 1743760 - Avoid indirect stub for wasm instructions (obsolete) — Details

When wasm table instructions operate on a private table and store a pointer
to an instance-local function, no stub need be created, so check this case.
This missing optimization was an oversight with the initial landing of
the indirect-stubs code.

Depends on D133865

Lars T Hansen [:lth]

Assignee

Comment 2

•

2 years ago

Attached file Bug 1743760 - Clean up some names. r?rhunt — Details

This tidies up some of the naming in Dmitry's patch. When we create
stubs we are not interested in whether the table is imported or
exported per se, but in whether it is public (shared by multiple
instances) or not. At the moment, that happens to be true if it were
imported or exported, but we want to move away from that.

Depends on D133865

Phabricator Automation

Updated

•

2 years ago

Assignee: nobody → lhansen

Status: NEW → ASSIGNED

Lars T Hansen [:lth]

Assignee

Comment 3

•

2 years ago

Attached file WIP: Bug 1743760 - Notes about how to create fewer stubs (obsolete) — Details

Depends on D134163

Lars T Hansen [:lth]

Assignee

Updated

•

2 years ago

Keywords: leave-open

Lars T Hansen [:lth]

Assignee

Comment 4

•

2 years ago

A little profiling (profiling code to appear). I instrumented Firefox and loaded a bunch of wasm content from the web. Virtually every table is imported or exported and will not get the full benefit of the call_indirect optimization, but will get a stub. However, in no case did I see multi-instance programs, so the dynamic same-instance optimization in the stub should basically always kick in (and is completely crucial at this time). This gives the call_indirect optimization a very moderate speedup over the old code, see https://bugzilla.mozilla.org/show_bug.cgi?id=1639153#c138. If this is all we wanted we should have implemented the dynamic-same-instance optimization inline (load target tls; compare to private tls; if same, jump to fast code, otherwise save state before call and restore after) and we would have seen more of a speedup (for that) and lower complexity and lower memory use.

However, since I never observe multi-instance programs, the better heuristic for when to create stubs really should pay off because we should not get call_indirect stubs in most cases - the table is "private" to the one instance, and the functions would remain stub-free.

Lars T Hansen [:lth]

Assignee

Comment 5

•

2 years ago

Attached file WIP: Bug 1743760 - Profile table lengths and compilation types — Details

See comment 4 on the bug for more information.

Lars T Hansen [:lth]

Assignee

Comment 6

•

2 years ago

Attached file WIP: Bug 1743760 - Dynamically test for same-instance always. (obsolete) — Details

This applies to the changeset before "Dmitry's patch" (ie the
call_indirect stub optimization) landed, rev
d4bd94bc7b58345d02f59b22f35ba6269d8fd2b0.

NOTE there's a bug here, wasm/simd/ad-hack.js has a failure for a
complex case involving a call_indirect to an imported function with a
complicated signature (line 1631). This probably has to do with how
stack arguments are handled and I've probably cut too many corners in
the present patch - caller and callee need to agree on stack arg
offsets, specifically about whether space is made for caller/callee
tls.

The patch implements wasmCallIndirect by a dynamic equality test on
the tls: if the target tls is the same as the caller tls then a fast
call is made without any saving and restoring of state. This is as
fast on my machine as the table-is-private path of Dmitry's patch,
which is basically the best we can hope for, but it is that fast even
when the table has been imported or exported. No stubs are created,
no gnarly heuristics are needed.

Importantly, this does not slow down cross-instance calls much at all
relative to the original code, unlike Dmitry's patch which seriously
penalizes cross-instance calls. There's a small slowdown, but quite
acceptable, and with the null-check optimization removed from the slow
path (it is not needed on the fast path because the tls is known not
to be null) we'll regain that perf, and maybe more.

The main problem here relative to Dmitry's patch is that we don't have
a clean path going forward to removing the tls from the function
table, nor toward a code-only representation for funcrefs so that
call_ref can be fast.

At the same time, I think this opens up some new perspectives, because
we have better perf and less complexity here than we get with Dmitry's
patch.

Lars T Hansen [:lth]

Assignee

Comment 7

•

2 years ago

(Bugs in the patch from comment 6 have been fixed, see updated commit msg in patch.)

Pulsebot

Comment 8

•

2 years ago

Pushed by lhansen@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/b16a7f303241
Clean up some names. r=rhunt

Andreea Pavel [:apavel]

Comment 9

•

2 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/b16a7f303241

Lars T Hansen [:lth]

Assignee

Comment 10

•

2 years ago

Attached file WIP: Bug 1743760 - Port the better bounds check code (obsolete) — Details

Lars T Hansen [:lth]

Assignee

Comment 11

•

2 years ago

Attached file WIP: Bug 1743760 - Move slow path of call-indirect ool in Ion (obsolete) — Details

Depends on D135331

Lars T Hansen [:lth]

Assignee

Comment 12

•

2 years ago

Attached file WIP: Bug 1743760 - Adapt hack to x86-32 and arm64. (obsolete) — Details

Depends on D135332

Lars T Hansen [:lth]

Assignee

Comment 13

•

2 years ago

Attached file WIP: Bug 1743760 - Better scaling code on ARM64 (obsolete) — Details

Depends on D135337

Lars T Hansen [:lth]

Assignee

Updated

•

2 years ago

Updated

•

2 years ago

Blocks: 1754018

Lars T Hansen [:lth]

Assignee

Comment 14

•

2 years ago

Attached file Bug 1743760 - Dual-path call code (no out-of-line code). r?rhunt (obsolete) — Details

This changes MacroAssembler::wasmCallIndirect to implement dual-path
call code for call_indirect: if the caller's tls equals the callee's
tls, no context switch will be needed and a fast call can be used,
otherwise a slow call with a context switch must be used. This speeds
up call_indirect significantly in the vast majority of cases at a
small cost in code size.

As a result of this, wasmCallIndirect has two call instructions and
therefore two safepoints, and this complication bubbles up to the
baseline compiler, the codegenerator, and lowering. The main issue is
that a LIR node only has one safepoint, so we must generate a second,
synthetic LIR node for the second safepoint.

Drive-by fix: the InterModule attribute in the baseline compiler is
not really about whether a call is inter-module, but about whether the
register state and realm must be restored after a call. The change to
call_indirect exposes this incorrectness: such calls may be
intermodule, but the compiler never needs to restore the register
state or the realm - the macroassembler does this, as needed, on the
slow path.

Drive-by fix: minor cleanup of the emitted code, notably, better
pointer scaling on ARM64.

Drive-by fix: remove some redundant parameters in lowering to reduce
confusion about whether a MIR node is updated for some LIR operations.

Depends on D137846

Phabricator Automation

Updated

•

2 years ago

Attachment #9258018 - Attachment is obsolete: true

Phabricator Automation

Updated

•

2 years ago

Attachment #9258017 - Attachment is obsolete: true

Phabricator Automation

Updated

•

2 years ago

Attachment #9256319 - Attachment is obsolete: true

Phabricator Automation

Updated

•

2 years ago

Attachment #9258006 - Attachment is obsolete: true

Phabricator Automation

Updated

•

2 years ago

Attachment #9256341 - Attachment is obsolete: true

Phabricator Automation

Updated

•

2 years ago

Attachment #9262708 - Attachment description: WIP: Bug 1743760 - Dual-path call code (no out-of-line code) → Bug 1743760 - Dual-path call code (no out-of-line code). r?rhunt

Phabricator Automation

Comment 15

•

2 years ago

Comment on attachment 9262708 [details]
Bug 1743760 - Dual-path call code (no out-of-line code). r?rhunt

Revision D138052 was moved to bug 1754377. Setting attachment 9262708 [details] to obsolete.

Attachment #9262708 - Attachment is obsolete: true

Phabricator Automation

Updated

•

2 years ago

Attachment #9255932 - Attachment is obsolete: true

Phabricator Automation

Updated

•

2 years ago

Attachment #9258007 - Attachment is obsolete: true

Lars T Hansen [:lth]

Assignee

Comment 16

•

2 years ago

Fixed with the landing of bug 1753061.

Status: ASSIGNED → RESOLVED

Closed: 2 years ago

Resolution: --- → FIXED

BugBot [:suhaib / :marco/ :calixte]

Updated

•

2 years ago

Keywords: leave-open

You need to log in before you can comment on or make changes to this bug.