Closed Bug 1829411 Opened 1 year ago Closed 10 months ago

Ion compiled polymorphic calls are 10x slower than V8 (InvokeFunction)

Tracking

()

Status:

RESOLVED FIXED

Milestone:

117 Branch

Tracking Flags:

Tracking

Status

firefox117

---

fixed

People

(Reporter: jrmuizel, Assigned: iain)

References

(Blocks 4 open bugs)

Details

(Keywords: perf-alert, Whiteboard: [sp3])

Attachments

(10 files)

Bug 1829411: Add empty trampoline stubs r=jandem 10 months ago Iain Ireland [:iain] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1829411: Support jit calls with generic trampoline r=jandem 10 months ago Iain Ireland [:iain] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1829411: Support bound function calls with generic trampoline r=jandem 10 months ago Iain Ireland [:iain] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1829411: Add RedirectedCallAnyNative r=jandem 10 months ago Iain Ireland [:iain] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1829411: Support native calls with generic trampoline r=jandem 10 months ago Iain Ireland [:iain] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1829411: Support fun_call in generic trampoline r=jandem 10 months ago Iain Ireland [:iain] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1829411: Add AutoNonDefaultSecondScratchRegister r=jandem 10 months ago Iain Ireland [:iain] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1829411: Add tests r=jandem 10 months ago Iain Ireland [:iain] 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1829411: Don't exceed JIT_ARGS_LENGTH_MAX r=jandem 10 months ago Iain Ireland [:iain] 48 bytes, text/x-phabricator-request		Details \| Review
testcase 10 months ago Markus Stange [:mstange] 923 bytes, text/html		Details

Jeff Muizelaar [:jrmuizel]

Reporter

Description

•

1 year ago

The following JS runs in:
SM: 122ms (https://share.firefox.dev/3N4c0Q3)
V8: 10ms (https://share.firefox.dev/3H4FGcc)
JSC: 6.1ms

function S(a) {this.a = a }
S.prototype.f = function() {return "fodogo"}
function R(a) {this.a = a }
R.prototype.f = (function() {return "fodogo"}).bind({a: 12})
var objs = []
var count = 100;
for (var i = 0; i < count; i++) {
        if (i % 13 == 0) {
                objs.push(new S(i));
        } else {
                objs.push(new R(i))
        }
}
function s(a) {
        return a.f();
}

function benchmark() {
        for (var c = 0; c < 10000; c++) {
                for (var i = 0; i < count; i++) {
                        s(objs[i])
                }
        }
}
var start = performance.now();
benchmark()
var end = performance.now();
print(`took ${end - start}ms`)

We end up spending lots of time in InvokeFunction:
https://searchfox.org/mozilla-central/rev/26790fecfcda622dab234b28859da721b80f3a35/js/src/jit/CodeGenerator.cpp#5558-5564

Fixing this should be worth about 2-3% on React-Stockcharts and 2% when loading the Google sheet from bug 1828961

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

1 year ago

Whiteboard: [sp3]

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

1 year ago

Blocks: 1828961

Jira Integration Bot

Updated

•

1 year ago

See Also: → https://mozilla-hub.atlassian.net/browse/SP3-341

Mayank Bansal

Updated

•

1 year ago

Blocks: speedometer3, sm-runtime

Nicolas B. Pierron [:nbp]

Updated

•

1 year ago

Blocks: sm-opt-jits
No longer blocks: sm-runtime

Severity: -- → S4

Priority: -- → P2

Summary: Ion compiled polymorphic calls into C++ are 10x slower than V8 → Ion compiled polymorphic calls are 10x slower than V8

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 1

•

1 year ago

This also shows up prominently when https://github.com/WebKit/Speedometer/pull/142 is applied to Angular.

We're spending 10x as much time as V8 in:

function callHook(currentView, initPhase, arr, i) {
            const isInitHook = arr[i] < 0, hook = arr[i + 1], directive = currentView[isInitHook ? -arr[i] : arr[i]];
            if (isInitHook) {
                if (currentView[2] >> 11 < currentView[18] >> 16 && (3 & currentView[2]) === initPhase) {
                    currentView[2] += 2048;
                    try {
                        hook.call(directive);
                    } finally {}
                }
            } else try {
                hook.call(directive);
            } finally {}
        }

Reducing the overhead to V8's level would get us 6.5% on this benchmark

Jan de Mooij [:jandem]

Comment 2

•

1 year ago

Unfortunately we don't have call ICs in Ion and our fallback code (LCallGeneric) only handles plain scripted functions well. This means that we have a perf cliff for anything that isn't monomorphic in Baseline and isn't just calling a scripted function, because it goes through the generic C++ call path. This is especially slow if it has to call into JIT code again.

The simplest thing short-term might be to have a call trampoline where we can support scripted functions, native functions, bound functions, maybe even call/apply, etc.

Markus Stange [:mstange]

Comment 3

•

1 year ago

A trampoline sounds like a good approach. Like you said, the main thing to avoid is the call from C++ back into JIT.

(In reply to Jan de Mooij [:jandem] from comment #2)

scripted functions, native functions, bound functions, maybe even call/apply, etc.

I agree that all of these are important, in decreasing order of priority, except that I would put "native functions" at the end of the list. Anyway, I think each of these would result in a speedup on their own, so they can probably be landed in parts.
I can try to create a breakdown of the InvokeFunction overhead by function type, if it would be useful. My intuition is that most InvokeFunction-overhead samples which have CallJSNative in them also have one of fun_apply / fun_call / BoundFunctionObject::call in them, and the ones that don't (like calling into DOM functions) have to go into C++ anyway.

Jan de Mooij [:jandem]

Comment 4

•

1 year ago

(In reply to Markus Stange [:mstange] from comment #3)

I can try to create a breakdown of the InvokeFunction overhead by function type, if it would be useful.

Yes please, that would be very useful to determine which cases we want to support first. I agree that for natives the overhead for the InvokeFunction call path is less important and we should focus on the JIT-ABI calls first.

This trampoline would also be good to have for Wasm => JS calls, it's basically the same situation.

Markus Stange [:mstange]

Comment 5

•

1 year ago

Here's a profile of a full sp3 run, focused on InvokeFunction, with Trampoline: EnterJIT and MaybeEnterInterpreterTrampoline samples removed: https://share.firefox.dev/44DDHpr

I think it shows that bound functions are the most important case, followed by apply and then call. Scripted functions seem to be very low on the list (I think this is the js::InternalCallOrConstruct -> js::RunScript path?). EventTarget.dispatch and HTMLElement.click need to go into C++ anyway.

Here's the profile narrowed further down after removing samples in the two DOM calls, samples in ensureHasJitScript and BaselineCompile, and samples in the various DOM constructors: https://share.firefox.dev/3VFuTeF
It's mostly CallJSNative, which has a 61% / 14% / 5% breakdown into bound functions, apply, and call.

Jeff Muizelaar [:jrmuizel]

Reporter

Updated

•

11 months ago

Summary: Ion compiled polymorphic calls are 10x slower than V8 → Ion compiled polymorphic calls are 10x slower than V8 (InvokeFunction)

Iain Ireland [:iain]

Assignee

Updated

•

10 months ago

Assignee: nobody → iireland

Iain Ireland [:iain]

Assignee

Comment 6

•

10 months ago

Attached file Bug 1829411: Add empty trampoline stubs r=jandem — Details

Iain Ireland [:iain]

Assignee

Comment 7

•

10 months ago

Attached file Bug 1829411: Support jit calls with generic trampoline r=jandem — Details

This patch implements the initial trampoline, with the same coverage as the existing code (inline support for functions with a jit entry, InvokeFunction for everything else).

Depends on D182945

Iain Ireland [:iain]

Assignee

Comment 8

•

10 months ago

Attached file Bug 1829411: Support bound function calls with generic trampoline r=jandem — Details

Depends on D182946

Iain Ireland [:iain]

Assignee

Comment 9

•

10 months ago

Attached file Bug 1829411: Add RedirectedCallAnyNative r=jandem — Details

We already use this to make the simulator work with CallAnyNative in baseline ICs. We can reuse the same mechanism for Ion.

Depends on D182947

Iain Ireland [:iain]

Assignee

Comment 10

•

10 months ago

Attached file Bug 1829411: Support native calls with generic trampoline r=jandem — Details

Depends on D182948

Iain Ireland [:iain]

Assignee

Comment 11

•

10 months ago

Attached file Bug 1829411: Support fun_call in generic trampoline r=jandem — Details

Depends on D182949

Iain Ireland [:iain]

Assignee

Comment 12

•

10 months ago

Attached file Bug 1829411: Add AutoNonDefaultSecondScratchRegister r=jandem — Details

We need to pick the second scratch register for arm32 carefully here to avoid clobbering lr or one of the arguments.

To avoid changing the second scratch register for trampolines generated after this one, I added an RAII type, then rewrote some existing users. I left js::jit::InitMacroAssemblerForICStub unchanged because it isn't a natural fit for RAII.

Depends on D182950

Iain Ireland [:iain]

Assignee

Comment 13

•

10 months ago

Attached file Bug 1829411: Add tests r=jandem — Details

Depends on D182951

Iain Ireland [:iain]

Assignee

Comment 14

•

10 months ago

Attached file Bug 1829411: Don't exceed JIT_ARGS_LENGTH_MAX r=jandem — Details

I didn't manage to write a test that failed on my laptop. Even a 32-bit x86 build let me push 499K bound arguments a few dozen times recursively without complaining, and then threw over-recursed.

This is the most precise way to implement the check. Given that JIT_ARGS_LENGTH_MAX includes a significant safety margin, we could instead consider replacing it with something like:

const uint32_t MaxBoundArgs = 10;
const uint32_t MaxFlags = (MaxBoundArgs + 1) << BoundFunctionObject::NumBoundArgsShift;
masm.branch32(Assembler::AboveOrEqual, flagsSlot, Imm32(MaxFlags), vmCall);

Since the caller already guarantees that argc <= JIT_ARGS_LENGTH_MAX, this would guarantee that the new argc will be <= JIT_ARGS_LENGTH_MAX + MaxBoundArgs, which should still comfortably fit within 4K.
(Also, since we are pushing each argument one at a time, there's no actual risk of skipping past the guard page on Windows no matter how many bound arguments there are.)

Depends on D182952

Pulsebot

Comment 15

•

10 months ago

Pushed by iireland@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/22c619c76135
Add empty trampoline stubs r=jandem
https://hg.mozilla.org/integration/autoland/rev/577a38f77e43
Support jit calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/bd4c301a38f4
Support bound function calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/7be12e98068e
Add RedirectedCallAnyNative r=jandem
https://hg.mozilla.org/integration/autoland/rev/e064d9dfa37b
Support native calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/03ce8d2f3e2f
Support fun_call in generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/582f22b99bc8
Add AutoNonDefaultSecondScratchRegister r=jandem
https://hg.mozilla.org/integration/autoland/rev/fd08a16ebe46
Add tests r=jandem
https://hg.mozilla.org/integration/autoland/rev/838d0c98acb1
Don't exceed JIT_ARGS_LENGTH_MAX r=jandem

Markus Stange [:mstange]

Comment 16

•

10 months ago

Attached file testcase — Details

Markus Stange [:mstange]

Comment 17

•

10 months ago

(This is just the HTML equivalent of the shell testcase from comment 0 (with adjusted numbers) so that it's easier to run. I'm just attaching it for completeness sake.)

Iulian Moraru

Comment 18

•

10 months ago

bugherder

Status: NEW → RESOLVED

Closed: 10 months ago

status-firefox117: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → 117 Branch

Mayank Bansal

Comment 19

•

10 months ago

Some improvements:

9% on Ares6 overall , Which is driven by 30% improvements on ML* sub-tests

Jetstream2:
48% on ML-Average
1-3% on prepack-wtb-*
6% on acorn-wtb-First
3% on babylon-wtb-First
4% on lebab-wtb-*
2-4% on prepack=wtb-* tests
Some improvements on HashSet-wasm-* sub-tests

Speedometer2:
1-3% on EmberJS-Debug-TodoMVC-* sub-tests
1-3% on React-Redux-TodoMVC sub-tests
https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&selected=3912916,1724463917&series=mozilla-central,3735871,1,13&series=mozilla-central,3735871,1,13&series=autoland,3912916,1,13&timerange=5184000&zoom=1688760028993,1689230279721,54.72130885966385,77.43369081458626
https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&selected=3912918,1724463919&series=mozilla-central,3735873,1,13&series=mozilla-central,3735873,1,13&series=autoland,3912918,1,13&timerange=5184000&zoom=1688769844935,1689200909817,57.16859932979882,72.29210131316091

Regressions:
5% on gbemu-Average

Jeff Muizelaar [:jrmuizel]

Reporter

Comment 20

•

10 months ago

I tried to reproduce the gbemu regresssion in the shell on Linux and the browser on Windows. I couldn't reliably see the 5% difference.

Markus Stange [:mstange]

Comment 21

•

10 months ago

On closer examination of gbemu the difference is pretty clear after all. The extra time is spent in the self time of the IonGenericCall trampoline, on functions which had direct Ion-to-Ion calls before the patches in this bug.
Before: https://share.firefox.dev/44lj94A , after: https://share.firefox.dev/44mqLUp

Iain Ireland [:iain]

Assignee

Comment 22

•

10 months ago

I believe the problem in gbemu is this line, which is calling into individual opcode handler functions to emulate various instructions. We've added a tiny bit of overhead to the jit-to-jit path by jumping into the trampoline at all. This looks like the absolute worst case: there are a huge number of calls, and the functions we're calling are very small, so we're mostly measuring call overhead.

There isn't really anything we can do about this, and I don't care at all about gbemu, so I think we just ignore this regression and go on with our lives. The improvements elsewhere more than make up for it.

Nicolas Chevobbe [:nchevobbe]

Comment 23

•

10 months ago

This nicely improved a couple DevTools tests as well 👍 Thanks Iain !

== Change summary for alert #38992 (as of Tue, 11 Jul 2023 13:39:08 GMT) ==

Improvements:

Ratio	Test	Platform	Options	Absolute values (old vs new)
14%	damp source-map.constructor.DAMP	macosx1015-64-shippable-qr	e10s fission stylo webrender	119.47 -> 102.96
14%	damp source-map.constructor.DAMP	macosx1015-64-shippable-qr	e10s fission stylo webrender-sw	120.15 -> 103.86
12%	damp source-map.constructor.DAMP	windows10-64-shippable-qr	e10s fission stylo webrender-sw	117.23 -> 103.18
11%	damp source-map.constructor.DAMP	windows10-64-shippable-qr	e10s fission stylo webrender	119.57 -> 106.72
8%	damp source-map.constructor.DAMP	linux1804-64-shippable-qr	e10s fission stylo webrender-sw	146.82 -> 134.97
...	...	...	...	...
4%	damp custom.jsdebugger.pretty-print.DAMP	linux1804-64-shippable-qr	e10s fission stylo webrender-sw	398.57 -> 382.87

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=38992

Alexandru Ionescu (needinfo me) [:alexandrui]

Comment 24

•

10 months ago

(In reply to Pulsebot from comment #15)

Pushed by iireland@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/22c619c76135
Add empty trampoline stubs r=jandem
https://hg.mozilla.org/integration/autoland/rev/577a38f77e43
Support jit calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/bd4c301a38f4
Support bound function calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/7be12e98068e== Change summary for alert #38999 (as of Wed, 12 Jul 2023 23:30:14 GMT) ==

Improvements:

Ratio	Test	Platform	Options	Absolute values (old vs new)	Performance Profiles
10%	ares6	macosx1015-64-shippable-qr	fission webrender	34.05 -> 30.72	Before/After
9%	ares6	windows10-64-shippable-qr	fission webrender	32.18 -> 29.16	Before/After
7%	ares6	linux1804-64-shippable-qr	fission webrender	57.35 -> 53.12	Before/After

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=38999

https://hg.mozilla.org/integration/autoland/rev/e064d9dfa37b
Support native calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/03ce8d2f3e2f
Support fun_call in generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/582f22b99bc8
Add AutoNonDefaultSecondScratchRegister r=jandem
https://hg.mozilla.org/integration/autoland/rev/fd08a16ebe46
Add tests r=jandem
https://hg.mozilla.org/integration/autoland/rev/838d0c98acb1
Don't exceed JIT_ARGS_LENGTH_MAX r=jandem

Alexandru Ionescu (needinfo me) [:alexandrui]

Updated

•

10 months ago

Keywords: perf-alert

Markus Stange [:mstange]

Updated

•

9 months ago

Blocks: 1844400

You need to log in before you can comment on or make changes to this bug.