Closed Bug 1829411 Opened 6 months ago Closed 3 months ago

Ion compiled polymorphic calls are 10x slower than V8 (InvokeFunction)

Categories

(Core :: JavaScript Engine: JIT, defect, P2)

defect

Tracking

()

RESOLVED FIXED
117 Branch
Tracking Status
firefox117 --- fixed

People

(Reporter: jrmuizel, Assigned: iain)

References

(Blocks 4 open bugs)

Details

(Keywords: perf-alert, Whiteboard: [sp3])

Attachments

(10 files)

The following JS runs in:
SM: 122ms (https://share.firefox.dev/3N4c0Q3)
V8: 10ms (https://share.firefox.dev/3H4FGcc)
JSC: 6.1ms

function S(a) {this.a = a }
S.prototype.f = function() {return "fodogo"}
function R(a) {this.a = a }
R.prototype.f = (function() {return "fodogo"}).bind({a: 12})
var objs = []
var count = 100;
for (var i = 0; i < count; i++) {
        if (i % 13 == 0) {
                objs.push(new S(i));
        } else {
                objs.push(new R(i))
        }
}
function s(a) {
        return a.f();
}

function benchmark() {
        for (var c = 0; c < 10000; c++) {
                for (var i = 0; i < count; i++) {
                        s(objs[i])
                }
        }
}
var start = performance.now();
benchmark()
var end = performance.now();
print(`took ${end - start}ms`)

We end up spending lots of time in InvokeFunction:
https://searchfox.org/mozilla-central/rev/26790fecfcda622dab234b28859da721b80f3a35/js/src/jit/CodeGenerator.cpp#5558-5564

Fixing this should be worth about 2-3% on React-Stockcharts and 2% when loading the Google sheet from bug 1828961

Whiteboard: [sp3]
Blocks: 1828961
Blocks: sm-opt-jits
No longer blocks: sm-runtime
Severity: -- → S4
Priority: -- → P2
Summary: Ion compiled polymorphic calls into C++ are 10x slower than V8 → Ion compiled polymorphic calls are 10x slower than V8

This also shows up prominently when https://github.com/WebKit/Speedometer/pull/142 is applied to Angular.

We're spending 10x as much time as V8 in:

function callHook(currentView, initPhase, arr, i) {
            const isInitHook = arr[i] < 0, hook = arr[i + 1], directive = currentView[isInitHook ? -arr[i] : arr[i]];
            if (isInitHook) {
                if (currentView[2] >> 11 < currentView[18] >> 16 && (3 & currentView[2]) === initPhase) {
                    currentView[2] += 2048;
                    try {
                        hook.call(directive);
                    } finally {}
                }
            } else try {
                hook.call(directive);
            } finally {}
        }

Reducing the overhead to V8's level would get us 6.5% on this benchmark

Unfortunately we don't have call ICs in Ion and our fallback code (LCallGeneric) only handles plain scripted functions well. This means that we have a perf cliff for anything that isn't monomorphic in Baseline and isn't just calling a scripted function, because it goes through the generic C++ call path. This is especially slow if it has to call into JIT code again.

The simplest thing short-term might be to have a call trampoline where we can support scripted functions, native functions, bound functions, maybe even call/apply, etc.

A trampoline sounds like a good approach. Like you said, the main thing to avoid is the call from C++ back into JIT.

(In reply to Jan de Mooij [:jandem] from comment #2)

scripted functions, native functions, bound functions, maybe even call/apply, etc.

I agree that all of these are important, in decreasing order of priority, except that I would put "native functions" at the end of the list. Anyway, I think each of these would result in a speedup on their own, so they can probably be landed in parts.
I can try to create a breakdown of the InvokeFunction overhead by function type, if it would be useful. My intuition is that most InvokeFunction-overhead samples which have CallJSNative in them also have one of fun_apply / fun_call / BoundFunctionObject::call in them, and the ones that don't (like calling into DOM functions) have to go into C++ anyway.

(In reply to Markus Stange [:mstange] from comment #3)

I can try to create a breakdown of the InvokeFunction overhead by function type, if it would be useful.

Yes please, that would be very useful to determine which cases we want to support first. I agree that for natives the overhead for the InvokeFunction call path is less important and we should focus on the JIT-ABI calls first.

This trampoline would also be good to have for Wasm => JS calls, it's basically the same situation.

Here's a profile of a full sp3 run, focused on InvokeFunction, with Trampoline: EnterJIT and MaybeEnterInterpreterTrampoline samples removed: https://share.firefox.dev/44DDHpr

I think it shows that bound functions are the most important case, followed by apply and then call. Scripted functions seem to be very low on the list (I think this is the js::InternalCallOrConstruct -> js::RunScript path?). EventTarget.dispatch and HTMLElement.click need to go into C++ anyway.

Here's the profile narrowed further down after removing samples in the two DOM calls, samples in ensureHasJitScript and BaselineCompile, and samples in the various DOM constructors: https://share.firefox.dev/3VFuTeF
It's mostly CallJSNative, which has a 61% / 14% / 5% breakdown into bound functions, apply, and call.

Summary: Ion compiled polymorphic calls are 10x slower than V8 → Ion compiled polymorphic calls are 10x slower than V8 (InvokeFunction)
Assignee: nobody → iireland

This patch implements the initial trampoline, with the same coverage as the existing code (inline support for functions with a jit entry, InvokeFunction for everything else).

Depends on D182945

We already use this to make the simulator work with CallAnyNative in baseline ICs. We can reuse the same mechanism for Ion.

Depends on D182947

Depends on D182949

We need to pick the second scratch register for arm32 carefully here to avoid clobbering lr or one of the arguments.

To avoid changing the second scratch register for trampolines generated after this one, I added an RAII type, then rewrote some existing users. I left js::jit::InitMacroAssemblerForICStub unchanged because it isn't a natural fit for RAII.

Depends on D182950

Depends on D182951

I didn't manage to write a test that failed on my laptop. Even a 32-bit x86 build let me push 499K bound arguments a few dozen times recursively without complaining, and then threw over-recursed.

This is the most precise way to implement the check. Given that JIT_ARGS_LENGTH_MAX includes a significant safety margin, we could instead consider replacing it with something like:

const uint32_t MaxBoundArgs = 10;
const uint32_t MaxFlags = (MaxBoundArgs + 1) << BoundFunctionObject::NumBoundArgsShift;
masm.branch32(Assembler::AboveOrEqual, flagsSlot, Imm32(MaxFlags), vmCall);

Since the caller already guarantees that argc <= JIT_ARGS_LENGTH_MAX, this would guarantee that the new argc will be <= JIT_ARGS_LENGTH_MAX + MaxBoundArgs, which should still comfortably fit within 4K.
(Also, since we are pushing each argument one at a time, there's no actual risk of skipping past the guard page on Windows no matter how many bound arguments there are.)

Depends on D182952

Pushed by iireland@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/22c619c76135
Add empty trampoline stubs r=jandem
https://hg.mozilla.org/integration/autoland/rev/577a38f77e43
Support jit calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/bd4c301a38f4
Support bound function calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/7be12e98068e
Add RedirectedCallAnyNative r=jandem
https://hg.mozilla.org/integration/autoland/rev/e064d9dfa37b
Support native calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/03ce8d2f3e2f
Support fun_call in generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/582f22b99bc8
Add AutoNonDefaultSecondScratchRegister r=jandem
https://hg.mozilla.org/integration/autoland/rev/fd08a16ebe46
Add tests r=jandem
https://hg.mozilla.org/integration/autoland/rev/838d0c98acb1
Don't exceed JIT_ARGS_LENGTH_MAX r=jandem
Attached file testcase

(This is just the HTML equivalent of the shell testcase from comment 0 (with adjusted numbers) so that it's easier to run. I'm just attaching it for completeness sake.)

I tried to reproduce the gbemu regresssion in the shell on Linux and the browser on Windows. I couldn't reliably see the 5% difference.

On closer examination of gbemu the difference is pretty clear after all. The extra time is spent in the self time of the IonGenericCall trampoline, on functions which had direct Ion-to-Ion calls before the patches in this bug.
Before: https://share.firefox.dev/44lj94A , after: https://share.firefox.dev/44mqLUp

I believe the problem in gbemu is this line, which is calling into individual opcode handler functions to emulate various instructions. We've added a tiny bit of overhead to the jit-to-jit path by jumping into the trampoline at all. This looks like the absolute worst case: there are a huge number of calls, and the functions we're calling are very small, so we're mostly measuring call overhead.

There isn't really anything we can do about this, and I don't care at all about gbemu, so I think we just ignore this regression and go on with our lives. The improvements elsewhere more than make up for it.

This nicely improved a couple DevTools tests as well 👍 Thanks Iain !

== Change summary for alert #38992 (as of Tue, 11 Jul 2023 13:39:08 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
14% damp source-map.constructor.DAMP macosx1015-64-shippable-qr e10s fission stylo webrender 119.47 -> 102.96
14% damp source-map.constructor.DAMP macosx1015-64-shippable-qr e10s fission stylo webrender-sw 120.15 -> 103.86
12% damp source-map.constructor.DAMP windows10-64-shippable-qr e10s fission stylo webrender-sw 117.23 -> 103.18
11% damp source-map.constructor.DAMP windows10-64-shippable-qr e10s fission stylo webrender 119.57 -> 106.72
8% damp source-map.constructor.DAMP linux1804-64-shippable-qr e10s fission stylo webrender-sw 146.82 -> 134.97
... ... ... ... ...
4% damp custom.jsdebugger.pretty-print.DAMP linux1804-64-shippable-qr e10s fission stylo webrender-sw 398.57 -> 382.87

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=38992

(In reply to Pulsebot from comment #15)

Pushed by iireland@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/22c619c76135
Add empty trampoline stubs r=jandem
https://hg.mozilla.org/integration/autoland/rev/577a38f77e43
Support jit calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/bd4c301a38f4
Support bound function calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/7be12e98068e== Change summary for alert #38999 (as of Wed, 12 Jul 2023 23:30:14 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new) Performance Profiles
10% ares6 macosx1015-64-shippable-qr fission webrender 34.05 -> 30.72 Before/After
9% ares6 windows10-64-shippable-qr fission webrender 32.18 -> 29.16 Before/After
7% ares6 linux1804-64-shippable-qr fission webrender 57.35 -> 53.12 Before/After

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=38999

https://hg.mozilla.org/integration/autoland/rev/e064d9dfa37b
Support native calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/03ce8d2f3e2f
Support fun_call in generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/582f22b99bc8
Add AutoNonDefaultSecondScratchRegister r=jandem
https://hg.mozilla.org/integration/autoland/rev/fd08a16ebe46
Add tests r=jandem
https://hg.mozilla.org/integration/autoland/rev/838d0c98acb1
Don't exceed JIT_ARGS_LENGTH_MAX r=jandem

Blocks: 1844400
You need to log in before you can comment on or make changes to this bug.