Ion compiled polymorphic calls are 10x slower than V8 (InvokeFunction)
Categories
(Core :: JavaScript Engine: JIT, defect, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox117 | --- | fixed |
People
(Reporter: jrmuizel, Assigned: iain)
References
(Blocks 4 open bugs)
Details
(Keywords: perf-alert, Whiteboard: [sp3])
Attachments
(10 files)
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
48 bytes,
text/x-phabricator-request
|
Details | Review | |
923 bytes,
text/html
|
Details |
The following JS runs in:
SM: 122ms (https://share.firefox.dev/3N4c0Q3)
V8: 10ms (https://share.firefox.dev/3H4FGcc)
JSC: 6.1ms
function S(a) {this.a = a }
S.prototype.f = function() {return "fodogo"}
function R(a) {this.a = a }
R.prototype.f = (function() {return "fodogo"}).bind({a: 12})
var objs = []
var count = 100;
for (var i = 0; i < count; i++) {
if (i % 13 == 0) {
objs.push(new S(i));
} else {
objs.push(new R(i))
}
}
function s(a) {
return a.f();
}
function benchmark() {
for (var c = 0; c < 10000; c++) {
for (var i = 0; i < count; i++) {
s(objs[i])
}
}
}
var start = performance.now();
benchmark()
var end = performance.now();
print(`took ${end - start}ms`)
We end up spending lots of time in InvokeFunction:
https://searchfox.org/mozilla-central/rev/26790fecfcda622dab234b28859da721b80f3a35/js/src/jit/CodeGenerator.cpp#5558-5564
Fixing this should be worth about 2-3% on React-Stockcharts and 2% when loading the Google sheet from bug 1828961
Reporter | ||
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Updated•1 year ago
|
Reporter | ||
Comment 1•1 year ago
|
||
This also shows up prominently when https://github.com/WebKit/Speedometer/pull/142 is applied to Angular.
We're spending 10x as much time as V8 in:
function callHook(currentView, initPhase, arr, i) {
const isInitHook = arr[i] < 0, hook = arr[i + 1], directive = currentView[isInitHook ? -arr[i] : arr[i]];
if (isInitHook) {
if (currentView[2] >> 11 < currentView[18] >> 16 && (3 & currentView[2]) === initPhase) {
currentView[2] += 2048;
try {
hook.call(directive);
} finally {}
}
} else try {
hook.call(directive);
} finally {}
}
Reducing the overhead to V8's level would get us 6.5% on this benchmark
Comment 2•1 year ago
|
||
Unfortunately we don't have call ICs in Ion and our fallback code (LCallGeneric
) only handles plain scripted functions well. This means that we have a perf cliff for anything that isn't monomorphic in Baseline and isn't just calling a scripted function, because it goes through the generic C++ call path. This is especially slow if it has to call into JIT code again.
The simplest thing short-term might be to have a call trampoline where we can support scripted functions, native functions, bound functions, maybe even call/apply, etc.
Comment 3•1 year ago
|
||
A trampoline sounds like a good approach. Like you said, the main thing to avoid is the call from C++ back into JIT.
(In reply to Jan de Mooij [:jandem] from comment #2)
scripted functions, native functions, bound functions, maybe even call/apply, etc.
I agree that all of these are important, in decreasing order of priority, except that I would put "native functions" at the end of the list. Anyway, I think each of these would result in a speedup on their own, so they can probably be landed in parts.
I can try to create a breakdown of the InvokeFunction
overhead by function type, if it would be useful. My intuition is that most InvokeFunction
-overhead samples which have CallJSNative
in them also have one of fun_apply
/ fun_call
/ BoundFunctionObject::call
in them, and the ones that don't (like calling into DOM functions) have to go into C++ anyway.
Comment 4•1 year ago
|
||
(In reply to Markus Stange [:mstange] from comment #3)
I can try to create a breakdown of the
InvokeFunction
overhead by function type, if it would be useful.
Yes please, that would be very useful to determine which cases we want to support first. I agree that for natives the overhead for the InvokeFunction
call path is less important and we should focus on the JIT-ABI calls first.
This trampoline would also be good to have for Wasm => JS calls, it's basically the same situation.
Comment 5•1 year ago
|
||
Here's a profile of a full sp3 run, focused on InvokeFunction
, with Trampoline: EnterJIT
and MaybeEnterInterpreterTrampoline
samples removed: https://share.firefox.dev/44DDHpr
I think it shows that bound functions are the most important case, followed by apply and then call. Scripted functions seem to be very low on the list (I think this is the js::InternalCallOrConstruct
-> js::RunScript
path?). EventTarget.dispatch
and HTMLElement.click
need to go into C++ anyway.
Here's the profile narrowed further down after removing samples in the two DOM calls, samples in ensureHasJitScript
and BaselineCompile
, and samples in the various DOM constructors: https://share.firefox.dev/3VFuTeF
It's mostly CallJSNative
, which has a 61% / 14% / 5% breakdown into bound functions, apply, and call.
Reporter | ||
Updated•1 year ago
|
Assignee | ||
Updated•1 year ago
|
Assignee | ||
Comment 6•1 year ago
|
||
Assignee | ||
Comment 7•1 year ago
|
||
This patch implements the initial trampoline, with the same coverage as the existing code (inline support for functions with a jit entry, InvokeFunction for everything else).
Depends on D182945
Assignee | ||
Comment 8•1 year ago
|
||
Depends on D182946
Assignee | ||
Comment 9•1 year ago
|
||
We already use this to make the simulator work with CallAnyNative in baseline ICs. We can reuse the same mechanism for Ion.
Depends on D182947
Assignee | ||
Comment 10•1 year ago
|
||
Depends on D182948
Assignee | ||
Comment 11•1 year ago
|
||
Depends on D182949
Assignee | ||
Comment 12•1 year ago
|
||
We need to pick the second scratch register for arm32 carefully here to avoid clobbering lr
or one of the arguments.
To avoid changing the second scratch register for trampolines generated after this one, I added an RAII type, then rewrote some existing users. I left js::jit::InitMacroAssemblerForICStub
unchanged because it isn't a natural fit for RAII.
Depends on D182950
Assignee | ||
Comment 13•1 year ago
|
||
Depends on D182951
Assignee | ||
Comment 14•1 year ago
|
||
I didn't manage to write a test that failed on my laptop. Even a 32-bit x86 build let me push 499K bound arguments a few dozen times recursively without complaining, and then threw over-recursed.
This is the most precise way to implement the check. Given that JIT_ARGS_LENGTH_MAX includes a significant safety margin, we could instead consider replacing it with something like:
const uint32_t MaxBoundArgs = 10;
const uint32_t MaxFlags = (MaxBoundArgs + 1) << BoundFunctionObject::NumBoundArgsShift;
masm.branch32(Assembler::AboveOrEqual, flagsSlot, Imm32(MaxFlags), vmCall);
Since the caller already guarantees that argc <= JIT_ARGS_LENGTH_MAX, this would guarantee that the new argc will be <= JIT_ARGS_LENGTH_MAX + MaxBoundArgs, which should still comfortably fit within 4K.
(Also, since we are pushing each argument one at a time, there's no actual risk of skipping past the guard page on Windows no matter how many bound arguments there are.)
Depends on D182952
Comment 15•1 year ago
|
||
Pushed by iireland@mozilla.com: https://hg.mozilla.org/integration/autoland/rev/22c619c76135 Add empty trampoline stubs r=jandem https://hg.mozilla.org/integration/autoland/rev/577a38f77e43 Support jit calls with generic trampoline r=jandem https://hg.mozilla.org/integration/autoland/rev/bd4c301a38f4 Support bound function calls with generic trampoline r=jandem https://hg.mozilla.org/integration/autoland/rev/7be12e98068e Add RedirectedCallAnyNative r=jandem https://hg.mozilla.org/integration/autoland/rev/e064d9dfa37b Support native calls with generic trampoline r=jandem https://hg.mozilla.org/integration/autoland/rev/03ce8d2f3e2f Support fun_call in generic trampoline r=jandem https://hg.mozilla.org/integration/autoland/rev/582f22b99bc8 Add AutoNonDefaultSecondScratchRegister r=jandem https://hg.mozilla.org/integration/autoland/rev/fd08a16ebe46 Add tests r=jandem https://hg.mozilla.org/integration/autoland/rev/838d0c98acb1 Don't exceed JIT_ARGS_LENGTH_MAX r=jandem
Comment 16•1 year ago
|
||
Comment 17•1 year ago
|
||
(This is just the HTML equivalent of the shell testcase from comment 0 (with adjusted numbers) so that it's easier to run. I'm just attaching it for completeness sake.)
Comment 18•1 year ago
|
||
bugherder |
https://hg.mozilla.org/mozilla-central/rev/22c619c76135
https://hg.mozilla.org/mozilla-central/rev/577a38f77e43
https://hg.mozilla.org/mozilla-central/rev/bd4c301a38f4
https://hg.mozilla.org/mozilla-central/rev/7be12e98068e
https://hg.mozilla.org/mozilla-central/rev/e064d9dfa37b
https://hg.mozilla.org/mozilla-central/rev/03ce8d2f3e2f
https://hg.mozilla.org/mozilla-central/rev/582f22b99bc8
https://hg.mozilla.org/mozilla-central/rev/fd08a16ebe46
https://hg.mozilla.org/mozilla-central/rev/838d0c98acb1
Comment 19•1 year ago
|
||
Some improvements:
9% on Ares6 overall , Which is driven by 30% improvements on ML* sub-tests
Jetstream2:
48% on ML-Average
1-3% on prepack-wtb-*
6% on acorn-wtb-First
3% on babylon-wtb-First
4% on lebab-wtb-*
2-4% on prepack=wtb-* tests
Some improvements on HashSet-wasm-* sub-tests
Speedometer2:
1-3% on EmberJS-Debug-TodoMVC-* sub-tests
1-3% on React-Redux-TodoMVC sub-tests
https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&selected=3912916,1724463917&series=mozilla-central,3735871,1,13&series=mozilla-central,3735871,1,13&series=autoland,3912916,1,13&timerange=5184000&zoom=1688760028993,1689230279721,54.72130885966385,77.43369081458626
https://treeherder.mozilla.org/perfherder/graphs?highlightAlerts=1&highlightChangelogData=1&highlightCommonAlerts=0&selected=3912918,1724463919&series=mozilla-central,3735873,1,13&series=mozilla-central,3735873,1,13&series=autoland,3912918,1,13&timerange=5184000&zoom=1688769844935,1689200909817,57.16859932979882,72.29210131316091
Regressions:
5% on gbemu-Average
Reporter | ||
Comment 20•1 year ago
|
||
I tried to reproduce the gbemu regresssion in the shell on Linux and the browser on Windows. I couldn't reliably see the 5% difference.
Comment 21•1 year ago
|
||
On closer examination of gbemu the difference is pretty clear after all. The extra time is spent in the self time of the IonGenericCall
trampoline, on functions which had direct Ion-to-Ion calls before the patches in this bug.
Before: https://share.firefox.dev/44lj94A , after: https://share.firefox.dev/44mqLUp
Assignee | ||
Comment 22•1 year ago
|
||
I believe the problem in gbemu is this line, which is calling into individual opcode handler functions to emulate various instructions. We've added a tiny bit of overhead to the jit-to-jit path by jumping into the trampoline at all. This looks like the absolute worst case: there are a huge number of calls, and the functions we're calling are very small, so we're mostly measuring call overhead.
There isn't really anything we can do about this, and I don't care at all about gbemu, so I think we just ignore this regression and go on with our lives. The improvements elsewhere more than make up for it.
Comment 23•1 year ago
|
||
This nicely improved a couple DevTools tests as well 👍 Thanks Iain !
== Change summary for alert #38992 (as of Tue, 11 Jul 2023 13:39:08 GMT) ==
Improvements:
Ratio | Test | Platform | Options | Absolute values (old vs new) |
---|---|---|---|---|
14% | damp source-map.constructor.DAMP | macosx1015-64-shippable-qr | e10s fission stylo webrender | 119.47 -> 102.96 |
14% | damp source-map.constructor.DAMP | macosx1015-64-shippable-qr | e10s fission stylo webrender-sw | 120.15 -> 103.86 |
12% | damp source-map.constructor.DAMP | windows10-64-shippable-qr | e10s fission stylo webrender-sw | 117.23 -> 103.18 |
11% | damp source-map.constructor.DAMP | windows10-64-shippable-qr | e10s fission stylo webrender | 119.57 -> 106.72 |
8% | damp source-map.constructor.DAMP | linux1804-64-shippable-qr | e10s fission stylo webrender-sw | 146.82 -> 134.97 |
... | ... | ... | ... | ... |
4% | damp custom.jsdebugger.pretty-print.DAMP | linux1804-64-shippable-qr | e10s fission stylo webrender-sw | 398.57 -> 382.87 |
For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=38992
Comment 24•1 year ago
|
||
(In reply to Pulsebot from comment #15)
Pushed by iireland@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/22c619c76135
Add empty trampoline stubs r=jandem
https://hg.mozilla.org/integration/autoland/rev/577a38f77e43
Support jit calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/bd4c301a38f4
Support bound function calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/7be12e98068e== Change summary for alert #38999 (as of Wed, 12 Jul 2023 23:30:14 GMT) ==
Improvements:
Ratio | Test | Platform | Options | Absolute values (old vs new) | Performance Profiles |
---|---|---|---|---|---|
10% | ares6 | macosx1015-64-shippable-qr | fission webrender | 34.05 -> 30.72 | Before/After |
9% | ares6 | windows10-64-shippable-qr | fission webrender | 32.18 -> 29.16 | Before/After |
7% | ares6 | linux1804-64-shippable-qr | fission webrender | 57.35 -> 53.12 | Before/After |
For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=38999
https://hg.mozilla.org/integration/autoland/rev/e064d9dfa37b
Support native calls with generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/03ce8d2f3e2f
Support fun_call in generic trampoline r=jandem
https://hg.mozilla.org/integration/autoland/rev/582f22b99bc8
Add AutoNonDefaultSecondScratchRegister r=jandem
https://hg.mozilla.org/integration/autoland/rev/fd08a16ebe46
Add tests r=jandem
https://hg.mozilla.org/integration/autoland/rev/838d0c98acb1
Don't exceed JIT_ARGS_LENGTH_MAX r=jandem
Updated•1 year ago
|
Description
•