(Some initial notes, I will edit these as more data become available.) Base rev: mozilla-central 7c3ea3514425. Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123). Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev. The benchmark is the call_indirect_ubench.js attached above. I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line. Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch: "external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower) "internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported) "direct" calls increase running time by 6% (very surprising, these ought to be invariant) Coming up: MacOS on an i7.
Bug 1639153 Comment 138 Edit History
Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.
(Some initial notes, I will edit these as more data become available.) Base rev: mozilla-central 7c3ea3514425. Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123). Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev. The benchmark is the call_indirect_ubench.js attached above. I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line. Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch: "external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower) "internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported) "direct" calls increase running time by 6% (very surprising, these ought to be invariant) On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto: "external" calls slow down slightly, but really not much (very nice) "internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported) "direct" calls are roughly invariant (as desired)
(Some initial notes, I will edit these as more data become available.) Base rev: mozilla-central 7c3ea3514425. Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123). Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev. The benchmark is the call_indirect_ubench.js attached above. I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line. Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch: "external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower) "internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported) "direct" calls increase running time by 6% (very surprising, these ought to be invariant) On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto: "external" calls slow down slightly, but really not much (very nice) "internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported) "direct" calls are roughly invariant (as desired) Another set of benchmarks. This is doubly-recursive fibonacci(40) using indirect calls. There are four cases: - DIR: direct calls - IIP: indirect calls, intra-module with a private table - IIE: indirect calls, intra-module but with a shared (exported) table - IIX: indirect calls, inter-module (requires two modules and a shared table) ``` Baseline Improved DIR IIP IIE IIX DIR IIP IIE IIX Xeon 753 1410 1425 1425 754 1070 1300 2080 i7 ``` On the Xeon, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster. Also, the direct cases have the same performance with the old and the new code, as expected & desired. For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case. That said, the IIX case is "only" about 33% slower than the IIX case of the baseline code. The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values). This optimization is very effective and will be worthwhile *if* we think that this case will be common. (i7 numbers to appear)
(Some initial notes, I will edit these as more data become available.) Base rev: mozilla-central 7c3ea3514425. Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123). Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev. The benchmark is the call_indirect_ubench.js attached above. I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line. Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch: "external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower) "internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported) "direct" calls increase running time by 6% (very surprising, these ought to be invariant) On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto: "external" calls slow down slightly, but really not much (very nice) "internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported) "direct" calls are roughly invariant (as desired) Another set of benchmarks. This is doubly-recursive fibonacci(40) using indirect calls. There are four cases: - DIR: direct calls - IIP: indirect calls, intra-module with a private table - IIE: indirect calls, intra-module but with a shared (exported) table - IIX: indirect calls, inter-module (requires two modules and a shared table) ``` Baseline Improved DIR IIP IIE IIX DIR IIP IIE IIX Xeon 753 1410 1425 1425 754 1070 1300 2080 i7 647 1130 1130 1140 645 836 1100 1520 ``` On both CPUs, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster. Also, the direct cases have the same performance with the old and the new code, as expected & desired. For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case. That said, the IIX case is "only" about 30% - 40% slower than the IIX case of the baseline code. The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values). This optimization is very effective and will be worthwhile *if* we think that this case will be common.
(Some initial notes, I will edit these as more data become available.) Base rev: mozilla-central 7c3ea3514425. Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123). Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev. The benchmark is the call_indirect_ubench.js attached above. I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line. Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch: "external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower) "internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported) "direct" calls increase running time by 6% (very surprising, these ought to be invariant) On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto: "external" calls slow down slightly, but really not much (very nice) "internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported) "direct" calls are roughly invariant (as desired) Another set of benchmarks. This is doubly-recursive fibonacci(40) using indirect calls. There are four cases: - DIR: direct calls - IIP: indirect calls, intra-module with a private table - IIE: indirect calls, intra-module but with a shared (exported) table - IIX: indirect calls, inter-module (requires two modules and a shared table) ``` Baseline Improved DIR IIP IIE IIX DIR IIP IIE IIX Xeon 753 1410 1425 1425 754 1070 1300 2080 i7 647 1130 1130 1140 645 836 1100 1520 ``` On both CPUs, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster. Also, the direct cases have the same performance with the old and the new code, as expected & desired. For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case. That said, the IIX case is "only" about 30% - 40% slower than the IIX case of the baseline code. The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values). This optimization is very effective and will be worthwhile *if* we think that this case will be common.
(Some initial notes, I will edit these as more data become available.) Base rev: mozilla-central 7c3ea3514425. Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123). Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev. The benchmark is the call_indirect_ubench.js attached above. I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line. Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch: "external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower) "internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported) "direct" calls increase running time by 6% (very surprising, these ought to be invariant) On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto: "external" calls slow down slightly, but really not much (very nice) "internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported) "direct" calls are roughly invariant (as desired) Another set of benchmarks. This is doubly-recursive fibonacci(40) using indirect calls. There are four cases: - DIR: direct calls - IIP: indirect calls, intra-module with a private table - IIE: indirect calls, intra-module but with a shared (exported) table - IIX: indirect calls, inter-module (requires two modules and a shared table) ``` Baseline Improved DIR IIP IIE IIX DIR IIP IIE IIX Xeon 753 1410 1425 1425 754 1070 1300 2080 i7 647 1130 1130 1140 645 836 1100 1520 ``` On both CPUs, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster. Also, the direct cases have the same performance with the old and the new code, as expected & desired. For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case. That said, the IIX case is "only" about 30% - 40% slower than the IIX case of the baseline code. The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values). This optimization is very effective and will be worthwhile *if* we think that this case will be common.
(Some initial notes, I will edit these as more data become available.) Base rev: mozilla-central 7c3ea3514425. Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123). Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev. The benchmark is the call_indirect_ubench.js attached above. I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line. Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch: "external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower) "internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported) "direct" calls increase running time by 6% (very surprising, these ought to be invariant) On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto: "external" calls slow down slightly, but really not much (very nice) "internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported) "direct" calls are roughly invariant (as desired) Another set of benchmarks. This is doubly-recursive fibonacci(40) using indirect calls. There are four cases: - DIR: direct calls - IIP: indirect calls, intra-module with a private table - IIE: indirect calls, intra-module but with a shared (exported) table - IIX: indirect calls, inter-module (requires two modules and a shared table) ``` Baseline Improved DIR IIP IIE IIX DIR IIP IIE IIX Xeon 753 1410 1425 1425 754 1070 1300 2080 i7 647 1130 1130 1140 645 836 1100 1520 M1 650 1050 1045 1045 650 770 980 1420 ``` On all CPUs, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster. Also, the direct cases have the same performance with the old and the new code, as expected & desired. For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case. That said, the IIX case is "only" about 30% - 40% slower than the IIX case of the baseline code. The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values). This optimization is very effective and will be worthwhile *if* we think that this case will be common.
(Some initial notes, I will edit these as more data become available.) Base rev: mozilla-central 7c3ea3514425. Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123). Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev. The benchmark is the call_indirect_ubench.js attached above. I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line. Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch: "external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower) "internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported) "direct" calls increase running time by 6% (very surprising, these ought to be invariant) On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto: "external" calls slow down slightly, but really not much (very nice) "internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported) "direct" calls are roughly invariant (as desired) On an Apple M1 with macOS 11 (M1 MacBook Pro): "external" calls slow down by a lot (running time goes from 270 to 430, ie increasing by 60%) "internal" calls speed up by about 15% (comparable to i7) "direct" calls are roughly invariant (as desired) Another set of benchmarks. This is doubly-recursive fibonacci(40) using indirect calls. There are four cases: - DIR: direct calls - IIP: indirect calls, intra-module with a private table - IIE: indirect calls, intra-module but with a shared (exported) table - IIX: indirect calls, inter-module (requires two modules and a shared table) ``` Baseline Improved DIR IIP IIE IIX DIR IIP IIE IIX Xeon 753 1410 1425 1425 754 1070 1300 2080 i7 647 1130 1130 1140 645 836 1100 1520 M1 650 1050 1045 1045 650 770 980 1420 ``` On all CPUs, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster. Also, the direct cases have the same performance with the old and the new code, as expected & desired. For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case. That said, the IIX case is "only" about 30% - 40% slower than the IIX case of the baseline code. The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values). This optimization is very effective and will be worthwhile *if* we think that this case will be common.