Bug 1639153 Comment 138 Edit History

Note: The actual edited comment in the bug view page will always show the original commenter’s name and original timestamp.

(Some initial notes, I will edit these as more data become available.)

Base rev: mozilla-central 7c3ea3514425.  Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123).  Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev.

The benchmark is the call_indirect_ubench.js attached above.  I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line.

Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch:
"external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower)
"internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported)
"direct" calls increase running time by 6% (very surprising, these ought to be invariant)

Coming up: MacOS on an i7.
(Some initial notes, I will edit these as more data become available.)

Base rev: mozilla-central 7c3ea3514425.  Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123).  Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev.

The benchmark is the call_indirect_ubench.js attached above.  I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line.

Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch:
"external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower)
"internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported)
"direct" calls increase running time by 6% (very surprising, these ought to be invariant)

On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto:
"external" calls slow down slightly, but really not much (very nice)
"internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported)
"direct" calls are roughly invariant (as desired)
(Some initial notes, I will edit these as more data become available.)

Base rev: mozilla-central 7c3ea3514425.  Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123).  Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev.

The benchmark is the call_indirect_ubench.js attached above.  I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line.

Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch:
"external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower)
"internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported)
"direct" calls increase running time by 6% (very surprising, these ought to be invariant)

On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto:
"external" calls slow down slightly, but really not much (very nice)
"internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported)
"direct" calls are roughly invariant (as desired)

Another set of benchmarks.  This is doubly-recursive fibonacci(40) using indirect calls.  There are four cases:

- DIR: direct calls
- IIP: indirect calls, intra-module with a private table
- IIE: indirect calls, intra-module but with a shared (exported) table
- IIX: indirect calls, inter-module (requires two modules and a shared table)

```
                Baseline                         Improved
       DIR   IIP   IIE    IIX             DIR   IIP    IIE   IIX
Xeon   753   1410  1425   1425            754   1070   1300  2080
i7
```
On the Xeon, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster.  Also, the direct cases have the same performance with the old and the new code, as expected & desired.  For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case.  That said, the IIX case is "only" about 33% slower than the IIX case of the baseline code.  The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values).  This optimization is very effective and will be worthwhile *if* we think that this case will be common.

(i7 numbers to appear)
(Some initial notes, I will edit these as more data become available.)

Base rev: mozilla-central 7c3ea3514425.  Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123).  Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev.

The benchmark is the call_indirect_ubench.js attached above.  I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line.

Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch:
"external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower)
"internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported)
"direct" calls increase running time by 6% (very surprising, these ought to be invariant)

On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto:
"external" calls slow down slightly, but really not much (very nice)
"internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported)
"direct" calls are roughly invariant (as desired)

Another set of benchmarks.  This is doubly-recursive fibonacci(40) using indirect calls.  There are four cases:

- DIR: direct calls
- IIP: indirect calls, intra-module with a private table
- IIE: indirect calls, intra-module but with a shared (exported) table
- IIX: indirect calls, inter-module (requires two modules and a shared table)

```
                Baseline                         Improved
       DIR   IIP   IIE    IIX             DIR   IIP    IIE   IIX
Xeon   753   1410  1425   1425            754   1070   1300  2080
i7      647  1130  1130   1140            645   836    1100  1520
```
On both CPUs, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster.  Also, the direct cases have the same performance with the old and the new code, as expected & desired.  For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case.  That said, the IIX case is "only" about 30% - 40% slower than the IIX case of the baseline code.  The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values).  This optimization is very effective and will be worthwhile *if* we think that this case will be common.
(Some initial notes, I will edit these as more data become available.)

Base rev: mozilla-central 7c3ea3514425.  Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123).  Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev.

The benchmark is the call_indirect_ubench.js attached above.  I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line.

Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch:
"external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower)
"internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported)
"direct" calls increase running time by 6% (very surprising, these ought to be invariant)

On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto:
"external" calls slow down slightly, but really not much (very nice)
"internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported)
"direct" calls are roughly invariant (as desired)

Another set of benchmarks.  This is doubly-recursive fibonacci(40) using indirect calls.  There are four cases:

- DIR: direct calls
- IIP: indirect calls, intra-module with a private table
- IIE: indirect calls, intra-module but with a shared (exported) table
- IIX: indirect calls, inter-module (requires two modules and a shared table)

```
                Baseline                         Improved
       DIR   IIP   IIE    IIX             DIR   IIP    IIE   IIX
Xeon   753   1410  1425   1425            754   1070   1300  2080
i7     647   1130  1130   1140            645   836    1100  1520
```
On both CPUs, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster.  Also, the direct cases have the same performance with the old and the new code, as expected & desired.  For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case.  That said, the IIX case is "only" about 30% - 40% slower than the IIX case of the baseline code.  The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values).  This optimization is very effective and will be worthwhile *if* we think that this case will be common.
(Some initial notes, I will edit these as more data become available.)

Base rev: mozilla-central 7c3ea3514425.  Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123).  Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev.

The benchmark is the call_indirect_ubench.js attached above.  I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line.

Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch:
"external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower)
"internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported)
"direct" calls increase running time by 6% (very surprising, these ought to be invariant)

On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto:
"external" calls slow down slightly, but really not much (very nice)
"internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported)
"direct" calls are roughly invariant (as desired)

Another set of benchmarks.  This is doubly-recursive fibonacci(40) using indirect calls.  There are four cases:

- DIR: direct calls
- IIP: indirect calls, intra-module with a private table
- IIE: indirect calls, intra-module but with a shared (exported) table
- IIX: indirect calls, inter-module (requires two modules and a shared table)

```
                Baseline                         Improved
       DIR   IIP   IIE    IIX             DIR   IIP    IIE   IIX
Xeon   753   1410  1425   1425            754   1070   1300  2080
i7     647   1130  1130   1140            645    836   1100  1520
```
On both CPUs, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster.  Also, the direct cases have the same performance with the old and the new code, as expected & desired.  For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case.  That said, the IIX case is "only" about 30% - 40% slower than the IIX case of the baseline code.  The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values).  This optimization is very effective and will be worthwhile *if* we think that this case will be common.
(Some initial notes, I will edit these as more data become available.)

Base rev: mozilla-central 7c3ea3514425.  Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123).  Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev.

The benchmark is the call_indirect_ubench.js attached above.  I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line.

Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch:
"external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower)
"internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported)
"direct" calls increase running time by 6% (very surprising, these ought to be invariant)

On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto:
"external" calls slow down slightly, but really not much (very nice)
"internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported)
"direct" calls are roughly invariant (as desired)

Another set of benchmarks.  This is doubly-recursive fibonacci(40) using indirect calls.  There are four cases:

- DIR: direct calls
- IIP: indirect calls, intra-module with a private table
- IIE: indirect calls, intra-module but with a shared (exported) table
- IIX: indirect calls, inter-module (requires two modules and a shared table)

```
                Baseline                         Improved
       DIR   IIP   IIE    IIX             DIR   IIP    IIE   IIX
Xeon   753   1410  1425   1425            754   1070   1300  2080
i7     647   1130  1130   1140            645    836   1100  1520
M1     650   1050  1045   1045            650    770    980  1420
```
On all CPUs, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster.  Also, the direct cases have the same performance with the old and the new code, as expected & desired.  For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case.  That said, the IIX case is "only" about 30% - 40% slower than the IIX case of the baseline code.  The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values).  This optimization is very effective and will be worthwhile *if* we think that this case will be common.
(Some initial notes, I will edit these as more data become available.)

Base rev: mozilla-central 7c3ea3514425.  Applied the latest (Jul 8) rev of [D117123](https://phabricator.services.mozilla.com/D117123).  Built it with `../configure --disable-debug --enable-optimize --enable-release --enable-debug-symbols` and also made a comparable build from the base rev.

The benchmark is the call_indirect_ubench.js attached above.  I run the JS shell with `--wasm-compiler=ion`, no other switches, loading the benchmark file on the command line.

Consistently on my Xeon E5-2637 system with Fedora 33 in the JS shell, comparing with-the-patch to without-the-patch:
"external" calls (ie cross-module indirect calls) drop running time by 14% (highly surprising, these ought to be be slower)
"internal" calls (ie same-module indirect calls) drop running time by 23% (almost precisely what Dmitry has reported)
"direct" calls increase running time by 6% (very surprising, these ought to be invariant)

On a Core-i7 with macOS 11 (this is a 2018 MacBook Pro), ditto:
"external" calls slow down slightly, but really not much (very nice)
"internal" calls speed up by about 14% (so less than on the Xeon and what Dmitry reported)
"direct" calls are roughly invariant (as desired)

On an Apple M1 with macOS 11 (M1 MacBook Pro):
"external" calls slow down by a lot (running time goes from 270 to 430, ie increasing by 60%)
"internal" calls speed up by about 15% (comparable to i7)
"direct" calls are roughly invariant (as desired)

Another set of benchmarks.  This is doubly-recursive fibonacci(40) using indirect calls.  There are four cases:

- DIR: direct calls
- IIP: indirect calls, intra-module with a private table
- IIE: indirect calls, intra-module but with a shared (exported) table
- IIX: indirect calls, inter-module (requires two modules and a shared table)

```
                Baseline                         Improved
       DIR   IIP   IIE    IIX             DIR   IIP    IIE   IIX
Xeon   753   1410  1425   1425            754   1070   1300  2080
i7     647   1130  1130   1140            645    836   1100  1520
M1     650   1050  1045   1045            650    770    980  1420
```
On all CPUs, in the baseline case the indirect programs all have roughly the same performance, which is what we want to see, and the direct case is much faster.  Also, the direct cases have the same performance with the old and the new code, as expected & desired.  For the indirect cases with the improved code, we see that there's a significant cost to the stubs: the IIX case takes almost twice as long as the IIP case.  That said, the IIX case is "only" about 30% - 40% slower than the IIX case of the baseline code.  The IIE case shows off an optimization where we can't statically say that a call is intramodule but we can detect it dynamically (by comparing the Tls values).  This optimization is very effective and will be worthwhile *if* we think that this case will be common.

Back to Bug 1639153 Comment 138