1616464 - [meta] Firefox’s score is lower than Safari on WASM benchmark in JetStream 2

Tetsuharu OHZEKI [:tetsuharu] (UTC+9)

Reporter

Description

•

5 years ago

Summary

By WASM benchmarks in JetStream 2, Firefox’ score is lower than Safari’s one, especially at startup time.

Environments

I runs the benchmark on macOS 10.15.3 on my MacBook Pro (15inch, 2017, 2.8 GHz Quad-Core Intel Core i7, 16 GB 2133 MHz LPDDR3).

Firefox Nightly 75.0a1
- https://hg.mozilla.org/mozilla-central/rev/df596657bebcb96b917d75ff452316bbe8140a1a
Google Chrome Canary 82.0.4061.2
Safari Technology Preview 100

Results

Overall Score

|                | Firefox | Google Chrome | Safari  |
|————————|————|———————|————|
| gcc-loops-wasm | 22.961  | 17.667        | 37.058  |
| HashSet-wasm   | 32.416  | 42.379        | 69.485  |
| quicksort-wasm | 307.729 | 353.553       | 487.950 |
|  richards-wasm | 59.732  | 52.863        | 94.830  |
| tsf-wasm       | 83.519  | 44.201        | 103.717 |

Startup Score

|                | Firefox | Google Chrome | Safari   |
|————————|————|———————|—————|
| gcc-loops-wasm | 500     | 200           | 833.333  |
| HashSet-wasm   | 312.500 | 333.333       | 833.333  |
| quicksort-wasm | 833.333 | 1000          | 1666.667 |
|  richards-wasm | 384.615 | 555.556       | 1250     |
| tsf-wasm       | 357.143 | 178.571       | 714.286  |

Runtime Score

|                | Firefox | Google Chrome | Safari  |
|————————|————|———————|————|
| gcc-loops-wasm | 1.054   | 1.561         | 1.648   |
| HashSet-wasm   | 3.362   | 5.388         | 5.794   |
| quicksort-wasm | 113.636 | 125           | 142.857 |
|  richards-wasm | 9.276   | 5.030         | 7.194   |
| tsf-wasm       | 19.531  | 10.941        | 15.060  |

Steps to Reproduce

Use https://github.com/WebKit/webkit/tree/63d9e9877c5d15243f5c9c9753fef20599b9f461/PerformanceTests/JetStream2 and run only WASM benchmarks.

I changed the following change to run only WASM benchmarks.

diff --git a/PerformanceTests/JetStream2/JetStreamDriver.js b/PerformanceTests/JetStream2/JetStreamDriver.js
index dd7fae9fbd..0d202b24e4 100644
--- a/PerformanceTests/JetStream2/JetStreamDriver.js
+++ b/PerformanceTests/JetStream2/JetStreamDriver.js
@@ -1630,7 +1630,7 @@ let runSeaMonster = true;
 let runCodeLoad = true;
 let runWasm = true;

-if (false) {
+if (true) {
     runOctane = false;
     runARES = false;
     runWSL = false;
@@ -1642,7 +1642,7 @@ if (false) {
     runWorkerTests = false;
     runSeaMonster = false;
     runCodeLoad = false;
-    runWasm = false;
+    runWasm = true;
 }

 if (typeof testList !== "undefined") {

Tetsuharu OHZEKI [:tetsuharu] (UTC+9)

Reporter

Updated

•

5 years ago

Blocks: jetstream2

Tetsuharu OHZEKI [:tetsuharu] (UTC+9)

Reporter

Updated

•

5 years ago

Keywords: parity-safari, perf

Lars T Hansen [:lth]

Comment 1

•

5 years ago

The startup score should not worry us a lot, since Safari has an interpreter and hence has approximately zero startup overhead. We already know that we can baseline-compile faster than the bits can arrive from the net and that improving baseline compilation speed will not really do anything in the real world. What we don't know is whether we're aggressive enough about selecting the baseline compiler -- in some cases (for small enough programs) we go straight to ion.

The runtime scores should be investigated. It is possible that we lose sometimes because we baseline-compile and then get stuck in a baseline-compiled loop (no OSR), but there could be other problems that limit performance.

Priority: -- → P3

Lars T Hansen [:lth]

Updated

•

5 years ago

Severity: normal → S3

Lars T Hansen [:lth]

Updated

•

4 years ago

Assignee: nobody → lhansen

Status: NEW → ASSIGNED

Lars T Hansen [:lth]

Comment 2

•

4 years ago

Unassigning as we're retriaging everything.

Assignee: lhansen → nobody

Status: ASSIGNED → NEW

Lars T Hansen [:lth]

Updated

•

3 years ago

Blocks: 1755624

Lars T Hansen [:lth]

Comment 3

•

3 years ago

•

Edited

(This comment will be updated.)

Comparing Fx 101 Nightly (no code caching) and Safari 15.4 under controlled conditions on my 2018 i7 quad-core MacBook Pro with macOS 12.3, I see Firefox lagging on execution time on all tests except tsf-wasm. The ratio of Safari/Firefox runtime score is 1.48 (gcc-loops), 1.6 (hashset), 1.31 (quicksort), 1.35 (richards). For tsf-wasm the ratio Firefox/Safari runtime score is 1.37.

The results are noisy, subsequent reloads improved times for both Firefox and Safari, Safari more so.

Disabling the baseline JIT in Firefox, the overall scores tank because we take a lot longer to load, but runtime scores improve. Hashset in particular improves a great deal, to the point where it's well ahead of Safari. This suggests Hashset gets stuck in baseline-compiled code (or the benchmark runs for such a short time that Ion code does not run); see next for more evidence of that. The others improve modestly.

Our total score with baseline-only is the same as with tiered compilation, because startup time is so much better. Running times are worse across the board except for HashSet, but this seems to matter less for the total score.

Enabling code caching does not help, suggesting benchmarks are not processed with streaming compilation in a way that allows caching. Placing the benchmark behind a real web server (as opposed to the python HTTP server) does not seem to make a difference.

Re startup scores, I basically don't care about these for reasons outlined earlier, but we're roughly comparable for gcc-loops, hashset, and quicksort, lagging on richards, and very far ahead on tsf. Richards is tiny and probably not indicative of anything. So I consider startup time a non-issue for now.

Action items:

Figure out why code caching does not seem to work. Caching would improve the startup time and run time of subsequent runs, as we would go straight to Ion code. [done: caching is not working because the driver loads a blob that it then compiles from a bytearray]
Figure out why quicksort doesn't improve all that much with Ion (the others are fine) [done: bad regalloc + codegen in inner loops]
Figure out why we are using tiered compilation for these basically very tiny programs on what is truly a fast computer. Real wasm programs have tens of megabytes of bytecode, not a couple hundred kilobytes. We need to optimize for real programs. [done: looks ok, more or less]
Profile some of these programs to discover why the scores are poor even with Ion-only compilation. Is there a lot of wasm<->JS traffic? Are we dying because of some unoptimized path? What else might be going on?

It's important not to take these programs too seriously; a real program would not get stuck in baseline code the way HashSet does, only microbenchmarks do that. But there's insight to be had from looking deeper.

Lars T Hansen [:lth]

Updated

•

3 years ago

Assignee: nobody → lhansen

Status: NEW → ASSIGNED

Lars T Hansen [:lth]

Comment 4

•

3 years ago

The driver bypasses our code caching, so that explains why caching isn't working.

    get runnerCode() {
        let str = "";
        if (isInBrowser) {
            str += `
                var xhr = new XMLHttpRequest();
                xhr.open('GET', wasmBlobURL, true);
                xhr.responseType = 'arraybuffer';
                xhr.onload = function() {
                    Module.wasmBinary = xhr.response;
                    doRun();
                };
                xhr.send(null);
            `;
        } else {

Lars T Hansen [:lth]

Comment 5

•

3 years ago

•

Edited

Baseline vs Ion: Quicksort doesn't improve much, but it's uncertain still if this is due to memory bandwidth or other issues. It's a tiny program, it could be that we go straight to Ion, but if so, why do we see a change? Other programs are fine.

Xeon baseline-only
gcc-loops 0.54 0.54 0.54
hashset 3.47 3.50 3.49
quicksort 102.0 94.3 104.0

Xeon ion-only
gcc-loops 1.27 1.25 1.25 - 2.3x speedup (assuming some sort of linearity)
hashset 6.39 5.98 6.33 - 1.8x speedup
quicksort 132 128 128 - 1.3x speedup

i7 baseline-only
gcc-loops 0.76 0.76 0.76
hashset 4.66 4.68 4.70
quicksort 114 114 116

i7 ion-only
gcc-loops 1.43 1.43 1.50 - 1.9x
hashset 8.08 8.14 7.99 - 1.7x
quicksort 152 139 135 - 1.2x

i7 safari
gcc-loops 1.98 1.94 1.91 - 1.4x over nightly
hashset 7.22 7.27 7.19 - slower
quicksort 161 179 179 - 1.3x over nightly

(I don't have times for Richards here but there were reasonable improvements with Ion.)

gcc-loops and quicksort are microbenchmarks that could be affected by poor register allocation on our part.

Lars T Hansen [:lth]

Comment 6

•

3 years ago

It appears that on the Xeon at least,

tsf is compiled tiered
richards is compiled with ion only
quicksort is compiled with ion only
hashset is compiled tiered
gcc-loops is compiled tiered

Lars T Hansen [:lth]

Comment 7

•

3 years ago

•

Edited

Looking at the machine code for quicksort, it looks like it's mostly subjected to bad register allocation. The inner while loops must be very lean. However, here's the first "while" loop (this is with a patch from bug 1680243 applied but it doesn't change anything material):

00000070  41 83 7e 40 00            cmpl $0x00, 0x40(%r14)  ;; check
00000075  0f 85 e7 00 00 00         jnz 0x0000000000000162  ;;   interrupts
0000007B  8b 4c 24 14               movl 0x14(%rsp), %ecx   ;; i
0000007F  83 c1 01                  add $0x01, %ecx         ;; i+1
00000082  8b 5c 24 14               movl 0x14(%rsp), %ebx   ;; i (again)
00000086  44 8d 04 9f               lea (%rdi,%rbx,4), %r8d ;; a+i*4
0000008A  43 8b 1c 07               movl (%r15,%r8,1), %ebx ;; a[i]
0000008E  3b d8                     cmp %eax, %ebx
00000090  0f 8d 0a 00 00 00         jnl 0x00000000000000A0
00000096  8b 7c 24 1c               movl 0x1C(%rsp), %edi   ;; a redundantly reloaded
0000009A  89 4c 24 14               movl %ecx, 0x14(%rsp)   ;; i=i+1
0000009E  eb d0                     jmp 0x0000000000000070

and the second one is hardly much better:

000000A0  41 83 7e 40 00            cmpl $0x00, 0x40(%r14)   ;; check
000000A5  0f 85 be 00 00 00         jnz 0x0000000000000169   ;;   interrupts
000000AB  8b fa                     mov %edx, %edi           ;; overwrite a with j
000000AD  83 c7 ff                  add $-0x01, %edi         ;; j-1
000000B0  44 8b 54 24 1c            movl 0x1C(%rsp), %r10d   ;; load a
000000B5  45 8d 0c 92               lea (%r10,%rdx,4), %r9d  ;; a+j*4
000000B9  47 8b 14 0f               movl (%r15,%r9,1), %r10d ;; a[j]
000000BD  41 3b c2                  cmp %r10d, %eax
000000C0  0f 8d 04 00 00 00         jnl 0x00000000000000CA
000000C6  8b d7                     mov %edi, %edx           ;; j=j-1
000000C8  eb d6                     jmp 0x00000000000000A0

There are few live variables in these loops and they should all be kept in registers, but they are not. Also, traditional induction variable analysis would likely simplify the code; instead of computing a+j*4 every iteration (say), we'd have a temp and just add 4 to it. Sinking the i+1 and j-1 calculations might also help (see below for more on this).

It's worth recording the wasm code that gives rise to these to show that the strange hoisting of the i+1 is not an Ion problem per se but is in the source. Allocating i to a stack location is an Ion problem however.

 000320: 03 40                      |     loop
 000322: 20 01                      |       local.get 1
 000324: 41 01                      |       i32.const 1
 000326: 6a                         |       i32.add
 000327: 21 06                      |       local.set 6
 000329: 20 00                      |       local.get 0
 00032b: 20 01                      |       local.get 1
 00032d: 41 02                      |       i32.const 2
 00032f: 74                         |       i32.shl
 000330: 6a                         |       i32.add
 000331: 22 08                      |       local.tee 8
 000333: 28 02 00                   |       i32.load 2 0
 000336: 22 09                      |       local.tee 9
 000338: 20 05                      |       local.get 5
 00033a: 48                         |       i32.lt_s
 00033b: 04 40                      |       if
 00033d: 20 06                      |         local.get 6
 00033f: 21 01                      |         local.set 1
 000341: 0c 01                      |         br 1
 000343: 0b                         |       end
 000344: 0b                         |     end
 000345: 03 40                      |     loop
 000347: 20 03                      |       local.get 3
 000349: 41 7f                      |       i32.const 4294967295
 00034b: 6a                         |       i32.add
 00034c: 21 07                      |       local.set 7
 00034e: 20 05                      |       local.get 5
 000350: 20 00                      |       local.get 0
 000352: 20 03                      |       local.get 3
 000354: 41 02                      |       i32.const 2
 000356: 74                         |       i32.shl
 000357: 6a                         |       i32.add
 000358: 22 0a                      |       local.tee 10
 00035a: 28 02 00                   |       i32.load 2 0
 00035d: 22 0b                      |       local.tee 11
 00035f: 48                         |       i32.lt_s
 000360: 04 40                      |       if
 000362: 20 07                      |         local.get 7
 000364: 21 03                      |         local.set 3
 000366: 0c 01                      |         br 1
 000368: 0b                         |       end
 000369: 0b                         |     end

Lars T Hansen [:lth]

Comment 8

•

3 years ago

For Richards, more than 60% of the time is in JS. The "fast entry trampoline" for optimized JS->wasm calls accounts for 6% of the remaining, and then wasm for about 32%, according to the Firefox profiler. Unlike the "in-depth analysis" in the benchmark I see no evidence that this is a great test of JS->wasm calls. If we're lagging on perf here I think we should look at JS perf first.

Lars T Hansen [:lth]

Comment 9

•

3 years ago

In summary:

startup time suffers because there's no caching of optimized machine code because the benchmark runner is not cache-friendly
tsf is much faster in firefox than safari
hashset suffers because it gets stuck in baseline code (and there's no caching to the rescue on subsequent runs); ion code is faster than safari. real programs will not get stuck in baseline code, mostly
quicksort suffers from bad register allocation (at least)
richards is mostly a JS benchmark and can be ignored IMO (although to be fair, Chrome beats firefox too)
gcc-loops is a microbenchmark that is probably also suffering from bad register allocation, I haven't really bothered to look

I will file separate bugs to track the actionable bits from the above, referencing this bug. i may then close this one; not sure yet.

Lars T Hansen [:lth]

Updated

•

3 years ago

Depends on: 1763375

Lars T Hansen [:lth]

Updated

•

3 years ago

Depends on: 1763384

Lars T Hansen [:lth]

Updated

•

3 years ago

Assignee: lhansen → nobody

Status: ASSIGNED → NEW

Type: defect → enhancement

Summary: Firefox’s score is lower than Safari on WASM benchmark in JetStream 2 → [meta] Firefox’s score is lower than Safari on WASM benchmark in JetStream 2

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Keywords: meta

Ryan Hunt [:rhunt]

Updated

•

1 year ago

Blocks: wasm-perf-gap

Ryan Hunt [:rhunt]

Updated

•

2 months ago

Comment 10

•

2 months ago

We're now focusing on jetstream3. Closing this now.

Status: NEW → RESOLVED

Closed: 2 months ago

Resolution: --- → FIXED