[meta] Firefox’s score is lower than Safari on WASM benchmark in JetStream 2
Categories
(Core :: JavaScript: WebAssembly, enhancement, P3)
Tracking
()
Tracking | Status | |
---|---|---|
firefox75 | --- | affected |
People
(Reporter: tetsuharu, Unassigned)
References
(Depends on 2 open bugs, Blocks 3 open bugs)
Details
(Keywords: meta, parity-safari, perf)
Summary
By WASM benchmarks in JetStream 2, Firefox’ score is lower than Safari’s one, especially at startup time.
Environments
I runs the benchmark on macOS 10.15.3 on my MacBook Pro (15inch, 2017, 2.8 GHz Quad-Core Intel Core i7, 16 GB 2133 MHz LPDDR3).
- Firefox Nightly 75.0a1
- Google Chrome Canary 82.0.4061.2
- Safari Technology Preview 100
Results
Overall Score
| | Firefox | Google Chrome | Safari |
|————————|————|———————|————|
| gcc-loops-wasm | 22.961 | 17.667 | 37.058 |
| HashSet-wasm | 32.416 | 42.379 | 69.485 |
| quicksort-wasm | 307.729 | 353.553 | 487.950 |
| richards-wasm | 59.732 | 52.863 | 94.830 |
| tsf-wasm | 83.519 | 44.201 | 103.717 |
Startup Score
| | Firefox | Google Chrome | Safari |
|————————|————|———————|—————|
| gcc-loops-wasm | 500 | 200 | 833.333 |
| HashSet-wasm | 312.500 | 333.333 | 833.333 |
| quicksort-wasm | 833.333 | 1000 | 1666.667 |
| richards-wasm | 384.615 | 555.556 | 1250 |
| tsf-wasm | 357.143 | 178.571 | 714.286 |
Runtime Score
| | Firefox | Google Chrome | Safari |
|————————|————|———————|————|
| gcc-loops-wasm | 1.054 | 1.561 | 1.648 |
| HashSet-wasm | 3.362 | 5.388 | 5.794 |
| quicksort-wasm | 113.636 | 125 | 142.857 |
| richards-wasm | 9.276 | 5.030 | 7.194 |
| tsf-wasm | 19.531 | 10.941 | 15.060 |
Steps to Reproduce
Use https://github.com/WebKit/webkit/tree/63d9e9877c5d15243f5c9c9753fef20599b9f461/PerformanceTests/JetStream2 and run only WASM benchmarks.
I changed the following change to run only WASM benchmarks.
diff --git a/PerformanceTests/JetStream2/JetStreamDriver.js b/PerformanceTests/JetStream2/JetStreamDriver.js
index dd7fae9fbd..0d202b24e4 100644
--- a/PerformanceTests/JetStream2/JetStreamDriver.js
+++ b/PerformanceTests/JetStream2/JetStreamDriver.js
@@ -1630,7 +1630,7 @@ let runSeaMonster = true;
let runCodeLoad = true;
let runWasm = true;
-if (false) {
+if (true) {
runOctane = false;
runARES = false;
runWSL = false;
@@ -1642,7 +1642,7 @@ if (false) {
runWorkerTests = false;
runSeaMonster = false;
runCodeLoad = false;
- runWasm = false;
+ runWasm = true;
}
if (typeof testList !== "undefined") {
Reporter | ||
Updated•4 years ago
|
Reporter | ||
Updated•4 years ago
|
Comment 1•4 years ago
|
||
The startup score should not worry us a lot, since Safari has an interpreter and hence has approximately zero startup overhead. We already know that we can baseline-compile faster than the bits can arrive from the net and that improving baseline compilation speed will not really do anything in the real world. What we don't know is whether we're aggressive enough about selecting the baseline compiler -- in some cases (for small enough programs) we go straight to ion.
The runtime scores should be investigated. It is possible that we lose sometimes because we baseline-compile and then get stuck in a baseline-compiled loop (no OSR), but there could be other problems that limit performance.
Updated•4 years ago
|
Updated•3 years ago
|
Comment 2•3 years ago
|
||
Unassigning as we're retriaging everything.
Comment 3•2 years ago
•
|
||
(This comment will be updated.)
Comparing Fx 101 Nightly (no code caching) and Safari 15.4 under controlled conditions on my 2018 i7 quad-core MacBook Pro with macOS 12.3, I see Firefox lagging on execution time on all tests except tsf-wasm. The ratio of Safari/Firefox runtime score is 1.48 (gcc-loops), 1.6 (hashset), 1.31 (quicksort), 1.35 (richards). For tsf-wasm the ratio Firefox/Safari runtime score is 1.37.
The results are noisy, subsequent reloads improved times for both Firefox and Safari, Safari more so.
Disabling the baseline JIT in Firefox, the overall scores tank because we take a lot longer to load, but runtime scores improve. Hashset in particular improves a great deal, to the point where it's well ahead of Safari. This suggests Hashset gets stuck in baseline-compiled code (or the benchmark runs for such a short time that Ion code does not run); see next for more evidence of that. The others improve modestly.
Our total score with baseline-only is the same as with tiered compilation, because startup time is so much better. Running times are worse across the board except for HashSet, but this seems to matter less for the total score.
Enabling code caching does not help, suggesting benchmarks are not processed with streaming compilation in a way that allows caching. Placing the benchmark behind a real web server (as opposed to the python HTTP server) does not seem to make a difference.
Re startup scores, I basically don't care about these for reasons outlined earlier, but we're roughly comparable for gcc-loops, hashset, and quicksort, lagging on richards, and very far ahead on tsf. Richards is tiny and probably not indicative of anything. So I consider startup time a non-issue for now.
Action items:
- Figure out why code caching does not seem to work. Caching would improve the startup time and run time of subsequent runs, as we would go straight to Ion code. [done: caching is not working because the driver loads a blob that it then compiles from a bytearray]
- Figure out why quicksort doesn't improve all that much with Ion (the others are fine) [done: bad regalloc + codegen in inner loops]
- Figure out why we are using tiered compilation for these basically very tiny programs on what is truly a fast computer. Real wasm programs have tens of megabytes of bytecode, not a couple hundred kilobytes. We need to optimize for real programs. [done: looks ok, more or less]
- Profile some of these programs to discover why the scores are poor even with Ion-only compilation. Is there a lot of wasm<->JS traffic? Are we dying because of some unoptimized path? What else might be going on?
It's important not to take these programs too seriously; a real program would not get stuck in baseline code the way HashSet does, only microbenchmarks do that. But there's insight to be had from looking deeper.
Updated•2 years ago
|
Comment 4•2 years ago
|
||
The driver bypasses our code caching, so that explains why caching isn't working.
get runnerCode() {
let str = "";
if (isInBrowser) {
str += `
var xhr = new XMLHttpRequest();
xhr.open('GET', wasmBlobURL, true);
xhr.responseType = 'arraybuffer';
xhr.onload = function() {
Module.wasmBinary = xhr.response;
doRun();
};
xhr.send(null);
`;
} else {
Comment 5•2 years ago
•
|
||
Baseline vs Ion: Quicksort doesn't improve much, but it's uncertain still if this is due to memory bandwidth or other issues. It's a tiny program, it could be that we go straight to Ion, but if so, why do we see a change? Other programs are fine.
Xeon baseline-only
gcc-loops 0.54 0.54 0.54
hashset 3.47 3.50 3.49
quicksort 102.0 94.3 104.0
Xeon ion-only
gcc-loops 1.27 1.25 1.25 - 2.3x speedup (assuming some sort of linearity)
hashset 6.39 5.98 6.33 - 1.8x speedup
quicksort 132 128 128 - 1.3x speedup
i7 baseline-only
gcc-loops 0.76 0.76 0.76
hashset 4.66 4.68 4.70
quicksort 114 114 116
i7 ion-only
gcc-loops 1.43 1.43 1.50 - 1.9x
hashset 8.08 8.14 7.99 - 1.7x
quicksort 152 139 135 - 1.2x
i7 safari
gcc-loops 1.98 1.94 1.91 - 1.4x over nightly
hashset 7.22 7.27 7.19 - slower
quicksort 161 179 179 - 1.3x over nightly
(I don't have times for Richards here but there were reasonable improvements with Ion.)
gcc-loops and quicksort are microbenchmarks that could be affected by poor register allocation on our part.
Comment 6•2 years ago
|
||
It appears that on the Xeon at least,
tsf is compiled tiered
richards is compiled with ion only
quicksort is compiled with ion only
hashset is compiled tiered
gcc-loops is compiled tiered
Comment 7•2 years ago
•
|
||
Looking at the machine code for quicksort, it looks like it's mostly subjected to bad register allocation. The inner while loops must be very lean. However, here's the first "while" loop (this is with a patch from bug 1680243 applied but it doesn't change anything material):
00000070 41 83 7e 40 00 cmpl $0x00, 0x40(%r14) ;; check
00000075 0f 85 e7 00 00 00 jnz 0x0000000000000162 ;; interrupts
0000007B 8b 4c 24 14 movl 0x14(%rsp), %ecx ;; i
0000007F 83 c1 01 add $0x01, %ecx ;; i+1
00000082 8b 5c 24 14 movl 0x14(%rsp), %ebx ;; i (again)
00000086 44 8d 04 9f lea (%rdi,%rbx,4), %r8d ;; a+i*4
0000008A 43 8b 1c 07 movl (%r15,%r8,1), %ebx ;; a[i]
0000008E 3b d8 cmp %eax, %ebx
00000090 0f 8d 0a 00 00 00 jnl 0x00000000000000A0
00000096 8b 7c 24 1c movl 0x1C(%rsp), %edi ;; a redundantly reloaded
0000009A 89 4c 24 14 movl %ecx, 0x14(%rsp) ;; i=i+1
0000009E eb d0 jmp 0x0000000000000070
and the second one is hardly much better:
000000A0 41 83 7e 40 00 cmpl $0x00, 0x40(%r14) ;; check
000000A5 0f 85 be 00 00 00 jnz 0x0000000000000169 ;; interrupts
000000AB 8b fa mov %edx, %edi ;; overwrite a with j
000000AD 83 c7 ff add $-0x01, %edi ;; j-1
000000B0 44 8b 54 24 1c movl 0x1C(%rsp), %r10d ;; load a
000000B5 45 8d 0c 92 lea (%r10,%rdx,4), %r9d ;; a+j*4
000000B9 47 8b 14 0f movl (%r15,%r9,1), %r10d ;; a[j]
000000BD 41 3b c2 cmp %r10d, %eax
000000C0 0f 8d 04 00 00 00 jnl 0x00000000000000CA
000000C6 8b d7 mov %edi, %edx ;; j=j-1
000000C8 eb d6 jmp 0x00000000000000A0
There are few live variables in these loops and they should all be kept in registers, but they are not. Also, traditional induction variable analysis would likely simplify the code; instead of computing a+j*4 every iteration (say), we'd have a temp and just add 4 to it. Sinking the i+1 and j-1 calculations might also help (see below for more on this).
It's worth recording the wasm code that gives rise to these to show that the strange hoisting of the i+1
is not an Ion problem per se but is in the source. Allocating i
to a stack location is an Ion problem however.
000320: 03 40 | loop
000322: 20 01 | local.get 1
000324: 41 01 | i32.const 1
000326: 6a | i32.add
000327: 21 06 | local.set 6
000329: 20 00 | local.get 0
00032b: 20 01 | local.get 1
00032d: 41 02 | i32.const 2
00032f: 74 | i32.shl
000330: 6a | i32.add
000331: 22 08 | local.tee 8
000333: 28 02 00 | i32.load 2 0
000336: 22 09 | local.tee 9
000338: 20 05 | local.get 5
00033a: 48 | i32.lt_s
00033b: 04 40 | if
00033d: 20 06 | local.get 6
00033f: 21 01 | local.set 1
000341: 0c 01 | br 1
000343: 0b | end
000344: 0b | end
000345: 03 40 | loop
000347: 20 03 | local.get 3
000349: 41 7f | i32.const 4294967295
00034b: 6a | i32.add
00034c: 21 07 | local.set 7
00034e: 20 05 | local.get 5
000350: 20 00 | local.get 0
000352: 20 03 | local.get 3
000354: 41 02 | i32.const 2
000356: 74 | i32.shl
000357: 6a | i32.add
000358: 22 0a | local.tee 10
00035a: 28 02 00 | i32.load 2 0
00035d: 22 0b | local.tee 11
00035f: 48 | i32.lt_s
000360: 04 40 | if
000362: 20 07 | local.get 7
000364: 21 03 | local.set 3
000366: 0c 01 | br 1
000368: 0b | end
000369: 0b | end
Comment 8•2 years ago
|
||
For Richards, more than 60% of the time is in JS. The "fast entry trampoline" for optimized JS->wasm calls accounts for 6% of the remaining, and then wasm for about 32%, according to the Firefox profiler. Unlike the "in-depth analysis" in the benchmark I see no evidence that this is a great test of JS->wasm calls. If we're lagging on perf here I think we should look at JS perf first.
Comment 9•2 years ago
|
||
In summary:
- startup time suffers because there's no caching of optimized machine code because the benchmark runner is not cache-friendly
- tsf is much faster in firefox than safari
- hashset suffers because it gets stuck in baseline code (and there's no caching to the rescue on subsequent runs); ion code is faster than safari. real programs will not get stuck in baseline code, mostly
- quicksort suffers from bad register allocation (at least)
- richards is mostly a JS benchmark and can be ignored IMO (although to be fair, Chrome beats firefox too)
- gcc-loops is a microbenchmark that is probably also suffering from bad register allocation, I haven't really bothered to look
I will file separate bugs to track the actionable bits from the above, referencing this bug. i may then close this one; not sure yet.
Updated•2 years ago
|
Updated•1 month ago
|
Description
•