Open Bug 1382643 Opened 7 years ago Updated 2 years ago

Emunittest WebAssembly Mandelbrot benchmark is +32.9% slower in Firefox compared to Chrome

Categories

(Core :: JavaScript Engine: JIT, defect, P3)

56 Branch
x86_64
macOS
defect

Tracking

()

People

(Reporter: jujjyl, Unassigned)

References

(Blocks 3 open bugs)

Details

(Keywords: triage-deferred)

Attachments

(3 files)

Running the Mandelbrot benchmark (singlethreaded, no SIMD.js) from Emscripten in Wasm on MacBook Pro (13-inch, 2016, Four Thunderbolt 3 Ports) with macOS Sierra 10.12.5 (16F73), at

https://s3.amazonaws.com/mozilla-games/emunittest-public/2017-07-20-Mandelbrot/mandelbrot.html?playback&novsync&fakegl

gives the following numbers:

Chrome Canary Version 61.0.3161.0 (Official Build) canary (64-bit)

var results = {
	"totalTime": 16366.320000000002,
	"totalRenderTime": 16256.320000000002,
	"wrongPixels": 27,
	"result": "FAIL",
	"cpuTime": 16140.170000000022,
	"cpuIdle": 0.6567907127811394,
	"fps": 24.605814846164442,
	"pageLoadTime": 95.69500000000005,
	"numStutterEvents": 10,
	"excessFrametime": 9786.130000000023,
	"pageLoadTimeToFrame1": 110.00000000000011,
	"pageLoadTimeToFrame10": 216.14,
	"timeUntilSmoothFramerate": 3254.0700000000006
};

Firefox Nightly 56.0a1 (2017-07-18) (64-bit)

var results = {
	"totalTime": 22135.94,
	"totalRenderTime": 21675.09,
	"wrongPixels": 27,
	"result": "FAIL",
	"cpuTime": 21455.934999999987,
	"cpuIdle": 0.8137451793741922,
	"fps": 18.454363972652477,
	"pageLoadTime": 441.94500000000005,
	"numStutterEvents": 12,
	"excessFrametime": 14900.706666666716,
	"pageLoadTimeToFrame1": 460.85,
	"pageLoadTimeToFrame10": 604.3249999999999,
	"timeUntilSmoothFramerate": 6926.73
};

Looking at the hotspots in Firefox in profiler, it looks quite tight in the Mandelbrot kernel: https://perfht.ml/2gN8g82, where

> 96.0% of time in ComputeMandelbrot
> 2.6% of time in _logf

Source code is in Emscripten repository at https://github.com/kripken/emscripten/blob/master/tests/pthread/test_pthread_mandelbrot.cpp#L108

Code is compiled with

> cd $EMSCRIPTEN
> em++ tests/pthread/test_pthread_mandelbrot.cpp --shell-file tests/pthread/test_pthread_mandelbrot_shell.html -o mandelbrot.html -DSINGLETHREADED=1 -O3 --profiling-funcs -s WASM=1

For a visual mode to see what is happening, run https://s3.amazonaws.com/mozilla-games/emunittest-public/2017-07-20-Mandelbrot/mandelbrot.html?playback&novsync , the numbers above are from fakegl (no rendering) mode. Also in that mode, it is expected that the reftest gives red. In the visual mode the reftest is expected to succeed.
An asm.js version of the same benchmark is at

https://s3.amazonaws.com/mozilla-games/emunittest-public/2017-07-20-Mandelbrot/mandelbrot_asm.html?playback&novsync&fakegl

to confirm that asm.js and wasm timings are about the same:

> asm.js: "cpuTime": 21380.575000000026,
> wasm: "cpuTime": 21455.934999999987,

There are no function pointers in the hot path, and also the application does not need to mask trapping instructions, and there are no FFIs to perform, so none of the three known wasm regressions are in effect, so good to see perf is then equal, at least on a small kernel like this.
Blocks: wasm-perf
Component: JavaScript Engine → JavaScript Engine: JIT
On my Haswell Windows PC (Intel Core i7-5960X 3.00GHz with 8/16 cores/threads), the benchmark gives

FF Nightly 56.0a1 (2017-07-19) (64-bit)

var results = {
	"totalTime": 21691.355,
	"totalRenderTime": 20601.760000000002,
	"wrongPixels": 0,
	"result": "PASS (no per-pixel test - fakegl mode)",
	"cpuTime": 20488.324999999993,
	"cpuIdle": 0.47117333664694255,
	"fps": 19.415816901080294,
	"pageLoadTime": 1067.345,
	"numStutterEvents": 14,
	"excessFrametime": 13905.955000000044,
	"pageLoadTimeToFrame1": 1089.595,
	"pageLoadTimeToFrame10": 1236.6250000000002,
	"timeUntilSmoothFramerate": null
};

Chrome Canary Version 61.0.3147.0 (Official Build) canary (64-bit)

var results = {
	"totalTime": 16373.96,
	"totalRenderTime": 15195.835,
	"wrongPixels": 0,
	"result": "PASS (no per-pixel test - fakegl mode)",
	"cpuTime": 15123.20500000004,
	"cpuIdle": 0.4789799310138142,
	"fps": 26.32300232267592,
	"pageLoadTime": 1163.6399999999999,
	"numStutterEvents": 13,
	"excessFrametime": 8675.363333333347,
	"pageLoadTimeToFrame1": 1178.125,
	"pageLoadTimeToFrame10": 1289.7399999999998,
	"timeUntilSmoothFramerate": null
};

so the issue does not seem to be restricted to OS X, but on Windows Firefox is +35.48% slower, about the same factor as the OS X Macbook.
Just a drive-by because I had my profiler open.

Here's a count of samples in the hottest part of the wasm benchmark in today's Win64 Nightly. (Note that the counts likely mean time spent in the _previous_ instruction)

14ee11696f1 add     eax,dword ptr [rsp+30h]       1
14ee11696f5 mov     dword ptr [rsp+28h],eax       1
14ee11696f9 movss   xmm0,dword ptr [r15+rax]      10
14ee11696ff movss   dword ptr [rsp+24h],xmm0      23
14ee1169705 xor     eax,eax                       1
14ee1169707 movss   xmm3,dword ptr [14ee117a1d8]  905   ; 2.f
14ee116970f mulss   xmm3,xmm2                     28
14ee1169713 mulss   xmm2,xmm2                     21
14ee1169717 mulss   xmm3,dword ptr [rsp+24h]      14
14ee116971d movss   xmm0,dword ptr [rsp+24h]      883
14ee1169723 mulss   xmm0,dword ptr [rsp+24h]      25
14ee1169729 movss   dword ptr [rsp+24h],xmm0      19
14ee116972f subss   xmm2,dword ptr [rsp+24h]      10
14ee1169735 movss   xmm0,dword ptr [rsp+38h]      1486
14ee116973b addss   xmm0,xmm3                     36
14ee116973f movaps  xmm4,xmm0                     22
14ee1169742 movss   dword ptr [rsp+24h],xmm0      10
14ee1169748 mulss   xmm4,dword ptr [rsp+24h]      928
14ee116974e movaps  xmm3,xmm1                     4513
14ee1169751 addss   xmm3,xmm2                     34
14ee1169755 movaps  xmm0,xmm3                     305
14ee1169758 mulss   xmm0,xmm3                     9
14ee116975c addss   xmm4,xmm0                     1937
14ee1169760 movss   xmm0,dword ptr [14ee117a1f8]  3226   ; 4.f
14ee1169768 ucomiss xmm4,xmm0                     11
14ee116976b ja      14ee1169786                   1809
14ee1169771 add     eax,1                         905
14ee1169774 cmp     eax,dword ptr [rsp+0F8h]      36
14ee116977b jae     14ee1169932                   0
14ee1169781 movaps  xmm2,xmm3                     8
14ee1169784 jmp     14ee1169707                   12
14ee1169786 mov     ecx,dword ptr [rsp+0F0h]      55
14ee116978d add     ecx,eax                       35

Wasm experts, do you see any room for improvement in this code?
(In reply to David Major [:dmajor] from comment #3)
> 14ee1169717 mulss   xmm3,dword ptr [rsp+24h]      14
> 14ee116971d movss   xmm0,dword ptr [rsp+24h]      883
> 14ee1169723 mulss   xmm0,dword ptr [rsp+24h]      25
> 14ee1169729 movss   dword ptr [rsp+24h],xmm0      19

Just glancing at the code, this bit looks pretty inefficient (3 loads of this float32 instead of 1).
(In reply to Jan de Mooij [:jandem] from comment #4)
> (In reply to David Major [:dmajor] from comment #3)
> > 14ee1169717 mulss   xmm3,dword ptr [rsp+24h]      14
> > 14ee116971d movss   xmm0,dword ptr [rsp+24h]      883
> > 14ee1169723 mulss   xmm0,dword ptr [rsp+24h]      25
> > 14ee1169729 movss   dword ptr [rsp+24h],xmm0      19
> 
> Just glancing at the code, this bit looks pretty inefficient (3 loads of
> this float32 instead of 1).

True, although based on the sample counts the subsequent ones are pretty cheap after we've paid the cost of the first.

If I squint, I can see four "squaring" operations in there, corresponding to https://github.com/kripken/emscripten/blob/master/tests/pthread/test_pthread_mandelbrot.cpp#L127 and https://github.com/kripken/emscripten/blob/master/tests/pthread/test_pthread_mandelbrot.cpp#L136 -- how difficult would it be to re-use `v_real*v_real` and `v_imag*v_imag` from the bottom of one loop to the top of the next?
(In reply to David Major [:dmajor] from comment #5)
> (In reply to Jan de Mooij [:jandem] from comment #4)
> > (In reply to David Major [:dmajor] from comment #3)
> If I squint, I can see four "squaring" operations in there, corresponding to
> https://github.com/kripken/emscripten/blob/master/tests/pthread/
> test_pthread_mandelbrot.cpp#L127 and
> https://github.com/kripken/emscripten/blob/master/tests/pthread/
> test_pthread_mandelbrot.cpp#L136 -- how difficult would it be to re-use
> `v_real*v_real` and `v_imag*v_imag` from the bottom of one loop to the top
> of the next?

That looks like a great optimization, good point. Though what the performance of the demo is is not too important for the purposes of this bug, as long as its apples to apples. I can look at optimizing it some time later.
If the host system is AVX capable, could Wasm compilation emit the three-operand variants of instructions to avoid some redundant movs? Will Cretonne be able to do that?
(In reply to Jukka Jylänki from comment #6)
> (In reply to David Major [:dmajor] from comment #5)
> > (In reply to Jan de Mooij [:jandem] from comment #4)
> > > (In reply to David Major [:dmajor] from comment #3)
> > If I squint, I can see four "squaring" operations in there, corresponding to
> > https://github.com/kripken/emscripten/blob/master/tests/pthread/
> > test_pthread_mandelbrot.cpp#L127 and
> > https://github.com/kripken/emscripten/blob/master/tests/pthread/
> > test_pthread_mandelbrot.cpp#L136 -- how difficult would it be to re-use
> > `v_real*v_real` and `v_imag*v_imag` from the bottom of one loop to the top
> > of the next?
> 
> That looks like a great optimization, good point. Though what the
> performance of the demo is is not too important for the purposes of this
> bug, as long as its apples to apples. I can look at optimizing it some time
> later.

I was thinking that the code generation might do this. I don't have Chrome on this machine to check, but if their optimizer can do this and we can't, that could easily explain 30%.
(In reply to Jukka Jylänki from comment #7)
> If the host system is AVX capable, could Wasm compilation emit the
> three-operand variants of instructions to avoid some redundant movs? Will
> Cretonne be able to do that?

Ion can do this if the CPU has AVX support: http://searchfox.org/mozilla-central/rev/3a3af33f513071ea829debdfbc628caebcdf6996/js/src/jit/x86-shared/Lowering-x86-shared.cpp#140-153  No idea why we're not seeing AVX instructions on dmajor's machine. /me will leave this to the wasm experts.
(In reply to Jukka Jylänki from comment #7)
> If the host system is AVX capable, could Wasm compilation emit the
> three-operand variants of instructions to avoid some redundant movs? Will
> Cretonne be able to do that?

We could generate AVX code, but it's disabled in general because it was actually regressing our code, iirc. Nobody has investigated why this happened, so AVX codegen is still disabled at the moment.
(In reply to Jan de Mooij [:jandem] from comment #4)
> (In reply to David Major [:dmajor] from comment #3)
> > 14ee1169717 mulss   xmm3,dword ptr [rsp+24h]      14
> > 14ee116971d movss   xmm0,dword ptr [rsp+24h]      883
> > 14ee1169723 mulss   xmm0,dword ptr [rsp+24h]      25
> > 14ee1169729 movss   dword ptr [rsp+24h],xmm0      19
> 
> Just glancing at the code, this bit looks pretty inefficient (3 loads of
> this float32 instead of 1).

Is this regalloc deciding to use memory locations instead of registers? We should have many more available xmm registers here, right?
(In reply to David Major [:dmajor] from comment #3)
> 14ee1169742 movss   dword ptr [rsp+24h],xmm0
> 14ee1169748 mulss   xmm4,dword ptr [rsp+24h]

This kind of pattern looks odd as well, wouldn't it be better with

> 14ee1169742 movss   dword ptr [rsp+24h],xmm0
> 14ee1169748 mulss   xmm4,xmm0

because we still have the variable in register, and this way there won't be a data dependency between the store and the next load? (I don't know if that data dependency is trivial to the CPU to handle though without penalty) Also, is register-register mulss shorter to encode compared to register-memory mulss?
Attached file standalone.zip
I managed to make a standalone test case, to be run in the shell (.js file); also including the cpp file in case there's something terribly wrong in there.
The backtracking regalloc is suspected to have regressed mandelbrot very significantly: see bug 1138876.
Keywords: triage-deferred
Priority: -- → P3

I've been investigating this issue: most of the execution time is spend within a loop (no surprise here). As per the paper "Linear Scan Register Allocation on SSA Form", the vregs within the loop are live over the entire loop.
Within the loop, there is a WasmCall instruction. According the LWasmCall::isCallPreserved, all physical registers are clobbered by the WasmCall. Hence assigning a register for the vregs fails (indicated by log "rdx collides with fixed use ...") and they're added to spilledBundles.
Later, tryAllocatingRegistersForSpillBundles attempts to find a physical register but again fails due to the WasmCall. As a consequence, the vregs within the loop are spilled to the stack.
My thoughts towards alleviating the issue: the lifetime of a vreg might require a hole during a WasmCall.
Furthermore, if LWasmCall::isCallPreserved had information about callee-saved registers (depending on the ABI of the OS/arch) at least some of the vregs wouldn't need to be spilled.
If this sounds about right and someone is up to mentoring this bug I'd give it a try.

Thanks for the investigation, sounds very plausible. This is probably pretty hard, and Julian is working on this very problem on a different bug, so let's wait for him to finish that first.

Possibly related to bug 1714280.

See Also: → 1714280

It is unfortunately indeed the case that our regalloc generates an utterly
wretched piece of code. The hottest two blocks, representing 87% of the
executed instructions, are below. They are mostly reloads and spills.

=-=-=-=-=-=-=-=-=-=-=-=-=-= begin SB rank 0 =-=-=-=-=-=-=-=-=-=-=-=-=-=
  0: (167061703 66.33%)   167061703 66.33%      0x2b767065e98d 
==== SB 68386 (evchecks 0) [tid 0] 0x2b767065e98d UNKNOWN_FUNCTION UNKNOWN_OBJECT+0x0
        0x2B767065E98D:  movl 44(%rsp),%eax
        0x2B767065E991:  addl 88(%rsp),%eax
        0x2B767065E995:  xorps %xmm0,%xmm0
        0x2B767065E998:  cvtsi2ss %eax,%xmm0
        0x2B767065E99C:  mulss 76(%rsp),%xmm0
        0x2B767065E9A2:  addss 84(%rsp),%xmm0
        0x2B767065E9A8:  movl 44(%rsp),%eax
        0x2B767065E9AC:  shll 0x2:I8, %eax
        0x2B767065E9AF:  movl %eax,40(%rsp)
        0x2B767065E9B3:  movl 60(%rsp),%eax
        0x2B767065E9B7:  addl 40(%rsp),%eax
        0x2B767065E9BB:  movl %eax,36(%rsp)
        0x2B767065E9BF:  movss (%r15,%rax),%xmm1
        0x2B767065E9C5:  movsd 592(%r14),%xmm2
        0x2B767065E9CE:  cvtsd2ss %xmm2,%xmm2
        0x2B767065E9D2:  ucomiss %xmm2,%xmm1
        0x2B767065E9D5:  jp-32 0x2B767065E9E1 
=-=-=-=-=-=-=-=-=-=-=-=-=-=  end SB rank 0  =-=-=-=-=-=-=-=-=-=-=-=-=-=

=-=-=-=-=-=-=-=-=-=-=-=-=-= begin SB rank 1 =-=-=-=-=-=-=-=-=-=-=-=-=-=
  1: (216101528 85.80%)    49039825 19.47%      0x2b767065ec71 
==== SB 68386 (evchecks 0) [tid 0] 0x2b767065ec71 UNKNOWN_FUNCTION UNKNOWN_OBJECT+0x0
        0x2B767065EC71:  movl 44(%rsp),%eax
        0x2B767065EC75:  addl $1, %eax
        0x2B767065EC78:  movl %eax,44(%rsp)
        0x2B767065EC7C:  cmpl 48(%rbp),%eax
        0x2B767065EC7F:  jne-32 0x2B767065E980 
=-=-=-=-=-=-=-=-=-=-=-=-=-=  end SB rank 1  =-=-=-=-=-=-=-=-=-=-=-=-=-=

Do you have the ability to check this on ARM64? I'm curious whether you're seeing good regalloc there, as we saw on the other program.

I built the js shell with ac_add_options --target=aarch64-linux-android and ran the mandelbrot benchmark on a Pixel device. The regalloc logs contain plenty of ... collides with fixed use v0 [197,198) as well. I'll upload the log and ion.json files so you can see for yourselves.

Attached file jitspew.log
Attached file ion.json.zip

Thanks. Nothing immediately actionable here, we'll wait for other investigations to complete I guess.

Blocks: 1755624
Severity: normal → S3
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: