Closed Bug 1708743 (simd-avx2) Opened 1 year ago Closed 4 months ago

x64/x86 SIMD: AVX2 support

Categories

(Core :: JavaScript: WebAssembly, enhancement, P3)

x86_64
All
enhancement

Tracking

()

RESOLVED FIXED
101 Branch
Tracking Status
firefox101 --- fixed

People

(Reporter: lth, Assigned: yury)

References

(Blocks 2 open bugs)

Details

Attachments

(7 files)

We should experiment with AVX2 support for SIMD on x86 and x64. Almost 60% of our Windows users have AVX2 (https://firefoxgraphics.github.io/telemetry/#view=system), and this can only increase. With AVX2 we would get lower register pressure from the 3-address ops, and better code generation in some cases.

My understanding is that plain AVX is probably not worth the bother, but I'll try to back that up with links to discussions.

We used to have AVX support for SIMD.js but we turned it off for two reasons:

  • the AVX unit is frequently powered down and powering it up is expensive. It is only code that uses Serious SIMD that benefits from getting AVX codegen.
  • the YMM registers caused some weird stalls at task switch time on MacOS.

The parameters may have changed here (hypothetical fixes to MacOS + Mac is moving to ARM64; maybe the cost of enabling/disabling AVX is lower in current chips or OSs) and in addition it may be that we can stick to SSE4.1 in the baseline compiler to avoid stalls during startup. There must be other ideas.

I will try to dig up old bugs that pertain to the known problems so that we can investigate how to avoid them.

One thing I don't know yet is whether there's a penalty for mixing AVX code (instructions have a special prefix) with non-AVX code, so that it would be necessary to do AVX encodings for everything before we could assess performance.

https://github.com/WebAssembly/simd/issues/342#issuecomment-834805766 suggests (this has to be checked) that AVX-encoded instructions no longer requires memory operands to be aligned. At the moment, for example, we can't do PADDD offs(basereg), destreg because we don't normally know if the effective address is aligned. Being able to do so without worrying alignment would save us from doing a load, and save us from dedicating a register to the loaded value.

Blocks: 1713056

The alignment requirements are documented in Intel Manual Vol I, Chapter 14 PROGRAMMING WITH AVX, FMA AND AVX2
In short, only explicitly aligned avx instructions trigger a GP fault; "regular" avx instructions won't.

14.9 MEMORY ALIGNMENT
[...]
With the exception of explicitly aligned 16 or 32 byte SIMD load/store instructions, most VEX-encoded, 
arithmetic and data processing instructions operate in a flexible environment regarding memory address 
alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load semantics will support unaligned load 
operation by default. Memory arguments for most instructions with VEX prefix operate normally without
causing #GP(0) on any byte-granularity alignment (unlike Legacy SSE instructions). The instructions that 
require explicit memory alignment requirements are listed in Table 14-22
[...]

Finding interesting behavior for (func (param v128 v128) (result v128) local.get 0 local.get 1 i32x4.add):

with AVX:

00000024  66 0f 6f d8               movdqa %xmm0, %xmm3
00000028  66 0f 6f d3               movdqa %xmm3, %xmm2
0000002C  c5 e9 fe c1               vpaddd %xmm1, %xmm2, %xmm0

without AVX:

00000024  66 0f fe c1               paddd %xmm1, %xmm0

Is it regalloc issue?

That looks like a common regalloc problem, yes. In this case, the regalloc will choose to move xmm0 to xmm2 since xmm0 is used both for the result and for one operand that has to be live throughout ("useRegister"); this is a conflict.

The extra move via xmm3 is a regalloc bug that I see quite often, though it does not look like there's a bug on file for it. I think this is a missing optimization in how multiple movegroups are not sensibly merged because they are separated by an LParameter node, but this is just a guess.

If you introduce a dummy parameter 0 and then use parameters 1 and 2 for the inputs to the add you'll likely see better code. You see this hack used in some of the whitebox tests for code generation sometimes, to avoid this specific problem.

In principle, if both the lhs and rhs are marked as useRegisterAtStart instead of useRegister then the regalloc should be able to generate good code here since xmm0 can be reused. But you have to be careful about how you do this, see lowerForFPU for some sample code.

If you introduce a dummy parameter 0 and then use parameters 1 and 2 for the inputs to the add you'll likely see better code. You see this hack used in some of the whitebox tests for code generation sometimes, to avoid this specific problem.

Correct, the avoiding parameter 0 bypasses the problem. The problem appears only for simple functions where the result register is one used parameter registers (e.g. it can be param 1 if param0 has type 'i32').

Assignee: nobody → ydelendik
Attachment #9250296 - Attachment description: WIP: Bug 1708743 - Enable some AVX for SIMD, JS shell preference. → Bug 1708743 - Enable some AVX for SIMD, JS shell preference. r?lth
Status: NEW → ASSIGNED

Locally, if I add test-also=--enable-avx to the directive.txt all wasm/simd/ tests pass. What will be the right way to subject avx to all simd test?

(In reply to Yury Delendik (:yury) from comment #7)

Locally, if I add test-also=--enable-avx to the directive.txt all wasm/simd/ tests pass. What will be the right way to subject avx to all simd test?

What happens if you flip the default value of avxEnabled to true and run all jit-tests and jstests? I think that's the only test that really matters. (Maybe we should do a phone call about this.)

Keywords: leave-open
Pushed by ydelendik@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/08aba69fa7b0
Enable some AVX for SIMD, JS shell preference. r=lth

I forced enabling AVX at try server https://treeherder.mozilla.org/jobs?repo=try&revision=8ae9a3914f6303d1d511a7784ba1e08de56b80e3&selectedTaskRun=S52NoQcxTBGINHjARFoFMQ.0

Locally and try server following jit-test (all codegen-x64) fail:

  • wasm/simd/bitselect-x64-ion-codegen.js
  • wasm/simd/cvt-x64-ion-codegen.js
  • wasm/simd/ion-analysis.js
  • wasm/simd/ion-bug1688713.js
  • wasm/simd/shuffle-x86-ion-codegen.js
  • wasm/simd/splat-x64-ion-codegen.js
Alias: simd-avx2
Depends on: 1747304

This seems relevant: https://bytecodealliance.zulipchat.com/#narrow/stream/217117-cranelift/topic/x64.20SIMD.20alignment, see esp later comments about optimization advice re vzeroupper and discussion about mixing SSE and AVX encodings. I think the conclusion is, Thou shalt benchmark carefully, but worth digging into the Intel docs maybe or googling for related information.

Agree. The vzeroupper has to come in play when 256-bit / full-YMM registers are used. If nothing executes 256-bit instruction then we are okay.

Though it looks like we have to inspect entire FF code and guarantee somehow "Clean Upper State" during execution of SpiderMonkey JIT code regardless of this bug -- nowadays the dependencies might use AVX2, e.g. codecs or (bergamot) intrinsics. From what I saw so far we don't do YMM in JIT itself.

(Per https://cdrdv2.intel.com/v1/dl/getContent/671488?explicitVersion=true&wapkw=intel%2064%20and%20ia-32%20architectures%20optimization%20reference%20manual , Section 15.3 Mixing AVX code with SSE code)

Allows AVX SIMD instructions on x86/x64. Mostly as experiment for benchmarking --
if success, it will be on if available.

Depends on D135560

Pushed by ydelendik@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/25fe9d00fa32
[wasm] Use IsAvxPresent flag in build ID. r=lth
Depends on: 1749788
Pushed by ydelendik@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/17a3b8d1c89a
Add javascript.options.wasm_simd_avx preference. r=jandem
Depends on: 1750049

Adds CPUID detection, if avxPresent set.

The isAvxPresent function interface is modified to check if AVX V2 is active.

Pushed by ydelendik@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/26ccca8d2d7e
Add HasAVX2() to assembler; detect via CPUID. r=lth
Blocks: 1750846
No longer blocks: 1750846
Depends on: 1750846
Depends on: 1752231
Depends on: 1753115
Depends on: 1753318
Depends on: 1753452

The x64 architectures optimization manual, section 15.3 "Mixing AVX code with SSE code", talks about runtime penalties when 256-bit and 128-bit AVX operations are mixed. After checking our code, and inserting ymm-upper-parts-are-dirty-check and running test suites, looks like we don't use 256-bit registers or properly return into the Clean (VZEROUPPER) state.

The compilers (GCC/clang) normally insert VZEROUPPER instructions on the boundaries between YMM and XMM usages, unless explicitly instructed otherwise, so libraries such as bergamot/intgemm don't represent a danger of breaking invariant of the clean state.

Depends on: 1755721
Depends on: 1757244
Depends on: 1759909

Support for AVX is implemented, and majority of masm operations were optimized to support VEX encoding. Also special case AVX2 variants were added if available.

Keywords: leave-open
Pushed by ydelendik@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/e922c3afe45f
Enable AVX support by default in release. r=lth
Status: ASSIGNED → RESOLVED
Closed: 4 months ago
Resolution: --- → FIXED
Target Milestone: --- → 101 Branch
No longer depends on: 1753452
You need to log in before you can comment on or make changes to this bug.