Wasm SIMD on ARM: Register sets
Categories
(Core :: JavaScript Engine: JIT, task, P3)
Tracking
()
People
(Reporter: lth, Unassigned)
References
(Blocks 1 open bug)
Details
Attachments
(2 obsolete files)
When we add SIMD for wasm we will need to increase the size of the representation of floating point registers on ARM and ARM64. (MIPS may be the same.) On both these platforms we're already using the full 64 bits to represent FP registers (in the case of ARM the reason is obscure, see bug 1610320 comment 1, but even so it's true) and we'll need something better. On ARM64 we can use __uint128_t, but that is not available on 32-bit platforms, and it's a portability hazard anyway.
I have tried to use std::bitset, and that's almost right, but we rely heavily on find-first-bit and find-last-bit for our register allocator and it does not have that, nor does it have good mechanisms for extracting slices that we could run our own findfirst/findlast functions on. And it likes exceptions.
So I've created a new class, Bitset96, and adapted the x64 float registers to use this, and it passes all tests (so far). (x64 is ironic because it doesn't need this change.) I have changed the regalloc/architecture interface slightly so that regalloc can work either with an integer representation or a Bitset96 representation. All of this is straightforward.
The main problem is that a lot of things that are now constexpr in the regalloc can no longer be that, because Bitset96 is not integral. In particular, there's now a risk of ordering dependencies in static const initializers, and indeed I ran into one of those. There may be a way around this for somebody who knows more C++ than I do, it's worth investigating.
Reporter | ||
Comment 1•5 years ago
|
||
A new type Bitset96
is introduced to represent bitsets up to 96
bits, with accessors and operators that make it suitable for our
register allocator. (std::bitset is not quite suitable.)
The register allocator is abstracted slightly so that it can work
both with Bitset96 and integer representations of FP registers
x64 floating registers and masks are then encoded using Bitset96.
(This is not necessary -- on x64 an integer will be just fine. This
is just a POC.)
The main fallout is that many things that used to be constexpr no
longer are. This becomes an order-of-initialization hazard.
Reporter | ||
Comment 2•5 years ago
|
||
To fix an initialization order hazard introduced by the previous
patch, tentatively move a definition into the file that performs the
initializations it depends on.
Depends on D60648
Reporter | ||
Comment 3•5 years ago
|
||
nbp, what do you think about this solution?
Comment 4•5 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #0)
I have tried to use std::bitset, and that's almost right, but we rely heavily on find-first-bit and find-last-bit for our register allocator and it does not have that, nor does it have good mechanisms for extracting slices that we could run our own findfirst/findlast functions on. And it likes exceptions.
If I'm reading the code right, libc++ supports fast (ctz/clz) find
on bitsets, but such support isn't mandated by the standard. (libstdc++ 7.x does not appear to have it, and I don't have the code for MSVC's STL in front of me.)
The slices thing would be hard regardless.
Comment 5•5 years ago
|
||
When seeing the amount of code, I am worried about the inlining rules of compilers. Most of the code manipulating register set is inlined across multiple depth of function because they are mostly implemented with 1 instruction.
Going with BitSet96 seems to add a lot of overhead which might change the inlining behavior.
In the review of the POC, I suggested to maybe use simd128. However, after checking there is no equivalent of lzc/ctz, except maybe with AVX512 extension. Maybe there is a way to implement these operations using pshufb, or just counting the last 32 bits, which sounds a bit hazardous.
Reporter | ||
Comment 6•5 years ago
|
||
Thanks. I really like the SIMD idea, and I followed up in a comment on the patch. It is possible that what we actually want to do is not to use SIMD per se, but we want to avoid the bit shifting, which is really just an artifact of the sets currently being represented by wide integers. When we shift something across "lanes" (ie across multiples of TotalPhys) it's just a convenience. When we shift "1", which we do a lot, we're just creating masks or inserting bits in particular lanes, and there's no reason this has to be done by simulating shifts. I think the next step here is to see whether the shift operators are really necessary at all.
Updated•5 years ago
|
Reporter | ||
Comment 7•5 years ago
|
||
I'm going to lower this to P3 (sorry for not setting the priority in the first place) since, as comment 0 explains, we only need this for ARM, and whether we'll need it for ARM depends on whether we continue to treat ARM as tier-1.
Comment 8•5 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #7)
I'm going to lower this to P3 […]
P1/P2 are about the scheduling of the work. Usually when a bug is assigned, I set it to P1 as assumed to be merged in the current version:
https://firefox-bug-handling.mozilla.org/triage-bugzilla
Reporter | ||
Comment 9•5 years ago
|
||
(In reply to Nicolas B. Pierron [:nbp] from comment #8)
(In reply to Lars T Hansen [:lth] from comment #7)
I'm going to lower this to P3 […]
P1/P2 are about the scheduling of the work. Usually when a bug is assigned, I set it to P1 as assumed to be merged in the current version:
https://firefox-bug-handling.mozilla.org/triage-bugzilla
Fair enough, and thanks for the link. I think this bug fits better into the "This doesn’t fit into a P1, P2, P3, P4, or P5 framework" category of that page probably - it's investigative, we may find it useful to fix this, but at the moment we can wait. And since we don't know for sure what we'll do it's P3 by that criterion. I'll amend the title slightly to indicate this.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Comment 10•5 years ago
|
||
Unclear whether we'll ever prioritize getting Wasm SIMD to work on ARMv7, but the patches are useful for now, so keeping the bug open.
Reporter | ||
Comment 11•4 years ago
|
||
If we ever do wasm simd for ARM, we'll use the same hack as for ARM64: the baseline compiler will allocate doubles and widen them to vectors internally, and there will be a couple of special paths in the stubs that handle vector registers. We will not need the wider register sets.
Updated•4 years ago
|
Updated•4 years ago
|
Description
•