Closed Bug 1610770 Opened 5 years ago Closed 4 years ago

Wasm SIMD on ARM: Register sets

Categories

(Core :: JavaScript Engine: JIT, task, P3)

task

Tracking

()

RESOLVED WONTFIX

People

(Reporter: lth, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(2 obsolete files)

When we add SIMD for wasm we will need to increase the size of the representation of floating point registers on ARM and ARM64. (MIPS may be the same.) On both these platforms we're already using the full 64 bits to represent FP registers (in the case of ARM the reason is obscure, see bug 1610320 comment 1, but even so it's true) and we'll need something better. On ARM64 we can use __uint128_t, but that is not available on 32-bit platforms, and it's a portability hazard anyway.

I have tried to use std::bitset, and that's almost right, but we rely heavily on find-first-bit and find-last-bit for our register allocator and it does not have that, nor does it have good mechanisms for extracting slices that we could run our own findfirst/findlast functions on. And it likes exceptions.

So I've created a new class, Bitset96, and adapted the x64 float registers to use this, and it passes all tests (so far). (x64 is ironic because it doesn't need this change.) I have changed the regalloc/architecture interface slightly so that regalloc can work either with an integer representation or a Bitset96 representation. All of this is straightforward.

The main problem is that a lot of things that are now constexpr in the regalloc can no longer be that, because Bitset96 is not integral. In particular, there's now a risk of ordering dependencies in static const initializers, and indeed I ran into one of those. There may be a way around this for somebody who knows more C++ than I do, it's worth investigating.

A new type Bitset96 is introduced to represent bitsets up to 96
bits, with accessors and operators that make it suitable for our
register allocator. (std::bitset is not quite suitable.)

The register allocator is abstracted slightly so that it can work
both with Bitset96 and integer representations of FP registers

x64 floating registers and masks are then encoded using Bitset96.
(This is not necessary -- on x64 an integer will be just fine. This
is just a POC.)

The main fallout is that many things that used to be constexpr no
longer are. This becomes an order-of-initialization hazard.

To fix an initialization order hazard introduced by the previous
patch, tentatively move a definition into the file that performs the
initializations it depends on.

Depends on D60648

nbp, what do you think about this solution?

Flags: needinfo?(nicolas.b.pierron)

(In reply to Lars T Hansen [:lth] from comment #0)

I have tried to use std::bitset, and that's almost right, but we rely heavily on find-first-bit and find-last-bit for our register allocator and it does not have that, nor does it have good mechanisms for extracting slices that we could run our own findfirst/findlast functions on. And it likes exceptions.

If I'm reading the code right, libc++ supports fast (ctz/clz) find on bitsets, but such support isn't mandated by the standard. (libstdc++ 7.x does not appear to have it, and I don't have the code for MSVC's STL in front of me.)

The slices thing would be hard regardless.

When seeing the amount of code, I am worried about the inlining rules of compilers. Most of the code manipulating register set is inlined across multiple depth of function because they are mostly implemented with 1 instruction.

Going with BitSet96 seems to add a lot of overhead which might change the inlining behavior.

In the review of the POC, I suggested to maybe use simd128. However, after checking there is no equivalent of lzc/ctz, except maybe with AVX512 extension. Maybe there is a way to implement these operations using pshufb, or just counting the last 32 bits, which sounds a bit hazardous.

Flags: needinfo?(nicolas.b.pierron)

Thanks. I really like the SIMD idea, and I followed up in a comment on the patch. It is possible that what we actually want to do is not to use SIMD per se, but we want to avoid the bit shifting, which is really just an artifact of the sets currently being represented by wide integers. When we shift something across "lanes" (ie across multiples of TotalPhys) it's just a convenience. When we shift "1", which we do a lot, we're just creating masks or inserting bits in particular lanes, and there's no reason this has to be done by simulating shifts. I think the next step here is to see whether the shift operators are really necessary at all.

Priority: -- → P1

I'm going to lower this to P3 (sorry for not setting the priority in the first place) since, as comment 0 explains, we only need this for ARM, and whether we'll need it for ARM depends on whether we continue to treat ARM as tier-1.

Priority: P1 → P3

(In reply to Lars T Hansen [:lth] from comment #7)

I'm going to lower this to P3 […]

P1/P2 are about the scheduling of the work. Usually when a bug is assigned, I set it to P1 as assumed to be merged in the current version:
https://firefox-bug-handling.mozilla.org/triage-bugzilla

(In reply to Nicolas B. Pierron [:nbp] from comment #8)

(In reply to Lars T Hansen [:lth] from comment #7)

I'm going to lower this to P3 […]

P1/P2 are about the scheduling of the work. Usually when a bug is assigned, I set it to P1 as assumed to be merged in the current version:
https://firefox-bug-handling.mozilla.org/triage-bugzilla

Fair enough, and thanks for the link. I think this bug fits better into the "This doesn’t fit into a P1, P2, P3, P4, or P5 framework" category of that page probably - it's investigative, we may find it useful to fix this, but at the moment we can wait. And since we don't know for sure what we'll do it's P3 by that criterion. I'll amend the title slightly to indicate this.

Summary: Abstract the representation of floating point registers → Exploration: Abstract the representation of floating point registers, to benefit ARM
No longer blocks: 1478632
Blocks: wasm-simd

Unclear whether we'll ever prioritize getting Wasm SIMD to work on ARMv7, but the patches are useful for now, so keeping the bug open.

Assignee: lhansen → nobody
Status: ASSIGNED → NEW
Summary: Exploration: Abstract the representation of floating point registers, to benefit ARM → Wasm SIMD on ARM: Register sets
Depends on: 1610328

If we ever do wasm simd for ARM, we'll use the same hack as for ARM64: the baseline compiler will allocate doubles and widen them to vectors internally, and there will be a couple of special paths in the stubs that handle vector registers. We will not need the wider register sets.

Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → WONTFIX
Attachment #9122304 - Attachment is obsolete: true
Attachment #9122302 - Attachment is obsolete: true
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: