Closed Bug 1610770 Opened 5 years ago Closed 4 years ago

Wasm SIMD on ARM: Register sets

Tracking

()

Status:

RESOLVED WONTFIX

People

(Reporter: lth, Unassigned)

References

(Blocks 1 open bug)

Details

Attachments

(2 obsolete files)

Bug 1610770 - Introduce and use Bitset96 for FP register sets (POC) 5 years ago Lars T Hansen [:lth] 47 bytes, text/x-phabricator-request		Details \| Review
Bug 1610770 - Move defn of AllRegs from Trampoline to Architecture (Hack) 5 years ago Lars T Hansen [:lth] 47 bytes, text/x-phabricator-request		Details \| Review

Lars T Hansen [:lth]

Reporter

Description

•

5 years ago

When we add SIMD for wasm we will need to increase the size of the representation of floating point registers on ARM and ARM64. (MIPS may be the same.) On both these platforms we're already using the full 64 bits to represent FP registers (in the case of ARM the reason is obscure, see bug 1610320 comment 1, but even so it's true) and we'll need something better. On ARM64 we can use __uint128_t, but that is not available on 32-bit platforms, and it's a portability hazard anyway.

I have tried to use std::bitset, and that's almost right, but we rely heavily on find-first-bit and find-last-bit for our register allocator and it does not have that, nor does it have good mechanisms for extracting slices that we could run our own findfirst/findlast functions on. And it likes exceptions.

So I've created a new class, Bitset96, and adapted the x64 float registers to use this, and it passes all tests (so far). (x64 is ironic because it doesn't need this change.) I have changed the regalloc/architecture interface slightly so that regalloc can work either with an integer representation or a Bitset96 representation. All of this is straightforward.

The main problem is that a lot of things that are now constexpr in the regalloc can no longer be that, because Bitset96 is not integral. In particular, there's now a risk of ordering dependencies in static const initializers, and indeed I ran into one of those. There may be a way around this for somebody who knows more C++ than I do, it's worth investigating.

Lars T Hansen [:lth]

Reporter

Comment 1

•

5 years ago

Attached file Bug 1610770 - Introduce and use Bitset96 for FP register sets (POC) (obsolete) — Details

A new type Bitset96 is introduced to represent bitsets up to 96
bits, with accessors and operators that make it suitable for our
register allocator. (std::bitset is not quite suitable.)

The register allocator is abstracted slightly so that it can work
both with Bitset96 and integer representations of FP registers

x64 floating registers and masks are then encoded using Bitset96.
(This is not necessary -- on x64 an integer will be just fine. This
is just a POC.)

The main fallout is that many things that used to be constexpr no
longer are. This becomes an order-of-initialization hazard.

Lars T Hansen [:lth]

Reporter

Comment 2

•

5 years ago

Attached file Bug 1610770 - Move defn of AllRegs from Trampoline to Architecture (Hack) (obsolete) — Details

To fix an initialization order hazard introduced by the previous
patch, tentatively move a definition into the file that performs the
initializations it depends on.

Depends on D60648

Lars T Hansen [:lth]

Reporter

Comment 3

•

5 years ago

nbp, what do you think about this solution?

Flags: needinfo?(nicolas.b.pierron)

Nathan Froyd [:froydnj]

Comment 4

•

5 years ago

(In reply to Lars T Hansen [:lth] from comment #0)

I have tried to use std::bitset, and that's almost right, but we rely heavily on find-first-bit and find-last-bit for our register allocator and it does not have that, nor does it have good mechanisms for extracting slices that we could run our own findfirst/findlast functions on. And it likes exceptions.

If I'm reading the code right, libc++ supports fast (ctz/clz) find on bitsets, but such support isn't mandated by the standard. (libstdc++ 7.x does not appear to have it, and I don't have the code for MSVC's STL in front of me.)

The slices thing would be hard regardless.

Nicolas B. Pierron [:nbp]

Comment 5

•

5 years ago

When seeing the amount of code, I am worried about the inlining rules of compilers. Most of the code manipulating register set is inlined across multiple depth of function because they are mostly implemented with 1 instruction.

Going with BitSet96 seems to add a lot of overhead which might change the inlining behavior.

In the review of the POC, I suggested to maybe use simd128. However, after checking there is no equivalent of lzc/ctz, except maybe with AVX512 extension. Maybe there is a way to implement these operations using pshufb, or just counting the last 32 bits, which sounds a bit hazardous.

Flags: needinfo?(nicolas.b.pierron)

Lars T Hansen [:lth]

Reporter

Comment 6

•

5 years ago

Thanks. I really like the SIMD idea, and I followed up in a comment on the patch. It is possible that what we actually want to do is not to use SIMD per se, but we want to avoid the bit shifting, which is really just an artifact of the sets currently being represented by wide integers. When we shift something across "lanes" (ie across multiples of TotalPhys) it's just a convenience. When we shift "1", which we do a lot, we're just creating masks or inserting bits in particular lanes, and there's no reason this has to be done by simulating shifts. I think the next step here is to see whether the shift operators are really necessary at all.

Nicolas B. Pierron [:nbp]

Updated

•

5 years ago

Priority: -- → P1

Lars T Hansen [:lth]

Reporter

Comment 7

•

5 years ago

I'm going to lower this to P3 (sorry for not setting the priority in the first place) since, as comment 0 explains, we only need this for ARM, and whether we'll need it for ARM depends on whether we continue to treat ARM as tier-1.

Priority: P1 → P3

Nicolas B. Pierron [:nbp]

Comment 8

•

5 years ago

(In reply to Lars T Hansen [:lth] from comment #7)

I'm going to lower this to P3 […]

P1/P2 are about the scheduling of the work. Usually when a bug is assigned, I set it to P1 as assumed to be merged in the current version:
https://firefox-bug-handling.mozilla.org/triage-bugzilla

Lars T Hansen [:lth]

Reporter

Comment 9

•

5 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #8)

(In reply to Lars T Hansen [:lth] from comment #7)

I'm going to lower this to P3 […]

P1/P2 are about the scheduling of the work. Usually when a bug is assigned, I set it to P1 as assumed to be merged in the current version:
https://firefox-bug-handling.mozilla.org/triage-bugzilla

Fair enough, and thanks for the link. I think this bug fits better into the "This doesn’t fit into a P1, P2, P3, P4, or P5 framework" category of that page probably - it's investigative, we may find it useful to fix this, but at the moment we can wait. And since we don't know for sure what we'll do it's P3 by that criterion. I'll amend the title slightly to indicate this.

Lars T Hansen [:lth]

Reporter

Updated

•

5 years ago

Summary: Abstract the representation of floating point registers → Exploration: Abstract the representation of floating point registers, to benefit ARM

Lars T Hansen [:lth]

Reporter

Updated

•

5 years ago

No longer blocks: 1478632

Lars T Hansen [:lth]

Reporter

Updated

•

5 years ago

Blocks: wasm-simd

Lars T Hansen [:lth]

Reporter

Comment 10

•

5 years ago

Unclear whether we'll ever prioritize getting Wasm SIMD to work on ARMv7, but the patches are useful for now, so keeping the bug open.

Assignee: lhansen → nobody

Status: ASSIGNED → NEW

Summary: Exploration: Abstract the representation of floating point registers, to benefit ARM → Wasm SIMD on ARM: Register sets

Lars T Hansen [:lth]

Reporter

Updated

•

5 years ago

Depends on: 1610328

Lars T Hansen [:lth]

Reporter

Comment 11

•

4 years ago

If we ever do wasm simd for ARM, we'll use the same hack as for ARM64: the baseline compiler will allocate doubles and widen them to vectors internally, and there will be a couple of special paths in the stubs that handle vector registers. We will not need the wider register sets.

Status: NEW → RESOLVED

Closed: 4 years ago

Resolution: --- → WONTFIX

Phabricator Automation

Updated

•

4 years ago

Attachment #9122304 - Attachment is obsolete: true

Phabricator Automation

Updated

•

4 years ago

Attachment #9122302 - Attachment is obsolete: true

You need to log in before you can comment on or make changes to this bug.