Open Bug 1401675 (wasm-multi-value) Opened 2 years ago Updated 2 months ago

[Meta] Implement multi-value


(Core :: Javascript: WebAssembly, enhancement, P3)




Tracking Status
firefox57 --- wontfix


(Reporter: luke, Unassigned)


(Depends on 2 open bugs)


(Keywords: leave-open, meta)


(7 files, 2 obsolete files)

Priority: -- → P3

As a first refactoring step, it's useful for readBlockType() to be able to call readValType() which means factoring this logic out of static functions in WasmValidate.cpp.

Assignee: nobody → luke
Attachment #9042346 - Flags: review?(lhansen)
Comment on attachment 9042346 [details] [diff] [review]

Review of attachment 9042346 [details] [diff] [review]:

Seems reasonable, with bug fixed.

::: js/src/wasm/WasmValidate.h
@@ +636,5 @@
> +    uint8_t code = uncheckedReadFixedU8();
> +    switch (code) {
> +      case ValType::AnyRef:
> +      case ValType::Ref:
> +        return ValType(ValType::Code(code), uncheckedReadVarU32());

I don't think we should read an argument for AnyRef, ie, there should not be a fallthrough for that case.
Attachment #9042346 - Flags: review?(lhansen) → review+

This patch removes the duplication between TypeAndValue/ControlStackEntry by using mozilla::Pair<> to optimize away the empty class. Confirmed no size regression locally with static asserts.

Attachment #9043040 - Flags: review?(lhansen)
Attached patch factor-subtype-check (obsolete) — Splinter Review

This patch factors out a function checkIsSubtypeOf() (with "check" prefix because it effectfully reports) which simplifies its clients' logic.

The patch also renames a dangling "Any" to "TVar" to match the previous bulk-renaming of StackType::Any to StackType::TVar.

Also, two MOZ_UNLIKELYs are removed because their paths may actually hit on validating wasm modules and I don't know how pessimal MOZ_UNLIKELY makes the unlikely branch.

Attachment #9043054 - Flags: review?(lhansen)

This patch goes a bit further by inlining typeMismatch() into its one caller (checkIsSubtypeOf()). Also, now that popAnyType() exists, popWithType() can take a ValType, not a StackType, which removes a branch in the common case.

Attachment #9043054 - Attachment is obsolete: true
Attachment #9043054 - Flags: review?(lhansen)
Attachment #9043068 - Flags: review?(lhansen)
Comment on attachment 9043040 [details] [diff] [review]

Review of attachment 9043040 [details] [diff] [review]:

Attachment #9043040 - Flags: review?(lhansen) → review+
Comment on attachment 9043068 [details] [diff] [review]

Review of attachment 9043068 [details] [diff] [review]:

Generally this seems very nice and is a welcome simplification.

I'm concerned that removing the guards on gcTypesEnabled() is not the right thing (the assertion in isRefSubtypeOf() is not a substitute for the run-time checks, it only asserts that those checks have been done).  But if I understand correctly, readValType is now used for all type reading and will apply the proper filtering, so I guess I'm OK with this.
Attachment #9043068 - Flags: review?(lhansen) → review+
Keywords: leave-open
Pushed by
Baldr: factor out Decoder::readValType() (r=lth)
Baldr: use Pair to optimize out empty values instead of specialization (r=lth)
Baldr: factor out OpIter::checkIsSubtype() (r=lth)

(In reply to Lars T Hansen [:lth] from comment #7)

But if I understand correctly, readValType is now used for all type reading and
will apply the proper filtering, so I guess I'm OK with this.

That is my understanding as well: it shouldn't be possible to get a ref ValType in the first place when !gcTypesEnabled.

Why didn't I write it this way to begin with...

Attachment #9044742 - Flags: review?(lhansen)

I often forget how readEnd()/popEnd()/readFunctionEnd() and the corresponding Ion/Baseline operations interleave. Adding a separate LabelKind for the outermost (function body) block I think makes this a bit clearer and easier to read by putting this whole sequence in one function.

Attachment #9044745 - Flags: review?(lhansen)

Unrelated, this patch removes some header dependencies to avoid pulling in some heavies.

Attachment #9044747 - Flags: review?(lhansen)

Just a WIP in case anyone is curious what multi-value is looking like. It's mostly obvious changes, it just spreads everywhere (like everywhere ExprType is used). I have WasmValidate.cpp compiling and half of WasmIonCompile.cpp.

Attachment #9044742 - Flags: review?(lhansen) → review+
Attachment #9044747 - Flags: review?(lhansen) → review+
Comment on attachment 9044745 [details] [diff] [review]

Review of attachment 9044745 [details] [diff] [review]:

Attachment #9044745 - Flags: review?(lhansen) → review+
Pushed by
Baldr: factor popWithType() (r=lth)
Baldr: simplify how function bodies end (r=lth)
Baldr: remove some unneeded header dependencies (r=lth)

Ok, finally arrived at WasmBaselineCompile.cpp and I see that the main challenge is the joinReg strategy: it is specialized for 0-or-1 block result types and now we have N. I think the simplest option here would be to stop using a join register and unconditionally store branch values into their target stack slots before jumping. While this produces slower code, my impression is that this feature is lightly used and thus it won't matter in practice. Are you ok with replacing the join registers with stack stores Lars?

Flags: needinfo?(lhansen)

I don't understand what you mean by "target stack slots". The target for a computed value is always a register, which may then be pushed onto the eval stack after the join point of a diamond (this is usually a no-op, it only records the existence of the value in the register), and then only subsequently may it be pushed onto the real CPU stack if register pressure requires it or we reach a sync before the value is consumed.

I also don't think this is "lightly used" unfortunately; binaryen tends to produce value-producing code that allows blocks and other control structures to return values. But that's a subject for telemetry :)

(As a general observation, moving the result value into memory even in the case of single-value producing blocks takes us in the wrong direction; we're already moving too much information through memory. For multi-results it may be different because I expect those to be fairly rare, comparatively speaking.)

We should probably discuss various options more synchronously, i'll ping you re a meeting time.

Flags: needinfo?(lhansen)

I think you're right that they are more widely used than I initially thought. Also, I had been thinking about this only in terms of branch values, which are rather esoteric, but the optimization also applies to plain old block fallthrough values, which are common.

Just to have some data, I instrumented a browser to measure the % of blocks which produce a value and the % increase in total stack loads and stores (including both locals and Stk::Mem*) and saw (behold my markdown!):

App % blocks w/ value % more stack loads/stores
Tanks 15.7% 3.0%
Earth 15.0% 3.2%
Meet 18.2% 4.5%
Ammo 14.0% 1.8%
PSPDFKit 26.5% 6.2%
Figma 20.5% 4.9%

Having thought about this a little bit more, I think we want to lightly generalize the joinreg mechanism to multiple values, and to have a flexible strategy for where values are stored when they go into a join "register". For the first value, it should continue to go into a register, and on register-rich platforms we could even consider multiple join registers. But for values beyond a certain point, we should allocate locations in the frame to store values into when they would otherwise be moved into a register, and then when we push the joinregs after the join point the values could be read out of those locations and manipulated in the standard way. There will be a little bit of extra work for those values, but that's probably OK; zero to two values will probably cover almost all cases anyway, and most of those will still be zero and one.

The main problem with allocating space to receive multiple values is that we don't know how much space we need, and of which types, until the end of the function. Since the Frame and the locals areas are fixed-size we could allocate the space for multiple values before or after the locals, depending on what's easiest, and then address both the MV area and the Locals area off the FP instead of the SP, while the dynamic part of the frame would still be accessed off the SP. This would allow us not to worry about how big the area for multiple values is. There's some reengineering required, but most/all of this logic should be localized to the BaseStackFrame already.

The same space to receive multiple values could also be used for receiving multiple values via calls, though it depends on what your thoughts are about the ABI for that; plus it's easier to set up the value-return area for calls as part of the outgoing arguments, so it's a minor consideration.

(Something that worries me more than register / stack slot management is that with multiple values, some hot data structures in the baseline compiler will become variable-size and may require heap allocation or significantly more space, neither of which is a pleasant prospect.)

Depends on: 1529622

In terms of stack location allocation in the baseline compiler, I am not sure I see the difference between multiple-value results and the expression stack. Both spill to the stack as needed, precisely because we don't know the frame size until the end, and both have the same stack discipline.

I am going to give this a try: when entering a block with nargs nvals > JoinRegisterCount, we decrement the stack pointer appropriately to reserve space for the extra values. When returning nvals > JoinRegisterCount, write the extra values to the stack at the block's saved stack height; since we eagerly reserve space there's no shuffling needed. In the block there are appropriate Stk entries for all block parameters (stack-spilled or no), as well as return values from a nested end.

For the call ABI -- I am not entirely sure about this but I would propose reserving space for nresults > ReturnRegCount before any spilled arguments, with stack return value N having a higher address than stack return value N+1. The stack return area should be word-aligned, which is already the case for baseline frames AFAIU. At return sites, appropriate Stk entries are present indicating the stack return values if any; I think it works out OK here so that no shuffling is needed.

As far as variable-sized data structures go. AFAIU the important bit is the ExprType for return types for blocks and function calls, and in the future for block parameters. I agree that creating a bunch of potentially-heap-allocated data structures would be a lose. I would propose instead a stack of block result types, and block result types become an index-and-count into this stack. Two u32s should be sufficient for references to result types. The stack itself can be supplied by the caller (like stk_) if needed. I'll try this refactor first to see if it's perf-neutral.

Sounds pretty plausible. When you say to reserve space for results "before" any spilled arguments, do you mean at higher addresses than the spilled args (what I think) or at lower?

(In reply to Lars T Hansen [:lth] from comment #24)

When you say to reserve space for results "before" any spilled arguments, do you mean at higher addresses than the spilled args (what I think) or at lower?

HIgher, as you were thinking. Allows for sensible handling of stack return args in tail calls.

Just an update here, it took me more time than expected to adapt the compiler to expect that blocks and functions could have multiple results. I ended up making it so that wasm::OpIter keeps a ValTypeVector stack of result and parameter types for all live blocks. Code refers to this stack using class ResultTypeRef { size_t stackBase; uint32_t resultCount; } and class FuncTypeRef : public ResultTypeRef { uint32_t paramCount; } types; these types mostly replace ExprType.

Changing FuncType to have a vector of results instead of an ExprType return value took a bit of time though; I'll probably have to split this patch.

I haven't started with codegen yet, obviously.

Anyway, things are en route and hopefully I'll post patches on Monday.

Just to check, because I worry about scalars being changed to vectors in hot places: is your work based on the WIP patch, and using BlockType/ResultType (which are intended to store indices into the type section), and the ValTypeVector is being introduced somewhere that the WIP patch hadn't gotten to (e.g., the results in a function type)?

:luke I am not sure but I may have wasted some time with this patch doing a wrong thing. I had started with your patch but when bashing it against the baseline compiler I may have gone off the rails.

The WIP patch there keeps a stack of the types of the live blocks. When a block is entered, the opiter pushes its result types onto the stack, then its parameter types. A block type is represented as a cursor into that stack of types. Interestingly Ion doesn't seem to care much at all about the types, but the baseline compiler does need them.

The advantage of putting types on a stack is that you can deal with them very uniformly. The 0 parameter, 0 or 1 return cases fall out really nicely, with no branches like if you had to pick apart a PackedTypeCode. But, you (sometimes) need to copy around a bigger cursor into the stack than your ResultType, as it needs a stack base and number of values, whereas your ResultType is always a single word. Also you need to copy types onto the stack; perhaps this isn't an issue though given that most blocks will have zero parameters and zero or one return values.

See ResultTypeRef and FuncTypeRef in the patch. Also, the FuncTypeRef / ResultTypeRef formulation avoids introducing the concept of a BlockType into the source.

I am going to let the idea marinate a bit and in the mean-time factor out the patches to have FuncType return a const ValTypeVector& results() rather than an ExprType ret(). What is your instinct about type-code versus type stack?

Ah, ok, thanks for explaining. I was worried about sticking vectors into vectors, but I see what's going on. Pushing block params/results onto the stack has some nice regularity properties. One worry, maybe it's too theoretical, I dunno, is that there could be large block signatures that are used repeatedly in a function (noting that a block with signature T^N --> T^N doesn't need to contain N instructions, it can just be empty, and so there could be O(N*M) copying for M bytes of bytecode). Another theoretical O(N*M) case is if you deeply nest blocks with big signatures.

I think you're right that the super-common case is and will be []->[] and []->[valtype], but I think inlining+branch-predication will make the packed-word representation no worse than the more homogeneous representation.

Are there any problems introduced by representing block types as indices into the type section that are solved by copying them onto the stack?

Also, the FuncTypeRef / ResultTypeRef formulation avoids introducing the concept of a BlockType into the source.

This is more aesthetic, but personally I think the relationship between valtype, resulttype, functype and blocktype is complicated enough that I found it nice to have ValType/ResultType/FuncType/BlockType C++ types that correspond 1:1 with what's written in the spec (e.g., for blocktype).

Alias: wasm-multi-value
Depends on: 1576900
Depends on: 1577508

@luke: current draft here -- A few tests failing for some reason, will ping you for review when all are done, but I decided to try to the pointer-tagging approach; early thoughts welcome.

Depends on: 1577757
Component: JavaScript Engine → Javascript: WebAssembly
Attachment #9088116 - Attachment is obsolete: true
Depends on: 1578418

In the baseline compiler, when finishing a block normally or jumping to the end of a block, with the approach of comment 23 we are guaranteed that parameters to the branch will not alias any stack-spilled result, because the space for the stack result is in the outer block.

However I am having a hard time convincing myself this is the case for jumping back to a loop header with parameters. If we model block parameters as being on the expression stack at block entry, that means they may get popped off in the course of the block, and that new values may get pushed onto the stack, so the locations of the parameters of the br may alias the locations of the parameters of the loop.

I guess the standard thing to do, besides a parallel-move solution, is push copies onto the stack so they don't alias, then shuffle down.

Good question. First, iiuc the problem correctly, it seems like this would be a problem for both forward and backwards branches because, in both cases, you have the situation where the stack contains a sequence of values V1...Vn and you need to copy V(n-k+1) down to v1...vk, which means you have to worry about clobbering the overlap when k > n/2. I think what's useful here is, assuming we do a sync() for these more-than-one-param-or-result cases, so that everything is fully flushed to the stack, we don't have a generalized parallel-move, because the values are contiguous; rather, I think we can do what memmove does and copy in the right order (which has a fixed direction).

@luke With the approach of comment 23, I think it is not a problem for forwards branches, for the reason I mentioned in the first paragraph of comment 32. Stack locations for different blocks are disjoint, and the spilled results go in the outer block (the continuation of the block returning multiple values), and the outer block reserves the stack space before entering the inner block.

I think also that reasoning in terms of value counts isn't sufficient, as different value types have different stack sizes AFAIU. i32 versus i64 on armv7, for instance.

I am also not sure that the values will necessarily be in order on the stack after a sync(). The reason being, consider you have (block (param i32 i32) (result i32 i32) end). Assuming we only put a single value in a register, the first value is in a register and the second is in memory. If you sync() before popping at block-exit, then the first value ends up after the second: badness. You can try to pop all return values to registers before syncing, but we don't have as many registers as return values, in general.

Thinking more broadly... I was planning on having the return convention being similar to the calling convention, where values go in registers as long as they are available, then get spilled on the stack. However I think that violates an invariant of the baseline register allocator: if temporary value stk[N] is spilled to the stack, then for all M &lt; N, stk[M] is also not in a register.

So... I think for nvalues > 1, we need to allocate all values onto the stack. We can still allow calls to return the first few values in registers, but the baseline compiler will eagerly write them out to the stack after a return, having reserved space for them before the call. A sad asymmetry with regards to the nvalues <= 1 cases, but oh well. If we take this approach, your sync()-then-memmove() approach works.

Alternately!! We could instead use a multi-value ABI in baseline in which values 1..count-N are on the stack, in order starting from 1, and values count-N+1..count are in registers. Then we get the memmove shuffle and we can allow the top value(s) to stay in registers, and it works for parameters as well as results.

I think we could use the same ABI for multi-result calls as well across baldr. I will give this a go. I had thought that it was somehow more important for the first return values to be in registers but it's impossible in the baseline compiler and I think for ion/cranelift it doesn't matter.

The disadvantage of course would be that not all results are alike; you might run out of GPRs but still have FPRs. Adopting this approach across baldr might be too restrictive. But... for baseline, it would be nice to simplify :)

@luke there is a problem with Stk elements that are constants, and thus not on the machine stack. We can't just shuffle down the stack arguments because not all arguments are on the stack. However, given that the isMem() values are in the same order on the stack as the result locations, we can partition them into two contiguous sets: those that will be shuffled down, and those which are shuffled up, and we can thus scatter mem values into result locations without an intermediate copy.

Just an update here -- I am still plugging away at this. There are some annoying edge cases like this one:

(loop (param i32 i32 i32)
  (drop) (drop) (drop)
  (i32.const 42) (i32.const 42) (i32.const 42) (br 1))

Here we have a jump where the size of the target dynamic stack is larger than the source (because the const values aren't on the machine stack).

But, I am close.

Depends on: 1584097
Depends on: 1584112
Depends on: 1585909
Depends on: 1585990
Depends on: 1586207
Depends on: 1586249
Depends on: 1590083
Depends on: 1592926
Depends on: 1592958
Depends on: 1592973
Depends on: 1592974
Depends on: 1592983
Depends on: 1593015
Depends on: 1593027
Depends on: 1593247
Depends on: 1593618
See Also: → 1594655
Depends on: 1595031
Assignee: luke → nobody
Keywords: meta
Summary: Baldr: implement multi-value → [Meta] Implement multi-value
Depends on: 1595825
You need to log in before you can comment on or make changes to this bug.