[Meta] Implement multi-value
Categories
(Core :: JavaScript: WebAssembly, enhancement, P2)
Tracking
()
Tracking | Status | |
---|---|---|
firefox57 | --- | wontfix |
People
(Reporter: luke, Assigned: wingo)
References
Details
(Keywords: meta)
Attachments
(7 files, 2 obsolete files)
16.98 KB,
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
5.49 KB,
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
10.26 KB,
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
1.70 KB,
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
5.14 KB,
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
5.88 KB,
patch
|
lth
:
review+
|
Details | Diff | Splinter Review |
69.68 KB,
patch
|
Details | Diff | Splinter Review |
Updated•7 years ago
|
Reporter | ||
Comment 1•6 years ago
|
||
As a first refactoring step, it's useful for readBlockType() to be able to call readValType() which means factoring this logic out of static functions in WasmValidate.cpp.
Comment 2•6 years ago
|
||
Comment on attachment 9042346 [details] [diff] [review] factor-read-val-types Review of attachment 9042346 [details] [diff] [review]: ----------------------------------------------------------------- Seems reasonable, with bug fixed. ::: js/src/wasm/WasmValidate.h @@ +636,5 @@ > + uint8_t code = uncheckedReadFixedU8(); > + switch (code) { > + case ValType::AnyRef: > + case ValType::Ref: > + return ValType(ValType::Code(code), uncheckedReadVarU32()); I don't think we should read an argument for AnyRef, ie, there should not be a fallthrough for that case.
Reporter | ||
Comment 3•6 years ago
|
||
This patch removes the duplication between TypeAndValue/ControlStackEntry by using mozilla::Pair<> to optimize away the empty class. Confirmed no size regression locally with static asserts.
Reporter | ||
Comment 4•6 years ago
|
||
This patch factors out a function checkIsSubtypeOf() (with "check" prefix because it effectfully reports) which simplifies its clients' logic.
The patch also renames a dangling "Any" to "TVar" to match the previous bulk-renaming of StackType::Any to StackType::TVar.
Also, two MOZ_UNLIKELYs are removed because their paths may actually hit on validating wasm modules and I don't know how pessimal MOZ_UNLIKELY makes the unlikely branch.
Reporter | ||
Comment 5•6 years ago
|
||
This patch goes a bit further by inlining typeMismatch() into its one caller (checkIsSubtypeOf()). Also, now that popAnyType() exists, popWithType() can take a ValType, not a StackType, which removes a branch in the common case.
Comment 6•6 years ago
|
||
Comment on attachment 9043040 [details] [diff] [review] remove-nothing-specialization Review of attachment 9043040 [details] [diff] [review]: ----------------------------------------------------------------- Sweet.
Comment 7•6 years ago
|
||
Comment on attachment 9043068 [details] [diff] [review] factor-subtype-check Review of attachment 9043068 [details] [diff] [review]: ----------------------------------------------------------------- Generally this seems very nice and is a welcome simplification. I'm concerned that removing the guards on gcTypesEnabled() is not the right thing (the assertion in isRefSubtypeOf() is not a substitute for the run-time checks, it only asserts that those checks have been done). But if I understand correctly, readValType is now used for all type reading and will apply the proper filtering, so I guess I'm OK with this.
Reporter | ||
Updated•6 years ago
|
Pushed by lwagner@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/17988b9f30d4 Baldr: factor out Decoder::readValType() (r=lth) https://hg.mozilla.org/integration/mozilla-inbound/rev/6ae6def2e6ed Baldr: use Pair to optimize out empty values instead of specialization (r=lth) https://hg.mozilla.org/integration/mozilla-inbound/rev/d29b60b57b23 Baldr: factor out OpIter::checkIsSubtype() (r=lth)
Reporter | ||
Comment 9•6 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #7)
But if I understand correctly, readValType is now used for all type reading and
will apply the proper filtering, so I guess I'm OK with this.
That is my understanding as well: it shouldn't be possible to get a ref ValType in the first place when !gcTypesEnabled.
Comment 10•6 years ago
|
||
bugherder |
Comment 11•6 years ago
|
||
bugherder |
Reporter | ||
Comment 12•6 years ago
|
||
Why didn't I write it this way to begin with...
Reporter | ||
Comment 13•6 years ago
|
||
I often forget how readEnd()/popEnd()/readFunctionEnd() and the corresponding Ion/Baseline operations interleave. Adding a separate LabelKind for the outermost (function body) block I think makes this a bit clearer and easier to read by putting this whole sequence in one function.
Reporter | ||
Comment 14•6 years ago
|
||
Unrelated, this patch removes some header dependencies to avoid pulling in some heavies.
Reporter | ||
Comment 15•6 years ago
|
||
Just a WIP in case anyone is curious what multi-value is looking like. It's mostly obvious changes, it just spreads everywhere (like everywhere ExprType is used). I have WasmValidate.cpp compiling and half of WasmIonCompile.cpp.
Updated•6 years ago
|
Updated•6 years ago
|
Comment 16•6 years ago
|
||
Comment on attachment 9044745 [details] [diff] [review] simplify-func-body Review of attachment 9044745 [details] [diff] [review]: ----------------------------------------------------------------- Nice.
Comment 17•6 years ago
|
||
Pushed by lwagner@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/e5fb09214df8 Baldr: factor popWithType() (r=lth) https://hg.mozilla.org/integration/mozilla-inbound/rev/f6323e932e3e Baldr: simplify how function bodies end (r=lth) https://hg.mozilla.org/integration/mozilla-inbound/rev/6b4ac180a1e3 Baldr: remove some unneeded header dependencies (r=lth)
Reporter | ||
Comment 18•6 years ago
|
||
Ok, finally arrived at WasmBaselineCompile.cpp and I see that the main challenge is the joinReg strategy: it is specialized for 0-or-1 block result types and now we have N. I think the simplest option here would be to stop using a join register and unconditionally store branch values into their target stack slots before jumping. While this produces slower code, my impression is that this feature is lightly used and thus it won't matter in practice. Are you ok with replacing the join registers with stack stores Lars?
Comment 19•6 years ago
|
||
bugherder |
Comment 20•6 years ago
|
||
I don't understand what you mean by "target stack slots". The target for a computed value is always a register, which may then be pushed onto the eval stack after the join point of a diamond (this is usually a no-op, it only records the existence of the value in the register), and then only subsequently may it be pushed onto the real CPU stack if register pressure requires it or we reach a sync before the value is consumed.
I also don't think this is "lightly used" unfortunately; binaryen tends to produce value-producing code that allows blocks and other control structures to return values. But that's a subject for telemetry :)
(As a general observation, moving the result value into memory even in the case of single-value producing blocks takes us in the wrong direction; we're already moving too much information through memory. For multi-results it may be different because I expect those to be fairly rare, comparatively speaking.)
We should probably discuss various options more synchronously, i'll ping you re a meeting time.
Reporter | ||
Comment 21•6 years ago
|
||
I think you're right that they are more widely used than I initially thought. Also, I had been thinking about this only in terms of branch values, which are rather esoteric, but the optimization also applies to plain old block fallthrough values, which are common.
Just to have some data, I instrumented a browser to measure the % of blocks which produce a value and the % increase in total stack loads and stores (including both locals and Stk::Mem*) and saw (behold my markdown!):
App | % blocks w/ value | % more stack loads/stores |
---|---|---|
Tanks | 15.7% | 3.0% |
Earth | 15.0% | 3.2% |
Meet | 18.2% | 4.5% |
Ammo | 14.0% | 1.8% |
PSPDFKit | 26.5% | 6.2% |
Figma | 20.5% | 4.9% |
Comment 22•6 years ago
|
||
Having thought about this a little bit more, I think we want to lightly generalize the joinreg mechanism to multiple values, and to have a flexible strategy for where values are stored when they go into a join "register". For the first value, it should continue to go into a register, and on register-rich platforms we could even consider multiple join registers. But for values beyond a certain point, we should allocate locations in the frame to store values into when they would otherwise be moved into a register, and then when we push the joinregs after the join point the values could be read out of those locations and manipulated in the standard way. There will be a little bit of extra work for those values, but that's probably OK; zero to two values will probably cover almost all cases anyway, and most of those will still be zero and one.
The main problem with allocating space to receive multiple values is that we don't know how much space we need, and of which types, until the end of the function. Since the Frame and the locals areas are fixed-size we could allocate the space for multiple values before or after the locals, depending on what's easiest, and then address both the MV area and the Locals area off the FP instead of the SP, while the dynamic part of the frame would still be accessed off the SP. This would allow us not to worry about how big the area for multiple values is. There's some reengineering required, but most/all of this logic should be localized to the BaseStackFrame already.
The same space to receive multiple values could also be used for receiving multiple values via calls, though it depends on what your thoughts are about the ABI for that; plus it's easier to set up the value-return area for calls as part of the outgoing arguments, so it's a minor consideration.
(Something that worries me more than register / stack slot management is that with multiple values, some hot data structures in the baseline compiler will become variable-size and may require heap allocation or significantly more space, neither of which is a pleasant prospect.)
Assignee | ||
Comment 23•5 years ago
•
|
||
In terms of stack location allocation in the baseline compiler, I am not sure I see the difference between multiple-value results and the expression stack. Both spill to the stack as needed, precisely because we don't know the frame size until the end, and both have the same stack discipline.
I am going to give this a try: when entering a block with nargs nvals > JoinRegisterCount, we decrement the stack pointer appropriately to reserve space for the extra values. When returning nvals > JoinRegisterCount, write the extra values to the stack at the block's saved stack height; since we eagerly reserve space there's no shuffling needed. In the block there are appropriate Stk entries for all block parameters (stack-spilled or no), as well as return values from a nested end
.
For the call ABI -- I am not entirely sure about this but I would propose reserving space for nresults > ReturnRegCount before any spilled arguments, with stack return value N having a higher address than stack return value N+1. The stack return area should be word-aligned, which is already the case for baseline frames AFAIU. At return sites, appropriate Stk entries are present indicating the stack return values if any; I think it works out OK here so that no shuffling is needed.
As far as variable-sized data structures go. AFAIU the important bit is the ExprType for return types for blocks and function calls, and in the future for block parameters. I agree that creating a bunch of potentially-heap-allocated data structures would be a lose. I would propose instead a stack of block result types, and block result types become an index-and-count into this stack. Two u32s should be sufficient for references to result types. The stack itself can be supplied by the caller (like stk_
) if needed. I'll try this refactor first to see if it's perf-neutral.
Comment 24•5 years ago
|
||
Sounds pretty plausible. When you say to reserve space for results "before" any spilled arguments, do you mean at higher addresses than the spilled args (what I think) or at lower?
Assignee | ||
Comment 25•5 years ago
|
||
(In reply to Lars T Hansen [:lth] from comment #24)
When you say to reserve space for results "before" any spilled arguments, do you mean at higher addresses than the spilled args (what I think) or at lower?
HIgher, as you were thinking. Allows for sensible handling of stack return args in tail calls.
Assignee | ||
Comment 26•5 years ago
•
|
||
Just an update here, it took me more time than expected to adapt the compiler to expect that blocks and functions could have multiple results. I ended up making it so that wasm::OpIter
keeps a ValTypeVector
stack of result and parameter types for all live blocks. Code refers to this stack using class ResultTypeRef { size_t stackBase; uint32_t resultCount; }
and class FuncTypeRef : public ResultTypeRef { uint32_t paramCount; }
types; these types mostly replace ExprType.
Changing FuncType
to have a vector of results instead of an ExprType return value took a bit of time though; I'll probably have to split this patch.
I haven't started with codegen yet, obviously.
Anyway, things are en route and hopefully I'll post patches on Monday.
Reporter | ||
Comment 27•5 years ago
|
||
Just to check, because I worry about scalars being changed to vectors in hot places: is your work based on the WIP patch, and using BlockType
/ResultType
(which are intended to store indices into the type section), and the ValTypeVector
is being introduced somewhere that the WIP patch hadn't gotten to (e.g., the results in a function type)?
Assignee | ||
Comment 28•5 years ago
|
||
Assignee | ||
Comment 29•5 years ago
|
||
:luke I am not sure but I may have wasted some time with this patch doing a wrong thing. I had started with your patch but when bashing it against the baseline compiler I may have gone off the rails.
The WIP patch there keeps a stack of the types of the live blocks. When a block is entered, the opiter pushes its result types onto the stack, then its parameter types. A block type is represented as a cursor into that stack of types. Interestingly Ion doesn't seem to care much at all about the types, but the baseline compiler does need them.
The advantage of putting types on a stack is that you can deal with them very uniformly. The 0 parameter, 0 or 1 return cases fall out really nicely, with no branches like if you had to pick apart a PackedTypeCode. But, you (sometimes) need to copy around a bigger cursor into the stack than your ResultType, as it needs a stack base and number of values, whereas your ResultType is always a single word. Also you need to copy types onto the stack; perhaps this isn't an issue though given that most blocks will have zero parameters and zero or one return values.
See ResultTypeRef and FuncTypeRef in the patch. Also, the FuncTypeRef / ResultTypeRef formulation avoids introducing the concept of a BlockType into the source.
I am going to let the idea marinate a bit and in the mean-time factor out the patches to have FuncType return a const ValTypeVector& results() rather than an ExprType ret(). What is your instinct about type-code versus type stack?
Reporter | ||
Comment 30•5 years ago
•
|
||
Ah, ok, thanks for explaining. I was worried about sticking vectors into vectors, but I see what's going on. Pushing block params/results onto the stack has some nice regularity properties. One worry, maybe it's too theoretical, I dunno, is that there could be large block signatures that are used repeatedly in a function (noting that a block with signature T^N --> T^N doesn't need to contain N instructions, it can just be empty, and so there could be O(N*M) copying for M bytes of bytecode). Another theoretical O(N*M) case is if you deeply nest blocks with big signatures.
I think you're right that the super-common case is and will be []->[]
and []->[valtype]
, but I think inlining+branch-predication will make the packed-word representation no worse than the more homogeneous representation.
Are there any problems introduced by representing block types as indices into the type section that are solved by copying them onto the stack?
Also, the FuncTypeRef / ResultTypeRef formulation avoids introducing the concept of a BlockType into the source.
This is more aesthetic, but personally I think the relationship between valtype, resulttype, functype and blocktype is complicated enough that I found it nice to have ValType
/ResultType
/FuncType
/BlockType
C++ types that correspond 1:1 with what's written in the spec (e.g., for blocktype).
Assignee | ||
Updated•5 years ago
|
Assignee | ||
Comment 31•5 years ago
|
||
@luke: current draft here -- https://phabricator.services.mozilla.com/D43977#change-sGLmZ16e1sci. A few tests failing for some reason, will ping you for review when all are done, but I decided to try to the pointer-tagging approach; early thoughts welcome.
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 32•5 years ago
|
||
In the baseline compiler, when finishing a block normally or jumping to the end of a block, with the approach of comment 23 we are guaranteed that parameters to the branch will not alias any stack-spilled result, because the space for the stack result is in the outer block.
However I am having a hard time convincing myself this is the case for jumping back to a loop header with parameters. If we model block parameters as being on the expression stack at block entry, that means they may get popped off in the course of the block, and that new values may get pushed onto the stack, so the locations of the parameters of the br
may alias the locations of the parameters of the loop
.
I guess the standard thing to do, besides a parallel-move solution, is push copies onto the stack so they don't alias, then shuffle down.
Reporter | ||
Comment 33•5 years ago
|
||
Good question. First, iiuc the problem correctly, it seems like this would be a problem for both forward and backwards branches because, in both cases, you have the situation where the stack contains a sequence of values V1...Vn
and you need to copy V(n-k+1)...vn
down to v1...vk
, which means you have to worry about clobbering the overlap when k > n/2
. I think what's useful here is, assuming we do a sync()
for these more-than-one-param-or-result cases, so that everything is fully flushed to the stack, we don't have a generalized parallel-move, because the values are contiguous; rather, I think we can do what memmove does and copy in the right order (which has a fixed direction).
Assignee | ||
Comment 34•5 years ago
|
||
@luke With the approach of comment 23, I think it is not a problem for forwards branches, for the reason I mentioned in the first paragraph of comment 32. Stack locations for different blocks are disjoint, and the spilled results go in the outer block (the continuation of the block returning multiple values), and the outer block reserves the stack space before entering the inner block.
I think also that reasoning in terms of value counts isn't sufficient, as different value types have different stack sizes AFAIU. i32 versus i64 on armv7, for instance.
I am also not sure that the values will necessarily be in order on the stack after a sync()
. The reason being, consider you have (block (param i32 i32) (result i32 i32) end)
. Assuming we only put a single value in a register, the first value is in a register and the second is in memory. If you sync()
before popping at block-exit, then the first value ends up after the second: badness. You can try to pop all return values to registers before syncing, but we don't have as many registers as return values, in general.
Thinking more broadly... I was planning on having the return convention being similar to the calling convention, where values go in registers as long as they are available, then get spilled on the stack. However I think that violates an invariant of the baseline register allocator: if temporary value stk[N]
is spilled to the stack, then for all M < N
, stk[M]
is also not in a register.
So... I think for nvalues > 1, we need to allocate all values onto the stack. We can still allow calls to return the first few values in registers, but the baseline compiler will eagerly write them out to the stack after a return, having reserved space for them before the call. A sad asymmetry with regards to the nvalues <= 1 cases, but oh well. If we take this approach, your sync()
-then-memmove()
approach works.
Assignee | ||
Comment 35•5 years ago
|
||
Alternately!! We could instead use a multi-value ABI in baseline in which values 1..count-N
are on the stack, in order starting from 1, and values count-N+1..count
are in registers. Then we get the memmove
shuffle and we can allow the top value(s) to stay in registers, and it works for parameters as well as results.
I think we could use the same ABI for multi-result calls as well across baldr. I will give this a go. I had thought that it was somehow more important for the first return values to be in registers but it's impossible in the baseline compiler and I think for ion/cranelift it doesn't matter.
Assignee | ||
Comment 36•5 years ago
|
||
The disadvantage of course would be that not all results are alike; you might run out of GPRs but still have FPRs. Adopting this approach across baldr might be too restrictive. But... for baseline, it would be nice to simplify :)
Assignee | ||
Comment 37•5 years ago
|
||
@luke there is a problem with Stk elements that are constants, and thus not on the machine stack. We can't just shuffle down the stack arguments because not all arguments are on the stack. However, given that the isMem()
values are in the same order on the stack as the result locations, we can partition them into two contiguous sets: those that will be shuffled down, and those which are shuffled up, and we can thus scatter mem values into result locations without an intermediate copy.
Assignee | ||
Comment 38•5 years ago
|
||
Just an update here -- I am still plugging away at this. There are some annoying edge cases like this one:
(loop (param i32 i32 i32)
(drop) (drop) (drop)
(i32.const 42) (i32.const 42) (i32.const 42) (br 1))
Here we have a jump where the size of the target dynamic stack is larger than the source (because the const values aren't on the machine stack).
But, I am close.
Updated•5 years ago
|
Updated•5 years ago
|
Assignee | ||
Comment 39•4 years ago
|
||
Now that there are no more to-do items, closing this meta-bug :)
Updated•4 years ago
|
Updated•4 years ago
|
Comment 40•4 years ago
|
||
Removing dev-doc-needed from here, as we can track the docs in https://bugzilla.mozilla.org/show_bug.cgi?id=1628321
Description
•