Cranelift: Basic alias analysis
Categories
(Core :: JavaScript: WebAssembly, enhancement, P5)
Tracking
()
People
(Reporter: lth, Assigned: jseward)
References
Details
Attachments
(1 file, 1 obsolete file)
28.83 KB,
patch
|
Details | Diff | Splinter Review |
Cranelift currently does no alias analysis, but this really prevents us from generating good code in a number of cases. We want probably something that allows us to distinguish accesses to globals, tables, and memories from each other; that takes into account type information when it's available; that allows reads from immutable globals to be unaffected by stores to mutable globals of the same type. (Later, also accesses to GC objects.)
We could use a bit vector as Ion does, but we could also use a short bit vector (for expressing load store effects) combined with some space identifier, or something else altogether. For now we have a closed set of simple types only, but with the reftypes proposal this grows a little and with the function references and gc proposals it becomes open-ended. For example, reads from a table of (void()(int32)) should not be affected be affected by writes to a table of (float64()(float64,float64)), to state the obvious.
There's some serious design work to be undertaken here.
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Comment 1•6 years ago
|
||
Designing the location-description lattice is obviously important. That said it's largely parametric in that CL's optimisers can treat it as an abstract type. All they care about is the answer to the question "do the following two location-descriptions definitely not overlap?" With a bit of careful design, we can let CL's clients specify the lattice and its associated no-overlap? predicate.
But there's another aspect to this. Currently CL's GVN pass makes no attempts to common up loads unless they are marked readonly
. It will even fail to common up identical back-to-back loads in the same EBB with nothing in between. So GVN is going to have to be enhanced to be able to common up identical loads separated by arbitrary but (per location-annotations) provably non-overlapping stores. I haven't looked at LICM yet -- in a way, that's actually more important -- but I suspect it has similar limitations.
Assignee | ||
Comment 2•6 years ago
|
||
Some notes re transformation and alias sets. This is my analysis so far. I need to look at what Ion does in order to see what we're trying to beat.
Some notation. Let each load L
be annotated with a set of abstract locations it reads, LocsR(L)
, or a "don't know" marker. Similarly each store S
is annotated with LocsW(S)
, a set of abstract locations written, or "don't know". "don't know" can be modelled by a set containing all possible abstract locations.
In the absence of interprocedural analysis we must also assume that LocsW(any call insn) == "don't know"
.
For a real implementation we'd need to refine the representation of locations, per comment 1.
For an ebb
with N
instructions we can compute a summary LocsW(ebb)
, which describes the overall set of locations written by ebb
:
LocsW(ebb) = union [ LocsW(ebb.insn[i]) | i <- 0 .. N-1 ]
Given LocsW(ebb)
for each ebb
in a function, we can consider two transformations:
LICM: lifting invariant loads out of a loop
Let L(addr)
be a load from addr
. Compute the aggregate write-set for the loop like this:
LocsWAggregate = union [ LocsW(ebb) | ebb <- ebbs that form part of the loop ]
Then L
is loop-invariant if (1) addr
is loop invariant and (2) intersect(LocsR(L), LocsWAggregate) == empty_set
.
What (2) means is "L
does not read any abstract location written by the loop".
GVN/CSE: commoning up redundant loads.
Curiously this seems more complex and expensive than LICMing invariant loads.
Let L1(addr1)
be a load from addr1
and L2(addr2)
be a load from addr2
. Furthermore let L1
dominate L2
.
To replace L2
with the value read by L1
, we need to:
(1) prove addr1
and addr2
are the same (outside the scope of this analysis; presumably GVN would tell us this)
(2) prove that, for all paths from L1
to L2
, no abstract location read by L2
is written by any ebb on the path.
More formally:
Let path_ebbs
= all ebbs reachable on any path from L1
to L2
Let LocsWAggregate = union [ LocsW(ebb) | ebb <- path_ebbs ]
Then L2(addr2)
can be replaced by L1
's value if addr1 == addr2
and intersect(LocsR(L2), LocsWAggregate) == empty_set
.
As with LICM we can make use of pre-computed LocsW
values for each ebb. What's not currently clear to me is how to avoid repeatedly computing union [ LocsW(ebb) | ebb <- path_ebbs ]
for each load-pair under consideration for commoning up.
Commoning up loads within a single ebb is easy because the find-all-paths problem is trivial. Maybe this could be a simple first thing to implement.
There is also the question of what to do about load pairs whose addresses are demonstrably the same but have different read-set annotations (LocsR(..)
values). Can this happen? It sounds fundamentally wrong.
Reporter | ||
Comment 3•6 years ago
|
||
Note that memory is untyped and you can have different-width accesses to the same location, so the address is by itself not an identity of the operation. This probably speaks to your last sentence. You can have a byte access and a word access and I think their LocsR sets would not be identical though their addresses would be, and this is perfectly reasonable code.
Don't forget we'll have atomics -- special rules for hoisting and reordering. For the time being we can perhaps model these as writes that kill everything loaded from flat memory and require every outstanding store to be performed (ie they potentially overlap with every address). Even if there are no interferences they can't be hoisted out of a loop.
Assignee | ||
Comment 4•6 years ago
|
||
Yes, so atomics will be exempt from LICM and GVN, and we'll have to flush any deferred loads prior to them. A corner case, but not a conceptual blocker.
What I'm struggling with right now is, for GVN, the need to inspect all ebbs reachable between two loads we might want to common up, to see if there are any intervening stores that invalidate the commoning. This sounds expensive. (Per third last para of comment 2) is there some way to avoid redoing that inspection for every load-pair under consideration? I get the impression that Ion has a way to do this more cheaply (in a big-O sense), but I don't know what it is. Anybody know?
Comment 5•6 years ago
|
||
(In reply to Julian Seward [:jseward] from comment #2)
LICM: lifting invariant loads out of a loop
I will note that IonMonkey is wrongly doing so, as the read might fail if the loop condition is not valid. Failure to read with WASM is less of an issue unless the loop condition check the allocated space, and the invariant load is out-of-bound.
The proper way to do LICM would be to unroll the first loop cycle, or to duplicate the loop condition in a pre-entry block, and lift instructions in this loop header.
[…] What's not currently clear to me is how to avoid repeatedly computing
union [ LocsW(ebb) | ebb <- path_ebbs ]
for each load-pair under consideration for commoning up.
The EmulateStateOf<MemoryView>::run
of Scalar replacement is similar to a precise Alias Analysis for non-escaped memory, and it works by having the equivalent of the union stored at the entry of each basic blocks.
For Cranelift, I think this can even be used as having one for the entry of each ebb
. As opposed to IonMonkey's Alias Analysis and EmulateStateOf which can both work in a single pass, I suggest to have 2 passes, a first pass to consider all the stores and propagate the info until all union [ LocsW(ebb) | ebb <- * ]
are stable, and then a second pass to annotates all loads with their associated stores.
Also, I am not sure what you mean by union
, but I would expect something like a mapping from an offset range of a memory bank to an instruction, or the last extended basic block to store at the given offset-range. Something like map [ (memory, range) -> Option<ebb> ]
or map [ (memory, range) -> Option<inst> ]
.
Assignee | ||
Comment 6•6 years ago
|
||
(In reply to Nicolas B. Pierron [:nbp] from comment #5)
[..] as the read might fail if the loop condition is not valid [..]
Ah, good point. Lars came up with the following example:
int f(int* p, int lim) {
int sum = 0;
for (int i=0; i < lim; i++) {
sum += *p;
}
return sum;
}
for which gcc generates if (lim < 0) then 0 else lim * *p
; hence correctly avoiding dereferencing p
if the loop body would never have been entered.
Assignee | ||
Comment 7•6 years ago
|
||
Pages 402/403 of the Muchnick book (from 1997) describe, effectively, this problem, and a related one, and fixes for both. The fixes are based on adding some extra conditions based on dominance relations relating to instructions to be lifted out.
Comment 8•6 years ago
|
||
Baldr today doen't hoist linear-memory loads out of even in that very simple loop, and its GVN doesn't eliminate redundant linear-memory loads even in the simplest cases. The trapping semantics create a lot of ways for users to observe exactly when and where a load happens. And, alias analysis in linear memory is particularly difficult to pull off in the general case, so even if it's theoretically possible in very simple cases, it's not clear that it's worth the compile time if it doesn't help in complex cases.
Are there specific testcases or examples that we're looking at here? I'm in learning more about what kinds of problems we're looking to solve here.
Assignee | ||
Comment 9•6 years ago
|
||
The initial goal is to generate decent code for Wasm global variable accesses. I assume these are guaranteed trap-free. Most of the discussion above is really about me trying to understand the big-picture general-case constraints, even if -- in the end -- we mostly or entirely sidestep the trapping problems, by restricting ourselves to a safe subset of loads/stores.
Reporter | ||
Updated•6 years ago
|
Assignee | ||
Comment 10•6 years ago
|
||
Here are some initial results from our initial set of 6 perf cases, for the
case of lifting invariant loads out of loops. The takeaway is that there is
(almost) no win to be had; what we already do with marking some loads as
readonly
gives all the available win. That said, for non-Wasm uses of CL
the results might be different.
To measure this, I added to each load and store, an alias-set description that
distinguishes Wasm memories, globals and other things, and distinguishes
globals from each other. I extended LICM so as to compute the alias set of
all stores in each loop, and then compared that against the alias set of each
load in the loop, and classified the load into one of 6 categories. For
cleanness, I also added a new MemFlag::Volatile
to mark reads of the wasm
interrupt flag, although that isn't strictly necessary here. The 6 categories
are:
rejected_NONINVAR
: rejected for lifting, because args (address) non-invariant
rejected_VOLATILE
: rejected because marked volatile
(wasm loop-head trap)
rejected_TRAPPING
: rejected because the load might trap
rejected_ALIASED
: rejected because load might alias a store in the loop
approved_READONLY
: accepted for lifting because marked readonly
approved_NO_ALIAS
: accepted because not aliased by any stores in the loop
Numbers are below. It's the NO_ALIAS
set that's of interest here. Some other
tests (tanks, ZenGarden) do show a few more loads in this category, but still
basically zero.
Patch for reference is attached. One good thing to come out of this is that
it seems easy to add alias-set descriptors which, on the one hand, allow CL's
client to specify alias sets in considerable detail, whilst on the other hand
being entirely opaque to CL -- CL has no understanding of the meaning of the
set. So it's a pretty clean interface extension. See enum AliasSet
in the
patch.
------- Rejected for LICM -------- --- Accepted for LICM ---
Test NONINVAR VOLATILE TRAPPING ALIASED READONLY NO_ALIAS
wasm_box2d.js 1996 328 143 320 1914 0
fib.js no loops
wasm_lua_bin..js 2278 705 322 141 2704 1
raybench.js 425 173 119 126 508 0
wasm_zlib.c.js 2659 570 112 24 3436 0
rust-fannkuch.js 68 15 8 16 94 0
Assignee | ||
Comment 11•6 years ago
|
||
Patch used to generate the measurements in comment 10.
Reporter | ||
Comment 12•6 years ago
|
||
Are the classification sets nonoverlapping? I'm just curious whether alias analysis would appear to be more effective if we did not have the readonly flag. (Clearly readonly solves some cases alias analysis could not solve.)
Assignee | ||
Comment 13•6 years ago
|
||
Are the classification sets nonoverlapping?
The AliasSet
values in the patch form an (implied) partial order. If two
sets s1
and s2
are incomparable then they denote non-overlapping
locations; but if s1 <= s2
then s2
denotes a set of locations which is a
subset of s1
's locations.
There's currently no way to indicate an empty set of locations (no top point
in the partial order). This would seem to be necessary to replace readonly
,
so as it stands AliasSet
can't replace readonly
-- unless we simulate that
by marking read-only loads as reading from a location-set that is never
applied to any store. That would be very inconvenient though in that, to
identify a load as readonly, we'd have to compare the load's location set
against all dominating store location sets.
All that said, it's obvious from the numbers that readonly
and location sets
do overlap to some extent. Here are the numbers for zlib.c.js
:
LICM: rejected_NONINVAR = 2659
LICM: rejected_VOLATILE = 570
LICM: rejected_TRAPPING = 112
LICM: rejected_ALIASED = 24
LICM: approved_READONLY = 3436
LICM: approved_NO_ALIAS = 0
readonly
allows many loads to be lifted out. Disabling that mechanism
causes the alias-based mechanism to detect at least some of them as liftable:
LICM: rejected_NONINVAR = 2764
LICM: rejected_VOLATILE = 578
LICM: rejected_TRAPPING = 16
LICM: rejected_ALIASED = 3773
LICM: approved_READONLY = 0
LICM: approved_NO_ALIAS = 420
Clearly 420 as detected by alias analysis is a long way from 3436 based simply
on readonly
annotations. I suspect alias analysis would get much closer to
that 3436 number if all loads and stores were consistently annotated, which
they are not. In a different tree I started an audit with the aim of marking
all loads and stores that enter the CL pipeline, but that doesn't apply here.
Comment 14•6 years ago
|
||
Thanks for the numbers and analysis.
-
Did you need to also modify the Cranelift code in Spidermonkey (baldrdash)? We generate a few loads/stores in there (for TLS, indirect globals,...), and while they might be just all marked as
readonly
, I don't know for sure if this is the case. -
I had a random thought that maybe the AliasSet information may get lost during legalization, if a load/store is expanded into several ones. (I wonder if this can happen; if the instruction id remains the same as the actual load/store, then we should be fine. But if it changes, we're losing information.) Would it be worth it to see if all the loads/stores are annotated with alias set information, just before we generate the code? (that is, on the final IR graph)
-
Is it worth considering using the alias sets during GVN too, since if effective, this can reduce memory traffic, lower register pressure etc etc.? Or are the poor results on LICM (when compared with
readonly
) sufficient to prove that this would be ineffective too?
Assignee | ||
Comment 15•6 years ago
|
||
Did you need to also modify the Cranelift code in Spidermonkey (baldrdash)?
I didn't modify it for these experiments. I did look at it though, as part of
the add-an-alias-set-for-all-memory-accesses audit mentioned in last para of
comment 13. There are a bunch of memory accesses created by
js/src/wasm/cranelift/src/wasm2clif.rs
that, as you say, are VM accesses and
similar. If we do choose to add alias sets then we should mark all of those
accesses too.
I had a random thought that maybe the AliasSet information may get lost
during legalization, if a load/store is expanded into several ones.
This is a good point. Does such an expansion actually happen? In the patch I
just added a SecondaryMap<Inst, AliasSet>
to DataFlowGraph
to hold the
alias info, so that I didn't have to make the InstructionDetails any bigger.
I assume that if a load/store gets moved (etc), then its Inst
value will
change, and so the SecondaryMap
will have to be adjusted too. My patch
doesn't do that.
Would it be worth it to see if all the loads/stores are annotated with alias
set information, just before we generate the code?
Yes. If we do this for real, then checking this in the verifier would be
definitely worthwhile. We'd have to be a bit careful about spills/reloads
generated by regalloc, since they will be unannotated.
Is it worth considering using the alias sets during GVN too, since if
effective, this can reduce memory traffic, lower register pressure etc etc.?
That's something I'd like to try to verify independently from the LICM
experiments. I suspect the answer is "no" but that's just a guess.
Assignee | ||
Comment 16•6 years ago
|
||
I implemented removal of redundant loads within individual ebbs. Doing it
across entire functions looks difficult because you'd have to prove
no-aliasing on all possible paths between candidate load-pairs. A compromise
position would be to extend the individual-ebb logic to work across
tree-shaped ebb groups, but I didn't do that.
The takeaway is that, as with licm-ing loads, it's pointless. But there are a
couple of marginally interesting observations. Firstly, the loads that are
removed have no intervening stores, so the presence of alias sets here is
pointless. Secondly, in both idioms below, it looks as if the duplicated
loads are some kind of front end (wasm->clif) artefact which we might be able
to fix up without needing a general redundant-load removal pass.
I profiled wasm_box2d.js
and wasm_zlib.c.js
:
static loads dyn insns dyn insns
removed without opt with opt
wasm_box2d.js 61 11,935,682,104 11,912,546,024
wasm_zlib.c.js 35 15,999,525,900 15,999,515,811
Looking at the places where loads are removed, it's clear there are two
distinct idioms involved. The most common is at the start of a function: it's
always instN
for N
almost zero, so I assume this is some prologue-related
effect. In one case there were three identical loads in sequence:
inst2 v6 = load.i32 notrap aligned v1+1052 <---
inst4 v8 = load.i32 notrap aligned v1+1052 <--- dup
inst5 v9 = iadd.i32 v8, v0
inst7 store.i32 notrap aligned v9, v1+1052
inst9 v12 = load.i32 notrap aligned v1+1052
inst11 v14 = iadd_imm.i32 v12, 15
inst13 v16 = band_imm.i32 v14, -16
inst15 store.i32 notrap aligned v16, v1+1052
inst16 jump ebb2(v6)
remove inst4 as dup of inst2
The less common idiom is some kind of FP thing (involving FP negation? .. is
that what the xor
is for?)
inst144 v130 = load.f64 notrap aligned v3+1072 <---
inst145 v131 = bxor.f64 v130, v700
inst146 v132 = fdemote.f32 v131
inst147 v133 = fcmp.f32 gt v110, v132
inst148 v134 = bint.i32 v133
inst150 v136 = load.f64 notrap aligned v3+1072 <--- dup
inst151 v137 = fdemote.f32 v136
inst152 v138 = fcmp.f32 gt v137, v110
inst153 v139 = bint.i32 v138
inst154 v140 = band.i32 v134, v139
inst157 brnz.i32 v140, ebb8(v105, v437, v530, v552, v108)
inst158 v143 = iconst.i32 2865
...
remove inst150 as dup of inst144
Does anybody know what either of these two idioms are?
Assignee | ||
Comment 17•6 years ago
|
||
WIP patch containing the initial implementations and instrumentation, FTR.
Comment 18•6 years ago
|
||
The first exerpt looks like "stackAlloc" in wasm_box2d.wasm. It's allocating some stack space and doing alignment:
(func (;25;) (type 4) (param i32) (result i32)
(local i32)
block (result i32) ;; label = @1
global.get 7
local.set 1
global.get 7
local.get 0
i32.add
global.set 7
global.get 7
i32.const 15
i32.add
i32.const -16
i32.and
global.set 7
local.get 1
end)
[...]
(export "stackAlloc" (func 25))
The code in Emscripten is in src/library.js and looks like this:
$stackAlloc: function(size) {
size = size|0;
var ret = 0;
ret = STACKTOP;
STACKTOP = (STACKTOP + size)|0;
STACKTOP = (STACKTOP + 15)&-16;
#if ASSERTIONS || STACK_OVERFLOW_CHECK >= 2
if ((STACKTOP|0) >= (STACK_MAX|0)) abortStackOverflow(size|0);
#endif
return ret|0;
},
Two possible conclusions to draw here:
- Emscripten's code here could be better optimized to avoid redundant global get/set traffic
- If we do look into alias analysis, perphaps globals are a good place to start, as producers may not see them as expensive. Also, we can be fully precise with globals.
Comment 19•6 years ago
|
||
The second is Emscripten insisting in using an imported global variable named Infinity
rather than just using an f32.const
to produce an Infinity. Here's the code in wasm_box2d.wasm:
local.get 13
global.get 11
f64.neg
f32.demote_f64
f32.gt
local.get 13
global.get 11
f32.demote_f64
f32.lt
i32.and
The loads are loading from global 11:
(global (;11;) (mut f64) (global.get 3))
which is initialized as a copy of global 3:
(import "global" "Infinity" (global (;3;) f64))
which is importing an Infinity global.
This also prevents the f32.neg and both f32.demotes from being constant-folded away.
Assignee | ||
Comment 20•6 years ago
|
||
Dan, thanks for identifying those idioms.
I should add, per comments 10 and 16, that I'm abandoning this line of enquiry, at least for now. If the situation and/or expected workload changes, we can easily enough resume it.
Reporter | ||
Updated•5 years ago
|
Reporter | ||
Updated•4 years ago
|
Description
•