Closed Bug 1552739 Opened 6 years ago Closed 4 years ago

Cranelift: Basic alias analysis

Categories

(Core :: JavaScript: WebAssembly, enhancement, P5)

enhancement

Tracking

()

RESOLVED INACTIVE

People

(Reporter: lth, Assigned: jseward)

References

Details

Attachments

(1 file, 1 obsolete file)

Cranelift currently does no alias analysis, but this really prevents us from generating good code in a number of cases. We want probably something that allows us to distinguish accesses to globals, tables, and memories from each other; that takes into account type information when it's available; that allows reads from immutable globals to be unaffected by stores to mutable globals of the same type. (Later, also accesses to GC objects.)

We could use a bit vector as Ion does, but we could also use a short bit vector (for expressing load store effects) combined with some space identifier, or something else altogether. For now we have a closed set of simple types only, but with the reftypes proposal this grows a little and with the function references and gc proposals it becomes open-ended. For example, reads from a table of (void()(int32)) should not be affected be affected by writes to a table of (float64()(float64,float64)), to state the obvious.

There's some serious design work to be undertaken here.

Assignee: nobody → jseward

Designing the location-description lattice is obviously important. That said it's largely parametric in that CL's optimisers can treat it as an abstract type. All they care about is the answer to the question "do the following two location-descriptions definitely not overlap?" With a bit of careful design, we can let CL's clients specify the lattice and its associated no-overlap? predicate.

But there's another aspect to this. Currently CL's GVN pass makes no attempts to common up loads unless they are marked readonly. It will even fail to common up identical back-to-back loads in the same EBB with nothing in between. So GVN is going to have to be enhanced to be able to common up identical loads separated by arbitrary but (per location-annotations) provably non-overlapping stores. I haven't looked at LICM yet -- in a way, that's actually more important -- but I suspect it has similar limitations.

Some notes re transformation and alias sets. This is my analysis so far. I need to look at what Ion does in order to see what we're trying to beat.

Some notation. Let each load L be annotated with a set of abstract locations it reads, LocsR(L), or a "don't know" marker. Similarly each store S is annotated with LocsW(S), a set of abstract locations written, or "don't know". "don't know" can be modelled by a set containing all possible abstract locations.

In the absence of interprocedural analysis we must also assume that LocsW(any call insn) == "don't know".

For a real implementation we'd need to refine the representation of locations, per comment 1.


For an ebb with N instructions we can compute a summary LocsW(ebb), which describes the overall set of locations written by ebb:

LocsW(ebb) = union [ LocsW(ebb.insn[i]) | i <- 0 .. N-1 ]

Given LocsW(ebb) for each ebb in a function, we can consider two transformations:


LICM: lifting invariant loads out of a loop

Let L(addr) be a load from addr. Compute the aggregate write-set for the loop like this:

LocsWAggregate = union [ LocsW(ebb) | ebb <- ebbs that form part of the loop ]

Then L is loop-invariant if (1) addr is loop invariant and (2) intersect(LocsR(L), LocsWAggregate) == empty_set.

What (2) means is "L does not read any abstract location written by the loop".


GVN/CSE: commoning up redundant loads.

Curiously this seems more complex and expensive than LICMing invariant loads.

Let L1(addr1) be a load from addr1 and L2(addr2) be a load from addr2. Furthermore let L1 dominate L2.

To replace L2 with the value read by L1, we need to:

(1) prove addr1 and addr2 are the same (outside the scope of this analysis; presumably GVN would tell us this)

(2) prove that, for all paths from L1 to L2, no abstract location read by L2 is written by any ebb on the path.

More formally:

Let path_ebbs = all ebbs reachable on any path from L1 to L2

Let LocsWAggregate = union [ LocsW(ebb) | ebb <- path_ebbs ]

Then L2(addr2) can be replaced by L1's value if addr1 == addr2 and intersect(LocsR(L2), LocsWAggregate) == empty_set.

As with LICM we can make use of pre-computed LocsW values for each ebb. What's not currently clear to me is how to avoid repeatedly computing union [ LocsW(ebb) | ebb <- path_ebbs ] for each load-pair under consideration for commoning up.

Commoning up loads within a single ebb is easy because the find-all-paths problem is trivial. Maybe this could be a simple first thing to implement.

There is also the question of what to do about load pairs whose addresses are demonstrably the same but have different read-set annotations (LocsR(..) values). Can this happen? It sounds fundamentally wrong.

Note that memory is untyped and you can have different-width accesses to the same location, so the address is by itself not an identity of the operation. This probably speaks to your last sentence. You can have a byte access and a word access and I think their LocsR sets would not be identical though their addresses would be, and this is perfectly reasonable code.

Don't forget we'll have atomics -- special rules for hoisting and reordering. For the time being we can perhaps model these as writes that kill everything loaded from flat memory and require every outstanding store to be performed (ie they potentially overlap with every address). Even if there are no interferences they can't be hoisted out of a loop.

Yes, so atomics will be exempt from LICM and GVN, and we'll have to flush any deferred loads prior to them. A corner case, but not a conceptual blocker.

What I'm struggling with right now is, for GVN, the need to inspect all ebbs reachable between two loads we might want to common up, to see if there are any intervening stores that invalidate the commoning. This sounds expensive. (Per third last para of comment 2) is there some way to avoid redoing that inspection for every load-pair under consideration? I get the impression that Ion has a way to do this more cheaply (in a big-O sense), but I don't know what it is. Anybody know?

(In reply to Julian Seward [:jseward] from comment #2)

LICM: lifting invariant loads out of a loop

I will note that IonMonkey is wrongly doing so, as the read might fail if the loop condition is not valid. Failure to read with WASM is less of an issue unless the loop condition check the allocated space, and the invariant load is out-of-bound.

The proper way to do LICM would be to unroll the first loop cycle, or to duplicate the loop condition in a pre-entry block, and lift instructions in this loop header.

[…] What's not currently clear to me is how to avoid repeatedly computing union [ LocsW(ebb) | ebb <- path_ebbs ] for each load-pair under consideration for commoning up.

The EmulateStateOf<MemoryView>::run of Scalar replacement is similar to a precise Alias Analysis for non-escaped memory, and it works by having the equivalent of the union stored at the entry of each basic blocks.

For Cranelift, I think this can even be used as having one for the entry of each ebb. As opposed to IonMonkey's Alias Analysis and EmulateStateOf which can both work in a single pass, I suggest to have 2 passes, a first pass to consider all the stores and propagate the info until all union [ LocsW(ebb) | ebb <- * ] are stable, and then a second pass to annotates all loads with their associated stores.

Also, I am not sure what you mean by union, but I would expect something like a mapping from an offset range of a memory bank to an instruction, or the last extended basic block to store at the given offset-range. Something like map [ (memory, range) -> Option<ebb> ] or map [ (memory, range) -> Option<inst> ].

(In reply to Nicolas B. Pierron [:nbp] from comment #5)

[..] as the read might fail if the loop condition is not valid [..]

Ah, good point. Lars came up with the following example:

int f(int* p, int lim) {
  int sum = 0;
  for (int i=0; i < lim; i++) {
     sum += *p;
  }
  return sum;
}

for which gcc generates if (lim < 0) then 0 else lim * *p; hence correctly avoiding dereferencing p if the loop body would never have been entered.

Pages 402/403 of the Muchnick book (from 1997) describe, effectively, this problem, and a related one, and fixes for both. The fixes are based on adding some extra conditions based on dominance relations relating to instructions to be lifted out.

Baldr today doen't hoist linear-memory loads out of even in that very simple loop, and its GVN doesn't eliminate redundant linear-memory loads even in the simplest cases. The trapping semantics create a lot of ways for users to observe exactly when and where a load happens. And, alias analysis in linear memory is particularly difficult to pull off in the general case, so even if it's theoretically possible in very simple cases, it's not clear that it's worth the compile time if it doesn't help in complex cases.

Are there specific testcases or examples that we're looking at here? I'm in learning more about what kinds of problems we're looking to solve here.

The initial goal is to generate decent code for Wasm global variable accesses. I assume these are guaranteed trap-free. Most of the discussion above is really about me trying to understand the big-picture general-case constraints, even if -- in the end -- we mostly or entirely sidestep the trapping problems, by restricting ourselves to a safe subset of loads/stores.

Priority: P3 → P1

Here are some initial results from our initial set of 6 perf cases, for the
case of lifting invariant loads out of loops. The takeaway is that there is
(almost) no win to be had; what we already do with marking some loads as
readonly gives all the available win. That said, for non-Wasm uses of CL
the results might be different.

To measure this, I added to each load and store, an alias-set description that
distinguishes Wasm memories, globals and other things, and distinguishes
globals from each other. I extended LICM so as to compute the alias set of
all stores in each loop, and then compared that against the alias set of each
load in the loop, and classified the load into one of 6 categories. For
cleanness, I also added a new MemFlag::Volatile to mark reads of the wasm
interrupt flag, although that isn't strictly necessary here. The 6 categories
are:

rejected_NONINVAR: rejected for lifting, because args (address) non-invariant
rejected_VOLATILE: rejected because marked volatile (wasm loop-head trap)
rejected_TRAPPING: rejected because the load might trap
rejected_ALIASED: rejected because load might alias a store in the loop
approved_READONLY: accepted for lifting because marked readonly
approved_NO_ALIAS: accepted because not aliased by any stores in the loop

Numbers are below. It's the NO_ALIAS set that's of interest here. Some other
tests (tanks, ZenGarden) do show a few more loads in this category, but still
basically zero.

Patch for reference is attached. One good thing to come out of this is that
it seems easy to add alias-set descriptors which, on the one hand, allow CL's
client to specify alias sets in considerable detail, whilst on the other hand
being entirely opaque to CL -- CL has no understanding of the meaning of the
set. So it's a pretty clean interface extension. See enum AliasSet in the
patch.

                  ------- Rejected for LICM --------     --- Accepted for LICM ---
Test              NONINVAR VOLATILE TRAPPING ALIASED     READONLY  NO_ALIAS

wasm_box2d.js     1996     328      143      320         1914      0
fib.js  no loops
wasm_lua_bin..js  2278     705      322      141         2704      1
raybench.js        425     173      119      126          508      0
wasm_zlib.c.js    2659     570      112       24         3436      0
rust-fannkuch.js    68      15        8       16           94      0

Patch used to generate the measurements in comment 10.

Are the classification sets nonoverlapping? I'm just curious whether alias analysis would appear to be more effective if we did not have the readonly flag. (Clearly readonly solves some cases alias analysis could not solve.)

Are the classification sets nonoverlapping?

The AliasSet values in the patch form an (implied) partial order. If two
sets s1 and s2 are incomparable then they denote non-overlapping
locations; but if s1 <= s2 then s2 denotes a set of locations which is a
subset of s1's locations.

There's currently no way to indicate an empty set of locations (no top point
in the partial order). This would seem to be necessary to replace readonly,
so as it stands AliasSet can't replace readonly -- unless we simulate that
by marking read-only loads as reading from a location-set that is never
applied to any store. That would be very inconvenient though in that, to
identify a load as readonly, we'd have to compare the load's location set
against all dominating store location sets.

All that said, it's obvious from the numbers that readonly and location sets
do overlap to some extent. Here are the numbers for zlib.c.js:

LICM: rejected_NONINVAR = 2659
LICM: rejected_VOLATILE = 570
LICM: rejected_TRAPPING = 112
LICM: rejected_ALIASED  = 24

LICM: approved_READONLY = 3436
LICM: approved_NO_ALIAS = 0

readonly allows many loads to be lifted out. Disabling that mechanism
causes the alias-based mechanism to detect at least some of them as liftable:

LICM: rejected_NONINVAR = 2764
LICM: rejected_VOLATILE = 578
LICM: rejected_TRAPPING = 16
LICM: rejected_ALIASED  = 3773

LICM: approved_READONLY = 0
LICM: approved_NO_ALIAS = 420

Clearly 420 as detected by alias analysis is a long way from 3436 based simply
on readonly annotations. I suspect alias analysis would get much closer to
that 3436 number if all loads and stores were consistently annotated, which
they are not. In a different tree I started an audit with the aim of marking
all loads and stores that enter the CL pipeline, but that doesn't apply here.

Thanks for the numbers and analysis.

  1. Did you need to also modify the Cranelift code in Spidermonkey (baldrdash)? We generate a few loads/stores in there (for TLS, indirect globals,...), and while they might be just all marked as readonly, I don't know for sure if this is the case.

  2. I had a random thought that maybe the AliasSet information may get lost during legalization, if a load/store is expanded into several ones. (I wonder if this can happen; if the instruction id remains the same as the actual load/store, then we should be fine. But if it changes, we're losing information.) Would it be worth it to see if all the loads/stores are annotated with alias set information, just before we generate the code? (that is, on the final IR graph)

  3. Is it worth considering using the alias sets during GVN too, since if effective, this can reduce memory traffic, lower register pressure etc etc.? Or are the poor results on LICM (when compared with readonly) sufficient to prove that this would be ineffective too?

Did you need to also modify the Cranelift code in Spidermonkey (baldrdash)?

I didn't modify it for these experiments. I did look at it though, as part of
the add-an-alias-set-for-all-memory-accesses audit mentioned in last para of
comment 13. There are a bunch of memory accesses created by
js/src/wasm/cranelift/src/wasm2clif.rs that, as you say, are VM accesses and
similar. If we do choose to add alias sets then we should mark all of those
accesses too.

I had a random thought that maybe the AliasSet information may get lost
during legalization, if a load/store is expanded into several ones.

This is a good point. Does such an expansion actually happen? In the patch I
just added a SecondaryMap<Inst, AliasSet> to DataFlowGraph to hold the
alias info, so that I didn't have to make the InstructionDetails any bigger.
I assume that if a load/store gets moved (etc), then its Inst value will
change, and so the SecondaryMap will have to be adjusted too. My patch
doesn't do that.

Would it be worth it to see if all the loads/stores are annotated with alias
set information, just before we generate the code?

Yes. If we do this for real, then checking this in the verifier would be
definitely worthwhile. We'd have to be a bit careful about spills/reloads
generated by regalloc, since they will be unannotated.

Is it worth considering using the alias sets during GVN too, since if
effective, this can reduce memory traffic, lower register pressure etc etc.?

That's something I'd like to try to verify independently from the LICM
experiments. I suspect the answer is "no" but that's just a guess.

I implemented removal of redundant loads within individual ebbs. Doing it
across entire functions looks difficult because you'd have to prove
no-aliasing on all possible paths between candidate load-pairs. A compromise
position would be to extend the individual-ebb logic to work across
tree-shaped ebb groups, but I didn't do that.

The takeaway is that, as with licm-ing loads, it's pointless. But there are a
couple of marginally interesting observations. Firstly, the loads that are
removed have no intervening stores, so the presence of alias sets here is
pointless. Secondly, in both idioms below, it looks as if the duplicated
loads are some kind of front end (wasm->clif) artefact which we might be able
to fix up without needing a general redundant-load removal pass.

I profiled wasm_box2d.js and wasm_zlib.c.js:

                static loads      dyn insns           dyn insns
                   removed        without opt         with opt
wasm_box2d.js         61          11,935,682,104      11,912,546,024
wasm_zlib.c.js        35          15,999,525,900      15,999,515,811

Looking at the places where loads are removed, it's clear there are two
distinct idioms involved. The most common is at the start of a function: it's
always instN for N almost zero, so I assume this is some prologue-related
effect. In one case there were three identical loads in sequence:

inst2    v6 = load.i32 notrap aligned v1+1052  <---
inst4    v8 = load.i32 notrap aligned v1+1052  <--- dup
inst5    v9 = iadd.i32 v8, v0
inst7    store.i32 notrap aligned v9, v1+1052
inst9    v12 = load.i32 notrap aligned v1+1052
inst11   v14 = iadd_imm.i32 v12, 15
inst13   v16 = band_imm.i32 v14, -16
inst15   store.i32 notrap aligned v16, v1+1052
inst16   jump ebb2(v6)
remove inst4 as dup of inst2

The less common idiom is some kind of FP thing (involving FP negation? .. is
that what the xor is for?)

inst144   v130 = load.f64 notrap aligned v3+1072  <---
inst145   v131 = bxor.f64 v130, v700
inst146   v132 = fdemote.f32 v131
inst147   v133 = fcmp.f32 gt v110, v132
inst148   v134 = bint.i32 v133
inst150   v136 = load.f64 notrap aligned v3+1072  <--- dup
inst151   v137 = fdemote.f32 v136
inst152   v138 = fcmp.f32 gt v137, v110
inst153   v139 = bint.i32 v138
inst154   v140 = band.i32 v134, v139
inst157   brnz.i32 v140, ebb8(v105, v437, v530, v552, v108)
inst158   v143 = iconst.i32 2865
...
remove inst150 as dup of inst144

Does anybody know what either of these two idioms are?

WIP patch containing the initial implementations and instrumentation, FTR.

Attachment #9074114 - Attachment is obsolete: true

The first exerpt looks like "stackAlloc" in wasm_box2d.wasm. It's allocating some stack space and doing alignment:

  (func (;25;) (type 4) (param i32) (result i32)
    (local i32)
    block (result i32)  ;; label = @1
      global.get 7
      local.set 1
      global.get 7
      local.get 0
      i32.add
      global.set 7
      global.get 7
      i32.const 15
      i32.add
      i32.const -16
      i32.and
      global.set 7
      local.get 1
    end)
    [...]
    (export "stackAlloc" (func 25))

The code in Emscripten is in src/library.js and looks like this:

  $stackAlloc: function(size) {
    size = size|0;
    var ret = 0;
    ret = STACKTOP;
    STACKTOP = (STACKTOP + size)|0;
    STACKTOP = (STACKTOP + 15)&-16;
#if ASSERTIONS || STACK_OVERFLOW_CHECK >= 2
    if ((STACKTOP|0) >= (STACK_MAX|0)) abortStackOverflow(size|0);
#endif
    return ret|0;
  },

Two possible conclusions to draw here:

  • Emscripten's code here could be better optimized to avoid redundant global get/set traffic
  • If we do look into alias analysis, perphaps globals are a good place to start, as producers may not see them as expensive. Also, we can be fully precise with globals.

The second is Emscripten insisting in using an imported global variable named Infinity rather than just using an f32.const to produce an Infinity. Here's the code in wasm_box2d.wasm:

  local.get 13
  global.get 11
  f64.neg
  f32.demote_f64
  f32.gt
  local.get 13
  global.get 11
  f32.demote_f64
  f32.lt
  i32.and

The loads are loading from global 11:

(global (;11;) (mut f64) (global.get 3))

which is initialized as a copy of global 3:

(import "global" "Infinity" (global (;3;) f64))

which is importing an Infinity global.

This also prevents the f32.neg and both f32.demotes from being constant-folded away.

Dan, thanks for identifying those idioms.

I should add, per comments 10 and 16, that I'm abandoning this line of enquiry, at least for now. If the situation and/or expected workload changes, we can easily enough resume it.

Not currently active, though not resolved.

Priority: P1 → P3
Priority: P3 → P5
Blocks: cranelift
Status: NEW → RESOLVED
Closed: 4 years ago
Resolution: --- → INACTIVE
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: