Closed Bug 1313576 Opened 5 years ago Closed 5 years ago
Wasm baseline: Simple bounds check elimination
On 32-bit systems (ARM, mainly) the baseline compiler will be at a more significant disadvantage to the ion compiler because we have no bounds check elimination. Eliminating bounds checks within straight-line code (for same base+offset minimally, perhaps for same base alone) is an appealing approach. Luke writes, on the Wasm reflector: > So one observation is that a large portion of the bounds-check-elimination > allowed by the 4gb hack can be avoided by just having a small guard region > at the end of wasm memory. Our initial experiments show that basic analyses > (nothing to do with loops, just eliminating checks on loads/stores w/ the same > base pointer) can already eliminate roughly half of bounds checks. And > probably this could get better. No data yet on how much of that required analysis in extended basic blocks, but it's a useful observation to keep in mind.
Just to get an idea of what the maximal potential is for simple elimination (which probably means matching against address expressions of the form get_local and get_local+offset in some kind of cache of recent, dominating checks) I added some simple static and dynamic instrumentation to the baseline compiler, details below. Results for the embenchen suite are here: https://docs.google.com/spreadsheets/d/1x3bfbJkfYKPYbmpjynl_5e3CUc-ITdCf6TfhjAba23M/edit?usp=sharing In summary, - With the optimization that landed in bug 1316831, *all* bounds checks on constant accesses are removed already for *all* these programs - For three of 10 programs in the suite, more than 70% of the *dynamic* accesses fit the simple patterns - For six of 10 programs in the suite, more than 50% fit, ditto - For nine of 10 programs, more than 30% of the accesses fit, ditto These numbers don't tell us what fraction we can realistically hope to eliminate, but the potential is at least not limited by the address expressions being too hard to analyze. The generic framework for these profile counters will be a patch on bug 1320645. The specific counting code will be attached here.
PoC. This implements a simple online bce algorithm as a bitvector problem, see the comment block in the patch. It tracks prior bounds checks on locals that have not since been updated and eliminates redundant checks for small offsets. I think this is roughly right, and it passes all testing so far, but I'm not going to offer it for review right now as it needs some naming changes and a generalization to functions with more locals. The cost of this optimization is probably quite small: There are slight overheads in get_local and set_local, and then a little dance at control flow points. Control nodes grow by two fields, which are words now but may be larger eventually. I have updated the spreadsheet referenced in my earlier comment with data regarding dynamic checks eliminated. 20% is a common number but eg the bullet benchmark sees a 42% reduction while others see very few removals indeed. For the programs that see few removals of checks there are a couple of possibilities yet to be explored: - They use more than 32 local variables (I'm lazy, so the PoC uses a length-32 bitvector and does not eliminate accesses where the base pointer is a local above 31) - They update locals instead of using constant base addresses and offsets in straight-line code like loop bodies - They use more complex addressing expressions The first problem can be fixed here, the second problem *might* be fixed in emscripten, and the third problem we'll probably just have to live with, unless there are very common expressions that we can easily track.
Making the bitset 64 bits improved only bullet, which is already doing quite well.
When you match "get_local+offset", are you performing and eliminating bounds checks on the local (not including offset), when offset < wasm::GuardSize? This is what we do for Ion and it's quite effective because you often have many loads/stores with the same base. Also, just to be sure, when you say "get_local+offset", this is just the offset immediate of a load/store opcode, not a general i32.add'ed offset, right? (I ask it's often tempting to do the latter but there's the stubborn corner case where the base address is out-of-bounds but the i32.add wraps it back in-bounds.)
(In reply to Luke Wagner [:luke] from comment #6) > When you match "get_local+offset", are you performing and eliminating bounds > checks on the local (not including offset), when offset < wasm::GuardSize? > This is what we do for Ion and it's quite effective because you often have > many loads/stores with the same base. Yes. > Also, just to be sure, when you say "get_local+offset", this is just the > offset immediate of a load/store opcode, not a general i32.add'ed offset, > right? Yes. The latter data set (actual checks removed) does away with the local+offset case, folding it into the local case, by disregarding offsets below the guard size.
Some final comments here: - The bce solver marks a local as non-known-good whenever it's updated but there are some assignments that should instead leave the local marked as known-good: - rhs is a known-good constant pointer - rhs is a known-good local (ie we're making a copy) I doubt that these are important cases though they may be worth investigating. It would be cheap to implement the tracking logic, so cost is not an argument against doing it. - The bce solver tracks locals only, but sync() calls will turn locals on the stack into enregistered values, and we use sync() liberally currently (bug 1316817; bug 1316802; bug 1318654). In principle there is therefore the chance that information is lost if a sync() intervenes between the time the address expression is pushed and the time it is consumed. I doubt this is much of a problem, because the natural way of using a local as a pointer is to load it directly before using it, and the pointer expression is on the stack top, loaded after the expression is computed for the store. The only way of getting a sync there is to do something like this: compute expr get_local for the pointer <do something that syncs but leaves no value> store - Looking at some of the codes that don't benefit from bce, eg fannkuch, it becomes apparent that address expressions in even simple loops tend to be complex, partly as a result of updating a pointer as part of the addressing expression (using eg tee_local). This is typical: (i32.store (get_local $12) (i32.load (tee_local $12 (i32.add (get_local $6) (i32.shl (get_local $1) (i32.const 2)))))) While emscripten might in principle generate baseline-friendlier code I don't think that should be a goal unless other goals (Ion performance; byte code size) are met at the same time, and the bce wins are clearly worth it.
I added some ARM results to the spreadsheet, resulting from running with and without the bce (an updated patch with a small bug fix and 64-bit sets). Too few iterations to be scientific, but I see a 10% speedup on the bullet benchmark, which is not suprising given the number of bounds checks that were removed. There are smaller improvements on other benchmarks, but the numbers are fairly noisy. There seems to be a rough correspondence between the percentage of checks removed and the percentage of improvement in running times between baseline and ion, though, so for ARM we can probably roughly assume that removing a check results in a speedup. This does not seem to be the case for x86, which doesn't improve very much from bce, presumably a result of better branch prediction.
Simple forward bounds check elimination. There's a large block comment in the patch that explains how this works, please read that for more information. I still need to gather better performance data on ARM to see if this is truly worthwhile (for a baseline compiler). Clearly the removal rates on some codes (I believe Bullet is very OO and it is credible that it should benefit from this optimization) are very appealing, but there should be actual performance wins too, and I've had a hard time repeating the speedups I reported earlier. Still, asking for review in order to vet the approach.
While you're measuring cost/benefit, it'd also be interesting to note whether there is any significant compile-time hit.
Performance looks good. Compilation time first: On ARM , running with --no-threads, compilation *speeds up* with the BCE patch from about 2040ms to about 1980ms. The timings are noisy but the trend is clear. Compilation is measured as the time it takes to run 'new WebAssembly.Module()' on the bytearray which has just been read by os.file.readFile() in a JS shell program. On x86  32-bit, the trend is the same: with BCE, compilation drops from about 760ms to about 730ms. A reasonable interpretation of the speedup may be that enough bounds checks are removed that there is less code and metainformation to generate and copy but I don't have data to back that up. (For hack value: The Ion compilation time is 14600ms on ARM, 5600ms on x86.) As for run time: I've attempted to quantify  measurement noise by comparing a program against itself, it looks like noise is within 1% measured this way, on both platforms. The following times are for 11 runs on x86, 5 runs on ARM. On ARM I see some speedups outside the noise range on box2d (2%) and bullet (4.5%) and skinning (3.1%). These also speed up on x86 (1.6%, 3%, 2.4%). And according to earlier profiling, BCE removes many checks in these programs. On x86 I also see speedups on zlib and scimark. There are a few mysterious slowdowns. All but ifs (a microbenchmark that completes in 200ms) disappear with repeated runs, but the speedups do not, so I think we're OK.  Jetson TK1 4 x Tegra (Cortex-A15) @ 2.3GHz, 2GB RAM, SSD, lightly loaded (ssh access, no X11), Ubuntu 14  2013 MacBook Pro, lightly loaded (no browsers running)  https://github.com/lars-t-hansen/embenchen, see asm_v_wasm/wasm_bench.sh, run as follows to compare two shells with default problem size, see the program for more information: JS_SHELL1=path/to/one/shell JS_SHELL2=path/to/another/shell ./wasm_bench.sh -m baseline -n 11
Comment on attachment 8828088 [details] [diff] [review] bug1313576-simple-bce-v2.patch Review of attachment 8828088 [details] [diff] [review]: ----------------------------------------------------------------- Wow, this was a surprisingly easy to understand patch; nice work! ::: js/src/wasm/WasmBaselineCompile.cpp @@ +5531,5 @@ > > popStackOnBlockExit(block.framePushed); > popValueStackTo(block.stackSize); > > + // bceSafe_ stays the same along the fallthrough path. Could you append "because branches to loops branch to the top". Also, by that logic, can the `block.bceSafeOnExit &= bceSafe_` above be removed? @@ +6530,5 @@ > +// For an access through a local variable + an offset we can eliminate the > +// bounds check if the local variable has already been checked and has not been > +// updated since, and the offset is less than the guard limit. > +// > +// To track locals for which we can eliminate cheks we use a bit vector bceSafe_ s/cheks/checks/ @@ +6561,5 @@ > +// > +// - After an if-then-else, set bceSafe_ to the if-then-else's bceSafeOnExit. > +// > +// - After an if-then, set bceSafe_ to the if-then's bceSafeOnExit AND'ed with > +// the if-then's bceSafeOnEntry. When debugEnabled_, at every debug trap location, a local can be mutated (eventually). Could you mention this case in this list? I suspect the right cut point here to disable when debugEnabled_ isin bceCheckLocal().
Attachment #8828088 - Flags: review?(luke) → review+
https://hg.mozilla.org/integration/mozilla-inbound/rev/dc0c0108e86dcb172a6b160c0438c5a5b4d1e18b Bug 1313576 - wasm baseline, simple bounds check elimination. r=luke
You need to log in before you can comment on or make changes to this bug.