777572 - IonMonkey: Fix access-fannkuch performance

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Description

•

13 years ago

Attached file benchmark — Details

We're about 3ms slower than JM+TI on my machine (7ms -> 9.7ms) Attached is a version of the benchmark that takes longer to run, but still shows a performance gap: JM+TI: 629ms Ion: 784ms V8: 565ms

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Updated

•

13 years ago

OS: Linux → All

Hardware: x86_64 → All

Marty Rosenberg [:mjrosenb]

Comment 1

•

13 years ago

FWIW, my times are rather different: ~/bench/js; js.ion.opt.pristine --no-ion ./fannkuch.js 632 ~/bench/js; js.ion.opt.pristine ./fannkuch.js 527 ~/bench/js; d8 ./fannkuch.js 495

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Updated

•

13 years ago

Whiteboard: [js:p1:fx18]

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Updated

•

13 years ago

Whiteboard: [js:p1:fx18] → [ion:p1:fx18]

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Comment 2

•

13 years ago

See also bug 763958 which has a reduced test case.

Jan de Mooij [:jandem]

Updated

•

13 years ago

Depends on: 779860

Jan de Mooij [:jandem]

Comment 3

•

13 years ago

On 32-bit, I get the following numbers for the attached testcase: d8 (old) : 446 ms js --no-ion : 480 ms js --no-jm : 513 ms js : 648 ms JM+Ion is slower than either Ion or JM because we enter a hot, inner loop via OSR (the one at line 34). With --no-jm, we compile the script 6x instead of 2x but the last compilation enters the loop at line 32, an outer loop, so it runs faster. The main problem with entering the inner loop via OSR is that it creates many phi-nodes and for a lot of them this results in a move from one stack slot to another (bug 779860). If I forbid entering the inner loop via OSR, JM+Ion is indeed about as fast as Ion. I wrote a (hackish) patch for bug 779860 to give OSR phi-nodes a PASSTHROUGH policy, so that the phi reuses the allocation of the first operand. This brings us down to 568 ms. Doing the same for phis at loop headers gets us to 515 ms. Unfortunately, the latter fails some jit-tests so it will need more work, but it at least confirms that the move groups are indeed the biggest problem. There's also a smaller problem: after a bounds check fails, we mark all bounds check in the same script as non-movable the next time we compile the script. If I comment out the two lines for this in IonBuilder::addBoundsCheck, on top of the other patch, we're down to 462 ms. I get 448 ms with --no-jm, pretty sure that's faster just because entering an outer loop via OSR allows us to hoist more bounds checks. A more fine-grained heuristic may be worthwhile.

Jan de Mooij [:jandem]

Comment 4

•

13 years ago

Thinking about this some more, it would be really beneficial for access-fannkuch (and other benchmarks like gaussian-blur) if we tried harder to enter *outer* loops via OSR instead of *inner* loops. It has some other advantages and should be easy to prototype so I'll try that first.

Jan de Mooij [:jandem]

Updated

•

13 years ago

Depends on: 780973

Jan de Mooij [:jandem]

Comment 5

•

13 years ago

The main remaining problems after fixing bug 780973: (1) Compilation time (on the real benchmark, not the modified/attached one, hopefully bug 774253 will help). (2) There's a hoisted bounds check that fails - we reach a SETELEM for the first time and arrayWriteHole is not (yet) set. The result is that we invalidate and the next time don't optimize bounds checks in the script. If I change Ion to always optimize bounds checks we're as fast as V8 on the attached (modified) benchmark and faster than JM+TI. Not sure how to fix (2), any ideas?

David Mandelin [:dmandelin]

Comment 6

•

12 years ago

(In reply to Jan de Mooij (:jandem) from comment #5) > the next time don't optimize bounds checks in the script. What causes it not to optimize bounds checks the next time around?

Jan de Mooij [:jandem]

Comment 7

•

12 years ago

(In reply to David Mandelin [:dmandelin] from comment #6) > (In reply to Jan de Mooij (:jandem) from comment #5) > > the next time don't optimize bounds checks in the script. > > What causes it not to optimize bounds checks the next time around? Both JM and Ion set script->failedBoundsCheck whenever we bailout due to a failing bounds check. During compilation, if that flag is set, we don't optimize (LICM/GVN) bounds checks, to avoid cases where bounds check hoisting causes us to bailout all the time. I discussed this a bit with bhackett a few days ago - what would work is to store this bit per-loop instead of per-script. Main problem is that after running LICM, GVN etc we don't easily know when we bailout from which loop the bounds check was hoisted, if any. Another problem is where to store this bit for every loop - probably either ScriptAnalysis (cleared on GC) or script->types.

David Mandelin [:dmandelin]

Comment 8

•

12 years ago

(In reply to Jan de Mooij (:jandem) from comment #7) > (In reply to David Mandelin [:dmandelin] from comment #6) > > (In reply to Jan de Mooij (:jandem) from comment #5) > > > the next time don't optimize bounds checks in the script. > > > > What causes it not to optimize bounds checks the next time around? > > Both JM and Ion set script->failedBoundsCheck whenever we bailout due to a > failing bounds check. During compilation, if that flag is set, we don't > optimize (LICM/GVN) bounds checks, to avoid cases where bounds check > hoisting causes us to bailout all the time. > > I discussed this a bit with bhackett a few days ago - what would work is to > store this bit per-loop instead of per-script. Main problem is that after > running LICM, GVN etc we don't easily know when we bailout from which loop > the bounds check was hoisted, if any. Another problem is where to store this > bit for every loop - probably either ScriptAnalysis (cleared on GC) or > script->types. Can you help me understand the details of what's going on? First, which loop/setelem in the benchmark isn't getting fully optimized? I gather that arrayWriteHole is set when you write to an array past |length-1|, and that it's used to add the array to the |growArrays| list. But I don't immediately see how that flows into IM--I just see that it's used in |hoistArrayLengthCheck|, which looks like a JM+TI optimization. Basically, my question is, how does the arrayWriteHole condition affect IM? Does it control whether you end up calling CodeGenerator::visitStoreElement(Hole)T? Is this what's happening? - First time around, compile with visitStoreElement. Hoist the bounds check. - Run the code and fail the bounds check - Second time around, compile with ??? (which one is it? and which bounds check isn't getting hoisted?)

Jan de Mooij [:jandem]

Comment 9

•

12 years ago

(In reply to David Mandelin [:dmandelin] from comment #8) > Basically, my question is, how does the arrayWriteHole condition affect IM? > Does it control whether you end up calling > CodeGenerator::visitStoreElement(Hole)T? Yeah, exactly. If arrayWriteHole is *not* set, we assume all writes will be in-bounds. This is the fastest case because in this case we can use separate MIR instructions for the bounds check (MBoundsCheck) and dense-array-elements vector (MElements). Using these instructions is awesome because we can LICM/GVN/etc them separately. If we *do* suddenly write to an out-of-bounds index, the bounds check will fail. We will invalidate and rejoin to the interpreter, the interpreter will properly set the arrayWriteHole flag, and if the script has warmed up again we compile again. OTOH, if during compilation we see that arrayWriteHole *is* set, we know we are likely going to write out-of-bounds values so we use MStoreElementHole. This instruction has the bounds check "built-in" and if it fails it jumps to an OOL path to store the out-of-bounds value (with a VM call to resize the array if needed). The downside of course is that we can no longer LICM/GVN the bounds check, since it's not a separate instruction. This distinction works well in practice because most SETELEM's are always in-bounds, and the ones that are not we can handle using MStoreElementHole. What happens on ss-fannkuch is as follows: (1) We compile the script. We see that arrayWriteHole is not set for the SETELEM at line 27, so we use MBoundsCheck + MStoreElement. (arrayWriteHole is not set because we have not yet executed this particular SETELEM before) (2) The bounds check fails. We bailout, set script->failedBoundsCheck, we invalidate the script and go back to the interpreter. The interpreter then executes the SETELEM, sets arrayWriteHole, etc. (3) We interpret, JM-compile, Ion-compile again. This time, arrayWriteHole is set so we (correctly) choose StoreElementHole. The problem, however, is that when we set script->failedBoundsCheck in step 2, we mark *all* MBoundsCheck's in the script as non-movable the next time we compile the script (we don't allow LICM/GVN on them). This hurts other, totally unrelated GETELEMs/SETELEMs. Both JITs use script->failedBoundsCheck to prevent us from bailing out all the time: in some cases a hoisted bounds check fails, whereas it wouldn't have failed without hoisting. So basically everything works fine, except that script->failedBoundsCheck causes us to deoptimize too much: we disallow moving *any* bounds checks in the script and this hurts us.

David Mandelin [:dmandelin]

Comment 10

•

12 years ago

I see. Thanks for explaining. I can imagine a few different ways of solving this: - Tracking bounds checks failures per-loop, like you and Brian discussed, seems like it should totally work. I guess I don't know the details of how instructions are stored, so I don't know how easy it would be to track through optimizations, but in principle, it seems doable. - I notice the loop in question has a 'populate' structure, i.e., it iterates i from 1 to n and writes into an array with i. It seems like you could recognize that pattern and do various interesting things with it, including remembering to hoist the bounds check. - What's the status of the bounds analysis? It seems like we should be able to eliminate the bounds check entirely for this loop by sizing the array on entry. - It seems like it could work to turn off bounds check optimizations only after a failedBoundsCheck has occurred twice, but I think that might lead to too many recompilations.

David Anderson [:dvander] - inactive, e-mail if emergency

Reporter

Updated

•

12 years ago

Whiteboard: [ion:p1:fx18] → [ion:t]

Guilherme Lima

Comment 11

•

11 years ago

Using Sunspider, I get: Nightly 29.0a1 (2014-01-22) - fannkuch: 8.5ms +/- 4.4% Chrome 32 - fannkuch: 11.9ms +/- 7.2%

Guilherme Lima

Comment 12

•

11 years ago

On the attached benchmark I get 529ms on Nightly and 633ms on Chrome 34. Nightly is also faster on all 3 machines on AWFY, so I'm marking this bug as wfm.

Status: NEW → RESOLVED

Closed: 11 years ago

Resolution: --- → WORKSFORME