Closed
Bug 777572
Opened 13 years ago
Closed 11 years ago
IonMonkey: Fix access-fannkuch performance
Categories
(Core :: JavaScript Engine, defect)
Core
JavaScript Engine
Tracking
()
RESOLVED
WORKSFORME
People
(Reporter: dvander, Unassigned)
References
Details
(Whiteboard: [ion:t])
Attachments
(1 file)
1.57 KB,
application/javascript
|
Details |
We're about 3ms slower than JM+TI on my machine (7ms -> 9.7ms)
Attached is a version of the benchmark that takes longer to run, but still shows a performance gap:
JM+TI: 629ms
Ion: 784ms
V8: 565ms
Reporter | ||
Updated•13 years ago
|
OS: Linux → All
Hardware: x86_64 → All
Comment 1•13 years ago
|
||
FWIW, my times are rather different:
~/bench/js; js.ion.opt.pristine --no-ion ./fannkuch.js
632
~/bench/js; js.ion.opt.pristine ./fannkuch.js
527
~/bench/js; d8 ./fannkuch.js
495
Reporter | ||
Updated•13 years ago
|
Whiteboard: [js:p1:fx18]
Reporter | ||
Updated•13 years ago
|
Whiteboard: [js:p1:fx18] → [ion:p1:fx18]
Reporter | ||
Comment 2•13 years ago
|
||
See also bug 763958 which has a reduced test case.
Comment 3•13 years ago
|
||
On 32-bit, I get the following numbers for the attached testcase:
d8 (old) : 446 ms
js --no-ion : 480 ms
js --no-jm : 513 ms
js : 648 ms
JM+Ion is slower than either Ion or JM because we enter a hot, inner loop via OSR (the one at line 34). With --no-jm, we compile the script 6x instead of 2x but the last compilation enters the loop at line 32, an outer loop, so it runs faster.
The main problem with entering the inner loop via OSR is that it creates many phi-nodes and for a lot of them this results in a move from one stack slot to another (bug 779860). If I forbid entering the inner loop via OSR, JM+Ion is indeed about as fast as Ion.
I wrote a (hackish) patch for bug 779860 to give OSR phi-nodes a PASSTHROUGH policy, so that the phi reuses the allocation of the first operand. This brings us down to 568 ms. Doing the same for phis at loop headers gets us to 515 ms. Unfortunately, the latter fails some jit-tests so it will need more work, but it at least confirms that the move groups are indeed the biggest problem.
There's also a smaller problem: after a bounds check fails, we mark all bounds check in the same script as non-movable the next time we compile the script. If I comment out the two lines for this in IonBuilder::addBoundsCheck, on top of the other patch, we're down to 462 ms. I get 448 ms with --no-jm, pretty sure that's faster just because entering an outer loop via OSR allows us to hoist more bounds checks. A more fine-grained heuristic may be worthwhile.
Comment 4•13 years ago
|
||
Thinking about this some more, it would be really beneficial for access-fannkuch (and other benchmarks like gaussian-blur) if we tried harder to enter *outer* loops via OSR instead of *inner* loops. It has some other advantages and should be easy to prototype so I'll try that first.
Comment 5•13 years ago
|
||
The main remaining problems after fixing bug 780973:
(1) Compilation time (on the real benchmark, not the modified/attached one, hopefully bug 774253 will help).
(2) There's a hoisted bounds check that fails - we reach a SETELEM for the first time and arrayWriteHole is not (yet) set. The result is that we invalidate and the next time don't optimize bounds checks in the script. If I change Ion to always optimize bounds checks we're as fast as V8 on the attached (modified) benchmark and faster than JM+TI.
Not sure how to fix (2), any ideas?
Comment 6•12 years ago
|
||
(In reply to Jan de Mooij (:jandem) from comment #5)
> the next time don't optimize bounds checks in the script.
What causes it not to optimize bounds checks the next time around?
Comment 7•12 years ago
|
||
(In reply to David Mandelin [:dmandelin] from comment #6)
> (In reply to Jan de Mooij (:jandem) from comment #5)
> > the next time don't optimize bounds checks in the script.
>
> What causes it not to optimize bounds checks the next time around?
Both JM and Ion set script->failedBoundsCheck whenever we bailout due to a failing bounds check. During compilation, if that flag is set, we don't optimize (LICM/GVN) bounds checks, to avoid cases where bounds check hoisting causes us to bailout all the time.
I discussed this a bit with bhackett a few days ago - what would work is to store this bit per-loop instead of per-script. Main problem is that after running LICM, GVN etc we don't easily know when we bailout from which loop the bounds check was hoisted, if any. Another problem is where to store this bit for every loop - probably either ScriptAnalysis (cleared on GC) or script->types.
Comment 8•12 years ago
|
||
(In reply to Jan de Mooij (:jandem) from comment #7)
> (In reply to David Mandelin [:dmandelin] from comment #6)
> > (In reply to Jan de Mooij (:jandem) from comment #5)
> > > the next time don't optimize bounds checks in the script.
> >
> > What causes it not to optimize bounds checks the next time around?
>
> Both JM and Ion set script->failedBoundsCheck whenever we bailout due to a
> failing bounds check. During compilation, if that flag is set, we don't
> optimize (LICM/GVN) bounds checks, to avoid cases where bounds check
> hoisting causes us to bailout all the time.
>
> I discussed this a bit with bhackett a few days ago - what would work is to
> store this bit per-loop instead of per-script. Main problem is that after
> running LICM, GVN etc we don't easily know when we bailout from which loop
> the bounds check was hoisted, if any. Another problem is where to store this
> bit for every loop - probably either ScriptAnalysis (cleared on GC) or
> script->types.
Can you help me understand the details of what's going on? First, which loop/setelem in the benchmark isn't getting fully optimized?
I gather that arrayWriteHole is set when you write to an array past |length-1|, and that it's used to add the array to the |growArrays| list. But I don't immediately see how that flows into IM--I just see that it's used in |hoistArrayLengthCheck|, which looks like a JM+TI optimization. Basically, my question is, how does the arrayWriteHole condition affect IM? Does it control whether you end up calling CodeGenerator::visitStoreElement(Hole)T?
Is this what's happening?
- First time around, compile with visitStoreElement. Hoist the bounds check.
- Run the code and fail the bounds check
- Second time around, compile with ??? (which one is it? and which bounds check isn't getting hoisted?)
Comment 9•12 years ago
|
||
(In reply to David Mandelin [:dmandelin] from comment #8)
> Basically, my question is, how does the arrayWriteHole condition affect IM?
> Does it control whether you end up calling
> CodeGenerator::visitStoreElement(Hole)T?
Yeah, exactly. If arrayWriteHole is *not* set, we assume all writes will be in-bounds. This is the fastest case because in this case we can use separate MIR instructions for the bounds check (MBoundsCheck) and dense-array-elements vector (MElements). Using these instructions is awesome because we can LICM/GVN/etc them separately.
If we *do* suddenly write to an out-of-bounds index, the bounds check will fail. We will invalidate and rejoin to the interpreter, the interpreter will properly set the arrayWriteHole flag, and if the script has warmed up again we compile again.
OTOH, if during compilation we see that arrayWriteHole *is* set, we know we are likely going to write out-of-bounds values so we use MStoreElementHole. This instruction has the bounds check "built-in" and if it fails it jumps to an OOL path to store the out-of-bounds value (with a VM call to resize the array if needed). The downside of course is that we can no longer LICM/GVN the bounds check, since it's not a separate instruction.
This distinction works well in practice because most SETELEM's are always in-bounds, and the ones that are not we can handle using MStoreElementHole.
What happens on ss-fannkuch is as follows:
(1) We compile the script. We see that arrayWriteHole is not set for the SETELEM at line 27, so we use MBoundsCheck + MStoreElement. (arrayWriteHole is not set because we have not yet executed this particular SETELEM before)
(2) The bounds check fails. We bailout, set script->failedBoundsCheck, we invalidate the script and go back to the interpreter. The interpreter then executes the SETELEM, sets arrayWriteHole, etc.
(3) We interpret, JM-compile, Ion-compile again. This time, arrayWriteHole is set so we (correctly) choose StoreElementHole.
The problem, however, is that when we set script->failedBoundsCheck in step 2, we mark *all* MBoundsCheck's in the script as non-movable the next time we compile the script (we don't allow LICM/GVN on them). This hurts other, totally unrelated GETELEMs/SETELEMs.
Both JITs use script->failedBoundsCheck to prevent us from bailing out all the time: in some cases a hoisted bounds check fails, whereas it wouldn't have failed without hoisting.
So basically everything works fine, except that script->failedBoundsCheck causes us to deoptimize too much: we disallow moving *any* bounds checks in the script and this hurts us.
Comment 10•12 years ago
|
||
I see. Thanks for explaining. I can imagine a few different ways of solving this:
- Tracking bounds checks failures per-loop, like you and Brian discussed, seems like it should totally work. I guess I don't know the details of how instructions are stored, so I don't know how easy it would be to track through optimizations, but in principle, it seems doable.
- I notice the loop in question has a 'populate' structure, i.e., it iterates i from 1 to n and writes into an array with i. It seems like you could recognize that pattern and do various interesting things with it, including remembering to hoist the bounds check.
- What's the status of the bounds analysis? It seems like we should be able to eliminate the bounds check entirely for this loop by sizing the array on entry.
- It seems like it could work to turn off bounds check optimizations only after a failedBoundsCheck has occurred twice, but I think that might lead to too many recompilations.
Reporter | ||
Updated•12 years ago
|
Whiteboard: [ion:p1:fx18] → [ion:t]
Comment 11•11 years ago
|
||
Using Sunspider, I get:
Nightly 29.0a1 (2014-01-22) - fannkuch: 8.5ms +/- 4.4%
Chrome 32 - fannkuch: 11.9ms +/- 7.2%
Comment 12•11 years ago
|
||
On the attached benchmark I get 529ms on Nightly and 633ms on Chrome 34.
Nightly is also faster on all 3 machines on AWFY, so I'm marking this bug as wfm.
Status: NEW → RESOLVED
Closed: 11 years ago
Resolution: --- → WORKSFORME
You need to log in
before you can comment on or make changes to this bug.
Description
•