Open Bug 1377710 Opened 7 years ago Updated 6 months ago

Compiling lots of unused branches can be expensive

Categories

(Core :: JavaScript Engine, enhancement, P5)

enhancement

Tracking

()

Performance Impact low
Tracking Status
firefox57 --- wontfix
firefox58 --- wontfix
firefox59 --- ?

People

(Reporter: ehsan.akhgari, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: perf)

Attachments

(6 files, 2 obsolete files)

Attached file microbenchmark
See the microbenchmark. Nightly on my machine gets around 550-850ms (the numbers can vary vastly), Chrome Dev channel gets around 5-8ms. We seem to be spending a lot of time in PruneUnusedBranches() and MDefinition::justReplaceAllUsesWith(). Note that I wrote this microbenchmark while trying to see if another patch improves the speed of another function, so I don't necessarily know whether optimizing this case is interesting or not but I figured I should file this.
I should be the right person to look into this issue, but I am currently focusing on enabling the bytecode cache.
Flags: needinfo?(tcampbell)
I'm seeing a completely different profile now. What I see is in https://perfht.ml/2uKD2Rq (on Win 10 2017-07-04 nightly) is that we bailout from Ion while looking up foo() and get stuck in baseline. I'll investigate this more as I look at other issues with IonBuilder perf. Clearing ni? for now as this was not related to specific realworld perf.
Flags: needinfo?(tcampbell)
Oops.. I was reading non-symbolicated profile wrong. Behavior is the same as before.
Assignee: nobody → nicolas.b.pierron
Status: NEW → ASSIGNED
I can reproduce this issue in the JS shell. Disabling Branch Pruning causes the time to go from 550ms to 30ms. Which is still a lot of time for such a simple function.
This patch add extra MIR flags which are only used for the analysis of Branch Pruning. Previously branch pruning had a loop to check for the usage of Phi, by visiting the linked list of uses, and iterate for at most 128 iterations throught the list of uses. This rewrite is caching the is-used state on the flags of the MPhi, and benefit from the Post-order visit of the graph to only propagate the is-used state one edge at a time. In addition, we still need to loop over the back-edges, until we reached the fix-point where all the MPhi have the is-used state. (which should be definitely less costly) The previous computation had a way higher complexity, but was only applied on the dead branches, where this code has a tighter loop which runs on all blocks (even the live one), but with a smaller complexity overall.
Attachment #8892131 - Flags: review?(tcampbell)
Counting Phi nodes as part of the set of instructions is not helpful, mostly because Phi are converted into move group in the previous blocks. Also Mphi cannot be effectful, so there is no need to go through the overhead of checking it.
Attachment #8892136 - Flags: review?(tcampbell)
Previously the cost of each block was computed when one candidate was detected, which might have caused the code to re-visit the blocks multiple times. This pre-compute the stats for each basic blocks in tight and predictible loops, and reduce a lot the remaining overhead seen while running Branch Pruning.
Attachment #8892137 - Flags: review?(tcampbell)
Comment on attachment 8892125 [details] [diff] [review] Branch Pruning: Check the compile info associated with the resume point. Review of attachment 8892125 [details] [diff] [review]: ----------------------------------------------------------------- Why is this better, considering MResumePoint::isObservableOperand just calls CompileInfo::isObservableSlot anyway? Will a later patch fix this? MResumePoint::isObservableOperand is not inlined so it will be slower..
Responding to Comment 11, note that it basically reverts Bug 1342016, which was a performance improvement. I do believe reverting those lines would be slower.
This fails jsapi-tests. > 8:03.41 c:\Users\tcampbell\Projects\gecko.ion\js\src\jsapi-tests/testJitMinimalFunc.h(49): error C2664: 'js::jit::CompileInfo::CompileInfo(js::jit::CompileInfo &&)': cannot convert argument 1 from 'int' to 'const js::jit::CompileInfo &' > 8:03.41 c:\Users\tcampbell\Projects\gecko.ion\js\src\jsapi-tests/testJitMinimalFunc.h(49): note: Reason: cannot convert from 'int' to 'const js::jit::CompileInfo' > 8:03.41 c:\Users\tcampbell\Projects\gecko.ion\js\src\jsapi-tests/testJitMinimalFunc.h(49): note: No constructor could take the source type, or constructor overload resolution was ambiguous
Comment on attachment 8892125 [details] [diff] [review] Branch Pruning: Check the compile info associated with the resume point. Review of attachment 8892125 [details] [diff] [review]: ----------------------------------------------------------------- Clearing review for now - we should avoid replacing an inlined call with an out-of-line call.
Attachment #8892125 - Flags: review?(jdemooij)
Attachment #8892125 - Attachment is obsolete: true
(In reply to Ted Campbell [:tcampbell] from comment #13) > This fails jsapi-tests. > > > 8:03.41 c:\Users\tcampbell\Projects\gecko.ion\js\src\jsapi-tests/testJitMinimalFunc.h(49): error C2664: 'js::jit::CompileInfo::CompileInfo(js::jit::CompileInfo &&)': cannot convert argument 1 from 'int' to 'const js::jit::CompileInfo &' Fixed locally as part of the modification of the CompileInfo, which now requires a TempAllocator& argument.
Comment on attachment 8892126 [details] [diff] [review] Cache isObservableSlot and isRecoverableSlot on the CompileInfo. Review of attachment 8892126 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit/CompileInfo.h @@ +600,5 @@ > > + // Recall whether the slots have to be kept for stack iteration or if they > + // have to be recovered on bailout. > + Vector<bool, 0, JitAllocPolicy> isObservable_; > + Vector<bool, 0, JitAllocPolicy> isRecoverable_; How much memory do these things wind up consuming? What's the tradeoff? ::: js/src/jit/MIR-inl.h @@ +4,5 @@ > + * License, v. 2.0. If a copy of the MPL was not distributed with this > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ > + > +#ifndef jit_MIR_inl_h > +#define jit_MIR_inl_h Per Bug 1388045 Comment 2, could these just go in MIR.h, as part of that bug?
(In reply to Sean Stangl [:sstangl] from comment #16) > ::: js/src/jit/CompileInfo.h > @@ +600,5 @@ > > > > + // Recall whether the slots have to be kept for stack iteration or if they > > + // have to be recovered on bailout. > > + Vector<bool, 0, JitAllocPolicy> isObservable_; > > + Vector<bool, 0, JitAllocPolicy> isRecoverable_; > > How much memory do these things wind up consuming? What's the tradeoff? They are only allocated once, without resize, so these only consumes ~2 byte per slot. This can be quite large for functions with tons of locals, but currently we consume way more space with all the resume points that rely on this info. Thus these 2 allocations per inlined JSScript are by far neglectable. The trade-off is between (before) call, cache misses and branch miss-predictions, vs. (after) one cache miss and few neglectable allocations. > ::: js/src/jit/MIR-inl.h > @@ +4,5 @@ > > + * License, v. 2.0. If a copy of the MPL was not distributed with this > > + * file, You can obtain one at http://mozilla.org/MPL/2.0/. */ > > + > > +#ifndef jit_MIR_inl_h > > +#define jit_MIR_inl_h > > Per Bug 1388045 Comment 2, could these just go in MIR.h, as part of that bug? If needed, yes.
Comment on attachment 8892137 [details] [diff] [review] Branch Pruning: Pre-compute and cache the cost of basic blocks. Review of attachment 8892137 [details] [diff] [review]: ----------------------------------------------------------------- Saving the instruction traversal seems like a nice win. The whole traversal seems rough and I wonder why we don't use real dominator information. Incremental improvement on existing code already, so I won't hold up this patch. ::: js/src/jit/IonAnalysis.cpp @@ +298,5 @@ > + size_t numInst; > + // Number of effectful instructions per block. > + size_t numEffectful; > + // Number of predecessors (excluding OSR blocks). > + size_t numPred; Not sure we gain a lot by caching numPred/numSucc, but is fine this way. @@ +334,5 @@ > + auto& info = infos[bid]; > + info.numInst = numInst; > + info.numEffectful = numEffectful; > + info.numPred = block->numPredecessors(); > + info.numPred = block->numSuccessors(); TYPO: info.numSucc @@ +421,5 @@ > // Iterate over the approximated set of dominated blocks and count > // the number of instructions which are dominated. Note that this > // approximation has issues with OSR blocks, but this should not be > // a big deal. > + if (!CollectBlocksInfos(mir, graph, infos)) Move before comment maybe? @@ +435,1 @@ > // Iterate over dominated blocks, and visit exit blocks as well. I'm not sure why we don't use BuildDominatorTree information rather than this rough approximation?
Attachment #8892137 - Flags: review?(tcampbell) → review+
Comment on attachment 8892136 [details] [diff] [review] Branch Pruning: Do not count Phi nodes as part of the weight of basic blocks. Review of attachment 8892136 [details] [diff] [review]: ----------------------------------------------------------------- Ignoring PHIs seems like the right call. The LLVM inliner seems to treat PHIs as zero-cost as well. Do we need to tweak |JitOptions.branchPruningInstFactor| based on average PHI density?
Attachment #8892136 - Flags: review?(tcampbell) → review+
Comment on attachment 8892131 [details] [diff] [review] Branch Pruning: Use MIR flags to track usage of Phi instructions. Review of attachment 8892131 [details] [diff] [review]: ----------------------------------------------------------------- I like the with general approach of having analysis flags to avoid repeated work. As we discussed on IRC, I'm having trouble following the BPPredRemoved logic. At the very least this transform needs better commenting to explain the invariant. I think you are also investigating if there is a hole or not. Will be nice to get rid of another quadratic analysis :) ::: js/src/jit/IonAnalysis.cpp @@ +93,5 @@ > + op->setInWorklist(); > + if (!worklist.append(op->toPhi())) > + return false; > + } > + B) If BPPredRemoved is set, we check all pred blocks for their marked status. Is this guaranteed to only have the expected blocks? I see either i) BPPredRemoved is uneeded, ii) there is a potential edge case, or iii) this is all very correct but needs documentation about why. @@ +109,5 @@ > + // operands would be marked in the following loop. > + phi->getOperand(p)->setUseRemovedUnchecked(); > + } > + } > + Can the block above and below be shared? @@ +200,5 @@ > // between the |block| and its successor |succ|. > MDefinition* def = phi->getOperand(predIndex); > if (def->isUseRemoved()) > continue; > A) Set on Phi without indicated which predIndex we are. @@ +517,5 @@ > if (!block->isMarked() && !block->unreachable()) > continue; > > + if (!FlagAllOperandsAsHavingRemovedUses(mir, block)) > + return false; Do other users of FlagAllOperandsAsHavingRemovedUses need to be fixed too? ::: js/src/jit/MIR.h @@ +211,5 @@ > + \ > + /* Used for branch pruning optimizarion. They mirror usage of resume point > + * and other instructions on phi instructions to cache the analysis. > + */ \ > + _(BPUsed) \ I'm not a fan of have local analysis local flags on MIR types. I wish we had a better system managing these sorts of IR analysis temporary metadata. Maybe explicitly mention they only are valid within branch pruning pass?
Attachment #8892131 - Flags: review?(tcampbell)
Comment on attachment 8892126 [details] [diff] [review] Cache isObservableSlot and isRecoverableSlot on the CompileInfo. Patch looks OK -- would just like to get rid of MIR-inl.h. Will re-review with that changed (addressed in a previous comment, also possibly addressed by another bug).
Attachment #8892126 - Flags: review?(sstangl)
This patch duplicates the EliminatePhis loop which flags all unused phi instructions. This is slightly too conservative about the observability in case of the presence of try blocks, which should be fixed in a follow-up issue. In the mean time, this patch solves the issue by flagging all Phi non-observable & non-used phis as being unused. Thus, FlagPhiInputsAsHavingRemovedUses only checks for phi which are lacking the Unused flag, to set the UseRemoved flag on the operand. FlagUnusedPhi is equivalent to the previous patch, and does not try to do it in the same loop, and it re-uses the existing logic for removing redundant Phis, and also for flag phis as unused. Checking the speed of the original test case, this is now executing in 29.8ms instead of 30ms when branch pruning is disabled. (comment 4)
Attachment #8899562 - Flags: review?(tcampbell)
Attachment #8892131 - Attachment is obsolete: true
Comment on attachment 8899562 [details] [diff] [review] Flag unused Phis before flagging phis as having removed uses. Review of attachment 8899562 [details] [diff] [review]: ----------------------------------------------------------------- This is a lot clearer :) Nice compile time improvement too. Please add a comment near FlagAllOperandsAsHavingRemovedUses that FlagUnusedPhis should be called first. ::: js/src/jit/IonAnalysis.cpp @@ +1221,5 @@ > // use from an actual instruction. > + // > + // At the end of this phase, every Phi which is not flag as being > + // unused is observed. Thus, the removal of of used Phi should > + // introduce the addition of UseRemoved flag to any of its operands, s/not flag as/not flagged as/ s/of of/of/ s/operands,/operands./
Attachment #8899562 - Flags: review?(tcampbell) → review+
(In reply to Ted Campbell [:tcampbell] from comment #18) > @@ +435,1 @@ > > // Iterate over dominated blocks, and visit exit blocks as well. > > I'm not sure why we don't use BuildDominatorTree information rather than > this rough approximation? Computing the dominator tree would be as costly as iterating over the dominated set, by counting that we have a balanced number of predecessors and successors. Moreover, after branch pruning, the dominator computation would no longer be valid as branch pruning might remove edges, causing more block to dominates others. (In reply to Ted Campbell [:tcampbell] from comment #19) > Do we need to tweak |JitOptions.branchPruningInstFactor| based on average > PHI density? Technically, yes. In practice, I do not think that this matter much knowing this is the small factor, and that this was a way to discard large single blocks with no Phi (octane/mandreel if I recall correctly). I might come back to increase this number in a follow-up bug if this causes performance issues on our benchmarks.
Pushed by npierron@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/c9985a01db87 Branch Pruning: Do not count Phi nodes as part of the weight of basic blocks. r=tcampbell https://hg.mozilla.org/integration/mozilla-inbound/rev/be2075c60ec8 Branch Pruning: Pre-compute and cache the cost of basic blocks. r=tcampbell https://hg.mozilla.org/integration/mozilla-inbound/rev/e5a213c7b80e Flag unused Phis before flagging Phi inputs as having removed uses. r=tcampbell https://hg.mozilla.org/integration/mozilla-inbound/rev/35c735592ec2 Branch Pruning: Add pathological test case. r=nbp
Pushed by npierron@mozilla.com: https://hg.mozilla.org/integration/mozilla-inbound/rev/e3b8f85ad004 FlagUnusedPhi should not assert that Unused flag does not exists as it is now re-run multiple times. r=jandem
Backed out for asserting at js/src/jit/IonAnalysis.cpp:2835 while running js/src/jit/IonAnalysis.cpp:283: https://hg.mozilla.org/integration/mozilla-inbound/rev/801f38f144051335ce76152cb123cc8ea688e218 https://hg.mozilla.org/integration/mozilla-inbound/rev/16a30269573302a5724ba30bffff326844ed20a4 https://hg.mozilla.org/integration/mozilla-inbound/rev/59ac198566094b6394a8bdccff13ec2069b5bcdc https://hg.mozilla.org/integration/mozilla-inbound/rev/4d9598ac42647e2d58f6baabeb0cf321aab5e6a2 https://hg.mozilla.org/integration/mozilla-inbound/rev/c15951f74658fbd01d307cf973f1ebaa87ac7758 Push with failures: https://treeherder.mozilla.org/#/jobs?repo=mozilla-inbound&revision=e3b8f85ad0049a8ff809d1f5672182196f186d2c&filter-resultStatus=testfailed&filter-resultStatus=busted&filter-resultStatus=exception&filter-resultStatus=retry&filter-resultStatus=usercancel&filter-resultStatus=runnable Failure log assertion: https://treeherder.mozilla.org/logviewer.html#?job_id=126438203&repo=mozilla-inbound Assertion failure: !block->isMarked(), at /home/worker/workspace/build/src/js/src/jit/IonAnalysis.cpp:2835 TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/gc/bug-1282986.js | Assertion failure: !block->isMarked(), at /home/worker/workspace/build/src/js/src/jit/IonAnalysis.cpp:2835 (code -11, args "--baseline-eager") [4.5 s] Failure log js/src/jit-test/tests/ion/branch-pruning-abuse.js: https://treeherder.mozilla.org/logviewer.html#?job_id=126438171&repo=mozilla-inbound TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/ion/branch-pruning-abuse.js | Timeout (code -9, args "--ion-eager --ion-offthread-compile=off") [150.0 s] TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/ion/branch-pruning-abuse.js | Timeout (code -9, args "--ion-eager --ion-offthread-compile=off --non-writable-jitcode --ion-check-range-analysis --ion-extra-checks --no-sse3 --no-threads") [150.0 s]
Flags: needinfo?(nicolas.b.pierron)
(In reply to Ted Campbell [:tcampbell] from comment #23) > > This is a lot clearer :) Nice compile time improvement too. Hum … apparently this is not the case, as I fixed the typo from comment 18. We seems to now be spending a lot of time under the justReplaceAllUsesWith, which is under FlagUnusedPhi. (~190 ms on the test case) Note, that the compilation time issue compared to v8, is that we have resume points for each block entry, and in this case we spam the graph with tons of resume point capturing hundreds of locals. As we add these resume points, we also add Phis, to simplify the graph construction. I will look if we have any simple way to avoid adding these Phis node eagerly in IonBuilder. I will look if there is a way to avoid generating Phis under IonBuilder. (In reply to Sebastian Hengst [:aryx][:archaeopteryx] (needinfo on intermittent or backout) from comment #27) > Failure log assertion: > https://treeherder.mozilla.org/logviewer.html#?job_id=126438203&repo=mozilla- > inbound > Assertion failure: !block->isMarked(), at > /home/worker/workspace/build/src/js/src/jit/IonAnalysis.cpp:2835 > TEST-UNEXPECTED-FAIL | js/src/jit-test/tests/gc/bug-1282986.js | Assertion > failure: !block->isMarked(), at > /home/worker/workspace/build/src/js/src/jit/IonAnalysis.cpp:2835 (code -11, > args "--baseline-eager") [4.5 s] Note, this is highly intermittent, and only emerged recently. I now managed to reproduce it under rr, and I will investigate.
Comment on attachment 8899562 [details] [diff] [review] Flag unused Phis before flagging phis as having removed uses. Review of attachment 8899562 [details] [diff] [review]: ----------------------------------------------------------------- ::: js/src/jit/IonAnalysis.cpp @@ +469,5 @@ > + Observability observability = graph.hasTryBlock() > + ? ConservativeObservability > + : AggressiveObservability; > + if (!FlagUnusedPhis(mir, graph, observability)) > + return true; self-nit: This should be "return false;" This caused the various intermittent "Assertion failures !block->isMarked()" failures under OOMTest function. @@ +2339,5 @@ > + Observability observability = graph.hasTryBlock() > + ? ConservativeObservability > + : AggressiveObservability; > + if (!FlagUnusedPhis(mir, graph, observability)) > + return true; self-nit: same here.
Depends on: 1394749
Keywords: perf
Priority: -- → P3
Whiteboard: [qf:p3]
Assignee: nicolas.b.pierron → nobody
Status: ASSIGNED → NEW
Flags: needinfo?(nicolas.b.pierron)
Priority: P3 → P5
Performance Impact: --- → P3
Whiteboard: [qf:p3]
Severity: normal → S3
Blocks: sm-jits
Severity: S3 → N/A
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: