Closed Bug 1229813 Opened 4 years ago Closed 3 years ago

Enable branch pruning by default.

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla50
Tracking Status
firefox45 --- affected
firefox50 --- fixed

People

(Reporter: nbp, Assigned: nbp)

References

(Blocks 1 open bug)

Details

Attachments

(3 files, 1 obsolete file)

No description provided.
As explained in Bug 1209515, branche pruning is a trade-off between memory
usage and speed improvements, which might consume a bit more memory, based
on the ratio of live (non-relazified) and ion compiled functions. (Bug
1209515 comment 16).  In exhcange, branch pruning can provide some noticable
improvements on benchmarks and on for-of loops, by removing unused
branches. (Bug 1209515 comment 11)

As a side-effect, this will force us to ensure that our code coverage
information used by the release managment team is correct, as we would be
relying on it for optimization as well.
Attachment #8694796 - Flags: review?(jdemooij)
(In reply to Nicolas B. Pierron [:nbp] from comment #1)
> As explained in Bug 1209515, branche pruning is a trade-off between memory
> usage and speed improvements, which might consume a bit more memory, based
> on the ratio of live (non-relazified) and ion compiled functions.

I have just one question: how does this affect function relazification? Are there functions we'll no longer relazify?
(In reply to Jan de Mooij [:jandem] from comment #2)
> I have just one question: how does this affect function relazification? Are
> there functions we'll no longer relazify?

I do not see anything in the changes that were made which could change the relazification behaviour.  If there is, then this should be fixed.

The PCCount map is owned by the compartment for some legacy reasons, but the live-range of the PCCount data is less than the live-range of the JSScript.  So if we finalize a JSScript, we should also remove the memory used by the profiling mechanism.
Comment on attachment 8694796 [details] [diff] [review]
Enable branch pruning.

Review of attachment 8694796 [details] [diff] [review]:
-----------------------------------------------------------------

(In reply to Nicolas B. Pierron [:nbp] from comment #3)
> I do not see anything in the changes that were made which could change the
> relazification behaviour.  If there is, then this should be fixed.

Thanks, I think I misread the sentence in comment 0 mentioning non-relazified functions.

This is exciting, great work!
Attachment #8694796 - Flags: review?(jdemooij) → review+
This issue seems to be blocked on a few other issues.

I think that some of these other issues are valid issues which are just exposed by the removal of branches made by our code coverage.

https://treeherder.mozilla.org/#/jobs?repo=try&revision=a04447a27780
Depends on: 1233176
Depends on: 1233786
This is a nice win on some benchmarks (Kraken beat-detection for instance, some assorted tests), but also a 28% regression on Kraken crypto-aes, 136% on the trace-compiler in the assorted test suite, and some other assorted tests got slower as well. Hopefully there's a single issue causing most of these.
Flags: needinfo?(nicolas.b.pierron)
I've backed out this patch, as we have other major regressions on asmjs-µbench that we should investigate, but with the upcoming end-of-year, this is probably not the best time to fix that in the tree.

Also philor reported to me that some devtools issues might be related to this patch:
https://treeherder.mozilla.org/logviewer.html#?job_id=18948769&repo=mozilla-inbound

https://hg.mozilla.org/integration/mozilla-inbound/rev/24a27da1a369
Blocks: 1129313
(In reply to Jan de Mooij [:jandem] from comment #7)
> This is a nice win on some benchmarks (Kraken beat-detection for instance,
> some assorted tests), but also a 28% regression on Kraken crypto-aes, 136%
> on the trace-compiler in the assorted test suite, and some other assorted
> tests got slower as well. Hopefully there's a single issue causing most of
> these.

I looked at the trace-compiler, and I managed to replace it by an improvement of 20%.
However, I did not managed to find any good heuristic to take crypto-aes back, without removing the huge improvements on misc-bugs-1131099-underscore.

The issue comes from the fact that a branch which is taken ~40% of the time, but the function the branch is in fact not taken for >10k iterations, before switching to the other branch.  This causes us to bail 10 times before starting a new compilation, while waiting in baseline.

Also, each bailout is probably costly as the function bails from the first loop and we might spin a bit in the second loop which follow the exit of the first loop.

I think we can solve this issue with 2 optimizations:
 1. If we start a recompilation at the time of the first bailout, and only commit the result of the new compilation if we spent too much time in Baseline.  Unfortunately, this sounds like an issue which can only be solved properly with Bug 825268.

 2. If we are to allow multiple OSR entry points, such that we can enter the second loop via OSR.
(In reply to Nicolas B. Pierron [:nbp] from comment #10)
> This causes us to bail 10 times before starting a new compilation, while waiting in baseline.

I know it's not ideal, but could we have a (maybe temporary) heuristic like: if we have a large-ish script and bailout/recompile (because we reached a cold branch) more than X times, we disable PGO for that script?

Also, usually we recompile anyway when we reach cold code, because of empty TI barriers etc. That's not happening for these branches?
(In reply to Jan de Mooij [:jandem] from comment #11)
> (In reply to Nicolas B. Pierron [:nbp] from comment #10)
> > This causes us to bail 10 times before starting a new compilation, while waiting in baseline.
> 
> I know it's not ideal, but could we have a (maybe temporary) heuristic like:
> if we have a large-ish script and bailout/recompile (because we reached a
> cold branch) more than X times, we disable PGO for that script?
> 
> Also, usually we recompile anyway when we reach cold code, because of empty
> TI barriers etc. That's not happening for these branches?

No, these branches are containing math, thus TI does not have much to invalidate here.

Here is the list of info I have so far to feed to any heuristics:

For crypto-aes, the branch that we should not remove (~56ms -> ~70ms):
 - predecessor count: 6540
 - approx. number of instructions: 6
 - approx. number of dominated instructions: 12
 - number of instruction returning objects: 4
 - isLoopExit: false
 - isExitBlock: false

For the underscore benchmark, the branch that we should remove (~360ms -> ~150ms):
 - predecessor count: 160
 - approx. number of instructions: 5
 - approx. number of dominated instructions: 81
 - number of instruction returning objects: 0
 - isLoopExit: false
 - isExitBlock: false

So far, I really cannot find any logical heuristic which will accept the removal of the underscore benchmark branch while refusing the removal of the crypto-aes branch.

The logic implies that:
 - predecessor count: Higher implies less likely to enter the current branch.
 - approx. number of (dominated) instructions: higher implies that we get more benefit if we can get rid of it.
 - number of instruction returning objects: higher implies that we are more likely to invalidate the code if we bailout.
 - isLoopExit: Keep if true.
 - isExitBlock: Keep if true.

What I tried so far was to do addition/multiplication between  "predecessor count" and "number of (dominated) instructions".  I have not yet tried to square the number of dominated instructions, but this sounds like a dirty hack as I cannot explain how a squaring this number reflects to the instruction cache behaviour.
(In reply to Nicolas B. Pierron [:nbp] from comment #12)
> For crypto-aes, the branch that we should not remove (~56ms -> ~70ms):

Hm, 56ms -> 70ms is not a huge regression and if it's just this one, I could live with it, especially compared to the big wins we see on other tests. What's the effect on all of Kraken? Are the asm.js regressions similar?
The square relation on the number of instruction seems to work well for all benchmarks (kk-crypto-aes, misc-trace-compiler, octane-mandreel) except for the misc-interpreter.

Forbidding branch pruning on case statements helps a bit but make the benchmark bi-modal (~250ms / ~7500ms).
I think the problem is that we made an ion-eager work-around at the beginning of IonMonkey, to avoid OSR re-entry after a bailout.  The problem is that the code from which we just bailed is still valid, and we should accept to OSR back into it after a few iterations back in baseline.
Ok, I think I found another issue which might explain why we have such cliff of performance.

The heuristics are a part in the cause of the performance cliff.

When we first compile the interpreter function, we compile the switch case, with a few branches which still have 0-counters, but the context is still low enough that we do not know if the branch is indeed cold, or if the branch is unlikely.  Thus we decide to keep it in Ion.

The problem being that as we keep it in Ion, we do not increment the counters.  So when we happen to be bailing out because of a different reasons such as a type-inference issue, then we might still not visit this unlikely branch while back in baseline.

Thus, the next Ion compilation will still consider this branch as non-used, but as the context of this branch being more likely than the branch it-self, we will later remove this unlikely branch, thinking that it is unused.

We should add counters in Ion, and remove them when the blocks are considered hot enough, that the counter would have a performance impact at runtime.  I will open another bug for this issue.
Depends on: 1242462
Attempt to increment the profile counter caused a 200% regression on  misc-typedobj-utf8array-typedobj.js, because the remaining for loops are well typed and incrementing the counters in Ion cause us to add a lot of overhead, as the basic blocks are flagged as cold, but not cold enough to be removed.
With Bug 1242462, this current heuristics are the one which avoided costly
bailouts on benchmarks used on AWFY, and also gave extra performances boost
on my machine while doing some manual tuning.

Note, that due to the asynchronous nature of ion compilation/execution, this
might lead to noisy results.  For example, changing the weird constant from
250k to 500k move the misc-bugs-1131099-underscore benchmark back to its
original value, on my laptop.
Attachment #8711739 - Flags: review?(jdemooij)
(In reply to Jan de Mooij [:jandem] from comment #7)
> […] Hopefully there's a single issue causing most of these.

Indeed, this was a single big issue which is solved in Bug 1242462.
Flags: needinfo?(nicolas.b.pierron)
Comment on attachment 8711739 [details] [diff] [review]
Improve Branch Pruning heuristics.

Review of attachment 8711739 [details] [diff] [review]:
-----------------------------------------------------------------

Makes sense.

::: js/src/jit/IonAnalysis.cpp
@@ +311,5 @@
>                  MBasicBlock* pred = block->getPredecessor(p);
>                  if (pred->getHitState() == MBasicBlock::HitState::Count)
>                      predCount += pred->getHitCount();
>                  isLoopExit |= pred->isLoopHeader() && pred->backedge() != *block;
> +                isCaseStatement |= pred->numSuccessors() > 2;

We may still want to prune branches if the number of cases (numSuccessors) is small (< 8 or so). Follow-up bug to investigate?

@@ +323,5 @@
> +            int numInOutEdges = block->numPredecessors();
> +            ReversePostorderIterator it(block);
> +            do {
> +                for (MDefinitionIterator def(*it); def; def++)
> +                    numDominatedInst++;

So comparing the id() of the first and last instruction in the block, as we did before, was wrong? If it works maybe we can use that and assert in debug builds that it's correct.

@@ +328,2 @@
>  
> +                numInOutEdges += it->numSuccessors() - it->numPredecessors();

numSuccessors and numPredecessors are size_t, so the subtraction will underflow if there are more predecessors than successors. Casting to int is a bit more explicit:

int(it->numSuccessors()) - int(it->numPredecessors())

@@ +328,4 @@
>  
> +                numInOutEdges += it->numSuccessors() - it->numPredecessors();
> +                it++;
> +            } while(numInOutEdges > 0 && it != graph.rpoEnd());

Nit: space after 'while'.

@@ +356,1 @@
>                  shouldBailout = false;

Does this mean we will not eliminate the else-branch here:

  if (alwaysTrue())
      x = y;
  else
      f();

Both predCount and numDominatedInst will be very small. We don't increment predCount while we're in Ion right?

I was hoping we could prune the else branch, because having that f() call will block LICM, etc.
Attachment #8711739 - Flags: review?(jdemooij) → review+
Would it be possible to start fuzzing the JS Shell with the new command line argument "--ion-pgo=on"?
Flags: needinfo?(gary)
Flags: needinfo?(choller)
I've been fuzzing this for a month now and found a few bugs. Just let us know if the flag is removed or turned on by default :)
Flags: needinfo?(choller)
(In reply to Christian Holler (:decoder) from comment #22)
> I've been fuzzing this for a month now and found a few bugs. Just let us
> know if the flag is removed or turned on by default :)

Thanks for fuzzing, I will fix the bugs and turn it on by default after that :)
Depends on: 1257929
Depends on: 1261826
Depends on: 1263645
Attachment #8753350 - Flags: review?(hv1989)
Comment on attachment 8753350 [details] [review]
Add branch pruning config to awfy.

Running every 12h on the shells currently.
Attachment #8753350 - Flags: review?(hv1989) → review+
And if nothing goes wrong it is also running on the win 8 machine. That one sadly has no queue yet. So running every time.
https://arewefastyet.com/#machine=17
Keywords: leave-open
(In reply to Jan de Mooij [:jandem] from comment #19)
> ::: js/src/jit/IonAnalysis.cpp
> @@ +311,5 @@
> >                  MBasicBlock* pred = block->getPredecessor(p);
> >                  if (pred->getHitState() == MBasicBlock::HitState::Count)
> >                      predCount += pred->getHitCount();
> >                  isLoopExit |= pred->isLoopHeader() && pred->backedge() != *block;
> > +                isCaseStatement |= pred->numSuccessors() > 2;
> 
> We may still want to prune branches if the number of cases (numSuccessors)
> is small (< 8 or so). Follow-up bug to investigate?

I am making a patch to test various configuration of the heuristics without having to recompile.  I will try to see if I cannot find interesting configurations with some auto-tuner.

> @@ +323,5 @@
> > +            int numInOutEdges = block->numPredecessors();
> > +            ReversePostorderIterator it(block);
> > +            do {
> > +                for (MDefinitionIterator def(*it); def; def++)
> > +                    numDominatedInst++;
> 
> So comparing the id() of the first and last instruction in the block, as we
> did before, was wrong? If it works maybe we can use that and assert in debug
> builds that it's correct.

Comparing the id can gives incorrect value, because we depend on the order of the creation of the instructions, and some blocks are filled with tons of instruction, and we only add/replace the last instruction later.  Thus leading to incorrect results.  Also, the way our type inference is working, we can cycle multiple times In IonBuilder, creating large id() gaps across basic blocks.

> @@ +356,1 @@
> >                  shouldBailout = false;
> 
> Does this mean we will not eliminate the else-branch here:
> 
>   if (alwaysTrue())
>       x = y;
>   else
>       f();
> 
> Both predCount and numDominatedInst will be very small. We don't increment
> predCount while we're in Ion right?
> 
> I was hoping we could prune the else branch, because having that f() call
> will block LICM, etc.

Yes, this was the attempt, and the current heuristic is unfortunately the one which works best on all benchmarks.  I tried to not remove the branch if there is a call or .apply, but this does not seems to please octane at the moment.  I will investigate more later.
Pushed by npierron@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/0ace95ea7f4c
Improve Branch Pruning heuristics. r=jandem
Attachment #8711739 - Flags: checkin+
This improves the heuristics by adding an arbitrary scoring system, which
has weight to deal with various aspect such as the number of predecessor
count, the number of dominated instructions, the number of dominated blocks,
and the number of effectful instructions.

Above, this new scoring system, I added a big comment to explain the logic
of it, such that we can later add more if needed. I tried my best to make
it both simple and almost logical.

For the moment, this set of metrics seems to be enough to not regress
kraken-crypto-aes, while maintaining other improvements on for-of, and the
underscore benchmark.

I had to update one of the test case, to let it run a higher number of times
to ensure that we have enough predecessor hits to safely remove the branch.

Different values can be tested with the JIT_OPTION_branchPruning*
envrionment variables.

About the example:

  if (alwaysTrue())
      x = y;
  else
      f()

In such case the heuristic would be able to remove the else branch even in
small functions (which are being compiled rapidely).  On the other hand,

  if (alwaysFalse())
      x = y;
  else
      f()

might need a higher number of iterations to remove the then-part of it if x
is a local variable, because this would not appear as an effectful instruction.
Attachment #8765012 - Flags: review?(jdemooij)
Attachment #8711739 - Attachment is obsolete: true
Comment on attachment 8765012 [details] [diff] [review]
Improve Branch Pruning heuristics.

Review of attachment 8765012 [details] [diff] [review]:
-----------------------------------------------------------------

Thanks for looking into the regression! A lot of heuristics, but we can always tweak later if needed.

::: js/src/jit/IonAnalysis.cpp
@@ +327,5 @@
> +            int numInOutEdges = block->numPredecessors();
> +            size_t branchSpan = 0;
> +            ReversePostorderIterator it(block);
> +            do {
> +                // Iterates over dominated blocks, and visit exit block as well.

Nit: s/Iterates/Iterate/, exit blocks

@@ +350,5 @@
> +            // be costly if we were to bailout. The following heuristics are
> +            // made to prevent bailouts in branches when we estimate that the
> +            // confidence is not enough to compensate for the cost of a bailout.
> +            //
> +            //   1. Confidences for removal varies with the number of hit counts

Nit: s/Confidences/Confidence/ here and below

@@ +369,4 @@
>              //
> +            //   4. Confidences for removal varies with the number of effectful
> +            //      instructions. The reason being that an effectful instruction
> +            //      can remove optimizations opportunities based on Scalar

Nit: s/optimizations/optimization/

@@ +401,5 @@
>              if (isLoopExit)
>                  shouldBailout = false;
> +
> +            // Interpreters are often implemented as a table switch within a for
> +            // loop. What might happen is that the interpreter heat up in a

Nit: s/heat/heats/
Attachment #8765012 - Flags: review?(jdemooij) → review+
Pushed by npierron@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/6e994818e8f6
Improve Branch Pruning heuristics. r=jandem
Flags: needinfo?(nicolas.b.pierron)
Keywords: leave-open
Depends on: 1282822
As a summary:

Regressions
misc 0.7 bugs-612930-solve-sudoku linux32 opt shell     4.05%
misc 0.7 bugs-663087-decompress-gzip linux32 opt shell  4.40%
misc 0.7 bugs-847389-jpeg2000 linux64 opt shell         2.11%
octane 2.0.1 DeltaBlue linux32 opt shell                2.19%
ss 1.0.1 aes linux32 opt shell                          2.25%

Improvements: 
dart 0.1 FluidMotion linux32 opt shell                 14.30%
dart 0.1 summary linux32 opt shell                     13.51%
jetstream 1.0 box2d linux64 opt                         2.29%
jetstream 1.0 richards linux64 opt                      7.92%
jetstream 1.0 richards windows10-64 opt                 4.77%
jetstream 1.0 typescript windows10-64 opt               5.39%
kraken 1.1 beat-detection linux32 opt shell             7.20%
kraken 1.1 beat-detection linux64 opt shell             8.31%
kraken 1.1 crypto-aes linux32 opt shell                 5.11%
kraken 1.1 crypto-aes linux64 opt shell                 5.12%
kraken 1.1 crypto-ccm linux32 opt shell                 2.51%
kraken 1.1 stanford-crypto-aes windows10-64 opt         3.37%
misc 0.7 basic-array-forof linux32 opt shell           48.85%
misc 0.7 basic-array-forof linux64 opt shell           51.55%
misc 0.7 bugs-1131099-lodash1 linux32 opt shell         8.10%
misc 0.7 bugs-1131099-lodash1 linux64 opt shell         8.34%
misc 0.7 bugs-1131099-lodash2 linux32 opt shell        10.09%
misc 0.7 bugs-1131099-lodash2 linux64 opt shell        12.23%
misc 0.7 bugs-1131099-underscore linux32 opt shell     48.28%
misc 0.7 bugs-1131099-underscore linux64 opt shell     48.78%
misc 0.7 bugs-608733-trace-compiler linux32 opt shell  20.83%
misc 0.7 bugs-608733-trace-compiler linux64 opt shell  20.31%
misc 0.7 bugs-654410-nezulator linux32 opt shell        2.02%
misc 0.7 map-iterator linux32 opt shell                20.23%
misc 0.7 map-iterator linux64 opt shell                20.98%
octane 2.0.1 Richards linux32 opt shell                 2.48%
octane 2.0.1 Richards linux64 opt                       7.78%
octane 2.0.1 Richards linux64 opt shell                 4.28%
octane 2.0.1 Richards windows10-64 opt                  1.76%
webaudio 0.2 Simple mixing with gains linux64 opt       4.13%

An up to date list is visible at
https://treeherder.allizom.org/perf.html#/alerts?id=2800
https://hg.mozilla.org/mozilla-central/rev/2c0f213c8eec
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla50
Depends on: 1304569
You need to log in before you can comment on or make changes to this bug.