Closed Bug 1209515 Opened 4 years ago Closed 4 years ago

Attempt to use Code Coverage for pruning branches in IonMonkey.

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla45
Tracking Status
firefox44 --- affected
firefox45 --- fixed

People

(Reporter: nbp, Assigned: nbp)

References

Details

Attachments

(12 files, 4 obsolete files)

2.66 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
3.41 KB, patch
bhackett1024
: review+
Details | Diff | Splinter Review
3.39 KB, patch
jandem
: review+
Details | Diff | Splinter Review
16.12 KB, patch
bhackett1024
: review+
Details | Diff | Splinter Review
4.96 KB, patch
bhackett1024
: review+
Details | Diff | Splinter Review
13.70 KB, patch
Details | Diff | Splinter Review
28.67 KB, patch
bhackett1024
: review+
Details | Diff | Splinter Review
3.56 KB, patch
jandem
: review+
Details | Diff | Splinter Review
5.35 KB, patch
jandem
: review+
Details | Diff | Splinter Review
2.63 KB, patch
jandem
: review+
Details | Diff | Splinter Review
4.00 KB, patch
bhackett1024
: review+
Details | Diff | Splinter Review
1.68 KB, patch
bbouvier
: review+
Details | Diff | Splinter Review
Now that Code Coverage is working fine in the Interpreter and in Baseline, we can make a second attempt at removing unused branches, as previously tried by Wei Wu in Bug 877872.

Also, removing the branch before Scalar Replacement should bring a 33% improvements on the for-of micro benchmark [1], as previously achieved by GVN in Bug 1165348 but this time without any of the overhead related to re-looping compiler phases.

[1] https://arewefastyet.com/#machine=28&view=single&suite=misc&subtest=basic-array-forof&start=1433992864&end=1434080570
With the current set of patches (part 0 to 5), running octane 997 times:

                  No PGO     With PGO
       Richards:   34683 ->   37175  7.2%  (++)
      DeltaBlue:   67346 ->   67583  0.4%
         Crypto:   32423 ->   32504  0.2%
       RayTrace:  115766 ->  112063  -3.2% (-)
    EarleyBoyer:   35002 ->   34644  -1.0% (-)
         RegExp:    4043 ->    4027  -0.4%
          Splay:   21158 ->   21202  0.2%
   SplayLatency:   23876 ->   23648  -1.0% (-)
   NavierStokes:   40103 ->   40013  -0.2%
          PdfJS:   20711 ->   20654  -0.3%
       Mandreel:   32616 ->   36339  11.4% (++)
MandreelLatency:   41156 ->   40076  -2.6% (-)
        Gameboy:   67691 ->   67690  -0.0% (whoa this is precise)
       CodeLoad:   20648 ->   19954  -3.4% (-)
          Box2D:   56376 ->   56558  0.3%
           zlib:   89513 ->   89535  0.0%
     Typescript:   35278 ->   36306  2.9%  (+)
          Score:   34944 ->   35143  0.6%  (+)

I guess code load is affected by this modification, as this implies collecting code coverage information while running the interpreter and baseline.

I will investigate Raytrace in case there is something obvious.

Also, the modification of the test cases made in part 3, highlight that we should now have a 33% improvements on the for-of loops (mentioned in comment 0) by removing of the corner-case branch which does a side-exit.  Which let Scalar Replacement remove the object allocation created by the next function of the Array iterator.
The following correspond to a little instrumented build that I made on top of the previous patches, which reports:
 - blocks: the total number of basic block produced by IonBuilder.
 - pruned: the number of basic block which either got removed or which content got discarded.
 - %: the ratio pruned branches over the number of blocks.
 - bailouts: the number of first-execution bailout (produced by this optimization) which are taken.

Richards:     (blocks: 1350,   pruned: 54    (  3%), bailouts: 2)
DeltaBlue:    (blocks: 3238,   pruned: 77    (  2%), bailouts: 0)
Crypto:       (blocks: 6103,   pruned: 2310  ( 37%), bailouts: 22)
RayTrace:     (blocks: 3022,   pruned: 61    (  2%), bailouts: 0)
EarleyBoyer:  (blocks: 4705,   pruned: 1054  ( 22%), bailouts: 1)
RegExp:       (blocks: 406,    pruned: 116   ( 28%), bailouts: 2)
Splay:        (blocks: 1993,   pruned: 204   ( 10%), bailouts: 12)
NavierStokes: (blocks: 1287,   pruned: 169   ( 13%), bailouts: 4)
PdfJS:        (blocks: 11904,  pruned: 3831  ( 32%), bailouts: 16)
Mandreel:     (blocks: 7164,   pruned: 4984  ( 69%), bailouts: 31)
Gameboy:      (blocks: 16995,  pruned: 9940  ( 58%), bailouts: 59)
CodeLoad:     (blocks: 59,     pruned: 9     ( 15%), bailouts: 0)
Box2D:        (blocks: 12317,  pruned: 1699  ( 13%), bailouts: 47)
zlib:         (blocks: 43,     pruned: 4     (  9%), bailouts: 0)
Typescript:   (blocks: 107399, pruned: 28070 ( 26%), bailouts: 67)

Note that the current implementation only cause a recompilation after 10 bailouts of the same IonScript, even for Bailout_FirstExecution, which means that the bailout of EarleyBoyer will bailout, but not recompile the function yet.

We see that RayTrace / CodeLoad benchmarks have no bailouts, which would mean that the slow down observed in the previous comment is not related to the code produced by this optimization but by the instrumentation which feeds this optimization phase.  Thus we should probably look for improving the collection of code coverage in baseline.
(In reply to Nicolas B. Pierron [:nbp] from comment #8)
> I will investigate Raytrace in case there is something obvious.

After some investigation with "perf" on raytrace, the baseline profiles seems similar which suggest that the overhead of incrementing code coverage counters is insignificant.  On the other hand, I noticed that raytrace.js:682:19 (Block14) is now taking a larger part of the profile, jumping from 0.6% to 1.6% of profile.

One explanation would be that the code added in part 3 (ConvertToBailingBlock) which is flagging all resume point operands has having removed uses is too conservative.  I took this conservative approach to avoid iterating over all instructions' operands of all removed basic blocks.  I will check if I can take a precise approach to this problem.
(In reply to Nicolas B. Pierron [:nbp] from comment #10)
> One explanation would be that the code added in part 3
> (ConvertToBailingBlock) which is flagging all resume point operands has
> having removed uses is too conservative.  I took this conservative approach
> to avoid iterating over all instructions' operands of all removed basic
> blocks.  I will check if I can take a precise approach to this problem.

Ok, after lots of issues & rewrites to ensure I have the right set of less-conservative rules for marking removed uses, I finally got some encouraging results.

With the current set of patches (part 0 to 6(¹)), running octane 930 times:

                  No PGO     With PGO
       Richards:   34652 ->   37175  7.3%   (++)
      DeltaBlue:   67232 ->   67737  0.8%   (~+)
         Crypto:   32377 ->   32871  1.5%   (+)
       RayTrace:  115413 ->  114950  -0.4%
    EarleyBoyer:   34630 ->   34534  -0.3%
         RegExp:    4030 ->    4007  -0.6%  (~-)
          Splay:   21092 ->   20971  -0.6%  (~-)
   SplayLatency:   23917 ->   23752  -0.7%  (~-)
   NavierStokes:   40104 ->   39989  -0.3%
          PdfJS:   20361 ->   20786  2.1%   (+)
       Mandreel:   32436 ->   35749  10.2%  (++)
MandreelLatency:   40464 ->   41345  2.2%   (+)
        Gameboy:   67114 ->   67514  0.6%   (~+)
       CodeLoad:   20507 ->   20294  -1.0%  (-)
          Box2D:   56158 ->   57087  1.7%   (+)
           zlib:   89239 ->   89206  -0.0%
     Typescript:   34540 ->   36118  4.6%   (+)
          Score:   34728 ->   35266  1.5%   (++)

These less conservative-rules should not change the number of blocks / pruned branches / bailouts reported in comment 9.  These rules were only made to prevent other optimizations to optimize-out some instructions if we ever remove a use from the graph, such that we keep it in its original state for bailouts.

(¹) I will attach the patch for part 6, as soon as I improved a few comments, and fix a few other patches attached here.
Delta:
 - Fix an issue while accessing the getNearestPCCounts near the end of the script.
Attachment #8672751 - Attachment is obsolete: true
Delta:
 - Change the removal of block to be more like an analysis / transformation
 phase, and make the RemoveUses flagging precise. (with some nice ascii art,
 because this is way easier to visualize than to read)
Attachment #8672753 - Attachment is obsolete: true
Delta:
 - Check the we are effectively collecting results for the current script
 before any attempts to increment any counters.
Attachment #8673566 - Attachment is obsolete: true
Previously counting code coverage in the interpreter required making use of
the interruptible flag which basically was doing debug checks in between
every instructions.

This change instrument the macro of the interpreter to only increment at
branches locations.
I did a bit of instrumentation (JitCode, IonScript, PCCounts) to approximate the memory usage, knowing that this optimization relies on code coverage instrumentation.  Here are the from running the Ovtane benchmark (¹):

               No PGO           With PGO

ionScript:
  count:        1808               1884
  size:      8338096            6517784
  mean:         4611               3459
jitCode:
  count:       45130              47839
  size:     32441336           30991584
  mean:          718                647
pcCounts:
  count:           0              15671
  size:            0            2618768
  mean:            0                167

  count (the number of constructor called)
  size (the size of the allocation)
  mean (= size / count)

Note, that the interesting part is not the sum, as effectively the version which is running with branch pruning enabled compiles more frequently, but it complies smaller code, as it is able to remove branches.

What is interesting is that this gives an overview of the number of bytes consumed by any JSScript (167 bytes), which have to be compared with the delta of memory ""saved"" (²) by enabling this optimization. (= 1223 = 4611 - 3459 + 718 - 647).  Which means that we are effectively saving memory if we Ion compile at least 13.6% of the non-relazified JSScripts.

(¹) Octane is not a memory benchmark, but I used it to provide an approximation of the memory consumed for various data structures.

(²) Yes, ""saved"".  I was the first person surprised as I did not consider the shrinkage of the compiled code before comment 9.
Attachment #8672750 - Flags: review?(benj)
Comment on attachment 8681243 [details] [diff] [review]
part 1 - IonBuilder: Attach hit counts on the MIRGraph.

Review of attachment 8681243 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/MIRGraph.h
@@ +610,5 @@
> +    }
> +    uint64_t getHitCount() const {
> +        return hitCount_;
> +    }
> +    enum HitState { HS_NotDefined, HS_Count, HS_Frequency };

Note for the reviewer: Currently HS_Frequency is not used, but I intend to use it after converting the absolute counts into frequencies of visits.  The idea being that we can then tell the register allocator about cold/hot branches.
Attachment #8681243 - Flags: review?(bhackett1024)
Comment on attachment 8672752 [details] [diff] [review]
part 2 - Ensure that MPhi removal considers removed uses.

Review of attachment 8672752 [details] [diff] [review]:
-----------------------------------------------------------------

As PruneBranches as coming before the MPhi removal, we need to annotate MPhi if we removed one of its uses, and as we remove basic blocks, we need to push this information to the inputs of the removed phi instructions.
Attachment #8672752 - Flags: review?(bhackett1024)
Attachment #8672750 - Flags: review?(benj) → review+
This optimization (branch pruning) does almost the same as PGO in static
compiler, except that instead of compiling the unused branches in
out-of-line code, we just get rid of the code and bailout to baseline if we
ever reach such code.

This optimization does not provide much speed-up alone, but enable the rest
of the compiler to do cleverer optimizations such as Scalar Replacement and
Range Analysis.

Delta:
 - Improve comments.
 - rename a few functions.
Attachment #8681368 - Flags: review?(bhackett1024)
Attachment #8681244 - Attachment is obsolete: true
Attachment #8681245 - Flags: review?(bhackett1024)
Attachment #8673567 - Flags: review?(jdemooij)
Attachment #8681249 - Flags: review?(bhackett1024)
(In reply to Nicolas B. Pierron [:nbp] from comment #17)
> > +    enum HitState { HS_NotDefined, HS_Count, HS_Frequency };

Drive by comment, we should use enum class everywhere instead of having these prefixes.
Attachment #8673567 - Flags: review?(jdemooij) → review+
Comment on attachment 8681243 [details] [diff] [review]
part 1 - IonBuilder: Attach hit counts on the MIRGraph.

Review of attachment 8681243 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/MIRGraph.h
@@ +608,5 @@
> +        hitCount_ = count;
> +        hitState_ = HS_Count;
> +    }
> +    uint64_t getHitCount() const {
> +        return hitCount_;

Assert(hitState_ == HS_Count)?

@@ +610,5 @@
> +    }
> +    uint64_t getHitCount() const {
> +        return hitCount_;
> +    }
> +    enum HitState { HS_NotDefined, HS_Count, HS_Frequency };

ditto re: Jan about the 'enum' thing.  It's also not obvious what these enum values mean and how they relate to hitCount_, so some comments would be nice.

Maybe it will be clear later, but will there ever be any time when we'll be putting different types of data on different MIR blocks?  If not then it might be clearer to remove HitState so that there can be a consistent interpretation for the count information associated with the blocks.

@@ +679,5 @@
>  
>      BytecodeSite* trackedSite_;
>  
> +    // Record the number of times a block got visited. Note, due to inlined
> +    // scripts these numbers might not continuous.

might not be

::: js/src/jsscript.cpp
@@ +1415,5 @@
>      return elem;
>  }
>  
>  const js::PCCounts*
> +ScriptCounts::getNearestPCCounts(size_t offset) const

What is this function trying to do?  It seems from its behavior and from its use in getHitCount is that it's supposed to find the count entry that is either at offset or at the closest point preceding it, which is different from what "nearest" means.

Maybe name this getImmediatePrecedingPCCounts?  This should also have a comment that because IIRC there are PC counts after jumps and at jump targets, the result counts will be in the same basic block as the supplied pc.

@@ +1438,5 @@
>      return elem;
>  }
>  
> +const js::PCCounts*
> +ScriptCounts::getNearestThrowCounts(size_t offset) const

Ditto re: renaming getNearestPCCounts, though the comment there would not apply here.
Attachment #8681243 - Flags: review?(bhackett1024) → review+
Attachment #8672752 - Flags: review?(bhackett1024) → review+
Comment on attachment 8681368 [details] [diff] [review]
part 3 - IonMonkey: Add branch pruning based on code coverage counters.

Review of attachment 8681368 [details] [diff] [review]:
-----------------------------------------------------------------

Nice!

::: js/src/jit/IonAnalysis.cpp
@@ +313,5 @@
> +            block->markUnchecked();
> +        }
> +
> +        // When removing a loop header, we should ensure that its backedge is
> +        // removed first, otherwise this trigger an assertion in

triggers
Attachment #8681368 - Flags: review?(bhackett1024) → review+
Comment on attachment 8681245 [details] [diff] [review]
part 4 - Increment code coverage counters on bailouts.

Review of attachment 8681245 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/BaselineBailouts.cpp
@@ +793,5 @@
>  
> +    // When pgo is enabled, increment the counter of the block in which we
> +    // resume, as Ion does not keep track of the code coverage.
> +    if (!js_JitOptions.disablePgo && script->hasScriptCounts())
> +        script->incHitCount(pc);

Why is this necessary?  When Ion is running the basic block counts are underapproximations anyways, right?
(In reply to Brian Hackett (:bhackett) from comment #23)
> Comment on attachment 8681245 [details] [diff] [review]
> part 4 - Increment code coverage counters on bailouts.
> 
> Review of attachment 8681245 [details] [diff] [review]:
> -----------------------------------------------------------------
> 
> ::: js/src/jit/BaselineBailouts.cpp
> @@ +793,5 @@
> >  
> > +    // When pgo is enabled, increment the counter of the block in which we
> > +    // resume, as Ion does not keep track of the code coverage.
> > +    if (!js_JitOptions.disablePgo && script->hasScriptCounts())
> > +        script->incHitCount(pc);
> 
> Why is this necessary?  When Ion is running the basic block counts are
> underapproximations anyways, right?

This is needed because if we only visit this branch after a bailout caused by the branch pruning, then we might never count this branch and recompile without this branch.  Incrementing the precedent jump target solve this issue, as the next time we re-compile the corresponding basic block would have a non-zero count.
Comment on attachment 8681249 [details] [diff] [review]
part 6 - Interpreter: Instrument JOF_JUMP opcodes instead of using interpreter's interrupts.

Review of attachment 8681249 [details] [diff] [review]:
-----------------------------------------------------------------

This doesn't seem equivalent to the interrupt method.  Say we have a branch in the bytecode sequence like:

(0) ... 
(1) ifeq 3
(2) ...
(3) ...

If we have PC counts at the start of all basic blocks then all but (1) here will have counts.  With interrupts we'll update counts at all those opcodes no matter the control flow, but with this patch AFAICS if the branch at (1) is not taken then counts will be updated at (2) but not (3), since there is no explicit branch instruction associated with the fallthrough from (2) to (3).

Is this understanding correct?
Comment on attachment 8681245 [details] [diff] [review]
part 4 - Increment code coverage counters on bailouts.

Review of attachment 8681245 [details] [diff] [review]:
-----------------------------------------------------------------

OK, though there should be a comment describing why this is necessary and how it interacts with the FirstExecution bailout.
Attachment #8681245 - Flags: review?(bhackett1024) → review+
(In reply to Brian Hackett (:bhackett) from comment #25)
> This doesn't seem equivalent to the interrupt method.  Say we have a branch
> in the bytecode sequence like:
> 
> (0) ... 
> (1) ifeq 3
> (2) ...
> (3) ...
> 
> If we have PC counts at the start of all basic blocks then all but (1) here
> will have counts.  With interrupts we'll update counts at all those opcodes
> no matter the control flow, but with this patch AFAICS if the branch at (1)
> is not taken then counts will be updated at (2) but not (3), since there is
> no explicit branch instruction associated with the fallthrough from (2) to
> (3).
> 
> Is this understanding correct?

Hum, yes … I did not consider this case, I will add a test case for the code coverage, and maybe insert an opcode for a no-op jump instruction to indicate such fallthrough.
Attachment #8681249 - Flags: review?(bhackett1024)
(In reply to Brian Hackett (:bhackett) from comment #21)
> > +    enum HitState { HS_NotDefined, HS_Count, HS_Frequency };
> 
> ditto re: Jan about the 'enum' thing.  It's also not obvious what these enum
> values mean and how they relate to hitCount_, so some comments would be nice.
> 
> Maybe it will be clear later, but will there ever be any time when we'll be
> putting different types of data on different MIR blocks?  If not then it
> might be clearer to remove HitState so that there can be a consistent
> interpretation for the count information associated with the blocks.

Probably not different types, but it might not be present on some scripts if for obscure reasons we failed to allocate the PCCount vector of a JSScript or an inlined JSScript.  This might cause some blocks to have a PCCounts while others don't.
The last 3 patches do not change the semantic of the compiled program, but
are used to ensure that we have consistent hit count on the basic blocks.
Before these patches, we had something like:

  Before-Loop:  42 hit
  OSR:           0 hit (By definition)
  PreHeader:  1001 hit (One extra because we OSR'd at the loopEntry)
  LoopHeader:   42 hit
  Backedge:   1000 hit
  After-Loop:   41 hit (We have not yet exited the loop)
Attachment #8682515 - Flags: review?(jdemooij)
This adds an extra heuristic which is needed to both avoid major regression
on kraken-gaussian-blur, and a noticable regression on
kraken-ai-astar. While removing the large unused branch of
Mandreel. (Otherwise we will lose all the 10% improvement)

I tried to associate some logic to the rules which are keeping the benchmark
score shown in comment 11, while avoiding regressions of kraken and
sunspider.
Attachment #8682521 - Flags: review?(bhackett1024)
Comment on attachment 8682522 [details] [diff] [review]
part 11 - Increase JitSpewer mask size to avoid overflow.

Review of attachment 8682522 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/JitSpewer.cpp
@@ +87,5 @@
>  // IonSpewer singleton.
>  static IonSpewer ionspewer;
>  
>  static bool LoggingChecked = false;
> +static uint64_t LoggingBits = 0;

Unless you've deleted it in another patch, you also need to change this line: https://dxr.mozilla.org/mozilla-central/source/js/src/jit/JitSpewer.cpp#492

@@ +627,5 @@
>  bool
>  jit::JitSpewEnabled(JitSpewChannel channel)
>  {
>      MOZ_ASSERT(LoggingChecked);
> +    return (LoggingBits & (uint64_t(1) << uint32_t(channel))) && !filteredOutCompilations;

Can you add either a MOZ_ASSERT (or a static_assert) that channel (resp. the highest value in JitSpewChannel) is less than 64, to make sure that the << doesn't overflow?

@@ +634,5 @@
>  void
>  jit::EnableChannel(JitSpewChannel channel)
>  {
>      MOZ_ASSERT(LoggingChecked);
> +    LoggingBits |= (uint64_t(1) << uint32_t(channel));

nit: maybe the outer parenthesis could be deleted here? (if the compiler warns without them, or you want to keep them, feel free to ignore this comment)
Attachment #8682522 - Flags: review?(benj) → review+
Attachment #8682515 - Flags: review?(jdemooij) → review+
Comment on attachment 8682511 [details] [diff] [review]
part 8 - IonBuilder: newOsrPreheader should not use the hit-count of the loop.

Review of attachment 8682511 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/IonBuilder.cpp
@@ +7339,5 @@
>      MBasicBlock* preheader = newBlock(predecessor, loopEntry);
>      if (!osrBlock || !preheader)
>          return nullptr;
>  
> +    // In order to keep the logical progression of the hit count, we use the

Nit: "logical progression of the hit count" is very vague. Maybe just:

// Give the pre-header the same hit count as the code before the loop.
Attachment #8682511 - Flags: review?(jdemooij) → review+
Attachment #8682510 - Flags: review?(jdemooij) → review+
Comment on attachment 8682521 [details] [diff] [review]
part 10 - Prevent PruneUnusedBranches from being greedy while removing branches which have only be visited a few times.

Review of attachment 8682521 [details] [diff] [review]:
-----------------------------------------------------------------

This is OK but the heuristic is a little bizarre and would be nice to remove sometime in the future.
Attachment #8682521 - Flags: review?(bhackett1024) → review+
Duplicate of this bug: 877878
You need to log in before you can comment on or make changes to this bug.