Closed Bug 1304081 Opened 3 years ago Closed 3 years ago

Cancelling Ion compilation sometimes takes a long time

Categories

(Core :: JavaScript Engine: JIT, defect)

defect
Not set

Tracking

()

RESOLVED FIXED
mozilla52
Tracking Status
firefox50 --- fixed
firefox51 --- fixed
firefox52 --- fixed

People

(Reporter: jonco, Assigned: jonco)

References

(Blocks 1 open bug)

Details

Attachments

(2 files)

In bug 1269695 I noticed that we sometimes take a very long time to cancel Ion compilation.  I added tracing to print the times and the point at which cancellation was detected:

CancelOffThreadIonCompile waited 635 mS for Prune unused branches (main loop)
CancelOffThreadIonCompile waited 1158 mS for Prune unused branches (removal loop)
CancelOffThreadIonCompile total wait of 1794 mS in 16 iterations
CancelOffThreadIonCompile waited 318 mS for Prune unused branches (removal loop)
CancelOffThreadIonCompile total wait of 318 mS in 1 iterations
CancelOffThreadIonCompile waited 144 mS for not cancelled
CancelOffThreadIonCompile total wait of 144 mS in 1 iterations
CancelOffThreadIonCompile waited 490 mS for Prune unused branches (removal loop)
CancelOffThreadIonCompile total wait of 490 mS in 6 iterations
Is taking the lock taking so long, or is it really the cancellation itself?
(In reply to Hannes Verschore [:h4writer] from comment #1)
It's the cancellation.  I wrote a patch to record the time and reason string when calling shouldCancel() and report long durations.  Here's a snippet:

Long cancel time: 19 mS from Lowering (main loop) to Lowering (main loop)
Long cancel time: 11 mS from Reordering to Make loops contiguous
Long cancel time: 15 mS from Prune Unused Branches to Fold Tests
Long cancel time: 16 mS from Start to Prune Unused Branches
Long cancel time: 296 mS from Scalar Replacement (main loop) to Scalar Replacement (main loop)
Long cancel time: 325 mS from Prune Unused Branches to Fold Tests
Long cancel time: 1566 mS from Start to Prune Unused Branches
Long cancel time: 11 mS from Eliminate Phis (worklist) to Eliminate phis
Long cancel time: 30 mS from Backtracking Allocation to Backtracking Allocation
Long cancel time: 11 mS from Start to Prune Unused Branches

I think this means we need to call shouldCancel() more often.
(In reply to Jon Coppeard (:jonco) from comment #0)
> CancelOffThreadIonCompile waited 635 mS for Prune unused branches (main loop)
> CancelOffThreadIonCompile waited 1158 mS for Prune unused branches (removal
> loop)

I guess the cost of the "analysis loop" comes from the heuristics which goes through every instructions to collect information about the instructions within the blocks, and also potentially from the dummy-heuristics to compute the dominated blocks.

The removal has to go through all the blocks and their instructions and flag their operands as having removed uses.  If this takes times, this implies that we have tons of dead code in the graph that we are compiling.  The only way to do that is probably by inlining this code in unused branches.  Maybe we should instrument IonBuilder to not inline functions into cold blocks.

(The lack of a huge number of FirstExecution bailout in bz profiles [Bug 1269695 comment 17], highlights that Branch Pruning is actually right at removing these blocks)

(In reply to Jon Coppeard (:jonco) from comment #2)
> I think this means we need to call shouldCancel() more often.

That too.

I guess the problem is similar to enoughBallast calls, where we have to guard against any unbounded loops which might allocate.
(In reply to Nicolas B. Pierron [:nbp] from comment #3)
> (In reply to Jon Coppeard (:jonco) from comment #2)
> > I think this means we need to call shouldCancel() more often.
> 
> I guess the problem is similar to enoughBallast calls, where we have to
> guard against any unbounded loops which might allocate.

To make it clear, even if this is not deterministic, I guess we should consider time as a resource as we consider memory as a resource.

We could make a debug build instrumentation similar to Bug 1264948 did on the LifoAlloc, to ensure that we have "enough time" to continue the execution.  One advantage of the LifoAlloc instrumentation, is that the stack trace is enough to identify the issue.

What we could do is instrument the MInstruction / MDefinition / MOperands iterators to check if we did not exhaust our allocated time slices since the last call to the "enough time" function.  Thus a debug-only assertion failure would report all the stack traces of all the loops which are taking more than X ms (*) to execute.  The only issue, are systems which are already overloaded, so unless we find a way to detect that, we would have to only enable it with a flag.

(*) As discussed on IRC, our incremental GC has 10 ms slices, thus we should probably look at being able to cancel in less than 6ms.
Attached patch trace-cancelSplinter Review
Here's a patch I'm using to report long durations between calls to shouldCancel().
Using the previous patch as a guide I added more calls to shouldCancel().

There are still many reported instances of long durations between calling this but this seems to fix the occurrence of long pauses when loading the Google docs page.
Attachment #8793274 - Flags: review?(nicolas.b.pierron)
Attachment #8793274 - Flags: review?(nicolas.b.pierron) → review+
Comment on attachment 8793274 [details] [diff] [review]
bug1304081-more-should-cancel

Review of attachment 8793274 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/IonAnalysis.cpp
@@ +307,5 @@
>          size_t numPred = block->numPredecessors();
>          size_t i = 0;
>          for (; i < numPred; i++) {
> +            if (mir->shouldCancel("Prune unused branches (inner loop 1)"))
> +                return false;

nit: This sounds unlikely to trigger as the number of predecessors is usually quite low, and the amount of work is small.

@@ +1469,5 @@
>      Vector<MPhi*, 0, SystemAllocPolicy> phisWithEmptyInputTypes;
>  
>      for (PostorderIterator block(graph.poBegin()); block != graph.poEnd(); block++) {
>          if (mir->shouldCancel("Specialize Phis (main loop)"))
>              return false;

nit: The check on the outer loop should no longer be needed, as we check in the inner loop, and we are likely to enter the inner loop to cancel the compilation.

@@ +2166,5 @@
>      MDefinitionVector worklist(graph.alloc());
>  
>      for (ReversePostorderIterator block(graph.rpoBegin()); block != graph.rpoEnd(); block++) {
> +        if (mir->shouldCancel("MakeMRegExpHoistable outer loop"))
> +            return false;

nit: Remove this check, there is always one MDefinition in each block, so the inner loop will always be executed.
Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ab46d921bf6b
Add more use of shouldCancel() to avoid pauses when cancelling off thread Ion compilation r=nbp
https://hg.mozilla.org/mozilla-central/rev/ab46d921bf6b
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → mozilla52
I suggest that we backport this patch to all branches, as this might cause noticeable performance issues.
Comment on attachment 8793274 [details] [diff] [review]
bug1304081-more-should-cancel

Approval Request Comment
[Feature/regressing bug #]: I'm going to blame bug 1209515 since that could have used more shouldCancel() checks but it's really a more general problem than that.
[User impact if declined]: Possibility of unresponsiveness / jank.
[Describe test coverage new/current, TreeHerder]: On m-c since 22nd September.
[Risks and why]: Low.
[String/UUID change made/needed]: None.
Attachment #8793274 - Flags: approval-mozilla-beta?
Attachment #8793274 - Flags: approval-mozilla-aurora?
Hi Jon, Nicolas, do we have data from Nightly that suggests this is indeed showing perf improvements? The size of the patch is worrisome to me and if we can justify the uplift with some perf data/telemetry probes showing less jank, it would be much easier to take this in Beta50.
Flags: needinfo?(nicolas.b.pierron)
Flags: needinfo?(jcoppeard)
Comment on attachment 8793274 [details] [diff] [review]
bug1304081-more-should-cancel

Perf improvements that we can uplift to Aurora51 to stabilize before deciding to take it in Beta50.
Attachment #8793274 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+
(In reply to Ritu Kothari (:ritu) from comment #12)
I think it's too soon to tell from telemetry, but see comment 1 for data showing that loading a particular google doc sometimes caused GC slices of over 1 second due to this issue.
Flags: needinfo?(jcoppeard)
(In reply to Jon Coppeard (:jonco) from comment #15)
> (In reply to Ritu Kothari (:ritu) from comment #12)
> I think it's too soon to tell from telemetry, but see comment 1 for data
> showing that loading a particular google doc sometimes caused GC slices of
> over 1 second due to this issue.

These GC slices are on the main thread, and we might wait for another thread for about 1 second on google doc.
Flags: needinfo?(nicolas.b.pierron)
Comment on attachment 8793274 [details] [diff] [review]
bug1304081-more-should-cancel

If this improves perf of Firefox when using google docs, it's a risk worth taking. So far no problems from Nightly and Aurora, let's uplift to Beta50.
Attachment #8793274 - Flags: approval-mozilla-beta? → approval-mozilla-beta+
Thanks guys! What telemetry probes would we use to monitor this?
Flags: needinfo?(jcoppeard)
(In reply to Ritu Kothari (:ritu) from comment #18)
> Thanks guys! What telemetry probes would we use to monitor this?

GC_SWEEP_MS and GC_MAX_PAUSE_MS are worth keeping an eye on.
Flags: needinfo?(jcoppeard)
Duplicate of this bug: 1007927
See Also: → 1358890
You need to log in before you can comment on or make changes to this bug.