Closed Bug 1304081 Opened 8 years ago Closed 8 years ago

Cancelling Ion compilation sometimes takes a long time

Tracking

()

Status:

RESOLVED FIXED

Milestone:

mozilla52

Tracking Flags:

Tracking

Status

firefox50

---

fixed

firefox51

---

fixed

firefox52

---

fixed

People

(Reporter: jonco, Assigned: jonco)

References

Details

Attachments

(2 files)

trace-cancel 8 years ago Jon Coppeard (:jonco) 5.94 KB, patch		Details \| Diff \| Splinter Review
bug1304081-more-should-cancel 8 years ago Jon Coppeard (:jonco) 25.65 KB, patch	nbp : review+ ritu : approval-mozilla-aurora+ ritu : approval-mozilla-beta+	Details \| Diff \| Splinter Review

Jon Coppeard (:jonco)

Assignee

Description

•

8 years ago

In bug 1269695 I noticed that we sometimes take a very long time to cancel Ion compilation.  I added tracing to print the times and the point at which cancellation was detected:

CancelOffThreadIonCompile waited 635 mS for Prune unused branches (main loop)
CancelOffThreadIonCompile waited 1158 mS for Prune unused branches (removal loop)
CancelOffThreadIonCompile total wait of 1794 mS in 16 iterations
CancelOffThreadIonCompile waited 318 mS for Prune unused branches (removal loop)
CancelOffThreadIonCompile total wait of 318 mS in 1 iterations
CancelOffThreadIonCompile waited 144 mS for not cancelled
CancelOffThreadIonCompile total wait of 144 mS in 1 iterations
CancelOffThreadIonCompile waited 490 mS for Prune unused branches (removal loop)
CancelOffThreadIonCompile total wait of 490 mS in 6 iterations

Hannes Verschore [:h4writer]

Comment 1

•

8 years ago

Is taking the lock taking so long, or is it really the cancellation itself?

Jon Coppeard (:jonco)

Assignee

Comment 2

•

8 years ago

(In reply to Hannes Verschore [:h4writer] from comment #1)
It's the cancellation.  I wrote a patch to record the time and reason string when calling shouldCancel() and report long durations.  Here's a snippet:

Long cancel time: 19 mS from Lowering (main loop) to Lowering (main loop)
Long cancel time: 11 mS from Reordering to Make loops contiguous
Long cancel time: 15 mS from Prune Unused Branches to Fold Tests
Long cancel time: 16 mS from Start to Prune Unused Branches
Long cancel time: 296 mS from Scalar Replacement (main loop) to Scalar Replacement (main loop)
Long cancel time: 325 mS from Prune Unused Branches to Fold Tests
Long cancel time: 1566 mS from Start to Prune Unused Branches
Long cancel time: 11 mS from Eliminate Phis (worklist) to Eliminate phis
Long cancel time: 30 mS from Backtracking Allocation to Backtracking Allocation
Long cancel time: 11 mS from Start to Prune Unused Branches

I think this means we need to call shouldCancel() more often.

Nicolas B. Pierron [:nbp]

Comment 3

•

8 years ago

(In reply to Jon Coppeard (:jonco) from comment #0)
> CancelOffThreadIonCompile waited 635 mS for Prune unused branches (main loop)
> CancelOffThreadIonCompile waited 1158 mS for Prune unused branches (removal
> loop)

I guess the cost of the "analysis loop" comes from the heuristics which goes through every instructions to collect information about the instructions within the blocks, and also potentially from the dummy-heuristics to compute the dominated blocks.

The removal has to go through all the blocks and their instructions and flag their operands as having removed uses.  If this takes times, this implies that we have tons of dead code in the graph that we are compiling.  The only way to do that is probably by inlining this code in unused branches.  Maybe we should instrument IonBuilder to not inline functions into cold blocks.

(The lack of a huge number of FirstExecution bailout in bz profiles [Bug 1269695 comment 17], highlights that Branch Pruning is actually right at removing these blocks)

(In reply to Jon Coppeard (:jonco) from comment #2)
> I think this means we need to call shouldCancel() more often.

That too.

I guess the problem is similar to enoughBallast calls, where we have to guard against any unbounded loops which might allocate.

Nicolas B. Pierron [:nbp]

Comment 4

•

8 years ago

(In reply to Nicolas B. Pierron [:nbp] from comment #3)
> (In reply to Jon Coppeard (:jonco) from comment #2)
> > I think this means we need to call shouldCancel() more often.
> 
> I guess the problem is similar to enoughBallast calls, where we have to
> guard against any unbounded loops which might allocate.

To make it clear, even if this is not deterministic, I guess we should consider time as a resource as we consider memory as a resource.

We could make a debug build instrumentation similar to Bug 1264948 did on the LifoAlloc, to ensure that we have "enough time" to continue the execution.  One advantage of the LifoAlloc instrumentation, is that the stack trace is enough to identify the issue.

What we could do is instrument the MInstruction / MDefinition / MOperands iterators to check if we did not exhaust our allocated time slices since the last call to the "enough time" function.  Thus a debug-only assertion failure would report all the stack traces of all the loops which are taking more than X ms (*) to execute.  The only issue, are systems which are already overloaded, so unless we find a way to detect that, we would have to only enable it with a flag.

(*) As discussed on IRC, our incremental GC has 10 ms slices, thus we should probably look at being able to cancel in less than 6ms.

Jon Coppeard (:jonco)

Assignee

Comment 5

•

8 years ago

Attached patch trace-cancel — Details — Splinter Review

Here's a patch I'm using to report long durations between calls to shouldCancel().

Jon Coppeard (:jonco)

Assignee

Comment 6

•

8 years ago

Attached patch bug1304081-more-should-cancel — Details — Splinter Review

Using the previous patch as a guide I added more calls to shouldCancel().

There are still many reported instances of long durations between calling this but this seems to fix the occurrence of long pauses when loading the Google docs page.

Attachment #8793274 - Flags: review?(nicolas.b.pierron)

Nicolas B. Pierron [:nbp]

Updated

•

8 years ago

Attachment #8793274 - Flags: review?(nicolas.b.pierron) → review+

Nicolas B. Pierron [:nbp]

Comment 7

•

8 years ago

Comment on attachment 8793274 [details] [diff] [review]
bug1304081-more-should-cancel

Review of attachment 8793274 [details] [diff] [review]:
-----------------------------------------------------------------

::: js/src/jit/IonAnalysis.cpp
@@ +307,5 @@
>          size_t numPred = block->numPredecessors();
>          size_t i = 0;
>          for (; i < numPred; i++) {
> +            if (mir->shouldCancel("Prune unused branches (inner loop 1)"))
> +                return false;

nit: This sounds unlikely to trigger as the number of predecessors is usually quite low, and the amount of work is small.

@@ +1469,5 @@
>      Vector<MPhi*, 0, SystemAllocPolicy> phisWithEmptyInputTypes;
>  
>      for (PostorderIterator block(graph.poBegin()); block != graph.poEnd(); block++) {
>          if (mir->shouldCancel("Specialize Phis (main loop)"))
>              return false;

nit: The check on the outer loop should no longer be needed, as we check in the inner loop, and we are likely to enter the inner loop to cancel the compilation.

@@ +2166,5 @@
>      MDefinitionVector worklist(graph.alloc());
>  
>      for (ReversePostorderIterator block(graph.rpoBegin()); block != graph.rpoEnd(); block++) {
> +        if (mir->shouldCancel("MakeMRegExpHoistable outer loop"))
> +            return false;

nit: Remove this check, there is always one MDefinition in each block, so the inner loop will always be executed.

Pulsebot

Comment 8

•

8 years ago

Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/mozilla-inbound/rev/ab46d921bf6b
Add more use of shouldCancel() to avoid pauses when cancelling off thread Ion compilation r=nbp

Carsten Book [:Tomcat]

Comment 9

•

8 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/ab46d921bf6b

Status: NEW → RESOLVED

Closed: 8 years ago

status-firefox52: --- → fixed

Resolution: --- → FIXED

Target Milestone: --- → mozilla52

Nicolas B. Pierron [:nbp]

Comment 10

•

8 years ago

I suggest that we backport this patch to all branches, as this might cause noticeable performance issues.

Jon Coppeard (:jonco)

Assignee

Comment 11

•

8 years ago

Comment on attachment 8793274 [details] [diff] [review]
bug1304081-more-should-cancel

Approval Request Comment
[Feature/regressing bug #]: I'm going to blame bug 1209515 since that could have used more shouldCancel() checks but it's really a more general problem than that.
[User impact if declined]: Possibility of unresponsiveness / jank.
[Describe test coverage new/current, TreeHerder]: On m-c since 22nd September.
[Risks and why]: Low.
[String/UUID change made/needed]: None.

Attachment #8793274 - Flags: approval-mozilla-beta?

Attachment #8793274 - Flags: approval-mozilla-aurora?

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Updated

•

8 years ago

status-firefox50: --- → affected

status-firefox51: --- → affected

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 12

•

8 years ago

Hi Jon, Nicolas, do we have data from Nightly that suggests this is indeed showing perf improvements? The size of the patch is worrisome to me and if we can justify the uplift with some perf data/telemetry probes showing less jank, it would be much easier to take this in Beta50.

Flags: needinfo?(nicolas.b.pierron)

Flags: needinfo?(jcoppeard)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 13

•

8 years ago

Comment on attachment 8793274 [details] [diff] [review]
bug1304081-more-should-cancel

Perf improvements that we can uplift to Aurora51 to stabilize before deciding to take it in Beta50.

Attachment #8793274 - Flags: approval-mozilla-aurora? → approval-mozilla-aurora+

Ryan VanderMeulen [:RyanVM]

Comment 14

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-aurora/rev/185dcfd9c1e1

status-firefox51: affected → fixed

Jon Coppeard (:jonco)

Assignee

Comment 15

•

8 years ago

(In reply to Ritu Kothari (:ritu) from comment #12)
I think it's too soon to tell from telemetry, but see comment 1 for data showing that loading a particular google doc sometimes caused GC slices of over 1 second due to this issue.

Flags: needinfo?(jcoppeard)

Nicolas B. Pierron [:nbp]

Comment 16

•

8 years ago

(In reply to Jon Coppeard (:jonco) from comment #15)
> (In reply to Ritu Kothari (:ritu) from comment #12)
> I think it's too soon to tell from telemetry, but see comment 1 for data
> showing that loading a particular google doc sometimes caused GC slices of
> over 1 second due to this issue.

These GC slices are on the main thread, and we might wait for another thread for about 1 second on google doc.

Flags: needinfo?(nicolas.b.pierron)

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 17

•

8 years ago

Comment on attachment 8793274 [details] [diff] [review]
bug1304081-more-should-cancel

If this improves perf of Firefox when using google docs, it's a risk worth taking. So far no problems from Nightly and Aurora, let's uplift to Beta50.

Attachment #8793274 - Flags: approval-mozilla-beta? → approval-mozilla-beta+

Ritu Kothari (:ritu) (Inactive, please n-i to RyanVM, jcristau, or pascal)

Comment 18

•

8 years ago

Thanks guys! What telemetry probes would we use to monitor this?

Flags: needinfo?(jcoppeard)

Ryan VanderMeulen [:RyanVM]

Comment 19

•

8 years ago

bugherder uplift

https://hg.mozilla.org/releases/mozilla-beta/rev/554bbde4e462

status-firefox50: affected → fixed

Jon Coppeard (:jonco)

Assignee

Comment 20

•

8 years ago

(In reply to Ritu Kothari (:ritu) from comment #18)
> Thanks guys! What telemetry probes would we use to monitor this?

GC_SWEEP_MS and GC_MAX_PAUSE_MS are worth keeping an eye on.

Flags: needinfo?(jcoppeard)

[:philipp]

Updated

•

8 years ago