Open Bug 1376557 Opened 7 years ago Updated 2 years ago

finalisation slice can run late

Tracking

()

Status:

NEW

Project Flags:

Performance Impact

low

People

(Reporter: pbone, Unassigned)

References

Details

(Keywords: perf)

Attachments

(2 files)

Profile timeline 7 years ago Paul Bone [:pbone] 17.02 KB, image/png		Details
Profile from Splay Latency 7 years ago Paul Bone [:pbone] 2.95 MB, application/json		Details

Paul Bone [:pbone]

Reporter

Description

•

7 years ago

The JS shell runs GC slices when the amount of allocated memory crosses a threshold.  If a slice ends after finishing sweeping but not finalisation (or maybe part-way through sweeping) then the allocated memory will have dropped, the next slice won't run now for some time (until the allocated memory increases again).  This means that if other zones want to begin a collection they need to interrupt this unfinished collection or due to bug 1367455 they will be delayed. Also barriers may still be in place (but I don't know) which may slow throughput.

Paul Bone [:pbone]

Reporter

Updated

•

7 years ago

Assignee: nobody → pbone

Status: NEW → ASSIGNED

Paul Bone [:pbone]

Reporter

Updated

•

7 years ago

Comment 1

•

7 years ago

Here are the two regressions that I'm attempting to fix:

https://treeherder.mozilla.org/perf.html#/alerts?id=7016

Paul Bone [:pbone]

Reporter

Updated

•

7 years ago

Blocks: 1376904

Paul Bone [:pbone]

Reporter

Updated

•

7 years ago

No longer blocks: 1376904

Kannan Vijayan [:djvj]

Comment 2

•

7 years ago

Paul is this something that only shows up with shell or does this pattern happen in the browser as well?  From the description it seems like the general browser could be affected by this too.

Flags: needinfo?(pbone)

Paul Bone [:pbone]

Reporter

Comment 3

•

7 years ago

Attached image Profile timeline — Details

This is a screenshot from the gecko profiler.  This could be the same behaviour happening in the browser (either from the same or a different cause). I'm also suspicious about the integration between the GC and the gecko profiler, I'm currently trying to work out how all this fits together so I understand _completely_ what it is I'm seeing.

Flags: needinfo?(pbone)

Paul Bone [:pbone]

Reporter

Comment 4

•

7 years ago

Attached file Profile from Splay Latency — Details

Here is the gecko profile file for splay latency.

Kannan Vijayan [:djvj]

Updated

•

7 years ago

Whiteboard: [qf] → [qf:p3]

Paul Bone [:pbone]

Reporter

Comment 5

•

7 years ago

Here's what I know:

+ The same behavior with the final delayed slice does occur in the browser.

+ The GC following the one with the delayed slice becomes non-incremental due to GCBytesTrigger. This is also in Beta (I think
my earlier patch is now in Beta). In my tests with the shell this was because the heap filled up during the collection
because the start of the collection was delayed. But there are also other possibilities.

+ My patch, which i discussed with jonco, attempts to avoid this by adding a new monotonic counter for the amount of used
memory, as memory is freed the counter is left alone, it is used to trigger the final slice in a reasonable time and
complete the zonal GC sooner and therefor not delaying the following GC.

+ My patch works, it does avoid this problem, however _all_ the GCs seem to switch to non-incremental (rather than only the
delayed GCs) due to GCBytesTrigger. This creates more frequent and longer pauses, lowering the benchmark scores further.
This shouldn't happen if my original guess that the delayed GC became non-incremental because it was delayed. Now none
are being delayed and yet ALL are non-incremental, this could also be a separate cause.

+ My patched version seems to have slices that "look" further apart but I haven't measured accurately. This should not be
caused by no-longer delaying the final slice, that doesn't make sense. However it does suggest my new counter is not
counting as fast as the previous one. Which could cause the GC to fail to reclaim memory quick enough and therefore
switching to non-incremental mode.

+ Since lowering the slice budget to 5ms the whole system is possibly more likely to fail to reclaim enough memory soon
enough, as the trigger for slices wasn't changed to correspond to this. I propose lowering the heap thresholds to begin a
collection earlier and therefore hope to run enough slices before the heap fills up. Ideally we should calculate some of
these parameters dynamically, based on a projection of how much work remains during a collection and how much space is
left in the heap.

There are some things I can test:

+ Adjust the threshold for my new counter, lower it and see if the behavior continues.

+ Search for places I'm missing to where I should be incrementing the counter.

+ Re-increase the GC pause time to 10ms and see if the problem still occurs.

+ Lower the thresholds that start a collection.

Paul Bone [:pbone]

Reporter

Updated

•

7 years ago

Blocks: 1367455

Paul Bone [:pbone]

Reporter

Updated

•

7 years ago

Updated

•

7 years ago

Priority: -- → P3

Steve Fink [:sfink] [:s:]

Comment 6

•

7 years ago

(In reply to Paul Bone [:pbone] from comment #5)
>  + My patch works, it does avoid this problem, however _all_ the GCs seem to
> switch to non-incremental (rather than only the 
>    delayed GCs) due to GCBytesTrigger. This creates more frequent and longer
> pauses, lowering the benchmark scores further.
>    This shouldn't happen if my original guess that the delayed GC became
> non-incremental because it was delayed.  Now none 
>    are being delayed and yet ALL are non-incremental, this could also be a
> separate cause.

It seems like it's important to understand what's going on here. If you look at a profile where this is happening, does the information on thresholds make sense? As in, if this is GCBytesTrigger, then timings.trigger_amount and timings.trigger_threshold should be getting set. Are these significantly different than before your patch? (It just strikes me that if your new counter were never incremented at all, due to some bug, then it would have this observed behavior.)

Flags: needinfo?(pbone)

Paul Bone [:pbone]

Reporter

Comment 7

•

7 years ago

Good thinking. I'll check that when I get to look at this next,  Probably not until Monday the 25th.

Firefox Bug Husbandry Bot

Updated

•

7 years ago

Keywords: perf

Paul Bone [:pbone]

Reporter

Updated

•

7 years ago

Flags: needinfo?(pbone)

Mayank Bansal

Updated

•

5 years ago

Updated

•

3 years ago

Assignee: pbone → nobody

Status: ASSIGNED → NEW

Dave Hunt [:davehunt] [he/him] ⌚BST

Updated

•

2 years ago

Performance Impact: --- → P3

Whiteboard: [qf:p3]

BMO Automation

Updated

•

2 years ago

Severity: normal → S3

You need to log in before you can comment on or make changes to this bug.