[meta] Recent minor GC performance regressions

RESOLVED WORKSFORME

Status

()

RESOLVED WORKSFORME
2 years ago
2 years ago

People

(Reporter: jonco, Assigned: jonco)

Tracking

Firefox Tracking Flags

(Not tracked)

Details

(Assignee)

Description

2 years ago
There have been a few performance regressions detected over the last few days which are probably related to my recent changes to minor GC:

1) GC_MINOR_US regression median ~320 -> ~791 uS

http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/1305/alerts/?from=2016-08-13&to=2016-08-13

2) GC_NURSERY_BYTES shows larger nursery size

http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/1931/alerts/?from=2016-08-13&to=2016-08-13

3) Talos damp regressions

https://bugzilla.mozilla.org/show_bug.cgi?id=1295490

4) shu reports local octane regressions


Changesets blamed by 1 and 2 include the following changes:

 - dynamic nusery chunk allocation (bug 1291292)
 - pre-tenuring changes (bug 1293262)
 - resize heuristic changes (bug 1293239)

The talos regression blames the resize heuristic changes only.

The increase of GC_NURSERY_BYTES shows that the resize heuristic changes did have an effect in the wild, but the increase is not necessarily a bad thing.

The GC_MINOR_US increase seems bad but is in fact inconclusive.  As nursery sizes increase we may end up doing fewer longer minor GCs.  The important consideration is the total time spent collecting but we can't tell that from this data.

However, it does seem that something is up.

I think it would make sense to revert the changes to the resizing heuristic and see if things return to normal.  This will tell us whether that or the dynamic allocation changes are to blame, and we can work from there.

Terrence, what do you think?
(Assignee)

Updated

2 years ago
Flags: needinfo?(terrence)
I agree with all of that.
Flags: needinfo?(terrence)
(Assignee)

Updated

2 years ago
Blocks: 1297646
(Assignee)

Comment 2

2 years ago
The nursery resize heuristic change was backed out incorrectly by 491bc7e98f2a on 19th August and then again correctly by 5c56991cb9a5 on 22nd August.  In a few days we will see the telemetry results.

Shu's reported regression was found to be due to not disabling poisoning while benchmarking.

The damp regressions seem to have mostly gone away since the backout (I'll put the full details in the specific bugs).
(Assignee)

Comment 3

2 years ago
Timeline of changes to help make sense of the perfherder graphs:

2016-08-11 10:45:24 BST -- Bug 1293239 lands
2016-08-12 21:49:03 BST -- Bug lands 1291292
2016-08-19 10:47:48 BST -- Broken backout of Bug 1293239
2016-08-22 17:47:27 BST -- Actual backout of Bug 1293239
(Assignee)

Comment 4

2 years ago
This has all been fixed now.
Status: NEW → RESOLVED
Last Resolved: 2 years ago
Resolution: --- → WORKSFORME

Comment 5

2 years ago
The first Telemetry Alerts are coming in to let us know that the measures have changed back (like GC_MINOR_US[1]).

I haven't seen an alert for GC_NURSERY_BYTES yet, but sometimes it takes a few extra days.

[1]: http://alerts.telemetry.mozilla.org/index.html#/detectors/1/metrics/1305/alerts/?from=2016-10-14&to=2016-10-14
You need to log in before you can comment on or make changes to this bug.