Open Bug 1728273 Opened 3 years ago Updated 2 years ago

USER_INACTIVE GCs do not collect all zones

Categories

(Core :: JavaScript: GC, defect, P1)

defect

Tracking

()

REOPENED

People

(Reporter: pbone, Assigned: sfink, NeedInfo)

References

(Blocks 1 open bug)

Details

(Keywords: perf-alert)

Attachments

(1 file)

I don't know if this was a side effect of some of my refactoring. But I noticed via the profiler that USER_INACTIVE GCs do not collect all zones. It seems like they should but maybe they don't for some other reason.

jonco, do you know of any reason why USER_INACTIVE GCs may don't currently collect all zones?

Flags: needinfo?(jcoppeard)
Assignee: nobody → pbone
Status: NEW → ASSIGNED

jonco answered via phabricator that these should be full GCs.

Flags: needinfo?(jcoppeard)
Pushed by pbone@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/891ac85a6625
All shrinking GCs are full GCs r=jonco,smaug
Status: ASSIGNED → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
Target Milestone: --- → 94 Branch

Backed out for causing Bug 1729890

Status: RESOLVED → REOPENED
Resolution: FIXED → ---
Target Milestone: 94 Branch → ---
Flags: needinfo?(pbone)

pbone: When looking into this, I had one question that isn't really relevant to the real problem (of increased memory usage), but it seems relevant here: why does this making all USER_INACTIVE GCs full GCs instead of just all shrinking GCs?

For example, we have a GCRunner fire with reason=USER_INACTIVE, but we discover IsRegularRateTimerTIcking() again (I guess the user did something?). If mWantAtLeastRegularGC is set, then we might end up doing a non-shrinking USER_INACTIVE GC. It looks like this would happen if we originally wanted a GC for a different reason. That shouldn't be forced to be a full GC, should it?

In short, shouldn't the updated test be for aShrinking == ShrinkingGC rather than aReason == JS::GCReason::USER_INACTIVE?

I'm still trying to remind myself of how all this logic works, and I see no reason for this to cause the observed problem, but this looked a little weird so I thought I'd ask.

(In reply to Steve Fink [:sfink] [:s:] from comment #8)

pbone: When looking into this, I had one question that isn't really relevant to the real problem (of increased memory usage), but it seems relevant here: why does this making all USER_INACTIVE GCs full GCs instead of just all shrinking GCs?

For example, we have a GCRunner fire with reason=USER_INACTIVE, but we discover IsRegularRateTimerTIcking() again (I guess the user did something?). If mWantAtLeastRegularGC is set, then we might end up doing a non-shrinking USER_INACTIVE GC. It looks like this would happen if we originally wanted a GC for a different reason. That shouldn't be forced to be a full GC, should it?

Oh, I think you're right, I forgot these would still be USER_INACTIVE.

In short, shouldn't the updated test be for aShrinking == ShrinkingGC rather than aReason == JS::GCReason::USER_INACTIVE?

Or maybe both? But I don't think you can have a shrinking GC without USER_INACTIVE at the moment.

I'm still trying to remind myself of how all this logic works, and I see no reason for this to cause the observed problem, but this looked a little weird so I thought I'd ask.

Thanks, since I've been working on this lately and trying to clean up/document the logic, I don't mind an extra set of eyes to make sure I've got it right. Or even to ask questions when something's not completely clear.

Cheers.

(In reply to Cristian Tuns from comment #7)

Backout merged to central https://hg.mozilla.org/mozilla-central/rev/e6a5efd2b988

== Change summary for alert #31372 (as of Sun, 19 Sep 2021 19:46:16 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
8% Explicit Memory linux1804-64-shippable-qr fission tp6 838,953,235.69 -> 772,642,542.00
8% Explicit Memory macosx1015-64-shippable-qr fission tp6 717,904,196.22 -> 663,977,456.01
7% Explicit Memory windows10-64-2004-shippable-qr fission tp6 682,592,078.67 -> 631,929,184.85
7% Explicit Memory macosx1015-64-shippable-qr tp6 750,795,029.97 -> 701,188,146.86
6% Heap Unclassified windows10-64-2004-shippable-qr tp6 81,840,163.62 -> 77,062,944.45
... ... ... ... ...
3% Heap Unclassified macosx1015-64-shippable-qr fission tp6 140,228,940.02 -> 135,557,908.64

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=31372

I'm going to continue looking into this. In particular, I looked at the before/after profiles, and not only does the after (with patch) profile show too few GCs, but the before profile is showing more GCs than seems necessary. But I might be wrong, it depends on why they're running. They appear to be running because we CC'd, so I'm working on a patch to track and report the various CC reasons to try to get a better picture of what's happening.

(Cancelling the needinfo? pbone because it was for the backout that is now complete.)

Flags: needinfo?(pbone)

Changing severity to S4 because of this is not blocking anybody from using Firefox.

Severity: -- → S4
Priority: -- → P1

(In reply to Cristian Tuns from comment #6)

Backed out for causing Bug 1729890

== Change summary for alert #31374 (as of Sun, 19 Sep 2021 20:27:04 GMT) ==

Improvements:

Ratio Test Platform Options Absolute values (old vs new)
30% facebook ContentfulSpeedIndex android-hw-g5-7-0-arm7-shippable-qr warm webrender 899.83 -> 628.25
12% instagram LastVisualChange android-hw-g5-7-0-arm7-shippable-qr warm webrender 506.00 -> 444.67
12% instagram fnbpaint android-hw-g5-7-0-arm7-shippable-qr warm webrender 465.25 -> 411.75
11% instagram fcp android-hw-g5-7-0-arm7-shippable-qr warm webrender 458.29 -> 406.58
11% instagram loadtime android-hw-g5-7-0-arm7-shippable-qr warm webrender 454.71 -> 404.54
... ... ... ... ...
2% amazon-search loadtime android-hw-g5-7-0-arm7-shippable-qr warm webrender 2,024.67 -> 1,982.00

For up to date results, see: https://treeherder.mozilla.org/perfherder/alerts?id=31374

sfink and I swapped notes about this, we basically had the same conclusion, when the USER_INACTIVE GC was a zonal GC the scheduler would always follow it with a delayed full GC. Then that would trigger a CC, which would trigger a delayed CC_WAITING GC, which was zonal and would trigger a full GC. This cycle kept memory low but was perhaps a bit too aggressive. When the patch here made the USER_INACTIVE GC a full GC it would not setup the CC (or the CC wouldn't schedule the CC_WAITING GC) and the cycle didn't get established, so memory usage increased.

sfink and I also decided that he should take this bug and I'll focus on some other regressions.

Assignee: pbone → sphink

There's a r+ patch which didn't land and no activity in this bug for 2 weeks.
:sfink, could you have a look please?
For more information, please visit auto_nag documentation.

Flags: needinfo?(sphink)
Flags: needinfo?(jcoppeard)
Flags: needinfo?(jcoppeard)

Note to self: the CC_WAITING reason that pbone referred to in his excellent comment 14 summary is the one that I renamed to CC_FINISHED.

You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: