Closed Bug 1689394 Opened 3 years ago Closed 3 years ago

GC_MINOR_US telemetry changes around 21-01-08

Tracking

()

Status:

RESOLVED WONTFIX

Tracking Flags:

Tracking

Status

firefox-esr78

---

unaffected

firefox85

---

unaffected

firefox86

---

wontfix

firefox87

---

wontfix

firefox88

---

wontfix

firefox89

---

fix-optional

People

(Reporter: jonco, Assigned: jonco)

References

(Regression)

Details

(Keywords: regression)

Attachments

(7 files)

Screenshot 2021-01-28 at 15.57.40.png 3 years ago Jon Coppeard (:jonco) 199.56 KB, image/png		Details
Screenshot_2021-01-29 Evolution Dashboard.png 3 years ago Jon Coppeard (:jonco) 143.39 KB, image/png		Details
Bug 1689394 - Make updating tracing context information unconditional on the tracer kind r?sfink 3 years ago Jon Coppeard (:jonco) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1689394 - Make sure the non-marking path is inlined in TraceEdgeInternal r?sfink 3 years ago Jon Coppeard (:jonco) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1689394 - Simplify TraceKind definition to reduce code generated by switch statements r?sfink 3 years ago Jon Coppeard (:jonco) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1689394 - Don't inline error paths in marking r?sfink 3 years ago Jon Coppeard (:jonco) 48 bytes, text/x-phabricator-request		Details \| Review
Bug 1689394 - Increase idle time nursery collection timeout to make it less agressive r?sfink 3 years ago Jon Coppeard (:jonco) 48 bytes, text/x-phabricator-request		Details \| Review

Jon Coppeard (:jonco)

Assignee

Description

•

3 years ago

Attached image Screenshot 2021-01-28 at 15.57.40.png — Details

Bug 1685132 was meant to simplify the implementation of tracing while also increasing performance. Looking at telemetry shows it had mixed results:

https://glam.telemetry.mozilla.org/firefox/probe/gc_minor_us/explore?ref=20210127093943

Long minor GCs improved a lot - e.g. the 75th percentile is down to 791uS from 1330uS. However short collections regressed - e.g. the 25th percentile increased from 77uS to 114uS.

Our aim with these is to keep them bellow one millisecond, so this isn't quite as bad as it looks from the attached graph (note the log scale emphases changes at the low end).

My theory is that edges traced via TraceEdge slowed down (maybe due to making this a virtual dispatch) whereas edges traced internally to the tracer became faster (maybe due to simplifying value handling). That would mean quick collections where root tracing dominates became slower, where long collections where evacuation dominates became faster, explaining the results.

Fortunately there's more we can do here to improve things.

Steve Fink [:sfink] [:s:]

Comment 1

•

3 years ago

That graph still looks like a net positive to me. The percentiles squeezed together some, reducing overall variance. But most importantly, the top percentiles dipped a bit. Which is far more important in my mind than fast collections becoming a little less fast.

Jon Coppeard (:jonco)

Assignee

Comment 2

•

3 years ago

Attached image Screenshot_2021-01-29 Evolution Dashboard.png — Details

The results on TMO are different for the 5th percentile. The change on the 7th January isn't nearly as pronounced (note this has a linear Y axis scale).

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Keywords: regression

BugBot [:suhaib / :marco/ :calixte]

Comment 3

•

3 years ago

Set release status flags based on info from the regressing bug 1685132

status-firefox85: --- → unaffected

status-firefox86: --- → affected

status-firefox87: --- → affected

status-firefox-esr78: --- → unaffected

Jon Coppeard (:jonco)

Assignee

Comment 4

•

3 years ago

(In reply to Steve Fink [:sfink] [:s:] from comment #1)
Agreed, slightly longer fast collections are unlikely to be noticeable. I'd still like to improve this though.

status-firefox85: unaffected → ---

status-firefox86: affected → ---

status-firefox87: affected → ---

status-firefox-esr78: unaffected → ---

Jon Coppeard (:jonco)

Assignee

Comment 5

•

3 years ago

Attached file Bug 1689394 - Make updating tracing context information unconditional on the tracer kind r?sfink — Details

One thing we have to do when tracaing is udpate context information (e.g. edge name) if the tracer is a kind that requires it. It's simpler and more efficient to give all tracers this context and perform an unconditional write to the stack.

Mostly we can get away without saving/restoring context information too. This adds AutoClearTracingContext for the one place we need to do this because of nested use of the same tracer while tracing something else.

Jon Coppeard (:jonco)

Assignee

Comment 6

•

3 years ago

Attached file Bug 1689394 - Make sure the non-marking path is inlined in TraceEdgeInternal r?sfink — Details

Previously this called a separate function in another file to do the virtual dispatch. This moves the DoCallback functions to Marking.cpp and marks them inline.

Depends on D103500

Jon Coppeard (:jonco)

Assignee

Comment 7

•

3 years ago

Attached file Bug 1689394 - Simplify TraceKind definition to reduce code generated by switch statements r?sfink — Details

Currently we arrange for the low thre bits of the TraceKind value be set for all trace kinds greater than seven (see definition of JS::TraceKind). This is to save a branch creating a GCCellPtr in GCCellPtr::checkedCast.

This has the side effect of generating an 80 entry lookup table when we switch on TraceKind when there are only 12 distinct trace kinds. And clang doesn't use a branch in checkCast anyway.

The patch changes TraceKind so that the values increment by one each time. This affects performance by increasing inlining opportunities in the marking code.

Depends on D103501

Jon Coppeard (:jonco)

Assignee

Comment 8

•

3 years ago

Attached file Bug 1689394 - Don't inline error paths in marking r?sfink — Details

Not related to the rest of this bug, I just happened to find it while looking at the generated code. It seems clang inlines a bunch of error paths which should only very rarely be taken. The patch adds hints to stop this from happening.

Depends on D103502

Pulsebot

Comment 9

•

3 years ago

Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/5ad5e86d2358
Make updating tracing context information unconditional on the tracer kind r=sfink
https://hg.mozilla.org/integration/autoland/rev/c2aced1a8fad
Make sure the non-marking path is inlined in TraceEdgeInternal r=sfink
https://hg.mozilla.org/integration/autoland/rev/7b413c436848
Simplify TraceKind definition to reduce code generated by switch statements r=sfink
https://hg.mozilla.org/integration/autoland/rev/3cf752580f1b
Don't inline error paths in marking r=sfink

Jon Coppeard (:jonco)

Assignee

Comment 10

•

3 years ago

Let's see what effect these patches have.

Keywords: leave-open

Noemi Erli[:noemi_erli]

Comment 11

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/5ad5e86d2358
https://hg.mozilla.org/mozilla-central/rev/c2aced1a8fad
https://hg.mozilla.org/mozilla-central/rev/7b413c436848
https://hg.mozilla.org/mozilla-central/rev/3cf752580f1b

Jon Coppeard (:jonco)

Assignee

Updated

•

3 years ago

Priority: -- → P1

Jon Coppeard (:jonco)

Assignee

Updated

•

3 years ago

Severity: -- → N/A

BugBot [:suhaib / :marco/ :calixte]

Comment 12

•

3 years ago

Set release status flags based on info from the regressing bug 1685132

status-firefox85: --- → unaffected

status-firefox86: --- → affected

status-firefox87: --- → affected

status-firefox-esr78: --- → unaffected

Pascal Chevrel:pascalc (PTO until April 26)

Updated

•

3 years ago

status-firefox86: affected → fix-optional

Jon Coppeard (:jonco)

Assignee

Comment 13

•

3 years ago

Well, those changes didn't help.

It's possible that bug 1685128 also had an effect here. We can try making that less aggressive and see what effect that has.

Jon Coppeard (:jonco)

Assignee

Updated

•

3 years ago

Depends on: 1694372

Jon Coppeard (:jonco)

Assignee

Comment 14

•

3 years ago

Attached file Bug 1689394 - Increase idle time nursery collection timeout to make it less agressive r?sfink — Details

Depends on D106126

Steve Fink [:sfink] [:s:]

Comment 15

•

3 years ago

As I commented in the review, this tweak will only make the checks for an underused nursery more frequent. It will not have a direct effect on idle-time minor GCs triggered because the nursery is filling up. I think I got it a little wrong, though; despite the name, Nursery::isUnderused doesn't consider the size of the nursery at all, other than to return false if we're already at the minimum size. timeToCheckWhetherWeCanShrinkTheNursery would be a more accurate description.

I'm a little worried that this parameter change won't even affect the underused check much at all, though. Consider the situation, which is probably quite common, where you do a bunch of setup and interaction and then go idle. While things are active, tasks are running so the microtask queue is frequently getting drained. The last task executed is unlikely to trigger a minor GC due to the underuse check because there was probably a recent check shortly before in an earlier task. Then we go idle, stop running any tasks, and therefore stop checking for underuse entirely. So the underuse check frequency only has an effect while things are active, and never gets consulted when the system is idle. Furthermore, while things are active we're likely to be allocating stuff and so the nursery won't be underused.

If we want to shrink an underused nursery, then we probably want the behavior where after we go idle for a while, we collect the nursery (no matter how full it is), and then periodically notice that it's not being used much and shrink it. If we can't rely on tasks to force the latter checks, we depend on major GC timing to trigger minor GCs that will cause the nursery to shrink. I don't know if we do enough major GCs during idle time to get the behavior we want?

Anyway, I feel I've probably gone a little too far out on the speculation limb without looking at data, so I'll just leave it here for consideration for now.

Steve Fink [:sfink] [:s:]

Comment 16

•

3 years ago

Another thing: increasing this parameter may shift the histogram to make the 5th percentile even more expensive. It will reduce minor GCs, but preferentially those that tend to be very fast (the idle time ones that we're only doing in hopes of being able to shrink the nursery). If you just cut out the fastest collections, your 5th percentile will shift up to include the remaining more expensive ones.

Jon Coppeard (:jonco)

Assignee

Comment 17

•

3 years ago

(In reply to Steve Fink [:sfink] [:s:] from comment #15)

As I commented in the review, this tweak will only make the checks for an underused nursery more frequent. It will not have a direct effect on idle-time minor GCs triggered because the nursery is filling up. I think I got it a little wrong, though; despite the name, Nursery::isUnderused doesn't consider the size of the nursery at all, other than to return false if we're already at the minimum size. timeToCheckWhetherWeCanShrinkTheNursery would be a more accurate description.

It should make them less frequent. I assume this is what you meant. Indeed, it doesn't affect the eager collection of nearly-full nurseries. The naming isn't great but I was struggling to come up with something better.

I'm a little worried that this parameter change won't even affect the underused check much at all, though. Consider the situation, which is probably quite common, where you do a bunch of setup and interaction and then go idle. While things are active, tasks are running so the microtask queue is frequently getting drained. The last task executed is unlikely to trigger a minor GC due to the underuse check because there was probably a recent check shortly before in an earlier task. Then we go idle, stop running any tasks, and therefore stop checking for underuse entirely. So the underuse check frequency only has an effect while things are active, and never gets consulted when the system is idle. Furthermore, while things are active we're likely to be allocating stuff and so the nursery won't be underused.

Yes this is true. So this is mainly going to come into effect when the browser is not idle but has stopped running JS, or the allocation rate has fallen and left as with a large nursery that isn't filling quickly.

If we want to shrink an underused nursery, then we probably want the behavior where after we go idle for a while, we collect the nursery (no matter how full it is), and then periodically notice that it's not being used much and shrink it. If we can't rely on tasks to force the latter checks, we depend on major GC timing to trigger minor GCs that will cause the nursery to shrink. I don't know if we do enough major GCs during idle time to get the behavior we want?

Yes, this would be desirable as well. On workers we do a shrinking GC after they have been idle for a bit. This doesn't happen for the main thread AFAIK.

Another thing: increasing this parameter may shift the histogram to make the 5th percentile even more expensive. It will reduce minor GCs, but preferentially those that tend to be very fast (the idle time ones that we're only doing in hopes of being able to shrink the nursery). If you just cut out the fastest collections, your 5th percentile will shift up to include the remaining more expensive ones.

Yes, that is a possibility. I'd still like to try it and see what happens.

Pulsebot

Comment 18

•

3 years ago

Pushed by jcoppeard@mozilla.com:
https://hg.mozilla.org/integration/autoland/rev/25225e079907
Increase idle time nursery collection timeout to make it less agressive r=sfink

Natalia Csoregi [:nataliaCs]

Comment 19

•

3 years ago

bugherder

https://hg.mozilla.org/mozilla-central/rev/25225e079907

Julien Cristau [:jcristau] (back April 22)

Updated

•

3 years ago

status-firefox86: fix-optional → wontfix

status-firefox87: affected → fix-optional

status-firefox88: --- → affected

Tim Spurway [:tspurway]

Updated

•

3 years ago

Severity: N/A → S3

status-firefox87: fix-optional → wontfix

status-firefox88: affected → wontfix

status-firefox89: --- → fix-optional

Jon Coppeard (:jonco)

Assignee

Updated

•

3 years ago

Regressions: 1703108

Jon Coppeard (:jonco)

Assignee

Comment 20

•

3 years ago

I'm not planning on doing anything more in this bug.

Status: NEW → RESOLVED

Closed: 3 years ago

Resolution: --- → WONTFIX

BugBot [:suhaib / :marco/ :calixte]

Updated

•

3 years ago

Keywords: leave-open

BMO Automation

Updated

•

3 years ago

Has Regression Range: --- → yes

You need to log in before you can comment on or make changes to this bug.