Closed Bug 1495734 Opened 6 years ago Closed 3 years ago

Long pauses on reloading a page that was using large amounts of memory

Categories

(Core :: JavaScript: GC, enhancement, P3)

enhancement

Tracking

()

RESOLVED INVALID
Performance Impact high
Tracking Status
firefox64 --- affected

People

(Reporter: jesup, Unassigned)

References

(Blocks 1 open bug, )

Details

(Keywords: perf:pageload)

I left a TTI test sitting on arstechnica overnight, profiler running.   Hit reload, waited for TTI, then captured a profile.  I'm seeing up to 7 second(!) GC Slices -- and a GC Major for 12 seconds with a max pause of near 3 seconds.

https://perfht.ml/2zLWfY3

This may be a key to what happened: ─2,423.52 MB (99.95%) ── js-non-window/gc-heap/decommitted-arenas  (If I have to bet, the 'leak' is allocations by adsafeprotected.com)

This implies the reload caused a large amount of memory to be freed - which is good, but in the process it janked the content process horribly (and likely slowed the reload).

We should check for any minor fubars that might have contributed to this (not scheduling compacting for some reason, perhaps), but likely some more fundamental work will be needed to avoid long pauses like this.

Also, I wouldn't be shocked if in the end we have to make some interactivity vs memory tradeoffs.  Right now almost all such tradeoffs I believe are static; perhaps we should consider making them somewhat dynamic.  Moving parts of GC/CC to run on idle queues is an example, but one could go much further in how GC/CC are scheduled and how they actually do the work; how memory areas and generations work, etc.
The long pauses here are due to grey marking, which is currently non-incremetal, and what we think is the cycle collector getting impaitent and just giving the GC a really big budget to work with.

So we know about #1 already, it's in our plans to fix.  I don't know enough to say about #2.
There's also just a heck of a lot of marking work in the first GC event.  AFAIK the CC can run the GC, then do some CC, then some more GC to try to free everything.  I don't know if this can be changed to avoid some of this marking work (if the CC can tell the GC that some of these objects arn't "roots" before the GC begins).  But I don't know enough about CC vs GC to say.
Also note that after the "initial" Major GC, there's a 3.3 hang running ForgetSkippable right in the middle of the load:
https://perfht.ml/2Qor7Do

virtually all UnMarkGrayThingsRecursively (no surprise); 21% directly in UnMarkGrayTracer::onChild
That's fairly expected behavior when you are freeing that much memory all at once. Maybe we could let the GCs run for much longer when there's work to do, or ramp up the slice times even more. We'd have to somehow detect that this wasn't due to a lot of allocation, or we'd end up falling behind.

With process per origin, hopefully we'd be able to just kill the process without GCing. Though I suppose if you hit 'reload' we'd want to reuse the process, so it doesn't actually help.

> Moving parts of GC/CC to run on idle queues is an example

We already do this, but it doesn't help much when you have that much collector work to do.
> Right now almost all such tradeoffs I believe are static; perhaps we should consider making them somewhat dynamic. 

We do something like this in the CC, where we increase slice times the longer the CC goes on. We still do eventually give up on being incremental and finish it all out. I think the GC may have a similar mechanism.
Flags: needinfo?(bugs)
Yes, GC does have similar mechanism. Slice time increases the closer we're forcing CC.
Flags: needinfo?(bugs)
Note that the profile is created with "Responsiveness" monitoring on (as far as I see), and that disturbs idleness checks pretty badly, so normal GC/CC scheduling won't be happening (bug 1482240).
But I'm sure we'd see pretty bad slices even without that.
Whiteboard: [qf] → [qf:p1:f67]
Whiteboard: [qf:p1:f67] → [qf:p1]
Blocks: GCScheduling
Priority: -- → P3
Whiteboard: [qf:p1] → [qf:p1:pageload]

We collected a new profile https://share.firefox.dev/3xogvuF by having the page loaded for about 5 minutes, and then do a reload.

The GC major was around 2 seconds, and it was quite responsive. So we are closing this bug.

Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → INVALID
Performance Impact: --- → P1
Keywords: perf:pageload
Whiteboard: [qf:p1:pageload]
You need to log in before you can comment on or make changes to this bug.