Closed Bug 1366877 (BrowserChrome_GC) Opened 7 years ago Closed 5 years ago

[Meta] Quantum Release Criteria: Figure out why Browser chrome CC/GC pauses longer than 150 ms

Categories

(Core :: JavaScript: GC, enhancement, P5)

enhancement

Tracking

()

RESOLVED INCOMPLETE
Performance Impact ?

People

(Reporter: jgong, Unassigned)

References

(Blocks 1 open bug)

Details

(Keywords: meta, perf)

User Story

We need to profile and find out why we are slow on nightly with Browser chrome CC/GC pauses longer than 150 ms.
We need to profile and find out why we are slow on nightly with Browser chrome CC/GC pauses longer than 150 ms.
Blocks: QuantumFlow
Whiteboard: [qf:investigate][qf:meta]
Benjamin, do you have any ideas about how to deal with page faults here? If the user's machine is swapping a lot, it's hard to make GC fast. Telemetry suggests that page faults are a big reason why GCs take a long time. But since Windows doesn't have an API that allows us to distinguish between hard and soft page faults, it's hard to even know.
Flags: needinfo?(benjamin)
Oh, we have telemetry about page faults?

One thing which would hopefully help is smaller chrome zone(s). In other words several chrome zones, and then doing more zone level GCs and less full GCs. Current parent process is doing more full GCs than child processes (partially because it is hard to get non-full-GCs work well enough in parent process and partially because it doesn't matter too much whether we're doing zone or full in parent, since it ends up being all-system-zone anyhow.).

This kinds of bugs need some links showing the data.
(In reply to Olli Pettay [:smaug] from comment #2)
> Oh, we have telemetry about page faults?

We have pretty extensive telemetry on GC. You can see examples in bug 1314828.

> One thing which would hopefully help is smaller chrome zone(s).

But what would the zones be? Most people only have one XUL window open. And with bug 1186409, we won't be able to split JSMs apart at all. Even without that bug, I'm not sure how we would split them.
One zone for each top level ChromeWindow and one for jsms and other js components
I wasn't aware we had telemetry about page faults, but that sounds really interesting. How does the OS expose that data to us? What is the difference between a hard and a soft page fault?

I'm going to ask a rather uninformed set of questions which are a combination of short-term/quantum and long-term architecture.

* how long is a single page fault? Does a single page fault threaten our time budget (150ms now, eventually 16ms) all by itself?
* what can we do to reduce the size/increase the locality of GC heaps?
* Do GC heaps take up entire pages now?
* Can we tell when things are getting paged out and change our behavior somehow?
* Is it possible/likely that we're leaking memory causing infinite heap growth and therefore paging, or is this memory usage real/normal?

Happy to chat about ways telemetry or other experiments could help narrow things down if that would be helpful.
Flags: needinfo?(benjamin)
(In reply to Benjamin Smedberg [:bsmedberg] from comment #5)
> I wasn't aware we had telemetry about page faults, but that sounds really
> interesting.

See http://gecko.readthedocs.io/en/latest/toolkit/components/telemetry/telemetry/data/main-ping.html#gc

> How does the OS expose that data to us? What is the difference
> between a hard and a soft page fault?

GetProcessMemoryInfo on Windows and getrusage on POSIX. They tell you the total page faults since startup, so you can do a subtraction to find the number of page faults during a time interval.

I think a soft page fault is counted anytime the OS needs to update a page table mapping. For example, if a page is mapped COW and you write to it, I expect that would count as a page fault. A hard fault is when you have to read a page from disk.

Wikipedia has a little background, although it calls them minor/major instead of soft/hard:
https://en.wikipedia.org/wiki/Page_fault

> * how long is a single page fault? Does a single page fault threaten our
> time budget (150ms now, eventually 16ms) all by itself?

Soft page faults are fast: something like 1000 to 2000 cycles. Hard faults essentially cost as much as a disk access. So something on the order of 10ms per fault for spinning disks and 150 microseconds for SSDs. We can get thousands of page faults per GC slice (although we don't know if they're hard or soft), so it's possible they have a substantial effect on the GC time.

> * what can we do to reduce the size/increase the locality of GC heaps?

We've done a lot of work to reduce the size of GC heaps. There's probably very little low-hanging fruit left there.

Locality doesn't really matter much, since a GC has to touch all live data. I think the best we could do to avoid these page faults is to GC the infrequently used tabs much less often. We have some work going on in that direction for Quantum.

> * Do GC heaps take up entire pages now?

The GC reserves 1MB chunks, which are used exclusively for GCed memory.

> * Can we tell when things are getting paged out and change our behavior
> somehow?

I guess we could ask the user to close some tabs. As I said, there isn't a great way on Windows to tell if a page fault is hard or soft. Although now I wonder if we could somehow measure how much I/O Firefox has done during a GC.

> * Is it possible/likely that we're leaking memory causing infinite heap
> growth and therefore paging, or is this memory usage real/normal?

I guess we could try to correlate page faults with heap size, and see if the people with lots of page faults have huge heaps.

> Happy to chat about ways telemetry or other experiments could help narrow
> things down if that would be helpful.

I guess I was hoping you might have some special Windows knowledge that would help here. Although maybe we have enough Mac/Linux users to try to get a sense of what percentage of page faults are hard versus soft. I'm just not sure how valid that data would be.
(In reply to Benjamin Smedberg [:bsmedberg] from comment #5)
> I wasn't aware we had telemetry about page faults, but that sounds really
> interesting. How does the OS expose that data to us? What is the difference
> between a hard and a soft page fault?

Windows has an API for retrieving a count of page faults, but does not distinguish between hard and soft. Other OSes generally do. It looks like you can do some extra work to tell them apart on Windows, but I don't know if that's available to user programs or not?: https://blogs.technet.microsoft.com/askperf/2008/06/10/the-basics-of-page-faults/

Page faults are when you access a memory address that is not currently mapped to the current virtual address space. Hard page faults require I/O. Soft page faults do not.

A hard fault is when the page must be read off of a swap file or a memory mapped file. A soft fault is when you grab a page that is already in memory and twiddle the MMU state to include it in the virtual address space. For example, you allocate memory and then touch it, and you have some unused physical memory, so the OS maps one of those pages into your address space.

> I'm going to ask a rather uninformed set of questions which are a
> combination of short-term/quantum and long-term architecture.
> 
> * how long is a single page fault? Does a single page fault threaten our
> time budget (150ms now, eventually 16ms) all by itself?

A soft (minor) page fault is no big deal, and I would not expect it to be a problem.

A hard (major) page fault can take a widely varying amount of time. In some situations, it could be seconds (eg if your swap disk needs to spin up). I don't have a good sense for the overall distribution. If you're going to a rotating disk, it's not going to be pretty. If you are on an SSD, or the page is in the disk controller's cache or something, it's probably not going to be too bad.

> * what can we do to reduce the size/increase the locality of GC heaps?

Optimize our chrome JS to use less memory and/or better locality.

But in terms of what we could do internally, we could do more aggressive compaction. Or more aggressive collection. Or somehow take into account the available memory and try harder to stay within physical RAM and avoid swapping, though I don't know of any way to know directly how much memory is "ok" to use. Or do the stuff smaug is suggesting.

> * Do GC heaps take up entire pages now?

Yes. The GC heap takes up many 4KB pages, and will always grab at least a page at a time.

> * Can we tell when things are getting paged out and change our behavior
> somehow?

It would at least be interesting to use paging as a signal to lower memory thresholds and thereby GC more aggressively, in hopes that we'll avoid swapping. (This will also increase GC overhead / decrease throughput and cause more frequent pauses, but many short GC pauses are preferable to long swapping-induced pauses.)

> * Is it possible/likely that we're leaking memory causing infinite heap
> growth and therefore paging, or is this memory usage real/normal?

It is possible. I'm not sure how likely it is. Perhaps we could add telemetry about memory usage, binned by GC pause times? And/or  add memory usage into billm's detailed GC telemetry if it isn't there already.
If you're looking for windows expertise, I'll point you at any of dmajor, ccorcoran, aklotz, and mhowell.

I'm not sure I have many great suggestions here: to the extent possible, the goal would be to map any particular GC > 150ms to a root cause, where the root causes could be:

* Firefox got paged out and is just coming back (transient paging) (P2, scheduling might help or budgeting smaller GC slices?)
* Firefox is using so much memory that it's just paging (permanent paging) (P1, need to do something about this even if it's hard)
* We're not paging, but the heap is too large and we cannot possibly do a GC in 150ms (P1, need to do something about this even if it's hard)
* We're CPU-starved because of other stuff on the system (P3, perhaps ignorable)
* ? other

Perhaps the answer is that we cannot meet our goals without a different kind of GC, perhaps a GC that runs on another thread? Is this a crazy suggestion or just hard?
(In reply to Benjamin Smedberg [:bsmedberg] from comment #8)
> If you're looking for windows expertise, I'll point you at any of dmajor,
> ccorcoran, aklotz, and mhowell.
> 
> I'm not sure I have many great suggestions here: to the extent possible, the
> goal would be to map any particular GC > 150ms to a root cause, where the
> root causes could be:

I think we need to distinguish between problems that the GC can do something about and things that it can't.

> * Firefox got paged out and is just coming back (transient paging) (P2,
> scheduling might help or budgeting smaller GC slices?)

- If you suspended your laptop in the middle of a GC, there's nothing we can do.

- If Firefox was paged out, then there's a question of whether there's anything we want to do about it. If it would have paused anyway when it did anything, even without GC, then there's nothing to be done. (This is where touching memory that is needed for the browser to run triggered hard page faults.)

The portion that would be worth doing something about is when we are only pausing because the GC touches memory that wouldn't otherwise be used. The GC scans live memory, but it doesn't have to scan *all* live memory. The best case for this situation (where we're paging Firefox back in) would be to scan only the memory that is going to be used by the Firefox code in the near-ish future (eventually we'll have to scan all of it, but hopefully taking the page faults gradually will keep us below the noticeable thresholds.

Unfortunately, all of these cases are likely to get blamed on the GC, because it's what we're timing.

Subdividing the heap is one way of avoiding long pauses from swapping stuff in. That's exactly what Zones are -- it's a subdivision of the overall heap into partitions relevant to different portions. (In a content process, they're associated with tabs.) We sometimes need to scan the entire heap in order to free up stuff that we wouldn't otherwise know is dead. Bug 1285355 is for reducing/removing those.

So as Olli said, one way to improve here is to make more Zones. But it's not obvious how to break down chrome heaps in this way, as billm pointed out in comment 3. Another option would be to subdivide Zones and be able to collect parts of them. We already do some of that by doing generational garbage collection. We might look into schemes where we do more generations or generation-like things. For example, we could gradually move older stuff into arenas holding only very long-lived read-only things, and remember outgoing pointers  in a separate data structure to avoid reading those pages. But we haven't really looked into that yet.

Few of these mechanisms will help our core metric here, btw -- they allow postponing the really slow collections, but over the course of a week, you're likely to have to do it at least once. And if I understand correctly, that one collection will put us down for a failure for the whole week. I wonder if we could adjust the metric: CC/GC pauses are irrelevant, no matter how long they take, if no updates are required and no user/network input is seen. It's harder to measure, but really *noticeable* pauses are the only thing we should care about. (Handwaving: after a GC pause, record a telemetry value only if the event loop is empty, after refilling it from whatever IPC or OS event polling we do.)

> * Firefox is using so much memory that it's just paging (permanent paging)
> (P1, need to do something about this even if it's hard)

Again, we need to subdivide this case. If we have a memory leak and are going to OOM, then any strenuous attempts the GC makes to improve things will *hurt* our telemetry-measured score. In terms of both the metric and the user's experience, probably the best thing to do would be to stop GCing completely and OOM as soon as possible. (Well, for the user, it would be best to make sure to save the sessionstore first.)

If it's high but nonfatal memory usage, then forcing more aggressive nonincremental GCs will make our metric worse, but probably be best for the user experience. (If we can collect enough to stop paging, things will be better. The pauses during heavy swapping are worse than taking a single longer pause to GC and then running smoothly, assuming we can collect enough for that.)

> * We're not paging, but the heap is too large and we cannot possibly do a GC
> in 150ms (P1, need to do something about this even if it's hard)

This is a matter of making the GC more incremental. Much of it can be broken into smaller pieces regardless of the overall heap size, but some parts cannot, and we are working on those. Bug 1167452, bug 1323083, bug 1367099, and some recently closed bugs are part of this effort.

> * We're CPU-starved because of other stuff on the system (P3, perhaps
> ignorable)
> * ? other
> 
> Perhaps the answer is that we cannot meet our goals without a different kind
> of GC, perhaps a GC that runs on another thread? Is this a crazy suggestion
> or just hard?

Concurrent GC is bug 1232802. It is just hard. It also doesn't magically fix everything; if you're swapping, then a concurrent GC will still be paging in memory and causing the main thread to wait on more page faults. It is out of scope for 57, but much of the work of making the GC more incremental is necessary for concurrent GC as well. (Specifically, the way to make things incremental is to barrier accesses to GC things, and concurrent GC requires barriers on a superset of these accesses. The concurrent GC just does different things in those barriers.)
Alias: QRC_BrowserChrome_GC
User Story: (updated)
Summary: [Meta] Figure out why Browser chrome CC/GC pauses longer than 150 ms → [Meta] Quantum Release Criteria: Figure out why Browser chrome CC/GC pauses longer than 150 ms
Blocks: Quantum
Keywords: meta
No longer blocks: QuantumFlow
Alias: QRC_BrowserChrome_GC → BrowserChrome_GC
Blocks: QRC_FX57
No longer blocks: Quantum
Keywords: perf
Priority: -- → P5
Status: NEW → RESOLVED
Closed: 5 years ago
Resolution: --- → INCOMPLETE
Performance Impact: --- → ?
Whiteboard: [qf:investigate][qf:meta]
You need to log in before you can comment on or make changes to this bug.