If you think a bug might affect users in the 57 release, please set the correct tracking and status flags for Release Management.

[perf] Massive performance issues and recent regressions on JS GC

NEW
Unassigned

Status

()

Core
JavaScript: GC
3 years ago
2 years ago

People

(Reporter: Mark Straver, Unassigned)

Tracking

(Blocks: 1 bug, {perf})

Trunk
x86_64
Windows 7
Points:
---

Firefox Tracking Flags

(Not tracked)

Details

Attachments

(3 attachments)

(Reporter)

Description

3 years ago
Created attachment 8527306 [details]
js_pause-graph.html

User Agent: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:36.0) Gecko/20100101 Firefox/36.0
Build ID: 20141123030204

Steps to reproduce:

I've kept an eye on GC performance using the Mozilla-authored js-pause script as FF progressed. This script was created when incremental GC was first introduced to demonstrate the advantaged of it. (attached)

Over time, GC has become progressively worse. I've looked at the performance in current release (33.1) and seeing it really struggling, dropped the same in Nightly today, only to see it got even worse than it was in -release.


Actual results:

The execution of the script is extremely slow for long periods of time, visually struggling to draw in a timely fashion and with accompanying high measured frame drawing times (>100ms per fram, with spikes). Even if not struggling, the graph shows regular (once per second?) spikes that you should not see, and that will undoubtedly affect html gaming performance.


Expected results:

Execution should be smooth with only minimal increase of drawing times when GC is in progress.
Confirmed: the timing is mostly bimodal for me, with periods of somewhat stable 18ms entries, then periods of somewhat stable ~100ms periods. Additionally, I see irregular but fairly frequent spikes pf 200+ms.

Looking at GC logs, I see iGC, but with very long slices. Here's an exemplary log entry:

GC(T+3572.1) Max Pause: 85.7ms, Total Time: 2715.9ms, Zones Collected: 1, Total Zones: 21, Total Compartments: 460, Minor GCs: 50, MMU (20ms): 0%, MMU (50ms): 0%, SCC Sweep Total: 0.1ms, SCC Sweep Max Pause: 0.1ms, Allocated: 572MB, +Chunks: 119, -Chunks: 0
    Slice: 0, Pause: 73.0 (When: 0.0ms, Reason: MAYBEGC): Mark: 14.0ms, Mark Roots: 13.8ms
    Slice: 1, Pause: 59.6 (When: 80.4ms, Reason: REFRESH_FRAME): 
    Slice: 2, Pause: 60.6 (When: 145.0ms, Reason: REFRESH_FRAME): 
    Slice: 3, Pause: 59.9 (When: 210.3ms, Reason: REFRESH_FRAME): 
    Slice: 4, Pause: 64.5 (When: 275.2ms, Reason: REFRESH_FRAME): 
    Slice: 5, Pause: 60.4 (When: 345.2ms, Reason: REFRESH_FRAME): 
    Slice: 6, Pause: 61.0 (When: 410.3ms, Reason: REFRESH_FRAME): 
    Slice: 7, Pause: 60.2 (When: 476.1ms, Reason: REFRESH_FRAME): 
    Slice: 8, Pause: 65.1 (When: 541.0ms, Reason: REFRESH_FRAME): 
    Slice: 9, Pause: 60.7 (When: 610.8ms, Reason: REFRESH_FRAME): 
    Slice: 10, Pause: 61.1 (When: 676.4ms, Reason: REFRESH_FRAME): 
    Slice: 11, Pause: 60.6 (When: 742.2ms, Reason: REFRESH_FRAME): 
    Slice: 12, Pause: 61.9 (When: 807.5ms, Reason: REFRESH_FRAME): 
    Slice: 13, Pause: 61.6 (When: 874.3ms, Reason: REFRESH_FRAME): 
    Slice: 14, Pause: 62.4 (When: 940.6ms, Reason: REFRESH_FRAME): 
    Slice: 15, Pause: 60.9 (When: 1007.8ms, Reason: REFRESH_FRAME): 
    Slice: 16, Pause: 61.8 (When: 1074.7ms, Reason: REFRESH_FRAME): 
    Slice: 17, Pause: 61.1 (When: 1141.2ms, Reason: REFRESH_FRAME): 
    Slice: 18, Pause: 62.2 (When: 1206.9ms, Reason: REFRESH_FRAME): 
    Slice: 19, Pause: 62.2 (When: 1273.9ms, Reason: REFRESH_FRAME): 
    Slice: 20, Pause: 61.1 (When: 1340.9ms, Reason: REFRESH_FRAME): 
    Slice: 21, Pause: 62.7 (When: 1406.7ms, Reason: REFRESH_FRAME): 
    Slice: 22, Pause: 61.1 (When: 1474.1ms, Reason: REFRESH_FRAME): 
    Slice: 23, Pause: 10.1 (When: 1538.2ms, Reason: REFRESH_FRAME): Mark: 10.0ms
    Slice: 24, Pause: 62.7 (When: 1552.9ms, Reason: REFRESH_FRAME): 
    Slice: 25, Pause: 61.7 (When: 1620.5ms, Reason: REFRESH_FRAME): 
    Slice: 26, Pause: 62.9 (When: 1686.9ms, Reason: REFRESH_FRAME): 
    Slice: 27, Pause: 62.0 (When: 1754.7ms, Reason: REFRESH_FRAME): 
    Slice: 28, Pause: 63.8 (When: 1821.4ms, Reason: REFRESH_FRAME): 
    Slice: 29, Pause: 62.9 (When: 1890.8ms, Reason: REFRESH_FRAME): 
    Slice: 30, Pause: 10.1 (When: 1956.3ms, Reason: REFRESH_FRAME): Mark: 10.0ms
    Slice: 31, Pause: 63.5 (When: 1971.4ms, Reason: REFRESH_FRAME): 
    Slice: 32, Pause: 10.1 (When: 2036.1ms, Reason: REFRESH_FRAME): Mark: 10.0ms
    Slice: 33, Pause: 62.6 (When: 2051.0ms, Reason: REFRESH_FRAME): 
    Slice: 34, Pause: 64.0 (When: 2118.4ms, Reason: REFRESH_FRAME): 
    Slice: 35, Pause: 62.6 (When: 2186.8ms, Reason: REFRESH_FRAME): 
    Slice: 36, Pause: 63.7 (When: 2254.2ms, Reason: REFRESH_FRAME): 
    Slice: 37, Pause: 62.9 (When: 2413.2ms, Reason: REFRESH_FRAME): 
    Slice: 38, Pause: 10.1 (When: 2476.6ms, Reason: REFRESH_FRAME): Mark: 10.0ms
    Slice: 39, Pause: 63.2 (When: 2495.9ms, Reason: REFRESH_FRAME): 
    Slice: 40, Pause: 62.7 (When: 2563.9ms, Reason: REFRESH_FRAME): 
    Slice: 41, Pause: 64.5 (When: 2631.8ms, Reason: REFRESH_FRAME): 
    Slice: 42, Pause: 85.7 (When: 2703.7ms, Reason: REFRESH_FRAME): 
    Slice: 43, Pause: 63.7 (When: 2794.2ms, Reason: REFRESH_FRAME): 
    Slice: 44, Pause: 62.9 (When: 2866.4ms, Reason: REFRESH_FRAME): 
    Slice: 45, Pause: 10.1 (When: 2993.1ms, Reason: REFRESH_FRAME): Mark: 9.0ms
    Slice: 46, Pause: 10.1 (When: 3015.3ms, Reason: REFRESH_FRAME): Mark: 9.1ms
    Slice: 47, Pause: 10.1 (When: 3064.4ms, Reason: REFRESH_FRAME): Mark: 9.9ms
    Slice: 48, Pause: 10.1 (When: 3080.9ms, Reason: REFRESH_FRAME): Mark: 9.9ms
    Slice: 49, Pause: 10.1 (When: 3100.4ms, Reason: REFRESH_FRAME): Mark: 9.9ms
    Slice: 50, Pause: 10.1 (When: 3113.7ms, Reason: REFRESH_FRAME): Mark: 9.9ms
    Slice: 51, Pause: 10.1 (When: 3126.8ms, Reason: REFRESH_FRAME): Mark: 9.9ms
    Slice: 52, Pause: 10.1 (When: 3139.8ms, Reason: REFRESH_FRAME): Mark: 9.9ms
    Slice: 54, Pause: 2.2 (When: 3169.9ms, Reason: REFRESH_FRAME): Sweep: 1.9ms, Mark During Sweeping: 0.1ms, Finalize Start Callback: 0.3ms, Sweep Compartments: 0.3ms, Sweep type information: 0.1ms, Finalize End Callback: 0.8ms
    Totals: Wait Background Thread: 0.1ms, Mark: 140.8ms, Mark Roots: 13.8ms, Mark Delayed: 2.1ms, Sweep: 1.9ms, Mark During Sweeping: 0.1ms, Finalize Start Callback: 0.3ms, Sweep Compartments: 0.3ms, Sweep type information: 0.1ms, Finalize End Callback: 0.8ms
Status: UNCONFIRMED → NEW
Ever confirmed: true
Flags: needinfo?(terrence)
As a side-effect of bug 1100652, we're not triggering the minor collections here because of "full store buffer" anymore. Because of that, we're not triggering the pre-tenuring heuristic, so we're ending up with the half-million element global Array in the whole cell buffer constantly. This simply takes 40ms to mark: this is the unaccounted time and the reason for the increase in latency.

I've wanted to fix this particular heuristic to be more robust for quite awhile now, this seems like a good bug for it.
Blocks: 1008333
Flags: needinfo?(terrence)
(Reporter)

Comment 3

3 years ago
Considering bug 1100652 is very recent, I think we're looking at multiple regression windows then - as said, GC is problematic in current m-r as well (although less severe as in nightly, it still has the spikes and heavy GC lag for a number of seconds - it's more "burst-y" because of an OMTC issue in m-r (and smoothed out to similar constant values as seen in nightly if OMTC is disabled) but still very much present nonetheless).
Thankfully, the biggest problem appears to be an artifact of how the test is implemented and should not impact most normal workloads.
(Reporter)

Comment 5

3 years ago
(In reply to Terrence Cole [:terrence] from comment #4)
> Thankfully, the biggest problem appears to be an artifact of how the test is
> implemented and should not impact most normal workloads.

I don't think that's exactly true. The test itself my synthetically cause GC to occur, but it's not uncommon for GC to be triggered very regularly under normal workloads (especially games).

I've made some simple tests (I don't have time to do a mozregression to find out exactly when it started being a problem, but my initial gut feeling is that it's related to generational GC) using the test and different garbage sizes ranging from 256k to 5M. I'll upload the resulting graphs. To avoid the regression introduced by bug 1100652 I've run these tests on m-r (FF 34.0).

Even at the smallest size that hits the high water mark for GC (256k), it shows spiking that shouldn't be there and that introduces GC-caused stuttering. This gets worse with 512k and 1M, none of which seem to trigger any iGC. 2M starts triggering iGC and actually improves the average response time a little, but quickly loses that edge again in the next step up. Of note is that the larger the garbage size, the longer it takes for the iGC to trigger (compare 2M, 3M and 5M) and the longer it takes for this GC to complete.

I'll also upload the result of the default test on a pre-GGC version (I used FF 24 ESR for this as an easy, release-grade reference point) which uses a 6M garbage size (larger than any of the ones I ran through this time) and shows very clean iGC and minimal frame delays on GC.
(Reporter)

Comment 6

3 years ago
Created attachment 8536156 [details]
Garbage collection comparison graph, FF34
(Reporter)

Comment 7

3 years ago
Created attachment 8536157 [details]
Reference graph before regressions
(In reply to Mark Straver from comment #5)
> (In reply to Terrence Cole [:terrence] from comment #4)
> > Thankfully, the biggest problem appears to be an artifact of how the test is
> > implemented and should not impact most normal workloads.
> 
> I don't think that's exactly true. The test itself my synthetically cause GC
> to occur, but it's not uncommon for GC to be triggered very regularly under
> normal workloads (especially games).

No, I mean the mechanism that the test is using to keep things live does not map well to the mechanisms typically used in games and other workloads that frequently GC. Specifically, this test is slow because it keeps all objects in a single global array including many short lived allocations. Most workloads are not strictly flat and generally have high infant mortality on the stack, not in heap structures. Thus, our poor performance on this test is not likely to indicate poor performance on normal workloads, even workloads that GC frequently.

> I've made some simple tests (I don't have time to do a mozregression to find
> out exactly when it started being a problem, but my initial gut feeling is
> that it's related to generational GC) using the test and different garbage
> sizes ranging from 256k to 5M. I'll upload the resulting graphs. To avoid
> the regression introduced by bug 1100652 I've run these tests on m-r (FF
> 34.0).

No need to do a regression test for this: the slowness is caused by all the cross-generation edges from the single global array to short-lived nursery allocations.
We discussed this at today's GC meeting. What we'd like to do is make insertion of a large object into the whole-cell buffer automatically forward to the slots buffer. I'll try to work on this some today.
Reproduction of this using the gc-ubench with a 30s run, N=8K M=8M shows:

510.523 ms/s : ▃▁▄▁▁▁▁▁ : globalArrayNewObject - var foo = [new Object(), ....]

<50% utilization, with majority of time going to very long pauses: ouch! The cause appears to be be MarkWholeCells taking ~500ms.
Keywords: perf
Retesting this in today's Nightly, the line is much smoother due to Silk, but I still see brief spikes up to 33.33ms every half second or so, with occasional larger (50ms) spikes. Then after a few seconds we seem to 'fall behind' and start to rapidly alternate between 16.67ms and 33.33ms before ending up in a fairly stable section of 33.33ms (still with occasional higher spikes). Then eventually we go back down to 16.67ms for a bit, and the process repeats.

Behavior is similar with javascript.options.mem.gc_dynamic_mark_slice disabled and javascript.options.mem.gc_incremental_slice_ms set to 7.

Terrence, do you know if we're still ending up with that large buffer in the nursery? It sounded like you wanted to address that sooner rather than later.
Flags: needinfo?(terrence)
On reflection, the thing I had in mind is not sufficiently less complex than page barriers to warrant the investment at this time. On the other hand, my store-buffer traversal inlining appears to have brought the whole object store buffer makring time down from 40ms to 10ms -- still not great, but good enough that this isn't the low-hanging fruit anymore.
Flags: needinfo?(terrence)
You need to log in before you can comment on or make changes to this bug.