Closed
Bug 1314828
Opened 8 years ago
Closed 3 years ago
Analyze GC telemetry data
Categories
(Core :: JavaScript: GC, defect, P3)
Core
JavaScript: GC
Tracking
()
RESOLVED
FIXED
People
(Reporter: billm, Assigned: billm)
References
Details
(Keywords: triage-deferred)
My first attempt to analyze this data is as follows:
1. Look at some random users and see what the most common problems are.
2. Write a function to automatically group GCs into buckets based on what appears to be slowest about that GC. The buckets I'm using are:
PageFaults: there are more than 5000 faults in the slowest slice
COMPARTMENT_REVIVED: first slice reason is COMPARTMENT_REVIVED
CC_FORCED1: only one slice, and the reason is CC_FORCED
CC_FORCED+: more than one slice, and the last slice reason is CC_FORCED
KeepAtomsSet: nonincremental_reason is KeepAtomsSet
GCBytesTrigger: nonincremental_reason is GCBytesTrigger
MallocBytesTrigger: nonincremental_reason is MallocBytesTrigger
Compact: slowest slice is >=75% in compaction
Sweep: slowest slice is >=75% in sweeping
MinorGCsToEvictNursery: slowest slice is >=75% in minor_gcs_to_evict_nursery
Other: everything else
3. Throw away GCs where the max_pause is >30s. The machine probably went to sleep during the GC or something (I hope).
4. See how many GCs are in each bucket.
5. See how many GCs where the max pause was >= 50ms (or 500ms or 5s) are in each bucket.
You can see the results here:
https://gist.github.com/bill-mccloskey/f94c25ad00e851698680586b42399d00
Let's say we are interested in GCs longer than 500ms. Then the sizes of the buckets are as follows:
[('Compact', 153),
('GCBytesTrigger', 86),
('PageFaults', 2747),
('KeepAtomsSet', 1258),
('CC_FORCED1', 52),
('MallocBytesTrigger', 97),
('Sweep', 1182),
('COMPARTMENT_REVIVED', 2373),
('MinorGCsToEvictNursery', 17),
('Other', 269),
('CC_FORCED+', 154)]
From this perspective, I think we can get the most bang for our buck by addressing COMPARTMENT_REVIVED and KeepAtomsSet. These are pretty big buckets and I suspect they would be easier to fix than Sweep or PageFaults (which we can dig into after the others are fixed).
Assignee | ||
Comment 1•8 years ago
|
||
I'd like to look at this again now that bug 1318384 has landed.
Flags: needinfo?(wmccloskey)
Assignee | ||
Comment 2•8 years ago
|
||
It looks like bug 1318384 was totally successful! Normally it would be a little difficult to make a direct comparison since we get different amounts of data on different days. But the difference is so stark that it doesn't really matter.
Data from Nov. 1:
Across "worst" GCs reported from the content process (those with the worst max pause), 32399 were caused by COMPARTMENT_REVIVED.
Data from Nov. 25 and 26:
Across "worst" GCs reported from the content process (those with the worst max pause), 11 were caused by COMPARTMENT_REVIVED.
And keep in mind that the total number of GCs recorded in the second data set was much higher (more than twice as high for some reason--maybe the US holiday).
Great job Jon! Based on the new data, bug 1213977 is the next easiest target.
https://gist.github.com/bill-mccloskey/2fe74101cb4e807e31c0f4215d3be2b9
Flags: needinfo?(wmccloskey)
Assignee | ||
Comment 3•8 years ago
|
||
I ran another analysis now that bug 1213977 is fixed. The keepAtoms stuff totally disappeared. Hopefully we'll be able to see a difference in GC_MAX_PAUSE_MS in the telemetry evolution view. Unfortunately, it seems to be over a week behind, so we'll need to wait.
The two biggest categories are now "PageFaults" and "Sweep". We can try to fix the page faults category by avoiding GCs of infrequently used zones. That way we won't be paging in data that we're probably not going to use soon. This isn't something we can make quick short-term progress on though.
I broke down the Sweep category to try to understand what parts of sweeping are slow. The two big sub-components are "Mark During Sweeping" and "Sweep Miscellaneous". For "Mark During Sweeping", I saw a fairly even mix of "Mark Weak", "Mark Gray", and "Mark Gray and Weak" (although "Mark Gray" was somewhat larger than the other two).
I filed bug 1323078 to break down the "misc" category into finer-grained phases.
Bug 1167452 already covers weak marking.
Gray marking will be tricky. The easier approach is probably bug 1323087. That will mark more objects black and fewer objects gray. I also filed bug 1323083 to incrementalize gray marking. We should implement bug 1323087 first and see how much it helps. If it's not enough, we can try to do bug 1323083.
Assignee | ||
Comment 4•8 years ago
|
||
Here's the latest gist:
https://gist.github.com/bill-mccloskey/7d61e025c3c66f5fbfc19067fad941f7
I wish I knew how to make it public, but at least I can see it...
Assignee | ||
Comment 5•8 years ago
|
||
I filed bug 1323306 based on some more analysis of weak marking.
Updated•7 years ago
|
Keywords: triage-deferred
Priority: -- → P3
Comment 6•3 years ago
|
||
There's always more work to be done here, but that will continue elsewhere.
Status: NEW → RESOLVED
Closed: 3 years ago
Resolution: --- → FIXED
You need to log in
before you can comment on or make changes to this bug.
Description
•