Closed Bug 707162 Opened 13 years ago Closed 12 years ago

regression in GC scheduling between Firefox 9 and 10?

Categories

(Core :: JavaScript Engine, defect)

defect
Not set
major

Tracking

()

RESOLVED FIXED
Tracking Status
firefox9 --- unaffected
firefox10 + fixed
firefox11 + fixed

People

(Reporter: mccr8, Assigned: igor)

References

Details

(Keywords: hang, regression, Whiteboard: [Snappy:P1][qa-])

Attachments

(2 files)

On moz.dev.platform [1], bz reported seeing some annoying GC/CC problems after running the browser for a few days:

GC mode: full, timestamp: 0, duration: 1064 ms.
CC timestamp: 6852, collected: 38 (38 waiting for GC), suspected: 173,
duration: 1851 ms.
GC mode: full, timestamp: 21641, duration: 1638 ms.
CC timestamp: 27133, collected: 400 (400 waiting for GC), suspected:
783, duration: 492 ms.
GC mode: full, timestamp: 32826, duration: 1692 ms.
CC timestamp: 38322, collected: 388 (388 waiting for GC), suspected:
578, duration: 494 ms.
GC mode: full, timestamp: 43309, duration: 987 ms.
CC timestamp: 48840, collected: 50 (50 waiting for GC), suspected: 189,
duration: 530 ms. 

The GC runs for a second, then 5 seconds later the CC runs for almost two seconds, then around 15 seconds later there's another pair of GCs and CCs.  Then 5 seconds later there's another pair, etc.

The interesting thing is that he sees this problem in Aurora (10) and Nightly (11), but not Beta (9).  The length of the GCs and CCs are about the same, they are just running more often, which makes the long pause times more noticeable.

From this snippet of the error console, the CC is running precisely 5 seconds after the GC, and never hitting 1000 suspected nodes, so the trigger for the CC must be the GC ending in DOMGCCallback.  That code changed slightly in 11 (though it shouldn't have any affect), but is the same in 9 and 10: 5 seconds after a full GC, a CC is triggered.

Is there any difference you can think of in GC scheduling between 9 and 10?

Do we run less compartmental GCs in 10?  Maybe more things are GC-allocated, so the GC gets triggered more?

I notice that one way the GC can be triggered after a full GC is that JS_GetGCParameter(cx->runtime, JSGC_UNUSED_CHUNKS) > 0.  Did the behavior of that parameter change between 9 and 10?

[1] http://groups.google.com/group/mozilla.dev.platform/browse_thread/thread/c2c498c48824e1db#
I think we need to find out why CC runs more often. Our GC scheduling heuristics does not affect the frequency of CC.
As far as I can tell, if a full GC finishes and there's no new pending GC, the CC is run 5 seconds later:
http://mxr.mozilla.org/mozilla-central/source/dom/base/nsJSEnvironment.cpp#3407
(In reply to Andrew McCreight [:mccr8] from comment #2)
> As far as I can tell, if a full GC finishes and there's no new pending GC,
> the CC is run 5 seconds later:
> http://mxr.mozilla.org/mozilla-central/source/dom/base/nsJSEnvironment.
> cpp#3407

Right, I completely missed that we schedule the full CC after any full-GC.

So this could be a regression from the bug 688641. In that bug the empty chunks are created in the background in advance. But this means that when the GC happens there is a chance of having an empty chunk that was just allocated on the helper thread. So even if the GC on itself has not released any empty chunks, we still would have non-zero JS_GetGCParameter(cx->runtime, JSGC_UNUSED_CHUNKS) and GC scheduling.
Sounds plausible to me.  bz said he'd run on Nightly for a while until he hits this problem again.  With that log we can see if these calls are from the API or somewhere else.  If bug 706227 was in place, we'd hopefully be able to tell precisely why the GC was being invoked by the browser.
There are a few reports that the website in bug 694883 is smooth on Aurora, janky on nightly (though one person didn't notice a difference).  Furthermore, Jesse reports that janks correspond to global GCs.  From those two things combined, it could be related to this.
Blocks: 694883
My last comment is somewhat mixed up, but in this case two wrongs make a right.  The reports that Nightly was bad and Aurora was good was from the end of October, when Nightly was 10 and Aurora was 9.  Jesse's comment is more recent, so it is for Firefox 11.
I'm experiencing several long hangs with Firefox 10 (some days after the Aurora update, not just after), I don't know if they are due to GC scheduling.

This is during normal browsing:
> GC mode: full, timestamp: 1323114586035000, duration: 104 ms.
> GC mode: full, timestamp: 1323114590378000, duration: 99 ms.
> CC timestamp: 1323114591066000, collected: 9378 (9378 waiting for GC), suspected: 1351, duration: 37 ms.
> GC mode: full, timestamp: 1323114595166000, duration: 102 ms.
> CC timestamp: 1323114597705000, collected: 3510 (3510 waiting for GC), suspected: 7909, duration: 43 ms.
> GC mode: full, timestamp: 1323114599417000, duration: 129 ms.
> CC timestamp: 1323114602832000, collected: 15811 (15811 waiting for GC), suspected: 5053, duration: 76 ms.
> GC mode: full, timestamp: 1323114604885000, duration: 107 ms.
> CC timestamp: 1323114609909000, collected: 581 (581 waiting for GC), suspected: 1747, duration: 32 ms.
> GC mode: full, timestamp: 1323114614012000, duration: 103 ms.
> CC timestamp: 1323114618331000, collected: 5077 (5077 waiting for GC), suspected: 1693, duration: 40 ms.
> GC mode: full, timestamp: 1323114621630000, duration: 107 ms.
> CC timestamp: 1323114624451000, collected: 2661 (2661 waiting for GC), suspected: 1032, duration: 36 ms.
> GC mode: full, timestamp: 1323114626822000, duration: 102 ms.
> CC timestamp: 1323114631926000, collected: 894 (894 waiting for GC), suspected: 449, duration: 103 ms.
> GC mode: full, timestamp: 1323114636027000, duration: 102 ms.
> CC timestamp: 1323114641055000, collected: 52 (52 waiting for GC), suspected: 258, duration: 28 ms.
> CC timestamp: 1323114755094000, collected: 0 (52 waiting for GC), suspected: 1255, duration: 104 ms.
> CC timestamp: 1323114876402000, collected: 42 (94 waiting for GC), suspected: 1287, duration: 103 ms.
> CC timestamp: 1323114933759000, collected: 21 (115 waiting for GC), suspected: 1263, duration: 32 ms.

This when the hangs happen (I was unable to catch some even longer pauses (3-4 seconds I think) before these because the console was full):
> GC mode: full, timestamp: 1323117541516000, duration: 429 ms.
> CC timestamp: 1323117544203000, collected: 323020 (323020 waiting for GC), suspected: 48981, duration: 1001 ms.
> GC mode: full, timestamp: 1323117546010000, duration: 386 ms.
> CC timestamp: 1323117549835000, collected: 28358 (28358 waiting for GC), suspected: 20346, duration: 634 ms.
> GC mode: full, timestamp: 1323117550318000, duration: 350 ms.
> GC mode: full, timestamp: 1323117554669000, duration: 352 ms.
> CC timestamp: 1323117555377000, collected: 95 (95 waiting for GC), suspected: 1369, duration: 581 ms.
> GC mode: full, timestamp: 1323117559011000, duration: 342 ms.
> CC timestamp: 1323117564799000, collected: 42 (42 waiting for GC), suspected: 316, duration: 787 ms.
> GC mode: full, timestamp: 1323117579424000, duration: 392 ms.
> CC timestamp: 1323117585010000, collected: 78 (78 waiting for GC), suspected: 713, duration: 585 ms.
> CC timestamp: 1323117617634000, collected: 21 (99 waiting for GC), suspected: 1027, duration: 592 ms.
> CC timestamp: 1323117648264000, collected: 2605 (2704 waiting for GC), suspected: 3434, duration: 780 ms.
> GC mode: full, timestamp: 1323117650658000, duration: 355 ms.
> GC mode: full, timestamp: 1323117655547000, duration: 375 ms.
> CC timestamp: 1323117656296000, collected: 30211 (30211 waiting for GC), suspected: 12956, duration: 666 ms.
> GC mode: full, timestamp: 1323117660653000, duration: 357 ms.
> CC timestamp: 1323117661923000, collected: 10064 (10064 waiting for GC), suspected: 1223, duration: 628 ms.
> GC mode: full, timestamp: 1323117666284000, duration: 361 ms.
> CC timestamp: 1323117671885000, collected: 2938 (2938 waiting for GC), suspected: 1589, duration: 600 ms.
> CC timestamp: 1323117671885000, collected: 2938 (2938 waiting for GC), suspected: 1589, duration: 600 ms.
> CC timestamp: 1323117681864000, collected: 12512 (12512 waiting for GC), suspected: 1457, duration: 621 ms.
> GC mode: full, timestamp: 1323117686228000, duration: 363 ms.
> CC timestamp: 1323117691847000, collected: 2666 (2666 waiting for GC), suspected: 513, duration: 619 ms.
> GC mode: full, timestamp: 1323117696184000, duration: 404 ms.
That has a bunch of CCs in a row, with a really huge number of suspected nodes, so it is likely a different issue.  File a new bug and CC me, please.
I get pretty similar GC and CC logs to those in comment 1:

GC(T+123003.1) Type:Glob, Total:275.0, Wait:1.0, Mark:212.7, Sweep:57.4, FinObj:5.2, FinStr:0.6, FinScr:1.3, FinShp:7.8, DisCod:4.8, DisAnl:19.2, XPCnct:8.9, Destry:0.4, End:10.3, +Chu:0, -Chu:0, Reason:  API
CC(T+123010.3) collected: 208 (208 waiting for GC), suspected: 270, duration: 350 ms.
GC(T+123029.7) Type:Glob, Total:268.8, Wait:1.2, Mark:210.4, Sweep:54.6, FinObj:3.8, FinStr:0.5, FinScr:1.3, FinShp:7.2, DisCod:4.0, DisAnl:17.2, XPCnct:8.3, Destry:3.4, End:8.5, +Chu:0, -Chu:0, Reason:Maybe
CC(T+123035.0) collected: 0 (0 waiting for GC), suspected: 291, duration: 338 ms.
GC(T+123041.0) Type:Glob, Total:359.5, Wait:1.1, Mark:253.6, Sweep:90.1, FinObj:5.9, FinStr:0.6, FinScr:1.4, FinShp:9.8, DisCod:6.6, DisAnl:33.5, XPCnct:15.2, Destry:0.4, End:27.0, +Chu:0, -Chu:0, Reason:  API
CC(T+123046.4) collected: 1500 (1500 waiting for GC), suspected: 945, duration: 341 ms.
GC(T+123050.7) Type:Glob, Total:262.5, Wait:1.2, Mark:211.2, Sweep:47.4, FinObj:3.9, FinStr:0.5, FinScr:1.2, FinShp:7.5, DisCod:4.1, DisAnl:13.2, XPCnct:8.1, Destry:0.4, End:8.4, +Chu:0, -Chu:0, Reason:  API
CC(T+123056.0) collected: 0 (0 waiting for GC), suspected: 108, duration: 336 ms.
GC(T+123071.3) Type:Glob, Total:268.4, Wait:1.3, Mark:214.3, Sweep:50.3, FinObj:4.1, FinStr:0.6, FinScr:1.3, FinShp:7.6, DisCod:3.9, DisAnl:14.0, XPCnct:8.4, Destry:1.7, End:8.6, +Chu:0, -Chu:0, Reason:Maybe
CC(T+123077.3) collected: 0 (0 waiting for GC), suspected: 189, duration: 985 ms.
GC(T+123091.7) Type:Glob, Total:266.0, Wait:1.2, Mark:210.3, Sweep:52.0, FinObj:3.9, FinStr:0.5, FinScr:1.2, FinShp:7.5, DisCod:4.6, DisAnl:15.4, XPCnct:8.2, Destry:1.8, End:8.3, +Chu:0, -Chu:0, Reason:Maybe
CC(T+123097.0) collected: 0 (0 waiting for GC), suspected: 244, duration: 351 ms.

What jumps out is that every few CCs, there's one that takes about three times as long as all the other ones, without a large number of suspected items that might be a reason for that.

This has been happening at least for the last couple of weeks now, probably longer and gets worse the longer I leave the browser open.
The long CCs could be bug 702813.
Igor: What do you think is the right solution for this problem?  Increasing the empty_chunks threshold (like I did in my patch in bug 694883), removing that trigger, something else?

With the lights demo, I did notice that the number of empty chunks would spike up to 19 or so after the GC freed a lot of things, so maybe that trigger is still useful.
I think we should try to reduce the "triggering a GC outside of the VM based on VM internals" behavior. 
Can we move this check to MaybeGC if we can't get rid of it?
If spiking becomes a problem we can solve this in another ways (limit the number of empty chunks that survive a GC?)
(In reply to Andrew McCreight [:mccr8] from comment #12)
> Igor: What do you think is the right solution for this problem?  Increasing
> the empty_chunks threshold (like I did in my patch in bug 694883), removing
> that trigger, something else?

We can try not to run the GC but rather just remove empty chunks in few seconds after the GC. But first we should try Gregor's suggestion and keep just very few chunks after the GC. Hopefully with the background chunk allocation that will not harm the benchmarks. I try to implement a patch today-tomorrow.
The ro.me demo also does a lot of empty-chunk GCs, even with my hacky patch to set the threshold to 6 empty chunks.  I don't think I saw it do any compartmental GCs.
Igor, any progress here?
Assignee: general → igor
Whiteboard: [Snappy] → [Snappy:P1]
(In reply to Andrew McCreight [:mccr8] from comment #16)
> Igor, any progress here?

I could remove empty chunk chunk pooling without performance regression on MAC. So I will change the code so it would not schedule the GC but rather just remove empty chunks in 5 seconds.
Depends on: 702251
Depends on: 711623
Blocks: 711900
Raising severity (only major because it's not a complete freeze) and setting hang keyword because the Firefox UI stalls operating while GC takes place.

Once I have received the example profile I will start to find the regression range.
Severity: normal → major
(In reply to Andrew McCreight [:mccr8] from comment #0)
> GC mode: full, timestamp: 0, duration: 1064 ms.
> CC timestamp: 6852, collected: 38 (38 waiting for GC), suspected: 173,
> duration: 1851 ms.
> 
> From this snippet of the error console, the CC is running precisely 5
> seconds after the GC, and never hitting 1000 suspected nodes, so the trigger
> for the CC must be the GC ending in DOMGCCallback.  That code changed

Does it mean that the number of collected has to be larger than 1000 for getting the CC triggered? If not, which entry are you referring to?

Also do we have good documentation how GC/CC is working and can be debugged? For the latter I'm not able to dump the messages to the console. They only end up in the error console which gets cleared out kinda often.

As I can see some ideas floating around, what would be the most important information QA could help with? We would appreciate any feedback.
Igor, could you explain what's happening here?

The problem you found in bug 711623 is definitely a bug. But when I apply Andrew's patch in bug 706227, most of the GCs I see are to free expired chunks--not from MaybeGC. This is also what Andrew is seeing in bug 694883.

As an easy fix, can't we just boost the unused chunk threshold to 1 in the DOMGCCallback?
(In reply to Bill McCloskey (:billm) from comment #20)
> 
> As an easy fix, can't we just boost the unused chunk threshold to 1 in the
> DOMGCCallback?

Lets try indeed to set the threshold to trigger the GC if there are at least 5 empty chunks. Later I plan to remove that GC trigger and replace it with separated call to just release empty chunks and decommit empty arenas. But those patches take few more days.
(In reply to Igor Bukanov from comment #21)
> Lets try indeed to set the threshold to trigger the GC if there are at least
> 5 empty chunks. Later I plan to remove that GC trigger and replace it with
> separated call to just release empty chunks and decommit empty arenas. But
> those patches take few more days.

I think we should just try to remove that GC trigger as MaybeGC heuristic does roughly the same job. That is, both try to schedule the GC when there is some GC memory that can be released. However, MaybeGC should in theory be more precise as it accounts for decommitted chunks.
Attached patch Test patchSplinter Review
billm suggested this patch as a possible easy fix.
I don't know about for general browsing, but when I was investigating the Lights demo (in bug 694883), I was regularly seeing 4 or so chunks being active when it did the check, which is why I set the chunk threshold to 6.
(In reply to David Mandelin from comment #23)
> Created attachment 583007 [details] [diff] [review]
> Test patch
> 
> billm suggested this patch as a possible easy fix.

I doubt this would work. The chunks are released on the background thread while the condition is checked on the main thread after the background thread is started from the GC_END callack. So even for the shrinking GC the check may see a number of empty chunks.
Another (possible) datapoint, using Google's "spinning balls" demo: "Our Opera 11.60 checked in with a score of 46, the latest Firefox 11 nightly build with 48 and the current stable Firefox 8 with 126." http://www.conceivablytech.com/9892/products/chrome-gets-improved-memory-performance-but-firefox-leads

This benchmark measures GC pause times.  http://v8.googlecode.com/svn/branches/bleeding_edge/benchmarks/spinning-balls/index.html
I ran the benchmark myself in a not super scientific way, and I saw these scores: Firefox 8 had 4060 frames and a score of 115, Firefox 9 had 3962 frames and a score of 53, Firefox 10 had 3815 frames and a score of 56.  So, it looks like the framerate is basically the same, but the score is lower somehow.  I don't know what the score means in terms of actual pauses, if anything.
Looking at GC_REASON in Telemetry (which only works in 11+ for now), it looks like there's a 3:1 ratio of GCs being caused by the API to GCs being caused by MaybeGC.  The other reasons hardly show up at all.
Depends on: 712853
We're starting to think that too many unused-chunks GCs are the main problem. I don't actually get that problem, so I can't really test it. But it would be great if someone who is getting too much GC now could check on that. It's probably pretty easy to fix.
Attached patch wipSplinter Review
The patch separates shrinking of the GC buffers (that is, the release of empty GC chunks and decommit of empty GC arenas) from the GC itself. With the patch the DOM GC callback schedules just shrinking, not the GC, in 5 seconds after the last G with some empty available chunks.

The patch is based on the patches from the bug 702251.
Depends on: 713916
I filed bug 713916 about separating GC buffer shrinking from the GC itself leaving this bug to the investigating of the regression itself.
Boris, are you still seeing this same problem?  A (possible) fix has been in Nightly since 12-30.
I've been using a test build that Olli built on Jan 4, and it's been pretty good so far.  I can try using normal nightlies again just to make sure....
Yes, it would be great to know that the timing fix has actually fixed your issue.  Thanks!
OK, which nightly do you want me to test?  Today's, or Jan 1, or something else?
The current one is fine.  Just something after 12-30.
OK, running the Jan 11 nightly.  Will report back in a few days...
How is this now? It seems we never had a clear characterization of the problem, but from what I hear, there was one, and now it's better. Is that true?
The problem was that after a running the browser for a while, the GC and CC were being run more often in 10 than in 9.  They were not taking any longer than they were before.

Our theory was that this was caused by the GCs scheduled to clear out empty chunks, which Igor fixed on Aurora and Beta by disabling the patch that caused this problem, and on Nightly by introducing a new way to free those chunks that didn't require a full GC.

Separately, there have been other people who get really long CCs (1s+) in 10 that they don't get in 9.  That seems to be a different issue.
I haven't managed to get a nightly to stay up for more than 2 days without crashing since comment 39, which is why I haven't provided any data yet...  :(

2 days in things are typically looking good (GCs every 30-40s).
(In reply to Andrew McCreight [:mccr8] from comment #41)
> Our theory was that this was caused by the GCs scheduled to clear out empty
> chunks, which Igor fixed on Aurora and Beta by disabling the patch that
> caused this problem, and on Nightly by introducing a new way to free those
> chunks that didn't require a full GC.

So this bug can probably be marked fixed, only awaiting verification from bz?

> Separately, there have been other people who get really long CCs (1s+) in 10
> that they don't get in 9.  That seems to be a different issue.

Do we have a separate bug on that? Do we need to work on it per se or will some other work make that go away (I know there's a couple different things going on in GC/CC land these days)?
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #43)
> (In reply to Andrew McCreight [:mccr8] from comment #41)
> > Our theory was that this was caused by the GCs scheduled to clear out empty
> > chunks, which Igor fixed on Aurora and Beta by disabling the patch that
> > caused this problem, and on Nightly by introducing a new way to free those
> > chunks that didn't require a full GC.
> 
> So this bug can probably be marked fixed, only awaiting verification from bz?

Yes, I think so.

> > Separately, there have been other people who get really long CCs (1s+) in 10
> > that they don't get in 9.  That seems to be a different issue.
> 
> Do we have a separate bug on that? Do we need to work on it per se or will
> some other work make that go away (I know there's a couple different things
> going on in GC/CC land these days)?

There are a few other related bugs I guess I should find and link here.  It is being actively investigated to try to figure out if there is a real problem or not.  We're having daily meetings about it.
(In reply to Andrew McCreight [:mccr8] from comment #44)
> (In reply to Robert Kaiser (:kairo@mozilla.com) from comment #43)
> > So this bug can probably be marked fixed, only awaiting verification from bz?
> 
> Yes, I think so.

Igor, as the assignee, do you agree? If so, please mark it as such. (And yay for us thinking we found the main offender here!)
(In reply to Robert Kaiser (:kairo@mozilla.com) from comment #45)
> Igor, as the assignee, do you agree? 

marking the bug as fixed based on reports above
Status: NEW → RESOLVED
Closed: 12 years ago
Resolution: --- → FIXED
OK.  I've had this build up for 4 days now, and my GC time differences are still being sane.  So yes, I think this is fixed.
(In reply to Andrew McCreight [:mccr8] from comment #41)
> The problem was that after a running the browser for a while, the GC and CC
> were being run more often in 10 than in 9.  They were not taking any longer
> than they were before.
> 
> Our theory was that this was caused by the GCs scheduled to clear out empty
> chunks, which Igor fixed on Aurora and Beta by disabling the patch that
> caused this problem, and on Nightly by introducing a new way to free those
> chunks that didn't require a full GC.
> 
> Separately, there have been other people who get really long CCs (1s+) in 10
> that they don't get in 9.  That seems to be a different issue.

Based upon this, it appears that we fixed this for FF10/11.
Given that the testcase for this is probably fairly broad, what should QA be looking for in regards to verification?
Whiteboard: [Snappy:P1] → [Snappy:P1][qa?]
I don't think there's particularly anything to do here.
Whiteboard: [Snappy:P1][qa?] → [Snappy:P1][qa-]
You need to log in before you can comment on or make changes to this bug.

Attachment

General

Created:
Updated:
Size: