Closed Bug 725110 Opened 9 years ago Closed 6 years ago

Insanely long GC times leading to intermittent hangs of a few seconds

Categories

(Core :: JavaScript Engine, defect)

x86
macOS
defect
Not set
critical

Tracking

()

RESOLVED INCOMPLETE
Tracking Status
firefox12 - ---
firefox13 - ---

People

(Reporter: cww, Unassigned)

Details

(Keywords: regression)

Attachments

(4 files)

After about an hour of browsing on http://hg.mozilla.org/mozilla-central/rev/b077059c575a (and earlier nightlies), I'm getting very long GC times leading to hangs of over a second every few minutes.  This is making browsing/working very difficult.

I have no extensions enabled other than about:telemetry.  Attached a graph.  Basically as time goes on, more and more of the GCs take > 1 s until pretty much all of them do and I'm forced to restart.
Version: unspecified → Trunk
I have also noticed that my browser has hung a lot more over the last two days - to the point where I have to kill it. I'm currently on Nightly 2012-02-06.

Attached is the same GC_MS chart from about:telemetry that Cheng attached to his post.
Do you know when Nightly had first time this behavior?
bug 716014 was in Nightlies for a day or two, but was backed out.

(I think I get higher GC times too.)
I definitely have a post-backout build (I'm using tonight's nightly.)

I haven't been using nightlies for a while so I don't know.  I know I had problems with builds from the 5th onwards.
Version: Trunk → unspecified
I pushed some code to tryserver that optionally dumps out the entire GC heap on every GC. The builds should show up here in a few hours. (Hopefully it compiles on MacOS.)
  https://tbpl.mozilla.org/?tree=Try&rev=adb613250d8e

To enable it, you have to turn on javascript.options.mem.gc_dump in about:config. It should create files with names like "gc-dump-xyz.abc.log" in your temp directory.

Eventually I would like to land this on trunk.

Cww, could you try this out? The dumping is really slow, so don't enable it until your GCs are already bad. Also, turning it on will cause the Wait portion of the GC time to explode (since it's accounting for the dumping there.)
I don't have specific STR but it happens frequently enough in Aurora (12.0a2 2012-02-04) that I often have another browser running.

How can I get debug info for you?
Hmm, Aurora...
mrz, could you try to find regression range. Try a Nightly from, say 29.1.
(I'm a bit worried that the stuff I landed 30.1. could have caused this.
Though, I don't know why. I and others have used those patches for few weeks.)
Also, just to clarify, this bug is about long GC times even when the JS GC heap is small. If you have a ton of tabs open, or if about:memory reports a GC heap larger than 1 GB, then you're experiencing a long-standing problem unrelated to this bug report.
Ah, ok. What is a "small" JS GC heap?
(global gc is about 150ms on this machine when js heap is 300+MB)
Telemetry doesn't show any regression, I think.
I got a heap dump from Cww. It happened around the same time as some slow GCs, but we're not 100% sure that it's a slow one. I'm working on improving how the data is gathered.

The main thing that pops out is that there are tons of shapes. Here are the top ten kinds of objects in the heap, along with how many there are:

1118662 shape
 516456 Function
 398806 string
 371084 Object
 175300 Array
 162571 script
 134561 type_object
 122768 base_shape
  89440 Call
  82189 HTMLDivElement

The problem seems to be distributed across all compartments. He has a chrome compartment with 125423 shapes, 36254 base shapes, and 8493 objects. The same trend seems to hold true across most of the other compartments. The Gmail compartment has 136590 shapes, 14595 base shapes, and 42345 objects. There was one compartment where the number of shapes was "only" about 25% more than the number of objects, but it was an outlier.

Brian, can you think of any reason why we would end up with so many shapes, across all compartments? Some sort of fallback mode than could be triggered globally? I'm going to try to look at the individual shapes more closely to see how long the shape chains are and how much sharing there is between objects.
Oops. Brian, see comment above.
(In reply to Bill McCloskey (:billm) from comment #11)
> Brian, can you think of any reason why we would end up with so many shapes,
> across all compartments? Some sort of fallback mode than could be triggered
> globally? I'm going to try to look at the individual shapes more closely to
> see how long the shape chains are and how much sharing there is between
> objects.

No, this seems pretty weird.  How many are dictionary vs. normal shapes?  Is there an addon which could be involved?
Cww doesn't have any addons aside from about:telemetry.
I made a new build that should end up here:
  https://tbpl.mozilla.org/?tree=Try&rev=13182321c748

It now grabs only one dump and then the flag is disabled. You can enable it again to get another dump. It also includes the error console output for that GC inside the dump, so there's no need to get that info.

Also, it saves the flags for shapes, base shapes, and type objects.

Finally, it breaks down some of the GC times in more detail.
Sorry to be dense, but how I run that version and what do you need me to do get your data?
(In reply to matthew zeier [:mrz] from comment #16)
> Sorry to be dense, but how I run that version and what do you need me to do
> get your data?

Eventually, the builds should show up here:
https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/wmccloskey@mozilla.com-13182321c748/

Once you've started the browser, wait until the GCs get slow. Then go to about:config and enable the javascript.options.mem.gc_dump flag. When the next GC happens, it will generate a dump file. The dump will be named "gc-dump-xyz.abc.log" in your temp directory.

But before that, how many tabs do you have open, and are you running any add-ons? Also, when did the GC start to get slow?
Severity: normal → critical
(In reply to Bill McCloskey (:billm) from comment #17)
> (In reply to matthew zeier [:mrz] from comment #16)
> > Sorry to be dense, but how I run that version and what do you need me to do
> > get your data?
> 
> Eventually, the builds should show up here:
> https://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/
> wmccloskey@mozilla.com-13182321c748/

These builds are now available. mrz/Lawrence/cww - please move ahead with testing when you have the chance.
(In reply to Bill McCloskey (:billm) from comment #17)
> Once you've started the browser, wait until the GCs get slow. Then go to
> about:config and enable the javascript.options.mem.gc_dump flag. When the
> next GC happens, it will generate a dump file. The dump will be named
> "gc-dump-xyz.abc.log" in your temp directory.

To clarify,

When this happens, I beachball.  Once I regain control, you want me to do this and let it run until the next beachball and then upload?

Is there data in that dump that I wouldn't want in a public bug?
(In reply to matthew zeier [:mrz] from comment #19)
> When this happens, I beachball.  Once I regain control, you want me to do
> this and let it run until the next beachball and then upload?

Have you enabled the javascript.options.mem.log preference in about:config? That will print every GC and CC in the error console, along with how long they took. Before we do this, it would be helpful if you could say how long your GCs are taking and how many tabs you have open.

If the GC is indeed slow, and it's not just because you have tons of tabs, you can wait until they start to get slow, based on the error console (anything over 500ms is slow if you don't have many tabs open). Then you can turn on the javascript.options.mem.gc_dump preference. The next GC will generate a dump file.

> Is there data in that dump that I wouldn't want in a public bug?

You probably don't want to post it to a bug. At the very least, it shows the tabs you have open.
Here are my last 5 mins of GC after using 2/14 nightly http://hg.mozilla.org/mozilla-central/rev/60edf587f4af

Emailing about:memory.

GC(T+101247.2) Type:Glob, Total:2359.5, Wait:9.1, Mark:1228.7, Sweep:972.6, FinObj:172.9, FinStr:21.5, FinScr:3.9, FinShp:125.4, DisCod:87.4, DisAnl:531.8, XPCnct:8.9, Destry:0.1, End:156.6, +Chu:0, -Chu:2, Reason:CC_WAITING

GC(T+101265.7) Type:Glob, Total:1047.6, Wait:0.5, Mark:439.4, Sweep:605.0, FinObj:12.1, FinStr:1.5, FinScr:4.7, FinShp:17.2, DisCod:20.5, DisAnl:524.1, XPCnct:8.4, Destry:0.1, End:9.6, +Chu:0, -Chu:0, Reason:CC_WAITING

GC(T+101328.9) Type:Glob, Total:2834.0, Wait:5.4, Mark:1331.8, Sweep:1493.7, FinObj:127.8, FinStr:22.5, FinScr:4.1, FinShp:162.1, DisCod:85.9, DisAnl:1066.2, XPCnct:7.5, Destry:0.1, End:9.3, +Chu:2, -Chu:0, Reason:MAYBEGC

GC(T+101338.8) Type:Glob, Total:892.4, Wait:0.6, Mark:445.1, Sweep:444.1, FinObj:9.8, FinStr:1.1, FinScr:3.9, FinShp:15.8, DisCod:11.0, DisAnl:379.9, XPCnct:7.1, Destry:0.1, End:8.5, +Chu:0, -Chu:1, Reason:CC_WAITING

GC(T+101406.0) Type:Glob, Total:1886.7, Wait:0.9, Mark:575.7, Sweep:1306.3, FinObj:26.6, FinStr:5.4, FinScr:4.2, FinShp:21.2, DisCod:114.0, DisAnl:1111.4, XPCnct:7.5, Destry:0.1, End:9.9, +Chu:0, -Chu:0, Reason:SET_NEW_DOCUMENT

GC(T+101415.9) Type:Glob, Total:1168.2, Wait:0.7, Mark:445.5, Sweep:719.0, FinObj:9.7, FinStr:1.2, FinScr:4.1, FinShp:15.4, DisCod:20.7, DisAnl:641.2, XPCnct:7.4, Destry:0.1, End:8.4, +Chu:0, -Chu:0, Reason:CC_WAITING

GC(T+101498.1) Type:Glob, Total:2186.0, Wait:1.0, Mark:979.5, Sweep:1200.7, FinObj:38.9, FinStr:8.6, FinScr:4.1, FinShp:20.3, DisCod:76.7, DisAnl:1018.1, XPCnct:18.2, Destry:0.1, End:21.7, +Chu:2, -Chu:0, Reason:MAYBEGC

GC(T+101505.2) Type:Glob, Total:1238.6, Wait:0.6, Mark:446.7, Sweep:785.7, FinObj:9.6, FinStr:0.9, FinScr:3.9, FinShp:15.1, DisCod:11.0, DisAnl:721.6, XPCnct:7.3, Destry:0.1, End:11.6, +Chu:0, -Chu:2, Reason:CC_WAITING
Any news about this?
I haven't had time to look at this in a while. I did a more careful comparison between Cww's logs and some logs I took from my own browsing. They weren't really that different. Both logs turn out to have lots of shapes, so I no longer think that's the problem.

I also looked at how many objects are actually being collected during the slow collections. Cww's logs seem to have more stuff being collected than mine, about 4x more. Virtually all the objects being collected are from the chrome compartment. I looked at a random sample of them, and many of them seem to be generated by XPConnect when chrome JS needs to access C++ objects like cookies and session history.

However, 4x more objects to collect doesn't seem like enough to explain the increased GC time. My collections were ~50ms, while Cww's were multi-second.

I'd like to get more data here, but I don't have time to do it right now.
Cww, are you still seeing this using Nightlies? Plenty of memory leaks have been fixed after
filing this.
Attached image 2/28 GC_MS graph
Still seeing long GCs (this is after ~1 day of browsing?). CC times have been great, however.
What's the status of this bug?

- Is there still a problem?
- If so, what is the problem?
- And is it actionable?
It sounds similar to bug 720512.
Sounds quite different to me. bug 720512 is about CC.
(In reply to David Mandelin from comment #26)
> What's the status of this bug?
> 
> - Is there still a problem?
> - If so, what is the problem?
> - And is it actionable?

:Cww has a reproducible problem where, after a day or so of usage with no add-ons, his GC times grow to over one second. It was still happening as of 2/28. We can try to get more data by creating instrumented browsers and looking over dump files for anomalies, but there's nothing specific that we can do.
(In reply to Olli Pettay [:smaug] from comment #28)
> Sounds quite different to me. bug 720512 is about CC.

Sure, but after the CC improvements have landed, my symptoms are now that the GC times get longer and longer.  There are a half million disconnected DOM nodes leaking, but the CC times are still pretty good.  You just get really long CC times once you close or reload the Gmail tab, until eventually that gets cleared up.
FWIW, still happening to me.  Let me know what you want me to do but I've had no further actions.
When you get in this state, and your CC times are okay, what happens when you close (or just reload) a Gmail tab?  Do you get super crazy CC times for a few minutes?
(In reply to Bill McCloskey (:billm) from comment #29)
> (In reply to David Mandelin from comment #26)
> > What's the status of this bug?
> > 
> > - Is there still a problem?
> > - If so, what is the problem?
> > - And is it actionable?
> 
> :Cww has a reproducible problem where, after a day or so of usage with no
> add-ons, his GC times grow to over one second. It was still happening as of
> 2/28. We can try to get more data by creating instrumented browsers and
> looking over dump files for anomalies, but there's nothing specific that we
> can do.

What do you recommend doing for 12? How much is there a risk that this is a regression that will affect many users? Is there a decent chance of being able to fix Cww's problem for 12?
(In reply to Andrew McCreight [:mccr8] from comment #32):

Yes. My CC times went NUTS (2.7s, 4.8s, 5.7s, 3.2s, 1.8s).  My GC times also went through the roof (usually between 500-1500, went to 8000).  CC eventually recovered, GC still with intermittent 1200 blips.

Same pattern with Google Voice and iGoogle (although neither was as bad as Gmail).

(In reply to David Mandelin from comment #33):

Here's the thing: right now, I don't think we can trust telemetry data on 11 (and possibly 12) because of server-side issues and we can't trust telemetry in general to give us a picture of what happens outside of a few minutes after startup because of an issue with the way we upload data.  Both of those need fixing.  Once we do, I think the thing to do is to figure out how many people are like me. How many people on Firefox have a mean GC of 800ms or higher (mine is 1122 ms right now for the record).  If it's a significant population, we should probably dive deeper. If it's not, let's just go office space on this machine and I'll get a new one.
Sorry for the bugspam:

(In reply to [:Cww] from comment #34):

Just wanted to correct myself.  After shift-reloading Gmail, iGoogle and Google Voice apptabs, I said my CC times went down.  That's partly true.  I've been seeing mostly <50ms CCs but every minute or so there's a 1300-2000ms one.
Getting CC logs would be very useful. 
Use the allTraces option
https://wiki.mozilla.org/Performance:Leak_Tools#Cycle_collector_heap_dump
The CC log I got from Cww earlier showed that something is keeping a DOM element alive, and
that kept a large-ish CC graph alive.
Based on comment 37, it looks like both the long GC and CC times might be due to something special keeping a big graph of DOM elements alive. This keeps coming up on my tracking list, so it's decision time. Bill, Olli:

 - Do you think this is likely to be a regression that affects many users, meaning we should keep it on the tracking list and start actively investigating, or

 - Do you think it's a rare bad case, meaning I'll take it off the tracking list and you can work on it on a "when you have time" basis?
(In reply to David Mandelin from comment #38)
> Based on comment 37, it looks like both the long GC and CC times might be
> due to something special keeping a big graph of DOM elements alive. This
> keeps coming up on my tracking list, so it's decision time.

Here's what we know so far:

1. Cww is getting really long GCs. However, he has had very serious collector pause times since well before Firefox 12. In that sense, I'm not sure this is definitely a regression for him (although maybe your old CC times were not quite this bad, Cww?).

2. Some other people were experiencing problems. The times that Lawrence Mandel posted are not unexpectedly long given a large heap. There are a few outliers there (~1% of GCs are over 800ms), but they could be do to swapping. Nobody else has posted any data.

3. Telemetry suggests that collection times may have gone up, but it sounds like the data is still pretty untrustworthy.

I think the reasonable thing to do here is to keep watching it and to devote modest resources to figuring out what's happening.
Some more data gathering:

This attachment is a modified version of the memchaser add-on. To run it, you need to be running a nightly starting with the one that will be generated tonight. (Getting a new one on Monday is probably the easiest thing to do.)

If you install this add-on, it will create a memchaser/ directory inside your profile directory. It puts some log files here. A new one will be created each time you start the browser. Cww, can you try out this add-on and send me the log file when the GC times get big?
Thanks for the update. I'm going to take it off the release drivers' radar for 12, but I'll leave it tracking 13+.
It's gotten better since last week (although I'm still getting occasional long GC times).  I'll send you logs at the end of the day
(In reply to [:Cww] from comment #42)
> It's gotten better since last week (although I'm still getting occasional
> long GC times).  I'll send you logs at the end of the day

Can we get an update of how this issue is looking on 13? Do we have more usable Telemetry data now on 13 that can help with debugging here?
This bug has gotten pretty quiet, and I haven't heard about it being a problem for other users or in general. I'm declaring it non-actionable and taking it off tracking. We can continue only if we get new information about the problem.
Assignee: general → nobody
Tons of GC/CC changes in the last 3 years. Let's file a new bug if there are issues still being seen.
Status: NEW → RESOLVED
Closed: 6 years ago
Resolution: --- → INCOMPLETE
You need to log in before you can comment on or make changes to this bug.