Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:14.0) Gecko/20120502 Firefox/14.0a2
I have already seen this issue some weeks ago and talked to Olli. Since then it was kinda sporadic that it happened again. Last time it was end of last week when I haven't had time for a deeper analysis. Today it happened again and I tried to find some steps to reproduce but so far I wasn't lucky.
The following information is everything I have at the moment:
* It happens in my daily profile
* Resident memory usage grows up to 2GB within an hour or so
* GC durations go up to around 4s!!!! and make the browser unusable
* It happens when I have the following Jenkins pages for Mozmill CI loaded (it's in MV VPN)
http://release3.qa.mtv1.mozilla.com:8080 (daily tests)
http://release8-osx-107.qa.mtv1.mozilla.com:8080/ (l10n tests)
First I thought it is related to auto-refresh being enabled but it doesn't look like that this is the reason. One other assumption is that it happens when new job requests are getting added via the Jenkins API. That's what I will test next. Otherwise I will try to reduce my profile as much as possible to have a minimized profile in hand.
Given the cc log I have sent to Olli there were about 32544 nsDOMStorageItem objects and over 100.000 nsXPCWrappedJS (nsISupports) objects listed. For the first issue Olli already filed bug 751529. For the latter no action has happened yet. I will leave it up to Olli because he currently analyses my cc log.
I hope that I can come up with some better STR soon.
Henrik, while you investigate STR, can you also check to see if:
A) This is any better or worse in Firefox 15 Nightly
B) This has gotten worse over time (for example, Firefox 12, 11, 10, etc)
CC times or GC times are 4s? Though I guess both could be bad given how much memory it is using up.
Created attachment 620738 [details]
It'd GC only. But I wanted to add a memchaser log which can hopefully give some more information what is getting executed.
There are a number of long CCs (one is 12 seconds!), but the GCs do look a little longer. I'd guess that GC times are long because the page is using an obscene amount of memory. If you could get the contents of about:memory?verbose when you are having this problem, that might help.
To some extent, the long GCs are a symptom of all the xpconnect stuff going on. We're spending ~100ms in root marking, which is very unusual. Another ~90ms is in mark_other. This means that we have to enumerate a bunch of xpconnect gray roots, and we're also spending a lot of time in gray marking.
I'm not sure why the sweeping times are so long. That's been a persistent issue for us, though, so it's known.
Created attachment 622357 [details]
Updated MemChaser log now with a memory consumption of about 2GB of RAM. The log shows the constant raise of the usage over time.
I think that this leak happens when new jobs are getting added to the queue. I will have to test this now on my local instance. If I can prove it, I will give details instructions how to reproduce.
The same effect can also be seen in Safari. I had the Jenkins website open on our master for a couple of days now and Safari claimed about 3.25GB of memory.
For people with VPN access, simply open the following URL and keep the browser open for a couple of days. You will have to catch the time when new job requests get added, means when builds have been finished. That's early in the day PTO so you should keep the browser open.
When you have the Jenkins web page open, please make sure to not change any configuration aspects. Thanks.
Olli, would you mind giving an update based on the latest cc log I have sent to you? You have mentioned about 170.000 XPCVariant objects, is it something we should file as a new bug?
Over to tech evang; given comment 7, this doesn't sound like our bug.
It would be great if pages which created giant heaps didn't cause GC pauses, but we already have bugs on that.
Created attachment 666566 [details]
Detailed MemChaser log
This is a detailed log of Memchaser which I wanted to attach since ages. Sorry for the delay. It contains hopefully all the information we need to sort out this issue.
I have had Jenkins open for an hour or so and the memory of Firefox raised from 500MB to 1.1GB. Closing the Jenkins tab caused a freeze of Firefox for about 1 minutes. As the log shows there are CC cyles of about 27s in duration but also iGC of 6s.
I wonder if we can do something here as leaving it in the evangelism section only.
memchaser log doesn't give much information.
GC and CC logs are needed.
IIRC you gave me once some logs, but I don't have them on this laptop.
But Justin is right. If the web app leaks, then it leaks which will at some point slow
down performance on any kind of browser.
We can ofc try to make CC and GC faster, but the web app needs to be fixed.
Olli, if you have VPN access to MV please open http://10.250.73.243:8080/ and enable 'auto refresh' at the top right corner. Leave the tab open for a while and you will notice the effect really soon.
The non-closing CC times could probably be improved, as the CC is looking at 200k GC objects and not freeing any.
DOM merging could potentially help the closing CC, as there are 1.2 million ref counted nodes in the graph, but it depends on what exactly they are.
I'm not sure if anything can be done for the GC. The page is just allocating a huge amount of memory and never letting it go. A large percentage of the GCs are nonincremental because the page is allocating so much (and I guess the GC is falling behind).
Olli, I can get a log for you, if you don't regularly VPN in, as I'll be in the MV office.
As downloadable from Dropbox you have a gc/cc log of the issue with an iGC of about 200ms and a low CC duration. That should make it easier for Olli to get it processed.
Olli, here an updated version of the logs with the XUL cache pref removed.
Henrik, which version of FF are you using?
Henrik, we think an addon might be involved. Could you try this again with a fresh profile?
Also, can you try to get minimal steps to reproduce this problem? I tried it myself, but I'm not getting the same kind of objects that are showing up in your CC log.
We're seeing some weird objects that we're not sure where they came from. Our theory is that they are weak map related, because we don't log weak maps very well right now.
Right now this log has been made with Mozilla/5.0 (Macintosh; Intel Mac OS X 10.7; rv:17.0) Gecko/17.0 Firefox/17.0 ID:20120930042009
I can try a fresh profile with memchaser installed only.
Actually, could you turn off memchaser as well? I'm afraid it might affect things, since it uses Jetpacl.. Could you just use the CC times in the error console instead?
Here the logs with a fresh profile:
Without memchaser I'm not able to reproduce this issue. Here the logs.
Haven't manage to reproduce using a trunk build.
CC graph stays almost all the time really small (<500) - only the automatic page reload increases it temporarily and CC times are between 6ms and 46ms.
> Without memchaser I'm not able to reproduce this issue.
Updating the bug metadata to reflect that this is caused by MemChaser.
This would be a significant issue if the problem turned out to be Jetpack and not simply MemChaser.
Kris says that MemChaser is using version 1.9 of the add-on SDK, which is pretty recent. So yeah, the real question here is whether this is an add-on SDK problem or MemChaser-specific?
If what we're worried about is memory usage after http://10.250.73.243:8080/ is closed, it looks like it's mostly heap-unclassified:
I'm getting about 60MB growth in heap-unclassified with MemChaser installed, and a bit of growth elsewhere, as opposed to nothing without it installed. I'll try stripping down MemChaser to narrow down the cause when I get a chance.
This might be similar to Bug 740162 where I also had massive amounts of memory being used by jenkins. I didn't have any addons, though.
I have uploaded a 0.5pre XPI of memchaser which I will test over the next days if it changes the behavior here:
Created attachment 706358 [details]
Looks like we have a lot of orphaned nodes for both the Jenkins instances:
* 97,530,664 B (20.59%) ── orphan-nodes
* 110,193,800 B (23.27%) ── orphan-nodes
I have to say this is with a profile with any extensions disabled.
Created attachment 706361 [details]
cc log from the same state as the about:memory dump.
So this is indeed the Jenkins CI page which is leaking or keeping orphaned elements around. Closing the tab and minimizing the memory a couple of times restores the previous memory usage.
The latest GG/CC logs you can find here:
For everyone interested, here the dump of the conversation between Olli and myself on IRC:
14:29 smaug: whimboo: any chance you could create new GC/CC logs? hopefully immediately when you start to see the leak, so that the logs don't get huge
14:29 smaug will be offline soon, but could perhaps analyze the logs tomorrow
14:30 whimboo: smaug: would your addon help here
14:30 whimboo: ?
14:31 smaug: whimboo: if you know what to look at, then yes
14:31 smaug: you could check if documents are leaded
14:31 smaug: whimboo: it is possible that gc log is need too, but my tool is for cc log only
14:31 whimboo: k
14:31 whimboo: i will check right now
14:31 smaug: my local cc analyzing scripts do just the same things as about:cc
14:32 whimboo: smaug: do you want to put this xpi up somewhere?
14:32 whimboo: i always have trouble to find it
14:34 smaug: whimboo: well, I was going to put it to AMO but that was too difficult
14:34 smaug: perhaps I should get it to tree
14:34 smaug: make it a bit prettier first
14:34 whimboo: that would be great
14:34 smaug: it is very tiny addon
14:35 whimboo: smaug: k. so i will let you have a look at
14:35 whimboo: i will produce the cc now
14:38 whimboo: smaug: cc is about 1MB. that's ok?
14:38 whimboo: gc 314MB
14:38 smaug: 1MB is ok
14:38 smaug: 314GC, huh
14:39 smaug: that is less than tiny
14:39 smaug: er, "less than tiny" doesn't make sense
14:40 whimboo: heh. so i will attach the CC log to the bug
14:41 whimboo: smaug: log is up
14:42 whimboo: smaug: so good thing is with Nightly: 82.10 MB (13.19%) ── heap-unclassified
14:42 whimboo: that's way lesser as with aurora
14:45 smaug: whimboo: that is not too big cc graph
14:45 smaug: 8000 objects
14:45 smaug: sure, too big, but still...
14:45 smaug: but ok, looking
14:45 whimboo: sweet
14:45 whimboo: if it's too large I think i can give you a smaller one later
14:46 whimboo: given that I know how to produce it
14:46 smaug: so we're leaking all the XBL document
14:46 smaug: whimboo: do you have XUL cache disabled?
14:46 smaug: whimboo: it is small enough
14:47 whimboo: ups. that was a copy of an old profile
14:47 whimboo: yes it was disabled
14:47 whimboo: but should we leak in such a case?
14:47 smaug: whimboo: disabling XUL cache is not supported setup
14:47 smaug: yes, we don't care about that case
14:48 smaug: since everything behaves differently anyway
14:48 smaug: it is development only setup
14:48 whimboo: ok, so let me restart then
14:48 smaug: whimboo: we're talking about nglayout.debug.disable_xul_cache here, right :)
14:49 whimboo: i know :)
14:50 whimboo: smaug: looks like it's starting again
14:50 whimboo: 2.03 MB (01.32%) ── orphan-nodes
14:51 whimboo: smaug: shall i wait until it's a big higher?
14:52 smaug: whimboo: you could wait a bit...
14:53 whimboo: smaug: k, how long would you prefer? until it reaches about 10MB or so?
14:54 smaug: whimboo: do you have about:cc installed?
14:54 whimboo: smaug: yes
14:54 whimboo: but not sure what to do with it
14:54 smaug: could you run cc and see the size of the graph
14:55 smaug: graph size ~10000 should be enough
14:55 smaug: whimboo: run CC few times to get stable results
14:55 whimboo: smaug: where do i see that?
14:55 smaug: whimboo: in the addon
14:55 whimboo: smaug: sure.. number of objects?
14:56 whimboo: or roots, edges
14:56 smaug: whimboo: number of objects
14:56 whimboo: k, at 353 right now
14:56 smaug: whimboo: I'd expect something like 10000
14:57 smaug: though, the previous log had 8000
14:57 smaug: whimboo: if you get it to 5000
14:57 whimboo: smaug: orphaned nodes are already at 15MB
14:57 smaug: sure
14:57 smaug: JS keeping nodes alive
14:57 whimboo: so you still want to have the 10.000 limit?
14:57 smaug: but the js is black (certainly alive) so those nodes can't be released
14:58 smaug: whimboo: 5000 could be enough
14:58 whimboo: k lets wait when I reach that
14:59 whimboo: smaug: so when I run cc it always drops down to about 200 again
14:59 whimboo: seems I can't get over 500
14:59 smaug: whimboo: ahaa
15:00 smaug: whimboo: so do you actually get leak
15:00 smaug: I mean if you close the tab
15:00 smaug: now that xul cache is enabled
15:01 whimboo: smaug: shall I close the tab and run cc?
15:01 smaug: whimboo: yes, and check what about:memory says
15:02 whimboo: smaug: hm, i think the addon initializes wrongly
15:02 whimboo: when I tried to run 'Run cycle collector' it did nothing
15:02 whimboo: now after closing the tab I get 220000 objects
15:03 smaug: that is very possibe
15:03 smaug: whimboo: run cc few times
15:03 whimboo: oh, right
15:03 whimboo: yes. it cleans up everything
15:03 whimboo: back at 200 objects
15:03 smaug: so it was garbage that we collected
15:03 whimboo: and about:memory doesn't list the page anymore
15:04 whimboo: that means it's really the website which is leaking memory
15:04 smaug: and memory usage is back to normal levels?
15:04 mdas has joined (mdas@50AE257C.D30B51A1.412CF160.IP)
15:04 whimboo: just a sec. will repeat iwth the other tab
15:04 smaug: well, "leaking", but yes, the website keeps objects alive
15:05 whimboo: smaug: err, now I'm at 1GB
15:06 whimboo: is that because of your addon?
15:06 whimboo: seems like
15:06 whimboo: 391.25 MB (61.12%) -- top(about:cc, id=28)/active/window(about:cc)
15:07 whimboo: smaug: wohoo. back to normal.. 240MB
15:07 smaug: whimboo: well, about:cc may use a lot of memory if the graph is huge
15:08 whimboo: smaug: so if I can help to analyze which js is holding that please tell me the steps
15:08 smaug: thinking...
15:09 smaug: whimboo: perhaps the best is to get cc and gc logs
15:09 smaug: before they are too huge
15:09 smaug: especially gc log is needed
15:09 smaug: since it is probably page keeping some js alive which ends up keeping nodes alive ...
15:10 whimboo: ok, will give those in a bit
15:11 whimboo: smaug: is 50MB for GC fine?
15:12 smaug: whimboo: uh, that is quite big one
15:12 whimboo: gna. let me restart then
15:13 smaug: whimboo: so, I need to go soon, but could you record this conversation to the bug, so that it is clear that it is the page which is leaking
15:13 whimboo: smaug: i cna't get it under 40MB
15:14 smaug: whimboo: put it to some server and I'll download it from there?
15:14 whimboo: sure
15:14 whimboo: thanks
15:14 whimboo: will comemnt in the bug
15:14 smaug: and put cc log too
15:14 smaug: thanks
15:14 whimboo: sure
Comment on attachment 706361 [details]
This was collected with XUL cache disabled.
Updated URLs for our own pages:
Open one of them and wait until jobs are running. The number of orphaned nodes will increase constantly and reach 10MB after a couple of minutes.
Not sure why I haven't commented on this bug with the location of the CC and GC log I have uploaded to my people account. Olli, whenever you have time please have a look at those.
Requesting info-needed from Olli for my latest cc/gc logs for his analysis. I hope you will have a bit of time soon. Thanks.
Olli, can find a bit of time to have a look at?
An old version of Firefox which was 15.0 I got to crash in js::gc::MarkObjectUnbarried after adding hundreds of job requests to the Jenkins instance. I have now updated the version of Firefox to the latest release and will check that again after Eastern.
Seems like Kris doesn't work on it. Back into the pool.
I got the crash again today with the latest release of Firefox:
As the crash info shows Firefox crashes due to an OOM. Is there something we can prevent to not let this happen?
Sorry about not dealing with this. Henrik, could you re-test and report whether there are any changes.
Yes, this is still an issue with Aurora. I know that information about Nightly would be better, but I can't deliver those today. If you need those I can check tomorrow. Otherwise I would really appreciate if you could test it yourself.
I left our Jenkins page open for half a day with latest Aurora (26.0a2) and now I have a memory usage of 1.6GB whereby the referred web page already uses 650MB! 480MB in orphaned nodes which can even not get released via GC.
I would love if we could get some tracking on this bug. It's waiting for months meanwhile. :(
1,004.78 MB (63.90%) -- window-objects
│ ├────642.13 MB (40.83%) -- top(http://mm-ci-master.qa.scl3.mozilla.com:8080/, id=83)
│ │ ├──637.28 MB (40.53%) -- active
│ │ │ ├──636.97 MB (40.51%) -- window(http://mm-ci-master.qa.scl3.mozilla.com:8080/)
│ │ │ │ ├──483.43 MB (30.74%) -- dom
│ │ │ │ │ ├──482.45 MB (30.68%) ── orphan-nodes
│ │ │ │ │ └────0.98 MB (00.06%) ++ (4 tiny)
│ │ │ │ ├──145.40 MB (09.25%) -- js-compartment(http://mm-ci-master.qa.scl3.mozilla.com:8080/)
│ │ │ │ │ ├──130.26 MB (08.28%) -- gc-heap
│ │ │ │ │ │ ├──129.73 MB (08.25%) -- objects
│ │ │ │ │ │ │ ├───79.33 MB (05.04%) ── ordinary
│ │ │ │ │ │ │ ├───34.48 MB (02.19%) ── function
│ │ │ │ │ │ │ └───15.92 MB (01.01%) ── dense-array
│ │ │ │ │ │ └────0.54 MB (00.03%) ++ (3 tiny)
│ │ │ │ │ └───15.13 MB (00.96%) ++ (5 tiny)
│ │ │ │ └────8.15 MB (00.52%) ++ (3 tiny)
│ │ │ └────0.31 MB (00.02%) ++ window(about:blank)
│ │ └────4.85 MB (00.31%) ++ js-zone(0x7fbb3c6f4800)
Yeah, I know. I or someone should find time to find the memory leak in Jenkins.
(Of course it would be nice if the Jenkins devs would do it.)
Henrik, have you tried contacting the Jenkins developers? This pretty clearly is a leak in their page.
Sure, I even talked with the lead developer. But sadly they never followed-up on it and on IRC I didn't receive helpful answers.
But lets wait here for a moment. At the moment we are going to upgrade our local Jenkins instance to the latest LTS (see https://github.com/mozilla/mozmill-ci/pull/260). That means nearly 1 year passed since we did the last upgrade. Lets see if the new version is better. I will report back in the next couple of days.
So we have updated to the latest Jenkins LTS version today in our production environment, but the issue is still persistent. With that in mind I will try to get this reproduced with a local Jenkins installation, and plugins disabled. One of those might also cause this problem.
Not sure where all this unclassified memory comes from but I suspect it is from this bug. Is there anything which has been regressed here in terms of collecting orphaned nodes? I'm not sure what I should do to get more details here.
2,202.88 MB (100.0%) -- explicit
├──1,035.84 MB (47.02%) ── heap-unclassified
Orphaned nodes held alive by JS should be tracked in about:memory now.
Andrew, what do you mean by now? Which version of Firefox? I'm currently running 28.0a2 and the memory situation is kinda bad as shown above.
I'm suggesting that the heap-unclassified probably isn't orphan nodes, but it could be. The only way to figure out what heap-unclassified is, is to run a DMD build.
And orphan nodes get reported -- look for "orphan-nodes" in about:memory. It's possible that some are being missed, but in general it's not immediately obvious that high heap-unclassified implies orphan nodes. As mccr8 says, DMD is the way to analyze this.
We need some DMD analysis here, I think.
But I could take a look at fresh cc logs - both "concise" and "verbose" (when created using about:memory).
If possible, create logs with a clean profile (no addons).
Someone who can reproduce this needs to do some DMD analysis here.
Early last week we updated our Jenkins installation to the most recent LTS version. I can now have a look if something has been changed. So I setup a new profile with only memchaser being installed. I will watch the memory usage through the day and will report back with logs as requested.
I will also try to get a DMD build working.
Created attachment 8445286 [details]
Ok, so this is still a problem with the latest Jenkins. Memory usage of a fresh profile with only memchaser installed went up from about 150MB to 1900MB. Firefox was barely reacting at this time. So for now I have the memchaser logs. The other requested information I will provide once I'm in a state again when Firefox uses about 1GB of memory.
(In reply to Olli Pettay [:smaug] from comment #56)
> But I could take a look at fresh cc logs - both "concise" and "verbose"
> (when created using about:memory).
Ok, so I saved a concise log:
Sadly all the memory has been released right after, so for a verbose log I will have to wait again until the memory usage is at a specific level. As it looks like it's really a memory leak for that site.
Here is the verbose cc log:
Olli, can you please have a look at?
I did look at the logs, and that was discussed on IRC
I still see this problem with current Nightly builds. The important fact here is now that even when I close the tab, the memory doesn't go down, but raises up from about 2GB to 5GB! Then Firefox has 100% CPU load, and memory goes up and down but the application is no longer usable. It completely hangs.
I have seen this as well.
Could you perhaps upload new GC/CC logs?
We haven't made any progress here so I doubt anything will happen in the future. And as such we are also going to stop using Jenkins within a most likely 3 months period. So by that we won't be able to reproduce this behavior anymore, whereby I'm even not sure if we can still reproduce.
So I'm closing this bug as incomplete for now.