Last Comment Bug 654028 - 70MB of collectible garbage not cleaned up
: 70MB of collectible garbage not cleaned up
Status: VERIFIED FIXED
[MemShrink:P2] [Fixed in 7 by Bug 656...
: regression, verified-aurora, verified-beta
Product: Core
Classification: Components
Component: JavaScript Engine (show other bugs)
: unspecified
: All All
: -- normal with 3 votes (vote)
: ---
Assigned To: general
:
:
Mentors:
Depends on: 656120
Blocks: 627200 mlk-fx6
  Show dependency treegraph
 
Reported: 2011-05-01 16:45 PDT by Alon Zakai (:azakai)
Modified: 2013-12-27 14:21 PST (History)
27 users (show)
See Also:
Crash Signature:
(edit)
QA Whiteboard:
Iteration: ---
Points: ---
Has Regression Range: ---
Has STR: ---
unaffected
affected
fixed


Attachments
reduced testcase (674 bytes, text/html)
2011-05-03 13:42 PDT, Alon Zakai (:azakai)
no flags Details
patch (764 bytes, patch)
2011-05-09 14:09 PDT, Gregor Wagner [:gwagner]
no flags Details | Diff | Splinter Review
patch (18.38 KB, patch)
2011-05-10 12:54 PDT, Gregor Wagner [:gwagner]
no flags Details | Diff | Splinter Review

Description Alon Zakai (:azakai) 2011-05-01 16:45:15 PDT
STR:

1. Open Firefox with a new profile, no addons or plugins. I tested on Ubuntu 10.10.
2. Visit http://linux.slashdot.org/story/11/04/30/230259/Firefox-On-Linux-Gets-Faster-Builds-mdash-To-Be-Fast-As-Windows
3. Click 'Post a comment'.
6. Type some stuff, for example just fill a line with one character, then another line with another.
7. Memory usage rises, from 40MB to even 100MB or more, quite quickly, in a way that is obviously tied to the typing. The CPU is also maxed out and things get sluggish (after a few lines of typing).

For comparison, on Chrome I get 30% CPU, which also seems abnormal, but memory usage does not rise (it starts at 50MB and stays in the 50's).

A user reported this on Slashdot, and was even kind enough to make a screencast of it: http://www.youtube.com/watch?v=64pX97-THXc

As a Slashdot user myself I must say Slashdot seems sluggish on Firefox, this issue might be a big part of that. (There are also other issues, like bug 648278 - apparently Slashdot doesn't test on Firefox at all?)
Comment 1 Alon Zakai (:azakai) 2011-05-01 16:45:54 PDT
I will do some profiling with valgrind to try to find out what is going on here.
Comment 2 Alon Zakai (:azakai) 2011-05-02 17:39:08 PDT
For some reason, we do *not* do a GC while typing in Slashdot comment fields. The memory increase here is caused by GC-able stuff, and it doesn't get cleaned up.

This doesn't seem to happen on a test webpage I created with a textarea. I will investigate the Slashdot page some more.
Comment 3 Alon Zakai (:azakai) 2011-05-02 18:09:33 PDT
Some more observations:

1. Just scrolling up and down is enough to increase memory by a few MB every few seconds.

2. Doing something like control-shift-J to open the JS error console will trigger a GC. This does not clear away all the memory, but a few seconds later a second set of GCs are done, which finally do clear things back to normal. (I am looking at a printf that shows up each time js_GC() is called.)
Comment 4 Alon Zakai (:azakai) 2011-05-03 13:42:47 PDT
Created attachment 529820 [details]
reduced testcase

Reduced testcase.

The attachment is an HTML file that uses onscroll to create lots of collectable garbage.

We do not do any GCing at all until 70MB of garbage is created (can change N in the testcase to set how fast garbage is created). At that point, memory does not increase further due to GC, however, that 70MB is *not* freed no matter how long you scroll and/or wait.

But, if you open the JS error console (control-shift-J), then 5-10 seconds later all that 70MB will be cleaned up, I guess opening that window triggers a different form of GC?
Comment 5 Alon Zakai (:azakai) 2011-05-03 13:56:23 PDT
Forgot to write proper STR for the testcase:

1. Load page
2. Scroll down by pressing down arrow. Over 70MB of garbage should be created in less than a second.
3. Wait however long you want, and watch that memory usage remains the same, that is, the 70MB is never cleaned up.

Optionally, also do

4. Press control-shift-J to open the JS console.
5. Wait 10 seconds.
6. See the 70MB of garbage cleaned up.
Comment 6 Hugh Nougher [:Hughman] 2011-05-03 22:46:52 PDT
I can confirm the reduced testcase and steps from comment 5 but only in Nightly. Aurora and 4.0.1 work as expected.
Mozilla/5.0 (Windows NT 6.1; WOW64; rv:6.0a1) Gecko/20110503 Firefox/6.0a1

about:memory says that its js/gc-heap that contains the memory.
Manual trigger of GC does the same as opening the JS Console (few seconds before memory drops).
Comment 7 Boris Zbarsky [:bz] (still a bit busy) 2011-05-04 20:01:51 PDT
> but only in Nightly. Aurora and 4.0.1 work as expected.

Hmm.  Can we figure out a regression range here?
Comment 8 Hugh Nougher [:Hughman] 2011-05-07 02:29:35 PDT
Last good nightly: 2011-04-26 First bad nightly: 2011-04-27

Pushlog: http://hg.mozilla.org/mozilla-central/pushloghtml?fromchange=c5e8cc100248&tochange=c833fb1623ca
Comment 9 Hugh Nougher [:Hughman] 2011-05-07 03:37:01 PDT
Tracemonkey for tm to m-c merge in above changesets:
Last good nightly: 2011-04-13 First bad nightly: 2011-04-14

Pushlog: http://hg.mozilla.org/tracemonkey/pushloghtml?fromchange=cd5910633dd8&tochange=dc6ce44bedd4
Comment 10 Boris Zbarsky [:bz] (still a bit busy) 2011-05-07 20:05:58 PDT
Looks likely to be a background finalization problem.  Hugh, thanks for hunting those ranges down!

Gregor, can you take a look?
Comment 11 Gregor Wagner [:gwagner] 2011-05-09 11:00:35 PDT
(In reply to comment #10)
> Looks likely to be a background finalization problem.  Hugh, thanks for
> hunting those ranges down!
> 
> Gregor, can you take a look?

I tried the reduced testcase but after navigating away the JS heap size went back to normal again.
I also tried the slashdot page and saw the incredible increase during typing. 
Once I navigate away the JS heap size decreases again (with a delay of about 40sec). I didn't see the CPU intensive part though. It stays between 5 and 10% for me.

Alon you mention that slashdot is sluggish on your system. What system do you have? (CPU, RAM, OS)

We try to perform GC when we detect idle time but it seems that doesn't work for low-end systems. 
The background finalization patch moves a huge amount of work to a second thread that is executed in parallel to the main thread after the GC. Maybe this affected the idle time detection on low-end systems.
Comment 12 Boris Zbarsky [:bz] (still a bit busy) 2011-05-09 11:05:51 PDT
Gregor, this bug is about the fact that GC doesn't run while the page is just there; navigating away will trigger GC.
Comment 13 Alon Zakai (:azakai) 2011-05-09 11:49:34 PDT
(In reply to comment #11)
> (In reply to comment #10)
> > Looks likely to be a background finalization problem.  Hugh, thanks for
> > hunting those ranges down!
> > 
> > Gregor, can you take a look?
> 
> I tried the reduced testcase but after navigating away the JS heap size went
> back to normal again.

As bz said, the issue here is that, without navigating away, the 70MB is never released (but it used to).

> I didn't see the CPU intensive part though. It stays between 5 and
> 10% for me.
> 
> Alon you mention that slashdot is sluggish on your system. What system do
> you have? (CPU, RAM, OS)
> 

The sluggishness is relative I guess, I originally tested on a slower 2GHz Lenovo laptop, but I also see the main issue here (the 70MB not being reclaimed) on a powerful MacBook, and other people reproduce it as well elsewhere.

(Side issue: Comparing browsers on the same machine, we use significantly more CPU (2x) and RAM (up to 70MB) than Chrome on Slashdot.)
Comment 14 Gregor Wagner [:gwagner] 2011-05-09 14:09:05 PDT
Created attachment 531133 [details] [diff] [review]
patch

Ok I see what the problem is. 
The background patch increased the memory footprint and the trigger mechanism in MaybeGC might be too loose.

We can fix this by changing the trigger for the GC in MaybeGC.
This patch has to be tested but it solves the problem for me.
Comment 15 Alon Zakai (:azakai) 2011-05-09 14:40:28 PDT
The patch does not fix the problem for me with the STR from comment 5.
Comment 16 Gregor Wagner [:gwagner] 2011-05-09 15:08:17 PDT
(In reply to comment #15)
> The patch does not fix the problem for me with the STR from comment 5.

Does it never work or just sometimes? 
What happens is you keep scrolling with about 10sec pauses in between?
For me it releases it maybe 8 out of 10 times after scrolling.
Comment 17 Alon Zakai (:azakai) 2011-05-09 15:18:29 PDT
I tried 3 times, with greater than 10 second pauses, and it didn't work in any of those. Maybe I was just unlucky.
Comment 18 Richard Ayotte 2011-05-09 15:24:55 PDT
Tightening the GC could help but isn't the allocation of 70 MB for the input of a few hundred bytes of text the real problem? The 70MB shouldn't be allocated in the first place. I'm not sure how much memory is needed but intuition tells me that 70MB is way too much, probably by a factor of 1000.
Comment 19 Alon Zakai (:azakai) 2011-05-09 15:43:55 PDT
(In reply to comment #18)
> Tightening the GC could help but isn't the allocation of 70 MB for the input
> of a few hundred bytes of text the real problem?

The text is not related to the 70MB. That 70MB is generated by the JavaScript in the testcase page, which purposefully creates many many MB of garbage JS objects. Which simulates what Slashdot does, every time you press a key or scroll on Slashdot, lots and lots of JS garbage is created (I don't know why they wrote it that way).

That many MB of garbage are created on Slashdot and in the testcase page, is expected (see the JavaScript that creates the garbage in each case). The bug however is that we do not clean it up eventually (unless you do something like browse to somewhere else).
Comment 20 Richard Ayotte 2011-05-09 16:49:09 PDT
(In reply to comment #19)
> (In reply to comment #18)
> > Tightening the GC could help but isn't the allocation of 70 MB for the input
> > of a few hundred bytes of text the real problem?
> 
> The text is not related to the 70MB. That 70MB is generated by the
> JavaScript in the testcase page, which purposefully creates many many MB of
> garbage JS objects.

What's interesting though is that the testcase only allocates an extra 3MB in Chrome.
Comment 21 Boris Zbarsky [:bz] (still a bit busy) 2011-05-09 17:25:27 PDT
Chrome runs GC much more often than we do; it almost certainly runs it multiple times while the testcase runs.
Comment 22 Gregor Wagner [:gwagner] 2011-05-10 12:48:04 PDT
We also perform a few GCs during the testcase but at the end we stop executing and if you don't move the mouse any more we don't even get MaybeGC calls.
So it's random how full the JS heap is after you stop scrolling.
Comment 23 Gregor Wagner [:gwagner] 2011-05-10 12:54:38 PDT
Created attachment 531433 [details] [diff] [review]
patch

This patch runs a GC every 5 seconds if we have any calls to MaybeGC.
Note that we need 2 GCs in order to give back the memory to the OS because of background finalization.

It should solve the reduced testcase issue but the slashdot situation does not get better. Even after full GCs the memory doesn't go down. This seems to be a leak.
Comment 24 Igor Bukanov 2011-05-10 13:22:44 PDT
Our GC heuristics depend on the size of the GC heap after the last GC. The background finalization broke that as at the end of js_GC() we now call setGCLastBytes with the value that does not account for any arena finalized in the background.

I guess as a simple hack we could try to move setGCLastBytes at the end of the background finalization.
Comment 25 Igor Bukanov 2011-05-10 13:23:56 PDT
(In reply to comment #24)
> I guess as a simple hack we could try to move setGCLastBytes at the end of
> the background finalization.

Or at least try to call it again accounting for the newly allocated GC arenas.
Comment 26 Gregor Wagner [:gwagner] 2011-05-10 13:36:31 PDT
The original problem here is that the 70MB during typing in SlashDot are not reclaimed. They don't even get reclaimed if we perform full GCs!
So lets focus on this issue here.

The reduced testcase shows a different (well known) problem: We don't perform GCs during idle time. I filed bug 656120 for that.
Comment 27 Alon Zakai (:azakai) 2011-05-10 14:36:07 PDT
(In reply to comment #26)
> The original problem here is that the 70MB during typing in SlashDot are not
> reclaimed. They don't even get reclaimed if we perform full GCs!
> So lets focus on this issue here.
> 
> The reduced testcase shows a different (well known) problem: We don't
> perform GCs during idle time. I filed bug 656120 for that.

Gregor, in what situation are you seeing that even full GCs do not reclaim all the memory?

In my testing, both on Slashdot and in the testcase, memory is reclaimed if something like the JS error console is opened and closed, which triggers a GC.

For example, load the Slashdot page linked to in comment #0, scroll up and down until memory goes from 95 to 130, wait a few minutes (or longer), see that memory stays up. Then press control-shift-J to open the JS error console, and close it, and see that 5-10 seconds later memory goes down to 95.

Or, instead of scrolling, click to add a comment and hold down a key (so that the key repeats and lots of text is entered), again wait until memory goes from 95 to 130, and continue as in the last paragraph, with the same results.

Does memory not get reclaimed in these STRs for you?
Comment 28 Gregor Wagner [:gwagner] 2011-05-11 08:41:12 PDT
(In reply to comment #27)
> 
> Gregor, in what situation are you seeing that even full GCs do not reclaim
> all the memory?
> 
> In my testing, both on Slashdot and in the testcase, memory is reclaimed if
> something like the JS error console is opened and closed, which triggers a
> GC.
> 
> For example, load the Slashdot page linked to in comment #0, scroll up and
> down until memory goes from 95 to 130, wait a few minutes (or longer), see
> that memory stays up. Then press control-shift-J to open the JS error
> console, and close it, and see that 5-10 seconds later memory goes down to
> 95.
> 
> Or, instead of scrolling, click to add a comment and hold down a key (so
> that the key repeats and lots of text is entered), again wait until memory
> goes from 95 to 130, and continue as in the last paragraph, with the same
> results.
> 
> Does memory not get reclaimed in these STRs for you?

Yes you are right. The memory goes down again after some time. I guess I didn't wait long enough.
I just tested on my netbook and after a few lines I almost couldn't type any more. With my usual typing speed I was always one word ahead of what I saw on the screen. Very annoying!
Comment 29 Alon Zakai (:azakai) 2011-05-11 10:07:51 PDT
> I just tested on my netbook and after a few lines I almost couldn't type any
> more. With my usual typing speed I was always one word ahead of what I saw
> on the screen. Very annoying!

Yes, the overall Firefox user experience on Slashdot is not good. I filed an evangelism bug for that, bug 656347.
Comment 30 Nicholas Nethercote [:njn] 2011-05-11 16:21:15 PDT
(In reply to comment #28)
> 
> Yes you are right. The memory goes down again after some time. I guess I
> didn't wait long enough.

I'm about to land bug 654041 which adds buttons to about:memory to force GC and CC.  Should be very helpful to avoid this not-waiting-long-enough problem :)
Comment 31 Nicholas Nethercote [:njn] 2011-06-20 18:28:19 PDT
Bug 656120 just landed.  Can we confirm whether it helped with the reduced test case?
Comment 32 Alon Zakai (:azakai) 2011-06-21 13:56:04 PDT
I confirmed that this bug is now fixed, memory is reclaimed properly (after a short pause). Marking resolved. Great stuff, Gregor!

We still don't have good performance on Slashdot, due to large amounts of garbage it creates all the time (scrolling, keypresses, etc.). However, at least that garbage is now reclaimed.
Comment 33 Asa Dotzler [:asa] 2011-07-17 22:55:53 PDT
So this bug is resolved fixed because we got the fix for Firefox 7 over in Bug 656120. But this is a regression in Firefox 6 and we don't seem to have a fix there. Is that correct? Are we going to just live with it for 6?
Comment 34 Nicholas Nethercote [:njn] 2011-07-17 23:39:31 PDT
(In reply to comment #33)
> So this bug is resolved fixed because we got the fix for Firefox 7 over in
> Bug 656120. But this is a regression in Firefox 6 and we don't seem to have
> a fix there. Is that correct? Are we going to just live with it for 6?

I don't see what else we can do.  Even if a spot-fix could be devised, if the patch from bug 656120 was deemed too risky for FF6 why would a spot-fix be deemed acceptable, esp. since we are much later in the cycle?
Comment 35 Asa Dotzler [:asa] 2011-07-17 23:41:50 PDT
That's great information, njn. I'm not making any judgements here, just looking for data so we can make the best use of our time in the big group triage meeting tomorrow. Thanks for the follow-up.
Comment 36 Christopher Blizzard (:blizzard) 2011-08-01 15:02:00 PDT
Not tracking for Firefox 6 - too late for that train.
Comment 37 Anthony Hughes (:ashughes) [GFX][QA][Mentor] 2011-09-22 13:39:24 PDT
qa+: QA verification needed on Firefox 7 using the attached testcase and the steps in comment 5
Comment 38 Mihaela Velimiroviciu (:mihaelav) 2011-09-23 04:27:17 PDT
Verified on Win7/64 bit with Firefox 7RC, Aurora and Nightly:
Build identifier: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:7.0) Gecko/20100101 Firefox/7.0
Build identifier: Mozilla/5.0 (Windows NT 6.1; WOW64; rv:8.0a2) Gecko/20110914 Firefox/8.0a2
Build identifier: Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:9.0a1) Gecko/20110922 Firefox/9.0a1

In the reduced test case page, when scrolling page up/down, memory usage increases with about 70 MB (on Aurora and 7RC) or 100MB (on Nightly), but decreases back in a few seconds after the scrolling is stopped.
Comment 39 Mihaela Velimiroviciu (:mihaelav) 2011-09-23 06:30:13 PDT
LINUX:
Mozilla/5.0 (X11; Linux x86_64; rv:9.0a1) Gecko/20110922 Firefox/9.0a1
Mozilla/5.0 (X11; Linux x86_64; rv:8.0a2) Gecko/20110916 Firefox/8.0a2
Mozilla/5.0 (X11; Linux x86_64; rv:7.0) Gecko/20100101 Firefox/7.0

MAC:
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:9.0a1) Gecko/20110922 Firefox/9.0a1
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:8.0a2) Gecko/20110921 Firefox/8.0a2
Mozilla/5.0 (Macintosh; Intel Mac OS X 10.6; rv:7.0) Gecko/20100101 Firefox/7.0

Same as in comment 38 with Firefox 7RC on Ubuntu and MAC 10.6
Comment 40 Mihaela Velimiroviciu (:mihaelav) 2011-09-23 07:19:50 PDT
Verified as fixed also on Win XP:
Mozilla/5.0 (Windows NT 5.1; rv:7.0) Gecko/20100101 Firefox/7.0
Comment 41 Robert O'Callahan (:roc) (email my personal email if necessary) 2012-09-21 03:37:22 PDT
http://www.cs.umd.edu/~mwh/papers/expositor-submitted.pdf claims that there is an unfixed underlying bug that caused this bug, a race on gcChunksWaitingToExpire.
Comment 42 Gregor Wagner [:gwagner] 2012-09-21 07:39:11 PDT
(In reply to Robert O'Callahan (:roc) (Mozilla Corporation) from comment #41)
> http://www.cs.umd.edu/~mwh/papers/expositor-submitted.pdf claims that there
> is an unfixed underlying bug that caused this bug, a race on
> gcChunksWaitingToExpire.

gcChunksWaitingToExpire and the mentioned timer-based code path don't exist any more. But they are right, bug 631733 in combination with bug 650326 caused some problems.

Note You need to log in before you can comment on or make changes to this bug.